The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Multi-GPUs nodes are becoming the platform of choice for graph processing. However, in the multiple GPUs environment, there are two main challenges in designing a graph processing system. First, the system suffers from huge communication overhead. GPUs and CPUs are connected through PCIe, whose bandwidth is far smaller than that of GPU memory. Second, the system is developed based on BSP (Bulk Synchronous...
Compute-intensive GPU architectures allow the use of high-order 3D stencils for better computational accuracy. These stencils are usually compute-bound. While current state-of-the-art register allocators are satisfactory for most applications, they are unable to effectively manage register pressure for such complex high-order stencils, resulting in a sub-optimal code with a large number of register...
A 19.2 Gb/s per lane link with IBM's latest POWER8 processor module has been analyzed. This paper presents the overview of the high-speed link design from the signal integrity point of view. Design approaches in package and printed circuit board (PCB) to support the target data-rate have been discussed. The end-to-end communication bus is modeled from extracted post-route design with a 3-D full-wave...
Deep Learning (DL) algorithms have become ubiquitous in data analytics. As a result, major computing vendors — including NVIDIA, Intel, AMD and IBM — have architectural road-maps influenced by DL workloads. Furthermore, several vendors have recently advertised new computing products as accelerating DL workloads. Unfortunately, it is difficult for data scientists to quantify the potential of these...
The massive parallelism and high memory bandwidth of GPU's are particularly well matched with the exigencies of Big Data analytics applications, for which many independent computations and high data throughput are prevalent. These applications often produce (intermediary or final) results in the form of key-value (KV) pairs, and hash tables are particularly well-suited for storing these KV pairs in...
It has been shown that a newly proposed micro-modeling method for deriving a concise passive circuit of a large-scale EM problem is highly suitable for GPU parallel computation. However, due to the memory bandwidth limit of GPU, the utilization of GPU is far from its peak performance because more than 97% processing time is occupied by the frequent data transactions. This paper proposes an effective...
Convolution is a fundamental operation in many applications, such as computer vision, natural language processing, image processing, etc. Recent successes of convolutional neural networks in various deep learning applications put even higher demand on fast convolution. The high computation throughput and memory bandwidth of graphics processing units (GPUs) make GPUs a natural choice for accelerating...
MeteoSwiss, the Swiss national weather forecast institute, has selected densely populated accelerator servers as their primary system to compute weather forecast simulation. Servers with multiple accelerator devices that are primarily connected by a PCI-Express (PCIe) network achieve a significantly higher energy efficiency. Memory transfers between accelerators in such a system are subjected to PCIe...
Today, accelerator cards like GPUs are an important constituent of HPC clusters. For certain GPU-intense applications, the trend is shifting toward multi-GPU systems with four or more GPUs per compute node. This can increase the performance per dollar and the performance per watt. The Linpack benchmark is the standard tool for measuring the compute performance of supercomputers. Its standard implementation,...
The memory wall problem is one of major obstacles against the realization of extremely fast and large scale simulations. Stencil computations, which are important kernels for CFD simulations, have been highly successful on GPU clusters in speed, due to high memory bandwidth and computation speed of accelerators. However, their problem scales have been limited by small capacity of GPU device memory...
This work introduces an alternative architecture of a GNSS signal simulator, where the multiple GNSS services in the full GNSS bandwidth from L5 to L1 are generated and mixed in digital form. The digital-to-analog conversion and up-conversion to L-band is then applied to the single compounded wideband digital signal. The digital signal generation and mixing is implemented on a pair of strong GPUs...
Accelerated computing has become pervasive for increasing the computational power and energy efficiency in terms of GFLOPs/Watt. For application areas with highest demands, for instance high performance computing, data warehousing and high performance analytics, accelerators like GPUs or Intel’s MICs are distributed throughout the cluster. Since current analyses and predictions show that data movement...
Manycore architecture system includes more number of processing elements to improve the performance while sustaining power considerations. Accelerating heterogeneous manycore computing elements involves huge amount of memory copy, computation and thread management. Applications of manycore architectures range from desktop computer to ware-house-scale computer. In this paper, the state-of-the-art trends...
Many GPU applications perform data transfers to and from GPU memory at regular intervals. For example because the data does not fit into GPU memory or because of internode communication at the end of each time step. Overlapping GPU computation with CPU-GPU communication can reduce the costs of moving data. Several different techniques exist for transferring data to and from GPU memory and for overlapping...
Neural network simulators that take into account the spiking behavior of neurons are useful for studying brain mechanisms and for engineering applications. Spiking neural network (SNN) simulators have been traditionally simulated on large-scale clusters, super-computers, or on dedicated hardware architectures. Alternatively, graphics processing units (GPUs) can provide a low-cost, programmable, and...
This article consists of a collection of slides from the author's conference presentation on NVIDIA's GeForce 8800 GTX family of products. Some of the specific topics discussed include: the special features, system specifications, and system design for these products; GPU computing capabilities; system architectures; applications for use; platforms supported; processing capabilities; memory capabilities;...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.