The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
GPU's SIMD architecture is a double-edged sword confronting parallel tasks with control flow divergence. On the one hand, it provides a high performance yet power-efficient platform to accelerate applications via massive parallelism; however, on the other hand, irregularities induce inefficiencies due to the warp's lockstep traversal of all diverging execution paths. In this work, we present a software...
GPUs have become prevalent and more general purpose, but GPU programming remains challenging and time consuming for the majority of programmers. In addition, it is not always clear which codes will benefit from getting ported to GPU. Therefore, having a tool to estimate GPU performance for a piece of code before writing a GPU implementation is highly desirable. To this end, we propose Cross-Architecture...
Supporting dynamic parallelism is important for GPU to benefit a broad range of applications. There are currently two fundamental ways for programs to exploit dynamic parallelism on GPU: a software-based approach with software-managed worklists, and a hardware-based approach through dynamic subkernel launches. Neither is satisfactory. The former is complicated to program and is often subject to some...
Porting CUDA program to other heterogeneous and many-core platform especially native processor is very meaningful for extending the range of the CUDA application, taking advantage of many-core on target platform and supporting national industries. Traditional binary translation technique is not competent to this task. On the point of software reverse engineering, it is feasible to design a new migration...
Modern video cameras normally only capture a single color per pixel, commonly arranged in a Bayer pattern. This means that we must restore the missing color channels in the image or the video frame in post-processing, a process referred to as debayering. In a live video scenario, this operation must be performed efficiently in order to output each frame in real-time, while also yielding acceptable...
In this paper, we have proposed high throughput parallel long integer multiplication algorithm on parallel workstation. In integer arithmetic operations, long integer multiplication is the most time consuming and key operation. In public-key cryptography such as RSA, Diffie-Hellman and so on long integer multiplication is required. Long integer multiplication operation is performed heavily for the...
HEVC (High Efficiency Video Coding) is the newest video compression standard. Compared with the previous standards, the Coding efficiency is greatly improved at the cost of much higher codec complexity. So, many people improve the HEVC algorithm from the hardware level and software level. For the IQ/IT (inverse quantization/inverse transform) part, HEVC just processes one TU block one by one. And...
In the paper, we design and optimize an ultrasound B-mode imaging including a high-computationally demanding beamformer on a commercial GPU. For the performance optimization, we explore the design space spanned with the use of different memory types, instruction scheduling and thread mapping strategies, etc. Then, with the developed B-mode imaging code, we conduct performance evaluations on various...
Heterogeneous systems consisting of multiple CPUs and GPUs are increasingly attractive as platforms for high performance computing. Such platforms are usually programmed using OpenCL which provides program portability by allowing the same program to execute on different types of device. As such systems become more mainstream, they will move from application dedicated devices to platforms that need...
In this paper, we introduce a new Java-based parallel GPGPU simulator, GpuTejas. GpuTejas is a fast trace driven simulator, which uses relaxed synchronization, and non-blocking data structures to derive its speedups. Secondly, it introduces a novel scheduling and partitioning scheme for parallelizing a GPU simulator. We evaluate the performance of our simulator with a set of Rodinia benchmarks. We...
Modern power systems analysis becomes more challenging due to their increasing complexity, greater interConnectivity of elements and sub-systems and the use of alternative generation systems, among others. In order to efficiently conduct these analyses, it is necessary the use of modern computational and numerical techniques, such as parallel processing and fast periodic steady state solution techniques...
Graphics Processing Units (GPU) have been used extensively for accelerating parallelizable applications in general, and scientific computations in particular. Stencil based algorithms are used intensively in various research areas and represent good candidates for GPU based acceleration. Since scientific computations have high accuracy requirements, herein we focus on stencil based double precision...
In traditional link level simulation, multiple-input and multiple-output (MIMO) channel model is one of the most time-consuming modules. When using more realistic geometry-based channel models, it consumes more time. In this paper, we propose an efficient simulator implementation of geometry-based spatial channel model (SCM) on graphics processing unit (GPU). We first analyze the potential parallelism...
An increasing number of MPI applications are being ported to take advantage of the compute power offered by GPUs. Data movement continues to be the major bottleneck on GPU clusters, more so when data is non-contiguous, which is common in scientific applications. The existing techniques of optimizing MPI data type processing, to improve performance of non-contiguous data movement, handle only certain...
Cyber-physical systems (CPS) must perform complex algorithms at very high speed to monitor and control complex real-world phenomena. GPU, with a large number of cores and extremely high parallel processing, promises better computation if the data parallelism often found in real-world scenarios of CPS could be exploited. Nevertheless, its performance is limited by the latency incurred when data are...
There is a stage in the GPU computing pipeline where a grid of thread-blocks, or space of computation, is mapped to the problem domain. Normally, the space of computation is a k-dimensional bounding box (BB) that covers a k-dimensional problem. Threads that fall inside the problem domain perform computations and threads that fall outside are discarded, all happening at runtime. For problems with non-square...
We consider two recently introduced massively multi-core architectures designed for high performance computing, the Xeon Phi coprocessor and Kepler graphics processor. We discuss the OpenCL programming model, as one that allows to look at the platforms in a unified way and to construct efficient algorithms for both of them. As an example application we investigate a typical algorithm employed in finite...
Sparse matrix-vector and multi-vector multiplications (SpMV and SpMM) are performance bottlenecks operations in numerous HPC applications. A variety of SpMV GPU kernels using different matrix storage formats have been developed to accelerate these applications. Unlike SpMV, where matrix elements are accessed only once, multiplying by k vectors requires accessing matrix elements k times. In this paper...
Graphics processing units(GPUs) provide a low cost platform for accelerating high performance computations. New programming languages, such as CUDA and OpenCL, make GPU programming attractive to programmers. However, programming GPUs is still a cumbersome task for two reasons, tedious performance optimizations and lack of portability. First, optimizing an algorithm for a specific GPU is a time-consuming...
Gaussian Elimination is commonly used to solve dense linear systems in scientific models. In a large number of applications, a need arises to solve many small size problems, instead of few large linear systems. The size of each of these small linear systems depends on the number of the ordinary differential equations (ODEs) used in the model, and can be on the order of hundreds of unknowns. To efficiently...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.