The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
The changing times have caused the requirements to change, causing a revolution in the field of parallel computing. The emergence of parallel computing as a necessity has boosted the use of GPGPUs for this purpose. With such an emergence comes a drastic improvement in many real world applications of GPGPUs as well. In this paper we discuss about GPGPUs, their evolution, and their contribution to many...
Training dictionaries for sparse representations is a time consuming task, due to the large size of the data involved and to the complexity of the training algorithms. We investigate a parallel version of the approximate K-SVD algorithm, where multiple atoms are updated simultaneously, and implement it using OpenCL, for execution on graphics processing units (GPU). This not only allows reducing the...
As today's computer architectures are becoming more and more heterogeneous, a plethora of options including CPUs, GPUs, DSPs, reconfigurable logic (FPGAs), and other application-specific processors come into consideration for close-to-sensor processing. Especially, in the domain of image processing on mobile devices, among numerous design challenges, a very stringent energy budget is of utmost importance,...
Graphics processing units are being widely used in embedded systems as they can achieve high performance and energy efficiency. In such systems, the problem of computation and data mapping for multiple applications while minimizing the completion time is quite challenging due to a large size of the policy space, including heterogeneous application characteristics, complex application structure, data...
Heterogeneous systems equipped with traditional processors (CPUs) and graphics processing units (GPUs) have enabled processing large data sets. With new programming models, such as OpenCL and CUDA, programmers are encouraged to offload data parallel workloads to GPUs as much as possible in order to fully utilize the available resources. Unfortunately, offloading work is strictly limited by the size...
Todays' hardware diversity exacerbates the need for optimizing compilers. A problem that arises when exploiting hardware accelerators (FPGA, GPU, dedicated boards) is how to automatically perform kernel/function offloading or outlining (as opposed to function inlining). The principle is to outsource part of the computation (the kernel to be performed on the accelerator) to a more efficient but more...
GT5D is a nuclear fusion simulation program which aims to analyze the turbulence phenomena in tokamak plasma. In this research, we optimize it for GPU clusters with multiple GPUs on a node. Based on the profile result of GT5D on a CPU node, we decide to offload the whole of the time development part of the program to GPUs except MPI communication. We achieved 3.37 times faster performance in maximum...
In this paper we propose a highly optimized parallel and distributed BFS on GPU for Graph500 benchmark. We evaluate the performance of our implementation using TSUBAME2.0 supercomputer. We achieve 317 GTEPS (billion traversed edges per second) with scale 35 (a large graph with 34.4 billion vertices and 550 billion edges) using 1366 nodes and 4096 GPUs. With this score, TSUBAME2.0 supercomputer is...
The proliferation of heterogeneous computing systems presents the parallel computing community with the challenge of porting legacy and emerging applications to multiple processors with diverse programming abstractions. OpenCL is a vendor-agnostic and industry-supported programming model that offers code portability on heterogeneous platforms, allowing applications to be developed once and deployed...
We present a fully automated approach to project the relative performance of an OpenCL program over different GPUs. Performance projections can be made within a small amount of time, and the projection overhead stays relatively constant with the input data size. As a result, the technique can help runtime tools make dynamic decisions about which GPU would run faster for a given kernel. Usage cases...
In this paper, we describe a novel technique to optimize longest common subsequence (LCS) algorithm for one-to-many matching problem on GPUs by transforming the computation into bit-wise operations and a post-processing step. The former can be highly optimized and achieves more than a trillion operations (cell updates) per second (CUPS)-a first for LCS algorithms. The latter is more efficiently done...
This paper first briefly introduces the principle of Ortho-Rectification of line-array image, then designed a parallel processing method based on GPU and proposes a shared memory optimizing strategy of POS data to avoid performance bottle-neck due frequently accessing data in global memory, at last do a system experiment using ADS40 image based on Tesla C2050 GPU and invalidate the parallel processing...
Graphics Processing Units (GPUs) have become a popular choice for general-purpose high-performance computing. Encryption and decryption algorithms such as the Advanced Encryption Standard (AES) have been implemented on GPUs to gain significant speedup. However, the security of the GPU architecture is not well studied, making it potentially risky to offload sensitive computation to GPUs. In this paper,...
In this paper, we present a high throughput and low latency LDPC (low-density parity-check) decoder implementation on GPUs (graphics processing units). The existing GPU-based LDPC decoder implementations suffer from low throughput and long latency, which prevent them from being used in practical SDR (software-defined radio) systems. To overcome this problem, we present optimization techniques for...
In this paper, we present design and implementation of a fast runtime visualizer for a GPU-based 3D-FDTD electromagnetic simulation. We focus on improving the productivity of simulator development without compromising simulation performance. In order to keep the portability, we implemented a visualizer with the MVC model, where simulation kernels and visualization process were completely separated...
Feature detection and extraction are essential in computer vision applications such as image matching and object recognition. The Scale-Invariant Feature Transform (SIFT) algorithm is one of the most robust approaches to detect and extract distinctive invariant features from images. However, high computational complexity makes it difficult to apply the SIFT algorithm to mobile applications. Recent...
This paper introduce a parallel computing method to improve the efficiency of prediction of membrane protein types by SVM. With early hardware limitations of the GPU(lack of synchronization primitives and limited memory caching mechanisms)can make GPU-based computation inefficient. We present this efficient method for prediction of membrane protein type for Intel(R) Core(TM) i3–3110m quad-core and...
Scale Invariance Feature Transform (SIFT) is quite suitable for image matching because of its invariance to image scaling, rotation and slight changes in illumination or viewpoint. However, due to high computation complexity it's technically challenging to deploy SIFT in real time application situations. To address this problem, we propose CLSIFT, an OpenCL based highly speeded up and performance...
The skeletons of the objects in 3D images can be extracted by using 3D image thinning. The application of 3D image thinning for image analysis is hampered by its considerable computation time. By employing the graphics processing unit (GPU), which has tremendous powerful computing power at an incomparable performance-to-cost ratio, the calculation of 3D image thinning can be accelerated. In this paper,...
Matrix multiplication is one of the basic operations in linear algebra that mostly used in computer science. For ages, applying naive algorithm to complete it has done it, and it has a standard complexity O(n3). Many researches are concluded to find more efficient and effective algorithm to process this operation, and one day Strassen has one that overcome the naive algorithm complexity with only...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.