The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
We present parallel algorithms for Binary Decision Diagram (BDD) manipulation optimized for efficient execution on Graphics Processing Units (GPUs). Compared to a sequential CPU-based BDD package with the same capabilities, our GPU implementation achieves at least 5 orders of magnitude speedup. To the best of our knowledge, this is the first work on using GPUs to accelerate a BDD package.
The scheduling algorithm of Linux operating systems has to fulfill several conflicting objectives: fast process response time, higher throughput for background jobs, avoidance of process starvation, reconciliation of the needs of low and high priority processes etc. The set of rules used to determine when and how to select a new process to run is called scheduling policy. Current Linux kernel uses...
Todays' hardware diversity exacerbates the need for optimizing compilers. A problem that arises when exploiting hardware accelerators (FPGA, GPU, dedicated boards) is how to automatically perform kernel/function offloading or outlining (as opposed to function inlining). The principle is to outsource part of the computation (the kernel to be performed on the accelerator) to a more efficient but more...
We present AMGE, a programming framework and runtime system to decompose data and GPU kernels and execute them on multiple GPUs concurrently. AMGE exploits the remote memory access capability of recent GPUs to guarantee data accessibility regardless of its physical location, thus allowing AMGE to safely decompose and distribute arrays across GPU memories. AMGE also includes a compiler analysis to...
The problem of obtaining high computational throughput from sparse matrix multiple-vector multiplication routines is considered. Current sparse matrix formats and algorithms have high bandwidth requirements and poor reuse of cache and register loaded entries, which restrict their performance. We propose the mapped blocked row format: a bitmapped sparse matrix format that stores entries as blocks without...
This paper first briefly introduces the principle of Ortho-Rectification of line-array image, then designed a parallel processing method based on GPU and proposes a shared memory optimizing strategy of POS data to avoid performance bottle-neck due frequently accessing data in global memory, at last do a system experiment using ADS40 image based on Tesla C2050 GPU and invalidate the parallel processing...
The safety and security of kernel is the key to the security of the embedded system and we even have to formal verification the kernel in the field of safety-critical embedded applications. In this paper we introduce a design and implementation of the modeling of micro kernel based on spatial-temporal isolation in Haskell which is a functional language. This not only could significantly improve the...
In this paper, we propose an implementation of a parallel one-dimensional fast Fourier transform (FFT) on GPU clusters. This implementation is based on the six-step FFT algorithm. Because the parallel one-dimensional FFT requires three all-to-all communications, one goal for parallel FFTs on GPU clusters is to minimize the PCI Express transfer time and the MPI communication time. We demonstrate that...
Heterogeneous computing utilizing both CPU and FPGA requires access to data in the main memory from both devices. While a typical system relies on software executing on the CPU to orchestrate all data movements between the FPGA and the main memory, our demo presents a complementary FPGA-centric approach that allows gateware to directly access the virtual memory space as part of the executing process...
This paper presents simple and efficient optimization techniques for an OpenCL compiler that targets reconfigurable processors. The target architecture consists of a generalpurpose processor core and an embedded reconfigurable accelerator with vector units. The accelerator is able to switch its architecture between the VLIW mode and the Coarse Grained Reconfigurable Array (CGRA) mode to achieve high...
The skeletons of the objects in 3D images can be extracted by using 3D image thinning. The application of 3D image thinning for image analysis is hampered by its considerable computation time. By employing the graphics processing unit (GPU), which has tremendous powerful computing power at an incomparable performance-to-cost ratio, the calculation of 3D image thinning can be accelerated. In this paper,...
Colour Filter Array (CFA) demosaicking is a process to interpolate missing colour values in order to produce a full colour image when a single image sensor is used. For smooth regions, a higher order of interpolation will usually achieve higher accuracy. However when there is a colour edge, a lower order of interpolation is desirable as it will avoid interpolation across an edge without blurring it...
Code optimization improves program performance through program analysis and program transformation, which transforms the program in an equivalent form. The basis of optimization is data flow analysis and control flow analysis. The paper first analyzes the characterization of Mgrid and the kernel Resid routine, including architecture analysis, data flow analysis, and dependence analysis, which is the...
Breadth-First Search (BFS) is a basis for many graph traversal and analysis algorithms. In this paper, we present a direction-optimizing BFS implementation on CPU-GPU heterogeneous platforms to fully exploit the computing power of both the multi-core CPU and GPU. For each level of the BFS algorithm, we dynamically choose the best implementation from: a sequential top-down execution on CPU, a parallel...
In this paper, we propose a new re-ordering technique for improving the performance of Sparse Matrix Vector Multiplication (SpMV) for systems supported with Graphics Processing Units (GPUs). We conducted the test by applying SpMV on solver based applications which are widely used in the domain of engineering and science. We studied and analyzed the existing representations and storage structures of...
Hardware supported multithreading can mask memory latency by switching the execution to ready threads, which is particularly effective on irregular applications. FPGAs provide an opportunity to have multithreaded data paths customized toeach individual application. In this paper we describe the compiler generation of these hardware structures from a C subset targeting a Convey HC-2ex machine. We describe...
We propose a numerical implementation based on a Graphics Processing Unit (GPU) for the acceleration of the execution time of the Lattice Boltzmann Method (LBM). The study focuses on the application of the LBM for patient-specific blood flow computations, and hence, to obtain higher accuracy, double precision computations are employed. The LBM specific operations are grouped into two kernels, whereas...
A domain-specific processor for energy-efficient execution of Recognition and Data Mining (RM) workloads is presented. The processor consists of a 2-D array of processing elements and a streaming memory hierarchy and interconnect network that are customized to efficiently execute dominant computational kernels (matrix-vector multiplication, vector dot product, L1 norm, and L2 norm) from a wide range...
Stochastic Rotation Dynamics (SRD) is a novel particle-based simulation method that can be used to model complex fluids [1], [2], such as binary and ternary mixtures [3], and polymer solutions [4]-[6], in either two or three dimensions. Although SRD is efficient compared to traditional methods, it is still computationally expensive for large system sizes, e.g. when using a large array of particles...
We present CrowdCL, an open-source framework for the rapid development of volunteer computing and OpenCL applications on the web. Drawing inspiration from existing GPU libraries like PyCUDA, CrowdCL provides an abstraction layer for WebCL aimed at reducing boilerplate and improving code readability. CrowdCL also provides developers with a framework to easily run computations in the background of a...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.