Search results

chapter

Efficient parallel GPU algorithms for BDD manipulation

Miroslav N. Velev, Ping Gao

2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC) > 750 - 755

2014 19th Asia and South Pacific Design Automation Conference (ASP-DAC)

We present parallel algorithms for Binary Decision Diagram (BDD) manipulation optimized for efficient execution on Graphics Processing Units (GPUs). Compared to a sequential CPU-based BDD package with the same capabilities, our GPU implementation achieves at least 5 orders of magnitude speedup. To the best of our knowledge, this is the first work on using GPUs to accelerate a BDD package.

chapter

On the Fairness of Linux O(1) Scheduler

Jyothish Jose, Oravanpadath Sujisha, Malayamparambath Gilesh, Thayyil Bindima

2014 5th International Conference on Intelligent Systems, Modelling and Simulation > 668 - 674

2014 5th International Conference on Intelligent Systems, Modelling and Simulation (ISMS)

The scheduling algorithm of Linux operating systems has to fulfill several conflicting objectives: fast process response time, higher throughput for background jobs, avoidance of process starvation, reconciliation of the needs of low and high priority processes etc. The set of rules used to determine when and how to select a new process to run is called scheduling policy. Current Linux kernel uses...

chapter

Data-reuse optimizations for pipelined tiling with parametric tile sizes

Alexandre Isoard

2014 23rd International Conference on Parallel Architecture and Compilation (PACT) > 509 - 510

2014 23rd International Conference on Parallel Architecture and Compilation (PACT)

Todays' hardware diversity exacerbates the need for optimizing compilers. A problem that arises when exploiting hardware accelerators (FPGA, GPU, dedicated boards) is how to automatically perform kernel/function offloading or outlining (as opposed to function inlining). The principle is to outsource part of the computation (the kernel to be performed on the accelerator) to a more efficient but more...

chapter

Automatic execution of single-GPU computations across multiple GPUs

Javier Cabezas, Lluis Vilanova, Isaac Geladeno, Thomas B. Jablin, more

2014 23rd International Conference on Parallel Architecture and Compilation (PACT) > 467 - 468

2014 23rd International Conference on Parallel Architecture and Compilation (PACT)

We present AMGE, a programming framework and runtime system to decompose data and GPU kernels and execute them on multiple GPUs concurrently. AMGE exploits the remote memory access capability of recent GPUs to guarantee data accessibility regardless of its physical location, thus allowing AMGE to safely decompose and distribute arrays across GPU memories. AMGE also includes a compiler analysis to...

chapter

Efficient sparse matrix multiple-vector multiplication using a bitmapped format

Ramaseshan Kannan

20th Annual International Conference on High Performance Computing > 286 - 294

2013 20th International Conference on High Performance Computing (HiPC)

The problem of obtaining high computational throughput from sparse matrix multiple-vector multiplication routines is considered. Current sparse matrix formats and algorithms have high bandwidth requirements and poor reuse of cache and register loaded entries, which restrict their performance. We propose the mapped blocked row format: a bitmapped sparse matrix format that stores entries as blocks without...

chapter

The Study of Parallel Ortho-rectification Method of Line-Array Image Based on GPU

Yuxia Yang, Zhaohua Liu, Jingyu Yang

2013 International Conference on Computer Sciences and Applications > 615 - 618

2013 International Conference on Computer Sciences and Applications (CSA)

This paper first briefly introduces the principle of Ortho-Rectification of line-array image, then designed a parallel processing method based on GPU and proposes a shared memory optimizing strategy of POS data to avoid performance bottle-neck due frequently accessing data in global memory, at last do a system experiment using ADS40 image based on Tesla C2050 GPU and invalidate the parallel processing...

chapter

A Model of Microkernel Based on Spatial-Temporal Isolation in Haskell

Fan Zhang, Xiaopeng Wang

2013 International Conference on Computer Sciences and Applications > 564 - 569

2013 International Conference on Computer Sciences and Applications (CSA)

The safety and security of kernel is the key to the security of the embedded system and we even have to formal verification the kernel in the field of safety-critical embedded applications. In this paper we introduce a design and implementation of the modeling of micro kernel based on spatial-temporal isolation in Haskell which is a functional language. This not only could significantly improve the...

chapter

Implementation of Parallel 1-D FFT on GPU Clusters

Daisuke Takahashi

2013 IEEE 16th International Conference on Computational Science and Engineering > 174 - 180

2013 IEEE 16th International Conference on Computational Science and Engineering (CSE)

In this paper, we propose an implementation of a parallel one-dimensional fast Fourier transform (FFT) on GPU clusters. This implementation is based on the six-step FFT algorithm. Because the parallel one-dimensional FFT requires three all-to-all communications, one goal for parallel FFTs on GPU clusters is to minimize the PCI Express transfer time and the MPI communication time. We demonstrate that...

chapter

Direct virtual memory access from FPGA for high-productivity heterogeneous computing

Ho-Cheung Ng, Yuk-Ming Choi, Hayden Kwok-Hay So

2013 International Conference on Field-Programmable Technology (FPT) > 458 - 461

2013 International Conference on Field-Programmable Technology (FPT)

Heterogeneous computing utilizing both CPU and FPGA requires access to data in the main memory from both devices. While a typical system relies on software executing on the CPU to orchestrate all data movements between the FPGA and the main memory, our demo presents a complementary FPGA-centric approach that allows gateware to directly access the virtual memory space as part of the executing process...

chapter

An OpenCL optimizing compiler for reconfigurable processors

Jeongho Nah, Jun Lee, Hongjune Kim, Jinseok Lee, more

2013 International Conference on Field-Programmable Technology (FPT) > 184 - 191

2013 International Conference on Field-Programmable Technology (FPT)

This paper presents simple and efficient optimization techniques for an OpenCL compiler that targets reconfigurable processors. The target architecture consists of a generalpurpose processor core and an embedded reconfigurable accelerator with vector units. The accelerator is able to switch its architecture between the VLIW mode and the Coarse Grained Reconfigurable Array (CGRA) mode to achieve high...

chapter

GPU-Accelerated Parallel 3D Image Thinning

Bingfeng Hu, Xuan Yang

2013 IEEE 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing > 149 - 152

2013 IEEE International Conference on High Performance Computing and Communications (HPCC) & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (EUC)

The skeletons of the objects in 3D images can be extracted by using 3D image thinning. The application of 3D image thinning for image analysis is hampered by its considerable computation time. By employing the graphics processing unit (GPU), which has tremendous powerful computing power at an incomparable performance-to-cost ratio, the calculation of 3D image thinning can be accelerated. In this paper,...

chapter

Adaptive CFA Demosaicking Using Bilateral Filters for Colour Edge Preservation

Jim S. Jimmy Li, Sharmil Randhawa

2013 2nd IAPR Asian Conference on Pattern Recognition > 451 - 455

2013 2nd IAPR Asian Conference on Pattern Recognition (ACPR)

Colour Filter Array (CFA) demosaicking is a process to interpolate missing colour values in order to produce a full colour image when a single image sensor is used. For smooth regions, a higher order of interpolation will usually achieve higher accuracy. However when there is a colour edge, a lower order of interpolation is desirable as it will avoid interpolation across an edge without blurring it...

chapter

Combining Program Analysis and Empirical Search to Optimize Programs

Pingjing Lu, Bao Li, Zhengbin Pang, Ying Zhang, more

2013 IEEE 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing > 1896 - 1901

2013 IEEE International Conference on High Performance Computing and Communications (HPCC) & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (EUC)

Code optimization improves program performance through program analysis and program transformation, which transforms the program in an equivalent form. The basis of optimization is data flow analysis and control flow analysis. The paper first analyzes the characterization of Mgrid and the kernel Resid routine, including architecture analysis, data flow analysis, and dependence analysis, which is the...

chapter

Direction-Optimizing Breadth-First Search on CPU-GPU Heterogeneous Platforms

Dan Zou, Yong Dou, Qiang Wang, Jinbo Xu, more

2013 IEEE 10th International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing > 1064 - 1069

2013 IEEE International Conference on High Performance Computing and Communications (HPCC) & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (EUC)

Breadth-First Search (BFS) is a basis for many graph traversal and analysis algorithms. In this paper, we present a direction-optimizing BFS implementation on CPU-GPU heterogeneous platforms to fully exploit the computing power of both the multi-core CPU and GPU. For each level of the BFS algorithm, we dynamically choose the best implementation from: a sequential top-down execution on CPU, a parallel...

chapter

Comparing SpMV for solver applications

Rohit Patel, Vibha Patel, Bhavin Patel

2013 Nirma University International Conference on Engineering (NUiCONE) > 1 - 6

2013 Nirma University International Conference on Engineering (NUiCONE)

In this paper, we propose a new re-ordering technique for improving the performance of Sparse Matrix Vector Multiplication (SpMV) for systems supported with Graphics Processing Units (GPUs). We conducted the test by applying SpMV on solver based applications which are widely used in the domain of engineering and science. We studied and analyzed the existing representations and storage structures of...

chapter

Compiled multithreaded data paths on FPGAs for dynamic workloads

Robert J. Halstead, Walid Najjar

2013 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES) > 1 - 10

2013 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES)

Hardware supported multithreading can mask memory latency by switching the execution to ready threads, which is particularly effective on irregular applications. FPGAs provide an opportunity to have multithreaded data paths customized toeach individual application. In this paper we describe the compiler generation of these hardware structures from a C subset targeting a Convey HC-2ex machine. We describe...

chapter

GPU accelerated blood flow computation using the Lattice Boltzmann Method

Cosmin Nita, Lucian Mihai Itu, Constantin Suciu, Constantin Suciu

2013 IEEE High Performance Extreme Computing Conference (HPEC) > 1 - 6

2013 IEEE High Performance Extreme Computing Conference (HPEC)

We propose a numerical implementation based on a Graphics Processing Unit (GPU) for the acceleration of the execution time of the Lattice Boltzmann Method (LBM). The study focuses on the application of the LBM for patient-specific blood flow computations, and hence, to obtain higher accuracy, double precision computations are employed. The LBM specific operations are grouped into two kernels, whereas...

chapter

Energy-efficient recognition and mining processor using scalable effort design

Vinay K. Chippa, Hrishikesh Jayakumar, Debabrata Mohapatra, Kaushik Roy, more

Proceedings of the IEEE 2013 Custom Integrated Circuits Conference > 1 - 4

2013 IEEE Custom Integrated Circuits Conference - CICC 2013

A domain-specific processor for energy-efficient execution of Recognition and Data Mining (RM) workloads is presented. The processor consists of a 2-D array of processing elements and a streaming memory hierarchy and interconnect network that are customized to efficiently execute dominant computational kernels (matrix-vector multiplication, vector dot product, L1 norm, and L2 norm) from a wide range...

chapter

Accelerating a novel particle-based fluid simulation on the GPU

Zhilu Chen, James Kingsley, Xinming Huang, Erkan Tuzel

2013 IEEE High Performance Extreme Computing Conference (HPEC) > 1 - 6

2013 IEEE High Performance Extreme Computing Conference (HPEC)

Stochastic Rotation Dynamics (SRD) is a novel particle-based simulation method that can be used to model complex fluids [1], [2], such as binary and ternary mixtures [3], and polymer solutions [4]-[6], in either two or three dimensions. Although SRD is efficient compared to traditional methods, it is still computationally expensive for large system sizes, e.g. when using a large array of particles...

chapter

CrowdCL: Web-based volunteer computing with WebCL

Tommy MacWilliam, Cris Cecka

2013 IEEE High Performance Extreme Computing Conference (HPEC) > 1 - 6

2013 IEEE High Performance Extreme Computing Conference (HPEC)

We present CrowdCL, an open-source framework for the rapid development of volunteer computing and OpenCL applications on the web. Drawing inspiration from existing GPU libraries like PyCUDA, CrowdCL provides an abstraction layer for WebCL aimed at reducing boilerplate and improving code readability. CrowdCL also provides developers with a framework to easily run computations in the background of a...

INFONA - science communication portal

Search results

Efficient parallel GPU algorithms for BDD manipulation

On the Fairness of Linux O(1) Scheduler

Data-reuse optimizations for pipelined tiling with parametric tile sizes

Automatic execution of single-GPU computations across multiple GPUs

Efficient sparse matrix multiple-vector multiplication using a bitmapped format

The Study of Parallel Ortho-rectification Method of Line-Array Image Based on GPU

A Model of Microkernel Based on Spatial-Temporal Isolation in Haskell

Implementation of Parallel 1-D FFT on GPU Clusters

Direct virtual memory access from FPGA for high-productivity heterogeneous computing

An OpenCL optimizing compiler for reconfigurable processors

GPU-Accelerated Parallel 3D Image Thinning

Adaptive CFA Demosaicking Using Bilateral Filters for Colour Edge Preservation

Combining Program Analysis and Empirical Search to Optimize Programs

Direction-Optimizing Breadth-First Search on CPU-GPU Heterogeneous Platforms

Comparing SpMV for solver applications

Compiled multithreaded data paths on FPGAs for dynamic workloads

GPU accelerated blood flow computation using the Lattice Boltzmann Method

Energy-efficient recognition and mining processor using scalable effort design

Accelerating a novel particle-based fluid simulation on the GPU

CrowdCL: Web-based volunteer computing with WebCL

Filter options

Publication date

Content availability

Keywords

INFONA - science communication portal

Search results

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Content availability

Keywords

Reporting an error / abuse

Sending the report failed

Accessibility options