Search results

chapter

Optimal Performance Prediction of ADAS Algorithms on Embedded Parallel Architectures

Romain Saussard, Boubker Bouzid, Marius Vasiliu, Roger Reynaud

2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems > 213 - 218

2015 IEEE 17th International Conference on High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS) and 2015 IEEE 12th International Conf on Embedded Software and Systems (ICESS)

ADAS (Advanced Driver Assistance Systems) algorithms increasingly use heavy image processing operations. To embed this type of algorithms, semiconductor companies offer many heterogeneous architectures. These SoCs (System on Chip) are composed of different processing units, with different capabilities, and often with massively parallel computing unit. Due to the complexity of these SoCs, predicting...

chapter

Run Time Approximation of Non-blocking Service Rates for Streaming Systems

Jonathan Curtis Beard, Roger Dean Chamberlain

2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems > 792 - 797

2015 IEEE 17th International Conference on High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS) and 2015 IEEE 12th International Conf on Embedded Software and Systems (ICESS)

Stream processing is a compute paradigm that promises safe and efficient parallelism. Its realization requires optimization of multiple parameters such as kernel placement and communications. Most techniques to optimize streaming systems use queueing network models or network flow models, which often require estimates of the execution rate of each compute kernel. This is known as the non-blocking...

chapter

SBIOS: An SSD-based Block I/O Scheduler with improved system performance

Jiayang Guo, Yimin Hu, Bo Mao

2015 IEEE International Conference on Networking, Architecture and Storage (NAS) > 357 - 358

2015 IEEE International Conference on Networking, Architecture and Storage (NAS)

This paper presents an SSD-based Block I/O Scheduler, short for SBIOS. SBIOS fully exploits the internal parallelism to improve the system performance. It dispatches the read requests to different blocks to make full use of SSD internal parallelism. For write requests, it tries to dispatch write requests to the same block to alleviate the block cross penalty and garbage collection overhead. The evaluation...

chapter

Parallel implementation of low light level image enhancement using CUDA

Peiyi Shen, Liang Zhang, Juan Song, Xilu Peng, more

2015 IEEE International Conference on Information and Automation > 673 - 677

2015 IEEE International Conference on Information and Automation (ICIA)

Enhancement algorithms can make low light level images have a clear visual effect like the one captured during the daytime, but due to high complexity and generous computational cost, low light level image enhancement algorithms are usually difficult to meet real-time requirements which make it difficult to be widely used in practical application. For this situation, a parallel optimization algorithm...

chapter

Scaling number of cores in GPGPU: A comparative performance analysis

Winnie Thomas, Rohin D. Daruwala

2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI) > 501 - 507

2015 International Conference on Advances in Computing, Communications and Informatics (ICACCI)

The Single Instruction Multiple Thread (SIMT) architecture based, Graphic Processing Units (GPUs) are emerging as more efficient than Multiple Instruction Multiple Data (MIMD) architectures in exploiting parallelism. A GPU has numerous shader cores and thousands of simultaneous finegrained active threads. These threads are grouped into Cooperative Thread Arrays (CTAs). All the threads within a CTA...

chapter

Tracking Many Solution Paths of a Polynomial Homotopy on a Graphics Processing Unit in Double Double and Quad Double Arithmetic

Jan Verschelde, Xiangcheng Yu

2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems > 371 - 376

2015 IEEE 17th International Conference on High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS) and 2015 IEEE 12th International Conf on Embedded Software and Systems (ICESS)

Polynomial systems occur in many areas of science and engineering. Unlike general nonlinear systems, the algebraic structure enablesto compute all solutions of a polynomial system. We describe our massively parallel predictor-corrector algorithmsto track many solution paths of a polynomial homotopy. The data parallelism that provides the speedups stems from theevaluation and differentiation of the...

chapter

CUDA Grid-Level Task Progression Algorithms

Christos Kartsaklis, Wayne Joubert, Oscar R. Hernandez, Markus Eisenbach, more

2015 IEEE 17th International Conference on High Performance Computing and Communications, 2015 IEEE 7th International Symposium on Cyberspace Safety and Security, and 2015 IEEE 12th International Conference on Embedded Software and Systems > 1628 - 1632

2015 IEEE 17th International Conference on High Performance Computing and Communications (HPCC), 2015 IEEE 7th International Symposium on Cyberspace Safety and Security (CSS) and 2015 IEEE 12th International Conf on Embedded Software and Systems (ICESS)

Tasking is a prominent parallel programming model. In this paper we conduct a first study into the feasibility of task-parallel execution at the CUDA grid, rather than the stream/kernel level, for regular, fixed in-out dependency task graphs, similar to those found in wavefront computational patterns, making the findings broadly applicable. We propose and evaluate three CUDA task progression algorithms,...

chapter

Accelerating persistent scatterer pixel selection for InSAR processing

Tahsin Reza, Aaron Zimmer, Parwant Ghuman, Tanuj kr Aasawat, more

2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP) > 49 - 56

2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Interferometric Synthetic Aperture Radar (InSAR) is a remote sensing technology used for estimating displacement of the earth's surface. Phase unwrapping is the most important step in InSAR processing and relies on successful selection of points that appear stable across a set of satellite images taken over time. This paper presents a new algorithm for selecting these points, a problem known as persistent...

chapter

Loop coarsening in C-based High-Level Synthesis

Moritz Schmid, Oliver Reiche, Frank Hannig, Jurgen Teich

2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP) > 166 - 173

2015 IEEE 26th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Current tools for High-Level Synthesis (HLS) excel at exploiting Instruction-Level Parallelism (ILP), the support for Data-Level Parallelism (DLP), one of the key advantages of Field Programmable Gate Arrays (FPGAs), is in contrast very limited. This work examines the exploitation of DLP on FPGAs using code generation for C-based HLS of image filters and streaming pipelines, consisting of point and...

chapter

Function and speed portability of audio fingerprint extraction across computing platforms

Fu-Hai Frank Wu, Jyh-Shing Roger Jang

2015 IEEE International Conference on Consumer Electronics - Taiwan > 216 - 217

2015 IEEE International Conference on Consumer Electronics - Taiwan (ICCE-TW)

Audio Fingerprinting (AFP) is a technology, which requests huge computing power for responsiveness, accuracy, and robustness to noise. In this study, we make efforts to improve the computing speed of fingerprint extraction in AFP system by parallelism language OpenCL. Especially, we also explore the function and speed portability across different platform. The experimental results show that the portability...

chapter

Optimizing the Bayesian Inference of Phylogeny on Graphic Processors

Cheng Ling, Chunbao Zhou, Arong Luo, Guoguang Zhao, more

2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing > 333 - 342

2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

Searching for the evolutionary relationships between groups of organism has become a routine procedure in molecular biology. MrBayes is a popular model based phylogenetic inference tool using Bayesian statistics. Unfortunately, the computational cost is very high, resulting in undesirably long execution time. In this paper, we present what we believe the fastest solution of the MrBayes MC3 algorithm...

chapter

No PAIN, No Gain? The Utility of PArallel Fault INjections

Stefan Winter, Oliver Schwahn, Roberto Natella, Neeraj Suri, more

2015 IEEE/ACM 37th IEEE International Conference on Software Engineering > 1 > 494 - 505

2015 IEEE/ACM 37th IEEE International Conference on Software Engineering (ICSE)

Software Fault Injection (SFI) is an established technique for assessing the robustness of a software under test by exposing it to faults in its operational environment. Depending on the complexity of this operational environment, the complexity of the software under test, and the number and type of faults, a thorough SFI assessment can entail (a) numerous experiments and (b) long experiment run times,...

chapter

GPU-based Parallel R-tree Construction and Querying

Sushil K. Prasad, Michael McDermott, Xi He, Satish Puri

2015 IEEE International Parallel and Distributed Processing Symposium Workshop > 618 - 627

2015 IEEE International Parallel and Distributed Processing Symposium Workshop (IPDPSW)

An R-tree is a data structure for organizing and querying multi-dimensional non-uniform and overlapping data. Efficient parallelization of R-tree is an important problem due to societal applications such as geographic information systems (GIS), spatial database management systems, and VLSI layout which employ R-trees for spatial analysis tasks such as map-overlay. As graphics processing units (GPUs)...

chapter

Lowering the complexity of k-means clustering by BFS-dijkstra method for graph computing

Anna Zhang, Jun Yao, Yasuhiko Nakashima

2015 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS XVIII) > 1 - 3

2015 IEEE Symposium in Low-Power and High-Speed Chips (COOL CHIPS XVIII)

K-means is a method of vector quantization, which is now popularly used for clustering analysis in massive data mining. Due to its heavily computational-intensive feature for iteratively re-computing and sorting distances, the execution of k-means takes a huge amount of time, especially when processing large graph data such as the practical social networks. This paper studies an alternative method...

chapter

Coordinating GPU Threads for OpenMP 4.0 in LLVM

Carlo Bertolli, Samuel F. Antao, Alexandre E. Eichenberger, Kevin OBrien Zehra Sura, more

2014 LLVM Compiler Infrastructure in HPC > 12 - 21

2014 LLVM Compiler Infrastructure in HPC (LLVM-HPC)

GPUs devices are becoming critical building blocks of High-Performance platforms for performance and energy efficiency reasons. As a consequence, parallel programming environment such as OpenMP were extended to support offloading code to such devices. OpenMP compilers are faced with offering an efficient implementation of device-targeting constructs.One main issue in implementing OpenMP on a GPU is...

chapter

Parallel background subtraction in video streams using OpenCL on GPU platforms

Grzegorz Szwoch

2014 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA) > 54 - 59

2014 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA)

Implementation of the background subtraction algorithm using OpenCL platform is presented. The algorithm processes live stream of video frames from the surveillance camera in on-line mode. Processing is performed using a host machine and a parallel computing device. The work focuses on optimizing an OpenCL algorithm implementation for GPU devices by taking into account specific features of the GPU...

chapter

Implementation of Kalman filter and Sonar image processing on FPGA platform

Radha Guha

2015 International Conference on Industrial Engineering and Operations Management (IEOM) > 1 - 7

2015 International Conference on Industrial Engineering and Operations Management (IEOM)

In recent years emergence of many intelligent autonomous systems are possible due to the tremendous advancement of various technologies like computer vision and automation and control engineering with sensor technology. One such intelligent system is autonomous underwater vehicle (AUV) for ocean floor mapping by SONAR technology. Success of this autonomous smart and precise intelligent system depends...

chapter

Locality aware concurrent start for stencil applications

Sunil Shrestha, Guang R. Gao, Joseph Manzano, Andres Marquez, more

2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) > 157 - 166

2015 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

Stencil computations are at the heart of many physical simulations used in scientific codes. Thus, there exists a plethora of optimization efforts for this family of computations. Among these techniques, tiling techniques that allow concurrent start have proven to be very efficient in providing better performance for these critical kernels. Nevertheless, with many core designs being the norm, these...

chapter

Intermediate representation for heterogeneous multi-core: A survey

Meena Belwal, Sudarshan TSB

2015 International Conference on VLSI Systems, Architecture, Technology and Applications (VLSI-SATA) > 1 - 6

2015 International Conference on VLSI Systems, Architecture, Technology and Applications (VLSI-SATA)

One of the necessary conditions to gain performance improvement through heterogeneous multi-core is to exploit the parallelism in the program. Compiler applies various transformations to the code to achieve execution efficiency. Code optimization is one of the important tasks performed by the compiler before generating the target code. With the availability of various parallel programming models in...

chapter

Free launch: Optimizing GPU dynamic kernel launches through thread reuse

Guoyang Chen, Xipeng Shen

2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) > 407 - 419

2015 48th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)

Supporting dynamic parallelism is important for GPU to benefit a broad range of applications. There are currently two fundamental ways for programs to exploit dynamic parallelism on GPU: a software-based approach with software-managed worklists, and a hardware-based approach through dynamic subkernel launches. Neither is satisfactory. The former is complicated to program and is often subject to some...

INFONA - science communication portal

Search results

Optimal Performance Prediction of ADAS Algorithms on Embedded Parallel Architectures

Run Time Approximation of Non-blocking Service Rates for Streaming Systems

SBIOS: An SSD-based Block I/O Scheduler with improved system performance

Parallel implementation of low light level image enhancement using CUDA

Scaling number of cores in GPGPU: A comparative performance analysis

Tracking Many Solution Paths of a Polynomial Homotopy on a Graphics Processing Unit in Double Double and Quad Double Arithmetic

CUDA Grid-Level Task Progression Algorithms

Accelerating persistent scatterer pixel selection for InSAR processing

Loop coarsening in C-based High-Level Synthesis

Function and speed portability of audio fingerprint extraction across computing platforms

Optimizing the Bayesian Inference of Phylogeny on Graphic Processors

No PAIN, No Gain? The Utility of PArallel Fault INjections

GPU-based Parallel R-tree Construction and Querying

Lowering the complexity of k-means clustering by BFS-dijkstra method for graph computing

Coordinating GPU Threads for OpenMP 4.0 in LLVM

Parallel background subtraction in video streams using OpenCL on GPU platforms

Implementation of Kalman filter and Sonar image processing on FPGA platform

Locality aware concurrent start for stencil applications

Intermediate representation for heterogeneous multi-core: A survey

Free launch: Optimizing GPU dynamic kernel launches through thread reuse

Filter options

Publication date

Content availability

Keywords

INFONA - science communication portal

Search results

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Content availability

Keywords

Reporting an error / abuse

Sending the report failed

Accessibility options