Search results

chapter

High-Level Designs of Complex FIR Filters on FPGAs for the SKA

Haomiao Wang, Joao Gante, Ming Zhang, Gabriel Falcao, more

2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS) > 797 - 804

2016 IEEE 18th International Conference on High Performance Computing and Communications; IEEE 14th International Conference on Smart City; IEEE 2nd International Conference on Data Science and Systems (HPCC/SmartCity/DSS)

High-end FPGAs are widely adopted as hardware accelerators, due to their power efficiency, flexibility, and high-performance computing ability. They are, therefore, extremely useful devices for a project with challenges and constraints such as the Square Kilometre Array (SKA). However, the traditional design methods require expert hardware knowledge and long development times for each of the SKA's...

chapter

Histogram optimization with CUDA

Keh Kok Yong, Sheera Shaheera Othman Talib

2016 IEEE Industrial Electronics and Applications Conference (IEACon) > 312 - 318

2016 IEEE Industrial Electronics and Applications Conference (IEACon)

Histogram is a popular analytic graphical representation of data distribution resulting from processing a given numerical input data. Although the sequential histogram computation may be simple, it is no longer suitable in processing high volume of data. With recent advancement of high performance computing (HPC), aided by the accelerating growth of General Purpose Graphic Processing Unit (GPGPU),...

chapter

Devito: Towards a Generic Finite Difference DSL Using Symbolic Python

Michael Lange, Navjot Kukreja, Mathias Louboutin, Fabio Luporini, more

2016 6th Workshop on Python for High-Performance and Scientific Computing (PyHPC) > 67 - 75

2016 6th Workshop on Python for High-Performance and Scientific Computing (PyHPC)

Domain specific languages (DSL) have been used in a variety of fields to express complex scientific problems in a concise manner and provide automated performance optimization for a range of computational architectures. As such DSLs provide a powerful mechanism to speed up scientific Python computation that goes beyond traditional vectorization and pre-compilation approaches, while allowing domain...

chapter

An Overview of Performance Portability in the Uintah Runtime System through the Use of Kokkos

Daniel Sunderland, Brad Peterson, John Schmidt, Alan Humphrey, more

2016 Second International Workshop on Extreme Scale Programming Models and Middlewar (ESPM2) > 44 - 47

2016 Second International Workshop on Extreme Scale Programming Models and Middleware (ESPM2)

The current diversity in nodal parallel computer architectures is seen in machines based upon multicore CPUs, GPUs and the Intel Xeon Phi's. A class of approaches for enabling scalability of complex applications on such architectures is based upon Asynchronous Many Task software architectures such as that in the Uintah framework used for the parallel solution of solid and fluid mechanics problems...

chapter

Parallelization challenges of BFS traversal on dense graphs using the CUDA platform

Huma Milisic, Dina Ahmic, Hamdija Sinanovic, Emina Saric, more

2016 XI International Symposium on Telecommunications (BIHTEL) > 1 - 5

2016 XI International Symposium on Telecommunications – BIHTEL

This paper presents challenges encountered while parallelizing an existing sequential algorithm. A breadth-first search implementation in CUDA C++ of quadratic time complexity is used. Even though BFS might seem like an easily parallelizable problem due to many independent iterations over graph vertices, there are other important aspects which need to be considered. Properties like granulation, communication...

chapter

Performance Optimization for SpMV on Multi-GPU Systems Using Threads and Multiple Streams

Ping Guo, Changjiang Zhang

2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW) > 67 - 72

2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)

Sparse matrix-vector multiplication (SpMV) is a key operation in scientific computing and engineering ap-plications. This paper presents an optimization strategy to improve SpMV performance on the multi-GPU systems by adopting OpenMP threads and multiple CUDA streams. We propose an efficient scheme to control multiple GPUs jointly complete SpMV computations by making use of OpenMP threads. Moreover,...

chapter

Kokkos/Qthreads task-parallel approach to linear algebra based graph analytics

Michael M. Wolf, H. Carter Edwards, Stephen L. Olivier

2016 IEEE High Performance Extreme Computing Conference (HPEC) > 1 - 7

2016 IEEE High Performance Extreme Computing Conference (HPEC)

The Graph BLAS effort to standardize a set of graph algorithms building blocks in terms of linear algebra primitives promises to deliver high performing graph algorithms and greatly impact the analysis of big data. However, there are challenges with this approach, which our data analytics miniapp miniTri exposes. In this paper, we improve upon a previously proposed task-parallel approach to linear...

chapter

Directive-Based Pipelining Extension for OpenMP

Xuewen Cui, Thomas R. W. Scogland, Bronis R. de Supinski, Wu-Chun Feng

2016 IEEE International Conference on Cluster Computing (CLUSTER) > 481 - 484

2016 IEEE International Conference on Cluster Computing (CLUSTER)

Programming models like CUDA, OpenMP, OpenACC and OpenCL are designed to offload compute-intensive workloads to accelerators efficiently. However, the naive offload model, which synchronously copies and executes in sequence, requires extensive hand-tuning of techniques, such as pipelining to overlap computation and communication. Therefore, we propose an easy-to-use, directive-based pipelining extension...

chapter

Optimization of parallel WAF for two-dimensional shallow water model with CUDA

Nugool Sataporn, Worasait Suwannik, Montri Maleewong

2016 11th International Conference on Computer Science & Education (ICCSE) > 155 - 159

2016 11th International Conference on Computer Science & Education (ICCSE)

This paper proposes the parallel implementation of finite volume method based on weighted average flux (WAF) to solve the shallow water equations on a graphic processing unit. We develop two parallel programs which are 1-dimension thread block and 2-dimension thread block, respectively. We compare the performance of these two versions with a sequential program. The numerical experiment is performed...

chapter

DT-CGRA: Dual-track coarse-grained reconfigurable architecture for stream applications

Xitian Fan, Huimin Li, Wei Cao, Lingli Wang

2016 26th International Conference on Field Programmable Logic and Applications (FPL) > 1 - 9

2016 26th International Conference on Field Programmable Logic and Applications (FPL)

This paper presents a new type of coarse-grained reconfigurable architecture (CGRA) for the object inference domain in machine learning. The proposed CGRA is optimized for stream processing and a correspondent programming model called dual-track model is proposed. The CGRA is realized in Verilog HDL and implemented in SMIC 55 nm process, with the footprint of 3.79 mm² and consuming 1.79 W at 500 MHz...

chapter

Compressed L1 data cache and L2 cache in GPGPUs

Ehsan Atoofian

2016 IEEE 27th International Conference on Application-specific Systems, Architectures and Processors (ASAP) > 1 - 8

2016 IEEE 27th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

General-Purpose Graphics Processing Units (GPGPUs) exploit several levels of caches to hide latency of memory and provide data for thousands of simultaneously executing threads. L1 data cache and L2 cache are critical to performance of GPGPUs as an L1 data cache should provide data for all threads within the corresponding Streaming Multiprocessor (SM) and the L2 cache should service memory requests...

chapter

Many-Thread Aware Compression in GPGPUs

Ehsan Atoofian

2016 Intl IEEE Conferences on Ubiquitous Intelligence & Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld) > 628 - 635

2016 Intl IEEE Conferences on Ubiquitous Intelligence & Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld)

Compression is a promising technique to increase effective capacity of caches. Due to latency overhead of decompression, most of previous studies applied compression to lower level caches. General-Purpose Graphics Processing Units (GPGPUs) are throughput oriented computing platforms which execute hundreds to thousands of threads, simultaneously. The massive number of threads makes GPGPUs less sensitive...

chapter

Efficient algorithms for memory management in embedded vision systems

Khadija Hadj Salem, Yann Kieffer, Stephane Mancini

2016 11th IEEE Symposium on Industrial Embedded Systems (SIES) > 1 - 6

2016 11th IEEE Symposium on Industrial Embedded Systems (SIES)

In the field of embedded vision systems, meeting the constraints on design criteria such as performance, area, and power consumption can be a real challenge. In fact, to alleviate the well known “Memory Mall”, it is mandatory to provide efficient memory hierarchies to reach usable performance for the system to be designed when it has to handle non-linear image treatments. To address this problematic,...

chapter

Energy evaluation of Sparse Matrix-Vector Multiplication on GPU

Akrem Benatia, Weixing Ji, Yizhuo Wang, Feng Shi

2016 Seventh International Green and Sustainable Computing Conference (IGSC) > 1 - 6

2016 Seventh International Green and Sustainable Computing Conference (IGSC)

Many recent studies suggest that energy efficiency should be placed as a primary design goal on par with the performance in building both the hardware and the software. As a primary step toward finding a good compromise between these two conflicting design goals, first we need to have a deep understanding about the performance and the energy of different application kernels. In this paper, we focus...

chapter

Joint loop mapping and data placement for coarse-grained reconfigurable architecture with multi-bank memory

Shouyi Yin, Xianqing Yao, Tianyi Lu, Leibo Liu, more

2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD) > 1 - 8

2016 IEEE/ACM International Conference on Computer-Aided Design (ICCAD)

Coarse-Grained Reconfigurable Architecture (CGRA) is a promising architecture with high performance, high power-efficiency and attraction of flexibility. The compute-intensive parts of an application (e.g. loops) are often mapped onto CGRA for acceleration. Since the high-parallel demands of PEs and the extremely expensive cost of single-bank memory with multi-port, the architecture with multi-bank...

chapter

Design and synthesis of reconfigurable control-flow structures for CGRA

Zoltan Endre Rakossy, Axel Acosta-Aponte, Tobias G. Noll, Gerd Ascheid, more

2015 International Conference on ReConFigurable Computing and FPGAs (ReConFig) > 1 - 8

2015 International Conference on ReConFigurable Computing and FPGAs (ReConFig)

Coarse-Grained Reconfigurable Architectures (CGRA) promise both low power and high performance coupled with flexibility, however automatic mapping of applications to such platforms remains a great research challenge. Efficient manual mapping of the data-centric kernels of applications yields great results, however these contain internally control-flow specific tasks, which introduce mapping irregularities...

chapter

QuickDough: A rapid FPGA loop accelerator design framework using soft CGRA overlay

Cheng Liu, Ho-Cheung Ng, Hayden Kwok-Hay So

2015 International Conference on Field Programmable Technology (FPT) > 56 - 63

2015 International Conference on Field Programmable Technology (FPT)

The use of FPGAs as compute accelerators has been demonstrated by numerous researchers as an effective solution to meet the performance requirement across many application domains. However, the design productivity of developing FPGA accelerators remains much lower compared to the use of a typical software development flow. Although the use of the high-level design tools may partly alleviate this shortcoming,...

chapter

GPU-accelerated real-time video background subtraction

Ramy Boghdady, Cherif Salama, Ayman Wahba

2015 Tenth International Conference on Computer Engineering & Systems (ICCES) > 34 - 39

2015 Tenth International Conference on Computer Engineering & Systems (ICCES)

Identifying objects of interest in a video sequence is a fundamental and essential part in many vision systems. A common method to achieve that goal is to perform background subtraction. For automated surveillance systems with multiple cameras, real-time background subtraction is particularly important. In this paper, we examine how to exploit GPU parallelism to accelerate the single Gaussian background...

chapter

Exploring pipe implementations using an OpenCL framework for FPGAs

Vincent Mirian, Paul Chow

2015 International Conference on Field Programmable Technology (FPT) > 112 - 119

2015 International Conference on Field Programmable Technology (FPT)

In the last decade, OpenCL has sparked the interest of the computing world as it is a language based on an open standard that can run on many different heterogeneous platforms. This standard is continuously evolving to adapt to various use cases of different platforms. For example, with requests from the FPGA community, the pipe construct was added to the standard to facilitate the implementation...

chapter

High Performance OpenSHMEM Strided Communication Support with InfiniBand UMR

Mingzhe Li, Khaled Hamidouche, Xiaoyi Lu, Jie Zhang, more

2015 IEEE 22nd International Conference on High Performance Computing (HiPC) > 244 - 253

2015 IEEE 22nd International Conference on High Performance Computing (HiPC)

Exchanging data on noncontiguous user buffers has been a dominant communication pattern in many scientific applications. The OpenSHMEM specification introduces a new set of communication routines to support strided data communication. Most high performance implementations of the OpenSHMEM specification support strided data communication by either packing/unpacking or multiple reads/writes based scheme,...

INFONA - science communication portal

Search results

High-Level Designs of Complex FIR Filters on FPGAs for the SKA

Histogram optimization with CUDA

Devito: Towards a Generic Finite Difference DSL Using Symbolic Python

An Overview of Performance Portability in the Uintah Runtime System through the Use of Kokkos

Parallelization challenges of BFS traversal on dense graphs using the CUDA platform

Performance Optimization for SpMV on Multi-GPU Systems Using Threads and Multiple Streams

Kokkos/Qthreads task-parallel approach to linear algebra based graph analytics

Directive-Based Pipelining Extension for OpenMP

Optimization of parallel WAF for two-dimensional shallow water model with CUDA

DT-CGRA: Dual-track coarse-grained reconfigurable architecture for stream applications

Compressed L1 data cache and L2 cache in GPGPUs

Many-Thread Aware Compression in GPGPUs

Efficient algorithms for memory management in embedded vision systems

Energy evaluation of Sparse Matrix-Vector Multiplication on GPU

Joint loop mapping and data placement for coarse-grained reconfigurable architecture with multi-bank memory

Design and synthesis of reconfigurable control-flow structures for CGRA

QuickDough: A rapid FPGA loop accelerator design framework using soft CGRA overlay

GPU-accelerated real-time video background subtraction

Exploring pipe implementations using an OpenCL framework for FPGAs

High Performance OpenSHMEM Strided Communication Support with InfiniBand UMR

Filter options

Publication date

Content availability

Keywords

INFONA - science communication portal

Search results

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Content availability

Keywords

Reporting an error / abuse

Sending the report failed

Accessibility options