Search results

chapter

PACENet: Energy efficient acceleration for convolutional network on embedded platform

Adwaya Kulkarni, Tahmid Abtahi, Colin Shea, Amey Kulkarni, more

2017 IEEE International Symposium on Circuits and Systems (ISCAS) > 1 - 4

2017 IEEE International Symposium on Circuits and Systems (ISCAS)

Lightweight convolutional neural network (CNN) on tiny embedded platforms can offer energy efficient solution for today's IoT devices. However, CNN implementation on embedded system faces processing bottleneck in convolutional layers and memory storage issues in fully connected layers. In past years, heterogeneous acceleration, where compute intensive tasks are performed on kernel specific cores,...

chapter

Highly parallel online bioelectrical signal processing on GPU architecture

Z. Juhasz

2017 40th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO) > 340 - 346

2017 40th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO)

Signal processing is of central importance in biomedical systems, in which pre-processing steps are unavoidable in order to reduce noise, remove unwanted artefacts, segment time series into smaller epochs, or extract statistical and other descriptive features that can be used in consecutive classification stages. The high sampling rates and electrode counts used e.g. in advanced EEG or body-surface...

chapter

Neneta: Heterogeneous computing complex-valued neural network framework

Vladimir Lekic, Zdenka Babic

2017 40th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO) > 192 - 196

2017 40th International Convention on Information and Communication Technology, Electronics and Microelectronics (MIPRO)

Due to increased demand for computational efficiency for the training, validation and testing of artificial neural networks, many open source software frameworks have emerged. Almost exclusively GPU programming model of choice in such software frameworks is CUDA. Symptomatic is also lack of the support for complex-valued neural networks. With our research going exactly in that direction, we developed...

chapter

Relaxations for High-Performance Message Passing on Massively Parallel SIMT Processors

Benjamin Klenk, Holger Froening, Hans Eberle, Larry Dennison

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) > 855 - 865

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Accelerators, such as GPUs, have proven to be highly successful in reducing execution time and power consumption of compute-intensive applications. Even though they are already used pervasively, they are typically supervised by general-purpose CPUs, which results in frequent control flow switches and data transfers as CPUs are handling all communication tasks. However, we observe that accelerators...

chapter

Power Efficient Sharing-Aware GPU Data Management

Abdulaziz Tabbakh, Murali Annavaram, Xuehai Qian

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) > 698 - 707

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

The power consumed by memory system in GPUs is a significant fraction of the total chip power. As thread level parallelism increases, GPUs are likely to stress cache and memory bandwidth even more, thereby exacerbating power consumption. We observe that neighboring concurrent thread arrays (CTAs) within GPU applications share considerable amount of data. However, the default GPU scheduling policy...

chapter

A software technique to enhance register utilization of Convolutional Neural Networks on GPGPUs

Che-Huai Lin, An-Ting Cheng, Bo-Cheng Lai

2017 International Conference on Applied System Innovation (ICASI) > 614 - 617

2017 International Conference on Applied System Innovation (ICASI)

CNNs (Convolutional Neural Networks) have demonstrated superior results in a wide range of applications. However, the time-consuming convolution operations required by CNNs pose great challenges to designers. GPGPUs (General Purpose Graphic Processing Units) have been widely used to exploiting the massive parallelism of convolution operations. This paper proposes a software-based loop-unrolling technique...

chapter

Co-Run Scheduling with Power Cap on Integrated CPU-GPU Systems

Qi Zhu, Bo Wu, Xipeng Shen, Li Shen, more

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) > 967 - 977

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

This paper presents the first systematic study on co-scheduling independent jobs on integrated CPU-GPU systems with power caps considered. It reveals the performance degradations caused by the co-run contentions at the levels of both memory and power. It then examines the problem of using job co-scheduling to alleviate the degradations in this less understood scenario. It offers several algorithms...

chapter

Directive-Based Partitioning and Pipelining for Graphics Processing Units

Xuewen Cui, Thomas R. W. Scogland, Bronis R. de Supinski, Wu-chun Feng

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) > 575 - 584

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

The community needs simpler mechanisms to access the performance available in accelerators, such as GPUs, FPGAs, and APUs, due to their increasing use in state-of-the-art supercomputers. Programming models like CUDA, OpenMP, OpenACC and OpenCL can efficiently offload compute-intensive workloads to these devices. By default these models naively offload computation without overlapping it with communication...

chapter

Mini-Gunrock: A Lightweight Graph Analytics Framework on the GPU

Yangzihao Wang, Sean Baxter, John D. Owens

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) > 616 - 626

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Existing GPU graph analytics frameworks are typically built from specialized, bottom-up implementations of graph operators that are customized to graph computation. In this work we describe Mini-Gunrock, a lightweight graph analytics framework on the GPU. Unlike existing frameworks, Mini-Gunrock is built from graph operators implemented with generic transform-based data-parallel primitives. Using...

chapter

Portable Implementation of Advanced Driver-Assistance Algorithms on Heterogeneous Architectures

Oliver Jakob Arndt, Fabian David Trager, Tobias MoB, Holger Blume

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) > 6 - 17

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

The increased use of application-specific computational devices turns even low-power chips into high-performance computers. Not only additional accelerators (e.g., GPU, DSP, or even FPGA), but also heterogeneous CPU clusters form modern computer systems. Programming these chips is however challenging, due to management overhead, data transfer delays, and a missing unification of the programming flow...

chapter

Offloading Communication Control Logic in GPU Accelerated Applications

Elena Agostini, Davide Rossetti, Sreeram Potluri

2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) > 248 - 257

2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)

NVIDIA GPUDirect is a family of technologiesaimed at optimizing data movement among GPUs (P2P) orbetween GPUs and third-party devices (RDMA). GPUDirectAsync, introduced in CUDA 8.0, is a new addition whichallows direct synchronization between GPU and third partydevices. For example, Async allows an NVIDIA GPU to directlytrigger and poll for completion of communication operationsqueued to an InfiniBand...

chapter

Communication Optimization on GPU: A Case Study of Sequence Alignment Algorithms

Jie Wang, Xinfeng Xie, Jason Cong

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) > 72 - 81

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Data movement is increasingly becoming the bottleneck of both performance and energy efficiency in modern computation. Until recently, it was the case that there is limited freedom for communication optimization on GPUs, as conventional GPUs only provide two types of methods for inter-thread communication: using shared memory or global memory. However, a new warp shuffle instruction has been introduced...

chapter

Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors

Kaixi Hou, Wu-chun Feng, Shuai Che

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) > 713 - 722

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Because sparse matrix-vector multiplication (SpMV) is an important and widely used computational kernel in many real-world applications, it behooves us to accelerate SpMV on modern multi- and many-core architectures. While many storage formats have been developed to facilitate SpMV operations, the compressed sparse row (CSR) format is still the most popular and general storage format. However, parallelizing...

chapter

Static WCET Analysis of GPUs with Predictable Warp Scheduling

Yijie Huangfu, Wei Zhang

2017 IEEE 20th International Symposium on Real-Time Distributed Computing (ISORC) > 101 - 108

2017 IEEE 20th International Symposium on Real-Time Distributed Computing (ISORC)

The capability of GPUs to accelerate general-purpose applications that can be parallelized into massive number of threads makes it promising to apply GPUs to real-time applications as well, where high throughput and intensive computation are also needed. However, due to the different architecture and programming model of GPUs, the worst-case execution time (WCET) analysis methods and techniques designed...

chapter

Clustering Throughput Optimization on the GPU

Michael Gowanlock, Cody M. Rude, David M. Blair, Justin D. Li, more

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) > 832 - 841

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Large datasets in astronomy and geoscience often require clustering and visualizations of phenomena at different densities and scales in order to generate scientific insight. We examine the problem of maximizing clustering throughput for concurrent dataset clustering in spatial dimensions. We introduce a novel hybrid approach that uses GPUs in conjunction with multicore CPUs for algorithmic throughput...

chapter

Exploiting Decoupled OpenCL Work-Items with Data Dependencies on FPGAs: A Case Study

Javier Alejandro Varela, Norbert Wehn, Qian Liang, Songyin Tang

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) > 124 - 131

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

In the field of high performance heterogeneous computing systems, field programmable gate arrays (FPGAs) have shown great advantages in terms of acceleration and energy efficiency. And with the inclusion of the OpenCL framework for parallel programming, the design complexity has been greatly reduced. However, the parallel implementation of applications containing data-dependent branches usually experiences...

chapter

Performance-Portable Sparse Matrix-Matrix Multiplication for Many-Core Architectures

Mehmet Deveci, Christian Trott, Sivasankaran Rajamanickam

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) > 693 - 702

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

We consider the problem of writing performance portablesparse matrix-sparse matrix multiplication (SPGEMM) kernelfor many-core architectures. We approach the SPGEMMkernel from the perspectives of algorithm design and implementation, and its practical usage. First, we design ahierarchical, memory-efficient SPGEMM algorithm. We thendesign and implement thread scalable data structures thatenable us to...

chapter

Use of Synthetic Benchmarks for Machine-Learning-Based Performance Auto-Tuning

Tianyi David Han, Tarek S. Abdelrahman

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) > 1350 - 1361

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

We explore the use of synthetic benchmarks for the training phase of machine-learning-based automatic performance tuning. We focus on the problem of predicting if the use of local memory on a GPU is beneficial for caching a single target array in a GPU kernel. We show that the use of only 13 real benchmarks leads to poor prediction accuracy (about to 58%) of the 13 leave-one-out models trained using...

chapter

A Sampling Based Strategy to Automatic Performance Tuning of GPU Programs

Wilson Feng, Tarek S. Abdelrahman

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) > 1342 - 1349

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

We present a novel strategy for automatic performance tuning of GPU computational kernels. The strategy combines heuristic search with regression trees to prune the optimization space. It samples configurations in the space and uses these samples to build a regression tree. It then focuses the search on the leaf region of the tree with the best mean sample performance. Additional configurations are...

chapter

GPU accelerated foreground segmentation using CodeBook model and shadow removal using CUDA

Praveen Gudivaka, Nayaneesh Mishra, Anupam Agrawal

2017 International Conference on Computing, Communication and Automation (ICCCA) > 765 - 770

2017 International Conference on Computing, Communication and Automation (ICCCA)

Background Subtraction is the major important step in many image processing applications which can be applied in much of video surveillances. The major result of this method is accuracy as well as processing time. So we mainly focused on these two challenges. We parallelized the Two Layered CodeBook Model on Graphical Processing Unit (GPU) for increasing the processing speed and the accuracy of the...

INFONA - science communication portal

Search results

PACENet: Energy efficient acceleration for convolutional network on embedded platform

Highly parallel online bioelectrical signal processing on GPU architecture

Neneta: Heterogeneous computing complex-valued neural network framework

Relaxations for High-Performance Message Passing on Massively Parallel SIMT Processors

Power Efficient Sharing-Aware GPU Data Management

A software technique to enhance register utilization of Convolutional Neural Networks on GPGPUs

Co-Run Scheduling with Power Cap on Integrated CPU-GPU Systems

Directive-Based Partitioning and Pipelining for Graphics Processing Units

Mini-Gunrock: A Lightweight Graph Analytics Framework on the GPU

Portable Implementation of Advanced Driver-Assistance Algorithms on Heterogeneous Architectures

Offloading Communication Control Logic in GPU Accelerated Applications

Communication Optimization on GPU: A Case Study of Sequence Alignment Algorithms

Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors

Static WCET Analysis of GPUs with Predictable Warp Scheduling

Clustering Throughput Optimization on the GPU

Exploiting Decoupled OpenCL Work-Items with Data Dependencies on FPGAs: A Case Study

Performance-Portable Sparse Matrix-Matrix Multiplication for Many-Core Architectures

Use of Synthetic Benchmarks for Machine-Learning-Based Performance Auto-Tuning

A Sampling Based Strategy to Automatic Performance Tuning of GPU Programs

GPU accelerated foreground segmentation using CodeBook model and shadow removal using CUDA

Filter options

Publication date

Keywords

INFONA - science communication portal

Search results

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Keywords

Reporting an error / abuse

Sending the report failed

Accessibility options