Search results

chapter

FLiT: Cross-platform floating-point result-consistency tester and workload

Geof Sawaya, Michael Bentley, Ian Briggs, Ganesh Gopalakrishnan, more

2017 IEEE International Symposium on Workload Characterization (IISWC) > 229 - 238

2017 IEEE International Symposium on Workload Characterization (IISWC)

Understanding the extent to which computational results can change across platforms, compilers, and compiler flags can go a long way toward supporting reproducible experiments. In this work, we offer the first automated testing aid called FLiT (Floating-point Litmus Tester) that can show how much these results can vary for any user-given collection of computational kernels. Our approach is to take...

chapter

Evaluating high-level design strategies on FPGAs for high-performance computing

Artur Podobas, Hamid Reza Zohouri, Naoya Maruyama, Satoshi Matsuoka

2017 27th International Conference on Field Programmable Logic and Applications (FPL) > 1 - 4

2017 27th International Conference on Field Programmable Logic and Applications (FPL)

Field-Programmable Gate Arrays (FPGAs) are gaining considerable momentum in mainstream high-performance systems in recent years due to their flexibility and low power consumption. Still, FPGAs remain largely unavailable to software programmers due to programming and debugging difficulties that are inherent to standard Hardware Description Languages. The performance that hardware-oblivious software...

chapter

Evaluating high-level design strategies on FPGAs for high-performance computing

Artur Podobas, Hamid Reza Zohouri, Naoya Maruyama, Satoshi Matsuoka

2017 27th International Conference on Field Programmable Logic and Applications (FPL) > 1 - 4

2017 27th International Conference on Field Programmable Logic and Applications (FPL)

Field-Programmable Gate Arrays (FPGAs) are gaining considerable momentum in mainstream high-performance systems in recent years due to their flexibility and low power consumption. Still, FPGAs remain largely unavailable to software programmers due to programming and debugging difficulties that are inherent to standard Hardware Description Languages. The performance that hardware-oblivious software...

chapter

Vectorization-Aware Loop Optimization with User-Defined Code Transformations

Hiroyuki Takizawa, Thorsten Reimann, Kazuhiko Komatsu, Takashi Soga, more

2017 IEEE International Conference on Cluster Computing (CLUSTER) > 685 - 692

2017 IEEE International Conference on Cluster Computing (CLUSTER)

The cost of maintaining an application code would significantly increase if the application code is branched into multiple versions, each of which is optimized for a different architecture. In this work, default and vector versions of a realworld application code are refactored to be a single version, and the differences between the versions are expressed as userdefined code transformations. As a...

chapter

3D tomography back-projection parallelization on FPGAs using opencl

Maxime Martelli, Nicolas Gag, Alain Merigot, Cyrille Enderli

2017 Conference on Design and Architectures for Signal and Image Processing (DASIP) > 1 - 6

2017 Conference on Design and Architectures for Signal and Image Processing (DASIP)

This paper deals with the evaluation of FPGAs resurgence for hardware acceleration applied to computed tomography on the back-projection operator used in iterative reconstruction algorithms. We focus our attention on the tools developed by FPGAs manufacturers, in particular the Intel FPGA SDK for OpenCL, that promises a new level of hardware abstraction from the developer's perspective, allowing a...

chapter

Loop Overhead Reduction Techniques for Coarse Grained Reconfigurable Architectures

Kanishkan Vadivel, Mark Wijtvliet, Roel Jordans, Henk Corporaal

2017 Euromicro Conference on Digital System Design (DSD) > 14 - 21

2017 Euromicro Conference on Digital System Design (DSD)

Due to their flexibility and high performance, Coarse Grained Reconfigurable Array (CGRA) are a topic of increasing research interest. However, CGRAs also have the potential to achieve very high energy efficiency in comparison to other reconfigurable architectures when hardware optimizations are applied. Some of these optimizations are common for more traditional processors but can also lead to large...

chapter

Performance Analysis and Optimization of the FFTXlib on the Intel Knights Landing Architecture

Michael Wagner, Victor Lopez, Julian Morillo, Carlo Cavazzoni, more

2017 46th International Conference on Parallel Processing Workshops (ICPPW) > 243 - 250

2017 46th International Conference on Parallel Processing Workshops (ICPPW)

In this paper, we address the decreasing performance of the FFTXlib, the Fast Fourier Transformation (FFT) kernel of Quantum ESPRESSO, when scaling to a full KNL node. An increased performance in the FFTXlib will likewise increase the performance of the entire Quantum ESPRESSO code one of the most used plane-wave DFT codes in the community of material science. Our approach focuses on, first, overlapping...

chapter

Analyzing Performance of Multi-cores and Applications with Cache-aware Roofline Model

Diogo Marques, Helder Duarte, Leonel Sousa, Aleksandar Ilic

2017 International Conference on High Performance Computing & Simulation (HPCS) > 933 - 934

2017 International Conference on High Performance Computing & Simulation (HPCS)

To satisfy growing computational demands of modern applications, significant enhancements have been introduced in the contemporary processor architectures with the aim to increase their attainable performance, such as increased number of cores, improved capability of memory subsystem and enhancements in the processor pipeline [1]. Therefore, the performance improvements are usually coupled with an...

chapter

GPU-based coevolutionary particle swarm optimization

Zhao Liang, Zhu Yanxing, Zhang Jianyu, Ye Zhencheng

2017 36th Chinese Control Conference (CCC) > 9883 - 9887

2017 36th Chinese Control Conference (CCC)

Coevolutionary particle swarm optimization (CPSO) algorithm has been investigated and applied in the real world widely. When tackling the large-scale and complex real time optimization problems, the running time of CPSO algorithm is a barrier. In this paper, Graphics Processing Unit (GPU) is introduced to provide speedup in order to meet the real time requirements. The CPSO algorithm has been implemented...

chapter

Comparative Performance and Optimization of Chapel in Modern Manycore Architectures

Engin Kayraklioglu, Wo Chang, Tarek El-Ghazawi

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) > 1105 - 1114

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Chapel is an emerging scalable, productive parallel programming language. In this work, we analyze Chapel's performance using The Parallel Research Kernels on two different manycore architectures including a state-of-the-art Intel Knights Landing processor. We discuss implementation techniques in Chapel and their relation to the OpenMP implementations of the PRK. We also suggest and prototype several...

chapter

Communication Optimization on GPU: A Case Study of Sequence Alignment Algorithms

Jie Wang, Xinfeng Xie, Jason Cong

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) > 72 - 81

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Data movement is increasingly becoming the bottleneck of both performance and energy efficiency in modern computation. Until recently, it was the case that there is limited freedom for communication optimization on GPUs, as conventional GPUs only provide two types of methods for inter-thread communication: using shared memory or global memory. However, a new warp shuffle instruction has been introduced...

chapter

Automatic generation of fast BLAS3-GEMM: A portable compiler approach

Xing Su, Xiangke Liao, Jingling Xue

2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) > 122 - 133

2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

GEMM is the main computational kernel in BLAS3. Its micro-kernel is either hand-crafted in assembly code or generated from C code by general-purpose compilers (guided by architecture-specific directives or auto-tuning). Therefore, either performance or portability suffers. We present a POrtable Compiler Approach, Poca, implemented in LLVM, to automatically generate and optimize this micro-kernel in...

chapter

A space- and energy-efficient code compression/decompression technique for coarse-grained reconfigurable architectures

Bernhard Egger, Hochan Lee, Duseok Kang, Mansureh S. Moghaddam, more

2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) > 197 - 209

2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

We present an effective code compression technique to reduce the area and energy overhead of the configuration memory for coarse-grained reconfigurable architectures (CGRA). Based on a statistical analysis of existing code, the proposed method reorders the storage locations of the reconfigurable entities and splits the wide configuration memory into a number of partitions. Code compression is achieved...

chapter

DFGenTool: A Dataflow Graph Generation Tool for Coarse Grain Reconfigurable Architectures

Manideepa Mukherjee, Alexander Fell, Apala Guha

2017 30th International Conference on VLSI Design and 2017 16th International Conference on Embedded Systems (VLSID) > 67 - 72

2017 30th International Conference on VLSI Design and 2017 16th International Conference on Embedded Systems (VLSID)

In this paper, DFGenTool, a dataflow graph (DFG) generation tool, is presented, which converts loops in a sequential program given in a high-level language such as C, into a DFG. DFGenTool adapts DFGs for mapping to Coarse Grain Reconfigurable Architectures (CGRA) to enable a variety of CGRA implementations and compilers to be benchmarked against a standard set of DFGs. Several kernels have been converted...

chapter

Automated Optimal Architecture of Deep Convolutional Neural Networks for Image Recognition

Saleh Albelwi, Ausif Mahmood

2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA) > 53 - 60

2016 15th IEEE International Conference on Machine Learning and Applications (ICMLA)

Recent advancements in deep Convolutional Neural Networks (CNNs) have led to impressive progress in computer vision, especially in image classification. CNNs involve numerous hyperparameters that identify the network's structure such as depth of the network, kernel size, number of feature maps, stride, pooling size and pooling regions etc. These hyperparameters have a significant impact on the classification...

chapter

The Vectorization of the Tersoff Multi-body Potential: An Exercise in Performance Portability

Markus Hohnerbach, Ahmed E. Ismail, Ahmed E. Ismail

SC16: International Conference for High Performance Computing, Networking, Storage and Analysis > 69 - 81

SC16: International Conference for High Performance Computing, Networking, Storage and Analysis

Molecular dynamics simulations, an indispensable research tool in computational chemistry and materials science, consume a significant portion of the supercomputing cycles around the world. We focus on multi-body potentials and aim at achieving performance portability. Compared with well-studied pair potentials, multibody potentials deliver increased simulation accuracy but are too complex for effective...

chapter

Investigation and performance analysis of OpenVX optimizations on computer vision applications

Djamila Dekkiche, Bastien Vincke, Alain Merigot

2016 14th International Conference on Control, Automation, Robotics and Vision (ICARCV) > 1 - 6

2016 14th International Conference on Control, Automation, Robotics and Vision (ICARCV)

The development of Advanced Driver Assistance Systems (ADAS), such as pedestrian detection, requires real-time update rates at high image resolution. Hopefully, heterogeneous architectures with high computing performance have been developed for this purpose. To benefit from this hardware performance, different programming languages and acceleration frameworks have been developed. OpenVX framework...

chapter

Evaluating and Optimizing the NERSC Workload on Knights Landing

Taylor Barnes, Brandon Cook, Jack Deslippe, Douglas Doerfler, more

2016 7th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS) > 43 - 53

2016 7th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)

NERSC has partnered with 20 representative application teams to evaluate performance on the Xeon-Phi Knights Landing architecture and develop an application-optimization strategy for the greater NERSC workload on the recently installed Cori system. In this article, we present early case studies and summarized results from a subset of the 20 applications highlighting the impact of important architecture...

chapter

Tuning Stencil codes in OpenCL for FPGAs

Qi Jia, Huiyang Zhou

2016 IEEE 34th International Conference on Computer Design (ICCD) > 249 - 256

2016 IEEE 34th International Conference on Computer Design (ICCD)

OpenCL is designed as a parallel programming framework to support heterogeneous computing platforms. The implicit or explicit parallelism in OpenCL kernel code enables efficient FPGA implementation from a high-level programming abstraction. However, FPGA architecture is completely different from GPU architecture, for which OpenCL is widely used. Tuning OpenCL codes to achieve high performance on FPGAs...

chapter

A Benchmark on Multi Improvement Neighborhood Search Strategies in CPU/GPU Systems

Eyder Rios, Igor M. Coelho, Luiz Satoru Ochi, Cristina Boeres, more

2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW) > 49 - 54

2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)

In combinatorial optimization problems, the neighborhood search (NS) is a fundamental component for local search based heuristics. It consists of selecting a solution from a high cardinality set of neighbor solutions, by means of operations called moves. To perform this search, NS algorithms usually adopt two main approaches: selecting the first or best improving move. The Multi Improvement (MI) strategy...

INFONA - science communication portal

Search results

FLiT: Cross-platform floating-point result-consistency tester and workload

Evaluating high-level design strategies on FPGAs for high-performance computing

Evaluating high-level design strategies on FPGAs for high-performance computing

Vectorization-Aware Loop Optimization with User-Defined Code Transformations

3D tomography back-projection parallelization on FPGAs using opencl

Loop Overhead Reduction Techniques for Coarse Grained Reconfigurable Architectures

Performance Analysis and Optimization of the FFTXlib on the Intel Knights Landing Architecture

Analyzing Performance of Multi-cores and Applications with Cache-aware Roofline Model

GPU-based coevolutionary particle swarm optimization

Comparative Performance and Optimization of Chapel in Modern Manycore Architectures

Communication Optimization on GPU: A Case Study of Sequence Alignment Algorithms

Automatic generation of fast BLAS3-GEMM: A portable compiler approach

A space- and energy-efficient code compression/decompression technique for coarse-grained reconfigurable architectures

DFGenTool: A Dataflow Graph Generation Tool for Coarse Grain Reconfigurable Architectures

Automated Optimal Architecture of Deep Convolutional Neural Networks for Image Recognition

The Vectorization of the Tersoff Multi-body Potential: An Exercise in Performance Portability

Investigation and performance analysis of OpenVX optimizations on computer vision applications

Evaluating and Optimizing the NERSC Workload on Knights Landing

Tuning Stencil codes in OpenCL for FPGAs

A Benchmark on Multi Improvement Neighborhood Search Strategies in CPU/GPU Systems

Filter options

Publication date

Publication type

Keywords

INFONA - science communication portal

Search results

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Publication type

Keywords

Reporting an error / abuse

Sending the report failed

Accessibility options