Search results

Items from 41 to 60 out of 473 results

chapter

Performance Evaluation of Parallel Sparse Tensor Decomposition Implementations

Thomas B. Rolinger, Tyler A. Simon, Christopher D. Krieger

2016 6th Workshop on Irregular Applications: Architecture and Algorithms (IA3) > 54 - 57

2016 6th Workshop on Irregular Applications: Architecture and Algorithms (IA3)

Tensor decomposition, the higher-order analogue to singular value decomposition, has emerged as a useful tool for finding relationships in large, sparse, multidimensional data sets. As this technique matures and is applied to increasingly larger data sets, the need for high performance implementations becomes critical. In this work, we perform an objective empirical evaluation of three popular parallel...

chapter

Evaluating and Optimizing the NERSC Workload on Knights Landing

Taylor Barnes, Brandon Cook, Jack Deslippe, Douglas Doerfler, more

2016 7th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS) > 43 - 53

2016 7th International Workshop on Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)

NERSC has partnered with 20 representative application teams to evaluate performance on the Xeon-Phi Knights Landing architecture and develop an application-optimization strategy for the greater NERSC workload on the recently installed Cori system. In this article, we present early case studies and summarized results from a subset of the 20 applications highlighting the impact of important architecture...

chapter

Performance Evaluation of Parallelizing Algorithm Using Spanning Tree for Stream-Based Computing

Guyue Wang, Koichi Wada, Shinichi Yamagiwa

2016 Fourth International Symposium on Computing and Networking (CANDAR) > 497 - 503

2016 Fourth International Symposium on Computing and Networking (CANDAR)

This paper proposes a detailed performance evaluation of an algorithm using spanning tree that automatically exploits the parallelism and determines an execution order of multiple kernel programs in distributed environment. In stream-based computing, efficient parallel execution requires careful scheduling of the invocation of the kernel programs. By mapping a kernel to a node and an I/O stream between...

chapter

V-Hadoop: Virtualized Hadoop using containers

Srihari Radhakrishnan, Bryan J. Muscedere, Khuzaima Daudjee

2016 IEEE 15th International Symposium on Network Computing and Applications (NCA) > 237 - 241

2016 IEEE 15th International Symposium on Network Computing and Applications (NCA)

MapReduce is a popular programming model used to process large amounts of data by exploiting parallelism. Open-source implementations of MapReduce such as Hadoop are generally best suited for large, homogeneous clusters of commodity machines. However, many businesses cannot afford to invest in such infrastructure and others are reluctant to use cloud services due to data security and privacy concerns...

chapter

Tuning Stencil codes in OpenCL for FPGAs

Qi Jia, Huiyang Zhou

2016 IEEE 34th International Conference on Computer Design (ICCD) > 249 - 256

2016 IEEE 34th International Conference on Computer Design (ICCD)

OpenCL is designed as a parallel programming framework to support heterogeneous computing platforms. The implicit or explicit parallelism in OpenCL kernel code enables efficient FPGA implementation from a high-level programming abstraction. However, FPGA architecture is completely different from GPU architecture, for which OpenCL is widely used. Tuning OpenCL codes to achieve high performance on FPGAs...

chapter

Memos: A full hierarchy hybrid memory management framework

Lei Liu, Hao Yang, Yong Li, Mengyao Xie, more

2016 IEEE 34th International Conference on Computer Design (ICCD) > 368 - 371

2016 IEEE 34th International Conference on Computer Design (ICCD)

In this paper, we introduce memos, which integrates suitable memory management policies and schedules resources over the entire memory hierarchy in hybrid memory system. Powered by an OS kernel level monitoring tool, memos captures memory patterns online, and then leverages them to guide the memory page placement and data mapping. Experimental results show, on average, memos can benefit memory utilization,...

chapter

Hyperspectral image classification using a parallel implementation of the linear SVM on a Massively Parallel Processor Array (MPPA) platform

D. Madronal, R. Lazcano, H. Fabelo, S. Ortega, more

2016 Conference on Design and Architectures for Signal and Image Processing (DASIP) > 154 - 160

2016 Conference on Design and Architectures for Signal and Image Processing (DASIP)

In this paper, a study of the parallel exploitation of a Support Vector Machine (SVM) classifier with a linear kernel running on a Massively Parallel Processor Array platform is exposed. This system joins 256 cores working in parallel and grouped in 16 different clusters. The main objective of the research has been to develop an optimal implementation of the SVM classifier on a MPPA platform whilst...

chapter

A Benchmark on Multi Improvement Neighborhood Search Strategies in CPU/GPU Systems

Eyder Rios, Igor M. Coelho, Luiz Satoru Ochi, Cristina Boeres, more

2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW) > 49 - 54

2016 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)

In combinatorial optimization problems, the neighborhood search (NS) is a fundamental component for local search based heuristics. It consists of selecting a solution from a high cardinality set of neighbor solutions, by means of operations called moves. To perform this search, NS algorithms usually adopt two main approaches: selecting the first or best improving move. The Multi Improvement (MI) strategy...

chapter

Profiling-based task graph extraction on multiprocessor system-on-chip

Sodam Han, Yonghee Yun, Young Hwan Kim

2016 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS) > 510 - 513

2016 IEEE Asia Pacific Conference on Circuits and Systems (APCCAS)

This paper proposes a profiling-based method to extract a task graph, which describes the system behavior of a multiprocessor system-on-chip with Android OS. The proposed method computes the resource usage of each task and extracts dependency among tasks using the run-time system profiling results. The proposed method calculates CPU resource usage and I/O waiting time of each task by analyzing CPU...

chapter

Characterizing Performance and Power towards Efficient Synchronization of GPU Kernels

Islam Harb, Wu-Chun Feng

2016 IEEE 24th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS) > 451 - 456

2016 IEEE 24th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS)

There is a lack of support for explicit synchronization in GPUs between the streaming multiprocessors (SMs) adversely impacts the performance of the GPUs to efficiently perform inter-block communication. In this paper, we present several approaches to inter-block synchronization using explicit/implicit CPU-based and dynamic parallelism (DP) mechanisms. Although this topic has been addressed in previous...

chapter

Designing and Enabling Simulation of Real-World GPU Network Applications with ns-3 and DCE

Jared Ivey, George Riley, Brian Swenson, Margaret Loper

2016 IEEE 24th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS) > 445 - 450

2016 IEEE 24th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS)

The ability to execute the original source code for network protocols and applications within a network simulation environment frees the simulation modeler from the time consuming task of having to create, test and debug models representing these applications. This work extends the functionality of the Direct Code Execution (DCE) framework of ns-3 by incorporating the ability to call NVIDIA CUDA kernels...

chapter

Unified and lightweight tasks and conduits: A high level parallel programming framework

Chao Liu, Miriam Leeser

2016 IEEE High Performance Extreme Computing Conference (HPEC) > 1 - 7

2016 IEEE High Performance Extreme Computing Conference (HPEC)

Computing platforms for high performance and parallel applications have changed rapidly during the past few years, from single to multiple cores, and from traditional Central Processing Units (CPUs) to hybrid systems which combine CPUs with accelerators such as Graphics Processing Units(GPUs), Intel Xeon Phi, etc. These developments bring more and more challenges to application developers, especially...

chapter

Performance evaluation of the parallel object tracking algorithm employing the particle filter

Grzegorz Szwoch

2016 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA) > 119 - 124

2016 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA)

An algorithm based on particle filters is employed to track moving objects in video streams from fixed and non-fixed cameras. Particle weighting is based on color histograms computed in the iHLS color space. Particle computations are parallelized with CUDA framework. The algorithm was tested on various GPU devices: a desktop GPU card, a mobile chipset and two embedded GPU platforms. The processing...

chapter

Kokkos/Qthreads task-parallel approach to linear algebra based graph analytics

Michael M. Wolf, H. Carter Edwards, Stephen L. Olivier

2016 IEEE High Performance Extreme Computing Conference (HPEC) > 1 - 7

2016 IEEE High Performance Extreme Computing Conference (HPEC)

The Graph BLAS effort to standardize a set of graph algorithms building blocks in terms of linear algebra primitives promises to deliver high performing graph algorithms and greatly impact the analysis of big data. However, there are challenges with this approach, which our data analytics miniapp miniTri exposes. In this paper, we improve upon a previously proposed task-parallel approach to linear...

chapter

Efficient HEVC decoder for heterogeneous CPU with GPU systems

Biao Wang, Mauricio Alvarez-Mesa, Chi Ching Chi, Ben Juurlink, more

2016 IEEE 18th International Workshop on Multimedia Signal Processing (MMSP) > 1 - 6

2016 IEEE 18th International Workshop on Multimedia Signal Processing (MMSP)

The High Efficiency Video Coding (HEVC) standard provides higher compression efficiency than other video coding standards but at the cost of increased computational load, which makes it hard to achieve real-time encoding/decoding of high-resolution, high-quality video sequences. In this paper, we investigate how Graphics Processing Units (GPUs) can be employed to accelerate HEVC decoding. GPUs are...

chapter

Basic k-mer operations using massive parallel processing on heterogeneus architectures

Nelson Enrique Vera-Parra, Cristian Alejandro Rojas-Quintero, Jose Nelson Perez-Castillo

2016 7th IEEE International Conference on Software Engineering and Service Science (ICSESS) > 193 - 196

2016 7th IEEE International Conference on Software Engineering and Service Science (ICSESS)

In this article is presented and assessed a massive parallel processing model for basic operations with k-mers from genomic sequences, based on defined functions in terms of N-dimensional spaces. The model is implemented using a set of OpenCL cores available at github.com/bioinfud/k-merscl and assessed using a heterogeneous platform CPU/GPU and a dataset based on randomly generated k-mers. The results...

chapter

Bridging the FPGA programmability-portability Gap via automatic OpenCL code generation and tuning

Konstantinos Krommydas, Ruchira Sasanka, Wu-chun Feng

2016 IEEE 27th International Conference on Application-specific Systems, Architectures and Processors (ASAP) > 213 - 218

2016 IEEE 27th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Programming FPGAs has been an arduous task that requires extensive knowledge of hardware design languages (HDLs), such as Verilog or VHDL, and low-level hardware details. With OpenCL support for FPGAs, the design, prototyping and implementation of an FPGA is increasingly moving towards a much higher level of abstraction, when compared to the intrinsically low-level nature of HDLs. On the other hand,...

chapter

Parallel adaptive sparsity-constrained NMF algorithm for hyperspectral unmixing

Wenhong Wang, Yuntao Qian

2016 IEEE International Geoscience and Remote Sensing Symposium (IGARSS) > 6137 - 6140

IGARSS 2016 - 2016 IEEE International Geoscience and Remote Sensing Symposium

Sparsity-constrained Nonnegative matrix factorization (NMF) has been proved to be an effective method for hyperspectral unmixing. However, the optimization procedure of sparsity-constrained NMF is computational demanding, which may limit its application in time-constrained conditions. In this paper, a parallel L_1/2 sparsity-constrained NMF unmixing method on Graphics Processing Units (GPUs) is proposed,...

chapter

Empowering OpenMP with automatically generated hardware

Artur Podobas, Mats Brorsson

2016 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS) > 245 - 252

2016 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS)

OpenMP enables productive software development that targets shared-memory general purpose systems. However, OpenMP compilers today have little support for future heterogeneous systems — systems that will more than likely contain Field Programmable Gate Arrays (FPGAs) to compensate for the lack of parallelism available in general purpose systems. We have designed a high-level synthesis flow that automatically...

chapter

Architecture exploration of a programmable neural network processor for embedded systems

Wonyong Sung, Jinhwan Park

2016 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS) > 124 - 131

2016 International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS)

Deep neural network algorithms show very high performance, however increased amounts of arithmetic and memory accesses hinder their adoption to embedded systems. This paper explores a programmable neural network processing architecture that can efficiently execute feed-forward, recurrent, and convolutional deep neural networks. The neural network algorithms are transformed to matrix-vector multiplication...

Keywords:
KERNEL
PARALLEL PROCESSING

Publication date

Set your own date range

Content availability

Available (468)
None (5)

Keywords

INSTRUCTION SETS (149)
GRAPHICS PROCESSING UNITS (132)
GRAPHICS PROCESSING UNIT (98)
COMPUTER ARCHITECTURE (92)
HARDWARE (89)
GPU (82)
COMPUTATIONAL MODELING (73)
CUDA (58)
FIELD PROGRAMMABLE GATE ARRAYS (58)
PROGRAMMING (56)
OPTIMIZATION (53)
COPROCESSORS (50)
ARRAYS (46)
ALGORITHM DESIGN AND ANALYSIS (44)
PROGRAM PROCESSORS (42)
COMPUTER GRAPHIC EQUIPMENT (38)
MEMORY MANAGEMENT (38)
PERFORMANCE EVALUATION (35)
GPGPU (34)
ACCELERATION (33)
MULTIPROCESSING SYSTEMS (32)
BENCHMARK TESTING (31)
REGISTERS (30)
YARN (29)
OPENCL (28)
RUNTIME (26)
PARALLEL PROGRAMMING (24)
BANDWIDTH (23)
FPGA (23)
SYNCHRONIZATION (22)
COMPUTER GRAPHICS (21)
DATA MINING (21)
MULTICORE PROCESSING (21)
PARALLEL COMPUTING (21)
CENTRAL PROCESSING UNIT (18)
LIBRARIES (18)
MICROPROCESSOR CHIPS (18)
PIXEL (18)
THROUGHPUT (18)
IMAGE PROCESSING (17)
PIPELINES (17)
TRAINING (17)
PARALLEL ARCHITECTURES (16)
CONVOLUTION (15)
HEURISTIC ALGORITHMS (15)
COMPUTE UNIFIED DEVICE ARCHITECTURE (14)
SPARSE MATRICES (14)
LINUX (13)
SERVERS (13)
SUPPORT VECTOR MACHINES (13)
MULTI-THREADING (12)
RANDOM ACCESS MEMORY (12)
VECTORS (12)
CONTEXT (11)
DATA STRUCTURES (11)
DATABASES (11)
EMBEDDED SYSTEMS (11)
INDEXES (11)
RECONFIGURABLE ARCHITECTURES (11)
TILES (11)
ACCURACY (10)
COMPUTERS (10)
DECODING (10)
GRAPHIC PROCESSING UNIT (10)
MAGNETIC CORES (10)
MATHEMATICAL MODEL (10)
MESSAGE PASSING (10)
MESSAGE SYSTEMS (10)
PARALLEL ALGORITHMS (10)
RESOURCE MANAGEMENT (10)
APPLICATION PROGRAM INTERFACES (9)
DIGITAL SIGNAL PROCESSING (9)
HIGH PERFORMANCE COMPUTING (9)
MICROPROCESSORS (9)
OPENMP (9)
RESOURCE ALLOCATION (9)
SCHEDULING (9)
CPU (8)
ENCODING (8)
FEATURE EXTRACTION (8)
GPU COMPUTING (8)
MULTI-CORE (8)
OPTIMISATION (8)
PARALLEL (8)
PROCESSOR SCHEDULING (8)
REAL-TIME SYSTEMS (8)
SCHEDULES (8)
ANALYTICAL MODELS (7)
BIOINFORMATICS (7)
CLOCKS (7)
GRAPHICS (7)
IMAGE COLOR ANALYSIS (7)
JACOBIAN MATRICES (7)
LINEAR ALGEBRA (7)
MATRIX MULTIPLICATION (7)
SCALABILITY (7)
SIMD (7)
SOFTWARE (7)
more

INFONA - science communication portal

Search results

Performance Evaluation of Parallel Sparse Tensor Decomposition Implementations

Evaluating and Optimizing the NERSC Workload on Knights Landing

Performance Evaluation of Parallelizing Algorithm Using Spanning Tree for Stream-Based Computing

V-Hadoop: Virtualized Hadoop using containers

Tuning Stencil codes in OpenCL for FPGAs

Memos: A full hierarchy hybrid memory management framework

Hyperspectral image classification using a parallel implementation of the linear SVM on a Massively Parallel Processor Array (MPPA) platform

A Benchmark on Multi Improvement Neighborhood Search Strategies in CPU/GPU Systems

Profiling-based task graph extraction on multiprocessor system-on-chip

Characterizing Performance and Power towards Efficient Synchronization of GPU Kernels

Designing and Enabling Simulation of Real-World GPU Network Applications with ns-3 and DCE

Unified and lightweight tasks and conduits: A high level parallel programming framework

Performance evaluation of the parallel object tracking algorithm employing the particle filter

Kokkos/Qthreads task-parallel approach to linear algebra based graph analytics

Efficient HEVC decoder for heterogeneous CPU with GPU systems

Basic k-mer operations using massive parallel processing on heterogeneus architectures

Bridging the FPGA programmability-portability Gap via automatic OpenCL code generation and tuning

Parallel adaptive sparsity-constrained NMF algorithm for hyperspectral unmixing

Empowering OpenMP with automatically generated hardware

Architecture exploration of a programmable neural network processor for embedded systems

Filter options

Publication date

Content availability

Keywords

INFONA - science communication portal

Search results

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Content availability

Keywords

Reporting an error / abuse

Sending the report failed

Accessibility options