Search results

Items from 21 to 40 out of 473 results

chapter

MOCHA: Morphable Locality and Compression Aware Architecture for Convolutional Neural Networks

Syed Mohammad Asad Hassan Jafri, Ahmed Hemani, Kolin Paul, Naeem Abbas

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) > 276 - 286

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Today, machine learning based on neural networks has become mainstream, in many application domains. A small subset of machine learning algorithms, called Convolutional Neural Networks (CNN), are considered as state-ofthe- art for many applications (e.g. video/audio classification). The main challenge in implementing the CNNs, in embedded systems, is their large computation, memory, and bandwidth...

chapter

Parameterized Diamond Tiling for Parallelizing stencil computations

T. Wijesinghe, K. Senevirathne, C. Siriwardhana, W. Visitha, more

2017 Moratuwa Engineering Research Conference (MERCon) > 99 - 104

2017 Moratuwa Engineering Research Conference (MERCon)

Loop tiling is a useful technique used to achieve cache optimization in scientific computations. However, general loop tiling techniques usually fail to improve parallelism in certain scientific computations due to dependences among execution steps. In this paper we implement and experiment on a tiling technique known as Parameterized Diamond Tiling designed based on the data dependences in the program...

chapter

Static Versus Dynamic Task Scheduling of the Lu Factorization on ARM big. LITTLE Architectures

Sandra Catalan, Rafael Rodriguez-Sanchez, Enrique S. Quintana-Orti, Jose R. Herrero

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) > 733 - 742

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

We investigate several parallel algorithmic variants of the LU factorization with partial pivoting (LUpp) that trade off the exploitation of increasing levels of task-parallelism in exchange for a more cache-oblivious execution. In particular, our first variant corresponds to the classical implementation of LUpp in the legacy version of LAPACK, which constrains the concurrency exploited to that intrinsic...

chapter

Exploring optimized accelerator design for binarized convolutional neural networks

Kodai Ueyoshi, Kota Ando, Kentaro Orimo, Masayuki Ikebe, more

2017 International Joint Conference on Neural Networks (IJCNN) > 2510 - 2516

2017 International Joint Conference on Neural Networks (IJCNN)

The convolutional neural network (CNN) is a state-of-the-art model that can achieve significantly high accuracy in many machine-learning tasks. Recently, for further developing the practical applications of CNNs, efficient hardware platforms for accelerating CNN have been throughly studied. A binarized neural network has been reported to minimize the multipliers, which consume a large amount of resources,...

chapter

Scalable Parallelization of a Markov Coalescent Genealogy Sampler

Philip E. Davis, Adam M. Terwilliger, David Zeitler, Greg Wolffe

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) > 293 - 302

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Coalescent genealogy samplers are effective tools for the study of population genetics. They are used to estimate the historical parameters of a population based upon the sampling of present-day genetic information. A popular approach employs Markov chain Monte Carlo (MCMC) methods. While effective, these methods are very computationally intensive, often taking weeks to run. Although attempts have...

chapter

Sparse Supernodal Solver Using Block Low-Rank Compression

Gregoire Pichon, Eric Darve, Mathieu Faverge, Pierre Ramet, more

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) > 1138 - 1147

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

This paper presents two approaches using a Block Low-Rank (BLR) compression technique to reduce the memory footprint and/or the time-to-solution of the sparse supernodal solver PASTIX. This flat, non-hierarchical, compression method allows to take advantage of the low-rank property of the blocks appearing during the factorization of sparse linear systems, which come from the discretization of partial...

chapter

Analyzing OpenCL 2.0 workloads using a heterogeneous CPU-GPU simulator

Li Wang, Ren-Wei Tsai, Shao-Chung Wang, Kun-Chih Chen, more

2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) > 127 - 128

2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

Heterogeneous CPU-GPU systems have recently emerged as an energy-efficient computing platform. A robust integrated CPU-GPU simulator is essential to facilitate researches in this direction. While few integrated CPU-GPU simulators are available, similar tools that support OpenCL 2.0, a widely used new standard with promising heterogeneous computing features, are currently missing. In this paper, we...

chapter

Real-Time 3D Ball Tracking with CPU-GPU Acceleration Using Particle Filter with Multi-command Queues and Stepped Parallelism Iteration

Yilin Hou, Xina Cheng, Takeshi Ikenaga

2017 2nd International Conference on Multimedia and Image Processing (ICMIP) > 235 - 239

2017 2nd International Conference on Multimedia and Image Processing (ICMIP)

3D ball tracking is a critical function in manyapplications such as game and players behavior analysis, andreal time implementation has become increasingly importantfor it can be used for live broadcast and TV contents. To reacha high accuracy, algorithms usually are time consuming due toa large set of calculations which is challenging to meet realtime demanding. This paper proposes multiple commandqueues,...

chapter

Controlled Kernel Launch for Dynamic Parallelism in GPUs

Xulong Tang, Ashutosh Pattnaik, Huaipan Jiang, Onur Kayiran, more

2017 IEEE International Symposium on High Performance Computer Architecture (HPCA) > 649 - 660

2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Dynamic parallelism (DP) is a promising feature for GPUs, which allows on-demand spawning of kernels on the GPU without any CPU intervention. However, this feature has two major drawbacks. First, the launching of GPU kernels can incur significant performance penalties. Second, dynamically-generated kernels are not always able to efficiently utilize the GPU cores due to hardware-limits. To address...

chapter

FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks

Wenyan Lu, Guihai Yan, Jiajun Li, Shijun Gong, more

2017 IEEE International Symposium on High Performance Computer Architecture (HPCA) > 553 - 564

2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Convolutional Neural Networks (CNN) are verycomputation-intensive. Recently, a lot of CNN accelerators based on the CNN intrinsic parallelism are proposed. However, we observed that there is a big mismatch between the parallel types supported by computing engine and the dominant parallel types of CNN workloads. This mismatch seriously degrades resource utilization of existing accelerators. In this...

chapter

Developing dynamic profiling and debugging support in OpenCL for FPGAs

Anshuman Verma, Huiyang Zhou, Skip Booth, Robbie King, more

2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC) > 1 - 6

2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC)

With FPGAs emerging as a promising accelerator for general-purpose computing, there is a strong demand to make them accessible to software developers. Recent advances in OpenCL compilers for FPGAs pave the way for synthesizing FPGA hardware from OpenCL kernel code. To enable broader adoption of this paradigm, significant challenges remain. This paper presents our efforts in developing dynamic profiling...

chapter

Latency-aware packet processing on CPU-GPU heterogeneous systems

Arian Maghazeh, Unmesh D. Bordoloi, Usman Dastgeer, Alexandru Andrei, more

2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC) > 1 - 6

2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC)

In response to the tremendous growth of the Internet, towards what we call the Internet of Things (IoT), there is a need to move from costly, high-time-to-market specific-purpose hardware to flexible, low-time-to-market general-purpose devices for packet processing. Among several such devices, GPUs have attracted attention in the past, mainly because the high computing demand of packet processing...

chapter

Work-in-progress: REDEFINE – a case for WCET-friendly hardware accelerators for real time applications

Kavitha Madhu, Tarun Singla, S K Nandy, Ranjani Narayan, more

2017 International Conference on Compilers, Architectures and Synthesis For Embedded Systems (CASES) > 1 - 2

2017 International Conference on Compilers, Architectures and Synthesis For Embedded Systems (CASES)

REDEFINE is a distributed dynamic dataow architecture, designed for exploiting parallelism at various granularities as an embedded system-on-chip (SoC). is paper dwells on the exibility of REDEFINE architecture and its execution model in accelerating real-time applications coupled with a WCET analyzer that computes execution time bounds of real time applications.

chapter

Caffeinated FPGAs: FPGA framework For Convolutional Neural Networks

Roberto DiCecco, Griffin Lacey, Jasmina Vasiljevic, Paul Chow, more

2016 International Conference on Field-Programmable Technology (FPT) > 265 - 268

2016 International Conference on Field-Programmable Technology (FPT)

Convolutional Neural Networks (CNNs) have gained significant traction in the field of machine learning, particularly due to their high accuracy in visual recognition. Recent works have pushed the performance of GPU implementations of CNNs showing significant improvements in their classification and training times. With these improvements, many frameworks have become available for implementing CNNs...

chapter

Multiple submodels parallel support vector machine on spark

Chang Liu, Bin Wu, Yi Yang, Zhihong Guo

2016 IEEE International Conference on Big Data (Big Data) > 945 - 950

2016 IEEE International Conference on Big Data (Big Data)

The Support Vector Machine (SVM) is a classical classification algorithm that has a wide range of application. With kernel function, SVM can dispose the datasets that are not linearly separable in their original feature space, making it more flexible in practical use compared with linear model. However, its complexity in training is an obstacle to large-scale dataset handling. This paper proposes...

chapter

An overview of micron's automata processor

Ke Wang, Kevin Angstadt, Chunkun Bo, Nathan Brunelle, more

2016 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS) > 1 - 3

2016 International Conference on Hardware/Software Codesign and System Synthesis (CODES+ISSS)

Micron's new Automata Processor (AP) architecture exploits the very high and natural level of parallelism found in DRAM technologies to achieve native-hardware implementation of non-deterministic finite automata (NFAs). The use of DRAM technology to implement the NFA states provides high capacity and therefore provide extraordinary parallelism for pattern recognition. In this paper, we give an overview...

chapter

Histogram optimization with CUDA

Keh Kok Yong, Sheera Shaheera Othman Talib

2016 IEEE Industrial Electronics and Applications Conference (IEACon) > 312 - 318

2016 IEEE Industrial Electronics and Applications Conference (IEACon)

Histogram is a popular analytic graphical representation of data distribution resulting from processing a given numerical input data. Although the sequential histogram computation may be simple, it is no longer suitable in processing high volume of data. With recent advancement of high performance computing (HPC), aided by the accelerating growth of General Purpose Graphic Processing Unit (GPGPU),...

chapter

GPU implementation of multi-scale Retinex image enhancement algorithm

Hui Li, Weihao Xie, Xingang Wang, Shousheng Liu, more

2016 IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA) > 1 - 5

2016 IEEE/ACS 13th International Conference of Computer Systems and Applications (AICCSA)

Multi-scale Retinex algorithm is an image enhancement algorithm that aims at image reconstruction. The algorithm maintains the high fidelity and the dynamic range compression of the image, so the enhancement effect is obvious. The algorithm exploits a large number of convolution operations to achieve dynamic range compression and color/brightness rendition, and the calculation time increased significantly...

chapter

Optimizing PLASMA Eigensolver on Large Shared Memory Systems

Cheng Liao

2016 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA) > 73 - 80

2016 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems (ScalA)

Performance of the PLASMA dense symmetric Eigensolver is optimized for large shared memory computer systems using multiple Householder domains for dense to band reduction and a communication reducing kernel for bulge chasing. The mr3-smp code by Petschow and Bientinesi is used for the tridiagonal eigensolution and the eigenvector back-transformations employ a 1D parallel decomposition. The input matrix,...

chapter

A Fast Level-Set Segmentation Algorithm for Image Processing Designed For Parallel Architectures

Julian Gutierrez, Fanny Nina-Paravecino, David Kaeli

2016 6th Workshop on Irregular Applications: Architecture and Algorithms (IA3) > 66 - 69

2016 6th Workshop on Irregular Applications: Architecture and Algorithms (IA3)

Among the many choices to perform image segmentation, Level-Set Methods have demonstrated great potential for unstructured images. However, the usefulness of Level-Set Methods have been limited by their irregular workload characteristics such as high degree of branch divergence and input dependencies, as well as the high computational costs required to solve partial differential equations (PDEs).In...

Keywords:
KERNEL
PARALLEL PROCESSING

Publication date

Set your own date range

Content availability

Available (468)
None (5)

Keywords

INSTRUCTION SETS (149)
GRAPHICS PROCESSING UNITS (132)
GRAPHICS PROCESSING UNIT (98)
COMPUTER ARCHITECTURE (92)
HARDWARE (89)
GPU (82)
COMPUTATIONAL MODELING (73)
CUDA (58)
FIELD PROGRAMMABLE GATE ARRAYS (58)
PROGRAMMING (56)
OPTIMIZATION (53)
COPROCESSORS (50)
ARRAYS (46)
ALGORITHM DESIGN AND ANALYSIS (44)
PROGRAM PROCESSORS (42)
COMPUTER GRAPHIC EQUIPMENT (38)
MEMORY MANAGEMENT (38)
PERFORMANCE EVALUATION (35)
GPGPU (34)
ACCELERATION (33)
MULTIPROCESSING SYSTEMS (32)
BENCHMARK TESTING (31)
REGISTERS (30)
YARN (29)
OPENCL (28)
RUNTIME (26)
PARALLEL PROGRAMMING (24)
BANDWIDTH (23)
FPGA (23)
SYNCHRONIZATION (22)
COMPUTER GRAPHICS (21)
DATA MINING (21)
MULTICORE PROCESSING (21)
PARALLEL COMPUTING (21)
CENTRAL PROCESSING UNIT (18)
LIBRARIES (18)
MICROPROCESSOR CHIPS (18)
PIXEL (18)
THROUGHPUT (18)
IMAGE PROCESSING (17)
PIPELINES (17)
TRAINING (17)
PARALLEL ARCHITECTURES (16)
CONVOLUTION (15)
HEURISTIC ALGORITHMS (15)
COMPUTE UNIFIED DEVICE ARCHITECTURE (14)
SPARSE MATRICES (14)
LINUX (13)
SERVERS (13)
SUPPORT VECTOR MACHINES (13)
MULTI-THREADING (12)
RANDOM ACCESS MEMORY (12)
VECTORS (12)
CONTEXT (11)
DATA STRUCTURES (11)
DATABASES (11)
EMBEDDED SYSTEMS (11)
INDEXES (11)
RECONFIGURABLE ARCHITECTURES (11)
TILES (11)
ACCURACY (10)
COMPUTERS (10)
DECODING (10)
GRAPHIC PROCESSING UNIT (10)
MAGNETIC CORES (10)
MATHEMATICAL MODEL (10)
MESSAGE PASSING (10)
MESSAGE SYSTEMS (10)
PARALLEL ALGORITHMS (10)
RESOURCE MANAGEMENT (10)
APPLICATION PROGRAM INTERFACES (9)
DIGITAL SIGNAL PROCESSING (9)
HIGH PERFORMANCE COMPUTING (9)
MICROPROCESSORS (9)
OPENMP (9)
RESOURCE ALLOCATION (9)
SCHEDULING (9)
CPU (8)
ENCODING (8)
FEATURE EXTRACTION (8)
GPU COMPUTING (8)
MULTI-CORE (8)
OPTIMISATION (8)
PARALLEL (8)
PROCESSOR SCHEDULING (8)
REAL-TIME SYSTEMS (8)
SCHEDULES (8)
ANALYTICAL MODELS (7)
BIOINFORMATICS (7)
CLOCKS (7)
GRAPHICS (7)
IMAGE COLOR ANALYSIS (7)
JACOBIAN MATRICES (7)
LINEAR ALGEBRA (7)
MATRIX MULTIPLICATION (7)
SCALABILITY (7)
SIMD (7)
SOFTWARE (7)
more

INFONA - science communication portal

Search results

MOCHA: Morphable Locality and Compression Aware Architecture for Convolutional Neural Networks

Parameterized Diamond Tiling for Parallelizing stencil computations

Static Versus Dynamic Task Scheduling of the Lu Factorization on ARM big. LITTLE Architectures

Exploring optimized accelerator design for binarized convolutional neural networks

Scalable Parallelization of a Markov Coalescent Genealogy Sampler

Sparse Supernodal Solver Using Block Low-Rank Compression

Analyzing OpenCL 2.0 workloads using a heterogeneous CPU-GPU simulator

Real-Time 3D Ball Tracking with CPU-GPU Acceleration Using Particle Filter with Multi-command Queues and Stepped Parallelism Iteration

Controlled Kernel Launch for Dynamic Parallelism in GPUs

FlexFlow: A Flexible Dataflow Accelerator Architecture for Convolutional Neural Networks

Developing dynamic profiling and debugging support in OpenCL for FPGAs

Latency-aware packet processing on CPU-GPU heterogeneous systems

Work-in-progress: REDEFINE – a case for WCET-friendly hardware accelerators for real time applications

Caffeinated FPGAs: FPGA framework For Convolutional Neural Networks

Multiple submodels parallel support vector machine on spark

An overview of micron's automata processor

Histogram optimization with CUDA

GPU implementation of multi-scale Retinex image enhancement algorithm

Optimizing PLASMA Eigensolver on Large Shared Memory Systems

A Fast Level-Set Segmentation Algorithm for Image Processing Designed For Parallel Architectures

Filter options

Publication date

Content availability

Keywords

INFONA - science communication portal

Search results

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Content availability

Keywords

Reporting an error / abuse

Sending the report failed

Accessibility options