Search results

Items from 1 to 20 out of 843 results

chapter

Optimization of GPU and CPU acceleration for neural networks layers implemented in python

Radu Dogaru, Ioana Dogaru

2017 5th International Symposium on Electrical and Electronics Engineering (ISEEE) > 1 - 6

2017 5th International Symposium on Electrical and Electronics Engineering (ISEEE)

Many neural architectures including RBF, SVM, FSVC classifiers, or deep-learning solutions require the efficient implementation of neurons layers, each of them having a given number of m neurons, a specific set of parameters and operating on a training or test set of N feature vectors having each a dimension n. Herein we investigate how to allocate the computation on GPU kernels and how to better...

chapter

Introducing parallel computing concepts in computer system related courses

Han Wan, Xiaopeng Gao, Xiang Long, Bo Jiang

2017 IEEE Frontiers in Education Conference (FIE) > 1 - 7

2017 IEEE Frontiers in Education Conference (FIE)

All semiconductor market domains are converging to concurrent platforms. This trend has certainly led real challenge to develop applications software that effectively uses these concurrent processors to achieve efficiency and performance goals. This paper argues that the Computer System related courses are natural places to introduce the parallelism, and the earlier to parallel computing concepts...

chapter

Prompt application-transparent transaction revalidation in software transactional memory

Simone Economo, Emiliano Silvestri, Pierangelo Di Sanzo, Alessandro Pellegrini, more

2017 IEEE 16th International Symposium on Network Computing and Applications (NCA) > 1 - 6

2017 IEEE 16th International Symposium on Network Computing and Applications (NCA)

Software Transactional Memory (STM) allows encapsulating shared-data accesses within transactions, executed with atomicity and isolation guarantees. The assessment of the consistency of a running transaction is performed by the STM layer at specific points of its execution, such as when a read or write access to a shared object occurs, or upon a commit attempt. However, performance and energy efficiency...

chapter

GScheduler: Optimizing resource provision by using GPU usage pattern extraction in cloud environments

Zhuqing Xu, Fang Dong, Jiahui Jin, Junzhou Luo, more

2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC) > 3225 - 3230

2017 IEEE International Conference on Systems, Man and Cybernetics (SMC)

GPU-based clusters are widely chosen for accelerating a variety of scientific applications in high-end cloud environments. With their growing popularity, there is a necessity for improving the system throughput and decreasing the turnaround time for co-executing applications on the same GPU device. However, resource contention among multiple applications on a multi-tasked GPU leads to the performance...

chapter

General-purpose computing on GPU: Pixel processing

Milos Ockay

2017 Communication and Information Technologies (KIT) > 1 - 4

2017 Communication and Information Technologies (KIT)

Presented paper explains general purpose approach to the parallel pixel processing on GPU. It presents essential dataset structuring, correct type assignment and kernel configuration for CUDA application interface. Paper also explains data movement and optimal computation saturation. Transfers are also analyzed in correlation with the computation especially for the embarrassingly parallel problem...

chapter

VLAG: A very fast locality approximation model for GPU kernels with regular access patterns

Mohsen Kiani, Amir Rajabzadeh

2017 7th International Conference on Computer and Knowledge Engineering (ICCKE) > 260 - 265

2017 7th International Conference on Computer and Knowledge Engineering (ICCKE)

Performance modeling plays an important role for optimal hardware design and optimized application implementation. This paper presents a very low overhead performance model, called VLAG, to approximate the data localities exploited by GPU kernels. VLAG receives source code-level information to estimate per memory-access instruction, per data array, and per kernel localities within GPU kernels. VLAG...

chapter

Moka: Model-based concurrent kernel analysis

Leiming Yu, Xun Gong, Yifan Sun, Qianqian Fang, more

2017 IEEE International Symposium on Workload Characterization (IISWC) > 197 - 206

2017 IEEE International Symposium on Workload Characterization (IISWC)

GPUs continue to increase the number of compute resources with each new generation. Many data-parallel applications have been re-engineered to leverage the thousands of cores on the GPU. But not every kernel can fully utilize all the resources available. Many applications contain multiple kernels that could potentially be run concurrently. To better utilize the massive resources on the GPU, device...

chapter

Fault simulation acceleration for TRAX dictionary construction using GPUs

Matthew Beckler, R. D. Blanton

2017 IEEE International Test Conference (ITC) > 1 - 9

2017 IEEE International Test Conference (ITC)

To ensure robustness of integrated systems, the TRAnsition-X (TRAX) fault model has been used with on-chip test and diagnosis hardware, utilizing fault dictionaries for diagnosis. Generating a fault dictionary requires fault simulation with no fault dropping, requiring extensive computational resources. This paper presents the design and implementation of an efficient fault simulator for the TRAX...

chapter

Solving 0-1 quadratic problems with two-level parallelization of the BiqCrunch solver

Camille Coti, Etienne Leclercq, Frederic Roupin, Franck Butelle

2017 Federated Conference on Computer Science and Information Systems (FedCSIS) > 445 - 452

2017 Federated Conference on Computer Science and Information Systems (FedCSIS)

In this paper we present MLTBiqCrunch, a hierarchically parallelized version of the open-source solver BiqCrunch [1]. More precisely, this version has two levels of parallelization: a coarse grain, assigning a thread to a node evaluation and a fine grain, parallelizing a node evaluation when some threads are not busy. We present experiments on some classical binary quadratic optimization problems...

chapter

cudaCR: An In-Kernel Application-Level Checkpoint/Restart Scheme for CUDA-Enabled GPUs

Behnam Pourghassemi, Aparna Chandramowlishwaran

2017 IEEE International Conference on Cluster Computing (CLUSTER) > 725 - 732

2017 IEEE International Conference on Cluster Computing (CLUSTER)

Fault-tolerance is becoming increasingly important as we enter the era of exascale computing. Increasing the number of cores results in a smaller mean time between failures, and consequently, higher probability of errors. Among the different software fault tolerance techniques, checkpoint/restart is the most commonly used method in supercomputers, the de-facto standard for large-scale systems. Although...

chapter

GPU-accelerated fault dictionary generation for the TRAX fault model

Matthew Beckler, R. D. Shawn Blanton

2017 International Test Conference in Asia (ITC-Asia) > 34 - 39

2017 International Test Conference in Asia (ITC-Asia)

This paper presents the design and implementation of a fault simulator for the TRAnsition-X fault model (TRAX for short) on a graphics processing unit (GPU). Fault dictionaries are an important aspect of on-chip fault detection and diagnosis. Generating a fault dictionary requires fault simulation with no fault dropping, requiring extensive computational resources. The inherent parallelism of the...

chapter

A GPU-Friendly Skiplist Algorithm

Nurit Moscovici, Nachshon Cohen, Erez Petrank

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT) > 246 - 259

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

We propose a design for a fine-grained lock-based skiplist optimized for Graphics Processing Units (GPUs). While GPUs are often used to accelerate streaming parallel computations, it remains a significant challenge to efficiently offload concurrent computations with more complicated data-irregular access and fine-grained synchronization. Natural building blocks for such computations would be concurrent...

chapter

Graphie: Large-Scale Asynchronous Graph Traversals on Just a GPU

Wei Han, Daniel Mawhirter, Bo Wu, Matthew Buland

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT) > 233 - 245

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Most GPU-based graph systems cannot handle large-scale graphs that do not fit in the GPU memory. The ever-increasing graph size demands a scale-up graph system, which can run on a single GPU with optimized memory access efficiency and well-controlled data transfer overhead. However, existing systems either incur redundant data transfers or fail to use shared memory. In this paper we present Graphie,...

chapter

SuperGraph-SLP Auto-Vectorization

Vasileios Porpodas

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT) > 330 - 342

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

SIMD vectors help improve the performance of certain applications. The code gets vectorized into SIMD form either by hand, or automatically with auto-vectorizing compilers. The Superword-Level Parallelism (SLP) vectorization algorithm is a widely used algorithm for vectorizing straight-line code and is part of most industrial compilers. The algorithm attempts to pack scalar instructions into vectors...

chapter

Efficient and accurate Word2Vec implementations in GPU and shared-memory multicore architectures

Trevor M. Simonton, Gita Alaghband

2017 IEEE High Performance Extreme Computing Conference (HPEC) > 1 - 7

2017 IEEE High Performance Extreme Computing Conference (HPEC)

Word2Vec is a popular set of machine learning algorithms that use a neural network to generate dense vector representations of words. These vectors have proven to be useful in a variety of machine learning tasks. In this work, we propose new methods to increase the speed of the Word2Vec skip gram with hierarchical softmax architecture on multi-core shared memory CPU systems, and on modern NVIDIA GPUs...

chapter

FSL: Fast system launch through persistent computing with nonvolatile memory

Hyeonho Song, Sam H. Noh

2017 IEEE 6th Non-Volatile Memory Systems and Applications Symposium (NVMSA) > 1 - 5

2017 IEEE 6th Non-Volatile Memory Systems and Applications Symposium (NVMSA)

Next generation memory technologies, which we denote as new memory, have both nonvolatile and byte addressable properties. These characteristics are expected to bring changes to the conventional computer system structure. Most previous research on the use of new memory have been focused on how to efficiently store files, objects, and data structure while exploiting persistence in new memory. Unlike...

chapter

Developing CPU-GPU Embedded Systems Using Platform-Agnostic Components

Gabriel Campeanu, Jan Carlson, Severine Sentilles

2017 43rd Euromicro Conference on Software Engineering and Advanced Applications (SEAA) > 176 - 180

2017 43rd Euromicro Conference on Software Engineering and Advanced Applications (SEAA)

Nowadays, there are many embedded systems with different architectures that have incorporated GPUs. However, it is difficult to develop CPU-GPU embedded systems using component-based development (CBD), since existing CBD approaches have no support for GPU development. In this context, when targeting a particular CPU-GPU platform, the component developer is forced to construct hardware-specific components,...

chapter

Speeding up tone mapping operators: Exploiting parallelism for real-time, high dynamic range video

Ziad Youssfi, Firas Hassan

2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS) > 192 - 195

2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS)

Tone mapping operators map high dynamic range images so that they can be displayed with a high dynamic range appearance in a limited range medium. However, due to their large computational complexity, sequential implementation of these operators on CPU cannot achieve the frame rate needed for real-time video image processing. In this paper, we revisit these operators to simplify them so that we can...

chapter

Modified Convolution Neural Network for Highly Effective Parallel Processing

Sang-Soo Park, Jung-Hyun Hong, Ki-Seok Chung

2017 IEEE International Conference on Information Reuse and Integration (IRI) > 325 - 331

2017 IEEE International Conference on Information Reuse and Integration (IRI)

Today, Convolutional Neural Network (CNN) is adopted in a lot of areas such as computer vision and natural language processing. By employing hardware accelerators such as graphic processing unit (GPU), a significant amount of speedup can be achieved in CNN and many studies have proposed such acceleration methods. However, it is not straightforward to parallelize the CNN on a hardware accelerator because...

chapter

OpenCL 2.0 Compiler Adaptation on LLVM for PTX Simulators

Chun-Chieh Yang, Shao-Chung Wang, Min-Yi Hsu, Yuan-Ming Chang, more

2017 46th International Conference on Parallel Processing Workshops (ICPPW) > 53 - 58

2017 46th International Conference on Parallel Processing Workshops (ICPPW)

OpenCL continues to gather momentum on both desktop and mobile devices. The new features of OpenCL 2.0 provides developers better expressive power in programming heterogeneous computing environments. Currently in the experimental simulation environment, gem5-gpu only supports CUDA, but GPGPU-Sim can support OpenCL by compiling OpenCL kernel code to PTX using real GPU driver. However, this driver compilation...

Keywords:
KERNEL
INSTRUCTION SETS

Publication date

Set your own date range

Content availability

Available (840)
None (3)

Keywords

GRAPHICS PROCESSING UNITS (354)
GRAPHICS PROCESSING UNIT (291)
GPU (204)
CUDA (164)
COMPUTER ARCHITECTURE (155)
PARALLEL PROCESSING (149)
HARDWARE (131)
OPTIMIZATION (110)
COMPUTATIONAL MODELING (109)
GPGPU (94)
COPROCESSORS (92)
REGISTERS (84)
MEMORY MANAGEMENT (81)
ARRAYS (77)
COMPUTER GRAPHIC EQUIPMENT (70)
PROGRAMMING (62)
SYNCHRONIZATION (57)
ALGORITHM DESIGN AND ANALYSIS (56)
BENCHMARK TESTING (54)
LINUX (48)
PERFORMANCE EVALUATION (46)
VECTORS (44)
ACCELERATION (43)
LIBRARIES (41)
SPARSE MATRICES (41)
MATHEMATICAL MODEL (38)
BANDWIDTH (35)
MULTIPROCESSING SYSTEMS (34)
THROUGHPUT (34)
MULTICORE PROCESSING (33)
RUNTIME (33)
OPENCL (32)
RANDOM ACCESS MEMORY (32)
RESOURCE MANAGEMENT (31)
MESSAGE SYSTEMS (30)
INDEXES (29)
CONTEXT (28)
FIELD PROGRAMMABLE GATE ARRAYS (27)
PARALLEL COMPUTING (27)
PARALLEL ARCHITECTURES (26)
CENTRAL PROCESSING UNIT (25)
DATA STRUCTURES (25)
REAL-TIME SYSTEMS (22)
EQUATIONS (21)
SCHEDULING (21)
SWITCHES (20)
PERFORMANCE (19)
PARALLEL ALGORITHMS (18)
PIPELINES (18)
CLUSTERING ALGORITHMS (17)
PARALLEL PROGRAMMING (17)
ACCURACY (16)
DATA TRANSFER (16)
HEURISTIC ALGORITHMS (16)
OPENMP (16)
EMBEDDED SYSTEMS (15)
IMAGE PROCESSING (15)
MULTI-THREADING (15)
PIXEL (15)
SYSTEM-ON-CHIP (15)
LAYOUT (14)
OPTIMISATION (14)
PROCESSOR SCHEDULING (14)
SCHEDULES (14)
SERVERS (14)
TRAINING (14)
COMPUTE UNIFIED DEVICE ARCHITECTURE (13)
COMPUTERS (13)
HIGH PERFORMANCE COMPUTING (13)
PARALLEL (13)
REAL TIME SYSTEMS (13)
GPU COMPUTING (12)
GRAPHIC PROCESSING UNIT (12)
MONITORING (12)
MPI (12)
SCALABILITY (12)
STANDARDS (12)
TILES (12)
DECODING (11)
ESTIMATION (11)
FEATURE EXTRACTION (11)
FPGA (11)
GENETIC ALGORITHMS (11)
GPUS (11)
GRAPHICS (11)
HISTOGRAMS (11)
JACOBIAN MATRICES (11)
MATRIX DECOMPOSITION (11)
SPMV (11)
TUNING (11)
ANALYTICAL MODELS (10)
APPLICATION PROGRAM INTERFACES (10)
CONVOLUTION (10)
CPU (10)
EDUCATIONAL INSTITUTIONS (10)
ENCODING (10)
ENERGY CONSUMPTION (10)
IMAGE COLOR ANALYSIS (10)
more

INFONA - science communication portal

Search results

Optimization of GPU and CPU acceleration for neural networks layers implemented in python

Introducing parallel computing concepts in computer system related courses

Prompt application-transparent transaction revalidation in software transactional memory

GScheduler: Optimizing resource provision by using GPU usage pattern extraction in cloud environments

General-purpose computing on GPU: Pixel processing

VLAG: A very fast locality approximation model for GPU kernels with regular access patterns

Moka: Model-based concurrent kernel analysis

Fault simulation acceleration for TRAX dictionary construction using GPUs

Solving 0-1 quadratic problems with two-level parallelization of the BiqCrunch solver

cudaCR: An In-Kernel Application-Level Checkpoint/Restart Scheme for CUDA-Enabled GPUs

GPU-accelerated fault dictionary generation for the TRAX fault model

A GPU-Friendly Skiplist Algorithm

Graphie: Large-Scale Asynchronous Graph Traversals on Just a GPU

SuperGraph-SLP Auto-Vectorization

Efficient and accurate Word2Vec implementations in GPU and shared-memory multicore architectures

FSL: Fast system launch through persistent computing with nonvolatile memory

Developing CPU-GPU Embedded Systems Using Platform-Agnostic Components

Speeding up tone mapping operators: Exploiting parallelism for real-time, high dynamic range video

Modified Convolution Neural Network for Highly Effective Parallel Processing

OpenCL 2.0 Compiler Adaptation on LLVM for PTX Simulators

Filter options

Publication date

Content availability

Keywords

INFONA - science communication portal

Search results

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Content availability

Keywords

Reporting an error / abuse

Sending the report failed

Accessibility options