Search results

Items from 41 to 60 out of 838 results

chapter

A software technique to enhance register utilization of Convolutional Neural Networks on GPGPUs

Che-Huai Lin, An-Ting Cheng, Bo-Cheng Lai

2017 International Conference on Applied System Innovation (ICASI) > 614 - 617

2017 International Conference on Applied System Innovation (ICASI)

CNNs (Convolutional Neural Networks) have demonstrated superior results in a wide range of applications. However, the time-consuming convolution operations required by CNNs pose great challenges to designers. GPGPUs (General Purpose Graphic Processing Units) have been widely used to exploiting the massive parallelism of convolution operations. This paper proposes a software-based loop-unrolling technique...

chapter

Preemptive Software Transactional Memory

Emiliano Silvestri, Simone Economo, Pierangelo Di Sanzo, Alessandro Pellegrini, more

2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) > 294 - 303

2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)

In state-of-the-art Software Transactional Memory (STM) systems, threads carry out the execution of transactions as non-interruptible tasks. Hence, a thread can react to the injection of a higher priority transactional task and take care of its processing only at the end of the currently executed transaction. In this article we pursue a paradigm shift where the execution of an in-memory transaction...

chapter

Multithreading programming for FPRM quantum circuits design

I. V. Matveeva, V. A. Kalmychkov, A. V. Dorokhov

2017 XX IEEE International Conference on Soft Computing and Measurements (SCM) > 292 - 295

2017 XX IEEE International Conference on Soft Computing and Measurements (SCM)

In this paper we examined the separate aspects of design and analysis of the quantum circuits specifications on the computer platforms with the parallelism support. Idea of multi-thread programming for the quantum circuits synthesis and selection is shown, taking into consideration the subsequent use of the Linear Nearest Neighbor (LNN) approach. Our experience on some multi-core platforms is represented.

chapter

26 PFLOPS Stencil Computations for Atmospheric Modeling on Sunway TaihuLight

Yulong Ao, Chao Yang, Xinliang Wang, Wei Xue, more

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) > 535 - 544

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Stencil computation arises from a broad set of scientific and engineering applications and often plays a critical role in the performance of extreme-scale simulations. Due to the memory bound nature, it is a challenging task to opti- mize stencil computation kernels on modern supercomputers with relatively high computing throughput whilst relatively low data-moving capability. This work serves as...

chapter

Extending valgrind framework with the MIPS MSA support

Tamara Vlahovic, Marko Misic, Milo Tomasevic, Aleksandra Karadzic, more

2017 Zooming Innovation in Consumer Electronics International Conference (ZINC) > 47 - 51

2017 Zooming Innovation in Consumer Electronics International Conference (ZINC)

This paper presents an extension of Valgrind framework for dynamic binary code analysis to support MIPS MSA instruction set which includes instructions for vector (SIMD) processing of integer and floating-point data of different widths. First, a background on MIPS and its MSA extention is given. Then, Valgrind features for code instrumentation are described. Several changes have been made to Valgrind...

chapter

Communication Optimization on GPU: A Case Study of Sequence Alignment Algorithms

Jie Wang, Xinfeng Xie, Jason Cong

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) > 72 - 81

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Data movement is increasingly becoming the bottleneck of both performance and energy efficiency in modern computation. Until recently, it was the case that there is limited freedom for communication optimization on GPUs, as conventional GPUs only provide two types of methods for inter-thread communication: using shared memory or global memory. However, a new warp shuffle instruction has been introduced...

chapter

Static WCET Analysis of GPUs with Predictable Warp Scheduling

Yijie Huangfu, Wei Zhang

2017 IEEE 20th International Symposium on Real-Time Distributed Computing (ISORC) > 101 - 108

2017 IEEE 20th International Symposium on Real-Time Distributed Computing (ISORC)

The capability of GPUs to accelerate general-purpose applications that can be parallelized into massive number of threads makes it promising to apply GPUs to real-time applications as well, where high throughput and intensive computation are also needed. However, due to the different architecture and programming model of GPUs, the worst-case execution time (WCET) analysis methods and techniques designed...

chapter

Using Babbage's difference engine to introduce computer architecture

William D. Richard

2017 IEEE International Conference on Microelectronic Systems Education (MSE) > 47 - 50

2017 IEEE International Conference on Microelectronic Systems Education (MSE)

A new approach to introducing computer architecture in introductory logic design courses based on Babbage's difference engine is described. The difference engine is the thread that students follow from logic design to the concepts associated with a basic load/store microprocessor architecture.

chapter

A Memory Heterogeneity-Aware Runtime System for Bandwidth-Sensitive HPC Applications

Kavitha Chandrasekar, Xiang Ni, Laxmikant V. Kale

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) > 1293 - 1300

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Today's supercomputers are moving towards deployment of many-core processors like Intel Xeon Phi Knights Landing (KNL), to deliver high compute and memory capacity. Applications executing on such many-core platforms with improved vectorization require high memory bandwidth. To improve performance, architectures like Knights Landing include a high bandwidth and low capacity in-package high bandwidth...

chapter

Modeling of Applications and Hardware to Explore Task Mapping and Scheduling Strategies on a Heterogeneous Micro-Server System

Lilia Zaourar, Massinissa Ait Aba, David Briand, Jean-Marc Philippe

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) > 65 - 76

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

Many of todays important applications of our everyday lives, e.g. weather forecast, design of plane and car shapes, medical analysis or even search engine queries depend on massively-parallel computer programs that are executed in data centers hosting thousands of computers. A large amount of electrical energy is used to power them, and it is of primary importance to compute more efficiently to sustain...

chapter

Optimization of multiple sequence alignment (MSA) using invariant code extraction and static thread scheduling

Kiran Sonavane, Badal Soni

2017 2nd International Conference for Convergence in Technology (I2CT) > 1182 - 1186

2017 2nd International Conference for Convergence in Technology (I2CT)

MSA(Multiple sequence alignment) is the challenging task in the bioinformatics domain. As there are many methods are available to align the input sequences. Also the output is also different from each method for same input. Out of many tools of alignment MSAProbs is faster tool available. But this tool is also not so optimal it has some bugs which makes this tool unoptimize. To Optimize the MSAProbs...

chapter

New optimized GPU version of the k-means algorithm for large-sized image segmentation

Hicham Fakhi, Omar Bouattane, Mohamed Youssfi, Ouajji Hassan

2017 Intelligent Systems and Computer Vision (ISCV) > 1 - 6

2017 Intelligent Systems and Computer Vision (ISCV)

K-means is a compute-intensive iterative algorithm, each iteration consists of two steps data assignment and K centroids recalculation. In order to accelerate the compute-intensive portions of k-means, the data assignment and K centroids recalculation steps are offloaded to the GPU in parallel. Only the initialization and convergence tests steps are performed by the CPU. In addition this new version...

chapter

Exploring GPU performance, power and energy-efficiency bounds with Cache-aware Roofline Modeling

Andre Lopes, Frederico Pratas, Leonel Sousa, Aleksandar Ilic

2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) > 259 - 268

2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

Optimization, portability and development of GPGPU applications are not trivial tasks, since the capabilities and organization of GPU processing elements and memory subsystem greatly differ from the traditional CPU concepts, as well as among different GPU architectures. This work goes a step further in aiding this process by delivering a set of visual models that can be used by GPU programmers to...

chapter

Multi2Sim Kepler: A detailed architectural GPU simulator

Xun Gong, Rafael Ubal, David Kaeli

2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) > 269 - 278

2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

Presilicon simulation is one of the key toolsets for computer architects to evaluate and optimize their future designs. As Graphics Processing Units (GPUs) have become the platform of choice in many computing communities due to their impressive processing capabilities, computer architecture researchers need a simulation framework that allows them to quantitatively consider design tradeoffs. In this...

article

Optimizing Data Placement on GPU Memory: A Portable Approach

Guoyang Chen, Xipeng Shen, Bo Wu, Dong Li

IEEE Transactions on Computers > 2017 > 66 > 3 > 473 - 487

Modern GPUs feature complex memory system designs. One GPU may contain many types of memory of different properties. The best way to place data in memory is sensitive to many factors (e.g., program inputs, architectures), making portable optimizations of GPU data placement a difficult challenge. PORPLE is a recently proposed method that overcomes the difficulties by enabling online optimizations of...

chapter

Characterization of stack behavior under soft errors

Junchi Ma, Yun Wang

Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017 > 1534 - 1539

2017 Design, Automation & Test in Europe Conference & Exhibition (DATE)

As process technology scales, electronic devices become more susceptible to soft error induced by radiation. The stack in the memory implements procedure calls and its behavior under soft error has not been studied yet. To analyze the effects of soft error on the stack behavior, we conduct a series of fault injection experiment in the IA-32 instruction set architecture. The injection targets are the...

article

IBM Power9 Processor Architecture

Satish Kumar Sadasivam, Brian W. Thompto, Ron Kalla, William J. Starke

IEEE Micro > 2017 > 37 > 2 > 40 - 51

The IBM Power9 processor has an enhanced core and chip architecture that provides superior thread performance and higher throughput. The core and chip architectures are optimized for emerging workloads to support the needs of next-generation computing. Multiple variants of silicon target the scale-out and scale-up markets. With a new core microarchitecture design, along with an innovative I/O fabric...

chapter

Hardware-accelerated dynamic binary translation

Simon Rokicki, Erven Rohou, Steven Derrien

Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017 > 1062 - 1067

2017 Design, Automation & Test in Europe Conference & Exhibition (DATE)

Dynamic Binary Translation (DBT) is often used in hardware/software co-design to take advantage of an architecture model while using binaries from another one. The co-development of the DBT engine and of the execution architecture leads to architecture with special support to these mechanisms. In this work, we propose a hardware accelerated Dynamic Binary Translation where the first steps of the DBT...

chapter

Mars: A MapReduce Framework on graphics processors

Bingsheng He, Wenbin Fang, Qiong Luo, Naga K. Govindaraju, more

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT) > 260 - 269

2008 International Conference on Parallel Architectures and Compilation Techniques (PACT)

We design and implement Mars, a MapReduce framework, on graphics processors (GPUs). MapReduce is a distributed programming framework originally proposed by Google for the ease of development of web search applications on a large number of commodity CPUs. Compared with CPUs, GPUs have an order of magnitude higher computation power and memory bandwidth, but are harder to program since their architectures...

chapter

FinePar: Irregularity-aware fine-grained workload partitioning on integrated architectures

Feng Zhang, Bo Wu, Jidong Zhai, Bingsheng He, more

2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) > 27 - 38

2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

The integrated architecture that features both CPU and GPU on the same die is an emerging and promising architecture for fine-grained CPU-GPU collaboration. However, the integration also brings forward several programming and system optimization challenges, especially for irregular applications. The complex interplay between heterogeneity and irregularity leads to very low processor utilization of...

Data set:
ieee
Keywords:
COMPUTER ARCHITECTURE
INSTRUCTION SETS

Publication date

Set your own date range

Content availability

Available (823)
None (15)

Publication type

book (730)
article (108)

Keywords

HARDWARE (206)
GRAPHICS PROCESSING UNITS (204)
REGISTERS (179)
KERNEL (174)
PARALLEL PROCESSING (124)
GRAPHICS PROCESSING UNIT (114)
COMPUTATIONAL MODELING (112)
GPU (107)
BENCHMARK TESTING (91)
CUDA (83)
MICROPROCESSOR CHIPS (82)
OPTIMIZATION (79)
SYNCHRONIZATION (68)
ALGORITHM DESIGN AND ANALYSIS (59)
FIELD PROGRAMMABLE GATE ARRAYS (58)
PIPELINES (58)
EMBEDDED SYSTEMS (54)
COPROCESSORS (50)
MULTIPROCESSING SYSTEMS (50)
PROGRAMMING (50)
GPGPU (48)
MICROPROCESSORS (44)
PROGRAM PROCESSORS (44)
PERFORMANCE EVALUATION (42)
PARALLEL ARCHITECTURES (40)
ACCELERATION (39)
CLOCKS (36)
MULTITHREADING (36)
RANDOM ACCESS MEMORY (35)
VLIW (35)
MESSAGE SYSTEMS (34)
THROUGHPUT (34)
RUNTIME (32)
SYSTEM-ON-CHIP (31)
BANDWIDTH (29)
MULTI-THREADING (29)
DECODING (28)
MATHEMATICAL MODEL (27)
PARALLEL PROGRAMMING (27)
SOFTWARE (27)
CACHE STORAGE (24)
FPGA (24)
LOGIC DESIGN (24)
PROTOCOLS (24)
CENTRAL PROCESSING UNIT (22)
COMPUTERS (22)
INSTRUCTION SET ARCHITECTURE (22)
PIPELINE PROCESSING (22)
PROCESSOR SCHEDULING (22)
PROGRAM COMPILERS (22)
SERVERS (22)
COMPUTER GRAPHIC EQUIPMENT (21)
CONTEXT (21)
ENCODING (21)
PARALLEL COMPUTING (21)
DATA MODELS (20)
DIGITAL SIGNAL PROCESSING (20)
VECTORS (20)
JAVA (19)
REAL TIME SYSTEMS (19)
EMBEDDED SYSTEM (18)
LIBRARIES (18)
MICROARCHITECTURE (18)
SCHEDULING (18)
SOFTWARE ARCHITECTURE (18)
HARDWARE-SOFTWARE CODESIGN (17)
REAL-TIME SYSTEMS (17)
RECONFIGURABLE ARCHITECTURES (17)
SPARSE MATRICES (17)
ANALYTICAL MODELS (16)
CRYPTOGRAPHY (16)
DYNAMIC SCHEDULING (16)
INSTRUCTION SET (16)
OPENMP (16)
PROCESS CONTROL (16)
TIMING (16)
APPLICATION SPECIFIC INTEGRATED CIRCUITS (15)
DATABASES (15)
DELAY (15)
HIGH PERFORMANCE COMPUTING (14)
INDEXES (14)
MONITORING (14)
PERFORMANCE (14)
REDUCED INSTRUCTION SET COMPUTING (14)
ENERGY CONSUMPTION (13)
POWER DEMAND (13)
SCALABILITY (13)
ASIP (12)
EQUATIONS (12)
INTEGRATED CIRCUIT DESIGN (12)
LINUX (12)
MULTICORE (12)
MULTICORE PROCESSING (12)
SPACE EXPLORATION (12)
ACCURACY (11)
COMPLEXITY THEORY (11)
DATA MINING (11)
DESIGN SPACE EXPLORATION (11)
more

INFONA - science communication portal

Search results

A software technique to enhance register utilization of Convolutional Neural Networks on GPGPUs

Preemptive Software Transactional Memory

Multithreading programming for FPRM quantum circuits design

26 PFLOPS Stencil Computations for Atmospheric Modeling on Sunway TaihuLight

Extending valgrind framework with the MIPS MSA support

Communication Optimization on GPU: A Case Study of Sequence Alignment Algorithms

Static WCET Analysis of GPUs with Predictable Warp Scheduling

Using Babbage's difference engine to introduce computer architecture

A Memory Heterogeneity-Aware Runtime System for Bandwidth-Sensitive HPC Applications

Modeling of Applications and Hardware to Explore Task Mapping and Scheduling Strategies on a Heterogeneous Micro-Server System

Optimization of multiple sequence alignment (MSA) using invariant code extraction and static thread scheduling

New optimized GPU version of the k-means algorithm for large-sized image segmentation

Exploring GPU performance, power and energy-efficiency bounds with Cache-aware Roofline Modeling

Multi2Sim Kepler: A detailed architectural GPU simulator

Optimizing Data Placement on GPU Memory: A Portable Approach

Characterization of stack behavior under soft errors

IBM Power9 Processor Architecture

Hardware-accelerated dynamic binary translation

Mars: A MapReduce Framework on graphics processors

FinePar: Irregularity-aware fine-grained workload partitioning on integrated architectures

Filter options

Publication date

Content availability

Publication type

Keywords

INFONA - science communication portal

Search results

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Content availability

Publication type

Keywords

Reporting an error / abuse

Sending the report failed

Accessibility options