Search results

Items from 1 to 20 out of 81 results

chapter

RCU-HTM: Combining RCU with HTM to Implement Highly Efficient Concurrent Binary Search Trees

Dimitrios Siakavaras, Konstantinos Nikas, Georgios Goumas, Nectarios Koziris

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT) > 1 - 13

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

In this paper we introduce RCU-HTM, a technique that combines Read-Copy-Update (RCU) with Hardware Transactional Memory (HTM) to implement highly efficient concurrent Binary Search Trees (BSTs). Similarly to RCU-based algorithms, we perform the modifications of the tree structure in private copies of the affected parts of the tree rather than in-place. This allows threads that traverse the tree to...

chapter

Understanding the Impact of Fine-Grained Data Sharing and Thread Communication on Heterogeneous Workload Development

Tuan Ta, David Troendle, Xiaoqi Hu, Byunghyun Jang

2017 16th International Symposium on Parallel and Distributed Computing (ISPDC) > 132 - 139

2017 16th International Symposium on Parallel and Distributed Computing (ISPDC)

The conventional OpenCL 1.x style CPU-GPU heterogeneous computing paradigm treats the CPU and GPU processors as loosely connected separate entities. At best each executes independent tasks, but, more commonly, the CPU idles while waiting for results from the GPU. No data-sharing and communications are allowed during kernel execution. This model limits the number of applications that can harness the...

chapter

Analyzing the scalability of managed language applications with speedup stacks

Jennifer B. Sartor, Kristof Du Bois, Stijn Eyerman, Lieven Eeckhout

2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) > 23 - 32

2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

Understanding the reasons why multi-threaded applications do not achieve perfect scaling on modern multicore hardware is challenging. Furthermore, more and more modern programs are written in managed languages, which have extra service threads (e.g., to perform memory management), which may retard scalability and complicate performance analysis. In this paper, we extend speedup stacks, a previously-presented...

chapter

GPUguard: Towards supporting a predictable execution model for heterogeneous SoC

Bjorn Forsberg, Andrea Marongiu, Luca Benini

Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017 > 318 - 321

2017 Design, Automation & Test in Europe Conference & Exhibition (DATE)

The deployment of real-time workloads on commercial off-the-shelf (COTS) hardware is attractive, as it reduces the cost and time-to-market of new products. Most modern high-end embedded SoCs rely on a heterogeneous design, coupling a general-purpose multi-core CPU to a massively parallel accelerator, typically a programmable GPU, sharing a single global DRAM. However, because of non-predictable hardware...

chapter

A novel way to efficiently simulate complex full systems incorporating hardware accelerators

Tampouratzis Nikolaos, Konstantinos Georgopoulos, Yannis Papaefstathiou

Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017 > 658 - 661

2017 Design, Automation & Test in Europe Conference & Exhibition (DATE)

The breakdown of Dennard scaling coupled with the persistently growing transistor counts severally increased the importance of application-specific hardware acceleration; such an approach offers significant performance and energy benefits compared to general-purpose solutions. In order to thoroughly evaluate such architectures, the designer should perform a quite extensive design space exploration...

chapter

Compiler techniques to reduce the synchronization overhead of GPU redundant multithreading

Manish Gupta, Daniel Lowell, John Kalamatianos, Steven Raasch, more

2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC) > 1 - 6

2017 54th ACM/EDAC/IEEE Design Automation Conference (DAC)

Redundant Multi-Threading (RMT) provides a potentially low cost mechanism to increase GPU reliability by replicating computation at the thread level. Prior work has shown that RMT's high performance overhead stems not only from executing redundant threads, but also from the synchronization overhead between the original and redundant threads. The overhead of inter-thread synchronization can be especially...

chapter

COMET: Communication-optimised multi-threaded error-detection technique

Konstantina Mitropoulou, Vasileios Porpodas, Timothy M. Jones

2016 International Conference on Compliers, Architectures, and Sythesis of Embedded Systems (CASES) > 1 - 10

2016 International Conference on Compliers, Architectures, and Sythesis of Embedded Systems (CASES)

Relentless technology scaling has made transistors more vulnerable to soft, or transient, errors. To keep systems robust against these, current error detection techniques use different types of redundancy at the hardware or the software level. A consequence of these additional protection mechanisms is that these systems tend to become slower. In particular, software error-detection techniques degrade...

chapter

Characterizing Performance and Power towards Efficient Synchronization of GPU Kernels

Islam Harb, Wu-Chun Feng

2016 IEEE 24th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS) > 451 - 456

2016 IEEE 24th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS)

There is a lack of support for explicit synchronization in GPUs between the streaming multiprocessors (SMs) adversely impacts the performance of the GPUs to efficiently perform inter-block communication. In this paper, we present several approaches to inter-block synchronization using explicit/implicit CPU-based and dynamic parallelism (DP) mechanisms. Although this topic has been addressed in previous...

chapter

A Review of Lightweight Thread Approaches for High Performance Computing

Adrian Castello, Antonio J. Pena, Sangmin Seo, Rafael Mayo, more

2016 IEEE International Conference on Cluster Computing (CLUSTER) > 471 - 480

2016 IEEE International Conference on Cluster Computing (CLUSTER)

High-level, directive-based solutions are becoming the programming models (PMs) of the multi/many-core architectures. Several solutions relying on operating system (OS) threads perfectly work with a moderate number of cores. However, exascale systems will spawn hundreds of thousands of threads in order to exploit their massive parallel architectures and thus conventional OS threads are too heavy for...

chapter

GPU concurrency choices in graph analytics

Masab Ahmad, Omer Khan

2016 IEEE International Symposium on Workload Characterization (IISWC) > 1 - 10

2016 IEEE International Symposium on Workload Characterization (IISWC)

Graph analytics is becoming ever more ubiquitous in today's world. However, situational dynamic changes in input graphs, such as changes in traffic and weather patterns, lead to variations in concurrency. Moreover, graph algorithms are known to have data dependent loops and fine-grain synchronization that makes them hard to scale on parallel machines. Recent trends in computing indicate the rise of...

chapter

Multiple core PLC CPU with tight thread synchronization

Adam Milik, Miroslaw Chmiel, Edward Hrynkiewicz

2016 International Conference on Signals and Electronic Systems (ICSES) > 253 - 258

2016 International Conference on Signals and Electronic Systems (ICSES)

The paper presents the architecture of PLC CPU consisting of multiple cores enabling parallel processing of control algorithms. Control programs consist of many program fragments that are suitable for parallel execution. Proposed architecture is constructed from independent logic and arithmetic units. They share common data memories of respective types. In order to enable tight coupling of processing...

chapter

Speculatively exploiting cross-invocation parallelism

Jialu Huang, Prakash Prabhu, Thomas B. Jablin, Soumyadeep Ghosh, more

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT) > 207 - 219

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)

Automatic parallelization has shown promise in producing scalable multi-threaded programs for multi-core architectures. Most existing automatic techniques parallelize independent loops and insert global synchronization between loop invocations. For programs with many loop invocations, frequent synchronization often becomes the performance bottleneck. Some techniques exploit cross-invocation parallelism...

chapter

Characterizing and optimizing the performance of multithreaded programs under interference

Yong Zhao, Jia Rao, Qing Yi

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT) > 287 - 297

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)

As virtualization becomes ubiquitous in datacenters, there is a growing interest in characterizing application performance in multi-tenant environments to improve datacenter resource management. The performance of parallel programs is notoriously difficult to reason about in virtualized environments. Although performance degradations caused by virtualization and interferences have been extensively...

chapter

Thrifty-malloc: A HW/SW codesign for the dynamic management of hardware transactional memory in embedded multicore systems

Thomas Carle, Dimitra Papagiannopoulou, Tali Moreshet, Andrea Marongiu, more

2016 International Conference on Compliers, Architectures, and Sythesis of Embedded Systems (CASES) > 1 - 10

2016 International Conference on Compliers, Architectures, and Sythesis of Embedded Systems (CASES)

We present thrifty-malloc: a transaction-friendly dynamic memory manager for high-end embedded multicore systems. The manager combines modularity, ease-of-use and hardware transactional memory (HTM) compatibility in a lightweight and memory-effcient design. Thrifty-malloc is easy to deploy and configure for non-expert programmers, yet provides good performance with low memory overhead for highly-parallel...

chapter

Reproducible floating-point atomic addition in data-parallel environment

David Defour, Sylvain Collange

2015 Federated Conference on Computer Science and Information Systems (FedCSIS) > 721 - 728

2015 Federated Conference on Computer Science and Information Systems (FedCSIS)

Floating-point additions in concurrent execution environment are known to be hazardous, as the result depends on the order in which operations are performed. This problem is encountered in data parallel execution environments such as GPUs, where reproducibility involving floating-point atomic addition is challenging. This problem is due to the rounding error or cancellation that appears for each operation,...

chapter

Effective hardware-level thread synchronization for high performance and power efficiency in application specific multi-threaded embedded processors

Mahanama Wickramasinghe, Hui Guo

2015 33rd IEEE International Conference on Computer Design (ICCD) > 311 - 318

2015 33rd IEEE International Conference on Computer Design (ICCD)

Multi-threaded processors interleave the execution of several threads to reduce processor stalling time. Instruction cache misses usually account for a significant fraction of the overall stalling time due to frequent instruction fetches. Apart from incurring extended execution time (hence its direct impact on energy consumption), cache misses also lead to indirect power overheads and increased thread...

chapter

BSSync: Processing Near Memory for Machine Learning Workloads with Bounded Staleness Consistency Models

Joo Hwan Lee, Jaewoong Sim, Hyesoon Kim

2015 International Conference on Parallel Architecture and Compilation (PACT) > 241 - 252

2015 International Conference on Parallel Architecture and Compilation (PACT)

Parallel machine learning workloads have become prevalent in numerous application domains. Many of these workloads are iterative convergent, allowing different threads to compute in an asynchronous manner, relaxing certain read-after-write data dependencies to use stale values. While considerable effort has been devoted to reducing the communication latency between nodes by utilizing asynchronous...

chapter

TSXProf: Profiling Hardware Transactions

Yujie Liu, Justin Gottschlich, Gilles Pokam, Michael Spear

2015 International Conference on Parallel Architecture and Compilation (PACT) > 75 - 86

2015 International Conference on Parallel Architecture and Compilation (PACT)

The availability of commercial hardware transactionalmemory (TM) systems has not yet been met with a rise in the numberof large-scale programs that use memory transactions explicitly. Asignificant impediment to the use of TM is the lack of tool support, specifically profilers that can identify and explain performance anomalies. In this paper, we introduce an end-to-end system that enables lowoverheadperformance...

chapter

Efficient parallel packet processing using a shared memory many-core processor with hardware support to accelerate communication

Farrukh Hijaz, Brian Kahne, Peter Wilson, Omer Khan

2015 IEEE International Conference on Networking, Architecture and Storage (NAS) > 122 - 129

2015 IEEE International Conference on Networking, Architecture and Storage (NAS)

Software IP forwarding routers provide flexibility, programmability and extensibility, while enabling fast deployment. The key question is whether they can keep up with the efficiency of special purpose hardware counterparts. Shared memory stands out as sine qua non for parallel programming of many commercial multicore processors, so it is the paradigm of choice to implement software routers. For...

chapter

OpenMP-MCA: Leveraging Multiprocessor Embedded Systems Using Industry Standards

Peng Sun, Sunita Chandrasekaran, Barbara Chapman

2015 IEEE International Parallel and Distributed Processing Symposium Workshop > 679 - 688

2015 IEEE International Parallel and Distributed Processing Symposium Workshop (IPDPSW)

Multicore embedded systems are rapidly emerging. Hardware designers are packing more and more features into their design. Introducing heterogeneity in these systems, i.e. Adding cores of varying types does provide opportunities to solve problems in different aspects. However, this presents several challenges to embedded system programmers since software is still not mature enough to efficiently exploit...

Data set:
ieee
Keywords:
SYNCHRONIZATION
HARDWARE
INSTRUCTION SETS
Publication type:
book

Publication date

Set your own date range

Content availability

Available (80)
None (1)

Keywords

COMPUTER ARCHITECTURE (15)
KERNEL (12)
MESSAGE SYSTEMS (12)
PARALLEL PROCESSING (10)
GRAPHICS PROCESSING UNITS (9)
MULTICORE PROCESSING (9)
BENCHMARK TESTING (8)
EMBEDDED SYSTEMS (7)
MULTITHREADING (7)
OPTIMIZATION (7)
REGISTERS (7)
RUNTIME (7)
COHERENCE (6)
FIELD PROGRAMMABLE GATE ARRAYS (6)
GRAPHICS PROCESSING UNIT (6)
LIBRARIES (6)
CONTEXT (5)
PARALLEL PROGRAMMING (5)
PROTOCOLS (5)
REAL TIME SYSTEMS (5)
SCALABILITY (5)
SHARED MEMORY SYSTEMS (5)
SYSTEM-ON-CHIP (5)
COMPUTATIONAL MODELING (4)
INSTRUMENTS (4)
INTERFERENCE (4)
MICROPROCESSOR CHIPS (4)
MULTIPROCESSING SYSTEMS (4)
PERFORMANCE (4)
PIPELINES (4)
PROGRAMMING (4)
SYNCHRONISATION (4)
THROUGHPUT (4)
TRANSACTIONAL MEMORY (4)
CLOCKS (3)
COMPUTER BUGS (3)
CONCURRENCY CONTROL (3)
CONCURRENT COMPUTING (3)
FPGA (3)
MEMORY MANAGEMENT (3)
MONITORING (3)
MPSOC (3)
MULTI-THREADING (3)
OPENMP (3)
POWER DEMAND (3)
RESOURCE MANAGEMENT (3)
SOFTWARE (3)
SWITCHES (3)
SYSTEM RECOVERY (3)
ARRAYS (2)
ATOMIC LAYER DEPOSITION (2)
ATOMIC OPERATION (2)
BANDWIDTH (2)
BASEBAND (2)
BUFFER STORAGE (2)
CACHE STORAGE (2)
DATA STRUCTURES (2)
DELAY (2)
FAULT TOLERANCE (2)
GPU (2)
HARDWARE-SOFTWARE CODESIGN (2)
JAVA (2)
LINUX (2)
MANY-CORE (2)
MULTI-CORE PROCESSORS (2)
MULTIPROCESSING PROGRAMS (2)
OPENCL (2)
OPERATING SYSTEMS (2)
PARALLEL ARCHITECTURES (2)
PIPELINE PROCESSING (2)
PROGRAM COMPILERS (2)
PROGRAM PROCESSORS (2)
PROPOSALS (2)
PROTOTYPES (2)
RADIATION DETECTORS (2)
RECONFIGURABLE ARCHITECTURES (2)
RUNTIME LIBRARY (2)
SCHEDULES (2)
SERVERS (2)
SPINNING (2)
THREAD SYNCHRONIZATION (2)
TOOLS (2)
VECTORS (2)
WIRELESS SENSOR NETWORKS (2)
ABORTING TRANSACTION (1)
ABSTRACT RTOS MODEL (1)
ACCELERATION (1)
ACCELERATION MECHANISM (1)
ADAPTATION MODELS (1)
AGGRESSIVE SNOOP REDUCTION (1)
AM BAND (1)
APACHE (1)
APPLICATION PROGRAM INTERFACES (1)
APPLICATION-DRIVEN CUSTOMIZATION TECHNIQUE (1)
APPROXIMATE VALUE LOCALITY (1)
APPROXIMATION METHODS (1)
ARCHITECTURAL SUPPORT (1)
more

INFONA - science communication portal

Search results

RCU-HTM: Combining RCU with HTM to Implement Highly Efficient Concurrent Binary Search Trees

Understanding the Impact of Fine-Grained Data Sharing and Thread Communication on Heterogeneous Workload Development

Analyzing the scalability of managed language applications with speedup stacks

GPUguard: Towards supporting a predictable execution model for heterogeneous SoC

A novel way to efficiently simulate complex full systems incorporating hardware accelerators

Compiler techniques to reduce the synchronization overhead of GPU redundant multithreading

COMET: Communication-optimised multi-threaded error-detection technique

Characterizing Performance and Power towards Efficient Synchronization of GPU Kernels

A Review of Lightweight Thread Approaches for High Performance Computing

GPU concurrency choices in graph analytics

Multiple core PLC CPU with tight thread synchronization

Speculatively exploiting cross-invocation parallelism

Characterizing and optimizing the performance of multithreaded programs under interference

Thrifty-malloc: A HW/SW codesign for the dynamic management of hardware transactional memory in embedded multicore systems

Reproducible floating-point atomic addition in data-parallel environment

Effective hardware-level thread synchronization for high performance and power efficiency in application specific multi-threaded embedded processors

BSSync: Processing Near Memory for Machine Learning Workloads with Bounded Staleness Consistency Models

TSXProf: Profiling Hardware Transactions

Efficient parallel packet processing using a shared memory many-core processor with hardware support to accelerate communication

OpenMP-MCA: Leveraging Multiprocessor Embedded Systems Using Industry Standards

Filter options

Publication date

Content availability

Keywords

INFONA - science communication portal

Search results

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Content availability

Keywords

Reporting an error / abuse

Sending the report failed

Accessibility options