Search results

chapter

Optimization of scan algorithms on multi- and many-core processors

Qiao Sun, Chao Yang

2014 21st International Conference on High Performance Computing (HiPC) > 1 - 10

2014 21st International Conference on High Performance Computing (HiPC)

Scan is a basic building block widely utilized in many applications. With the emergence of multi-core and many-core processors, the study of highly scalable parallel scan algorithms becomes increasingly important. In this paper, we first propose a novel parallel scan algorithm based on the fine grain dynamic task scheduling in QUARK, and then derive a cache-friendly framework for any parallel scan...

chapter

Fine-grained GPU parallelization of pairwise local sequence alignment

Chirag Jain, Subodh Kumar

2014 21st International Conference on High Performance Computing (HiPC) > 1 - 10

2014 21st International Conference on High Performance Computing (HiPC)

The Smith-Waterman algorithm is used in Bio-informatics to perform pairwise local alignment between a query sequence and a subject sequence. We present a GPU based parallel version of this algorithm that is able to perform pair-wise alignment faster than previous algorithms. In particular, it parallelizes each alignment, rather than relying on parallelism across multiple pair alignments, which many...

chapter

Energy-Efficient Stencil Computations on Distributed GPUs Using Dynamic Parallelism and GPU-Controlled Communication

Lena Oden, Benjamin Klenk, Holger Froning

2014 Energy Efficient Supercomputing Workshop > 31 - 40

2014 Energy Efficient Supercomputing Workshop (E2SC)

GPUs are widely used in high performance computing, due to their high computational power and high performance per Watt. Still, one of the main bottlenecks of GPU-accelerated cluster computing is the data transfer between distributed GPUs. This not only affects performance, but also power consumption. The most common way to utilize a GPU cluster is a hybrid model, in which the GPU is used to accelerate...

chapter

Analysis of Linux UDP Sockets Concurrent Performance

Diego Rivera, Eduardo Acha, Jose Piquer, Javier Bustos-Jimenez

2014 33rd International Conference of the Chilean Computer Science Society (SCCC) > 65 - 69

2014 33rd International Conference of the Chilean Computer Science Society (SCCC)

Almost all DNS queries that traverse Internet are transported via UDP in self-contained small packages. Therefore, with no restriction of packet ordering, the intuition would say that adding thread-based parallelism to the servers will increase their performance, but it does not. This paper studies the problem of serialized access to UDP sockets, and states the problem in the way the packets are enqueued...

chapter

Mainstream Components for Near Hard Real-Time Distributed Simulation and Testing

Fernand Quartier, Pierre Verhoyen, Nadie Rousse, Frederic Manon

2014 IEEE/ACM 18th International Symposium on Distributed Simulation and Real Time Applications > 11 - 17

2014 IEEE/ACM 18th International Symposium on Distributed Simulation and Real Time Applications (DS-RT)

At CNES, each new satellite simulation and testing system increases significantly processing requirements and real-time constraints. While mainstream systems allow adding almost unlimited computing resources, whenever there are stronger timing constraints, we arrive in a much unknown territory. To prepare the future, several R&D projects have been carried out that were focusing on related...

chapter

Leveraging OmpSs to Exploit Hardware Accelerators

Florentino Sainz, Sergi Mateo, Vicenc Beltran, Jose L. Bosque, more

2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing > 112 - 119

2014 26th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

CUDA and OpenCL are the most widely used programming models to exploit hardware accelerators. Both programming models provide a C-based programming language to write accelerator kernels and a host API used to glue the host and kernel parts. Although this model is a clear improvement over a low-level and ad-hoc programming model for each hardware accelerator, it is still too complex and cumbersome...

chapter

Graph processing on GPUs: Where are the bottlenecks?

Qiumin Xu, Hyeran Jeon, Murali Annavaram

2014 IEEE International Symposium on Workload Characterization (IISWC) > 140 - 149

2014 IEEE International Symposium on Workload Characterization (IISWC)

Large graph processing is now a critical component of many data analytics. Graph processing is used from social networking web sites that provide context-aware services from user connectivity data to medical informatics that diagnose a disease from a given set of symptoms. Graph processing has several inherently parallel computation steps interspersed with synchronization needs. Graphics processing...

chapter

A comparison of parallel systemc simulation approaches at RTL

Bastian Haetzer, Martin Radetzki

Proceedings of the 2014 Forum on Specification and Design Languages (FDL) > 978-2-9530504-9-3 > 1 - 8

2014 Forum on Specification and Design Languages (FDL)

This paper presents a holistic comparison of different parallel SystemC simulation approaches at the register transfer level (RTL). The effect of RTL modeling styles and simulation strategies on performance will be evaluated to show potentials and limitations of state of the art parallel simulation techniques on shared memory machines. Experiments show that the simulation performance strongly depends...

chapter

Understanding synchronization in TCP Cubic

Sonia Belhareth, Dino Lopez-Pacheco, Lucile Sassatelli, Denis Collange, more

2014 26th International Teletraffic Congress (ITC) > 1 - 9

2014 26th International Teletraffic Congress (ITC)

TCP Cubic is designed to better utilize high bandwidth-delay product paths in IP networks. It is currently the default TCP version in the Linux kernel. Our objective in this work is to better understand the performance of TCP Cubic in scenarios with a large number of competing long-lived TCP flows, as can be observed, e.g., in cloud environments. In such situations, Cubic connections tend to synchronize...

chapter

Scalable Graph500 design with MPI-3 RMA

Mingzhe Li, Xiaoyi Lu, Sreeram Potluri, Khaled Hamidouche, more

2014 IEEE International Conference on Cluster Computing (CLUSTER) > 230 - 238

2014 IEEE International Conference On Cluster Computing (CLUSTER)

The MPI two-sided programming model has been widely used for scientific applications. However, the benefits of MPI one-sided communication are still not well exploited. Recently, MPI-3 Remote Memory Access (RMA) was introduced with several advanced features which provide better performance, programmability, and flexibility over MPI-2 RMA. However, few studies have shown the benefits of using MPI-3...

chapter

Disruption-free software updates in automation systems

Michael Wahler, Manuel Oriol

Proceedings of the 2014 IEEE Emerging Technology and Factory Automation (ETFA) > 1 - 8

2014 IEEE Emerging Technology and Factory Automation (ETFA)

Automation systems must primarily be deterministic and reliable, especially in safety-critical environments. With recent trends such as mass customization or Industry 4.0, there is an increasing need for automation systems to be dynamic. Changing parts of the software of today's automation systems, however, typically requires rebooting the controller, which makes software updates a complex and costly...

chapter

How Processor Speedups Can Slow Down I/O Performance

Hung-Ching Chang, Bo Li, Matthew Grove, Kirk W. Cameron

2014 IEEE 22nd International Symposium on Modelling, Analysis & Simulation of Computer and Telecommunication Systems > 395 - 404

2014 IEEE 22nd International Symposium on Modelling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS

Power states in power-scalable systems are managed to maximize performance and reduce energy waste. Power-scalable processor capabilities (e.g., Intel Turbo Boost) embrace a "faster is better" approach to power management. While these technologies can vastly improve performance and energy efficiency, there is a growing body of evidence that "faster is not always better". For example,...

chapter

CASITA: A Tool for Identifying Critical Optimization Targets in Distributed Heterogeneous Applications

Felix Schmitt, Jonas Stolle, Robert Dietrich

2014 43rd International Conference on Parallel Processing Workshops > 186 - 195

2014 43nd International Conference on Parallel Processing Workshops (ICCPW)

Programming of high performance computing systems has become more complex over time. Several layers of parallelism need to be exploited to efficiently utilize the available resources. To support application developers and performance analysts we propose a technique for identifying the most performance critical optimization targets in distributed heterogeneous applications. We have developed CASITA,...

chapter

Adaptive Algorithm and Tool Flow for Accelerating System C on Many-Core Architectures

Christoph Roth, Simon Reder, Harald Bucher, Oliver Sander, more

2014 17th Euromicro Conference on Digital System Design > 137 - 145

2014 17th Euromicro Conference on Digital System Design (DSD)

Within this paper an adaptive approach for parallel simulation of SystemC RTL models on future many-core architectures like the Single-chip Cloud Computer (SCC) from Intel is presented. It is based on a configurable parallel SystemC kernel that preserves the partial order defined by the SystemC delta cycles while avoiding global synchronization as far as possible. The underlying algorithm relies on...

chapter

DMCTCP: Desynchronized Multi-Channel TCP for high speed access networks with tiny buffers

Cheng Cui, Lin Xue, Chui-Hui Chiu, Praveenkumar Kondikoppa, more

2014 23rd International Conference on Computer Communication and Networks (ICCCN) > 1 - 8

2014 23rd International Conference on Computer Communication and Networks (ICCCN)

The past few years have witnessed debate on how to improve link utilization of high-speed tiny-size buffer routers. Widely argued proposals for TCP traffic to realize acceptable link capacities mandate: (i) over-provisioned core link bandwidth; and (ii) non-bursty flows; and (iii) tens of thousands of asynchronous flows. However, in high speed access networks where flows are bursty, sparse and synchronous,...

chapter

Times square - marriage of real-time and logical-time in GALS and synchronous languages

HeeJong Park, Avinash Malik, Zoran Salcic

2014 IEEE 20th International Conference on Embedded and Real-Time Computing Systems and Applications > 1 - 10

2014 IEEE 20th International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA)

In this paper we introduce exact and non-exact real-time waits in reactive Globally Asynchronous Locally Synchronous (GALS) programming languages and synchronous languages as their subset. The language constructs that allow use of real-time waits are illustrated on the SystemJ GALS language. They allow system designers to explicitly use, at the specification level, not only logical time but also the...

chapter

A Flexible and Scalable Affinity Lock for the Kernel

Benlong Zhang, Junbin Kang, Tianyu Wo, Yuda Wang, more

2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS) > 34 - 37

2014 IEEE International Conference on High Performance Computing and Communications (HPCC), 2014 IEEE 6th International Symposium on Cyberspace Safety and Security (CSS) and 2014 IEEE 11th International Conference on Embedded Software and Systems (ICESS)

A number of NUMA-aware synchronization algorithms have been proposed lately to stress the scalability inefficiencies of existing locks. However their presupposed local lock granularity, a physical processor, is often not the optimum configuration for various workloads. This paper further explores the design space by taking into consideration the physical affinity between the cores within a single...

chapter

Data Interception through Broken Concurrency in Kernel Land

Julian L. Rrushi

2014 IEEE Intl Conf on High Performance Computing and Communications, 2014 IEEE 6th Intl Symp on Cyberspace Safety and Security, 2014 IEEE 11th Intl Conf on Embedded Software and Syst (HPCC,CSS,ICESS) > 785 - 793

2014 IEEE International Conference on High Performance Computing and Communications (HPCC), 2014 IEEE 6th International Symposium on Cyberspace Safety and Security (CSS) and 2014 IEEE 11th International Conference on Embedded Software and Systems (ICESS)

We present a kernel data interception technique that is undetectable by existing approaches to malware detection, and propose practical methods to detect it. The technique is based on breaking concurrency in a way that enables the attack code to take over the synchronization established by target kernel modules. That level of control allows the attack code to interpose between those modules, and thus...

chapter

Multi Sloth: An Efficient Multi-core RTOS Using Hardware-Based Scheduling

Rainer Muller, Daniel Danner, Wolfgang Schroder Preikschat, Daniel Lohmann

2014 26th Euromicro Conference on Real-Time Systems > 189 - 198

2014 26th Euromicro Conference on Real-Time Systems (ECRTS)

Multi-core operating systems inherently face the problem of concurrent access to internal kernel state held in shared memory. Previous work on the Sloth real-time kernel proposed to offload the scheduling decisions to the interrupt hardware, thus removing the need for a software scheduler, no state has to be managed in software. While our existing design covers single-core platforms only, we now present...

chapter

Sparse matrix computations on clusters with GPGPUs

Valeria Cardellini, Alessandro Fanfarillo, Salvatore Filippone

2014 International Conference on High Performance Computing & Simulation (HPCS) > 23 - 30

2014 International Conference on High Performance Computing & Simulation (HPCS)

Hybrid nodes containing GPUs are rapidly becoming the norm in parallel machines. We have conducted some experiments regarding how to plug GPU-enabled computational kernels into PSBLAS, a MPI-based library specifically geared towards sparse matrix computations. In this paper, we present our findings on which strategies are more promising in the quest for the optimal compromise among raw performance,...

INFONA - science communication portal

Search results

Optimization of scan algorithms on multi- and many-core processors

Fine-grained GPU parallelization of pairwise local sequence alignment

Energy-Efficient Stencil Computations on Distributed GPUs Using Dynamic Parallelism and GPU-Controlled Communication

Analysis of Linux UDP Sockets Concurrent Performance

Mainstream Components for Near Hard Real-Time Distributed Simulation and Testing

Leveraging OmpSs to Exploit Hardware Accelerators

Graph processing on GPUs: Where are the bottlenecks?

A comparison of parallel systemc simulation approaches at RTL

Understanding synchronization in TCP Cubic

Scalable Graph500 design with MPI-3 RMA

Disruption-free software updates in automation systems

How Processor Speedups Can Slow Down I/O Performance

CASITA: A Tool for Identifying Critical Optimization Targets in Distributed Heterogeneous Applications

Adaptive Algorithm and Tool Flow for Accelerating System C on Many-Core Architectures

DMCTCP: Desynchronized Multi-Channel TCP for high speed access networks with tiny buffers

Times square - marriage of real-time and logical-time in GALS and synchronous languages

A Flexible and Scalable Affinity Lock for the Kernel

Data Interception through Broken Concurrency in Kernel Land

Multi Sloth: An Efficient Multi-core RTOS Using Hardware-Based Scheduling

Sparse matrix computations on clusters with GPGPUs

Filter options

Publication date

Content availability

Keywords

INFONA - science communication portal

Search results

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Content availability

Keywords

Reporting an error / abuse

Sending the report failed

Accessibility options