2017 IEEE International Conference on Cluster Computing (CLUSTER)

chapter

DH-Falcon: A Language for Large-Scale Graph Processing on Distributed Heterogeneous Systems

Unnikrishnan Cheramangalath, Rupesh Nasre, Y. N. Srikant

2017 IEEE International Conference on Cluster Computing (CLUSTER) > 439 - 450

Graph models of social information systems typically contain trillions of edges. Such big graphs cannot beprocessed on a single machine. The graph object must bepartitioned and distributed among machines and processedin parallel on a computer cluster. Programming such systemsis very challenging. In this work, we present DH-Falcon, a graph DSL (domain-specific language) which can be usedto implement...

chapter

Fast Failure Erasure Encoding Using Just in Time Compilation for CPUs, GPUs, and FPGAs

David Rohr, Volker Lindenstruth

2017 IEEE International Conference on Cluster Computing (CLUSTER) > 451 - 463

2017 IEEE International Conference on Cluster Computing (CLUSTER)

Failure tolerant data encoding and storage is of paramount importance for data centers, supercomputers, data transfers, and many aspects of information technology. Reed-Solomon failure erasure codes and their variants are the basis for many applications in this field. Efficient implementation of these codes is challenging because they require computations in Galois fields, which are not supported...

chapter

Toward a General Theory of Optimal Checkpoint Placement

Omer Subasi, Gokcen Kestor, Sriram Krishnamoorthy

2017 IEEE International Conference on Cluster Computing (CLUSTER) > 464 - 474

2017 IEEE International Conference on Cluster Computing (CLUSTER)

Checkpoint/restart has been widely used to cope with fail-stop errors. The checkpointing frequency is most often optimized by assuming an exponential failure distribution. However, field studies show that most often failures do not follow a constant failure rate exponential distribution. Therefore, the optimal checkpointing frequency should be computed and tuned considering the different distributions...

chapter

Algorithm-Directed Crash Consistence in Non-volatile Memory for HPC

Shuo Yang, Kai Wu, Yifan Qiao, Dong Li, more

2017 IEEE International Conference on Cluster Computing (CLUSTER) > 475 - 486

2017 IEEE International Conference on Cluster Computing (CLUSTER)

Fault tolerance is one of the major design goals for HPC. The emergence of non-volatile memories (NVM) provides a solution to build fault tolerant HPC. Data in NVM-based main memory are not lost when the system crashes because of the non-volatility nature of NVM. However, because of volatile caches, data must be logged and explicitly flushed from caches into NVM to ensure consistence and correctness...

chapter

Checkpointing Workflows for Fail-Stop Errors

Li Han, Louis-Claude Canon, Henri Casanova, Yves Robert, more

2017 IEEE International Conference on Cluster Computing (CLUSTER) > 487 - 497

2017 IEEE International Conference on Cluster Computing (CLUSTER)

We consider the problem of orchestrating the execution of workflow applications structured as Directed Acyclic Graphs (DAGs) on parallel computing platforms that are subject to fail-stop failures. The objective is to minimize expected overall execution time, or makespan. A solution to this problem consists of a schedule of the workflow tasks on the available processors and of a decision of which application...

chapter

Exploring On-Node Parallelism with Neutral, a Monte Carlo Neutral Particle Transport Mini-App

Matt Martineau, Simon McIntosh-Smith

2017 IEEE International Conference on Cluster Computing (CLUSTER) > 498 - 508

2017 IEEE International Conference on Cluster Computing (CLUSTER)

In this research we describe the development and optimisation of a new Monte Carlo neutral particle transport mini-app, neutral. In spite of the success of previous research efforts to load balance the algorithm at scale, it is not clear how to take advantage of the diverse architectures being installed in the newest supercomputers. We explore different algorithmic approaches, and perform extensive...

chapter

Manala: A Flexible Flow Control Library for Asynchronous Task Communication

Matthieu Dreher, Kiran Sasikumar, Subramanian Sankaranarayanan, Tom Peterka

2017 IEEE International Conference on Cluster Computing (CLUSTER) > 509 - 519

2017 IEEE International Conference on Cluster Computing (CLUSTER)

Tasks coupled in an in situ workflow may not process data at the same speed, potentially causing overflows in the communication channel between them. To prevent this problem, software infrastructures for in situ workflows usually impose a strict FIFO policy that has the side-effect of slowing down faster tasks to the speed of the slower ones. This may not be the desired behavior; for example, a scientist...

chapter

Distributed Affine-Invariant MCMC Sampler

Balazs Nemeth, Tom Haber, Jori Liesenborgs, Wim Lamotte

2017 IEEE International Conference on Cluster Computing (CLUSTER) > 520 - 524

2017 IEEE International Conference on Cluster Computing (CLUSTER)

Markov Chain Monte Carlo methods provide a tool for tackling high dimensional problems. With many-core systems readily available today, it is no surprise that leveraging parallelism in these samplers has been a subject of recent research. The focus has been on solutions for shared-memory architectures, however these perform poorly in a distributed-memory environment. This paper introduces a fully...

chapter

A Stencil Framework to Realize Large-Scale Computations Beyond Device Memory Capacity on GPU Supercomputers

Takashi Shimokawabe, Toshio Endo, Naoyuki Onodera, Takayuki Aoki

2017 IEEE International Conference on Cluster Computing (CLUSTER) > 525 - 529

2017 IEEE International Conference on Cluster Computing (CLUSTER)

Stencil-based applications such as CFD have succeeded in obtaining high performance on GPU supercomputers. The problem sizes of these applications are limited by the GPU device memory capacity, which is typically smaller than the host memory. On GPU supercomputers, a locality improvement technique using temporal blocking method with memory swapping between host and device enables large computation...

chapter

Trade-Off Between Prediction Accuracy and Underestimation Rate in Job Runtime Estimates

Yuping Fan, Paul Rich, William E. Allcock, Michael E. Papka, more

2017 IEEE International Conference on Cluster Computing (CLUSTER) > 530 - 540

2017 IEEE International Conference on Cluster Computing (CLUSTER)

Job runtime estimates provided by users are widely acknowledged to be overestimated and runtime overestimation can greatly degrade job scheduling performance. Previous studies focus on improving accuracy of job runtime estimates by reducing runtime overestimation, but fail to address the underestimation problem (i.e., the underestimation of job runtimes). Using an underestimated runtime is catastrophic...

chapter

CLIP: Cluster-Level Intelligent Power Coordination for Power-Bounded Systems

Pengfei Zou, Tyler Allen, Claude H. Davis IV, Xizhou Feng, more

2017 IEEE International Conference on Cluster Computing (CLUSTER) > 541 - 551

2017 IEEE International Conference on Cluster Computing (CLUSTER)

High performance computing systems will need to operate with certain power budgets while maximizing performance in the exascale era. Such systems are built with power aware components, whose collective peak power may exceed the specified power budget. Cluster level power bounded computing addresses this power challenge by coordinating power among components within compute nodes and further adjusting...

chapter

Pure Functions in C: A Small Keyword for Automatic Parallelization

Tim SuB, Lars Nagel, Marc-Andre Vef, Andre Brinkmann, more

2017 IEEE International Conference on Cluster Computing (CLUSTER) > 552 - 556

2017 IEEE International Conference on Cluster Computing (CLUSTER)

The need for parallel task execution has been steadily growing in recent years since manufacturers mainly improve processor performance by scaling the number of installed cores instead of the frequency of processors. To make use of this potential, an essential technique to increase the parallelism of a program is to parallelize loops. However, a main restriction of available tools for automatic loop...

chapter

The Effect of Resource Allocation and System Events on VM Consolidation

Maruf Ahmed, Albert Y. Zomaya

2017 IEEE International Conference on Cluster Computing (CLUSTER) > 557 - 562

2017 IEEE International Conference on Cluster Computing (CLUSTER)

Virtual machine (VM) consolidation is necessary for increasing the server utilization; however, it also leads to VM performance degradation. This work presents a method to predict the consolidated VMs performance from the critical system events data. Experiments are designed to demonstrate the effect of system events like interrupts, page faults, mutex operations, and context switching on the consolidated...

chapter

Extending Skel to Support the Development and Optimization of Next Generation I/O Systems

Jeremy Logan, Jong Youl Choi, Matthew Wolf, George Ostrouchov, more

2017 IEEE International Conference on Cluster Computing (CLUSTER) > 563 - 571

2017 IEEE International Conference on Cluster Computing (CLUSTER)

As the memory and storage hierarchy get deeper and more complex, it is important to have new benchmarks and evaluation tools that allow us to explore the emerging middleware solutions to use this hierarchy. Skel is a tool aimed at automating and refining this process of studying HPC I/O performance. It works by generating application I/O kernel/benchmarks as determined by a domain-specific model....

chapter

keybin: Key-Based Binning for Distributed Clustering

Xinyu Chen, Jeremy Benson, Trilce Estrada

2017 IEEE International Conference on Cluster Computing (CLUSTER) > 572 - 581

2017 IEEE International Conference on Cluster Computing (CLUSTER)

Traditional machine learning algorithms often require computations on centralized data, but modern datasets are collected and stored in a distributed way. In addition to the cost of moving data to centralized locations, increasing concerns about privacy and security warrant distributed approaches. We propose keybin, a distributed key-based binning clustering algorithm for high-dimensional spaces....

chapter

Optimizing the Datapath for Key-value Middleware with NVMe SSDs over RDMA Interconnects

Zhongqi An, Zhengyu Zhang, Qiang Li, Jing Xing, more

2017 IEEE International Conference on Cluster Computing (CLUSTER) > 582 - 586

2017 IEEE International Conference on Cluster Computing (CLUSTER)

In-memory key-value store is a crucial building block of large-scale web architecture. Given the growth of the data volume and the need for low-latency responses, cost-effective storage expansion and fast large-message processing are the major challenges. In this paper, we explore the design of key-value middleware that takes advantage of modern NVMe SSDs and RDMA interconnects to achieve high performance...

chapter

TGE: Machine Learning Based Task Graph Embedding for Large-Scale Topology Mapping

Jong Youl Choi, Jeremy Logan, Matthew Wolf, George Ostrouchov, more

2017 IEEE International Conference on Cluster Computing (CLUSTER) > 587 - 591

2017 IEEE International Conference on Cluster Computing (CLUSTER)

Task mapping is an important problem in parallel and distributed computing. The goal in task mapping is to find an optimal layout of the processes of an application (or a task) onto a given network topology. We target this problem in the context of staging applications. A staging application consists of two or more parallel applications (also referred to as staging tasks) which run concurrently and...

chapter

Detection of Silent Data Corruption in Adaptive Numerical Integration Solvers

Pierre-Louis Guhur, Emil Constantinescu, Debojyoti Ghosh, Tom Peterka, more

2017 IEEE International Conference on Cluster Computing (CLUSTER) > 592 - 602

2017 IEEE International Conference on Cluster Computing (CLUSTER)

Scientific computing requires trust in results. In high-performance computing, trust is impeded by silent data corruption (SDC), in other words corruption that remains unnoticed. Numerical integration solvers are especially sensitive to SDCs because an SDC introduced in a certain step affects all the following steps. SDCs can even cause the solver to become unstable. Adaptive solvers can change the...

chapter

Evaluating the Viability of Using Compression to Mitigate Silent Corruption of Read-Mostly Application Data

Scott Levy, Kurt B. Ferreira, Patrick G. Bridges

2017 IEEE International Conference on Cluster Computing (CLUSTER) > 603 - 607

2017 IEEE International Conference on Cluster Computing (CLUSTER)

Aggregating millions of hardware components to construct an exascale computing platform will pose significant resilience challenges. In addition to slowdowns associated with detected errors, silent errors are likely to further degrade application performance. Moreover, silent data corruption (SDC) has the potential to undermine the integrity of the results produced by important scientific applications...

chapter

A Gaussian Process Approach for Effective Soft Error Detection

Omer Subasi, Sriram Krishnamoorthy

2017 IEEE International Conference on Cluster Computing (CLUSTER) > 608 - 612

2017 IEEE International Conference on Cluster Computing (CLUSTER)

In this paper, we present a non-parametric dataanalytic soft-error detector. Our detector uses the key properties of Gaussian process regression. First, because Gaussian process regression provides confidence on the prediction, this confidence can be used to automatize construction of the detection range. Second, because the correlation model of a Gaussian process captures the similarity among neighboring...

INFONA - science communication portal

2017 IEEE International Conference on Cluster Computing (CLUSTER)

DH-Falcon: A Language for Large-Scale Graph Processing on Distributed Heterogeneous Systems

Fast Failure Erasure Encoding Using Just in Time Compilation for CPUs, GPUs, and FPGAs

Toward a General Theory of Optimal Checkpoint Placement

Algorithm-Directed Crash Consistence in Non-volatile Memory for HPC

Checkpointing Workflows for Fail-Stop Errors

Exploring On-Node Parallelism with Neutral, a Monte Carlo Neutral Particle Transport Mini-App

Manala: A Flexible Flow Control Library for Asynchronous Task Communication

Distributed Affine-Invariant MCMC Sampler

A Stencil Framework to Realize Large-Scale Computations Beyond Device Memory Capacity on GPU Supercomputers

Trade-Off Between Prediction Accuracy and Underestimation Rate in Job Runtime Estimates

CLIP: Cluster-Level Intelligent Power Coordination for Power-Bounded Systems

Pure Functions in C: A Small Keyword for Automatic Parallelization

The Effect of Resource Allocation and System Events on VM Consolidation

Extending Skel to Support the Development and Optimization of Next Generation I/O Systems

keybin: Key-Based Binning for Distributed Clustering

Optimizing the Datapath for Key-value Middleware with NVMe SSDs over RDMA Interconnects

TGE: Machine Learning Based Task Graph Embedding for Large-Scale Topology Mapping

Detection of Silent Data Corruption in Adaptive Numerical Integration Solvers

Evaluating the Viability of Using Compression to Mitigate Silent Corruption of Read-Mostly Application Data

A Gaussian Process Approach for Effective Soft Error Detection

Filter options

Publication date

Keywords

INFONA - science communication portal

2017 IEEE International Conference on Cluster Computing (CLUSTER) $("#expandableTitles").expandable();

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Keywords

Reporting an error / abuse

Sending the report failed

Accessibility options

2017 IEEE International Conference on Cluster Computing (CLUSTER)