The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Hardware design is an essential part of research in high performance computing. Initial efforts in hardware research consist of analyzing the design ideas in a software simulator. This allows chip designers to minimize amount of manufacturing that would be too costly and to avoid doing FPGA designs which are even more time consuming. Simulating a hardware design involves running many tests that try...
In HPC applications, it is widely understood that in situ systems will play a significant role in next generation systems. The rate that next-generation leadership machines will be able to generate data will exceed the bandwidths of the planned I/O systems, leading to a need for in situ processing of the resulting data to reduce it. There have been a number of techniques proposed for in situ workflow...
The rapidly growing number of large network analysis problems has led to the emergence of many parallel and distributed graph processing systems—one survey in 2014 identified over 80. Determining the best approach for a given problem is infeasible for most developers. We present an approach and associated software for analyzing the performance and scalability of parallel, open-source graph libraries...
We evaluate the on-node interference caused when co-locating traditional high-performance computing applications with a big-data application. Using kernel benchmarks from the NPB suite and a state-of-art graph analytics code, we explore different process placements and effects they have on application performance. Our results show that the most memory intensive HPC application (MG) experienced the...
Recent breakthroughs in DNA sequencing opened up new avenues for bioinformatics, and we have seen increasing demand to make such advanced biomedical technologies cheaper and more accessible. Sequence alignment, the process of matching two gene fragments, is a major bottleneck in Whole Genome Sequencing (WGS). We explored the potential of accelerating Smith-Waterman sequence alignment algorithm through...
We evaluate the vector performance of the Halide domain-specific language for a computational photography application targeted at Android devices. Our application has existing implementations in C++ and ARM NEON and these are used as a baseline for performance comparisons with Halide. We give a very brief introduction to Halide concepts and describe the structure of our application. We describe how...
Modern high performance processors are equipped with very wide SIMD instruction set. SVE (Scalable Vector Extension) is an ARM® SIMD technology that supports vector lengths from 128 bits to 2048 bits. One of its promising features is to offer "vector-length agnostic" programming to allow the same SVE code to run on hardware of any vector length without any modification of the code. This...
The cost of maintaining an application code would significantly increase if the application code is branched into multiple versions, each of which is optimized for a different architecture. In this work, default and vector versions of a realworld application code are refactored to be a single version, and the differences between the versions are expressed as userdefined code transformations. As a...
As the SIMD width of modern microprocessors has been widening for keeping up with the computational demand for HPC systems, recently the vector architecture comes back to spotlight. Besides, a modern vector architecture that has been keeping a large SIMD width and a high B/F ratio has survived and evolved in the HPC community. In this paper, to clarify the potential of the modern vector architecture,...
In recent years, a lot of computer simulation codes have been developed as open-source software. Meanwhile major processors adopt a concept of a vector processing in high performance computing. Hence, the computer simulation codes need to follow a vector processing manner to have a benefit of a computational potential of the vector processing. Our study is evaluation and analysis of performance of...
This paper revisits the failure1 temporal independence hypothesis which is omnipresent in the analysis of resilience methods for HPC. We explain why a previous approach is incorrect, and we propose a new method to detect failure cascades, i.e., series of non-independent consecutive failures. We use this new method to assess whether public archive failure logs contain failure cascades. Then we design...
Future high-performance computing (HPC) systems with ever-increasing resource capacity (such as compute cores, memory and storage) may significantly increase the risks on reliability. Silent data corruptions (SDCs) or silent errors are among the major sources that corrupt HPC execution results. Unlike fail-stop errors, SDCs can be harmful and dangerous in that they cannot be detected by hardware....
Fault-tolerance is becoming increasingly important as we enter the era of exascale computing. Increasing the number of cores results in a smaller mean time between failures, and consequently, higher probability of errors. Among the different software fault tolerance techniques, checkpoint/restart is the most commonly used method in supercomputers, the de-facto standard for large-scale systems. Although...
The continuous growth of high-performance computing (HPC) systems has lead to Fault Tolerance (FT) being identified as one of the major challenges for exascale computing, due to the expected decrease in Mean Time Between Failures (MTBF). One source of faults are soft errors, which can cause bit corruptions to the data held in memory. Current solutions for protection against these errors include hardware...
Due to the growing size of compute clusters, large scale parallel applications increasingly have to deal with hardware malfunctions and other failure scenarios during execution. The overall goal of this research is to get good performance of MapReduce applications despite failures. The paper focuses on evaluation of the performance of two representative Hadoop MapReduce applications, 'WordCount' and...
Current HPC environments require parallel programs that are both malleable and fault-tolerant. Malleability denotes the ability to embrace system-initiated resource changes, and fault tolerance denotes the ability to cope with, e.g., permanent node failures.This paper considers the task pool pattern, specifically its lifeline-based variant. It builds on a previous fault-tolerant realization, and integrates...
Today's high-performance computing (HPC) systems are heavily instrumented, generating logs containing information about abnormal events, such as critical conditions, faults, errors and failures, system resource utilization, and about the resource usage of user applications. These logs, once fully analyzed and correlated, can produce detailed information about the system health, root causes of failures,...
Size and complexity of contemporary High Performance Computing (HPC) systems increases permanently. While the reliability of a single component and compute node is high, the huge amount of components comprising these systems results in the fact that defects happen regularly. This drives the need to manage failure situations. Common issues are component failures or node soft lock-ups that typically...
This paper discusses the motivation and implementation for Cray's Project Caribou. Project Caribou enables users to correlate HPC job performance with Lustre file systems through collected metrics and events. We will discuss use cases, the sources of metrics that are collected, correlation, and how the data is visualized. Additional topics to include events and alerts that are available, as well as...
System monitoring is an established tool to measure the utilization and health of HPC systems. Usually system monitoring infrastructures make no connection to job information and do not utilize hardware performance monitoring (HPM) data. To increase the efficient use of HPC systems automatic and continuous performance monitoring of jobs is an essential component. It can help to identify pathological...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.