The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Silent data corruption (SDC) poses a great challenge for high-performance computing (HPC) applications as we move to extreme-scale systems. Mechanisms have been proposed that are able to detect SDC in HPC applications by using the peculiarities of the data (more specifically, its “smoothness” in time and space) to make predictions. However, these data-analytic solutions are still far from fully protecting...
System monitoring is an established tool to measure the utilization and health of HPC systems. Usually system monitoring infrastructures make no connection to job information and do not utilize hardware performance monitoring (HPM) data. To increase the efficient use of HPC systems automatic and continuous performance monitoring of jobs is an essential component. It can help to identify pathological...
Because data collection in HPC systems happens on the nodes and is easily related to the job running on the node, tools presenting the data and subsequent analyses to the user generally present them at the job level. Our position is that this is the wrong level of abstraction and thus limits the value of the analyses, often dissuading users from using any of the offered tools. In this paper we present...
The rise of graph analytic systems has created a need for ways to measure and compare the capabilities of these systems. Graph analytics present unique scalability difficulties. The machine learning, high performance computing, and visual analytics communities have wrestled with these difficulties for decades and developed methodologies for creating challenges to move these communities forward. The...
In this study we use triangular basis function set to solve second kind fuzzy integral equation that can be converted to a system of two integral equations in crisp case. We also consider collocation method for approximately solving the equation.
Parallel applications are highly irregular and high performance computing (HPC) infrastructures are very complex. The HPC applications of interest herein are timestepping scientific applications (TSSA). Often, TSSA involve the repeated execution of multiple parallel loops with thousands of iterations and irregular behavior. Dynamic loop scheduling (DLS) techniques were developed over time and have...
Cloud computing enables end users to execute high-performance computing applications by renting the required computing power. This pay-for-use approach enables small enterprises and startups to run HPC-related businesses with a significant saving in capital investment and a short time to market. When deploying an application in the cloud, the users may a) fail to understand the interactions of the...
Performance monitoring is essential for all subsystems, especially high performance computing systems. These systems are sensitive to errors and failures which lead to data losses and then severely impact on the organizations. Consequently, resource information in the systems (e.g., CPU usage, memory usage, disk I/O usage, etc.) during the operations must be collected through the system monitoring...
The interconnection network has a large influence on total cost, application performance, energy consumption, and overall system efficiency of a supercomputer. Unfortunately, today's routing algorithms do not utilize this important resource most efficiently. We first demonstrate this by defining the dark fiber metric as a measure of unused resource in networks. To improve the utilization, we propose...
In this paper we introduce a novel, dense, system-on-chip many-core Lenovo NeXtScale System® server based on the Cavium THUNDERX® ARMv8 processor that was designed for performance, energy efficiency and programmability. THUNDERX processor was designed to scale up to 96 cores in a cache coherent, shared memory architecture. Furthermore, this hardware system has a power interface board (PIB) that measures...
About ten years ago, we presented the results of an effort to identify the "right metric" for efficient supercomputing at this workshop, The Workshop on High-Performance, Power-Aware Computing. In this paper, we review the advances that the community has made in this area of research. The intention of this ten-year retrospective is two-fold: (1) to acknowledge the past work through a historical...
With each technology improvement, parallel systems get larger, and the impact of interconnection networks becomes more prominent. Random topologies and their variants received more and more attention lately due to their low diameter, low average shortest path length and high scalability. However, existing supercomputers still prefer torus and fat-tree topologies, because a number of existing parallel...
With each technology improvement, parallel systems get larger, and the impact of interconnection networks becomes more prominent. Random topologies and their variants received more and more attention lately due to their low diameter, low average shortest path length and high scalability. However, existing supercomputers still prefer torus and fat-tree topologies, because a number of existing parallel...
Deduplication has become essential in disk-based backup systems, but there have been few long-term studies of backup workloads. Most past studies either were of a small static snapshot or covered only a short period that was not representative of how a backup system evolves over time. For this paper, we collected 21 months of data from a shared user file system; 33 users and over 4,000 snapshots are...
With supercomputer system scaling up, the performance gap between compute and storage system increases dramatically. The traditional speedup only measures the performance of compute system. In this paper, we firstly propose the speedup metric taking into account the I/O constraint. The new metric unifies the computing and I/O performance, and evaluates practical speedup of parallel application under...
Monitoring High Performance Computing clusters is currently geared towards providing system administrators the information they need to make informed decisions on the resources used in the cluster. However, this emphasis leaves out the End User, those who utilize the cluster resources towards projects and programs, as they are not given the information of how their workflow is impacting the cluster...
A detailed understanding of HPC application's resource needs and their complex interactions with each other and HPC platform resources is critical to achieving scalability and performance. Such understanding has been difficult to achieve because typical application profiling tools do not capture the behaviors of codes under the potentially wide spectrum of actual production conditions and because...
While HPC system monitoring is a necessary and accepted practice, applications are still basically opaque in the production environment. For better HPC platform management and utilization, especially as platforms push towards exascale size, HPC applications need to be more transparent in their execution in the production environment. PROMON is a framework for application monitoring in the production...
As we move towards the Exactable era of supercomputing, node-level failures are becoming more common-place, frequent check pointing is currently used to recover from such failures in long-running science applications. While compute performance has steadily improved year-on-year, parallel I/O performance has stalled, meaning check pointing is fast becoming a bottleneck to performance. Using current...
The paper considers techniques for measurement and calculation of security metrics taking into account attack graphs and service dependencies. The techniques are based on several assessment levels (topological, attack graph level, attacker level, events level and system level) and important aspects (zero-day attacks, cost-efficiency characteristics). It allows understanding the current security situation,...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.