The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Parallel computing architectures like GPUs have traditionally been used to accelerate applications with dense and highly-structured workloads; however, many important applications in science and engineering are irregular and dynamic in nature, making their effective parallel implementation a daunting task. Numerical simulation of charged particle beam dynamics is one such application where the distribution...
Data staging has been shown to be very effective for supporting data intensive in-situ workflows and coupling of applications. Experimental sciences are increasingly becoming collaborative among geographically distributed teams, and include experimental instruments and HPC facilities. This new way of doing science poses new challenges due to data sizes, complexity of computation, and the use of wide...
To reduce energy consumption and carbon emission, many data centers have deployed (or anticipate to build) their own renewable-energy power plants. However, the renewable energy (such as wind, tide, and solar energy) has the serious issues of intermittency and variability that prevent the green energy from being utilized effectively in practice. To cope with the issues, new power-supply management...
To overcome scalability and performance issues for process queries over a web-scale RDF data, various studies have proposed RDF SPARQL query processing algorithm using parallel processing manners. However, it is hard to resolve the scalability and performance issues together because the problem of communication overhead between nodes is closely related to the data distribution for parallel processing...
Modern distributed storage systems often store redundant data in multiple replications or erasure coding according to their access frequencies. Multiple replications scheme is well-performance for hot data while erasure coding scheme is storage-efficient for warm and cold data. When hot data turn cold, an encoding procedure starts to do the conversion. However, due to sequential striping, current...
This paper implements a Smoothed Particle Hydrodynamics simulation code and distributes it on a heterogeneous cluster. The theoretical analysis results show that treating GPU as equivalent peer of CPU rather than an assistant or a substitute is the most efficient way of using a CPU+GPU compute node. However, it raises complex challenges of heterogeneous cooperation. Our strategies of hybrid-level...
OpenMP is the de facto standard application programming interface (API) for on-node parallelism. The most popular OpenMP runtimes rely on POSIX threads (pthreads) implementations that offer an excellent performance for coarse-grained parallelism and match perfectly with the current hardware. However, a recent trend in runtimes/applications points in the direction of leveraging massive on-node parallelism...
The matrix-matrix multiplication is an essential building block that can be found in various scientific and engineering applications. High-performance implementations of the matrix-matrix multiplication on state-of-the-art processors may be of great importance for both the vendors and the users. In this paper, we present a detailed methodology of implementing and optimizing the double-precision general...
Machine Learning (ML) approaches are widelyused classification/regression methods for data mining applications. However, the time-consuming training process greatly limits the efficiency of ML approaches. We use the example of SVM (traditional ML algorithm) and DNN (state-of-the-art ML algorithm) to illustrate the idea in this paper. For SVM, a major performance bottleneck of current tools is that...
Parity declustering is widely deployed in erasure coded storage systems so as to provide fast recovery and high data availability. However, to perform scaling on such RAIDs, it is necessary to preserve the parity declustered data layout so as to guarantee the RAID performance after scaling. Unfortunately, existing scaling algorithms fail to achieve this goal so they can not be applied for scaling...
Parallel and distributed processing is employed to accelerate training for many deep-learning applications with large models and inputs. As it reduces synchronization and communication overhead by tolerating stale gradient updates, asynchronous stochastic gradient descent (ASGD), derived from stochastic gradient descent (SGD), is widely used. Recent theoretical analyses show ASGD converges with linear...
HPCG and Graph500 can be regarded as the two most relevant benchmarks for high-performance computing systems. Existing supercomputer designs, however, tend to focus on floating-point peak performance, a metric less relevant for these two benchmarks, leaving resources underutilized, and resulting in little performance improvements, for these benchmarks, over time. In this work, we analyze the implementation...
Dynamic task graph schedulers automatically balance work across processor cores by scheduling tasks among available threads while preserving dependences. In this paper, we design NABBITC, a provably efficient dynamic task graph scheduler that accounts for data locality on NUMA systems. NABBITC allows users to assign a color to each task representing the location (e.g., a processor core) that has the...
Power is a critical factor that limits the performance and scalability of modern high performance computer systems. Considering power as a first-order constraint and a scarce system resource, power-bounded computing represents a new perspective to address the power challenge in HPC.In this work we present an application-aware, multi-dimensional power allocation framework to support power-bounded parallel...
We present a scalable distributed memory library for generating and computations involving structured dense matrices, such as those produced by boundary integral equation formulations. Such matrices are dense, but have special structure that can be exploited to obtain efficient storage and matrix-vector product evaluations and consequently the fast solution of linear systems. At the core of the methods...
Apache Storm is a fault-tolerant, distributed inmemory computation system for processing large volumes of high-velocity data in real-time. As an integral part of the fault-tolerance mechanism, Storm's state management is achieved by a checkpointing framework, which commits states regularly and recovers lost states from the latest checkpoint. However, this method involves a remote data store for state...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.