The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
The numerical treatment of variational problems gives rise to large sparse matrices, which are typically assembled by coalescing elementary contributions. As the explicit matrix form is required by numerical solvers, the assembly step can be a potential bottleneck, especially in implicit and time dependent settings where considerable updates are needed. On standard HPC platforms, this process can...
The cube attack is a flexible cryptanalysis technique, with a simple and fascinating theoretical implant. It combines offline exhaustive searches over selected tweakable public/IV bits (the sides of the “cube“), with an online key-recovery phase. Although virtually applicable to any cipher, and generally praised by the research community, the real potential of the attack is still in question, and...
Input/output (I/O) devices such as a graphics processing unit and a solid-state drive are inserted into I/O slots of a host in data center platforms. With this sort of configuration the I/O devices are used exclusively by the host with resultant inefficient resource usage. In addition, the maximum number of I/O devices that can be assigned to each host is limited by the number of its I/O slots. This...
ChaCha20 is an encryption cipher selected by Google to replace the now obsolete RC4 in the Chrome browser and Android devices. The current article discusses the performance implications of parallelizing ChaCha20 across multicore CPU and GPU. The serial implementation used to derive the parallel code is part of BoringSSL encryption library. We used OpenMP and OpenCL to accelerate the cipher and obtain...
The increasing adoption of GPUs as mainstream computing devices, coupled with the imminent availability of large high-bandwidth caches based on die-stacked memory makes it important to analyze and understand modern GPU compute applications from the perspective of their memory access and data reuse characteristics. This paper presents detailed workload characterization studies on four GPU compute applications...
A Bilateral filter is basically an edge-preserving and smoothing, non-linear filter. It consists of two kernels, namely spatial and range kernels which can be constant or arbitrary. Algorithms for bilateral filtering with constant time computational complexity are present today, but their execution time is too high for real time applications. Also, hardware latency and throughput sometimes reduce...
Reproducibility for High Performance Computing (HPC) systems has been discussed for some time already, but more work should be carried out to cover the latest accelerators that equip the fastest supercomputers such as the ones listed in Top500. In this paper, we perform a replication of a performance evaluation carried out using an N-Body Open MP parallel application on a XeonPhi accelerator. We also...
In recent years, heterogeneous HPC systems, whichcombine traditional processors with accelerator cards such as GPUs, have been shown to deliver superior performance and power efficiency. Since different scientific problems pose different demands on the computer architecture, some general purpose supercomputers consist of different types of nodes, where each type is suited best for certain applications...
Many scientific software applications, that solve complex compute-or data-intensive problems, such as large parallel simulations of physics phenomena, increasingly use HPC systems in order to achieve scientifically relevant results. An increasing number of HPC systems adopt heterogeneous node architectures, combining traditional multi-core CPUs with energy-efficient massively parallel accelerators,...
Ray casting algorithm is a major component of the direct volume rendering, which exhibits inherent parallelism, making it suitable for graphics processing units (GPUs). However, blindly mapping the ray casting algorithm on a GPU's complex parallel architecture can result in a magnitude of performance loss. In this paper, a novel computation-to-core mapping strategy, called Warp Marching, for the texture-based...
A low-rank approximation of a dense matrix plays an important role in many applications. To compute such an approximation, a common approach uses the QR factorization with column pivoting (QRCP). Though the reliability and efficiency of QRCP have been demonstrated, this deterministic approach requires costly communication at each step of the factorization. Since such communication is becoming increasingly...
Pedestrian detection is a challenging task, due to wide variety of appearances, especially in complex real world scenes. The use of real-time pedestrian detection is of great use for a broad range of applications in multiple domains, such as surveillance and Intelligent Transportation System. In this paper we present a fast implementation of a robust pedestrian detector by using OpenCL, which is a...
We describe a pilot project for the use of GPUs (Graphics Processing Units) in online triggering applications for high energy physics experiments. Two major trends can be identified in the development of trigger and DAQ systems for particle physics experiments: the massive use of general-purpose commodity systems for data acquisition, such as commercial multicore PC farms, and the reduction of trigger...
This paper reports on the current state of a work-in-progress porting of a physical-optics simulation tool onto NVIDIA's CUDA platform. Current accelerator APIs are shortly presented. Our choice for the CUDA platform is explained, as well as the data flow of the simulation tool. The current state of the implementation of the port is presented, as are first run time measurements. The results are promising;...
The massive parallelism offered by Graphics Processing Units (GPUs) is now routinely exploited to accelerate computationally intensive tasks in a wide variety of application domains. Efficient GPU programming in languages such as CUDA and OpenCL requires careful application of hand optimisations to exploit parallelism and locality while minimising synchronisation. The effectiveness of such optimisations...
The paper describes heterogeneous parallel processing as a feature of hardware devices. Software supports the configuration of the hardware components and a new kind of system-software supports the distribution of data and the scheduling of tasks. The concept is supported by referring to the relatively recent Open Systems specification, OpenCL. This is briefly described and its likely evolution surmised...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.