The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
With the fast increasingly use of image and video processing in many aspects, the requirements for high performance and high-quality systems lead to the use of reconfigurable computing to accelerate traditional image processing platforms. In this work, an efficient runtime adaptable floating-point Gaussian filtering core is proposed to achieve not only high performance and quality but also kernel...
In this paper, we analyse performance and energy consumption of four OpenMP runtime systems over a NUMA platform. We present an experimental study to characterize OpenMP runtime systems on the three main kernels in dense linear algebra algorithms (Cholesky, LU and QR) in terms of performance and energy consumption. Our experimental results suggest that OpenMP runtime systems can be considered as a...
We show how to compute a relative-error low-rank approximation to any positive semidefinite (PSD) matrix in sublinear time, i.e., for any n x n PSD matrix A, in Õ(n ⋅ poly(k/ε)) time we output a rank-k matrix B, in factored form, for which kA – B║ 2 F ≤ (1 + ε)║A – Ak║2 F , where Ak is the best...
Heterogeneous platforms that include diverse architectures such as multicore CPUs, FPGAs and GPUs are becoming very popular due to their superior performance and energy efficiency. Besides heterogeneity, a promising approach for minimizing energy consumption is through approximate computing which relaxes the requirement that all parts of a program are considered equally important to the output quality,...
In this paper we propose a vectorized sorted set intersection approach for the task of counting the exact number of triangles of a graph on CPU cores. The computation is factorized into reordering and counting kernels where the reordering kernel builds upon the Reverse Cuthill-McKee heuristic.
OpenCL is a standard that supports a parallel programming paradigm which enables heterogeneous multi-core systems and also offers a high level of portability for the application. Some of the systems that are used with OpenCL might have vector capabilities at device compute units level. There are more ways the vector capabilities could be exploited by the OpenCL device application, the most common...
In Gradient-Based Cross-Spectral Stereo Matching (GB-CSSM) output disparity maps tend to produce coarse results that are, for the most part, reliable. However, general methods of improving the performance of disparity maps generated from the Cross-Spectral comparison of visual and full infrared input images are non-existent. In particular, previous works fail to address the role and interaction of...
While GPUs are becoming common in HPC systems, the CPU is still responsible for managing both GPU-side and CPU-side compute, communication, and synchronization operations. For instance, if a result from a GPU-side computation is to be transferred to a remote destination, then the CPU must synchronize on GPU compute completion issuing a communication operation. Both CPU cycles and energy are consumed...
In this paper, we address the decreasing performance of the FFTXlib, the Fast Fourier Transformation (FFT) kernel of Quantum ESPRESSO, when scaling to a full KNL node. An increased performance in the FFTXlib will likewise increase the performance of the entire Quantum ESPRESSO code one of the most used plane-wave DFT codes in the community of material science. Our approach focuses on, first, overlapping...
We address the problem of optimizing global shared memory usage in deeply heterogeneous accelerators in the context of HPC systems running multiple applications with different quality of service levels. We explore predictive memory allocation algorithms, allowing to serve up to 28% more high priority requests when using a moving average based prediction in a low-workload scenario.
In order to effectively detect malware in Android, dynamic analysis techniques with Android emulators are widely adopted. Emulators can be deployed for large-scale malware detection and restored to an ensured clean state in a short period after each app analysis process such that dynamic analysis upon emulators can effectively detect malware. Moreover, emulators significantly reduce the detection...
The hybrid runtime (HRT) model offers a path towards high performance and efficiency. By integrating the OS kernel, runtime, and application, an HRT allows the runtime developer to leverage the full feature set of the hardware and specialize OS services to the runtime's needs. However, conforming to the HRT model currently requires a port of the runtime to the kernel level, for example to the Nautilus...
Future high-performance computing systems will need to include multiple specialized accelerators in a single heterogeneous system to overcome power-density limitations of CPU performance.
This paper presents the design and implementation of a hardwired OS kernel circuitry inside a Java application processor to provide the system services that are traditionally implemented in software. The hardwired system functions in the proposed SoC include the thread manager, the memory manager, and the I/O subsystem interface. There are many advantages in making the OS kernel a hardware component,...
For problems of image or video segmentation, where clusters have a complex structure, a leading method is spectral clustering. It works by encoding the similarity between pairs of points into an affinity matrix and applying k-means in its low-order eigenspace, where the clustering structure is enhanced. When the number of points is large, an approximation is necessary to limit the runtime even if...
Increasing architectural diversity makes performance portability extremely important for parallel simulation codes. Emerging on-node parallelization frameworks such as Kokkos and RAJA decouple the work done in kernels from the parallelization mechanism, allowing for a single source kernel to be tuned for different architectures at compile time. However, computational demands in production applications...
In the field of high performance heterogeneous computing systems, field programmable gate arrays (FPGAs) have shown great advantages in terms of acceleration and energy efficiency. And with the inclusion of the OpenCL framework for parallel programming, the design complexity has been greatly reduced. However, the parallel implementation of applications containing data-dependent branches usually experiences...
The main challenge of architecting modern industrial control and automation systems (ICASs) is that they need to fulfill quality attributes (QAs) traditional to real-time systems — such as timeliness and predictability — and modern software engineering — such as modularity or reusability. QAs often areconflicting, which entails difficult trade-offs. As a consequence, even the architecture of closely...
In this paper, we present our experience designing and testing anenergy saving strategy for mobile phones, implemented atoperating system level, using Android OS. Our approach was todeploy kernel extensions that assess the status of the device, andenable economic profiles without user intervention. Ourexperiments showed that the power management kernel extensionwas able to extend the battery runtime...
The fast-extract algorithm is a well-known algebraic method for factoring and decomposing Boolean expressions. Since it uses pairwise comparisons between cubes to find factors, the runtime is degraded for networks whose primary outputs are expressed in terms of primary inputs and have Boolean functions with thousands of cubes. This paper describes a new implementation of the fast-extract algorithm,...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.