The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Recognition, Mining, and Synthesis (RMS) applications are expected to make up much of the computing workloads of the future. Many of these applications (e.g., recommender systems and search engine) are formulated as finding eigenvalues/vectors of large-scale matrices. These applications are inherently error-tolerant, and it is often unnecessary, sometimes even impossible, to calculate all the eigenpairs...
Energy has emerged to be the most important resource for computing systems. Despite the exceptional importance of energy, reducing its demand at application and system level remains a challenging task for programmers and engineers. This is aggravated by the fact that traditional energy-saving approaches are not only error-prone but even lead to adverse consequences (i.e. increased energy consumption)...
State-of-the-art mobile system-on-chips (SoC) include heterogeneity in various forms for accelerated and energy-efficient execution of diverse range of applications. The modern SoCs now include programmable cores such as CPU and GPU with very different functionality. The SoCs also integrate performance heterogeneous cores with different power-performance characteristics but the same instruction-set...
Understanding three-dimensional seismic wave propagation in complex media is still one of the main challenges of quantitative seismology. Because of its simplicity and numerical efficiency, the finite-differences method is one of the standard techniques implemented to consider the elastodynamics equation. Additionally, this class of modeling heavily relies on parallel architectures in order to tackle...
Many-Core systems and heterogeneous systems are getting more and more common and may soon enter the mainstream market. To harvest their capabilities to their full potential, the runtime system's scheduling policies have to be adapted and, in many cases, tailored to the specific system. The runtime system can be both an operating system or management infrastructure of an infrastructure as a service...
Numerous applications focus on the analysis of entities and the connections between them, and such data are naturally represented as graphs. In particular, the detection of a small subset of vertices with anomalous coordinated connectivity is of broad interest, for problems such as detecting strange traffic in a computer network or unknown communities in a social network. Eigenspace analysis of large-scale...
OpenCL is a portable interface that can be used to program cluster nodes with heterogeneous compute devices. The OpenCL specification tightly binds its workflow abstraction, or "command queue," to a specific device for the entire program. For best performance, the user has to find the ideal queue -- device mapping at command queue creation time, an effort that requires a thorough understanding...
In this work, we present the characterization of a set of scientific kernels which are representative of the behavior of fundamental and applied physics applications across a wide range of fields. We collect performance attributes in the form of micro-operation mix and off-chip memory bandwidth measurements for these kernels. Using these measurements, we use two clustering methodologies to show which...
OpenCL is an open standard for programming of parallel heterogeneous systems. It is designed for portability, therefore being utilized in the area of embedded system programming as well as high performance computing (HPC). Due to the applicability on different platforms, OpenCL library vendors have a certain freedom in implementing parts of the OpenCL execution model. Multiple versions of the standard...
Heterogeneous platforms are mixes of different processing units. The key factor to their efficient usage is workload partitioning. Both static and dynamic partitioning strategies have been defined in previous work, but their applicability and performance differ significantly depending on the application to execute. In this paper, we propose an application-driven method to select the best partitioning...
The Open ACC standard has been developed to simplify parallel programming of heterogeneous systems. Based on a set of high-level compiler directives it allows application developers to offload code regions from a host CPU to an accelerator without the need for low-level programming with CUDA or Open CL. Details are implicit in the programming model and managed by Open ACC API-enabled compilers and...
Hardware and software stack complexity make programming GPGPUs difficult and limit application portability. This article first discusses challenges imposed by the current hardware and software model in GPGPU systems which relies heavily on the HOST device (CPU). We then identify system bottlenecks both in the hardware design and in the software stack and present two ideas to extend the HOST and DEVICE...
Nowadays, heterogeneous system architectures, integrating CPUs and one or more kinds of accelerators (e.g., GPUs or HW accelerators), are a promising solution to achieve high performance for data-intensive workloads while fulfilling other system-level requirements on the available power/energy budgets. However, heterogeneity comes at the cost of greater design and management complexity leading to...
Process scheduling algorithm plays a crucial role in operating system performance and so does the data-structure used for its implementation. A scheduler is designed to ensure the distribution of resources among the tasks is fair along with maximization of CPU utilization. The Completely Fair Scheduler (CFS), the default scheduler of Linux (since kernel version 2.6.23), ensures equal opportunity among...
Enhancement algorithms can make low light level images have a clear visual effect like the one captured during the daytime, but due to high complexity and generous computational cost, low light level image enhancement algorithms are usually difficult to meet real-time requirements which make it difficult to be widely used in practical application. For this situation, a parallel optimization algorithm...
Intelligent GPU cache bypassing can improve the efficiency of using GPU memory bandwidth, which can benefit GPU performance. In this paper, we study a pure hardware-based GPU cache bypassing method that can be applied to GPU applications without having to recompile the programs. Moreover, we introduce a hybrid method that can exploit profiling information to further enhance the hardware-based bypassing...
As the core density of future processors keeps increasing, MPI+Threads is becoming a promising programming model for large scale SMP clusters. Generally speaking, hybrid MPI+Threads runtime can largely improve intra-node parallelism and data sharing on shared-memory architectures. However, it does not help much on inter-node communication due to the inefficient integration of existing communication...
We present JolokiaC++ a compiler framework to ease coding of irregular data applications on GPUs. The effectiveness of the compiler and runtime systems of JolokiaC++ is tested using three kernels IRREG, MOLDYN and NBF, executed on NVIDIA GPUs. We developed extensions for the generic parallel constructs that allow portable and efficient programming of codes with irregular accesses on the GPU. We present...
Tasking is a prominent parallel programming model. In this paper we conduct a first study into the feasibility of task-parallel execution at the CUDA grid, rather than the stream/kernel level, for regular, fixed in-out dependency task graphs, similar to those found in wavefront computational patterns, making the findings broadly applicable. We propose and evaluate three CUDA task progression algorithms,...
Graph-structured data has come into wide use in various fields where graphs are the natural data structure to model networks. Therefore, the comparison between two graphs becomes a research focus. Traditional approaches for graph comparison face the common problem: either increasing the runtime for large graphs or simplifying the representation of graphs which ignores part of their topological information...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.