The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Automatically parallelizing loop nests into CUDA kernels must exploit the full potential of GPUs to obtain high performance. One state-of-the-art approach makes use of the polyhedral model to extract parallelism from a loop nest by applying a sequence of affine transformations to the loop nest. However, how to automate this process to exploit both intra and inter-SM parallelism for GPUs remains a...
The web is moving from an era of "search" to that of "discovery". Collaborative filtering (CF) recommender systems are now commonly used to predict user's preference towards an unknown item from past ratings. To be scalable or effective, they are typically deployed in distributed clusters and operate on extremely large apriori datasets. Improvement of the efficiency of these systems...
In this paper, we demonstrate the use of tiling with noise to generate rich procedural textures. We introduce the idea of storing tiles which consist of only the gradients stored at the integer lattice points and constructing a texture on the GPU from these tiles. We also introduce the idea of using mipmapped tiles to store gradients for turbulence. Finally we demonstrate a novel use of mipmaps to...
Multi-agent path planning on grid maps is a challenging problem and has numerous real-life applications ranging from robotics to real-time strategy games and non-player characters in video games. A∗ is a cost-optimal forward search algorithm for path planning which scales up poorly in practice since both the search space and the branching factor grow exponentially in the number of agents. In this...
Fragment shaders in a graphics pipeline are used to compute the color for each pixel, where lighting, texture loading, and other calculations are involved. The required computing power is proportional to the number of input fragments. In order to improve the power efficiency of mobile GPUs, a content adaptive sampling scheme is proposed to reduce the fragments. The proposed scheme is based on tile-based...
The past decade has witnessed a major paradigm shift in high performance computing with the introduction of accelerators as general purpose processors. These computing devices make available very high parallel computing power at low cost and power consumption, transforming current high performance platforms into heterogeneous CPU-GPU equipped systems. Although the theoretical performance achieved...
Heterogeneous parallel systems including accelerators such as Graphics Processing Units (GPUs), are expected to play a major role in architecting the largest systems in the world, as well as the most powerful embedded devices. Impressive computational speedups have been reported for numerous algorithms in fields of medical image processing, digital signal processing, astrophysics, modeling and simulations...
We present a tile-based GPU design which is modeled in a full system simulation platform. The full system simulation platform includes a functional Linux-based system on which the GPU is incorporated for design explorations. To accurately estimate the execution time of the application graphics software, an execution time synchronization mechanism for the virtual platform is developed. We extend the...
Power dissipation and energy consumption are becoming increasingly important architectural design constraints in different types of computers, from embedded systems to large-scale supercomputers. To continue the scaling of performance, it is essential that we build parallel processor chips that make the best use of exponentially increasing numbers of transistors within the power and energy budgets...
Diverse IP cores are integrated on a modern system-on-chip and share resources. Off-chip memory bandwidth is often the scarcest resource and requires careful allocation. Two of the most important cores, the CPU and the GPU, can both simultaneously demand high bandwidth. We demonstrate that conventional quality-of-service allocation techniques can severely constrict GPU performance by allowing the...
Pharmaceutical industries which are intended for the packaging of different tablets in a strip of blister need to make sure that the tablets are free from defects before letting them go into the packing box. The purpose of this project is to speed-up the system process via implementing the image processing algorithm on GPU. Morphological and mathematical operations have been implemented on both GPU...
Multicore architectures enhanced with multiple GPUs are likely to become mainstream High Performance Computing (HPC) platforms in a near future. In this paper, we present the design and implementation of an LU factorization using tile algorithm that can fully exploit the potential of such platforms in spite of their complexity. We use a methodology derived from previous work on Cholesky and QR factorizations...
For enabling immersive user experiences for interactive TV services and automating camera view selection and framing, knowledge of the location of persons in a scene is essential. We describe an architecture for detecting and tracking persons in high-resolution panoramic video streams, obtained from the Omni Cam, a panoramic camera stitching video streams from 6 HD resolution tiles. We use a CUDA...
Image mosaic is a large image assembled from many smaller tiles which one tile itself is an actual image. In this research, we introduce an efficient method to make image mosaic. Our method is based on Log-polar mapping which enables us to detect the color and shape change. We also successfully make an image mosaic version by exploiting GPU power. Our algorithm is simple, easy to implement, gives...
Auto-tuning has emerged as an important practical method for creating highly optimized code. However, the growing complexity of architectures and applications has resulted in a prohibitively large search space that preclude empirical auto-tuning. Here, we focus on the challenge to auto-tuning presented by applications that require auto-tuning of not just a small number of distinct kernels, but a large...
Performance portability is a major challenge faced today by developers on heterogeneous high performance computers, consisting of an interconnect, memory with non-uniform access, many-cores and accelerators like GPUs. Recent studies have successfully demonstrated that dense linear algebra operations can be efficiently handled by runtime systems using a DAG representation. In this work, we present...
Many applications require real-time decoding of high-resolution video pictures, for example, quick editing of video sequences in video editing applications. To increase decoding speed, parallelism can be exploited, yet, block-based image and video coding standards are difficult to decode in parallel because of the high number of dependencies between blocks. This paper investigates the parallel decoding...
Physics-based simulations are actively used in the design, testing, and operations phases of surface and near-surface planetary space missions. One of the challenges in real-time simulations is the ability to handle large multi-resolution terrain data sets within models as well as for visualization. In this paper, we describe special techniques that we have developed for visualization, paging, and...
One of the major trends in the design of exascale architectures is the use of multicore nodes enhanced with GPU accelerators. Exploiting all resources of a hybrid accelerators-based node at their maximum potential is thus a fundamental step towards exascale computing. In this article, we present the design of a highly efficient QR factorization for such a node. Our method is in three steps. The first...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.