The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
GPUs have been proven very effective for structured applications. However, emerging data intensive applications are increasingly unstructured — irregular in their memory and control flow behavior over massive data sets. While the irregularity in these applications can result in poor workload balance among fine-grained threads or coarse-grained blocks, one can still observe dynamically formed pockets...
Graphics Processing Units (GPU) have been used extensively for accelerating parallelizable applications in general, and scientific computations in particular. Stencil based algorithms are used intensively in various research areas and represent good candidates for GPU based acceleration. Since scientific computations have high accuracy requirements, herein we focus on stencil based double precision...
Current GPUs (Graphics Processing Units) offer high computational power at relatively low cost, nonetheless, this enhanced performance often comes at the expenses of flexibility and code complexity. Efficient GPU programming requires detailed knowledge on certain hardware aspects. The scan operator is an important building block for a wide range of algorithms. In this paper, we present a number of...
Fast Oil Paint Image filter is a performance optimized oil paint algorithm. Current oil paint algorithms are CPU intensive and take a long time to produce the output; further the time taken increases exponentially with increasing quality. One of the main causes is re-computation. The proposed algorithm significantly reduces re-computation reducing the processing time by approximately 86%. The research...
An increasing number of MPI applications are being ported to take advantage of the compute power offered by GPUs. Data movement continues to be the major bottleneck on GPU clusters, more so when data is non-contiguous, which is common in scientific applications. The existing techniques of optimizing MPI data type processing, to improve performance of non-contiguous data movement, handle only certain...
Short-lag spatial coherence (SLSC) imaging, a coherence-based alternative beamforming technique, creates images related to the spatial covariance in backscatter. Because spatial covariance estimation is a computationally intensive process, efficient techniques are crucial to implementing SLSC imaging in real-time. Sparse sampling methods that take advantage of the statistical properties of spatial...
We present a tool that reduces the development time of GPU-executable code. We implement a catalogue of common optimizations specific to the GPU architecture. Through the tool, the programmer can semi-automatically transform a computationally-intensive code section into GPU-executable form and apply optimizations thereto. Based on experiments, the code generated by the tool can be 3-256X faster than...
Application mapping algorithm for reconfigurable architecture is one of the major research direction in reconfigurable computing. In this paper, we analyze the data memory bank conflict problem of the ACRPs (Application Customized Reconfigurable Pipelines) when exploiting the data parallelism and a conflict-free iteration-data mapping algorithm based on the operation mapping results is proposed to...
Sparse matrix-vector and multi-vector multiplications (SpMV and SpMM) are performance bottlenecks operations in numerous HPC applications. A variety of SpMV GPU kernels using different matrix storage formats have been developed to accelerate these applications. Unlike SpMV, where matrix elements are accessed only once, multiplying by k vectors requires accessing matrix elements k times. In this paper...
Graphics processing units(GPUs) provide a low cost platform for accelerating high performance computations. New programming languages, such as CUDA and OpenCL, make GPU programming attractive to programmers. However, programming GPUs is still a cumbersome task for two reasons, tedious performance optimizations and lack of portability. First, optimizing an algorithm for a specific GPU is a time-consuming...
Memory accesses limit the performance of stream processors. The stream compiler exploits the reuse of records distributed on different ALU clusters by introducing inter-cluster communications, which decreases the program performance. The paper presents the Stream Transpose (ST) approach to exploit such reuse. The approach, by reorganizing the data, puts data that have been distributed on neighboring...
We present JolokiaC++, an annotation based compiler framework which generates high quality CUDA (Compute Unified Device Architecture) code for GPUs. Our contributions include: (1) developing explicit and implicit annotations with illustrations of their use in C++, (2) showing the utility of these annotations by providing comparison code snippets, which demonstrates the ease of programming and performance...
Outlier detection is a data mining task consisting in the discovery of observations which deviate substantially from the rest of the data, and has many important practical applications. Outlier detection in very large data sets is however computationally very demanding and the size limit of the data that can be elaborated is considerably pushed forward by mixing three ingredients: efficient algorithms,...
Graph500 is a data intensive application for high performance computing and it is an increasingly important workload because graphs are a core part of most analytic applications. So far there is no work that examines if Graph500 is suitable for vectorization mostly due a lack of vector memory instructions for irregular memory accesses. The Xeon Phi is a massively parallel processor recently released...
A multi-GPU parallelization of exact string matching algorithms based on the backward-search procedure by using indexing techniques, such as the Burrows-Wheeler Transform and the FM-Index, is proposed in this paper. To attain an efficient execution on highly heterogeneous parallel platforms, the proposed parallelization adopted an unified OpenCL implementation that allows its execution either in CPUs...
Bias-scalability in analog CMOS circuits refers to a current-mode design paradigm where the operation of the circuit remains invariant to the operating conditions (weak-inversion, moderate-inversion or strong-inversion) of the transistors. In this paper we present the design and implementation of a bias-scalable analog support vector machine (SVM) based on our previously reported margin propagation...
In general-purpose graphics processing unit (GPGPU) computing, data is processed by concurrent threads executing the same function. This model, dubbed single-instruction/multiple-thread (SIMT), requires programmers to coordinate the synchronous execution of similar operations across thousands of data elements. To alleviate this programmer burden, Gaster and Howes outlined the channel abstraction,...
A new unified mathematical framework for sensor array processing is proposed. The proposed framework combines Bayesian estimation with stochastic geometry to accommodate prior information, uncertainty in array parameters, and unknown and stochastically time-varying number of nonstationary sources. A system model for a common signal setting is constructed to demonstrate the proposed framework.
Computer game, a new field of artificial intelligence, as the name suggests, is to make the computer learn to think and play chess games like human beings. As one of the important research field of the artificial intelligence, computer game, which is considered as the touchstone of the artificial intelligence, has brought many important methods and theories to the field. Connect6, is a newly introduced...
The paging mechanism is widely used in most modern systems to handle the virtual memory. Many page replacement algorithms have been proposed. Therefore, the cor-rectness and reliability of virtual memory management systems become very important. It is essential to formalize and verify the system in a formal way. In this paper, we model the virtual memory management system with MSVL, which is a parallel...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.