The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
As hardware becomes more flexible in terms ofprogramming, software APIs must expose hardware features ina portable way. Additions in the OpenCL 2.0 API expose threadcommunication through the newly defined work-group functions. In this paper we focus on two implementations of the work-groupfunctions in the OpenCL compiler backend for Intel's GPUs. Wefirst describe the particularities of Intel's GEN...
Document is unavailable: This DOI was registered to an article that was not presented by the author(s) at this conference. As per section 8.2.1.B.13 of IEEE's "Publication Services and Products Board Operations Manual," IEEE has chosen to exclude this article from distribution. We regret any inconvenience.
With GPU (Graphics Processing Unit) taking part in general-purpose computing, a heterogeneous system usually achieves higher performance and efficiency. There are many studies on how to improve the performance of a heterogeneous system, among of which are a number of researches to achieve the goal by allocating workload into processors with different strategies. In the paper, we implement a task allocation...
Finding the roots of polynomials is a very important part of solving real-life problems but the higher the degree of the polynomials is, the less easy it becomes. In this paper, we present two different parallel algorithms of the Ehrlich-Aberth method to find roots of sparse and fully defined polynomials of high degrees. Both algorithms are based on CUDA technology to be implemented on multi-GPU computing...
In this article is presented and assessed a massive parallel processing model for basic operations with k-mers from genomic sequences, based on defined functions in terms of N-dimensional spaces. The model is implemented using a set of OpenCL cores available at github.com/bioinfud/k-merscl and assessed using a heterogeneous platform CPU/GPU and a dataset based on randomly generated k-mers. The results...
In modern networks, there exist different applications which generate various different types of network traffic. In order to improve the performance of network management, it is important to identify and classify the internet traffic. The machine learning (ML) technique based on per-flow statistics has been widely used in traffic classification. Different from traditional classification methods,...
Sparsity-constrained Nonnegative matrix factorization (NMF) has been proved to be an effective method for hyperspectral unmixing. However, the optimization procedure of sparsity-constrained NMF is computational demanding, which may limit its application in time-constrained conditions. In this paper, a parallel L1/2 sparsity-constrained NMF unmixing method on Graphics Processing Units (GPUs) is proposed,...
Synthetic Aperture Radar (SAR) has been widely used in airborne remote sensing and satellite ocean observation fields to reduce the affect of weather condition and sun illumination. As technology developed, swath and resolution requirements are increased in terrain, which result in a huge increase in echo data and simulated time[1]. With the development of graphics processing unit (GPU), it can reduce...
The calculation of small-scale data is commonly used in scientific computing and application domain, and the high-efficiency method of small calculation can give play to the potency of many calculation and application. In this paper, a novel self-adaptive parallel computing method based on the graphics processing unit (GPU) architecture for batches of small scale computing tasks is proposed herein...
This paper presents several novel GPU optimization technologies to accelerate the SRCNN(Super-Resolution Convolutional Neural Network) — one of the best super-resolution algorithm. We first directly parallelize and implement the SRCNN, then accelerate the convolution by making use of the hierarchical feature of GPU memory. We explore different optimization methods on each convolution and select the...
Many new cloud-focused applications such as deeplearning and graph analytics have started to rely on the highcomputing throughput of GPUs, but cloud providers cannotcurrently support fine-grained time-sharing on GPUs to enablemulti-tenancy for these types of applications. Currently, schedulingis performed by the GPU driver in combination with ahardware thread dispatcher to maximize utilization. However,...
The maximum common subgraph of two graphs, G1 and G2, is the largest subgraph in G1 that is isomorphic to a subgraph in G2. Finding the maximum common subgraph of two given graphs is known to be a NP-complete problem. An exact solution for the maximum common subgraph problem can be found by an algorithm that transforms the maximum common subgraph problem into a maximal clique enumeration problem....
We examine the implementation of block compressed row storage (BCSR) sparse matrix-vector multiplication (SpMV) for sparse matrices with dense block substructure, optimized for blocks with sizes from 2x2 to 32x32, on CPU, Intel many-integrated-core, and GPU architectures. Previous research on SpMV for matrices with dense block substructure has largely focused on the design of novel data structures...
The clustering coefficient and the transitivity ratio are concepts often used in network analysis, which creates a need for fast practical algorithms for counting triangles in large graphs. Previous research in this area focused on sequential algorithms, MapReduce parallelization, and fast approximations. In this paper we propose a parallel triangle counting algorithm for CUDA GPU. We describe the...
ChaCha20 is an encryption cipher selected by Google to replace the now obsolete RC4 in the Chrome browser and Android devices. The current article discusses the performance implications of parallelizing ChaCha20 across multicore CPU and GPU. The serial implementation used to derive the parallel code is part of BoringSSL encryption library. We used OpenMP and OpenCL to accelerate the cipher and obtain...
Numerical approach to frequency response problems usually requires that the system governing equation is solved repeatedly at many frequencies. The computational efficiency of the overall process can be increased by departing from traditional sequential computing model in favor of utilizing the parallel processing capability commonly offered by modern hardware. In this paper, we consider a hybrid...
Image filtering is a process of reducing noise which degrades the performance of image processing. In some applications such as segmentation or classification, denoising has been designed to smooth the homogeneous areas while keeping and enhancing the edges. In several applications such as video analysis, image-guided surgical interventions or visual servoing, real-time denoising is needed. The devoted...
Visual pattern recognition is a key research topic in the field of image processing and computer vision. Texture analysis based on steerable Riesz wavelets is powerful, but requires computing pixel -- wise operations resulting in a run time in the order of days when large volumes of data are processed. To overcome this limitation we propose a Graphics Processing Unit (GPU) based solution. A standard...
In this paper we present microbenchmarks in OpenCL to measure the most important performance characteristics of GPUs. Microbenchmarks try to measure individual characteristics that influence the performance. First, performance, in operations or bytes per second, is measured with respect to the occupancy and as such provides an occupancy roofline curve. The curve shows at which occupancy level peak...
In this paper we present microbenchmarks in OpenCL to measure the most important performance characteristics of GPUs. Microbenchmarks try to measure individual characteristics that influence the performance. First, performance, in operations or bytes per second, is measured with respect to the occupancy and as such provides an occupancy roofline curve. The curve shows at which occupancy level peak...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.