The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
We designed and implemented a Remote Inter-Processor Communication architecture software on Xeon Phi coprocessors and made a testbed to verify it. Also, we implemented a lightweight kernel and RIPC transmission/receiver application threads on the lightweight kernel running on Xeon Phi coprocessors. This paper proposes RIPC methods to communicate between user threads in separate Xeon Phi nodes using...
In this paper, we propose a parallel block-based Viterbi decoder (PBVD) on the graphic processing unit (GPU) platform for the decoding of convolutional codes. The decoding procedure is simplified and parallelized, and the characteristic of the trellis is exploited to reduce the metric computation. Based on the compute unified device architecture (CUDA), two kernels with different parallelism are designed...
This paper presents OpenSwarm, a lightweight easy-to-use open-source operating system. To our knowledge, it is the first operating system designed for and deployed on miniature robots. OpenSwarm operates directly on a robot's microcontroller. It has a memory footprint of 1 kB RAM and 12 kB ROM. OpenSwarm enables a robot to execute multiple processes simultaneously. It provides a hybrid kernel that...
Availability of OpenCL for FPGAs has raised new questions about the efficiency of massive thread-level parallelism on FPGAs. The general trend is toward creating deep pipelining and in-order execution of many OpenCL threads across a shared data-path. While this can be a very effective approach for regular kernels, its efficiency significantly diminishes for irregular kernels with runtime-dependent...
Graphics Processing Units (GPUs) have become a prevalent platform for high throughput general purpose computing. The peak computational throughput of GPUs has been steadily increasing with each technology node by scaling the number of cores on the chip. Although this vastly improves the performance of several compute-intensive applications, our experiments show that some applications can achieve peak...
Sparse matrix-vector multiplication (SpMV) is a key operation in scientific computing and engineering ap-plications. This paper presents an optimization strategy to improve SpMV performance on the multi-GPU systems by adopting OpenMP threads and multiple CUDA streams. We propose an efficient scheme to control multiple GPUs jointly complete SpMV computations by making use of OpenMP threads. Moreover,...
It makes the haze removal in real-time by CUDA based on the atmospheric scattering model and temporal coherence algorithm. Firstly, a hierarchical search method based on four fork tree subdivision replaced the original algorithm to obtain the atmospheric light, and put the number of pixels as the number of parallel threads, which processes the required calculation of pixels, the intermediate results...
In combinatorial optimization problems, the neighborhood search (NS) is a fundamental component for local search based heuristics. It consists of selecting a solution from a high cardinality set of neighbor solutions, by means of operations called moves. To perform this search, NS algorithms usually adopt two main approaches: selecting the first or best improving move. The Multi Improvement (MI) strategy...
This paper proposes a profiling-based method to extract a task graph, which describes the system behavior of a multiprocessor system-on-chip with Android OS. The proposed method computes the resource usage of each task and extracts dependency among tasks using the run-time system profiling results. The proposed method calculates CPU resource usage and I/O waiting time of each task by analyzing CPU...
There is a lack of support for explicit synchronization in GPUs between the streaming multiprocessors (SMs) adversely impacts the performance of the GPUs to efficiently perform inter-block communication. In this paper, we present several approaches to inter-block synchronization using explicit/implicit CPU-based and dynamic parallelism (DP) mechanisms. Although this topic has been addressed in previous...
Stencil computations form the basis for computer simulations across almost every field of science, such as computational fluid dynamics, data mining, and image processing. Their mostly regular data access patterns potentially enable them to take advantage of the high computation and data bandwidth of GPUs, but only if data buffering and other issues are handled properly. Finding a good code generation...
This paper presents a parallel motion estimation algorithm on Graphics Processing Units (GPU) with a GPU-based fast Coding Unit (CU) splitting mechanism for speeding up the execution speed of High Efficiency Video Coding (HEVC). Parallel motion estimation algorithms only offer motion vectors to HEVC encoder, but CU splitting decision in HEVC still needs more information to speed up the encoder. Therefore,...
The emergence of new era of Internet of Things or IoT have encouraged intensive if not extensive usage of modern mobile apps, thus multi-ISA equipped multicore processor gain great potential to be used for more efficient instruction binary processing in near future. In order to support this ISA diversity of computing platforms, mix modes of statically and dynamically Binary Translation and Optimization...
The performance of a CUDA kernel often depends on the number of threads per thread-block (thread-block size), and the optimal configuration differs according to the graphics processing unit (GPU) hardware and the given data size to the kernel. In particular, in linear algebra libraries such as Basic Linear Algebra Subprograms (BLAS), most routines support a wide range of problem sizes and various...
Non-equispaced fast Fourier transform (NFFT) has attracted significant interest for its applications in tomography and remote sensing where visualization and image reconstruction require non-equispaced data. Here we present an efficient implementation of high accuracy NFFT on an NVidia GPU (Graphic Processing Unit). We focused on the convolution step in the computation of NFFT, since it is the most...
The High Efficiency Video Coding (HEVC) standard provides higher compression efficiency than other video coding standards but at the cost of increased computational load, which makes it hard to achieve real-time encoding/decoding of high-resolution, high-quality video sequences. In this paper, we investigate how Graphics Processing Units (GPUs) can be employed to accelerate HEVC decoding. GPUs are...
Document is unavailable: This DOI was registered to an article that was not presented by the author(s) at this conference. As per section 8.2.1.B.13 of IEEE's "Publication Services and Products Board Operations Manual," IEEE has chosen to exclude this article from distribution. We regret any inconvenience.
Stream graphs can provide a natural way to represent many applications in multimedia and DSP domains. Though the exposed parallelism of stream graphs makes it relatively easy to map them to GP (General Purpose)-GPUs, very large stream graphs as well as how to best exploit multi-GPU platforms to achieve scalable performance poses great challenges for stream graph mapping. Previous work considers either...
This paper studies two parallelization techniques for the implementation of a SPSO algorithm applied to optimize electromagnetic field devices, GPGPU and Pthreads for multiprocessor architectures. The GPGPU and Pthreads implementations are compared in terms of solution quality and speed up. The electromagnetic optimization problems chosen for testing the efficiency of the parallelization techniques...
Feature calculation of large amount of images is time consuming. The GPU based CUDA framework offers an affordable solution for calculating image features in parallel. The research focused on an empirical study of different implementations of a general-purpose GPU-based solution for calculating Gray-Level Co-occurrence Matrices (GLCM) and associated features of diffraction images of biological cells...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.