The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
A dramatic improvement in energy efficiency is mandatory for sustainable supercomputing and has been identified as a major challenge. Affordable energy solution continues to be of great concern in the development of the next generation of supercomputers. Low power processors, dynamic control of processor frequency and heterogeneous systems are being proposed to mitigate energy costs. However, the...
Amplicon Noise [1], an updated version of Py-ronoise [2], is a tool for removing noise from metagenomic data recorded by a 454 pyrosequencer. Amplicon Noise has shown to be effective in reducing overestimation of operational taxonomic units (OTUs) and chimera detection. Amplicon-Noise's noise removal method relies on clustering a large set of short sequences read by the sequencer. The DNA sequencing...
A distance matrix is simply an n×n two-dimensional array that contains pairwise distances of a set of n points in a metric space. It has a wide range of usage in several fields of scientific research e.g., data clustering, machine learning, pattern recognition, image analysis, information retrieval, signal processing, bioinformatics etc. However, as the size of n increases, the computation of distance...
Raptor code, a member of the fountain code family, is a significant theoretical improvement over the Luby transform code (LT code) for forward error correction (FEC) transmission. Graphics processing units (GPUs) have become a common place in the consumer market and are finding their way beyond graphics processing into general purpose computing. This paper investigates the suitability of GPU for Raptor...
Since the introduction of fully programmable vertex shader hardware, GPU computing has made tremendous advances. Exception support and speculative execution are the next steps to expand the scope and improve the usability of GPUs. However, traditional mechanisms to support exceptions and speculative execution are highly intrusive to GPU hardware design. This paper builds on two related insights to...
Heterogeneous computing system increases the performance of parallel computing in many domain of general purpose computing with CPU, GPU and other accelerators. Open Computing Language(OpenCL) is the first open, royalty-free standard for heterogenous computing on multi hardware platforms. In this paper, we propose a parallel Motion Estimation(ME) algorithm implemented using OpenCL and present several...
This paper presents OpenCL Remote framework that extends the native OpenCL platform model to network scale and utilizes the native OpenCL's support of heterogeneous computing. OpenCL Remote boosts performance by distributing computation over network to many compute devices in parallel.
Recently, the Graphic Processor Unit (GPU) has evolved into a highly parallel, multithreaded, many-core processor with tremendous computational horsepower and very high memory bandwidth. To improve the simulation efficiency of complex flow phenomena in the field of computational fluid dynamics, a CUDA-based simulation algorithm of large eddy simulation using multiple GPUs is proposed. Our implementation...
Current synchrotron experiments require state-of-the-art scientific cameras with sensors that provide several million pixels, each at a dynamic range of up to 16 bits and the ability to acquire hundreds of frames per second. The resulting data bandwidth of such a data stream reaches several Gigabits per second. These streams have to be processed in real-time to achieve a fast process response. In...
In this paper, we describe an implementation of Node Histogram Sampling Algorithm (NHBSA) on GPUs with CUDA and apply the algorithm to solve large scale QAP instances. To solve large scale QAP instances, we combined the taboo search with NHBSA. In this implementation, we used an efficient thread assignment method, Move-Cost Adjusted Thread Assignment (MATA), which is proposed in a previous study....
This paper introduces a novel implementation of the genetic algorithm exploiting a multi-GPU cluster. The proposed implementation employs an island-based genetic algorithm where every GPU evolves a single island. The individuals are processed by CUDA warps, which enables the solution of large knapsack instances and eliminates undesirable thread divergence. The MPI interface is used to exchange genetic...
A process of generating a digital hologram requires a lot of time-consuming computations. Therefore, it is important to reduce the computation time or the number of computations for achieving real-time digital holographic video generation. In this paper, we propose a method of parallelizing the computations using multiple GPUs with CUDA and OpenMP and an optimization method for reducing the computation...
Modern computer processing units tend towards simpler cores in greater numbers, favouring the development of data-parallel applications. Evolutionary algorithms are ideal for taking full advantage of SIMD (Single Instruction, Multiple Data) processing, which is available on both CPUs and GPUs. Creating software that runs on a GPU requires the use of specialised programming languages or styles, forcing...
This paper presents a new sparse matrix format ALIGNED_COO, an extension to COO format to optimize performance of large sparse matrix having skewed distribution of non-zero elements. Load balancing, alignment and synchronization free distribution of work load are three important factors to improve performance of sparse matrices representing power-law graph. Coordinate (COO) format is selected for...
The process of applying generalized minimum aberration criteria (GMAC) to non-regular fractional factorial designs is extremely computationally intensive. Constructing and ranking all designs can take hours if not days; therefore, exploitation of the massively parallel nature of modern graphics processing units (GPUs) are used to perform the task. The computation is not just ported to the GPU, but...
Database searching is a main method for protein identification in shotgun proteomics, and many research efforts are dedicated to improving its effectiveness. However, the efficiency of database searching is facing a serious challenge, due to the ever fast growth of protein and peptide databases resulted from genome translations, enzymatic digestions, and post-translational modifications (PTMs). On...
Software pipelines permit the decomposition of a repetitive sequential process into a succession of distinguishable sub-processes called stages, each of which can be concurrently executed on a distinct processing element. This paper presents a heterogeneous streaming pipeline implementation using the FastFlow skeletal library for a numerical linear algebra code. By introducing minimal memory management,...
Simulation of the behaviour of a ship operating in pack ice is a computationally intensive process to which General Purpose Computing on Graphical Processing Units (GPGPU) can be applied. In this paper we present an efficient parallel implementation of such a simulator developed using the NVIDIA Compute Unified Device Architecture (CUDA). We have conducted an experiment to measure the relative performance...
GPU is continuing its trend of vastly outperforming CPU while becoming more general purpose. In order to improve the efficiency of AES algorithm, this paper proposed a CUDA implementation of Electronic Codebook (ECB) mode encoding process and Cipher Feedback (CBC) mode decoding process on GPU. In our implementation, the frequently accessed T-boxes were allocated on on-chip shared memory and the granularity...
The Viola-Jones face detection algorithm represents a class of parallel algorithms that both memory accesses and work distributions are irregular, thereby hard to obtain high performance on GPUs. Furthermore, conventional GPU programming wisdom usually guides us on how to optimize data parallel workloads with regular inputs and outputs. While how to efficiently write task-level parallelism programs...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.