The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
CPU-FPGA heterogeneous platforms offer a promising solution for high-performance and energy-efficient computing systems by providing specialized accelerators with post-silicon reconfigurability. To unleash the power of FPGA, however, the programmability gap has to be filled so that applications specified in high-level programming languages can be efficiently mapped and scheduled on FPGA. The above...
An important number of Numerical Linear Algebra methods to tackle problems in diverse fields of science and engineering, rely heavily on the solution of one or many sparse triangular linear systems. Since the early years, this has motivated numerous efforts that seek to produce efficientimplementations of this kernel for most hardware platforms. However, this operation implies strong data dependencies...
All semiconductor market domains are converging to concurrent platforms. This trend has certainly led real challenge to develop applications software that effectively uses these concurrent processors to achieve efficiency and performance goals. This paper argues that the Computer System related courses are natural places to introduce the parallelism, and the earlier to parallel computing concepts...
State-of-the-art storage devices that have parallel capability have significantly reduced the performance gap between processor and storage I/O. However, the internal parallelism makes it difficult to measure utilization that can be used as a basis of load balancing, which is a critical feature of performance improvement of parallel systems. When utilization of storage reaches to one hundred percent,...
In this paper we present MLTBiqCrunch, a hierarchically parallelized version of the open-source solver BiqCrunch [1]. More precisely, this version has two levels of parallelization: a coarse grain, assigning a thread to a node evaluation and a fine grain, parallelizing a node evaluation when some threads are not busy. We present experiments on some classical binary quadratic optimization problems...
With the help of parallelism provided by the fine-grained architecture, hardware accelerators on Field Programmable Gate Arrays (FPGAs) can significantly improve the performance of many applications. However, designers are typically required to have excellent hardware programming skills and unique optimization techniques to fully explore the potential of FPGA resources. In this work, we propose the...
SIMD vectors help improve the performance of certain applications. The code gets vectorized into SIMD form either by hand, or automatically with auto-vectorizing compilers. The Superword-Level Parallelism (SLP) vectorization algorithm is a widely used algorithm for vectorizing straight-line code and is part of most industrial compilers. The algorithm attempts to pack scalar instructions into vectors...
The need for systems capable of conducting inferential analysis and predictive analytics is ubiquitous in a global information society. With the recent advances in the areas of predictive machine learning models and massive parallel computing a new set of resources is now potentially available for the computer science community in order to research and develop new truly intelligent and innovative...
Today, Convolutional Neural Network (CNN) is adopted in a lot of areas such as computer vision and natural language processing. By employing hardware accelerators such as graphic processing unit (GPU), a significant amount of speedup can be achieved in CNN and many studies have proposed such acceleration methods. However, it is not straightforward to parallelize the CNN on a hardware accelerator because...
OpenCL continues to gather momentum on both desktop and mobile devices. The new features of OpenCL 2.0 provides developers better expressive power in programming heterogeneous computing environments. Currently in the experimental simulation environment, gem5-gpu only supports CUDA, but GPGPU-Sim can support OpenCL by compiling OpenCL kernel code to PTX using real GPU driver. However, this driver compilation...
Machine Learning (ML) approaches are widelyused classification/regression methods for data mining applications. However, the time-consuming training process greatly limits the efficiency of ML approaches. We use the example of SVM (traditional ML algorithm) and DNN (state-of-the-art ML algorithm) to illustrate the idea in this paper. For SVM, a major performance bottleneck of current tools is that...
We present a set of new batched CUDA kernels for the LU factorization of a large collection of independent problems of different size, and the subsequent triangular solves. All kernels heavily exploit the registers of the graphics processing unit (GPU) in order to deliver high performance for small problems. The development of these kernels is motivated by the need for tackling this embarrasingly-parallel...
String pattern matching with finite automata (FAs) is a well-established method across many areas in computer science. Until now, data dependencies inherent in the pattern matching algorithm have hampered effective parallelization. To overcome the dependency-constraint between subsequent matching steps, simultaneous deterministic finite automata (SFAs) have been recently introduced. Although an SFA...
Sparse general matrix-matrix multiplication (SpGEMM) is one of the key kernels of preconditioners such as algebraic multigrid method or graph algorithms. However, the performance of SpGEMM is quite low on modern processors due to random memory access to both input and output matrices. As well as the number and the pattern of non-zero elements in the output matrix, important for achieving locality,...
The exponential growth of available data has increased the need for interactive exploratory analysis. Dataset can no longer be understood through manual crawling and simple statistics. In Geographical Information Systems (GIS), the dataset is often composed of events localized in space and time; and visualizing such a dataset involves building a map of where the events occurred.We focus in this paper...
The goal of this paper is to implement an efficient FPGA-based hardware architectures for the design of fast artificial vision systems. The proposed architecture is capable of performing classification operations of a Convolutional Neural Network (CNN) in realtime. To show the effectiveness of the architecture, some design examples such as hand posture recognition, character recognition, and face...
The ever changing nature of network technology requires a flexible platform that can change as the technology evolves. In this work, a complete networking switch designed in OpenCL is presented, identifying several high-level constructs that form the building blocks of any network application targeting FPGAs. These include the notion of an on-chip global memory and kernels constantly processing data...
With the continuously increasing power of computation, especially in the region of parallel computing, computerbased texture analysis, computer-assisted classification methods, automated pathology detections, etc. are more and more commonly performed on medical images, like X-ray, Magnetic Resonance (MR) images, for clinical or scientific purposes. These procedures almost always include a stage of...
Recent advances in photonics and imaging technology allow the development of cutting-edge, lightweight hyperspectral sensors, both push-broom/line-scanning and snapshot/frame. At the same time, emerging applications in robotics, food inspection, medicine and earth observation are posing critical challenges on real-time processing and computational efficiency, both in terms of accuracy and power consumption...
Lightweight convolutional neural network (CNN) on tiny embedded platforms can offer energy efficient solution for today's IoT devices. However, CNN implementation on embedded system faces processing bottleneck in convolutional layers and memory storage issues in fully connected layers. In past years, heterogeneous acceleration, where compute intensive tasks are performed on kernel specific cores,...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.