The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Due to increasing demand of low power computing, and diminishing returns from technology scaling, industry and academia are turning with renewed interest toward energy-efficient programmable accelerators. This paper proposes an Integrated Programmable-Array accelerator (IPA) architecture based on an innovative execution model, targeted to accelerate both data and control-flow parts of deeply embedded...
For many intensive computing tasks, simultaneous data access into multi-dimensional data arrays is highly restricted by its data mapping strategy and memory port constraint. As such, to increase memory accessing bandwidth, innovative memory partitioning and mapping algorithms have been proposed to simultaneously access multiple memory blocks through physically distributing data elements in the same...
High-end FPGAs are widely adopted as hardware accelerators, due to their power efficiency, flexibility, and high-performance computing ability. They are, therefore, extremely useful devices for a project with challenges and constraints such as the Square Kilometre Array (SKA). However, the traditional design methods require expert hardware knowledge and long development times for each of the SKA's...
The use of FPGAs as compute accelerators has been demonstrated by numerous researchers as an effective solution to meet the performance requirement across many application domains. However, the design productivity of developing FPGA accelerators remains much lower compared to the use of a typical software development flow. Although the use of the high-level design tools may partly alleviate this shortcoming,...
In the last decade, OpenCL has sparked the interest of the computing world as it is a language based on an open standard that can run on many different heterogeneous platforms. This standard is continuously evolving to adapt to various use cases of different platforms. For example, with requests from the FPGA community, the pipe construct was added to the standard to facilitate the implementation...
In order to improve the real-time performance and reliability of the drive system for infrared image array, this paper designs an embedded drive system. With MPC8315 as the processing core, this system takes reflective memory network as the transmission unit. In order to verify and analyze the performance of the embedded drive system for the infrared image array, this paper sets up a test platform...
We present the first accelerated implementation of BWA-MEM, a popular genome sequence alignment algorithm widely used in next generation sequencing genomics pipelines. The Smith-Waterman-like sequence alignment kernel requires a significant portion of overall execution time. We propose and evaluate a number of FPGA-based systolic array architectures, presenting optimizations generally applicable to...
Autonomous UAVs need on-board vision system to be able to navigate, avoid collisions, and execute missions. Small UAVs can carry small form factor vision system with low power consumption due to natural payload limitations. Therefore it is a natural idea to use cellular sensor-processor arrays to implement the necessary vision functions. In this paper, we present a UAV collision warning algorithm...
Todays' hardware diversity exacerbates the need for optimizing compilers. A problem that arises when exploiting hardware accelerators (FPGA, GPU, dedicated boards) is how to automatically perform kernel/function offloading or outlining (as opposed to function inlining). The principle is to outsource part of the computation (the kernel to be performed on the accelerator) to a more efficient but more...
Heterogeneous computing utilizing both CPU and FPGA requires access to data in the main memory from both devices. While a typical system relies on software executing on the CPU to orchestrate all data movements between the FPGA and the main memory, our demo presents a complementary FPGA-centric approach that allows gateware to directly access the virtual memory space as part of the executing process...
Hardware supported multithreading can mask memory latency by switching the execution to ready threads, which is particularly effective on irregular applications. FPGAs provide an opportunity to have multithreaded data paths customized toeach individual application. In this paper we describe the compiler generation of these hardware structures from a C subset targeting a Convey HC-2ex machine. We describe...
In many application domains, data are represented using large graphs involving millions of vertices and billions of edges. Graph exploration algorithms, such as breadth-first search (BFS), are largely dominated by memory latency and are challenging to process efficiently. In this paper, we present a reconfigurable hardware methodology for efficient parallel processing of large-scale graph exploration...
In FPGA based logic emulation systems, effective verification performance not only depends on the frequency at which the design clocks can be advanced, but also on the efficiency of various design element access tasks initiated by associated SW applications like high level testbench, GUI etc. Although existing emulation systems achieve high degree of parallelism in model execution by partitioning...
Recent progress in High-Level Synthesis (HLS) techniques has helped raise the abstraction level of FPGA programming. However implementation and performance evaluation of the HLS-generated RTL, involves lengthy logic synthesis and physical design flows. Moreover, mapping of different levels of coarse grained parallelism onto hardware spatial parallelism affects the final FPGA-based performance both...
Gridding is a method of interpolating irregularly sampled data on to a uniform grid and is a critical image reconstruction step in several applications which operate on non-Cartesian sampled data. In this paper, we present an algorithm-architecture co-design framework for accelerating gridding using FPGAs. We present a parameterized hardware library for accelerating gridding to support both arbitrary...
We present an FPGA accelerator for the Non-uniform Fast Fourier Transform, which is a technique to reconstruct images from arbitrarily sampled data. We accelerate the compute-intensive interpolation step of the NuFFT Gridding algorithm by implementing it on an FPGA. In order to ensure efficient memory performance, we present a novel FPGA implementation for Geometric Tiling based sorting of the arbitrary...
This paper describes a System on Chip implementation of a reconfigurable digital signal processor. The device is suitable for execution of a wide range of applications exploiting a balanced mix of heterogeneous reconfigurable fabrics merged together by a flexible and efficient communication infrastructure based on a 64-bit Network On Chip. The SoC combines a fine grain embedded FPGA, a mid grain configurable...
Capacity of FPGAs has grown significantly, leading to increased complexity of designs targeting these chips. Traditional FPGA design methodology using HDLs is no longer sufficient and new methodologies are being sought. An attractive possibility is to use streaming languages. Streaming languages group data into streams, which are processed by computational nodes called kernels. They are suitable for...
With the proliferation of reconfigurable systems and flexible memory architectures, there has been intense interest in stream systems. While the existing stream systems require the programs to be written using special models, this paper demonstrates an approach to automatically generate stream programs from existing applications written for non-stream scalar processors. As a part of this approach,...
Fully-pipelined simple modular structures are presented in this paper for efficient hardware realization of discrete Hadamard transform (HT). From the kernel matrix of HT, we have derived four different pipelined modular designs for transform length N = 4. It is shown further that the HT of transform-length N = 8 can be obtained from two 4-point HT modules, and similarly, the HT of transform-length...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.