The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
StarT-ng is a joint MIT-Motorola project to build a high-performance message passing machine from commercial systems. Each site of the machine consists of a PowerPC 620-based Motorola symmetric multiprocessor (SMP) running the AIX 4.1 operating system. Every processor is connected to a low-latency, high-bandwidth network that is directly accessible from user-level code. In addition to fast message...
Fast Fourier transform algorithms on large data sets achieve poor performance on various platforms because of the inefficient strided memory access patterns. These inefficient access patterns need to be reshaped to achieve high performance implementations. In this paper we formally restructure 1D, 2D and 3D FFTs targeting a generic machine model with a two-level memory hierarchy requiring block data...
3D-stacked integration of DRAM and logic layers using through-silicon via (TSV) technology has given rise to a new interpretation of near-data processing (NDP) concepts that were proposed decades ago. However, processing capability within the stack is limited by stringent power and thermal constraints. Simple processing mechanisms with intensive memory accesses, such as data reorganization, are an...
Efficiently supporting the communication needs of systems on chip with tens to hundreds of interacting modules requires a systematic and flexible network-on-chip (NoC) infrastructure. The freely available CONNECT generator lets users quickly navigate a range of design parameters to produce tailored NoC design instances in Verilog. To date, it has generated nearly 4,000 designs.
Facilitating DRAM access is an essential part of an application programming environment for FPGA computing. Existing FPGA application programming environments primarily focus on support for simple, regular memory access patterns, such as block copy and streaming. This paper presents CoRAM++, a programming environment for FPGA computing that is based on an extensible set of data-structure-specific...
Today's offerings of parameterized hardware IP generators permit very high degrees of performance and implementation customization. Nevertheless, it is ultimately still left to the IP users to set IP parameters to achieve the desired tuning effects. For the average IP user, the knowledge and effort required to navigate a complex IP's design space can significantly offset the productivity gain from...
In this paper we focus on common data reorganization operations such as shuffle, pack/unpack, swap, transpose, and layout transformations. Although these operations simply relocate the data in the memory, they are costly on conventional systems mainly due to inefficient access patterns, limited data reuse and roundtrip data traversal throughout the memory hierarchy. This paper presents a two pronged...
Over the last decade, the looming power wall has spurred a flurry of interest in developing heterogeneous systems with hardware accelerators. The questions, then, are what and how accelerators should be designed, and what software support is required. Our accelerator design approach stems from the observation that many efficient and portable software implementations rely on high performance software...
Real-time system level implementations of complex Synthetic Aperture Radar (SAR) image reconstruction algorithms have always been challenging due to their data intensive characteristics. In this paper, we propose a basis vector transform based novel algorithm to alleviate the data intensity and a 3D-stacked logic in memory based hardware accelerator as the implementation platform. Experimental results...
Memory layout transformations via data reorganization are very common operations, which occur as a part of the computation or as a performance optimization in data-intensive applications. These operations require inefficient memory access patterns and roundtrip data movement through the memory hierarchy, failing to utilize the performance and energy-efficiency potentials of the memory subsystem. This...
The complexity of SystemC virtual prototyping is continuously increasing. Accelerating RTL/TLM SystemC simulations is essential to control future SoC development cost and time-to-market. In this paper, we present RAVES, a highly-parallel special-purpose multicore architecture that achieves simulation performance more efficiently by parallel execution of light-weight user-level threads on many small...
As technology scaling is reaching its limits, pointing to the well-known memory and power wall problems, achieving high-performance and energy-efficient systems is becoming a significant challenge. Especially for data-intensive computing, efficient utilization of the memory subsystem is the key to achieve high performance and energy efficiency.We address this challenge in DRAM-optimized hardware accelerators...
Fast Fourier transform algorithms on large data sets achieve poor performance on various platforms because of the inefficient strided memory access patterns. These inefficient access patterns need to be reshaped to achieve high performance implementations. In this paper we formally restructure 1D, 2D and 3D FFTs targeting a generic machine model with a two-level memory hierarchy requiring block data...
Vertex-centric graph computations are widely used in many machine learning and data mining applications that operate on graph data structures. This paper presents GraphGen, a vertex-centric framework that targets FPGA for hardware acceleration of graph computations. GraphGen accepts a vertex-centric graph specification and automatically compiles it onto an application-specific synthesized graph processor...
This paper introduces a 3D-stacked logic-in-memory (LiM) system that integrates the 3D die-stacked DRAM architecture with the application-specific LiM IC to accelerate important data-intensive computing. The proposed system comprises a fine-grained rank-level 3D die-stacked DRAM device and extra LiM layers implementing logic-enhanced SRAM blocks that are dedicated to a particular application. Through...
Large scale 3D image localization requires computationally expensive matching between 2D feature points in the query image and a 3D point cloud. In this paper, we present a method to accelerate the matching process and to reduce the memory footprint by analyzing the view-statistics of points in a training corpus. Given a training image set that is representative of common views of a scene, our approach...
In this paper we present a framework to obtain highly efficient implementations for the narrow band level set method on commercial off-the-shelf (COTS) multicore CPU systems with a cache-based memory hierarchy such as Intel Xeon and Atom processors. The narrow-band level set algorithm tracks wave-fronts in discretized volumes (for instance, explosion shock waves), and is computationally very demanding...
Prevailing VLSI trends point to a growing gap between the scaling of on-chip processing throughput and off-chip memory bandwidth. An efficient use of memory bandwidth must become a first-class design consideration in order to fully utilize the processing capability of highly concurrent processing platforms like FPGAs. In this paper, we present key aspects of this challenge in developing FPGA-based...
Optical OFDM communication systems operating at data rates in the 40Gb/s (and higher) range require high-throughput/highly parallel fast Fourier transform (FFT) implementations. These consume a significant amount of chip resources; we aim to reduce costs by improving the system's accuracy per chip-area. For OFDM signals, we characterize the growth of data within the FFT and explore several cost-conscious...
We have investigated real-time DSP algorithms implementing electronic predistortion and optical OFDM. These studies have been carried out through experiments with FPGA-based transmitters, and the design and post-synthesis simulations of transceiver ASICs.
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.