The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Bufferless, deflection-routed, Butterfly Fat Trees (BFTs) can outperform state-of-the-art FPGAs overlay NoCs such as Hoplite by as much as 2–5× on throughput and ≈5× on worst-case latency at identical PE counts, and by ≈1.5× on throughput at identical resource costs >16K LUTs for statistical traffic patterns. In this paper, we show how to modify the tree connectivity and routing function to support...
Deflection-routed FPGA overlay NoCs such as Hoplite suffer from high worst-case routing latencies due to the penalty of deflections at large system sizes. Segmentation of communication channels in such NoCs can (1) reduce worst-case packet routing latencies for FPGA traffic, (2) enable efficient composition of multi-application NoC workloads, and (3) ease the burden of supporting Partial Reconfiguration...
We can build lightweight bit-serial FPGA NoC routers thatcost 20 LUT, 17 FF per router and operate at 800–900 MHzspeeds. Each bit-serial router implements deflection-routing on aunidirectional torus topology requiring 1b-wide connection perport. The key ideas that enable this implementation are (1)reformulation of the dimension-ordered routing (DOR) functionusing compact 1 LUT, 1 FF streaming pattern...
We can enhance the performance and efficiency of deflection-routed FPGA overlay NoCs by exploiting the cascading featureof the Xilinx UltraScale BlockRAMs. This allows us to (1) hardenthe multiplexers in the NoC switch crossbars, and (2) efficientlyadd buffering support to deflection-routing. While buffering isnot required for correct operation of a deflection routed NoC, it can boost network throughputs...
We can deliver high performance and energy efficient operation on the multi-core NoC-based Adapteva Epiphany-III SoC for bulk-synchronous workloads using our proposed eBSP communication API. We characterize and automate performance tuning of spatial parallelism for supporting (1) random access load-store style traffic suitable for irregular sparse computations, as well as (2) variable, data-dependent...
Reducing worst case routing latencies while delivering high throughput and low energy are key design concerns in the engineering of overlay packet-switched NoCs for FPGA fabrics. Deflection routed torus NoCs are known to map particularly well to modern wire-rich FPGA substrates with fracturable LUT organizations while delivering high sustained bandwidths for various workloads and traffic patterns...
When discussing programming issues on social platforms (e.g, Stack Overflow, Twitter), developers often mention APIs in natural language texts. Extracting API mentions in natural language texts is a prerequisite for effective indexing and searching for API-related information in software engineering social content. However, the informal nature of social discussions creates two fundamental challenges...
Machine Learning approaches for automated selection of FPGA CAD tool parameters have been demonstrated to be useful for timing closure of FPGA designs [3], [4]. This is achieved by running the CAD tool multiple times with small variations in the the CAD parameter values. The timing slack from each run is recorded into a database along with all input parameter selections to help train a classifier...
We can exploit application-specific sparse structure and distribution of non-zero coefficients in Discrete Wavelet Transform (DWT) matrices to significantly improve the performance of 1-D DWT mapped to FPGA-based soft vector processors. We reformulate DWT computations specifically in terms of sparse matrix operations, where the transformation matrices have a repeating block with a fixed non-zero pattern,...
High-performance FPGA programming has typically been the exclusive domain of a small band of specialized hardware developers. They are capable of reasoning about implementation concerns at the register-transfer level (RTL) which is analogous to assembly-level programming in software. Sometimes these developers are required to push further down to manage even lower levels of abstraction closer to physical...
We can embed the crossbar functionality of NoC (network-on-chip) routers onto the hard multiplexers of Xilinx DSP48E primitives to support resource efficient mapping of FPGA overlay NoCs. This embedding also permits the use of dedicated hard wiring resources of the DSP cascade links to support vertical NoC channels. This unique mapping allows us to significantly reduce soft logic (LUTs+FFs) utilization...
FPGA-based embedded soft vector processors can exceed the performance and energy-efficiency of embedded GPUs and DSPs for lightweight deep learning applications. For low complexity deep neural networks targeting resource constrained platforms, we develop optimized Caffe-compatible deep learning library routines that target a range of embedded accelerator-based systems between 4 -- 8 W power budgets...
The management and optimization of communication in an NoC-based (network-on-chip) bespoke computing platform such as the Parallella (Zynq 7010 + Epiphany-III SoC) is critical for performance and energy-efficiency of floating-point bulk-synchronous workloads. In this paper, we explore the opportunities and capabilities of the Epiphany-III SoC for communication-intensive workloads. Using our communication...
We can use Cloud Computing and Machine Learning to help deliver timing closure of FPGA designs using InTime [2], [3]. This approach requires no modification to the input RTL and relies exclusively on manipulating the CAD tool parameters that drive the optimization heuristics. By running multiple combinations of the parameters in parallel, we learn from results and identify which parameters caused...
We can improve the performance of deflection-routed FPGA overlay networks-on-chip (NoCs) like Hoplite by as much as 10× (random traffic) at the expense of modest extra storage cost when combining static scheduling with packet switching in an efficient, hybrid manner. Deflection routed bufferless NoCs such as Hoplite, allow extremely lightweight packet switched routers on FPGAs, but suffer from high...
Programming-specific Q&A sites (e.g., Stack Overflow) are being used extensively by software developers for knowledge sharing and acquisition. Due to the cross-reference of questions and answers (note that users also reference URLs external to the Q&A site. In this paper, URL sharing refers to internal URLs within the Q&A site, unless otherwise stated), knowledge is diffused in the Q&A...
Software engineering social content, such as Q&A discussions on Stack Overflow, has become a wealth of information on software engineering. This textual content is centered around software-specific entities, and their usage patterns, issues-solutions, and alternatives. However, existing approaches to analyzing software engineering texts treat software-specific entities in the same way as other...
Off-the-shelf accelerator-based embedded platforms offer a competitive energy-efficient solution for lightweight deep learning computations over CPU-based systems. Low-complexity classifiers used in power-constrained and performance-limited scenarios are characterized by operations on small image maps with 2– 3 deep layers and few class labels. For these use cases, we consider a range of embedded...
This year we have a strong FPT program with a full paper acceptance rate of ∼21% from a pool of 106 submissions. This is a ∼25% increase in the number of submissions over last year while still maintaining a near identical number of accepted papers. We strictly enforced all advertised deadlines, and significantly revised the Program Committee to help improve review quality. We had 132 original registrations...
Scatter-gather direct memory access (DMA) transfers can be used to efficiently fetch graph memory data for onchip processing of graph applications. We present a hardware controlled graph DMA engine which can operate autonomously without the need for CPU interaction. Graph processing algorithms can asynchronously request graph data which is fetched from memory and streamed to the processing core. An...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.