The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Stereo matching systems that generate dense, accurate, robust and real-time disparity maps are quite attractive for a variety of applications. Most of the existing stereo matching systems that fulfill to all of these requirements adopt the Semi-Global Matching (SOM) technique. This work proposes a scalable architecture based on a systolic array, fully pipeline. The design builds on a combination of...
The optimization of legacy codes for fully exploiting the parallelism opportunities provided by modern heterogeneous architectures is a difficult task. Multiple levels of parallelism can be exploited in order to gain the expected performance. This work describes the lessons learned in the performance optimization of a real-world reservoir engineering application composed of thousands of code lines...
With the rapidly increasing applications of deep learning, LSTM-RNNs are widely used. Meanwhile, the complex data dependence and intensive computation limit the performance of the accelerators. In this paper, we first proposed a hybrid network expansion model to exploit the finegrained data parallelism. Based on the model, we implemented a Reconfigurable Processing Unit(RPU) using Processing In Memory(PIM)...
All semiconductor market domains are converging to concurrent platforms. This trend has certainly led real challenge to develop applications software that effectively uses these concurrent processors to achieve efficiency and performance goals. This paper argues that the Computer System related courses are natural places to introduce the parallelism, and the earlier to parallel computing concepts...
Traditional processor design approaches using CISC and RISC philosophies suffer from low performance. One of alternative approaches to improve system performance is instruction level parallelism (ILP). Among the processor architectures supporting ILP, very long instruction word (VLIW) processors offer some advantages such as low power consumption and hardware complexity. In this paper, we introduce...
A challenging task in numerical programming modern computer systems is to effectively exploit the parallelism available in the architecture and manage the CPU caches to increase performance. Loop nest tiling allows for both coarsening parallel code and improving code locality. In this paper, we explore a new way to generate tiled code and derive the free schedule of tiles by means of the transitive...
This paper explores hardware acceleration to significantly improve the runtime of computing the forward algorithm on Pair-HMM models, a crucial step in analyzing mutations in sequenced genomes. We describe 1) the design and evaluation of a novel accelerator architecture that can efficiently process real sequence data without performing wasteful work; and 2) aggressive memoization techniques that can...
Stream join is a fundamental and computationally expensive data mining operation for relating information from different data streams. This paper presents two FPGA-based architectures that accelerate stream join processing. The proposed hardware-based systems were implemented on a multi-FPGA hybrid system with high memory bandwidth. The experimental evaluation shows that our proposed systems can outperform...
Iterative sparse linear solvers are an important class of algorithm in high performance computing, and form a crucial component of many scientific codes. As intra and inter node parallelism continues to increase rapidly, the design of new, scalable solvers which can target next generation architectures becomes increasingly important. In this work we present TeaLeaf, a recent mini-app constructed to...
The availability and amount of sequenced genomes have been rapidly growing in recent years because of the adoption of next-generation sequencing (NGS) technologies that enable high-throughput short-read generation at highly competitive cost. Since this trend is expected to continue in the foreseeable future, the design and implementation of efficient and scalable NGS bioinformatics algorithms are...
Implementing convolutional neural networks for scene labelling is a current hot topic in the field of advanced driver assistance systems. The massive computational demands under hard real-time and energy constraints can only be tackled using specialized architectures. Also, cost-effectiveness is an important factor when targeting lower quantities. In this PhD thesis, a vector processor architecture...
With the help of parallelism provided by the fine-grained architecture, hardware accelerators on Field Programmable Gate Arrays (FPGAs) can significantly improve the performance of many applications. However, designers are typically required to have excellent hardware programming skills and unique optimization techniques to fully explore the potential of FPGA resources. In this work, we propose the...
Finite State Automata (FSA) are powerful computational models for extracting patterns from large streams (TBs/PBs) of unstructured data such as system logs, social media posts, emails, and news articles. FSA are also widely used in network security [6], bioinformatics [4] to enable efficient pattern matching. Compute-centric architectures like CPUs and GPG-PUs perform poorly on automata processing...
Processor manufacturers have adopted SIMD for decades because of its superior performance and power efficiency. The configurations of SIMD registers (i.e., the number and width) have evolved and diverged rapidly through various ISA extensions on different architectures. However, migrating legacy or proprietary applications optimized for one guest ISA to another host ISA that has fewer but longer SIMD...
Deep Learning techniques like Convolutional Neural Networks (CNN) are getting popular for image classification with the broad usage spanning across automotive, industrial, medicine, robotics etc. Typical CNN network consists of multiple layers of convolutions, non-linearity, spatial pooling and fully connected layer, with 2D convolutions constituting more than 95% of overall computations. In this...
This paper presents a study of the par alle lization possibilities of a Non-Linear Iterative Partial Least Squares algorithm and its adaptation to a Massively Parallel Processor Array manycore architecture, which assembles 256 cores distributed over 16 clusters. The aim of this work is twofold: first, to test the behavior of iterative, complex algorithms in a manycore architecture; and, secondly,...
High performance computing applications are far more difficult to write, therefore, practitioners expect a well-tuned software to last long and provide optimized performance even when the hardware is upgraded. It may also be necessary to write software using sufficient abstraction over the hardware so that it is capable of running on heterogeneous architecture. Therefore, it is required to have a...
Due to high accuracy, inherent redundancy, and embarrassingly parallel nature, the neural networks are fast becoming mainstream machine learning algorithms. However, these advantages come at the cost of high memory and processing requirements (that can be met by either GPUs, FPGAs or ASICs). For embedded systems, the requirements are particularly challenging because of stiff power and timing budgets...
Binary Edwards Curves (BEC) constitute an alternative to the standardized Weierstrass elliptic curve (EC) equations since the latter have intrinsic side channel attack vulnerabilities due to their lack of point operation uniformity. Thus, BECs have gained popularity over the past few years due to their uniformity, operation regularity, completeness and implementation attractiveness. However, BEC Scalar...
In this paper, a novel, low-complexity, and hardware efficient signal detection algorithm and its corresponding VLSI architecture are proposed for massive multiple-input multiple-output (MIMO) systems. This method is based on the parallel Gauss-Seidel (PGS) iterative method, and achieves comparable detection performance as the linear minimum mean-square error (MMSE) detection. It successfully avoids...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.