Search results

chapter

Hardware module for low-resource and real-time stereo vision engine using semi-global matching approach

Lucas F. S. Cambuim, Joao P. F. Barbosa, Edna N. S. Barros

2017 30th Symposium on Integrated Circuits and Systems Design (SBCCI) > 53 - 58

2017 30th Symposium on Integrated Circuits and Systems Design (SBCCI)

Stereo matching systems that generate dense, accurate, robust and real-time disparity maps are quite attractive for a variety of applications. Most of the existing stereo matching systems that fulfill to all of these requirements adopt the Semi-Global Matching (SOM) technique. This work proposes a scalable architecture based on a systolic array, fully pipeline. The design builds on a combination of...

chapter

A Case Study of Performance Optimization in a Heterogeneous Environment

Leandro Pereira, Cristiana Bentes, Maria Clicia S. de Castro, Eduardo Garcia

2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW) > 13 - 18

2017 International Symposium on Computer Architecture and High Performance Computing Workshops (SBAC-PADW)

The optimization of legacy codes for fully exploiting the parallelism opportunities provided by modern heterogeneous architectures is a difficult task. Multiple levels of parallelism can be exploited in order to gain the expected performance. This work describes the lessons learned in the performance optimization of a real-world reservoir engineering application composed of thousands of code lines...

chapter

Processing LSTM in memory using hybrid network expansion model

Yu Gong, Tingting Xu, Bo Liu, Wei Ge, more

2017 IEEE International Workshop on Signal Processing Systems (SiPS) > 1 - 6

2017 IEEE International Workshop on Signal Processing Systems (SiPS)

With the rapidly increasing applications of deep learning, LSTM-RNNs are widely used. Meanwhile, the complex data dependence and intensive computation limit the performance of the accelerators. In this paper, we first proposed a hybrid network expansion model to exploit the finegrained data parallelism. Based on the model, we implemented a Reconfigurable Processing Unit(RPU) using Processing In Memory(PIM)...

chapter

Introducing parallel computing concepts in computer system related courses

Han Wan, Xiaopeng Gao, Xiang Long, Bo Jiang

2017 IEEE Frontiers in Education Conference (FIE) > 1 - 7

2017 IEEE Frontiers in Education Conference (FIE)

All semiconductor market domains are converging to concurrent platforms. This trend has certainly led real challenge to develop applications software that effectively uses these concurrent processors to achieve efficiency and performance goals. This paper argues that the Computer System related courses are natural places to introduce the parallelism, and the earlier to parallel computing concepts...

chapter

Adaptable VLIW processor: The reconfigurable technology approach

Cuong Pham-Quoc, Binh Kieu-Do-Nguyen, Anh-Vu Dinh-Duc

2017 International Conference on Advanced Technologies for Communications (ATC) > 120 - 125

2017 International Conference on Advanced Technologies for Communications (ATC)

Traditional processor design approaches using CISC and RISC philosophies suffer from low performance. One of alternative approaches to improve system performance is instruction level parallelism (ILP). Among the processor architectures supporting ILP, very long instruction word (VLIW) processors offer some advantages such as low power consumption and hardware complexity. In this paper, we introduce...

chapter

Optimizing numerical code by means of the transitive closure of dependence graphs

Marek Palkowski, Wlodzimierz Bielecki

2017 Federated Conference on Computer Science and Information Systems (FedCSIS) > 523 - 526

2017 Federated Conference on Computer Science and Information Systems (FedCSIS)

A challenging task in numerical programming modern computer systems is to effectively exploit the parallelism available in the architecture and manage the CPU caches to increase performance. Loop nest tiling allows for both coarsening parallel code and improving code locality. In this paper, we explore a new way to generate tiled code and derive the free schedule of tiles by means of the transitive...

chapter

On accelerating pair-HMM computations in programmable hardware

Subho S. Banerjee, Mohamed el-Hadedy, Ching Y. Tan, Zbigniew T. Kalbarczyk, more

2017 27th International Conference on Field Programmable Logic and Applications (FPL) > 1 - 8

2017 27th International Conference on Field Programmable Logic and Applications (FPL)

This paper explores hardware acceleration to significantly improve the runtime of computing the forward algorithm on Pair-HMM models, a crucial step in analyzing mutations in sequenced genomes. We describe 1) the design and evaluation of a novel accelerator architecture that can efficiently process real sequence data without performing wasteful work; and 2) aggressive memoization techniques that can...

chapter

A generic high throughput architecture for stream processing

Christes Rousopoulos, Ektoras Karandeinos, Grigorios Chrysos, Apostolos Dollas, more

2017 27th International Conference on Field Programmable Logic and Applications (FPL) > 1 - 5

2017 27th International Conference on Field Programmable Logic and Applications (FPL)

Stream join is a fundamental and computationally expensive data mining operation for relating information from different data streams. This paper presents two FPGA-based architectures that accelerate stream join processing. The proposed hardware-based systems were implemented on a multi-FPGA hybrid system with high memory bandwidth. The experimental evaluation shows that our proposed systems can outperform...

chapter

TeaLeaf: A Mini-Application to Enable Design-Space Explorations for Iterative Sparse Linear Solvers

Simon McIntosh-Smith, Matthew Martineau, Tom Deakin, Grzegorz Pawelczak, more

2017 IEEE International Conference on Cluster Computing (CLUSTER) > 842 - 849

2017 IEEE International Conference on Cluster Computing (CLUSTER)

Iterative sparse linear solvers are an important class of algorithm in high performance computing, and form a crucial component of many scientific codes. As intra and inter node parallelism continues to increase rapidly, the design of new, scalable solvers which can target next generation architectures becomes increasingly important. In this work we present TeaLeaf, a recent mini-app constructed to...

chapter

S-Aligner: Ultrascalable Read Mapping on Sunway Taihu Light

Xiaohui Duan, Kai Xu, Yuandong Chan, Christian Hundt, more

2017 IEEE International Conference on Cluster Computing (CLUSTER) > 36 - 46

2017 IEEE International Conference on Cluster Computing (CLUSTER)

The availability and amount of sequenced genomes have been rapidly growing in recent years because of the adoption of next-generation sequencing (NGS) technologies that enable high-throughput short-read generation at highly competitive cost. Since this trend is expected to continue in the foreseeable future, the design and implementation of efficient and scalable NGS bioinformatics algorithms are...

chapter

Application-specific soft-core vector processor for advanced driver assistance systems

Stephan Nolting, Florian Giesemann, Julian Hartig, Achim Schmider, more

2017 27th International Conference on Field Programmable Logic and Applications (FPL) > 1 - 2

2017 27th International Conference on Field Programmable Logic and Applications (FPL)

Implementing convolutional neural networks for scene labelling is a current hot topic in the field of advanced driver assistance systems. The massive computational demands under hard real-time and energy constraints can only be tackled using specialized architectures. Also, cost-effectiveness is an important factor when targeting lower quantities. In this PhD thesis, a vector processor architecture...

chapter

PolyPC: Polymorphic parallel computing framework on embedded reconfigurable system

Hongyuan Ding, Miaoqing Huang

2017 27th International Conference on Field Programmable Logic and Applications (FPL) > 1 - 8

2017 27th International Conference on Field Programmable Logic and Applications (FPL)

With the help of parallelism provided by the fine-grained architecture, hardware accelerators on Field Programmable Gate Arrays (FPGAs) can significantly improve the performance of many applications. However, designers are typically required to have excellent hardware programming skills and unique optimization techniques to fully explore the potential of FPGA resources. In this work, we propose the...

chapter

Cache Automaton: Repurposing Caches for Automata Processing

Arun Subramaniyan, Jingcheng Wang, Ezhil R. M. Balasubramanian, David Blaauw, more

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT) > 373

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Finite State Automata (FSA) are powerful computational models for extracting patterns from large streams (TBs/PBs) of unstructured data such as system logs, social media posts, emails, and news articles. FSA are also widely used in network security [6], bioinformatics [4] to enable efficient pattern matching. Compute-centric architectures like CPUs and GPG-PUs perform poorly on automata processing...

chapter

Exploiting Asymmetric SIMD Register Configurations in ARM-to-x86 Dynamic Binary Translation

Yu-Ping Liu, Ding-Yong Hong, Jan-Jan Wu, Sheng-Yu Fu, more

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT) > 343 - 355

2017 26th International Conference on Parallel Architectures and Compilation Techniques (PACT)

Processor manufacturers have adopted SIMD for decades because of its superior performance and power efficiency. The configurations of SIMD registers (i.e., the number and width) have evolved and diverged rapidly through various ISA extensions on different architectures. However, migrating legacy or proprietary applications optimized for one guest ISA to another host ISA that has fewer but longer SIMD...

chapter

CNN inference: VLSI architecture for convolution layer for 1.2 TOPS

Mihir Mody, Manu Mathew, Shyam Jagannathan, Arthur Redfern, more

2017 30th IEEE International System-on-Chip Conference (SOCC) > 158 - 162

2017 30th IEEE International System-on-Chip Conference (SOCC)

Deep Learning techniques like Convolutional Neural Networks (CNN) are getting popular for image classification with the broad usage spanning across automotive, industrial, medicine, robotics etc. Typical CNN network consists of multiple layers of convolutions, non-linearity, spatial pooling and fully connected layer, with 2D convolutions constituting more than 95% of overall computations. In this...

chapter

Parallel implementation of an iterative PCA algorithm for hyperspectral images on a manycore platform

R. Lazcano, D. Madronal, H. Fabelo, S. Ortega, more

2017 Conference on Design and Architectures for Signal and Image Processing (DASIP) > 1 - 6

2017 Conference on Design and Architectures for Signal and Image Processing (DASIP)

This paper presents a study of the par alle lization possibilities of a Non-Linear Iterative Partial Least Squares algorithm and its adaptation to a Massively Parallel Processor Array manycore architecture, which assembles 256 cores distributed over 16 clusters. The aim of this work is twofold: first, to test the behavior of iterative, complex algorithms in a manycore architecture; and, secondly,...

chapter

An Empirical Evaluation of Design Abstraction and Performance of Thrust Framework

Ajai V. George, Sankar Manoj, Sanket Rajan Gupte, Santonu Sarkar

2017 46th International Conference on Parallel Processing Workshops (ICPPW) > 233 - 242

2017 46th International Conference on Parallel Processing Workshops (ICPPW)

High performance computing applications are far more difficult to write, therefore, practitioners expect a well-tuned software to last long and provide optimized performance even when the hardware is upgraded. It may also be necessary to write software using sufficient abstraction over the hardware so that it is capable of running on heterogeneous architecture. Therefore, it is required to have a...

chapter

SPEED: Open-Source Framework to Accelerate Speech Recognition on Embedded GPUs

Syed Mohammad Asad Hassan Jafri, Ahmed Hemani, Leonardo Intesa

2017 Euromicro Conference on Digital System Design (DSD) > 94 - 101

2017 Euromicro Conference on Digital System Design (DSD)

Due to high accuracy, inherent redundancy, and embarrassingly parallel nature, the neural networks are fast becoming mainstream machine learning algorithms. However, these advantages come at the cost of high memory and processing requirements (that can be met by either GPUs, FPGAs or ASICs). For embedded systems, the requirements are particularly challenging because of stiff power and timing budgets...

chapter

A Design Strategy for Digit Serial Multiplier Based Binary Edwards Curve Scalar Multiplier Architectures

Apostolos P. Fournaris, Charalambos Dimopoulos, Odysseas Koufopavlou

2017 Euromicro Conference on Digital System Design (DSD) > 221 - 228

2017 Euromicro Conference on Digital System Design (DSD)

Binary Edwards Curves (BEC) constitute an alternative to the standardized Weierstrass elliptic curve (EC) equations since the latter have intrinsic side channel attack vulnerabilities due to their lack of point operation uniformity. Thus, BECs have gained popularity over the past few years due to their uniformity, operation regularity, completeness and implementation attractiveness. However, BEC Scalar...

chapter

Hardware efficient detection for massive MIMO uplink with parallel Gauss-Seidel method

Zhizhen Wu, Ye Xue, Xiaohu You, Chuan Zhang

2017 22nd International Conference on Digital Signal Processing (DSP) > 1 - 5

2017 22nd International Conference on Digital Signal Processing (DSP)

In this paper, a novel, low-complexity, and hardware efficient signal detection algorithm and its corresponding VLSI architecture are proposed for massive multiple-input multiple-output (MIMO) systems. This method is based on the parallel Gauss-Seidel (PGS) iterative method, and achieves comparable detection performance as the linear minimum mean-square error (MMSE) detection. It successfully avoids...

INFONA - science communication portal

Search results

Hardware module for low-resource and real-time stereo vision engine using semi-global matching approach

A Case Study of Performance Optimization in a Heterogeneous Environment

Processing LSTM in memory using hybrid network expansion model

Introducing parallel computing concepts in computer system related courses

Adaptable VLIW processor: The reconfigurable technology approach

Optimizing numerical code by means of the transitive closure of dependence graphs

On accelerating pair-HMM computations in programmable hardware

A generic high throughput architecture for stream processing

TeaLeaf: A Mini-Application to Enable Design-Space Explorations for Iterative Sparse Linear Solvers

S-Aligner: Ultrascalable Read Mapping on Sunway Taihu Light

Application-specific soft-core vector processor for advanced driver assistance systems

PolyPC: Polymorphic parallel computing framework on embedded reconfigurable system

Cache Automaton: Repurposing Caches for Automata Processing

Exploiting Asymmetric SIMD Register Configurations in ARM-to-x86 Dynamic Binary Translation

CNN inference: VLSI architecture for convolution layer for 1.2 TOPS

Parallel implementation of an iterative PCA algorithm for hyperspectral images on a manycore platform

An Empirical Evaluation of Design Abstraction and Performance of Thrust Framework

SPEED: Open-Source Framework to Accelerate Speech Recognition on Embedded GPUs

A Design Strategy for Digit Serial Multiplier Based Binary Edwards Curve Scalar Multiplier Architectures

Hardware efficient detection for massive MIMO uplink with parallel Gauss-Seidel method

Filter options

Publication date

Content availability

Keywords

INFONA - science communication portal

Search results

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Content availability

Keywords

Reporting an error / abuse

Sending the report failed

Accessibility options