Search results

chapter

Optimizations of Two Compute-Bound Scientific Kernels on the SW26010 Many-Core Processor

James Lin, Zhigeng Xu, Akira Nukada, Naoya Maruyama, more

2017 46th International Conference on Parallel Processing (ICPP) > 432 - 441

2017 46th International Conference on Parallel Processing (ICPP)

The home-grown SW26010 many-core processor enabled the production of China’s first independently developed number-one ranked supercomputer – the Sunway TaihuLight. The design of the limited off-chip memory bandwidth, however, renders the SW26010 a highly memory-bound processor. To compensate for this limitation, the processor was designed with a unique hardware feature, "Register Level Communication"...

chapter

In-Place Irregular Computation for Message-Passing Chip-Multiprocessors

Zhang Youhui, Zhang Youyang, Li Yanhua, Fei Xiang, more

2017 46th International Conference on Parallel Processing Workshops (ICPPW) > 69 - 76

2017 46th International Conference on Parallel Processing Workshops (ICPPW)

With the increase of CMP (Chip-Multiprocessor) scale, moving data to computation on chip becomes more expensive. Accordingly, moving computation to data has potential to improve efficiency. We propose an in-place computation co-design of many-simple-core CMP for irregular applications. The computing paradigm is that an application's critical irregular data (or part of them) is partitioned into on-chip...

chapter

Hierarchical Read/Write Analysis for Pointer-Based OpenCL Programs on RRAM

Lin-Ya Yu, Shao-Chung Wang, Jenq-Kuen Lee

2017 46th International Conference on Parallel Processing Workshops (ICPPW) > 45 - 52

2017 46th International Conference on Parallel Processing Workshops (ICPPW)

Heterogeneous computing platforms containing a wide range of computing resources from CPUs to specialized hardware accelerators is the trend today resulting from the physical limitations on processors speed and the increasing demand for computing performance. Hence many optimization strategies are studied to get better throughput and lower energy consumption in heterogeneous systems. Various memory...

chapter

Autotuning GPU Kernels via Static and Predictive Analysis

Robert Lim, Boyana Norris, Allen Malony

2017 46th International Conference on Parallel Processing (ICPP) > 523 - 532

2017 46th International Conference on Parallel Processing (ICPP)

Optimizing the performance of GPU kernels is challenging for both human programmers and code generators. For example, CUDA programmers must set thread and block parameters for a kernel, but might not have the intuition to make a good choice. Similarly, compilers can generate working code, but may miss tuning opportunities by not targeting GPU models or performing code transformations. Although empirical...

chapter

An efficient FPGA-Based architecture for convolutional neural networks

Wen-Jyi Hwang, Yun-Jie Jhang, Tsung-Ming Tai

2017 40th International Conference on Telecommunications and Signal Processing (TSP) > 582 - 588

2017 40th International Conference on Telecommunications and Signal Processing (TSP)

The goal of this paper is to implement an efficient FPGA-based hardware architectures for the design of fast artificial vision systems. The proposed architecture is capable of performing classification operations of a Convolutional Neural Network (CNN) in realtime. To show the effectiveness of the architecture, some design examples such as hand posture recognition, character recognition, and face...

chapter

Jamming resistant encoding for non-uniformly distributed information

Batya Karp, Yerucham Berkowitz, Osnat Keren

2017 IEEE 23rd International Symposium on On-Line Testing and Robust System Design (IOLTS) > 169 - 173

2017 IEEE 23rd International Symposium on On-Line Testing and Robust System Design (IOLTS)

Codes that aim to detect any error regardless of its multiplicity are referred to as security oriented codes. Most of these codes are designed to protect uniformly distributed codewords; there are few solutions which are used in protecting systems with non-uniformly distributed words. The paper introduces a new encoding method, termed “Level-Out encoding”, for cases in which some words are more likely...

chapter

Understanding the Impact of Fine-Grained Data Sharing and Thread Communication on Heterogeneous Workload Development

Tuan Ta, David Troendle, Xiaoqi Hu, Byunghyun Jang

2017 16th International Symposium on Parallel and Distributed Computing (ISPDC) > 132 - 139

2017 16th International Symposium on Parallel and Distributed Computing (ISPDC)

The conventional OpenCL 1.x style CPU-GPU heterogeneous computing paradigm treats the CPU and GPU processors as loosely connected separate entities. At best each executes independent tasks, but, more commonly, the CPU idles while waiting for results from the GPU. No data-sharing and communications are allowed during kernel execution. This model limits the number of applications that can harness the...

chapter

Multiverse: Easy Conversion of Runtime Systems into OS Kernels via Automatic Hybridization

Kyle C. Hale, Conor Hetland, Peter Dinda

2017 IEEE International Conference on Autonomic Computing (ICAC) > 177 - 186

2017 IEEE International Conference on Autonomic Computing (ICAC)

The hybrid runtime (HRT) model offers a path towards high performance and efficiency. By integrating the OS kernel, runtime, and application, an HRT allows the runtime developer to leverage the full feature set of the hardware and specialize OS services to the runtime's needs. However, conforming to the HRT model currently requires a port of the runtime to the kernel level, for example to the Nautilus...

chapter

Binarized Convolutional Neural Networks with Separable Filters for Efficient Hardware Acceleration

Jeng-Hau Lin, Tianwei Xing, Ritchie Zhao, Zhiru Zhang, more

2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) > 344 - 352

2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

State-of-the-art convolutional neural networks are enormously costly in both compute and memory, demanding massively parallel GPUs for execution. Such networks strain the computational capabilities and energy available to embedded and mobile processing platforms, restricting their use in many important applications. In this paper, we propose BCNN with Separable Filters (BCNNw/SF), which applies Singular...

chapter

A Model Driven Approach for Device Driver Development

Yunwei Dong, Yuanyuan He, Yin Lu, Hong Ye

2017 IEEE International Conference on Software Quality, Reliability and Security Companion (QRS-C) > 122 - 129

2017 IEEE International Conference on Software Quality, Reliability and Security Companion (QRS-C)

In order to facilitate the development and maintenance of device drivers integrated into the operating system, a model driven approach is proposed in this pater for driver design and verification before codding. Architecture model and behavior model are created to illustrate both static and dynamic characteristics of device drivers, in company with device model and device-driver-O.S. interaction model...

chapter

OpenMP device offloading to FPGA accelerators

Lukas Sommer, Jens Korinth, Andreas Koch

2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP) > 201 - 205

2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Future high-performance computing systems will need to include multiple specialized accelerators in a single heterogeneous system to overcome power-density limitations of CPU performance.

chapter

Automated monitoring and detection of resource-limited NFV-based services

Steven Van Rossem, Wouter Tavernier, Didier Colle, Mario Pickavet, more

2017 IEEE Conference on Network Softwarization (NetSoft) > 1 - 5

2017 IEEE Conference on Network Softwarization (NetSoft)

The growing demand for flexibility and cost reduction in the telecommunication landscape directs the focus of service development heavily to programmability and softwarization. In the domain of Network Function Virtualization (NFV), one of the goals is to replace dedicated hardware devices (such as switches, routers, firewalls) with software-based network functionalities, showing comparable performance...

chapter

A survey on decoding schedules of LDPC convolutional codes and associated hardware architectures

Hayfa Ben Thameur, Bertrand Le Gal, Nadia Khouja, Fethi Tlili, more

2017 IEEE Symposium on Computers and Communications (ISCC) > 898 - 905

2017 IEEE Symposium on Computers and Communications (ISCC)

Low-density parity-check convolutional codes (LDPC-CC) have interesting error correction features. They have a great potential to become a key error-correcting codes for enhancing reliability of modern digital communication systems, optical systems and storage devices. On the implementation side, however, the design of low-cost low-power and high-throughput LDPC-CC decoders remains challenging. This...

chapter

Exploring the Granularity of Sparsity in Convolutional Neural Networks

Huizi Mao, Song Han, Jeff Pool, Wenshuo Li, more

2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) > 1927 - 1934

2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

Sparsity helps reducing the computation complexity of DNNs by skipping the multiplication with zeros. The granularity of sparsity affects the efficiency of hardware architecture and the prediction accuracy. In this paper we quantitatively measure the accuracy-sparsity relationship with different granularity. Coarse-grained sparsity brings more regular sparsity pattern, making it easier for hardware...

chapter

Renovate high performance user-level stacks' innovation utilizing commodity network adaptors

Mao Miao, Xiaohui Luo, Fengyuan Ren, Wenxue Cheng, more

2017 IEEE Symposium on Computers and Communications (ISCC) > 906 - 911

2017 IEEE Symposium on Computers and Communications (ISCC)

Today's data center servers are equipped with high speed and complex network adaptors, featuring an array of functions, e.g. hardware TX/RX queues, packet filters, rate limiters, etc. Recent work like IX, Arrakis, MultiStack has made us rekindle the user-level network stacks' innovation utilizing these commodity network adaptors. In this paper, we revisit the idea to move stacks' design from in-kernel...

chapter

Unified Model for Contrast Enhancement and Denoising

Alex Pappachen James, Olga Krestinskaya, Joshin John Mathew

2017 IEEE Computer Society Annual Symposium on VLSI (ISVLSI) > 379 - 384

2017 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)

In this paper, we attempt a challenging task to unify two important complementary operations, i.e. contrast enhancement and denoising, that is required in most image processing applications. The proposed method is implemented using practical analog circuit configurations that can lead to near real-time processing capabilities useful to be integrated with vision sensors. Metrics used for performance...

chapter

Reconfigurable Support Vector Machine Classifier with Approximate Computing

Martin Van Leussen, Jos Huisken, Lei Wang, Hailong Jiao, more

2017 IEEE Computer Society Annual Symposium on VLSI (ISVLSI) > 13 - 18

2017 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)

Support Vector Machine (SVM) is one of the most popular machine learning algorithms. An energy-efficient SVM classifier is proposed in this paper, where approximate computing is utilized to reduce energy consumption and silicon area. A hardware architecture with reconfigurable kernels and overflow-resilient limiter is presented. For different applications, different kernels can be chosen and configured...

chapter

Hardwiring the OS kernel into a Java application processor

Chun-Jen Tsai, Cheng-Ju Lin, Cheng-Yang Chen, Yan-Hung Lin, more

2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP) > 53 - 60

2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

This paper presents the design and implementation of a hardwired OS kernel circuitry inside a Java application processor to provide the system services that are traditionally implemented in software. The hardwired system functions in the proposed SoC include the thread manager, the memory manager, and the I/O subsystem interface. There are many advantages in making the OS kernel a hardware component,...

chapter

Poster Abstract: KLEP: A Kernel Level Energy Profiling Tool for Android

Dong Li, Ripeng Du, Li Cui, Guoliang Xing

2017 16th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN) > 305 - 306

2017 16th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN)

We propose a kernel-level energy profiling tool KLEP that can work with diverse APIs of Android. KLEP addresses the challenges of the tail energy problem and the complex interrelation between hardware components in the device energy consumption profile. KLEP collects energy-sensitive events in the kernel and measures real energy consumption of the device at the same time, and employs a LSTM neural-network-based...

chapter

Digital architecture for real-time CNN-based face detection for video processing

Smrity Bhattarai, Arjuna Madanayake, Renato J. Cintra, Stefan Duffner, more

2017 Cognitive Communications for Aerospace Applications Workshop (CCAA) > 1 - 6

2017 Cognitive Communications for Aerospace Applications Workshop (CCAA)

In this paper, we propose a hardware computing architecture for face detection that classifies an image as a face or non-face. The computing architecture is first designed, modeled and tested in MATLAB Simulink using Xilinx block set and was later tested using a Virtex-6 FPGA ML605 Evaluation Kit. The system uses learned filters which were previously extracted by training on a set of face and non-face...

INFONA - science communication portal

Search results

Optimizations of Two Compute-Bound Scientific Kernels on the SW26010 Many-Core Processor

In-Place Irregular Computation for Message-Passing Chip-Multiprocessors

Hierarchical Read/Write Analysis for Pointer-Based OpenCL Programs on RRAM

Autotuning GPU Kernels via Static and Predictive Analysis

An efficient FPGA-Based architecture for convolutional neural networks

Jamming resistant encoding for non-uniformly distributed information

Understanding the Impact of Fine-Grained Data Sharing and Thread Communication on Heterogeneous Workload Development

Multiverse: Easy Conversion of Runtime Systems into OS Kernels via Automatic Hybridization

Binarized Convolutional Neural Networks with Separable Filters for Efficient Hardware Acceleration

A Model Driven Approach for Device Driver Development

OpenMP device offloading to FPGA accelerators

Automated monitoring and detection of resource-limited NFV-based services

A survey on decoding schedules of LDPC convolutional codes and associated hardware architectures

Exploring the Granularity of Sparsity in Convolutional Neural Networks

Renovate high performance user-level stacks' innovation utilizing commodity network adaptors

Unified Model for Contrast Enhancement and Denoising

Reconfigurable Support Vector Machine Classifier with Approximate Computing

Hardwiring the OS kernel into a Java application processor

Poster Abstract: KLEP: A Kernel Level Energy Profiling Tool for Android

Digital architecture for real-time CNN-based face detection for video processing

Filter options

Publication date

Content availability

Keywords

INFONA - science communication portal

Search results

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Content availability

Keywords

Reporting an error / abuse

Sending the report failed

Accessibility options