Search results

chapter

The Changing Relevance of the TLB

Jessica R. Jones, James H. Davenport, Russell Bradford

2013 12th International Symposium on Distributed Computing and Applications to Business, Engineering & Science > 110 - 114

2013 12th International Symposium on Distributed Computing and Applications to Business, Engineering & Science (DCABES)

A little over a decade ago, Goto and van de Geijn wrote about the importance of the treatment of the translation lookaside buffer (TLB) on the performance of matrix multiplication. Crucially, they did not say how important, nor did they provide results that would allow the reader to make his own judgement. In this paper, we revisit their work and look at the effect on the performance of their algorithm...

chapter

Early Experiences for Adaptation of Auto-tuning by ppOpen-AT to an Explicit Method

Takahiro Katagiri, Satoshi Ito, Satoshi Ohshima

2013 IEEE 7th International Symposium on Embedded Multicore Socs > 153 - 158

2013 IEEE 7th International Symposium on Embedded Multicore Socs (MCSoC)

We present a code optimization technique by adapting an auto-tuning (AT) function to an explicit method with the static code generator FIBER. The AT function is evaluated with current multicore processors to match situations with high-thread parallelism (HTP). The results of performance evaluations indicate that the AT function is crucial for HTP, as the speedups of the explicit method with a static...

chapter

Measuring the gap between programmable and fixed-function accelerators: A case study on speech recognition

Yunsup Lee, David Sheffield, Andrew Waterman, Michael Anderson, more

2013 IEEE Hot Chips 25 Symposium (HCS) > 1 - 2

2013 IEEE Hot Chips 25 Symposium (HCS)

As power and energy consumption have become the key design constraint of mobile systems, mobile system-on-chip (SoC) architects have dedicated a progressively larger area budget to custom accelerators: graphics processors, audio/video codecs, and image signal processors abound. Fixed-function accelerators now occupy more than half of the die area of these chips [2], and we foresee this trend only...

chapter

A HW/SW Co-design of Execution Migration for Shared-ISA Heterogeneous Chip Multiprocessors

Hongwei Liu, Bo Sang, Jing Huang, Ji Qiu, more

2013 IEEE Eighth International Conference on Networking, Architecture and Storage > 23 - 30

2013 IEEE 8th International Conference on Networking, Architecture, and Storage (NAS)

Heterogeneous multi-core processors have strong potential for performance improvement, energy efficiency and area efficiency, compared to the homogeneous multi-core processors. The present methods of execution migration for heterogeneous multi-core processor suffer in efficiency, cost, compatibility, or programmability. In this paper, we propose a HW/SW code sign migration method based on binary-instrumentation...

chapter

Parallelization of elastic bunch graph matching (EBGM) algorithm for fast face recognition

Xianming Chen, Chaoyang Zhang, Fan Dong, Zhaoxian Zhou

2013 IEEE China Summit and International Conference on Signal and Information Processing > 201 - 205

2013 IEEE China Summit and International Conference on Signal and Information Processing (ChinaSIP)

This paper presents a parallel method for EBGM face recognition. Compared with other methods such as principal component analysis (PCA) and linear discriminant analysis (LDA), EBGM has the advantage of higher accuracy, however, with more computational time and memory usage, which also mean less practicability. We propose a parallel method for EBGM by balancing the unit of images. We distribute the...

chapter

Towards implementation of Virtual-Clustered multiprocessor scheduling in Linux

Syed Md Jakaria Abdullah, Nima Moghaddami Khalilzad, Moris Behnam, Thomas Nolte

2013 8th IEEE International Symposium on Industrial Embedded Systems (SIES) > 97 - 100

2013 8th IEEE International Symposium on Industrial Embedded Systems (SIES)

Cluster based multiprocessor scheduling can be seen as a hybrid approach combining benefits of both partitioned and global scheduling. Virtual clustering further enhances it by providing dynamic cluster resource allocation and applying hierarchical scheduling techniques. Over the years, the study of virtual cluster scheduling has been limited to theoretical analysis. In this paper, we present our...

chapter

Increasing the trustworthiness of commodity hardware through software

Kevin Elphinstone, Yanyan Shen

2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) > 1 - 6

2013 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)

Advances in formal software verification has produced an operating system that is guaranteed mathematically to be correct and enforce access isolation. Such an operating system could potentially consolidate safety and security critical software on a single device where previously multiple devices were used. One of the barriers to consolidation on commodity hardware is the lack of hardware dependability...

chapter

Virtual Systolic Array for QR Decomposition

Jakub Kurzak, Piotr Luszczek, Mark Gates, Ichitaro Yamazaki, more

2013 IEEE 27th International Symposium on Parallel and Distributed Processing > 251 - 260

2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)

Systolic arrays offer a very attractive, data centric, execution model as an alternative to the von Neumann architecture. Hardware implementations of systolic arrays turned out not to be viable solutions in the past. This article shows how the systolic design principles can be applied to a software solution to deliver an algorithm with unprecedented strong scaling capabilities. Systolic array for...

chapter

Redefining the relationship between scalar and parallel units in SIMD architectures

Yaohua Wang, Shuming Chen, Jianghua Wan, Kai Zhang

2013 IEEE International Symposium on Circuits and Systems (ISCAS2013) > 781 - 784

2013 IEEE International Symposium on Circuits and Systems (ISCAS)

SIMD architectures, comprising of both scalar and parallel units, have been widely used in media processors. To further improve the performance, much effort has been made to enhance the design of both units, while little attention has been placed on the relationship between the units. This paper demonstrates that a dynamic coupling mechanism, which can dynamically transform the scalar and parallel...

chapter

High performance 3D-FFT implementation

U Nidhi, Kolin Paul, Ahmed Hemani, Anshul Kumar

2013 IEEE International Symposium on Circuits and Systems (ISCAS2013) > 2227 - 2230

2013 IEEE International Symposium on Circuits and Systems (ISCAS)

3D FFT is a very data and compute intensive kernel encountered in many applications. We report a high performance design and implementation of 3D-FFT on a CGRA which supports partial reconfiguration. The hardware software multi clock design uses dynamic reconfiguration to reduce the required communication bandwidth to achieve a sustained throughput of 40 GOPS on a wordsize of 48 bits. Performance...

chapter

Implementation of a volume rendering on coarse-grained reconfigurable multiprocessor

Seunghun Jin, Sangheon Lee, Moo-Kyoung Chung, Yeongon Cho, more

2012 International Conference on Field-Programmable Technology > 243 - 246

2012 International Conference on Field-Programmable Technology (FPT)

In this paper, we present reconfigurable multiprocessor architecture for volume rendering. The multiprocessor consists of sixteen reconfigurable processors to exploit data parallelism of the volume rendering. Each processor has VLIW core and reconfigurable coarse-grained array specialized for control and data-intensive part of the program, respectively. The coarse-grained array can be configured dynamically,...

chapter

Energy Efficient High-Performance Computing Using ARM Cortex-A9 Cores

D. Pleiter, M. Richter

2012 IEEE International Conference on Green Computing and Communications > 607 - 610

2012 IEEE International Conference on Green Computing and Communications (GreenCom)

In this paper we investigate the energy efficiency of processors based on ARM Cortex-A9 cores for scientific numerical applications. We study the performance for a few numerical kernels which appear in a larger set of scientific applications. From power measurements that were performed on different platforms we estimate the energy consumed when executing these kernels.

chapter

A forensic hypervisor for process tracking and exploit discovery

Stephen Kuhn, Stephen Taylor

MILCOM 2012 - 2012 IEEE Military Communications Conference > 1 - 5

MILCOM 2012 - 2012 IEEE Military Communications Conference

Real-time forensic reconstruction of a processes memory and interaction history is impractical in modern computing environments because the volume of data processed by a typical server is immense. Having this information would speed the search for zero-day exploits and designate precisely which system components could have been affected by an intrusion. Unfortunately, it may be several months after...

chapter

Implementing Basic Computational Kernels of Linear Algebra on Multicore

Panagiotis D. Michailidis, Konstantinos G. Margaritis

2012 16th Panhellenic Conference on Informatics > 217 - 222

2012 16th Panhellenic Conference on Informatics (PCI)

This paper implements basic computational kernels of the scientific computing such as matrix - vector product, matrix product and Gaussian elimination on multi-core platforms using several parallel programming tools. Specifically, these tools are Pthreads, OpenMP, Intel Cilk++, Intel TBB, Intel ArBB, SMPSs, SWARM and Fast Flow. The aim of this paper is to present an unified quantitative and qualitative...

chapter

Enabling an OS kernel for large data with a SIMD Unit

Shogo Saito, Shuichi Oikawa

The 1st IEEE Global Conference on Consumer Electronics 2012 > 597 - 601

2012 IEEE 1st Global Conference on Consumer Electronics (GCCE)

Nowadays, embedded systems treats larger data than ever before. It can be expected that the size of data treated by embedded systems will be increased. In ordinary case, these complicated requirements are achieved with adopting OS(operating system) kernel to systems. To improve the performance of OS kernel's data processing is meaningful for many embedded solutions. To achieve this improvement, we...

chapter

Enabling an OpenCL Compiler for Embedded Multicore DSP Systems

Jia-Jhe Li, Chi-Bang Kuan, Tung-Yu Wu, Jenq Kuen Lee

2012 41st International Conference on Parallel Processing Workshops > 545 - 552

2012 41st International Conference on Parallel Processing Workshops (ICPPW)

OpenCL is an industry's attempt to unify heterogeneous multicore programming. With its programming model defining SPMD kernels, vector types, and address space qualifiers, OpenCL allows programmers to exploit data parallelism with multicore processors and SIMD instructions as well as data locality with memory hierarchy. Recently, OpenCL has gained success on many architectures, including multicore...

chapter

OpenMP-based Synergistic Parallelization and HW Acceleration for On-Chip Shared-Memory Clusters

Paolo Burgio, Andrea Marongiu, Dominique Heller, Cyrille Chavet, more

2012 15th Euromicro Conference on Digital System Design > 751 - 758

2012 15th Euromicro Conference on Digital System Design (DSD)

Modern embedded MPSoC designs increasingly couple hardware accelerators to processing cores to trade between energy efficiency and platform specialization. To assist effective design of such systems there is the need on one hand for clear methodologies to streamline accelerator definition and instantiation, on the other for architectural templates and run-time techniques that minimize processors-to-accelerator...

chapter

Built-in Device Simulator for OS Performance Evaluation

Junjie Mao, Yu Chen, Yaozu Dong

2012 IEEE International Conference on Cluster Computing > 538 - 541

2012 IEEE International Conference on Cluster Computing (CLUSTER)

I/O devices are evolving rapidly, while OS optimization is always slower because of its dependence on physical devices. This inevitably prevents latest devices from working with their rating performance, which remains a big problem for performance-critical applications. Though I/O device simulators can help carry out performance evaluation before physical devices are ready, the existing simulator...

chapter

Reducing Scalability Collapse via Requester-Based Locking on Multicore Systems

Yan Cui, Yingxin Wang, Yu Chen, Yuanchun Shi, more

2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems > 298 - 307

2012 IEEE 20th International Symposium on Modelling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS)

In response to the increasing ubiquity of multicore processors, there has been widespread development of multithreaded applications that strive to realize their full potential. Unfortunately, lock contention within operating systems can limit the scalability of multicore systems so severely that an increase in the number of cores can actually lead to reduced performance (i.e. scalability collapse)...

chapter

An OpenCL Runtime Library for Embedded Multi-Core Accelerator

Ryuichi Sakamoto, Mikiko Sato, Yusuke Koizumi, Hideharu Amano, more

2012 IEEE International Conference on Embedded and Real-Time Computing Systems and Applications > 419 - 422

2012 IEEE 18th International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA 2012)

In recent years, improvements of energy efficiency and computational performance have become a major issue, because smartphones and tablets become popular. To implement high performance, multi-core accelerator consists of general purpose processors and accelerators are often used. But to use these multi-core accelerator efficiently, programmers have to consider synchronization and data transfer between...

INFONA - science communication portal

Search results

The Changing Relevance of the TLB

Early Experiences for Adaptation of Auto-tuning by ppOpen-AT to an Explicit Method

Measuring the gap between programmable and fixed-function accelerators: A case study on speech recognition

A HW/SW Co-design of Execution Migration for Shared-ISA Heterogeneous Chip Multiprocessors

Parallelization of elastic bunch graph matching (EBGM) algorithm for fast face recognition

Towards implementation of Virtual-Clustered multiprocessor scheduling in Linux

Increasing the trustworthiness of commodity hardware through software

Virtual Systolic Array for QR Decomposition

Redefining the relationship between scalar and parallel units in SIMD architectures

High performance 3D-FFT implementation

Implementation of a volume rendering on coarse-grained reconfigurable multiprocessor

Energy Efficient High-Performance Computing Using ARM Cortex-A9 Cores

A forensic hypervisor for process tracking and exploit discovery

Implementing Basic Computational Kernels of Linear Algebra on Multicore

Enabling an OS kernel for large data with a SIMD Unit

Enabling an OpenCL Compiler for Embedded Multicore DSP Systems

OpenMP-based Synergistic Parallelization and HW Acceleration for On-Chip Shared-Memory Clusters

Built-in Device Simulator for OS Performance Evaluation

Reducing Scalability Collapse via Requester-Based Locking on Multicore Systems

An OpenCL Runtime Library for Embedded Multi-Core Accelerator

Filter options

Publication date

Content availability

Keywords

INFONA - science communication portal

Search results

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Content availability

Keywords

Reporting an error / abuse

Sending the report failed

Accessibility options