Search results for: Xuehai Qian

Items from 1 to 12 out of 12 results

chapter

Power Efficient Sharing-Aware GPU Data Management

Abdulaziz Tabbakh, Murali Annavaram, Xuehai Qian

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS) > 698 - 707

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

The power consumed by memory system in GPUs is a significant fraction of the total chip power. As thread level parallelism increases, GPUs are likely to stress cache and memory bandwidth even more, thereby exacerbating power consumption. We observe that neighboring concurrent thread arrays (CTAs) within GPU applications share considerable amount of data. However, the default GPU scheduling policy...

chapter

BulkCommit: Scalable and fast commit of atomic blocks in a lazy multiprocessor environment

Xuehai Qian, Josep Torrellas, Benjamin Sahelices, Depei Qian

2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO) > 371 - 382

2013 46th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)

To help improve the programmability and performance of shared-memory multiprocessors, there are proposals of architectures that continuously execute atomic blocks of instructions — also called Chunks. To be competitive, these architectures must support chunk operations very efficiently. In particular, in a large manycore with lazy conflict detection, they must support efficient chunk commit. This...

chapter

PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning

Linghao Song, Xuehai Qian, Hai Li, Yiran Chen

2017 IEEE International Symposium on High Performance Computer Architecture (HPCA) > 541 - 552

2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)

Convolution neural networks (CNNs) are the heart of deep learning applications. Recent works PRIME [1] and ISAAC [2] demonstrated the promise of using resistive random access memory (ReRAM) to perform neural computations in memory. We found that training cannot be efficiently supported with the current schemes. First, they do not consider weight update and complex data dependency in training procedure...

chapter

Optimized Register Renaming Scheme for Stack-Based x86 Operations

Xuehai Qian, He Huang, Zhenzhong Duan, Junchao Zhang, more

Lecture Notes in Computer Science > Architecture of Computing Systems - ARCS 2007 > 43-56

The stack-based floating point unit (FPU) in the x86 architecture limits its floating point (FP) performance. The flat register file can improve FP performance but affect x86 compatibility. This paper presents an optimized two-phase floating point register renaming scheme used in implementing an x86-compliant processor. The two-phase renaming scheme eliminates the implicit dependencies between the...

article

Improving multiprocessor performance with fine-grain coherence bypass

Hui Wang, Rui Wang, ZhongZhi Luan, XueHai Qian, more

Science China Information Sciences > 2015 > 58 > 1 > 1-15

Efficient and scalable cache coherence protocol is crucial to high-performance servers with shared-memory. The directory-based cache coherence protocol is more desirable than the snooping-based protocol with respect to the scalability. However, even for the former protocol, scaling to a large number of cores is also challenging due to the additional area requirements of the directories. We observed...

chapter

OmniOrder: Directory-based conflict serialization of transactions

Xuehai Qian, Benjamin Sahelices, Josep Torrellas

2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA) > 421 - 432

2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA)

Effective execution of atomic blocks of instructions (also called transactions) can enhance the performance and programmability of multiprocessors. Atomic blocks can be demarcated in software as in Transactional Memory (TM) or dynamically generated by the hardware as in aggressive implementations of strict memory consistency. In most current designs, when two atomic blocks conflict, one is squashed...

chapter

Pacifier: Record and replay for relaxed-consistency multiprocessors with distributed directory protocol

Xuehai Qian, Benjamin Sahelices, Depei Qian

2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA) > 433 - 444

2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA)

Record and Deterministic Replay (R&R) of multithreaded programs on relaxed-consistency multiprocessors with distributed directory protocol has been a long-standing open problem. The independently developed RelaxReplay [8] solves the problem by assuming write atomicity. This paper proposes Pacifier, the first R&R scheme to provide a solution without assuming write atomicity. R&R for relaxed-consistency...

chapter

Rainbow: Efficient memory dependence recording with high replay parallelism for relaxed memory model

Xuehai Qian, He Huang, Benjamin Sahelices, Depei Qian

2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA) > 554 - 565

2013 IEEE 19th International Symposium on High Performance Computer Architecture (HPCA)

Architectures for record-and-replay (R&R) of multithreaded applications ease program debugging, intrusion analysis and fault-tolerance. Among the large body of previous works, Strata enables efficient memory dependence recording with little hardware overhead and can be applied smoothly to snoopy protocols. However, Strata records imprecise happens-before relations and assumes Sequential Consistency...

chapter

BulkSMT: Designing SMT processors for atomic-block execution

Xuehai Qian, Benjamin Sahelices, Josep Torrellas

IEEE International Symposium on High-Performance Comp Architecture > 1 - 12

2012 IEEE 18th International Symposium on High Performance Computer Architecture (HPCA)

Multiprocessor architectures that continuously execute atomic blocks (or chunks) of instructions can improve performance and software productivity. However, all of the prior proposals for such architectures assume single-context cores as building blocks — rather than the widely-used Simultaneous Multithreading (SMT) cores. As a result, they are underutilizing hardware resources. This paper presents...

chapter

ScalableBulk: Scalable Cache Coherence for Atomic Blocks in a Lazy Environment

Xuehai Qian, Wonsun Ahn, Josep Torrellas

2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture > 447 - 458

2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2010)

Recently-proposed architectures that continuously operate on atomic blocks of instructions (also called chunks) can boost the programmability and performance of shared-memory multiprocessing. However, they must support chunk operations very efficiently. In particular, in lazy conflict-detection environments, it is key that they provide scalable chunk commits. Unfortunately, current proposals typically...

chapter

Circuit implementation of floating point range reduction for trigonometric functions

Xuehai Qian, Hao Zhang, Jingang Yang, He Huang, more

2007 IEEE International Symposium on Circuits and Systems > 3010 - 3013

2007 IEEE International Symposium on Circuits and Systems

Range reduction is important in evaluating trigonometric functions but not enough work is done in relation to the hardware implementation of it. A hardware floating point range reduction implementation is presented. The whole reduction is divided into two steps; the first is based on double-residue modular range reduction method and the second adopts on a novel method described in this paper. The...

chapter

Design and Implementation of Floating Point Stack on General RISC Architecture

Xuehai Qian, He Huang, Hao Zhang, Guoping Long, more

15th EUROMICRO International Conference on Parallel, Distributed and Network-Based Processing (PDP'7) > 238 - 245

15th EUROMICRO International Conference on Parallel, Distributed and Network-Based Processing (PDP'07)

This paper presents a framework for implementing the X86 FP stack used in an x86-compliant processor based on a general RISC architecture. Architectural supports are added to a typical RISC architecture to maintain the FP stack status. Some speculative techniques are applied to the decode stage to enable pipelined and efficient FP operations. An optimized register renaming scheme is proposed to eliminate...

Filter options

Publication date

Set your own date range

INFONA - science communication portal

Search results for: Xuehai Qian

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Publication type

Keywords

Data set

Reporting an error / abuse

Sending the report failed

Accessibility options