Search results

Items from 21 to 40 out of 402 results

chapter

Scalable NUMA-Aware Wilson-Dirac on Supercomputers

Claude Tadonki

2017 International Conference on High Performance Computing & Simulation (HPCS) > 315 - 324

2017 International Conference on High Performance Computing & Simulation (HPCS)

We revisit the Wilson-Dirac operator, also referred as Dslash, on NUMA manycore vector machines and thereby seek an efficient supercomputing implementation. Quantum Chro- moDynamics (QCD) is the theory of the strong nuclear force and its discrete formalism is the so-called Lattice Quantum ChromoDynamics (LQCD). Wilson-Dirac is the major computing kernel in LQCD, where a special attention is paid to...

chapter

Asymmetric Feature Maps with Application to Sketch Based Retrieval

Giorgos Tolias, Ondrej Chum

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) > 6185 - 6193

2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

We propose a novel concept of asymmetric feature maps (AFM), which allows to evaluate multiple kernels between a query and database entries without increasing the memory requirements. To demonstrate the advantages of the AFM method, we derive a short vector image representation that, due to asymmetric feature maps, supports efficient scale and translation invariant sketch-based image retrieval. Unlike...

chapter

Cache Partitioning + Loop Tiling: A Methodology for Effective Shared Cache Management

Vasilios Kelefouras, Georgios Keramidas, Nikolaos Voros

2017 IEEE Computer Society Annual Symposium on VLSI (ISVLSI) > 477 - 482

2017 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)

In this paper, we present a new methodology that provides i) a theoretical analysis of the two most commonly used approaches for effective shared cache management (i.e., cache partitioning and loop tiling) and ii) a unified framework to fine tuning those two mechanisms in tandem (not separately). Our approach manages to lower the number of main memory accesses by one order of magnitude keeping at...

chapter

Hardwiring the OS kernel into a Java application processor

Chun-Jen Tsai, Cheng-Ju Lin, Cheng-Yang Chen, Yan-Hung Lin, more

2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP) > 53 - 60

2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

This paper presents the design and implementation of a hardwired OS kernel circuitry inside a Java application processor to provide the system services that are traditionally implemented in software. The hardwired system functions in the proposed SoC include the thread manager, the memory manager, and the I/O subsystem interface. There are many advantages in making the OS kernel a hardware component,...

chapter

Parallel Multi Channel convolution using General Matrix Multiplication

Aravind Vasudevan, Andrew Anderson, David Gregg

2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP) > 19 - 24

2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Convolutional neural networks (CNNs) have emerged as one of the most successful machine learning technologies for image and video processing. The most computationally-intensive parts of CNNs are the convolutional layers, which convolve multi-channel images with multiple kernels. A common approach to implementing convolutional layers is to expand the image into a column matrix (im2col) and perform...

chapter

Taming Performance Degradation of Containers in the Case of Extreme Memory Overcommitment

Rina Nakazawa, Kazunori Ogata, Seetharami Seelam, Tamiya Onodera

2017 IEEE 10th International Conference on Cloud Computing (CLOUD) > 196 - 204

2017 IEEE 10th International Conference on Cloud Computing (CLOUD)

The efficiency of datacenters is important consideration for cloud service providers to make their datacenters always ready for fulfilling the increasing demand for computing resources. Container-based virtualization is one approach to improving efficiency by reducing the overhead of virtualization. Resource overcommitment is another approach, but cloud providers tend to make conservative allocations...

chapter

Introducing approximate memory support in Linux Kernel

Giulia Stazi, Francesco Menichelli, Antonio Mastrandrea, Mauro Olivieri

2017 13th Conference on Ph.D. Research in Microelectronics and Electronics (PRIME) > 97 - 100

2017 13th Conference on Ph.D. Research in Microelectronics and Electronics (PRIME)

This paper describes the implementation of approximate memory support in Linux operating system kernel. The new functionality allows the kernel to distinguish between normal memory banks, which are composed by standard memory cells that retain data without corruption, and approximate memory banks, where memory cells are subject to read/write faults with controlled probability. Approximate memories...

chapter

Memory fartitioning-based modulo scheduling for high-level synthesis

Tianyi Lu, Shouyi Yin, Xianqing Yao, Zhicong Xie, more

2017 IEEE International Symposium on Circuits and Systems (ISCAS) > 1 - 4

2017 IEEE International Symposium on Circuits and Systems (ISCAS)

High-Level Synthesis (HLS) has been widely recognized as an efficient compilation process targeting FPGAs for algorithm evaluation and product prototyping. However, the massively parallel memory access demands and the extremely expensive cost of single-bank memory with multi-port have impeded loop pipelining performance. Thus, based on an alternative multi-bank memory architecture, a joint approach...

chapter

A fast FPGA-based deep convolutional neural network using pseudo parallel memories

Muluken Hailesellasie, Syed Rafay Hasan

2017 IEEE International Symposium on Circuits and Systems (ISCAS) > 1 - 4

2017 IEEE International Symposium on Circuits and Systems (ISCAS)

Deep learning is gaining popularity in the recent years due to its impressive performance in different application areas. Convolutional Neural Network (CNN) is the state-of-the-art deep learning architecture that is being used widely in the areas of image recognition, speech recognition and many other applications. CNN is computationally intensive and resource hungry architecture. Hence, its efficient...

chapter

Large-scale image classification using fast SVM with deep quasi-linear kernel

Peifeng Liang, Weite Li, Donghang Liu, Jinglu Hu

2017 International Joint Conference on Neural Networks (IJCNN) > 1064 - 1071

2017 International Joint Conference on Neural Networks (IJCNN)

In this paper, a novel fast support vector machine (SVM) method combining with the deep quasi-linear kernel (DQLK) learning is proposed for large scale image classification. This method can train large-scale dataset with SVM fast using less memory space and less training time. Since SVM classifiers are constructed by support vectors (SVs) that lie close to the separation boundary, removing the other...

chapter

A New File System I/O Mode for Efficient User-Level Caching

Jiwoong Park, Cheolgi Min, HeonYoung Yeom

2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) > 649 - 658

2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)

A large number of cloud datastores have been developed to handle the cloud OLTP workload. Double caching problem where the same data resides both at the user buffer and the kernel buffer has been identified as one of the problems and has been largely solved by using direct I/O mode to bypass the kernel buffer. However, maintaining the caching layer only in user-level has the disadvantage that the...

chapter

Sparse Supernodal Solver Using Block Low-Rank Compression

Gregoire Pichon, Eric Darve, Mathieu Faverge, Pierre Ramet, more

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) > 1138 - 1147

2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

This paper presents two approaches using a Block Low-Rank (BLR) compression technique to reduce the memory footprint and/or the time-to-solution of the sparse supernodal solver PASTIX. This flat, non-hierarchical, compression method allows to take advantage of the low-rank property of the blocks appearing during the factorization of sparse linear systems, which come from the discretization of partial...

chapter

Efficient GPGPU Computing with Cross-Core Resource Sharing and Core Reconfiguration

Ashutosh Dhar, Deming Chen

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) > 48 - 55

2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)

GPUs are capable of running a variety of applications, however their generic parallel-architecture can lead to inefficient use of resources and reduced power efficiency, due to algorithmic or architectural constraints. In this work, taking inspiration from CGRAs (coarse-grained reconfigurable architectures), we demonstrate resource sharing and re-distribution as a solution that can be leveraged by...

chapter

Predicting memory page stability and its application to memory deduplication and live migration

Karim Elghamrawy, Diana Franklin, Frederic T. Chong

2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) > 125 - 126

2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

There are various applications and operations in virtualized environments that rely on memory page stability to achieve satisfactory performance. These applications include VM live migration and memory deduplication. Unfortunately, there is a large gap between existing prediction mechanisms and actual behavior. This is the gap we hope to narrow.

chapter

Work-In-Progress: Protecting Real-Time GPU Applications on Integrated CPU-GPU SoC Platforms

Waqar Ali, Heechul Yun

2017 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS) > 141 - 144

2017 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS)

Integrated CPU-GPU architecture provides excellent acceleration capabilities for data parallel applications on embedded platforms while meeting the size, weight and power (SWaP) requirements. However, sharing of main memory between CPU applications and GPU kernels can severely affect the execution of GPU kernels and diminish the performance gain provided by GPU. In the NVIDIA Tegra TK1 platform which...

chapter

srcSlice: A Tool for Efficient Static Forward Slicing

Christian D. Newman, Tessandra Sage, Michael L. Collard, Hakam W. Alomari, more

2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C) > 621 - 624

2016 IEEE/ACM 38th International Conference on Software Engineering Companion (ICSE-C)

An efficient lightweight forward static slicing tool is presented.The tool is implemented on top of srcML, an XML representation of source code.The approach does not compute the full program dependence graph but instead dependency information is computed as needed while computing the slice on a variable.The result is a list of line numbers, dependent variables, aliases, and function calls that are...

chapter

Chain-NN: An energy-efficient 1D chain architecture for accelerating deep convolutional neural networks

Shihao Wang, Dajiang Zhou, Xushen Han, Takeshi Yoshimura

Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017 > 1032 - 1037

2017 Design, Automation & Test in Europe Conference & Exhibition (DATE)

Deep convolutional neural networks (CNN) have shown their good performances in many computer vision tasks. However, the high computational complexity of CNN involves a huge amount of data movements between the computational processor core and memory hierarchy which occupies the major of the power consumption. This paper presents Chain-NN, a novel energy-efficient 1D chain architecture for accelerating...

chapter

Optimisation opportunities and evaluation for GPGPU applications on low-end mobile GPUs

Matina Maria Trompouki, Leonidas Kosmidis

Design, Automation & Test in Europe Conference & Exhibition (DATE), 2017 > 950 - 953

2017 Design, Automation & Test in Europe Conference & Exhibition (DATE)

Previous works in the literature have shown the feasibility of general purpose computations for non-visual applications on low-end mobile graphics processors using graphics APIs. These works focused only on the functional aspects of the software, ignoring the implementation details and therefore their performance implications due to their particular micro-architecture. Since various steps in such...

chapter

14.1 A 2.9TOPS/W deep convolutional neural network SoC in FD-SOI 28nm for intelligent embedded systems

Giuseppe Desoli, Nitin Chawla, Thomas Boesch, Surinder-pal Singh, more

2017 IEEE International Solid-State Circuits Conference (ISSCC) > 238 - 239

2017 IEEE International Solid- State Circuits Conference - (ISSCC)

A booming number of computer vision, speech recognition, and signal processing applications, are increasingly benefiting from the use of deep convolutional neural networks (DCNN) stemming from the seminal work of Y. LeCun et al. [1] and others that led to winning the 2012 ImageNet Large Scale Visual Recognition Challenge with AlexNet [2], a DCNN significantly outperforming classical approaches for...

chapter

Taming warp divergence

Jayvant Anantpur, R. Govindarajan

2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO) > 50 - 60

2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)

Graphics Processing Units (GPUs) are designed to exploit large amount of parallelism. However, warp-level divergence occurring due to different amounts of work, memory access latency experienced, etc., results in warps of a thread block (TB) finishing kernel execution at different points in time. This, in effect, reduces utilization of resources of SMs and hence performance of the GPU. We propose...

Keywords:
KERNEL
MEMORY MANAGEMENT

Publication date

Set your own date range

Content availability

Available (398)
None (4)

Keywords

INSTRUCTION SETS (81)
LINUX (80)
HARDWARE (74)
GRAPHICS PROCESSING UNITS (58)
RANDOM ACCESS MEMORY (55)
RESOURCE MANAGEMENT (50)
GRAPHICS PROCESSING UNIT (45)
GPU (42)
BENCHMARK TESTING (38)
PARALLEL PROCESSING (38)
BANDWIDTH (37)
SERVERS (36)
OPTIMIZATION (33)
LIBRARIES (28)
ARRAYS (27)
STORAGE MANAGEMENT (25)
COMPUTATIONAL MODELING (24)
CUDA (24)
REGISTERS (24)
FIELD PROGRAMMABLE GATE ARRAYS (23)
PROGRAMMING (23)
RUNTIME (21)
OPERATING SYSTEM (20)
OPERATING SYSTEMS (COMPUTERS) (19)
PERFORMANCE EVALUATION (19)
VIRTUAL MACHINING (19)
EMBEDDED SYSTEMS (18)
SECURITY (18)
COPROCESSORS (17)
OPERATING SYSTEMS (17)
PROTOCOLS (17)
ALGORITHM DESIGN AND ANALYSIS (16)
MONITORING (16)
OPERATING SYSTEM KERNELS (16)
COMPUTER GRAPHIC EQUIPMENT (15)
DATA STRUCTURES (15)
MULTIPROCESSING SYSTEMS (15)
PROGRAM PROCESSORS (15)
SYNCHRONIZATION (15)
THROUGHPUT (14)
VIRTUAL MACHINE MONITORS (14)
INDEXES (13)
OPENCL (13)
VIRTUAL MACHINES (13)
ACCELERATION (12)
DATA MINING (12)
PARALLEL PROGRAMMING (12)
VIRTUALIZATION (12)
CACHE STORAGE (11)
CLOUD COMPUTING (11)
GPGPU (11)
IMAGE PROCESSING (11)
PREFETCHING (11)
REAL TIME SYSTEMS (11)
YARN (11)
COMPUTE UNIFIED DEVICE ARCHITECTURE (10)
FPGA (10)
MULTICORE PROCESSING (10)
NONVOLATILE MEMORY (10)
RADIATION DETECTORS (10)
CONVOLUTION (9)
DRIVER CIRCUITS (9)
POWER DEMAND (9)
RELIABILITY (9)
STREAMING MEDIA (9)
VECTORS (9)
EQUATIONS (8)
HIGH PERFORMANCE COMPUTING (8)
LATTICES (8)
REAL-TIME SYSTEMS (8)
SCALABILITY (8)
SUPPORT VECTOR MACHINES (8)
SYSTEM-ON-A-CHIP (8)
TRAINING (8)
VIRTUAL MACHINE (8)
COMPUTER ARCHITECTURE (7)
COMPUTER GRAPHICS (7)
DATA TRANSFER (7)
MEMORY (7)
MEMORY ARCHITECTURE (7)
NEURAL NETWORKS (7)
PIXEL (7)
PROCESSOR SCHEDULING (7)
RECONFIGURABLE ARCHITECTURES (7)
SCHEDULES (7)
SWITCHES (7)
ACCURACY (6)
APPLICATION PROGRAM INTERFACES (6)
CLOCKS (6)
COMPLEXITY THEORY (6)
CONTEXT (6)
DATABASES (6)
DIGITAL SIGNAL PROCESSING (6)
ENERGY CONSUMPTION (6)
GRAPHICS (6)
INSTRUMENTS (6)
INTERNET (6)
LOGIC GATES (6)
more

INFONA - science communication portal

Search results

Scalable NUMA-Aware Wilson-Dirac on Supercomputers

Asymmetric Feature Maps with Application to Sketch Based Retrieval

Cache Partitioning + Loop Tiling: A Methodology for Effective Shared Cache Management

Hardwiring the OS kernel into a Java application processor

Parallel Multi Channel convolution using General Matrix Multiplication

Taming Performance Degradation of Containers in the Case of Extreme Memory Overcommitment

Introducing approximate memory support in Linux Kernel

Memory fartitioning-based modulo scheduling for high-level synthesis

A fast FPGA-based deep convolutional neural network using pseudo parallel memories

Large-scale image classification using fast SVM with deep quasi-linear kernel

A New File System I/O Mode for Efficient User-Level Caching

Sparse Supernodal Solver Using Block Low-Rank Compression

Efficient GPGPU Computing with Cross-Core Resource Sharing and Core Reconfiguration

Predicting memory page stability and its application to memory deduplication and live migration

Work-In-Progress: Protecting Real-Time GPU Applications on Integrated CPU-GPU SoC Platforms

srcSlice: A Tool for Efficient Static Forward Slicing

Chain-NN: An energy-efficient 1D chain architecture for accelerating deep convolutional neural networks

Optimisation opportunities and evaluation for GPGPU applications on low-end mobile GPUs

14.1 A 2.9TOPS/W deep convolutional neural network SoC in FD-SOI 28nm for intelligent embedded systems

Taming warp divergence

Filter options

Publication date

Content availability

Keywords

INFONA - science communication portal

Search results

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Content availability

Keywords

Reporting an error / abuse

Sending the report failed

Accessibility options