The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Programming accelerators such as GPUs withlow-level APIs and languages such as OpenCL and CUDAis difficult, error-prone, and not performance-portable. Au-tomatic parallelization and domain specific languages (DSLs)have been proposed to hide complexity and regain performanceportability. We present P ENCIL, a rigorously-defined subset ofGNU C99 -- enriched with additional language constructs -- that...
Nested thread-level parallelism (TLP) is pervasive in real applications. For example, 75% (14 out of 19) of the applications in the Rodinia benchmark for heterogeneous accelerators contain kernels with nested thread-level parallelism. Efficiently mapping the enclosed nested parallelism to the GPU threads in the C-to-CUDA compilation (OpenACC in this paper) is becoming more and more important. This...
The polyhedral model is a powerful algebraic framework that hasenabled significant advances to analysis and transformation ofsequential affine (sub)programs, relative to traditional AST-basedapproaches. However, given the rapid growth of parallel software, there is a need for increased attention to using polyhedral frameworksto optimize explicitly parallel programs. An interesting side effectof supporting...
In this paper we present a machine-learning approach to predict the total communication time of parallel applications. Communication time is heavily dependent on a very wide set of parameters relevant to the architecture, runtime configuration and application communication profile. We focus our study on parameters that can be easily extracted from the application and the process mapping ahead of execution...
With the introduction of low power System on a Chip (SoC) processor architectures in enterprise server configurations, there is a growing need to develop the software that will support scale-out, data intensive cloud applications that are deployed in data centers today. In this paper, we describe the design and implementation of a low latency user space fully compliant TCP/IP socket stack on a low...
This paper presents an SSD-based Block I/O Scheduler, short for SBIOS. SBIOS fully exploits the internal parallelism to improve the system performance. It dispatches the read requests to different blocks to make full use of SSD internal parallelism. For write requests, it tries to dispatch write requests to the same block to alleviate the block cross penalty and garbage collection overhead. The evaluation...
Intelligent GPU cache bypassing can improve the efficiency of using GPU memory bandwidth, which can benefit GPU performance. In this paper, we study a pure hardware-based GPU cache bypassing method that can be applied to GPU applications without having to recompile the programs. Moreover, we introduce a hybrid method that can exploit profiling information to further enhance the hardware-based bypassing...
Next generation embedded systems will massively adopt on-chip many core architectures to provide both performance and energy-efficiency. This trend will definitely establish the convergence of embedded computing and high-performance computing. In such a context, one major design challenge will concern the choice of adequate architecture parameters given system requirements. Moreover, it will affect...
Scene classification for high-resolution remotely sensed imagery have been widely investigated in recent years. However, there is few public, widely accepted and large scale dataset for benchmarking different methods. This paper presents a new and large dataset consisting of 5000 high-resolution remote sensing images which is manually labeled in 20 semantic classes for scene classification. Each class...
Despite of Cloud infrastructures can be used as High Performance Computing (HPC) platforms, many issues from virtualization overhead had kept them unrelated. However, with advent of container-based virtualizers, this scenario acquires new perspectives because this technique promises to decrease the virtualization overhead, achieving a near-native performance. In this work, we analyzed the performance...
Two nonlinear methods for producing short-term spatio-temporal wind speed forecast are presented. From the relatively new class of kernel methods, a kernel least mean squares algorithm and kernel recursive least squares algorithm are introduced and used to produce 1 to 6 hour-ahead predictions of wind speed at six locations in the Netherlands. The performance of the proposed methods are compared to...
In this paper, we present a system for sketch classification and similarity search. We used deep convolution neural networks (ConvNets), state of the art in the field of image recognition. They enable both classification and medium/highlevel features extraction. We make use of ConvNets features as a basis for similarity search using k-Nearest Neighbors (kNN). Evaluation are performed on the TU-Berlin...
In this paper, we propose a CPU allocation technique to solve the problem of running volunteer application on a system with Completely Fair Scheduler (CFS) using adaptive reservation. Our allocation technique works across user boundary without requiring administrative privilege. We implemented and evaluated our technique on Linux-based system with the CFS. Our technique could mitigate performance...
This article presents a design of a dynamically reconfigurable hybrid multiprocessor system on a chip (SoC), where individual reconfiguration partitions (RP) are time multiplexed by demands of a task. Scheduling the RPs is designed to be done by a modified Linux kernel. Design is partially implemented on the experimental platform, tested by multiple benchmarks and will be extended in the future.
Modern operating system kernels, such as Linux, address the trade-off between portability and performance by exposing a generic interface to user space programs, while maintaining architecture-dependent functionality as a set of separate components inside the kernel space. In particular, performance can only be achieved by ensuring that the architecture-dependent code takes advantage of the facilities...
Linux supports transparent huge page since 2.6.38.It can automatically map huge pages. But this implementation fails to adjust to page alignment in memory allocation and thus cannot use huge page in some situations. The design is not efficient. Our work aims to increase huge page allocation, so as to improve the utilization ratio of huge page and overall performance. The experimental results show...
Cache memories have been introduced in recent generations of Graphics Processing Units (GPUs) to benefit general-purpose computing on GPUs (GPGPUs). In this work, we analyze the memory access patterns of GPGPU applications and propose a cost-effective profiling-based method to identify the data accesses that should bypass the L1 data cache to improve performance. The evaluation indicates that the...
Unified Memory is an emerging technology which is supported by CUDA 6.X. Before CUDA 6.X, the existing CUDA programming model relies on programmers to explicitly manage data between CPU and GPU and hence increases programming complexity. CUDA 6.X provides a new technology which is called as Unified Memory to provide a new programming model that defines CPU and GPU memory space as a single coherent...
The popular and diverse hardware accelerator ecosystem makes apples-to-apples comparisons between platforms rather difficult. SPEC ACCEL tries to offer a yardstick to compare different accelerator hardware and software ecosystems. This paper uses this SPEC benchmark to compare an AMD GPU, an NVIDIA GPU and an Intel Xeon Phi with respect to performance and energy consumption. It also provides observations...
Heterogeneous computing, which combines devices with different architectures, is rising in popularity, and promises increased performance combined with reduced energy consumption. OpenCL has been proposed as a standard for programing such systems, and offers functional portability. It does, however, suffer from poor performance portability, code tuned for one device must be re-tuned to achieve good...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.