The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Heterogeneous processing has gained popularity in the high performancecomputing (HPC) area lately and it appears to have a great potential for future data centers. In this regard, accelerators, such as GPUs and Intel Xeon Phi, have already started to play a significant role in HPC systems offering a high degree of parallelism to application developers. Furthermore, hardware virtualization is gaining...
This paper presents the implementation of image edge detection on Heterogeneous System Architecture (HSA). HSA which includes ARM processor, Coprocessor and FPGA are compared with x64 CPU in terms of performance and power consumption. The experimental results show that although the best execution time is from x64 CPU, HSA has 50 times more energy efficiency. Also, HSA can exploit coprocessors and...
We designed and implemented a Remote Inter-Processor Communication architecture software on Xeon Phi coprocessors and made a testbed to verify it. Also, we implemented a lightweight kernel and RIPC transmission/receiver application threads on the lightweight kernel running on Xeon Phi coprocessors. This paper proposes RIPC methods to communicate between user threads in separate Xeon Phi nodes using...
The Adapteva Epiphany MIMD architecture is a scalable 2D array of RISC cores with a fast network-on-chip (NoC) for parallel processing. The work presented here discusses the suitability of the architecture to handle software defined radio (SDR) applications such as Finite Impulse Response (FIR) filters. This paper discusses implementation of the Hilbert filter through using the COPRTHR 2.0 SDK which...
The Parallella is a hybrid computing platform that came into existence as the result of a Kickstarter project by Adapteva. It is composed of the high performance, energy-efficient, manycore architecture, Epiphany chip (used as co-processor) and one Zynq-7000 series chip, which normally runs a regular Linux OS version, serves as the main processor, and implements "glue logic" in its internal...
In the past few years nonlocal filters have emerged as a serious contender for denoising synthetic aperture radar (SAR) images, offering superior noise reduction and detail preservation compared to many other filters. In this manuscript we analyze how nonlocal filters, whose computational costs were so far prohibitive for large scale processing, can be implemented efficiently on graphics processing...
Using multiple streams can improve the overall system performance by mitigating the data transfer overhead on heterogeneous systems. Prior work focuses a lot on GPUs but little is known about the performance impact on (Intel Xeon) Phi. In this work, we apply multiple streams into six real-world applications on Phi. We then systematically evaluate the performance benefits of using multiple streams...
Our target in this work is to study ways of exploring the parallelism offered by vectorization on accelerators with very wide vector units. To this end, we implemented two kernels that derive from the Wilson Dslash operator and investigate several data layout techniques for increasing the scalability of lattice QCD scientific kernels suitable for the Intel Xeon Phi. In parts of the application where...
Our target in this work is to study ways of exploring the parallelism offered by vectorization on accelerators with very wide vector units. To this end, we implemented two kernels that derive from the Wilson Dslash operator and investigate several data layout techniques for increasing the scalability of lattice QCD scientific kernels suitable for the Intel Xeon Phi. In parts of the application where...
Flexibility and high efficiency are common design drivers in the embedded systems domain. Coarse-grained reconfigurable coprocessors can tackle these issues, but they suffer of complex design, debugging and applications mapping problems. In this paper, we propose an automated design flow that aids developers in design and managing coarse-grained reconfigurable coprocessors. It provides both the hardware...
OpenACC is an application programming interface (API) that aims to unleash the power of heterogeneous systems composed of CPUs and accelerators such as graphic processing units (GPUs) or Intel Xeon Phi coprocessors. This directive-based programming model is intended to enable developers to accelerate their application's execution with much less effort. Coprocessors offer significant computing power...
FPGA-based reconfigurable computing is finding its way into a wide range of application areas in which high performance and low power consumption are paramount. However, FPGA-application development using hardware-description languages (HDLs) faces many productivity challenges that limit its wide adoption, including a steep learning curve and lengthy compilation. High-level synthesis (HLS) languages...
Aligning sequencing reads to a reference genome is often essential in many comparative genomics pipelines. With the maturation of next-generation DNA sequencing (NGS) technologies, an enormous amount of sequence data has been generated, this calls for the development of faster read alignment programs. In this paper we present an OpenCL implementation of the short read aligner BarraCUDA [1], which...
Sequence analysis plays critical role in bioinformatics, and most applications of which have compute intensive kernels consuming over 70% of total execution time. By exploiting the compute intensive execution stages of popular sequence analysis applications, we present and evaluate a VLSI architecture with a focus on those that target at biological sequences directly, including pairwise alignment,...
This article presents a design of a dynamically reconfigurable hybrid multiprocessor system on a chip (SoC), where individual reconfiguration partitions (RP) are time multiplexed by demands of a task. Scheduling the RPs is designed to be done by a modified Linux kernel. Design is partially implemented on the experimental platform, tested by multiple benchmarks and will be extended in the future.
The present work implements solvers with OpenCL of the FGMRES and preconditioned BCGSTAB algorithms. These solvers are integrated in a 3-D simulation tool of nanoscaled MOSFET transistors. Simulations are launched in two different platform devices: NVIDIA Tesla S2050 and Intel Xeon Phi 3120A. The resulting times of execution are compared against the optimized PSPARSLIB version of the FGMRES solver...
Popular accelerator programming models rely on offloading computation operations and their corresponding data transfers to the coprocessors, leveraging synchronization points where needed. In this paper we identify and explore how such a programming model enables optimization opportunities not utilized in traditional checkpoint/restart systems, and we analyze them as the building blocks for an efficient...
Ever since accelerators and coprocessors became the mainstream hardware for throughput-oriented HPC workloads, various programming techniques have been proposed to increase productivity in terms of both the performance and ease-of-use. We evaluate these aspects of OpenCL on a number of hardware platforms for an important subset of dense linear algebra operations that are relevant to a wide range of...
In the present article we describe the implementation of the finite element numerical integration algorithm for the Xeon Phi coprocessor. The coprocessor is an extension of the idea of the many-core specialized unit for calculations and, by assumption, its performance has to be competitive with the current families of GPUs. Its main advantage is the built-in set of 512-bit vector registers and the...
Pattern libraries are important tools for high productivity application development. Their struggle for best performance is complicated by the fact that they are used to execute user-provided code, which is not known during their creation. This makes pattern libraries good candidate for automatic software tuning. In this paper, we deal with automatic online parameter tuning of the HyPHI hybrid pattern...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.