The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
In this paper, we present our work to enable optimized one-sided communication operations on the ARM v8 architecture using a high-performance InfiniBand network interconnect, as well as an evaluation of our implementation. For this study, we started with an OpenSHMEM implementation based on Open MPI/SHMEM, and combined it with the UCX framework and the XPMEM kernel extension for shared memory communication...
Technological advancements have necessitated the need for effectively teaching GPU computing. This need has been inspired by the increasing pattern of utilizing parallel computing and by the growing utilization of GPUs for computationally intensive tasks. This paper is motivated to address the above mentioned need. The paper describes a semester-long course on CUDA programming. The course has significant...
Big Data as expressed as "Big Graphs" are growing in importance. Looking forward, there is also increasing interest in streaming versions of the associated analytics. This paper develops an initial template for the relationship between "traditional" batch graph problems, and streaming forms. Variations of streaming problems are discussed, along with their relationship to existing...
Chapel is an emerging scalable, productive parallel programming language. In this work, we analyze Chapel's performance using The Parallel Research Kernels on two different manycore architectures including a state-of-the-art Intel Knights Landing processor. We discuss implementation techniques in Chapel and their relation to the OpenMP implementations of the PRK. We also suggest and prototype several...
Graph algorithms play an important role in several fields of sciences and engineering. Prominent among them are the All-Pairs-Shortest-Paths (APSP) and related problems. Indeed there are several efficient implementations for such problems on a variety of modern multi- and many-core architectures. It can be noticed that for several graph problems, parallelism offers only a limited success as current...
Reconfigurable datapaths can be used to implement multiple applications on the same hardware. Switching between applications can be realized by loading new configuration information into the datapath. In this contribution, we want to use such datapaths for high frequency event processing. We have developed the toolset ReEP, which takes multiple problem descriptions and superposes them into one reconfigurable...
Deep Learning (DL) algorithms have become ubiquitous in data analytics. As a result, major computing vendors — including NVIDIA, Intel, AMD and IBM — have architectural road-maps influenced by DL workloads. Furthermore, several vendors have recently advertised new computing products as accelerating DL workloads. Unfortunately, it is difficult for data scientists to quantify the potential of these...
Some of the newer processor architectures are no longer based on registers in order to increase their potential of instruction-level parallelism. Instead, they expose their data paths to the compiler so that the program is able to directly move data values between function units using suitable instructions. Some of these architectures require a synchronous transfer of data values while others use...
Field Programmable Gate Arrays (FPGAs) are usually perceived as difficult to exploit due to the High Level of expertise required to program them. In the last years, the major FPGAs vendors have produced different High Level Synthesis (HLS) tools to help programmers during the flow of acceleration of their algorithms through the hardware architecture. However, these tools often use languages considered...
In the field of high performance heterogeneous computing systems, field programmable gate arrays (FPGAs) have shown great advantages in terms of acceleration and energy efficiency. And with the inclusion of the OpenCL framework for parallel programming, the design complexity has been greatly reduced. However, the parallel implementation of applications containing data-dependent branches usually experiences...
We describe an integration of the CHiLL compiler with OpenTuner to reduce the programmer burden in using autotuning. We use as a case study optimizing the smooth operator and its associated stencil computations in the context of Geometric Multigrid (GMG), a hierarchical linear solver that operates in multiple grid resolutions (levels). Smooth is the most performance-critical operation that runs multiple...
We introduce Dynamic Dual Fixed Point (DDFX) CORDIC, that relies on run-time alteration of the numerical format of the Dual Fixed Point (DFX) CORDIC hardware. This allows for enhanced dynamic range and accuracy. Fixed Point, Dual Fixed Point, Floating Point, and Dynamic Dual Fixed Point CORDIC units are compared in terms of resources and accuracy. Results show that the hardware/software approach achieves...
Nature has proved to be a source of inspiration for engineering solutions. Spiking Neural Networks are exemplary from this perspective, due to the possibility to exploit them not only to simulate the biological networks of neurons but also to effectively work as classifiers and artificial intelligence systems. Another interesting nature-inspired paradigm is Swarm Intelligence, mainly applied to optimization...
Determining key characteristics of High Performance Computing machines that allow users to predict their performance is an old and recurrent dream. This was, for example, the rationale behind the design of the LogP model that later evolved into many variants (LogGP, LogGPS, LoGPS, ) to cope with the evolution and complexity of network technology. Although the network has received a lot of attention,...
A number of computational science algorithms lead to discretizations that require a large number of independent small matrix solves. Examples include small non-linear coupled chemistry and flow systems, one-dimensional sub-systems in climate and diffusion simulations and semi-implicit time integrators, among others. We introduce an approach for solving large quantities of independent banded matrix...
Scientists who want to exploit the computing power of the latest parallel architectures are faced with a diverse set of architectures and a number of programming languages, models and approaches. Among several such programming techniques are directive-based programming models, OpenMP and OpenACC. This paper explores the similarities and the functionality gaps between both models and presents insights...
Today's supercomputers are moving towards deployment of many-core processors like Intel Xeon Phi Knights Landing (KNL), to deliver high compute and memory capacity. Applications executing on such many-core platforms with improved vectorization require high memory bandwidth. To improve performance, architectures like Knights Landing include a high bandwidth and low capacity in-package high bandwidth...
Performance and power consumption are key features for evaluating any processor design. In this paper, we present close attention to the impact on power and energy consumption of customized Instruction SetArchitecture (ISA) designed by means of High Level Synthesis (HLS) tools. We compare these results against a full ISA soft processor, Microblaze. Our customized ISA processors greatly reduce the...
The use of reconfigurable chips such as FPGAs in embedded systems for many runtime applications is limited by large reconfiguration time. Techniques to circumvent this limitation relies on hardware task reuse which preserve certain circuits on the chip. However, the frequent addition and removal of circuits while preserving others on the chip will inevitably lead to fragmentation of its area, in an...
Thanks to the availability of new biomedical technologies and analysis methodologies, the quality of clinical exams and medical research is increasing. These improvements have given the opportunity to analyze large amount of data with an higher level of accuracy. Therefore, processors able to handle compute intensive algorithms and large datasets are needed, and the use of homogeneous processors is...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.