The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Tensor decomposition, the higher-order analogue to singular value decomposition, has emerged as a useful tool for finding relationships in large, sparse, multidimensional data sets. As this technique matures and is applied to increasingly larger data sets, the need for high performance implementations becomes critical. In this work, we perform an objective empirical evaluation of three popular parallel...
NERSC has partnered with 20 representative application teams to evaluate performance on the Xeon-Phi Knights Landing architecture and develop an application-optimization strategy for the greater NERSC workload on the recently installed Cori system. In this article, we present early case studies and summarized results from a subset of the 20 applications highlighting the impact of important architecture...
This paper proposes a detailed performance evaluation of an algorithm using spanning tree that automatically exploits the parallelism and determines an execution order of multiple kernel programs in distributed environment. In stream-based computing, efficient parallel execution requires careful scheduling of the invocation of the kernel programs. By mapping a kernel to a node and an I/O stream between...
MapReduce is a popular programming model used to process large amounts of data by exploiting parallelism. Open-source implementations of MapReduce such as Hadoop are generally best suited for large, homogeneous clusters of commodity machines. However, many businesses cannot afford to invest in such infrastructure and others are reluctant to use cloud services due to data security and privacy concerns...
OpenCL is designed as a parallel programming framework to support heterogeneous computing platforms. The implicit or explicit parallelism in OpenCL kernel code enables efficient FPGA implementation from a high-level programming abstraction. However, FPGA architecture is completely different from GPU architecture, for which OpenCL is widely used. Tuning OpenCL codes to achieve high performance on FPGAs...
In this paper, we introduce memos, which integrates suitable memory management policies and schedules resources over the entire memory hierarchy in hybrid memory system. Powered by an OS kernel level monitoring tool, memos captures memory patterns online, and then leverages them to guide the memory page placement and data mapping. Experimental results show, on average, memos can benefit memory utilization,...
In this paper, a study of the parallel exploitation of a Support Vector Machine (SVM) classifier with a linear kernel running on a Massively Parallel Processor Array platform is exposed. This system joins 256 cores working in parallel and grouped in 16 different clusters. The main objective of the research has been to develop an optimal implementation of the SVM classifier on a MPPA platform whilst...
In combinatorial optimization problems, the neighborhood search (NS) is a fundamental component for local search based heuristics. It consists of selecting a solution from a high cardinality set of neighbor solutions, by means of operations called moves. To perform this search, NS algorithms usually adopt two main approaches: selecting the first or best improving move. The Multi Improvement (MI) strategy...
This paper proposes a profiling-based method to extract a task graph, which describes the system behavior of a multiprocessor system-on-chip with Android OS. The proposed method computes the resource usage of each task and extracts dependency among tasks using the run-time system profiling results. The proposed method calculates CPU resource usage and I/O waiting time of each task by analyzing CPU...
There is a lack of support for explicit synchronization in GPUs between the streaming multiprocessors (SMs) adversely impacts the performance of the GPUs to efficiently perform inter-block communication. In this paper, we present several approaches to inter-block synchronization using explicit/implicit CPU-based and dynamic parallelism (DP) mechanisms. Although this topic has been addressed in previous...
The ability to execute the original source code for network protocols and applications within a network simulation environment frees the simulation modeler from the time consuming task of having to create, test and debug models representing these applications. This work extends the functionality of the Direct Code Execution (DCE) framework of ns-3 by incorporating the ability to call NVIDIA CUDA kernels...
Computing platforms for high performance and parallel applications have changed rapidly during the past few years, from single to multiple cores, and from traditional Central Processing Units (CPUs) to hybrid systems which combine CPUs with accelerators such as Graphics Processing Units(GPUs), Intel Xeon Phi, etc. These developments bring more and more challenges to application developers, especially...
An algorithm based on particle filters is employed to track moving objects in video streams from fixed and non-fixed cameras. Particle weighting is based on color histograms computed in the iHLS color space. Particle computations are parallelized with CUDA framework. The algorithm was tested on various GPU devices: a desktop GPU card, a mobile chipset and two embedded GPU platforms. The processing...
The Graph BLAS effort to standardize a set of graph algorithms building blocks in terms of linear algebra primitives promises to deliver high performing graph algorithms and greatly impact the analysis of big data. However, there are challenges with this approach, which our data analytics miniapp miniTri exposes. In this paper, we improve upon a previously proposed task-parallel approach to linear...
The High Efficiency Video Coding (HEVC) standard provides higher compression efficiency than other video coding standards but at the cost of increased computational load, which makes it hard to achieve real-time encoding/decoding of high-resolution, high-quality video sequences. In this paper, we investigate how Graphics Processing Units (GPUs) can be employed to accelerate HEVC decoding. GPUs are...
In this article is presented and assessed a massive parallel processing model for basic operations with k-mers from genomic sequences, based on defined functions in terms of N-dimensional spaces. The model is implemented using a set of OpenCL cores available at github.com/bioinfud/k-merscl and assessed using a heterogeneous platform CPU/GPU and a dataset based on randomly generated k-mers. The results...
Programming FPGAs has been an arduous task that requires extensive knowledge of hardware design languages (HDLs), such as Verilog or VHDL, and low-level hardware details. With OpenCL support for FPGAs, the design, prototyping and implementation of an FPGA is increasingly moving towards a much higher level of abstraction, when compared to the intrinsically low-level nature of HDLs. On the other hand,...
Sparsity-constrained Nonnegative matrix factorization (NMF) has been proved to be an effective method for hyperspectral unmixing. However, the optimization procedure of sparsity-constrained NMF is computational demanding, which may limit its application in time-constrained conditions. In this paper, a parallel L1/2 sparsity-constrained NMF unmixing method on Graphics Processing Units (GPUs) is proposed,...
OpenMP enables productive software development that targets shared-memory general purpose systems. However, OpenMP compilers today have little support for future heterogeneous systems — systems that will more than likely contain Field Programmable Gate Arrays (FPGAs) to compensate for the lack of parallelism available in general purpose systems. We have designed a high-level synthesis flow that automatically...
Deep neural network algorithms show very high performance, however increased amounts of arithmetic and memory accesses hinder their adoption to embedded systems. This paper explores a programmable neural network processing architecture that can efficiently execute feed-forward, recurrent, and convolutional deep neural networks. The neural network algorithms are transformed to matrix-vector multiplication...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.