The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
CNNs (Convolutional Neural Networks) have demonstrated superior results in a wide range of applications. However, the time-consuming convolution operations required by CNNs pose great challenges to designers. GPGPUs (General Purpose Graphic Processing Units) have been widely used to exploiting the massive parallelism of convolution operations. This paper proposes a software-based loop-unrolling technique...
In state-of-the-art Software Transactional Memory (STM) systems, threads carry out the execution of transactions as non-interruptible tasks. Hence, a thread can react to the injection of a higher priority transactional task and take care of its processing only at the end of the currently executed transaction. In this article we pursue a paradigm shift where the execution of an in-memory transaction...
In this paper we examined the separate aspects of design and analysis of the quantum circuits specifications on the computer platforms with the parallelism support. Idea of multi-thread programming for the quantum circuits synthesis and selection is shown, taking into consideration the subsequent use of the Linear Nearest Neighbor (LNN) approach. Our experience on some multi-core platforms is represented.
Stencil computation arises from a broad set of scientific and engineering applications and often plays a critical role in the performance of extreme-scale simulations. Due to the memory bound nature, it is a challenging task to opti- mize stencil computation kernels on modern supercomputers with relatively high computing throughput whilst relatively low data-moving capability. This work serves as...
This paper presents an extension of Valgrind framework for dynamic binary code analysis to support MIPS MSA instruction set which includes instructions for vector (SIMD) processing of integer and floating-point data of different widths. First, a background on MIPS and its MSA extention is given. Then, Valgrind features for code instrumentation are described. Several changes have been made to Valgrind...
Data movement is increasingly becoming the bottleneck of both performance and energy efficiency in modern computation. Until recently, it was the case that there is limited freedom for communication optimization on GPUs, as conventional GPUs only provide two types of methods for inter-thread communication: using shared memory or global memory. However, a new warp shuffle instruction has been introduced...
The capability of GPUs to accelerate general-purpose applications that can be parallelized into massive number of threads makes it promising to apply GPUs to real-time applications as well, where high throughput and intensive computation are also needed. However, due to the different architecture and programming model of GPUs, the worst-case execution time (WCET) analysis methods and techniques designed...
A new approach to introducing computer architecture in introductory logic design courses based on Babbage's difference engine is described. The difference engine is the thread that students follow from logic design to the concepts associated with a basic load/store microprocessor architecture.
Today's supercomputers are moving towards deployment of many-core processors like Intel Xeon Phi Knights Landing (KNL), to deliver high compute and memory capacity. Applications executing on such many-core platforms with improved vectorization require high memory bandwidth. To improve performance, architectures like Knights Landing include a high bandwidth and low capacity in-package high bandwidth...
Many of todays important applications of our everyday lives, e.g. weather forecast, design of plane and car shapes, medical analysis or even search engine queries depend on massively-parallel computer programs that are executed in data centers hosting thousands of computers. A large amount of electrical energy is used to power them, and it is of primary importance to compute more efficiently to sustain...
MSA(Multiple sequence alignment) is the challenging task in the bioinformatics domain. As there are many methods are available to align the input sequences. Also the output is also different from each method for same input. Out of many tools of alignment MSAProbs is faster tool available. But this tool is also not so optimal it has some bugs which makes this tool unoptimize. To Optimize the MSAProbs...
K-means is a compute-intensive iterative algorithm, each iteration consists of two steps data assignment and K centroids recalculation. In order to accelerate the compute-intensive portions of k-means, the data assignment and K centroids recalculation steps are offloaded to the GPU in parallel. Only the initialization and convergence tests steps are performed by the CPU. In addition this new version...
Optimization, portability and development of GPGPU applications are not trivial tasks, since the capabilities and organization of GPU processing elements and memory subsystem greatly differ from the traditional CPU concepts, as well as among different GPU architectures. This work goes a step further in aiding this process by delivering a set of visual models that can be used by GPU programmers to...
Presilicon simulation is one of the key toolsets for computer architects to evaluate and optimize their future designs. As Graphics Processing Units (GPUs) have become the platform of choice in many computing communities due to their impressive processing capabilities, computer architecture researchers need a simulation framework that allows them to quantitatively consider design tradeoffs. In this...
Modern GPUs feature complex memory system designs. One GPU may contain many types of memory of different properties. The best way to place data in memory is sensitive to many factors (e.g., program inputs, architectures), making portable optimizations of GPU data placement a difficult challenge. PORPLE is a recently proposed method that overcomes the difficulties by enabling online optimizations of...
As process technology scales, electronic devices become more susceptible to soft error induced by radiation. The stack in the memory implements procedure calls and its behavior under soft error has not been studied yet. To analyze the effects of soft error on the stack behavior, we conduct a series of fault injection experiment in the IA-32 instruction set architecture. The injection targets are the...
The IBM Power9 processor has an enhanced core and chip architecture that provides superior thread performance and higher throughput. The core and chip architectures are optimized for emerging workloads to support the needs of next-generation computing. Multiple variants of silicon target the scale-out and scale-up markets. With a new core microarchitecture design, along with an innovative I/O fabric...
Dynamic Binary Translation (DBT) is often used in hardware/software co-design to take advantage of an architecture model while using binaries from another one. The co-development of the DBT engine and of the execution architecture leads to architecture with special support to these mechanisms. In this work, we propose a hardware accelerated Dynamic Binary Translation where the first steps of the DBT...
We design and implement Mars, a MapReduce framework, on graphics processors (GPUs). MapReduce is a distributed programming framework originally proposed by Google for the ease of development of web search applications on a large number of commodity CPUs. Compared with CPUs, GPUs have an order of magnitude higher computation power and memory bandwidth, but are harder to program since their architectures...
The integrated architecture that features both CPU and GPU on the same die is an emerging and promising architecture for fine-grained CPU-GPU collaboration. However, the integration also brings forward several programming and system optimization challenges, especially for irregular applications. The complex interplay between heterogeneity and irregularity leads to very low processor utilization of...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.