The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
A little over a decade ago, Goto and van de Geijn wrote about the importance of the treatment of the translation lookaside buffer (TLB) on the performance of matrix multiplication. Crucially, they did not say how important, nor did they provide results that would allow the reader to make his own judgement. In this paper, we revisit their work and look at the effect on the performance of their algorithm...
We present a code optimization technique by adapting an auto-tuning (AT) function to an explicit method with the static code generator FIBER. The AT function is evaluated with current multicore processors to match situations with high-thread parallelism (HTP). The results of performance evaluations indicate that the AT function is crucial for HTP, as the speedups of the explicit method with a static...
As power and energy consumption have become the key design constraint of mobile systems, mobile system-on-chip (SoC) architects have dedicated a progressively larger area budget to custom accelerators: graphics processors, audio/video codecs, and image signal processors abound. Fixed-function accelerators now occupy more than half of the die area of these chips [2], and we foresee this trend only...
Heterogeneous multi-core processors have strong potential for performance improvement, energy efficiency and area efficiency, compared to the homogeneous multi-core processors. The present methods of execution migration for heterogeneous multi-core processor suffer in efficiency, cost, compatibility, or programmability. In this paper, we propose a HW/SW code sign migration method based on binary-instrumentation...
This paper presents a parallel method for EBGM face recognition. Compared with other methods such as principal component analysis (PCA) and linear discriminant analysis (LDA), EBGM has the advantage of higher accuracy, however, with more computational time and memory usage, which also mean less practicability. We propose a parallel method for EBGM by balancing the unit of images. We distribute the...
Cluster based multiprocessor scheduling can be seen as a hybrid approach combining benefits of both partitioned and global scheduling. Virtual clustering further enhances it by providing dynamic cluster resource allocation and applying hierarchical scheduling techniques. Over the years, the study of virtual cluster scheduling has been limited to theoretical analysis. In this paper, we present our...
Advances in formal software verification has produced an operating system that is guaranteed mathematically to be correct and enforce access isolation. Such an operating system could potentially consolidate safety and security critical software on a single device where previously multiple devices were used. One of the barriers to consolidation on commodity hardware is the lack of hardware dependability...
Systolic arrays offer a very attractive, data centric, execution model as an alternative to the von Neumann architecture. Hardware implementations of systolic arrays turned out not to be viable solutions in the past. This article shows how the systolic design principles can be applied to a software solution to deliver an algorithm with unprecedented strong scaling capabilities. Systolic array for...
SIMD architectures, comprising of both scalar and parallel units, have been widely used in media processors. To further improve the performance, much effort has been made to enhance the design of both units, while little attention has been placed on the relationship between the units. This paper demonstrates that a dynamic coupling mechanism, which can dynamically transform the scalar and parallel...
3D FFT is a very data and compute intensive kernel encountered in many applications. We report a high performance design and implementation of 3D-FFT on a CGRA which supports partial reconfiguration. The hardware software multi clock design uses dynamic reconfiguration to reduce the required communication bandwidth to achieve a sustained throughput of 40 GOPS on a wordsize of 48 bits. Performance...
In this paper, we present reconfigurable multiprocessor architecture for volume rendering. The multiprocessor consists of sixteen reconfigurable processors to exploit data parallelism of the volume rendering. Each processor has VLIW core and reconfigurable coarse-grained array specialized for control and data-intensive part of the program, respectively. The coarse-grained array can be configured dynamically,...
In this paper we investigate the energy efficiency of processors based on ARM Cortex-A9 cores for scientific numerical applications. We study the performance for a few numerical kernels which appear in a larger set of scientific applications. From power measurements that were performed on different platforms we estimate the energy consumed when executing these kernels.
Real-time forensic reconstruction of a processes memory and interaction history is impractical in modern computing environments because the volume of data processed by a typical server is immense. Having this information would speed the search for zero-day exploits and designate precisely which system components could have been affected by an intrusion. Unfortunately, it may be several months after...
This paper implements basic computational kernels of the scientific computing such as matrix - vector product, matrix product and Gaussian elimination on multi-core platforms using several parallel programming tools. Specifically, these tools are Pthreads, OpenMP, Intel Cilk++, Intel TBB, Intel ArBB, SMPSs, SWARM and Fast Flow. The aim of this paper is to present an unified quantitative and qualitative...
Nowadays, embedded systems treats larger data than ever before. It can be expected that the size of data treated by embedded systems will be increased. In ordinary case, these complicated requirements are achieved with adopting OS(operating system) kernel to systems. To improve the performance of OS kernel's data processing is meaningful for many embedded solutions. To achieve this improvement, we...
OpenCL is an industry's attempt to unify heterogeneous multicore programming. With its programming model defining SPMD kernels, vector types, and address space qualifiers, OpenCL allows programmers to exploit data parallelism with multicore processors and SIMD instructions as well as data locality with memory hierarchy. Recently, OpenCL has gained success on many architectures, including multicore...
Modern embedded MPSoC designs increasingly couple hardware accelerators to processing cores to trade between energy efficiency and platform specialization. To assist effective design of such systems there is the need on one hand for clear methodologies to streamline accelerator definition and instantiation, on the other for architectural templates and run-time techniques that minimize processors-to-accelerator...
I/O devices are evolving rapidly, while OS optimization is always slower because of its dependence on physical devices. This inevitably prevents latest devices from working with their rating performance, which remains a big problem for performance-critical applications. Though I/O device simulators can help carry out performance evaluation before physical devices are ready, the existing simulator...
In response to the increasing ubiquity of multicore processors, there has been widespread development of multithreaded applications that strive to realize their full potential. Unfortunately, lock contention within operating systems can limit the scalability of multicore systems so severely that an increase in the number of cores can actually lead to reduced performance (i.e. scalability collapse)...
In recent years, improvements of energy efficiency and computational performance have become a major issue, because smartphones and tablets become popular. To implement high performance, multi-core accelerator consists of general purpose processors and accelerators are often used. But to use these multi-core accelerator efficiently, programmers have to consider synchronization and data transfer between...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.