The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
An approximate kernel for the discrete cosine transform (DCT) of length 4 is derived from the 4-point DCT defined by the High Efficiency Video Coding (HEVC) standard and used for the computation of DCT and inverse DCT (IDCT) of power-of-two lengths. There are two reasons for considering the DCT of length 4 as the basic module. First, it allows computation of DCTs of lengths 4, 8, 16, and 32 prescribed...
RDMA (Remote Direct Memory Access) is a technology that enables user applications to perform direct data transfer between the virtual memory of processes on remote endpoints, without operating system involvement or intermediate data copies. Achieving zero intermediate data copies using RDMA requires specialized network interface hardware. Software RDMA drivers emulate RDMA semantics in software to...
Due to their flexibility and high performance, Coarse Grained Reconfigurable Array (CGRA) are a topic of increasing research interest. However, CGRAs also have the potential to achieve very high energy efficiency in comparison to other reconfigurable architectures when hardware optimizations are applied. Some of these optimizations are common for more traditional processors but can also lead to large...
Heterogeneous platforms with large numbers of processing elements (PEs) have been proposed to satisfy the computational requirements of computer vision applications. Limiting the incurred communication cost here is key to meet the power constraints of embedded devices.We present a new heuristic to reduce communication among PEs and to external memory by aggregating inter-process communication and...
Attacks on memory, revealing secrets, for example, via DMA or cold boot, are a long known problem. In this paper, we present TransCrypt, a concept for transparent and guest-agnostic, dynamic kernel and user main memory encryption using a custom minimal hypervisor. The concept utilizes the address translation features provided by hardware-based virtualization support of modern CPUs to restrict the...
The Beaglebone Black single-board computer is well-suited for real-time embedded applications because its system-on-a-chip contains two "Programmable Real-time Units" (PRUs): 200-MHz microcontrollers that run concurrently with the main 1-GHz CPU that runs Linux. This paper introduces "Cyclops": a web-browser-based IDE that facilitates the development of embedded applications on...
Nowadays, there are many embedded systems with different architectures that have incorporated GPUs. However, it is difficult to develop CPU-GPU embedded systems using component-based development (CBD), since existing CBD approaches have no support for GPU development. In this context, when targeting a particular CPU-GPU platform, the component developer is forced to construct hardware-specific components,...
The home-grown SW26010 many-core processor enabled the production of China’s first independently developed number-one ranked supercomputer – the Sunway TaihuLight. The design of the limited off-chip memory bandwidth, however, renders the SW26010 a highly memory-bound processor. To compensate for this limitation, the processor was designed with a unique hardware feature, "Register Level Communication"...
With the increase of CMP (Chip-Multiprocessor) scale, moving data to computation on chip becomes more expensive. Accordingly, moving computation to data has potential to improve efficiency. We propose an in-place computation co-design of many-simple-core CMP for irregular applications. The computing paradigm is that an application's critical irregular data (or part of them) is partitioned into on-chip...
Heterogeneous computing platforms containing a wide range of computing resources from CPUs to specialized hardware accelerators is the trend today resulting from the physical limitations on processors speed and the increasing demand for computing performance. Hence many optimization strategies are studied to get better throughput and lower energy consumption in heterogeneous systems. Various memory...
Optimizing the performance of GPU kernels is challenging for both human programmers and code generators. For example, CUDA programmers must set thread and block parameters for a kernel, but might not have the intuition to make a good choice. Similarly, compilers can generate working code, but may miss tuning opportunities by not targeting GPU models or performing code transformations. Although empirical...
The goal of this paper is to implement an efficient FPGA-based hardware architectures for the design of fast artificial vision systems. The proposed architecture is capable of performing classification operations of a Convolutional Neural Network (CNN) in realtime. To show the effectiveness of the architecture, some design examples such as hand posture recognition, character recognition, and face...
Codes that aim to detect any error regardless of its multiplicity are referred to as security oriented codes. Most of these codes are designed to protect uniformly distributed codewords; there are few solutions which are used in protecting systems with non-uniformly distributed words. The paper introduces a new encoding method, termed “Level-Out encoding”, for cases in which some words are more likely...
The conventional OpenCL 1.x style CPU-GPU heterogeneous computing paradigm treats the CPU and GPU processors as loosely connected separate entities. At best each executes independent tasks, but, more commonly, the CPU idles while waiting for results from the GPU. No data-sharing and communications are allowed during kernel execution. This model limits the number of applications that can harness the...
The hybrid runtime (HRT) model offers a path towards high performance and efficiency. By integrating the OS kernel, runtime, and application, an HRT allows the runtime developer to leverage the full feature set of the hardware and specialize OS services to the runtime's needs. However, conforming to the HRT model currently requires a port of the runtime to the kernel level, for example to the Nautilus...
State-of-the-art convolutional neural networks are enormously costly in both compute and memory, demanding massively parallel GPUs for execution. Such networks strain the computational capabilities and energy available to embedded and mobile processing platforms, restricting their use in many important applications. In this paper, we propose BCNN with Separable Filters (BCNNw/SF), which applies Singular...
In order to facilitate the development and maintenance of device drivers integrated into the operating system, a model driven approach is proposed in this pater for driver design and verification before codding. Architecture model and behavior model are created to illustrate both static and dynamic characteristics of device drivers, in company with device model and device-driver-O.S. interaction model...
Future high-performance computing systems will need to include multiple specialized accelerators in a single heterogeneous system to overcome power-density limitations of CPU performance.
The growing demand for flexibility and cost reduction in the telecommunication landscape directs the focus of service development heavily to programmability and softwarization. In the domain of Network Function Virtualization (NFV), one of the goals is to replace dedicated hardware devices (such as switches, routers, firewalls) with software-based network functionalities, showing comparable performance...
Low-density parity-check convolutional codes (LDPC-CC) have interesting error correction features. They have a great potential to become a key error-correcting codes for enhancing reliability of modern digital communication systems, optical systems and storage devices. On the implementation side, however, the design of low-cost low-power and high-throughput LDPC-CC decoders remains challenging. This...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.