The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
FPGAs are becoming an attractive choice as a heterogeneous computing unit for scientific computing because FPGA vendors are adding floating-point-optimized architectures to their product lines. Additionally, high-level synthesis (HLS) tools such as Altera OpenCL SDK are emerging, which could potentially break the FPGA programming wall and provide a streamlined flow for domain experts in scientific...
In this paper we propose a novel CNN hardware accelerator, called AlScale, capable of accelerating convolutional, pooling, fully-connected and adding CNN layers. In contrast to most existing solutions, AIScale offers a complete solution to the full CNN acceleration. AIScale is designed as a coarse-grained reconfigurable architecture, which uses rapid, dynamic reconfiguration during the CNN layer processing...
RDMA (Remote Direct Memory Access) is a technology that enables user applications to perform direct data transfer between the virtual memory of processes on remote endpoints, without operating system involvement or intermediate data copies. Achieving zero intermediate data copies using RDMA requires specialized network interface hardware. Software RDMA drivers emulate RDMA semantics in software to...
Heterogeneous platforms with large numbers of processing elements (PEs) have been proposed to satisfy the computational requirements of computer vision applications. Limiting the incurred communication cost here is key to meet the power constraints of embedded devices.We present a new heuristic to reduce communication among PEs and to external memory by aggregating inter-process communication and...
This paper presents the design and implementation of a hardwired OS kernel circuitry inside a Java application processor to provide the system services that are traditionally implemented in software. The hardwired system functions in the proposed SoC include the thread manager, the memory manager, and the I/O subsystem interface. There are many advantages in making the OS kernel a hardware component,...
This paper describes the implementation of approximate memory support in Linux operating system kernel. The new functionality allows the kernel to distinguish between normal memory banks, which are composed by standard memory cells that retain data without corruption, and approximate memory banks, where memory cells are subject to read/write faults with controlled probability. Approximate memories...
High-Level Synthesis (HLS) has been widely recognized as an efficient compilation process targeting FPGAs for algorithm evaluation and product prototyping. However, the massively parallel memory access demands and the extremely expensive cost of single-bank memory with multi-port have impeded loop pipelining performance. Thus, based on an alternative multi-bank memory architecture, a joint approach...
To protect the integrity of operating system kernels, we present Vigilare system, a kernel integrity monitor that is architected to snoop the bus traffic of the host system from a separate independent hardware. This snoop-based monitoringenabled by the Vigilare system, overcomes the limitations of the snapshot-based monitoring employed in previous kernel integrity monitoring solutions. Being based...
Graphics Processing Units (GPUs) are designed to exploit large amount of parallelism. However, warp-level divergence occurring due to different amounts of work, memory access latency experienced, etc., results in warps of a thread block (TB) finishing kernel execution at different points in time. This, in effect, reduces utilization of resources of SMs and hence performance of the GPU. We propose...
For many intensive computing tasks, simultaneous data access into multi-dimensional data arrays is highly restricted by its data mapping strategy and memory port constraint. As such, to increase memory accessing bandwidth, innovative memory partitioning and mapping algorithms have been proposed to simultaneously access multiple memory blocks through physically distributing data elements in the same...
Convolution operations dominate the total execution time of deep convolutional neural networks (CNNs). In this paper, we aim at enhancing the performance of the state-of-the-art convolution algorithm (called Winograd convolution) on the GPU. Our work is based on two observations: (1) CNNs often have abundant zero weights and (2) the performance benefit of Winograd convolution is limited mainly due...
Publishing scientific results without the detailed execution environments describing how the results were collected makes it difficult or even impossible for the reader to reproduce the work. However, the configurations of the execution environments are too complex to be described easily by authors. To solve this problem, we propose a framework facilitating the conduct of reproducible research by...
Despite its popularity, deploying Convolutional Neural Networks (CNNs) on a portable system is still challenging due to large data volume, intensive computation and frequent memory access. Although previous FPGA acceleration schemes generated by high-level synthesis tools (i.e., HLS, OpenCL) have allowed for fast design optimization, hardware inefficiency still exists when allocating FPGA resources...
This paper introduces a software policy for memory management in heterogeneous memory systems in order to improve the trade-offs between performance and power consumption, while attempting to make the best use of different characteristics of the underlying memory technologies. In this policy, the operating system and the application co-schedule page management in order to make informed decisions about...
In this paper, we advocate the use of code polymorphism as an efficient means to improve security at several levels in electronic devices. We analyse the threats that polymorphism could help thwart, and present the solution that we plan to demonstrate in the scope of a collaborative research project called COGITO. We expect our solution to be effective to improve security, to comply with the computing...
In presence of known and unknown vulnerabilities in code and flow control of programs, virtual machine alike isolation and sandboxing to confine maliciousness of process, by monitoring and controlling the behaviour of untrusted application, is an effective strategy. A confined malicious application cannot effect system resources and other applications running on same operating system. But present...
In the new era of cyber-physical systems, software must adapt itself to ever-changing environmental conditions and situations. This is currently not reflected in the design of embedded operating systems, since they are primarily optimized for fixed usage scenarios with tight resource constraints. We discuss the idea of interpreted operating system kernels, which can form a new foundation for highly...
The survivability of OS is very important for the whole system because OS is the base of information system or network system. Based on the analysis of resources, services and functions of the OS, this paper proposed the concept of a integrity running environment (IRE) owing to the particularity of the OS survivability, and then, puts forward the new definition, namely the OS survivability is that...
High utilization of hardware resources is the key for designing performance and power optimized GPUapplications. The efficiency of applications and kernels, which do not fully utilize the GPU resources, can be improved through concurrent execution with independent kernels and/or applications. Hyper-Q enables multiple CPU threads or processes to launch work on a single GPU simultaneously for increased...
Traditional PC based operating systems load most of its components during the boot process along with the kernel. This mechanism though effective for a broader objective, is seldom utilized fully by a majority of users as they usually perform a specific job which does not require every component of OS. It has been observed that operating systems which are designed keeping in mind the nature of job,...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.