The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
As the number of processors and the size of the memory of computing systems keep increasing, the likelihood of CPU core failures, memory errors, and bus failures increases and can threaten system availability. Software components can be hardened against such failures by running several replicas of a component on hardware replicas that fail independently and that are coordinated by a State-Machine...
ARTICo3 is an architecture that permits to dynamically set an arbitrary number of reconfigurable hardware accelerators, each containing a given number of threads fixed at design time according to High Level Synthesis constraints. However, the replication of these modules can be decided at runtime to accelerate kernels by increasing the overall number of threads, add modular redundancy to increase...
Cyber-Physical Systems need to handle increasingly complex tasks, which additionally, may have variable operating conditions over time. Therefore, dynamic resource management to adapt the system to different needs is required. In this paper, a new bus-based architecture, called ARTICo3, which by means of Dynamic Partial Reconfiguration, allows the replication of hardware tasks to support module redundancy,...
Buggy device drivers are a major threat to the reliability of their host operating system. There have been myriad attempts to protect the kernel, but most of them either required driver modifications or incur substantial performance overhead. This paper describes an isolated device driver execution system called SIDE (Streamlined Isolated Driver Execution), which focuses specifically on unmodified...
General purpose GPU's (GPGPU) appearance made it possible that heterogeneous computing can be used by human beings. And it's also produce a reform for GPU's general purpose computing and parallel computing. Heterogeneous Systems has been adopted by large-scale of high-performance computers. Nowadays, fault tolerance technique is necessary among these large-scale kinds of scientific computing, but...
Availability is increased with recovery based on component micro reboot instead of whole system reboot. There are unique challenges that must be overcome in order to apply micro reboot to low-level system software. These challenges arise from the need to interact with immutable hardware components on one hand and, on the other hand, with a wide variety of higher level workloads whose characteristics...
Linux servers with heterogeneous architectures present a new challenge for fault management. With the significant increase in the numbers and types of hardware components, separate fault management becomes more complex and inefficient. It is clear that centralized management, automatic recovering and scalable design must be incorporated in the modern fault management system. Based on the ccNUMA architecture,...
High performance and relatively low cost of GPU-based platforms provide an attractive alternative for general purpose high performance computing (HPC). However, the emerging HPC applications have usually stricter output cor-rectness requirements than typical GPU applications (i.e., 3D graphics). This paper first analyzes the error resiliency of GPGPU platforms using a fault injection tool we have...
While RAID is the prevailing method of creating reliable secondary storage infrastructure, many users desire more flexibility than offered by current implementations. Traditionally, RAID capabilities have been implemented largely in hardware in order to achieve the best performance possible, but hardware RAID has rigid designs that are costly to change. Software implementations are much more flexible,...
Fault injection technology devotes an efficient way for verifying fault tolerance of computer and detecting the vulnerability of software system. In this paper, we present a Xen-based fault injection technology for software vulnerability test (XFISV) in order to build an efficient and general-purpose software test model, which injects faults into interactive layer between software applications and...
This paper presents an approach to conducting experimental studies for the characterization and comparison of the error behavior in different computing systems. The proposed approach is applied to characterize and compare the error behavior of three commercial systems (Linux 2.6 on Pentium 4, Solaris 10 on UltraSPARC IIIi, and AIX 5.3 on POWER 5) under hardware transient faults. The data is obtained...
This paper describes the design and implementation of a small real time operating system (OS) called Minos and its application in an onboard active safety project for general aviation. The focus of the operating system is predictability, stability, safety and simplicity. We introduce fault tolerance aspects in software by the concept of a very fast reboot procedure and by an error correcting flight...
Live migration of virtual machine (VM) is a desirable feature for distributed computing such as grid computing and recent cloud computing by facilitating fault tolerance, load balance, and hardware maintenance. Virtual machine monitor (VMM) enforced process protection is a newly advocated approach to provide a trustworthy execution environment for processes running on commodity operating systems.While...
Checkpoint-restart is considered one of the most natural approaches to achieving fault-tolerance in a high-performance cluster. While experiences has focused attention on user-level solutions, the advent of efficient system-level virtualization software, such as Xen and VMWare, has opened the door to the possibility of efficient and scalable cluster-level virtualization. In this paper we present an...
We report the computational advances that have enabled the first micron-scale simulation of a Kelvin-Helmholtz (KH) instability using molecular dynamics (MD). The advances are in three key areas for massively parallel computation such as on BlueGene/L (BG/L): fault tolerance, application kernel optimization, and highly efficient parallel I/O. In particular, we have developed novel capabilities for...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.