The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
As the computing power of large scale computing systems increases exponentially with time, their failure rates are increasing exponentially as well. While current high performance computing (HPC) systems experience failures of some type every few days, projections indicate that the next generation exascale machines will experience failures up to several times an hour. The resilience techniques implemented...
Most prior work on hardware reliability make use of module (spatial) redundancy or time redundancy. In the first case, these methods assume that each module is exactly the same. Multiple module replicas implementing the same logic function are executed in different hardware channels and a voting scheme detects if the outputs match or not. In the second case, they re-compute the result using the same...
In multicore era, enabled by the decrease of transistors size, networks on chips (NoCs) emerged as a fast and scalable solution in replacement to buses systems. While providing high performance, the process of transistors miniaturization affects the dependability of the systems due to increase of fault rates caused by the susceptibility of transistors, wire and connections at deep submicron scale...
As hardware components are expected to become ever more unreliable due to the technology scaling, hardware errors have become unavoidable. Dependable systems that rely on a correct functionality often use redundancy to detect such hardware faults during operation. However, to design costefficient reliable systems, it is crucial to effectively exploit the available redundancy. Thus, researchers have...
Existing SRAM-based Field Programmable Gate Arrays (FPGAs) are very sensitive to Single Event Effects (SEE) phenomena in harsh environments. To protect applications running on SRAM-based FPGAs from SEE, those applications mainly relay on resources redundancy approaches, which involve significant resources overhead. New proposed fault mitigation approaches use Partial Dynamic Reconfiguration to overcome...
In recent years, the research of parallel digital terrain analysis has become a hot spot. Using the parallel computing technology to solve data intensive problems, and it has become a development trend in digital terrain analysis. On the other hand, with the development of hardware technology and new applications, how to ensure the reliability of the computing results is a one of the key problems...
Modern computing systems are increasingly becoming more vulnerable to reliability issues due to both permanent (hard) and transient/intermittent (soft) errors. Various techniques have been proposed to incorporate redundancy into the hardware or software in order to achieve the desired fault tolerance. We present a technique that allows a task to be executed multiple times on multiprocessor systems...
Heterogeneous many-core architectures combined with scratch-pad memories are attractive because they promise better energy efficiency than conventional architectures and a good balance between single-thread performance and multi-thread throughput. However, programmers will need an environment for finding and managing the large degree of parallelism, locality, and system resilience. We propose a Python-based...
Dependability is a key decision factor in today's global business environment. A powerful method that permits to evaluate the dependability of a system is the fault injection. The principle of this approach is to insert faults into the system and to monitor its responses in order to observe its behavior in the presence of faults. Several fault injection techniques and tools have been developed and...
Arithmetic error coding schemes (AN codes) are a well known and effective technique for soft error mitigation. Although coding theory being a rich area of mathematics, their implementation seems to be fairly easy. However, compliance with the theory can be lost easily while moving towards an actual implementation - finally jeopardizing the aspired fault-tolerance characteristics. In this paper, we...
Fault injection has been an important mechanism to test the dependability properties of a system. Through this mechanism, it is possible to analyze the behavior of a computer program in case of anomalies and to obtain useful statistics to measure the effectiveness of techniques for fault tolerance. In areas such as telecommunications, aviation and finance, the use of fault tolerance is a common practice,...
Today system reliability, availability, serviceability, and manageability (RASM) are becoming more crucial as computer based systems continue to increase in complexity and importance to our daily lives. Redundancy is a viable approach to improve the RASM attributes of a system. There are many forms of fault tolerant/redundant system architectures employed in both commercial and military /aerospace...
Integrated circuits fabricated in deep sub-micron technology are vulnerable to intermittent or transient faults which are the predominant cause of system failures. With continued scaling, operating voltage levels have reduced and resultant decrease in noise margins, the possibility of transient faults is likely to increase. Also, during operation in adverse environments, transient faults occur upon...
The advent of multi- and many-core processors comes with new challenges and opportunities for the designer of embedded real-time applications. By using parallel programming techniques (e.g. OpenMP) software engineers can leverage from the available hardware parallelism and speed up the algorithms. The inherent redundancy of multi-core architectures can also be used to implement fault-tolerance by...
Reliability is one of the most critical factors that is to be considered during the designing phase of any product. There are many factors that contribute to make a system more reliable in terms of area, power, operating frequency and accuracy. This paper proposes the design of a 4bit fault tolerant ALU system using backend designing. Parallel processing along with triple modular redundancy (TMR)...
Achieving dependable computing systems is becoming increasingly more difficult as CMOS integrated circuits technology scaling reaches sub-22nm ranges and faces physical limitations. Dependable computing is also a major concern with the various new technologies that are being investigated to overcome the physical limitations of CMOS technology. 3D integration, though initially proposed as a way of...
We propose a new methodology for hardware/software co-design of embedded systems which is specifically aimed to mitigate SET effects. A hardening infrastructure is used to generate different versions of the design using several combinations of hardware and software hardening which are evaluated with respect to SET effects. The advantages of the proposed approach are demonstrated by means of a case...
Reliability and manufacturability have emerged as dominant concerns for today's multi-billion transistor chips. In this paper, we investigate how to degrade a chip multiprocessor (CMP) gracefully in presence of faults, by keeping its architected functionality intact at the expense of some loss of performance. The proposed solution involves sharing critical execution resources among cores to survive...
This paper addresses the issue of error detection in transactional memory, and proposes a new method of error detection based on redundant transaction (EDRT). This method creates a transaction copy for every transaction, and executes both original transactions and transaction copies on adequate processor cores, and achieves error detection by comparing the execution results. EDRT utilizes the data-versioning...
This paper proposes: 1) A dynamically scheduled Process-Level Redundancy (PLR) for enhancing reliability of multi-core systems, 2) A comparison between PLR and Thread-Level Redundancy (TLR), and 3) A fault study on the thread selector unit of a modern processor. The proposed technique employs underutilized CPU resources to improve fault tolerance ability of a system. The evaluation on PLR reliability...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.