The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
With the development of neural networks based machine learning and their usage in mission critical applications, voices are rising against the black box aspect of neural networks as it becomes crucial to understand their limits and capabilities. With the rise of neuromorphic hardware, it is even more critical to understand how a neural network, as a distributed system, tolerates the failures of its...
Cloud computing provides support for hosting client's application. Cloud is a distributed platform that provides hardware, software and network resources to both execute consumer's application and also to store and mange user's data. Cloud is also used to execute scientific workflow applications that are in general complex in nature when compared to other applications. Since cloud is a distributed...
As the computing power of large scale computing systems increases exponentially with time, their failure rates are increasing exponentially as well. While current high performance computing (HPC) systems experience failures of some type every few days, projections indicate that the next generation exascale machines will experience failures up to several times an hour. The resilience techniques implemented...
We present a resilient task-based domain-decomposition preconditioner for partial differential equations (PDEs) built on top of User Level Fault Mitigation Message Passing Interface (ULFM-MPI). The algorithm reformulates the PDE as a sampling problem, followed by a robust regression-based solution update that is resilient to silent data corruptions (SDCs). We adopt a server-client model where all...
A self-repairing robot utilising a spiking astrocyte-neuron network is presented in this paper. It uses the output spike frequency of neurons to control the motor speed and robot activation. A software model of the astrocyte-neuron network previously demonstrated self-detection of faults and its self-repairing capability. In this paper the application demonstrator of mobile robotics is employed to...
ARTICo3 is an architecture that permits to dynamically set an arbitrary number of reconfigurable hardware accelerators, each containing a given number of threads fixed at design time according to High Level Synthesis constraints. However, the replication of these modules can be decided at runtime to accelerate kernels by increasing the overall number of threads, add modular redundancy to increase...
We present a domain-decomposition-based pre-conditioner for the solution of partial differential equations (PDEs) that is resilient to both soft and hard faults. The algorithm is based on the following steps: first, the computational domain is split into overlapping subdomains, second, the target PDE is solved on each subdomain for sampled values of the local current boundary conditions, third, the...
Since the before birth of computers we have strived to make intelligent machines that share some of the properties of our own brains. We have tried to make devices that quickly solve problems that we find time consuming, that adapt to our needs, and that learn and derive new information. In more recent years we have tried to add new capabilities to our devices: self-adaptation, fault tolerance, self-repair,...
Task migration has been applied as an efficient mechanism to handle faulty processing elements (PEs) in Multi-processor Systems-on-Chip (MPSoCs). However, current task migration solutions are either implemented or emulated in software, compromising intrinsically the predictability and degrading the system robustness. Moreover, the initial placement and mapping of the tasks in the MPSoC plays an important...
The computing-cloud manages huge numbers of virtualized resources to provide uniquely beneficial computing paradigms for scientific research. A modern cloud can behave in a virtual context - much like a local homogeneous computer cluster - to deliver High Performance Computing (HPC) platforms that provide public users with access, cut purchase costs, and eliminate the maintenance burden of sophisticated...
Over the past decade, high performance applications have embraced parallel programming and computing models. While parallel computing offers advantages such as good utilization of dedicated hardware resources, it also has several drawbacks such as poor fault-tolerance, scalability, and ability to harness available resources during run-time. The advent of cloud computing presents a viable and promising...
Taking inspiration from biological organism's cell based structure, the paper is focused on modeling and implementation of bio-inspired artificial hardware structures for high reliability industrial control applications. As it known, living organisms offers the ability to grow with fault-tolerance and self-repair. These remarkable capabilities can be associated with principles to engine complex novel...
Soft errors on hardware could affect the reliability of computer system. To estimate system reliability, it is important to know the effects of soft errors to system reliability. This paper explores the effects of soft errors to computer system reliability. We propose a new approach to measure system reliability for soft error factor. In our approach, hardware components reliability is concerned first...
Due to the continuous shrinking of the transistor sizes which is strongly driven by Moore's law, reliability becomes a dominant design challenge for embedded systems. Reliability problems arise from permanent errors due to manufacturing, process variations, aging as well as soft errors. As a result, the hardware will consist of unreliable components and hence, the development of embedded systems has...
As supercomputers and clusters increase in size and complexity, system failures are inevitable. Different hardware components (such as memory, disk, or network) of such systems can have different failure rates. Prior works assume failures equally affect an application, whereas our goal is to provide failure models for applications that reflect their specific component usage. This is challenging because...
Negative selection algorithm is one of the most widely used techniques in the field of artificial immune systems. This paper proposed an approach to implement negative selection algorithm based on FPGA, aiming at fault detection problems. The negative selection algorithm generally uses binary matching rules to discriminate self from non-self. Firstly, three most widely used binary matching rules were...
In light of its powerful computing capacity and high energy efficiency, GPU (graphics processing unit) has become a focus in the research field of HPC (High Performance Computing). CPU-GPU heterogeneous parallel systems have become a new development trend of super-computer. However, the inherent unreliability of the GPU hardware deteriorates the reliability of super-computer. We have researched on...
Modern many-core architectures with hundreds of cores provide a high computational potential. This makes them particularly interesting for scientific high-performance computing and simulation technology. Like all nano scaled semiconductor devices, many-core processors are prone to reliability harming factors like variations and soft errors. One way to improve the reliability of such systems is software-based...
Recent research in multi-agent systems incorporate fault tolerance concepts. However, the research does not explore the extension and implementation of such ideas for large scale parallel computing systems. The work reported in this paper investigates a swarm array computing approach, namely 'Intelligent Agents'. In the approach considered a task to be executed on a parallel computing system is decomposed...
Evaluating and possibly improving the fault tolerance and error detecting mechanisms is becoming a key issue when designing safety-critical electronic systems. The proposed approach is based on simulation-based fault injection and allows the analysis of the system behavior when faults occur. The paper describes how a microprocessor board employed in an automated light-metro control system has been...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.