The Infona portal uses cookies, i.e. strings of text saved by a browser on the user's device. The portal can access those files and use them to remember the user's data, such as their chosen settings (screen view, interface language, etc.), or their login data. By using the Infona portal the user accepts automatic saving and using this information for portal operation purposes. More information on the subject can be found in the Privacy Policy and Terms of Service. By closing this window the user confirms that they have read the information on cookie usage, and they accept the privacy policy and the way cookies are used by the portal. You can change the cookie settings in your browser.
Failure of a task running on a Hadoop cluster is highly expensive in terms of computational time. A failure occurring even at the end phase of the task may cause the need to redo the entire task. Thus is really important to deploy fault tolerant techniques. Hadoop deploys a technique of checkpointing to prevent data loss. However, computational time-loss still pose a grim threat to critical applications...
Hadoop architecture provides one level of fault tolerance, in a way of rescheduling the job on the faulty nodes to other nodes in the network. But, this approach is inefficient when a fault occurs after most of the job is executed. Thus, it's necessary to predict the fault at the node at quite an early stage so that the rescheduling of the job is not costly in terms of time and efficiency. Prediction...
HPC systems run applications that can take several hours to executeand have to deal with the occurrence of a potentially large numberof faults. Most of the existing fault-tolerance strategies for thesesystems assume crash faults that are permanent events easily detected. This is not the case in several real systems, in particular in shared clusters, in which even the load variation may cause performance...
With development of Multicore clusters the taskscheduling problem in heterogeneous cluster has become hot point of research. The method to solve this problem in Cloud computing is virtualization, which can make the heterogeneous nodes being isomorphic and then using MapReduce model for task scheduling in isomorphic nodes. But the approach has some shortcomings: virtualization itself will cause the...
Requirement verification is an important part of the development process, and the increasing system complexity has exacerbated the need for integrating this step into a formalized model driven development process, providing a dedicated methodology as well as tool support. In this paper the authors propose an extension for Modelica, an equation-based language for system modeling, that will allow to...
Wireless sensor network is a set of autonomous sensor nodes dedicated to sense sizes of physical phenomena of a geographical area of interest. The sizes so collected are converted to numerical data to be transmitted to a specific node called base station or sink. After some appropriate processing, the data are sent out to a monitoring center. Therefore, a sink takes over a vital role in a WSN since...
Recently, software practitioners, using model-based engineering and similar methods, have begun developing software from models. After creating a model of the required system behavior, a developer can obtain assurance of the model by validating that it captures the intended behavior and verifying that it satisfies critical properties. Invariants are important to both validation, as a check that the...
The engineering of resilient cyber-physical systems requires collaborative development and analysis of models from different disciplines, including discrete-event models of software and continuous-time models of physical plant. This paper describes a rigorous approach to the model-based design of such systems through co-simulation of discrete-event models in the Vienna Development Method (VDM) and...
Production grids exhibit high failure rates hampering the development of many large scale scientific applications. End users require robust experiment production environments ensuring efficient resubmission of failed tasks. Proper parameterization of resubmission strategies is a complex problem that depends on the non-stationary workload conditions experienced by the infrastructure. In order to determine...
Distributed systems are used in numerous applications where failures can be costly. Due to concerns that some of the nodes may become faulty, critical services are usually replicated across several nodes, which execute distributed algorithms to ensure correct service in spite of failures. To prevent replica-exhaustion, it is fundamental to detect errors and trigger appropriate recovery actions. In...
Achieving self-management can be challenging, particularly in dynamic environments with resource churn (joins/leaves/failures). Dealing with the effect of churn on management increases the complexity of the management logic and thus makes its development time consuming and error prone. We propose the abstraction of robust management elements (RMEs), which are able to heal themselves under continuous...
With the increasing scale and complexity of HLA based simulations, fault tolerance is gradually becoming a pressing problem. This paper addresses the challenges in realizing a failover federate to support fault tolerance for HLA based simulations. Based on the analysis of the fault tolerance problem, the failover federate is described firstly. It comprises a primary federate and a standby federate...
An approach of designing a simulation environment for the on-line monitoring of a fault tolerant flight control computer is presented in this paper. The simulation environment is designed to evaluate an improved on-line monitoring technique for processors with a built-in cache. This technique assumes that a monitor checks on-line whether the execution of a program is in accordance with the control...
In this paper we show that it is possible to implement a perfect failure detector P (one that detects all faulty processes if and only if those processes failed) in a non-synchronous distributed system. To realize that, we introduce the partitioned synchronous system (Spa) that is weaker than the conventional synchronous system. From some properties we introduce (such as strong partitioned synchrony)...
In this paper, I propose a new architecture for PHM, which is characterized by life-system approach- treating PHM as a hierarchical system with fundamental properties similar to those of life systems. Conceptually, besides drawing on the important concepts from existing PHM theory and practice such as life cycle, condition-based maintenance (CBM), remaining useful lifetime (RUL), I draw on the dynamic...
There exists a class of scientific applications for which utilizing distributed resources is critical for reducing the time-to-solution. In this paper, we discuss a specific class of applications - Replica-Exchange simulations - where the orchestration of many distributed jobs in a dynamic and inherently unreliable distributed environment is essential for a successful completion. We describe the design,...
In a distributed P2P (peer to peer) network, each computer is able to act as a server for the others. Collaboration and sharing resources are the main purpose of this distributed heterogonous network. Users need to promptly access the vast amount of data and easily use other user's result. In other words, the processing ability is improved. In this paper, a novel model named FQ (fault tolerant and...
The use of discrete-event simulators in the design and development of distributed systems is appealing due to their efficiency and scalability. Their core abstractions of process and event map neatly to the components and interactions of modern-day distributed systems and allow designing realistic simulation scenarios. MONARC, a multi-threaded, process oriented simulation framework designed for modeling...
Present and future semiconductor technologies are characterized by increasing parameters variations as well as an increasing susceptibility to external disturbances. Transient errors during system operation are no longer restricted to memories but also affect random logic, and a robust design becomes mandatory to ensure a reliable system operation. Self-checking circuits rely on redundancy to detect...
Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.