Search results

chapter

Hadoop cluster monitoring and fault analysis in real time

Joey Pinto, Pooja Jain, Tapan Kumar

2016 International Conference on Recent Advances and Innovations in Engineering (ICRAIE) > 1 - 6

2016 International Conference on Recent Advances and Innovations in Engineering (ICRAIE)

Failure of a task running on a Hadoop cluster is highly expensive in terms of computational time. A failure occurring even at the end phase of the task may cause the need to redo the entire task. Thus is really important to deploy fault tolerant techniques. Hadoop deploys a technique of checkpointing to prevent data loss. However, computational time-loss still pose a grim threat to critical applications...

chapter

Hadoop distributed computing clusters for fault prediction

Joey Pinto, Pooja Jain, Tapan Kumar

2016 International Computer Science and Engineering Conference (ICSEC) > 1 - 6

2016 International Computer Science and Engineering Conference (ICSEC)

Hadoop architecture provides one level of fault tolerance, in a way of rescheduling the job on the faulty nodes to other nodes in the network. But, this approach is inefficient when a fault occurs after most of the job is executed. Thus, it's necessary to predict the fault at the node at quite an early stage so that the rescheduling of the job is not costly in terms of time and efficiency. Prediction...

chapter

Running Resilient MPI Applications on a Dynamic Group of Recommended Processes

Edson Tavares De Camargo, Elias P. Duarte

2016 Seventh Latin-American Symposium on Dependable Computing (LADC) > 15 - 24

2016 Seventh Latin-American Symposium on Dependable Computing (LADC)

HPC systems run applications that can take several hours to executeand have to deal with the occurrence of a potentially large numberof faults. Most of the existing fault-tolerance strategies for thesesystems assume crash faults that are permanent events easily detected. This is not the case in several real systems, in particular in shared clusters, in which even the load variation may cause performance...

chapter

MapReduce Model Implementation on MPI Platform

Guo Yucheng

2014 13th International Symposium on Distributed Computing and Applications to Business, Engineering and Science > 88 - 91

2014 13th International Symposium on Distributed Computing and Applications to Business, Engineering and Science (DCABES)

With development of Multicore clusters the taskscheduling problem in heterogeneous cluster has become hot point of research. The method to solve this problem in Cloud computing is virtualization, which can make the heterogeneous nodes being isomorphic and then using MapReduce model for task scheduling in isomorphic nodes. But the approach has some shortcomings: virtualization itself will cause the...

chapter

Requirement Verification and Dependency Tracing During Simulation in Modelica

Lena Buffoni-Rogovchenko, Peter Fritzson, Mattias Nyberg, Alfredo Garro, more

2013 8th EUROSIM Congress on Modelling and Simulation > 561 - 566

2013 8th EUROSIM Congress on Modelling and Simulation (EUROSIM)

Requirement verification is an important part of the development process, and the increasing system complexity has exacerbated the need for integrating this step into a formalized model driven development process, providing a dedicated methodology as well as tool support. In this paper the authors propose an extension for Modelica, an equation-based language for system modeling, that will allow to...

chapter

Resilient sinks for long lived wireless sensor networks

Makhlouf Aliouat, Zibouda Aliouat, Chafiq Titouna

2012 International Symposium on Computer Applications and Industrial Electronics (ISCAIE) > 267 - 272

2012 IEEE Symposium on Computer Applications and Industrial Electronics (ISCAIE)

Wireless sensor network is a set of autonomous sensor nodes dedicated to sense sizes of physical phenomena of a geographical area of interest. The sizes so collected are converted to numerical data to be transmitted to a specific node called base station or sink. After some appropriate processing, the data are sent out to a monitoring center. Therefore, a sink takes over a vital role in a WSN since...

chapter

Direct generation of invariants for reactive models

Elizabeth I. Leonard, Myla M. Archer, Constance L. Heitmeyer, Ralph D. Jeffords

Tenth ACM/IEEE International Conference on Formal Methods and Models for Codesign (MEMCODE2012) > 119 - 130

2012 10th IEEE/ACM International Conference on Formal Methods and Models for Codesign (MEMOCODE 2012)

Recently, software practitioners, using model-based engineering and similar methods, have begun developing software from models. After creating a model of the required system behavior, a developer can obtain assurance of the model by validating that it captures the intended behavior and verifying that it satisfies critical properties. Invariants are important to both validation, as a check that the...

chapter

A rigorous approach to the design of resilient cyber-physical systems through co-simulation

John Fitzgerald, Ken Pierce, Carl Gamble

IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN 2012) > 1 - 6

2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W)

The engineering of resilient cyber-physical systems requires collaborative development and analysis of models from different disciplines, including discrete-event models of software and continuous-time models of physical plant. This paper describes a rigorous approach to the model-based design of such systems through co-simulation of discrete-event models in the Vienna Development Method (VDM) and...

chapter

Efficient Resubmission Strategies to Design Robust Grid Production Environments

D Lingrand, J Montagnat

2010 IEEE Sixth International Conference on e-Science > 198 - 205

E-Science 2010. 6th IEEE International Conference on E-Science (E-Science 2010)

Production grids exhibit high failure rates hampering the development of many large scale scientific applications. End users require robust experiment production environments ensuring efficient resubmission of failed tasks. Proper parameterization of resubmission strategies is a complex problem that depends on the non-stationary workload conditions experienced by the infrastructure. In order to determine...

chapter

Monitoring Local Progress with Watchdog Timers Deduced from Global Properties

R Barbosa

2010 29th IEEE Symposium on Reliable Distributed Systems > 131 - 140

2010 29th IEEE International Symposium on Reliable Distributed Systems (SRDS)

Distributed systems are used in numerous applications where failures can be costly. Due to concerns that some of the nodes may become faulty, critical services are usually replicated across several nodes, which execute distributed algorithms to ensure correct service in spite of failures. To prevent replica-exhaustion, it is fundamental to detect errors and trigger appropriate recovery actions. In...

chapter

Achieving Robust Self-Management for Large-Scale Distributed Applications

Ahmad Al-Shishtawy, M A Fayyaz, K Popov, V Vlassov

2010 Fourth IEEE International Conference on Self-Adaptive and Self-Organizing Systems > 31 - 40

2010 4th IEEE International Conference on Self-Adaptive and Self-Organizing Systems (SASO 2010)

Achieving self-management can be challenging, particularly in dynamic environments with resource churn (joins/leaves/failures). Dealing with the effect of churn on management increases the complexity of the management logic and thus makes its development time consuming and error prone. We propose the abstraction of robust management elements (RMEs), which are able to heal themselves under continuous...

chapter

Design and Implementation of Failover Federates Supporting Fault Tolerance for HLA Based Simulations

Han Yibo, Wang Qun, Zhang Wei

2010 International Conference on Measuring Technology and Mechatronics Automation > 1 > 946 - 949

2010 International Conference on Measuring Technology and Mechatronics Automation (ICMTMA 2010)

With the increasing scale and complexity of HLA based simulations, fault tolerance is gradually becoming a pressing problem. This paper addresses the challenges in realizing a failover federate to support fault tolerance for HLA based simulations. Based on the analysis of the fault tolerance problem, the failover federate is described firstly. It comprises a primary federate and a standby federate...

chapter

A Simulation Environment for the On-Line Monitoring of a Fault Tolerant Flight Control Computer

M. Punt, J. Djordjevic, M. Tomasevic

2009 First IEEE Eastern European Conference on the Engineering of Computer Based Systems > 100 - 109

2009 First IEEE Eastern European Regional Conference on the Engineering of Computer Based Systems (ECBS-EERC 2009)

An approach of designing a simulation environment for the on-line monitoring of a fault tolerant flight control computer is presented in this paper. The simulation environment is designed to evaluate an improved on-line monitoring technique for processors with a built-in cache. This technique assumes that a monitor checks on-line whether the execution of a program is in accordance with the control...

chapter

Perfect Failure Detection in the Partitioned Synchronous Distributed System Model

R.J. de Araujo Macedo, S. Gorender

2009 International Conference on Availability, Reliability and Security > 273 - 280

2009 International Conference on Availability, Reliability and Security. ARES 2009

In this paper we show that it is possible to implement a perfect failure detector P (one that detects all faulty processes if and only if those processes failed) in a non-synchronous distributed system. To realize that, we introduce the partitioned synchronous system (Spa) that is weaker than the conventional synchronous system. From some properties we introduce (such as strong partitioned synchrony)...

chapter

A new life system approach to the Prognostic and Health Management (PHM) with survival analysis, dynamic hybrid fault models, evolutionary game theory, and three-layer survivability analysis

Zhanshan Ma

2009 IEEE Aerospace conference > 1 - 20

2009 IEEE Aerospace Conference

In this paper, I propose a new architecture for PHM, which is characterized by life-system approach- treating PHM as a hierarchical system with fundamental properties similar to those of life systems. Conceptually, besides drawing on the important concepts from existing PHM theory and practice such as life cycle, condition-based maintenance (CBM), remaining useful lifetime (RUL), I draw on the dynamic...

chapter

Distributed Replica-Exchange Simulations on Production Environments Using SAGA and Migol

A. Luckow, S. Jha, Joohyun Kim, A. Merzky, more

2008 IEEE Fourth International Conference on eScience > 253 - 260

2008 IEEE Fourth International Conference on eScience

There exists a class of scientific applications for which utilizing distributed resources is critical for reducing the time-to-solution. In this paper, we discuss a specific class of applications - Replica-Exchange simulations - where the orchestration of many distributed jobs in a dynamic and inherently unreliable distributed environment is essential for a successful completion. We describe the design,...

chapter

Towards a Multi-agent Framework for Fault Tolerance and QoS Guarantee in P2P Networks

N. Dayhim, A.M. Rahmani, S.N. Gelyan, G. Zarrinzad

2008 Third International Conference on Convergence and Hybrid Information Technology > 2 > 166 - 171

2008 Third International Conference on Convergence and Hybrid Information Technology (ICCIT)

In a distributed P2P (peer to peer) network, each computer is able to act as a server for the others. Collaboration and sharing resources are the main purpose of this distributed heterogonous network. Users need to promptly access the vast amount of data and easily use other user's result. In other words, the processing ability is improved. In this paper, a novel model named FQ (fault tolerant and...

chapter

A Simulation Framework for Dependable Distributed Systems

C. Dobre, F. Pop, V. Cristea

2008 International Conference on Parallel Processing - Workshops > 181 - 187

2008 International Conference on Parallel Processing Workshops (ICPP-W)

The use of discrete-event simulators in the design and development of distributed systems is appealing due to their efficiency and scalability. Their core abstractions of process and event map neatly to the components and interactions of modern-day distributed systems and allow designing realistic simulation scenarios. MONARC, a multi-threaded, process oriented simulation framework designed for modeling...

chapter

Verification and Analysis of Self-Checking Properties through ATPG

M. Hunger, S. Hellebrand

2008 14th IEEE International On-Line Testing Symposium > 25 - 30

14th IEEE International On-Line Testing Symposium

Present and future semiconductor technologies are characterized by increasing parameters variations as well as an increasing susceptibility to external disturbances. Transient errors during system operation are no longer restricted to memories but also affect random logic, and a robust design becomes mandatory to ensure a reliable system operation. Self-checking circuits rely on redundancy to detect...

INFONA - science communication portal

Search results

Table of contents

Hadoop cluster monitoring and fault analysis in real time

Hadoop distributed computing clusters for fault prediction

Running Resilient MPI Applications on a Dynamic Group of Recommended Processes

MapReduce Model Implementation on MPI Platform

Requirement Verification and Dependency Tracing During Simulation in Modelica

Resilient sinks for long lived wireless sensor networks

Direct generation of invariants for reactive models

A rigorous approach to the design of resilient cyber-physical systems through co-simulation

Efficient Resubmission Strategies to Design Robust Grid Production Environments

Monitoring Local Progress with Watchdog Timers Deduced from Global Properties

Achieving Robust Self-Management for Large-Scale Distributed Applications

Design and Implementation of Failover Federates Supporting Fault Tolerance for HLA Based Simulations

A Simulation Environment for the On-Line Monitoring of a Fault Tolerant Flight Control Computer

Perfect Failure Detection in the Partitioned Synchronous Distributed System Model

A new life system approach to the Prognostic and Health Management (PHM) with survival analysis, dynamic hybrid fault models, evolutionary game theory, and three-layer survivability analysis

Distributed Replica-Exchange Simulations on Production Environments Using SAGA and Migol

Towards a Multi-agent Framework for Fault Tolerance and QoS Guarantee in P2P Networks

A Simulation Framework for Dependable Distributed Systems

Verification and Analysis of Self-Checking Properties through ATPG

Filter options

Publication date

Publication type

Keywords

INFONA - science communication portal

Search results

Add recipient

Sending message cancelled

Are you sure you want to cancel sending this message?

Send message

Filter options

Publication date

Date range setting

Set the date range to filter the displayed results. You can set a starting date, ending date or both. You can enter the dates manually or choose them from the calendar.

Publication type

Keywords

Reporting an error / abuse

Sending the report failed

Accessibility options