Hadoop distributed computing clusters for fault prediction

Joey Pinto; Pooja Jain; Tapan Kumar

doi:10.1109/ICSEC.2016.7859903

Hadoop distributed computing clusters for fault prediction

Source

2016 International Computer Science and Engineering Conference (ICSEC) > 1 - 6

Abstract

Hadoop architecture provides one level of fault tolerance, in a way of rescheduling the job on the faulty nodes to other nodes in the network. But, this approach is inefficient when a fault occurs after most of the job is executed. Thus, it's necessary to predict the fault at the node at quite an early stage so that the rescheduling of the job is not costly in terms of time and efficiency. Prediction of these faults gives us the necessary time to shift the task load onto another node(s) and thus prevent data or computation time loss. An implementation is done on MATLAB SVM kernel and Ganglia with Java as an interfacing language. Ganglia is used for network system statistics monitoring. The system is trained using statistics of a normal task run and can thus detect deviations from them in real time. The experimental results clearly indicate that it is possible to predict the occurrence of a fault using previously gained knowledge with minimal time delay. As a result of which either the job can be rescheduled or the cluster itself can be upscaled. The reinforced learning module reduces false positives with each run and makes it possible to implement a truly fault-tolerant cluster.