Wyniki wyszukiwania

Pozycje od 1 do 11 spośród 11 wyników

rozdział

cudaCR: An In-Kernel Application-Level Checkpoint/Restart Scheme for CUDA-Enabled GPUs

Behnam Pourghassemi, Aparna Chandramowlishwaran

2017 IEEE International Conference on Cluster Computing (CLUSTER) > 725 - 732

2017 IEEE International Conference on Cluster Computing (CLUSTER)

Fault-tolerance is becoming increasingly important as we enter the era of exascale computing. Increasing the number of cores results in a smaller mean time between failures, and consequently, higher probability of errors. Among the different software fault tolerance techniques, checkpoint/restart is the most commonly used method in supercomputers, the de-facto standard for large-scale systems. Although...

rozdział

Design and Implementation for Checkpointing of Distributed Resources Using Process-Level Virtualization

Kapil Arya, Rohan Garg, Artem Y. Polyakov, Gene Cooperman

2016 IEEE International Conference on Cluster Computing (CLUSTER) > 402 - 412

2016 IEEE International Conference on Cluster Computing (CLUSTER)

System-level checkpoint-restart is a critical technology for long-running jobs in high-performance computing. Yet, only two approaches to checkpointing MPI applications continue to survive in wide use today. One approach is to use the kernel module-based BLCR in combination with an MPI checkpoint-restart service particular to the MPI implementation in use. Unfortunately, this lacks support for some...

rozdział

Power-Check: An Energy-Efficient Checkpointing Framework for HPC Clusters

Raghunath Raja Chandrasekar, Akshay Venkatesh, Khaled Hamidouche, Dhabaleswar K. Panda

2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing > 261 - 270

2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)

Checkpoint-restart is a predominantly used reactive fault-tolerance mechanism for applications running on HPC systems. While there are innumerable studies in literature that have analyzed, and optimized for, the performance and scalability of a variety of check pointing protocols, not much research has been done from an energy or power perspective. The limited number of studies conducted along this...

rozdział

VOCL-FT: introducing techniques for efficient soft error coprocessor recovery

Antonio J. Peña, Wesley Bland, Pavan Balaji

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis > 1 - 12

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis

Popular accelerator programming models rely on offloading computation operations and their corresponding data transfers to the coprocessors, leveraging synchronization points where needed. In this paper we identify and explore how such a programming model enables optimization opportunities not utilized in traditional checkpoint/restart systems, and we analyze them as the building blocks for an efficient...

rozdział

Performance evaluation of checkpoint/restart techniques: For MPI applications on Amazon cloud

Basma Abdel Azeem, Manal Helal

2014 9th International Conference on Informatics and Systems > PDC-49 - PDC-57

2014 9th International Conference on Informatics and Systems (INFOS)

Distributed applications running on a large cluster environment, such as the cloud instances will have shorter execution time. However, the application might suffer from sudden termination due to unpredicted computing node failures, thus loosing the whole computation. Checkpoint/restart is a fault tolerance technique used to solve this problem. In this work we evaluated the performance of two of the...

rozdział

An Application-Assisted Checkpoint-Restart Mechanism for Java Applications

Diana Andreea Popescu, Eliana-Dina Tirsa, Mugurel Ionut Andreica, Valentin Cristea

2013 IEEE 12th International Symposium on Parallel and Distributed Computing > 190 - 197

2013 IEEE 12th International Symposium on Parallel and Distributed Computing (ISPDC)

In this paper we present a novel application-assisted checkpoint-restart mechanism for Java applications. The checkpoint-restart API provides the application developers with full control over what data needs to be check-pointed. The novelty of our system is that it allows different checkpoint periods for different data items. Our implementation makes full use of the Java Reflection API.

rozdział

Combining Partial Redundancy and Checkpointing for HPC

James Elliott, Kishor Kharbas, David Fiala, Frank Mueller, więcej

2012 IEEE 32nd International Conference on Distributed Computing Systems > 615 - 626

2012 IEEE 32nd International Conference on Distributed Computing Systems (ICDCS)

Today's largest High Performance Computing (HPC) systems exceed one Petaflops (10^15) floating point operations per second) and exascale systems are projected within seven years. But reliability is becoming one of the major challenges faced by exascale computing. With billion-core parallelism, the mean time to failure is projected to be in the range of minutes or hours instead of days. Failures are...

rozdział

RDMA-Based Job Migration Framework for MPI over InfiniBand

Xiangyong Ouyang, Sonya Marcarelli, Raghunath Rajachandrasekar, Dhabaleswar K Panda

2010 IEEE International Conference on Cluster Computing > 116 - 125

2010 IEEE International Conference on Cluster Computing (CLUSTER 2010)

Coordinated checkpoint and recovery is a common approach to achieve fault tolerance on large-scale systems. The traditional mechanism dumps the process image to a local disk or a central storage area of all the processes involved in the parallel job. When a failure occurs, the processes are restarted and restored to the latest checkpoint image. However, this kind of approach is unable to provide the...

rozdział

Accelerating Checkpoint Operation by Node-Level Write Aggregation on Multicore Systems

Xiangyong Ouyang, K. Gopalakrishnan, D.K. Panda

2009 International Conference on Parallel Processing > 34 - 41

2009 International Conference on Parallel Processing (ICPP 2009)

Clusters and applications continue to grow in size while their mean time between failure (MTBF) is getting smaller. Checkpoint/restart is becoming increasingly important for large scale parallel jobs. However, the performance of the checkpoint/restart mechanism does not scale well with increasing job size due to constraints within the file system. Furthermore, with the advent of multi-core architecture,...

rozdział

DMTCP: Transparent checkpointing for cluster computations and the desktop

J. Ansel, K. Arya, G. Cooperman

2009 IEEE International Symposium on Parallel&Distributed Processing > 1 - 12

2009 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)

DMTCP (distributed multithreaded checkpointing) is a transparent user-level checkpointing package for distributed applications. Checkpointing and restart is demonstrated for a wide range of over 20 well known applications, including MATLAB, Python, TightVNC, MPICH2, OpenMPI, and runCMS. RunCMS runs as a 680 MB image in memory that includes 540 dynamic libraries, and is used for the CMS experiment...

rozdział

Transparent system-level migration of PGAS applications using Xen on InfiniBand

D.P. Scarpazza, P. Mullaney, O. Villa, F. Petrini, więcej

2007 IEEE International Conference on Cluster Computing > 74 - 83

2007 IEEE International Conference on Cluster Computing (CLUSTER)

Checkpoint-restart is considered one of the most natural approaches to achieving fault-tolerance in a high-performance cluster. While experiences has focused attention on user-level solutions, the advent of efficient system-level virtualization software, such as Xen and VMWare, has opened the door to the possibility of efficient and scalable cluster-level virtualization. In this paper we present an...

Opcje filtrowania

Zbiór danych:
ieee
Słowa kluczowe:
KERNEL
LIBRARIES
CHECKPOINTING

Data publikacji

Ustaw własny zakres dat

Słowa kluczowe

FAULT TOLERANCE (6)
FAULT TOLERANT SYSTEMS (3)
CHECKPOINT (2)
CHECKPOINT/RESTART (2)
DATA MINING (2)
DMTCP (2)
FAULT TOLERANT COMPUTING (2)
INSTRUCTION SETS (2)
LINUX (2)
MVAPICH2 (2)
PROTOCOLS (2)
REGISTERS (2)
RUNTIME (2)
AGGREGATE REMOTE MEMORY COPY INTERFACE (1)
AMAZON EC2 (1)
APPLICATION PROGRAM INTERFACES (1)
ARMCI ONE-SIDED COMMUNICATION LIBRARY (1)
BACKPLANES (1)
BLCR (1)
BUFFER STORAGE (1)
CHECKPOINT OPERATION (1)
CHECKPOINT TIMES (1)
CHECKPOINT-RESTART (1)
CHECKPOINT-RESTART APPROACH (1)
CHECKPOINT-RESTART MODULE (1)
CHECKPOINT/RESTART MECHANISM (1)
CHECKPOINTING LIBRARY (1)
CHECKPOINTING PACKAGES (1)
CLOUD COMPUTING (1)
CLUSTER COMPUTATIONS (1)
COMPUTATIONAL MODELING (1)
CONTEXT (1)
COPROCESSORS (1)
DISTRIBUTED CORES (1)
DISTRIBUTED MULTITHREADED CHECKPOINTING (1)
DRIVER CIRCUITS (1)
EDUCATIONAL INSTITUTIONS (1)
ELECTRONICS PACKAGING (1)
ENERGY-EFFICIENCY (1)
ERROR CORRECTION CODES (1)
FILE ORGANISATION (1)
FILE SYSTEM (1)
FORKED CHECKPOINTING (1)
GLOBAL RECOVERY LINE IDENTIFICATION (1)
GPU (1)
GRAPHICS PROCESSING UNITS (1)
HARDWARE (1)
HIGH-PERFORMANCE CLUSTER FAULT-TOLERANCE (1)
IMAGE TRANSMISSION (1)
INFINIBAND (1)
INFINIBAND NETWORK (1)
IP NETWORKS (1)
JAVA (1)
LARGE-SCALE SYSTEMS (1)
LINUX KERNELS (1)
MATLAB (1)
MEAN TIME BETWEEN FAILURE (1)
MEMORY MANAGEMENT (1)
MESSAGE PASSING (1)
MIDDLEWARE (1)
MPI (1)
MPI APPLICATION (1)
MPI-2 (1)
MPICH2 (1)
MULTI-THREADING (1)
MULTICORE ARCHITECTURE (1)
MULTICORE SYSTEMS (1)
MULTIPROCESSING SYSTEMS (1)
NAS PARALLEL BENCHMARKS (1)
NODE-LEVEL WRITE AGGREGATION (1)
OBJECT RECOGNITION (1)
OPEN FILE DESCRIPTORS (1)
OPENMPI (1)
OPERATING SYSTEM KERNELS (1)
OPTIMIZATION (1)
PARALLEL JOBS (1)
PARALLEL PROCESSING (1)
PARENT-CHILD PROCESS RELATIONSHIPS (1)
PARTITIONED GLOBAL ADDRESS SPACE (1)
PATTERN CLUSTERING (1)
PEER TO PEER COMPUTING (1)
PERFORMANCE DEGRADATION (1)
PGAS PROGRAMMING MODEL (1)
PID VIRTUALIZATION (1)
POWER-CAPPING (1)
PROACTIVE FAULT TOLERANCE (1)
PROACTIVE JOB MIGRATION SCHEME (1)
PROCESS-MIGRATION (1)
PYTHON (1)
RAPL (1)
RDMA (1)
REDUNDANCY (1)
RESTART (1)
RESTART SCHEME (1)
RESTART TIMES (1)
RUN-TIME SYSTEM (1)
RUNCMS (1)
więcej

INFONA - portal komunikacji naukowej

Wyniki wyszukiwania

Dodaj adresata

Anulowanie wysłania wiadomości

Czy na pewno chcesz anulować wysłanie wiadomości?

Wyślij wiadomość

Opcje filtrowania

Data publikacji

Ustawianie zakresu dat

Podaj zakres dat dla filtrowania wyświetlonych wyników. Możesz podać datę początkową, końcową lub obie daty. Daty możesz wpisać ręcznie lub wybrać za pomocą kalendarza.

Słowa kluczowe

Zgłaszanie błędu / nadużycia

Nieudane wysłanie zgłoszenia

Ułatwienia dostępu