High-performance computing often involves sets of jobs or workloads that must be scheduled. If there are dependencies in the ordering of the jobs (e.g., pipelines or directed acyclic graphs) the user often has to carefully, manually submit the jobs in the right order and/or delay submitting dependent jobs until other jobs have finished. If the user can submit the entire workload with dependencies, then the scheduler has more information about future jobs in the workflow.
We have designed and implemented TrellisDAG, a system that combines the use of placeholder scheduling and a subsystem for describing workflows to provide novel mechanisms for computing non-trivial workloads with inter-job dependencies. TrellisDAG also has a modular architecture for implementing different scheduling policies, which will be the topic of future work. Currently, TrellisDAG supports: 1
A spectrum of mechanisms for users to specify both simple and complicated workflows.
2
The ability to load balance across multiple administrative domains.
3
A convenient tool to monitor complicated workflows.