Abstract | Orchestration methods at Infrastructure-as-a-Service (IaaS) level automate the deployment, scaling, and management of virtualized resources, typically across multiple hosts and data centres. While orchestration provides many advantages, it also introduces several challenges in testing and debugging phases, particularly due to the distributed nature of the virtualized resources. Even the proper initial deployment of interdependent virtual machines (VM) may cause fatal errors since the unpredictable timing conditions may change the overall initialisation method, which can lead to abnormal behaviour, i.e. in complex, non-deterministic environments, the set of VM configurations can drift from their expected states (‘configuration drift’). The overall motivation of our research is to improve the reliability of cloud-based infrastructures with minimal user interactions and significantly automate the time-consuming debugging process. This paper focuses on the examination and behaviour of cloud-based infrastructures during their deployment phase. We continued the adaption of a replay-active control based debugging technique, called macrostep, in the field of cloud orchestration. In order to provide efficient support for developers troubleshooting major deployment related errors, the fundamental macrostep mechanisms have been enriched and significantly extended including 1) the automated generation of collective breakpoint sets, 2) parallel and robust traversal method for such consistent global states with 3) automated evaluation of global predicates in each global state of VM set. Furthermore, the novel methods have been 4) generalized towards wider user scenarios by targeting the Terraform orchestration tool as well (besides the already supported Occopus). The paper describes the significantly enhanced approach, our design choices, and also the implementation of the experimental debugger tool with a use case for validation purposes by addressing the deployment of a SLURM (HPC) cluster. |
---|