A Dynamic Resource Management Framework and Reconfiguration Strategies for Cloud-native Bulk Synchronous Parallel Applications
David Johnson
Sumaiya Shomaji
Many High Performance Computing (HPC) applications following the Bulk Synchronous Parallel
(BSP) model are increasingly deployed in cloud-native, multi-tenant container environments such
as Kubernetes. Unlike dedicated HPC clusters, these shared platforms introduce resource virtualization
and variability, making BSP applications more susceptible to performance fluctuations.
Workload imbalance across supersteps can trigger the straggler effect, where faster tasks wait
at synchronization barriers for slower ones, increasing overall execution time. Existing BSP resource
management approaches typically assume static workloads and reuse a single configuration
throughout execution. However, real-world workloads vary due to dynamic data and system conditions,
making static configurations suboptimal. This limitation underscores the need for adaptive
resource management strategies that respond to workload changes while considering reconfiguration
costs.
To address these limitations, we evaluate a dynamic, data-driven resource management framework
tailored for cloud-native BSP applications. This approach integrates workload profiling,
time-series forecasting, and predictive performance modeling to estimate task execution behavior
under varying workload and resource conditions. The framework explicitly models the trade-off
between performance gains achieved through reconfiguration and the associated checkpointing
and migration costs incurred during container reallocation. Multiple reconfiguration strategies
are evaluated, spanning simple window-based heuristics, dynamic programming methods, and
reinforcement learning approaches. Through extensive experimental evaluation, this framework
demonstrates up to 24.5% improvement in total execution time compared to a baseline static configuration.
Furthermore, we systematically analyze the performance of each strategy under varying
workload characteristics, simulation lengths, and checkpoint penalties, and provide guidance on
selecting the most appropriate strategy for a given workload environment.