A Dynamic Resource Management Framework and Reconfiguration Strategies for Cloud-native Bulk Synchronous Parallel Applications


Student Name: Krishna Chaitanya Reddy Chitta
Defense Date:
Location: Eaton Hall, Room 2001B
Chair: Hongyang Sun

David Johnson

Sumaiya Shomaji

Abstract:

Many High Performance Computing (HPC) applications following the Bulk Synchronous Parallel

(BSP) model are increasingly deployed in cloud-native, multi-tenant container environments such

as Kubernetes. Unlike dedicated HPC clusters, these shared platforms introduce resource virtualization

and variability, making BSP applications more susceptible to performance fluctuations.

Workload imbalance across supersteps can trigger the straggler effect, where faster tasks wait

at synchronization barriers for slower ones, increasing overall execution time. Existing BSP resource

management approaches typically assume static workloads and reuse a single configuration

throughout execution. However, real-world workloads vary due to dynamic data and system conditions,

making static configurations suboptimal. This limitation underscores the need for adaptive

resource management strategies that respond to workload changes while considering reconfiguration

costs.

 

To address these limitations, we evaluate a dynamic, data-driven resource management framework

tailored for cloud-native BSP applications. This approach integrates workload profiling,

time-series forecasting, and predictive performance modeling to estimate task execution behavior

under varying workload and resource conditions. The framework explicitly models the trade-off

between performance gains achieved through reconfiguration and the associated checkpointing

and migration costs incurred during container reallocation. Multiple reconfiguration strategies

are evaluated, spanning simple window-based heuristics, dynamic programming methods, and

reinforcement learning approaches. Through extensive experimental evaluation, this framework

demonstrates up to 24.5% improvement in total execution time compared to a baseline static configuration.

Furthermore, we systematically analyze the performance of each strategy under varying

workload characteristics, simulation lengths, and checkpoint penalties, and provide guidance on

selecting the most appropriate strategy for a given workload environment.

Degree: MS Thesis Defense (CS)
Degree Type: MS Thesis Defense
Degree Field: Computer Science