Connecting the Dots: Towards a Unified Multi-Cluster AI/ML Experience - Qing Hao, Red Hat & Chen Yu

Published: 04 September 2024
on channel: CNCF [Cloud Native Computing Foundation]

140

Don't miss out! Join us at our next Flagship Conference: KubeCon + CloudNativeCon North America in Salt Lake City from November 12 - 15, 2024. Connect with our current graduated, incubating, and sandbox projects as the community gathers to further the education and advancement of cloud native computing. Learn more at https://kubecon.io

Connecting the Dots: Towards a Unified Multi-Cluster AI/ML Experience | 连接点：走向统一的多集群AI/ML体验 - Qing Hao, Red Hat & Chen Yu, Microsoft

Today cloud-native infra is vital for AI/ML, administrative complexities and the growing demand for compute resources drive devs towards multi-cluster patterns. Batch scheduling projects, like Kueue, are valuable for efficient AI/ML training in a single Kubernetes cluster. Multi-cluster management platforms like OCM and Fleet simplify cluster management and provide advanced scheduling features. We hope to bridge the best of both worlds to simplify user operations and reduce confusion between different systems. In this talk, we will showcase that with the help of Sig Multi-Cluster's newly proposed API - ClusterProfile, combined with OCM, Fleet, and Kueue, to address these challenges. We will demonstrate that MultiKueue setup can be easily automated with the help of the ClusterProfile API; with a few tweaks, users can use OCM and Fleet's advanced scheduling features through MultiKueue to smart place AI/ML jobs across the clusters to maximize resource utilization like GPU to save costs.

今天，云原生基础设施对于人工智能/机器学习、管理复杂性以及对计算资源需求不断增长至关重要，这推动开发人员转向多集群模式。像Kueue这样的批处理调度项目对于在单个Kubernetes集群中高效进行人工智能/机器学习训练非常有价值。OCM和Fleet等多集群管理平台简化了集群管理，并提供了高级调度功能。我们希望将两者的优势结合起来，简化用户操作，减少不同系统之间的混乱。在本次演讲中，我们将展示如何借助Sig Multi-Cluster最新提出的API - ClusterProfile，结合OCM、Fleet和Kueue来解决这些挑战。我们将演示如何通过ClusterProfile API轻松自动化MultiKueue设置；通过一些调整，用户可以利用OCM和Fleet的高级调度功能，通过MultiKueue智能地在集群之间放置人工智能/机器学习作业，以最大化资源利用率，如GPU，以节省成本。