Posted in

Kubernetes 1.31:任务的 Pod 失败策略正式发布_AI阅读总结 — 包阅AI

包阅导读总结

1.

关键词:Kubernetes 1.31、Pod 失败政策、Jobs、控制成本、规则

2.

总结:本文介绍了 Kubernetes 1.31 中 Jobs 的 Pod 失败政策,包括其用途、工作机制、如何指定规则和动作,还提到了 Kubernetes 发起的 Pod 中断情况及示例,此政策有助于控制成本,相关工作仍在进展。

3.

主要内容:

– Kubernetes 1.31 中 Jobs 的 Pod 失败政策稳定可用

– 关于 Pod 失败政策

– 运行时 Pod 可能因多种原因失败,Jobs 应能应对

– 虽有 backoffLimit 字段,但设置大值会增加成本

– Pod 失败政策的作用

– 控制非可重试失败时立即失败 Job

– 区分可重试和非可重试错误

– 工作机制

– 在 Job 规范中指定,以规则列表形式

– 基于容器退出代码或 Pod 条件定义规则和动作

– Kubernetes 发起的 Pod 中断

– 引入 DisruptionTarget Pod 条件匹配失败规则

– 示例

– 展示了不同条件和动作的配置

– 更多信息

– 基于此概念的更多工作在进展

– 感谢相关人员的参与

思维导图:

文章地址:https://kubernetes.io/blog/2024/08/19/kubernetes-1-31-pod-failure-policy-for-jobs-goes-ga/

文章来源:kubernetes.io

作者:Kubernetes Blog

发布时间:2024/8/19 0:00

语言:英文

总字数:1092字

预计阅读时间:5分钟

评分:87分

标签:Kubernetes,Pod 失败策略,任务管理,资源优化,容器编排


以下为原文内容

本内容来源于用户推荐转载,旨在分享知识与观点,如有侵权请联系删除 联系邮箱 media@ilingban.com

Kubernetes 1.31: Pod Failure Policy for Jobs Goes GA

This post describes Pod failure policy, which graduates to stable in Kubernetes1.31, and how to use it in your Jobs.

About Pod failure policy

When you run workloads on Kubernetes, Pods might fail for a variety of reasons.Ideally, workloads like Jobs should be able to ignore transient, retriablefailures and continue running to completion.

To allow for these transient failures, Kubernetes Jobs include the backoffLimitfield, which lets you specify a number of Pod failures that you’re willing to tolerateduring Job execution. However, if you set a large value for the backoffLimit fieldand rely solely on this field, you might notice unnecessary increases in operatingcosts as Pods restart excessively until the backoffLimit is met.

This becomes particularly problematic when running large-scale Jobs withthousands of long-running Pods across thousands of nodes.

The Pod failure policy extends the backoff limit mechanism to help you reducecosts in the following ways:

  • Gives you control to fail the Job as soon as a non-retriable Pod failure occurs.
  • Allows you to ignore retriable errors without increasing the backoffLimit field.

For example, you can use a Pod failure policy to run your workload on more affordable spot machinesby ignoring Pod failures caused bygraceful node shutdown.

The policy allows you to distinguish between retriable and non-retriable Podfailures based on container exit codes or Pod conditions in a failed Pod.

How it works

You specify a Pod failure policy in the Job specification, represented as a listof rules.

For each rule you define match requirements based on one of the following properties:

  • Container exit codes: the onExitCodes property.
  • Pod conditions: the onPodConditions property.

Additionally, for each rule, you specify one of the following actions to takewhen a Pod matches the rule:

  • Ignore: Do not count the failure towards the backoffLimit or backoffLimitPerIndex.
  • FailJob: Fail the entire Job and terminate all running Pods.
  • FailIndex: Fail the index corresponding to the failed Pod.This action works with the Backoff limit per index feature.
  • Count: Count the failure towards the backoffLimit or backoffLimitPerIndex.This is the default behavior.

When Pod failures occur in a running Job, Kubernetes matches thefailed Pod status against the list of Pod failure policy rules, in the specifiedorder, and takes the corresponding actions for the first matched rule.

Note that when specifying the Pod failure policy, you must also set the Job’sPod template with restartPolicy: Never. This prevents race conditions betweenthe kubelet and the Job controller when counting Pod failures.

Kubernetes-initiated Pod disruptions

To allow matching Pod failure policy rules against failures caused bydisruptions initiated by Kubernetes, this feature introduces the DisruptionTargetPod condition.

Kubernetes adds this condition to any Pod, regardless of whether it’s managed bya Job controller, that fails because of a retriabledisruption scenario.The DisruptionTarget condition contains one of the following reasons thatcorresponds to these disruption scenarios:

In all other disruption scenarios, like eviction due to exceedingPod container limits,Pods don’t receive the DisruptionTarget condition because the disruptions werelikely caused by the Pod and would reoccur on retry.

Example

The Pod failure policy snippet below demonstrates an example use:

podFailurePolicy:  rules:  - action: Ignore    onPodConditions:    - type: DisruptionTarget  - action: FailJob    onPodConditions:    - type: ConfigIssue  - action: FailJob    onExitCodes:      operator: In      values: [ 42 ]

In this example, the Pod failure policy does the following:

  • Ignores any failed Pods that have the built-in DisruptionTargetcondition. These Pods don’t count towards Job backoff limits.
  • Fails the Job if any failed Pods have the custom user-suppliedConfigIssue condition, which was added either by a custom controller or webhook.
  • Fails the Job if any containers exited with the exit code 42.
  • Counts all other Pod failures towards the default backoffLimit (orbackoffLimitPerIndex if used).

Learn more

Based on the concepts introduced by Pod failure policy, the following additional work is in progress:

Get involved

This work was sponsored bybatch working groupin close collaboration with theSIG Apps,and SIG Node,and SIG Schedulingcommunities.

If you are interested in working on new features in the space we recommendsubscribing to our Slackchannel and attending the regular community meetings.

Acknowledgments

I would love to thank everyone who was involved in this project over the years -it’s been a journey and a joint community effort! The list below ismy best-effort attempt to remember and recognize people who made an impact.Thank you!