Skip to content

EC2 Simplified Automatic Recovery conflicts with Karpenter's termination behavior #8821

@ellistarn

Description

@ellistarn

Description

EC2 Simplified Automatic Recovery is enabled by default for many instance types. When triggered, it can hold a node for much longer than it would take for Karpenter to simply replace it.

Karpenter's interruption controller handles spot interruptions and scheduled maintenance via EventBridge, but doesn't watch the system status checks that trigger EC2 Auto Recovery.

As a potential design option, we could:

  1. Always disable auto recovery in maintenanceOptions on Karpenter-managed launch templates
  2. Watch for status check failures via DescribeInstanceStatus and trigger node replacement

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingneeds-triageIssues that need to be triaged

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions