Skip to content

Improve e2e test isolation, error reporting, and extensibility #171

@willie-yao

Description

@willie-yao

The e2e tests were recently converted from bash scripts to Go/Ginkgo. They work well but carry over some design choices from the bash era that make them harder to debug, extend, and maintain. This tracks the overall cleanup effort.

Key areas:

  • Error messages: Several Eventually calls swallow errors, making failures hard to debug (e.g. pod-not-found shows up as a generic timeout after 120s)
  • Test isolation: All 8 test manifests are deployed together in BeforeSuite with a global observedGPUs map tracking state across tests. Tests can't run independently and order matters (test7 is deliberately deployed early)
  • Boilerplate: The sharing tests (3, 4, 5, 6) repeat nearly identical verification logic with minor variations
  • Per-test driver config: The driver is installed once with fixed Helm values. No test can use different values, and this makes future work like upgrade testing harder

Planned Improvements:

  • Make each test self-contained (deploy/cleanup per test context)
  • Reduce boilerplate with table-driven tests and shared helpers
  • Support per-test Helm values (foundation for upgrade tests)

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

🏗 In progress

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions