Improve e2e test isolation, error reporting, and extensibility

The e2e tests were recently converted from bash scripts to Go/Ginkgo. They work well but carry over some design choices from the bash era that make them harder to debug, extend, and maintain. This tracks the overall cleanup effort.

Key areas:
- **Error messages**: Several `Eventually` calls swallow errors, making failures hard to debug (e.g. pod-not-found shows up as a generic timeout after 120s)
- **Test isolation**: All 8 test manifests are deployed together in `BeforeSuite` with a global `observedGPUs` map tracking state across tests. Tests can't run independently and order matters (test7 is deliberately deployed early)
- **Boilerplate**: The sharing tests (3, 4, 5, 6) repeat nearly identical verification logic with minor variations
- **Per-test driver config**: The driver is installed once with fixed Helm values. No test can use different values, and this makes future work like upgrade testing harder

Planned Improvements:

- [ ] Make each test self-contained (deploy/cleanup per test context)
- [ ] Reduce boilerplate with table-driven tests and shared helpers
- [ ] Support per-test Helm values (foundation for upgrade tests)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve e2e test isolation, error reporting, and extensibility #171

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Improve e2e test isolation, error reporting, and extensibility #171

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions