Skip to content

Install a separate driver instance for each test#201

Open
willie-yao wants to merge 7 commits into
kubernetes-sigs:mainfrom
willie-yao:separate-driver-install
Open

Install a separate driver instance for each test#201
willie-yao wants to merge 7 commits into
kubernetes-sigs:mainfrom
willie-yao:separate-driver-install

Conversation

@willie-yao
Copy link
Copy Markdown
Contributor

Each It block now installs its own DRA driver release with a unique driver name, and uninstalls it on teardown.

Fixes #171

Changes:

  • New test/e2e/driver_test.go with installDriver(ctx, DriverConfig) helper. Shells out to helm (already required by setup-e2e.sh); honors HELM_CHART_PATH env var.
  • deployManifest accepts a DriverConfig and substitutes gpu.example.com -> per-test driver name (and example.com/gpu -> per-test extended resource name) in demo manifests before applying.
  • Beyond helm --wait, polls explicitly for DeviceClass + ResourceSlices + webhook readiness before returning from install.
  • Per-test DeferCleanup LIFO: driver-log diagnostics -> helm uninstall -> namespace delete with foreground propagation + termination wait.
  • setup-e2e.sh no longer installs the driver. Cert-manager + kind cluster + image build stay (cluster-wide infra).
  • Webhook specs share one driver pinned to gpu.example.com and run Ordered, Serial so their static testdata stays valid.

Signed-off-by: William Yao <william2000yao@gmail.com>
@k8s-ci-robot k8s-ci-robot requested review from byako and pohly May 13, 2026 17:42
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 13, 2026
Comment thread test/e2e/driver_test.go

// verifyWebhook waits until the validating webhook is serving for the given
// DeviceClass by creating a dry-run ResourceClaim until it succeeds.
func verifyWebhook(ctx context.Context, deviceClassName string) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is mostly moved from e2e_setup_test.go at line 107: https://github.com/willie-yao/dra-example-driver/blob/main/test/e2e/e2e_setup_test.go#L107

Changes: now takes a deviceClassNAme parameter, the test claim's Name is suffixed with it due to parallel webhook installs and added DeviceClass to error messages

Comment thread test/e2e/driver_test.go
Comment on lines +211 to +237
driverPods, err := clientset.CoreV1().Pods(cfg.Namespace).List(ctx, metav1.ListOptions{
LabelSelector: driverPodSelector,
})
if err != nil {
fmt.Fprintf(GinkgoWriter, "Failed to list driver pods: %v\n", err)
return
}
for _, pod := range driverPods.Items {
for _, c := range pod.Spec.Containers {
stream, err := clientset.CoreV1().Pods(cfg.Namespace).GetLogs(pod.Name, &corev1.PodLogOptions{
Container: c.Name,
TailLines: &tailLines,
}).Stream(ctx)
if err != nil {
fmt.Fprintf(GinkgoWriter,
"Driver pod %s, container %s: failed to get logs: %v\n",
pod.Name, c.Name, err)
continue
}
buf := new(bytes.Buffer)
_, _ = io.Copy(buf, stream)
stream.Close()
fmt.Fprintf(GinkgoWriter,
"Driver pod %s, container %s (last %d lines):\n%s\n",
pod.Name, c.Name, tailLines, buf.String())
}
}
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@willie-yao
Copy link
Copy Markdown
Contributor Author

/assign @nojnhuh

Comment thread test/e2e/driver_test.go Outdated

// runHelm executes a helm subcommand and returns its combined output.
func runHelm(ctx context.Context, args ...string) (string, error) {
cmd := exec.CommandContext(ctx, "helm", args...)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use Helm's Go library instead of invoking the binary? To keep dependencies separate, we should turn the test directory into a separate module like CAPZ.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 784f6e9. CAPZ doesn't actually have an entirely different submodule but uses a .golangci.yml, but I think a sub-module would work better here anyways.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, CAPI does this but I forgot CAPZ doesn't.

Signed-off-by: William Yao <william2000yao@gmail.com>
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: willie-yao
Once this PR has been reviewed and has the lgtm label, please ask for approval from nojnhuh. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels May 13, 2026
Comment thread .github/workflows/lint.yaml Outdated
Comment thread test/go.mod Outdated
@nojnhuh nojnhuh moved this from 🆕 New to 👀 In review in Dynamic Resource Allocation May 13, 2026
Drop the partitionable-devices spec and helmUpgradeDriver helper from main:
they assume a single global driver, which is incompatible with the per-test
driver install model on this branch. The spec is reintroduced adapted to
the new model in a follow-up commit.
The spec from kubernetes-sigs#150 reconfigured a single global driver via helmUpgradeDriver
and ran Serial. With the per-test driver install model on this branch it can
just install its own release with kubeletPlugin.gpuPartitions=4 via
DriverConfig.ExtraValues and run in parallel with everything else.
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 13, 2026
Signed-off-by: William Yao <william2000yao@gmail.com>
Copy link
Copy Markdown
Contributor

@nojnhuh nojnhuh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the annoying comments on the workspace/lint periphery here. I promise I'm starting to look at the main point of this PR.

Comment thread Makefile Outdated
Comment on lines +82 to +83
golangci-lint run ./...
cd test && golangci-lint run --build-tags=e2e ./...
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hides lint failures in the test module if any issues are found in the main module. Can we combine into one call?

Suggested change
golangci-lint run ./...
cd test && golangci-lint run --build-tags=e2e ./...
golangci-lint run ./...
golangci-lint run --build-tags=e2e ./... ./test/...

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tried combining it with golangci-lint run --build-tags=e2e ./... ./test/... but it silently doesn't lint the test module. It logs pattern ./test/...: directory prefix test does not contain main module or its selected dependencies and exits 0 with 0 issues. I'm thinking to do it by just creating another make target for test

.PHONY: lint lint-root lint-test
lint:
	@$(MAKE) -k lint-root lint-test

lint-root:
	golangci-lint run ./...

lint-test:
	cd test && golangci-lint run --build-tags=e2e ./...

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a workspace set up, it seems to work for me: willie-yao/dra-example-driver@separate-driver-install...nojnhuh:dra-example-driver:workspace-lint

% make lint
golangci-lint run --build-tags=e2e ./... ./test/...
cmd/dra-example-webhook/main.go:183:1: Comment should end in a period (godot)
// function
^
test/e2e/e2e_setup_test.go:54:1: Comment should end in a period (godot)
// driverPodSelector finds kubelet plugin Pods within an installed driver's release namespace
^
2 issues:
* godot: 2
make: *** [Makefile:83: lint] Error 1

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah that was completely my bad. I had v0.0.0 set before when testing something. Setting it to v0.2.1 fixed it! Thanks for checking this for me

Comment thread .github/workflows/lint.yaml Outdated
Comment thread test/go.mod Outdated
k8s.io/api v0.36.0
k8s.io/apimachinery v0.36.0
k8s.io/client-go v0.36.0
sigs.k8s.io/dra-example-driver v0.0.0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go mod tidy seems to want to put v0.2.1 here if this doesn't already exist. That should still ultimately resolve to the local version though and not that actual tagged version.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirmed locally that with replace => ../ present, v0.2.1 or any other version resolves to the local checkout. Should I change this to v0.2.1?

Comment thread test/go.mod
Comment thread Makefile Outdated
Signed-off-by: William Yao <william2000yao@gmail.com>
Signed-off-by: William Yao <william2000yao@gmail.com>
@willie-yao willie-yao force-pushed the separate-driver-install branch from e5f08a6 to 27b36c3 Compare May 15, 2026 17:17
Comment thread test/e2e/driver_test.go
Comment on lines +140 to +143
// runUpgradeOrInstall emulates `helm upgrade --install`: if a release with
// the configured name already exists it is upgraded in place, otherwise a
// fresh install is performed.
func runUpgradeOrInstall(ctx context.Context, actionCfg *action.Configuration, install *action.Install, chrt *chart.Chart, values map[string]any) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice, will this ever actually perform an upgrade of an existing Helm release? If we expect each test to have a separate instance of the driver, then I think we'd want to error if that driver already exists instead of upgrade and risk affecting a different test.

The script used helm upgrade --install before mostly to be copy-pasteable, but that was always expected to first be run on a fresh kind cluster without the driver installed where a plain helm install should have worked.

Comment thread test/e2e/driver_test.go
actionCfg := newHelmActionConfig(cfg.Namespace, registryClient)

install := action.NewInstall(actionCfg)
install.ReleaseName = cfg.ReleaseName
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we automatically generate a release name instead? That will make sure there aren't conflicts between tests, tests don't have to decide for themselves what the driver should be called, and we don't have to validate them up front.

Comment thread test/e2e/driver_test.go
Expect(validation.IsDNS1123Label(cfg.ReleaseName)).To(BeEmpty(),
"DriverConfig.ReleaseName %q must be a valid DNS-1123 label", cfg.ReleaseName)
if cfg.Namespace == "" {
cfg.Namespace = cfg.ReleaseName
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like the release name, can we generate a new, separate namespace for each instance of the driver?

Comment thread test/e2e/driver_test.go
Comment on lines +61 to +64
// maxDriverNameLen is the longest driver name that fits within Linux's
// 108-byte UNIX_PATH_MAX after the kubelet appends its registrar socket
// prefix and per-pod UID suffix.
maxDriverNameLen = 28
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you expand on how this value was calculated?

Comment thread test/e2e/e2e_test.go
name string
fileName string
}{
{name: "v1 ResourceClaim", fileName: "invalid_rc_v1.yaml"},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it simplify things if we construct these ResourceClaims inline vs. reading them from a file?

// with the per-test driver name and (when set) extended resource name.
func substituteDriverIdentifiers(raw string, drv DriverConfig) string {
GinkgoHelper()
out := strings.ReplaceAll(raw, defaultDeviceClassName, drv.DriverName)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually, the manifests will be referencing other drivers in addition to gpu.example.com. Some manifests may even rely on multiple separate instances of the driver running different profiles. Not a blocker for this PR, but we should find a better way to handle dynamic driver names.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve e2e test isolation, error reporting, and extensibility

3 participants