Skip to content

disk: get disk quantity and identify LingJun node from metadata#1605

Open
huww98 wants to merge 10 commits into
kubernetes-sigs:masterfrom
huww98:metadata-eflo
Open

disk: get disk quantity and identify LingJun node from metadata#1605
huww98 wants to merge 10 commits into
kubernetes-sigs:masterfrom
huww98:metadata-eflo

Conversation

@huww98
Copy link
Copy Markdown
Contributor

@huww98 huww98 commented Jan 2, 2026

What type of PR is this?

/kind cleanup

What this PR does / why we need it:

Currently, the InstanceType query logic is bound to the volume count logic, primarily because we need both Node.status.volumesAttached (for currently managed disks) and node.annotation (for DiskQuantity) from the same get node response, and the status must be fetched in the middle of the volume count logic. This makes it hard to extract more info (e.g. nvmeSupport) from the fetched metadata apart from DiskQuantity.

So I moved DiskQuantity logic to cloud/metadata, it fits well because we have 3 different places to get this info.
To implement this, two major refactor is done in cloud/metadata:

  • Support non-string value. In-package component now all returns any for the value, but we still keep it type-safe for all public API. This allows us to add 2 new non-string metadata: DiskQuantity (int32) and MachineKind (enum: ECS/LingJun).
  • Session support. Introduce a new m.WithSession(ctx) API to inject a context and allow retry of previously failed fetchers. So that we can keep retry as the NodeGetInfo CSI GRPC call retries.

Then, two new fetchers for ECS DescribeInstanceTypes and EFLO DescribeNodeType is added. K8s fetcher is extended to parse the annotation.

The disk driver is changed to use the added metadata fields.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

/hold
based on #1599, merge it first

Does this PR introduce a user-facing change?

NONE

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. labels Jan 2, 2026
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jan 2, 2026
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 7, 2026
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 8, 2026
@huww98
Copy link
Copy Markdown
Contributor Author

huww98 commented Jan 8, 2026

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 8, 2026
@huww98
Copy link
Copy Markdown
Contributor Author

huww98 commented Jan 9, 2026

Log on ECS

I0108 18:29:33.229251    9131 metadata.go:248] "retrieved metadata" provider="IMDS" key="RegionID" value="cn-beijing"
I0108 18:29:33.242843    9131 metadata.go:248] "retrieved metadata" provider="IMDS" key="InstanceID" value="i-2zedlg2qt0av5phg18uq"
E0108 18:29:33.265988    9131 nodeserver.go:206] get vmoc failed: unknown metadata key
I0108 18:29:33.834886    9131 metadata.go:248] "retrieved metadata" method="/csi.v1.Node/NodeGetInfo" provider="IMDS" key="MachineKind" value=1
I0108 18:29:33.834908    9131 eflo.go:89] "skip EFLO metadata fetcher" method="/csi.v1.Node/NodeGetInfo" machineKind=1
I0108 18:29:33.834913    9131 metadata.go:248] "retrieved metadata" method="/csi.v1.Node/NodeGetInfo" provider="IMDS" key="InstanceType" value="ecs.g8y.xlarge"
I0108 18:29:33.872589    9131 metadata.go:248] "retrieved metadata" method="/csi.v1.Node/NodeGetInfo" provider="ECS_Instance_Type" key="DiskQuantity" value=16
I0108 18:29:33.973957    9131 metadata.go:248] "retrieved metadata" method="/csi.v1.Node/NodeGetInfo" provider="IMDS" key="ZoneID" value="cn-beijing-i"

Log on LingJun

I0109 08:55:19.682164 3555667 metadata.go:248] "retrieved metadata" provider="env" key="RegionID" value="cn-wulanchabu"
I0109 08:55:19.759666 3555667 metadata.go:248] "retrieved metadata" provider="lingjun" key="InstanceID" value="e01-cn-zqb46i0iv7y"
E0109 08:55:19.872533 3555667 nodeserver.go:206] get vmoc failed: unknown metadata key
I0109 08:55:21.202312 3555667 metadata.go:248] "retrieved metadata" method="/csi.v1.Node/NodeGetInfo" provider="lingjun" key="MachineKind" value=2
I0109 08:55:21.383911 3555667 metadata.go:248] "retrieved metadata" method="/csi.v1.Node/NodeGetInfo" provider="EFLO" key="DiskQuantity" value=0
I0109 08:55:21.424987 3555667 metadata.go:248] "retrieved metadata" method="/csi.v1.Node/NodeGetInfo" provider="lingjun" key="ZoneID" value="cn-wulanchabu-c"

When /etc/eflo_config/lingjun_config is not found (works now!)

I0109 08:58:47.358973 3609408 metadata.go:248] "retrieved metadata" provider="env" key="RegionID" value="cn-wulanchabu"
I0109 08:58:47.416114 3609408 metadata.go:248] "retrieved metadata" provider="IMDS" key="InstanceID" value="e01-cn-zqb46i0iv7y"
E0109 08:58:47.480204 3609408 nodeserver.go:206] get vmoc failed: unknown metadata key
I0109 08:59:29.345592 3609408 metadata.go:248] "retrieved metadata" method="/csi.v1.Node/NodeGetInfo" provider="Kubernetes" key="MachineKind" value=2
I0109 08:59:29.564357 3609408 metadata.go:248] "retrieved metadata" method="/csi.v1.Node/NodeGetInfo" provider="EFLO" key="DiskQuantity" value=0
I0109 08:59:29.596003 3609408 metadata.go:248] "retrieved metadata" method="/csi.v1.Node/NodeGetInfo" provider="IMDS" key="ZoneID" value="cn-wulanchabu-c"

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 12, 2026
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jan 29, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: huww98
Once this PR has been reviewed and has the lgtm label, please assign huww98, mowangdk for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:
  • OWNERS [huww98]

    Need more approvers for rest parts.

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@huww98
Copy link
Copy Markdown
Contributor Author

huww98 commented Feb 5, 2026

/hold
merge #1614 first

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 5, 2026
@huww98 huww98 force-pushed the metadata-eflo branch 2 times, most recently from 5d75c08 to ba18e54 Compare February 6, 2026 09:36
@huww98
Copy link
Copy Markdown
Contributor Author

huww98 commented Feb 6, 2026

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 6, 2026
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 27, 2026
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 27, 2026
@mowangdk
Copy link
Copy Markdown
Contributor

mowangdk commented May 3, 2026

Please resolve the conflict

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 3, 2026
Copy link
Copy Markdown
Contributor

@mowangdk mowangdk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A Lingjun node type e2e test is still waiting to be merged. Need to merge this after that one.

func (f *OpenAPIFetcher) FetchFor(ctx *mcontext, key MetadataKey) (middleware, error) {
switch key {
case InstanceID, ZoneID, InstanceType, AccountID:
case InstanceID, ZoneID, InstanceType:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is AccountID?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to sts.go


instanceId, err := f.mPre.Get(InstanceID)
kind, err := f.mPre.GetAny(ctx, machineKind)
if err == nil && kind != MachineKindECS { // skip for non-ECS instances
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we support metadata for all Lingjun instance in future, we’ll need a more specific type.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this. But LingJun OpenAPI is moved to eflo.go, IMDS support is still in imds.go which works as you can see in #1605 (comment)

return v, nil
}

func newImmutableProvider(provider MetadataProvider, name string) *immutableProvider {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comments to these providers. I forgot why this one is needed, and I’m not sure what ‘immutable’ means here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// immutable fetches metadata from next only once and caches the result.
// Print a log with name, key, and value when metadata is retrieved

Comment added

Comment thread pkg/disk/utils.go
}

unmanaged := 0
for _, disk := range attachedDisks {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have any e2e tests for disk availability? Please add more tests here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have unit test for this function. And we have external-storage e2e that assert the disk limit isn't too high, it will try to attach as many disks as reported by plugin and ensure all disks can be attached and used by pods.

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label May 5, 2026
@huww98 huww98 force-pushed the metadata-eflo branch 2 times, most recently from 8aa08f9 to 39d9fd0 Compare May 5, 2026 09:00
huww98 added 10 commits May 15, 2026 17:05
Allow us to integrate non-string metadata.
Returns LingJun if:
- /etc/eflo_config/lingjun_config exists
- Node has label alibabacloud.com/lingjun-worker

Returns ECS if InstanceType has "ecs." prefix.
IMDS stands for Instance Metadata Service. Now it can also be accessed from LingJun instances
Introduce a new `m.WithSession(ctx)` API to inject a context and allow retry of
previously failed fetchers.

The errors are moved to the Metadata type from lazyInit, so all the errors can
be replaced at once. A slot is reserved for each type of fetcher, assuming each
type is used only once in the hierarchy.

A *mcontext argument is passed along to every fetcher and middleware, with ctx
from session and logger extracted from context. New inMemory mode is introduced
to minimize network requests. For example, if we have fetcher A failed but B
succeeded, then in the new session, the error from A is cleared, but we should
still use data from B because it is already present in memory.
Use a real json copied from LingJun instance.
We should handle the case when server returned incorrect or multiple items.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants