Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
77 changes: 77 additions & 0 deletions .agents/skills/debug-csi/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
---
name: debug-csi
description: Build, deploy, and debug the alibaba-cloud-csi-driver on a test cluster. Use this skill when the user wants to build images, deploy via helm, check logs, restart components, or troubleshoot the CSI plugin/provisioner. Also trigger for "deploy csi", "check csi logs", "restart csi", "debug csi", "build and push", or any question about how to iterate on this driver in a dev cluster.
---

# Debug CSI Driver

This skill covers the build-deploy-debug loop for the CSI driver on an ACK test cluster.

Prerequisite: `.local/test-values.yaml` must exist. If it doesn't (or the cluster it refers to is unreachable), run `/init-dev-env` first.

## Build and deploy

### Uninstall component-center version (if present)

Before deploying via helm, check whether the CSI driver was installed via the ACK component center:

```bash
helm list -n kube-system
```

If there's a `csi` release, it's already a debug version — skip uninstall. Otherwise run:

```bash
../../init-dev-env/scripts/uninstall-csi-addon.sh
```

### Build and push images

```bash
PLATFORM=linux/amd64 IMAGE_REPO=<registry>/<namespace> IMAGE_TAG=latest ./build/build-all-multi.sh
```
Refer to `.local/test-values.yaml` for the registry and namespace.
`IMAGE_TAG` defaults to `latest`; set it to a custom tag (e.g. `v1.35.3-dev`) if needed.
The script builds and pushes all images: `csi-plugin`, `csi-plugin:init`, `csi-plugin:controller`, `csi-ossfs`, `csi-ossfs2`, `csi-alinas`, and their variants.

### Deploy via helm

```bash
helm upgrade -f .local/test-values.yaml -n kube-system ack-csi-plugin ./deploy/charts/alibaba-cloud-csi-driver
# One-time: set imagePullPolicy to Always
scripts/set-image-pull-policy.sh
```

If only images changed (no chart changes), rolling-restart instead of helm upgrade:
```bash
kubectl -n kube-system rollout restart deploy/csi-provisioner
kubectl -n kube-system rollout restart daemonset/csi-plugin
```

## Debugging and logs

**csi-plugin** (daemonset, runs on each node, handles NodePublish/NodeUnpublish):
```bash
kubectl -n kube-system logs -l app=csi-plugin -c csi-plugin --tail=100
```

**csi-provisioner** (deployment, handles ControllerPublish/ControllerUnpublish):
```bash
kubectl -n kube-system logs -l app=csi-provisioner -c csi-provisioner --tail=100
```

## Restarting components

Patching the pod template (env, imagePullPolicy, etc.) automatically triggers a rolling update.
`rollout restart` is only needed when the patch reports "no change" (e.g. image tag is the same).

## Test pod images

Use `mirrors-ssl.aliyuncs.com/busybox` or `mirrors-ssl.aliyuncs.com/debian` as test pod images.
Also use these for ephemeral containers and `kubectl debug` commands.
Docker Hub and other registries outside China are unreliable due to network issues.
`mirrors-ssl.aliyuncs.com` mirrors all popular official Docker Hub images.

## OSS / OSSFS

For OSSFS-related debugging (fuse pod image config, fuse pod logs, fuse pod lifecycle), read `references/oss.md`.
37 changes: 37 additions & 0 deletions .agents/skills/debug-csi/references/oss.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# OSS / OSSFS Debugging

OSSFS mounts are served by fuse pods managed by the csi-provisioner.

## Set up OSSFS debug environment

Run the setup script to configure everything in one shot (reads registry, namespace, and secret name from `.local/test-values.yaml`):

```bash
scripts/setup-oss-debug.sh [image-tag]
```

This does three things:
1. Sets the OSSFS and OSSFS2 image tags in the `csi-plugin` configmap (defaults to `latest`)
2. Patches `IMAGE_REPOSITORY_PREFIX` on csi-provisioner
3. Sets image pull secrets on both service accounts (`default` and `csi-fuse-ossfs`) in `ack-csi-fuse`

Rollout restart csi-provisioner if the patches report no change.

## Debugging fuse pods

List fuse pods (one per volume):

```bash
kubectl -n ack-csi-fuse get pods
kubectl -n ack-csi-fuse logs <pod-name>
```

## Restarting fuse pods

The fuse pod lifecycle is tied to ControllerPublish/ControllerUnpublish. To recreate a fuse pod with a new image:

1. Delete the consumer pod
2. Wait for ControllerUnpublish to clean up the fuse pod
3. Recreate the consumer pod

Simply deleting the fuse pod is not enough — it will not be recreated automatically.
23 changes: 23 additions & 0 deletions .agents/skills/debug-csi/scripts/set-image-pull-policy.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
#!/bin/bash
set -e

kubectl -n kube-system patch deploy/csi-provisioner -p '
spec:
template:
spec:
containers:
- name: csi-provisioner
imagePullPolicy: Always
'

kubectl -n kube-system patch daemonset/csi-plugin -p '
spec:
template:
spec:
containers:
- name: csi-plugin
imagePullPolicy: Always
initContainers:
- name: init
imagePullPolicy: Always
'
53 changes: 53 additions & 0 deletions .agents/skills/debug-csi/scripts/setup-oss-debug.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
#!/bin/bash
set -e

REPO_ROOT="$(git rev-parse --show-toplevel)"
VALUES_FILE="$REPO_ROOT/.local/test-values.yaml"

if [ ! -f "$VALUES_FILE" ]; then
echo "Error: $VALUES_FILE not found. Run /init-dev-env first." >&2
exit 1
fi

REGISTRY=$(yq '.images.registry' "$VALUES_FILE")
NAMESPACE=$(yq '.images.controller.repo | split("/") | .[0]' "$VALUES_FILE")
SECRET_NAME=$(yq '.imagePullSecrets[0]' "$VALUES_FILE")

IMAGE_REPOSITORY_PREFIX="$REGISTRY/$NAMESPACE"
IMAGE_TAG="${1:-latest}"

echo "Registry prefix: $IMAGE_REPOSITORY_PREFIX"
echo "Image tag: $IMAGE_TAG"
echo "Pull secret: $SECRET_NAME"

echo "--- Setting OSSFS image tags via configmap ---"
kubectl apply --server-side --field-manager=debug-ossfs -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
name: csi-plugin
namespace: kube-system
data:
fuse-ossfs: |
image-tag=$IMAGE_TAG
fuse-ossfs2: |
image-tag=$IMAGE_TAG
EOF

echo "--- Patching csi-provisioner IMAGE_REPOSITORY_PREFIX ---"
kubectl -n kube-system patch deploy/csi-provisioner -p "
spec:
template:
spec:
containers:
- name: csi-provisioner
env:
- name: IMAGE_REPOSITORY_PREFIX
value: $IMAGE_REPOSITORY_PREFIX
"

echo "--- Patching ack-csi-fuse service accounts ---"
kubectl -n ack-csi-fuse patch sa default -p "{\"imagePullSecrets\": [{\"name\": \"$SECRET_NAME\"}]}"
kubectl -n ack-csi-fuse patch sa csi-fuse-ossfs -p "{\"imagePullSecrets\": [{\"name\": \"$SECRET_NAME\"}]}"

echo "Done."
176 changes: 176 additions & 0 deletions .agents/skills/init-dev-env/SKILL.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,176 @@
---
name: init-dev-env
description: Initialize the local development environment for the alibaba-cloud-csi-driver project. Use this skill when the user wants to set up their dev environment, configure test cluster access, create image pull secrets, or write test-values.yaml. Also trigger when the user mentions "init dev", "setup dev env", "configure test cluster", or asks how to push/pull test images to ACR.
---

# Initialize Development Environment

This skill walks through setting up the local development environment for building, pushing, and deploying the CSI driver to a test ACK cluster.

## Overview of steps

1. Probe the environment (kubectl, aliyun CLI)
2. Gather ACR registry info
3. Confirm all details with the user
4. Uninstall existing CSI addon (if installed via ACK component center)
5. Write `.local/test-values.yaml`
6. Create image pull secrets

## Step 1: Probe environment

Run these two commands to gather cluster and identity info:

```bash
kubectl get cm -n kube-system ack-cluster-profile -ojsonpath='{.data}'
```

This returns JSON with `clusterid`, `uid` (Alibaba Cloud account ID), and `vpcid` among other fields.

```bash
aliyun sts GetCallerIdentity
```

This returns the current RAM identity including `AccountId`. If the `AccountId` matches the `uid` from the configmap, the aliyun CLI is properly configured for this cluster's account.

### If kubectl fails (no cluster accessible)

List available clusters from aliyun CLI for the user to select:

```bash
aliyun cs GET "/clusters" | jq '[.[] | {cluster_id, name, region_id, state}]'
```

If a suitable cluster is found, after user approval, write its kubeconfig to `~/.kube/config`:

```bash
aliyun cs GET "/k8s/<cluster-id>/user_config" | jq -r .config > ~/.kube/config
chmod 600 ~/.kube/config
```

If no cluster exists, ask the user if they want to create one. If `.local/create-cluster.sh` exists, offer to reuse it. Otherwise, gather parameters and write the script first:

List existing VPCs for the user to select:
```bash
aliyun vpc DescribeVpcs --RegionId <region> | jq '[.Vpcs.Vpc[] | {VpcId, VpcName, CidrBlock}]'
```

Parameters to gather:
- `N_NODES` (default 1)
- `ACK_REGION` (default cn-beijing)
- `ACK_ZONE` (default "cn-beijing-h cn-beijing-i cn-beijing-j", use a single zone for N_NODES=1)
- `VPC_ID` (optional, set to reuse an existing VPC; leave unset to create a new one)
- Cluster name (default `ack-csi-driver-test`)

Write `.local/create-cluster.sh` with the chosen parameters:
```bash
#!/bin/bash
set -e
source "$(git rev-parse --show-toplevel)/test/cluster/cluster_setup.sh"
VPC_ID=<vpc-id> N_NODES=<n> ACK_ZONE="<zone>" create-ack-cluster <cluster-name>
```

Then run it to create the cluster.

### If aliyun CLI fails

Check memory or other loaded context for account info. If nothing found, tell the user to install and configure aliyun CLI:
- Install: https://help.aliyun.com/cli/
- Configure: `aliyun configure` with their AccessKey ID/Secret and region

## Step 2: Gather ACR info

List registries already configured in `~/.docker/config.json`:

```bash
jq -r '.auths | keys[]' ~/.docker/config.json
```

Present these for the user to select from.

Get the cluster region:

```bash
aliyun cs GET "/clusters/<cluster-id>" | jq -r .region_id
```

If the selected registry is an `aliyuncs.com` endpoint whose region matches the cluster region, derive the VPC endpoint automatically:
- Personal edition: `registry.<region>.aliyuncs.com` → VPC: `registry-vpc.<region>.aliyuncs.com`
- Enterprise edition: `<name>-registry.<region>.cr.aliyuncs.com` → VPC: `<name>-registry-vpc.<region>.cr.aliyuncs.com`

If the registry region does NOT match the cluster region (or it's a custom domain where VPC endpoint cannot be derived), set both `registry` and `registryVPC` to the same value and warn the user that in-cluster pulls will go over the public network.

Then ask the user for their **ACR namespace** (the first path segment in image URLs, e.g., `huweiwen-test` in `registry.cn-beijing.aliyuncs.com/huweiwen-test/csi-plugin`).

## Step 3: Confirm with user

Present a summary for confirmation:
- Cluster ID
- Account UID
- Region
- ACR registry endpoint (public)
- ACR registry endpoint (VPC, for in-cluster pulls)
- ACR namespace
- Image pull secret name (convention: `<namespace>-registry`)

## Step 4: Uninstall existing CSI addon

Before deploying via helm, any existing CSI installation from the ACK component center must be removed. The bundled script `scripts/uninstall-csi-addon.sh` handles this automatically:

- Skips if CSI is already managed by Helm (debug version)
- Detects installed CSI addons (`csi-plugin`, `csi-provisioner`, `managed-csiprovisioner`) via the ACK API
- Uninstalls in dependency order: `csi-plugin` first, then `csi-provisioner` or `managed-csiprovisioner` (they are mutually exclusive)
- Waits for each removal to complete before proceeding
- Cleans up leftover resources (ack-csi-fuse namespace, stale ServiceAccounts)

```bash
scripts/uninstall-csi-addon.sh
```

Note: `managed-csiprovisioner` and `csi-provisioner` cannot coexist — only one will be present.

## Step 5: Write `.local/test-values.yaml`

Template:

```yaml
images:
registry: <registry>
registryVPC: <registryVPC>
controller:
repo: <namespace>/csi-plugin
tag: "latest-controller"
plugin:
repo: <namespace>/csi-plugin
tag: "latest"
pluginInit:
repo: <namespace>/csi-plugin
tag: "latest-init"

imagePullSecrets:
- <namespace>-registry

deploy:
clusterID: "<cluster-id>"
ramToken: v1 # Use v1 because it exists even when ACK csi addon is not installed
```

Replace placeholders with actual values gathered in previous steps.

## Step 6: Create image pull secrets

The bundled script `scripts/create-image-pull-secret.sh` handles creating the Kubernetes secret. It extracts the docker auth entry from local `~/.docker/config.json` and passes it directly as a `dockerconfigjson` secret (same format, no parsing needed).

The script reads registry and secret name from `.local/test-values.yaml`, so it requires no configuration. It passes through all arguments to `kubectl create secret`, enabling namespace selection.

Prerequisite: user must have run `docker login <registry>` locally first. If credentials are missing, the script tells them to do so.

Run the script to create secrets in both required namespaces:

```bash
scripts/create-image-pull-secret.sh -n kube-system
scripts/create-image-pull-secret.sh -n ack-csi-fuse
```

## Post-setup notes

After running the skill, remind the user of the build & deploy workflow with `/debug-csi`.
22 changes: 22 additions & 0 deletions .agents/skills/init-dev-env/scripts/create-image-pull-secret.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
#!/bin/bash
set -e

ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")" && git rev-parse --show-toplevel)"
VALUES_FILE="$ROOT/.local/test-values.yaml"

REGISTRY=$(yq '.images.registry' "$VALUES_FILE")
SECRET_NAME=$(yq '.imagePullSecrets[0]' "$VALUES_FILE")

AUTH=$(jq -r --arg server "$REGISTRY" '.auths[$server].auth // empty' ~/.docker/config.json)
if [ -z "$AUTH" ]; then
echo "No docker credentials found for $REGISTRY"
echo "Please run: docker login $REGISTRY"
exit 1
fi

jq -n --arg server "$REGISTRY" --arg auth "$AUTH" \
'{"auths": {($server): {"auth": $auth}}}' | \
kubectl create secret generic "$SECRET_NAME" \
--from-file=.dockerconfigjson=/dev/stdin \
--type=kubernetes.io/dockerconfigjson \
"$@"
Loading