extensions/nvidia: per-distro version detection + runtime auto-disable on no-GPU hosts#9845
extensions/nvidia: per-distro version detection + runtime auto-disable on no-GPU hosts#9845igorpecovnik wants to merge 12 commits into
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughRemoves a hardcoded NVIDIA driver default; selects/install driver packages during post-install by querying chroot apt (highest numeric or fallbacks); deploys a systemd oneshot and script that blacklists/purges NVIDIA packages when no NVIDIA PCI device is present; removes firstrun's dmesg-based purge. ChangesNVIDIA Driver Version Selection & Autodetect
Sequence DiagramsequenceDiagram
participant BuildScript
participant AptIndex
participant Chroot
participant DKMS
participant HostSystemd
BuildScript->>AptIndex: Check for nvidia-dkms-<N> and unversioned packages
AptIndex-->>BuildScript: Return highest numeric version or unversioned candidates
BuildScript->>Chroot: Install resolved nvidia-dkms / nvidia-driver packages
Chroot->>DKMS: DKMS builds (monitor /var/lib/dkms/*/build/make.log)
BuildScript->>Chroot: Deploy armbian-nvidia-autodetect script + systemd oneshot
HostSystemd->>Chroot: Run oneshot at boot (lspci + dpkg-query)
HostSystemd-->>Chroot: If no NVIDIA -> blacklist modules & purge NVIDIA packages
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@extensions/nvidia.sh`:
- Line 46: The grep regex is using a double-quoted pattern
'^nvidia-dkms-[0-9]+\$' so the backslash before $ is treated as a literal and
prevents anchoring; update the grep invocation that contains the string
'^nvidia-dkms-[0-9]+\$' to remove the backslash before $ (i.e., use
'^nvidia-dkms-[0-9]+$') so the end-of-line anchor is passed to grep correctly
and only exact package names match.
- Around line 45-48: The pipeline that sets latest using chroot_sdcard ... |
grep -E '^nvidia-dkms-[0-9]+\$' ... may return non-zero when there are no
matches and cause a set -e abort; modify the pipeline inside the command
substitution that assigns latest (the chroot_sdcard ... | grep ... | sed ... |
sort -nr | head -1 sequence) to append || true so the subshell always exits zero
and the subsequent [[ -n "$latest" ]] fallback check can run; update the
invocation around the latest variable assignment in extensions/nvidia.sh (look
for the latest=... chroot_sdcard pipeline) to include || true at the end of the
pipeline.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 20e2b57b-3257-40d7-b3e5-ed306a176cc4
📒 Files selected for processing (1)
extensions/nvidia.sh
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@extensions/nvidia.sh`:
- Around line 130-156: The script currently exits early when lspci finds NVIDIA
hardware, but never removes the blacklist file
(/etc/modprobe.d/armbian-nvidia-disabled.conf) or restores removed packages, so
once a host boots without NVIDIA it can't recover; update the control flow so
when lspci -nn | grep -qiE '\[10de:' succeeds you remove that blacklist file (if
present) and, if NVIDIA_PKGS is empty because packages were purged earlier,
re-install the driver package set (use the same dpkg-query pattern used earlier
to determine package names, i.e. NVIDIA_PKGS=$(dpkg-query -W
-f='${binary:Package}\n' 'nvidia-dkms-*' 'nvidia-driver-*' 'nvidia-settings'
'nvidia-common' 2>/dev/null | tr '\n' ' ')) and run apt-get install for those
packages; ensure the existing purge/autoremove code (NVIDIA_PKGS / apt-get -y
-qq purge ... and autoremove) is only run when no NVIDIA hardware is detected
and that the recovery branch runs when lspci finds '[10de:' (affecting the
current lspci early-exit branch and the later re-evaluation logic around lines
183-187).
- Around line 151-153: The dpkg-query that builds NVIDIA_PKGS only matches
patterns 'nvidia-dkms-*' and 'nvidia-driver-*' so plain metapackages like
'nvidia-dkms' or 'nvidia-driver' can be missed; update the query that sets
NVIDIA_PKGS to also include the exact package names 'nvidia-dkms' and
'nvidia-driver' (and any other known metapackage names you expect) so the purge
set removes both hyphenated variants and the plain metapackages; adjust the
dpkg-query argument list where NVIDIA_PKGS is defined to add those exact names.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: bf5e13d6-4d95-4671-b3cc-4d89f98dbfb2
📒 Files selected for processing (2)
extensions/nvidia.shpackages/bsp/common/usr/lib/armbian/armbian-firstrun
There was a problem hiding this comment.
🧹 Nitpick comments (1)
extensions/nvidia.sh (1)
202-206: 💤 Low valueComment overstates "re-evaluation" behavior.
The comment claims hot-pluggable scenarios (eGPU, Thunderbolt) "get re-evaluated," but the script only disables—it never restores. On subsequent boots with a newly attached GPU, the script exits early at line 144 without removing the blacklist or reinstalling packages.
Consider rewording to reflect the actual behavior: auto-disable on first no-GPU boot, manual re-enable required thereafter.
📝 Suggested comment update
# Enable the unit so it fires at every boot. Cheap when NVIDIA is # present (early exit on the lspci check) and idempotent when not # (apt-purge is a no-op on a system where the packages are already - # gone). Running every boot means hot-pluggable scenarios (eGPU, - # Thunderbolt) get re-evaluated. + # gone). Once packages are purged and the blacklist is written, + # manually delete /etc/modprobe.d/armbian-nvidia-disabled.conf and + # reinstall the driver packages to re-enable NVIDIA on a new host.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@extensions/nvidia.sh` around lines 202 - 206, Update the explanatory comment that starts with "Enable the unit so it fires at every boot..." to remove the claim that hot-pluggable scenarios "get re-evaluated" and instead state the actual behavior: the script performs an early exit on the lspci check (the lspci check branch) and only disables components (apt-purge/blacklist) on a no-GPU boot, so it will not automatically restore or reinstall when a GPU is later hot-plugged; note that manual re-enabling is required. Refer to the lspci check and the apt-purge/blacklist behavior in the comment so readers understand the one-way disable behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Nitpick comments:
In `@extensions/nvidia.sh`:
- Around line 202-206: Update the explanatory comment that starts with "Enable
the unit so it fires at every boot..." to remove the claim that hot-pluggable
scenarios "get re-evaluated" and instead state the actual behavior: the script
performs an early exit on the lspci check (the lspci check branch) and only
disables components (apt-purge/blacklist) on a no-GPU boot, so it will not
automatically restore or reinstall when a GPU is later hot-plugged; note that
manual re-enabling is required. Refer to the lspci check and the
apt-purge/blacklist behavior in the comment so readers understand the one-way
disable behavior.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 0bb8a7d6-bac2-47e9-9373-8099e153f454
📒 Files selected for processing (1)
extensions/nvidia.sh
5bd211f to
a6d6ccd
Compare
Drops the hardcoded NVIDIA_DRIVER_VERSION=580 default (with its honest @todo comment about per-release / Debian-vs-Ubuntu drift). The post_install hook now resolves the package set against the chroot's actual apt index at install time: 1. If NVIDIA_DRIVER_VERSION is set explicitly (env or config), pin to it — operator override always wins. 2. Otherwise, ask apt for the highest `nvidia-dkms-<N>` available in the target distribution/release. Common Ubuntu shape across noble / resolute / etc. — versions vary (535, 550, 560, 580, ...). 3. Fall through to the unversioned Debian metapackage `nvidia-dkms` if no numeric variants exist (bookworm, trixie). 4. None of the above — skip with a warning instead of crashing the build on an opaque 'package not found'. Closes the long-standing @todo and removes the silent build failures on releases that don't ship nvidia-dkms-580 specifically.
The previous solution lived in
packages/bsp/common/usr/lib/armbian/armbian-firstrun as a single
line:
[[ -n "$(dmesg | grep "No NVIDIA GPU found")" ]] && \
sudo apt-get -y -qq purge nvidia-dkms-510 nvidia-driver-510 \
nvidia-settings nvidia-common \
>> /dev/null
Two reasons it was unreliable:
1. dmesg-grep for "No NVIDIA GPU found" only sees the line if the
driver actually bound far enough to print it. On many boots the
line never appears (driver couldn't load at all) or has already
rotated out of the kernel ring buffer by the time firstrun runs.
2. Hardcoded nvidia-dkms-510 / nvidia-driver-510 — wrong on every
distro/release that ships a different driver branch, and
especially wrong now that the install path auto-picks the
highest available version.
Replace it with a build-time-installed detector + systemd one-shot
under extensions/nvidia.sh:
- /usr/lib/armbian/armbian-nvidia-autodetect
probes the PCI bus directly (lspci -nn, vendor 10de). Works
regardless of whether any driver module loaded.
If no NVIDIA hardware found:
a. drops /etc/modprobe.d/armbian-nvidia-disabled.conf
(blacklist nvidia / nvidia_drm / nvidia_modeset / nvidia_uvm)
so the driver doesn't try to load on the next boot.
b. dpkg-query's the actually-installed nvidia-dkms-* /
nvidia-driver-* / nvidia-settings / nvidia-common packages
(no hardcoded version!) and apt-purges them. DKMS stops
rebuilding the module on every kernel update.
- /etc/systemd/system/armbian-nvidia-autodetect.service
Type=oneshot, runs Before=display-manager.service /
graphical.target. WantedBy=multi-user.target — fires every
boot. Cheap (early exit when NVIDIA present), idempotent (no-op
on a system where the packages are already purged), and handles
hot-pluggable scenarios (eGPU added later → reverse direction
handled by removing the modprobe.d file manually).
Removes the dmesg-grep line from armbian-firstrun and leaves a
breadcrumb pointing at the new location.
The runtime armbian-nvidia-autodetect helper in this extension calls lspci to probe the PCI bus for an NVIDIA card. lspci lives in pciutils, which isn't in the Debian/Ubuntu base install and isn't guaranteed to be pulled by every desktop metapackage transitively. The helper defensively no-ops when lspci is missing — which would leave images without auto-disable on no-GPU hosts (the exact thing this PR is meant to fix). Append pciutils to PACKAGE_LIST_ADDITIONAL in extension_finish_config so it lands in the rootfs alongside the other build-time prerequisites.
The runtime autodetect's dpkg-query argument list used only the globs 'nvidia-dkms-*' and 'nvidia-driver-*' for the numbered Ubuntu shape. The trailing dash makes those globs miss the bare 'nvidia-dkms' / 'nvidia-driver' metapackages — which the install branch deliberately falls through to on Debian (case 3 of the resolver added in this PR). Add the exact names alongside the globs so the purge covers both shapes. Without this fix a Debian image installed with the extension on a host that turns out to have no NVIDIA GPU would correctly drop the modprobe blacklist but leave the package set behind, defeating the DKMS-rebuild-avoidance half of the autodisable design.
…chable chroot_sdcard wraps its argument with `bash -e -o pipefail -c …`, so the pipeline `apt-cache pkgnames … | grep … | sed … | sort | head -1` returns 1 when grep finds no numbered nvidia-dkms-<N> entries — the exact Debian shape that the install resolver's case-3 fall-through was designed to handle. Under the build framework's outer `set -e` (compile.sh) the substitution `latest=$(chroot_sdcard …)` then aborts the build at that assignment, which means case-3 (unversioned `nvidia-dkms` metapackage) was unreachable in practice. Append `|| true` to the inner pipeline so the substitution always succeeds with `$latest` empty on no-match, and the `if/elif` chain below can pick case-2 (number found), case-3 (Debian fallback) or case-4 (skip with warn) on real data. Reproduced and verified locally — without `|| true` the assignment aborts; with it, latest='' and the fallback executes.
… redundant)
Two reasons:
1. PACKAGE_LIST_ADDITIONAL is sealed `readonly` by the time
extension_finish_config__* hooks run, so the
`declare -g PACKAGE_LIST_ADDITIONAL+=" pciutils"` line aborted
the build with:
/armbian/extensions/nvidia.sh: line 20: declare:
PACKAGE_LIST_ADDITIONAL: readonly variable
2. It was redundant anyway. `pciutils` is already listed in
config/cli/common/main/packages.additional, which ships in
every non-minimal CLI image. This extension early-returns on
BUILD_MINIMAL=yes, so we never reach a context that wouldn't
have pciutils already present.
Replace the now-broken line with a comment pointing at the canonical
source so a future maintainer doesn't try to add it again.
post_install_kernel_debs runs BEFORE armbian-bsp-cli is installed in
the chroot (per its own docstring at the call site). Anything we
wrote into /usr/lib/armbian/ or /etc/systemd/system/ there was
getting clobbered by the BSP install or swept by later rootfs steps
— operator reported the autodetect script + service simply weren't
in the resulting rootfs even though the firstrun edit shipped
(because firstrun ships through the BSP package which is
dpkg-tracked).
Split the responsibilities:
- post_install_kernel_debs__build_nvidia_kernel_module
keeps doing the apt-get install of nvidia-dkms-* / nvidia-driver-*
(works fine before BSP — dependencies resolve, dkms builds).
- post_family_tweaks__build_nvidia_kernel_module_autodetect (NEW)
calls install_armbian_nvidia_autodetect_helper. post_family_tweaks
fires AFTER `install_artifact_deb_chroot "armbian-bsp-cli"` so
/usr/lib/armbian/ already exists with BSP-owned content and our
untracked drop sits beside it without being overwritten.
The autodetect remains extension-gated (only on images built with the
nvidia extension enabled), not BSP-common — per operator preference,
to avoid every SBC's bsp-cli carrying nvidia-related plumbing it has
no use for.
The case-3 fallback ("No nvidia-dkms package in ... apt sources")
hits on noble even though nvidia-dkms-* lives in restricted/, where
the rootfs is supposed to include the restricted component. Without
seeing the chroot's apt state at the moment of failure, there's no
way to tell whether:
- restricted is missing from sources.list.d at all,
- it's listed but the indices were never fetched,
- or apt-cache pkgnames is filtering everything else for some
arch / component reason.
Before the existing "skipping nVidia install" warn, dump:
- `apt-cache pkgnames | grep -c ^nvidia` from inside the chroot
- listing of /etc/apt/sources.list.d/
- sources files that mention "restricted"
- apt/lists entries containing restricted/multiverse (proves
whether indices were refreshed)
All purely diagnostic; no behaviour change on the happy path.
chroot_sdcard wraps the inner command with `bash -e -o pipefail -c`. The debug-dump pipelines added in the previous commit had grep/ls calls that legitimately return rc=1 when nothing matches; pipefail propagates that as the pipeline's exit, the outer set -e aborts the build mid-function, and bash emits a confusing pop_var_context: head of shell_variables not a function context instead of the actual diagnostic. Tail every chroot_sdcard "..." with `|| true` so empty matches stay rc=0 and the diagnostic lines actually print. Also simplify the "restricted" probe from `grep -lE '(^|\s)restricted(\s|$)'` to `grep -lF restricted` - the regex form was both fragile under nested double-quote escaping and overkill for what we need (presence of the literal word in a sources file).
apt-cache pkgnames reads from /var/cache/apt/pkgcache.bin, which is built from /var/lib/apt/lists/. If the rootfs was cached before `restricted` was added to ubuntu.sources, or if the framework hasn't run `apt-get update` since the final sources.list was finalized, the indices for the restricted component are simply absent - and pkgnames returns nothing for nvidia-dkms-*, even though sources.list lists the component. Verified locally that on a stale chroot the pkgnames pipeline returns empty; after `apt-get update`, it returns the full nvidia-dkms-N set. `apt-get update -qq || true`: quiet on success, doesn't abort the build if the proxy hiccups or one of the suite indices fails (apt returns non-zero on partial failures).
If a previous boot ran on the same rootfs without NVIDIA hardware, the autodetect helper wrote /etc/modprobe.d/armbian-nvidia-disabled.conf to keep the kernel modules from auto-loading. When the same rootfs later boots with NVIDIA hardware (card added, SSD swapped into a GPU-equipped host), the early `exit 0` left that file in place, so the modules still wouldn't load even though they're present and the GPU is wired up. The detector was effectively one-way. When lspci finds [10de:], clear the blacklist file if it exists (rm -f is idempotent for the common case where it never existed). Log via systemd-cat so the action shows up in `journalctl -u armbian-nvidia-autodetect` for triage. Deliberately /not/ auto-reinstalling NVIDIA packages on the recovery path - proprietary driver auto-install without operator consent, and without guaranteed network/apt-sources, is out of scope for a boot-time detector. If packages were previously purged the operator runs apt install manually; the freshly-cleared blacklist file makes that work the next boot.
…boot when GPU is found
Previous flow on a host without an NVIDIA GPU:
- First boot: kernel modules auto-load from initrd udev → probe
fails → noisy dmesg ("nvidia: probe failed", DKMS rebuild
artefacts in journal, etc.)
- armbian-nvidia-autodetect runs in userspace → writes blacklist
+ purges packages
- Second boot: clean
Inverting the default so the first boot is also clean:
1. Build-time write of /etc/modprobe.d/armbian-nvidia-disabled.conf
BEFORE the apt install. nvidia-dkms postinst triggers
update-initramfs which now bakes the blacklist into initramfs,
so initrd udev doesn't try to load nvidia* at all.
2. Boot-time autodetect:
- lspci finds [10de:] → rm -f blacklist file + modprobe
nvidia_drm modeset=1 (pulls nvidia + nvidia_modeset via
deps; Wayland-friendly KMS). Display-manager (we're
Before= it) sees the driver loaded.
- No [10de:] → keep blacklist + purge packages (unchanged).
Self-healing for hosts that gain a GPU later: rootfs blacklist file
deleted on first NVIDIA-detected boot; next kernel upgrade
regenerates initramfs from the (now blacklist-free) rootfs, so
subsequent boots are clean directly from initrd. Until that kernel
upgrade, the runtime modprobe covers the gap each boot.
|| true on modprobe handles the edge case where packages were
previously purged on a no-GPU run, then the operator swapped in a
GPU but hasn't re-apt-installed. Operator runs apt install manually
in that case; the cleared blacklist makes it work on next boot.
75325a5 to
98a0649
Compare
Two related fixes in `extensions/nvidia.sh`
1. Auto-pick the right driver version per distribution
Drops the hardcoded `NVIDIA_DRIVER_VERSION=580` default (with its own honest `@TODO: this might vary per-release and Debian/Ubuntu` comment). The post-install hook now resolves package names against the chroot's apt index at install time.
2. Runtime auto-disable on hosts without NVIDIA hardware
Replaces an unreliable dmesg-grep line that lived in `packages/bsp/common/usr/lib/armbian/armbian-firstrun`:
```bash
REMOVED:
[[ -n "$(dmesg | grep "No NVIDIA GPU found")" ]] && \
sudo apt-get -y -qq purge nvidia-dkms-510 nvidia-driver-510 \
nvidia-settings nvidia-common >> /dev/null
```
Two reasons it didn't work:
New solution (installed by the extension itself, so it ships in every image built with this extension):
Behaviour:
The unit fires every boot so an eGPU added later just works (run autodetect, see lspci hit, exit). To reverse direction (eGPU removed and the user wants it back), the script's preamble documents that deleting `/etc/modprobe.d/armbian-nvidia-disabled.conf` re-enables.
File changes
```
extensions/nvidia.sh | +47/-4 (PR #1) + +120 (PR #2)
packages/bsp/common/usr/lib/armbian/armbian-firstrun | +12/-1
```
Test plan
Summary by CodeRabbit
New Features
Bug Fixes