Skip to content

extensions/nvidia: per-distro version detection + runtime auto-disable on no-GPU hosts#9845

Open
igorpecovnik wants to merge 12 commits into
mainfrom
nvidia-auto-detect-driver-version
Open

extensions/nvidia: per-distro version detection + runtime auto-disable on no-GPU hosts#9845
igorpecovnik wants to merge 12 commits into
mainfrom
nvidia-auto-detect-driver-version

Conversation

@igorpecovnik
Copy link
Copy Markdown
Member

@igorpecovnik igorpecovnik commented May 17, 2026

Two related fixes in `extensions/nvidia.sh`

1. Auto-pick the right driver version per distribution

Drops the hardcoded `NVIDIA_DRIVER_VERSION=580` default (with its own honest `@TODO: this might vary per-release and Debian/Ubuntu` comment). The post-install hook now resolves package names against the chroot's apt index at install time.

# Condition Result
1 `NVIDIA_DRIVER_VERSION` set (env/config) `nvidia-dkms-${ver}` / `nvidia-driver-${ver}` — operator override wins
2 One or more `nvidia-dkms-` in apt Pick highest `N` (Ubuntu shape)
3 Unversioned `nvidia-dkms` only Use metapackage (Debian shape)
4 Nothing matches Skip with `warn` log line — no opaque "package not found"

2. Runtime auto-disable on hosts without NVIDIA hardware

Replaces an unreliable dmesg-grep line that lived in `packages/bsp/common/usr/lib/armbian/armbian-firstrun`:

```bash

REMOVED:

[[ -n "$(dmesg | grep "No NVIDIA GPU found")" ]] && \
sudo apt-get -y -qq purge nvidia-dkms-510 nvidia-driver-510 \
nvidia-settings nvidia-common >> /dev/null
```

Two reasons it didn't work:

  • dmesg-grep is fragile. The "No NVIDIA GPU found" line is only printed if the driver bound far enough to print it, and it falls off the kernel ring buffer on busy boots — so the check often missed and the purge never fired.
  • Hardcoded `nvidia-dkms-510` — wrong for every release that ships a different driver. After fix could you please add this issues tab for cubietruck page so we can report issues? thank you #1 above, the install path uses different version numbers per distro, so a fixed purge name was guaranteed to miss.

New solution (installed by the extension itself, so it ships in every image built with this extension):

File Purpose
`/usr/lib/armbian/armbian-nvidia-autodetect` Probes PCI bus via `lspci -nn` (vendor `10de`). If no NVIDIA: drops a `/etc/modprobe.d` blacklist + `apt purge` of every actually-installed `nvidia-dkms-` / `nvidia-driver-` package (no hardcoded version — uses `dpkg-query`).
`/etc/systemd/system/armbian-nvidia-autodetect.service` Type=oneshot, `After=local-fs.target`, `Before=display-manager.service graphical.target`. Enabled at build time.

Behaviour:

Host hardware First boot Subsequent boots
NVIDIA GPU present lspci hit → exit 0 lspci hit → exit 0
No NVIDIA GPU blacklist conf written, packages purged lspci miss → dpkg-query returns empty → apt-purge no-op → done in ms

The unit fires every boot so an eGPU added later just works (run autodetect, see lspci hit, exit). To reverse direction (eGPU removed and the user wants it back), the script's preamble documents that deleting `/etc/modprobe.d/armbian-nvidia-disabled.conf` re-enables.

File changes

```
extensions/nvidia.sh | +47/-4 (PR #1) + +120 (PR #2)
packages/bsp/common/usr/lib/armbian/armbian-firstrun | +12/-1
```

Test plan

  • Build a desktop image for Ubuntu noble with this extension; confirm the auto-detected nvidia-dkms version installs and the kernel module builds.
  • Build for Debian trixie; confirm fall-through to the unversioned `nvidia-dkms` metapackage.
  • Build for a release with no nvidia-dkms; confirm the build completes with a "skipping nVidia install" warning rather than failing.
  • Build with `NVIDIA_DRIVER_VERSION=550` pinned via env; confirm operator override still wins.
  • Boot a desktop image with an NVIDIA GPU; confirm `armbian-nvidia-autodetect.service` shows green and no packages were removed.
  • Boot the same image without an NVIDIA GPU; confirm `/etc/modprobe.d/armbian-nvidia-disabled.conf` is in place after first boot and that `nvidia-dkms-` / `nvidia-driver-` have been purged.
  • Confirm `/usr/lib/armbian/armbian-firstrun` no longer purges anything nvidia-related.

Summary by CodeRabbit

  • New Features

    • Adds a boot-time NVIDIA autodetect service that checks for NVIDIA hardware, blacklists NVIDIA modules and removes NVIDIA packages when no GPU is present.
    • Ensures PCI utilities are installed for reliable hardware detection and moves autodetection out of first-run into a dedicated boot-time helper.
  • Bug Fixes

    • Improved driver installation: auto-selects the appropriate NVIDIA driver/DKMS packages at install time, tracks DKMS build output, and warns if no suitable packages are found.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 17, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Removes a hardcoded NVIDIA driver default; selects/install driver packages during post-install by querying chroot apt (highest numeric or fallbacks); deploys a systemd oneshot and script that blacklists/purges NVIDIA packages when no NVIDIA PCI device is present; removes firstrun's dmesg-based purge.

Changes

NVIDIA Driver Version Selection & Autodetect

Layer / File(s) Summary
Defer version default decision
extensions/nvidia.sh
Stops declaring a default NVIDIA_DRIVER_VERSION in extension_finish_config__build_nvidia_kernel_module; forces INSTALL_HEADERS=yes and adds pciutils so lspci is available at runtime.
Post-install package resolution and install
extensions/nvidia.sh
post_install_kernel_debs__build_nvidia_kernel_module resolves packages by: using a pinned NVIDIA_DRIVER_VERSION; auto-detecting highest numeric nvidia-dkms-<N> from the chroot apt index and setting NVIDIA_DRIVER_VERSION; or falling back to unversioned nvidia-dkms/nvidia-driver. If no matches, it warns and skips. Installs resolved packages in chroot and preserves DKMS build log error tracking; deploys the autodetect helper.
Install runtime autodetect/disable helper
extensions/nvidia.sh
Adds install_armbian_nvidia_autodetect_helper() which writes an armbian-nvidia-autodetect script and a systemd oneshot into the target rootfs; on boot it uses lspci to detect NVIDIA (vendor 0x10de), and if absent writes a modprobe blacklist and purges installed NVIDIA packages (including versioned and unversioned metapackages).
Remove firstrun hardcoded purge
packages/bsp/common/usr/lib/armbian/armbian-firstrun
Remove dmesg-based purge of hardcoded nvidia-*-510 packages and add a comment indicating NVIDIA autodetection is handled by the systemd helper.

Sequence Diagram

sequenceDiagram
  participant BuildScript
  participant AptIndex
  participant Chroot
  participant DKMS
  participant HostSystemd
  BuildScript->>AptIndex: Check for nvidia-dkms-<N> and unversioned packages
  AptIndex-->>BuildScript: Return highest numeric version or unversioned candidates
  BuildScript->>Chroot: Install resolved nvidia-dkms / nvidia-driver packages
  Chroot->>DKMS: DKMS builds (monitor /var/lib/dkms/*/build/make.log)
  BuildScript->>Chroot: Deploy armbian-nvidia-autodetect script + systemd oneshot
  HostSystemd->>Chroot: Run oneshot at boot (lspci + dpkg-query)
  HostSystemd-->>Chroot: If no NVIDIA -> blacklist modules & purge NVIDIA packages
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐰 I nibble code in tidy rows,
I stop the pins and version woes.
I sniff the bus with tiny nose,
If no GPU, the blacklist grows.
A helper hops and cleans my toes.

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Linked Issues check ⚠️ Warning The linked issue requests adding an issues tab to the Cubietruck device page, but the PR changes NVIDIA driver detection and auto-disable logic unrelated to this request. Verify the linked issue is correct. If this PR addresses NVIDIA issues, link relevant issues instead. If unrelated, clarify the connection or remove the incorrect issue link.
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main changes: per-distro version detection and runtime auto-disable on no-GPU hosts.
Out of Scope Changes check ✅ Passed All changes relate to NVIDIA driver version detection and runtime auto-disable functionality, with no unrelated modifications detected.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch nvidia-auto-detect-driver-version

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions github-actions Bot added size/medium PR with more then 50 and less then 250 lines 05 Milestone: Second quarter release Needs review Seeking for review Framework Framework components labels May 17, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@extensions/nvidia.sh`:
- Line 46: The grep regex is using a double-quoted pattern
'^nvidia-dkms-[0-9]+\$' so the backslash before $ is treated as a literal and
prevents anchoring; update the grep invocation that contains the string
'^nvidia-dkms-[0-9]+\$' to remove the backslash before $ (i.e., use
'^nvidia-dkms-[0-9]+$') so the end-of-line anchor is passed to grep correctly
and only exact package names match.
- Around line 45-48: The pipeline that sets latest using chroot_sdcard ... |
grep -E '^nvidia-dkms-[0-9]+\$' ... may return non-zero when there are no
matches and cause a set -e abort; modify the pipeline inside the command
substitution that assigns latest (the chroot_sdcard ... | grep ... | sed ... |
sort -nr | head -1 sequence) to append || true so the subshell always exits zero
and the subsequent [[ -n "$latest" ]] fallback check can run; update the
invocation around the latest variable assignment in extensions/nvidia.sh (look
for the latest=... chroot_sdcard pipeline) to include || true at the end of the
pipeline.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 20e2b57b-3257-40d7-b3e5-ed306a176cc4

📥 Commits

Reviewing files that changed from the base of the PR and between 1c3d2b4 and e1dd0ec.

📒 Files selected for processing (1)
  • extensions/nvidia.sh

Comment thread extensions/nvidia.sh Outdated
Comment thread extensions/nvidia.sh
@igorpecovnik igorpecovnik requested a review from a team as a code owner May 17, 2026 15:16
@igorpecovnik igorpecovnik requested review from catalinii and swissiety and removed request for a team May 17, 2026 15:16
@github-actions github-actions Bot added the BSP Board Support Packages label May 17, 2026
@igorpecovnik igorpecovnik changed the title extensions/nvidia: auto-detect highest nvidia-dkms available per distro extensions/nvidia: per-distro version detection + runtime auto-disable on no-GPU hosts May 17, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@extensions/nvidia.sh`:
- Around line 130-156: The script currently exits early when lspci finds NVIDIA
hardware, but never removes the blacklist file
(/etc/modprobe.d/armbian-nvidia-disabled.conf) or restores removed packages, so
once a host boots without NVIDIA it can't recover; update the control flow so
when lspci -nn | grep -qiE '\[10de:' succeeds you remove that blacklist file (if
present) and, if NVIDIA_PKGS is empty because packages were purged earlier,
re-install the driver package set (use the same dpkg-query pattern used earlier
to determine package names, i.e. NVIDIA_PKGS=$(dpkg-query -W
-f='${binary:Package}\n' 'nvidia-dkms-*' 'nvidia-driver-*' 'nvidia-settings'
'nvidia-common' 2>/dev/null | tr '\n' ' ')) and run apt-get install for those
packages; ensure the existing purge/autoremove code (NVIDIA_PKGS / apt-get -y
-qq purge ... and autoremove) is only run when no NVIDIA hardware is detected
and that the recovery branch runs when lspci finds '[10de:' (affecting the
current lspci early-exit branch and the later re-evaluation logic around lines
183-187).
- Around line 151-153: The dpkg-query that builds NVIDIA_PKGS only matches
patterns 'nvidia-dkms-*' and 'nvidia-driver-*' so plain metapackages like
'nvidia-dkms' or 'nvidia-driver' can be missed; update the query that sets
NVIDIA_PKGS to also include the exact package names 'nvidia-dkms' and
'nvidia-driver' (and any other known metapackage names you expect) so the purge
set removes both hyphenated variants and the plain metapackages; adjust the
dpkg-query argument list where NVIDIA_PKGS is defined to add those exact names.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: bf5e13d6-4d95-4671-b3cc-4d89f98dbfb2

📥 Commits

Reviewing files that changed from the base of the PR and between e1dd0ec and b43ab16.

📒 Files selected for processing (2)
  • extensions/nvidia.sh
  • packages/bsp/common/usr/lib/armbian/armbian-firstrun

Comment thread extensions/nvidia.sh
Comment thread extensions/nvidia.sh
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
extensions/nvidia.sh (1)

202-206: 💤 Low value

Comment overstates "re-evaluation" behavior.

The comment claims hot-pluggable scenarios (eGPU, Thunderbolt) "get re-evaluated," but the script only disables—it never restores. On subsequent boots with a newly attached GPU, the script exits early at line 144 without removing the blacklist or reinstalling packages.

Consider rewording to reflect the actual behavior: auto-disable on first no-GPU boot, manual re-enable required thereafter.

📝 Suggested comment update
 	# Enable the unit so it fires at every boot. Cheap when NVIDIA is
 	# present (early exit on the lspci check) and idempotent when not
 	# (apt-purge is a no-op on a system where the packages are already
-	# gone). Running every boot means hot-pluggable scenarios (eGPU,
-	# Thunderbolt) get re-evaluated.
+	# gone). Once packages are purged and the blacklist is written,
+	# manually delete /etc/modprobe.d/armbian-nvidia-disabled.conf and
+	# reinstall the driver packages to re-enable NVIDIA on a new host.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@extensions/nvidia.sh` around lines 202 - 206, Update the explanatory comment
that starts with "Enable the unit so it fires at every boot..." to remove the
claim that hot-pluggable scenarios "get re-evaluated" and instead state the
actual behavior: the script performs an early exit on the lspci check (the lspci
check branch) and only disables components (apt-purge/blacklist) on a no-GPU
boot, so it will not automatically restore or reinstall when a GPU is later
hot-plugged; note that manual re-enabling is required. Refer to the lspci check
and the apt-purge/blacklist behavior in the comment so readers understand the
one-way disable behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@extensions/nvidia.sh`:
- Around line 202-206: Update the explanatory comment that starts with "Enable
the unit so it fires at every boot..." to remove the claim that hot-pluggable
scenarios "get re-evaluated" and instead state the actual behavior: the script
performs an early exit on the lspci check (the lspci check branch) and only
disables components (apt-purge/blacklist) on a no-GPU boot, so it will not
automatically restore or reinstall when a GPU is later hot-plugged; note that
manual re-enabling is required. Refer to the lspci check and the
apt-purge/blacklist behavior in the comment so readers understand the one-way
disable behavior.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 0bb8a7d6-bac2-47e9-9373-8099e153f454

📥 Commits

Reviewing files that changed from the base of the PR and between 90efd56 and 19fb105.

📒 Files selected for processing (1)
  • extensions/nvidia.sh

@igorpecovnik igorpecovnik force-pushed the nvidia-auto-detect-driver-version branch 3 times, most recently from 5bd211f to a6d6ccd Compare May 18, 2026 03:58
@github-actions github-actions Bot added size/large PR with 250 lines or more and removed size/medium PR with more then 50 and less then 250 lines labels May 18, 2026
Drops the hardcoded NVIDIA_DRIVER_VERSION=580 default (with its
honest @todo comment about per-release / Debian-vs-Ubuntu drift).
The post_install hook now resolves the package set against the
chroot's actual apt index at install time:

  1. If NVIDIA_DRIVER_VERSION is set explicitly (env or config), pin
     to it — operator override always wins.
  2. Otherwise, ask apt for the highest `nvidia-dkms-<N>` available
     in the target distribution/release. Common Ubuntu shape across
     noble / resolute / etc. — versions vary (535, 550, 560, 580, ...).
  3. Fall through to the unversioned Debian metapackage `nvidia-dkms`
     if no numeric variants exist (bookworm, trixie).
  4. None of the above — skip with a warning instead of crashing the
     build on an opaque 'package not found'.

Closes the long-standing @todo and removes the silent build failures
on releases that don't ship nvidia-dkms-580 specifically.
The previous solution lived in
packages/bsp/common/usr/lib/armbian/armbian-firstrun as a single
line:

  [[ -n "$(dmesg | grep "No NVIDIA GPU found")" ]] && \
      sudo apt-get -y -qq purge nvidia-dkms-510 nvidia-driver-510 \
                                 nvidia-settings nvidia-common \
      >> /dev/null

Two reasons it was unreliable:

  1. dmesg-grep for "No NVIDIA GPU found" only sees the line if the
     driver actually bound far enough to print it. On many boots the
     line never appears (driver couldn't load at all) or has already
     rotated out of the kernel ring buffer by the time firstrun runs.

  2. Hardcoded nvidia-dkms-510 / nvidia-driver-510 — wrong on every
     distro/release that ships a different driver branch, and
     especially wrong now that the install path auto-picks the
     highest available version.

Replace it with a build-time-installed detector + systemd one-shot
under extensions/nvidia.sh:

  - /usr/lib/armbian/armbian-nvidia-autodetect
    probes the PCI bus directly (lspci -nn, vendor 10de). Works
    regardless of whether any driver module loaded.
    If no NVIDIA hardware found:
      a. drops /etc/modprobe.d/armbian-nvidia-disabled.conf
         (blacklist nvidia / nvidia_drm / nvidia_modeset / nvidia_uvm)
         so the driver doesn't try to load on the next boot.
      b. dpkg-query's the actually-installed nvidia-dkms-* /
         nvidia-driver-* / nvidia-settings / nvidia-common packages
         (no hardcoded version!) and apt-purges them. DKMS stops
         rebuilding the module on every kernel update.

  - /etc/systemd/system/armbian-nvidia-autodetect.service
    Type=oneshot, runs Before=display-manager.service /
    graphical.target. WantedBy=multi-user.target — fires every
    boot. Cheap (early exit when NVIDIA present), idempotent (no-op
    on a system where the packages are already purged), and handles
    hot-pluggable scenarios (eGPU added later → reverse direction
    handled by removing the modprobe.d file manually).

Removes the dmesg-grep line from armbian-firstrun and leaves a
breadcrumb pointing at the new location.
The runtime armbian-nvidia-autodetect helper in this extension calls
lspci to probe the PCI bus for an NVIDIA card. lspci lives in
pciutils, which isn't in the Debian/Ubuntu base install and isn't
guaranteed to be pulled by every desktop metapackage transitively.
The helper defensively no-ops when lspci is missing — which would
leave images without auto-disable on no-GPU hosts (the exact thing
this PR is meant to fix).

Append pciutils to PACKAGE_LIST_ADDITIONAL in
extension_finish_config so it lands in the rootfs alongside the
other build-time prerequisites.
The runtime autodetect's dpkg-query argument list used only the globs
'nvidia-dkms-*' and 'nvidia-driver-*' for the numbered Ubuntu shape.
The trailing dash makes those globs miss the bare 'nvidia-dkms' /
'nvidia-driver' metapackages — which the install branch deliberately
falls through to on Debian (case 3 of the resolver added in this PR).

Add the exact names alongside the globs so the purge covers both
shapes. Without this fix a Debian image installed with the extension
on a host that turns out to have no NVIDIA GPU would correctly drop
the modprobe blacklist but leave the package set behind, defeating
the DKMS-rebuild-avoidance half of the autodisable design.
…chable

chroot_sdcard wraps its argument with `bash -e -o pipefail -c …`, so
the pipeline `apt-cache pkgnames … | grep … | sed … | sort | head -1`
returns 1 when grep finds no numbered nvidia-dkms-<N> entries — the
exact Debian shape that the install resolver's case-3 fall-through
was designed to handle.

Under the build framework's outer `set -e` (compile.sh) the
substitution `latest=$(chroot_sdcard …)` then aborts the build at
that assignment, which means case-3 (unversioned `nvidia-dkms`
metapackage) was unreachable in practice.

Append `|| true` to the inner pipeline so the substitution always
succeeds with `$latest` empty on no-match, and the `if/elif` chain
below can pick case-2 (number found), case-3 (Debian fallback) or
case-4 (skip with warn) on real data.

Reproduced and verified locally — without `|| true` the assignment
aborts; with it, latest='' and the fallback executes.
… redundant)

Two reasons:

  1. PACKAGE_LIST_ADDITIONAL is sealed `readonly` by the time
     extension_finish_config__* hooks run, so the
     `declare -g PACKAGE_LIST_ADDITIONAL+=" pciutils"` line aborted
     the build with:
       /armbian/extensions/nvidia.sh: line 20: declare:
         PACKAGE_LIST_ADDITIONAL: readonly variable

  2. It was redundant anyway. `pciutils` is already listed in
     config/cli/common/main/packages.additional, which ships in
     every non-minimal CLI image. This extension early-returns on
     BUILD_MINIMAL=yes, so we never reach a context that wouldn't
     have pciutils already present.

Replace the now-broken line with a comment pointing at the canonical
source so a future maintainer doesn't try to add it again.
post_install_kernel_debs runs BEFORE armbian-bsp-cli is installed in
the chroot (per its own docstring at the call site). Anything we
wrote into /usr/lib/armbian/ or /etc/systemd/system/ there was
getting clobbered by the BSP install or swept by later rootfs steps
— operator reported the autodetect script + service simply weren't
in the resulting rootfs even though the firstrun edit shipped
(because firstrun ships through the BSP package which is
dpkg-tracked).

Split the responsibilities:

  - post_install_kernel_debs__build_nvidia_kernel_module
    keeps doing the apt-get install of nvidia-dkms-* / nvidia-driver-*
    (works fine before BSP — dependencies resolve, dkms builds).

  - post_family_tweaks__build_nvidia_kernel_module_autodetect (NEW)
    calls install_armbian_nvidia_autodetect_helper. post_family_tweaks
    fires AFTER `install_artifact_deb_chroot "armbian-bsp-cli"` so
    /usr/lib/armbian/ already exists with BSP-owned content and our
    untracked drop sits beside it without being overwritten.

The autodetect remains extension-gated (only on images built with the
nvidia extension enabled), not BSP-common — per operator preference,
to avoid every SBC's bsp-cli carrying nvidia-related plumbing it has
no use for.
The case-3 fallback ("No nvidia-dkms package in ... apt sources")
hits on noble even though nvidia-dkms-* lives in restricted/, where
the rootfs is supposed to include the restricted component. Without
seeing the chroot's apt state at the moment of failure, there's no
way to tell whether:
  - restricted is missing from sources.list.d at all,
  - it's listed but the indices were never fetched,
  - or apt-cache pkgnames is filtering everything else for some
    arch / component reason.

Before the existing "skipping nVidia install" warn, dump:
  - `apt-cache pkgnames | grep -c ^nvidia` from inside the chroot
  - listing of /etc/apt/sources.list.d/
  - sources files that mention "restricted"
  - apt/lists entries containing restricted/multiverse (proves
    whether indices were refreshed)

All purely diagnostic; no behaviour change on the happy path.
chroot_sdcard wraps the inner command with `bash -e -o pipefail -c`.
The debug-dump pipelines added in the previous commit had grep/ls
calls that legitimately return rc=1 when nothing matches; pipefail
propagates that as the pipeline's exit, the outer set -e aborts the
build mid-function, and bash emits a confusing
  pop_var_context: head of shell_variables not a function context
instead of the actual diagnostic.

Tail every chroot_sdcard "..." with `|| true` so empty matches stay
rc=0 and the diagnostic lines actually print.

Also simplify the "restricted" probe from `grep -lE '(^|\s)restricted(\s|$)'`
to `grep -lF restricted` - the regex form was both fragile under
nested double-quote escaping and overkill for what we need (presence
of the literal word in a sources file).
apt-cache pkgnames reads from /var/cache/apt/pkgcache.bin, which is
built from /var/lib/apt/lists/. If the rootfs was cached before
`restricted` was added to ubuntu.sources, or if the framework hasn't
run `apt-get update` since the final sources.list was finalized, the
indices for the restricted component are simply absent - and pkgnames
returns nothing for nvidia-dkms-*, even though sources.list lists
the component.

Verified locally that on a stale chroot the pkgnames pipeline returns
empty; after `apt-get update`, it returns the full nvidia-dkms-N set.

`apt-get update -qq || true`: quiet on success, doesn't abort the
build if the proxy hiccups or one of the suite indices fails (apt
returns non-zero on partial failures).
If a previous boot ran on the same rootfs without NVIDIA hardware,
the autodetect helper wrote /etc/modprobe.d/armbian-nvidia-disabled.conf
to keep the kernel modules from auto-loading. When the same rootfs
later boots with NVIDIA hardware (card added, SSD swapped into a
GPU-equipped host), the early `exit 0` left that file in place, so
the modules still wouldn't load even though they're present and the
GPU is wired up. The detector was effectively one-way.

When lspci finds [10de:], clear the blacklist file if it exists
(rm -f is idempotent for the common case where it never existed).
Log via systemd-cat so the action shows up in `journalctl -u
armbian-nvidia-autodetect` for triage.

Deliberately /not/ auto-reinstalling NVIDIA packages on the recovery
path - proprietary driver auto-install without operator consent, and
without guaranteed network/apt-sources, is out of scope for a
boot-time detector. If packages were previously purged the operator
runs apt install manually; the freshly-cleared blacklist file makes
that work the next boot.
…boot when GPU is found

Previous flow on a host without an NVIDIA GPU:
  - First boot: kernel modules auto-load from initrd udev → probe
    fails → noisy dmesg ("nvidia: probe failed", DKMS rebuild
    artefacts in journal, etc.)
  - armbian-nvidia-autodetect runs in userspace → writes blacklist
    + purges packages
  - Second boot: clean

Inverting the default so the first boot is also clean:

  1. Build-time write of /etc/modprobe.d/armbian-nvidia-disabled.conf
     BEFORE the apt install. nvidia-dkms postinst triggers
     update-initramfs which now bakes the blacklist into initramfs,
     so initrd udev doesn't try to load nvidia* at all.

  2. Boot-time autodetect:
       - lspci finds [10de:] → rm -f blacklist file + modprobe
         nvidia_drm modeset=1 (pulls nvidia + nvidia_modeset via
         deps; Wayland-friendly KMS). Display-manager (we're
         Before= it) sees the driver loaded.
       - No [10de:] → keep blacklist + purge packages (unchanged).

Self-healing for hosts that gain a GPU later: rootfs blacklist file
deleted on first NVIDIA-detected boot; next kernel upgrade
regenerates initramfs from the (now blacklist-free) rootfs, so
subsequent boots are clean directly from initrd. Until that kernel
upgrade, the runtime modprobe covers the gap each boot.

|| true on modprobe handles the edge case where packages were
previously purged on a no-GPU run, then the operator swapped in a
GPU but hasn't re-apt-installed. Operator runs apt install manually
in that case; the cleared blacklist makes it work on next boot.
@igorpecovnik igorpecovnik force-pushed the nvidia-auto-detect-driver-version branch from 75325a5 to 98a0649 Compare May 18, 2026 05:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

05 Milestone: Second quarter release BSP Board Support Packages Framework Framework components Needs review Seeking for review size/large PR with 250 lines or more

Development

Successfully merging this pull request may close these issues.

could you please add this issues tab for cubietruck page so we can report issues? thank you

1 participant