Refactor mount operations to use procfd and new mount API by giuseppe · Pull Request #2094 · containers/crun

giuseppe · 2026-05-15T13:56:05Z

Open a /proc fd early and thread it through all mount helpers, replacing hardcoded /proc/self/fd string paths with fd-relative operations. This removes the dependency on /proc being mounted during container setup.
Migrate mount operations to the new mount API (fsopen/fsmount/move_mount/mount_setattr).
Move pivot_root before container mounts and defer mount namespace creation (CLONE_NEWNS) to libcrun_do_pivot_root(), consolidating it in a single location.
Add OPEN_TREE_NAMESPACE support (kernel 7.0+) to replace the unshare(CLONE_NEWNS) + bind mount rootfs + pivot_root sequence entirely.

Closes #2086

gemini-code-assist

Code Review

This pull request refactors the container mount logic to utilize the new Linux mount API, enabling detached mounts and the pre-opening of file descriptors before pivot_root to maintain access to host paths. It introduces setup_mount_namespace, adds support for MS_SYNCHRONOUS and MS_DIRSYNC flags, and enhances rootless device node handling. The review feedback identifies several bugs related to empty path strings when processing the root directory and suggests using AT_FDCWD for consistent directory descriptor resolution.

gemini-code-assist · 2026-05-15T14:00:29Z

+          if (get_private_data (container)->mount_fds)
+            {
+              struct libcrun_fd_map *mount_fds = get_private_data (container)->mount_fds;


Directly checking get_private_data (container)->mount_fds might skip pre-opening bind mounts if no non-bind mounts have been processed yet (as mount_fds is lazily initialized). Use get_fd_map (container) to ensure the map is initialized before use.

struct libcrun_fd_map *mount_fds = get_fd_map (container); if (mount_fds) {

gemini-code-assist · 2026-05-15T14:00:29Z

+                  if (old_root_fd >= 0 && source && source[0] == '/')
+                    {
+                      crun_error_release (&tmp_err);
+                      bindfd = get_bind_mount (old_root_fd, source + 1,


If source is exactly /, source + 1 will be an empty string. get_bind_mount calls syscall_open_tree, which requires the AT_EMPTY_PATH flag to handle an empty path. Since get_bind_mount does not currently pass this flag, the call will fail. Using . instead of an empty string avoids this issue.

bindfd = get_bind_mount (old_root_fd, (source[1] == '\0') ? "." : source + 1,

gemini-code-assist · 2026-05-15T14:00:29Z

-                      if (UNLIKELY (ret < 0))
-                        return crun_make_error (err, errno, "bind mount `/sys` from the host");
+                      crun_error_release (err);
+                      mountfd = get_bind_mount (-1, "/sys", true, false, false, MS_PRIVATE, err);


Use AT_FDCWD instead of -1 for the directory file descriptor to ensure standard behavior for relative path resolution in get_bind_mount.

mountfd = get_bind_mount (AT_FDCWD, "/sys", true, false, false, MS_PRIVATE, err);

gemini-code-assist · 2026-05-15T14:00:29Z

-      /* Do not resolve the symlink only when src-nofollow and copy-symlink are used.  */
-      ret = get_file_type (&src_mode, (extra_flags & (OPTION_SRC_NOFOLLOW | OPTION_COPY_SYMLINK)) ? true : false, path);
+          if (old_root_fd >= 0)
+            ret = get_file_type_at (old_root_fd, &src_mode, nofollow, path + 1);


If path is exactly /, path + 1 will be an empty string. get_file_type_at only applies the AT_EMPTY_PATH flag if the path argument is NULL. Passing an empty string will result in an ENOENT error. Suggest passing NULL when the path is the root directory.

ret = get_file_type_at (old_root_fd, &src_mode, nofollow, (path[1] == '\0') ? NULL : path + 1);

gemini-code-assist · 2026-05-15T14:00:29Z

+          if (old_root_fd >= 0 && mount->source && mount->source[0] == '/')
+            {
+              bind_dirfd = old_root_fd;
+              bind_source = mount->source + 1;


If mount->source is exactly /, mount->source + 1 will be an empty string, which causes get_bind_mount to fail as it lacks the AT_EMPTY_PATH flag for its internal open_tree call. Using . ensures the directory itself is opened correctly.

bind_source = (mount->source[1] == '\0') ? "." : mount->source + 1;

gemini-code-assist · 2026-05-15T14:00:29Z

+                {
+                  libcrun_error_t tmp_err = NULL;
+
+                  mount_fds->fds[i] = get_bind_mount (-1, def->mounts[i]->source,


Use AT_FDCWD instead of -1 for the directory file descriptor to ensure standard behavior for relative path resolution in get_bind_mount.

mount_fds->fds[i] = get_bind_mount (AT_FDCWD, def->mounts[i]->source,

packit-as-a-service · 2026-05-15T14:06:50Z

TMT tests failed. @containers/packit-build please check.

giuseppe · 2026-05-15T15:00:38Z

CI fuzzing fix in #2096

The check was inverted as it dereferenced src_nofollow when the pointer was NULL instead of when it was non-NULL. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

Replace setxattr via /proc/self/fd path with openat(procfd) + fsetxattr. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

Replace chmod/chown via /proc/self/fd path with fchmodat/fchownat using the trusted procfd directory fd. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

Use procfd-relative paths for get_bind_mount and fstatfs instead of /proc/self/fd string paths. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

Use get_bind_mount with the cached procfd and relative self/fd path instead of formatting an absolute /proc/self/fd path and going through do_mount. Apply the extra mount flags from fstatfs via do_mount_setattr on the detached mount before moving it into place. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

Replace libcrun_get_cgroup_process() with direct read via procfd to avoid dependency on /proc being mounted. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

Use mount_setattr() when available before falling back to the mount() syscall. mount_setattr() is fd-based and does not need /proc/self/fd paths. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

Use mount_setattr() when available before falling back to the mount() syscall. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

Replace the slash-counting loop bound with fstat-based root detection. The old code counted '/' characters in the path string to cap iterations; for relative paths like "rootfs" (zero slashes) the loop ran only once, which could be insufficient. Compare inode/device of each directory with its parent instead: when ".." resolves to the same inode we have reached the filesystem root. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

The keep_flags path in do_masked_or_readonly_path passed the inherited mount flags (MS_RDONLY, MS_NOSUID, etc.) as the `clear` argument to do_mount_setattr instead of `set`, effectively clearing the flags rather than applying them. Swap the arguments so the inherited flags are set on the readonly bind mount as intended. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

Try fsopen_mount()+fs_move_mount_to() for filesystem mounts and get_bind_mount()+fs_move_mount_to() for bind mounts before falling back to mount(). For propagation changes, try mount_setattr() first. This removes the dependency on /proc/self/fd paths for the initial mount and propagation steps when the new mount API is available. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

In user namespace containers, devices under /dev are created via bind mounts from the host. Pre-open the 6 standard devices (/dev/null, /dev/zero, /dev/full, /dev/tty, /dev/random, /dev/urandom) using open_tree(OPEN_TREE_CLONE) in the parent process after clone(), and send them to the child via the existing sync socket fd-passing mechanism. This reuses the same send_mounts/receive_mounts pattern already used for custom devices (dev_fds) and bind mount sources (mount_fds). This is a preparatory change for OPEN_TREE_NAMESPACE support, where host paths are not accessible after setns() into the new mount namespace. When open_tree() is not available or fails (e.g. rootless without CAP_SYS_ADMIN), the fds remain -1 and the child falls back to the existing bind mount path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

Replace the openat(procfd) + fchmod/fchown approach with direct fchmodat/fchownat calls. These operate on the directory entry without opening the device file, avoiding ENXIO errors on character devices like /dev/tty that fail to open when no controlling terminal exists. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

and rename it Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

Use the new mount API (fsopen/fsconfig/fsmount) to create detached mounts for each non-bind filesystem type before pivot_root. After the pivot, place them via move_mount so the kernel mnt_already_visible check is satisfied. set_mounts then mounts fresh instances on top with the correct OCI flags. This fixes mounting proc/sysfs/cgroup in containers that use a user namespace, where the kernel denies these mounts unless a mount of the same type is already visible. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

The masked-dir-nlink test checked that st_nlink >= 2 to verify a masked path is a directory. This fails on btrfs where directories have st_nlink=1 (btrfs does not count . and .. in nlink). Add an 'isdir' command to the test init binary and use S_ISDIR to verify the masked path is a directory, which works regardless of filesystem type. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

Read symlink targets for copy-symlink mounts while the host filesystem is still fully accessible (before pivot_root). After pivot_root the original symlink source is not reachable, so cache the readlink result in private_data and use it from handle_copy_symlink and process_single_mount. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

Move CLONE_NEWNS out of both the clone() call (non-userns path) and init_container()'s unshare() (userns path), consolidating it into libcrun_do_pivot_root(). This makes mount namespace creation happen at a single location, immediately before it is needed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

Use open_tree(OPEN_TREE_NAMESPACE) + setns(CLONE_NEWNS) to replace the traditional unshare(CLONE_NEWNS) + bind mount rootfs + pivot_root sequence. OPEN_TREE_NAMESPACE creates a new mount namespace with the rootfs as the root mount. setns() enters that namespace directly, so no bind mount or pivot_root is needed. The kernel automatically sets the process root and cwd to the new namespace's root when the old root is not reachable. On older kernels (< 7.0) or when OPEN_TREE_NAMESPACE is not supported, the code falls back to the traditional path. Closes: containers#2086 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

giuseppe mentioned this pull request May 15, 2026

Use OPEN_TREE_NAMESPACE #2086

Open

gemini-code-assist Bot reviewed May 15, 2026

View reviewed changes

giuseppe force-pushed the procfd-refactor branch from 690c6e7 to 935db62 Compare May 15, 2026 15:00

giuseppe and others added 23 commits May 15, 2026 18:10

linux: fix NULL pointer dereference in is_bind_mount

dec7637

The check was inverted as it dereferenced src_nofollow when the pointer was NULL instead of when it was non-NULL. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

linux: open procfd early and store in private_data

843e20d

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

linux: use fsetxattr with procfd in do_mount

2d23082

Replace setxattr via /proc/self/fd path with openat(procfd) + fsetxattr. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

linux: use fchmodat/fchownat with procfd in libcrun_create_dev

4d9fd4b

Replace chmod/chown via /proc/self/fd path with fchmodat/fchownat using the trusted procfd directory fd. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

linux: use procfd to read /proc/self/cgroup in do_mount_cgroup_v1

1313ad4

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

linux: use procfd to stat source_mountfd in process_single_mount

17a8f72

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

linux: use procfd in do_masked_or_readonly_path

1202b23

Use procfd-relative paths for get_bind_mount and fstatfs instead of /proc/self/fd string paths. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

linux: use procfd in get_shared_empty_dir_cached and mount_masked_dir

c19298f

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

linux: use procfd to read unified cgroup path

eb7e3ea

Replace libcrun_get_cgroup_process() with direct read via procfd to avoid dependency on /proc being mounted. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

linux: try mount_setattr in make_parent_mount_private

0829392

Use mount_setattr() when available before falling back to the mount() syscall. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

linux: change signature for open_mount_of_type

84fa224

and rename it Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

giuseppe force-pushed the procfd-refactor branch from 935db62 to 33d8053 Compare May 15, 2026 16:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor mount operations to use procfd and new mount API#2094

Refactor mount operations to use procfd and new mount API#2094
giuseppe wants to merge 23 commits into
containers:mainfrom
giuseppe:procfd-refactor

giuseppe commented May 15, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 15, 2026

Uh oh!

gemini-code-assist Bot May 15, 2026

Uh oh!

gemini-code-assist Bot May 15, 2026

Uh oh!

gemini-code-assist Bot May 15, 2026

Uh oh!

gemini-code-assist Bot May 15, 2026

Uh oh!

gemini-code-assist Bot May 15, 2026

Uh oh!

packit-as-a-service Bot commented May 15, 2026

Uh oh!

giuseppe commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

giuseppe commented May 15, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 15, 2026

Choose a reason for hiding this comment

Uh oh!

packit-as-a-service Bot commented May 15, 2026

Uh oh!

giuseppe commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant