Skip to content

Refactor mount operations to use procfd and new mount API#2094

Draft
giuseppe wants to merge 23 commits into
containers:mainfrom
giuseppe:procfd-refactor
Draft

Refactor mount operations to use procfd and new mount API#2094
giuseppe wants to merge 23 commits into
containers:mainfrom
giuseppe:procfd-refactor

Conversation

@giuseppe
Copy link
Copy Markdown
Member

  • Open a /proc fd early and thread it through all mount helpers, replacing hardcoded /proc/self/fd string paths with fd-relative operations. This removes the dependency on /proc being mounted during container setup.
  • Migrate mount operations to the new mount API (fsopen/fsmount/move_mount/mount_setattr).
  • Move pivot_root before container mounts and defer mount namespace creation (CLONE_NEWNS) to libcrun_do_pivot_root(), consolidating it in a single location.
  • Add OPEN_TREE_NAMESPACE support (kernel 7.0+) to replace the unshare(CLONE_NEWNS) + bind mount rootfs + pivot_root sequence entirely.

Closes #2086

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the container mount logic to utilize the new Linux mount API, enabling detached mounts and the pre-opening of file descriptors before pivot_root to maintain access to host paths. It introduces setup_mount_namespace, adds support for MS_SYNCHRONOUS and MS_DIRSYNC flags, and enhances rootless device node handling. The review feedback identifies several bugs related to empty path strings when processing the root directory and suggests using AT_FDCWD for consistent directory descriptor resolution.

Comment thread src/libcrun/linux.c Outdated
Comment on lines +3438 to +3440
if (get_private_data (container)->mount_fds)
{
struct libcrun_fd_map *mount_fds = get_private_data (container)->mount_fds;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Directly checking get_private_data (container)->mount_fds might skip pre-opening bind mounts if no non-bind mounts have been processed yet (as mount_fds is lazily initialized). Use get_fd_map (container) to ensure the map is initialized before use.

          struct libcrun_fd_map *mount_fds = get_fd_map (container);
          if (mount_fds)
            {

Comment thread src/libcrun/linux.c Outdated
if (old_root_fd >= 0 && source && source[0] == '/')
{
crun_error_release (&tmp_err);
bindfd = get_bind_mount (old_root_fd, source + 1,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If source is exactly /, source + 1 will be an empty string. get_bind_mount calls syscall_open_tree, which requires the AT_EMPTY_PATH flag to handle an empty path. Since get_bind_mount does not currently pass this flag, the call will fail. Using . instead of an empty string avoids this issue.

                      bindfd = get_bind_mount (old_root_fd, (source[1] == '\0') ? "." : source + 1,

Comment thread src/libcrun/linux.c Outdated
if (UNLIKELY (ret < 0))
return crun_make_error (err, errno, "bind mount `/sys` from the host");
crun_error_release (err);
mountfd = get_bind_mount (-1, "/sys", true, false, false, MS_PRIVATE, err);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Use AT_FDCWD instead of -1 for the directory file descriptor to ensure standard behavior for relative path resolution in get_bind_mount.

                      mountfd = get_bind_mount (AT_FDCWD, "/sys", true, false, false, MS_PRIVATE, err);

Comment thread src/libcrun/linux.c Outdated
/* Do not resolve the symlink only when src-nofollow and copy-symlink are used. */
ret = get_file_type (&src_mode, (extra_flags & (OPTION_SRC_NOFOLLOW | OPTION_COPY_SYMLINK)) ? true : false, path);
if (old_root_fd >= 0)
ret = get_file_type_at (old_root_fd, &src_mode, nofollow, path + 1);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If path is exactly /, path + 1 will be an empty string. get_file_type_at only applies the AT_EMPTY_PATH flag if the path argument is NULL. Passing an empty string will result in an ENOENT error. Suggest passing NULL when the path is the root directory.

            ret = get_file_type_at (old_root_fd, &src_mode, nofollow, (path[1] == '\0') ? NULL : path + 1);

Comment thread src/libcrun/linux.c Outdated
if (old_root_fd >= 0 && mount->source && mount->source[0] == '/')
{
bind_dirfd = old_root_fd;
bind_source = mount->source + 1;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

If mount->source is exactly /, mount->source + 1 will be an empty string, which causes get_bind_mount to fail as it lacks the AT_EMPTY_PATH flag for its internal open_tree call. Using . ensures the directory itself is opened correctly.

              bind_source = (mount->source[1] == '\0') ? "." : mount->source + 1;

Comment thread src/libcrun/linux.c Outdated
{
libcrun_error_t tmp_err = NULL;

mount_fds->fds[i] = get_bind_mount (-1, def->mounts[i]->source,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Use AT_FDCWD instead of -1 for the directory file descriptor to ensure standard behavior for relative path resolution in get_bind_mount.

                  mount_fds->fds[i] = get_bind_mount (AT_FDCWD, def->mounts[i]->source,

@packit-as-a-service
Copy link
Copy Markdown

TMT tests failed. @containers/packit-build please check.

@giuseppe
Copy link
Copy Markdown
Member Author

CI fuzzing fix in #2096

giuseppe and others added 23 commits May 15, 2026 18:10
The check was inverted as it dereferenced src_nofollow when the
pointer was NULL instead of when it was non-NULL.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Replace setxattr via /proc/self/fd path with openat(procfd) +
fsetxattr.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Replace chmod/chown via /proc/self/fd path with fchmodat/fchownat
using the trusted procfd directory fd.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Use procfd-relative paths for get_bind_mount and fstatfs instead of
/proc/self/fd string paths.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Use get_bind_mount with the cached procfd and relative self/fd path
instead of formatting an absolute /proc/self/fd path and going through
do_mount.  Apply the extra mount flags from fstatfs via
do_mount_setattr on the detached mount before moving it into place.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Replace libcrun_get_cgroup_process() with direct read via procfd to
avoid dependency on /proc being mounted.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Use mount_setattr() when available before falling back to the
mount() syscall.  mount_setattr() is fd-based and does not need
/proc/self/fd paths.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Use mount_setattr() when available before falling back to the
mount() syscall.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Replace the slash-counting loop bound with fstat-based root
detection.  The old code counted '/' characters in the path string
to cap iterations; for relative paths like "rootfs" (zero slashes)
the loop ran only once, which could be insufficient.

Compare inode/device of each directory with its parent instead:
when ".." resolves to the same inode we have reached the filesystem
root.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
The keep_flags path in do_masked_or_readonly_path passed the inherited
mount flags (MS_RDONLY, MS_NOSUID, etc.) as the `clear` argument to
do_mount_setattr instead of `set`, effectively clearing the flags
rather than applying them.

Swap the arguments so the inherited flags are set on the readonly bind
mount as intended.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Try fsopen_mount()+fs_move_mount_to() for filesystem mounts and
get_bind_mount()+fs_move_mount_to() for bind mounts before falling
back to mount().  For propagation changes, try mount_setattr()
first.

This removes the dependency on /proc/self/fd paths for the initial
mount and propagation steps when the new mount API is available.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
In user namespace containers, devices under /dev are created via bind
mounts from the host.  Pre-open the 6 standard devices (/dev/null,
/dev/zero, /dev/full, /dev/tty, /dev/random, /dev/urandom) using
open_tree(OPEN_TREE_CLONE) in the parent process after clone(), and
send them to the child via the existing sync socket fd-passing
mechanism.

This reuses the same send_mounts/receive_mounts pattern already used
for custom devices (dev_fds) and bind mount sources (mount_fds).

This is a preparatory change for OPEN_TREE_NAMESPACE support, where
host paths are not accessible after setns() into the new mount
namespace.

When open_tree() is not available or fails (e.g. rootless without
CAP_SYS_ADMIN), the fds remain -1 and the child falls back to the
existing bind mount path.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Replace the openat(procfd) + fchmod/fchown approach with direct
fchmodat/fchownat calls.  These operate on the directory entry
without opening the device file, avoiding ENXIO errors on character
devices like /dev/tty that fail to open when no controlling terminal
exists.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
and rename it

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Use the new mount API (fsopen/fsconfig/fsmount) to create detached
mounts for each non-bind filesystem type before pivot_root.  After
the pivot, place them via move_mount so the kernel mnt_already_visible
check is satisfied.  set_mounts then mounts fresh instances on top
with the correct OCI flags.

This fixes mounting proc/sysfs/cgroup in containers that use a user
namespace, where the kernel denies these mounts unless a mount of
the same type is already visible.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
The masked-dir-nlink test checked that st_nlink >= 2 to verify a
masked path is a directory.  This fails on btrfs where directories
have st_nlink=1 (btrfs does not count . and .. in nlink).

Add an 'isdir' command to the test init binary and use S_ISDIR to
verify the masked path is a directory, which works regardless of
filesystem type.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Read symlink targets for copy-symlink mounts while the host filesystem
is still fully accessible (before pivot_root).  After pivot_root the
original symlink source is not reachable, so cache the readlink result
in private_data and use it from handle_copy_symlink and
process_single_mount.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Move CLONE_NEWNS out of both the clone() call (non-userns path) and
init_container()'s unshare() (userns path), consolidating it into
libcrun_do_pivot_root().  This makes mount namespace creation happen
at a single location, immediately before it is needed.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Use open_tree(OPEN_TREE_NAMESPACE) + setns(CLONE_NEWNS) to replace the
traditional unshare(CLONE_NEWNS) + bind mount rootfs + pivot_root
sequence.

OPEN_TREE_NAMESPACE creates a new mount namespace with the rootfs as
the root mount.  setns() enters that namespace directly, so no bind
mount or pivot_root is needed.  The kernel automatically sets the
process root and cwd to the new namespace's root when the old root is
not reachable.

On older kernels (< 7.0) or when OPEN_TREE_NAMESPACE is not
supported, the code falls back to the traditional path.

Closes: containers#2086

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Use OPEN_TREE_NAMESPACE

1 participant