Refactor mount operations to use procfd and new mount API#2094
Refactor mount operations to use procfd and new mount API#2094giuseppe wants to merge 23 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the container mount logic to utilize the new Linux mount API, enabling detached mounts and the pre-opening of file descriptors before pivot_root to maintain access to host paths. It introduces setup_mount_namespace, adds support for MS_SYNCHRONOUS and MS_DIRSYNC flags, and enhances rootless device node handling. The review feedback identifies several bugs related to empty path strings when processing the root directory and suggests using AT_FDCWD for consistent directory descriptor resolution.
| if (get_private_data (container)->mount_fds) | ||
| { | ||
| struct libcrun_fd_map *mount_fds = get_private_data (container)->mount_fds; |
There was a problem hiding this comment.
Directly checking get_private_data (container)->mount_fds might skip pre-opening bind mounts if no non-bind mounts have been processed yet (as mount_fds is lazily initialized). Use get_fd_map (container) to ensure the map is initialized before use.
struct libcrun_fd_map *mount_fds = get_fd_map (container);
if (mount_fds)
{| if (old_root_fd >= 0 && source && source[0] == '/') | ||
| { | ||
| crun_error_release (&tmp_err); | ||
| bindfd = get_bind_mount (old_root_fd, source + 1, |
There was a problem hiding this comment.
If source is exactly /, source + 1 will be an empty string. get_bind_mount calls syscall_open_tree, which requires the AT_EMPTY_PATH flag to handle an empty path. Since get_bind_mount does not currently pass this flag, the call will fail. Using . instead of an empty string avoids this issue.
bindfd = get_bind_mount (old_root_fd, (source[1] == '\0') ? "." : source + 1,| if (UNLIKELY (ret < 0)) | ||
| return crun_make_error (err, errno, "bind mount `/sys` from the host"); | ||
| crun_error_release (err); | ||
| mountfd = get_bind_mount (-1, "/sys", true, false, false, MS_PRIVATE, err); |
| /* Do not resolve the symlink only when src-nofollow and copy-symlink are used. */ | ||
| ret = get_file_type (&src_mode, (extra_flags & (OPTION_SRC_NOFOLLOW | OPTION_COPY_SYMLINK)) ? true : false, path); | ||
| if (old_root_fd >= 0) | ||
| ret = get_file_type_at (old_root_fd, &src_mode, nofollow, path + 1); |
There was a problem hiding this comment.
If path is exactly /, path + 1 will be an empty string. get_file_type_at only applies the AT_EMPTY_PATH flag if the path argument is NULL. Passing an empty string will result in an ENOENT error. Suggest passing NULL when the path is the root directory.
ret = get_file_type_at (old_root_fd, &src_mode, nofollow, (path[1] == '\0') ? NULL : path + 1);| if (old_root_fd >= 0 && mount->source && mount->source[0] == '/') | ||
| { | ||
| bind_dirfd = old_root_fd; | ||
| bind_source = mount->source + 1; |
There was a problem hiding this comment.
If mount->source is exactly /, mount->source + 1 will be an empty string, which causes get_bind_mount to fail as it lacks the AT_EMPTY_PATH flag for its internal open_tree call. Using . ensures the directory itself is opened correctly.
bind_source = (mount->source[1] == '\0') ? "." : mount->source + 1;| { | ||
| libcrun_error_t tmp_err = NULL; | ||
|
|
||
| mount_fds->fds[i] = get_bind_mount (-1, def->mounts[i]->source, |
|
TMT tests failed. @containers/packit-build please check. |
|
CI fuzzing fix in #2096 |
The check was inverted as it dereferenced src_nofollow when the pointer was NULL instead of when it was non-NULL. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Replace setxattr via /proc/self/fd path with openat(procfd) + fsetxattr. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Replace chmod/chown via /proc/self/fd path with fchmodat/fchownat using the trusted procfd directory fd. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Use procfd-relative paths for get_bind_mount and fstatfs instead of /proc/self/fd string paths. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Use get_bind_mount with the cached procfd and relative self/fd path instead of formatting an absolute /proc/self/fd path and going through do_mount. Apply the extra mount flags from fstatfs via do_mount_setattr on the detached mount before moving it into place. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Replace libcrun_get_cgroup_process() with direct read via procfd to avoid dependency on /proc being mounted. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Use mount_setattr() when available before falling back to the mount() syscall. mount_setattr() is fd-based and does not need /proc/self/fd paths. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Use mount_setattr() when available before falling back to the mount() syscall. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Replace the slash-counting loop bound with fstat-based root detection. The old code counted '/' characters in the path string to cap iterations; for relative paths like "rootfs" (zero slashes) the loop ran only once, which could be insufficient. Compare inode/device of each directory with its parent instead: when ".." resolves to the same inode we have reached the filesystem root. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
The keep_flags path in do_masked_or_readonly_path passed the inherited mount flags (MS_RDONLY, MS_NOSUID, etc.) as the `clear` argument to do_mount_setattr instead of `set`, effectively clearing the flags rather than applying them. Swap the arguments so the inherited flags are set on the readonly bind mount as intended. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Try fsopen_mount()+fs_move_mount_to() for filesystem mounts and get_bind_mount()+fs_move_mount_to() for bind mounts before falling back to mount(). For propagation changes, try mount_setattr() first. This removes the dependency on /proc/self/fd paths for the initial mount and propagation steps when the new mount API is available. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
In user namespace containers, devices under /dev are created via bind mounts from the host. Pre-open the 6 standard devices (/dev/null, /dev/zero, /dev/full, /dev/tty, /dev/random, /dev/urandom) using open_tree(OPEN_TREE_CLONE) in the parent process after clone(), and send them to the child via the existing sync socket fd-passing mechanism. This reuses the same send_mounts/receive_mounts pattern already used for custom devices (dev_fds) and bind mount sources (mount_fds). This is a preparatory change for OPEN_TREE_NAMESPACE support, where host paths are not accessible after setns() into the new mount namespace. When open_tree() is not available or fails (e.g. rootless without CAP_SYS_ADMIN), the fds remain -1 and the child falls back to the existing bind mount path. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Replace the openat(procfd) + fchmod/fchown approach with direct fchmodat/fchownat calls. These operate on the directory entry without opening the device file, avoiding ENXIO errors on character devices like /dev/tty that fail to open when no controlling terminal exists. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
and rename it Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Use the new mount API (fsopen/fsconfig/fsmount) to create detached mounts for each non-bind filesystem type before pivot_root. After the pivot, place them via move_mount so the kernel mnt_already_visible check is satisfied. set_mounts then mounts fresh instances on top with the correct OCI flags. This fixes mounting proc/sysfs/cgroup in containers that use a user namespace, where the kernel denies these mounts unless a mount of the same type is already visible. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
The masked-dir-nlink test checked that st_nlink >= 2 to verify a masked path is a directory. This fails on btrfs where directories have st_nlink=1 (btrfs does not count . and .. in nlink). Add an 'isdir' command to the test init binary and use S_ISDIR to verify the masked path is a directory, which works regardless of filesystem type. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Read symlink targets for copy-symlink mounts while the host filesystem is still fully accessible (before pivot_root). After pivot_root the original symlink source is not reachable, so cache the readlink result in private_data and use it from handle_copy_symlink and process_single_mount. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Move CLONE_NEWNS out of both the clone() call (non-userns path) and init_container()'s unshare() (userns path), consolidating it into libcrun_do_pivot_root(). This makes mount namespace creation happen at a single location, immediately before it is needed. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Use open_tree(OPEN_TREE_NAMESPACE) + setns(CLONE_NEWNS) to replace the traditional unshare(CLONE_NEWNS) + bind mount rootfs + pivot_root sequence. OPEN_TREE_NAMESPACE creates a new mount namespace with the rootfs as the root mount. setns() enters that namespace directly, so no bind mount or pivot_root is needed. The kernel automatically sets the process root and cwd to the new namespace's root when the old root is not reachable. On older kernels (< 7.0) or when OPEN_TREE_NAMESPACE is not supported, the code falls back to the traditional path. Closes: containers#2086 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Closes #2086