Skip to content

libtcb: skip privilege drop in unprivileged user namespaces#38

Open
Blarse wants to merge 2 commits into
openwall:mainfrom
Blarse:libtcb-unpriv-userns-support
Open

libtcb: skip privilege drop in unprivileged user namespaces#38
Blarse wants to merge 2 commits into
openwall:mainfrom
Blarse:libtcb-unpriv-userns-support

Conversation

@Blarse
Copy link
Copy Markdown

@Blarse Blarse commented Apr 28, 2026

When an unprivileged user sets up a user namespace without newuidmap, it must first write "deny" to /proc/self/setgroups before writing gid_map, after which setgroups() becomes permanently forbidden[1]. tcb_drop_priv_r() then trips over sys_setgroups(0, NULL) and bails out with EPERM, even though euid 0 inside the namespace is not real root and there is nothing to drop in the first place.

In my case this breaks mkosi tool that builds images inside an unprivileged userns, where any tcb-aware utility aborts on the setgroups step. The patch detects the "deny" state via a small setgroups_allowed() helper and short-circuits tcb_drop_priv_r() with PRIV_MAGIC_NONROOT, the same path already taken for non-root callers. No behavioural change outside user namespaces; non-Linux builds get a stub that always returns 1.

[1] https://man7.org/linux/man-pages/man7/user_namespaces.7.html

@solardiz solardiz requested a review from ldv-alt May 2, 2026 20:17
@solardiz
Copy link
Copy Markdown
Member

solardiz commented May 2, 2026

Thank you @Blarse! Are these changes already in use in ALT's package perhaps?

I was concerned at first that the deny mode could possibly be enabled by a user in some other scenario, such as planning to attack libtcb as used by a privileged program, but reading up on it and experimenting with it a little bit I see it's very specialized to unprivileged containers.

Another approach I'd consider is only checking for deny after setgroups fails, and letting the uid/gid changes proceed anyway (and similar when restoring, so would need another magic to identify this state, or maybe mark it with number_of_groups = -1). However, this isn't obviously better - it skips the extra logic in the common case (faster?) and never skips uid/gid switching (safer against the unknown?), but adds complexity (and effort since it's different from what you already have in this PR). So just thinking out loud.

Comment thread libs/libtcb.c Outdated
@solardiz
Copy link
Copy Markdown
Member

solardiz commented May 2, 2026

Another approach I'd consider is only checking for deny after setgroups fails, and letting the uid/gid changes proceed anyway (and similar when restoring, so would need another magic to identify this state, or maybe mark it with number_of_groups = -1). However, this isn't obviously better - it skips the extra logic in the common case (faster?) and never skips uid/gid switching (safer against the unknown?)

"The unknown" includes possible future changes to the kernel, where the deny mode could possibly become usable in some other scenario. It isn't currently intended to disable any security measures that userspace tools take, so us doing essentially that is a risk and may not be future-proof.

@Blarse
Copy link
Copy Markdown
Author

Blarse commented May 4, 2026

Hi,

Thank you @Blarse! Are these changes already in use in ALT's package perhaps?

Not yet, but I'm working on porting mkosi to ALT Linux and stumbled upon a systemd-sysusers segfault caused by this. I can confirm that this patch fixes it.

Apparently, when useradd invoked libtcb and libtcb could not drop privileges after a failed setgroups() call, it left an empty shadow file, which in turn triggered another issue in ALT's downstream patch to systemd-sysusers that adds tcb support.

I was concerned at first that the deny mode could possibly be enabled by a user in some other scenario, such as planning to attack libtcb as used by a privileged program, but reading up on it and experimenting with it a little bit I see it's very specialized to unprivileged containers.

(For reference, virtiofsd hit the same setgroups + user namespace issue and applied a similar fix in !207.)

Another approach I'd consider is only checking for deny after setgroups fails, and letting the uid/gid changes proceed anyway (and similar when restoring, so would need another magic to identify this state, or maybe mark it with number_of_groups = -1). However, this isn't obviously better - it skips the extra logic in the common case (faster?) and never skips uid/gid switching (safer against the unknown?), but adds complexity (and effort since it's different from what you already have in this PR). So just thinking out loud.

@Blarse Blarse force-pushed the libtcb-unpriv-userns-support branch from 9f42bc3 to ad67dc3 Compare May 4, 2026 06:21
@solardiz
Copy link
Copy Markdown
Member

solardiz commented May 6, 2026

Thank you for revising the proc file reading/check per my request - this part passes my review now.

For reference, virtiofsd hit the same setgroups + user namespace issue and applied a similar fix in !207.

This looks like they're skipping only setgroups, whereas your changes here will also skip uid/gid switching. So the way you do it may be higher risk.

@ldv-alt Do we need to wait for your review here?

@ldv-alt
Copy link
Copy Markdown
Collaborator

ldv-alt commented May 6, 2026

I'd definitely prefer if the code would do the following:

  • attempt to drop supplementary groups using setgroups() unconditionally;
  • if setgroups() failed, check whether setgroups() is disabled;
  • if setgroups() is disabled, skipped just dropping/restoring of supplementary groups.

@Blarse
Copy link
Copy Markdown
Author

Blarse commented May 6, 2026

For reference, virtiofsd hit the same setgroups + user namespace issue and applied a similar fix in !207.

This looks like they're skipping only setgroups, whereas your changes here will also skip uid/gid switching. So the way you do it may be higher risk.

You're right. According to the man page referenced above:

•  One of the following two cases applies:

   (a)  Either the writing process has the CAP_SETUID (CAP_SETGID)
        capability in the parent user namespace.

        •  No further restrictions apply: the process can make
           mappings to arbitrary user IDs (group IDs) in the
           parent user namespace.

   (b)  Or otherwise all of the following restrictions apply:

        •  The data written to uid_map (gid_map) must consist of a
           single line that maps the writing process's effective
           user ID (group ID) in the parent user namespace to a
           user ID (group ID) in the user namespace.

        •  The writing process must have the same effective user
           ID as the process that created the user namespace.

        •  In the case of gid_map, use of the setgroups(2) system
           call must first be denied by writing "deny" to the
           /proc/pid/setgroups file (see below) before writing to
           gid_map.

The current PR effectively handles only case (b), where it makes no sense to switch uid/gid (though without the single line mapping check that the virtiofsd patch has). In case (a) the mapping can cover an arbitrary range, and /proc/pid/setgroups may be either allow or deny. So we shouldn't skip the whole drop just because setgroups=deny. With a range mapping, uid/gid switching still works and is needed.

I'll update the patch to skip only setgroups(0, NULL) when it's denied, while still calling ch_uid()/ch_gid(). tcb_gain_priv_r() needs a matching change to skip the setgroups() restoration in that case.

@Blarse
Copy link
Copy Markdown
Author

Blarse commented May 6, 2026

I'd definitely prefer if the code would do the following:

* attempt to drop supplementary groups using setgroups() unconditionally;

* if setgroups() failed, check whether setgroups() is disabled;

* if setgroups() is disabled, skipped just dropping/restoring of supplementary groups.

Would something like this work for you?

 	}
 	free(dir);
 
-	res = getgroups(p->number_of_groups, p->grplist);
-	if (res < 0 || res > p->number_of_groups)
-		return -1;
-
-	p->number_of_groups = res;
+	if (setgroups_allowed()) {
+		res = getgroups(p->number_of_groups, p->grplist);
+		if (res < 0 || res > p->number_of_groups)
+			return -1;
+		p->number_of_groups = res;
+		if (sys_setgroups(0, NULL) == -1)
+			return -1;
+	} else {
+		p->number_of_groups = -1;
+	}
 
-	if (sys_setgroups(0, NULL) == -1)
-		return -1;
 	if (!ch_gid(shadow_gid, &p->old_gid))
 		return -1;
 	if (!ch_uid(st.st_uid, &p->old_uid))

And for tcb_gain_priv_r:

 		return -1;
 	if (!ch_gid(p->old_gid, NULL))
 		return -1;
-	if (sys_setgroups(p->number_of_groups, p->grplist) == -1)
+	if (p->number_of_groups >= 0 &&
+		sys_setgroups(p->number_of_groups, p->grplist) == -1)
 		return -1;

@ldv-alt
Copy link
Copy Markdown
Collaborator

ldv-alt commented May 6, 2026

I'd definitely prefer if the code would do the following:

* attempt to drop supplementary groups using setgroups() unconditionally;

* if setgroups() failed, check whether setgroups() is disabled;

* if setgroups() is disabled, skipped just dropping/restoring of supplementary groups.

Would something like this work for you?

 	}
 	free(dir);
 
-	res = getgroups(p->number_of_groups, p->grplist);
-	if (res < 0 || res > p->number_of_groups)
-		return -1;
-
-	p->number_of_groups = res;
+	if (setgroups_allowed()) {
+		res = getgroups(p->number_of_groups, p->grplist);
+		if (res < 0 || res > p->number_of_groups)
+			return -1;
+		p->number_of_groups = res;
+		if (sys_setgroups(0, NULL) == -1)
+			return -1;
+	} else {
+		p->number_of_groups = -1;
+	}
 
-	if (sys_setgroups(0, NULL) == -1)
-		return -1;

I'd rather invoked sys_setgroups(0, NULL) unconditionally and started checking setgroups_allowed() only if that setgroups call failed.

Blarse added 2 commits May 6, 2026 11:22
Detect /proc/self/setgroups == "deny" to recognize an unprivileged
user namespace where setgroups(2) is permanently denied by the kernel.
No-op on non-Linux.
In a user namespace where /proc/self/setgroups is "deny",
setgroups(2) is permanently rejected by the kernel.  Perform the
regular privilege drop and only tolerate sys_setgroups(0, NULL)
failing with EPERM in such a namespace; in that case the kernel
guarantees no supplementary group could have been gained via the
namespace, so leaving the list in place is safe.  Record this
with a new PRIV_MAGIC_NOSETGROUPS state so that tcb_gain_priv_r()
skips the matching setgroups() call.

Fixes failures of pam_tcb, libnss_tcb, tcb_unconvert and shadow's
shadowtcb_drop_priv() when running under rootless container.
@Blarse Blarse force-pushed the libtcb-unpriv-userns-support branch from ad67dc3 to b8805df Compare May 6, 2026 09:25
@Blarse
Copy link
Copy Markdown
Author

Blarse commented May 6, 2026

I'd rather invoked sys_setgroups(0, NULL) unconditionally and started checking setgroups_allowed() only if that setgroups call failed.

I've updated the patch according to your suggestions.

@Blarse
Copy link
Copy Markdown
Author

Blarse commented May 22, 2026

Hi, is there anything else I need to address here, or is this good to merge?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants