[Diagnostics] Add in-proc crash report watchdog by mdh1418 · Pull Request #128281 · dotnet/runtime

mdh1418 · 2026-05-16T05:44:09Z

Adds a watchdog for the in-proc crash report generation so a hung crash reporter cannot leave the process stuck indefinitely.

The in-proc crash reporter runs while the process is already handling a fatal signal. If the reporter hangs, OS-level watchdogs are not reliable across all relevant app locations, especially worker/background-thread crashes. This bounds reporter execution time and ensures the process eventually terminates instead of remaining stuck.

The watchdog is initialized outside the crash path, uses a pipe-backed notification channel, and keeps the crash-reporting path limited to async-signal-safe write() calls. If report generation starts but does not finish before the configured timeout, the watchdog aborts the process with SIGABRT.

Adds inproccrashreportwatchdog.{h,cpp}.
Arms the watchdog when InProcCrashReporter::CreateReport() begins and disarms it when report generation exits.
Uses a detached watchdog thread plus a nonblocking pipe instead of semaphores for POSIX compatibility.
Blocks fatal signals on the watchdog thread so process-directed crash signals do not land there.
Adds DOTNET_CrashReportTimeoutSeconds.
- Default: 30
- 0 disables the watchdog for diagnostics/debugging.
Keeps watchdog initialization best-effort; if initialization fails, crash reporting proceeds without the watchdog.

Add a pipe-backed watchdog for in-proc crash reporting, using an async-signal-safe write from the crash path and a detached watchdog thread initialized during startup. Expose best-effort initialization through TryInitialize, document process-lifetime watchdog state, and use a conservative 30-second default timeout. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

dotnet-policy-service · 2026-05-16T05:45:02Z

Tagging subscribers to this area: @steveisok, @tommcdon, @dotnet/dotnet-diag
See info in area-owners.md if you want to be subscribed.

Copilot

Pull request overview

Adds a POSIX in-process crash-report watchdog so a hung crash report generation path is bounded by a configurable timeout and eventually aborts the process.

Changes:

Adds a pipe-backed detached watchdog thread and RAII scope to arm/disarm it from the crash-report path.
Wires watchdog initialization into in-proc crash reporter startup.
Adds parsing for DOTNET_CrashReportTimeoutSeconds, defaulting to 30 seconds with 0 disabling the watchdog.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`src/coreclr/vm/crashreportstackwalker.cpp`	Reads and passes the crash report timeout setting during crash report configuration.
`src/coreclr/debug/crashreport/inproccrashreportwatchdog.h`	Declares watchdog initialization and scope APIs.
`src/coreclr/debug/crashreport/inproccrashreportwatchdog.cpp`	Implements watchdog thread, pipe protocol, timeout handling, and abort behavior.
`src/coreclr/debug/crashreport/inproccrashreporter.h`	Extends reporter settings with a timeout value.
`src/coreclr/debug/crashreport/inproccrashreporter.cpp`	Initializes the watchdog and scopes crash report generation with arm/disarm notifications.
`src/coreclr/debug/crashreport/CMakeLists.txt`	Adds the watchdog implementation to the crashreport object library.

+    unsigned long timeoutSeconds = strtoul(timeoutString, &end, 10);
+    if (errno != 0 || end == timeoutString || *end != '\0' || timeoutSeconds > UINT32_MAX)


+        char command = CrashReportWatchdogStartedCommand;
+        (void)write(static_cast<int>(writeFd), &command, sizeof(command));


+        char command = CrashReportWatchdogFinishedCommand;
+        (void)write(static_cast<int>(writeFd), &command, sizeof(command));


lateralusX · 2026-05-19T15:40:08Z

+    }
+
+    const time_t maxTime = std::numeric_limits<time_t>::max();
+    if (static_cast<unsigned long long>(s_crashReportTimeoutSeconds) > static_cast<unsigned long long>(maxTime))


Maybe let s_crashReportTimeoutSeconds be of time_t and you don't need all these casts.

lateralusX · 2026-05-19T15:54:37Z

+// Signal and timeout helpers.
+
+static void
+CrashReportWatchdogBuildFatalSignalSet(sigset_t* signalSet)


Feels like these belong into a class instead of prefixing all functions CrashReportWatchdog.

lateralusX · 2026-05-19T16:44:20Z

+CrashReportWatchdogConfigurePipeFd(int fd)
+{
+    int descriptorFlags = fcntl(fd, F_GETFD);
+    if (descriptorFlags == -1 || fcntl(fd, F_SETFD, descriptorFlags | FD_CLOEXEC) != 0)


Usage of FD_CLOEXEC is normally protected under defines, but if we only compile this code when opting into crash reporter assuming FD_CLOEXEC exists on platforms enabling it, then its fine.

lateralusX · 2026-05-19T17:01:51Z

+// Watchdog thread wait loop.
+
+static int
+CrashReportWatchdogGetRemainingMilliseconds(const struct timespec* deadline)


This whole timeout implementation and deadline creation seems overly complex, we just work with seconds anyways, so maybe we should just stick to that keeping the implementation things simple.

lateralusX · 2026-05-19T17:04:57Z

+    }
+
+    time_t timeoutSeconds = static_cast<time_t>(s_crashReportTimeoutSeconds);
+    if (deadline->tv_sec > maxTime - timeoutSeconds)


You don't need to bother about overflow for seconds. This function could be much simpler.

lateralusX · 2026-05-19T17:06:22Z

+}
+
+static bool
+CrashReportWatchdogWaitForCommand(char expectedCommand, const struct timespec* deadline)


Maybe you should use the same API as other wait API's, pass in milliseconds to wait, -1 means infinite.

lateralusX · 2026-05-19T17:09:21Z

+
+    while (true)
+    {
+        if (!CrashReportWatchdogWaitForCommand(CrashReportWatchdogStartedCommand, nullptr))


We should probably have some logging when the watchdog starts to monitor and when it detects timeout and abort the process.

lateralusX · 2026-05-19T17:10:48Z

+        return false;
+    }
+
+    if (static_cast<unsigned long long>(timeoutSeconds) > static_cast<unsigned long long>(std::numeric_limits<time_t>::max()))


uint32_t will never be larger than time_t, so this check feels unnecessary.

lateralusX · 2026-05-19T17:11:32Z

+        }
+
+        struct pollfd pollFd;
+        pollFd.fd = s_crashReportWatchdogPipe[0];


Should we validate that we have a valid pipe before using them?

lateralusX · 2026-05-19T17:57:43Z

+{
+    // This runs from the crash-reporting path. Keep this and any future callees
+    // async-signal-safe.
+    sig_atomic_t writeFd = s_crashReportWatchdogWriteFd;


Instead of having these statics shared across classes, maybe you could add a method on the CrashReporterWatchdog that did the work?

lateralusX · 2026-05-19T17:57:52Z

+{
+    // This runs from the crash-reporting path. Keep this and any future callees
+    // async-signal-safe.
+    sig_atomic_t writeFd = s_crashReportWatchdogWriteFd;


Same comment as above.

lateralusX · 2026-05-19T18:03:23Z

+    // remains available to disable the watchdog for diagnostics.
+    static constexpr DWORD DefaultTimeoutSeconds = 30;
+
+    CLRConfigNoCache timeoutCfg = CLRConfigNoCache::Get("CrashReportTimeoutSeconds", /*noprefix*/ false, &getenv);


TryAsInteger ?

lateralusX · 2026-05-19T18:05:01Z

+// Process-lifetime watchdog state. Successful initialization intentionally keeps
+// the detached thread and pipe open until process exit; failed initialization
+// paths close the pipe before returning.
+static pthread_t s_crashReportWatchdogThread;


Maybe we should put this as members of a crash reporter watchdog class, then we can have a static instance of that class, alternative allocate it when the crash reporter started.

mdh1418 requested review from Copilot and lateralusX May 16, 2026 05:44

mdh1418 added the area-Diagnostics-coreclr label May 16, 2026

Copilot started reviewing on behalf of mdh1418 May 16, 2026 05:44 View session

dotnet-policy-service Bot assigned mdh1418 May 16, 2026

Copilot AI reviewed May 16, 2026

View reviewed changes

This was referenced May 16, 2026

Unable to pull image from mcr.microsoft.com #117164

Open

ExceptionHandlingExpressions.ExpressionsUnwrapeExternallyThrownRuntimeWrappedException failure #128164

Open

lateralusX reviewed May 19, 2026

View reviewed changes

		unsigned long timeoutSeconds = strtoul(timeoutString, &end, 10);
		if (errno != 0 \|\| end == timeoutString \|\| *end != '\0' \|\| timeoutSeconds > UINT32_MAX)

		char command = CrashReportWatchdogStartedCommand;
		(void)write(static_cast<int>(writeFd), &command, sizeof(command));

		char command = CrashReportWatchdogFinishedCommand;
		(void)write(static_cast<int>(writeFd), &command, sizeof(command));

Conversation

mdh1418 commented May 16, 2026

Uh oh!

dotnet-policy-service Bot commented May 16, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants