[Diagnostics] Add in-proc crash report watchdog#128281
Conversation
Add a pipe-backed watchdog for in-proc crash reporting, using an async-signal-safe write from the crash path and a detached watchdog thread initialized during startup. Expose best-effort initialization through TryInitialize, document process-lifetime watchdog state, and use a conservative 30-second default timeout. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
Tagging subscribers to this area: @steveisok, @tommcdon, @dotnet/dotnet-diag |
There was a problem hiding this comment.
Pull request overview
Adds a POSIX in-process crash-report watchdog so a hung crash report generation path is bounded by a configurable timeout and eventually aborts the process.
Changes:
- Adds a pipe-backed detached watchdog thread and RAII scope to arm/disarm it from the crash-report path.
- Wires watchdog initialization into in-proc crash reporter startup.
- Adds parsing for
DOTNET_CrashReportTimeoutSeconds, defaulting to 30 seconds with0disabling the watchdog.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
src/coreclr/vm/crashreportstackwalker.cpp |
Reads and passes the crash report timeout setting during crash report configuration. |
src/coreclr/debug/crashreport/inproccrashreportwatchdog.h |
Declares watchdog initialization and scope APIs. |
src/coreclr/debug/crashreport/inproccrashreportwatchdog.cpp |
Implements watchdog thread, pipe protocol, timeout handling, and abort behavior. |
src/coreclr/debug/crashreport/inproccrashreporter.h |
Extends reporter settings with a timeout value. |
src/coreclr/debug/crashreport/inproccrashreporter.cpp |
Initializes the watchdog and scopes crash report generation with arm/disarm notifications. |
src/coreclr/debug/crashreport/CMakeLists.txt |
Adds the watchdog implementation to the crashreport object library. |
| unsigned long timeoutSeconds = strtoul(timeoutString, &end, 10); | ||
| if (errno != 0 || end == timeoutString || *end != '\0' || timeoutSeconds > UINT32_MAX) |
| char command = CrashReportWatchdogStartedCommand; | ||
| (void)write(static_cast<int>(writeFd), &command, sizeof(command)); |
| char command = CrashReportWatchdogFinishedCommand; | ||
| (void)write(static_cast<int>(writeFd), &command, sizeof(command)); |
| } | ||
|
|
||
| const time_t maxTime = std::numeric_limits<time_t>::max(); | ||
| if (static_cast<unsigned long long>(s_crashReportTimeoutSeconds) > static_cast<unsigned long long>(maxTime)) |
There was a problem hiding this comment.
Maybe let s_crashReportTimeoutSeconds be of time_t and you don't need all these casts.
| // Signal and timeout helpers. | ||
|
|
||
| static void | ||
| CrashReportWatchdogBuildFatalSignalSet(sigset_t* signalSet) |
There was a problem hiding this comment.
Feels like these belong into a class instead of prefixing all functions CrashReportWatchdog.
| CrashReportWatchdogConfigurePipeFd(int fd) | ||
| { | ||
| int descriptorFlags = fcntl(fd, F_GETFD); | ||
| if (descriptorFlags == -1 || fcntl(fd, F_SETFD, descriptorFlags | FD_CLOEXEC) != 0) |
There was a problem hiding this comment.
Usage of FD_CLOEXEC is normally protected under defines, but if we only compile this code when opting into crash reporter assuming FD_CLOEXEC exists on platforms enabling it, then its fine.
| // Watchdog thread wait loop. | ||
|
|
||
| static int | ||
| CrashReportWatchdogGetRemainingMilliseconds(const struct timespec* deadline) |
There was a problem hiding this comment.
This whole timeout implementation and deadline creation seems overly complex, we just work with seconds anyways, so maybe we should just stick to that keeping the implementation things simple.
| } | ||
|
|
||
| time_t timeoutSeconds = static_cast<time_t>(s_crashReportTimeoutSeconds); | ||
| if (deadline->tv_sec > maxTime - timeoutSeconds) |
There was a problem hiding this comment.
You don't need to bother about overflow for seconds. This function could be much simpler.
| } | ||
|
|
||
| static bool | ||
| CrashReportWatchdogWaitForCommand(char expectedCommand, const struct timespec* deadline) |
There was a problem hiding this comment.
Maybe you should use the same API as other wait API's, pass in milliseconds to wait, -1 means infinite.
|
|
||
| while (true) | ||
| { | ||
| if (!CrashReportWatchdogWaitForCommand(CrashReportWatchdogStartedCommand, nullptr)) |
There was a problem hiding this comment.
We should probably have some logging when the watchdog starts to monitor and when it detects timeout and abort the process.
| return false; | ||
| } | ||
|
|
||
| if (static_cast<unsigned long long>(timeoutSeconds) > static_cast<unsigned long long>(std::numeric_limits<time_t>::max())) |
There was a problem hiding this comment.
uint32_t will never be larger than time_t, so this check feels unnecessary.
| } | ||
|
|
||
| struct pollfd pollFd; | ||
| pollFd.fd = s_crashReportWatchdogPipe[0]; |
There was a problem hiding this comment.
Should we validate that we have a valid pipe before using them?
| { | ||
| // This runs from the crash-reporting path. Keep this and any future callees | ||
| // async-signal-safe. | ||
| sig_atomic_t writeFd = s_crashReportWatchdogWriteFd; |
There was a problem hiding this comment.
Instead of having these statics shared across classes, maybe you could add a method on the CrashReporterWatchdog that did the work?
| { | ||
| // This runs from the crash-reporting path. Keep this and any future callees | ||
| // async-signal-safe. | ||
| sig_atomic_t writeFd = s_crashReportWatchdogWriteFd; |
| // remains available to disable the watchdog for diagnostics. | ||
| static constexpr DWORD DefaultTimeoutSeconds = 30; | ||
|
|
||
| CLRConfigNoCache timeoutCfg = CLRConfigNoCache::Get("CrashReportTimeoutSeconds", /*noprefix*/ false, &getenv); |
| // Process-lifetime watchdog state. Successful initialization intentionally keeps | ||
| // the detached thread and pipe open until process exit; failed initialization | ||
| // paths close the pipe before returning. | ||
| static pthread_t s_crashReportWatchdogThread; |
There was a problem hiding this comment.
Maybe we should put this as members of a crash reporter watchdog class, then we can have a static instance of that class, alternative allocate it when the crash reporter started.
Adds a watchdog for the in-proc crash report generation so a hung crash reporter cannot leave the process stuck indefinitely.
The in-proc crash reporter runs while the process is already handling a fatal signal. If the reporter hangs, OS-level watchdogs are not reliable across all relevant app locations, especially worker/background-thread crashes. This bounds reporter execution time and ensures the process eventually terminates instead of remaining stuck.
The watchdog is initialized outside the crash path, uses a pipe-backed notification channel, and keeps the crash-reporting path limited to async-signal-safe
write()calls. If report generation starts but does not finish before the configured timeout, the watchdog aborts the process with SIGABRT.inproccrashreportwatchdog.{h,cpp}.InProcCrashReporter::CreateReport()begins and disarms it when report generation exits.DOTNET_CrashReportTimeoutSeconds.300disables the watchdog for diagnostics/debugging.