Skip to content

[Diagnostics] Add in-proc crash report watchdog#128281

Open
mdh1418 wants to merge 1 commit into
dotnet:mainfrom
mdh1418:inproc_crashreport_watchdog
Open

[Diagnostics] Add in-proc crash report watchdog#128281
mdh1418 wants to merge 1 commit into
dotnet:mainfrom
mdh1418:inproc_crashreport_watchdog

Conversation

@mdh1418
Copy link
Copy Markdown
Member

@mdh1418 mdh1418 commented May 16, 2026

Adds a watchdog for the in-proc crash report generation so a hung crash reporter cannot leave the process stuck indefinitely.

The in-proc crash reporter runs while the process is already handling a fatal signal. If the reporter hangs, OS-level watchdogs are not reliable across all relevant app locations, especially worker/background-thread crashes. This bounds reporter execution time and ensures the process eventually terminates instead of remaining stuck.

The watchdog is initialized outside the crash path, uses a pipe-backed notification channel, and keeps the crash-reporting path limited to async-signal-safe write() calls. If report generation starts but does not finish before the configured timeout, the watchdog aborts the process with SIGABRT.

  • Adds inproccrashreportwatchdog.{h,cpp}.
  • Arms the watchdog when InProcCrashReporter::CreateReport() begins and disarms it when report generation exits.
  • Uses a detached watchdog thread plus a nonblocking pipe instead of semaphores for POSIX compatibility.
  • Blocks fatal signals on the watchdog thread so process-directed crash signals do not land there.
  • Adds DOTNET_CrashReportTimeoutSeconds.
    • Default: 30
    • 0 disables the watchdog for diagnostics/debugging.
  • Keeps watchdog initialization best-effort; if initialization fails, crash reporting proceeds without the watchdog.

Add a pipe-backed watchdog for in-proc crash reporting, using an async-signal-safe write from the crash path and a detached watchdog thread initialized during startup.

Expose best-effort initialization through TryInitialize, document process-lifetime watchdog state, and use a conservative 30-second default timeout.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@dotnet-policy-service
Copy link
Copy Markdown
Contributor

Tagging subscribers to this area: @steveisok, @tommcdon, @dotnet/dotnet-diag
See info in area-owners.md if you want to be subscribed.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a POSIX in-process crash-report watchdog so a hung crash report generation path is bounded by a configurable timeout and eventually aborts the process.

Changes:

  • Adds a pipe-backed detached watchdog thread and RAII scope to arm/disarm it from the crash-report path.
  • Wires watchdog initialization into in-proc crash reporter startup.
  • Adds parsing for DOTNET_CrashReportTimeoutSeconds, defaulting to 30 seconds with 0 disabling the watchdog.

Reviewed changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
src/coreclr/vm/crashreportstackwalker.cpp Reads and passes the crash report timeout setting during crash report configuration.
src/coreclr/debug/crashreport/inproccrashreportwatchdog.h Declares watchdog initialization and scope APIs.
src/coreclr/debug/crashreport/inproccrashreportwatchdog.cpp Implements watchdog thread, pipe protocol, timeout handling, and abort behavior.
src/coreclr/debug/crashreport/inproccrashreporter.h Extends reporter settings with a timeout value.
src/coreclr/debug/crashreport/inproccrashreporter.cpp Initializes the watchdog and scopes crash report generation with arm/disarm notifications.
src/coreclr/debug/crashreport/CMakeLists.txt Adds the watchdog implementation to the crashreport object library.

Comment on lines +475 to +476
unsigned long timeoutSeconds = strtoul(timeoutString, &end, 10);
if (errno != 0 || end == timeoutString || *end != '\0' || timeoutSeconds > UINT32_MAX)
Comment on lines +346 to +347
char command = CrashReportWatchdogStartedCommand;
(void)write(static_cast<int>(writeFd), &command, sizeof(command));
Comment on lines +358 to +359
char command = CrashReportWatchdogFinishedCommand;
(void)write(static_cast<int>(writeFd), &command, sizeof(command));
}

const time_t maxTime = std::numeric_limits<time_t>::max();
if (static_cast<unsigned long long>(s_crashReportTimeoutSeconds) > static_cast<unsigned long long>(maxTime))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe let s_crashReportTimeoutSeconds be of time_t and you don't need all these casts.

// Signal and timeout helpers.

static void
CrashReportWatchdogBuildFatalSignalSet(sigset_t* signalSet)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like these belong into a class instead of prefixing all functions CrashReportWatchdog.

CrashReportWatchdogConfigurePipeFd(int fd)
{
int descriptorFlags = fcntl(fd, F_GETFD);
if (descriptorFlags == -1 || fcntl(fd, F_SETFD, descriptorFlags | FD_CLOEXEC) != 0)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usage of FD_CLOEXEC is normally protected under defines, but if we only compile this code when opting into crash reporter assuming FD_CLOEXEC exists on platforms enabling it, then its fine.

// Watchdog thread wait loop.

static int
CrashReportWatchdogGetRemainingMilliseconds(const struct timespec* deadline)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole timeout implementation and deadline creation seems overly complex, we just work with seconds anyways, so maybe we should just stick to that keeping the implementation things simple.

}

time_t timeoutSeconds = static_cast<time_t>(s_crashReportTimeoutSeconds);
if (deadline->tv_sec > maxTime - timeoutSeconds)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You don't need to bother about overflow for seconds. This function could be much simpler.

}

static bool
CrashReportWatchdogWaitForCommand(char expectedCommand, const struct timespec* deadline)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe you should use the same API as other wait API's, pass in milliseconds to wait, -1 means infinite.


while (true)
{
if (!CrashReportWatchdogWaitForCommand(CrashReportWatchdogStartedCommand, nullptr))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably have some logging when the watchdog starts to monitor and when it detects timeout and abort the process.

return false;
}

if (static_cast<unsigned long long>(timeoutSeconds) > static_cast<unsigned long long>(std::numeric_limits<time_t>::max()))
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uint32_t will never be larger than time_t, so this check feels unnecessary.

}

struct pollfd pollFd;
pollFd.fd = s_crashReportWatchdogPipe[0];
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we validate that we have a valid pipe before using them?

{
// This runs from the crash-reporting path. Keep this and any future callees
// async-signal-safe.
sig_atomic_t writeFd = s_crashReportWatchdogWriteFd;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having these statics shared across classes, maybe you could add a method on the CrashReporterWatchdog that did the work?

{
// This runs from the crash-reporting path. Keep this and any future callees
// async-signal-safe.
sig_atomic_t writeFd = s_crashReportWatchdogWriteFd;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

// remains available to disable the watchdog for diagnostics.
static constexpr DWORD DefaultTimeoutSeconds = 30;

CLRConfigNoCache timeoutCfg = CLRConfigNoCache::Get("CrashReportTimeoutSeconds", /*noprefix*/ false, &getenv);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TryAsInteger ?

// Process-lifetime watchdog state. Successful initialization intentionally keeps
// the detached thread and pipe open until process exit; failed initialization
// paths close the pipe before returning.
static pthread_t s_crashReportWatchdogThread;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should put this as members of a crash reporter watchdog class, then we can have a static instance of that class, alternative allocate it when the crash reporter started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants