Skip to content

Azure AI Agents: add next-step guidance and doctor diagnostics#8198

Open
antriksh30 wants to merge 84 commits into
Azure:mainfrom
antriksh30:antrikshjain/next-step-implementation
Open

Azure AI Agents: add next-step guidance and doctor diagnostics#8198
antriksh30 wants to merge 84 commits into
Azure:mainfrom
antriksh30:antrikshjain/next-step-implementation

Conversation

@antriksh30
Copy link
Copy Markdown
Contributor

@antriksh30 antriksh30 commented May 15, 2026

Summary

  • add context-aware Next: guidance for the Azure AI Agents extension across init, run, invoke, show, deploy-hook, and doctor flows
  • add azd ai agent doctor with local and remote checks for project setup, environment variables, authentication, Foundry reachability/RBAC, hosted agent status, identity roles, model deployments, connections, and toolboxes
  • improve post-deploy guidance with README-first sample payload hints, omit active-agent show next steps, and stream text-mode doctor checks as they complete

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

@antriksh30 antriksh30 changed the title Azure AI Agents: add next-step guidance and doctor diagnostics [WIP] Azure AI Agents: add next-step guidance and doctor diagnostics May 15, 2026
@antriksh30 antriksh30 marked this pull request as draft May 15, 2026 04:08
@antriksh30 antriksh30 force-pushed the antrikshjain/next-step-implementation branch from 401ca65 to 829063a Compare May 15, 2026 04:17
@antriksh30 antriksh30 requested a review from Copilot May 15, 2026 04:55
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot wasn't able to review this pull request because it exceeds the maximum number of lines (20,000). Try reducing the number of changed lines and requesting a review from Copilot again.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 73 out of 73 changed files in this pull request and generated 10 comments.

Comment thread cli/azd/extensions/azure.ai.agents/internal/cmd/doctor.go
Comment thread cli/azd/extensions/azure.ai.agents/internal/cmd/doctor.go
Comment thread cli/azd/extensions/azure.ai.agents/internal/cmd/invoke.go
Comment thread cli/azd/extensions/azure.ai.agents/internal/cmd/doctor_format.go
Comment thread cli/azd/extensions/azure.ai.agents/internal/cmd/invoke.go
Copy link
Copy Markdown
Member

@jongio jongio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid foundation for the doctor + nextstep subsystems - clean module boundaries, interface-driven testability, and well-structured check pipeline. A few things to sort out before this leaves draft:

invoke.go:541-543 is the one I'd prioritize - it silently changes CLI semantics for --name. The rest are lower-stakes but worth a look.

Comment thread cli/azd/extensions/azure.ai.agents/internal/cmd/invoke.go
Comment thread cli/azd/extensions/azure.ai.agents/internal/cmd/doctor.go
Comment thread cli/azd/extensions/azure.ai.agents/internal/cmd/doctor/checks_auth.go Outdated
Comment thread cli/azd/extensions/azure.ai.agents/internal/cmd/doctor.go
@antriksh30 antriksh30 marked this pull request as ready for review May 18, 2026 04:55
antriksh30 pushed a commit to antriksh30/azure-dev that referenced this pull request May 18, 2026
The remote.auth doctor check surfaced the raw user principal name in
both the Message string and the structured Details map regardless of
the --unredacted flag, in contrast with the rest of the doctor checks
(checks_rbac.go, checks_agent_identity_roles.go) which already gate
identity values on opts.Unredacted.

This change threads Options into the auth check function and adds two
small helpers that reuse the existing redactedPlaceholder constant:

  - redactUPN(upn, unredacted) returns the value to surface in Message
    text: raw when --unredacted, the shared <redacted> placeholder when
    a UPN was discovered but should be scrubbed, and empty when none
    was found so composeAuthMessage cleanly drops the prefix.
  - authDetails(upn, minutes, unredacted) builds the Details map and
    omits the "upn" key entirely unless --unredacted is set, so
    machine consumers do not see the raw value by default.

PASS, WARN, and expired-FAIL branches now compose their messages from
the redacted display value. Existing tests that asserted the raw UPN
were updated to pass Options{Unredacted: true}; new table tests cover
the default-redacted and --unredacted contracts on every branch, the
empty-UPN drop, and both helpers in isolation.

Resolves PR Azure#8198 review comment from @jongio.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
antriksh30 pushed a commit to antriksh30/azure-dev that referenced this pull request May 18, 2026
TestErrorCodeWireValues pins the lowercase JSON wire values of every
exported enum the nextstep package consumes from the Agents API, but
the AgentVersionStatus map was missing the "idle" entry. The idle
status is actively read at show.go:207 and nextstep/resolver.go:248,
so silent drift on that one literal would regress the show command's
idle branch and the resolver's deployment-pending hint without any
test failure.

Add the missing "idle": string(AgentVersionIdle) case.

Resolves PR Azure#8198 Copilot review comments (ids 3246075889, 3246075800).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@antriksh30 antriksh30 changed the title [WIP] Azure AI Agents: add next-step guidance and doctor diagnostics Azure AI Agents: add next-step guidance and doctor diagnostics May 20, 2026
Antriksh Jain and others added 2 commits May 20, 2026 11:13
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add the foundation for context-aware `Next:` guidance described in PR Azure#8057:

- New `internal/cmd/nextstep` package with `Suggestion`, `State`, `ServiceState`, `AuthState` types and a format-agnostic `PrintNext` writer that aligns commands on the longest entry and caps output at one primary + one secondary line.

- Add an `isTerminal(fd uintptr) bool` helper in `internal/cmd/helpers.go` wrapping `golang.org/x/term`; promote that module from indirect to direct in `go.mod`.

- Register `nextstep` in the repo cspell dictionary.

No callers yet; resolvers, state assembly, and command wiring land in subsequent commits.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Antriksh Jain and others added 29 commits May 20, 2026 11:31
…local success (P5.1 C4)

Fixes B7: `azd ai agent invoke --local` success previously emitted
only a single `azd deploy` suggestion. Per issue Azure#7975 lines 168-181,
the natural follow-up loop is "ship to Azure, then watch the live
logs" — surface both commands in one block:

  Next: azd deploy                       -- deploy the agent to Azure
        azd ai agent monitor --follow    -- view logs after deploying

Rationale: by the time `invoke --local` returns success, the user
has already provisioned (dependencies exist) and the agent itself
works. The next loop is the deploy + verify dance, so the live-log
feed is the right secondary.

Also updates the deploy command's description from the conversational
"the local invoke worked — ship it to Azure" to the spec-aligned
"deploy the agent to Azure" so the description column stays
functional rather than narrative.

Renderer fit: PrintNext (used by invoke.go) caps at 2 lines, which
exactly accommodates the new shape; no callsite change needed. The
existing remote-success path already returned 2 suggestions
(`show <agent>` + `monitor`), so this brings the local- and
remote-success paths into structural parity.

Scope: ~40 LoC + doc/test updates. Independent of C1-C3.

Out of scope (issue Azure#7975 lines 183-191 multi-agent variant):
The multi-agent local-invoke output uses an "After deploying:"
prose subsection with per-agent invoke commands. That requires
threading state into ResolveAfterInvoke (today's call site passes
state=nil — see invoke.go:222 doc) and either a new Suggestion
sub-type for the indented subsection or a layout change to
PrintNext. Deferred to a follow-up commit. The single-agent
project case — by far the common one — is fully covered by this
change.

Tests:
- `TestResolveAfterInvoke_Success` / "local success → deploy +
  monitor" updated to assert the 2-Suggestion shape, both command
  strings, both descriptions, and that neither is Trailing.

Source of truth: issue Azure#7975 lines 168-181, P5.1 commit plan C4.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ure#7975 vocabulary (P5.1 C5)

Fixes B8: `azd ai agent show` returned `monitor --tail 100` on
failed status and a `show` re-check on empty status. Per issue
Azure#7975 lines 208-214 the vocabulary is:

  IF status == "active" OR status == "idle":
      -> "azd ai agent invoke <agent> 'Hello!'"
  IF status == "failed" OR status == "":
      -> "azd ai agent monitor --follow"
  ELSE:
      -> "azd ai agent show <agent>"  (transitional re-check)

Three changes:

1. New `AgentVersionIdle = "idle"` constant in error_codes.go.
   The verified platform enum (see error_codes.go:64-71 doc)
   only emits `active` today, but the issue spec treats `idle`
   as a "ready" synonym. We add it defensively so a future API
   surface change wouldn't make this branch return the wrong
   suggestion. The Active branch's switch case now reads
   `case AgentVersionActive, AgentVersionIdle:`.

2. `AgentVersionFailed` switch arm now emits
   `azd ai agent monitor --follow` (was `--tail 100`). By the
   time `show` surfaces the failure, the interactive user wants
   to watch the live tail of the next reconcile attempt, not
   capture a fixed 100-line window. The pre-C5 `--tail 100`
   behavior is preserved in the *invoke*-failure paths
   (resolver.go:291,300; error_codes.go:143) — those are
   post-mortem inspections after a 5xx response and have
   different intent.

3. Empty status (`AgentStatus == ""`) is now combined with the
   Failed arm via `case AgentVersionFailed, "":`. The previous
   fall-through to the unknown/transitional `show` re-check
   masked the case where the platform genuinely had no status
   to report — the live log feed is the most useful next view
   when "we don't know yet".

Genuinely unknown statuses (anything not in the AgentVersionStatus
enum and not "") still fall through to the unknown branch, which
emits an `azd ai agent show <svc>` re-check.

Tests:
- resolver_test.go TestResolveAfterShow table: bumped `Failed`
  case from `monitor --tail 100` → `monitor --follow`, added
  `Idle` row routing to invoke, flipped `empty status` row from
  re-check → `monitor --follow`.
- show_test.go TestResolveNextStepFromSource_NonActiveBranches:
  bumped the `failed` row from `--tail 100` → `--follow`.

Doc comment on `ResolveAfterShow` now includes the full status
mapping table (single source of truth in the code).

Scope: ~40 LoC + doc/test updates. Independent of C1-C4.

Source of truth: issue Azure#7975 lines 207-214 + P5.1 commit plan C5.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… (P5.1 C6, fix B9)

Fix B9 from issue Azure#7975 (lines 228-242). Pre-C6, `ResolveAfterDeploy`
stripped the agent name when there was exactly one agent service in
state, emitting:

  Next:  azd ai agent show                       -- verify the deployed agent is running
         azd ai agent invoke '<payload>'         -- send a sample request to the deployed agent

That output had two problems:

1. **`azd ai agent show` (no name)** runs interactive resolution that
   relies on the user's terminal state. When the artifact note is
   copy-pasted into a different project (or read days later when the
   single-agent project may have grown a second agent), the unqualified
   command picks the wrong target or prompts. The spec example at
   lines 231-232 shows the qualified form even in the single-agent
   case for exactly this copy-paste-safety reason.

2. **Description copy was generic** ("verify the deployed agent is
   running" / "send a sample request to the deployed agent"). Per
   spec lines 231-241 the descriptions are intentionally tighter and
   identify which agent each row refers to in the multi-agent layout.

C6 changes:

- `ResolveAfterDeploy` now always emits service-qualified commands
  regardless of `len(state.Services)`. The pre-C6 unqualified branch
  is gone.

- Descriptions reshape:
  * Single agent (`len(state.Services) == 1`):
    `verify it's running` / `test the deployment`
  * Multi-agent (`len(state.Services) >= 2`):
    `verify <name> is running` / `test <name>`

- Multi-agent layout now groups all `show` lines first, then all
  `invoke` lines (was interleaved per service). Matches spec example
  at lines 238-241. Single-agent output is unchanged in layout —
  with one service the pass-1/pass-2 split still produces show-then-
  invoke order.

- README hint placement: in the new pass-2 invoke loop the per-agent
  hint is emitted immediately after that agent's invoke line, so a
  reader can scan rows top-to-bottom and find each agent's hint in
  context. Single-agent placement is identical to pre-C6.

- `AfterDeployOpts.ForceQualified` is kept as a NO-OP for backward
  compatibility. Callers (doctor.go line 239 passes it for filtered
  states) still compile and produce identical output. The doc
  comment is updated to mark it as a no-op and explain why
  (the single-agent unqualified heuristic it overrode is gone).

Tests:

- `TestResolveAfterDeploy` rewritten — all single-agent expectations
  now assert qualified commands (`azd ai agent show echo` /
  `azd ai agent invoke echo '...'`) and per-spec descriptions.
- New subtest for multi-agent grouped ordering (shows-then-invokes)
  asserts shows in service-declaration order, then invokes in the
  same order, each row carrying the per-agent descriptive text.
- New subtest for README hint placement in the multi-agent layout
  asserts the hint follows the invoke line for the service that
  triggered it, even with the new grouped ordering.
- The two `ForceQualified` subtests are kept and rewritten as
  backward-compat assertions — they now compare against the
  no-opts baseline and prove the flag is a true no-op.

Caller impact:
- `doctor.go:235-239` passes `ForceQualified: totalServices > 1`.
  Output is identical to its old behavior (totalServices==1 used
  to emit unqualified; now emits qualified, but doctor's
  filtered-state callsite always wanted qualified anyway — that's
  literally the comment at the callsite).

Spec source: issue Azure#7975 lines 228-242 + the C6 row in the P5.1
commit plan.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… ready" Next: (P5.1 C7)

Issue Azure#7975 lines 96-103 specify that the init "everything ready"
output should surface two commands, not one:

  Next:  azd ai agent run                      -- start the agent locally
         azd ai agent invoke --local "Hello!"  -- test it in another terminal

Pre-C7 the resolver emitted only `azd ai agent run` (plus the trailing
`azd deploy` reminder). Users hit the "now what?" wall after the agent
bound its port: they had to remember `azd ai agent invoke --local …`
and figure out the right payload for their protocol.

Source of truth: issue Azure#7975 lines 96-103. Spec example uses the
unqualified `azd ai agent invoke --local "Hello!"` form regardless of
how many services the project declares.

# Change

`ResolveAfterInit`'s default branch (everything-ready case) now appends
a second `Suggestion`:

  Command:     azd ai agent invoke --local <payload>
  Description: test it in another terminal

The trailing `azd deploy` reminder stays in place (Priority 90,
Trailing:true), so the rendered block is three lines on a fresh,
fully-provisioned project.

Payload selection:

- `len(state.Services) == 1` → `defaultInvokePayload(&state.Services[0])`
  which already maps `ProtocolInvocations` → `'{"message": "Hello!"}'`
  (JSON envelope) and everything else → `"Hello!"`. So a single-agent
  invocations-protocol project gets the JSON shape, single-agent
  responses-protocol gets the literal, matching what `ResolveAfterRun`
  has done since commit 2.3.
- `len(state.Services) != 1` (zero or multi-agent) → literal `"Hello!"`.
  Multi-agent: the unqualified `azd ai agent invoke --local` doesn't
  know which agent the user will pick at runtime, so picking one
  service's protocol arbitrarily would be wrong half the time. The
  responses-style literal is what the spec example uses.

The suppression branches are untouched:

- `hasPlaceholders` (case 2) — neither `run` nor `invoke --local`
  emitted. Running locally with literal `{{NAME}}` values produces a
  broken agent; the spec gates `run` on placeholder-clear state, and
  the invoke secondary is paired with `run` so it inherits the gate.
- `len(state.MissingManualVars) > 0` (case 1 of the default-cases
  block) — the manual-vars renderer ships its own `run` follow-up
  ("start the agent locally once the values above are set"). The
  invoke-local secondary is NOT added there: the manual-vars example
  in the spec (lines 119-127) deliberately stops at `run` to keep the
  "set values → run" call-to-action focused.

# Renderer

Both init.go (`init.go:1643`) and init_from_code.go (`:148`) call
`nextstep.PrintAllNext`, which has NO line cap (uncapped renderer
per commit 4.7's G1 fix). The new third suggestion fits cleanly;
no truncation risk.

`PrintNext`'s `maxRendered = 2` cap is irrelevant here — it's used
by mid-flow resolvers (invoke, show) where ≤2 suggestions naturally
occur.

# Tests

resolver_test.go adds `TestResolveAfterInit_EverythingReady_EmitsInvokeLocalSecondary`
with five subcases:

1. Zero services → unqualified invoke with responses payload. Pins
   the spec-mandated unqualified form. Asserts Priority ordering
   (run < invoke) so the renderer always emits them in the
   user-expected order.
2. Single-agent responses protocol → `azd ai agent invoke --local "Hello!"`.
3. Single-agent invocations protocol → `azd ai agent invoke --local '{"message": "Hello!"}'`.
   Locks the protocol-aware shape.
4. Multi-agent (mixed protocols) → invoke stays unqualified with
   responses payload. Anti-regression: protects against accidentally
   picking `state.Services[0].Protocol` for a multi-agent project.
5. Placeholders present → neither `run` nor `invoke --local` emitted.
   Anti-regression: pairs with the existing
   `TestResolveAfterInit_UnresolvedPlaceholders` table.

Existing `TestResolveAfterInit` table cases (happy path,
"existing project chosen, all vars set") still pass without
modification — they assert `out[0].Command` (still
`azd ai agent run`) and `out[len(out)-1].Command` (still
`azd deploy`); the new invoke-local slot inserts in the middle
and they don't pin block length.

# Affected callers

`init.go` and `init_from_code.go` print the new line automatically
via `PrintAllNext`. `doctor.go:243`'s no-deploy branch flows through
`ResolveAfterInit` too — pre-deploy doctor guidance gets the same
upgrade.

# Verified

- gofmt -s -w . (clean)
- go vet ./... (clean)
- go test ./... -count=1 (all packages green; nextstep 6.5s,
  cmd 17.3s, doctor 4.9s)
- golangci-lint run ./internal/cmd/... (0 issues)
- cspell lint internal/cmd/nextstep/**/*.go --config ../../.vscode/cspell.yaml (0 issues)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… C9)

Adds the 7th local doctor check, surfacing the same MissingManualVars
signal that the post-init `Next:` block renders — but at any point in
the flow (post-deploy, mid-debug, before reporting a bug). Closes the
"manual config values missing" branch from issue Azure#7975
lines 117-127 on the doctor side.

## Why

The doctor's job is to short-circuit user frustration: "I get the same
guidance from `azd ai agent doctor` regardless of where I am in the
flow." Today the manual-vars signal only surfaces at the tail of `azd
ai agent init`, so a user who hits the issue mid-cycle (e.g. they
cloned a friend's project, ran `azd env select`, and went straight to
`azd ai agent run`) sees no clear pointer to the missing config.

## What

- `internal/cmd/doctor/checks_manual_env.go` — new check produces ID
  `local.manual-env-vars`, name "manual env vars set". On Pass, message
  is "no manual env vars are missing"; on Fail, lists every missing var
  in the message and stages a paste-ready `azd env set <FIRST> <value>`
  suggestion (sorted alphabetically — the renderer paints Suggestion as
  a single line, multi-line breaks indentation). When 2+ vars are
  missing the suggestion adds a "Repeat for each of the other
  variables listed above." clause; when exactly 1 is missing the bare
  command is the suggestion (no misleading "and others" wording).
- `internal/cmd/doctor/checks_local.go` — adds the check as the 7th
  entry in `NewLocalChecks` (after `local.agent-yaml-valid`); introduces
  a lowercase `assembleState` field on the exported `Dependencies`
  struct as a test seam (production code leaves it nil; tests inject
  directly).
- Skip cascade: skips when AzdClient is nil OR when
  `local.agent-yaml-valid` failed/skipped OR when
  `local.environment-selected` failed/skipped. The first guard
  transitively covers azure-yaml → agent-service-detected →
  agent-yaml-valid; the second is required because env-selection is a
  sibling chain (not upstream of agent-yaml-valid) and
  `nextstep.AssembleState` silently early-exits its detectMissingVars
  block when no env is selected (state.go: `if project != nil &&
  envName != ""`). Without the env guard the check would falsely Pass
  in a state where no env values were ever examined.

## Architecture: why reuse `nextstep.AssembleState`

The "manual vs infra" classification logic lives in nextstep — the same
place that drives every other `Next:` recommendation. Adding a separate
classifier inside the doctor would split the source of truth, and
future improvements (e.g. C1's Bicep-output discovery replacing the
`AZURE_` prefix shortcut) would have to be ported twice. Forwarding to
`AssembleState` keeps the doctor as a presentation layer.

## Tests

12 new sub-tests in `checks_manual_env_test.go`:

- No client → Skip
- Prior `local.agent-yaml-valid` failed → Skip
- Prior `local.agent-yaml-valid` skipped → Skip (cascade propagation)
- Prior `local.environment-selected` failed → Skip (regression for
  Opus xhigh review HIGH finding; asserts assembler is NOT called via
  t.Fatalf in the fake)
- Prior `local.environment-selected` skipped → Skip (symmetric)
- 0 missing → Pass with "no manual env vars are missing"
- 1 missing → Fail with bare `azd env set <NAME> <value>` suggestion
  (asserts no "Repeat" / "likewise" wording — regression for Sonnet
  4.6 review MEDIUM finding)
- 4 missing → Fail with sorted list; suggestion has "Repeat for each
  of the other variables" clause
- nil State from assembler → Fail with assembly error message
- nil State + nil errs → Fail with fallback "unknown error" message
- Non-fatal errs but State populated → Pass (state.MissingManualVars
  is the authoritative signal; ancillary errors like missing
  AI_AGENT_PENDING_PROVISION key don't dirty the result)
- Existing `TestNewLocalChecks_OrderAndIDs` extended from 6 to 7 checks;
  new `TestNewLocalChecks_IncludesManualEnvVarsLast` pins the
  agent-yaml-valid → manual-env-vars ordering invariant so a future
  reorder can't silently break the skip-cascade.

## Out of scope

- Bicep-output classifier (B1 fix / C1) — `assembleState` today routes
  non-`AZURE_*` Bicep outputs (e.g. `TOOLBOX_*`) into MissingManualVars
  rather than MissingInfraVars. This check will surface that bug
  verbatim until C1 lands; that's the correct phasing because (a)
  every Phase 5 consumer benefits from the same fix at once, and (b)
  showing the current behavior in the doctor makes the bug debuggable
  in the meantime.

Refs: Azure#7975 lines 117-127, 308-312 (issue spec)
Refs: Azure#8057 (PR being implemented)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Wire the Cobra/runner pipe for remote (network-dependent) doctor
checks so commits C11-C17 can add individual checks (auth, foundry
endpoint reachability, RBAC, agent status) without touching the
doctor command's Cobra surface or runner internals.

Most of C10's framework was already in place from Phase 4:

  - `Check.Remote bool` (runner.go:32-37)
  - `Runner.Run` skips remote checks under `Options.LocalOnly`
    (runner.go:74-82)
  - `report.Remote = true` flipped whenever an executed check is
    Remote (runner.go:128-130)
  - `--local-only` / `--unredacted` flags bound on the Cobra surface
    and threaded through `doctor.Options`

What was missing was the *factory slot* — a `NewRemoteChecks` mirror
of `NewLocalChecks` that the doctor command appends to its check
list. Without that slot the only place to add a remote check was the
command file itself, breaking the convention established by
`NewLocalChecks` (every doctor check lives in the `doctor` package;
the command file is plumbing only).

Changes
-------

* New `internal/cmd/doctor/checks_remote.go`:
  - `NewRemoteChecks(deps Dependencies) []Check` — empty today;
    documents the conventions C11+ remote checks must follow
    (Remote: true, ctx cancellation, skip-cascade against the
    local chain via `priorBlocked`, redaction discipline under
    `!Options.Unredacted`).
  - Names the four follow-up checks the slot is reserved for so
    a future reader can immediately see the scope without
    cross-referencing the plan: C11 auth, C12 foundry endpoint,
    C16 RBAC, C17 agent status.
  - Explicitly documents that local-checks-then-remote-checks
    ordering is load-bearing for `priorBlocked` skip-cascade.

* `internal/cmd/doctor.go`:
  - `runDoctor` now builds the runner from
    `append(NewLocalChecks(deps), NewRemoteChecks(deps)...)`
    instead of `NewLocalChecks(deps)` alone. Comment on the call
    site pins the ordering contract.
  - `doctorFlags` doc comment rewritten to describe today's
    reality (the wire is fully exercised; remote-checks factory
    is empty but populated transparently when C11+ land) instead
    of the pre-C10 wording "no-op today, reserved for an upcoming
    pass."
  - `--local-only` and `--unredacted` user-visible help text
    trimmed of internal plan-tracking jargon (no more "subsequent
    commits" or "(P5 C11+)"). The first sentence now stands on
    its own and reads cleanly under `--help`.

* `internal/cmd/doctor/types.go`:
  - `Options.LocalOnly` doc comment updated from "no-op in phase 4
    — no remote checks are wired yet" to match the new post-C10
    reality (the factory returns empty today; the wire is
    exercised).

* New `internal/cmd/doctor/checks_remote_test.go` (5 tests):
  - `TestNewRemoteChecks_EmptyTodayButCallable` — pins the
    contract; when C11 lands the empty-slice assertion fires and
    forces the author to update the count.
  - `TestNewLocalAndRemoteChecks_ProductionCompositionLocalsFirst`
    — pins the load-bearing local-then-remote ordering by reading
    the actual production factories (not synthesized checks).
    Asserts (a) every check in NewLocalChecks has Remote=false,
    (b) every check in NewRemoteChecks has Remote=true, (c) no
    local check appears after any remote check in the combined
    slice. Catches a future contributor swapping the append order
    or forgetting the Remote flag on a remote check.
  - `TestRunner_LocalThenRemote_RemoteSeesLocalPriorResults` —
    asserts the runner preserves the slice order so a remote
    check's `priorBlocked` guard reads the local results.
  - `TestRunner_LocalOnly_AppendedRemoteCheck_NotInvoked` —
    exercises the production-shaped `append(local, remote...)`
    slice under `LocalOnly: true`.
  - `TestRunner_RemoteCheck_RanProducesReportRemoteFlag` —
    asserts `report.Remote = true` against the production-shaped
    slice when a remote check executes.

User-visible behavior
---------------------

None. The remote-checks factory is empty, so:

  - `azd ai agent doctor` produces the same 7-local-check report
    it did before this commit.
  - `azd ai agent doctor --local-only` produces the same 7-check
    report (no remote checks to skip).
  - `azd ai agent doctor --output json` envelope has `"remote":
    false` (no remote check executed).

The change is exclusively in the *plumbing* for the follow-up
commits.

Preflight
---------

* gofmt -s -w . — clean
* go vet ./... — clean
* go build ./... — clean
* go test ./... -count=1 — green (cmd 15.0s, doctor 6.1s, nextstep
  3.7s, etc.)
* golangci-lint run ./internal/cmd/... — 0 issues
* npx cspell lint "internal/cmd/doctor.go" "internal/cmd/doctor/**/*.go"
  --relative --config ../../.vscode/cspell.yaml --no-progress —
  0 issues across 7 files

Closes Phase 5 commit slot C10.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Implements Check 7 from the doctor remote-checks design as the first
populated entry in the remote chain wired up by C10. The check proves
that `azd auth token` can mint a token for the Foundry data-plane
scope and surfaces the result with branched severity:

  - Pass: "<UPN> · token valid for <N> minutes"
  - Warn (< 5 min remaining): suggest `azd auth login` proactively
  - Fail (expired or acquisition error): suggest `azd auth login` with
    a learn.microsoft.com link to the `azd auth login` reference
  - Skip (user cancelled): no suggestion — Ctrl-C is not an auth bug
  - Fail (probe deadline): distinct "timed out" message that does NOT
    immediately accuse the user of being logged out

Skip-cascade: `local.environment-selected`. Per the design dependency
matrix, an env must be selected before remote checks have a project
to test against; otherwise the remote chain would run against the
wrong context. `local.grpc-extension` is intentionally NOT a
precondition — the auth probe goes straight to azidentity and does
not need the AzdClient gRPC channel.

Implementation notes:

  - Uses `azidentity.NewAzureDeveloperCLICredential` (same as
    `agent_context.go:newAgentCredential`) so the probe mirrors the
    production credential exactly.
  - Requests the production data-plane scope `https://ai.azure.com/
    .default` (same as `agent_api/operations.go`) so a Pass here
    matches what the runtime invoke flow needs — not a different
    scope that might succeed when the runtime would fail.
  - Wraps the probe in a 10s timeout to bound stuck shells. After the
    probe returns the check classifies `context.Canceled` /
    `context.DeadlineExceeded` separately from generic auth errors
    so user Ctrl-C maps to Skip (not Fail) and a probe timeout maps
    to a distinct Fail message instead of telling the user to log in
    again when the real cause is a stuck `azd` invocation.
  - UPN extraction is best-effort JWT payload decoding (stdlib only,
    no third-party JWT lib). Tries `upn`, `unique_name`,
    `preferred_username`, `email` in order. Decode failures return an
    empty UPN and never an error: the auth check cares about the
    token's validity, not how readable its claims are.
  - The raw access token is never logged, returned, or exposed
    outside `realProbeAuth`.
  - Wrong-tenant detection is intentionally NOT done here — that's
    the job of `remote.foundry-endpoint` (C12) where a 403 maps to a
    precise "wrong tenant or insufficient RBAC" suggestion.
    Surfacing the same failure mode here would produce false
    positives — flagging auth as broken when the user IS
    authenticated, just against the wrong tenant.
  - `formatTokenWindow` substitutes "less than 1 minute" for any
    sub-minute positive window so the Warn message can never read
    "token expires in 0 minutes" — that wording is indistinguishable
    from expiry to a reader scanning the report.
  - `firstLine` strips a trailing `\r` after splitting on `\n` so
    Windows CRLF stderr output from `os/exec`-invoked `azd` does not
    leak a carriage return into the doctor report message.
  - Every Fail branch that suggests `azd auth login` now carries the
    same `authLoginLink` (a single package constant), so all four
    suggestion paths point at the canonical MS Learn reference.

Testability:

  - New `probeAuth` test seam on `Dependencies` matches the pattern
    used by `assembleState`. Production wiring leaves it nil; the
    check falls back to `realProbeAuth`. Tests inject deterministic
    `authProbeResult` values to exercise each branch.
  - 22 new tests cover: skip-cascade (Fail + Skip cases); default
    seam fallback; all severity branches (Pass, Warn, sub-minute
    Warn, expired Fail with Links, acquisition-error Fail with
    Links, cancellation Skip, deadline Fail); singular / plural and
    sub-minute formatting; 5-minute Warn/Pass boundary; UPN claim
    ordering / empty / whitespace / non-string variants; and JWT
    decode failure modes (empty token, wrong segment count, invalid
    base64, non-JSON payload, non-object JSON root).

Files:

  - checks_auth.go (new): newCheckAuth + realProbeAuth + extractUPN
    + format helpers + authLoginLink constant
  - checks_auth_test.go (new): comprehensive table-driven coverage
  - checks_local.go: add `probeAuth` test seam to Dependencies
  - checks_remote.go: append newCheckAuth(deps) to the remote chain
  - checks_remote_test.go: replace the empty-slice contract test
    with one pinning the auth-only shape, so any future addition has
    to touch this one assertion

Reviewer findings applied (3-reviewer pass: Opus xhigh + Sonnet 4.6 +
GPT-5.5):

  - HIGH (Sonnet): Expired-token Fail was missing `Links` — fixed,
    every Fail with `azd auth login` suggestion now carries the
    reference link via the new `authLoginLink` constant.
  - MEDIUM (GPT-5.5): Cancellation / timeout was classified as
    generic auth failure — fixed, classified separately with
    appropriate Skip / Fail and distinct messages.
  - MEDIUM (Sonnet + Opus convergent): Sub-minute positive validity
    rendered as "token expires in 0 minutes" — fixed via
    `formatTokenWindow` substituting "less than 1 minute".
  - MEDIUM (Sonnet): `firstLine` left trailing `\r` on Windows CRLF —
    fixed with `strings.TrimRight(s[:i], "\r")` + CRLF test case.
  - LOW (Sonnet): Doc said "false negatives" where logic required
    "false positives" — corrected.
  - LOW (Opus): 5-minute Warn/Pass boundary was not directly
    tested — added `TestCheckAuth_WarnPassBoundaryAtFiveMinutes`.
  - LOW (Opus): Doc comment cited a non-existent design path —
    updated to the actual `.tmp/pr-8057/...` location.

Preflight: gofmt -s -w . ✓, go vet ./... ✓, go test ./... -count=1
(all packages) ✓, golangci-lint run ./internal/cmd/... ✓, cspell ✓.

Refs: Azure#7975
Refs: design `.tmp/pr-8057/azd-ai-agent-doctor-remote-checks.md`
      Check 7 / dependency matrix lines 112-131

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… C12)

Implements Check 8 from the doctor remote-checks design as the second
populated entry in the remote chain. The check proves that the
configured Foundry project endpoint is reachable AND that the bearer
token minted by C11 actually authorizes against it — a combined
"can we talk to the project at all" signal that gates every other
remote check (models / agent status / RBAC) from the design's
dependency matrix.

The probe issues a single `GET <endpoint>/agents?api-version=<v>&limit=1`
with a 10s timeout, no retries, using the same credential + scope
as the production runtime path (`agent_context.go:newAgentCredential`
+ `agent_api/operations.go`). The `limit=1` parameter matches the
production agent_api client exactly so a Pass here proves the same
query shape the runtime invoke flow uses (the earlier `$top=1`
choice was a divergence flagged by reviewers).

The status-code response is mapped 1:1 to user-actionable outcomes:

  - 200            → Pass: "endpoint reachable (HTTP 200)"
  - 401            → Fail: "token expired or scope mismatch" +
                     suggest `azd auth login` (link: auth login docs)
  - 403            → Fail: "wrong tenant or insufficient RBAC" +
                     suggest re-auth in correct tenant and let
                     `remote.rbac` flag the role-assignment gap
                     (deliberately NO `azd auth login` here, but DO
                     carry a docs Link to the Foundry RBAC quickstart
                     so the suggestion is actionable — matches the
                     C11 "every actionable Fail carries Links"
                     convention)
  - 404            → Fail: "endpoint is wrong or project is gone" +
                     suggest `azd provision` / `azd env set
                     AZURE_AI_PROJECT_ENDPOINT`
  - 5xx            → Fail: "service-side error" + suggest retry
  - other          → Fail: "unexpected HTTP <N>" + verbose hint
  - transport err  → Fail: "could not reach <host>: <first line>" +
                     network / VPN / firewall guidance
  - ctx canceled   → Skip (user aborted)
  - 10s elapsed    → Fail: "did not respond within 10s" + retry hint

Skip-cascade: `local.project-endpoint-set` AND `remote.auth`. The
former gives us the endpoint to probe; the latter gives us the
token to authenticate with. Skipping when either failed prevents
double-reporting the same root cause. `local.environment-selected`
is transitively cascaded via `local.project-endpoint-set`.

Implementation notes:

  - The api-version is read from the package-level
    `DefaultAgentAPIVersion` constant by the Cobra wiring in
    `cmd/doctor.go` and passed in through `Dependencies.AgentAPIVersion`.
    This honors the design's "single source of truth" requirement
    while keeping the doctor package self-contained (no import cycle
    against `internal/cmd`).
  - `makeRealProbeFoundryEndpoint(apiVersion string) func(...)` is
    a closure factory rather than a top-level function so the
    api-version is captured at construction time without becoming a
    global.
  - `buildFoundryProbeURL` parses the user-supplied endpoint FIRST,
    then mutates `u.Path` (trim-right + "/agents"), clears
    `u.Fragment` / `u.RawFragment`, and only then sets RawQuery.
    This prevents a stray `?api-version=evil` or `#fragment` in the
    endpoint from displacing the `/agents` path segment — a
    silent-misdiagnosis bug the prior `endpoint + "/agents"` string
    concatenation could trigger. Regression tests now positively
    assert `/agents?` is present in every successful build output.
    The builder also returns an error for any endpoint that is not
    an absolute HTTPS URL with a non-empty host, so a malformed
    env value cannot leak a bearer token to the wrong scheme/host.
  - A new `validateFoundryEndpoint` helper runs at the check-level
    BEFORE the probe is invoked, BEFORE any token is acquired. A
    non-HTTPS, relative, or otherwise-malformed
    `AZURE_AI_PROJECT_ENDPOINT` surfaces a precise Fail with an
    `azd env set AZURE_AI_PROJECT_ENDPOINT <https://...>` suggestion
    instead of either a generic transport error (with the token
    leak that would have come with it) or the builder's defensive
    error wrapped in a less-helpful network-VPN-firewall message.
  - Cancellation classification mirrors C11's pattern:
    `errors.Is(ctx.Err(), context.Canceled)` → StatusSkip (user
    aborted); `errors.Is(probeCtx.Err(), context.DeadlineExceeded)`
    → StatusFail (we hit our own 10s bound, not the parent ctx).
  - Multi-line transport errors are reduced to their first line
    via the shared `firstLine` helper from C11 so the resulting
    Message stays readable.
  - The `Details` map carries the endpoint, request URL, and
    HTTP status code (when available) for `--output json` consumers
    and `--unredacted` debugging. No raw tokens, no response body
    excerpts.
  - 24 tests cover skip-cascade (env-not-selected, endpoint-not-set,
    auth-failed, AgentAPIVersion-missing), every status-code branch,
    cancellation vs timeout disambiguation, URL builder safety
    (junk query in endpoint, trailing slashes, fragment, blank
    api-version, non-HTTPS / relative / malformed endpoint),
    endpoint validation (HTTPS-only, non-empty host, well-formed),
    helper functions (`endpointHost`, `readProjectEndpoint`,
    `firstLine` reuse), a TLS httptest server smoke test asserting
    the built URL lands on `/agents` on the wire, and a token-leak
    sanity check on Details / Message / Suggestion strings.

Behavior: with this commit, `azd ai agent doctor` now produces two
remote checks (`remote.auth` from C11 + `remote.foundry-endpoint`
from C12) instead of just one. The full remote chain still requires
C13+ to be useful end-to-end, but every subsequent check can now
take a Pass on this one as proof that the project URL works.

Refs: Azure#7975, PR Azure#8057 design-spec
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds the third remote doctor check, `remote.rbac`, implementing
Check Azure#10 of issue Azure#7975 / Check 10 of the design doc
(`azd-ai-agent-doctor-remote-checks.md` lines 159-180): "Developer
has required role on Foundry project".

The check queries the developer's role assignments on the Foundry
project's ARM scope via the Microsoft Graph + ARM stack, then
classifies the result:

  * Pass: "<display> has the required role on project '<acct>/<proj>'"
    (display name replaced with "the current principal" in the
    default redacted mode for UPN safety).
  * Fail: templated `az role assignment create --role "Azure AI User"
    --assignee <oid> --scope <scope>` + a learn.microsoft.com link
    to the RBAC concepts page. Principal ID and scope ARN are
    substituted with shell-safe ALL_CAPS placeholders
    (OBJECT_ID / PROJECT_SCOPE) in redacted mode — NOT `<redacted>`,
    because bash/zsh interpret `<word>` as input redirection.
  * Skip: precondition unmet (no AzdClient, no env, auth Failed,
    `AZURE_AI_PROJECT_ID` missing/malformed/unset, service-principal
    token detected, transient Graph/ARM error, cancellation).

Architecture
------------
Two new files:

  internal/project/developer_rbac_query.go (~175 LoC)
    - QueryDeveloperRBAC: side-effect-free wrapper returning
      a structured DeveloperRBACResult. Reuses package-private
      parseAgentIdentityInfo / hasAnyRoleAssignment /
      sufficientAIUserRoles from developer_rbac_check.go.
    - ValidateProjectResourceID: shape check for AZURE_AI_PROJECT_ID,
      returns wrapped ErrInvalidProjectResourceID sentinel.
    - ErrInvalidProjectResourceID, ErrSPNDelegatedAuthRequired:
      sentinel errors for diagnostic classification.
    - Graph /me failure detection: routes app-only/SPN token
      rejections onto ErrSPNDelegatedAuthRequired (case-insensitive
      message match) so doctor surfaces a SPN-aware Skip.

  internal/cmd/doctor/checks_rbac.go (~430 LoC)
    - newCheckRBAC: skip-cascade + projectID lookup + upfront
      ValidateProjectResourceID gate + probe dispatch.
    - classifyRBACProbeError: sentinel-keyed error classification
      (Canceled / SPN / InvalidProjectID / generic transient).
    - classifyRBACResult: pure Pass/Fail mapping for diagnostic
      consumers.
    - sanitizeScopeARNs: regex-based scope+GUID scrubber for
      probe error text.
    - readProjectResourceID: AZURE_AI_PROJECT_ID env lookup via
      gRPC EnvironmentService.
    - redactID / redactScope / redactDisplay: centralized
      placeholder substitution helpers.

Two new seams on Dependencies:
  probeDeveloperRBAC       - replaces project.QueryDeveloperRBAC
  readProjectResourceIDFn  - replaces readProjectResourceID

Skip-cascade rationale
----------------------
Per design dependency matrix (line 115), `remote.rbac` cascades
against `local.environment-selected` + `remote.auth` but NOT
`remote.foundry-endpoint`. RBAC reads ARM, not the data plane;
a transient DNS / proxy / outage on the data-plane check must not
prevent the user from learning their role assignment is missing.
`TestCheckRBAC_DoesNotSkipOnFoundryEndpointFail` pins this.

Probe errors → Skip (not Fail) to avoid false alarms on transient
Graph/ARM hiccups. Cancellation similarly Skips with a clean
message rather than rendering an error trace.

Review fixes applied
--------------------
Following the 3-reviewer pass (Opus xhigh + Sonnet 4.6 + GPT-5.5)
of an earlier draft (commit 0c4d5ee), the following findings
were addressed:

  * MEDIUM (Opus + GPT-5.5): probe-error path leaked raw
    subscription/scope IDs via azcore.ResponseError.Error()'s first
    line (`GET https://management.azure.com/subscriptions/...`)
    into Message + Details. Fix: sanitizeScopeARNs regex pass
    strips ARM scopes + bare GUIDs from Message in redacted mode;
    Details["probeError"] is OMITTED entirely unless --unredacted.
  * MEDIUM (Sonnet 4.6): malformed AZURE_AI_PROJECT_ID got
    "check your network" Suggestion despite no network call.
    Fix: upfront ValidateProjectResourceID gate runs before the
    probe; surfaces "is not a valid Foundry project ARM resource
    ID" with an `azd env set` Suggestion.
  * MEDIUM (GPT-5.5): PrincipalDisplay rendered verbatim in
    Message even in default redacted mode; display names can
    contain UPN fragments (e.g., "Alice (alice@contoso.com)").
    Fix: redactedDisplayLabel ("the current principal")
    substitutes for raw display in default mode; unredacted mode
    still echoes the real display.
  * MEDIUM (GPT-5.5): service-principal `azd auth login` cannot
    use Graph /me — confusing generic transient Skip. Fix:
    ErrSPNDelegatedAuthRequired sentinel + case-insensitive
    detection of the canonical "delegated authentication flow"
    Graph response; doctor surfaces a SPN-aware Skip with
    user-delegated guidance.
  * LOW/Nit (Opus): `<redacted>` is a bash input-redirection
    token; `--assignee <redacted>` would fail with
    `redacted: No such file or directory` on copy-paste. Fix:
    shellSafePlaceholderID/Scope constants render `OBJECT_ID` /
    `PROJECT_SCOPE` in the templated az command.

Verified clean (no action): skip-cascade structure, firstLine
helper, AZURE_AI_PROJECT_ID provenance (set by
init_foundry_resources_helpers.go:327), seam design, account-scope
check missing (acceptable per `assignedTo()` inheriting parent-
scope assignments).

Testing
-------
22 unit tests in checks_rbac_test.go and developer_rbac_query
covering: skip cascades, probe-error branches (transient, parse,
SPN, defensive sentinel, cancellation), end-to-end probe Pass/
Fail, classifyRBACResult mapping with both redaction modes,
display-name fallback, sensitive-identifier leak prevention,
shell-safe placeholder substitution, ValidateProjectResourceID
shape coverage, sanitizeScopeARNs regex coverage, all three
redaction helpers (redactID/Scope/Display) with empty-input + flag
permutations.

Preflight (from cli/azd/extensions/azure.ai.agents)
---------------------------------------------------
  gofmt -s -w .                                   - clean
  go vet ./...                                    - clean
  go build ./...                                  - clean
  go test ./... -count=1                          - all packages green
  golangci-lint run ./internal/cmd/...            - 0 issues
                  ./internal/project/...
  npx cspell lint                                 - 0 issues
    "internal/cmd/doctor/**/*.go"
    "internal/project/developer_rbac_query.go"

Refs: Azure#7975 (PR Azure#8057, Phase 5 / C16)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds the fourth (and final per the current phase-5 scope) remote
doctor check: `remote.agent-status`. For each hosted-agent service
declared in azure.yaml, the check resolves the deployed agent
name + version from the active azd environment (the AGENT_<KEY>_NAME
and AGENT_<KEY>_VERSION values written by service_target_agent.go
on a successful `azd deploy`) and probes Foundry's
`GET /agents/{name}/versions/{version}` endpoint, mapping the
lifecycle status to a doctor classification.

Per-service classification

  - Active                  → Pass (one of N agents active)
  - Creating                → Warn (azd ai agent monitor --follow)
  - Failed                  → Fail (azd ai agent monitor --follow,
                                    NOT azd deploy — read logs first)
  - 404 / Deleting / Deleted → Fail (azd deploy)
  - AGENT_<KEY>_NAME unset, OR NAME set but AGENT_<KEY>_VERSION
    unset                    → Fail (azd deploy — the service is
                                     declared but has not been
                                     deployed cleanly; the post-deploy
                                     hook writes both env vars
                                     atomically so a half-pair is a
                                     deterministic config issue)
  - Unrecognized status      → Fail with raw status surfaced so the
                              user can search; suggests the Foundry
                              portal
  - Probe error / 401-403    → service-scoped transient skip; does
                              NOT fail the aggregate when other
                              services are healthy

Aggregate

The per-service entries fold into a single doctor Result via a
ranked classifier (`agentClassRank`). Worst-class drives the
aggregate Status:

  - All Active                                       → Pass
  - Worst is Failed                                  → Fail (monitor)
  - Worst is Missing (404 / Deleted / Deleting)      → Fail (deploy)
  - Worst is NotDeployed (no env var / half-pair)    → Fail (deploy)
  - Worst is Unknown                                 → Fail (portal)
  - Worst is Deploying / Creating                    → Warn (monitor)
  - Worst is Transient AND ≥1 Active                 → Pass with
                                                       Message note
  - All Transient                                    → Skip (retry)

When multiple failing classes coexist (e.g., one Failed agent and
one Missing agent in the same project) the dominant class drives
the headline and Suggestion. Detail lines are filtered to the
dominant class only so the headline count and the rendered body
always agree; a short "other agents have additional issues" hint
is appended to the Suggestion telling the user to read Details for
the secondary fix path.

The Message lists up to 3 failing services (with `(N more)` after
that). Active services are listed in the all-active Pass message.
Details["services"] always carries the full per-service list for
JSON consumers.

Fan-out

Probes execute concurrently across services with a bounded worker
pool (probeConcurrency = 4) so wall-clock cost is bounded by the
slowest probe rather than the sum of probes. Each probe still
enforces its own 6 s timeout via agentStatusProbeTimeout; the
parent ctx propagates cancellation to all in-flight workers.

Skip-cascade (per design dependency matrix lines 110-117 of
.tmp/pr-8057/azd-ai-agent-doctor-remote-checks.md):

  - local.environment-selected     (env reads would fail)
  - local.agent-service-detected   (no service list ⇒ Pass = bug)
  - remote.auth                    (no token ⇒ every probe 401s)
  - remote.foundry-endpoint        (no reachability ⇒ N transports)

We deliberately do NOT cascade from `remote.rbac`: agent-list /
agent-get is Reader-level, and developers with read-only access
on the Foundry project still benefit from knowing whether their
agents are healthy. Pinned by
TestCheckAgentStatus_DoesNotSkipOnRBACFail.

Review findings applied (Opus 4.7 xhigh + Sonnet 4.6 + GPT-5.5)

  - MEDIUM (Sonnet Azure#1 + Opus Azure#1): doc/code/test disagreement on
    Active + Transient mix. The godoc promised "Pass with note"
    but the code returned Skip and the test pinned the wrong
    behavior. Implemented the documented Pass-with-note so a
    transient probe failure for one service does not mask the
    healthy state of the rest. Test renamed to
    TestCheckAgentStatus_Aggregate_ActiveAndTransient_PassWithNote
    and asserts the Message text.

  - MEDIUM (Sonnet Azure#2 + Opus Azure#2 + GPT-5.5 Azure#3): headline count and
    rendered detail body diverged when entries from multiple
    failing classes coexisted (e.g., "1 of 2 agents are in a
    failed state" with 2 detail lines including a Missing entry).
    The aggregate now filters detail lines to the dominant class
    via `detailsForClass(worst)`, and the Suggestion is enriched
    with an "Other agents have additional issues (<classes>); see
    the per-service Details" hint when other failing classes are
    present. Pinned by
    TestCheckAgentStatus_Aggregate_FailedDominatesMissing.

  - MEDIUM (GPT-5.5 Azure#1): AGENT_<KEY>_NAME present with
    AGENT_<KEY>_VERSION empty was classified as `transient`,
    surfacing "retry doctor" — wrong for a deterministic config
    issue. Reclassified as `not-deployed` so the user is told to
    `azd deploy`. Test
    TestCheckAgentStatus_MismatchedNameVersion_NotDeployedForService
    + TestProbeOneService_NameSetVersionEmpty_NotDeployed.

  - MEDIUM (GPT-5.5 Azure#2): serial probes — 100 services × 6 s could
    drag the doctor run past 10 minutes. The design spec calls it
    "fan out", so probes now run in parallel via a bounded
    4-worker pool (probeConcurrency). Order preservation guarantees
    deterministic Details rendering.

  - LOW (Sonnet Azure#3): added TestProbeOneService_DeletingStatus_Missing
    covering the `Deleting` lifecycle branch that previously had no
    direct test (only `Deleted` was covered).

Files

  - internal/cmd/doctor/checks_agent_status.go
      * newCheckAgentStatus + the Check shape
      * probeOneService — per-service classification body
        (now-corrected name-set-version-empty path)
      * probeAllServices — bounded-concurrency fan-out helper
      * classifyAgentStatusAggregate — folds entries into one
        Result with class-filtered detail lines + mixed-class
        Suggestion hint + Active+Transient Pass-with-note branch
      * makeRealProbeAgentStatus — production probe closure (uses
        agent_api.GetAgentVersion + azidentity.NewAzureDeveloperCLI
        Credential, the same auth path the runtime invoke flow uses)
      * readAgentNameVersion + readAgentServices helpers
      * doctorServiceKey (mirrors cmd.toServiceKey; duplicated to
        avoid an import cycle, same rationale as agentHost in
        checks_project.go)
      * Lifecycle constants (Active / Creating / Failed / Deleted /
        Deleting) sourced from
        vienna:Contracts/V2/Generated/Agents/AgentVersionStatus.cs
      * truncateLines / serviceNamesByClass / firstTransient

  - internal/cmd/doctor/checks_agent_status_test.go (~640 LoC, 34
    tests; +2 new tests over v1)
      * Skip-cascade gates × 9
      * Per-service classification × 7 (incl. Deleting, NameNoVersion)
      * Status case-insensitive matching
      * Aggregate ranking × 5 (incl. Active+Transient Pass-with-note,
        FailedDominatesMissing with Message-text assertion)
      * Aggregate truncation at 3 + "(N more)"
      * probeOneService transport branches × 6 (incl. context.Canceled
        and context.DeadlineExceeded handling)
      * Service-key edge cases, rank fallback, truncateLines
        boundary, makeRealProbeAgentStatus closure check, plus a
        ServerHandler-based smoke test that the
        azcore.ResponseError code is surfaced via statusCode

  - internal/cmd/doctor/checks_local.go
      * Dependencies struct grows two new seams:
        probeAgentStatus + readAgentNameVersionFn (mirrors the
        probeDeveloperRBAC + readProjectResourceIDFn pattern from C16)

  - internal/cmd/doctor/checks_remote.go
      * NewRemoteChecks adds newCheckAgentStatus(deps) as the 4th
        entry, after auth / foundry-endpoint / rbac

  - internal/cmd/doctor/checks_remote_test.go
      * TestNewRemoteChecks_HasAuthFoundryEndpointRBACAndAgentStatus
        pins the 4-entry shape

Out of scope (deferred)

  - Sharing the credential across services: each probe currently
    constructs its own azidentity.NewAzureDeveloperCLICredential.
    Since the credential is essentially a thin shell around
    `azd auth token`, the cost is negligible (a single in-process
    call per probe) and threading it through complicates the
    test-seam shape. Will revisit if benchmarks surface it.

  - Re-using readAgentNameVersion for the doctor's eventual ENV
    pretty-print mode. Out of scope for the check itself; the
    helper is unexported and can be promoted when the renderer
    needs it.

Preflight (from cli/azd/extensions/azure.ai.agents)

  - gofmt -s -w .                  clean
  - go vet ./...                   clean
  - go build ./...                 clean
  - go test ./... -count=1         32 doctor tests + full ext suite PASS
  - golangci-lint run ./...        0 issues
  - cspell lint ... (cspell.yaml)  0 issues

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fixes two pre-existing UX bugs in `azd ai agent run`:

B5 (Next: race): pre-C8 the Next: block was rendered BEFORE
`proc.Start()`, so its "in another terminal, try: <example>" line could
land in the user's terminal seconds before the agent's port was bound.
Following the suggestion verbatim while the agent was still starting
yielded a connection-refused.

B6 (stale cached OpenAPI): pre-C8 the OpenAPI probe was strictly
cache-only. The first `run` ever issued — and every `run` against an
agent whose schema had drifted — surfaced a protocol-generic
`<payload>` literal instead of the agent's actual example.

Scope — only `run.go` uses the live probe. `show.go` (`WithOpenAPIProbe
(name, "remote")`), `deploy`/artifact-note path, `init.go`, and
`init_from_code.go` remain cache-only by design.

Changes
-------

nextstep/state.go
  - New functional option `WithLiveOpenAPIProbe(fetch func(context.
    Context) ([]byte, error))`. Stores the fetcher in
    `cfg.openAPILiveFetch`.
  - `populateOpenAPIPayload` now takes `(ctx, cfg, projectPath, envName,
    state)`. Order of resolution: (1) live, on success → use it;
    (2) cache (existing `WithOpenAPIProbe` path), on success → use it;
    (3) leave HasOpenAPI=false. Every failure is silent.
  - `assembleState` threads `ctx` through to the new signature.
  - Doc comment on `WithOpenAPIProbe` updated to note that
    `WithLiveOpenAPIProbe` overrides it when both are supplied.

run.go
  - Removed the pre-Start `nextstep.AssembleState` + `PrintNext` block
    that was emitting Next: before the agent bound its port.
  - After `proc.Start()`, spawn `emitNextAfterBind` in a goroutine with
    a `nextDone` channel. After `proc.Wait()` returns, the main flow
    calls `cancel()` and waits on `<-nextDone` so the goroutine is
    fully joined before stdout writes from `runRun`'s caller resume —
    closes a stdout race on shutdown.
  - `emitNextAfterBind` early-returns when stdout is not a terminal,
    honoring the documented nextstep call-site TTY-gating contract
    (matches `invoke.go:217`, `show.go:381`). Redirected stdout
    (`run > log`) no longer receives the banner or Next: block.
  - After waitForPortReady succeeds, builds a closure that wraps
    `fetchLiveOpenAPI(ctx, port)` and passes BOTH
    `WithOpenAPIProbe(serviceName, "local")` (cache fallback) and
    `WithLiveOpenAPIProbe(...)` to `AssembleState`. The state
    assembler picks live first, falls back to cache silently if the
    live fetch fails.
  - Re-checks `ctx.Err()` after `AssembleState` returns so a Ctrl+C
    arriving mid-call doesn't surface "Agent ready" after
    "Agent stopped." was already printed.
  - Four new constants: `portReadyBudget` (5 s),
    `portReadyPollInterval` (100 ms), `portReadyDialTimeout` (50 ms),
    `liveOpenAPITimeout` (3 s).
  - `waitForPortReady(ctx, port, budget) bool`: bounded TCP dial-loop
    that honors ctx.
  - `fetchLiveOpenAPI(ctx, port) ([]byte, error)`: uses
    `http.NewRequestWithContext` to GET
    `http://localhost:<port>/invocations/docs/openapi.json`. The route
    matches the cache-side fetcher in `helpers.go:368` and the
    user-facing curl tip in `nextstep/resolver.go:226`. Non-200
    responses are returned as errors so the assembler falls back to
    cache rather than ingesting a 404 body via `ExtractInvokeExample`.

Tests
-----

state_test.go (+5 new TestAssembleState_WithLiveOpenAPIProbe_* cases +
expanded `TestOptionsApplyCleanly`):
  - PrefersLiveOverCache, FallsBackToCacheOnError,
    FallsBackToCacheOnEmptyBody, LiveWorksEvenWithoutCacheProbe,
    LiveFailureWithoutCacheLeavesUnset.

run_test.go (+8 new tests + `listenLoopback` helper):
  - 3× waitForPortReady (bound port, budget elapse, ctx cancel).
  - 3× fetchLiveOpenAPI (200 body asserts
    `/invocations/docs/openapi.json` path, non-200 error, ctx
    deadline).
  - 2× emitNextAfterBind (never-binds, ctx cancelled — both pass nil
    azdClient through the safe early-return paths to verify the helper
    exits silently without panic or goroutine leak).

Preflight clean: gofmt -s -w, go vet, go build,
go test ./internal/cmd/... ./internal/cmd/nextstep/... -count=1
(cmd 10.5 s, doctor 1.6 s, nextstep 1.9 s), golangci-lint run
./internal/cmd/... (0 issues), cspell on the four modified files (0).

Review fixes (3-reviewer pass)
------------------------------

Three independent reviewers (Opus xhigh, Sonnet 4.6, GPT-5.5) reached
consensus on three correctness findings before merge:
  - Live probe URL corrected to /invocations/docs/openapi.json
    (matches existing cache fetcher and user-facing curl tip).
  - Banner + PrintNext now gated on isTerminal(os.Stdout.Fd()) to
    honor the nextstep call-site contract.
  - emitNextAfterBind goroutine is now joined via nextDone channel
    after proc.Wait, and re-checks ctx.Err() before printing so the
    banner cannot land after "Agent stopped."
  - Replaced misleading "ReturnsSilentlyWhenPortNeverBinds" test that
    only exercised waitForPortReady with two tests that actually call
    emitNextAfterBind with nil azdClient on the safe early-return
    paths.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…P5.1 C12)

Adds the final doctor check (P5.1 C12) that surfaces deployed-agent
managed-identity role assignments at three ARM scopes — project,
account, and resource group. For each agent classified active by
the upstream `remote.agent-status` check, the new
`remote.agent-identity-roles` check fetches the agent's
`instance_identity.principal_id` from Foundry and lists ARM role
assignments at all three scopes via the existing
`armauthorization` SDK already pulled in for the developer-RBAC
check.

## What lands

`internal/project/agent_identity_query.go` (new, ~340 LoC):

  - Public API `QueryAgentIdentityRoles(ctx, azdClient,
    projectResourceID, principals) (*AgentIdentityRolesResult,
    error)` reuses `parseAgentIdentityInfo` from
    `agent_identity_rbac.go` to derive the three scope ARNs, looks
    up the user-access tenant via `LookupTenant`, builds an
    `AzureDeveloperCLICredential` pinned to that tenant, and fans
    out per-principal role-assignment listings with `wg.Go`.
  - Public types `AgentPrincipal`, `AgentScopeRoles`,
    `AgentIdentityRolesEntry`, `AgentIdentityRolesScopes`,
    `AgentIdentityRolesResult` form the structured listing the
    doctor renders.
  - `queryAgentIdentityRolesWithLister` separates the per-scope
    listing strategy from credential acquisition so unit tests
    drive the inner classifier without standing up ARM fakes.
  - `listRoleNamesAtScope` lists role assignments with ARM's
    server-side `assignedTo('<principal>')` filter, then resolves
    role-definition IDs to human-readable names via
    `RoleDefinitions.Get`. Failures on individual role-name
    resolution downgrade gracefully (omitted from the listing).

`internal/cmd/doctor/checks_agent_identity_roles.go` (new, ~640 LoC
including doc comments):

  - `newCheckAgentIdentityRoles(deps)` builds the check Closure.
    Skip cascade against `local.environment-selected`,
    `local.agent-service-detected`, `remote.auth`,
    `remote.foundry-endpoint`, and `remote.agent-status`'s Pass
    (per the design's "for each active agent found in check 11").
  - `readActiveAgents(prior)` enumerates agents reachable to this
    check by reading the upstream `remote.agent-status` Details'
    `services` slice and filtering to Classification == "active".
  - `classifyOneAgent` buckets a single agent into fine /
    underscoped / empty / unknown per the design's pass condition
    ("project + (account|RG) covered"). `describeOneAgent`
    renders the one-line per-agent breakdown
    (`<agent>: project=N, account=M, resource-group=K`, with `?`
    on probe-error scopes).
  - `classifyAgentIdentityRolesAggregate` folds per-agent entries
    into a single doctor Result: all "fine" → Info; any "empty"
    → Fail (smoking-gun for "every tool call 403s"); worst
    "underscoped" → Warn; worst "unknown" → Warn.
  - `makeRealProbeAgentPrincipal` builds the production probe
    closure (mirrors `makeRealProbeAgentStatus` byte-for-byte
    apart from the field consumed — `InstanceIdentity.PrincipalID`
    vs `Status`).

## Renderer additions

`StatusInfo` joins the existing Pass/Warn/Fail/Skip status set so the
"all agents are fine" case can surface as an informational role
listing without flagging the run yellow.

  - `types.go`: `StatusInfo Status = "info"` + `Summary.Info int`
    (JSON tag added).
  - `runner.go`: canonical validation switch + summarize switch
    extended for Info; `ExitCode` treats Info as a "useful
    diagnostic completed" status (matches Pass for exit-code
    purposes).
  - `doctor_format.go`: glyph "ⓘ" and label "INFO" added to
    `statusGlyphAndLabel`; `writeSummaryLine` appends ", N info"
    when Info > 0 (preserves existing test assertions otherwise).

## Dependencies wiring

`internal/cmd/doctor/checks_local.go` adds two test seams on
`Dependencies`:

  - `probeAgentPrincipal` — replaces the production
    `GetAgentVersion` call with a unit-test fake. Same signature
    shape as `probeAgentStatus`.
  - `queryAgentIdentityRoles` — replaces the production
    `project.QueryAgentIdentityRoles` call. Signature mirrors the
    public API so wiring is a single substitution.

`internal/cmd/doctor/checks_remote.go` appends
`newCheckAgentIdentityRoles(deps)` after the existing
`remote.agent-status` entry in `NewRemoteChecks`. The append-after
ordering is load-bearing — every skip-cascade guard in C12 reads
`remote.agent-status`'s Result from `prior []Result`, and the
local-then-remote ordering invariant remains intact (verified by
the existing `TestNewLocalAndRemoteChecks_ProductionCompositionLocalsFirst`).

## Tests

`internal/cmd/doctor/checks_agent_identity_roles_test.go` (new, 16 KB):

  - Skip-cascade gates: nil AzdClient, `remote.agent-status` not
    Passed, project endpoint missing, no active agents,
    project-resource-ID unset, project-resource-ID malformed.
  - Aggregate classification: Info when all fine; Fail when any
    agent has zero roles; Warn when worst is underscoped; Warn on
    transient query error.
  - Per-agent classifier table (six cases: project+account,
    project+RG, project-only, account-only, all-empty,
    all-errored).
  - Detail formatting: scope counts and `?` for probe-error
    scopes.
  - Missing-principal degradation: agent with no
    `instance_identity` surfaces as a warning rather than a fail.
  - `readActiveAgents` filtering invariants (active-only,
    missing-name dropped, nil-return on missing upstream).

`internal/cmd/doctor/checks_remote_test.go` updated: the
`NewRemoteChecks` contract test now pins five entries (auth →
foundry-endpoint → rbac → agent-status → agent-identity-roles)
with their ID / Name / Remote / Fn invariants.

## Preflight

- gofmt -s -w .                                          clean
- go vet ./...                                           clean
- go build ./...                                         clean
- go test ./... -count=1                                 all green
  - internal/cmd                                         14.6s
  - internal/cmd/doctor                                  1.6s
  - internal/cmd/nextstep                                3.8s
  - internal/pkg/agents/agent_api                        10.9s
  - internal/pkg/agents/agent_yaml                       1.0s
  - internal/pkg/azure                                   12.7s
  - internal/project                                     5.5s
- golangci-lint run ./internal/cmd/... ./internal/project/...    0 issues
- cspell on new files (after "underscoped" added to cspell.yaml) 0 issues
- copyright-check.sh on extension                        clean

## Design notes

- The spec at `.tmp/pr-8057/azd-ai-agent-doctor-remote-checks.md`
  lines 193–223 specifies a per-agent fan-out at three scopes
  with the "fine" pass condition (project + (account|RG)). Renders
  as INFO rather than PASS because the design's intent is a
  diagnostic listing — operators inspect it on `--output json`
  and confirm no MI is starved; the check should not flip the
  doctor green on its own.
- C12 uses the `wg.Go` Go 1.26 idiom for per-principal fan-out;
  per-scope probes within one principal run sequentially (3
  scopes × 1 ARM listing each is well under budget and avoids
  the goroutine-per-scope-explosion).
- `probeAgentPrincipal` deliberately does NOT extend C17's
  `probeAgentStatus` surface — extending it would couple two
  independent checks. The mirror cost is one ~40-line factory
  function shared by both.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…P5.1 C2)

Adds a best-effort manifest walker that surfaces model / toolbox /
connection resources from each service's `agent.manifest.yaml`
into `nextstep.State`. Unblocks the doctor checks C13 (model
deployments), C14 (toolboxes), and C15 (connections), all of which
need to know whether the relevant resource kinds are declared
before they can decide to run or skip.

## State additions

- `State.HasModels`, `State.HasToolboxes`, `State.HasConnections` —
  aggregate boolean flags. True when at least one matching resource
  is found across all `azure.ai.agent` services.
- `State.ModelRefs`, `State.Toolboxes`, `State.Connections` —
  sorted `[]ResourceRef` per kind.
- `ResourceRef{Name, ServiceName, Detail}` — slim doctor-facing
  shape. Detail carries the kind-specific identifier (model id,
  connection `<Category> | <Target>`, empty for toolboxes).

## Walker semantics

- File names probed (in order): `agent.manifest.yaml`,
  `agent.manifest.yml`. `agent.yaml` is deliberately excluded —
  it describes the container, not declared resources.
- Uses `agent_yaml.ExtractResourceDefinitions` directly (NOT
  `LoadAndValidateAgentManifest`) so a manifest with an absent /
  partial `template` block — common during init — still surfaces
  its `resources:` declarations.
- Best-effort: missing file, unreadable bytes, malformed YAML,
  zero resources, and unknown resource kinds all silently degrade
  (Has* flags stay false; lists stay nil). Walker never adds to
  the `errs` slice so a manifest-in-flight (which init re-writes
  mid-flow) never blocks the rest of state assembly.
- Dedup key is `(ServiceName, Name)`. Same name twice in one
  service collapses to one entry (first-occurrence wins, matching
  agent_yaml's manifest semantics). Same name under two services
  surfaces twice so per-service doctor failures remain
  individually addressable.
- Result slices are sorted by `(Name, ServiceName)` so doctor
  output snapshots and downstream renderers are deterministic.

## Why this is its own commit

The walker is a pure data-collection step with no resolver-side
consumers in this commit. Doctor checks C13/C14/C15 (following
commits) gate-skip themselves on `state.Has{Models,Toolboxes,
Connections}` and iterate the matching ref slice. Landing the
walker first keeps each downstream commit focused on its single
check.

## Tests

8 new tests in `manifest_test.go`:
- All three kinds present → flags + lists populated, sort order
  + detail formatting locked.
- Missing manifest → silent, no errors logged through the
  walker.
- Malformed YAML → silent, no errors.
- Manifest with no `resources:` key → silent, flags false.
- Multi-service aggregation → entries sorted by Name, ties
  broken by ServiceName.
- Duplicate `(service, name)` within one manifest → first
  occurrence wins.
- `.yaml` wins over `.yml` when both exist.
- `agent.yaml` (not a manifest) is ignored even if its content
  happens to parse as one.
- `connectionDetail` table-driven test covers all four
  category/target combinations.

## Preflight

- `gofmt -s -w .` — clean
- `go vet ./...` — clean
- `go test ./... -count=1` — full extension suite green
- `golangci-lint run ./internal/cmd/...` — 0 issues
- `cspell` over the touched files — 0 issues

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds the seventh doctor remote check (`remote.model-deployments`) which
verifies that every model resource declared in any service's
`agent.manifest.yaml` (collected by the C2 manifest walker into
`State.ModelRefs`) has a corresponding Cognitive Services deployment on
the Foundry project's underlying account.

# What it does

For each project run:

1. Skip-cascade gates (in order): `AzdClient` nil →
   `local.environment-selected` → `local.azure-yaml` /
   `local.agent-service-detected` → `remote.auth` →
   `remote.foundry-endpoint` → `!state.HasModels` →
   `AZURE_AI_PROJECT_ID` unreadable / cannot be parsed. Every gate
   produces a single, actionable Skip message that points the user at
   the upstream check.

2. Parse `AZURE_AI_PROJECT_ID` (Foundry project ARM resource ID) for
   `(subscription, resourceGroup, accountName)` via the new
   `parseAccountFromProjectID` helper. Deployments live at the
   Cognitive Services *account* level, NOT the project level, so the
   account name is the load-bearing parameter here.

3. Issue exactly one `armcognitiveservices/v2.DeploymentsClient
   .NewListPager` round trip via the new `realProbeModelDeployments`
   helper, capped at 10s (matches the design's per-probe budget in
   `.tmp/pr-8057/azd-ai-agent-doctor-remote-checks.md`). Returns
   `[]string` of deployment names. Transport errors short-circuit the
   check to Skip with the error verbatim plus an actionable retry
   suggestion — we can not distinguish "deployment missing" from "ARM
   unreachable" without a successful round trip.

4. `classifyModelDeployments` joins `state.ModelRefs` to the deployment
   set on name. All match → Pass with the matched count. One or more
   missing → Fail with the missing names listed in the Message and
   structured under `Details["missingModels"]` (each entry carries
   both `name` and `service` so the user can locate the offending
   manifest entry). Suggestion: `azd provision` to create the missing
   deployments, or update `agent.manifest.yaml` `resources[].name` to
   match deployments that already exist.

# Aggregation

The walker may surface `ModelRefs` from multiple services. Every
service in an azd project belongs to the same Foundry project (and
therefore the same Cognitive Services account), so the check issues
exactly one deployments list per run regardless of how many services /
model refs exist. The same model referenced by two services collapses
to a single match check; a missing model referenced by two services
surfaces as two `missingModels` entries (one per service) so the user
can pinpoint each affected manifest.

# Test seam

`Dependencies.probeModelDeployments` (lowercase, package-internal)
matches the established pattern from `probeAuth`,
`probeFoundryEndpoint`, `probeDeveloperRBAC`, `probeAgentStatus`,
`probeAgentPrincipal`. Production wiring leaves it nil; tests inject a
closure that returns canned `(names, err)` tuples and (optionally)
captures the `(subscription, resourceGroup, accountName)` it was
called with.

`Dependencies.assembleState` and `Dependencies.readProjectResourceIDFn`
are reused from earlier checks; no new top-level seam is added besides
the probe.

# Files

- `internal/cmd/doctor/checks_model_deployments.go` — new check
  factory `newCheckModelDeployments`, `parseAccountFromProjectID`,
  `classifyModelDeployments`, `realProbeModelDeployments`,
  `listDeploymentNames`. 363 lines.
- `internal/cmd/doctor/checks_model_deployments_test.go` — 11 tests:
  skip-cascade (1 + table of 5 upstream-blocked permutations), no
  manifest models, unset project ID, unparsable project ID, probe
  transport error, all-match Pass, partial-match Fail, all-missing
  Fail, parser table (canonical / mixed-case / missing segments /
  garbage), factory shape pin.
- `internal/cmd/doctor/checks_local.go` — adds the
  `probeModelDeployments` field to `Dependencies` next to its
  same-shape siblings.
- `internal/cmd/doctor/checks_remote.go` — appends
  `newCheckModelDeployments` after `newCheckAgentIdentityRoles` in
  `NewRemoteChecks`.
- `internal/cmd/doctor/checks_remote_test.go` — extends the
  composition pin test to assert 6 checks (was 5) including the new
  `remote.model-deployments` slot.

# Preflight

- `gofmt -s -w .` clean.
- `go vet ./...` clean.
- `go build ./...` clean.
- Full extension test suite: green (`cmd`, `cmd/doctor`,
  `cmd/nextstep`, `exterrors`, `agents/agent_api`, `agents/agent_yaml`,
  `pkg/azure`, `project` — all pass).
- `golangci-lint run ./internal/cmd/doctor/...` 0 issues.
- `cspell` 0 issues on production file.
- No `go.mod` or `go.sum` changes (uses already-imported
  `armcognitiveservices/v2`).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds the eighth local doctor check, `local.toolboxes`, which verifies
that every ToolboxResource declared in any service's
agent.manifest.yaml has its canonical MCP endpoint env var
(TOOLBOX_<NORMALIZED_NAME>_MCP_ENDPOINT) set in the active azd
environment.

Why local (Remote: false). The check only reads the active azd
environment via the existing gRPC env service — no ARM / Foundry
round trips. Tagging it local means `--local-only` still runs it
(which is exactly what we want: a missing TOOLBOX_*_MCP_ENDPOINT is
diagnosable without network access).

Skip cascade. Skips on AzdClient==nil, local.environment-selected,
local.azure-yaml, local.agent-service-detected, and when
state.HasToolboxes is false. Deliberately NOT gated on remote.auth /
remote.foundry-endpoint — a down Foundry must not poison a local env
diagnostic.

Classification.
  - All endpoints set → Pass with matchedCount.
  - One or more missing → Fail with the missing toolbox names + env
    var keys in the Message, Suggestion points at `azd provision`
    (the canonical fix) or `azd env set` as the manual override,
    Details["missingToolboxes"] carries a structured list (name,
    service, envVar) for JSON consumers.
  - Env lookup transport error → Fail (NOT Skip). Divergent from
    C13's model-deployments which Skips on probe error, because env
    lookup is local; a transport failure here means the user's azd
    config / extension is broken and a Skip would silently swallow
    that signal. Suggestion points at `azd env list` / `azd env
    get-values`.
  - Empty / whitespace-only value → treated as missing (matches
    detectMissingVars semantics in nextstep/state.go).

Convention. TOOLBOX_<NORMALIZED_NAME>_MCP_ENDPOINT, with name
upper-cased and `-` / `.` / ` ` mapped to `_`. Matches the
hosted-toolbox Bicep sample output names. The prefix and suffix are
pinned in code (not derived from the env) so the Fail message can
name the exact env var the user must grep their Bicep template
for.

Dedup. classifyToolboxEndpoints dedupes on the canonical env key
because the C2 manifest walker dedupes on (ServiceName, Name) — the
same toolbox referenced by two services would otherwise produce two
env lookups and two missing-list entries. Exposed
`dedupToolboxKeys` for callers (renderer / future telemetry) that
want the expected-key list up front; the classifier does its own
inline dedup so it does not depend on this helper.

Test seam. `Dependencies.lookupToolboxEnv toolboxEnvLookupFn`
matches the established seam pattern (probeAuth,
probeFoundryEndpoint, probeModelDeployments). Production wiring
leaves it nil; the check binds `makeRealToolboxEnvLookup(deps
.AzdClient)` on first call, which calls
`client.Environment().GetValue` — the canonical one-key env reader
used by service_target_agent.go and checks_rbac.go.

Tests (15). Skip cascade (azdClient nil + 3 priors), not-gated-on-
remote-priors invariant, state emptiness (no toolboxes / nil
state), 3 classifier paths (all-set / partial / all-missing),
whitespace-as-missing, transport-error-is-Fail, cross-service
dedup, normalizeToolboxName table (8 cases), toolboxEndpointKey
roundtrip, dedupToolboxKeys table, factory-shape pin.

Wired into NewLocalChecks (now 8 entries); local-checks pin test
updated. NewRemoteChecks unchanged (still 6 entries).

Preflight clean: gofmt, vet, build, full extension test suite green
(cmd 16.7s, doctor 2.9s, nextstep 6.7s, etc.), golangci-lint 0
issues, cspell 0 issues on production files.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Issue: Doctor previously had no way to verify that Foundry connections
referenced by agent manifests (e.g. `bing-grounding`, key-vault-backed
auth connections) actually exist on the project. Failure surfaced
later at invoke time as 401/403 from the upstream tool, with no clear
path back to the missing connection.

Approach: New `remote.connections` doctor check that enumerates manifest
ConnectionResource entries (already discovered by the C2 manifest
walker into `state.Connections`), calls
`FoundryProjectsClient.GetAllConnections(ctx)` on the active project,
and reports any manifest-declared connection that isn't present on the
project as a Fail. Missing-entry rendering format:
`<name> [<detail>] (service <svc>)` when Detail is non-empty, falling
back to `<name> (service <svc>)` to avoid a bare `[]`. Detail
typically renders as `<Category> | <Target>` for connections
(types.go:183-189; manifest.go:152-162).

Classification: REMOTE check (Remote: true). Calls the Foundry API.
Skip cascade mirrors C13 (`remote.model-deployments`):

  AzdClient → environment-selected → azure.yaml / agent-service-detected
  → remote.auth → remote.foundry-endpoint → !state.HasConnections
  → unparsable project ID

Probe error → Skip (matching C13's pattern — distinguishing transport
failure from "missing connection" requires a successful round-trip).
10s probe timeout.

Wiring: `newCheckConnections(deps)` added as the 7th and final entry
in `NewRemoteChecks` (after C12 agent-identity-roles, C13 model-
deployments). Pin test `TestNewRemoteChecks_HasAuthFoundryEndpointRBAC
AgentStatusIdentityRolesModelDeployments` renamed to
`...ModelDeploymentsConnections`, Len bumped 6 → 7, 4 new
index-6 assertions for ID / Title / Description / Remote.

Test seam: New `probeFoundryConnections` field appended to
`Dependencies` matching the existing seam pattern from
`probeModelDeployments` (C13). Production wiring uses
`realProbeFoundryConnections` which constructs a credential via
`azidentity.NewAzureDeveloperCLICredential` (matching the rest of
the extension per `agent_context.go:101-109`).

New helper: `parseAccountProjectFromProjectID(projectID) (account,
project, err)` — sibling of C13's `parseAccountFromProjectID`. Returns
two segments instead of the four C13 needs; kept separate to avoid
churning C13's signature for a single new caller. Both case-insensitive
on segment markers. Follow-up: consolidate into a single parser when
a third caller appears.

Tests: `checks_connections_test.go` — 13 tests mirroring C13 patterns:

  - Skip cascade table (5 rows: AzdClient, environment, azure.yaml /
    agent-service-detected, auth, foundry-endpoint).
  - State emptiness (HasConnections false → Skip).
  - Project ID unset → Skip.
  - Project ID unparsable → Skip.
  - Probe error → Skip.
  - All-match → Pass.
  - Partial mismatch → Fail with missing names + service tags.
  - All-missing → Fail.
  - Empty-Detail rendering omits `[]`.
  - Parser table (5 cases: canonical, mixed case, missing project,
    missing account, garbage).
  - Factory shape pin (Remote: true, ID, Title, Description).

Preflight:
  - gofmt -s -w .  (clean)
  - go vet ./... (clean)
  - go build ./... (clean)
  - go test ./... -count=1 (all packages pass; doctor 5.474s)
  - golangci-lint run ./internal/cmd/doctor/... (0 issues)
  - cspell lint "internal/cmd/doctor/*.go" (14 files, 0 issues)
  - copyright header verified on both new files

Files:
  - internal/cmd/doctor/checks_connections.go        (NEW, +332)
  - internal/cmd/doctor/checks_connections_test.go   (NEW, +326)
  - internal/cmd/doctor/checks_local.go              (probeFoundryConnections seam +5)
  - internal/cmd/doctor/checks_remote.go             (wire +newCheckConnections +1)
  - internal/cmd/doctor/checks_remote_test.go        (pin test 6 → 7, +14)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Two reviewer-consensus findings from the batched code review of
commits 385b147 (C13 remote.model-deployments), 87e1dcc (C14
local.toolboxes), and 1b3143f (C15 remote.connections):

Fix 1 (MEDIUM, Opus + GPT-5.5): toolbox env-key normalizer
divergence.

  C14's `normalizeToolboxName` only mapped `-`, `.`, and ` ` to `_`
  rune-by-rune, while the production helpers
  `init.go:toolboxMCPEndpointEnvKey` (manifest injection) and
  `listen.go:toolboxMCPEndpointEnvKey` (runtime env write) both use
  the regex `[^A-Z0-9]+` -> `_` (run-collapsing, all
  non-alphanumerics). The two algorithms agreed only on the subset of
  inputs the test table exercised (`web-search-tools`,
  `my.toolbox.v2`, `my toolbox`, ...) and diverged on inputs like
  `my--tool`, `my+tool`, `my:tool`, `my(tool)`, `my\ttool`. A user
  with such a toolbox name would see the doctor flag a missing
  endpoint under a key (`TOOLBOX_MY__TOOL_MCP_ENDPOINT`,
  `TOOLBOX_MY+TOOL_MCP_ENDPOINT`, ...) that nothing in the system
  ever writes.

  Resolution: hoist the canonical helper into a shared
  `internal/pkg/envkey` package so the producing and diagnostic
  sides cannot drift again.

    - new internal/pkg/envkey/envkey.go         -- `ToolboxMCPEndpoint`
    - new internal/pkg/envkey/envkey_test.go    -- 13 cases incl.
                                                   double-hyphen run,
                                                   `+`, `:`, `/`, tab,
                                                   parens, empty
    - internal/cmd/listen.go                    -- drop local helper,
                                                   drop `regexp` import,
                                                   route through envkey
    - internal/cmd/init.go                      -- route through envkey
    - internal/cmd/init_test.go                 -- delete duplicated
                                                   table (covered by
                                                   envkey package test)
    - internal/cmd/doctor/checks_toolboxes.go   -- drop local
                                                   normalizeToolboxName /
                                                   toolboxEndpointKey
                                                   /toolbox{Prefix,
                                                   Suffix}, route 2
                                                   callsites through
                                                   envkey
    - internal/cmd/doctor/checks_toolboxes_test.go
                                                -- replace the
                                                   normalize-table test
                                                   with a thin pin test
                                                   verifying the
                                                   doctor's renderer
                                                   helper routes
                                                   through envkey
    - cspell.yaml                               -- allowlist `envkey`

Fix 2 (MEDIUM, Sonnet): assembler errors silently swallowed.

  C13/C14/C15 all used `state, _ := assembler(...)` and reported a
  `Skip` with "no X declared in any service's agent.manifest.yaml"
  whenever `state == nil || !state.HasX`. The existing pattern at
  `checks_manual_env.go:95-109` instead captures `errs` and Fails
  with the actual cause when `state == nil` (defensive against a
  future contract change where AssembleState may return a nil
  state with a populated errs slice).

  Resolution: mirror the established pattern in all three new
  checks. The Skip for `!state.HasX` is preserved; only the
  `state == nil` branch becomes a Fail surfacing
  `errs[0].Error()`.

    - checks_model_deployments.go     -- Fail-on-nil with cause
    - checks_toolboxes.go             -- Fail-on-nil with cause
    - checks_connections.go           -- Fail-on-nil with cause
    - checks_model_deployments_test.go
                                       -- new test: nil state
                                          surfaces errs[0]
    - checks_toolboxes_test.go        -- update existing
                                          `SkipsWhenAssemblerReturnsNil`
                                          to `FailsWhenAssembler
                                          ReturnsNilState` plus new
                                          test asserting errs[0]
                                          surfaces in the Fail message
    - checks_connections_test.go      -- new test: nil state
                                          surfaces errs[0]

Not addressed (deferred):

  LOW (GPT-5.5): `parseAccountProjectFromProjectID` (C15) accepts
  partial paths; `parseAccountFromProjectID` (C13) does not. Opus
  reviewed and called the dual-parser duplication "defensible for
  two callers"; commit 1b3143f's message already notes the
  follow-up to consolidate when a third caller appears.

Preflight:
  - gofmt -s -w .  (clean)
  - go build ./... (clean)
  - go vet ./...   (clean)
  - go test ./... -count=1 (all packages pass; envkey 1.837s,
                            doctor 6.276s, cmd 14.401s)
  - golangci-lint run ./... (0 issues)
  - cspell lint <new+touched> (17 files, 0 issues)

Files (10):
  - internal/pkg/envkey/envkey.go              (NEW)
  - internal/pkg/envkey/envkey_test.go         (NEW)
  - internal/cmd/listen.go                     (MOD)
  - internal/cmd/init.go                       (MOD)
  - internal/cmd/init_test.go                  (MOD)
  - internal/cmd/doctor/checks_toolboxes.go    (MOD)
  - internal/cmd/doctor/checks_toolboxes_test.go (MOD)
  - internal/cmd/doctor/checks_model_deployments.go (MOD)
  - internal/cmd/doctor/checks_model_deployments_test.go (MOD)
  - internal/cmd/doctor/checks_connections.go  (MOD)
  - internal/cmd/doctor/checks_connections_test.go (MOD)
  - cspell.yaml                                (MOD)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When Doctor/post-deploy guidance has no cached OpenAPI-derived sample payload
but a service README is present, don't suggest a concrete protocol-generic
payload that may fail for that sample's schema. Emit the README pointer first,
then an invoke command with an explicit '<payload>' placeholder.

Cached OpenAPI payloads still produce runnable invoke commands, and services
without a README still get the protocol-generic fallback payload with a generic
label.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Active/idle show output should stay an inspection view of the hosted agent
resource. Remove the active-state invoke suggestion from ResolveAfterShow and
avoid attaching next_step in show JSON when the agent is already healthy.

Non-active states keep actionable guidance:
- creating -> monitor --type system --follow
- failed/empty -> monitor --follow
- deleting/deleted -> azd deploy
- unknown -> azd ai agent show <service>

This also avoids state/OpenAPI assembly work for active show output because no
active-state guidance is rendered.

Validation:
- go test ./internal/cmd ./internal/cmd/nextstep -run 'TestResolveAfterShow|TestResolveNextStepFromStatus|TestShowResultJSON|TestPrintAgentVersionJSON|TestPrintAgentVersionTable|TestResolveAfterInvoke_Success|TestResolveAfterInit_UnresolvedPlaceholders|TestResolveAfterRun' -count=1
- go test ./internal/cmd/... -count=1
- go vet ./internal/cmd/...
- golangci-lint run ./internal/cmd/...
- cspell lint touched show/nextstep files

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Stream text-mode doctor output by observing each finalized check result, while keeping JSON output buffered and unchanged. Split the text formatter into header/check/footer pieces so the streaming path preserves the existing report shape.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove unnecessary explanatory comments across the Azure AI Agents extension and shorten the remaining comments to focus on non-obvious contracts and behavior. Also drops the unused .tmp/ ignore entry.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove the reserved doctor --output flag path, apply go fix modernizations, and reword cspell-sensitive text.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Restore comments that predated this PR in existing files so the cleanup only affects new or changed comment surfaces.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add project-root validation for service-relative file probes and reads, including symlink-aware containment checks and root-service handling.\n\nCo-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove the unused nextstep auth probe option and state fields that were added by the PR but never consumed by the resolver or doctor wiring.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Route PR-added nextstep stdout emission through one cmd helper so TTY gating stays consistent across init, invoke, run, show, and doctor text output.

This intentionally applies the nextstep call-site TTY contract to the PR-added init next-step blocks as well, keeping redirected output free of human-only guidance.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The remote.auth doctor check surfaced the raw user principal name in
both the Message string and the structured Details map regardless of
the --unredacted flag, in contrast with the rest of the doctor checks
(checks_rbac.go, checks_agent_identity_roles.go) which already gate
identity values on opts.Unredacted.

This change threads Options into the auth check function and adds two
small helpers that reuse the existing redactedPlaceholder constant:

  - redactUPN(upn, unredacted) returns the value to surface in Message
    text: raw when --unredacted, the shared <redacted> placeholder when
    a UPN was discovered but should be scrubbed, and empty when none
    was found so composeAuthMessage cleanly drops the prefix.
  - authDetails(upn, minutes, unredacted) builds the Details map and
    omits the "upn" key entirely unless --unredacted is set, so
    machine consumers do not see the raw value by default.

PASS, WARN, and expired-FAIL branches now compose their messages from
the redacted display value. Existing tests that asserted the raw UPN
were updated to pass Options{Unredacted: true}; new table tests cover
the default-redacted and --unredacted contracts on every branch, the
empty-UPN drop, and both helpers in isolation.

Resolves PR Azure#8198 review comment from @jongio.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
TestErrorCodeWireValues pins the lowercase JSON wire values of every
exported enum the nextstep package consumes from the Agents API, but
the AgentVersionStatus map was missing the "idle" entry. The idle
status is actively read at show.go:207 and nextstep/resolver.go:248,
so silent drift on that one literal would regress the show command's
idle branch and the resolver's deployment-pending hint without any
test failure.

Add the missing "idle": string(AgentVersionIdle) case.

Resolves PR Azure#8198 Copilot review comments (ids 3246075889, 3246075800).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Restores the .tmp/ entry in .gitignore (accidentally trimmed during a
mid-rebase comment-cleanup commit) and removes the design-spec mirror
and PR comment scratch files that were swept into the index by an
overbroad `git add -A` during conflict resolution. The .tmp/ folder
remains on disk locally as documented scratch space, but must not be
tracked by git.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@antriksh30 antriksh30 force-pushed the antrikshjain/next-step-implementation branch from 6312729 to da2efb5 Compare May 20, 2026 06:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Guide developers to the right next step after every azd ai agent command and add azd ai agent doctor

3 participants