Feature 1.2 — Health Check & Lifecycle State on Dashboard¶
Ready-to-use instruction file for GitHub Copilot Agent / Claude Code. Project: AgentLens (
github.com/PawelHaracz/Agentlens) Tier: 1 — MUST HAVE for launch Effort: M (3–5 days)
Goal¶
Transform AgentLens from a "glorified JSON store" into a live registry that shows the real-time state of every catalog entry. Platform engineers must be able to glance at the dashboard and immediately see: which agents are alive, which are degraded, which are gone, and how fast they respond.
Why this matters¶
89% of organizations running AI agents in production have already deployed observability for them. A registry that doesn't reflect runtime state of its entries is a non-starter for those teams. Competitors (AGNTCY, Apicurio, Nacos) all surface health state. Without this feature the launch on HN/Reddit will be torn apart with "looks nice but how do I know if anything actually works".
Scope (in)¶
- Lifecycle state machine on
CatalogEntry - Periodic health probe worker (already partially configurable — extend it)
- State transitions persisted in storage (both SQLite and PostgreSQL)
status,lastSeen,latencyMsfields exposed via REST API- Dashboard: colored badges, last-seen relative timestamp, latency display
- Manual "probe now" action from the dashboard (admin/editor only)
Scope (out — explicitly NOT this feature)¶
- Alerting / notifications (email, Slack, webhooks) — backlog
- Historical health charts / time-series — backlog
- SLA tracking, uptime % — backlog
- Custom probe scripts per agent — backlog
- Probing the inside of A2A skills (only the transport endpoint is probed)
Domain model changes¶
Lifecycle states¶
registered → active → degraded → offline → deprecated
| State | Meaning | Trigger |
|---|---|---|
registered |
Entry created but not yet probed | Initial state on POST /api/v1/catalog |
active |
Last probe succeeded within healthyThreshold |
Probe → 2xx within timeout |
degraded |
Last probe slow (latency > degradedLatencyMs) OR 1 failure within window |
Probe → 2xx but slow, OR single failure |
offline |
failureThreshold consecutive failures |
Probe → error / 5xx / timeout failureThreshold times in a row |
deprecated |
Manually set by admin via API/UI | PATCH /api/v1/catalog/{id}/lifecycle |
deprecated is a terminal manual state — the probe worker must skip deprecated entries entirely.
CatalogEntry additions¶
Add to internal/model/catalog_entry.go (do not rename to Agent — see Known Traps):
type LifecycleState string
const (
LifecycleRegistered LifecycleState = "registered"
LifecycleActive LifecycleState = "active"
LifecycleDegraded LifecycleState = "degraded"
LifecycleOffline LifecycleState = "offline"
LifecycleDeprecated LifecycleState = "deprecated"
)
type Health struct {
State LifecycleState
LastProbedAt *time.Time
LastSuccessAt *time.Time
LastError string // last non-empty error message (truncated to 512 chars)
LatencyMs int64 // latency of the last successful probe
ConsecutiveFailures int
}
The existing Validity struct (with From, To, LastSeen, IsActiveAt()) stays untouched. Health.LastSuccessAt mirrors into Validity.LastSeen on successful probe so IsActiveAt() keeps working — but the source of truth for "is it up right now" is Health.State.
Storage layer¶
Migration¶
Create a new versioned migration: internal/store/migrations/0NN_health_state.sql (use the next free sequence number — check existing migrations folder).
-- +migrate Up
ALTER TABLE catalog_entries ADD COLUMN health_state TEXT NOT NULL DEFAULT 'registered';
ALTER TABLE catalog_entries ADD COLUMN health_last_probed_at TIMESTAMP NULL;
ALTER TABLE catalog_entries ADD COLUMN health_last_success_at TIMESTAMP NULL;
ALTER TABLE catalog_entries ADD COLUMN health_last_error TEXT NOT NULL DEFAULT '';
ALTER TABLE catalog_entries ADD COLUMN health_latency_ms INTEGER NOT NULL DEFAULT 0;
ALTER TABLE catalog_entries ADD COLUMN health_consecutive_failures INTEGER NOT NULL DEFAULT 0;
CREATE INDEX idx_catalog_entries_health_state ON catalog_entries(health_state);
CREATE INDEX idx_catalog_entries_health_last_probed_at ON catalog_entries(health_last_probed_at);
-- +migrate Down
DROP INDEX IF EXISTS idx_catalog_entries_health_last_probed_at;
DROP INDEX IF EXISTS idx_catalog_entries_health_state;
ALTER TABLE catalog_entries DROP COLUMN health_consecutive_failures;
ALTER TABLE catalog_entries DROP COLUMN health_latency_ms;
ALTER TABLE catalog_entries DROP COLUMN health_last_error;
ALTER TABLE catalog_entries DROP COLUMN health_last_success_at;
ALTER TABLE catalog_entries DROP COLUMN health_last_probed_at;
ALTER TABLE catalog_entries DROP COLUMN health_state;
Both dialects must compile and run cleanly. PostgreSQL accepts the same DDL above; if anything diverges (e.g. BOOLEAN vs INTEGER truthiness), gate it via the existing dialect helper used by the project.
Store interface additions¶
// internal/store/store.go
type Store interface {
// ...existing...
UpdateHealth(ctx context.Context, entryID string, h model.Health) error
ListForProbing(ctx context.Context, olderThan time.Time, limit int) ([]model.CatalogEntry, error)
SetLifecycle(ctx context.Context, entryID string, state model.LifecycleState) error
}
ListForProbing returns entries whose health_last_probed_at IS NULL OR health_last_probed_at < olderThan AND health_state != 'deprecated', ordered by health_last_probed_at NULLS FIRST, capped by limit. This gives the worker a fair, deterministic batch.
Implement for both SQLite and PostgreSQL backends.
Health probe worker¶
Create internal/service/health/prober.go:
type Prober struct {
store store.Store
httpClient *http.Client
interval time.Duration // default 30s
timeout time.Duration // default 5s
concurrency int // default 8
degradedLatency time.Duration // default 1500ms
failureThreshold int // default 3
logger *slog.Logger
}
func (p *Prober) Run(ctx context.Context) error
func (p *Prober) ProbeOnce(ctx context.Context, entry model.CatalogEntry) (model.Health, error)
Probe algorithm (per entry)¶
- Resolve probe URL:
- For A2A entries:
supportedInterfaces[0].urlif present, otherwise legacyurlfield - For MCP entries: the entry's primary endpoint (existing field)
- If no URL → set state to
offline, error"no probeable endpoint", do not perform HTTP GET <probeURL>withp.timeout. Respecthttps_skip_verifyif env says so (existing config).- Measure wall-clock latency.
- Decide new state from
(httpResult, latency, currentHealth): - HTTP error / non-2xx →
failures++. Iffailures >= failureThreshold→offline. Else → keepdegraded. - 2xx and
latency > degradedLatency→degraded,failures = 0. - 2xx and
latency <= degradedLatency→active,failures = 0. - Always update
LastProbedAt = now. On success updateLastSuccessAt = nowand write through toValidity.LastSeen. - Persist via
store.UpdateHealth.
Worker loop¶
- Tick every
interval. On each tick fetchListForProbing(now - interval, concurrency * 4). - Run probes in a worker pool with
concurrencygoroutines. - Worker loop must respect
ctx.Done()for clean shutdown (the microkernelStop()lifecycle). - Register the prober as a microkernel plugin in
internal/plugins/health/plugin.goso it followsRegister → Init → Start → Stop.
Configuration¶
Reuse existing config struct internal/config/config.go. Add:
type HealthConfig struct {
Enabled bool `yaml:"enabled" env:"AGENTLENS_HEALTH_ENABLED" default:"true"`
Interval time.Duration `yaml:"interval" env:"AGENTLENS_HEALTH_INTERVAL" default:"30s"`
Timeout time.Duration `yaml:"timeout" env:"AGENTLENS_HEALTH_TIMEOUT" default:"5s"`
Concurrency int `yaml:"concurrency" env:"AGENTLENS_HEALTH_CONCURRENCY" default:"8"`
DegradedLatency time.Duration `yaml:"degradedLatency" env:"AGENTLENS_HEALTH_DEGRADED_LATENCY" default:"1500ms"`
FailureThreshold int `yaml:"failureThreshold" env:"AGENTLENS_HEALTH_FAILURE_THRESHOLD" default:"3"`
}
REST API¶
GET /api/v1/catalog and GET /api/v1/catalog/{id}¶
Extend the response DTO (do not leak the storage struct directly):
{
"id": "...",
"displayName": "...",
"categories": ["A2A"],
"health": {
"state": "active",
"lastProbedAt": "2026-04-07T11:42:13Z",
"lastSuccessAt": "2026-04-07T11:42:13Z",
"latencyMs": 142,
"consecutiveFailures": 0,
"lastError": ""
}
}
For backward compatibility also expose a flat status field (string, same value as health.state) — the feature list explicitly requires this.
PATCH /api/v1/catalog/{id}/lifecycle¶
Body: { "state": "deprecated" }.
- Allowed values:
deprecated,active(un-deprecate). - Permission:
editororadmin(existing RBAC).viewer→ 403. - Returns updated entry DTO.
- Audit log entry written via existing audit hook.
POST /api/v1/catalog/{id}/probe¶
Triggers an immediate single-shot probe for one entry, bypasses the worker queue.
- Permission:
editororadmin. - Returns the resulting
healthobject. - Rate limit: max 1 call per entry per 5s (in-memory token bucket is fine).
Filtering¶
GET /api/v1/catalog?state=active,degraded — comma-separated allow-list. Unknown states → 400.
Frontend (React 18 + Tailwind + shadcn/ui)¶
Hard rule: raw HTML elements styled with Tailwind are not acceptable for new components. Use shadcn/ui primitives (
Badge,Button,Tooltip,DropdownMenu,Alert). See Known Traps.
Catalog list view (web/src/routes/catalog/list.tsx)¶
Add a column Status, between Provider and Categories.
Use a shadcn Badge with semantic color via variant:
| State | Variant | Color suggestion | Label |
|---|---|---|---|
| active | default (or success if defined) |
green | "Active" |
| degraded | warning |
amber | "Degraded" |
| offline | destructive |
red | "Offline" |
| registered | secondary |
gray | "Pending" |
| deprecated | outline |
slate | "Deprecated" |
Next to the badge:
- Latency: small muted text
142 msifstate ∈ {active, degraded} - Last seen: relative time tooltip —
5s ago,2m ago,3h ago. On hover show absolute UTC.
Add a filter bar above the table — multi-select shadcn DropdownMenu with the five states. Selected states are pushed to the URL as ?state=... and used in the API query.
Detail drawer / view¶
New section "Health" with a small grid:
State Active
Last probed 2s ago (2026-04-07 11:42:15 UTC)
Last successful 2s ago
Latency 142 ms
Failures (run) 0
Last error —
Action buttons in the section header (admin/editor only — gate via existing useCurrentUser().permissions):
- Probe now →
POST /api/v1/catalog/{id}/probe, optimistic loading state, show toast on result - Deprecate / Un-deprecate →
PATCH /api/v1/catalog/{id}/lifecycle, confirmation dialog before deprecating
Empty / loading states¶
- Loading: skeleton bar in the badge column
- No probe yet: badge "Pending" with tooltip "Will be probed within next interval"
- All filter values exclude every entry: empty state message "No entries match the selected status filter — clear filters" with a Clear button
Localization¶
The frontend uses no i18n yet. Keep all strings in English as in the rest of the codebase. Do not introduce a translation library.
Tests¶
Backend¶
- Unit:
prober_test.gocovering all state-transition cases. Usehttptest.Serverfor the probed agent. Test cases (minimum): - Fresh entry → 200 fast →
active - Active → 200 slow →
degraded - Active → 500 once →
degraded, failures=1 - Degraded → 500 twice more →
offline, failures=3 - Offline → 200 fast →
active, failures=0 - Deprecated entry passed in → returns early, no HTTP call (use a recording transport that fails the test if invoked)
- No URL →
offline, no HTTP call - Probe times out → counted as failure
- Store: for both SQLite and PostgreSQL backends, test
UpdateHealthandListForProbing— including theNULLS FIRSTordering and thedeprecatedexclusion. Use existingstoretestharness if it exists; otherwise write a small one. - API: handler tests for
PATCH /lifecycle,POST /probe, the newstatefilter, and 403 for viewers.
E2E (Playwright)¶
Extend an existing spec (do not create a new file just for this):
- Seed a catalog entry pointing to a stub server controlled by the test.
- Wait for the badge to flip from "Pending" → "Active".
- Stop the stub server, wait for the badge to flip to "Offline" (use a short interval override via env in the test runner).
- Click "Probe now" while still offline → toast appears with the failure reason.
- Click "Deprecate" → confirm dialog → badge becomes "Deprecated", probe worker stops touching the entry (verify by checking that
lastProbedAtdoes not advance over the next interval).
Acceptance criteria¶
A reviewer must be able to verify each of these without reading source code.
- ✅ Fresh
docker compose upshows demo agents flipping from "Pending" to "Active" within 30s - ✅
curl -s http://localhost:8080/api/v1/catalog | jq '.[0].health.state'returns one of the 5 lifecycle values - ✅ Killing a demo agent flips it to "Degraded" within one interval and to "Offline" after 3 intervals
- ✅ Bringing it back flips it to "Active" on the next probe
- ✅
PATCH /api/v1/catalog/{id}/lifecyclewith{"state":"deprecated"}works for admin, returns 403 for viewer - ✅ Deprecated entries are visibly distinct in the dashboard and are not probed
- ✅
?state=active,degradedfilter works in both API and dashboard URL - ✅ Both SQLite and PostgreSQL backends pass the same store-level test suite
- ✅ Worker shuts down cleanly on
SIGTERM(no goroutine leaks — verify withgoleakif available) - ✅ All new strings on the dashboard come from shadcn/ui components, not raw
<div className="bg-red-500">
Known traps (read before writing code)¶
These are the recurring AI-coding-agent mistakes in this codebase. Do not commit any code that violates them.
- ❌ Do not rename
CatalogEntrytoAgent. This rename has been undone three times. The catalog stores A2A agents, MCP servers, and (future) A2UI surfaces —Agentis wrong for two of those three. - ❌ Do not add a top-level
LastSeenfield toCatalogEntry. It already lives insideValidity. The newHealth.LastSuccessAtmirrors intoValidity.LastSeenon success. - ❌ Do not add
Tags,Team,Namespacefields. These were superseded by the archetype model (categories,Provider,Metadata map[string]string). - ❌ Do not store JWTs / session tokens in
localStorage. Use the existing httpOnly cookie auth flow. - ❌ Do not write raw
<button>,<input>,<table>styled with Tailwind. Use shadcn/ui primitives. - ❌ Do not break the dual-dialect contract. Every migration and every
Storequery must run on both SQLite and PostgreSQL. Run the full test suite against both before opening the PR. - ❌ Do not gate this feature behind the enterprise license. Health monitoring is OSS Core. The license-gated features remain SSO, RBAC federation, audit log federation, multi-region.
- ❌ Do not bypass the microkernel lifecycle. The prober is a plugin:
Register → Init → Start → Stop. Do not start a goroutine frominit()or frommain. - ❌ Do not block the request handler on a probe. The
POST /probeendpoint runs the probe synchronously but with thep.timeoutceiling, returning a structured error if exceeded — never an open-ended wait. - ❌ Do not invent a new
pkg/healthcheckpackage outsideinternal/. The project keeps everything underinternal/for now to avoid premature API commitments.
Execution order (suggested commit sequence)¶
- Migration + model fields (compiles, no behavior)
Storeinterface methods + SQLite impl + PostgreSQL impl + store tests- Prober service + unit tests (no plugin wiring yet)
- Microkernel plugin wiring + config + integration into
cmd/agentlens - REST API: response DTO extension +
statefilter - REST API:
PATCH /lifecycle+POST /probe+ audit hooks - Frontend: badge + column + filter
- Frontend: detail view "Health" section + actions
- E2E test extension
- README snippet update — one paragraph on health monitoring under Features
Each step should be a separate commit so the PR is reviewable.
Out-of-band notes¶
degradedis intentionally a "soft" state that flips on a single failure. This gives operators a fast warning signal.offlinerequiresfailureThresholdfailures to avoid badge flapping during transient network blips.- The
failureThresholddefault of 3 with a 30s interval means an entry takes ~90s to be marked offline. This is intentionally conservative — the launch story is "trustworthy state", not "instant alarm". lastErroris truncated to 512 chars to keep the row small in SQLite. Full error details are not persisted on purpose; that is observability's job, not the registry's.- Future Tier 2 work will add OpenTelemetry spans for each probe (feature 2.1) — keep the prober's hot path easy to instrument by passing
ctxthrough everything.