📊 Runtime dashboard

Automatic skill testing, live.

This is the control surface for DriftLoom’s testing engine: coverage, freshness, backlog, regressions, and the latest runtime evidence across the baseline safety lane and the deeper follow-on functionality lane.

Dashboard read: healthy runtime coverage means more than a big tested count — look for low live failure pressure, shrinking backlog, and some functionality depth instead of treating raw volume as proof.

Coverage and queue health

Backlog read: a trustworthy runtime lane does not just accumulate tests — it keeps the baseline backlog from ballooning and keeps functionality backlog from quietly starving the deeper evidence tier.
Eligible catalog
5126community source-scanned skills that can enter the automatic runtime program
Baseline coverage
2094% of eligible skills have baseline safety evidence
Functionality coverage
19494% of baseline-cleared skills also have deeper follow-on functionality evidence
Baseline backlog
4917eligible skills still waiting for their first baseline safety run
Functionality backlog
12baseline-passed skills still waiting for deeper follow-on checks
Latest touch
6m agomost recent run landed at 2026-03-16 01:15 UTC

Coverage by depth

Depth read: baseline-only coverage proves the generic safety lane ran, but stronger confidence comes from follow-on functionality checks and fixture-backed evidence because those layers test more of the skill’s claimed shape.
Untested4917 / 5126 · 96%
Baseline checks only15 / 5126 · 0%
Follow-on functionality checks177 / 5126 · 3%
Fixture / example-backed proof17 / 5126 · 0%

Why these buckets matter

The goal is not just “some tests happened.” It is depth you can reason about. Baseline checks only means the skill cleared the generic safety lane. Follow-on functionality checks means it also survived repo-aware checks. Fixture / example-backed proof means the repo shipped richer evidence like test corpora, eval assets, or bundled examples that the worker could validate directly.

baseline passed: 206functionality passed: 150baseline failing now: 3functionality failing now: 44

Freshness

Freshness read: a healthy runtime lane is not just active today — it also keeps stale baseline and stale functionality counts low enough that older receipts do not dominate the picture.
Skills touched in 24h
112distinct skills that received any runtime run in the last day
Fresh baseline rows
102latest baseline-v3 rows produced in the last 24h
Fresh functionality rows
102latest functionality-v2 rows produced in the last 24h
Stale baseline rows
0baseline rows older than 7 days
Stale functionality rows
0functionality-v2 rows older than 7 days

Recent regressions

These are latest failed rows that had at least one earlier pass for the same suite. The confidence chips below distinguish fresh failures from reproduced ones, so the site does not pretend every red row carries the same weight.

Latest failures worth inspecting

Latest runs