📊 Runtime dashboard

Automatic skill testing, live.

This is the control surface for DriftLoom’s testing engine: coverage, freshness, backlog, regressions, and the latest runtime evidence across the baseline safety lane and the deeper follow-on functionality lane.

Browse tested skills Stale runtime Investigate failures Hall of Shame Methodology

Dashboard read: healthy runtime coverage means more than a big tested count — look for low live failure pressure, shrinking backlog, and some functionality depth instead of treating raw volume as proof.

Coverage and queue health

Backlog read: a trustworthy runtime lane does not just accumulate tests — it keeps the baseline backlog from ballooning and keeps functionality backlog from quietly starving the deeper evidence tier.

Eligible catalog

5126community source-scanned skills that can enter the automatic runtime program

Baseline coverage

2094% of eligible skills have baseline safety evidence

Functionality coverage

19494% of baseline-cleared skills also have deeper follow-on functionality evidence

Baseline backlog

4917eligible skills still waiting for their first baseline safety run

Functionality backlog

12baseline-passed skills still waiting for deeper follow-on checks

Latest touch

6m agomost recent run landed at 2026-03-16 01:15 UTC

Coverage by depth

Depth read: baseline-only coverage proves the generic safety lane ran, but stronger confidence comes from follow-on functionality checks and fixture-backed evidence because those layers test more of the skill’s claimed shape.

Untested4917 / 5126 · 96%

Baseline checks only15 / 5126 · 0%

Follow-on functionality checks177 / 5126 · 3%

Fixture / example-backed proof17 / 5126 · 0%

Why these buckets matter

The goal is not just “some tests happened.” It is depth you can reason about. Baseline checks only means the skill cleared the generic safety lane. Follow-on functionality checks means it also survived repo-aware checks. Fixture / example-backed proof means the repo shipped richer evidence like test corpora, eval assets, or bundled examples that the worker could validate directly.

baseline passed: 206functionality passed: 150baseline failing now: 3functionality failing now: 44

Freshness

Freshness read: a healthy runtime lane is not just active today — it also keeps stale baseline and stale functionality counts low enough that older receipts do not dominate the picture.

Skills touched in 24h

112distinct skills that received any runtime run in the last day

Fresh baseline rows

102latest baseline-v3 rows produced in the last 24h

Fresh functionality rows

102latest functionality-v2 rows produced in the last 24h

Stale baseline rows

0baseline rows older than 7 days

Stale functionality rows

0functionality-v2 rows older than 7 days

Recent regressions

These are latest failed rows that had at least one earlier pass for the same suite. The confidence chips below distinguish fresh failures from reproduced ones, so the site does not pretend every red row carries the same weight.

regressionfunctionality-v2failure repeated in more than one runregression after earlier pass2026-03-15 21:31 UTC

veille

romain-grosos

blocked_on_external_service · 0/1 passed