๐Ÿงญ Methodology

How DriftBot evaluates skills

DriftBot is a trust and scrutiny layer for OpenClaw skills. It does not claim perfect security judgment, and it does not reduce trust to one magic score. Instead, it combines visible findings, explainable heuristics, provenance context, and manual review notes where available.

What the system looks at

Important distinction

DriftBot separates privileged capability from suspicious behavior. A skill can be powerful, integrated, or high-consequence without looking deceptive. Suspicious patterns count much more heavily against trust than ordinary operational power alone.

What the labels mean

What runtime testing now adds

DriftBot now runs a real runtime pipeline for community skills on a separate worker VM. It is intentionally layered instead of pretending one lane can do everything.

Layer 1: baseline-v3

baseline-v3 is the generic sandbox safety lane. It is conservative and scalable: it mounts the skill source, denies network access, injects fake canary credentials where appropriate, and captures output, hashes, artifacts, failure classifications, suite/harness versions, and source-fingerprint metadata.

Layer 2: functionality-v2

functionality-v2 is the follow-on adaptive smoke lane. It only runs after a clean baseline-v3 pass and only checks the file types the skill actually contains, so the system can go a bit deeper without bloating the generic safety lane.

This still does not mean a skill is comprehensively safe or fully functional. It means DriftBot can now show both generic sandbox proof and a first layer of source-aware functional smoke evidence.

Failure confidence and reproduced status

DriftBot also tracks whether the latest failed row is freshly observed, reproduced, or a regression after earlier passes. That matters because a one-off new failure should not carry the same weight as a failure that keeps reproducing under the same suite.

What the new decision cues mean

DriftBot now also surfaces compact decision cues on skill cards, skill pages, and publisher pages. These are not a second secret score. They are short reading aids derived from the same visible evidence already on the page.

The goal is speed, not mystique: help users move faster without hiding the underlying evidence model.

DriftBot also now uses compact evidence snapshots and quick compare rules on search, skill, publisher, and runtime pages. These are not extra scores either โ€” just short interpretation aids for the visible runtime, freshness, review, source-evidence, and queue-health mix already on the page.

Those snapshot counts are still lightweight summaries, but some of them can also graduate into real query dimensions over time. Search freshness is now a first-class filter, while other snapshot metrics still act as page-level reading aids rather than full filter axes.

RatioDaemon commentary is separate from this scoring layer: editorial can interpret receipts and add judgment, but it does not become hidden score input and does not override the underlying evidence model.

The same rule applies to failure-ranking flourishes like Hall of Shame spice: those cues help triage interesting failures for humans, but they are not part of the trust score and do not replace the underlying runtime evidence.

What DriftBot is not doing yet

Community signals are treated as attention markers, not truth. They can help prioritize review, but they do not replace visible findings, observed runtime evidence, or human judgment.

How imported catalog entries are scored

Awesome-index/community-imported entries are metadata-first. That means DriftBot can usually see provenance hints, category context, and some obvious blast-radius clues, but not the same source-level evidence available from a full local static scan.

So the model now distinguishes between cleaner metadata-only entries and higher-impact ones, but it tries not to confuse broad capability with maliciousness. Imported entries with thin proof often land in Insufficient Evidence, while higher-impact or more ambiguous ones slide toward Use Caution.High Risk is meant to be rarer and tied to stronger suspicious patterns or sharper risk combinations.

Evidence levels you can browse right now

If you want the strongest evidence slice first, open source-scanned entries. If you want to inspect the lower-confidence side of the catalog, browse catalog-only entries.