How DriftBot evaluates skills
DriftBot is a trust and scrutiny layer for OpenClaw skills. It does not claim perfect security judgment, and it does not reduce trust to one magic score. Instead, it combines visible findings, explainable heuristics, provenance context, and manual review notes where available.
What the system looks at
- Environment variable and secret requirements
- Network references and external services
- Shell and subprocess capability
- File-writing or publishing behavior
- Browser automation references
- Higher-blast-radius domains like auth, messaging, posting, and finance
- Suspicious implementation patterns
Important distinction
DriftBot separates privileged capability from suspicious behavior. A skill can be powerful, integrated, or high-consequence without looking deceptive. Suspicious patterns count much more heavily against trust than ordinary operational power alone.
What the labels mean
- Trusted โ the current evidence looks understandable, the capability surface is explainable, and no stronger suspicious signals dominate the picture
- Insufficient Evidence โ not necessarily bad, but the available proof is still too thin for a stronger call
- Use Caution โ the skill may be legitimate, but it touches higher-impact surfaces or has enough ambiguity that extra scrutiny is warranted
- High Risk โ reserved for stronger suspicious patterns or sharper combinations of risk signals, not just ordinary power-user capability
What runtime testing now adds
DriftBot now runs a real runtime pipeline for community skills on a separate worker VM. It is intentionally layered instead of pretending one lane can do everything.
Layer 1: baseline-v3
baseline-v3 is the generic sandbox safety lane. It is conservative and scalable: it mounts the skill source, denies network access, injects fake canary credentials where appropriate, and captures output, hashes, artifacts, failure classifications, suite/harness versions, and source-fingerprint metadata.
- Mounted-source and write-boundary checks โ prove the staged skill executed against a real source tree without mutating `/source`
- No-network checks โ confirm the sandbox blocks both hostname-based and raw-IP outbound fetch attempts
- Fake-env and secret-isolation checks โ observe canary credential handling while confirming common secret paths and docker socket access stay absent
- Runtime receipts โ commands, output excerpts, artifact files, image digests, stdout/stderr sha256, and explicit failure classes
Layer 2: functionality-v2
functionality-v2 is the follow-on adaptive smoke lane. It only runs after a clean baseline-v3 pass and only checks the file types the skill actually contains, so the system can go a bit deeper without bloating the generic safety lane.
- Structure checks โ basic source-shape and `SKILL.md` sanity for repos that expose those files
- OpenClaw manifest sanity โ `_meta.json` and `skill.json` shape checks catch malformed metadata and broken skill manifests
- Identity coherence โ `_meta.json` owner / slug / history must match the actual skill identity instead of merely parsing
- JSON / YAML / CSV parsing โ parse real `.json`, `.yaml`, `.yml`, and `.csv` files found in the repo instead of relying on metadata vibes
- Entrypoint smoke checks โ lightweight Python / Node `--help` or `help` runs catch scripts that parse but do not even start cleanly
- Docs and fixture integrity โ repo-local Markdown links, shipped `test.yaml` / `EVALS.json` / inventory assets, and packaged example corpora can all be checked directly
- Common project-file checks โ `package.json` shape validation, root-package entrypoint existence, `package-lock.json` sanity, `pyproject.toml` parsing, and `requirements.txt` checks when those files exist
This still does not mean a skill is comprehensively safe or fully functional. It means DriftBot can now show both generic sandbox proof and a first layer of source-aware functional smoke evidence.
Failure confidence and reproduced status
DriftBot also tracks whether the latest failed row is freshly observed, reproduced, or a regression after earlier passes. That matters because a one-off new failure should not carry the same weight as a failure that keeps reproducing under the same suite.
- Fresh failure โ the current latest row is failed and no earlier failed row exists for that suite
- Failure reproduced โ the current latest row is failed and earlier failed rows already existed
- Regression โ the current latest row is failed but earlier passes existed for the same suite
- Recovered โ the current latest row is passing after earlier failures
What the new decision cues mean
DriftBot now also surfaces compact decision cues on skill cards, skill pages, and publisher pages. These are not a second secret score. They are short reading aids derived from the same visible evidence already on the page.
- Review first usually means a sandbox failure or stronger suspicious signal already exists, so the safe move is to inspect receipts before install.
- Lower-friction candidate for local use usually means a clean runtime plus a lower visible surface area, not a promise that the skill is universally safe.
- Better than average evidence means the page has stronger support like human review or source-level evidence, not that judgment is no longer required.
- Thin evidence means metadata and heuristics are doing more of the work, so confidence should stay conservative.
The goal is speed, not mystique: help users move faster without hiding the underlying evidence model.
DriftBot also now uses compact evidence snapshots and quick compare rules on search, skill, publisher, and runtime pages. These are not extra scores either โ just short interpretation aids for the visible runtime, freshness, review, source-evidence, and queue-health mix already on the page.
Those snapshot counts are still lightweight summaries, but some of them can also graduate into real query dimensions over time. Search freshness is now a first-class filter, while other snapshot metrics still act as page-level reading aids rather than full filter axes.
RatioDaemon commentary is separate from this scoring layer: editorial can interpret receipts and add judgment, but it does not become hidden score input and does not override the underlying evidence model.
The same rule applies to failure-ranking flourishes like Hall of Shame spice: those cues help triage interesting failures for humans, but they are not part of the trust score and do not replace the underlying runtime evidence.
What DriftBot is not doing yet
- Deep skill-specific functional validation for every skill
- Full environment/integration testing against real third-party services
- Comprehensive provenance verification
- Runtime observability across the entire catalog
- Crowd voting as a replacement for evidence
Community signals are treated as attention markers, not truth. They can help prioritize review, but they do not replace visible findings, observed runtime evidence, or human judgment.
How imported catalog entries are scored
Awesome-index/community-imported entries are metadata-first. That means DriftBot can usually see provenance hints, category context, and some obvious blast-radius clues, but not the same source-level evidence available from a full local static scan.
So the model now distinguishes between cleaner metadata-only entries and higher-impact ones, but it tries not to confuse broad capability with maliciousness. Imported entries with thin proof often land in Insufficient Evidence, while higher-impact or more ambiguous ones slide toward Use Caution.High Risk is meant to be rarer and tied to stronger suspicious patterns or sharper risk combinations.
Evidence levels you can browse right now
- Source-scanned โ DriftBot has local source-level evidence, so findings like shell usage, env vars, network references, suspicious patterns, and capability inferences are grounded in a real scan.
- Catalog-only โ DriftBot mostly has metadata, provenance hints, naming/category context, and other weaker signals. These entries can still be useful, but confidence is lower by default.
If you want the strongest evidence slice first, open source-scanned entries. If you want to inspect the lower-confidence side of the catalog, browse catalog-only entries.