Public methodology
ProofStack Trust Standard v0.1
The versioned methodology behind every score — published so developers, buyers, competitors, and auditors can scrutinize, reproduce, and challenge it. A trust signal is only as credible as the standard behind it.
Status: v0.1 (draft, pre-accreditation). This is an independent, evidence-based assessment standard — not yet a third-party-accredited certification. See Limitations.
1. Dimensions & weights
Five dimensions, fixed weights (sum = 100). Weights are the single source in code and this standard; changing them is a version bump.
| Dimension | Weight | What it measures |
|---|---|---|
| Problem Fidelity | 25 | Does it faithfully solve the defined problem, including the hard parts? |
| Live Behavior | 25 | Does the deployed app actually work — performance, stability, UX? |
| Code Quality | 18 | Structure, typing, tests, maintainability. |
| Security | 17 | Secret exposure, OWASP risks, auth/payment handling, injection resistance. |
| Code Integrity | 15 | Authenticity/ownership: incremental authorship vs. dumped/forked code. |
2. Scoring method — evidence by model, score by rule
- Evidence extraction (model). Specialist analyzers read the artifacts and produce a per-dimension 0–100 score with concrete evidence (file:line, URL, commit). Ungrounded scores are rejected by a validator.
- Aggregation (deterministic rule). The overall score is not taken from the model — it is recomputed by a fixed weighted average over measured dimensions. Identical dimension scores always yield an identical overall score.
- Tier (deterministic band). Tier is derived from the overall score by the fixed bands below, never from model free-text.
3. Trust tiers
| Overall | Tier | Buyer guidance |
|---|---|---|
| 85–100 | Verified Pro | Core business, payments/auth-sensitive work. |
| 70–84 | Trusted Practitioner | General feature development and MVPs. |
| 55–69 | Growing Developer | Clearly-specced work with light oversight. |
| 40–54 | Entry Level | Learning / low-risk work, mentoring assumed. |
| 0–39 | Needs More Evidence | Insufficient objective evidence; submit more. |
4. Anti-gaming rules
Attempts to inflate a score are themselves trust signals. Violations apply caps and are disclosed.
- Text-only submission → confidence low, tier capped at Growing Developer.
- Fork/clone with little first-party work → Code Integrity capped at 40.
- Deploy URL that doesn't load → Live Behavior recorded as a low score, not skipped.
- Single bulk-dump commit → lower Code Integrity confidence; “no incremental history” noted.
- AI-generated code is NOT penalized per se — we judge whether it works, is safe, and you control it.
- README/claims vs. reality gap → Problem-Solving Fidelity docked by the gap, and disclosed.
- Prompt-injection attempts in the repo → treated as untrusted data, never instructions; recorded as a Security concern.
5. Neutrality
ProofStack verifies connections; it does not broker transactions.We earn nothing from a hire's value (no GMV cut), so we have no incentive to inflate scores — which is why transaction-based platforms cannot credibly run strict verification.
6. Reproducibility, provenance & audit trail
Every score carries engine metadata: standard version, scoring method (deterministic-weighted), the model versions used, and a timestamp. Each dimension retains an evidence array (finding + location). For public repos anyone can re-verify the cited locations; for private repos, source is used transiently and deleted, evidence pointers are retained, and re-verification requires the owner's consent.
7. Limitations — what this is not
- Not yet a third-party-accredited certification. The methodology is self-published and not (yet) externally audited. Authority is being built via reproducibility + an accumulating outcome-calibration dataset — not asserted.
- A snapshot, not a guarantee. A score reflects submitted artifacts at a point in time and does not guarantee future work.
- Scores are opinions derived from automated analysis of the cited evidence. Developers may dispute a score; the dispute path and evidence are retained.
- Outcome calibration is early. Predictive validity (score ↔ real-world result) strengthens as field outcomes accumulate; until then, treat calibration claims as provisional.
8. Versioning & governance
This standard is versioned (vMAJOR.MINOR); weight/tier/rule changes bump the version, and every published score records the version it was graded under. Roadmap to accreditation: public standard (this document) → advisory review → published calibration statistics → external methodology audit.
9. “Are you a certification authority?”
10. “By whose standard?” — traceability
A measurement instrument earns trust through three things — reliability, traceability, and validity— not a stamp. We don't invent measurement criteria in a vacuum: each dimension traces to already-recognized standards and observable facts. What ProofStack adds is a transparent, versioned way to weight and combine them. These recognized bases are our equivalent of an instrument's reference units.
| Dimension | Traces to (recognized basis) |
|---|---|
| Problem Fidelity | Spec-vs-implementation gap — observable: stated requirements (README/brief) compared against what the code and live app actually do. |
| Live Behavior | Observable runtime facts — HTTP/Core-Web-Vitals-style load & performance signals, error states, reachability. Facts, not opinion: it loads or it doesn't. |
| Code Quality | Established static-analysis constructs — cyclomatic complexity, test presence/coverage, type safety, structure/duplication — from accepted software-engineering measurement. |
| Security | Recognized security standards — OWASP Top 10, CWE, CVE / dependency advisories, secret-exposure patterns. These are themselves industry standards. |
| Code Integrity | Objective git history — commit cadence, incremental authorship, contribution consistency. Recorded facts, not inference. |
Reliability:scoring is deterministic — identical artifacts always yield an identical score (perfect test-retest reliability), unlike non-reproducible "AI vibe" ratings. Validity: the ultimate test is whether a score predicts real-world results, which the outcome-calibration loop measures and feeds back.
Honest about the one invented part: the initial weights (25/25/18/17/15) are a documented expert prior, not an empirically fitted constant. The outcome-calibration loop is precisely the mechanism that moves them from prior toward fitted values. This is a self-correcting, versioned standard— not a frozen invention. So the honest answer to "by whose standard?" is: by recognized sub-standards you can check, combined by a public rule you can audit, and corrected by real outcomes you can verify.
11. Per-dimension predictive validity
§10 says the ultimate test of a measurement is whether it predicts real-world results. Here is that test, live and per dimension: across verified field outcomes, how often each dimension's score was confirmed by what actually happened. This is the empirical anchor that turns the standard from a published method into a self-correcting one — and it updates as outcomes accumulate.
Where accuracy is low or the sample is thin, that is itself disclosed rather than hidden — a dimension with weak predictive validity is a signal to recalibrate its weight, not to quietly trust it.