Public methodology

ProofStack Trust Standard v0.1

The versioned methodology behind every score — published so developers, buyers, competitors, and auditors can scrutinize, reproduce, and challenge it. A trust signal is only as credible as the standard behind it.

Status: v0.1 (draft, pre-accreditation). This is an independent, evidence-based assessment standard — not yet a third-party-accredited certification. See Limitations.

1. Dimensions & weights

Five dimensions, fixed weights (sum = 100). Weights are the single source in code and this standard; changing them is a version bump.

DimensionWeightWhat it measures
Problem Fidelity25Does it faithfully solve the defined problem, including the hard parts?
Live Behavior25Does the deployed app actually work — performance, stability, UX?
Code Quality18Structure, typing, tests, maintainability.
Security17Secret exposure, OWASP risks, auth/payment handling, injection resistance.
Code Integrity15Authenticity/ownership: incremental authorship vs. dumped/forked code.

2. Scoring method — evidence by model, score by rule

  • Evidence extraction (model). Specialist analyzers read the artifacts and produce a per-dimension 0–100 score with concrete evidence (file:line, URL, commit). Ungrounded scores are rejected by a validator.
  • Aggregation (deterministic rule). The overall score is not taken from the model — it is recomputed by a fixed weighted average over measured dimensions. Identical dimension scores always yield an identical overall score.
  • Tier (deterministic band). Tier is derived from the overall score by the fixed bands below, never from model free-text.

3. Trust tiers

OverallTierBuyer guidance
85–100Verified ProCore business, payments/auth-sensitive work.
70–84Trusted PractitionerGeneral feature development and MVPs.
55–69Growing DeveloperClearly-specced work with light oversight.
40–54Entry LevelLearning / low-risk work, mentoring assumed.
0–39Needs More EvidenceInsufficient objective evidence; submit more.

4. Anti-gaming rules

Attempts to inflate a score are themselves trust signals. Violations apply caps and are disclosed.

  • Text-only submission → confidence low, tier capped at Growing Developer.
  • Fork/clone with little first-party work → Code Integrity capped at 40.
  • Deploy URL that doesn't load → Live Behavior recorded as a low score, not skipped.
  • Single bulk-dump commit → lower Code Integrity confidence; “no incremental history” noted.
  • AI-generated code is NOT penalized per se — we judge whether it works, is safe, and you control it.
  • README/claims vs. reality gap → Problem-Solving Fidelity docked by the gap, and disclosed.
  • Prompt-injection attempts in the repo → treated as untrusted data, never instructions; recorded as a Security concern.

5. Neutrality

ProofStack verifies connections; it does not broker transactions.We earn nothing from a hire's value (no GMV cut), so we have no incentive to inflate scores — which is why transaction-based platforms cannot credibly run strict verification.

6. Reproducibility, provenance & audit trail

Every score carries engine metadata: standard version, scoring method (deterministic-weighted), the model versions used, and a timestamp. Each dimension retains an evidence array (finding + location). For public repos anyone can re-verify the cited locations; for private repos, source is used transiently and deleted, evidence pointers are retained, and re-verification requires the owner's consent.

7. Limitations — what this is not

  • Not yet a third-party-accredited certification. The methodology is self-published and not (yet) externally audited. Authority is being built via reproducibility + an accumulating outcome-calibration dataset — not asserted.
  • A snapshot, not a guarantee. A score reflects submitted artifacts at a point in time and does not guarantee future work.
  • Scores are opinions derived from automated analysis of the cited evidence. Developers may dispute a score; the dispute path and evidence are retained.
  • Outcome calibration is early. Predictive validity (score ↔ real-world result) strengthens as field outcomes accumulate; until then, treat calibration claims as provisional.

8. Versioning & governance

This standard is versioned (vMAJOR.MINOR); weight/tier/rule changes bump the version, and every published score records the version it was graded under. Roadmap to accreditation: public standard (this document) → advisory review → published calibration statistics → external methodology audit.

9. “Are you a certification authority?”

No — and on purpose.ProofStack is a measurement instrument, not an accreditation body. We don't ask you to trust a stamp; we ask you to inspect the evidence. That is a stronger position than a credential, for four reasons:

  • Transparency over authority. The method (this document), the fixed weights, the deterministic recompute, and per-finding evidence (file:line) are all public. A certifier is a black box; we are a white box you can audit and reproduce.
  • Neutrality the incumbents can't copy.We take no cut of any hire, so we have no incentive to inflate scores. Marketplaces earn on transaction volume — which is exactly why their "verified" badges are self-reported.
  • Predictive validity, not a seal.Authority is earned by an accumulating dataset of score ↔ real-world outcome. As field results arrive, each score is recalibrated against what actually happened — shown as the Outcome-calibratedsignal on a developer's report.
  • Honest scope. A score is a snapshot of cited artifacts, not a guarantee — and we say so up front (see Limitations).

The most trusted private standards started exactly here — not as government-appointed authorities. FICO credit scores, Moody's / S&P bond ratings, SOC 2 (an accounting-association self-standard), and UL safety marks all became required line-items through reproducibility and adoption, not a charter. ProofStack walks the same path: public standard → published calibration statistics → external methodology audit.

Becoming a standard and not being a certification authority are different axes, so they never collide. We hold one direction — we measure and publish; you verify— and never the certifier's "we issue, we guarantee."

Certifier framing — rejectedInstrument framing — adopted
Badge = "ProofStack Certified ✓ — passed"Badge = "ProofStack Score 76/100 — independently verifiable"
A signature means "we guarantee it passed"A signature means "this number is unmodified" — integrity, not endorsement
External audit grants us authority to certifyExternal audit calibrates the accuracy of our instrument
We interpret the score into "safe to hire"We supply score + evidence; the buyer judges

ProofStack does not issue certifications — it is the instrument that gets calibrated. A signed badge is not a certificate; it is a tamper-evident measurement (a thermometer with a seal is still a thermometer). And we never call ProofStack "the standard": a standard is earned through adoption, not declared— FICO never announced itself; the market named it. That instrument-first discipline is exactly what keeps the standard ambition and the "not a certification authority" stance from ever contradicting.

10. “By whose standard?” — traceability

A measurement instrument earns trust through three things — reliability, traceability, and validity— not a stamp. We don't invent measurement criteria in a vacuum: each dimension traces to already-recognized standards and observable facts. What ProofStack adds is a transparent, versioned way to weight and combine them. These recognized bases are our equivalent of an instrument's reference units.

DimensionTraces to (recognized basis)
Problem FidelitySpec-vs-implementation gap — observable: stated requirements (README/brief) compared against what the code and live app actually do.
Live BehaviorObservable runtime facts — HTTP/Core-Web-Vitals-style load & performance signals, error states, reachability. Facts, not opinion: it loads or it doesn't.
Code QualityEstablished static-analysis constructs — cyclomatic complexity, test presence/coverage, type safety, structure/duplication — from accepted software-engineering measurement.
SecurityRecognized security standards — OWASP Top 10, CWE, CVE / dependency advisories, secret-exposure patterns. These are themselves industry standards.
Code IntegrityObjective git history — commit cadence, incremental authorship, contribution consistency. Recorded facts, not inference.

Reliability:scoring is deterministic — identical artifacts always yield an identical score (perfect test-retest reliability), unlike non-reproducible "AI vibe" ratings. Validity: the ultimate test is whether a score predicts real-world results, which the outcome-calibration loop measures and feeds back.

Honest about the one invented part: the initial weights (25/25/18/17/15) are a documented expert prior, not an empirically fitted constant. The outcome-calibration loop is precisely the mechanism that moves them from prior toward fitted values. This is a self-correcting, versioned standard— not a frozen invention. So the honest answer to "by whose standard?" is: by recognized sub-standards you can check, combined by a public rule you can audit, and corrected by real outcomes you can verify.

11. Per-dimension predictive validity

§10 says the ultimate test of a measurement is whether it predicts real-world results. Here is that test, live and per dimension: across verified field outcomes, how often each dimension's score was confirmed by what actually happened. This is the empirical anchor that turns the standard from a published method into a self-correcting one — and it updates as outcomes accumulate.

Where accuracy is low or the sample is thin, that is itself disclosed rather than hidden — a dimension with weak predictive validity is a signal to recalibrate its weight, not to quietly trust it.