Public methodology

ProofStack Trust Standard v0.1

The versioned methodology behind every score — published so developers, buyers, competitors, and auditors can scrutinize, reproduce, and challenge it. A trust signal is only as credible as the standard behind it.

Status: v0.1 (draft, pre-accreditation). This is an independent, evidence-based assessment standard — not yet a third-party-accredited certification. See Limitations.

1. Dimensions & weights

Five dimensions, fixed weights (sum = 100). Weights are the single source in code and this standard; changing them is a version bump.

Dimension	Weight	What it measures
Problem Fidelity	25	Does it faithfully solve the defined problem, including the hard parts?
Live Behavior	25	Does the deployed app actually work — performance, stability, UX?
Code Quality	18	Structure, typing, tests, maintainability.
Security	17	Secret exposure, OWASP risks, auth/payment handling, injection resistance.
Code Integrity	15	Authenticity/ownership: incremental authorship vs. dumped/forked code.

2. Scoring method — evidence by model, score by rule

Evidence extraction (model). Specialist analyzers read the artifacts and produce a per-dimension 0–100 score with concrete evidence (file:line, URL, commit). Ungrounded scores are rejected by a validator.
Aggregation (deterministic rule). The overall score is not taken from the model — it is recomputed by a fixed weighted average over measured dimensions. Identical dimension scores always yield an identical overall score.
Tier (deterministic band). Tier is derived from the overall score by the fixed bands below, never from model free-text.

3. Trust tiers

Overall	Tier	Buyer guidance
85–100	Verified Pro	Core business, payments/auth-sensitive work.
70–84	Trusted Practitioner	General feature development and MVPs.
55–69	Growing Developer	Clearly-specced work with light oversight.
40–54	Entry Level	Learning / low-risk work, mentoring assumed.
0–39	Needs More Evidence	Insufficient objective evidence; submit more.

4. Anti-gaming rules

Attempts to inflate a score are themselves trust signals. Violations apply caps and are disclosed.

Text-only submission → confidence low, tier capped at Growing Developer.
Fork/clone with little first-party work → Code Integrity capped at 40.
Deploy URL that doesn't load → Live Behavior recorded as a low score, not skipped.
Single bulk-dump commit → lower Code Integrity confidence; “no incremental history” noted.
AI-generated code is NOT penalized per se — we judge whether it works, is safe, and you control it.
README/claims vs. reality gap → Problem-Solving Fidelity docked by the gap, and disclosed.
Prompt-injection attempts in the repo → treated as untrusted data, never instructions; recorded as a Security concern.

5. Neutrality

ProofStack verifies connections; it does not broker transactions.We earn nothing from a hire's value (no GMV cut), so we have no incentive to inflate scores — which is why transaction-based platforms cannot credibly run strict verification.

6. Reproducibility, provenance & audit trail

Every score carries engine metadata: standard version, scoring method (deterministic-weighted), the model versions used, and a timestamp. Each dimension retains an evidence array (finding + location). For public repos anyone can re-verify the cited locations; for private repos, source is used transiently and deleted, evidence pointers are retained, and re-verification requires the owner's consent.

7. Limitations — what this is not

Not yet a third-party-accredited certification. The methodology is self-published and not (yet) externally audited. Authority is being built via reproducibility + an accumulating outcome-calibration dataset — not asserted.
A snapshot, not a guarantee. A score reflects submitted artifacts at a point in time and does not guarantee future work.
Scores are opinions derived from automated analysis of the cited evidence. Developers may dispute a score; the dispute path and evidence are retained.
Outcome calibration is early. Predictive validity (score ↔ real-world result) strengthens as field outcomes accumulate; until then, treat calibration claims as provisional.

8. Versioning & governance

This standard is versioned (vMAJOR.MINOR); weight/tier/rule changes bump the version, and every published score records the version it was graded under. Roadmap to accreditation: public standard (this document) → advisory review → published calibration statistics → external methodology audit.

9. “Are you a certification authority?”

No — and on purpose.ProofStack is a measurement instrument, not an accreditation body. We don't ask you to trust a stamp; we ask you to inspect the evidence. That is a stronger position than a credential, for four reasons:

Transparency over authority. The method (this document), the fixed weights, the deterministic recompute, and per-finding evidence (file:line) are all public. A certifier is a black box; we are a white box you can audit and reproduce.
Neutrality the incumbents can't copy.We take no cut of any hire, so we have no incentive to inflate scores. Marketplaces earn on transaction volume — which is exactly why their "verified" badges are self-reported.
Predictive validity, not a seal.Authority is earned by an accumulating dataset of score ↔ real-world outcome. As field results arrive, each score is recalibrated against what actually happened — shown as the Outcome-calibratedsignal on a developer's report.
Honest scope. A score is a snapshot of cited artifacts, not a guarantee — and we say so up front (see Limitations).

The most trusted private standards started exactly here — not as government-appointed authorities. FICO credit scores, Moody's / S&P bond ratings, SOC 2 (an accounting-association self-standard), and UL safety marks all became required line-items through reproducibility and adoption, not a charter. ProofStack walks the same path: public standard → published calibration statistics → external methodology audit.

Becoming a standard and not being a certification authority are different axes, so they never collide. We hold one direction — we measure and publish; you verify— and never the certifier's "we issue, we guarantee."

Certifier framing — rejected	Instrument framing — adopted
Badge = "ProofStack Certified ✓ — passed"	Badge = "ProofStack Score 76/100 — independently verifiable"
A signature means "we guarantee it passed"	A signature means "this number is unmodified" — integrity, not endorsement
External audit grants us authority to certify	External audit calibrates the accuracy of our instrument
We interpret the score into "safe to hire"	We supply score + evidence; the buyer judges

ProofStack does not issue certifications — it is the instrument that gets calibrated. A signed badge is not a certificate; it is a tamper-evident measurement (a thermometer with a seal is still a thermometer). And we never call ProofStack "the standard": a standard is earned through adoption, not declared— FICO never announced itself; the market named it. That instrument-first discipline is exactly what keeps the standard ambition and the "not a certification authority" stance from ever contradicting.

10. “By whose standard?” — traceability

A measurement instrument earns trust through three things — reliability, traceability, and validity— not a stamp. We don't invent measurement criteria in a vacuum: each dimension traces to already-recognized standards and observable facts. What ProofStack adds is a transparent, versioned way to weight and combine them. These recognized bases are our equivalent of an instrument's reference units.

Dimension	Traces to (recognized basis)
Problem Fidelity	Spec-vs-implementation gap — observable: stated requirements (README/brief) compared against what the code and live app actually do.
Live Behavior	Observable runtime facts — HTTP/Core-Web-Vitals-style load & performance signals, error states, reachability. Facts, not opinion: it loads or it doesn't.
Code Quality	Established static-analysis constructs — cyclomatic complexity, test presence/coverage, type safety, structure/duplication — from accepted software-engineering measurement.
Security	Recognized security standards — OWASP Top 10, CWE, CVE / dependency advisories, secret-exposure patterns. These are themselves industry standards.
Code Integrity	Objective git history — commit cadence, incremental authorship, contribution consistency. Recorded facts, not inference.

Reliability:scoring is deterministic — identical artifacts always yield an identical score (perfect test-retest reliability), unlike non-reproducible "AI vibe" ratings. Validity: the ultimate test is whether a score predicts real-world results, which the outcome-calibration loop measures and feeds back.

Honest about the one invented part: the initial weights (25/25/18/17/15) are a documented expert prior, not an empirically fitted constant. The outcome-calibration loop is precisely the mechanism that moves them from prior toward fitted values. This is a self-correcting, versioned standard— not a frozen invention. So the honest answer to "by whose standard?" is: by recognized sub-standards you can check, combined by a public rule you can audit, and corrected by real outcomes you can verify.

Prior → calibrated weights (live)

The published prior is the startingpoint. As verified field outcomes confirm or contradict each dimension's predictions, the active weights move toward what actually predicts results — bounded, renormalized, and epoch-stamped on every score. Here is that movement, live:

11. Per-dimension predictive validity

§10 says the ultimate test of a measurement is whether it predicts real-world results. Here is that test, live and per dimension: across verified field outcomes, how often each dimension's score was confirmed by what actually happened. This is the empirical anchor that turns the standard from a published method into a self-correcting one — and it updates as outcomes accumulate.

Where accuracy is low or the sample is thin, that is itself disclosed rather than hidden — a dimension with weak predictive validity is a signal to recalibrate its weight, not to quietly trust it.

12. Problem complexity — a separate axis, not a multiplier

A score answers how well the work was done. It does not, on its own, answer how hard the problem was. A flawless to-do list and a working payment-reconciliation flow are not the same proof, even at the same score. So we record problem complexity as a separate axis — High / Medium / Low / Trivial — and annotate it on the report, badge, and match, rather than folding it into the number.

We deliberately do not multiply complexity into the score. Doing so would break determinism (same dimension scores → same score) and create a gaming surface where merely claiming an ambitious project inflates the result. Complexity is graded on what was actually achieved (working, implemented hard parts), never on what was attempted, and every grade carries evidence the same way dimension scores do.

Instead of guessing the weight, we earn it. Below, verified field outcomes are stratified by difficulty: does a high score on a High-complexity problem actually predict good real-world results? Only once that holds up across enough outcomes does difficulty earn a place in the score — by evidence, not assumption.

In matching, complexity is used immediately as a fit signal (not a capability score): a developer whose verified work is all Low-complexity is a weaker fit for a payment-grade job, and we say so rather than ranking on raw score alone.