perea.ai Research · 1.0 · Essay

The Validated Learning Taxonomy: A Falsifiability-Forcing Schema for Pinnacle Gecko Experiments

Why 62 logged experiments can produce zero validated learning — and the multi-axis taxonomy that fixes it.

AuthorDante Perea
Published9 May 2026 06:30
Length2,758 words · 13 min read
AudienceFounders running experiment trackers who are accumulating logs without compounding learnings
LicenseCC BY 4.0

#The Validated Learning Taxonomy

A falsifiability-forcing schema for the Pinnacle Gecko experiment loop. Built from the convergent recommendations of every major hypothesis-testing framework, normalized into one operational taxonomy.


#Part 0 — The Failure Mode

A founder logs 62 experiments. Reviewing the corpus, every entry reads like:

"If @vercel/blob is reinstalled and handleUpload is called with a properly generated client token, then the upload-token route will return valid JSON instead of throwing on .json()."

"If the background dispatcher is manually triggered, then EngineSignal timestamps will update and the '8 days' KPI warning will clear."

These are bug-fix verifications. They belong in a commit log, not a learning corpus. Eric Ries (Lean Startup) is precise: "the unit of progress is validated learning, not output." An infrastructure repair produces output. It produces no validated learning about user behavior, market demand, willingness to pay, growth mechanics, or any other belief about the world.

The Pinnacle Gecko Protocol (Perea, 2026) identifies two distinct loops:

  • Loop A — Hypothesis → demand test → MVP → 48-hour decision (validated learning)
  • Loop B — Code commit → production in 5–15 minutes (delivery)

When every entry in the experiment tracker is a Loop B operational fix mislabeled as Loop A, the tracker stops compounding insight. The fix is structural enforcement at the schema level, not better discipline.

This document specifies that schema.


#Part 1 — The Eight-Axis Taxonomy

Every experiment is classified along eight orthogonal axes. Some axes are mandatory at creation (gating the "Run" button). Others are mandatory at close (gating the "Win/Kill" verdict). The schema rejects experiments that fail the falsifiability gate.

#Axis 1 — Loop Class (mandatory at create)

The single most important field. Without it, Loop A and Loop B contaminate each other.

CodeNameDefinitionBelongs in learning corpus?
L0OperationalBug fix, infrastructure repair, refactor, deps upgrade❌ No — log as a commit, not a hypothesis
L1DiscoveryTests a belief about the world (user behavior, market, willingness to pay)✅ Yes
L2OptimizationA/B test on existing live product flow✅ Yes (lower weight)

Rule: L0 entries should never use the words "hypothesis" or "learning." The system should auto-suggest L0 for entries whose stated outcome is a system state ("returns valid JSON," "timestamp updates," "warning clears") rather than a user behavior.

#Axis 2 — Risk Dimension (mandatory for L1/L2) — Cagan SVPG

The Four Big Risks from Marty Cagan's Inspired. Every L1/L2 experiment must declare the dominant risk it tests:

CodeRiskQuestion
VALValueWill customers buy/use it?
USAUsabilityCan users figure out how to use it?
FEAFeasibilityCan we build/operate it sustainably?
VIAViabilityDoes it work for the business (margins, brand, channels, legal)?

This forces specificity. "We're testing the new onboarding" is rejected. "We're testing the value of the new onboarding (will users complete it?)" is accepted.

#Axis 3 — Hypothesis Class (mandatory for L1) — Ries

The Lean Startup's two foundational classes:

CodeClassTests
VAL-HValue HypothesisDoes the product deliver value to the user once they're using it?
GRO-HGrowth HypothesisHow will new users discover and adopt the product?

A founder who has only logged Value-hypothesis experiments has not tested how the business grows. A tracker that surfaces this distribution daily forces balance.

#Axis 4 — AARRR Stage (mandatory for L1/L2) — McClure

Already in your schema. The taxonomy keeps it as-is:

CodeStage
ACQAcquisition
ACTActivation
RETRetention
REFReferral
REVRevenue

#Axis 5 — Evidence Method (mandatory for L1) — Kromatic / Bland

How you'll gather signal. Maps directly to the Testing Business Ideas experiment library (44 cataloged methods). The taxonomy collapses these into 7 buckets ordered by evidence strength:

CodeMethodEvidence StrengthCycle Time
INTCustomer interview (1:1)Low (anecdotal)30 min
OBSObservation / session replayLow–Mediumhours
FAKFake-door / smoke test landing pageMedium (intent only)1–4 hours
CONConcierge MVP (user knows you're human)High (real workflow)hours–days
WOZWizard of Oz (user thinks it's automated)High (real workflow)hours–days
ABA/B test on live trafficStatisticaldays
PAYReal payment / pre-order / Stripe linkStrongest (cash)minutes–hours

Rule: the system should warn if a high-stakes (mission-critical) hypothesis is being tested only with INT-class evidence.

#Axis 6 — Hypothesis Statement (mandatory, falsifiability gate)

The system rejects any L1/L2 experiment whose hypothesis does not contain all six slots. This is the single highest-leverage change.

We believe that  [SPECIFIC CHANGE]
for             [SPECIFIC USER SEGMENT]
will result in  [SPECIFIC USER BEHAVIOR]
measured by     [SPECIFIC METRIC]
hitting         [SPECIFIC THRESHOLD]
within          [SPECIFIC TIMEFRAME].

We will kill this if [METRIC] is below [KILL THRESHOLD] by [DEADLINE].

Accepted example (Loop A discovery):

"We believe that adding a Stripe payment link to the founder-dashboard onboarding for unifounder.ai users in the AI-startup ICP will result in pre-launch payment commitments measured by paid checkout completions hitting ≥3 within 48 hours of the tweet. We will kill this if completions are below 1 by hour 24."

Rejected example (current corpus):

"If @vercel/blob is reinstalled, then the upload-token route returns JSON." Reason: no user segment, no behavior, no measurable threshold, no timeframe. Auto-classified as L0 (operational), routed out of the learning corpus.

This template is canonical across Lean Startup (Ries), Lean UX (Gothelf), Strategyzer (Bland/Osterwalder), and Continuous Discovery (Torres). Every framework converges on these six slots; only the field names differ.

#Axis 7 — Validated Learning Extraction (mandatory at close, for L1)

When verdict is win, kill, or inconclusive, the system forces a structured learning record. Without this field populated, the experiment cannot be closed.

We believed   [BELIEF]
We learned    [OBSERVATION]
Confidence    [HIGH | MEDIUM | LOW]   ← based on sample size + evidence method
Generalizes to [THIS-EXPERIMENT-ONLY | THIS-SEGMENT | THIS-MARKET | UNIVERSAL]
Implication   [PERSEVERE | PIVOT | KILL | DOUBLE-DOWN]
Pivot type    [if pivot — Customer-Segment / Customer-Need / Channel / Pricing /
               Value-Prop / Zoom-In / Zoom-Out / Tech / Platform / Business-Model]
Next bet      [link to chained experiment ID]

Pivot taxonomy is verbatim from Eric Ries's Lean Startup Chapter 8. Forcing the founder to name the pivot type prevents the silent drift where "we adjusted the product" hides what was actually a customer-segment pivot.

Confidence calibration:

  • HIGH: ≥10 paying users OR statistical significance OR triangulated across ≥3 evidence methods
  • MEDIUM: 3–9 users with consistent signal OR one strong evidence method
  • LOW: 1–2 users OR single weak method (interview only, observation only)

#Axis 8 — Signal Quality Tags (recorded per logged signal)

Each signal under an experiment carries metadata. Already partially in your schema (source); extend with:

FieldValues
Sourcedm, usage, observation, call, review, analytics, payment
Signal typebehavior, stated-preference, payment, referral, friction, abandonment
Evidence weightanecdotal-1 (one user), triangulated-3 (3–5 users agree), quantitative (statistical)
Polarityconfirmatory, disconfirming, neutral

Why polarity matters: confirmation bias is the silent killer of validated learning. A tracker that only collects confirmatory signals while ignoring disconfirming ones is a vanity tracker. Forcing polarity tagging surfaces this.


#Part 2 — The Anti-Pattern Detector

The taxonomy is not just classification; it is a detection layer that flags drift from the Pinnacle Gecko Protocol. The system runs these checks daily and surfaces violations in the dashboard:

CheckTriggerDiagnosis
Loop B contamination>70% of last 14 days are L0"You're maintaining, not validating. Run a Loop A."
Hypothesis class imbalanceZero GRO-H in last 30 days"You're optimizing value but not testing growth."
Risk-dimension blind spotZero VIA experiments ever"You've never tested business viability — pricing, margins, channels."
AARRR funnel holeZero RET experiments"You're acquiring but not testing whether anyone stays."
Falsifiability decayAverage kill-threshold specificity < 0.5"Your hypotheses aren't precise enough to fail."
Confirmation biasDisconfirming signal ratio < 15%"You're only logging signals that support your hypothesis."
Cycle time bloatAverage L1 cycle > 7 days"Compress. The protocol target is 48-hour decision."
Free-first violationL1 experiment with no PAY-class evidence"No price test = no real demand test (Marc Lou rule)."
Dangling successWin without Next-bet field"A win that doesn't chain into the next experiment is a wasted win."
Stale flagWin + 30 days passed + feature flag still alive"Anti-pattern #10 from the protocol — clean up the flag."

These checks turn the experiment tracker from a passive log into an active coach.


#Part 3 — How This Maps to Your Existing Schema

Your current experiments table already has most fields. The taxonomy requires:

#Add to experiments table

ALTER TABLE experiments ADD COLUMN loop_class      VARCHAR(2);        -- L0|L1|L2
ALTER TABLE experiments ADD COLUMN risk_dimension  VARCHAR(3);        -- VAL|USA|FEA|VIA
ALTER TABLE experiments ADD COLUMN hypothesis_class VARCHAR(5);       -- VAL-H|GRO-H
ALTER TABLE experiments ADD COLUMN evidence_method VARCHAR(3);        -- INT|OBS|FAK|CON|WOZ|AB|PAY
ALTER TABLE experiments ADD COLUMN segment         TEXT;              -- forced segment slot
ALTER TABLE experiments ADD COLUMN behavior        TEXT;              -- forced behavior slot
ALTER TABLE experiments ADD COLUMN metric          TEXT;              -- forced metric slot
ALTER TABLE experiments ADD COLUMN threshold       TEXT;              -- forced threshold slot
ALTER TABLE experiments ADD COLUMN timeframe       TEXT;              -- forced timeframe slot
ALTER TABLE experiments ADD COLUMN kill_threshold  TEXT;              -- forced kill criterion
ALTER TABLE experiments ADD COLUMN confidence      VARCHAR(6);        -- HIGH|MEDIUM|LOW
ALTER TABLE experiments ADD COLUMN generalizes_to  VARCHAR(20);       -- THIS-ONLY|SEGMENT|MARKET|UNIVERSAL
ALTER TABLE experiments ADD COLUMN pivot_type      VARCHAR(20);       -- nullable, only if pivot
ALTER TABLE experiments ADD COLUMN next_bet_id     VARCHAR(64);       -- chains to next experiment

Your existing experiment_type enum (product, pricing, messaging, ...) becomes redundant once risk_dimension + hypothesis_class are populated. Migrate or drop.

#Add to signals table

ALTER TABLE signals ADD COLUMN signal_type     VARCHAR(20);   -- behavior|stated-preference|payment|...
ALTER TABLE signals ADD COLUMN evidence_weight VARCHAR(20);   -- anecdotal-1|triangulated-3|quantitative
ALTER TABLE signals ADD COLUMN polarity        VARCHAR(15);   -- confirmatory|disconfirming|neutral

#Backfill the 62 existing experiments

Run a one-time LLM pass (Grok 4.3 or Claude) over the 62 entries. The model:

  1. Reads each hypothesis field
  2. Classifies as L0 / L1 / L2 (most will be L0)
  3. Routes L0 entries out of the active learning corpus (move to operations table or tag archived-as-ops)
  4. For L1/L2 entries: extracts segment/behavior/metric/threshold from free text, fills the new columns, flags any that fail the falsifiability check

Expected outcome: 50–55 of 62 entries reclassified as L0. The remaining 7–12 become the foundation of a real validated-learning corpus.


#Part 4 — The Forced-Structure Form (UI implication)

The dashboard's "New experiment" form must change. Instead of two free-text fields (hypothesis, success_criteria), it becomes a structured wizard:

Step 1: Loop class                        [L0 / L1 / L2]
        ↓ if L0, route to Operations log (different form)
        ↓ if L1/L2, continue

Step 2: Risk dimension                    [VAL / USA / FEA / VIA]
Step 3: Hypothesis class (L1 only)        [Value / Growth]
Step 4: AARRR stage                       [ACQ / ACT / RET / REF / REV]
Step 5: Evidence method                   [INT / OBS / FAK / CON / WOZ / AB / PAY]

Step 6: Hypothesis statement (six slots)
        ┌────────────────────────────────────────────────┐
        │ We believe that [_________________________]    │
        │ for             [_________________________]    │
        │ will result in  [_________________________]    │
        │ measured by     [_________________________]    │
        │ hitting         [_________________________]    │
        │ within          [_________________________]    │
        │                                                │
        │ Kill if         [_________________________]    │
        │ by              [_________________________]    │
        └────────────────────────────────────────────────┘

  → "Run experiment" button DISABLED until all slots filled
  → Auto-validation: threshold and kill threshold must be numeric
  → Auto-validation: timeframe must be ≤7 days (Pinnacle Gecko target)

The form's job is to make malformed hypotheses uncreatable, not to teach discipline after the fact.


#Part 5 — Validated Learning Extraction at Close

Closing an experiment requires the structured learning block. The dashboard prompts:

You set out to test:
  [original hypothesis statement, rendered]

You logged N signals:
  [signal summary]

What did you actually learn?
  We learned that [_________________________________]
  Confidence:    [HIGH / MEDIUM / LOW] (based on sample of N)
  Generalizes:   [THIS-ONLY / SEGMENT / MARKET / UNIVERSAL]

Implication:
  ( ) Persevere — same hypothesis, more data
  ( ) Pivot     — choose pivot type:
                  ( ) Customer-Segment   ( ) Customer-Need
                  ( ) Channel            ( ) Pricing
                  ( ) Value-Prop         ( ) Zoom-In  ( ) Zoom-Out
                  ( ) Tech               ( ) Platform
                  ( ) Business-Model
  ( ) Kill      — this branch of exploration is closed
  ( ) Double-down — increase investment, run scaled version

Next bet (chained experiment):
  [auto-creates a draft experiment, pre-fills "We believe that..." with this learning as input]

The closed experiment becomes a node in a learning graph. The next_bet_id field creates explicit edges. Over time, the graph reveals which lines of inquiry compounded into product-market fit and which dead-ended.


#Part 6 — What This Buys You

#Immediate (week 1 after schema migration)

  • 62 existing entries get triaged. The ~10 that are real Loop A learnings become your true corpus. The ~52 that are L0 ops fixes get archived where they belong.
  • The "Validation rate" metric becomes meaningful (currently it's measured against a corpus that is 80% bug fixes, so the number is meaningless).
  • Daily-report AI synthesis stops hallucinating insights from commit-log entries.

#30-day horizon

  • The anti-pattern detector starts flagging your blind spots. ("Zero GRO-H experiments in 30 days — you're not testing growth.")
  • Signal polarity tagging surfaces your confirmation bias quantitatively.
  • The learning graph (chained next_bet_id) reveals which hypotheses compound. Dead-end branches become visible.

#90-day horizon

  • The corpus accumulates ~30–60 well-formed L1 experiments instead of ~250 mixed-quality entries.
  • Cycle time analytics become reliable. "Average L1 cycle: 38 hours" is a real number you can compress.
  • The validated-learning records become a personal RAG corpus for your Claude/Cursor sessions. Every new hypothesis is grounded in what you've already tested. The KB becomes a compounding asset rather than an archive.

#Part 7 — Authoritative Sources

The taxonomy synthesizes the convergent prescriptions of:

  • Eric RiesThe Lean Startup (2011): Value/Growth hypotheses, pivot taxonomy, validated learning, innovation accounting, build-measure-learn loop.
  • David Bland & Alex OsterwalderTesting Business Ideas (2019): 44-experiment library, Desirability/Viability/Feasibility/Adaptability axes, evidence-strength grading, "We believe that…" statement format.
  • Marty CaganInspired (2017) & SVPG "Four Big Risks" article: Value/Usability/Feasibility/Business-Viability risk taxonomy.
  • Teresa TorresContinuous Discovery Habits (2021): opportunity solution tree, assumption testing, weekly customer interview cadence.
  • Steve BlankThe Four Steps to the Epiphany (2005): Customer Discovery → Validation → Creation → Building, hypothesis testing as scientific method.
  • Tony UlwickWhat Customers Want / Outcome-Driven Innovation (2005): desired-outcome statements (devoid of solutions, measurable, controllable, stable).
  • Dave McClureStartup Metrics for Pirates (2007): AARRR funnel taxonomy.
  • Sean Ellis — Product-market fit "very disappointed" >40% threshold (referenced in the Pinnacle Gecko Protocol's 48-hour decision rule).
  • Marc Lou — "Sell first" rule: no demand validation without payment friction tested.
  • Jeff GothelfLean UX: hypothesis statement template ("We believe… for… will result in… measured by…").
  • Kromatic — Generative vs. Evaluative experiments framing, Concierge/Wizard-of-Oz canonical definitions.
  • Barry O'ReillyHypothesis-Driven Development: falsifiability as the entry gate to running any test.

The Pinnacle Gecko Protocol (Perea, 2026) is the operational target this taxonomy serves: minutes-to-hours cycle times, payment-validated demand, 48-hour decisions, anti-pattern refusal.


#Closing

A learning corpus is not the same as a log. A log accumulates entries. A corpus compounds knowledge. The difference is structural: forced classes, forced slots, forced extraction at close.

Your 62 experiments are a log. The schema in this document turns the next 62 into a corpus.

perea.ai Research

One deep piece a month. Three weekly signals.

Get every B2A field report, protocol update, and benchmark from real audits — published before the rest of the market sees it. No filler. Unsubscribe in one click.