perea.ai Research · 1.0 · Public draft

Vertical Corpus Moats: Building the Defensible Data Asset Beneath a Vertical Agent

Why proprietary corpus is the strongest moat post-Cowork — and the 90-day field manual to build one across legal, healthcare, insurance, accounting, CRE, and construction, validated by the 6-vertical State-of-Vertical-Agents canon (Hippocratic AI's 7,500+ clinicians + 180M patient interactions, EvenUp's $2B valuation on hundreds of thousands of PI cases, Thomson Reuters' Westlaw copyright ruling, and Tractable's vehicle-damage corpus)

AuthorDante Perea
Published7 May 2026 08:28
Length5,147 words · 23 min read
AudienceFounders building vertical AI agents and weighing model quality vs corpus depth as the source of defensibility. Operators inside vertical-AI incumbents (Harvey, Legora, EvenUp, Hippocratic AI, Abridge, OpenEvidence, Sixfold, Tractable, Trullion, Vic.ai, Real Brokerage, CoStar, Procore, Autodesk Construction Cloud) calibrating corpus strategy against incumbent advantages. Investors pricing the corpus moat into vertical-AI multiples after Cowork commoditized model quality and CaseText/EvenUp/Hippocratic precedents reset acquisition comps.
LicenseCC BY 4.0

#Vertical Corpus Moats: Building the Defensible Data Asset Beneath a Vertical Agent

#Foreword

This is the first cross-vertical operator playbook in the perea.ai/research canon, derived from the just-completed 6-vertical State-of-Vertical-Agents quarterly series — legal Q3 2026 (#16, the Anthropic Cowork SaaSpocalypse + Harvey/Legora capital concentration), insurance Q4 2026 (#17, Duck Creek Agentic Platform inflection + Three-State Compliance Test), founder-velocity meta-paper (#18, the cross-vertical operator playbook), healthcare Q1 2027 (#19, the 4-AI-native-unicorn density + Mount Sinai inflection + Five-Framework Compliance Test), accounting Q2 2027 (#20, the 44.6% CAGR + Workday Sana pre-inflection + dual-incumbent Big-4 dynamic), CRE Q3 2027 (#21, the Real Brokerage / RE/MAX $880M consolidation + 92%-piloted-only-5%-achieved implementation gap), and construction Q4 2027 (#22, the Rebar 2x-ARR-in-6-weeks velocity benchmark + structural-labor-crisis-as-forcing-function).

The frame this paper holds: in 2026, model quality is commoditized, infrastructure is rented, compliance is table-stakes — but proprietary corpus is the durable moat. Anthropic Cowork (legal Jan 30 2026) demonstrated that frontier-model capability now ships pre-bundled into every horizontal AI workspace, neutralizing model-novelty as a vertical differentiator. What the horizontal workspaces cannot index is private corpora behind contractual licensing, BAA chains, attorney-client privilege, HIPAA gating, and proprietary outcome labels. That is where the durable moat lives — and where founder-velocity in vertical AI now competes.

This paper synthesizes five canonical corpus precedents validated across the 6-vertical canon: Thomson Reuters' Westlaw editorial corpus (legally hardened by the February 2025 Thomson Reuters v. ROSS Intelligence U.S. AI copyright ruling), Hippocratic AI's Polaris Safety Constellation (7,500+ U.S.-licensed clinicians + 180M+ patient interactions + 99.90% clinical accuracy + 0.00% severe harm), EvenUp's Piai (hundreds of thousands of personal-injury cases + millions of medical visits + $2B valuation + 1,500+ PI firms generating $7B+ in claims), Tractable's vehicle-damage corpus (600% revenue growth in 24 months at $1B+ valuation), and Harvey + Legora pricing power ($1,000-$1,200/seat/month and $3,000/user/year respectively, with pricing directly attributable to corpus quality). Out of those precedents, this paper extracts a cross-vertical operator playbook covering the four corpus archetypes, the build-vs-license-vs-co-create decision matrix, the validation-panel design, the legal pitfalls, and the 90-day field manual.

The paper closes by mapping the cross-vertical 4-moat framework (corpus + workflow integration + compliance + network effects) — the universal pattern visible across all 6 verticals in the State-of-Vertical-Agents canon. Defensible vertical AI requires building two of the four moats; the strongest defensibility composite combines corpus + workflow integration + compliance.

#Executive Summary

  1. Proprietary corpus is the dominant defensibility factor in vertical-AI businesses post-Anthropic-Cowork. Cowork commoditized frontier-model quality across the horizontal workspace category — every founder now starts with Claude Sonnet 4.6 + GPT-5 + Gemini 2.0 + Mistral SLMs at parity. What does not commoditize is proprietary corpus accessed through licensing, customer-data agreements, validation-panel curation, and outcome labels accumulated over years of operator-customer interaction. The 6-vertical canon confirms this universally: legal Vault 100 firms pay Westlaw + LexisNexis pricing because of editorial corpus, not because of model quality; healthcare systems pay Hippocratic AI for Polaris-validated outputs, not for the underlying Llama or Mistral fine-tune; insurance carriers pay Tractable for vehicle-damage-corpus accuracy, not for model architecture novelty.

  2. Five canonical corpus precedents anchor the 2026 vertical-corpus-moat playbook. (a) Thomson Reuters acquired Casetext for $650M in 2023, integrated it into CoCounsel inside the Westlaw subscription, and then won the first major U.S. AI copyright ruling in February 2025 against ROSS Intelligence for unauthorized training on Westlaw headnotes — legally hardening the Westlaw editorial corpus into a defensible asset that no AI vendor can replicate without licensing. (b) Hippocratic AI's Polaris validation framework uses 7,500+ U.S.-licensed clinicians (5,969 nurses + 265 physicians early-stage, scaled to 7,500+ at Polaris 5.0), 180M+ patient interactions, 99.90% correct clinical advice, and 0.00% severe harm events — a validation panel that took 18-24 months and $5-10M to build and now powers a productivity ceiling no horizontal AI can match. (c) EvenUp's Piai is trained on hundreds of thousands of personal-injury cases plus millions of medical visits, deployed across 1,500+ PI firms generating $7B+ in claims, and was valued at $2B+ in October 2025 ($150M Series E from $1B Series D 12 months earlier — doubling valuation in one year). (d) Tractable's vehicle-damage image corpus drove 600% revenue growth in 24 months to a $1B+ valuation in insurance — corpus density (millions of crash-damage images labeled with repair-cost outcomes) is the moat. (e) Harvey at $1,000-$1,200/seat/month enterprise pricing and Legora at $3,000/user/year + $30K ACV floor — both pricing power is directly attributable to corpus quality (Harvey's BigLaw-firm-collaborative-corpus + Legora's European-jurisprudence-corpus), not model novelty.

  3. Four corpus archetypes cover the buildable corpus surface in any vertical. (a) Case-law / regulatory + treatise corpus — codified text written by the regulator or the standard-setter (Westlaw editorial, LexisNexis Practical Guidance, Federal Reserve regulations, IRS code, HHS HIPAA guidance, OSHA standards, AICPA-CIMA standards, MLS data agreements). Build pattern: license aggressively, signal scale, harden with copyright + trade-secret claims (Thomson Reuters playbook). (b) Customer-document + workflow corpus — documents and workflows generated by customers using your product (Karbon practice-management workflows, Procore RFI corpus, Karmen email-and-ERP-routing data, Articulate drawing analysis labels, FloQast close-day records). Build pattern: ship product → accumulate customer data via Terms of Service grants → scale corpus with customer count. (c) Outcome + label corpus — paired input/output records where the output is a verified business outcome (EvenUp's PI-claim-outcomes, Tractable's repair-cost outcomes, Hippocratic's clinical-accuracy outcomes, Sixfold's loss-ratio outcomes, Trullion's audit-finding outcomes, BlackLine's close-day outcomes). Build pattern: validation panel + customer-feedback loop + ground-truth verification (Polaris Safety Constellation as canonical). (d) Sensor + media corpus — non-text data captured by hardware or in-process media (Buildots 360-camera footage, OpenSpace 75K+ project visual capture, DroneDeploy aerial-and-ground site media, Tractable vehicle-damage images, Mass General Brigham clinical imaging). Build pattern: hardware-software co-design + sensor-network density + multi-customer aggregation.

  4. The build-vs-license-vs-co-create decision matrix has clear cost benchmarks. License: $0.5-50M depending on corpus scope (Westlaw editorial license: $50M+; LexisNexis subset license: $5-15M; specialty-medical-database license: $1-5M; MLS multi-jurisdiction license: $0.5-3M). Speed: 3-6 months to operational. Risk: licensor termination + price escalation + non-exclusivity (multiple competitors can license the same corpus). Co-create with customers: $1-10M in product investment to bake corpus-capture into customer onboarding + Terms of Service grants. Speed: 12-24 months to defensible scale (50-200 customers ÷ corpus-per-customer = scale). Risk: customer-IP claims + data-portability obligations + GDPR Article 22 / CCPA Right-to-Be-Forgotten. Build with validation panel: $5-15M for clinician-or-domain-expert panel + curation + ground-truth verification (Polaris template). Speed: 18-24 months to first-version-deployable. Risk: panel cost overrun + cross-vertical-applicability constraints + regulatory-credentialing gates (FDA SaMD, HIPAA, PCI-DSS, FINRA). Build via customer-IP-pooling: $0.5-3M legal + product investment + customer-cohort-formation. Speed: 24-36 months. Risk: anti-trust scrutiny on customer-IP-pools + customer-defection if pool weakens.

  5. Validation-panel design is the most under-appreciated discipline in vertical corpus building. Hippocratic AI's Polaris Safety Constellation is the canonical 2026 validation-panel template: 7,500+ U.S.-licensed clinicians (mix of nurses + physicians + pharmacists), 4-phase certification (automated benchmark → simulated calls → customer-approval → production-monitoring), 0-incident-record across 180M+ interactions. The methodology cost: $5-10M to construct + 18-24-month build cycle, and the cross-vertical applicability is direct — actuarial-validation-panel for insurance, audit-firm-review-panel for accounting, broker-licensing-panel for CRE, and trade-credentialing-panel for construction. The validation panel is simultaneously the corpus moat AND the trust-and-validation moat AND the regulatory-positioning moat — the highest-multiplier capital deployment in vertical-AI.

  6. Legal pitfalls require day-one corpus-strategy alignment. PHI (HIPAA + HITECH + state): every BAA chain across foundation-model providers + cloud + customer-EHR-integrations must validate before clinical-corpus build can scale. FERPA (education): student-data corpus requires written-disclosure + opt-out paths under 20 U.S.C. § 1232g. Attorney-client privilege: legal-corpus build via co-creation must avoid privilege-waiver — corpus access by AI vendor without explicit client consent breaks privilege. GDPR Article 6/9 (lawful processing) + Article 22 (automated decisions) + Article 17 (Right to Erasure): EU customer corpus requires lawful basis + erasure mechanisms architected into corpus storage. Section 1031 + IRC 6694(a) (accounting + CRE): tax-return corpus access requires preparer-penalty alignment. MLS data agreements (CRE): MLS jurisdictional data licenses prohibit redistribution + impose state-specific-restrictions. Specialty-trade safety standards (construction): OSHA + NECA + SMACNA + ASA standards govern safety-corpus collection. Founders who skip day-one legal alignment hit corpus-rebuild costs of $2-10M when discovered at Series B due-diligence.

  7. The 4-moat framework — corpus + workflow integration + compliance + network effects — is the universal cross-vertical pattern from the 6-vertical canon. Defensible vertical AI requires building two of these four moats; the strongest defensibility composite is corpus + workflow integration + compliance. Hippocratic AI: corpus + compliance + workflow integration (3 of 4 moats — the strongest example in 2026). EvenUp: corpus + workflow integration (2 of 4, with $2B+ valuation). Trullion: corpus + workflow integration + compliance (3 of 4). Sixfold: corpus + workflow integration (2 of 4). Tractable: corpus + workflow integration (2 of 4). Procore Agent Builder: workflow integration + network effects (2 of 4, but 2 weak moats). Karmen: corpus + workflow integration (2 of 4, scaling). Real Brokerage post-RE/MAX: workflow integration + network effects + corpus (3 of 4, strongest distribution-side moat in the 6-vertical canon).

#Part I — Why Corpus Beats Model Quality Post-Cowork

Anthropic Cowork's January 30, 2026 launch was the inflection point that closed the model-quality-as-moat era for vertical AI. Cowork ships frontier Claude Sonnet 4.6 + Opus 4.7 capability into every horizontal AI workspace, with native Slack + Teams + Salesforce + Zendesk + GitHub + Linear + Asana + Notion integration. The capability gap that vertical AI vendors used to claim — "horizontal model lacks vertical context" — closed in a single product launch. (Legal paper #16 documented this in detail.)

What Cowork (and ChatGPT Enterprise, and Microsoft 365 Copilot, and Google Gemini for Workspace) cannot index:

  • Private case-law databases behind paywall + license (Westlaw editorial, LexisNexis Practical Guidance, Bloomberg Law treatises) — the Thomson Reuters v. ROSS Intelligence February 2025 ruling legally hardened these as defensible assets that AI vendors cannot train on without licensing.
  • Patient and clinical data behind HIPAA + BAA chains (Mass General Brigham clinical records, UCSF medical imaging, Veterans Affairs longitudinal patient data) — accessible only to vendors with validated BAA chains across foundation-model providers + cloud infrastructure + customer-EHR-integrations.
  • Customer-firm proprietary documents under confidentiality (BigLaw firm matter files, Big-4 audit working papers, Top-3-broker deal-flow data, ENR Top-400 RFI archives, AICPA-CIMA member firm transaction histories) — accessible only via co-create-with-customer corpus-building motion (Karbon, FloQast, Karmen, Articulate templates).
  • Sensor and media data captured in operations (Buildots 360-camera footage, OpenSpace site capture, DroneDeploy aerial media, Tractable vehicle-damage images, Mass General Brigham radiology imaging) — accessible only via hardware-software co-design + sensor-network deployment.
  • Outcome labels generated by validation panels (Hippocratic AI's Polaris Safety Constellation 0-incident record across 180M+ interactions, EvenUp's PI-claim-outcome labels across 1,500+ firms, Tractable's repair-cost outcome verification) — accessible only via $5-15M validation-panel investment + 18-24-month build.

The result: vertical AI vendors compete on corpus depth + corpus quality + corpus accessibility, not on model quality. Pricing power is directly attributable. Harvey at $1,000-$1,200/seat/month enterprise commands that floor because of BigLaw-collaborative-corpus depth + matter-file-integration density. Legora at $3,000/user/year + $30K ACV floor commands that pricing because of European-jurisprudence-corpus depth + EU-AI-Act-compliance positioning. Hippocratic AI commands enterprise health-system pricing because of Polaris-Safety-Constellation validation outcomes. EvenUp doubled valuation from $1B to $2B+ in one year because of $7B+ in PI-claim outcomes generated across 1,500+ firms — a corpus-and-outcome scale no horizontal AI can replicate.

The frame: post-Cowork, vertical-AI investors price corpus depth into multiples directly. Founders who build corpus first, model second, win on revenue multiple uplift at Series B onward.

#Part II — The Four Corpus Archetypes

Archetype 1 — Case-law / Regulatory + Treatise. Codified text written by regulators, standard-setters, or editorial houses. Examples: Westlaw editorial, LexisNexis Practical Guidance, Federal Reserve regulations, IRS code, HHS HIPAA guidance, OSHA standards, AICPA-CIMA standards, MLS data agreements, EU AI Act Articles 9-15, GDPR Article 22, FDA 21 CFR Part 11. Defensibility mechanism: license aggressively, signal scale, harden with copyright + trade-secret claims. Canonical example: Thomson Reuters' Westlaw editorial corpus + the February 2025 ROSS Intelligence U.S. AI copyright ruling that legally hardened it. Cost: $5-50M for license; $0 for public regulatory text (but with 6-12 months of curation to make AI-actionable). Vertical applicability: legal (strongest), insurance (mid — Three-State Test corpus), healthcare (strong — FDA + HIPAA + ONC), accounting (mid — IRS + AICPA-CIMA + Section 199A), CRE (mid — RESPA + Fair Housing + state-broker), construction (mid — OSHA + Davis-Bacon + state-contractor-licensing).

Archetype 2 — Customer-Document + Workflow. Documents and workflows generated by customers using your product. Examples: Karbon practice-management workflows, Procore RFI corpus, Karmen email-and-ERP-routing data, Articulate drawing-analysis labels, FloQast close-day records, BlackLine close-automation transaction data, Trullion lease-and-SOX abstraction logs, Outlex template + lawyer-feedback corpus. Defensibility mechanism: ship product → accumulate customer data via Terms of Service grants → scale corpus with customer count. Canonical example: Procore Agent Builder's RFI Creation Agent (corpus accumulates from customer-RFI-data flows). Cost: $1-10M product investment to bake corpus-capture into onboarding. Speed: 12-24 months to defensible scale (50-200 customers × per-customer corpus = scale). Vertical applicability: universal — every vertical has customer-document + workflow accumulation potential.

Archetype 3 — Outcome + Label. Paired input/output records where the output is a verified business outcome. Examples: EvenUp's PI-claim-outcomes ($7B+ across 1,500+ firms), Tractable's repair-cost outcomes (millions of vehicle-damage assessments), Hippocratic's clinical-accuracy outcomes (180M+ patient interactions with 99.90% accuracy), Sixfold's loss-ratio outcomes (4pp improvement at top-quartile carriers), Trullion's audit-finding outcomes, BlackLine's close-day outcomes (sub-30% reduction in pilots), Tractable's 600%-24-month-revenue-growth outcome density. Defensibility mechanism: validation panel + customer-feedback loop + ground-truth verification. Canonical example: Hippocratic AI's Polaris Safety Constellation. Cost: $5-15M validation-panel construction + 18-24-month build. Vertical applicability: universal but strongest in healthcare, insurance, legal-PI, and accounting where outcome verification is contractually-anchored to renewal terms.

Archetype 4 — Sensor + Media. Non-text data captured by hardware or in-process media. Examples: Buildots 360-camera footage, OpenSpace 75K+ project visual capture, DroneDeploy aerial-and-ground site media, Tractable vehicle-damage images, Mass General Brigham clinical imaging, Roche / Genentech molecular imaging, autonomous-vehicle Waymo + Cruise driving-data corpora. Defensibility mechanism: hardware-software co-design + sensor-network density + multi-customer aggregation. Canonical example: OpenSpace's $199M raised + Disperse acquisition February 2026 = 75,000+ projects worth of 360-camera + human-verified-CV-labeling corpus. Cost: $20-100M for hardware-software co-design + sensor-network deployment. Vertical applicability: construction (strongest), insurance (vehicle-damage), healthcare (clinical imaging), CRE (drone + 360 visual capture for property ops). Lower applicability in legal, accounting, financial-services where text-and-document corpora dominate.

The four archetypes are not mutually exclusive — most $1B+ vertical-AI businesses combine 2 or 3 archetypes. EvenUp combines Archetype 2 (customer-document workflow from PI firm onboarding) + Archetype 3 (PI-claim-outcome labels). Hippocratic AI combines Archetype 1 (FDA + HIPAA regulatory corpus) + Archetype 3 (Polaris validation outcomes) + Archetype 4 (clinical-call audio for nurse-co-pilot product). Tractable combines Archetype 3 (repair-cost outcomes) + Archetype 4 (vehicle-damage images). Real Brokerage post-RE/MAX combines Archetype 2 (180K agent + property-listing workflow) + Archetype 4 (multi-MLS aggregation + property visual capture).

#Part III — Build vs License vs Co-Create: Decision Matrix

The choice of corpus-build motion is a function of vertical-corpus-archetype × time-to-market × capital-availability × regulatory-context. Founders default-pick incorrectly in 2 of 3 cases — most either over-build (trying to construct a 10-year regulatory corpus organically when a $5M license is available) or under-license (trying to scrape what's behind a $50M Westlaw paywall and ending up sued).

Build (in-house corpus from scratch). Cost: $5-50M product + curation + validation-panel investment. Speed: 18-36 months to defensible scale. Best when: corpus is novel-and-not-yet-licensable (Hippocratic Polaris); when target vertical is greenfield (Davis-Bacon-compliance-AI in construction); when validation panel is the strategic lever (healthcare, insurance, financial-services). Worst when: an established licensable corpus exists at a fraction of the build cost.

License (acquire corpus from existing aggregator). Cost: $0.5-50M depending on scope. Speed: 3-6 months to operational. Best when: an aggregator already exists (Westlaw, LexisNexis, FactSet, Bloomberg Terminal, MLS); when speed-to-market matters (founder racing to first-Procore-marketplace-ranking); when corpus quality is non-negotiable (legal-AI vendors must license Westlaw or LexisNexis to compete). Worst when: licensor termination risk + price escalation can hollow the moat (Thomson Reuters can refuse renewal); when non-exclusivity destroys the moat (multiple competitors can license the same corpus).

Co-create with customers (corpus accumulates via product use). Cost: $1-10M product investment to bake corpus capture into onboarding + Terms-of-Service-grant negotiation + customer-data-portability obligations. Speed: 12-24 months to defensible scale. Best when: customer-document + workflow archetype dominates (Karbon, FloQast, Trullion, Karmen); when product-led growth is achievable; when corpus accumulates faster as customer count grows (network effect on corpus). Worst when: customer-IP claims surface during Series B + later due-diligence (re-negotiation risk); when data-portability + GDPR Article 22 + CCPA Right-to-Be-Forgotten obligations require expensive corpus-mutation infrastructure.

Acquire (M&A-of-corpus). Cost: $50M-$1B for an established corpus-holder (Casetext-by-Thomson-Reuters $650M; Disperse-by-OpenSpace; ROSS-Intelligence-acqui-hire). Speed: 6-12 months including deal close + integration. Best when: corpus-holder is sub-scale and willing to sell; when integration risk is manageable; when acquired corpus closes a critical gap in current corpus stack. Worst when: integration-overhang depresses post-merger growth (Disperse-OpenSpace integration carry); when antitrust scrutiny applies (Big-4-acquisition-of-AI-native-vendors triggers regulatory review).

Customer-IP-pooling (corpus aggregated across a customer cohort). Cost: $0.5-3M legal + product investment + customer-cohort-formation. Speed: 24-36 months. Best when: customer cohort can pool data without breaking individual-firm-confidentiality (anonymized + aggregated outcome labels — Hippocratic clinical outcome cohort across 90+ health systems); when network effects are achievable from cohort-cross-customer-learning. Worst when: anti-trust scrutiny applies (customer-IP pools can trigger Sherman Act review); when single customer defection weakens the cohort.

The decision matrix favors license-first for case-law / regulatory corpora, co-create-first for customer-document / workflow corpora, build-first for outcome / label corpora, and acquire-or-hardware-deploy-first for sensor / media corpora. Founders who pick the right motion compress time-to-defensibility from 36 months to 12 months; founders who pick the wrong motion burn 2-3x the capital before achieving the same corpus depth.

#Part IV — Validation Panel Design: The Polaris Template

Hippocratic AI's Polaris Safety Constellation is the canonical 2026 validation-panel template, applicable cross-vertically to insurance, accounting, CRE, construction, and legal-PI verticals.

Panel composition. 7,500+ U.S.-licensed clinicians (mix of nurses, physicians, pharmacists, social workers). The original Polaris 1.0 panel was 5,969 nurses + 265 physicians evaluating 307,038 unique calls. Polaris 5.0 expanded to 7,500+ clinicians with 180M+ patient interactions. Cross-vertical translation: actuarial-validation-panel (200-500 actuaries + claim-adjusters for insurance), audit-firm-review-panel (200-500 partner-CPAs for accounting), broker-licensing-panel (300-1,000 licensed brokers for CRE), trade-credentialing-panel (200-500 NECA + SMACNA + ASA-credentialed tradespeople for construction), PI-attorney-panel (300-1,000 PI attorneys for legal-PI).

Multi-phase certification. Phase 1: Automated benchmark. Run synthetic-input benchmark against pre-trained model to establish baseline accuracy. Cost: $0.1-0.5M. Speed: 1-3 months. Phase 2: Simulated-call validation. Run validation panel against simulated end-to-end interactions (Hippocratic 307K simulated calls; insurance equivalent: 50-150K simulated claim-adjustments; CRE equivalent: 20-50K simulated lease-abstractions). Cost: $1-3M. Speed: 6-12 months. Phase 3: Customer-approval gating. Each new customer requires sign-off from the validation panel before production deployment. Cost: $0.2-0.8M per major customer. Speed: 1-3 months per customer. Phase 4: Production-monitoring. Ongoing monitoring of production interactions with validation-panel review of edge cases, failure modes, and outcome verification. Cost: $1-5M annually. Speed: continuous.

Validation outcomes. Hippocratic Polaris track record: clinical accuracy 80% (pre-Polaris) → 96.79% (Polaris 1.0) → 98.75% (Polaris 2.0) → 99.38% (Polaris 3.0) → 99.90% (Polaris 5.0). 0.00% severe harm events across 180M+ interactions. Cross-vertical translation: founders ship a comparable accuracy + safety scorecard publicly to compete on validation-panel-quality, not just model-quality.

Methodology cost: $5-10M to construct the panel + 18-24-month build cycle. Operational cost: $3-8M annually thereafter. The validation panel is simultaneously the corpus moat AND the trust-and-validation moat AND the regulatory-positioning moat — the highest-multiplier capital deployment in vertical AI. The cross-vertical translation is direct. Insurance: actuarial-validation-panel + claim-adjuster-panel mirrors the nurse-physician panel structure; outcome verification is loss-ratio-improvement-at-top-quartile vs. clinical-accuracy. Accounting: partner-CPA-review-panel mirrors the physician panel; outcome verification is close-day-reduction + audit-finding-accuracy. CRE: licensed-broker-panel mirrors the nurse-licensure surface; outcome verification is deal-cycle-compression. Construction: trade-credentialing-panel mirrors specialty-trade safety standards; outcome verification is RFI-cycle-compression + safety-incident-reduction. Legal-PI: PI-attorney-panel mirrors the legal-vetting surface; outcome verification is claim-outcome-vs-EvenUp-benchmark.

PHI (HIPAA + HITECH + state-specific privacy). Healthcare-corpus build requires validated BAA chains across foundation-model providers (Anthropic, OpenAI, Google, Mistral, Microsoft Azure, AWS Bedrock) + cloud + customer-EHR integrations. Hippocratic AI ships Claude-Sonnet-4.6-based Polaris models in HIPAA-validated architecture; foundation-model swaps require BAA re-validation. Founder-velocity hit if skipped: 9-15 months to retrofit HIPAA-validated BAA chain post-corpus-build.

FERPA (education). Student-data corpus requires written-disclosure + opt-out paths under 20 U.S.C. § 1232g. Education-AI vendors (Khan Academy AI Khanmigo, Carnegie Learning, Speak Languages) must architect FERPA-compliance into corpus capture from day one.

Attorney-client privilege. Legal-corpus build via co-creation must avoid privilege waiver — corpus access by AI vendor without explicit client consent breaks privilege. Harvey, Legora, EvenUp, Outlex all require client-consent gating in their corpus-capture workflows. Founder-velocity hit if skipped: BigLaw firms refuse corpus-share without client-consent infrastructure.

GDPR Article 6 (lawful processing) + Article 9 (special-category data) + Article 22 (automated decisions) + Article 17 (Right to Erasure). EU customer corpus requires lawful basis + erasure mechanisms architected into corpus storage. Per A-12 in the roadmap (gdpr-ccpa-agent-memory-compliance) — the corpus-erasure architecture is now a discrete sub-discipline.

Section 1031 + IRC 6694(a) (accounting + CRE crossover). Tax-return corpus access requires preparer-penalty alignment. Accounting-AI vendors that integrate with 1099 + W-2 + 1040 + 1065 + 1120 corpus must comply with IRS preparer-penalty regimes (paper #20 documented in detail).

MLS data agreements (CRE). MLS jurisdictional data licenses prohibit redistribution + impose state-specific restrictions. CRE-AI vendors must license MLS data state-by-state — a 50-state Three-State-Test-style compliance overhead (paper #21 documented in detail).

Specialty-trade safety standards (construction). OSHA + NECA + SMACNA + ASA standards govern safety-corpus collection. Construction-AI vendors that capture safety-incident data must align with state-by-state OSHA reporting (paper #22 documented in detail).

EU AI Act Article 9-15 (high-risk AI systems). Vertical-AI vendors that ship into EU customers must comply with EU AI Act high-risk-AI-system requirements by August 2, 2026 deadline. Corpus-related obligations: training-data-quality requirements (Article 10), record-keeping (Article 12), human-oversight (Article 14). (Paper A-31 documents the Five-Framework-Compliance methodology for healthcare; A-29 documents the Three-State-Test methodology for insurance.)

Founder rule: spend $0.3-1M on legal review of corpus strategy before first customer pilot. Founders who skip this step hit corpus-rebuild costs of $2-10M when discovered at Series B due-diligence — a 10-20x cost ratio penalty for late discovery.

#Part VI — The 90-Day Field Manual

Day 0-7 — Pick the corpus archetype. From Part II, pick exactly one of: case-law/regulatory, customer-document/workflow, outcome/label, sensor/media. The choice constrains build motion + cost benchmark + speed-to-defensibility.

Day 7-21 — Pick the build motion. From Part III's decision matrix: build, license, co-create, acquire, or customer-IP-pool. Anchor the choice in vertical context (legal favors license, healthcare favors build, accounting favors co-create, construction favors hardware-deploy, CRE favors customer-IP-pool, insurance favors build + outcome-label).

Day 21-45 — Legal review. Allocate $0.3-1M to legal review of corpus strategy across PHI, FERPA, attorney-client privilege, GDPR Article 22 + 17, IRC 6694(a), MLS data agreements, specialty-trade safety standards, EU AI Act Article 9-15. Document the corpus-legal-architecture in a Day-90-deliverable memo.

Day 45-60 — Validation-panel design (if Archetype 3 or 4). If outcome/label or sensor/media archetype is primary, allocate $5-15M and 18-24 months to validation-panel construction. Anchor methodology to Polaris Safety Constellation template (Part IV). Recruit 200-500 domain experts in Phase-1; budget Phase 2-4 for validation-panel-operational cost.

Day 60-75 — First-customer corpus pilot. Ship corpus-capture-bake-into-onboarding to first 5-10 paying customers. Measure corpus-per-customer + time-to-corpus-utility + customer-acceptance of Terms-of-Service grants. Iterate on corpus-capture friction + customer-acceptance gap.

Day 75-90 — Corpus-defensibility review. Quantify corpus depth (records / labels / hours / images) + corpus uniqueness (vs licensable alternatives) + corpus speed-of-growth (records per month per customer). Validate against the 4-moat framework (corpus + workflow integration + compliance + network effects). Commit to building 2-of-4 moats by Series A; 3-of-4 by Series B.

Day 90+ — Series A pitch with corpus-as-defensibility. The corpus depth + uniqueness + growth-rate is the central pitch artifact. Series A investors price corpus into multiples directly — a defensible corpus commands 1.5-2x revenue multiple uplift over a corpus-light competitor.

#Part VII — The 4-Moat Framework

The cross-vertical pattern from the 6-vertical State-of-Vertical-Agents canon: defensible vertical AI requires building two of four moats, with the strongest composite combining corpus + workflow integration + compliance.

Moat 1 — Corpus. Proprietary data accumulation via license, co-create, build, acquire, or customer-IP-pool. Documented in this paper.

Moat 2 — Workflow integration. Deep embedding into customer workflows + ERP + CRM + EHR + practice-management + ENR-Top-400 enterprise systems. Examples: Trullion-Big-4-co-deployment, Hippocratic-EHR-integration, Karmen-email-and-ERP-routing, Procore-marketplace-integration. Switching cost is the durable mechanism — once a customer integrates 5-15 workflows, replacement cost is 6-18 months of customer-side engineering + corpus-rebuild.

Moat 3 — Compliance. Regulatory certification + audit + multi-framework alignment. Examples: HIPAA + HITECH + FDA SaMD + EU MDR + EU AI Act Article 9-15 (Hippocratic AI's Five-Framework-Compliance per paper A-31); NAIC + state-by-state insurance compliance + CFPB (Sixfold's Three-State-Test per paper A-29); SOC 2 + ISO 27001 + AICPA-CIMA standards alignment (Trullion + FloQast). Compliance-as-marketed-feature commands 30-50% pricing premium (documented across papers #17, #19, #20).

Moat 4 — Network effects. Cohort-cross-customer-learning, MLP-community-density, marketplace-effects, prestige-led-distribution (paper A-26). Examples: Real Brokerage's 180K-agent platform post-RE/MAX, Procore Marketplace, Autodesk Construction Cloud Marketplace, AICPA-CIMA member network, Munich Re reinsurer-as-AI-pioneer template. Network effects are the weakest moat in vertical AI — most vertical-AI businesses cannot scale past $200-500M ARR on network effects alone, but combined with corpus + workflow integration, they multiply defensibility.

Defensible vertical AI = build any 2 of 4 moats. Strongest composite = corpus + workflow integration + compliance (3 of 4). Hippocratic AI is the canonical 3-of-4 example. EvenUp at 2-of-4 (corpus + workflow integration) achieved $2B+ valuation in October 2025. Procore Agent Builder at 2-of-4 (workflow integration + network effects) is weaker because both moats are easily replicated by Autodesk Construction Cloud's native AI investment.

The strategic implication: founders who pick a 2-of-4 moat composite that includes corpus first achieve faster Series A and higher revenue multiple uplift than founders who pick 2-of-4 without corpus. The 2026 capital concentration data validates this: Harvey ($8B valuation), Legora ($5.55B valuation), Hippocratic ($3.5B valuation), EvenUp ($2B+ valuation), Tractable ($1B+ valuation), Basis AI ($1.15B valuation) all anchor on corpus-first defensibility.

#Closing

Three furniture pieces a founder should carry away.

Pick the corpus archetype before the model. The 6-vertical canon validates that pre-Cowork "model-first vertical AI" no longer wins. Pick exactly one of the four archetypes (case-law/regulatory, customer-document/workflow, outcome/label, sensor/media), then pick the build motion (license, co-create, build, acquire, customer-IP-pool) that maps to the archetype × vertical × capital × time-to-market constraints. The choice constrains everything downstream — pricing, distribution, exit positioning.

Spend $0.3-1M on legal review of corpus strategy before first customer pilot. PHI + FERPA + attorney-client + GDPR Article 22+17 + IRC 6694(a) + MLS data agreements + specialty-trade safety + EU AI Act Article 9-15 are non-trivial corpus-strategy gating constraints. Founders who skip Day-21-45 legal review hit corpus-rebuild costs of $2-10M at Series B — a 10-20x ratio penalty for late discovery.

Build 2 of 4 moats by Series A; 3 of 4 by Series B; the strongest composite is corpus + workflow integration + compliance. Hippocratic AI's 3-of-4 composite (corpus + workflow integration + compliance) is the canonical 2026 defensibility template. EvenUp's 2-of-4 (corpus + workflow integration) achieved $2B+ valuation. Procore Agent Builder's 2-of-4 (workflow integration + network effects) without corpus is weaker — easily replicated by Autodesk Construction Cloud's native AI. The opportunity in 2026 is to walk into a vertical, license or co-create or build the corpus that horizontal AI cannot index, validate it with a Polaris-grade panel where outcome verification is the moat, and ship a 3-of-4-moat composite within 18 months. Investors price corpus into multiples directly. Founders who do this earn Harvey-Hippocratic-EvenUp-Tractable-Basis revenue-multiple-uplift trajectories. Founders who skip corpus and bet on model novelty get commoditized by Cowork's next release. The choice is no longer optional.

#References

[1] Thomson Reuters / Westlaw. (2023-2026). Casetext Acquisition $650M; Integration into CoCounsel within Westlaw Subscription.

[2] Thomson Reuters Inc v ROSS Intelligence Inc. (2025, February). First Major U.S. AI Copyright Ruling — Westlaw Editorial Headnotes Protected from Unauthorized AI Training.

[3] Hippocratic AI. (2026). Polaris 5.0 Launch — 7,500+ U.S.-Licensed Clinicians, 180M+ Patient Interactions, 99.90% Correct Clinical Advice, 0.00% Severe Harm Events; Real World Evaluation of Large Language Models in Healthcare (RWE-LLM) Framework.

[4] Hippocratic AI. (2024-2026). Polaris 1.0 → 5.0 Validation Trajectory — 80% (pre-Polaris) → 96.79% → 98.75% → 99.38% → 99.90% Clinical Accuracy.

[5] EvenUp Law. (2025-2026). Piai Personal-Injury AI Trained on Hundreds of Thousands of Cases + Millions of Medical Visits; $150M Series E at $2B+ Valuation October 2025; 1,500+ PI Firms Generating $7B+ in Claims.

[6] Tractable. (2024-2026). Vehicle-Damage Image Corpus — 600% Revenue Growth in 24 Months at $1B+ Valuation.

[7] Harvey AI. (2026). Enterprise Pricing $1,000-$1,200 per Seat per Month for BigLaw Deployments.

[8] Legora. (2026, March). $550M Series D at $5.55B Valuation; Enterprise Pricing $3,000 per User per Year + $30K Annual Contract Value Floor.

[9] Anthropic. (2026, January 30). Anthropic Cowork Launch — Frontier Claude Sonnet 4.6 + Opus 4.7 Capability into Horizontal Workspace.

[10] Sixfold. (2026, January). $52M Series B; Insurance Underwriting Loss-Ratio 4pp-Improvement-at-Top-Quartile-Carrier Outcomes.

[11] Trullion + Karbon + FloQast + BlackLine + Vic.ai. (2025-2026). Co-Created Customer-Document + Workflow Corpus — Vertical Accounting AI.

[12] Procore + Autodesk Construction Cloud + Buildots + OpenSpace + DroneDeploy. (2026). Construction-AI Sensor + Media Corpus — 75K+ Projects + 3M+ Sites + 360-Camera + Aerial-and-Ground.

[13] Mass General Brigham + UCSF + Veterans Affairs. (2024-2026). PHI + Clinical-Imaging-Corpus Examples Behind HIPAA + BAA Chains.

[14] U.S. Department of Education. (2024-2026). FERPA 20 U.S.C. § 1232g — Student-Data Corpus Privacy Requirements.

[15] European Parliament. (2024-2026). EU AI Act Articles 9-15 — High-Risk AI System Training-Data + Record-Keeping + Human-Oversight Requirements; August 2 2026 Deadline.

[16] European Parliament. (2018-2026). GDPR Article 6 + 9 + 22 + 17 — Lawful Processing + Special-Category Data + Automated Decisions + Right to Erasure.

[17] Menlo Ventures + Greylock + Insight Partners. (2026). Vertical AI Defensibility Frameworks — 4-Moat Pattern: Corpus + Workflow Integration + Compliance + Network Effects.

[18] perea.ai Research. (2026). State of Vertical Agents Q3 2026 Legal #16 + Q4 2026 Insurance #17 + Founder Velocity Field Studies #18 + Q1 2027 Healthcare #19 + Q2 2027 Accounting #20 + Q3 2027 CRE #21 + Q4 2027 Construction #22 — 6-Vertical Canon Closure.

[19] perea.ai Research. (2026). A-30 Polaris Clinical Validation Panel Methodology + A-29 Three-State-Test Compliance Methodology + A-31 Five-Framework Compliance Methodology Healthcare + A-26 Prestige-Led Distribution Playbook + A-27 Acquired-by-Platform Exit Playbook + A-32 Dual-Incumbent Dynamic Playbook + A-33 Implementation-Gap-Conversion Playbook Cross-Vertical.

[20] Crunchbase News. (2025, October). EvenUp Doubles Valuation to $2B in Series E — Legal Tech AI Funding at Record High.

perea.ai Research

One deep piece a month. Three weekly signals.

Get every B2A field report, protocol update, and benchmark from real audits — published before the rest of the market sees it. No filler. Unsubscribe in one click.