FORAGE: Earning Trust Through Evidence

Executive Summary

Every major financial institution relies on human teams to extract critical data from documents and validate operational processes. These teams are large, the work is repetitive, and the review processes designed to catch errors are mathematically limited in ways that are rarely examined.

This paper makes three arguments:

Manual review processes have a ceiling. Mathematical analysis of maker/checker controls demonstrates that, even in the best-case blind double process, a 9 percentage point degradation in individual accuracy eliminates the entire mathematical advantage — yielding more than double the cost for zero improvement. A companion paper, The Economics of Dual-Control Processes, provides the full derivation. In the linear review processes typical of high-volume operations, where the checker's independence is compromised by anchoring to the maker's output, the breakeven threshold is higher still. The institution may be paying double for a worse result than a single accountable operator would deliver.
The barrier to AI adoption is measurement, not capability. Large language models now achieve extraction accuracy comparable to trained human operators. But institutions lack a rigorous framework for measuring, monitoring, and demonstrating that accuracy — which is the real barrier.
The solution is a modular accuracy framework, not a monolithic platform. We present FORAGE (Framework for Orchestrated AI Retrieval, Grounding & Extraction) — a set of independently deployable components that can measure human accuracy, automate extraction with calibrated confidence, validate outputs against independent context, and govern the transition from manual to automated processes with statistical rigour.

The question is no longer whether AI can extract data from documents as well as humans.

The question is whether your institution can prove it.

And the deeper question: can you prove your humans are doing it accurately?

This paper is written for two audiences. Business leaders, risk officers, and operations heads will find the strategic argument and governance framework in the first seven sections sufficient. Technology leaders seeking implementation detail will find it in the sections that follow.

1. The Hidden Cost of Human Extraction

1.1 The Scale of Manual Document Processing

Across global banking, hundreds of thousands of people spend their days reading documents and entering data into systems. The documents are varied — loan agreements running to hundreds of pages, trade confirmations arriving in dozens of formats, regulatory filings with nested cross-references — but the work follows a common pattern: read, interpret, type, check.

The scale is rarely appreciated from outside operations:

A mid-sized custodian bank processes 50,000–100,000 loan events per year, each requiring data extraction from notices, agreements, and amendments.
A global investment bank's trade support function may handle 500,000+ confirmations annually, matching and reconciling extracted terms across counterparties.
Regulatory reporting teams extract and validate data from thousands of source documents per quarter, under strict submission deadlines with material consequences for errors.

Each of these operations employs dedicated teams — often dozens or hundreds of people — performing extraction tasks that are cognitively demanding but fundamentally repetitive.

1.2 Why Errors Persist Despite Review

Financial institutions universally apply maker/checker controls to these tasks. One person extracts the data (the maker); another reviews it (the checker). This dual-control model is mandated by regulatory frameworks including SOX and embedded in operational risk policies globally.

The intuition is sound: two pairs of eyes catch more errors than one. But a detailed mathematical analysis reveals a more nuanced picture. The full derivation is presented in the companion paper, The Economics of Dual-Control Processes. The key findings are summarised here.

The Corrected Blind Double

In a corrected blind double, two operators independently extract the same data and their outputs are compared. The combined error rate is (1 − a)², where a is individual accuracy. At 99% individual accuracy, the combined result is 99.99% — an impressive improvement.

But this only holds if both operators maintain their solo accuracy. Three well-documented cognitive effects work against this: safety net recalibration, or social loafing — the robust tendency for individuals to expend less effort when working collectively (Karau & Williams, 1993); accountability diffusion, where the presence of other agents actively reduces outcome monitoring (Beyer et al., 2017); and routine attenuation, the vigilance decrement first identified in sustained attention research (Mackworth, 1948; See et al., 1995), where both operators develop the same blind spots over time, eroding the independence that the mathematics depend on. The companion paper develops this analysis in full, with detailed mathematical derivations and additional citations on anchoring bias in linear review.

The true cost also exceeds the headline 200%, because every disagreement between operators requires arbitration by a third reviewer. For each field, three outcomes are possible: both correct (no cost), both wrong with the same error (no arbitration but an undetected error — the most dangerous outcome), or disagreement (arbitration needed). The arbitration rate is approximately 2e − e², which grows rapidly as individual accuracy degrades.

Table 1: Corrected blind double — accuracy and true cost at selected degradation levels. Solo baseline: 99% accuracy at 100% cost. Full analysis in The Economics of Dual-Control Processes.
Individual Accuracy	Combined Accuracy	Arbitration Rate	True Cost	vs. Solo (99%)
99% (no degradation)	99.99%	2.0%	~202%	+0.99%
95% (moderate)	99.75%	9.8%	~210%	+0.75%
90% (severe)	99.00%	19.0%	~219%	0.00%

The Crossover Point

At 90% individual accuracy, the blind double produces exactly 99% combined accuracy — the same as a solo operator. But nearly one in five fields requires arbitration, bringing the true cost to 219%. The institution has more than doubled its cost for zero improvement.

Linear Review Is Worse

Most operations do not use a corrected blind double. They use linear review: the checker reviews the maker's output. This breaks the independence assumption that the blind double relies on. The checker is anchored by the maker's answer; their review becomes confirmatory rather than independent. The effective accuracy improvement is smaller, and the crossover to negative returns occurs at a higher individual accuracy.

The AI Alternative

The economics create a natural question: what happens when the second human role is filled by AI instead?

Replace one human with AI (Economics paper, Stage 6.2): independence is preserved (the AI extracts from the source without seeing the human's output), the second operator's cost is eliminated, and disagreements are surfaced in structured interfaces that reduce arbitration cost. Total cost is approximately 110–130% of a solo operator — the human at 100% plus structured arbitration on disagreements — compared to the blind double's 200–220%. If the AI comparison also raises human attention — operators knowing their work is compared against a system that does not tire, does not develop shortcuts, does not have favourites, and does not develop the lazy pattern analysis that routine produces — individual accuracy may actually increase.
Measured solo with AI validation (Economics paper, Stage 6.3): one accountable human, AI-powered contextual validation against reference data and historical patterns, and statistical governance determining review rates. Total cost of 105–120% for accuracy that equals or exceeds the original blind double — with superior audit evidence, and lower cost than the comparison approach because review is targeted rather than exhaustive.

Beyond these scenarios, the companion paper describes a full progressive pathway (Stages 6.1–6.5): once AI accuracy exceeds human accuracy on specific fields, roles reverse — AI becomes maker, human becomes sampled checker (Stage 6.4, approximately 25–35% of solo cost). At the final stage, a second AI fills the checker role, with human oversight retained through sampling and exception handling (Stage 6.5, approximately 5–15% of solo cost). Each transition is gated by measurement evidence and can proceed field by field, document type by document type, without requiring an all-or-nothing switchover.

Both scenarios are precisely what FORAGE enables: Phase 2 (Compare) replaces the second human with AI extraction; Phase 3 (Transition) moves to the measured-solo model with progressive autonomy.

The Core Problem: No Measurement

The critical enabler of all the degradation described above is invisibility. Nobody measures individual accuracy, checker effectiveness, or combined outcomes at the field level. The institution does not even know its true error rate — only its detected error rate, surfaced by downstream breaks and client complaints. Latent errors — a misread maturity date that won't manifest until the instrument matures, an incorrect option trigger dormant until specific market conditions, a covenant threshold that may never be tested — remain invisible. The institution cannot distinguish a high-performing four-eye process from one that is actively degrading accuracy.

The four-eye check is not inherently wrong. It is inherently unmeasured.

The first step is not to remove the checker. It is to measure what the checker is actually contributing. That measurement — and the evidence it produces — is where FORAGE begins.

1.3 The Measurement Gap

The most significant finding is not that reviews sometimes fail. It is that most institutions have no systematic way of knowing whether their review processes are working. Volume, exceptions, and cycle time are measured. Field-level accuracy, checker effectiveness, accuracy trends, and field-specific weakness almost never are.

This is the measurement gap. It is the primary barrier to AI adoption, to process improvement, and to demonstrating control effectiveness to regulators.

2. FORAGE: A Modular Accuracy Framework

FORAGE (Framework for Orchestrated AI Retrieval, Grounding & Extraction) is designed as a set of composable components. Deployed together, they form an end-to-end document intelligence engine. Deployed independently, each component solves a distinct problem that extends well beyond document extraction.

Table 2: FORAGE components. Each is independently deployable and valuable. Together they form a complete accuracy assurance platform.
Component	What It Does	Standalone Applications
Extract	Multi-model ensemble extraction from any document type via configurable JSON schemas	Document intelligence, regulatory filing extraction, contract analysis, sustainability reporting
Measure	Field-level accuracy scoring grounded in observable evidence (MQS), with calibrated confidence and three reporting lenses	Measuring human accuracy in manual processes, comparing team performance, benchmarking AI vs. human, any maker/checker quality measurement
Validate	Contextual validation against independent sources: reference data reconciliation, historical plausibility, and metadata provenance	Cross-system reconciliation, SoR consistency checking, anomaly detection in any data pipeline, post-entry plausibility screening
Govern	Formal validation governance: statistical sampling, drift detection, threshold management, progressive autonomy	Governing any automated or manual process, regulatory audit evidence, control effectiveness demonstration

Each component addresses a different layer of the problem. A useful distinction — often lost in operational frameworks — is between two types of error:

✓Errors in primary business data: the extracted value is wrong. An incorrect interest rate, a misread maturity date, a transposed notional amount. These are errors in what was produced.
✓Errors in the control architecture: the process designed to catch those errors is not working effectively. The checker misses mistakes, the sampling rate is too low, governance has drifted. These are errors in how production is supervised.

Conflating these layers — as most operational frameworks do — obscures both the root cause and the remedy. FORAGE separates them deliberately, and the deployment model follows from this separation.

The critical insight is that these components are not locked to AI extraction. Measure can be applied to a purely human maker/checker process. Validate can check any data against reference sources — whether the data was extracted by AI, entered by a human, or imported from a feed. Govern can manage any workflow where accuracy matters. This modular design means an institution can start anywhere and extend to AI extraction when the evidence supports it.

2.1 Deployment in Three Phases

The recommended deployment follows three phases, each of which delivers standalone value and each of which represents a distinct business decision with its own approval, risk profile, and stakeholders.

Phase 1: Measure

The first phase requires no AI. It does require one operational change: the review process must capture corrections at the field level, not just binary approval.

In most maker/checker operations today, the checker opens the source document alongside the extracted data, scans for errors, and either approves the record or overwrites incorrect fields. The problem is that this correction activity is largely unrecorded. The system knows that a record was approved or that certain fields changed — but it does not systematically capture which fields were corrected, what the original value was, or why the correction was made.

Measure introduces a structured review interface that captures this information. Checkers perform the same work they do today — comparing extracted data against source documents and correcting errors — but the interface records each correction as a discrete event: original value, corrected value, field identifier, source reference, and reviewer. The review task does not change; the recording of it does.

This produces structured, field-level accuracy data for the first time: which fields are being corrected, by whom, how often, and with what patterns. Wilson confidence intervals provide defensible bounds on accuracy estimates, not just point values. The output is a statistical accuracy baseline — not a single number, but a detailed profile by team, by field type, by document category, and over time.

Where systems of record exist, Measure can supplement checker-derived data with automated reconciliation. For fields that are independently recorded elsewhere — trade economics in a booking system, loan terms in a servicing platform, reference data in a master database — the extracted values can be compared programmatically. This provides an additional accuracy signal that does not depend on the checker's own accuracy, but it covers only fields with an external reference point. Interpretive fields (covenant classifications, MAE scope, regulatory categorisations) still rely on expert human review as the ground truth.

This baseline enables an immediate governance improvement: risk-based sampling. Once accuracy is measurable at the group level, review intensity can be aligned to observed risk. High-performing teams earn reduced review rates. Low-performing teams receive targeted oversight. The inverse sampling principle (Section 3) turns accuracy measurement directly into cost optimisation.

Components: Measure + Govern. Primary outcome: defensible accuracy baseline, right-sized review controls, and the evidence base for any subsequent automation decision.

Phase 2: Compare

The second phase introduces AI extraction alongside the existing human process. The human workflow continues unchanged and retains full production responsibility. AI operates in parallel with no dependency on its output.

The mechanism is specific. As documents enter the existing workflow, a copy is routed to the Extract pipeline. Extract produces structured field outputs from the same source documents. These outputs are then compared against the final human output — the post-checker corrected record — which serves as the ground truth proxy.

This comparison is field-by-field and scored by Measure using the same metrics applied to the human teams in Phase 1. The result is an auditable record: for each field type, does the AI agree with the human outcome? Where it disagrees, is the AI wrong, or did it catch something the humans missed?

A ground truth limitation must be acknowledged. When both the maker and checker miss an error, the final human record is wrong — and an AI that extracts the correct value will be scored as a disagreement. This is why discrepancies where the AI diverges from the human consensus are flagged for expert review rather than automatically scored as AI errors. Over time, this expert adjudication produces a higher-quality ground truth and a more accurate picture of both human and AI performance.

Where Validate is deployed, this phase also compares AI outputs against independent reference data — providing a second opinion that does not depend on human accuracy at all. This is particularly powerful for fields where both human and AI might make the same error (for example, misreading an ambiguous date format).

This phase transforms the AI adoption conversation from opinion to evidence. The question is no longer "do we trust AI?" but "what does the data show, field by field, measured by the same system that measured the humans?"

Components: Extract + Measure + Validate + Govern. Primary outcome: auditable human-vs-AI comparison, model selection evidence, and the business case for Phase 3.

Phase 3: Transition

The third phase begins progressive autonomy: high-confidence AI extractions are auto-accepted, medium-confidence items are statistically sampled, and low-confidence items receive full human review. Govern manages the thresholds, monitors for drift, and ensures that every step in the reduction of human oversight is backed by statistical evidence.

Validate plays a critical role in this phase. An item that passes MQS confidence thresholds and also passes contextual validation is a strong auto-accept candidate. An item that scores high on extraction confidence but fails contextual validation is escalated regardless — the system read the document correctly but the value does not match what the institution's other systems expect. This catches a category of error that extraction confidence alone cannot detect.

Critically, this transition is gradual and reversible. The system can pause at any autonomy level indefinitely. If accuracy degrades, thresholds tighten automatically and review rates increase. The evidence trail is continuous: at any point, the institution can demonstrate to regulators exactly what accuracy is being achieved, at what confidence level, with what governance.

Components: Extract + Measure + Validate + Govern. Primary outcome: material reduction in manual effort while maintaining statistical defensibility and full reversibility.

Table 3: Three deployment phases. Each is a distinct business decision with standalone value.
Phase	Decision	What Changes	Components
1. Measure	"Do we know how accurate we are?"	Review process captures field-level corrections. Accuracy becomes visible. Review effort aligns to observed risk.	Measure, Govern
2. Compare	"Can AI match our best teams?"	AI extracts in parallel. Outputs compared field-by-field. Contextual validation provides independent cross-check.	Extract, Measure, Validate, Govern
3. Transition	"Where can we reduce manual effort?"	Progressive autonomy: evidence-gated, statistically governed, fully reversible. Validate gates auto-acceptance.	Extract, Measure, Validate, Govern

Phase 1 sits in the control architecture layer: it measures whether the existing review process is effective and aligns its cost to observed risk. Phases 2 and 3 sit in the data accuracy layer: introducing AI extraction and progressively reducing human involvement as the evidence justifies.

This sequence is recommended but not mandatory. Each component is independently deployable. An institution that already has confidence in its human accuracy data can begin with Extract. One that needs to demonstrate control effectiveness to regulators can deploy Govern on existing processes. One that has reference data it wants to leverage immediately can deploy Validate as a standalone reconciliation layer. The phased approach provides the strongest evidence base for each subsequent step.

3. Measure: Establishing the Accuracy Baseline

Before an institution automates anything, it needs to answer a question that surprisingly few can: how accurate are our people?

Measure applied to manual processes provides the baseline that makes every subsequent decision evidence-based. Without it, an institution cannot set meaningful targets for AI, cannot identify which processes to automate first, and cannot demonstrate to regulators that a process change — human or automated — has improved outcomes.

3.1 How Field-Level Measurement Works

In any operations team of scale, accuracy varies between individuals, between teams, between offices, and between shifts. This variation is almost never measured because the maker/checker process treats all output as either "passed" or "failed" — a binary that discards the rich information contained in the pattern of errors.

Measure instruments the existing review process to capture field-level accuracy data. Over time, this reveals:

Team-Level Accuracy Profiles

Which teams are consistently high-performing? Which are struggling? The data may confirm intuitions — or overturn them.

Field-Specific Weakness

Perhaps one team is excellent at extracting dates and amounts but weak on covenant classifications. Targeted training becomes possible.

Temporal Patterns

Does accuracy degrade on Fridays, at month-end, during staff shortages? The data reveals operational risk that was previously invisible.

3.2 The Inverse Sampling Principle

Once accuracy is measurable at the group level, a powerful economic principle emerges: the relationship between sampling rate and accuracy can be inverted.

The Inverse Sampling Principle

In traditional maker/checker, every item is reviewed regardless of the maker's track record. This means a team with 99.5% accuracy receives the same review burden as a team with 92% accuracy — the same cost for radically different risk profiles.

Measure enables evidence-based sampling: high-accuracy teams earn lower sampling rates. Low-accuracy teams receive higher sampling rates. Accuracy drives economics.

This creates a virtuous cycle with three effects:

1.Cost reduction for high performers. A team demonstrating sustained 99%+ accuracy at the field level moves from 100% review to statistical sampling — perhaps 10–20% review. The review cost drops proportionally.
2.Targeted intervention for low performers. A team showing 85% accuracy on a specific field type receives increased review and targeted training — resources applied where they have maximum impact.
3.Positive incentive alignment. Teams know that sustained accuracy earns reduced oversight. The cognitive effect shifts from complacency ("someone will catch it") to ownership ("my accuracy record determines my review burden").

This is not theoretical. It is a direct application of the maker/checker analysis in Section 1: when individual accountability replaces diffused responsibility, maker accuracy rises and checker effectiveness improves. Measurement creates the conditions under which four-eye checks actually deliver on their promise.

3.3 Building the Case for Automation

Measure applied to human processes creates the evidence base that makes AI adoption safe. Instead of asking "is AI good enough?" (an unanswerable question without human baseline data), the institution can ask: "does the AI, measured by the same framework and the same metrics, perform at or above the level of our best human teams?"

This reframes the conversation from a leap of faith to a data-driven comparison. It also provides the regulatory evidence trail: we measured our humans, we measured our AI, and here is the auditable comparison at the field level.

4. Extract: Multi-Model Ensemble Intelligence

When the institution is ready to introduce AI extraction — whether alongside humans or as a phased replacement — Extract provides the engine.

4.1 Schema-Driven Universality

Extract processes any document type through a JSON extraction schema that defines fields, their expected types, validation rules, and extraction hints. Changing the schema changes the document type — no code modifications required. A single deployment can process loan agreements, trade confirmations, sustainability reports, and regulatory filings using different schema configurations.

4.2 Why Ensemble Matters

FORAGE does not rely on a single AI model. Every document is processed by two or more independent extraction models — which may include locally-deployed open-source models and cloud-based API models — and their outputs are compared field by field.

This is analogous to the corrected blind double process from the maker/checker framework, but with a critical difference: where human reviewers suffer cognitive degradation under routine supervision, AI models do not experience complacency, fatigue, or diffusion of responsibility. Their error patterns are independent and statistically characterisable.

The ensemble approach transforms the problem. Instead of asking "is the AI right?" (which requires ground truth we do not have), we ask "do multiple independent AIs agree?" (which is directly observable). Agreement across independent models is a strong, auditable signal of correctness.

4.3 Model Flexibility: Local and API

The framework supports both locally-deployed open-source models (running entirely on-premise with zero data exfiltration) and commercial API models. Specialised extraction models such as NuExtract 2.0 (8 billion parameters, open-source, MIT-licensed) now outperform general-purpose frontier models on structured extraction while running on standard GPU hardware. The framework benchmarks every combination, showing the exact cost-accuracy trade-off of each configuration.

4.4 Benchmarking Against Academic Standards

FORAGE includes a built-in benchmarking harness that evaluates extraction against established datasets including MAUD (Merger Agreement Understanding Dataset — 47,000+ expert annotations across 92 legal deal point questions), CUAD (Contract Understanding Atticus Dataset), and DocVQA (document visual question answering). This produces per-field accuracy breakdowns, cost-accuracy frontier analysis, and confidence calibration verification.

5. Model Quality Score: Replacing Self-Reported Confidence

Both Extract and Measure use the Model Quality Score (MQS) — a composite measure built from independent, observable signals rather than self-reported certainty. This applies whether the "model" is an AI or a human team.

5.1 The Five Signals

1. Inter-Model Agreement

Do multiple independent extractors (AI or human) produce the same value? This is the strongest signal.

2. Source Grounding

Can the extracted value be traced to specific text in the source document?

3. Schema Validation

Does the value conform to expected types, ranges, and formats?

4. Historical Field Accuracy

For this field type, what is the extractor's track record on previously reviewed items?

5. Extraction Difficulty

How inherently complex is this field? Structured fields (dates, amounts) score higher than interpretive fields (legal classifications).

The MQS is calibrated against actual accuracy outcomes using isotonic regression, ensuring that an MQS of 0.95 means the system has historically been correct 95% of the time at that score. This calibration evolves through a maturity roadmap from heuristic weights through empirical tuning to Bayesian updating.

5.2 The Three Lenses: Honest Accuracy Reporting

FORAGE distinguishes between items that have been actively verified and items that have merely been accepted without review, using three reporting lenses:

Table 4: The three accuracy lenses prevent a common failure: presenting accuracy figures derived mainly from easy cases while excluding harder ones.
Lens	Definition	Use
Conservative	Count only items actively validated. Treat all unreviewed items as potentially incorrect.	Floor estimate. Minimum defensible accuracy.
Optimistic	Treat all unreviewed items accepted at high confidence as correct.	Ceiling estimate. Maximum plausible accuracy.
Expected	Use calibrated MQS probabilities for unreviewed items.	Best estimate. The most statistically honest view.

6. Validate: Contextual Intelligence

Extract reads the document. Measure scores how well it was read. But neither asks a critical follow-up question: does the extracted value make sense given everything else we know?

Validate provides that question. It compares extracted outputs — whether produced by AI, entered by a human, or imported from a feed — against independent contextual sources. Where Measure asks "was the extraction accurate?", Validate asks "does the result make sense in context?"

These are distinct questions with distinct failure modes. A value can be extracted perfectly and still be wrong — the document itself may contain an error. A value can pass every schema validation rule and still be implausible — the format is correct but the number is anomalous. Validate catches what extraction confidence cannot.

6.1 Three Types of Contextual Validation

Reference Reconciliation

The extracted value is compared against an independently recorded value in a system of record. The settlement amount on a drawdown notice should match the calculation derived from facility terms in the loan servicing platform. The counterparty LEI should resolve in the GLEIF database. The ISIN should match the instrument description in the securities master.

This is the strongest form of validation: an independent source provides ground truth. Where reference data exists, disagreement is a high-signal event that warrants investigation regardless of how confident the extraction was.

Historical Plausibility

When no independent reference value exists, the extracted value is compared against historical patterns. A new trade at 450 basis points when the same counterparty's last twenty trades averaged 180 basis points is not necessarily wrong — but it is unusual enough to warrant a second look. An interest rate of 12% on a senior secured facility in a low-rate environment is structurally valid but statistically implausible.

Historical plausibility uses entity-level and population-level baselines to identify outliers. Unlike rule-based thresholds (which require manual calibration and generate high false-positive rates), plausibility scoring adapts to the institution's own data patterns and evolves as those patterns change.

Metadata Context

The document's provenance provides validation signals independent of its content. A loan amendment arriving from an unfamiliar email domain, a confirmation with a timestamp outside business hours for the relevant market, a document routed through an unexpected workflow path — these metadata signals do not prove an error, but they adjust the prior probability that the content deserves closer scrutiny.

Metadata validation is particularly valuable in high-volume environments where adversarial or erroneous documents may enter the processing pipeline. It functions as an early warning layer before content extraction even begins.

6.2 The Decision Matrix: MQS Meets Validation

When Validate operates alongside Measure, each extracted field produces two independent assessments: an MQS score (extraction confidence) and a validation outcome (contextual plausibility). The combination creates four distinct scenarios:

Table 5: The four-outcome decision matrix. The high-MQS / validation-fail quadrant is uniquely valuable — it catches errors invisible to extraction confidence alone.
MQS Score	Validation	Interpretation	Action
High	Pass	We read it correctly and it matches context.	Auto-accept candidate.
High	Fail	We read it correctly but it does not match context. Genuine change, document error, or SoR error.	Escalate for investigation.
Low	Pass	Extraction uncertain, but the value aligns with context.	Spot-check. Likely correct.
Low	Fail	Low confidence all round.	Full human review.

The high-MQS / validation-fail quadrant deserves emphasis. This is the case where the extraction is confident — all models agree, the value is well-grounded in the source text — but the extracted value does not match what the institution's other systems expect. This could indicate a genuine change (the counterparty has restructured), a document error (the notice contains a typo), or a system-of-record error (the reference data is stale). In all three cases, the discrepancy warrants investigation. Without Validate, this category of issue passes through the pipeline undetected.

6.3 Validate as a Standalone Component

Validate does not require Extract or Measure to deliver value. Any data pipeline that has access to reference data or historical patterns can deploy Validate independently:

Post-entry reconciliation: Human-entered data checked against systems of record. Catches keying errors, transposition mistakes, and stale reference data without changing the entry process.
Feed validation: Incoming data feeds from counterparties, market data providers, or internal systems checked for contextual plausibility before processing. Catches transmission errors, format drift, and anomalous values.
Migration assurance: During system migrations, Validate can compare extracted values from the legacy system against the target system's reference data, catching mapping errors and transformation bugs at the field level.
Regulatory filing cross-check: Reported values checked against internal calculations and market data. Catches reporting errors before submission.

In each case, Validate operates as a contextual plausibility layer that is agnostic to how the data was produced. The data could come from AI extraction, manual entry, system-to-system feeds, or legacy migration — the validation logic is the same.

6.4 Why This Is Different

Institutions routinely perform reconciliation. What makes Validate distinct is the integration of three capabilities that are rarely combined:

Multi-source triangulation. Rather than checking one value against one reference, Validate can compare an extracted field against multiple independent sources simultaneously — the system of record, historical patterns, and metadata signals — producing a composite plausibility score.
Integration with extraction confidence. The four-outcome decision matrix means Validate does not operate in isolation. It amplifies the value of MQS by providing an orthogonal quality signal. The two assessments are deliberately independent: MQS measures how well the document was read; Validate measures whether what was read makes sense.
Adaptive baselines. Historical plausibility thresholds are not static rules. They evolve as the institution's data evolves, reducing false positives over time without manual recalibration. Entity-level baselines mean that what is normal for one counterparty is not assumed to be normal for another.

The combination is greater than the sum. Extraction confidence tells you that the system read the document correctly. Contextual validation tells you that what the document says is consistent with reality. Both are necessary; neither alone is sufficient.

7. Govern: The Statistical Foundation

Govern provides the governance framework that transforms any of the preceding components from technology solutions into auditable controls.

7.1 Progressive Autonomy

Whether applied to AI extraction or human processes, Govern manages autonomy levels:

Table 6: Autonomy levels apply equally to AI extraction and human processes with inverse sampling. Level 0 to Level 4 represents a 97.25% reduction in human effort.
Level	Name	Description	Human Involvement
0	Manual Baseline	Traditional maker/checker: one person extracts, another reviews every item	200%
1	AI-Assisted	AI suggests, humans review every field	100% (review)
2	Selective Automation	High-confidence items auto-accepted. Medium and low reviewed.	30–50%
3	Adaptive Autonomy	Thresholds adjust based on observed accuracy.	10–25%
4	Supervised Autonomy	Majority auto-accepted. Humans review samples and exceptions.	5–10%

The first reduction — from Level 0 (200%) to Level 1 (100%) — removes the checker. As the analysis in Section 1 demonstrates, in unmanaged four-eye processes the second human frequently degrades accuracy rather than improving it — while doubling cost. But this reduction is only justified after Phase 1 has established the accuracy baseline: you must first demonstrate the checker's actual contribution (positive or negative) before acting on the evidence. Once that data exists, the case is often compelling.

Subsequent reductions from Level 1 through Level 4 replace human review with statistical sampling, and each step requires accumulated accuracy evidence to justify.

Critically, the system moves bidirectionally. If accuracy degrades, thresholds tighten and review rates increase automatically.

7.2 Formal Sampling

Minimum validation rates are computed using statistical sampling theory with Wilson confidence intervals. Unlike simple percentage estimates, Wilson intervals account for the fact that accuracy is bounded between 0% and 100%, producing reliable estimates even with small sample sizes or extreme accuracy rates — both of which are common in financial document processing where error rates may be below 1%.

The Wilson interval provides a lower bound on estimated accuracy at a specified confidence level. For example, if 200 extracted fields are sampled and 197 are correct, the observed accuracy is 98.5%. But the Wilson 95% lower bound is 95.8% — meaning we can state with 95% confidence that true accuracy is at least 95.8%. This lower bound is what determines whether a process qualifies for reduced review rates: the threshold must be cleared not by the point estimate, but by the confidence interval. This prevents a small run of good results from prematurely reducing oversight.

Sample sizes are calculated to achieve a specified margin of error at the required confidence level. Higher confidence demands larger samples. Tighter margins demand larger samples. The framework computes the minimum sample size needed for each accuracy tier, ensuring that validation effort is proportionate to the precision required for governance decisions.

Stratified selection combines MQS-weighted priority (items near decision boundaries are most informative) with a random floor (preventing selection bias from over-representing hard cases). This ensures accuracy estimates reflect the full population, not just the items the system was least certain about.

7.3 Drift Detection

Three types of drift are monitored: acceptance drift (is the auto-acceptance rate changing?), distribution drift (have the characteristics of incoming items changed?), and outcome drift (is the accuracy of reviewed items degrading?). Statistical control charts flag when performance exits normal operating bounds.

8. The Business Case

8.1 Multiple Entry Points

The modular architecture means an institution does not need to commit to the full platform. Each component delivers standalone value:

Table 7: Entry points. Start where the need is greatest.
Starting Point	Component	Immediate Value	Builds Toward
"We don't know our current accuracy"	Measure	Field-level accuracy baselines, team comparison, evidence for regulators	Data-driven case for AI extraction or process change
"Our data doesn't match our reference systems"	Validate	Cross-system reconciliation, anomaly detection, SoR consistency assurance	Evidence base for automated extraction, feed quality improvement
"We want to automate extraction"	Extract + Measure	AI extraction with measurable, calibrated accuracy assurance	Progressive autonomy, human review reduction
"We need to prove our controls work"	Govern	Auditable sampling, drift detection, control effectiveness evidence	SR 11-7 compliance, regulatory examination readiness

8.2 The Inverse Sampling Economics

For institutions that have completed Phase 1 and deployed Measure on their manual processes, the inverse sampling principle creates a direct, quantifiable cost reduction:

Before Measurement

A team of 40 extractors reviewed 100% by 20 checkers: 60 FTEs total.

After Measurement

After measurement reveals three accuracy tiers: 15 high-accuracy extractors (10% sampling), 15 medium (40% sampling), 10 low (100% review + training). Checker requirement drops to approximately 10 FTEs — a 50% reduction in review cost, with better risk targeting.

The cost saving is real and immediate. But the greater value is the creation of a measurement culture: accuracy becomes visible, improvable, and directly linked to operational economics.

8.3 Risk Considerations

Model risk: The framework operates within SR 11-7 principles. Ongoing monitoring, performance documentation, and threshold governance provide the evidence trail.
Client choice of models: The framework is model-agnostic. Institutions select the models that meet their security, performance, and cost requirements — from fully on-premise open-source models to commercial API services — with comparative benchmarking across all options.
Transition risk: Progressive autonomy ensures human oversight is maintained throughout. The system can pause at any level indefinitely.

9. Technical Architecture

This section provides additional detail for architecture and technology teams.

9.1 Component Integration

When deployed together, the components form a pipeline: Extract handles document ingestion and multi-model extraction. Measure scores each extracted field using MQS. Validate checks extracted values against independent reference data, historical patterns, and metadata context. Govern manages the routing, sampling, and threshold decisions. Each component communicates through well-defined interfaces, and any component can be replaced or bypassed.

9.2 Model Abstraction

All extraction models — local or API — conform to a unified interface:

Table 8: Supported model families. The framework is model-agnostic.
Model	Type	Parameters	Strength
NuExtract 2.0	Local (open-source)	2B / 8B	Highest extraction precision. Zero marginal cost.
Qwen2.5-VL	Local (open-source)	7B / 72B	Vision-language. Processes scanned documents directly.
Claude Sonnet	API (Anthropic)	—	Strong reasoning. Effective as adjudicator.
GPT-4.1	API (OpenAI)	—	Fast, cost-effective for high-volume extraction.

9.3 MQS Calibration Roadmap

Table 9: MQS calibration evolves as the system accumulates data.
Version	Method	When
MQS v0	Heuristic weights (expert-assigned)	Initial deployment
MQS v1	Empirically tuned from 200+ validated extractions	After initial review phase
MQS v2	Isotonic regression mapping MQS to observed P(correct)	After 500+ validations per field
MQS v3	Bayesian updating with drift-adjusted priors	Steady-state production
MQS v4	RL-optimised weights minimising review cost subject to accuracy floor	Mature deployment

9.4 Deployment Options

Table 10: Deployment scenarios. Human measurement, reconciliation, and governance require no GPU infrastructure.
Scenario	Infrastructure	Components	Data Sovereignty
Human Measurement Only	Application server	Measure + Govern	Full (no AI models)
Reconciliation Overlay	Application server	Validate	Full (no AI models)
AI Extraction POC	GPU server (24GB+)	Extract + Measure	Partial (API calls)
Full Platform	Multi-GPU + application	All components	Configurable
Air-Gapped Production	GPU cluster only	All (local models)	Full (zero exfiltration)

Conclusion

The challenge facing financial institutions is not whether AI can extract data from documents. The evidence is clear that it can. The challenge is trust. And trust requires measurement.

The current state — large teams of human extractors reviewed by large teams of human checkers, with no systematic measurement of field-level accuracy — is expensive, mathematically limited, and invisible to governance.

FORAGE addresses this at every level:

Measure

Makes accuracy visible — whether the extractor is human or AI — and enables the inverse sampling principle that rewards accuracy with lower cost.

Extract

Provides AI extraction that is schema-driven, model-agnostic, and benchmarked against academic standards.

Validate

Adds contextual intelligence — checking extracted values against reference data, historical patterns, and metadata signals to catch errors that extraction confidence alone cannot detect.

Govern

Provides the statistical foundation that satisfies regulators, risk teams, and audit — ensuring every claim of accuracy is backed by auditable evidence.

The phased deployment model — measure, compare, transition — ensures that each step is evidence-based and reversible. The first phase requires no AI at all: simply measuring human accuracy with statistical rigour and using the results to right-size review controls. The final phase is not full automation but supervised autonomy, where human oversight is proportionate to demonstrated accuracy and every governance decision is backed by auditable confidence intervals.

The result is a 97.25% reduction in human effort from traditional maker/checker — where the first half of that reduction is justified by Phase 1 measurement data revealing the checker's actual contribution (which, in unmanaged four-eye processes, is frequently negative), and the second half is earned through accumulated accuracy evidence at each subsequent level.

FORAGEEarning Trust Through Evidence

Executive Summary

1. The Hidden Cost of Human Extraction

1.1 The Scale of Manual Document Processing

1.2 Why Errors Persist Despite Review

The Corrected Blind Double

Linear Review Is Worse

The AI Alternative

The Core Problem: No Measurement

1.3 The Measurement Gap

2. FORAGE: A Modular Accuracy Framework

2.1 Deployment in Three Phases

Phase 1: Measure

Phase 2: Compare

Phase 3: Transition

3. Measure: Establishing the Accuracy Baseline

3.1 How Field-Level Measurement Works

Team-Level Accuracy Profiles

Field-Specific Weakness

Temporal Patterns

3.2 The Inverse Sampling Principle

3.3 Building the Case for Automation

4. Extract: Multi-Model Ensemble Intelligence

4.1 Schema-Driven Universality

4.2 Why Ensemble Matters

4.3 Model Flexibility: Local and API

4.4 Benchmarking Against Academic Standards

5. Model Quality Score: Replacing Self-Reported Confidence

5.1 The Five Signals

1. Inter-Model Agreement

2. Source Grounding

3. Schema Validation

4. Historical Field Accuracy

5. Extraction Difficulty

5.2 The Three Lenses: Honest Accuracy Reporting

6. Validate: Contextual Intelligence

6.1 Three Types of Contextual Validation

Reference Reconciliation

Historical Plausibility

Metadata Context

6.2 The Decision Matrix: MQS Meets Validation

6.3 Validate as a Standalone Component

6.4 Why This Is Different

7. Govern: The Statistical Foundation

7.1 Progressive Autonomy

7.2 Formal Sampling

7.3 Drift Detection

8. The Business Case

8.1 Multiple Entry Points

8.2 The Inverse Sampling Economics

Before Measurement

After Measurement

8.3 Risk Considerations

9. Technical Architecture

9.1 Component Integration

9.2 Model Abstraction

9.3 MQS Calibration Roadmap

9.4 Deployment Options

Conclusion

Measure

Extract

Validate

Govern

References

About Prospect 33

FORAGE
Earning Trust Through Evidence