Skip to main content
Back to Home

Latest Study

Where AI Beats Traditional OCR — and Where It Still Needs Human Review

This report evaluates whether AI-based document extraction can replace traditional OCR for an organization processing three document workflows — across 1,497 documents and six evaluation stages.

The Challenge

The scenario: an organization processes thousands of standardized documents each month across three workflows. Two invoice workflows (Types A and B) run on an LSTM-based OCR engine with positional extraction, achieving about 96% field accuracy. A third workflow for scanned receipts (Type C) was piloted with a deep-learning OCR library but achieved only about 76% accuracy—not viable for production.

On invoices, the same types of extraction errors recur systematically. Employees must visually verify every field, correct errors, and manually transcribe failed extractions. The correction burden is persistent and costly.

On receipts, the problem is worse. Receipt layout diversity—multiple total lines, trade name versus registered entity differences, varied thermal print quality—defeats heuristic extraction. At about 76% accuracy, roughly one in four fields requires correction, and the workflow cannot be deployed. Receipts are currently processed entirely by hand.

The evaluation was designed to provide a controlled comparison between the current OCR pipeline and an AI alternative, with statistical evidence and time analysis sufficient to support a deployment decision.


Why This Kind of Testing Matters

Whether AI extracts fields more accurately than rule-based OCR is only the first question. The harder question is whether the improvement reduces review burden enough to justify the deployment complexity and ongoing operational requirements of an AI system in production.

This evaluation answers that by processing the same documents through both methods, comparing accuracy field by field, and computing the time impact of each approach at realistic volumes. It also tests whether the accuracy improvement translates into actual time savings, whether multi-implementation consensus adds value, and whether an auto-accept model can safely eliminate human review for high-confidence documents. Pre-registered predictions establish expectations before each stage, and supporting artifacts are listed in Section 12.


The Approach

Every document in the evaluation dataset was processed by both the baseline workflow and the AI workflow. This paired comparison design controls for document-level variation—image quality, legibility, layout complexity—ensuring that observed differences are attributable to the extraction method, not the document.

The evaluation proceeded through six progressive stages:

  1. Requirements and success criteria— defined what to test and how to measure it
  2. Baseline measurement— established current-state accuracy and time benchmarks
  3. AI extraction— measured accuracy improvement and processing speed
  4. Multi-implementation consensus— tested whether agreement improves accuracy
  5. Auto-accept model— identified documents safe to skip human review
  6. Integrated time analysis— computed end-to-end time savings and cost framework
Stage 1:  Requirements and success criteria defined
  Stage 2:  Baseline accuracy measured
    Stage 3:  AI extraction evaluated
      Stage 4:  Consensus tested (hypothesis failed)
        Stage 5:  Auto-accept model built
          Stage 6:  End-to-end time savings quantified

Stage 2: Measuring the Baseline — The Gap That Demanded Investigation

The evaluation began by measuring current-state performance. 1,497 documents (493 Type A, 504 Type B, 500 Type C) were processed through the existing OCR pipelines: the LSTM-based engine for invoices, the deep-learning OCR library for receipts.

Type A invoices achieved about 97% field accuracy. Type B achieved about 96%. Type C receipts managed only about 76%.

Type C’s accuracy is what motivated the AI evaluation. At about 76%, nearly one in four fields requires correction—a rate that makes automated processing impractical.

Bar chart showing baseline OCR accuracy across three document types. Types A and B at approximately 96%, Type C at 75.7%.
Baseline OCR accuracy across three document types. Types A and B perform adequately at ~96%. Type C at 75.7% does not meet production requirements.

Stage 3: AI Extraction Results — Better Accuracy, but Slower

Five vision-language models were evaluated across nine configurations in a pilot study. A 9-billion parameter VLM with 8-bit quantization was selected based on accuracy (97% pilot accuracy), VRAM efficiency (fits on a single consumer-grade GPU), and structured output capability.

Across all 1,497 documents, the AI workflow improved accuracy by about 9 percentage points overall. The largest improvement was on Type C: from about 76% to about 97%.

Extraction errors dropped from 683 at baseline to 127 with AI—an 81% reduction. Empty extraction failures, where the OCR returned nothing for a field, dropped from 232 to 14.

A representative example: a scanned receipt from a retail chain. The deep-learning OCR baseline extracted the company name as garbled characters and the monetary total as an incorrect numeric value—a misread name and a wrong financial figure. The VLM extracted all four fields correctly. In the baseline workflow, that receipt would have forced the employee to manually look up the company name and re-enter multiple fields from scratch. In the AI workflow, the employee would only need a quick verification pass.

AI processing takes about 5.5 seconds per document on average, compared to under 0.4 seconds for the LSTM-based engine and about 3.4 seconds for the deep-learning OCR library. More accurate, but slower—a tradeoff that the later stages address directly.

Grouped bar chart comparing baseline and AI accuracy for each document type. Type C shows the largest gain from 75.7% to 97.3%.
AI accuracy versus baseline for each document type. Type C shows the largest gain (+21.6pp), from 75.7% to 97.3%.

Stage 4: Consensus Testing Outcome — When the Hypothesis Failed

Three extraction implementations ran in parallel on the 997 invoices (Types A and B): the LSTM-based engine (about 94%), the VLM (about 99.6%), and a transformer-based document OCR engine (about 97%). Each field was classified by agreement pattern: GREEN (all three agree), YELLOW (two agree), or RED (all disagree).

The hypothesis was that consensus—weighted majority voting—would be more accurate than any single implementation. It was not. That was a setback: the planned mechanism for hardening extraction accuracy simply did not work. The VLM alone outperformed consensus because the two traditional OCR engines agreeing on a wrong value could outvote the correct VLM extraction.

The agreement pattern proved valuable for a different purpose: as a confidence signal. What had failed as a voting mechanism worked as a triage signal. In the observed data, when all three implementations agreed on a field value (GREEN), the VLM extraction was correct 100% of the time. For example, on a sample invoice, the LSTM-based engine extracted a total with extraneous label text, the transformer OCR engine included different surrounding formatting, and the VLM returned only the clean numeric value—three independent systems with different raw formats converging on the same financial value.

Bar chart comparing single-implementation accuracy versus consensus. The VLM alone outperforms consensus.
Single-implementation versus consensus accuracy. The VLM alone outperforms consensus, but the agreement pattern serves as a reliable confidence signal.

Stage 5: Auto-Accept Eligibility — The Reframe That Salvaged the Approach

Using consensus for value selection produced only 3.1% review time savings—far below the 10% target. But the agreement categories, reframed as confidence signals, enabled a more powerful strategy: auto-accepting documents where all implementations agreed, bypassing human review entirely.

At the balanced threshold (P ≥ 0.970), about 22% of invoices could be auto-accepted with over 99% precision. One false accept occurred in 220 auto-accepted documents—one document containing an error was approved without review. At production scale (10,000 documents/month with 22% auto-accept), that rate translates to roughly 10 false accepts per month.

The ultra-conservative threshold (P ≥ 0.980) produced zero false accepts in the observed data, at the cost of a lower auto-accept rate (about 6%). This is the recommended starting point. Whether that conservative auto-accept rate is enough to justify AI deployment—whether the confidence signal actually reverses the time penalty—is what Stage 6 tested.

Line chart showing auto-accept rate as a function of confidence threshold. Higher thresholds accept fewer documents with greater precision.
Auto-accept rate as a function of confidence threshold. The balanced threshold (P >= 0.970) auto-accepts 22% of documents with over 99% precision.

Stage 6: End-to-End Time Impact — The Counterintuitive Reversal

Here the evaluation turned up a counterintuitive finding. For invoices, AI extraction alone is slower than baseline OCR. The AI processing penalty (about 5.7 seconds for Type A, 7.7 for Type B, versus sub-second traditional OCR) outweighs the modest review time reduction. AI-only increases total per-document time by about 6.5%.

Adding auto-accept reverses this. With AI + auto-accept, total processing time drops by about 54% for invoices combined (Type A: 61%, Type B: 48%). For Type C receipts, AI achieves about 58% time savings compared to manual entry, on a workflow that currently cannot be deployed at all.

Sensitivity analysis varied all parameters by up to 50%. Even at worst case, AI + auto-accept remained faster than baseline.

Grouped bar chart showing per-document total time across three scenarios for Types A and B. AI-only is slower than baseline; AI plus auto-accept reduces total time by 53.6%.
Per-document total time across scenarios for Types A and B. AI-only (orange) is slower than baseline (blue). AI + auto-accept (green) reduces total time by 53.6%.
Waterfall chart breaking down invoice processing time, showing how auto-accept offsets the AI processing penalty by eliminating review for 22% of documents.
Waterfall breakdown of invoice processing time. Auto-accept offsets the AI processing penalty by eliminating review for 22% of documents.

The Result

Four deployment recommendations, each traceable to specific stage evidence:

  1. Deploy Type C first. Largest accuracy improvement (+21.6 percentage points) and about 58% time savings versus manual entry. Addresses a workflow that currently cannot be deployed and does not depend on the auto-accept pipeline. Infrastructure cost scales favorably when combined with Types A/B deployment. (Stages 3 and 6)
  2. Deploy Types A and B only with auto-accept. AI extraction alone does not offset the processing time penalty. The auto-accept layer makes it viable by eliminating human review for about 22% of documents. (Stage 6)
  3. Use consensus for confidence, not value selection. The VLM alone is more accurate than weighted majority voting. The agreement pattern serves as a confidence signal only. (Stages 4 and 5)
  4. Start at the ultra-conservative threshold (P ≥ 0.980). Zero false accepts in observed data. Relax the threshold as operational confidence grows. (Stage 5)

What This Means for Deployment

Type C benefits most from AI extraction. The receipt workflow achieves 58% time savings versus manual entry and the largest accuracy improvement (from about 76% to about 97%), making it viable for production. Auto-accept is not available for receipts because the consensus pipeline requires multiple independent extraction implementations, and only the VLM supports receipt extraction—all Type C documents require human review. Infrastructure cost is shared when deployed alongside Types A/B.

Types A and B require the full pipeline. AI extraction alone is slower than baseline (+6.5% total time). Adding consensus-based auto-accept reverses this, achieving 53.6% steady-state time savings by eliminating review for high-confidence documents. The conservative starting threshold (P ≥ 0.980) produces a lower initial savings rate until production evidence justifies relaxation toward the balanced threshold. Volume thresholds depend on organization-specific labor rates and infrastructure costs—see the parametric framework in Section 9.

The parametric time model translates accuracy improvements into hours saved at any document volume. A reader-configurable cost framework provides the formulas and parameter definitions needed to produce organization-specific dollar estimates. The observed time savings are large enough to justify an organization-specific economic review; because labor rates, infrastructure choices, and baseline operating costs vary widely, this report leaves dollar conversion to the reader.


Why This Recommendation Is Low-Risk to Deploy

The AI workflow was designed to minimize operational disruption.

Same employee workflow. The review interface is identical—same side-by-side view, same correction process. Employees see better pre-filled values, not a different tool. No retraining required on the employee review workflow.

Same audit schema. AI workflow records use the same append-only audit table as the baseline. The only difference is the extraction_method field value. Historical baseline records remain accessible alongside AI records, enabling direct cross-method queries on a single table.

Reversible rollout. The transition can be reversed at any time without data loss or process changes. If the AI workflow does not meet performance targets in production, the system reverts to baseline OCR extraction.

Parallel-run validation. Before full cutover, documents can be processed by both methods simultaneously. This provides real-world validation and gives the processing team confidence before the baseline is retired.

Fallback for model failures. If the AI model is unavailable, times out, or produces malformed output, the document is flagged and recoverable—no document is silently dropped. Every upload results in either a completed processing record or an explicit error record in the audit log.

Internal processing. Document images are processed in isolated containers via internal gRPC communication. No document data is transmitted externally. Access control and data handling follow existing organizational policies.


What This Evaluation Demonstrates

  • Baseline-first evaluation. The existing system was measured before any AI comparison, not after—establishing an honest comparison target rather than testing the AI in isolation.
  • Separating model improvement from business value. AI extraction alone makes invoices slower. The evaluation identified this and built the auto-accept layer that converts accuracy gains into actual time savings.
  • Quantified review burden. Instead of treating OCR quality as an abstract percentage, the evaluation computed the time impact of every extraction error—connecting accuracy to measurable operational impact.
  • Phased deployment over blanket replacement. The recommendation deploys Type C first (largest gain, lowest risk), adds invoice auto-accept at a conservative threshold, and relaxes constraints only with production evidence.
  • Pre-registered predictions. Twenty directional predictions were registered before each stage ran. Four failed, and those failures were reported—not buried.
  • Transferable methodology. The six-stage evaluation structure (baseline-first, staged, pre-registered, limitation-transparent) is designed to be rebuilt on any document-extraction pipeline; specific results must always be validated on the target dataset.

Part II

Technical Detail

The narrative above intentionally simplified the underlying methodology. The following section contains the complete statistical framework, stage-by-stage analysis, and evidence supporting every finding and recommendation in Part I.

Dataset: Invoice evaluation dataset (Types A/B) + Receipt evaluation dataset (Type C) | Model: 9B-parameter VLM (8-bit quantized)

1. Executive Summary

This evaluation assessed whether AI-based document extraction can replace traditional OCR for an organization processing invoices and receipts. Across six stages, 1,497 documents were processed through both baseline (LSTM-based OCR / deep-learning OCR) and AI (vision-language model) workflows. The AI workflow improved accuracy by +9.25 percentage points overall, with the largest gain on Type C receipts (+21.6pp). When combined with a consensus-based auto-accept model, the full pipeline achieves 53.6% time savings on invoices and 58% on receipts versus manual entry.

Table T1: Stage Overview

StageQuestion AnsweredDocumentsKey Output
1What are the evaluation goals and success criteria?Requirements, time/cost framework, evaluation design
2How accurate is baseline OCR?1,497Types A/B ~96%, Type C 76%
3Can VLM extraction improve accuracy?1,497+9.25pp overall, +21.6pp Type C
4Does multi-implementation consensus help?997No — VLM alone beats consensus
5Can we auto-accept high-confidence documents?99722% auto-accepted at 99.55% precision
6What are the end-to-end time savings?1,49753.6% A/B, 58% C vs manual

Table T2: Decision Funnel

StageDecisionCarried Forward
2Baseline measurements establish comparison targetAccuracy and time benchmarks, cost model structure
3VLM selected; accuracy improvement confirmedVLM pipeline, label-stripping correction
4Consensus < VLM; agreement pattern is a confidence signalColor categories (GREEN/YELLOW/RED)
5Auto-accept at P ≥ 0.970: 22% rate, 99.55% precisionBalanced threshold, conditional probabilities
6AI-only slower; AI+AA delivers 53.6% savingsDeployment recommendation, time/cost model

2. Methodology

2.1 Pipeline Architecture

Both the baseline and AI workflows follow the same three-stage DAG: validate (confirm document is processable) → extract (produce field values) → format (normalize output). The baseline uses the LSTM-based OCR engine (Types A/B) or deep-learning OCR library (Type C) with positional/pattern extraction rules. The AI workflow uses a 9B-parameter vision-language model with structured JSON output.

2.2 Evaluation Metrics

MetricDefinition
Field accuracy rateFraction of fields where extracted value matches ground truth (exact or fuzzy match)
Review timeSimulated employee time to verify correct fields, correct errors, and enter missing values
Total document timeProcessing time (OCR/VLM) + review time
Cost per documentReader-configurable parametric model (Section 9.2, Appendix B): infrastructure + processing + labor + maintenance

2.3 Statistical Framework

All accuracy comparisons use the Wilcoxon signed-rank test (paired, non-parametric) with effect size measured by Cohen’s d. Confidence intervals are computed via percentile bootstrap with 10,000 iterations. Pre-registered construct validity predictions (5 per stage, Stages 3–6) provide independent confirmation that the evaluation framework measures what it intends to measure.

2.4 Review Time Model

Employee review time is simulated computationally for reproducibility. The model computes: fixed overhead + (correct fields × verification time) + (incorrect fields × correction time) + (failed fields × entry time). Field-level times vary by complexity category (short_numeric, short_text, long_text, constrained_choice, icon). See Appendix A for the full parameter table.

2.5 Document Scope

Stages 2, 3, and 6 evaluate all 1,497 documents (493 Type A + 504 Type B + 500 Type C). Stages 4 and 5 evaluate only the 997 invoice documents (Types A/B) because the consensus methodology requires three independent extraction implementations per field, and Type C (receipts) has only the VLM implementation—the LSTM-based engine and the transformer OCR engine do not support receipt extraction. This scope difference is noted in each stage section.


3. Stage 2: Baseline Measurements

Per-Type Accuracy

Document TypeMean AccuracyProcessing (ms)Review (ms)Total (ms)
Type A (n=493)96.77%321.651,154.251,475.8
Type B (n=504)95.84%372.365,259.965,632.2
Type C (n=500)75.70%3,411.936,142.039,553.9

Error Type Distribution

TypeSubstitutionInsertionDeletionEmpty
A3005168
B10809379
C2344012785

Baseline Time Budget (per document)

ComponentType AType BType C
Processing time0.32s0.37s3.41s
Review time51.15s65.26s36.14s
Total document time51.48s65.63s39.55s

These time measurements are the primary baseline data. Section 9 presents the parametric model for converting time into dollar cost estimates using configurable labor and infrastructure parameters.

Construct Validity Predictions (registered for Stage 3 evaluation)

#PredictionExpected Direction
1AI accuracy > baseline accuracy (overall)AI > baseline
2AI improvement larger on long_text fields than short_numeric fieldsΔ(long_text) > Δ(short_numeric)
3AI has fewer ‘empty’ extraction failures than baselineAI < baseline
4AI review time < baseline review timeAI < baseline
5AI improvement on Type C > AI improvement on Types A/BΔ(type_c) > Δ(type_a), Δ(type_c) > Δ(type_b)

Decisions Carried Forward: Baseline measurements established for paired comparison; construct validity predictions registered for Stage 3 evaluation; time and cost model structure defined for later AI comparison.

Baseline accuracy distribution across document types showing the gap between Types A/B at 96% and Type C at 75.7%.
Baseline accuracy distribution across document types. The gap between Types A/B (~96%) and Type C (75.7%) motivated the AI evaluation.

4. Stage 3: AI Extraction

4.1 Model Selection

Five vision-language models were evaluated across nine model-strategy configurations. The table below shows the top 5 configurations by pilot accuracy.

Pilot Model Ranking

RankModelStrategyPilot Accuracy
1VLM-A (9B, 8-bit quantized)structured97.09%
2VLM-A (4B, 4-bit quantized)minimal95.70%
3VLM-A (4B, 4-bit quantized)structured95.70%
4VLM-A (9B, 8-bit quantized)minimal90.77%
5VLM-B (11B, 4-bit quantized)structured48.77%

4.2 Accuracy Comparison

TypeNBaselineAIDeltaCohen’s dp-value95% CI
A49396.77%99.39%+2.62pp0.4974.07e-14[1.99, 3.28]
B50495.84%99.32%+3.48pp0.7611.74e-26[2.95, 4.04]
C50075.70%97.30%+21.60pp1.2361.34e-53[19.55, 23.60]
Overall1,49789.42%98.67%+9.25pp0.7185.34e-89[8.42, 10.12]

4.3 Error Analysis

Of the fields evaluated, 127 AI extraction errors were found (down from 683 at baseline). Label-stripping correction resolved 34 of these, leaving 93 errors.

Error Categories

CategoryCountPercentage
gt_includes_label5140.2%
receipt_ocr_noise3829.9%
empty_extraction1411.0%
insertion_wrong_value107.9%
multiline_truncation86.3%
other64.7%

4.4 Speed Comparison

TypeBaseline Processing (ms)AI Processing (ms)Baseline Review (ms)AI Review (ms)
A321.65,684.551,154.249,154.2
B372.37,735.065,259.962,244.0
C3,411.93,100.536,142.030,075.0

4.5 Time Impact Summary (AI-Only)

TypeReview Time Saved/Doc (s)AI Processing Penalty (s)Net Time Change/Doc (s)
A2.0+5.4+3.4 (slower)
B3.0+7.4+4.4 (slower)
C6.1-0.3-6.4 (faster)

AI-only extraction is net slower for Types A/B — the processing penalty exceeds review savings. The auto-accept layer (Stage 5) reverses this. Type C is net faster per document.

4.6 Construct Validity

#PredictionResult
1AI accuracy > baseline accuracy (overall)PASS
2AI improvement larger on long_text fields than short_numeric fieldsFAIL
3AI has fewer ‘empty’ extraction failures than baselinePASS
4AI review time < baseline review timePASS
5AI improvement on Type C > AI improvement on Types A/BPASS

Decisions Carried Forward: VLM selected as extraction engine; prompt strategy confirmed (structured JSON output); label-stripping correction applied to ground truth evaluation.

Per-type accuracy comparison between baseline and AI extraction.
Per-type accuracy comparison between baseline and AI extraction. Type C shows the largest improvement (+21.6pp), from 75.7% to 97.3%.
Error type distribution comparing baseline versus AI.
Error type distribution: baseline versus AI. Empty extractions — the most costly error type for employees — drop from 232 to 14 (94% reduction).
Pilot accuracy ranking across 5 model candidates.
Pilot accuracy ranking across 5 model candidates. The selected VLM (9B, 8-bit quantized) with structured output achieved 97.09%, well ahead of the next configuration (95.70%).

5. Stage 4: Multi-Implementation Consensus

Three extraction implementations ran in parallel on 997 invoice documents (Types A and B only—Type C excluded because only the VLM supports receipt extraction).

Implementation Accuracy (Overall, 10,746 in-template fields)

ImplementationOverall Accuracy95% CI
VLM (9B, 8-bit quantized)99.61%[99.49, 99.72]
Consensus (majority vote)99.12%[98.94, 99.28]
Transformer OCR engine96.95%[96.61, 97.26]
LSTM-based OCR engine93.74%[93.25, 94.19]

Color Category Distribution (in-template fields)

TypeGREENYELLOWRED
A44.2%40.6%15.2%
B40.0%40.2%19.7%

Consensus Accuracy by Category

CategoryAccuracyField Count
GREEN100.00%4,483
YELLOW98.11%4,342
RED99.32%1,921

VLM alone (99.61%) outperformed consensus (99.12%). Majority voting with weaker implementations degrades the best single model.

Construct Validity: 3/5 PASS

#PredictionResult
1Consensus accuracy ≥ VLM accuracy (all in-template fields)FAIL
2GREEN field rate higher for Type A than Type BPASS
3Prioritized review time < standard review timePASS
4GREEN actual accuracy > YELLOW > REDFAIL
5φ(transformer_ocr, lstm_ocr) > φ(transformer_ocr, vlm)PASS

Decisions Carried Forward: Use agreement as confidence signal (not for value selection); VLM extraction is the production path.

Accuracy comparison: individual implementations versus consensus.
Accuracy comparison: individual implementations versus consensus. The VLM outperforms all alternatives including consensus.

6. Stage 5: Auto-Accept Confidence Model

Stage 4’s consensus approach reframed: instead of selecting values by agreement, use agreement patterns as confidence signals. The prioritized-review approach from Stage 4 yielded only 3.1% time savings; the auto-accept reframing achieves 30.8%.

Conditional Probabilities (with Wilson 95% CIs)

CategoryP(VLM Correct)Field Count95% CI
GREEN1.0000004,483[0.9991, 1.0000]
YELLOW0.9940124,342[0.9912, 0.9959]
RED0.9916711,921[0.9865, 0.9949]

Threshold Selection (5 key rows from 21)

ThresholdAA RatePrecisionFalse AcceptsTime Savings
0.96053.36%96.99%1660.2%
0.96534.50%97.67%842.7%
0.97022.07%99.55%130.8%
0.97511.63%99.14%120.8%
0.9806.22%100.00%015.4%

Cross-Validation: 5-fold CV confirmed threshold stability (standard deviation < 0.001 across folds).

Construct Validity: 4/5 PASS

#PredictionResult
1P(VLM correct | GREEN) = 1.0 in all CV foldsPASS
2Auto-accept rate higher for Type A than Type BPASS
3CV precision at P ≥ 0.970 ≥ 98%PASS
4P_doc calibration error < 5ppFAIL
5Review time savings at P ≥ 0.970 ≥ 20%PASS

Decisions Carried Forward: Balanced threshold P ≥ 0.970 for deployment.

Auto-accept rate versus confidence threshold.
Auto-accept rate versus confidence threshold. Higher thresholds accept fewer documents but with greater precision.
Precision increases sharply above P >= 0.970.
Precision increases sharply above P >= 0.970, reaching 100% at P >= 0.980.

7. Stage 6: Integrated Time Analysis

Scenario Definitions

#ScenarioDescriptionApplies To
1BaselineLSTM-based OCR / DL-OCR + full reviewAll
2AI-onlyVLM + full review (no auto-accept)All
3AI+Auto-AcceptVLM + auto-accept + reduced reviewA/B
4Type C ManualManual data entry (no OCR)C only
5Type C AIVLM + full reviewC only

Types A/B Per-Document Time (ms)

ScenarioType AType B
Baseline51,475.865,632.2
AI-only54,838.769,979.1
AI+Auto-Accept19,985.834,306.7

Type C Per-Document Time (ms)

ScenarioTotal (ms)
Baseline OCR39,553.9
AI extraction33,175.5
Manual entry79,000.0

Monthly Time Projections (hours/month)

Volume TierTypeBaseline (hrs)AI+AA (hrs)Savings
300/moA4.31.761.1%
300/moB5.52.947.7%
1,000/moA14.35.561.2%
1,000/moB18.29.547.7%
5,000/moA71.527.861.2%
5,000/moB91.247.647.7%

Type C Monthly Time Projections (hours/month)

Volume TierManual (hrs)AI (hrs)Savings
300/mo6.62.858.0%
1,000/mo21.99.258.0%
5,000/mo109.746.158.0%

Sensitivity Analysis: Top 5 Parameters by Impact

RankParameterSwing (pp)Range
1YELLOW Discount Factor15.144.5%–59.5%
2Pre-Fill Overhead Reduction8.247.9%–56.1%
3RED Discount Factor6.749.3%–56.0%
4Per-Field Verification Time5.448.5%–53.8%
5GREEN Discount Factor3.948.1%–52.0%

Threshold Sensitivity Sweep

ThresholdType A AA RateType B AA RateCombined A/B Savings
0.95589.9%35.7%69.3%
0.96079.9%27.4%65.4%
0.96558.0%11.5%58.0%
0.97039.8%4.8%53.6%
0.97522.9%0.6%50.2%
0.98012.6%0.0%48.6%
0.9855.7%0.0%47.7%

Robustness: The minimum savings across all one-at-a-time parameter sweeps was 44.5% (YELLOW discount factor at worst case). AI + auto-accept remained faster than baseline under every variation tested.

Construct Validity: 5/5 PASS

#PredictionResult
CV-1AI+AA total time < Baseline total time (Types A/B)PASS
CV-2AI+review total < Manual entry (Type C)PASS
CV-3AI+AA advantage for Types A/B holds under ALL OAT parameter variationsPASS
CV-4Type C AI advantage over manual holds under ALL parameter variationsPASS
CV-5Sensitivity tornado: yellow_discount has largest swingPASS

Decisions Carried Forward: AI+Auto-Accept confirmed as the deployment recommendation for Types A/B; AI-only insufficient (slower than baseline); Type C strongest standalone business case; all conclusions robust under parameter variation.

Per-document time comparison across scenarios for invoice types.
Per-document time comparison across scenarios for invoice types.
Waterfall showing how processing time penalty is offset by review time savings.
Waterfall showing how processing time penalty is offset by review time savings.
Monthly time projections at three volume tiers.
Monthly time projections at three volume tiers.

8. Cross-Stage Synthesis

8.1 Metrics Consistency Reconciliation

MetricValue AValue BExplanation
Overall AI accuracy0.98670.9876Pre vs post label-stripping correction. 34 of 127 AI errors resolved. Report uses 0.9867 (conservative, pre-correction)
Review time savings30.8%53.6%Stage 5 scope: all fields (31 fields). Stage 6 scope: template-specific. Report uses 53.6%
Consensus vs VLM accuracy99.12%99.61%VLM alone outperforms consensus; agreement reframed as confidence signal

8.2 Cumulative Construct Validity

StageTotalPASSFAILRate
354180%
453260%
554180%
6550100%
Cumulative2016480%
Construct validity results by stage.
Construct validity results by stage. 16 of 20 predictions confirmed (80%).

8.3 Stage-to-Stage Decision Propagation

Stage 2 established baseline measurements. Stage 3 confirmed the VLM improves accuracy but is slower. Stage 4 found that consensus did not improve on the VLM, but that agreement patterns work as confidence signals. Stage 5 converted those signals into an auto-accept model. Stage 6 integrated all components and confirmed that the auto-accept layer, not AI accuracy alone, drives the time savings case for invoices.


9. Unified Time and Cost Analysis

9.1 Time Savings Summary

The primary evaluation metric is per-document time—the sum of processing time and employee review time. Time savings are independent of labor rates and infrastructure costs, making them the universal measure of operational improvement.

Per-Document Total Time by Scenario (seconds)

TypeBaselineAI-OnlyAI+Auto-AcceptTime Saved (AI+AA)
A51.554.820.031.5 (61.2%)
B65.670.034.331.3 (47.7%)
C (vs manual)79.033.245.8 (58.0%)
C (vs baseline OCR)39.633.26.4 (16.1%)

AI-only extraction is slower than baseline for Types A/B (processing penalty outweighs review savings). The auto-accept layer reverses this by eliminating review entirely for high-confidence documents. Type C’s primary comparison is against manual entry (since the baseline OCR workflow was not viable for production).

9.2 Reader-Configurable Cost Framework

This framework is provided as a planning tool. Readers should substitute their own labor rates and infrastructure costs to produce organization-specific estimates.

cost_per_doc(volume) = monthly_fixed / volume + labor_per_doc
labor_per_doc = (total_document_time_seconds / 3600) × hourly_rate

Where:

  • monthly_fixed = combined infrastructure + maintenance cost per month
  • labor_per_doc = employee time cost computed from total document time (processing wait + review)
  • volume = documents processed per month

9.3 Worked Example

A worked example applying the framework above to reference parameters is provided in Appendix B. Readers can substitute organization-specific values for labor rate and infrastructure cost to produce context-appropriate estimates.

9.4 Sensitivity to Labor Rate

Break-even volumes shift with labor rates. Higher rates increase per-document savings, lowering the volume required to justify infrastructure investment. A sensitivity table across a range of labor rates is provided in Appendix B alongside the worked example.

Per-document total cost versus monthly volume at reference parameters.
Per-document total cost versus monthly volume at reference parameters. Break-even points marked where the AI method becomes cheaper than the baseline.

10. Final Recommendation

10.1 Per-Type Recommendation

Document TypeRecommendationEvidenceOperational Fit
Type C (receipts)Deploy AI extraction (VLM + mandatory review)+21.6pp accuracy, 58% time savings vs manualStrongest standalone case; enables currently non-viable workflow
Type A (standard invoices)Deploy AI + Auto-Accept if volume justifies+2.6pp accuracy, 61.2% time savingsRequires auto-accept layer and conservative rollout
Type B (detailed invoices)Deploy AI + Auto-Accept if volume justifies+3.5pp accuracy, 47.7% time savingsRequires auto-accept layer and conservative rollout

Operational fit is based on time savings evidence and deployment dependencies. Volume-dependent cost analysis is available via the parametric framework in Section 9.2 with reference parameters in Appendix B.

10.2 Selection Rationale

Combined deployment shares infrastructure cost. The GPU infrastructure serves all three document types. Deploying Types A/B and Type C together means each type shares the incremental infrastructure cost above the existing CPU baseline, improving per-type economics.

Type C benefits most from accuracy improvement. The receipt workflow has the largest accuracy gap (75.7% to 97.3%) and addresses a workflow that currently cannot be deployed. It does not depend on the auto-accept pipeline. Every receipt currently requires full manual data entry; the AI workflow eliminates 58% of that time.

Types A/B require auto-accept. AI extraction alone makes invoices slower (+6.5% total time). The auto-accept model reverses this by eliminating review for high-confidence documents. Without auto-accept, the time penalty is not recovered. With auto-accept, 53.6% steady-state time savings are achieved; whether this justifies infrastructure cost depends on volume and labor rate.

10.3 Caveats

  • Simulated review times (not validated with real employees)
  • Single VLM architecture family tested; other VLM architectures untested
  • Invoice evaluation dataset is synthetic (real invoices may have different error patterns)
  • Auto-accept model calibrated on 997 documents (50 templates); new templates require recalibration
  • Future scope capabilities (unknown-layout classification and dynamic form generation) were not evaluated

11. Limitations

Data

  • L-02 [medium]: Invoice dataset is synthetic; real invoices may have more variability.
  • L-03 [low]: Receipt dataset limited to 4 fields; real receipts may require more.
  • L-07 [medium]: Consensus analysis limited to Types A/B; Type C excluded because only one OCR implementation exists for that format.
  • L-09 [low]: GRAY fields (~65.2% of field instances) excluded from consensus.
  • L-13 [medium]: Type C excluded from auto-accept confidence model. All Type C documents always require manual review.
  • L-16 [low]: Type C manual processing time (79s) is theoretical, not measured.

Methodology

  • L-01 [high]: Review times simulated, not empirically measured. The review time model uses estimated per-field verification, correction, and entry times based on field complexity categories.
  • L-06 [medium]: Prompt strategies optimized on same data used for evaluation. No held-out test set was used.
  • L-08 [low]: Three implementations may not capture all possible error patterns.
  • L-10 [high]: Review discount factors (GREEN=0%, YELLOW=60%, RED=80%) are assumed, not empirically validated.
  • L-11 [medium]: Conditional probabilities computed and evaluated on same dataset (mitigated by 5-fold CV).
  • L-14 [medium]: Integrated time model compounds estimated parameters from multiple stages.
  • L-15 [low]: OAT sensitivity analysis; parameter interactions not modeled.

Model

  • L-04 [medium]: Single VLM evaluated (9B, 8-bit quantized); other models may differ.
  • L-12 [medium]: Independence assumption for document-level confidence may not hold for uniformly degraded scans.

Deployment

  • L-05 [medium]: Inference time depends on GPU hardware; results specific to one consumer-grade GPU.
  • L-17 [low]: Monthly time projections assume constant per-document processing time.
  • L-18 [low]: Time estimates exclude document acquisition, routing, and post-processing overhead.

Additional Limitations (Stage 7)

  • L-19 [medium]: Parametric cost model uses assumed hourly rates and GPU costs; organization-specific values will change break-even volumes. Mitigation: Report provides parametric model with adjustable inputs.
  • L-20 [high]: No real employee review validation at any stage; all time savings derived from simulated review model. Mitigation: Sensitivity analysis shows all conclusions robust to ±50% parameter variation.
  • L-21 [medium]: Auto-accept precision (99.55%) means ~1 false accept per 220 auto-accepted documents. At 10,000 docs/month with 22% auto-accept rate, this is ~10 false accepts/month. Mitigation: Ultra-conservative threshold (P ≥ 0.980) reduces false accepts to 0 in observed data.

12. Artifact Index

Table A: Stage Reports

StageFileDescription
2results/doc_ocr_baseline/analysis/baseline_results_summary.mdBaseline OCR measurements
3results/doc_ocr_ai/analysis/final_report.mdAI extraction evaluation
4results/doc_ocr_consensus/analysis/final_report.mdMulti-implementation consensus
5results/doc_ocr_autoaccept/analysis/final_report.mdAuto-accept confidence model
6results/doc_ocr_integrated/analysis/final_report.mdIntegrated time analysis

Table B: Key Data Files

FileStageDescription
paired_accuracy_comparison.json3Per-type accuracy with statistical tests
rescore_label_stripping.json3Post-correction accuracy analysis
consensus_vs_single.json4VLM vs consensus accuracy
auto_accept_analysis.json5Auto-accept rates and precision
threshold_table.json521 threshold options with metrics
scenario_time_comparison.json65 scenarios × 3 types time comparison
sensitivity_tornado.json6Parameter sensitivity ranking
consolidated_metrics.json7All headline metrics consolidated

Appendix A: Review Time Model Parameters

Review Time Parameters (seconds)

The model defines five complexity categories. Three (short_numeric, short_text, long_text) are used in the current evaluation; two (constrained_choice, icon) are defined for future document types.

ParameterValue (s)
Overhead per document15.0
short_numeric — verification2.5
short_numeric — correction6.5
short_numeric — entry10.0
short_text — verification3.5
short_text — correction10.0
short_text — entry15.0
long_text — verification5.0
long_text — correction16.0
long_text — entry24.0

Invoice Dataset Field Complexity Mapping (31 fields)

  • long_text (8): recipient addresses, entity names, payment terms, notes, and similar multi-word fields
  • short_numeric (12): monetary amounts (totals, subtotals, discounts) and tax values at various rates
  • short_text (11): dates, tax identification numbers, document numbers, seller contact details, and document titles

Receipt Dataset Field Complexity Mapping (4 fields)

  • entity name: short_text
  • address: long_text
  • date: short_text
  • monetary total: short_numeric

Appendix B: Cost Model Framework

Cost Model Equations

cost_per_doc(V) = monthly_fixed / V + labor_per_doc
labor_per_doc = (total_document_time_seconds / 3600) * hourly_rate
break_even_volume = additional_fixed_monthly / net_labor_savings_per_doc

Configurable Parameters

ParameterReference ValueDescription
Employee hourly rate$40.00/hrFully-loaded labor cost per hour (sensitivity: $35–$50)
AI deployment (monthly)$8,000GPU-accelerated hosting: infrastructure + maintenance + inference (all types)
Baseline deployment (monthly)$4,000CPU-only hosting: infrastructure + maintenance (Types A/B)
Manual processing (monthly)$0No infrastructure; pure labor cost (Type C current state)

Worked Example: Dollar Cost at Reference Parameters

The tables below use reference parameter values to illustrate the cost model’s output. They are not claims about expected savings for any specific organization. Reference parameters: hourly labor rate = $40.00/hr, AI deployment (GPU) = $8,000/month, baseline deployment (CPU) = $4,000/month.

Per-Document Cost at Key Volume Tiers (reference parameters)

TypeScenario1,000/mo5,000/mo10,000/mo20,000/mo
ABaseline$4.57$1.37$0.97$0.77
AAI-Only$8.61$2.21$1.41$1.01
AAI+Auto-Accept$8.22$1.82$1.02$0.62
BBaseline$4.73$1.53$1.13$0.93
BAI-Only$8.78$2.38$1.58$1.18
BAI+Auto-Accept$8.38$1.98$1.18$0.78
CAI-Only$8.37$1.97$1.17$0.77
CManual$0.88$0.88$0.88$0.88

At lower volumes, infrastructure cost dominates and AI is significantly more expensive per document. At higher volumes, infrastructure amortizes and the labor savings from AI become the deciding factor.

Break-Even Volumes (reference parameters, $40/hr)

TypeAI-Only Break-EvenAI+AA Break-Even
ANo break-even (AI-only is slower)~11,500/mo
BNo break-even (AI-only is slower)~11,600/mo
C~15,800/mo (standalone, vs manual)

Sensitivity to Labor Rate

Hourly RateType A AA Break-EvenType B AA Break-EvenType C Break-Even (standalone)
$35/hr13,10013,20018,000
$40/hr11,50011,60015,800
$45/hr10,20010,30014,000
$50/hr9,2009,30012,600

Disclosure: Results on this page are derived from controlled benchmarks, and are not a guarantee of performance in other environments.

© 2026 RCTK. All rights reserved. This study may not be reproduced or distributed without prior written permission.