Recent Study

Where AI Beats Traditional OCR — and Where It Still Needs Human Review

This report evaluates whether AI-based document extraction can replace traditional OCR for an organization processing three document workflows — across 1,497 documents and six evaluation stages.

The Challenge

The scenario: an organization processes thousands of standardized documents each month across three workflows. Two invoice workflows (Types A and B) run on an LSTM-based OCR engine with positional extraction, achieving about 96% field accuracy. A third workflow for scanned receipts (Type C) was piloted with a deep-learning OCR library but achieved only about 76% accuracy—not viable for production.

On invoices, the same types of extraction errors recur systematically. Employees must visually verify every field, correct errors, and manually transcribe failed extractions. The correction burden is persistent and costly.

On receipts, the problem is worse. Receipt layout diversity—multiple total lines, trade name versus registered entity differences, varied thermal print quality—defeats heuristic extraction. At about 76% accuracy, roughly one in four fields requires correction, and the workflow cannot be deployed. Receipts are currently processed entirely by hand.

The evaluation was designed to provide a controlled comparison between the current OCR pipeline and an AI alternative, with statistical evidence and time analysis sufficient to support a deployment decision.

Why This Kind of Testing Matters

Whether AI extracts fields more accurately than rule-based OCR is only the first question. The harder question is whether the improvement reduces review burden enough to justify the deployment complexity and ongoing operational requirements of an AI system in production.

This evaluation answers that by processing the same documents through both methods, comparing accuracy field by field, and computing the time impact of each approach at realistic volumes. It also tests whether the accuracy improvement translates into actual time savings, whether multi-implementation consensus adds value, and whether an auto-accept model can safely eliminate human review for high-confidence documents. Pre-registered predictions establish expectations before each stage, and supporting artifacts are listed in Section 12.

The Approach

Every document in the evaluation dataset was processed by both the baseline workflow and the AI workflow. This paired comparison design controls for document-level variation—image quality, legibility, layout complexity—ensuring that observed differences are attributable to the extraction method, not the document.

The evaluation proceeded through six progressive stages:

Requirements and success criteria— defined what to test and how to measure it
Baseline measurement— established current-state accuracy and time benchmarks
AI extraction— measured accuracy improvement and processing speed
Multi-implementation consensus— tested whether agreement improves accuracy
Auto-accept model— identified documents safe to skip human review
Integrated time analysis— computed end-to-end time savings and cost framework

Stage 1:  Requirements and success criteria defined
  Stage 2:  Baseline accuracy measured
    Stage 3:  AI extraction evaluated
      Stage 4:  Consensus tested (hypothesis failed)
        Stage 5:  Auto-accept model built
          Stage 6:  End-to-end time savings quantified

Stage 2: Measuring the Baseline — The Gap That Demanded Investigation

The evaluation began by measuring current-state performance. 1,497 documents (493 Type A, 504 Type B, 500 Type C) were processed through the existing OCR pipelines: the LSTM-based engine for invoices, the deep-learning OCR library for receipts.

Type A invoices achieved about 97% field accuracy. Type B achieved about 96%. Type C receipts managed only about 76%.

Type C’s accuracy is what motivated the AI evaluation. At about 76%, nearly one in four fields requires correction—a rate that makes automated processing impractical.

Bar chart showing baseline OCR accuracy across three document types. Types A and B at approximately 96%, Type C at 75.7%. — Baseline OCR accuracy across three document types. Types A and B perform adequately at ~96%. Type C at 75.7% does not meet production requirements.

Stage 3: AI Extraction Results — Better Accuracy, but Slower

Five vision-language models were evaluated across nine configurations in a pilot study. A 9-billion parameter VLM with 8-bit quantization was selected based on accuracy (97% pilot accuracy), VRAM efficiency (fits on a single consumer-grade GPU), and structured output capability.

Across all 1,497 documents, the AI workflow improved accuracy by about 9 percentage points overall. The largest improvement was on Type C: from about 76% to about 97%.

Extraction errors dropped from 683 at baseline to 127 with AI—an 81% reduction. Empty extraction failures, where the OCR returned nothing for a field, dropped from 232 to 14.

A representative example: a scanned receipt from a retail chain. The deep-learning OCR baseline extracted the company name as garbled characters and the monetary total as an incorrect numeric value—a misread name and a wrong financial figure. The VLM extracted all four fields correctly. In the baseline workflow, that receipt would have forced the employee to manually look up the company name and re-enter multiple fields from scratch. In the AI workflow, the employee would only need a quick verification pass.

AI processing takes about 5.5 seconds per document on average, compared to under 0.4 seconds for the LSTM-based engine and about 3.4 seconds for the deep-learning OCR library. More accurate, but slower—a tradeoff that the later stages address directly.

Grouped bar chart comparing baseline and AI accuracy for each document type. Type C shows the largest gain from 75.7% to 97.3%. — AI accuracy versus baseline for each document type. Type C shows the largest gain (+21.6pp), from 75.7% to 97.3%.

Stage 4: Consensus Testing Outcome — When the Hypothesis Failed

Three extraction implementations ran in parallel on the 997 invoices (Types A and B): the LSTM-based engine (about 94%), the VLM (about 99.6%), and a transformer-based document OCR engine (about 97%). Each field was classified by agreement pattern: GREEN (all three agree), YELLOW (two agree), or RED (all disagree).

The hypothesis was that consensus—weighted majority voting—would be more accurate than any single implementation. It was not. That was a setback: the planned mechanism for hardening extraction accuracy simply did not work. The VLM alone outperformed consensus because the two traditional OCR engines agreeing on a wrong value could outvote the correct VLM extraction.

The agreement pattern proved valuable for a different purpose: as a confidence signal. What had failed as a voting mechanism worked as a triage signal. In the observed data, when all three implementations agreed on a field value (GREEN), the VLM extraction was correct 100% of the time. For example, on a sample invoice, the LSTM-based engine extracted a total with extraneous label text, the transformer OCR engine included different surrounding formatting, and the VLM returned only the clean numeric value—three independent systems with different raw formats converging on the same financial value.

Bar chart comparing single-implementation accuracy versus consensus. The VLM alone outperforms consensus. — Single-implementation versus consensus accuracy. The VLM alone outperforms consensus, but the agreement pattern serves as a reliable confidence signal.

Stage 5: Auto-Accept Eligibility — The Reframe That Salvaged the Approach

Using consensus for value selection produced only 3.1% review time savings—far below the 10% target. But the agreement categories, reframed as confidence signals, enabled a more powerful strategy: auto-accepting documents where all implementations agreed, bypassing human review entirely.

At the balanced threshold (P ≥ 0.970), about 22% of invoices could be auto-accepted with over 99% precision. One false accept occurred in 220 auto-accepted documents—one document containing an error was approved without review. At production scale (10,000 documents/month with 22% auto-accept), that rate translates to roughly 10 false accepts per month.

The ultra-conservative threshold (P ≥ 0.980) produced zero false accepts in the observed data, at the cost of a lower auto-accept rate (about 6%). This is the recommended starting point. Whether that conservative auto-accept rate is enough to justify AI deployment—whether the confidence signal actually reverses the time penalty—is what Stage 6 tested.

Line chart showing auto-accept rate as a function of confidence threshold. Higher thresholds accept fewer documents with greater precision. — Auto-accept rate as a function of confidence threshold. The balanced threshold (P >= 0.970) auto-accepts 22% of documents with over 99% precision.

Stage 6: End-to-End Time Impact — The Counterintuitive Reversal

Here the evaluation turned up a counterintuitive finding. For invoices, AI extraction alone is slower than baseline OCR. The AI processing penalty (about 5.7 seconds for Type A, 7.7 for Type B, versus sub-second traditional OCR) outweighs the modest review time reduction. AI-only increases total per-document time by about 6.5%.

Adding auto-accept reverses this. With AI + auto-accept, total processing time drops by about 54% for invoices combined (Type A: 61%, Type B: 48%). For Type C receipts, AI achieves about 58% time savings compared to manual entry, on a workflow that currently cannot be deployed at all.

Sensitivity analysis varied all parameters by up to 50%. Even at worst case, AI + auto-accept remained faster than baseline.

Grouped bar chart showing per-document total time across three scenarios for Types A and B. AI-only is slower than baseline; AI plus auto-accept reduces total time by 53.6%. — Per-document total time across scenarios for Types A and B. AI-only (orange) is slower than baseline (blue). AI + auto-accept (green) reduces total time by 53.6%.

Waterfall chart breaking down invoice processing time, showing how auto-accept offsets the AI processing penalty by eliminating review for 22% of documents. — Waterfall breakdown of invoice processing time. Auto-accept offsets the AI processing penalty by eliminating review for 22% of documents.

The Result

Four deployment recommendations, each traceable to specific stage evidence:

Deploy Type C first. Largest accuracy improvement (+21.6 percentage points) and about 58% time savings versus manual entry. Addresses a workflow that currently cannot be deployed and does not depend on the auto-accept pipeline. Infrastructure cost scales favorably when combined with Types A/B deployment. (Stages 3 and 6)
Deploy Types A and B only with auto-accept. AI extraction alone does not offset the processing time penalty. The auto-accept layer makes it viable by eliminating human review for about 22% of documents. (Stage 6)
Use consensus for confidence, not value selection. The VLM alone is more accurate than weighted majority voting. The agreement pattern serves as a confidence signal only. (Stages 4 and 5)
Start at the ultra-conservative threshold (P ≥ 0.980). Zero false accepts in observed data. Relax the threshold as operational confidence grows. (Stage 5)

What This Means for Deployment

Type C benefits most from AI extraction. The receipt workflow achieves 58% time savings versus manual entry and the largest accuracy improvement (from about 76% to about 97%), making it viable for production. Auto-accept is not available for receipts because the consensus pipeline requires multiple independent extraction implementations, and only the VLM supports receipt extraction—all Type C documents require human review. Infrastructure cost is shared when deployed alongside Types A/B.

Types A and B require the full pipeline. AI extraction alone is slower than baseline (+6.5% total time). Adding consensus-based auto-accept reverses this, achieving 53.6% steady-state time savings by eliminating review for high-confidence documents. The conservative starting threshold (P ≥ 0.980) produces a lower initial savings rate until production evidence justifies relaxation toward the balanced threshold. Volume thresholds depend on organization-specific labor rates and infrastructure costs—see the parametric framework in Section 9.

The parametric time model translates accuracy improvements into hours saved at any document volume. A reader-configurable cost framework provides the formulas and parameter definitions needed to produce organization-specific dollar estimates. The observed time savings are large enough to justify an organization-specific economic review; because labor rates, infrastructure choices, and baseline operating costs vary widely, this report leaves dollar conversion to the reader.

Why This Recommendation Is Low-Risk to Deploy

The AI workflow was designed to minimize operational disruption.

Same employee workflow. The review interface is identical—same side-by-side view, same correction process. Employees see better pre-filled values, not a different tool. No retraining required on the employee review workflow.

Same audit schema. AI workflow records use the same append-only audit table as the baseline. The only difference is the extraction_method field value. Historical baseline records remain accessible alongside AI records, enabling direct cross-method queries on a single table.

Reversible rollout. The transition can be reversed at any time without data loss or process changes. If the AI workflow does not meet performance targets in production, the system reverts to baseline OCR extraction.

Parallel-run validation. Before full cutover, documents can be processed by both methods simultaneously. This provides real-world validation and gives the processing team confidence before the baseline is retired.

Fallback for model failures. If the AI model is unavailable, times out, or produces malformed output, the document is flagged and recoverable—no document is silently dropped. Every upload results in either a completed processing record or an explicit error record in the audit log.

Internal processing. Document images are processed in isolated containers via internal gRPC communication. No document data is transmitted externally. Access control and data handling follow existing organizational policies.

What This Evaluation Demonstrates

Baseline-first evaluation. The existing system was measured before any AI comparison, not after—establishing an honest comparison target rather than testing the AI in isolation.
Separating model improvement from business value. AI extraction alone makes invoices slower. The evaluation identified this and built the auto-accept layer that converts accuracy gains into actual time savings.
Quantified review burden. Instead of treating OCR quality as an abstract percentage, the evaluation computed the time impact of every extraction error—connecting accuracy to measurable operational impact.
Phased deployment over blanket replacement. The recommendation deploys Type C first (largest gain, lowest risk), adds invoice auto-accept at a conservative threshold, and relaxes constraints only with production evidence.
Pre-registered predictions. Twenty directional predictions were registered before each stage ran. Four failed, and those failures were reported—not buried.
Transferable methodology. The six-stage evaluation structure (baseline-first, staged, pre-registered, limitation-transparent) is designed to be rebuilt on any document-extraction pipeline; specific results must always be validated on the target dataset.

Part II

Technical Detail

The narrative above intentionally simplified the underlying methodology. The following section contains the complete statistical framework, stage-by-stage analysis, and evidence supporting every finding and recommendation in Part I.

Dataset: Invoice evaluation dataset (Types A/B) + Receipt evaluation dataset (Type C) | Model: 9B-parameter VLM (8-bit quantized)

1. Executive Summary

This evaluation assessed whether AI-based document extraction can replace traditional OCR for an organization processing invoices and receipts. Across six stages, 1,497 documents were processed through both baseline (LSTM-based OCR / deep-learning OCR) and AI (vision-language model) workflows. The AI workflow improved accuracy by +9.25 percentage points overall, with the largest gain on Type C receipts (+21.6pp). When combined with a consensus-based auto-accept model, the full pipeline achieves 53.6% time savings on invoices and 58% on receipts versus manual entry.

Table T1: Stage Overview

Stage	Question Answered	Documents	Key Output
1	What are the evaluation goals and success criteria?	—	Requirements, time/cost framework, evaluation design
2	How accurate is baseline OCR?	1,497	Types A/B ~96%, Type C 76%
3	Can VLM extraction improve accuracy?	1,497	+9.25pp overall, +21.6pp Type C
4	Does multi-implementation consensus help?	997	No — VLM alone beats consensus
5	Can we auto-accept high-confidence documents?	997	22% auto-accepted at 99.55% precision
6	What are the end-to-end time savings?	1,497	53.6% A/B, 58% C vs manual

Table T2: Decision Funnel

Stage	Decision	Carried Forward
2	Baseline measurements establish comparison target	Accuracy and time benchmarks, cost model structure
3	VLM selected; accuracy improvement confirmed	VLM pipeline, label-stripping correction
4	Consensus < VLM; agreement pattern is a confidence signal	Color categories (GREEN/YELLOW/RED)
5	Auto-accept at P ≥ 0.970: 22% rate, 99.55% precision	Balanced threshold, conditional probabilities
6	AI-only slower; AI+AA delivers 53.6% savings	Deployment recommendation, time/cost model

2. Methodology

2.1 Pipeline Architecture

Both the baseline and AI workflows follow the same three-stage DAG: validate (confirm document is processable) → extract (produce field values) → format (normalize output). The baseline uses the LSTM-based OCR engine (Types A/B) or deep-learning OCR library (Type C) with positional/pattern extraction rules. The AI workflow uses a 9B-parameter vision-language model with structured JSON output.

2.2 Evaluation Metrics

Metric	Definition
Field accuracy rate	Fraction of fields where extracted value matches ground truth (exact or fuzzy match)
Review time	Simulated employee time to verify correct fields, correct errors, and enter missing values
Total document time	Processing time (OCR/VLM) + review time
Cost per document	Reader-configurable parametric model (Section 9.2, Appendix B): infrastructure + processing + labor + maintenance

2.3 Statistical Framework

All accuracy comparisons use the Wilcoxon signed-rank test (paired, non-parametric) with effect size measured by Cohen’s d. Confidence intervals are computed via percentile bootstrap with 10,000 iterations. Pre-registered construct validity predictions (5 per stage, Stages 3–6) provide independent confirmation that the evaluation framework measures what it intends to measure.

2.4 Review Time Model

Employee review time is simulated computationally for reproducibility. The model computes: fixed overhead + (correct fields × verification time) + (incorrect fields × correction time) + (failed fields × entry time). Field-level times vary by complexity category (short_numeric, short_text, long_text, constrained_choice, icon). See Appendix A for the full parameter table.

2.5 Document Scope

Stages 2, 3, and 6 evaluate all 1,497 documents (493 Type A + 504 Type B + 500 Type C). Stages 4 and 5 evaluate only the 997 invoice documents (Types A/B) because the consensus methodology requires three independent extraction implementations per field, and Type C (receipts) has only the VLM implementation—the LSTM-based engine and the transformer OCR engine do not support receipt extraction. This scope difference is noted in each stage section.

3. Stage 2: Baseline Measurements

Per-Type Accuracy

Document Type	Mean Accuracy	Processing (ms)	Review (ms)	Total (ms)
Type A (n=493)	96.77%	321.6	51,154.2	51,475.8
Type B (n=504)	95.84%	372.3	65,259.9	65,632.2
Type C (n=500)	75.70%	3,411.9	36,142.0	39,553.9

Error Type Distribution

Type	Substitution	Insertion	Deletion	Empty
A	30	0	51	68
B	108	0	93	79
C	234	40	127	85

Baseline Time Budget (per document)

Component	Type A	Type B	Type C
Processing time	0.32s	0.37s	3.41s
Review time	51.15s	65.26s	36.14s
Total document time	51.48s	65.63s	39.55s

These time measurements are the primary baseline data. Section 9 presents the parametric model for converting time into dollar cost estimates using configurable labor and infrastructure parameters.

Construct Validity Predictions (registered for Stage 3 evaluation)

#	Prediction	Expected Direction
1	AI accuracy > baseline accuracy (overall)	AI > baseline
2	AI improvement larger on long_text fields than short_numeric fields	Δ(long_text) > Δ(short_numeric)
3	AI has fewer ‘empty’ extraction failures than baseline	AI < baseline
4	AI review time < baseline review time	AI < baseline
5	AI improvement on Type C > AI improvement on Types A/B	Δ(type_c) > Δ(type_a), Δ(type_c) > Δ(type_b)

Decisions Carried Forward: Baseline measurements established for paired comparison; construct validity predictions registered for Stage 3 evaluation; time and cost model structure defined for later AI comparison.

Baseline accuracy distribution across document types showing the gap between Types A/B at 96% and Type C at 75.7%. — Baseline accuracy distribution across document types. The gap between Types A/B (~96%) and Type C (75.7%) motivated the AI evaluation.

4. Stage 3: AI Extraction

4.1 Model Selection

Five vision-language models were evaluated across nine model-strategy configurations. The table below shows the top 5 configurations by pilot accuracy.

Pilot Model Ranking

Rank	Model	Strategy	Pilot Accuracy
1	VLM-A (9B, 8-bit quantized)	structured	97.09%
2	VLM-A (4B, 4-bit quantized)	minimal	95.70%
3	VLM-A (4B, 4-bit quantized)	structured	95.70%
4	VLM-A (9B, 8-bit quantized)	minimal	90.77%
5	VLM-B (11B, 4-bit quantized)	structured	48.77%

4.2 Accuracy Comparison

Type	N	Baseline	AI	Delta	Cohen’s d	p-value	95% CI
A	493	96.77%	99.39%	+2.62pp	0.497	4.07e-14	[1.99, 3.28]
B	504	95.84%	99.32%	+3.48pp	0.761	1.74e-26	[2.95, 4.04]
C	500	75.70%	97.30%	+21.60pp	1.236	1.34e-53	[19.55, 23.60]
Overall	1,497	89.42%	98.67%	+9.25pp	0.718	5.34e-89	[8.42, 10.12]

4.3 Error Analysis

Of the fields evaluated, 127 AI extraction errors were found (down from 683 at baseline). Label-stripping correction resolved 34 of these, leaving 93 errors.

Error Categories

Category	Count	Percentage
gt_includes_label	51	40.2%
receipt_ocr_noise	38	29.9%
empty_extraction	14	11.0%
insertion_wrong_value	10	7.9%
multiline_truncation	8	6.3%
other	6	4.7%

4.4 Speed Comparison

Type	Baseline Processing (ms)	AI Processing (ms)	Baseline Review (ms)	AI Review (ms)
A	321.6	5,684.5	51,154.2	49,154.2
B	372.3	7,735.0	65,259.9	62,244.0
C	3,411.9	3,100.5	36,142.0	30,075.0

4.5 Time Impact Summary (AI-Only)

Type	Review Time Saved/Doc (s)	AI Processing Penalty (s)	Net Time Change/Doc (s)
A	2.0	+5.4	+3.4 (slower)
B	3.0	+7.4	+4.4 (slower)
C	6.1	-0.3	-6.4 (faster)

AI-only extraction is net slower for Types A/B — the processing penalty exceeds review savings. The auto-accept layer (Stage 5) reverses this. Type C is net faster per document.

4.6 Construct Validity

#	Prediction	Result
1	AI accuracy > baseline accuracy (overall)	PASS
2	AI improvement larger on long_text fields than short_numeric fields	FAIL
3	AI has fewer ‘empty’ extraction failures than baseline	PASS
4	AI review time < baseline review time	PASS
5	AI improvement on Type C > AI improvement on Types A/B	PASS

Decisions Carried Forward: VLM selected as extraction engine; prompt strategy confirmed (structured JSON output); label-stripping correction applied to ground truth evaluation.

Per-type accuracy comparison between baseline and AI extraction. Type C shows the largest improvement (+21.6pp), from 75.7% to 97.3%.

Error type distribution comparing baseline versus AI. — Error type distribution: baseline versus AI. Empty extractions — the most costly error type for employees — drop from 232 to 14 (94% reduction).

Pilot accuracy ranking across 5 model candidates. The selected VLM (9B, 8-bit quantized) with structured output achieved 97.09%, well ahead of the next configuration (95.70%).

5. Stage 4: Multi-Implementation Consensus

Three extraction implementations ran in parallel on 997 invoice documents (Types A and B only—Type C excluded because only the VLM supports receipt extraction).

Implementation Accuracy (Overall, 10,746 in-template fields)

Implementation	Overall Accuracy	95% CI
VLM (9B, 8-bit quantized)	99.61%	[99.49, 99.72]
Consensus (majority vote)	99.12%	[98.94, 99.28]
Transformer OCR engine	96.95%	[96.61, 97.26]
LSTM-based OCR engine	93.74%	[93.25, 94.19]

Color Category Distribution (in-template fields)

Type	GREEN	YELLOW	RED
A	44.2%	40.6%	15.2%
B	40.0%	40.2%	19.7%

Consensus Accuracy by Category

Category	Accuracy	Field Count
GREEN	100.00%	4,483
YELLOW	98.11%	4,342
RED	99.32%	1,921

VLM alone (99.61%) outperformed consensus (99.12%). Majority voting with weaker implementations degrades the best single model.

Construct Validity: 3/5 PASS

#	Prediction	Result
1	Consensus accuracy ≥ VLM accuracy (all in-template fields)	FAIL
2	GREEN field rate higher for Type A than Type B	PASS
3	Prioritized review time < standard review time	PASS
4	GREEN actual accuracy > YELLOW > RED	FAIL
5	φ(transformer_ocr, lstm_ocr) > φ(transformer_ocr, vlm)	PASS

Decisions Carried Forward: Use agreement as confidence signal (not for value selection); VLM extraction is the production path.

Accuracy comparison: individual implementations versus consensus. The VLM outperforms all alternatives including consensus.

6. Stage 5: Auto-Accept Confidence Model

Stage 4’s consensus approach reframed: instead of selecting values by agreement, use agreement patterns as confidence signals. The prioritized-review approach from Stage 4 yielded only 3.1% time savings; the auto-accept reframing achieves 30.8%.

Conditional Probabilities (with Wilson 95% CIs)

Category	P(VLM Correct)	Field Count	95% CI
GREEN	1.000000	4,483	[0.9991, 1.0000]
YELLOW	0.994012	4,342	[0.9912, 0.9959]
RED	0.991671	1,921	[0.9865, 0.9949]

Threshold Selection (5 key rows from 21)

Threshold	AA Rate	Precision	False Accepts	Time Savings
0.960	53.36%	96.99%	16	60.2%
0.965	34.50%	97.67%	8	42.7%
0.970	22.07%	99.55%	1	30.8%
0.975	11.63%	99.14%	1	20.8%
0.980	6.22%	100.00%	0	15.4%

Cross-Validation: 5-fold CV confirmed threshold stability (standard deviation < 0.001 across folds).

Construct Validity: 4/5 PASS

#	Prediction	Result
1	P(VLM correct \| GREEN) = 1.0 in all CV folds	PASS
2	Auto-accept rate higher for Type A than Type B	PASS
3	CV precision at P ≥ 0.970 ≥ 98%	PASS
4	P_doc calibration error < 5pp	FAIL
5	Review time savings at P ≥ 0.970 ≥ 20%	PASS

Decisions Carried Forward: Balanced threshold P ≥ 0.970 for deployment.

Auto-accept rate versus confidence threshold. Higher thresholds accept fewer documents but with greater precision.

Precision increases sharply above P >= 0.970. — Precision increases sharply above P >= 0.970, reaching 100% at P >= 0.980.

7. Stage 6: Integrated Time Analysis

Scenario Definitions

#	Scenario	Description	Applies To
1	Baseline	LSTM-based OCR / DL-OCR + full review	All
2	AI-only	VLM + full review (no auto-accept)	All
3	AI+Auto-Accept	VLM + auto-accept + reduced review	A/B
4	Type C Manual	Manual data entry (no OCR)	C only
5	Type C AI	VLM + full review	C only

Types A/B Per-Document Time (ms)

Scenario	Type A	Type B
Baseline	51,475.8	65,632.2
AI-only	54,838.7	69,979.1
AI+Auto-Accept	19,985.8	34,306.7

Type C Per-Document Time (ms)

Scenario	Total (ms)
Baseline OCR	39,553.9
AI extraction	33,175.5
Manual entry	79,000.0

Monthly Time Projections (hours/month)

Volume Tier	Type	Baseline (hrs)	AI+AA (hrs)	Savings
300/mo	A	4.3	1.7	61.1%
300/mo	B	5.5	2.9	47.7%
1,000/mo	A	14.3	5.5	61.2%
1,000/mo	B	18.2	9.5	47.7%
5,000/mo	A	71.5	27.8	61.2%
5,000/mo	B	91.2	47.6	47.7%

Type C Monthly Time Projections (hours/month)

Volume Tier	Manual (hrs)	AI (hrs)	Savings
300/mo	6.6	2.8	58.0%
1,000/mo	21.9	9.2	58.0%
5,000/mo	109.7	46.1	58.0%

Sensitivity Analysis: Top 5 Parameters by Impact

Rank	Parameter	Swing (pp)	Range
1	YELLOW Discount Factor	15.1	44.5%–59.5%
2	Pre-Fill Overhead Reduction	8.2	47.9%–56.1%
3	RED Discount Factor	6.7	49.3%–56.0%
4	Per-Field Verification Time	5.4	48.5%–53.8%
5	GREEN Discount Factor	3.9	48.1%–52.0%

Threshold Sensitivity Sweep

Threshold	Type A AA Rate	Type B AA Rate	Combined A/B Savings
0.955	89.9%	35.7%	69.3%
0.960	79.9%	27.4%	65.4%
0.965	58.0%	11.5%	58.0%
0.970	39.8%	4.8%	53.6%
0.975	22.9%	0.6%	50.2%
0.980	12.6%	0.0%	48.6%
0.985	5.7%	0.0%	47.7%

Robustness: The minimum savings across all one-at-a-time parameter sweeps was 44.5% (YELLOW discount factor at worst case). AI + auto-accept remained faster than baseline under every variation tested.

Construct Validity: 5/5 PASS

#	Prediction	Result
CV-1	AI+AA total time < Baseline total time (Types A/B)	PASS
CV-2	AI+review total < Manual entry (Type C)	PASS
CV-3	AI+AA advantage for Types A/B holds under ALL OAT parameter variations	PASS
CV-4	Type C AI advantage over manual holds under ALL parameter variations	PASS
CV-5	Sensitivity tornado: yellow_discount has largest swing	PASS

Decisions Carried Forward: AI+Auto-Accept confirmed as the deployment recommendation for Types A/B; AI-only insufficient (slower than baseline); Type C strongest standalone business case; all conclusions robust under parameter variation.

Per-document time comparison across scenarios for invoice types.

Waterfall showing how processing time penalty is offset by review time savings.

Monthly time projections at three volume tiers.

8. Cross-Stage Synthesis

8.1 Metrics Consistency Reconciliation

Metric	Value A	Value B	Explanation
Overall AI accuracy	0.9867	0.9876	Pre vs post label-stripping correction. 34 of 127 AI errors resolved. Report uses 0.9867 (conservative, pre-correction)
Review time savings	30.8%	53.6%	Stage 5 scope: all fields (31 fields). Stage 6 scope: template-specific. Report uses 53.6%
Consensus vs VLM accuracy	99.12%	99.61%	VLM alone outperforms consensus; agreement reframed as confidence signal

8.2 Cumulative Construct Validity

Stage	Total	PASS	FAIL	Rate
3	5	4	1	80%
4	5	3	2	60%
5	5	4	1	80%
6	5	5	0	100%
Cumulative	20	16	4	80%

Construct validity results by stage. 16 of 20 predictions confirmed (80%).

8.3 Stage-to-Stage Decision Propagation

Stage 2 established baseline measurements. Stage 3 confirmed the VLM improves accuracy but is slower. Stage 4 found that consensus did not improve on the VLM, but that agreement patterns work as confidence signals. Stage 5 converted those signals into an auto-accept model. Stage 6 integrated all components and confirmed that the auto-accept layer, not AI accuracy alone, drives the time savings case for invoices.

9. Unified Time and Cost Analysis

9.1 Time Savings Summary

The primary evaluation metric is per-document time—the sum of processing time and employee review time. Time savings are independent of labor rates and infrastructure costs, making them the universal measure of operational improvement.

Per-Document Total Time by Scenario (seconds)

Type	Baseline	AI-Only	AI+Auto-Accept	Time Saved (AI+AA)
A	51.5	54.8	20.0	31.5 (61.2%)
B	65.6	70.0	34.3	31.3 (47.7%)
C (vs manual)	79.0	33.2	—	45.8 (58.0%)
C (vs baseline OCR)	39.6	33.2	—	6.4 (16.1%)

AI-only extraction is slower than baseline for Types A/B (processing penalty outweighs review savings). The auto-accept layer reverses this by eliminating review entirely for high-confidence documents. Type C’s primary comparison is against manual entry (since the baseline OCR workflow was not viable for production).

9.2 Reader-Configurable Cost Framework

This framework is provided as a planning tool. Readers should substitute their own labor rates and infrastructure costs to produce organization-specific estimates.

cost_per_doc(volume) = monthly_fixed / volume + labor_per_doc
labor_per_doc = (total_document_time_seconds / 3600) × hourly_rate

Where:

monthly_fixed = combined infrastructure + maintenance cost per month
labor_per_doc = employee time cost computed from total document time (processing wait + review)
volume = documents processed per month

9.3 Worked Example

A worked example applying the framework above to reference parameters is provided in Appendix B. Readers can substitute organization-specific values for labor rate and infrastructure cost to produce context-appropriate estimates.

9.4 Sensitivity to Labor Rate

Break-even volumes shift with labor rates. Higher rates increase per-document savings, lowering the volume required to justify infrastructure investment. A sensitivity table across a range of labor rates is provided in Appendix B alongside the worked example.

Per-document total cost versus monthly volume at reference parameters. Break-even points marked where the AI method becomes cheaper than the baseline.

10. Final Recommendation

10.1 Per-Type Recommendation

Document Type	Recommendation	Evidence	Operational Fit
Type C (receipts)	Deploy AI extraction (VLM + mandatory review)	+21.6pp accuracy, 58% time savings vs manual	Strongest standalone case; enables currently non-viable workflow
Type A (standard invoices)	Deploy AI + Auto-Accept if volume justifies	+2.6pp accuracy, 61.2% time savings	Requires auto-accept layer and conservative rollout
Type B (detailed invoices)	Deploy AI + Auto-Accept if volume justifies	+3.5pp accuracy, 47.7% time savings	Requires auto-accept layer and conservative rollout

Operational fit is based on time savings evidence and deployment dependencies. Volume-dependent cost analysis is available via the parametric framework in Section 9.2 with reference parameters in Appendix B.

10.2 Selection Rationale

Combined deployment shares infrastructure cost. The GPU infrastructure serves all three document types. Deploying Types A/B and Type C together means each type shares the incremental infrastructure cost above the existing CPU baseline, improving per-type economics.

Type C benefits most from accuracy improvement. The receipt workflow has the largest accuracy gap (75.7% to 97.3%) and addresses a workflow that currently cannot be deployed. It does not depend on the auto-accept pipeline. Every receipt currently requires full manual data entry; the AI workflow eliminates 58% of that time.

Types A/B require auto-accept. AI extraction alone makes invoices slower (+6.5% total time). The auto-accept model reverses this by eliminating review for high-confidence documents. Without auto-accept, the time penalty is not recovered. With auto-accept, 53.6% steady-state time savings are achieved; whether this justifies infrastructure cost depends on volume and labor rate.

10.3 Caveats

Simulated review times (not validated with real employees)
Single VLM architecture family tested; other VLM architectures untested
Invoice evaluation dataset is synthetic (real invoices may have different error patterns)
Auto-accept model calibrated on 997 documents (50 templates); new templates require recalibration
Future scope capabilities (unknown-layout classification and dynamic form generation) were not evaluated

11. Limitations

Data

L-02 [medium]: Invoice dataset is synthetic; real invoices may have more variability.
L-03 [low]: Receipt dataset limited to 4 fields; real receipts may require more.
L-07 [medium]: Consensus analysis limited to Types A/B; Type C excluded because only one OCR implementation exists for that format.
L-09 [low]: GRAY fields (~65.2% of field instances) excluded from consensus.
L-13 [medium]: Type C excluded from auto-accept confidence model. All Type C documents always require manual review.
L-16 [low]: Type C manual processing time (79s) is theoretical, not measured.

Methodology

L-01 [high]: Review times simulated, not empirically measured. The review time model uses estimated per-field verification, correction, and entry times based on field complexity categories.
L-06 [medium]: Prompt strategies optimized on same data used for evaluation. No held-out test set was used.
L-08 [low]: Three implementations may not capture all possible error patterns.
L-10 [high]: Review discount factors (GREEN=0%, YELLOW=60%, RED=80%) are assumed, not empirically validated.
L-11 [medium]: Conditional probabilities computed and evaluated on same dataset (mitigated by 5-fold CV).
L-14 [medium]: Integrated time model compounds estimated parameters from multiple stages.
L-15 [low]: OAT sensitivity analysis; parameter interactions not modeled.

Model

L-04 [medium]: Single VLM evaluated (9B, 8-bit quantized); other models may differ.
L-12 [medium]: Independence assumption for document-level confidence may not hold for uniformly degraded scans.

Deployment

L-05 [medium]: Inference time depends on GPU hardware; results specific to one consumer-grade GPU.
L-17 [low]: Monthly time projections assume constant per-document processing time.
L-18 [low]: Time estimates exclude document acquisition, routing, and post-processing overhead.

Additional Limitations (Stage 7)

L-19 [medium]: Parametric cost model uses assumed hourly rates and GPU costs; organization-specific values will change break-even volumes. Mitigation: Report provides parametric model with adjustable inputs.
L-20 [high]: No real employee review validation at any stage; all time savings derived from simulated review model. Mitigation: Sensitivity analysis shows all conclusions robust to ±50% parameter variation.
L-21 [medium]: Auto-accept precision (99.55%) means ~1 false accept per 220 auto-accepted documents. At 10,000 docs/month with 22% auto-accept rate, this is ~10 false accepts/month. Mitigation: Ultra-conservative threshold (P ≥ 0.980) reduces false accepts to 0 in observed data.

12. Artifact Index

Table A: Stage Reports

Stage	File	Description
2	results/doc_ocr_baseline/analysis/baseline_results_summary.md	Baseline OCR measurements
3	results/doc_ocr_ai/analysis/final_report.md	AI extraction evaluation
4	results/doc_ocr_consensus/analysis/final_report.md	Multi-implementation consensus
5	results/doc_ocr_autoaccept/analysis/final_report.md	Auto-accept confidence model
6	results/doc_ocr_integrated/analysis/final_report.md	Integrated time analysis

Table B: Key Data Files

File	Stage	Description
paired_accuracy_comparison.json	3	Per-type accuracy with statistical tests
rescore_label_stripping.json	3	Post-correction accuracy analysis
consensus_vs_single.json	4	VLM vs consensus accuracy
auto_accept_analysis.json	5	Auto-accept rates and precision
threshold_table.json	5	21 threshold options with metrics
scenario_time_comparison.json	6	5 scenarios × 3 types time comparison
sensitivity_tornado.json	6	Parameter sensitivity ranking
consolidated_metrics.json	7	All headline metrics consolidated

Appendix A: Review Time Model Parameters

Review Time Parameters (seconds)

The model defines five complexity categories. Three (short_numeric, short_text, long_text) are used in the current evaluation; two (constrained_choice, icon) are defined for future document types.

Parameter	Value (s)
Overhead per document	15.0
short_numeric — verification	2.5
short_numeric — correction	6.5
short_numeric — entry	10.0
short_text — verification	3.5
short_text — correction	10.0
short_text — entry	15.0
long_text — verification	5.0
long_text — correction	16.0
long_text — entry	24.0

Invoice Dataset Field Complexity Mapping (31 fields)

long_text (8): recipient addresses, entity names, payment terms, notes, and similar multi-word fields
short_numeric (12): monetary amounts (totals, subtotals, discounts) and tax values at various rates
short_text (11): dates, tax identification numbers, document numbers, seller contact details, and document titles

Receipt Dataset Field Complexity Mapping (4 fields)

entity name: short_text
address: long_text
date: short_text
monetary total: short_numeric

Appendix B: Cost Model Framework

Cost Model Equations

cost_per_doc(V) = monthly_fixed / V + labor_per_doc
labor_per_doc = (total_document_time_seconds / 3600) * hourly_rate
break_even_volume = additional_fixed_monthly / net_labor_savings_per_doc

Configurable Parameters

Parameter	Reference Value	Description
Employee hourly rate	$40.00/hr	Fully-loaded labor cost per hour (sensitivity: $35–$50)
AI deployment (monthly)	$8,000	GPU-accelerated hosting: infrastructure + maintenance + inference (all types)
Baseline deployment (monthly)	$4,000	CPU-only hosting: infrastructure + maintenance (Types A/B)
Manual processing (monthly)	$0	No infrastructure; pure labor cost (Type C current state)

Worked Example: Dollar Cost at Reference Parameters

The tables below use reference parameter values to illustrate the cost model’s output. They are not claims about expected savings for any specific organization. Reference parameters: hourly labor rate = $40.00/hr, AI deployment (GPU) = $8,000/month, baseline deployment (CPU) = $4,000/month.

Per-Document Cost at Key Volume Tiers (reference parameters)

Type	Scenario	1,000/mo	5,000/mo	10,000/mo	20,000/mo
A	Baseline	$4.57	$1.37	$0.97	$0.77
A	AI-Only	$8.61	$2.21	$1.41	$1.01
A	AI+Auto-Accept	$8.22	$1.82	$1.02	$0.62
B	Baseline	$4.73	$1.53	$1.13	$0.93
B	AI-Only	$8.78	$2.38	$1.58	$1.18
B	AI+Auto-Accept	$8.38	$1.98	$1.18	$0.78
C	AI-Only	$8.37	$1.97	$1.17	$0.77
C	Manual	$0.88	$0.88	$0.88	$0.88

At lower volumes, infrastructure cost dominates and AI is significantly more expensive per document. At higher volumes, infrastructure amortizes and the labor savings from AI become the deciding factor.

Break-Even Volumes (reference parameters, $40/hr)

Type	AI-Only Break-Even	AI+AA Break-Even
A	No break-even (AI-only is slower)	~11,500/mo
B	No break-even (AI-only is slower)	~11,600/mo
C	~15,800/mo (standalone, vs manual)	—

Sensitivity to Labor Rate

Hourly Rate	Type A AA Break-Even	Type B AA Break-Even	Type C Break-Even (standalone)
$35/hr	13,100	13,200	18,000
$40/hr	11,500	11,600	15,800
$45/hr	10,200	10,300	14,000
$50/hr	9,200	9,300	12,600

Disclosure: Results on this page are derived from controlled benchmarks, and are not a guarantee of performance in other environments.