Latest Study
Where AI Beats Traditional OCR — and Where It Still Needs Human Review
This report evaluates whether AI-based document extraction can replace traditional OCR for an organization processing three document workflows — across 1,497 documents and six evaluation stages.
The Challenge
The scenario: an organization processes thousands of standardized documents each month across three workflows. Two invoice workflows (Types A and B) run on an LSTM-based OCR engine with positional extraction, achieving about 96% field accuracy. A third workflow for scanned receipts (Type C) was piloted with a deep-learning OCR library but achieved only about 76% accuracy—not viable for production.
On invoices, the same types of extraction errors recur systematically. Employees must visually verify every field, correct errors, and manually transcribe failed extractions. The correction burden is persistent and costly.
On receipts, the problem is worse. Receipt layout diversity—multiple total lines, trade name versus registered entity differences, varied thermal print quality—defeats heuristic extraction. At about 76% accuracy, roughly one in four fields requires correction, and the workflow cannot be deployed. Receipts are currently processed entirely by hand.
The evaluation was designed to provide a controlled comparison between the current OCR pipeline and an AI alternative, with statistical evidence and time analysis sufficient to support a deployment decision.
Why This Kind of Testing Matters
Whether AI extracts fields more accurately than rule-based OCR is only the first question. The harder question is whether the improvement reduces review burden enough to justify the deployment complexity and ongoing operational requirements of an AI system in production.
This evaluation answers that by processing the same documents through both methods, comparing accuracy field by field, and computing the time impact of each approach at realistic volumes. It also tests whether the accuracy improvement translates into actual time savings, whether multi-implementation consensus adds value, and whether an auto-accept model can safely eliminate human review for high-confidence documents. Pre-registered predictions establish expectations before each stage, and supporting artifacts are listed in Section 12.
The Approach
Every document in the evaluation dataset was processed by both the baseline workflow and the AI workflow. This paired comparison design controls for document-level variation—image quality, legibility, layout complexity—ensuring that observed differences are attributable to the extraction method, not the document.
The evaluation proceeded through six progressive stages:
- Requirements and success criteria— defined what to test and how to measure it
- Baseline measurement— established current-state accuracy and time benchmarks
- AI extraction— measured accuracy improvement and processing speed
- Multi-implementation consensus— tested whether agreement improves accuracy
- Auto-accept model— identified documents safe to skip human review
- Integrated time analysis— computed end-to-end time savings and cost framework
Stage 1: Requirements and success criteria defined
Stage 2: Baseline accuracy measured
Stage 3: AI extraction evaluated
Stage 4: Consensus tested (hypothesis failed)
Stage 5: Auto-accept model built
Stage 6: End-to-end time savings quantifiedStage 2: Measuring the Baseline — The Gap That Demanded Investigation
The evaluation began by measuring current-state performance. 1,497 documents (493 Type A, 504 Type B, 500 Type C) were processed through the existing OCR pipelines: the LSTM-based engine for invoices, the deep-learning OCR library for receipts.
Type A invoices achieved about 97% field accuracy. Type B achieved about 96%. Type C receipts managed only about 76%.
Type C’s accuracy is what motivated the AI evaluation. At about 76%, nearly one in four fields requires correction—a rate that makes automated processing impractical.

Stage 3: AI Extraction Results — Better Accuracy, but Slower
Five vision-language models were evaluated across nine configurations in a pilot study. A 9-billion parameter VLM with 8-bit quantization was selected based on accuracy (97% pilot accuracy), VRAM efficiency (fits on a single consumer-grade GPU), and structured output capability.
Across all 1,497 documents, the AI workflow improved accuracy by about 9 percentage points overall. The largest improvement was on Type C: from about 76% to about 97%.
Extraction errors dropped from 683 at baseline to 127 with AI—an 81% reduction. Empty extraction failures, where the OCR returned nothing for a field, dropped from 232 to 14.
A representative example: a scanned receipt from a retail chain. The deep-learning OCR baseline extracted the company name as garbled characters and the monetary total as an incorrect numeric value—a misread name and a wrong financial figure. The VLM extracted all four fields correctly. In the baseline workflow, that receipt would have forced the employee to manually look up the company name and re-enter multiple fields from scratch. In the AI workflow, the employee would only need a quick verification pass.
AI processing takes about 5.5 seconds per document on average, compared to under 0.4 seconds for the LSTM-based engine and about 3.4 seconds for the deep-learning OCR library. More accurate, but slower—a tradeoff that the later stages address directly.

Stage 4: Consensus Testing Outcome — When the Hypothesis Failed
Three extraction implementations ran in parallel on the 997 invoices (Types A and B): the LSTM-based engine (about 94%), the VLM (about 99.6%), and a transformer-based document OCR engine (about 97%). Each field was classified by agreement pattern: GREEN (all three agree), YELLOW (two agree), or RED (all disagree).
The hypothesis was that consensus—weighted majority voting—would be more accurate than any single implementation. It was not. That was a setback: the planned mechanism for hardening extraction accuracy simply did not work. The VLM alone outperformed consensus because the two traditional OCR engines agreeing on a wrong value could outvote the correct VLM extraction.
The agreement pattern proved valuable for a different purpose: as a confidence signal. What had failed as a voting mechanism worked as a triage signal. In the observed data, when all three implementations agreed on a field value (GREEN), the VLM extraction was correct 100% of the time. For example, on a sample invoice, the LSTM-based engine extracted a total with extraneous label text, the transformer OCR engine included different surrounding formatting, and the VLM returned only the clean numeric value—three independent systems with different raw formats converging on the same financial value.

Stage 5: Auto-Accept Eligibility — The Reframe That Salvaged the Approach
Using consensus for value selection produced only 3.1% review time savings—far below the 10% target. But the agreement categories, reframed as confidence signals, enabled a more powerful strategy: auto-accepting documents where all implementations agreed, bypassing human review entirely.
At the balanced threshold (P ≥ 0.970), about 22% of invoices could be auto-accepted with over 99% precision. One false accept occurred in 220 auto-accepted documents—one document containing an error was approved without review. At production scale (10,000 documents/month with 22% auto-accept), that rate translates to roughly 10 false accepts per month.
The ultra-conservative threshold (P ≥ 0.980) produced zero false accepts in the observed data, at the cost of a lower auto-accept rate (about 6%). This is the recommended starting point. Whether that conservative auto-accept rate is enough to justify AI deployment—whether the confidence signal actually reverses the time penalty—is what Stage 6 tested.

Stage 6: End-to-End Time Impact — The Counterintuitive Reversal
Here the evaluation turned up a counterintuitive finding. For invoices, AI extraction alone is slower than baseline OCR. The AI processing penalty (about 5.7 seconds for Type A, 7.7 for Type B, versus sub-second traditional OCR) outweighs the modest review time reduction. AI-only increases total per-document time by about 6.5%.
Adding auto-accept reverses this. With AI + auto-accept, total processing time drops by about 54% for invoices combined (Type A: 61%, Type B: 48%). For Type C receipts, AI achieves about 58% time savings compared to manual entry, on a workflow that currently cannot be deployed at all.
Sensitivity analysis varied all parameters by up to 50%. Even at worst case, AI + auto-accept remained faster than baseline.


The Result
Four deployment recommendations, each traceable to specific stage evidence:
- Deploy Type C first. Largest accuracy improvement (+21.6 percentage points) and about 58% time savings versus manual entry. Addresses a workflow that currently cannot be deployed and does not depend on the auto-accept pipeline. Infrastructure cost scales favorably when combined with Types A/B deployment. (Stages 3 and 6)
- Deploy Types A and B only with auto-accept. AI extraction alone does not offset the processing time penalty. The auto-accept layer makes it viable by eliminating human review for about 22% of documents. (Stage 6)
- Use consensus for confidence, not value selection. The VLM alone is more accurate than weighted majority voting. The agreement pattern serves as a confidence signal only. (Stages 4 and 5)
- Start at the ultra-conservative threshold (P ≥ 0.980). Zero false accepts in observed data. Relax the threshold as operational confidence grows. (Stage 5)
What This Means for Deployment
Type C benefits most from AI extraction. The receipt workflow achieves 58% time savings versus manual entry and the largest accuracy improvement (from about 76% to about 97%), making it viable for production. Auto-accept is not available for receipts because the consensus pipeline requires multiple independent extraction implementations, and only the VLM supports receipt extraction—all Type C documents require human review. Infrastructure cost is shared when deployed alongside Types A/B.
Types A and B require the full pipeline. AI extraction alone is slower than baseline (+6.5% total time). Adding consensus-based auto-accept reverses this, achieving 53.6% steady-state time savings by eliminating review for high-confidence documents. The conservative starting threshold (P ≥ 0.980) produces a lower initial savings rate until production evidence justifies relaxation toward the balanced threshold. Volume thresholds depend on organization-specific labor rates and infrastructure costs—see the parametric framework in Section 9.
The parametric time model translates accuracy improvements into hours saved at any document volume. A reader-configurable cost framework provides the formulas and parameter definitions needed to produce organization-specific dollar estimates. The observed time savings are large enough to justify an organization-specific economic review; because labor rates, infrastructure choices, and baseline operating costs vary widely, this report leaves dollar conversion to the reader.
Why This Recommendation Is Low-Risk to Deploy
The AI workflow was designed to minimize operational disruption.
Same employee workflow. The review interface is identical—same side-by-side view, same correction process. Employees see better pre-filled values, not a different tool. No retraining required on the employee review workflow.
Same audit schema. AI workflow records use the same append-only audit table as the baseline. The only difference is the extraction_method field value. Historical baseline records remain accessible alongside AI records, enabling direct cross-method queries on a single table.
Reversible rollout. The transition can be reversed at any time without data loss or process changes. If the AI workflow does not meet performance targets in production, the system reverts to baseline OCR extraction.
Parallel-run validation. Before full cutover, documents can be processed by both methods simultaneously. This provides real-world validation and gives the processing team confidence before the baseline is retired.
Fallback for model failures. If the AI model is unavailable, times out, or produces malformed output, the document is flagged and recoverable—no document is silently dropped. Every upload results in either a completed processing record or an explicit error record in the audit log.
Internal processing. Document images are processed in isolated containers via internal gRPC communication. No document data is transmitted externally. Access control and data handling follow existing organizational policies.
What This Evaluation Demonstrates
- Baseline-first evaluation. The existing system was measured before any AI comparison, not after—establishing an honest comparison target rather than testing the AI in isolation.
- Separating model improvement from business value. AI extraction alone makes invoices slower. The evaluation identified this and built the auto-accept layer that converts accuracy gains into actual time savings.
- Quantified review burden. Instead of treating OCR quality as an abstract percentage, the evaluation computed the time impact of every extraction error—connecting accuracy to measurable operational impact.
- Phased deployment over blanket replacement. The recommendation deploys Type C first (largest gain, lowest risk), adds invoice auto-accept at a conservative threshold, and relaxes constraints only with production evidence.
- Pre-registered predictions. Twenty directional predictions were registered before each stage ran. Four failed, and those failures were reported—not buried.
- Transferable methodology. The six-stage evaluation structure (baseline-first, staged, pre-registered, limitation-transparent) is designed to be rebuilt on any document-extraction pipeline; specific results must always be validated on the target dataset.
Part II
Technical Detail
The narrative above intentionally simplified the underlying methodology. The following section contains the complete statistical framework, stage-by-stage analysis, and evidence supporting every finding and recommendation in Part I.
Dataset: Invoice evaluation dataset (Types A/B) + Receipt evaluation dataset (Type C) | Model: 9B-parameter VLM (8-bit quantized)
1. Executive Summary
This evaluation assessed whether AI-based document extraction can replace traditional OCR for an organization processing invoices and receipts. Across six stages, 1,497 documents were processed through both baseline (LSTM-based OCR / deep-learning OCR) and AI (vision-language model) workflows. The AI workflow improved accuracy by +9.25 percentage points overall, with the largest gain on Type C receipts (+21.6pp). When combined with a consensus-based auto-accept model, the full pipeline achieves 53.6% time savings on invoices and 58% on receipts versus manual entry.
Table T1: Stage Overview
| Stage | Question Answered | Documents | Key Output |
|---|---|---|---|
| 1 | What are the evaluation goals and success criteria? | — | Requirements, time/cost framework, evaluation design |
| 2 | How accurate is baseline OCR? | 1,497 | Types A/B ~96%, Type C 76% |
| 3 | Can VLM extraction improve accuracy? | 1,497 | +9.25pp overall, +21.6pp Type C |
| 4 | Does multi-implementation consensus help? | 997 | No — VLM alone beats consensus |
| 5 | Can we auto-accept high-confidence documents? | 997 | 22% auto-accepted at 99.55% precision |
| 6 | What are the end-to-end time savings? | 1,497 | 53.6% A/B, 58% C vs manual |
Table T2: Decision Funnel
| Stage | Decision | Carried Forward |
|---|---|---|
| 2 | Baseline measurements establish comparison target | Accuracy and time benchmarks, cost model structure |
| 3 | VLM selected; accuracy improvement confirmed | VLM pipeline, label-stripping correction |
| 4 | Consensus < VLM; agreement pattern is a confidence signal | Color categories (GREEN/YELLOW/RED) |
| 5 | Auto-accept at P ≥ 0.970: 22% rate, 99.55% precision | Balanced threshold, conditional probabilities |
| 6 | AI-only slower; AI+AA delivers 53.6% savings | Deployment recommendation, time/cost model |
2. Methodology
2.1 Pipeline Architecture
Both the baseline and AI workflows follow the same three-stage DAG: validate (confirm document is processable) → extract (produce field values) → format (normalize output). The baseline uses the LSTM-based OCR engine (Types A/B) or deep-learning OCR library (Type C) with positional/pattern extraction rules. The AI workflow uses a 9B-parameter vision-language model with structured JSON output.
2.2 Evaluation Metrics
| Metric | Definition |
|---|---|
| Field accuracy rate | Fraction of fields where extracted value matches ground truth (exact or fuzzy match) |
| Review time | Simulated employee time to verify correct fields, correct errors, and enter missing values |
| Total document time | Processing time (OCR/VLM) + review time |
| Cost per document | Reader-configurable parametric model (Section 9.2, Appendix B): infrastructure + processing + labor + maintenance |
2.3 Statistical Framework
All accuracy comparisons use the Wilcoxon signed-rank test (paired, non-parametric) with effect size measured by Cohen’s d. Confidence intervals are computed via percentile bootstrap with 10,000 iterations. Pre-registered construct validity predictions (5 per stage, Stages 3–6) provide independent confirmation that the evaluation framework measures what it intends to measure.
2.4 Review Time Model
Employee review time is simulated computationally for reproducibility. The model computes: fixed overhead + (correct fields × verification time) + (incorrect fields × correction time) + (failed fields × entry time). Field-level times vary by complexity category (short_numeric, short_text, long_text, constrained_choice, icon). See Appendix A for the full parameter table.
2.5 Document Scope
Stages 2, 3, and 6 evaluate all 1,497 documents (493 Type A + 504 Type B + 500 Type C). Stages 4 and 5 evaluate only the 997 invoice documents (Types A/B) because the consensus methodology requires three independent extraction implementations per field, and Type C (receipts) has only the VLM implementation—the LSTM-based engine and the transformer OCR engine do not support receipt extraction. This scope difference is noted in each stage section.
3. Stage 2: Baseline Measurements
Per-Type Accuracy
| Document Type | Mean Accuracy | Processing (ms) | Review (ms) | Total (ms) |
|---|---|---|---|---|
| Type A (n=493) | 96.77% | 321.6 | 51,154.2 | 51,475.8 |
| Type B (n=504) | 95.84% | 372.3 | 65,259.9 | 65,632.2 |
| Type C (n=500) | 75.70% | 3,411.9 | 36,142.0 | 39,553.9 |
Error Type Distribution
| Type | Substitution | Insertion | Deletion | Empty |
|---|---|---|---|---|
| A | 30 | 0 | 51 | 68 |
| B | 108 | 0 | 93 | 79 |
| C | 234 | 40 | 127 | 85 |
Baseline Time Budget (per document)
| Component | Type A | Type B | Type C |
|---|---|---|---|
| Processing time | 0.32s | 0.37s | 3.41s |
| Review time | 51.15s | 65.26s | 36.14s |
| Total document time | 51.48s | 65.63s | 39.55s |
These time measurements are the primary baseline data. Section 9 presents the parametric model for converting time into dollar cost estimates using configurable labor and infrastructure parameters.
Construct Validity Predictions (registered for Stage 3 evaluation)
| # | Prediction | Expected Direction |
|---|---|---|
| 1 | AI accuracy > baseline accuracy (overall) | AI > baseline |
| 2 | AI improvement larger on long_text fields than short_numeric fields | Δ(long_text) > Δ(short_numeric) |
| 3 | AI has fewer ‘empty’ extraction failures than baseline | AI < baseline |
| 4 | AI review time < baseline review time | AI < baseline |
| 5 | AI improvement on Type C > AI improvement on Types A/B | Δ(type_c) > Δ(type_a), Δ(type_c) > Δ(type_b) |
Decisions Carried Forward: Baseline measurements established for paired comparison; construct validity predictions registered for Stage 3 evaluation; time and cost model structure defined for later AI comparison.

4. Stage 3: AI Extraction
4.1 Model Selection
Five vision-language models were evaluated across nine model-strategy configurations. The table below shows the top 5 configurations by pilot accuracy.
Pilot Model Ranking
| Rank | Model | Strategy | Pilot Accuracy |
|---|---|---|---|
| 1 | VLM-A (9B, 8-bit quantized) | structured | 97.09% |
| 2 | VLM-A (4B, 4-bit quantized) | minimal | 95.70% |
| 3 | VLM-A (4B, 4-bit quantized) | structured | 95.70% |
| 4 | VLM-A (9B, 8-bit quantized) | minimal | 90.77% |
| 5 | VLM-B (11B, 4-bit quantized) | structured | 48.77% |
4.2 Accuracy Comparison
| Type | N | Baseline | AI | Delta | Cohen’s d | p-value | 95% CI |
|---|---|---|---|---|---|---|---|
| A | 493 | 96.77% | 99.39% | +2.62pp | 0.497 | 4.07e-14 | [1.99, 3.28] |
| B | 504 | 95.84% | 99.32% | +3.48pp | 0.761 | 1.74e-26 | [2.95, 4.04] |
| C | 500 | 75.70% | 97.30% | +21.60pp | 1.236 | 1.34e-53 | [19.55, 23.60] |
| Overall | 1,497 | 89.42% | 98.67% | +9.25pp | 0.718 | 5.34e-89 | [8.42, 10.12] |
4.3 Error Analysis
Of the fields evaluated, 127 AI extraction errors were found (down from 683 at baseline). Label-stripping correction resolved 34 of these, leaving 93 errors.
Error Categories
| Category | Count | Percentage |
|---|---|---|
| gt_includes_label | 51 | 40.2% |
| receipt_ocr_noise | 38 | 29.9% |
| empty_extraction | 14 | 11.0% |
| insertion_wrong_value | 10 | 7.9% |
| multiline_truncation | 8 | 6.3% |
| other | 6 | 4.7% |
4.4 Speed Comparison
| Type | Baseline Processing (ms) | AI Processing (ms) | Baseline Review (ms) | AI Review (ms) |
|---|---|---|---|---|
| A | 321.6 | 5,684.5 | 51,154.2 | 49,154.2 |
| B | 372.3 | 7,735.0 | 65,259.9 | 62,244.0 |
| C | 3,411.9 | 3,100.5 | 36,142.0 | 30,075.0 |
4.5 Time Impact Summary (AI-Only)
| Type | Review Time Saved/Doc (s) | AI Processing Penalty (s) | Net Time Change/Doc (s) |
|---|---|---|---|
| A | 2.0 | +5.4 | +3.4 (slower) |
| B | 3.0 | +7.4 | +4.4 (slower) |
| C | 6.1 | -0.3 | -6.4 (faster) |
AI-only extraction is net slower for Types A/B — the processing penalty exceeds review savings. The auto-accept layer (Stage 5) reverses this. Type C is net faster per document.
4.6 Construct Validity
| # | Prediction | Result |
|---|---|---|
| 1 | AI accuracy > baseline accuracy (overall) | PASS |
| 2 | AI improvement larger on long_text fields than short_numeric fields | FAIL |
| 3 | AI has fewer ‘empty’ extraction failures than baseline | PASS |
| 4 | AI review time < baseline review time | PASS |
| 5 | AI improvement on Type C > AI improvement on Types A/B | PASS |
Decisions Carried Forward: VLM selected as extraction engine; prompt strategy confirmed (structured JSON output); label-stripping correction applied to ground truth evaluation.



5. Stage 4: Multi-Implementation Consensus
Three extraction implementations ran in parallel on 997 invoice documents (Types A and B only—Type C excluded because only the VLM supports receipt extraction).
Implementation Accuracy (Overall, 10,746 in-template fields)
| Implementation | Overall Accuracy | 95% CI |
|---|---|---|
| VLM (9B, 8-bit quantized) | 99.61% | [99.49, 99.72] |
| Consensus (majority vote) | 99.12% | [98.94, 99.28] |
| Transformer OCR engine | 96.95% | [96.61, 97.26] |
| LSTM-based OCR engine | 93.74% | [93.25, 94.19] |
Color Category Distribution (in-template fields)
| Type | GREEN | YELLOW | RED |
|---|---|---|---|
| A | 44.2% | 40.6% | 15.2% |
| B | 40.0% | 40.2% | 19.7% |
Consensus Accuracy by Category
| Category | Accuracy | Field Count |
|---|---|---|
| GREEN | 100.00% | 4,483 |
| YELLOW | 98.11% | 4,342 |
| RED | 99.32% | 1,921 |
VLM alone (99.61%) outperformed consensus (99.12%). Majority voting with weaker implementations degrades the best single model.
Construct Validity: 3/5 PASS
| # | Prediction | Result |
|---|---|---|
| 1 | Consensus accuracy ≥ VLM accuracy (all in-template fields) | FAIL |
| 2 | GREEN field rate higher for Type A than Type B | PASS |
| 3 | Prioritized review time < standard review time | PASS |
| 4 | GREEN actual accuracy > YELLOW > RED | FAIL |
| 5 | φ(transformer_ocr, lstm_ocr) > φ(transformer_ocr, vlm) | PASS |
Decisions Carried Forward: Use agreement as confidence signal (not for value selection); VLM extraction is the production path.

6. Stage 5: Auto-Accept Confidence Model
Stage 4’s consensus approach reframed: instead of selecting values by agreement, use agreement patterns as confidence signals. The prioritized-review approach from Stage 4 yielded only 3.1% time savings; the auto-accept reframing achieves 30.8%.
Conditional Probabilities (with Wilson 95% CIs)
| Category | P(VLM Correct) | Field Count | 95% CI |
|---|---|---|---|
| GREEN | 1.000000 | 4,483 | [0.9991, 1.0000] |
| YELLOW | 0.994012 | 4,342 | [0.9912, 0.9959] |
| RED | 0.991671 | 1,921 | [0.9865, 0.9949] |
Threshold Selection (5 key rows from 21)
| Threshold | AA Rate | Precision | False Accepts | Time Savings |
|---|---|---|---|---|
| 0.960 | 53.36% | 96.99% | 16 | 60.2% |
| 0.965 | 34.50% | 97.67% | 8 | 42.7% |
| 0.970 | 22.07% | 99.55% | 1 | 30.8% |
| 0.975 | 11.63% | 99.14% | 1 | 20.8% |
| 0.980 | 6.22% | 100.00% | 0 | 15.4% |
Cross-Validation: 5-fold CV confirmed threshold stability (standard deviation < 0.001 across folds).
Construct Validity: 4/5 PASS
| # | Prediction | Result |
|---|---|---|
| 1 | P(VLM correct | GREEN) = 1.0 in all CV folds | PASS |
| 2 | Auto-accept rate higher for Type A than Type B | PASS |
| 3 | CV precision at P ≥ 0.970 ≥ 98% | PASS |
| 4 | P_doc calibration error < 5pp | FAIL |
| 5 | Review time savings at P ≥ 0.970 ≥ 20% | PASS |
Decisions Carried Forward: Balanced threshold P ≥ 0.970 for deployment.


7. Stage 6: Integrated Time Analysis
Scenario Definitions
| # | Scenario | Description | Applies To |
|---|---|---|---|
| 1 | Baseline | LSTM-based OCR / DL-OCR + full review | All |
| 2 | AI-only | VLM + full review (no auto-accept) | All |
| 3 | AI+Auto-Accept | VLM + auto-accept + reduced review | A/B |
| 4 | Type C Manual | Manual data entry (no OCR) | C only |
| 5 | Type C AI | VLM + full review | C only |
Types A/B Per-Document Time (ms)
| Scenario | Type A | Type B |
|---|---|---|
| Baseline | 51,475.8 | 65,632.2 |
| AI-only | 54,838.7 | 69,979.1 |
| AI+Auto-Accept | 19,985.8 | 34,306.7 |
Type C Per-Document Time (ms)
| Scenario | Total (ms) |
|---|---|
| Baseline OCR | 39,553.9 |
| AI extraction | 33,175.5 |
| Manual entry | 79,000.0 |
Monthly Time Projections (hours/month)
| Volume Tier | Type | Baseline (hrs) | AI+AA (hrs) | Savings |
|---|---|---|---|---|
| 300/mo | A | 4.3 | 1.7 | 61.1% |
| 300/mo | B | 5.5 | 2.9 | 47.7% |
| 1,000/mo | A | 14.3 | 5.5 | 61.2% |
| 1,000/mo | B | 18.2 | 9.5 | 47.7% |
| 5,000/mo | A | 71.5 | 27.8 | 61.2% |
| 5,000/mo | B | 91.2 | 47.6 | 47.7% |
Type C Monthly Time Projections (hours/month)
| Volume Tier | Manual (hrs) | AI (hrs) | Savings |
|---|---|---|---|
| 300/mo | 6.6 | 2.8 | 58.0% |
| 1,000/mo | 21.9 | 9.2 | 58.0% |
| 5,000/mo | 109.7 | 46.1 | 58.0% |
Sensitivity Analysis: Top 5 Parameters by Impact
| Rank | Parameter | Swing (pp) | Range |
|---|---|---|---|
| 1 | YELLOW Discount Factor | 15.1 | 44.5%–59.5% |
| 2 | Pre-Fill Overhead Reduction | 8.2 | 47.9%–56.1% |
| 3 | RED Discount Factor | 6.7 | 49.3%–56.0% |
| 4 | Per-Field Verification Time | 5.4 | 48.5%–53.8% |
| 5 | GREEN Discount Factor | 3.9 | 48.1%–52.0% |
Threshold Sensitivity Sweep
| Threshold | Type A AA Rate | Type B AA Rate | Combined A/B Savings |
|---|---|---|---|
| 0.955 | 89.9% | 35.7% | 69.3% |
| 0.960 | 79.9% | 27.4% | 65.4% |
| 0.965 | 58.0% | 11.5% | 58.0% |
| 0.970 | 39.8% | 4.8% | 53.6% |
| 0.975 | 22.9% | 0.6% | 50.2% |
| 0.980 | 12.6% | 0.0% | 48.6% |
| 0.985 | 5.7% | 0.0% | 47.7% |
Robustness: The minimum savings across all one-at-a-time parameter sweeps was 44.5% (YELLOW discount factor at worst case). AI + auto-accept remained faster than baseline under every variation tested.
Construct Validity: 5/5 PASS
| # | Prediction | Result |
|---|---|---|
| CV-1 | AI+AA total time < Baseline total time (Types A/B) | PASS |
| CV-2 | AI+review total < Manual entry (Type C) | PASS |
| CV-3 | AI+AA advantage for Types A/B holds under ALL OAT parameter variations | PASS |
| CV-4 | Type C AI advantage over manual holds under ALL parameter variations | PASS |
| CV-5 | Sensitivity tornado: yellow_discount has largest swing | PASS |
Decisions Carried Forward: AI+Auto-Accept confirmed as the deployment recommendation for Types A/B; AI-only insufficient (slower than baseline); Type C strongest standalone business case; all conclusions robust under parameter variation.



8. Cross-Stage Synthesis
8.1 Metrics Consistency Reconciliation
| Metric | Value A | Value B | Explanation |
|---|---|---|---|
| Overall AI accuracy | 0.9867 | 0.9876 | Pre vs post label-stripping correction. 34 of 127 AI errors resolved. Report uses 0.9867 (conservative, pre-correction) |
| Review time savings | 30.8% | 53.6% | Stage 5 scope: all fields (31 fields). Stage 6 scope: template-specific. Report uses 53.6% |
| Consensus vs VLM accuracy | 99.12% | 99.61% | VLM alone outperforms consensus; agreement reframed as confidence signal |
8.2 Cumulative Construct Validity
| Stage | Total | PASS | FAIL | Rate |
|---|---|---|---|---|
| 3 | 5 | 4 | 1 | 80% |
| 4 | 5 | 3 | 2 | 60% |
| 5 | 5 | 4 | 1 | 80% |
| 6 | 5 | 5 | 0 | 100% |
| Cumulative | 20 | 16 | 4 | 80% |

8.3 Stage-to-Stage Decision Propagation
Stage 2 established baseline measurements. Stage 3 confirmed the VLM improves accuracy but is slower. Stage 4 found that consensus did not improve on the VLM, but that agreement patterns work as confidence signals. Stage 5 converted those signals into an auto-accept model. Stage 6 integrated all components and confirmed that the auto-accept layer, not AI accuracy alone, drives the time savings case for invoices.
9. Unified Time and Cost Analysis
9.1 Time Savings Summary
The primary evaluation metric is per-document time—the sum of processing time and employee review time. Time savings are independent of labor rates and infrastructure costs, making them the universal measure of operational improvement.
Per-Document Total Time by Scenario (seconds)
| Type | Baseline | AI-Only | AI+Auto-Accept | Time Saved (AI+AA) |
|---|---|---|---|---|
| A | 51.5 | 54.8 | 20.0 | 31.5 (61.2%) |
| B | 65.6 | 70.0 | 34.3 | 31.3 (47.7%) |
| C (vs manual) | 79.0 | 33.2 | — | 45.8 (58.0%) |
| C (vs baseline OCR) | 39.6 | 33.2 | — | 6.4 (16.1%) |
AI-only extraction is slower than baseline for Types A/B (processing penalty outweighs review savings). The auto-accept layer reverses this by eliminating review entirely for high-confidence documents. Type C’s primary comparison is against manual entry (since the baseline OCR workflow was not viable for production).
9.2 Reader-Configurable Cost Framework
This framework is provided as a planning tool. Readers should substitute their own labor rates and infrastructure costs to produce organization-specific estimates.
cost_per_doc(volume) = monthly_fixed / volume + labor_per_doc
labor_per_doc = (total_document_time_seconds / 3600) × hourly_rateWhere:
- monthly_fixed = combined infrastructure + maintenance cost per month
- labor_per_doc = employee time cost computed from total document time (processing wait + review)
- volume = documents processed per month
9.3 Worked Example
A worked example applying the framework above to reference parameters is provided in Appendix B. Readers can substitute organization-specific values for labor rate and infrastructure cost to produce context-appropriate estimates.
9.4 Sensitivity to Labor Rate
Break-even volumes shift with labor rates. Higher rates increase per-document savings, lowering the volume required to justify infrastructure investment. A sensitivity table across a range of labor rates is provided in Appendix B alongside the worked example.

10. Final Recommendation
10.1 Per-Type Recommendation
| Document Type | Recommendation | Evidence | Operational Fit |
|---|---|---|---|
| Type C (receipts) | Deploy AI extraction (VLM + mandatory review) | +21.6pp accuracy, 58% time savings vs manual | Strongest standalone case; enables currently non-viable workflow |
| Type A (standard invoices) | Deploy AI + Auto-Accept if volume justifies | +2.6pp accuracy, 61.2% time savings | Requires auto-accept layer and conservative rollout |
| Type B (detailed invoices) | Deploy AI + Auto-Accept if volume justifies | +3.5pp accuracy, 47.7% time savings | Requires auto-accept layer and conservative rollout |
Operational fit is based on time savings evidence and deployment dependencies. Volume-dependent cost analysis is available via the parametric framework in Section 9.2 with reference parameters in Appendix B.
10.2 Selection Rationale
Combined deployment shares infrastructure cost. The GPU infrastructure serves all three document types. Deploying Types A/B and Type C together means each type shares the incremental infrastructure cost above the existing CPU baseline, improving per-type economics.
Type C benefits most from accuracy improvement. The receipt workflow has the largest accuracy gap (75.7% to 97.3%) and addresses a workflow that currently cannot be deployed. It does not depend on the auto-accept pipeline. Every receipt currently requires full manual data entry; the AI workflow eliminates 58% of that time.
Types A/B require auto-accept. AI extraction alone makes invoices slower (+6.5% total time). The auto-accept model reverses this by eliminating review for high-confidence documents. Without auto-accept, the time penalty is not recovered. With auto-accept, 53.6% steady-state time savings are achieved; whether this justifies infrastructure cost depends on volume and labor rate.
10.3 Caveats
- Simulated review times (not validated with real employees)
- Single VLM architecture family tested; other VLM architectures untested
- Invoice evaluation dataset is synthetic (real invoices may have different error patterns)
- Auto-accept model calibrated on 997 documents (50 templates); new templates require recalibration
- Future scope capabilities (unknown-layout classification and dynamic form generation) were not evaluated
11. Limitations
Data
- L-02 [medium]: Invoice dataset is synthetic; real invoices may have more variability.
- L-03 [low]: Receipt dataset limited to 4 fields; real receipts may require more.
- L-07 [medium]: Consensus analysis limited to Types A/B; Type C excluded because only one OCR implementation exists for that format.
- L-09 [low]: GRAY fields (~65.2% of field instances) excluded from consensus.
- L-13 [medium]: Type C excluded from auto-accept confidence model. All Type C documents always require manual review.
- L-16 [low]: Type C manual processing time (79s) is theoretical, not measured.
Methodology
- L-01 [high]: Review times simulated, not empirically measured. The review time model uses estimated per-field verification, correction, and entry times based on field complexity categories.
- L-06 [medium]: Prompt strategies optimized on same data used for evaluation. No held-out test set was used.
- L-08 [low]: Three implementations may not capture all possible error patterns.
- L-10 [high]: Review discount factors (GREEN=0%, YELLOW=60%, RED=80%) are assumed, not empirically validated.
- L-11 [medium]: Conditional probabilities computed and evaluated on same dataset (mitigated by 5-fold CV).
- L-14 [medium]: Integrated time model compounds estimated parameters from multiple stages.
- L-15 [low]: OAT sensitivity analysis; parameter interactions not modeled.
Model
- L-04 [medium]: Single VLM evaluated (9B, 8-bit quantized); other models may differ.
- L-12 [medium]: Independence assumption for document-level confidence may not hold for uniformly degraded scans.
Deployment
- L-05 [medium]: Inference time depends on GPU hardware; results specific to one consumer-grade GPU.
- L-17 [low]: Monthly time projections assume constant per-document processing time.
- L-18 [low]: Time estimates exclude document acquisition, routing, and post-processing overhead.
Additional Limitations (Stage 7)
- L-19 [medium]: Parametric cost model uses assumed hourly rates and GPU costs; organization-specific values will change break-even volumes. Mitigation: Report provides parametric model with adjustable inputs.
- L-20 [high]: No real employee review validation at any stage; all time savings derived from simulated review model. Mitigation: Sensitivity analysis shows all conclusions robust to ±50% parameter variation.
- L-21 [medium]: Auto-accept precision (99.55%) means ~1 false accept per 220 auto-accepted documents. At 10,000 docs/month with 22% auto-accept rate, this is ~10 false accepts/month. Mitigation: Ultra-conservative threshold (P ≥ 0.980) reduces false accepts to 0 in observed data.
12. Artifact Index
Table A: Stage Reports
| Stage | File | Description |
|---|---|---|
| 2 | results/doc_ocr_baseline/analysis/baseline_results_summary.md | Baseline OCR measurements |
| 3 | results/doc_ocr_ai/analysis/final_report.md | AI extraction evaluation |
| 4 | results/doc_ocr_consensus/analysis/final_report.md | Multi-implementation consensus |
| 5 | results/doc_ocr_autoaccept/analysis/final_report.md | Auto-accept confidence model |
| 6 | results/doc_ocr_integrated/analysis/final_report.md | Integrated time analysis |
Table B: Key Data Files
| File | Stage | Description |
|---|---|---|
| paired_accuracy_comparison.json | 3 | Per-type accuracy with statistical tests |
| rescore_label_stripping.json | 3 | Post-correction accuracy analysis |
| consensus_vs_single.json | 4 | VLM vs consensus accuracy |
| auto_accept_analysis.json | 5 | Auto-accept rates and precision |
| threshold_table.json | 5 | 21 threshold options with metrics |
| scenario_time_comparison.json | 6 | 5 scenarios × 3 types time comparison |
| sensitivity_tornado.json | 6 | Parameter sensitivity ranking |
| consolidated_metrics.json | 7 | All headline metrics consolidated |
Appendix A: Review Time Model Parameters
Review Time Parameters (seconds)
The model defines five complexity categories. Three (short_numeric, short_text, long_text) are used in the current evaluation; two (constrained_choice, icon) are defined for future document types.
| Parameter | Value (s) |
|---|---|
| Overhead per document | 15.0 |
| short_numeric — verification | 2.5 |
| short_numeric — correction | 6.5 |
| short_numeric — entry | 10.0 |
| short_text — verification | 3.5 |
| short_text — correction | 10.0 |
| short_text — entry | 15.0 |
| long_text — verification | 5.0 |
| long_text — correction | 16.0 |
| long_text — entry | 24.0 |
Invoice Dataset Field Complexity Mapping (31 fields)
- long_text (8): recipient addresses, entity names, payment terms, notes, and similar multi-word fields
- short_numeric (12): monetary amounts (totals, subtotals, discounts) and tax values at various rates
- short_text (11): dates, tax identification numbers, document numbers, seller contact details, and document titles
Receipt Dataset Field Complexity Mapping (4 fields)
- entity name: short_text
- address: long_text
- date: short_text
- monetary total: short_numeric
Appendix B: Cost Model Framework
Cost Model Equations
cost_per_doc(V) = monthly_fixed / V + labor_per_doc
labor_per_doc = (total_document_time_seconds / 3600) * hourly_rate
break_even_volume = additional_fixed_monthly / net_labor_savings_per_docConfigurable Parameters
| Parameter | Reference Value | Description |
|---|---|---|
| Employee hourly rate | $40.00/hr | Fully-loaded labor cost per hour (sensitivity: $35–$50) |
| AI deployment (monthly) | $8,000 | GPU-accelerated hosting: infrastructure + maintenance + inference (all types) |
| Baseline deployment (monthly) | $4,000 | CPU-only hosting: infrastructure + maintenance (Types A/B) |
| Manual processing (monthly) | $0 | No infrastructure; pure labor cost (Type C current state) |
Worked Example: Dollar Cost at Reference Parameters
The tables below use reference parameter values to illustrate the cost model’s output. They are not claims about expected savings for any specific organization. Reference parameters: hourly labor rate = $40.00/hr, AI deployment (GPU) = $8,000/month, baseline deployment (CPU) = $4,000/month.
Per-Document Cost at Key Volume Tiers (reference parameters)
| Type | Scenario | 1,000/mo | 5,000/mo | 10,000/mo | 20,000/mo |
|---|---|---|---|---|---|
| A | Baseline | $4.57 | $1.37 | $0.97 | $0.77 |
| A | AI-Only | $8.61 | $2.21 | $1.41 | $1.01 |
| A | AI+Auto-Accept | $8.22 | $1.82 | $1.02 | $0.62 |
| B | Baseline | $4.73 | $1.53 | $1.13 | $0.93 |
| B | AI-Only | $8.78 | $2.38 | $1.58 | $1.18 |
| B | AI+Auto-Accept | $8.38 | $1.98 | $1.18 | $0.78 |
| C | AI-Only | $8.37 | $1.97 | $1.17 | $0.77 |
| C | Manual | $0.88 | $0.88 | $0.88 | $0.88 |
At lower volumes, infrastructure cost dominates and AI is significantly more expensive per document. At higher volumes, infrastructure amortizes and the labor savings from AI become the deciding factor.
Break-Even Volumes (reference parameters, $40/hr)
| Type | AI-Only Break-Even | AI+AA Break-Even |
|---|---|---|
| A | No break-even (AI-only is slower) | ~11,500/mo |
| B | No break-even (AI-only is slower) | ~11,600/mo |
| C | ~15,800/mo (standalone, vs manual) | — |
Sensitivity to Labor Rate
| Hourly Rate | Type A AA Break-Even | Type B AA Break-Even | Type C Break-Even (standalone) |
|---|---|---|---|
| $35/hr | 13,100 | 13,200 | 18,000 |
| $40/hr | 11,500 | 11,600 | 15,800 |
| $45/hr | 10,200 | 10,300 | 14,000 |
| $50/hr | 9,200 | 9,300 | 12,600 |
Disclosure: Results on this page are derived from controlled benchmarks, and are not a guarantee of performance in other environments.
© 2026 RCTK. All rights reserved. This study may not be reproduced or distributed without prior written permission.