Skip to main content
Back to Home

Recent Study

Reclaiming AI Document Search Quality Through Configuration Testing And Parameter Sweeps

This report documents the systematic optimization of a Retrieval-Augmented Generation (RAG) pipeline through four rounds of evidence-based testing: 16,143 evaluations narrowing 40+ configurations to a single recommendation.

The Challenge

When organizations use AI to answer questions from their own documents—contracts, research papers, technical manuals, policy guides—the quality of those answers depends on dozens of configuration decisions. How should documents be split into searchable pieces? How many pieces should the system retrieve? How should they be ranked? How should the AI be instructed to formulate its response?

These choices directly affect whether answers are reliable enough to trust in real work. When the configuration is wrong, the consequences are concrete: unreliable answers, teams that revert to manual search, and support escalations that the system was supposed to prevent. Yet most teams either accept default settings or make these choices based on intuition, and without testing, there is no way to know how much quality is being left on the table.

I took a different approach: systematic, evidence-based testing. I designed a series of experiments to measure the impact of every major configuration decision, compare the alternatives head-to-head, and follow the evidence to a final recommendation.

To stress-test the configurations, I used a deliberately challenging multi-domain benchmark: expert-level questions spanning medicine, law, science, and technology. These are the kinds of complex, cross-domain queries that expose weaknesses in a document search system, providing stronger evidence than testing on routine questions alone.

The result: four rounds of progressively focused testing (over 16,000 evaluations conducted over 5 days) that narrowed more than 40 configurations down to one.


Why This Kind of Testing Matters

Most AI document search systems are deployed with settings that feel plausible but have never been tested against alternatives. Over four rounds, evidence-based testing overturned conventional assumptions, caught a silent failure in the evaluation system itself, and identified a configuration that no amount of intuition would have reached.

When leadership asks “why this configuration?”, every choice traces back to a specific round of testing. When a component turns out to hurt rather than help, it is caught before deployment. When the measurement system itself stops working, that is caught too.


The Approach: Test, Measure, Narrow, Repeat

The evaluation pipeline had been validated in prior rounds of internal testing on different datasets before this benchmark began. With tested infrastructure in hand, I designed a four-round optimization funnel. Each round answered one specific question, and its findings constrained the next:

  • Round 1Which document-splitting method produces the best answers? Four methods tested across 7,350 evaluations.
  • Round 2How should the selected method be fine-tuned? Twenty-eight parameter combinations tested across 5,600 evaluations.
  • Round 3Are the measurements trustworthy? A measurement audit that re-scored 1,600 answers and revealed a critical flaw in the scoring system.
  • Round 4Which configuration produces the most helpful answers? An independent AI judge, calibrated against human ratings, scored 1,600 answers for real-world usefulness.

Each round eliminated weaker options and sharpened the focus for the next:

Round 1:  40+ configurations tested
  Round 2:  Selected method fine-tuned
    Round 3:  Measurements audited and fixed
      Round 4:  Best configuration identified

The methodology (select, optimize, audit, validate) is designed to be reapplied on other AI pipelines, though the specific results must be validated on the target dataset and use case.


Round 1 — Finding the Right Foundation

The first and most impactful decision in any document search system is how to break documents into searchable pieces. A long technical manual or research paper must be divided into smaller segments that the system can retrieve individually. How those segments are created has a cascading effect on every answer the system produces.

I tested four fundamentally different splitting methods across 7,350 question-answer evaluations, measuring each on fidelity to source material, question coverage, and effective use of retrieved information.

No single method dominated every measure. One led on source fidelity; another led on question coverage and information usage. The selected foundation was sentence-window retrieval—an approach that retrieves a targeted passage along with its immediate neighbors, balancing precision with enough surrounding context to answer well. It offered the strongest overall balance across quality, operational stability, and tunability for further optimization.

Heatmap showing four document-splitting methods scored across five quality measures. Semantic leads on adherence, sentence_window leads on relevance and utilization.
Four document-splitting methods scored across five quality measures. No single method dominated. The pattern of strengths and weaknesses across methods guided the selection of sentence-window retrieval as the strongest overall foundation.

This round also uncovered a counterintuitive finding: a commonly recommended processing step called “re-ranking,” a second-pass filter intended to improve retrieval quality, actually made answers worse on this dataset. I flagged it for confirmation in the next round.


Round 2 — Fine-Tuning the Selected Method

With sentence-window retrieval as the foundation, Round 2 tested variations of its key parameters across 5,600 evaluations. The most important was “window size”—how much surrounding context to include around each retrieved sentence, ranging from minimal (just the neighboring sentences) to broad (several sentences in each direction).

The smallest window—just the immediate neighbors—produced the highest quality. Larger windows diluted the answer with less relevant material.

Bar chart showing composite quality score decreasing as window size increases from 1 to 4.
Overall quality score by window size. The smallest window produced the highest quality. Larger windows, which include more surrounding context, diluted answers with less relevant material.

This round also confirmed the earlier finding about re-ranking. Across multiple parameter combinations, disabling re-ranking consistently improved quality. Two independent rounds of testing, thousands of evaluations each, reached the same conclusion. Without testing, this step would have shipped by default—adding latency while reducing answer quality.

I also tested how different instructions affect the AI’s responses. A detailed template—one that asked the AI to provide thorough, well-sourced answers—outperformed the default, and became part of the recommended configuration.


Round 3 — When the Measurements Stopped Working

After two rounds of optimization, I had a strong configuration with validated parameter choices. The next step was testing how the AI formulates its responses: how creative versus deterministic it should be, response length limits, and prompt variations. I tested eight configurations.

The top four scored almost identically. Their quality scores differed by less than two-tenths of a percent. It looked like a plateau.

I could have stopped here. Instead, I investigated why the scores were so similar.

Three of the five quality measurements had “ceiling effects.” Imagine a thermometer that only goes up to 100 degrees being used to measure temperatures of 105 or 110—everything above 100 reads the same. The majority of answers were scoring at the maximum on several metrics. In one metric, 87% of all answers received the maximum score.

Side-by-side comparison of binary versus continuous metric distributions. Binary distributions pile up at the maximum; continuous distributions show the full range of quality differences.
The same answers scored two different ways. The old method (orange/red) piles scores at the maximum, so everything looks the same. The improved method (blue) reveals a full range of quality differences that were previously hidden.

The fix was redesigning the scoring from simple pass/fail (did the answer exceed a threshold?) to graduated measurement (how far above or below?). Quality checks that had been failing under the old system passed under the new one. The old measurements were hiding real differences.

The most important output of Round 3 was not a better configuration—it was a better way of measuring. When the evidence stops making sense, fix the measurement first.


Round 4 — Adding a Helpfulness Lens

With improved measurements, the top four configurations still scored within a narrow band on automated quality metrics. The scores were no longer artificially compressed, but the configurations genuinely produced answers of similar measurable quality.

Automated metrics capture whether an answer is technically correct and well-sourced. They do not directly measure whether a person would find it helpful. A technically correct answer that is poorly organized or unnecessarily terse might score well on automated metrics while frustrating an actual user.

To add a more user-centered signal, I deployed an independent AI model as a quality judge—a separate system with no knowledge of which configuration produced which answer. This judge evaluated all 1,600 answers on a 1-to-5 helpfulness scale: Does the answer directly address the question? Does it use the provided information effectively? Is it appropriately detailed? Would a domain expert find it useful?

Before scoring at scale, I calibrated the judge against human ratings on a sample of 50 answers. The correlation was statistically significant, and 88% of the judge’s scores fell within one point of the corresponding human rating. This did not replace full-scale human evaluation, but it provided a calibrated supplementary signal, sufficient for distinguishing between configurations that automated metrics could not separate.

Bar chart showing helpfulness scores by configuration. Blue bars indicate the top four candidates from automated testing. Even among these top performers, the judge identified meaningful differences.
Helpfulness scores from the independent AI judge. Blue bars are the top four candidates from automated testing. Even among these top performers, the judge identified meaningful differences.

The judge found real differences within the top four. The strongest configuration gave the AI more room to answer fully (50% more than the default), which was associated with more complete and thorough responses. This single change produced measurably more helpful answers.


The Result

After 16,143 evaluations across four rounds, one configuration emerged as the best overall recommendation. Every design choice traces back to a specific round of evidence:

  • Document splitting: Sentence-window retrieval with minimal surrounding context, providing focused retrieval without dilution (Round 1, confirmed Round 2)
  • Re-ranking: Disabled; the conventional second-pass filter reduced answer quality on this content (Round 1, confirmed Round 2)
  • Response instructions: A detailed template asking for thorough, well-sourced answers (Round 2)
  • Response length: 50% more room to write than the default, associated with more helpful answers (Round 4)
  • Retrieved context: The 5 most relevant document sections per question (Round 2)

This configuration achieved:

  • 4.4 out of 5.0 helpfulness rating from the independent AI judge
  • Less than 0.5% failure rate across over 13,000 evaluation runs
  • The only configuration to rank in the top 2 on both automated quality metrics and AI-judged helpfulness

What This Means for a Deployment

The specific settings above are validated for this benchmark and model. Different content, models, or deployment contexts may shift the optimal configuration—earlier testing on single-domain medical and technical corpora produced different optimal settings, including cases where re-ranking helped. What follows are the broader patterns.

The re-ranking step, a commonly recommended component, would have shipped by default. Testing showed it reduced answer quality on this content. Two independent rounds confirmed it. That is the kind of mistake that structured evaluation catches before production.

The scoring system appeared to show four equivalent configurations. Investigation revealed it was failing to measure real differences. Without the audit, the final recommendation would have been arbitrary.

Every component in the final configuration earned its place through evidence. Components that did not—larger context windows, default response limits—were removed or changed. The result is a simpler system, because complexity that did not improve answers was cut.

The four-round funnel itself (select, optimize, audit, validate) transfers to other pipelines. It is designed to produce recommendations specific to the target content, not to assume one answer fits all.

Part II

Technical Detail

The narrative above intentionally simplified the underlying methodology. The following section contains the complete statistical framework, version-by-version analysis, and evidence supporting every finding and recommendation in Part I.

Dataset: ExpertQA | Model: Nemotron-Nano-9B-v2-NVFP4 | Embedding: Qwen3-Embedding-0.6B

1. Executive Summary

Four iterations of RAG benchmark optimization (v5–v8) were conducted on the ExpertQA dataset using a 6-step retrieval-augmented generation pipeline powered by Nemotron-Nano-9B-v2-NVFP4 via vLLM. The study processed 14,545 benchmark requests plus 1,598 LLM judge evaluations over 5 days, progressively narrowing the configuration search space through a decision funnel: technique selection (v5), parameter optimization (v6), measurement reform (v7), and helpfulness assessment (v8). Each version answered a specific question, and its findings constrained the next iteration’s search space. The final recommendation is maxtok_768, a sentence_window_1 configuration with max_tokens=768, temperature=0.1, top_k=5, reranking disabled, and a detailed prompt template—which achieves the best balance of automated metric quality (composite rank #2, score 0.5823) and human-perceived helpfulness (answer relevance rank #2, score 4.365/5.0).

Table T1: Version Overview

VersionQuestion AskedRequestsConfigsDurationKey Outcome
v5Which chunking technique is best?7,3504 techniques + 8 sweeps~6.3hsentence_window dominates; all 4 techniques KEEP
v6Which window size and parameters?5,5974 window sizes + 3 sweeps~9.2hsw_1 best (0.769); reranking=false +0.041; 6/6 validity
v7Are binary metrics hiding differences?1,5988 generation configsreevalYes—ceilings at 22.8/86.7/80.7% eliminated; top-4 plateau at 0.579–0.583
v8Can AR break the top-4 tie?1,5988 generation configs~16minYes—Friedman p=0.006; maxtok_768 wins (AR 4.365)

Table T2: Decision Funnel

VersionSearch SpaceDecisionCarried Forward
v54 techniques × 8 sweepssentence_window best coverage-fidelity tradeoffTechnique = sentence_window
v64 window sizes + 3 sweepssw_1 + reranking=false = 0.810 compositeWindow = 1, reranking = false
v78 generation configsBinary metrics plateau; continuous reformulationContinuous metrics; top-4 cluster identified
v8Top-4 clustermaxtok_768 > anchor_detailed (d=0.221), > topk_3 (d=0.218)maxtok_768 = recommended

Final recommendation: The maxtok_768 configuration is the recommended operating point across both automated composite metrics and LLM-judged answer relevance. See Section 8 for the complete configuration card.


2. Methodology

2.1 Pipeline Architecture

All four versions share a common 6-step RAG pipeline:

  1. Validate Input — Schema checks on question + ground truth
  2. Retrieve — Dense retrieval via ChromaDB (Qwen3-Embedding-0.6B, 1024-dim) with optional BM25 hybrid fusion
  3. Rerank — Cross-encoder re-ranking (bge-reranker-v2-m3) with configurable enable/disable
  4. Generate — LLM answer generation via Nemotron-Nano-9B-v2-NVFP4 (vLLM, Marlin backend, RTX 4090)
  5. Evaluate — Automated metrics computed against ground-truth answers and retrieved contexts
  6. Compare — Pairwise statistical analysis across configurations

Dataset: ExpertQA: 150 stratified test cases drawn from a multi-domain expert-sourced QA corpus (medicine, law, science, technology, etc.). The same 150 test cases were reused across all versions for comparability.

Infrastructure: AI Manager service orchestrating Docker containers. Single Nemotron-9B NVFP4 instance (~12.2 GiB model, ~23 GiB peak VRAM). PostgreSQL + Redis backing stores. All runs on a single RTX 4090 (24 GiB).

Embedding model: Qwen3-Embedding-0.6B with asymmetric query/document encoding. Chunked documents stored in ChromaDB collections with per-technique indexing.

2.2 Evaluation Metrics

Table T3: Metric Definitions

MetricBinary Formulation (v5–v6)Continuous Formulation (v7–v8)Range
AdherenceCosine similarity of answer-chunk pairs > threshold (0.50)Mean cosine similarity across all answer-chunk sentence pairs[0, 1]
RelevanceCosine similarity of answer-question > threshold (0.55)Cosine similarity between full answer and question embeddings[0, 1]
UtilizationFraction of retrieved chunks with similarity > threshold (0.60)Mean cosine similarity between answer and each retrieved chunk[0, 1]
CompletenessProportion of ground-truth sentences coveredProportion of ground-truth sentences with cosine sim > threshold[0, 1]
NLI FaithfulnessFraction of answer sentences entailed by context (DeBERTa NLI)(dropped in v7)[0, 1]
Answer Relevance(not used)LLM-judged helpfulness score (Qwen3.5-9B, 1–5 Likert)[1, 5]

NLI Faithfulness was dropped after v6 due to an 88.8% ceiling effect and independent evaluation model concerns. Answer Relevance was added in v8 as a holistic helpfulness measure orthogonal to component metrics.

2.3 Composite Score Design

The composite score evolved across versions as metric quality findings accumulated:

Table T4: Composite Weight Evolution

Metricv5–v6 Weightsv7 Weights (original)v7 Weights (reeval)v8
Adherence0.300.300.350.35
Relevance0.250.250.250.25
NLI Faithfulness0.200.20— (dropped)
Utilization0.150.150.200.20
Completeness0.100.100.200.20

Key changes:

  • v7 re-evaluation dropped NLI (ceiling) and redistributed its 0.20 weight to utilization (+0.05) and completeness (+0.10)
  • The v7 re-evaluation composite (template_v2_no_nli) was selected for strongest discrimination (F=6.81 vs F=3.80 for original weights)

2.4 Statistical Framework

All versions use a consistent statistical framework:

  • Paired tests: Wilcoxon signed-rank (non-parametric) for pairwise comparisons on the same test cases
  • Multiple comparison correction: Holm-Bonferroni step-down procedure
  • Effect sizes: Cohen’s d with benchmarks: |d| < 0.2 = Negligible, 0.2–0.5 = Small, 0.5–0.8 = Medium, > 0.8 = Large
  • Omnibus tests: Friedman test (non-parametric repeated measures) for rank-based group differences; one-way ANOVA as parametric complement
  • Confidence intervals: 10,000-iteration bootstrap (BCa) for technique means
  • Bayesian evidence: BIC-based Wagenmakers (2007) Bayes factors
  • Construct validity: Pre-registered directional predictions tested as strict (must hold) or informational (expected but not required)
  • Discrimination power: ANOVA F-statistics and eta-squared (η²) for between-technique variance
  • Ceiling/floor analysis: Percentage of observations at metric bounds; >50% classified as “Unacceptable”

Note on effect sizes: Cohen’s d is computed on raw paired score differences (not rank-transformed), which is standard practice when the underlying continuous metrics have interval-scale properties. While rank-biserial correlation would be the natural nonparametric companion for Wilcoxon signed-rank tests, Cohen’s d was chosen for interpretability and comparability with power analysis conventions. Kendall’s W was not reported alongside Friedman tests but would be a useful addition in future iterations.


3. v5 — Technique Selection

Question: Which chunking technique produces the best RAG responses on ExpertQA?

Design: 4 techniques (semantic, sentence_400, recursive_400, sentence_window_3) evaluated at default parameters with 150 samples each (600 main comparisons), plus 8 parameter sweeps (chunk_size, prompt_template, final_top_k, similarity_threshold, reranking, retrieval_method, adherence_support, chunk_size for sentence_window). 7,350 completed requests, 99.0% success rate. Runtime ~6.3 hours.

Table T5: Technique Ranking (v5)

RankTechniqueAdherenceRelevanceUtilizationNLI Faith.CompletenessBest At
1semantic0.7330.5890.6460.9940.256Fidelity
2sentence_window_30.6880.8080.8360.9080.235Coverage
3sentence_4000.6740.5710.5730.9750.261Completeness
4recursive_4000.6340.5240.5530.9880.259

The results revealed a fundamental fidelity-coverage tradeoff: semantic excelled at adherence (0.733) and NLI faithfulness (0.994), while sentence_window_3 led on relevance (0.808) and utilization (0.836). The gap was substantial, with Medium effect sizes (d=0.64–0.73) on coverage metrics.

Table T6: Key Sweep Findings (v5)

ParameterFindingEffect
RerankingEnabled 0.682 adherence vs disabled 0.701Reranking hurts on ExpertQA (negative)
Hybrid retrievalDense: adh 0.682, NLI 0.966; Hybrid: adh 0.602, NLI 0.657Hybrid severely degrades fidelity (−0.309 NLI)
Chunk size (sentence)200 > 400 > 600 on adherence (0.722 > 0.674 > 0.659)Smaller consistently better
Window size (sw)sw_1: adh 0.753, NLI 0.982; sw_3: rel 0.808, util 0.836Window=1 for fidelity, =3 for coverage
Top_k3→15: relevance 0.688→0.468; completeness 0.243→0.262top_k=5 is balanced
Prompt templateDetailed > default for adherence on some techniquesSmall effect, worth exploring
Similarity threshold0.40→0.55: adherence 0.829→0.550Impactful but tradeoff with recall
Metric heatmap across all 4 chunking techniques showing adherence, relevance, utilization, NLI faithfulness, and completeness scores.
Figure 3.1: Metric heatmap across all 4 techniques.
Comparison of dense versus hybrid retrieval methods showing hybrid severely degrades fidelity on ExpertQA.
Figure 3.2: Dense vs hybrid retrieval: hybrid severely degrades fidelity on ExpertQA.
Bar chart showing reranking impact is counterintuitively negative on this dataset.
Figure 3.3: Reranking impact — counterintuitively negative on this dataset.
Discrimination power analysis showing completeness fails to discriminate with p=0.15.
Figure 3.4: Discrimination power — completeness fails to discriminate (p=0.15).
NLI faithfulness distribution showing 88.8% of observations at the ceiling.
Figure 3.5: NLI faithfulness distribution — 88.8% ceiling effect.
Prompt template sweep results on adherence metric.
Figure 3.6: Prompt template sweep on adherence.

Construct Validity (v5): 4/5 Strict PASS

TestPredictionResultStatus
top_k increases → relevance decreasesMonotonic decrease0.688 → 0.468PASS
top_k increases → completeness increasesMonotonic increase0.243 → 0.262PASS
Reranking improves adherenceEnabled > disabled0.682 < 0.701FAIL
Smaller chunks → higher relevance200 > 400 > 6000.570 > 0.548 > 0.527PASS
Similarity threshold → adherence decreasesMonotonic decrease0.829 → 0.550PASS

Test 3 failed because reranking hurts on ExpertQA, unlike techqa/covidqa. This was an important dataset-specific finding carried forward.

Metric Quality Alerts

  • NLI Faithfulness: 88.8% ceiling — unreliable as discriminator (recommended drop in v6+)
  • Completeness: ANOVA F=1.77, p=0.15 — zero discriminative power; eta-squared=0.009 (negligible)
  • Relevance: 37.3% ceiling (sentence_window_3 at 67.3%) — marginal

Decisions Carried Forward

  1. Technique = sentence_window— selected over semantic (adherence leader, 0.733) because sentence_window provides the best coverage-fidelity balance, is fastest (13.1s vs 14.8s mean), has the lowest failure rate (0.3% vs 0.8%), and its window_size parameter enables further optimization in v6. All 4 techniques exceeded the 0.60 adherence threshold and were retained as viable candidates; the v6 carry-forward decision was which to prioritize.
  2. Investigate window_size=1 — highest adherence (0.753) and NLI (0.982) among all configs
  3. Reranking = investigate further — counterintuitive negative effect needs confirmation
  4. Flag completeness and NLI as non-discriminating — under binary formulation, neither metric discriminates between techniques; reassess after measurement reform
  5. Dense retrieval only — hybrid severely degraded fidelity

4. v6 — Window Size and Parameter Optimization

Question: Which sentence_window size and parameter combination maximizes composite quality?

Design: 4 window sizes (1, 2, 3, 4) at 200 samples each, plus 3 parameter sweeps (final_top_k, reranking_enabled, prompt_template). 5,597 completed requests out of 5,617 submitted (99.6% completion rate; 5 failures, 15 other non-completed). Runtime ~9.2 hours across 4 runs (1 main + 3 sweeps).

Table T7: Window Size Ranking (v6)

RankWindow SizeCompositeAdherenceRelevanceNLI Faith.UtilizationCompleteness
1sw_10.7690.7660.7840.9790.8180.244
2sw_20.7590.7390.8030.9320.8370.241
3sw_30.7460.6910.8170.9020.8680.234
4sw_40.7220.6620.8070.8370.8720.230

Window_size=1 won the composite ranking through leading fidelity metrics (adherence, NLI). The fidelity-coverage tradeoff from v5 repeated at finer granularity: sw_1 led on adherence (+0.104 vs sw_4, d=0.42) while sw_3/sw_4 led on utilization (+0.054, not significant).

Table T8: Best Observed Configurations (v6)

ConfigurationCompositeAdherenceRelevanceNLI Faith.UtilizationCI (95%)
sw_1 + reranking=false0.8100.7850.9120.9770.849[0.792, 0.827]
sw_1 + template=detailed0.7930.7950.8300.9810.829
sw_1 + default0.7690.7660.7840.9790.818
sw_1 + reranking=true0.7690.7660.7840.9790.818

Disabling reranking improved composite by +0.041, confirming v5’s counterintuitive finding. The effect was driven primarily by relevance (0.784 → 0.912, +0.128) and utilization (0.818 → 0.849). A plausible mechanism is that the cross-encoder filtered out chunks that were topically relevant but not the closest match—on ExpertQA’s diverse multi-domain content, this filtering may have been counterproductive.

Sweep Details

Top_k sweep (k = 3, 5, 7, 10): Composite flat across k=5–10 (0.749–0.755). k=3 dropped to 0.722. Decision: k=5 retained (default, balanced).

Reranking sweep: Disabled significantly better (+0.041 composite). Bayesian evidence for the overall window-size adherence gradient: BF10=308.6 (extreme) for sw_1 vs sw_4.

Template sweep: Detailed template +0.024 composite over default. Chain-of-thought comparable but 2.1× slower (~29s vs ~14s mean generation time).

Bar chart comparing composite scores across 4 window sizes, showing sw_1 at the top.
Figure 4.1: Window size composite score comparison.
Radar chart showing multi-metric comparison across 4 window sizes.
Figure 4.2: Multi-metric radar chart across 4 window sizes.
Heatmap showing the interaction between window size and top_k on composite score.
Figure 4.3: Window size × top_k interaction heatmap.
Chart confirming that disabling reranking improves quality.
Figure 4.4: Reranking impact confirmation — disabled is better.
Bayesian evidence chart for pairwise window size comparisons.
Figure 4.5: Bayesian evidence for pairwise window size comparisons.

Construct Validity (v6): 6/6 PASS (Highest of Any Version)

TestPredictionResultStatus
top_k increases → relevance decreasesMonotonic decrease0.822 → 0.751PASS
top_k increases → completeness increasesMonotonic increase0.224 → 0.253PASS
Window size increases → NLI decreasesMonotonic decrease0.979 → 0.837PASS
Window size increases → adherence decreasesMonotonic decrease0.766 → 0.662PASS
Reranking impacts relevance (info)DirectionalConfirmedPASS
Larger window → higher utilization (info)DirectionalConfirmedPASS

Ceiling Problem Foreshadowing

Despite 6/6 construct validity (the highest of any version), v6 revealed a severe measurement problem:

MetricCeiling %Assessment
NLI Faithfulness71.6%Unacceptable
Relevance66.9%Unacceptable
Utilization66.8%Unacceptable
Adherence29.6%Acceptable
Completeness0.0%Ideal

Three of five metrics had unacceptable ceiling effects (>50% of observations at maximum). This meant the composite score was increasingly driven by the few metrics with remaining variance, raising the question: is this optimizing metrics or actual quality?

Decisions Carried Forward

  1. Window size = 1 — best composite, fidelity leader
  2. Reranking = disabled — confirmed negative effect
  3. Template = detailed — modest gain, retained
  4. top_k = 5 — flat region, no need to change
  5. Address ceiling effects — measurement reform needed before trusting generation-level optimization

5. v7 — Measurement Reform

Question: Are binary metrics hiding meaningful differences between generation configurations?

Design: 8 named generation configurations varying prompt_template (detailed, grounded_detailed, precise, structured), temperature (0.0 vs 0.1 default), max_tokens (768 vs 512 default), and top_k (3, 5, 7). All used sentence_window_1 with system_prompt=/no_think and other defaults from v6. 1,598 completed requests (200 per config, minus 2 missing). The same responses were scored twice: once with original binary metrics, once with continuous reformulations using an independent evaluation model.

The 8 Configurations

All configs share: system_prompt=/no_think, reranking=false, dense retrieval, similarity_threshold=0.45.

ConfigTemperatureMax TokensTop_kTemplateWhat it tests
anchor_detailed0.15125detailedBaseline (anchor)
maxtok_7680.17685detailedMore generation budget
topk_30.15123detailedFewer, higher-quality chunks
topk_70.15127detailedMore context coverage
temp_greedy0.05125detailedDeterministic decoding
grounded_detailed0.15125grounded_detailedGrounding-focused template
precise0.15125preciseAdherence-maximizing template
structured0.15125structuredStructured output format

The Plateau Discovery

With binary metrics, the top-4 configurations (topk_3, maxtok_768, temp_greedy, anchor_detailed) scored 0.830–0.832 composite — a spread of just 0.0017. The ranking appeared converged: further optimization seemed pointless.

But investigation revealed this convergence was an artifact of metric ceilings, not genuine equivalence.

Table T9: Binary vs Continuous Comparison

MetricBinary MeanBinary CeilingContinuous MeanContinuous CeilingChange
Adherence0.75522.8%0.6160.0%Ceiling eliminated
Relevance0.91286.7%0.5810.0%Ceiling eliminated
Utilization0.92780.7%0.7530.0%Ceiling eliminated

The binary relevance metric had 86.7% of all observations at the ceiling (1.0). Any configuration that exceeded the threshold got a perfect score, erasing all differences above that line. The continuous formulation, using raw cosine similarity instead of threshold-based binary classification, restored the full measurement range.

Table T10: Configuration Rankings — Binary vs Continuous

ConfigBinary RankBinary CompositeContinuous RankContinuous CompositeRank Change
topk_310.83210.583
maxtok_76820.83220.582
temp_greedy30.83130.580
anchor_detailed40.83040.579
grounded_detailed50.82460.571−1
topk_760.81950.574+1
precise70.79670.555
structured80.78980.552

The top-4 order was preserved, but the spread widened from 0.0017 to 0.0036, though still narrow. The continuous composite revealed 16 significant pairwise differences (vs fewer with binary), with 5 Large or Medium effects (4 with d > 0.8, plus structured vs topk_7 at d = 0.71), all involving structured as the worst config.

Side-by-side comparison of binary versus continuous metric distributions showing ceiling elimination.
Figure 5.1: Binary vs continuous metric distributions — ceiling elimination.
Configuration ranking under continuous composite scoring.
Figure 5.2: Configuration ranking under continuous composite.

Construct Validity Revolution: 0/2 → 2/2

Under binary metrics, both strict construct validity tests failed — the threshold-based scoring masked the expected monotonic relationship between top_k and relevance. After continuous reformulation:

TestBinary ResultContinuous Result
topk_3 > anchor (relevance)FAIL (both at ceiling)PASS (0.595 > 0.580)
anchor > topk_7 (relevance)FAIL (both at ceiling)PASS (0.580 > 0.570)

This was the strongest evidence that binary metrics were not just noisy but actively misleading — they could not detect real differences that continuous metrics revealed.

The Top-4 Cluster Problem

Even with continuous metrics, the top-4 configs (composite 0.579–0.583) remained statistically indistinguishable in pairwise tests (all Holm-corrected p > 0.05, all d < 0.15). The composite metric had reached its discriminative limit: the four best configurations produced responses of equivalent measured quality across adherence, relevance, utilization, and completeness.

This raised the question: is there a quality dimension not being measured?

Decisions Carried Forward

  1. Continuous metrics adopted — binary formulations permanently retired
  2. NLI dropped — 88.8% ceiling, independent eval model concerns
  3. Top-4 cluster identified — topk_3, maxtok_768, temp_greedy, anchor_detailed
  4. Need a new discriminator — automated composite cannot break the tie
  5. Explore answer relevance — holistic helpfulness judgment as complementary signal

6. v8 — Answer Relevance Judge

Question: Can an LLM-judged answer relevance score discriminate between the top-4 configs that automated metrics cannot?

Design: Qwen3.5-9B (BF16 via vLLM, Docker container) scored all 1,598 v7 responses on a 1–5 Likert scale for answer relevance — “how helpful and relevant is this answer to the user’s question, considering completeness, accuracy, and specificity?” Temperature=0.0 for deterministic scoring. Scoring time: 969 seconds (1.5 responses/second).

Judge Calibration

Before scoring, the judge was calibrated against 50 human-labeled samples:

MetricValueThresholdStatus
Cohen’s kappa (weighted)0.469≥ 0.6Soft FAIL
MAE0.580≤ 1.0PASS
Within-1 agreement88%Strong
Spearman rho0.509p=0.00016
Parse rate100%> 95%PASS

Kappa fell below the 0.6 threshold, indicating moderate (not substantial) agreement. However, three factors supported proceeding: (1) 88% of judgments were within 1 point of human labels, (2) Spearman correlation was significant (p<0.001), and (3) the primary analysis uses relative ranking rather than absolute scores, which is robust to systematic bias.

Table T11: Combined Ranking (Composite + Answer Relevance)

ConfigComposite RankCompositeAR RankAR MeanAR StdTop-4?
topk_310.58374.2250.899Yes
maxtok_76820.58224.3650.803Yes
temp_greedy30.58034.3100.792Yes
anchor_detailed40.57964.2460.844Yes
topk_750.57444.2810.786No
grounded_detailed60.57154.2651.020No
precise70.55584.0451.204No
structured80.55214.4400.831No

The AR-Composite Tension

The most striking finding was structured: ranked dead last on composite (#8, score 0.552) but first on answer relevance (#1, AR 4.44). This suggests that the structured template produces responses that users find helpful despite scoring poorly on embedding-based metrics. This tension is informative but does not change the recommendation; structured’s composite deficit (d > 0.9 vs top-4) is too large to justify on automated metrics alone.

Top-4 Discrimination: AR Breaks the Tie

The central question: does AR discriminate within the top-4 cluster that composite cannot separate?

Overall Friedman test (all 8 configs): chi2 = 47.12, p = 5.3e-08. AR discriminates across the full set.

Top-4 Friedman test: chi2 = 12.41, p = 0.006. AR successfully discriminates within the top-4 cluster.

Top-4 Pairwise Comparisons (Wilcoxon + Holm)

PairMean DiffCohen’s dp (Holm)Significant?
maxtok_768 > anchor_detailed+0.1160.2210.014Yes
maxtok_768 > topk_3+0.1400.2180.013Yes
maxtok_768 > temp_greedy+0.0550.1020.305No
temp_greedy > anchor_detailed+0.0650.1250.267No
temp_greedy > topk_3+0.0850.1310.291No
anchor_detailed > topk_3+0.0250.0380.566No

maxtok_768 significantly outperforms both anchor_detailed (d=0.221, p=0.014) and topk_3 (d=0.218, p=0.013). These are Small effects, but they are the only statistically significant differences within a cluster that composite could not crack.

Bar chart showing answer relevance scores by configuration.
Figure 6.1: Answer relevance scores by configuration.
Box plot showing answer relevance score distributions for all 8 configurations.
Figure 6.2: Answer relevance score distributions.
Top-4 pairwise discrimination via answer relevance showing maxtok_768 significantly outperforming anchor_detailed and topk_3.
Figure 6.3: Top-4 pairwise discrimination via AR.
Correlation matrix between answer relevance and automated metrics.
Figure 6.4: AR correlation with automated metrics.

AR is Not a Proxy for Existing Metrics

Correlations between AR and automated metrics were modest:

MetricSpearman rho
Composite (continuous)0.331
Adherence0.256
Relevance0.244
Utilization0.226
Completeness0.088

The strongest correlation (rho=0.331 with composite) confirms AR captures signal not present in automated metrics. If AR were simply a noisy version of composite, one would expect rho > 0.7.

Construct Validity (v8): 1/2 Strict PASS

TestPredictionResultStatus
precise < anchor (AR)Constrained template → lower helpfulness4.045 < 4.246 (−0.201)PASS
grounded < anchor (AR)Grounded template → lower helpfulness4.265 > 4.246 (+0.019)FAIL

Test 2 failed by a negligible margin (+0.019, d=0.03). The grounded_detailed template performed slightly better than expected, possibly because its emphasis on sourcing improves perceived answer quality, though this was not directly tested.

Decisions Carried Forward

maxtok_768 is the final selection. It is the only configuration that ranks in the top-2 on both dimensions: composite rank #2 (0.582) and AR rank #2 (4.365). It significantly outperforms 2 of 3 other top-4 members on AR while maintaining equivalent composite performance.


7. Cross-Version Synthesis

Optimization Trajectory

The four versions trace a clear narrowing funnel:

v5: 4 techniques x 8 sweeps = 32 cells -> sentence_window selected
v6: 4 windows x 4 sweeps = 16 cells -> sw_1 + reranking=false
v7: 8 configs x 2 formulations = measurement reform -> top-4 cluster
v8: top-4 -> maxtok_768 (AR tiebreaker)

Each version reduced the search space by roughly 4× while adding measurement sophistication. The total optimization path evaluated 14,545 benchmark requests plus 1,598 LLM judge evaluations (16,143 total) across more than 40 unique configurations.

Metric Evolution

The evaluation framework evolved significantly through evidence-based decisions:

ChangeVersionEvidenceImpact
Completeness flagged as non-discriminatingv5ANOVA F=1.77, p=0.15, η²=0.009Flagged; retained after continuous reformulation restored discrimination
NLI faithfulness droppedv788.8% ceiling (v5), 71.6% ceiling (v6)Removed from composite
Binary → continuous formulationv73 metrics with >66% ceiling; 0/2 validity → 2/2Full discrimination restored
Answer relevance addedv8Top-4 composite plateau (spread 0.004)Broke tie, identified maxtok_768
Completeness discrimination restoredv7Continuous: Friedman χ²=785.22, η²=0.175Weight increased 0.10 → 0.20
Composite weights redistributedv7NLI removal, completeness reweightedStronger discrimination (F: 3.80 → 6.81)

Construct Validity Trend

VersionStrict TestsPassRateNotes
v55480%Reranking test failed (dataset-specific)
v644100%Best construct validity
v7 (binary)200%Ceiling effects masked all signals
v7 (continuous)22100%Measurement reform recovered validity
v82150%Grounded template slightly exceeded expectation

The v7 binary → continuous transition is the most dramatic result: the same data, same configurations, same predictions went from 0% to 100% validity solely through metric reformulation. This is strong evidence that metric design is at least as important as pipeline design.

Diminishing Returns

TransitionComposite ChangeAR ChangeEffort
v5 → v6+0.041 (reranking=false)5,597 requests
v6 → v7−0.217 (scale change)1,598 + re-eval
v7 top-40.004 spread (plateau)Identified ceiling as cause
v8 maxtok_768+0.001 vs topk_3+0.140 vs topk_31,598 judge calls

The composite gains diminished rapidly: v6’s reranking discovery was the single largest improvement (+0.041). By v7, automated metrics had plateaued. v8’s contribution was not in composite improvement but in adding a new dimension (AR) that revealed maxtok_768 as measurably more helpful despite near-identical composite scores.

Key Findings Across All Versions

  1. Sentence_window_1 is the optimal chunking strategy for ExpertQA with this model. It maximizes fidelity (adherence, NLI) while maintaining adequate coverage.
  2. Reranking hurts on ExpertQA. Confirmed in both v5 and v6 across multiple window sizes. The cross-encoder filters out relevant chunks in this multi-domain dataset. This finding may not generalize to narrow-domain datasets.
  3. Binary threshold metrics have a hard ceiling problem. When most responses exceed a quality threshold, the metric saturates and can no longer discriminate. Continuous formulations are essential for high-performing systems.
  4. Composite metrics plateau before quality converges. The top-4 configs are equivalent on automated metrics but differ on AR. This suggests that automated metrics capture necessary but insufficient quality dimensions.
  5. max_tokens=768 is a sweet spot. More generation budget (768 vs the 512 default) produces measurably more helpful answers (AR d=0.221 vs anchor, p=0.014). The additional generation budget plausibly allows the model to elaborate rather than truncating, though this hypothesis was not directly tested.
  6. Temperature has minimal impact. temp_greedy (0.0) vs anchor_detailed (0.1) differ by just d=0.125 on AR and d=0.008 on composite. Even this small temperature gap produces negligible differences.
  7. Prompt template matters more than expected. The structured template produces the most user-helpful responses (AR 4.44) but worst composite (0.552). This tension between automated and human-aligned metrics deserves future investigation.

8. Final Recommendation

Table T12: Best Configuration Card

ParameterValueSource VersionEvidence
Techniquesentence_windowv5Best coverage-fidelity tradeoff (d=0.64–0.73 Medium)
Window size1v6Composite 0.769 (rank #1); adherence 0.766
Rerankingdisabledv5, v6+0.041 composite; confirmed across window sizes
Top_k5v6Flat region k=5–10; k=3 too restrictive
Retrievaldense onlyv5Hybrid: −0.309 NLI, severe fidelity loss
Prompt templatedetailedv6, v7+0.024 composite over default
Max tokens768v7, v8AR 4.365 (#2); composite 0.582 (#2)
Temperature0.1v7Default; negligible difference vs 0.0
System prompt/no_thinkv7All 8 v7 configs used /no_think (workflow default)
Similarity threshold0.45v5ExpertQA-calibrated (mean sim 0.440)

Expected Performance

MetricExpected ValueSource
Composite (continuous)0.582v7 config_ranking_continuous.json
Adherence (continuous)0.608v7 config_ranking_continuous.json
Relevance (continuous)0.581v7 config_ranking_continuous.json
Utilization (continuous)0.768v7 config_ranking_continuous.json
Completeness0.354v7 config_ranking_continuous.json
Answer Relevance4.365 / 5.0v8 config_ranking.json
95th pct latency~19–20sv6 latency analysis
Failure rate< 0.5%v5–v6 observed rates

Table T13: Finalist Comparison

ConfigComp. RankCompositeAR RankAR MeanLatencyFail RateKey Differentiator
maxtok_76820.58224.365~12.5s<0.5%+256 tokens → AR gain
topk_310.58374.225~12.5s<0.5%Best composite, weak AR
temp_greedy30.58034.310~12.5s<0.5%Deterministic; no sig. edge
anchor_detailed40.57964.246~12.5s<0.5%Baseline anchor config
structured80.55214.440~12.5s<0.5%Best AR but worst composite

Latency estimates from v6 sw_1 main run (~12–13s mean generation time). Failure rates from v5–v6 observed rates across all sw_1 runs. Composite and AR from v7 re-evaluation and v8 judge scoring, respectively.

Operational Characteristics

  • Latency: ~12.5s mean, ~19–20s p95 (v6 sw_1 measurements)
  • Failure rate: <0.5% (v5–v6 observed across >13,000 requests)
  • Throughput: ~0.175 req/s sustained (v5 measurement, single-instance)
  • Resource requirements: Single RTX 4090, ~23 GiB peak VRAM (12.2 GiB model + KV cache + evaluation models)

Selection Rationale

Decision rule: Select the configuration that ranks in the top tier on automated composite metrics AND performs best on answer relevance, excluding any configuration with a large automated-metric deficit.

maxtok_768 is the recommended configuration under this rule:

  1. Composite: Rank #2 (0.582), within 0.001 of the best (topk_3, 0.583). The difference is not statistically significant (p > 0.8).
  2. Answer Relevance: Rank #2 (4.365), significantly better than anchor_detailed (#6, d=0.221, p=0.014) and topk_3 (#7, d=0.218, p=0.013).
  3. No other config ranks top-2 on both dimensions: topk_3 is #1 composite but #7 AR. temp_greedy is #3 on both. structured is #1 AR but #8 composite.
  4. max_tokens=768 is the differentiator: The only parameter change from the anchor config (512 tokens) is +256 max tokens. This extra generation budget produces measurably more helpful responses (d=0.221 on AR) at negligible composite cost.

Caveats

  • All results are specific to ExpertQA with Nemotron-9B-NVFP4. Generalization to other datasets or models requires validation.
  • Answer relevance calibration showed moderate agreement (kappa=0.469). The relative ranking is more trustworthy than absolute scores.
  • The optimization used one-factor-at-a-time (OFAT) sweeps, not factorial designs. Interaction effects may exist between parameters that were not explored.

9. Limitations

  1. Single dataset: All results are from ExpertQA only. Findings like “reranking hurts” may not generalize to narrow-domain datasets where v3 (covidqa) and v4 (techqa) showed reranking benefits.
  2. Single model: Nemotron-Nano-9B-v2-NVFP4 is a specific quantized model. Larger or differently quantized models may shift optimal configurations, particularly for max_tokens and prompt template effects.
  3. OFAT sweep design: Parameters were swept one at a time. Factorial designs would reveal interaction effects (e.g., window_size × reranking, temperature × max_tokens) but at exponentially higher cost.
  4. Moderate calibration kappa: The AR judge achieved kappa=0.469 (moderate), below the 0.6 threshold for substantial agreement. While 88% within-1 agreement and significant Spearman correlation support relative ranking reliability, absolute AR scores should be interpreted cautiously.
  5. Metric reformulation mid-study: Switching from binary to continuous metrics between v6 and v7 means composite scores are not directly comparable across that boundary. The composite scale shifted from ~0.77–0.81 (binary) to ~0.55–0.58 (continuous).
  6. No large-scale human evaluation: The AR judge is an automated proxy for human judgment. The 50-sample calibration set may not capture the full distribution of human preferences. A larger human evaluation (N > 200) would strengthen confidence in AR-based conclusions.
  7. Completeness metric history: Completeness showed zero discriminative power under binary formulation in v5 (F=1.77, p=0.15) and was flagged for removal. However, the v7 continuous reformulation restored strong discrimination (Friedman χ²=785.22, η²=0.175, the strongest of all metrics), justifying the weight increase from 0.10 to 0.20. The metric’s utility is formulation-dependent.
  8. Non-independent test cases: The same 150 test cases were used across all versions and configurations. While paired analysis accounts for within-subject correlation, there is no guarantee these 150 cases represent the full distribution of real-world queries. Performance on unseen questions may differ.

10. Artifact Index

Analysis Reports

FileVersionContent
results/v5/v5_session1_analysis/session_1_analysis.mdv5Full statistical analysis (77 PNGs, 23 JSONs)
results/v6/v6_session1_analysis/session_1_analysis.mdv6Window size analysis (78 artifacts)
results/v7/v7_reevaluation_analysis/reevaluation_analysis.mdv7Binary vs continuous comparison
results/v8/v8_judge_analysis/judge_analysis.mdv8Answer relevance analysis

Key Data Files

FileVersionContent
results/v7/.../config_ranking_continuous.jsonv7Continuous composite rankings (8 configs)
results/v7/.../formulation_comparison.jsonv7Ceiling rates and F-statistics
results/v8/.../config_ranking.jsonv8AR rankings (8 configs)
results/v8/.../top4_discrimination.jsonv8Friedman + pairwise tests
results/v8/.../construct_validity.jsonv84 construct validity tests
results/v8/.../metric_correlations.jsonv8AR vs automated metric correlations
results/v8/.../discrimination.jsonv8Overall Friedman (all 8 configs)

Design Documentation

FileVersionContent
documentation/260311_01_.../design_analysis.mdv5Approved design (4 techniques, 8 sweeps)
documentation/260313_01_.../design_analysis.mdv6Approved design (4 window sizes, 4 sweeps)
documentation/260314_01_.../design_analysis.mdv7Re-evaluation methodology
documentation/260315_01_.../design_analysis.mdv8AR judge design

Appendix A: Answer Relevance Judge Protocol

Judge Model

  • Model: Qwen3.5-9B (BF16 precision)
  • Inference engine: vLLM with enforce_eager=True, max_model_len=4096, gpu_memory_utilization=0.85
  • Temperature: 0.0 (deterministic scoring)
  • Max output tokens: 1024
  • Thinking mode: Disabled (enable_thinking: False)

Scoring Scale

ScoreAnchor
1Completely unhelpful: response is irrelevant, incoherent, or empty
2Mostly unhelpful: response touches the topic but fails to address the question
3Partially helpful: response addresses some aspects but misses key parts or is vague
4Mostly helpful: response addresses the question well with minor gaps or unnecessary content
5Fully helpful: response directly and completely addresses the question using the provided context

Evaluation Criteria

  1. Does the response directly answer the user’s question?
  2. Does the response use the provided context effectively?
  3. Is the response appropriately detailed (not too brief, not excessively verbose)?
  4. Would a domain expert find this response useful?

Additional scoring rules instruct the judge not to reward verbosity, to use the full 1–5 scale (reserving 5 for truly excellent responses), and to score 3 for on-topic but gapped responses.

Prompt and Rubric

The exact judge prompt is defined in scripts/rag_benchmark_v8/config.py (JUDGE_USER_TEMPLATE). The system prompt instructs the model to act as an expert answer quality judge and output valid JSON only. The user prompt provides the question, retrieved context chunks (numbered), and the response to evaluate, followed by the rubric and scoring rules.

Output format: {"score": <1-5>, "reasoning": "<brief justification>"}. Parsing uses strict JSON first, then regex fallback. Parse rate: 100% across all 1,598 responses.

Blinding and Ordering

  • Blinding: The judge sees question, context chunks, and response only: no configuration identity, run name, or parameter values are provided.
  • Presentation order: Responses are scored sequentially in CSV row order (one response per judge call). No within-response randomization is applicable since each call evaluates a single response.
  • Position bias: Not applicable in this protocol (single-response evaluation, not pairwise comparison).

Calibration

50 human-labeled samples scored by the judge before full deployment. Results: kappa=0.469 (moderate), MAE=0.580, within-1 agreement=88%, Spearman rho=0.509. See scripts/rag_benchmark_v8/calibrate_judge.py for the calibration protocol.

Known Biases and Limitations

  • Verbosity bias: Possible but not measured. Longer responses may receive higher scores from the judge regardless of quality. Anti-verbosity instructions are included in the prompt, but their effectiveness was not independently validated.
  • Rubric iteration: The rubric was revised once after initial calibration (iteration 1 showed judge bias toward score 5). The deployed rubric includes explicit anti-inflation instructions.
  • Kappa below threshold: Weighted kappa (0.469) fell below the 0.6 target for substantial agreement. Relative ranking is more robust than absolute scores.

© 2026 RCTK. All rights reserved. This study may not be reproduced or distributed without prior written permission.