Recent Study
Reclaiming AI Document Search Quality Through Configuration Testing And Parameter Sweeps
This report documents the systematic optimization of a Retrieval-Augmented Generation (RAG) pipeline through four rounds of evidence-based testing: 16,143 evaluations narrowing 40+ configurations to a single recommendation.
The Challenge
When organizations use AI to answer questions from their own documents—contracts, research papers, technical manuals, policy guides—the quality of those answers depends on dozens of configuration decisions. How should documents be split into searchable pieces? How many pieces should the system retrieve? How should they be ranked? How should the AI be instructed to formulate its response?
These choices directly affect whether answers are reliable enough to trust in real work. When the configuration is wrong, the consequences are concrete: unreliable answers, teams that revert to manual search, and support escalations that the system was supposed to prevent. Yet most teams either accept default settings or make these choices based on intuition, and without testing, there is no way to know how much quality is being left on the table.
I took a different approach: systematic, evidence-based testing. I designed a series of experiments to measure the impact of every major configuration decision, compare the alternatives head-to-head, and follow the evidence to a final recommendation.
To stress-test the configurations, I used a deliberately challenging multi-domain benchmark: expert-level questions spanning medicine, law, science, and technology. These are the kinds of complex, cross-domain queries that expose weaknesses in a document search system, providing stronger evidence than testing on routine questions alone.
The result: four rounds of progressively focused testing (over 16,000 evaluations conducted over 5 days) that narrowed more than 40 configurations down to one.
Why This Kind of Testing Matters
Most AI document search systems are deployed with settings that feel plausible but have never been tested against alternatives. Over four rounds, evidence-based testing overturned conventional assumptions, caught a silent failure in the evaluation system itself, and identified a configuration that no amount of intuition would have reached.
When leadership asks “why this configuration?”, every choice traces back to a specific round of testing. When a component turns out to hurt rather than help, it is caught before deployment. When the measurement system itself stops working, that is caught too.
The Approach: Test, Measure, Narrow, Repeat
The evaluation pipeline had been validated in prior rounds of internal testing on different datasets before this benchmark began. With tested infrastructure in hand, I designed a four-round optimization funnel. Each round answered one specific question, and its findings constrained the next:
- Round 1— Which document-splitting method produces the best answers? Four methods tested across 7,350 evaluations.
- Round 2— How should the selected method be fine-tuned? Twenty-eight parameter combinations tested across 5,600 evaluations.
- Round 3— Are the measurements trustworthy? A measurement audit that re-scored 1,600 answers and revealed a critical flaw in the scoring system.
- Round 4— Which configuration produces the most helpful answers? An independent AI judge, calibrated against human ratings, scored 1,600 answers for real-world usefulness.
Each round eliminated weaker options and sharpened the focus for the next:
Round 1: 40+ configurations tested
Round 2: Selected method fine-tuned
Round 3: Measurements audited and fixed
Round 4: Best configuration identifiedThe methodology (select, optimize, audit, validate) is designed to be reapplied on other AI pipelines, though the specific results must be validated on the target dataset and use case.
Round 1 — Finding the Right Foundation
The first and most impactful decision in any document search system is how to break documents into searchable pieces. A long technical manual or research paper must be divided into smaller segments that the system can retrieve individually. How those segments are created has a cascading effect on every answer the system produces.
I tested four fundamentally different splitting methods across 7,350 question-answer evaluations, measuring each on fidelity to source material, question coverage, and effective use of retrieved information.
No single method dominated every measure. One led on source fidelity; another led on question coverage and information usage. The selected foundation was sentence-window retrieval—an approach that retrieves a targeted passage along with its immediate neighbors, balancing precision with enough surrounding context to answer well. It offered the strongest overall balance across quality, operational stability, and tunability for further optimization.

This round also uncovered a counterintuitive finding: a commonly recommended processing step called “re-ranking,” a second-pass filter intended to improve retrieval quality, actually made answers worse on this dataset. I flagged it for confirmation in the next round.
Round 2 — Fine-Tuning the Selected Method
With sentence-window retrieval as the foundation, Round 2 tested variations of its key parameters across 5,600 evaluations. The most important was “window size”—how much surrounding context to include around each retrieved sentence, ranging from minimal (just the neighboring sentences) to broad (several sentences in each direction).
The smallest window—just the immediate neighbors—produced the highest quality. Larger windows diluted the answer with less relevant material.

This round also confirmed the earlier finding about re-ranking. Across multiple parameter combinations, disabling re-ranking consistently improved quality. Two independent rounds of testing, thousands of evaluations each, reached the same conclusion. Without testing, this step would have shipped by default—adding latency while reducing answer quality.
I also tested how different instructions affect the AI’s responses. A detailed template—one that asked the AI to provide thorough, well-sourced answers—outperformed the default, and became part of the recommended configuration.
Round 3 — When the Measurements Stopped Working
After two rounds of optimization, I had a strong configuration with validated parameter choices. The next step was testing how the AI formulates its responses: how creative versus deterministic it should be, response length limits, and prompt variations. I tested eight configurations.
The top four scored almost identically. Their quality scores differed by less than two-tenths of a percent. It looked like a plateau.
I could have stopped here. Instead, I investigated why the scores were so similar.
Three of the five quality measurements had “ceiling effects.” Imagine a thermometer that only goes up to 100 degrees being used to measure temperatures of 105 or 110—everything above 100 reads the same. The majority of answers were scoring at the maximum on several metrics. In one metric, 87% of all answers received the maximum score.

The fix was redesigning the scoring from simple pass/fail (did the answer exceed a threshold?) to graduated measurement (how far above or below?). Quality checks that had been failing under the old system passed under the new one. The old measurements were hiding real differences.
The most important output of Round 3 was not a better configuration—it was a better way of measuring. When the evidence stops making sense, fix the measurement first.
Round 4 — Adding a Helpfulness Lens
With improved measurements, the top four configurations still scored within a narrow band on automated quality metrics. The scores were no longer artificially compressed, but the configurations genuinely produced answers of similar measurable quality.
Automated metrics capture whether an answer is technically correct and well-sourced. They do not directly measure whether a person would find it helpful. A technically correct answer that is poorly organized or unnecessarily terse might score well on automated metrics while frustrating an actual user.
To add a more user-centered signal, I deployed an independent AI model as a quality judge—a separate system with no knowledge of which configuration produced which answer. This judge evaluated all 1,600 answers on a 1-to-5 helpfulness scale: Does the answer directly address the question? Does it use the provided information effectively? Is it appropriately detailed? Would a domain expert find it useful?
Before scoring at scale, I calibrated the judge against human ratings on a sample of 50 answers. The correlation was statistically significant, and 88% of the judge’s scores fell within one point of the corresponding human rating. This did not replace full-scale human evaluation, but it provided a calibrated supplementary signal, sufficient for distinguishing between configurations that automated metrics could not separate.

The judge found real differences within the top four. The strongest configuration gave the AI more room to answer fully (50% more than the default), which was associated with more complete and thorough responses. This single change produced measurably more helpful answers.
The Result
After 16,143 evaluations across four rounds, one configuration emerged as the best overall recommendation. Every design choice traces back to a specific round of evidence:
- Document splitting: Sentence-window retrieval with minimal surrounding context, providing focused retrieval without dilution (Round 1, confirmed Round 2)
- Re-ranking: Disabled; the conventional second-pass filter reduced answer quality on this content (Round 1, confirmed Round 2)
- Response instructions: A detailed template asking for thorough, well-sourced answers (Round 2)
- Response length: 50% more room to write than the default, associated with more helpful answers (Round 4)
- Retrieved context: The 5 most relevant document sections per question (Round 2)
This configuration achieved:
- 4.4 out of 5.0 helpfulness rating from the independent AI judge
- Less than 0.5% failure rate across over 13,000 evaluation runs
- The only configuration to rank in the top 2 on both automated quality metrics and AI-judged helpfulness
What This Means for a Deployment
The specific settings above are validated for this benchmark and model. Different content, models, or deployment contexts may shift the optimal configuration—earlier testing on single-domain medical and technical corpora produced different optimal settings, including cases where re-ranking helped. What follows are the broader patterns.
The re-ranking step, a commonly recommended component, would have shipped by default. Testing showed it reduced answer quality on this content. Two independent rounds confirmed it. That is the kind of mistake that structured evaluation catches before production.
The scoring system appeared to show four equivalent configurations. Investigation revealed it was failing to measure real differences. Without the audit, the final recommendation would have been arbitrary.
Every component in the final configuration earned its place through evidence. Components that did not—larger context windows, default response limits—were removed or changed. The result is a simpler system, because complexity that did not improve answers was cut.
The four-round funnel itself (select, optimize, audit, validate) transfers to other pipelines. It is designed to produce recommendations specific to the target content, not to assume one answer fits all.
Part II
Technical Detail
The narrative above intentionally simplified the underlying methodology. The following section contains the complete statistical framework, version-by-version analysis, and evidence supporting every finding and recommendation in Part I.
Dataset: ExpertQA | Model: Nemotron-Nano-9B-v2-NVFP4 | Embedding: Qwen3-Embedding-0.6B
1. Executive Summary
Four iterations of RAG benchmark optimization (v5–v8) were conducted on the ExpertQA dataset using a 6-step retrieval-augmented generation pipeline powered by Nemotron-Nano-9B-v2-NVFP4 via vLLM. The study processed 14,545 benchmark requests plus 1,598 LLM judge evaluations over 5 days, progressively narrowing the configuration search space through a decision funnel: technique selection (v5), parameter optimization (v6), measurement reform (v7), and helpfulness assessment (v8). Each version answered a specific question, and its findings constrained the next iteration’s search space. The final recommendation is maxtok_768, a sentence_window_1 configuration with max_tokens=768, temperature=0.1, top_k=5, reranking disabled, and a detailed prompt template—which achieves the best balance of automated metric quality (composite rank #2, score 0.5823) and human-perceived helpfulness (answer relevance rank #2, score 4.365/5.0).
Table T1: Version Overview
| Version | Question Asked | Requests | Configs | Duration | Key Outcome |
|---|---|---|---|---|---|
| v5 | Which chunking technique is best? | 7,350 | 4 techniques + 8 sweeps | ~6.3h | sentence_window dominates; all 4 techniques KEEP |
| v6 | Which window size and parameters? | 5,597 | 4 window sizes + 3 sweeps | ~9.2h | sw_1 best (0.769); reranking=false +0.041; 6/6 validity |
| v7 | Are binary metrics hiding differences? | 1,598 | 8 generation configs | reeval | Yes—ceilings at 22.8/86.7/80.7% eliminated; top-4 plateau at 0.579–0.583 |
| v8 | Can AR break the top-4 tie? | 1,598 | 8 generation configs | ~16min | Yes—Friedman p=0.006; maxtok_768 wins (AR 4.365) |
Table T2: Decision Funnel
| Version | Search Space | Decision | Carried Forward |
|---|---|---|---|
| v5 | 4 techniques × 8 sweeps | sentence_window best coverage-fidelity tradeoff | Technique = sentence_window |
| v6 | 4 window sizes + 3 sweeps | sw_1 + reranking=false = 0.810 composite | Window = 1, reranking = false |
| v7 | 8 generation configs | Binary metrics plateau; continuous reformulation | Continuous metrics; top-4 cluster identified |
| v8 | Top-4 cluster | maxtok_768 > anchor_detailed (d=0.221), > topk_3 (d=0.218) | maxtok_768 = recommended |
Final recommendation: The maxtok_768 configuration is the recommended operating point across both automated composite metrics and LLM-judged answer relevance. See Section 8 for the complete configuration card.
2. Methodology
2.1 Pipeline Architecture
All four versions share a common 6-step RAG pipeline:
- Validate Input — Schema checks on question + ground truth
- Retrieve — Dense retrieval via ChromaDB (Qwen3-Embedding-0.6B, 1024-dim) with optional BM25 hybrid fusion
- Rerank — Cross-encoder re-ranking (bge-reranker-v2-m3) with configurable enable/disable
- Generate — LLM answer generation via Nemotron-Nano-9B-v2-NVFP4 (vLLM, Marlin backend, RTX 4090)
- Evaluate — Automated metrics computed against ground-truth answers and retrieved contexts
- Compare — Pairwise statistical analysis across configurations
Dataset: ExpertQA: 150 stratified test cases drawn from a multi-domain expert-sourced QA corpus (medicine, law, science, technology, etc.). The same 150 test cases were reused across all versions for comparability.
Infrastructure: AI Manager service orchestrating Docker containers. Single Nemotron-9B NVFP4 instance (~12.2 GiB model, ~23 GiB peak VRAM). PostgreSQL + Redis backing stores. All runs on a single RTX 4090 (24 GiB).
Embedding model: Qwen3-Embedding-0.6B with asymmetric query/document encoding. Chunked documents stored in ChromaDB collections with per-technique indexing.
2.2 Evaluation Metrics
Table T3: Metric Definitions
| Metric | Binary Formulation (v5–v6) | Continuous Formulation (v7–v8) | Range |
|---|---|---|---|
| Adherence | Cosine similarity of answer-chunk pairs > threshold (0.50) | Mean cosine similarity across all answer-chunk sentence pairs | [0, 1] |
| Relevance | Cosine similarity of answer-question > threshold (0.55) | Cosine similarity between full answer and question embeddings | [0, 1] |
| Utilization | Fraction of retrieved chunks with similarity > threshold (0.60) | Mean cosine similarity between answer and each retrieved chunk | [0, 1] |
| Completeness | Proportion of ground-truth sentences covered | Proportion of ground-truth sentences with cosine sim > threshold | [0, 1] |
| NLI Faithfulness | Fraction of answer sentences entailed by context (DeBERTa NLI) | (dropped in v7) | [0, 1] |
| Answer Relevance | (not used) | LLM-judged helpfulness score (Qwen3.5-9B, 1–5 Likert) | [1, 5] |
NLI Faithfulness was dropped after v6 due to an 88.8% ceiling effect and independent evaluation model concerns. Answer Relevance was added in v8 as a holistic helpfulness measure orthogonal to component metrics.
2.3 Composite Score Design
The composite score evolved across versions as metric quality findings accumulated:
Table T4: Composite Weight Evolution
| Metric | v5–v6 Weights | v7 Weights (original) | v7 Weights (reeval) | v8 |
|---|---|---|---|---|
| Adherence | 0.30 | 0.30 | 0.35 | 0.35 |
| Relevance | 0.25 | 0.25 | 0.25 | 0.25 |
| NLI Faithfulness | 0.20 | 0.20 | — (dropped) | — |
| Utilization | 0.15 | 0.15 | 0.20 | 0.20 |
| Completeness | 0.10 | 0.10 | 0.20 | 0.20 |
Key changes:
- v7 re-evaluation dropped NLI (ceiling) and redistributed its 0.20 weight to utilization (+0.05) and completeness (+0.10)
- The v7 re-evaluation composite (template_v2_no_nli) was selected for strongest discrimination (F=6.81 vs F=3.80 for original weights)
2.4 Statistical Framework
All versions use a consistent statistical framework:
- Paired tests: Wilcoxon signed-rank (non-parametric) for pairwise comparisons on the same test cases
- Multiple comparison correction: Holm-Bonferroni step-down procedure
- Effect sizes: Cohen’s d with benchmarks: |d| < 0.2 = Negligible, 0.2–0.5 = Small, 0.5–0.8 = Medium, > 0.8 = Large
- Omnibus tests: Friedman test (non-parametric repeated measures) for rank-based group differences; one-way ANOVA as parametric complement
- Confidence intervals: 10,000-iteration bootstrap (BCa) for technique means
- Bayesian evidence: BIC-based Wagenmakers (2007) Bayes factors
- Construct validity: Pre-registered directional predictions tested as strict (must hold) or informational (expected but not required)
- Discrimination power: ANOVA F-statistics and eta-squared (η²) for between-technique variance
- Ceiling/floor analysis: Percentage of observations at metric bounds; >50% classified as “Unacceptable”
Note on effect sizes: Cohen’s d is computed on raw paired score differences (not rank-transformed), which is standard practice when the underlying continuous metrics have interval-scale properties. While rank-biserial correlation would be the natural nonparametric companion for Wilcoxon signed-rank tests, Cohen’s d was chosen for interpretability and comparability with power analysis conventions. Kendall’s W was not reported alongside Friedman tests but would be a useful addition in future iterations.
3. v5 — Technique Selection
Question: Which chunking technique produces the best RAG responses on ExpertQA?
Design: 4 techniques (semantic, sentence_400, recursive_400, sentence_window_3) evaluated at default parameters with 150 samples each (600 main comparisons), plus 8 parameter sweeps (chunk_size, prompt_template, final_top_k, similarity_threshold, reranking, retrieval_method, adherence_support, chunk_size for sentence_window). 7,350 completed requests, 99.0% success rate. Runtime ~6.3 hours.
Table T5: Technique Ranking (v5)
| Rank | Technique | Adherence | Relevance | Utilization | NLI Faith. | Completeness | Best At |
|---|---|---|---|---|---|---|---|
| 1 | semantic | 0.733 | 0.589 | 0.646 | 0.994 | 0.256 | Fidelity |
| 2 | sentence_window_3 | 0.688 | 0.808 | 0.836 | 0.908 | 0.235 | Coverage |
| 3 | sentence_400 | 0.674 | 0.571 | 0.573 | 0.975 | 0.261 | Completeness |
| 4 | recursive_400 | 0.634 | 0.524 | 0.553 | 0.988 | 0.259 | — |
The results revealed a fundamental fidelity-coverage tradeoff: semantic excelled at adherence (0.733) and NLI faithfulness (0.994), while sentence_window_3 led on relevance (0.808) and utilization (0.836). The gap was substantial, with Medium effect sizes (d=0.64–0.73) on coverage metrics.
Table T6: Key Sweep Findings (v5)
| Parameter | Finding | Effect |
|---|---|---|
| Reranking | Enabled 0.682 adherence vs disabled 0.701 | Reranking hurts on ExpertQA (negative) |
| Hybrid retrieval | Dense: adh 0.682, NLI 0.966; Hybrid: adh 0.602, NLI 0.657 | Hybrid severely degrades fidelity (−0.309 NLI) |
| Chunk size (sentence) | 200 > 400 > 600 on adherence (0.722 > 0.674 > 0.659) | Smaller consistently better |
| Window size (sw) | sw_1: adh 0.753, NLI 0.982; sw_3: rel 0.808, util 0.836 | Window=1 for fidelity, =3 for coverage |
| Top_k | 3→15: relevance 0.688→0.468; completeness 0.243→0.262 | top_k=5 is balanced |
| Prompt template | Detailed > default for adherence on some techniques | Small effect, worth exploring |
| Similarity threshold | 0.40→0.55: adherence 0.829→0.550 | Impactful but tradeoff with recall |






Construct Validity (v5): 4/5 Strict PASS
| Test | Prediction | Result | Status |
|---|---|---|---|
| top_k increases → relevance decreases | Monotonic decrease | 0.688 → 0.468 | PASS |
| top_k increases → completeness increases | Monotonic increase | 0.243 → 0.262 | PASS |
| Reranking improves adherence | Enabled > disabled | 0.682 < 0.701 | FAIL |
| Smaller chunks → higher relevance | 200 > 400 > 600 | 0.570 > 0.548 > 0.527 | PASS |
| Similarity threshold → adherence decreases | Monotonic decrease | 0.829 → 0.550 | PASS |
Test 3 failed because reranking hurts on ExpertQA, unlike techqa/covidqa. This was an important dataset-specific finding carried forward.
Metric Quality Alerts
- NLI Faithfulness: 88.8% ceiling — unreliable as discriminator (recommended drop in v6+)
- Completeness: ANOVA F=1.77, p=0.15 — zero discriminative power; eta-squared=0.009 (negligible)
- Relevance: 37.3% ceiling (sentence_window_3 at 67.3%) — marginal
Decisions Carried Forward
- Technique = sentence_window— selected over semantic (adherence leader, 0.733) because sentence_window provides the best coverage-fidelity balance, is fastest (13.1s vs 14.8s mean), has the lowest failure rate (0.3% vs 0.8%), and its window_size parameter enables further optimization in v6. All 4 techniques exceeded the 0.60 adherence threshold and were retained as viable candidates; the v6 carry-forward decision was which to prioritize.
- Investigate window_size=1 — highest adherence (0.753) and NLI (0.982) among all configs
- Reranking = investigate further — counterintuitive negative effect needs confirmation
- Flag completeness and NLI as non-discriminating — under binary formulation, neither metric discriminates between techniques; reassess after measurement reform
- Dense retrieval only — hybrid severely degraded fidelity
4. v6 — Window Size and Parameter Optimization
Question: Which sentence_window size and parameter combination maximizes composite quality?
Design: 4 window sizes (1, 2, 3, 4) at 200 samples each, plus 3 parameter sweeps (final_top_k, reranking_enabled, prompt_template). 5,597 completed requests out of 5,617 submitted (99.6% completion rate; 5 failures, 15 other non-completed). Runtime ~9.2 hours across 4 runs (1 main + 3 sweeps).
Table T7: Window Size Ranking (v6)
| Rank | Window Size | Composite | Adherence | Relevance | NLI Faith. | Utilization | Completeness |
|---|---|---|---|---|---|---|---|
| 1 | sw_1 | 0.769 | 0.766 | 0.784 | 0.979 | 0.818 | 0.244 |
| 2 | sw_2 | 0.759 | 0.739 | 0.803 | 0.932 | 0.837 | 0.241 |
| 3 | sw_3 | 0.746 | 0.691 | 0.817 | 0.902 | 0.868 | 0.234 |
| 4 | sw_4 | 0.722 | 0.662 | 0.807 | 0.837 | 0.872 | 0.230 |
Window_size=1 won the composite ranking through leading fidelity metrics (adherence, NLI). The fidelity-coverage tradeoff from v5 repeated at finer granularity: sw_1 led on adherence (+0.104 vs sw_4, d=0.42) while sw_3/sw_4 led on utilization (+0.054, not significant).
Table T8: Best Observed Configurations (v6)
| Configuration | Composite | Adherence | Relevance | NLI Faith. | Utilization | CI (95%) |
|---|---|---|---|---|---|---|
| sw_1 + reranking=false | 0.810 | 0.785 | 0.912 | 0.977 | 0.849 | [0.792, 0.827] |
| sw_1 + template=detailed | 0.793 | 0.795 | 0.830 | 0.981 | 0.829 | — |
| sw_1 + default | 0.769 | 0.766 | 0.784 | 0.979 | 0.818 | — |
| sw_1 + reranking=true | 0.769 | 0.766 | 0.784 | 0.979 | 0.818 | — |
Disabling reranking improved composite by +0.041, confirming v5’s counterintuitive finding. The effect was driven primarily by relevance (0.784 → 0.912, +0.128) and utilization (0.818 → 0.849). A plausible mechanism is that the cross-encoder filtered out chunks that were topically relevant but not the closest match—on ExpertQA’s diverse multi-domain content, this filtering may have been counterproductive.
Sweep Details
Top_k sweep (k = 3, 5, 7, 10): Composite flat across k=5–10 (0.749–0.755). k=3 dropped to 0.722. Decision: k=5 retained (default, balanced).
Reranking sweep: Disabled significantly better (+0.041 composite). Bayesian evidence for the overall window-size adherence gradient: BF10=308.6 (extreme) for sw_1 vs sw_4.
Template sweep: Detailed template +0.024 composite over default. Chain-of-thought comparable but 2.1× slower (~29s vs ~14s mean generation time).





Construct Validity (v6): 6/6 PASS (Highest of Any Version)
| Test | Prediction | Result | Status |
|---|---|---|---|
| top_k increases → relevance decreases | Monotonic decrease | 0.822 → 0.751 | PASS |
| top_k increases → completeness increases | Monotonic increase | 0.224 → 0.253 | PASS |
| Window size increases → NLI decreases | Monotonic decrease | 0.979 → 0.837 | PASS |
| Window size increases → adherence decreases | Monotonic decrease | 0.766 → 0.662 | PASS |
| Reranking impacts relevance (info) | Directional | Confirmed | PASS |
| Larger window → higher utilization (info) | Directional | Confirmed | PASS |
Ceiling Problem Foreshadowing
Despite 6/6 construct validity (the highest of any version), v6 revealed a severe measurement problem:
| Metric | Ceiling % | Assessment |
|---|---|---|
| NLI Faithfulness | 71.6% | Unacceptable |
| Relevance | 66.9% | Unacceptable |
| Utilization | 66.8% | Unacceptable |
| Adherence | 29.6% | Acceptable |
| Completeness | 0.0% | Ideal |
Three of five metrics had unacceptable ceiling effects (>50% of observations at maximum). This meant the composite score was increasingly driven by the few metrics with remaining variance, raising the question: is this optimizing metrics or actual quality?
Decisions Carried Forward
- Window size = 1 — best composite, fidelity leader
- Reranking = disabled — confirmed negative effect
- Template = detailed — modest gain, retained
- top_k = 5 — flat region, no need to change
- Address ceiling effects — measurement reform needed before trusting generation-level optimization
5. v7 — Measurement Reform
Question: Are binary metrics hiding meaningful differences between generation configurations?
Design: 8 named generation configurations varying prompt_template (detailed, grounded_detailed, precise, structured), temperature (0.0 vs 0.1 default), max_tokens (768 vs 512 default), and top_k (3, 5, 7). All used sentence_window_1 with system_prompt=/no_think and other defaults from v6. 1,598 completed requests (200 per config, minus 2 missing). The same responses were scored twice: once with original binary metrics, once with continuous reformulations using an independent evaluation model.
The 8 Configurations
All configs share: system_prompt=/no_think, reranking=false, dense retrieval, similarity_threshold=0.45.
| Config | Temperature | Max Tokens | Top_k | Template | What it tests |
|---|---|---|---|---|---|
| anchor_detailed | 0.1 | 512 | 5 | detailed | Baseline (anchor) |
| maxtok_768 | 0.1 | 768 | 5 | detailed | More generation budget |
| topk_3 | 0.1 | 512 | 3 | detailed | Fewer, higher-quality chunks |
| topk_7 | 0.1 | 512 | 7 | detailed | More context coverage |
| temp_greedy | 0.0 | 512 | 5 | detailed | Deterministic decoding |
| grounded_detailed | 0.1 | 512 | 5 | grounded_detailed | Grounding-focused template |
| precise | 0.1 | 512 | 5 | precise | Adherence-maximizing template |
| structured | 0.1 | 512 | 5 | structured | Structured output format |
The Plateau Discovery
With binary metrics, the top-4 configurations (topk_3, maxtok_768, temp_greedy, anchor_detailed) scored 0.830–0.832 composite — a spread of just 0.0017. The ranking appeared converged: further optimization seemed pointless.
But investigation revealed this convergence was an artifact of metric ceilings, not genuine equivalence.
Table T9: Binary vs Continuous Comparison
| Metric | Binary Mean | Binary Ceiling | Continuous Mean | Continuous Ceiling | Change |
|---|---|---|---|---|---|
| Adherence | 0.755 | 22.8% | 0.616 | 0.0% | Ceiling eliminated |
| Relevance | 0.912 | 86.7% | 0.581 | 0.0% | Ceiling eliminated |
| Utilization | 0.927 | 80.7% | 0.753 | 0.0% | Ceiling eliminated |
The binary relevance metric had 86.7% of all observations at the ceiling (1.0). Any configuration that exceeded the threshold got a perfect score, erasing all differences above that line. The continuous formulation, using raw cosine similarity instead of threshold-based binary classification, restored the full measurement range.
Table T10: Configuration Rankings — Binary vs Continuous
| Config | Binary Rank | Binary Composite | Continuous Rank | Continuous Composite | Rank Change |
|---|---|---|---|---|---|
| topk_3 | 1 | 0.832 | 1 | 0.583 | — |
| maxtok_768 | 2 | 0.832 | 2 | 0.582 | — |
| temp_greedy | 3 | 0.831 | 3 | 0.580 | — |
| anchor_detailed | 4 | 0.830 | 4 | 0.579 | — |
| grounded_detailed | 5 | 0.824 | 6 | 0.571 | −1 |
| topk_7 | 6 | 0.819 | 5 | 0.574 | +1 |
| precise | 7 | 0.796 | 7 | 0.555 | — |
| structured | 8 | 0.789 | 8 | 0.552 | — |
The top-4 order was preserved, but the spread widened from 0.0017 to 0.0036, though still narrow. The continuous composite revealed 16 significant pairwise differences (vs fewer with binary), with 5 Large or Medium effects (4 with d > 0.8, plus structured vs topk_7 at d = 0.71), all involving structured as the worst config.


Construct Validity Revolution: 0/2 → 2/2
Under binary metrics, both strict construct validity tests failed — the threshold-based scoring masked the expected monotonic relationship between top_k and relevance. After continuous reformulation:
| Test | Binary Result | Continuous Result |
|---|---|---|
| topk_3 > anchor (relevance) | FAIL (both at ceiling) | PASS (0.595 > 0.580) |
| anchor > topk_7 (relevance) | FAIL (both at ceiling) | PASS (0.580 > 0.570) |
This was the strongest evidence that binary metrics were not just noisy but actively misleading — they could not detect real differences that continuous metrics revealed.
The Top-4 Cluster Problem
Even with continuous metrics, the top-4 configs (composite 0.579–0.583) remained statistically indistinguishable in pairwise tests (all Holm-corrected p > 0.05, all d < 0.15). The composite metric had reached its discriminative limit: the four best configurations produced responses of equivalent measured quality across adherence, relevance, utilization, and completeness.
This raised the question: is there a quality dimension not being measured?
Decisions Carried Forward
- Continuous metrics adopted — binary formulations permanently retired
- NLI dropped — 88.8% ceiling, independent eval model concerns
- Top-4 cluster identified — topk_3, maxtok_768, temp_greedy, anchor_detailed
- Need a new discriminator — automated composite cannot break the tie
- Explore answer relevance — holistic helpfulness judgment as complementary signal
6. v8 — Answer Relevance Judge
Question: Can an LLM-judged answer relevance score discriminate between the top-4 configs that automated metrics cannot?
Design: Qwen3.5-9B (BF16 via vLLM, Docker container) scored all 1,598 v7 responses on a 1–5 Likert scale for answer relevance — “how helpful and relevant is this answer to the user’s question, considering completeness, accuracy, and specificity?” Temperature=0.0 for deterministic scoring. Scoring time: 969 seconds (1.5 responses/second).
Judge Calibration
Before scoring, the judge was calibrated against 50 human-labeled samples:
| Metric | Value | Threshold | Status |
|---|---|---|---|
| Cohen’s kappa (weighted) | 0.469 | ≥ 0.6 | Soft FAIL |
| MAE | 0.580 | ≤ 1.0 | PASS |
| Within-1 agreement | 88% | — | Strong |
| Spearman rho | 0.509 | — | p=0.00016 |
| Parse rate | 100% | > 95% | PASS |
Kappa fell below the 0.6 threshold, indicating moderate (not substantial) agreement. However, three factors supported proceeding: (1) 88% of judgments were within 1 point of human labels, (2) Spearman correlation was significant (p<0.001), and (3) the primary analysis uses relative ranking rather than absolute scores, which is robust to systematic bias.
Table T11: Combined Ranking (Composite + Answer Relevance)
| Config | Composite Rank | Composite | AR Rank | AR Mean | AR Std | Top-4? |
|---|---|---|---|---|---|---|
| topk_3 | 1 | 0.583 | 7 | 4.225 | 0.899 | Yes |
| maxtok_768 | 2 | 0.582 | 2 | 4.365 | 0.803 | Yes |
| temp_greedy | 3 | 0.580 | 3 | 4.310 | 0.792 | Yes |
| anchor_detailed | 4 | 0.579 | 6 | 4.246 | 0.844 | Yes |
| topk_7 | 5 | 0.574 | 4 | 4.281 | 0.786 | No |
| grounded_detailed | 6 | 0.571 | 5 | 4.265 | 1.020 | No |
| precise | 7 | 0.555 | 8 | 4.045 | 1.204 | No |
| structured | 8 | 0.552 | 1 | 4.440 | 0.831 | No |
The AR-Composite Tension
The most striking finding was structured: ranked dead last on composite (#8, score 0.552) but first on answer relevance (#1, AR 4.44). This suggests that the structured template produces responses that users find helpful despite scoring poorly on embedding-based metrics. This tension is informative but does not change the recommendation; structured’s composite deficit (d > 0.9 vs top-4) is too large to justify on automated metrics alone.
Top-4 Discrimination: AR Breaks the Tie
The central question: does AR discriminate within the top-4 cluster that composite cannot separate?
Overall Friedman test (all 8 configs): chi2 = 47.12, p = 5.3e-08. AR discriminates across the full set.
Top-4 Friedman test: chi2 = 12.41, p = 0.006. AR successfully discriminates within the top-4 cluster.
Top-4 Pairwise Comparisons (Wilcoxon + Holm)
| Pair | Mean Diff | Cohen’s d | p (Holm) | Significant? |
|---|---|---|---|---|
| maxtok_768 > anchor_detailed | +0.116 | 0.221 | 0.014 | Yes |
| maxtok_768 > topk_3 | +0.140 | 0.218 | 0.013 | Yes |
| maxtok_768 > temp_greedy | +0.055 | 0.102 | 0.305 | No |
| temp_greedy > anchor_detailed | +0.065 | 0.125 | 0.267 | No |
| temp_greedy > topk_3 | +0.085 | 0.131 | 0.291 | No |
| anchor_detailed > topk_3 | +0.025 | 0.038 | 0.566 | No |
maxtok_768 significantly outperforms both anchor_detailed (d=0.221, p=0.014) and topk_3 (d=0.218, p=0.013). These are Small effects, but they are the only statistically significant differences within a cluster that composite could not crack.




AR is Not a Proxy for Existing Metrics
Correlations between AR and automated metrics were modest:
| Metric | Spearman rho |
|---|---|
| Composite (continuous) | 0.331 |
| Adherence | 0.256 |
| Relevance | 0.244 |
| Utilization | 0.226 |
| Completeness | 0.088 |
The strongest correlation (rho=0.331 with composite) confirms AR captures signal not present in automated metrics. If AR were simply a noisy version of composite, one would expect rho > 0.7.
Construct Validity (v8): 1/2 Strict PASS
| Test | Prediction | Result | Status |
|---|---|---|---|
| precise < anchor (AR) | Constrained template → lower helpfulness | 4.045 < 4.246 (−0.201) | PASS |
| grounded < anchor (AR) | Grounded template → lower helpfulness | 4.265 > 4.246 (+0.019) | FAIL |
Test 2 failed by a negligible margin (+0.019, d=0.03). The grounded_detailed template performed slightly better than expected, possibly because its emphasis on sourcing improves perceived answer quality, though this was not directly tested.
Decisions Carried Forward
maxtok_768 is the final selection. It is the only configuration that ranks in the top-2 on both dimensions: composite rank #2 (0.582) and AR rank #2 (4.365). It significantly outperforms 2 of 3 other top-4 members on AR while maintaining equivalent composite performance.
7. Cross-Version Synthesis
Optimization Trajectory
The four versions trace a clear narrowing funnel:
v5: 4 techniques x 8 sweeps = 32 cells -> sentence_window selected
v6: 4 windows x 4 sweeps = 16 cells -> sw_1 + reranking=false
v7: 8 configs x 2 formulations = measurement reform -> top-4 cluster
v8: top-4 -> maxtok_768 (AR tiebreaker)Each version reduced the search space by roughly 4× while adding measurement sophistication. The total optimization path evaluated 14,545 benchmark requests plus 1,598 LLM judge evaluations (16,143 total) across more than 40 unique configurations.
Metric Evolution
The evaluation framework evolved significantly through evidence-based decisions:
| Change | Version | Evidence | Impact |
|---|---|---|---|
| Completeness flagged as non-discriminating | v5 | ANOVA F=1.77, p=0.15, η²=0.009 | Flagged; retained after continuous reformulation restored discrimination |
| NLI faithfulness dropped | v7 | 88.8% ceiling (v5), 71.6% ceiling (v6) | Removed from composite |
| Binary → continuous formulation | v7 | 3 metrics with >66% ceiling; 0/2 validity → 2/2 | Full discrimination restored |
| Answer relevance added | v8 | Top-4 composite plateau (spread 0.004) | Broke tie, identified maxtok_768 |
| Completeness discrimination restored | v7 | Continuous: Friedman χ²=785.22, η²=0.175 | Weight increased 0.10 → 0.20 |
| Composite weights redistributed | v7 | NLI removal, completeness reweighted | Stronger discrimination (F: 3.80 → 6.81) |
Construct Validity Trend
| Version | Strict Tests | Pass | Rate | Notes |
|---|---|---|---|---|
| v5 | 5 | 4 | 80% | Reranking test failed (dataset-specific) |
| v6 | 4 | 4 | 100% | Best construct validity |
| v7 (binary) | 2 | 0 | 0% | Ceiling effects masked all signals |
| v7 (continuous) | 2 | 2 | 100% | Measurement reform recovered validity |
| v8 | 2 | 1 | 50% | Grounded template slightly exceeded expectation |
The v7 binary → continuous transition is the most dramatic result: the same data, same configurations, same predictions went from 0% to 100% validity solely through metric reformulation. This is strong evidence that metric design is at least as important as pipeline design.
Diminishing Returns
| Transition | Composite Change | AR Change | Effort |
|---|---|---|---|
| v5 → v6 | +0.041 (reranking=false) | — | 5,597 requests |
| v6 → v7 | −0.217 (scale change) | — | 1,598 + re-eval |
| v7 top-4 | 0.004 spread (plateau) | — | Identified ceiling as cause |
| v8 maxtok_768 | +0.001 vs topk_3 | +0.140 vs topk_3 | 1,598 judge calls |
The composite gains diminished rapidly: v6’s reranking discovery was the single largest improvement (+0.041). By v7, automated metrics had plateaued. v8’s contribution was not in composite improvement but in adding a new dimension (AR) that revealed maxtok_768 as measurably more helpful despite near-identical composite scores.
Key Findings Across All Versions
- Sentence_window_1 is the optimal chunking strategy for ExpertQA with this model. It maximizes fidelity (adherence, NLI) while maintaining adequate coverage.
- Reranking hurts on ExpertQA. Confirmed in both v5 and v6 across multiple window sizes. The cross-encoder filters out relevant chunks in this multi-domain dataset. This finding may not generalize to narrow-domain datasets.
- Binary threshold metrics have a hard ceiling problem. When most responses exceed a quality threshold, the metric saturates and can no longer discriminate. Continuous formulations are essential for high-performing systems.
- Composite metrics plateau before quality converges. The top-4 configs are equivalent on automated metrics but differ on AR. This suggests that automated metrics capture necessary but insufficient quality dimensions.
- max_tokens=768 is a sweet spot. More generation budget (768 vs the 512 default) produces measurably more helpful answers (AR d=0.221 vs anchor, p=0.014). The additional generation budget plausibly allows the model to elaborate rather than truncating, though this hypothesis was not directly tested.
- Temperature has minimal impact. temp_greedy (0.0) vs anchor_detailed (0.1) differ by just d=0.125 on AR and d=0.008 on composite. Even this small temperature gap produces negligible differences.
- Prompt template matters more than expected. The structured template produces the most user-helpful responses (AR 4.44) but worst composite (0.552). This tension between automated and human-aligned metrics deserves future investigation.
8. Final Recommendation
Table T12: Best Configuration Card
| Parameter | Value | Source Version | Evidence |
|---|---|---|---|
| Technique | sentence_window | v5 | Best coverage-fidelity tradeoff (d=0.64–0.73 Medium) |
| Window size | 1 | v6 | Composite 0.769 (rank #1); adherence 0.766 |
| Reranking | disabled | v5, v6 | +0.041 composite; confirmed across window sizes |
| Top_k | 5 | v6 | Flat region k=5–10; k=3 too restrictive |
| Retrieval | dense only | v5 | Hybrid: −0.309 NLI, severe fidelity loss |
| Prompt template | detailed | v6, v7 | +0.024 composite over default |
| Max tokens | 768 | v7, v8 | AR 4.365 (#2); composite 0.582 (#2) |
| Temperature | 0.1 | v7 | Default; negligible difference vs 0.0 |
| System prompt | /no_think | v7 | All 8 v7 configs used /no_think (workflow default) |
| Similarity threshold | 0.45 | v5 | ExpertQA-calibrated (mean sim 0.440) |
Expected Performance
| Metric | Expected Value | Source |
|---|---|---|
| Composite (continuous) | 0.582 | v7 config_ranking_continuous.json |
| Adherence (continuous) | 0.608 | v7 config_ranking_continuous.json |
| Relevance (continuous) | 0.581 | v7 config_ranking_continuous.json |
| Utilization (continuous) | 0.768 | v7 config_ranking_continuous.json |
| Completeness | 0.354 | v7 config_ranking_continuous.json |
| Answer Relevance | 4.365 / 5.0 | v8 config_ranking.json |
| 95th pct latency | ~19–20s | v6 latency analysis |
| Failure rate | < 0.5% | v5–v6 observed rates |
Table T13: Finalist Comparison
| Config | Comp. Rank | Composite | AR Rank | AR Mean | Latency | Fail Rate | Key Differentiator |
|---|---|---|---|---|---|---|---|
| maxtok_768 | 2 | 0.582 | 2 | 4.365 | ~12.5s | <0.5% | +256 tokens → AR gain |
| topk_3 | 1 | 0.583 | 7 | 4.225 | ~12.5s | <0.5% | Best composite, weak AR |
| temp_greedy | 3 | 0.580 | 3 | 4.310 | ~12.5s | <0.5% | Deterministic; no sig. edge |
| anchor_detailed | 4 | 0.579 | 6 | 4.246 | ~12.5s | <0.5% | Baseline anchor config |
| structured | 8 | 0.552 | 1 | 4.440 | ~12.5s | <0.5% | Best AR but worst composite |
Latency estimates from v6 sw_1 main run (~12–13s mean generation time). Failure rates from v5–v6 observed rates across all sw_1 runs. Composite and AR from v7 re-evaluation and v8 judge scoring, respectively.
Operational Characteristics
- Latency: ~12.5s mean, ~19–20s p95 (v6 sw_1 measurements)
- Failure rate: <0.5% (v5–v6 observed across >13,000 requests)
- Throughput: ~0.175 req/s sustained (v5 measurement, single-instance)
- Resource requirements: Single RTX 4090, ~23 GiB peak VRAM (12.2 GiB model + KV cache + evaluation models)
Selection Rationale
Decision rule: Select the configuration that ranks in the top tier on automated composite metrics AND performs best on answer relevance, excluding any configuration with a large automated-metric deficit.
maxtok_768 is the recommended configuration under this rule:
- Composite: Rank #2 (0.582), within 0.001 of the best (topk_3, 0.583). The difference is not statistically significant (p > 0.8).
- Answer Relevance: Rank #2 (4.365), significantly better than anchor_detailed (#6, d=0.221, p=0.014) and topk_3 (#7, d=0.218, p=0.013).
- No other config ranks top-2 on both dimensions: topk_3 is #1 composite but #7 AR. temp_greedy is #3 on both. structured is #1 AR but #8 composite.
- max_tokens=768 is the differentiator: The only parameter change from the anchor config (512 tokens) is +256 max tokens. This extra generation budget produces measurably more helpful responses (d=0.221 on AR) at negligible composite cost.
Caveats
- All results are specific to ExpertQA with Nemotron-9B-NVFP4. Generalization to other datasets or models requires validation.
- Answer relevance calibration showed moderate agreement (kappa=0.469). The relative ranking is more trustworthy than absolute scores.
- The optimization used one-factor-at-a-time (OFAT) sweeps, not factorial designs. Interaction effects may exist between parameters that were not explored.
9. Limitations
- Single dataset: All results are from ExpertQA only. Findings like “reranking hurts” may not generalize to narrow-domain datasets where v3 (covidqa) and v4 (techqa) showed reranking benefits.
- Single model: Nemotron-Nano-9B-v2-NVFP4 is a specific quantized model. Larger or differently quantized models may shift optimal configurations, particularly for max_tokens and prompt template effects.
- OFAT sweep design: Parameters were swept one at a time. Factorial designs would reveal interaction effects (e.g., window_size × reranking, temperature × max_tokens) but at exponentially higher cost.
- Moderate calibration kappa: The AR judge achieved kappa=0.469 (moderate), below the 0.6 threshold for substantial agreement. While 88% within-1 agreement and significant Spearman correlation support relative ranking reliability, absolute AR scores should be interpreted cautiously.
- Metric reformulation mid-study: Switching from binary to continuous metrics between v6 and v7 means composite scores are not directly comparable across that boundary. The composite scale shifted from ~0.77–0.81 (binary) to ~0.55–0.58 (continuous).
- No large-scale human evaluation: The AR judge is an automated proxy for human judgment. The 50-sample calibration set may not capture the full distribution of human preferences. A larger human evaluation (N > 200) would strengthen confidence in AR-based conclusions.
- Completeness metric history: Completeness showed zero discriminative power under binary formulation in v5 (F=1.77, p=0.15) and was flagged for removal. However, the v7 continuous reformulation restored strong discrimination (Friedman χ²=785.22, η²=0.175, the strongest of all metrics), justifying the weight increase from 0.10 to 0.20. The metric’s utility is formulation-dependent.
- Non-independent test cases: The same 150 test cases were used across all versions and configurations. While paired analysis accounts for within-subject correlation, there is no guarantee these 150 cases represent the full distribution of real-world queries. Performance on unseen questions may differ.
10. Artifact Index
Analysis Reports
| File | Version | Content |
|---|---|---|
| results/v5/v5_session1_analysis/session_1_analysis.md | v5 | Full statistical analysis (77 PNGs, 23 JSONs) |
| results/v6/v6_session1_analysis/session_1_analysis.md | v6 | Window size analysis (78 artifacts) |
| results/v7/v7_reevaluation_analysis/reevaluation_analysis.md | v7 | Binary vs continuous comparison |
| results/v8/v8_judge_analysis/judge_analysis.md | v8 | Answer relevance analysis |
Key Data Files
| File | Version | Content |
|---|---|---|
| results/v7/.../config_ranking_continuous.json | v7 | Continuous composite rankings (8 configs) |
| results/v7/.../formulation_comparison.json | v7 | Ceiling rates and F-statistics |
| results/v8/.../config_ranking.json | v8 | AR rankings (8 configs) |
| results/v8/.../top4_discrimination.json | v8 | Friedman + pairwise tests |
| results/v8/.../construct_validity.json | v8 | 4 construct validity tests |
| results/v8/.../metric_correlations.json | v8 | AR vs automated metric correlations |
| results/v8/.../discrimination.json | v8 | Overall Friedman (all 8 configs) |
Design Documentation
| File | Version | Content |
|---|---|---|
| documentation/260311_01_.../design_analysis.md | v5 | Approved design (4 techniques, 8 sweeps) |
| documentation/260313_01_.../design_analysis.md | v6 | Approved design (4 window sizes, 4 sweeps) |
| documentation/260314_01_.../design_analysis.md | v7 | Re-evaluation methodology |
| documentation/260315_01_.../design_analysis.md | v8 | AR judge design |
Appendix A: Answer Relevance Judge Protocol
Judge Model
- Model: Qwen3.5-9B (BF16 precision)
- Inference engine: vLLM with enforce_eager=True, max_model_len=4096, gpu_memory_utilization=0.85
- Temperature: 0.0 (deterministic scoring)
- Max output tokens: 1024
- Thinking mode: Disabled (enable_thinking: False)
Scoring Scale
| Score | Anchor |
|---|---|
| 1 | Completely unhelpful: response is irrelevant, incoherent, or empty |
| 2 | Mostly unhelpful: response touches the topic but fails to address the question |
| 3 | Partially helpful: response addresses some aspects but misses key parts or is vague |
| 4 | Mostly helpful: response addresses the question well with minor gaps or unnecessary content |
| 5 | Fully helpful: response directly and completely addresses the question using the provided context |
Evaluation Criteria
- Does the response directly answer the user’s question?
- Does the response use the provided context effectively?
- Is the response appropriately detailed (not too brief, not excessively verbose)?
- Would a domain expert find this response useful?
Additional scoring rules instruct the judge not to reward verbosity, to use the full 1–5 scale (reserving 5 for truly excellent responses), and to score 3 for on-topic but gapped responses.
Prompt and Rubric
The exact judge prompt is defined in scripts/rag_benchmark_v8/config.py (JUDGE_USER_TEMPLATE). The system prompt instructs the model to act as an expert answer quality judge and output valid JSON only. The user prompt provides the question, retrieved context chunks (numbered), and the response to evaluate, followed by the rubric and scoring rules.
Output format: {"score": <1-5>, "reasoning": "<brief justification>"}. Parsing uses strict JSON first, then regex fallback. Parse rate: 100% across all 1,598 responses.
Blinding and Ordering
- Blinding: The judge sees question, context chunks, and response only: no configuration identity, run name, or parameter values are provided.
- Presentation order: Responses are scored sequentially in CSV row order (one response per judge call). No within-response randomization is applicable since each call evaluates a single response.
- Position bias: Not applicable in this protocol (single-response evaluation, not pairwise comparison).
Calibration
50 human-labeled samples scored by the judge before full deployment. Results: kappa=0.469 (moderate), MAE=0.580, within-1 agreement=88%, Spearman rho=0.509. See scripts/rag_benchmark_v8/calibrate_judge.py for the calibration protocol.
Known Biases and Limitations
- Verbosity bias: Possible but not measured. Longer responses may receive higher scores from the judge regardless of quality. Anti-verbosity instructions are included in the prompt, but their effectiveness was not independently validated.
- Rubric iteration: The rubric was revised once after initial calibration (iteration 1 showed judge bias toward score 5). The deployed rubric includes explicit anti-inflation instructions.
- Kappa below threshold: Weighted kappa (0.469) fell below the 0.6 target for substantial agreement. Relative ranking is more robust than absolute scores.
© 2026 RCTK. All rights reserved. This study may not be reproduced or distributed without prior written permission.