Recent Study

Reclaiming AI Document Search Quality Through Configuration Testing And Parameter Sweeps

This report documents the systematic optimization of a Retrieval-Augmented Generation (RAG) pipeline through four rounds of evidence-based testing: 16,143 evaluations narrowing 40+ configurations to a single recommendation.

The Challenge

When organizations use AI to answer questions from their own documents—contracts, research papers, technical manuals, policy guides—the quality of those answers depends on dozens of configuration decisions. How should documents be split into searchable pieces? How many pieces should the system retrieve? How should they be ranked? How should the AI be instructed to formulate its response?

These choices directly affect whether answers are reliable enough to trust in real work. When the configuration is wrong, the consequences are concrete: unreliable answers, teams that revert to manual search, and support escalations that the system was supposed to prevent. Yet most teams either accept default settings or make these choices based on intuition, and without testing, there is no way to know how much quality is being left on the table.

I took a different approach: systematic, evidence-based testing. I designed a series of experiments to measure the impact of every major configuration decision, compare the alternatives head-to-head, and follow the evidence to a final recommendation.

To stress-test the configurations, I used a deliberately challenging multi-domain benchmark: expert-level questions spanning medicine, law, science, and technology. These are the kinds of complex, cross-domain queries that expose weaknesses in a document search system, providing stronger evidence than testing on routine questions alone.

The result: four rounds of progressively focused testing (over 16,000 evaluations conducted over 5 days) that narrowed more than 40 configurations down to one.

Why This Kind of Testing Matters

Most AI document search systems are deployed with settings that feel plausible but have never been tested against alternatives. Over four rounds, evidence-based testing overturned conventional assumptions, caught a silent failure in the evaluation system itself, and identified a configuration that no amount of intuition would have reached.

When leadership asks “why this configuration?”, every choice traces back to a specific round of testing. When a component turns out to hurt rather than help, it is caught before deployment. When the measurement system itself stops working, that is caught too.

The Approach: Test, Measure, Narrow, Repeat

The evaluation pipeline had been validated in prior rounds of internal testing on different datasets before this benchmark began. With tested infrastructure in hand, I designed a four-round optimization funnel. Each round answered one specific question, and its findings constrained the next:

Round 1— Which document-splitting method produces the best answers? Four methods tested across 7,350 evaluations.
Round 2— How should the selected method be fine-tuned? Twenty-eight parameter combinations tested across 5,600 evaluations.
Round 3— Are the measurements trustworthy? A measurement audit that re-scored 1,600 answers and revealed a critical flaw in the scoring system.
Round 4— Which configuration produces the most helpful answers? An independent AI judge, calibrated against human ratings, scored 1,600 answers for real-world usefulness.

Each round eliminated weaker options and sharpened the focus for the next:

Round 1:  40+ configurations tested
  Round 2:  Selected method fine-tuned
    Round 3:  Measurements audited and fixed
      Round 4:  Best configuration identified

The methodology (select, optimize, audit, validate) is designed to be reapplied on other AI pipelines, though the specific results must be validated on the target dataset and use case.

Round 1 — Finding the Right Foundation

The first and most impactful decision in any document search system is how to break documents into searchable pieces. A long technical manual or research paper must be divided into smaller segments that the system can retrieve individually. How those segments are created has a cascading effect on every answer the system produces.

I tested four fundamentally different splitting methods across 7,350 question-answer evaluations, measuring each on fidelity to source material, question coverage, and effective use of retrieved information.

No single method dominated every measure. One led on source fidelity; another led on question coverage and information usage. The selected foundation was sentence-window retrieval—an approach that retrieves a targeted passage along with its immediate neighbors, balancing precision with enough surrounding context to answer well. It offered the strongest overall balance across quality, operational stability, and tunability for further optimization.

Heatmap showing four document-splitting methods scored across five quality measures. Semantic leads on adherence, sentence_window leads on relevance and utilization. — Four document-splitting methods scored across five quality measures. No single method dominated. The pattern of strengths and weaknesses across methods guided the selection of sentence-window retrieval as the strongest overall foundation.

This round also uncovered a counterintuitive finding: a commonly recommended processing step called “re-ranking,” a second-pass filter intended to improve retrieval quality, actually made answers worse on this dataset. I flagged it for confirmation in the next round.

Round 2 — Fine-Tuning the Selected Method

With sentence-window retrieval as the foundation, Round 2 tested variations of its key parameters across 5,600 evaluations. The most important was “window size”—how much surrounding context to include around each retrieved sentence, ranging from minimal (just the neighboring sentences) to broad (several sentences in each direction).

The smallest window—just the immediate neighbors—produced the highest quality. Larger windows diluted the answer with less relevant material.

Bar chart showing composite quality score decreasing as window size increases from 1 to 4. — Overall quality score by window size. The smallest window produced the highest quality. Larger windows, which include more surrounding context, diluted answers with less relevant material.

This round also confirmed the earlier finding about re-ranking. Across multiple parameter combinations, disabling re-ranking consistently improved quality. Two independent rounds of testing, thousands of evaluations each, reached the same conclusion. Without testing, this step would have shipped by default—adding latency while reducing answer quality.

I also tested how different instructions affect the AI’s responses. A detailed template—one that asked the AI to provide thorough, well-sourced answers—outperformed the default, and became part of the recommended configuration.

Round 3 — When the Measurements Stopped Working

After two rounds of optimization, I had a strong configuration with validated parameter choices. The next step was testing how the AI formulates its responses: how creative versus deterministic it should be, response length limits, and prompt variations. I tested eight configurations.

The top four scored almost identically. Their quality scores differed by less than two-tenths of a percent. It looked like a plateau.

I could have stopped here. Instead, I investigated why the scores were so similar.

Three of the five quality measurements had “ceiling effects.” Imagine a thermometer that only goes up to 100 degrees being used to measure temperatures of 105 or 110—everything above 100 reads the same. The majority of answers were scoring at the maximum on several metrics. In one metric, 87% of all answers received the maximum score.

Side-by-side comparison of binary versus continuous metric distributions. Binary distributions pile up at the maximum; continuous distributions show the full range of quality differences. — The same answers scored two different ways. The old method (orange/red) piles scores at the maximum, so everything looks the same. The improved method (blue) reveals a full range of quality differences that were previously hidden.

The fix was redesigning the scoring from simple pass/fail (did the answer exceed a threshold?) to graduated measurement (how far above or below?). Quality checks that had been failing under the old system passed under the new one. The old measurements were hiding real differences.

The most important output of Round 3 was not a better configuration—it was a better way of measuring. When the evidence stops making sense, fix the measurement first.

Round 4 — Adding a Helpfulness Lens

With improved measurements, the top four configurations still scored within a narrow band on automated quality metrics. The scores were no longer artificially compressed, but the configurations genuinely produced answers of similar measurable quality.

Automated metrics capture whether an answer is technically correct and well-sourced. They do not directly measure whether a person would find it helpful. A technically correct answer that is poorly organized or unnecessarily terse might score well on automated metrics while frustrating an actual user.

To add a more user-centered signal, I deployed an independent AI model as a quality judge—a separate system with no knowledge of which configuration produced which answer. This judge evaluated all 1,600 answers on a 1-to-5 helpfulness scale: Does the answer directly address the question? Does it use the provided information effectively? Is it appropriately detailed? Would a domain expert find it useful?

Before scoring at scale, I calibrated the judge against human ratings on a sample of 50 answers. The correlation was statistically significant, and 88% of the judge’s scores fell within one point of the corresponding human rating. This did not replace full-scale human evaluation, but it provided a calibrated supplementary signal, sufficient for distinguishing between configurations that automated metrics could not separate.

Bar chart showing helpfulness scores by configuration. Blue bars indicate the top four candidates from automated testing. Even among these top performers, the judge identified meaningful differences. — Helpfulness scores from the independent AI judge. Blue bars are the top four candidates from automated testing. Even among these top performers, the judge identified meaningful differences.

The judge found real differences within the top four. The strongest configuration gave the AI more room to answer fully (50% more than the default), which was associated with more complete and thorough responses. This single change produced measurably more helpful answers.

The Result

After 16,143 evaluations across four rounds, one configuration emerged as the best overall recommendation. Every design choice traces back to a specific round of evidence:

Document splitting: Sentence-window retrieval with minimal surrounding context, providing focused retrieval without dilution (Round 1, confirmed Round 2)
Re-ranking: Disabled; the conventional second-pass filter reduced answer quality on this content (Round 1, confirmed Round 2)
Response instructions: A detailed template asking for thorough, well-sourced answers (Round 2)
Response length: 50% more room to write than the default, associated with more helpful answers (Round 4)
Retrieved context: The 5 most relevant document sections per question (Round 2)

This configuration achieved:

4.4 out of 5.0 helpfulness rating from the independent AI judge
Less than 0.5% failure rate across over 13,000 evaluation runs
The only configuration to rank in the top 2 on both automated quality metrics and AI-judged helpfulness

What This Means for a Deployment

The specific settings above are validated for this benchmark and model. Different content, models, or deployment contexts may shift the optimal configuration—earlier testing on single-domain medical and technical corpora produced different optimal settings, including cases where re-ranking helped. What follows are the broader patterns.

The re-ranking step, a commonly recommended component, would have shipped by default. Testing showed it reduced answer quality on this content. Two independent rounds confirmed it. That is the kind of mistake that structured evaluation catches before production.

The scoring system appeared to show four equivalent configurations. Investigation revealed it was failing to measure real differences. Without the audit, the final recommendation would have been arbitrary.

Every component in the final configuration earned its place through evidence. Components that did not—larger context windows, default response limits—were removed or changed. The result is a simpler system, because complexity that did not improve answers was cut.

The four-round funnel itself (select, optimize, audit, validate) transfers to other pipelines. It is designed to produce recommendations specific to the target content, not to assume one answer fits all.

Part II

Technical Detail

The narrative above intentionally simplified the underlying methodology. The following section contains the complete statistical framework, version-by-version analysis, and evidence supporting every finding and recommendation in Part I.

Dataset: ExpertQA | Model: Nemotron-Nano-9B-v2-NVFP4 | Embedding: Qwen3-Embedding-0.6B

1. Executive Summary

Four iterations of RAG benchmark optimization (v5–v8) were conducted on the ExpertQA dataset using a 6-step retrieval-augmented generation pipeline powered by Nemotron-Nano-9B-v2-NVFP4 via vLLM. The study processed 14,545 benchmark requests plus 1,598 LLM judge evaluations over 5 days, progressively narrowing the configuration search space through a decision funnel: technique selection (v5), parameter optimization (v6), measurement reform (v7), and helpfulness assessment (v8). Each version answered a specific question, and its findings constrained the next iteration’s search space. The final recommendation is maxtok_768, a sentence_window_1 configuration with max_tokens=768, temperature=0.1, top_k=5, reranking disabled, and a detailed prompt template—which achieves the best balance of automated metric quality (composite rank #2, score 0.5823) and human-perceived helpfulness (answer relevance rank #2, score 4.365/5.0).

Table T1: Version Overview

Version	Question Asked	Requests	Configs	Duration	Key Outcome
v5	Which chunking technique is best?	7,350	4 techniques + 8 sweeps	~6.3h	sentence_window dominates; all 4 techniques KEEP
v6	Which window size and parameters?	5,597	4 window sizes + 3 sweeps	~9.2h	sw_1 best (0.769); reranking=false +0.041; 6/6 validity
v7	Are binary metrics hiding differences?	1,598	8 generation configs	reeval	Yes—ceilings at 22.8/86.7/80.7% eliminated; top-4 plateau at 0.579–0.583
v8	Can AR break the top-4 tie?	1,598	8 generation configs	~16min	Yes—Friedman p=0.006; maxtok_768 wins (AR 4.365)

Table T2: Decision Funnel

Version	Search Space	Decision	Carried Forward
v5	4 techniques × 8 sweeps	sentence_window best coverage-fidelity tradeoff	Technique = sentence_window
v6	4 window sizes + 3 sweeps	sw_1 + reranking=false = 0.810 composite	Window = 1, reranking = false
v7	8 generation configs	Binary metrics plateau; continuous reformulation	Continuous metrics; top-4 cluster identified
v8	Top-4 cluster	maxtok_768 > anchor_detailed (d=0.221), > topk_3 (d=0.218)	maxtok_768 = recommended

Final recommendation: The maxtok_768 configuration is the recommended operating point across both automated composite metrics and LLM-judged answer relevance. See Section 8 for the complete configuration card.

2. Methodology

2.1 Pipeline Architecture

All four versions share a common 6-step RAG pipeline:

Validate Input — Schema checks on question + ground truth
Retrieve — Dense retrieval via ChromaDB (Qwen3-Embedding-0.6B, 1024-dim) with optional BM25 hybrid fusion
Rerank — Cross-encoder re-ranking (bge-reranker-v2-m3) with configurable enable/disable
Generate — LLM answer generation via Nemotron-Nano-9B-v2-NVFP4 (vLLM, Marlin backend, RTX 4090)
Evaluate — Automated metrics computed against ground-truth answers and retrieved contexts
Compare — Pairwise statistical analysis across configurations

Dataset: ExpertQA: 150 stratified test cases drawn from a multi-domain expert-sourced QA corpus (medicine, law, science, technology, etc.). The same 150 test cases were reused across all versions for comparability.

Infrastructure: AI Manager service orchestrating Docker containers. Single Nemotron-9B NVFP4 instance (~12.2 GiB model, ~23 GiB peak VRAM). PostgreSQL + Redis backing stores. All runs on a single RTX 4090 (24 GiB).

Embedding model: Qwen3-Embedding-0.6B with asymmetric query/document encoding. Chunked documents stored in ChromaDB collections with per-technique indexing.

2.2 Evaluation Metrics

Table T3: Metric Definitions

Metric	Binary Formulation (v5–v6)	Continuous Formulation (v7–v8)	Range
Adherence	Cosine similarity of answer-chunk pairs > threshold (0.50)	Mean cosine similarity across all answer-chunk sentence pairs	[0, 1]
Relevance	Cosine similarity of answer-question > threshold (0.55)	Cosine similarity between full answer and question embeddings	[0, 1]
Utilization	Fraction of retrieved chunks with similarity > threshold (0.60)	Mean cosine similarity between answer and each retrieved chunk	[0, 1]
Completeness	Proportion of ground-truth sentences covered	Proportion of ground-truth sentences with cosine sim > threshold	[0, 1]
NLI Faithfulness	Fraction of answer sentences entailed by context (DeBERTa NLI)	(dropped in v7)	[0, 1]
Answer Relevance	(not used)	LLM-judged helpfulness score (Qwen3.5-9B, 1–5 Likert)	[1, 5]

NLI Faithfulness was dropped after v6 due to an 88.8% ceiling effect and independent evaluation model concerns. Answer Relevance was added in v8 as a holistic helpfulness measure orthogonal to component metrics.

2.3 Composite Score Design

The composite score evolved across versions as metric quality findings accumulated:

Table T4: Composite Weight Evolution

Metric	v5–v6 Weights	v7 Weights (original)	v7 Weights (reeval)	v8
Adherence	0.30	0.30	0.35	0.35
Relevance	0.25	0.25	0.25	0.25
NLI Faithfulness	0.20	0.20	— (dropped)	—
Utilization	0.15	0.15	0.20	0.20
Completeness	0.10	0.10	0.20	0.20

Key changes:

v7 re-evaluation dropped NLI (ceiling) and redistributed its 0.20 weight to utilization (+0.05) and completeness (+0.10)
The v7 re-evaluation composite (template_v2_no_nli) was selected for strongest discrimination (F=6.81 vs F=3.80 for original weights)

2.4 Statistical Framework

All versions use a consistent statistical framework:

Paired tests: Wilcoxon signed-rank (non-parametric) for pairwise comparisons on the same test cases
Multiple comparison correction: Holm-Bonferroni step-down procedure
Effect sizes: Cohen’s d with benchmarks: |d| < 0.2 = Negligible, 0.2–0.5 = Small, 0.5–0.8 = Medium, > 0.8 = Large
Omnibus tests: Friedman test (non-parametric repeated measures) for rank-based group differences; one-way ANOVA as parametric complement
Confidence intervals: 10,000-iteration bootstrap (BCa) for technique means
Bayesian evidence: BIC-based Wagenmakers (2007) Bayes factors
Construct validity: Pre-registered directional predictions tested as strict (must hold) or informational (expected but not required)
Discrimination power: ANOVA F-statistics and eta-squared (η²) for between-technique variance
Ceiling/floor analysis: Percentage of observations at metric bounds; >50% classified as “Unacceptable”

Note on effect sizes: Cohen’s d is computed on raw paired score differences (not rank-transformed), which is standard practice when the underlying continuous metrics have interval-scale properties. While rank-biserial correlation would be the natural nonparametric companion for Wilcoxon signed-rank tests, Cohen’s d was chosen for interpretability and comparability with power analysis conventions. Kendall’s W was not reported alongside Friedman tests but would be a useful addition in future iterations.

3. v5 — Technique Selection

Question: Which chunking technique produces the best RAG responses on ExpertQA?

Design: 4 techniques (semantic, sentence_400, recursive_400, sentence_window_3) evaluated at default parameters with 150 samples each (600 main comparisons), plus 8 parameter sweeps (chunk_size, prompt_template, final_top_k, similarity_threshold, reranking, retrieval_method, adherence_support, chunk_size for sentence_window). 7,350 completed requests, 99.0% success rate. Runtime ~6.3 hours.

Table T5: Technique Ranking (v5)

Rank	Technique	Adherence	Relevance	Utilization	NLI Faith.	Completeness	Best At
1	semantic	0.733	0.589	0.646	0.994	0.256	Fidelity
2	sentence_window_3	0.688	0.808	0.836	0.908	0.235	Coverage
3	sentence_400	0.674	0.571	0.573	0.975	0.261	Completeness
4	recursive_400	0.634	0.524	0.553	0.988	0.259	—

The results revealed a fundamental fidelity-coverage tradeoff: semantic excelled at adherence (0.733) and NLI faithfulness (0.994), while sentence_window_3 led on relevance (0.808) and utilization (0.836). The gap was substantial, with Medium effect sizes (d=0.64–0.73) on coverage metrics.

Table T6: Key Sweep Findings (v5)

Parameter	Finding	Effect
Reranking	Enabled 0.682 adherence vs disabled 0.701	Reranking hurts on ExpertQA (negative)
Hybrid retrieval	Dense: adh 0.682, NLI 0.966; Hybrid: adh 0.602, NLI 0.657	Hybrid severely degrades fidelity (−0.309 NLI)
Chunk size (sentence)	200 > 400 > 600 on adherence (0.722 > 0.674 > 0.659)	Smaller consistently better
Window size (sw)	sw_1: adh 0.753, NLI 0.982; sw_3: rel 0.808, util 0.836	Window=1 for fidelity, =3 for coverage
Top_k	3→15: relevance 0.688→0.468; completeness 0.243→0.262	top_k=5 is balanced
Prompt template	Detailed > default for adherence on some techniques	Small effect, worth exploring
Similarity threshold	0.40→0.55: adherence 0.829→0.550	Impactful but tradeoff with recall

Metric heatmap across all 4 chunking techniques showing adherence, relevance, utilization, NLI faithfulness, and completeness scores. — Figure 3.1: Metric heatmap across all 4 techniques.

Comparison of dense versus hybrid retrieval methods showing hybrid severely degrades fidelity on ExpertQA. — Figure 3.2: Dense vs hybrid retrieval: hybrid severely degrades fidelity on ExpertQA.

Bar chart showing reranking impact is counterintuitively negative on this dataset. — Figure 3.3: Reranking impact — counterintuitively negative on this dataset.

Discrimination power analysis showing completeness fails to discriminate with p=0.15. — Figure 3.4: Discrimination power — completeness fails to discriminate (p=0.15).

NLI faithfulness distribution showing 88.8% of observations at the ceiling. — Figure 3.5: NLI faithfulness distribution — 88.8% ceiling effect.

Prompt template sweep results on adherence metric. — Figure 3.6: Prompt template sweep on adherence.

Construct Validity (v5): 4/5 Strict PASS

Test	Prediction	Result	Status
top_k increases → relevance decreases	Monotonic decrease	0.688 → 0.468	PASS
top_k increases → completeness increases	Monotonic increase	0.243 → 0.262	PASS
Reranking improves adherence	Enabled > disabled	0.682 < 0.701	FAIL
Smaller chunks → higher relevance	200 > 400 > 600	0.570 > 0.548 > 0.527	PASS
Similarity threshold → adherence decreases	Monotonic decrease	0.829 → 0.550	PASS

Test 3 failed because reranking hurts on ExpertQA, unlike techqa/covidqa. This was an important dataset-specific finding carried forward.

Metric Quality Alerts

NLI Faithfulness: 88.8% ceiling — unreliable as discriminator (recommended drop in v6+)
Completeness: ANOVA F=1.77, p=0.15 — zero discriminative power; eta-squared=0.009 (negligible)
Relevance: 37.3% ceiling (sentence_window_3 at 67.3%) — marginal

Decisions Carried Forward

Technique = sentence_window— selected over semantic (adherence leader, 0.733) because sentence_window provides the best coverage-fidelity balance, is fastest (13.1s vs 14.8s mean), has the lowest failure rate (0.3% vs 0.8%), and its window_size parameter enables further optimization in v6. All 4 techniques exceeded the 0.60 adherence threshold and were retained as viable candidates; the v6 carry-forward decision was which to prioritize.
Investigate window_size=1 — highest adherence (0.753) and NLI (0.982) among all configs
Reranking = investigate further — counterintuitive negative effect needs confirmation
Flag completeness and NLI as non-discriminating — under binary formulation, neither metric discriminates between techniques; reassess after measurement reform
Dense retrieval only — hybrid severely degraded fidelity

4. v6 — Window Size and Parameter Optimization

Question: Which sentence_window size and parameter combination maximizes composite quality?

Design: 4 window sizes (1, 2, 3, 4) at 200 samples each, plus 3 parameter sweeps (final_top_k, reranking_enabled, prompt_template). 5,597 completed requests out of 5,617 submitted (99.6% completion rate; 5 failures, 15 other non-completed). Runtime ~9.2 hours across 4 runs (1 main + 3 sweeps).

Table T7: Window Size Ranking (v6)

Rank	Window Size	Composite	Adherence	Relevance	NLI Faith.	Utilization	Completeness
1	sw_1	0.769	0.766	0.784	0.979	0.818	0.244
2	sw_2	0.759	0.739	0.803	0.932	0.837	0.241
3	sw_3	0.746	0.691	0.817	0.902	0.868	0.234
4	sw_4	0.722	0.662	0.807	0.837	0.872	0.230

Window_size=1 won the composite ranking through leading fidelity metrics (adherence, NLI). The fidelity-coverage tradeoff from v5 repeated at finer granularity: sw_1 led on adherence (+0.104 vs sw_4, d=0.42) while sw_3/sw_4 led on utilization (+0.054, not significant).

Table T8: Best Observed Configurations (v6)

Configuration	Composite	Adherence	Relevance	NLI Faith.	Utilization	CI (95%)
sw_1 + reranking=false	0.810	0.785	0.912	0.977	0.849	[0.792, 0.827]
sw_1 + template=detailed	0.793	0.795	0.830	0.981	0.829	—
sw_1 + default	0.769	0.766	0.784	0.979	0.818	—
sw_1 + reranking=true	0.769	0.766	0.784	0.979	0.818	—

Disabling reranking improved composite by +0.041, confirming v5’s counterintuitive finding. The effect was driven primarily by relevance (0.784 → 0.912, +0.128) and utilization (0.818 → 0.849). A plausible mechanism is that the cross-encoder filtered out chunks that were topically relevant but not the closest match—on ExpertQA’s diverse multi-domain content, this filtering may have been counterproductive.

Sweep Details

Top_k sweep (k = 3, 5, 7, 10): Composite flat across k=5–10 (0.749–0.755). k=3 dropped to 0.722. Decision: k=5 retained (default, balanced).

Reranking sweep: Disabled significantly better (+0.041 composite). Bayesian evidence for the overall window-size adherence gradient: BF10=308.6 (extreme) for sw_1 vs sw_4.

Template sweep: Detailed template +0.024 composite over default. Chain-of-thought comparable but 2.1× slower (~29s vs ~14s mean generation time).

Bar chart comparing composite scores across 4 window sizes, showing sw_1 at the top. — Figure 4.1: Window size composite score comparison.

Radar chart showing multi-metric comparison across 4 window sizes. — Figure 4.2: Multi-metric radar chart across 4 window sizes.

Heatmap showing the interaction between window size and top_k on composite score. — Figure 4.3: Window size × top_k interaction heatmap.

Chart confirming that disabling reranking improves quality. — Figure 4.4: Reranking impact confirmation — disabled is better.

Bayesian evidence chart for pairwise window size comparisons. — Figure 4.5: Bayesian evidence for pairwise window size comparisons.

Construct Validity (v6): 6/6 PASS (Highest of Any Version)

Test	Prediction	Result	Status
top_k increases → relevance decreases	Monotonic decrease	0.822 → 0.751	PASS
top_k increases → completeness increases	Monotonic increase	0.224 → 0.253	PASS
Window size increases → NLI decreases	Monotonic decrease	0.979 → 0.837	PASS
Window size increases → adherence decreases	Monotonic decrease	0.766 → 0.662	PASS
Reranking impacts relevance (info)	Directional	Confirmed	PASS
Larger window → higher utilization (info)	Directional	Confirmed	PASS

Ceiling Problem Foreshadowing

Despite 6/6 construct validity (the highest of any version), v6 revealed a severe measurement problem:

Metric	Ceiling %	Assessment
NLI Faithfulness	71.6%	Unacceptable
Relevance	66.9%	Unacceptable
Utilization	66.8%	Unacceptable
Adherence	29.6%	Acceptable
Completeness	0.0%	Ideal

Three of five metrics had unacceptable ceiling effects (>50% of observations at maximum). This meant the composite score was increasingly driven by the few metrics with remaining variance, raising the question: is this optimizing metrics or actual quality?

Decisions Carried Forward

Window size = 1 — best composite, fidelity leader
Reranking = disabled — confirmed negative effect
Template = detailed — modest gain, retained
top_k = 5 — flat region, no need to change
Address ceiling effects — measurement reform needed before trusting generation-level optimization

5. v7 — Measurement Reform

Question: Are binary metrics hiding meaningful differences between generation configurations?

Design: 8 named generation configurations varying prompt_template (detailed, grounded_detailed, precise, structured), temperature (0.0 vs 0.1 default), max_tokens (768 vs 512 default), and top_k (3, 5, 7). All used sentence_window_1 with system_prompt=/no_think and other defaults from v6. 1,598 completed requests (200 per config, minus 2 missing). The same responses were scored twice: once with original binary metrics, once with continuous reformulations using an independent evaluation model.

The 8 Configurations

All configs share: system_prompt=/no_think, reranking=false, dense retrieval, similarity_threshold=0.45.

Config	Temperature	Max Tokens	Top_k	Template	What it tests
anchor_detailed	0.1	512	5	detailed	Baseline (anchor)
maxtok_768	0.1	768	5	detailed	More generation budget
topk_3	0.1	512	3	detailed	Fewer, higher-quality chunks
topk_7	0.1	512	7	detailed	More context coverage
temp_greedy	0.0	512	5	detailed	Deterministic decoding
grounded_detailed	0.1	512	5	grounded_detailed	Grounding-focused template
precise	0.1	512	5	precise	Adherence-maximizing template
structured	0.1	512	5	structured	Structured output format

The Plateau Discovery

With binary metrics, the top-4 configurations (topk_3, maxtok_768, temp_greedy, anchor_detailed) scored 0.830–0.832 composite — a spread of just 0.0017. The ranking appeared converged: further optimization seemed pointless.

But investigation revealed this convergence was an artifact of metric ceilings, not genuine equivalence.

Table T9: Binary vs Continuous Comparison

Metric	Binary Mean	Binary Ceiling	Continuous Mean	Continuous Ceiling	Change
Adherence	0.755	22.8%	0.616	0.0%	Ceiling eliminated
Relevance	0.912	86.7%	0.581	0.0%	Ceiling eliminated
Utilization	0.927	80.7%	0.753	0.0%	Ceiling eliminated

The binary relevance metric had 86.7% of all observations at the ceiling (1.0). Any configuration that exceeded the threshold got a perfect score, erasing all differences above that line. The continuous formulation, using raw cosine similarity instead of threshold-based binary classification, restored the full measurement range.

Table T10: Configuration Rankings — Binary vs Continuous

Config	Binary Rank	Binary Composite	Continuous Rank	Continuous Composite	Rank Change
topk_3	1	0.832	1	0.583	—
maxtok_768	2	0.832	2	0.582	—
temp_greedy	3	0.831	3	0.580	—
anchor_detailed	4	0.830	4	0.579	—
grounded_detailed	5	0.824	6	0.571	−1
topk_7	6	0.819	5	0.574	+1
precise	7	0.796	7	0.555	—
structured	8	0.789	8	0.552	—

The top-4 order was preserved, but the spread widened from 0.0017 to 0.0036, though still narrow. The continuous composite revealed 16 significant pairwise differences (vs fewer with binary), with 5 Large or Medium effects (4 with d > 0.8, plus structured vs topk_7 at d = 0.71), all involving structured as the worst config.

Side-by-side comparison of binary versus continuous metric distributions showing ceiling elimination. — Figure 5.1: Binary vs continuous metric distributions — ceiling elimination.

Configuration ranking under continuous composite scoring. — Figure 5.2: Configuration ranking under continuous composite.

Construct Validity Revolution: 0/2 → 2/2

Under binary metrics, both strict construct validity tests failed — the threshold-based scoring masked the expected monotonic relationship between top_k and relevance. After continuous reformulation:

Test	Binary Result	Continuous Result
topk_3 > anchor (relevance)	FAIL (both at ceiling)	PASS (0.595 > 0.580)
anchor > topk_7 (relevance)	FAIL (both at ceiling)	PASS (0.580 > 0.570)

This was the strongest evidence that binary metrics were not just noisy but actively misleading — they could not detect real differences that continuous metrics revealed.

The Top-4 Cluster Problem

Even with continuous metrics, the top-4 configs (composite 0.579–0.583) remained statistically indistinguishable in pairwise tests (all Holm-corrected p > 0.05, all d < 0.15). The composite metric had reached its discriminative limit: the four best configurations produced responses of equivalent measured quality across adherence, relevance, utilization, and completeness.

This raised the question: is there a quality dimension not being measured?

Decisions Carried Forward

Continuous metrics adopted — binary formulations permanently retired
NLI dropped — 88.8% ceiling, independent eval model concerns
Top-4 cluster identified — topk_3, maxtok_768, temp_greedy, anchor_detailed
Need a new discriminator — automated composite cannot break the tie
Explore answer relevance — holistic helpfulness judgment as complementary signal

6. v8 — Answer Relevance Judge

Question: Can an LLM-judged answer relevance score discriminate between the top-4 configs that automated metrics cannot?

Design: Qwen3.5-9B (BF16 via vLLM, Docker container) scored all 1,598 v7 responses on a 1–5 Likert scale for answer relevance — “how helpful and relevant is this answer to the user’s question, considering completeness, accuracy, and specificity?” Temperature=0.0 for deterministic scoring. Scoring time: 969 seconds (1.5 responses/second).

Judge Calibration

Before scoring, the judge was calibrated against 50 human-labeled samples:

Metric	Value	Threshold	Status
Cohen’s kappa (weighted)	0.469	≥ 0.6	Soft FAIL
MAE	0.580	≤ 1.0	PASS
Within-1 agreement	88%	—	Strong
Spearman rho	0.509	—	p=0.00016
Parse rate	100%	> 95%	PASS

Kappa fell below the 0.6 threshold, indicating moderate (not substantial) agreement. However, three factors supported proceeding: (1) 88% of judgments were within 1 point of human labels, (2) Spearman correlation was significant (p<0.001), and (3) the primary analysis uses relative ranking rather than absolute scores, which is robust to systematic bias.

Table T11: Combined Ranking (Composite + Answer Relevance)

Config	Composite Rank	Composite	AR Rank	AR Mean	AR Std	Top-4?
topk_3	1	0.583	7	4.225	0.899	Yes
maxtok_768	2	0.582	2	4.365	0.803	Yes
temp_greedy	3	0.580	3	4.310	0.792	Yes
anchor_detailed	4	0.579	6	4.246	0.844	Yes
topk_7	5	0.574	4	4.281	0.786	No
grounded_detailed	6	0.571	5	4.265	1.020	No
precise	7	0.555	8	4.045	1.204	No
structured	8	0.552	1	4.440	0.831	No

The AR-Composite Tension

The most striking finding was structured: ranked dead last on composite (#8, score 0.552) but first on answer relevance (#1, AR 4.44). This suggests that the structured template produces responses that users find helpful despite scoring poorly on embedding-based metrics. This tension is informative but does not change the recommendation; structured’s composite deficit (d > 0.9 vs top-4) is too large to justify on automated metrics alone.

Top-4 Discrimination: AR Breaks the Tie

The central question: does AR discriminate within the top-4 cluster that composite cannot separate?

Overall Friedman test (all 8 configs): chi2 = 47.12, p = 5.3e-08. AR discriminates across the full set.

Top-4 Friedman test: chi2 = 12.41, p = 0.006. AR successfully discriminates within the top-4 cluster.

Top-4 Pairwise Comparisons (Wilcoxon + Holm)

Pair	Mean Diff	Cohen’s d	p (Holm)	Significant?
maxtok_768 > anchor_detailed	+0.116	0.221	0.014	Yes
maxtok_768 > topk_3	+0.140	0.218	0.013	Yes
maxtok_768 > temp_greedy	+0.055	0.102	0.305	No
temp_greedy > anchor_detailed	+0.065	0.125	0.267	No
temp_greedy > topk_3	+0.085	0.131	0.291	No
anchor_detailed > topk_3	+0.025	0.038	0.566	No

maxtok_768 significantly outperforms both anchor_detailed (d=0.221, p=0.014) and topk_3 (d=0.218, p=0.013). These are Small effects, but they are the only statistically significant differences within a cluster that composite could not crack.

Bar chart showing answer relevance scores by configuration. — Figure 6.1: Answer relevance scores by configuration.

Box plot showing answer relevance score distributions for all 8 configurations. — Figure 6.2: Answer relevance score distributions.

Top-4 pairwise discrimination via answer relevance showing maxtok_768 significantly outperforming anchor_detailed and topk_3. — Figure 6.3: Top-4 pairwise discrimination via AR.

Correlation matrix between answer relevance and automated metrics. — Figure 6.4: AR correlation with automated metrics.

AR is Not a Proxy for Existing Metrics

Correlations between AR and automated metrics were modest:

Metric	Spearman rho
Composite (continuous)	0.331
Adherence	0.256
Relevance	0.244
Utilization	0.226
Completeness	0.088

The strongest correlation (rho=0.331 with composite) confirms AR captures signal not present in automated metrics. If AR were simply a noisy version of composite, one would expect rho > 0.7.

Construct Validity (v8): 1/2 Strict PASS

Test	Prediction	Result	Status
precise < anchor (AR)	Constrained template → lower helpfulness	4.045 < 4.246 (−0.201)	PASS
grounded < anchor (AR)	Grounded template → lower helpfulness	4.265 > 4.246 (+0.019)	FAIL

Test 2 failed by a negligible margin (+0.019, d=0.03). The grounded_detailed template performed slightly better than expected, possibly because its emphasis on sourcing improves perceived answer quality, though this was not directly tested.

Decisions Carried Forward

maxtok_768 is the final selection. It is the only configuration that ranks in the top-2 on both dimensions: composite rank #2 (0.582) and AR rank #2 (4.365). It significantly outperforms 2 of 3 other top-4 members on AR while maintaining equivalent composite performance.

7. Cross-Version Synthesis

Optimization Trajectory

The four versions trace a clear narrowing funnel:

v5: 4 techniques x 8 sweeps = 32 cells -> sentence_window selected
v6: 4 windows x 4 sweeps = 16 cells -> sw_1 + reranking=false
v7: 8 configs x 2 formulations = measurement reform -> top-4 cluster
v8: top-4 -> maxtok_768 (AR tiebreaker)

Each version reduced the search space by roughly 4× while adding measurement sophistication. The total optimization path evaluated 14,545 benchmark requests plus 1,598 LLM judge evaluations (16,143 total) across more than 40 unique configurations.

Metric Evolution

The evaluation framework evolved significantly through evidence-based decisions:

Change	Version	Evidence	Impact
Completeness flagged as non-discriminating	v5	ANOVA F=1.77, p=0.15, η²=0.009	Flagged; retained after continuous reformulation restored discrimination
NLI faithfulness dropped	v7	88.8% ceiling (v5), 71.6% ceiling (v6)	Removed from composite
Binary → continuous formulation	v7	3 metrics with >66% ceiling; 0/2 validity → 2/2	Full discrimination restored
Answer relevance added	v8	Top-4 composite plateau (spread 0.004)	Broke tie, identified maxtok_768
Completeness discrimination restored	v7	Continuous: Friedman χ²=785.22, η²=0.175	Weight increased 0.10 → 0.20
Composite weights redistributed	v7	NLI removal, completeness reweighted	Stronger discrimination (F: 3.80 → 6.81)

Construct Validity Trend

Version	Strict Tests	Pass	Rate	Notes
v5	5	4	80%	Reranking test failed (dataset-specific)
v6	4	4	100%	Best construct validity
v7 (binary)	2	0	0%	Ceiling effects masked all signals
v7 (continuous)	2	2	100%	Measurement reform recovered validity
v8	2	1	50%	Grounded template slightly exceeded expectation

The v7 binary → continuous transition is the most dramatic result: the same data, same configurations, same predictions went from 0% to 100% validity solely through metric reformulation. This is strong evidence that metric design is at least as important as pipeline design.

Diminishing Returns

Transition	Composite Change	AR Change	Effort
v5 → v6	+0.041 (reranking=false)	—	5,597 requests
v6 → v7	−0.217 (scale change)	—	1,598 + re-eval
v7 top-4	0.004 spread (plateau)	—	Identified ceiling as cause
v8 maxtok_768	+0.001 vs topk_3	+0.140 vs topk_3	1,598 judge calls

The composite gains diminished rapidly: v6’s reranking discovery was the single largest improvement (+0.041). By v7, automated metrics had plateaued. v8’s contribution was not in composite improvement but in adding a new dimension (AR) that revealed maxtok_768 as measurably more helpful despite near-identical composite scores.

Key Findings Across All Versions

Sentence_window_1 is the optimal chunking strategy for ExpertQA with this model. It maximizes fidelity (adherence, NLI) while maintaining adequate coverage.
Reranking hurts on ExpertQA. Confirmed in both v5 and v6 across multiple window sizes. The cross-encoder filters out relevant chunks in this multi-domain dataset. This finding may not generalize to narrow-domain datasets.
Binary threshold metrics have a hard ceiling problem. When most responses exceed a quality threshold, the metric saturates and can no longer discriminate. Continuous formulations are essential for high-performing systems.
Composite metrics plateau before quality converges. The top-4 configs are equivalent on automated metrics but differ on AR. This suggests that automated metrics capture necessary but insufficient quality dimensions.
max_tokens=768 is a sweet spot. More generation budget (768 vs the 512 default) produces measurably more helpful answers (AR d=0.221 vs anchor, p=0.014). The additional generation budget plausibly allows the model to elaborate rather than truncating, though this hypothesis was not directly tested.
Temperature has minimal impact. temp_greedy (0.0) vs anchor_detailed (0.1) differ by just d=0.125 on AR and d=0.008 on composite. Even this small temperature gap produces negligible differences.
Prompt template matters more than expected. The structured template produces the most user-helpful responses (AR 4.44) but worst composite (0.552). This tension between automated and human-aligned metrics deserves future investigation.

8. Final Recommendation

Table T12: Best Configuration Card

Parameter	Value	Source Version	Evidence
Technique	sentence_window	v5	Best coverage-fidelity tradeoff (d=0.64–0.73 Medium)
Window size	1	v6	Composite 0.769 (rank #1); adherence 0.766
Reranking	disabled	v5, v6	+0.041 composite; confirmed across window sizes
Top_k	5	v6	Flat region k=5–10; k=3 too restrictive
Retrieval	dense only	v5	Hybrid: −0.309 NLI, severe fidelity loss
Prompt template	detailed	v6, v7	+0.024 composite over default
Max tokens	768	v7, v8	AR 4.365 (#2); composite 0.582 (#2)
Temperature	0.1	v7	Default; negligible difference vs 0.0
System prompt	/no_think	v7	All 8 v7 configs used /no_think (workflow default)
Similarity threshold	0.45	v5	ExpertQA-calibrated (mean sim 0.440)

Expected Performance

Metric	Expected Value	Source
Composite (continuous)	0.582	v7 config_ranking_continuous.json
Adherence (continuous)	0.608	v7 config_ranking_continuous.json
Relevance (continuous)	0.581	v7 config_ranking_continuous.json
Utilization (continuous)	0.768	v7 config_ranking_continuous.json
Completeness	0.354	v7 config_ranking_continuous.json
Answer Relevance	4.365 / 5.0	v8 config_ranking.json
95th pct latency	~19–20s	v6 latency analysis
Failure rate	< 0.5%	v5–v6 observed rates

Table T13: Finalist Comparison

Config	Comp. Rank	Composite	AR Rank	AR Mean	Latency	Fail Rate	Key Differentiator
maxtok_768	2	0.582	2	4.365	~12.5s	<0.5%	+256 tokens → AR gain
topk_3	1	0.583	7	4.225	~12.5s	<0.5%	Best composite, weak AR
temp_greedy	3	0.580	3	4.310	~12.5s	<0.5%	Deterministic; no sig. edge
anchor_detailed	4	0.579	6	4.246	~12.5s	<0.5%	Baseline anchor config
structured	8	0.552	1	4.440	~12.5s	<0.5%	Best AR but worst composite

Latency estimates from v6 sw_1 main run (~12–13s mean generation time). Failure rates from v5–v6 observed rates across all sw_1 runs. Composite and AR from v7 re-evaluation and v8 judge scoring, respectively.

Operational Characteristics

Latency: ~12.5s mean, ~19–20s p95 (v6 sw_1 measurements)
Failure rate: <0.5% (v5–v6 observed across >13,000 requests)
Throughput: ~0.175 req/s sustained (v5 measurement, single-instance)
Resource requirements: Single RTX 4090, ~23 GiB peak VRAM (12.2 GiB model + KV cache + evaluation models)

Selection Rationale

Decision rule: Select the configuration that ranks in the top tier on automated composite metrics AND performs best on answer relevance, excluding any configuration with a large automated-metric deficit.

maxtok_768 is the recommended configuration under this rule:

Composite: Rank #2 (0.582), within 0.001 of the best (topk_3, 0.583). The difference is not statistically significant (p > 0.8).
Answer Relevance: Rank #2 (4.365), significantly better than anchor_detailed (#6, d=0.221, p=0.014) and topk_3 (#7, d=0.218, p=0.013).
No other config ranks top-2 on both dimensions: topk_3 is #1 composite but #7 AR. temp_greedy is #3 on both. structured is #1 AR but #8 composite.
max_tokens=768 is the differentiator: The only parameter change from the anchor config (512 tokens) is +256 max tokens. This extra generation budget produces measurably more helpful responses (d=0.221 on AR) at negligible composite cost.

Caveats

All results are specific to ExpertQA with Nemotron-9B-NVFP4. Generalization to other datasets or models requires validation.
Answer relevance calibration showed moderate agreement (kappa=0.469). The relative ranking is more trustworthy than absolute scores.
The optimization used one-factor-at-a-time (OFAT) sweeps, not factorial designs. Interaction effects may exist between parameters that were not explored.

9. Limitations

Single dataset: All results are from ExpertQA only. Findings like “reranking hurts” may not generalize to narrow-domain datasets where v3 (covidqa) and v4 (techqa) showed reranking benefits.
Single model: Nemotron-Nano-9B-v2-NVFP4 is a specific quantized model. Larger or differently quantized models may shift optimal configurations, particularly for max_tokens and prompt template effects.
OFAT sweep design: Parameters were swept one at a time. Factorial designs would reveal interaction effects (e.g., window_size × reranking, temperature × max_tokens) but at exponentially higher cost.
Moderate calibration kappa: The AR judge achieved kappa=0.469 (moderate), below the 0.6 threshold for substantial agreement. While 88% within-1 agreement and significant Spearman correlation support relative ranking reliability, absolute AR scores should be interpreted cautiously.
Metric reformulation mid-study: Switching from binary to continuous metrics between v6 and v7 means composite scores are not directly comparable across that boundary. The composite scale shifted from ~0.77–0.81 (binary) to ~0.55–0.58 (continuous).
No large-scale human evaluation: The AR judge is an automated proxy for human judgment. The 50-sample calibration set may not capture the full distribution of human preferences. A larger human evaluation (N > 200) would strengthen confidence in AR-based conclusions.
Completeness metric history: Completeness showed zero discriminative power under binary formulation in v5 (F=1.77, p=0.15) and was flagged for removal. However, the v7 continuous reformulation restored strong discrimination (Friedman χ²=785.22, η²=0.175, the strongest of all metrics), justifying the weight increase from 0.10 to 0.20. The metric’s utility is formulation-dependent.
Non-independent test cases: The same 150 test cases were used across all versions and configurations. While paired analysis accounts for within-subject correlation, there is no guarantee these 150 cases represent the full distribution of real-world queries. Performance on unseen questions may differ.

10. Artifact Index

Analysis Reports

File	Version	Content
results/v5/v5_session1_analysis/session_1_analysis.md	v5	Full statistical analysis (77 PNGs, 23 JSONs)
results/v6/v6_session1_analysis/session_1_analysis.md	v6	Window size analysis (78 artifacts)
results/v7/v7_reevaluation_analysis/reevaluation_analysis.md	v7	Binary vs continuous comparison
results/v8/v8_judge_analysis/judge_analysis.md	v8	Answer relevance analysis

Key Data Files

File	Version	Content
results/v7/.../config_ranking_continuous.json	v7	Continuous composite rankings (8 configs)
results/v7/.../formulation_comparison.json	v7	Ceiling rates and F-statistics
results/v8/.../config_ranking.json	v8	AR rankings (8 configs)
results/v8/.../top4_discrimination.json	v8	Friedman + pairwise tests
results/v8/.../construct_validity.json	v8	4 construct validity tests
results/v8/.../metric_correlations.json	v8	AR vs automated metric correlations
results/v8/.../discrimination.json	v8	Overall Friedman (all 8 configs)

Design Documentation

File	Version	Content
documentation/260311_01_.../design_analysis.md	v5	Approved design (4 techniques, 8 sweeps)
documentation/260313_01_.../design_analysis.md	v6	Approved design (4 window sizes, 4 sweeps)
documentation/260314_01_.../design_analysis.md	v7	Re-evaluation methodology
documentation/260315_01_.../design_analysis.md	v8	AR judge design

Appendix A: Answer Relevance Judge Protocol

Judge Model

Model: Qwen3.5-9B (BF16 precision)
Inference engine: vLLM with enforce_eager=True, max_model_len=4096, gpu_memory_utilization=0.85
Temperature: 0.0 (deterministic scoring)
Max output tokens: 1024
Thinking mode: Disabled (enable_thinking: False)

Scoring Scale

Score	Anchor
1	Completely unhelpful: response is irrelevant, incoherent, or empty
2	Mostly unhelpful: response touches the topic but fails to address the question
3	Partially helpful: response addresses some aspects but misses key parts or is vague
4	Mostly helpful: response addresses the question well with minor gaps or unnecessary content
5	Fully helpful: response directly and completely addresses the question using the provided context

Evaluation Criteria

Does the response directly answer the user’s question?
Does the response use the provided context effectively?
Is the response appropriately detailed (not too brief, not excessively verbose)?
Would a domain expert find this response useful?

Additional scoring rules instruct the judge not to reward verbosity, to use the full 1–5 scale (reserving 5 for truly excellent responses), and to score 3 for on-topic but gapped responses.

Prompt and Rubric

The exact judge prompt is defined in scripts/rag_benchmark_v8/config.py (JUDGE_USER_TEMPLATE). The system prompt instructs the model to act as an expert answer quality judge and output valid JSON only. The user prompt provides the question, retrieved context chunks (numbered), and the response to evaluate, followed by the rubric and scoring rules.

Output format: {"score": <1-5>, "reasoning": "<brief justification>"}. Parsing uses strict JSON first, then regex fallback. Parse rate: 100% across all 1,598 responses.

Blinding and Ordering

Blinding: The judge sees question, context chunks, and response only: no configuration identity, run name, or parameter values are provided.
Presentation order: Responses are scored sequentially in CSV row order (one response per judge call). No within-response randomization is applicable since each call evaluates a single response.
Position bias: Not applicable in this protocol (single-response evaluation, not pairwise comparison).

Calibration

50 human-labeled samples scored by the judge before full deployment. Results: kappa=0.469 (moderate), MAE=0.580, within-1 agreement=88%, Spearman rho=0.509. See scripts/rag_benchmark_v8/calibrate_judge.py for the calibration protocol.

Known Biases and Limitations

Verbosity bias: Possible but not measured. Longer responses may receive higher scores from the judge regardless of quality. Anti-verbosity instructions are included in the prompt, but their effectiveness was not independently validated.
Rubric iteration: The rubric was revised once after initial calibration (iteration 1 showed judge bias toward score 5). The deployed rubric includes explicit anti-inflation instructions.
Kappa below threshold: Weighted kappa (0.469) fell below the 0.6 target for substantial agreement. Relative ranking is more robust than absolute scores.