HydroAgent: Closing the Gap Between Frontier LLMs and Human Experts in Hydrologic Model Calibration via Simulator-Grounded RL

Li, Zhi; Yan, Songkun; Cao, Jie; Zhang, Mofan; Wei, Anjiang; Yoo, Jinwoong; Hong, Yang

HydroAgent: Closing the Gap Between Frontier LLMs and Human Experts in Hydrologic Model Calibration via Simulator-Grounded RL

Zhi Li¹, Songkun Yan², Jie Cao², Mofan Zhang³, Anjiang Wei³, Jinwoong Yoo⁴, Yang Hong²

¹University of Colorado Boulder ²University of Oklahoma ³Stanford University ⁴NASA Goddard Space Flight Center

Paper Code arXiv 🤗 Data

Overview of the HydroAgent hydrologic calibration benchmark and training pipeline

Overview. We benchmark nine frontier LLM agents on calibrating the operational CREST/EF5 distributed hydrologic model, then introduce HydroAgent — an open-weight Qwen3-4B policy fine-tuned with supervised learning on expert calibration trajectories and reinforcement learning with simulation feedback (RLSF), where the simulator itself is the verifier.

Abstract

Calibrating distributed hydrologic models is a critical bottleneck across operational water-resources management — streamflow prediction, water-supply assessment, reservoir operation, drought monitoring, infrastructure design, and flood forecasting all depend on it: each basin demands a domain expert to translate hydrograph signatures into adjustments of a high-dimensional parameter vector, and the resulting workflow does not transfer between watersheds. We ask a sharper version of a now-common question: can frontier large language model (LLM) agents replace the human hydrologic modeler, and if not, what would it take? We benchmark nine frontier LLM agents — Claude Opus 4.6/4.7, Sonnet 4.6, GPT-5/5.4/5.4-pro, and Gemini 2.5-pro/3.1-pro/3-flash — on the calibration of the operational CREST distributed hydrologic model used by the U.S. National Weather Service for flash-flood forecasting. Mean best-of-twenty-rounds Nash–Sutcliffe Efficiency (NSE) across four held-out gauges spanning 329–40,792 km² ranges from −0.16 (GPT-5.4) to 0.75 (Sonnet 4.6); the ceiling is reproducible across all three frontier vendors and across capability tiers, and no model reaches the human-expert reference except for Opus-4.7 at one testing gauge. We argue this gap is not a parameter-count problem: it is a domain-grounding problem. We then propose HydroAgent, a recipe that fine-tunes the open-weight Qwen3-4B model with supervised fine-tuning on 2,576 expert calibration trajectories and Group-Relative Policy Optimization using NSE as a verifiable reward sourced from online CREST simulations — reinforcement learning with simulation feedback (RLSF). Our central thesis is that, for Earth-system science, a small domain-tuned policy with simulator-in-the-loop RL is a more compute-efficient and physically faithful path than scaling generic frontier models, and that the multi-modal richness of Earth data makes domain agents an unusually leveraged direction for AI in physical science.

Key Contributions

Frontier benchmark

The first systematic benchmark of nine frontier LLM agents on calibrating an operational distributed hydrologic model, with a public release of all 9 × 20 calibration trajectories for reproducibility.

HydroAgent recipe

A domain-specific recipe that fine-tunes Qwen3-4B with SFT + GRPO using NSE as a verifiable, simulator-grounded reward — reproducible on 4×H100.

A position, quantified

For Earth-system tasks with cheap-to-evaluate simulators, a domain-tuned 4B agent can substantially close the gap to frontier generalists — grounding, not scale, is the bottleneck.

Method: SFT + Reinforcement Learning with Simulation Feedback (RLSF)

HydroAgent calibration loop: set parameters, run simulation, evaluate, parse failure

A calibration episode exposes four tools to the agent — set_parameters (write 13 CREST scalar multipliers), run_simulation (invoke the EF5 binary and parse the simulated-vs-observed hydrograph), evaluate (return NSE plus a multi-criteria diagnostic), and a parse_failure reasoning step. Training proceeds in two phases:

Phase 1 — Supervised Fine-Tuning. Distill 2,576 expert calibration trajectories across 29 U.S. gauges into Qwen3-4B, teaching the tool-call grammar, parameter-bounds convention, and a basic hydrologic-reasoning style.
Phase 2 — RLSF (GRPO). Draw K=8 multi-turn rollouts per prompt; each rollout proposes a parameter set, runs an online CREST/EF5 simulation, and is scored by a clipped NSE plus shaped per-turn signals. The simulator is the verifier — there is no learned reward model. An explicit improvement bonus trains the policy to iterate-until-you-cannot-improve, which is what a hydrologist actually does.

Training Recipe & Reproducibility

The RL stack is verl 0.5 GRPO with SGLang multi-turn rollouts and an in-process EF5 tool, trained in full BF16 under FSDP (no LoRA) on 4×H100 (80 GB). For each prompt the engine draws K rollouts; each rollout is a multi-turn calibration episode that dispatches set_parameters → run_simulation → evaluate to registered tool classes, with EF5 invocations gated by a 32-way semaphore to avoid CPU/IO contention. A checkpoint cadence lands roughly every 5 hours; resume_mode: auto recovers across Modal's 24 h function-timeout cycle.

Setting	Value
Base model	`Qwen/Qwen3-4B-Instruct-2507` (full FT, BF16, FSDP)
Advantage estimator	GRPO (group-relative, critic-free)
Rollouts per prompt (K)	6–8 · batch = 4 prompts/step
Max calibration turns / rollout	50
Actor learning rate	1×10⁻⁶ (5e−6 caused collapse)
KL anchor coeff.	0.2 (strong anchor to SFT init)
Entropy coeff.	0.01
Sampling	temperature 1.0, top-p 0.95
Epochs	30 over the 10-gauge training set
Infrastructure	verl 0.5 + SGLang on 4×H100, EF5 concurrency 32

Full configs and the eval harness are in the code repository.

Where Do Frontier Agents Stand?

Each agent calibrates four held-out gauges inside an identical Linux sandbox, with a 20-round budget (up to 200 EF5 simulations per gauge). Best-NSE and rounds-used are tightly coupled: the strongest models use the full horizon, while several pro-tier reasoning models terminate after 1–2 rounds with budget remaining. Performance bands follow Moriasi et al. (2015); the dashed line at NSE = 0.85 is the human-expert reference (an experienced hydrologist using domain knowledge + the DREAM sampler). Use the carousel to browse all four held-out gauges.

Gauge `02338660` (329 km²) — the canonical evaluation gauge. No frontier model crosses the human-expert reference at NSE = 0.85.

Gauge `01403060` — per-model best-NSE with rounds-used annotated on the right.

Gauge `06279500` (40,792 km²) — the largest, hardest basin; the one case where Opus-4.7 reaches the human-expert reference.

Gauge `07144100` — the qualitative ranking and rounds-used/best-NSE coupling persist across the panel.

How Does HydroAgent Improve Streamflow Prediction?

Relative change in four hydrologic-fit metrics for HydroAgent-4B vs the Qwen3-4B baseline

Relative change in four hydrologic-fit metrics for HydroAgent-4B versus the untuned Qwen3-4B-Instruct baseline, on the four held-out gauges. Blue = improvement, red = degradation; size encodes magnitude. Simulator-grounded post-training improves the metrics calibration is judged by: NSE improves on all four basins (from +5% to over +200%), discharge correlation improves on all four, and timing offset (|Lag|) contracts on all four. The lone residual cost is peak-ratio error on gauge 02338660, consistent with the volume- and peak-weighted shaping of the RLSF reward.

Per-gauge ablation of the SFT + RLSF stack

Gauge	Method	Best NSE	Sims	Turns	Parse fail
`07144100`	Baseline	0.34	15	50	0
	SFT-only	0.07	1	15	4
	HydroAgent	0.65	13	50	0
`06279500`	Baseline	−1.41	13	50	1
	SFT-only	−2.27	1	13	4
	HydroAgent	−0.84	17	50	0
`02338660`	Baseline	0.65	13	50	0
	SFT-only	−17.53	1	11	4
	HydroAgent	0.68	12	50	0
`01403060`	Baseline	−0.15	16	50	0
	SFT-only	0.58	4	16	4
	HydroAgent	0.40	14	50	0

Higher NSE is better. SFT alone is unstable; the full SFT + RLSF stack restores long-horizon iteration and zero parse failures.

Calibrated Hydrographs

Observed vs simulated hydrographs for HydroAgent and the Qwen3-4B baseline across four held-out gauges

Observed (black) versus simulated discharge for HydroAgent-4B (orange) and the Qwen3-4B baseline (gray) across the four held-out gauges, with precipitation forcing shown on the inverted top axis. Simulator-grounded RL aligns event peaks and timing while stabilizing volume across distinct hydroclimatic regions.

Benchmark Basins

Map of 10 training and 4 testing CONUS gauges, sized by basin area

Ten CONUS training gauges (539–2,401 km²) and four geographically held-out testing gauges (329–40,792 km²), spanning distinct hydroclimatic regions, hourly forcings (MRMS gauge-corrected precipitation, daily PET), and 13 calibrated CREST parameters. Each window is a clear flood event (rising + receding limbs) selected by a flood-window audit over the observation series.

Training set (10 gauges)

Gauge ID	Basin (km²)	Lat	Lon	Window (UTC)
`11383500`	539	40.014	−121.948	2018-05-19 → 07-17
`11043000`	575	33.480	−117.144	2019-03-15 → 05-13
`11152000`	632	36.281	−121.323	2018-05-29 → 07-27
`02294781`	1,064	27.825	−81.802	2018-04-29 → 06-27
`02312000`	1,476	28.480	−82.178	2018-11-15 → 01-13
`07195430`	1,489	36.109	−94.533	2018-01-04 → 03-04
`11179000`	1,639	37.587	−121.961	2018-06-03 → 08-01
`14301000`	1,727	45.704	−123.755	2018-09-11 → 11-09
`14207500`	1,828	45.351	−122.676	2018-04-09 → 06-07
`11376000`	2,401	40.387	−122.239	2018-09-21 → 11-19

Testing set (4 held-out gauges)

Gauge ID	Basin (km²)	Lat	Lon	Window (UTC)
`02338660`	329	33.236	−84.988	2018-07-01 → 08-31
`01403060`	2,033	40.551	−74.548	2018-11-11 → 01-09
`06279500`	40,792	44.759	−108.182	2018-06-13 → 08-11
`07144100`	3,209	37.883	−97.425	2019-03-30 → 05-28

Acknowledgement

We appreciate Modal for sponsoring the computing credits for this research. The code is released under the MIT license; see the project repository for training, evaluation, and data-preparation scripts.

BibTeX

@misc{li2026hydroagentclosinggapfrontier,
      title={HydroAgent: Closing the Gap Between Frontier LLMs and Human Experts
             in Hydrologic Model Calibration via Simulator-Grounded RL},
      author={Zhi Li and Songkun Yan and Jie Cao and Mofan Zhang
             and Anjiang Wei and Jinwoong Yoo and Yang Hong},
      year={2026},
      eprint={2605.17792},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2605.17792},
}

HydroAgent: Closing the Gap Between Frontier LLMs and Human Experts in Hydrologic Model Calibration via Simulator-Grounded RL

Abstract

Key Contributions

Method: SFT + Reinforcement Learning with Simulation Feedback (RLSF)

Training Recipe & Reproducibility

Where Do Frontier Agents Stand?

Gauge 02338660 (329 km2) — the canonical evaluation gauge. No frontier model crosses the human-expert reference at NSE = 0.85.

Gauge 01403060 — per-model best-NSE with rounds-used annotated on the right.

Gauge 06279500 (40,792 km2) — the largest, hardest basin; the one case where Opus-4.7 reaches the human-expert reference.

Gauge 07144100 — the qualitative ranking and rounds-used/best-NSE coupling persist across the panel.

How Does HydroAgent Improve Streamflow Prediction?

Per-gauge ablation of the SFT + RLSF stack

Calibrated Hydrographs

Benchmark Basins

Training set (10 gauges)

Testing set (4 held-out gauges)

Acknowledgement

BibTeX

Gauge `02338660` (329 km²) — the canonical evaluation gauge. No frontier model crosses the human-expert reference at NSE = 0.85.

Gauge `01403060` — per-model best-NSE with rounds-used annotated on the right.

Gauge `06279500` (40,792 km²) — the largest, hardest basin; the one case where Opus-4.7 reaches the human-expert reference.

Gauge `07144100` — the qualitative ranking and rounds-used/best-NSE coupling persist across the panel.