HydroAgent: Closing the Gap Between Frontier LLMs and Human Experts in Hydrologic Model Calibration via Simulator-Grounded RL

1University of Colorado Boulder   2University of Oklahoma   3Stanford University   4NASA Goddard Space Flight Center
Overview of the HydroAgent hydrologic calibration benchmark and training pipeline

Overview. We benchmark nine frontier LLM agents on calibrating the operational CREST/EF5 distributed hydrologic model, then introduce HydroAgent — an open-weight Qwen3-4B policy fine-tuned with supervised learning on expert calibration trajectories and reinforcement learning with simulation feedback (RLSF), where the simulator itself is the verifier.

Abstract

Calibrating distributed hydrologic models is a critical bottleneck across operational water-resources management — streamflow prediction, water-supply assessment, reservoir operation, drought monitoring, infrastructure design, and flood forecasting all depend on it: each basin demands a domain expert to translate hydrograph signatures into adjustments of a high-dimensional parameter vector, and the resulting workflow does not transfer between watersheds. We ask a sharper version of a now-common question: can frontier large language model (LLM) agents replace the human hydrologic modeler, and if not, what would it take? We benchmark nine frontier LLM agents — Claude Opus 4.6/4.7, Sonnet 4.6, GPT-5/5.4/5.4-pro, and Gemini 2.5-pro/3.1-pro/3-flash — on the calibration of the operational CREST distributed hydrologic model used by the U.S. National Weather Service for flash-flood forecasting. Mean best-of-twenty-rounds Nash–Sutcliffe Efficiency (NSE) across four held-out gauges spanning 329–40,792 km2 ranges from −0.16 (GPT-5.4) to 0.75 (Sonnet 4.6); the ceiling is reproducible across all three frontier vendors and across capability tiers, and no model reaches the human-expert reference except for Opus-4.7 at one testing gauge. We argue this gap is not a parameter-count problem: it is a domain-grounding problem. We then propose HydroAgent, a recipe that fine-tunes the open-weight Qwen3-4B model with supervised fine-tuning on 2,576 expert calibration trajectories and Group-Relative Policy Optimization using NSE as a verifiable reward sourced from online CREST simulations — reinforcement learning with simulation feedback (RLSF). Our central thesis is that, for Earth-system science, a small domain-tuned policy with simulator-in-the-loop RL is a more compute-efficient and physically faithful path than scaling generic frontier models, and that the multi-modal richness of Earth data makes domain agents an unusually leveraged direction for AI in physical science.

Key Contributions

  Frontier benchmark

The first systematic benchmark of nine frontier LLM agents on calibrating an operational distributed hydrologic model, with a public release of all 9 × 20 calibration trajectories for reproducibility.

  HydroAgent recipe

A domain-specific recipe that fine-tunes Qwen3-4B with SFT + GRPO using NSE as a verifiable, simulator-grounded reward — reproducible on 4×H100.

  A position, quantified

For Earth-system tasks with cheap-to-evaluate simulators, a domain-tuned 4B agent can substantially close the gap to frontier generalists — grounding, not scale, is the bottleneck.

Method: SFT + Reinforcement Learning with Simulation Feedback (RLSF)

HydroAgent calibration loop: set parameters, run simulation, evaluate, parse failure

A calibration episode exposes four tools to the agent — set_parameters (write 13 CREST scalar multipliers), run_simulation (invoke the EF5 binary and parse the simulated-vs-observed hydrograph), evaluate (return NSE plus a multi-criteria diagnostic), and a parse_failure reasoning step. Training proceeds in two phases:

  • Phase 1 — Supervised Fine-Tuning. Distill 2,576 expert calibration trajectories across 29 U.S. gauges into Qwen3-4B, teaching the tool-call grammar, parameter-bounds convention, and a basic hydrologic-reasoning style.
  • Phase 2 — RLSF (GRPO). Draw K=8 multi-turn rollouts per prompt; each rollout proposes a parameter set, runs an online CREST/EF5 simulation, and is scored by a clipped NSE plus shaped per-turn signals. The simulator is the verifier — there is no learned reward model. An explicit improvement bonus trains the policy to iterate-until-you-cannot-improve, which is what a hydrologist actually does.

Training Recipe & Reproducibility

The RL stack is verl 0.5 GRPO with SGLang multi-turn rollouts and an in-process EF5 tool, trained in full BF16 under FSDP (no LoRA) on 4×H100 (80 GB). For each prompt the engine draws K rollouts; each rollout is a multi-turn calibration episode that dispatches set_parameters → run_simulation → evaluate to registered tool classes, with EF5 invocations gated by a 32-way semaphore to avoid CPU/IO contention. A checkpoint cadence lands roughly every 5 hours; resume_mode: auto recovers across Modal's 24 h function-timeout cycle.

SettingValue
Base modelQwen/Qwen3-4B-Instruct-2507 (full FT, BF16, FSDP)
Advantage estimatorGRPO (group-relative, critic-free)
Rollouts per prompt (K)6–8 · batch = 4 prompts/step
Max calibration turns / rollout50
Actor learning rate1×10−6 (5e−6 caused collapse)
KL anchor coeff.0.2 (strong anchor to SFT init)
Entropy coeff.0.01
Samplingtemperature 1.0, top-p 0.95
Epochs30 over the 10-gauge training set
Infrastructureverl 0.5 + SGLang on 4×H100, EF5 concurrency 32

Full configs and the eval harness are in the code repository.

Where Do Frontier Agents Stand?

Each agent calibrates four held-out gauges inside an identical Linux sandbox, with a 20-round budget (up to 200 EF5 simulations per gauge). Best-NSE and rounds-used are tightly coupled: the strongest models use the full horizon, while several pro-tier reasoning models terminate after 1–2 rounds with budget remaining. Performance bands follow Moriasi et al. (2015); the dashed line at NSE = 0.85 is the human-expert reference (an experienced hydrologist using domain knowledge + the DREAM sampler). Use the carousel to browse all four held-out gauges.

How Does HydroAgent Improve Streamflow Prediction?

Relative change in four hydrologic-fit metrics for HydroAgent-4B vs the Qwen3-4B baseline

Relative change in four hydrologic-fit metrics for HydroAgent-4B versus the untuned Qwen3-4B-Instruct baseline, on the four held-out gauges. Blue = improvement, red = degradation; size encodes magnitude. Simulator-grounded post-training improves the metrics calibration is judged by: NSE improves on all four basins (from +5% to over +200%), discharge correlation improves on all four, and timing offset (|Lag|) contracts on all four. The lone residual cost is peak-ratio error on gauge 02338660, consistent with the volume- and peak-weighted shaping of the RLSF reward.

Per-gauge ablation of the SFT + RLSF stack

GaugeMethodBest NSESimsTurnsParse fail
07144100Baseline0.3415500
SFT-only0.071154
HydroAgent0.6513500
06279500Baseline−1.4113501
SFT-only−2.271134
HydroAgent−0.8417500
02338660Baseline0.6513500
SFT-only−17.531114
HydroAgent0.6812500
01403060Baseline−0.1516500
SFT-only0.584164
HydroAgent0.4014500

Higher NSE is better. SFT alone is unstable; the full SFT + RLSF stack restores long-horizon iteration and zero parse failures.

Calibrated Hydrographs

Observed vs simulated hydrographs for HydroAgent and the Qwen3-4B baseline across four held-out gauges

Observed (black) versus simulated discharge for HydroAgent-4B (orange) and the Qwen3-4B baseline (gray) across the four held-out gauges, with precipitation forcing shown on the inverted top axis. Simulator-grounded RL aligns event peaks and timing while stabilizing volume across distinct hydroclimatic regions.

Benchmark Basins

Map of 10 training and 4 testing CONUS gauges, sized by basin area

Ten CONUS training gauges (539–2,401 km2) and four geographically held-out testing gauges (329–40,792 km2), spanning distinct hydroclimatic regions, hourly forcings (MRMS gauge-corrected precipitation, daily PET), and 13 calibrated CREST parameters. Each window is a clear flood event (rising + receding limbs) selected by a flood-window audit over the observation series.

Training set (10 gauges)

Gauge IDBasin (km²)LatLonWindow (UTC)
1138350053940.014−121.9482018-05-19 → 07-17
1104300057533.480−117.1442019-03-15 → 05-13
1115200063236.281−121.3232018-05-29 → 07-27
022947811,06427.825−81.8022018-04-29 → 06-27
023120001,47628.480−82.1782018-11-15 → 01-13
071954301,48936.109−94.5332018-01-04 → 03-04
111790001,63937.587−121.9612018-06-03 → 08-01
143010001,72745.704−123.7552018-09-11 → 11-09
142075001,82845.351−122.6762018-04-09 → 06-07
113760002,40140.387−122.2392018-09-21 → 11-19

Testing set (4 held-out gauges)

Gauge IDBasin (km²)LatLonWindow (UTC)
0233866032933.236−84.9882018-07-01 → 08-31
014030602,03340.551−74.5482018-11-11 → 01-09
0627950040,79244.759−108.1822018-06-13 → 08-11
071441003,20937.883−97.4252019-03-30 → 05-28

Acknowledgement

We appreciate Modal for sponsoring the computing credits for this research. The code is released under the MIT license; see the project repository for training, evaluation, and data-preparation scripts.

BibTeX

@misc{li2026hydroagentclosinggapfrontier,
      title={HydroAgent: Closing the Gap Between Frontier LLMs and Human Experts
             in Hydrologic Model Calibration via Simulator-Grounded RL},
      author={Zhi Li and Songkun Yan and Jie Cao and Mofan Zhang
             and Anjiang Wei and Jinwoong Yoo and Yang Hong},
      year={2026},
      eprint={2605.17792},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2605.17792},
}