NCAR RAL NCAR RAL

April 16, 2026

Before We Reproduce AI-Assisted Science, We Need to Make It Reproducible

6th Chameleon User Meeting | Applying AI to Accelerate Computational R&R

Victor Weeks
Software Engineer, NSF NCAR | 2026 BSSw Fellow

This material is based upon work supported by the NSF National Center for Atmospheric Research, a major facility sponsored by the U.S. National Science Foundation and managed by the University Corporation for Atmospheric Research. This work is also supported by the Better Scientific Software Fellowship Program, funded by the U.S. Department of Energy and National Science Foundation.

AI to Reproduce Computational Artifacts

  • Reconstruct software environments
  • Resolve dependency conflicts
  • Navigate platform-specific quirks
  • Automate artifact evaluation

AI to Build Computational Artifacts

  • Write scientific software
  • Configure computational experiments
  • Make architectural decisions
  • Generate analysis pipelines

“Vibe Coding”

Conversational AI assists with code generation and refactoring

MCP & Tools

Agents access external tools, APIs, and platform infrastructure

Autonomous Agents

AI makes multi-step decisions, configures systems, and executes workflows

Failure modes distinct from standard software bugs:

Numerical Precision Errors AI agents may introduce floating-point assumptions that silently degrade accuracy in physical simulations
Subtle Physics Violations Generated code can violate conservation laws or domain constraints not captured by standard test suites
Opaque Decision Histories When an agent makes architectural choices, the rationale may not be recorded — making the artifact harder to describe and evaluate
Satellite image of hurricane — atmospheric science

Validation Loops

Domain-specific checks integrated into agent workflows — not just at the end, but at each decision point

Human Oversight

Defined points where humans review agent decisions, especially for choices with scientific implications

Standardized Tooling

Reusable MCP server templates, scientific skills, and agent configurations that encode domain knowledge

If AI agents build computational artifacts without guardrails, those artifacts become harder to describe in an Artifact Description and harder to validate in an Artifact Evaluation.

On the reproduce side, AI agents reconstructing environments and running evaluations also need the same kinds of guardrails — validation that the reconstructed environment actually matches, domain-aware checks on whether the reproduction is scientifically valid.

The same structural patterns that SAE defines — validation loops, domain-specific safeguards, and standardized tooling — serve both sides of this coin.

Build-time discipline is upstream of reproduce-time reliability.

Multi-agent AI workflow — scientific agentic engineering in action

Scientific Agentic Engineering (SAE)

  • Framework & methodology
  • Community repository of scientific MCP templates
  • Reference workflow & tutorials
  • Workshops

All coming over the next 12 months.

vweeks.github.io/scientific-agentic-engineering

QR code — vweeks.github.io/scientific-agentic-engineering