Train Once, Infer a Million Times: Bayesian Inference That's (Finally) Cheap at Query Time
← All posts
Alex Andorra

Train Once, Infer a Million Times: Bayesian Inference That's (Finally) Cheap at Query Time

MCMC re-runs from scratch on every new dataset. Amortized Bayesian inference pays the cost once -- then gives you a posterior in milliseconds, forever. Stefan Radev (BayesFlow creator) and I built an Agent Skill that makes it reliable. Come join us, your first trained amortizer is ~3 minutes away πŸ˜‰

And how Stefan Radev and I built an Agent Skill to stop your coding agent from getting it silently wrong.

Imagine for a moment you're running a clinical study. Same model for each participant, ... but you have one million participants, and you need a posterior for each of them.

With MCMC, you're in trouble. Even if a single run takes 30 seconds, you're looking at ~347 days of sampling. Clusters, parallelization, nutpie β€” sure, knock it down to a week or two. But now multiply by every update to the data, every sensitivity check, every "what if we got another hundred patients?" question from the stakeholders -- that's a long time if you can't take a sabbatical.

This isn't a contrived scenario by the way. It's the one Marvin Schmitt described on LBS episode 107 β€” a real motivating use case where Bayesian inference as we normally practice it is just… infeasible.

This is where amortized Bayesian inference (ABI) comes in. And it's why Stefan Radev (the creator of BayesFlow) and I built an Agent Skill to make coding agents save your year β€” correctly.

The philosophy

The philosophy behind this skill is the same as the bayesian-workflow and causal-inference skills before it: enforce the workflow, not just the code.

BayesFlow gives you all the building blocks. The skill tells the agent which block to reach for, in what order, with which guardrails, and how to prove to you that the result is trustworthy β€” architecture choice, simulator sanity, diagnostic gates, report structure.

Amortized inference isn't magic β€” it's a lot of small decisions where getting one wrong silently breaks everything. The skill's job is to make sure none of them are silent, and that the trustworthy path is the default path.

Train once. Infer a million times. And know that your posterior is worth trusting.

Wait β€” amortized what?

The word "amortized" comes from finance -- I know, sexy right? If you amortize a loan, you pay a big cost upfront and spread it out, so the marginal cost of each transaction becomes small. Amortized Bayesian inference does the same thing... with posteriors.

The trade is this: instead of running MCMC every time you get new data, you train a neural network once β€” typically for a few minutes to a few hours β€” to approximate the posterior for the entire family of possible datasets your model could generate. After that, getting a posterior for any new dataset is just a forward pass through the network. Milliseconds. No warmup. No tuning. No divergences to stare at.

A million participants? That's a million forward passes. Practical.

And there's a second superpower: you don't need a likelihood function. As Jonas Arruda demonstrated beautifully on LBS episode 151, ABI falls under the broader umbrella of simulation-based inference β€” you hand it a simulator (prior + forward model) and it learns the posterior from simulations alone. If you have a mechanistic model for epidemics, cosmology, or your sport of choice (yes, soccer, of course), and the likelihood is intractable or prohibitive to compute β€” you can still do Bayesian inference. It's like being able to sip on your favorite espresso while camping deep in the forest β€” a pretty big deal!

The diagnostic flip

Here's the part that delighted me when Stefan first walked me through it.

With MCMC, simulation-based calibration (SBC) is the gold standard for checking whether your sampler produces calibrated posteriors. The catch is that doing SBC properly means running your posterior on ~1000 simulated datasets. For a moderately complex PyMC model that takes a minute to sample, that's several hours just to check calibration β€” which is why, let's be honest, most of us skip it.

With an amortized estimator, SBC is cheap. You trained the network on simulations, remember? Running it on 1000 more simulated datasets is seconds. What was previously a luxury becomes routine β€” like getting upgraded to first class on all your intercontinental flights.

And that flips the Bayesian workflow on its head. Diagnostics that were previously "we'd love to but can't afford it" become mandatory. You can actually afford to prove your posterior is well-calibrated before trusting it.

So why a skill?

Here's the thing: BayesFlow is brilliant, but the surface area is large. There are several summary network options (SetTransformer, TimeSeriesTransformer, FusionTransformer, ConvolutionalNetwork), several inference networks (FlowMatching, DiffusionModel, StableConsistencyModel, coupling flows), three training regimes (online, offline, disk), and an adapter system that needs to route data to the right slots.

Miss one choice, and things fail in the worst possible way: silently. The loss curve goes down. The code runs, yes, but the posterior is wrong.

For example, if you have N exchangeable observations (like a regression dataset, or repeated measurements), you must route them through summary_variables with a SetTransformer. If your agent flattens them into inference_conditions, training converges, inference runs, the numbers look plausible β€” and the posterior is invalid because the network doesn't know it's looking at a set.

Or, if you try to do image denoising (your inferential target is a 28Γ—28 image), you need a DiffusionModel with a UNet subnet. Use the default setup and your agent will confidently "train a model" that doesn't even have the right tensor shapes to represent the output.

An agent doing this without a skill will very often get the architecture wrong β€” because the defaults look like they apply and the error messages don't appear until much later, when you ask why the posterior looks strange.

The skill enforces the right choices and calls them out as MUST and NEVER rules.

Making ABI fast at iteration time

ABI is fast at inference time β€” that's the whole point. But v2 of the skill, which Stefan led, tackles a different kind of speed: how fast can you go from "I have a generative story" to "I have diagnostics I trust"?

He added three things:

  • β—†Offline-first by default. The first pass always uses fit_offline with a pre-simulated pilot budget of 20k datasets (for fast simulators) or 3–5k (for slow ones), trained for 100 epochs. On our benchmarks, that's about 3 minutes to a trained amortizer for typical problems. Online training becomes a refinement step, not the starting point.
  • β—†Mandatory structured report. Every training + diagnostics run writes a report.md in a slug-named folder, with fixed figure names (loss.png, recovery.png, calibration_ecdf.png, coverage.png, z_score_contraction.png) and a template with structured descriptions. You don't get "some output" β€” you get a publication-shaped artifact.
  • β—†Programmatic diagnostic interpretation. The skill ships a check_diagnostics.py script that takes a metrics DataFrame and returns per-parameter qualitative ratings ("excellent calibration; good recovery; high contraction") and a suggest_next_steps() function that combines training + diagnostic reports into an ordered action list. The skill diagnoses the diagnostics β€” it doesn't just run them and leave you guessing.

This last one is important. Before v2, the agent still had to interpret calibration and contraction numbers from a DataFrame β€” and interpretation is exactly where agents are weakest. Now it's programmatic.

The numbers

We ran 7 eval scenarios covering the terrain: Gaussian location-scale, multi-parameter constrained models, variable-N regression, AR(2) time series, non-identifiable mixtures, offline simulation banks, and Bayesian image denoising (Fashion MNIST).

ScenarioWith skillWithout skill
Gaussian location-scale100%86%
Multi-param constraints100%92%
Regression, varying N100%75%
Time series (AR2)100%92%
Non-identifiable mixture100%75%
Offline simulation bank100%92%
Bayesian denoising (images)100%83%
Overall100% (86/86)84.9% (73/86)

A +15.1 percentage-point lift.

Where does the gap come from? Three recurring failure modes in the without-skill runs:

  • β—†Wrong routing. Exchangeable observations pushed into inference_conditions instead of summary_variables with a SetTransformer. Code runs; posterior is wrong.
  • β—†Wrong architecture for images. When the target is an image, you need DiffusionModel(subnet=UNet). Agents default to the usual coupling-flow setup, which can't even represent a 2D spatial field properly.
  • β—†Skipped diagnostics gates. Without the skill, training loss going down is often taken as "done." The skill enforces SBC, coverage, and posterior contraction checks with house thresholds before touching real data.

These aren't subtle failures. They're the difference between "I have a trained network" and "I have inference I can defend."

A word on Stefan

I want to pause here, because this skill wouldn't exist without Stefan Radev.

Stefan is the creator of BayesFlow β€” the library the skill is built on β€” and is one of the most generous collaborators I've worked with. When I first reached out about a skill for amortized inference, he didn't just give feedback from the sidelines. He wrote code, opened PRs, ran pilot studies with his grad students to figure out what made agents stumble. V2's architecture came out of his direct experience watching people and agents use BayesFlow in practice.

This matters to me because open-source Bayesian tooling runs on this kind of collaboration, and isn't always rewarded as much as it should be. So if you try this skill and it works for you, please star BayesFlow and baygent-skills, and tell a friend! This is how the ecosystem gets better.

The honest caveats

A few things to have in mind:

  • β—†The amortization gap. Your trained amortizer is only reliable over the prior you trained it on. If your real data lives in a region the prior doesn't cover well, inference will quietly degrade. This is why mandatory in-silico diagnostics before real-data inference exist β€” they catch it.
  • β—†ABI isn't a drop-in replacement. For a one-off analysis on a single dataset, MCMC is often simpler and its diagnostics are more mature. Amortization shines when you'll re-infer many times, when the likelihood is intractable, or when you genuinely need millisecond inference.
  • β—†Diagnostics are still required. "Fast neural inference" doesn't mean "skip the workflow". The skill enforces SBC, coverage, and contraction checks with house thresholds before you're allowed to trust the estimator on real data.

Install

Just ask your agent: "Install the amortized-workflow skill: https://github.com/Learning-Bayesian-Statistics/baygent-skills/tree/main/amortized-workflow" β€” and you're good to go. Unlike causal-inference, this skill is standalone: it doesn't depend on bayesian-workflow.

As always, it works with Claude Code, Cursor, Gemini CLI, Kimi Code, and any agent supporting the Agent Skills spec.

You'll also need BayesFlow: pip install "bayesflow>=2.0". JAX is the recommended backend; PyTorch and TensorFlow also work.

The amortized-workflow skill is open source and part of the baygent-skills collection. If you try it and something doesn't work, open an issue β€” Stefan and I both read them.

And if you want to go deeper on the ideas behind this, the podcast episodes that got me hooked are:

On that note, PyMCheers, my dear Bayesians!

Alexandre Andorra β€” with Stefan Radev