Amortized Inference & BayesFlow in Practice

‌

Listen on your favorite platform:

In this episode, Alex Andorra sits down with Stefan Radev, an assistant professor at RPI and creator of BayesFlow — the Python library for amortized simulation-based inference. Stefan's path here runs through psychology, neuroscience, and a summer spent reading a Matlab textbook on the beach, which is a better origin story than most. The episode covers what amortized inference actually is, what the sim-to-real framing means in practice, and a live demo of an AI agent skill Stefan and Alex co-developed to guide you through a full amortized Bayesian workflow — automated, open source, and diagnostic-first.

From Psychology to Amortized Inference

Stefan started out as a psychology student in Heidelberg with no particular plan, landed an unpaid internship at a neuroscience lab, and was handed a copy of Zero to Hero in Matlab on the condition he learned to code over the summer. He did. From there: a parallel CS degree, a PhD in statistical modeling, mentors who shaped everything he knows about deep learning and Bayesian statistics, and eventually a faculty position at RPI.

What drew him to amortized inference was the same thing that drew a lot of people to it around 2016–2018: the realization that you could train a neural network to do what MCMC does, but without paying the cost again every time. The Bayesian revolution in the social sciences was happening, the replication crisis had made rigorous modeling feel urgent, and compute was finally cheap enough to experiment.

The Sim-to-Real Framing

The central idea behind simulation-based inference is simple: you have a model that's too complex to write down analytically, but you can simulate from it. You generate enormous quantities of labeled simulations, parameter values and the data they produce — and train a neural network to invert that mapping. Then you deploy the trained network on real, unlabeled data.

Stefan calls this sim-to-real: train on simulations, deploy on reality. It's distinct from the more common use of synthetic data in machine learning, where the goal is augmentation. Here the simulation is an epistemic tool — a mechanistic model of a process you're trying to understand. The guarantees come from the simulations; the network just makes it fast.

The payoff is what Stefan calls amortization: you pay the training cost once, and every subsequent inference is essentially free. For a study with 1.2 million participants, this meant a day of training instead of a projected year of MCMC. For 5 million participants, the numbers are now under an hour each way.

The BayesFlow Skill: AI-Guided Amortized Workflows

The highlight of the episode is a live demo — so if you're listening rather than watching, the YouTube version with chapters is worth the detour.

Stefan and Alex have built a Bayesian skill for AI agents that encodes a full end-to-end amortized inference workflow. The skill lives in the Bayesian Skills repo, is open source, and can be installed by pointing your agent at the link. The demo runs the skill on the Heston model — a stochastic volatility model from finance — with no real data, just a simulator, as a clean proof of concept.

What makes the skill interesting isn't just what it automates. It's what it prioritizes. The skill is built around progressive disclosure (load only what's needed), pilot runs before full training (iterate in under five minutes or you're doing it wrong), and a structured report at the end that covers convergence diagnostics, parameter recovery, calibration coverage, and — crucially — suggested next steps. The agent doesn't just run diagnostics and stop. It tells you what to do next, in the language of the research.

For the Heston demo, that meant correctly identifying that five of six parameters were recoverable, flagging the correlation parameter as essentially unidentifiable from price paths alone, noting that training could run a further 50 epochs, and recommending augmentation with option price data if identifiability is a real concern. All of this matches what Stefan's group would expect from domain knowledge — which is exactly the point.

Hierarchical Models: The Hard Problem

Most simulation-based inference today is flat — non-hierarchical, one dataset, one inference. Hierarchical models are the frontier, and for good reason. Simulating a two-level model across hundreds of groups is expensive. Simulating a three-level model with a slow simulator is often infeasible.

Stefan's group has been working on compositional score-based modeling — a way to decompose the problem so you never have to simulate the full hierarchy exhaustively. The core insight is that the joint posterior of a hierarchical model factorizes in a structured way. If you exploit that factorization, you can train one network per level, then chain them together at inference time. You simulate one unit of the hierarchy, not all of them simultaneously.

This is available now in the development branch of BayesFlow, with a full tutorial from Jonas Arruda. For up to a few hundred groups — which covers most participant-level cognitive modeling in psychology — it's ready to use. For models with thousands of groups or higher, the approximation starts to degrade in ways the calibration diagnostics will catch, and there's active work ongoing to close that gap.

When Amortized Inference Gets It Wrong

The most important part of the conversation, and the one easiest to skip past, is Stefan's honest account of where the method fails.

The core risk is model misspecification: your real data is atypical under the simulator, and the neural network gives you something that doesn't match what an oracle MCMC sampler would have returned. This can happen silently. The network produces a posterior, it looks reasonable, and you write the paper — but the estimates have drifted from what proper Bayesian inference would give you, in ways that are hard to detect without explicit diagnostics.

The practical defense is straightforward: always run out-of-distribution detection on your real data relative to the simulation space. Always look at calibration coverage plots — in-sample and out-of-sample. If the model is misspecified in ways you can anticipate (unmodeled noise, unknown guessing processes in human data), add a simple noise model. A uniform guessing process is often enough to bring the neural network estimates back in line with MCMC, at a small cost in precision.

Stefan is also candid that the field has used terms inconsistently — "robust inference" means different things in different papers, and the distinction between robustifying the network and robustifying the underlying model matters a great deal. More conceptual work is needed there. But for practitioners: don't read the posterior at face value, run the diagnostics, and treat any workflow that skips them as unfinished.

Looking Ahead

This is part one of two — Stefan and Alex ran out of questions before they ran out of time, and a second conversation is already planned. Stefan is continuing to develop the compositional hierarchical modeling work toward larger scales, and the BayesFlow skill itself is a living document: the V2 that shipped just before this episode was partially shaped by student stress-testing in a recent workshop with Paul Bürkner.

If you want to contribute a skill for your own domain, open an issue or PR on the Bayesian Skills GitHub repo. If you find the ABI skill misbehaving on a real problem, Stefan is the right person to hear about it.

You can also interact with the episode on NotebookLM! Ask questions, generate flashcards, and more.

Hope you enjoyed it, and see you in two weeks, my dear Bayesians!

Chapters

00:00:00 How does amortized inference fit into the Bayesian workflow?

00:12:03 What does "sim-to-real" mean in simulation-based inference?

00:15:57 Why is amortized inference particularly suited to psychology and neuroscience?

00:21:51 What is the amortized inference agent skill?

00:39:00 What is calibration coverage and how do you interpret it?

00:41:50 How do you decide what to do next after your first training run?

00:44:53 How do actionable insights make Bayesian workflows more usable?

00:49:08 What are the unique challenges of hierarchical models in amortized inference?

01:00:51 What is the current state of BayesFlow's support for hierarchical models?

01:05:00 What are the main failure modes of amortized inference and how do you handle model misspecification?

Links from the Show

Amortized Bayesian Workflow Skill
Soccer Factor Model
Soccer Factor Model App
Stefan’s website
Stefan on Linkedin
Stefan on GitHub
Bayes Ops Lab
BayesFlow package
Compositional amortized inference for large-scale hierarchical Bayesian models: paper and tutorial

From Psychology to Amortized Inference
The Sim-to-Real Framing
The BayesFlow Skill: AI-Guided Amortized Workflows
Hierarchical Models: The Hard Problem
When Amortized Inference Gets It Wrong
Looking Ahead
Chapters
Links from the Show

My guest today is the brilliant Stéphane Ralef, an assistant professor at RPI where he runs the BaseOps lab.

He's also the creator of Baseflow, the Python library for Amortize simulation-based inference that we first covered back in episode 107 with Marvin Schmidt.

Since then, Stéphane and his team have pushed the framework forward considerably

and have been a big fan of the whole amortized inference paradigm ever since.

So yes, we'll dig into what amortized inference is, why it's particularly well suited to messy, noisy, low-resolution data, the kind you get in psychology and neuroscience, for

instance, and what the sim-to-real framing is all about.

You train on simulations, you deploy on real, unlabeled data.

But the really exciting part is that Stefan is going to do a live demo.

on screen, so if you're listening, you'll want to jump to YouTube for this part.

And that's going to be a live demo of an AI agent skill, him and I co-developed to guide you through a state-of-the-art, amortized inference workflow.

We'll run it, go through diagnostics, parameter recovery, calibration coverage, and actionable next steps.

All automated, all open source.

This is Learning Vision Statistics, episode 157, recorded April 24.

2026.

Welcome Bayesian Statistics, a podcast about Bayesian inference, the methods, the projects, and the people who make it possible.

I'm your host, Alex Andorra.

You can follow me on Twitter at alex-underscore-andorra.

like the country.

For any info about the show, LearnBasedStats.com is Laplace to be.

Show notes, becoming a corporate sponsor, unlocking Beijing merch, supporting the show on Patreon, everything is in there.

That's LearnBasedStats.com.

you're interested in one-on-one mentorship, online courses, or statistical consulting, feel free to reach out and book a call at topmate.io slash alex underscore and dora.

See you around, folks, and best Beijing wishes to you all.

Hello my dear Abasians, just a few words to let you know that our soccer factor model is now available as an app on the App Store for iPhones in 175 countries.

It's all thanks to Maximilien Goebel who's been working on that for month.

Check it out if you want to see what Abasian model looks like on your phone and also sound smart during the coming World Cup starting in June from some messy code on a Jupyter

notebook.

to slick looking app, who would have thought?

So if you wanna see what a production based model looks like in the wild, check out the website, I put it in the show notes.

Of course, check out the app on the Apple App Store.

I also put that in the show notes.

And if you want more details about the soccer factor model, well, you have that on the website, of course, but you can also listen to Max Kerbel on this show.

I put the link to his episode in the related episodes for

that one.

And I think it's a good idea to have a refresher with the World Cup coming up.

We'll probably have some even better football news for you coming up in the coming weeks, but I don't want to spoil it for you.

So in the meantime, let's listen to Stephane Radeff and talk about amortized Bayesian infants.

Stephane Ralef, welcome to Learning Basics and Statistics.

Good to see you, Alex.

Yeah, it's great to finally have you on the show.

We've been collaborating for a few months, maybe even some years now.

don't know, it's bit blurry in my memory.

yeah, I've been, of course, talking with you and the rest of the Baseflow team since I discovered about Baseflow in episode 107 with Marvin Schmidt.

And yeah, since then I've been a really big fan of Baseflow and Amortization inference in general.

I think it's also a great setting to learn more about neural networks and their power and how they can help for a type of inference that is dear to our heart here on the show.

And so yeah, it was long overdue to...

to have you on the show.

And actually I'm very happy that it happened now because we have something really cool to share with people today, which I was also super happy to collaborate on with you.

But before that, you know, as usual, let's start with your origin story.

You know, what are you doing nowadays and how do you end up doing that?

So let's keep the long story short.

I'm currently an assistant professor at RPI where they let me set up

my base lab, betisops lab.

I didn't start out as a Bayesian, as most people don't, I guess.

I started out as a psychology student in Heidelberg University.

I had no idea what I wanted to do, but I really liked neuroscience.

So I joined, I managed to join an unpaid internship job at the neuroscience lab in Mannheim.

And there, my mentors were really kind and smart people that they let me participate in different experiments, even though I was not of much help at that point.

But because I showed some motivation, one of the, advice was there offered me a job on the condition that they learn how to program over the song.

So I took her up on the challenge.

I got a book, Zero to Hero in Matlab.

Oh, okay.

Which is, yeah, not a typical language to do any kind of basic analysis to begin with.

But I managed to learn a lot during the summer.

grabbing a book, going to the beach with the book.

And yeah, when I came back, suddenly I found out you can do a lot of things with programming, right?

And this was the pre-AI era.

So coding skills were still important.

And of course, I introduced a lot of bugs and so on the usual ones.

didn't crush any space shuttles in the process, but I fell in love with programming.

So I even decided to add a parallel bachelor's degree in computer science.

And at that time, I also started another student assistant job at the Department of Quantitative Research Methods in HydroTech.

That's where I met...

a PhD candidate, uh Ulf Mertens, who was already towards the end of his PhD.

He was working on something called adaptive design optimization, which is back then was very, ah it was very hyped.

And it's a cool idea, basically says, have your experiment as part of your model and try to have informative measurements as you go.

And the way they were selling this

idea was, suppose you have a neuroscience experiment which takes four sessions each session, it's two hours for a participant, that's a lot of time.

What if you could do much better?

What if you uh gain the same information for just 20 minutes?

That's how they were selling.

uh We're reading a lot of papers by James Jung from Ohio State University.

I only knew from papers, I had the pleasure of meeting him actually last year at the conference and we started implementing this in C++.

Yeah, eventually we found out about Stan, so we're looped in Stan and it was very slow as you can imagine because this is also the pre-neural network era in basic statistics.

At some point you already had amortized inference and all that but

enthusiasm has waned about adaptive design optimization, but I digress.

So I figured out that's what I want to do at that point in time.

And I decided that I have to try to avoid the labor market as long as possible.

And I happened to be this graduate school about statistical modeling with a focus on psychology.

I applied, I got the PhD and then I met during that time

I met two of my mentors Ulrich Kjote from Heidelberg and later Paul Bjorkner.

So everything I know about deep learning I know from Ulrich Kjote.

Everything I know about basal statistics I know from Paul.

So it's important to have smart and open-minded mentors, I would say.

uh So I was rather quick with my PhD, two and a half years, then I transitioned to a postdoc, the usual story, and a colleague of mine

just randomly forwarded me an email about my current uh job description and said, okay, maybe it's a good idea to start applying something beyond the postdocs.

okay, let's do a practice run.

So I applied and I ended up getting the job and now I'm here.

Wow.

Okay.

Yeah.

Yeah, I love it.

that's, that's definitely somewhat, somewhat senior space, but I like how.

It sounds like you've been, you know, conscious of all the choices you've made throughout the journey.

I think it's, yeah, it's beautiful to see that it's really something you were really interested in and passionate about since the beginning.

uh And how, so that's the beginning and how you were drawn to patient inference and deep learning in particular.

Is that when also you started being interested in

enamortized methods and, and you know, like, yeah, basically ABI and the themes we're going to talk about today.

I would say Bayesian inference happened to me in a way that at the time where I was transitioning from my, my masters to my PhD, the so-called Bayesian revolution in the

social science was happening partially in response to the so-called

replication crisis.

Now for you viewers who don't know what the replication crisis is, it was the precarious situation that a lot of psychological effects tend to not reproduce in independent

studies.

And there were some very big papers about 2015-16 showing that.

One third, two thirds of the effects cannot be reproduced and there were

It's still not clear what caused the crisis, whether it was a theory crisis or a methods crisis, the methods crisis camp proposed as a solution, swapping out p-values for base

factors, basically.

And there was an outpour of gazillion sets of papers on like how to do any classical analysis in a Bayesian way.

There was a lot of excitement about Bayesian analysis and parallel to adaptive design optimization, we also started fiddling with

approximate Bayesian computation, which is the sort of the brute force precursor of our ties based on inference where your model is too complex to formulate as a set of

analytical equations.

So what you're doing is you're using simulations and you're trying to compare the simulations to real data and you keep the simulations that are as close as possible to

your real data.

and the remaining simulations and corresponding parameters that generate the simulations, the simulations are easier, are your posterior space.

And we implemented a lot of these algorithms, I even implemented at that time a graphical user interface for the basic methods.

And at some point, I think, the lab launch, my psychology advisor suggested, well, can't you just train on your network on this?

on this data and said, actually, that's a good idea.

Let's try it out.

it worked that in retrospect, the idea is very obvious.

I think the idea was also developed independently in various areas.

For example, neuroscience, particle physics, population genetics.

I was myself a bit oblivious to a lot of the related work.

We were starting out, we were mostly driven by excitement because something big is happening with this neural network space and AI was already cool.

It was not super cool like today, but it was starting to get very cool.

Hmm.

Okay.

Yeah.

I mean, yeah, all of that makes sense.

it's like, it does sound like it was the, you arrived at the intersection of both fields at a very good moment in a very...

right moment where there was a lot of very interesting work to do and probably also computing power that was now enabling to do that work, uh which is something you need

definitely.

So we've talked on the show, as I was saying, about amortizing friends with Baseflow.

um I will put in the related episodes, show notes of these episodes.

Episodes 151, uh

with Jonas, Jonas Arruda and episode 107 with Marvin Schmidt.

I also put episode 18 just because it's about the replication crisis with Daniel Likens and also episode 51 with Aubrey Clayton about exactly that author of the book, Bernoulli's

Fantasy.

Very good episode.

So that's for folks who want to dig into that.

Now,

to get back to amortized inference.

uh We don't need to recover that ground, so don't worry.

We can just pick up where episode 151 was.

But I do want to understand your specific training to it, which is I've seen you write and talk about something you call seem too real in the sense that you train on simulations and

you deploy on real data.

I think this is quite a change of mindset for lot of listeners.

So what do you mean by that concretely?

Good question.

So would say if you're working in the deep learning realm, the two major ways in which you can use synthetic data, predominant approaches and what's called synthetic for ML.

The goal here is...

You create a bunch of fake data that ideally resembles the real data.

And you use the step as data augmentation, especially in cases where real data is scarce.

Think again, lot of brain-computer interface research, neuroscience, typically deals with low participant counts.

We're talking about a few thousands.

For example, that's not a great starting point to deep learning.

Wouldn't it be dreamy if in these settings you had a simulator that can produce infinite streams of training data?

And even better, if you could annotate this data.

And this is a very fruitful approach.

Some recent reviews show this for particular niche cases, for example.

I recently came across a review on

data implementation for EEG studies on automatic diagnosis of major depression.

And the results there were that data implementation, like this can increase downstream classification accuracy anywhere between 1 % to 40%.

So it really depends on the application, but it's a cool approach, right?

And in these settings, generally speaking, you're not

Training on purely on simulation data.

You're in a semi supervised setting.

We have basically to domains real data simulated data and try to Extract as much information as possible from both domains Now the less popular approach is where our

research also Falls into mr.

Simulation based inference approach where you use simulations to identify?

mechanistic models right the simulation itself is

an epistemic tool to gain insights into some process.

A lot of the work on SPI is seen to real.

You basically train your network on gazillion of labeled simulations.

And then you deploy the network on your real data, which is unlabeled for construction because you don't know which parameters generated your data.

So that's the basic.

idea but I have to say the lines are getting very blurry for example you have nowadays world models right so world models are now neural networks that can simulate physics and

state spaces from by just training on video sequences in this case you have a real to sim but then you can use world models as simulators for other downstream neural networks so

you have

sort of a loop, real to same, same to real.

And there are also a few SBI papers, SBI simulation-based inference, which in fact do try to plug in the real data in the loop as well.

Okay, we were also guilty of a few papers in this realm as well where the claims basically, the real data provides additional signal.

Not only that, it can robustify the networks.

Okay.

I guess cases where the simulation is an imperfect representation of reality, but I think we'll come to talk about this in more detail.

Yeah.

This is really, really fascinating.

And I love how it complements, you know, the modeling workflow.

so you have something to share with us today live where you'll do a demo.

of what we've been up to, but before we do that, uh maybe one good way to establish again what all these methods are about is, as you were saying, you started working on cognition

and psychology.

And so a lot of your applied work is in these fields.

So that means messy models, noisy data.

What makes amortized inference particularly well suited to that domain?

Yeah, great question.

oh Unfortunately, I wish I did more psychology than I actually did.

Even towards the end of my PhD, there were some questions.

uh So your thesis was supposed to be about at least a little bit about cognitive modeling, but all you did was deep learning.

ah but psychology...

remains a great test bet for our methods because as you say we're dealing with a lot of noise there and thanks to the big data revolution we now also have access to gazillions of

data okay which from participants taking online experiments for example but let's take a step back so not only is this data very noisy it's also

of very low resolution, as you might think.

You don't have a window into the mind.

You are confined to very indirect measures, like response times, like mouse tracking data.

oh If you're lucky, eye tracking data.

So you can get closer and closer to the brain, but generally speaking, you're dealing with very impoverished data.

But you have a lot of it.

And if you were doing model-based reasoning or model-based inference, which is still not the default approach in psychology, right?

In many instances, in psychology and social sciences, you wouldn't think about mechanistically modeling how this behavioral data comes about.

You would rather fit a statistical, classical statistical model with regression, right?

mixed effects models, et cetera.

In the still infrequent cases where you do have a claim on the underlying processes that generate the behavioral data, you have a model that you want to apply to a lot of

datasets.

And each dataset in this case could be a participant, but it could be a dataset of multiple participants.

And you may have

multiple datasets of multiple participants at different sites.

So you have the same model, you want to apply it many, many times.

If you're coming from a classical, I hesitate to say the classical, base on perspective, you applying your favorite Markov chain mode, the Carlo MCMC sampler, you will be applying

it from scratch to each and every participant to obtain the underlying.

cognition.

With amortized inference, you train once.

You simulate the cognitive process, you train your network to predict the parameters from the data.

This takes some time.

When we're starting out, for example, we got our hands on data from the project implicit.

This is an extremely cool repository of

data from years of response times experiments in psychology.

And the data is ginormous.

We have more than 5 million participants.

So we pre-processed around 1.2 million participants, which at that time count was probably the largest model-based analysis that one has done in this realm.

And it took us around the day, I think, to train the network.

And inference took a few hours.

If you repeat this analysis with the infrastructure we have nowadays, and we actually did it in an upcoming paper, now we did it with 5 million people, instead of different

questions.

It's going to take, training is going to take less than an hour, and inference is going to take also around that time.

We're talking about enormous gains of compute power by back.

Back in the days, we extrapolated the time it would have taken to do this naively with MCMC without any kind of prioritization.

And it was something in the order of a year.

Now that is not a great timeline if you want to graduate.

And especially if you then get prompted to fit other models, et cetera, et cetera.

So this is what we call amortization.

The cost of inference amortizes across repeated evaluations of the model.

Yeah.

I think it's a...

It's a really great way to explain it and illustrate what the advantage is.

And probably we'll talk a bit more about the nuances of Amantized Inference.

have still a few questions for you about that, but I want to make sure you have the time to demo what you wanted to demo today.

So do you want to get into that?

um So basically, you're going to show us an agent skill.

We've been working together to help people to do amortized inference uh in the real world, in production, with a very serious and validated process, because it was validated by you.

So who better than you to do that?

And yeah, basically a way to teach the AI engines to do state-of-the-art amortized inference.

So uh on that note,

This is merged by the way on the Beijing skill repo folks and it's all open source.

if you want to see that, the link is in the show notes and you can just ask your AI agent to install the skill from the link.

You just give it the link and it will figure out how to install the skill.

And now Stefan, sorry, Stefan, you're going to show us

you're going to show us what this all is about in Charter Screen.

So if you're listening instead of watching, that's a portion of the episode you might want to tune into YouTube and use the chapters to jump in there very easily.

All right, let's get into it.

So do we want to first give a short overview on what a skill actually is?

No, I think that's fine.

think people know what this is.

yeah, your pitch, folks.

think the easiest way to understand that is if you code with an AI agent, which I'm sure everybody did here, um sometimes the agent is not good enough on what you're trying to do.

And amortized inference is something usually it's not good enough at because it's new.

There is not enough training data in the whole internet.

And so making sure you train and you teach your agent is very important.

That's basically what a skill is doing.

If it's still blurry to you, I will refer you to the blog section of the Learn Based Dance website, which is new now, but I've written a few blog posts exactly about that, about the

different skills I've developed and will write one for sure with Stefan to go through.

the armatized inference skill.

So yeah, basically that's that and you can take it from there, Stefan.

Excellent.

Thank you for the elevator picture, So what we have here is the entry point of the skill.

And this is in the context of a skeleton of the skill.

This is the description which should express user intent.

So when you make a request to your agent,

it's going to see if the request matches any of its skills.

This saves you a lot of tokens, basically, because skills follow what we call in computer science, progressive disclosure.

If you don't load the whole thing into memory, you only load things on demand if certain conditions are met.

So the way progressive disclosure is going to work here, I would ask the agent the following prompt.

And I have already prepared this.

for the podcast because I didn't want it to learn to run in the background for five to 10 minutes, which is usually the time it takes.

it's also interactive, right?

It's gonna ask you, do you wanna run this script?

And so I'm just going straight to the assets.

Yeah, so my prompt was this.

I want to calibrate the Heston model in Heston.py using simulation-based inference.

I don't have real data yet.

I just want you to set up, run and interpret an amortized Bayesian workflow.

Okay?

Now this prompt triggered the intent.

And this is the first step in validating any skill.

Of course, you have to make sure that the intent is clearly not experienced.

The skill is triggered by what you, you know, by the relevant problem space.

Now, what is this model all about?

the so-called Heston model.

I took something that I myself by no means an expert in, but this is something that people in my lab are fiddling in.

So I said, why not?

This is also great demo where I myself am not an expert in the modeling domain itself.

So the Heston model is a model from the 90s and it's a way to simulate assets price evolves over time, where volatility is not constant.

So it's a popular model used in options pricing.

eh Now I'm not really sure if it's that useful in practice, but it's an interesting toy model.

And the way this model works is basically there are the two processes.

There is the asset price.

So you're modeling the asset price as a geometric Brownian motion.

And there is also a process

for the variance, this is called the volatility process.

So you have volatility over the variance of the geometric Brownian motion.

And there is a correlation.

Now this model has six parameters.

We don't have to understand all the parameters.

These parameters mean something to quads.

For example, we have the annualized drift of the asset, which tells us how fast the price tends upwards on average.

other industry parameter is, for example, omega.

This is the long run mean variance of the process.

Where does variance settle in the long run?

And so on.

We have four more of these.

Okay?

But this model is easy to simulate.

That is, it takes probably on this computer, takes around five milliseconds to generate this whole ensemble of paths.

of price paths, okay, for a given trading day.

And the model has access to the simulator.

The simulator is in this file.

Okay.

Now, how does the skill work?

Okay.

Now, the skill tells the agent what are the essential steps of an amortized business.

So, this is basically an outpour of my brain, expertise in fitting amortized models edited by Alex.

And we're not going to go over the whole skill, right?

But the skill basically has hard rules.

These are the musts and the nevers.

So these are the harnesses that we're using to constrain the LLM.

It has installation instructions.

It has also instructions on how to set up the networks.

So for that, we have prepared a set of quick references.

So references are important for skills.

think of these as quick checklists, just as a pilot has.

These quick reference checkbooks, these are ways for the agent to not try to guess or guesstimate certain parameterizations, but immediately just fetch predefined

configurations.

Here, for example, based on experience, I have defined different model sizes for the input of neural network, right?

For transformers, for example, the agent can choose between a small and a...

Excel model depending on different problem settings.

Okay, and there's also lot of built-in heuristics here as well.

Now going back to the skill.md.

Yeah, I think it's a great, just as you go through that again, I mean as you get back to the skill.md, I think it's a great way to understand a skill.

It's basically that like

Not trying to, it's basically being a manager, I think.

ah So it's trying to give guidance without being micromanaging and over controlling so that you know that the agent goes into the right path, but you don't tell it what the

right path is because it might be a new one that you didn't even think about.

So basically it's trying to make sure it doesn't go the wrong way.

and then letting Git be creative enough to find a good way.

That's an excellent way to put it.

It's automation with opportunities for creative expression.

Now, the body of the skill basically specifies the outline of a full end-to-end amortized workflow.

Now, this is using all the base flow interfaces.

So as you can see, there is also very minimal amounts of code, which is also part of Baseball's core design philosophy that gets you from idea to inference in the least number

of lines of code.

uh Now, here is another interesting aspect of the skill.

Now, we're also going to ask the agent to produce a publication ready report, format it in an aesthetically pleasing way for you to look at.

interpret for you the main diagnostics in an advertised base and workflow and suggest next steps for improvement.

Now one important heuristic that I've built in here is being parsimonious and frugal with your time.

So for any new problem the agent is instructed to first do a pilot run.

Okay, so instead of uh

spending a lot of time in simulating, training and designing things just to figure out that the model was not properly set up.

Advising the agent, is something also I gleaned from personal experience, fit in with this.

For us, it's always good to just have a development set, do a quick run, iterate for less than five minutes.

That's kind of my personal benchmark if you're iterating for longer than five minutes during the development phase.

you're doing something wrong.

So try to encode this preference to the agent.

So it's always going to pre-simulate a few thousand simulated data sets, depending on how fast simulation is.

Right.

Now there's maybe also be the case where you don't have access to the simulator at all.

You're just giving a pre-simulated simulation back.

The agent could deal with that as well.

It knows based on the

problem characteristics, how to choose the best possible dev set, right?

To get you very quickly from your idea to the first report.

All right?

So let me now show you what this agent generated.

And this is available here.

So the run proceeds as follows.

You can actually, just to show you that this is genuine, right?

So here is the reasoning trace uh for when I run.

The model.

It gives me a little summary, but everything I need is now in this HestonSBI folder.

The agent is also instructed to collect neatly everything into its own folder.

say, prepend also version names and so on, so you actually can keep track of the different runs.

So let's look at this very first run.

Here is the report mask done.

Let's render this.

And the way this is currently structured,

is like that.

We have the training and network configuration, always handy for you to look at.

In this case, we had a flow matching inference network.

This is Frontier Generative AI families, which we're using to sell code from the posterior.

We have the summary backbone.

In this case, it correctly selected a time series transformer because our data is price paths, or more precisely,

log differences of the price path.

It trained 400 epochs, batch size 32, some general deep learning information.

The first thing it did was inspect the convergence, right?

Convergence in deep learning symptoms.

We look at the training and validation loss trajectory.

We look at differences between training and validation.

If you're a deep learning person, you know all that.

If you're going purely from a statistics background, this is new.

You're used to

interpreting your MCMC trace plots.

Think of this as the trace plot.

If this looks bad, you don't proceed.

Just as if your trace plots don't look good.

You just cycle back.

Okay.

Now here's the model assessment.

That's an interesting thing.

This text is here scripted, which explains what the loss should say.

But this is the creativity part.

The model judged that training looks healthy.

The loss decreased steadily from 1.8 to around 1 over 100 epochs.

No non-spikes.

The validation is lost very close to the training loss.

There's no overfitting.

It also spotted that the loss was still marginally decreasing towards the end.

And it may be helpful to train for maybe 50 more epochs to reap these last performance droplets that you may want to reap.

from this application.

Next, we see the parameter recovery.

uh This is something that is typically very expensive to do with non-amortized methods because it requires you to fit your model on hundreds of simulated data sets.

For amortized inference, this is just a nice side effect.

So we're seeing here the ground truth versus the estimate, and we see that I would say three of the parameters can be estimated.

Somewhat precisely.

Two of the parameters show signs of recovery, but not impressive.

And one parameter role to this correlation between the two stochastic paths is completely unrecoverable.

And these things, if I talk to my group, these things are actually expected.

So this run, would say,

They don't produce anything surprising in this case.

Let's continue to some more.

So here, you just tell listeners why that's not surprising in this case?

Because in some of the papers that we have to note here that this way of fitting these options pricing models is absolutely non-standard.

In the finance literature, goes...

model calibration, they have other tools and it's just some other papers already indicated that it's really really hard to estimate the correlation rule.

And now I personally, I haven't done any mathematical analysis on this model to prove that it's the case.

But that's something you're like you're always seeing these kind of models.

Yeah, or you expect to see, unless you have let's say much longer time series.

Or some other information for some reason you may be it for some in some way you measure the volatility or you get an indicator of the voltage Because all the model is seen here

is a price path, right?

um The path of the volatility is latent.

Yeah.

Yeah things it's can be very hard to Estimate a correlation like that.

Yeah.

Yeah Okay, now calibration coverage.

I think we don't need to go into detail here.

These are more comprehensive diagnostics, which basically they are great cluster.

Yeah, yeah

I definitely recommend people to use these plants.

I use them all the time now.

They are in the new RVs 1.0 and well, in base floor they have been there since the inception, which is something I always really loved in the package.

And that's also why I've bugged Osvaldo Martin, I think, to really have them also in RVs.

Because honestly, they are extremely useful and very important.

Also, they give you a lot of information about how good your model is at recovering parameters when you're developing and also how good it is at actually predicting.

So very, very important.

And I love it.

can do them for out of sample data, for in sample data.

You'll see some very interesting patterns of hierarchical models, which usually, you know, they under fit in sample.

So it can be like.

So if you compare models only in sample, can be like, my God, the hierarchical model is bad actually.

um then out of sample, the hierarchical model almost always becomes the best one because thanks to the under fitting in sample, it actually becomes much better out of sample.

So this is a very interesting pattern.

Yeah.

I would definitely absolutely recommend never ignore these diagnostics, especially if they can be computed handily.

Yeah, exactly.

Now they can be done very easily, both when you're using Blazeflow or RVs.

So yeah, definitely do not skip them.

If you're using the Bayesian workflow skill from the Bayesian skill repo, it will do that automatically for you.

So it will make sure you don't forget that because I made sure of that because I love these plots.

Yeah, there's no way around it.

Yeah, exactly.

uh Keep going, Stephane.

Sorry.

I derail you with my weird passion for coverage plots.

No, this is great.

So what we also get in the end is also the same information in the form of a numerical submariner, because to each of these diagnostic plots, you can produce a single number

summary, which loses, of course, some information, but gives a general idea of how the model is doing.

You can also verify things are reasonable.

see calibration errors.

They're all

in the order of 0.01 to 0.025.

This is good result.

uh The model also currently does a little bit of qualitative interpretation here, which also matches what we see it saying for, let's go to the correlation row, poor recovery,

excellent calibration, which of course is something that throws off uh newcomers.

that this can happen, but of course you should remember the prior is always well calibrated.

So even if you learn nothing, at least you recover the prior, which is per construction, well calibrated.

All right.

And the most interesting part is this segment over here, which contains the suggested next steps.

Let's see if we agree with what the agent suggested.

And the first suggestion is to extend training, 150, 200, most.

Right?

The reasoning here is that the loss still had the loss curve still had some remaining slope towards the end.

So I would say that's a correct interpretation at this point.

Another suggestion is to augment with option price data for raw entity.

This is actually making a suggestion here on how to make the model more identifiable.

That's interesting.

See, this is not something that you would get from a plain automated workflow.

It's also suggesting Titan priors for TITAN raw if domain don't support it.

Fourth, it suggests to switch to online training as a refinance step, which I think is also reasonable.

Now that we've seen we can get recovery for five out of six parameters, the question is, what is the maximum performance that we can squeeze out of this workflow?

So online training is the natural next step.

in this case, it's also something that I would have done, especially if the model is so cheap to simulate.

Simulation-based inference profits enormously if you can quickly simulate.

It literally means an inference stream of training data.

And finally, proceed to real data inference when price data are available.

Okay, well, thank you, agent.

No, that's really awesome.

That's really something you've added in the second

iteration of the skill that we just merged and Yeah, I think this is super super valuable to have this kind of Harness on the on the report and making sure the report is

actionable.

I really love that.

Am I truly gonna?

Gonna steal you that idea for the other skills like the causal inference skill and the Bayesian workflow skill.

I think they would both um I would both benefit from that so

Yeah, you'll probably see a PR from me coming up in the coming days uh where I implement that and I adapt it, obviously, to the previous skills because I think it's making the

agents work even more actionable for the human.

And I think this is basically what we want to do here with these kind of skills, which is like, okay, do that and then come back to me.

and give me the diagnostic, but also what's next?

You know, what are the next steps?

What do you recommend based on the state of the art research?

And I think this is really something Akive Tari and I have been talking about on the show last time he came where he has been very focused on, I don't remember how he was calling

that at the time, but today you would call that an AI assisted patient workflow.

And his idea was really that, like trying to basically multiply the number of people who can recommend you state of the arts research without having them with you.

So here I'd be doing amortized inference with you looking over my shoulders and giving me advice, which is extremely valuable.

Fantastic.

Well, thank you.

Thank you so much, Stefan, for

Walking us through that and also contributing that skill.

think it's really amazing and super helpful.

um Is the notebook you just presented available somewhere already?

It is not, but I can definitely just contribute it.

Yeah, we should.

my pleasure because I think this is really going to be a continuous development process.

Yeah.

During this skill development.

I noticed that we lack a scientific way to construct these skips.

We're completely reliant on our sometimes tacit knowledge and experience to first verbalize it.

Also, in a way, it's sort of an exercise for us to see if we can...

Oh, totally.

...explicate some domain expertise, right?

But we don't have a scientific way.

to generate these skills, To act as scientists and say, okay, let's systematically vary certain aspects of the skill, randomize all others and see what works best.

Yeah.

Yeah, yeah.

Exactly.

Yeah.

Something I do also to test them is making sure I'm testing them on something I really know or a model I've worked on and that test that I've worked on and then see what the

skill does.

And especially if it does something

really weird or recommend something that is not good and that helps me to stress test it.

um So yeah, but yeah, I agree with you.

is like, this is also the super fun part, which is you have to be very explicit about your knowledge and basically write that in the in the markdown files.

And this is super valuable.

So yeah, let's do that.

Once you have the notebook available publicly.

let me know and I will add that to the show notes or you can do that also since you have access to the document.

So that will be in the show notes and also um I'll be working on the blog post to announce and explain the skill for the, to put that on the website of Learn Based Dance.

And well, I will of course run it by you in the blog post.

We'll add a link to the notebook.

for people who want to see and read the demo.

um And of course, link to your episode.

That way people have different ways of understanding what's going on and what they can do with Amplified Pageant inference with this skill.

And I think it's going to be a very nice ecosystem.

Let's say they have the episode, the blog post, and the notebook to dig even deeper.

So I'm super excited about that.

Anything you want to add about this section or can I ask you other conceptual questions about Bayesian inference that I've had for a while and that I'm very happy to be able to

ask you today?

So we can move on.

That was an excellent summary and maybe just add that we want people to use these things, right?

Stress test them on your own problems and let us know when they fail.

That's how we can make them better.

Yeah, definitely.

Yeah, please do that.

When you use any of the skills that is on Beijing skills, if there are issues, you can always contact me, but I have a lot of work, so my bandwidth can be limited.

If you open an issue on the GitHub, um I'm not the only one to see it.

Stefan will see it, and especially if it's about the ABI skill, um he'll be even better than myself to answer you.

I definitely encourage you to do that.

actually the V2 of the skill, Stefan was inspired by your own workshop that you taught with Paul Berkner uh a few days ago and actually stress testing the skill with students.

uh do that folks.

And also if like Stefan, actually you're an expert of uh your field and

think that you could be contributing a skill to help your workflow, but also everybody else's.

um Please open an issue on the GitHub repo or even a PR for me to review.

That'd be even better.

And or contact me uh by email or LinkedIn and we'll get that going.

I'm always happy to welcome new folks on this open source project.

That's super fun.

um On that note, I need to ask you, but

hierarchical models Stefan because that's that's been hard to do with amortized inference historically so I'd like to see where we are right now because you've you've worked on

that a lot a lot especially on compositional amortized inference for large hierarchical models so yeah what makes hierarchical models hard to amortize and how does composition

help now

Yeah, now we're getting into the hard questions.

Now, hierarchical models, you can treat them as the reward at the end of the final level of SPI.

Almost all SPI, with very few exceptions, is done on flat, aka non-hierarchical models.

There's a good reason for that.

Hierarchical models are super challenging.

To begin with...

What is the simplest hierarchical model that you can have?

It's a two-level model.

What is your favorite two-level model?

So you mean a kind of model on a kind of dataset?

Yeah, what example should we have here?

One I really like and I started working with was something

doing electoral forecasting in France.

So here the hierarchy is very interesting because you've got cities inside things that we call departments, but they are like, you know, regions.

And then you've got the whole country.

So you have that pyramid of three levels.

I think it's an interesting one.

So, so let's stick with that.

Right.

So, so we have to simulate this model across three levels.

which are now, well, one level is always there, that's a flat model, but the new levels that you added, let's say the region and the country, now you have two more dimensions

that you have to simulate, which may not seem very problematic if your simulator is fast, but suppose that your simulator is slow, like it's already...

Pretty hard to simulate even one instance ah Suppose you're modeling the brain right and you have you have a brain emulator so takes a few minutes to generate one sequence Right

now suppose you want to model hundreds of breaks at the same time, Hundreds of days to simulate and this is just one training instance Okay, so there's no way you can

train this model efficiently if you proceed like that.

Okay.

And this is also bearing memory issues and the need to design specialized networks.

Now, what do I mean here?

When you have hierarchical models, you're modeling two different categories, at least two different categories of parameters.

Like have the local parameters, which vary.

For example, stick with your case again by location.

but just some global practice that capture what is shared among the locations.

And you don't want to it separately, but what you really care about, if you are a proper basin, you want the joint distribution of all these runs conditioned on all the available

data to get this precious shrinkage that you're after in the hierarchical model.

This puts you in a tricky situation.

when designing your neural networks.

Because now you can just say, okay, I'm gonna take all my parameters and put them as a single vector and say, okay, that's a high dimensional parameter space and neural networks

can deal with higher dimensional parameters spaces, images.

No, no, no, you can't do this because your problem has a symmetry, right?

So these, this joint posterior factorizes in a particularly nice way.

if you think about it.

And so you can actually pose this problem as estimating each of the local parameters conditioned on each of its local data, but also conditioned on the global parameters.

So you have, let's say, any of these problems, and then you have still to solve one big problem, estimating the global parameters, keep on all the data.

You can do it.

We have

Paper on the others have paper on that you can do it by chaining different networks together right and the inside here is The so-called inverse factorization.

This is something that's not commonly Let's say discussed in hierarchical modeling because MCMC always gives you the joint distribution.

This is I know it's a little bit technical but It for neural networks you need to consider

how this joint distribution factorizes, right?

And these different factors, when you have repeated factors at the same level, you just use one shared neural network that only knows how to estimate the parameters for one

factor, okay?

So, because you want this network to generalize to different numbers of factors, that's the main idea.

So it's the good old story of inductive bias, right?

You're trying to...

called the probabilistic symmetry into the network instead of hoping that the network can learn it out of the box.

No, this is very, hard.

For the two-level model, it's easy to imagine for three, four-level models with many factors, the inverse factorizations are not unique.

In fact, even for a two-level model, the inverse factorization is not unique.

We have two possible inverse factorizations.

One of these

is nicer to amortize than the other.

And for three and four and other, generally speaking, graphical models, uh there are some papers that have already looked at which of these inverse factorizations are particularly

favorable for amortization.

OK?

So theoretically, can build a heuristic algorithm that already gives you the inverse factorization that requires the minimal number of networks chained together.

But even if you could do this, you still have to simulate the whole thing at once.

And you have to do it multiple times for simulation-based training.

So now you should put your computer science glasses on.

And you ask, is there a way, maybe, to think about divide and conquer?

Can you solve the whole problem by partitioning it into multiple easier problems?

sort of in a way, can I solve this in a non-hierarchical way by just simulating one unit of the hierarchy, training a network that is competent in estimating the parameters for

this one unit, and then somehow aggregate after the fact.

So I can use all the existing infrastructure, but maybe just train two networks now, right?

One for individual units.

one for global units, and I never ever have to simulate exhaustively the full hierarchical model.

That is what we are currently extremely excited about.

So this is what we call compositional score-based modeling.

And we took the idea from earlier papers for this, you need a diffusion model that estimates a score, right?

it estimates the gradient of the law density instead of the density itself.

This makes everything that factorizes additive its core space.

And it lets you aggregate this through some very simple mathematical reformulations.

And this is great.

This has already been shown for simple exchangeable models where you don't even have a hierarchical model.

You have a stream of data coming one at a time.

If you tackle such problem in a typical way,

You will exhaustively simulate all sequence lengths and then induce some kind of a transformer that generalizes all sequence lengths.

And this way with composition, you enable proper base and update.

The same idea has been applied to complete pooling.

So basically aggregating information from different experiments without hierarchical structure.

And we showed it for the first time for when you have a hierarchical structure.

And we also tried to scale it up.

to really big problems where you have hundreds of thousands of hierarchical groups.

Now, this works somewhat, but there is a caveat.

ah We're still not at a point where we can aggregate hundreds of thousands of groups without loss of categorization.

So there's a lot of work to be done there, but I'm confident to say this is already now implemented in the dead version of Paceful.

For up to a few thousand groups, you can now use Amatya's inference and treat the problem as if it were not a hierarchical problem.

So feel free to test it out.

It's going to be out in the next release on the base.

But there still some unsolved issues there that we're currently working on in making this fully feature complete and fully scalable.

with the really big problems.

think it also makes sense that this is a hard problem.

If this were to work, it addresses most of social sciences, most problems.

Okay.

Yeah.

mean, that's already extremely good progress.

um What, so yeah, take maybe to summarize, what's the current state of Baseflow?

that's the state of the art.

package to do a multi-spatial inference.

So can listeners who need to try API with a hierarchical model, can they already do that right now in Baseflow or do they have to wait for the next release?

And what are the limit of what is currently available?

Like which kind of hierarchical models will not work well with what's currently available and what's coming in the very next release?

Yes.

The interfaces are available on the development branch on GitHub.

There is also a full end-to-end tutorial, which Jonas Saruda is also on the podcast brought.

He was the main engine behind this paper.

It's this do-where-it-is-do.

So I can say based on the current state, if you have a moderate

hierarchical model with around 1000 groups.

So you have 1000 geographical locations, data available in each of these and you want to aggregate this, go ahead.

It's most likely it's going to work.

And of course, if it doesn't, this comes with all the diagnostics that are available anyways.

So at least it's very easy to diagnose when things fail in this case.

When you have

a lot of data, right?

It is still going to work in the sense that it's going to give you an approximation that is probably not very sharp and not very well calibrated and you will notice it in the

diagnostics, right?

So we are ourselves unsure at which point the method just fails silently in the sense that you keep aggregating information but your estimate is not getting sharper.

in anyway, and this occurs for different models around a few thousand.

So this is still in the works, I would say, but feel free to use it.

There many cases where we have in social sciences, your typical, most typical model is you have participants, each participant is a data set you want to aggregate over different

participants.

If you have a couple of hundred participants, this is ready to go.

If you have something crazy, crazy high dimensional, then

Just break it.

Let us know.

Yeah.

Yeah, exactly.

Always let us know, please.

And please make sure to add these paper and this example in the show notes for this episode, because I am sure people are going to want to check it out.

There's already a link to Baseflow, but I think the link to that paper in particular and to that tutorial in particular from from Genesis work is going to be...

is going to be very interesting and also very practical for people to apply that.

So let's do that, please.

And so I'm going start to have to start winding us down here.

But a practical question I have for you in that you started touching on with hierarchical model is in general, when does a monetized inference give you fast but wrong answers?

you would not necessarily know it.

And what are the failure modes that people don't talk about enough and need to be aware of?

So you really want to put our hands into the conceptual beehive now.

Yeah, yeah, exactly.

I I love that.

And I think it's very important.

Like you, so, know, that show is very practical.

So I want people to be able to know, okay, so that method is really cool.

And can you stand for this case?

But in that case,

It's not appropriating.

need that one instead.

Yeah.

So let's start with actually how simulation based inference sort of matured.

Right.

When we were starting out, was very exciting.

Just the fact that two networks produced anything that's useful and resembles a posterior that more and more people entered the field and we showed, okay, we could actually

get fully based in inference under ideal conditions, now what happens in less than ideal conditions?

Namely when you're working with a misspecified model.

now, model misspecification is a very dangerous ground to thread because there are different definitions, right?

And depending on which field you're coming from, people understand different things.

But suppose basically the situation where you have your single editor,

You've trained the networks on simulations, but your real data is super atypical under the simulator.

What do you expect to happen there?

eh There is no general answer here.

As a proper basin, you expect to get the posterior that you wouldn't have gotten from an Oracle MCMC sample.

from a converse and sense a suburb.

This is what we're calling the correct posterior under the wrong model.

Okay?

Now this may raise some eyebrows.

I've been drawing a lot with physicists, right?

So some physicists, the fishing party of physics, they don't like the idea that you should be interpreting an object, that even though it does the proper object, that space rule

implies,

If the model is wrong, why do you want to deal with this posterior?

No, we should improve the model.

Now, you talk to social scientists.

Nobody's making an ontological claim that we have the right model.

Actually, the world assumption is the model is wrong anyway.

We still care about the posterior that we get from MSNC.

And unfortunately, neural networks are not guaranteed to give you that.

posterior that an Oracle NCC would have given you if your real data is very rare under the simulator.

And for that, we have diagnostics.

So we first noticed, for example, you can use known techniques from machine learning, from auto-distribution detection, or domain generalization, where you basically construct

something like a representation space of your simulation, which is interpretable all dimensional, and you look at where the real data lands in this space.

If it's an outlier in this space, you probably shouldn't trust your inference, right?

In fact, all guarantees just drop out.

this case, and we've shown it on the very different stress tests, misspecified priors, wrong likelihoods, unmodeled noise that the degree

of mis-specification coordinates to the degree of deviation between the neural network implied and the NCMC implied posterior.

And there ways to mitigate this.

We've proposed a way to diagnose it.

Others and we have also looked at different ways to align the neural network estimates with what you would expect as a proper patient.

Now, an unfortunate consequence of

the inherent conceptual murkiness of the field is that people have used different terms in different ways or use synonyms to the non-projective synonyms.

We are also ourselves guilty of this in our first paper.

We didn't get model right in this case.

We actually looked at differences between distributions, but it turns out it's important to look at differences in only one direction, not in the case where

For example, your model is over dispersed relative to reality.

It's more of a model that's under dispersed relative to reality, et cetera, et cetera.

And when people talk about robust inference, it's really important to specify what they mean.

What do they mean?

You're now robustifying the neural networks so that they give you the expected estimate under the assumption of proper Bayesianism or whether you are robustifying the underlying

Bayesian model.

which actually entails changing the model, right?

As in, for example, swapping out the Gaussian likelihood for student T like in a regression case or assuming a mixture between a noise process and a signal process.

Right?

So in practice, by the way, it turns out a very simple way to at least have a high probability of a line is to simply noise the data.

in some ways.

For example, for human data, very easy.

Assume some sort of random guessing process, simple uniform process.

It's already suffice to bring you in line with what MCMC would give you for, of course, a small loss of accuracy when the model is misspecified.

I this the sort of fundamental trade-off in ProPlus methods.

There's also no way around this.

There are other approaches which try to, for example, do some kind post-hoc correction to the neural networks trading off amortization through some sort of semi-amortized method

with post-hoc optimization.

This is also promising.

Then again, others are saying, let's go in direction of generalized based.

So learn something like a power-scaled base and model, power-scaled posterior, which deviates from

the correct posterior under the proper Bayesian model, but has higher predictive file under the specification.

So I would say this is one of the frontiers now of unorthodox inference, simulation, base inference.

mean, bringing all these methods, there's some conceptual work to be done, bringing them under some conceptual framework, noticing the differences and the similarities with other

deep learning fields like domain generalization.

For example, if you think about it in SBI, we're doing domain generalization.

So it's good that you mentioned that because it's still one of the itchy problems of the field.

But the bottom line is if you're a practitioner, don't just plug your data into the network, get the results at face value and write your paper.

Always diagnose.

And so it's not workflows, not analysis.

Yeah, exactly.

I think people...

I think people are aware of that on that show.

Actually, you know what, Stephane?

think so.

It's time to call it a show, but I do still have a lot of questions for you, as you can see from the Google Doc.

And I think these are very important questions and very interesting questions to not only me, but the whole the whole audience.

So you know what?

Let's do a part two of this.

discussion and record that, record that later and actually do a part two of that discussion because I think it's very important.

And as a, as a cliffhanger for people, I will not ask you the last two questions I ask every guest at the end of the show right now.

I will ask you that at the end of the second part.

And so that way you folks have to tune in for the second part.

How does that sound to you?

Excellent.

Thank you for the.

invitation, which I happily accept.

And sorry for talking too much.

No, that's perfect.

I think it's great.

Yeah.

I think it's great.

mean, you're very passionate about what you're doing and you love to explain and educate and teach.

So I think it's really perfect.

You make my job easier, to be honest.

So that's perfect.

So let's do that, folks.

This is the end of part one.

We'll see you very soon for part two, Stefan.

Thanks a lot for taking the time and being partly on this show.

This has been another episode of Learning Bayesian Statistics.

Be sure to rate, review and follow the show on your favorite podcatcher and visit learnbaystats.com for more resources about today's topics as well as access to more

episodes to help you reach true Bayesian state of mind.

That's learnbaystats.com.

Our theme music is Good Bayesian by Baba Brinkman, fit MC Lars and Meghiraam.

Check out his awesome work at bababrinkman.com.

I'm your host.

Alex and Dora.

can follow me on Twitter at Alex underscore and Dora like the country.

You can support the show and unlock exclusive benefits by visiting Patreon.com slash LearnBasedDance.

Thank you so much for listening and for your support.

You're truly a good Bayesian.

Change your predictions after taking information.

And if you're thinking I'll be less than amazing.

Let's adjust those expectations.

me show you how to be a good Bayesian Change calculations after taking fresh data in Those predictions that your brain is making Let's get them on a solid foundation

Key Takeaways

Simulation-based inference (SBI) uses a mechanistic simulator as an epistemic tool: you train a neural network on a large number of labeled simulations and then deploy it on real, unlabeled data. The "sim-to-real" framing captures the key asymmetry -- your network never sees real data during training, only simulations, but it generalizes to real observations at inference time. This is the opposite of the more common "synthetic-for-ML" approach, where fake data is used purely to augment real training data.

It's an open-source AI agent skill, co-developed by Stefan and Alexandre, that teaches an AI coding agent to run a complete, state-of-the-art amortized inference workflow. Because amortized inference is recent enough that it's underrepresented in LLM training data, vanilla agents tend to get it wrong. The skill injects the right methodology: it guides the agent to set up the simulator, choose the right network architecture, run a pilot, train with appropriate diagnostics, and produce an actionable report -- without the user needing to know the details.

Calibration coverage tells you whether your posterior uncertainty is honest -- whether your credible intervals actually contain the true parameter at the right frequency. A model can show poor parameter recovery yet still be well-calibrated (because it's falling back on the prior), or it can appear to recover parameters while being poorly calibrated. Running calibration diagnostics both in-sample and out-of-sample is especially revealing for hierarchical models, which often appear to underfit in-sample but generalize much better out-of-sample thanks to shrinkage.

Hierarchical models introduce nested structure -- parameters at the group level that govern individual-level parameters -- that amortized networks have historically struggled to handle efficiently. BayesFlow is actively evolving to address this, but it remains an area of active development. The calibration coverage diagnostics are particularly useful here because they reveal the characteristic in-sample underfitting that makes hierarchical models look worse than they are when evaluated naively.

Model misspecification means your simulator doesn't accurately capture the real data-generating process. In MCMC, misspecification shows up in predictive checks and posterior behaviour. In amortized inference, it shows up as a divergence between what the neural network infers and what MCMC would have inferred for the same data. Stefan's team has shown that the degree of misspecification correlates directly with this divergence -- and that a practical first line of defence is adding noise to the simulator's output during training, which robustifies the network at a small cost to accuracy when the model is correct.

Never treat your results as ground truth without running diagnostics. The whole point of talking about "workflows" rather than "analyses" is that inference is iterative -- you train, diagnose, identify what's working and what isn't, and cycle back. An amortized inference network that's misspecified or undertrained will give you confident-looking results that don't mean what you think they mean. The skill and BayesFlow's built-in diagnostics are designed to make skipping this step impossible.

Related Episodes

#151 Diffusion Models for SBI in Python, with Jonas Arruda

Listen →

#107 Amortized Bayesian Inference with Deep Neural Networks, with Marvin Schmitt

Listen →

#91 Exploring European Football Analytics, with Max Göbel

Listen →

Support & Resources

→ Support the show on Patreon
→ Intro to Bayes Course (first 2 lessons free)
→ Advanced Regression Course (first 2 lessons free)
Theme music: “Good Bayesian” by Baba Brinkman (feat MC Lars and Mega Ran). bababrinkman.com