Listen on your favorite platform:

This is the second half of my conversation with Stefan Radev, and where part one focused on the practical mechanics of amortized inference (and a live demo of the BayesFlow agent skill we built together), part two is where we get into the bigger ideas -- the ones I find most intellectually exciting about this whole research program.

The thread that runs through everything we discuss is the same: Bayesian statistics has always had a computational tax, and most of us have quietly accepted that tax as the price of doing things properly. We skip prior predictive checks because they take time. We don't run simulation-based calibration because it requires running inference thousands of times. We don't do sensitivity analyses unless reviewers force us to. And in doing so, we systematically underestimate uncertainty and miss problems with our models. Amortized methods change that calculus in ways that go far beyond just making inference faster.

Simulations Across the Full Bayesian Workflow

A natural starting point is the paper Stefan co-wrote with Paul Bürkner on simulations in statistical workflows, which decomposes the Bayesian modeling process into four stages and asks, at each one, how simulations can help.

Stage one is model specification -- building the model, choosing priors, and verifying that what comes out of the joint prior looks like something that could plausibly happen in the world. This is where prior predictive checks live, and Stefan's blunt observation is that despite a decade of advocacy from people like Michael Betancourt, these checks remain criminally underused in published work. The barrier isn't technical; it's cultural. People want to fit data, not think about what their model says before they see data.

Stage two is model verification -- parameter recovery, simulation-based calibration, calibration coverage, all the checks that tell you whether your inference is actually informative. Classically, these are expensive: you have to simulate thousands of synthetic datasets and run full inference on each. With amortized methods, the cost is essentially zero -- once you've trained an inference network, evaluating it on a thousand simulated datasets takes seconds. What used to be a publishable contribution becomes a default sanity check.

Stages three and four cover inference itself and model comparison, where similar points apply. The takeaway isn't that amortization replaces classical methods -- it's that it removes the budgetary objection to doing the diagnostics that everyone agrees you should be doing.

Prior Elicitation Without a Distributional Family

One thread from that paper I find particularly compelling is prior elicitation: how do you turn a domain expert's vague intuitions into a probability distribution?

The classical answer is to ask the expert to choose a family (Normal? Beta? Lognormal?) and specify hyperparameters. This is hopeless for almost anyone who isn't already a statistician. The expert knows what realistic data looks like; they don't think in distributional families.

The newer approach -- being developed by Paul Bürkner, Luigi Acerbi, and others -- flips the question. Don't ask the expert to specify a prior; ask them to evaluate samples from one. You generate synthetic datasets, the expert says which ones look plausible and which don't, and a generative model in the loop adapts until the implied prior matches their intuitions. The hard part is stability -- inverse problems like this can admit many solutions -- but the basic idea is one of the most promising directions for making Bayesian modeling accessible to non-statisticians. It's also one of the places where current LLMs can clearly help: translating qualitative domain knowledge into structured probabilistic constraints is exactly what they're good at, especially as a semantic front-end to a more rigorous numerical pipeline.

Foundation Models for Bayesian Inference: Closer to Stockfish than ChatGPT

The natural next question is whether we'll eventually have a foundation model for Bayesian inference -- a single network you can prompt with a model and some data and have it return a posterior.

Stefan's framing here is one of the most clarifying I've heard on this topic. His view is that the right analogy is chess, not language. If you ask current LLMs to play chess, they will routinely make illegal moves -- teleporting pieces, inventing positions, ignoring rules. The solution is not to fine-tune the LLM to play chess from scratch. The solution is to teach the LLM when to call Stockfish.

The same logic should apply to Bayesian inference. You don't want a general LLM to do MCMC numerically; you want a semantic layer that understands the analysis goal and calls a specialized inference engine -- MCMC, variational, or amortized -- to do the actual work. Agent skills are an early step toward this architecture. The longer-term vision is what's increasingly being called neuro-symbolic systems: neural networks providing the semantic glue, with auditable, interpretable specialized components doing the rigorous numerical work.

What this means concretely for amortized inference research is a push toward generalized engines -- networks that don't need retraining when a single assumption changes. Stefan's lab has been working on amortized engines that generalize across families of GLMs, across priors, across choices of which parameters to fix. The scaling result here is striking: thanks to weight sharing, amortizing over ten priors doesn't cost ten times what amortizing over one does. The marginal cost of generality is small, and the practical implications are large.

The Iceberg of Uncertainty

The most important section of the conversation, from a practitioner's standpoint, is about sensitivity analysis -- and it opens with an example I'd never seen before that I think every Bayesian should know.

In the early days of the COVID pandemic, a UK modeling group ran an exercise: they gave the same data to nine independent analysis teams and asked them to estimate R, the basic reproduction number -- the most famous number on the news at the time. The teams came back with very different estimates. The lowest team's credible interval included the possibility that the pandemic was receding (R < 1). The highest team's interval included the possibility that each infected person was infecting one and a half others on average.

The key observation: the variation across teams was larger than the uncertainty within any team. Every team was confidently reporting a posterior that, taken on its own, dramatically underestimated total epistemic uncertainty. The implicit analytical choices each team made -- about model structure, preprocessing, priors -- mattered as much as the data itself, and none of those choices were being surfaced in the final report.

This is the iceberg. Every Bayesian analysis you've ever read sits on top of a set of implicit choices that, if varied, could change the answer. We almost never surface that uncertainty -- partly out of disinclination, but mostly because doing so classically is prohibitively expensive.

Multiverse Analysis Made Affordable

The principled response to the iceberg is multiverse analysis: instead of running your analysis once, run it under every reasonable combination of choices -- preprocessing pipelines, priors, observation models, model parameterizations -- and report the distribution of results. If 99% of configurations agree, that's a much stronger statement than any single configuration could make. If they disagree, that disagreement is itself an important scientific finding.

The classical catch is cost. If a single Bayesian analysis takes four hours, running a thousand configurations takes nearly six months. That's why multiverse analyses are mostly published when reviewers demand them, and almost never run during exploratory work.

This is where amortized inference is genuinely transformative. Stefan and colleagues have shown that amortizing over multiple analytical configurations -- different priors, different likelihoods, different preprocessing -- exhibits sub-linear scaling. Doubling the number of configurations doesn't double the simulation cost; it barely increases it.

The reason is weight sharing: the network learns shared structure across related models, so the marginal cost of generality drops as you add configurations. The concrete recommendation in the paper is to run multiverse analyses as a default, not as an optional robustness check buried in an appendix.

One of the case studies looks at climate-change forecasting models: when you run the sensitivity analysis, those models turn out to be relatively insensitive to choice of prior but much more sensitive to model parameterization. That kind of finding -- which choices matter and which don't -- is impossible to surface without doing the analysis.

A related idea Stefan brings up is ensembles as diagnostics. Train multiple inference networks with different architectures or initializations on the same problem. If they agree, you have additional confidence. If they disagree, that disagreement is a signal -- it often points to identifiability problems or to phenomena like dancing Bayes factors, where small data perturbations cause model comparison results to swing wildly. BayesFlow now includes an ensemble workflow that makes this almost free to run.

Hybrid Workflows

It's worth being explicit about something that comes up repeatedly in the conversation: Stefan is not arguing that amortized inference replaces MCMC. He's arguing for hybrid workflows.

The natural division of labor is to use amortized inference for exploration, sensitivity analysis, and tasks requiring repeated inference; and to use MCMC for verification on the cases that matter most.

Amortized methods have known failure modes -- model misspecification, out-of-distribution data -- and the right defense is to keep classical methods in the toolbox as a reference point. Conversely, MCMC's computational expense makes it impractical for the multiverse-style work that's needed to honestly characterize uncertainty.

This framing matters because parts of the Bayesian community have treated amortized methods with suspicion -- as a deep-learning encroachment on principled statistics. Stefan's framing is exactly the right one: these are complementary tools, and the practitioner's job is to use both well.

Looking Ahead

Three priorities for BayesFlow over the next stretch: generalized amortized engines that don't need retraining when assumptions change; closing the gap between SBI research and production-ready software (a lot of good ideas live in papers but haven't made it into BayesFlow yet); and going after harder problems -- high-dimensional data, expensive simulators, small-data regimes -- where SBI is currently underutilized.

That last point is the most exciting one to me. Most SBI benchmarks are toy problems. The methodology has matured to the point where the field needs to stop benchmarking on simulated examples and start engaging with the messy, high-stakes scientific problems where probabilistic surrogate models could genuinely change what's computationally possible.

Check out the full episode above, and the show notes for links to the papers we mentioned. If you missed part one, go listen to that first -- it covers Stefan's origin story, the foundations of amortized inference, and the live demo of our BayesFlow agent skill.

You can also interact with the episode on NotebookLM! Ask questions, generate flashcards, and more.

Hope you enjoyed this two-parter, and see you in two weeks, my dear Bayesians!

Chapters

00:00 How does amortized inference fit into modern Bayesian workflows?

06:01 What role do simulations play across the full Bayesian workflow?

12:12 How do you elicit priors from a domain expert who doesn't think in distributions?

19:01 What would a foundation model for Bayesian inference actually look like?

35:32 What is self-consistency in amortized inference and why does it matter?

39:22 How does semi-supervised learning improve simulation-based inference?

43:16 Why is sensitivity analysis so important yet so underused in Bayesian practice?

47:40 What is multiverse analysis and how does it change how we report Bayesian results?

51:32 How does amortized inference make sensitivity and multiverse analysis affordable?

01:02:47 How do amortized inference and classical MCMC complement each other?

01:10:08 What are the next major directions for BayesFlow and amortized inference research?

Welcome to part two of my conversation with Stefan Radev and I promise you, won't regret staying for it because this is where we go deeper into the ideas that I find most

intellectually exciting in the whole amortized inference space.

We start with the role of simulations across the full Bayesian workflow and from there we get into prior elicitation.

How do you translate the domain expert's vague intuitions into a probability distribution, and can generative AI help?

Then we tackle foundation models for Bayesian inference, what would one actually look like, and then the part I found most practically useful, sensitivity analysis and

multiverse analysis, and how amortized inference can make running a full multiverse analysis nearly as cheap as running one.

So yeah, again,

Welcome to Learning Basion Statistics.

uh

a podcast about patient inference, the methods, the projects, and the people who make it possible.

I'm your host, Alex Andorra.

You can follow me on Twitter at Alex underscore Andorra, like the country.

For any info about the show, learnbasedats.com is the place to be.

Show notes, becoming a corporate sponsor, unlocking patient merge, supporting the show on Patreon, everything is in there.

That's learnbasedats.com.

If you're interested in one-on-one mentorship,

online courses or statistical consulting, feel free to reach out and book a call at topmate.io slash alex underscore and dora.

See you around folks and best patient wishes to you all.

Hello, my dear patients!

We set up a new way for you to support the show at no extra cost to you.

Because yes, we have a Friends of the Show page now.

And on it, you will find a bunch of discounts for you and products that I actually genuinely use myself, starting with Cloud Code, my AI pair programmer in the terminal, the

tool I reach for to keep the podcast website evolving, Riverside,

for podcast recordings.

That's the software I use for the show itself.

Wwise for international transfers and multi-currency spending.

I've been loving them for years.

They have super low fees and instant wire transfers.

Rovmise to get the most out of credit card points and airline mines.

And a lot more, honestly.

Each one comes with a perk for you.

So make sure to check them out at learnbasedats.com slash friends of the show.

In the show notes, I swear, no fluff, no sponsored posts, just tools I genuinely use and recommend and think that you'd like to.

Now, let's tune in for second part of our episode with Stefan Radev.

Stéphane Ralef, welcome back to Learning Bechet Statistics.

Good to see you again, Alex.

Yeah, it's always great to have you on the show, honestly.

So, that's a delight.

You're actually...

So I've had recurring guests.

Usually, I don't do two-part episodes.

It's not really a hard world, it's just like how it ended up being.

But I did do that at the very beginning of the show.

So...

eight years ago, something like that.

Um, Colin Carroll had a two-part episode also.

That was the third episode at the time.

Um, it was numbered 3.1 and 3.2, you know, of course software, uh, convention, but this one, no, I will, I will do like, I will do the normal counting integer counts.

That's actually just easier for people.

Um, so yeah, like I showed you the second time.

This happens, ah but this is a very rare event.

the Poisson rate on second, on two-part episode is very low.

ah So thank you for taking the time again.

I'm starting to get used to this, but I honestly don't know what I'm going to be doing with my morning next Friday.

So yeah, let's get back into it because so first part, um folks, I...

I obviously encourage you to listen to it because that's going to be the whole context of this new episode.

This one, we're just going to jump in where we left off at the end of first part.

So if you want all the background about Stefan, about amortized patient inference, the agent skill we've done together and a live demo of that and then the limitation of these

workflows, that's first part.

Now, I'd like to talk a bit more about

Like that's going to be a bit more theory oriented, at least at the beginning.

Um, because you do that a lot.

And something I'm very curious about is you have a paper with a friend of the show, Paul Bürkner, on simulations in statistical workflows.

And I think it's a very interesting one.

We definitely need it in the, in the show notes.

What's the core argument for that paper?

We do have a lot of papers with Paul.

This is just one of these conceptual slash theoretical papers.

And the core argument is basically Bayesian statistics has never been more computational than it is now.

And we, we parcellated basically the minimal Bayesian workflow into four stages.

And then we discuss

how simulations can accelerate bootstrap or help in various ways in each of these stages.

So let's start first with stage number one, which we call model specification.

And everybody is familiar with this stage, can call it model development.

And this is where all existing base and workflows recommend doing prior predictive checks, okay, which you can do.

rather informally, but just running a few simulations from the model and ensuring that what comes out meets a set of expectations that you have.

could do it also very formally in Michael Betancourt style.

So prior push forward checks where you're transforming the outputs of your model with their high dimension and transforming them into a low dimensional, more interpretable

representation, which you can then

also formally check is it under or over dispersed relative to reality or your domain etc etc right so you can use this would be or what if worlds not all the simulation to

basically validate your model before you see any data but it's actually surprising how many models you could simply discard by doing these types of prior predictive checks and I

still think they're

despite all these workloads that are sort of underutilized in practice from what I see of this in lot of papers.

Another aspect of this is prior dissertation, which we may touch upon later down the line, but we also have another set of ideas on how you can translate non-probabilistic as expert

knowledge into probabilistic knowledge, into distributions.

also using simulations.

perhaps it's, of course, a lot of this paper is about simulation-based inference, which is unsurprising given that this is our bread and butter of research.

uh But still, it's still, we should remember that Markov chain Monte Carlo methods were also termed simulation-based methods.

Right?

A couple of decades.

a goal simply because you're using random numbers to simulate the distribution.

Also, this notion has changed a little bit nowadays.

When you talk about simulations, you talk about simulating from the model, not simulating the posterior as it used to be called.

So this is stage number one, model specification.

Stage number two is what we term model verification.

And this is another part of workflow where you have to

Well, now we basically want to if your inferences are informative, if your inferences are well calibrated.

So this contains classical procedures like simulation-based calibration, parameter recovery, checks, all very important, all traditionally associated with very high

computational costs.

Because these checks are basically brute force.

you simulate uh sometimes thousands of synthetic data sets and you run inference on each any of these data sets.

ah And this of course is also the reason why you also don't see a lot of these checks in papers.

in research papers, even doing such a check, a parameter recovery study for a new model is a valuable contribution simply because you had to pay a price, a certain computational

price to get it.

So this is also one aspect, by the way, which is just greatly encompassed by amortized methods and simulation-based inference because all these checks, they just come for free

if you're doing amortized inference.

It's just literally a few seconds for something that for obtaining millions of posterior samples to something that traditionally used to maybe take a few days depending on your

whole.

So after the stage, of model verification, we have the model inference part.

And this is basically a simulation-based inference.

The somewhat more modern idea that you now use simulations

for model identification, starting with approximate Bayesian computation, Bayesian projection methods, and into the model realm of neural simulation-based inference.

By the way, we also talked about some frequentist approaches, which we should not also disregard, as they exist.

And the full stages.

There's something everybody is familiar with, like posterior predictive checks, model checking.

As we call it, this is...

There's also a nice aspect of simulation-based inference being embedded into the general Bayesian workflow because you also just inherit all these best practices that you use

Bayesian are doing anyway, or at least you were taught to do.

Yeah, yeah, yeah.

No.

And I think the paper does a very good job at summarizing all these steps.

So let's definitely add that in the show notes, the link to that paper.

And since you touched on prior elicitation a bit, this, yeah, as you were saying, this, this being a classy headache in Bayton workflows, uh, as, as well as all the checks just

talked about, I do think the agent skills help a lot with that because they will be able to just automate doing that.

And this is basically a very repetitive part of the workflow of our job, which is not, you know, the most

where we really need to be in the loop and it's a lot of checks.

um So I think actually this can be much more automated.

I've been doing that myself in my own workflow with the patient workflow skill, for instance, where I've added a lot of that.

The new RVs 1.0 Plus helps a lot because they have been starting to write functions that are actually giving you back text reports, which are perfect for LLMs to parse through.

and basically see if all the steps passed.

um So I think this is going to be much more helpful because also you can just, if that doesn't look good, you can just tell the agent what to do, what to change, and then do

another pass on all the checks and it just runs in the background.

You don't have to supervise that.

I think it's going to be going to be very helpful.

It's already been in my experience.

um But so yeah, like

You work a lot also on this particular part of simulations, is prior elicitation.

um And you've worked on simulation-based approaches to it, if I understood correctly.

So what to you, what does that look like in practice now?

We did two papers on this.

of this work was led by a graduate student by O.B.Eugner, Florence Hochting.

She is now...

a research software developer for acubectoring.

so the logic of this work was as follows, prior elicitation techniques have been around for decades.

Prior elicitation is a fancy term to say, okay, we ask a bunch of experts to tell us what are meaningful ranges, let's say, for parameters.

And the realization in a lot of live-bodies work has been that typically you can't just ask experts, give me a prior distribution.

Okay, because that's not how human knowledge is typically represented.

It's not represented in a strictly formal probabilistic way.

It's represented sometimes in text structures.

And so elicitation is a translation process, basically.

And so the perennial problem there is translating this tacit knowledge or non-provisional knowledge into prior distributions that conform to the expectation, which may not be

obvious.

An expert can even tell you what these are, the priors that you should use, but then these priors, when you actually do prior checking or model checking, they lead to surprising

results.

For example, your model diverges a lot.

Okay.

So the idea here was to automate this with the simulation based procedure in which, okay, you still ask your experts or expert to elicit certain quantities.

Let's say that expert gives you quantiles, a mean, and that's all you got.

So what you can then do is you can parameterize your price.

with what we call hyper parameters.

For example, instead of fixing the mean and the standard deviation of Gaussian prior, you let them vary and you start simulating from this hyper model.

And when you simulate this, also compute the model implied at least quantities as a constant.

We compute the model implied contours and means, let's say.

And we basically compare them to the ones

even by the expert, right?

Initially, they'll be very far apart.

And so we need a way now to propagate this error, the divergence between the model implied and the expert implied quantities.

And if you can keep your pipeline differentiable end to end, that's nice.

You use gradient descent, right?

You can do various reparameterization tricks to make sampling differentiable, et cetera.

If you can't keep the whole pipeline differentiable, you can use black box optimization, something like Bayes optimization.

And you run the situative procedure and you end up with ideally with the hyper parameters of the prior that most closely match what the expert had in mind.

So that's, that was the idea of the first paper.

And we showed that this, this can be very useful for a bunch of case studies.

Now in the follow-up paper, we decided to take this one step further.

In the first approach, it's true that you have some flexibility, but you're still confined to a certain parametric family.

Yeah, you may optimize for the hyperparameters of a Gaussian, but you're still in a Gaussian world.

So we asked here, can you do better?

Can you have some sort of a non-parametric?

procedure that is going to give you the prior direct without assuming its distribution of family a priori.

And to do this, of course, we turn to generative AI.

So it would basically replace the prior with an untrained generative network.

We use normalizing flows for the proof of concept and we applied exactly the same procedure now, but with a normalizing flow.

as a surrogate for the prior.

And this sounds really cool and all, but of course there are some caveats.

Because ah this prior is very flexible, there is a lot of non-identifiability issues.

You can have cases where you end up with a lot of possible solutions.

And you have to deal with stability issues.

So I think that the general idea is

still very viable because if you can take this approach now and add some further constraints, you end up with an end-plan pipeline for finding priors based on

non-prog-alistic expert knowledge, which in 2026 can also be theoretically followed by an agent.

So that is the basic idea.

Yeah, this is super interesting and practical, I think.

um Do you guys are still research on that?

What is the state of what you're doing right now uh on these topics?

I have personally drifted away from these topics, but Paul Bürkner is still very interested in this and he has a couple of works lined up.

Luigi Acerbi from Aalto is also doing some research on this topic.

This is from my club's surrounding.

I'm pretty sure others are also interested.

Okay.

Well, um sounds like I should have Paul back on the show and maybe Luigi.

That sounds like it would be good episodes.

I think Luigi never came on the show, so it could be cool to have him for first time.

um

And so getting back to what you do, em something I wanted to ask you is, and we talked about that a bit with Jonas Aruda in episode 151, em because of course, there is a lot of

excitement about foundation models for Bayesian inference, which, if understood correctly, the main idea is you have one network that works across many models.

My question for you is, first, can you

remind listeners what foundation models are, why they would be useful in this case, and if you think what I just described is a realistic near-term goal or if it's still an off-tar.

Yeah, that's a great futuristic question.

A foundation model is basically a general purpose neural network that has been typically trained with self-supervised learning without labels.

and that can then be plugged into various downstream tasks.

So it's a general purpose multi-task model.

ChargBt is a foundation model.

Gemini is a foundation model.

So these are the large frontier foundation models.

Typically, it is understood that the foundation model also works with different modalities.

So they're multi-modal.

Meta recently...

publish a foundation model for Incytical Neuroscience.

This is called the TRIBE V2 Foundation Model, which processes text, audio, or videos, sequences of images, and emulates brain activations.

They train this on 700 uh brain scans.

We already jumped on it, by the way, so we show that you can be using BASOL, you can infer it.

such model for doing in silico decoding.

But that is the idea.

Now, a foundation model for Bayesian inference.

Sounds like a goal that you may have, but I think we need to make clear what we mean by that, or in what sense existing foundation models are not already foundation models for

Bayesian inference.

So let us imagine maybe what would be the

the ideal inference engine.

So suppose, I think the ideal, at least in my perspective, the ideal engine would be I take my data, I have a prompt, and I tell it, okay, I want you to test these models on

this data, give me all model-implied quantities, like posteriors, diagnostics, et cetera.

So I plug this into the engine, out comes

an ensemble of posterior sum diagnostics.

So that will be, that will be maybe a foundation model based on inference.

And now we have to ask the question, is this even desirable?

Right?

So should we, should we take say an existing multimodal LLM and fine tune it to be able to do something like this, to solve a numerical task like that, work with numerical data.

uh

I think the question is no.

I think the question is similar to say, shall we have a foundation model of chess?

Now you know, if you now try to play chess with any of the frontier LLMs, occasionally they will teleport pieces, right?

Or they will just invent a new position.

So you may think, shall we make an effort to...

fine-tune this model to be able to play chess or should we just tell the model when to call stock feature or a specific chess engine and I think the same is going to apply to

Bayesian inference.

We're gonna have some kind of layer a semantic layer which we should interact and this semantic layer is going to then call on

Specific procedures like an MCFC sampler or an unorthized sample, which is gonna do the job right and I think skills are already a step in this direction and now if you if you

start thinking about adding various harnesses to this then you get a You get a new symbolic edge, which I think the new symbolic is what?

the field is now excited about, is neural network.

A lot of harnesses which were auditable, which were more interpretable.

So I think if we're being very futuristic, very general here, I think this is what practitioners will eventually be using.

And this also now asks the question, existing labs, can already...

write perfect stand-gode in my experience.

I taught a class which used a lot of Bayesian inference this semester and there was almost no model that GPT-5 cannot just write given a good description.

I think also this comes back to what Stephen Wolfram defined as computational thinking.

He has a very beautiful definition of computational thinking that's

he gave it one interview, computational thinking is the ability to express ideas clearly enough so that an arbitrary smart computer can follow them.

And in a way we are now in this situation where we have approximately arbitrary smarts and one-day computers with LLM's.

So again, it's about the clarity of ideas that's maybe separate us from the end goal.

Not all that being said.

I think there's still a lot of value in having amortized engines that generalize across large spaces.

This idea has already been explored in a certain depth and breadth in the field of SBI.

We also contributed to PABAR recently, which is an amortized engine that can generalize to different

GLM type models, which with a lot of regressors and a lot of changing assumptions.

So what we induced is a model that generalizes to a power set of configurations.

A power set scales as 2 to the power of the number of configurations.

So the number of configurations which you want to scale grows very large.

And this is case where simple regression.

without interactions.

If you want all possible combinations of predictors, it scales as a power set, the size of the power set.

Now, if you want to have a simple regression with all possible predictors and all possible interactions, two-way, three-way, four-way, n-way interactions, you're looking at the

scaling of the power set of the power set.

It is a ginormous number.

So I think there's a lot of value in having simply more general amortized engines that you don't need to retrain when a small assumption changes.

When a practitioner now comes and says, oh, I actually didn't want to use this prior.

So maybe this engine can adapt to the prior.

We have worked on that as well.

Or maybe the practitioner says, OK.

let me fix this parameter.

Suddenly, I know I want this parameter to be zero.

I don't want to retrain the whole thing now.

I want get an immediate result.

I want this to reconfigure itself.

This, think, is very interesting direction.

is where also we are going with all of our research.

um Yeah.

This is definitely super exciting.

And to make sure I understand what you're saying, basically,

Your point is that, well, your bet is that the way this is going to look like is, yes, we can have some foundation models for to Bayesian inference, but they would be specialized

in a way to Bayesian inference.

So this wouldn't be like you just go to, um, chatgpt.com or Claude.ai and you just ask for the ad.

This would probably look more like a coding agent, but we're

If I understood correctly, I think where it's different from what we already get with agent skills, is that you're saying that the specialization in Bayesian inference would

happen upstream during training of the foundation models and not downstream, which is what skills do, which is more like the model is already trained.

generalistically and now which is oriented more towards that part of the knowledge and that's also you're also not saying this would look like fine-tuning because these would

still be post training you're you're saying that these specialization would happen during training of the models themselves did I understand that correctly it will be my bet yes

this is an excellent way

way to summarize it.

So it's going to be something more than skills, but closer to skills than closer to fine-tuning.

And so you mean closer to fine-tuning?

Closer to skills than to fine-tuning.

Closer to skills.

But that would happen at training.

like I would not have, I mean, unless I'm the one training the model, I would not have control

into what gets into that training and why do you think this is a more interesting approach because to me what I'm thinking is what if we already have these generalistic models why

do we spend time training them on a sub part of the knowledge instead of just specializing them afterwards

whether it's with fine tuning or skills.

Why do you think it's going to be more interesting to do that way?

What would we get by doing that?

So that is certainly also a possibility.

I think ultimately is going to be an empirical question of what works best.

There already exists certain specialized GPTs, for example, FinGPT for

for FinTech, time GPT, parameter time series.

So I'm mostly observing and astutely what is happening in this ecosystem.

And you can certainly see signs of bold approaches now, ultimately is going to come out on top.

It's very difficult to predict.

I think ah what is going to happen is uh

an evolutionary selection of the model that just works best.

It's just going to establish itself.

No, and I mean, beyond that prediction step here, I think my question is more towards what would you like um to do if you had this possibility?

And it sounds to me like it's something that you think is promising to try that.

And so my question is...

How come you're saying that?

Like how come you think that doing that training would be more interesting?

What would we get from that?

Simply because we, as researchers, we like playing with things.

We like having control and to be at the source of things.

We have a certain disliking towards black boxes.

I think, I can't speak for everyone, but...

I think I would much rather be the trainer than the user.

Of course, there is a possibility that simply we don't have in academia, we just don't have the resources to train anything like that.

that they may even be, right, the ecosystem may parcel itself into different solutions.

Because I'm not even sure if...

We should not just wait out for a few more generations.

Agentic models, which may already be just perfect based on code generators.

then our job is going to be uh fine tuning skills, finding out, as I think I mentioned during our last conversation, the science of skills.

What makes a good skill and what are the factors that can make it so that we have new harnesses?

um Yeah.

Yeah.

But I'm definitely curious, like if we could do that, basically train a specialized model to help us do Beijing inference, like basically a Beijing code.

That's the idea, right?

Or a Beijing...

Belgian Jimmy Nye or something like that.

Okay.

Yeah.

Yeah.

That'd be super fun to do for sure.

Let's see how it goes.

know, like maybe we'll have this opportunity at some point.

We'll see.

I'll be happy to see it in any case.

Yeah.

super and very fascinating topics.

also another paper you have.

is on self-consistency.

And we didn't talk about that yet.

and I think if I understood correctly, that's you see that as a way to make amortized inference more data efficient.

So can you tell us what self-consistency is and what the problem, what problem it is solving?

Absolutely.

We actually have a few papers on self-consistency.

You may even say that we really like to be self-consistent.

in that regard.

Now the idea started out as a question in many, many simulation-based inference applications.

You still have access to the likelihood, right?

Be it the analytic likelihood or a synthetic likelihood of sorts.

And the way most simulation-based inference pipelines work is you don't consider this information in your training.

Just use the simulations.

training data but you don't use likelihood information and so that can in can we in these cases what we have the likelihood can we devise better loss functions that simply uses

information as additional training signal for the networks and what we came up with was almost trigger right so you take base rule

have the posterior on the left hand side and you have the evidence on the right hand side like the normalizer of the posterior and just swap their places.

So now we have the evidence on the left hand side and the posterior on the right hand side.

That is not a familiar picture for basis.

You practically never do this because the posterior is never available.

It's the object which trying to get in the first place.

But the game changes completely if you now training a neural network that is going to become a surrogate for the posterior.

Now, in fact, you do have this object.

And you can evaluate its density if you're using models with close-term density, with normal Isaac flows, flow matching.

So if you can do that, then this new object, this new right-hand side, is a calculator for the evidence.

So you have something that is independent of your parameters in theory, right?

This is a cool property if you plug in any any parameter value The effects cancel out So you get the same X that is that is interpretation here But now if you have an imperfect

Approximated for your posterior you start having variance, right?

You treat this now as a marginal likelihood or an evidence estimator, basically, which has an RS.

This variance is a direct proxy for approximation error.

It's differentiable.

So you can turn this into a loss function.

You can also plug into a network, can differentiate to it, you can propagate gradients back into your network.

And these gradients use more information than simply your simulations.

traditional simulation-based lobs.

So that's the core idea here.

And at first, we just tested this for cases with limited simulation budgets.

Like we're talking about a few hundred and we saw improvements.

And later on, when we started thinking about semi-supervised approaches, okay, so how can we use both synthetic and real data during training?

We realized that such self-consistency lobs us.

ah Give you the perfect proxy to just plug in the real data in there because this is an object that is In theory completely independent of any ground truth products, right?

So you have now your two sets.

You have a set of labeled simulations these go into your Canonical simulation base laws you have your set of unlabeled real data this goes into Your self consistency loss.

Okay, so you have yourself

a really nice semi supervised objective that trains your entire pipeline and it uses information from reality and simulations.

What's the catch?

The catch is you still need a likelihood or you need a very accurate likelihood approximation.

So think where our money is now is finding a way to have something like this, which is just another loss function.

but doesn't depend on a light.

So now this could be, you can port this into the light with three settings.

And by the way, we also, we're also trying this out.

The student of Paul is trying this for model comparison as well as somewhat neglected setting in simulation based inference.

But this very same logic m applies to any conditional distribution, which is subject to base rule.

uh So we're now talking about

this as a family of self-consistency laws.

I think there's some really interesting theoretical research.

We did some theory on it in the second paper, but you can do much more.

You can start thinking about deriving error about different types of self-consistency estimators.

this is, I think this is very interesting theoretical strand of this world.

Cause also the second paper was about robustness, right?

Because

We also know this if now you start using the real data, your network is no longer surprised by the real data.

It has seen the real data and it's consistent on the real data in a sense that you're directly minimizing the residual error on the real data if you're doing this.

So you bring your neural estimator in line with the theoretical oracle MCMC sample if you're doing this.

you minimize or mitigate the so-called extrapolation errors that we talked about last time.

So it's also a really nice way to make neural simulation based inference more attractive to people who more familiar with the traditional MCMC workflow.

Right.

Yeah.

Okay.

This definitely needs to win the show notes because it sounds like it's super helpful.

um

And yeah, mean, you guys have so many papers that I found super practical that need to be distilled into some, into either some functions in BayesFlow and or some agent skills that

we should work on together.

Stefan, to make sure we have all these great knowledge and guidance distilled to practitioners, even without their...

them realizing it, know, but at least their models will be much, much more robust and powerful.

another, so I think, I don't think we talked about that yet.

It's another paper you have is on sensitivity aware, amortized inference.

So that's not the same thing as self-consistency.

So what does sensitivity mean in this context and why does it matter for practitioners?

This is not a very exciting topic.

It's sort of orthogonal to self-consistency.

So self-consistency, just view it as a way to justify your training or make it a bit more simulation efficient.

Now, sensitivity analysis came as an inspiration from a Nature of Opinion paper.

The paper was titled, One Statistical Analysis Must Not Rule Them All.

Those Who Get It Will Get.

uh

The core argument of the paper was really straightforward.

Any analysis, be it Bayesian or Frequentus, hides an iceberg of uncertainty.

Typically, this uncertainty is not explicated enough.

The paper starts with a great example from 2020, the early days of the COVID pandemic, where the UK modeling group asked nine analysis teams

to carry out an analysis of the basic reproduction number as an indicator of the rate of spread.

If you remember, this was the most famous number for a while on the news.

Just as a reminder, at base revolution number one, this is a stationary point, if it's below one, the pandemic is receding.

If it's above one, means we're on an exponential trajectory.

These nine teams came up with very different estimates.

And if you looked at the lowest estimates, the credible interval of the lowest estimate contained the possibility that the pandemic was receding.

If you looked at the highest estimate, it contained the possibility that one person infects more than one and half people on average.

Highly raging trajectory in this case.

Key observation was that the uncertainty across teams was higher than any uncertainty within any team.

Let's just pause and ponder for a moment to realize what this means.

Every single team underestimated the epistemic uncertainty, simply basing their analysis on an implicit set of assumptions.

The authors of this opinion paper used this example also to motivate

what is now called in the field many analyst teams, where you task different teams to analyze the same data.

You don't prescribe how the data should be analyzed, you just let them romp freely and then report the results and then you see what's kind of heterogeneity in results.

So we looked at this paper and realized, well, isn't amortized inference?

kind of perfect method for carrying out such analysis precisely because, well, you distill, you can have a single model that distills all these possible configurations.

Now, to give you a practical example of what this might look like, when I was analyzing neural data and I started scripting, I noticed that I have

I have to choose among gazillions of settings.

Like how to filter the data, how to do artifact rejection, how to do other sorts of pre-processing, how to aggregate the data.

The universe of choices was just vast.

And what I also noticed, because I was writing scripts and I can very quickly run different analysis, was I noticed a somewhat startling observation, which we have as a

student, that you can bend the analysis.

one way or another depending on the settings you choose.

And this may seem obvious now after the fact, but and it was obvious because many methodologically are fine people in the field who also noticing this which eventually gave

rise to the idea of multiverse analysis.

The multiverse analysis, you got to love these fancy terms.

Multiverse analysis is simply

a way to run all of your analysis for any different configuration.

Basically, take a bunch of configurations, you run the analysis with any configuration, and suppose you're a null hypothesis tester, you compute the p-value for your final effect

under any of these configurations.

A p-value is usually stochastic, even though it's not treated that way.

But now you have a distribution of p-values, right?

Every configuration gives one p-value.

And now you take, for example, some sort of a threshold here and say, OK, if 99 % of these settings give me a significant pivot, then could be more confident in my statistical

conclusion than I can be under any of these individual configurations.

Of course, there are much smarter ways to aggregate these things because not all settings are equal.

Some produce junk results.

Also, you can have adaptive weighting and so on.

There's research on that as well.

uh There's also research on how to aggregate results from amortized estimators.

There's a great paper by Yulin Yao uh and colleagues, which also shows different methods to aggregate results.

This is generally a fascinating field, the way, simply aggregating results.

Simple question, very complex answers.

So anyway...

You have this situation where you want to run multiverse analysis, but you have to consider that this is very costly.

If you're in a Bayesian setting and you're already running one configuration, it takes you a few hours.

Well, you don't want to be doing this for 1,000 times, times four hours.

This would be frankly annoying.

Again, you'll see a lot of these sensitivity analysis unless...

in research unless reviewers ask for that.

And then you have to do it, but you are not required.

So what we propose is first let's taxonomize different sources of sensitivity in basic analysis.

We found four basic sources.

Source number one is the prior.

That's the obvious one, prior sensitivity analysis.

You almost always get asked the question in an analysis.

What if you had a different prior?

Like how would that?

Skew your results, right?

So that's why I'll be started second target the observation model itself, right?

What if now?

Suppose I have a different noise model.

How would that change the conclusions?

The third source is data pre-processing.

That's good.

What if I tweak the data a little bit, right?

Would that also give me the same?

And the same answer or are my results simply contingent on the fact that I have this one very influential data point in my data.

And because now this was all conceptualized in a simulation based in France setting, we also asked, what about your approximator?

What about your neuro estimator?

And this is also a determining factor.

What if you change something, you change different, you use different architecture, right?

Do you expect to get the same posterior or not?

You said?

Amortize inference now gives you a way to simply expand the scope of this this is sort of precursor to our uh GLM like train ones for many models work But the key empirical finding

that you have to have if you're proposing something like this So it doesn't sound trivial is to say okay if I want

to generalize over 10 different priors with one neural net.

Do I have to simulate 10 times as much as I would for one prior?

So in other words, you want to have sub-linear scaling for this.

You want to generalize over 10 priors by, let's say, simulating a bit more than what you would do if you're only amortizing a single configuration.

So that's what we showed in this paper, basically.

you see sub-linear scaling, which we attribute to weight sharing.

Weight sharing is this observation that if your network is learning about different problems, learning about how to infer the posterior under prior A is beneficial for

learning how to infer the posterior under prior B.

We think the worst case would be you have your training a network for completely independent.

unrelated property, which case the network has no choice but to somehow internally split itself into two networks with a selection.

But in practice, this never happens because you're analyzing the data in a given domain.

So all models will be similar among each other, right?

Or they'd be more dissimilar to other models.

So weight sharing always kicks in this situation.

For example, one of our findings here, apart from toy examples, exploring, scaling, etc., was that if you take climate change forecasting models, in this case, and you subject such

kind of prior sensitivity analysis, and not just prior sensitivity analysis, sensitivity analysis in general, you see that they're relatively insensitive to the choice of prior,

they're more sensitive

to the type of model parameterization.

And also relatively insensitive to the of overall scenario that you're assuming, suppose, let's say, business as usual versus middle of the world.

Your forecasts for the critical threshold not change so much.

They change much more when you vary the underlying single model assumptions.

So you can do all this with amortized inference.

for a sub-linear cost.

We actually advocate this as a default approach.

But there's also one last point here, which is what about the different neural networks?

So we proposed a somewhat brute force approach here, like train an ensemble of neural networks.

And then look at disagreement between members of the ensemble.

If they disagree a lot, there's some points

to some deeper problems.

One of these problems in a complex model comparison could be, for example, so-called dancer-based factors, which is the observations that base factors change to dance around a

lot if you do small changes, small perturbations to the data.

So we noticed something similar in our analysis.

But now you can you can reveal this within one analysis.

You don't have to perturb the data more.

any sort of cross validation.

You can just see it by the fact that your ensembles, your ensemble members suggest different models for the same data set.

Yeah, and I think now in Basel with the new version, it's very easy.

We have an ensemble workflow, which you can specify, basically list the different networks you have and you can do this all that in a single workflow.

Again, not for the time that it would cost you to train this.

sequentially then backs because we have also different ways to reuse data during batching.

this is all available and we recommend this uh as a default basically because it's just so natural to do in our entire setting.

Okay, so basically as a default whenever you do a multi-signal inference?

Yeah, I would just call for an ensemble.

networks.

It's also way to cheaply do a naive hyperparameter optimization.

Maybe as we wrote in the skill, right?

Check models of different sizes as a heuristic to see how much performance gains can you reap by increasing your model size.

um Okay.

Yeah.

And so is it already the default in

in BayesFlow or not?

This is not the default.

As a user, have to choose.

It's called an unsupportable in BayesFlow.

And now, if you want to do different kinds of prior sensitivity or likelihood sensitivity, you still need to simulate under these conditions.

But from the perspective of the library, the only thing that you need to change is you need to tell the network that

these different configurations exist, right?

Like a simple case would be, exponentiated primes, right?

Beta Gaussian, you add an exponent, an alpha exponent that shrinks or widens your primes, basically, because it has an effect on the variance, and just let the network know that

this factor exists.

you train the network, draw inference, you just do a sweep over a range of alphas.

And you report now an ensemble of Osterios and you just visualize or quantify if you want.

If you don't want to be stuck with qualitative analysis, you just quantify the difference for only to see, is this a difference that makes a difference really or not?

Yeah, yeah, Okay.

So yeah, so that's why you don't have that as a default on Baseball, right?

you can't really anticipate what...

factors of sensitivity are going to be particularly interesting for this user group.

It seems though that we...

Do we have that already in the skill or not?

No, but we should.

Yeah, we should.

Sounds like we should.

Because I think it's something where...

So basically if we need the users...

Here, I think we need the user's ah input before setting that up.

But we can definitely do that in the skill where...

For instance, I did that in the causal inference skill where it won't kick off its work before it validates with you the DAG and the assumptions that are absolutely necessary for

the causal interpretation to be valid, but that these are assumptions that are not in the data.

So the agent cannot figure them out.

It has to ask you for your like that's literally prior elicitation where

The agent will go, so I think the best method for that is, I don't know, diffindif.

The core assumptions of diffindif are this and that.

Do you think this is true in this case or not?

If it's not, then it will ask it.

It will suggest another method if there is one or otherwise it can say, well, we can do that, but the causal interpretation will be, will be lowered.

So I think here it's something we can definitely have in the skill.

where basically if the agent sees this ensemble is useful and it sounds like it's going to be the case for almost all the cases from what you're saying, then it's going to going to

prompt the user with a few pointed questions.

You enter yes, or you enter whatever you need and then the agent goes and do its work.

I think the, the skill framework for it, for that is very useful.

Yeah.

There goes my weekend.

And mine too.

But I mean, this is a good way to spend the weekend.

It is.

That's why.

but I mean, for sure, let's add that when you have time and also when I have time, can definitely add that in the scale because I think it's going to be making it even more

helpful and practical for people.

Yeah.

I think that the key...

the key consideration there is also manageability.

From my experience, everybody agrees sensitivity analysis is good.

And you have these papers on meta-science that they all tell you sensitivity analysis is good.

But many practitioners are simply overwhelmed.

They don't have to deal with one result, they have to deal with an ensemble.

of results, right?

So they don't teach you that at school.

Like what to do now if I have now two competing explanations suggested by my results.

perhaps it is on us to just push these as defaults.

In my experience, yes.

And that's also what you see from behavioral...

uh

economics, literature, where the power of defaults is extremely important.

And so if we think this is a this is a better behavior for users to have in their models, then having it as the default is going to be much, much more powerful because they'll just

use it by default without even knowing it because they don't really need to.

And it's just, it gives them better model, better predictions, better inferences.

for free.

think it's a great deal.

ah The power of default.

uh Never to be underestimated.

um You've heard it, listeners.

Now we have to do it.

We'll do it.

We'll get to it at some point.

uh It's pretty fun to work on these skills and even more fun when we're...

uh

At least two of us working on that.

Yeah.

That's cool.

So let's start playing this out here because we're approaching the hour already.

Yeah.

I'm curious, basically after everything we've said in this episode and the previous one, if you had in front of you someone coming from PyMC or Stan, know, classic PPL, when should

they actually consider switching to amortized methods?

I'm saying switching here, it's like, it's not really a switch, like, especially with the future work you guys have on the deck.

And we'll talk about that on the podcast um with other guests.

That's a teasing.

em So yeah, when should they actually consider switching to amortized methods and when should they just stick with what they know?

That's a great question.

And I think is also the way you pose it.

think it also calls for a criterion.

in a skew itself.

That if you're a user who doesn't really care about any of the methods, you're just facing an only problem.

A skew can recommend.

Go for amortized or go for a non-amortized approach.

But I think the way we are also positioning our work in BigSaw is in direction of complementarity.

Modellers, petitioners, statisticians, should have amortized inference in their toolkit, as a complementary set of methods whenever the classical toolkit is infeasible for variety

of reasons.

We even have a paper that now has a dedicated website and a repo, it's put an amortized Bayesian workflow, paying homage to the Bayesian workflow.

This is joint work with Paul Aki and

Nubija Chary, again a very strong student from Aalto.

In this work, we actually outline a step-by-step workflow that uses both amortized neural inference and MCMC and it tries to balance basically the Pareto frontier of accuracy

versus speed.

I think the main trade-off

If you're in a business, main trade-off here is speed versus accuracy.

If you would attach utilities to speed versus accuracy, at some point, suppose MCMC is a viable option, but the cost of applying MCMC grows linearly unless you're paralyzing

things, et cetera.

Most people are not paralyzing a lot.

So the cost grows approximately linearly.

the cost of amortized inferences is sub-linear.

So if you approach it as economist, that is a breakeven point, even in a case where absence is feasible for which you have a stream of data coming in and you have to update

your inferences a lot.

Think of a pandemic monitoring system where data is arriving uh from different locations, different time points.

So you need to base on updating.

At some point you just cannot afford.

to wait for more than an hour to get the new result.

Amortized inference gives it you instantly.

We have to worry about other things.

Can I trust the result basically?

So, complementarity simply means that you start with amortized first in a setting where suppose you have 100 data sets, you need to apply one more to these 100 data sets.

You run the amortized pipeline, you get your results.

And then you diagnose basically which of these results are trustworthy or not.

Right?

This putting emphasis on trustworthiness proxies.

Suppose that this diagnostic flags 10 of these results.

We can't trust.

So what we propose is try to do a correction first.

Post-hoc correction, paratus mooded importance cycling or something similar, which just basically tries to correct for the bias that you have.

Okay, now you ruin your diagnostic again.

Eight out of the ten datasets now pass the diagnostic.

You're left with two problematic datasets.

What do you do?

Well, this is where again you enter the MCMC world, but you don't have to start from scratch.

What you show is that you can now use the neural network to initialize the samples.

So what you can do is now, again, another aspect of complementarity.

You use the neural network outputs to have a very informative guess

on where MCMC should start, which then lets you run so massively parallel or partially parallel MCMC where you do a lot of short chains because in the ideal case, you skip the

initial exploration phase of the sample, right?

You basically landed the typical set.

So you start reaping benefits in terms of reducing approximation error.

And that's how you close the loop on the pipeline.

So we're envisioning a lot of these

Right hybrid workflows for which we also need a skill Eventually, but it generally speaking like I can't I can't give a good rule because it depends on expertise and in some

notion in which you need to approximate Currently in your mind the quarter the speed accuracy trade-off trade-offs How much is it gonna take to run a classical analysis?

If a classical analysis is not feasible at all, right then there is no

choice to be made.

It just basically goes for SBI.

Yeah.

Yeah.

And I think I love your answer because it shows that this is not an either or case, but actually this is a case where we would love to have a hybrid workflow where ABI would come

and help the classic MCMC uh workflow.

And so yeah, as you were saying, there is a bunch of things we need for that.

first one probably a PyMC BayesFlow bridge, which uh I heard is coming very soon.

Yeah, this already exists, but I won't say more about it.

Yeah, we'll talk about that on the show.

And then, yeah, probably an agent scale that basically is the orchestrator here between these two workflows and can make the choices in between both.

I think this would be super powerful and helpful for a lot of models and people.

Yeah.

Yeah.

I think that the future is very exciting.

Yes.

Yeah, definitely.

ah This is really, really, really super fun to work on.

If anybody want to come and help us, please contact me and or Stefan and uh we'll for sure appreciate the help that will just make things go much faster because otherwise

Stefan and his team have a limited amount of time per day and as it turns out, I do too.

So that's a shame.

So Stefan, looking forward, actually, what are your plans for BayesFlow?

We have a lot of plans for BayesFlow.

I would say maybe let's start with the long-term plan for BayesFlow.

It's to become the workflow layer for our DiceInference.

fully AI ready, agentic ready.

Like this, should be the, the gold standard workflow layer for amortized inference that interacts very well with, with agents.

It's fully interoperable with bi MC and Stan.

We're working at all these directions, this is also working on changing the ecosystem.

So we also continuous development effort that is happening.

I think a huge strength of that is our positioning in Keras 3, the multi-backend Keras 3, because this makes us also interoperate with the three most popular deep learning

backends, That's a lot of high-dose and JAX.

So you can plug in a real custom networks in there, and it works.

You can download a network from Keras Hub, and it just works out of box.

So the workflow layer is this long-term goal.

more short term, we're looking at inducing these very generalized, amortized engines and piloting them in different fields.

For now, we're, we have a time within cognitive modeling because this is still, one, one, one, they say one fault of my heart still belongs to cognitive modeling.

And they, these models are excellent test beds because they keep

changing their running targets.

Researchers keep changing the assumptions that they have.

So it's just a set of test beds for our methods.

having this also enhanced generalizability, which doesn't come at the cost of any kind of formalization gap or robustness gap, is the way to go.

We still have a bit of homework to do in terms of robustness.

This is, think, where more theoretical work is needed.

also more implementation one, right?

Because as you said, we have all these scattered ideas in different papers, even in our own papers, which is they still haven't made it into the software simply because either

they're not production ready yet, they're still more in the ground research, or because there's not enough energy to put them.

But since Basel is now community effort, I think this is happening.

a lot so it is just that you're trying to uphold good software development standards.

just takes some time to make sure that this is not just an artifact of research, but it's actually something that can be deployed.

So also bridging this existing gap between research on SPI and the deployment of SPI is also a major goal of Paceful.

We'll keep working on this, we'll keep expanding.

have a lot of ongoing research.

also, in a third strand of our research, we are targeting more and more challenging applications in our recent review paper on diffusion models for simulation-based

inference.

We sort of categorized problems according to difficulty, and now we're getting more ambitious with targeting the very difficult problems, high dimensional problems.

expensive simulators, not a lot of data.

I this is where SBI is currently completely underutilized.

And these applications also provide great challenges for the methods themselves.

Because we just can't continue benchmarking on our favorite toy examples and pretend that everything works.

Yeah, I love that.

I really love how you guys are.

approaching these topics in your lab.

And well, I think it's in Bean Park thanks to you, since you lead the lab.

But I really love that you're doing, and that's also on brand with Paul Bürkner, know, that's really his kind of research too.

It's strictly a team effort.

Yeah.

Yeah, no, exactly.

And I love how all of you guys think about it as, yeah, we're doing state of the art research, which is extremely complex and fascinating, but we never lose the sight of

Okay, but how are people going to even use that?

And is that even useful?

Or should we maybe work on something else?

oh And I think this is the most impactful research there is.

And I really love that.

And it's also the brand of the show.

ah that's why you've been here twice already in just two weeks.

um Amazing.

Well, Stefan, I think it's time to call it finally a show.

So that means I'm going to ask you the...

Last two questions I ask every guest at the end of the show.

First one, if you had unlimited time and resources, which problem would you try to solve?

Yeah, so I didn't think about this a lot.

Well, let's confine ourselves to local research problems and not big world problems, I would say.

But if we're very locally, I would just solve the problem of automating statistics.

the more complex statistics for everybody.

Just take the strain of dealing with computational issues and unlocking creativity.

Yes.

Simply free of computational considerations.

I love that.

Yeah, of course.

resonate with that.

And I'm sure a of people will.

Second question.

If you could have dinner with any great scientific mind, dead, alive or fictional.

That's also a very difficult question, but I would probably pick Ed Torp, the mathematician who developed card counting strategies in Blackjack.

Blackjack then went on to start the first hedge fund and basically demonstrated what it means to think independently, think outside the box.

He's still alive.

93 going strong.

So this is also very, very inspirational.

So that that would be my aspirational peer.

love that.

Yeah.

Sounds like a fun dinner for sure.

Well, Stefan, I think, I think that's it.

So yeah, that, that was of course two dense shows.

I hope you folks liked it and please make sure to, to add the different

papers and content we just mentioned today to the show notes, Stefan, for people who want to dig deeper.

And well, thank you again for taking the time and being again on this show.

Thank you again for invite.

I'm already getting nostalgic.

This has been another episode of Learning Bayesian Statistics.

Be sure to rate, review, and follow the show on your favorite podcatcher, and visit learnbaystats.com for more resources about today's topics, as well as access to more

episodes to help you reach true Bayesian state of mind.

That's learnbaystats.com.

Our theme music is Good Bayesian by Baba Brinkman, fit MC Lass and Meghiraam.

Check out his awesome work at bababrinkman.com.

I'm your host.

Alex and Dora.

can follow me on Twitter at Alex underscore and Dora like the country.

You can support the show and unlock exclusive benefits by visiting Patreon.com slash LearnBasedDance.

Thank you so much for listening and for your support.

You're truly a good Bayesian.

Change your predictions after taking information and if you're thinking I'll be less than amazing.

Let's adjust those expectations.

Let me show you how to be.

Good day, easy change calculations after taking fresh data in those predictions that your brain is making Let's get them on a solid foundation

Key Takeaways

They're underused because researchers don't always think to run them before seeing data -- but also because doing them rigorously (in the style Michael Betancourt advocates, with prior push-forward checks on interpretable summaries) takes effort. Simulations make it cheap to generate thousands of “what-if world” datasets from your model and check whether they look plausible, catching bad priors before you ever touch real data.

Rather than forcing a domain expert to choose a distributional family and parameterize it, you can use a generative model to translate their qualitative knowledge directly into a prior. The expert describes what realistic data should look like; the generative model produces synthetic datasets matching that description; those datasets are used to fit a prior distribution. It removes the assumption that experts can think in terms of parameters and replaces it with the more natural question: does this look like your data?

Stefan's bet is that it won't be a fine-tuned general LLM. The right analogy is chess: you don't fine-tune GPT to play chess, you teach it when to call Stockfish. For Bayesian inference, you'd want a semantic layer – an LLM that understands the analysis goal – calling specialized numerical engines (MCMC samplers, amortized inference networks) that do the actual computation. Agent skills are already a step in this direction; the longer-term vision is engines that have been trained from scratch to generalize across large families of models and priors.

Self-consistency is the idea that when you train a network on simulations, you can use the network's own outputs as additional training signal – essentially creating a feedback loop where the network teaches itself to be more internally consistent. This can significantly improve data efficiency, letting you get better posteriors from the same number of simulations. It's particularly useful when running the simulator is expensive.

Every Bayesian analysis rests on a set of implicit choices – the prior, the likelihood, the data preprocessing steps, the specific model architecture. Change any of those and you might get a different answer. A striking example from the 2020 COVID pandemic: nine analysis teams asked to estimate the same reproduction number R came up with estimates so different that the between-team uncertainty dwarfed the within-team uncertainty. Every team was underestimating total epistemic uncertainty by treating their analytical choices as fixed. Sensitivity analysis makes this uncertainty explicit – but classically it's so expensive that it only gets done when reviewers force it.

Multiverse analysis runs an analysis under every reasonable combination of analytical choices – different preprocessing pipelines, priors, noise models – and reports the distribution of results rather than a single answer. If 99% of configurations give you the same conclusion, you can be much more confident in it. If conclusions vary wildly across configurations, that variability is itself important scientific information. The concept emerged partly from the replication crisis in psychology, where the same data could be made to support very different conclusions depending on analytical choices.

The key result is sub-linear scaling. Amortizing over 10 different priors doesn't cost 10 times as much as amortizing over one – it costs only somewhat more, because the network learns a shared structure across configurations (what Stefan calls “weight sharing”). This means a full sensitivity analysis that would take thousands of MCMC hours classically can be run for a small premium over a single amortized analysis. Stefan advocates this as a default, not an optional extra.

By training multiple neural networks with different architectures or initializations on the same problem and comparing their posteriors. If the ensemble members agree, that's reassuring. If they disagree strongly, that disagreement is a diagnostic signal – it often points to identifiability problems, unstable Bayes factors, or data-dependent variability that would otherwise go unnoticed. BayesFlow now includes an ensemble workflow that makes this easy to run.

Stefan's vision is hybrid workflows rather than replacement. Amortized inference is fast and excels at tasks requiring repeated inference, sensitivity analysis, and large-scale studies. MCMC is slower but provides exact asymptotic guarantees and is the right reference when you need to stress-test an amortized result. The two can be used in sequence: amortize to explore, then run MCMC to verify on the cases that matter most.

Related Episodes