Name: Diffusion Models for SBI in Python, with Jonas Arruda
Uploaded: 2026-02-12T12:30:47Z
Description: Learn how diffusion models enable simulation-based inference in Python. Jonas Arruda demos BayesFlow, amortized posteriors, hierarchical models, and more

‌

Listen on your favorite platform:

Apple Podcasts

Spotify

Youtube

• Join this channel to get access to perks:

https://www.patreon.com/c/learnbayesstats

• Proudly sponsored by PyMC Labs: https://www.pymc-labs.com/contact

• Intro to Bayes Course (first 2 lessons free): https://topmate.io/alex_andorra/503302

• Advanced Regression Course (first 2 lessons free): https://topmate.io/alex_andorra/1011122

Our theme music is « Good Bayesian », by Baba Brinkman (feat MC Lars and Mega Ran). Check out his awesome work at https://bababrinkman.com/ !

Takeaways:

Simulation-based inference allows for modeling real-world phenomena using simulations.
Amortized Bayesian inference can be used for diagnostics on neural networks.
Diffusion models can learn parameters for simulators effectively.
BayesFlow is a Python library that supports SBI with neural networks.
Training data for SBI can be generated without evaluating likelihood densities.
The prior predictive distribution must cover the data for effective inference.
Amortization in Bayesian inference allows for solving multiple inference problems at once.
Careful design of the simulator and prior is crucial for accurate inference. Measurement error can be modeled as noise in simulations.
Coverage plots are essential for evaluating model performance.
Hierarchical models may appear to underfit in-sample but perform better out-of-sample.
Diffusion models iteratively denoise to generate samples.
The score function guides the reverse process in diffusion models.
Generative models can be compared based on their sampling speed and accuracy.
Choosing the right generative model depends on the specific problem.
The score function in diffusion models is always tractable.
Compositional inference can estimate parameters for multiple individuals.
Hierarchical models can be trained on individual data to infer population parameters.
Out-of-sample predictions depend on the prior's coverage of new data.
Constraints can be added to diffusion models without retraining.
Diffusion models are more expressive than normalizing flows.
The iterative denoising process in diffusion models allows for post-hoc interventions.
Future research is needed to define valid posterior distributions in diffusion models.

Chapters:

00:00 Exploring Generative AI and Scientific Modeling

10:27 Understanding Simulation-Based Inference (SBI) and Its Applications

15:59 Diffusion Models in Simulation-Based Inference

19:22 Live Coding Session: Implementing Baseflow for SBI

34:39 Analyzing Results and Diagnostics in Simulation-Based Inference

46:18 Hierarchical Models and Amortized Bayesian Inference

48:14 Understanding Simulation-Based Inference (SBI) and Its Importance

49:14 Diving into Diffusion Models: Basics and Mechanisms

50:38 Forward and Backward Processes in Diffusion Models

53:03 Learning the Score: Training Diffusion Models

54:57 Inference with Diffusion Models: The Reverse Process

57:36 Exploring Variants: Flow Matching and Consistency Models

01:01:43 Benchmarking Different Models for Simulation-Based Inference

01:06:41 Hierarchical Models and Their Applications in Inference

01:14:25 Intervening in the Inference Process: Adding Constraints

01:25:35 Summary of Key Concepts and Future Directions

Thank you to my Patrons (https://learnbayesstats.com/#patrons ) for making this episode possible!

Links from the show:

Come meet Alex at the Field of Play Conference in Manchester, UK, March 27, 2026!

https://www.fieldofplay.co.uk/

Jonas's Diffusion for SBI Tutorial & Review (Paper & Code): https://bayesflow-org.github.io/diffusion-experiments/
The BayesFlow Library: https://bayesflow.org/main/index.html#
Jonas on LinkedIn: https://www.linkedin.com/in/jonas-arruda/
Jonas on GitHub: https://github.com/arrjon

Further reading for more mathematical details: Holderrieth & Erives https://arxiv.org/abs/2506.02070

Today we are exploring the intersection of generative AI and scientific modeling, specifically how the tech behind image generation is revolutionizing how we handle black

box simulations.

My guest is Jonas Arruda, a mathematician and PhD researcher at the University of Pohn who is a key contributor to the Baseflow library.

In this episode we unpack

the power of simulation-based inference and why diffusion models, the same architecture behind DALY and Mid-Journey, are becoming a game-changer for approximating complex

posteriors when traditional likelihoods are not available.

We discuss the trade-offs between speed and accuracy in generative modeling, how neural networks learn to denoise statistical distributions, and what it takes to scale these

methods to hierarchical models.

Plus, we move into technical

territory with a live coding session where Jonas walks us through a base flow implementation step by step to show exactly how these models work under the hood.

So I recommend to tune in into YouTube so that you can check out this video on the Learning Basics Statistics channel.

It's going to be much better in video, but if you prefer the audio, I'm happy to be in your earbuds.

This is Learning Basics Statistics episode.

Welcome Bayesian Statistics, a podcast about Bayesian inference, the methods, the projects, and the people who make it possible.

I'm your host, Alex Andorra.

You can follow me on Twitter at alex.andorra, like the country.

For any info about the show, learnbasedats.com is Laplace to be.

Show notes, becoming a corporate sponsor, unlocking Bayesian Merge, supporting the show on Patreon, everything is in there.

That's learnbasedats.com.

If you're interested in one-on-one mentorship, online courses, or statistical consulting, feel free to reach out and book a call.

topmate.io slash alex underscore and dora see you around folks and best patient wishes to you all

Hello my dear Bajans!

Before today's episode, I wanted to let you know that this year, we'll be talking about Bajan modeling in soccer at the Field of Play conference in Manchester UK on March 27,

2026.

So if you want to meet me, you can come there in the audience if you want, but also as a speaker, because we have already locked in most of the speakers, and announcement coming

soon!

Stay tuned and follow on LinkedIn.

Last year we had speakers from baseball, cycling, education, fantasy sports, soccer obviously, because it's Manchester, and that mix honestly genuinely raised the level of

conversation.

The theme for this year, 2026, is communicating complex ideas, how do you take something technical, nuanced, uncertain, like models, abilities, trade-offs.

and make it understandable and useful for people who are not data experts.

Like last year, we are opening up one of the final speaker slots.

So if this theme resonates with you or someone you know, whether they work in football or somewhere completely different, feel free to contact me and I will take a look.

And in any case, you can already buy your tickets, go to Field of Play's website or LinkedIn page.

and you'll have all the information there.

I'm really looking forward to seeing you there, and well, I will for sure have some LBS merch with you, so please come say hello, and well, come to my talk also, so that then I

can say that the room was full, because otherwise, I don't know what I will do.

Thank you so much, people, I will see you there, and now, let's go on with the episode.

Jonah Saruda, welcome to Learning Basics.

Thanks Alex for having me.

I think the word you're looking for is obrigado.

Obrigado, yeah.

thank you.

That's the extent of my Brazilian.

So let's stop here.

uh No, but yeah, thanks a lot for joining.

This is awesome to have you on the show.

You guys have done amazing work on the base flow.

side and you in particular with that new paper and tutorial that we're gonna talk about today and that you're actually gonna live demo.

First we're gonna do kind of a hybrid episode where we're talk about your background and then we'll dive into the code which is great because we have some people joining from the

US East Coast which is 8 a.m.

for them right now and so that gives them the time to know sip on coffee and

Wake up why we talk about your background and then then we'll dive so uh and When you join folks in the chat, please let us know where you are from and And yeah, sure what time it

is for you.

I have Ron here.

It's 6 a.m.

So it starts to be a competition Damn Scottsdale's, Arizona.

Oh Ron.

Okay.

Yeah, you know, I know which Ron it is.

Well, well done Ron Ron is a

He's a patron of the show.

He's there all the time.

Thank you so much, Ron, for your unwavering support, even at 6 a.m.

I see Philippe from France, from France.

Arniel from Ontario, Canada.

Well done.

ATM for you too.

Vikas from New York.

ATM, damn, you guys are much more dedicated than I am.

I'm not sure I would wake up for that.

And we have Oscar from London, Raphael from Germany.

um Maybe that's someone you know.

I know there is a Raphael in your group.

be from Freiburg.

Yeah, yeah, maybe, Kelvin, 4pm from, so you said Kenya, but I think it's Kenya.

Wow.

Okay.

Damn.

Amazing.

This is amazing.

We have a great diversity.

Douglas from the UK.

Fantastic.

Yeah.

That's amazing.

Calvin, I was in Kenya recently and yeah, that was incredible.

So thank you for the welcoming even though you are not personally doing it.

But yeah, that's great to see people from all over the world.

And now let's do it Jonas.

Yeah.

Well, what about you, Jonas?

What's your origin story?

What are you doing nowadays?

And how did you end up working on that?

Yeah.

So I'm a mathematician by training.

So I studied mathematics at the University of Bonn in Germany.

And currently I'm doing my PhD here in a very fantastic research group.

Just briefly, our research group is really big and we do a lot of different things.

So we are at the intersection of mathematics and life science.

So always driven by problems from the life sciences, we try to use math to solve these kinds of problems.

For example, some of us look at like how genes link to our immune system.

So which functions do they have?

Some others of our group look at more specific modeling projects.

So where we try to model the immune system and so on.

But me and also some others here in the group, work more on the inference part.

So there in these models, are unknown quantities and we are trying to infer these quantities to better describe the biological processes we are looking at here.

And I, myself, I especially like to work on like inference problems where you can't use traditional methods.

So I'm using neural networks then to do inference on these kinds of problems.

And I develop new algorithms there.

Okay.

Yeah, that sounds like a lot of fun.

did you end up working on that?

Is it something you've been doing pretty much since after high school or was it a meandering way to where you are right now?

So like during my studies in mathematics, I did like pure theory, mathematics, no real applications.

And I was always wondering, okay, how to actually apply math for some problems, which you feel that the

world is needing, so like going and then there was a pandemic.

So looking at the immune system is really important, not only at this time, of course, but especially in that time it was important.

And I had a nice professor here at the University of Bonn, which was looking directly at like the COVID pandemics from this perspective, also from the mathematical perspective.

And that's how I ended then in this direction.

So I was always dreaming.

and to do math all the time.

And now I can do math and solve real world problems.

So that's really cool.

Yeah, I can relate for sure.

um And so today you're going to talk a lot about a concept that's called simulation-based inference.

They are SBI.

Maybe you'll hear us say that acronym.

um Not to be mixed with SVI.

yeah, that's like, it's SBI.

um So yeah, can you briefly tell us about SBI knowing that...

For you folks listening and in the chat, if you haven't listened to it, there is episode 107 with Marvin Schmidt, uh who talks exactly about bass flow and what is exactly a

multi-spatial inference, which is really something we're going to dive into today.

So if you want to even more background than what Jonas is going to tell us, listen to that one.

I think previous episode also 150 about fast patient deep learning with another group in Germany from Munich.

It's going to be very interesting to you guys as background.

yeah, like basically Jonas, can you tell us about what that is?

Like when, what's the difference between SBI and amortized patient inference?

And basically for each of those, can you tell us the what, the why, the when?

Simulation-based inference is a setting where you have, let's say, a real world phenomenon, so real world problem, and you want to describe this problem with a model.

Let's say you end this model, you can simulate from that model, so we call it a simulator.

So there where the simulation part is coming from.

And now you can run a lot of simulations and

try to find out, there are some unknown quantities in my simulator and I need to calibrate that to find the best parameters of my model and which then describe my real world

problem.

And this process of finding these quantities based purely on simulation, that's simulation-based inference.

And there are lots of different algorithms out there which can do that.

Some of them are based on neural networks, others they are not.

So this is not really

a neural network thing, but a very general um topic.

And I myself, I mostly work on the neural network parts then, um but there are also other methods which are called sub-pair, which are part of simulation-based inference.

And the real key of simulation-based inference is that you don't need anything more than simulations to do the inference part.

So when you think of your m favorite Bayesian methods like MCMC or so, you would define

need to define some probability distributions, which you need to evaluate.

And in simulation-based inference, usually you only use the simulations.

So you can use very complicated simulators and that's also the why.

uh So you want to use a really complicated simulator, which can really accurately describe your problem.

Then simulation-based inference is a great tool to calibrate your simulator then to your data.

And in context, and like in the difference then to armatized Bayesian inference is

There are certain simulation-based inference methods, which for example, use neural networks to do this inference process going from data to your inferred quantities.

And the idea in armatized Bayesian inference is that you can learn like one neural network to solve many inference problems.

So not only one, but many.

And in this way, the training of these neural networks is then armatized by applying

this one network to many em data sets.

that's the like, so amortized patient inference is like a subtopic in simulation based inference.

Yeah, okay.

And which are the use cases that you would recommend people use ABI for?

So, um Amatize Bayesian Inference is really, really nice tool um to do diagnostics on your neural network.

So usually when you have a Bayesian workflow, you want to make sure, okay, my simulator and my inference algorithm, they work together and that they actually produce a valid...

Posterior distributions, for example, and to do this kind of diagnostics, you need actually to do inference on many datasets.

So simulated datasets, for example, and in this case, amortized patient inference is already a great tool.

So not only when you want to apply it to many datasets, but also just for diagnosing your Bayesian workflow.

So that's one part.

And the other part is often you have problems you can actually separate into smaller sub-problems.

Like for example, when you have a hierarchical model, you have like a population with individuals, then you can say, okay, maybe I can train a neural network only on

individuals and then put all the stuff back together to the hierarchical model.

And therefore, then you don't armatize over multiple datasets, but you armatize over individuals in your dataset.

So you have less training costs compared to a non-armatized method.

So they're really...

Different applications where I'm a type of space influences.

tool.

Yeah.

so you have, so for even more background about that folks, again, uh episode 107 with Marvin, we dug into that pretty well.

But today I want to focus on this new tutorial you have out, which is diffusion models in simulation based inference.

And that's based on a paper actually.

So I will link to the tutorial.

In the show notes of course, but now I'll also put it in the chat, folks for you guys, if you want to follow along.

But first Jonas, can you tell us what this is all about?

Like give us the elevator pitch, why that's useful, what you're doing in this and what will people get out of going through that notebook?

Yeah.

So the notebook is basically an introduction to simulation-based inference, what I just saw.

what I just told you about, but also how to use a diffusion model for simulation-based inference.

So you might know diffusion models from Dali or so for image generation, but they can actually also be used to learn parameters for your simulator.

And they are pretty powerful.

basically last year, me and my colleagues, we wrote a large review on how to use the diffusion models for simulation-based inference and also what makes them especially cool.

because you can also after training use these models to do cool stuff with them.

So they are very different also from previous neural networks, which could be used in simulation-based inference.

they're like a game changer.

And so this is why we wrote a review.

There was like tons of new papers coming out only last year, which we tried to synthesize.

And basically the notebook is a first introduction to this whole topic.

And if you want to then dive deeper, then the paper is the right thing.

oh to continue learning about this.

Yeah.

And that's a very good tutorial.

I really enjoyed going through it.

So thank you so much for doing that, Jonas.

I think it's very practical, which I really like.

That's like the mark of this show, right?

uh Giving you guys the latest research, but in digestible and practical and applicable way.

So let's do that.

Actually, Jonas, do you want to?

share screen now and, and life code some parts of the notebook and also like basically explaining us what this is all about.

so yeah, now folks, should see, you should see the notebook uh from, from Jonas in the, in the chat.

um And actually, if you have any questions, please tell us in the chat.

So any live questions, I will be monitoring the questions all the time.

So depending on the type of questions, I will ask them in the moment or I will wait for a better moment or maybe to the end of the presentation if it's a bit more generic.

So yeah, please send me your questions in the chat.

If you are a patron of the show, you also have access to Discord.

I saw Ron, for instance, you already sent two questions in advance.

So, yes, I will ask them here in priority.

But if you're a patron, please during the live show, ask your questions here on Riverside in the chat because navigating both channels will be hard for me.

So, okay.

Now that's, that's out of the way.

Let's do the fun stuff.

Let's, you can take it out, Jonas.

Take it away.

Okay.

So let's briefly talk about what this tutorial is going to cover.

So we want to talk about what is simulation based inference.

So I'm going to show you a very simple problem and how to solve that problem.

I'm going to talk about what are diffusion models actually.

So what is the idea behind them and try to give you an intuition.

And then I will show you one example why diffusion models are so special for SPI.

So let's start.

with a problem and we're going to look at the so-called inverse kinematics problem.

In this problem, we have a robot arm and this robot arm has like three knots where it can move.

So you see here the three knots and an initial height, which can be changed.

And then this robot arm ends in a final position, which is marked by this red cross.

And uh in this toy example here, the problem

which we want to solve is given this final position.

So this final red cross here, what are all the possible angles um which the robot could have used to get to this final position?

in a little bit more mathematical terms, so we have these parameters, so which is the initial height of sets and the three joint angles.

And you want to do inference on these parameters given the end position.

Why is this a nice example?

It is a nice example because we can visualize it nicely, first of all.

And the second reason is this problem here allows us to analyze how good is our inference method.

Why?

Because if you look at one of these positions here, there are actually many different angles and initial heights which lead to this final position.

So it's made in a way that

Our posterior later on will look more to model and this is a good stress test for any inference algorithm.

So how do we want to solve this problem?

What we will do here is we will train a neural network to approximate the posterior distribution of the simple problem.

And for that, we're going to take first one step back and talk again about simulation based inference or SBI.

And of course, as we are here in a

In a Bayesian podcast, we are interested in the Bayesian posterior distribution.

So the probability distribution of parameters give them our observed data.

And as in any standard Bayesian workflow, would actually, what you need is you need some prior function and the likelihood to be able to compute your posterior.

But usually in a standard Bayesian workflow, what you would need is you would be, need to evaluate your likelihood and the prior.

So you need the densities.

And the key idea now in simulation-based inference is if you can simulate data from your model, so that is basically sampling parameters from your prior using these parameters

then in your model and do a simulation, we can actually create training data for a neural network without ever evaluating the likelihood density or the prior density.

So the idea is, okay, let's generate

many training pairs, which consists here of these pairs of my data and my uh parameters, which created these simulations.

And then we train a neural network, which learns basically a mapping from my data to this posterior distribution or something related to that mapping.

And after training, so when the training is purely done on simulations, we use our neural network then to, plug in our actual observation, which we want to do inference on.

red cross, right, the position and generate then posterior samples.

And that's actually what we are also doing in Baseflow.

So I'm one of the core developers in Baseflow and Baseflow is a Python library, which uses simulation-based inference with neural networks and also includes diffusion models, as

we're going to see later.

And here really the idea is, okay, you need only some simulation functions.

Then you can use space flow to learn with the help of new networks, the distribution of your posteriors, but also other kinds of uh distributions, like you could also estimate

the likelihood or so on.

But for now, we're going to stick only to posterior distributions.

And for those who have worked before with new networks, so we actually support any of the major Python libraries, which can be used to build new networks.

So TensorFlow, JAX and PyTorch.

Baseflow is basically backend agnostic and then you can build your own new networks in your favorite library and can use them within Baseflow.

But we also provide you a bunch of new networks so you don't have to code them yourself.

So the only thing which you have to provide to Baseflow is basically your simulation function, which you want to do inference on.

and actually here, Jonas, I think it's a good time to ask you one of the questions that I had in the Discord.

from Ron because it's actually exactly what the backend, I think it's going to be a fast answer.

So Ron was asking, when I tried the notebook on my Mac, everything worked well with the Jack's backend.

And then I tried Torch for the backend thinking I would take advantage of MPS support in Torch.

And it was 10 times slower.

Is this to be expected?

So I would not expect 10 times slower, but I would expect

to be much faster than Torch.

So that's because how Jax is uh implemented in the background, Jax has like nice support for like fast evaluation of the new networks.

So I would expect actually Jax to be much faster than Torch.

However, it could be that the Torch environment did not recognize correctly the MPS also, and that makes it also slower.

I'm not sure about that.

But I'm not so surprised that JAX is so much faster.

I hope that answers the question.

Yeah.

That answers the question perfectly.

Well, at least I think, but Ron tell us at some point if he doesn't.

Okay.

Good.

So let's continue Jonas.

Okay.

So let's have a look how to do simulation based inference and baseflow.

Now, so look a little bit at the code overview here.

So actually we start by importing baseflow and then we do

have to define a prior and a simulator.

So that's basically the only things which we have to define by ourselves.

And the prior is just a function which generates us parameters from some distribution, which we can define and which can also be implicit.

And our observation model then takes in these parameters and generate us some simulation data.

And both of these functions return these dictionaries here so that we later know

what was what, and then baseflow provides a nice convenience function to stack these two things together.

So the prior and observation model together, they become one simulator.

So a function which we can generate simulations from.

And then in baseflow, we have something which is called a basic workflow.

And that's basically the main object in baseflow, which takes everything you have.

So your definition of your simulator.

And also where you define which kind of new networks you want to do use for inference.

Yeah, in this kind, in this setting, I'm using a diffusion model, but we also have other generative models, which you could potentially use here.

And we're going to discuss the choice here later a bit.

And then there's just actually two more lines of code, which we need.

One is fitting them, the new networks, we just say, okay, let's generate this function here in the background, generates some simulations.

And then em goes over the simulations to train the new networks, in this case, 100 times over a fixed data set, over data sets, which is generated in every epoch, new.

And then we have one function, which is, which we use to sample our posterior samples.

Because in Baseflow, usually we do amortized Bayesian inference.

So during training, there's no simulations and there's no real data used, but then here we can plug in our real data and say,

Okay.

And let's sample as our posterior samples.

And then Baseflow also provides a lot of diagnostic tools, which we can use them to see, okay, did our new network actually converge?

How does our posterior samples look like?

Are these samples the correct ones and so on?

And these you can find also in the documentation of Baseflow.

So let's fill these functions here a bit with life and look at our concrete problem.

So what I'm doing here, I'm using...

I'm importing base flow and you'll see here that I'm using the back end jacks.

And then I define the prior function, which is in this case, just a random normal distribution with different scales.

So this is given by the problem in this case and returns this big to your parameters.

Then I have this observation model, which takes in my parameters, which was the height of the I am right and these three angles here.

And then with some lines of math, you can compute the observable, right?

The end position of my robot arm.

And the really, the key here also is this function could be anything.

Yeah.

So you can use any kind of simulator, also super complex C plus plus um simulator, which you call in this function.

Yeah.

So we only need, what this function only needs to do is take in some parameters and give us some observations.

So this is defined, then let's define again here using this convenience function.

Let's define the full simulator consisting of the prior and our observation model.

And let's just generate 10,000 samples from the simulator to have a training data set.

So, this is a nice and easy simulator, so it's super fast.

And we see we get out of our simulator, we get out our training data, which is now a dictionary with our training data.

And this consists, the data sets here now consists of parameters and observables.

So, you see here 10,000 simulations and the four parameters and here the 10,000 simulations and the corresponding observations.

And this is the only supervision signal we're to use for training, this data set here.

So there's no real data involved also.

So let's just visualize what we have done here.

So what you see here in the first row are the four parameters and the prior distributions.

So you see the gaussians here.

So those are the initial values we're going to start from.

And here you see again some random distributions, which are similar.

simulations which you already have seen before.

But now if you, I just want to show you if I run them again and again you see how the arm of the robot jiggles a bit.

So now that we have defined our prior and our observation model, let's now put this all together in our basic workflow.

So our basic workflow now takes in here the simulator and our inference network, which is just a fusion model.

And we also define an adapter.

So an adapter is something special em in our baseflow library, which basically tells our neural networks, okay, what

part of the training data is now what?

So there are two fixed namings, which are here the inference variables and inference conditions.

And we need to tell our network, okay, what are the actual conditions?

So the data we know distribution we are conditioning on, the observables.

So this is not the entry of the dict of our simulator.

And this is the inference variables we want to do inference on, so the parameters.

Because the simulator could put out any dictionary, this is just a way to tell

the workflow, what is what.

And the adapter also takes care of the fact that new networks like data in a specific format and it's doing like data conversion for you and can also do more complex

transformation if needed.

So after defining here our basic workflow, what we can do is then we can train our new networks and we're going to train our new networks just for a short amount of time so that

we can actually, so that it actually finishes during

this presentation here, so we take just 10 epochs of our whole training data and let's do this.

So we see as I'm using this Jacks back end, eh training is super fast.

So it's after some initialization, it's just one second per round and we're done.

Yeah.

And actually, while we're, well, I mean, we're done, but um Ron had a question about that.

Wait, about the adapter, Ron, think.

Yes.

Would that be a good time to ask about the adapter that John has?

Yeah, sure.

Yeah.

So.

Ron was asking is changing the adapter, how would you change from NPE to NLE for instance?

Ah, you would just change the naming here.

So then parameters would become the things you want to do, you want to condition on and observables would become the thing you would do uh inference on.

So you would just put this name below here and the other one here and that's it.

Nice.

Okay.

And so can you also articulate what NPE and NLE mean in this context?

So NPE would be a neural posterior estimation.

So estimating the posterior distribution with a neural network and NLE would be neural likelihood estimation.

So estimating the likelihood of your model with a neural network.

Okay.

Yeah, we can go ahead.

So now that we trained our model, let's do some inference.

So we define now our observables.

So it's the same entry as before, key entry as before.

And we take now some random observation.

So this could be now also real data if you would work on an actual problem.

And we tell our workflow now to sample from this observation.

So we pass this condition and we tell how many samples we want to.

Sample, yeah, and we're gonna sample now and plot them.

So this takes a few seconds only.

So done.

Now we see prior and here in gray behind and this pinkish color, this is the posterior.

We actually learned nothing.

the prior and posterior look the same.

But that already tells us one important thing.

And that is these new networks before they actually can learn the posterior, they need to learn the prior.

And after they have learned the prior, if you train them longer, they're going to learn the posterior.

very complicated prior that also complicates your learning task because you need to first estimate the prior before you then can go on to learn the posterior.

So what we actually have to do is we have to train much longer than only 10 epochs.

So let's define our network here again.

So it's the same definition as before, but now we train for 200 epochs and we're going to do this now.

it's, and while this is running, have to like, I want to talk again.

about amortized spatial inference a bit.

So uh what we already talked a bit before, so SBI can also used in an amortized way, meaning we train the neural network once and can then reuse it for many observations.

And the idea here is really that we learn a global conditional model.

So instead of just doing solving one inference problem at a time,

We are solving actually many inference problems at once.

And of course, um so this is nice when you have many datasets, but that's also nice when you can like dissect your problem into many smaller problems and then you can amortize

basically all of these smaller kinds of problems.

And, but one important thing one has to always keep in mind is there's a so-called amortization gap, which means

Okay, maybe I have trained now a new network, which is super powerful.

And I want to apply it to a new data set.

One has to make sure that the training set my new network was trained on actually also covers this new data point.

So that actually means that my, sorry in abation terms, means my prior, so my prior predictive distribution also must cover the data I want to do later inference on.

otherwise the new network doesn't know what to do with it.

So basically...

Right, so that means, does that mean basically you have to include the out of sample scenarios in your training, in your prior?

Exactly.

Yeah.

So if you have, because if you have like really data, which looks really, really different from your simulations, yeah, your neural networks won't be able to work with that.

And so they don't just didn't have never seen this kind of data.

So they don't know what to do with it.

And someone has to be a bit careful.

Okay.

How do I design my simulator and also my prior so it actually matches then my data in the end.

Yeah.

So it's not, it's not very important that the auto sample data is actually in the training data set because otherwise that's not auto sample, but you need the model to know that

it's something that's possible.

That some other scenarios than the ones that are in the training set are possible.

So that the prior is actually wider than what's actually been observed so far.

Exactly.

So if, I mean, if you have, if you only want to do inference for one specific data set, then you don't need this white prior, right?

It's just when you, if you keep in mind, I want to do later also inference on different things, then you have to keep this in mind already during training.

So that can be a problem, but often um this is also very helpful to have this amortization feature.

Yeah, no, for sure.

That's like for me, for instance, I always work on cases where I need to do auto sample predictions at some point.

So it's actually the whole point of the model.

no, no, for sure.

Super important.

yeah, I mean, that's the same as when you use HSGP, for instance, you have to include in the hyperparameters of the GP approximation, the fact that that set can go further.

than what's actually observed because otherwise the approximation bounds will stop before the end of your test data set and then the predictions are extremely bad.

Okay.

Yeah.

Awesome.

So, wait, we have a question from Ivan.

Is it possible to introduce measurement error?

Suppose that the position and angles are known to a certain precision.

Yeah, mean, you can, so measurement error is also just uh a noise model, which you put on top on your simulator.

So basically what you could do is, uh let's go back to our simulator here.

Instead of like giving back this deterministic output, we could also add noise on top.

and this noise could also depend on some parameters which you want to estimate.

And then we also have basically measurement error in our simulations and then the neural networks also be capable to work with the measurement errors.

Yeah, I hope this answers your question.

Yeah, that's pretty much like, I mean, this is not surprising as neural networks are basically GPs.

But yeah, if you're used to work with GPs, this is very, very similar.

Like in your GP model, if you wanted to add some measurement error, you would use a noise kernel here.

Exactly.

And that would be your measurement error.

Yeah, exactly.

Okay.

Awesome.

So now actually our training here finished while we were talking.

That's great.

It only took like two minutes, but now we can use this new trained model and do inference again on the same observation we had before.

So let's do that.

And we see now, ah, there it changed something.

So we see, for example, here for the first angle, we see a bimodal distribution.

So something seems to have been learned.

And we actually can also visualize these samples nicely.

So we just use the posterior samples and plot them again here in this graph.

So you see the different arm configurations, which lead to this final end position, which was our observation.

And you see how many different

em like angles and so on and initial heights were actually possible to reach this fire position.

So we see this clearly this multimodal distribution is also reflected here in this depiction here.

So seems that we have learned something useful.

We get a nice bimodal distribution here.

And the question now is obviously, is that actually the correct posterior?

So is this, or did we only learn em something which looks nice?

So what we can do is we can now use the armatization property of our new networks and do inference on not only this one dataset, but on 100.

So let's do that.

So I'm generating here 100 test datasets, then I'm passing the test datasets as conditions to our inference network.

And I'm sampling 100 posterior samples for each of them.

And so we're generating now 10,000 posterior samples and we're going to

uh plot them with our coverage, with our coverage statistics.

So Baseflow provides a lot of diagnostics tools.

So this is one of them, which is simply telling us um is the posterior, which we are getting out of oh our new network, is this actually covering the crown truth parameter we

use to generate our data.

And we see actually for all of these four parameters, this is actually the case.

So this is a good indication that we actually learned something meaningful.

But there are also lots of different diagnostic tools which help you then also to look, okay, is my network actually converge?

There's something called simulation-based calibration, which is also very super powerful and a good way to look into your posterior.

But I won't go into detail into that today.

Yeah, these coverage plots are really amazing.

I use them all the time.

And they are actually, if you're using the Bayesian framework, they actually mean what you want them to mean.

So it's like basically, yeah, like if you're expecting 80 % of like the true parameter to be in there 80 % of the time, then you should expect more or less an 80 % coverage from

the, from the, HDI with some noise, which is uniformly distributed.

And yeah, so that plot that you see here on the diagonal, that's that.

If you use Arviz 1.0, which is to be released very, very soon, that plot is done horizontally.

So it's plotting only the difference between the two.

So that actually the, like, yeah, like you can see more easily the difference because the numbers are smaller and that's all.

But that's the same plot if you're using Arviz.

And yeah, I definitely.

encourage everybody to start using that in the workflow because this is one of the great smoking guns to know if a model is really good or if it's under fitting or over fitting.

And this is actually a great plot to demonstrate over fitting, for instance, from some model, whereas another one won't.

And actually a pattern that you'll see that's very interesting that I teach when I do.

So I have a workshop that they do from time to time about hierarchical models.

It's funny because hierarchical models, they look like they underfit when you look at the in-sample predictions and especially the coverage.

So if you look at the in-sample coverage from a hierarchical model versus a completely un-pulled model, it will look like the un-pulled model is much better and much closer to

perfect coverage.

the hierarchical model will have a coverage like that, like an inverse U, when you look at the plot from all these.

So which is supposed to be horizontal, but the...

hierarchical model will look like an inverse U, which is the pattern of under-fitting.

But then if you look in out of sample, then the hierarchical model will have really great coverage, whereas the unpooled model will show over-fitting because it was actually

over-fitting and over-confident.

So yeah, this is a nice pattern.

that you'll see from hierarchical models that is counterintuitive.

So don't get discouraged if you see that your hierarchical model looks like it's underfitting in sample.

It's supposed to do that because it's a compromise and it's trading off worse fit in sample for better fit out of sample.

And actually there were a question, so I'm not doing that only to talk about hierarchical models, although I love them.

But there was a question Jonas for you from Oscar, who was asking exactly about using API for hierarchical models.

And that sounded like a really interesting use case to him.

And he was asking if he was right in thinking.

we wouldn't need to fit the whole hierarchical model in this setup.

So then I'm not sure what he means by that, but you seem to be understanding.

So please enlighten me.

because I have written a paper myself on that topic.

um So for hierarchical models, actually can decompose your...

So let's say to make it a little bit more intuitive, let's say your hierarchical model is about some individuals in a population.

And basically what you can do is you can train a neural network only on simulations for your single individuals in your population.

And then use your amortization over the individuals to estimate the full hierarchical model.

Yeah.

So you save a lot of simulations because you only do inference basically.

So you train basically only on the individuals instead of the full population.

Okay.

So how do you infer the whole population then?

Because you can like, so that's why diffusion models are super cool.

I think we should talk about that later on.

Okay.

Yeah, let's do that.

Perfect.

Yeah.

And a point that Ron is making in the chat, which is a good one is, the key thing here when we do the coverage plots and you were talking about SBC simulation based calibration,

usually it involves fitting the model multiple times.

So if you're using HMC and it's a model that's hard to fit, it's going to take you a long time.

But here the key thing is that precisely the armatized inference makes SBC fast, but also that you really need SBC for SBI to be confident in the inference.

So that's, that's very important.

And that's actually a point that, that you make Jonas in the paper.

Yeah.

So thanks Ron for, for this very good point.

Let's, let's continue Jonas.

Indeed.

Yeah.

So what we have seen now was basically the first part of the tutorial.

So.

What is simulation based inference and how to use it in baseflow.

And what I'm talking now about is um what are diffusion models.

So what did, so we're going to look now under the hood of uh this, what baseflow is actually doing here.

em So diffusion models are generative models that create samples by iteratively denoising random noise.

So what does this mean?

So let's have a look at this visualization here.

So let's think um of some easy distribution, which we can easily sample from, which would be simple cautions.

And that's the distribution we're going to start at.

And what a diffusion model is then doing is it iteratively removes some noise from my distributions.

And it makes it step by step, a little bit more removal, remove some more noise and some more noise.

You see the bimodal distribution is showing up and then until we reach the posterior distribution.

So that's the basic idea of a diffusion model is we start in a distribution, which is easy to sample from.

So a random normal distribution, and then we remove noise until we end up in the distribution we actually want to sample for, the posterior distribution.

Now taking one more step back is this is not only working for posterior distributions, but also for um likelihood distributions, which you want to estimate.

And the basic idea is, okay, we have some target space or posterior distribution and some easy distribution.

we want to define, what we define is a process which goes in both directions.

So one direction is the forward process, which is going from my

difficult posterior distribution or whatever distribution I want to estimate to my easy distribution, so to the Gaussian distribution.

And the other one is the backward process, which is going back.

So let's have a look directly at these two processes to understand what these actually do and how they help us training a diffusion model and how we do that inference.

So the forward process is the training part.

And in the training part, what we do is

We start with a clean sample of our parameters.

So that's here the tetanal.

So let's plot them here.

So we start with our, so this is now just the prior distributions, which we see here.

And since we have a training data set, we actually know the crown truth parameter here.

Right.

And what we do is during the forward process, we add some noise on top.

So we add some more noise and more noise until we end in the fully Gaussian case.

So now we started also in Gaussian distributions because our prior is Gaussian, but that could be any distribution, right?

So the idea here is really, okay, we start with our Krone-Truth parameters and we add noise until we end up in a setting where we only have pure noise.

And there are two functions here when you look at this formula, how you add the noise and these functions are this alpha and the sigma, which are the so-called noise schedule.

So they determine how fast...

um And how much noise you add during this diffusion time.

So diffusion time is meaning um time zero is the posterior distribution and time one is the only the Gaussian distribution.

So, and the idea here is of these noise schedules, can think about them as um they are, you can like hyperparameters, which you might need to tune for performance data of your

diffusion models.

But um as we...

But this is a topic I'm not going to discuss today, but if you want to learn more about different noise schedules and how they look like and when to choose which noise schedule,

this is something you will find in the review paper we wrote.

But for now, just think about this alpha here as a function which is decreasing over time.

So you have less and less signal and the sigma is a function which is increasing over time.

So when you reach the end of your, when you reach the noisy distribution, you only have noise in the end.

And what is the network then actually learning during training?

So the network is learning this guy here.

So that's the, I hope you can see the cursor here.

So this, the score.

And what is the score?

The score is the direction in which you need to move your noisy posterior, your noisy parameter to increase the probability under your noisy posterior distribution.

So when we try to visualize that.

that looks basically like this.

So this, these gray arrows are pointing towards the distribution we want to go to.

And um this is the so-called score.

And what diffusion models then learn during training is exactly to predict these scores given our current noisy estimate, which we get just by using this formula here.

eh And because we have the...

true value we actually know because it's in our training set.

And we also give the new network our corresponding simulation, which correspond to this parameter.

And then one can actually write down a very simple loss function with the score in there, because the score is analytically computable just from my noise schedule.

So when I define my alpha and my sigma here, I can give you a formula for the score.

I won't go into the detail here of how to derive that, but

Basically think about, okay, given my forward process, I actually know I can compute the direction in which my parameters need to go to reach the posterior distribution.

So this is the forward process.

Then in the reverse process, doing inference, what we, we don't know the crown truth parameters anymore, right?

Because we want to we want to infer them.

Yeah.

So he was wondering if there is an intuitive way to understand this and also if you have some pointers to additional reading for learning how these arises in detail.

Yeah.

So like the intuition really is you have to think, so this is kind of a direction you're learning.

eh And uh the direction is pointing towards, uh it's why it is as complicated is because this direction is a direction in probability space.

Yeah.

Because what you...

Actually, when you do sampling, what you do is you start in this Gaussian distribution and you move them to your posterior distribution.

So you move between two probability spaces and basically how to move between these two spaces, this is the direction which the score is telling you.

em So that's why that's a complicated uh part.

there's lots of like uh math involved to write the formulas down for that.

And if you want to...

But if you want to dive deep into the mathematics, then I also can point to our review paper.

And in the review paper, there's actually another paper from two PhDs at MIT.

One of them is called Peter Holdery.

He wrote a really nice tutorial on how to derive these score functions and did all the mathematical details.

So if you want to have a look at this literature, that's really a great resource.

Okay.

Yeah, yeah, for sure.

Put them in the show notes for the episode in that way.

We'll make sure people have that handy when they check out the video.

And Ron says this is very helpful.

thank you, Jonas.

And actually Prashant, I know where you're joining from Prashant, but thank you for your question.

He's asking if there is a convexity requirement for these to work.

A convexity requirement for like for the problem so that you have a convex problem?

Yeah, I think so.

There is no, not more detail to the question than that, but I think that's what he means.

If it's not Prussian, Pirogis tell us.

So I think there is not, but maybe I need to discuss with him directly what he means to be more precise in my answer.

Yeah.

So please Prashant, give us a bit more detail in the chat and then...

Also feel free to reach out to me and we can discuss that.

Yeah.

So in the meantime, let's continue.

Yeah.

Okay.

So let's talk now, we talked about the forward process.

Now let's talk about the reverse process.

So the inference part.

So basically what we already trained on your network to learn these score functions.

And what we do now is because we know how we derive the forward process, there's actually a direct link to the reverse process.

So just by plugging in my noise schedule and so on, I can write down this to high-stick differential equation.

which tells me how to get back from my noisy space, back to my posterior distribution.

And in this formula, there appears the only thing which we need to know, which is besides the noise schedule, is this go off function here.

And this is learned by our neural network.

So what we do then in the reverse process is actually iteratively plugging in our current noisy estimate and our data into our diffusion model, which then predicts us the next

update step.

of our reverse process and we do this reverse process until the end in our posterior distribution.

So that's on a very high level, how this reverse process works.

So just to briefly summarize that, so the diffusion model learns a score function, which is basically something like a direction in which you need to solve the reverse SDE.

And sampling is really done by the speed derivative denoising basically in the direction which the score is telling us.

What this gives us is a really highly expressive posterior approximation, which is useful for either multimodal posteriors as we have seen before, but also for high dimensional

problems because diffusion models were before SBI, they were actually developed for image generation.

And in this case, have like each pixel can be seen as a parameter.

So we have like huge dimensions.

So this is also really a nice thing that diffusion models are scalable to

to these high dimensions and as we're to see later on, we can also do post-talk modifications during inference.

So besides the fusion models, there are variants of these kind of models, which are also very popular and useful for SBI.

And uh two of them are so-called flow matching and consistency models.

And I just want to briefly mention them here because I consider them also very nice and they're also very useful.

But how they differ from the diffusion model is basically just the way you parameterize the neural network.

Because the diffusion model is learning this kind of the score function, which I was telling you before.

And then you can also rewrite a notation a bit of your reverse process.

Then what you actually get out is instead of a stochastic process, you can write down a deterministic process.

And in this deterministic process, there's just one, also one direction you need one

velocity field you need to predict and this is basically what flow matching is doing.

So instead of directly predicting the score function, flow matching, for example, uses a deterministic reverse path.

And furthermore, um if you want to do like very fast sampling, then there are these consistency models, which actually learn to directly jump to the end of your um reverse

process.

So you don't do this iteratively denoising, but you would directly jump.

to the end of your process.

But during training and so on, you also need these forward and backward processes defined.

So that's how they all fit together.

So you define this forward process and then you know how to do the reverse process by some math formulas.

So, and the cool thing is all of these models are also readily available in ASO.

So you don't have to write down the models yourself, but you can also just use them.

So I already prepared

Some of these models which are already trained.

So in Baseflow you can also save your model and then load it later for inference tasks.

So I defined here a flow matching model and a consistency model.

And I'm loading them here from my laptop.

But besides that, I trained them exactly as I did before with the diffusion model.

So let's load them.

sampling with them.

So that was the first model, this was the fusion model.

Then we did the flow matching model and you see the last model, the sampling was way faster.

So that was the consistency model, which does this few step sample, few step predictions.

And you see all of these three models, like generative models actually learned our posterior distributions, but the posteriors look a bit different for all of them.

So you see, for example, that here for a consistency model, you have some outliers, which are not ending actually in the um

in the final red cross.

So accuracy of this consistency model is not as good as for the diffusion models.

But if you have a setting where you want to do fast inference, then these consistency models are great because they can be way faster.

So actually there are lot of choices you have to do when you are using a diffusion model for your simulation based inference tasks.

So you need to decide, okay, what is the actual noise schedule?

I'm using for these alphas and sigmas and which kind of family I want to actually use.

So diffusion model, flow matching, consistency model, whatsoever.

And there's also one thing we did in our review, we benchmarked all of these different methods and looked, okay, in which setting I actually want to use which model and we tried

to explain it in the paper.

So if you want to dive into em when to use which of these generative models, then I think this is also a good resource.

because we try really try to tell you, what are the choices out there and what is the choice you should use for your problem.

So just to give you a teaser, for example, we have run all of the different choices, can like all of the different architectures and parameterizations and so on, like a rather

famous, I would say, simulation-based inference benchmark from Lickman et al., where we can...

Basically then we compared on these 10 different problems, compared like tons of different versions of the diffusion model with different settings and looked which was performing

the best.

So if you look here, for example, at this Lotka-Volterra example, which is one of the most difficult low dimensional problems here in this benchmark, can, for example, see that the

flow matching models here like performed best.

So the score here is a C to a C score, which meaning that

A score of 0.5 would be the best score you can achieve.

So, the true posterior distribution would then be indistinguishable of your distributions of your neural networks.

And actually what we also see here with this dotted line, we show what was the benchmark result for the previous methods, which I would say were the normalizing flows.

So, this is a normalizing flow here in Crayon.

We see all of the different

Diffusion models, consistency models, flow matching models, they all beat the benchmark by far.

So they are way more expressive than uh new networks which were used before in the context of simulation-based inference.

And we also see, for example, that the consistency models are not as accurate as the other models, but they allow for these fast sampling.

So you have this trade off of faster sampling versus accuracy.

So it depends on what...

what your task actually is.

So this was not the second part of my tutorial.

So I have one last thing I want to talk about.

And we're also going to talk briefly here again about hierarchy commodals as I promised.

um Nice.

yeah.

Actually Jonas, I have a few questions for you from the chat.

So eh let me read them.

from.

Raphael, how many parameters slash weights do the models have?

How do they scale with the number of parameters you posterior?

In your posterior, guess he meant to say.

Yeah.

So um in our review paper, we looked at problems with like a few hundred parameters and even more.

And for that, we used new networks, which had the size of roughly a few hundred thousand.

parameters, weights and biases and so on.

So maybe a million or so, so not that big in terms of neural network sizes.

And of course, if you go to larger dimensions, you also will need a larger neural network, but that also depends how high dimensional is your observation.

So it's only the inference target high dimensional, but also your observation because you need to somehow...

incorporate that information so you have like different aspects which lead to larger or smaller new networks I would say.

Okay, yeah.

Yeah, I was thinking about the biggest model I've had to fit.

I think the biggest I've done so far is a huge model with more than 300,000 parameters.

And I remember I had tried, and these were huge GPS all over the place.

And I had tried, so not really...

ABI or SBI here because that was not really a good fit for the use case in the sense that if I understood correctly, and maybe that changed since I talked to Marvin, but basically

in that show, he was saying that basically you mostly want to use API when you have a constant structure from your model, but your test set changes and can be extremely big.

My use case was yes, we have big data, but not very, very big.

And the structure of the model wasn't fixed.

So that meant that SBI was not, I mean, API was not very...

Yeah, basically you have to fix your, structure of your model, right?

That's really important.

That's the observation model we defined before that has to be a fixed thing.

So with parameters inside and so on, they, so this is, um...

That's usually the, so for amortized page inference problems, usually you have to fix structure.

Yes.

Yeah.

Otherwise, like you can't amortize.

So this is just like, where are you doing that in the first place?

You're, you're gonna, you're gonna need to sample a lot longer.

And, for these models, actually HMT is really good if you're very careful about the prior and have a smart structure in the code for the model.

That model I was fitting in 15 minutes on a...

on an M4 Pro, so you know, like for a huge model like that, it's very, very good.

Elham or Elam has a question for you, but the score function, he's wondering if the score function is always tractable.

That's a good question.

em So thanks for that, because the score function actually is always tractable.

And the reason for that is...

that you will get it by, because you define your forward process, you have this simple noise schedule and you have a direct formula for the score, which you use then in the

reverse process.

And actually, if you look a little bit into the math, what you will find is that you don't need this conditional score, which is indeed interactable, but you only need a score

defined on your parameters, which is an unconditional score.

And this is always, but this is the lead like in the optimization procedure will lead to a diffusion model, which actually learns this unconditional score here.

So yes, the score you need for training the diffusion model is always tractable.

And that, so in this case, don't need any derivatives from your simulator also.

So don't get me wrong.

Yeah.

So, because sometimes you also, there's also something called score matching, which they try to learn the derivatives directly of your likelihood function or something like that.

That's different.

Yeah.

So here in this case, you really get a score based on this noising process.

And from this, the score is always tractable.

Okay.

Yeah.

Interesting.

I didn't know, I didn't know about that.

Nice, there were some emojis.

didn't know we could do that in Riverside, folks.

So that's cool.

Apparently you can react with emojis.

That's very useful for me because that makes my job easier.

So thank you.

And uh Ivan is asking if these benchmark examples are available somewhere in code?

Yeah, they are.

So you can have a look at the, if you Google SBI.

benchmarks suite, you will find them.

So this is an open source repository, not for me.

But also if you look at the code for our review paper, so there's a repository there, you will find all the links to how we did the benchmark and where you find the simulators for

doing the benchmark and so on.

Nice.

Yeah.

Can you link to that in the show notes for this episode?

That'd be great.

Yeah.

Cool.

Sure.

and so Prussian had a follow-up on his question, but it's a, so it's a big one.

let's see.

It's, it looks like I can show the question on the stream.

So let's try that.

yes.

Oh, look at that.

So the question is as a follow-up to his convexity question before, maybe that helps Jonas.

If it doesn't, then well, we'll just go to something else.

But, the question is, is any local minima.

The model learns the global minima in the example.

What if the relationship is affine but not convex between the parameter and scoring, perhaps due to the noise added?

So uh in a diffusion model, think, I mean, of course, your neural network, okay, let's start again.

Of course, your neural network can learn uh local minimas.

So that's one thing, but that's a local minima in your training scheme of your neural networks.

So that's one thing.

that you can, with like different learning rates or more training data, you can overcome this.

So that's one part.

The other thing is if you're thinking about

that the noise schedule you're adding on your, so the alphas and sigmas are actually, that this is not enough to learn the relationship to get to your posterior distribution.

Maybe there's some mathematical theory which will tell us in the future that these alphas and sigmas are actually enough.

But for now, can, what I can say is that empirically, this is enough.

Yeah.

So the alpha and sigma, um like the right noise schedule will give you

a like a score, which is then good enough to approximate any kind of posterior distributions you will see.

Yeah.

you can really again think about an also image generation where they use the same kind of noise schedules and you can basically the densities of all these images.

So, which is also really complex densities, which are which these models are learning.

So, at least I have not seen it empirically that um you get stuck in any local minima or something like that.

I hope that helps.

Yeah.

Okay.

So I think we're good for now.

So let's continue with the end of the tutorial and talking about hierarchical models.

Yeah.

Okay.

So.

In the end now, I want to briefly talk about why diffusion models are so special.

So we have already seen that they are good for like tackling difficult and high dimensional posteriors.

But the real game changer in diffusion models is that you can use the score in the, that you can change the score in the inverse process.

So what do I mean with that is for example, you might think, ah, I have now, I have a constraint on my parameters.

I did not encode it before in my prior.

For example, that my

my parameters have to lie in this circle here above, in this blue circle.

Yeah.

Then what I can do is I can intervene in my inference process where I use the score like this denoising scoring, denoising process and add the score, add the constraint directly

to my score.

And what I end up is then the samples which do not only follow my posterior distribution, but also satisfy the constraint I added top.

And this actually also allows us, for example, to

change the prior without retraining the model.

And what it also does is it allows us to combine multiple factors.

So if you have multiple observations and you want to combine them into one single parameter estimate, you can just estimate all the different scores for your different

observations and add them together and basically end up in a global posterior, which corresponds to the posterior of all your observations.

That's so-called compositional inference.

And actually,

This composition inference is then also related to hierarchical models because in a hierarchical model, that's actually what I have when I try to estimate, for example, the

population parameters of my individuals, which I was talking before about.

So when I have these multiple individuals, I train my diffusion model only to estimate parameter for one of these individuals.

But what I can do then is estimate a score for each of these individuals and just add these scores up.

get basically a full estimate for my full hierarchical model.

I hope also now Alex just explains the question, how do you actually combine the estimates for the infusion model for the hierarchical models?

Okay.

So this is how basically this is how you would, this is why you can infer the whole population of...

Based on only the individuals.

Okay.

Yeah, this is really cool.

So that's why you basically, it's it's like fitting an unpoled model, but you still get a hierarchical model.

exactly.

Yeah.

So the training, in the training part, you already have to take a bit care how to, yeah, design your neural networks and so on.

But in the end, you'll basically train on the, the, on the single individuals and get the full hierarchical model.

Nice.

And what about out of sample prediction here?

So there are two cases for hierarchical models.

It's one, you observe a new individual in an existing group and two, you observe a new individual in a new group.

Yeah.

So a new individual in a new group would mean what, what's, what's, would be a, can you explain the setting for me a bit?

Yeah.

So that would be like, let's say you have a new.

Let's say you have players, right?

So you fit your model on players.

players are part of groups.

But if it's a new player, like if it's a complete rookie, you don't know which group it's part of yet.

yeah, the way I do that usually in a, like in classic hierarchical model, if I fit it with HMCs, I mean...

It's not a closed answer, right?

It's just like, this is a question where you have to use your domain knowledge and statistical judgment, but most of the time, the way I do it is I use prior information I

have about the player, and then I blame that as a prior with the population level estimates that I got from feeding the training data set.

I mean, from sampling on the training data set.

So yeah, how would that look like here?

So here, basically the question is if the priors you initially designed for your individuals, is this prior already covering the new player you are adding?

em If this prior is not wide enough, so it's not covering this new player because it's a complete rookie and you never had a complete rookie before, em then your new networks

won't be able to work on this new player because it's not part of their training data, the training distribution.

But if your prior was wide enough, that it also, um this new rookie could also be like, part, still part of this prior distribution basically, then you can apply your neural

network straight away also on this new, um including this new player.

Yeah.

Okay.

So again, we come back to make sure you add that scenario of possibilities to the prior.

Yeah.

So that's basically going back exactly to this amortized discussion we had before, right?

You have to design your prior in a clever way so that you actually cover the use cases you want to cover.

em Yeah.

And what about the other case, the easier one, which is a new member of an existing group?

So here that would be a new player from a team you've already sampled from.

Or if you think about houses, it could be a new house, but in a county that was already in your deficit.

Yeah.

So that's exactly the same explanation I just gave to you.

So if the prior, you're covering the new house or the new player in the same team, then you're good.

And I would say probably you are more likely.

that you will cover this new house because it's part of an existing group.

Yeah.

Okay.

Cool.

I mean, that means it's always the same, the same answer.

So that's great in a way, you know, it's like, once you know you have to do that, you're, you're covered.

Exactly.

Amazing.

Okay.

No questions so far, so you can, you can continue.

Yeah.

I have one last thing to show basically.

And what we want to, what we do now is we want to add one constraint to the problem.

we are looking for before and so to this Roboter Arm Inference Problem.

And the setting we could be in would be when the Roboter was doing something, I actually observed the Roboter and I saw, ah yeah, the elbow is up.

So I would say, okay, either I could say change my prior now and say how to try to encode that this elbow up thing, okay, what does it mean for my parameters and so on.

That could be one thing, but then I would have to retrain my new network again on this new prior.

But what I also can do with diffusion models, and that's why they're so cool is I can use the model I already trained on the, which I trained before, so the one I showed you.

And I can say, let's add a constrain.

And this constrain is just that the first angle, so the sign of the first angle is larger or equal to zero.

Because that actually means that the elbow is going up.

So I defined here the constrain and then I can, what I can do is I just add them now to base flow.

And what you see here is on the left.

uh what I, now let's look at the code.

So I basically just gave my sample function.

I told it, okay, here I have an additional constrain.

And now looking at these images on the left, what you see is the

diffusion model from before, so with the normal score used in the reverse process.

And then on the right, you see I have now set a strength to zero, so it's still the same because the score is, the constraint is multiplied by, with this parameter here.

But if I now increase the strength here, I end up with the elbows being up.

Yeah.

And just to show you again, so I'm using the exact same model.

So one is with constrain and the other one is without.

And then I can also change my constraints.

I'd now I think the elbow was down.

And the was done.

So I can, in like with the same new network, I can start now intervene in this reverse process to add more knowledge that I have, which I didn't have while I trained the model.

So that's super cool.

And something you cannot do with other kind of new networks, which have been used in simulation-based inference before.

So in summary, what have we seen here today?

Wait, about that last point, why wouldn't that be possible for another neural network and you make that explicit for the audience?

Yeah, so for example, if you look at normalizing flow, which would be um also a kind of neural networks, which have been used a lot before in simulation-based inference, what

they do is they do direct prediction from this Gaussian distribution to your posterior distribution.

So they have a one-step pass.

But with these diffusion models, you have this iterative denoising process where you have in each of these steps, you actually evaluate your network to get the score and going into

the correct direction.

And now, because you have this basically written like this process written out, you can just intervene and add this constraint in this case to the score and every of these

iterations.

And by this change, the reverse process to this new posterior distribution, which satisfy

the constraint you added.

Okay.

So this is because, yeah.

So it's because we have that inference about the population in the first place that we can do these kinds of things.

Yeah.

Okay.

Cool.

Awesome.

Yeah.

Go ahead with the summary.

Yeah.

So this is super advanced already, right?

Because you are intervening in this reverse process.

And, um but there's lots of research at the moment going on and...

What can, what you can do with that.

Like, for example, the compositional part is basically also the same, intervening in this reverse process, adding scores on top of each other.

So there's constraints here, prior change.

There are lots of cool papers which came out lot last year, actually, or the year before um doing this kind of stuff.

So that's really new and like state of the art research, I would say.

So let's briefly summarize what we did here today.

So we talked about simulation-based inference and how we do likelihood-free Bayesian inference with new networks trained purely from simulations.

We saw that amortization makes inference cheaper at test time so that we can do uh diagnostics for new network.

We have seen what is a diffusion model and that it provides strong posterior expressiveness and that really this iterative process is the key part of the diffusion

model.

And that this then allows us for post-hoc interventions.

yeah, if you want to learn more about diffusion models, then I can point to our tutorial review which we wrote.

So that's basically giving you all the math and also all the introduction again to assimilate space inference and when to use which kind of diffusion model, which

parameterization, which noise schedule and so on.

And, but all of these different kinds of models, are also already

implemented in Basewell, so they're ready to use for you there if you want to apply them to your own simulation-based inference program.

And if you have any further questions, feel free to ask them now or also reach out to me.

Yeah.

And please add also these links to the show notes of the episode, to the document I shared with you.

And that way we'll make sure to add that to the show notes when we publish the episode.

Folks, feel free to send your questions right now, the last questions before we sign off.

There is one already from the one and only Ron Jonas.

How does diffusion compare to the other methods in base flow, such as normalizing flows, for instance?

And he asked in comparison in performance slash ease of use.

Yeah.

So uh thanks Ron for your many questions.

I enjoyed.

and enjoyed them.

So normalizing flows, um I would say they are easier in a sense of analyzing um the convergence of a normalizing flow.

you have your, you plot a loss curve of these new networks, you see a nice convergence and at some point they stop improving.

And this is nice and this is nice feature of a normalizing flow and that makes them easy to diagnose.

The fusion models on the other hand, because you have this

noisy process, this forward process where you add noise on all the time, your loss curve will always look similar.

So you don't see the improvement directly in your loss, which makes them harder to analyze to see if they're already converged.

But in terms of performance, their normalizing flows are now matched to any diffusion model.

So diffusion models are way more expressive than the normalized flows because you have these constraints on the architecture in your normalizing flow, which you don't have in a

diffusion model.

So can use any kind of...

new network as a backbone in your diffusion model that makes them way more expressive.

But as I said, you need maybe a little more training, more tuning.

So if you have an easy task, your normalizing flow might be the better choice because it's easier to diagnose.

But if you have a hard task, definitely try out the diffusion models.

Okay.

Thanks, Trinitron.

That's very clear.

Awesome.

So I think you can stop.

sharing our screen now.

So we're done with the presentation and now I'll play us out.

But thank you so much, Jonas.

That was a great presentation.

No code issues, so well done.

em This almost never happens in live code presentation and uh perfect timing.

you know, well done.

Impressive.

And I learned a lot and I'm sure everybody here in the chat did too.

um I see a lot of...

Happy faces and happy emojis.

So thank you so much.

And you were like a lot of you guys joined.

So thank you so much.

And for some of you very early in the morning.

So fantastic.

Jonas um to play us out.

Maybe can you tell us, you know, where do you see these methods going in the next few months and year?

And also personally for you, I'm wondering, you know, what's next?

What are you especially excited about or?

So about me personally, I'm excited to write my doctoral thesis now and I hope to defend the summer and then I'm looking for postdoc positions.

ah if you have any, contact me.

um But from the method perspective, I think what we are really looking forward now this year is...

looking at these guidance uh part of the diffusion model and trying to figure out, okay, when do we actually, under which settings we actually still get a valid posterior

distribution?

Yeah, because you can do many things with diffusion models.

And the question is, is the posterior distribution we get out in the end actually still valid and how to make it a valid posterior distribution?

And I think that I need some more theory to be developed.

And that's just one thing I'm really curious about what I want to do this year.

Okay.

Like what do you mean here?

I mean, how do you define a valid posterior distribution in these Yeah, that's the first question, right?

So how, when I add this constraint, how would you assume the true posterior should look like then?

Yeah.

So defining all this clearly, I think there needs to be some work to be done because otherwise you cannot just apply them and say, I get nice plots out and it looks good and

the fit might look good, but I still want to know.

Is the uncertainty in my posterior distribution actually the correct one?

Is it really reflected the uncertainty in my data in there and so on?

Yeah, yeah, that makes sense.

Awesome, well Jonas, you've already been on the deck for a long time now, so I'm gonna have to let you go, but first I'm gonna ask you the last two questions I ask every guest

at the end of the show.

First one, if you had unlimited time and resources, which problem would you try to solve?

So I love collaborating with um the people from the life science institutes here.

And uh one of the major things people are working here is cancer.

So when I would have unlimited resources, I would try to find the correct, the best methods to help people building models, which then hopefully at some point can cure

cancer.

that would be great.

Yeah.

Please do that.

And second question, if you could have dinner with any great scientific mind, dead, alive or fictional, who would it be?

So I saw that you asked this question also the other people in your podcast and it's actually a difficult question to answer.

I think I would like to have dinner with Marie Curie.

because she had like exceptional challenges in front of her and she did exceptional work and she loves to talk about her science.

So that would be a great dinner conversation.

And I like her view that um science is beautiful.

I think so too.

So that would, it's already great conversation starter.

That sounds good.

And do invite us all to that dinner.

For sure.

He's still on the chat.

uh Fantastic.

Well, let's call it a show.

Thank you so much, everybody, for joining.

That was super fun.

um I see you guys are also happy.

So I will definitely do more of these.

see you guys want more of these.

So that's awesome.

Thanks, Jonas, for having done that and being one of the first.

to these kind of hybrid shows.

And most importantly, thank you so much for all your work.

I think it's extremely valuable and also extremely practical.

So thanks a lot for contributing to Baseflow and making that available to everybody in the conversation in the world, because we need these signs outside of only research papers, so

that people can actually use that in their...

workflow.

hope we helped everybody do that today.

And well, Jonas will add everything, folks, by the way, to the show notes.

Make sure to check that out.

And Jonas, again, thank you so much for taking the time and being on this show.

Thanks for having me.

This has been another episode of Learning Bajan Statistics.

Be sure to rate, review and follow the show on your favorite podcatcher and visit learnbastats.com for more resources about today's topics as well as access to more

episodes to help you reach true Bajan state of mind.

That's learnbastats.com.

Our theme music is Good Bajan by Baba Brinkman, fit MC Lars and Meghiraan.

Check out his awesome work at bababrinkman.com.

I'm your host.

Alex and Dora.

can follow me on Twitter at Alex underscore and Dora like the country.

You can support the show and unlock exclusive benefits by visiting Patreon.com slash LearnBasedDance.

Thank you so much for listening and for your support.

You're truly a good Bayesian.

Change your predictions after taking information and if you're thinking I'll be less than amazing.

Let's adjust those expectations.

Let me show you how to

Good days, you change calculations after taking fresh data in Those predictions that your brain is making Let's get them on a solid foundation

Related Episodes

#150 Fast Bayesian Deep Learning, with David Rügamer, Emanuel Sommer & Jakob Robnik

Listen →

#107 Amortized Bayesian Inference with Deep Neural Networks, with Marvin Schmitt

Listen →

Support & Resources

→ Support the show on Patreon
→ Intro to Bayes Course (first 2 lessons free)
→ Advanced Regression Course (first 2 lessons free)
Theme music: “Good Bayesian” by Baba Brinkman (feat MC Lars and Mega Ran). bababrinkman.com