Proudly sponsored by PyMC Labs, the Bayesian Consultancy. Book a call, or get in touch!
In this episode, Marvin Schmitt introduces the concept of amortized Bayesian inference, where the upfront training phase of a neural network is followed by fast posterior inference.
Marvin will guide us through this new concept, discussing his work in probabilistic machine learning and uncertainty quantification, using Bayesian inference with deep neural networks.
He also introduces BayesFlow, a Python library for amortized Bayesian workflows, and discusses its use cases in various fields, while also touching on the concept of deep fusion and its relation to multimodal simulation-based inference.
A PhD student in computer science at the University of Stuttgart, Marvin is supervised by two LBS guests you surely know — Paul Bürkner and Aki Vehtari. Marvin’s research combines deep learning and statistics, to make Bayesian inference fast and trustworthy.
In his free time, Marvin enjoys board games and is a passionate guitar player.
Our theme music is « Good Bayesian », by Baba Brinkman (feat MC Lars and Mega Ran). Check out his awesome work at https://bababrinkman.com/ !
Thank you to my Patrons for making this episode possible!
Yusuke Saito, Avi Bryant, Ero Carrera, Giuliano Cruz, Tim Gasser, James Wade, Tradd Salvo, William Benton, James Ahloy, Robin Taylor,, Chad Scherrer, Zwelithini Tunyiswa, Bertrand Wilden, James Thompson, Stephen Oates, Gian Luca Di Tanna, Jack Wells, Matthew Maldonado, Ian Costley, Ally Salim, Larry Gill, Ian Moran, Paul Oreto, Colin Caprani, Colin Carroll, Nathaniel Burbank, Michael Osthege, Rémi Louf, Clive Edelsten, Henri Wallen, Hugo Botha, Vinh Nguyen, Marcin Elantkowski, Adam C. Smith, Will Kurt, Andrew Moskowitz, Hector Munoz, Marco Gorelli, Simon Kessell, Bradley Rode, Patrick Kelley, Rick Anderson, Casper de Bruin, Philippe Labonde, Michael Hankin, Cameron Smith, Tomáš Frýda, Ryan Wesslen, Andreas Netti, Riley King, Yoshiyuki Hamajima, Sven De Maeyer, Michael DeCrescenzo, Fergal M, Mason Yahr, Naoya Kanai, Steven Rowland, Aubrey Clayton, Jeannine Sue, Omri Har Shemesh, Scott Anthony Robson, Robert Yolken, Or Duek, Pavel Dusek, Paul Cox, Andreas Kröpelin, Raphaël R, Nicolas Rode, Gabriel Stechschulte, Arkady, Kurt TeKolste, Gergely Juhasz, Marcus Nölke, Maggi Mackintosh, Grant Pezzolesi, Avram Aelony, Joshua Meehl, Javier Sabio, Kristian Higgins, Alex Jones, Gregorio Aguilar, Matt Rosinski, Bart Trudeau, Luis Fonseca, Dante Gates, Matt Niccolls, Maksim Kuznecov, Michael Thomas, Luke Gorrie, Cory Kiser, Julio, Edvin Saveljev, Frederick Ayala, Jeffrey Powell, Gal Kampel, Adan Romero, Will Geary and Blake Walters.
Visit https://www.patreon.com/learnbayesstats to unlock exclusive Bayesian swag 😉
Takeaways:
- Amortized Bayesian inference combines deep learning and statistics to make posterior inference fast and trustworthy.
- Bayesian neural networks can be used for full Bayesian inference on neural network weights.
- Amortized Bayesian inference decouples the training phase and the posterior inference phase, making posterior sampling much faster.
- BayesFlow is a Python library for amortized Bayesian workflows, providing a user-friendly interface and modular architecture.
- Self-consistency loss is a technique that combines simulation-based inference and likelihood-based Bayesian inference, with a focus on amortization
- The BayesFlow package aims to make amortized Bayesian inference more accessible and provides sensible default values for neural networks.
- Deep fusion techniques allow for the fusion of multiple sources of information in neural networks.
- Generative models that are expressive and have one-step inference are an emerging topic in deep learning and probabilistic machine learning.
- Foundation models, which have a large training set and can handle out-of-distribution cases, are another intriguing area of research.
Chapters:
00:00 Introduction to Amortized Bayesian Inference
07:39 Bayesian Neural Networks
11:47 Amortized Bayesian Inference and Posterior Inference
23:20 BayesFlow: A Python Library for Amortized Bayesian Workflows
38:15 Self-consistency loss: Bridging Simulation-Based Inference and Likelihood-Based Bayesian Inference
41:35 Amortized Bayesian Inference
43:53 Fusing Multiple Sources of Information
45:19 Compensating for Missing Data
56:17 Emerging Topics: Expressive Generative Models and Foundation Models
01:06:18 The Future of Deep Learning and Probabilistic Machine Learning
Links from the show:
- Marvin’s website: https://www.marvinschmitt.com/
- Marvin on GitHub: https://github.com/marvinschmitt
- Marvin on Linkedin: https://www.linkedin.com/in/marvin-schmitt/
- Marvin on Twitter: https://twitter.com/MarvinSchmittML
- The BayesFlow package for amortized Bayesian workflows: https://bayesflow.org/
- BayesFlow Forums for users: https://discuss.bayesflow.org
- BayesFlow software paper (JOSS): https://joss.theoj.org/papers/10.21105/joss.05702
- Tutorial on amortized Bayesian inference with BayesFlow (Python): https://colab.research.google.com/drive/1ub9SivzBI5fMbSTwVM1pABsMlRupgqRb?usp=sharing
- Towards Reliable Amortized Bayesian Inference: https://www.marvinschmitt.com/speaking/pdf/slides_reliable_abi_botb.pdf
- Expand the model space that we amortize over (multiverse analyses, power scaling, …): “Sensitivity-Aware Amortized Bayesian Inference” https://arxiv.org/abs/2310.11122
- Use heterogeneous data sources in amortized inference: “Fuse It or Lose It: Deep Fusion for Multimodal Simulation-Based Inference” https://arxiv.org/abs/2311.10671
- Use likelihood density information (explicit or even learned on the fly): “Leveraging Self-Consistency for Data-Efficient Amortized Bayesian Inference” https://arxiv.org/abs/2310.04395
- LBS #98 Fusing Statistical Physics, Machine Learning & Adaptive MCMC, with Marylou Gabrié: https://learnbayesstats.com/episode/98-fusing-statistical-physics-machine-learning-adaptive-mcmc-marylou-gabrie/
- LBS #101 Black Holes Collisions & Gravitational Waves, with LIGO Experts Christopher Berry & John Veitch: https://learnbayesstats.com/episode/101-black-holes-collisions-gravitational-waves-ligo-experts-christopher-berry-john-veitch/
- Deep Learning book: https://www.deeplearningbook.org/
- Statistical Rethinking: https://xcelab.net/rm/
Transcript
This is an automatic transcript and may therefore contain errors. Please get in touch if you’re willing to correct them.
Transcript
In this episode, Marvin Schmidt introduces
the concept of amortized Bayesian
2
inference, where the upfront training
phase of a neural network is followed by
3
fast posterior inference.
4
Marvin will guide us through this new
concept, discussing his work in
5
probabilistic machine learning and
uncertainty quantification using Bayesian
6
inference with deep neural networks.
7
He also introduces Bayes' law,
8
Python library for amortized Bayesian
workflows and discusses its use cases in
9
various fields while also touching on the
concept of deep fusion and its relation to
10
multi -model simulation -based inference.
11
Yeah, that is a very deep episode and also
a fascinating one.
12
I've been personally diving much more into
amortized Bayesian inference with Baseful
13
since the folks there have been kind
enough.
14
to invite me to the team, and I can tell
you, this is super promising technology.
15
A PhD student in computer science at the
University of Stuttgart, Marvin is
16
supervised actually by two LBS guests you
surely know, Paul Burkner and Aki
17
Vettelik.
18
Marvin's research combines deep learning
and statistics to make vision inference
19
fast and trustworthy.
20
In his free time, Marvin enjoys board
games and is a passionate guitar player.
21
This is Learning Basion Statistics,
episode 107, recorded April 3, 2024.
22
Welcome to Learning Basion Statistics, a
podcast about patient inference, the
23
methods, the projects,
24
and the people who make it possible.
25
I'm your host, Alex Andorra.
26
You can follow me on Twitter at alex
.andorra, like the country, for any info
27
about the show.
28
LearnBasedStats .com is left last to be.
29
Show notes, becoming a corporate sponsor,
unlocking Bayesian Merch, supporting the
30
show on Patreon, everything is in there.
31
That's LearnBasedStats .com.
32
If you're interested in one -on -one
mentorship, online courses, or statistical
33
consulting,
34
Feel free to reach out and book a call at
topmate .io slash alex underscore and
35
dora.
36
See you around folks and best patient
wishes to you all.
37
Today, I want to thank the fantastic Adam
Romero, Will Geary, and Blake Walters for
38
supporting the show on Patreon.
39
Your support is truly invaluable and
literally makes this show possible.
40
I can't wait to talk with you guys in the
Slack channel.
41
Second, the first part of our modeling
webinar series on Gaussian processes is
42
out for everyone.
43
So if you want to see how to use the new
HSGP approximation in PIMC, head over to
44
the LBS YouTube channel and you'll see
Juan Orduz, a fellow PIMC Core Dev and
45
mathematician, explain how to do fast and
efficient Gaussian processes in PIMC.
46
I'm actually working on the next part in
this series as we speak, so stay tuned for
47
more and follow the LBS YouTube channel if
you don't want to miss it.
48
Okay, back to the show now.
49
Marvin Schmidt, Willkommen nach Learning
Patient Statistics.
50
Thanks Alex, thanks for having me.
51
Actually my German is very rusty, do you
say nach or zu?
52
Well, welcome Learning Patient Statistics.
53
Maybe welcome in podcast?
54
Nah.
55
Obviously, obviously like it was a third
hidden option.
56
Damn.
57
it's a secret third thing, right?
58
Yeah, always in Germany.
59
It's always that.
60
Man, damn.
61
Well, that's okay.
62
I got embarrassed in front of the world,
but I'm used to that in each episode.
63
So thanks a lot for taking the time.
64
Marvin.
65
Thanks a lot to Matt Rosinski actually for
recommending to do an episode with you.
66
Matt was kind enough to take some of his
time to write to me and put me in contact
67
with you.
68
I think you guys met in Australia in a
very fun conference based on the beach.
69
I think it happens every two years.
70
Definitely when I go there in two years
and do a live episode there.
71
Definitely that's a...
72
That's a product I wanted to do that this
year, but that didn't go well with my
73
traveling dates.
74
So in two years, definitely going to try
to do that.
75
So yeah, listeners and Marvin, you can
help me accountable on that promise.
76
Absolutely.
77
We will.
78
So Marvin, before we talk a bit more about
what you're a specialist in and also what
79
you presented in Australia, can you tell
us what you're doing nowadays and also how
80
you...
81
Andy Depp working on this?
82
Yeah, of course.
83
So these days, I'm mostly doing methods
development.
84
So broadly in probabilistic machine
learning, I care a lot about uncertainty
85
quantification.
86
And so essentially, I'm doing Bayesian
inference with deep neural networks.
87
So taking Bayesian inference, which is
notoriously slow at times, which might be
88
a bottleneck, and then using generative
neural networks to speed up this process,
89
but still maintaining all the
explainability, all these nice benefits
90
that we have from using
91
I have a background in both psychology and
computer science.
92
That's also how I ended up in, Beijing
inference.
93
cause during my psychology studies, I took
a few statistics courses, then started as
94
a statistics tutor, mainly doing frequent
statistics.
95
And then I took a seminar on Beijing
statistics in Heidelberg in Germany.
96
and it was the hardest seminar that ever
took.
97
Well, it's super hard.
98
We read like papers every single week.
99
Everyone had to prepare every single paper
for every single week.
100
And then at the start of each session, the
professor would just shuffle and randomly
101
pick someone to prison.
102
my God.
103
That was tough, but somehow, I don't know,
it stuck with me.
104
And I had like this aha moment where I
felt like, okay, all this statistics stuff
105
that I've been doing before was more of,
you know, following a recipe, which is
106
very strict.
107
But then this like holistic Bayesian
probabilistic take.
108
just gave me a much broader overview of
statistics in general.
109
Somehow I followed the path.
110
Yeah.
111
I'm curious what that...
112
So what does that mean to do patient stats
on deep neural network concretely?
113
What is the thing you would do if you had
to do that?
114
Let's say, does that mean you mainly...
115
you develop the deep neural network and
then you add some Bayesian layer on that,
116
or you have to have the Bayesian framework
from the beginning.
117
How does that work?
118
Yeah, that's a great question.
119
And in fact, that's a common point of
confusion there as well, because Bayesian
120
inference is just like a general, almost
philosophical framework for reasoning
121
about uncertainty.
122
So you have some latent quantities, call
them parameters, whatever, some latent
123
unknowns.
124
And you want to do inference on them.
125
You want to know what these latent
quantities are, but all you have are
126
actual observables.
127
And you want to know how these are related
to each other.
128
And so with Bayesian neural networks, for
instance, these parameters would be the
129
neural network weights.
130
And so you want full Bayesian inference on
the neural network weights.
131
And fitting normal neural networks already
supports that.
132
Like a Bixarity distribution.
133
Exactly.
134
Over these neural network weights.
135
Exactly.
136
So that's one approach of doing Bayesian
deep learning, but that's not what I'm
137
currently doing.
138
Instead, I'm coming from the Bayesian
side.
139
So we have like a normal Bayesian model,
which has statistical parameters.
140
So you can imagine it like a mechanistical
model for like a simulation program.
141
And we want to estimate these scientific
parameters.
142
So for example, if you have a cognitive
decision -making task from the cognitive
143
sciences, and these parameters might be
something like the non -decision time, the
144
actual motor reaction time that you need
to
145
move your muscles and some information
uptake rates, some bias and all these
146
things that researchers are actually
interested in.
147
And usually you would then formulate your
model in, for example, PiMC or Stan or
148
however you want to formulate your
statistical model and then run MCMC for
149
parameter inference.
150
And now where the neural networks come in
in my research is that we replace MCMC
151
with a neural network.
152
So we still have our Bayesian model.
153
But we don't use MCMC for posterior
inference.
154
Instead, we use a neural network just for
posterior inference.
155
And this neural network is trained by
maximum likelihood.
156
So the neural network itself, the weights
there are not probabilistic.
157
There are no posterior distributions over
the weights.
158
But we just want to somehow model the
actual posterior distributions of our
159
statistical model parameters using a
neural network.
160
So the neural net, I think so.
161
That's quite new to me.
162
So I'm going to rephrase that and see how
much I understood.
163
So that means the deep neural network is
already trained beforehand?
164
No, we have to train it.
165
And that's the cool part about this.
166
OK, so you train it at the same time.
167
You train it at the same time.
168
You're also trying to infer the underlying
parameters of your model.
169
And that's the cool part now.
170
Because in MCMC, you would do both at the
same time, right?
171
You have your fixed model that you write
down in PyMC or Stan, and then you have
172
your one observed data set, and you want
to fit your model to the data set.
173
And so, you know, you do, for example,
your Hamiltonian Monte Carlo algorithm to,
174
you know, traverse your parameter space
and then do the sampling.
175
So you couple your approximating
176
phase and your inference phase.
177
Like you learn about the posterior
distribution based on your data set.
178
And then you also want to generate
posterior samples while you're exploring
179
this parameter space.
180
And in the line of work that I'm doing,
which we call amortized Bayesian
181
inference, we decouple those two phases.
182
So the first phase is actually training
those neural networks.
183
And that's the hard task.
184
And then you essentially take your
Bayesian model.
185
generate a lot of training data from the
model because you can just run prior
186
predictive samples.
187
So generate prior predictive samples.
188
And those are your training data for the
neural network.
189
And use the neural network to essentially
learn surrogate for the posterior
190
distribution.
191
So for each data set that you have, you
want to take those as conditions and then
192
have a generative neural network to learn
somehow how these data and the parameters
193
are related to each other.
194
And this upfront training phase takes
quite some time and usually takes longer
195
than the equivalent MCMC would take, given
that you can run MCMC.
196
Now, the cool thing is, as you said, when
your neural network is trained, then the
197
posterior inference is super fast.
198
Then if you want to generate posterior
samples, there's no approximation anymore
199
because you've already done all the
approximation.
200
So now you're really just doing sampling.
201
That means just generating some random
numbers in some latent space and having
202
one pass through the neural network, which
is essentially just a series of matrix
203
multiplications.
204
So once you've done this hard part and
trained your generative neural network,
205
then actually doing the posterior sampling
takes like a fraction of a second for 10
206
,000 posterior samples.
207
Okay, yeah, that's really cool.
208
And how generalizable is your deep neural
network then?
209
Do you have like, is that, because I can
see the really cool thing to have a neural
210
network that's customized to each of your
models.
211
That's really cool.
212
But at the same time, as you were saying,
that's really expensive to train a neural
213
network each time you have to sample a
model.
214
And so I was thinking, OK, so then maybe
what you want is have generalized
215
categories of deep neural network.
216
So that would probably be another kill.
217
But let's say I have a deep neural network
for linear regressions.
218
Whether they are generalized or just plain
normal likelihood, you would use that deep
219
neural network for linear regressions.
220
And then the inference is super fast,
because you only have to train.
221
the neural network once and then
inference, posterior inference on the
222
linear regression parameters themselves is
super fast.
223
So yeah, like that's a long question, but
did you get what I'm asking?
224
Yeah, absolutely.
225
So if I get your question right, now
you're asking like, if you don't want to
226
run linear regression, but want to run
some slightly different model, can I still
227
use my pre -trained neural network to do
that?
228
Yes, exactly.
229
And also, yeah, like in general, how does
that work?
230
Like, how are you thinking about that?
231
Are there already some best practices or
is it like really for now, really cutting
232
edge research that and all the questions
are in the air?
233
Yeah.
234
So first of all, the general use case for
this type of amortized Bayesian inference
235
is usually when your model is fixed, but
you have many new datasets.
236
So assume you have some quite complex
model where MCMC would take a few minutes
237
to run.
238
And so instead for one fixed data set that
you actually want to sample from.
239
And now instead of running MCMC on it, you
say, okay, I'm going to train this neural
240
network.
241
So this won't yet be worth it for just one
data set.
242
Now the cool thing is if you want to keep
your actual model, so whatever you write
243
down in PyMC or Stan,
244
We want to keep that fixed, but now plug
in different data sets.
245
That's where amortized inference really
shines.
246
So for instance, there was this one huge
analysis in the UK where they had like
247
intelligence study data from more than 1
million participants.
248
And so for each of those participants,
they again had a set of observations.
249
And so for each of those 1 million
participants,
250
They want to perform posterior inference.
251
It means if you want to do this with
something like MCMC or anything non
252
-amortized, you would need to fit one
million models.
253
So you might argue now, okay, but you can
parallelize this across like a thousand
254
cores, but still that's, that's a lot.
255
That's a lot of control.
256
Now the cool thing is the model was the
same every single time.
257
You just had a million different data
sets.
258
And so what these people did then is train
a neural network once.
259
And then like it will train for a few
hours, of course, but then you can just
260
sequentially feed in all these 1 million
data sets.
261
And for each of these 1 million data sets,
it takes way, way less than one second.
262
to generate tens of thousands of posterior
samples.
263
But that didn't really answer your
question.
264
So your question was about how can we
generalize in the model space?
265
And that's a really hard problem because
essentially what these neural networks
266
learn is to give you some posterior
function if you feed in a data set.
267
Now, if you have a domain shift in the
model space, so now you want inference
268
based on a different model, and this
neural network has never learned to do
269
that.
270
So that's tough.
271
That's a hard problem.
272
And essentially what you could do and what
we are currently doing in our research,
273
but that's cutting edge, is expanding the
model space.
274
So you would have a very general
formulation of a model and then try to
275
amortize over this model.
276
So that different configurations of this
model, different variations.
277
could just be extracted special case model
essentially.
278
Can you take an example maybe to give an
idea to listeners how that would work?
279
Absolutely.
280
We have one preprint about sensitivity
-aware amortized Bayesian inference.
281
What we do there is essentially have a
kind of multiverse analysis built into the
282
neural network training.
283
give some background, multiverse analysis,
basically says, okay, what are all the pre
284
-processing steps that you could take in
your analysis?
285
And you encode those.
286
And now you're interested in like, what
if, what if I had chosen a different pre
287
-processing technique?
288
What if I had chosen a different way to
standardize my data?
289
Then also the classical like prior
sensitivity or likelihood sensitivity
290
analysis.
291
Like what happens if I do power scaling on
my prior?
292
power scaling on my posterior.
293
So we also encode this.
294
What happens if I bootstrap some of my
data or just have a perturbation of my
295
data?
296
What if I add a bit of noise to my data?
297
So these are all slightly different
models.
298
What we do essentially keep track of that
during the training phase and just encode
299
it into a vector and say, well, okay, now
we're doing pre -processing choice number
300
seven.
301
and scale the prior to the power of two,
don't scale the likelihood and don't do
302
any perturbation and feed this as an
additional information into the neural
303
network.
304
Now the cool thing is during inference
phase, once we're done with the training,
305
you can say, hey, here's a data set.
306
Now pretend that we chose pre -processing
technique number 11 and prior scaling of
307
power 0 .5.
308
What's the posterior now?
309
Because we've amortized over this large or
more general model space, we also get
310
valid posterior inference if we've trained
for long enough over these different
311
configurations of model.
312
And essentially, if you were to do this
with MCMC, for instance, you would refit
313
your model every single time.
314
And so here you don't have to do that.
315
Okay.
316
Yeah, I see.
317
That's super.
318
Yeah, that's super cool.
319
And I feel like, so that would be mainly
the main use cases would be as you were
320
saying, when, when you're getting into
really high data territory and you have
321
what's changing is mainly the data side,
mainly the data.
322
set and to be even more precise, not
really the data set, but the data values,
323
because the data set is supposed to be
like quite the same, like you would have
324
the same columns, for instance, but the
values of the columns would change all the
325
time.
326
And the model at the same time doesn't
change.
327
Is that like, that's really for now, at
least the best use case for that kind of
328
method.
329
Yes.
330
And this might seem like a very niche
case.
331
But then if you look at like,
332
Bayesian workflows in practice, this topic
of this scheme of many model research
333
doesn't necessarily mean that you have a
large number of data sets.
334
This might also just mean you want
extensive cross validation.
335
So assume that you have one data set with
1000 observations.
336
Now you want to run leaf1 or cross
validation, but for some reason you can't
337
do the Pareto Smooth importance sampling
version, which would be much faster.
338
So you would need 1000 model refits, even
though you just have one data set, because
339
you want 1000 cross validation refits.
340
Maybe can you explicit what your meaning
by cross validation here?
341
Because that's not a term that's used a
lot in the patient framework, I think.
342
Yeah, of course.
343
So especially innovation setting, there's
this approach of leave one out cross
344
validation, where you would fit your
posterior based on all data points, but
345
one.
346
And that's why it's called leave one out,
because you take one out and then fit your
347
model, fit your posterior on the rest of
the data.
348
And now you're interested in the posterior
predictive performance of this one left
349
out observation.
350
Yeah.
351
And that's called cross validation.
352
Yeah.
353
Go ahead.
354
Yeah, no, just I'm going to let you
finish, but yeah, for listeners familiar
355
with the frequented framework, that's
something that's really heavily used in
356
that framework, cross validation.
357
And it's very similar to the machine
learning concept of cross validation.
358
But in the machine learning area, you
would rather have something like fivefold
359
in general, k -fold cross validation,
where you would have larger splits of your
360
data and then use parts of your
361
whole dataset as the training dataset and
the rest for evaluation.
362
Essentially, like the one across relation
just puts it to the extreme.
363
Everything but one data point is your
train dataset.
364
Yeah.
365
Yeah.
366
Okay.
367
Yeah.
368
Damn, that's super fun.
369
And is there, is there already a way for
people to try that out or is it mainly for
370
now implemented for papers?
371
And you are probably.
372
I'm guessing working on that with Aki and
all his group in Finland to make that more
373
open source, helping people use packages
to do that.
374
What's the state of the things here?
375
Yeah, that's a great question.
376
And in fact, the state of usable open
source software is far behind what we have
377
for likelihood -based MCMC based
inference.
378
So we currently don't have something
that's comparable to PyMC or Stan.
379
Our group is developing or actively
developing a software that's called Base
380
Flow.
381
That's because like the name, because like
base, because we're doing Bayesian
382
inference.
383
And essentially the first neural network
architecture that was used for this
384
amortized Bayesian inference are so
-called normalizing flows.
385
Conditional normalizing flows to be
precise.
386
And that's why the name Base Flow came to
be.
387
But now.
388
actually have a bit of a different take
because now we have a whole lot of
389
generative neural networks and not only
normalizing flows.
390
So now we can also use, for example, score
-based diffusion models that are mainly
391
used for image generation and AI or
consistency models, which are essentially
392
like a distilled version of score -based
diffusion models.
393
And so now baseflow doesn't really capture
that anymore.
394
But now what the baseflow Python library
specializes in is defining
395
Principled amortized Bayesian workflows.
396
So the meaning of base or slightly shifted
to amortized Bayesian workflows and hence
397
the name base login And the focus of base
slope and the aim of base low is twofold
398
So first we want a library.
399
It's good for actual users So this might
be researchers who just say hey, here's my
400
data set.
401
Here's my model my simulation program and
Please just give me fast posterior
402
samples.
403
So we want
404
usable high level interface with sensible
default values that mostly work out of the
405
box and an interface that's mostly self
-explanatory.
406
Also of course, good teaching material and
all this.
407
But that's only one side of the coin
because the other large goal of FaceFlow
408
is that it should be usable for machine
learning researchers who want to advance
409
amortized Bayesian inference methods as
well.
410
And so the software in general,
411
is structured in a very modular way.
412
So for instance, you could just say, hey,
take my current pipeline, my current
413
workflow.
414
But now try out a different loss function
because I have a new fancy idea.
415
I want to incorporate more likelihood
information.
416
And so I want to alter my loss function.
417
So you would have your general program
because of the modular architecture there,
418
you could just say, take the current loss
function and replace it with a different
419
one.
420
that is used to the API.
421
And we're trying to doing both and serving
both interests, user friendly side for
422
actually applied researchers who are also
currently using Baseflow.
423
But then also the machine learning
researchers with completely different
424
requirements for this piece of software.
425
Maybe we can also use Baseflow
documentation and the current project
426
website in the notes.
427
Yeah, we should definitely do that.
428
Definitely gonna try that out myself.
429
It sounds like fun.
430
I need a use case, but as soon as I have a
use case, I'm definitely gonna try that
431
out because it sounds like a lot of fun.
432
Yeah, several questions based on that and
thanks a lot for being so clear and so
433
detailed on these.
434
So first, we talked about normalizing
flows in episode 98 with Marie -Lou
435
Gabriel.
436
Definitely recommend listeners to listen
to that for some background.
437
And question, so Baseflow, yeah,
definitely we need that in the show notes
438
and I'm going to install that in my
environment.
439
And I'm guessing, so you're saying that
that's in Python, right?
440
The package?
441
Yes, the core package is in Python and
we're currently refactoring to Keras.
442
So by the time this podcast episode is
aired, we will have a new major release
443
version, hopefully.
444
OK, nice.
445
So you're agnostic to the actual machine
learning back end.
446
So then you could choose TensorFlow,
PyTorch, or JAX, whatever integrates best
447
with what you're currently proficient in
and what you might be currently using in
448
other parts of a project.
449
OK, that was going to be my question.
450
Because I think while preparing for the
episode, I saw that you were mainly using
451
PyTorch.
452
So that was going to be my question.
453
What is that based on?
454
So the back end could be PyTorch, JAX, or.
455
What did you think the last one was?
456
Tansor flow.
457
Yeah, I always forget about all these
names.
458
I really know PyTorch.
459
So that's why I like the other ones.
460
And JAX, of course, for PyMC.
461
And then, so my question is, the workflow,
what would it look like if you're using
462
Baseflow?
463
Because you were saying the model, you
could write it in standard PyMC or
464
TensorFlow, for instance.
465
Although I don't know if you can write.
466
patient models with TensorFlow anymore.
467
Anyways, let's say PyMC or Stan.
468
You write your model.
469
But then the sampling of the model is done
with the neural network.
470
So that means, for instance, PyTorch or
Jax.
471
How does that work?
472
Do you have then to write the model in a
Jax compatible way?
473
Or is the translation done by the package
itself?
474
Yeah, that's a great question.
475
It touches on many different topics and
considerations and also on future roadmap
476
for bass flow.
477
So.
478
This class of algorithms that are
implemented in Baseflow, these amortized
479
Bayesian inference algorithms, to give you
some background there, they originally
480
started in simulation -based inference.
481
It's also sometimes called likelihood
-free inference.
482
So essentially it is Bayesian inference
when you don't bring a closed -form
483
likelihood function to the table.
484
But instead, you only have some generic
forward simulation program.
485
So you would just have your prior as
some...
486
Python function or C++ function, whatever,
any function that you could call and it
487
would return you a sample from the prior
distribution.
488
You don't need to write it down in terms
of distributions actually, but you only
489
need to be able to sample from it.
490
And then the same for the likelihood.
491
So you don't need to write down your
likelihood in like a PMC or Stan in terms
492
of a probability distribution, in terms of
density distribution or densities.
493
But instead it's.
494
just got to be some simulation program,
which takes in parameters and then outputs
495
data.
496
What happens between these parameters and
the data is not necessarily probabilistic
497
in terms of closed form distributions.
498
It could also be some non -tractable
differential equations.
499
It could be essentially everything.
500
So for base flow, this means that you
don't have to input something like a PMC
501
or a Stan model, which you write down in
terms of
502
distributions, but it's just a generic
forward model that you can call and you
503
will get a tuple of a parameter draw and a
data set.
504
So you'd usually just do it in NumPy.
505
So you would write, if I'm using Baseflow,
I would write it in NumPy.
506
It would probably be the easiest way.
507
You could probably also write it in JAX or
in PyTorch or in TensorFlow or TensorFlow
508
probability, whatever you want to use and
like behind the scenes.
509
But essentially what we just care about is
that the model gets a tuple of parameters
510
and then data that has been generated from
these parameters.
511
for the neural network training process.
512
That's super fun.
513
Yeah, yeah, yeah.
514
Definitely want to see that.
515
Do you have already some Jupyter notebook
examples up on the repo or are you working
516
on that?
517
Yeah, currently it's a full -fledged
library.
518
It's been under development for a few
years now.
519
And we also have an active user base right
now.
520
It's quite small compared to other
Bayesian packages.
521
We're growing it.
522
Yeah, that's cool.
523
In documentation, there are currently, I
think, seven or eight tutorial notebooks.
524
And then also for a Based on the Beach,
like this conference in Australia that we
525
just talked about earlier, we also
prepared a workshop.
526
And we're also going to link to this
Jupyter notebook in the show notes.
527
Yeah, definitely we should, we should link
to some of these Jupyter notebooks in the
528
show notes.
529
And Sean, I'm thinking you should...
530
Like if you're down, you should definitely
come back to the show, but for a webinar.
531
I have another format that's modeling
webinar where you could, you would come to
532
the show and share your screen and, and go
through the model code live and people can
533
ask questions and so on.
534
I've done that already on a variety of
things.
535
Last one was about causal inference and
propensity scores.
536
Next one is going to be on about helper
space GP decomposition.
537
So yeah, if you're down, you should
definitely come and do a demonstration of
538
base flow and amortized Bayesian
inference.
539
I think that would be super fun and very
interesting to people.
540
Absolutely.
541
Then to answer the last part of your
question.
542
Yeah.
543
Like if you currently have a model that's
written down in PyMC or Stan, that's a bit
544
more tricky to integrate because
essentially what all we need in base flow
545
are samples from the prior predictive
distribution.
546
If you talk in Bayesian terminology.
547
Yeah.
548
And if your current model can do that,
that's fine.
549
That's all you need right now.
550
And then base build builds.
551
You can have like a PIMC model and just do
pm .sample -properative, save that as a
552
big NumPy multidimensional array and pass
that to baseflow.
553
Yes.
554
Okay.
555
Just all you need are two builds of the
ground truth parameters of the data
556
training process.
557
So essentially like the result of your
prior call and then the result of your
558
likelihood call with those prior
parameters.
559
So you mean what the likelihood samples
look like once you fix the prior
560
parameters to some value?
561
Yes.
562
So like in practice, you would just call
your prior function.
563
Yeah.
564
Then get a sample from the prior.
565
So parameter vector.
566
Yeah.
567
And then plug this parameter vector into
the likelihood function.
568
And then you get one simulated synthetic
data set.
569
And you just need those two.
570
Okay.
571
Super cool.
572
Yeah.
573
Definitely sounds like a lot of fun and
should definitely do a webinar about that.
574
I'm very excited about that.
575
Yeah.
576
Fantastic.
577
And so that was one of my main questions
on that.
578
Other question is, I'm guessing you are a
lot of people working on that, right?
579
Because your roadmap that you just talked
about is super big.
580
Because having a package that's designed
for users, but also for researchers is
581
quite, that's really a lot of work.
582
So I'm hoping you're not allowed doing
that.
583
No, we're currently a team of about a
dozen people.
584
No, yeah, that makes sense.
585
It's an interdisciplinary team.
586
So like a few people with a hardcore like
software engineering background, like some
587
people with a machine learning background,
and some people from the cognitive
588
sciences and also a handful of physicists.
589
Because in fact, these amortized Bayesian
inference methods are particularly
590
interesting for physicists.
591
Example for astrophysicists who have these
gravitational wave inference problems
592
where they have massive data sets.
593
And running MCMC on those would be quite
cumbersome.
594
So if you have this huge in -stream data
and you don't have this underlying
595
likelihood density, but just some
simulation program that might generate
596
sensible, like gravitational waves, then
amortized Bayesian inference really shines
597
there.
598
Okay.
599
So that's exactly the case you were
talking about where the model doesn't
600
change, but you have a lot of different
datasets.
601
Yeah, exactly.
602
Because I mean, what you're trying to run
inference on is your physical model.
603
And that doesn't change.
604
I mean, it does.
605
And then again, physicists have a very
good understanding and very good models of
606
the world around them.
607
And that's made one of the largest
differences.
608
people from the cognitive sciences, where,
you know, the, the models of the human
609
brain, for instance, are just, it's such a
tough thing to model and there's so much
610
not there and so much uncertainty in the
model building process.
611
Yeah, for sure.
612
Okay, yeah, I think I'm starting to
understand the idea.
613
And yeah, so actually, episode 101 was
exactly about that.
614
Black holes, collisions, gravitational
waves.
615
And I was talking with LIGO researchers,
Christopher Perry and John Vich.
616
And we talked exactly about that, their
problem with big data sets.
617
They are mainly using sequential Monte
Carlo, but I'm guessing they would also be
618
interested in a Monte...
619
amortized Bayesian inference.
620
So yeah, Christopher and John, if you're
listening, if you're future reach out to
621
Marvin and use Baseflow.
622
And listeners, this episode will be in the
show notes also if you want to give it a
623
listen.
624
That's a really fun one also learning a
lot of stuff, but the crazy universe we
625
live in.
626
Actually, a weird question I have is why
627
easy to call it amortized Bayesian
inference.
628
The reason is that we have this two -stage
process where we would first pay upfront
629
with this long neural network training
phase.
630
But then once we're done with this, this
cost of the upfront training phase
631
amortizes over all the posterior samples
that we can draw within a few
632
milliseconds.
633
That makes sense.
634
That makes sense.
635
And so I think something you're also
working on is something that's called deep
636
fusion.
637
And you do that in particular for
multimodal simulation -based inference.
638
How is that related to amortized patient
inference, if at all?
639
And what is it about?
640
I'm gonna answer these two questions in
reverse order.
641
So first about the relation between
simulation -based inference and amortized
642
Bayesian inference.
643
So to give you a bit of history there,
simulation -based inference essentially
644
Bayesian inference based on simulations
where we don't assume that we have access
645
to a likelihood density, but instead we
just assume that we can sample from the
646
likelihood.
647
Essentially simulate from the model.
648
In fact, the likelihood is still.
649
present, but it's only implicitly defined
and we don't have access to the density.
650
That's why likelihood -free inference
doesn't really hit what's happening here.
651
But instead, like in the recent years,
people have started adopting the term
652
simulation -based inference because we do
Bayesian inference based on simulations
653
instead of likelihood densities.
654
So methods that have been used...
655
for quite a long time now in the
simulation -based inference research area.
656
For example, rejection ABC, so approximate
Bayesian computation, or then ABC SMC, so
657
combining ABC with sequential Monte Carlo.
658
Essentially, the next iteration there was
throwing neural network at simulation
659
-based inference.
660
That's exactly this neural posterior
estimation that I talked about earlier.
661
And now what researchers noticed is, hey,
when we train a neural network for
662
simulation -based inference, instead of
running rejection, approximate base
663
computation, then we get amortization for
free as a site product.
664
It's just a by -product of using a neural
network for simulation -based inference.
665
And so in the last maybe four to five
years,
666
People have mainly focused on this
algorithm that's called neuro posterior
667
estimation for simulation based inference.
668
And so all developments that happened
there and all the research that happened
669
there, almost all the research, sorry,
focused on cases where we don't have any
670
likelihood density.
671
So we're purely in the simulation based
case.
672
Now with our view of things, when we come
from a Bayesian inference, like likelihood
673
based setting,
674
can say, hey, amortization is not just a
random coincidental byproduct, but it's a
675
feature and we should focus on this
feature.
676
And so now what we're currently doing is
moving this idea of amortized Bayesian
677
inference with neural networks back into a
likelihood -based setting.
678
So we've started using likelihood
information again.
679
For example, using likelihood densities if
they're available or learning information
680
about the likelihood.
681
So like a surrogate model on the fly, and
then again, using this information for
682
better posterior inference.
683
So we're essentially bridging simulation
-based inference and likelihood -based
684
Bayesian inference again with this goal, a
larger goal of amortization if we can do
685
it.
686
And so this work on deep fusion.
687
essentially addresses one huge shortcoming
of neural networks when we want to use
688
them for amortized Bayesian inference.
689
And that is in situation where we have
multiple different sources of data.
690
So for example,
691
Imagine you're a cognitive scientist and
you run an experiment with subjects and
692
for each test subject, you give them a
decision -making task.
693
But at the same time, while your subjects
solve the decision -making task, you wire
694
them up with an EEG to measure the brain
activity.
695
So for each subject across maybe 100
trials, what you now have is both an EEG
696
and the data from the decision -making
task.
697
Now, if you want to analyze this with PyMC
or Stan, what you would just do is say,
698
hey, well, we have two data -generating
processes that are governed by a set of
699
shared parameters.
700
So the first part of the likelihood would
just be this we -know process for the
701
decision -making task where you just model
the reaction time.
702
fairly standard procedure there in the
cognitive science.
703
And then for the second part, we have a
second part of the likelihood that we
704
evaluate that somehow handles these EEG
measurements.
705
For example, a spatial temporal process or
just like some summary statistics that are
706
being computed there.
707
However, you would usually compute your
EEG.
708
Then you add both to the log PDF of the
likelihood, and then you can call it a
709
day.
710
You cannot do that in neural networks
because you have no straightforward
711
sensible way to combine these reaction
times from the decision -making task and
712
the EEG data.
713
Because you cannot just take them and slap
them together.
714
They are not compatible with each other
because these information data sources are
715
heterogeneous.
716
So you somehow need a way to fuse these
sources of information.
717
so that you can then feed them into the
neural network.
718
That's essentially what we're studying in
this paper, where you could just get very
719
creative and have different schemes to
fuse the data.
720
So you could use these attention schemes
that are very hip in large language models
721
right now with transformers essentially,
and have these different data sources
722
attend or listen essentially to each
other.
723
With cross attention, you could just let
the EEG data inform
724
your decision -making data or just have
the decision -making data inform the EEG
725
data.
726
So you can get very creative there.
727
You could also just learn some
representation of both individually, then
728
concatenate them and feed them to the
neural network.
729
Or you could do very creative and weird
mixes of all those approaches.
730
And in this paper, we essentially have a
systematic investigation of these
731
different options.
732
And we find that the most straightforward
option works the best.
733
overall, and that's just learning fixed
size embeddings of your data sources
734
individually, and then just concatenating
them.
735
It turns out then we can use information
from both sources in an efficient way,
736
even though we're doing inference with
neural networks.
737
And maybe what's interesting for
practitioners is that we can compensate
738
for missing data in individual sources.
739
And the paper we essentially, we induced
missing data by just taking these EEG data
740
and decision -making data and just
randomly dropping some of them.
741
And the neural networks have learned, like
when we do this fusion process, the neural
742
networks learn to compensate for partial
missingness in both sources.
743
So if you just remove some of the decision
-making data, the neural network learn to
744
use the EEG data to inform your posterior.
745
Even though the data and one of the
sources are missing, the inference is
746
pretty robust then.
747
And again, all this happens without model
refits.
748
So you would just account for that during
training.
749
Of course you have to do this like random
dropping of data during a training phase
750
as well.
751
And then you can also get it during the
inference phase.
752
yeah, that sounds, yeah, that's really
cool.
753
Maybe that's a bit of a, like a small
piece of this paper in our larger roadmap.
754
This is essentially taking this amortized
vision inference.
755
up to the level of trustworthiness and
robustness and all these gold standards
756
that we currently have for likelihood
-based inference in PMC or Stan.
757
Yeah.
758
Yeah.
759
And there's still a lot of work to do
because of course, like there's no free
760
lunch.
761
and, and of course there are many problems
with trustworthiness.
762
And that's also one of the reasons why I'm
here with Aki right now.
763
cause Aki is so great at Bayesian workflow
and trustworthiness, good diagnostics.
764
That's all, you know, all the things that
we currently still need for trustworthy,
765
amortized Bayesian inference.
766
Yeah.
767
So maybe you want to.
768
talk a bit more about that and what you're
doing on that.
769
That sounds like something very
interesting.
770
So one huge advantage of an amortized
Bayesian sampler is that evaluations and
771
diagnostics are extremely cheap.
772
So for example, there's this gold standard
method that's called simulation based
773
calibration, where you would sample from
your model and then like a sample from
774
your prior predictive space and then refit
your model and look at your coverage, for
775
instance.
776
In general, look at the calibration of
your model on this potentially very large
777
prior predictive space.
778
So you naturally need many model refits,
but your model is fixed.
779
So if you do it with MCMC, it's a gold
standard evaluation technique, but it's
780
very expensive to run, especially if your
model is complex.
781
Now, if you have an amortized estimator,
simulation -based calibration on thousands
782
of datasets takes a few seconds.
783
So essentially, and that's my goal for
this research visit with Aki here in
784
Finland, is trying to figure out what are
some diagnostics that are gold standard,
785
but potentially very expensive, up to a
point where it's infeasible to run on a
786
larger scale with MCMC.
787
But we can easily do it with an amontized
estimator.
788
With the goal of figuring out, like, can
we trust this estimator?
789
Yes or no?
790
It's like, as you might know from neural
networks, we just have no idea what's
791
happening inside their neural network.
792
And so we currently don't have these
strong diagnostics that we have for MCMC.
793
Like for example, our head.
794
There's no comparable thing for neural
network.
795
So one of my goals here is to come up with
more good diagnostics that are either
796
possible with MCMC, but very expensive so
we don't run them, but they would be very
797
cheap with an amortized estimator.
798
Or the second thing just specific to an
amortized estimator, just like our head is
799
specific to MCMC.
800
Okay.
801
Yeah, I see.
802
Yeah, that makes tons of sense.
803
well.
804
And actually, so I would have more
technical questions on these, but I see
805
the time running out.
806
I think something I'm mainly curious about
is the challenges, the biggest challenges
807
you face when applying amortized spatial
inference and diffusion techniques in your
808
projects, but also like in the projects
you see.
809
I think that's going to also give a sense
to listeners of when and where to use
810
these kinds of methods.
811
That's a great question.
812
And I'm more than happy to talk about all
these challenges that we have because
813
there's so much room for improvement
because like these Amortized methods, they
814
have so much potential, but we still have
a long way to go until they are as usable
815
and as straightforward to use as current
MCMC samplers.
816
And in general, one challenge for
practitioners,
817
is that we have most of the problems and
hardships that we have in PyMC or Stan.
818
And that is that researchers have to think
about their model in a probabilistic way,
819
in a mechanistic way.
820
So instead of just saying, hey, I click on
t -test or linear regression in some
821
graphical user interface, they actually
have to come up with a data generating
822
process.
823
and have to specify their model.
824
And this whole topic of model
specification is just the same in
825
amortized workflow because some way we
need to specify the Bayesian model.
826
And now on top of all this, we have a huge
additional layer of complexity and this is
827
defining the neural networks.
828
And amortized Bayesian inference, nowadays
we have two neural networks.
829
The first one is a so -called summary
network.
830
which essentially learns a latent
embedding of the data set.
831
Essentially those are like optimal learned
summary statistics and optimal doesn't
832
mean that they have to be optimal to
reconstruct the data, but instead optimal
833
means they're optimal to inform the
posterior.
834
for example, in a very, very simple toy
model, if you have just like a Gaussian
835
model and you just want to perform
inference on the mean.
836
then a sufficient summary statistic for
posterior inference on the mean would be
837
the mean.
838
Because that's all you need to reconstruct
the mean.
839
It sounds very tautological, but yeah.
840
Then again, the mean is obviously not
enough to reconstruct the data because all
841
the variance information is missing.
842
What the summary network learns is
something like the mean.
843
So summary statistics that are optimal for
posterior inference.
844
And then the second network is the actual
generative neural network.
845
So like a normalizing flow, score -based
diffusion model, consistency model, flow
846
matching, whatever condition generative
model you want.
847
And this will handle the sampling from the
posterior.
848
And these two networks are learned end to
end.
849
So you would learn your summary statistic,
output it, feed it into the posterior
850
network, the generative model, and then
have one.
851
evaluation of the loss function, optimize
both end to end.
852
And so we have two neural networks, long
story short, which is substantially harder
853
than just hitting like sample on a PMC or
Stan program.
854
And that's an additional hardship for
practitioners.
855
Now in Baseflow, what we do is we provide
sensible default values for the generative
856
neural networks, which work in maybe like
80 or 90 % of the cases.
857
It's just sufficient to have, for example,
like a NeuroSpline flow, like some sort of
858
normalizing flow with, I don't know, like,
859
six layers and a certain number of units,
some regularization for robustness and,
860
you know, cosine decay of the learning
rates, and all these machine learning
861
parts, we try to take them away from the
user if they don't want to mess with it.
862
But still, if things don't work, they
would need to somehow diagnose the
863
problems and then, you know, play with the
number of layers and this neural network
864
architecture.
865
And then for the summary network, the
summary network essentially needs to be
866
informed by the data.
867
So if you have time series, you would
868
look at something like an LSTM.
869
So these like long short time memory time
series neural networks.
870
Or you would have like recurrent neural
network or nowadays a time series
871
transformer.
872
They're also called temporal fusion
transforms.
873
If you have IID data, you would have
something like a deep set or a set
874
transformer, which respect this
exchangeable structure of the data.
875
So again, we can give all the
recommendations and sensible default
876
values like
877
If you have a time series, try a time
series transformer.
878
Then again, if things don't work out,
users need to play around with these
879
settings.
880
So that's definitely one hardship of
armatized Bayesian inference in general.
881
And for the second part of your question,
hardships of this deep fusion.
882
It's essentially if you have more and more
information sources, then things can get
883
very complicated.
884
Example, just a few days ago, we discussed
about a
885
case where someone has 60 different
sources of information and they're all
886
streams of time series.
887
Now we could say, hey, just slap 60
summary networks on this problem, like one
888
summary network for each domain.
889
That's going to be very complex and very
hard to train, especially if we don't
890
bring that many data sets to the table for
the neural network training.
891
And so there we somehow need to find a
compromise.
892
Okay, what information can we condense and
group together?
893
So maybe some of the time series sources
are somewhat similar and actually
894
compatible with each other.
895
So we could, for example, come up with six
groups of 10 time series each.
896
Then we would only need six neural
networks for the summary embeddings and
897
all these practical considerations.
898
That makes things just like as hard as in
likelihood based MCMC based inference, but
899
just a bit harder because of all the
neural network stuff that's happening.
900
Did this address your question?
901
Yeah.
902
Yeah.
903
It gives me more questions, but yeah, for
sure.
904
That does answer the question.
905
When you're talking about transformer for
time series, are you talking about the
906
transformers, the neural network that's
used in large language models or is it
907
something else?
908
It's essentially the same, but slightly
adjusted for time series so that the...
909
statistics or these latent embeddings that
you output still respect the time series
910
structure where typically you would have
this autoregressive structure.
911
So it's not exactly the same like standard
transformer, but you would just enrich it
912
to respect the probabilistic structure in
your data.
913
But at the core, it's just the same.
914
So at the core, it's an attention
mechanism, like multi -head attention
915
where
916
Like the different parts of your dataset
could essentially talk or listen to each
917
other.
918
So it's just the same.
919
Okay.
920
Yeah, that's interesting.
921
I didn't know that existed for time
series.
922
That's interesting.
923
That means, so because the transformer
takes like one of the main thing is you
924
have to tokenize the inputs.
925
Right?
926
So here you would tokenize like that there
is a tokenization happening of the time
927
series data.
928
You don't have to tokenize here because
the reason why you have to tokenize.
929
in large language models or natural
language processing in general is that you
930
want to somehow encode your characters or
your words?
931
into like a into numbers essentially and
we don't need that in Bayesian inference
932
in general because we already have numbers
Yeah So our data already comes in numbers,
933
so we don't need tokenization here.
934
Of course if we had text data
935
Then we would need tokenization.
936
Yeah.
937
Yeah.
938
Yeah.
939
OK.
940
OK.
941
Yeah, it makes more sense to me.
942
All right, that's fun.
943
I didn't know that existed.
944
Do you have any resources about
transformer for time series that we could
945
put in the show notes?
946
Absolutely.
947
There is a paper that's called Temporal
Fusion Transformers, I think.
948
I will send you the link.
949
yeah.
950
Awesome.
951
Yeah, thanks.
952
Definitely.
953
We have this time series transformer,
temporary fusion transformer implemented
954
in base flow.
955
So now it's just like a very usable
interface where you would just input your
956
data and then you get your latent
embeddings.
957
You can say like, I want to input my data
and I want as an output 20 learned summary
958
statistics.
959
So that's all you need to do there.
960
Okay.
961
And you can go crazy.
962
So what would you do with it?
963
Good.
964
Yeah, what would you do with these
results?
965
Basically the outputs of the transformer,
what would you use that for?
966
Those are the learned summary statistics.
967
That you would then treat as a compressed
fixed length version of your data for the
968
posterior network for this generative
model.
969
So then you use that afterwards in the
model?
970
Exactly.
971
Yeah.
972
So the transformer is just used to learn
summary statistics of the data sets that
973
we input.
974
For instance, if you have time series,
like we did this for COVID time series.
975
If you have a COVID time series,
976
worth like for a three year period would
be and daily reporting, you would have a
977
time series with about a thousand time
steps.
978
That's quite long as a condition into a
neural network to pass in there.
979
And also like if now you don't have a
thousand days, but a thousand and one
980
days, then the length of your input to the
neural network would change and your
981
neural network wouldn't do that.
982
So what you do with a time series
transformer is compress this time series
983
of maybe 1 ,000 or maybe 1 ,050 time steps
into a fixed length vector of summary
984
statistics.
985
Maybe you extract 200 summary statistics
from that.
986
Hey, okay, I see.
987
And then you can use that in your neural
network, in the model that's going to be
988
sampling your model.
989
In the neural network that's going to be
sampling your model.
990
We already see that we're heavily
overloading terminology here.
991
So what's a model actually?
992
So then we have to differentiate between
the actual Bayesian model that we're
993
trying to fit.
994
And then the neural network, the
generative model or generative neural
995
network that we're using as a replacement
for MCMC.
996
So it's, it's a lot of this taxonomy
that's, that's odd when you're at the
997
interface of deep learning and statistics.
998
Another one of those hiccups are
parameters.
999
Like invasion inference parameters are
your inference targets.
Speaker:
So you want posterior distributions on a
handful of model parameters.
Speaker:
When you talk to people from deep learning
about parameters,
Speaker:
they understand the neural network
weights.
Speaker:
So sometimes you have to be careful with
the, I have to be careful with the
Speaker:
terminology and words used to describe
things because we have different types of
Speaker:
people going on different levels of
abstraction here in different functions.
Speaker:
Yeah.
Speaker:
Yeah, exactly.
Speaker:
So that means in this case, it's the
transformer takes in time values, it
Speaker:
summarizes them.
Speaker:
And it passed that on to the neural
network that's going to be used to sample
Speaker:
the patient model.
Speaker:
Exactly.
Speaker:
And they are passed in as the conditions,
like conditional probability, which
Speaker:
totally makes sense because like this
generative neural network, it learns the
Speaker:
distribution of parameters conditional on
the data or summary statistics of the
Speaker:
data.
Speaker:
So that's the exact definition of the
Bayesian posterior distribution.
Speaker:
Like a distribution of the Bayesian model
parameters conditional on the data.
Speaker:
It's the exact definition of the
posterior.
Speaker:
Yeah, I see.
Speaker:
And that means...
Speaker:
So in this case, yeah, no, I think my
question was going to be, so why would you
Speaker:
use these kind of additional layer on the
time series data?
Speaker:
But you have to answer that.
Speaker:
Is that, well, what if your time series
data is too big or something like that?
Speaker:
Exactly.
Speaker:
It's not just being too big, but also just
a variable length.
Speaker:
Because the neural network, like the
generative neural network, it always wants
Speaker:
fixed length inputs.
Speaker:
Like it can only handle, in this case of
the COVID model, it could only handle
Speaker:
input conditions with length 200.
Speaker:
And now the time series transformer takes
part, so the time series transformer
Speaker:
handles the part that our actual raw data
have variable length.
Speaker:
And time series transformers can handle
data of variable length.
Speaker:
So they would, you know, just take a time
series of length.
Speaker:
maybe 500 time steps to 2000 time steps,
and then always compress it to 200 summary
Speaker:
statistics.
Speaker:
So this generative neural network, which
is much more strict about the shapes and
Speaker:
form of the input data, will always see
the same length inputs.
Speaker:
Yeah.
Speaker:
Okay.
Speaker:
Yeah, I see.
Speaker:
That makes sense.
Speaker:
Awesome.
Speaker:
Yeah, super cool.
Speaker:
And so as you were saying, this is already
available in base flow, people can use
Speaker:
this kind of transformer for time series.
Speaker:
Yeah, absolutely.
Speaker:
For time series and also for sets.
Speaker:
So for IID data.
Speaker:
Yeah.
Speaker:
Because if you just fed, if you just take
an IID data set and input into a neural
Speaker:
network, the neural network doesn't know
that your observations are exchangeable.
Speaker:
So it will assume much more structure than
there actually is in your data.
Speaker:
So again, it has a double function, like a
dual function of like compressing data,
Speaker:
encoding the probabilistic structure of
the data, and also outputting a fixed
Speaker:
representation.
Speaker:
So this would be a set transformer or deep
set is another option.
Speaker:
It's also implemented in Baseflow.
Speaker:
Super cool.
Speaker:
Yeah.
Speaker:
And so let's start winding down here
because I've already taken a lot of your
Speaker:
time.
Speaker:
Maybe a last few questions would be what
are some emerging topics that you see
Speaker:
within deep learning and probabilistic
machine learning that you find
Speaker:
particularly intriguing?
Speaker:
Because I've been to talk here a lot about
really the nitty -gritty, the statistical
Speaker:
detail.
Speaker:
And so on, but now if we do zoom a bit and
we start thinking about more long -term.
Speaker:
Yeah.
Speaker:
I'm very excited about two large topics.
Speaker:
The first one are generative models that
are very expressive.
Speaker:
So unconstrained neural network
architectures, but at the same time have a
Speaker:
one -step inference.
Speaker:
So for example, people have been using
score -based diffusion models a lot for
Speaker:
flow matching.
Speaker:
for image generation, like for example,
stable diffusion.
Speaker:
You might be familiar with this tool to
generate like, you know, input a text
Speaker:
prompt and then you get fantastic images.
Speaker:
Now this takes quite some time.
Speaker:
So like a few seconds for each image, but
only because it runs on a fancy cluster.
Speaker:
If you run it locally on a computer, it
takes much longer.
Speaker:
And that's because the Scorby's diffusion
model needs many discretization steps in
Speaker:
denoising, in this denoising process
during inference time.
Speaker:
And now there's, like, throughout the last
year, there have been a few attempts on
Speaker:
having these very expressive and super
powerful neural networks.
Speaker:
But they are much, much faster because
they don't have these many denoising
Speaker:
steps.
Speaker:
Instead, they directly learn a one -step
inference.
Speaker:
So they could generate an image not like a
thousand steps, but only in one step.
Speaker:
And that's very cutting edge or bleeding
edge, if you will, because they don't work
Speaker:
that great yet.
Speaker:
But I think there's much potential in
there.
Speaker:
it's both expressive and fast.
Speaker:
And then again, we've used some of those
for amortized Bayesian inference.
Speaker:
So we use consistency models and they have
super high potential in my opinion.
Speaker:
So, you know, with these advances in deep
learning, we can always, oftentimes we can
Speaker:
use them for amortized Bayesian inference.
Speaker:
We just like reformulate these generative
models and slightly tune them to our
Speaker:
tasks.
Speaker:
So I'm very excited about this.
Speaker:
And the second area I'm very excited about
our foundation models.
Speaker:
I guess most people are in AI these days.
Speaker:
So foundation models essentially means
neural networks are very good at in
Speaker:
-distribution tasks.
Speaker:
So whatever is in the training data set,
neural networks are typically very good at
Speaker:
finding patterns that are similar to the
training set, what they saw in the
Speaker:
training set.
Speaker:
Now in the open world, so if we are out of
distribution, we have a domain shift,
Speaker:
distribution shift, model mis
-specification, however you want to call
Speaker:
it, neural networks typically aren't that
good.
Speaker:
So what we could do is either make them
slightly better at out of distribution, or
Speaker:
we just extend the in -distribution to a
huge space.
Speaker:
And that's what foundation models do.
Speaker:
For example, GPD4 would be a foundation
model.
Speaker:
because it's just trained on so much data.
Speaker:
I don't know how many, it's not terabyte
anymore.
Speaker:
It's like, like essentially the entire
internet.
Speaker:
So it's just a huge training set.
Speaker:
And so the world and the training set that
this neural network has been trained on is
Speaker:
just huge.
Speaker:
And so essentially we don't really have
out of distribution cases anymore, just
Speaker:
because our training set is so huge.
Speaker:
And that's also one area that could be
very useful for
Speaker:
amortized Bayesian inference and to
overcome the very initial shortcoming that
Speaker:
you talked about, where we would also like
to amortize over different Asian models.
Speaker:
Hmm.
Speaker:
I see.
Speaker:
Yeah, yeah, yeah.
Speaker:
Yeah, that would definitely be super fun.
Speaker:
Yeah, I'm really impressed and interested
to see these interaction of like deep
Speaker:
learning, artificial intelligence, and
then the Bayesian.
Speaker:
framework coming on top of that.
Speaker:
That is really super cool.
Speaker:
I love that.
Speaker:
Yeah.
Speaker:
Yeah, it makes me super curious to try
that stuff out.
Speaker:
So to play us out, Marvin, actually, this
is a very active area of research.
Speaker:
So what advice would you give to beginners
interested in diving into this
Speaker:
intersection of deep learning and
probabilistic machine learning?
Speaker:
That's a great question.
Speaker:
Essentially, I would have two
recommendations.
Speaker:
The first one is to really try to simulate
stuff.
Speaker:
Whatever it is that you are curious about,
just try to write a simulation program and
Speaker:
try to simulate some of the data that you
might be interested in.
Speaker:
So for example, if you're really
interested in soccer, then code up a
Speaker:
simulation program.
Speaker:
that just simulate soccer matches and the
outcomes of soccer matches.
Speaker:
So you can really get a feeling of the
data generating processes that are
Speaker:
happening because probabilistic machine
learning at its very core is all about
Speaker:
data generating processes and reasoning
about these processes.
Speaker:
And I think it was Richard Feynman who
said, what I cannot create, I do not
Speaker:
understand.
Speaker:
That's essentially at the heart of
simulation based inference in a more
Speaker:
narrow setting.
Speaker:
probabilistic machinery and machine
learning more broadly or science more
Speaker:
broadly even So yeah, definitely like
Simulating and running simulation studies
Speaker:
can be super helpful both to understand
what's happening in the background also to
Speaker:
get a feeling for Programming and to get
better at programming as well Then the
Speaker:
second advice would be to essentially find
a balance between these hands -on getting
Speaker:
your hands dirty type of things like
implement a model and
Speaker:
I torch or Keras or solve some Kaggle
tasks, just some machine learning tasks.
Speaker:
But then at the same time, also finding
this balance to reading books and finding
Speaker:
new information to make sure that you
actually know what you're doing and also
Speaker:
know what you don't know and what the next
steps are to get better from the
Speaker:
theoretical part.
Speaker:
And there are two books that I can really
recommend.
Speaker:
The first one is Deep Learning by Ian
Goodfellow.
Speaker:
It's also available.
Speaker:
for free online.
Speaker:
You can also link to this in the show
notes.
Speaker:
It's a great book and it covers so much.
Speaker:
And then if you come from this Bayesian or
statistics background, you see a lot of
Speaker:
conditional probabilities in there because
a lot of deep learning is just conditional
Speaker:
generative modeling.
Speaker:
And then the second book would in fact be
Statistical Rethinking by Richard
Speaker:
McAlrath.
Speaker:
It's a great book and it's not only
limited to Bayesian inference, but more.
Speaker:
Also a lot of causal inference, of course.
Speaker:
Also just thinking about probability and
the philosophy behind this whole
Speaker:
probabilistic modeling topic more broadly.
Speaker:
So earlier today, I had a chat with one of
the student assistants that I'm
Speaker:
supervising and he said, Hey Marvin, like
I read statistic rethinking a few weeks
Speaker:
ago.
Speaker:
And today I read something about score
-based diffusion models.
Speaker:
So these like state of the art deep
learning models that are used to generate
Speaker:
images.
Speaker:
He said like, because I read statistical
rethinking, it all made sense.
Speaker:
There's so much probability going on in
these score -based diffusion models.
Speaker:
And statistical rethinking really helped
me understand that.
Speaker:
And at first I didn't really, I couldn't
believe it, but it totally makes sense.
Speaker:
Cause like statistical rethinking is not
just a book about Bayesian workflow and
Speaker:
Bayesian modeling, but more about, you
know, reasoning about probabilities and
Speaker:
uncertainty, in a more general way.
Speaker:
And it's a beautiful book.
Speaker:
So I'd recommend those.
Speaker:
Nice.
Speaker:
Yeah.
Speaker:
So definitely let's put those two in the
show notes.
Speaker:
Marvin, I will.
Speaker:
So of course I've read statistical
rethinking several times, so I definitely
Speaker:
agree.
Speaker:
The first one about deep learning, I
haven't yet, but I will definitely read it
Speaker:
because that sounds really fascinating.
Speaker:
So really want to get that book.
Speaker:
Fantastic.
Speaker:
Well, thanks a lot, Marvin.
Speaker:
That was really awesome.
Speaker:
I really learned a lot.
Speaker:
I'm pretty sure listeners did too, so
that's super fun.
Speaker:
You definitely need to come back to do a
modeling webinar with us and show us in
Speaker:
action what we talked about today with the
Base Vlog Package.
Speaker:
It's also, I guess, going to inspire
people to use it and maybe contribute to
Speaker:
it.
Speaker:
But before that, of course, I'm going to
ask you the last two questions I ask every
Speaker:
guest at the end of the show.
Speaker:
First one, if you had unlimited time and
resources, which problem would you try to
Speaker:
solve?
Speaker:
That's a very loaded question because
there's so many very, very important
Speaker:
problems to solve.
Speaker:
Like big picture problems, like peace,
world hunger, global warming, all those.
Speaker:
I'm afraid I couldn't, like with my
background, I don't really know how to
Speaker:
contribute significantly with a huge
impact to those problems.
Speaker:
So my consideration is essentially a trade
-off between like...
Speaker:
how important is the problem and what
impact does solving the problem or
Speaker:
addressing the problem have and what
impact could I have on solving the
Speaker:
problem?
Speaker:
And so I think what would be very nice is
to make probabilistic inference or
Speaker:
Bayesian inference more particular, like
accessible, usable, easy and fast for
Speaker:
everyone.
Speaker:
And that doesn't just mean, you know,
methods, machine learning researchers.
Speaker:
But essentially means anyone who works
with data in any way.
Speaker:
And there's so much to do, like the actual
Bayesian model in the background, it could
Speaker:
be huge, be like a base GPT, like chat
GPT, but just for base.
Speaker:
Just with the sheer scope of amortization,
different models, different settings and
Speaker:
so on.
Speaker:
So that's a huge, huge challenge.
Speaker:
Like on the backend side, but then on the
front end and API side, I think it also
Speaker:
has...
Speaker:
many different sub problems there.
Speaker:
cause it would mean like people could
just, you know, write down a description
Speaker:
of their model in plain text language,
like a large language model.
Speaker:
And, you know, don't actually specify
everything by a programming.
Speaker:
Maybe also just sketch out some data like
expert elicitation and all those different
Speaker:
topics.
Speaker:
I think there's like this bigger picture,
that, you know, so like.
Speaker:
thousands of researchers worldwide are
working on so many niche topics there.
Speaker:
But having this overarching base GPT kind
of thing would be really cool.
Speaker:
So I probably choose that to work on.
Speaker:
It's a very risky thing, so that's why I'm
not currently working on it.
Speaker:
Yeah, I love that.
Speaker:
Yeah, that sounds awesome.
Speaker:
Feel free to corporate.
Speaker:
and collaborate with me on that.
Speaker:
I would definitely be down.
Speaker:
That sounds absolutely amazing.
Speaker:
Yeah.
Speaker:
So send me an email when you start working
that place.
Speaker:
I'll be happy to join the team.
Speaker:
And second question, if you could have
dinner with any great scientific mind,
Speaker:
dead, alive or fictional, who would it be?
Speaker:
Again, very loaded question.
Speaker:
Super interesting question.
Speaker:
I mean, there are two huge choices.
Speaker:
I could either go with someone who's
currently alive and
Speaker:
I feel like I want their take on the
current state of the art and future
Speaker:
directions and so on.
Speaker:
And the second huge option, what I guess
many people would go with is someone who's
Speaker:
been dead for two to three centuries.
Speaker:
And I think I'd go with the second choice.
Speaker:
So really take someone from way from the
past.
Speaker:
And that's because of two reasons.
Speaker:
I think like, of course, speaking to
today's scientists is super interesting
Speaker:
and I would love to do that.
Speaker:
But I mean, they have access to all the
state of the art technology and they know
Speaker:
about all the latest advancements.
Speaker:
And so if they have some groundbreaking
creative ideas to share that they come up
Speaker:
with, they could just implement it and
make them actionable.
Speaker:
And the second reason is that today
scientists have a huge platform because
Speaker:
they're on the internet.
Speaker:
So if they really want to express an idea,
they could just do it on
Speaker:
Twitter or wherever So there's like other
ways to engage with them apart from you
Speaker:
know, having a magical dinner Right.
Speaker:
so I would choose someone from the past
and in particular.
Speaker:
I think at a lovelace would be super
interesting for me to talk to Essentially
Speaker:
because she's widely considered the first
programmer the craziest thing about is
Speaker:
that is She's never had access to like a
modern computer
Speaker:
So she wrote the first program, but the
machine wasn't there yet.
Speaker:
So that's such a huge leap of creativity
and genius.
Speaker:
And so I'd really be interested in like if
Adelavelis saw what's happening today,
Speaker:
like all the technology that we have with
generative AI, GPU clusters and all these
Speaker:
possibilities, like what's the next leap
forward?
Speaker:
Like what's today's equivalent of writing
Speaker:
the first program without having the
computer.
Speaker:
Yeah, I really love to know this answer
and there's currently no other way except
Speaker:
for your magical dinner invitation to get
this answer.
Speaker:
So that's why I go with this option.
Speaker:
Yeah.
Speaker:
Yeah.
Speaker:
No, awesome.
Speaker:
Awesome.
Speaker:
I love it.
Speaker:
That definitely sounds like a, like a
marvelous dinner.
Speaker:
So yeah.
Speaker:
Awesome.
Speaker:
Thanks a lot, Marvin.
Speaker:
That was, that was really a blast.
Speaker:
I'm going to let you go now because you've
been talking for a long time, guessing you
Speaker:
need a break.
Speaker:
But that was really amazing.
Speaker:
So yeah, thanks a lot for taking the time.
Speaker:
Thanks again to Matt Rosinski for this
awesome recommendation.
Speaker:
I hope you loved it, Marvin.
Speaker:
And also Matt, me, I did.
Speaker:
So that was really awesome.
Speaker:
As usual, I'll put resources and a link to
your website.
Speaker:
And also, Marvin is going to add stuff to
the show notes for those who want to dig
Speaker:
deeper.
Speaker:
Thank you again, Marvin, for taking the
time and being on this show.
Speaker:
Thank you very much for having me, Alex.
Speaker:
I appreciate it.
Speaker:
This has been another episode of Learning
Bayesian Statistics.
Speaker:
Be sure to rate, review and follow the
show on your favorite podcatcher and visit
Speaker:
learnbaystats .com for more resources
about today's topics as well as access to
Speaker:
more episodes to help you reach true
Bayesian state of mind.
Speaker:
That's learnbaystats .com.
Speaker:
Our theme music is Good Bayesian by Baba
Brinkman, fit MC Lars and Meghiraam.
Speaker:
Check out his awesome work at bababrinkman
.com.
Speaker:
I'm your host.
Speaker:
Alex Andorra.
Speaker:
You can follow me on Twitter at Alex
underscore Andorra, like the country.
Speaker:
You can support the show and unlock
exclusive benefits by visiting Patreon
Speaker:
.com slash LearnBasedDance.
Speaker:
Thank you so much for listening and for
your support.
Speaker:
You're truly a good Bayesian change your
predictions after taking information.
Speaker:
And if you're thinking I'll be less than
amazing, let's adjust those expectations.
Speaker:
Let me show you how to be a good Bayesian
Change calculations after taking fresh
Speaker:
data in Those predictions that your brain
is making Let's get them on a solid
Speaker:
foundation