Proudly sponsored by PyMC Labs, the Bayesian Consultancy. Book a call, or get in touch!
In this episode, Marvin Schmitt introduces the concept of amortized Bayesian inference, where the upfront training phase of a neural network is followed by fast posterior inference.
Marvin will guide us through this new concept, discussing his work in probabilistic machine learning and uncertainty quantification, using Bayesian inference with deep neural networks.
He also introduces BayesFlow, a Python library for amortized Bayesian workflows, and discusses its use cases in various fields, while also touching on the concept of deep fusion and its relation to multimodal simulation-based inference.
A PhD student in computer science at the University of Stuttgart, Marvin is supervised by two LBS guests you surely know — Paul Bürkner and Aki Vehtari. Marvin’s research combines deep learning and statistics, to make Bayesian inference fast and trustworthy.
In his free time, Marvin enjoys board games and is a passionate guitar player.
Our theme music is « Good Bayesian », by Baba Brinkman (feat MC Lars and Mega Ran). Check out his awesome work at https://bababrinkman.com/ !
Thank you to my Patrons for making this episode possible!
Yusuke Saito, Avi Bryant, Ero Carrera, Giuliano Cruz, Tim Gasser, James Wade, Tradd Salvo, William Benton, James Ahloy, Robin Taylor,, Chad Scherrer, Zwelithini Tunyiswa, Bertrand Wilden, James Thompson, Stephen Oates, Gian Luca Di Tanna, Jack Wells, Matthew Maldonado, Ian Costley, Ally Salim, Larry Gill, Ian Moran, Paul Oreto, Colin Caprani, Colin Carroll, Nathaniel Burbank, Michael Osthege, Rémi Louf, Clive Edelsten, Henri Wallen, Hugo Botha, Vinh Nguyen, Marcin Elantkowski, Adam C. Smith, Will Kurt, Andrew Moskowitz, Hector Munoz, Marco Gorelli, Simon Kessell, Bradley Rode, Patrick Kelley, Rick Anderson, Casper de Bruin, Philippe Labonde, Michael Hankin, Cameron Smith, Tomáš Frýda, Ryan Wesslen, Andreas Netti, Riley King, Yoshiyuki Hamajima, Sven De Maeyer, Michael DeCrescenzo, Fergal M, Mason Yahr, Naoya Kanai, Steven Rowland, Aubrey Clayton, Jeannine Sue, Omri Har Shemesh, Scott Anthony Robson, Robert Yolken, Or Duek, Pavel Dusek, Paul Cox, Andreas Kröpelin, Raphaël R, Nicolas Rode, Gabriel Stechschulte, Arkady, Kurt TeKolste, Gergely Juhasz, Marcus Nölke, Maggi Mackintosh, Grant Pezzolesi, Avram Aelony, Joshua Meehl, Javier Sabio, Kristian Higgins, Alex Jones, Gregorio Aguilar, Matt Rosinski, Bart Trudeau, Luis Fonseca, Dante Gates, Matt Niccolls, Maksim Kuznecov, Michael Thomas, Luke Gorrie, Cory Kiser, Julio, Edvin Saveljev, Frederick Ayala, Jeffrey Powell, Gal Kampel, Adan Romero, Will Geary and Blake Walters.
Visit https://www.patreon.com/learnbayesstats to unlock exclusive Bayesian swag 😉
Takeaways:
- Amortized Bayesian inference combines deep learning and statistics to make posterior inference fast and trustworthy.
- Bayesian neural networks can be used for full Bayesian inference on neural network weights.
- Amortized Bayesian inference decouples the training phase and the posterior inference phase, making posterior sampling much faster.
- BayesFlow is a Python library for amortized Bayesian workflows, providing a user-friendly interface and modular architecture.
- Self-consistency loss is a technique that combines simulation-based inference and likelihood-based Bayesian inference, with a focus on amortization
- The BayesFlow package aims to make amortized Bayesian inference more accessible and provides sensible default values for neural networks.
- Deep fusion techniques allow for the fusion of multiple sources of information in neural networks.
- Generative models that are expressive and have one-step inference are an emerging topic in deep learning and probabilistic machine learning.
- Foundation models, which have a large training set and can handle out-of-distribution cases, are another intriguing area of research.
Chapters:
00:00 Introduction to Amortized Bayesian Inference
07:39 Bayesian Neural Networks
11:47 Amortized Bayesian Inference and Posterior Inference
23:20 BayesFlow: A Python Library for Amortized Bayesian Workflows
38:15 Self-consistency loss: Bridging Simulation-Based Inference and Likelihood-Based Bayesian Inference
41:35 Amortized Bayesian Inference
43:53 Fusing Multiple Sources of Information
45:19 Compensating for Missing Data
56:17 Emerging Topics: Expressive Generative Models and Foundation Models
01:06:18 The Future of Deep Learning and Probabilistic Machine Learning
Links from the show:
- Marvin’s website: https://www.marvinschmitt.com/
- Marvin on GitHub: https://github.com/marvinschmitt
- Marvin on Linkedin: https://www.linkedin.com/in/marvin-schmitt/
- Marvin on Twitter: https://twitter.com/MarvinSchmittML
- The BayesFlow package for amortized Bayesian workflows: https://bayesflow.org/
- BayesFlow Forums for users: https://discuss.bayesflow.org
- BayesFlow software paper (JOSS): https://joss.theoj.org/papers/10.21105/joss.05702
- Tutorial on amortized Bayesian inference with BayesFlow (Python): https://colab.research.google.com/drive/1ub9SivzBI5fMbSTwVM1pABsMlRupgqRb?usp=sharing
- Towards Reliable Amortized Bayesian Inference: https://www.marvinschmitt.com/speaking/pdf/slides_reliable_abi_botb.pdf
- Expand the model space that we amortize over (multiverse analyses, power scaling, …): “Sensitivity-Aware Amortized Bayesian Inference” https://arxiv.org/abs/2310.11122
- Use heterogeneous data sources in amortized inference: “Fuse It or Lose It: Deep Fusion for Multimodal Simulation-Based Inference” https://arxiv.org/abs/2311.10671
- Use likelihood density information (explicit or even learned on the fly): “Leveraging Self-Consistency for Data-Efficient Amortized Bayesian Inference” https://arxiv.org/abs/2310.04395
- LBS #98 Fusing Statistical Physics, Machine Learning & Adaptive MCMC, with Marylou Gabrié: https://learnbayesstats.com/episode/98-fusing-statistical-physics-machine-learning-adaptive-mcmc-marylou-gabrie/
- LBS #101 Black Holes Collisions & Gravitational Waves, with LIGO Experts Christopher Berry & John Veitch: https://learnbayesstats.com/episode/101-black-holes-collisions-gravitational-waves-ligo-experts-christopher-berry-john-veitch/
- Deep Learning book: https://www.deeplearningbook.org/
- Statistical Rethinking: https://xcelab.net/rm/
Transcript
This is an automatic transcript and may therefore contain errors. Please get in touch if you’re willing to correct them.
Transcript
In this episode, Marvin Schmidt introduces
the concept of amortized Bayesian
2
:inference, where the upfront training
phase of a neural network is followed by
3
:fast posterior inference.
4
:Marvin will guide us through this new
concept, discussing his work in
5
:probabilistic machine learning and
uncertainty quantification using Bayesian
6
:inference with deep neural networks.
7
:He also introduces Bayes' law,
8
:Python library for amortized Bayesian
workflows and discusses its use cases in
9
:various fields while also touching on the
concept of deep fusion and its relation to
10
:multi -model simulation -based inference.
11
:Yeah, that is a very deep episode and also
a fascinating one.
12
:I've been personally diving much more into
amortized Bayesian inference with Baseful
13
:since the folks there have been kind
enough.
14
:to invite me to the team, and I can tell
you, this is super promising technology.
15
:A PhD student in computer science at the
University of Stuttgart, Marvin is
16
:supervised actually by two LBS guests you
surely know, Paul Burkner and Aki
17
:Vettelik.
18
:Marvin's research combines deep learning
and statistics to make vision inference
19
:fast and trustworthy.
20
:In his free time, Marvin enjoys board
games and is a passionate guitar player.
21
:This is Learning Basion Statistics,
,:
22
:Welcome to Learning Basion Statistics, a
podcast about patient inference, the
23
:methods, the projects,
24
:and the people who make it possible.
25
:I'm your host, Alex Andorra.
26
:You can follow me on Twitter at alex
.andorra, like the country, for any info
27
:about the show.
28
:LearnBasedStats .com is left last to be.
29
:Show notes, becoming a corporate sponsor,
unlocking Bayesian Merch, supporting the
30
:show on Patreon, everything is in there.
31
:That's LearnBasedStats .com.
32
:If you're interested in one -on -one
mentorship, online courses, or statistical
33
:consulting,
34
:Feel free to reach out and book a call at
topmate .io slash alex underscore and
35
:dora.
36
:See you around folks and best patient
wishes to you all.
37
:Today, I want to thank the fantastic Adam
Romero, Will Geary, and Blake Walters for
38
:supporting the show on Patreon.
39
:Your support is truly invaluable and
literally makes this show possible.
40
:I can't wait to talk with you guys in the
Slack channel.
41
:Second, the first part of our modeling
webinar series on Gaussian processes is
42
:out for everyone.
43
:So if you want to see how to use the new
HSGP approximation in PIMC, head over to
44
:the LBS YouTube channel and you'll see
Juan Orduz, a fellow PIMC Core Dev and
45
:mathematician, explain how to do fast and
efficient Gaussian processes in PIMC.
46
:I'm actually working on the next part in
this series as we speak, so stay tuned for
47
:more and follow the LBS YouTube channel if
you don't want to miss it.
48
:Okay, back to the show now.
49
:Marvin Schmidt, Willkommen nach Learning
Patient Statistics.
50
:Thanks Alex, thanks for having me.
51
:Actually my German is very rusty, do you
say nach or zu?
52
:Well, welcome Learning Patient Statistics.
53
:Maybe welcome in podcast?
54
:Nah.
55
:Obviously, obviously like it was a third
hidden option.
56
:Damn.
57
:it's a secret third thing, right?
58
:Yeah, always in Germany.
59
:It's always that.
60
:Man, damn.
61
:Well, that's okay.
62
:I got embarrassed in front of the world,
but I'm used to that in each episode.
63
:So thanks a lot for taking the time.
64
:Marvin.
65
:Thanks a lot to Matt Rosinski actually for
recommending to do an episode with you.
66
:Matt was kind enough to take some of his
time to write to me and put me in contact
67
:with you.
68
:I think you guys met in Australia in a
very fun conference based on the beach.
69
:I think it happens every two years.
70
:Definitely when I go there in two years
and do a live episode there.
71
:Definitely that's a...
72
:That's a product I wanted to do that this
year, but that didn't go well with my
73
:traveling dates.
74
:So in two years, definitely going to try
to do that.
75
:So yeah, listeners and Marvin, you can
help me accountable on that promise.
76
:Absolutely.
77
:We will.
78
:So Marvin, before we talk a bit more about
what you're a specialist in and also what
79
:you presented in Australia, can you tell
us what you're doing nowadays and also how
80
:you...
81
:Andy Depp working on this?
82
:Yeah, of course.
83
:So these days, I'm mostly doing methods
development.
84
:So broadly in probabilistic machine
learning, I care a lot about uncertainty
85
:quantification.
86
:And so essentially, I'm doing Bayesian
inference with deep neural networks.
87
:So taking Bayesian inference, which is
notoriously slow at times, which might be
88
:a bottleneck, and then using generative
neural networks to speed up this process,
89
:but still maintaining all the
explainability, all these nice benefits
90
:that we have from using
91
:I have a background in both psychology and
computer science.
92
:That's also how I ended up in, Beijing
inference.
93
:cause during my psychology studies, I took
a few statistics courses, then started as
94
:a statistics tutor, mainly doing frequent
statistics.
95
:And then I took a seminar on Beijing
statistics in Heidelberg in Germany.
96
:and it was the hardest seminar that ever
took.
97
:Well, it's super hard.
98
:We read like papers every single week.
99
:Everyone had to prepare every single paper
for every single week.
100
:And then at the start of each session, the
professor would just shuffle and randomly
101
:pick someone to prison.
102
:my God.
103
:That was tough, but somehow, I don't know,
it stuck with me.
104
:And I had like this aha moment where I
felt like, okay, all this statistics stuff
105
:that I've been doing before was more of,
you know, following a recipe, which is
106
:very strict.
107
:But then this like holistic Bayesian
probabilistic take.
108
:just gave me a much broader overview of
statistics in general.
109
:Somehow I followed the path.
110
:Yeah.
111
:I'm curious what that...
112
:So what does that mean to do patient stats
on deep neural network concretely?
113
:What is the thing you would do if you had
to do that?
114
:Let's say, does that mean you mainly...
115
:you develop the deep neural network and
then you add some Bayesian layer on that,
116
:or you have to have the Bayesian framework
from the beginning.
117
:How does that work?
118
:Yeah, that's a great question.
119
:And in fact, that's a common point of
confusion there as well, because Bayesian
120
:inference is just like a general, almost
philosophical framework for reasoning
121
:about uncertainty.
122
:So you have some latent quantities, call
them parameters, whatever, some latent
123
:unknowns.
124
:And you want to do inference on them.
125
:You want to know what these latent
quantities are, but all you have are
126
:actual observables.
127
:And you want to know how these are related
to each other.
128
:And so with Bayesian neural networks, for
instance, these parameters would be the
129
:neural network weights.
130
:And so you want full Bayesian inference on
the neural network weights.
131
:And fitting normal neural networks already
supports that.
132
:Like a Bixarity distribution.
133
:Exactly.
134
:Over these neural network weights.
135
:Exactly.
136
:So that's one approach of doing Bayesian
deep learning, but that's not what I'm
137
:currently doing.
138
:Instead, I'm coming from the Bayesian
side.
139
:So we have like a normal Bayesian model,
which has statistical parameters.
140
:So you can imagine it like a mechanistical
model for like a simulation program.
141
:And we want to estimate these scientific
parameters.
142
:So for example, if you have a cognitive
decision -making task from the cognitive
143
:sciences, and these parameters might be
something like the non -decision time, the
144
:actual motor reaction time that you need
to
145
:move your muscles and some information
uptake rates, some bias and all these
146
:things that researchers are actually
interested in.
147
:And usually you would then formulate your
model in, for example, PiMC or Stan or
148
:however you want to formulate your
statistical model and then run MCMC for
149
:parameter inference.
150
:And now where the neural networks come in
in my research is that we replace MCMC
151
:with a neural network.
152
:So we still have our Bayesian model.
153
:But we don't use MCMC for posterior
inference.
154
:Instead, we use a neural network just for
posterior inference.
155
:And this neural network is trained by
maximum likelihood.
156
:So the neural network itself, the weights
there are not probabilistic.
157
:There are no posterior distributions over
the weights.
158
:But we just want to somehow model the
actual posterior distributions of our
159
:statistical model parameters using a
neural network.
160
:So the neural net, I think so.
161
:That's quite new to me.
162
:So I'm going to rephrase that and see how
much I understood.
163
:So that means the deep neural network is
already trained beforehand?
164
:No, we have to train it.
165
:And that's the cool part about this.
166
:OK, so you train it at the same time.
167
:You train it at the same time.
168
:You're also trying to infer the underlying
parameters of your model.
169
:And that's the cool part now.
170
:Because in MCMC, you would do both at the
same time, right?
171
:You have your fixed model that you write
down in PyMC or Stan, and then you have
172
:your one observed data set, and you want
to fit your model to the data set.
173
:And so, you know, you do, for example,
your Hamiltonian Monte Carlo algorithm to,
174
:you know, traverse your parameter space
and then do the sampling.
175
:So you couple your approximating
176
:phase and your inference phase.
177
:Like you learn about the posterior
distribution based on your data set.
178
:And then you also want to generate
posterior samples while you're exploring
179
:this parameter space.
180
:And in the line of work that I'm doing,
which we call amortized Bayesian
181
:inference, we decouple those two phases.
182
:So the first phase is actually training
those neural networks.
183
:And that's the hard task.
184
:And then you essentially take your
Bayesian model.
185
:generate a lot of training data from the
model because you can just run prior
186
:predictive samples.
187
:So generate prior predictive samples.
188
:And those are your training data for the
neural network.
189
:And use the neural network to essentially
learn surrogate for the posterior
190
:distribution.
191
:So for each data set that you have, you
want to take those as conditions and then
192
:have a generative neural network to learn
somehow how these data and the parameters
193
:are related to each other.
194
:And this upfront training phase takes
quite some time and usually takes longer
195
:than the equivalent MCMC would take, given
that you can run MCMC.
196
:Now, the cool thing is, as you said, when
your neural network is trained, then the
197
:posterior inference is super fast.
198
:Then if you want to generate posterior
samples, there's no approximation anymore
199
:because you've already done all the
approximation.
200
:So now you're really just doing sampling.
201
:That means just generating some random
numbers in some latent space and having
202
:one pass through the neural network, which
is essentially just a series of matrix
203
:multiplications.
204
:So once you've done this hard part and
trained your generative neural network,
205
:then actually doing the posterior sampling
takes like a fraction of a second for 10
206
:,000 posterior samples.
207
:Okay, yeah, that's really cool.
208
:And how generalizable is your deep neural
network then?
209
:Do you have like, is that, because I can
see the really cool thing to have a neural
210
:network that's customized to each of your
models.
211
:That's really cool.
212
:But at the same time, as you were saying,
that's really expensive to train a neural
213
:network each time you have to sample a
model.
214
:And so I was thinking, OK, so then maybe
what you want is have generalized
215
:categories of deep neural network.
216
:So that would probably be another kill.
217
:But let's say I have a deep neural network
for linear regressions.
218
:Whether they are generalized or just plain
normal likelihood, you would use that deep
219
:neural network for linear regressions.
220
:And then the inference is super fast,
because you only have to train.
221
:the neural network once and then
inference, posterior inference on the
222
:linear regression parameters themselves is
super fast.
223
:So yeah, like that's a long question, but
did you get what I'm asking?
224
:Yeah, absolutely.
225
:So if I get your question right, now
you're asking like, if you don't want to
226
:run linear regression, but want to run
some slightly different model, can I still
227
:use my pre -trained neural network to do
that?
228
:Yes, exactly.
229
:And also, yeah, like in general, how does
that work?
230
:Like, how are you thinking about that?
231
:Are there already some best practices or
is it like really for now, really cutting
232
:edge research that and all the questions
are in the air?
233
:Yeah.
234
:So first of all, the general use case for
this type of amortized Bayesian inference
235
:is usually when your model is fixed, but
you have many new datasets.
236
:So assume you have some quite complex
model where MCMC would take a few minutes
237
:to run.
238
:And so instead for one fixed data set that
you actually want to sample from.
239
:And now instead of running MCMC on it, you
say, okay, I'm going to train this neural
240
:network.
241
:So this won't yet be worth it for just one
data set.
242
:Now the cool thing is if you want to keep
your actual model, so whatever you write
243
:down in PyMC or Stan,
244
:We want to keep that fixed, but now plug
in different data sets.
245
:That's where amortized inference really
shines.
246
:So for instance, there was this one huge
analysis in the UK where they had like
247
:intelligence study data from more than 1
million participants.
248
:And so for each of those participants,
they again had a set of observations.
249
:And so for each of those 1 million
participants,
250
:They want to perform posterior inference.
251
:It means if you want to do this with
something like MCMC or anything non
252
:-amortized, you would need to fit one
million models.
253
:So you might argue now, okay, but you can
parallelize this across like a thousand
254
:cores, but still that's, that's a lot.
255
:That's a lot of control.
256
:Now the cool thing is the model was the
same every single time.
257
:You just had a million different data
sets.
258
:And so what these people did then is train
a neural network once.
259
:And then like it will train for a few
hours, of course, but then you can just
260
:sequentially feed in all these 1 million
data sets.
261
:And for each of these 1 million data sets,
it takes way, way less than one second.
262
:to generate tens of thousands of posterior
samples.
263
:But that didn't really answer your
question.
264
:So your question was about how can we
generalize in the model space?
265
:And that's a really hard problem because
essentially what these neural networks
266
:learn is to give you some posterior
function if you feed in a data set.
267
:Now, if you have a domain shift in the
model space, so now you want inference
268
:based on a different model, and this
neural network has never learned to do
269
:that.
270
:So that's tough.
271
:That's a hard problem.
272
:And essentially what you could do and what
we are currently doing in our research,
273
:but that's cutting edge, is expanding the
model space.
274
:So you would have a very general
formulation of a model and then try to
275
:amortize over this model.
276
:So that different configurations of this
model, different variations.
277
:could just be extracted special case model
essentially.
278
:Can you take an example maybe to give an
idea to listeners how that would work?
279
:Absolutely.
280
:We have one preprint about sensitivity
-aware amortized Bayesian inference.
281
:What we do there is essentially have a
kind of multiverse analysis built into the
282
:neural network training.
283
:give some background, multiverse analysis,
basically says, okay, what are all the pre
284
:-processing steps that you could take in
your analysis?
285
:And you encode those.
286
:And now you're interested in like, what
if, what if I had chosen a different pre
287
:-processing technique?
288
:What if I had chosen a different way to
standardize my data?
289
:Then also the classical like prior
sensitivity or likelihood sensitivity
290
:analysis.
291
:Like what happens if I do power scaling on
my prior?
292
:power scaling on my posterior.
293
:So we also encode this.
294
:What happens if I bootstrap some of my
data or just have a perturbation of my
295
:data?
296
:What if I add a bit of noise to my data?
297
:So these are all slightly different
models.
298
:What we do essentially keep track of that
during the training phase and just encode
299
:it into a vector and say, well, okay, now
we're doing pre -processing choice number
300
:seven.
301
:and scale the prior to the power of two,
don't scale the likelihood and don't do
302
:any perturbation and feed this as an
additional information into the neural
303
:network.
304
:Now the cool thing is during inference
phase, once we're done with the training,
305
:you can say, hey, here's a data set.
306
:Now pretend that we chose pre -processing
technique number 11 and prior scaling of
307
:power 0 .5.
308
:What's the posterior now?
309
:Because we've amortized over this large or
more general model space, we also get
310
:valid posterior inference if we've trained
for long enough over these different
311
:configurations of model.
312
:And essentially, if you were to do this
with MCMC, for instance, you would refit
313
:your model every single time.
314
:And so here you don't have to do that.
315
:Okay.
316
:Yeah, I see.
317
:That's super.
318
:Yeah, that's super cool.
319
:And I feel like, so that would be mainly
the main use cases would be as you were
320
:saying, when, when you're getting into
really high data territory and you have
321
:what's changing is mainly the data side,
mainly the data.
322
:set and to be even more precise, not
really the data set, but the data values,
323
:because the data set is supposed to be
like quite the same, like you would have
324
:the same columns, for instance, but the
values of the columns would change all the
325
:time.
326
:And the model at the same time doesn't
change.
327
:Is that like, that's really for now, at
least the best use case for that kind of
328
:method.
329
:Yes.
330
:And this might seem like a very niche
case.
331
:But then if you look at like,
332
:Bayesian workflows in practice, this topic
of this scheme of many model research
333
:doesn't necessarily mean that you have a
large number of data sets.
334
:This might also just mean you want
extensive cross validation.
335
:So assume that you have one data set with
:
336
:Now you want to run leaf1 or cross
validation, but for some reason you can't
337
:do the Pareto Smooth importance sampling
version, which would be much faster.
338
:So you would need:though you just have one data set, because
339
:you want:
340
:Maybe can you explicit what your meaning
by cross validation here?
341
:Because that's not a term that's used a
lot in the patient framework, I think.
342
:Yeah, of course.
343
:So especially innovation setting, there's
this approach of leave one out cross
344
:validation, where you would fit your
posterior based on all data points, but
345
:one.
346
:And that's why it's called leave one out,
because you take one out and then fit your
347
:model, fit your posterior on the rest of
the data.
348
:And now you're interested in the posterior
predictive performance of this one left
349
:out observation.
350
:Yeah.
351
:And that's called cross validation.
352
:Yeah.
353
:Go ahead.
354
:Yeah, no, just I'm going to let you
finish, but yeah, for listeners familiar
355
:with the frequented framework, that's
something that's really heavily used in
356
:that framework, cross validation.
357
:And it's very similar to the machine
learning concept of cross validation.
358
:But in the machine learning area, you
would rather have something like fivefold
359
:in general, k -fold cross validation,
where you would have larger splits of your
360
:data and then use parts of your
361
:whole dataset as the training dataset and
the rest for evaluation.
362
:Essentially, like the one across relation
just puts it to the extreme.
363
:Everything but one data point is your
train dataset.
364
:Yeah.
365
:Yeah.
366
:Okay.
367
:Yeah.
368
:Damn, that's super fun.
369
:And is there, is there already a way for
people to try that out or is it mainly for
370
:now implemented for papers?
371
:And you are probably.
372
:I'm guessing working on that with Aki and
all his group in Finland to make that more
373
:open source, helping people use packages
to do that.
374
:What's the state of the things here?
375
:Yeah, that's a great question.
376
:And in fact, the state of usable open
source software is far behind what we have
377
:for likelihood -based MCMC based
inference.
378
:So we currently don't have something
that's comparable to PyMC or Stan.
379
:Our group is developing or actively
developing a software that's called Base
380
:Flow.
381
:That's because like the name, because like
base, because we're doing Bayesian
382
:inference.
383
:And essentially the first neural network
architecture that was used for this
384
:amortized Bayesian inference are so
-called normalizing flows.
385
:Conditional normalizing flows to be
precise.
386
:And that's why the name Base Flow came to
be.
387
:But now.
388
:actually have a bit of a different take
because now we have a whole lot of
389
:generative neural networks and not only
normalizing flows.
390
:So now we can also use, for example, score
-based diffusion models that are mainly
391
:used for image generation and AI or
consistency models, which are essentially
392
:like a distilled version of score -based
diffusion models.
393
:And so now baseflow doesn't really capture
that anymore.
394
:But now what the baseflow Python library
specializes in is defining
395
:Principled amortized Bayesian workflows.
396
:So the meaning of base or slightly shifted
to amortized Bayesian workflows and hence
397
:the name base login And the focus of base
slope and the aim of base low is twofold
398
:So first we want a library.
399
:It's good for actual users So this might
be researchers who just say hey, here's my
400
:data set.
401
:Here's my model my simulation program and
Please just give me fast posterior
402
:samples.
403
:So we want
404
:usable high level interface with sensible
default values that mostly work out of the
405
:box and an interface that's mostly self
-explanatory.
406
:Also of course, good teaching material and
all this.
407
:But that's only one side of the coin
because the other large goal of FaceFlow
408
:is that it should be usable for machine
learning researchers who want to advance
409
:amortized Bayesian inference methods as
well.
410
:And so the software in general,
411
:is structured in a very modular way.
412
:So for instance, you could just say, hey,
take my current pipeline, my current
413
:workflow.
414
:But now try out a different loss function
because I have a new fancy idea.
415
:I want to incorporate more likelihood
information.
416
:And so I want to alter my loss function.
417
:So you would have your general program
because of the modular architecture there,
418
:you could just say, take the current loss
function and replace it with a different
419
:one.
420
:that is used to the API.
421
:And we're trying to doing both and serving
both interests, user friendly side for
422
:actually applied researchers who are also
currently using Baseflow.
423
:But then also the machine learning
researchers with completely different
424
:requirements for this piece of software.
425
:Maybe we can also use Baseflow
documentation and the current project
426
:website in the notes.
427
:Yeah, we should definitely do that.
428
:Definitely gonna try that out myself.
429
:It sounds like fun.
430
:I need a use case, but as soon as I have a
use case, I'm definitely gonna try that
431
:out because it sounds like a lot of fun.
432
:Yeah, several questions based on that and
thanks a lot for being so clear and so
433
:detailed on these.
434
:So first, we talked about normalizing
flows in episode 98 with Marie -Lou
435
:Gabriel.
436
:Definitely recommend listeners to listen
to that for some background.
437
:And question, so Baseflow, yeah,
definitely we need that in the show notes
438
:and I'm going to install that in my
environment.
439
:And I'm guessing, so you're saying that
that's in Python, right?
440
:The package?
441
:Yes, the core package is in Python and
we're currently refactoring to Keras.
442
:So by the time this podcast episode is
aired, we will have a new major release
443
:version, hopefully.
444
:OK, nice.
445
:So you're agnostic to the actual machine
learning back end.
446
:So then you could choose TensorFlow,
PyTorch, or JAX, whatever integrates best
447
:with what you're currently proficient in
and what you might be currently using in
448
:other parts of a project.
449
:OK, that was going to be my question.
450
:Because I think while preparing for the
episode, I saw that you were mainly using
451
:PyTorch.
452
:So that was going to be my question.
453
:What is that based on?
454
:So the back end could be PyTorch, JAX, or.
455
:What did you think the last one was?
456
:Tansor flow.
457
:Yeah, I always forget about all these
names.
458
:I really know PyTorch.
459
:So that's why I like the other ones.
460
:And JAX, of course, for PyMC.
461
:And then, so my question is, the workflow,
what would it look like if you're using
462
:Baseflow?
463
:Because you were saying the model, you
could write it in standard PyMC or
464
:TensorFlow, for instance.
465
:Although I don't know if you can write.
466
:patient models with TensorFlow anymore.
467
:Anyways, let's say PyMC or Stan.
468
:You write your model.
469
:But then the sampling of the model is done
with the neural network.
470
:So that means, for instance, PyTorch or
Jax.
471
:How does that work?
472
:Do you have then to write the model in a
Jax compatible way?
473
:Or is the translation done by the package
itself?
474
:Yeah, that's a great question.
475
:It touches on many different topics and
considerations and also on future roadmap
476
:for bass flow.
477
:So.
478
:This class of algorithms that are
implemented in Baseflow, these amortized
479
:Bayesian inference algorithms, to give you
some background there, they originally
480
:started in simulation -based inference.
481
:It's also sometimes called likelihood
-free inference.
482
:So essentially it is Bayesian inference
when you don't bring a closed -form
483
:likelihood function to the table.
484
:But instead, you only have some generic
forward simulation program.
485
:So you would just have your prior as
some...
486
:Python function or C++ function, whatever,
any function that you could call and it
487
:would return you a sample from the prior
distribution.
488
:You don't need to write it down in terms
of distributions actually, but you only
489
:need to be able to sample from it.
490
:And then the same for the likelihood.
491
:So you don't need to write down your
likelihood in like a PMC or Stan in terms
492
:of a probability distribution, in terms of
density distribution or densities.
493
:But instead it's.
494
:just got to be some simulation program,
which takes in parameters and then outputs
495
:data.
496
:What happens between these parameters and
the data is not necessarily probabilistic
497
:in terms of closed form distributions.
498
:It could also be some non -tractable
differential equations.
499
:It could be essentially everything.
500
:So for base flow, this means that you
don't have to input something like a PMC
501
:or a Stan model, which you write down in
terms of
502
:distributions, but it's just a generic
forward model that you can call and you
503
:will get a tuple of a parameter draw and a
data set.
504
:So you'd usually just do it in NumPy.
505
:So you would write, if I'm using Baseflow,
I would write it in NumPy.
506
:It would probably be the easiest way.
507
:You could probably also write it in JAX or
in PyTorch or in TensorFlow or TensorFlow
508
:probability, whatever you want to use and
like behind the scenes.
509
:But essentially what we just care about is
that the model gets a tuple of parameters
510
:and then data that has been generated from
these parameters.
511
:for the neural network training process.
512
:That's super fun.
513
:Yeah, yeah, yeah.
514
:Definitely want to see that.
515
:Do you have already some Jupyter notebook
examples up on the repo or are you working
516
:on that?
517
:Yeah, currently it's a full -fledged
library.
518
:It's been under development for a few
years now.
519
:And we also have an active user base right
now.
520
:It's quite small compared to other
Bayesian packages.
521
:We're growing it.
522
:Yeah, that's cool.
523
:In documentation, there are currently, I
think, seven or eight tutorial notebooks.
524
:And then also for a Based on the Beach,
like this conference in Australia that we
525
:just talked about earlier, we also
prepared a workshop.
526
:And we're also going to link to this
Jupyter notebook in the show notes.
527
:Yeah, definitely we should, we should link
to some of these Jupyter notebooks in the
528
:show notes.
529
:And Sean, I'm thinking you should...
530
:Like if you're down, you should definitely
come back to the show, but for a webinar.
531
:I have another format that's modeling
webinar where you could, you would come to
532
:the show and share your screen and, and go
through the model code live and people can
533
:ask questions and so on.
534
:I've done that already on a variety of
things.
535
:Last one was about causal inference and
propensity scores.
536
:Next one is going to be on about helper
space GP decomposition.
537
:So yeah, if you're down, you should
definitely come and do a demonstration of
538
:base flow and amortized Bayesian
inference.
539
:I think that would be super fun and very
interesting to people.
540
:Absolutely.
541
:Then to answer the last part of your
question.
542
:Yeah.
543
:Like if you currently have a model that's
written down in PyMC or Stan, that's a bit
544
:more tricky to integrate because
essentially what all we need in base flow
545
:are samples from the prior predictive
distribution.
546
:If you talk in Bayesian terminology.
547
:Yeah.
548
:And if your current model can do that,
that's fine.
549
:That's all you need right now.
550
:And then base build builds.
551
:You can have like a PIMC model and just do
pm .sample -properative, save that as a
552
:big NumPy multidimensional array and pass
that to baseflow.
553
:Yes.
554
:Okay.
555
:Just all you need are two builds of the
ground truth parameters of the data
556
:training process.
557
:So essentially like the result of your
prior call and then the result of your
558
:likelihood call with those prior
parameters.
559
:So you mean what the likelihood samples
look like once you fix the prior
560
:parameters to some value?
561
:Yes.
562
:So like in practice, you would just call
your prior function.
563
:Yeah.
564
:Then get a sample from the prior.
565
:So parameter vector.
566
:Yeah.
567
:And then plug this parameter vector into
the likelihood function.
568
:And then you get one simulated synthetic
data set.
569
:And you just need those two.
570
:Okay.
571
:Super cool.
572
:Yeah.
573
:Definitely sounds like a lot of fun and
should definitely do a webinar about that.
574
:I'm very excited about that.
575
:Yeah.
576
:Fantastic.
577
:And so that was one of my main questions
on that.
578
:Other question is, I'm guessing you are a
lot of people working on that, right?
579
:Because your roadmap that you just talked
about is super big.
580
:Because having a package that's designed
for users, but also for researchers is
581
:quite, that's really a lot of work.
582
:So I'm hoping you're not allowed doing
that.
583
:No, we're currently a team of about a
dozen people.
584
:No, yeah, that makes sense.
585
:It's an interdisciplinary team.
586
:So like a few people with a hardcore like
software engineering background, like some
587
:people with a machine learning background,
and some people from the cognitive
588
:sciences and also a handful of physicists.
589
:Because in fact, these amortized Bayesian
inference methods are particularly
590
:interesting for physicists.
591
:Example for astrophysicists who have these
gravitational wave inference problems
592
:where they have massive data sets.
593
:And running MCMC on those would be quite
cumbersome.
594
:So if you have this huge in -stream data
and you don't have this underlying
595
:likelihood density, but just some
simulation program that might generate
596
:sensible, like gravitational waves, then
amortized Bayesian inference really shines
597
:there.
598
:Okay.
599
:So that's exactly the case you were
talking about where the model doesn't
600
:change, but you have a lot of different
datasets.
601
:Yeah, exactly.
602
:Because I mean, what you're trying to run
inference on is your physical model.
603
:And that doesn't change.
604
:I mean, it does.
605
:And then again, physicists have a very
good understanding and very good models of
606
:the world around them.
607
:And that's made one of the largest
differences.
608
:people from the cognitive sciences, where,
you know, the, the models of the human
609
:brain, for instance, are just, it's such a
tough thing to model and there's so much
610
:not there and so much uncertainty in the
model building process.
611
:Yeah, for sure.
612
:Okay, yeah, I think I'm starting to
understand the idea.
613
:And yeah, so actually, episode 101 was
exactly about that.
614
:Black holes, collisions, gravitational
waves.
615
:And I was talking with LIGO researchers,
Christopher Perry and John Vich.
616
:And we talked exactly about that, their
problem with big data sets.
617
:They are mainly using sequential Monte
Carlo, but I'm guessing they would also be
618
:interested in a Monte...
619
:amortized Bayesian inference.
620
:So yeah, Christopher and John, if you're
listening, if you're future reach out to
621
:Marvin and use Baseflow.
622
:And listeners, this episode will be in the
show notes also if you want to give it a
623
:listen.
624
:That's a really fun one also learning a
lot of stuff, but the crazy universe we
625
:live in.
626
:Actually, a weird question I have is why
627
:easy to call it amortized Bayesian
inference.
628
:The reason is that we have this two -stage
process where we would first pay upfront
629
:with this long neural network training
phase.
630
:But then once we're done with this, this
cost of the upfront training phase
631
:amortizes over all the posterior samples
that we can draw within a few
632
:milliseconds.
633
:That makes sense.
634
:That makes sense.
635
:And so I think something you're also
working on is something that's called deep
636
:fusion.
637
:And you do that in particular for
multimodal simulation -based inference.
638
:How is that related to amortized patient
inference, if at all?
639
:And what is it about?
640
:I'm gonna answer these two questions in
reverse order.
641
:So first about the relation between
simulation -based inference and amortized
642
:Bayesian inference.
643
:So to give you a bit of history there,
simulation -based inference essentially
644
:Bayesian inference based on simulations
where we don't assume that we have access
645
:to a likelihood density, but instead we
just assume that we can sample from the
646
:likelihood.
647
:Essentially simulate from the model.
648
:In fact, the likelihood is still.
649
:present, but it's only implicitly defined
and we don't have access to the density.
650
:That's why likelihood -free inference
doesn't really hit what's happening here.
651
:But instead, like in the recent years,
people have started adopting the term
652
:simulation -based inference because we do
Bayesian inference based on simulations
653
:instead of likelihood densities.
654
:So methods that have been used...
655
:for quite a long time now in the
simulation -based inference research area.
656
:For example, rejection ABC, so approximate
Bayesian computation, or then ABC SMC, so
657
:combining ABC with sequential Monte Carlo.
658
:Essentially, the next iteration there was
throwing neural network at simulation
659
:-based inference.
660
:That's exactly this neural posterior
estimation that I talked about earlier.
661
:And now what researchers noticed is, hey,
when we train a neural network for
662
:simulation -based inference, instead of
running rejection, approximate base
663
:computation, then we get amortization for
free as a site product.
664
:It's just a by -product of using a neural
network for simulation -based inference.
665
:And so in the last maybe four to five
years,
666
:People have mainly focused on this
algorithm that's called neuro posterior
667
:estimation for simulation based inference.
668
:And so all developments that happened
there and all the research that happened
669
:there, almost all the research, sorry,
focused on cases where we don't have any
670
:likelihood density.
671
:So we're purely in the simulation based
case.
672
:Now with our view of things, when we come
from a Bayesian inference, like likelihood
673
:based setting,
674
:can say, hey, amortization is not just a
random coincidental byproduct, but it's a
675
:feature and we should focus on this
feature.
676
:And so now what we're currently doing is
moving this idea of amortized Bayesian
677
:inference with neural networks back into a
likelihood -based setting.
678
:So we've started using likelihood
information again.
679
:For example, using likelihood densities if
they're available or learning information
680
:about the likelihood.
681
:So like a surrogate model on the fly, and
then again, using this information for
682
:better posterior inference.
683
:So we're essentially bridging simulation
-based inference and likelihood -based
684
:Bayesian inference again with this goal, a
larger goal of amortization if we can do
685
:it.
686
:And so this work on deep fusion.
687
:essentially addresses one huge shortcoming
of neural networks when we want to use
688
:them for amortized Bayesian inference.
689
:And that is in situation where we have
multiple different sources of data.
690
:So for example,
691
:Imagine you're a cognitive scientist and
you run an experiment with subjects and
692
:for each test subject, you give them a
decision -making task.
693
:But at the same time, while your subjects
solve the decision -making task, you wire
694
:them up with an EEG to measure the brain
activity.
695
:So for each subject across maybe 100
trials, what you now have is both an EEG
696
:and the data from the decision -making
task.
697
:Now, if you want to analyze this with PyMC
or Stan, what you would just do is say,
698
:hey, well, we have two data -generating
processes that are governed by a set of
699
:shared parameters.
700
:So the first part of the likelihood would
just be this we -know process for the
701
:decision -making task where you just model
the reaction time.
702
:fairly standard procedure there in the
cognitive science.
703
:And then for the second part, we have a
second part of the likelihood that we
704
:evaluate that somehow handles these EEG
measurements.
705
:For example, a spatial temporal process or
just like some summary statistics that are
706
:being computed there.
707
:However, you would usually compute your
EEG.
708
:Then you add both to the log PDF of the
likelihood, and then you can call it a
709
:day.
710
:You cannot do that in neural networks
because you have no straightforward
711
:sensible way to combine these reaction
times from the decision -making task and
712
:the EEG data.
713
:Because you cannot just take them and slap
them together.
714
:They are not compatible with each other
because these information data sources are
715
:heterogeneous.
716
:So you somehow need a way to fuse these
sources of information.
717
:so that you can then feed them into the
neural network.
718
:That's essentially what we're studying in
this paper, where you could just get very
719
:creative and have different schemes to
fuse the data.
720
:So you could use these attention schemes
that are very hip in large language models
721
:right now with transformers essentially,
and have these different data sources
722
:attend or listen essentially to each
other.
723
:With cross attention, you could just let
the EEG data inform
724
:your decision -making data or just have
the decision -making data inform the EEG
725
:data.
726
:So you can get very creative there.
727
:You could also just learn some
representation of both individually, then
728
:concatenate them and feed them to the
neural network.
729
:Or you could do very creative and weird
mixes of all those approaches.
730
:And in this paper, we essentially have a
systematic investigation of these
731
:different options.
732
:And we find that the most straightforward
option works the best.
733
:overall, and that's just learning fixed
size embeddings of your data sources
734
:individually, and then just concatenating
them.
735
:It turns out then we can use information
from both sources in an efficient way,
736
:even though we're doing inference with
neural networks.
737
:And maybe what's interesting for
practitioners is that we can compensate
738
:for missing data in individual sources.
739
:And the paper we essentially, we induced
missing data by just taking these EEG data
740
:and decision -making data and just
randomly dropping some of them.
741
:And the neural networks have learned, like
when we do this fusion process, the neural
742
:networks learn to compensate for partial
missingness in both sources.
743
:So if you just remove some of the decision
-making data, the neural network learn to
744
:use the EEG data to inform your posterior.
745
:Even though the data and one of the
sources are missing, the inference is
746
:pretty robust then.
747
:And again, all this happens without model
refits.
748
:So you would just account for that during
training.
749
:Of course you have to do this like random
dropping of data during a training phase
750
:as well.
751
:And then you can also get it during the
inference phase.
752
:yeah, that sounds, yeah, that's really
cool.
753
:Maybe that's a bit of a, like a small
piece of this paper in our larger roadmap.
754
:This is essentially taking this amortized
vision inference.
755
:up to the level of trustworthiness and
robustness and all these gold standards
756
:that we currently have for likelihood
-based inference in PMC or Stan.
757
:Yeah.
758
:Yeah.
759
:And there's still a lot of work to do
because of course, like there's no free
760
:lunch.
761
:and, and of course there are many problems
with trustworthiness.
762
:And that's also one of the reasons why I'm
here with Aki right now.
763
:cause Aki is so great at Bayesian workflow
and trustworthiness, good diagnostics.
764
:That's all, you know, all the things that
we currently still need for trustworthy,
765
:amortized Bayesian inference.
766
:Yeah.
767
:So maybe you want to.
768
:talk a bit more about that and what you're
doing on that.
769
:That sounds like something very
interesting.
770
:So one huge advantage of an amortized
Bayesian sampler is that evaluations and
771
:diagnostics are extremely cheap.
772
:So for example, there's this gold standard
method that's called simulation based
773
:calibration, where you would sample from
your model and then like a sample from
774
:your prior predictive space and then refit
your model and look at your coverage, for
775
:instance.
776
:In general, look at the calibration of
your model on this potentially very large
777
:prior predictive space.
778
:So you naturally need many model refits,
but your model is fixed.
779
:So if you do it with MCMC, it's a gold
standard evaluation technique, but it's
780
:very expensive to run, especially if your
model is complex.
781
:Now, if you have an amortized estimator,
simulation -based calibration on thousands
782
:of datasets takes a few seconds.
783
:So essentially, and that's my goal for
this research visit with Aki here in
784
:Finland, is trying to figure out what are
some diagnostics that are gold standard,
785
:but potentially very expensive, up to a
point where it's infeasible to run on a
786
:larger scale with MCMC.
787
:But we can easily do it with an amontized
estimator.
788
:With the goal of figuring out, like, can
we trust this estimator?
789
:Yes or no?
790
:It's like, as you might know from neural
networks, we just have no idea what's
791
:happening inside their neural network.
792
:And so we currently don't have these
strong diagnostics that we have for MCMC.
793
:Like for example, our head.
794
:There's no comparable thing for neural
network.
795
:So one of my goals here is to come up with
more good diagnostics that are either
796
:possible with MCMC, but very expensive so
we don't run them, but they would be very
797
:cheap with an amortized estimator.
798
:Or the second thing just specific to an
amortized estimator, just like our head is
799
:specific to MCMC.
800
:Okay.
801
:Yeah, I see.
802
:Yeah, that makes tons of sense.
803
:well.
804
:And actually, so I would have more
technical questions on these, but I see
805
:the time running out.
806
:I think something I'm mainly curious about
is the challenges, the biggest challenges
807
:you face when applying amortized spatial
inference and diffusion techniques in your
808
:projects, but also like in the projects
you see.
809
:I think that's going to also give a sense
to listeners of when and where to use
810
:these kinds of methods.
811
:That's a great question.
812
:And I'm more than happy to talk about all
these challenges that we have because
813
:there's so much room for improvement
because like these Amortized methods, they
814
:have so much potential, but we still have
a long way to go until they are as usable
815
:and as straightforward to use as current
MCMC samplers.
816
:And in general, one challenge for
practitioners,
817
:is that we have most of the problems and
hardships that we have in PyMC or Stan.
818
:And that is that researchers have to think
about their model in a probabilistic way,
819
:in a mechanistic way.
820
:So instead of just saying, hey, I click on
t -test or linear regression in some
821
:graphical user interface, they actually
have to come up with a data generating
822
:process.
823
:and have to specify their model.
824
:And this whole topic of model
specification is just the same in
825
:amortized workflow because some way we
need to specify the Bayesian model.
826
:And now on top of all this, we have a huge
additional layer of complexity and this is
827
:defining the neural networks.
828
:And amortized Bayesian inference, nowadays
we have two neural networks.
829
:The first one is a so -called summary
network.
830
:which essentially learns a latent
embedding of the data set.
831
:Essentially those are like optimal learned
summary statistics and optimal doesn't
832
:mean that they have to be optimal to
reconstruct the data, but instead optimal
833
:means they're optimal to inform the
posterior.
834
:for example, in a very, very simple toy
model, if you have just like a Gaussian
835
:model and you just want to perform
inference on the mean.
836
:then a sufficient summary statistic for
posterior inference on the mean would be
837
:the mean.
838
:Because that's all you need to reconstruct
the mean.
839
:It sounds very tautological, but yeah.
840
:Then again, the mean is obviously not
enough to reconstruct the data because all
841
:the variance information is missing.
842
:What the summary network learns is
something like the mean.
843
:So summary statistics that are optimal for
posterior inference.
844
:And then the second network is the actual
generative neural network.
845
:So like a normalizing flow, score -based
diffusion model, consistency model, flow
846
:matching, whatever condition generative
model you want.
847
:And this will handle the sampling from the
posterior.
848
:And these two networks are learned end to
end.
849
:So you would learn your summary statistic,
output it, feed it into the posterior
850
:network, the generative model, and then
have one.
851
:evaluation of the loss function, optimize
both end to end.
852
:And so we have two neural networks, long
story short, which is substantially harder
853
:than just hitting like sample on a PMC or
Stan program.
854
:And that's an additional hardship for
practitioners.
855
:Now in Baseflow, what we do is we provide
sensible default values for the generative
856
:neural networks, which work in maybe like
80 or 90 % of the cases.
857
:It's just sufficient to have, for example,
like a NeuroSpline flow, like some sort of
858
:normalizing flow with, I don't know, like,
859
:six layers and a certain number of units,
some regularization for robustness and,
860
:you know, cosine decay of the learning
rates, and all these machine learning
861
:parts, we try to take them away from the
user if they don't want to mess with it.
862
:But still, if things don't work, they
would need to somehow diagnose the
863
:problems and then, you know, play with the
number of layers and this neural network
864
:architecture.
865
:And then for the summary network, the
summary network essentially needs to be
866
:informed by the data.
867
:So if you have time series, you would
868
:look at something like an LSTM.
869
:So these like long short time memory time
series neural networks.
870
:Or you would have like recurrent neural
network or nowadays a time series
871
:transformer.
872
:They're also called temporal fusion
transforms.
873
:If you have IID data, you would have
something like a deep set or a set
874
:transformer, which respect this
exchangeable structure of the data.
875
:So again, we can give all the
recommendations and sensible default
876
:values like
877
:If you have a time series, try a time
series transformer.
878
:Then again, if things don't work out,
users need to play around with these
879
:settings.
880
:So that's definitely one hardship of
armatized Bayesian inference in general.
881
:And for the second part of your question,
hardships of this deep fusion.
882
:It's essentially if you have more and more
information sources, then things can get
883
:very complicated.
884
:Example, just a few days ago, we discussed
about a
885
:case where someone has 60 different
sources of information and they're all
886
:streams of time series.
887
:Now we could say, hey, just slap 60
summary networks on this problem, like one
888
:summary network for each domain.
889
:That's going to be very complex and very
hard to train, especially if we don't
890
:bring that many data sets to the table for
the neural network training.
891
:And so there we somehow need to find a
compromise.
892
:Okay, what information can we condense and
group together?
893
:So maybe some of the time series sources
are somewhat similar and actually
894
:compatible with each other.
895
:So we could, for example, come up with six
groups of 10 time series each.
896
:Then we would only need six neural
networks for the summary embeddings and
897
:all these practical considerations.
898
:That makes things just like as hard as in
likelihood based MCMC based inference, but
899
:just a bit harder because of all the
neural network stuff that's happening.
900
:Did this address your question?
901
:Yeah.
902
:Yeah.
903
:It gives me more questions, but yeah, for
sure.
904
:That does answer the question.
905
:When you're talking about transformer for
time series, are you talking about the
906
:transformers, the neural network that's
used in large language models or is it
907
:something else?
908
:It's essentially the same, but slightly
adjusted for time series so that the...
909
:statistics or these latent embeddings that
you output still respect the time series
910
:structure where typically you would have
this autoregressive structure.
911
:So it's not exactly the same like standard
transformer, but you would just enrich it
912
:to respect the probabilistic structure in
your data.
913
:But at the core, it's just the same.
914
:So at the core, it's an attention
mechanism, like multi -head attention
915
:where
916
:Like the different parts of your dataset
could essentially talk or listen to each
917
:other.
918
:So it's just the same.
919
:Okay.
920
:Yeah, that's interesting.
921
:I didn't know that existed for time
series.
922
:That's interesting.
923
:That means, so because the transformer
takes like one of the main thing is you
924
:have to tokenize the inputs.
925
:Right?
926
:So here you would tokenize like that there
is a tokenization happening of the time
927
:series data.
928
:You don't have to tokenize here because
the reason why you have to tokenize.
929
:in large language models or natural
language processing in general is that you
930
:want to somehow encode your characters or
your words?
931
:into like a into numbers essentially and
we don't need that in Bayesian inference
932
:in general because we already have numbers
Yeah So our data already comes in numbers,
933
:so we don't need tokenization here.
934
:Of course if we had text data
935
:Then we would need tokenization.
936
:Yeah.
937
:Yeah.
938
:Yeah.
939
:OK.
940
:OK.
941
:Yeah, it makes more sense to me.
942
:All right, that's fun.
943
:I didn't know that existed.
944
:Do you have any resources about
transformer for time series that we could
945
:put in the show notes?
946
:Absolutely.
947
:There is a paper that's called Temporal
Fusion Transformers, I think.
948
:I will send you the link.
949
:yeah.
950
:Awesome.
951
:Yeah, thanks.
952
:Definitely.
953
:We have this time series transformer,
temporary fusion transformer implemented
954
:in base flow.
955
:So now it's just like a very usable
interface where you would just input your
956
:data and then you get your latent
embeddings.
957
:You can say like, I want to input my data
and I want as an output 20 learned summary
958
:statistics.
959
:So that's all you need to do there.
960
:Okay.
961
:And you can go crazy.
962
:So what would you do with it?
963
:Good.
964
:Yeah, what would you do with these
results?
965
:Basically the outputs of the transformer,
what would you use that for?
966
:Those are the learned summary statistics.
967
:That you would then treat as a compressed
fixed length version of your data for the
968
:posterior network for this generative
model.
969
:So then you use that afterwards in the
model?
970
:Exactly.
971
:Yeah.
972
:So the transformer is just used to learn
summary statistics of the data sets that
973
:we input.
974
:For instance, if you have time series,
like we did this for COVID time series.
975
:If you have a COVID time series,
976
:worth like for a three year period would
be and daily reporting, you would have a
977
:time series with about a thousand time
steps.
978
:That's quite long as a condition into a
neural network to pass in there.
979
:And also like if now you don't have a
thousand days, but a thousand and one
980
:days, then the length of your input to the
neural network would change and your
981
:neural network wouldn't do that.
982
:So what you do with a time series
transformer is compress this time series
983
:of maybe 1 ,000 or maybe 1 ,050 time steps
into a fixed length vector of summary
984
:statistics.
985
:Maybe you extract 200 summary statistics
from that.
986
:Hey, okay, I see.
987
:And then you can use that in your neural
network, in the model that's going to be
988
:sampling your model.
989
:In the neural network that's going to be
sampling your model.
990
:We already see that we're heavily
overloading terminology here.
991
:So what's a model actually?
992
:So then we have to differentiate between
the actual Bayesian model that we're
993
:trying to fit.
994
:And then the neural network, the
generative model or generative neural
995
:network that we're using as a replacement
for MCMC.
996
:So it's, it's a lot of this taxonomy
that's, that's odd when you're at the
997
:interface of deep learning and statistics.
998
:Another one of those hiccups are
parameters.
999
:Like invasion inference parameters are
your inference targets.
::
So you want posterior distributions on a
handful of model parameters.
::
When you talk to people from deep learning
about parameters,
::
they understand the neural network
weights.
::
So sometimes you have to be careful with
the, I have to be careful with the
::
terminology and words used to describe
things because we have different types of
::
people going on different levels of
abstraction here in different functions.
::
Yeah.
::
Yeah, exactly.
::
So that means in this case, it's the
transformer takes in time values, it
::
summarizes them.
::
And it passed that on to the neural
network that's going to be used to sample
::
the patient model.
::
Exactly.
::
And they are passed in as the conditions,
like conditional probability, which
::
totally makes sense because like this
generative neural network, it learns the
::
distribution of parameters conditional on
the data or summary statistics of the
::
data.
::
So that's the exact definition of the
Bayesian posterior distribution.
::
Like a distribution of the Bayesian model
parameters conditional on the data.
::
It's the exact definition of the
posterior.
::
Yeah, I see.
::
And that means...
::
So in this case, yeah, no, I think my
question was going to be, so why would you
::
use these kind of additional layer on the
time series data?
::
But you have to answer that.
::
Is that, well, what if your time series
data is too big or something like that?
::
Exactly.
::
It's not just being too big, but also just
a variable length.
::
Because the neural network, like the
generative neural network, it always wants
::
fixed length inputs.
::
Like it can only handle, in this case of
the COVID model, it could only handle
::
input conditions with length 200.
::
And now the time series transformer takes
part, so the time series transformer
::
handles the part that our actual raw data
have variable length.
::
And time series transformers can handle
data of variable length.
::
So they would, you know, just take a time
series of length.
::time steps to:
and then always compress it to 200 summary
::
statistics.
::
So this generative neural network, which
is much more strict about the shapes and
::
form of the input data, will always see
the same length inputs.
::
Yeah.
::
Okay.
::
Yeah, I see.
::
That makes sense.
::
Awesome.
::
Yeah, super cool.
::
And so as you were saying, this is already
available in base flow, people can use
::
this kind of transformer for time series.
::
Yeah, absolutely.
::
For time series and also for sets.
::
So for IID data.
::
Yeah.
::
Because if you just fed, if you just take
an IID data set and input into a neural
::
network, the neural network doesn't know
that your observations are exchangeable.
::
So it will assume much more structure than
there actually is in your data.
::
So again, it has a double function, like a
dual function of like compressing data,
::
encoding the probabilistic structure of
the data, and also outputting a fixed
::
representation.
::
So this would be a set transformer or deep
set is another option.
::
It's also implemented in Baseflow.
::
Super cool.
::
Yeah.
::
And so let's start winding down here
because I've already taken a lot of your
::
time.
::
Maybe a last few questions would be what
are some emerging topics that you see
::
within deep learning and probabilistic
machine learning that you find
::
particularly intriguing?
::
Because I've been to talk here a lot about
really the nitty -gritty, the statistical
::
detail.
::
And so on, but now if we do zoom a bit and
we start thinking about more long -term.
::
Yeah.
::
I'm very excited about two large topics.
::
The first one are generative models that
are very expressive.
::
So unconstrained neural network
architectures, but at the same time have a
::
one -step inference.
::
So for example, people have been using
score -based diffusion models a lot for
::
flow matching.
::
for image generation, like for example,
stable diffusion.
::
You might be familiar with this tool to
generate like, you know, input a text
::
prompt and then you get fantastic images.
::
Now this takes quite some time.
::
So like a few seconds for each image, but
only because it runs on a fancy cluster.
::
If you run it locally on a computer, it
takes much longer.
::
And that's because the Scorby's diffusion
model needs many discretization steps in
::
denoising, in this denoising process
during inference time.
::
And now there's, like, throughout the last
year, there have been a few attempts on
::
having these very expressive and super
powerful neural networks.
::
But they are much, much faster because
they don't have these many denoising
::
steps.
::
Instead, they directly learn a one -step
inference.
::
So they could generate an image not like a
thousand steps, but only in one step.
::
And that's very cutting edge or bleeding
edge, if you will, because they don't work
::
that great yet.
::
But I think there's much potential in
there.
::
it's both expressive and fast.
::
And then again, we've used some of those
for amortized Bayesian inference.
::
So we use consistency models and they have
super high potential in my opinion.
::
So, you know, with these advances in deep
learning, we can always, oftentimes we can
::
use them for amortized Bayesian inference.
::
We just like reformulate these generative
models and slightly tune them to our
::
tasks.
::
So I'm very excited about this.
::
And the second area I'm very excited about
our foundation models.
::
I guess most people are in AI these days.
::
So foundation models essentially means
neural networks are very good at in
::
-distribution tasks.
::
So whatever is in the training data set,
neural networks are typically very good at
::
finding patterns that are similar to the
training set, what they saw in the
::
training set.
::
Now in the open world, so if we are out of
distribution, we have a domain shift,
::
distribution shift, model mis
-specification, however you want to call
::
it, neural networks typically aren't that
good.
::
So what we could do is either make them
slightly better at out of distribution, or
::
we just extend the in -distribution to a
huge space.
::
And that's what foundation models do.
::
For example, GPD4 would be a foundation
model.
::
because it's just trained on so much data.
::
I don't know how many, it's not terabyte
anymore.
::
It's like, like essentially the entire
internet.
::
So it's just a huge training set.
::
And so the world and the training set that
this neural network has been trained on is
::
just huge.
::
And so essentially we don't really have
out of distribution cases anymore, just
::
because our training set is so huge.
::
And that's also one area that could be
very useful for
::
amortized Bayesian inference and to
overcome the very initial shortcoming that
::
you talked about, where we would also like
to amortize over different Asian models.
::
Hmm.
::
I see.
::
Yeah, yeah, yeah.
::
Yeah, that would definitely be super fun.
::
Yeah, I'm really impressed and interested
to see these interaction of like deep
::
learning, artificial intelligence, and
then the Bayesian.
::
framework coming on top of that.
::
That is really super cool.
::
I love that.
::
Yeah.
::
Yeah, it makes me super curious to try
that stuff out.
::
So to play us out, Marvin, actually, this
is a very active area of research.
::
So what advice would you give to beginners
interested in diving into this
::
intersection of deep learning and
probabilistic machine learning?
::
That's a great question.
::
Essentially, I would have two
recommendations.
::
The first one is to really try to simulate
stuff.
::
Whatever it is that you are curious about,
just try to write a simulation program and
::
try to simulate some of the data that you
might be interested in.
::
So for example, if you're really
interested in soccer, then code up a
::
simulation program.
::
that just simulate soccer matches and the
outcomes of soccer matches.
::
So you can really get a feeling of the
data generating processes that are
::
happening because probabilistic machine
learning at its very core is all about
::
data generating processes and reasoning
about these processes.
::
And I think it was Richard Feynman who
said, what I cannot create, I do not
::
understand.
::
That's essentially at the heart of
simulation based inference in a more
::
narrow setting.
::
probabilistic machinery and machine
learning more broadly or science more
::
broadly even So yeah, definitely like
Simulating and running simulation studies
::
can be super helpful both to understand
what's happening in the background also to
::
get a feeling for Programming and to get
better at programming as well Then the
::
second advice would be to essentially find
a balance between these hands -on getting
::
your hands dirty type of things like
implement a model and
::
I torch or Keras or solve some Kaggle
tasks, just some machine learning tasks.
::
But then at the same time, also finding
this balance to reading books and finding
::
new information to make sure that you
actually know what you're doing and also
::
know what you don't know and what the next
steps are to get better from the
::
theoretical part.
::
And there are two books that I can really
recommend.
::
The first one is Deep Learning by Ian
Goodfellow.
::
It's also available.
::
for free online.
::
You can also link to this in the show
notes.
::
It's a great book and it covers so much.
::
And then if you come from this Bayesian or
statistics background, you see a lot of
::
conditional probabilities in there because
a lot of deep learning is just conditional
::
generative modeling.
::
And then the second book would in fact be
Statistical Rethinking by Richard
::
McAlrath.
::
It's a great book and it's not only
limited to Bayesian inference, but more.
::
Also a lot of causal inference, of course.
::
Also just thinking about probability and
the philosophy behind this whole
::
probabilistic modeling topic more broadly.
::
So earlier today, I had a chat with one of
the student assistants that I'm
::
supervising and he said, Hey Marvin, like
I read statistic rethinking a few weeks
::
ago.
::
And today I read something about score
-based diffusion models.
::
So these like state of the art deep
learning models that are used to generate
::
images.
::
He said like, because I read statistical
rethinking, it all made sense.
::
There's so much probability going on in
these score -based diffusion models.
::
And statistical rethinking really helped
me understand that.
::
And at first I didn't really, I couldn't
believe it, but it totally makes sense.
::
Cause like statistical rethinking is not
just a book about Bayesian workflow and
::
Bayesian modeling, but more about, you
know, reasoning about probabilities and
::
uncertainty, in a more general way.
::
And it's a beautiful book.
::
So I'd recommend those.
::
Nice.
::
Yeah.
::
So definitely let's put those two in the
show notes.
::
Marvin, I will.
::
So of course I've read statistical
rethinking several times, so I definitely
::
agree.
::
The first one about deep learning, I
haven't yet, but I will definitely read it
::
because that sounds really fascinating.
::
So really want to get that book.
::
Fantastic.
::
Well, thanks a lot, Marvin.
::
That was really awesome.
::
I really learned a lot.
::
I'm pretty sure listeners did too, so
that's super fun.
::
You definitely need to come back to do a
modeling webinar with us and show us in
::
action what we talked about today with the
Base Vlog Package.
::
It's also, I guess, going to inspire
people to use it and maybe contribute to
::
it.
::
But before that, of course, I'm going to
ask you the last two questions I ask every
::
guest at the end of the show.
::
First one, if you had unlimited time and
resources, which problem would you try to
::
solve?
::
That's a very loaded question because
there's so many very, very important
::
problems to solve.
::
Like big picture problems, like peace,
world hunger, global warming, all those.
::
I'm afraid I couldn't, like with my
background, I don't really know how to
::
contribute significantly with a huge
impact to those problems.
::
So my consideration is essentially a trade
-off between like...
::
how important is the problem and what
impact does solving the problem or
::
addressing the problem have and what
impact could I have on solving the
::
problem?
::
And so I think what would be very nice is
to make probabilistic inference or
::
Bayesian inference more particular, like
accessible, usable, easy and fast for
::
everyone.
::
And that doesn't just mean, you know,
methods, machine learning researchers.
::
But essentially means anyone who works
with data in any way.
::
And there's so much to do, like the actual
Bayesian model in the background, it could
::
be huge, be like a base GPT, like chat
GPT, but just for base.
::
Just with the sheer scope of amortization,
different models, different settings and
::
so on.
::
So that's a huge, huge challenge.
::
Like on the backend side, but then on the
front end and API side, I think it also
::
has...
::
many different sub problems there.
::
cause it would mean like people could
just, you know, write down a description
::
of their model in plain text language,
like a large language model.
::
And, you know, don't actually specify
everything by a programming.
::
Maybe also just sketch out some data like
expert elicitation and all those different
::
topics.
::
I think there's like this bigger picture,
that, you know, so like.
::
thousands of researchers worldwide are
working on so many niche topics there.
::
But having this overarching base GPT kind
of thing would be really cool.
::
So I probably choose that to work on.
::
It's a very risky thing, so that's why I'm
not currently working on it.
::
Yeah, I love that.
::
Yeah, that sounds awesome.
::
Feel free to corporate.
::
and collaborate with me on that.
::
I would definitely be down.
::
That sounds absolutely amazing.
::
Yeah.
::
So send me an email when you start working
that place.
::
I'll be happy to join the team.
::
And second question, if you could have
dinner with any great scientific mind,
::
dead, alive or fictional, who would it be?
::
Again, very loaded question.
::
Super interesting question.
::
I mean, there are two huge choices.
::
I could either go with someone who's
currently alive and
::
I feel like I want their take on the
current state of the art and future
::
directions and so on.
::
And the second huge option, what I guess
many people would go with is someone who's
::
been dead for two to three centuries.
::
And I think I'd go with the second choice.
::
So really take someone from way from the
past.
::
And that's because of two reasons.
::
I think like, of course, speaking to
today's scientists is super interesting
::
and I would love to do that.
::
But I mean, they have access to all the
state of the art technology and they know
::
about all the latest advancements.
::
And so if they have some groundbreaking
creative ideas to share that they come up
::
with, they could just implement it and
make them actionable.
::
And the second reason is that today
scientists have a huge platform because
::
they're on the internet.
::
So if they really want to express an idea,
they could just do it on
::
Twitter or wherever So there's like other
ways to engage with them apart from you
::
know, having a magical dinner Right.
::
so I would choose someone from the past
and in particular.
::
I think at a lovelace would be super
interesting for me to talk to Essentially
::
because she's widely considered the first
programmer the craziest thing about is
::
that is She's never had access to like a
modern computer
::
So she wrote the first program, but the
machine wasn't there yet.
::
So that's such a huge leap of creativity
and genius.
::
And so I'd really be interested in like if
Adelavelis saw what's happening today,
::
like all the technology that we have with
generative AI, GPU clusters and all these
::
possibilities, like what's the next leap
forward?
::
Like what's today's equivalent of writing
::
the first program without having the
computer.
::
Yeah, I really love to know this answer
and there's currently no other way except
::
for your magical dinner invitation to get
this answer.
::
So that's why I go with this option.
::
Yeah.
::
Yeah.
::
No, awesome.
::
Awesome.
::
I love it.
::
That definitely sounds like a, like a
marvelous dinner.
::
So yeah.
::
Awesome.
::
Thanks a lot, Marvin.
::
That was, that was really a blast.
::
I'm going to let you go now because you've
been talking for a long time, guessing you
::
need a break.
::
But that was really amazing.
::
So yeah, thanks a lot for taking the time.
::
Thanks again to Matt Rosinski for this
awesome recommendation.
::
I hope you loved it, Marvin.
::
And also Matt, me, I did.
::
So that was really awesome.
::
As usual, I'll put resources and a link to
your website.
::
And also, Marvin is going to add stuff to
the show notes for those who want to dig
::
deeper.
::
Thank you again, Marvin, for taking the
time and being on this show.
::
Thank you very much for having me, Alex.
::
I appreciate it.
::
This has been another episode of Learning
Bayesian Statistics.
::
Be sure to rate, review and follow the
show on your favorite podcatcher and visit
::
learnbaystats .com for more resources
about today's topics as well as access to
::
more episodes to help you reach true
Bayesian state of mind.
::
That's learnbaystats .com.
::
Our theme music is Good Bayesian by Baba
Brinkman, fit MC Lars and Meghiraam.
::
Check out his awesome work at bababrinkman
.com.
::
I'm your host.
::
Alex Andorra.
::
You can follow me on Twitter at Alex
underscore Andorra, like the country.
::
You can support the show and unlock
exclusive benefits by visiting Patreon
::
.com slash LearnBasedDance.
::
Thank you so much for listening and for
your support.
::
You're truly a good Bayesian change your
predictions after taking information.
::
And if you're thinking I'll be less than
amazing, let's adjust those expectations.
::
Let me show you how to be a good Bayesian
Change calculations after taking fresh
::
data in Those predictions that your brain
is making Let's get them on a solid
::
foundation