Proudly sponsored by PyMC Labs, the Bayesian Consultancy. Book a call, or get in touch!
Our theme music is « Good Bayesian », by Baba Brinkman (feat MC Lars and Mega Ran). Check out his awesome work!
Visit our Patreon page to unlock exclusive Bayesian swag 😉
Takeaways:
- Use mini-batch methods to efficiently process large datasets within Bayesian frameworks in enterprise AI applications.
- Apply approximate inference techniques, like stochastic gradient MCMC and Laplace approximation, to optimize Bayesian analysis in practical settings.
- Explore thermodynamic computing to significantly speed up Bayesian computations, enhancing model efficiency and scalability.
- Leverage the Posteriors python package for flexible and integrated Bayesian analysis in modern machine learning workflows.
- Overcome challenges in Bayesian inference by simplifying complex concepts for non-expert audiences, ensuring the practical application of statistical models.
- Address the intricacies of model assumptions and communicate effectively to non-technical stakeholders to enhance decision-making processes.
Chapters:
00:00 Introduction to Large-Scale Machine Learning
11:26 Scalable and Flexible Bayesian Inference with Posteriors
25:56 The Role of Temperature in Bayesian Models
32:30 Stochastic Gradient MCMC for Large Datasets
36:12 Introducing Posteriors: Bayesian Inference in Machine Learning
41:22 Uncertainty Quantification and Improved Predictions
52:05 Supporting New Algorithms and Arbitrary Likelihoods
59:16 Thermodynamic Computing
01:06:22 Decoupling Model Specification, Data Generation, and Inference
Thank you to my Patrons for making this episode possible!
Yusuke Saito, Avi Bryant, Ero Carrera, Giuliano Cruz, Tim Gasser, James Wade, Tradd Salvo, William Benton, James Ahloy, Robin Taylor,, Chad Scherrer, Zwelithini Tunyiswa, Bertrand Wilden, James Thompson, Stephen Oates, Gian Luca Di Tanna, Jack Wells, Matthew Maldonado, Ian Costley, Ally Salim, Larry Gill, Ian Moran, Paul Oreto, Colin Caprani, Colin Carroll, Nathaniel Burbank, Michael Osthege, Rémi Louf, Clive Edelsten, Henri Wallen, Hugo Botha, Vinh Nguyen, Marcin Elantkowski, Adam C. Smith, Will Kurt, Andrew Moskowitz, Hector Munoz, Marco Gorelli, Simon Kessell, Bradley Rode, Patrick Kelley, Rick Anderson, Casper de Bruin, Philippe Labonde, Michael Hankin, Cameron Smith, Tomáš Frýda, Ryan Wesslen, Andreas Netti, Riley King, Yoshiyuki Hamajima, Sven De Maeyer, Michael DeCrescenzo, Fergal M, Mason Yahr, Naoya Kanai, Steven Rowland, Aubrey Clayton, Jeannine Sue, Omri Har Shemesh, Scott Anthony Robson, Robert Yolken, Or Duek, Pavel Dusek, Paul Cox, Andreas Kröpelin, Raphaël R, Nicolas Rode, Gabriel Stechschulte, Arkady, Kurt TeKolste, Gergely Juhasz, Marcus Nölke, Maggi Mackintosh, Grant Pezzolesi, Avram Aelony, Joshua Meehl, Javier Sabio, Kristian Higgins, Alex Jones, Gregorio Aguilar, Matt Rosinski, Bart Trudeau, Luis Fonseca, Dante Gates, Matt Niccolls, Maksim Kuznecov, Michael Thomas, Luke Gorrie, Cory Kiser, Julio, Edvin Saveljev, Frederick Ayala, Jeffrey Powell, Gal Kampel, Adan Romero, Will Geary, Blake Walters, Jonathan Morgan and Francesco Madrisotti.
Links from the show:
- Sam on Twitter: https://x.com/Sam_Duffield
- Sam on Scholar: https://scholar.google.com/citations?user=7wm_ka8AAAAJ&hl=en&oi=ao
- Sam on Linkedin: https://www.linkedin.com/in/samduffield/
- Sam on GitHub: https://github.com/SamDuffield
- Posteriors paper (new!): https://arxiv.org/abs/2406.00104
- Blog post introducing Posteriors: https://blog.normalcomputing.ai/posts/introducing-posteriors/posteriors.html
- Posteriors docs: https://normal-computing.github.io/posteriors/
- Paper introducing Posteriors – Scalable Bayesian Learning with posteriors: https://arxiv.org/abs/2406.00104v1
- Normal Computing scholar: https://scholar.google.com/citations?hl=en&user=jGCLWRUAAAAJ&view_op=list_works
- Thermo blogs: https://blog.normalcomputing.ai/posts/2023-11-09-thermodynamic-inversion/thermo-inversion.html
- https://blog.normalcomputing.ai/posts/thermox/thermox.html
- Great paper on SGMCMC: https://proceedings.neurips.cc/paper_files/paper/2015/file/9a4400501febb2a95e79248486a5f6d3-Paper.pdf
- David MacKay textbook on Sustainable Energy: https://www.withouthotair.com/
- LBS #107 – Amortized Bayesian Inference with Deep Neural Networks, with Marvin Schmitt: https://learnbayesstats.com/episode/107-amortized-bayesian-inference-deep-neural-networks-marvin-schmitt/
- LBS #98 – Fusing Statistical Physics, Machine Learning & Adaptive MCMC, with Marylou Gabrié: https://learnbayesstats.com/episode/98-fusing-statistical-physics-machine-learning-adaptive-mcmc-marylou-gabrie/
Transcript
This is an automatic transcript and may therefore contain errors. Please get in touch if you’re willing to correct them.
Transcript
Folks, strap in, because today's episode
is a deep dive into the fascinating world
2
of large -scale machine learning.
3
And who better to guide us through this
journey than Sam Dofeld.
4
Currently honing his expertise at normal
computing, Sam has an impressive
5
background that bridges the theoretical
and practical realms of Bayesian
6
statistics, from quantum computation to
the cutting edge of AI technology.
7
In our discussion, Sam breaks down complex
topics such as the posterior's Python
8
package, minimatch methods, approximate
inference, and the intriguing world
9
of thermodynamic hardware for statistics.
10
Yeah, I didn't know what that was either.
11
We delve into how these advanced methods
like stochastic gradient MCMC and Laplace
12
approximation are not just theoretical
concepts but pivotal in shaping enterprise
13
AI models today.
14
And Sam is not just about algorithms and
models, he is a sports enthusiast who
15
loves football, tennis and squash.
16
and he recently returned from an awe
-inspiring trip to the Faroe Islands.
17
So join us as we explore the future of AI
with Bayesian methods.
18
This is Learning Bayesian Statistics,
episode 110, recorded May 31, 2024.
19
Welcome to Learning Bayesian Statistics, a
podcast about Bayesian inference, the
20
methods, the projects, and the people who
make it possible.
21
I'm your host, Alex Andorra.
22
You can follow me on Twitter at alex
-underscore -andorra.
23
like the country.
24
For any info about the show, learnbasedats
.com is Laplace to be.
25
Show notes, becoming a corporate sponsor,
unlocking Beijing Merch, supporting the
26
show on Patreon, everything is in there.
27
That's learnbasedats .com.
28
If you're interested in one -on -one
mentorship, online courses, or statistical
29
consulting, feel free to reach out and
book a call at topmate .io slash alex
30
underscore and dora.
31
See you around, folks, and best Beijing
wishes to you all.
32
I'm Sam Duffield, welcome to Learning
Bayesian Statistics.
33
Thanks, thank you very much.
34
Yeah, thank you so much for taking the
time.
35
I invited you on the show because I saw
what you guys at normal computing were
36
doing, especially with the Posteriors
Python package.
37
And I am personally always learning new
stuff.
38
Right now I'm learning a lot about sports
analytics, because that's a
39
Like that's always been a personal pet
peeves of mine and Bayesian says extremely
40
useful in that field.
41
But I'm also in conjunction working a lot
about LLMs and the interaction with the
42
Bayesian framework.
43
I've been working much more on the base
flow package, which we've talked about
44
with Marvin Schmidt in episode 107.
45
So.
46
Yeah, working on developing a PIMC bridge
to base flow so that you can write your
47
model in PIMC and then like using
amortized patient inference for your PIMC
48
models.
49
It's still like way, way down the road.
50
I need to learn about all that stuff, but
that's really fascinating.
51
I love that.
52
And so of course, when I saw what you were
doing with Posterior, I was like, that
53
sounds...
54
Awesome.
55
I want to learn more about that.
56
So I'm going to ask you a lot of
questions, a lot of things I don't know.
57
So that's great.
58
But first, can you tell us, give us a
brief overview of your research interests
59
and how Bayesian methods play a role in
your work?
60
Yeah, no, I know.
61
Thanks again for the invite.
62
I think, yeah, sports analytics, Bayesian
statistics, language models, I think we
63
have a lot to talk about.
64
should be fun.
65
Bayesian methods in my work, yes, so at
normal we have a lot of problems where we
66
think that Bayes is the right answer if
you could compute it exactly.
67
So what we're trying to do is trying to
look at different approximations and
68
different, like how they scale in
different methods and different settings.
69
and how we can get as close to the exact
phase or the exact sort of integral and
70
updating under uncertainty that can
provide us with some of those benefits.
71
Yeah.
72
OK.
73
Yeah.
74
That's interesting.
75
I, of course, agree.
76
Of course.
77
Can you, like, actually, do you remember
when you were first introduced to patient
78
inference?
79
Because you had a
80
an extensive background you've studied a
lot.
81
When in that, in those studies, were you
introduced to the Bayesian framework?
82
And also, how did you end up working on
what you're working on nowadays?
83
Yeah, okay.
84
I'll try not to rant too long about this.
85
But yeah, so I guess I, yeah, mathematics,
undergraduate at Imperial.
86
So I think that's
87
I was very young at this stage, we were
very young in our undergraduates, so not
88
really sure what we want to do.
89
At some point, it came to me that
statistics within the field of mathematics
90
is kind of like where I can like, that
should be working on like, applied
91
problems and how what where the sort of
field is going.
92
And that's what got me excited.
93
Statistics at undergraduate are different,
different places, but you get thrown a lot
94
of different
95
I mean, probably in all courses, you get
different, you get different point of view
96
and you get like, yeah, you get your
frequencies, your hypothesis testing, and
97
then you have your Bayesian method as
well.
98
And that is just the Bayesian approach
really sort of settled with me as being
99
more natural in terms of you just write
down the equation and the Bayes Bayes
100
Bayes theorem handles you write down, you
have your forward model and your prior and
101
then Bayes theorem handles everything
else.
102
So you're kind of writing down it's like,
103
mathematicians is kind of like one of the
lecturers in my first year said, yeah,
104
mathematicians are lazy.
105
You want to they want to do as little as
possible.
106
So base theorem is kind of nice there
because you just write down your your
107
likelihood you write down your prior and
then base theorem handles the rest.
108
So you have to do like the minimum
possible work you have your data
109
likelihood prior and then done.
110
So that was that was really compelling to
me.
111
And that led me to a to my PhD, which was
in the engineering department in
112
Cambridge.
113
So that was like, yeah, I had a few
114
thoughts on what to do for my PhD.
115
There was some more theoretical stuff and
I wanted to get into some problems, get
116
into the weeds a bit.
117
So yeah, engineering department of
Cambridge working on Bayesian statistics,
118
state space models and in a state space
model sequential Monte Carlo.
119
And I think, yeah, I mean, for terminology
wise, I use state space model and hidden
120
Markov model as the same thing.
121
So yeah, you have this time series style
data and that was working on that sort of
122
data gave me
123
I feel like this propagation of
uncertainty really shines there because
124
you need to take into account your
uncertainty from the previous experiments,
125
say, when you update for your new ones.
126
That was really compelling for me.
127
That was, I guess, my route into Bayesian
statistics.
128
Yeah, okay.
129
Actually, here I could ask you a lot of
questions, but...
130
those time series models.
131
I'm always fascinated by time series
models.
132
I don't know, I love them for some reason.
133
I find there is a kind of magic in the
ability of a model to take time
134
dependencies into account.
135
I love using Gaussian processes for that.
136
So I could definitely go down that rabbit
hole, but I'm afraid then I won't have
137
enough time for you to talk about post
-series.
138
Let me just say one minute about it.
139
So I'll just say like, yeah, in terms of
yeah, Gaussian process is really cool.
140
Like Gaussian process, like can think of
as like a continuous time or continuous
141
space or whatever that the time variant
access, we'll call it time continuous time
142
varying version of a state space model and
state space model or hidden Markov model.
143
Kind of like that to me is like the
canonical extension of a just a static
144
based inference model to a
145
the time varying setting because you can
and they kind of unify each other because
146
you can write smoothing in a state space
model as one big static Bayesian inference
147
problem and then you can write static
Bayesian inference problems they're just p
148
of y given x or p of yeah recovering x
from from y as as the first step as a
149
single step of state space model so the
techniques that you build just overlap and
150
you can yeah at least conceptually on the
mathematical level when you actually get
151
into the approximations and the
commutation
152
there are different things to consider,
different axes of scalability considered,
153
but conceptually, I really like that.
154
I probably ranted for a bit more than a
minute there, so I apologize.
155
No, no, that's fine.
156
I love that.
157
Yeah.
158
I have much more knowledge and experience
on GPs, but I'm definitely super curious
159
to also apply these state space models and
so on.
160
So definitely going to read the...
161
the paper you sent me about skill rating
of football players where you're using, if
162
I understand correctly, some state space
models.
163
That's going to be two birds with one
stone.
164
So thanks a lot for writing that.
165
The whole point of that paper is to say
that rating systems, ELO, TrueSkill are
166
and should be reframed as state space
models.
167
And then you just have your full Bayesian
understanding of it.
168
Yeah, yeah.
169
I mean, for sure.
170
I'm working myself also on the
171
project like that on football data.
172
And yeah, the first thing I was doing is
like, okay, I'm gonna write the simple
173
model.
174
But then as soon as I have that down, I'm
gonna add a GP to that.
175
It's like, I have to take these
nonlinearities into account.
176
So yeah, I'm like, super excited about
that.
177
So thanks a lot for giving me some weekend
readings.
178
So actually now let's go into your
posteriors package because I have so many
179
questions about that.
180
So could you give us an overview of the
package, what motivated this development
181
and also putting it in the context of
large scale AI models?
182
Yeah, so as I said, we at normal think the
base is the right answer.
183
So we want to get, we want to, but yeah,
we're interested in
184
large scale enterprise AI models.
185
So we need to be able to scale these to
big, big models, big, big parameter sizes
186
and big data at the same time.
187
So this is what Posterior's Python package
built on PyTorch really hopes to bring.
188
It's built with sort of flexibility and
research in mind.
189
So really we want to try out different
methods and try out for different data
190
sets and different goals.
191
what's going to be the best approach for
us.
192
That's the motivation of the Posteriors
package.
193
When would people use it?
194
For instance, for which use cases would I
use Posteriors?
195
There's a lot of just genuinely fantastic
Bayesian software out there.
196
But most of it has focused on the full
batch setting, as is classically the case
197
with Metropolis Hastings, except for
Jekste.
198
And we feel like we're moving or we have
already moved into the mini batch era, the
199
big data era.
200
So posterior is mini batch first.
201
So if you have a lot of data, even if you
have a small model, and you have a lot of
202
data, and you want to try posterior
sampling with mini batches, you want to
203
see how that...
204
If that can speed up your inference rather
than doing full batch on every step, then
205
Posterior is the place for that, even with
small models.
206
So you can just write down your model in
Pyro, in PyTorch, and then use Posterior
207
to do that.
208
But then that's like moving from like
classical Bayesian statistics into like
209
the mini batch one.
210
But then there are also benefits of
Bayesian
211
very crude approximations to the Bayesian
posterior in these really high scale large
212
scale models.
213
So like, yeah, like language models, big
neural networks, these aren't going to get
214
you you're not you're not going to be able
to do your convergence checks and these
215
sort of things in those models, but you
might still be able to get some advantages
216
out of distribution detection, as a
distributed improved attribution
217
performance sort of continual learning,
and these are the sort of things we're
218
investigating is if like,
219
the sort of what if you just trained with
grading essentially, you wouldn't
220
necessarily get these things.
221
But even very crude, crude Bayesian
approximations will hopefully provide
222
these benefits.
223
I think I will talk about this more later.
224
I think.
225
Yeah, okay.
226
So basically, what what I understand is
that you can use Posters for basically any
227
model.
228
So I mean, we're still young.
229
And it doesn't have like the
230
very young and it doesn't have like the
support of, I don't know, if you want to
231
do Gaussian processes, we were not going
to have a whole suite of kernels that
232
you're going to be able to just type up.
233
But fundamentally, it takes any, it just
takes a function, a log posterior
234
function, and then you will be able to try
out different methods.
235
But as I said, like the big data regime is
much less researched, and as much and the
236
sort of big parameter regime is much
harder.
237
at least.
238
So it's going to be, it's not going to be
like a silver bullet.
239
You're going to have to, there's research,
basically, posterior is a tool for
240
research a lot of the time where you're
going to research what inference methods
241
you can use, where they fail, and
hopefully where they succeed as well.
242
Okay.
243
Okay.
244
I see.
245
And so to make sure listeners understand,
well, you can do both in posers, right?
246
You can write your model in posterior.
247
and then sample from it?
248
Or is that only model definition or is
that only model sampling?
249
So it only does approximate posterior
sampling.
250
So you write down the log posterior,
you're given some data and you write down
251
the log posterior.
252
Or the joint, you could say.
253
It doesn't have the sophisticated support
of Stan or IMC or where you actually have
254
the, you can write down the model.
255
but it has the support for all the
distributions and doing forward samples.
256
It leans on other tools like Pyro or
PyTorch itself for that in no other case.
257
It is about approximate inference in the
posterior space, in the sample space.
258
So you can do Laplace approximation with
these things and compare them.
259
And importantly, it's mini -batch first.
260
So every method only expects to receive
batch by batch.
261
So you can support the large data regime.
262
Okay, so I think there are a bunch of
terms we need to define here for
263
listeners.
264
Okay, yeah, sorry about that.
265
Can you define minibatch?
266
Can you define approximate inference and
in particular, Laplace approximation?
267
Okay, so minibatch is the important one,
of course.
268
Yeah, so normally in traditional Bayesian
statistics, if you're running random walk
269
-through troblos -Hastings or HMC, you
will be seeing your whole dataset, all end
270
data points at every step of the
iteration.
271
And there's beautiful theory about that.
272
But a lot of the time in machine learning,
you have a billion data points.
273
Or if you're doing a foundation model,
it's like all of Wikipedia, it's billions
274
of data points or something like that.
275
And there's just no way that every time
you do a gradient step, you just can't sum
276
over a billion data points.
277
So you take 10 of them, you do this
unbiased approximation.
278
And this doesn't propagate through the
exponential, which you need.
279
for the metropolis hastening step.
280
So it rules out a lot of traditional
Bayesian methods, but there's still been
281
research on this.
282
So this is the we saw a scalable Bayesian
learning is what we talked about with
283
posterior.
284
So we're investigating mini batch methods.
285
So yeah, methods that only use a small
amount of the data, as is very common in
286
so it's like gradient descent, stochastic
gradient descent and optimization terms.
287
So hopefully
288
Mini -batches, okay, you said approximate
inference.
289
So approximate, okay, yeah, inference is a
very loaded term.
290
Maybe I should try not to use it, but when
I say approximate inference, I mean
291
approximate Bayesian inference.
292
So you can write down mathematically the
posterior distribution, P of theta given
293
y, and then yeah, proportional to P of
theta, P of y given theta.
294
But that's
295
You only have access to pointwise
evaluations of that and potentially even
296
only mini -batch pointwise evaluation
sets.
297
So approximate inference is forming some
approximation to that posterior
298
distribution, whether that's a Gaussian
approximation or through Monte Carlo
299
samples.
300
So yeah, just like an ensemble of points
and approximate inference.
301
So that's approximate inference.
302
And yeah, you have different fidelities of
this posterior approximation.
303
Last one, Laplace approximation.
304
Laplace approximation is the simplest
305
arguably the simplest in like machine
learning setting approximation to the
306
posterior distribution.
307
So it's just a Gaussian distribution.
308
So all you need to define is a mean and
covariance.
309
You define the mean by doing an
optimization procedure on your log
310
posterior or just log likelihood.
311
And that will give you a point that will
give you your mean.
312
And then
313
And then you take okay, it gets quite in
the weeds the Laplace approximation, but
314
ideally you you then do a Taylor expansion
across them.
315
Second order Taylor expansion will give
you Hessian.
316
We would recommend the Hessian being the
co your approximate covariance.
317
But there are tiny quantities there and
use the Fisher information as said.
318
And yeah, you can read that there's lots
of I'm sure you've had people on the on
319
the podcast explain it better than me.
320
Yeah.
321
For Laplace, no.
322
Actually, so that's why I asked you to
define it.
323
I'm happy to go down into the weeds if you
want.
324
Yeah, if you think that's useful.
325
Otherwise, we can definitely do also an
episode with someone you'd recommend to
326
talk about Laplace approximation.
327
Something I'd like to communicate to
listeners is for them to understand.
328
Yeah, we say approximation, but at the
same time, MCMC is an approximation
329
itself.
330
So that can be a bit confusing.
331
Can you talk about the fact, like, about
why these kind of methods, like Laplace
332
approximation, I think VI, variational
inference, would fall also into this
333
bucket.
334
Why are those methods called
approximations?
335
in contrast to MCMC?
336
What's the main difference here?
337
I honestly I would say MCMC is also an
approximation in the same terminology but
338
yeah the difference is that we talk about
bias and asymptotically some methods
339
asymptotically unbiased which MCMC is
stochastic gradient MCMC which is what
340
Prosterus is as well in some
341
under some caveats, and there are caveats
for MCMC, normal MCMC as well.
342
But yeah, so you have your Gaussian
approximations from variational inference
343
and the applies approximation.
344
And these are very much approximations in
the sense there's no axes on which you can
345
increase if you increase it to infinity or
change the posterior.
346
You cannot do that with the Gaussian
approximations unless your posterior is
347
you're known to be Gaussian, in which case
is more and more I mean, the amount of
348
interesting cases like that like Gaussian
processes and things.
349
But yeah, so they don't have this
asymptotically unbiased feature that MTMC
350
does or important sampling as sequential
Monte Carlo does, which is very useful
351
because it allows you to trade compute for
accuracy, which you can't do with a
352
Laplace approximation or VI beyond
extending, like going from diagonal
353
covariance to a full covariance or things
like that.
354
And this is very useful in the case that
you have extra compute available.
355
So I'm a big fan of the
356
asymptotic unbiased property because it
means that you can increase your compute
357
and safety.
358
Yeah.
359
Yeah.
360
Great explanation.
361
Thanks a lot.
362
And so yeah, but so as you were saying,
there is not these asymptotic unbiasedness
363
from these approximations, but at the same
time, that means they can be way faster.
364
So it's like if you're in the right use
case in the right, in the right
365
Yeah, in the right use case, then that
really makes sense to use them.
366
But you have to be careful about the
conditions where the approximation falls
367
down.
368
Can you maybe dive a bit deeper into
stochastic gradient descent, which is the
369
method that Posterioris is using, and how
that fits into these different methods
370
that you just talked about?
371
Actually, stochastic gradient descent is
not a method that Posterioris is using per
372
se.
373
descent is stochastic gradient descent is
the workhorse of machine most machine
374
learning algorithms, but posteriors would
kind of be this kind of same like it kind
375
of saying it shouldn't be perhaps or like,
not in all cases.
376
stochastic gradient descent is what you
use.
377
If you have extremely large data, and you
just want to find the MLE or so the
378
maximum likelihood or the minimum of a
loss, which you might say.
379
So
380
that is just as an optimization routine.
381
So you just want to find the parameters
that minimize something.
382
If you're doing variational inference,
what you can do is you can trackively get
383
the KL divergence between your specified
variational distribution and the log
384
posterior.
385
And then you have parameters.
386
So they're like parameters of the
variational distribution over your model
387
parameters.
388
And then you use stochastic gradient
system like that.
389
So this is nice because it just means that
you can throw the workhorse from machine
390
learning at a
391
Bayesian problem and get the Bayesian
approximation out.
392
Again, as we mentioned, it doesn't have
this asymptotic unbiased feature, which is
393
maybe less of a concern in machine
learning models where you have less of
394
ability to trade compute because you've
kind of filled your compute budget with
395
your gigantic model.
396
Although we may see this, we think that I
think this might change over the coming
397
years.
398
But yeah, maybe not.
399
Maybe we'll just go even bigger and bigger
and bigger.
400
You...
401
Okay, sorry.
402
I got lost.
403
You said you're asking about stochastic
gradient descent.
404
So actually, there's something interesting
to say here.
405
And then that means also what the main
difference characteristics of posterior
406
is, like these, so that really people
understand the use case of posterior here.
407
Yeah.
408
So we didn't want to...
409
Okay.
410
So yeah, there's a key thing about the way
we've written posterior is that we like...
411
where possible to have stochastic gradient
descent, so optimization, as sort of
412
limits under some hyperparameter
specifications of the algorithms.
413
And it turns out that in a lot of cases,
so we talked about MCMC, and then we
414
talked about stochastic gradient MCMC,
which are MCMC methods that strictly
415
handle mini -batch methods.
416
And a lot of the time, you can write down
the temperature, you have the temperature
417
parameter of your posterior distribution.
418
And then as you take that to zero,
419
So the temperature is like, if the
temperature is very high, your posterior
420
distribution is very heated up.
421
So you've increased the tails and there's
a lot like a much closer to sort of a
422
uniform distribution.
423
You take it very cold, it comes very
pointed and focused around optima.
424
So we write the algorithms so that there's
this convenient transition through the
425
temperature.
426
So you set the temperature to zero, you
just get optimization.
427
And this is a key thing about posteriors.
428
So we have the, so the posteriors
stochastic grain MCMC methods.
429
this temperature parameter which if you
set to zero will become a variant of
430
stochastic gradient descent.
431
So you can just sort of unify gradient
descent and stochastic gradient MCMC and
432
it's nice so you have your yeah you have
your Langevin dynamics which tempered down
433
to zero just becomes vanilla gradient
descent you have underdamped Langevin
434
dynamics or stochastic gradient HMC,
stochastic gradient Hamiltonian Monte
435
Carlo, you set the temperature to zero and
then you've just got stochastic gradient
436
descent with momentum.
437
So yeah, this is a nice thing about
Posterius to sort of unify these
438
approaches and it hopefully will make it
less scary to use Bayesian approaches
439
because you know you always have gradient
descent and you can sanity check by just
440
setting the temp, just filling with a
temperature parameter.
441
Okay, that's really cool.
442
Okay.
443
So it's like, it's a bit like the
temperature parameter in the, in the
444
transformers that, that like make sure, I
mean, in the LLMs that
445
It's like adding a bit of variation on top
of the prediction stat that the LL could
446
make.
447
Yeah, so it's exactly the same as that.
448
So when you use this in language models or
natural language generation, you
449
temperature the generative distribution so
that the logits get tempered.
450
So if you set the temperature there to
zero, you get greedy sampling.
451
But we're doing this in parameter space.
452
So it's, yeah.
453
It has this, yeah, exactly.
454
Distribution tempering is a broad thing,
particularly in, I'm not going to go too
455
philosophical, but I mean, I've first met
with like tempering, then we thought about
456
it in the settings of sequential Monte
Carlo, and it's like, is it the natural
457
way?
458
Is it something that's natural to do?
459
But in the context of Bayes, because
Bayes' theorem is multiplicative, right,
460
you have your P of theta, P of y given
theta, it kind of makes sense to temper
461
because it means like, okay, I'll just
introduce the likelihood a little bit.
462
and sort of tempering as a natural way to
do it because there's multiplicative
463
feature of Bayes' theorem.
464
So, I kind of settled with me after
thinking about it like that.
465
Yeah, no, I mean, that makes perfect
sense.
466
And I was really surprised to see that was
used in LLMs when I first read about the
467
algorithms.
468
And I was pleasantly surprised because
I've worked a lot on electoral forecasting
469
models.
470
That's how I were introduced to Bayesian
stats.
471
Actually, I've done that without knowing
it.
472
So first I'm using the softmax all the
time because they're called forecasting.
473
Unless you're doing that in the U S you
need a multinomial likelihood.
474
The multinomial needs a probability
distribution.
475
And how do you get that from the softmax
function, which is actually a very
476
important one in the LLM framework.
477
And, and, and also the thing is your
probability is not, it's like the latent.
478
observation of popularity of each party,
but you never observe it, right?
479
And so the polls, you could, you could
like conceptualize them as a tempered
480
version of the true latent popularity.
481
And so that was really interesting.
482
I was like, damn, this like, this, this
stuff is much more powerful than what I
483
thought, because I was like applying only
on electoral forecasting models, which is
484
like a very niche application, you could
say of these models in
485
actually there are so many applications of
that in the wild.
486
No, it's so yeah, tempering in general is
very widespread and also I would say not
487
particularly understood that well.
488
Like yeah, we have this thing, there's
been research in this cold posterior
489
effect which is quite a, I don't know,
it's perhaps a...
490
annoying things for Bayesian modeling on
neural networks where you get, as I said,
491
you have this temperature parameter that
transitions between optimization and the
492
Bayesian posterior.
493
So zero is optimization, one is the
Bayesian posterior.
494
And empirically, we see better predictive
performance, which is a lot of time we
495
care about in machine learning, with
temperatures less than one.
496
So like, yeah, which is annoying because
we're Bayesians and we think that the
497
Bayesian posterior is the optimal decision
-making under uncertainty.
498
So this is annoying, but at least in our
experiments, we found this to be this so
499
-called cold posterior effect much more
prominent under Gaussian approximations,
500
which we only believe to be very crude
approximations to the posterior anyway.
501
And if we do more MCMC or deep ensemble
stuff, where deep ensemble is, we've got a
502
paper we'll be able to archive shortly,
which describes deep ensembles.
503
In deep ensembles, you just run gradient
descent in parallel.
504
with different initializations and batch
shuffling.
505
And then you just have like, I know you
run 10 ensembles, 10 optimizations in
506
parallel, then you've got 10 parameters at
the things at the end.
507
So Monte Carlo approximation posterior
size 10.
508
And then we describe in the paper that how
to get this asymptotic and biased property
509
by using that temperature.
510
Because as we said earlier, you have SG
MCMC becomes SGD with temperature zero.
511
So you can reverse this.
512
for deep ensembles, so you add the noise
and then you'll get an asymptotic and
513
biased deep ensembles become
asymptotically unbiased MCMC between SGMC
514
and PSE.
515
But in those cases when you have the non
-Gaussian approximation we found much less
516
of the cold posterior effect.
517
So yeah, it's, but it's still not, maybe
the cold posterior effect is a natural
518
thing because it's not really like Bayes'
theorem.
519
Yeah, we still need to be better
understood.
520
I don't, at least in my head I'm not.
521
fully clear on whether the cold posterior
effect is something we should be surprised
522
about.
523
Okay, yeah.
524
Yeah, me neither.
525
That makes you feel any better because I
just learned about that.
526
So yeah, I don't have any strong opinion.
527
Okay, I think we're getting clearer now on
the like the what posterior ears is for
528
listeners.
529
So then I think one of
530
the last question about the algorithms
that that's underlying all of that.
531
So, stochastic gradient MCMC.
532
That's, that's where I got confused.
533
Like I hear stochastic gradient and like
stochastic gradient isn't, but no, it's
534
like SG MCMC not SGG.
535
So, Posteriority is like really to use SG
MCMC.
536
Why, like, why would you do that and not
use MCMC?
537
like the classic HMC from Stan or PyMC?
538
Yeah, so I mean, it's not just for SGMCMC.
539
There's also variational inference,
Laplace approximation, extended count
540
filter, and we're really excited to have
more methods as well as we look to
541
maintain and expand the library.
542
Why would you use SGMCMC?
543
So yeah, I think we've already touched
upon this.
544
The thing is, if you've got loads of data,
it's just going to be inefficient to...
545
sum over all of that data at every
iteration of your MCMC algorithm as Stan
546
would do.
547
But there's mathematical reasons why you
can't just do that in Stan.
548
It's because the Metropolis -Hastings
ratio has this exponential of the log
549
posterior.
550
But it's in log space is the only place
you can get the unbiased approximation,
551
which is what you need if you did want to
naively subsample.
552
So you need to, you can't do the
Machrofist Hastings except reject.
553
So you have to use different toolage.
554
And in its simplest terms, SGMCMC just
omits it and just runs a Langevin.
555
So it just runs your Hamiltonian Monte
Carlo without the extract project.
556
But there's more theory on top of this and
you need to control the disqualification
557
error and stuff like that.
558
And I won't go into the weeds of that.
559
Okay.
560
Yeah.
561
Okay.
562
And that's
563
And that's tied to mini -batching
basically.
564
Like the power that SGMCMC allows you when
you're in a high data regime is tied to
565
the mini -batching, if I understand
correctly.
566
It's the difference between MCMC and
SGMCMC.
567
Okay, so that's like the main difference.
568
Okay.
569
Yeah, stochastic gradient.
570
So you can't actually get the exact
gradient like you need in Amazigh in Monte
571
Carlo and for Metropolis Hastings step,
you only get an unbiased approximation.
572
And then there's theory about this is like
sometimes you can deploy the central limit
573
theorem and then you've got a you can go
covariance attached to your gradients and
574
you could do nice theory and improve the
equivalence like that, which, yeah.
575
Okay.
576
All clear now.
577
All clear.
578
Awesome.
579
Yeah.
580
And I think that's the first time we talk
about that on the show.
581
So I think it was it's definitely useful
to be extra clear about that.
582
And so that listeners understand and me,
like myself, so that I understand.
583
Thanks a lot.
584
It's in some setting actually much simpler
because you kind of like remove the tools
585
that you have available to you by removing
that much of the step.
586
So it makes the implementation a bit
simpler.
587
But you kind of lose the theory in that.
588
And then a lot of the argument is like if
you use a decreasing step size, then your
589
noise from the mini match, your noise from
the stochastic gradient decreases Epsilon
590
squared, which is faster.
591
So you
592
If you decrease your step size and run it
for infinite time, then you'll just be
593
running, eventually just be running the
continuous time dynamics, which are exact
594
and do have the right stationary
distribution.
595
So if you run it with decreasing step
size, then you are asymptotically
596
unbiased.
597
But running with decreasing step size is
really annoying because you then don't
598
move as far.
599
As we know from normal MCMC, we want to
increase our step size and move and
600
explore the posterior more so.
601
There's lots of research to be done here.
602
I hope and I feel that it's not the last
time you'll talk about stochastic gradient
603
MCMC on this podcast.
604
Yeah, no.
605
I mean, that sounds super interesting.
606
I'm really interested also to really
understand the difference between these
607
algorithms.
608
Right now, that's really at the frontier
of research.
609
You not only have a lot of research done
on how do you make HMC more efficient, but
610
you have all these new algorithms.
611
approximate algorithms as we said before.
612
So, VLM plus approximation, stuff like
that.
613
But also now you have normalizing flows.
614
We talked about that in episode 98 with
Marilou Gabrié.
615
Marilou Gabrié, actually, I don't know why
I said the second part with the Spanish.
616
Because my Spanish is really available in
my brain right now.
617
So, she's French.
618
So, that's Marilou Gabrié.
619
Episode 98, it's in the show notes.
620
Episode 107, I already mentioned it with
Marvin Schmidt about amortized patient
621
inference.
622
Actually, do you know about amortized
patient inference and normalizing flows?
623
I know a bit about normalizing flows.
624
Amortized patient inference I would be
less comfortable with.
625
Okay.
626
But I mean, if you could explain it.
627
Yeah, I haven't watched that episode and
listened to that episode.
628
Yeah, I mean, we released it yesterday.
629
Yeah, I don't...
630
I'm a bit disappointed, Sam, but that's
fine.
631
Like, it's just one day, you know.
632
If you listen to it just after the
recording, I'll forgive you.
633
That's okay.
634
No, so, kidding aside, I'm actually
curious to hear you speak about the
635
difference between normalizing flows
636
and SGMCMC.
637
Can you talk a bit about that if you're
comfortable with that?
638
I mean, I can't.
639
It's been a while since I've read about
normalizing flows.
640
When I did read about them, I understood
it to be essentially a form of variational
641
inference where you have more elaborate,
you define a more elaborate variational
642
family through like, essentially through
like a triangular mapping.
643
Like, the thing why you can't just use
someone might say,
644
Why can't you use it just a neural network
as your variational distribution?
645
And it's not so easy because you need to
have this tractable form.
646
Hang on a second.
647
Let me remember.
648
But the thing is with normalizing flows,
you can get this because you can invert.
649
That's it.
650
They're invertible, right?
651
Normalizing flows are invertible.
652
So you can get this.
653
You can write the change of distribution
formula and then you can calculate
654
essentially just y -maxum likelihood.
655
the using these normalizing flows to fit
to a distribution.
656
Whereas SGMCMC doesn't.
657
So you have to, in normalizing flows, you
kind of have to define your ansatz that
658
will fit to your distribution.
659
I think normalizing flows are really
exciting and really interesting, but yeah,
660
you have to specify your ansatz.
661
So it's another, so there's another tool
on top, another specification on top of
662
how you.
663
rather than just writing the log
posterior, you then need to find an
664
approximate ansatz which you think will
fit the posterior or the distribution
665
you're targeting.
666
Whereas SGMCMC is just log posterior, go.
667
Which is sort of what we're trying to do
with posterior, is we're trying to
668
automate, well not automate, we're trying
to research, of course, so much for that.
669
But normalizing flows might be, yeah, as I
said, I think it's really interesting that
670
you can get these more expressive
variational families through like
671
triangular mappings, yeah.
672
Yeah, super interesting.
673
And yeah, I'm also like spatial inference
is related in the sense that you first
674
feed a deep neural network on your model.
675
And then once it's feed, you get posterior
inference for free, basically.
676
So that's quite different from what I
understand as GMC to be.
677
But that's also extremely interesting.
678
That's also why I'm
679
hammering you down on the different use
cases of SGMCMC so that myself and
680
listeners have a kind of a tree in their
head of like, okay, my use case then is
681
more appropriate for SGMCMC or, no, here
I'd like to try multi -spacian inference
682
or, I know here I can just stick to plain
vanilla HMC.
683
I think that's very interesting.
684
But thanks for that question that was
completely improvised.
685
I definitely appreciate you taking the
time to rack your brain about the
686
difference with normalizing flows.
687
No, I'd love to talk more on that.
688
I'd need to refresh myself.
689
I've written down some notes on
normalizing flows, and I was quite
690
comfortable with them, but it's just been
a while since I refreshed.
691
So I would love to refresh, and then we
can chat about them.
692
Because I'd love to do a project on them,
or I'd love to work on them, because I
693
think that's it.
694
way to fit distribution to data, which is,
after all, a lot of what we do.
695
Yeah.
696
Yeah.
697
So that makes me think we should probably
do another episode about normalizing
698
flows.
699
So listeners, if there is a researcher you
like who does a lot of normalizing flows
700
and you think would be a good guest on the
show, please reach out to me and I'll make
701
that happen.
702
Now let's let's get you closer to home
salmon and talk about posteriors again
703
Because so basically if understood
correctly posteriors aims to address
704
uncertainty quantification in deep
learning Why it's is that my right here
705
and also if that's the case why is this
particularly important for neural networks
706
and How does the package help in?
707
managing especially overconfident in model
predict, overconfidence in model
708
predictions.
709
Yeah, so it's that's our primary use case.
710
And normal is to use posterity as a
proximate base, we're getting as close to
711
base as we can, which is probably not that
close, but still getting somewhere on the
712
way to base base, base and posterior in
big deep learning models.
713
But we feel posterior is to be as modular
and general as possible.
714
So as I said, if you have a
715
classical Bayesian model, you can write it
down in Pyro, but you've got loads of
716
data, then okay, go ahead.
717
And it posterior should be well suited to
that.
718
In terms of what advantages we want to see
from uncertainty communication or this
719
approximate Bayesian inference in deep
learning models, there are three sorts of
720
key things that we distilled it down to.
721
So yeah, you mentioned
722
confidence in outer distribution
predictions.
723
So yeah, we should be able to improve our
performance in predicting on inputs that
724
we haven't seen in the training set.
725
So I'll talk about that after this.
726
The second one is continual learning,
where we think that if you can do Bayes
727
theorem exactly, you have your prior, you
get some likelihood and you have the
728
likelihood, you get some data, you have a
posterior, then you get some more data.
729
and then your posterior becomes your prior
and do the update.
730
And you can just write like that if you
can do Bayes' theorem exactly.
731
And then, yeah, this is, you can extend it
even further and then you have, with some
732
sort of evolution along your parameters,
then you have a state space model, and
733
then the exact setting linear Gaussian,
you've got a count filter.
734
So continual learning is, in this case,
Bayes' theorem does that exactly.
735
And in continual learning research in
machine learning settings, they have this
736
term of avoiding catastrophic forgetting.
737
So,
738
If you just continue to do gradient
descent, there was no memory there, so you
739
would just, apart from the initialization,
you would just forget what you've done
740
previously and there's lots of evidence
for this, whereas Bayes' theorem is
741
completely exchangeable between of the
order of the data that you see.
742
So you're doing Bayes' theorem exactly,
there's no forgetting, you just have the
743
capacity of the model.
744
So that's where we see Bayes solving
continual learning, but as I said, you
745
can't
746
can't do Bayes' theorem exactly in a
billion -dimensional model.
747
And then the last one is, we'll call it
like decomposition of uncertainty in your
748
predictions.
749
So if you just have gradient descent model
and you're predicting reviews, someone's
750
reviews and you have to predict the stars,
it will just give you, as you said, it
751
gives you your softmax, it'll just give
you this distribution over the reviews and
752
it'll be like that.
753
But what you really want is you want to
have some indication of
754
like also distribution detection, right,
you want to know, okay, yeah, I'm
755
confident in my, my prediction.
756
And you might get to review that is like,
the food was terrible, but the service was
757
amazing, or something like that, like a
user amazing food was terrible.
758
And then, like, let's say we're perfect
models, say of this, we know how people
759
review things, but we can we can give, we
have quite a lot of uncertainty under
760
review, because we don't know how the
reviewer values those different things.
761
So we might have just a completely
uniform.
762
distribution over the stars for that
review.
763
But we'd be confident in that
distribution.
764
But what Bayes gives you is it gives you
the ability to the sort of the second
765
order uncertainty quantification, is if
you have this distribution over parameters
766
and you have a distribution over logits at
the end, the predictions, you can
767
identify, you can split between it from
information theories called aleatoric and
768
epistemic uncertainty.
769
Aleatoric uncertainty or data uncertainty
is what I just described there, which is
770
natural uncertainty in the model and the
data generating process.
771
Epistemic uncertainty is uncertainty that
was removed in the infinite data limit.
772
So that would be where the model doesn't
know.
773
So this is really important for us to
quantify that.
774
Okay.
775
I, yeah, around to the bit there.
776
I can in like 30 seconds elaborate on the
point you specifically mentioned on alpha
777
distribution performance and improving
performance and alpha distribution.
778
And I think that's quite compelling from a
Bayesian point of view, because what Bayes
779
says on like a supervised learning sector
setting is said, gradient descent just
780
fits one parameter, finds one parameter
configuration that's plausible given the
781
training data.
782
Bayes' theorem says, I find the whole
distribution of parameter configurations
783
that's plausible given the data.
784
And then when we make predictions, we
average over those.
785
So it's perfectly natural to think that a
single configuration might overfit.
786
and might just give, it might just be very
confident in its prediction when it sees
787
out the distribution data.
788
But it doesn't necessarily solve a bad
model, but it should be more honest to the
789
model and the data generating process
you've specified is if you average over
790
plausible model configurations under the
training data when you have your testing.
791
So that's sort of quite a compelling, to
me, argument for improving
792
performance on after distribution
predictions, like the accuracy of them.
793
And there's a fair bit of empirical
evidence for this, with the caveat again,
794
being that the Bayesian posterior in high
dimensional models, machine learning
795
models is pretty hard to approximate, cold
posterior effect, caveats, things like
796
these things.
797
Okay, yeah, I see.
798
Yeah, super interesting in that.
799
So now I understand better.
800
what you have on the posteriors website,
but the different kind of uncertainties.
801
So definitely that's something I recommend
listeners to give a read to.
802
I put that in the show notes.
803
So both your blog post introducing
posteriors and the docs for posteriors,
804
because I think it makes that clear
combined to your explanation right now.
805
Yeah.
806
And...
807
Something I was also wondering is that if
I understood correctly, the package is
808
built on top of PyTorch, right?
809
Yeah, that's correct.
810
Yeah.
811
Okay.
812
So, and also, did I understand correctly
that you can integrate posteriors with pre
813
-trained LLMs like Lama2 and Mistral, and
you do that with a...
814
Hacking's Feast Transformers package?
815
So, yeah, so, I mean, yeah, Posterior is
open source.
816
We're fully supported the open source
community for machine learning, for
817
statistics, which is, and in terms of,
yeah, I mean, we're sort of in the fine
818
tuning era or like we have like, there's
so much, there are these open source
819
models and you can't get away from them.
820
We have that Lama 2, Lama 3, Mistral,
like, yeah.
821
And basically we want to harness this
power, right?
822
But as I mentioned previously, there are
some issues that we like to remedy with
823
Bayesian techniques.
824
So the majority of these open source
models are built in PyTorch.
825
I'm also a big Jax fan.
826
I also use Jax a lot.
827
So I was very happy to see and work with
the torch .funk like sub library, which
828
basically makes it
829
you can write your PyTorch code and you
can use Llama 3 or Mistral with PyTorch
830
but writing functional code.
831
So that's what we've done with Posterior.
832
So, yeah, Hugging Face Transformers, you
can download the models, that's where all
833
they're hosted, and how you access them.
834
But then what you get is just a PyTorch
model.
835
It's just a PyTorch model.
836
And then you throw that in Composers and
all nicely with the Posterior updates.
837
Or you write your own new updates in the
Posterior framework and you can use that
838
as well.
839
still with Lama 3.
840
Mr.
841
Robin.
842
Yeah.
843
Okay.
844
Nice.
845
And so what does it mean concretely for
users?
846
That means you can use these pre -trained
LLMs with posteriors and that means adding
847
a layer of uncertainty quantification on
top of those models?
848
Yeah.
849
So you need, I mean, Bayes theorem is a
training theorem.
850
So you need data as well.
851
So you take
852
You take your pre -trained model, which
is, yeah, transformer, or it could be
853
another type of model, it could be an
image model or something like that, and
854
then you give it some new data, which we
would say was fine -tuning, and then you
855
combine, use posterior to combine the two,
and then you have your new model out at
856
the end of the day, which has uncertainty
quantification.
857
It's difficult, as I said, we're sort of
in this fine -tuning era as open -source
858
large language models.
859
It's still to be, this is different.
860
There's still lots of research to do here
and it's different to our classical
861
Bayesian regime where we just have our,
there's only one source of data and it's
862
what we give it.
863
In this case, there's two sources of data
because you have your data, whatever,
864
whatever Lama3 saw in its original
training data and then it has your own
865
data.
866
It's, yeah, can we hope to get uncertainty
chronification and the data that they used
867
in the original training?
868
Probably not, but we might be able to get
uncertainty chronification and improved
869
predictions.
870
based on the data that we've committed.
871
So there's lots of lots for us to try out
here and learn because we are still
872
learning on this in terms of the fine
tuning.
873
But yeah, this is what Polastir is there
to make these sort of questions as easy as
874
possible to ask and answer.
875
Okay, fantastic.
876
Yeah, that's, that's so exciting.
877
It's just like, it's a bit frustrating to
me because I'm like, I'd love to try that
878
and learn on that and like, contribute to
that kind of packages.
879
At the same time, I have to work, I have
to do the podcast, and I have all the
880
packages I'm already contributing to.
881
So I'm like, my god, too much choices too
much, too many choices.
882
No, come on Alex, I'm gonna see you.
883
We're gonna see you again, Alex pull
request.
884
It's soon enough.
885
Actually, how does like, do the like this
ability to have the transformers in, you
886
know, use these pre trained models, does
that help facilitate the adoption of new
887
algorithms in in posteriors?
888
Because if I understand correctly, you can
support
889
new algorithms pretty easily and you can
support arbitrary likelihoods.
890
How do you do that?
891
I wouldn't say that the existence of the
pre -trained models necessarily allows us
892
to support new algorithms.
893
I feel like we've built the posterior to
be suitably general and suitably modular,
894
that it's kind of agnostic to your model
choice and your log posterior choice.
895
terms of arbitrary likelihoods.
896
But yeah, that's like a benefit.
897
That's like, yeah, as an hour, yeah, the
arbitrary like is is relevant, because a
898
lot of machine learning packages on.
899
I mean, a lot of machine learning is
essentially just boils down to
900
classification or regression.
901
And that is true.
902
And because of that, a lot of a lot of
machine learning algorithms will a lot of
903
machine learning packages will essentially
constrain it to classification or
904
regression.
905
At the end, you either have your softmax
or you have your mean squared error.
906
Yeah, softmax cross entry means greater.
907
In posterior, we haven't done that.
908
We're more faithful to the sort of the
Bayesian setting where you just write down
909
your log posterior and you can write down
whatever you want.
910
And this allows you greater flexibility in
the case you did want to try out a
911
different likelihood or like even in like
simple cases, like it's just more
912
sophisticated than just classification or
regression a lot of the time.
913
Like in sequence generation where you have
the sequence and then you have the cross
914
entropy over all of that.
915
It just allows you to be more flexible and
write the code how you want.
916
And there's additional things to be taken
into account.
917
Like sometimes if you were doing a
regression, you might have knowledge of
918
the noise variance.
919
And that's just the observation noise
variance.
920
And that's just much easier to, yeah, if
we don't constrain like this, it's just
921
much easier to write your code much
cleaner code than if you were.
922
And it's also future -proofing.
923
We don't know what's going to be.
924
happening in going forward.
925
We may see like, yeah, in multimodal
models, we may see like, text and images
926
together, in which case, yeah, we will
support that.
927
You have to supply the compute and the
data, which might be the harder thing, but
928
we'll support those likelihoods.
929
Okay, I see.
930
I see.
931
Yeah, that's very, very interesting.
932
Any stats related to the fact that I think
I've read in your blog post or on the
933
website that
934
You say that Posterior is swappable.
935
What does that mean?
936
And how does that flexibility benefit
users?
937
Yeah.
938
So, I mean, this is the point of swappable
is that when I say that is that you can
939
change between if you want to, if you
think, as I said, Posterior is a research
940
like toolbox and it's to us to investigate
which inference method is appropriate in
941
the different settings, which might be
different if you care about decomposing.
942
predictive uncertainty, it might be
different if you care about boarding cast
943
-off, you're forgetting it's in your
continued learning.
944
So the thing is that you can just, the way
it's written is you can just swap, you can
945
go from sthmc and you can go to the class
approximation or you can go to vi just by
946
changing one line of code.
947
And the way it works is like you have your
builds, you have your transform equals
948
posterior .infant method .build and then
any configuration argument, step size.
949
things like this, which are algorithm
specific.
950
And then after that is all unified.
951
So you just have your init around the
parameters that you want to do based on.
952
And then you iterate through your data
loader, you iterate through your data.
953
And then it just updates based on the
batch.
954
And batch can be very general.
955
So that's what it means.
956
So you can just change one line of code to
swap between Variational Imprints and
957
STHMC or Extended Calama Filter or any and
all the new methods that the listeners are
958
going to add in the future.
959
Heh.
960
Okay.
961
Okay.
962
I see.
963
And so I have so many more questions for
you and posterior's but let's start and
964
run, wrap that up because also when I ask
you about another project you're working
965
on so maybe to close that up on
posterior's.
966
What are the future plans for posterior's
and are there any upcoming features or
967
integration integrations that you can
share with us?
968
So we're quite happy with the framework at
the moment.
969
There's lots of little tweaks that we have
a list of GitHub issues that we want to go
970
through, which are mostly and excitingly
about adding new methods and new
971
applications.
972
So that's really what we're excited about
now is actually use it in the wild and
973
hopefully experiment all these questions
that we've discussed.
974
Yeah, like, like how we how does it make
sense and how we get the benefits of
975
Bayesian, true Bayesian inference on fine
tuning or on large models or large data.
976
And so yeah, we are really excited and to
add more methods.
977
So if listeners have mini batch, big data
Bayesian methods that we want to want to
978
try out with a large data model, then
we're hopefully accepting that we will.
979
I do.
980
I do like, I do promote like generality
and doing it like in a way that is sort of
981
flexible and stuff.
982
So we may have, we may think a lot.
983
It's not, it's not, we want to add methods
that somehow feel natural and, and one way
984
is to extend and compose with other
methods.
985
So it might be that if we've got some very
complicated last layer,
986
requires classes just for classification
method, we're probably not going to add
987
it.
988
So it has to be methods that stick within
the posterior framework, which is this
989
arbitrary likelihood Bayesian swappable
computation.
990
Okay.
991
Okay.
992
Yeah.
993
Yeah, that makes sense.
994
Yeah, because you have like, yeah, you
have that kind of vision of wanting to do
995
that and having that as a as a research
tool, basically.
996
So
997
Yeah, that makes sense to keep that under
control, let's say.
998
Something I want to ask you in the last
few minutes of the show is about
999
thermodynamic compute.
Speaker:
I've seen you, you are working on that.
Speaker:
And you've told me you're working on that.
Speaker:
So yeah, I don't know anything about that.
Speaker:
So can you like, what's that about?
Speaker:
Yeah, so I mean, this is yeah, this is
something that's very normal, normal
Speaker:
computing.
Speaker:
And it's like,
Speaker:
It's something that we have.
Speaker:
Yeah, we have this hardware team.
Speaker:
It's like a full stack AI company.
Speaker:
And we, yeah, on the posterior side, on
the client side, we look at how we can
Speaker:
bring in principle Bayesian uncertainty
quantification and help us solve the
Speaker:
issues with machine learning pipelines
like we've already discussed.
Speaker:
And on the other side, there's lots of
parts to this.
Speaker:
More just like traditional MCMC is
difficult sometimes because
Speaker:
Or just it's just like about simulating
SDEs essentially as what the thermodynamic
Speaker:
hardware is simulating SDEs Normally, you
have this real pain with the step size and
Speaker:
as the mention grows steps, let's get
really small and so SDEs, where do we see
Speaker:
SDEs?
Speaker:
You see SDEs in physics all the time and
physics is real we can use physics so it's
Speaker:
doing so it's building physical hardware
analog hardware that We can hopefully that
Speaker:
evolves as SDEs
Speaker:
then we can harness that SDEs by encoding,
you know, like currents and voltages and
Speaker:
things like that.
Speaker:
So I'm not a physicist, so I don't know
exactly how it is.
Speaker:
But I'm always reassured at how the when I
speak to the hardware team, how simple the
Speaker:
they talk about these things, it's like,
yeah, we can just stick some resistors and
Speaker:
capacitors on a chip, and then it'll then
it'll do this SDE.
Speaker:
So this is the and then we want to use
those SDEs for scientific computation.
Speaker:
And with a real focus on statistics and
machine learning.
Speaker:
So yeah, we want to be able to do an HMC
Speaker:
on device, on an analog device.
Speaker:
The first step is to do like with a
linear, so we'll have a Gaussian posterior
Speaker:
or with a linear drift in terms of this.
Speaker:
This is an Ornstein -Ollenbeck process and
we've developed hardware to do this and
Speaker:
turns out that an Ornstein -Ollenbeck
process, because it has a Gaussian
Speaker:
stationary distribution and you have this,
you can input like you can input the
Speaker:
precision matrix and output the covariance
matrix, that's matrix inversion.
Speaker:
So, and you just, your physical device
just does this.
Speaker:
And it's because it's an SDE, it has noise
and is kind of noise aware, which is
Speaker:
different to classical analog computation,
which has historically been plagued, which
Speaker:
is really old, really old, but
historically been plagued by noise.
Speaker:
And it's like, yeah, there's all this
noise in physics.
Speaker:
And because we're doing SDEs, we want the
noise.
Speaker:
So yeah, that's the whole idea.
Speaker:
It's obviously very young, but it's fun.
Speaker:
It's fun stuff.
Speaker:
Yeah.
Speaker:
So that's basically to...
Speaker:
accelerate computing?
Speaker:
That's hardware first, so that computing
is accelerated?
Speaker:
We want to, I mean, it's a baby field.
Speaker:
So we're trying to accelerate different
components.
Speaker:
What we worked out is with the simplest
thermodynamic chip we can build is this
Speaker:
linear chip with the Ornstein -Ullenberg
process.
Speaker:
And that can speed up with some error.
Speaker:
some error, but it has asymptotic speed
ups for linear algebra routines, so
Speaker:
inverting a matrix or solving a linear
system.
Speaker:
That's awesome.
Speaker:
In this case, it would speed up a certain
component, but that could be useful in a
Speaker:
Laplace approximation or these sort of
things also in machine learning.
Speaker:
Okay, that must be very fun to work on.
Speaker:
Do you have any writing about that that we
can put in the show notes?
Speaker:
Because
Speaker:
I think it'd be super interesting for
listeners.
Speaker:
Yeah, yeah.
Speaker:
We've got the normal computing scholar
page has a list of papers, but we also
Speaker:
have more accessible blogs, which I'll
make sure to put in the shop.
Speaker:
Yeah, yeah, please do because, yeah, I
think it's super interesting.
Speaker:
And yeah, and when you have something to
present on that, feel free to reach out.
Speaker:
And I think that'd be fun to do an episode
about that, honestly.
Speaker:
That'd be great.
Speaker:
Yeah.
Speaker:
Yes, so maybe one last question before
asking you the last two questions.
Speaker:
Like very, like, let's do Zoom be way less
technical.
Speaker:
We've been very technical through the
whole episode, which I love.
Speaker:
But maybe I'm thinking if you have any
advice to give to aspiring developers
Speaker:
interested in contributing to open source
projects like Posterior's, what would it
Speaker:
be?
Speaker:
Okay, yeah, I don't know, I don't feel
like I'm necessarily the best place to say
Speaker:
all this, but yeah, I mean, I would just,
the most important thing is just to go for
Speaker:
it, just get stuck in, get in the weeds of
these libraries and see what's there.
Speaker:
And there's loads of people building such
cool stuff in the open source ecosystem
Speaker:
and it's really fun to, honestly, it's
really fun and rewarding to get involved
Speaker:
for it.
Speaker:
So just go for it, you'll learn so much
along the way.
Speaker:
something more tangible.
Speaker:
I find that when I'm stuck on, starting
on, it's not like I don't understand
Speaker:
something in code or mathematics, then I
often struggle to find it in papers per
Speaker:
se.
Speaker:
And I find that textbooks, I love
textbooks, textbooks I find as a real
Speaker:
source of gold for these because they
actually go to the depths of explaining
Speaker:
things, without having this sort of horse
in the race style writing that you often
Speaker:
find in papers.
Speaker:
So yeah, get stuck in check text textbooks
if you, if you get lost.
Speaker:
Or I don't understand.
Speaker:
Or just ask as well.
Speaker:
Open source is all about asking and
communicating and bouncing ideas.
Speaker:
Yeah, yeah, yeah, for sure.
Speaker:
Yeah, that's usually what I do.
Speaker:
I ask a lot and I usually end up
surrounding myself with people way smarter
Speaker:
than me.
Speaker:
And that's exactly what you want.
Speaker:
That's exactly how I learned.
Speaker:
Yeah, textbook DICI, I would say I kind of
find the writing boring most of the time,
Speaker:
depends on the textbooks.
Speaker:
And also, it's expensive.
Speaker:
Yeah.
Speaker:
So that's kind of the problem of
textbooks, I would say.
Speaker:
I mean, you often can have them in PDFs,
but I just hate reading the PDF on my
Speaker:
computer.
Speaker:
So, you know, I wonder on the book object
or having it on Kindle or something like
Speaker:
that.
Speaker:
But that doesn't really that doesn't
really exist yet.
Speaker:
So.
Speaker:
could be something that some editors solve
someday that'd be cool, I'd love that
Speaker:
awesome, Sam, that was great thank you so
much, we've covered so many topics and my
Speaker:
brain is burning so that's a very good
sign I've learned a lot and I'm sure our
Speaker:
listeners did too of course, before
letting you go I'm gonna ask you the last
Speaker:
two questions I ask every guest at the end
of the show so one
Speaker:
If you had unlimited time and resources,
which problem would you try to solve?
Speaker:
want to decouple the model specification,
the data generating process, how you go
Speaker:
from your something you don't know to the
data you do have.
Speaker:
That's your site freedom as a data model.
Speaker:
I have you define that from like the
inference and the mathematical
Speaker:
computation.
Speaker:
So that's whatever, what the way you do
your approximate Bayesian inference.
Speaker:
And you want to decouple those.
Speaker:
You want to make it as easy as possible.
Speaker:
Ideally, we just want to be doing that
one.
Speaker:
We just want to be doing the model
specification.
Speaker:
And this is like Stan and PyMC do this
really well.
Speaker:
It's just like,
Speaker:
you write down your model, we'll handle
the rest.
Speaker:
And that's kind of like the dream we have
as Bayesian or Bayesian software
Speaker:
developers.
Speaker:
And it's so with Posterior, we're trying
to do something like this towards going to
Speaker:
move towards this for bigger, big machine
learning models and so bigger models,
Speaker:
bigger data settings.
Speaker:
So that's kind of the dream there.
Speaker:
But then in machine learning, what does
machine learning have differently to
Speaker:
statistics in that setting?
Speaker:
It's like, well, machine learning models
are less interesting than classical
Speaker:
Bayesian models.
Speaker:
The thing is they're more transferable,
right?
Speaker:
It's just a neural network, which we
believe is machine learning and will solve
Speaker:
a whole suite of tasks.
Speaker:
So perhaps in terms of the machine
learning setting, where we decouple
Speaker:
modeling and inference and data, you kind
of want to remove the model one as well.
Speaker:
You want to have these general purpose
foundational models, you could say.
Speaker:
So really you want to let the user focus.
Speaker:
And so we're handling the inference.
Speaker:
We're also handling the model.
Speaker:
So really let the user just give it the
data and say, okay, let's do this data and
Speaker:
let's use this data to predict other
things and let the user handle that.
Speaker:
So that's potentially like a real
unlimited time and resources.
Speaker:
Plenty of resources need to do that.
Speaker:
But yeah, that's Sam May 2024's answer.
Speaker:
Yeah.
Speaker:
Yeah, that sounds...
Speaker:
That sounds amazing.
Speaker:
I agree with that.
Speaker:
That's a fantastic goal.
Speaker:
And yeah, also that reminds me, that's
also why I really love what you guys are
Speaker:
doing with Posteriorus because it's like,
yeah, trying to now that we start being
Speaker:
able to get there, making patient
inference really scalable to really big
Speaker:
data and big models.
Speaker:
I'm super enthusiastic about that.
Speaker:
it would be just fantastic.
Speaker:
So thank you so much for taking the time
to do that guys.
Speaker:
Yeah we're doing it, we're gonna get
there.
Speaker:
Yeah yeah yeah I love that.
Speaker:
And second question, if you could have
dinner with any great scientific mind dead
Speaker:
alive or fictional who would it be?
Speaker:
Yeah I was a bit intimidated this
question.
Speaker:
Yeah you know you ask everyone.
Speaker:
again, it's a great question.
Speaker:
But then I thought about it for a little
bit.
Speaker:
And it wasn't too hard for me.
Speaker:
I think that David Mackay is someone who,
yeah, I mean, it's been amazing work.
Speaker:
David Mackay is doing Bayesian neural
networks in 1992.
Speaker:
And that's like, yeah, like crazy before
before I'm born.
Speaker:
Anyway, Bayesian neural networks in 1992,
then I've just been going through his
Speaker:
textbook, as I said, I love textbooks, so
going through his textbooks on information
Speaker:
theory and
Speaker:
Basin statistics is a Bayesian or was a
Bayesian information theory and
Speaker:
statistics.
Speaker:
And there's something that he says like
right at the start of the textbook is
Speaker:
like, one of the themes of this book is
that data compression and data modeling
Speaker:
are one and the same.
Speaker:
And that's just really beautiful.
Speaker:
And we talked about stream codes, which in
a very information theory style setting,
Speaker:
but it's just an auto -aggressive
prediction model, just like our language
Speaker:
model.
Speaker:
So it's just someone else the ability to
distill these informations and do these.
Speaker:
distill information and help the
unification and be so ahead of their time.
Speaker:
And then additionally, with a sort of like
groundbreaking book on sustainable energy.
Speaker:
So like also tackling the one of the
greatest challenges we have at the moment.
Speaker:
So yeah, that's the sustainable energy
book is really wonderful.
Speaker:
I'm one of my favorite books so far.
Speaker:
Nice.
Speaker:
Yeah, definitely put that in the show
notes.
Speaker:
I think.
Speaker:
Yes, definitely.
Speaker:
Yeah.
Speaker:
Yeah, I'd like to keep that to read.
Speaker:
So
Speaker:
Yeah, please also put that in the show and
that's going to be fantastic.
Speaker:
Great.
Speaker:
Well, I think we can call it a show.
Speaker:
That was fantastic.
Speaker:
Thank you so much, Sam.
Speaker:
I learned so much and now I feel like I
have to go and read and learn about so
Speaker:
many things.
Speaker:
And I can definitely tell that you are
extremely passionate about your doing.
Speaker:
So yeah, thank you so much for.
Speaker:
taking the time and being on this show?
Speaker:
No, thank you very much.
Speaker:
I had a lot of fun.
Speaker:
Yeah.
Speaker:
Thank you for, yeah, being parcel to my
rantings.
Speaker:
I need that sometimes.
Speaker:
Yeah, that's what the show is about.
Speaker:
My girlfriend is extremely, extremely
happy that I have this show to rent about
Speaker:
patient stats and any nerdy stuff.
Speaker:
Yeah, it's so true, yeah.
Speaker:
Well, Sam, you're welcome.
Speaker:
Anytime you need to do some nerdy rant.
Speaker:
thank you.
Speaker:
I'm sure I'll be...
Speaker:
This has been another episode of Learning
Bayesian Statistics.
Speaker:
Be sure to rate, review, and follow the
show on your favorite podcatcher, and
Speaker:
visit learnbayestats .com for more
resources about today's topics, as well as
Speaker:
access to more episodes to help you reach
true Bayesian state of mind.
Speaker:
That's learnbayestats .com.
Speaker:
Our theme music is Good Bayesian by Baba
Brinkman.
Speaker:
Fit MC Lance and Meghiraam.
Speaker:
Check out his awesome work at bababrinkman
.com.
Speaker:
I'm your host.
Speaker:
Alex Andorra.
Speaker:
You can follow me on Twitter at Alex
underscore Andorra, like the country.
Speaker:
You can support the show and unlock
exclusive benefits by visiting Patreon
Speaker:
.com slash LearnBasedDance.
Speaker:
Thank you so much for listening and for
your support.
Speaker:
You're truly a good Bayesian.
Speaker:
Change your predictions after taking
information in.
Speaker:
And if you're thinking of me less than
amazing, let's adjust those expectations.
Speaker:
Let me show you how to be a good Bayesian
Change calculations after taking fresh
Speaker:
data in Those predictions that your brain
is making Let's get them on a solid
Speaker:
foundation