Proudly sponsored by PyMC Labs, the Bayesian Consultancy. Book a call, or get in touch!
Our theme music is « Good Bayesian », by Baba Brinkman (feat MC Lars and Mega Ran). Check out his awesome work!
Visit our Patreon page to unlock exclusive Bayesian swag 😉
Takeaways:
- Use mini-batch methods to efficiently process large datasets within Bayesian frameworks in enterprise AI applications.
- Apply approximate inference techniques, like stochastic gradient MCMC and Laplace approximation, to optimize Bayesian analysis in practical settings.
- Explore thermodynamic computing to significantly speed up Bayesian computations, enhancing model efficiency and scalability.
- Leverage the Posteriors python package for flexible and integrated Bayesian analysis in modern machine learning workflows.
- Overcome challenges in Bayesian inference by simplifying complex concepts for non-expert audiences, ensuring the practical application of statistical models.
- Address the intricacies of model assumptions and communicate effectively to non-technical stakeholders to enhance decision-making processes.
Chapters:
00:00 Introduction to Large-Scale Machine Learning
11:26 Scalable and Flexible Bayesian Inference with Posteriors
25:56 The Role of Temperature in Bayesian Models
32:30 Stochastic Gradient MCMC for Large Datasets
36:12 Introducing Posteriors: Bayesian Inference in Machine Learning
41:22 Uncertainty Quantification and Improved Predictions
52:05 Supporting New Algorithms and Arbitrary Likelihoods
59:16 Thermodynamic Computing
01:06:22 Decoupling Model Specification, Data Generation, and Inference
Thank you to my Patrons for making this episode possible!
Yusuke Saito, Avi Bryant, Ero Carrera, Giuliano Cruz, Tim Gasser, James Wade, Tradd Salvo, William Benton, James Ahloy, Robin Taylor,, Chad Scherrer, Zwelithini Tunyiswa, Bertrand Wilden, James Thompson, Stephen Oates, Gian Luca Di Tanna, Jack Wells, Matthew Maldonado, Ian Costley, Ally Salim, Larry Gill, Ian Moran, Paul Oreto, Colin Caprani, Colin Carroll, Nathaniel Burbank, Michael Osthege, Rémi Louf, Clive Edelsten, Henri Wallen, Hugo Botha, Vinh Nguyen, Marcin Elantkowski, Adam C. Smith, Will Kurt, Andrew Moskowitz, Hector Munoz, Marco Gorelli, Simon Kessell, Bradley Rode, Patrick Kelley, Rick Anderson, Casper de Bruin, Philippe Labonde, Michael Hankin, Cameron Smith, Tomáš Frýda, Ryan Wesslen, Andreas Netti, Riley King, Yoshiyuki Hamajima, Sven De Maeyer, Michael DeCrescenzo, Fergal M, Mason Yahr, Naoya Kanai, Steven Rowland, Aubrey Clayton, Jeannine Sue, Omri Har Shemesh, Scott Anthony Robson, Robert Yolken, Or Duek, Pavel Dusek, Paul Cox, Andreas Kröpelin, Raphaël R, Nicolas Rode, Gabriel Stechschulte, Arkady, Kurt TeKolste, Gergely Juhasz, Marcus Nölke, Maggi Mackintosh, Grant Pezzolesi, Avram Aelony, Joshua Meehl, Javier Sabio, Kristian Higgins, Alex Jones, Gregorio Aguilar, Matt Rosinski, Bart Trudeau, Luis Fonseca, Dante Gates, Matt Niccolls, Maksim Kuznecov, Michael Thomas, Luke Gorrie, Cory Kiser, Julio, Edvin Saveljev, Frederick Ayala, Jeffrey Powell, Gal Kampel, Adan Romero, Will Geary, Blake Walters, Jonathan Morgan and Francesco Madrisotti.
Links from the show:
- Sam on Twitter: https://x.com/Sam_Duffield
- Sam on Scholar: https://scholar.google.com/citations?user=7wm_ka8AAAAJ&hl=en&oi=ao
- Sam on Linkedin: https://www.linkedin.com/in/samduffield/
- Sam on GitHub: https://github.com/SamDuffield
- Posteriors paper (new!): https://arxiv.org/abs/2406.00104
- Blog post introducing Posteriors: https://blog.normalcomputing.ai/posts/introducing-posteriors/posteriors.html
- Posteriors docs: https://normal-computing.github.io/posteriors/
- Paper introducing Posteriors – Scalable Bayesian Learning with posteriors: https://arxiv.org/abs/2406.00104v1
- Normal Computing scholar: https://scholar.google.com/citations?hl=en&user=jGCLWRUAAAAJ&view_op=list_works
- Thermo blogs: https://blog.normalcomputing.ai/posts/2023-11-09-thermodynamic-inversion/thermo-inversion.html
- https://blog.normalcomputing.ai/posts/thermox/thermox.html
- Great paper on SGMCMC: https://proceedings.neurips.cc/paper_files/paper/2015/file/9a4400501febb2a95e79248486a5f6d3-Paper.pdf
- David MacKay textbook on Sustainable Energy: https://www.withouthotair.com/
- LBS #107 – Amortized Bayesian Inference with Deep Neural Networks, with Marvin Schmitt: https://learnbayesstats.com/episode/107-amortized-bayesian-inference-deep-neural-networks-marvin-schmitt/
- LBS #98 – Fusing Statistical Physics, Machine Learning & Adaptive MCMC, with Marylou Gabrié: https://learnbayesstats.com/episode/98-fusing-statistical-physics-machine-learning-adaptive-mcmc-marylou-gabrie/
Transcript
This is an automatic transcript and may therefore contain errors. Please get in touch if you’re willing to correct them.
Transcript
Folks, strap in, because today's episode
is a deep dive into the fascinating world
2
:of large -scale machine learning.
3
:And who better to guide us through this
journey than Sam Dofeld.
4
:Currently honing his expertise at normal
computing, Sam has an impressive
5
:background that bridges the theoretical
and practical realms of Bayesian
6
:statistics, from quantum computation to
the cutting edge of AI technology.
7
:In our discussion, Sam breaks down complex
topics such as the posterior's Python
8
:package, minimatch methods, approximate
inference, and the intriguing world
9
:of thermodynamic hardware for statistics.
10
:Yeah, I didn't know what that was either.
11
:We delve into how these advanced methods
like stochastic gradient MCMC and Laplace
12
:approximation are not just theoretical
concepts but pivotal in shaping enterprise
13
:AI models today.
14
:And Sam is not just about algorithms and
models, he is a sports enthusiast who
15
:loves football, tennis and squash.
16
:and he recently returned from an awe
-inspiring trip to the Faroe Islands.
17
:So join us as we explore the future of AI
with Bayesian methods.
18
:This is Learning Bayesian Statistics,
,:
19
:Welcome to Learning Bayesian Statistics, a
podcast about Bayesian inference, the
20
:methods, the projects, and the people who
make it possible.
21
:I'm your host, Alex Andorra.
22
:You can follow me on Twitter at alex
-underscore -andorra.
23
:like the country.
24
:For any info about the show, learnbasedats
.com is Laplace to be.
25
:Show notes, becoming a corporate sponsor,
unlocking Beijing Merch, supporting the
26
:show on Patreon, everything is in there.
27
:That's learnbasedats .com.
28
:If you're interested in one -on -one
mentorship, online courses, or statistical
29
:consulting, feel free to reach out and
book a call at topmate .io slash alex
30
:underscore and dora.
31
:See you around, folks, and best Beijing
wishes to you all.
32
:I'm Sam Duffield, welcome to Learning
Bayesian Statistics.
33
:Thanks, thank you very much.
34
:Yeah, thank you so much for taking the
time.
35
:I invited you on the show because I saw
what you guys at normal computing were
36
:doing, especially with the Posteriors
Python package.
37
:And I am personally always learning new
stuff.
38
:Right now I'm learning a lot about sports
analytics, because that's a
39
:Like that's always been a personal pet
peeves of mine and Bayesian says extremely
40
:useful in that field.
41
:But I'm also in conjunction working a lot
about LLMs and the interaction with the
42
:Bayesian framework.
43
:I've been working much more on the base
flow package, which we've talked about
44
:with Marvin Schmidt in episode 107.
45
:So.
46
:Yeah, working on developing a PIMC bridge
to base flow so that you can write your
47
:model in PIMC and then like using
amortized patient inference for your PIMC
48
:models.
49
:It's still like way, way down the road.
50
:I need to learn about all that stuff, but
that's really fascinating.
51
:I love that.
52
:And so of course, when I saw what you were
doing with Posterior, I was like, that
53
:sounds...
54
:Awesome.
55
:I want to learn more about that.
56
:So I'm going to ask you a lot of
questions, a lot of things I don't know.
57
:So that's great.
58
:But first, can you tell us, give us a
brief overview of your research interests
59
:and how Bayesian methods play a role in
your work?
60
:Yeah, no, I know.
61
:Thanks again for the invite.
62
:I think, yeah, sports analytics, Bayesian
statistics, language models, I think we
63
:have a lot to talk about.
64
:should be fun.
65
:Bayesian methods in my work, yes, so at
normal we have a lot of problems where we
66
:think that Bayes is the right answer if
you could compute it exactly.
67
:So what we're trying to do is trying to
look at different approximations and
68
:different, like how they scale in
different methods and different settings.
69
:and how we can get as close to the exact
phase or the exact sort of integral and
70
:updating under uncertainty that can
provide us with some of those benefits.
71
:Yeah.
72
:OK.
73
:Yeah.
74
:That's interesting.
75
:I, of course, agree.
76
:Of course.
77
:Can you, like, actually, do you remember
when you were first introduced to patient
78
:inference?
79
:Because you had a
80
:an extensive background you've studied a
lot.
81
:When in that, in those studies, were you
introduced to the Bayesian framework?
82
:And also, how did you end up working on
what you're working on nowadays?
83
:Yeah, okay.
84
:I'll try not to rant too long about this.
85
:But yeah, so I guess I, yeah, mathematics,
undergraduate at Imperial.
86
:So I think that's
87
:I was very young at this stage, we were
very young in our undergraduates, so not
88
:really sure what we want to do.
89
:At some point, it came to me that
statistics within the field of mathematics
90
:is kind of like where I can like, that
should be working on like, applied
91
:problems and how what where the sort of
field is going.
92
:And that's what got me excited.
93
:Statistics at undergraduate are different,
different places, but you get thrown a lot
94
:of different
95
:I mean, probably in all courses, you get
different, you get different point of view
96
:and you get like, yeah, you get your
frequencies, your hypothesis testing, and
97
:then you have your Bayesian method as
well.
98
:And that is just the Bayesian approach
really sort of settled with me as being
99
:more natural in terms of you just write
down the equation and the Bayes Bayes
100
:Bayes theorem handles you write down, you
have your forward model and your prior and
101
:then Bayes theorem handles everything
else.
102
:So you're kind of writing down it's like,
103
:mathematicians is kind of like one of the
lecturers in my first year said, yeah,
104
:mathematicians are lazy.
105
:You want to they want to do as little as
possible.
106
:So base theorem is kind of nice there
because you just write down your your
107
:likelihood you write down your prior and
then base theorem handles the rest.
108
:So you have to do like the minimum
possible work you have your data
109
:likelihood prior and then done.
110
:So that was that was really compelling to
me.
111
:And that led me to a to my PhD, which was
in the engineering department in
112
:Cambridge.
113
:So that was like, yeah, I had a few
114
:thoughts on what to do for my PhD.
115
:There was some more theoretical stuff and
I wanted to get into some problems, get
116
:into the weeds a bit.
117
:So yeah, engineering department of
Cambridge working on Bayesian statistics,
118
:state space models and in a state space
model sequential Monte Carlo.
119
:And I think, yeah, I mean, for terminology
wise, I use state space model and hidden
120
:Markov model as the same thing.
121
:So yeah, you have this time series style
data and that was working on that sort of
122
:data gave me
123
:I feel like this propagation of
uncertainty really shines there because
124
:you need to take into account your
uncertainty from the previous experiments,
125
:say, when you update for your new ones.
126
:That was really compelling for me.
127
:That was, I guess, my route into Bayesian
statistics.
128
:Yeah, okay.
129
:Actually, here I could ask you a lot of
questions, but...
130
:those time series models.
131
:I'm always fascinated by time series
models.
132
:I don't know, I love them for some reason.
133
:I find there is a kind of magic in the
ability of a model to take time
134
:dependencies into account.
135
:I love using Gaussian processes for that.
136
:So I could definitely go down that rabbit
hole, but I'm afraid then I won't have
137
:enough time for you to talk about post
-series.
138
:Let me just say one minute about it.
139
:So I'll just say like, yeah, in terms of
yeah, Gaussian process is really cool.
140
:Like Gaussian process, like can think of
as like a continuous time or continuous
141
:space or whatever that the time variant
access, we'll call it time continuous time
142
:varying version of a state space model and
state space model or hidden Markov model.
143
:Kind of like that to me is like the
canonical extension of a just a static
144
:based inference model to a
145
:the time varying setting because you can
and they kind of unify each other because
146
:you can write smoothing in a state space
model as one big static Bayesian inference
147
:problem and then you can write static
Bayesian inference problems they're just p
148
:of y given x or p of yeah recovering x
from from y as as the first step as a
149
:single step of state space model so the
techniques that you build just overlap and
150
:you can yeah at least conceptually on the
mathematical level when you actually get
151
:into the approximations and the
commutation
152
:there are different things to consider,
different axes of scalability considered,
153
:but conceptually, I really like that.
154
:I probably ranted for a bit more than a
minute there, so I apologize.
155
:No, no, that's fine.
156
:I love that.
157
:Yeah.
158
:I have much more knowledge and experience
on GPs, but I'm definitely super curious
159
:to also apply these state space models and
so on.
160
:So definitely going to read the...
161
:the paper you sent me about skill rating
of football players where you're using, if
162
:I understand correctly, some state space
models.
163
:That's going to be two birds with one
stone.
164
:So thanks a lot for writing that.
165
:The whole point of that paper is to say
that rating systems, ELO, TrueSkill are
166
:and should be reframed as state space
models.
167
:And then you just have your full Bayesian
understanding of it.
168
:Yeah, yeah.
169
:I mean, for sure.
170
:I'm working myself also on the
171
:project like that on football data.
172
:And yeah, the first thing I was doing is
like, okay, I'm gonna write the simple
173
:model.
174
:But then as soon as I have that down, I'm
gonna add a GP to that.
175
:It's like, I have to take these
nonlinearities into account.
176
:So yeah, I'm like, super excited about
that.
177
:So thanks a lot for giving me some weekend
readings.
178
:So actually now let's go into your
posteriors package because I have so many
179
:questions about that.
180
:So could you give us an overview of the
package, what motivated this development
181
:and also putting it in the context of
large scale AI models?
182
:Yeah, so as I said, we at normal think the
base is the right answer.
183
:So we want to get, we want to, but yeah,
we're interested in
184
:large scale enterprise AI models.
185
:So we need to be able to scale these to
big, big models, big, big parameter sizes
186
:and big data at the same time.
187
:So this is what Posterior's Python package
built on PyTorch really hopes to bring.
188
:It's built with sort of flexibility and
research in mind.
189
:So really we want to try out different
methods and try out for different data
190
:sets and different goals.
191
:what's going to be the best approach for
us.
192
:That's the motivation of the Posteriors
package.
193
:When would people use it?
194
:For instance, for which use cases would I
use Posteriors?
195
:There's a lot of just genuinely fantastic
Bayesian software out there.
196
:But most of it has focused on the full
batch setting, as is classically the case
197
:with Metropolis Hastings, except for
Jekste.
198
:And we feel like we're moving or we have
already moved into the mini batch era, the
199
:big data era.
200
:So posterior is mini batch first.
201
:So if you have a lot of data, even if you
have a small model, and you have a lot of
202
:data, and you want to try posterior
sampling with mini batches, you want to
203
:see how that...
204
:If that can speed up your inference rather
than doing full batch on every step, then
205
:Posterior is the place for that, even with
small models.
206
:So you can just write down your model in
Pyro, in PyTorch, and then use Posterior
207
:to do that.
208
:But then that's like moving from like
classical Bayesian statistics into like
209
:the mini batch one.
210
:But then there are also benefits of
Bayesian
211
:very crude approximations to the Bayesian
posterior in these really high scale large
212
:scale models.
213
:So like, yeah, like language models, big
neural networks, these aren't going to get
214
:you you're not you're not going to be able
to do your convergence checks and these
215
:sort of things in those models, but you
might still be able to get some advantages
216
:out of distribution detection, as a
distributed improved attribution
217
:performance sort of continual learning,
and these are the sort of things we're
218
:investigating is if like,
219
:the sort of what if you just trained with
grading essentially, you wouldn't
220
:necessarily get these things.
221
:But even very crude, crude Bayesian
approximations will hopefully provide
222
:these benefits.
223
:I think I will talk about this more later.
224
:I think.
225
:Yeah, okay.
226
:So basically, what what I understand is
that you can use Posters for basically any
227
:model.
228
:So I mean, we're still young.
229
:And it doesn't have like the
230
:very young and it doesn't have like the
support of, I don't know, if you want to
231
:do Gaussian processes, we were not going
to have a whole suite of kernels that
232
:you're going to be able to just type up.
233
:But fundamentally, it takes any, it just
takes a function, a log posterior
234
:function, and then you will be able to try
out different methods.
235
:But as I said, like the big data regime is
much less researched, and as much and the
236
:sort of big parameter regime is much
harder.
237
:at least.
238
:So it's going to be, it's not going to be
like a silver bullet.
239
:You're going to have to, there's research,
basically, posterior is a tool for
240
:research a lot of the time where you're
going to research what inference methods
241
:you can use, where they fail, and
hopefully where they succeed as well.
242
:Okay.
243
:Okay.
244
:I see.
245
:And so to make sure listeners understand,
well, you can do both in posers, right?
246
:You can write your model in posterior.
247
:and then sample from it?
248
:Or is that only model definition or is
that only model sampling?
249
:So it only does approximate posterior
sampling.
250
:So you write down the log posterior,
you're given some data and you write down
251
:the log posterior.
252
:Or the joint, you could say.
253
:It doesn't have the sophisticated support
of Stan or IMC or where you actually have
254
:the, you can write down the model.
255
:but it has the support for all the
distributions and doing forward samples.
256
:It leans on other tools like Pyro or
PyTorch itself for that in no other case.
257
:It is about approximate inference in the
posterior space, in the sample space.
258
:So you can do Laplace approximation with
these things and compare them.
259
:And importantly, it's mini -batch first.
260
:So every method only expects to receive
batch by batch.
261
:So you can support the large data regime.
262
:Okay, so I think there are a bunch of
terms we need to define here for
263
:listeners.
264
:Okay, yeah, sorry about that.
265
:Can you define minibatch?
266
:Can you define approximate inference and
in particular, Laplace approximation?
267
:Okay, so minibatch is the important one,
of course.
268
:Yeah, so normally in traditional Bayesian
statistics, if you're running random walk
269
:-through troblos -Hastings or HMC, you
will be seeing your whole dataset, all end
270
:data points at every step of the
iteration.
271
:And there's beautiful theory about that.
272
:But a lot of the time in machine learning,
you have a billion data points.
273
:Or if you're doing a foundation model,
it's like all of Wikipedia, it's billions
274
:of data points or something like that.
275
:And there's just no way that every time
you do a gradient step, you just can't sum
276
:over a billion data points.
277
:So you take 10 of them, you do this
unbiased approximation.
278
:And this doesn't propagate through the
exponential, which you need.
279
:for the metropolis hastening step.
280
:So it rules out a lot of traditional
Bayesian methods, but there's still been
281
:research on this.
282
:So this is the we saw a scalable Bayesian
learning is what we talked about with
283
:posterior.
284
:So we're investigating mini batch methods.
285
:So yeah, methods that only use a small
amount of the data, as is very common in
286
:so it's like gradient descent, stochastic
gradient descent and optimization terms.
287
:So hopefully
288
:Mini -batches, okay, you said approximate
inference.
289
:So approximate, okay, yeah, inference is a
very loaded term.
290
:Maybe I should try not to use it, but when
I say approximate inference, I mean
291
:approximate Bayesian inference.
292
:So you can write down mathematically the
posterior distribution, P of theta given
293
:y, and then yeah, proportional to P of
theta, P of y given theta.
294
:But that's
295
:You only have access to pointwise
evaluations of that and potentially even
296
:only mini -batch pointwise evaluation
sets.
297
:So approximate inference is forming some
approximation to that posterior
298
:distribution, whether that's a Gaussian
approximation or through Monte Carlo
299
:samples.
300
:So yeah, just like an ensemble of points
and approximate inference.
301
:So that's approximate inference.
302
:And yeah, you have different fidelities of
this posterior approximation.
303
:Last one, Laplace approximation.
304
:Laplace approximation is the simplest
305
:arguably the simplest in like machine
learning setting approximation to the
306
:posterior distribution.
307
:So it's just a Gaussian distribution.
308
:So all you need to define is a mean and
covariance.
309
:You define the mean by doing an
optimization procedure on your log
310
:posterior or just log likelihood.
311
:And that will give you a point that will
give you your mean.
312
:And then
313
:And then you take okay, it gets quite in
the weeds the Laplace approximation, but
314
:ideally you you then do a Taylor expansion
across them.
315
:Second order Taylor expansion will give
you Hessian.
316
:We would recommend the Hessian being the
co your approximate covariance.
317
:But there are tiny quantities there and
use the Fisher information as said.
318
:And yeah, you can read that there's lots
of I'm sure you've had people on the on
319
:the podcast explain it better than me.
320
:Yeah.
321
:For Laplace, no.
322
:Actually, so that's why I asked you to
define it.
323
:I'm happy to go down into the weeds if you
want.
324
:Yeah, if you think that's useful.
325
:Otherwise, we can definitely do also an
episode with someone you'd recommend to
326
:talk about Laplace approximation.
327
:Something I'd like to communicate to
listeners is for them to understand.
328
:Yeah, we say approximation, but at the
same time, MCMC is an approximation
329
:itself.
330
:So that can be a bit confusing.
331
:Can you talk about the fact, like, about
why these kind of methods, like Laplace
332
:approximation, I think VI, variational
inference, would fall also into this
333
:bucket.
334
:Why are those methods called
approximations?
335
:in contrast to MCMC?
336
:What's the main difference here?
337
:I honestly I would say MCMC is also an
approximation in the same terminology but
338
:yeah the difference is that we talk about
bias and asymptotically some methods
339
:asymptotically unbiased which MCMC is
stochastic gradient MCMC which is what
340
:Prosterus is as well in some
341
:under some caveats, and there are caveats
for MCMC, normal MCMC as well.
342
:But yeah, so you have your Gaussian
approximations from variational inference
343
:and the applies approximation.
344
:And these are very much approximations in
the sense there's no axes on which you can
345
:increase if you increase it to infinity or
change the posterior.
346
:You cannot do that with the Gaussian
approximations unless your posterior is
347
:you're known to be Gaussian, in which case
is more and more I mean, the amount of
348
:interesting cases like that like Gaussian
processes and things.
349
:But yeah, so they don't have this
asymptotically unbiased feature that MTMC
350
:does or important sampling as sequential
Monte Carlo does, which is very useful
351
:because it allows you to trade compute for
accuracy, which you can't do with a
352
:Laplace approximation or VI beyond
extending, like going from diagonal
353
:covariance to a full covariance or things
like that.
354
:And this is very useful in the case that
you have extra compute available.
355
:So I'm a big fan of the
356
:asymptotic unbiased property because it
means that you can increase your compute
357
:and safety.
358
:Yeah.
359
:Yeah.
360
:Great explanation.
361
:Thanks a lot.
362
:And so yeah, but so as you were saying,
there is not these asymptotic unbiasedness
363
:from these approximations, but at the same
time, that means they can be way faster.
364
:So it's like if you're in the right use
case in the right, in the right
365
:Yeah, in the right use case, then that
really makes sense to use them.
366
:But you have to be careful about the
conditions where the approximation falls
367
:down.
368
:Can you maybe dive a bit deeper into
stochastic gradient descent, which is the
369
:method that Posterioris is using, and how
that fits into these different methods
370
:that you just talked about?
371
:Actually, stochastic gradient descent is
not a method that Posterioris is using per
372
:se.
373
:descent is stochastic gradient descent is
the workhorse of machine most machine
374
:learning algorithms, but posteriors would
kind of be this kind of same like it kind
375
:of saying it shouldn't be perhaps or like,
not in all cases.
376
:stochastic gradient descent is what you
use.
377
:If you have extremely large data, and you
just want to find the MLE or so the
378
:maximum likelihood or the minimum of a
loss, which you might say.
379
:So
380
:that is just as an optimization routine.
381
:So you just want to find the parameters
that minimize something.
382
:If you're doing variational inference,
what you can do is you can trackively get
383
:the KL divergence between your specified
variational distribution and the log
384
:posterior.
385
:And then you have parameters.
386
:So they're like parameters of the
variational distribution over your model
387
:parameters.
388
:And then you use stochastic gradient
system like that.
389
:So this is nice because it just means that
you can throw the workhorse from machine
390
:learning at a
391
:Bayesian problem and get the Bayesian
approximation out.
392
:Again, as we mentioned, it doesn't have
this asymptotic unbiased feature, which is
393
:maybe less of a concern in machine
learning models where you have less of
394
:ability to trade compute because you've
kind of filled your compute budget with
395
:your gigantic model.
396
:Although we may see this, we think that I
think this might change over the coming
397
:years.
398
:But yeah, maybe not.
399
:Maybe we'll just go even bigger and bigger
and bigger.
400
:You...
401
:Okay, sorry.
402
:I got lost.
403
:You said you're asking about stochastic
gradient descent.
404
:So actually, there's something interesting
to say here.
405
:And then that means also what the main
difference characteristics of posterior
406
:is, like these, so that really people
understand the use case of posterior here.
407
:Yeah.
408
:So we didn't want to...
409
:Okay.
410
:So yeah, there's a key thing about the way
we've written posterior is that we like...
411
:where possible to have stochastic gradient
descent, so optimization, as sort of
412
:limits under some hyperparameter
specifications of the algorithms.
413
:And it turns out that in a lot of cases,
so we talked about MCMC, and then we
414
:talked about stochastic gradient MCMC,
which are MCMC methods that strictly
415
:handle mini -batch methods.
416
:And a lot of the time, you can write down
the temperature, you have the temperature
417
:parameter of your posterior distribution.
418
:And then as you take that to zero,
419
:So the temperature is like, if the
temperature is very high, your posterior
420
:distribution is very heated up.
421
:So you've increased the tails and there's
a lot like a much closer to sort of a
422
:uniform distribution.
423
:You take it very cold, it comes very
pointed and focused around optima.
424
:So we write the algorithms so that there's
this convenient transition through the
425
:temperature.
426
:So you set the temperature to zero, you
just get optimization.
427
:And this is a key thing about posteriors.
428
:So we have the, so the posteriors
stochastic grain MCMC methods.
429
:this temperature parameter which if you
set to zero will become a variant of
430
:stochastic gradient descent.
431
:So you can just sort of unify gradient
descent and stochastic gradient MCMC and
432
:it's nice so you have your yeah you have
your Langevin dynamics which tempered down
433
:to zero just becomes vanilla gradient
descent you have underdamped Langevin
434
:dynamics or stochastic gradient HMC,
stochastic gradient Hamiltonian Monte
435
:Carlo, you set the temperature to zero and
then you've just got stochastic gradient
436
:descent with momentum.
437
:So yeah, this is a nice thing about
Posterius to sort of unify these
438
:approaches and it hopefully will make it
less scary to use Bayesian approaches
439
:because you know you always have gradient
descent and you can sanity check by just
440
:setting the temp, just filling with a
temperature parameter.
441
:Okay, that's really cool.
442
:Okay.
443
:So it's like, it's a bit like the
temperature parameter in the, in the
444
:transformers that, that like make sure, I
mean, in the LLMs that
445
:It's like adding a bit of variation on top
of the prediction stat that the LL could
446
:make.
447
:Yeah, so it's exactly the same as that.
448
:So when you use this in language models or
natural language generation, you
449
:temperature the generative distribution so
that the logits get tempered.
450
:So if you set the temperature there to
zero, you get greedy sampling.
451
:But we're doing this in parameter space.
452
:So it's, yeah.
453
:It has this, yeah, exactly.
454
:Distribution tempering is a broad thing,
particularly in, I'm not going to go too
455
:philosophical, but I mean, I've first met
with like tempering, then we thought about
456
:it in the settings of sequential Monte
Carlo, and it's like, is it the natural
457
:way?
458
:Is it something that's natural to do?
459
:But in the context of Bayes, because
Bayes' theorem is multiplicative, right,
460
:you have your P of theta, P of y given
theta, it kind of makes sense to temper
461
:because it means like, okay, I'll just
introduce the likelihood a little bit.
462
:and sort of tempering as a natural way to
do it because there's multiplicative
463
:feature of Bayes' theorem.
464
:So, I kind of settled with me after
thinking about it like that.
465
:Yeah, no, I mean, that makes perfect
sense.
466
:And I was really surprised to see that was
used in LLMs when I first read about the
467
:algorithms.
468
:And I was pleasantly surprised because
I've worked a lot on electoral forecasting
469
:models.
470
:That's how I were introduced to Bayesian
stats.
471
:Actually, I've done that without knowing
it.
472
:So first I'm using the softmax all the
time because they're called forecasting.
473
:Unless you're doing that in the U S you
need a multinomial likelihood.
474
:The multinomial needs a probability
distribution.
475
:And how do you get that from the softmax
function, which is actually a very
476
:important one in the LLM framework.
477
:And, and, and also the thing is your
probability is not, it's like the latent.
478
:observation of popularity of each party,
but you never observe it, right?
479
:And so the polls, you could, you could
like conceptualize them as a tempered
480
:version of the true latent popularity.
481
:And so that was really interesting.
482
:I was like, damn, this like, this, this
stuff is much more powerful than what I
483
:thought, because I was like applying only
on electoral forecasting models, which is
484
:like a very niche application, you could
say of these models in
485
:actually there are so many applications of
that in the wild.
486
:No, it's so yeah, tempering in general is
very widespread and also I would say not
487
:particularly understood that well.
488
:Like yeah, we have this thing, there's
been research in this cold posterior
489
:effect which is quite a, I don't know,
it's perhaps a...
490
:annoying things for Bayesian modeling on
neural networks where you get, as I said,
491
:you have this temperature parameter that
transitions between optimization and the
492
:Bayesian posterior.
493
:So zero is optimization, one is the
Bayesian posterior.
494
:And empirically, we see better predictive
performance, which is a lot of time we
495
:care about in machine learning, with
temperatures less than one.
496
:So like, yeah, which is annoying because
we're Bayesians and we think that the
497
:Bayesian posterior is the optimal decision
-making under uncertainty.
498
:So this is annoying, but at least in our
experiments, we found this to be this so
499
:-called cold posterior effect much more
prominent under Gaussian approximations,
500
:which we only believe to be very crude
approximations to the posterior anyway.
501
:And if we do more MCMC or deep ensemble
stuff, where deep ensemble is, we've got a
502
:paper we'll be able to archive shortly,
which describes deep ensembles.
503
:In deep ensembles, you just run gradient
descent in parallel.
504
:with different initializations and batch
shuffling.
505
:And then you just have like, I know you
run 10 ensembles, 10 optimizations in
506
:parallel, then you've got 10 parameters at
the things at the end.
507
:So Monte Carlo approximation posterior
size 10.
508
:And then we describe in the paper that how
to get this asymptotic and biased property
509
:by using that temperature.
510
:Because as we said earlier, you have SG
MCMC becomes SGD with temperature zero.
511
:So you can reverse this.
512
:for deep ensembles, so you add the noise
and then you'll get an asymptotic and
513
:biased deep ensembles become
asymptotically unbiased MCMC between SGMC
514
:and PSE.
515
:But in those cases when you have the non
-Gaussian approximation we found much less
516
:of the cold posterior effect.
517
:So yeah, it's, but it's still not, maybe
the cold posterior effect is a natural
518
:thing because it's not really like Bayes'
theorem.
519
:Yeah, we still need to be better
understood.
520
:I don't, at least in my head I'm not.
521
:fully clear on whether the cold posterior
effect is something we should be surprised
522
:about.
523
:Okay, yeah.
524
:Yeah, me neither.
525
:That makes you feel any better because I
just learned about that.
526
:So yeah, I don't have any strong opinion.
527
:Okay, I think we're getting clearer now on
the like the what posterior ears is for
528
:listeners.
529
:So then I think one of
530
:the last question about the algorithms
that that's underlying all of that.
531
:So, stochastic gradient MCMC.
532
:That's, that's where I got confused.
533
:Like I hear stochastic gradient and like
stochastic gradient isn't, but no, it's
534
:like SG MCMC not SGG.
535
:So, Posteriority is like really to use SG
MCMC.
536
:Why, like, why would you do that and not
use MCMC?
537
:like the classic HMC from Stan or PyMC?
538
:Yeah, so I mean, it's not just for SGMCMC.
539
:There's also variational inference,
Laplace approximation, extended count
540
:filter, and we're really excited to have
more methods as well as we look to
541
:maintain and expand the library.
542
:Why would you use SGMCMC?
543
:So yeah, I think we've already touched
upon this.
544
:The thing is, if you've got loads of data,
it's just going to be inefficient to...
545
:sum over all of that data at every
iteration of your MCMC algorithm as Stan
546
:would do.
547
:But there's mathematical reasons why you
can't just do that in Stan.
548
:It's because the Metropolis -Hastings
ratio has this exponential of the log
549
:posterior.
550
:But it's in log space is the only place
you can get the unbiased approximation,
551
:which is what you need if you did want to
naively subsample.
552
:So you need to, you can't do the
Machrofist Hastings except reject.
553
:So you have to use different toolage.
554
:And in its simplest terms, SGMCMC just
omits it and just runs a Langevin.
555
:So it just runs your Hamiltonian Monte
Carlo without the extract project.
556
:But there's more theory on top of this and
you need to control the disqualification
557
:error and stuff like that.
558
:And I won't go into the weeds of that.
559
:Okay.
560
:Yeah.
561
:Okay.
562
:And that's
563
:And that's tied to mini -batching
basically.
564
:Like the power that SGMCMC allows you when
you're in a high data regime is tied to
565
:the mini -batching, if I understand
correctly.
566
:It's the difference between MCMC and
SGMCMC.
567
:Okay, so that's like the main difference.
568
:Okay.
569
:Yeah, stochastic gradient.
570
:So you can't actually get the exact
gradient like you need in Amazigh in Monte
571
:Carlo and for Metropolis Hastings step,
you only get an unbiased approximation.
572
:And then there's theory about this is like
sometimes you can deploy the central limit
573
:theorem and then you've got a you can go
covariance attached to your gradients and
574
:you could do nice theory and improve the
equivalence like that, which, yeah.
575
:Okay.
576
:All clear now.
577
:All clear.
578
:Awesome.
579
:Yeah.
580
:And I think that's the first time we talk
about that on the show.
581
:So I think it was it's definitely useful
to be extra clear about that.
582
:And so that listeners understand and me,
like myself, so that I understand.
583
:Thanks a lot.
584
:It's in some setting actually much simpler
because you kind of like remove the tools
585
:that you have available to you by removing
that much of the step.
586
:So it makes the implementation a bit
simpler.
587
:But you kind of lose the theory in that.
588
:And then a lot of the argument is like if
you use a decreasing step size, then your
589
:noise from the mini match, your noise from
the stochastic gradient decreases Epsilon
590
:squared, which is faster.
591
:So you
592
:If you decrease your step size and run it
for infinite time, then you'll just be
593
:running, eventually just be running the
continuous time dynamics, which are exact
594
:and do have the right stationary
distribution.
595
:So if you run it with decreasing step
size, then you are asymptotically
596
:unbiased.
597
:But running with decreasing step size is
really annoying because you then don't
598
:move as far.
599
:As we know from normal MCMC, we want to
increase our step size and move and
600
:explore the posterior more so.
601
:There's lots of research to be done here.
602
:I hope and I feel that it's not the last
time you'll talk about stochastic gradient
603
:MCMC on this podcast.
604
:Yeah, no.
605
:I mean, that sounds super interesting.
606
:I'm really interested also to really
understand the difference between these
607
:algorithms.
608
:Right now, that's really at the frontier
of research.
609
:You not only have a lot of research done
on how do you make HMC more efficient, but
610
:you have all these new algorithms.
611
:approximate algorithms as we said before.
612
:So, VLM plus approximation, stuff like
that.
613
:But also now you have normalizing flows.
614
:We talked about that in episode 98 with
Marilou Gabrié.
615
:Marilou Gabrié, actually, I don't know why
I said the second part with the Spanish.
616
:Because my Spanish is really available in
my brain right now.
617
:So, she's French.
618
:So, that's Marilou Gabrié.
619
:Episode 98, it's in the show notes.
620
:Episode 107, I already mentioned it with
Marvin Schmidt about amortized patient
621
:inference.
622
:Actually, do you know about amortized
patient inference and normalizing flows?
623
:I know a bit about normalizing flows.
624
:Amortized patient inference I would be
less comfortable with.
625
:Okay.
626
:But I mean, if you could explain it.
627
:Yeah, I haven't watched that episode and
listened to that episode.
628
:Yeah, I mean, we released it yesterday.
629
:Yeah, I don't...
630
:I'm a bit disappointed, Sam, but that's
fine.
631
:Like, it's just one day, you know.
632
:If you listen to it just after the
recording, I'll forgive you.
633
:That's okay.
634
:No, so, kidding aside, I'm actually
curious to hear you speak about the
635
:difference between normalizing flows
636
:and SGMCMC.
637
:Can you talk a bit about that if you're
comfortable with that?
638
:I mean, I can't.
639
:It's been a while since I've read about
normalizing flows.
640
:When I did read about them, I understood
it to be essentially a form of variational
641
:inference where you have more elaborate,
you define a more elaborate variational
642
:family through like, essentially through
like a triangular mapping.
643
:Like, the thing why you can't just use
someone might say,
644
:Why can't you use it just a neural network
as your variational distribution?
645
:And it's not so easy because you need to
have this tractable form.
646
:Hang on a second.
647
:Let me remember.
648
:But the thing is with normalizing flows,
you can get this because you can invert.
649
:That's it.
650
:They're invertible, right?
651
:Normalizing flows are invertible.
652
:So you can get this.
653
:You can write the change of distribution
formula and then you can calculate
654
:essentially just y -maxum likelihood.
655
:the using these normalizing flows to fit
to a distribution.
656
:Whereas SGMCMC doesn't.
657
:So you have to, in normalizing flows, you
kind of have to define your ansatz that
658
:will fit to your distribution.
659
:I think normalizing flows are really
exciting and really interesting, but yeah,
660
:you have to specify your ansatz.
661
:So it's another, so there's another tool
on top, another specification on top of
662
:how you.
663
:rather than just writing the log
posterior, you then need to find an
664
:approximate ansatz which you think will
fit the posterior or the distribution
665
:you're targeting.
666
:Whereas SGMCMC is just log posterior, go.
667
:Which is sort of what we're trying to do
with posterior, is we're trying to
668
:automate, well not automate, we're trying
to research, of course, so much for that.
669
:But normalizing flows might be, yeah, as I
said, I think it's really interesting that
670
:you can get these more expressive
variational families through like
671
:triangular mappings, yeah.
672
:Yeah, super interesting.
673
:And yeah, I'm also like spatial inference
is related in the sense that you first
674
:feed a deep neural network on your model.
675
:And then once it's feed, you get posterior
inference for free, basically.
676
:So that's quite different from what I
understand as GMC to be.
677
:But that's also extremely interesting.
678
:That's also why I'm
679
:hammering you down on the different use
cases of SGMCMC so that myself and
680
:listeners have a kind of a tree in their
head of like, okay, my use case then is
681
:more appropriate for SGMCMC or, no, here
I'd like to try multi -spacian inference
682
:or, I know here I can just stick to plain
vanilla HMC.
683
:I think that's very interesting.
684
:But thanks for that question that was
completely improvised.
685
:I definitely appreciate you taking the
time to rack your brain about the
686
:difference with normalizing flows.
687
:No, I'd love to talk more on that.
688
:I'd need to refresh myself.
689
:I've written down some notes on
normalizing flows, and I was quite
690
:comfortable with them, but it's just been
a while since I refreshed.
691
:So I would love to refresh, and then we
can chat about them.
692
:Because I'd love to do a project on them,
or I'd love to work on them, because I
693
:think that's it.
694
:way to fit distribution to data, which is,
after all, a lot of what we do.
695
:Yeah.
696
:Yeah.
697
:So that makes me think we should probably
do another episode about normalizing
698
:flows.
699
:So listeners, if there is a researcher you
like who does a lot of normalizing flows
700
:and you think would be a good guest on the
show, please reach out to me and I'll make
701
:that happen.
702
:Now let's let's get you closer to home
salmon and talk about posteriors again
703
:Because so basically if understood
correctly posteriors aims to address
704
:uncertainty quantification in deep
learning Why it's is that my right here
705
:and also if that's the case why is this
particularly important for neural networks
706
:and How does the package help in?
707
:managing especially overconfident in model
predict, overconfidence in model
708
:predictions.
709
:Yeah, so it's that's our primary use case.
710
:And normal is to use posterity as a
proximate base, we're getting as close to
711
:base as we can, which is probably not that
close, but still getting somewhere on the
712
:way to base base, base and posterior in
big deep learning models.
713
:But we feel posterior is to be as modular
and general as possible.
714
:So as I said, if you have a
715
:classical Bayesian model, you can write it
down in Pyro, but you've got loads of
716
:data, then okay, go ahead.
717
:And it posterior should be well suited to
that.
718
:In terms of what advantages we want to see
from uncertainty communication or this
719
:approximate Bayesian inference in deep
learning models, there are three sorts of
720
:key things that we distilled it down to.
721
:So yeah, you mentioned
722
:confidence in outer distribution
predictions.
723
:So yeah, we should be able to improve our
performance in predicting on inputs that
724
:we haven't seen in the training set.
725
:So I'll talk about that after this.
726
:The second one is continual learning,
where we think that if you can do Bayes
727
:theorem exactly, you have your prior, you
get some likelihood and you have the
728
:likelihood, you get some data, you have a
posterior, then you get some more data.
729
:and then your posterior becomes your prior
and do the update.
730
:And you can just write like that if you
can do Bayes' theorem exactly.
731
:And then, yeah, this is, you can extend it
even further and then you have, with some
732
:sort of evolution along your parameters,
then you have a state space model, and
733
:then the exact setting linear Gaussian,
you've got a count filter.
734
:So continual learning is, in this case,
Bayes' theorem does that exactly.
735
:And in continual learning research in
machine learning settings, they have this
736
:term of avoiding catastrophic forgetting.
737
:So,
738
:If you just continue to do gradient
descent, there was no memory there, so you
739
:would just, apart from the initialization,
you would just forget what you've done
740
:previously and there's lots of evidence
for this, whereas Bayes' theorem is
741
:completely exchangeable between of the
order of the data that you see.
742
:So you're doing Bayes' theorem exactly,
there's no forgetting, you just have the
743
:capacity of the model.
744
:So that's where we see Bayes solving
continual learning, but as I said, you
745
:can't
746
:can't do Bayes' theorem exactly in a
billion -dimensional model.
747
:And then the last one is, we'll call it
like decomposition of uncertainty in your
748
:predictions.
749
:So if you just have gradient descent model
and you're predicting reviews, someone's
750
:reviews and you have to predict the stars,
it will just give you, as you said, it
751
:gives you your softmax, it'll just give
you this distribution over the reviews and
752
:it'll be like that.
753
:But what you really want is you want to
have some indication of
754
:like also distribution detection, right,
you want to know, okay, yeah, I'm
755
:confident in my, my prediction.
756
:And you might get to review that is like,
the food was terrible, but the service was
757
:amazing, or something like that, like a
user amazing food was terrible.
758
:And then, like, let's say we're perfect
models, say of this, we know how people
759
:review things, but we can we can give, we
have quite a lot of uncertainty under
760
:review, because we don't know how the
reviewer values those different things.
761
:So we might have just a completely
uniform.
762
:distribution over the stars for that
review.
763
:But we'd be confident in that
distribution.
764
:But what Bayes gives you is it gives you
the ability to the sort of the second
765
:order uncertainty quantification, is if
you have this distribution over parameters
766
:and you have a distribution over logits at
the end, the predictions, you can
767
:identify, you can split between it from
information theories called aleatoric and
768
:epistemic uncertainty.
769
:Aleatoric uncertainty or data uncertainty
is what I just described there, which is
770
:natural uncertainty in the model and the
data generating process.
771
:Epistemic uncertainty is uncertainty that
was removed in the infinite data limit.
772
:So that would be where the model doesn't
know.
773
:So this is really important for us to
quantify that.
774
:Okay.
775
:I, yeah, around to the bit there.
776
:I can in like 30 seconds elaborate on the
point you specifically mentioned on alpha
777
:distribution performance and improving
performance and alpha distribution.
778
:And I think that's quite compelling from a
Bayesian point of view, because what Bayes
779
:says on like a supervised learning sector
setting is said, gradient descent just
780
:fits one parameter, finds one parameter
configuration that's plausible given the
781
:training data.
782
:Bayes' theorem says, I find the whole
distribution of parameter configurations
783
:that's plausible given the data.
784
:And then when we make predictions, we
average over those.
785
:So it's perfectly natural to think that a
single configuration might overfit.
786
:and might just give, it might just be very
confident in its prediction when it sees
787
:out the distribution data.
788
:But it doesn't necessarily solve a bad
model, but it should be more honest to the
789
:model and the data generating process
you've specified is if you average over
790
:plausible model configurations under the
training data when you have your testing.
791
:So that's sort of quite a compelling, to
me, argument for improving
792
:performance on after distribution
predictions, like the accuracy of them.
793
:And there's a fair bit of empirical
evidence for this, with the caveat again,
794
:being that the Bayesian posterior in high
dimensional models, machine learning
795
:models is pretty hard to approximate, cold
posterior effect, caveats, things like
796
:these things.
797
:Okay, yeah, I see.
798
:Yeah, super interesting in that.
799
:So now I understand better.
800
:what you have on the posteriors website,
but the different kind of uncertainties.
801
:So definitely that's something I recommend
listeners to give a read to.
802
:I put that in the show notes.
803
:So both your blog post introducing
posteriors and the docs for posteriors,
804
:because I think it makes that clear
combined to your explanation right now.
805
:Yeah.
806
:And...
807
:Something I was also wondering is that if
I understood correctly, the package is
808
:built on top of PyTorch, right?
809
:Yeah, that's correct.
810
:Yeah.
811
:Okay.
812
:So, and also, did I understand correctly
that you can integrate posteriors with pre
813
:-trained LLMs like Lama2 and Mistral, and
you do that with a...
814
:Hacking's Feast Transformers package?
815
:So, yeah, so, I mean, yeah, Posterior is
open source.
816
:We're fully supported the open source
community for machine learning, for
817
:statistics, which is, and in terms of,
yeah, I mean, we're sort of in the fine
818
:tuning era or like we have like, there's
so much, there are these open source
819
:models and you can't get away from them.
820
:We have that Lama 2, Lama 3, Mistral,
like, yeah.
821
:And basically we want to harness this
power, right?
822
:But as I mentioned previously, there are
some issues that we like to remedy with
823
:Bayesian techniques.
824
:So the majority of these open source
models are built in PyTorch.
825
:I'm also a big Jax fan.
826
:I also use Jax a lot.
827
:So I was very happy to see and work with
the torch .funk like sub library, which
828
:basically makes it
829
:you can write your PyTorch code and you
can use Llama 3 or Mistral with PyTorch
830
:but writing functional code.
831
:So that's what we've done with Posterior.
832
:So, yeah, Hugging Face Transformers, you
can download the models, that's where all
833
:they're hosted, and how you access them.
834
:But then what you get is just a PyTorch
model.
835
:It's just a PyTorch model.
836
:And then you throw that in Composers and
all nicely with the Posterior updates.
837
:Or you write your own new updates in the
Posterior framework and you can use that
838
:as well.
839
:still with Lama 3.
840
:Mr.
841
:Robin.
842
:Yeah.
843
:Okay.
844
:Nice.
845
:And so what does it mean concretely for
users?
846
:That means you can use these pre -trained
LLMs with posteriors and that means adding
847
:a layer of uncertainty quantification on
top of those models?
848
:Yeah.
849
:So you need, I mean, Bayes theorem is a
training theorem.
850
:So you need data as well.
851
:So you take
852
:You take your pre -trained model, which
is, yeah, transformer, or it could be
853
:another type of model, it could be an
image model or something like that, and
854
:then you give it some new data, which we
would say was fine -tuning, and then you
855
:combine, use posterior to combine the two,
and then you have your new model out at
856
:the end of the day, which has uncertainty
quantification.
857
:It's difficult, as I said, we're sort of
in this fine -tuning era as open -source
858
:large language models.
859
:It's still to be, this is different.
860
:There's still lots of research to do here
and it's different to our classical
861
:Bayesian regime where we just have our,
there's only one source of data and it's
862
:what we give it.
863
:In this case, there's two sources of data
because you have your data, whatever,
864
:whatever Lama3 saw in its original
training data and then it has your own
865
:data.
866
:It's, yeah, can we hope to get uncertainty
chronification and the data that they used
867
:in the original training?
868
:Probably not, but we might be able to get
uncertainty chronification and improved
869
:predictions.
870
:based on the data that we've committed.
871
:So there's lots of lots for us to try out
here and learn because we are still
872
:learning on this in terms of the fine
tuning.
873
:But yeah, this is what Polastir is there
to make these sort of questions as easy as
874
:possible to ask and answer.
875
:Okay, fantastic.
876
:Yeah, that's, that's so exciting.
877
:It's just like, it's a bit frustrating to
me because I'm like, I'd love to try that
878
:and learn on that and like, contribute to
that kind of packages.
879
:At the same time, I have to work, I have
to do the podcast, and I have all the
880
:packages I'm already contributing to.
881
:So I'm like, my god, too much choices too
much, too many choices.
882
:No, come on Alex, I'm gonna see you.
883
:We're gonna see you again, Alex pull
request.
884
:It's soon enough.
885
:Actually, how does like, do the like this
ability to have the transformers in, you
886
:know, use these pre trained models, does
that help facilitate the adoption of new
887
:algorithms in in posteriors?
888
:Because if I understand correctly, you can
support
889
:new algorithms pretty easily and you can
support arbitrary likelihoods.
890
:How do you do that?
891
:I wouldn't say that the existence of the
pre -trained models necessarily allows us
892
:to support new algorithms.
893
:I feel like we've built the posterior to
be suitably general and suitably modular,
894
:that it's kind of agnostic to your model
choice and your log posterior choice.
895
:terms of arbitrary likelihoods.
896
:But yeah, that's like a benefit.
897
:That's like, yeah, as an hour, yeah, the
arbitrary like is is relevant, because a
898
:lot of machine learning packages on.
899
:I mean, a lot of machine learning is
essentially just boils down to
900
:classification or regression.
901
:And that is true.
902
:And because of that, a lot of a lot of
machine learning algorithms will a lot of
903
:machine learning packages will essentially
constrain it to classification or
904
:regression.
905
:At the end, you either have your softmax
or you have your mean squared error.
906
:Yeah, softmax cross entry means greater.
907
:In posterior, we haven't done that.
908
:We're more faithful to the sort of the
Bayesian setting where you just write down
909
:your log posterior and you can write down
whatever you want.
910
:And this allows you greater flexibility in
the case you did want to try out a
911
:different likelihood or like even in like
simple cases, like it's just more
912
:sophisticated than just classification or
regression a lot of the time.
913
:Like in sequence generation where you have
the sequence and then you have the cross
914
:entropy over all of that.
915
:It just allows you to be more flexible and
write the code how you want.
916
:And there's additional things to be taken
into account.
917
:Like sometimes if you were doing a
regression, you might have knowledge of
918
:the noise variance.
919
:And that's just the observation noise
variance.
920
:And that's just much easier to, yeah, if
we don't constrain like this, it's just
921
:much easier to write your code much
cleaner code than if you were.
922
:And it's also future -proofing.
923
:We don't know what's going to be.
924
:happening in going forward.
925
:We may see like, yeah, in multimodal
models, we may see like, text and images
926
:together, in which case, yeah, we will
support that.
927
:You have to supply the compute and the
data, which might be the harder thing, but
928
:we'll support those likelihoods.
929
:Okay, I see.
930
:I see.
931
:Yeah, that's very, very interesting.
932
:Any stats related to the fact that I think
I've read in your blog post or on the
933
:website that
934
:You say that Posterior is swappable.
935
:What does that mean?
936
:And how does that flexibility benefit
users?
937
:Yeah.
938
:So, I mean, this is the point of swappable
is that when I say that is that you can
939
:change between if you want to, if you
think, as I said, Posterior is a research
940
:like toolbox and it's to us to investigate
which inference method is appropriate in
941
:the different settings, which might be
different if you care about decomposing.
942
:predictive uncertainty, it might be
different if you care about boarding cast
943
:-off, you're forgetting it's in your
continued learning.
944
:So the thing is that you can just, the way
it's written is you can just swap, you can
945
:go from sthmc and you can go to the class
approximation or you can go to vi just by
946
:changing one line of code.
947
:And the way it works is like you have your
builds, you have your transform equals
948
:posterior .infant method .build and then
any configuration argument, step size.
949
:things like this, which are algorithm
specific.
950
:And then after that is all unified.
951
:So you just have your init around the
parameters that you want to do based on.
952
:And then you iterate through your data
loader, you iterate through your data.
953
:And then it just updates based on the
batch.
954
:And batch can be very general.
955
:So that's what it means.
956
:So you can just change one line of code to
swap between Variational Imprints and
957
:STHMC or Extended Calama Filter or any and
all the new methods that the listeners are
958
:going to add in the future.
959
:Heh.
960
:Okay.
961
:Okay.
962
:I see.
963
:And so I have so many more questions for
you and posterior's but let's start and
964
:run, wrap that up because also when I ask
you about another project you're working
965
:on so maybe to close that up on
posterior's.
966
:What are the future plans for posterior's
and are there any upcoming features or
967
:integration integrations that you can
share with us?
968
:So we're quite happy with the framework at
the moment.
969
:There's lots of little tweaks that we have
a list of GitHub issues that we want to go
970
:through, which are mostly and excitingly
about adding new methods and new
971
:applications.
972
:So that's really what we're excited about
now is actually use it in the wild and
973
:hopefully experiment all these questions
that we've discussed.
974
:Yeah, like, like how we how does it make
sense and how we get the benefits of
975
:Bayesian, true Bayesian inference on fine
tuning or on large models or large data.
976
:And so yeah, we are really excited and to
add more methods.
977
:So if listeners have mini batch, big data
Bayesian methods that we want to want to
978
:try out with a large data model, then
we're hopefully accepting that we will.
979
:I do.
980
:I do like, I do promote like generality
and doing it like in a way that is sort of
981
:flexible and stuff.
982
:So we may have, we may think a lot.
983
:It's not, it's not, we want to add methods
that somehow feel natural and, and one way
984
:is to extend and compose with other
methods.
985
:So it might be that if we've got some very
complicated last layer,
986
:requires classes just for classification
method, we're probably not going to add
987
:it.
988
:So it has to be methods that stick within
the posterior framework, which is this
989
:arbitrary likelihood Bayesian swappable
computation.
990
:Okay.
991
:Okay.
992
:Yeah.
993
:Yeah, that makes sense.
994
:Yeah, because you have like, yeah, you
have that kind of vision of wanting to do
995
:that and having that as a as a research
tool, basically.
996
:So
997
:Yeah, that makes sense to keep that under
control, let's say.
998
:Something I want to ask you in the last
few minutes of the show is about
999
:thermodynamic compute.
::
I've seen you, you are working on that.
::
And you've told me you're working on that.
::
So yeah, I don't know anything about that.
::
So can you like, what's that about?
::
Yeah, so I mean, this is yeah, this is
something that's very normal, normal
::
computing.
::
And it's like,
::
It's something that we have.
::
Yeah, we have this hardware team.
::
It's like a full stack AI company.
::
And we, yeah, on the posterior side, on
the client side, we look at how we can
::
bring in principle Bayesian uncertainty
quantification and help us solve the
::
issues with machine learning pipelines
like we've already discussed.
::
And on the other side, there's lots of
parts to this.
::
More just like traditional MCMC is
difficult sometimes because
::
Or just it's just like about simulating
SDEs essentially as what the thermodynamic
::
hardware is simulating SDEs Normally, you
have this real pain with the step size and
::
as the mention grows steps, let's get
really small and so SDEs, where do we see
::
SDEs?
::
You see SDEs in physics all the time and
physics is real we can use physics so it's
::
doing so it's building physical hardware
analog hardware that We can hopefully that
::
evolves as SDEs
::
then we can harness that SDEs by encoding,
you know, like currents and voltages and
::
things like that.
::
So I'm not a physicist, so I don't know
exactly how it is.
::
But I'm always reassured at how the when I
speak to the hardware team, how simple the
::
they talk about these things, it's like,
yeah, we can just stick some resistors and
::
capacitors on a chip, and then it'll then
it'll do this SDE.
::
So this is the and then we want to use
those SDEs for scientific computation.
::
And with a real focus on statistics and
machine learning.
::
So yeah, we want to be able to do an HMC
::
on device, on an analog device.
::
The first step is to do like with a
linear, so we'll have a Gaussian posterior
::
or with a linear drift in terms of this.
::
This is an Ornstein -Ollenbeck process and
we've developed hardware to do this and
::
turns out that an Ornstein -Ollenbeck
process, because it has a Gaussian
::
stationary distribution and you have this,
you can input like you can input the
::
precision matrix and output the covariance
matrix, that's matrix inversion.
::
So, and you just, your physical device
just does this.
::
And it's because it's an SDE, it has noise
and is kind of noise aware, which is
::
different to classical analog computation,
which has historically been plagued, which
::
is really old, really old, but
historically been plagued by noise.
::
And it's like, yeah, there's all this
noise in physics.
::
And because we're doing SDEs, we want the
noise.
::
So yeah, that's the whole idea.
::
It's obviously very young, but it's fun.
::
It's fun stuff.
::
Yeah.
::
So that's basically to...
::
accelerate computing?
::
That's hardware first, so that computing
is accelerated?
::
We want to, I mean, it's a baby field.
::
So we're trying to accelerate different
components.
::
What we worked out is with the simplest
thermodynamic chip we can build is this
::
linear chip with the Ornstein -Ullenberg
process.
::
And that can speed up with some error.
::
some error, but it has asymptotic speed
ups for linear algebra routines, so
::
inverting a matrix or solving a linear
system.
::
That's awesome.
::
In this case, it would speed up a certain
component, but that could be useful in a
::
Laplace approximation or these sort of
things also in machine learning.
::
Okay, that must be very fun to work on.
::
Do you have any writing about that that we
can put in the show notes?
::
Because
::
I think it'd be super interesting for
listeners.
::
Yeah, yeah.
::
We've got the normal computing scholar
page has a list of papers, but we also
::
have more accessible blogs, which I'll
make sure to put in the shop.
::
Yeah, yeah, please do because, yeah, I
think it's super interesting.
::
And yeah, and when you have something to
present on that, feel free to reach out.
::
And I think that'd be fun to do an episode
about that, honestly.
::
That'd be great.
::
Yeah.
::
Yes, so maybe one last question before
asking you the last two questions.
::
Like very, like, let's do Zoom be way less
technical.
::
We've been very technical through the
whole episode, which I love.
::
But maybe I'm thinking if you have any
advice to give to aspiring developers
::
interested in contributing to open source
projects like Posterior's, what would it
::
be?
::
Okay, yeah, I don't know, I don't feel
like I'm necessarily the best place to say
::
all this, but yeah, I mean, I would just,
the most important thing is just to go for
::
it, just get stuck in, get in the weeds of
these libraries and see what's there.
::
And there's loads of people building such
cool stuff in the open source ecosystem
::
and it's really fun to, honestly, it's
really fun and rewarding to get involved
::
for it.
::
So just go for it, you'll learn so much
along the way.
::
something more tangible.
::
I find that when I'm stuck on, starting
on, it's not like I don't understand
::
something in code or mathematics, then I
often struggle to find it in papers per
::
se.
::
And I find that textbooks, I love
textbooks, textbooks I find as a real
::
source of gold for these because they
actually go to the depths of explaining
::
things, without having this sort of horse
in the race style writing that you often
::
find in papers.
::
So yeah, get stuck in check text textbooks
if you, if you get lost.
::
Or I don't understand.
::
Or just ask as well.
::
Open source is all about asking and
communicating and bouncing ideas.
::
Yeah, yeah, yeah, for sure.
::
Yeah, that's usually what I do.
::
I ask a lot and I usually end up
surrounding myself with people way smarter
::
than me.
::
And that's exactly what you want.
::
That's exactly how I learned.
::
Yeah, textbook DICI, I would say I kind of
find the writing boring most of the time,
::
depends on the textbooks.
::
And also, it's expensive.
::
Yeah.
::
So that's kind of the problem of
textbooks, I would say.
::
I mean, you often can have them in PDFs,
but I just hate reading the PDF on my
::
computer.
::
So, you know, I wonder on the book object
or having it on Kindle or something like
::
that.
::
But that doesn't really that doesn't
really exist yet.
::
So.
::
could be something that some editors solve
someday that'd be cool, I'd love that
::
awesome, Sam, that was great thank you so
much, we've covered so many topics and my
::
brain is burning so that's a very good
sign I've learned a lot and I'm sure our
::
listeners did too of course, before
letting you go I'm gonna ask you the last
::
two questions I ask every guest at the end
of the show so one
::
If you had unlimited time and resources,
which problem would you try to solve?
::
want to decouple the model specification,
the data generating process, how you go
::
from your something you don't know to the
data you do have.
::
That's your site freedom as a data model.
::
I have you define that from like the
inference and the mathematical
::
computation.
::
So that's whatever, what the way you do
your approximate Bayesian inference.
::
And you want to decouple those.
::
You want to make it as easy as possible.
::
Ideally, we just want to be doing that
one.
::
We just want to be doing the model
specification.
::
And this is like Stan and PyMC do this
really well.
::
It's just like,
::
you write down your model, we'll handle
the rest.
::
And that's kind of like the dream we have
as Bayesian or Bayesian software
::
developers.
::
And it's so with Posterior, we're trying
to do something like this towards going to
::
move towards this for bigger, big machine
learning models and so bigger models,
::
bigger data settings.
::
So that's kind of the dream there.
::
But then in machine learning, what does
machine learning have differently to
::
statistics in that setting?
::
It's like, well, machine learning models
are less interesting than classical
::
Bayesian models.
::
The thing is they're more transferable,
right?
::
It's just a neural network, which we
believe is machine learning and will solve
::
a whole suite of tasks.
::
So perhaps in terms of the machine
learning setting, where we decouple
::
modeling and inference and data, you kind
of want to remove the model one as well.
::
You want to have these general purpose
foundational models, you could say.
::
So really you want to let the user focus.
::
And so we're handling the inference.
::
We're also handling the model.
::
So really let the user just give it the
data and say, okay, let's do this data and
::
let's use this data to predict other
things and let the user handle that.
::
So that's potentially like a real
unlimited time and resources.
::
Plenty of resources need to do that.
::But yeah, that's Sam May:
::
Yeah.
::
Yeah, that sounds...
::
That sounds amazing.
::
I agree with that.
::
That's a fantastic goal.
::
And yeah, also that reminds me, that's
also why I really love what you guys are
::
doing with Posteriorus because it's like,
yeah, trying to now that we start being
::
able to get there, making patient
inference really scalable to really big
::
data and big models.
::
I'm super enthusiastic about that.
::
it would be just fantastic.
::
So thank you so much for taking the time
to do that guys.
::
Yeah we're doing it, we're gonna get
there.
::
Yeah yeah yeah I love that.
::
And second question, if you could have
dinner with any great scientific mind dead
::
alive or fictional who would it be?
::
Yeah I was a bit intimidated this
question.
::
Yeah you know you ask everyone.
::
again, it's a great question.
::
But then I thought about it for a little
bit.
::
And it wasn't too hard for me.
::
I think that David Mackay is someone who,
yeah, I mean, it's been amazing work.
::
David Mackay is doing Bayesian neural
networks in:::
And that's like, yeah, like crazy before
before I'm born.
::, Bayesian neural networks in:
then I've just been going through his
::
textbook, as I said, I love textbooks, so
going through his textbooks on information
::
theory and
::
Basin statistics is a Bayesian or was a
Bayesian information theory and
::
statistics.
::
And there's something that he says like
right at the start of the textbook is
::
like, one of the themes of this book is
that data compression and data modeling
::
are one and the same.
::
And that's just really beautiful.
::
And we talked about stream codes, which in
a very information theory style setting,
::
but it's just an auto -aggressive
prediction model, just like our language
::
model.
::
So it's just someone else the ability to
distill these informations and do these.
::
distill information and help the
unification and be so ahead of their time.
::
And then additionally, with a sort of like
groundbreaking book on sustainable energy.
::
So like also tackling the one of the
greatest challenges we have at the moment.
::
So yeah, that's the sustainable energy
book is really wonderful.
::
I'm one of my favorite books so far.
::
Nice.
::
Yeah, definitely put that in the show
notes.
::
I think.
::
Yes, definitely.
::
Yeah.
::
Yeah, I'd like to keep that to read.
::
So
::
Yeah, please also put that in the show and
that's going to be fantastic.
::
Great.
::
Well, I think we can call it a show.
::
That was fantastic.
::
Thank you so much, Sam.
::
I learned so much and now I feel like I
have to go and read and learn about so
::
many things.
::
And I can definitely tell that you are
extremely passionate about your doing.
::
So yeah, thank you so much for.
::
taking the time and being on this show?
::
No, thank you very much.
::
I had a lot of fun.
::
Yeah.
::
Thank you for, yeah, being parcel to my
rantings.
::
I need that sometimes.
::
Yeah, that's what the show is about.
::
My girlfriend is extremely, extremely
happy that I have this show to rent about
::
patient stats and any nerdy stuff.
::
Yeah, it's so true, yeah.
::
Well, Sam, you're welcome.
::
Anytime you need to do some nerdy rant.
::
thank you.
::
I'm sure I'll be...
::
This has been another episode of Learning
Bayesian Statistics.
::
Be sure to rate, review, and follow the
show on your favorite podcatcher, and
::
visit learnbayestats .com for more
resources about today's topics, as well as
::
access to more episodes to help you reach
true Bayesian state of mind.
::
That's learnbayestats .com.
::
Our theme music is Good Bayesian by Baba
Brinkman.
::
Fit MC Lance and Meghiraam.
::
Check out his awesome work at bababrinkman
.com.
::
I'm your host.
::
Alex Andorra.
::
You can follow me on Twitter at Alex
underscore Andorra, like the country.
::
You can support the show and unlock
exclusive benefits by visiting Patreon
::
.com slash LearnBasedDance.
::
Thank you so much for listening and for
your support.
::
You're truly a good Bayesian.
::
Change your predictions after taking
information in.
::
And if you're thinking of me less than
amazing, let's adjust those expectations.
::
Let me show you how to be a good Bayesian
Change calculations after taking fresh
::
data in Those predictions that your brain
is making Let's get them on a solid
::
foundation