#110 Unpacking Bayesian Methods in AI with Sam Duffield

Proudly sponsored by PyMC Labs, the Bayesian Consultancy. Book a call, or get in touch!

Our theme music is « Good Bayesian », by Baba Brinkman (feat MC Lars and Mega Ran). Check out his awesome work!

Visit our Patreon page to unlock exclusive Bayesian swag 😉

Takeaways:

Use mini-batch methods to efficiently process large datasets within Bayesian frameworks in enterprise AI applications.
Apply approximate inference techniques, like stochastic gradient MCMC and Laplace approximation, to optimize Bayesian analysis in practical settings.
Explore thermodynamic computing to significantly speed up Bayesian computations, enhancing model efficiency and scalability.
Leverage the Posteriors python package for flexible and integrated Bayesian analysis in modern machine learning workflows.
Overcome challenges in Bayesian inference by simplifying complex concepts for non-expert audiences, ensuring the practical application of statistical models.
Address the intricacies of model assumptions and communicate effectively to non-technical stakeholders to enhance decision-making processes.

Chapters:

00:00 Introduction to Large-Scale Machine Learning

11:26 Scalable and Flexible Bayesian Inference with Posteriors

25:56 The Role of Temperature in Bayesian Models

32:30 Stochastic Gradient MCMC for Large Datasets

36:12 Introducing Posteriors: Bayesian Inference in Machine Learning

41:22 Uncertainty Quantification and Improved Predictions

52:05 Supporting New Algorithms and Arbitrary Likelihoods

59:16 Thermodynamic Computing

01:06:22 Decoupling Model Specification, Data Generation, and Inference

Thank you to my Patrons for making this episode possible!

Yusuke Saito, Avi Bryant, Ero Carrera, Giuliano Cruz, Tim Gasser, James Wade, Tradd Salvo, William Benton, James Ahloy, Robin Taylor,, Chad Scherrer, Zwelithini Tunyiswa, Bertrand Wilden, James Thompson, Stephen Oates, Gian Luca Di Tanna, Jack Wells, Matthew Maldonado, Ian Costley, Ally Salim, Larry Gill, Ian Moran, Paul Oreto, Colin Caprani, Colin Carroll, Nathaniel Burbank, Michael Osthege, Rémi Louf, Clive Edelsten, Henri Wallen, Hugo Botha, Vinh Nguyen, Marcin Elantkowski, Adam C. Smith, Will Kurt, Andrew Moskowitz, Hector Munoz, Marco Gorelli, Simon Kessell, Bradley Rode, Patrick Kelley, Rick Anderson, Casper de Bruin, Philippe Labonde, Michael Hankin, Cameron Smith, Tomáš Frýda, Ryan Wesslen, Andreas Netti, Riley King, Yoshiyuki Hamajima, Sven De Maeyer, Michael DeCrescenzo, Fergal M, Mason Yahr, Naoya Kanai, Steven Rowland, Aubrey Clayton, Jeannine Sue, Omri Har Shemesh, Scott Anthony Robson, Robert Yolken, Or Duek, Pavel Dusek, Paul Cox, Andreas Kröpelin, Raphaël R, Nicolas Rode, Gabriel Stechschulte, Arkady, Kurt TeKolste, Gergely Juhasz, Marcus Nölke, Maggi Mackintosh, Grant Pezzolesi, Avram Aelony, Joshua Meehl, Javier Sabio, Kristian Higgins, Alex Jones, Gregorio Aguilar, Matt Rosinski, Bart Trudeau, Luis Fonseca, Dante Gates, Matt Niccolls, Maksim Kuznecov, Michael Thomas, Luke Gorrie, Cory Kiser, Julio, Edvin Saveljev, Frederick Ayala, Jeffrey Powell, Gal Kampel, Adan Romero, Will Geary, Blake Walters, Jonathan Morgan and Francesco Madrisotti.

Links from the show:

Sam on Twitter: https://x.com/Sam_Duffield
Sam on Scholar: https://scholar.google.com/citations?user=7wm_ka8AAAAJ&hl=en&oi=ao
Sam on Linkedin: https://www.linkedin.com/in/samduffield/
Sam on GitHub: https://github.com/SamDuffield
Posteriors paper (new!): https://arxiv.org/abs/2406.00104
Blog post introducing Posteriors: https://blog.normalcomputing.ai/posts/introducing-posteriors/posteriors.html
Posteriors docs: https://normal-computing.github.io/posteriors/
Paper introducing Posteriors – Scalable Bayesian Learning with posteriors: https://arxiv.org/abs/2406.00104v1
Normal Computing scholar: https://scholar.google.com/citations?hl=en&user=jGCLWRUAAAAJ&view_op=list_works
Thermo blogs: https://blog.normalcomputing.ai/posts/2023-11-09-thermodynamic-inversion/thermo-inversion.html
https://blog.normalcomputing.ai/posts/thermox/thermox.html
Great paper on SGMCMC: https://proceedings.neurips.cc/paper_files/paper/2015/file/9a4400501febb2a95e79248486a5f6d3-Paper.pdf
David MacKay textbook on Sustainable Energy: https://www.withouthotair.com/
LBS #107 – Amortized Bayesian Inference with Deep Neural Networks, with Marvin Schmitt: https://learnbayesstats.com/episode/107-amortized-bayesian-inference-deep-neural-networks-marvin-schmitt/
LBS #98 – Fusing Statistical Physics, Machine Learning & Adaptive MCMC, with Marylou Gabrié: https://learnbayesstats.com/episode/98-fusing-statistical-physics-machine-learning-adaptive-mcmc-marylou-gabrie/

Transcript

This is an automatic transcript and may therefore contain errors. Please get in touch if you’re willing to correct them.

Transcript

Speaker: 00:00:04

Folks, strap in, because today's episode

is a deep dive into the fascinating world

: 00:00:11

of large -scale machine learning.

: 00:00:14

And who better to guide us through this

journey than Sam Dofeld.

: 00:00:19

Currently honing his expertise at normal

computing, Sam has an impressive

: 00:00:23

background that bridges the theoretical

and practical realms of Bayesian

: 00:00:27

statistics, from quantum computation to

the cutting edge of AI technology.

: 00:00:33

In our discussion, Sam breaks down complex

topics such as the posterior's Python

: 00:00:38

package, minimatch methods, approximate

inference, and the intriguing world

: 00:00:44

of thermodynamic hardware for statistics.

: 00:00:48

Yeah, I didn't know what that was either.

: 00:00:52

We delve into how these advanced methods

like stochastic gradient MCMC and Laplace

: 00:00:57

approximation are not just theoretical

concepts but pivotal in shaping enterprise

: 00:01:03

AI models today.

: 00:01:05

And Sam is not just about algorithms and

models, he is a sports enthusiast who

: 00:01:11

loves football, tennis and squash.

: 00:01:14

and he recently returned from an awe

-inspiring trip to the Faroe Islands.

: 00:01:21

So join us as we explore the future of AI

with Bayesian methods.

: 00:01:26

This is Learning Bayesian Statistics,

,: 2024

: 00:01:38

Welcome to Learning Bayesian Statistics, a

podcast about Bayesian inference, the

: 00:01:55

methods, the projects, and the people who

make it possible.

: 00:01:59

I'm your host, Alex Andorra.

: 00:02:01

You can follow me on Twitter at alex

-underscore -andorra.

: 00:02:05

like the country.

: 00:02:05

For any info about the show, learnbasedats

.com is Laplace to be.

: 00:02:10

Show notes, becoming a corporate sponsor,

unlocking Beijing Merch, supporting the

: 00:02:14

show on Patreon, everything is in there.

: 00:02:17

That's learnbasedats .com.

: 00:02:19

If you're interested in one -on -one

mentorship, online courses, or statistical

: 00:02:23

consulting, feel free to reach out and

book a call at topmate .io slash alex

: 00:02:28

underscore and dora.

: 00:02:30

See you around, folks, and best Beijing

wishes to you all.

: 00:02:37

I'm Sam Duffield, welcome to Learning

Bayesian Statistics.

: 00:02:42

Thanks, thank you very much.

: 00:02:45

Yeah, thank you so much for taking the

time.

: 00:02:48

I invited you on the show because I saw

what you guys at normal computing were

: 00:02:55

doing, especially with the Posteriors

Python package.

: 00:02:59

And I am personally always learning new

stuff.

: 00:03:04

Right now I'm learning a lot about sports

analytics, because that's a

: 00:03:09

Like that's always been a personal pet

peeves of mine and Bayesian says extremely

: 00:03:13

useful in that field.

: 00:03:14

But I'm also in conjunction working a lot

about LLMs and the interaction with the

: 00:03:21

Bayesian framework.

: 00:03:23

I've been working much more on the base

flow package, which we've talked about

: 00:03:29

with Marvin Schmidt in episode 107.

: 00:03:32

So.

: 00:03:32

Yeah, working on developing a PIMC bridge

to base flow so that you can write your

: 00:03:38

model in PIMC and then like using

amortized patient inference for your PIMC

: 00:03:47

models.

: 00:03:48

It's still like way, way down the road.

: 00:03:52

I need to learn about all that stuff, but

that's really fascinating.

: 00:03:55

I love that.

: 00:03:56

And so of course, when I saw what you were

doing with Posterior, I was like, that

: 00:04:01

sounds...

: 00:04:02

Awesome.

: 00:04:03

I want to learn more about that.

: 00:04:05

So I'm going to ask you a lot of

questions, a lot of things I don't know.

: 00:04:08

So that's great.

: 00:04:11

But first, can you tell us, give us a

brief overview of your research interests

: 00:04:17

and how Bayesian methods play a role in

your work?

: 00:04:23

Yeah, no, I know.

: 00:04:24

Thanks again for the invite.

: 00:04:25

I think, yeah, sports analytics, Bayesian

statistics, language models, I think we

: 00:04:30

have a lot to talk about.

: 00:04:32

should be fun.

: 00:04:33

Bayesian methods in my work, yes, so at

normal we have a lot of problems where we

: 00:04:42

think that Bayes is the right answer if

you could compute it exactly.

: 00:04:49

So what we're trying to do is trying to

look at different approximations and

: 00:04:53

different, like how they scale in

different methods and different settings.

: 00:04:57

and how we can get as close to the exact

phase or the exact sort of integral and

: 00:05:01

updating under uncertainty that can

provide us with some of those benefits.

Yeah.

OK.

Yeah.

That's interesting.

I, of course, agree.

Of course.

Can you, like, actually, do you remember

when you were first introduced to patient

: 00:05:22

inference?

: 00:05:23

Because you had a

: 00:05:25

an extensive background you've studied a

lot.

: 00:05:28

When in that, in those studies, were you

introduced to the Bayesian framework?

: 00:05:35

And also, how did you end up working on

what you're working on nowadays?

: 00:05:41

Yeah, okay.

: 00:05:43

I'll try not to rant too long about this.

: 00:05:47

But yeah, so I guess I, yeah, mathematics,

undergraduate at Imperial.

: 00:05:51

So I think that's

: 00:05:53

I was very young at this stage, we were

very young in our undergraduates, so not

: 00:05:57

really sure what we want to do.

: 00:05:58

At some point, it came to me that

statistics within the field of mathematics

: 00:06:03

is kind of like where I can like, that

should be working on like, applied

: 00:06:08

problems and how what where the sort of

field is going.

: 00:06:11

And that's what got me excited.

: 00:06:14

Statistics at undergraduate are different,

different places, but you get thrown a lot

: 00:06:17

of different

: 00:06:18

I mean, probably in all courses, you get

different, you get different point of view

: 00:06:22

and you get like, yeah, you get your

frequencies, your hypothesis testing, and

: 00:06:25

then you have your Bayesian method as

well.

: 00:06:28

And that is just the Bayesian approach

really sort of settled with me as being

: 00:06:34

more natural in terms of you just write

down the equation and the Bayes Bayes

100

: 00:06:40

Bayes theorem handles you write down, you

have your forward model and your prior and

101

: 00:06:44

then Bayes theorem handles everything

else.

102

: 00:06:45

So you're kind of writing down it's like,

103

: 00:06:48

mathematicians is kind of like one of the

lecturers in my first year said, yeah,

104

: 00:06:52

mathematicians are lazy.

105

: 00:06:53

You want to they want to do as little as

possible.

106

: 00:06:55

So base theorem is kind of nice there

because you just write down your your

107

: 00:06:58

likelihood you write down your prior and

then base theorem handles the rest.

108

: 00:07:01

So you have to do like the minimum

possible work you have your data

109

: 00:07:04

likelihood prior and then done.

110

: 00:07:05

So that was that was really compelling to

me.

111

: 00:07:07

And that led me to a to my PhD, which was

in the engineering department in

112

: 00:07:14

Cambridge.

113

: 00:07:14

So that was like, yeah, I had a few

114

: 00:07:17

thoughts on what to do for my PhD.

115

: 00:07:20

There was some more theoretical stuff and

I wanted to get into some problems, get

116

: 00:07:25

into the weeds a bit.

117

: 00:07:26

So yeah, engineering department of

Cambridge working on Bayesian statistics,

118

: 00:07:29

state space models and in a state space

model sequential Monte Carlo.

119

: 00:07:34

And I think, yeah, I mean, for terminology

wise, I use state space model and hidden

120

: 00:07:38

Markov model as the same thing.

121

: 00:07:39

So yeah, you have this time series style

data and that was working on that sort of

122

: 00:07:45

data gave me

123

: 00:07:46

I feel like this propagation of

uncertainty really shines there because

124

: 00:07:50

you need to take into account your

uncertainty from the previous experiments,

125

: 00:07:57

say, when you update for your new ones.

126

: 00:07:59

That was really compelling for me.

127

: 00:08:02

That was, I guess, my route into Bayesian

statistics.

128

: 00:08:07

Yeah, okay.

129

: 00:08:09

Actually, here I could ask you a lot of

questions, but...

130

: 00:08:12

those time series models.

131

: 00:08:14

I'm always fascinated by time series

models.

132

: 00:08:16

I don't know, I love them for some reason.

133

: 00:08:19

I find there is a kind of magic in the

ability of a model to take time

134

: 00:08:25

dependencies into account.

135

: 00:08:28

I love using Gaussian processes for that.

136

: 00:08:31

So I could definitely go down that rabbit

hole, but I'm afraid then I won't have

137

: 00:08:36

enough time for you to talk about post

-series.

138

: 00:08:39

Let me just say one minute about it.

139

: 00:08:42

So I'll just say like, yeah, in terms of

yeah, Gaussian process is really cool.

140

: 00:08:45

Like Gaussian process, like can think of

as like a continuous time or continuous

141

: 00:08:49

space or whatever that the time variant

access, we'll call it time continuous time

142

: 00:08:53

varying version of a state space model and

state space model or hidden Markov model.

143

: 00:08:58

Kind of like that to me is like the

canonical extension of a just a static

144

: 00:09:03

based inference model to a

145

: 00:09:05

the time varying setting because you can

and they kind of unify each other because

146

: 00:09:09

you can write smoothing in a state space

model as one big static Bayesian inference

147

: 00:09:14

problem and then you can write static

Bayesian inference problems they're just p

148

: 00:09:17

of y given x or p of yeah recovering x

from from y as as the first step as a

149

: 00:09:24

single step of state space model so the

techniques that you build just overlap and

150

: 00:09:29

you can yeah at least conceptually on the

mathematical level when you actually get

151

: 00:09:32

into the approximations and the

commutation

152

: 00:09:34

there are different things to consider,

different axes of scalability considered,

153

: 00:09:38

but conceptually, I really like that.

154

: 00:09:41

I probably ranted for a bit more than a

minute there, so I apologize.

155

: 00:09:44

No, no, that's fine.

156

: 00:09:45

I love that.

157

: 00:09:48

Yeah.

158

: 00:09:50

I have much more knowledge and experience

on GPs, but I'm definitely super curious

159

: 00:09:55

to also apply these state space models and

so on.

160

: 00:09:59

So definitely going to read the...

161

: 00:10:01

the paper you sent me about skill rating

of football players where you're using, if

162

: 00:10:06

I understand correctly, some state space

models.

163

: 00:10:08

That's going to be two birds with one

stone.

164

: 00:10:10

So thanks a lot for writing that.

165

: 00:10:13

The whole point of that paper is to say

that rating systems, ELO, TrueSkill are

166

: 00:10:18

and should be reframed as state space

models.

167

: 00:10:22

And then you just have your full Bayesian

understanding of it.

168

: 00:10:26

Yeah, yeah.

169

: 00:10:27

I mean, for sure.

170

: 00:10:28

I'm working myself also on the

171

: 00:10:31

project like that on football data.

172

: 00:10:34

And yeah, the first thing I was doing is

like, okay, I'm gonna write the simple

173

: 00:10:37

model.

174

: 00:10:38

But then as soon as I have that down, I'm

gonna add a GP to that.

175

: 00:10:43

It's like, I have to take these

nonlinearities into account.

176

: 00:10:47

So yeah, I'm like, super excited about

that.

177

: 00:10:50

So thanks a lot for giving me some weekend

readings.

178

: 00:10:56

So actually now let's go into your

posteriors package because I have so many

179

: 00:11:01

questions about that.

180

: 00:11:02

So could you give us an overview of the

package, what motivated this development

181

: 00:11:07

and also putting it in the context of

large scale AI models?

182

: 00:11:14

Yeah, so as I said, we at normal think the

base is the right answer.

183

: 00:11:21

So we want to get, we want to, but yeah,

we're interested in

184

: 00:11:26

large scale enterprise AI models.

185

: 00:11:30

So we need to be able to scale these to

big, big models, big, big parameter sizes

186

: 00:11:36

and big data at the same time.

187

: 00:11:37

So this is what Posterior's Python package

built on PyTorch really hopes to bring.

188

: 00:11:45

It's built with sort of flexibility and

research in mind.

189

: 00:11:49

So really we want to try out different

methods and try out for different data

190

: 00:11:54

sets and different goals.

191

: 00:11:56

what's going to be the best approach for

us.

192

: 00:12:01

That's the motivation of the Posteriors

package.

193

: 00:12:06

When would people use it?

194

: 00:12:09

For instance, for which use cases would I

use Posteriors?

195

: 00:12:18

There's a lot of just genuinely fantastic

Bayesian software out there.

196

: 00:12:24

But most of it has focused on the full

batch setting, as is classically the case

197

: 00:12:28

with Metropolis Hastings, except for

Jekste.

198

: 00:12:33

And we feel like we're moving or we have

already moved into the mini batch era, the

199

: 00:12:41

big data era.

200

: 00:12:42

So posterior is mini batch first.

201

: 00:12:43

So if you have a lot of data, even if you

have a small model, and you have a lot of

202

: 00:12:47

data, and you want to try posterior

sampling with mini batches, you want to

203

: 00:12:52

see how that...

204

: 00:12:54

If that can speed up your inference rather

than doing full batch on every step, then

205

: 00:12:59

Posterior is the place for that, even with

small models.

206

: 00:13:01

So you can just write down your model in

Pyro, in PyTorch, and then use Posterior

207

: 00:13:05

to do that.

208

: 00:13:10

But then that's like moving from like

classical Bayesian statistics into like

209

: 00:13:16

the mini batch one.

210

: 00:13:16

But then there are also benefits of

Bayesian

211

: 00:13:19

very crude approximations to the Bayesian

posterior in these really high scale large

212

: 00:13:23

scale models.

213

: 00:13:24

So like, yeah, like language models, big

neural networks, these aren't going to get

214

: 00:13:30

you you're not you're not going to be able

to do your convergence checks and these

215

: 00:13:32

sort of things in those models, but you

might still be able to get some advantages

216

: 00:13:36

out of distribution detection, as a

distributed improved attribution

217

: 00:13:40

performance sort of continual learning,

and these are the sort of things we're

218

: 00:13:44

investigating is if like,

219

: 00:13:46

the sort of what if you just trained with

grading essentially, you wouldn't

220

: 00:13:50

necessarily get these things.

221

: 00:13:51

But even very crude, crude Bayesian

approximations will hopefully provide

222

: 00:13:57

these benefits.

223

: 00:13:57

I think I will talk about this more later.

224

: 00:13:59

I think.

225

: 00:14:01

Yeah, okay.

226

: 00:14:04

So basically, what what I understand is

that you can use Posters for basically any

227

: 00:14:09

model.

228

: 00:14:10

So I mean, we're still young.

229

: 00:14:13

And it doesn't have like the

230

: 00:14:15

very young and it doesn't have like the

support of, I don't know, if you want to

231

: 00:14:20

do Gaussian processes, we were not going

to have a whole suite of kernels that

232

: 00:14:23

you're going to be able to just type up.

233

: 00:14:26

But fundamentally, it takes any, it just

takes a function, a log posterior

234

: 00:14:30

function, and then you will be able to try

out different methods.

235

: 00:14:35

But as I said, like the big data regime is

much less researched, and as much and the

236

: 00:14:41

sort of big parameter regime is much

harder.

237

: 00:14:43

at least.

238

: 00:14:45

So it's going to be, it's not going to be

like a silver bullet.

239

: 00:14:48

You're going to have to, there's research,

basically, posterior is a tool for

240

: 00:14:51

research a lot of the time where you're

going to research what inference methods

241

: 00:14:57

you can use, where they fail, and

hopefully where they succeed as well.

242

: 00:15:01

Okay.

243

: 00:15:01

Okay.

244

: 00:15:02

I see.

245

: 00:15:03

And so to make sure listeners understand,

well, you can do both in posers, right?

246

: 00:15:10

You can write your model in posterior.

247

: 00:15:14

and then sample from it?

248

: 00:15:16

Or is that only model definition or is

that only model sampling?

249

: 00:15:20

So it only does approximate posterior

sampling.

250

: 00:15:23

So you write down the log posterior,

you're given some data and you write down

251

: 00:15:27

the log posterior.

252

: 00:15:29

Or the joint, you could say.

253

: 00:15:32

It doesn't have the sophisticated support

of Stan or IMC or where you actually have

254

: 00:15:38

the, you can write down the model.

255

: 00:15:40

but it has the support for all the

distributions and doing forward samples.

256

: 00:15:45

It leans on other tools like Pyro or

PyTorch itself for that in no other case.

257

: 00:15:51

It is about approximate inference in the

posterior space, in the sample space.

258

: 00:15:56

So you can do Laplace approximation with

these things and compare them.

259

: 00:16:00

And importantly, it's mini -batch first.

260

: 00:16:02

So every method only expects to receive

batch by batch.

261

: 00:16:06

So you can support the large data regime.

262

: 00:16:10

Okay, so I think there are a bunch of

terms we need to define here for

263

: 00:16:13

listeners.

264

: 00:16:15

Okay, yeah, sorry about that.

265

: 00:16:17

Can you define minibatch?

266

: 00:16:19

Can you define approximate inference and

in particular, Laplace approximation?

267

: 00:16:24

Okay, so minibatch is the important one,

of course.

268

: 00:16:28

Yeah, so normally in traditional Bayesian

statistics, if you're running random walk

269

: 00:16:33

-through troblos -Hastings or HMC, you

will be seeing your whole dataset, all end

270

: 00:16:38

data points at every step of the

iteration.

271

: 00:16:40

And there's beautiful theory about that.

272

: 00:16:43

But a lot of the time in machine learning,

you have a billion data points.

273

: 00:16:48

Or if you're doing a foundation model,

it's like all of Wikipedia, it's billions

274

: 00:16:52

of data points or something like that.

275

: 00:16:54

And there's just no way that every time

you do a gradient step, you just can't sum

276

: 00:17:01

over a billion data points.

277

: 00:17:03

So you take 10 of them, you do this

unbiased approximation.

278

: 00:17:06

And this doesn't propagate through the

exponential, which you need.

279

: 00:17:10

for the metropolis hastening step.

280

: 00:17:11

So it rules out a lot of traditional

Bayesian methods, but there's still been

281

: 00:17:17

research on this.

282

: 00:17:18

So this is the we saw a scalable Bayesian

learning is what we talked about with

283

: 00:17:22

posterior.

284

: 00:17:22

So we're investigating mini batch methods.

285

: 00:17:24

So yeah, methods that only use a small

amount of the data, as is very common in

286

: 00:17:30

so it's like gradient descent, stochastic

gradient descent and optimization terms.

287

: 00:17:36

So hopefully

288

: 00:17:38

Mini -batches, okay, you said approximate

inference.

289

: 00:17:41

So approximate, okay, yeah, inference is a

very loaded term.

290

: 00:17:44

Maybe I should try not to use it, but when

I say approximate inference, I mean

291

: 00:17:48

approximate Bayesian inference.

292

: 00:17:49

So you can write down mathematically the

posterior distribution, P of theta given

293

: 00:17:54

y, and then yeah, proportional to P of

theta, P of y given theta.

294

: 00:17:59

But that's

295

: 00:18:01

You only have access to pointwise

evaluations of that and potentially even

296

: 00:18:05

only mini -batch pointwise evaluation

sets.

297

: 00:18:07

So approximate inference is forming some

approximation to that posterior

298

: 00:18:12

distribution, whether that's a Gaussian

approximation or through Monte Carlo

299

: 00:18:16

samples.

300

: 00:18:17

So yeah, just like an ensemble of points

and approximate inference.

301

: 00:18:19

So that's approximate inference.

302

: 00:18:21

And yeah, you have different fidelities of

this posterior approximation.

303

: 00:18:25

Last one, Laplace approximation.

304

: 00:18:28

Laplace approximation is the simplest

305

: 00:18:30

arguably the simplest in like machine

learning setting approximation to the

306

: 00:18:35

posterior distribution.

307

: 00:18:36

So it's just a Gaussian distribution.

308

: 00:18:38

So all you need to define is a mean and

covariance.

309

: 00:18:41

You define the mean by doing an

optimization procedure on your log

310

: 00:18:45

posterior or just log likelihood.

311

: 00:18:48

And that will give you a point that will

give you your mean.

312

: 00:18:52

And then

313

: 00:18:55

And then you take okay, it gets quite in

the weeds the Laplace approximation, but

314

: 00:18:59

ideally you you then do a Taylor expansion

across them.

315

: 00:19:02

Second order Taylor expansion will give

you Hessian.

316

: 00:19:05

We would recommend the Hessian being the

co your approximate covariance.

317

: 00:19:09

But there are tiny quantities there and

use the Fisher information as said.

318

: 00:19:13

And yeah, you can read that there's lots

of I'm sure you've had people on the on

319

: 00:19:18

the podcast explain it better than me.

320

: 00:19:21

Yeah.

321

: 00:19:22

For Laplace, no.

322

: 00:19:25

Actually, so that's why I asked you to

define it.

323

: 00:19:28

I'm happy to go down into the weeds if you

want.

324

: 00:19:33

Yeah, if you think that's useful.

325

: 00:19:36

Otherwise, we can definitely do also an

episode with someone you'd recommend to

326

: 00:19:41

talk about Laplace approximation.

327

: 00:19:43

Something I'd like to communicate to

listeners is for them to understand.

328

: 00:19:51

Yeah, we say approximation, but at the

same time, MCMC is an approximation

329

: 00:19:55

itself.

330

: 00:19:57

So that can be a bit confusing.

331

: 00:20:01

Can you talk about the fact, like, about

why these kind of methods, like Laplace

332

: 00:20:06

approximation, I think VI, variational

inference, would fall also into this

333

: 00:20:11

bucket.

334

: 00:20:13

Why are those methods called

approximations?

335

: 00:20:17

in contrast to MCMC?

336

: 00:20:20

What's the main difference here?

337

: 00:20:24

I honestly I would say MCMC is also an

approximation in the same terminology but

338

: 00:20:32

yeah the difference is that we talk about

bias and asymptotically some methods

339

: 00:20:37

asymptotically unbiased which MCMC is

stochastic gradient MCMC which is what

340

: 00:20:43

Prosterus is as well in some

341

: 00:20:46

under some caveats, and there are caveats

for MCMC, normal MCMC as well.

342

: 00:20:51

But yeah, so you have your Gaussian

approximations from variational inference

343

: 00:20:55

and the applies approximation.

344

: 00:20:56

And these are very much approximations in

the sense there's no axes on which you can

345

: 00:21:01

increase if you increase it to infinity or

change the posterior.

346

: 00:21:04

You cannot do that with the Gaussian

approximations unless your posterior is

347

: 00:21:08

you're known to be Gaussian, in which case

is more and more I mean, the amount of

348

: 00:21:13

interesting cases like that like Gaussian

processes and things.

349

: 00:21:16

But yeah, so they don't have this

asymptotically unbiased feature that MTMC

350

: 00:21:21

does or important sampling as sequential

Monte Carlo does, which is very useful

351

: 00:21:25

because it allows you to trade compute for

accuracy, which you can't do with a

352

: 00:21:29

Laplace approximation or VI beyond

extending, like going from diagonal

353

: 00:21:35

covariance to a full covariance or things

like that.

354

: 00:21:37

And this is very useful in the case that

you have extra compute available.

355

: 00:21:41

So I'm a big fan of the

356

: 00:21:43

asymptotic unbiased property because it

means that you can increase your compute

357

: 00:21:47

and safety.

358

: 00:21:49

Yeah.

359

: 00:21:49

Yeah.

360

: 00:21:49

Great explanation.

361

: 00:21:50

Thanks a lot.

362

: 00:21:52

And so yeah, but so as you were saying,

there is not these asymptotic unbiasedness

363

: 00:22:00

from these approximations, but at the same

time, that means they can be way faster.

364

: 00:22:05

So it's like if you're in the right use

case in the right, in the right

365

: 00:22:11

Yeah, in the right use case, then that

really makes sense to use them.

366

: 00:22:14

But you have to be careful about the

conditions where the approximation falls

367

: 00:22:19

down.

368

: 00:22:21

Can you maybe dive a bit deeper into

stochastic gradient descent, which is the

369

: 00:22:26

method that Posterioris is using, and how

that fits into these different methods

370

: 00:22:31

that you just talked about?

371

: 00:22:33

Actually, stochastic gradient descent is

not a method that Posterioris is using per

372

: 00:22:38

se.

373

: 00:22:40

descent is stochastic gradient descent is

the workhorse of machine most machine

374

: 00:22:44

learning algorithms, but posteriors would

kind of be this kind of same like it kind

375

: 00:22:49

of saying it shouldn't be perhaps or like,

not in all cases.

376

: 00:22:52

stochastic gradient descent is what you

use.

377

: 00:22:54

If you have extremely large data, and you

just want to find the MLE or so the

378

: 00:23:00

maximum likelihood or the minimum of a

loss, which you might say.

379

: 00:23:05

380

: 00:23:06

that is just as an optimization routine.

381

: 00:23:08

So you just want to find the parameters

that minimize something.

382

: 00:23:10

If you're doing variational inference,

what you can do is you can trackively get

383

: 00:23:15

the KL divergence between your specified

variational distribution and the log

384

: 00:23:20

posterior.

385

: 00:23:20

And then you have parameters.

386

: 00:23:22

So they're like parameters of the

variational distribution over your model

387

: 00:23:26

parameters.

388

: 00:23:27

And then you use stochastic gradient

system like that.

389

: 00:23:28

So this is nice because it just means that

you can throw the workhorse from machine

390

: 00:23:32

learning at a

391

: 00:23:35

Bayesian problem and get the Bayesian

approximation out.

392

: 00:23:38

Again, as we mentioned, it doesn't have

this asymptotic unbiased feature, which is

393

: 00:23:45

maybe less of a concern in machine

learning models where you have less of

394

: 00:23:49

ability to trade compute because you've

kind of filled your compute budget with

395

: 00:23:53

your gigantic model.

396

: 00:23:55

Although we may see this, we think that I

think this might change over the coming

397

: 00:23:58

years.

398

: 00:24:00

But yeah, maybe not.

399

: 00:24:01

Maybe we'll just go even bigger and bigger

and bigger.

400

: 00:24:03

You...

401

: 00:24:04

Okay, sorry.

402

: 00:24:05

I got lost.

403

: 00:24:06

You said you're asking about stochastic

gradient descent.

404

: 00:24:09

So actually, there's something interesting

to say here.

405

: 00:24:11

And then that means also what the main

difference characteristics of posterior

406

: 00:24:17

is, like these, so that really people

understand the use case of posterior here.

407

: 00:24:23

Yeah.

408

: 00:24:24

So we didn't want to...

409

: 00:24:26

Okay.

410

: 00:24:26

So yeah, there's a key thing about the way

we've written posterior is that we like...

411

: 00:24:32

where possible to have stochastic gradient

descent, so optimization, as sort of

412

: 00:24:39

limits under some hyperparameter

specifications of the algorithms.

413

: 00:24:42

And it turns out that in a lot of cases,

so we talked about MCMC, and then we

414

: 00:24:47

talked about stochastic gradient MCMC,

which are MCMC methods that strictly

415

: 00:24:51

handle mini -batch methods.

416

: 00:24:52

And a lot of the time, you can write down

the temperature, you have the temperature

417

: 00:24:56

parameter of your posterior distribution.

418

: 00:24:58

And then as you take that to zero,

419

: 00:25:00

So the temperature is like, if the

temperature is very high, your posterior

420

: 00:25:03

distribution is very heated up.

421

: 00:25:04

So you've increased the tails and there's

a lot like a much closer to sort of a

422

: 00:25:09

uniform distribution.

423

: 00:25:10

You take it very cold, it comes very

pointed and focused around optima.

424

: 00:25:12

So we write the algorithms so that there's

this convenient transition through the

425

: 00:25:17

temperature.

426

: 00:25:17

So you set the temperature to zero, you

just get optimization.

427

: 00:25:20

And this is a key thing about posteriors.

428

: 00:25:23

So we have the, so the posteriors

stochastic grain MCMC methods.

429

: 00:25:27

this temperature parameter which if you

set to zero will become a variant of

430

: 00:25:31

stochastic gradient descent.

431

: 00:25:32

So you can just sort of unify gradient

descent and stochastic gradient MCMC and

432

: 00:25:38

it's nice so you have your yeah you have

your Langevin dynamics which tempered down

433

: 00:25:42

to zero just becomes vanilla gradient

descent you have underdamped Langevin

434

: 00:25:45

dynamics or stochastic gradient HMC,

stochastic gradient Hamiltonian Monte

435

: 00:25:50

Carlo, you set the temperature to zero and

then you've just got stochastic gradient

436

: 00:25:54

descent with momentum.

437

: 00:25:56

So yeah, this is a nice thing about

Posterius to sort of unify these

438

: 00:26:00

approaches and it hopefully will make it

less scary to use Bayesian approaches

439

: 00:26:04

because you know you always have gradient

descent and you can sanity check by just

440

: 00:26:08

setting the temp, just filling with a

temperature parameter.

441

: 00:26:12

Okay, that's really cool.

442

: 00:26:14

Okay.

443

: 00:26:15

So it's like, it's a bit like the

temperature parameter in the, in the

444

: 00:26:19

transformers that, that like make sure, I

mean, in the LLMs that

445

: 00:26:25

It's like adding a bit of variation on top

of the prediction stat that the LL could

446

: 00:26:31

make.

447

: 00:26:31

Yeah, so it's exactly the same as that.

448

: 00:26:33

So when you use this in language models or

natural language generation, you

449

: 00:26:37

temperature the generative distribution so

that the logits get tempered.

450

: 00:26:41

So if you set the temperature there to

zero, you get greedy sampling.

451

: 00:26:44

But we're doing this in parameter space.

452

: 00:26:47

So it's, yeah.

453

: 00:26:49

It has this, yeah, exactly.

454

: 00:26:53

Distribution tempering is a broad thing,

particularly in, I'm not going to go too

455

: 00:26:58

philosophical, but I mean, I've first met

with like tempering, then we thought about

456

: 00:27:03

it in the settings of sequential Monte

Carlo, and it's like, is it the natural

457

: 00:27:07

way?

458

: 00:27:08

Is it something that's natural to do?

459

: 00:27:10

But in the context of Bayes, because

Bayes' theorem is multiplicative, right,

460

: 00:27:14

you have your P of theta, P of y given

theta, it kind of makes sense to temper

461

: 00:27:19

because it means like, okay, I'll just

introduce the likelihood a little bit.

462

: 00:27:22

and sort of tempering as a natural way to

do it because there's multiplicative

463

: 00:27:25

feature of Bayes' theorem.

464

: 00:27:26

So, I kind of settled with me after

thinking about it like that.

465

: 00:27:30

Yeah, no, I mean, that makes perfect

sense.

466

: 00:27:32

And I was really surprised to see that was

used in LLMs when I first read about the

467

: 00:27:38

algorithms.

468

: 00:27:40

And I was pleasantly surprised because

I've worked a lot on electoral forecasting

469

: 00:27:46

models.

470

: 00:27:46

That's how I were introduced to Bayesian

stats.

471

: 00:27:51

Actually, I've done that without knowing

it.

472

: 00:27:54

So first I'm using the softmax all the

time because they're called forecasting.

473

: 00:27:58

Unless you're doing that in the U S you

need a multinomial likelihood.

474

: 00:28:02

The multinomial needs a probability

distribution.

475

: 00:28:06

And how do you get that from the softmax

function, which is actually a very

476

: 00:28:10

important one in the LLM framework.

477

: 00:28:13

And, and, and also the thing is your

probability is not, it's like the latent.

478

: 00:28:19

observation of popularity of each party,

but you never observe it, right?

479

: 00:28:23

And so the polls, you could, you could

like conceptualize them as a tempered

480

: 00:28:28

version of the true latent popularity.

481

: 00:28:32

And so that was really interesting.

482

: 00:28:34

I was like, damn, this like, this, this

stuff is much more powerful than what I

483

: 00:28:39

thought, because I was like applying only

on electoral forecasting models, which is

484

: 00:28:43

like a very niche application, you could

say of these models in

485

: 00:28:47

actually there are so many applications of

that in the wild.

486

: 00:28:52

No, it's so yeah, tempering in general is

very widespread and also I would say not

487

: 00:28:57

particularly understood that well.

488

: 00:28:59

Like yeah, we have this thing, there's

been research in this cold posterior

489

: 00:29:04

effect which is quite a, I don't know,

it's perhaps a...

490

: 00:29:12

annoying things for Bayesian modeling on

neural networks where you get, as I said,

491

: 00:29:17

you have this temperature parameter that

transitions between optimization and the

492

: 00:29:22

Bayesian posterior.

493

: 00:29:23

So zero is optimization, one is the

Bayesian posterior.

494

: 00:29:27

And empirically, we see better predictive

performance, which is a lot of time we

495

: 00:29:30

care about in machine learning, with

temperatures less than one.

496

: 00:29:34

So like, yeah, which is annoying because

we're Bayesians and we think that the

497

: 00:29:38

Bayesian posterior is the optimal decision

-making under uncertainty.

498

: 00:29:41

So this is annoying, but at least in our

experiments, we found this to be this so

499

: 00:29:47

-called cold posterior effect much more

prominent under Gaussian approximations,

500

: 00:29:51

which we only believe to be very crude

approximations to the posterior anyway.

501

: 00:29:55

And if we do more MCMC or deep ensemble

stuff, where deep ensemble is, we've got a

502

: 00:30:02

paper we'll be able to archive shortly,

which describes deep ensembles.

503

: 00:30:07

In deep ensembles, you just run gradient

descent in parallel.

504

: 00:30:11

with different initializations and batch

shuffling.

505

: 00:30:15

And then you just have like, I know you

run 10 ensembles, 10 optimizations in

506

: 00:30:18

parallel, then you've got 10 parameters at

the things at the end.

507

: 00:30:21

So Monte Carlo approximation posterior

size 10.

508

: 00:30:23

And then we describe in the paper that how

to get this asymptotic and biased property

509

: 00:30:28

by using that temperature.

510

: 00:30:31

Because as we said earlier, you have SG

MCMC becomes SGD with temperature zero.

511

: 00:30:38

So you can reverse this.

512

: 00:30:39

for deep ensembles, so you add the noise

and then you'll get an asymptotic and

513

: 00:30:43

biased deep ensembles become

asymptotically unbiased MCMC between SGMC

514

: 00:30:48

and PSE.

515

: 00:30:49

But in those cases when you have the non

-Gaussian approximation we found much less

516

: 00:30:54

of the cold posterior effect.

517

: 00:30:56

So yeah, it's, but it's still not, maybe

the cold posterior effect is a natural

518

: 00:31:01

thing because it's not really like Bayes'

theorem.

519

: 00:31:04

Yeah, we still need to be better

understood.

520

: 00:31:06

I don't, at least in my head I'm not.

521

: 00:31:08

fully clear on whether the cold posterior

effect is something we should be surprised

522

: 00:31:12

about.

523

: 00:31:13

Okay, yeah.

524

: 00:31:14

Yeah, me neither.

525

: 00:31:16

That makes you feel any better because I

just learned about that.

526

: 00:31:20

So yeah, I don't have any strong opinion.

527

: 00:31:24

Okay, I think we're getting clearer now on

the like the what posterior ears is for

528

: 00:31:32

listeners.

529

: 00:31:33

So then I think one of

530

: 00:31:35

the last question about the algorithms

that that's underlying all of that.

531

: 00:31:39

So, stochastic gradient MCMC.

532

: 00:31:43

That's, that's where I got confused.

533

: 00:31:45

Like I hear stochastic gradient and like

stochastic gradient isn't, but no, it's

534

: 00:31:48

like SG MCMC not SGG.

535

: 00:31:51

So, Posteriority is like really to use SG

MCMC.

536

: 00:31:57

Why, like, why would you do that and not

use MCMC?

537

: 00:32:02

like the classic HMC from Stan or PyMC?

538

: 00:32:05

Yeah, so I mean, it's not just for SGMCMC.

539

: 00:32:08

There's also variational inference,

Laplace approximation, extended count

540

: 00:32:13

filter, and we're really excited to have

more methods as well as we look to

541

: 00:32:16

maintain and expand the library.

542

: 00:32:18

Why would you use SGMCMC?

543

: 00:32:20

So yeah, I think we've already touched

upon this.

544

: 00:32:23

The thing is, if you've got loads of data,

it's just going to be inefficient to...

545

: 00:32:30

sum over all of that data at every

iteration of your MCMC algorithm as Stan

546

: 00:32:35

would do.

547

: 00:32:39

But there's mathematical reasons why you

can't just do that in Stan.

548

: 00:32:43

It's because the Metropolis -Hastings

ratio has this exponential of the log

549

: 00:32:49

posterior.

550

: 00:32:50

But it's in log space is the only place

you can get the unbiased approximation,

551

: 00:32:54

which is what you need if you did want to

naively subsample.

552

: 00:32:58

So you need to, you can't do the

Machrofist Hastings except reject.

553

: 00:33:03

So you have to use different toolage.

554

: 00:33:06

And in its simplest terms, SGMCMC just

omits it and just runs a Langevin.

555

: 00:33:11

So it just runs your Hamiltonian Monte

Carlo without the extract project.

556

: 00:33:15

But there's more theory on top of this and

you need to control the disqualification

557

: 00:33:18

error and stuff like that.

558

: 00:33:20

And I won't go into the weeds of that.

559

: 00:33:23

Okay.

560

: 00:33:24

Yeah.

561

: 00:33:25

Okay.

562

: 00:33:26

And that's

563

: 00:33:27

And that's tied to mini -batching

basically.

564

: 00:33:30

Like the power that SGMCMC allows you when

you're in a high data regime is tied to

565

: 00:33:37

the mini -batching, if I understand

correctly.

566

: 00:33:40

It's the difference between MCMC and

SGMCMC.

567

: 00:33:43

Okay, so that's like the main difference.

568

: 00:33:46

Okay.

569

: 00:33:46

Yeah, stochastic gradient.

570

: 00:33:48

So you can't actually get the exact

gradient like you need in Amazigh in Monte

571

: 00:33:52

Carlo and for Metropolis Hastings step,

you only get an unbiased approximation.

572

: 00:33:56

And then there's theory about this is like

sometimes you can deploy the central limit

573

: 00:34:01

theorem and then you've got a you can go

covariance attached to your gradients and

574

: 00:34:05

you could do nice theory and improve the

equivalence like that, which, yeah.

575

: 00:34:09

Okay.

576

: 00:34:11

All clear now.

577

: 00:34:12

All clear.

578

: 00:34:12

Awesome.

579

: 00:34:13

Yeah.

580

: 00:34:13

And I think that's the first time we talk

about that on the show.

581

: 00:34:16

So I think it was it's definitely useful

to be extra clear about that.

582

: 00:34:20

And so that listeners understand and me,

like myself, so that I understand.

583

: 00:34:25

Thanks a lot.

584

: 00:34:26

It's in some setting actually much simpler

because you kind of like remove the tools

585

: 00:34:31

that you have available to you by removing

that much of the step.

586

: 00:34:33

So it makes the implementation a bit

simpler.

587

: 00:34:37

But you kind of lose the theory in that.

588

: 00:34:40

And then a lot of the argument is like if

you use a decreasing step size, then your

589

: 00:34:46

noise from the mini match, your noise from

the stochastic gradient decreases Epsilon

590

: 00:34:50

squared, which is faster.

591

: 00:34:52

So you

592

: 00:34:53

If you decrease your step size and run it

for infinite time, then you'll just be

593

: 00:34:57

running, eventually just be running the

continuous time dynamics, which are exact

594

: 00:35:00

and do have the right stationary

distribution.

595

: 00:35:02

So if you run it with decreasing step

size, then you are asymptotically

596

: 00:35:05

unbiased.

597

: 00:35:07

But running with decreasing step size is

really annoying because you then don't

598

: 00:35:10

move as far.

599

: 00:35:11

As we know from normal MCMC, we want to

increase our step size and move and

600

: 00:35:15

explore the posterior more so.

601

: 00:35:17

There's lots of research to be done here.

602

: 00:35:19

I hope and I feel that it's not the last

time you'll talk about stochastic gradient

603

: 00:35:23

MCMC on this podcast.

604

: 00:35:25

Yeah, no.

605

: 00:35:26

I mean, that sounds super interesting.

606

: 00:35:27

I'm really interested also to really

understand the difference between these

607

: 00:35:30

algorithms.

608

: 00:35:31

Right now, that's really at the frontier

of research.

609

: 00:35:34

You not only have a lot of research done

on how do you make HMC more efficient, but

610

: 00:35:41

you have all these new algorithms.

611

: 00:35:44

approximate algorithms as we said before.

612

: 00:35:48

So, VLM plus approximation, stuff like

that.

613

: 00:35:51

But also now you have normalizing flows.

614

: 00:35:54

We talked about that in episode 98 with

Marilou Gabrié.

615

: 00:35:58

Marilou Gabrié, actually, I don't know why

I said the second part with the Spanish.

616

: 00:36:04

Because my Spanish is really available in

my brain right now.

617

: 00:36:08

So, she's French.

618

: 00:36:10

So, that's Marilou Gabrié.

619

: 00:36:12

Episode 98, it's in the show notes.

620

: 00:36:15

Episode 107, I already mentioned it with

Marvin Schmidt about amortized patient

621

: 00:36:20

inference.

622

: 00:36:21

Actually, do you know about amortized

patient inference and normalizing flows?

623

: 00:36:26

I know a bit about normalizing flows.

624

: 00:36:28

Amortized patient inference I would be

less comfortable with.

625

: 00:36:32

Okay.

626

: 00:36:32

But I mean, if you could explain it.

627

: 00:36:34

Yeah, I haven't watched that episode and

listened to that episode.

628

: 00:36:37

Yeah, I mean, we released it yesterday.

629

: 00:36:42

Yeah, I don't...

630

: 00:36:44

I'm a bit disappointed, Sam, but that's

fine.

631

: 00:36:50

Like, it's just one day, you know.

632

: 00:36:52

If you listen to it just after the

recording, I'll forgive you.

633

: 00:36:56

That's okay.

634

: 00:37:00

No, so, kidding aside, I'm actually

curious to hear you speak about the

635

: 00:37:09

difference between normalizing flows

636

: 00:37:11

and SGMCMC.

637

: 00:37:13

Can you talk a bit about that if you're

comfortable with that?

638

: 00:37:19

I mean, I can't.

639

: 00:37:20

It's been a while since I've read about

normalizing flows.

640

: 00:37:22

When I did read about them, I understood

it to be essentially a form of variational

641

: 00:37:27

inference where you have more elaborate,

you define a more elaborate variational

642

: 00:37:32

family through like, essentially through

like a triangular mapping.

643

: 00:37:36

Like, the thing why you can't just use

someone might say,

644

: 00:37:42

Why can't you use it just a neural network

as your variational distribution?

645

: 00:37:45

And it's not so easy because you need to

have this tractable form.

646

: 00:37:51

Hang on a second.

647

: 00:37:52

Let me remember.

648

: 00:37:53

But the thing is with normalizing flows,

you can get this because you can invert.

649

: 00:37:59

That's it.

650

: 00:38:00

They're invertible, right?

651

: 00:38:01

Normalizing flows are invertible.

652

: 00:38:02

So you can get this.

653

: 00:38:03

You can write the change of distribution

formula and then you can calculate

654

: 00:38:07

essentially just y -maxum likelihood.

655

: 00:38:09

the using these normalizing flows to fit

to a distribution.

656

: 00:38:14

Whereas SGMCMC doesn't.

657

: 00:38:17

So you have to, in normalizing flows, you

kind of have to define your ansatz that

658

: 00:38:22

will fit to your distribution.

659

: 00:38:23

I think normalizing flows are really

exciting and really interesting, but yeah,

660

: 00:38:26

you have to specify your ansatz.

661

: 00:38:27

So it's another, so there's another tool

on top, another specification on top of

662

: 00:38:32

how you.

663

: 00:38:35

rather than just writing the log

posterior, you then need to find an

664

: 00:38:37

approximate ansatz which you think will

fit the posterior or the distribution

665

: 00:38:42

you're targeting.

666

: 00:38:44

Whereas SGMCMC is just log posterior, go.

667

: 00:38:47

Which is sort of what we're trying to do

with posterior, is we're trying to

668

: 00:38:50

automate, well not automate, we're trying

to research, of course, so much for that.

669

: 00:38:54

But normalizing flows might be, yeah, as I

said, I think it's really interesting that

670

: 00:38:58

you can get these more expressive

variational families through like

671

: 00:39:03

triangular mappings, yeah.

672

: 00:39:04

Yeah, super interesting.

673

: 00:39:09

And yeah, I'm also like spatial inference

is related in the sense that you first

674

: 00:39:13

feed a deep neural network on your model.

675

: 00:39:16

And then once it's feed, you get posterior

inference for free, basically.

676

: 00:39:21

So that's quite different from what I

understand as GMC to be.

677

: 00:39:26

But that's also extremely interesting.

678

: 00:39:31

That's also why I'm

679

: 00:39:32

hammering you down on the different use

cases of SGMCMC so that myself and

680

: 00:39:38

listeners have a kind of a tree in their

head of like, okay, my use case then is

681

: 00:39:44

more appropriate for SGMCMC or, no, here

I'd like to try multi -spacian inference

682

: 00:39:50

or, I know here I can just stick to plain

vanilla HMC.

683

: 00:39:56

I think that's very interesting.

684

: 00:39:59

But thanks for that question that was

completely improvised.

685

: 00:40:03

I definitely appreciate you taking the

time to rack your brain about the

686

: 00:40:08

difference with normalizing flows.

687

: 00:40:10

No, I'd love to talk more on that.

688

: 00:40:12

I'd need to refresh myself.

689

: 00:40:14

I've written down some notes on

normalizing flows, and I was quite

690

: 00:40:16

comfortable with them, but it's just been

a while since I refreshed.

691

: 00:40:19

So I would love to refresh, and then we

can chat about them.

692

: 00:40:22

Because I'd love to do a project on them,

or I'd love to work on them, because I

693

: 00:40:25

think that's it.

694

: 00:40:29

way to fit distribution to data, which is,

after all, a lot of what we do.

695

: 00:40:34

Yeah.

696

: 00:40:34

Yeah.

697

: 00:40:35

So that makes me think we should probably

do another episode about normalizing

698

: 00:40:39

flows.

699

: 00:40:40

So listeners, if there is a researcher you

like who does a lot of normalizing flows

700

: 00:40:46

and you think would be a good guest on the

show, please reach out to me and I'll make

701

: 00:40:53

that happen.

702

: 00:40:55

Now let's let's get you closer to home

salmon and talk about posteriors again

703

: 00:41:01

Because so basically if understood

correctly posteriors aims to address

704

: 00:41:07

uncertainty quantification in deep

learning Why it's is that my right here

705

: 00:41:13

and also if that's the case why is this

particularly important for neural networks

706

: 00:41:19

and How does the package help in?

707

: 00:41:22

managing especially overconfident in model

predict, overconfidence in model

708

: 00:41:26

predictions.

709

: 00:41:28

Yeah, so it's that's our primary use case.

710

: 00:41:31

And normal is to use posterity as a

proximate base, we're getting as close to

711

: 00:41:36

base as we can, which is probably not that

close, but still getting somewhere on the

712

: 00:41:40

way to base base, base and posterior in

big deep learning models.

713

: 00:41:46

But we feel posterior is to be as modular

and general as possible.

714

: 00:41:49

So as I said, if you have a

715

: 00:41:51

classical Bayesian model, you can write it

down in Pyro, but you've got loads of

716

: 00:41:55

data, then okay, go ahead.

717

: 00:41:57

And it posterior should be well suited to

that.

718

: 00:42:00

In terms of what advantages we want to see

from uncertainty communication or this

719

: 00:42:08

approximate Bayesian inference in deep

learning models, there are three sorts of

720

: 00:42:13

key things that we distilled it down to.

721

: 00:42:16

So yeah, you mentioned

722

: 00:42:19

confidence in outer distribution

predictions.

723

: 00:42:21

So yeah, we should be able to improve our

performance in predicting on inputs that

724

: 00:42:28

we haven't seen in the training set.

725

: 00:42:31

So I'll talk about that after this.

726

: 00:42:34

The second one is continual learning,

where we think that if you can do Bayes

727

: 00:42:39

theorem exactly, you have your prior, you

get some likelihood and you have the

728

: 00:42:43

likelihood, you get some data, you have a

posterior, then you get some more data.

729

: 00:42:46

and then your posterior becomes your prior

and do the update.

730

: 00:42:48

And you can just write like that if you

can do Bayes' theorem exactly.

731

: 00:42:52

And then, yeah, this is, you can extend it

even further and then you have, with some

732

: 00:42:56

sort of evolution along your parameters,

then you have a state space model, and

733

: 00:42:59

then the exact setting linear Gaussian,

you've got a count filter.

734

: 00:43:02

So continual learning is, in this case,

Bayes' theorem does that exactly.

735

: 00:43:08

And in continual learning research in

machine learning settings, they have this

736

: 00:43:11

term of avoiding catastrophic forgetting.

737

: 00:43:14

So,

738

: 00:43:17

If you just continue to do gradient

descent, there was no memory there, so you

739

: 00:43:22

would just, apart from the initialization,

you would just forget what you've done

740

: 00:43:28

previously and there's lots of evidence

for this, whereas Bayes' theorem is

741

: 00:43:31

completely exchangeable between of the

order of the data that you see.

742

: 00:43:35

So you're doing Bayes' theorem exactly,

there's no forgetting, you just have the

743

: 00:43:40

capacity of the model.

744

: 00:43:41

So that's where we see Bayes solving

continual learning, but as I said, you

745

: 00:43:45

can't

746

: 00:43:45

can't do Bayes' theorem exactly in a

billion -dimensional model.

747

: 00:43:49

And then the last one is, we'll call it

like decomposition of uncertainty in your

748

: 00:43:56

predictions.

749

: 00:43:57

So if you just have gradient descent model

and you're predicting reviews, someone's

750

: 00:44:02

reviews and you have to predict the stars,

it will just give you, as you said, it

751

: 00:44:06

gives you your softmax, it'll just give

you this distribution over the reviews and

752

: 00:44:10

it'll be like that.

753

: 00:44:12

But what you really want is you want to

have some indication of

754

: 00:44:15

like also distribution detection, right,

you want to know, okay, yeah, I'm

755

: 00:44:19

confident in my, my prediction.

756

: 00:44:21

And you might get to review that is like,

the food was terrible, but the service was

757

: 00:44:27

amazing, or something like that, like a

user amazing food was terrible.

758

: 00:44:31

And then, like, let's say we're perfect

models, say of this, we know how people

759

: 00:44:35

review things, but we can we can give, we

have quite a lot of uncertainty under

760

: 00:44:39

review, because we don't know how the

reviewer values those different things.

761

: 00:44:41

So we might have just a completely

uniform.

762

: 00:44:44

distribution over the stars for that

review.

763

: 00:44:47

But we'd be confident in that

distribution.

764

: 00:44:49

But what Bayes gives you is it gives you

the ability to the sort of the second

765

: 00:44:52

order uncertainty quantification, is if

you have this distribution over parameters

766

: 00:44:56

and you have a distribution over logits at

the end, the predictions, you can

767

: 00:44:59

identify, you can split between it from

information theories called aleatoric and

768

: 00:45:04

epistemic uncertainty.

769

: 00:45:05

Aleatoric uncertainty or data uncertainty

is what I just described there, which is

770

: 00:45:09

natural uncertainty in the model and the

data generating process.

771

: 00:45:12

Epistemic uncertainty is uncertainty that

was removed in the infinite data limit.

772

: 00:45:17

So that would be where the model doesn't

know.

773

: 00:45:19

So this is really important for us to

quantify that.

774

: 00:45:24

Okay.

775

: 00:45:25

I, yeah, around to the bit there.

776

: 00:45:27

I can in like 30 seconds elaborate on the

point you specifically mentioned on alpha

777

: 00:45:33

distribution performance and improving

performance and alpha distribution.

778

: 00:45:37

And I think that's quite compelling from a

Bayesian point of view, because what Bayes

779

: 00:45:40

says on like a supervised learning sector

setting is said, gradient descent just

780

: 00:45:46

fits one parameter, finds one parameter

configuration that's plausible given the

781

: 00:45:51

training data.

782

: 00:45:53

Bayes' theorem says, I find the whole

distribution of parameter configurations

783

: 00:45:57

that's plausible given the data.

784

: 00:45:58

And then when we make predictions, we

average over those.

785

: 00:46:01

So it's perfectly natural to think that a

single configuration might overfit.

786

: 00:46:06

and might just give, it might just be very

confident in its prediction when it sees

787

: 00:46:11

out the distribution data.

788

: 00:46:12

But it doesn't necessarily solve a bad

model, but it should be more honest to the

789

: 00:46:17

model and the data generating process

you've specified is if you average over

790

: 00:46:23

plausible model configurations under the

training data when you have your testing.

791

: 00:46:28

So that's sort of quite a compelling, to

me, argument for improving

792

: 00:46:33

performance on after distribution

predictions, like the accuracy of them.

793

: 00:46:39

And there's a fair bit of empirical

evidence for this, with the caveat again,

794

: 00:46:43

being that the Bayesian posterior in high

dimensional models, machine learning

795

: 00:46:46

models is pretty hard to approximate, cold

posterior effect, caveats, things like

796

: 00:46:52

these things.

797

: 00:46:53

Okay, yeah, I see.

798

: 00:46:55

Yeah, super interesting in that.

799

: 00:46:57

So now I understand better.

800

: 00:46:59

what you have on the posteriors website,

but the different kind of uncertainties.

801

: 00:47:04

So definitely that's something I recommend

listeners to give a read to.

802

: 00:47:10

I put that in the show notes.

803

: 00:47:12

So both your blog post introducing

posteriors and the docs for posteriors,

804

: 00:47:19

because I think it makes that clear

combined to your explanation right now.

805

: 00:47:25

Yeah.

806

: 00:47:27

And...

807

: 00:47:28

Something I was also wondering is that if

I understood correctly, the package is

808

: 00:47:32

built on top of PyTorch, right?

809

: 00:47:37

Yeah, that's correct.

810

: 00:47:38

Yeah.

811

: 00:47:38

Okay.

812

: 00:47:39

So, and also, did I understand correctly

that you can integrate posteriors with pre

813

: 00:47:48

-trained LLMs like Lama2 and Mistral, and

you do that with a...

814

: 00:47:56

Hacking's Feast Transformers package?

815

: 00:47:59

So, yeah, so, I mean, yeah, Posterior is

open source.

816

: 00:48:04

We're fully supported the open source

community for machine learning, for

817

: 00:48:08

statistics, which is, and in terms of,

yeah, I mean, we're sort of in the fine

818

: 00:48:16

tuning era or like we have like, there's

so much, there are these open source

819

: 00:48:21

models and you can't get away from them.

820

: 00:48:22

We have that Lama 2, Lama 3, Mistral,

like, yeah.

821

: 00:48:26

And basically we want to harness this

power, right?

822

: 00:48:29

But as I mentioned previously, there are

some issues that we like to remedy with

823

: 00:48:35

Bayesian techniques.

824

: 00:48:36

So the majority of these open source

models are built in PyTorch.

825

: 00:48:40

I'm also a big Jax fan.

826

: 00:48:42

I also use Jax a lot.

827

: 00:48:43

So I was very happy to see and work with

the torch .funk like sub library, which

828

: 00:48:51

basically makes it

829

: 00:48:54

you can write your PyTorch code and you

can use Llama 3 or Mistral with PyTorch

830

: 00:48:59

but writing functional code.

831

: 00:49:01

So that's what we've done with Posterior.

832

: 00:49:03

So, yeah, Hugging Face Transformers, you

can download the models, that's where all

833

: 00:49:07

they're hosted, and how you access them.

834

: 00:49:09

But then what you get is just a PyTorch

model.

835

: 00:49:12

It's just a PyTorch model.

836

: 00:49:13

And then you throw that in Composers and

all nicely with the Posterior updates.

837

: 00:49:18

Or you write your own new updates in the

Posterior framework and you can use that

838

: 00:49:23

as well.

839

: 00:49:23

still with Lama 3.

840

: 00:49:25

Mr.

841

: 00:49:25

Robin.

842

: 00:49:26

Yeah.

843

: 00:49:26

Okay.

844

: 00:49:27

Nice.

845

: 00:49:28

And so what does it mean concretely for

users?

846

: 00:49:30

That means you can use these pre -trained

LLMs with posteriors and that means adding

847

: 00:49:40

a layer of uncertainty quantification on

top of those models?

848

: 00:49:45

Yeah.

849

: 00:49:46

So you need, I mean, Bayes theorem is a

training theorem.

850

: 00:49:49

So you need data as well.

851

: 00:49:52

So you take

852

: 00:49:52

You take your pre -trained model, which

is, yeah, transformer, or it could be

853

: 00:49:58

another type of model, it could be an

image model or something like that, and

854

: 00:50:01

then you give it some new data, which we

would say was fine -tuning, and then you

855

: 00:50:04

combine, use posterior to combine the two,

and then you have your new model out at

856

: 00:50:09

the end of the day, which has uncertainty

quantification.

857

: 00:50:11

It's difficult, as I said, we're sort of

in this fine -tuning era as open -source

858

: 00:50:16

large language models.

859

: 00:50:17

It's still to be, this is different.

860

: 00:50:20

There's still lots of research to do here

and it's different to our classical

861

: 00:50:23

Bayesian regime where we just have our,

there's only one source of data and it's

862

: 00:50:27

what we give it.

863

: 00:50:28

In this case, there's two sources of data

because you have your data, whatever,

864

: 00:50:32

whatever Lama3 saw in its original

training data and then it has your own

865

: 00:50:36

data.

866

: 00:50:38

It's, yeah, can we hope to get uncertainty

chronification and the data that they used

867

: 00:50:42

in the original training?

868

: 00:50:43

Probably not, but we might be able to get

uncertainty chronification and improved

869

: 00:50:46

predictions.

870

: 00:50:47

based on the data that we've committed.

871

: 00:50:49

So there's lots of lots for us to try out

here and learn because we are still

872

: 00:50:55

learning on this in terms of the fine

tuning.

873

: 00:50:58

But yeah, this is what Polastir is there

to make these sort of questions as easy as

874

: 00:51:04

possible to ask and answer.

875

: 00:51:07

Okay, fantastic.

876

: 00:51:09

Yeah, that's, that's so exciting.

877

: 00:51:11

It's just like, it's a bit frustrating to

me because I'm like, I'd love to try that

878

: 00:51:17

and learn on that and like, contribute to

that kind of packages.

879

: 00:51:21

At the same time, I have to work, I have

to do the podcast, and I have all the

880

: 00:51:25

packages I'm already contributing to.

881

: 00:51:27

So I'm like, my god, too much choices too

much, too many choices.

882

: 00:51:32

No, come on Alex, I'm gonna see you.

883

: 00:51:33

We're gonna see you again, Alex pull

request.

884

: 00:51:35

It's soon enough.

885

: 00:51:40

Actually, how does like, do the like this

ability to have the transformers in, you

886

: 00:51:51

know, use these pre trained models, does

that help facilitate the adoption of new

887

: 00:51:59

algorithms in in posteriors?

888

: 00:52:02

Because if I understand correctly, you can

support

889

: 00:52:05

new algorithms pretty easily and you can

support arbitrary likelihoods.

890

: 00:52:10

How do you do that?

891

: 00:52:14

I wouldn't say that the existence of the

pre -trained models necessarily allows us

892

: 00:52:22

to support new algorithms.

893

: 00:52:23

I feel like we've built the posterior to

be suitably general and suitably modular,

894

: 00:52:28

that it's kind of agnostic to your model

choice and your log posterior choice.

895

: 00:52:33

terms of arbitrary likelihoods.

896

: 00:52:36

But yeah, that's like a benefit.

897

: 00:52:37

That's like, yeah, as an hour, yeah, the

arbitrary like is is relevant, because a

898

: 00:52:41

lot of machine learning packages on.

899

: 00:52:46

I mean, a lot of machine learning is

essentially just boils down to

900

: 00:52:48

classification or regression.

901

: 00:52:50

And that is true.

902

: 00:52:51

And because of that, a lot of a lot of

machine learning algorithms will a lot of

903

: 00:52:55

machine learning packages will essentially

constrain it to classification or

904

: 00:52:58

regression.

905

: 00:52:58

At the end, you either have your softmax

or you have your mean squared error.

906

: 00:53:01

Yeah, softmax cross entry means greater.

907

: 00:53:04

In posterior, we haven't done that.

908

: 00:53:05

We're more faithful to the sort of the

Bayesian setting where you just write down

909

: 00:53:08

your log posterior and you can write down

whatever you want.

910

: 00:53:11

And this allows you greater flexibility in

the case you did want to try out a

911

: 00:53:16

different likelihood or like even in like

simple cases, like it's just more

912

: 00:53:22

sophisticated than just classification or

regression a lot of the time.

913

: 00:53:25

Like in sequence generation where you have

the sequence and then you have the cross

914

: 00:53:30

entropy over all of that.

915

: 00:53:31

It just allows you to be more flexible and

write the code how you want.

916

: 00:53:35

And there's additional things to be taken

into account.

917

: 00:53:38

Like sometimes if you were doing a

regression, you might have knowledge of

918

: 00:53:41

the noise variance.

919

: 00:53:42

And that's just the observation noise

variance.

920

: 00:53:44

And that's just much easier to, yeah, if

we don't constrain like this, it's just

921

: 00:53:49

much easier to write your code much

cleaner code than if you were.

922

: 00:53:54

And it's also future -proofing.

923

: 00:53:55

We don't know what's going to be.

924

: 00:53:57

happening in going forward.

925

: 00:54:00

We may see like, yeah, in multimodal

models, we may see like, text and images

926

: 00:54:04

together, in which case, yeah, we will

support that.

927

: 00:54:09

You have to supply the compute and the

data, which might be the harder thing, but

928

: 00:54:13

we'll support those likelihoods.

929

: 00:54:15

Okay, I see.

930

: 00:54:16

I see.

931

: 00:54:17

Yeah, that's very, very interesting.

932

: 00:54:20

Any stats related to the fact that I think

I've read in your blog post or on the

933

: 00:54:26

website that

934

: 00:54:27

You say that Posterior is swappable.

935

: 00:54:32

What does that mean?

936

: 00:54:33

And how does that flexibility benefit

users?

937

: 00:54:37

Yeah.

938

: 00:54:38

So, I mean, this is the point of swappable

is that when I say that is that you can

939

: 00:54:43

change between if you want to, if you

think, as I said, Posterior is a research

940

: 00:54:48

like toolbox and it's to us to investigate

which inference method is appropriate in

941

: 00:54:52

the different settings, which might be

different if you care about decomposing.

942

: 00:54:57

predictive uncertainty, it might be

different if you care about boarding cast

943

: 00:54:59

-off, you're forgetting it's in your

continued learning.

944

: 00:55:02

So the thing is that you can just, the way

it's written is you can just swap, you can

945

: 00:55:07

go from sthmc and you can go to the class

approximation or you can go to vi just by

946

: 00:55:11

changing one line of code.

947

: 00:55:12

And the way it works is like you have your

builds, you have your transform equals

948

: 00:55:17

posterior .infant method .build and then

any configuration argument, step size.

949

: 00:55:22

things like this, which are algorithm

specific.

950

: 00:55:24

And then after that is all unified.

951

: 00:55:26

So you just have your init around the

parameters that you want to do based on.

952

: 00:55:32

And then you iterate through your data

loader, you iterate through your data.

953

: 00:55:36

And then it just updates based on the

batch.

954

: 00:55:38

And batch can be very general.

955

: 00:55:40

So that's what it means.

956

: 00:55:41

So you can just change one line of code to

swap between Variational Imprints and

957

: 00:55:44

STHMC or Extended Calama Filter or any and

all the new methods that the listeners are

958

: 00:55:51

going to add in the future.

959

: 00:55:52

Heh.

960

: 00:55:56

Okay.

961

: 00:55:57

Okay.

962

: 00:55:57

I see.

963

: 00:55:59

And so I have so many more questions for

you and posterior's but let's start and

964

: 00:56:05

run, wrap that up because also when I ask

you about another project you're working

965

: 00:56:11

on so maybe to close that up on

posterior's.

966

: 00:56:16

What are the future plans for posterior's

and are there any upcoming features or

967

: 00:56:21

integration integrations that you can

share with us?

968

: 00:56:25

So we're quite happy with the framework at

the moment.

969

: 00:56:29

There's lots of little tweaks that we have

a list of GitHub issues that we want to go

970

: 00:56:36

through, which are mostly and excitingly

about adding new methods and new

971

: 00:56:42

applications.

972

: 00:56:43

So that's really what we're excited about

now is actually use it in the wild and

973

: 00:56:47

hopefully experiment all these questions

that we've discussed.

974

: 00:56:51

Yeah, like, like how we how does it make

sense and how we get the benefits of

975

: 00:56:56

Bayesian, true Bayesian inference on fine

tuning or on large models or large data.

976

: 00:57:03

And so yeah, we are really excited and to

add more methods.

977

: 00:57:08

So if listeners have mini batch, big data

Bayesian methods that we want to want to

978

: 00:57:14

try out with a large data model, then

we're hopefully accepting that we will.

979

: 00:57:19

I do.

980

: 00:57:23

I do like, I do promote like generality

and doing it like in a way that is sort of

981

: 00:57:35

flexible and stuff.

982

: 00:57:35

So we may have, we may think a lot.

983

: 00:57:37

It's not, it's not, we want to add methods

that somehow feel natural and, and one way

984

: 00:57:42

is to extend and compose with other

methods.

985

: 00:57:46

So it might be that if we've got some very

complicated last layer,

986

: 00:57:50

requires classes just for classification

method, we're probably not going to add

987

: 00:57:53

it.

988

: 00:57:53

So it has to be methods that stick within

the posterior framework, which is this

989

: 00:57:59

arbitrary likelihood Bayesian swappable

computation.

990

: 00:58:03

Okay.

991

: 00:58:04

Okay.

992

: 00:58:04

Yeah.

993

: 00:58:04

Yeah, that makes sense.

994

: 00:58:06

Yeah, because you have like, yeah, you

have that kind of vision of wanting to do

995

: 00:58:13

that and having that as a as a research

tool, basically.

996

: 00:58:19

997

: 00:58:20

Yeah, that makes sense to keep that under

control, let's say.

998

: 00:58:26

Something I want to ask you in the last

few minutes of the show is about

999

: 00:58:31

thermodynamic compute.

: 1000: 00:58:34

I've seen you, you are working on that.

: 1001: 00:58:37

And you've told me you're working on that.

: 1002: 00:58:39

So yeah, I don't know anything about that.

: 1003: 00:58:41

So can you like, what's that about?

: 1004: 00:58:43

Yeah, so I mean, this is yeah, this is

something that's very normal, normal

: 1005: 00:58:47

computing.

: 1006: 00:58:48

And it's like,

: 1007: 00:58:50

It's something that we have.

: 1008: 00:58:52

Yeah, we have this hardware team.

: 1009: 00:58:53

It's like a full stack AI company.

: 1010: 00:58:55

And we, yeah, on the posterior side, on

the client side, we look at how we can

: 1011: 00:59:00

bring in principle Bayesian uncertainty

quantification and help us solve the

: 1012: 00:59:06

issues with machine learning pipelines

like we've already discussed.

: 1013: 00:59:10

And on the other side, there's lots of

parts to this.

: 1014: 00:59:12

More just like traditional MCMC is

difficult sometimes because

: 1015: 00:59:16

Or just it's just like about simulating

SDEs essentially as what the thermodynamic

: 1016: 00:59:19

hardware is simulating SDEs Normally, you

have this real pain with the step size and

: 1017: 00:59:25

as the mention grows steps, let's get

really small and so SDEs, where do we see

: 1018: 00:59:32

SDEs?

: 1019: 00:59:32

You see SDEs in physics all the time and

physics is real we can use physics so it's

: 1020: 00:59:37

doing so it's building physical hardware

analog hardware that We can hopefully that

: 1021: 00:59:43

evolves as SDEs

: 1022: 00:59:45

then we can harness that SDEs by encoding,

you know, like currents and voltages and

: 1023: 00:59:49

things like that.

: 1024: 00:59:50

So I'm not a physicist, so I don't know

exactly how it is.

: 1025: 00:59:52

But I'm always reassured at how the when I

speak to the hardware team, how simple the

: 1026: 00:59:57

they talk about these things, it's like,

yeah, we can just stick some resistors and

: 1027: 01:00:00

capacitors on a chip, and then it'll then

it'll do this SDE.

: 1028: 01:00:03

So this is the and then we want to use

those SDEs for scientific computation.

: 1029: 01:00:08

And with a real focus on statistics and

machine learning.

: 1030: 01:00:11

So yeah, we want to be able to do an HMC

: 1031: 01:00:14

on device, on an analog device.

: 1032: 01:00:17

The first step is to do like with a

linear, so we'll have a Gaussian posterior

: 1033: 01:00:21

or with a linear drift in terms of this.

: 1034: 01:00:24

This is an Ornstein -Ollenbeck process and

we've developed hardware to do this and

: 1035: 01:00:29

turns out that an Ornstein -Ollenbeck

process, because it has a Gaussian

: 1036: 01:00:33

stationary distribution and you have this,

you can input like you can input the

: 1037: 01:00:37

precision matrix and output the covariance

matrix, that's matrix inversion.

: 1038: 01:00:40

So, and you just, your physical device

just does this.

: 1039: 01:00:45

And it's because it's an SDE, it has noise

and is kind of noise aware, which is

: 1040: 01:00:50

different to classical analog computation,

which has historically been plagued, which

: 1041: 01:00:54

is really old, really old, but

historically been plagued by noise.

: 1042: 01:00:57

And it's like, yeah, there's all this

noise in physics.

: 1043: 01:01:00

And because we're doing SDEs, we want the

noise.

: 1044: 01:01:03

So yeah, that's the whole idea.

: 1045: 01:01:05

It's obviously very young, but it's fun.

: 1046: 01:01:07

It's fun stuff.

: 1047: 01:01:07

Yeah.

: 1048: 01:01:08

So that's basically to...

: 1049: 01:01:11

accelerate computing?

: 1050: 01:01:13

That's hardware first, so that computing

is accelerated?

: 1051: 01:01:18

We want to, I mean, it's a baby field.

: 1052: 01:01:21

So we're trying to accelerate different

components.

: 1053: 01:01:24

What we worked out is with the simplest

thermodynamic chip we can build is this

: 1054: 01:01:28

linear chip with the Ornstein -Ullenberg

process.

: 1055: 01:01:30

And that can speed up with some error.

: 1056: 01:01:34

some error, but it has asymptotic speed

ups for linear algebra routines, so

: 1057: 01:01:38

inverting a matrix or solving a linear

system.

: 1058: 01:01:41

That's awesome.

: 1059: 01:01:44

In this case, it would speed up a certain

component, but that could be useful in a

: 1060: 01:01:48

Laplace approximation or these sort of

things also in machine learning.

: 1061: 01:01:52

Okay, that must be very fun to work on.

: 1062: 01:01:57

Do you have any writing about that that we

can put in the show notes?

: 1063: 01:02:02

Because

: 1064: 01:02:03

I think it'd be super interesting for

listeners.

: 1065: 01:02:06

Yeah, yeah.

: 1066: 01:02:07

We've got the normal computing scholar

page has a list of papers, but we also

: 1067: 01:02:12

have more accessible blogs, which I'll

make sure to put in the shop.

: 1068: 01:02:16

Yeah, yeah, please do because, yeah, I

think it's super interesting.

: 1069: 01:02:22

And yeah, and when you have something to

present on that, feel free to reach out.

: 1070: 01:02:25

And I think that'd be fun to do an episode

about that, honestly.

: 1071: 01:02:29

That'd be great.

: 1072: 01:02:30

Yeah.

: 1073: 01:02:32

Yes, so maybe one last question before

asking you the last two questions.

: 1074: 01:02:37

Like very, like, let's do Zoom be way less

technical.

: 1075: 01:02:40

We've been very technical through the

whole episode, which I love.

: 1076: 01:02:45

But maybe I'm thinking if you have any

advice to give to aspiring developers

: 1077: 01:02:52

interested in contributing to open source

projects like Posterior's, what would it

: 1078: 01:02:57

be?

: 1079: 01:03:00

Okay, yeah, I don't know, I don't feel

like I'm necessarily the best place to say

: 1080: 01:03:05

all this, but yeah, I mean, I would just,

the most important thing is just to go for

: 1081: 01:03:10

it, just get stuck in, get in the weeds of

these libraries and see what's there.

: 1082: 01:03:17

And there's loads of people building such

cool stuff in the open source ecosystem

: 1083: 01:03:22

and it's really fun to, honestly, it's

really fun and rewarding to get involved

: 1084: 01:03:25

for it.

: 1085: 01:03:26

So just go for it, you'll learn so much

along the way.

: 1086: 01:03:29

something more tangible.

: 1087: 01:03:31

I find that when I'm stuck on, starting

on, it's not like I don't understand

: 1088: 01:03:35

something in code or mathematics, then I

often struggle to find it in papers per

: 1089: 01:03:40

se.

: 1090: 01:03:40

And I find that textbooks, I love

textbooks, textbooks I find as a real

: 1091: 01:03:44

source of gold for these because they

actually go to the depths of explaining

: 1092: 01:03:48

things, without having this sort of horse

in the race style writing that you often

: 1093: 01:03:53

find in papers.

: 1094: 01:03:54

So yeah, get stuck in check text textbooks

if you, if you get lost.

: 1095: 01:03:58

Or I don't understand.

: 1096: 01:03:59

Or just ask as well.

: 1097: 01:04:00

Open source is all about asking and

communicating and bouncing ideas.

: 1098: 01:04:04

Yeah, yeah, yeah, for sure.

: 1099: 01:04:05

Yeah, that's usually what I do.

: 1100: 01:04:07

I ask a lot and I usually end up

surrounding myself with people way smarter

: 1101: 01:04:13

than me.

: 1102: 01:04:14

And that's exactly what you want.

: 1103: 01:04:17

That's exactly how I learned.

: 1104: 01:04:19

Yeah, textbook DICI, I would say I kind of

find the writing boring most of the time,

: 1105: 01:04:26

depends on the textbooks.

: 1106: 01:04:28

And also, it's expensive.

: 1107: 01:04:31

Yeah.

: 1108: 01:04:32

So that's kind of the problem of

textbooks, I would say.

: 1109: 01:04:35

I mean, you often can have them in PDFs,

but I just hate reading the PDF on my

: 1110: 01:04:40

computer.

: 1111: 01:04:41

So, you know, I wonder on the book object

or having it on Kindle or something like

: 1112: 01:04:48

that.

: 1113: 01:04:48

But that doesn't really that doesn't

really exist yet.

: 1114: 01:04:51

So.

: 1115: 01:04:54

could be something that some editors solve

someday that'd be cool, I'd love that

: 1116: 01:05:00

awesome, Sam, that was great thank you so

much, we've covered so many topics and my

: 1117: 01:05:09

brain is burning so that's a very good

sign I've learned a lot and I'm sure our

: 1118: 01:05:15

listeners did too of course, before

letting you go I'm gonna ask you the last

: 1119: 01:05:19

two questions I ask every guest at the end

of the show so one

: 1120: 01:05:23

If you had unlimited time and resources,

which problem would you try to solve?

: 1121: 01:05:53

want to decouple the model specification,

the data generating process, how you go

: 1122: 01:05:58

from your something you don't know to the

data you do have.

: 1123: 01:06:02

That's your site freedom as a data model.

: 1124: 01:06:04

I have you define that from like the

inference and the mathematical

: 1125: 01:06:07

computation.

: 1126: 01:06:08

So that's whatever, what the way you do

your approximate Bayesian inference.

: 1127: 01:06:12

And you want to decouple those.

: 1128: 01:06:13

You want to make it as easy as possible.

: 1129: 01:06:14

Ideally, we just want to be doing that

one.

: 1130: 01:06:16

We just want to be doing the model

specification.

: 1131: 01:06:19

And this is like Stan and PyMC do this

really well.

: 1132: 01:06:22

It's just like,

: 1133: 01:06:22

you write down your model, we'll handle

the rest.

: 1134: 01:06:25

And that's kind of like the dream we have

as Bayesian or Bayesian software

: 1135: 01:06:28

developers.

: 1136: 01:06:31

And it's so with Posterior, we're trying

to do something like this towards going to

: 1137: 01:06:36

move towards this for bigger, big machine

learning models and so bigger models,

: 1138: 01:06:42

bigger data settings.

: 1139: 01:06:44

So that's kind of the dream there.

: 1140: 01:06:46

But then in machine learning, what does

machine learning have differently to

: 1141: 01:06:50

statistics in that setting?

: 1142: 01:06:51

It's like, well, machine learning models

are less interesting than classical

: 1143: 01:06:56

Bayesian models.

: 1144: 01:06:57

The thing is they're more transferable,

right?

: 1145: 01:07:01

It's just a neural network, which we

believe is machine learning and will solve

: 1146: 01:07:06

a whole suite of tasks.

: 1147: 01:07:07

So perhaps in terms of the machine

learning setting, where we decouple

: 1148: 01:07:11

modeling and inference and data, you kind

of want to remove the model one as well.

: 1149: 01:07:15

You want to have these general purpose

foundational models, you could say.

: 1150: 01:07:18

So really you want to let the user focus.

: 1151: 01:07:20

And so we're handling the inference.

: 1152: 01:07:22

We're also handling the model.

: 1153: 01:07:23

So really let the user just give it the

data and say, okay, let's do this data and

: 1154: 01:07:27

let's use this data to predict other

things and let the user handle that.

: 1155: 01:07:30

So that's potentially like a real

unlimited time and resources.

: 1156: 01:07:35

Plenty of resources need to do that.

: 1157: 01:07:37But yeah, that's Sam May: 2024

: 1158: 01:07:44

Yeah.

: 1159: 01:07:46

Yeah, that sounds...

: 1160: 01:07:47

That sounds amazing.

: 1161: 01:07:49

I agree with that.

: 1162: 01:07:50

That's a fantastic goal.

: 1163: 01:07:53

And yeah, also that reminds me, that's

also why I really love what you guys are

: 1164: 01:07:57

doing with Posteriorus because it's like,

yeah, trying to now that we start being

: 1165: 01:08:06

able to get there, making patient

inference really scalable to really big

: 1166: 01:08:13

data and big models.

: 1167: 01:08:15

I'm super enthusiastic about that.

: 1168: 01:08:17

it would be just fantastic.

: 1169: 01:08:21

So thank you so much for taking the time

to do that guys.

: 1170: 01:08:26

Yeah we're doing it, we're gonna get

there.

: 1171: 01:08:28

Yeah yeah yeah I love that.

: 1172: 01:08:31

And second question, if you could have

dinner with any great scientific mind dead

: 1173: 01:08:36

alive or fictional who would it be?

: 1174: 01:08:41

Yeah I was a bit intimidated this

question.

: 1175: 01:08:44

Yeah you know you ask everyone.

: 1176: 01:08:45

again, it's a great question.

: 1177: 01:08:46

But then I thought about it for a little

bit.

: 1178: 01:08:48

And it wasn't too hard for me.

: 1179: 01:08:49

I think that David Mackay is someone who,

yeah, I mean, it's been amazing work.

: 1180: 01:08:55

David Mackay is doing Bayesian neural

networks in: 1992

: 1181: 01:08:58

And that's like, yeah, like crazy before

before I'm born.

: 1182: 01:09:03, Bayesian neural networks in: 1992

then I've just been going through his

: 1183: 01:09:10

textbook, as I said, I love textbooks, so

going through his textbooks on information

: 1184: 01:09:13

theory and

: 1185: 01:09:14

Basin statistics is a Bayesian or was a

Bayesian information theory and

: 1186: 01:09:18

statistics.

: 1187: 01:09:18

And there's something that he says like

right at the start of the textbook is

: 1188: 01:09:21

like, one of the themes of this book is

that data compression and data modeling

: 1189: 01:09:25

are one and the same.

: 1190: 01:09:26

And that's just really beautiful.

: 1191: 01:09:27

And we talked about stream codes, which in

a very information theory style setting,

: 1192: 01:09:32

but it's just an auto -aggressive

prediction model, just like our language

: 1193: 01:09:36

model.

: 1194: 01:09:37

So it's just someone else the ability to

distill these informations and do these.

: 1195: 01:09:43

distill information and help the

unification and be so ahead of their time.

: 1196: 01:09:47

And then additionally, with a sort of like

groundbreaking book on sustainable energy.

: 1197: 01:09:52

So like also tackling the one of the

greatest challenges we have at the moment.

: 1198: 01:09:58

So yeah, that's the sustainable energy

book is really wonderful.

: 1199: 01:10:01

I'm one of my favorite books so far.

: 1200: 01:10:04

Nice.

: 1201: 01:10:04

Yeah, definitely put that in the show

notes.

: 1202: 01:10:06

I think.

: 1203: 01:10:07

Yes, definitely.

: 1204: 01:10:08

Yeah.

: 1205: 01:10:08

Yeah, I'd like to keep that to read.

: 1206: 01:10:11

: 1207: 01:10:11

Yeah, please also put that in the show and

that's going to be fantastic.

: 1208: 01:10:15

Great.

: 1209: 01:10:16

Well, I think we can call it a show.

: 1210: 01:10:19

That was fantastic.

: 1211: 01:10:20

Thank you so much, Sam.

: 1212: 01:10:24

I learned so much and now I feel like I

have to go and read and learn about so

: 1213: 01:10:29

many things.

: 1214: 01:10:30

And I can definitely tell that you are

extremely passionate about your doing.

: 1215: 01:10:37

So yeah, thank you so much for.

: 1216: 01:10:39

taking the time and being on this show?

: 1217: 01:10:42

No, thank you very much.

: 1218: 01:10:43

I had a lot of fun.

: 1219: 01:10:44

Yeah.

: 1220: 01:10:44

Thank you for, yeah, being parcel to my

rantings.

: 1221: 01:10:48

I need that sometimes.

: 1222: 01:10:50

Yeah, that's what the show is about.

: 1223: 01:10:53

My girlfriend is extremely, extremely

happy that I have this show to rent about

: 1224: 01:10:58

patient stats and any nerdy stuff.

: 1225: 01:11:02

Yeah, it's so true, yeah.

: 1226: 01:11:04

Well, Sam, you're welcome.

: 1227: 01:11:06

Anytime you need to do some nerdy rant.

: 1228: 01:11:10

thank you.

: 1229: 01:11:10

I'm sure I'll be...

: 1230: 01:11:15

This has been another episode of Learning

Bayesian Statistics.

: 1231: 01:11:19

Be sure to rate, review, and follow the

show on your favorite podcatcher, and

: 1232: 01:11:24

visit learnbayestats .com for more

resources about today's topics, as well as

: 1233: 01:11:29

access to more episodes to help you reach

true Bayesian state of mind.

: 1234: 01:11:33

That's learnbayestats .com.

: 1235: 01:11:35

Our theme music is Good Bayesian by Baba

Brinkman.

: 1236: 01:11:38

Fit MC Lance and Meghiraam.

: 1237: 01:11:40

Check out his awesome work at bababrinkman

.com.

: 1238: 01:11:43

I'm your host.

: 1239: 01:11:45

Alex Andorra.

: 1240: 01:11:45

You can follow me on Twitter at Alex

underscore Andorra, like the country.

: 1241: 01:11:50

You can support the show and unlock

exclusive benefits by visiting Patreon

: 1242: 01:11:55

.com slash LearnBasedDance.

: 1243: 01:11:57

Thank you so much for listening and for

your support.

: 1244: 01:11:59

You're truly a good Bayesian.

: 1245: 01:12:02

Change your predictions after taking

information in.

: 1246: 01:12:05

And if you're thinking of me less than

amazing, let's adjust those expectations.

: 1247: 01:12:12

Let me show you how to be a good Bayesian

Change calculations after taking fresh

: 1248: 01:12:17

data in Those predictions that your brain

is making Let's get them on a solid

: 1249: 01:12:23

foundation

Transcript

Sign up for our newsletter!

The latest from Reverend Bayes directly in your inbox!

QUICK Links

Get in Touch