Learning Bayesian Statistics

Proudly sponsored by PyMC Labs, the Bayesian Consultancy. Book a call, or get in touch!

Our theme music is « Good Bayesian », by Baba Brinkman (feat MC Lars and Mega Ran). Check out his awesome work!

Visit our Patreon page to unlock exclusive Bayesian swag 😉

Takeaways:

  • Use mini-batch methods to efficiently process large datasets within Bayesian frameworks in enterprise AI applications.
  • Apply approximate inference techniques, like stochastic gradient MCMC and Laplace approximation, to optimize Bayesian analysis in practical settings.
  • Explore thermodynamic computing to significantly speed up Bayesian computations, enhancing model efficiency and scalability.
  • Leverage the Posteriors python package for flexible and integrated Bayesian analysis in modern machine learning workflows.
  • Overcome challenges in Bayesian inference by simplifying complex concepts for non-expert audiences, ensuring the practical application of statistical models.
  • Address the intricacies of model assumptions and communicate effectively to non-technical stakeholders to enhance decision-making processes.

Chapters:

00:00 Introduction to Large-Scale Machine Learning

11:26 Scalable and Flexible Bayesian Inference with Posteriors

25:56 The Role of Temperature in Bayesian Models

32:30 Stochastic Gradient MCMC for Large Datasets

36:12 Introducing Posteriors: Bayesian Inference in Machine Learning

41:22 Uncertainty Quantification and Improved Predictions

52:05 Supporting New Algorithms and Arbitrary Likelihoods

59:16 Thermodynamic Computing

01:06:22 Decoupling Model Specification, Data Generation, and Inference

Thank you to my Patrons for making this episode possible!

Yusuke Saito, Avi Bryant, Ero Carrera, Giuliano Cruz, Tim Gasser, James Wade, Tradd Salvo, William Benton, James Ahloy, Robin Taylor,, Chad Scherrer, Zwelithini Tunyiswa, Bertrand Wilden, James Thompson, Stephen Oates, Gian Luca Di Tanna, Jack Wells, Matthew Maldonado, Ian Costley, Ally Salim, Larry Gill, Ian Moran, Paul Oreto, Colin Caprani, Colin Carroll, Nathaniel Burbank, Michael Osthege, Rémi Louf, Clive Edelsten, Henri Wallen, Hugo Botha, Vinh Nguyen, Marcin Elantkowski, Adam C. Smith, Will Kurt, Andrew Moskowitz, Hector Munoz, Marco Gorelli, Simon Kessell, Bradley Rode, Patrick Kelley, Rick Anderson, Casper de Bruin, Philippe Labonde, Michael Hankin, Cameron Smith, Tomáš Frýda, Ryan Wesslen, Andreas Netti, Riley King, Yoshiyuki Hamajima, Sven De Maeyer, Michael DeCrescenzo, Fergal M, Mason Yahr, Naoya Kanai, Steven Rowland, Aubrey Clayton, Jeannine Sue, Omri Har Shemesh, Scott Anthony Robson, Robert Yolken, Or Duek, Pavel Dusek, Paul Cox, Andreas Kröpelin, Raphaël R, Nicolas Rode, Gabriel Stechschulte, Arkady, Kurt TeKolste, Gergely Juhasz, Marcus Nölke, Maggi Mackintosh, Grant Pezzolesi, Avram Aelony, Joshua Meehl, Javier Sabio, Kristian Higgins, Alex Jones, Gregorio Aguilar, Matt Rosinski, Bart Trudeau, Luis Fonseca, Dante Gates, Matt Niccolls, Maksim Kuznecov, Michael Thomas, Luke Gorrie, Cory Kiser, Julio, Edvin Saveljev, Frederick Ayala, Jeffrey Powell, Gal Kampel, Adan Romero, Will Geary, Blake Walters, Jonathan Morgan and Francesco Madrisotti.

Links from the show:

Transcript

This is an automatic transcript and may therefore contain errors. Please get in touch if you’re willing to correct them.

Transcript
Speaker:

Folks, strap in, because today's episode

is a deep dive into the fascinating world

2

00:00:11,834 --> 00:00:14,382

of large -scale machine learning.

3

00:00:14,382 --> 00:00:19,162

And who better to guide us through this

journey than Sam Dofeld.

4

00:00:19,162 --> 00:00:23,702

Currently honing his expertise at normal

computing, Sam has an impressive

5

00:00:23,702 --> 00:00:27,562

background that bridges the theoretical

and practical realms of Bayesian

6

00:00:27,562 --> 00:00:33,082

statistics, from quantum computation to

the cutting edge of AI technology.

7

00:00:33,082 --> 00:00:38,582

In our discussion, Sam breaks down complex

topics such as the posterior's Python

8

00:00:38,582 --> 00:00:44,368

package, minimatch methods, approximate

inference, and the intriguing world

9

00:00:44,368 --> 00:00:48,248

of thermodynamic hardware for statistics.

10

00:00:48,248 --> 00:00:52,110

Yeah, I didn't know what that was either.

11

00:00:52,110 --> 00:00:57,750

We delve into how these advanced methods

like stochastic gradient MCMC and Laplace

12

00:00:57,750 --> 00:01:03,510

approximation are not just theoretical

concepts but pivotal in shaping enterprise

13

00:01:03,510 --> 00:01:05,510

AI models today.

14

00:01:05,510 --> 00:01:11,610

And Sam is not just about algorithms and

models, he is a sports enthusiast who

15

00:01:11,610 --> 00:01:14,350

loves football, tennis and squash.

16

00:01:14,710 --> 00:01:21,130

and he recently returned from an awe

-inspiring trip to the Faroe Islands.

17

00:01:21,358 --> 00:01:26,218

So join us as we explore the future of AI

with Bayesian methods.

18

00:01:26,218 --> 00:01:33,718

This is Learning Bayesian Statistics,

episode 110, recorded May 31, 2024.

19

00:01:38,134 --> 00:01:55,794

Welcome to Learning Bayesian Statistics, a

podcast about Bayesian inference, the

20

00:01:55,794 --> 00:01:59,354

methods, the projects, and the people who

make it possible.

21

00:01:59,354 --> 00:02:01,584

I'm your host, Alex Andorra.

22

00:02:01,584 --> 00:02:05,102

You can follow me on Twitter at alex

-underscore -andorra.

23

00:02:05,102 --> 00:02:05,982

like the country.

24

00:02:05,982 --> 00:02:10,222

For any info about the show, learnbasedats

.com is Laplace to be.

25

00:02:10,222 --> 00:02:14,762

Show notes, becoming a corporate sponsor,

unlocking Beijing Merch, supporting the

26

00:02:14,762 --> 00:02:17,402

show on Patreon, everything is in there.

27

00:02:17,402 --> 00:02:19,342

That's learnbasedats .com.

28

00:02:19,342 --> 00:02:23,722

If you're interested in one -on -one

mentorship, online courses, or statistical

29

00:02:23,722 --> 00:02:28,882

consulting, feel free to reach out and

book a call at topmate .io slash alex

30

00:02:28,882 --> 00:02:30,342

underscore and dora.

31

00:02:30,342 --> 00:02:34,242

See you around, folks, and best Beijing

wishes to you all.

32

00:02:37,582 --> 00:02:41,102

I'm Sam Duffield, welcome to Learning

Bayesian Statistics.

33

00:02:42,702 --> 00:02:44,762

Thanks, thank you very much.

34

00:02:45,082 --> 00:02:48,942

Yeah, thank you so much for taking the

time.

35

00:02:48,942 --> 00:02:55,642

I invited you on the show because I saw

what you guys at normal computing were

36

00:02:55,642 --> 00:02:59,342

doing, especially with the Posteriors

Python package.

37

00:02:59,342 --> 00:03:04,002

And I am personally always learning new

stuff.

38

00:03:04,042 --> 00:03:09,838

Right now I'm learning a lot about sports

analytics, because that's a

39

00:03:09,838 --> 00:03:13,818

Like that's always been a personal pet

peeves of mine and Bayesian says extremely

40

00:03:13,818 --> 00:03:14,938

useful in that field.

41

00:03:14,938 --> 00:03:21,138

But I'm also in conjunction working a lot

about LLMs and the interaction with the

42

00:03:21,138 --> 00:03:22,858

Bayesian framework.

43

00:03:23,738 --> 00:03:29,298

I've been working much more on the base

flow package, which we've talked about

44

00:03:29,298 --> 00:03:32,318

with Marvin Schmidt in episode 107.

45

00:03:32,318 --> 00:03:32,590

So.

46

00:03:32,590 --> 00:03:38,390

Yeah, working on developing a PIMC bridge

to base flow so that you can write your

47

00:03:38,390 --> 00:03:47,750

model in PIMC and then like using

amortized patient inference for your PIMC

48

00:03:47,750 --> 00:03:48,810

models.

49

00:03:48,830 --> 00:03:52,110

It's still like way, way down the road.

50

00:03:52,110 --> 00:03:55,420

I need to learn about all that stuff, but

that's really fascinating.

51

00:03:55,420 --> 00:03:56,430

I love that.

52

00:03:56,430 --> 00:04:01,170

And so of course, when I saw what you were

doing with Posterior, I was like, that

53

00:04:01,170 --> 00:04:02,250

sounds...

54

00:04:02,286 --> 00:04:03,186

Awesome.

55

00:04:03,186 --> 00:04:05,526

I want to learn more about that.

56

00:04:05,526 --> 00:04:08,926

So I'm going to ask you a lot of

questions, a lot of things I don't know.

57

00:04:08,926 --> 00:04:10,666

So that's great.

58

00:04:11,566 --> 00:04:17,026

But first, can you tell us, give us a

brief overview of your research interests

59

00:04:17,026 --> 00:04:21,726

and how Bayesian methods play a role in

your work?

60

00:04:23,106 --> 00:04:24,406

Yeah, no, I know.

61

00:04:24,406 --> 00:04:25,766

Thanks again for the invite.

62

00:04:25,766 --> 00:04:30,486

I think, yeah, sports analytics, Bayesian

statistics, language models, I think we

63

00:04:30,486 --> 00:04:32,174

have a lot to talk about.

64

00:04:32,174 --> 00:04:33,854

should be fun.

65

00:04:33,854 --> 00:04:42,794

Bayesian methods in my work, yes, so at

normal we have a lot of problems where we

66

00:04:42,794 --> 00:04:48,614

think that Bayes is the right answer if

you could compute it exactly.

67

00:04:49,014 --> 00:04:53,574

So what we're trying to do is trying to

look at different approximations and

68

00:04:53,574 --> 00:04:57,262

different, like how they scale in

different methods and different settings.

69

00:04:57,262 --> 00:05:01,062

and how we can get as close to the exact

phase or the exact sort of integral and

70

00:05:01,062 --> 00:05:07,742

updating under uncertainty that can

provide us with some of those benefits.

71

00:05:08,382 --> 00:05:09,082

Yeah.

72

00:05:09,382 --> 00:05:09,952

OK.

73

00:05:09,952 --> 00:05:10,782

Yeah.

74

00:05:10,782 --> 00:05:11,912

That's interesting.

75

00:05:11,912 --> 00:05:13,921

I, of course, agree.

76

00:05:13,922 --> 00:05:15,402

Of course.

77

00:05:15,902 --> 00:05:22,962

Can you, like, actually, do you remember

when you were first introduced to patient

78

00:05:22,962 --> 00:05:23,922

inference?

79

00:05:23,922 --> 00:05:25,773

Because you had a

80

00:05:25,773 --> 00:05:28,933

an extensive background you've studied a

lot.

81

00:05:28,933 --> 00:05:34,973

When in that, in those studies, were you

introduced to the Bayesian framework?

82

00:05:35,033 --> 00:05:40,093

And also, how did you end up working on

what you're working on nowadays?

83

00:05:41,573 --> 00:05:43,293

Yeah, okay.

84

00:05:43,333 --> 00:05:46,973

I'll try not to rant too long about this.

85

00:05:47,473 --> 00:05:51,823

But yeah, so I guess I, yeah, mathematics,

undergraduate at Imperial.

86

00:05:51,823 --> 00:05:53,549

So I think that's

87

00:05:53,549 --> 00:05:57,309

I was very young at this stage, we were

very young in our undergraduates, so not

88

00:05:57,309 --> 00:05:58,519

really sure what we want to do.

89

00:05:58,519 --> 00:06:03,369

At some point, it came to me that

statistics within the field of mathematics

90

00:06:03,369 --> 00:06:08,109

is kind of like where I can like, that

should be working on like, applied

91

00:06:08,109 --> 00:06:11,529

problems and how what where the sort of

field is going.

92

00:06:11,969 --> 00:06:14,009

And that's what got me excited.

93

00:06:14,009 --> 00:06:17,409

Statistics at undergraduate are different,

different places, but you get thrown a lot

94

00:06:17,409 --> 00:06:18,381

of different

95

00:06:18,381 --> 00:06:22,181

I mean, probably in all courses, you get

different, you get different point of view

96

00:06:22,181 --> 00:06:25,141

and you get like, yeah, you get your

frequencies, your hypothesis testing, and

97

00:06:25,141 --> 00:06:28,161

then you have your Bayesian method as

well.

98

00:06:28,161 --> 00:06:34,241

And that is just the Bayesian approach

really sort of settled with me as being

99

00:06:34,241 --> 00:06:40,621

more natural in terms of you just write

down the equation and the Bayes Bayes

100

00:06:40,621 --> 00:06:44,721

Bayes theorem handles you write down, you

have your forward model and your prior and

101

00:06:44,721 --> 00:06:45,921

then Bayes theorem handles everything

else.

102

00:06:45,921 --> 00:06:48,269

So you're kind of writing down it's like,

103

00:06:48,269 --> 00:06:52,519

mathematicians is kind of like one of the

lecturers in my first year said, yeah,

104

00:06:52,519 --> 00:06:53,749

mathematicians are lazy.

105

00:06:53,749 --> 00:06:55,649

You want to they want to do as little as

possible.

106

00:06:55,649 --> 00:06:58,829

So base theorem is kind of nice there

because you just write down your your

107

00:06:58,829 --> 00:07:01,639

likelihood you write down your prior and

then base theorem handles the rest.

108

00:07:01,639 --> 00:07:04,249

So you have to do like the minimum

possible work you have your data

109

00:07:04,249 --> 00:07:05,729

likelihood prior and then done.

110

00:07:05,729 --> 00:07:07,709

So that was that was really compelling to

me.

111

00:07:07,709 --> 00:07:14,369

And that led me to a to my PhD, which was

in the engineering department in

112

00:07:14,369 --> 00:07:14,769

Cambridge.

113

00:07:14,769 --> 00:07:17,869

So that was like, yeah, I had a few

114

00:07:17,869 --> 00:07:20,959

thoughts on what to do for my PhD.

115

00:07:20,959 --> 00:07:25,449

There was some more theoretical stuff and

I wanted to get into some problems, get

116

00:07:25,449 --> 00:07:26,369

into the weeds a bit.

117

00:07:26,369 --> 00:07:29,649

So yeah, engineering department of

Cambridge working on Bayesian statistics,

118

00:07:29,649 --> 00:07:33,629

state space models and in a state space

model sequential Monte Carlo.

119

00:07:34,109 --> 00:07:38,389

And I think, yeah, I mean, for terminology

wise, I use state space model and hidden

120

00:07:38,389 --> 00:07:39,939

Markov model as the same thing.

121

00:07:39,939 --> 00:07:45,749

So yeah, you have this time series style

data and that was working on that sort of

122

00:07:45,749 --> 00:07:46,765

data gave me

123

00:07:46,765 --> 00:07:50,345

I feel like this propagation of

uncertainty really shines there because

124

00:07:50,345 --> 00:07:57,025

you need to take into account your

uncertainty from the previous experiments,

125

00:07:57,025 --> 00:07:59,485

say, when you update for your new ones.

126

00:07:59,485 --> 00:08:01,485

That was really compelling for me.

127

00:08:02,145 --> 00:08:07,465

That was, I guess, my route into Bayesian

statistics.

128

00:08:07,685 --> 00:08:09,085

Yeah, okay.

129

00:08:09,705 --> 00:08:12,685

Actually, here I could ask you a lot of

questions, but...

130

00:08:12,685 --> 00:08:14,525

those time series models.

131

00:08:14,525 --> 00:08:16,635

I'm always fascinated by time series

models.

132

00:08:16,635 --> 00:08:19,225

I don't know, I love them for some reason.

133

00:08:19,225 --> 00:08:25,785

I find there is a kind of magic in the

ability of a model to take time

134

00:08:25,785 --> 00:08:27,785

dependencies into account.

135

00:08:28,145 --> 00:08:31,085

I love using Gaussian processes for that.

136

00:08:31,305 --> 00:08:36,045

So I could definitely go down that rabbit

hole, but I'm afraid then I won't have

137

00:08:36,045 --> 00:08:39,053

enough time for you to talk about post

-series.

138

00:08:39,053 --> 00:08:42,013

Let me just say one minute about it.

139

00:08:42,013 --> 00:08:45,933

So I'll just say like, yeah, in terms of

yeah, Gaussian process is really cool.

140

00:08:45,933 --> 00:08:49,413

Like Gaussian process, like can think of

as like a continuous time or continuous

141

00:08:49,413 --> 00:08:53,653

space or whatever that the time variant

access, we'll call it time continuous time

142

00:08:53,653 --> 00:08:58,313

varying version of a state space model and

state space model or hidden Markov model.

143

00:08:58,313 --> 00:09:03,353

Kind of like that to me is like the

canonical extension of a just a static

144

00:09:03,353 --> 00:09:05,485

based inference model to a

145

00:09:05,485 --> 00:09:09,205

the time varying setting because you can

and they kind of unify each other because

146

00:09:09,205 --> 00:09:14,565

you can write smoothing in a state space

model as one big static Bayesian inference

147

00:09:14,565 --> 00:09:17,885

problem and then you can write static

Bayesian inference problems they're just p

148

00:09:17,885 --> 00:09:24,105

of y given x or p of yeah recovering x

from from y as as the first step as a

149

00:09:24,105 --> 00:09:29,025

single step of state space model so the

techniques that you build just overlap and

150

00:09:29,025 --> 00:09:32,345

you can yeah at least conceptually on the

mathematical level when you actually get

151

00:09:32,345 --> 00:09:34,253

into the approximations and the

commutation

152

00:09:34,253 --> 00:09:38,673

there are different things to consider,

different axes of scalability considered,

153

00:09:38,673 --> 00:09:41,213

but conceptually, I really like that.

154

00:09:41,213 --> 00:09:44,373

I probably ranted for a bit more than a

minute there, so I apologize.

155

00:09:44,513 --> 00:09:45,913

No, no, that's fine.

156

00:09:45,913 --> 00:09:47,593

I love that.

157

00:09:48,773 --> 00:09:49,473

Yeah.

158

00:09:50,053 --> 00:09:55,893

I have much more knowledge and experience

on GPs, but I'm definitely super curious

159

00:09:55,893 --> 00:09:59,553

to also apply these state space models and

so on.

160

00:09:59,553 --> 00:10:01,037

So definitely going to read the...

161

00:10:01,037 --> 00:10:06,437

the paper you sent me about skill rating

of football players where you're using, if

162

00:10:06,437 --> 00:10:08,497

I understand correctly, some state space

models.

163

00:10:08,497 --> 00:10:10,877

That's going to be two birds with one

stone.

164

00:10:10,877 --> 00:10:13,057

So thanks a lot for writing that.

165

00:10:13,057 --> 00:10:18,957

The whole point of that paper is to say

that rating systems, ELO, TrueSkill are

166

00:10:18,957 --> 00:10:22,097

and should be reframed as state space

models.

167

00:10:22,097 --> 00:10:25,437

And then you just have your full Bayesian

understanding of it.

168

00:10:26,557 --> 00:10:27,387

Yeah, yeah.

169

00:10:27,387 --> 00:10:28,617

I mean, for sure.

170

00:10:28,617 --> 00:10:31,021

I'm working myself also on the

171

00:10:31,021 --> 00:10:33,761

project like that on football data.

172

00:10:34,101 --> 00:10:37,721

And yeah, the first thing I was doing is

like, okay, I'm gonna write the simple

173

00:10:37,721 --> 00:10:38,261

model.

174

00:10:38,261 --> 00:10:43,301

But then as soon as I have that down, I'm

gonna add a GP to that.

175

00:10:43,301 --> 00:10:47,801

It's like, I have to take these

nonlinearities into account.

176

00:10:47,801 --> 00:10:50,821

So yeah, I'm like, super excited about

that.

177

00:10:50,821 --> 00:10:55,181

So thanks a lot for giving me some weekend

readings.

178

00:10:56,557 --> 00:11:01,237

So actually now let's go into your

posteriors package because I have so many

179

00:11:01,237 --> 00:11:02,837

questions about that.

180

00:11:02,837 --> 00:11:07,997

So could you give us an overview of the

package, what motivated this development

181

00:11:07,997 --> 00:11:14,017

and also putting it in the context of

large scale AI models?

182

00:11:14,717 --> 00:11:21,357

Yeah, so as I said, we at normal think the

base is the right answer.

183

00:11:21,357 --> 00:11:26,717

So we want to get, we want to, but yeah,

we're interested in

184

00:11:26,829 --> 00:11:30,309

large scale enterprise AI models.

185

00:11:30,389 --> 00:11:36,149

So we need to be able to scale these to

big, big models, big, big parameter sizes

186

00:11:36,149 --> 00:11:37,829

and big data at the same time.

187

00:11:37,829 --> 00:11:45,469

So this is what Posterior's Python package

built on PyTorch really hopes to bring.

188

00:11:45,469 --> 00:11:49,709

It's built with sort of flexibility and

research in mind.

189

00:11:49,709 --> 00:11:54,929

So really we want to try out different

methods and try out for different data

190

00:11:54,929 --> 00:11:56,589

sets and different goals.

191

00:11:56,589 --> 00:12:01,009

what's going to be the best approach for

us.

192

00:12:01,009 --> 00:12:04,069

That's the motivation of the Posteriors

package.

193

00:12:06,009 --> 00:12:09,189

When would people use it?

194

00:12:09,549 --> 00:12:14,689

For instance, for which use cases would I

use Posteriors?

195

00:12:18,089 --> 00:12:24,013

There's a lot of just genuinely fantastic

Bayesian software out there.

196

00:12:24,013 --> 00:12:28,853

But most of it has focused on the full

batch setting, as is classically the case

197

00:12:28,853 --> 00:12:32,513

with Metropolis Hastings, except for

Jekste.

198

00:12:33,573 --> 00:12:41,193

And we feel like we're moving or we have

already moved into the mini batch era, the

199

00:12:41,193 --> 00:12:42,133

big data era.

200

00:12:42,133 --> 00:12:43,643

So posterior is mini batch first.

201

00:12:43,643 --> 00:12:47,773

So if you have a lot of data, even if you

have a small model, and you have a lot of

202

00:12:47,773 --> 00:12:52,433

data, and you want to try posterior

sampling with mini batches, you want to

203

00:12:52,433 --> 00:12:53,973

see how that...

204

00:12:54,729 --> 00:12:59,189

If that can speed up your inference rather

than doing full batch on every step, then

205

00:12:59,189 --> 00:13:01,729

Posterior is the place for that, even with

small models.

206

00:13:01,729 --> 00:13:05,969

So you can just write down your model in

Pyro, in PyTorch, and then use Posterior

207

00:13:05,969 --> 00:13:07,969

to do that.

208

00:13:10,989 --> 00:13:16,049

But then that's like moving from like

classical Bayesian statistics into like

209

00:13:16,049 --> 00:13:16,779

the mini batch one.

210

00:13:16,779 --> 00:13:19,349

But then there are also benefits of

Bayesian

211

00:13:19,469 --> 00:13:23,729

very crude approximations to the Bayesian

posterior in these really high scale large

212

00:13:23,729 --> 00:13:24,509

scale models.

213

00:13:24,509 --> 00:13:30,049

So like, yeah, like language models, big

neural networks, these aren't going to get

214

00:13:30,049 --> 00:13:32,889

you you're not you're not going to be able

to do your convergence checks and these

215

00:13:32,889 --> 00:13:36,729

sort of things in those models, but you

might still be able to get some advantages

216

00:13:36,729 --> 00:13:40,249

out of distribution detection, as a

distributed improved attribution

217

00:13:40,249 --> 00:13:44,349

performance sort of continual learning,

and these are the sort of things we're

218

00:13:44,349 --> 00:13:46,538

investigating is if like,

219

00:13:46,988 --> 00:13:50,588

the sort of what if you just trained with

grading essentially, you wouldn't

220

00:13:50,588 --> 00:13:51,528

necessarily get these things.

221

00:13:51,528 --> 00:13:57,148

But even very crude, crude Bayesian

approximations will hopefully provide

222

00:13:57,148 --> 00:13:57,678

these benefits.

223

00:13:57,678 --> 00:13:59,578

I think I will talk about this more later.

224

00:13:59,578 --> 00:14:00,388

I think.

225

00:14:01,648 --> 00:14:03,228

Yeah, okay.

226

00:14:04,108 --> 00:14:09,008

So basically, what what I understand is

that you can use Posters for basically any

227

00:14:09,008 --> 00:14:09,848

model.

228

00:14:10,668 --> 00:14:13,088

So I mean, we're still young.

229

00:14:13,088 --> 00:14:15,756

And it doesn't have like the

230

00:14:15,756 --> 00:14:20,476

very young and it doesn't have like the

support of, I don't know, if you want to

231

00:14:20,476 --> 00:14:23,436

do Gaussian processes, we were not going

to have a whole suite of kernels that

232

00:14:23,436 --> 00:14:26,056

you're going to be able to just type up.

233

00:14:26,056 --> 00:14:30,876

But fundamentally, it takes any, it just

takes a function, a log posterior

234

00:14:30,876 --> 00:14:35,956

function, and then you will be able to try

out different methods.

235

00:14:35,956 --> 00:14:41,616

But as I said, like the big data regime is

much less researched, and as much and the

236

00:14:41,616 --> 00:14:43,948

sort of big parameter regime is much

harder.

237

00:14:43,948 --> 00:14:45,088

at least.

238

00:14:45,188 --> 00:14:48,658

So it's going to be, it's not going to be

like a silver bullet.

239

00:14:48,658 --> 00:14:51,688

You're going to have to, there's research,

basically, posterior is a tool for

240

00:14:51,688 --> 00:14:57,688

research a lot of the time where you're

going to research what inference methods

241

00:14:57,688 --> 00:15:01,328

you can use, where they fail, and

hopefully where they succeed as well.

242

00:15:01,328 --> 00:15:01,808

Okay.

243

00:15:01,808 --> 00:15:02,168

Okay.

244

00:15:02,168 --> 00:15:03,508

I see.

245

00:15:03,668 --> 00:15:10,798

And so to make sure listeners understand,

well, you can do both in posers, right?

246

00:15:10,798 --> 00:15:13,778

You can write your model in posterior.

247

00:15:14,252 --> 00:15:16,522

and then sample from it?

248

00:15:16,522 --> 00:15:20,772

Or is that only model definition or is

that only model sampling?

249

00:15:20,812 --> 00:15:23,412

So it only does approximate posterior

sampling.

250

00:15:23,412 --> 00:15:27,032

So you write down the log posterior,

you're given some data and you write down

251

00:15:27,032 --> 00:15:28,412

the log posterior.

252

00:15:29,712 --> 00:15:32,562

Or the joint, you could say.

253

00:15:32,562 --> 00:15:38,072

It doesn't have the sophisticated support

of Stan or IMC or where you actually have

254

00:15:38,072 --> 00:15:40,532

the, you can write down the model.

255

00:15:40,716 --> 00:15:45,156

but it has the support for all the

distributions and doing forward samples.

256

00:15:45,976 --> 00:15:51,176

It leans on other tools like Pyro or

PyTorch itself for that in no other case.

257

00:15:51,176 --> 00:15:56,636

It is about approximate inference in the

posterior space, in the sample space.

258

00:15:56,876 --> 00:16:00,606

So you can do Laplace approximation with

these things and compare them.

259

00:16:00,606 --> 00:16:02,136

And importantly, it's mini -batch first.

260

00:16:02,136 --> 00:16:06,416

So every method only expects to receive

batch by batch.

261

00:16:06,416 --> 00:16:09,394

So you can support the large data regime.

262

00:16:10,124 --> 00:16:13,764

Okay, so I think there are a bunch of

terms we need to define here for

263

00:16:13,764 --> 00:16:14,743

listeners.

264

00:16:15,004 --> 00:16:17,224

Okay, yeah, sorry about that.

265

00:16:17,304 --> 00:16:19,044

Can you define minibatch?

266

00:16:19,044 --> 00:16:23,724

Can you define approximate inference and

in particular, Laplace approximation?

267

00:16:24,124 --> 00:16:28,394

Okay, so minibatch is the important one,

of course.

268

00:16:28,394 --> 00:16:33,304

Yeah, so normally in traditional Bayesian

statistics, if you're running random walk

269

00:16:33,304 --> 00:16:38,064

-through troblos -Hastings or HMC, you

will be seeing your whole dataset, all end

270

00:16:38,064 --> 00:16:40,172

data points at every step of the

iteration.

271

00:16:40,172 --> 00:16:42,432

And there's beautiful theory about that.

272

00:16:43,072 --> 00:16:48,962

But a lot of the time in machine learning,

you have a billion data points.

273

00:16:48,962 --> 00:16:52,992

Or if you're doing a foundation model,

it's like all of Wikipedia, it's billions

274

00:16:52,992 --> 00:16:54,852

of data points or something like that.

275

00:16:54,852 --> 00:17:01,992

And there's just no way that every time

you do a gradient step, you just can't sum

276

00:17:01,992 --> 00:17:03,352

over a billion data points.

277

00:17:03,352 --> 00:17:06,312

So you take 10 of them, you do this

unbiased approximation.

278

00:17:06,312 --> 00:17:10,252

And this doesn't propagate through the

exponential, which you need.

279

00:17:10,252 --> 00:17:11,792

for the metropolis hastening step.

280

00:17:11,792 --> 00:17:17,412

So it rules out a lot of traditional

Bayesian methods, but there's still been

281

00:17:17,412 --> 00:17:18,632

research on this.

282

00:17:18,632 --> 00:17:22,112

So this is the we saw a scalable Bayesian

learning is what we talked about with

283

00:17:22,112 --> 00:17:22,402

posterior.

284

00:17:22,402 --> 00:17:24,992

So we're investigating mini batch methods.

285

00:17:24,992 --> 00:17:30,532

So yeah, methods that only use a small

amount of the data, as is very common in

286

00:17:30,532 --> 00:17:36,732

so it's like gradient descent, stochastic

gradient descent and optimization terms.

287

00:17:36,772 --> 00:17:38,028

So hopefully

288

00:17:38,028 --> 00:17:41,068

Mini -batches, okay, you said approximate

inference.

289

00:17:41,128 --> 00:17:44,328

So approximate, okay, yeah, inference is a

very loaded term.

290

00:17:44,328 --> 00:17:48,108

Maybe I should try not to use it, but when

I say approximate inference, I mean

291

00:17:48,108 --> 00:17:49,888

approximate Bayesian inference.

292

00:17:49,888 --> 00:17:54,588

So you can write down mathematically the

posterior distribution, P of theta given

293

00:17:54,588 --> 00:17:59,588

y, and then yeah, proportional to P of

theta, P of y given theta.

294

00:17:59,588 --> 00:18:01,196

But that's

295

00:18:01,196 --> 00:18:05,376

You only have access to pointwise

evaluations of that and potentially even

296

00:18:05,376 --> 00:18:07,276

only mini -batch pointwise evaluation

sets.

297

00:18:07,276 --> 00:18:12,236

So approximate inference is forming some

approximation to that posterior

298

00:18:12,236 --> 00:18:16,716

distribution, whether that's a Gaussian

approximation or through Monte Carlo

299

00:18:16,716 --> 00:18:17,376

samples.

300

00:18:17,376 --> 00:18:19,726

So yeah, just like an ensemble of points

and approximate inference.

301

00:18:19,726 --> 00:18:21,116

So that's approximate inference.

302

00:18:21,116 --> 00:18:24,856

And yeah, you have different fidelities of

this posterior approximation.

303

00:18:25,916 --> 00:18:28,056

Last one, Laplace approximation.

304

00:18:28,056 --> 00:18:30,636

Laplace approximation is the simplest

305

00:18:30,636 --> 00:18:35,356

arguably the simplest in like machine

learning setting approximation to the

306

00:18:35,356 --> 00:18:36,276

posterior distribution.

307

00:18:36,276 --> 00:18:38,706

So it's just a Gaussian distribution.

308

00:18:38,706 --> 00:18:41,036

So all you need to define is a mean and

covariance.

309

00:18:41,036 --> 00:18:45,536

You define the mean by doing an

optimization procedure on your log

310

00:18:45,536 --> 00:18:48,556

posterior or just log likelihood.

311

00:18:48,776 --> 00:18:52,456

And that will give you a point that will

give you your mean.

312

00:18:52,776 --> 00:18:54,790

And then

313

00:18:55,276 --> 00:18:59,196

And then you take okay, it gets quite in

the weeds the Laplace approximation, but

314

00:18:59,196 --> 00:19:02,296

ideally you you then do a Taylor expansion

across them.

315

00:19:02,296 --> 00:19:05,536

Second order Taylor expansion will give

you Hessian.

316

00:19:05,536 --> 00:19:09,556

We would recommend the Hessian being the

co your approximate covariance.

317

00:19:09,636 --> 00:19:13,796

But there are tiny quantities there and

use the Fisher information as said.

318

00:19:13,796 --> 00:19:18,816

And yeah, you can read that there's lots

of I'm sure you've had people on the on

319

00:19:18,816 --> 00:19:21,496

the podcast explain it better than me.

320

00:19:21,596 --> 00:19:22,156

Yeah.

321

00:19:22,156 --> 00:19:24,696

For Laplace, no.

322

00:19:25,640 --> 00:19:28,970

Actually, so that's why I asked you to

define it.

323

00:19:28,970 --> 00:19:32,220

I'm happy to go down into the weeds if you

want.

324

00:19:33,200 --> 00:19:36,520

Yeah, if you think that's useful.

325

00:19:36,520 --> 00:19:41,780

Otherwise, we can definitely do also an

episode with someone you'd recommend to

326

00:19:41,780 --> 00:19:43,620

talk about Laplace approximation.

327

00:19:43,800 --> 00:19:50,600

Something I'd like to communicate to

listeners is for them to understand.

328

00:19:51,180 --> 00:19:55,680

Yeah, we say approximation, but at the

same time, MCMC is an approximation

329

00:19:55,680 --> 00:19:56,740

itself.

330

00:19:57,240 --> 00:20:00,220

So that can be a bit confusing.

331

00:20:01,360 --> 00:20:06,880

Can you talk about the fact, like, about

why these kind of methods, like Laplace

332

00:20:06,880 --> 00:20:11,580

approximation, I think VI, variational

inference, would fall also into this

333

00:20:11,580 --> 00:20:12,460

bucket.

334

00:20:13,340 --> 00:20:17,728

Why are those methods called

approximations?

335

00:20:17,900 --> 00:20:20,780

in contrast to MCMC?

336

00:20:20,780 --> 00:20:24,640

What's the main difference here?

337

00:20:24,640 --> 00:20:32,100

I honestly I would say MCMC is also an

approximation in the same terminology but

338

00:20:32,100 --> 00:20:37,760

yeah the difference is that we talk about

bias and asymptotically some methods

339

00:20:37,760 --> 00:20:43,320

asymptotically unbiased which MCMC is

stochastic gradient MCMC which is what

340

00:20:43,320 --> 00:20:46,764

Prosterus is as well in some

341

00:20:46,764 --> 00:20:51,104

under some caveats, and there are caveats

for MCMC, normal MCMC as well.

342

00:20:51,104 --> 00:20:55,344

But yeah, so you have your Gaussian

approximations from variational inference

343

00:20:55,344 --> 00:20:56,504

and the applies approximation.

344

00:20:56,504 --> 00:21:01,984

And these are very much approximations in

the sense there's no axes on which you can

345

00:21:01,984 --> 00:21:04,564

increase if you increase it to infinity or

change the posterior.

346

00:21:04,564 --> 00:21:08,904

You cannot do that with the Gaussian

approximations unless your posterior is

347

00:21:08,904 --> 00:21:13,384

you're known to be Gaussian, in which case

is more and more I mean, the amount of

348

00:21:13,384 --> 00:21:16,172

interesting cases like that like Gaussian

processes and things.

349

00:21:16,172 --> 00:21:21,612

But yeah, so they don't have this

asymptotically unbiased feature that MTMC

350

00:21:21,612 --> 00:21:25,392

does or important sampling as sequential

Monte Carlo does, which is very useful

351

00:21:25,392 --> 00:21:29,492

because it allows you to trade compute for

accuracy, which you can't do with a

352

00:21:29,492 --> 00:21:35,932

Laplace approximation or VI beyond

extending, like going from diagonal

353

00:21:35,932 --> 00:21:37,862

covariance to a full covariance or things

like that.

354

00:21:37,862 --> 00:21:41,552

And this is very useful in the case that

you have extra compute available.

355

00:21:41,552 --> 00:21:43,660

So I'm a big fan of the

356

00:21:43,660 --> 00:21:47,440

asymptotic unbiased property because it

means that you can increase your compute

357

00:21:47,440 --> 00:21:49,040

and safety.

358

00:21:49,080 --> 00:21:49,270

Yeah.

359

00:21:49,270 --> 00:21:49,760

Yeah.

360

00:21:49,760 --> 00:21:50,510

Great explanation.

361

00:21:50,510 --> 00:21:51,720

Thanks a lot.

362

00:21:52,760 --> 00:22:00,380

And so yeah, but so as you were saying,

there is not these asymptotic unbiasedness

363

00:22:00,380 --> 00:22:05,140

from these approximations, but at the same

time, that means they can be way faster.

364

00:22:05,740 --> 00:22:11,628

So it's like if you're in the right use

case in the right, in the right

365

00:22:11,628 --> 00:22:14,828

Yeah, in the right use case, then that

really makes sense to use them.

366

00:22:14,828 --> 00:22:19,268

But you have to be careful about the

conditions where the approximation falls

367

00:22:19,268 --> 00:22:20,268

down.

368

00:22:21,168 --> 00:22:26,008

Can you maybe dive a bit deeper into

stochastic gradient descent, which is the

369

00:22:26,008 --> 00:22:31,248

method that Posterioris is using, and how

that fits into these different methods

370

00:22:31,248 --> 00:22:33,008

that you just talked about?

371

00:22:33,448 --> 00:22:38,228

Actually, stochastic gradient descent is

not a method that Posterioris is using per

372

00:22:38,228 --> 00:22:39,348

se.

373

00:22:40,556 --> 00:22:44,976

descent is stochastic gradient descent is

the workhorse of machine most machine

374

00:22:44,976 --> 00:22:49,276

learning algorithms, but posteriors would

kind of be this kind of same like it kind

375

00:22:49,276 --> 00:22:52,706

of saying it shouldn't be perhaps or like,

not in all cases.

376

00:22:52,706 --> 00:22:54,866

stochastic gradient descent is what you

use.

377

00:22:54,866 --> 00:23:00,376

If you have extremely large data, and you

just want to find the MLE or so the

378

00:23:00,376 --> 00:23:04,336

maximum likelihood or the minimum of a

loss, which you might say.

379

00:23:05,336 --> 00:23:06,284

So

380

00:23:06,284 --> 00:23:08,224

that is just as an optimization routine.

381

00:23:08,224 --> 00:23:10,844

So you just want to find the parameters

that minimize something.

382

00:23:10,844 --> 00:23:15,164

If you're doing variational inference,

what you can do is you can trackively get

383

00:23:15,164 --> 00:23:20,084

the KL divergence between your specified

variational distribution and the log

384

00:23:20,084 --> 00:23:20,924

posterior.

385

00:23:20,924 --> 00:23:22,824

And then you have parameters.

386

00:23:22,924 --> 00:23:26,604

So they're like parameters of the

variational distribution over your model

387

00:23:26,604 --> 00:23:27,014

parameters.

388

00:23:27,014 --> 00:23:28,924

And then you use stochastic gradient

system like that.

389

00:23:28,924 --> 00:23:32,944

So this is nice because it just means that

you can throw the workhorse from machine

390

00:23:32,944 --> 00:23:35,028

learning at a

391

00:23:35,028 --> 00:23:38,348

Bayesian problem and get the Bayesian

approximation out.

392

00:23:38,688 --> 00:23:45,368

Again, as we mentioned, it doesn't have

this asymptotic unbiased feature, which is

393

00:23:45,368 --> 00:23:49,708

maybe less of a concern in machine

learning models where you have less of

394

00:23:49,708 --> 00:23:53,848

ability to trade compute because you've

kind of filled your compute budget with

395

00:23:53,848 --> 00:23:55,648

your gigantic model.

396

00:23:55,648 --> 00:23:58,988

Although we may see this, we think that I

think this might change over the coming

397

00:23:58,988 --> 00:23:59,808

years.

398

00:24:00,028 --> 00:24:01,558

But yeah, maybe not.

399

00:24:01,558 --> 00:24:03,668

Maybe we'll just go even bigger and bigger

and bigger.

400

00:24:03,980 --> 00:24:04,740

You...

401

00:24:04,740 --> 00:24:05,620

Okay, sorry.

402

00:24:05,620 --> 00:24:06,960

I got lost.

403

00:24:06,960 --> 00:24:09,960

You said you're asking about stochastic

gradient descent.

404

00:24:09,960 --> 00:24:11,450

So actually, there's something interesting

to say here.

405

00:24:11,450 --> 00:24:17,600

And then that means also what the main

difference characteristics of posterior

406

00:24:17,600 --> 00:24:23,820

is, like these, so that really people

understand the use case of posterior here.

407

00:24:23,820 --> 00:24:24,480

Yeah.

408

00:24:24,480 --> 00:24:26,160

So we didn't want to...

409

00:24:26,160 --> 00:24:26,580

Okay.

410

00:24:26,580 --> 00:24:32,364

So yeah, there's a key thing about the way

we've written posterior is that we like...

411

00:24:32,364 --> 00:24:39,104

where possible to have stochastic gradient

descent, so optimization, as sort of

412

00:24:39,104 --> 00:24:42,144

limits under some hyperparameter

specifications of the algorithms.

413

00:24:42,144 --> 00:24:47,224

And it turns out that in a lot of cases,

so we talked about MCMC, and then we

414

00:24:47,224 --> 00:24:51,124

talked about stochastic gradient MCMC,

which are MCMC methods that strictly

415

00:24:51,124 --> 00:24:52,524

handle mini -batch methods.

416

00:24:52,524 --> 00:24:56,264

And a lot of the time, you can write down

the temperature, you have the temperature

417

00:24:56,264 --> 00:24:58,464

parameter of your posterior distribution.

418

00:24:58,464 --> 00:25:00,396

And then as you take that to zero,

419

00:25:00,396 --> 00:25:03,376

So the temperature is like, if the

temperature is very high, your posterior

420

00:25:03,376 --> 00:25:04,856

distribution is very heated up.

421

00:25:04,856 --> 00:25:09,196

So you've increased the tails and there's

a lot like a much closer to sort of a

422

00:25:09,196 --> 00:25:10,116

uniform distribution.

423

00:25:10,116 --> 00:25:12,846

You take it very cold, it comes very

pointed and focused around optima.

424

00:25:12,846 --> 00:25:17,536

So we write the algorithms so that there's

this convenient transition through the

425

00:25:17,536 --> 00:25:17,866

temperature.

426

00:25:17,866 --> 00:25:20,556

So you set the temperature to zero, you

just get optimization.

427

00:25:20,556 --> 00:25:23,016

And this is a key thing about posteriors.

428

00:25:23,076 --> 00:25:27,340

So we have the, so the posteriors

stochastic grain MCMC methods.

429

00:25:27,340 --> 00:25:31,399

this temperature parameter which if you

set to zero will become a variant of

430

00:25:31,399 --> 00:25:32,680

stochastic gradient descent.

431

00:25:32,680 --> 00:25:38,980

So you can just sort of unify gradient

descent and stochastic gradient MCMC and

432

00:25:38,980 --> 00:25:42,400

it's nice so you have your yeah you have

your Langevin dynamics which tempered down

433

00:25:42,400 --> 00:25:45,280

to zero just becomes vanilla gradient

descent you have underdamped Langevin

434

00:25:45,280 --> 00:25:50,600

dynamics or stochastic gradient HMC,

stochastic gradient Hamiltonian Monte

435

00:25:50,600 --> 00:25:54,420

Carlo, you set the temperature to zero and

then you've just got stochastic gradient

436

00:25:54,420 --> 00:25:55,880

descent with momentum.

437

00:25:56,332 --> 00:26:00,272

So yeah, this is a nice thing about

Posterius to sort of unify these

438

00:26:00,272 --> 00:26:04,552

approaches and it hopefully will make it

less scary to use Bayesian approaches

439

00:26:04,552 --> 00:26:08,232

because you know you always have gradient

descent and you can sanity check by just

440

00:26:08,232 --> 00:26:11,552

setting the temp, just filling with a

temperature parameter.

441

00:26:12,032 --> 00:26:14,092

Okay, that's really cool.

442

00:26:14,252 --> 00:26:15,032

Okay.

443

00:26:15,072 --> 00:26:19,812

So it's like, it's a bit like the

temperature parameter in the, in the

444

00:26:19,812 --> 00:26:25,708

transformers that, that like make sure, I

mean, in the LLMs that

445

00:26:25,708 --> 00:26:31,868

It's like adding a bit of variation on top

of the prediction stat that the LL could

446

00:26:31,868 --> 00:26:31,988

make.

447

00:26:31,988 --> 00:26:33,647

Yeah, so it's exactly the same as that.

448

00:26:33,647 --> 00:26:37,168

So when you use this in language models or

natural language generation, you

449

00:26:37,168 --> 00:26:41,188

temperature the generative distribution so

that the logits get tempered.

450

00:26:41,188 --> 00:26:44,268

So if you set the temperature there to

zero, you get greedy sampling.

451

00:26:44,268 --> 00:26:47,668

But we're doing this in parameter space.

452

00:26:47,668 --> 00:26:49,128

So it's, yeah.

453

00:26:49,128 --> 00:26:50,948

It has this, yeah, exactly.

454

00:26:53,196 --> 00:26:58,076

Distribution tempering is a broad thing,

particularly in, I'm not going to go too

455

00:26:58,076 --> 00:27:03,356

philosophical, but I mean, I've first met

with like tempering, then we thought about

456

00:27:03,356 --> 00:27:07,676

it in the settings of sequential Monte

Carlo, and it's like, is it the natural

457

00:27:07,676 --> 00:27:08,476

way?

458

00:27:08,476 --> 00:27:10,536

Is it something that's natural to do?

459

00:27:10,536 --> 00:27:14,496

But in the context of Bayes, because

Bayes' theorem is multiplicative, right,

460

00:27:14,496 --> 00:27:19,076

you have your P of theta, P of y given

theta, it kind of makes sense to temper

461

00:27:19,076 --> 00:27:22,380

because it means like, okay, I'll just

introduce the likelihood a little bit.

462

00:27:22,380 --> 00:27:25,040

and sort of tempering as a natural way to

do it because there's multiplicative

463

00:27:25,040 --> 00:27:26,299

feature of Bayes' theorem.

464

00:27:26,299 --> 00:27:30,240

So, I kind of settled with me after

thinking about it like that.

465

00:27:30,520 --> 00:27:32,330

Yeah, no, I mean, that makes perfect

sense.

466

00:27:32,330 --> 00:27:38,980

And I was really surprised to see that was

used in LLMs when I first read about the

467

00:27:38,980 --> 00:27:40,020

algorithms.

468

00:27:40,640 --> 00:27:46,500

And I was pleasantly surprised because

I've worked a lot on electoral forecasting

469

00:27:46,500 --> 00:27:46,970

models.

470

00:27:46,970 --> 00:27:51,440

That's how I were introduced to Bayesian

stats.

471

00:27:51,756 --> 00:27:54,406

Actually, I've done that without knowing

it.

472

00:27:54,406 --> 00:27:58,516

So first I'm using the softmax all the

time because they're called forecasting.

473

00:27:58,516 --> 00:28:02,736

Unless you're doing that in the U S you

need a multinomial likelihood.

474

00:28:02,876 --> 00:28:06,176

The multinomial needs a probability

distribution.

475

00:28:06,176 --> 00:28:10,016

And how do you get that from the softmax

function, which is actually a very

476

00:28:10,016 --> 00:28:13,056

important one in the LLM framework.

477

00:28:13,156 --> 00:28:19,212

And, and, and also the thing is your

probability is not, it's like the latent.

478

00:28:19,212 --> 00:28:23,281

observation of popularity of each party,

but you never observe it, right?

479

00:28:23,281 --> 00:28:28,832

And so the polls, you could, you could

like conceptualize them as a tempered

480

00:28:28,832 --> 00:28:32,532

version of the true latent popularity.

481

00:28:32,692 --> 00:28:34,672

And so that was really interesting.

482

00:28:34,672 --> 00:28:39,552

I was like, damn, this like, this, this

stuff is much more powerful than what I

483

00:28:39,552 --> 00:28:43,552

thought, because I was like applying only

on electoral forecasting models, which is

484

00:28:43,552 --> 00:28:47,756

like a very niche application, you could

say of these models in

485

00:28:47,756 --> 00:28:51,556

actually there are so many applications of

that in the wild.

486

00:28:52,296 --> 00:28:57,016

No, it's so yeah, tempering in general is

very widespread and also I would say not

487

00:28:57,016 --> 00:28:59,416

particularly understood that well.

488

00:28:59,416 --> 00:29:04,856

Like yeah, we have this thing, there's

been research in this cold posterior

489

00:29:04,856 --> 00:29:10,876

effect which is quite a, I don't know,

it's perhaps a...

490

00:29:12,364 --> 00:29:17,754

annoying things for Bayesian modeling on

neural networks where you get, as I said,

491

00:29:17,754 --> 00:29:22,724

you have this temperature parameter that

transitions between optimization and the

492

00:29:22,724 --> 00:29:23,544

Bayesian posterior.

493

00:29:23,544 --> 00:29:26,324

So zero is optimization, one is the

Bayesian posterior.

494

00:29:27,144 --> 00:29:30,944

And empirically, we see better predictive

performance, which is a lot of time we

495

00:29:30,944 --> 00:29:34,904

care about in machine learning, with

temperatures less than one.

496

00:29:34,904 --> 00:29:38,764

So like, yeah, which is annoying because

we're Bayesians and we think that the

497

00:29:38,764 --> 00:29:41,524

Bayesian posterior is the optimal decision

-making under uncertainty.

498

00:29:41,900 --> 00:29:47,820

So this is annoying, but at least in our

experiments, we found this to be this so

499

00:29:47,820 --> 00:29:51,150

-called cold posterior effect much more

prominent under Gaussian approximations,

500

00:29:51,150 --> 00:29:55,790

which we only believe to be very crude

approximations to the posterior anyway.

501

00:29:55,790 --> 00:30:02,400

And if we do more MCMC or deep ensemble

stuff, where deep ensemble is, we've got a

502

00:30:02,400 --> 00:30:07,810

paper we'll be able to archive shortly,

which describes deep ensembles.

503

00:30:07,810 --> 00:30:11,116

In deep ensembles, you just run gradient

descent in parallel.

504

00:30:11,116 --> 00:30:15,116

with different initializations and batch

shuffling.

505

00:30:15,116 --> 00:30:18,776

And then you just have like, I know you

run 10 ensembles, 10 optimizations in

506

00:30:18,776 --> 00:30:21,226

parallel, then you've got 10 parameters at

the things at the end.

507

00:30:21,226 --> 00:30:23,656

So Monte Carlo approximation posterior

size 10.

508

00:30:23,656 --> 00:30:28,676

And then we describe in the paper that how

to get this asymptotic and biased property

509

00:30:28,676 --> 00:30:31,556

by using that temperature.

510

00:30:31,616 --> 00:30:38,616

Because as we said earlier, you have SG

MCMC becomes SGD with temperature zero.

511

00:30:38,616 --> 00:30:39,948

So you can reverse this.

512

00:30:39,948 --> 00:30:43,928

for deep ensembles, so you add the noise

and then you'll get an asymptotic and

513

00:30:43,928 --> 00:30:48,048

biased deep ensembles become

asymptotically unbiased MCMC between SGMC

514

00:30:48,048 --> 00:30:48,908

and PSE.

515

00:30:49,688 --> 00:30:54,828

But in those cases when you have the non

-Gaussian approximation we found much less

516

00:30:54,828 --> 00:30:56,288

of the cold posterior effect.

517

00:30:56,288 --> 00:31:01,908

So yeah, it's, but it's still not, maybe

the cold posterior effect is a natural

518

00:31:01,908 --> 00:31:04,368

thing because it's not really like Bayes'

theorem.

519

00:31:04,368 --> 00:31:06,928

Yeah, we still need to be better

understood.

520

00:31:06,928 --> 00:31:08,844

I don't, at least in my head I'm not.

521

00:31:08,844 --> 00:31:12,464

fully clear on whether the cold posterior

effect is something we should be surprised

522

00:31:12,464 --> 00:31:13,164

about.

523

00:31:13,164 --> 00:31:14,844

Okay, yeah.

524

00:31:14,844 --> 00:31:16,404

Yeah, me neither.

525

00:31:16,404 --> 00:31:19,784

That makes you feel any better because I

just learned about that.

526

00:31:20,264 --> 00:31:22,944

So yeah, I don't have any strong opinion.

527

00:31:24,664 --> 00:31:32,304

Okay, I think we're getting clearer now on

the like the what posterior ears is for

528

00:31:32,304 --> 00:31:33,004

listeners.

529

00:31:33,004 --> 00:31:35,056

So then I think one of

530

00:31:35,056 --> 00:31:39,476

the last question about the algorithms

that that's underlying all of that.

531

00:31:39,476 --> 00:31:43,516

So, stochastic gradient MCMC.

532

00:31:43,516 --> 00:31:45,036

That's, that's where I got confused.

533

00:31:45,036 --> 00:31:48,896

Like I hear stochastic gradient and like

stochastic gradient isn't, but no, it's

534

00:31:48,896 --> 00:31:51,576

like SG MCMC not SGG.

535

00:31:51,576 --> 00:31:57,456

So, Posteriority is like really to use SG

MCMC.

536

00:31:57,496 --> 00:32:02,188

Why, like, why would you do that and not

use MCMC?

537

00:32:02,188 --> 00:32:05,468

like the classic HMC from Stan or PyMC?

538

00:32:05,628 --> 00:32:08,458

Yeah, so I mean, it's not just for SGMCMC.

539

00:32:08,458 --> 00:32:13,188

There's also variational inference,

Laplace approximation, extended count

540

00:32:13,188 --> 00:32:16,708

filter, and we're really excited to have

more methods as well as we look to

541

00:32:16,708 --> 00:32:18,488

maintain and expand the library.

542

00:32:18,488 --> 00:32:20,838

Why would you use SGMCMC?

543

00:32:20,838 --> 00:32:23,168

So yeah, I think we've already touched

upon this.

544

00:32:23,168 --> 00:32:29,508

The thing is, if you've got loads of data,

it's just going to be inefficient to...

545

00:32:30,092 --> 00:32:35,332

sum over all of that data at every

iteration of your MCMC algorithm as Stan

546

00:32:35,332 --> 00:32:36,432

would do.

547

00:32:39,252 --> 00:32:43,672

But there's mathematical reasons why you

can't just do that in Stan.

548

00:32:43,672 --> 00:32:49,432

It's because the Metropolis -Hastings

ratio has this exponential of the log

549

00:32:49,432 --> 00:32:50,332

posterior.

550

00:32:50,332 --> 00:32:54,482

But it's in log space is the only place

you can get the unbiased approximation,

551

00:32:54,482 --> 00:32:58,924

which is what you need if you did want to

naively subsample.

552

00:32:58,924 --> 00:33:03,744

So you need to, you can't do the

Machrofist Hastings except reject.

553

00:33:03,744 --> 00:33:06,404

So you have to use different toolage.

554

00:33:06,424 --> 00:33:11,684

And in its simplest terms, SGMCMC just

omits it and just runs a Langevin.

555

00:33:11,684 --> 00:33:15,904

So it just runs your Hamiltonian Monte

Carlo without the extract project.

556

00:33:15,904 --> 00:33:18,844

But there's more theory on top of this and

you need to control the disqualification

557

00:33:18,844 --> 00:33:20,064

error and stuff like that.

558

00:33:20,064 --> 00:33:23,104

And I won't go into the weeds of that.

559

00:33:23,104 --> 00:33:24,084

Okay.

560

00:33:24,084 --> 00:33:25,264

Yeah.

561

00:33:25,724 --> 00:33:26,234

Okay.

562

00:33:26,234 --> 00:33:27,308

And that's

563

00:33:27,308 --> 00:33:30,068

And that's tied to mini -batching

basically.

564

00:33:30,068 --> 00:33:37,608

Like the power that SGMCMC allows you when

you're in a high data regime is tied to

565

00:33:37,608 --> 00:33:40,268

the mini -batching, if I understand

correctly.

566

00:33:40,648 --> 00:33:43,848

It's the difference between MCMC and

SGMCMC.

567

00:33:43,848 --> 00:33:46,548

Okay, so that's like the main difference.

568

00:33:46,588 --> 00:33:46,848

Okay.

569

00:33:46,848 --> 00:33:48,148

Yeah, stochastic gradient.

570

00:33:48,148 --> 00:33:52,128

So you can't actually get the exact

gradient like you need in Amazigh in Monte

571

00:33:52,128 --> 00:33:56,236

Carlo and for Metropolis Hastings step,

you only get an unbiased approximation.

572

00:33:56,236 --> 00:34:01,036

And then there's theory about this is like

sometimes you can deploy the central limit

573

00:34:01,036 --> 00:34:05,036

theorem and then you've got a you can go

covariance attached to your gradients and

574

00:34:05,036 --> 00:34:09,316

you could do nice theory and improve the

equivalence like that, which, yeah.

575

00:34:09,976 --> 00:34:10,896

Okay.

576

00:34:11,516 --> 00:34:12,476

All clear now.

577

00:34:12,476 --> 00:34:12,916

All clear.

578

00:34:12,916 --> 00:34:13,376

Awesome.

579

00:34:13,376 --> 00:34:13,636

Yeah.

580

00:34:13,636 --> 00:34:16,586

And I think that's the first time we talk

about that on the show.

581

00:34:16,586 --> 00:34:20,476

So I think it was it's definitely useful

to be extra clear about that.

582

00:34:20,476 --> 00:34:25,516

And so that listeners understand and me,

like myself, so that I understand.

583

00:34:25,516 --> 00:34:26,156

Thanks a lot.

584

00:34:26,156 --> 00:34:31,336

It's in some setting actually much simpler

because you kind of like remove the tools

585

00:34:31,336 --> 00:34:33,776

that you have available to you by removing

that much of the step.

586

00:34:33,776 --> 00:34:36,956

So it makes the implementation a bit

simpler.

587

00:34:37,596 --> 00:34:40,716

But you kind of lose the theory in that.

588

00:34:40,716 --> 00:34:46,756

And then a lot of the argument is like if

you use a decreasing step size, then your

589

00:34:46,756 --> 00:34:50,616

noise from the mini match, your noise from

the stochastic gradient decreases Epsilon

590

00:34:50,616 --> 00:34:52,196

squared, which is faster.

591

00:34:52,196 --> 00:34:53,580

So you

592

00:34:53,580 --> 00:34:57,220

If you decrease your step size and run it

for infinite time, then you'll just be

593

00:34:57,220 --> 00:35:00,820

running, eventually just be running the

continuous time dynamics, which are exact

594

00:35:00,820 --> 00:35:02,220

and do have the right stationary

distribution.

595

00:35:02,220 --> 00:35:05,880

So if you run it with decreasing step

size, then you are asymptotically

596

00:35:05,880 --> 00:35:07,100

unbiased.

597

00:35:07,160 --> 00:35:10,480

But running with decreasing step size is

really annoying because you then don't

598

00:35:10,480 --> 00:35:11,360

move as far.

599

00:35:11,360 --> 00:35:15,140

As we know from normal MCMC, we want to

increase our step size and move and

600

00:35:15,140 --> 00:35:16,942

explore the posterior more so.

601

00:35:17,232 --> 00:35:19,512

There's lots of research to be done here.

602

00:35:19,512 --> 00:35:23,492

I hope and I feel that it's not the last

time you'll talk about stochastic gradient

603

00:35:23,492 --> 00:35:25,291

MCMC on this podcast.

604

00:35:25,632 --> 00:35:26,372

Yeah, no.

605

00:35:26,372 --> 00:35:27,812

I mean, that sounds super interesting.

606

00:35:27,812 --> 00:35:30,832

I'm really interested also to really

understand the difference between these

607

00:35:30,832 --> 00:35:31,892

algorithms.

608

00:35:31,892 --> 00:35:34,992

Right now, that's really at the frontier

of research.

609

00:35:34,992 --> 00:35:41,552

You not only have a lot of research done

on how do you make HMC more efficient, but

610

00:35:41,552 --> 00:35:43,492

you have all these new algorithms.

611

00:35:44,236 --> 00:35:47,856

approximate algorithms as we said before.

612

00:35:48,396 --> 00:35:51,476

So, VLM plus approximation, stuff like

that.

613

00:35:51,476 --> 00:35:54,556

But also now you have normalizing flows.

614

00:35:54,556 --> 00:35:58,756

We talked about that in episode 98 with

Marilou Gabrié.

615

00:35:58,776 --> 00:36:03,316

Marilou Gabrié, actually, I don't know why

I said the second part with the Spanish.

616

00:36:04,056 --> 00:36:07,756

Because my Spanish is really available in

my brain right now.

617

00:36:08,536 --> 00:36:10,296

So, she's French.

618

00:36:10,296 --> 00:36:12,258

So, that's Marilou Gabrié.

619

00:36:12,300 --> 00:36:15,460

Episode 98, it's in the show notes.

620

00:36:15,700 --> 00:36:20,500

Episode 107, I already mentioned it with

Marvin Schmidt about amortized patient

621

00:36:20,500 --> 00:36:21,480

inference.

622

00:36:21,560 --> 00:36:26,560

Actually, do you know about amortized

patient inference and normalizing flows?

623

00:36:26,780 --> 00:36:28,460

I know a bit about normalizing flows.

624

00:36:28,460 --> 00:36:32,200

Amortized patient inference I would be

less comfortable with.

625

00:36:32,200 --> 00:36:32,750

Okay.

626

00:36:32,750 --> 00:36:34,720

But I mean, if you could explain it.

627

00:36:34,720 --> 00:36:37,300

Yeah, I haven't watched that episode and

listened to that episode.

628

00:36:37,300 --> 00:36:41,400

Yeah, I mean, we released it yesterday.

629

00:36:42,423 --> 00:36:44,744

Yeah, I don't...

630

00:36:44,744 --> 00:36:50,364

I'm a bit disappointed, Sam, but that's

fine.

631

00:36:50,364 --> 00:36:52,424

Like, it's just one day, you know.

632

00:36:52,424 --> 00:36:56,164

If you listen to it just after the

recording, I'll forgive you.

633

00:36:56,164 --> 00:36:57,744

That's okay.

634

00:37:00,204 --> 00:37:09,144

No, so, kidding aside, I'm actually

curious to hear you speak about the

635

00:37:09,144 --> 00:37:11,564

difference between normalizing flows

636

00:37:11,564 --> 00:37:13,674

and SGMCMC.

637

00:37:13,674 --> 00:37:18,184

Can you talk a bit about that if you're

comfortable with that?

638

00:37:19,044 --> 00:37:20,114

I mean, I can't.

639

00:37:20,114 --> 00:37:22,684

It's been a while since I've read about

normalizing flows.

640

00:37:22,684 --> 00:37:27,004

When I did read about them, I understood

it to be essentially a form of variational

641

00:37:27,004 --> 00:37:32,384

inference where you have more elaborate,

you define a more elaborate variational

642

00:37:32,384 --> 00:37:36,764

family through like, essentially through

like a triangular mapping.

643

00:37:36,764 --> 00:37:41,114

Like, the thing why you can't just use

someone might say,

644

00:37:42,072 --> 00:37:45,752

Why can't you use it just a neural network

as your variational distribution?

645

00:37:45,752 --> 00:37:50,832

And it's not so easy because you need to

have this tractable form.

646

00:37:51,432 --> 00:37:52,312

Hang on a second.

647

00:37:52,312 --> 00:37:53,492

Let me remember.

648

00:37:53,792 --> 00:37:59,932

But the thing is with normalizing flows,

you can get this because you can invert.

649

00:37:59,932 --> 00:38:00,402

That's it.

650

00:38:00,402 --> 00:38:01,232

They're invertible, right?

651

00:38:01,232 --> 00:38:02,712

Normalizing flows are invertible.

652

00:38:02,712 --> 00:38:03,752

So you can get this.

653

00:38:03,752 --> 00:38:07,732

You can write the change of distribution

formula and then you can calculate

654

00:38:07,732 --> 00:38:09,708

essentially just y -maxum likelihood.

655

00:38:09,708 --> 00:38:14,068

the using these normalizing flows to fit

to a distribution.

656

00:38:14,948 --> 00:38:17,647

Whereas SGMCMC doesn't.

657

00:38:17,647 --> 00:38:22,008

So you have to, in normalizing flows, you

kind of have to define your ansatz that

658

00:38:22,008 --> 00:38:23,268

will fit to your distribution.

659

00:38:23,268 --> 00:38:26,148

I think normalizing flows are really

exciting and really interesting, but yeah,

660

00:38:26,148 --> 00:38:27,688

you have to specify your ansatz.

661

00:38:27,688 --> 00:38:32,648

So it's another, so there's another tool

on top, another specification on top of

662

00:38:32,648 --> 00:38:35,564

how you.

663

00:38:35,564 --> 00:38:37,944

rather than just writing the log

posterior, you then need to find an

664

00:38:37,944 --> 00:38:42,944

approximate ansatz which you think will

fit the posterior or the distribution

665

00:38:42,944 --> 00:38:44,164

you're targeting.

666

00:38:44,244 --> 00:38:47,824

Whereas SGMCMC is just log posterior, go.

667

00:38:47,824 --> 00:38:50,464

Which is sort of what we're trying to do

with posterior, is we're trying to

668

00:38:50,464 --> 00:38:54,014

automate, well not automate, we're trying

to research, of course, so much for that.

669

00:38:54,014 --> 00:38:58,104

But normalizing flows might be, yeah, as I

said, I think it's really interesting that

670

00:38:58,104 --> 00:39:03,464

you can get these more expressive

variational families through like

671

00:39:03,464 --> 00:39:04,524

triangular mappings, yeah.

672

00:39:04,524 --> 00:39:08,664

Yeah, super interesting.

673

00:39:09,044 --> 00:39:13,684

And yeah, I'm also like spatial inference

is related in the sense that you first

674

00:39:13,684 --> 00:39:16,604

feed a deep neural network on your model.

675

00:39:16,604 --> 00:39:21,004

And then once it's feed, you get posterior

inference for free, basically.

676

00:39:21,804 --> 00:39:26,744

So that's quite different from what I

understand as GMC to be.

677

00:39:26,744 --> 00:39:31,014

But that's also extremely interesting.

678

00:39:31,014 --> 00:39:32,364

That's also why I'm

679

00:39:32,364 --> 00:39:38,164

hammering you down on the different use

cases of SGMCMC so that myself and

680

00:39:38,164 --> 00:39:44,304

listeners have a kind of a tree in their

head of like, okay, my use case then is

681

00:39:44,304 --> 00:39:50,024

more appropriate for SGMCMC or, no, here

I'd like to try multi -spacian inference

682

00:39:50,024 --> 00:39:56,704

or, I know here I can just stick to plain

vanilla HMC.

683

00:39:56,704 --> 00:39:59,044

I think that's very interesting.

684

00:39:59,596 --> 00:40:03,736

But thanks for that question that was

completely improvised.

685

00:40:03,736 --> 00:40:08,576

I definitely appreciate you taking the

time to rack your brain about the

686

00:40:08,576 --> 00:40:10,516

difference with normalizing flows.

687

00:40:10,516 --> 00:40:12,386

No, I'd love to talk more on that.

688

00:40:12,386 --> 00:40:14,136

I'd need to refresh myself.

689

00:40:14,136 --> 00:40:16,996

I've written down some notes on

normalizing flows, and I was quite

690

00:40:16,996 --> 00:40:19,336

comfortable with them, but it's just been

a while since I refreshed.

691

00:40:19,336 --> 00:40:22,656

So I would love to refresh, and then we

can chat about them.

692

00:40:22,656 --> 00:40:25,876

Because I'd love to do a project on them,

or I'd love to work on them, because I

693

00:40:25,876 --> 00:40:29,024

think that's it.

694

00:40:29,024 --> 00:40:34,104

way to fit distribution to data, which is,

after all, a lot of what we do.

695

00:40:34,484 --> 00:40:34,844

Yeah.

696

00:40:34,844 --> 00:40:35,464

Yeah.

697

00:40:35,464 --> 00:40:39,464

So that makes me think we should probably

do another episode about normalizing

698

00:40:39,464 --> 00:40:40,124

flows.

699

00:40:40,124 --> 00:40:46,824

So listeners, if there is a researcher you

like who does a lot of normalizing flows

700

00:40:46,824 --> 00:40:53,084

and you think would be a good guest on the

show, please reach out to me and I'll make

701

00:40:53,084 --> 00:40:54,198

that happen.

702

00:40:55,052 --> 00:41:01,072

Now let's let's get you closer to home

salmon and talk about posteriors again

703

00:41:01,072 --> 00:41:07,512

Because so basically if understood

correctly posteriors aims to address

704

00:41:07,512 --> 00:41:13,452

uncertainty quantification in deep

learning Why it's is that my right here

705

00:41:13,452 --> 00:41:19,272

and also if that's the case why is this

particularly important for neural networks

706

00:41:19,272 --> 00:41:22,316

and How does the package help in?

707

00:41:22,316 --> 00:41:26,676

managing especially overconfident in model

predict, overconfidence in model

708

00:41:26,676 --> 00:41:27,716

predictions.

709

00:41:28,056 --> 00:41:31,926

Yeah, so it's that's our primary use case.

710

00:41:31,926 --> 00:41:36,736

And normal is to use posterity as a

proximate base, we're getting as close to

711

00:41:36,736 --> 00:41:40,876

base as we can, which is probably not that

close, but still getting somewhere on the

712

00:41:40,876 --> 00:41:46,876

way to base base, base and posterior in

big deep learning models.

713

00:41:46,876 --> 00:41:49,956

But we feel posterior is to be as modular

and general as possible.

714

00:41:49,956 --> 00:41:51,884

So as I said, if you have a

715

00:41:51,884 --> 00:41:55,264

classical Bayesian model, you can write it

down in Pyro, but you've got loads of

716

00:41:55,264 --> 00:41:57,384

data, then okay, go ahead.

717

00:41:57,484 --> 00:41:59,944

And it posterior should be well suited to

that.

718

00:42:00,204 --> 00:42:08,164

In terms of what advantages we want to see

from uncertainty communication or this

719

00:42:08,164 --> 00:42:13,224

approximate Bayesian inference in deep

learning models, there are three sorts of

720

00:42:13,224 --> 00:42:16,844

key things that we distilled it down to.

721

00:42:16,844 --> 00:42:18,744

So yeah, you mentioned

722

00:42:19,820 --> 00:42:21,920

confidence in outer distribution

predictions.

723

00:42:21,920 --> 00:42:28,940

So yeah, we should be able to improve our

performance in predicting on inputs that

724

00:42:28,940 --> 00:42:31,440

we haven't seen in the training set.

725

00:42:31,480 --> 00:42:34,620

So I'll talk about that after this.

726

00:42:34,620 --> 00:42:39,660

The second one is continual learning,

where we think that if you can do Bayes

727

00:42:39,660 --> 00:42:43,300

theorem exactly, you have your prior, you

get some likelihood and you have the

728

00:42:43,300 --> 00:42:46,572

likelihood, you get some data, you have a

posterior, then you get some more data.

729

00:42:46,572 --> 00:42:48,872

and then your posterior becomes your prior

and do the update.

730

00:42:48,872 --> 00:42:52,052

And you can just write like that if you

can do Bayes' theorem exactly.

731

00:42:52,192 --> 00:42:56,152

And then, yeah, this is, you can extend it

even further and then you have, with some

732

00:42:56,152 --> 00:42:59,412

sort of evolution along your parameters,

then you have a state space model, and

733

00:42:59,412 --> 00:43:02,172

then the exact setting linear Gaussian,

you've got a count filter.

734

00:43:02,172 --> 00:43:08,112

So continual learning is, in this case,

Bayes' theorem does that exactly.

735

00:43:08,112 --> 00:43:11,992

And in continual learning research in

machine learning settings, they have this

736

00:43:11,992 --> 00:43:14,972

term of avoiding catastrophic forgetting.

737

00:43:14,972 --> 00:43:16,052

So,

738

00:43:17,731 --> 00:43:22,712

If you just continue to do gradient

descent, there was no memory there, so you

739

00:43:22,712 --> 00:43:28,672

would just, apart from the initialization,

you would just forget what you've done

740

00:43:28,672 --> 00:43:31,152

previously and there's lots of evidence

for this, whereas Bayes' theorem is

741

00:43:31,152 --> 00:43:35,452

completely exchangeable between of the

order of the data that you see.

742

00:43:35,452 --> 00:43:40,112

So you're doing Bayes' theorem exactly,

there's no forgetting, you just have the

743

00:43:40,112 --> 00:43:41,212

capacity of the model.

744

00:43:41,212 --> 00:43:45,252

So that's where we see Bayes solving

continual learning, but as I said, you

745

00:43:45,252 --> 00:43:45,952

can't

746

00:43:45,952 --> 00:43:49,632

can't do Bayes' theorem exactly in a

billion -dimensional model.

747

00:43:49,852 --> 00:43:56,372

And then the last one is, we'll call it

like decomposition of uncertainty in your

748

00:43:56,372 --> 00:43:57,072

predictions.

749

00:43:57,072 --> 00:44:02,752

So if you just have gradient descent model

and you're predicting reviews, someone's

750

00:44:02,752 --> 00:44:06,532

reviews and you have to predict the stars,

it will just give you, as you said, it

751

00:44:06,532 --> 00:44:10,932

gives you your softmax, it'll just give

you this distribution over the reviews and

752

00:44:10,932 --> 00:44:12,472

it'll be like that.

753

00:44:12,472 --> 00:44:15,564

But what you really want is you want to

have some indication of

754

00:44:15,564 --> 00:44:19,264

like also distribution detection, right,

you want to know, okay, yeah, I'm

755

00:44:19,264 --> 00:44:21,764

confident in my, my prediction.

756

00:44:21,764 --> 00:44:27,844

And you might get to review that is like,

the food was terrible, but the service was

757

00:44:27,844 --> 00:44:31,154

amazing, or something like that, like a

user amazing food was terrible.

758

00:44:31,154 --> 00:44:35,324

And then, like, let's say we're perfect

models, say of this, we know how people

759

00:44:35,324 --> 00:44:39,044

review things, but we can we can give, we

have quite a lot of uncertainty under

760

00:44:39,044 --> 00:44:41,764

review, because we don't know how the

reviewer values those different things.

761

00:44:41,764 --> 00:44:44,164

So we might have just a completely

uniform.

762

00:44:44,908 --> 00:44:47,408

distribution over the stars for that

review.

763

00:44:47,408 --> 00:44:49,398

But we'd be confident in that

distribution.

764

00:44:49,398 --> 00:44:52,888

But what Bayes gives you is it gives you

the ability to the sort of the second

765

00:44:52,888 --> 00:44:56,648

order uncertainty quantification, is if

you have this distribution over parameters

766

00:44:56,648 --> 00:44:59,748

and you have a distribution over logits at

the end, the predictions, you can

767

00:44:59,748 --> 00:45:04,508

identify, you can split between it from

information theories called aleatoric and

768

00:45:04,508 --> 00:45:05,548

epistemic uncertainty.

769

00:45:05,548 --> 00:45:09,968

Aleatoric uncertainty or data uncertainty

is what I just described there, which is

770

00:45:09,968 --> 00:45:12,876

natural uncertainty in the model and the

data generating process.

771

00:45:12,876 --> 00:45:17,556

Epistemic uncertainty is uncertainty that

was removed in the infinite data limit.

772

00:45:17,556 --> 00:45:19,946

So that would be where the model doesn't

know.

773

00:45:19,946 --> 00:45:22,996

So this is really important for us to

quantify that.

774

00:45:24,196 --> 00:45:24,896

Okay.

775

00:45:25,096 --> 00:45:27,076

I, yeah, around to the bit there.

776

00:45:27,356 --> 00:45:33,236

I can in like 30 seconds elaborate on the

point you specifically mentioned on alpha

777

00:45:33,236 --> 00:45:37,324

distribution performance and improving

performance and alpha distribution.

778

00:45:37,324 --> 00:45:40,984

And I think that's quite compelling from a

Bayesian point of view, because what Bayes

779

00:45:40,984 --> 00:45:46,824

says on like a supervised learning sector

setting is said, gradient descent just

780

00:45:46,824 --> 00:45:51,944

fits one parameter, finds one parameter

configuration that's plausible given the

781

00:45:51,944 --> 00:45:52,984

training data.

782

00:45:53,044 --> 00:45:57,144

Bayes' theorem says, I find the whole

distribution of parameter configurations

783

00:45:57,144 --> 00:45:58,794

that's plausible given the data.

784

00:45:58,794 --> 00:46:01,204

And then when we make predictions, we

average over those.

785

00:46:01,204 --> 00:46:06,188

So it's perfectly natural to think that a

single configuration might overfit.

786

00:46:06,188 --> 00:46:11,148

and might just give, it might just be very

confident in its prediction when it sees

787

00:46:11,148 --> 00:46:12,268

out the distribution data.

788

00:46:12,268 --> 00:46:17,088

But it doesn't necessarily solve a bad

model, but it should be more honest to the

789

00:46:17,088 --> 00:46:23,728

model and the data generating process

you've specified is if you average over

790

00:46:23,728 --> 00:46:28,288

plausible model configurations under the

training data when you have your testing.

791

00:46:28,288 --> 00:46:32,748

So that's sort of quite a compelling, to

me, argument for improving

792

00:46:33,740 --> 00:46:39,380

performance on after distribution

predictions, like the accuracy of them.

793

00:46:39,380 --> 00:46:43,150

And there's a fair bit of empirical

evidence for this, with the caveat again,

794

00:46:43,150 --> 00:46:46,600

being that the Bayesian posterior in high

dimensional models, machine learning

795

00:46:46,600 --> 00:46:52,880

models is pretty hard to approximate, cold

posterior effect, caveats, things like

796

00:46:52,880 --> 00:46:53,480

these things.

797

00:46:53,480 --> 00:46:55,180

Okay, yeah, I see.

798

00:46:55,460 --> 00:46:57,620

Yeah, super interesting in that.

799

00:46:57,620 --> 00:46:59,838

So now I understand better.

800

00:46:59,884 --> 00:47:04,484

what you have on the posteriors website,

but the different kind of uncertainties.

801

00:47:04,484 --> 00:47:09,984

So definitely that's something I recommend

listeners to give a read to.

802

00:47:10,104 --> 00:47:12,664

I put that in the show notes.

803

00:47:12,664 --> 00:47:19,014

So both your blog post introducing

posteriors and the docs for posteriors,

804

00:47:19,014 --> 00:47:25,684

because I think it makes that clear

combined to your explanation right now.

805

00:47:25,764 --> 00:47:26,904

Yeah.

806

00:47:27,564 --> 00:47:28,492

And...

807

00:47:28,492 --> 00:47:32,551

Something I was also wondering is that if

I understood correctly, the package is

808

00:47:32,551 --> 00:47:36,712

built on top of PyTorch, right?

809

00:47:37,012 --> 00:47:38,072

Yeah, that's correct.

810

00:47:38,072 --> 00:47:38,772

Yeah.

811

00:47:38,952 --> 00:47:39,512

Okay.

812

00:47:39,512 --> 00:47:48,412

So, and also, did I understand correctly

that you can integrate posteriors with pre

813

00:47:48,412 --> 00:47:56,396

-trained LLMs like Lama2 and Mistral, and

you do that with a...

814

00:47:56,396 --> 00:47:58,796

Hacking's Feast Transformers package?

815

00:47:59,556 --> 00:48:04,076

So, yeah, so, I mean, yeah, Posterior is

open source.

816

00:48:04,496 --> 00:48:08,856

We're fully supported the open source

community for machine learning, for

817

00:48:08,856 --> 00:48:16,136

statistics, which is, and in terms of,

yeah, I mean, we're sort of in the fine

818

00:48:16,136 --> 00:48:21,016

tuning era or like we have like, there's

so much, there are these open source

819

00:48:21,016 --> 00:48:22,616

models and you can't get away from them.

820

00:48:22,616 --> 00:48:26,060

We have that Lama 2, Lama 3, Mistral,

like, yeah.

821

00:48:26,060 --> 00:48:29,620

And basically we want to harness this

power, right?

822

00:48:29,620 --> 00:48:35,680

But as I mentioned previously, there are

some issues that we like to remedy with

823

00:48:35,680 --> 00:48:36,840

Bayesian techniques.

824

00:48:36,840 --> 00:48:40,820

So the majority of these open source

models are built in PyTorch.

825

00:48:40,820 --> 00:48:42,260

I'm also a big Jax fan.

826

00:48:42,260 --> 00:48:43,460

I also use Jax a lot.

827

00:48:43,460 --> 00:48:51,440

So I was very happy to see and work with

the torch .funk like sub library, which

828

00:48:51,440 --> 00:48:53,406

basically makes it

829

00:48:54,060 --> 00:48:59,280

you can write your PyTorch code and you

can use Llama 3 or Mistral with PyTorch

830

00:48:59,280 --> 00:49:01,340

but writing functional code.

831

00:49:01,740 --> 00:49:03,660

So that's what we've done with Posterior.

832

00:49:03,880 --> 00:49:07,480

So, yeah, Hugging Face Transformers, you

can download the models, that's where all

833

00:49:07,480 --> 00:49:09,960

they're hosted, and how you access them.

834

00:49:09,960 --> 00:49:12,060

But then what you get is just a PyTorch

model.

835

00:49:12,060 --> 00:49:13,720

It's just a PyTorch model.

836

00:49:13,780 --> 00:49:18,340

And then you throw that in Composers and

all nicely with the Posterior updates.

837

00:49:18,340 --> 00:49:23,100

Or you write your own new updates in the

Posterior framework and you can use that

838

00:49:23,100 --> 00:49:23,596

as well.

839

00:49:23,596 --> 00:49:25,056

still with Lama 3.

840

00:49:25,066 --> 00:49:25,406

Mr.

841

00:49:25,406 --> 00:49:26,156

Robin.

842

00:49:26,596 --> 00:49:26,736

Yeah.

843

00:49:26,736 --> 00:49:27,336

Okay.

844

00:49:27,336 --> 00:49:28,036

Nice.

845

00:49:28,456 --> 00:49:30,746

And so what does it mean concretely for

users?

846

00:49:30,746 --> 00:49:40,796

That means you can use these pre -trained

LLMs with posteriors and that means adding

847

00:49:40,796 --> 00:49:45,336

a layer of uncertainty quantification on

top of those models?

848

00:49:45,936 --> 00:49:46,376

Yeah.

849

00:49:46,376 --> 00:49:49,816

So you need, I mean, Bayes theorem is a

training theorem.

850

00:49:49,816 --> 00:49:52,086

So you need data as well.

851

00:49:52,086 --> 00:49:52,780

So you take

852

00:49:52,780 --> 00:49:58,960

You take your pre -trained model, which

is, yeah, transformer, or it could be

853

00:49:58,960 --> 00:50:01,080

another type of model, it could be an

image model or something like that, and

854

00:50:01,080 --> 00:50:04,440

then you give it some new data, which we

would say was fine -tuning, and then you

855

00:50:04,440 --> 00:50:09,260

combine, use posterior to combine the two,

and then you have your new model out at

856

00:50:09,260 --> 00:50:11,740

the end of the day, which has uncertainty

quantification.

857

00:50:11,780 --> 00:50:16,540

It's difficult, as I said, we're sort of

in this fine -tuning era as open -source

858

00:50:16,540 --> 00:50:17,580

large language models.

859

00:50:17,580 --> 00:50:20,850

It's still to be, this is different.

860

00:50:20,972 --> 00:50:23,872

There's still lots of research to do here

and it's different to our classical

861

00:50:23,872 --> 00:50:27,692

Bayesian regime where we just have our,

there's only one source of data and it's

862

00:50:27,692 --> 00:50:28,472

what we give it.

863

00:50:28,472 --> 00:50:32,312

In this case, there's two sources of data

because you have your data, whatever,

864

00:50:32,312 --> 00:50:36,652

whatever Lama3 saw in its original

training data and then it has your own

865

00:50:36,652 --> 00:50:37,492

data.

866

00:50:38,072 --> 00:50:42,472

It's, yeah, can we hope to get uncertainty

chronification and the data that they used

867

00:50:42,472 --> 00:50:43,372

in the original training?

868

00:50:43,372 --> 00:50:46,772

Probably not, but we might be able to get

uncertainty chronification and improved

869

00:50:46,772 --> 00:50:47,692

predictions.

870

00:50:47,692 --> 00:50:49,312

based on the data that we've committed.

871

00:50:49,312 --> 00:50:55,132

So there's lots of lots for us to try out

here and learn because we are still

872

00:50:55,132 --> 00:50:58,852

learning on this in terms of the fine

tuning.

873

00:50:58,852 --> 00:51:04,132

But yeah, this is what Polastir is there

to make these sort of questions as easy as

874

00:51:04,132 --> 00:51:06,072

possible to ask and answer.

875

00:51:07,948 --> 00:51:09,308

Okay, fantastic.

876

00:51:09,308 --> 00:51:11,628

Yeah, that's, that's so exciting.

877

00:51:11,628 --> 00:51:17,508

It's just like, it's a bit frustrating to

me because I'm like, I'd love to try that

878

00:51:17,508 --> 00:51:21,288

and learn on that and like, contribute to

that kind of packages.

879

00:51:21,288 --> 00:51:25,408

At the same time, I have to work, I have

to do the podcast, and I have all the

880

00:51:25,408 --> 00:51:27,568

packages I'm already contributing to.

881

00:51:27,568 --> 00:51:32,068

So I'm like, my god, too much choices too

much, too many choices.

882

00:51:32,068 --> 00:51:33,688

No, come on Alex, I'm gonna see you.

883

00:51:33,688 --> 00:51:35,468

We're gonna see you again, Alex pull

request.

884

00:51:35,468 --> 00:51:36,928

It's soon enough.

885

00:51:40,799 --> 00:51:51,380

Actually, how does like, do the like this

ability to have the transformers in, you

886

00:51:51,380 --> 00:51:59,280

know, use these pre trained models, does

that help facilitate the adoption of new

887

00:51:59,280 --> 00:52:02,360

algorithms in in posteriors?

888

00:52:02,360 --> 00:52:05,068

Because if I understand correctly, you can

support

889

00:52:05,068 --> 00:52:10,378

new algorithms pretty easily and you can

support arbitrary likelihoods.

890

00:52:10,378 --> 00:52:11,928

How do you do that?

891

00:52:14,288 --> 00:52:22,528

I wouldn't say that the existence of the

pre -trained models necessarily allows us

892

00:52:22,528 --> 00:52:23,968

to support new algorithms.

893

00:52:23,968 --> 00:52:28,128

I feel like we've built the posterior to

be suitably general and suitably modular,

894

00:52:28,128 --> 00:52:33,368

that it's kind of agnostic to your model

choice and your log posterior choice.

895

00:52:33,676 --> 00:52:36,396

terms of arbitrary likelihoods.

896

00:52:36,396 --> 00:52:37,976

But yeah, that's like a benefit.

897

00:52:37,976 --> 00:52:41,956

That's like, yeah, as an hour, yeah, the

arbitrary like is is relevant, because a

898

00:52:41,956 --> 00:52:45,696

lot of machine learning packages on.

899

00:52:46,016 --> 00:52:48,956

I mean, a lot of machine learning is

essentially just boils down to

900

00:52:48,956 --> 00:52:50,016

classification or regression.

901

00:52:50,016 --> 00:52:51,156

And that is true.

902

00:52:51,156 --> 00:52:55,516

And because of that, a lot of a lot of

machine learning algorithms will a lot of

903

00:52:55,516 --> 00:52:58,796

machine learning packages will essentially

constrain it to classification or

904

00:52:58,796 --> 00:52:58,996

regression.

905

00:52:58,996 --> 00:53:01,676

At the end, you either have your softmax

or you have your mean squared error.

906

00:53:01,676 --> 00:53:03,970

Yeah, softmax cross entry means greater.

907

00:53:04,044 --> 00:53:05,134

In posterior, we haven't done that.

908

00:53:05,134 --> 00:53:08,164

We're more faithful to the sort of the

Bayesian setting where you just write down

909

00:53:08,164 --> 00:53:10,864

your log posterior and you can write down

whatever you want.

910

00:53:11,724 --> 00:53:16,304

And this allows you greater flexibility in

the case you did want to try out a

911

00:53:16,304 --> 00:53:22,664

different likelihood or like even in like

simple cases, like it's just more

912

00:53:22,664 --> 00:53:25,584

sophisticated than just classification or

regression a lot of the time.

913

00:53:25,584 --> 00:53:30,204

Like in sequence generation where you have

the sequence and then you have the cross

914

00:53:30,204 --> 00:53:31,532

entropy over all of that.

915

00:53:31,532 --> 00:53:35,712

It just allows you to be more flexible and

write the code how you want.

916

00:53:35,712 --> 00:53:38,082

And there's additional things to be taken

into account.

917

00:53:38,082 --> 00:53:41,032

Like sometimes if you were doing a

regression, you might have knowledge of

918

00:53:41,032 --> 00:53:42,292

the noise variance.

919

00:53:42,292 --> 00:53:44,672

And that's just the observation noise

variance.

920

00:53:44,672 --> 00:53:49,252

And that's just much easier to, yeah, if

we don't constrain like this, it's just

921

00:53:49,252 --> 00:53:54,412

much easier to write your code much

cleaner code than if you were.

922

00:53:54,412 --> 00:53:55,832

And it's also future -proofing.

923

00:53:55,832 --> 00:53:57,420

We don't know what's going to be.

924

00:53:57,420 --> 00:54:00,320

happening in going forward.

925

00:54:00,320 --> 00:54:04,680

We may see like, yeah, in multimodal

models, we may see like, text and images

926

00:54:04,680 --> 00:54:09,200

together, in which case, yeah, we will

support that.

927

00:54:09,400 --> 00:54:13,580

You have to supply the compute and the

data, which might be the harder thing, but

928

00:54:13,580 --> 00:54:15,660

we'll support those likelihoods.

929

00:54:15,660 --> 00:54:16,890

Okay, I see.

930

00:54:16,890 --> 00:54:17,210

I see.

931

00:54:17,210 --> 00:54:19,680

Yeah, that's very, very interesting.

932

00:54:20,080 --> 00:54:26,220

Any stats related to the fact that I think

I've read in your blog post or on the

933

00:54:26,220 --> 00:54:27,276

website that

934

00:54:27,276 --> 00:54:32,076

You say that Posterior is swappable.

935

00:54:32,076 --> 00:54:33,736

What does that mean?

936

00:54:33,956 --> 00:54:37,096

And how does that flexibility benefit

users?

937

00:54:37,656 --> 00:54:38,116

Yeah.

938

00:54:38,116 --> 00:54:43,796

So, I mean, this is the point of swappable

is that when I say that is that you can

939

00:54:43,796 --> 00:54:48,676

change between if you want to, if you

think, as I said, Posterior is a research

940

00:54:48,676 --> 00:54:52,976

like toolbox and it's to us to investigate

which inference method is appropriate in

941

00:54:52,976 --> 00:54:57,260

the different settings, which might be

different if you care about decomposing.

942

00:54:57,260 --> 00:54:59,920

predictive uncertainty, it might be

different if you care about boarding cast

943

00:54:59,920 --> 00:55:02,100

-off, you're forgetting it's in your

continued learning.

944

00:55:02,720 --> 00:55:07,320

So the thing is that you can just, the way

it's written is you can just swap, you can

945

00:55:07,320 --> 00:55:11,720

go from sthmc and you can go to the class

approximation or you can go to vi just by

946

00:55:11,720 --> 00:55:12,940

changing one line of code.

947

00:55:12,940 --> 00:55:17,620

And the way it works is like you have your

builds, you have your transform equals

948

00:55:17,620 --> 00:55:22,636

posterior .infant method .build and then

any configuration argument, step size.

949

00:55:22,636 --> 00:55:24,576

things like this, which are algorithm

specific.

950

00:55:24,576 --> 00:55:26,236

And then after that is all unified.

951

00:55:26,236 --> 00:55:32,476

So you just have your init around the

parameters that you want to do based on.

952

00:55:32,476 --> 00:55:36,716

And then you iterate through your data

loader, you iterate through your data.

953

00:55:36,716 --> 00:55:38,516

And then it just updates based on the

batch.

954

00:55:38,516 --> 00:55:40,256

And batch can be very general.

955

00:55:40,256 --> 00:55:41,316

So that's what it means.

956

00:55:41,316 --> 00:55:44,716

So you can just change one line of code to

swap between Variational Imprints and

957

00:55:44,716 --> 00:55:51,216

STHMC or Extended Calama Filter or any and

all the new methods that the listeners are

958

00:55:51,216 --> 00:55:52,538

going to add in the future.

959

00:55:52,588 --> 00:55:53,568

Heh.

960

00:55:56,140 --> 00:55:57,080

Okay.

961

00:55:57,080 --> 00:55:57,660

Okay.

962

00:55:57,660 --> 00:55:58,920

I see.

963

00:55:59,620 --> 00:56:05,560

And so I have so many more questions for

you and posterior's but let's start and

964

00:56:05,560 --> 00:56:11,560

run, wrap that up because also when I ask

you about another project you're working

965

00:56:11,560 --> 00:56:16,250

on so maybe to close that up on

posterior's.

966

00:56:16,250 --> 00:56:21,440

What are the future plans for posterior's

and are there any upcoming features or

967

00:56:21,440 --> 00:56:24,660

integration integrations that you can

share with us?

968

00:56:25,484 --> 00:56:29,834

So we're quite happy with the framework at

the moment.

969

00:56:29,834 --> 00:56:36,084

There's lots of little tweaks that we have

a list of GitHub issues that we want to go

970

00:56:36,084 --> 00:56:42,464

through, which are mostly and excitingly

about adding new methods and new

971

00:56:42,464 --> 00:56:43,064

applications.

972

00:56:43,064 --> 00:56:47,444

So that's really what we're excited about

now is actually use it in the wild and

973

00:56:47,444 --> 00:56:51,384

hopefully experiment all these questions

that we've discussed.

974

00:56:51,788 --> 00:56:56,168

Yeah, like, like how we how does it make

sense and how we get the benefits of

975

00:56:56,168 --> 00:57:03,528

Bayesian, true Bayesian inference on fine

tuning or on large models or large data.

976

00:57:03,888 --> 00:57:08,368

And so yeah, we are really excited and to

add more methods.

977

00:57:08,368 --> 00:57:14,488

So if listeners have mini batch, big data

Bayesian methods that we want to want to

978

00:57:14,488 --> 00:57:19,568

try out with a large data model, then

we're hopefully accepting that we will.

979

00:57:19,568 --> 00:57:21,266

I do.

980

00:57:23,260 --> 00:57:35,120

I do like, I do promote like generality

and doing it like in a way that is sort of

981

00:57:35,120 --> 00:57:35,770

flexible and stuff.

982

00:57:35,770 --> 00:57:37,320

So we may have, we may think a lot.

983

00:57:37,320 --> 00:57:42,920

It's not, it's not, we want to add methods

that somehow feel natural and, and one way

984

00:57:42,920 --> 00:57:46,390

is to extend and compose with other

methods.

985

00:57:46,390 --> 00:57:50,508

So it might be that if we've got some very

complicated last layer,

986

00:57:50,508 --> 00:57:53,648

requires classes just for classification

method, we're probably not going to add

987

00:57:53,648 --> 00:57:53,908

it.

988

00:57:53,908 --> 00:57:59,008

So it has to be methods that stick within

the posterior framework, which is this

989

00:57:59,008 --> 00:58:03,288

arbitrary likelihood Bayesian swappable

computation.

990

00:58:03,708 --> 00:58:04,088

Okay.

991

00:58:04,088 --> 00:58:04,558

Okay.

992

00:58:04,558 --> 00:58:04,948

Yeah.

993

00:58:04,948 --> 00:58:06,708

Yeah, that makes sense.

994

00:58:06,948 --> 00:58:13,808

Yeah, because you have like, yeah, you

have that kind of vision of wanting to do

995

00:58:13,808 --> 00:58:19,928

that and having that as a as a research

tool, basically.

996

00:58:19,928 --> 00:58:20,524

So

997

00:58:20,524 --> 00:58:25,644

Yeah, that makes sense to keep that under

control, let's say.

998

00:58:26,444 --> 00:58:31,624

Something I want to ask you in the last

few minutes of the show is about

999

00:58:31,624 --> 00:58:34,104

thermodynamic compute.

Speaker:

00:58:34,344 --> 00:58:37,074

I've seen you, you are working on that.

Speaker:

00:58:37,074 --> 00:58:39,204

And you've told me you're working on that.

Speaker:

00:58:39,204 --> 00:58:41,094

So yeah, I don't know anything about that.

Speaker:

00:58:41,094 --> 00:58:43,604

So can you like, what's that about?

Speaker:

00:58:43,824 --> 00:58:47,584

Yeah, so I mean, this is yeah, this is

something that's very normal, normal

Speaker:

00:58:47,584 --> 00:58:48,044

computing.

Speaker:

00:58:48,044 --> 00:58:49,744

And it's like,

Speaker:

00:58:50,807 --> 00:58:52,378

It's something that we have.

Speaker:

00:58:52,378 --> 00:58:53,728

Yeah, we have this hardware team.

Speaker:

00:58:53,728 --> 00:58:55,428

It's like a full stack AI company.

Speaker:

00:58:55,428 --> 00:59:00,128

And we, yeah, on the posterior side, on

the client side, we look at how we can

Speaker:

00:59:00,128 --> 00:59:06,468

bring in principle Bayesian uncertainty

quantification and help us solve the

Speaker:

00:59:06,468 --> 00:59:10,228

issues with machine learning pipelines

like we've already discussed.

Speaker:

00:59:10,228 --> 00:59:12,548

And on the other side, there's lots of

parts to this.

Speaker:

00:59:12,548 --> 00:59:16,204

More just like traditional MCMC is

difficult sometimes because

Speaker:

00:59:16,204 --> 00:59:19,824

Or just it's just like about simulating

SDEs essentially as what the thermodynamic

Speaker:

00:59:19,824 --> 00:59:25,664

hardware is simulating SDEs Normally, you

have this real pain with the step size and

Speaker:

00:59:25,664 --> 00:59:32,344

as the mention grows steps, let's get

really small and so SDEs, where do we see

Speaker:

00:59:32,344 --> 00:59:32,634

SDEs?

Speaker:

00:59:32,634 --> 00:59:37,544

You see SDEs in physics all the time and

physics is real we can use physics so it's

Speaker:

00:59:37,544 --> 00:59:43,904

doing so it's building physical hardware

analog hardware that We can hopefully that

Speaker:

00:59:43,904 --> 00:59:45,100

evolves as SDEs

Speaker:

00:59:45,100 --> 00:59:49,920

then we can harness that SDEs by encoding,

you know, like currents and voltages and

Speaker:

00:59:49,920 --> 00:59:50,350

things like that.

Speaker:

00:59:50,350 --> 00:59:52,760

So I'm not a physicist, so I don't know

exactly how it is.

Speaker:

00:59:52,760 --> 00:59:57,800

But I'm always reassured at how the when I

speak to the hardware team, how simple the

Speaker:

00:59:57,800 --> 01:00:00,900

they talk about these things, it's like,

yeah, we can just stick some resistors and

Speaker:

01:00:00,900 --> 01:00:03,850

capacitors on a chip, and then it'll then

it'll do this SDE.

Speaker:

01:00:03,850 --> 01:00:08,100

So this is the and then we want to use

those SDEs for scientific computation.

Speaker:

01:00:08,100 --> 01:00:11,730

And with a real focus on statistics and

machine learning.

Speaker:

01:00:11,730 --> 01:00:14,476

So yeah, we want to be able to do an HMC

Speaker:

01:00:14,476 --> 01:00:17,216

on device, on an analog device.

Speaker:

01:00:17,216 --> 01:00:21,976

The first step is to do like with a

linear, so we'll have a Gaussian posterior

Speaker:

01:00:21,976 --> 01:00:24,816

or with a linear drift in terms of this.

Speaker:

01:00:24,816 --> 01:00:29,736

This is an Ornstein -Ollenbeck process and

we've developed hardware to do this and

Speaker:

01:00:29,736 --> 01:00:33,596

turns out that an Ornstein -Ollenbeck

process, because it has a Gaussian

Speaker:

01:00:33,596 --> 01:00:37,196

stationary distribution and you have this,

you can input like you can input the

Speaker:

01:00:37,196 --> 01:00:40,972

precision matrix and output the covariance

matrix, that's matrix inversion.

Speaker:

01:00:40,972 --> 01:00:45,432

So, and you just, your physical device

just does this.

Speaker:

01:00:45,432 --> 01:00:50,152

And it's because it's an SDE, it has noise

and is kind of noise aware, which is

Speaker:

01:00:50,152 --> 01:00:54,972

different to classical analog computation,

which has historically been plagued, which

Speaker:

01:00:54,972 --> 01:00:57,572

is really old, really old, but

historically been plagued by noise.

Speaker:

01:00:57,572 --> 01:01:00,172

And it's like, yeah, there's all this

noise in physics.

Speaker:

01:01:00,172 --> 01:01:03,752

And because we're doing SDEs, we want the

noise.

Speaker:

01:01:03,752 --> 01:01:05,062

So yeah, that's the whole idea.

Speaker:

01:01:05,062 --> 01:01:07,012

It's obviously very young, but it's fun.

Speaker:

01:01:07,012 --> 01:01:07,832

It's fun stuff.

Speaker:

01:01:07,832 --> 01:01:08,112

Yeah.

Speaker:

01:01:08,112 --> 01:01:10,434

So that's basically to...

Speaker:

01:01:11,288 --> 01:01:12,928

accelerate computing?

Speaker:

01:01:13,648 --> 01:01:18,528

That's hardware first, so that computing

is accelerated?

Speaker:

01:01:18,908 --> 01:01:21,878

We want to, I mean, it's a baby field.

Speaker:

01:01:21,878 --> 01:01:24,568

So we're trying to accelerate different

components.

Speaker:

01:01:24,568 --> 01:01:28,568

What we worked out is with the simplest

thermodynamic chip we can build is this

Speaker:

01:01:28,568 --> 01:01:30,528

linear chip with the Ornstein -Ullenberg

process.

Speaker:

01:01:30,528 --> 01:01:34,092

And that can speed up with some error.

Speaker:

01:01:34,092 --> 01:01:38,432

some error, but it has asymptotic speed

ups for linear algebra routines, so

Speaker:

01:01:38,432 --> 01:01:41,112

inverting a matrix or solving a linear

system.

Speaker:

01:01:41,252 --> 01:01:42,792

That's awesome.

Speaker:

01:01:44,712 --> 01:01:48,952

In this case, it would speed up a certain

component, but that could be useful in a

Speaker:

01:01:48,952 --> 01:01:52,912

Laplace approximation or these sort of

things also in machine learning.

Speaker:

01:01:52,912 --> 01:01:57,272

Okay, that must be very fun to work on.

Speaker:

01:01:57,652 --> 01:02:02,552

Do you have any writing about that that we

can put in the show notes?

Speaker:

01:02:02,552 --> 01:02:03,244

Because

Speaker:

01:02:03,244 --> 01:02:06,164

I think it'd be super interesting for

listeners.

Speaker:

01:02:06,604 --> 01:02:07,224

Yeah, yeah.

Speaker:

01:02:07,224 --> 01:02:12,184

We've got the normal computing scholar

page has a list of papers, but we also

Speaker:

01:02:12,184 --> 01:02:16,404

have more accessible blogs, which I'll

make sure to put in the shop.

Speaker:

01:02:16,424 --> 01:02:21,404

Yeah, yeah, please do because, yeah, I

think it's super interesting.

Speaker:

01:02:22,004 --> 01:02:25,514

And yeah, and when you have something to

present on that, feel free to reach out.

Speaker:

01:02:25,514 --> 01:02:29,244

And I think that'd be fun to do an episode

about that, honestly.

Speaker:

01:02:29,244 --> 01:02:30,324

That'd be great.

Speaker:

01:02:30,424 --> 01:02:31,124

Yeah.

Speaker:

01:02:32,684 --> 01:02:36,984

Yes, so maybe one last question before

asking you the last two questions.

Speaker:

01:02:37,244 --> 01:02:40,174

Like very, like, let's do Zoom be way less

technical.

Speaker:

01:02:40,174 --> 01:02:44,624

We've been very technical through the

whole episode, which I love.

Speaker:

01:02:45,084 --> 01:02:52,684

But maybe I'm thinking if you have any

advice to give to aspiring developers

Speaker:

01:02:52,684 --> 01:02:57,484

interested in contributing to open source

projects like Posterior's, what would it

Speaker:

01:02:57,484 --> 01:02:58,424

be?

Speaker:

01:03:00,812 --> 01:03:05,632

Okay, yeah, I don't know, I don't feel

like I'm necessarily the best place to say

Speaker:

01:03:05,632 --> 01:03:10,352

all this, but yeah, I mean, I would just,

the most important thing is just to go for

Speaker:

01:03:10,352 --> 01:03:17,292

it, just get stuck in, get in the weeds of

these libraries and see what's there.

Speaker:

01:03:17,292 --> 01:03:22,852

And there's loads of people building such

cool stuff in the open source ecosystem

Speaker:

01:03:22,852 --> 01:03:25,872

and it's really fun to, honestly, it's

really fun and rewarding to get involved

Speaker:

01:03:25,872 --> 01:03:26,182

for it.

Speaker:

01:03:26,182 --> 01:03:28,832

So just go for it, you'll learn so much

along the way.

Speaker:

01:03:29,280 --> 01:03:30,960

something more tangible.

Speaker:

01:03:31,200 --> 01:03:35,300

I find that when I'm stuck on, starting

on, it's not like I don't understand

Speaker:

01:03:35,300 --> 01:03:40,300

something in code or mathematics, then I

often struggle to find it in papers per

Speaker:

01:03:40,300 --> 01:03:40,420

se.

Speaker:

01:03:40,420 --> 01:03:44,620

And I find that textbooks, I love

textbooks, textbooks I find as a real

Speaker:

01:03:44,620 --> 01:03:48,020

source of gold for these because they

actually go to the depths of explaining

Speaker:

01:03:48,020 --> 01:03:53,600

things, without having this sort of horse

in the race style writing that you often

Speaker:

01:03:53,600 --> 01:03:54,360

find in papers.

Speaker:

01:03:54,360 --> 01:03:58,348

So yeah, get stuck in check text textbooks

if you, if you get lost.

Speaker:

01:03:58,348 --> 01:03:59,138

Or I don't understand.

Speaker:

01:03:59,138 --> 01:04:00,218

Or just ask as well.

Speaker:

01:04:00,218 --> 01:04:04,048

Open source is all about asking and

communicating and bouncing ideas.

Speaker:

01:04:04,188 --> 01:04:05,768

Yeah, yeah, yeah, for sure.

Speaker:

01:04:05,768 --> 01:04:07,208

Yeah, that's usually what I do.

Speaker:

01:04:07,208 --> 01:04:13,308

I ask a lot and I usually end up

surrounding myself with people way smarter

Speaker:

01:04:13,308 --> 01:04:14,508

than me.

Speaker:

01:04:14,508 --> 01:04:16,848

And that's exactly what you want.

Speaker:

01:04:17,508 --> 01:04:19,448

That's exactly how I learned.

Speaker:

01:04:19,848 --> 01:04:26,428

Yeah, textbook DICI, I would say I kind of

find the writing boring most of the time,

Speaker:

01:04:26,428 --> 01:04:28,168

depends on the textbooks.

Speaker:

01:04:28,236 --> 01:04:30,636

And also, it's expensive.

Speaker:

01:04:31,516 --> 01:04:32,216

Yeah.

Speaker:

01:04:32,716 --> 01:04:35,336

So that's kind of the problem of

textbooks, I would say.

Speaker:

01:04:35,336 --> 01:04:40,676

I mean, you often can have them in PDFs,

but I just hate reading the PDF on my

Speaker:

01:04:40,676 --> 01:04:41,636

computer.

Speaker:

01:04:41,636 --> 01:04:48,416

So, you know, I wonder on the book object

or having it on Kindle or something like

Speaker:

01:04:48,416 --> 01:04:48,556

that.

Speaker:

01:04:48,556 --> 01:04:51,416

But that doesn't really that doesn't

really exist yet.

Speaker:

01:04:51,416 --> 01:04:52,016

So.

Speaker:

01:04:54,252 --> 01:05:00,972

could be something that some editors solve

someday that'd be cool, I'd love that

Speaker:

01:05:00,972 --> 01:05:09,072

awesome, Sam, that was great thank you so

much, we've covered so many topics and my

Speaker:

01:05:09,072 --> 01:05:15,072

brain is burning so that's a very good

sign I've learned a lot and I'm sure our

Speaker:

01:05:15,072 --> 01:05:19,832

listeners did too of course, before

letting you go I'm gonna ask you the last

Speaker:

01:05:19,832 --> 01:05:23,820

two questions I ask every guest at the end

of the show so one

Speaker:

01:05:23,820 --> 01:05:28,600

If you had unlimited time and resources,

which problem would you try to solve?

Speaker:

01:05:53,676 --> 01:05:58,836

want to decouple the model specification,

the data generating process, how you go

Speaker:

01:05:58,836 --> 01:06:02,296

from your something you don't know to the

data you do have.

Speaker:

01:06:02,376 --> 01:06:04,216

That's your site freedom as a data model.

Speaker:

01:06:04,216 --> 01:06:07,916

I have you define that from like the

inference and the mathematical

Speaker:

01:06:07,916 --> 01:06:08,616

computation.

Speaker:

01:06:08,616 --> 01:06:12,316

So that's whatever, what the way you do

your approximate Bayesian inference.

Speaker:

01:06:12,316 --> 01:06:13,396

And you want to decouple those.

Speaker:

01:06:13,396 --> 01:06:14,966

You want to make it as easy as possible.

Speaker:

01:06:14,966 --> 01:06:16,416

Ideally, we just want to be doing that

one.

Speaker:

01:06:16,416 --> 01:06:19,596

We just want to be doing the model

specification.

Speaker:

01:06:19,656 --> 01:06:22,246

And this is like Stan and PyMC do this

really well.

Speaker:

01:06:22,246 --> 01:06:22,860

It's just like,

Speaker:

01:06:22,860 --> 01:06:25,160

you write down your model, we'll handle

the rest.

Speaker:

01:06:25,160 --> 01:06:28,820

And that's kind of like the dream we have

as Bayesian or Bayesian software

Speaker:

01:06:28,820 --> 01:06:29,920

developers.

Speaker:

01:06:31,020 --> 01:06:36,700

And it's so with Posterior, we're trying

to do something like this towards going to

Speaker:

01:06:36,700 --> 01:06:42,780

move towards this for bigger, big machine

learning models and so bigger models,

Speaker:

01:06:42,780 --> 01:06:44,220

bigger data settings.

Speaker:

01:06:44,360 --> 01:06:46,420

So that's kind of the dream there.

Speaker:

01:06:46,420 --> 01:06:50,080

But then in machine learning, what does

machine learning have differently to

Speaker:

01:06:50,080 --> 01:06:51,404

statistics in that setting?

Speaker:

01:06:51,404 --> 01:06:56,484

It's like, well, machine learning models

are less interesting than classical

Speaker:

01:06:56,484 --> 01:06:57,564

Bayesian models.

Speaker:

01:06:57,984 --> 01:07:01,384

The thing is they're more transferable,

right?

Speaker:

01:07:01,384 --> 01:07:06,064

It's just a neural network, which we

believe is machine learning and will solve

Speaker:

01:07:06,064 --> 01:07:07,624

a whole suite of tasks.

Speaker:

01:07:07,624 --> 01:07:11,744

So perhaps in terms of the machine

learning setting, where we decouple

Speaker:

01:07:11,744 --> 01:07:15,464

modeling and inference and data, you kind

of want to remove the model one as well.

Speaker:

01:07:15,464 --> 01:07:18,572

You want to have these general purpose

foundational models, you could say.

Speaker:

01:07:18,572 --> 01:07:20,892

So really you want to let the user focus.

Speaker:

01:07:20,892 --> 01:07:22,632

And so we're handling the inference.

Speaker:

01:07:22,632 --> 01:07:23,582

We're also handling the model.

Speaker:

01:07:23,582 --> 01:07:27,972

So really let the user just give it the

data and say, okay, let's do this data and

Speaker:

01:07:27,972 --> 01:07:30,882

let's use this data to predict other

things and let the user handle that.

Speaker:

01:07:30,882 --> 01:07:35,312

So that's potentially like a real

unlimited time and resources.

Speaker:

01:07:35,312 --> 01:07:37,152

Plenty of resources need to do that.

Speaker:

01:07:37,152 --> 01:07:43,672

But yeah, that's Sam May 2024's answer.

Speaker:

01:07:44,712 --> 01:07:45,792

Yeah.

Speaker:

01:07:46,172 --> 01:07:47,852

Yeah, that sounds...

Speaker:

01:07:47,852 --> 01:07:49,352

That sounds amazing.

Speaker:

01:07:49,352 --> 01:07:50,402

I agree with that.

Speaker:

01:07:50,402 --> 01:07:52,872

That's a fantastic goal.

Speaker:

01:07:53,532 --> 01:07:57,932

And yeah, also that reminds me, that's

also why I really love what you guys are

Speaker:

01:07:57,932 --> 01:08:06,452

doing with Posteriorus because it's like,

yeah, trying to now that we start being

Speaker:

01:08:06,452 --> 01:08:13,032

able to get there, making patient

inference really scalable to really big

Speaker:

01:08:13,032 --> 01:08:14,492

data and big models.

Speaker:

01:08:15,072 --> 01:08:17,356

I'm super enthusiastic about that.

Speaker:

01:08:17,356 --> 01:08:20,676

it would be just fantastic.

Speaker:

01:08:21,536 --> 01:08:26,056

So thank you so much for taking the time

to do that guys.

Speaker:

01:08:26,056 --> 01:08:28,716

Yeah we're doing it, we're gonna get

there.

Speaker:

01:08:28,716 --> 01:08:31,456

Yeah yeah yeah I love that.

Speaker:

01:08:31,756 --> 01:08:36,736

And second question, if you could have

dinner with any great scientific mind dead

Speaker:

01:08:36,736 --> 01:08:39,976

alive or fictional who would it be?

Speaker:

01:08:41,256 --> 01:08:44,236

Yeah I was a bit intimidated this

question.

Speaker:

01:08:44,236 --> 01:08:45,228

Yeah you know you ask everyone.

Speaker:

01:08:45,228 --> 01:08:46,388

again, it's a great question.

Speaker:

01:08:46,388 --> 01:08:48,548

But then I thought about it for a little

bit.

Speaker:

01:08:48,548 --> 01:08:49,928

And it wasn't too hard for me.

Speaker:

01:08:49,928 --> 01:08:55,548

I think that David Mackay is someone who,

yeah, I mean, it's been amazing work.

Speaker:

01:08:55,548 --> 01:08:58,448

David Mackay is doing Bayesian neural

networks in 1992.

Speaker:

01:08:58,768 --> 01:09:02,968

And that's like, yeah, like crazy before

before I'm born.

Speaker:

01:09:03,068 --> 01:09:10,248

Anyway, Bayesian neural networks in 1992,

then I've just been going through his

Speaker:

01:09:10,248 --> 01:09:13,508

textbook, as I said, I love textbooks, so

going through his textbooks on information

Speaker:

01:09:13,508 --> 01:09:14,316

theory and

Speaker:

01:09:14,316 --> 01:09:18,276

Basin statistics is a Bayesian or was a

Bayesian information theory and

Speaker:

01:09:18,276 --> 01:09:18,476

statistics.

Speaker:

01:09:18,476 --> 01:09:21,076

And there's something that he says like

right at the start of the textbook is

Speaker:

01:09:21,076 --> 01:09:25,616

like, one of the themes of this book is

that data compression and data modeling

Speaker:

01:09:25,616 --> 01:09:26,626

are one and the same.

Speaker:

01:09:26,626 --> 01:09:27,936

And that's just really beautiful.

Speaker:

01:09:27,936 --> 01:09:32,916

And we talked about stream codes, which in

a very information theory style setting,

Speaker:

01:09:32,916 --> 01:09:36,336

but it's just an auto -aggressive

prediction model, just like our language

Speaker:

01:09:36,336 --> 01:09:37,056

model.

Speaker:

01:09:37,056 --> 01:09:42,796

So it's just someone else the ability to

distill these informations and do these.

Speaker:

01:09:43,404 --> 01:09:47,914

distill information and help the

unification and be so ahead of their time.

Speaker:

01:09:47,914 --> 01:09:52,524

And then additionally, with a sort of like

groundbreaking book on sustainable energy.

Speaker:

01:09:52,524 --> 01:09:58,184

So like also tackling the one of the

greatest challenges we have at the moment.

Speaker:

01:09:58,244 --> 01:10:01,884

So yeah, that's the sustainable energy

book is really wonderful.

Speaker:

01:10:01,884 --> 01:10:04,064

I'm one of my favorite books so far.

Speaker:

01:10:04,144 --> 01:10:04,584

Nice.

Speaker:

01:10:04,584 --> 01:10:06,844

Yeah, definitely put that in the show

notes.

Speaker:

01:10:06,844 --> 01:10:07,564

I think.

Speaker:

01:10:07,564 --> 01:10:08,604

Yes, definitely.

Speaker:

01:10:08,604 --> 01:10:08,934

Yeah.

Speaker:

01:10:08,934 --> 01:10:11,164

Yeah, I'd like to keep that to read.

Speaker:

01:10:11,164 --> 01:10:11,436

So

Speaker:

01:10:11,436 --> 01:10:14,996

Yeah, please also put that in the show and

that's going to be fantastic.

Speaker:

01:10:15,776 --> 01:10:16,236

Great.

Speaker:

01:10:16,236 --> 01:10:19,586

Well, I think we can call it a show.

Speaker:

01:10:19,586 --> 01:10:20,796

That was fantastic.

Speaker:

01:10:20,796 --> 01:10:22,976

Thank you so much, Sam.

Speaker:

01:10:24,516 --> 01:10:29,656

I learned so much and now I feel like I

have to go and read and learn about so

Speaker:

01:10:29,656 --> 01:10:30,996

many things.

Speaker:

01:10:30,996 --> 01:10:36,856

And I can definitely tell that you are

extremely passionate about your doing.

Speaker:

01:10:37,036 --> 01:10:39,356

So yeah, thank you so much for.

Speaker:

01:10:39,404 --> 01:10:41,964

taking the time and being on this show?

Speaker:

01:10:42,644 --> 01:10:43,414

No, thank you very much.

Speaker:

01:10:43,414 --> 01:10:44,264

I had a lot of fun.

Speaker:

01:10:44,264 --> 01:10:44,724

Yeah.

Speaker:

01:10:44,724 --> 01:10:48,454

Thank you for, yeah, being parcel to my

rantings.

Speaker:

01:10:48,454 --> 01:10:50,264

I need that sometimes.

Speaker:

01:10:50,904 --> 01:10:53,164

Yeah, that's what the show is about.

Speaker:

01:10:53,624 --> 01:10:58,504

My girlfriend is extremely, extremely

happy that I have this show to rent about

Speaker:

01:10:58,504 --> 01:11:01,084

patient stats and any nerdy stuff.

Speaker:

01:11:02,664 --> 01:11:04,464

Yeah, it's so true, yeah.

Speaker:

01:11:04,844 --> 01:11:06,104

Well, Sam, you're welcome.

Speaker:

01:11:06,104 --> 01:11:08,944

Anytime you need to do some nerdy rant.

Speaker:

01:11:10,356 --> 01:11:10,796

thank you.

Speaker:

01:11:10,796 --> 01:11:12,156

I'm sure I'll be...

Speaker:

01:11:15,596 --> 01:11:19,296

This has been another episode of Learning

Bayesian Statistics.

Speaker:

01:11:19,296 --> 01:11:24,236

Be sure to rate, review, and follow the

show on your favorite podcatcher, and

Speaker:

01:11:24,236 --> 01:11:29,176

visit learnbayestats .com for more

resources about today's topics, as well as

Speaker:

01:11:29,176 --> 01:11:33,896

access to more episodes to help you reach

true Bayesian state of mind.

Speaker:

01:11:33,896 --> 01:11:35,856

That's learnbayestats .com.

Speaker:

01:11:35,856 --> 01:11:38,756

Our theme music is Good Bayesian by Baba

Brinkman.

Speaker:

01:11:38,756 --> 01:11:40,676

Fit MC Lance and Meghiraam.

Speaker:

01:11:40,676 --> 01:11:43,826

Check out his awesome work at bababrinkman

.com.

Speaker:

01:11:43,826 --> 01:11:45,004

I'm your host.

Speaker:

01:11:45,004 --> 01:11:45,994

Alex Andorra.

Speaker:

01:11:45,994 --> 01:11:50,224

You can follow me on Twitter at Alex

underscore Andorra, like the country.

Speaker:

01:11:50,224 --> 01:11:55,304

You can support the show and unlock

exclusive benefits by visiting Patreon

Speaker:

01:11:55,304 --> 01:11:57,504

.com slash LearnBasedDance.

Speaker:

01:11:57,504 --> 01:11:59,924

Thank you so much for listening and for

your support.

Speaker:

01:11:59,924 --> 01:12:02,164

You're truly a good Bayesian.

Speaker:

01:12:02,164 --> 01:12:05,684

Change your predictions after taking

information in.

Speaker:

01:12:05,684 --> 01:12:12,332

And if you're thinking of me less than

amazing, let's adjust those expectations.

Speaker:

01:12:12,332 --> 01:12:17,712

Let me show you how to be a good Bayesian

Change calculations after taking fresh

Speaker:

01:12:17,712 --> 01:12:23,732

data in Those predictions that your brain

is making Let's get them on a solid

Speaker:

01:12:23,732 --> 01:12:25,452

foundation

Previous post