Learning Bayesian Statistics

Proudly sponsored by PyMC Labs, the Bayesian Consultancy. Book a call, or get in touch!

Changing perspective is often a great way to solve burning research problems. Riemannian spaces are such a perspective change, as Arto Klami, an Associate Professor of computer science at the University of Helsinki and member of the Finnish Center for Artificial Intelligence, will tell us in this episode.

He explains the concept of Riemannian spaces, their application in inference algorithms, how they can help sampling Bayesian models, and their similarity with normalizing flows, that we discussed in episode 98.

Arto also introduces PreliZ, a tool for prior elicitation, and highlights its benefits in simplifying the process of setting priors, thus improving the accuracy of our models.

When Arto is not solving mathematical equations, you’ll find him cycling, or around a good board game.

Our theme music is « Good Bayesian », by Baba Brinkman (feat MC Lars and Mega Ran). Check out his awesome work at https://bababrinkman.com/ !

Thank you to my Patrons for making this episode possible!

Yusuke Saito, Avi Bryant, Ero Carrera, Giuliano Cruz, Tim Gasser, James Wade, Tradd Salvo, William Benton, James Ahloy, Robin Taylor,, Chad Scherrer, Zwelithini Tunyiswa, Bertrand Wilden, James Thompson, Stephen Oates, Gian Luca Di Tanna, Jack Wells, Matthew Maldonado, Ian Costley, Ally Salim, Larry Gill, Ian Moran, Paul Oreto, Colin Caprani, Colin Carroll, Nathaniel Burbank, Michael Osthege, Rémi Louf, Clive Edelsten, Henri Wallen, Hugo Botha, Vinh Nguyen, Marcin Elantkowski, Adam C. Smith, Will Kurt, Andrew Moskowitz, Hector Munoz, Marco Gorelli, Simon Kessell, Bradley Rode, Patrick Kelley, Rick Anderson, Casper de Bruin, Philippe Labonde, Michael Hankin, Cameron Smith, Tomáš Frýda, Ryan Wesslen, Andreas Netti, Riley King, Yoshiyuki Hamajima, Sven De Maeyer, Michael DeCrescenzo, Fergal M, Mason Yahr, Naoya Kanai, Steven Rowland, Aubrey Clayton, Jeannine Sue, Omri Har Shemesh, Scott Anthony Robson, Robert Yolken, Or Duek, Pavel Dusek, Paul Cox, Andreas Kröpelin, Raphaël R, Nicolas Rode, Gabriel Stechschulte, Arkady, Kurt TeKolste, Gergely Juhasz, Marcus Nölke, Maggi Mackintosh, Grant Pezzolesi, Avram Aelony, Joshua Meehl, Javier Sabio, Kristian Higgins, Alex Jones, Gregorio Aguilar, Matt Rosinski, Bart Trudeau, Luis Fonseca, Dante Gates, Matt Niccolls, Maksim Kuznecov, Michael Thomas, Luke Gorrie, Cory Kiser and Julio.

Visit https://www.patreon.com/learnbayesstats to unlock exclusive Bayesian swag 😉

Takeaways:

– Riemannian spaces offer a way to improve computational efficiency and accuracy in Bayesian inference by considering the curvature of the posterior distribution.

– Riemannian spaces can be used in Laplace approximation and Markov chain Monte Carlo algorithms to better model the posterior distribution and explore challenging areas of the parameter space.

– Normalizing flows are a complementary approach to Riemannian spaces, using non-linear transformations to warp the parameter space and improve sampling efficiency.

– Evaluating the performance of Bayesian inference algorithms in challenging cases is a current research challenge, and more work is needed to establish benchmarks and compare different methods. 

– PreliZ is a package for prior elicitation in Bayesian modeling that facilitates communication with users through visualizations of predictive and parameter distributions.

– Careful prior specification is important, and tools like PreliZ make the process easier and more reproducible.

– Teaching Bayesian machine learning is challenging due to the combination of statistical and programming concepts, but it is possible to teach the basic reasoning behind Bayesian methods to a diverse group of students.

– The integration of Bayesian approaches in data science workflows is becoming more accepted, especially in industries that already use deep learning techniques.

– The future of Bayesian methods in AI research may involve the development of AI assistants for Bayesian modeling and probabilistic reasoning.

Chapters:

00:00 Introduction and Background

02:05 Arto’s Work and Background

06:05 Introduction to Bayesian Inference

12:46 Riemannian Spaces in Bayesian Inference

27:24 Availability of Romanian-based Algorithms

30:20 Practical Applications and Evaluation

37:33 Introduction to Prelease

38:03 Prior Elicitation

39:01 Predictive Elicitation Techniques

39:30 PreliZ: Interface with Users

40:27 PreliZ: General Purpose Tool

41:55 Getting Started with PreliZ

42:45 Challenges of Setting Priors

45:10 Reproducibility and Transparency in Priors

46:07 Integration of Bayesian Approaches in Data Science Workflows

55:11 Teaching Bayesian Machine Learning

01:06:13 The Future of Bayesian Methods with AI Research

01:10:16 Solving the Prior Elicitation Problem

Links from the show:

Transcript

This is an automatic transcript and may therefore contain errors. Please get in touch if you’re willing to correct them.

Transcript
Speaker:

Let me show you how to be a good b...

2

00:00:28,398 --> 00:00:33,258

how they can help sampling Bayesian models

and their similarity with normalizing

3

00:00:33,258 --> 00:00:36,978

flows that we discussed in episode 98.

4

00:00:37,278 --> 00:00:41,898

ARTO also introduces Prelease, a tool for

prior elicitation, and highlights its

5

00:00:41,898 --> 00:00:46,208

benefits in simplifying the process of

setting priors, thus improving the

6

00:00:46,208 --> 00:00:47,778

accuracy of our models.

7

00:00:47,778 --> 00:00:52,108

When ARTO is not solving mathematical

equations, you'll find him cycling or

8

00:00:52,108 --> 00:00:54,018

around the good board game.

9

00:00:54,018 --> 00:00:57,198

This is Learning Bayesian Statistics,

episode 103.

10

00:00:57,198 --> 00:01:00,418

recorded February 15, 2024.

11

00:01:15,898 --> 00:01:22,108

Welcome to Learning Bayesian Statistics, a

podcast about Bayesian inference, the

12

00:01:22,108 --> 00:01:25,558

methods, the projects, and the people who

make it possible.

13

00:01:25,558 --> 00:01:26,862

I'm your host.

14

00:01:26,862 --> 00:01:32,462

You can follow me on Twitter at Alex

underscore and Dora like the country for

15

00:01:32,462 --> 00:01:34,142

any info about the show.

16

00:01:34,142 --> 00:01:36,582

Learnbasedats .com is left last to me.

17

00:01:36,582 --> 00:01:41,302

Show notes, becoming a corporate sponsor,

unlocking Bayesian Merch, supporting the

18

00:01:41,302 --> 00:01:42,422

show on Patreon.

19

00:01:42,422 --> 00:01:44,042

Everything is in there.

20

00:01:44,042 --> 00:01:45,782

That's Learnbasedats .com.

21

00:01:45,782 --> 00:01:50,222

If you're interested in one -on -one

mentorship, online courses, or statistical

22

00:01:50,222 --> 00:01:55,432

consulting, feel free to reach out and

book a call at topmate .io slash Alex

23

00:01:55,432 --> 00:01:56,174

underscore.

24

00:01:56,174 --> 00:02:01,158

And Dora, see you around, folks, and best

wishes to you all.

25

00:02:05,262 --> 00:02:08,562

Clemmy, welcome to Layer Name Patient

Statistics.

26

00:02:09,342 --> 00:02:10,462

Thank you.

27

00:02:10,902 --> 00:02:11,392

You're welcome.

28

00:02:11,392 --> 00:02:13,742

How was my Finnish pronunciation?

29

00:02:13,962 --> 00:02:16,078

Oh, I think that was excellent.

30

00:02:16,078 --> 00:02:22,958

For people who don't have the video, I

don't think that was true.

31

00:02:24,218 --> 00:02:26,498

So thanks a lot for taking the time,

Artho.

32

00:02:26,498 --> 00:02:28,998

I'm really happy to have you on the show.

33

00:02:29,518 --> 00:02:34,428

And I've had a lot of questions for you

for a long time, and the longer we

34

00:02:34,428 --> 00:02:37,738

postpone the episode, the more questions.

35

00:02:38,198 --> 00:02:42,158

So I'm gonna do my best to not take three

hours of your time.

36

00:02:42,498 --> 00:02:44,774

And let's start by...

37

00:02:44,846 --> 00:02:50,716

maybe defining the work you're doing

nowadays and well, how do you end up

38

00:02:50,716 --> 00:02:52,026

working on this?

39

00:02:52,666 --> 00:02:54,246

Yes, sure.

40

00:02:54,846 --> 00:02:59,246

So I personally identify as a machine

learning researcher.

41

00:02:59,246 --> 00:03:04,606

So I do machine learning research, but

very much from a Bayesian perspective.

42

00:03:05,106 --> 00:03:08,246

So my original background is in computer

science.

43

00:03:08,246 --> 00:03:13,236

I'm essentially a self -educated

statistician in the sense that I've never

44

00:03:13,236 --> 00:03:14,310

really

45

00:03:14,914 --> 00:03:20,744

kind of studied properly statistics

design, well except for a few courses here

46

00:03:20,744 --> 00:03:21,494

and there.

47

00:03:21,494 --> 00:03:27,744

But I've been building models, algorithms,

building on the Bayesian principles for

48

00:03:27,744 --> 00:03:30,654

addressing various kinds of machine

learning problems.

49

00:03:32,294 --> 00:03:40,014

So you're basically like a self -taught

statistician through learning, let's say.

50

00:03:40,094 --> 00:03:41,294

More or less, yes.

51

00:03:41,294 --> 00:03:44,094

I think the first things I started doing,

52

00:03:44,366 --> 00:03:49,016

with anything that had to do with Bayesian

statistics was pretty much already going

53

00:03:49,016 --> 00:03:55,146

to the deep end and trying to learn

posterior inference for fairly complicated

54

00:03:55,146 --> 00:04:00,266

models, even actually non -parametric

models in some ways.

55

00:04:00,486 --> 00:04:05,046

Yeah, we're going to dive a bit on that.

56

00:04:05,206 --> 00:04:12,656

Before that, can you tell us the topics

you are particularly focusing on through

57

00:04:12,656 --> 00:04:13,048

that?

58

00:04:13,048 --> 00:04:15,558

umbrella of topics you've named.

59

00:04:15,558 --> 00:04:17,238

Yes, absolutely.

60

00:04:17,238 --> 00:04:23,038

So I think I actually have a few somewhat

distinct areas of interest.

61

00:04:23,038 --> 00:04:27,578

So on one hand, I'm working really on the

kind of core inference problem.

62

00:04:27,578 --> 00:04:33,528

So how do we computationally efficiently,

accurately enough approximate the

63

00:04:33,528 --> 00:04:35,138

posterior distributions?

64

00:04:36,198 --> 00:04:40,698

Recently, we've been especially working on

inference algorithms that build on

65

00:04:40,698 --> 00:04:42,926

concepts from Riemannian geometry.

66

00:04:42,926 --> 00:04:48,256

So we're trying to really kind of account

the actual manifold induced by this

67

00:04:48,256 --> 00:04:53,616

posterior distribution and try to somehow

utilize these concepts to kind of speed up

68

00:04:53,616 --> 00:04:54,686

inference.

69

00:04:54,966 --> 00:04:58,226

So that's kind of one very technical

aspect.

70

00:04:58,226 --> 00:05:04,206

Then there's the other main theme on the

kind of Bayesian side is on priors.

71

00:05:04,206 --> 00:05:06,406

So we'll be working on prior elicitation.

72

00:05:06,406 --> 00:05:11,726

So how do we actually go about specifying

the prior distributions?

73

00:05:11,726 --> 00:05:14,106

and ideally maybe not even specifying.

74

00:05:14,106 --> 00:05:19,406

So how would we extract that knowledge

from a domain expert who doesn't

75

00:05:19,406 --> 00:05:23,126

necessarily even have any sort of

statistical training?

76

00:05:23,126 --> 00:05:28,266

And how do we flexibly represent their

true beliefs and then encode them as part

77

00:05:28,266 --> 00:05:29,326

of a model?

78

00:05:29,326 --> 00:05:35,486

That's maybe the main kind of technical

aspects there.

79

00:05:35,486 --> 00:05:35,726

Yeah.

80

00:05:35,726 --> 00:05:36,478

Yeah.

81

00:05:36,942 --> 00:05:38,022

No, super fun.

82

00:05:38,022 --> 00:05:43,302

And we're definitely going to dive into

those two aspects a bit later in the show.

83

00:05:43,302 --> 00:05:45,382

I'm really interested in that.

84

00:05:46,462 --> 00:05:51,222

Before that, do you remember how you first

got introduced to Bayesian inference,

85

00:05:51,222 --> 00:05:54,602

actually, and also why it sticks with you?

86

00:05:54,742 --> 00:05:59,462

Yeah, like I said, I'm in some sense self

-trained.

87

00:05:59,702 --> 00:06:04,282

I mean, coming with the computer science

background, we just, more or less,

88

00:06:04,282 --> 00:06:05,902

sometime during my PhD,

89

00:06:05,902 --> 00:06:10,582

I was working in a research group that was

led by Samuel Kaski.

90

00:06:11,162 --> 00:06:18,172

When I joined the group, we were working

on neural networks of the kind that people

91

00:06:18,172 --> 00:06:19,002

were interested in.

92

00:06:19,002 --> 00:06:20,442

That was like 20 years ago.

93

00:06:20,442 --> 00:06:24,162

So we were working on things like self

-organizing maps and these kind of

94

00:06:24,162 --> 00:06:25,042

methods.

95

00:06:25,622 --> 00:06:30,452

And then we started working on

applications where we really bumped into

96

00:06:30,452 --> 00:06:32,902

the kind of small sample size problems.

97

00:06:32,902 --> 00:06:34,828

So looking at...

98

00:06:34,954 --> 00:06:40,154

DNA microarray data that was kind of tens

of thousands of dimensions and medical

99

00:06:40,154 --> 00:06:42,674

applications with 20 samples.

100

00:06:42,794 --> 00:06:47,494

So we essentially figured out that we're

gonna need to take the kind of uncertainty

101

00:06:47,494 --> 00:06:48,994

into account properly.

102

00:06:48,994 --> 00:06:53,864

Started working on the Bayesian modeling

side of these and one of the very first

103

00:06:53,864 --> 00:07:00,134

things I was doing is kind of trying to

create Bayesian versions of some of these

104

00:07:00,134 --> 00:07:02,796

classical analysis methods that were

105

00:07:02,796 --> 00:07:04,986

especially canonical correlation analysis.

106

00:07:04,986 --> 00:07:09,196

That's the original derivation is like an

information theoretic formulation.

107

00:07:09,196 --> 00:07:14,826

So I kind of dive directly into this that

let's do Bayesian versions of models.

108

00:07:16,226 --> 00:07:22,226

But I actually do remember that around the

same time I also took a course, a course

109

00:07:22,226 --> 00:07:24,066

by Akivehtari.

110

00:07:24,206 --> 00:07:26,306

He's his author of this Gelman et al.

111

00:07:26,306 --> 00:07:27,376

book, one of the authors.

112

00:07:27,376 --> 00:07:30,702

I think the first version of the book had

been released.

113

00:07:30,926 --> 00:07:32,406

just before that.

114

00:07:32,406 --> 00:07:36,006

So Aki was giving a course where he was

teaching based on that book.

115

00:07:36,226 --> 00:07:40,916

And I think that's the kind of first real

official contact on trying to understand

116

00:07:40,916 --> 00:07:45,066

the actual details behind the principles.

117

00:07:46,866 --> 00:07:52,766

Yeah, and actually I'm pretty sure

listeners are familiar with Aki.

118

00:07:52,806 --> 00:07:59,346

He's been on the show already, so I'll

link to the episode, of course, where Aki

119

00:07:59,366 --> 00:08:00,622

was.

120

00:08:00,622 --> 00:08:02,582

And yeah, for sure.

121

00:08:02,582 --> 00:08:08,162

I also recommend going through these

episodes, show notes for people who are

122

00:08:08,162 --> 00:08:13,622

interested in, well, starting learning

about basic stuff and things like that.

123

00:08:14,102 --> 00:08:23,542

Something I'm wondering from what you just

explained is, so you define yourself as a

124

00:08:23,542 --> 00:08:25,222

machine learning researcher, right?

125

00:08:25,222 --> 00:08:27,790

And you work in artificial intelligence

too.

126

00:08:27,790 --> 00:08:31,070

But there is this interaction with the

Bayesian framework.

127

00:08:31,070 --> 00:08:36,160

How does that framework underpin your

research in statistical machine learning

128

00:08:36,160 --> 00:08:37,670

and artificial intelligence?

129

00:08:37,670 --> 00:08:39,830

How does that all combine?

130

00:08:40,990 --> 00:08:42,430

Yeah.

131

00:08:42,810 --> 00:08:45,050

Well, that's a broad topic.

132

00:08:45,310 --> 00:08:48,110

There's of course a lot in that

intersection.

133

00:08:49,270 --> 00:08:56,942

I personally do view all learning problems

in some sense from a Bayesian perspective.

134

00:08:56,942 --> 00:09:02,462

I mean, no matter what kind of a, whether

it's a very simple fitting a linear

135

00:09:02,462 --> 00:09:07,492

regression type of a problem or whether

it's figuring out the parameters of a

136

00:09:07,492 --> 00:09:12,492

neural network with 1 billion parameters,

it's ultimately still a statistical

137

00:09:12,492 --> 00:09:13,902

inference problem.

138

00:09:14,642 --> 00:09:20,872

I mean, most of the cases, I'm quite

confident that we can't figure out the

139

00:09:20,872 --> 00:09:21,902

parameters exactly.

140

00:09:21,902 --> 00:09:25,202

We need to somehow quantify for the

uncertainty.

141

00:09:25,518 --> 00:09:29,718

I'm not really aware of any other kind of

principled way of doing it.

142

00:09:29,718 --> 00:09:33,958

So I would just kind of think about it

that we're always doing Bayesian inference

143

00:09:33,958 --> 00:09:35,258

in some sense.

144

00:09:35,258 --> 00:09:38,998

But then there's the issue of how far can

we go in practice?

145

00:09:38,998 --> 00:09:40,818

So it's going to be approximate.

146

00:09:40,998 --> 00:09:44,138

It's possibly going to be very crude

approximations.

147

00:09:44,138 --> 00:09:50,308

But I would still view it through the lens

of Bayesian statistics in my own work.

148

00:09:50,308 --> 00:09:54,222

And that's what I do when I teach for my

BSc students, for example.

149

00:09:54,222 --> 00:09:59,272

I mean not all of them explicitly

formulate the learning algorithms kind of

150

00:09:59,272 --> 00:10:03,112

from these perspectives but we are still

kind of talking about that what's the

151

00:10:03,112 --> 00:10:07,392

relationship what can we assume about the

algorithms what can we assume about the

152

00:10:07,392 --> 00:10:13,182

result and how would it relate to like

like properly estimating everything

153

00:10:13,182 --> 00:10:16,682

through kind of exactly how it should be

done.

154

00:10:17,442 --> 00:10:21,302

Yeah okay that's an interesting

perspective yeah so basically putting that

155

00:10:21,302 --> 00:10:23,406

in a in that framework.

156

00:10:23,406 --> 00:10:32,766

And that means, I mean, that makes me

think then, how does that, how do you

157

00:10:32,766 --> 00:10:40,976

believe, what do you believe, sorry, the

impact of Bayesian machine learning is on

158

00:10:40,976 --> 00:10:42,286

the broader field of AI?

159

00:10:42,286 --> 00:10:45,446

What does that bring to that field?

160

00:10:46,926 --> 00:10:51,662

It's a, let's say it has a big effect.

161

00:10:51,662 --> 00:10:57,002

It has a very big impact in a sense that

pretty much most of the stuff that is

162

00:10:57,002 --> 00:11:01,272

happening on the machine learning front

and hence also on the kind of all learning

163

00:11:01,272 --> 00:11:03,142

based AI solutions.

164

00:11:03,142 --> 00:11:07,122

It is ultimately, I think a lot of people

are thinking about roughly in the same way

165

00:11:07,122 --> 00:11:11,832

as I am, that there is an underlying

learning problem that we would ideally

166

00:11:11,832 --> 00:11:16,542

want to solve more or less following

exactly the Bayesian principles.

167

00:11:17,294 --> 00:11:19,994

don't necessarily talk about it from this

perspective.

168

00:11:19,994 --> 00:11:26,584

So you might be happy to write algorithms,

all the justification on the choices you

169

00:11:26,584 --> 00:11:28,554

make comes from somewhere else.

170

00:11:28,554 --> 00:11:33,494

But I think a lot of people are kind of

accepting that it's the kind of

171

00:11:33,494 --> 00:11:35,534

probabilistic basis of these.

172

00:11:35,674 --> 00:11:42,014

So for instance, I think if you think

about the objectives that people are

173

00:11:42,014 --> 00:11:46,542

optimizing in deep learning, they're all

essentially likelihoods of some

174

00:11:46,542 --> 00:11:48,722

assume probabilistic model.

175

00:11:48,942 --> 00:11:54,632

Most of the regularizers they are

considering do have an interpretation of

176

00:11:54,632 --> 00:11:56,622

some kind of a prior distribution.

177

00:11:57,542 --> 00:12:02,522

I think a lot of people are all the time

going deeper and deeper into actually

178

00:12:02,522 --> 00:12:04,882

explicitly thinking about it from these

perspectives.

179

00:12:04,882 --> 00:12:11,242

So we have a lot of these deep learning

type of approaches, various autoencoders,

180

00:12:11,242 --> 00:12:16,198

Bayesian neural networks, various kinds of

generative AI models that are

181

00:12:16,430 --> 00:12:19,990

They are actually even explicitly

formulated as probabilistic models and

182

00:12:19,990 --> 00:12:22,570

some sort of an approximate inference

scheme.

183

00:12:22,690 --> 00:12:27,290

So I think the kind of these things are,

they are the same two sides of the same

184

00:12:27,290 --> 00:12:27,660

coin.

185

00:12:27,660 --> 00:12:31,750

People are kind of more and more thinking

about them from the same perspective.

186

00:12:32,910 --> 00:12:34,870

Okay, yeah, that's super interesting.

187

00:12:35,510 --> 00:12:42,350

Actually, let's start diving into these

topics from a more technical perspective.

188

00:12:43,150 --> 00:12:45,634

So you've mentioned the

189

00:12:46,130 --> 00:12:52,230

research and advances you are working on

regarding Romanian spaces.

190

00:12:52,230 --> 00:12:58,050

So I think it'd be super fun to talk about

that because we've never really talked

191

00:12:58,050 --> 00:12:59,490

about it on the show.

192

00:13:00,010 --> 00:13:07,050

So maybe can you give listeners a primer

on what a Romanian space is?

193

00:13:07,190 --> 00:13:09,170

Why would you even care about that?

194

00:13:09,170 --> 00:13:15,162

And what you are doing in this regard,

what your research is in this regard.

195

00:13:15,598 --> 00:13:17,298

Yes, let's try.

196

00:13:17,298 --> 00:13:20,838

I mean, this is a bit of a mathematical

concept to talk about.

197

00:13:20,838 --> 00:13:26,148

But I mean, ultimately, if you think about

most of the learning algorithms, so we are

198

00:13:26,148 --> 00:13:30,798

kind of thinking that there are some

parameters that live in some space.

199

00:13:30,798 --> 00:13:34,308

So we essentially, without thinking about

it, that we just assume that it's a

200

00:13:34,308 --> 00:13:40,638

Euclidean space in a sense that we can

measure distances between two parameters,

201

00:13:40,638 --> 00:13:42,630

that how similar they are.

202

00:13:42,638 --> 00:13:46,568

It doesn't matter which direction we go,

if the distance is the same, we think that

203

00:13:46,568 --> 00:13:49,038

they are kind of equally far away.

204

00:13:49,118 --> 00:13:55,498

So now a Riemannian geometry is one that

is kind of curved in some sense.

205

00:13:55,498 --> 00:14:00,758

So we may be stretching the space in

certain ways and we'll be doing this

206

00:14:00,758 --> 00:14:02,398

stretching locally.

207

00:14:02,478 --> 00:14:07,498

So what it actually means, for example, is

that the shortest path between two

208

00:14:07,498 --> 00:14:08,206

possible

209

00:14:08,206 --> 00:14:12,236

values, maybe for example two parameter

configurations, that if you start

210

00:14:12,236 --> 00:14:17,216

interpolating between two possible values

for a parameter, it's going to be a

211

00:14:17,216 --> 00:14:22,976

shortest path in this Riemannian geometry,

which is not necessarily a straight line

212

00:14:22,976 --> 00:14:26,086

in an underlying Euclidean space.

213

00:14:26,206 --> 00:14:30,146

So that's what the Riemannian geometry is

in general.

214

00:14:30,146 --> 00:14:35,286

So it's kind of the tools and machinery we

need to work with these kind of settings.

215

00:14:35,470 --> 00:14:41,990

And now then the relationship to

statistical inference comes from trying to

216

00:14:41,990 --> 00:14:46,470

define such a Riemannian space that it has

somehow nice characteristics.

217

00:14:46,470 --> 00:14:52,590

So maybe the concept that most of the

people actually might be aware of would be

218

00:14:52,590 --> 00:15:00,420

the Fisher information matrix that kind of

characterizes the kind of the curvature

219

00:15:00,420 --> 00:15:03,950

induced by a particular probabilistic

model.

220

00:15:03,950 --> 00:15:08,710

So these tools kind of then allow, for

example, a very recent thing that we did,

221

00:15:08,710 --> 00:15:14,800

it's going to come out later this spring

in AI stats, is an extension of the

222

00:15:14,800 --> 00:15:19,030

Laplace approximation in a Riemannian

geometry.

223

00:15:19,030 --> 00:15:22,980

So those of you who know what the Laplace

approximation is, it's essentially just

224

00:15:22,980 --> 00:15:26,690

fitting a normal distribution at the mode

of a distribution.

225

00:15:26,770 --> 00:15:30,920

But if we now fit the same normal

distribution in a suitably chosen

226

00:15:30,920 --> 00:15:32,462

Riemannian space,

227

00:15:32,462 --> 00:15:38,062

we can actually model also the kind of

curvature of the posterior mode and even

228

00:15:38,062 --> 00:15:39,132

kind of how it stretches.

229

00:15:39,132 --> 00:15:41,882

So we get a more flexible approximation.

230

00:15:42,102 --> 00:15:44,232

We are still fitting a normal

distribution.

231

00:15:44,232 --> 00:15:46,702

We're just doing it in a different space.

232

00:15:48,302 --> 00:15:52,542

Not sure how easy that was to follow, but

at least maybe it gives some sort of an

233

00:15:52,542 --> 00:15:53,382

idea.

234

00:15:53,662 --> 00:15:55,042

Yeah, yeah, yeah.

235

00:15:55,042 --> 00:16:01,614

That was actually, I think, a pretty

approachable.

236

00:16:01,614 --> 00:16:11,534

introduction and so if I understood

correctly then you're gonna use these

237

00:16:11,534 --> 00:16:18,784

Romanian approximations to come up with

better algorithms is that what you do and

238

00:16:18,784 --> 00:16:25,554

why you focus on Romanian spaces and yeah

if you can if you can introduce that and

239

00:16:25,554 --> 00:16:29,742

tell us basically why that is interesting

to then look

240

00:16:29,742 --> 00:16:36,832

at geometry from these different ways

instead of the classical Euclidean way of

241

00:16:36,832 --> 00:16:38,102

things geometry.

242

00:16:38,562 --> 00:16:41,682

Yeah, I think that's exactly what it is

about.

243

00:16:41,682 --> 00:16:45,322

So one other thing, maybe another

perspective of thinking about it is that

244

00:16:45,322 --> 00:16:50,042

we've also been doing Markov chain Monte

Carlo algorithms, so MCMC in these

245

00:16:50,042 --> 00:16:51,402

Riemannian spaces.

246

00:16:51,402 --> 00:16:56,432

And what we can achieve with those is that

if you have, let's say, a posterior

247

00:16:56,432 --> 00:16:57,464

distribution,

248

00:16:57,464 --> 00:17:03,724

that has some sort of a narrow funnel,

some very narrow area that extends far

249

00:17:03,724 --> 00:17:06,694

away in one corner of your parameter

space.

250

00:17:06,694 --> 00:17:10,304

It's actually very difficult to get there

with something like standard Hamiltonian

251

00:17:10,304 --> 00:17:15,084

Monte Carlo, but with the Riemannian

methods we can kind of make these narrow

252

00:17:15,084 --> 00:17:19,594

funnels equally easy compared to the

flatter areas.

253

00:17:19,874 --> 00:17:23,744

Now of course this may sound like a magic

bullet that we should be doing all

254

00:17:23,744 --> 00:17:24,914

inference with these techniques.

255

00:17:24,914 --> 00:17:26,958

Of course it does come with

256

00:17:26,958 --> 00:17:28,778

certain computational challenges.

257

00:17:28,778 --> 00:17:33,308

So we do need to be, like I said, the

shortest paths are no longer straight

258

00:17:33,308 --> 00:17:33,638

lines.

259

00:17:33,638 --> 00:17:38,278

So we need numerical integration to follow

the geodesic paths in these metrics and so

260

00:17:38,278 --> 00:17:38,598

on.

261

00:17:38,598 --> 00:17:40,408

So it's a bit of a compromise, of course.

262

00:17:40,408 --> 00:17:42,918

So they have very nice theoretical

properties.

263

00:17:43,038 --> 00:17:46,928

We've been able to get them working also

in practice in many cases so that they are

264

00:17:46,928 --> 00:17:50,438

kind of comparable with the current state

of the art.

265

00:17:50,438 --> 00:17:52,578

But it's not always easy.

266

00:17:53,538 --> 00:17:55,138

Yeah, there is no free lunch.

267

00:17:55,138 --> 00:17:56,098

Yes.

268

00:17:56,218 --> 00:17:56,548

Yeah.

269

00:17:56,548 --> 00:17:57,200

Yeah.

270

00:17:57,240 --> 00:18:04,570

Do you have any resources about these?

271

00:18:05,030 --> 00:18:12,090

Well, first the concepts of Romanian

spaces and then the algorithms that you

272

00:18:12,090 --> 00:18:17,520

folks derived in your group using these

Romanian space for people who are

273

00:18:17,520 --> 00:18:18,410

interested?

274

00:18:19,350 --> 00:18:25,216

Yeah, I think I wouldn't know, let's say a

very particular

275

00:18:25,216 --> 00:18:28,686

reasons I would recommend on the Romanian

geometry.

276

00:18:28,686 --> 00:18:33,166

It is actually a rather, let's say,

mathematically involved topic.

277

00:18:33,806 --> 00:18:37,946

But regarding the specific methods, I

think they are...

278

00:18:37,946 --> 00:18:42,066

It's a couple of my recent papers, so we

have this Laplace approximation is coming

279

00:18:42,066 --> 00:18:44,586

out in AI stats this year.

280

00:18:45,006 --> 00:18:51,086

The MCMC sampler we had, I think, two

years ago in AI stats, similarly, the

281

00:18:51,086 --> 00:18:53,742

first MCMC method building on these and

then...

282

00:18:53,742 --> 00:18:57,742

last year one paper on transactions of

machine learning research.

283

00:18:58,782 --> 00:19:01,942

I think they are more or less accessible.

284

00:19:03,202 --> 00:19:09,102

Let's definitely link to those papers if

you can in the show notes because I'm

285

00:19:09,102 --> 00:19:14,062

personally curious about it but also I

think listeners will be.

286

00:19:14,782 --> 00:19:20,732

It sounds from what you're saying that

this idea of doing algorithms in this

287

00:19:20,732 --> 00:19:22,730

Romanian space is

288

00:19:22,730 --> 00:19:24,150

somewhat recent.

289

00:19:24,850 --> 00:19:26,130

Am I right?

290

00:19:26,130 --> 00:19:28,450

And why would it appear now?

291

00:19:28,450 --> 00:19:30,710

Why would it become interesting now?

292

00:19:31,110 --> 00:19:33,320

Well, it's not actually that recent.

293

00:19:33,320 --> 00:19:39,790

I think the basic principle goes back, I

don't know, maybe 20 years or so.

294

00:19:40,950 --> 00:19:46,412

I think the main reason why we've been

working on this right now is that the

295

00:19:46,412 --> 00:19:50,002

We've been able to resolve some of the

computational challenges.

296

00:19:50,002 --> 00:19:54,982

So the fundamental problem with these

models is always this numeric integration

297

00:19:54,982 --> 00:19:59,142

of following the shortest paths depending

on an algorithm we needed for different

298

00:19:59,142 --> 00:20:04,522

reasons, but we always needed to do it,

which usually requires operations like

299

00:20:04,522 --> 00:20:10,112

inversion of a metric tensor, which has

the kind of a dimensionality of the

300

00:20:10,112 --> 00:20:11,562

parameter space.

301

00:20:11,802 --> 00:20:15,346

So we came up with the particular metric.

302

00:20:15,470 --> 00:20:20,030

that happens to have computationally

efficient inverse.

303

00:20:20,110 --> 00:20:24,850

So there's kind of this kind of concrete

algorithmic techniques that are kind of

304

00:20:24,850 --> 00:20:32,640

bringing the computational cost to the

level so that it's no longer notably more

305

00:20:32,640 --> 00:20:35,840

expensive than doing kind of standard

Euclidean methods.

306

00:20:35,840 --> 00:20:39,410

So we can, for example, scale them for

Bayesian neural networks.

307

00:20:39,410 --> 00:20:41,838

That's one of the application cases we are

looking at.

308

00:20:41,838 --> 00:20:47,058

We are really having very high

-dimensional problems but still able to do

309

00:20:47,058 --> 00:20:51,378

some of these Riemannian techniques or

approximations of them.

310

00:20:52,298 --> 00:20:55,238

That was going to be my next question.

311

00:20:55,238 --> 00:21:00,958

In which cases are these approximations

interesting?

312

00:21:00,958 --> 00:21:05,948

In which cases would you recommend

listeners to actually invest time to

313

00:21:05,948 --> 00:21:10,862

actually use these techniques because they

have a better chance of working than the

314

00:21:10,862 --> 00:21:16,202

classic Hamiltonian Monte Carlo semper

that are the default in most probabilistic

315

00:21:16,202 --> 00:21:17,162

languages?

316

00:21:17,602 --> 00:21:23,762

Yeah, I think the easy answer is that when

the inference problem is hard.

317

00:21:23,982 --> 00:21:27,932

So essentially one very practical way

would be that if you realize that you

318

00:21:27,932 --> 00:21:33,192

can't really get a Hamiltonian Monte Carlo

to explore the space, the posterior

319

00:21:33,192 --> 00:21:39,054

properly, that it may be difficult to find

out that this is happening.

320

00:21:39,054 --> 00:21:42,404

Of course, if you're ever visiting a

certain corner, you wouldn't actually

321

00:21:42,404 --> 00:21:42,954

know.

322

00:21:42,954 --> 00:21:47,234

But if you have some sort of a reason to

believe that you really are handling with

323

00:21:47,234 --> 00:21:52,994

such a complex posterior that I'm kind of

willing to spend a bit more extra

324

00:21:52,994 --> 00:21:58,454

computation to be careful so that I really

try to cover every corner there is.

325

00:21:58,854 --> 00:22:03,754

Another example is that we realized on the

scope of these Bayesian neural networks

326

00:22:03,754 --> 00:22:08,266

that there are certain kind of classical

327

00:22:08,782 --> 00:22:13,752

Well, certain kind of scenarios where we

can show that if you do inference with the

328

00:22:13,752 --> 00:22:16,992

two simple methods, so something in the

Euclidean metric with the standard

329

00:22:16,992 --> 00:22:22,842

Vangerman dynamics type of a thing, what

we actually see is that if you switch to

330

00:22:22,842 --> 00:22:28,592

using better prior distributions in your

model, you don't actually see an advantage

331

00:22:28,592 --> 00:22:33,802

of those unless you at the same time

switch to using an inference algorithm

332

00:22:33,802 --> 00:22:36,902

that is kind of able to handle the extra

complexity.

333

00:22:36,902 --> 00:22:38,638

So if you have for example like

334

00:22:38,638 --> 00:22:43,478

heavy tail spike and slap type of priors

in the neural network.

335

00:22:43,478 --> 00:22:49,768

You just kind of fail to get any benefit

from these better priors if you don't pay

336

00:22:49,768 --> 00:22:53,158

a bit more attention into how you do the

inference.

337

00:22:54,498 --> 00:22:56,258

Okay, super interesting.

338

00:22:56,518 --> 00:23:01,758

And also, so that seems it's also quite

interesting to look at that when you have,

339

00:23:01,758 --> 00:23:05,518

well, or when you suspect that you have

multi -modal posteriors.

340

00:23:08,430 --> 00:23:11,890

Yes, well yeah, multimodal posteriors are

interesting.

341

00:23:11,890 --> 00:23:17,850

I'm not, we haven't specifically studied

like this question that is there and we

342

00:23:17,850 --> 00:23:22,290

have actually thought about some ideas of

creating metrics that would specifically

343

00:23:22,290 --> 00:23:27,090

encourage exploring the different modes

but we haven't done that concretely so we

344

00:23:27,090 --> 00:23:32,610

now still focusing on these kind of narrow

thin areas of posteriors and how can you

345

00:23:32,610 --> 00:23:34,770

kind of reach those.

346

00:23:35,430 --> 00:23:36,630

Okay.

347

00:23:37,774 --> 00:23:43,794

And do you know of normalizing flows?

348

00:23:44,494 --> 00:23:45,974

Sure, yes.

349

00:23:45,994 --> 00:23:51,124

So yeah, we've had Marie -Lou Gabriel on

the show recently.

350

00:23:51,124 --> 00:23:52,894

It was episode 98.

351

00:23:53,114 --> 00:23:57,774

And so she's working a lot on these

normalizing flows and the idea of

352

00:23:57,774 --> 00:24:02,694

assisting NCMC sampling with these machine

learning methods.

353

00:24:02,974 --> 00:24:04,206

And it's amazing.

354

00:24:04,206 --> 00:24:08,726

can sound somewhat similar to what you do

in your group.

355

00:24:08,726 --> 00:24:15,436

And so for listeners, could you explain

the difference between the two ideas and

356

00:24:15,436 --> 00:24:20,006

maybe also the use cases that both apply

to it?

357

00:24:20,586 --> 00:24:22,886

Yeah, I think you're absolutely right.

358

00:24:22,886 --> 00:24:25,666

So they are very closely related.

359

00:24:25,666 --> 00:24:30,486

So there are, for example, the basic idea

of the neural transport that uses

360

00:24:30,486 --> 00:24:32,280

normalizing flows for

361

00:24:32,280 --> 00:24:38,960

essentially transforming the parameter

space in a suitable non -linear way and

362

00:24:38,960 --> 00:24:43,090

then running standard Euclidean

Hamiltonian Monte Carlo.

363

00:24:43,470 --> 00:24:45,190

It can actually be proven.

364

00:24:45,190 --> 00:24:48,730

I think it is in the original paper as

well that I mean it is actually

365

00:24:48,730 --> 00:24:55,608

mathematically equivalent to conducting

Riemannian inference in a suitable metric.

366

00:24:55,822 --> 00:25:01,612

So I would say that it's like a

complementary approach of solving exactly

367

00:25:01,612 --> 00:25:02,532

the same problem.

368

00:25:02,532 --> 00:25:09,222

So you have a way of somehow in a flexible

way warping your parameter space.

369

00:25:09,402 --> 00:25:14,762

You either do it through a metric or you

kind of do it as a pre -transformation.

370

00:25:14,802 --> 00:25:17,102

So there's a lot of similarities.

371

00:25:17,102 --> 00:25:22,286

It's also the computation in some sense

that if you think about mapping...

372

00:25:22,286 --> 00:25:24,536

sample through a normalizing flow.

373

00:25:24,536 --> 00:25:28,556

It's actually very close to what we do

with the Riemannian Laplace approximation

374

00:25:28,556 --> 00:25:34,766

that you start kind of take a sample and

you start propagating it through some sort

375

00:25:34,766 --> 00:25:35,666

of a transformation.

376

00:25:35,666 --> 00:25:39,726

It's just whether it's defined through a

metric or as a flow.

377

00:25:40,846 --> 00:25:44,566

So yes, so they are kind of very close.

378

00:25:44,566 --> 00:25:48,346

So now the question is then that when

should I be using one of these?

379

00:25:48,346 --> 00:25:51,662

I'm afraid I don't really have an answer.

380

00:25:51,662 --> 00:25:59,562

that in a sense that I mean there's

computational properties on let's say for

381

00:25:59,562 --> 00:26:03,992

example if you've worked with flows you do

need to pre -train them so you do need to

382

00:26:03,992 --> 00:26:08,232

train some sort of a flow to be able to

use it in certain applications so it comes

383

00:26:08,232 --> 00:26:10,182

with some pre -training cost.

384

00:26:11,182 --> 00:26:15,102

Quite likely during when you're actually

using it it's going to be faster than

385

00:26:15,102 --> 00:26:19,182

working in a Riemannian metric where you

need to invert some metric tensors and so

386

00:26:19,182 --> 00:26:20,342

on.

387

00:26:21,422 --> 00:26:24,002

So there's kind of like technical

differences.

388

00:26:24,362 --> 00:26:28,822

Then I think the bigger question is of

course that if we go to really challenging

389

00:26:28,822 --> 00:26:32,612

problems, for example, very high

dimensions, that which of these methods

390

00:26:32,612 --> 00:26:35,122

actually work well there.

391

00:26:35,922 --> 00:26:40,862

For that I don't quite now have an answer

in the sense that I would dare to say that

392

00:26:40,862 --> 00:26:46,062

or even speculate that which of these

things I might miss some kind of obvious

393

00:26:46,062 --> 00:26:51,118

limitations of one of the approaches if

trying to kind of extrapolate too far.

394

00:26:51,118 --> 00:26:53,538

from what we've actually tried in

practice.

395

00:26:53,818 --> 00:26:55,498

Yeah, that's what I was going to say.

396

00:26:55,498 --> 00:27:00,938

It's also that these methods are really at

the frontier of the science.

397

00:27:00,938 --> 00:27:07,978

So I guess we're lacking, we're lacking

for now the practical cases, right?

398

00:27:07,978 --> 00:27:13,598

And probably in a few years we'll have

more ideas of these and when one is more

399

00:27:13,598 --> 00:27:14,878

appropriate than another.

400

00:27:14,878 --> 00:27:18,062

But for now, I guess we have to try.

401

00:27:18,062 --> 00:27:21,362

those algorithms and see what we get back.

402

00:27:24,376 --> 00:27:33,546

And so actually, what if people want to

try these Romanian based algorithms?

403

00:27:33,546 --> 00:27:38,986

Do you have already packages that we can

link to that people can try and plug their

404

00:27:38,986 --> 00:27:40,326

own model into?

405

00:27:41,786 --> 00:27:43,726

Yes and no.

406

00:27:43,746 --> 00:27:50,206

So we have released open source code with

each of the research papers.

407

00:27:50,206 --> 00:27:53,446

So there is a reference implementation

that

408

00:27:53,582 --> 00:27:55,222

can be used.

409

00:27:58,002 --> 00:28:03,752

We have internally been integrating these,

kind of working a bit towards integrating

410

00:28:03,752 --> 00:28:08,622

the kind of proper open ecosystems that

would allow, make like for example model

411

00:28:08,622 --> 00:28:10,382

specification easy.

412

00:28:11,062 --> 00:28:12,592

It's not quite there yet.

413

00:28:12,592 --> 00:28:17,722

So there's one particular challenge is

that many of the environments don't

414

00:28:17,722 --> 00:28:22,712

actually have all the support

functionality you need for the Riemannian

415

00:28:22,712 --> 00:28:23,534

methods.

416

00:28:23,534 --> 00:28:28,864

They're essentially simplifying some of

the things that directly encoding these

417

00:28:28,864 --> 00:28:33,194

assumptions that the shortest path is an

interpolation or it's a line.

418

00:28:33,374 --> 00:28:38,294

So you need a bit of an extra machinery

for the most established libraries.

419

00:28:38,294 --> 00:28:45,044

There are some libraries, I believe, that

are actually making it fairly easy to do

420

00:28:45,044 --> 00:28:48,074

kind of plug and play Riemannian metrics.

421

00:28:48,474 --> 00:28:53,414

I don't remember the names right now, but

that's where we've kind of been.

422

00:28:53,422 --> 00:28:58,202

planning on putting in the algorithms, but

they're not really there yet.

423

00:28:58,602 --> 00:29:00,642

Hmm, OK, I see.

424

00:29:00,742 --> 00:29:05,322

Yeah, definitely that would be, I guess,

super, super interesting.

425

00:29:05,802 --> 00:29:11,602

If by the time of release, you see

something that people could try,

426

00:29:11,602 --> 00:29:17,122

definitely we'll link to that, because I

think listeners will be curious.

427

00:29:17,122 --> 00:29:19,802

And I'm definitely super curious to try

that.

428

00:29:19,802 --> 00:29:21,774

Any new stuff like that, or you'd like to?

429

00:29:21,774 --> 00:29:24,014

try and see what you can do with it.

430

00:29:24,014 --> 00:29:26,174

It's always super interesting.

431

00:29:26,174 --> 00:29:33,014

And I've already seen some very

interesting experiments done with

432

00:29:33,214 --> 00:29:42,234

normalizing flows, especially Bayox by

Colin Carroll and other people.

433

00:29:42,734 --> 00:29:46,974

Colin Carroll is one of the EasyPindC

developer also.

434

00:29:47,534 --> 00:29:50,830

And yeah, now you can use Bayox to take

any

435

00:29:50,830 --> 00:30:00,550

a juxtifiable model and you plug that into

it and you can use the flow MC algorithm

436

00:30:00,550 --> 00:30:03,610

to sample your juxtifiable PIMC model.

437

00:30:03,610 --> 00:30:06,150

So that's really super cool.

438

00:30:06,490 --> 00:30:12,270

And I'm really looking forward to more

experiments like that to see, well, okay,

439

00:30:12,270 --> 00:30:14,310

what can we do with those algorithms?

440

00:30:14,650 --> 00:30:20,718

Where can we push them to what extent, to

what degree, where do they fall down?

441

00:30:20,718 --> 00:30:25,078

That's really super interesting, at least

for me, because I'm not a mathematician.

442

00:30:25,158 --> 00:30:29,168

So when I see that, I find that super,

like, I love the idea of, basically the

443

00:30:29,168 --> 00:30:30,268

idea is somewhat simple.

444

00:30:30,268 --> 00:30:34,858

It's like, okay, we have that problem when

we think about geometry that way, because

445

00:30:34,858 --> 00:30:38,458

then the geometry becomes a funnel, for

instance, as you were saying.

446

00:30:38,458 --> 00:30:42,398

And then sampling at the bottom of the

funnel is just super hard in the way we do

447

00:30:42,398 --> 00:30:46,138

it right now, because just super small

distances.

448

00:30:46,318 --> 00:30:48,910

What if we change the definition of

distance?

449

00:30:48,910 --> 00:30:54,440

What if we change the definition of

geometry, basically, which is this idea

450

00:30:54,440 --> 00:30:57,090

of, OK, let's switch to Romanian space.

451

00:30:57,090 --> 00:31:01,190

And the way we do that, then, well, the

funnel disappears, and it just becomes

452

00:31:01,190 --> 00:31:02,970

something easier.

453

00:31:02,970 --> 00:31:09,790

It's just like going beyond the idea of

the centered versus non -centered

454

00:31:09,790 --> 00:31:13,490

parameterization, for instance, when you

do that in model, right?

455

00:31:13,490 --> 00:31:16,970

But it's going big with that because it's

more general.

456

00:31:17,450 --> 00:31:19,024

So I love that idea.

457

00:31:19,406 --> 00:31:23,726

I understand it, but I cannot really read

the math and be like, oh, OK, I see what

458

00:31:23,726 --> 00:31:24,866

that means.

459

00:31:25,566 --> 00:31:30,176

So I have to see the model and see what I

can do and where I can push it.

460

00:31:30,176 --> 00:31:35,126

And then I get a better understanding of

what that entails.

461

00:31:35,706 --> 00:31:40,266

Yeah, I think you gave a much better

summary of what it is doing than I did.

462

00:31:40,666 --> 00:31:42,206

So good for that.

463

00:31:42,946 --> 00:31:45,186

I mean, you are actually touching that, of

course.

464

00:31:45,186 --> 00:31:48,558

So there's the one point is making the

algorithms.

465

00:31:48,558 --> 00:31:51,438

available so that everyone could try them

out.

466

00:31:51,438 --> 00:31:56,338

But then there's also the other aspect

that we need to worry about, which is the

467

00:31:56,338 --> 00:31:59,198

proper evaluation of what they're doing.

468

00:31:59,198 --> 00:32:04,038

I mean, of course, most of the papers when

you release a new algorithm, you need to

469

00:32:04,038 --> 00:32:08,198

emphasize things like, in our case,

computational efficiency.

470

00:32:08,198 --> 00:32:13,268

And you do demonstrate that it, maybe for

example, being quite explicitly showing

471

00:32:13,268 --> 00:32:17,486

that these very strong funnels, it does

work better with those.

472

00:32:17,486 --> 00:32:22,456

But now then the question is of course

that how reliable these things are if used

473

00:32:22,456 --> 00:32:27,106

in a black box manner in a so that someone

just runs them on their favorite model.

474

00:32:27,286 --> 00:32:34,546

And one of the challenges we realized is

that it's actually very hard to evaluate

475

00:32:34,546 --> 00:32:39,506

how well an algorithm is working in an

extremely difficult case.

476

00:32:39,806 --> 00:32:41,486

Because there is no baseline.

477

00:32:41,486 --> 00:32:47,238

I mean, in some of the cases we've been

comparing that let's try to do...

478

00:32:47,982 --> 00:32:54,822

standard Hamiltonian MCMC on nuts as

carefully as we can.

479

00:32:55,062 --> 00:32:59,822

And they kind of think that this is the

ground truth, this is the true posterior.

480

00:33:00,182 --> 00:33:02,542

But we don't really know whether that's

the case.

481

00:33:02,542 --> 00:33:08,542

So if it's hard enough case, our kind of

supposed ground truth is failing as well.

482

00:33:09,402 --> 00:33:12,952

And it's very hard to kind of then we

might be able to see that our solution

483

00:33:12,952 --> 00:33:14,414

differs from that.

484

00:33:14,414 --> 00:33:17,614

But then we would need to kind of

separately go and investigate that which

485

00:33:17,614 --> 00:33:19,114

one was wrong.

486

00:33:20,414 --> 00:33:27,274

And that is a practical challenge,

especially if you would like to have a

487

00:33:27,274 --> 00:33:30,734

broad set of models.

488

00:33:31,034 --> 00:33:35,404

And we would want to show somehow

transparently for the kind of end users

489

00:33:35,404 --> 00:33:39,184

that in these and these kind of problems,

this and that particular method, whether

490

00:33:39,184 --> 00:33:42,830

it's one of ours or something else, any

other new fancy.

491

00:33:42,830 --> 00:33:46,210

When do they work when they don't?

492

00:33:46,690 --> 00:33:52,550

Without relying that we really have some

particular method that they already trust

493

00:33:52,550 --> 00:33:58,330

and we kind of, if it's just compared to

it, we can't kind of really convince

494

00:33:58,330 --> 00:34:04,440

others that is it correct when it is

differing from what we kind of used to

495

00:34:04,440 --> 00:34:05,870

rely on.

496

00:34:06,250 --> 00:34:10,300

Yeah, that's definitely a problem.

497

00:34:10,300 --> 00:34:12,747

That's also a question I asked Marilu.

498

00:34:12,747 --> 00:34:16,787

when she was on the show and then that was

kind of the same answer if I remember

499

00:34:16,787 --> 00:34:22,157

correctly that for now it's kind of hard

to do benchmarks in a way, which is

500

00:34:22,157 --> 00:34:28,537

definitely an issue if you're trying to

work on that from a scientific perspective

501

00:34:28,537 --> 00:34:30,057

as well.

502

00:34:30,257 --> 00:34:34,297

If we were astrologists, that'd be great,

like then we'd be good.

503

00:34:34,297 --> 00:34:39,534

But if you're a scientist, then you want

to evaluate your methods and...

504

00:34:39,534 --> 00:34:43,774

And finding a method to evaluate the

method is almost as valuable as finding

505

00:34:43,774 --> 00:34:45,634

the method in the first place.

506

00:34:46,414 --> 00:34:50,314

And where do you think we are on that

regarding in your field?

507

00:34:50,314 --> 00:34:57,094

Is that an active branch of the research

to try and evaluate these algorithms?

508

00:34:57,094 --> 00:34:59,354

How would that even look like?

509

00:34:59,354 --> 00:35:05,894

Or are we still really, really at a very

early time for that work?

510

00:35:05,894 --> 00:35:07,086

That's a...

511

00:35:07,086 --> 00:35:08,246

Very good question.

512

00:35:08,246 --> 00:35:13,156

So I'm not aware of a lot of people that

would kind of specifically focus on

513

00:35:13,156 --> 00:35:14,066

evaluation.

514

00:35:14,066 --> 00:35:17,706

So for example, Aki has of course been

working a lot on that, trying to kind of

515

00:35:17,706 --> 00:35:19,486

create diagnostics and so on.

516

00:35:19,486 --> 00:35:26,056

But then if we think about more on the

flexible machine learning side, I think my

517

00:35:26,056 --> 00:35:31,266

hunch is that it's the individual research

groups are kind of all circling around the

518

00:35:31,266 --> 00:35:36,426

same problems that they are kind of trying

to figure out that, okay,

519

00:35:36,846 --> 00:35:41,726

Every now and then someone invents a fancy

way of evaluating something.

520

00:35:41,726 --> 00:35:49,056

It introduces a particular type of

synthetic scenario where I think that the

521

00:35:49,056 --> 00:35:55,006

most common in tries that what people do

is that you create problems where you

522

00:35:55,006 --> 00:35:59,676

actually have an analytic posterior and

it's somehow like an artificial problem

523

00:35:59,676 --> 00:36:04,796

that you take a problem and you transform

it in a given way and then you assume that

524

00:36:04,796 --> 00:36:06,798

I didn't have the analytic one.

525

00:36:06,798 --> 00:36:10,348

But they are all, I mean, they feel a bit

artificial.

526

00:36:10,348 --> 00:36:12,058

They feel a bit synthetic.

527

00:36:12,338 --> 00:36:13,518

So let's see.

528

00:36:13,518 --> 00:36:17,308

It would maybe be something that the

community should kind of be talking a bit

529

00:36:17,308 --> 00:36:22,538

more about on a workshop or something

that, OK, let's try to really think about

530

00:36:22,538 --> 00:36:28,008

how to verify the robustness or possibly

identify that these things are not really

531

00:36:28,008 --> 00:36:33,978

ready or reliable for practical use in

very serious applications yet.

532

00:36:34,198 --> 00:36:35,182

Yeah.

533

00:36:35,182 --> 00:36:39,582

I haven't been following very closely

what's happening, so I may be missing some

534

00:36:39,582 --> 00:36:42,882

important works that are already out

there.

535

00:36:42,922 --> 00:36:44,202

Okay, yeah.

536

00:36:44,602 --> 00:36:50,582

Well, Aki, if you're listening, send us a

message if we forgot something.

537

00:36:50,982 --> 00:36:57,222

And second, that sounds like there are

some interesting PhDs to do on the issue,

538

00:36:57,222 --> 00:37:01,902

if that's still a very new branch of the

research.

539

00:37:01,902 --> 00:37:03,242

So, people?

540

00:37:04,078 --> 00:37:08,358

If you're interested in that, maybe

contact Arto and we'll see.

541

00:37:08,358 --> 00:37:12,848

Maybe in a few months or years, you can

come here on the show and answer the

542

00:37:12,848 --> 00:37:16,118

question I just asked.

543

00:37:18,318 --> 00:37:24,738

Another aspect of your work I really want

to talk about also that I really love and

544

00:37:24,738 --> 00:37:31,238

now listeners can relax because that's

going to be, I think, less abstract and

545

00:37:31,238 --> 00:37:33,614

closer to their user experience.

546

00:37:33,614 --> 00:37:35,114

is about priors.

547

00:37:35,174 --> 00:37:40,594

You talked about it a bit at the

beginning, especially you are working and

548

00:37:40,594 --> 00:37:46,154

you worked a lot on a package called

Prelease that I really love.

549

00:37:46,234 --> 00:37:51,834

One of my friends and fellow Pimc

developers, Osvaldo Martin, is also

550

00:37:51,834 --> 00:37:54,074

collaborating on that.

551

00:37:54,094 --> 00:37:58,754

And you guys have done a tremendous job on

that.

552

00:37:58,874 --> 00:38:02,554

So yeah, can you give people a primer

about Prelease?

553

00:38:02,554 --> 00:38:03,662

What is it?

554

00:38:03,662 --> 00:38:11,202

When could they use it and what's its

purpose in general?

555

00:38:12,002 --> 00:38:16,142

Maybe I need to start by saying that I

haven't worked a lot on prelease.

556

00:38:16,142 --> 00:38:21,472

Osvaldo has and a couple of others, so

I've been kind of just hovering around and

557

00:38:21,472 --> 00:38:23,242

giving a bit of feedback.

558

00:38:23,282 --> 00:38:28,472

But yeah, so I'll maybe start a bit

further away, so not directly from

559

00:38:28,472 --> 00:38:31,382

prelease, but the whole question of prior

elicitation.

560

00:38:31,382 --> 00:38:32,362

So I think the...

561

00:38:32,362 --> 00:38:33,058

Yeah.

562

00:38:33,102 --> 00:38:38,422

What we've been working with that is the

prior elicitation is simply an, I would

563

00:38:38,422 --> 00:38:43,352

frame it as that it's some sort of

unusually iterative approach of

564

00:38:43,352 --> 00:38:49,842

communicating with the domain expert where

the goal is to estimate what's their

565

00:38:49,842 --> 00:38:56,772

actual subjective prior knowledge is on

whatever parameters the model has and

566

00:38:56,772 --> 00:39:01,026

doing it so that it's like cognitively

easy for the expert.

567

00:39:01,374 --> 00:39:07,874

So many of the algorithms that we've been

working on this are based on this idea of

568

00:39:07,874 --> 00:39:09,734

predictive elicitation.

569

00:39:09,754 --> 00:39:13,374

So if you have a model where the

parameters don't actually have a very

570

00:39:13,374 --> 00:39:19,054

concrete, easily understandable meaning,

you can't actually start asking questions

571

00:39:19,054 --> 00:39:21,654

from the expert about the parameters.

572

00:39:21,654 --> 00:39:25,354

It would require them to understand fully

the model itself.

573

00:39:26,134 --> 00:39:29,838

The predictive elicitation techniques kind

of ask

574

00:39:30,174 --> 00:39:34,964

communicate with the expert usually in the

space of the observable quantities.

575

00:39:34,964 --> 00:39:40,494

So they're trying to make that is this

somehow more likely realization than this

576

00:39:40,494 --> 00:39:41,754

other one.

577

00:39:42,014 --> 00:39:46,994

And now this is where the prelease comes

into play.

578

00:39:46,994 --> 00:39:52,494

So when we are communicating with the

user, so most of the times the information

579

00:39:52,494 --> 00:39:58,248

we show for the user is some sort of

visualizations.

580

00:39:58,248 --> 00:40:03,688

of predictive distributions or possibly

also about the parameter distributions

581

00:40:03,688 --> 00:40:04,798

themselves.

582

00:40:04,798 --> 00:40:11,278

So we need an easy way of communicating

whether it's histograms of predicted

583

00:40:11,278 --> 00:40:13,118

values and whatnot.

584

00:40:13,418 --> 00:40:21,878

So how do we show those for a user in

scenarios where the model itself is some

585

00:40:21,878 --> 00:40:26,038

sort of a probabilistic program so we

can't kind of fixate to a given model

586

00:40:26,038 --> 00:40:26,998

family.

587

00:40:27,886 --> 00:40:34,246

That's actually what's the main role of

Prelease is essentially making it easy to

588

00:40:34,246 --> 00:40:35,606

interface with the user.

589

00:40:35,606 --> 00:40:40,286

Of course, Prelease also then includes

these algorithms themselves.

590

00:40:40,286 --> 00:40:46,466

So, algorithms for estimating the prior

and the kind of interface components for

591

00:40:46,466 --> 00:40:48,986

the expert to give information.

592

00:40:48,986 --> 00:40:54,896

So, make a selection, use a slider that I

would want my distribution to be a bit

593

00:40:54,896 --> 00:40:57,276

more skewed towards the right and so on.

594

00:40:57,806 --> 00:40:59,626

That's what we are aiming at.

595

00:40:59,626 --> 00:41:06,066

A general purpose tool that would be used,

it's essentially kind of a platform for

596

00:41:06,066 --> 00:41:11,486

developing and kind of bringing into use

all kinds of prioritization techniques.

597

00:41:11,486 --> 00:41:15,766

So it's not tied to any given algorithm or

anything but you just have the components

598

00:41:15,766 --> 00:41:21,396

and could then easily kind of commit,

let's say, a new type of prioritization

599

00:41:21,396 --> 00:41:23,526

algorithm into the library.

600

00:41:25,026 --> 00:41:27,386

Yeah and I re -encourage

601

00:41:28,066 --> 00:41:31,656

folks to go take a look at the prelease

package.

602

00:41:31,656 --> 00:41:38,726

I put the link in the show notes because,

yeah, as you were saying, that's a really

603

00:41:38,966 --> 00:41:47,016

easier way to specify your priors and also

elicit them if you need the intervention

604

00:41:47,016 --> 00:41:52,956

of non -statisticians in your model, which

you often do if the model is complex

605

00:41:52,956 --> 00:41:53,786

enough.

606

00:41:54,126 --> 00:41:55,662

So yeah, like...

607

00:41:55,662 --> 00:41:58,102

I'm using it myself quite a lot.

608

00:41:58,122 --> 00:42:01,922

So thanks a lot guys for this work.

609

00:42:02,182 --> 00:42:07,382

So Arto, as you were saying, Osvaldo

Martín is one of the main contributors,

610

00:42:07,622 --> 00:42:13,802

Oriol Abril Blas also, and Alejandro

Icazati, if I remember correctly.

611

00:42:13,802 --> 00:42:18,242

So at least these four people are the main

contributors.

612

00:42:19,262 --> 00:42:22,622

And yeah, so I definitely encourage people

to go there.

613

00:42:22,622 --> 00:42:25,774

What would you say, Arto, are the...

614

00:42:25,774 --> 00:42:32,174

like the Pareto effect, what would it be

if people want to get started with

615

00:42:32,174 --> 00:42:32,734

Prelease?

616

00:42:32,734 --> 00:42:40,374

Like the 20 % of uses that will give you

80 % of the benefits of Prelease for

617

00:42:40,374 --> 00:42:42,278

someone who don't know anything about it.

618

00:42:45,646 --> 00:42:47,586

That's a very good question.

619

00:42:48,046 --> 00:42:58,396

I think the most important thing actually

is to realize that we need to be careful

620

00:42:58,396 --> 00:43:00,326

when we set the priors.

621

00:43:00,966 --> 00:43:04,666

So simply being aware that you need a tool

for this.

622

00:43:04,686 --> 00:43:09,486

You need a tool that makes it easy to do

something like a prior predictive check.

623

00:43:09,606 --> 00:43:15,522

You need a tool that relieves you from

figuring out how do I inspect.

624

00:43:15,522 --> 00:43:19,122

my priors or the effects it has on the

model.

625

00:43:19,122 --> 00:43:22,162

That's actually where the real benefit is.

626

00:43:22,162 --> 00:43:23,972

You get most of the...

627

00:43:23,972 --> 00:43:28,632

when you kind of try to bring it as part

of your Bayesian workflow in a kind of a

628

00:43:28,632 --> 00:43:32,382

concrete step that you identify that I

need to do this.

629

00:43:32,422 --> 00:43:37,782

Then the kind of the remaining tale of

this thing is then of course that the...

630

00:43:37,782 --> 00:43:42,182

maybe in some cases you have such a

complicated model that you really need to

631

00:43:42,182 --> 00:43:43,694

deep dive and start...

632

00:43:43,694 --> 00:43:47,014

running algorithms that help you eliciting

the priors.

633

00:43:47,014 --> 00:43:52,224

And I would actually even say that the

elicitation algorithms, I do perceive them

634

00:43:52,224 --> 00:43:57,554

useful even when the person is actually a

statistician.

635

00:43:57,554 --> 00:44:02,664

I mean, there's a lot of models that we

may think that we know how to set the

636

00:44:02,664 --> 00:44:03,414

priors.

637

00:44:03,414 --> 00:44:09,634

But what we are actually doing is

following some very vague ideas on what's

638

00:44:09,634 --> 00:44:10,614

the effect.

639

00:44:10,614 --> 00:44:12,390

And we may also make

640

00:44:12,660 --> 00:44:16,170

severe mistakes or spend a lot of time in

doing it.

641

00:44:16,170 --> 00:44:21,050

So to an extent these elicitation

interfaces, I believe that ultimately they

642

00:44:21,050 --> 00:44:27,340

will be helping even kind of hardcore

statisticians in just kind of doing it

643

00:44:27,340 --> 00:44:31,890

faster, doing it slightly better, doing it

perhaps in a more better documented

644

00:44:31,890 --> 00:44:32,480

manner.

645

00:44:32,480 --> 00:44:40,164

So you could for example kind of store all

the interaction the modeler had.

646

00:44:40,238 --> 00:44:44,838

with these things and kind of put that

aside that this is where we got the prior

647

00:44:44,838 --> 00:44:51,258

from instead of just trial and error and

then we just see at the end the result.

648

00:44:51,298 --> 00:44:55,978

So you could kind of revisit the choices

you made during an elicitation process

649

00:44:55,978 --> 00:45:00,948

that I discarded these predictive

distributions for some reason and then you

650

00:45:00,948 --> 00:45:05,248

can later kind of, okay I made a mistake

there maybe I go and change my answer in

651

00:45:05,248 --> 00:45:09,804

that part and then an algorithm provides

you an updated prior.

652

00:45:10,126 --> 00:45:14,586

without you needing to actually go through

the whole prior specification process

653

00:45:14,586 --> 00:45:15,846

again.

654

00:45:16,086 --> 00:45:16,446

Yeah.

655

00:45:16,446 --> 00:45:17,066

Yeah.

656

00:45:17,066 --> 00:45:18,866

Yeah, I really love that.

657

00:45:19,546 --> 00:45:25,386

And that makes the process of setting

priors more reproducible, more transparent

658

00:45:25,386 --> 00:45:26,746

in a way.

659

00:45:26,806 --> 00:45:33,276

That makes me think a bit of the scikit

-learn pipelines that you use to transform

660

00:45:33,276 --> 00:45:33,886

the data.

661

00:45:33,886 --> 00:45:38,756

For instance, you just set up the pipeline

and you say, I want to standardize my

662

00:45:38,756 --> 00:45:40,046

data, for instance.

663

00:45:40,046 --> 00:45:41,416

And then you have that pipeline ready.

664

00:45:41,416 --> 00:45:44,866

And when you do the auto sample

predictions, you can use the pipeline and

665

00:45:44,866 --> 00:45:48,896

say, okay, now like do that same

transformation on these new data so that

666

00:45:48,896 --> 00:45:52,056

we're sure that it's done the right way,

but it's still transparent and people know

667

00:45:52,056 --> 00:45:53,416

what's going on here.

668

00:45:53,416 --> 00:45:56,306

It's a bit the same thing, but with the

priors.

669

00:45:56,306 --> 00:46:02,126

And I really love that because that makes

it also easier for people to think about

670

00:46:02,126 --> 00:46:06,866

the priors and to actually choose the

priors.

671

00:46:07,006 --> 00:46:07,374

Because.

672

00:46:07,374 --> 00:46:13,194

What I've seen in teaching is that

especially for beginners, even more when

673

00:46:13,194 --> 00:46:18,394

they come from the Frequentis framework,

sending the priors can be just like

674

00:46:18,394 --> 00:46:19,194

paralyzing.

675

00:46:19,194 --> 00:46:20,814

It's like products of choice.

676

00:46:20,814 --> 00:46:23,634

It's way too many, way too many choices.

677

00:46:23,634 --> 00:46:27,574

And then they end up not choosing anything

because they are too afraid to choose the

678

00:46:27,574 --> 00:46:28,634

wrong prior.

679

00:46:29,314 --> 00:46:31,514

Yes, I fully agree with that.

680

00:46:31,514 --> 00:46:36,494

I mean, there's a lot of very simple

models.

681

00:46:36,494 --> 00:46:43,594

that already start having six, seven,

eight different univariate priors there.

682

00:46:43,594 --> 00:46:50,764

And then I've been working with these

things for a long time and I still very

683

00:46:50,764 --> 00:46:55,224

easily make stupid mistakes that I'm

thinking that I increase the variance of

684

00:46:55,224 --> 00:47:00,684

this particular prior here, thinking that

what I'm achieving is, for example, higher

685

00:47:00,684 --> 00:47:02,494

predictive variance as well.

686

00:47:02,494 --> 00:47:04,574

And then I realized that, no, that's not

the case.

687

00:47:04,574 --> 00:47:06,234

It's actually...

688

00:47:06,542 --> 00:47:11,192

Later in the model, it plays some sort of

a role and it actually has the opposite

689

00:47:11,192 --> 00:47:11,942

effect.

690

00:47:11,942 --> 00:47:14,662

It's hard.

691

00:47:14,662 --> 00:47:15,322

Yeah.

692

00:47:15,322 --> 00:47:15,632

Yeah.

693

00:47:15,632 --> 00:47:18,742

That stuff is really hard and same here.

694

00:47:19,662 --> 00:47:26,192

When I discovered that, I'm extremely

frustrated because I'm like, I always did

695

00:47:26,192 --> 00:47:31,562

hours on these, whereas if I had a more

producible pipeline, that would just have

696

00:47:31,562 --> 00:47:34,222

been handled automatically for me.

697

00:47:34,222 --> 00:47:34,702

So...

698

00:47:34,702 --> 00:47:35,982

Yeah, for sure.

699

00:47:35,982 --> 00:47:41,982

We're not there yet in the workflow, but

that definitely makes it way easier.

700

00:47:42,682 --> 00:47:45,662

So yeah, I absolutely agree that we are

not there yet.

701

00:47:45,662 --> 00:47:55,112

I mean, the Prellis is a very well

-defined tool that allows us to start

702

00:47:55,112 --> 00:47:55,882

working on it.

703

00:47:55,882 --> 00:48:02,606

But I mean, then the actual concrete

algorithms that would make it easy to

704

00:48:02,606 --> 00:48:07,856

let's say for example, avoid these kind of

stupid mistakes and be able to kind of

705

00:48:07,856 --> 00:48:09,856

really reduce the effort.

706

00:48:09,856 --> 00:48:15,936

So if it now takes two weeks for a PhD

student trying to think about and fiddle

707

00:48:15,936 --> 00:48:19,686

with the prior, so can we get to one day?

708

00:48:19,686 --> 00:48:21,346

Can we get it to one hour?

709

00:48:21,346 --> 00:48:24,926

Can we get it to two minutes of a quick

interaction?

710

00:48:24,926 --> 00:48:30,030

And probably not two minutes, but if we

can get it to one hour and it...

711

00:48:30,030 --> 00:48:32,060

It will require lots of things.

712

00:48:32,060 --> 00:48:36,310

It will require even better of this kind

of tooling.

713

00:48:36,310 --> 00:48:39,190

So how do we visualize, how do we play

around with it?

714

00:48:39,190 --> 00:48:48,450

But I think it's going to require quite a

bit better algorithms on how do you, from

715

00:48:48,450 --> 00:48:53,606

kind of maximally limited interaction, how

do you estimate.

716

00:48:54,030 --> 00:48:58,380

what the prior is and how you design the

kind of optimal questions you should be

717

00:48:58,380 --> 00:49:00,790

asking from the expert.

718

00:49:00,790 --> 00:49:05,990

There's no point in kind of reiterating

the same things just to fine -tune a bit

719

00:49:05,990 --> 00:49:12,160

one of the variances of the priors if

there is a massive mistake still somewhere

720

00:49:12,160 --> 00:49:18,310

in the prior and a single question would

be able to rule out half of the possible

721

00:49:18,310 --> 00:49:19,670

scenarios.

722

00:49:20,230 --> 00:49:22,348

It's going to be an interesting...

723

00:49:22,550 --> 00:49:27,050

let's say, rise research direction, I

would say, for the next 5, 10 years.

724

00:49:27,770 --> 00:49:29,570

Yeah, for sure.

725

00:49:29,570 --> 00:49:33,390

And very valuable also because very

practical.

726

00:49:33,750 --> 00:49:37,950

So for sure, again, a great PhD

opportunity, folks.

727

00:49:39,410 --> 00:49:40,490

Yeah, yeah.

728

00:49:40,490 --> 00:49:45,560

Also, I mean, that may be hard to find

those algorithms that you were talking

729

00:49:45,560 --> 00:49:48,090

about because it is hard, right?

730

00:49:48,090 --> 00:49:51,022

I know I worked on the...

731

00:49:51,022 --> 00:49:55,782

find constraint prior function that we

have in PMC now.

732

00:49:56,082 --> 00:50:00,102

And it's just like, it seemed like a very

simple case.

733

00:50:00,102 --> 00:50:03,422

It's not even doing all the fancy stuff

that Prellis is doing.

734

00:50:03,422 --> 00:50:10,642

It's mainly just optimizing distribution

so that it fits the constraints that you

735

00:50:10,642 --> 00:50:12,822

are giving it.

736

00:50:12,822 --> 00:50:18,682

Like for instance, I want a gamma with 95

% of the mass between 2 and 6.

737

00:50:18,682 --> 00:50:20,198

Give me the...

738

00:50:20,910 --> 00:50:23,210

parameters that fit that constraint.

739

00:50:23,610 --> 00:50:26,650

That's actually surprisingly hard

mathematically.

740

00:50:26,650 --> 00:50:30,720

You have a lot of choices to make, you

have a lot of things to really be careful

741

00:50:30,720 --> 00:50:31,730

about.

742

00:50:32,070 --> 00:50:38,430

And so I'm guessing that's also one of the

hurdles right now in that research.

743

00:50:38,430 --> 00:50:40,470

Yeah, it absolutely is.

744

00:50:40,470 --> 00:50:44,836

I mean, I would say at least I'm

approaching this.

745

00:50:44,910 --> 00:50:48,290

more or less from an optimization

perspective then that I mean, yes, we are

746

00:50:48,290 --> 00:50:53,060

trying to find a prior that best satisfies

whatever constraints we have and trying to

747

00:50:53,060 --> 00:50:57,330

formulate an optimization problem of some

kind that gets us there.

748

00:50:58,210 --> 00:51:03,410

This is also where I think there's a lot

of room for the, let's say flexible

749

00:51:03,410 --> 00:51:05,610

machine learning tools type of things.

750

00:51:05,610 --> 00:51:09,960

So, I mean, if you think about the prior

that satisfies these constraints, we could

751

00:51:09,960 --> 00:51:15,066

be specifying it with some sort of a

flexible

752

00:51:15,598 --> 00:51:20,938

not a particular parametric prior but some

sort of a flexible representation and then

753

00:51:20,938 --> 00:51:26,598

just kind of optimizing for within a much

broader set of this.

754

00:51:26,598 --> 00:51:29,868

But then of course it requires completely

different kinds of tools that we are used

755

00:51:29,868 --> 00:51:30,998

to working on.

756

00:51:30,998 --> 00:51:38,818

It also requires people accepting that our

priors may take arbitrary shapes.

757

00:51:38,818 --> 00:51:43,858

They may be distributions that we could

have never specified directly.

758

00:51:43,858 --> 00:51:45,876

Maybe they're multimodal.

759

00:51:45,902 --> 00:51:51,552

priors that we kind of just infer that

this is what you couldn't really and

760

00:51:51,552 --> 00:51:55,462

there's going to be also a lot of kind of

educational perspective on getting people

761

00:51:55,462 --> 00:51:56,962

to accept this.

762

00:51:56,962 --> 00:52:03,342

But even if I had to give you a perfect

algorithm that somehow cranks out a prior

763

00:52:03,342 --> 00:52:07,142

and then you look at the prior and you're

saying that I don't even know what

764

00:52:07,142 --> 00:52:12,682

distribution this is, I would have never

ever converged into this if I was manually

765

00:52:12,682 --> 00:52:13,722

doing this.

766

00:52:13,722 --> 00:52:15,374

So will you accept?

767

00:52:15,374 --> 00:52:19,874

that that's your prior or will you insist

that your method is doing something

768

00:52:19,874 --> 00:52:20,594

stupid?

769

00:52:20,594 --> 00:52:24,014

I mean, I still want to use my my Gaussian

prior here.

770

00:52:24,694 --> 00:52:27,234

Yeah, that's a good point.

771

00:52:27,234 --> 00:52:34,014

And in a way that's kind of related to a

classic problem that you have when you're

772

00:52:34,014 --> 00:52:35,554

trying to automate a process.

773

00:52:35,554 --> 00:52:39,804

I think there's the same issue with the

automated cars, like those self -driving

774

00:52:39,804 --> 00:52:45,354

cars, where people actually trust the cars

more if they think they have

775

00:52:45,390 --> 00:52:47,130

some control over it.

776

00:52:47,130 --> 00:52:52,690

I've seen interesting experiments where

they put a placebo button in the car that

777

00:52:52,690 --> 00:52:58,120

people could push on to override if they

wanted to, but the button wasn't doing

778

00:52:58,120 --> 00:52:59,090

anything.

779

00:52:59,870 --> 00:53:03,730

People are saying they were more

trustworthy of these cars than the

780

00:53:03,730 --> 00:53:05,710

completely self -driving cars.

781

00:53:05,710 --> 00:53:09,320

That's also definitely something to take

into account, but that's more related to

782

00:53:09,320 --> 00:53:13,310

the human psychology than to the

algorithms per se.

783

00:53:15,054 --> 00:53:19,774

related to human psychology but it's also

related to this evaluation perspective.

784

00:53:19,774 --> 00:53:25,134

I mean of course if we did have a very

robust evaluation pattern that somehow

785

00:53:25,134 --> 00:53:30,724

tells that once you start using these

techniques your final conclusions in some

786

00:53:30,724 --> 00:53:38,484

sense will be better and if we can make

that kind of a very convincing then it

787

00:53:38,484 --> 00:53:39,514

will be easier.

788

00:53:39,514 --> 00:53:43,994

I mean if you think about, I mean there's

a lot of people that would say that

789

00:53:44,494 --> 00:53:49,244

very massive neural network with four

billion parameters.

790

00:53:49,244 --> 00:53:54,294

It would never ever be able to answer a

question given in a natural language.

791

00:53:54,294 --> 00:53:57,984

A lot of people were saying that five

years ago that this is a pipeline, it's

792

00:53:57,984 --> 00:53:59,534

never gonna happen.

793

00:53:59,694 --> 00:54:04,754

Now we do have it and now everyone is

ready to accept that yes, it can be done.

794

00:54:04,754 --> 00:54:09,124

And they are willing to actually trust

these judge -y pity type of models in a

795

00:54:09,124 --> 00:54:09,694

lot of things.

796

00:54:09,694 --> 00:54:14,212

And they are investing a lot of effort

into figuring out what to do with this.

797

00:54:14,798 --> 00:54:20,488

It just needs this kind of very concrete

demonstration that there is value and that

798

00:54:20,488 --> 00:54:22,138

it works well enough.

799

00:54:22,578 --> 00:54:26,918

It will still take time for people to

really accept it, but I mean, I think

800

00:54:26,918 --> 00:54:28,978

that's kind of the key ingredient.

801

00:54:29,218 --> 00:54:30,198

Yeah, yeah.

802

00:54:30,838 --> 00:54:33,658

I mean, it's also good in some way.

803

00:54:34,258 --> 00:54:37,438

Like that skepticism makes the tools

better.

804

00:54:37,438 --> 00:54:39,938

So that's good.

805

00:54:40,218 --> 00:54:41,678

I mean, so we could...

806

00:54:41,678 --> 00:54:45,758

Keep talking about Prolis because I have

other technical questions about that.

807

00:54:46,098 --> 00:54:51,818

But actually, since you're like, that's a

perfect segue to a question I also had for

808

00:54:51,818 --> 00:54:56,778

you because you have a lot of experience

in that field.

809

00:54:57,018 --> 00:55:03,648

So how do you think can industries better

integrate the patient approaches into

810

00:55:03,648 --> 00:55:04,978

their data science workflows?

811

00:55:04,978 --> 00:55:10,078

Because that's basically what we ended up

talking about right now without me nudging

812

00:55:10,078 --> 00:55:11,312

you towards it.

813

00:55:11,918 --> 00:55:16,878

Yeah, I have actually indeed been thinking

about that quite a bit.

814

00:55:16,878 --> 00:55:22,378

So I do a lot of collaboration with

industrial partners in different domains.

815

00:55:23,318 --> 00:55:28,008

I think there's a couple of perspectives

to this.

816

00:55:28,008 --> 00:55:34,248

So one is that, I mean, people are

finally, I think they are starting to

817

00:55:34,248 --> 00:55:37,848

accept the fact that probabilistic

programming with kind of black box

818

00:55:37,848 --> 00:55:41,198

automated inference is the only sensible

way.

819

00:55:41,198 --> 00:55:43,238

doing statistical modeling.

820

00:55:43,238 --> 00:55:47,778

So looking at back like 10 -15 years ago,

you would still have a lot of people,

821

00:55:47,778 --> 00:55:52,388

maybe not in industry but in research in

different disciplines, in meteorology or

822

00:55:52,388 --> 00:55:53,178

physics or whatever.

823

00:55:53,178 --> 00:55:57,568

People would actually be writing

Metropolis -Hastings algorithms from

824

00:55:57,568 --> 00:56:03,178

scratch, which is simply not reliable in

any sense.

825

00:56:03,178 --> 00:56:08,628

I mean, it took time for them to accept

that yes, we can actually now do it with

826

00:56:08,628 --> 00:56:09,954

something like Stan.

827

00:56:10,188 --> 00:56:16,448

I think this is of course the way that to

an extent that there are problems that fit

828

00:56:16,448 --> 00:56:21,488

well with what something like Stan or

Priency offers.

829

00:56:21,648 --> 00:56:27,668

I think we've been educating long enough

master students who are kind of familiar

830

00:56:27,668 --> 00:56:28,498

with these concepts.

831

00:56:28,498 --> 00:56:32,868

Once they go to the industry they will use

them, they know roughly how to use them.

832

00:56:32,868 --> 00:56:33,768

So that's one side.

833

00:56:33,768 --> 00:56:36,688

But then the other thing is that I

think...

834

00:56:37,164 --> 00:56:42,084

Especially in many of these predictive

industries, so whether it's marketing or

835

00:56:42,084 --> 00:56:46,014

recommendation or sales or whatever,

people are anyway already doing a lot of

836

00:56:46,014 --> 00:56:48,164

deep learning types of models there.

837

00:56:48,164 --> 00:56:51,944

That's a routine tool in what they do.

838

00:56:52,104 --> 00:56:56,604

And now if we think about that, at least

in my opinion, that these fields are

839

00:56:56,604 --> 00:56:57,964

getting closer to each other.

840

00:56:57,964 --> 00:57:02,344

So we have more and more deep learning

techniques that are, like various and

841

00:57:02,344 --> 00:57:07,092

autoencoder is a prime example, but it is

ultimately a Bayesian model in itself.

842

00:57:07,116 --> 00:57:12,646

This may actually be that they creep

through that all this bayesian thinking

843

00:57:12,646 --> 00:57:19,346

and reasoning is actually getting into use

by the next generation of these deep

844

00:57:19,346 --> 00:57:20,776

learning techniques that they are doing.

845

00:57:20,776 --> 00:57:24,696

They've been building those models,

they've been figuring out that they cannot

846

00:57:24,696 --> 00:57:29,426

get reliable estimates of uncertainty,

they maybe tried some ensembles or

847

00:57:29,426 --> 00:57:30,380

whatnot.

848

00:57:30,380 --> 00:57:31,870

And they will be following.

849

00:57:31,870 --> 00:57:35,810

So once the tools are out there, there's

good enough tutorials on how to use those.

850

00:57:35,810 --> 00:57:40,280

So they might start using things like,

let's say, Bayesian neural networks or

851

00:57:40,280 --> 00:57:43,200

whatever the latest tool is at that point.

852

00:57:43,200 --> 00:57:48,000

And I think this may be the easiest way

for the industries to do so.

853

00:57:48,000 --> 00:57:52,110

They're not going to go switch back to

very simple classical linear models when

854

00:57:52,110 --> 00:57:53,600

they do their analysis.

855

00:57:53,600 --> 00:57:59,110

But they're going to make their deep

learning solutions Bayesian on some time

856

00:57:59,110 --> 00:57:59,852

scale.

857

00:57:59,852 --> 00:58:02,752

Maybe not tomorrow, but maybe in five

years.

858

00:58:04,152 --> 00:58:06,552

Yeah, that's a very good point.

859

00:58:07,512 --> 00:58:08,932

Yeah, I love that.

860

00:58:09,352 --> 00:58:14,022

And of course, I'm very happy about that,

being one of the actors making the

861

00:58:14,022 --> 00:58:15,432

industry more patient.

862

00:58:16,072 --> 00:58:18,092

So I have a vested interest in these.

863

00:58:18,092 --> 00:58:22,992

But yeah, also, I've seen the same

evolution you were talking about.

864

00:58:23,772 --> 00:58:28,420

Right now, it's not even really an issue

of

865

00:58:28,428 --> 00:58:31,508

convincing people to use these kind of

tools.

866

00:58:31,888 --> 00:58:35,428

I mean, still from time to time, but less

and less.

867

00:58:35,528 --> 00:58:42,458

And now the question is really more in

making those tools more accessible, more

868

00:58:42,458 --> 00:58:49,788

versatile, easier to use, more reliable,

easier to deploy in industry, things like

869

00:58:49,788 --> 00:58:53,368

that, which is a really good point to be

at for sure.

870

00:58:53,548 --> 00:58:56,324

And to some extent, I think it's...

871

00:58:56,490 --> 00:58:59,800

It's an interesting question also from the

perspective of the tools.

872

00:58:59,800 --> 00:59:07,000

So to some extent, it may mean that we

just end up doing a lot of the kind of

873

00:59:07,000 --> 00:59:11,460

Bayesian analysis on top of what we would

now call deep learning frameworks.

874

00:59:11,460 --> 00:59:16,670

And it's going to be, of course, it's

going to be libraries building on top of

875

00:59:16,670 --> 00:59:17,000

those.

876

00:59:17,000 --> 00:59:20,460

So like NumPyro is a library building on

PyTorch.

877

00:59:20,460 --> 00:59:26,540

But the syntax is kind of intentionally

similar to what they've used in

878

00:59:26,540 --> 00:59:30,120

used to in the deep learning type of

modeling these.

879

00:59:30,120 --> 00:59:31,420

And this is perfectly fine.

880

00:59:31,420 --> 00:59:35,430

We are anyway using a lot of stochastic

optimization routines in Bayesian

881

00:59:35,430 --> 00:59:37,140

inference and so on.

882

00:59:37,140 --> 00:59:41,960

So they are actually very good tools for

building all kinds of Bayesian models.

883

00:59:41,960 --> 00:59:47,340

And I think this may be the layer where

the industry use happens, that it's going

884

00:59:47,340 --> 00:59:48,620

to be always.

885

00:59:48,620 --> 00:59:52,480

They need the GPU type of scaling and

everything there anyway.

886

00:59:52,640 --> 00:59:56,000

So just happy to have our systems.

887

00:59:56,044 --> 00:59:59,324

work on top of these libraries.

888

01:00:00,244 --> 01:00:02,644

Yeah, very good point.

889

01:00:03,304 --> 01:00:11,164

And also to come back to one of the points

you've made in passing, where education is

890

01:00:11,164 --> 01:00:12,544

helping a lot with that.

891

01:00:12,544 --> 01:00:19,664

You have been educating now the data

scientists who go in industry.

892

01:00:19,664 --> 01:00:22,704

And I know in Finland, in France, not that

much.

893

01:00:22,704 --> 01:00:25,164

Where are you originally from?

894

01:00:25,164 --> 01:00:29,004

But in Finland, I know there is this

really great integration between the

895

01:00:29,004 --> 01:00:31,924

research part, the university and the

industry.

896

01:00:31,924 --> 01:00:37,804

You can really see that in the PhD

positions, in the professorship positions

897

01:00:37,804 --> 01:00:38,584

and stuff like that.

898

01:00:38,584 --> 01:00:41,954

So I think that's really interesting and

that's why I wanted to talk to you about

899

01:00:41,954 --> 01:00:42,864

that.

900

01:00:43,084 --> 01:00:48,694

To go back to the education part, what

challenges and opportunities do you see in

901

01:00:48,694 --> 01:00:53,522

teaching Bayesian machine learning as you

do at the university level?

902

01:00:54,444 --> 01:00:57,624

Yeah, it's challenging.

903

01:00:57,624 --> 01:00:58,914

I must say that.

904

01:00:58,914 --> 01:01:04,004

I mean, especially if we get to the point

of well, Bayesian machine learning.

905

01:01:04,004 --> 01:01:10,244

So it is a combination of two topics that

are somewhat difficult in itself.

906

01:01:10,244 --> 01:01:15,514

So if we want to talk about normalizing

flows and then we want to talk about

907

01:01:15,514 --> 01:01:20,544

statistical properties of estimators or

MCMC convergence.

908

01:01:20,544 --> 01:01:23,788

So they require different kinds of

mathematical tools.

909

01:01:23,788 --> 01:01:30,388

tools, they require a certain level of

expertise on the software, on the

910

01:01:30,388 --> 01:01:31,848

programming side.

911

01:01:31,908 --> 01:01:36,588

So what it means actually is that it's

even that if we look at the population of

912

01:01:36,588 --> 01:01:41,578

let's say data science students, we can

always have a lot of people that are

913

01:01:41,578 --> 01:01:44,428

missing background on one of these sites.

914

01:01:44,948 --> 01:01:50,938

So I think this is a difficult topic to

teach.

915

01:01:50,938 --> 01:01:53,676

If it was a small class, it would be fine.

916

01:01:53,676 --> 01:01:57,706

But it appears to be that at least our

students are really excited about these

917

01:01:57,706 --> 01:01:58,026

things.

918

01:01:58,026 --> 01:02:02,876

So I can launch a course with explicitly a

title of a Bayesian machine learning,

919

01:02:02,876 --> 01:02:05,976

which is like an advanced level machine

learning course.

920

01:02:05,976 --> 01:02:11,856

And I would still get 60 to 100 students

enrolling on that course.

921

01:02:11,936 --> 01:02:15,426

And then that means that within that

group, there's going to be some CS

922

01:02:15,426 --> 01:02:18,756

students with almost no background on

statistics.

923

01:02:18,756 --> 01:02:21,804

There's going to be some statisticians who

924

01:02:21,804 --> 01:02:28,004

certainly know how to program but they're

not really used to thinking about GPU

925

01:02:28,004 --> 01:02:31,304

acceleration of a very large model.

926

01:02:31,844 --> 01:02:35,314

But it's interesting, I mean it's not an

impossible thing.

927

01:02:35,314 --> 01:02:41,704

I think it is also a topic that you can

kind of teach on a sufficient level for

928

01:02:41,704 --> 01:02:42,584

everyone.

929

01:02:42,584 --> 01:02:46,584

So everyone agrees is able to understand

the basic reasoning of why we are doing

930

01:02:46,584 --> 01:02:47,944

these things.

931

01:02:48,244 --> 01:02:50,820

Some of the students may struggle,

932

01:02:50,924 --> 01:02:53,304

figuring out all the math behind it.

933

01:02:53,304 --> 01:02:57,904

But they might still be able to use these

tools very nicely.

934

01:02:57,904 --> 01:03:01,434

They might be able to say that if I do

this and that kind of modification, I

935

01:03:01,434 --> 01:03:04,584

realize that my estimates are better

calibrated.

936

01:03:04,964 --> 01:03:09,604

And some others are really then going

deeper into figuring out why these things

937

01:03:09,604 --> 01:03:10,204

work.

938

01:03:10,204 --> 01:03:16,224

So it just needs a bit of creativity on

how do we do it and what do we expect from

939

01:03:16,224 --> 01:03:17,024

the students.

940

01:03:17,024 --> 01:03:20,644

What should they know once they've

completed a course like this?

941

01:03:20,684 --> 01:03:24,024

Yeah, that makes sense.

942

01:03:27,304 --> 01:03:32,564

Do you have seen also an increase in the

number of students in the recent years?

943

01:03:32,964 --> 01:03:37,924

Well, we get as many students as we can

take.

944

01:03:37,924 --> 01:03:43,384

So I mean, it's actually been for quite a

while already that in our university, by

945

01:03:43,384 --> 01:03:44,780

far the most...

946

01:03:44,780 --> 01:03:50,690

popular master's programs and bachelor's

programs are essentially data science and

947

01:03:50,690 --> 01:03:52,040

computer science.

948

01:03:52,080 --> 01:03:55,140

So we can't take in everyone we would

want.

949

01:03:55,140 --> 01:03:59,020

So it actually looks to us that it's more

or less like a stable number of students,

950

01:03:59,020 --> 01:04:04,680

but it's always been a large number since

we launched, for example, the data science

951

01:04:04,680 --> 01:04:05,220

program.

952

01:04:05,220 --> 01:04:07,500

So it went up very fast.

953

01:04:07,500 --> 01:04:09,780

So there's definitely interest.

954

01:04:10,180 --> 01:04:10,680

Yeah.

955

01:04:10,680 --> 01:04:10,900

Yeah.

956

01:04:10,900 --> 01:04:12,240

That's fantastic.

957

01:04:12,620 --> 01:04:13,860

And...

958

01:04:14,220 --> 01:04:16,180

So I've been taking a lot of your time.

959

01:04:16,180 --> 01:04:20,200

So we're going to start to close up the

show, but there are at least two questions

960

01:04:20,200 --> 01:04:23,980

I want to get your insight on.

961

01:04:24,680 --> 01:04:29,080

And the first one is, what do you think

the biggest hurdle in the Bayesian

962

01:04:29,080 --> 01:04:30,480

workflow currently is?

963

01:04:30,480 --> 01:04:35,040

We've talked about that a bit already, but

how do you want to get your structured

964

01:04:35,040 --> 01:04:36,020

answer?

965

01:04:38,200 --> 01:04:43,688

Well, I think the first thing is that

getting people to actually start

966

01:04:43,688 --> 01:04:46,348

using more or less systematic workflows.

967

01:04:46,348 --> 01:04:48,468

I mean, the idea is great.

968

01:04:48,548 --> 01:04:56,018

We kind of know more or less how we should

be thinking about it, but it's a very

969

01:04:56,018 --> 01:04:57,588

complex object.

970

01:04:58,648 --> 01:05:04,268

So we're going to be able to tell experts,

statisticians that, yes, this is roughly

971

01:05:04,268 --> 01:05:05,118

how you should do.

972

01:05:05,118 --> 01:05:10,478

Then we should still also convince them

that, like, almost force them to stick to

973

01:05:10,478 --> 01:05:11,186

it.

974

01:05:11,500 --> 01:05:15,850

But then especially if we then think about

newcomers, people who are just starting

975

01:05:15,850 --> 01:05:20,020

with these things, it's a very complicated

thing.

976

01:05:20,020 --> 01:05:24,320

So if you would need to read 50 page book

or 100 page book about Bayesian workflow

977

01:05:24,320 --> 01:05:27,750

to even know how to do it, it's a

technical challenge.

978

01:05:27,750 --> 01:05:35,820

So I think in long term, we are going to

get essentially tools for assisting it.

979

01:05:35,820 --> 01:05:38,860

So really kind of streamlining the

process.

980

01:05:38,860 --> 01:05:44,300

thinking of something like an AI assistant

for a person building a model that they

981

01:05:44,300 --> 01:05:50,420

really kind of pull you that now I see

that you are trying to go there and do

982

01:05:50,420 --> 01:05:54,280

this, but I see that you haven't done

prior predictive checks.

983

01:05:54,380 --> 01:05:56,940

I actually already created some plots for

you.

984

01:05:56,940 --> 01:06:01,060

Please take a look at these and confirm

that is this what you were expecting?

985

01:06:01,180 --> 01:06:05,900

And it's going to be a lot of effort in

creating those.

986

01:06:05,900 --> 01:06:08,820

It's something that we've been kind of

trying to think about.

987

01:06:08,940 --> 01:06:10,888

how to do it, but it's still.

988

01:06:13,036 --> 01:06:15,416

I think that's where the challenge is.

989

01:06:15,416 --> 01:06:19,966

We know most of the stuff within the

workflow, roughly how it should be done.

990

01:06:19,966 --> 01:06:22,236

At least we have good enough solutions.

991

01:06:22,836 --> 01:06:29,056

But then really kind of helping people to

actually follow these principles, that's

992

01:06:29,056 --> 01:06:30,516

gonna be hard.

993

01:06:31,336 --> 01:06:32,496

Yeah, yeah, yeah.

994

01:06:32,496 --> 01:06:34,856

But damn, that would be super cool.

995

01:06:34,856 --> 01:06:40,336

Like talking about something like a Javis,

you know, like the AI assistant

996

01:06:40,336 --> 01:06:42,316

environment, a Javis, but for...

997

01:06:42,316 --> 01:06:45,656

Beijing models, how cool would that be?

998

01:06:45,656 --> 01:06:47,216

Love that.

999

01:06:48,636 --> 01:06:54,476

And looking forward, how do you see

Beijing methods evolving with artificial

Speaker:

01:06:54,476 --> 01:06:56,198

intelligence research?

Speaker:

01:06:58,284 --> 01:07:00,728

Yeah, I think.

Speaker:

01:07:02,476 --> 01:07:06,356

For quite a while I was about to say that,

like I've been kind of building this basic

Speaker:

01:07:06,356 --> 01:07:10,486

idea that the deep learning models as such

will become more and more basic in any

Speaker:

01:07:10,486 --> 01:07:10,976

way.

Speaker:

01:07:10,976 --> 01:07:13,276

So that's kind of a given.

Speaker:

01:07:13,276 --> 01:07:19,916

But now of course, now the recent very

large scale AI models, they're getting so

Speaker:

01:07:19,916 --> 01:07:25,656

big that then the question of

computational resources is, it's a major

Speaker:

01:07:25,656 --> 01:07:31,658

hurdle to do learning for those models,

even in the crudest possible way.

Speaker:

01:07:31,658 --> 01:07:37,678

So it may be that there's of course kind

of clear needs for uncertainty

Speaker:

01:07:37,678 --> 01:07:41,258

quantification in the large language model

type of scopes.

Speaker:

01:07:41,258 --> 01:07:43,088

They are really kind of unreliable.

Speaker:

01:07:43,088 --> 01:07:47,558

They're really poor at, for example,

evaluating their own confidence.

Speaker:

01:07:47,558 --> 01:07:52,228

So there's been some examples that if you

ask how sure you are about these states,

Speaker:

01:07:52,228 --> 01:07:55,238

more or less irrespective of the

statement, give similar number.

Speaker:

01:07:55,238 --> 01:07:56,388

Yeah, 50 % sure.

Speaker:

01:07:56,388 --> 01:07:57,888

I don't know.

Speaker:

01:07:58,708 --> 01:08:01,412

So it may be that the

Speaker:

01:08:01,580 --> 01:08:05,150

It's not really, at least on a very short

run, it's not going to be the Bayesian

Speaker:

01:08:05,150 --> 01:08:10,040

techniques that really sells all the

uncertainty quantification in those type

Speaker:

01:08:10,040 --> 01:08:10,400

of models.

Speaker:

01:08:10,400 --> 01:08:13,560

In the long term, it maybe is.

Speaker:

01:08:13,560 --> 01:08:15,140

But I think there's a lot of...

Speaker:

01:08:15,140 --> 01:08:16,380

It's going to be interesting.

Speaker:

01:08:16,380 --> 01:08:21,480

It looks to me a bit that it's a lot of

stuff that's built on top of...

Speaker:

01:08:21,480 --> 01:08:27,058

To address specific limitations of these

large language models, it is...

Speaker:

01:08:27,058 --> 01:08:28,368

separate components.

Speaker:

01:08:28,368 --> 01:08:32,468

It's some sort of an external tool that

reads in those inputs or it's an external

Speaker:

01:08:32,468 --> 01:08:35,088

tool that the LLM can use.

Speaker:

01:08:35,248 --> 01:08:39,298

So maybe this is going to be this kind of

a separate element that somehow

Speaker:

01:08:39,298 --> 01:08:40,228

integrates.

Speaker:

01:08:40,228 --> 01:08:50,088

So an LLM, of course, could be having an

API interface where it can query, let's

Speaker:

01:08:50,088 --> 01:08:51,948

say, use tan.

Speaker:

01:08:51,948 --> 01:08:56,418

to figure out an answer to type of a

question that requires probabilistic

Speaker:

01:08:56,418 --> 01:08:57,328

reasoning.

Speaker:

01:08:57,328 --> 01:09:03,248

So people have been plugging in, there's

this public famous examples where you can

Speaker:

01:09:03,248 --> 01:09:07,108

query like some mathematical reasoning

engines and so on.

Speaker:

01:09:07,108 --> 01:09:11,058

So that the LLM, if you ask a specific

type of a question, it goes outside of its

Speaker:

01:09:11,058 --> 01:09:13,128

own realm and does something.

Speaker:

01:09:13,248 --> 01:09:17,578

It already kind of knows how to program,

so maybe we just need to teach LLMs to do

Speaker:

01:09:17,578 --> 01:09:19,214

statistical inference.

Speaker:

01:09:19,596 --> 01:09:24,716

by relying on actually running an MCMC

algorithm on a model that they kind of

Speaker:

01:09:24,716 --> 01:09:26,596

specify together with the user.

Speaker:

01:09:26,596 --> 01:09:29,086

I don't know whether anyone is actually

working on that.

Speaker:

01:09:29,086 --> 01:09:31,246

It's something that just came to my mind.

Speaker:

01:09:31,246 --> 01:09:34,096

So I haven't really thought about this too

much.

Speaker:

01:09:35,436 --> 01:09:41,436

Yeah, but again, we're getting so many PhD

ideas for people right now.

Speaker:

01:09:41,436 --> 01:09:42,576

We are.

Speaker:

01:09:42,576 --> 01:09:48,684

Yeah, I feel like we should be doing the

best of all your...

Speaker:

01:09:48,684 --> 01:09:50,564

Awesome PhD ideas.

Speaker:

01:09:51,804 --> 01:09:52,194

Awesome.

Speaker:

01:09:52,194 --> 01:09:59,324

Well, I still have so many questions for

you, but let's go to the show because I

Speaker:

01:09:59,324 --> 01:10:01,064

don't want to take too much of your time.

Speaker:

01:10:01,064 --> 01:10:02,884

I know it's getting late in Finland.

Speaker:

01:10:02,884 --> 01:10:07,344

So let's close up the show and ask you the

last two questions.

Speaker:

01:10:07,344 --> 01:10:10,124

I always ask at the end of the show.

Speaker:

01:10:10,124 --> 01:10:14,814

First one, if you had unlimited time and

resources, which problem would you try to

Speaker:

01:10:14,814 --> 01:10:15,684

solve?

Speaker:

01:10:16,780 --> 01:10:17,720

Let's see.

Speaker:

01:10:17,720 --> 01:10:23,760

The lazy answer is that I am now trying to

get unlimited resources, well, not

Speaker:

01:10:23,760 --> 01:10:28,160

unlimited resources, but I'm really trying

to tackle this prior elicitation question.

Speaker:

01:10:28,160 --> 01:10:32,900

I think most of the other parts on the

Bayesian workflow are kind of, we have

Speaker:

01:10:32,900 --> 01:10:36,750

reasonably good solutions for those, but

this whole question of really how to

Speaker:

01:10:36,750 --> 01:10:42,926

figure out complex multivariate priors

over arbitrary complex models.

Speaker:

01:10:42,988 --> 01:10:47,348

That's a very practical thing that I am

investing on.

Speaker:

01:10:47,388 --> 01:10:51,588

But maybe if I'm kind of taking, if it

really is infinite, then maybe I could

Speaker:

01:10:51,588 --> 01:10:55,948

actually continue on the quick idea that

we just talked about.

Speaker:

01:10:55,948 --> 01:11:01,278

That I mean really getting this

probabilistic reasoning at the core of

Speaker:

01:11:01,278 --> 01:11:04,638

these large language model type of AI

applications.

Speaker:

01:11:04,638 --> 01:11:13,188

That it would really be reliably answering

proper probabilistic judgments of the

Speaker:

01:11:13,228 --> 01:11:17,048

kind of decision -making reasoning

problems that we ask from them.

Speaker:

01:11:17,148 --> 01:11:18,988

So that would be interesting.

Speaker:

01:11:19,308 --> 01:11:19,528

Yeah.

Speaker:

01:11:19,528 --> 01:11:21,668

Yeah, for sure.

Speaker:

01:11:22,748 --> 01:11:26,808

And second question, if you could have

dinner with any great scientific mind,

Speaker:

01:11:26,808 --> 01:11:29,988

dead or alive or fictional, who would it

be?

Speaker:

01:11:30,328 --> 01:11:34,228

Yes, this is something I actually thought

about it because I figured you would be

Speaker:

01:11:34,228 --> 01:11:36,208

asking it also from me.

Speaker:

01:11:36,248 --> 01:11:39,708

And I chose that I mean fictional

characters.

Speaker:

01:11:39,708 --> 01:11:41,238

I like fictional characters.

Speaker:

01:11:41,238 --> 01:11:43,052

So I went with...

Speaker:

01:11:43,052 --> 01:11:48,232

Daniel Waterhouse from Niels Deffensen's

The Baroque Cycle books.

Speaker:

01:11:48,492 --> 01:11:50,772

So they are kind of semi -historical

books.

Speaker:

01:11:50,772 --> 01:11:57,352

So they talk about the era where Isaac

Newton and others are kind of living and

Speaker:

01:11:57,352 --> 01:11:59,372

establishing the Royal Society.

Speaker:

01:11:59,372 --> 01:12:03,872

And there's a lot of high fantasy

components involved.

Speaker:

01:12:04,132 --> 01:12:12,970

And Daniel Waterhouse in those novels is

his roommate of Isaac Newton and a friend.

Speaker:

01:12:12,980 --> 01:12:14,840

of Gottfried Leibniz.

Speaker:

01:12:14,840 --> 01:12:20,250

So he knows both sides of this great

debate on who invented calculus and who

Speaker:

01:12:20,250 --> 01:12:21,600

copied whom.

Speaker:

01:12:21,600 --> 01:12:27,020

So if I had a dinner with him, I would get

to talk about these innovations that I

Speaker:

01:12:27,020 --> 01:12:29,840

think are one of the foundational ones.

Speaker:

01:12:29,840 --> 01:12:34,170

But I wouldn't actually need to get

involved with either party.

Speaker:

01:12:34,170 --> 01:12:39,020

I wouldn't need to choose sides, whether

it's Isaac or Gottfried that I would be

Speaker:

01:12:39,020 --> 01:12:40,200

talking to.

Speaker:

01:12:41,164 --> 01:12:42,344

Love it.

Speaker:

01:12:42,344 --> 01:12:43,704

Yeah, love that answer.

Speaker:

01:12:43,704 --> 01:12:47,204

Make sure to record that dinner and post

it on YouTube.

Speaker:

01:12:47,204 --> 01:12:50,564

I'm pretty sure lots of people will be

interested in it.

Speaker:

01:12:50,564 --> 01:12:51,334

Fantastic.

Speaker:

01:12:51,334 --> 01:12:51,804

Thanks.

Speaker:

01:12:51,804 --> 01:12:53,284

Thanks a lot, Arto.

Speaker:

01:12:53,644 --> 01:12:56,184

That was a great discussion.

Speaker:

01:12:56,184 --> 01:13:01,214

Really happy we could go through the,

well, not the whole depth of what you do

Speaker:

01:13:01,214 --> 01:13:04,204

because you do so many things, but a good

chunk of it.

Speaker:

01:13:04,204 --> 01:13:06,114

So I'm really happy about that.

Speaker:

01:13:06,114 --> 01:13:08,108

As usual,

Speaker:

01:13:08,108 --> 01:13:12,008

I'll put resources and a link to your

website in the show notes for those who

Speaker:

01:13:12,008 --> 01:13:13,188

want to dig deeper.

Speaker:

01:13:13,288 --> 01:13:17,148

Thank you again, Akto, for taking the time

and being on this show.

Speaker:

01:13:18,348 --> 01:13:19,348

Thank you very much.

Speaker:

01:13:19,348 --> 01:13:20,718

It was my pleasure.

Speaker:

01:13:20,718 --> 01:13:22,856

I really enjoyed the discussion.

Speaker:

01:13:26,796 --> 01:13:30,496

This has been another episode of Learning

Bayesian Statistics.

Speaker:

01:13:30,496 --> 01:13:35,486

Be sure to rate, review, and follow the

show on your favorite podcatcher, and

Speaker:

01:13:35,486 --> 01:13:40,376

visit learnbaystats .com for more

resources about today's topics, as well as

Speaker:

01:13:40,376 --> 01:13:45,116

access to more episodes to help you reach

true Bayesian state of mind.

Speaker:

01:13:45,116 --> 01:13:47,076

That's learnbaystats .com.

Speaker:

01:13:47,076 --> 01:13:51,916

Our theme music is Good Bayesian by Baba

Brinkman, fit MC Lass and Meghiraam.

Speaker:

01:13:51,916 --> 01:13:55,036

Check out his awesome work at bababrinkman

.com.

Speaker:

01:13:55,036 --> 01:13:56,234

I'm your host.

Speaker:

01:13:56,234 --> 01:13:57,184

Alex and Dora.

Speaker:

01:13:57,184 --> 01:14:01,464

You can follow me on Twitter at Alex

underscore and Dora like the country.

Speaker:

01:14:01,464 --> 01:14:06,524

You can support the show and unlock

exclusive benefits by visiting patreon

Speaker:

01:14:06,524 --> 01:14:08,704

.com slash LearnBasedDance.

Speaker:

01:14:08,704 --> 01:14:11,144

Thank you so much for listening and for

your support.

Speaker:

01:14:11,144 --> 01:14:17,034

You're truly a good Bayesian change your

predictions after taking information and

Speaker:

01:14:17,034 --> 01:14:20,324

if you think and I'll be less than

amazing.

Speaker:

01:14:20,364 --> 01:14:26,304

Let me show you how to be a good Bayesian.

Speaker:

01:14:26,304 --> 01:14:29,844

Change calculations after taking fresh

data in.

Speaker:

01:14:29,844 --> 01:14:33,124

Those predictions that your brain is

making.

Speaker:

01:14:33,124 --> 01:14:36,624

Let's get them on a solid foundation.

Previous post
Next post