*Proudly sponsored by **PyMC Labs**, the Bayesian Consultancy. **Book a call**, or **get in touch**!*

Changing perspective is often a great way to solve burning research problems. Riemannian spaces are such a perspective change, as Arto Klami, an Associate Professor of computer science at the University of Helsinki and member of the Finnish Center for Artificial Intelligence, will tell us in this episode.

He explains the concept of Riemannian spaces, their application in inference algorithms, how they can help sampling Bayesian models, and their similarity with normalizing flows, that we discussed in episode 98.

Arto also introduces PreliZ, a tool for prior elicitation, and highlights its benefits in simplifying the process of setting priors, thus improving the accuracy of our models.

When Arto is not solving mathematical equations, you’ll find him cycling, or around a good board game.

*Our theme music is « Good Bayesian », by Baba Brinkman (feat MC Lars and Mega Ran). Check out his awesome work at **https://bababrinkman.com/** !*

**Thank you to my Patrons for making this episode possible!**

*Yusuke Saito, Avi Bryant, Ero Carrera, Giuliano Cruz, Tim Gasser, James Wade, Tradd Salvo, William Benton, James Ahloy, Robin Taylor,, Chad Scherrer, Zwelithini Tunyiswa, Bertrand Wilden, James Thompson, Stephen Oates, Gian Luca Di Tanna, Jack Wells, Matthew Maldonado, Ian Costley, Ally Salim, Larry Gill, Ian Moran, Paul Oreto, Colin Caprani, Colin Carroll, Nathaniel Burbank, Michael Osthege, Rémi Louf, Clive Edelsten, Henri Wallen, Hugo Botha, Vinh Nguyen, Marcin Elantkowski, Adam C. Smith, Will Kurt, Andrew Moskowitz, Hector Munoz, Marco Gorelli, Simon Kessell, Bradley Rode, Patrick Kelley, Rick Anderson, Casper de Bruin, Philippe Labonde, Michael Hankin, Cameron Smith, Tomáš Frýda, Ryan Wesslen, Andreas Netti, Riley King, Yoshiyuki Hamajima, Sven De Maeyer, Michael DeCrescenzo, Fergal M, Mason Yahr, Naoya Kanai, Steven Rowland, Aubrey Clayton, Jeannine Sue, Omri Har Shemesh, Scott Anthony Robson, Robert Yolken, Or Duek, Pavel Dusek, Paul Cox, Andreas Kröpelin, Raphaël R, Nicolas Rode, Gabriel Stechschulte, Arkady, Kurt TeKolste, Gergely Juhasz, Marcus Nölke, Maggi Mackintosh, Grant Pezzolesi, Avram Aelony, Joshua Meehl, Javier Sabio, Kristian Higgins, Alex Jones, Gregorio Aguilar, Matt Rosinski, Bart Trudeau, Luis Fonseca, Dante Gates, Matt Niccolls, Maksim Kuznecov, Michael Thomas, Luke Gorrie, Cory Kiser and Julio*.

Visit https://www.patreon.com/learnbayesstats to unlock exclusive Bayesian swag 😉

**Takeaways**:

– Riemannian spaces offer a way to improve computational efficiency and accuracy in Bayesian inference by considering the curvature of the posterior distribution.

– Riemannian spaces can be used in Laplace approximation and Markov chain Monte Carlo algorithms to better model the posterior distribution and explore challenging areas of the parameter space.

– Normalizing flows are a complementary approach to Riemannian spaces, using non-linear transformations to warp the parameter space and improve sampling efficiency.

– Evaluating the performance of Bayesian inference algorithms in challenging cases is a current research challenge, and more work is needed to establish benchmarks and compare different methods.

– PreliZ is a package for prior elicitation in Bayesian modeling that facilitates communication with users through visualizations of predictive and parameter distributions.

– Careful prior specification is important, and tools like PreliZ make the process easier and more reproducible.

– Teaching Bayesian machine learning is challenging due to the combination of statistical and programming concepts, but it is possible to teach the basic reasoning behind Bayesian methods to a diverse group of students.

– The integration of Bayesian approaches in data science workflows is becoming more accepted, especially in industries that already use deep learning techniques.

– The future of Bayesian methods in AI research may involve the development of AI assistants for Bayesian modeling and probabilistic reasoning.

**Chapters**:

00:00 Introduction and Background

02:05 Arto’s Work and Background

06:05 Introduction to Bayesian Inference

12:46 Riemannian Spaces in Bayesian Inference

27:24 Availability of Romanian-based Algorithms

30:20 Practical Applications and Evaluation

37:33 Introduction to Prelease

38:03 Prior Elicitation

39:01 Predictive Elicitation Techniques

39:30 PreliZ: Interface with Users

40:27 PreliZ: General Purpose Tool

41:55 Getting Started with PreliZ

42:45 Challenges of Setting Priors

45:10 Reproducibility and Transparency in Priors

46:07 Integration of Bayesian Approaches in Data Science Workflows

55:11 Teaching Bayesian Machine Learning

01:06:13 The Future of Bayesian Methods with AI Research

01:10:16 Solving the Prior Elicitation Problem

**Links from the show:**

- LBS #29, Model Assessment, Non-Parametric Models, And Much More, with Aki Vehtari: https://learnbayesstats.com/episode/model-assessment-non-parametric-models-aki-vehtari/
- LBS #20 Regression and Other Stories, with Andrew Gelman, Jennifer Hill & Aki Vehtari: https://learnbayesstats.com/episode/20-regression-and-other-stories-with-andrew-gelman-jennifer-hill-aki-vehtari/
- LBS #98 Fusing Statistical Physics, Machine Learning & Adaptive MCMC, with Marylou Gabrié: https://learnbayesstats.com/episode/98-fusing-statistical-physics-machine-learning-adaptive-mcmc-marylou-gabrie/
- Arto’s website: https://www.cs.helsinki.fi/u/aklami/
- Arto on Google Scholar: https://scholar.google.com/citations?hl=en&user=v8PeLGgAAAAJ
- Multi-source probabilistic inference Group: https://www.helsinki.fi/en/researchgroups/multi-source-probabilistic-inference
- FCAI web page: https://fcai.fi
- Probabilistic AI summer school lectures: https://www.youtube.com/channel/UCcMwNzhpePJE3xzOP_3pqsw
- Keynote: “Better priors for everyone” by Arto Klami: https://www.youtube.com/watch?v=mEmiEHsfWyc&ab_channel=ProbabilisticAISchool
- Variational Inference and Optimization I by Arto Klami: https://www.youtube.com/watch?v=60USDNc1nE8&list=PLRy-VW__9hV8s–JkHXZvnd26KgjRP2ik&index=3&ab_channel=ProbabilisticAISchool
- PreliZ, A tool-box for prior elicitation: https://preliz.readthedocs.io/en/latest/
- AISTATS paper that presents the new computationally efficient metric in context of MCMC: https://researchportal.helsinki.fi/en/publications/lagrangian-manifold-monte-carlo-on-monge-patches
- TMLR paper that scales up the solution for larger models, using the metric for sampling-based inference in deel learning: https://openreview.net/pdf?id=dXAuvo6CGI
- Riemannian Laplace approximation (to appear in AISTATS’24): https://arxiv.org/abs/2311.02766
- Prior Knowledge Elicitation — The Past, Present, and Future: https://projecteuclid.org/journals/bayesian-analysis/advance-publication/Prior-Knowledge-Elicitation-The-Past-Present-and-Future/10.1214/23-BA1381.full

**Transcript**

*This is an automatic transcript and may therefore contain errors. Please **get in touch** if you’re willing to correct them.*

##### Transcript

Let me show you how to be a good b...

2

how they can help sampling Bayesian models

and their similarity with normalizing

3

flows that we discussed in episode 98.

4

ARTO also introduces Prelease, a tool for

prior elicitation, and highlights its

5

benefits in simplifying the process of

setting priors, thus improving the

6

accuracy of our models.

7

When ARTO is not solving mathematical

equations, you'll find him cycling or

8

around the good board game.

9

This is Learning Bayesian Statistics,

episode 103.

10

recorded February 15, 2024.

11

Welcome to Learning Bayesian Statistics, a

podcast about Bayesian inference, the

12

methods, the projects, and the people who

make it possible.

13

I'm your host.

14

You can follow me on Twitter at Alex

underscore and Dora like the country for

15

any info about the show.

16

Learnbasedats .com is left last to me.

17

Show notes, becoming a corporate sponsor,

unlocking Bayesian Merch, supporting the

18

show on Patreon.

19

Everything is in there.

20

That's Learnbasedats .com.

21

If you're interested in one -on -one

mentorship, online courses, or statistical

22

consulting, feel free to reach out and

book a call at topmate .io slash Alex

23

underscore.

24

And Dora, see you around, folks, and best

wishes to you all.

25

Clemmy, welcome to Layer Name Patient

Statistics.

26

Thank you.

27

You're welcome.

28

How was my Finnish pronunciation?

29

Oh, I think that was excellent.

30

For people who don't have the video, I

don't think that was true.

31

So thanks a lot for taking the time,

Artho.

32

I'm really happy to have you on the show.

33

And I've had a lot of questions for you

for a long time, and the longer we

34

postpone the episode, the more questions.

35

So I'm gonna do my best to not take three

hours of your time.

36

And let's start by...

37

maybe defining the work you're doing

nowadays and well, how do you end up

38

working on this?

39

Yes, sure.

40

So I personally identify as a machine

learning researcher.

41

So I do machine learning research, but

very much from a Bayesian perspective.

42

So my original background is in computer

science.

43

I'm essentially a self -educated

statistician in the sense that I've never

44

really

45

kind of studied properly statistics

design, well except for a few courses here

46

and there.

47

But I've been building models, algorithms,

building on the Bayesian principles for

48

addressing various kinds of machine

learning problems.

49

So you're basically like a self -taught

statistician through learning, let's say.

50

More or less, yes.

51

I think the first things I started doing,

52

with anything that had to do with Bayesian

statistics was pretty much already going

53

to the deep end and trying to learn

posterior inference for fairly complicated

54

models, even actually non -parametric

models in some ways.

55

Yeah, we're going to dive a bit on that.

56

Before that, can you tell us the topics

you are particularly focusing on through

57

that?

58

umbrella of topics you've named.

59

Yes, absolutely.

60

So I think I actually have a few somewhat

distinct areas of interest.

61

So on one hand, I'm working really on the

kind of core inference problem.

62

So how do we computationally efficiently,

accurately enough approximate the

63

posterior distributions?

64

Recently, we've been especially working on

inference algorithms that build on

65

concepts from Riemannian geometry.

66

So we're trying to really kind of account

the actual manifold induced by this

67

posterior distribution and try to somehow

utilize these concepts to kind of speed up

68

inference.

69

So that's kind of one very technical

aspect.

70

Then there's the other main theme on the

kind of Bayesian side is on priors.

71

So we'll be working on prior elicitation.

72

So how do we actually go about specifying

the prior distributions?

73

and ideally maybe not even specifying.

74

So how would we extract that knowledge

from a domain expert who doesn't

75

necessarily even have any sort of

statistical training?

76

And how do we flexibly represent their

true beliefs and then encode them as part

77

of a model?

78

That's maybe the main kind of technical

aspects there.

79

Yeah.

80

Yeah.

81

No, super fun.

82

And we're definitely going to dive into

those two aspects a bit later in the show.

83

I'm really interested in that.

84

Before that, do you remember how you first

got introduced to Bayesian inference,

85

actually, and also why it sticks with you?

86

Yeah, like I said, I'm in some sense self

-trained.

87

I mean, coming with the computer science

background, we just, more or less,

88

sometime during my PhD,

89

I was working in a research group that was

led by Samuel Kaski.

90

When I joined the group, we were working

on neural networks of the kind that people

91

were interested in.

92

That was like 20 years ago.

93

So we were working on things like self

-organizing maps and these kind of

94

methods.

95

And then we started working on

applications where we really bumped into

96

the kind of small sample size problems.

97

So looking at...

98

DNA microarray data that was kind of tens

of thousands of dimensions and medical

99

applications with 20 samples.

100

So we essentially figured out that we're

gonna need to take the kind of uncertainty

101

into account properly.

102

Started working on the Bayesian modeling

side of these and one of the very first

103

things I was doing is kind of trying to

create Bayesian versions of some of these

104

classical analysis methods that were

105

especially canonical correlation analysis.

106

That's the original derivation is like an

information theoretic formulation.

107

So I kind of dive directly into this that

let's do Bayesian versions of models.

108

But I actually do remember that around the

same time I also took a course, a course

109

by Akivehtari.

110

He's his author of this Gelman et al.

111

book, one of the authors.

112

I think the first version of the book had

been released.

113

just before that.

114

So Aki was giving a course where he was

teaching based on that book.

115

And I think that's the kind of first real

official contact on trying to understand

116

the actual details behind the principles.

117

Yeah, and actually I'm pretty sure

listeners are familiar with Aki.

118

He's been on the show already, so I'll

link to the episode, of course, where Aki

119

was.

120

And yeah, for sure.

121

I also recommend going through these

episodes, show notes for people who are

122

interested in, well, starting learning

about basic stuff and things like that.

123

Something I'm wondering from what you just

explained is, so you define yourself as a

124

machine learning researcher, right?

125

And you work in artificial intelligence

too.

126

But there is this interaction with the

Bayesian framework.

127

How does that framework underpin your

research in statistical machine learning

128

and artificial intelligence?

129

How does that all combine?

130

Yeah.

131

Well, that's a broad topic.

132

There's of course a lot in that

intersection.

133

I personally do view all learning problems

in some sense from a Bayesian perspective.

134

I mean, no matter what kind of a, whether

it's a very simple fitting a linear

135

regression type of a problem or whether

it's figuring out the parameters of a

136

neural network with 1 billion parameters,

it's ultimately still a statistical

137

inference problem.

138

I mean, most of the cases, I'm quite

confident that we can't figure out the

139

parameters exactly.

140

We need to somehow quantify for the

uncertainty.

141

I'm not really aware of any other kind of

principled way of doing it.

142

So I would just kind of think about it

that we're always doing Bayesian inference

143

in some sense.

144

But then there's the issue of how far can

we go in practice?

145

So it's going to be approximate.

146

It's possibly going to be very crude

approximations.

147

But I would still view it through the lens

of Bayesian statistics in my own work.

148

And that's what I do when I teach for my

BSc students, for example.

149

I mean not all of them explicitly

formulate the learning algorithms kind of

150

from these perspectives but we are still

kind of talking about that what's the

151

relationship what can we assume about the

algorithms what can we assume about the

152

result and how would it relate to like

like properly estimating everything

153

through kind of exactly how it should be

done.

154

Yeah okay that's an interesting

perspective yeah so basically putting that

155

in a in that framework.

156

And that means, I mean, that makes me

think then, how does that, how do you

157

believe, what do you believe, sorry, the

impact of Bayesian machine learning is on

158

the broader field of AI?

159

What does that bring to that field?

160

It's a, let's say it has a big effect.

161

It has a very big impact in a sense that

pretty much most of the stuff that is

162

happening on the machine learning front

and hence also on the kind of all learning

163

based AI solutions.

164

It is ultimately, I think a lot of people

are thinking about roughly in the same way

165

as I am, that there is an underlying

learning problem that we would ideally

166

want to solve more or less following

exactly the Bayesian principles.

167

don't necessarily talk about it from this

perspective.

168

So you might be happy to write algorithms,

all the justification on the choices you

169

make comes from somewhere else.

170

But I think a lot of people are kind of

accepting that it's the kind of

171

probabilistic basis of these.

172

So for instance, I think if you think

about the objectives that people are

173

optimizing in deep learning, they're all

essentially likelihoods of some

174

assume probabilistic model.

175

Most of the regularizers they are

considering do have an interpretation of

176

some kind of a prior distribution.

177

I think a lot of people are all the time

going deeper and deeper into actually

178

explicitly thinking about it from these

perspectives.

179

So we have a lot of these deep learning

type of approaches, various autoencoders,

180

Bayesian neural networks, various kinds of

generative AI models that are

181

They are actually even explicitly

formulated as probabilistic models and

182

some sort of an approximate inference

scheme.

183

So I think the kind of these things are,

they are the same two sides of the same

184

coin.

185

People are kind of more and more thinking

about them from the same perspective.

186

Okay, yeah, that's super interesting.

187

Actually, let's start diving into these

topics from a more technical perspective.

188

So you've mentioned the

189

research and advances you are working on

regarding Romanian spaces.

190

So I think it'd be super fun to talk about

that because we've never really talked

191

about it on the show.

192

So maybe can you give listeners a primer

on what a Romanian space is?

193

Why would you even care about that?

194

And what you are doing in this regard,

what your research is in this regard.

195

Yes, let's try.

196

I mean, this is a bit of a mathematical

concept to talk about.

197

But I mean, ultimately, if you think about

most of the learning algorithms, so we are

198

kind of thinking that there are some

parameters that live in some space.

199

So we essentially, without thinking about

it, that we just assume that it's a

200

Euclidean space in a sense that we can

measure distances between two parameters,

201

that how similar they are.

202

It doesn't matter which direction we go,

if the distance is the same, we think that

203

they are kind of equally far away.

204

So now a Riemannian geometry is one that

is kind of curved in some sense.

205

So we may be stretching the space in

certain ways and we'll be doing this

206

stretching locally.

207

So what it actually means, for example, is

that the shortest path between two

208

possible

209

values, maybe for example two parameter

configurations, that if you start

210

interpolating between two possible values

for a parameter, it's going to be a

211

shortest path in this Riemannian geometry,

which is not necessarily a straight line

212

in an underlying Euclidean space.

213

So that's what the Riemannian geometry is

in general.

214

So it's kind of the tools and machinery we

need to work with these kind of settings.

215

And now then the relationship to

statistical inference comes from trying to

216

define such a Riemannian space that it has

somehow nice characteristics.

217

So maybe the concept that most of the

people actually might be aware of would be

218

the Fisher information matrix that kind of

characterizes the kind of the curvature

219

induced by a particular probabilistic

model.

220

So these tools kind of then allow, for

example, a very recent thing that we did,

221

it's going to come out later this spring

in AI stats, is an extension of the

222

Laplace approximation in a Riemannian

geometry.

223

So those of you who know what the Laplace

approximation is, it's essentially just

224

fitting a normal distribution at the mode

of a distribution.

225

But if we now fit the same normal

distribution in a suitably chosen

226

Riemannian space,

227

we can actually model also the kind of

curvature of the posterior mode and even

228

kind of how it stretches.

229

So we get a more flexible approximation.

230

We are still fitting a normal

distribution.

231

We're just doing it in a different space.

232

Not sure how easy that was to follow, but

at least maybe it gives some sort of an

233

idea.

234

Yeah, yeah, yeah.

235

That was actually, I think, a pretty

approachable.

236

introduction and so if I understood

correctly then you're gonna use these

237

Romanian approximations to come up with

better algorithms is that what you do and

238

why you focus on Romanian spaces and yeah

if you can if you can introduce that and

239

tell us basically why that is interesting

to then look

240

at geometry from these different ways

instead of the classical Euclidean way of

241

things geometry.

242

Yeah, I think that's exactly what it is

about.

243

So one other thing, maybe another

perspective of thinking about it is that

244

we've also been doing Markov chain Monte

Carlo algorithms, so MCMC in these

245

Riemannian spaces.

246

And what we can achieve with those is that

if you have, let's say, a posterior

247

distribution,

248

that has some sort of a narrow funnel,

some very narrow area that extends far

249

away in one corner of your parameter

space.

250

It's actually very difficult to get there

with something like standard Hamiltonian

251

Monte Carlo, but with the Riemannian

methods we can kind of make these narrow

252

funnels equally easy compared to the

flatter areas.

253

Now of course this may sound like a magic

bullet that we should be doing all

254

inference with these techniques.

255

Of course it does come with

256

certain computational challenges.

257

So we do need to be, like I said, the

shortest paths are no longer straight

258

lines.

259

So we need numerical integration to follow

the geodesic paths in these metrics and so

260

on.

261

So it's a bit of a compromise, of course.

262

So they have very nice theoretical

properties.

263

We've been able to get them working also

in practice in many cases so that they are

264

kind of comparable with the current state

of the art.

265

But it's not always easy.

266

Yeah, there is no free lunch.

267

Yes.

268

Yeah.

269

Yeah.

270

Do you have any resources about these?

271

Well, first the concepts of Romanian

spaces and then the algorithms that you

272

folks derived in your group using these

Romanian space for people who are

273

interested?

274

Yeah, I think I wouldn't know, let's say a

very particular

275

reasons I would recommend on the Romanian

geometry.

276

It is actually a rather, let's say,

mathematically involved topic.

277

But regarding the specific methods, I

think they are...

278

It's a couple of my recent papers, so we

have this Laplace approximation is coming

279

out in AI stats this year.

280

The MCMC sampler we had, I think, two

years ago in AI stats, similarly, the

281

first MCMC method building on these and

then...

282

last year one paper on transactions of

machine learning research.

283

I think they are more or less accessible.

284

Let's definitely link to those papers if

you can in the show notes because I'm

285

personally curious about it but also I

think listeners will be.

286

It sounds from what you're saying that

this idea of doing algorithms in this

287

Romanian space is

288

somewhat recent.

289

Am I right?

290

And why would it appear now?

291

Why would it become interesting now?

292

Well, it's not actually that recent.

293

I think the basic principle goes back, I

don't know, maybe 20 years or so.

294

I think the main reason why we've been

working on this right now is that the

295

We've been able to resolve some of the

computational challenges.

296

So the fundamental problem with these

models is always this numeric integration

297

of following the shortest paths depending

on an algorithm we needed for different

298

reasons, but we always needed to do it,

which usually requires operations like

299

inversion of a metric tensor, which has

the kind of a dimensionality of the

300

parameter space.

301

So we came up with the particular metric.

302

that happens to have computationally

efficient inverse.

303

So there's kind of this kind of concrete

algorithmic techniques that are kind of

304

bringing the computational cost to the

level so that it's no longer notably more

305

expensive than doing kind of standard

Euclidean methods.

306

So we can, for example, scale them for

Bayesian neural networks.

307

That's one of the application cases we are

looking at.

308

We are really having very high

-dimensional problems but still able to do

309

some of these Riemannian techniques or

approximations of them.

310

That was going to be my next question.

311

In which cases are these approximations

interesting?

312

In which cases would you recommend

listeners to actually invest time to

313

actually use these techniques because they

have a better chance of working than the

314

classic Hamiltonian Monte Carlo semper

that are the default in most probabilistic

315

languages?

316

Yeah, I think the easy answer is that when

the inference problem is hard.

317

So essentially one very practical way

would be that if you realize that you

318

can't really get a Hamiltonian Monte Carlo

to explore the space, the posterior

319

properly, that it may be difficult to find

out that this is happening.

320

Of course, if you're ever visiting a

certain corner, you wouldn't actually

321

know.

322

But if you have some sort of a reason to

believe that you really are handling with

323

such a complex posterior that I'm kind of

willing to spend a bit more extra

324

computation to be careful so that I really

try to cover every corner there is.

325

Another example is that we realized on the

scope of these Bayesian neural networks

326

that there are certain kind of classical

327

Well, certain kind of scenarios where we

can show that if you do inference with the

328

two simple methods, so something in the

Euclidean metric with the standard

329

Vangerman dynamics type of a thing, what

we actually see is that if you switch to

330

using better prior distributions in your

model, you don't actually see an advantage

331

of those unless you at the same time

switch to using an inference algorithm

332

that is kind of able to handle the extra

complexity.

333

So if you have for example like

334

heavy tail spike and slap type of priors

in the neural network.

335

You just kind of fail to get any benefit

from these better priors if you don't pay

336

a bit more attention into how you do the

inference.

337

Okay, super interesting.

338

And also, so that seems it's also quite

interesting to look at that when you have,

339

well, or when you suspect that you have

multi -modal posteriors.

340

Yes, well yeah, multimodal posteriors are

interesting.

341

I'm not, we haven't specifically studied

like this question that is there and we

342

have actually thought about some ideas of

creating metrics that would specifically

343

encourage exploring the different modes

but we haven't done that concretely so we

344

now still focusing on these kind of narrow

thin areas of posteriors and how can you

345

kind of reach those.

346

Okay.

347

And do you know of normalizing flows?

348

Sure, yes.

349

So yeah, we've had Marie -Lou Gabriel on

the show recently.

350

It was episode 98.

351

And so she's working a lot on these

normalizing flows and the idea of

352

assisting NCMC sampling with these machine

learning methods.

353

And it's amazing.

354

can sound somewhat similar to what you do

in your group.

355

And so for listeners, could you explain

the difference between the two ideas and

356

maybe also the use cases that both apply

to it?

357

Yeah, I think you're absolutely right.

358

So they are very closely related.

359

So there are, for example, the basic idea

of the neural transport that uses

360

normalizing flows for

361

essentially transforming the parameter

space in a suitable non -linear way and

362

then running standard Euclidean

Hamiltonian Monte Carlo.

363

It can actually be proven.

364

I think it is in the original paper as

well that I mean it is actually

365

mathematically equivalent to conducting

Riemannian inference in a suitable metric.

366

So I would say that it's like a

complementary approach of solving exactly

367

the same problem.

368

So you have a way of somehow in a flexible

way warping your parameter space.

369

You either do it through a metric or you

kind of do it as a pre -transformation.

370

So there's a lot of similarities.

371

It's also the computation in some sense

that if you think about mapping...

372

sample through a normalizing flow.

373

It's actually very close to what we do

with the Riemannian Laplace approximation

374

that you start kind of take a sample and

you start propagating it through some sort

375

of a transformation.

376

It's just whether it's defined through a

metric or as a flow.

377

So yes, so they are kind of very close.

378

So now the question is then that when

should I be using one of these?

379

I'm afraid I don't really have an answer.

380

that in a sense that I mean there's

computational properties on let's say for

381

example if you've worked with flows you do

need to pre -train them so you do need to

382

train some sort of a flow to be able to

use it in certain applications so it comes

383

with some pre -training cost.

384

Quite likely during when you're actually

using it it's going to be faster than

385

working in a Riemannian metric where you

need to invert some metric tensors and so

386

on.

387

So there's kind of like technical

differences.

388

Then I think the bigger question is of

course that if we go to really challenging

389

problems, for example, very high

dimensions, that which of these methods

390

actually work well there.

391

For that I don't quite now have an answer

in the sense that I would dare to say that

392

or even speculate that which of these

things I might miss some kind of obvious

393

limitations of one of the approaches if

trying to kind of extrapolate too far.

394

from what we've actually tried in

practice.

395

Yeah, that's what I was going to say.

396

It's also that these methods are really at

the frontier of the science.

397

So I guess we're lacking, we're lacking

for now the practical cases, right?

398

And probably in a few years we'll have

more ideas of these and when one is more

399

appropriate than another.

400

But for now, I guess we have to try.

401

those algorithms and see what we get back.

402

And so actually, what if people want to

try these Romanian based algorithms?

403

Do you have already packages that we can

link to that people can try and plug their

404

own model into?

405

Yes and no.

406

So we have released open source code with

each of the research papers.

407

So there is a reference implementation

that

408

can be used.

409

We have internally been integrating these,

kind of working a bit towards integrating

410

the kind of proper open ecosystems that

would allow, make like for example model

411

specification easy.

412

It's not quite there yet.

413

So there's one particular challenge is

that many of the environments don't

414

actually have all the support

functionality you need for the Riemannian

415

methods.

416

They're essentially simplifying some of

the things that directly encoding these

417

assumptions that the shortest path is an

interpolation or it's a line.

418

So you need a bit of an extra machinery

for the most established libraries.

419

There are some libraries, I believe, that

are actually making it fairly easy to do

420

kind of plug and play Riemannian metrics.

421

I don't remember the names right now, but

that's where we've kind of been.

422

planning on putting in the algorithms, but

they're not really there yet.

423

Hmm, OK, I see.

424

Yeah, definitely that would be, I guess,

super, super interesting.

425

If by the time of release, you see

something that people could try,

426

definitely we'll link to that, because I

think listeners will be curious.

427

And I'm definitely super curious to try

that.

428

Any new stuff like that, or you'd like to?

429

try and see what you can do with it.

430

It's always super interesting.

431

And I've already seen some very

interesting experiments done with

432

normalizing flows, especially Bayox by

Colin Carroll and other people.

433

Colin Carroll is one of the EasyPindC

developer also.

434

And yeah, now you can use Bayox to take

any

435

a juxtifiable model and you plug that into

it and you can use the flow MC algorithm

436

to sample your juxtifiable PIMC model.

437

So that's really super cool.

438

And I'm really looking forward to more

experiments like that to see, well, okay,

439

what can we do with those algorithms?

440

Where can we push them to what extent, to

what degree, where do they fall down?

441

That's really super interesting, at least

for me, because I'm not a mathematician.

442

So when I see that, I find that super,

like, I love the idea of, basically the

443

idea is somewhat simple.

444

It's like, okay, we have that problem when

we think about geometry that way, because

445

then the geometry becomes a funnel, for

instance, as you were saying.

446

And then sampling at the bottom of the

funnel is just super hard in the way we do

447

it right now, because just super small

distances.

448

What if we change the definition of

distance?

449

What if we change the definition of

geometry, basically, which is this idea

450

of, OK, let's switch to Romanian space.

451

And the way we do that, then, well, the

funnel disappears, and it just becomes

452

something easier.

453

It's just like going beyond the idea of

the centered versus non -centered

454

parameterization, for instance, when you

do that in model, right?

455

But it's going big with that because it's

more general.

456

So I love that idea.

457

I understand it, but I cannot really read

the math and be like, oh, OK, I see what

458

that means.

459

So I have to see the model and see what I

can do and where I can push it.

460

And then I get a better understanding of

what that entails.

461

Yeah, I think you gave a much better

summary of what it is doing than I did.

462

So good for that.

463

I mean, you are actually touching that, of

course.

464

So there's the one point is making the

algorithms.

465

available so that everyone could try them

out.

466

But then there's also the other aspect

that we need to worry about, which is the

467

proper evaluation of what they're doing.

468

I mean, of course, most of the papers when

you release a new algorithm, you need to

469

emphasize things like, in our case,

computational efficiency.

470

And you do demonstrate that it, maybe for

example, being quite explicitly showing

471

that these very strong funnels, it does

work better with those.

472

But now then the question is of course

that how reliable these things are if used

473

in a black box manner in a so that someone

just runs them on their favorite model.

474

And one of the challenges we realized is

that it's actually very hard to evaluate

475

how well an algorithm is working in an

extremely difficult case.

476

Because there is no baseline.

477

I mean, in some of the cases we've been

comparing that let's try to do...

478

standard Hamiltonian MCMC on nuts as

carefully as we can.

479

And they kind of think that this is the

ground truth, this is the true posterior.

480

But we don't really know whether that's

the case.

481

So if it's hard enough case, our kind of

supposed ground truth is failing as well.

482

And it's very hard to kind of then we

might be able to see that our solution

483

differs from that.

484

But then we would need to kind of

separately go and investigate that which

485

one was wrong.

486

And that is a practical challenge,

especially if you would like to have a

487

broad set of models.

488

And we would want to show somehow

transparently for the kind of end users

489

that in these and these kind of problems,

this and that particular method, whether

490

it's one of ours or something else, any

other new fancy.

491

When do they work when they don't?

492

Without relying that we really have some

particular method that they already trust

493

and we kind of, if it's just compared to

it, we can't kind of really convince

494

others that is it correct when it is

differing from what we kind of used to

495

rely on.

496

Yeah, that's definitely a problem.

497

That's also a question I asked Marilu.

498

when she was on the show and then that was

kind of the same answer if I remember

499

correctly that for now it's kind of hard

to do benchmarks in a way, which is

500

definitely an issue if you're trying to

work on that from a scientific perspective

501

as well.

502

If we were astrologists, that'd be great,

like then we'd be good.

503

But if you're a scientist, then you want

to evaluate your methods and...

504

And finding a method to evaluate the

method is almost as valuable as finding

505

the method in the first place.

506

And where do you think we are on that

regarding in your field?

507

Is that an active branch of the research

to try and evaluate these algorithms?

508

How would that even look like?

509

Or are we still really, really at a very

early time for that work?

510

That's a...

511

Very good question.

512

So I'm not aware of a lot of people that

would kind of specifically focus on

513

evaluation.

514

So for example, Aki has of course been

working a lot on that, trying to kind of

515

create diagnostics and so on.

516

But then if we think about more on the

flexible machine learning side, I think my

517

hunch is that it's the individual research

groups are kind of all circling around the

518

same problems that they are kind of trying

to figure out that, okay,

519

Every now and then someone invents a fancy

way of evaluating something.

520

It introduces a particular type of

synthetic scenario where I think that the

521

most common in tries that what people do

is that you create problems where you

522

actually have an analytic posterior and

it's somehow like an artificial problem

523

that you take a problem and you transform

it in a given way and then you assume that

524

I didn't have the analytic one.

525

But they are all, I mean, they feel a bit

artificial.

526

They feel a bit synthetic.

527

So let's see.

528

It would maybe be something that the

community should kind of be talking a bit

529

more about on a workshop or something

that, OK, let's try to really think about

530

how to verify the robustness or possibly

identify that these things are not really

531

ready or reliable for practical use in

very serious applications yet.

532

Yeah.

533

I haven't been following very closely

what's happening, so I may be missing some

534

important works that are already out

there.

535

Okay, yeah.

536

Well, Aki, if you're listening, send us a

message if we forgot something.

537

And second, that sounds like there are

some interesting PhDs to do on the issue,

538

if that's still a very new branch of the

research.

539

So, people?

540

If you're interested in that, maybe

contact Arto and we'll see.

541

Maybe in a few months or years, you can

come here on the show and answer the

542

question I just asked.

543

Another aspect of your work I really want

to talk about also that I really love and

544

now listeners can relax because that's

going to be, I think, less abstract and

545

closer to their user experience.

546

is about priors.

547

You talked about it a bit at the

beginning, especially you are working and

548

you worked a lot on a package called

Prelease that I really love.

549

One of my friends and fellow Pimc

developers, Osvaldo Martin, is also

550

collaborating on that.

551

And you guys have done a tremendous job on

that.

552

So yeah, can you give people a primer

about Prelease?

553

What is it?

554

When could they use it and what's its

purpose in general?

555

Maybe I need to start by saying that I

haven't worked a lot on prelease.

556

Osvaldo has and a couple of others, so

I've been kind of just hovering around and

557

giving a bit of feedback.

558

But yeah, so I'll maybe start a bit

further away, so not directly from

559

prelease, but the whole question of prior

elicitation.

560

So I think the...

561

Yeah.

562

What we've been working with that is the

prior elicitation is simply an, I would

563

frame it as that it's some sort of

unusually iterative approach of

564

communicating with the domain expert where

the goal is to estimate what's their

565

actual subjective prior knowledge is on

whatever parameters the model has and

566

doing it so that it's like cognitively

easy for the expert.

567

So many of the algorithms that we've been

working on this are based on this idea of

568

predictive elicitation.

569

So if you have a model where the

parameters don't actually have a very

570

concrete, easily understandable meaning,

you can't actually start asking questions

571

from the expert about the parameters.

572

It would require them to understand fully

the model itself.

573

The predictive elicitation techniques kind

of ask

574

communicate with the expert usually in the

space of the observable quantities.

575

So they're trying to make that is this

somehow more likely realization than this

576

other one.

577

And now this is where the prelease comes

into play.

578

So when we are communicating with the

user, so most of the times the information

579

we show for the user is some sort of

visualizations.

580

of predictive distributions or possibly

also about the parameter distributions

581

themselves.

582

So we need an easy way of communicating

whether it's histograms of predicted

583

values and whatnot.

584

So how do we show those for a user in

scenarios where the model itself is some

585

sort of a probabilistic program so we

can't kind of fixate to a given model

586

family.

587

That's actually what's the main role of

Prelease is essentially making it easy to

588

interface with the user.

589

Of course, Prelease also then includes

these algorithms themselves.

590

So, algorithms for estimating the prior

and the kind of interface components for

591

the expert to give information.

592

So, make a selection, use a slider that I

would want my distribution to be a bit

593

more skewed towards the right and so on.

594

That's what we are aiming at.

595

A general purpose tool that would be used,

it's essentially kind of a platform for

596

developing and kind of bringing into use

all kinds of prioritization techniques.

597

So it's not tied to any given algorithm or

anything but you just have the components

598

and could then easily kind of commit,

let's say, a new type of prioritization

599

algorithm into the library.

600

Yeah and I re -encourage

601

folks to go take a look at the prelease

package.

602

I put the link in the show notes because,

yeah, as you were saying, that's a really

603

easier way to specify your priors and also

elicit them if you need the intervention

604

of non -statisticians in your model, which

you often do if the model is complex

605

enough.

606

So yeah, like...

607

I'm using it myself quite a lot.

608

So thanks a lot guys for this work.

609

So Arto, as you were saying, Osvaldo

Martín is one of the main contributors,

610

Oriol Abril Blas also, and Alejandro

Icazati, if I remember correctly.

611

So at least these four people are the main

contributors.

612

And yeah, so I definitely encourage people

to go there.

613

What would you say, Arto, are the...

614

like the Pareto effect, what would it be

if people want to get started with

615

Prelease?

616

Like the 20 % of uses that will give you

80 % of the benefits of Prelease for

617

someone who don't know anything about it.

618

That's a very good question.

619

I think the most important thing actually

is to realize that we need to be careful

620

when we set the priors.

621

So simply being aware that you need a tool

for this.

622

You need a tool that makes it easy to do

something like a prior predictive check.

623

You need a tool that relieves you from

figuring out how do I inspect.

624

my priors or the effects it has on the

model.

625

That's actually where the real benefit is.

626

You get most of the...

627

when you kind of try to bring it as part

of your Bayesian workflow in a kind of a

628

concrete step that you identify that I

need to do this.

629

Then the kind of the remaining tale of

this thing is then of course that the...

630

maybe in some cases you have such a

complicated model that you really need to

631

deep dive and start...

632

running algorithms that help you eliciting

the priors.

633

And I would actually even say that the

elicitation algorithms, I do perceive them

634

useful even when the person is actually a

statistician.

635

I mean, there's a lot of models that we

may think that we know how to set the

636

priors.

637

But what we are actually doing is

following some very vague ideas on what's

638

the effect.

639

And we may also make

640

severe mistakes or spend a lot of time in

doing it.

641

So to an extent these elicitation

interfaces, I believe that ultimately they

642

will be helping even kind of hardcore

statisticians in just kind of doing it

643

faster, doing it slightly better, doing it

perhaps in a more better documented

644

manner.

645

So you could for example kind of store all

the interaction the modeler had.

646

with these things and kind of put that

aside that this is where we got the prior

647

from instead of just trial and error and

then we just see at the end the result.

648

So you could kind of revisit the choices

you made during an elicitation process

649

that I discarded these predictive

distributions for some reason and then you

650

can later kind of, okay I made a mistake

there maybe I go and change my answer in

651

that part and then an algorithm provides

you an updated prior.

652

without you needing to actually go through

the whole prior specification process

653

again.

654

Yeah.

655

Yeah.

656

Yeah, I really love that.

657

And that makes the process of setting

priors more reproducible, more transparent

658

in a way.

659

That makes me think a bit of the scikit

-learn pipelines that you use to transform

660

the data.

661

For instance, you just set up the pipeline

and you say, I want to standardize my

662

data, for instance.

663

And then you have that pipeline ready.

664

And when you do the auto sample

predictions, you can use the pipeline and

665

say, okay, now like do that same

transformation on these new data so that

666

we're sure that it's done the right way,

but it's still transparent and people know

667

what's going on here.

668

It's a bit the same thing, but with the

priors.

669

And I really love that because that makes

it also easier for people to think about

670

the priors and to actually choose the

priors.

671

Because.

672

What I've seen in teaching is that

especially for beginners, even more when

673

they come from the Frequentis framework,

sending the priors can be just like

674

paralyzing.

675

It's like products of choice.

676

It's way too many, way too many choices.

677

And then they end up not choosing anything

because they are too afraid to choose the

678

wrong prior.

679

Yes, I fully agree with that.

680

I mean, there's a lot of very simple

models.

681

that already start having six, seven,

eight different univariate priors there.

682

And then I've been working with these

things for a long time and I still very

683

easily make stupid mistakes that I'm

thinking that I increase the variance of

684

this particular prior here, thinking that

what I'm achieving is, for example, higher

685

predictive variance as well.

686

And then I realized that, no, that's not

the case.

687

It's actually...

688

Later in the model, it plays some sort of

a role and it actually has the opposite

689

effect.

690

It's hard.

691

Yeah.

692

Yeah.

693

That stuff is really hard and same here.

694

When I discovered that, I'm extremely

frustrated because I'm like, I always did

695

hours on these, whereas if I had a more

producible pipeline, that would just have

696

been handled automatically for me.

697

So...

698

Yeah, for sure.

699

We're not there yet in the workflow, but

that definitely makes it way easier.

700

So yeah, I absolutely agree that we are

not there yet.

701

I mean, the Prellis is a very well

-defined tool that allows us to start

702

working on it.

703

But I mean, then the actual concrete

algorithms that would make it easy to

704

let's say for example, avoid these kind of

stupid mistakes and be able to kind of

705

really reduce the effort.

706

So if it now takes two weeks for a PhD

student trying to think about and fiddle

707

with the prior, so can we get to one day?

708

Can we get it to one hour?

709

Can we get it to two minutes of a quick

interaction?

710

And probably not two minutes, but if we

can get it to one hour and it...

711

It will require lots of things.

712

It will require even better of this kind

of tooling.

713

So how do we visualize, how do we play

around with it?

714

But I think it's going to require quite a

bit better algorithms on how do you, from

715

kind of maximally limited interaction, how

do you estimate.

716

what the prior is and how you design the

kind of optimal questions you should be

717

asking from the expert.

718

There's no point in kind of reiterating

the same things just to fine -tune a bit

719

one of the variances of the priors if

there is a massive mistake still somewhere

720

in the prior and a single question would

be able to rule out half of the possible

721

scenarios.

722

It's going to be an interesting...

723

let's say, rise research direction, I

would say, for the next 5, 10 years.

724

Yeah, for sure.

725

And very valuable also because very

practical.

726

So for sure, again, a great PhD

opportunity, folks.

727

Yeah, yeah.

728

Also, I mean, that may be hard to find

those algorithms that you were talking

729

about because it is hard, right?

730

I know I worked on the...

731

find constraint prior function that we

have in PMC now.

732

And it's just like, it seemed like a very

simple case.

733

It's not even doing all the fancy stuff

that Prellis is doing.

734

It's mainly just optimizing distribution

so that it fits the constraints that you

735

are giving it.

736

Like for instance, I want a gamma with 95

% of the mass between 2 and 6.

737

Give me the...

738

parameters that fit that constraint.

739

That's actually surprisingly hard

mathematically.

740

You have a lot of choices to make, you

have a lot of things to really be careful

741

about.

742

And so I'm guessing that's also one of the

hurdles right now in that research.

743

Yeah, it absolutely is.

744

I mean, I would say at least I'm

approaching this.

745

more or less from an optimization

perspective then that I mean, yes, we are

746

trying to find a prior that best satisfies

whatever constraints we have and trying to

747

formulate an optimization problem of some

kind that gets us there.

748

This is also where I think there's a lot

of room for the, let's say flexible

749

machine learning tools type of things.

750

So, I mean, if you think about the prior

that satisfies these constraints, we could

751

be specifying it with some sort of a

flexible

752

not a particular parametric prior but some

sort of a flexible representation and then

753

just kind of optimizing for within a much

broader set of this.

754

But then of course it requires completely

different kinds of tools that we are used

755

to working on.

756

It also requires people accepting that our

priors may take arbitrary shapes.

757

They may be distributions that we could

have never specified directly.

758

Maybe they're multimodal.

759

priors that we kind of just infer that

this is what you couldn't really and

760

there's going to be also a lot of kind of

educational perspective on getting people

761

to accept this.

762

But even if I had to give you a perfect

algorithm that somehow cranks out a prior

763

and then you look at the prior and you're

saying that I don't even know what

764

distribution this is, I would have never

ever converged into this if I was manually

765

doing this.

766

So will you accept?

767

that that's your prior or will you insist

that your method is doing something

768

stupid?

769

I mean, I still want to use my my Gaussian

prior here.

770

Yeah, that's a good point.

771

And in a way that's kind of related to a

classic problem that you have when you're

772

trying to automate a process.

773

I think there's the same issue with the

automated cars, like those self -driving

774

cars, where people actually trust the cars

more if they think they have

775

some control over it.

776

I've seen interesting experiments where

they put a placebo button in the car that

777

people could push on to override if they

wanted to, but the button wasn't doing

778

anything.

779

People are saying they were more

trustworthy of these cars than the

780

completely self -driving cars.

781

That's also definitely something to take

into account, but that's more related to

782

the human psychology than to the

algorithms per se.

783

related to human psychology but it's also

related to this evaluation perspective.

784

I mean of course if we did have a very

robust evaluation pattern that somehow

785

tells that once you start using these

techniques your final conclusions in some

786

sense will be better and if we can make

that kind of a very convincing then it

787

will be easier.

788

I mean if you think about, I mean there's

a lot of people that would say that

789

very massive neural network with four

billion parameters.

790

It would never ever be able to answer a

question given in a natural language.

791

A lot of people were saying that five

years ago that this is a pipeline, it's

792

never gonna happen.

793

Now we do have it and now everyone is

ready to accept that yes, it can be done.

794

And they are willing to actually trust

these judge -y pity type of models in a

795

lot of things.

796

And they are investing a lot of effort

into figuring out what to do with this.

797

It just needs this kind of very concrete

demonstration that there is value and that

798

it works well enough.

799

It will still take time for people to

really accept it, but I mean, I think

800

that's kind of the key ingredient.

801

Yeah, yeah.

802

I mean, it's also good in some way.

803

Like that skepticism makes the tools

better.

804

So that's good.

805

I mean, so we could...

806

Keep talking about Prolis because I have

other technical questions about that.

807

But actually, since you're like, that's a

perfect segue to a question I also had for

808

you because you have a lot of experience

in that field.

809

So how do you think can industries better

integrate the patient approaches into

810

their data science workflows?

811

Because that's basically what we ended up

talking about right now without me nudging

812

you towards it.

813

Yeah, I have actually indeed been thinking

about that quite a bit.

814

So I do a lot of collaboration with

industrial partners in different domains.

815

I think there's a couple of perspectives

to this.

816

So one is that, I mean, people are

finally, I think they are starting to

817

accept the fact that probabilistic

programming with kind of black box

818

automated inference is the only sensible

way.

819

doing statistical modeling.

820

So looking at back like 10 -15 years ago,

you would still have a lot of people,

821

maybe not in industry but in research in

different disciplines, in meteorology or

822

physics or whatever.

823

People would actually be writing

Metropolis -Hastings algorithms from

824

scratch, which is simply not reliable in

any sense.

825

I mean, it took time for them to accept

that yes, we can actually now do it with

826

something like Stan.

827

I think this is of course the way that to

an extent that there are problems that fit

828

well with what something like Stan or

Priency offers.

829

I think we've been educating long enough

master students who are kind of familiar

830

with these concepts.

831

Once they go to the industry they will use

them, they know roughly how to use them.

832

So that's one side.

833

But then the other thing is that I

think...

834

Especially in many of these predictive

industries, so whether it's marketing or

835

recommendation or sales or whatever,

people are anyway already doing a lot of

836

deep learning types of models there.

837

That's a routine tool in what they do.

838

And now if we think about that, at least

in my opinion, that these fields are

839

getting closer to each other.

840

So we have more and more deep learning

techniques that are, like various and

841

autoencoder is a prime example, but it is

ultimately a Bayesian model in itself.

842

This may actually be that they creep

through that all this bayesian thinking

843

and reasoning is actually getting into use

by the next generation of these deep

844

learning techniques that they are doing.

845

They've been building those models,

they've been figuring out that they cannot

846

get reliable estimates of uncertainty,

they maybe tried some ensembles or

847

whatnot.

848

And they will be following.

849

So once the tools are out there, there's

good enough tutorials on how to use those.

850

So they might start using things like,

let's say, Bayesian neural networks or

851

whatever the latest tool is at that point.

852

And I think this may be the easiest way

for the industries to do so.

853

They're not going to go switch back to

very simple classical linear models when

854

they do their analysis.

855

But they're going to make their deep

learning solutions Bayesian on some time

856

scale.

857

Maybe not tomorrow, but maybe in five

years.

858

Yeah, that's a very good point.

859

Yeah, I love that.

860

And of course, I'm very happy about that,

being one of the actors making the

861

industry more patient.

862

So I have a vested interest in these.

863

But yeah, also, I've seen the same

evolution you were talking about.

864

Right now, it's not even really an issue

of

865

convincing people to use these kind of

tools.

866

I mean, still from time to time, but less

and less.

867

And now the question is really more in

making those tools more accessible, more

868

versatile, easier to use, more reliable,

easier to deploy in industry, things like

869

that, which is a really good point to be

at for sure.

870

And to some extent, I think it's...

871

It's an interesting question also from the

perspective of the tools.

872

So to some extent, it may mean that we

just end up doing a lot of the kind of

873

Bayesian analysis on top of what we would

now call deep learning frameworks.

874

And it's going to be, of course, it's

going to be libraries building on top of

875

those.

876

So like NumPyro is a library building on

PyTorch.

877

But the syntax is kind of intentionally

similar to what they've used in

878

used to in the deep learning type of

modeling these.

879

And this is perfectly fine.

880

We are anyway using a lot of stochastic

optimization routines in Bayesian

881

inference and so on.

882

So they are actually very good tools for

building all kinds of Bayesian models.

883

And I think this may be the layer where

the industry use happens, that it's going

884

to be always.

885

They need the GPU type of scaling and

everything there anyway.

886

So just happy to have our systems.

887

work on top of these libraries.

888

Yeah, very good point.

889

And also to come back to one of the points

you've made in passing, where education is

890

helping a lot with that.

891

You have been educating now the data

scientists who go in industry.

892

And I know in Finland, in France, not that

much.

893

Where are you originally from?

894

But in Finland, I know there is this

really great integration between the

895

research part, the university and the

industry.

896

You can really see that in the PhD

positions, in the professorship positions

897

and stuff like that.

898

So I think that's really interesting and

that's why I wanted to talk to you about

899

that.

900

To go back to the education part, what

challenges and opportunities do you see in

901

teaching Bayesian machine learning as you

do at the university level?

902

Yeah, it's challenging.

903

I must say that.

904

I mean, especially if we get to the point

of well, Bayesian machine learning.

905

So it is a combination of two topics that

are somewhat difficult in itself.

906

So if we want to talk about normalizing

flows and then we want to talk about

907

statistical properties of estimators or

MCMC convergence.

908

So they require different kinds of

mathematical tools.

909

tools, they require a certain level of

expertise on the software, on the

910

programming side.

911

So what it means actually is that it's

even that if we look at the population of

912

let's say data science students, we can

always have a lot of people that are

913

missing background on one of these sites.

914

So I think this is a difficult topic to

teach.

915

If it was a small class, it would be fine.

916

But it appears to be that at least our

students are really excited about these

917

things.

918

So I can launch a course with explicitly a

title of a Bayesian machine learning,

919

which is like an advanced level machine

learning course.

920

And I would still get 60 to 100 students

enrolling on that course.

921

And then that means that within that

group, there's going to be some CS

922

students with almost no background on

statistics.

923

There's going to be some statisticians who

924

certainly know how to program but they're

not really used to thinking about GPU

925

acceleration of a very large model.

926

But it's interesting, I mean it's not an

impossible thing.

927

I think it is also a topic that you can

kind of teach on a sufficient level for

928

everyone.

929

So everyone agrees is able to understand

the basic reasoning of why we are doing

930

these things.

931

Some of the students may struggle,

932

figuring out all the math behind it.

933

But they might still be able to use these

tools very nicely.

934

They might be able to say that if I do

this and that kind of modification, I

935

realize that my estimates are better

calibrated.

936

And some others are really then going

deeper into figuring out why these things

937

work.

938

So it just needs a bit of creativity on

how do we do it and what do we expect from

939

the students.

940

What should they know once they've

completed a course like this?

941

Yeah, that makes sense.

942

Do you have seen also an increase in the

number of students in the recent years?

943

Well, we get as many students as we can

take.

944

So I mean, it's actually been for quite a

while already that in our university, by

945

far the most...

946

popular master's programs and bachelor's

programs are essentially data science and

947

computer science.

948

So we can't take in everyone we would

want.

949

So it actually looks to us that it's more

or less like a stable number of students,

950

but it's always been a large number since

we launched, for example, the data science

951

program.

952

So it went up very fast.

953

So there's definitely interest.

954

Yeah.

955

Yeah.

956

That's fantastic.

957

And...

958

So I've been taking a lot of your time.

959

So we're going to start to close up the

show, but there are at least two questions

960

I want to get your insight on.

961

And the first one is, what do you think

the biggest hurdle in the Bayesian

962

workflow currently is?

963

We've talked about that a bit already, but

how do you want to get your structured

964

answer?

965

Well, I think the first thing is that

getting people to actually start

966

using more or less systematic workflows.

967

I mean, the idea is great.

968

We kind of know more or less how we should

be thinking about it, but it's a very

969

complex object.

970

So we're going to be able to tell experts,

statisticians that, yes, this is roughly

971

how you should do.

972

Then we should still also convince them

that, like, almost force them to stick to

973

it.

974

But then especially if we then think about

newcomers, people who are just starting

975

with these things, it's a very complicated

thing.

976

So if you would need to read 50 page book

or 100 page book about Bayesian workflow

977

to even know how to do it, it's a

technical challenge.

978

So I think in long term, we are going to

get essentially tools for assisting it.

979

So really kind of streamlining the

process.

980

thinking of something like an AI assistant

for a person building a model that they

981

really kind of pull you that now I see

that you are trying to go there and do

982

this, but I see that you haven't done

prior predictive checks.

983

I actually already created some plots for

you.

984

Please take a look at these and confirm

that is this what you were expecting?

985

And it's going to be a lot of effort in

creating those.

986

It's something that we've been kind of

trying to think about.

987

how to do it, but it's still.

988

I think that's where the challenge is.

989

We know most of the stuff within the

workflow, roughly how it should be done.

990

At least we have good enough solutions.

991

But then really kind of helping people to

actually follow these principles, that's

992

gonna be hard.

993

Yeah, yeah, yeah.

994

But damn, that would be super cool.

995

Like talking about something like a Javis,

you know, like the AI assistant

996

environment, a Javis, but for...

997

Beijing models, how cool would that be?

998

Love that.

999

And looking forward, how do you see

Beijing methods evolving with artificial

Speaker:

intelligence research?

Speaker:

Yeah, I think.

Speaker:

For quite a while I was about to say that,

like I've been kind of building this basic

Speaker:

idea that the deep learning models as such

will become more and more basic in any

Speaker:

way.

Speaker:

So that's kind of a given.

Speaker:

But now of course, now the recent very

large scale AI models, they're getting so

Speaker:

big that then the question of

computational resources is, it's a major

Speaker:

hurdle to do learning for those models,

even in the crudest possible way.

Speaker:

So it may be that there's of course kind

of clear needs for uncertainty

Speaker:

quantification in the large language model

type of scopes.

Speaker:

They are really kind of unreliable.

Speaker:

They're really poor at, for example,

evaluating their own confidence.

Speaker:

So there's been some examples that if you

ask how sure you are about these states,

Speaker:

more or less irrespective of the

statement, give similar number.

Speaker:

Yeah, 50 % sure.

Speaker:

I don't know.

Speaker:

So it may be that the

Speaker:

It's not really, at least on a very short

run, it's not going to be the Bayesian

Speaker:

techniques that really sells all the

uncertainty quantification in those type

Speaker:

of models.

Speaker:

In the long term, it maybe is.

Speaker:

But I think there's a lot of...

Speaker:

It's going to be interesting.

Speaker:

It looks to me a bit that it's a lot of

stuff that's built on top of...

Speaker:

To address specific limitations of these

large language models, it is...

Speaker:

separate components.

Speaker:

It's some sort of an external tool that

reads in those inputs or it's an external

Speaker:

tool that the LLM can use.

Speaker:

So maybe this is going to be this kind of

a separate element that somehow

Speaker:

integrates.

Speaker:

So an LLM, of course, could be having an

API interface where it can query, let's

Speaker:

say, use tan.

Speaker:

to figure out an answer to type of a

question that requires probabilistic

Speaker:

reasoning.

Speaker:

So people have been plugging in, there's

this public famous examples where you can

Speaker:

query like some mathematical reasoning

engines and so on.

Speaker:

So that the LLM, if you ask a specific

type of a question, it goes outside of its

Speaker:

own realm and does something.

Speaker:

It already kind of knows how to program,

so maybe we just need to teach LLMs to do

Speaker:

statistical inference.

Speaker:

by relying on actually running an MCMC

algorithm on a model that they kind of

Speaker:

specify together with the user.

Speaker:

I don't know whether anyone is actually

working on that.

Speaker:

It's something that just came to my mind.

Speaker:

So I haven't really thought about this too

much.

Speaker:

Yeah, but again, we're getting so many PhD

ideas for people right now.

Speaker:

We are.

Speaker:

Yeah, I feel like we should be doing the

best of all your...

Speaker:

Awesome PhD ideas.

Speaker:

Awesome.

Speaker:

Well, I still have so many questions for

you, but let's go to the show because I

Speaker:

don't want to take too much of your time.

Speaker:

I know it's getting late in Finland.

Speaker:

So let's close up the show and ask you the

last two questions.

Speaker:

I always ask at the end of the show.

Speaker:

First one, if you had unlimited time and

resources, which problem would you try to

Speaker:

solve?

Speaker:

Let's see.

Speaker:

The lazy answer is that I am now trying to

get unlimited resources, well, not

Speaker:

unlimited resources, but I'm really trying

to tackle this prior elicitation question.

Speaker:

I think most of the other parts on the

Bayesian workflow are kind of, we have

Speaker:

reasonably good solutions for those, but

this whole question of really how to

Speaker:

figure out complex multivariate priors

over arbitrary complex models.

Speaker:

That's a very practical thing that I am

investing on.

Speaker:

But maybe if I'm kind of taking, if it

really is infinite, then maybe I could

Speaker:

actually continue on the quick idea that

we just talked about.

Speaker:

That I mean really getting this

probabilistic reasoning at the core of

Speaker:

these large language model type of AI

applications.

Speaker:

That it would really be reliably answering

proper probabilistic judgments of the

Speaker:

kind of decision -making reasoning

problems that we ask from them.

Speaker:

So that would be interesting.

Speaker:

Yeah.

Speaker:

Yeah, for sure.

Speaker:

And second question, if you could have

dinner with any great scientific mind,

Speaker:

dead or alive or fictional, who would it

be?

Speaker:

Yes, this is something I actually thought

about it because I figured you would be

Speaker:

asking it also from me.

Speaker:

And I chose that I mean fictional

characters.

Speaker:

I like fictional characters.

Speaker:

So I went with...

Speaker:

Daniel Waterhouse from Niels Deffensen's

The Baroque Cycle books.

Speaker:

So they are kind of semi -historical

books.

Speaker:

So they talk about the era where Isaac

Newton and others are kind of living and

Speaker:

establishing the Royal Society.

Speaker:

And there's a lot of high fantasy

components involved.

Speaker:

And Daniel Waterhouse in those novels is

his roommate of Isaac Newton and a friend.

Speaker:

of Gottfried Leibniz.

Speaker:

So he knows both sides of this great

debate on who invented calculus and who

Speaker:

copied whom.

Speaker:

So if I had a dinner with him, I would get

to talk about these innovations that I

Speaker:

think are one of the foundational ones.

Speaker:

But I wouldn't actually need to get

involved with either party.

Speaker:

I wouldn't need to choose sides, whether

it's Isaac or Gottfried that I would be

Speaker:

talking to.

Speaker:

Love it.

Speaker:

Yeah, love that answer.

Speaker:

Make sure to record that dinner and post

it on YouTube.

Speaker:

I'm pretty sure lots of people will be

interested in it.

Speaker:

Fantastic.

Speaker:

Thanks.

Speaker:

Thanks a lot, Arto.

Speaker:

That was a great discussion.

Speaker:

Really happy we could go through the,

well, not the whole depth of what you do

Speaker:

because you do so many things, but a good

chunk of it.

Speaker:

So I'm really happy about that.

Speaker:

As usual,

Speaker:

I'll put resources and a link to your

website in the show notes for those who

Speaker:

want to dig deeper.

Speaker:

Thank you again, Akto, for taking the time

and being on this show.

Speaker:

Thank you very much.

Speaker:

It was my pleasure.

Speaker:

I really enjoyed the discussion.

Speaker:

This has been another episode of Learning

Bayesian Statistics.

Speaker:

Be sure to rate, review, and follow the

show on your favorite podcatcher, and

Speaker:

visit learnbaystats .com for more

resources about today's topics, as well as

Speaker:

access to more episodes to help you reach

true Bayesian state of mind.

Speaker:

That's learnbaystats .com.

Speaker:

Our theme music is Good Bayesian by Baba

Brinkman, fit MC Lass and Meghiraam.

Speaker:

Check out his awesome work at bababrinkman

.com.

Speaker:

I'm your host.

Speaker:

Alex and Dora.

Speaker:

You can follow me on Twitter at Alex

underscore and Dora like the country.

Speaker:

You can support the show and unlock

exclusive benefits by visiting patreon

Speaker:

.com slash LearnBasedDance.

Speaker:

Thank you so much for listening and for

your support.

Speaker:

You're truly a good Bayesian change your

predictions after taking information and

Speaker:

if you think and I'll be less than

amazing.

Speaker:

Let me show you how to be a good Bayesian.

Speaker:

Change calculations after taking fresh

data in.

Speaker:

Those predictions that your brain is

making.

Speaker:

Let's get them on a solid foundation.