Learning Bayesian Statistics

Proudly sponsored by PyMC Labs, the Bayesian Consultancy. Book a call, or get in touch!

In this episode, Marvin Schmitt introduces the concept of amortized Bayesian inference, where the upfront training phase of a neural network is followed by fast posterior inference.

Marvin will guide us through this new concept, discussing his work in probabilistic machine learning and uncertainty quantification, using Bayesian inference with deep neural networks. 

He also introduces BayesFlow, a Python library for amortized Bayesian workflows, and discusses its use cases in various fields, while also touching on the concept of deep fusion and its relation to multimodal simulation-based inference.

A PhD student in computer science at the University of Stuttgart, Marvin is supervised by two LBS guests you surely know — Paul Bürkner and Aki Vehtari. Marvin’s research combines deep learning and statistics, to make Bayesian inference fast and trustworthy. 

In his free time, Marvin enjoys board games and is a passionate guitar player.

Our theme music is « Good Bayesian », by Baba Brinkman (feat MC Lars and Mega Ran). Check out his awesome work at https://bababrinkman.com/ !

Thank you to my Patrons for making this episode possible!

Yusuke Saito, Avi Bryant, Ero Carrera, Giuliano Cruz, Tim Gasser, James Wade, Tradd Salvo, William Benton, James Ahloy, Robin Taylor,, Chad Scherrer, Zwelithini Tunyiswa, Bertrand Wilden, James Thompson, Stephen Oates, Gian Luca Di Tanna, Jack Wells, Matthew Maldonado, Ian Costley, Ally Salim, Larry Gill, Ian Moran, Paul Oreto, Colin Caprani, Colin Carroll, Nathaniel Burbank, Michael Osthege, Rémi Louf, Clive Edelsten, Henri Wallen, Hugo Botha, Vinh Nguyen, Marcin Elantkowski, Adam C. Smith, Will Kurt, Andrew Moskowitz, Hector Munoz, Marco Gorelli, Simon Kessell, Bradley Rode, Patrick Kelley, Rick Anderson, Casper de Bruin, Philippe Labonde, Michael Hankin, Cameron Smith, Tomáš Frýda, Ryan Wesslen, Andreas Netti, Riley King, Yoshiyuki Hamajima, Sven De Maeyer, Michael DeCrescenzo, Fergal M, Mason Yahr, Naoya Kanai, Steven Rowland, Aubrey Clayton, Jeannine Sue, Omri Har Shemesh, Scott Anthony Robson, Robert Yolken, Or Duek, Pavel Dusek, Paul Cox, Andreas Kröpelin, Raphaël R, Nicolas Rode, Gabriel Stechschulte, Arkady, Kurt TeKolste, Gergely Juhasz, Marcus Nölke, Maggi Mackintosh, Grant Pezzolesi, Avram Aelony, Joshua Meehl, Javier Sabio, Kristian Higgins, Alex Jones, Gregorio Aguilar, Matt Rosinski, Bart Trudeau, Luis Fonseca, Dante Gates, Matt Niccolls, Maksim Kuznecov, Michael Thomas, Luke Gorrie, Cory Kiser, Julio, Edvin Saveljev, Frederick Ayala, Jeffrey Powell, Gal Kampel, Adan Romero, Will Geary and Blake Walters.

Visit https://www.patreon.com/learnbayesstats to unlock exclusive Bayesian swag 😉

Takeaways:

  • Amortized Bayesian inference combines deep learning and statistics to make posterior inference fast and trustworthy.
  • Bayesian neural networks can be used for full Bayesian inference on neural network weights.
  • Amortized Bayesian inference decouples the training phase and the posterior inference phase, making posterior sampling much faster.
  • BayesFlow is a Python library for amortized Bayesian workflows, providing a user-friendly interface and modular architecture.
  • Self-consistency loss is a technique that combines simulation-based inference and likelihood-based Bayesian inference, with a focus on amortization
  • The BayesFlow package aims to make amortized Bayesian inference more accessible and provides sensible default values for neural networks.
  • Deep fusion techniques allow for the fusion of multiple sources of information in neural networks.
  • Generative models that are expressive and have one-step inference are an emerging topic in deep learning and probabilistic machine learning.
  • Foundation models, which have a large training set and can handle out-of-distribution cases, are another intriguing area of research.

Chapters:

00:00 Introduction to Amortized Bayesian Inference

07:39 Bayesian Neural Networks

11:47 Amortized Bayesian Inference and Posterior Inference

23:20 BayesFlow: A Python Library for Amortized Bayesian Workflows

38:15 Self-consistency loss: Bridging Simulation-Based Inference and Likelihood-Based Bayesian Inference

41:35 Amortized Bayesian Inference

43:53 Fusing Multiple Sources of Information

45:19 Compensating for Missing Data

56:17 Emerging Topics: Expressive Generative Models and Foundation Models

01:06:18 The Future of Deep Learning and Probabilistic Machine Learning

Links from the show:

Transcript

This is an automatic transcript and may therefore contain errors. Please get in touch if you’re willing to correct them.

Transcript
Speaker:

In this episode, Marvin Schmidt introduces

the concept of amortized Bayesian

2

:

inference, where the upfront training

phase of a neural network is followed by

3

:

fast posterior inference.

4

:

Marvin will guide us through this new

concept, discussing his work in

5

:

probabilistic machine learning and

uncertainty quantification using Bayesian

6

:

inference with deep neural networks.

7

:

He also introduces Bayes' law,

8

:

Python library for amortized Bayesian

workflows and discusses its use cases in

9

:

various fields while also touching on the

concept of deep fusion and its relation to

10

:

multi -model simulation -based inference.

11

:

Yeah, that is a very deep episode and also

a fascinating one.

12

:

I've been personally diving much more into

amortized Bayesian inference with Baseful

13

:

since the folks there have been kind

enough.

14

:

to invite me to the team, and I can tell

you, this is super promising technology.

15

:

A PhD student in computer science at the

University of Stuttgart, Marvin is

16

:

supervised actually by two LBS guests you

surely know, Paul Burkner and Aki

17

:

Vettelik.

18

:

Marvin's research combines deep learning

and statistics to make vision inference

19

:

fast and trustworthy.

20

:

In his free time, Marvin enjoys board

games and is a passionate guitar player.

21

:

This is Learning Basion Statistics,

,:

22

:

Welcome to Learning Basion Statistics, a

podcast about patient inference, the

23

:

methods, the projects,

24

:

and the people who make it possible.

25

:

I'm your host, Alex Andorra.

26

:

You can follow me on Twitter at alex

.andorra, like the country, for any info

27

:

about the show.

28

:

LearnBasedStats .com is left last to be.

29

:

Show notes, becoming a corporate sponsor,

unlocking Bayesian Merch, supporting the

30

:

show on Patreon, everything is in there.

31

:

That's LearnBasedStats .com.

32

:

If you're interested in one -on -one

mentorship, online courses, or statistical

33

:

consulting,

34

:

Feel free to reach out and book a call at

topmate .io slash alex underscore and

35

:

dora.

36

:

See you around folks and best patient

wishes to you all.

37

:

Today, I want to thank the fantastic Adam

Romero, Will Geary, and Blake Walters for

38

:

supporting the show on Patreon.

39

:

Your support is truly invaluable and

literally makes this show possible.

40

:

I can't wait to talk with you guys in the

Slack channel.

41

:

Second, the first part of our modeling

webinar series on Gaussian processes is

42

:

out for everyone.

43

:

So if you want to see how to use the new

HSGP approximation in PIMC, head over to

44

:

the LBS YouTube channel and you'll see

Juan Orduz, a fellow PIMC Core Dev and

45

:

mathematician, explain how to do fast and

efficient Gaussian processes in PIMC.

46

:

I'm actually working on the next part in

this series as we speak, so stay tuned for

47

:

more and follow the LBS YouTube channel if

you don't want to miss it.

48

:

Okay, back to the show now.

49

:

Marvin Schmidt, Willkommen nach Learning

Patient Statistics.

50

:

Thanks Alex, thanks for having me.

51

:

Actually my German is very rusty, do you

say nach or zu?

52

:

Well, welcome Learning Patient Statistics.

53

:

Maybe welcome in podcast?

54

:

Nah.

55

:

Obviously, obviously like it was a third

hidden option.

56

:

Damn.

57

:

it's a secret third thing, right?

58

:

Yeah, always in Germany.

59

:

It's always that.

60

:

Man, damn.

61

:

Well, that's okay.

62

:

I got embarrassed in front of the world,

but I'm used to that in each episode.

63

:

So thanks a lot for taking the time.

64

:

Marvin.

65

:

Thanks a lot to Matt Rosinski actually for

recommending to do an episode with you.

66

:

Matt was kind enough to take some of his

time to write to me and put me in contact

67

:

with you.

68

:

I think you guys met in Australia in a

very fun conference based on the beach.

69

:

I think it happens every two years.

70

:

Definitely when I go there in two years

and do a live episode there.

71

:

Definitely that's a...

72

:

That's a product I wanted to do that this

year, but that didn't go well with my

73

:

traveling dates.

74

:

So in two years, definitely going to try

to do that.

75

:

So yeah, listeners and Marvin, you can

help me accountable on that promise.

76

:

Absolutely.

77

:

We will.

78

:

So Marvin, before we talk a bit more about

what you're a specialist in and also what

79

:

you presented in Australia, can you tell

us what you're doing nowadays and also how

80

:

you...

81

:

Andy Depp working on this?

82

:

Yeah, of course.

83

:

So these days, I'm mostly doing methods

development.

84

:

So broadly in probabilistic machine

learning, I care a lot about uncertainty

85

:

quantification.

86

:

And so essentially, I'm doing Bayesian

inference with deep neural networks.

87

:

So taking Bayesian inference, which is

notoriously slow at times, which might be

88

:

a bottleneck, and then using generative

neural networks to speed up this process,

89

:

but still maintaining all the

explainability, all these nice benefits

90

:

that we have from using

91

:

I have a background in both psychology and

computer science.

92

:

That's also how I ended up in, Beijing

inference.

93

:

cause during my psychology studies, I took

a few statistics courses, then started as

94

:

a statistics tutor, mainly doing frequent

statistics.

95

:

And then I took a seminar on Beijing

statistics in Heidelberg in Germany.

96

:

and it was the hardest seminar that ever

took.

97

:

Well, it's super hard.

98

:

We read like papers every single week.

99

:

Everyone had to prepare every single paper

for every single week.

100

:

And then at the start of each session, the

professor would just shuffle and randomly

101

:

pick someone to prison.

102

:

my God.

103

:

That was tough, but somehow, I don't know,

it stuck with me.

104

:

And I had like this aha moment where I

felt like, okay, all this statistics stuff

105

:

that I've been doing before was more of,

you know, following a recipe, which is

106

:

very strict.

107

:

But then this like holistic Bayesian

probabilistic take.

108

:

just gave me a much broader overview of

statistics in general.

109

:

Somehow I followed the path.

110

:

Yeah.

111

:

I'm curious what that...

112

:

So what does that mean to do patient stats

on deep neural network concretely?

113

:

What is the thing you would do if you had

to do that?

114

:

Let's say, does that mean you mainly...

115

:

you develop the deep neural network and

then you add some Bayesian layer on that,

116

:

or you have to have the Bayesian framework

from the beginning.

117

:

How does that work?

118

:

Yeah, that's a great question.

119

:

And in fact, that's a common point of

confusion there as well, because Bayesian

120

:

inference is just like a general, almost

philosophical framework for reasoning

121

:

about uncertainty.

122

:

So you have some latent quantities, call

them parameters, whatever, some latent

123

:

unknowns.

124

:

And you want to do inference on them.

125

:

You want to know what these latent

quantities are, but all you have are

126

:

actual observables.

127

:

And you want to know how these are related

to each other.

128

:

And so with Bayesian neural networks, for

instance, these parameters would be the

129

:

neural network weights.

130

:

And so you want full Bayesian inference on

the neural network weights.

131

:

And fitting normal neural networks already

supports that.

132

:

Like a Bixarity distribution.

133

:

Exactly.

134

:

Over these neural network weights.

135

:

Exactly.

136

:

So that's one approach of doing Bayesian

deep learning, but that's not what I'm

137

:

currently doing.

138

:

Instead, I'm coming from the Bayesian

side.

139

:

So we have like a normal Bayesian model,

which has statistical parameters.

140

:

So you can imagine it like a mechanistical

model for like a simulation program.

141

:

And we want to estimate these scientific

parameters.

142

:

So for example, if you have a cognitive

decision -making task from the cognitive

143

:

sciences, and these parameters might be

something like the non -decision time, the

144

:

actual motor reaction time that you need

to

145

:

move your muscles and some information

uptake rates, some bias and all these

146

:

things that researchers are actually

interested in.

147

:

And usually you would then formulate your

model in, for example, PiMC or Stan or

148

:

however you want to formulate your

statistical model and then run MCMC for

149

:

parameter inference.

150

:

And now where the neural networks come in

in my research is that we replace MCMC

151

:

with a neural network.

152

:

So we still have our Bayesian model.

153

:

But we don't use MCMC for posterior

inference.

154

:

Instead, we use a neural network just for

posterior inference.

155

:

And this neural network is trained by

maximum likelihood.

156

:

So the neural network itself, the weights

there are not probabilistic.

157

:

There are no posterior distributions over

the weights.

158

:

But we just want to somehow model the

actual posterior distributions of our

159

:

statistical model parameters using a

neural network.

160

:

So the neural net, I think so.

161

:

That's quite new to me.

162

:

So I'm going to rephrase that and see how

much I understood.

163

:

So that means the deep neural network is

already trained beforehand?

164

:

No, we have to train it.

165

:

And that's the cool part about this.

166

:

OK, so you train it at the same time.

167

:

You train it at the same time.

168

:

You're also trying to infer the underlying

parameters of your model.

169

:

And that's the cool part now.

170

:

Because in MCMC, you would do both at the

same time, right?

171

:

You have your fixed model that you write

down in PyMC or Stan, and then you have

172

:

your one observed data set, and you want

to fit your model to the data set.

173

:

And so, you know, you do, for example,

your Hamiltonian Monte Carlo algorithm to,

174

:

you know, traverse your parameter space

and then do the sampling.

175

:

So you couple your approximating

176

:

phase and your inference phase.

177

:

Like you learn about the posterior

distribution based on your data set.

178

:

And then you also want to generate

posterior samples while you're exploring

179

:

this parameter space.

180

:

And in the line of work that I'm doing,

which we call amortized Bayesian

181

:

inference, we decouple those two phases.

182

:

So the first phase is actually training

those neural networks.

183

:

And that's the hard task.

184

:

And then you essentially take your

Bayesian model.

185

:

generate a lot of training data from the

model because you can just run prior

186

:

predictive samples.

187

:

So generate prior predictive samples.

188

:

And those are your training data for the

neural network.

189

:

And use the neural network to essentially

learn surrogate for the posterior

190

:

distribution.

191

:

So for each data set that you have, you

want to take those as conditions and then

192

:

have a generative neural network to learn

somehow how these data and the parameters

193

:

are related to each other.

194

:

And this upfront training phase takes

quite some time and usually takes longer

195

:

than the equivalent MCMC would take, given

that you can run MCMC.

196

:

Now, the cool thing is, as you said, when

your neural network is trained, then the

197

:

posterior inference is super fast.

198

:

Then if you want to generate posterior

samples, there's no approximation anymore

199

:

because you've already done all the

approximation.

200

:

So now you're really just doing sampling.

201

:

That means just generating some random

numbers in some latent space and having

202

:

one pass through the neural network, which

is essentially just a series of matrix

203

:

multiplications.

204

:

So once you've done this hard part and

trained your generative neural network,

205

:

then actually doing the posterior sampling

takes like a fraction of a second for 10

206

:

,000 posterior samples.

207

:

Okay, yeah, that's really cool.

208

:

And how generalizable is your deep neural

network then?

209

:

Do you have like, is that, because I can

see the really cool thing to have a neural

210

:

network that's customized to each of your

models.

211

:

That's really cool.

212

:

But at the same time, as you were saying,

that's really expensive to train a neural

213

:

network each time you have to sample a

model.

214

:

And so I was thinking, OK, so then maybe

what you want is have generalized

215

:

categories of deep neural network.

216

:

So that would probably be another kill.

217

:

But let's say I have a deep neural network

for linear regressions.

218

:

Whether they are generalized or just plain

normal likelihood, you would use that deep

219

:

neural network for linear regressions.

220

:

And then the inference is super fast,

because you only have to train.

221

:

the neural network once and then

inference, posterior inference on the

222

:

linear regression parameters themselves is

super fast.

223

:

So yeah, like that's a long question, but

did you get what I'm asking?

224

:

Yeah, absolutely.

225

:

So if I get your question right, now

you're asking like, if you don't want to

226

:

run linear regression, but want to run

some slightly different model, can I still

227

:

use my pre -trained neural network to do

that?

228

:

Yes, exactly.

229

:

And also, yeah, like in general, how does

that work?

230

:

Like, how are you thinking about that?

231

:

Are there already some best practices or

is it like really for now, really cutting

232

:

edge research that and all the questions

are in the air?

233

:

Yeah.

234

:

So first of all, the general use case for

this type of amortized Bayesian inference

235

:

is usually when your model is fixed, but

you have many new datasets.

236

:

So assume you have some quite complex

model where MCMC would take a few minutes

237

:

to run.

238

:

And so instead for one fixed data set that

you actually want to sample from.

239

:

And now instead of running MCMC on it, you

say, okay, I'm going to train this neural

240

:

network.

241

:

So this won't yet be worth it for just one

data set.

242

:

Now the cool thing is if you want to keep

your actual model, so whatever you write

243

:

down in PyMC or Stan,

244

:

We want to keep that fixed, but now plug

in different data sets.

245

:

That's where amortized inference really

shines.

246

:

So for instance, there was this one huge

analysis in the UK where they had like

247

:

intelligence study data from more than 1

million participants.

248

:

And so for each of those participants,

they again had a set of observations.

249

:

And so for each of those 1 million

participants,

250

:

They want to perform posterior inference.

251

:

It means if you want to do this with

something like MCMC or anything non

252

:

-amortized, you would need to fit one

million models.

253

:

So you might argue now, okay, but you can

parallelize this across like a thousand

254

:

cores, but still that's, that's a lot.

255

:

That's a lot of control.

256

:

Now the cool thing is the model was the

same every single time.

257

:

You just had a million different data

sets.

258

:

And so what these people did then is train

a neural network once.

259

:

And then like it will train for a few

hours, of course, but then you can just

260

:

sequentially feed in all these 1 million

data sets.

261

:

And for each of these 1 million data sets,

it takes way, way less than one second.

262

:

to generate tens of thousands of posterior

samples.

263

:

But that didn't really answer your

question.

264

:

So your question was about how can we

generalize in the model space?

265

:

And that's a really hard problem because

essentially what these neural networks

266

:

learn is to give you some posterior

function if you feed in a data set.

267

:

Now, if you have a domain shift in the

model space, so now you want inference

268

:

based on a different model, and this

neural network has never learned to do

269

:

that.

270

:

So that's tough.

271

:

That's a hard problem.

272

:

And essentially what you could do and what

we are currently doing in our research,

273

:

but that's cutting edge, is expanding the

model space.

274

:

So you would have a very general

formulation of a model and then try to

275

:

amortize over this model.

276

:

So that different configurations of this

model, different variations.

277

:

could just be extracted special case model

essentially.

278

:

Can you take an example maybe to give an

idea to listeners how that would work?

279

:

Absolutely.

280

:

We have one preprint about sensitivity

-aware amortized Bayesian inference.

281

:

What we do there is essentially have a

kind of multiverse analysis built into the

282

:

neural network training.

283

:

give some background, multiverse analysis,

basically says, okay, what are all the pre

284

:

-processing steps that you could take in

your analysis?

285

:

And you encode those.

286

:

And now you're interested in like, what

if, what if I had chosen a different pre

287

:

-processing technique?

288

:

What if I had chosen a different way to

standardize my data?

289

:

Then also the classical like prior

sensitivity or likelihood sensitivity

290

:

analysis.

291

:

Like what happens if I do power scaling on

my prior?

292

:

power scaling on my posterior.

293

:

So we also encode this.

294

:

What happens if I bootstrap some of my

data or just have a perturbation of my

295

:

data?

296

:

What if I add a bit of noise to my data?

297

:

So these are all slightly different

models.

298

:

What we do essentially keep track of that

during the training phase and just encode

299

:

it into a vector and say, well, okay, now

we're doing pre -processing choice number

300

:

seven.

301

:

and scale the prior to the power of two,

don't scale the likelihood and don't do

302

:

any perturbation and feed this as an

additional information into the neural

303

:

network.

304

:

Now the cool thing is during inference

phase, once we're done with the training,

305

:

you can say, hey, here's a data set.

306

:

Now pretend that we chose pre -processing

technique number 11 and prior scaling of

307

:

power 0 .5.

308

:

What's the posterior now?

309

:

Because we've amortized over this large or

more general model space, we also get

310

:

valid posterior inference if we've trained

for long enough over these different

311

:

configurations of model.

312

:

And essentially, if you were to do this

with MCMC, for instance, you would refit

313

:

your model every single time.

314

:

And so here you don't have to do that.

315

:

Okay.

316

:

Yeah, I see.

317

:

That's super.

318

:

Yeah, that's super cool.

319

:

And I feel like, so that would be mainly

the main use cases would be as you were

320

:

saying, when, when you're getting into

really high data territory and you have

321

:

what's changing is mainly the data side,

mainly the data.

322

:

set and to be even more precise, not

really the data set, but the data values,

323

:

because the data set is supposed to be

like quite the same, like you would have

324

:

the same columns, for instance, but the

values of the columns would change all the

325

:

time.

326

:

And the model at the same time doesn't

change.

327

:

Is that like, that's really for now, at

least the best use case for that kind of

328

:

method.

329

:

Yes.

330

:

And this might seem like a very niche

case.

331

:

But then if you look at like,

332

:

Bayesian workflows in practice, this topic

of this scheme of many model research

333

:

doesn't necessarily mean that you have a

large number of data sets.

334

:

This might also just mean you want

extensive cross validation.

335

:

So assume that you have one data set with

:

336

:

Now you want to run leaf1 or cross

validation, but for some reason you can't

337

:

do the Pareto Smooth importance sampling

version, which would be much faster.

338

:So you would need:

though you just have one data set, because

339

:you want:

340

:

Maybe can you explicit what your meaning

by cross validation here?

341

:

Because that's not a term that's used a

lot in the patient framework, I think.

342

:

Yeah, of course.

343

:

So especially innovation setting, there's

this approach of leave one out cross

344

:

validation, where you would fit your

posterior based on all data points, but

345

:

one.

346

:

And that's why it's called leave one out,

because you take one out and then fit your

347

:

model, fit your posterior on the rest of

the data.

348

:

And now you're interested in the posterior

predictive performance of this one left

349

:

out observation.

350

:

Yeah.

351

:

And that's called cross validation.

352

:

Yeah.

353

:

Go ahead.

354

:

Yeah, no, just I'm going to let you

finish, but yeah, for listeners familiar

355

:

with the frequented framework, that's

something that's really heavily used in

356

:

that framework, cross validation.

357

:

And it's very similar to the machine

learning concept of cross validation.

358

:

But in the machine learning area, you

would rather have something like fivefold

359

:

in general, k -fold cross validation,

where you would have larger splits of your

360

:

data and then use parts of your

361

:

whole dataset as the training dataset and

the rest for evaluation.

362

:

Essentially, like the one across relation

just puts it to the extreme.

363

:

Everything but one data point is your

train dataset.

364

:

Yeah.

365

:

Yeah.

366

:

Okay.

367

:

Yeah.

368

:

Damn, that's super fun.

369

:

And is there, is there already a way for

people to try that out or is it mainly for

370

:

now implemented for papers?

371

:

And you are probably.

372

:

I'm guessing working on that with Aki and

all his group in Finland to make that more

373

:

open source, helping people use packages

to do that.

374

:

What's the state of the things here?

375

:

Yeah, that's a great question.

376

:

And in fact, the state of usable open

source software is far behind what we have

377

:

for likelihood -based MCMC based

inference.

378

:

So we currently don't have something

that's comparable to PyMC or Stan.

379

:

Our group is developing or actively

developing a software that's called Base

380

:

Flow.

381

:

That's because like the name, because like

base, because we're doing Bayesian

382

:

inference.

383

:

And essentially the first neural network

architecture that was used for this

384

:

amortized Bayesian inference are so

-called normalizing flows.

385

:

Conditional normalizing flows to be

precise.

386

:

And that's why the name Base Flow came to

be.

387

:

But now.

388

:

actually have a bit of a different take

because now we have a whole lot of

389

:

generative neural networks and not only

normalizing flows.

390

:

So now we can also use, for example, score

-based diffusion models that are mainly

391

:

used for image generation and AI or

consistency models, which are essentially

392

:

like a distilled version of score -based

diffusion models.

393

:

And so now baseflow doesn't really capture

that anymore.

394

:

But now what the baseflow Python library

specializes in is defining

395

:

Principled amortized Bayesian workflows.

396

:

So the meaning of base or slightly shifted

to amortized Bayesian workflows and hence

397

:

the name base login And the focus of base

slope and the aim of base low is twofold

398

:

So first we want a library.

399

:

It's good for actual users So this might

be researchers who just say hey, here's my

400

:

data set.

401

:

Here's my model my simulation program and

Please just give me fast posterior

402

:

samples.

403

:

So we want

404

:

usable high level interface with sensible

default values that mostly work out of the

405

:

box and an interface that's mostly self

-explanatory.

406

:

Also of course, good teaching material and

all this.

407

:

But that's only one side of the coin

because the other large goal of FaceFlow

408

:

is that it should be usable for machine

learning researchers who want to advance

409

:

amortized Bayesian inference methods as

well.

410

:

And so the software in general,

411

:

is structured in a very modular way.

412

:

So for instance, you could just say, hey,

take my current pipeline, my current

413

:

workflow.

414

:

But now try out a different loss function

because I have a new fancy idea.

415

:

I want to incorporate more likelihood

information.

416

:

And so I want to alter my loss function.

417

:

So you would have your general program

because of the modular architecture there,

418

:

you could just say, take the current loss

function and replace it with a different

419

:

one.

420

:

that is used to the API.

421

:

And we're trying to doing both and serving

both interests, user friendly side for

422

:

actually applied researchers who are also

currently using Baseflow.

423

:

But then also the machine learning

researchers with completely different

424

:

requirements for this piece of software.

425

:

Maybe we can also use Baseflow

documentation and the current project

426

:

website in the notes.

427

:

Yeah, we should definitely do that.

428

:

Definitely gonna try that out myself.

429

:

It sounds like fun.

430

:

I need a use case, but as soon as I have a

use case, I'm definitely gonna try that

431

:

out because it sounds like a lot of fun.

432

:

Yeah, several questions based on that and

thanks a lot for being so clear and so

433

:

detailed on these.

434

:

So first, we talked about normalizing

flows in episode 98 with Marie -Lou

435

:

Gabriel.

436

:

Definitely recommend listeners to listen

to that for some background.

437

:

And question, so Baseflow, yeah,

definitely we need that in the show notes

438

:

and I'm going to install that in my

environment.

439

:

And I'm guessing, so you're saying that

that's in Python, right?

440

:

The package?

441

:

Yes, the core package is in Python and

we're currently refactoring to Keras.

442

:

So by the time this podcast episode is

aired, we will have a new major release

443

:

version, hopefully.

444

:

OK, nice.

445

:

So you're agnostic to the actual machine

learning back end.

446

:

So then you could choose TensorFlow,

PyTorch, or JAX, whatever integrates best

447

:

with what you're currently proficient in

and what you might be currently using in

448

:

other parts of a project.

449

:

OK, that was going to be my question.

450

:

Because I think while preparing for the

episode, I saw that you were mainly using

451

:

PyTorch.

452

:

So that was going to be my question.

453

:

What is that based on?

454

:

So the back end could be PyTorch, JAX, or.

455

:

What did you think the last one was?

456

:

Tansor flow.

457

:

Yeah, I always forget about all these

names.

458

:

I really know PyTorch.

459

:

So that's why I like the other ones.

460

:

And JAX, of course, for PyMC.

461

:

And then, so my question is, the workflow,

what would it look like if you're using

462

:

Baseflow?

463

:

Because you were saying the model, you

could write it in standard PyMC or

464

:

TensorFlow, for instance.

465

:

Although I don't know if you can write.

466

:

patient models with TensorFlow anymore.

467

:

Anyways, let's say PyMC or Stan.

468

:

You write your model.

469

:

But then the sampling of the model is done

with the neural network.

470

:

So that means, for instance, PyTorch or

Jax.

471

:

How does that work?

472

:

Do you have then to write the model in a

Jax compatible way?

473

:

Or is the translation done by the package

itself?

474

:

Yeah, that's a great question.

475

:

It touches on many different topics and

considerations and also on future roadmap

476

:

for bass flow.

477

:

So.

478

:

This class of algorithms that are

implemented in Baseflow, these amortized

479

:

Bayesian inference algorithms, to give you

some background there, they originally

480

:

started in simulation -based inference.

481

:

It's also sometimes called likelihood

-free inference.

482

:

So essentially it is Bayesian inference

when you don't bring a closed -form

483

:

likelihood function to the table.

484

:

But instead, you only have some generic

forward simulation program.

485

:

So you would just have your prior as

some...

486

:

Python function or C++ function, whatever,

any function that you could call and it

487

:

would return you a sample from the prior

distribution.

488

:

You don't need to write it down in terms

of distributions actually, but you only

489

:

need to be able to sample from it.

490

:

And then the same for the likelihood.

491

:

So you don't need to write down your

likelihood in like a PMC or Stan in terms

492

:

of a probability distribution, in terms of

density distribution or densities.

493

:

But instead it's.

494

:

just got to be some simulation program,

which takes in parameters and then outputs

495

:

data.

496

:

What happens between these parameters and

the data is not necessarily probabilistic

497

:

in terms of closed form distributions.

498

:

It could also be some non -tractable

differential equations.

499

:

It could be essentially everything.

500

:

So for base flow, this means that you

don't have to input something like a PMC

501

:

or a Stan model, which you write down in

terms of

502

:

distributions, but it's just a generic

forward model that you can call and you

503

:

will get a tuple of a parameter draw and a

data set.

504

:

So you'd usually just do it in NumPy.

505

:

So you would write, if I'm using Baseflow,

I would write it in NumPy.

506

:

It would probably be the easiest way.

507

:

You could probably also write it in JAX or

in PyTorch or in TensorFlow or TensorFlow

508

:

probability, whatever you want to use and

like behind the scenes.

509

:

But essentially what we just care about is

that the model gets a tuple of parameters

510

:

and then data that has been generated from

these parameters.

511

:

for the neural network training process.

512

:

That's super fun.

513

:

Yeah, yeah, yeah.

514

:

Definitely want to see that.

515

:

Do you have already some Jupyter notebook

examples up on the repo or are you working

516

:

on that?

517

:

Yeah, currently it's a full -fledged

library.

518

:

It's been under development for a few

years now.

519

:

And we also have an active user base right

now.

520

:

It's quite small compared to other

Bayesian packages.

521

:

We're growing it.

522

:

Yeah, that's cool.

523

:

In documentation, there are currently, I

think, seven or eight tutorial notebooks.

524

:

And then also for a Based on the Beach,

like this conference in Australia that we

525

:

just talked about earlier, we also

prepared a workshop.

526

:

And we're also going to link to this

Jupyter notebook in the show notes.

527

:

Yeah, definitely we should, we should link

to some of these Jupyter notebooks in the

528

:

show notes.

529

:

And Sean, I'm thinking you should...

530

:

Like if you're down, you should definitely

come back to the show, but for a webinar.

531

:

I have another format that's modeling

webinar where you could, you would come to

532

:

the show and share your screen and, and go

through the model code live and people can

533

:

ask questions and so on.

534

:

I've done that already on a variety of

things.

535

:

Last one was about causal inference and

propensity scores.

536

:

Next one is going to be on about helper

space GP decomposition.

537

:

So yeah, if you're down, you should

definitely come and do a demonstration of

538

:

base flow and amortized Bayesian

inference.

539

:

I think that would be super fun and very

interesting to people.

540

:

Absolutely.

541

:

Then to answer the last part of your

question.

542

:

Yeah.

543

:

Like if you currently have a model that's

written down in PyMC or Stan, that's a bit

544

:

more tricky to integrate because

essentially what all we need in base flow

545

:

are samples from the prior predictive

distribution.

546

:

If you talk in Bayesian terminology.

547

:

Yeah.

548

:

And if your current model can do that,

that's fine.

549

:

That's all you need right now.

550

:

And then base build builds.

551

:

You can have like a PIMC model and just do

pm .sample -properative, save that as a

552

:

big NumPy multidimensional array and pass

that to baseflow.

553

:

Yes.

554

:

Okay.

555

:

Just all you need are two builds of the

ground truth parameters of the data

556

:

training process.

557

:

So essentially like the result of your

prior call and then the result of your

558

:

likelihood call with those prior

parameters.

559

:

So you mean what the likelihood samples

look like once you fix the prior

560

:

parameters to some value?

561

:

Yes.

562

:

So like in practice, you would just call

your prior function.

563

:

Yeah.

564

:

Then get a sample from the prior.

565

:

So parameter vector.

566

:

Yeah.

567

:

And then plug this parameter vector into

the likelihood function.

568

:

And then you get one simulated synthetic

data set.

569

:

And you just need those two.

570

:

Okay.

571

:

Super cool.

572

:

Yeah.

573

:

Definitely sounds like a lot of fun and

should definitely do a webinar about that.

574

:

I'm very excited about that.

575

:

Yeah.

576

:

Fantastic.

577

:

And so that was one of my main questions

on that.

578

:

Other question is, I'm guessing you are a

lot of people working on that, right?

579

:

Because your roadmap that you just talked

about is super big.

580

:

Because having a package that's designed

for users, but also for researchers is

581

:

quite, that's really a lot of work.

582

:

So I'm hoping you're not allowed doing

that.

583

:

No, we're currently a team of about a

dozen people.

584

:

No, yeah, that makes sense.

585

:

It's an interdisciplinary team.

586

:

So like a few people with a hardcore like

software engineering background, like some

587

:

people with a machine learning background,

and some people from the cognitive

588

:

sciences and also a handful of physicists.

589

:

Because in fact, these amortized Bayesian

inference methods are particularly

590

:

interesting for physicists.

591

:

Example for astrophysicists who have these

gravitational wave inference problems

592

:

where they have massive data sets.

593

:

And running MCMC on those would be quite

cumbersome.

594

:

So if you have this huge in -stream data

and you don't have this underlying

595

:

likelihood density, but just some

simulation program that might generate

596

:

sensible, like gravitational waves, then

amortized Bayesian inference really shines

597

:

there.

598

:

Okay.

599

:

So that's exactly the case you were

talking about where the model doesn't

600

:

change, but you have a lot of different

datasets.

601

:

Yeah, exactly.

602

:

Because I mean, what you're trying to run

inference on is your physical model.

603

:

And that doesn't change.

604

:

I mean, it does.

605

:

And then again, physicists have a very

good understanding and very good models of

606

:

the world around them.

607

:

And that's made one of the largest

differences.

608

:

people from the cognitive sciences, where,

you know, the, the models of the human

609

:

brain, for instance, are just, it's such a

tough thing to model and there's so much

610

:

not there and so much uncertainty in the

model building process.

611

:

Yeah, for sure.

612

:

Okay, yeah, I think I'm starting to

understand the idea.

613

:

And yeah, so actually, episode 101 was

exactly about that.

614

:

Black holes, collisions, gravitational

waves.

615

:

And I was talking with LIGO researchers,

Christopher Perry and John Vich.

616

:

And we talked exactly about that, their

problem with big data sets.

617

:

They are mainly using sequential Monte

Carlo, but I'm guessing they would also be

618

:

interested in a Monte...

619

:

amortized Bayesian inference.

620

:

So yeah, Christopher and John, if you're

listening, if you're future reach out to

621

:

Marvin and use Baseflow.

622

:

And listeners, this episode will be in the

show notes also if you want to give it a

623

:

listen.

624

:

That's a really fun one also learning a

lot of stuff, but the crazy universe we

625

:

live in.

626

:

Actually, a weird question I have is why

627

:

easy to call it amortized Bayesian

inference.

628

:

The reason is that we have this two -stage

process where we would first pay upfront

629

:

with this long neural network training

phase.

630

:

But then once we're done with this, this

cost of the upfront training phase

631

:

amortizes over all the posterior samples

that we can draw within a few

632

:

milliseconds.

633

:

That makes sense.

634

:

That makes sense.

635

:

And so I think something you're also

working on is something that's called deep

636

:

fusion.

637

:

And you do that in particular for

multimodal simulation -based inference.

638

:

How is that related to amortized patient

inference, if at all?

639

:

And what is it about?

640

:

I'm gonna answer these two questions in

reverse order.

641

:

So first about the relation between

simulation -based inference and amortized

642

:

Bayesian inference.

643

:

So to give you a bit of history there,

simulation -based inference essentially

644

:

Bayesian inference based on simulations

where we don't assume that we have access

645

:

to a likelihood density, but instead we

just assume that we can sample from the

646

:

likelihood.

647

:

Essentially simulate from the model.

648

:

In fact, the likelihood is still.

649

:

present, but it's only implicitly defined

and we don't have access to the density.

650

:

That's why likelihood -free inference

doesn't really hit what's happening here.

651

:

But instead, like in the recent years,

people have started adopting the term

652

:

simulation -based inference because we do

Bayesian inference based on simulations

653

:

instead of likelihood densities.

654

:

So methods that have been used...

655

:

for quite a long time now in the

simulation -based inference research area.

656

:

For example, rejection ABC, so approximate

Bayesian computation, or then ABC SMC, so

657

:

combining ABC with sequential Monte Carlo.

658

:

Essentially, the next iteration there was

throwing neural network at simulation

659

:

-based inference.

660

:

That's exactly this neural posterior

estimation that I talked about earlier.

661

:

And now what researchers noticed is, hey,

when we train a neural network for

662

:

simulation -based inference, instead of

running rejection, approximate base

663

:

computation, then we get amortization for

free as a site product.

664

:

It's just a by -product of using a neural

network for simulation -based inference.

665

:

And so in the last maybe four to five

years,

666

:

People have mainly focused on this

algorithm that's called neuro posterior

667

:

estimation for simulation based inference.

668

:

And so all developments that happened

there and all the research that happened

669

:

there, almost all the research, sorry,

focused on cases where we don't have any

670

:

likelihood density.

671

:

So we're purely in the simulation based

case.

672

:

Now with our view of things, when we come

from a Bayesian inference, like likelihood

673

:

based setting,

674

:

can say, hey, amortization is not just a

random coincidental byproduct, but it's a

675

:

feature and we should focus on this

feature.

676

:

And so now what we're currently doing is

moving this idea of amortized Bayesian

677

:

inference with neural networks back into a

likelihood -based setting.

678

:

So we've started using likelihood

information again.

679

:

For example, using likelihood densities if

they're available or learning information

680

:

about the likelihood.

681

:

So like a surrogate model on the fly, and

then again, using this information for

682

:

better posterior inference.

683

:

So we're essentially bridging simulation

-based inference and likelihood -based

684

:

Bayesian inference again with this goal, a

larger goal of amortization if we can do

685

:

it.

686

:

And so this work on deep fusion.

687

:

essentially addresses one huge shortcoming

of neural networks when we want to use

688

:

them for amortized Bayesian inference.

689

:

And that is in situation where we have

multiple different sources of data.

690

:

So for example,

691

:

Imagine you're a cognitive scientist and

you run an experiment with subjects and

692

:

for each test subject, you give them a

decision -making task.

693

:

But at the same time, while your subjects

solve the decision -making task, you wire

694

:

them up with an EEG to measure the brain

activity.

695

:

So for each subject across maybe 100

trials, what you now have is both an EEG

696

:

and the data from the decision -making

task.

697

:

Now, if you want to analyze this with PyMC

or Stan, what you would just do is say,

698

:

hey, well, we have two data -generating

processes that are governed by a set of

699

:

shared parameters.

700

:

So the first part of the likelihood would

just be this we -know process for the

701

:

decision -making task where you just model

the reaction time.

702

:

fairly standard procedure there in the

cognitive science.

703

:

And then for the second part, we have a

second part of the likelihood that we

704

:

evaluate that somehow handles these EEG

measurements.

705

:

For example, a spatial temporal process or

just like some summary statistics that are

706

:

being computed there.

707

:

However, you would usually compute your

EEG.

708

:

Then you add both to the log PDF of the

likelihood, and then you can call it a

709

:

day.

710

:

You cannot do that in neural networks

because you have no straightforward

711

:

sensible way to combine these reaction

times from the decision -making task and

712

:

the EEG data.

713

:

Because you cannot just take them and slap

them together.

714

:

They are not compatible with each other

because these information data sources are

715

:

heterogeneous.

716

:

So you somehow need a way to fuse these

sources of information.

717

:

so that you can then feed them into the

neural network.

718

:

That's essentially what we're studying in

this paper, where you could just get very

719

:

creative and have different schemes to

fuse the data.

720

:

So you could use these attention schemes

that are very hip in large language models

721

:

right now with transformers essentially,

and have these different data sources

722

:

attend or listen essentially to each

other.

723

:

With cross attention, you could just let

the EEG data inform

724

:

your decision -making data or just have

the decision -making data inform the EEG

725

:

data.

726

:

So you can get very creative there.

727

:

You could also just learn some

representation of both individually, then

728

:

concatenate them and feed them to the

neural network.

729

:

Or you could do very creative and weird

mixes of all those approaches.

730

:

And in this paper, we essentially have a

systematic investigation of these

731

:

different options.

732

:

And we find that the most straightforward

option works the best.

733

:

overall, and that's just learning fixed

size embeddings of your data sources

734

:

individually, and then just concatenating

them.

735

:

It turns out then we can use information

from both sources in an efficient way,

736

:

even though we're doing inference with

neural networks.

737

:

And maybe what's interesting for

practitioners is that we can compensate

738

:

for missing data in individual sources.

739

:

And the paper we essentially, we induced

missing data by just taking these EEG data

740

:

and decision -making data and just

randomly dropping some of them.

741

:

And the neural networks have learned, like

when we do this fusion process, the neural

742

:

networks learn to compensate for partial

missingness in both sources.

743

:

So if you just remove some of the decision

-making data, the neural network learn to

744

:

use the EEG data to inform your posterior.

745

:

Even though the data and one of the

sources are missing, the inference is

746

:

pretty robust then.

747

:

And again, all this happens without model

refits.

748

:

So you would just account for that during

training.

749

:

Of course you have to do this like random

dropping of data during a training phase

750

:

as well.

751

:

And then you can also get it during the

inference phase.

752

:

yeah, that sounds, yeah, that's really

cool.

753

:

Maybe that's a bit of a, like a small

piece of this paper in our larger roadmap.

754

:

This is essentially taking this amortized

vision inference.

755

:

up to the level of trustworthiness and

robustness and all these gold standards

756

:

that we currently have for likelihood

-based inference in PMC or Stan.

757

:

Yeah.

758

:

Yeah.

759

:

And there's still a lot of work to do

because of course, like there's no free

760

:

lunch.

761

:

and, and of course there are many problems

with trustworthiness.

762

:

And that's also one of the reasons why I'm

here with Aki right now.

763

:

cause Aki is so great at Bayesian workflow

and trustworthiness, good diagnostics.

764

:

That's all, you know, all the things that

we currently still need for trustworthy,

765

:

amortized Bayesian inference.

766

:

Yeah.

767

:

So maybe you want to.

768

:

talk a bit more about that and what you're

doing on that.

769

:

That sounds like something very

interesting.

770

:

So one huge advantage of an amortized

Bayesian sampler is that evaluations and

771

:

diagnostics are extremely cheap.

772

:

So for example, there's this gold standard

method that's called simulation based

773

:

calibration, where you would sample from

your model and then like a sample from

774

:

your prior predictive space and then refit

your model and look at your coverage, for

775

:

instance.

776

:

In general, look at the calibration of

your model on this potentially very large

777

:

prior predictive space.

778

:

So you naturally need many model refits,

but your model is fixed.

779

:

So if you do it with MCMC, it's a gold

standard evaluation technique, but it's

780

:

very expensive to run, especially if your

model is complex.

781

:

Now, if you have an amortized estimator,

simulation -based calibration on thousands

782

:

of datasets takes a few seconds.

783

:

So essentially, and that's my goal for

this research visit with Aki here in

784

:

Finland, is trying to figure out what are

some diagnostics that are gold standard,

785

:

but potentially very expensive, up to a

point where it's infeasible to run on a

786

:

larger scale with MCMC.

787

:

But we can easily do it with an amontized

estimator.

788

:

With the goal of figuring out, like, can

we trust this estimator?

789

:

Yes or no?

790

:

It's like, as you might know from neural

networks, we just have no idea what's

791

:

happening inside their neural network.

792

:

And so we currently don't have these

strong diagnostics that we have for MCMC.

793

:

Like for example, our head.

794

:

There's no comparable thing for neural

network.

795

:

So one of my goals here is to come up with

more good diagnostics that are either

796

:

possible with MCMC, but very expensive so

we don't run them, but they would be very

797

:

cheap with an amortized estimator.

798

:

Or the second thing just specific to an

amortized estimator, just like our head is

799

:

specific to MCMC.

800

:

Okay.

801

:

Yeah, I see.

802

:

Yeah, that makes tons of sense.

803

:

well.

804

:

And actually, so I would have more

technical questions on these, but I see

805

:

the time running out.

806

:

I think something I'm mainly curious about

is the challenges, the biggest challenges

807

:

you face when applying amortized spatial

inference and diffusion techniques in your

808

:

projects, but also like in the projects

you see.

809

:

I think that's going to also give a sense

to listeners of when and where to use

810

:

these kinds of methods.

811

:

That's a great question.

812

:

And I'm more than happy to talk about all

these challenges that we have because

813

:

there's so much room for improvement

because like these Amortized methods, they

814

:

have so much potential, but we still have

a long way to go until they are as usable

815

:

and as straightforward to use as current

MCMC samplers.

816

:

And in general, one challenge for

practitioners,

817

:

is that we have most of the problems and

hardships that we have in PyMC or Stan.

818

:

And that is that researchers have to think

about their model in a probabilistic way,

819

:

in a mechanistic way.

820

:

So instead of just saying, hey, I click on

t -test or linear regression in some

821

:

graphical user interface, they actually

have to come up with a data generating

822

:

process.

823

:

and have to specify their model.

824

:

And this whole topic of model

specification is just the same in

825

:

amortized workflow because some way we

need to specify the Bayesian model.

826

:

And now on top of all this, we have a huge

additional layer of complexity and this is

827

:

defining the neural networks.

828

:

And amortized Bayesian inference, nowadays

we have two neural networks.

829

:

The first one is a so -called summary

network.

830

:

which essentially learns a latent

embedding of the data set.

831

:

Essentially those are like optimal learned

summary statistics and optimal doesn't

832

:

mean that they have to be optimal to

reconstruct the data, but instead optimal

833

:

means they're optimal to inform the

posterior.

834

:

for example, in a very, very simple toy

model, if you have just like a Gaussian

835

:

model and you just want to perform

inference on the mean.

836

:

then a sufficient summary statistic for

posterior inference on the mean would be

837

:

the mean.

838

:

Because that's all you need to reconstruct

the mean.

839

:

It sounds very tautological, but yeah.

840

:

Then again, the mean is obviously not

enough to reconstruct the data because all

841

:

the variance information is missing.

842

:

What the summary network learns is

something like the mean.

843

:

So summary statistics that are optimal for

posterior inference.

844

:

And then the second network is the actual

generative neural network.

845

:

So like a normalizing flow, score -based

diffusion model, consistency model, flow

846

:

matching, whatever condition generative

model you want.

847

:

And this will handle the sampling from the

posterior.

848

:

And these two networks are learned end to

end.

849

:

So you would learn your summary statistic,

output it, feed it into the posterior

850

:

network, the generative model, and then

have one.

851

:

evaluation of the loss function, optimize

both end to end.

852

:

And so we have two neural networks, long

story short, which is substantially harder

853

:

than just hitting like sample on a PMC or

Stan program.

854

:

And that's an additional hardship for

practitioners.

855

:

Now in Baseflow, what we do is we provide

sensible default values for the generative

856

:

neural networks, which work in maybe like

80 or 90 % of the cases.

857

:

It's just sufficient to have, for example,

like a NeuroSpline flow, like some sort of

858

:

normalizing flow with, I don't know, like,

859

:

six layers and a certain number of units,

some regularization for robustness and,

860

:

you know, cosine decay of the learning

rates, and all these machine learning

861

:

parts, we try to take them away from the

user if they don't want to mess with it.

862

:

But still, if things don't work, they

would need to somehow diagnose the

863

:

problems and then, you know, play with the

number of layers and this neural network

864

:

architecture.

865

:

And then for the summary network, the

summary network essentially needs to be

866

:

informed by the data.

867

:

So if you have time series, you would

868

:

look at something like an LSTM.

869

:

So these like long short time memory time

series neural networks.

870

:

Or you would have like recurrent neural

network or nowadays a time series

871

:

transformer.

872

:

They're also called temporal fusion

transforms.

873

:

If you have IID data, you would have

something like a deep set or a set

874

:

transformer, which respect this

exchangeable structure of the data.

875

:

So again, we can give all the

recommendations and sensible default

876

:

values like

877

:

If you have a time series, try a time

series transformer.

878

:

Then again, if things don't work out,

users need to play around with these

879

:

settings.

880

:

So that's definitely one hardship of

armatized Bayesian inference in general.

881

:

And for the second part of your question,

hardships of this deep fusion.

882

:

It's essentially if you have more and more

information sources, then things can get

883

:

very complicated.

884

:

Example, just a few days ago, we discussed

about a

885

:

case where someone has 60 different

sources of information and they're all

886

:

streams of time series.

887

:

Now we could say, hey, just slap 60

summary networks on this problem, like one

888

:

summary network for each domain.

889

:

That's going to be very complex and very

hard to train, especially if we don't

890

:

bring that many data sets to the table for

the neural network training.

891

:

And so there we somehow need to find a

compromise.

892

:

Okay, what information can we condense and

group together?

893

:

So maybe some of the time series sources

are somewhat similar and actually

894

:

compatible with each other.

895

:

So we could, for example, come up with six

groups of 10 time series each.

896

:

Then we would only need six neural

networks for the summary embeddings and

897

:

all these practical considerations.

898

:

That makes things just like as hard as in

likelihood based MCMC based inference, but

899

:

just a bit harder because of all the

neural network stuff that's happening.

900

:

Did this address your question?

901

:

Yeah.

902

:

Yeah.

903

:

It gives me more questions, but yeah, for

sure.

904

:

That does answer the question.

905

:

When you're talking about transformer for

time series, are you talking about the

906

:

transformers, the neural network that's

used in large language models or is it

907

:

something else?

908

:

It's essentially the same, but slightly

adjusted for time series so that the...

909

:

statistics or these latent embeddings that

you output still respect the time series

910

:

structure where typically you would have

this autoregressive structure.

911

:

So it's not exactly the same like standard

transformer, but you would just enrich it

912

:

to respect the probabilistic structure in

your data.

913

:

But at the core, it's just the same.

914

:

So at the core, it's an attention

mechanism, like multi -head attention

915

:

where

916

:

Like the different parts of your dataset

could essentially talk or listen to each

917

:

other.

918

:

So it's just the same.

919

:

Okay.

920

:

Yeah, that's interesting.

921

:

I didn't know that existed for time

series.

922

:

That's interesting.

923

:

That means, so because the transformer

takes like one of the main thing is you

924

:

have to tokenize the inputs.

925

:

Right?

926

:

So here you would tokenize like that there

is a tokenization happening of the time

927

:

series data.

928

:

You don't have to tokenize here because

the reason why you have to tokenize.

929

:

in large language models or natural

language processing in general is that you

930

:

want to somehow encode your characters or

your words?

931

:

into like a into numbers essentially and

we don't need that in Bayesian inference

932

:

in general because we already have numbers

Yeah So our data already comes in numbers,

933

:

so we don't need tokenization here.

934

:

Of course if we had text data

935

:

Then we would need tokenization.

936

:

Yeah.

937

:

Yeah.

938

:

Yeah.

939

:

OK.

940

:

OK.

941

:

Yeah, it makes more sense to me.

942

:

All right, that's fun.

943

:

I didn't know that existed.

944

:

Do you have any resources about

transformer for time series that we could

945

:

put in the show notes?

946

:

Absolutely.

947

:

There is a paper that's called Temporal

Fusion Transformers, I think.

948

:

I will send you the link.

949

:

yeah.

950

:

Awesome.

951

:

Yeah, thanks.

952

:

Definitely.

953

:

We have this time series transformer,

temporary fusion transformer implemented

954

:

in base flow.

955

:

So now it's just like a very usable

interface where you would just input your

956

:

data and then you get your latent

embeddings.

957

:

You can say like, I want to input my data

and I want as an output 20 learned summary

958

:

statistics.

959

:

So that's all you need to do there.

960

:

Okay.

961

:

And you can go crazy.

962

:

So what would you do with it?

963

:

Good.

964

:

Yeah, what would you do with these

results?

965

:

Basically the outputs of the transformer,

what would you use that for?

966

:

Those are the learned summary statistics.

967

:

That you would then treat as a compressed

fixed length version of your data for the

968

:

posterior network for this generative

model.

969

:

So then you use that afterwards in the

model?

970

:

Exactly.

971

:

Yeah.

972

:

So the transformer is just used to learn

summary statistics of the data sets that

973

:

we input.

974

:

For instance, if you have time series,

like we did this for COVID time series.

975

:

If you have a COVID time series,

976

:

worth like for a three year period would

be and daily reporting, you would have a

977

:

time series with about a thousand time

steps.

978

:

That's quite long as a condition into a

neural network to pass in there.

979

:

And also like if now you don't have a

thousand days, but a thousand and one

980

:

days, then the length of your input to the

neural network would change and your

981

:

neural network wouldn't do that.

982

:

So what you do with a time series

transformer is compress this time series

983

:

of maybe 1 ,000 or maybe 1 ,050 time steps

into a fixed length vector of summary

984

:

statistics.

985

:

Maybe you extract 200 summary statistics

from that.

986

:

Hey, okay, I see.

987

:

And then you can use that in your neural

network, in the model that's going to be

988

:

sampling your model.

989

:

In the neural network that's going to be

sampling your model.

990

:

We already see that we're heavily

overloading terminology here.

991

:

So what's a model actually?

992

:

So then we have to differentiate between

the actual Bayesian model that we're

993

:

trying to fit.

994

:

And then the neural network, the

generative model or generative neural

995

:

network that we're using as a replacement

for MCMC.

996

:

So it's, it's a lot of this taxonomy

that's, that's odd when you're at the

997

:

interface of deep learning and statistics.

998

:

Another one of those hiccups are

parameters.

999

:

Like invasion inference parameters are

your inference targets.

::

So you want posterior distributions on a

handful of model parameters.

::

When you talk to people from deep learning

about parameters,

::

they understand the neural network

weights.

::

So sometimes you have to be careful with

the, I have to be careful with the

::

terminology and words used to describe

things because we have different types of

::

people going on different levels of

abstraction here in different functions.

::

Yeah.

::

Yeah, exactly.

::

So that means in this case, it's the

transformer takes in time values, it

::

summarizes them.

::

And it passed that on to the neural

network that's going to be used to sample

::

the patient model.

::

Exactly.

::

And they are passed in as the conditions,

like conditional probability, which

::

totally makes sense because like this

generative neural network, it learns the

::

distribution of parameters conditional on

the data or summary statistics of the

::

data.

::

So that's the exact definition of the

Bayesian posterior distribution.

::

Like a distribution of the Bayesian model

parameters conditional on the data.

::

It's the exact definition of the

posterior.

::

Yeah, I see.

::

And that means...

::

So in this case, yeah, no, I think my

question was going to be, so why would you

::

use these kind of additional layer on the

time series data?

::

But you have to answer that.

::

Is that, well, what if your time series

data is too big or something like that?

::

Exactly.

::

It's not just being too big, but also just

a variable length.

::

Because the neural network, like the

generative neural network, it always wants

::

fixed length inputs.

::

Like it can only handle, in this case of

the COVID model, it could only handle

::

input conditions with length 200.

::

And now the time series transformer takes

part, so the time series transformer

::

handles the part that our actual raw data

have variable length.

::

And time series transformers can handle

data of variable length.

::

So they would, you know, just take a time

series of length.

::time steps to:

and then always compress it to 200 summary

::

statistics.

::

So this generative neural network, which

is much more strict about the shapes and

::

form of the input data, will always see

the same length inputs.

::

Yeah.

::

Okay.

::

Yeah, I see.

::

That makes sense.

::

Awesome.

::

Yeah, super cool.

::

And so as you were saying, this is already

available in base flow, people can use

::

this kind of transformer for time series.

::

Yeah, absolutely.

::

For time series and also for sets.

::

So for IID data.

::

Yeah.

::

Because if you just fed, if you just take

an IID data set and input into a neural

::

network, the neural network doesn't know

that your observations are exchangeable.

::

So it will assume much more structure than

there actually is in your data.

::

So again, it has a double function, like a

dual function of like compressing data,

::

encoding the probabilistic structure of

the data, and also outputting a fixed

::

representation.

::

So this would be a set transformer or deep

set is another option.

::

It's also implemented in Baseflow.

::

Super cool.

::

Yeah.

::

And so let's start winding down here

because I've already taken a lot of your

::

time.

::

Maybe a last few questions would be what

are some emerging topics that you see

::

within deep learning and probabilistic

machine learning that you find

::

particularly intriguing?

::

Because I've been to talk here a lot about

really the nitty -gritty, the statistical

::

detail.

::

And so on, but now if we do zoom a bit and

we start thinking about more long -term.

::

Yeah.

::

I'm very excited about two large topics.

::

The first one are generative models that

are very expressive.

::

So unconstrained neural network

architectures, but at the same time have a

::

one -step inference.

::

So for example, people have been using

score -based diffusion models a lot for

::

flow matching.

::

for image generation, like for example,

stable diffusion.

::

You might be familiar with this tool to

generate like, you know, input a text

::

prompt and then you get fantastic images.

::

Now this takes quite some time.

::

So like a few seconds for each image, but

only because it runs on a fancy cluster.

::

If you run it locally on a computer, it

takes much longer.

::

And that's because the Scorby's diffusion

model needs many discretization steps in

::

denoising, in this denoising process

during inference time.

::

And now there's, like, throughout the last

year, there have been a few attempts on

::

having these very expressive and super

powerful neural networks.

::

But they are much, much faster because

they don't have these many denoising

::

steps.

::

Instead, they directly learn a one -step

inference.

::

So they could generate an image not like a

thousand steps, but only in one step.

::

And that's very cutting edge or bleeding

edge, if you will, because they don't work

::

that great yet.

::

But I think there's much potential in

there.

::

it's both expressive and fast.

::

And then again, we've used some of those

for amortized Bayesian inference.

::

So we use consistency models and they have

super high potential in my opinion.

::

So, you know, with these advances in deep

learning, we can always, oftentimes we can

::

use them for amortized Bayesian inference.

::

We just like reformulate these generative

models and slightly tune them to our

::

tasks.

::

So I'm very excited about this.

::

And the second area I'm very excited about

our foundation models.

::

I guess most people are in AI these days.

::

So foundation models essentially means

neural networks are very good at in

::

-distribution tasks.

::

So whatever is in the training data set,

neural networks are typically very good at

::

finding patterns that are similar to the

training set, what they saw in the

::

training set.

::

Now in the open world, so if we are out of

distribution, we have a domain shift,

::

distribution shift, model mis

-specification, however you want to call

::

it, neural networks typically aren't that

good.

::

So what we could do is either make them

slightly better at out of distribution, or

::

we just extend the in -distribution to a

huge space.

::

And that's what foundation models do.

::

For example, GPD4 would be a foundation

model.

::

because it's just trained on so much data.

::

I don't know how many, it's not terabyte

anymore.

::

It's like, like essentially the entire

internet.

::

So it's just a huge training set.

::

And so the world and the training set that

this neural network has been trained on is

::

just huge.

::

And so essentially we don't really have

out of distribution cases anymore, just

::

because our training set is so huge.

::

And that's also one area that could be

very useful for

::

amortized Bayesian inference and to

overcome the very initial shortcoming that

::

you talked about, where we would also like

to amortize over different Asian models.

::

Hmm.

::

I see.

::

Yeah, yeah, yeah.

::

Yeah, that would definitely be super fun.

::

Yeah, I'm really impressed and interested

to see these interaction of like deep

::

learning, artificial intelligence, and

then the Bayesian.

::

framework coming on top of that.

::

That is really super cool.

::

I love that.

::

Yeah.

::

Yeah, it makes me super curious to try

that stuff out.

::

So to play us out, Marvin, actually, this

is a very active area of research.

::

So what advice would you give to beginners

interested in diving into this

::

intersection of deep learning and

probabilistic machine learning?

::

That's a great question.

::

Essentially, I would have two

recommendations.

::

The first one is to really try to simulate

stuff.

::

Whatever it is that you are curious about,

just try to write a simulation program and

::

try to simulate some of the data that you

might be interested in.

::

So for example, if you're really

interested in soccer, then code up a

::

simulation program.

::

that just simulate soccer matches and the

outcomes of soccer matches.

::

So you can really get a feeling of the

data generating processes that are

::

happening because probabilistic machine

learning at its very core is all about

::

data generating processes and reasoning

about these processes.

::

And I think it was Richard Feynman who

said, what I cannot create, I do not

::

understand.

::

That's essentially at the heart of

simulation based inference in a more

::

narrow setting.

::

probabilistic machinery and machine

learning more broadly or science more

::

broadly even So yeah, definitely like

Simulating and running simulation studies

::

can be super helpful both to understand

what's happening in the background also to

::

get a feeling for Programming and to get

better at programming as well Then the

::

second advice would be to essentially find

a balance between these hands -on getting

::

your hands dirty type of things like

implement a model and

::

I torch or Keras or solve some Kaggle

tasks, just some machine learning tasks.

::

But then at the same time, also finding

this balance to reading books and finding

::

new information to make sure that you

actually know what you're doing and also

::

know what you don't know and what the next

steps are to get better from the

::

theoretical part.

::

And there are two books that I can really

recommend.

::

The first one is Deep Learning by Ian

Goodfellow.

::

It's also available.

::

for free online.

::

You can also link to this in the show

notes.

::

It's a great book and it covers so much.

::

And then if you come from this Bayesian or

statistics background, you see a lot of

::

conditional probabilities in there because

a lot of deep learning is just conditional

::

generative modeling.

::

And then the second book would in fact be

Statistical Rethinking by Richard

::

McAlrath.

::

It's a great book and it's not only

limited to Bayesian inference, but more.

::

Also a lot of causal inference, of course.

::

Also just thinking about probability and

the philosophy behind this whole

::

probabilistic modeling topic more broadly.

::

So earlier today, I had a chat with one of

the student assistants that I'm

::

supervising and he said, Hey Marvin, like

I read statistic rethinking a few weeks

::

ago.

::

And today I read something about score

-based diffusion models.

::

So these like state of the art deep

learning models that are used to generate

::

images.

::

He said like, because I read statistical

rethinking, it all made sense.

::

There's so much probability going on in

these score -based diffusion models.

::

And statistical rethinking really helped

me understand that.

::

And at first I didn't really, I couldn't

believe it, but it totally makes sense.

::

Cause like statistical rethinking is not

just a book about Bayesian workflow and

::

Bayesian modeling, but more about, you

know, reasoning about probabilities and

::

uncertainty, in a more general way.

::

And it's a beautiful book.

::

So I'd recommend those.

::

Nice.

::

Yeah.

::

So definitely let's put those two in the

show notes.

::

Marvin, I will.

::

So of course I've read statistical

rethinking several times, so I definitely

::

agree.

::

The first one about deep learning, I

haven't yet, but I will definitely read it

::

because that sounds really fascinating.

::

So really want to get that book.

::

Fantastic.

::

Well, thanks a lot, Marvin.

::

That was really awesome.

::

I really learned a lot.

::

I'm pretty sure listeners did too, so

that's super fun.

::

You definitely need to come back to do a

modeling webinar with us and show us in

::

action what we talked about today with the

Base Vlog Package.

::

It's also, I guess, going to inspire

people to use it and maybe contribute to

::

it.

::

But before that, of course, I'm going to

ask you the last two questions I ask every

::

guest at the end of the show.

::

First one, if you had unlimited time and

resources, which problem would you try to

::

solve?

::

That's a very loaded question because

there's so many very, very important

::

problems to solve.

::

Like big picture problems, like peace,

world hunger, global warming, all those.

::

I'm afraid I couldn't, like with my

background, I don't really know how to

::

contribute significantly with a huge

impact to those problems.

::

So my consideration is essentially a trade

-off between like...

::

how important is the problem and what

impact does solving the problem or

::

addressing the problem have and what

impact could I have on solving the

::

problem?

::

And so I think what would be very nice is

to make probabilistic inference or

::

Bayesian inference more particular, like

accessible, usable, easy and fast for

::

everyone.

::

And that doesn't just mean, you know,

methods, machine learning researchers.

::

But essentially means anyone who works

with data in any way.

::

And there's so much to do, like the actual

Bayesian model in the background, it could

::

be huge, be like a base GPT, like chat

GPT, but just for base.

::

Just with the sheer scope of amortization,

different models, different settings and

::

so on.

::

So that's a huge, huge challenge.

::

Like on the backend side, but then on the

front end and API side, I think it also

::

has...

::

many different sub problems there.

::

cause it would mean like people could

just, you know, write down a description

::

of their model in plain text language,

like a large language model.

::

And, you know, don't actually specify

everything by a programming.

::

Maybe also just sketch out some data like

expert elicitation and all those different

::

topics.

::

I think there's like this bigger picture,

that, you know, so like.

::

thousands of researchers worldwide are

working on so many niche topics there.

::

But having this overarching base GPT kind

of thing would be really cool.

::

So I probably choose that to work on.

::

It's a very risky thing, so that's why I'm

not currently working on it.

::

Yeah, I love that.

::

Yeah, that sounds awesome.

::

Feel free to corporate.

::

and collaborate with me on that.

::

I would definitely be down.

::

That sounds absolutely amazing.

::

Yeah.

::

So send me an email when you start working

that place.

::

I'll be happy to join the team.

::

And second question, if you could have

dinner with any great scientific mind,

::

dead, alive or fictional, who would it be?

::

Again, very loaded question.

::

Super interesting question.

::

I mean, there are two huge choices.

::

I could either go with someone who's

currently alive and

::

I feel like I want their take on the

current state of the art and future

::

directions and so on.

::

And the second huge option, what I guess

many people would go with is someone who's

::

been dead for two to three centuries.

::

And I think I'd go with the second choice.

::

So really take someone from way from the

past.

::

And that's because of two reasons.

::

I think like, of course, speaking to

today's scientists is super interesting

::

and I would love to do that.

::

But I mean, they have access to all the

state of the art technology and they know

::

about all the latest advancements.

::

And so if they have some groundbreaking

creative ideas to share that they come up

::

with, they could just implement it and

make them actionable.

::

And the second reason is that today

scientists have a huge platform because

::

they're on the internet.

::

So if they really want to express an idea,

they could just do it on

::

Twitter or wherever So there's like other

ways to engage with them apart from you

::

know, having a magical dinner Right.

::

so I would choose someone from the past

and in particular.

::

I think at a lovelace would be super

interesting for me to talk to Essentially

::

because she's widely considered the first

programmer the craziest thing about is

::

that is She's never had access to like a

modern computer

::

So she wrote the first program, but the

machine wasn't there yet.

::

So that's such a huge leap of creativity

and genius.

::

And so I'd really be interested in like if

Adelavelis saw what's happening today,

::

like all the technology that we have with

generative AI, GPU clusters and all these

::

possibilities, like what's the next leap

forward?

::

Like what's today's equivalent of writing

::

the first program without having the

computer.

::

Yeah, I really love to know this answer

and there's currently no other way except

::

for your magical dinner invitation to get

this answer.

::

So that's why I go with this option.

::

Yeah.

::

Yeah.

::

No, awesome.

::

Awesome.

::

I love it.

::

That definitely sounds like a, like a

marvelous dinner.

::

So yeah.

::

Awesome.

::

Thanks a lot, Marvin.

::

That was, that was really a blast.

::

I'm going to let you go now because you've

been talking for a long time, guessing you

::

need a break.

::

But that was really amazing.

::

So yeah, thanks a lot for taking the time.

::

Thanks again to Matt Rosinski for this

awesome recommendation.

::

I hope you loved it, Marvin.

::

And also Matt, me, I did.

::

So that was really awesome.

::

As usual, I'll put resources and a link to

your website.

::

And also, Marvin is going to add stuff to

the show notes for those who want to dig

::

deeper.

::

Thank you again, Marvin, for taking the

time and being on this show.

::

Thank you very much for having me, Alex.

::

I appreciate it.

::

This has been another episode of Learning

Bayesian Statistics.

::

Be sure to rate, review and follow the

show on your favorite podcatcher and visit

::

learnbaystats .com for more resources

about today's topics as well as access to

::

more episodes to help you reach true

Bayesian state of mind.

::

That's learnbaystats .com.

::

Our theme music is Good Bayesian by Baba

Brinkman, fit MC Lars and Meghiraam.

::

Check out his awesome work at bababrinkman

.com.

::

I'm your host.

::

Alex Andorra.

::

You can follow me on Twitter at Alex

underscore Andorra, like the country.

::

You can support the show and unlock

exclusive benefits by visiting Patreon

::

.com slash LearnBasedDance.

::

Thank you so much for listening and for

your support.

::

You're truly a good Bayesian change your

predictions after taking information.

::

And if you're thinking I'll be less than

amazing, let's adjust those expectations.

::

Let me show you how to be a good Bayesian

Change calculations after taking fresh

::

data in Those predictions that your brain

is making Let's get them on a solid

::

foundation

Previous post
Next post