*Proudly sponsored by **PyMC Labs**, the Bayesian Consultancy. **Book a call**, or **get in touch**!*

In this episode, Marvin Schmitt introduces the concept of amortized Bayesian inference, where the upfront training phase of a neural network is followed by fast posterior inference.

Marvin will guide us through this new concept, discussing his work in probabilistic machine learning and uncertainty quantification, using Bayesian inference with deep neural networks.

He also introduces BayesFlow, a Python library for amortized Bayesian workflows, and discusses its use cases in various fields, while also touching on the concept of deep fusion and its relation to multimodal simulation-based inference.

A PhD student in computer science at the University of Stuttgart, Marvin is supervised by two LBS guests you surely know — Paul Bürkner and Aki Vehtari. Marvin’s research combines deep learning and statistics, to make Bayesian inference fast and trustworthy.

In his free time, Marvin enjoys board games and is a passionate guitar player.

*Our theme music is « Good Bayesian », by Baba Brinkman (feat MC Lars and Mega Ran). Check out his awesome work at **https://bababrinkman.com/** !*

**Thank you to my Patrons for making this episode possible!**

*Yusuke Saito, Avi Bryant, Ero Carrera, Giuliano Cruz, Tim Gasser, James Wade, Tradd Salvo, William Benton, James Ahloy, Robin Taylor,, Chad Scherrer, Zwelithini Tunyiswa, Bertrand Wilden, James Thompson, Stephen Oates, Gian Luca Di Tanna, Jack Wells, Matthew Maldonado, Ian Costley, Ally Salim, Larry Gill, Ian Moran, Paul Oreto, Colin Caprani, Colin Carroll, Nathaniel Burbank, Michael Osthege, Rémi Louf, Clive Edelsten, Henri Wallen, Hugo Botha, Vinh Nguyen, Marcin Elantkowski, Adam C. Smith, Will Kurt, Andrew Moskowitz, Hector Munoz, Marco Gorelli, Simon Kessell, Bradley Rode, Patrick Kelley, Rick Anderson, Casper de Bruin, Philippe Labonde, Michael Hankin, Cameron Smith, Tomáš Frýda, Ryan Wesslen, Andreas Netti, Riley King, Yoshiyuki Hamajima, Sven De Maeyer, Michael DeCrescenzo, Fergal M, Mason Yahr, Naoya Kanai, Steven Rowland, Aubrey Clayton, Jeannine Sue, Omri Har Shemesh, Scott Anthony Robson, Robert Yolken, Or Duek, Pavel Dusek, Paul Cox, Andreas Kröpelin, Raphaël R, Nicolas Rode, Gabriel Stechschulte, Arkady, Kurt TeKolste, Gergely Juhasz, Marcus Nölke, Maggi Mackintosh, Grant Pezzolesi, Avram Aelony, Joshua Meehl, Javier Sabio, Kristian Higgins, Alex Jones, Gregorio Aguilar, Matt Rosinski, Bart Trudeau, Luis Fonseca, Dante Gates, Matt Niccolls, Maksim Kuznecov, Michael Thomas, Luke Gorrie, Cory Kiser, Julio, Edvin Saveljev,* *Frederick Ayala, Jeffrey Powell, Gal Kampel, Adan Romero, Will Geary and Blake Walters*.

Visit https://www.patreon.com/learnbayesstats to unlock exclusive Bayesian swag 😉

**Takeaways**:

- Amortized Bayesian inference combines deep learning and statistics to make posterior inference fast and trustworthy.
- Bayesian neural networks can be used for full Bayesian inference on neural network weights.
- Amortized Bayesian inference decouples the training phase and the posterior inference phase, making posterior sampling much faster.
- BayesFlow is a Python library for amortized Bayesian workflows, providing a user-friendly interface and modular architecture.
- Self-consistency loss is a technique that combines simulation-based inference and likelihood-based Bayesian inference, with a focus on amortization
- The BayesFlow package aims to make amortized Bayesian inference more accessible and provides sensible default values for neural networks.
- Deep fusion techniques allow for the fusion of multiple sources of information in neural networks.
- Generative models that are expressive and have one-step inference are an emerging topic in deep learning and probabilistic machine learning.
- Foundation models, which have a large training set and can handle out-of-distribution cases, are another intriguing area of research.

**Chapters**:

00:00 Introduction to Amortized Bayesian Inference

07:39 Bayesian Neural Networks

11:47 Amortized Bayesian Inference and Posterior Inference

23:20 BayesFlow: A Python Library for Amortized Bayesian Workflows

38:15 Self-consistency loss: Bridging Simulation-Based Inference and Likelihood-Based Bayesian Inference

41:35 Amortized Bayesian Inference

43:53 Fusing Multiple Sources of Information

45:19 Compensating for Missing Data

56:17 Emerging Topics: Expressive Generative Models and Foundation Models

01:06:18 The Future of Deep Learning and Probabilistic Machine Learning

**Links from the show:**

- Marvin’s website: https://www.marvinschmitt.com/
- Marvin on GitHub: https://github.com/marvinschmitt
- Marvin on Linkedin: https://www.linkedin.com/in/marvin-schmitt/
- Marvin on Twitter: https://twitter.com/MarvinSchmittML
- The BayesFlow package for amortized Bayesian workflows: https://bayesflow.org/
- BayesFlow Forums for users: https://discuss.bayesflow.org
- BayesFlow software paper (JOSS): https://joss.theoj.org/papers/10.21105/joss.05702
- Tutorial on amortized Bayesian inference with BayesFlow (Python): https://colab.research.google.com/drive/1ub9SivzBI5fMbSTwVM1pABsMlRupgqRb?usp=sharing
- Towards Reliable Amortized Bayesian Inference: https://www.marvinschmitt.com/speaking/pdf/slides_reliable_abi_botb.pdf
- Expand the model space that we amortize over (multiverse analyses, power scaling, …): “Sensitivity-Aware Amortized Bayesian Inference” https://arxiv.org/abs/2310.11122
- Use heterogeneous data sources in amortized inference: “Fuse It or Lose It: Deep Fusion for Multimodal Simulation-Based Inference” https://arxiv.org/abs/2311.10671
- Use likelihood density information (explicit or even learned on the fly): “Leveraging Self-Consistency for Data-Efficient Amortized Bayesian Inference” https://arxiv.org/abs/2310.04395
- LBS #98 Fusing Statistical Physics, Machine Learning & Adaptive MCMC, with Marylou Gabrié: https://learnbayesstats.com/episode/98-fusing-statistical-physics-machine-learning-adaptive-mcmc-marylou-gabrie/
- LBS #101 Black Holes Collisions & Gravitational Waves, with LIGO Experts Christopher Berry & John Veitch: https://learnbayesstats.com/episode/101-black-holes-collisions-gravitational-waves-ligo-experts-christopher-berry-john-veitch/
- Deep Learning book: https://www.deeplearningbook.org/
- Statistical Rethinking: https://xcelab.net/rm/

**Transcript**

*This is an automatic transcript and may therefore contain errors. Please **get in touch** if you’re willing to correct them.*

##### Transcript

In this episode, Marvin Schmidt introduces

the concept of amortized Bayesian

2

inference, where the upfront training

phase of a neural network is followed by

3

fast posterior inference.

4

Marvin will guide us through this new

concept, discussing his work in

5

probabilistic machine learning and

uncertainty quantification using Bayesian

6

inference with deep neural networks.

7

He also introduces Bayes' law,

8

Python library for amortized Bayesian

workflows and discusses its use cases in

9

various fields while also touching on the

concept of deep fusion and its relation to

10

multi -model simulation -based inference.

11

Yeah, that is a very deep episode and also

a fascinating one.

12

I've been personally diving much more into

amortized Bayesian inference with Baseful

13

since the folks there have been kind

enough.

14

to invite me to the team, and I can tell

you, this is super promising technology.

15

A PhD student in computer science at the

University of Stuttgart, Marvin is

16

supervised actually by two LBS guests you

surely know, Paul Burkner and Aki

17

Vettelik.

18

Marvin's research combines deep learning

and statistics to make vision inference

19

fast and trustworthy.

20

In his free time, Marvin enjoys board

games and is a passionate guitar player.

21

This is Learning Basion Statistics,

episode 107, recorded April 3, 2024.

22

Welcome to Learning Basion Statistics, a

podcast about patient inference, the

23

methods, the projects,

24

and the people who make it possible.

25

I'm your host, Alex Andorra.

26

You can follow me on Twitter at alex

.andorra, like the country, for any info

27

about the show.

28

LearnBasedStats .com is left last to be.

29

Show notes, becoming a corporate sponsor,

unlocking Bayesian Merch, supporting the

30

show on Patreon, everything is in there.

31

That's LearnBasedStats .com.

32

If you're interested in one -on -one

mentorship, online courses, or statistical

33

consulting,

34

Feel free to reach out and book a call at

topmate .io slash alex underscore and

35

dora.

36

See you around folks and best patient

wishes to you all.

37

Today, I want to thank the fantastic Adam

Romero, Will Geary, and Blake Walters for

38

supporting the show on Patreon.

39

Your support is truly invaluable and

literally makes this show possible.

40

I can't wait to talk with you guys in the

Slack channel.

41

Second, the first part of our modeling

webinar series on Gaussian processes is

42

out for everyone.

43

So if you want to see how to use the new

HSGP approximation in PIMC, head over to

44

the LBS YouTube channel and you'll see

Juan Orduz, a fellow PIMC Core Dev and

45

mathematician, explain how to do fast and

efficient Gaussian processes in PIMC.

46

I'm actually working on the next part in

this series as we speak, so stay tuned for

47

more and follow the LBS YouTube channel if

you don't want to miss it.

48

Okay, back to the show now.

49

Marvin Schmidt, Willkommen nach Learning

Patient Statistics.

50

Thanks Alex, thanks for having me.

51

Actually my German is very rusty, do you

say nach or zu?

52

Well, welcome Learning Patient Statistics.

53

Maybe welcome in podcast?

54

Nah.

55

Obviously, obviously like it was a third

hidden option.

56

Damn.

57

it's a secret third thing, right?

58

Yeah, always in Germany.

59

It's always that.

60

Man, damn.

61

Well, that's okay.

62

I got embarrassed in front of the world,

but I'm used to that in each episode.

63

So thanks a lot for taking the time.

64

Marvin.

65

Thanks a lot to Matt Rosinski actually for

recommending to do an episode with you.

66

Matt was kind enough to take some of his

time to write to me and put me in contact

67

with you.

68

I think you guys met in Australia in a

very fun conference based on the beach.

69

I think it happens every two years.

70

Definitely when I go there in two years

and do a live episode there.

71

Definitely that's a...

72

That's a product I wanted to do that this

year, but that didn't go well with my

73

traveling dates.

74

So in two years, definitely going to try

to do that.

75

So yeah, listeners and Marvin, you can

help me accountable on that promise.

76

Absolutely.

77

We will.

78

So Marvin, before we talk a bit more about

what you're a specialist in and also what

79

you presented in Australia, can you tell

us what you're doing nowadays and also how

80

you...

81

Andy Depp working on this?

82

Yeah, of course.

83

So these days, I'm mostly doing methods

development.

84

So broadly in probabilistic machine

learning, I care a lot about uncertainty

85

quantification.

86

And so essentially, I'm doing Bayesian

inference with deep neural networks.

87

So taking Bayesian inference, which is

notoriously slow at times, which might be

88

a bottleneck, and then using generative

neural networks to speed up this process,

89

but still maintaining all the

explainability, all these nice benefits

90

that we have from using

91

I have a background in both psychology and

computer science.

92

That's also how I ended up in, Beijing

inference.

93

cause during my psychology studies, I took

a few statistics courses, then started as

94

a statistics tutor, mainly doing frequent

statistics.

95

And then I took a seminar on Beijing

statistics in Heidelberg in Germany.

96

and it was the hardest seminar that ever

took.

97

Well, it's super hard.

98

We read like papers every single week.

99

Everyone had to prepare every single paper

for every single week.

100

And then at the start of each session, the

professor would just shuffle and randomly

101

pick someone to prison.

102

my God.

103

That was tough, but somehow, I don't know,

it stuck with me.

104

And I had like this aha moment where I

felt like, okay, all this statistics stuff

105

that I've been doing before was more of,

you know, following a recipe, which is

106

very strict.

107

But then this like holistic Bayesian

probabilistic take.

108

just gave me a much broader overview of

statistics in general.

109

Somehow I followed the path.

110

Yeah.

111

I'm curious what that...

112

So what does that mean to do patient stats

on deep neural network concretely?

113

What is the thing you would do if you had

to do that?

114

Let's say, does that mean you mainly...

115

you develop the deep neural network and

then you add some Bayesian layer on that,

116

or you have to have the Bayesian framework

from the beginning.

117

How does that work?

118

Yeah, that's a great question.

119

And in fact, that's a common point of

confusion there as well, because Bayesian

120

inference is just like a general, almost

philosophical framework for reasoning

121

about uncertainty.

122

So you have some latent quantities, call

them parameters, whatever, some latent

123

unknowns.

124

And you want to do inference on them.

125

You want to know what these latent

quantities are, but all you have are

126

actual observables.

127

And you want to know how these are related

to each other.

128

And so with Bayesian neural networks, for

instance, these parameters would be the

129

neural network weights.

130

And so you want full Bayesian inference on

the neural network weights.

131

And fitting normal neural networks already

supports that.

132

Like a Bixarity distribution.

133

Exactly.

134

Over these neural network weights.

135

Exactly.

136

So that's one approach of doing Bayesian

deep learning, but that's not what I'm

137

currently doing.

138

Instead, I'm coming from the Bayesian

side.

139

So we have like a normal Bayesian model,

which has statistical parameters.

140

So you can imagine it like a mechanistical

model for like a simulation program.

141

And we want to estimate these scientific

parameters.

142

So for example, if you have a cognitive

decision -making task from the cognitive

143

sciences, and these parameters might be

something like the non -decision time, the

144

actual motor reaction time that you need

to

145

move your muscles and some information

uptake rates, some bias and all these

146

things that researchers are actually

interested in.

147

And usually you would then formulate your

model in, for example, PiMC or Stan or

148

however you want to formulate your

statistical model and then run MCMC for

149

parameter inference.

150

And now where the neural networks come in

in my research is that we replace MCMC

151

with a neural network.

152

So we still have our Bayesian model.

153

But we don't use MCMC for posterior

inference.

154

Instead, we use a neural network just for

posterior inference.

155

And this neural network is trained by

maximum likelihood.

156

So the neural network itself, the weights

there are not probabilistic.

157

There are no posterior distributions over

the weights.

158

But we just want to somehow model the

actual posterior distributions of our

159

statistical model parameters using a

neural network.

160

So the neural net, I think so.

161

That's quite new to me.

162

So I'm going to rephrase that and see how

much I understood.

163

So that means the deep neural network is

already trained beforehand?

164

No, we have to train it.

165

And that's the cool part about this.

166

OK, so you train it at the same time.

167

You train it at the same time.

168

You're also trying to infer the underlying

parameters of your model.

169

And that's the cool part now.

170

Because in MCMC, you would do both at the

same time, right?

171

You have your fixed model that you write

down in PyMC or Stan, and then you have

172

your one observed data set, and you want

to fit your model to the data set.

173

And so, you know, you do, for example,

your Hamiltonian Monte Carlo algorithm to,

174

you know, traverse your parameter space

and then do the sampling.

175

So you couple your approximating

176

phase and your inference phase.

177

Like you learn about the posterior

distribution based on your data set.

178

And then you also want to generate

posterior samples while you're exploring

179

this parameter space.

180

And in the line of work that I'm doing,

which we call amortized Bayesian

181

inference, we decouple those two phases.

182

So the first phase is actually training

those neural networks.

183

And that's the hard task.

184

And then you essentially take your

Bayesian model.

185

generate a lot of training data from the

model because you can just run prior

186

predictive samples.

187

So generate prior predictive samples.

188

And those are your training data for the

neural network.

189

And use the neural network to essentially

learn surrogate for the posterior

190

distribution.

191

So for each data set that you have, you

want to take those as conditions and then

192

have a generative neural network to learn

somehow how these data and the parameters

193

are related to each other.

194

And this upfront training phase takes

quite some time and usually takes longer

195

than the equivalent MCMC would take, given

that you can run MCMC.

196

Now, the cool thing is, as you said, when

your neural network is trained, then the

197

posterior inference is super fast.

198

Then if you want to generate posterior

samples, there's no approximation anymore

199

because you've already done all the

approximation.

200

So now you're really just doing sampling.

201

That means just generating some random

numbers in some latent space and having

202

one pass through the neural network, which

is essentially just a series of matrix

203

multiplications.

204

So once you've done this hard part and

trained your generative neural network,

205

then actually doing the posterior sampling

takes like a fraction of a second for 10

206

,000 posterior samples.

207

Okay, yeah, that's really cool.

208

And how generalizable is your deep neural

network then?

209

Do you have like, is that, because I can

see the really cool thing to have a neural

210

network that's customized to each of your

models.

211

That's really cool.

212

But at the same time, as you were saying,

that's really expensive to train a neural

213

network each time you have to sample a

model.

214

And so I was thinking, OK, so then maybe

what you want is have generalized

215

categories of deep neural network.

216

So that would probably be another kill.

217

But let's say I have a deep neural network

for linear regressions.

218

Whether they are generalized or just plain

normal likelihood, you would use that deep

219

neural network for linear regressions.

220

And then the inference is super fast,

because you only have to train.

221

the neural network once and then

inference, posterior inference on the

222

linear regression parameters themselves is

super fast.

223

So yeah, like that's a long question, but

did you get what I'm asking?

224

Yeah, absolutely.

225

So if I get your question right, now

you're asking like, if you don't want to

226

run linear regression, but want to run

some slightly different model, can I still

227

use my pre -trained neural network to do

that?

228

Yes, exactly.

229

And also, yeah, like in general, how does

that work?

230

Like, how are you thinking about that?

231

Are there already some best practices or

is it like really for now, really cutting

232

edge research that and all the questions

are in the air?

233

Yeah.

234

So first of all, the general use case for

this type of amortized Bayesian inference

235

is usually when your model is fixed, but

you have many new datasets.

236

So assume you have some quite complex

model where MCMC would take a few minutes

237

to run.

238

And so instead for one fixed data set that

you actually want to sample from.

239

And now instead of running MCMC on it, you

say, okay, I'm going to train this neural

240

network.

241

So this won't yet be worth it for just one

data set.

242

Now the cool thing is if you want to keep

your actual model, so whatever you write

243

down in PyMC or Stan,

244

We want to keep that fixed, but now plug

in different data sets.

245

That's where amortized inference really

shines.

246

So for instance, there was this one huge

analysis in the UK where they had like

247

intelligence study data from more than 1

million participants.

248

And so for each of those participants,

they again had a set of observations.

249

And so for each of those 1 million

participants,

250

They want to perform posterior inference.

251

It means if you want to do this with

something like MCMC or anything non

252

-amortized, you would need to fit one

million models.

253

So you might argue now, okay, but you can

parallelize this across like a thousand

254

cores, but still that's, that's a lot.

255

That's a lot of control.

256

Now the cool thing is the model was the

same every single time.

257

You just had a million different data

sets.

258

And so what these people did then is train

a neural network once.

259

And then like it will train for a few

hours, of course, but then you can just

260

sequentially feed in all these 1 million

data sets.

261

And for each of these 1 million data sets,

it takes way, way less than one second.

262

to generate tens of thousands of posterior

samples.

263

But that didn't really answer your

question.

264

So your question was about how can we

generalize in the model space?

265

And that's a really hard problem because

essentially what these neural networks

266

learn is to give you some posterior

function if you feed in a data set.

267

Now, if you have a domain shift in the

model space, so now you want inference

268

based on a different model, and this

neural network has never learned to do

269

that.

270

So that's tough.

271

That's a hard problem.

272

And essentially what you could do and what

we are currently doing in our research,

273

but that's cutting edge, is expanding the

model space.

274

So you would have a very general

formulation of a model and then try to

275

amortize over this model.

276

So that different configurations of this

model, different variations.

277

could just be extracted special case model

essentially.

278

Can you take an example maybe to give an

idea to listeners how that would work?

279

Absolutely.

280

We have one preprint about sensitivity

-aware amortized Bayesian inference.

281

What we do there is essentially have a

kind of multiverse analysis built into the

282

neural network training.

283

give some background, multiverse analysis,

basically says, okay, what are all the pre

284

-processing steps that you could take in

your analysis?

285

And you encode those.

286

And now you're interested in like, what

if, what if I had chosen a different pre

287

-processing technique?

288

What if I had chosen a different way to

standardize my data?

289

Then also the classical like prior

sensitivity or likelihood sensitivity

290

analysis.

291

Like what happens if I do power scaling on

my prior?

292

power scaling on my posterior.

293

So we also encode this.

294

What happens if I bootstrap some of my

data or just have a perturbation of my

295

data?

296

What if I add a bit of noise to my data?

297

So these are all slightly different

models.

298

What we do essentially keep track of that

during the training phase and just encode

299

it into a vector and say, well, okay, now

we're doing pre -processing choice number

300

seven.

301

and scale the prior to the power of two,

don't scale the likelihood and don't do

302

any perturbation and feed this as an

additional information into the neural

303

network.

304

Now the cool thing is during inference

phase, once we're done with the training,

305

you can say, hey, here's a data set.

306

Now pretend that we chose pre -processing

technique number 11 and prior scaling of

307

power 0 .5.

308

What's the posterior now?

309

Because we've amortized over this large or

more general model space, we also get

310

valid posterior inference if we've trained

for long enough over these different

311

configurations of model.

312

And essentially, if you were to do this

with MCMC, for instance, you would refit

313

your model every single time.

314

And so here you don't have to do that.

315

Okay.

316

Yeah, I see.

317

That's super.

318

Yeah, that's super cool.

319

And I feel like, so that would be mainly

the main use cases would be as you were

320

saying, when, when you're getting into

really high data territory and you have

321

what's changing is mainly the data side,

mainly the data.

322

set and to be even more precise, not

really the data set, but the data values,

323

because the data set is supposed to be

like quite the same, like you would have

324

the same columns, for instance, but the

values of the columns would change all the

325

time.

326

And the model at the same time doesn't

change.

327

Is that like, that's really for now, at

least the best use case for that kind of

328

method.

329

Yes.

330

And this might seem like a very niche

case.

331

But then if you look at like,

332

Bayesian workflows in practice, this topic

of this scheme of many model research

333

doesn't necessarily mean that you have a

large number of data sets.

334

This might also just mean you want

extensive cross validation.

335

So assume that you have one data set with

1000 observations.

336

Now you want to run leaf1 or cross

validation, but for some reason you can't

337

do the Pareto Smooth importance sampling

version, which would be much faster.

338

So you would need 1000 model refits, even

though you just have one data set, because

339

you want 1000 cross validation refits.

340

Maybe can you explicit what your meaning

by cross validation here?

341

Because that's not a term that's used a

lot in the patient framework, I think.

342

Yeah, of course.

343

So especially innovation setting, there's

this approach of leave one out cross

344

validation, where you would fit your

posterior based on all data points, but

345

one.

346

And that's why it's called leave one out,

because you take one out and then fit your

347

model, fit your posterior on the rest of

the data.

348

And now you're interested in the posterior

predictive performance of this one left

349

out observation.

350

Yeah.

351

And that's called cross validation.

352

Yeah.

353

Go ahead.

354

Yeah, no, just I'm going to let you

finish, but yeah, for listeners familiar

355

with the frequented framework, that's

something that's really heavily used in

356

that framework, cross validation.

357

And it's very similar to the machine

learning concept of cross validation.

358

But in the machine learning area, you

would rather have something like fivefold

359

in general, k -fold cross validation,

where you would have larger splits of your

360

data and then use parts of your

361

whole dataset as the training dataset and

the rest for evaluation.

362

Essentially, like the one across relation

just puts it to the extreme.

363

Everything but one data point is your

train dataset.

364

Yeah.

365

Yeah.

366

Okay.

367

Yeah.

368

Damn, that's super fun.

369

And is there, is there already a way for

people to try that out or is it mainly for

370

now implemented for papers?

371

And you are probably.

372

I'm guessing working on that with Aki and

all his group in Finland to make that more

373

open source, helping people use packages

to do that.

374

What's the state of the things here?

375

Yeah, that's a great question.

376

And in fact, the state of usable open

source software is far behind what we have

377

for likelihood -based MCMC based

inference.

378

So we currently don't have something

that's comparable to PyMC or Stan.

379

Our group is developing or actively

developing a software that's called Base

380

Flow.

381

That's because like the name, because like

base, because we're doing Bayesian

382

inference.

383

And essentially the first neural network

architecture that was used for this

384

amortized Bayesian inference are so

-called normalizing flows.

385

Conditional normalizing flows to be

precise.

386

And that's why the name Base Flow came to

be.

387

But now.

388

actually have a bit of a different take

because now we have a whole lot of

389

generative neural networks and not only

normalizing flows.

390

So now we can also use, for example, score

-based diffusion models that are mainly

391

used for image generation and AI or

consistency models, which are essentially

392

like a distilled version of score -based

diffusion models.

393

And so now baseflow doesn't really capture

that anymore.

394

But now what the baseflow Python library

specializes in is defining

395

Principled amortized Bayesian workflows.

396

So the meaning of base or slightly shifted

to amortized Bayesian workflows and hence

397

the name base login And the focus of base

slope and the aim of base low is twofold

398

So first we want a library.

399

It's good for actual users So this might

be researchers who just say hey, here's my

400

data set.

401

Here's my model my simulation program and

Please just give me fast posterior

402

samples.

403

So we want

404

usable high level interface with sensible

default values that mostly work out of the

405

box and an interface that's mostly self

-explanatory.

406

Also of course, good teaching material and

all this.

407

But that's only one side of the coin

because the other large goal of FaceFlow

408

is that it should be usable for machine

learning researchers who want to advance

409

amortized Bayesian inference methods as

well.

410

And so the software in general,

411

is structured in a very modular way.

412

So for instance, you could just say, hey,

take my current pipeline, my current

413

workflow.

414

But now try out a different loss function

because I have a new fancy idea.

415

I want to incorporate more likelihood

information.

416

And so I want to alter my loss function.

417

So you would have your general program

because of the modular architecture there,

418

you could just say, take the current loss

function and replace it with a different

419

one.

420

that is used to the API.

421

And we're trying to doing both and serving

both interests, user friendly side for

422

actually applied researchers who are also

currently using Baseflow.

423

But then also the machine learning

researchers with completely different

424

requirements for this piece of software.

425

Maybe we can also use Baseflow

documentation and the current project

426

website in the notes.

427

Yeah, we should definitely do that.

428

Definitely gonna try that out myself.

429

It sounds like fun.

430

I need a use case, but as soon as I have a

use case, I'm definitely gonna try that

431

out because it sounds like a lot of fun.

432

Yeah, several questions based on that and

thanks a lot for being so clear and so

433

detailed on these.

434

So first, we talked about normalizing

flows in episode 98 with Marie -Lou

435

Gabriel.

436

Definitely recommend listeners to listen

to that for some background.

437

And question, so Baseflow, yeah,

definitely we need that in the show notes

438

and I'm going to install that in my

environment.

439

And I'm guessing, so you're saying that

that's in Python, right?

440

The package?

441

Yes, the core package is in Python and

we're currently refactoring to Keras.

442

So by the time this podcast episode is

aired, we will have a new major release

443

version, hopefully.

444

OK, nice.

445

So you're agnostic to the actual machine

learning back end.

446

So then you could choose TensorFlow,

PyTorch, or JAX, whatever integrates best

447

with what you're currently proficient in

and what you might be currently using in

448

other parts of a project.

449

OK, that was going to be my question.

450

Because I think while preparing for the

episode, I saw that you were mainly using

451

PyTorch.

452

So that was going to be my question.

453

What is that based on?

454

So the back end could be PyTorch, JAX, or.

455

What did you think the last one was?

456

Tansor flow.

457

Yeah, I always forget about all these

names.

458

I really know PyTorch.

459

So that's why I like the other ones.

460

And JAX, of course, for PyMC.

461

And then, so my question is, the workflow,

what would it look like if you're using

462

Baseflow?

463

Because you were saying the model, you

could write it in standard PyMC or

464

TensorFlow, for instance.

465

Although I don't know if you can write.

466

patient models with TensorFlow anymore.

467

Anyways, let's say PyMC or Stan.

468

You write your model.

469

But then the sampling of the model is done

with the neural network.

470

So that means, for instance, PyTorch or

Jax.

471

How does that work?

472

Do you have then to write the model in a

Jax compatible way?

473

Or is the translation done by the package

itself?

474

Yeah, that's a great question.

475

It touches on many different topics and

considerations and also on future roadmap

476

for bass flow.

477

So.

478

This class of algorithms that are

implemented in Baseflow, these amortized

479

Bayesian inference algorithms, to give you

some background there, they originally

480

started in simulation -based inference.

481

It's also sometimes called likelihood

-free inference.

482

So essentially it is Bayesian inference

when you don't bring a closed -form

483

likelihood function to the table.

484

But instead, you only have some generic

forward simulation program.

485

So you would just have your prior as

some...

486

Python function or C++ function, whatever,

any function that you could call and it

487

would return you a sample from the prior

distribution.

488

You don't need to write it down in terms

of distributions actually, but you only

489

need to be able to sample from it.

490

And then the same for the likelihood.

491

So you don't need to write down your

likelihood in like a PMC or Stan in terms

492

of a probability distribution, in terms of

density distribution or densities.

493

But instead it's.

494

just got to be some simulation program,

which takes in parameters and then outputs

495

data.

496

What happens between these parameters and

the data is not necessarily probabilistic

497

in terms of closed form distributions.

498

It could also be some non -tractable

differential equations.

499

It could be essentially everything.

500

So for base flow, this means that you

don't have to input something like a PMC

501

or a Stan model, which you write down in

terms of

502

distributions, but it's just a generic

forward model that you can call and you

503

will get a tuple of a parameter draw and a

data set.

504

So you'd usually just do it in NumPy.

505

So you would write, if I'm using Baseflow,

I would write it in NumPy.

506

It would probably be the easiest way.

507

You could probably also write it in JAX or

in PyTorch or in TensorFlow or TensorFlow

508

probability, whatever you want to use and

like behind the scenes.

509

But essentially what we just care about is

that the model gets a tuple of parameters

510

and then data that has been generated from

these parameters.

511

for the neural network training process.

512

That's super fun.

513

Yeah, yeah, yeah.

514

Definitely want to see that.

515

Do you have already some Jupyter notebook

examples up on the repo or are you working

516

on that?

517

Yeah, currently it's a full -fledged

library.

518

It's been under development for a few

years now.

519

And we also have an active user base right

now.

520

It's quite small compared to other

Bayesian packages.

521

We're growing it.

522

Yeah, that's cool.

523

In documentation, there are currently, I

think, seven or eight tutorial notebooks.

524

And then also for a Based on the Beach,

like this conference in Australia that we

525

just talked about earlier, we also

prepared a workshop.

526

And we're also going to link to this

Jupyter notebook in the show notes.

527

Yeah, definitely we should, we should link

to some of these Jupyter notebooks in the

528

show notes.

529

And Sean, I'm thinking you should...

530

Like if you're down, you should definitely

come back to the show, but for a webinar.

531

I have another format that's modeling

webinar where you could, you would come to

532

the show and share your screen and, and go

through the model code live and people can

533

ask questions and so on.

534

I've done that already on a variety of

things.

535

Last one was about causal inference and

propensity scores.

536

Next one is going to be on about helper

space GP decomposition.

537

So yeah, if you're down, you should

definitely come and do a demonstration of

538

base flow and amortized Bayesian

inference.

539

I think that would be super fun and very

interesting to people.

540

Absolutely.

541

Then to answer the last part of your

question.

542

Yeah.

543

Like if you currently have a model that's

written down in PyMC or Stan, that's a bit

544

more tricky to integrate because

essentially what all we need in base flow

545

are samples from the prior predictive

distribution.

546

If you talk in Bayesian terminology.

547

Yeah.

548

And if your current model can do that,

that's fine.

549

That's all you need right now.

550

And then base build builds.

551

You can have like a PIMC model and just do

pm .sample -properative, save that as a

552

big NumPy multidimensional array and pass

that to baseflow.

553

Yes.

554

Okay.

555

Just all you need are two builds of the

ground truth parameters of the data

556

training process.

557

So essentially like the result of your

prior call and then the result of your

558

likelihood call with those prior

parameters.

559

So you mean what the likelihood samples

look like once you fix the prior

560

parameters to some value?

561

Yes.

562

So like in practice, you would just call

your prior function.

563

Yeah.

564

Then get a sample from the prior.

565

So parameter vector.

566

Yeah.

567

And then plug this parameter vector into

the likelihood function.

568

And then you get one simulated synthetic

data set.

569

And you just need those two.

570

Okay.

571

Super cool.

572

Yeah.

573

Definitely sounds like a lot of fun and

should definitely do a webinar about that.

574

I'm very excited about that.

575

Yeah.

576

Fantastic.

577

And so that was one of my main questions

on that.

578

Other question is, I'm guessing you are a

lot of people working on that, right?

579

Because your roadmap that you just talked

about is super big.

580

Because having a package that's designed

for users, but also for researchers is

581

quite, that's really a lot of work.

582

So I'm hoping you're not allowed doing

that.

583

No, we're currently a team of about a

dozen people.

584

No, yeah, that makes sense.

585

It's an interdisciplinary team.

586

So like a few people with a hardcore like

software engineering background, like some

587

people with a machine learning background,

and some people from the cognitive

588

sciences and also a handful of physicists.

589

Because in fact, these amortized Bayesian

inference methods are particularly

590

interesting for physicists.

591

Example for astrophysicists who have these

gravitational wave inference problems

592

where they have massive data sets.

593

And running MCMC on those would be quite

cumbersome.

594

So if you have this huge in -stream data

and you don't have this underlying

595

likelihood density, but just some

simulation program that might generate

596

sensible, like gravitational waves, then

amortized Bayesian inference really shines

597

there.

598

Okay.

599

So that's exactly the case you were

talking about where the model doesn't

600

change, but you have a lot of different

datasets.

601

Yeah, exactly.

602

Because I mean, what you're trying to run

inference on is your physical model.

603

And that doesn't change.

604

I mean, it does.

605

And then again, physicists have a very

good understanding and very good models of

606

the world around them.

607

And that's made one of the largest

differences.

608

people from the cognitive sciences, where,

you know, the, the models of the human

609

brain, for instance, are just, it's such a

tough thing to model and there's so much

610

not there and so much uncertainty in the

model building process.

611

Yeah, for sure.

612

Okay, yeah, I think I'm starting to

understand the idea.

613

And yeah, so actually, episode 101 was

exactly about that.

614

Black holes, collisions, gravitational

waves.

615

And I was talking with LIGO researchers,

Christopher Perry and John Vich.

616

And we talked exactly about that, their

problem with big data sets.

617

They are mainly using sequential Monte

Carlo, but I'm guessing they would also be

618

interested in a Monte...

619

amortized Bayesian inference.

620

So yeah, Christopher and John, if you're

listening, if you're future reach out to

621

Marvin and use Baseflow.

622

And listeners, this episode will be in the

show notes also if you want to give it a

623

listen.

624

That's a really fun one also learning a

lot of stuff, but the crazy universe we

625

live in.

626

Actually, a weird question I have is why

627

easy to call it amortized Bayesian

inference.

628

The reason is that we have this two -stage

process where we would first pay upfront

629

with this long neural network training

phase.

630

But then once we're done with this, this

cost of the upfront training phase

631

amortizes over all the posterior samples

that we can draw within a few

632

milliseconds.

633

That makes sense.

634

That makes sense.

635

And so I think something you're also

working on is something that's called deep

636

fusion.

637

And you do that in particular for

multimodal simulation -based inference.

638

How is that related to amortized patient

inference, if at all?

639

And what is it about?

640

I'm gonna answer these two questions in

reverse order.

641

So first about the relation between

simulation -based inference and amortized

642

Bayesian inference.

643

So to give you a bit of history there,

simulation -based inference essentially

644

Bayesian inference based on simulations

where we don't assume that we have access

645

to a likelihood density, but instead we

just assume that we can sample from the

646

likelihood.

647

Essentially simulate from the model.

648

In fact, the likelihood is still.

649

present, but it's only implicitly defined

and we don't have access to the density.

650

That's why likelihood -free inference

doesn't really hit what's happening here.

651

But instead, like in the recent years,

people have started adopting the term

652

simulation -based inference because we do

Bayesian inference based on simulations

653

instead of likelihood densities.

654

So methods that have been used...

655

for quite a long time now in the

simulation -based inference research area.

656

For example, rejection ABC, so approximate

Bayesian computation, or then ABC SMC, so

657

combining ABC with sequential Monte Carlo.

658

Essentially, the next iteration there was

throwing neural network at simulation

659

-based inference.

660

That's exactly this neural posterior

estimation that I talked about earlier.

661

And now what researchers noticed is, hey,

when we train a neural network for

662

simulation -based inference, instead of

running rejection, approximate base

663

computation, then we get amortization for

free as a site product.

664

It's just a by -product of using a neural

network for simulation -based inference.

665

And so in the last maybe four to five

years,

666

People have mainly focused on this

algorithm that's called neuro posterior

667

estimation for simulation based inference.

668

And so all developments that happened

there and all the research that happened

669

there, almost all the research, sorry,

focused on cases where we don't have any

670

likelihood density.

671

So we're purely in the simulation based

case.

672

Now with our view of things, when we come

from a Bayesian inference, like likelihood

673

based setting,

674

can say, hey, amortization is not just a

random coincidental byproduct, but it's a

675

feature and we should focus on this

feature.

676

And so now what we're currently doing is

moving this idea of amortized Bayesian

677

inference with neural networks back into a

likelihood -based setting.

678

So we've started using likelihood

information again.

679

For example, using likelihood densities if

they're available or learning information

680

about the likelihood.

681

So like a surrogate model on the fly, and

then again, using this information for

682

better posterior inference.

683

So we're essentially bridging simulation

-based inference and likelihood -based

684

Bayesian inference again with this goal, a

larger goal of amortization if we can do

685

it.

686

And so this work on deep fusion.

687

essentially addresses one huge shortcoming

of neural networks when we want to use

688

them for amortized Bayesian inference.

689

And that is in situation where we have

multiple different sources of data.

690

So for example,

691

Imagine you're a cognitive scientist and

you run an experiment with subjects and

692

for each test subject, you give them a

decision -making task.

693

But at the same time, while your subjects

solve the decision -making task, you wire

694

them up with an EEG to measure the brain

activity.

695

So for each subject across maybe 100

trials, what you now have is both an EEG

696

and the data from the decision -making

task.

697

Now, if you want to analyze this with PyMC

or Stan, what you would just do is say,

698

hey, well, we have two data -generating

processes that are governed by a set of

699

shared parameters.

700

So the first part of the likelihood would

just be this we -know process for the

701

decision -making task where you just model

the reaction time.

702

fairly standard procedure there in the

cognitive science.

703

And then for the second part, we have a

second part of the likelihood that we

704

evaluate that somehow handles these EEG

measurements.

705

For example, a spatial temporal process or

just like some summary statistics that are

706

being computed there.

707

However, you would usually compute your

EEG.

708

Then you add both to the log PDF of the

likelihood, and then you can call it a

709

day.

710

You cannot do that in neural networks

because you have no straightforward

711

sensible way to combine these reaction

times from the decision -making task and

712

the EEG data.

713

Because you cannot just take them and slap

them together.

714

They are not compatible with each other

because these information data sources are

715

heterogeneous.

716

So you somehow need a way to fuse these

sources of information.

717

so that you can then feed them into the

neural network.

718

That's essentially what we're studying in

this paper, where you could just get very

719

creative and have different schemes to

fuse the data.

720

So you could use these attention schemes

that are very hip in large language models

721

right now with transformers essentially,

and have these different data sources

722

attend or listen essentially to each

other.

723

With cross attention, you could just let

the EEG data inform

724

your decision -making data or just have

the decision -making data inform the EEG

725

data.

726

So you can get very creative there.

727

You could also just learn some

representation of both individually, then

728

concatenate them and feed them to the

neural network.

729

Or you could do very creative and weird

mixes of all those approaches.

730

And in this paper, we essentially have a

systematic investigation of these

731

different options.

732

And we find that the most straightforward

option works the best.

733

overall, and that's just learning fixed

size embeddings of your data sources

734

individually, and then just concatenating

them.

735

It turns out then we can use information

from both sources in an efficient way,

736

even though we're doing inference with

neural networks.

737

And maybe what's interesting for

practitioners is that we can compensate

738

for missing data in individual sources.

739

And the paper we essentially, we induced

missing data by just taking these EEG data

740

and decision -making data and just

randomly dropping some of them.

741

And the neural networks have learned, like

when we do this fusion process, the neural

742

networks learn to compensate for partial

missingness in both sources.

743

So if you just remove some of the decision

-making data, the neural network learn to

744

use the EEG data to inform your posterior.

745

Even though the data and one of the

sources are missing, the inference is

746

pretty robust then.

747

And again, all this happens without model

refits.

748

So you would just account for that during

training.

749

Of course you have to do this like random

dropping of data during a training phase

750

as well.

751

And then you can also get it during the

inference phase.

752

yeah, that sounds, yeah, that's really

cool.

753

Maybe that's a bit of a, like a small

piece of this paper in our larger roadmap.

754

This is essentially taking this amortized

vision inference.

755

up to the level of trustworthiness and

robustness and all these gold standards

756

that we currently have for likelihood

-based inference in PMC or Stan.

757

Yeah.

758

Yeah.

759

And there's still a lot of work to do

because of course, like there's no free

760

lunch.

761

and, and of course there are many problems

with trustworthiness.

762

And that's also one of the reasons why I'm

here with Aki right now.

763

cause Aki is so great at Bayesian workflow

and trustworthiness, good diagnostics.

764

That's all, you know, all the things that

we currently still need for trustworthy,

765

amortized Bayesian inference.

766

Yeah.

767

So maybe you want to.

768

talk a bit more about that and what you're

doing on that.

769

That sounds like something very

interesting.

770

So one huge advantage of an amortized

Bayesian sampler is that evaluations and

771

diagnostics are extremely cheap.

772

So for example, there's this gold standard

method that's called simulation based

773

calibration, where you would sample from

your model and then like a sample from

774

your prior predictive space and then refit

your model and look at your coverage, for

775

instance.

776

In general, look at the calibration of

your model on this potentially very large

777

prior predictive space.

778

So you naturally need many model refits,

but your model is fixed.

779

So if you do it with MCMC, it's a gold

standard evaluation technique, but it's

780

very expensive to run, especially if your

model is complex.

781

Now, if you have an amortized estimator,

simulation -based calibration on thousands

782

of datasets takes a few seconds.

783

So essentially, and that's my goal for

this research visit with Aki here in

784

Finland, is trying to figure out what are

some diagnostics that are gold standard,

785

but potentially very expensive, up to a

point where it's infeasible to run on a

786

larger scale with MCMC.

787

But we can easily do it with an amontized

estimator.

788

With the goal of figuring out, like, can

we trust this estimator?

789

Yes or no?

790

It's like, as you might know from neural

networks, we just have no idea what's

791

happening inside their neural network.

792

And so we currently don't have these

strong diagnostics that we have for MCMC.

793

Like for example, our head.

794

There's no comparable thing for neural

network.

795

So one of my goals here is to come up with

more good diagnostics that are either

796

possible with MCMC, but very expensive so

we don't run them, but they would be very

797

cheap with an amortized estimator.

798

Or the second thing just specific to an

amortized estimator, just like our head is

799

specific to MCMC.

800

Okay.

801

Yeah, I see.

802

Yeah, that makes tons of sense.

803

well.

804

And actually, so I would have more

technical questions on these, but I see

805

the time running out.

806

I think something I'm mainly curious about

is the challenges, the biggest challenges

807

you face when applying amortized spatial

inference and diffusion techniques in your

808

projects, but also like in the projects

you see.

809

I think that's going to also give a sense

to listeners of when and where to use

810

these kinds of methods.

811

That's a great question.

812

And I'm more than happy to talk about all

these challenges that we have because

813

there's so much room for improvement

because like these Amortized methods, they

814

have so much potential, but we still have

a long way to go until they are as usable

815

and as straightforward to use as current

MCMC samplers.

816

And in general, one challenge for

practitioners,

817

is that we have most of the problems and

hardships that we have in PyMC or Stan.

818

And that is that researchers have to think

about their model in a probabilistic way,

819

in a mechanistic way.

820

So instead of just saying, hey, I click on

t -test or linear regression in some

821

graphical user interface, they actually

have to come up with a data generating

822

process.

823

and have to specify their model.

824

And this whole topic of model

specification is just the same in

825

amortized workflow because some way we

need to specify the Bayesian model.

826

And now on top of all this, we have a huge

additional layer of complexity and this is

827

defining the neural networks.

828

And amortized Bayesian inference, nowadays

we have two neural networks.

829

The first one is a so -called summary

network.

830

which essentially learns a latent

embedding of the data set.

831

Essentially those are like optimal learned

summary statistics and optimal doesn't

832

mean that they have to be optimal to

reconstruct the data, but instead optimal

833

means they're optimal to inform the

posterior.

834

for example, in a very, very simple toy

model, if you have just like a Gaussian

835

model and you just want to perform

inference on the mean.

836

then a sufficient summary statistic for

posterior inference on the mean would be

837

the mean.

838

Because that's all you need to reconstruct

the mean.

839

It sounds very tautological, but yeah.

840

Then again, the mean is obviously not

enough to reconstruct the data because all

841

the variance information is missing.

842

What the summary network learns is

something like the mean.

843

So summary statistics that are optimal for

posterior inference.

844

And then the second network is the actual

generative neural network.

845

So like a normalizing flow, score -based

diffusion model, consistency model, flow

846

matching, whatever condition generative

model you want.

847

And this will handle the sampling from the

posterior.

848

And these two networks are learned end to

end.

849

So you would learn your summary statistic,

output it, feed it into the posterior

850

network, the generative model, and then

have one.

851

evaluation of the loss function, optimize

both end to end.

852

And so we have two neural networks, long

story short, which is substantially harder

853

than just hitting like sample on a PMC or

Stan program.

854

And that's an additional hardship for

practitioners.

855

Now in Baseflow, what we do is we provide

sensible default values for the generative

856

neural networks, which work in maybe like

80 or 90 % of the cases.

857

It's just sufficient to have, for example,

like a NeuroSpline flow, like some sort of

858

normalizing flow with, I don't know, like,

859

six layers and a certain number of units,

some regularization for robustness and,

860

you know, cosine decay of the learning

rates, and all these machine learning

861

parts, we try to take them away from the

user if they don't want to mess with it.

862

But still, if things don't work, they

would need to somehow diagnose the

863

problems and then, you know, play with the

number of layers and this neural network

864

architecture.

865

And then for the summary network, the

summary network essentially needs to be

866

informed by the data.

867

So if you have time series, you would

868

look at something like an LSTM.

869

So these like long short time memory time

series neural networks.

870

Or you would have like recurrent neural

network or nowadays a time series

871

transformer.

872

They're also called temporal fusion

transforms.

873

If you have IID data, you would have

something like a deep set or a set

874

transformer, which respect this

exchangeable structure of the data.

875

So again, we can give all the

recommendations and sensible default

876

values like

877

If you have a time series, try a time

series transformer.

878

Then again, if things don't work out,

users need to play around with these

879

settings.

880

So that's definitely one hardship of

armatized Bayesian inference in general.

881

And for the second part of your question,

hardships of this deep fusion.

882

It's essentially if you have more and more

information sources, then things can get

883

very complicated.

884

Example, just a few days ago, we discussed

about a

885

case where someone has 60 different

sources of information and they're all

886

streams of time series.

887

Now we could say, hey, just slap 60

summary networks on this problem, like one

888

summary network for each domain.

889

That's going to be very complex and very

hard to train, especially if we don't

890

bring that many data sets to the table for

the neural network training.

891

And so there we somehow need to find a

compromise.

892

Okay, what information can we condense and

group together?

893

So maybe some of the time series sources

are somewhat similar and actually

894

compatible with each other.

895

So we could, for example, come up with six

groups of 10 time series each.

896

Then we would only need six neural

networks for the summary embeddings and

897

all these practical considerations.

898

That makes things just like as hard as in

likelihood based MCMC based inference, but

899

just a bit harder because of all the

neural network stuff that's happening.

900

Did this address your question?

901

Yeah.

902

Yeah.

903

It gives me more questions, but yeah, for

sure.

904

That does answer the question.

905

When you're talking about transformer for

time series, are you talking about the

906

transformers, the neural network that's

used in large language models or is it

907

something else?

908

It's essentially the same, but slightly

adjusted for time series so that the...

909

statistics or these latent embeddings that

you output still respect the time series

910

structure where typically you would have

this autoregressive structure.

911

So it's not exactly the same like standard

transformer, but you would just enrich it

912

to respect the probabilistic structure in

your data.

913

But at the core, it's just the same.

914

So at the core, it's an attention

mechanism, like multi -head attention

915

where

916

Like the different parts of your dataset

could essentially talk or listen to each

917

other.

918

So it's just the same.

919

Okay.

920

Yeah, that's interesting.

921

I didn't know that existed for time

series.

922

That's interesting.

923

That means, so because the transformer

takes like one of the main thing is you

924

have to tokenize the inputs.

925

Right?

926

So here you would tokenize like that there

is a tokenization happening of the time

927

series data.

928

You don't have to tokenize here because

the reason why you have to tokenize.

929

in large language models or natural

language processing in general is that you

930

want to somehow encode your characters or

your words?

931

into like a into numbers essentially and

we don't need that in Bayesian inference

932

in general because we already have numbers

Yeah So our data already comes in numbers,

933

so we don't need tokenization here.

934

Of course if we had text data

935

Then we would need tokenization.

936

Yeah.

937

Yeah.

938

Yeah.

939

OK.

940

OK.

941

Yeah, it makes more sense to me.

942

All right, that's fun.

943

I didn't know that existed.

944

Do you have any resources about

transformer for time series that we could

945

put in the show notes?

946

Absolutely.

947

There is a paper that's called Temporal

Fusion Transformers, I think.

948

I will send you the link.

949

yeah.

950

Awesome.

951

Yeah, thanks.

952

Definitely.

953

We have this time series transformer,

temporary fusion transformer implemented

954

in base flow.

955

So now it's just like a very usable

interface where you would just input your

956

data and then you get your latent

embeddings.

957

You can say like, I want to input my data

and I want as an output 20 learned summary

958

statistics.

959

So that's all you need to do there.

960

Okay.

961

And you can go crazy.

962

So what would you do with it?

963

Good.

964

Yeah, what would you do with these

results?

965

Basically the outputs of the transformer,

what would you use that for?

966

Those are the learned summary statistics.

967

That you would then treat as a compressed

fixed length version of your data for the

968

posterior network for this generative

model.

969

So then you use that afterwards in the

model?

970

Exactly.

971

Yeah.

972

So the transformer is just used to learn

summary statistics of the data sets that

973

we input.

974

For instance, if you have time series,

like we did this for COVID time series.

975

If you have a COVID time series,

976

worth like for a three year period would

be and daily reporting, you would have a

977

time series with about a thousand time

steps.

978

That's quite long as a condition into a

neural network to pass in there.

979

And also like if now you don't have a

thousand days, but a thousand and one

980

days, then the length of your input to the

neural network would change and your

981

neural network wouldn't do that.

982

So what you do with a time series

transformer is compress this time series

983

of maybe 1 ,000 or maybe 1 ,050 time steps

into a fixed length vector of summary

984

statistics.

985

Maybe you extract 200 summary statistics

from that.

986

Hey, okay, I see.

987

And then you can use that in your neural

network, in the model that's going to be

988

sampling your model.

989

In the neural network that's going to be

sampling your model.

990

We already see that we're heavily

overloading terminology here.

991

So what's a model actually?

992

So then we have to differentiate between

the actual Bayesian model that we're

993

trying to fit.

994

And then the neural network, the

generative model or generative neural

995

network that we're using as a replacement

for MCMC.

996

So it's, it's a lot of this taxonomy

that's, that's odd when you're at the

997

interface of deep learning and statistics.

998

Another one of those hiccups are

parameters.

999

Like invasion inference parameters are

your inference targets.

Speaker:

So you want posterior distributions on a

handful of model parameters.

Speaker:

When you talk to people from deep learning

about parameters,

Speaker:

they understand the neural network

weights.

Speaker:

So sometimes you have to be careful with

the, I have to be careful with the

Speaker:

terminology and words used to describe

things because we have different types of

Speaker:

people going on different levels of

abstraction here in different functions.

Speaker:

Yeah.

Speaker:

Yeah, exactly.

Speaker:

So that means in this case, it's the

transformer takes in time values, it

Speaker:

summarizes them.

Speaker:

And it passed that on to the neural

network that's going to be used to sample

Speaker:

the patient model.

Speaker:

Exactly.

Speaker:

And they are passed in as the conditions,

like conditional probability, which

Speaker:

totally makes sense because like this

generative neural network, it learns the

Speaker:

distribution of parameters conditional on

the data or summary statistics of the

Speaker:

data.

Speaker:

So that's the exact definition of the

Bayesian posterior distribution.

Speaker:

Like a distribution of the Bayesian model

parameters conditional on the data.

Speaker:

It's the exact definition of the

posterior.

Speaker:

Yeah, I see.

Speaker:

And that means...

Speaker:

So in this case, yeah, no, I think my

question was going to be, so why would you

Speaker:

use these kind of additional layer on the

time series data?

Speaker:

But you have to answer that.

Speaker:

Is that, well, what if your time series

data is too big or something like that?

Speaker:

Exactly.

Speaker:

It's not just being too big, but also just

a variable length.

Speaker:

Because the neural network, like the

generative neural network, it always wants

Speaker:

fixed length inputs.

Speaker:

Like it can only handle, in this case of

the COVID model, it could only handle

Speaker:

input conditions with length 200.

Speaker:

And now the time series transformer takes

part, so the time series transformer

Speaker:

handles the part that our actual raw data

have variable length.

Speaker:

And time series transformers can handle

data of variable length.

Speaker:

So they would, you know, just take a time

series of length.

Speaker:

maybe 500 time steps to 2000 time steps,

and then always compress it to 200 summary

Speaker:

statistics.

Speaker:

So this generative neural network, which

is much more strict about the shapes and

Speaker:

form of the input data, will always see

the same length inputs.

Speaker:

Yeah.

Speaker:

Okay.

Speaker:

Yeah, I see.

Speaker:

That makes sense.

Speaker:

Awesome.

Speaker:

Yeah, super cool.

Speaker:

And so as you were saying, this is already

available in base flow, people can use

Speaker:

this kind of transformer for time series.

Speaker:

Yeah, absolutely.

Speaker:

For time series and also for sets.

Speaker:

So for IID data.

Speaker:

Yeah.

Speaker:

Because if you just fed, if you just take

an IID data set and input into a neural

Speaker:

network, the neural network doesn't know

that your observations are exchangeable.

Speaker:

So it will assume much more structure than

there actually is in your data.

Speaker:

So again, it has a double function, like a

dual function of like compressing data,

Speaker:

encoding the probabilistic structure of

the data, and also outputting a fixed

Speaker:

representation.

Speaker:

So this would be a set transformer or deep

set is another option.

Speaker:

It's also implemented in Baseflow.

Speaker:

Super cool.

Speaker:

Yeah.

Speaker:

And so let's start winding down here

because I've already taken a lot of your

Speaker:

time.

Speaker:

Maybe a last few questions would be what

are some emerging topics that you see

Speaker:

within deep learning and probabilistic

machine learning that you find

Speaker:

particularly intriguing?

Speaker:

Because I've been to talk here a lot about

really the nitty -gritty, the statistical

Speaker:

detail.

Speaker:

And so on, but now if we do zoom a bit and

we start thinking about more long -term.

Speaker:

Yeah.

Speaker:

I'm very excited about two large topics.

Speaker:

The first one are generative models that

are very expressive.

Speaker:

So unconstrained neural network

architectures, but at the same time have a

Speaker:

one -step inference.

Speaker:

So for example, people have been using

score -based diffusion models a lot for

Speaker:

flow matching.

Speaker:

for image generation, like for example,

stable diffusion.

Speaker:

You might be familiar with this tool to

generate like, you know, input a text

Speaker:

prompt and then you get fantastic images.

Speaker:

Now this takes quite some time.

Speaker:

So like a few seconds for each image, but

only because it runs on a fancy cluster.

Speaker:

If you run it locally on a computer, it

takes much longer.

Speaker:

And that's because the Scorby's diffusion

model needs many discretization steps in

Speaker:

denoising, in this denoising process

during inference time.

Speaker:

And now there's, like, throughout the last

year, there have been a few attempts on

Speaker:

having these very expressive and super

powerful neural networks.

Speaker:

But they are much, much faster because

they don't have these many denoising

Speaker:

steps.

Speaker:

Instead, they directly learn a one -step

inference.

Speaker:

So they could generate an image not like a

thousand steps, but only in one step.

Speaker:

And that's very cutting edge or bleeding

edge, if you will, because they don't work

Speaker:

that great yet.

Speaker:

But I think there's much potential in

there.

Speaker:

it's both expressive and fast.

Speaker:

And then again, we've used some of those

for amortized Bayesian inference.

Speaker:

So we use consistency models and they have

super high potential in my opinion.

Speaker:

So, you know, with these advances in deep

learning, we can always, oftentimes we can

Speaker:

use them for amortized Bayesian inference.

Speaker:

We just like reformulate these generative

models and slightly tune them to our

Speaker:

tasks.

Speaker:

So I'm very excited about this.

Speaker:

And the second area I'm very excited about

our foundation models.

Speaker:

I guess most people are in AI these days.

Speaker:

So foundation models essentially means

neural networks are very good at in

Speaker:

-distribution tasks.

Speaker:

So whatever is in the training data set,

neural networks are typically very good at

Speaker:

finding patterns that are similar to the

training set, what they saw in the

Speaker:

training set.

Speaker:

Now in the open world, so if we are out of

distribution, we have a domain shift,

Speaker:

distribution shift, model mis

-specification, however you want to call

Speaker:

it, neural networks typically aren't that

good.

Speaker:

So what we could do is either make them

slightly better at out of distribution, or

Speaker:

we just extend the in -distribution to a

huge space.

Speaker:

And that's what foundation models do.

Speaker:

For example, GPD4 would be a foundation

model.

Speaker:

because it's just trained on so much data.

Speaker:

I don't know how many, it's not terabyte

anymore.

Speaker:

It's like, like essentially the entire

internet.

Speaker:

So it's just a huge training set.

Speaker:

And so the world and the training set that

this neural network has been trained on is

Speaker:

just huge.

Speaker:

And so essentially we don't really have

out of distribution cases anymore, just

Speaker:

because our training set is so huge.

Speaker:

And that's also one area that could be

very useful for

Speaker:

amortized Bayesian inference and to

overcome the very initial shortcoming that

Speaker:

you talked about, where we would also like

to amortize over different Asian models.

Speaker:

Hmm.

Speaker:

I see.

Speaker:

Yeah, yeah, yeah.

Speaker:

Yeah, that would definitely be super fun.

Speaker:

Yeah, I'm really impressed and interested

to see these interaction of like deep

Speaker:

learning, artificial intelligence, and

then the Bayesian.

Speaker:

framework coming on top of that.

Speaker:

That is really super cool.

Speaker:

I love that.

Speaker:

Yeah.

Speaker:

Yeah, it makes me super curious to try

that stuff out.

Speaker:

So to play us out, Marvin, actually, this

is a very active area of research.

Speaker:

So what advice would you give to beginners

interested in diving into this

Speaker:

intersection of deep learning and

probabilistic machine learning?

Speaker:

That's a great question.

Speaker:

Essentially, I would have two

recommendations.

Speaker:

The first one is to really try to simulate

stuff.

Speaker:

Whatever it is that you are curious about,

just try to write a simulation program and

Speaker:

try to simulate some of the data that you

might be interested in.

Speaker:

So for example, if you're really

interested in soccer, then code up a

Speaker:

simulation program.

Speaker:

that just simulate soccer matches and the

outcomes of soccer matches.

Speaker:

So you can really get a feeling of the

data generating processes that are

Speaker:

happening because probabilistic machine

learning at its very core is all about

Speaker:

data generating processes and reasoning

about these processes.

Speaker:

And I think it was Richard Feynman who

said, what I cannot create, I do not

Speaker:

understand.

Speaker:

That's essentially at the heart of

simulation based inference in a more

Speaker:

narrow setting.

Speaker:

probabilistic machinery and machine

learning more broadly or science more

Speaker:

broadly even So yeah, definitely like

Simulating and running simulation studies

Speaker:

can be super helpful both to understand

what's happening in the background also to

Speaker:

get a feeling for Programming and to get

better at programming as well Then the

Speaker:

second advice would be to essentially find

a balance between these hands -on getting

Speaker:

your hands dirty type of things like

implement a model and

Speaker:

I torch or Keras or solve some Kaggle

tasks, just some machine learning tasks.

Speaker:

But then at the same time, also finding

this balance to reading books and finding

Speaker:

new information to make sure that you

actually know what you're doing and also

Speaker:

know what you don't know and what the next

steps are to get better from the

Speaker:

theoretical part.

Speaker:

And there are two books that I can really

recommend.

Speaker:

The first one is Deep Learning by Ian

Goodfellow.

Speaker:

It's also available.

Speaker:

for free online.

Speaker:

You can also link to this in the show

notes.

Speaker:

It's a great book and it covers so much.

Speaker:

And then if you come from this Bayesian or

statistics background, you see a lot of

Speaker:

conditional probabilities in there because

a lot of deep learning is just conditional

Speaker:

generative modeling.

Speaker:

And then the second book would in fact be

Statistical Rethinking by Richard

Speaker:

McAlrath.

Speaker:

It's a great book and it's not only

limited to Bayesian inference, but more.

Speaker:

Also a lot of causal inference, of course.

Speaker:

Also just thinking about probability and

the philosophy behind this whole

Speaker:

probabilistic modeling topic more broadly.

Speaker:

So earlier today, I had a chat with one of

the student assistants that I'm

Speaker:

supervising and he said, Hey Marvin, like

I read statistic rethinking a few weeks

Speaker:

ago.

Speaker:

And today I read something about score

-based diffusion models.

Speaker:

So these like state of the art deep

learning models that are used to generate

Speaker:

images.

Speaker:

He said like, because I read statistical

rethinking, it all made sense.

Speaker:

There's so much probability going on in

these score -based diffusion models.

Speaker:

And statistical rethinking really helped

me understand that.

Speaker:

And at first I didn't really, I couldn't

believe it, but it totally makes sense.

Speaker:

Cause like statistical rethinking is not

just a book about Bayesian workflow and

Speaker:

Bayesian modeling, but more about, you

know, reasoning about probabilities and

Speaker:

uncertainty, in a more general way.

Speaker:

And it's a beautiful book.

Speaker:

So I'd recommend those.

Speaker:

Nice.

Speaker:

Yeah.

Speaker:

So definitely let's put those two in the

show notes.

Speaker:

Marvin, I will.

Speaker:

So of course I've read statistical

rethinking several times, so I definitely

Speaker:

agree.

Speaker:

The first one about deep learning, I

haven't yet, but I will definitely read it

Speaker:

because that sounds really fascinating.

Speaker:

So really want to get that book.

Speaker:

Fantastic.

Speaker:

Well, thanks a lot, Marvin.

Speaker:

That was really awesome.

Speaker:

I really learned a lot.

Speaker:

I'm pretty sure listeners did too, so

that's super fun.

Speaker:

You definitely need to come back to do a

modeling webinar with us and show us in

Speaker:

action what we talked about today with the

Base Vlog Package.

Speaker:

It's also, I guess, going to inspire

people to use it and maybe contribute to

Speaker:

it.

Speaker:

But before that, of course, I'm going to

ask you the last two questions I ask every

Speaker:

guest at the end of the show.

Speaker:

First one, if you had unlimited time and

resources, which problem would you try to

Speaker:

solve?

Speaker:

That's a very loaded question because

there's so many very, very important

Speaker:

problems to solve.

Speaker:

Like big picture problems, like peace,

world hunger, global warming, all those.

Speaker:

I'm afraid I couldn't, like with my

background, I don't really know how to

Speaker:

contribute significantly with a huge

impact to those problems.

Speaker:

So my consideration is essentially a trade

-off between like...

Speaker:

how important is the problem and what

impact does solving the problem or

Speaker:

addressing the problem have and what

impact could I have on solving the

Speaker:

problem?

Speaker:

And so I think what would be very nice is

to make probabilistic inference or

Speaker:

Bayesian inference more particular, like

accessible, usable, easy and fast for

Speaker:

everyone.

Speaker:

And that doesn't just mean, you know,

methods, machine learning researchers.

Speaker:

But essentially means anyone who works

with data in any way.

Speaker:

And there's so much to do, like the actual

Bayesian model in the background, it could

Speaker:

be huge, be like a base GPT, like chat

GPT, but just for base.

Speaker:

Just with the sheer scope of amortization,

different models, different settings and

Speaker:

so on.

Speaker:

So that's a huge, huge challenge.

Speaker:

Like on the backend side, but then on the

front end and API side, I think it also

Speaker:

has...

Speaker:

many different sub problems there.

Speaker:

cause it would mean like people could

just, you know, write down a description

Speaker:

of their model in plain text language,

like a large language model.

Speaker:

And, you know, don't actually specify

everything by a programming.

Speaker:

Maybe also just sketch out some data like

expert elicitation and all those different

Speaker:

topics.

Speaker:

I think there's like this bigger picture,

that, you know, so like.

Speaker:

thousands of researchers worldwide are

working on so many niche topics there.

Speaker:

But having this overarching base GPT kind

of thing would be really cool.

Speaker:

So I probably choose that to work on.

Speaker:

It's a very risky thing, so that's why I'm

not currently working on it.

Speaker:

Yeah, I love that.

Speaker:

Yeah, that sounds awesome.

Speaker:

Feel free to corporate.

Speaker:

and collaborate with me on that.

Speaker:

I would definitely be down.

Speaker:

That sounds absolutely amazing.

Speaker:

Yeah.

Speaker:

So send me an email when you start working

that place.

Speaker:

I'll be happy to join the team.

Speaker:

And second question, if you could have

dinner with any great scientific mind,

Speaker:

dead, alive or fictional, who would it be?

Speaker:

Again, very loaded question.

Speaker:

Super interesting question.

Speaker:

I mean, there are two huge choices.

Speaker:

I could either go with someone who's

currently alive and

Speaker:

I feel like I want their take on the

current state of the art and future

Speaker:

directions and so on.

Speaker:

And the second huge option, what I guess

many people would go with is someone who's

Speaker:

been dead for two to three centuries.

Speaker:

And I think I'd go with the second choice.

Speaker:

So really take someone from way from the

past.

Speaker:

And that's because of two reasons.

Speaker:

I think like, of course, speaking to

today's scientists is super interesting

Speaker:

and I would love to do that.

Speaker:

But I mean, they have access to all the

state of the art technology and they know

Speaker:

about all the latest advancements.

Speaker:

And so if they have some groundbreaking

creative ideas to share that they come up

Speaker:

with, they could just implement it and

make them actionable.

Speaker:

And the second reason is that today

scientists have a huge platform because

Speaker:

they're on the internet.

Speaker:

So if they really want to express an idea,

they could just do it on

Speaker:

Twitter or wherever So there's like other

ways to engage with them apart from you

Speaker:

know, having a magical dinner Right.

Speaker:

so I would choose someone from the past

and in particular.

Speaker:

I think at a lovelace would be super

interesting for me to talk to Essentially

Speaker:

because she's widely considered the first

programmer the craziest thing about is

Speaker:

that is She's never had access to like a

modern computer

Speaker:

So she wrote the first program, but the

machine wasn't there yet.

Speaker:

So that's such a huge leap of creativity

and genius.

Speaker:

And so I'd really be interested in like if

Adelavelis saw what's happening today,

Speaker:

like all the technology that we have with

generative AI, GPU clusters and all these

Speaker:

possibilities, like what's the next leap

forward?

Speaker:

Like what's today's equivalent of writing

Speaker:

the first program without having the

computer.

Speaker:

Yeah, I really love to know this answer

and there's currently no other way except

Speaker:

for your magical dinner invitation to get

this answer.

Speaker:

So that's why I go with this option.

Speaker:

Yeah.

Speaker:

Yeah.

Speaker:

No, awesome.

Speaker:

Awesome.

Speaker:

I love it.

Speaker:

That definitely sounds like a, like a

marvelous dinner.

Speaker:

So yeah.

Speaker:

Awesome.

Speaker:

Thanks a lot, Marvin.

Speaker:

That was, that was really a blast.

Speaker:

I'm going to let you go now because you've

been talking for a long time, guessing you

Speaker:

need a break.

Speaker:

But that was really amazing.

Speaker:

So yeah, thanks a lot for taking the time.

Speaker:

Thanks again to Matt Rosinski for this

awesome recommendation.

Speaker:

I hope you loved it, Marvin.

Speaker:

And also Matt, me, I did.

Speaker:

So that was really awesome.

Speaker:

As usual, I'll put resources and a link to

your website.

Speaker:

And also, Marvin is going to add stuff to

the show notes for those who want to dig

Speaker:

deeper.

Speaker:

Thank you again, Marvin, for taking the

time and being on this show.

Speaker:

Thank you very much for having me, Alex.

Speaker:

I appreciate it.

Speaker:

This has been another episode of Learning

Bayesian Statistics.

Speaker:

Be sure to rate, review and follow the

show on your favorite podcatcher and visit

Speaker:

learnbaystats .com for more resources

about today's topics as well as access to

Speaker:

more episodes to help you reach true

Bayesian state of mind.

Speaker:

That's learnbaystats .com.

Speaker:

Our theme music is Good Bayesian by Baba

Brinkman, fit MC Lars and Meghiraam.

Speaker:

Check out his awesome work at bababrinkman

.com.

Speaker:

I'm your host.

Speaker:

Alex Andorra.

Speaker:

You can follow me on Twitter at Alex

underscore Andorra, like the country.

Speaker:

You can support the show and unlock

exclusive benefits by visiting Patreon

Speaker:

.com slash LearnBasedDance.

Speaker:

Thank you so much for listening and for

your support.

Speaker:

You're truly a good Bayesian change your

predictions after taking information.

Speaker:

And if you're thinking I'll be less than

amazing, let's adjust those expectations.

Speaker:

Let me show you how to be a good Bayesian

Change calculations after taking fresh

Speaker:

data in Those predictions that your brain

is making Let's get them on a solid

Speaker:

foundation