#107 Amortized Bayesian Inference with Deep Neural Networks, with Marvin Schmitt

Proudly sponsored by PyMC Labs, the Bayesian Consultancy. Book a call, or get in touch!

In this episode, Marvin Schmitt introduces the concept of amortized Bayesian inference, where the upfront training phase of a neural network is followed by fast posterior inference.

Marvin will guide us through this new concept, discussing his work in probabilistic machine learning and uncertainty quantification, using Bayesian inference with deep neural networks.

He also introduces BayesFlow, a Python library for amortized Bayesian workflows, and discusses its use cases in various fields, while also touching on the concept of deep fusion and its relation to multimodal simulation-based inference.

A PhD student in computer science at the University of Stuttgart, Marvin is supervised by two LBS guests you surely know — Paul Bürkner and Aki Vehtari. Marvin’s research combines deep learning and statistics, to make Bayesian inference fast and trustworthy.

In his free time, Marvin enjoys board games and is a passionate guitar player.

Our theme music is « Good Bayesian », by Baba Brinkman (feat MC Lars and Mega Ran). Check out his awesome work at https://bababrinkman.com/ !

Thank you to my Patrons for making this episode possible!

Yusuke Saito, Avi Bryant, Ero Carrera, Giuliano Cruz, Tim Gasser, James Wade, Tradd Salvo, William Benton, James Ahloy, Robin Taylor,, Chad Scherrer, Zwelithini Tunyiswa, Bertrand Wilden, James Thompson, Stephen Oates, Gian Luca Di Tanna, Jack Wells, Matthew Maldonado, Ian Costley, Ally Salim, Larry Gill, Ian Moran, Paul Oreto, Colin Caprani, Colin Carroll, Nathaniel Burbank, Michael Osthege, Rémi Louf, Clive Edelsten, Henri Wallen, Hugo Botha, Vinh Nguyen, Marcin Elantkowski, Adam C. Smith, Will Kurt, Andrew Moskowitz, Hector Munoz, Marco Gorelli, Simon Kessell, Bradley Rode, Patrick Kelley, Rick Anderson, Casper de Bruin, Philippe Labonde, Michael Hankin, Cameron Smith, Tomáš Frýda, Ryan Wesslen, Andreas Netti, Riley King, Yoshiyuki Hamajima, Sven De Maeyer, Michael DeCrescenzo, Fergal M, Mason Yahr, Naoya Kanai, Steven Rowland, Aubrey Clayton, Jeannine Sue, Omri Har Shemesh, Scott Anthony Robson, Robert Yolken, Or Duek, Pavel Dusek, Paul Cox, Andreas Kröpelin, Raphaël R, Nicolas Rode, Gabriel Stechschulte, Arkady, Kurt TeKolste, Gergely Juhasz, Marcus Nölke, Maggi Mackintosh, Grant Pezzolesi, Avram Aelony, Joshua Meehl, Javier Sabio, Kristian Higgins, Alex Jones, Gregorio Aguilar, Matt Rosinski, Bart Trudeau, Luis Fonseca, Dante Gates, Matt Niccolls, Maksim Kuznecov, Michael Thomas, Luke Gorrie, Cory Kiser, Julio, Edvin Saveljev, Frederick Ayala, Jeffrey Powell, Gal Kampel, Adan Romero, Will Geary and Blake Walters.

Visit https://www.patreon.com/learnbayesstats to unlock exclusive Bayesian swag 😉

Takeaways:

Amortized Bayesian inference combines deep learning and statistics to make posterior inference fast and trustworthy.
Bayesian neural networks can be used for full Bayesian inference on neural network weights.
Amortized Bayesian inference decouples the training phase and the posterior inference phase, making posterior sampling much faster.
BayesFlow is a Python library for amortized Bayesian workflows, providing a user-friendly interface and modular architecture.
Self-consistency loss is a technique that combines simulation-based inference and likelihood-based Bayesian inference, with a focus on amortization
The BayesFlow package aims to make amortized Bayesian inference more accessible and provides sensible default values for neural networks.
Deep fusion techniques allow for the fusion of multiple sources of information in neural networks.
Generative models that are expressive and have one-step inference are an emerging topic in deep learning and probabilistic machine learning.
Foundation models, which have a large training set and can handle out-of-distribution cases, are another intriguing area of research.

Chapters:

00:00 Introduction to Amortized Bayesian Inference

07:39 Bayesian Neural Networks

11:47 Amortized Bayesian Inference and Posterior Inference

23:20 BayesFlow: A Python Library for Amortized Bayesian Workflows

38:15 Self-consistency loss: Bridging Simulation-Based Inference and Likelihood-Based Bayesian Inference

41:35 Amortized Bayesian Inference

43:53 Fusing Multiple Sources of Information

45:19 Compensating for Missing Data

56:17 Emerging Topics: Expressive Generative Models and Foundation Models

01:06:18 The Future of Deep Learning and Probabilistic Machine Learning

Links from the show:

Marvin’s website: https://www.marvinschmitt.com/
Marvin on GitHub: https://github.com/marvinschmitt
Marvin on Linkedin: https://www.linkedin.com/in/marvin-schmitt/
Marvin on Twitter: https://twitter.com/MarvinSchmittML
The BayesFlow package for amortized Bayesian workflows: https://bayesflow.org/
BayesFlow Forums for users: https://discuss.bayesflow.org
BayesFlow software paper (JOSS): https://joss.theoj.org/papers/10.21105/joss.05702
Tutorial on amortized Bayesian inference with BayesFlow (Python): https://colab.research.google.com/drive/1ub9SivzBI5fMbSTwVM1pABsMlRupgqRb?usp=sharing
Towards Reliable Amortized Bayesian Inference: https://www.marvinschmitt.com/speaking/pdf/slides_reliable_abi_botb.pdf
Expand the model space that we amortize over (multiverse analyses, power scaling, …): “Sensitivity-Aware Amortized Bayesian Inference” https://arxiv.org/abs/2310.11122
Use heterogeneous data sources in amortized inference: “Fuse It or Lose It: Deep Fusion for Multimodal Simulation-Based Inference” https://arxiv.org/abs/2311.10671
Use likelihood density information (explicit or even learned on the fly): “Leveraging Self-Consistency for Data-Efficient Amortized Bayesian Inference” https://arxiv.org/abs/2310.04395
LBS #98 Fusing Statistical Physics, Machine Learning & Adaptive MCMC, with Marylou Gabrié: https://learnbayesstats.com/episode/98-fusing-statistical-physics-machine-learning-adaptive-mcmc-marylou-gabrie/
LBS #101 Black Holes Collisions & Gravitational Waves, with LIGO Experts Christopher Berry & John Veitch: https://learnbayesstats.com/episode/101-black-holes-collisions-gravitational-waves-ligo-experts-christopher-berry-john-veitch/
Deep Learning book: https://www.deeplearningbook.org/
Statistical Rethinking: https://xcelab.net/rm/

Transcript

This is an automatic transcript and may therefore contain errors. Please get in touch if you’re willing to correct them.

Transcript

Speaker: 00:00:02

In this episode, Marvin Schmidt introduces

the concept of amortized Bayesian

00:00:09,542 --> 00:00:14,322

inference, where the upfront training

phase of a neural network is followed by

00:00:14,322 --> 00:00:16,562

fast posterior inference.

00:00:16,562 --> 00:00:20,782

Marvin will guide us through this new

concept, discussing his work in

00:00:20,782 --> 00:00:24,862

probabilistic machine learning and

uncertainty quantification using Bayesian

00:00:24,862 --> 00:00:27,682

inference with deep neural networks.

00:00:27,682 --> 00:00:29,774

He also introduces Bayes' law,

00:00:29,774 --> 00:00:34,654

Python library for amortized Bayesian

workflows and discusses its use cases in

00:00:34,654 --> 00:00:40,494

various fields while also touching on the

concept of deep fusion and its relation to

00:00:40,494 --> 00:00:43,114

multi -model simulation -based inference.

00:00:43,114 --> 00:00:47,754

Yeah, that is a very deep episode and also

a fascinating one.

00:00:47,754 --> 00:00:53,554

I've been personally diving much more into

amortized Bayesian inference with Baseful

00:00:53,554 --> 00:00:57,326

since the folks there have been kind

enough.

00:00:57,326 --> 00:01:03,426

to invite me to the team, and I can tell

you, this is super promising technology.

00:01:03,546 --> 00:01:08,206

A PhD student in computer science at the

University of Stuttgart, Marvin is

00:01:08,206 --> 00:01:14,286

supervised actually by two LBS guests you

surely know, Paul Burkner and Aki

00:01:14,286 --> 00:01:15,206

Vettelik.

00:01:15,206 --> 00:01:19,146

Marvin's research combines deep learning

and statistics to make vision inference

00:01:19,146 --> 00:01:20,886

fast and trustworthy.

00:01:20,886 --> 00:01:26,592

In his free time, Marvin enjoys board

games and is a passionate guitar player.

00:01:26,638 --> 00:01:33,398

This is Learning Basion Statistics,

episode 107, recorded April 3, 2024.

00:01:50,178 --> 00:01:54,818

Welcome to Learning Basion Statistics, a

podcast about patient inference, the

00:01:54,818 --> 00:01:56,398

methods, the projects,

00:01:56,398 --> 00:01:58,398

and the people who make it possible.

00:01:58,398 --> 00:02:00,618

I'm your host, Alex Andorra.

00:02:00,618 --> 00:02:05,898

You can follow me on Twitter at alex

.andorra, like the country, for any info

00:02:05,898 --> 00:02:06,878

about the show.

00:02:06,878 --> 00:02:09,238

LearnBasedStats .com is left last to be.

00:02:09,238 --> 00:02:14,078

Show notes, becoming a corporate sponsor,

unlocking Bayesian Merch, supporting the

00:02:14,078 --> 00:02:16,698

show on Patreon, everything is in there.

00:02:16,698 --> 00:02:18,528

That's LearnBasedStats .com.

00:02:18,528 --> 00:02:22,978

If you're interested in one -on -one

mentorship, online courses, or statistical

00:02:22,978 --> 00:02:23,982

consulting,

00:02:23,982 --> 00:02:29,222

Feel free to reach out and book a call at

topmate .io slash alex underscore and

00:02:29,222 --> 00:02:30,042

dora.

00:02:30,042 --> 00:02:33,926

See you around folks and best patient

wishes to you all.

00:02:38,030 --> 00:02:46,090

Today, I want to thank the fantastic Adam

Romero, Will Geary, and Blake Walters for

00:02:46,090 --> 00:02:47,890

supporting the show on Patreon.

00:02:47,890 --> 00:02:53,270

Your support is truly invaluable and

literally makes this show possible.

00:02:53,270 --> 00:02:56,686

I can't wait to talk with you guys in the

Slack channel.

00:02:56,686 --> 00:03:01,546

Second, the first part of our modeling

webinar series on Gaussian processes is

00:03:01,546 --> 00:03:02,846

out for everyone.

00:03:02,846 --> 00:03:08,626

So if you want to see how to use the new

HSGP approximation in PIMC, head over to

00:03:08,626 --> 00:03:13,466

the LBS YouTube channel and you'll see

Juan Orduz, a fellow PIMC Core Dev and

00:03:13,466 --> 00:03:18,062

mathematician, explain how to do fast and

efficient Gaussian processes in PIMC.

00:03:18,062 --> 00:03:24,182

I'm actually working on the next part in

this series as we speak, so stay tuned for

00:03:24,182 --> 00:03:28,322

more and follow the LBS YouTube channel if

you don't want to miss it.

00:03:28,322 --> 00:03:30,642

Okay, back to the show now.

00:03:31,542 --> 00:03:36,522

Marvin Schmidt, Willkommen nach Learning

Patient Statistics.

00:03:36,902 --> 00:03:38,682

Thanks Alex, thanks for having me.

00:03:38,682 --> 00:03:42,502

Actually my German is very rusty, do you

say nach or zu?

00:03:42,502 --> 00:03:46,082

Well, welcome Learning Patient Statistics.

00:03:47,054 --> 00:03:49,454

Maybe welcome in podcast?

00:03:49,454 --> 00:03:50,834

Nah.

00:03:51,174 --> 00:03:56,034

Obviously, obviously like it was a third

hidden option.

00:03:56,034 --> 00:03:56,554

Damn.

00:03:56,554 --> 00:03:58,674

it's a secret third thing, right?

00:03:58,674 --> 00:04:00,054

Yeah, always in Germany.

00:04:00,054 --> 00:04:01,674

It's always that.

00:04:01,714 --> 00:04:03,033

Man, damn.

00:04:03,033 --> 00:04:04,524

Well, that's okay.

00:04:04,524 --> 00:04:09,574

I got embarrassed in front of the world,

but I'm used to that in each episode.

00:04:09,574 --> 00:04:12,974

So thanks a lot for taking the time.

00:04:12,974 --> 00:04:14,126

Marvin.

00:04:14,126 --> 00:04:19,586

Thanks a lot to Matt Rosinski actually for

recommending to do an episode with you.

00:04:19,586 --> 00:04:26,106

Matt was kind enough to take some of his

time to write to me and put me in contact

00:04:26,106 --> 00:04:27,006

with you.

00:04:27,006 --> 00:04:34,786

I think you guys met in Australia in a

very fun conference based on the beach.

00:04:34,786 --> 00:04:36,326

I think it happens every two years.

00:04:36,326 --> 00:04:42,266

Definitely when I go there in two years

and do a live episode there.

00:04:42,266 --> 00:04:43,662

Definitely that's a...

00:04:43,662 --> 00:04:47,922

That's a product I wanted to do that this

year, but that didn't go well with my

00:04:47,922 --> 00:04:48,842

traveling dates.

00:04:48,842 --> 00:04:51,882

So in two years, definitely going to try

to do that.

00:04:51,882 --> 00:04:55,802

So yeah, listeners and Marvin, you can

help me accountable on that promise.

00:04:56,382 --> 00:04:56,662

Absolutely.

00:04:56,662 --> 00:04:57,922

We will.

00:04:59,402 --> 00:05:06,282

So Marvin, before we talk a bit more about

what you're a specialist in and also what

00:05:06,282 --> 00:05:12,842

you presented in Australia, can you tell

us what you're doing nowadays and also how

00:05:12,842 --> 00:05:13,486

you...

00:05:13,486 --> 00:05:15,506

Andy Depp working on this?

00:05:16,446 --> 00:05:17,626

Yeah, of course.

00:05:17,626 --> 00:05:20,566

So these days, I'm mostly doing methods

development.

00:05:20,566 --> 00:05:24,946

So broadly in probabilistic machine

learning, I care a lot about uncertainty

00:05:24,946 --> 00:05:26,186

quantification.

00:05:26,186 --> 00:05:30,366

And so essentially, I'm doing Bayesian

inference with deep neural networks.

00:05:30,466 --> 00:05:34,146

So taking Bayesian inference, which is

notoriously slow at times, which might be

00:05:34,146 --> 00:05:38,376

a bottleneck, and then using generative

neural networks to speed up this process,

00:05:38,376 --> 00:05:41,806

but still maintaining all the

explainability, all these nice benefits

00:05:41,806 --> 00:05:43,246

that we have from using

00:05:43,246 --> 00:05:48,346

I have a background in both psychology and

computer science.

00:05:48,346 --> 00:05:53,216

That's also how I ended up in, Beijing

inference.

00:05:53,826 --> 00:05:58,206

cause during my psychology studies, I took

a few statistics courses, then started as

00:05:58,206 --> 00:06:01,946

a statistics tutor, mainly doing frequent

statistics.

00:06:01,946 --> 00:06:06,106

And then I took a seminar on Beijing

statistics in Heidelberg in Germany.

00:06:06,446 --> 00:06:08,726

and it was the hardest seminar that ever

took.

00:06:08,726 --> 00:06:10,326

Well, it's super hard.

00:06:10,326 --> 00:06:12,366

We read like papers every single week.

00:06:12,366 --> 00:06:15,686

Everyone had to prepare every single paper

for every single week.

100

00:06:15,686 --> 00:06:21,466

And then at the start of each session, the

professor would just shuffle and randomly

101

00:06:21,466 --> 00:06:23,226

pick someone to prison.

102

00:06:23,746 --> 00:06:24,156

my God.

103

00:06:24,156 --> 00:06:28,016

That was tough, but somehow, I don't know,

it stuck with me.

104

00:06:28,016 --> 00:06:32,466

And I had like this aha moment where I

felt like, okay, all this statistics stuff

105

00:06:32,466 --> 00:06:36,846

that I've been doing before was more of,

you know, following a recipe, which is

106

00:06:36,846 --> 00:06:37,826

very strict.

107

00:06:37,826 --> 00:06:41,390

But then this like holistic Bayesian

probabilistic take.

108

00:06:41,390 --> 00:06:46,230

just gave me a much broader overview of

statistics in general.

109

00:06:47,090 --> 00:06:49,870

Somehow I followed the path.

110

00:06:50,610 --> 00:06:51,570

Yeah.

111

00:06:52,010 --> 00:06:53,730

I'm curious what that...

112

00:06:53,730 --> 00:07:00,570

So what does that mean to do patient stats

on deep neural network concretely?

113

00:07:00,570 --> 00:07:05,610

What is the thing you would do if you had

to do that?

114

00:07:05,610 --> 00:07:10,290

Let's say, does that mean you mainly...

115

00:07:10,382 --> 00:07:15,502

you develop the deep neural network and

then you add some Bayesian layer on that,

116

00:07:15,502 --> 00:07:19,502

or you have to have the Bayesian framework

from the beginning.

117

00:07:19,502 --> 00:07:21,022

How does that work?

118

00:07:21,982 --> 00:07:23,342

Yeah, that's a great question.

119

00:07:23,342 --> 00:07:28,902

And in fact, that's a common point of

confusion there as well, because Bayesian

120

00:07:28,902 --> 00:07:32,782

inference is just like a general, almost

philosophical framework for reasoning

121

00:07:32,782 --> 00:07:33,922

about uncertainty.

122

00:07:33,922 --> 00:07:38,762

So you have some latent quantities, call

them parameters, whatever, some latent

123

00:07:38,762 --> 00:07:39,630

unknowns.

124

00:07:39,630 --> 00:07:41,690

And you want to do inference on them.

125

00:07:41,690 --> 00:07:45,570

You want to know what these latent

quantities are, but all you have are

126

00:07:45,570 --> 00:07:46,930

actual observables.

127

00:07:46,930 --> 00:07:50,090

And you want to know how these are related

to each other.

128

00:07:50,090 --> 00:07:53,710

And so with Bayesian neural networks, for

instance, these parameters would be the

129

00:07:53,710 --> 00:07:54,550

neural network weights.

130

00:07:54,550 --> 00:07:57,530

And so you want full Bayesian inference on

the neural network weights.

131

00:07:57,790 --> 00:07:59,770

And fitting normal neural networks already

supports that.

132

00:07:59,770 --> 00:08:01,230

Like a Bixarity distribution.

133

00:08:01,230 --> 00:08:02,210

Exactly.

134

00:08:02,210 --> 00:08:03,870

Over these neural network weights.

135

00:08:03,870 --> 00:08:04,150

Exactly.

136

00:08:04,150 --> 00:08:08,630

So that's one approach of doing Bayesian

deep learning, but that's not what I'm

137

00:08:08,630 --> 00:08:09,758

currently doing.

138

00:08:09,806 --> 00:08:12,146

Instead, I'm coming from the Bayesian

side.

139

00:08:12,146 --> 00:08:16,606

So we have like a normal Bayesian model,

which has statistical parameters.

140

00:08:16,606 --> 00:08:22,606

So you can imagine it like a mechanistical

model for like a simulation program.

141

00:08:22,606 --> 00:08:26,266

And we want to estimate these scientific

parameters.

142

00:08:26,526 --> 00:08:31,306

So for example, if you have a cognitive

decision -making task from the cognitive

143

00:08:31,306 --> 00:08:36,066

sciences, and these parameters might be

something like the non -decision time, the

144

00:08:36,066 --> 00:08:38,957

actual motor reaction time that you need

145

00:08:38,957 --> 00:08:42,737

move your muscles and some information

uptake rates, some bias and all these

146

00:08:42,737 --> 00:08:45,576

things that researchers are actually

interested in.

147

00:08:45,937 --> 00:08:51,457

And usually you would then formulate your

model in, for example, PiMC or Stan or

148

00:08:51,457 --> 00:08:55,977

however you want to formulate your

statistical model and then run MCMC for

149

00:08:55,977 --> 00:08:57,397

parameter inference.

150

00:08:58,077 --> 00:09:03,997

And now where the neural networks come in

in my research is that we replace MCMC

151

00:09:03,997 --> 00:09:05,697

with a neural network.

152

00:09:05,857 --> 00:09:08,373

So we still have our Bayesian model.

153

00:09:09,710 --> 00:09:12,350

But we don't use MCMC for posterior

inference.

154

00:09:12,350 --> 00:09:15,750

Instead, we use a neural network just for

posterior inference.

155

00:09:16,090 --> 00:09:19,530

And this neural network is trained by

maximum likelihood.

156

00:09:19,530 --> 00:09:24,650

So the neural network itself, the weights

there are not probabilistic.

157

00:09:25,490 --> 00:09:27,850

There are no posterior distributions over

the weights.

158

00:09:27,850 --> 00:09:33,930

But we just want to somehow model the

actual posterior distributions of our

159

00:09:33,930 --> 00:09:37,810

statistical model parameters using a

neural network.

160

00:09:39,470 --> 00:09:43,170

So the neural net, I think so.

161

00:09:43,170 --> 00:09:44,770

That's quite new to me.

162

00:09:44,770 --> 00:09:49,530

So I'm going to rephrase that and see how

much I understood.

163

00:09:50,270 --> 00:09:55,210

So that means the deep neural network is

already trained beforehand?

164

00:09:55,510 --> 00:09:56,870

No, we have to train it.

165

00:09:56,870 --> 00:09:58,430

And that's the cool part about this.

166

00:09:58,430 --> 00:09:59,990

OK, so you train it at the same time.

167

00:09:59,990 --> 00:10:01,400

You train it at the same time.

168

00:10:01,400 --> 00:10:05,850

You're also trying to infer the underlying

parameters of your model.

169

00:10:06,130 --> 00:10:07,342

And that's the cool part now.

170

00:10:07,342 --> 00:10:11,012

Because in MCMC, you would do both at the

same time, right?

171

00:10:11,012 --> 00:10:15,822

You have your fixed model that you write

down in PyMC or Stan, and then you have

172

00:10:15,822 --> 00:10:20,462

your one observed data set, and you want

to fit your model to the data set.

173

00:10:20,462 --> 00:10:26,182

And so, you know, you do, for example,

your Hamiltonian Monte Carlo algorithm to,

174

00:10:26,182 --> 00:10:31,982

you know, traverse your parameter space

and then do the sampling.

175

00:10:34,301 --> 00:10:36,942

So you couple your approximating

176

00:10:36,942 --> 00:10:40,422

phase and your inference phase.

177

00:10:40,962 --> 00:10:44,242

Like you learn about the posterior

distribution based on your data set.

178

00:10:44,242 --> 00:10:47,942

And then you also want to generate

posterior samples while you're exploring

179

00:10:47,942 --> 00:10:49,482

this parameter space.

180

00:10:49,682 --> 00:10:53,922

And in the line of work that I'm doing,

which we call amortized Bayesian

181

00:10:53,922 --> 00:10:56,842

inference, we decouple those two phases.

182

00:10:56,842 --> 00:10:59,562

So the first phase is actually training

those neural networks.

183

00:10:59,562 --> 00:11:01,522

And that's the hard task.

184

00:11:01,602 --> 00:11:04,750

And then you essentially take your

Bayesian model.

185

00:11:04,750 --> 00:11:08,690

generate a lot of training data from the

model because you can just run prior

186

00:11:08,690 --> 00:11:10,290

predictive samples.

187

00:11:10,710 --> 00:11:12,930

So generate prior predictive samples.

188

00:11:13,130 --> 00:11:16,454

And those are your training data for the

neural network.

189

00:11:18,126 --> 00:11:23,106

And use the neural network to essentially

learn surrogate for the posterior

190

00:11:23,106 --> 00:11:24,306

distribution.

191

00:11:24,546 --> 00:11:30,746

So for each data set that you have, you

want to take those as conditions and then

192

00:11:30,746 --> 00:11:37,266

have a generative neural network to learn

somehow how these data and the parameters

193

00:11:37,266 --> 00:11:38,946

are related to each other.

194

00:11:39,386 --> 00:11:43,526

And this upfront training phase takes

quite some time and usually takes longer

195

00:11:43,526 --> 00:11:47,614

than the equivalent MCMC would take, given

that you can run MCMC.

196

00:11:47,982 --> 00:11:52,622

Now, the cool thing is, as you said, when

your neural network is trained, then the

197

00:11:52,622 --> 00:11:54,682

posterior inference is super fast.

198

00:11:54,682 --> 00:11:58,282

Then if you want to generate posterior

samples, there's no approximation anymore

199

00:11:58,282 --> 00:12:00,402

because you've already done all the

approximation.

200

00:12:00,402 --> 00:12:02,342

So now you're really just doing sampling.

201

00:12:02,342 --> 00:12:07,162

That means just generating some random

numbers in some latent space and having

202

00:12:07,162 --> 00:12:11,302

one pass through the neural network, which

is essentially just a series of matrix

203

00:12:11,302 --> 00:12:12,622

multiplications.

204

00:12:12,622 --> 00:12:16,430

So once you've done this hard part and

trained your generative neural network,

205

00:12:16,430 --> 00:12:20,810

then actually doing the posterior sampling

takes like a fraction of a second for 10

206

00:12:20,810 --> 00:12:22,430

,000 posterior samples.

207

00:12:24,494 --> 00:12:26,154

Okay, yeah, that's really cool.

208

00:12:26,154 --> 00:12:32,904

And how generalizable is your deep neural

network then?

209

00:12:32,904 --> 00:12:37,974

Do you have like, is that, because I can

see the really cool thing to have a neural

210

00:12:37,974 --> 00:12:40,694

network that's customized to each of your

models.

211

00:12:40,694 --> 00:12:41,414

That's really cool.

212

00:12:41,414 --> 00:12:45,194

But at the same time, as you were saying,

that's really expensive to train a neural

213

00:12:45,194 --> 00:12:49,136

network each time you have to sample a

model.

214

00:12:49,806 --> 00:12:54,606

And so I was thinking, OK, so then maybe

what you want is have generalized

215

00:12:54,606 --> 00:12:57,906

categories of deep neural network.

216

00:12:58,266 --> 00:13:00,216

So that would probably be another kill.

217

00:13:00,216 --> 00:13:04,546

But let's say I have a deep neural network

for linear regressions.

218

00:13:04,546 --> 00:13:10,546

Whether they are generalized or just plain

normal likelihood, you would use that deep

219

00:13:10,546 --> 00:13:13,886

neural network for linear regressions.

220

00:13:13,886 --> 00:13:18,446

And then the inference is super fast,

because you only have to train.

221

00:13:18,446 --> 00:13:23,106

the neural network once and then

inference, posterior inference on the

222

00:13:23,106 --> 00:13:28,786

linear regression parameters themselves is

super fast.

223

00:13:29,366 --> 00:13:34,626

So yeah, like that's a long question, but

did you get what I'm asking?

224

00:13:34,626 --> 00:13:35,366

Yeah, absolutely.

225

00:13:35,366 --> 00:13:38,866

So if I get your question right, now

you're asking like, if you don't want to

226

00:13:38,866 --> 00:13:43,746

run linear regression, but want to run

some slightly different model, can I still

227

00:13:43,746 --> 00:13:46,026

use my pre -trained neural network to do

that?

228

00:13:46,830 --> 00:13:48,050

Yes, exactly.

229

00:13:48,050 --> 00:13:50,540

And also, yeah, like in general, how does

that work?

230

00:13:50,540 --> 00:13:53,750

Like, how are you thinking about that?

231

00:13:53,750 --> 00:13:59,670

Are there already some best practices or

is it like really for now, really cutting

232

00:13:59,670 --> 00:14:04,190

edge research that and all the questions

are in the air?

233

00:14:04,190 --> 00:14:04,630

Yeah.

234

00:14:04,630 --> 00:14:09,490

So first of all, the general use case for

this type of amortized Bayesian inference

235

00:14:09,490 --> 00:14:14,622

is usually when your model is fixed, but

you have many new datasets.

236

00:14:15,566 --> 00:14:20,746

So assume you have some quite complex

model where MCMC would take a few minutes

237

00:14:20,746 --> 00:14:21,886

to run.

238

00:14:21,886 --> 00:14:28,146

And so instead for one fixed data set that

you actually want to sample from.

239

00:14:29,066 --> 00:14:35,026

And now instead of running MCMC on it, you

say, okay, I'm going to train this neural

240

00:14:35,026 --> 00:14:36,066

network.

241

00:14:36,366 --> 00:14:40,246

So this won't yet be worth it for just one

data set.

242

00:14:40,266 --> 00:14:43,446

Now the cool thing is if you want to keep

your actual model, so whatever you write

243

00:14:43,446 --> 00:14:45,318

down in PyMC or Stan,

244

00:14:45,454 --> 00:14:49,734

We want to keep that fixed, but now plug

in different data sets.

245

00:14:50,354 --> 00:14:52,934

That's where amortized inference really

shines.

246

00:14:53,294 --> 00:15:00,174

So for instance, there was this one huge

analysis in the UK where they had like

247

00:15:00,174 --> 00:15:06,454

intelligence study data from more than 1

million participants.

248

00:15:07,054 --> 00:15:11,494

And so for each of those participants,

they again had a set of observations.

249

00:15:11,774 --> 00:15:14,808

And so for each of those 1 million

participants,

250

00:15:15,086 --> 00:15:17,446

They want to perform posterior inference.

251

00:15:18,386 --> 00:15:21,626

It means if you want to do this with

something like MCMC or anything non

252

00:15:21,626 --> 00:15:26,810

-amortized, you would need to fit one

million models.

253

00:15:28,558 --> 00:15:32,898

So you might argue now, okay, but you can

parallelize this across like a thousand

254

00:15:32,898 --> 00:15:34,898

cores, but still that's, that's a lot.

255

00:15:34,898 --> 00:15:36,078

That's a lot of control.

256

00:15:36,078 --> 00:15:39,418

Now the cool thing is the model was the

same every single time.

257

00:15:39,418 --> 00:15:41,898

You just had a million different data

sets.

258

00:15:41,898 --> 00:15:46,738

And so what these people did then is train

a neural network once.

259

00:15:46,738 --> 00:15:52,438

And then like it will train for a few

hours, of course, but then you can just

260

00:15:52,438 --> 00:15:54,738

sequentially feed in all these 1 million

data sets.

261

00:15:54,738 --> 00:15:58,886

And for each of these 1 million data sets,

it takes way, way less than one second.

262

00:15:59,662 --> 00:16:02,762

to generate tens of thousands of posterior

samples.

263

00:16:03,102 --> 00:16:04,862

But that didn't really answer your

question.

264

00:16:04,862 --> 00:16:08,962

So your question was about how can we

generalize in the model space?

265

00:16:08,962 --> 00:16:12,102

And that's a really hard problem because

essentially what these neural networks

266

00:16:12,102 --> 00:16:21,798

learn is to give you some posterior

function if you feed in a data set.

267

00:16:23,886 --> 00:16:28,806

Now, if you have a domain shift in the

model space, so now you want inference

268

00:16:28,806 --> 00:16:32,246

based on a different model, and this

neural network has never learned to do

269

00:16:32,246 --> 00:16:32,646

that.

270

00:16:32,646 --> 00:16:33,646

So that's tough.

271

00:16:33,646 --> 00:16:35,746

That's a hard problem.

272

00:16:35,926 --> 00:16:39,746

And essentially what you could do and what

we are currently doing in our research,

273

00:16:39,746 --> 00:16:43,266

but that's cutting edge, is expanding the

model space.

274

00:16:43,266 --> 00:16:47,626

So you would have a very general

formulation of a model and then try to

275

00:16:47,626 --> 00:16:49,146

amortize over this model.

276

00:16:49,146 --> 00:16:53,092

So that different configurations of this

model, different variations.

277

00:16:53,134 --> 00:16:56,534

could just be extracted special case model

essentially.

278

00:16:59,594 --> 00:17:07,474

Can you take an example maybe to give an

idea to listeners how that would work?

279

00:17:07,474 --> 00:17:08,454

Absolutely.

280

00:17:08,454 --> 00:17:13,134

We have one preprint about sensitivity

-aware amortized Bayesian inference.

281

00:17:13,254 --> 00:17:19,234

What we do there is essentially have a

kind of multiverse analysis built into the

282

00:17:19,234 --> 00:17:20,834

neural network training.

283

00:17:21,134 --> 00:17:25,494

give some background, multiverse analysis,

basically says, okay, what are all the pre

284

00:17:25,494 --> 00:17:28,734

-processing steps that you could take in

your analysis?

285

00:17:28,734 --> 00:17:30,374

And you encode those.

286

00:17:30,374 --> 00:17:34,914

And now you're interested in like, what

if, what if I had chosen a different pre

287

00:17:34,914 --> 00:17:35,834

-processing technique?

288

00:17:35,834 --> 00:17:40,054

What if I had chosen a different way to

standardize my data?

289

00:17:40,154 --> 00:17:44,934

Then also the classical like prior

sensitivity or likelihood sensitivity

290

00:17:44,934 --> 00:17:46,414

analysis.

291

00:17:46,414 --> 00:17:50,190

Like what happens if I do power scaling on

my prior?

292

00:17:50,190 --> 00:17:51,670

power scaling on my posterior.

293

00:17:51,670 --> 00:17:54,350

So we also encode this.

294

00:17:54,890 --> 00:17:58,970

What happens if I bootstrap some of my

data or just have a perturbation of my

295

00:17:58,970 --> 00:17:59,950

data?

296

00:18:00,390 --> 00:18:03,250

What if I add a bit of noise to my data?

297

00:18:03,250 --> 00:18:05,950

So these are all slightly different

models.

298

00:18:05,950 --> 00:18:10,930

What we do essentially keep track of that

during the training phase and just encode

299

00:18:10,930 --> 00:18:14,930

it into a vector and say, well, okay, now

we're doing pre -processing choice number

300

00:18:14,930 --> 00:18:16,570

seven.

301

00:18:17,422 --> 00:18:22,122

and scale the prior to the power of two,

don't scale the likelihood and don't do

302

00:18:22,122 --> 00:18:25,662

any perturbation and feed this as an

additional information into the neural

303

00:18:25,662 --> 00:18:26,702

network.

304

00:18:27,342 --> 00:18:30,922

Now the cool thing is during inference

phase, once we're done with the training,

305

00:18:30,922 --> 00:18:33,122

you can say, hey, here's a data set.

306

00:18:33,122 --> 00:18:40,342

Now pretend that we chose pre -processing

technique number 11 and prior scaling of

307

00:18:40,342 --> 00:18:42,422

power 0 .5.

308

00:18:42,782 --> 00:18:44,448

What's the posterior now?

309

00:18:45,326 --> 00:18:50,926

Because we've amortized over this large or

more general model space, we also get

310

00:18:50,926 --> 00:18:55,026

valid posterior inference if we've trained

for long enough over these different

311

00:18:55,026 --> 00:18:56,846

configurations of model.

312

00:18:57,246 --> 00:19:02,906

And essentially, if you were to do this

with MCMC, for instance, you would refit

313

00:19:02,906 --> 00:19:04,710

your model every single time.

314

00:19:08,494 --> 00:19:10,814

And so here you don't have to do that.

315

00:19:10,814 --> 00:19:11,294

Okay.

316

00:19:11,294 --> 00:19:12,394

Yeah, I see.

317

00:19:12,394 --> 00:19:13,194

That's super.

318

00:19:13,194 --> 00:19:15,014

Yeah, that's super cool.

319

00:19:15,174 --> 00:19:21,074

And I feel like, so that would be mainly

the main use cases would be as you were

320

00:19:21,074 --> 00:19:30,434

saying, when, when you're getting into

really high data territory and you have

321

00:19:30,554 --> 00:19:36,206

what's changing is mainly the data side,

mainly the data.

322

00:19:36,206 --> 00:19:41,226

set and to be even more precise, not

really the data set, but the data values,

323

00:19:41,226 --> 00:19:44,646

because the data set is supposed to be

like quite the same, like you would have

324

00:19:44,646 --> 00:19:48,066

the same columns, for instance, but the

values of the columns would change all the

325

00:19:48,066 --> 00:19:49,186

time.

326

00:19:49,406 --> 00:19:53,406

And the model at the same time doesn't

change.

327

00:19:53,826 --> 00:19:58,506

Is that like, that's really for now, at

least the best use case for that kind of

328

00:19:58,506 --> 00:19:59,506

method.

329

00:19:59,506 --> 00:19:59,796

Yes.

330

00:19:59,796 --> 00:20:03,186

And this might seem like a very niche

case.

331

00:20:03,186 --> 00:20:05,230

But then if you look at like,

332

00:20:05,230 --> 00:20:13,490

Bayesian workflows in practice, this topic

of this scheme of many model research

333

00:20:14,050 --> 00:20:17,910

doesn't necessarily mean that you have a

large number of data sets.

334

00:20:18,010 --> 00:20:21,610

This might also just mean you want

extensive cross validation.

335

00:20:22,010 --> 00:20:27,690

So assume that you have one data set with

1000 observations.

336

00:20:28,050 --> 00:20:31,570

Now you want to run leaf1 or cross

validation, but for some reason you can't

337

00:20:31,570 --> 00:20:34,982

do the Pareto Smooth importance sampling

version, which would be much faster.

338

00:20:35,662 --> 00:20:41,202

So you would need 1000 model refits, even

though you just have one data set, because

339

00:20:41,202 --> 00:20:45,734

you want 1000 cross validation refits.

340

00:20:48,142 --> 00:20:53,842

Maybe can you explicit what your meaning

by cross validation here?

341

00:20:53,842 --> 00:20:58,882

Because that's not a term that's used a

lot in the patient framework, I think.

342

00:20:58,882 --> 00:20:59,742

Yeah, of course.

343

00:20:59,742 --> 00:21:02,982

So especially innovation setting, there's

this approach of leave one out cross

344

00:21:02,982 --> 00:21:09,262

validation, where you would fit your

posterior based on all data points, but

345

00:21:09,262 --> 00:21:10,242

one.

346

00:21:10,362 --> 00:21:14,022

And that's why it's called leave one out,

because you take one out and then fit your

347

00:21:14,022 --> 00:21:16,842

model, fit your posterior on the rest of

the data.

348

00:21:17,518 --> 00:21:23,438

And now you're interested in the posterior

predictive performance of this one left

349

00:21:23,438 --> 00:21:24,808

out observation.

350

00:21:26,382 --> 00:21:26,802

Yeah.

351

00:21:26,802 --> 00:21:28,622

And that's called cross validation.

352

00:21:28,802 --> 00:21:28,942

Yeah.

353

00:21:28,942 --> 00:21:29,862

Go ahead.

354

00:21:29,862 --> 00:21:34,662

Yeah, no, just I'm going to let you

finish, but yeah, for listeners familiar

355

00:21:34,662 --> 00:21:39,282

with the frequented framework, that's

something that's really heavily used in

356

00:21:39,282 --> 00:21:42,002

that framework, cross validation.

357

00:21:42,122 --> 00:21:46,002

And it's very similar to the machine

learning concept of cross validation.

358

00:21:46,002 --> 00:21:49,682

But in the machine learning area, you

would rather have something like fivefold

359

00:21:49,682 --> 00:21:53,362

in general, k -fold cross validation,

where you would have larger splits of your

360

00:21:53,362 --> 00:21:56,526

data and then use parts of your

361

00:21:56,526 --> 00:22:00,406

whole dataset as the training dataset and

the rest for evaluation.

362

00:22:00,926 --> 00:22:04,406

Essentially, like the one across relation

just puts it to the extreme.

363

00:22:04,746 --> 00:22:08,226

Everything but one data point is your

train dataset.

364

00:22:09,086 --> 00:22:09,396

Yeah.

365

00:22:09,396 --> 00:22:10,246

Yeah.

366

00:22:10,966 --> 00:22:11,586

Okay.

367

00:22:11,586 --> 00:22:12,266

Yeah.

368

00:22:12,266 --> 00:22:13,166

Damn, that's super fun.

369

00:22:13,166 --> 00:22:18,466

And is there, is there already a way for

people to try that out or is it mainly for

370

00:22:18,466 --> 00:22:20,886

now implemented for papers?

371

00:22:20,886 --> 00:22:23,278

And you are probably.

372

00:22:23,278 --> 00:22:30,298

I'm guessing working on that with Aki and

all his group in Finland to make that more

373

00:22:30,298 --> 00:22:33,558

open source, helping people use packages

to do that.

374

00:22:33,558 --> 00:22:37,178

What's the state of the things here?

375

00:22:37,178 --> 00:22:39,237

Yeah, that's a great question.

376

00:22:39,237 --> 00:22:44,798

And in fact, the state of usable open

source software is far behind what we have

377

00:22:44,798 --> 00:22:48,318

for likelihood -based MCMC based

inference.

378

00:22:48,318 --> 00:22:52,198

So we currently don't have something

that's comparable to PyMC or Stan.

379

00:22:53,326 --> 00:22:57,826

Our group is developing or actively

developing a software that's called Base

380

00:22:57,826 --> 00:22:58,686

Flow.

381

00:22:59,526 --> 00:23:02,646

That's because like the name, because like

base, because we're doing Bayesian

382

00:23:02,646 --> 00:23:03,606

inference.

383

00:23:03,606 --> 00:23:08,346

And essentially the first neural network

architecture that was used for this

384

00:23:08,346 --> 00:23:12,146

amortized Bayesian inference are so

-called normalizing flows.

385

00:23:13,346 --> 00:23:15,266

Conditional normalizing flows to be

precise.

386

00:23:15,266 --> 00:23:19,096

And that's why the name Base Flow came to

be.

387

00:23:19,096 --> 00:23:20,270

But now.

388

00:23:20,270 --> 00:23:23,450

actually have a bit of a different take

because now we have a whole lot of

389

00:23:23,450 --> 00:23:26,250

generative neural networks and not only

normalizing flows.

390

00:23:26,250 --> 00:23:31,870

So now we can also use, for example, score

-based diffusion models that are mainly

391

00:23:31,870 --> 00:23:37,850

used for image generation and AI or

consistency models, which are essentially

392

00:23:37,850 --> 00:23:41,170

like a distilled version of score -based

diffusion models.

393

00:23:41,170 --> 00:23:43,990

And so now baseflow doesn't really capture

that anymore.

394

00:23:43,990 --> 00:23:50,028

But now what the baseflow Python library

specializes in is defining

395

00:23:50,094 --> 00:23:52,934

Principled amortized Bayesian workflows.

396

00:23:52,934 --> 00:23:57,594

So the meaning of base or slightly shifted

to amortized Bayesian workflows and hence

397

00:23:57,594 --> 00:24:04,094

the name base login And the focus of base

slope and the aim of base low is twofold

398

00:24:04,094 --> 00:24:06,194

So first we want a library.

399

00:24:06,194 --> 00:24:12,294

It's good for actual users So this might

be researchers who just say hey, here's my

400

00:24:12,294 --> 00:24:12,834

data set.

401

00:24:12,834 --> 00:24:18,214

Here's my model my simulation program and

Please just give me fast posterior

402

00:24:18,214 --> 00:24:18,674

samples.

403

00:24:18,674 --> 00:24:20,356

So we want

404

00:24:23,118 --> 00:24:28,958

usable high level interface with sensible

default values that mostly work out of the

405

00:24:28,958 --> 00:24:33,878

box and an interface that's mostly self

-explanatory.

406

00:24:33,878 --> 00:24:36,398

Also of course, good teaching material and

all this.

407

00:24:36,418 --> 00:24:40,718

But that's only one side of the coin

because the other large goal of FaceFlow

408

00:24:40,798 --> 00:24:44,918

is that it should be usable for machine

learning researchers who want to advance

409

00:24:44,918 --> 00:24:48,678

amortized Bayesian inference methods as

well.

410

00:24:48,678 --> 00:24:51,414

And so the software in general,

411

00:24:51,598 --> 00:24:54,058

is structured in a very modular way.

412

00:24:54,498 --> 00:24:58,358

So for instance, you could just say, hey,

take my current pipeline, my current

413

00:24:58,358 --> 00:24:59,218

workflow.

414

00:24:59,218 --> 00:25:04,838

But now try out a different loss function

because I have a new fancy idea.

415

00:25:04,838 --> 00:25:06,918

I want to incorporate more likelihood

information.

416

00:25:06,918 --> 00:25:09,898

And so I want to alter my loss function.

417

00:25:10,118 --> 00:25:17,578

So you would have your general program

because of the modular architecture there,

418

00:25:17,578 --> 00:25:20,938

you could just say, take the current loss

function and replace it with a different

419

00:25:20,938 --> 00:25:21,518

one.

420

00:25:21,518 --> 00:25:23,172

that is used to the API.

421

00:25:24,718 --> 00:25:30,458

And we're trying to doing both and serving

both interests, user friendly side for

422

00:25:30,458 --> 00:25:33,938

actually applied researchers who are also

currently using Baseflow.

423

00:25:33,998 --> 00:25:39,118

But then also the machine learning

researchers with completely different

424

00:25:39,118 --> 00:25:41,378

requirements for this piece of software.

425

00:25:41,378 --> 00:25:47,278

Maybe we can also use Baseflow

documentation and the current project

426

00:25:47,278 --> 00:25:48,858

website in the notes.

427

00:25:49,738 --> 00:25:52,418

Yeah, we should definitely do that.

428

00:25:52,494 --> 00:25:54,854

Definitely gonna try that out myself.

429

00:25:54,874 --> 00:25:56,054

It sounds like fun.

430

00:25:56,054 --> 00:25:59,894

I need a use case, but as soon as I have a

use case, I'm definitely gonna try that

431

00:25:59,894 --> 00:26:02,514

out because it sounds like a lot of fun.

432

00:26:02,594 --> 00:26:08,014

Yeah, several questions based on that and

thanks a lot for being so clear and so

433

00:26:08,014 --> 00:26:09,874

detailed on these.

434

00:26:10,034 --> 00:26:15,294

So first, we talked about normalizing

flows in episode 98 with Marie -Lou

435

00:26:15,294 --> 00:26:16,274

Gabriel.

436

00:26:16,394 --> 00:26:21,886

Definitely recommend listeners to listen

to that for some background.

437

00:26:22,106 --> 00:26:29,206

And question, so Baseflow, yeah,

definitely we need that in the show notes

438

00:26:29,206 --> 00:26:32,526

and I'm going to install that in my

environment.

439

00:26:33,006 --> 00:26:37,006

And I'm guessing, so you're saying that

that's in Python, right?

440

00:26:37,006 --> 00:26:38,186

The package?

441

00:26:38,446 --> 00:26:42,766

Yes, the core package is in Python and

we're currently refactoring to Keras.

442

00:26:42,766 --> 00:26:47,706

So by the time this podcast episode is

aired, we will have a new major release

443

00:26:47,706 --> 00:26:49,164

version, hopefully.

444

00:26:49,354 --> 00:26:50,074

OK, nice.

445

00:26:50,074 --> 00:26:52,644

So you're agnostic to the actual machine

learning back end.

446

00:26:52,644 --> 00:26:57,314

So then you could choose TensorFlow,

PyTorch, or JAX, whatever integrates best

447

00:26:57,314 --> 00:27:01,274

with what you're currently proficient in

and what you might be currently using in

448

00:27:01,274 --> 00:27:02,854

other parts of a project.

449

00:27:03,094 --> 00:27:05,374

OK, that was going to be my question.

450

00:27:05,374 --> 00:27:10,034

Because I think while preparing for the

episode, I saw that you were mainly using

451

00:27:10,034 --> 00:27:10,674

PyTorch.

452

00:27:10,674 --> 00:27:11,914

So that was going to be my question.

453

00:27:11,914 --> 00:27:13,074

What is that based on?

454

00:27:13,074 --> 00:27:17,520

So the back end could be PyTorch, JAX, or.

455

00:27:17,710 --> 00:27:20,250

What did you think the last one was?

456

00:27:20,310 --> 00:27:21,030

Tansor flow.

457

00:27:21,030 --> 00:27:25,190

Yeah, I always forget about all these

names.

458

00:27:25,330 --> 00:27:26,590

I really know PyTorch.

459

00:27:26,590 --> 00:27:28,080

So that's why I like the other ones.

460

00:27:28,080 --> 00:27:30,410

And JAX, of course, for PyMC.

461

00:27:31,150 --> 00:27:35,850

And then, so my question is, the workflow,

what would it look like if you're using

462

00:27:35,850 --> 00:27:37,070

Baseflow?

463

00:27:37,510 --> 00:27:42,850

Because you were saying the model, you

could write it in standard PyMC or

464

00:27:42,850 --> 00:27:44,910

TensorFlow, for instance.

465

00:27:45,350 --> 00:27:46,606

Although I don't know if you can write.

466

00:27:46,606 --> 00:27:48,706

patient models with TensorFlow anymore.

467

00:27:48,706 --> 00:27:52,246

Anyways, let's say PyMC or Stan.

468

00:27:52,366 --> 00:27:53,186

You write your model.

469

00:27:53,186 --> 00:27:59,476

But then the sampling of the model is done

with the neural network.

470

00:27:59,476 --> 00:28:03,686

So that means, for instance, PyTorch or

Jax.

471

00:28:03,906 --> 00:28:05,516

How does that work?

472

00:28:05,516 --> 00:28:10,506

Do you have then to write the model in a

Jax compatible way?

473

00:28:10,506 --> 00:28:15,206

Or is the translation done by the package

itself?

474

00:28:15,694 --> 00:28:17,334

Yeah, that's a great question.

475

00:28:17,334 --> 00:28:21,614

It touches on many different topics and

considerations and also on future roadmap

476

00:28:21,614 --> 00:28:23,494

for bass flow.

477

00:28:23,754 --> 00:28:24,454

So.

478

00:28:26,062 --> 00:28:30,042

This class of algorithms that are

implemented in Baseflow, these amortized

479

00:28:30,042 --> 00:28:35,142

Bayesian inference algorithms, to give you

some background there, they originally

480

00:28:35,142 --> 00:28:37,502

started in simulation -based inference.

481

00:28:37,502 --> 00:28:40,562

It's also sometimes called likelihood

-free inference.

482

00:28:40,722 --> 00:28:44,502

So essentially it is Bayesian inference

when you don't bring a closed -form

483

00:28:44,502 --> 00:28:46,362

likelihood function to the table.

484

00:28:46,362 --> 00:28:51,142

But instead, you only have some generic

forward simulation program.

485

00:28:51,142 --> 00:28:54,382

So you would just have your prior as

some...

486

00:28:54,382 --> 00:28:59,022

Python function or C++ function, whatever,

any function that you could call and it

487

00:28:59,022 --> 00:29:02,102

would return you a sample from the prior

distribution.

488

00:29:02,882 --> 00:29:06,582

You don't need to write it down in terms

of distributions actually, but you only

489

00:29:06,582 --> 00:29:08,422

need to be able to sample from it.

490

00:29:08,422 --> 00:29:10,682

And then the same for the likelihood.

491

00:29:10,682 --> 00:29:16,482

So you don't need to write down your

likelihood in like a PMC or Stan in terms

492

00:29:16,482 --> 00:29:21,382

of a probability distribution, in terms of

density distribution or densities.

493

00:29:21,402 --> 00:29:23,814

But instead it's.

494

00:29:23,822 --> 00:29:29,702

just got to be some simulation program,

which takes in parameters and then outputs

495

00:29:29,702 --> 00:29:30,590

data.

496

00:29:32,238 --> 00:29:41,438

What happens between these parameters and

the data is not necessarily probabilistic

497

00:29:41,438 --> 00:29:44,238

in terms of closed form distributions.

498

00:29:44,238 --> 00:29:49,898

It could also be some non -tractable

differential equations.

499

00:29:49,898 --> 00:29:52,118

It could be essentially everything.

500

00:29:54,298 --> 00:29:59,118

So for base flow, this means that you

don't have to input something like a PMC

501

00:29:59,118 --> 00:30:02,310

or a Stan model, which you write down in

terms of

502

00:30:02,830 --> 00:30:08,570

distributions, but it's just a generic

forward model that you can call and you

503

00:30:08,570 --> 00:30:13,542

will get a tuple of a parameter draw and a

data set.

504

00:30:15,310 --> 00:30:17,470

So you'd usually just do it in NumPy.

505

00:30:19,430 --> 00:30:23,730

So you would write, if I'm using Baseflow,

I would write it in NumPy.

506

00:30:23,730 --> 00:30:25,990

It would probably be the easiest way.

507

00:30:26,030 --> 00:30:31,050

You could probably also write it in JAX or

in PyTorch or in TensorFlow or TensorFlow

508

00:30:31,050 --> 00:30:35,030

probability, whatever you want to use and

like behind the scenes.

509

00:30:35,290 --> 00:30:40,670

But essentially what we just care about is

that the model gets a tuple of parameters

510

00:30:40,850 --> 00:30:44,190

and then data that has been generated from

these parameters.

511

00:30:45,070 --> 00:30:47,250

for the neural network training process.

512

00:30:47,810 --> 00:30:50,550

That's super fun.

513

00:30:50,630 --> 00:30:51,010

Yeah, yeah, yeah.

514

00:30:51,010 --> 00:30:52,790

Definitely want to see that.

515

00:30:52,790 --> 00:30:59,310

Do you have already some Jupyter notebook

examples up on the repo or are you working

516

00:30:59,310 --> 00:31:00,130

on that?

517

00:31:00,130 --> 00:31:03,040

Yeah, currently it's a full -fledged

library.

518

00:31:03,040 --> 00:31:06,070

It's been under development for a few

years now.

519

00:31:06,070 --> 00:31:08,510

And we also have an active user base right

now.

520

00:31:08,510 --> 00:31:12,170

It's quite small compared to other

Bayesian packages.

521

00:31:12,370 --> 00:31:14,114

We're growing it.

522

00:31:14,766 --> 00:31:16,306

Yeah, that's cool.

523

00:31:16,306 --> 00:31:21,486

In documentation, there are currently, I

think, seven or eight tutorial notebooks.

524

00:31:21,506 --> 00:31:24,566

And then also for a Based on the Beach,

like this conference in Australia that we

525

00:31:24,566 --> 00:31:29,106

just talked about earlier, we also

prepared a workshop.

526

00:31:29,106 --> 00:31:34,186

And we're also going to link to this

Jupyter notebook in the show notes.

527

00:31:34,986 --> 00:31:39,026

Yeah, definitely we should, we should link

to some of these Jupyter notebooks in the

528

00:31:39,026 --> 00:31:39,846

show notes.

529

00:31:39,846 --> 00:31:41,198

And Sean, I'm thinking you should...

530

00:31:41,198 --> 00:31:45,038

Like if you're down, you should definitely

come back to the show, but for a webinar.

531

00:31:45,038 --> 00:31:48,818

I have another format that's modeling

webinar where you could, you would come to

532

00:31:48,818 --> 00:31:54,938

the show and share your screen and, and go

through the model code live and people can

533

00:31:54,938 --> 00:31:56,298

ask questions and so on.

534

00:31:56,298 --> 00:31:59,678

I've done that already on a variety of

things.

535

00:31:59,698 --> 00:32:03,958

Last one was about causal inference and

propensity scores.

536

00:32:03,958 --> 00:32:08,618

Next one is going to be on about helper

space GP decomposition.

537

00:32:09,774 --> 00:32:13,274

So yeah, if you're down, you should

definitely come and do a demonstration of

538

00:32:13,274 --> 00:32:15,514

base flow and amortized Bayesian

inference.

539

00:32:15,514 --> 00:32:20,754

I think that would be super fun and very

interesting to people.

540

00:32:21,374 --> 00:32:22,074

Absolutely.

541

00:32:22,334 --> 00:32:25,354

Then to answer the last part of your

question.

542

00:32:25,494 --> 00:32:26,314

Yeah.

543

00:32:26,314 --> 00:32:31,394

Like if you currently have a model that's

written down in PyMC or Stan, that's a bit

544

00:32:31,394 --> 00:32:37,094

more tricky to integrate because

essentially what all we need in base flow

545

00:32:37,094 --> 00:32:39,854

are samples from the prior predictive

distribution.

546

00:32:39,854 --> 00:32:42,774

If you talk in Bayesian terminology.

547

00:32:43,134 --> 00:32:44,134

Yeah.

548

00:32:44,214 --> 00:32:50,214

And if your current model can do that,

that's fine.

549

00:32:50,934 --> 00:32:52,634

That's all you need right now.

550

00:32:52,634 --> 00:32:53,994

And then base build builds.

551

00:32:53,994 --> 00:33:03,754

You can have like a PIMC model and just do

pm .sample -properative, save that as a

552

00:33:03,754 --> 00:33:08,854

big NumPy multidimensional array and pass

that to baseflow.

553

00:33:08,854 --> 00:33:09,710

Yes.

554

00:33:09,710 --> 00:33:10,490

Okay.

555

00:33:10,490 --> 00:33:16,350

Just all you need are two builds of the

ground truth parameters of the data

556

00:33:16,350 --> 00:33:17,710

training process.

557

00:33:18,030 --> 00:33:22,010

So essentially like the result of your

prior call and then the result of your

558

00:33:22,010 --> 00:33:24,550

likelihood call with those prior

parameters.

559

00:33:25,769 --> 00:33:32,090

So you mean what the likelihood samples

look like once you fix the prior

560

00:33:32,090 --> 00:33:34,010

parameters to some value?

561

00:33:34,010 --> 00:33:35,170

Yes.

562

00:33:35,170 --> 00:33:39,364

So like in practice, you would just call

your prior function.

563

00:33:39,950 --> 00:33:40,270

Yeah.

564

00:33:40,270 --> 00:33:41,790

Then get a sample from the prior.

565

00:33:41,790 --> 00:33:43,070

So parameter vector.

566

00:33:43,070 --> 00:33:43,170

Yeah.

567

00:33:43,170 --> 00:33:46,630

And then plug this parameter vector into

the likelihood function.

568

00:33:46,630 --> 00:33:50,089

And then you get one simulated synthetic

data set.

569

00:33:50,570 --> 00:33:54,390

And you just need those two.

570

00:33:56,170 --> 00:33:57,070

Okay.

571

00:33:57,070 --> 00:33:58,530

Super cool.

572

00:33:59,070 --> 00:33:59,540

Yeah.

573

00:33:59,540 --> 00:34:04,670

Definitely sounds like a lot of fun and

should definitely do a webinar about that.

574

00:34:04,670 --> 00:34:07,210

I'm very excited about that.

575

00:34:07,510 --> 00:34:07,910

Yeah.

576

00:34:07,910 --> 00:34:09,090

Fantastic.

577

00:34:10,182 --> 00:34:16,002

And so that was one of my main questions

on that.

578

00:34:16,002 --> 00:34:20,462

Other question is, I'm guessing you are a

lot of people working on that, right?

579

00:34:20,462 --> 00:34:24,422

Because your roadmap that you just talked

about is super big.

580

00:34:24,422 --> 00:34:28,982

Because having a package that's designed

for users, but also for researchers is

581

00:34:28,982 --> 00:34:32,222

quite, that's really a lot of work.

582

00:34:32,222 --> 00:34:34,902

So I'm hoping you're not allowed doing

that.

583

00:34:34,902 --> 00:34:37,434

No, we're currently a team of about a

dozen people.

584

00:34:38,158 --> 00:34:39,358

No, yeah, that makes sense.

585

00:34:39,358 --> 00:34:41,698

It's an interdisciplinary team.

586

00:34:41,698 --> 00:34:46,518

So like a few people with a hardcore like

software engineering background, like some

587

00:34:46,518 --> 00:34:50,398

people with a machine learning background,

and some people from the cognitive

588

00:34:50,398 --> 00:34:53,518

sciences and also a handful of physicists.

589

00:34:53,918 --> 00:34:57,198

Because in fact, these amortized Bayesian

inference methods are particularly

590

00:34:57,198 --> 00:34:59,118

interesting for physicists.

591

00:34:59,398 --> 00:35:04,798

Example for astrophysicists who have these

gravitational wave inference problems

592

00:35:04,798 --> 00:35:07,476

where they have massive data sets.

593

00:35:07,950 --> 00:35:11,589

And running MCMC on those would be quite

cumbersome.

594

00:35:12,810 --> 00:35:18,430

So if you have this huge in -stream data

and you don't have this underlying

595

00:35:18,430 --> 00:35:24,270

likelihood density, but just some

simulation program that might generate

596

00:35:24,270 --> 00:35:29,710

sensible, like gravitational waves, then

amortized Bayesian inference really shines

597

00:35:29,710 --> 00:35:30,406

there.

598

00:35:31,950 --> 00:35:32,950

Okay.

599

00:35:32,950 --> 00:35:36,330

So that's exactly the case you were

talking about where the model doesn't

600

00:35:36,330 --> 00:35:39,630

change, but you have a lot of different

datasets.

601

00:35:39,850 --> 00:35:41,030

Yeah, exactly.

602

00:35:41,290 --> 00:35:46,690

Because I mean, what you're trying to run

inference on is your physical model.

603

00:35:46,810 --> 00:35:49,230

And that doesn't change.

604

00:35:49,230 --> 00:35:50,230

I mean, it does.

605

00:35:50,230 --> 00:35:56,950

And then again, physicists have a very

good understanding and very good models of

606

00:35:56,950 --> 00:35:58,290

the world around them.

607

00:35:58,290 --> 00:36:01,294

And that's made one of the largest

differences.

608

00:36:01,294 --> 00:36:06,074

people from the cognitive sciences, where,

you know, the, the models of the human

609

00:36:06,074 --> 00:36:11,714

brain, for instance, are just, it's such a

tough thing to model and there's so much

610

00:36:11,714 --> 00:36:15,520

not there and so much uncertainty in the

model building process.

611

00:36:17,870 --> 00:36:19,190

Yeah, for sure.

612

00:36:19,190 --> 00:36:23,370

Okay, yeah, I think I'm starting to

understand the idea.

613

00:36:23,450 --> 00:36:28,910

And yeah, so actually, episode 101 was

exactly about that.

614

00:36:28,910 --> 00:36:31,870

Black holes, collisions, gravitational

waves.

615

00:36:31,870 --> 00:36:38,130

And I was talking with LIGO researchers,

Christopher Perry and John Vich.

616

00:36:38,150 --> 00:36:41,510

And we talked exactly about that, their

problem with big data sets.

617

00:36:41,510 --> 00:36:45,610

They are mainly using sequential Monte

Carlo, but I'm guessing they would also be

618

00:36:45,610 --> 00:36:47,054

interested in a Monte...

619

00:36:47,054 --> 00:36:49,234

amortized Bayesian inference.

620

00:36:49,794 --> 00:36:54,774

So yeah, Christopher and John, if you're

listening, if you're future reach out to

621

00:36:54,774 --> 00:36:57,014

Marvin and use Baseflow.

622

00:36:57,794 --> 00:37:03,394

And listeners, this episode will be in the

show notes also if you want to give it a

623

00:37:03,394 --> 00:37:03,714

listen.

624

00:37:03,714 --> 00:37:09,274

That's a really fun one also learning a

lot of stuff, but the crazy universe we

625

00:37:09,274 --> 00:37:10,294

live in.

626

00:37:12,054 --> 00:37:16,366

Actually, a weird question I have is why

627

00:37:16,366 --> 00:37:19,546

easy to call it amortized Bayesian

inference.

628

00:37:22,266 --> 00:37:29,886

The reason is that we have this two -stage

process where we would first pay upfront

629

00:37:29,886 --> 00:37:32,986

with this long neural network training

phase.

630

00:37:33,146 --> 00:37:38,046

But then once we're done with this, this

cost of the upfront training phase

631

00:37:38,046 --> 00:37:43,646

amortizes over all the posterior samples

that we can draw within a few

632

00:37:43,646 --> 00:37:44,726

milliseconds.

633

00:37:47,694 --> 00:37:49,334

That makes sense.

634

00:37:49,334 --> 00:37:51,594

That makes sense.

635

00:37:52,893 --> 00:38:00,714

And so I think something you're also

working on is something that's called deep

636

00:38:00,714 --> 00:38:01,474

fusion.

637

00:38:01,474 --> 00:38:07,274

And you do that in particular for

multimodal simulation -based inference.

638

00:38:07,614 --> 00:38:11,774

How is that related to amortized patient

inference, if at all?

639

00:38:11,774 --> 00:38:13,694

And what is it about?

640

00:38:15,918 --> 00:38:19,538

I'm gonna answer these two questions in

reverse order.

641

00:38:19,538 --> 00:38:22,498

So first about the relation between

simulation -based inference and amortized

642

00:38:22,498 --> 00:38:23,698

Bayesian inference.

643

00:38:24,218 --> 00:38:29,618

So to give you a bit of history there,

simulation -based inference essentially

644

00:38:29,618 --> 00:38:34,158

Bayesian inference based on simulations

where we don't assume that we have access

645

00:38:34,158 --> 00:38:38,078

to a likelihood density, but instead we

just assume that we can sample from the

646

00:38:38,078 --> 00:38:38,897

likelihood.

647

00:38:40,018 --> 00:38:42,758

Essentially simulate from the model.

648

00:38:43,558 --> 00:38:45,422

In fact, the likelihood is still.

649

00:38:45,422 --> 00:38:49,822

present, but it's only implicitly defined

and we don't have access to the density.

650

00:38:49,902 --> 00:38:55,082

That's why likelihood -free inference

doesn't really hit what's happening here.

651

00:38:55,082 --> 00:38:59,322

But instead, like in the recent years,

people have started adopting the term

652

00:38:59,322 --> 00:39:03,382

simulation -based inference because we do

Bayesian inference based on simulations

653

00:39:03,382 --> 00:39:05,482

instead of likelihood densities.

654

00:39:07,942 --> 00:39:12,590

So methods that have been used...

655

00:39:12,590 --> 00:39:18,070

for quite a long time now in the

simulation -based inference research area.

656

00:39:18,510 --> 00:39:24,350

For example, rejection ABC, so approximate

Bayesian computation, or then ABC SMC, so

657

00:39:24,350 --> 00:39:28,210

combining ABC with sequential Monte Carlo.

658

00:39:29,110 --> 00:39:34,310

Essentially, the next iteration there was

throwing neural network at simulation

659

00:39:34,310 --> 00:39:35,590

-based inference.

660

00:39:35,730 --> 00:39:40,896

That's exactly this neural posterior

estimation that I talked about earlier.

661

00:39:42,606 --> 00:39:48,026

And now what researchers noticed is, hey,

when we train a neural network for

662

00:39:48,026 --> 00:39:52,266

simulation -based inference, instead of

running rejection, approximate base

663

00:39:52,266 --> 00:39:57,806

computation, then we get amortization for

free as a site product.

664

00:39:58,885 --> 00:40:06,246

It's just a by -product of using a neural

network for simulation -based inference.

665

00:40:06,766 --> 00:40:10,542

And so in the last maybe four to five

years,

666

00:40:10,542 --> 00:40:14,022

People have mainly focused on this

algorithm that's called neuro posterior

667

00:40:14,022 --> 00:40:17,122

estimation for simulation based inference.

668

00:40:17,282 --> 00:40:21,702

And so all developments that happened

there and all the research that happened

669

00:40:21,702 --> 00:40:27,222

there, almost all the research, sorry,

focused on cases where we don't have any

670

00:40:27,222 --> 00:40:28,342

likelihood density.

671

00:40:28,342 --> 00:40:31,242

So we're purely in the simulation based

case.

672

00:40:33,442 --> 00:40:39,102

Now with our view of things, when we come

from a Bayesian inference, like likelihood

673

00:40:39,102 --> 00:40:40,622

based setting,

674

00:40:40,622 --> 00:40:49,862

can say, hey, amortization is not just a

random coincidental byproduct, but it's a

675

00:40:49,862 --> 00:40:52,482

feature and we should focus on this

feature.

676

00:40:52,482 --> 00:40:57,702

And so now what we're currently doing is

moving this idea of amortized Bayesian

677

00:40:57,702 --> 00:41:01,502

inference with neural networks back into a

likelihood -based setting.

678

00:41:01,502 --> 00:41:04,646

So we've started using likelihood

information again.

679

00:41:06,542 --> 00:41:12,462

For example, using likelihood densities if

they're available or learning information

680

00:41:12,462 --> 00:41:13,842

about the likelihood.

681

00:41:13,842 --> 00:41:18,162

So like a surrogate model on the fly, and

then again, using this information for

682

00:41:18,162 --> 00:41:19,602

better posterior inference.

683

00:41:19,962 --> 00:41:24,442

So we're essentially bridging simulation

-based inference and likelihood -based

684

00:41:24,442 --> 00:41:29,922

Bayesian inference again with this goal, a

larger goal of amortization if we can do

685

00:41:29,922 --> 00:41:31,062

it.

686

00:41:32,122 --> 00:41:35,982

And so this work on deep fusion.

687

00:41:35,982 --> 00:41:40,642

essentially addresses one huge shortcoming

of neural networks when we want to use

688

00:41:40,642 --> 00:41:42,862

them for amortized Bayesian inference.

689

00:41:43,222 --> 00:41:51,322

And that is in situation where we have

multiple different sources of data.

690

00:41:51,961 --> 00:41:53,822

So for example,

691

00:41:55,886 --> 00:42:03,806

Imagine you're a cognitive scientist and

you run an experiment with subjects and

692

00:42:03,806 --> 00:42:08,626

for each test subject, you give them a

decision -making task.

693

00:42:09,126 --> 00:42:14,766

But at the same time, while your subjects

solve the decision -making task, you wire

694

00:42:14,766 --> 00:42:18,040

them up with an EEG to measure the brain

activity.

695

00:42:19,726 --> 00:42:25,386

So for each subject across maybe 100

trials, what you now have is both an EEG

696

00:42:25,386 --> 00:42:28,346

and the data from the decision -making

task.

697

00:42:29,666 --> 00:42:34,416

Now, if you want to analyze this with PyMC

or Stan, what you would just do is say,

698

00:42:34,416 --> 00:42:38,386

hey, well, we have two data -generating

processes that are governed by a set of

699

00:42:38,386 --> 00:42:39,406

shared parameters.

700

00:42:39,406 --> 00:42:44,266

So the first part of the likelihood would

just be this we -know process for the

701

00:42:44,266 --> 00:42:47,750

decision -making task where you just model

the reaction time.

702

00:42:49,678 --> 00:42:52,158

fairly standard procedure there in the

cognitive science.

703

00:42:52,158 --> 00:42:57,378

And then for the second part, we have a

second part of the likelihood that we

704

00:42:57,378 --> 00:43:02,938

evaluate that somehow handles these EEG

measurements.

705

00:43:03,498 --> 00:43:08,458

For example, a spatial temporal process or

just like some summary statistics that are

706

00:43:08,458 --> 00:43:09,618

being computed there.

707

00:43:09,618 --> 00:43:12,538

However, you would usually compute your

EEG.

708

00:43:12,538 --> 00:43:18,338

Then you add both to the log PDF of the

likelihood, and then you can call it a

709

00:43:18,338 --> 00:43:18,918

day.

710

00:43:20,302 --> 00:43:26,262

You cannot do that in neural networks

because you have no straightforward

711

00:43:26,262 --> 00:43:32,962

sensible way to combine these reaction

times from the decision -making task and

712

00:43:32,962 --> 00:43:34,282

the EEG data.

713

00:43:34,282 --> 00:43:37,342

Because you cannot just take them and slap

them together.

714

00:43:37,442 --> 00:43:41,902

They are not compatible with each other

because these information data sources are

715

00:43:41,902 --> 00:43:43,262

heterogeneous.

716

00:43:44,402 --> 00:43:49,192

So you somehow need a way to fuse these

sources of information.

717

00:43:49,262 --> 00:43:51,942

so that you can then feed them into the

neural network.

718

00:43:53,870 --> 00:44:00,390

That's essentially what we're studying in

this paper, where you could just get very

719

00:44:00,390 --> 00:44:04,390

creative and have different schemes to

fuse the data.

720

00:44:04,510 --> 00:44:09,929

So you could use these attention schemes

that are very hip in large language models

721

00:44:09,929 --> 00:44:14,110

right now with transformers essentially,

and have these different data sources

722

00:44:14,110 --> 00:44:16,930

attend or listen essentially to each

other.

723

00:44:17,590 --> 00:44:23,534

With cross attention, you could just let

the EEG data inform

724

00:44:23,534 --> 00:44:27,974

your decision -making data or just have

the decision -making data inform the EEG

725

00:44:27,974 --> 00:44:28,534

data.

726

00:44:28,534 --> 00:44:30,694

So you can get very creative there.

727

00:44:30,834 --> 00:44:35,734

You could also just learn some

representation of both individually, then

728

00:44:35,734 --> 00:44:38,094

concatenate them and feed them to the

neural network.

729

00:44:38,094 --> 00:44:42,494

Or you could do very creative and weird

mixes of all those approaches.

730

00:44:42,994 --> 00:44:48,354

And in this paper, we essentially have a

systematic investigation of these

731

00:44:48,354 --> 00:44:49,234

different options.

732

00:44:49,234 --> 00:44:53,166

And we find that the most straightforward

option works the best.

733

00:44:53,166 --> 00:44:59,906

overall, and that's just learning fixed

size embeddings of your data sources

734

00:44:59,906 --> 00:45:03,266

individually, and then just concatenating

them.

735

00:45:03,326 --> 00:45:07,616

It turns out then we can use information

from both sources in an efficient way,

736

00:45:07,616 --> 00:45:09,946

even though we're doing inference with

neural networks.

737

00:45:10,426 --> 00:45:14,826

And maybe what's interesting for

practitioners is that we can compensate

738

00:45:14,826 --> 00:45:18,056

for missing data in individual sources.

739

00:45:19,758 --> 00:45:25,398

And the paper we essentially, we induced

missing data by just taking these EEG data

740

00:45:25,398 --> 00:45:29,218

and decision -making data and just

randomly dropping some of them.

741

00:45:29,598 --> 00:45:33,898

And the neural networks have learned, like

when we do this fusion process, the neural

742

00:45:33,898 --> 00:45:39,398

networks learn to compensate for partial

missingness in both sources.

743

00:45:39,398 --> 00:45:43,238

So if you just remove some of the decision

-making data, the neural network learn to

744

00:45:43,238 --> 00:45:49,428

use the EEG data to inform your posterior.

745

00:45:49,646 --> 00:45:55,926

Even though the data and one of the

sources are missing, the inference is

746

00:45:55,926 --> 00:45:57,286

pretty robust then.

747

00:45:57,366 --> 00:46:00,186

And again, all this happens without model

refits.

748

00:46:00,666 --> 00:46:03,466

So you would just account for that during

training.

749

00:46:03,466 --> 00:46:08,026

Of course you have to do this like random

dropping of data during a training phase

750

00:46:08,026 --> 00:46:09,426

as well.

751

00:46:09,946 --> 00:46:12,894

And then you can also get it during the

inference phase.

752

00:46:16,290 --> 00:46:18,950

yeah, that sounds, yeah, that's really

cool.

753

00:46:19,130 --> 00:46:25,970

Maybe that's a bit of a, like a small

piece of this paper in our larger roadmap.

754

00:46:25,970 --> 00:46:29,630

This is essentially taking this amortized

vision inference.

755

00:46:31,406 --> 00:46:36,506

up to the level of trustworthiness and

robustness and all these gold standards

756

00:46:36,506 --> 00:46:41,390

that we currently have for likelihood

-based inference in PMC or Stan.

757

00:46:42,894 --> 00:46:43,314

Yeah.

758

00:46:43,314 --> 00:46:43,654

Yeah.

759

00:46:43,654 --> 00:46:46,714

And there's still a lot of work to do

because of course, like there's no free

760

00:46:46,714 --> 00:46:47,654

lunch.

761

00:46:48,894 --> 00:46:52,144

and, and of course there are many problems

with trustworthiness.

762

00:46:52,144 --> 00:46:55,174

And that's also one of the reasons why I'm

here with Aki right now.

763

00:46:55,474 --> 00:46:59,714

cause Aki is so great at Bayesian workflow

and trustworthiness, good diagnostics.

764

00:46:59,714 --> 00:47:04,234

That's all, you know, all the things that

we currently still need for trustworthy,

765

00:47:04,234 --> 00:47:06,494

amortized Bayesian inference.

766

00:47:09,774 --> 00:47:09,894

Yeah.

767

00:47:09,894 --> 00:47:11,342

So maybe you want to.

768

00:47:11,342 --> 00:47:15,282

talk a bit more about that and what you're

doing on that.

769

00:47:15,282 --> 00:47:18,234

That sounds like something very

interesting.

770

00:47:19,726 --> 00:47:29,446

So one huge advantage of an amortized

Bayesian sampler is that evaluations and

771

00:47:29,446 --> 00:47:32,446

diagnostics are extremely cheap.

772

00:47:32,786 --> 00:47:36,105

So for example, there's this gold standard

method that's called simulation based

773

00:47:36,105 --> 00:47:41,586

calibration, where you would sample from

your model and then like a sample from

774

00:47:41,586 --> 00:47:48,766

your prior predictive space and then refit

your model and look at your coverage, for

775

00:47:48,766 --> 00:47:49,560

instance.

776

00:47:49,774 --> 00:47:56,574

In general, look at the calibration of

your model on this potentially very large

777

00:47:56,574 --> 00:47:58,174

prior predictive space.

778

00:47:58,654 --> 00:48:03,354

So you naturally need many model refits,

but your model is fixed.

779

00:48:03,354 --> 00:48:07,574

So if you do it with MCMC, it's a gold

standard evaluation technique, but it's

780

00:48:07,574 --> 00:48:10,734

very expensive to run, especially if your

model is complex.

781

00:48:11,874 --> 00:48:16,014

Now, if you have an amortized estimator,

simulation -based calibration on thousands

782

00:48:16,014 --> 00:48:18,494

of datasets takes a few seconds.

783

00:48:21,134 --> 00:48:25,434

So essentially, and that's my goal for

this research visit with Aki here in

784

00:48:25,434 --> 00:48:32,654

Finland, is trying to figure out what are

some diagnostics that are gold standard,

785

00:48:32,654 --> 00:48:37,193

but potentially very expensive, up to a

point where it's infeasible to run on a

786

00:48:37,193 --> 00:48:39,054

larger scale with MCMC.

787

00:48:39,834 --> 00:48:42,814

But we can easily do it with an amontized

estimator.

788

00:48:43,214 --> 00:48:48,184

With the goal of figuring out, like, can

we trust this estimator?

789

00:48:48,184 --> 00:48:49,326

Yes or no?

790

00:48:49,326 --> 00:48:54,886

It's like, as you might know from neural

networks, we just have no idea what's

791

00:48:54,886 --> 00:48:57,666

happening inside their neural network.

792

00:48:57,666 --> 00:49:03,426

And so we currently don't have these

strong diagnostics that we have for MCMC.

793

00:49:03,706 --> 00:49:05,266

Like for example, our head.

794

00:49:05,266 --> 00:49:08,326

There's no comparable thing for neural

network.

795

00:49:10,094 --> 00:49:20,194

So one of my goals here is to come up with

more good diagnostics that are either

796

00:49:21,254 --> 00:49:26,134

possible with MCMC, but very expensive so

we don't run them, but they would be very

797

00:49:26,134 --> 00:49:28,354

cheap with an amortized estimator.

798

00:49:28,354 --> 00:49:35,214

Or the second thing just specific to an

amortized estimator, just like our head is

799

00:49:35,214 --> 00:49:36,934

specific to MCMC.

800

00:49:37,894 --> 00:49:39,034

Okay.

801

00:49:39,034 --> 00:49:40,192

Yeah, I see.

802

00:49:40,270 --> 00:49:42,850

Yeah, that makes tons of sense.

803

00:49:42,850 --> 00:49:43,950

well.

804

00:49:44,130 --> 00:49:50,430

And actually, so I would have more

technical questions on these, but I see

805

00:49:50,430 --> 00:49:52,630

the time running out.

806

00:49:53,150 --> 00:49:59,690

I think something I'm mainly curious about

is the challenges, the biggest challenges

807

00:49:59,690 --> 00:50:05,370

you face when applying amortized spatial

inference and diffusion techniques in your

808

00:50:05,370 --> 00:50:08,526

projects, but also like in the projects

you see.

809

00:50:08,526 --> 00:50:13,426

I think that's going to also give a sense

to listeners of when and where to use

810

00:50:13,426 --> 00:50:14,986

these kinds of methods.

811

00:50:14,986 --> 00:50:17,406

That's a great question.

812

00:50:17,486 --> 00:50:20,946

And I'm more than happy to talk about all

these challenges that we have because

813

00:50:20,946 --> 00:50:24,246

there's so much room for improvement

because like these Amortized methods, they

814

00:50:24,246 --> 00:50:29,206

have so much potential, but we still have

a long way to go until they are as usable

815

00:50:29,206 --> 00:50:33,306

and as straightforward to use as current

MCMC samplers.

816

00:50:34,066 --> 00:50:37,742

And in general, one challenge for

practitioners,

817

00:50:37,742 --> 00:50:45,202

is that we have most of the problems and

hardships that we have in PyMC or Stan.

818

00:50:45,202 --> 00:50:51,042

And that is that researchers have to think

about their model in a probabilistic way,

819

00:50:51,042 --> 00:50:53,522

in a mechanistic way.

820

00:50:53,782 --> 00:50:59,702

So instead of just saying, hey, I click on

t -test or linear regression in some

821

00:50:59,702 --> 00:51:03,002

graphical user interface, they actually

have to come up with a data generating

822

00:51:03,002 --> 00:51:04,242

process.

823

00:51:05,166 --> 00:51:06,646

and have to specify their model.

824

00:51:06,646 --> 00:51:13,646

And this whole topic of model

specification is just the same in

825

00:51:13,646 --> 00:51:18,106

amortized workflow because some way we

need to specify the Bayesian model.

826

00:51:18,746 --> 00:51:23,326

And now on top of all this, we have a huge

additional layer of complexity and this is

827

00:51:23,326 --> 00:51:25,266

defining the neural networks.

828

00:51:26,246 --> 00:51:30,346

And amortized Bayesian inference, nowadays

we have two neural networks.

829

00:51:30,346 --> 00:51:33,090

The first one is a so -called summary

network.

830

00:51:33,166 --> 00:51:38,326

which essentially learns a latent

embedding of the data set.

831

00:51:38,746 --> 00:51:45,906

Essentially those are like optimal learned

summary statistics and optimal doesn't

832

00:51:45,906 --> 00:51:50,446

mean that they have to be optimal to

reconstruct the data, but instead optimal

833

00:51:50,446 --> 00:51:53,442

means they're optimal to inform the

posterior.

834

00:51:55,758 --> 00:52:00,638

for example, in a very, very simple toy

model, if you have just like a Gaussian

835

00:52:00,638 --> 00:52:04,090

model and you just want to perform

inference on the mean.

836

00:52:06,190 --> 00:52:11,350

then a sufficient summary statistic for

posterior inference on the mean would be

837

00:52:11,350 --> 00:52:12,450

the mean.

838

00:52:12,990 --> 00:52:17,770

Because that's all you need to reconstruct

the mean.

839

00:52:18,630 --> 00:52:21,010

It sounds very tautological, but yeah.

840

00:52:21,010 --> 00:52:25,250

Then again, the mean is obviously not

enough to reconstruct the data because all

841

00:52:25,250 --> 00:52:27,190

the variance information is missing.

842

00:52:27,610 --> 00:52:32,050

What the summary network learns is

something like the mean.

843

00:52:32,050 --> 00:52:35,328

So summary statistics that are optimal for

posterior inference.

844

00:52:36,078 --> 00:52:38,778

And then the second network is the actual

generative neural network.

845

00:52:38,778 --> 00:52:43,078

So like a normalizing flow, score -based

diffusion model, consistency model, flow

846

00:52:43,078 --> 00:52:46,738

matching, whatever condition generative

model you want.

847

00:52:46,918 --> 00:52:50,978

And this will handle the sampling from the

posterior.

848

00:52:51,178 --> 00:52:54,338

And these two networks are learned end to

end.

849

00:52:54,978 --> 00:53:01,178

So you would learn your summary statistic,

output it, feed it into the posterior

850

00:53:01,178 --> 00:53:04,398

network, the generative model, and then

have one.

851

00:53:04,398 --> 00:53:07,814

evaluation of the loss function, optimize

both end to end.

852

00:53:09,742 --> 00:53:14,122

And so we have two neural networks, long

story short, which is substantially harder

853

00:53:14,122 --> 00:53:18,502

than just hitting like sample on a PMC or

Stan program.

854

00:53:18,502 --> 00:53:21,242

And that's an additional hardship for

practitioners.

855

00:53:21,262 --> 00:53:26,222

Now in Baseflow, what we do is we provide

sensible default values for the generative

856

00:53:26,222 --> 00:53:30,962

neural networks, which work in maybe like

80 or 90 % of the cases.

857

00:53:30,962 --> 00:53:36,862

It's just sufficient to have, for example,

like a NeuroSpline flow, like some sort of

858

00:53:36,862 --> 00:53:38,862

normalizing flow with, I don't know, like,

859

00:53:38,862 --> 00:53:44,342

six layers and a certain number of units,

some regularization for robustness and,

860

00:53:44,342 --> 00:53:47,482

you know, cosine decay of the learning

rates, and all these machine learning

861

00:53:47,482 --> 00:53:51,762

parts, we try to take them away from the

user if they don't want to mess with it.

862

00:53:52,262 --> 00:53:56,642

But still, if things don't work, they

would need to somehow diagnose the

863

00:53:56,642 --> 00:53:59,782

problems and then, you know, play with the

number of layers and this neural network

864

00:53:59,782 --> 00:54:00,842

architecture.

865

00:54:01,282 --> 00:54:04,782

And then for the summary network, the

summary network essentially needs to be

866

00:54:04,782 --> 00:54:06,112

informed by the data.

867

00:54:06,112 --> 00:54:08,750

So if you have time series, you would

868

00:54:08,750 --> 00:54:10,930

look at something like an LSTM.

869

00:54:10,930 --> 00:54:16,769

So these like long short time memory time

series neural networks.

870

00:54:16,769 --> 00:54:20,650

Or you would have like recurrent neural

network or nowadays a time series

871

00:54:20,650 --> 00:54:21,510

transformer.

872

00:54:21,510 --> 00:54:23,770

They're also called temporal fusion

transforms.

873

00:54:23,770 --> 00:54:28,030

If you have IID data, you would have

something like a deep set or a set

874

00:54:28,030 --> 00:54:33,830

transformer, which respect this

exchangeable structure of the data.

875

00:54:33,830 --> 00:54:37,670

So again, we can give all the

recommendations and sensible default

876

00:54:37,670 --> 00:54:38,542

values like

877

00:54:38,542 --> 00:54:41,422

If you have a time series, try a time

series transformer.

878

00:54:41,422 --> 00:54:44,922

Then again, if things don't work out,

users need to play around with these

879

00:54:44,922 --> 00:54:45,702

settings.

880

00:54:45,702 --> 00:54:49,082

So that's definitely one hardship of

armatized Bayesian inference in general.

881

00:54:50,082 --> 00:54:54,582

And for the second part of your question,

hardships of this deep fusion.

882

00:54:54,582 --> 00:55:02,002

It's essentially if you have more and more

information sources, then things can get

883

00:55:02,002 --> 00:55:03,622

very complicated.

884

00:55:04,022 --> 00:55:08,526

Example, just a few days ago, we discussed

about a

885

00:55:08,526 --> 00:55:18,526

case where someone has 60 different

sources of information and they're all

886

00:55:18,526 --> 00:55:20,292

streams of time series.

887

00:55:21,902 --> 00:55:27,482

Now we could say, hey, just slap 60

summary networks on this problem, like one

888

00:55:27,482 --> 00:55:29,862

summary network for each domain.

889

00:55:30,062 --> 00:55:34,982

That's going to be very complex and very

hard to train, especially if we don't

890

00:55:34,982 --> 00:55:39,342

bring that many data sets to the table for

the neural network training.

891

00:55:39,922 --> 00:55:43,282

And so there we somehow need to find a

compromise.

892

00:55:43,282 --> 00:55:49,594

Okay, what information can we condense and

group together?

893

00:55:49,646 --> 00:55:54,926

So maybe some of the time series sources

are somewhat similar and actually

894

00:55:54,926 --> 00:55:56,566

compatible with each other.

895

00:55:56,566 --> 00:56:01,746

So we could, for example, come up with six

groups of 10 time series each.

896

00:56:01,746 --> 00:56:05,226

Then we would only need six neural

networks for the summary embeddings and

897

00:56:05,226 --> 00:56:07,586

all these practical considerations.

898

00:56:07,586 --> 00:56:12,626

That makes things just like as hard as in

likelihood based MCMC based inference, but

899

00:56:12,626 --> 00:56:16,064

just a bit harder because of all the

neural network stuff that's happening.

900

00:56:17,966 --> 00:56:20,086

Did this address your question?

901

00:56:20,826 --> 00:56:21,326

Yeah.

902

00:56:21,326 --> 00:56:22,046

Yeah.

903

00:56:22,046 --> 00:56:25,546

It gives me more questions, but yeah, for

sure.

904

00:56:26,166 --> 00:56:28,286

That does answer the question.

905

00:56:28,366 --> 00:56:32,586

When you're talking about transformer for

time series, are you talking about the

906

00:56:32,586 --> 00:56:38,626

transformers, the neural network that's

used in large language models or is it

907

00:56:38,626 --> 00:56:39,506

something else?

908

00:56:39,506 --> 00:56:47,822

It's essentially the same, but slightly

adjusted for time series so that the...

909

00:56:47,822 --> 00:56:53,682

statistics or these latent embeddings that

you output still respect the time series

910

00:56:53,682 --> 00:56:58,002

structure where typically you would have

this autoregressive structure.

911

00:56:58,482 --> 00:57:07,581

So it's not exactly the same like standard

transformer, but you would just enrich it

912

00:57:07,581 --> 00:57:11,002

to respect the probabilistic structure in

your data.

913

00:57:11,002 --> 00:57:12,782

But at the core, it's just the same.

914

00:57:12,782 --> 00:57:16,062

So at the core, it's an attention

mechanism, like multi -head attention

915

00:57:16,062 --> 00:57:17,294

where

916

00:57:17,294 --> 00:57:23,594

Like the different parts of your dataset

could essentially talk or listen to each

917

00:57:23,594 --> 00:57:24,262

other.

918

00:57:26,318 --> 00:57:28,158

So it's just the same.

919

00:57:28,898 --> 00:57:29,788

Okay.

920

00:57:29,788 --> 00:57:30,708

Yeah, that's interesting.

921

00:57:30,708 --> 00:57:33,418

I didn't know that existed for time

series.

922

00:57:33,838 --> 00:57:35,118

That's interesting.

923

00:57:35,118 --> 00:57:43,398

That means, so because the transformer

takes like one of the main thing is you

924

00:57:43,398 --> 00:57:45,898

have to tokenize the inputs.

925

00:57:45,898 --> 00:57:46,088

Right?

926

00:57:46,088 --> 00:57:50,858

So here you would tokenize like that there

is a tokenization happening of the time

927

00:57:50,858 --> 00:57:51,998

series data.

928

00:57:51,998 --> 00:57:56,486

You don't have to tokenize here because

the reason why you have to tokenize.

929

00:57:56,822 --> 00:58:02,622

in large language models or natural

language processing in general is that you

930

00:58:02,622 --> 00:58:07,082

want to somehow encode your characters or

your words?

931

00:58:07,082 --> 00:58:13,822

into like a into numbers essentially and

we don't need that in Bayesian inference

932

00:58:13,822 --> 00:58:20,942

in general because we already have numbers

Yeah So our data already comes in numbers,

933

00:58:20,942 --> 00:58:22,722

so we don't need tokenization here.

934

00:58:22,722 --> 00:58:24,622

Of course if we had text data

935

00:58:24,622 --> 00:58:26,622

Then we would need tokenization.

936

00:58:26,922 --> 00:58:27,242

Yeah.

937

00:58:27,242 --> 00:58:27,602

Yeah.

938

00:58:27,602 --> 00:58:28,322

Yeah.

939

00:58:28,322 --> 00:58:28,752

OK.

940

00:58:28,752 --> 00:58:29,742

OK.

941

00:58:29,742 --> 00:58:31,962

Yeah, it makes more sense to me.

942

00:58:32,302 --> 00:58:33,322

All right, that's fun.

943

00:58:33,322 --> 00:58:34,492

I didn't know that existed.

944

00:58:34,492 --> 00:58:39,322

Do you have any resources about

transformer for time series that we could

945

00:58:39,322 --> 00:58:40,581

put in the show notes?

946

00:58:40,581 --> 00:58:41,232

Absolutely.

947

00:58:41,232 --> 00:58:44,862

There is a paper that's called Temporal

Fusion Transformers, I think.

948

00:58:44,862 --> 00:58:46,762

I will send you the link.

949

00:58:47,082 --> 00:58:47,902

yeah.

950

00:58:47,902 --> 00:58:48,662

Awesome.

951

00:58:48,662 --> 00:58:49,710

Yeah, thanks.

952

00:58:49,710 --> 00:58:50,730

Definitely.

953

00:58:50,750 --> 00:58:53,930

We have this time series transformer,

temporary fusion transformer implemented

954

00:58:53,930 --> 00:58:55,130

in base flow.

955

00:58:55,130 --> 00:59:01,170

So now it's just like a very usable

interface where you would just input your

956

00:59:01,170 --> 00:59:04,930

data and then you get your latent

embeddings.

957

00:59:05,490 --> 00:59:10,310

You can say like, I want to input my data

and I want as an output 20 learned summary

958

00:59:10,310 --> 00:59:11,070

statistics.

959

00:59:11,070 --> 00:59:13,110

So that's all you need to do there.

960

00:59:13,330 --> 00:59:14,580

Okay.

961

00:59:14,580 --> 00:59:15,570

And you can go crazy.

962

00:59:15,570 --> 00:59:17,170

So what would you do with it?

963

00:59:17,170 --> 00:59:17,794

Good.

964

00:59:19,054 --> 00:59:22,524

Yeah, what would you do with these

results?

965

00:59:22,524 --> 00:59:27,674

Basically the outputs of the transformer,

what would you use that for?

966

00:59:27,754 --> 00:59:30,288

Those are the learned summary statistics.

967

00:59:32,878 --> 00:59:38,958

That you would then treat as a compressed

fixed length version of your data for the

968

00:59:38,958 --> 00:59:41,798

posterior network for this generative

model.

969

00:59:42,558 --> 00:59:47,198

So then you use that afterwards in the

model?

970

00:59:47,258 --> 00:59:48,198

Exactly.

971

00:59:48,198 --> 00:59:48,857

Yeah.

972

00:59:48,857 --> 00:59:54,378

So the transformer is just used to learn

summary statistics of the data sets that

973

00:59:54,378 --> 00:59:55,498

we input.

974

00:59:55,578 --> 01:00:01,038

For instance, if you have time series,

like we did this for COVID time series.

975

01:00:01,038 --> 01:00:02,862

If you have a COVID time series,

976

01:00:02,862 --> 01:00:08,602

worth like for a three year period would

be and daily reporting, you would have a

977

01:00:08,602 --> 01:00:11,501

time series with about a thousand time

steps.

978

01:00:11,662 --> 01:00:16,742

That's quite long as a condition into a

neural network to pass in there.

979

01:00:16,742 --> 01:00:24,322

And also like if now you don't have a

thousand days, but a thousand and one

980

01:00:24,322 --> 01:00:28,682

days, then the length of your input to the

neural network would change and your

981

01:00:28,682 --> 01:00:31,014

neural network wouldn't do that.

982

01:00:31,598 --> 01:00:35,617

So what you do with a time series

transformer is compress this time series

983

01:00:35,617 --> 01:00:45,118

of maybe 1 ,000 or maybe 1 ,050 time steps

into a fixed length vector of summary

984

01:00:45,118 --> 01:00:45,918

statistics.

985

01:00:45,918 --> 01:00:49,318

Maybe you extract 200 summary statistics

from that.

986

01:00:51,630 --> 01:00:53,290

Hey, okay, I see.

987

01:00:53,290 --> 01:01:04,830

And then you can use that in your neural

network, in the model that's going to be

988

01:01:04,830 --> 01:01:06,710

sampling your model.

989

01:01:06,710 --> 01:01:08,810

In the neural network that's going to be

sampling your model.

990

01:01:08,810 --> 01:01:13,250

We already see that we're heavily

overloading terminology here.

991

01:01:13,250 --> 01:01:15,090

So what's a model actually?

992

01:01:15,090 --> 01:01:17,650

So then we have to differentiate between

the actual Bayesian model that we're

993

01:01:17,650 --> 01:01:18,382

trying to fit.

994

01:01:18,382 --> 01:01:21,542

And then the neural network, the

generative model or generative neural

995

01:01:21,542 --> 01:01:25,062

network that we're using as a replacement

for MCMC.

996

01:01:25,062 --> 01:01:30,222

So it's, it's a lot of this taxonomy

that's, that's odd when you're at the

997

01:01:30,222 --> 01:01:32,302

interface of deep learning and statistics.

998

01:01:32,462 --> 01:01:35,262

Another one of those hiccups are

parameters.

999

01:01:35,402 --> 01:01:38,582

Like invasion inference parameters are

your inference targets.

Speaker: 1000

01:01:38,582 --> 01:01:43,222

So you want posterior distributions on a

handful of model parameters.

Speaker: 1001

01:01:43,562 --> 01:01:47,252

When you talk to people from deep learning

about parameters,

Speaker: 1002

01:01:47,982 --> 01:01:50,662

they understand the neural network

weights.

Speaker: 1003

01:01:51,682 --> 01:01:58,162

So sometimes you have to be careful with

the, I have to be careful with the

Speaker: 1004

01:01:58,162 --> 01:02:03,682

terminology and words used to describe

things because we have different types of

Speaker: 1005

01:02:03,682 --> 01:02:08,122

people going on different levels of

abstraction here in different functions.

Speaker: 1006

01:02:08,522 --> 01:02:08,622

Yeah.

Speaker: 1007

01:02:08,622 --> 01:02:10,262

Yeah, exactly.

Speaker: 1008

01:02:10,262 --> 01:02:15,762

So that means in this case, it's the

transformer takes in time values, it

Speaker: 1009

01:02:15,762 --> 01:02:17,594

summarizes them.

Speaker: 1010

01:02:17,742 --> 01:02:22,222

And it passed that on to the neural

network that's going to be used to sample

Speaker: 1011

01:02:22,222 --> 01:02:23,582

the patient model.

Speaker: 1012

01:02:23,582 --> 01:02:24,422

Exactly.

Speaker: 1013

01:02:24,422 --> 01:02:30,722

And they are passed in as the conditions,

like conditional probability, which

Speaker: 1014

01:02:30,722 --> 01:02:36,282

totally makes sense because like this

generative neural network, it learns the

Speaker: 1015

01:02:36,282 --> 01:02:41,182

distribution of parameters conditional on

the data or summary statistics of the

Speaker: 1016

01:02:41,182 --> 01:02:41,992

data.

Speaker: 1017

01:02:43,534 --> 01:02:47,034

So that's the exact definition of the

Bayesian posterior distribution.

Speaker: 1018

01:02:47,034 --> 01:02:53,034

Like a distribution of the Bayesian model

parameters conditional on the data.

Speaker: 1019

01:02:53,414 --> 01:02:55,650

It's the exact definition of the

posterior.

Speaker: 1020

01:03:00,110 --> 01:03:01,510

Yeah, I see.

Speaker: 1021

01:03:02,090 --> 01:03:04,198

And that means...

Speaker: 1022

01:03:05,806 --> 01:03:10,986

So in this case, yeah, no, I think my

question was going to be, so why would you

Speaker: 1023

01:03:10,986 --> 01:03:16,046

use these kind of additional layer on the

time series data?

Speaker: 1024

01:03:16,046 --> 01:03:17,436

But you have to answer that.

Speaker: 1025

01:03:17,436 --> 01:03:21,646

Is that, well, what if your time series

data is too big or something like that?

Speaker: 1026

01:03:21,666 --> 01:03:22,326

Exactly.

Speaker: 1027

01:03:22,326 --> 01:03:26,686

It's not just being too big, but also just

a variable length.

Speaker: 1028

01:03:26,686 --> 01:03:31,666

Because the neural network, like the

generative neural network, it always wants

Speaker: 1029

01:03:31,666 --> 01:03:33,556

fixed length inputs.

Speaker: 1030

01:03:33,678 --> 01:03:38,498

Like it can only handle, in this case of

the COVID model, it could only handle

Speaker: 1031

01:03:38,498 --> 01:03:41,598

input conditions with length 200.

Speaker: 1032

01:03:42,238 --> 01:03:47,218

And now the time series transformer takes

part, so the time series transformer

Speaker: 1033

01:03:47,218 --> 01:03:52,898

handles the part that our actual raw data

have variable length.

Speaker: 1034

01:03:53,498 --> 01:03:58,178

And time series transformers can handle

data of variable length.

Speaker: 1035

01:03:58,178 --> 01:04:03,374

So they would, you know, just take a time

series of length.

Speaker: 1036

01:04:03,374 --> 01:04:09,614

maybe 500 time steps to 2000 time steps,

and then always compress it to 200 summary

Speaker: 1037

01:04:09,614 --> 01:04:10,714

statistics.

Speaker: 1038

01:04:11,154 --> 01:04:17,954

So this generative neural network, which

is much more strict about the shapes and

Speaker: 1039

01:04:17,954 --> 01:04:23,470

form of the input data, will always see

the same length inputs.

Speaker: 1040

01:04:25,166 --> 01:04:25,906

Yeah.

Speaker: 1041

01:04:25,906 --> 01:04:26,296

Okay.

Speaker: 1042

01:04:26,296 --> 01:04:27,306

Yeah, I see.

Speaker: 1043

01:04:27,306 --> 01:04:28,886

That makes sense.

Speaker: 1044

01:04:29,386 --> 01:04:29,746

Awesome.

Speaker: 1045

01:04:29,746 --> 01:04:30,426

Yeah, super cool.

Speaker: 1046

01:04:30,426 --> 01:04:36,446

And so as you were saying, this is already

available in base flow, people can use

Speaker: 1047

01:04:36,446 --> 01:04:38,806

this kind of transformer for time series.

Speaker: 1048

01:04:38,806 --> 01:04:39,616

Yeah, absolutely.

Speaker: 1049

01:04:39,616 --> 01:04:41,145

For time series and also for sets.

Speaker: 1050

01:04:41,145 --> 01:04:42,426

So for IID data.

Speaker: 1051

01:04:42,426 --> 01:04:42,746

Yeah.

Speaker: 1052

01:04:42,746 --> 01:04:49,946

Because if you just fed, if you just take

an IID data set and input into a neural

Speaker: 1053

01:04:49,946 --> 01:04:53,756

network, the neural network doesn't know

that your observations are exchangeable.

Speaker: 1054

01:04:54,094 --> 01:04:59,414

So it will assume much more structure than

there actually is in your data.

Speaker: 1055

01:05:00,374 --> 01:05:06,674

So again, it has a double function, like a

dual function of like compressing data,

Speaker: 1056

01:05:07,034 --> 01:05:11,814

encoding the probabilistic structure of

the data, and also outputting a fixed

Speaker: 1057

01:05:11,814 --> 01:05:13,074

representation.

Speaker: 1058

01:05:13,754 --> 01:05:17,014

So this would be a set transformer or deep

set is another option.

Speaker: 1059

01:05:17,014 --> 01:05:19,714

It's also implemented in Baseflow.

Speaker: 1060

01:05:20,254 --> 01:05:21,614

Super cool.

Speaker: 1061

01:05:21,614 --> 01:05:22,734

Yeah.

Speaker: 1062

01:05:23,470 --> 01:05:29,150

And so let's start winding down here

because I've already taken a lot of your

Speaker: 1063

01:05:29,150 --> 01:05:30,190

time.

Speaker: 1064

01:05:31,090 --> 01:05:39,090

Maybe a last few questions would be what

are some emerging topics that you see

Speaker: 1065

01:05:39,090 --> 01:05:43,050

within deep learning and probabilistic

machine learning that you find

Speaker: 1066

01:05:43,050 --> 01:05:44,290

particularly intriguing?

Speaker: 1067

01:05:44,290 --> 01:05:49,150

Because I've been to talk here a lot about

really the nitty -gritty, the statistical

Speaker: 1068

01:05:49,150 --> 01:05:49,998

detail.

Speaker: 1069

01:05:49,998 --> 01:05:55,378

And so on, but now if we do zoom a bit and

we start thinking about more long -term.

Speaker: 1070

01:05:55,858 --> 01:05:56,178

Yeah.

Speaker: 1071

01:05:56,178 --> 01:05:59,838

I'm very excited about two large topics.

Speaker: 1072

01:05:59,838 --> 01:06:05,798

The first one are generative models that

are very expressive.

Speaker: 1073

01:06:05,798 --> 01:06:10,518

So unconstrained neural network

architectures, but at the same time have a

Speaker: 1074

01:06:10,518 --> 01:06:11,978

one -step inference.

Speaker: 1075

01:06:13,518 --> 01:06:18,158

So for example, people have been using

score -based diffusion models a lot for

Speaker: 1076

01:06:18,158 --> 01:06:18,766

flow matching.

Speaker: 1077

01:06:18,766 --> 01:06:21,766

for image generation, like for example,

stable diffusion.

Speaker: 1078

01:06:21,766 --> 01:06:24,766

You might be familiar with this tool to

generate like, you know, input a text

Speaker: 1079

01:06:24,766 --> 01:06:27,786

prompt and then you get fantastic images.

Speaker: 1080

01:06:28,046 --> 01:06:30,246

Now this takes quite some time.

Speaker: 1081

01:06:30,246 --> 01:06:33,926

So like a few seconds for each image, but

only because it runs on a fancy cluster.

Speaker: 1082

01:06:33,926 --> 01:06:36,486

If you run it locally on a computer, it

takes much longer.

Speaker: 1083

01:06:36,486 --> 01:06:42,086

And that's because the Scorby's diffusion

model needs many discretization steps in

Speaker: 1084

01:06:42,086 --> 01:06:48,166

denoising, in this denoising process

during inference time.

Speaker: 1085

01:06:49,166 --> 01:06:53,646

And now there's, like, throughout the last

year, there have been a few attempts on

Speaker: 1086

01:06:53,646 --> 01:06:57,066

having these very expressive and super

powerful neural networks.

Speaker: 1087

01:06:57,726 --> 01:07:00,646

But they are much, much faster because

they don't have these many denoising

Speaker: 1088

01:07:00,646 --> 01:07:01,066

steps.

Speaker: 1089

01:07:01,066 --> 01:07:04,026

Instead, they directly learn a one -step

inference.

Speaker: 1090

01:07:04,026 --> 01:07:09,486

So they could generate an image not like a

thousand steps, but only in one step.

Speaker: 1091

01:07:10,166 --> 01:07:14,426

And that's very cutting edge or bleeding

edge, if you will, because they don't work

Speaker: 1092

01:07:14,426 --> 01:07:15,526

that great yet.

Speaker: 1093

01:07:15,526 --> 01:07:17,838

But I think there's much potential in

there.

Speaker: 1094

01:07:17,838 --> 01:07:21,018

it's both expressive and fast.

Speaker: 1095

01:07:21,118 --> 01:07:25,038

And then again, we've used some of those

for amortized Bayesian inference.

Speaker: 1096

01:07:25,038 --> 01:07:30,878

So we use consistency models and they have

super high potential in my opinion.

Speaker: 1097

01:07:31,098 --> 01:07:35,858

So, you know, with these advances in deep

learning, we can always, oftentimes we can

Speaker: 1098

01:07:35,858 --> 01:07:38,218

use them for amortized Bayesian inference.

Speaker: 1099

01:07:38,218 --> 01:07:42,218

We just like reformulate these generative

models and slightly tune them to our

Speaker: 1100

01:07:42,218 --> 01:07:43,118

tasks.

Speaker: 1101

01:07:43,318 --> 01:07:44,878

So I'm very excited about this.

Speaker: 1102

01:07:44,878 --> 01:07:48,218

And the second area I'm very excited about

our foundation models.

Speaker: 1103

01:07:48,218 --> 01:07:51,278

I guess most people are in AI these days.

Speaker: 1104

01:07:51,278 --> 01:07:56,658

So foundation models essentially means

neural networks are very good at in

Speaker: 1105

01:07:56,658 --> 01:07:57,978

-distribution tasks.

Speaker: 1106

01:07:57,978 --> 01:08:04,458

So whatever is in the training data set,

neural networks are typically very good at

Speaker: 1107

01:08:05,838 --> 01:08:09,658

finding patterns that are similar to the

training set, what they saw in the

Speaker: 1108

01:08:09,658 --> 01:08:10,604

training set.

Speaker: 1109

01:08:10,862 --> 01:08:15,052

Now in the open world, so if we are out of

distribution, we have a domain shift,

Speaker: 1110

01:08:15,052 --> 01:08:17,822

distribution shift, model mis

-specification, however you want to call

Speaker: 1111

01:08:17,822 --> 01:08:21,522

it, neural networks typically aren't that

good.

Speaker: 1112

01:08:21,722 --> 01:08:27,641

So what we could do is either make them

slightly better at out of distribution, or

Speaker: 1113

01:08:27,641 --> 01:08:32,962

we just extend the in -distribution to a

huge space.

Speaker: 1114

01:08:32,962 --> 01:08:35,202

And that's what foundation models do.

Speaker: 1115

01:08:36,042 --> 01:08:40,654

For example, GPD4 would be a foundation

model.

Speaker: 1116

01:08:40,654 --> 01:08:44,074

because it's just trained on so much data.

Speaker: 1117

01:08:44,393 --> 01:08:49,314

I don't know how many, it's not terabyte

anymore.

Speaker: 1118

01:08:49,314 --> 01:08:51,814

It's like, like essentially the entire

internet.

Speaker: 1119

01:08:51,814 --> 01:08:54,534

So it's just a huge training set.

Speaker: 1120

01:08:54,714 --> 01:08:58,314

And so the world and the training set that

this neural network has been trained on is

Speaker: 1121

01:08:58,314 --> 01:08:59,514

just huge.

Speaker: 1122

01:08:59,514 --> 01:09:04,614

And so essentially we don't really have

out of distribution cases anymore, just

Speaker: 1123

01:09:04,614 --> 01:09:06,474

because our training set is so huge.

Speaker: 1124

01:09:06,654 --> 01:09:10,350

And that's also one area that could be

very useful for

Speaker: 1125

01:09:10,350 --> 01:09:14,950

amortized Bayesian inference and to

overcome the very initial shortcoming that

Speaker: 1126

01:09:14,950 --> 01:09:20,090

you talked about, where we would also like

to amortize over different Asian models.

Speaker: 1127

01:09:21,150 --> 01:09:22,250

Hmm.

Speaker: 1128

01:09:23,110 --> 01:09:23,770

I see.

Speaker: 1129

01:09:23,770 --> 01:09:24,810

Yeah, yeah, yeah.

Speaker: 1130

01:09:24,810 --> 01:09:27,130

Yeah, that would definitely be super fun.

Speaker: 1131

01:09:27,490 --> 01:09:34,950

Yeah, I'm really impressed and interested

to see these interaction of like deep

Speaker: 1132

01:09:34,950 --> 01:09:39,260

learning, artificial intelligence, and

then the Bayesian.

Speaker: 1133

01:09:39,374 --> 01:09:41,094

framework coming on top of that.

Speaker: 1134

01:09:41,094 --> 01:09:42,974

That is really super cool.

Speaker: 1135

01:09:42,974 --> 01:09:44,334

I love that.

Speaker: 1136

01:09:45,334 --> 01:09:45,434

Yeah.

Speaker: 1137

01:09:45,434 --> 01:09:49,734

Yeah, it makes me super curious to try

that stuff out.

Speaker: 1138

01:09:50,574 --> 01:09:56,794

So to play us out, Marvin, actually, this

is a very active area of research.

Speaker: 1139

01:09:57,614 --> 01:10:03,774

So what advice would you give to beginners

interested in diving into this

Speaker: 1140

01:10:03,774 --> 01:10:07,654

intersection of deep learning and

probabilistic machine learning?

Speaker: 1141

01:10:09,358 --> 01:10:10,938

That's a great question.

Speaker: 1142

01:10:11,958 --> 01:10:14,438

Essentially, I would have two

recommendations.

Speaker: 1143

01:10:14,438 --> 01:10:21,018

The first one is to really try to simulate

stuff.

Speaker: 1144

01:10:21,018 --> 01:10:28,018

Whatever it is that you are curious about,

just try to write a simulation program and

Speaker: 1145

01:10:28,018 --> 01:10:32,098

try to simulate some of the data that you

might be interested in.

Speaker: 1146

01:10:32,098 --> 01:10:36,758

So for example, if you're really

interested in soccer, then code up a

Speaker: 1147

01:10:36,758 --> 01:10:37,774

simulation program.

Speaker: 1148

01:10:37,774 --> 01:10:41,774

that just simulate soccer matches and the

outcomes of soccer matches.

Speaker: 1149

01:10:42,114 --> 01:10:47,314

So you can really get a feeling of the

data generating processes that are

Speaker: 1150

01:10:47,314 --> 01:10:51,674

happening because probabilistic machine

learning at its very core is all about

Speaker: 1151

01:10:51,674 --> 01:10:55,134

data generating processes and reasoning

about these processes.

Speaker: 1152

01:10:55,774 --> 01:10:59,554

And I think it was Richard Feynman who

said, what I cannot create, I do not

Speaker: 1153

01:10:59,554 --> 01:11:00,774

understand.

Speaker: 1154

01:11:00,774 --> 01:11:04,834

That's essentially at the heart of

simulation based inference in a more

Speaker: 1155

01:11:04,834 --> 01:11:05,806

narrow setting.

Speaker: 1156

01:11:05,806 --> 01:11:09,046

probabilistic machinery and machine

learning more broadly or science more

Speaker: 1157

01:11:09,046 --> 01:11:14,766

broadly even So yeah, definitely like

Simulating and running simulation studies

Speaker: 1158

01:11:14,766 --> 01:11:19,466

can be super helpful both to understand

what's happening in the background also to

Speaker: 1159

01:11:19,466 --> 01:11:25,706

get a feeling for Programming and to get

better at programming as well Then the

Speaker: 1160

01:11:25,706 --> 01:11:30,766

second advice would be to essentially find

a balance between these hands -on getting

Speaker: 1161

01:11:30,766 --> 01:11:34,926

your hands dirty type of things like

implement a model and

Speaker: 1162

01:11:34,926 --> 01:11:40,726

I torch or Keras or solve some Kaggle

tasks, just some machine learning tasks.

Speaker: 1163

01:11:41,306 --> 01:11:47,906

But then at the same time, also finding

this balance to reading books and finding

Speaker: 1164

01:11:47,906 --> 01:11:53,066

new information to make sure that you

actually know what you're doing and also

Speaker: 1165

01:11:53,066 --> 01:11:56,666

know what you don't know and what the next

steps are to get better from the

Speaker: 1166

01:11:56,666 --> 01:11:58,426

theoretical part.

Speaker: 1167

01:11:59,026 --> 01:12:00,846

And there are two books that I can really

recommend.

Speaker: 1168

01:12:00,846 --> 01:12:03,346

The first one is Deep Learning by Ian

Goodfellow.

Speaker: 1169

01:12:03,346 --> 01:12:04,750

It's also available.

Speaker: 1170

01:12:04,750 --> 01:12:05,720

for free online.

Speaker: 1171

01:12:05,720 --> 01:12:08,150

You can also link to this in the show

notes.

Speaker: 1172

01:12:08,190 --> 01:12:11,050

It's a great book and it covers so much.

Speaker: 1173

01:12:11,050 --> 01:12:17,030

And then if you come from this Bayesian or

statistics background, you see a lot of

Speaker: 1174

01:12:17,030 --> 01:12:22,830

conditional probabilities in there because

a lot of deep learning is just conditional

Speaker: 1175

01:12:22,830 --> 01:12:24,110

generative modeling.

Speaker: 1176

01:12:24,710 --> 01:12:28,070

And then the second book would in fact be

Statistical Rethinking by Richard

Speaker: 1177

01:12:28,070 --> 01:12:28,540

McAlrath.

Speaker: 1178

01:12:28,540 --> 01:12:33,510

It's a great book and it's not only

limited to Bayesian inference, but more.

Speaker: 1179

01:12:33,614 --> 01:12:35,574

Also a lot of causal inference, of course.

Speaker: 1180

01:12:35,574 --> 01:12:40,694

Also just thinking about probability and

the philosophy behind this whole

Speaker: 1181

01:12:40,694 --> 01:12:42,994

probabilistic modeling topic more broadly.

Speaker: 1182

01:12:43,054 --> 01:12:47,574

So earlier today, I had a chat with one of

the student assistants that I'm

Speaker: 1183

01:12:47,574 --> 01:12:52,374

supervising and he said, Hey Marvin, like

I read statistic rethinking a few weeks

Speaker: 1184

01:12:52,374 --> 01:12:53,114

ago.

Speaker: 1185

01:12:53,114 --> 01:12:56,894

And today I read something about score

-based diffusion models.

Speaker: 1186

01:12:56,894 --> 01:13:00,094

So these like state of the art deep

learning models that are used to generate

Speaker: 1187

01:13:00,094 --> 01:13:00,846

images.

Speaker: 1188

01:13:00,846 --> 01:13:03,896

He said like, because I read statistical

rethinking, it all made sense.

Speaker: 1189

01:13:03,896 --> 01:13:07,966

There's so much probability going on in

these score -based diffusion models.

Speaker: 1190

01:13:07,966 --> 01:13:11,186

And statistical rethinking really helped

me understand that.

Speaker: 1191

01:13:11,726 --> 01:13:16,526

And at first I didn't really, I couldn't

believe it, but it totally makes sense.

Speaker: 1192

01:13:16,526 --> 01:13:20,766

Cause like statistical rethinking is not

just a book about Bayesian workflow and

Speaker: 1193

01:13:20,766 --> 01:13:24,166

Bayesian modeling, but more about, you

know, reasoning about probabilities and

Speaker: 1194

01:13:24,166 --> 01:13:26,506

uncertainty, in a more general way.

Speaker: 1195

01:13:26,506 --> 01:13:27,826

And it's a beautiful book.

Speaker: 1196

01:13:27,826 --> 01:13:29,306

So I'd recommend those.

Speaker: 1197

01:13:30,926 --> 01:13:31,646

Nice.

Speaker: 1198

01:13:31,646 --> 01:13:32,146

Yeah.

Speaker: 1199

01:13:32,146 --> 01:13:37,046

So definitely let's put those two in the

show notes.

Speaker: 1200

01:13:37,046 --> 01:13:38,426

Marvin, I will.

Speaker: 1201

01:13:38,426 --> 01:13:44,326

So of course I've read statistical

rethinking several times, so I definitely

Speaker: 1202

01:13:44,326 --> 01:13:45,406

agree.

Speaker: 1203

01:13:45,566 --> 01:13:50,006

The first one about deep learning, I

haven't yet, but I will definitely read it

Speaker: 1204

01:13:50,006 --> 01:13:52,046

because that sounds really fascinating.

Speaker: 1205

01:13:52,046 --> 01:13:55,526

So really want to get that book.

Speaker: 1206

01:13:56,006 --> 01:13:56,726

Fantastic.

Speaker: 1207

01:13:56,726 --> 01:13:58,096

Well, thanks a lot, Marvin.

Speaker: 1208

01:13:58,096 --> 01:14:00,326

That was really awesome.

Speaker: 1209

01:14:00,878 --> 01:14:02,398

I really learned a lot.

Speaker: 1210

01:14:02,398 --> 01:14:06,438

I'm pretty sure listeners did too, so

that's super fun.

Speaker: 1211

01:14:06,478 --> 01:14:13,718

You definitely need to come back to do a

modeling webinar with us and show us in

Speaker: 1212

01:14:13,718 --> 01:14:17,958

action what we talked about today with the

Base Vlog Package.

Speaker: 1213

01:14:17,958 --> 01:14:22,798

It's also, I guess, going to inspire

people to use it and maybe contribute to

Speaker: 1214

01:14:22,798 --> 01:14:23,838

it.

Speaker: 1215

01:14:24,198 --> 01:14:28,718

But before that, of course, I'm going to

ask you the last two questions I ask every

Speaker: 1216

01:14:28,718 --> 01:14:30,620

guest at the end of the show.

Speaker: 1217

01:14:30,638 --> 01:14:35,158

First one, if you had unlimited time and

resources, which problem would you try to

Speaker: 1218

01:14:35,158 --> 01:14:36,038

solve?

Speaker: 1219

01:14:36,318 --> 01:14:40,298

That's a very loaded question because

there's so many very, very important

Speaker: 1220

01:14:40,298 --> 01:14:42,738

problems to solve.

Speaker: 1221

01:14:42,738 --> 01:14:48,038

Like big picture problems, like peace,

world hunger, global warming, all those.

Speaker: 1222

01:14:48,038 --> 01:14:51,658

I'm afraid I couldn't, like with my

background, I don't really know how to

Speaker: 1223

01:14:51,658 --> 01:14:55,578

contribute significantly with a huge

impact to those problems.

Speaker: 1224

01:14:55,578 --> 01:14:59,374

So my consideration is essentially a trade

-off between like...

Speaker: 1225

01:14:59,374 --> 01:15:03,734

how important is the problem and what

impact does solving the problem or

Speaker: 1226

01:15:03,734 --> 01:15:07,394

addressing the problem have and what

impact could I have on solving the

Speaker: 1227

01:15:07,394 --> 01:15:08,454

problem?

Speaker: 1228

01:15:08,734 --> 01:15:15,054

And so I think what would be very nice is

to make probabilistic inference or

Speaker: 1229

01:15:15,054 --> 01:15:20,094

Bayesian inference more particular, like

accessible, usable, easy and fast for

Speaker: 1230

01:15:20,094 --> 01:15:20,934

everyone.

Speaker: 1231

01:15:20,934 --> 01:15:26,634

And that doesn't just mean, you know,

methods, machine learning researchers.

Speaker: 1232

01:15:26,670 --> 01:15:30,670

But essentially means anyone who works

with data in any way.

Speaker: 1233

01:15:30,970 --> 01:15:36,130

And there's so much to do, like the actual

Bayesian model in the background, it could

Speaker: 1234

01:15:36,130 --> 01:15:41,230

be huge, be like a base GPT, like chat

GPT, but just for base.

Speaker: 1235

01:15:41,750 --> 01:15:45,810

Just with the sheer scope of amortization,

different models, different settings and

Speaker: 1236

01:15:45,810 --> 01:15:46,100

so on.

Speaker: 1237

01:15:46,100 --> 01:15:48,670

So that's a huge, huge challenge.

Speaker: 1238

01:15:49,530 --> 01:15:55,270

Like on the backend side, but then on the

front end and API side, I think it also

Speaker: 1239

01:15:55,270 --> 01:15:55,790

has...

Speaker: 1240

01:15:55,790 --> 01:15:58,510

many different sub problems there.

Speaker: 1241

01:15:58,530 --> 01:16:02,510

cause it would mean like people could

just, you know, write down a description

Speaker: 1242

01:16:02,510 --> 01:16:07,530

of their model in plain text language,

like a large language model.

Speaker: 1243

01:16:07,530 --> 01:16:11,990

And, you know, don't actually specify

everything by a programming.

Speaker: 1244

01:16:12,350 --> 01:16:17,310

Maybe also just sketch out some data like

expert elicitation and all those different

Speaker: 1245

01:16:17,310 --> 01:16:17,770

topics.

Speaker: 1246

01:16:17,770 --> 01:16:23,782

I think there's like this bigger picture,

that, you know, so like.

Speaker: 1247

01:16:24,014 --> 01:16:28,174

thousands of researchers worldwide are

working on so many niche topics there.

Speaker: 1248

01:16:28,174 --> 01:16:33,294

But having this overarching base GPT kind

of thing would be really cool.

Speaker: 1249

01:16:33,894 --> 01:16:37,314

So I probably choose that to work on.

Speaker: 1250

01:16:37,314 --> 01:16:41,094

It's a very risky thing, so that's why I'm

not currently working on it.

Speaker: 1251

01:16:42,034 --> 01:16:43,914

Yeah, I love that.

Speaker: 1252

01:16:44,414 --> 01:16:46,654

Yeah, that sounds awesome.

Speaker: 1253

01:16:46,654 --> 01:16:49,582

Feel free to corporate.

Speaker: 1254

01:16:49,582 --> 01:16:51,702

and collaborate with me on that.

Speaker: 1255

01:16:51,702 --> 01:16:53,182

I would definitely be down.

Speaker: 1256

01:16:53,182 --> 01:16:54,782

That sounds absolutely amazing.

Speaker: 1257

01:16:54,782 --> 01:16:55,242

Yeah.

Speaker: 1258

01:16:55,242 --> 01:16:58,552

So send me an email when you start working

that place.

Speaker: 1259

01:16:58,552 --> 01:17:00,602

I'll be happy to join the team.

Speaker: 1260

01:17:02,242 --> 01:17:06,062

And second question, if you could have

dinner with any great scientific mind,

Speaker: 1261

01:17:06,062 --> 01:17:09,362

dead, alive or fictional, who would it be?

Speaker: 1262

01:17:10,042 --> 01:17:11,202

Again, very loaded question.

Speaker: 1263

01:17:11,202 --> 01:17:12,622

Super interesting question.

Speaker: 1264

01:17:12,622 --> 01:17:14,822

I mean, there are two huge choices.

Speaker: 1265

01:17:14,822 --> 01:17:19,406

I could either go with someone who's

currently alive and

Speaker: 1266

01:17:19,406 --> 01:17:24,846

I feel like I want their take on the

current state of the art and future

Speaker: 1267

01:17:24,846 --> 01:17:26,246

directions and so on.

Speaker: 1268

01:17:26,246 --> 01:17:30,386

And the second huge option, what I guess

many people would go with is someone who's

Speaker: 1269

01:17:30,386 --> 01:17:34,426

been dead for two to three centuries.

Speaker: 1270

01:17:34,586 --> 01:17:37,666

And I think I'd go with the second choice.

Speaker: 1271

01:17:37,666 --> 01:17:40,946

So really take someone from way from the

past.

Speaker: 1272

01:17:40,946 --> 01:17:43,046

And that's because of two reasons.

Speaker: 1273

01:17:43,046 --> 01:17:46,706

I think like, of course, speaking to

today's scientists is super interesting

Speaker: 1274

01:17:46,706 --> 01:17:48,358

and I would love to do that.

Speaker: 1275

01:17:48,398 --> 01:17:52,958

But I mean, they have access to all the

state of the art technology and they know

Speaker: 1276

01:17:52,958 --> 01:17:55,898

about all the latest advancements.

Speaker: 1277

01:17:55,898 --> 01:18:02,758

And so if they have some groundbreaking

creative ideas to share that they come up

Speaker: 1278

01:18:02,758 --> 01:18:06,518

with, they could just implement it and

make them actionable.

Speaker: 1279

01:18:07,678 --> 01:18:11,498

And the second reason is that today

scientists have a huge platform because

Speaker: 1280

01:18:11,498 --> 01:18:12,558

they're on the internet.

Speaker: 1281

01:18:12,558 --> 01:18:17,838

So if they really want to express an idea,

they could just do it on

Speaker: 1282

01:18:17,838 --> 01:18:24,258

Twitter or wherever So there's like other

ways to engage with them apart from you

Speaker: 1283

01:18:24,258 --> 01:18:28,098

know, having a magical dinner Right.

Speaker: 1284

01:18:28,098 --> 01:18:29,978

so I would choose someone from the past

and in particular.

Speaker: 1285

01:18:29,978 --> 01:18:36,378

I think at a lovelace would be super

interesting for me to talk to Essentially

Speaker: 1286

01:18:36,378 --> 01:18:41,978

because she's widely considered the first

programmer the craziest thing about is

Speaker: 1287

01:18:41,978 --> 01:18:46,694

that is She's never had access to like a

modern computer

Speaker: 1288

01:18:46,766 --> 01:18:50,456

So she wrote the first program, but the

machine wasn't there yet.

Speaker: 1289

01:18:50,456 --> 01:18:55,686

So that's such a huge leap of creativity

and genius.

Speaker: 1290

01:18:56,046 --> 01:19:02,366

And so I'd really be interested in like if

Adelavelis saw what's happening today,

Speaker: 1291

01:19:02,366 --> 01:19:07,726

like all the technology that we have with

generative AI, GPU clusters and all these

Speaker: 1292

01:19:07,726 --> 01:19:11,426

possibilities, like what's the next leap

forward?

Speaker: 1293

01:19:11,986 --> 01:19:16,270

Like what's today's equivalent of writing

Speaker: 1294

01:19:16,270 --> 01:19:19,164

the first program without having the

computer.

Speaker: 1295

01:19:21,742 --> 01:19:26,342

Yeah, I really love to know this answer

and there's currently no other way except

Speaker: 1296

01:19:26,342 --> 01:19:29,382

for your magical dinner invitation to get

this answer.

Speaker: 1297

01:19:29,382 --> 01:19:32,522

So that's why I go with this option.

Speaker: 1298

01:19:32,522 --> 01:19:33,082

Yeah.

Speaker: 1299

01:19:33,082 --> 01:19:34,562

Yeah.

Speaker: 1300

01:19:34,562 --> 01:19:35,222

No, awesome.

Speaker: 1301

01:19:35,222 --> 01:19:35,602

Awesome.

Speaker: 1302

01:19:35,602 --> 01:19:36,222

I love it.

Speaker: 1303

01:19:36,222 --> 01:19:40,642

That definitely sounds like a, like a

marvelous dinner.

Speaker: 1304

01:19:40,642 --> 01:19:41,822

So yeah.

Speaker: 1305

01:19:41,822 --> 01:19:42,542

Awesome.

Speaker: 1306

01:19:42,542 --> 01:19:43,422

Thanks a lot, Marvin.

Speaker: 1307

01:19:43,422 --> 01:19:45,302

That was, that was really a blast.

Speaker: 1308

01:19:45,302 --> 01:19:49,782

I'm going to let you go now because you've

been talking for a long time, guessing you

Speaker: 1309

01:19:49,782 --> 01:19:50,922

need a break.

Speaker: 1310

01:19:51,278 --> 01:19:52,938

But that was really amazing.

Speaker: 1311

01:19:52,938 --> 01:19:55,468

So yeah, thanks a lot for taking the time.

Speaker: 1312

01:19:55,468 --> 01:19:59,418

Thanks again to Matt Rosinski for this

awesome recommendation.

Speaker: 1313

01:19:59,418 --> 01:20:01,818

I hope you loved it, Marvin.

Speaker: 1314

01:20:02,278 --> 01:20:05,038

And also Matt, me, I did.

Speaker: 1315

01:20:05,038 --> 01:20:07,398

So that was really awesome.

Speaker: 1316

01:20:07,458 --> 01:20:11,808

As usual, I'll put resources and a link to

your website.

Speaker: 1317

01:20:11,808 --> 01:20:15,318

And also, Marvin is going to add stuff to

the show notes for those who want to dig

Speaker: 1318

01:20:15,318 --> 01:20:16,178

deeper.

Speaker: 1319

01:20:16,366 --> 01:20:19,526

Thank you again, Marvin, for taking the

time and being on this show.

Speaker: 1320

01:20:19,526 --> 01:20:20,666

Thank you very much for having me, Alex.

Speaker: 1321

01:20:20,666 --> 01:20:21,766

I appreciate it.

Speaker: 1322

01:20:25,614 --> 01:20:29,354

This has been another episode of Learning

Bayesian Statistics.

Speaker: 1323

01:20:29,354 --> 01:20:34,674

Be sure to rate, review and follow the

show on your favorite podcatcher and visit

Speaker: 1324

01:20:34,674 --> 01:20:39,734

learnbaystats .com for more resources

about today's topics as well as access to

Speaker: 1325

01:20:39,734 --> 01:20:43,974

more episodes to help you reach true

Bayesian state of mind.

Speaker: 1326

01:20:43,974 --> 01:20:45,904

That's learnbaystats .com.

Speaker: 1327

01:20:45,904 --> 01:20:50,774

Our theme music is Good Bayesian by Baba

Brinkman, fit MC Lars and Meghiraam.

Speaker: 1328

01:20:50,774 --> 01:20:53,884

Check out his awesome work at bababrinkman

.com.

Speaker: 1329

01:20:53,884 --> 01:20:55,054

I'm your host.

Speaker: 1330

01:20:55,054 --> 01:20:56,034

Alex Andorra.

Speaker: 1331

01:20:56,034 --> 01:21:00,274

You can follow me on Twitter at Alex

underscore Andorra, like the country.

Speaker: 1332

01:21:00,274 --> 01:21:05,354

You can support the show and unlock

exclusive benefits by visiting Patreon

Speaker: 1333

01:21:05,354 --> 01:21:07,534

.com slash LearnBasedDance.

Speaker: 1334

01:21:07,534 --> 01:21:09,994

Thank you so much for listening and for

your support.

Speaker: 1335

01:21:09,994 --> 01:21:15,734

You're truly a good Bayesian change your

predictions after taking information.

Speaker: 1336

01:21:15,734 --> 01:21:22,494

And if you're thinking I'll be less than

amazing, let's adjust those expectations.

Speaker: 1337

01:21:22,542 --> 01:21:27,722

Let me show you how to be a good Bayesian

Change calculations after taking fresh

Speaker: 1338

01:21:27,722 --> 01:21:33,782

data in Those predictions that your brain

is making Let's get them on a solid

Speaker: 1339

01:21:33,782 --> 01:21:35,620

foundation

Transcript

Sign up for our newsletter!

The latest from Reverend Bayes directly in your inbox!

QUICK Links

Get in Touch