#104 Automated Gaussian Processes & Sequential Monte Carlo, with Feras Saad

Proudly sponsored by PyMC Labs, the Bayesian Consultancy. Book a call, or get in touch!

GPs are extremely powerful…. but hard to handle. One of the bottlenecks is learning the appropriate kernel. What if you could learn the structure of GP kernels automatically? Sounds really cool, but also a bit futuristic, doesn’t it?

Well, think again, because in this episode, Feras Saad will teach us how to do just that! Feras is an Assistant Professor in the Computer Science Department at Carnegie Mellon University. He received his PhD in Computer Science from MIT, and, most importantly for our conversation, he’s the creator of AutoGP.jl, a Julia package for automatic Gaussian process modeling.

Feras discusses the implementation of AutoGP, how it scales, what you can do with it, and how you can integrate its outputs in your models.

Finally, Feras provides an overview of Sequential Monte Carlo and its usefulness in AutoGP, highlighting the ability of SMC to incorporate new data in a streaming fashion and explore multiple modes efficiently.

Our theme music is « Good Bayesian », by Baba Brinkman (feat MC Lars and Mega Ran). Check out his awesome work at https://bababrinkman.com/ !

Thank you to my Patrons for making this episode possible!

Yusuke Saito, Avi Bryant, Ero Carrera, Giuliano Cruz, Tim Gasser, James Wade, Tradd Salvo, William Benton, James Ahloy, Robin Taylor,, Chad Scherrer, Zwelithini Tunyiswa, Bertrand Wilden, James Thompson, Stephen Oates, Gian Luca Di Tanna, Jack Wells, Matthew Maldonado, Ian Costley, Ally Salim, Larry Gill, Ian Moran, Paul Oreto, Colin Caprani, Colin Carroll, Nathaniel Burbank, Michael Osthege, Rémi Louf, Clive Edelsten, Henri Wallen, Hugo Botha, Vinh Nguyen, Marcin Elantkowski, Adam C. Smith, Will Kurt, Andrew Moskowitz, Hector Munoz, Marco Gorelli, Simon Kessell, Bradley Rode, Patrick Kelley, Rick Anderson, Casper de Bruin, Philippe Labonde, Michael Hankin, Cameron Smith, Tomáš Frýda, Ryan Wesslen, Andreas Netti, Riley King, Yoshiyuki Hamajima, Sven De Maeyer, Michael DeCrescenzo, Fergal M, Mason Yahr, Naoya Kanai, Steven Rowland, Aubrey Clayton, Jeannine Sue, Omri Har Shemesh, Scott Anthony Robson, Robert Yolken, Or Duek, Pavel Dusek, Paul Cox, Andreas Kröpelin, Raphaël R, Nicolas Rode, Gabriel Stechschulte, Arkady, Kurt TeKolste, Gergely Juhasz, Marcus Nölke, Maggi Mackintosh, Grant Pezzolesi, Avram Aelony, Joshua Meehl, Javier Sabio, Kristian Higgins, Alex Jones, Gregorio Aguilar, Matt Rosinski, Bart Trudeau, Luis Fonseca, Dante Gates, Matt Niccolls, Maksim Kuznecov, Michael Thomas, Luke Gorrie, Cory Kiser, Julio, Edvin Saveljev, Frederick Ayala, Jeffrey Powell and Gal Kampel.

Visit https://www.patreon.com/learnbayesstats to unlock exclusive Bayesian swag 😉

Takeaways:

– AutoGP is a Julia package for automatic Gaussian process modeling that learns the structure of GP kernels automatically.

– It addresses the challenge of making structural choices for covariance functions by using a symbolic language and a recursive grammar to infer the expression of the covariance function given the observed data.

-AutoGP incorporates sequential Monte Carlo inference to handle scalability and uncertainty in structure learning.

– The package is implemented in Julia using the Gen probabilistic programming language, which provides support for sequential Monte Carlo and involutive MCMC.

– Sequential Monte Carlo (SMC) and inductive MCMC are used in AutoGP to infer the structure of the model.

– Integrating probabilistic models with language models can improve interpretability and trustworthiness in data-driven inferences.

– Challenges in Bayesian workflows include the need for automated model discovery and scalability of inference algorithms.

– Future developments in probabilistic reasoning systems include unifying people around data-driven inferences and improving the scalability and configurability of inference algorithms.

Chapters:

00:00 Introduction to AutoGP

26:28 Automatic Gaussian Process Modeling

45:05 AutoGP: Automatic Discovery of Gaussian Process Model Structure

53:39 Applying AutoGP to New Settings

01:09:27 The Biggest Hurdle in the Bayesian Workflow

01:19:14 Unifying People Around Data-Driven Inferences

Links from the show:

Sign up to the Fast & Efficient Gaussian Processes modeling webinar: https://topmate.io/alex_andorra/901986
Feras’ website: https://www.cs.cmu.edu/~fsaad/
LBS #3.1, What is Probabilistic Programming & Why use it, with Colin Carroll: https://learnbayesstats.com/episode/3-1-what-is-probabilistic-programming-why-use-it-with-colin-carroll/
LBS #3.2, How to use Bayes in industry, with Colin Carroll: https://learnbayesstats.com/episode/3-2-how-to-use-bayes-in-industry-with-colin-carroll/
LBS #21, Gaussian Processes, Bayesian Neural Nets & SIR Models, with Elizaveta Semenova: https://learnbayesstats.com/episode/21-gaussian-processes-bayesian-neural-nets-sir-models-with-elizaveta-semenova/
LBS #29, Model Assessment, Non-Parametric Models, And Much More, with Aki Vehtari: https://learnbayesstats.com/episode/model-assessment-non-parametric-models-aki-vehtari/
LBS #63, Media Mix Models & Bayes for Marketing, with Luciano Paz: https://learnbayesstats.com/episode/63-media-mix-models-bayes-marketing-luciano-paz/
LBS #83, Multilevel Regression, Post-Stratification & Electoral Dynamics, with Tarmo Jüristo: https://learnbayesstats.com/episode/83-multilevel-regression-post-stratification-electoral-dynamics-tarmo-juristo/
AutoGP.jl, A Julia package for learning the covariance structure of Gaussian process time series models: https://probsys.github.io/AutoGP.jl/stable/
Sequential Monte Carlo Learning for Time Series Structure Discovery: https://arxiv.org/abs/2307.09607
Street Epistemlogy: https://www.youtube.com/@magnabosco210
You’re not so smart Podcast: https://youarenotsosmart.com/podcast/
How Minds Change: https://www.davidmcraney.com/howmindschangehome
Josh Tenebaum’s lectures on computational cognitive science: https://www.youtube.com/playlist?list=PLUl4u3cNGP61RTZrT3MIAikp2G5EEvTjf

Transcript

This is an automatic transcript and may therefore contain errors. Please get in touch if you’re willing to correct them.

Transcript

Speaker: 00:00:02

GPs are extremely powerful, but hard to

handle.

00:00:09,002 --> 00:00:12,422

One of the bottlenecks is learning the

appropriate kernels.

00:00:12,422 --> 00:00:17,882

Well, what if you could learn the

structure of GP's kernels automatically?

00:00:18,122 --> 00:00:19,532

Sounds really cool, right?

00:00:19,532 --> 00:00:23,402

But also, eh, a bit futuristic, doesn't

it?

00:00:23,402 --> 00:00:28,162

Well, think again, because in this

episode, Farah Saad will teach us how to

00:00:28,162 --> 00:00:29,422

do just that.

00:00:29,422 --> 00:00:33,522

Feras is an assistant professor in the

computer science department at Carnegie

00:00:33,522 --> 00:00:34,922

Mellon University.

00:00:34,922 --> 00:00:38,662

He received his PhD in computer science

from MIT.

00:00:38,662 --> 00:00:44,882

And most importantly for our conversation,

he's the creator of AutoGP .jl, a Julia

00:00:44,882 --> 00:00:48,062

package for automatic Gaussian process

modeling.

00:00:48,062 --> 00:00:52,802

Feras discusses the implementation of

AutoGP, how it scales, what you can do

00:00:52,802 --> 00:00:57,882

with it, and how you can integrate its

outputs in your patient models.

00:00:58,062 --> 00:00:58,830

Finally,

00:00:58,830 --> 00:01:03,230

DeepFerence provides an overview of

Sequential Monte Carlo and its usefulness

00:01:03,230 --> 00:01:08,550

in AutoGP, highlighting the ability of SMC

to incorporate new data in a streaming

00:01:08,550 --> 00:01:12,250

fashion and explore multiple modes

efficiently.

00:01:12,250 --> 00:01:19,810

This is Learning Basics Statistics,

episode 104, recorded February 23, 2024.

00:01:27,054 --> 00:01:41,334

Welcome to Learning Bayesian Statistics, a

podcast about Bayesian inference, the

00:01:41,334 --> 00:01:44,894

methods, the projects, and the people who

make it possible.

00:01:44,894 --> 00:01:47,114

I'm your host, Alex Andorra.

00:01:47,114 --> 00:01:52,394

You can follow me on Twitter at alex

.andorra, like the country, for any info

00:01:52,394 --> 00:01:53,274

about the show.

00:01:53,274 --> 00:01:55,734

LearnBayStats .com is Laplace to me.

00:01:55,734 --> 00:01:56,654

Show notes,

00:01:56,654 --> 00:02:00,874

becoming a corporate sponsor, unlocking

Bayesian Merge, supporting the show on

00:02:00,874 --> 00:02:03,254

Patreon, everything is in there.

00:02:03,254 --> 00:02:05,094

That's learnbasedats .com.

00:02:05,094 --> 00:02:09,514

If you're interested in one -on -one

mentorship, online courses, or statistical

00:02:09,514 --> 00:02:14,674

consulting, feel free to reach out and

book a call at topmate .io slash alex

00:02:14,674 --> 00:02:16,614

underscore and dora.

00:02:16,614 --> 00:02:20,454

See you around, folks, and best Bayesian

wishes to you all.

00:02:24,270 --> 00:02:25,610

idea patients.

00:02:25,630 --> 00:02:31,310

First, I want to thank Edwin Saveliev,

Frederic Ayala, Jeffrey Powell, and Gala

00:02:31,310 --> 00:02:33,220

Campbell for supporting the show.

00:02:33,220 --> 00:02:39,230

Patreon, your support is invaluable, guys,

and literally makes this show possible.

00:02:39,230 --> 00:02:42,910

I cannot wait to talk with you in the

Slack channel.

00:02:42,990 --> 00:02:48,970

Second, I have an exciting modeling

webinar coming up on April 18 with Juan

00:02:48,970 --> 00:02:52,150

Ardus, a fellow PyMC Core Dev and

mathematician.

00:02:52,150 --> 00:02:57,230

In this modeling webinar, we'll learn how

to use the new HSGP approximation for fast

00:02:57,230 --> 00:03:01,510

and efficient Gaussian processes, we'll

simplify the foundational concepts,

00:03:01,510 --> 00:03:05,730

explain why this technique is so useful

and innovative, and of course, we'll show

00:03:05,730 --> 00:03:08,630

you a real -world application in PyMC.

00:03:08,630 --> 00:03:10,414

So if that sounds like fun,

00:03:10,414 --> 00:03:14,714

Go to topmade .io slash Alex underscore

and Dora to secure your seat.

00:03:14,714 --> 00:03:18,914

Of course, if you're a patron of the show,

you get bonuses like submitting questions

00:03:18,914 --> 00:03:22,794

in advance, early access to the recording,

et cetera.

00:03:22,854 --> 00:03:25,694

You are my favorite listeners after all.

00:03:25,694 --> 00:03:27,654

Okay, back to the show now.

00:03:27,654 --> 00:03:31,414

Arasad, welcome to Learning Vision

Statistics.

00:03:32,214 --> 00:03:33,064

Hi, thank you.

00:03:33,064 --> 00:03:33,944

Thanks for the invitation.

00:03:33,944 --> 00:03:35,534

I'm delighted to be here.

00:03:35,534 --> 00:03:37,774

Yeah, thanks a lot for taking the time.

00:03:37,774 --> 00:03:40,214

Thanks a lot to Colin Carroll.

00:03:40,270 --> 00:03:47,170

who of course listeners know, he was in

episode 3 of Uninvasioned Statistics.

00:03:48,050 --> 00:03:52,220

Well I will of course put it in the show

notes, that's like a vintage episode now,

00:03:52,220 --> 00:03:54,430

from 4 years ago.

00:03:54,870 --> 00:04:02,390

I was a complete beginner in invasion

stats, so if you wanna embarrass myself,

00:04:02,630 --> 00:04:07,790

definitely that's one of the episodes you

should listen to without my -

00:04:07,790 --> 00:04:14,790

my beginner's questions, and that's one of

the rare episodes I could do on site.

00:04:14,790 --> 00:04:20,410

I was with Colleen in person to record

that episode in Boston.

00:04:20,930 --> 00:04:24,350

So, hi Colleen, thanks a lot again.

00:04:24,730 --> 00:04:28,340

And Feres, let's talk about you first.

00:04:28,340 --> 00:04:31,350

How would you define the work you're doing

nowadays?

00:04:31,350 --> 00:04:34,424

And also, how did you end up doing that?

00:04:34,702 --> 00:04:35,962

Yeah, yeah, thanks.

00:04:35,962 --> 00:04:39,442

And yeah, thanks for calling Carol for

setting up this connection.

00:04:39,442 --> 00:04:44,062

I've been watching the podcast for a while

and I think it's really great how you've

00:04:44,062 --> 00:04:47,682

brought together lots of different people

in the Bayesian inference community, the

00:04:47,682 --> 00:04:50,102

statistics community to talk about their

work.

00:04:50,102 --> 00:04:54,442

So thank you and thank you to Colin for

that connection.

00:04:54,842 --> 00:04:57,422

Yeah, so a little background about me.

00:04:57,542 --> 00:05:01,958

I'm a professor at CMU and I'm working

in...

00:05:01,966 --> 00:05:07,906

a few different areas surrounding Bayesian

inference with my colleagues and students.

00:05:08,466 --> 00:05:12,966

One, I think, you know, I like to think of

the work I do as following different

00:05:12,966 --> 00:05:16,686

threads, which are all unified by this

idea of probability and computation.

00:05:16,686 --> 00:05:21,186

So one area that I work a lot in, and I'm

sure you have lots of experience in this,

00:05:21,186 --> 00:05:27,306

being one of the core developers of PyMC,

is probabilistic programming languages and

00:05:27,306 --> 00:05:30,606

developing new tools that help

00:05:30,606 --> 00:05:35,046

both high level users and also machine

learning experts and statistics experts

00:05:35,046 --> 00:05:40,406

more easily use Bayesian models and

inferences as part of their workflow.

00:05:41,026 --> 00:05:46,166

The, you know, putting my programming

languages hat on, it's important to think

00:05:46,166 --> 00:05:50,406

about not only how do we make it easier

for people to write up Bayesian inference

00:05:50,406 --> 00:05:55,186

workflows, but also what kind of

guarantees or what kind of help can we

00:05:55,186 --> 00:05:59,470

give them in terms of verifying the

correctness of their implementations or.

00:05:59,470 --> 00:06:04,230

automating the process of getting these

probabilistic programs to begin with using

00:06:04,230 --> 00:06:06,650

probabilistic program synthesis

techniques.

00:06:07,790 --> 00:06:14,630

So these are questions that are very

challenging and, you know, if we're able

00:06:14,630 --> 00:06:18,650

to solve them, you know, really can go a

long way.

00:06:19,350 --> 00:06:22,690

So there's a lot of work in the

probabilistic programming world that I do,

00:06:22,690 --> 00:06:26,210

and I'm specifically interested in

probabilistic programming languages that

00:06:26,210 --> 00:06:28,142

support programmable inference.

00:06:28,142 --> 00:06:32,682

So we can think of many probabilistic

programming languages like Stan or Bugs or

00:06:32,682 --> 00:06:37,042

PyMC as largely having a single inference

algorithm that they're going to use

00:06:37,042 --> 00:06:40,422

multiple times for all the different

programs you can express.

00:06:40,422 --> 00:06:47,242

So bugs might use Gibbs sampling, Stan

uses HMC with nuts, PyMC uses MCMC

00:06:47,242 --> 00:06:50,142

algorithms, and these are all great.

100

00:06:50,142 --> 00:06:54,002

But of course, one of the limitations is

there's no universal inference algorithm

101

00:06:54,002 --> 00:06:57,326

that works well for any problem you might

want to express.

102

00:06:57,326 --> 00:07:01,326

And that's where I think a lot of the

power of programmable inference comes in.

103

00:07:01,326 --> 00:07:04,616

A lot of where the interesting research is

as well, right?

104

00:07:04,616 --> 00:07:12,286

Like how can you support users writing

their own say MCMC proposal for a given

105

00:07:12,286 --> 00:07:16,286

Bayesian inference problem and verify that

that proposal distribution meets the

106

00:07:16,286 --> 00:07:19,926

theoretical conditions needed for

soundness, whether it's defining a

107

00:07:19,926 --> 00:07:25,966

reducible chain, for example, or whether

it's a periodic.

108

00:07:25,966 --> 00:07:29,946

or in the context of variational

inference, whether you define the

109

00:07:29,946 --> 00:07:35,206

variational family that is broad enough,

so it's support encompasses the support of

110

00:07:35,206 --> 00:07:36,466

the target model.

111

00:07:36,466 --> 00:07:42,146

We have all of these conditions that we

usually hope are correct, but our systems

112

00:07:42,146 --> 00:07:46,646

don't actually verify that for us, whether

it's an MCMC or variational inference or

113

00:07:46,646 --> 00:07:49,006

importance sampling or sequential Monte

Carlo.

114

00:07:49,046 --> 00:07:52,886

And I think the more flexibility we give

programmers,

115

00:08:27,308 --> 00:08:29,948

And I touched upon this a little bit by

talking about probabilistic program

116

00:08:29,948 --> 00:08:33,308

synthesis, which is this idea of

probabilistic, automated probabilistic

117

00:08:33,308 --> 00:08:34,728

model discovery.

118

00:08:34,868 --> 00:08:44,808

And there, our goal is to use hierarchical

Bayesian models to specify prior

119

00:08:44,808 --> 00:08:48,108

distributions, not only over model

parameters, but also over model

120

00:08:48,108 --> 00:08:49,248

structures.

121

00:08:49,248 --> 00:08:53,868

And here, this is based on this idea that

traditionally in statistics, a data

122

00:08:53,868 --> 00:08:55,372

scientist or an expert,

123

00:08:55,372 --> 00:08:59,552

we'll hand design a Bayesian model for a

given problem, but oftentimes it's not

124

00:08:59,552 --> 00:09:01,932

obvious what's the right model to use.

125

00:09:01,972 --> 00:09:07,532

So the idea is, you know, how can we use

the observed data to guide our decisions

126

00:09:07,532 --> 00:09:11,572

about what is the right model structure to

even be using before we worry about

127

00:09:11,572 --> 00:09:13,052

parameter inference?

128

00:09:13,672 --> 00:09:17,632

So, you know, we've looked at this problem

in the context of learning models of time

129

00:09:17,632 --> 00:09:18,812

series data.

130

00:09:18,812 --> 00:09:21,232

Should my time series data have a periodic

component?

131

00:09:21,232 --> 00:09:23,162

Should it have polynomial trends?

132

00:09:23,162 --> 00:09:25,036

Should it have a change point?

133

00:09:25,036 --> 00:09:25,486

right?

134

00:09:25,486 --> 00:09:27,916

You know, how can we automate the

discovery of these different patterns and

135

00:09:27,916 --> 00:09:29,956

then learn an appropriate probabilistic

model?

136

00:09:29,956 --> 00:09:33,576

And I think it ties in very nicely to

probabilistic programming because

137

00:09:33,576 --> 00:09:39,136

probabilistic programs are so expressive

that we can express prior distributions on

138

00:09:39,136 --> 00:09:42,596

structures or prior distributions on

probabilistic programs all within the

139

00:09:42,596 --> 00:09:45,176

system using this unified technology.

140

00:09:45,176 --> 00:09:45,706

Yeah.

141

00:09:45,706 --> 00:09:49,946

Which is where, you know, these two

research areas really inform one another.

142

00:09:49,946 --> 00:09:51,628

If we're able to express

143

00:09:51,628 --> 00:09:55,168

rich probabilistic programming languages,

then we can start doing inference over

144

00:09:55,168 --> 00:09:59,988

probabilistic programs themselves and try

and synthesize these programs from data.

145

00:10:00,148 --> 00:10:05,748

Other areas that I've looked at are

tabular data or relational data models,

146

00:10:05,748 --> 00:10:10,968

different types of traditionally

structured data, and synthesizing models

147

00:10:10,968 --> 00:10:11,528

there.

148

00:10:11,528 --> 00:10:15,408

And the workhorse in that area is largely

Bayesian non -parametrics.

149

00:10:15,436 --> 00:10:21,696

So prior distributions over unbounded

spaces of latent variables, which are, I

150

00:10:21,696 --> 00:10:27,196

think, a very mathematically elegant way

to treat probabilistic structure discovery

151

00:10:27,196 --> 00:10:30,500

using Bayesian inferences as the workhorse

for that.

152

00:10:30,667 --> 00:10:33,887

And I'll just touch upon a few other areas

that I work in, which are also quite

153

00:10:33,887 --> 00:10:38,427

aligned, which a third area I work in is

more on the computational statistics side,

154

00:10:38,427 --> 00:10:43,347

which is now that we have probabilistic

programs and we're using them and they're

155

00:10:43,347 --> 00:10:48,147

becoming more and more routine in the

workflow of Bayesian inference, we need to

156

00:10:48,147 --> 00:10:52,327

start thinking about new statistical

methods and testing methods for these

157

00:10:52,327 --> 00:10:53,667

probabilistic programs.

158

00:10:53,667 --> 00:10:58,347

So for example, this is a little bit

different than traditional statistics

159

00:10:58,347 --> 00:11:00,622

where, you know, traditionally in

statistics we might

160

00:11:00,622 --> 00:11:06,062

some type of analytic mathematical

derivation on some probability model,

161

00:11:06,062 --> 00:11:06,302

right?

162

00:11:06,302 --> 00:11:11,282

So you might write up your model by hand,

and then you might, you know, if you want

163

00:11:11,282 --> 00:11:15,282

to compute some property, you'll treat the

model as some kind of mathematical

164

00:11:15,282 --> 00:11:16,082

expression.

165

00:11:16,082 --> 00:11:19,722

But now that we have programs, these

programs are often far too hard to

166

00:11:19,722 --> 00:11:22,142

formalize mathematically by hand.

167

00:11:22,142 --> 00:11:26,942

So if you want to analyze their

properties, how can we understand the

168

00:11:26,942 --> 00:11:27,962

properties of a program?

169

00:11:27,962 --> 00:11:29,420

By simulating it.

170

00:11:29,420 --> 00:11:35,100

So a very simple example of this would be,

say I wrote a probabilistic program for

171

00:11:35,100 --> 00:11:37,880

some given data, and I actually have the

data.

172

00:11:37,880 --> 00:11:40,680

Then I'd like to know whether the

probabilistic program I wrote is even a

173

00:11:40,680 --> 00:11:42,540

reasonable prior from that data.

174

00:11:42,540 --> 00:11:47,560

So this is a goodness of fit testing, or

how well does the probabilistic program I

175

00:11:47,560 --> 00:11:50,140

wrote explain the range of data sets I

might see?

176

00:11:50,764 --> 00:11:55,004

So, you know, if you do a goodness of fit

test using stats 101, you would look, all

177

00:11:55,004 --> 00:11:56,124

right, what is my distribution?

178

00:11:56,124 --> 00:11:57,104

What is the CDF?

179

00:11:57,104 --> 00:12:01,144

What are the parameters that I'm going to

derive some type of thing by hand?

180

00:12:01,144 --> 00:12:02,904

But for policy programs, we can't do that.

181

00:12:02,904 --> 00:12:07,244

So we might like to simulate data from the

program and do some type of analysis based

182

00:12:07,244 --> 00:12:11,148

on samples of the program as compared to

samples of the observed data.

183

00:12:11,148 --> 00:12:14,488

So these type of simulation -based

analyses of statistical properties of

184

00:12:14,488 --> 00:12:18,808

probabilistic programs for testing their

behavior or for quantifying the

185

00:12:18,808 --> 00:12:22,608

information between variables, things like

that.

186

00:12:22,888 --> 00:12:28,568

And then the final area I'll touch upon is

really more at the foundational level,

187

00:12:28,568 --> 00:12:29,324

which is.

188

00:12:29,324 --> 00:12:34,324

understanding what are the primitive

operations, a more rigorous or principled

189

00:12:34,324 --> 00:12:37,624

understanding of the primitive operations

on our computers that enable us to do

190

00:12:37,624 --> 00:12:38,964

random computations.

191

00:12:38,964 --> 00:12:40,724

So what do I mean by that?

192

00:12:40,724 --> 00:12:45,924

Well, you know, we love to assume that our

computers can freely compute over real

193

00:12:45,924 --> 00:12:46,804

numbers.

194

00:12:46,804 --> 00:12:49,944

But of course, computers don't have real

numbers built within them.

195

00:12:49,944 --> 00:12:54,144

They're built on finite precision

machines, right, which means I can't

196

00:12:54,144 --> 00:12:55,424

express.

197

00:12:55,500 --> 00:12:58,020

some arbitrary division between two real

numbers.

198

00:12:58,020 --> 00:13:01,240

Everything is at some level it's floating

point.

199

00:13:01,240 --> 00:13:06,220

And so this gives us a gap between the

theory and the practice.

200

00:13:06,220 --> 00:13:10,120

Because in theory, you know, whenever

we're writing our models, we assume

201

00:13:10,120 --> 00:13:13,380

everything is in this, you know,

infinitely precise universe.

202

00:13:13,380 --> 00:13:17,860

But when we actually implement it, there's

some level of approximation.

203

00:13:17,860 --> 00:13:22,040

So I'm interested in understanding first,

theoretically, what is this approximation?

204

00:13:22,040 --> 00:13:26,480

How important is it that I'm actually

treating my model as running on an

205

00:13:26,480 --> 00:13:29,620

infinitely precise machine where I

actually have finite precision?

206

00:13:29,620 --> 00:13:33,360

And second, what are the implications of

that gap for Bayesian inference?

207

00:13:33,360 --> 00:13:36,174

Does it mean that now I actually have some

208

00:13:36,174 --> 00:13:40,594

properties of my Markov chain that no

longer hold because I'm actually running

209

00:13:40,594 --> 00:13:44,614

it on a finite precision machine whereby

all my analysis was assuming I have an

210

00:13:44,614 --> 00:13:50,394

infinite precision or what does it mean

about the actual variables we generate?

211

00:13:50,394 --> 00:13:55,354

So, you know, we might generate a Gaussian

random variable, but in practice, the

212

00:13:55,354 --> 00:13:58,195

variable we're simulating has some other

distribution.

213

00:13:58,284 --> 00:14:03,084

Can we theoretically quantify that other

distribution and its error with respect to

214

00:14:03,084 --> 00:14:04,114

the true distribution?

215

00:14:04,114 --> 00:14:07,504

Or have we come up with sampling

procedures that are as close as possible

216

00:14:07,504 --> 00:14:10,744

to the ideal real value distribution?

217

00:14:10,744 --> 00:14:14,204

And so this brings together ideas from

information theory, from theoretical

218

00:14:14,204 --> 00:14:15,344

computer science.

219

00:14:15,344 --> 00:14:19,104

And one of the motivations is to thread

those results through into the actual

220

00:14:19,104 --> 00:14:22,304

Bayesian inference procedures that we

implement using probabilistic programming

221

00:14:22,304 --> 00:14:23,384

languages.

222

00:14:24,684 --> 00:14:28,304

So that's just, you know, an overview of

these three or four different areas that

223

00:14:28,304 --> 00:14:31,404

I'm interested in and I've been working on

recently.

224

00:14:31,684 --> 00:14:32,464

Yeah, that's amazing.

225

00:14:32,464 --> 00:14:37,104

Thanks a lot for these, like full panel of

what you're doing.

226

00:14:37,144 --> 00:14:42,604

And yeah, that's just incredible also that

you're doing so many things.

227

00:14:42,604 --> 00:14:44,284

I'm really impressed.

228

00:14:44,524 --> 00:14:50,004

And of course we're going to dive a bit

into these, at least some of these topics.

229

00:14:50,004 --> 00:14:53,516

I don't want to take three hours of your

time, but...

230

00:14:53,516 --> 00:15:00,616

Before that though, I'm curious if you

remembered when and how you first got

231

00:15:00,616 --> 00:15:04,796

introduced to Bayesian inference and also

why it's ticked with you because it seems

232

00:15:04,796 --> 00:15:10,196

like it's underpinning most of your work,

at least that idea of probabilistic

233

00:15:10,196 --> 00:15:11,336

programming.

234

00:15:11,636 --> 00:15:13,896

Yeah, that's a good question.

235

00:15:14,156 --> 00:15:20,156

I think I was first interested in

probability before I was interested in

236

00:15:20,156 --> 00:15:21,096

Bayesian inference.

237

00:15:21,096 --> 00:15:22,124

I remember...

238

00:15:22,124 --> 00:15:27,144

I used to read a book by Maasteller called

50 Challenging Problems in Probability.

239

00:15:27,324 --> 00:15:33,584

I took a course in high school and I

thought, how could I actually use these

240

00:15:33,584 --> 00:15:35,444

cool ideas for fun?

241

00:15:35,444 --> 00:15:40,824

And there was actually a very nice book

written back in the 50s by Maasteller.

242

00:15:40,824 --> 00:15:44,844

So that got me interested in probability

and how we can use probability to reason

243

00:15:44,844 --> 00:15:46,464

about real world phenomena.

244

00:15:46,464 --> 00:15:49,228

So the book that...

245

00:15:49,228 --> 00:15:54,408

that I used to read would sort of have

these questions about, you know, if

246

00:15:54,408 --> 00:15:57,688

someone misses a train and the train has a

certain schedule, what's the probability

247

00:15:57,688 --> 00:15:59,208

that they'll arrive at the right time?

248

00:15:59,208 --> 00:16:03,128

And it's a really nice book because it

ties in our everyday experiences with

249

00:16:03,128 --> 00:16:05,128

probabilistic modeling and inference.

250

00:16:05,128 --> 00:16:08,188

And so I thought, wow, this is actually a

really powerful paradigm for reasoning

251

00:16:08,188 --> 00:16:12,408

about the everyday things that we do,

like, you know, missing a bus and knowing

252

00:16:12,408 --> 00:16:15,248

something about its schedule and when's

the right time that I should arrive to

253

00:16:15,248 --> 00:16:19,182

maximize the probability of, you know,

some, some, some, some,

254

00:16:19,182 --> 00:16:21,442

event of interest, things like that.

255

00:16:22,342 --> 00:16:25,662

So that really got me hooked to the idea

of probability.

256

00:16:26,022 --> 00:16:32,742

But I think what really connected Bayesian

inference to me was taking, I think this

257

00:16:32,742 --> 00:16:38,182

was as a senior or as a first year

master's student, a course by Professor

258

00:16:38,182 --> 00:16:43,282

Josh Tannenbaum at MIT, which is

computational cognitive science.

259

00:16:43,362 --> 00:16:46,342

And that course has evolved.

260

00:16:46,348 --> 00:16:50,088

quiet a lot through the years, but the

version that I took was really a beautiful

261

00:16:50,088 --> 00:16:56,288

synthesis of lots of deep ideas of how

Bayesian inference can tell us something

262

00:16:56,288 --> 00:17:02,288

meaningful about how humans reason about,

you know, different empirical phenomena

263

00:17:02,288 --> 00:17:03,668

and cognition.

264

00:17:04,088 --> 00:17:07,500

So, you know, in cognitive science for,

you know, for...

265

00:17:07,500 --> 00:17:11,300

majority of the history of the field,

people would run these experiments on

266

00:17:11,300 --> 00:17:15,200

humans and they would try and analyze

these experiments using some type of, you

267

00:17:15,200 --> 00:17:19,140

know, frequentist statistics or they would

not really use generative models to

268

00:17:19,140 --> 00:17:23,459

describe how humans are are solving a

particular experiment.

269

00:17:23,459 --> 00:17:29,772

But the, you know, Professor Tenenbaum's

approach was to use Bayesian models.

270

00:17:29,772 --> 00:17:33,612

as a way of describing or at least

emulating the cognitive processes that

271

00:17:33,612 --> 00:17:36,612

humans do for solving these types of

cognition tasks.

272

00:17:36,612 --> 00:17:40,552

And by cognition tasks, I mean, you know,

simple experiments you might ask a human

273

00:17:40,552 --> 00:17:44,852

to do, which is, you know, you might have

some dots on a screen and you might tell

274

00:17:44,852 --> 00:17:47,772

them, all right, you've seen five dots,

why don't you extrapolate the next five?

275

00:17:47,772 --> 00:17:55,308

Just simple things that, simple cognitive

experiments or, you know, yeah, so.

276

00:17:55,308 --> 00:18:00,088

I think that being able to use Bayesian

models to describe very simple cognitive

277

00:18:00,088 --> 00:18:04,908

phenomena was another really appealing

prospect to me throughout that course.

278

00:18:05,008 --> 00:18:10,156

I'm seeing all the ways in which that

manifested in very nice questions about.

279

00:18:10,156 --> 00:18:12,896

how do we do efficient inference in real

time?

280

00:18:12,896 --> 00:18:16,036

Because humans are able to do inference

very quickly.

281

00:18:16,216 --> 00:18:19,776

And Bayesian inference is obviously very

challenging to do.

282

00:18:19,776 --> 00:18:23,496

But then, if we actually want to engineer

systems, we need to think about the hard

283

00:18:23,496 --> 00:18:27,236

questions of efficient and scalable

inference in real time, maybe at human

284

00:18:27,236 --> 00:18:28,396

level speeds.

285

00:18:28,396 --> 00:18:32,056

Which brought in a lot of the reason for

why I'm so interested in inference as

286

00:18:32,056 --> 00:18:33,436

well.

287

00:18:33,436 --> 00:18:38,256

Because that's one of the harder aspects

of Bayesian computing.

288

00:18:38,348 --> 00:18:43,548

And then I think a third thing which

really hooked me to Bayesian inference was

289

00:18:43,548 --> 00:18:47,728

taking a machine learning course and kind

of comparing.

290

00:18:47,728 --> 00:18:52,628

So the way these machine learning courses

work is they'll teach you empirical risk

291

00:18:52,628 --> 00:18:57,128

minimization, and then they'll teach you

some type of optimization, and then

292

00:18:57,128 --> 00:18:59,588

there'll be a lecture called Bayesian

inference.

293

00:19:00,168 --> 00:19:00,684

And...

294

00:19:00,684 --> 00:19:05,444

What was so interesting to me at the time

was up until the time, up until the

295

00:19:05,444 --> 00:19:08,404

lecture where we learned anything about

Bayesian inference, all of these machine

296

00:19:08,404 --> 00:19:12,584

learning concepts seem to just be a

hodgepodge of random tools and techniques

297

00:19:12,584 --> 00:19:14,244

that people were using.

298

00:19:14,244 --> 00:19:17,304

So I, you know, there's the support vector

machine and it's good at classification

299

00:19:17,304 --> 00:19:19,304

and then there's the random forest and

it's good at this.

300

00:19:19,304 --> 00:19:22,324

But what's really nice about using

Bayesian inference in the machine learning

301

00:19:22,324 --> 00:19:25,944

setting, or at least what I found

appealing was how you have a very clean

302

00:19:25,944 --> 00:19:29,964

specification of the problem that you're

trying to solve in terms of number one, a

303

00:19:29,964 --> 00:19:30,638

prior distribution.

304

00:19:30,638 --> 00:19:36,488

over parameters and observable data, and

then the actual observed data, and three,

305

00:19:36,488 --> 00:19:38,938

which is the posterior distribution that

you're trying to infer.

306

00:19:38,938 --> 00:19:45,438

So you can use a very nice high -level

specification of what is even the problem

307

00:19:45,438 --> 00:19:49,324

you're trying to solve before you even

worry about how you solve it.

308

00:19:49,324 --> 00:19:53,444

you can very cleanly separate modeling and

inference, whereby most of the machine

309

00:19:53,444 --> 00:19:56,684

learning techniques that I was initially

reading or learning about seem to be only

310

00:19:56,684 --> 00:20:00,844

focused on how do I infer something

without crisply formalizing the problem

311

00:20:00,844 --> 00:20:02,764

that I'm trying to solve.

312

00:20:04,076 --> 00:20:05,526

And then, you know, just, yeah.

313

00:20:05,526 --> 00:20:06,876

And then, yeah.

314

00:20:06,876 --> 00:20:11,356

So once we have this Bayesian posterior

that we're trying to infer, then maybe

315

00:20:11,356 --> 00:20:13,976

we'll do fully Bayesian inference, or

maybe we'll do approximate Bayesian

316

00:20:13,976 --> 00:20:15,876

inference, or maybe we'll just do maximum

likelihood.

317

00:20:15,876 --> 00:20:17,456

That's maybe less of a detail.

318

00:20:17,456 --> 00:20:21,636

The more important detail is we have a

very clean specification for our problem

319

00:20:21,636 --> 00:20:24,036

and we can, you know, build in our

assumptions.

320

00:20:24,036 --> 00:20:26,976

And as we change our assumptions, we

change the specification.

321

00:20:26,976 --> 00:20:31,216

So it seemed like a very systematic way,

very systematic way to build machine

322

00:20:31,216 --> 00:20:33,696

learning and artificial intelligence

pipelines.

323

00:20:34,030 --> 00:20:38,060

using a principled process that I found

easy to reason about.

324

00:20:38,060 --> 00:20:41,530

And I didn't really find that in the other

types of machine learning approaches that

325

00:20:41,530 --> 00:20:43,290

we learned in the class.

326

00:20:44,310 --> 00:20:49,750

So yeah, so I joined the probabilistic

computing project at MIT, which is run by

327

00:20:49,750 --> 00:20:50,960

my PhD advisor, Dr.

328

00:20:50,960 --> 00:20:52,390

Vikash Mansinga.

329

00:20:52,396 --> 00:20:57,996

And, um, you really got the opportunity to

explore these interests at the research

330

00:20:57,996 --> 00:21:00,096

level, not only in classes.

331

00:21:00,096 --> 00:21:02,736

And that's, I think where everything took

off afterwards.

332

00:21:02,736 --> 00:21:06,156

Those are the synthesis of various things,

I think that got me interested in the

333

00:21:06,156 --> 00:21:07,016

field.

334

00:21:07,016 --> 00:21:07,756

Yeah.

335

00:21:07,756 --> 00:21:11,216

Thanks a lot for that, for that, that

that's super interesting to see.

336

00:21:11,216 --> 00:21:18,916

And, uh, I definitely relate to the idea

of these, um, like the patient framework

337

00:21:18,916 --> 00:21:21,420

being, uh, attractive.

338

00:21:21,420 --> 00:21:28,080

not because it's a toolbox, but because

it's more of a principle based framework,

339

00:21:28,080 --> 00:21:33,120

basically, where instead of thinking, oh

yeah, what tool do I need for that stuff,

340

00:21:33,120 --> 00:21:35,390

it's just always the same in a way.

341

00:21:35,390 --> 00:21:42,280

To me, it's cool because you don't have to

be smart all the time in a way, right?

342

00:21:42,280 --> 00:21:46,240

You're just like, it's the problem takes

the same workflow.

343

00:21:46,240 --> 00:21:48,460

It's not going to be the same solution.

344

00:21:48,460 --> 00:21:49,980

But it's always the same workflow.

345

00:21:49,980 --> 00:21:50,180

Okay.

346

00:21:50,180 --> 00:21:52,160

What does the data look like?

347

00:21:52,360 --> 00:21:53,830

How can we model that?

348

00:21:53,830 --> 00:21:56,000

Where is the data generative story?

349

00:21:56,000 --> 00:22:00,520

And then you have very different

challenges all the time and different

350

00:22:00,520 --> 00:22:06,720

kinds of models, but you're not thinking

about, okay, what is the ready made model

351

00:22:06,720 --> 00:22:08,380

that they can apply to these data?

352

00:22:08,380 --> 00:22:15,380

It's more like how can I create a custom

model to these data knowing the

353

00:22:15,380 --> 00:22:17,900

constraints I have about my problem?

354

00:22:17,900 --> 00:22:18,316

And.

355

00:22:18,316 --> 00:22:22,656

thinking in a principled way instead of

thinking in a toolkit way.

356

00:22:22,656 --> 00:22:24,046

I definitely relate to that.

357

00:22:24,046 --> 00:22:24,866

I find that amazing.

358

00:22:24,866 --> 00:22:29,316

I'll just add to that, which is this is

not only some type of aesthetic or

359

00:22:29,316 --> 00:22:30,336

theoretical idea.

360

00:22:30,336 --> 00:22:34,116

I think it's actually strongly tied into

good practice that makes it easier to

361

00:22:34,116 --> 00:22:35,056

solve problems.

362

00:22:35,056 --> 00:22:36,516

And by that, what do I mean?

363

00:22:36,516 --> 00:22:43,056

Well, so I did a very brief undergraduate

research project in a biology lab,

364

00:22:43,056 --> 00:22:44,836

computational biology lab.

365

00:22:44,956 --> 00:22:48,428

And just looking at the empirical workflow

that was done,

366

00:22:48,428 --> 00:22:52,728

made me very suspicious about the process,

which is, you know, you might have some

367

00:22:52,728 --> 00:22:57,948

data and then you'll hit it with PCA and

you'll get some projection of the data and

368

00:22:57,948 --> 00:23:00,928

then you'll use a random forest classifier

and you're going to classify it in

369

00:23:00,928 --> 00:23:01,408

different ways.

370

00:23:01,408 --> 00:23:04,487

And then you're going to use the

classification and some type of logistic

371

00:23:04,487 --> 00:23:04,868

regression.

372

00:23:04,868 --> 00:23:08,808

So you're just chaining these ad hoc

different data analyses to come up with

373

00:23:08,808 --> 00:23:10,108

some final story.

374

00:23:10,108 --> 00:23:14,008

And while that might be okay to get you

some specific result, it doesn't really

375

00:23:14,008 --> 00:23:18,382

tell you anything about how changing one

modeling choice in this pipeline.

376

00:23:18,382 --> 00:23:23,102

is going to impact your final inference

because this sort of mix and match

377

00:23:23,102 --> 00:23:29,182

approach of applying different ad hoc

estimators to solve different subtasks

378

00:23:29,182 --> 00:23:34,222

doesn't really give us a way to iterate on

our models, understand their limitations

379

00:23:34,222 --> 00:23:38,702

very well, knowing their sensitivity to

different choices, or even building

380

00:23:38,702 --> 00:23:42,222

computational systems that automate a lot

of these things, right?

381

00:23:42,222 --> 00:23:43,756

Like probabilistic programs.

382

00:23:43,756 --> 00:23:48,666

Like you're saying, we can write our data

generating process as the workflow itself,

383

00:23:48,666 --> 00:23:49,416

right?

384

00:23:49,416 --> 00:23:53,596

Rather than, you know, maybe in Matlab

I'll run PCA and then, you know, I'll use

385

00:23:53,596 --> 00:23:55,116

scikit -learn and Python.

386

00:23:55,116 --> 00:24:00,596

Without, I think, this type of prior

distribution over our data, it becomes

387

00:24:00,596 --> 00:24:07,308

very hard to reason formally about our

entire inference workflow, which would...

388

00:24:07,308 --> 00:24:11,848

know, which probabilistic programming

languages are trying to make easier and

389

00:24:11,848 --> 00:24:15,268

give a more principled approach that's

more amenable to engineering, to

390

00:24:15,268 --> 00:24:18,308

optimization, to things of that sort.

391

00:24:18,388 --> 00:24:18,528

Yeah.

392

00:24:18,528 --> 00:24:19,728

Yeah, yeah.

393

00:24:19,728 --> 00:24:21,148

Fantastic point.

394

00:24:21,188 --> 00:24:22,108

Definitely.

395

00:24:22,108 --> 00:24:27,548

And that's also the way I personally tend

to teach patient stats.

396

00:24:27,548 --> 00:24:35,048

Now it's much more on a, let's say,

principle -based way instead of, and

397

00:24:35,048 --> 00:24:36,972

workflow -based instead of just...

398

00:24:36,972 --> 00:24:43,552

Okay, Poisson regression is this

multinomial regression is that I find that

399

00:24:43,552 --> 00:24:49,212

much more powerful because then when

students get out in the wild, they are

400

00:24:49,212 --> 00:24:56,572

used to first think about the problem and

then try to see how they could solve it

401

00:24:56,572 --> 00:25:03,686

instead of just trying to find, okay,

which model is going to be the most.

402

00:25:03,788 --> 00:25:09,048

useful here in the models that I already

know, because then if the data are

403

00:25:09,048 --> 00:25:11,908

different, you're going to have a lot of

problems.

404

00:25:12,128 --> 00:25:12,828

Yeah.

405

00:25:13,828 --> 00:25:22,928

And so you actually talked about the

different topics that you work on.

406

00:25:22,928 --> 00:25:24,798

There are a lot I want to ask you about.

407

00:25:24,798 --> 00:25:32,468

One of my favorites, and actually I think

Colin also has been working a bit on that

408

00:25:32,468 --> 00:25:33,324

lately.

409

00:25:33,324 --> 00:25:39,384

is the development of AutoGP .jl.

410

00:25:39,704 --> 00:25:44,584

So I think that'd be cool to talk about

that.

411

00:25:45,504 --> 00:25:51,264

What inspired you to develop that package,

which is in Julia?

412

00:25:51,404 --> 00:25:56,644

Maybe you can also talk about that if you

mainly develop in Julia most of the time,

413

00:25:56,644 --> 00:25:59,584

or if that was mostly useful for that

project.

414

00:25:59,584 --> 00:26:02,498

And how does this package...

415

00:26:02,604 --> 00:26:09,784

advance, like help the learning structure

of Gaussian Processes kernels because if I

416

00:26:09,784 --> 00:26:13,184

understand correctly, that's what the

package is mostly about.

417

00:26:13,184 --> 00:26:18,024

So yeah, if you can give a primer to

listeners about that.

418

00:26:18,024 --> 00:26:18,604

Definitely.

419

00:26:18,604 --> 00:26:19,004

Yes.

420

00:26:19,004 --> 00:26:26,644

So Gaussian Processes are a pretty

standard model that's used in many

421

00:26:26,644 --> 00:26:28,108

different application areas.

422

00:26:28,108 --> 00:26:31,728

spatial temporal statistics and many

engineering applications based on

423

00:26:31,728 --> 00:26:32,928

optimization.

424

00:26:33,268 --> 00:26:39,208

So these Gaussian process models are

parameterized by covariance functions,

425

00:26:39,208 --> 00:26:44,088

which specify how the data produced by

this Gaussian process co -varies across

426

00:26:44,088 --> 00:26:49,768

time, across space, across any domain

which you're able to define some type of

427

00:26:49,768 --> 00:26:51,208

covariance function.

428

00:26:51,208 --> 00:26:55,372

But one of the main challenges in using a

Gaussian process for modeling your data,

429

00:26:55,372 --> 00:27:00,252

is making the structural choice about what

should the covariance structure be.

430

00:27:01,552 --> 00:27:05,792

So, you know, the one of the universal

choices or the most common choices is to

431

00:27:05,792 --> 00:27:10,972

say, you know, some type of a radial basis

function for my data, the RBF kernel, or,

432

00:27:10,972 --> 00:27:15,172

you know, maybe a linear kernel or a

polynomial kernel, somehow hoping that

433

00:27:15,172 --> 00:27:18,232

you'll make the right choice to model your

data accurately.

434

00:27:18,232 --> 00:27:24,364

So the inspiration for auto GP or

automatic Gaussian process is to try and

435

00:27:24,364 --> 00:27:28,704

use the data not only to infer the numeric

parameters of the Gaussian process, but

436

00:27:28,704 --> 00:27:32,964

also the structural parameters or the

actual symbolic structure of this

437

00:27:32,964 --> 00:27:34,124

covariance function.

438

00:27:34,124 --> 00:27:37,884

And here we are drawing our inspiration

from work which is maybe almost 10 years

439

00:27:37,884 --> 00:27:43,644

now from Dave Duvenoe and colleagues

called the Automated Statistician Project,

440

00:27:44,424 --> 00:27:50,684

or ABCD, Automatic Bayesian Covariance

Discovery, which introduced this idea of

441

00:27:50,684 --> 00:27:52,492

defining a symbolic language.

442

00:27:52,492 --> 00:27:57,882

over Gaussian process covariance functions

or covariance kernels and using a grammar,

443

00:27:57,882 --> 00:28:03,612

using a recursive grammar and trying to

infer an expression in that grammar given

444

00:28:03,612 --> 00:28:04,892

the observed data.

445

00:28:04,892 --> 00:28:10,532

So, you know, in a time series setting,

for example, you might have time on the

446

00:28:10,532 --> 00:28:13,452

horizontal axis and the variable on the y

-axis and you just have some variable

447

00:28:13,452 --> 00:28:14,732

that's evolving.

448

00:28:14,892 --> 00:28:17,632

You don't know necessarily the dynamics of

that, right?

449

00:28:17,632 --> 00:28:20,852

There might be some periodic structure in

the data or there might be multiple

450

00:28:20,852 --> 00:28:22,284

periodic effects.

451

00:28:22,284 --> 00:28:25,464

Or there might be a linear trend that's

overlaying the data.

452

00:28:25,464 --> 00:28:30,944

Or there might be a point in time in which

the data is switching between some process

453

00:28:30,944 --> 00:28:34,724

before the change point and some process

after the change point.

454

00:28:34,724 --> 00:28:38,864

Obviously, for example, in the COVID era,

almost all macroeconomic data sets had

455

00:28:38,864 --> 00:28:42,144

some type of change point around April

2020.

456

00:28:42,144 --> 00:28:45,384

And we see that in the empirical data that

we're analyzing today.

457

00:28:45,384 --> 00:28:50,424

So the question is, how can we

automatically surface these structural

458

00:28:50,424 --> 00:28:51,468

choices?

459

00:28:51,468 --> 00:28:53,068

using Bayesian inference.

460

00:28:53,088 --> 00:28:57,888

So the original approach that was in the

automated statistician was based on a type

461

00:28:57,888 --> 00:28:58,908

of greedy search.

462

00:28:58,908 --> 00:29:03,528

So they were trying to say, let's find the

single kernel that maximizes the

463

00:29:03,528 --> 00:29:05,008

probability of the data.

464

00:29:05,008 --> 00:29:05,288

Okay.

465

00:29:05,288 --> 00:29:09,368

So they're trying to do a greedy search

over these kernel structures for Gaussian

466

00:29:09,368 --> 00:29:13,828

processes using these different search

operators.

467

00:29:13,828 --> 00:29:18,168

And for each different kernel, you might

find the maximum likelihood parameter, et

468

00:29:18,168 --> 00:29:18,608

cetera.

469

00:29:18,608 --> 00:29:20,460

And I think that's a fine approach.

470

00:29:20,460 --> 00:29:23,760

But it does run into some serious

limitations, and I'll mention a few of

471

00:29:23,760 --> 00:29:24,460

them.

472

00:29:24,460 --> 00:29:29,560

One limitation is that greedy search is in

a sense not representing any uncertainty

473

00:29:29,560 --> 00:29:31,520

about what's the right structure.

474

00:29:31,520 --> 00:29:36,100

It's just finding a single best structure

to maximize some probability or maybe

475

00:29:36,100 --> 00:29:37,500

likelihood of the data.

476

00:29:37,500 --> 00:29:41,560

But we know just like parameters are

uncertain, structure can also be quite

477

00:29:41,560 --> 00:29:43,700

uncertain because the data is very noisy.

478

00:29:43,700 --> 00:29:45,700

We may have sparse data.

479

00:29:45,700 --> 00:29:49,420

And so, you know, we'd want type of

inference systems that are more robust.

480

00:29:49,420 --> 00:29:56,100

when discovering the temporal structure in

the data and that greedy search doesn't

481

00:29:56,100 --> 00:30:00,840

really give us that level of robustness

through expressing posterior uncertainty.

482

00:30:01,060 --> 00:30:06,099

I think another challenge with greedy

search is its scalability.

483

00:30:06,100 --> 00:30:11,740

And by that, if you have a very large data

set in a greedy search algorithm, we're

484

00:30:11,740 --> 00:30:15,740

typically at each stage of the search,

we're looking at the entire data set to

485

00:30:15,740 --> 00:30:16,556

score our model.

486

00:30:16,556 --> 00:30:20,166

And this is also a traditional Markov

chain Monte Carlo algorithms.

487

00:30:20,166 --> 00:30:24,976

We often score our data set, but in the

Gaussian process setting, scoring the data

488

00:30:24,976 --> 00:30:25,916

set is very expensive.

489

00:30:25,916 --> 00:30:29,316

If you have N data points, it's going to

cost you N cubed.

490

00:30:29,396 --> 00:30:34,056

And so it becomes quite infeasible to run

greedy search or even pure Markov chain

491

00:30:34,056 --> 00:30:38,396

Monte Carlo, where at each step, each time

you change the parameters or you change

492

00:30:38,396 --> 00:30:40,976

the kernel, you need to now compute the

full likelihood.

493

00:30:40,996 --> 00:30:46,060

And so the second motivation in AutoGP is

to build an inference algorithm.

494

00:30:46,060 --> 00:30:52,640

that is not looking at the whole data set

at each point in time, but using subsets

495

00:30:52,640 --> 00:30:55,140

of the data set that are sequentially

growing.

496

00:30:55,140 --> 00:30:59,820

And that's where the sequential Monte

Carlo inference algorithm comes in.

497

00:30:59,980 --> 00:31:03,520

So AutoGP is implemented in Julia.

498

00:31:03,520 --> 00:31:07,940

And the API is that basically you give it

a one -dimensional time series.

499

00:31:07,940 --> 00:31:09,580

You hit infer.

500

00:31:09,580 --> 00:31:14,120

And then it's going to report an ensemble

of Gaussian processes or a sample from my

501

00:31:14,120 --> 00:31:17,860

posterior distribution, where each

Gaussian process has some particular

502

00:31:17,860 --> 00:31:19,510

structure and some numeric parameters.

503

00:31:19,510 --> 00:31:23,640

And you can show the user, hey, I've

inferred these hundred GPS from my

504

00:31:23,640 --> 00:31:24,340

posterior.

505

00:31:24,340 --> 00:31:27,820

And then they can start using them for

generating predictions.

506

00:31:27,820 --> 00:31:32,680

You can use them to find outliers because

these are probabilistic models.

507

00:31:32,680 --> 00:31:35,340

You can use them for a lot of interesting

tasks.

508

00:31:35,340 --> 00:31:37,644

Or you might say, you know,

509

00:31:37,644 --> 00:31:41,344

This particular model actually isn't

consistent with what I know about the

510

00:31:41,344 --> 00:31:41,534

data.

511

00:31:41,534 --> 00:31:45,324

So you might remove one of the posterior

samples from your ensemble.

512

00:31:46,744 --> 00:31:51,104

Yeah, so those are, you know, we used

AutoGP on the M3.

513

00:31:51,104 --> 00:31:53,884

We benchmarked it on the M3 competition

data.

514

00:31:53,884 --> 00:32:01,964

M3 is around, or the monthly data sets in

M3 are around 1 ,500 time series, you

515

00:32:01,964 --> 00:32:05,644

know, between 100 and 500 observations in

length.

516

00:32:05,644 --> 00:32:09,064

And we compared the performance against

different statistics baselines and machine

517

00:32:09,064 --> 00:32:10,204

learning baselines.

518

00:32:10,204 --> 00:32:14,744

And it's actually able to find pretty

common sense structures in these economic

519

00:32:14,744 --> 00:32:15,124

data.

520

00:32:15,124 --> 00:32:19,584

Some of them have seasonal features,

multiple seasonal effects as well.

521

00:32:19,584 --> 00:32:24,924

And what's interesting is we don't need to

customize the prior to analyze each data

522

00:32:24,924 --> 00:32:25,184

set.

523

00:32:25,184 --> 00:32:27,504

It's essentially able to discover.

524

00:32:27,664 --> 00:32:31,164

And what's also interesting is that

sometimes when the data set just looks

525

00:32:31,164 --> 00:32:34,644

like a random walk, it's going to learn a

covariance structure, which emulates a

526

00:32:34,644 --> 00:32:35,602

random walk.

527

00:32:35,628 --> 00:32:39,068

So by having a very broad prior

distribution on the types of covariance

528

00:32:39,068 --> 00:32:43,488

structures that you see, it's able to find

which of these are plausible explanation

529

00:32:43,488 --> 00:32:45,348

given the data.

530

00:32:46,128 --> 00:32:48,808

Yes, as you mentioned, we implemented this

in Julia.

531

00:32:48,808 --> 00:32:53,708

The reason is that AutoGP is built on the

Gen probabilistic programming language,

532

00:32:53,708 --> 00:32:56,648

which is embedded in the Julia language.

533

00:32:56,888 --> 00:33:03,532

And the reason that Gen, I think, is a

very useful system for this problem.

534

00:33:03,532 --> 00:33:09,752

So Gen was developed primarily by Marco

Cosumano Towner, who wrote a PhD thesis.

535

00:33:09,752 --> 00:33:13,272

He was a colleague of mine at the MIT

Policy Computing Project.

536

00:33:13,832 --> 00:33:18,532

And Gen really, it's a Turing complete

language and has programmable inference.

537

00:33:18,532 --> 00:33:22,812

So you're able to write a prior

distribution over these symbolic

538

00:33:22,812 --> 00:33:25,172

expressions in a very natural way.

539

00:33:25,172 --> 00:33:31,192

And you're able to customize an inference

algorithm that's able to solve this

540

00:33:31,192 --> 00:33:32,652

problem efficiently.

541

00:33:32,952 --> 00:33:33,132

And

542

00:33:33,132 --> 00:33:37,012

What really drew us to GEN for this

problem, I think, are twofold.

543

00:33:37,012 --> 00:33:39,952

The first is its support for sequential

Monte Carlo inference.

544

00:33:39,952 --> 00:33:43,872

So it has a pretty mature library for

doing sequential Monte Carlo.

545

00:33:43,892 --> 00:33:48,002

And sequential Monte Carlo construed more

generally than just particle filtering,

546

00:33:48,002 --> 00:33:51,712

but other types of inference over

sequences of probability distributions.

547

00:33:51,712 --> 00:33:54,932

So particle filters are one type of

sequential Monte Carlo algorithm you might

548

00:33:54,932 --> 00:33:55,372

write.

549

00:33:55,372 --> 00:33:59,392

But you might do some type of temperature

annealing or data annealing or other types

550

00:33:59,392 --> 00:34:01,676

of sequentialization strategies.

551

00:34:01,676 --> 00:34:05,236

And Jen provides a very nice toolbox and

abstraction for experimenting with

552

00:34:05,236 --> 00:34:08,136

different types of sequential Monte Carlo

approaches.

553

00:34:08,136 --> 00:34:11,036

And so we definitely made good use of that

library when developing our inference

554

00:34:11,036 --> 00:34:12,076

algorithm.

555

00:34:12,076 --> 00:34:18,276

The second reason I think that Jen was

very nice to use is its library for

556

00:34:18,276 --> 00:34:20,176

involutive MCMC.

557

00:34:20,276 --> 00:34:27,776

And involutive MCMC, it's a relatively new

framework.

558

00:34:27,776 --> 00:34:31,340

It was discovered, I think, concurrently.

559

00:34:31,340 --> 00:34:36,180

and independently both by Marco and other

folks.

560

00:34:37,200 --> 00:34:40,940

And this is kind of, you can think of it

as a generalization of reversible jump

561

00:34:40,940 --> 00:34:41,900

MCMC.

562

00:34:41,900 --> 00:34:46,420

And it's really a unifying framework to

understand many different MCMC algorithms

563

00:34:46,420 --> 00:34:48,660

using a common terminology.

564

00:34:48,660 --> 00:34:54,200

And so there's a wonderful ICML paper

which lists 30 or so different algorithms

565

00:34:54,200 --> 00:34:58,620

that people use all the time like

Hamiltonian Monte Carlo, reversible jump

566

00:34:58,620 --> 00:35:01,196

MCMC, Gibbs sampling, Metropolis Hastings.

567

00:35:01,196 --> 00:35:05,936

and expresses them using the language of

involutive MCMC.

568

00:35:05,936 --> 00:35:10,116

I believe the author is Nick Liudov,

although I might be mispronouncing that,

569

00:35:10,116 --> 00:35:11,996

sorry for that.

570

00:35:12,516 --> 00:35:18,556

So, Jen has a library for involutive MCMC,

which makes it quite easy to write

571

00:35:18,556 --> 00:35:24,516

different proposals for how you do this

inference over your symbolic expressions.

572

00:35:24,516 --> 00:35:29,096

Because when you're doing MCMC within the

inner loop of a sequential Monte Carlo

573

00:35:29,096 --> 00:35:29,964

algorithm,

574

00:35:29,964 --> 00:35:34,224

You need to somehow be able to improve

your current symbolic expressions for the

575

00:35:34,224 --> 00:35:36,564

covariance kernel, given the observed

data.

576

00:35:36,864 --> 00:35:41,784

And, uh, doing that is, is hard because

this is kind of a reversible jump

577

00:35:41,784 --> 00:35:44,064

algorithm where you make a structural

change.

578

00:35:44,064 --> 00:35:46,704

Then you need to maybe generate some new

parameters.

579

00:35:46,704 --> 00:35:49,124

You need the reverse probability of going

back.

580

00:35:49,124 --> 00:35:53,424

And so Jen has a high level, has a lot of

automation and a library for implementing

581

00:35:53,424 --> 00:35:56,254

these types of structure moves in a very

high level way.

582

00:35:56,254 --> 00:35:59,500

And it automates the low level math for.

583

00:35:59,500 --> 00:36:03,600

computing the acceptance probability and

embedding all of that within an outer

584

00:36:03,600 --> 00:36:04,940

level SMC loop.

585

00:36:04,940 --> 00:36:08,740

And so this is, I think, one of my

favorite examples for what probabilistic

586

00:36:08,740 --> 00:36:14,000

programming can give us, which is very

expressive priors over these, you know,

587

00:36:14,000 --> 00:36:17,960

symbolic expressions generated by symbolic

grammars, powerful inference algorithms

588

00:36:17,960 --> 00:36:21,740

using combinations of sequential Monte

Carlo and involutive MCMC and reversible

589

00:36:21,740 --> 00:36:24,640

jump moves and gradient based inference

over the parameters.

590

00:36:24,640 --> 00:36:27,148

It really brings together a lot of the

591

00:36:27,148 --> 00:36:29,908

a lot of the strengths of probabilistic

programming languages.

592

00:36:29,908 --> 00:36:34,548

And we showed at least on these M3

datasets that they can actually be quite

593

00:36:34,548 --> 00:36:38,028

competitive with state -of -the -art

solutions, both in statistics and in

594

00:36:38,028 --> 00:36:39,348

machine learning.

595

00:36:40,428 --> 00:36:45,948

I will say, though, that as with

traditional GPs, the scalability is really

596

00:36:45,948 --> 00:36:47,328

in the likelihood.

597

00:36:47,988 --> 00:36:52,928

So whether AutoGP can handle datasets with

10 ,000 data points, it's actually too

598

00:36:52,928 --> 00:36:55,084

hard because ultimately,

599

00:36:55,084 --> 00:36:59,304

Once you've seen all the data in your

sequential Monte Carlo, you will be forced

600

00:36:59,304 --> 00:37:02,804

to do this sort of N cubed scaling, which

then, you know, you need some type of

601

00:37:02,804 --> 00:37:06,404

improvements or some type of approximation

for handling larger data.

602

00:37:06,404 --> 00:37:11,084

But I think what's more interesting in

AutoGP is not necessarily that it's

603

00:37:11,084 --> 00:37:14,404

applied to inferring structures of

Gaussian processes, but that it's sort of

604

00:37:14,404 --> 00:37:18,444

a library for inferring probabilistic

structure and showing how to do that by

605

00:37:18,444 --> 00:37:21,404

integrating these different inference

methodologies.

606

00:37:21,784 --> 00:37:22,264

Hmm.

607

00:37:22,264 --> 00:37:23,276

Okay.

608

00:37:23,276 --> 00:37:25,696

Yeah, so many things here.

609

00:37:26,396 --> 00:37:33,936

So first, I put all the links to autogp

.jl in the show notes.

610

00:37:33,936 --> 00:37:41,496

I also put a link to the underlying paper

that you've written with some co -authors

611

00:37:41,496 --> 00:37:46,736

about, well, the sequential Monte Carlo

learning that you're doing to discover

612

00:37:46,736 --> 00:37:51,242

these time -series structure for people

who want to dig deeper.

613

00:37:51,340 --> 00:37:57,940

And I put also a link to all, well, most

of the LBS episodes where we talk about

614

00:37:57,940 --> 00:38:02,020

Gaussian processes for people who need a

bit more background information because

615

00:38:02,020 --> 00:38:06,800

here we're mainly going to talk about how

you do that and so on and how useful is

616

00:38:06,800 --> 00:38:07,780

it.

617

00:38:07,780 --> 00:38:11,480

And we're not going to give a primer on

what Gaussian processes are.

618

00:38:11,480 --> 00:38:17,160

So if you want that, folks, there are a

bunch of episodes in the show notes for

619

00:38:17,160 --> 00:38:18,120

that.

620

00:38:18,540 --> 00:38:19,908

So...

621

00:38:20,492 --> 00:38:28,112

on that basically practical utility of

that time -series discovery.

622

00:38:28,172 --> 00:38:36,832

So if understood correctly, for now, you

can do that only on one -dimensional input

623

00:38:36,832 --> 00:38:37,472

data.

624

00:38:37,472 --> 00:38:42,292

So that would be basically on a time

series.

625

00:38:42,292 --> 00:38:47,512

You cannot input, let's say, that you have

categories.

626

00:38:47,512 --> 00:38:49,472

These could be age groups.

627

00:38:49,472 --> 00:38:50,156

So.

628

00:38:50,156 --> 00:38:55,736

you could one -hot, usually I think that's

the way it's done, how to give that to a

629

00:38:55,736 --> 00:39:00,176

GP would be to one -hot encode each of

these edge groups.

630

00:39:00,176 --> 00:39:03,256

And then that means, let's say you have

four edge group.

631

00:39:03,256 --> 00:39:08,456

Now the input dimension of your GP is not

one, which is time, but it's five.

632

00:39:08,456 --> 00:39:12,396

So one for time and four for the edge

groups.

633

00:39:12,596 --> 00:39:14,660

This would not work here, right?

634

00:39:15,116 --> 00:39:15,656

Right, yes.

635

00:39:15,656 --> 00:39:18,516

So at the moment, we're focused on, and

these are called, I guess, in

636

00:39:18,516 --> 00:39:22,396

econometrics, pure time series models,

where you're only trying to do inference

637

00:39:22,396 --> 00:39:24,996

on the time series based on its own

history.

638

00:39:24,996 --> 00:39:29,636

I think the extensions that you're

proposing are very natural to consider.

639

00:39:29,636 --> 00:39:34,136

You might have a multi -input Gaussian

process where you're not only looking at

640

00:39:34,136 --> 00:39:39,296

your own history, but you're also

considering some type of categorical

641

00:39:39,296 --> 00:39:40,316

variable.

642

00:39:40,316 --> 00:39:44,780

Or you might have exogenous covariates

evolving along with the time series.

643

00:39:44,780 --> 00:39:48,640

If you want to predict temperature, for

example, you might have the wind speed and

644

00:39:48,640 --> 00:39:51,420

you might want to use that as a feature

for your Gaussian process.

645

00:39:51,420 --> 00:39:54,230

Or you might have an output, a multiple

output Gaussian process.

646

00:39:54,230 --> 00:39:58,380

You want a Gaussian process over multiple

different time series generally.

647

00:39:58,380 --> 00:40:02,580

And I think all of these variants are, you

know, they're possible to develop.

648

00:40:02,580 --> 00:40:06,680

There's no fundamental difficulty, but the

main, I think the main challenge is how

649

00:40:06,680 --> 00:40:11,040

can you define a domain specific language

over these covariance structures for

650

00:40:11,040 --> 00:40:13,324

multi, for multivariate input data?

651

00:40:13,324 --> 00:40:14,844

becomes a little bit more challenging.

652

00:40:14,844 --> 00:40:19,864

So in the time series setting, what's nice

is we can interpret how any type of

653

00:40:19,864 --> 00:40:24,584

covariance kernel is going to impact the

actual prior over time series.

654

00:40:24,584 --> 00:40:27,484

Once we're in the multi -dimensional

setting, we need to think about how to

655

00:40:27,484 --> 00:40:31,004

combine the kernels for different

dimensions in a way that's actually

656

00:40:31,004 --> 00:40:34,884

meaningful for modeling to ensure that

it's more tractable.

657

00:40:34,884 --> 00:40:40,204

But I think extensions of the DSL to

handle multiple inputs, exogenous

658

00:40:40,204 --> 00:40:41,932

covariates, multiple outputs,

659

00:40:41,932 --> 00:40:43,512

These are all great directions.

660

00:40:43,512 --> 00:40:47,712

And I'll just add on top of that, I think

another important direction is using some

661

00:40:47,712 --> 00:40:52,632

of the more recent approximations for

Gaussian processes.

662

00:40:52,632 --> 00:40:55,612

So we're not bottlenecked by the n cubed

scaling.

663

00:40:55,612 --> 00:40:59,502

So there are, I think, a few different

approaches that have been developed.

664

00:40:59,502 --> 00:41:05,272

There are approaches which are based on

stochastic PDEs or state space

665

00:41:05,272 --> 00:41:08,780

approximations of Gaussian processes,

which are quite promising.

666

00:41:08,780 --> 00:41:12,040

There's some other things like nearest

neighbor Gaussian processes, but I'm a

667

00:41:12,040 --> 00:41:16,520

little less confident about those because

we lose a lot of the nice affordances of

668

00:41:16,520 --> 00:41:19,720

GPs once we start doing nearest neighbor

approximations.

669

00:41:20,020 --> 00:41:26,910

But I think there's a lot of new methods

for approximate GPs.

670

00:41:26,910 --> 00:41:33,180

So we might do a stochastic variational

inference, for example, an SVGP.

671

00:41:33,180 --> 00:41:37,540

So I think as we think about handling more

672

00:41:38,636 --> 00:41:42,216

more richer types of data, then we should

also think about how to start introducing

673

00:41:42,216 --> 00:41:45,616

some of these more scalable approximations

to make sure we can still efficiently do

674

00:41:45,616 --> 00:41:48,456

the structure learning in that setting.

675

00:41:49,756 --> 00:41:53,556

Yeah, that would be awesome for sure.

676

00:41:53,556 --> 00:41:59,876

As a more, much more on the practitioner

side than on the math side.

677

00:41:59,876 --> 00:42:02,396

Of course, that's where my head goes

first.

678

00:42:02,396 --> 00:42:06,556

You know, I'm like, oh, that'd be awesome,

but I would need to have that to have it

679

00:42:06,556 --> 00:42:07,898

really practical.

680

00:42:07,980 --> 00:42:14,150

Um, and so if I use auto GP dot channel,

so I give it a time series data.

681

00:42:14,150 --> 00:42:17,570

Um, then what do I get back?

682

00:42:17,570 --> 00:42:28,040

Do I get back, um, the busier samples of

the, the implied model, or do I get back

683

00:42:28,040 --> 00:42:31,520

the covariance structure?

684

00:42:32,000 --> 00:42:36,900

So that could be, I don't know what, what

form that could be, but I'm thinking, you

685

00:42:36,900 --> 00:42:37,164

know,

686

00:42:37,164 --> 00:42:42,804

Uh, often when I use GPS, I use them

inside other models with other, like I

687

00:42:42,804 --> 00:42:45,364

could use a GP in a linear regression, for

instance.

688

00:42:45,364 --> 00:42:51,364

And so I'm thinking that'd be cool if I'm

not sure about the covariance structure,

689

00:42:51,364 --> 00:42:56,344

especially if it can do the discovery of

the seasonality and things like that

690

00:42:56,344 --> 00:42:59,824

automatically, because it's always

seasonality is a bit weird and you have to

691

00:42:59,824 --> 00:43:03,484

add another GP that can handle

periodicity.

692

00:43:03,484 --> 00:43:06,860

Um, and then you have basically a sum of

GP.

693

00:43:06,860 --> 00:43:10,500

And then you can take that sum of GP and

put that in the linear predictor of the

694

00:43:10,500 --> 00:43:11,340

linear regression.

695

00:43:11,340 --> 00:43:13,060

That's usually how I use that.

696

00:43:13,060 --> 00:43:17,520

And very often, I'm using categorical

predictors almost always.

697

00:43:18,420 --> 00:43:23,420

And I'm thinking what would be super cool

is that I can outsource that discovery

698

00:43:23,420 --> 00:43:30,940

part of the GP to the computer like you're

doing with this algorithm.

699

00:43:30,940 --> 00:43:34,360

And then I get back under what form?

700

00:43:34,360 --> 00:43:34,920

I don't know yet.

701

00:43:34,920 --> 00:43:36,612

I'm just thinking about that.

702

00:43:36,620 --> 00:43:41,420

this covariance structure that I can just,

which would be an MV normal, like a

703

00:43:41,420 --> 00:43:45,120

multivit normal in a way, that I just use

in my linear predictor.

704

00:43:45,120 --> 00:43:48,940

And then I can use that, for instance, in

a PMC model or something like that,

705

00:43:48,940 --> 00:43:51,800

without to specify the GP myself.

706

00:43:51,980 --> 00:43:54,160

Is it something that's doable?

707

00:43:54,160 --> 00:43:56,460

Yeah, yeah, I think that's absolutely

right.

708

00:43:56,460 --> 00:44:01,080

So you can, because Gaussian processes are

compositional, just, you know, you

709

00:44:01,080 --> 00:44:05,120

mentioned the sum of two Gaussian

processes, which corresponds to the sum of

710

00:44:05,120 --> 00:44:06,360

two kernel.

711

00:44:06,412 --> 00:44:11,392

So if I have Gaussian process one plus

Gaussian process two, that's the same as

712

00:44:11,392 --> 00:44:15,192

the Gaussian process whose covariance is

k1 plus k2.

713

00:44:15,192 --> 00:44:21,092

And so what that means is we can take our

synthesized kernel, which is comprised of

714

00:44:21,092 --> 00:44:24,992

some base kernels and then maybe sums and

products and change points, and we can

715

00:44:24,992 --> 00:44:33,972

wrap all of these in just one mega GP,

basically, which would encode the entire

716

00:44:33,972 --> 00:44:35,852

posterior disk or, you know,

717

00:44:35,916 --> 00:44:39,216

a summary of all of the samples in one GP.

718

00:44:39,676 --> 00:44:43,096

Another, and I think you also mentioned an

important point, which is multivariate

719

00:44:43,096 --> 00:44:43,826

normals.

720

00:44:43,826 --> 00:44:47,976

You can also think of the posterior as

just a mixture of these multivariate

721

00:44:47,976 --> 00:44:48,616

normals.

722

00:44:48,616 --> 00:44:54,436

So let's say I'm not going to sort of

compress them into a single GP, but I'm

723

00:44:54,436 --> 00:44:59,236

actually going to represent the output of

auto GP as a mixture of multivariate

724

00:44:59,236 --> 00:45:00,026

normals.

725

00:45:00,026 --> 00:45:02,636

And that would be another type of API.

726

00:45:02,636 --> 00:45:05,540

So depending on exactly what type of

727

00:45:05,580 --> 00:45:10,340

how you're planning to use the GP, I think

you can use the output of auto GP in the

728

00:45:10,340 --> 00:45:14,040

right way, because ultimately, it's

producing some covariance kernels, you

729

00:45:14,040 --> 00:45:19,580

might aggregate them all into a GP, or you

might compose them together to make a

730

00:45:19,580 --> 00:45:21,020

mixture of GPs.

731

00:45:21,020 --> 00:45:27,560

And you can export this to PyTorch, or

most of the current libraries for GPs

732

00:45:27,560 --> 00:45:32,240

support composing the GPs with one

another, et cetera.

733

00:45:33,100 --> 00:45:36,500

So I think depending on the use case, it

should be quite straightforward to figure

734

00:45:36,500 --> 00:45:40,960

out how to leverage the output of AutoGP

to use within the inner loop of some bra

735

00:45:40,960 --> 00:45:46,020

or within the internals of some larger

linear regression model or other type of

736

00:45:46,020 --> 00:45:47,120

model.

737

00:45:48,380 --> 00:45:55,420

Yeah, that's definitely super cool because

then you can, well, yeah, use that,

738

00:45:55,780 --> 00:46:01,516

outsource that part of the model where I

think the algorithm probably...

739

00:46:01,516 --> 00:46:08,536

If not now, in just a few years, it's

going to make do a better job than most

740

00:46:08,536 --> 00:46:12,756

modelers, at least to have a rough first

draft.

741

00:46:12,836 --> 00:46:13,636

That's right.

742

00:46:13,636 --> 00:46:14,976

The first draft.

743

00:46:14,976 --> 00:46:20,416

A data scientist who's determined enough

to beat AutoGP, probably they can do it if

744

00:46:20,416 --> 00:46:23,056

they put in enough effort just to study

the data.

745

00:46:23,216 --> 00:46:27,596

But it's getting a first pass model that's

actually quite good as compared to other

746

00:46:27,596 --> 00:46:29,196

types of automated techs.

747

00:46:29,196 --> 00:46:29,876

Yeah, exactly.

748

00:46:29,876 --> 00:46:31,006

I mean, that's recall.

749

00:46:31,006 --> 00:46:37,296

It's like asking for a first draft of, I

don't know, blog post to ChatGPT and then

750

00:46:37,296 --> 00:46:41,896

going yourself in there and improving it

instead of starting everything from

751

00:46:41,896 --> 00:46:42,956

scratch.

752

00:46:42,956 --> 00:46:49,816

Yeah, for sure you could do it, but that's

not where your value added really lies.

753

00:46:50,496 --> 00:46:51,256

So yeah.

754

00:46:51,256 --> 00:46:56,086

So what you get is these kind of samples.

755

00:46:56,086 --> 00:46:58,252

In a way, do you get back samples?

756

00:46:58,252 --> 00:47:01,752

or do you get symbolic variables back?

757

00:47:01,752 --> 00:47:06,212

You get symbolic expressions for the

covariance kernels as well as the

758

00:47:06,212 --> 00:47:08,112

parameters embedded within them.

759

00:47:08,112 --> 00:47:12,032

So you might get, let's say you asked for

five posterior samples, you're going to

760

00:47:12,032 --> 00:47:14,862

have maybe one posterior sample, which is

a linear kernel.

761

00:47:14,862 --> 00:47:18,392

And then another posterior sample, which

is a linear times linear, so a quadratic

762

00:47:18,392 --> 00:47:18,912

kernel.

763

00:47:18,912 --> 00:47:22,812

And then maybe a third posterior sample,

which is again, a linear, and each of them

764

00:47:22,812 --> 00:47:24,652

will have their different parameters.

765

00:47:24,652 --> 00:47:26,892

And because we're using sequential Monte

Carlo,

766

00:47:26,892 --> 00:47:30,712

all of the posterior samples are

associated with weights.

767

00:47:31,232 --> 00:47:35,432

The sequential Monte Carlo returns a

weighted particle collection, which is

768

00:47:35,432 --> 00:47:37,312

approximating the posterior.

769

00:47:37,332 --> 00:47:41,372

So you get back these weighted particles,

which are symbolic expressions.

770

00:47:41,372 --> 00:47:44,882

And we have, in AutoGP, we have a minimal

prediction GP library.

771

00:47:44,882 --> 00:47:48,192

So you can actually put these symbolic

expressions into a GP to get a functional

772

00:47:48,192 --> 00:47:52,932

GP, but you can export them to a text file

and then use your favorite GP library and

773

00:47:52,932 --> 00:47:55,692

embed them within that as well.

774

00:47:55,692 --> 00:47:58,352

And we also get noise parameters.

775

00:47:58,352 --> 00:48:02,212

So each kernel is going to be associated

with the output noise.

776

00:48:02,212 --> 00:48:05,892

Because obviously depending on what kernel

you use, you're going to infer a different

777

00:48:05,892 --> 00:48:07,212

noise level.

778

00:48:07,672 --> 00:48:12,752

So you get a kernel structure, parameters,

and noise for each individual particle in

779

00:48:12,752 --> 00:48:15,072

your SMC ensemble.

780

00:48:15,992 --> 00:48:17,302

OK, I see.

781

00:48:17,302 --> 00:48:18,932

Yeah, super cool.

782

00:48:19,372 --> 00:48:23,468

And so yeah, if you can get back that as a

text file.

783

00:48:23,468 --> 00:48:29,148

Like either you use it in a full Julia

program, or if you prefer R or Python, you

784

00:48:29,148 --> 00:48:32,368

could use auto -gp .jl just for that.

785

00:48:32,368 --> 00:48:39,388

Get back a text file and then use that in

R or in Python in another model, for

786

00:48:39,388 --> 00:48:40,328

instance.

787

00:48:41,288 --> 00:48:42,048

Okay.

788

00:48:42,048 --> 00:48:42,988

That's super cool.

789

00:48:42,988 --> 00:48:44,668

Do you have examples of that?

790

00:48:44,668 --> 00:48:45,828

Yeah.

791

00:48:45,828 --> 00:48:50,168

Do you have examples of that we can link

to for listeners in the show notes?

792

00:48:50,368 --> 00:48:52,148

We have tutorial.

793

00:48:52,148 --> 00:48:53,196

And so...

794

00:48:53,196 --> 00:48:58,416

The tutorial, I think, prints, it shows a

print of the, it prints the learned

795

00:48:58,416 --> 00:49:01,766

structures into the output cells of the

IPython notebooks.

796

00:49:01,766 --> 00:49:05,116

And so you could take the printed

structure and just save it as a text file

797

00:49:05,116 --> 00:49:08,936

and write your own little parser for

extracting those structures and building

798

00:49:08,936 --> 00:49:12,756

an RGP or a PyTorch GP or any other GP.

799

00:49:13,096 --> 00:49:13,516

Okay.

800

00:49:13,516 --> 00:49:14,076

Yeah.

801

00:49:14,076 --> 00:49:15,086

That was super cool.

802

00:49:15,086 --> 00:49:16,476

That's awesome.

803

00:49:17,096 --> 00:49:22,860

And do you know if there is already an

implementation in R?

804

00:49:22,860 --> 00:49:27,540

and or in Python of what you're doing in

AutoGP .JS?

805

00:49:27,540 --> 00:49:33,960

Yeah, so we, so this project was

implemented during my year at Google when

806

00:49:33,960 --> 00:49:38,020

I was so between starting at CMU and

finishing my PhD, I was at Google for a

807

00:49:38,020 --> 00:49:40,360

year as a visiting faculty scientist.

808

00:49:40,380 --> 00:49:44,700

And some of the prototype implementations

were also in Python.

809

00:49:44,700 --> 00:49:52,658

But I think the only public version at the

moment is the Julia version.

810

00:49:52,716 --> 00:50:00,236

But I think it's a little bit challenging

to reimplement this because one of the

811

00:50:00,236 --> 00:50:05,856

things we learned when trying to implement

it in Python is that we don't have Gen, or

812

00:50:05,856 --> 00:50:07,876

at least at the time we didn't.

813

00:50:08,476 --> 00:50:13,396

The reason we focused on Julia is that we

could use the power of the Gen

814

00:50:13,396 --> 00:50:19,716

probabilistic programming language in a

way that made model development and

815

00:50:19,716 --> 00:50:20,748

iterating.

816

00:50:20,748 --> 00:50:24,948

much more feasible than a pure Python

implementation or even, you know, an R

817

00:50:24,948 --> 00:50:27,368

implementation or in another language.

818

00:50:28,348 --> 00:50:28,558

Yeah.

819

00:50:28,558 --> 00:50:29,248

Okay.

820

00:50:29,248 --> 00:50:38,288

Um, and so actually, yeah, so I, I would

have so many more questions on that, but I

821

00:50:38,288 --> 00:50:42,868

think that's already a good, a good

overview of, of that project.

822

00:50:42,868 --> 00:50:48,828

Maybe I'm curious about the, the biggest

obstacle that you had on the path, uh,

823

00:50:48,828 --> 00:50:49,708

when developing

824

00:50:49,708 --> 00:50:56,198

that package, autogp .jl, and also what

are your future plans for this package?

825

00:50:56,198 --> 00:51:06,168

What would you like to see it become in

the coming months and years?

826

00:51:06,628 --> 00:51:07,448

Yeah.

827

00:51:07,448 --> 00:51:09,048

So thanks for those questions.

828

00:51:09,048 --> 00:51:15,348

So for the biggest challenge, I think

designing and implementing the inference

829

00:51:15,348 --> 00:51:17,324

algorithm that includes...

830

00:51:17,324 --> 00:51:20,264

sequential Monte Carlo and involuted MCMC.

831

00:51:20,264 --> 00:51:25,124

That was a challenge because there aren't

many works, prior works in the literature

832

00:51:25,124 --> 00:51:30,744

that have actually explored this type of a

combination, which is, um, you know, which

833

00:51:30,744 --> 00:51:35,484

is really at the heart of auto GP, um,

designing the right proposal distributions

834

00:51:35,484 --> 00:51:38,684

for, I have some given structure and I

have my data.

835

00:51:38,684 --> 00:51:40,564

How do I do a data driven proposal?

836

00:51:40,564 --> 00:51:44,604

So I'm not just blindly proposing some new

structure from the prior or some new sub

837

00:51:44,604 --> 00:51:45,164

-structure.

838

00:51:45,164 --> 00:51:49,484

but actually use the observed data to come

up with a smart proposal for how I'm going

839

00:51:49,484 --> 00:51:52,384

to improve the structure in the inner loop

of MCMC.

840

00:51:52,384 --> 00:51:58,344

So we put a lot of thought into the actual

move types and how to use the data to come

841

00:51:58,344 --> 00:52:01,064

up with data -driven proposal

distributions.

842

00:52:01,884 --> 00:52:04,544

So the paper describes some of these

tricks.

843

00:52:04,684 --> 00:52:08,464

So there's moves which are based on

replacing a random subtree.

844

00:52:08,464 --> 00:52:13,708

There are moves which are detaching the

subtree and throwing everything away or...

845

00:52:13,708 --> 00:52:16,628

embedding the subtree within a new tree.

846

00:52:16,628 --> 00:52:19,288

So there are these different types of

moves, which we found are more helpful to

847

00:52:19,288 --> 00:52:19,988

guide the search.

848

00:52:19,988 --> 00:52:23,888

And it was a challenging process to figure

out how to implement those moves and how

849

00:52:23,888 --> 00:52:25,068

to debug them.

850

00:52:25,068 --> 00:52:29,368

So that I think was, was part of the

challenge.

851

00:52:29,368 --> 00:52:33,058

I think another challenge which, which we

came, which we were facing was of course,

852

00:52:33,058 --> 00:52:36,688

the fact that we were using these dense

Gaussian process models without the actual

853

00:52:36,688 --> 00:52:40,948

approximations that are needed to scale to

say tens or hundreds of thousands of data

854

00:52:40,948 --> 00:52:41,548

points.

855

00:52:41,548 --> 00:52:42,700

And so.

856

00:52:42,700 --> 00:52:47,240

This I think was part of the motivation

for thinking about what are other types of

857

00:52:47,240 --> 00:52:52,000

approximations of the GP that would let us

handle datasets of that size.

858

00:52:52,000 --> 00:52:57,960

In terms of what I'd like for AutoGP to be

in the future, I think there's two answers

859

00:52:57,960 --> 00:52:58,839

to that.

860

00:52:58,839 --> 00:53:02,720

One answer, and I think there's already a

nice success case here, but one answer is

861

00:53:02,720 --> 00:53:06,820

I'd like the implementation of AutoGP to

be a reference for how to do probabilistic

862

00:53:06,820 --> 00:53:08,720

structure discovery using GEN.

863

00:53:08,720 --> 00:53:10,604

So I expect that people...

864

00:53:10,604 --> 00:53:15,404

across many different disciplines have

this problem of not knowing what their

865

00:53:15,404 --> 00:53:18,364

specific model is for the data.

866

00:53:18,364 --> 00:53:21,484

And then you might have a prior

distribution over symbolic model

867

00:53:21,484 --> 00:53:25,164

structures and given your observed data,

you want to infer the right model

868

00:53:25,164 --> 00:53:25,964

structure.

869

00:53:25,964 --> 00:53:31,144

And I think in the auto GP code base, we

have a lot of the important components

870

00:53:31,144 --> 00:53:34,744

that are needed to apply this workflow to

new settings.

871

00:53:34,744 --> 00:53:38,404

So I think we've really put a lot of

effort in having the code be self

872

00:53:38,404 --> 00:53:39,820

-documenting in a sense.

873

00:53:39,820 --> 00:53:44,960

and make it easier for people to adapt the

code for their own purposes.

874

00:53:45,080 --> 00:53:50,840

And so there was a recent paper this year

presented at NURiPS by Tracy Mills and Sam

875

00:53:50,840 --> 00:53:58,840

Shayet from Professor Tenenbaum's group

that extended the AutoGP package for a

876

00:53:58,840 --> 00:54:04,120

task in cognition, which was very nice to

see that the code isn't only valuable for

877

00:54:04,120 --> 00:54:08,432

its own purpose, but also adaptable by

others for other types of tasks.

878

00:54:08,492 --> 00:54:12,372

Um, and I think the second thing that I'd

like auto GP or at least the auto GP type

879

00:54:12,372 --> 00:54:18,212

models to do is, um, you know, integrating

these with, and this goes back to the

880

00:54:18,212 --> 00:54:22,412

original automatic statistician that, uh,

that motivated auto GP.

881

00:54:22,412 --> 00:54:24,212

It's worked say 10 years ago.

882

00:54:24,212 --> 00:54:28,612

Um, so the auto automated statistician had

the component, the natural language

883

00:54:28,612 --> 00:54:33,112

processing component, which is, you know,

at the time there was no chat GPT or large

884

00:54:33,112 --> 00:54:33,742

language models.

885

00:54:33,742 --> 00:54:37,804

So they just wrote some simple rules to

take the learned Gaussian process.

886

00:54:37,804 --> 00:54:40,124

and summarize it in terms of a report.

887

00:54:40,124 --> 00:54:44,324

But now we have much more powerful

language models.

888

00:54:44,324 --> 00:54:48,004

And one question could be, how can I use

the outputs of AutoGP and integrate it

889

00:54:48,004 --> 00:54:52,804

within a language model, not only for

reporting the structure, but also for

890

00:54:52,804 --> 00:54:54,414

answering now probabilistic queries.

891

00:54:54,414 --> 00:55:00,184

So you might say, find for me a time when

there could be a change point, or give me

892

00:55:00,184 --> 00:55:04,664

a numerical estimate of the covariance

between two different time slices, or

893

00:55:04,664 --> 00:55:06,092

impute the data.

894

00:55:06,092 --> 00:55:10,732

between these two different time regions,

or give me a 95 % prediction interval.

895

00:55:10,732 --> 00:55:14,692

And so a data scientist can write these in

terms of natural language, or rather a

896

00:55:14,692 --> 00:55:17,332

domain specialist can write these in

natural language, and then you would

897

00:55:17,332 --> 00:55:21,912

compile it into different little programs

that are querying the GP learned by

898

00:55:21,912 --> 00:55:22,892

AutoGP.

899

00:55:22,892 --> 00:55:28,352

And so creating some type of a higher

level interface that makes it possible for

900

00:55:28,352 --> 00:55:33,032

people to not necessarily dive into the

guts of Julia and, you know, or implement

901

00:55:33,032 --> 00:55:34,828

even an IPython notebook.

902

00:55:34,828 --> 00:55:38,708

but have the system learn the

probabilistic models and then have a

903

00:55:38,708 --> 00:55:43,028

natural language interface which you can

use to query those models, either for

904

00:55:43,028 --> 00:55:46,208

learning something about the structure of

the data, but also for solving prediction

905

00:55:46,208 --> 00:55:47,348

tasks.

906

00:55:47,348 --> 00:55:52,308

And in both cases, I think, you know, off

the shelf models may not work so well

907

00:55:52,308 --> 00:55:57,508

because, you know, they may not know how

to parse the auto GP kernel to come up

908

00:55:57,508 --> 00:56:00,868

with a meaningful summary of what it

actually means in terms of the data, or

909

00:56:00,868 --> 00:56:03,724

they may not know how to translate natural

language into

910

00:56:03,724 --> 00:56:05,744

Julia code for AutoGP.

911

00:56:05,744 --> 00:56:08,824

So there's a little bit of research into

thinking about how do we fine tune these

912

00:56:08,824 --> 00:56:13,944

models so that they're able to interact

with the automatically learned

913

00:56:13,944 --> 00:56:15,624

probabilistic models.

914

00:56:16,624 --> 00:56:20,544

And I think what's, I'll just mention

here, which is one of the benefits of an

915

00:56:20,544 --> 00:56:23,224

AutoGP like system is its

interpretability.

916

00:56:23,224 --> 00:56:27,524

So because Gaussian processes are, they're

quiet, transparent, like you said, they're

917

00:56:27,524 --> 00:56:30,444

ultimately at the end of the day, these

giant multivariate normals.

918

00:56:30,444 --> 00:56:35,084

We can explain to people who are using

these types of these distributions and

919

00:56:35,084 --> 00:56:37,924

they're comfortable with them, what

exactly is the distribution that's been

920

00:56:37,924 --> 00:56:38,524

learned?

921

00:56:38,524 --> 00:56:42,044

These are some weights and some giant

neural network and here's the prediction

922

00:56:42,044 --> 00:56:43,194

and you have to live with it.

923

00:56:43,194 --> 00:56:45,924

Rather, you can say, well, here's our

prediction and the reason we made this

924

00:56:45,924 --> 00:56:49,804

prediction is, well, we inferred a

seasonal components with so -and -so

925

00:56:49,804 --> 00:56:50,624

frequency.

926

00:56:50,624 --> 00:56:53,324

And so you can get the predictions, but

you can also get some type of

927

00:56:53,324 --> 00:56:57,364

interpretable summary for why those

predictions were made, which maybe helps

928

00:56:57,364 --> 00:56:59,468

with the trustworthiness of the system.

929

00:56:59,468 --> 00:57:02,208

or just transparency more generally.

930

00:57:03,128 --> 00:57:04,468

Yeah.

931

00:57:04,808 --> 00:57:06,928

I'm signing now.

932

00:57:06,928 --> 00:57:09,318

That sounds like an awesome tool.

933

00:57:09,318 --> 00:57:11,008

Yeah, for sure.

934

00:57:11,668 --> 00:57:13,848

That looks absolutely fantastic.

935

00:57:15,268 --> 00:57:18,608

And yeah, hopefully that will, these kind

of tools will help.

936

00:57:18,608 --> 00:57:25,468

I'm definitely curious to try that now in

my own models, basically.

937

00:57:26,428 --> 00:57:29,132

And yeah, see what...

938

00:57:29,132 --> 00:57:35,012

AutoGP .jl tells you, but the covariance

structure and then try and use that myself

939

00:57:35,012 --> 00:57:41,032

in a model of mine, probably in Python so

that I have to get out of the Julia and

940

00:57:41,032 --> 00:57:44,672

see how that, like how you can plug that

into another model.

941

00:57:44,672 --> 00:57:47,612

That would be super, super interesting for

sure.

942

00:57:47,612 --> 00:57:47,782

Yeah.

943

00:57:47,782 --> 00:57:52,472

I'm going to try and find an excuse to do

that.

944

00:57:54,316 --> 00:58:01,576

Um, actually I'm curious now, um, we could

talk a bit about how that's done, right?

945

00:58:01,576 --> 00:58:06,256

How you do that discovery of the time

series structure.

946

00:58:06,256 --> 00:58:10,516

And you've mentioned that you're using

sequential Monte Carlo to do that.

947

00:58:10,516 --> 00:58:20,976

So SMC, um, can you give listeners an idea

of what SMC is and why that would be

948

00:58:20,976 --> 00:58:22,676

useful in that case?

949

00:58:22,676 --> 00:58:24,236

Uh, and also if.

950

00:58:24,236 --> 00:58:28,896

the way you do it for these projects

differs from the classical way of doing

951

00:58:28,896 --> 00:58:30,196

SMC.

952

00:58:31,156 --> 00:58:31,636

Good.

953

00:58:31,636 --> 00:58:33,476

Yes, thanks for that question.

954

00:58:33,816 --> 00:58:38,316

So sequential Monte Carlo is a very broad

family of algorithms.

955

00:58:38,316 --> 00:58:42,476

And I think one of the confusing parts for

me when I was learning sequential Monte

956

00:58:42,476 --> 00:58:46,696

Carlo is that a lot of the introductory

material of sequential Monte Carlo are

957

00:58:46,696 --> 00:58:49,132

very closely married to particle filters.

958

00:58:49,324 --> 00:58:53,024

But particle filtering, which is only one

application of sequential Monte Carlo,

959

00:58:53,024 --> 00:58:54,644

isn't the whole story.

960

00:58:54,644 --> 00:58:59,444

And so I think, you know, there's now more

modern expositions of sequential Monte

961

00:58:59,444 --> 00:59:03,914

Carlo, which are really bringing to light

how general these methods are.

962

00:59:03,914 --> 00:59:08,174

And here I would like to recommend

Professor Nicholas Chopin's textbook,

963

00:59:08,174 --> 00:59:09,624

Introduction to Sequential Monte Carlo.

964

00:59:09,624 --> 00:59:11,664

It's a Springer 2020 textbook.

965

00:59:11,664 --> 00:59:16,084

I continue to use this in my research and,

you know, I think that it's a very well

966

00:59:16,084 --> 00:59:18,788

-written overview of really

967

00:59:18,892 --> 00:59:22,332

how general and how powerful sequential

Monte Carlo is.

968

00:59:22,332 --> 00:59:25,592

So a brief explanation of sequential Monte

Carlo.

969

00:59:25,592 --> 00:59:28,732

I guess maybe one way we could contrast it

is the traditional Markov chain Monte

970

00:59:28,732 --> 00:59:29,372

Carlo.

971

00:59:29,372 --> 00:59:33,632

So in traditional MCMC, we have some

particular latent state, let's call it

972

00:59:33,632 --> 00:59:34,592

theta.

973

00:59:34,932 --> 00:59:40,792

And we just, theta is supposed to be drawn

from P of theta given X, where that's our

974

00:59:40,792 --> 00:59:42,712

posterior distribution and X is the data.

975

00:59:42,712 --> 00:59:46,372

And we just apply some transition kernel

over and over and over again, and then we

976

00:59:46,372 --> 00:59:47,044

hope.

977

00:59:47,308 --> 00:59:50,288

And the limit of the applications of these

transition kernels, we're going to

978

00:59:50,288 --> 00:59:52,588

converge to the posterior distribution.

979

00:59:52,588 --> 00:59:52,808

Okay.

980

00:59:52,808 --> 00:59:57,328

So MCMC is just like one iterative chain

that you run forever.

981

00:59:57,328 --> 00:59:59,078

You can do a little bit of modifications.

982

00:59:59,078 --> 01:00:04,108

You might have multiple chains, which are

independent of one another, but sequential

983

01:00:04,108 --> 01:00:09,088

Monte Carlo is, is in a sense, trying to

go beyond that, which is anything you can

984

01:00:09,088 --> 01:00:13,468

do in a traditional MCMC algorithm, you

can do using sequential Monte Carlo.

985

01:00:13,648 --> 01:00:15,436

But in sequential Monte Carlo,

986

01:00:15,436 --> 01:00:19,176

you don't have a single chain, but you

have multiple different particles.

987

01:00:19,176 --> 01:00:22,816

And each of these different particles you

can think of as being analogous in some

988

01:00:22,816 --> 01:00:26,776

way to a particular MCMC chain, but

they're allowed to interact.

989

01:00:26,776 --> 01:00:32,936

And so you start with, say, some number of

particles, and you start with no data.

990

01:00:32,936 --> 01:00:35,936

And so what you would do is you would just

draw these particles from your prior

991

01:00:35,936 --> 01:00:37,216

distribution.

992

01:00:37,216 --> 01:00:41,036

And each of these draws from the prior are

basically draws from p of theta.

993

01:00:41,036 --> 01:00:43,516

And now I'd like to get them to p of theta

given x.

994

01:00:43,516 --> 01:00:44,556

That's my goal.

995

01:00:44,556 --> 01:00:48,096

So I start with a bunch of particles drawn

from p of theta, and I'd like to get them

996

01:00:48,096 --> 01:00:48,996

to p of theta given x.

997

01:00:48,996 --> 01:00:52,196

So how am I going to go from p of theta to

p of theta given x?

998

01:00:52,196 --> 01:00:55,196

There's many different ways you might do

that, and that's exactly what's

999

01:00:55,196 --> 01:00:55,896

sequential, right?

Speaker: 1000

01:00:55,896 --> 01:00:58,376

How do you go from the prior to the

posterior?

Speaker: 1001

01:00:58,376 --> 01:01:04,176

The approach we take in data in AutoGP is

based on this idea of data tempering.

Speaker: 1002

01:01:04,176 --> 01:01:08,706

So let's say my data x consists of a

thousand measurements, okay?

Speaker: 1003

01:01:08,706 --> 01:01:11,756

And I'd like to go from p of theta to p of

theta given x.

Speaker: 1004

01:01:11,756 --> 01:01:15,136

Well, here's one sequential strategy that

I can use to bridge between these two

Speaker: 1005

01:01:15,136 --> 01:01:16,076

distributions.

Speaker: 1006

01:01:16,076 --> 01:01:20,316

I can start with P of theta, then I can

start with P of theta given X1, then P of

Speaker: 1007

01:01:20,316 --> 01:01:23,636

theta given X1 and X2, P of theta given X2

and X3.

Speaker: 1008

01:01:23,636 --> 01:01:27,596

So I can anneal or I can temper these data

points into the prior.

Speaker: 1009

01:01:27,596 --> 01:01:30,436

And the more data points I put in, the

closer I'm going to get to the full

Speaker: 1010

01:01:30,436 --> 01:01:34,796

posterior P of theta given X1 through a

thousand or something.

Speaker: 1011

01:01:34,796 --> 01:01:36,876

Or you might introduce these data in

batch.

Speaker: 1012

01:01:36,876 --> 01:01:41,708

But the key idea is that you start with

draws from some prior typically.

Speaker: 1013

01:01:41,708 --> 01:01:45,448

and then you're just adding more and more

data and you're reweighting the particles

Speaker: 1014

01:01:45,448 --> 01:01:48,548

based on the probability that they assign

to the new data.

Speaker: 1015

01:01:48,548 --> 01:01:53,368

So if I have 10 particles and some

particle is always able to predict or it's

Speaker: 1016

01:01:53,368 --> 01:01:57,108

always assigning a very high score to the

new data, I know that that's a particle

Speaker: 1017

01:01:57,108 --> 01:01:59,168

that's explaining the data quite well.

Speaker: 1018

01:01:59,168 --> 01:02:02,628

And so I might resample these particles

according to their weights to get rid of

Speaker: 1019

01:02:02,628 --> 01:02:05,948

the particles that are not explaining the

new data well and to focus my

Speaker: 1020

01:02:05,948 --> 01:02:09,156

computational effort on the particles that

are explaining the data well.

Speaker: 1021

01:02:09,516 --> 01:02:12,576

And this is something that an MCMC

algorithm does not give us.

Speaker: 1022

01:02:12,576 --> 01:02:17,496

Because even if we run like a hundred MCMC

chains in parallel, we don't know how to

Speaker: 1023

01:02:17,496 --> 01:02:22,236

resample the chains, for example, because

they're all these independent executions

Speaker: 1024

01:02:22,236 --> 01:02:26,276

and we don't have a principled way of

assigning a score to those different

Speaker: 1025

01:02:26,276 --> 01:02:26,516

chains.

Speaker: 1026

01:02:26,516 --> 01:02:28,016

You can't use the joint likelihood.

Speaker: 1027

01:02:28,016 --> 01:02:32,956

That's not, it's not a valid or even a

meaningful statistic to use to measure, to

Speaker: 1028

01:02:32,956 --> 01:02:34,668

measure the quality of a given chain.

Speaker: 1029

01:02:34,668 --> 01:02:38,648

But SMC has, because it's built on

importance sampling, has a principled way

Speaker: 1030

01:02:38,648 --> 01:02:43,208

for us to assign weights to these

different particles and focus on the ones

Speaker: 1031

01:02:43,208 --> 01:02:44,868

which are most promising.

Speaker: 1032

01:02:44,928 --> 01:02:49,048

And then I think the final component

that's missing in my explanation is where

Speaker: 1033

01:02:49,048 --> 01:02:50,728

does the MCMC come in?

Speaker: 1034

01:02:50,728 --> 01:02:54,288

So traditionally in sequential Monte

Carlo, there was no MCMC.

Speaker: 1035

01:02:54,288 --> 01:02:59,368

You would just have your particles, you

would add new data, you would reweight it

Speaker: 1036

01:02:59,368 --> 01:03:02,168

based on the probability of the data, then

you would resample the particles.

Speaker: 1037

01:03:02,168 --> 01:03:03,628

Then I'm going to add some...

Speaker: 1038

01:03:03,628 --> 01:03:07,008

next batch of data, resample, re -weight,

et cetera.

Speaker: 1039

01:03:07,008 --> 01:03:12,648

But you're also able to, in between adding

new data points, run MCMC in the inner

Speaker: 1040

01:03:12,648 --> 01:03:14,608

loop of sequential Monte Carlo.

Speaker: 1041

01:03:14,608 --> 01:03:20,318

And that does not sort of make the

algorithm incorrect.

Speaker: 1042

01:03:20,318 --> 01:03:23,968

It preserves the correctness of the

algorithm, even if you run MCMC.

Speaker: 1043

01:03:23,968 --> 01:03:28,908

And there the intuition is that, you know,

your prior draws are not going to be good.

Speaker: 1044

01:03:28,908 --> 01:03:32,248

So now that after I've observed say 10 %

of the data, I might actually run some

Speaker: 1045

01:03:32,248 --> 01:03:37,288

MCMC on that subset of 10 % of the data

before I introduce the next batch of data.

Speaker: 1046

01:03:37,288 --> 01:03:42,048

So after you're reweighting the particles,

you're also using a little bit of MCMC to

Speaker: 1047

01:03:42,048 --> 01:03:45,608

improve their structure given the data

that's been observed so far.

Speaker: 1048

01:03:45,608 --> 01:03:49,288

And that's where the MCMC is run inside

the inner loop.

Speaker: 1049

01:03:49,288 --> 01:03:53,428

So some of the benefits I think of this

kind of approach are, like I mentioned at

Speaker: 1050

01:03:53,428 --> 01:03:57,168

the beginning, in MCMC you have to compute

the probability of all the data at each

Speaker: 1051

01:03:57,168 --> 01:03:57,708

step.

Speaker: 1052

01:03:57,708 --> 01:04:01,767

But in SMC, because we're sequentially

incorporating new batches of data, we can

Speaker: 1053

01:04:01,767 --> 01:04:06,808

get away with only looking at say 10 or 20

% of the data and get some initial

Speaker: 1054

01:04:06,808 --> 01:04:10,488

inferences before we actually reach to the

end and processed all of the observed

Speaker: 1055

01:04:10,488 --> 01:04:11,488

data.

Speaker: 1056

01:04:12,268 --> 01:04:18,548

So that's, I guess, a high level overview

of the algorithm that AutoGP is using.

Speaker: 1057

01:04:18,548 --> 01:04:20,908

It's annealing the data or tempering the

data.

Speaker: 1058

01:04:20,908 --> 01:04:24,812

It's reassigning the scores of the

particles based on how well they're

Speaker: 1059

01:04:24,812 --> 01:04:30,372

explaining the new batch of data and it's

running MCMC to improve their structure by

Speaker: 1060

01:04:30,372 --> 01:04:33,572

applying these different moves like

removing the sub -expression, adding the

Speaker: 1061

01:04:33,572 --> 01:04:36,282

sub -expression, different things of that

nature.

Speaker: 1062

01:04:38,188 --> 01:04:39,348

Okay, yeah.

Speaker: 1063

01:04:39,348 --> 01:04:43,508

Thanks a lot for this explanation because

that was a very hard question on my part

Speaker: 1064

01:04:43,508 --> 01:04:52,228

and I think you've done a tremendous job

explaining the basics of SMC and when that

Speaker: 1065

01:04:52,228 --> 01:04:53,608

would be useful.

Speaker: 1066

01:04:53,608 --> 01:04:55,768

So, yeah, thank you very much.

Speaker: 1067

01:04:55,768 --> 01:04:57,668

I think that's super helpful.

Speaker: 1068

01:04:58,048 --> 01:05:04,528

And why in this case, when you're trying

to do these kind of time series

Speaker: 1069

01:05:04,528 --> 01:05:06,188

discoveries, why...

Speaker: 1070

01:05:06,188 --> 01:05:11,228

would SMC be more useful than a classic

MCMC?

Speaker: 1071

01:05:11,568 --> 01:05:12,078

Yeah.

Speaker: 1072

01:05:12,078 --> 01:05:15,468

So it's more useful, I guess, for several

reasons.

Speaker: 1073

01:05:15,468 --> 01:05:19,638

One reason is that, well, you might

actually have a true streaming problem.

Speaker: 1074

01:05:19,638 --> 01:05:24,688

So if your data is actually streaming, you

can't use MCMC because MCMC is operating

Speaker: 1075

01:05:24,688 --> 01:05:25,968

on a static data set.

Speaker: 1076

01:05:25,968 --> 01:05:31,928

So what if I'm running AutoGP in some type

of industrial process system where some

Speaker: 1077

01:05:31,928 --> 01:05:33,068

data is coming in?

Speaker: 1078

01:05:33,068 --> 01:05:36,098

and I'm updating the models in real time

as my data is coming in.

Speaker: 1079

01:05:36,098 --> 01:05:41,248

That's a purely online algorithm in which

SMC is perfect for, but MCMC is not so

Speaker: 1080

01:05:41,248 --> 01:05:46,388

well suited because you basically don't

have a way to, I mean, obviously you can

Speaker: 1081

01:05:46,388 --> 01:05:50,768

always incorporate new data in MCMC, but

that's not the traditional algorithm where

Speaker: 1082

01:05:50,768 --> 01:05:52,748

we know its correctness properties.

Speaker: 1083

01:05:52,748 --> 01:05:56,228

So for when you have streaming data, that

might be extremely useful.

Speaker: 1084

01:05:56,228 --> 01:05:59,180

But even if your data is not streaming,

Speaker: 1085

01:05:59,180 --> 01:06:03,280

you know, theoretically there's results

that show that convergence can be much

Speaker: 1086

01:06:03,280 --> 01:06:06,620

improved when you use the sequential Monte

Carlo approach.

Speaker: 1087

01:06:06,620 --> 01:06:12,100

Because you have these multiple particles

that are interacting with one another.

Speaker: 1088

01:06:12,100 --> 01:06:16,580

And what they can do is they can explore

multiple modes whereby an MCMC, you know,

Speaker: 1089

01:06:16,580 --> 01:06:19,540

each individual MCMC chain might get

trapped in a mode.

Speaker: 1090

01:06:19,540 --> 01:06:23,620

And unless you have an extremely accurate

posterior proposal distribution, you may

Speaker: 1091

01:06:23,620 --> 01:06:25,298

never escape from that mode.

Speaker: 1092

01:06:25,388 --> 01:06:28,708

But in SMC, we're able to resample these

different particles so that they're

Speaker: 1093

01:06:28,708 --> 01:06:32,568

interacting, which means that you can

probably explore the space much more

Speaker: 1094

01:06:32,568 --> 01:06:36,148

efficiently than you could with a single

chain that's not interacting with other

Speaker: 1095

01:06:36,148 --> 01:06:36,708

chains.

Speaker: 1096

01:06:36,708 --> 01:06:41,928

And this is especially important in the

types of posteriors that AutoGP is

Speaker: 1097

01:06:41,928 --> 01:06:44,568

exploring, because these are symbolic

expression spaces.

Speaker: 1098

01:06:44,568 --> 01:06:46,428

They are not Euclidean space.

Speaker: 1099

01:06:46,428 --> 01:06:51,548

And so we expect there to be largely non

-smooth components, and we want to be able

Speaker: 1100

01:06:51,548 --> 01:06:54,156

to jump efficiently through this space

through...

Speaker: 1101

01:06:54,156 --> 01:07:00,736

the resampling procedure of, of, of SMC,

uh, which, which is why, uh, which, which

Speaker: 1102

01:07:00,736 --> 01:07:02,116

is why it's a suitable algorithm.

Speaker: 1103

01:07:02,116 --> 01:07:06,256

And then the third component is because,

you know, this is more specific to GPs in

Speaker: 1104

01:07:06,256 --> 01:07:11,516

particular, which is because GPs have a

cubic cost of evaluating the likelihood in

Speaker: 1105

01:07:11,516 --> 01:07:14,486

MCMC, that's really going to bite you if

you're doing it each step.

Speaker: 1106

01:07:14,486 --> 01:07:17,676

If I have a million, a thousand

observations, I don't want to be doing

Speaker: 1107

01:07:17,676 --> 01:07:22,476

that at each step, but in SMC, because the

data is being introduced in batches, what

Speaker: 1108

01:07:22,476 --> 01:07:23,628

that means is.

Speaker: 1109

01:07:23,628 --> 01:07:28,068

I might be able to get some very accurate

predictions using only the first 10 % of

Speaker: 1110

01:07:28,068 --> 01:07:31,928

the data, which is going to be quite cheap

to evaluate the likelihood.

Speaker: 1111

01:07:31,928 --> 01:07:35,768

So you're somehow smoothly interpolating

between the prior, where you can get

Speaker: 1112

01:07:35,768 --> 01:07:39,728

perfect samples, and the posterior, which

is hard to sample, using these

Speaker: 1113

01:07:39,728 --> 01:07:44,148

intermediate distributions, which are

closer to one another than the distance

Speaker: 1114

01:07:44,148 --> 01:07:46,068

between the prior and the posterior.

Speaker: 1115

01:07:46,068 --> 01:07:49,168

And that's what makes inference hard,

essentially, which is the distance between

Speaker: 1116

01:07:49,168 --> 01:07:50,700

the prior and the posterior.

Speaker: 1117

01:07:50,700 --> 01:07:56,420

because SMC is introducing datasets in

smaller batches, it's making this sort of

Speaker: 1118

01:07:56,420 --> 01:07:57,020

bridging.

Speaker: 1119

01:07:57,020 --> 01:08:00,700

It's making it easier to bridge between

the prior and the posterior by having

Speaker: 1120

01:08:00,700 --> 01:08:03,280

these partial posteriors, basically.

Speaker: 1121

01:08:03,740 --> 01:08:06,460

Okay, I see.

Speaker: 1122

01:08:06,460 --> 01:08:07,560

Yeah.

Speaker: 1123

01:08:07,860 --> 01:08:08,660

Yeah, okay.

Speaker: 1124

01:08:08,660 --> 01:08:12,650

That makes sense because of that batching

process, basically.

Speaker: 1125

01:08:12,650 --> 01:08:13,540

Yeah, for sure.

Speaker: 1126

01:08:13,540 --> 01:08:19,428

And the requirements also of MCMC coupled

to a GP that's...

Speaker: 1127

01:08:19,436 --> 01:08:22,376

That's for sure making stuff hard.

Speaker: 1128

01:08:22,376 --> 01:08:22,656

Yeah.

Speaker: 1129

01:08:22,656 --> 01:08:23,816

Yeah.

Speaker: 1130

01:08:25,216 --> 01:08:29,726

And well, I've already taken a lot of time

from you.

Speaker: 1131

01:08:29,726 --> 01:08:30,886

So thanks a lot for us.

Speaker: 1132

01:08:30,886 --> 01:08:32,326

I really appreciate it.

Speaker: 1133

01:08:32,326 --> 01:08:35,306

And that's very, very fascinating.

Speaker: 1134

01:08:35,306 --> 01:08:37,476

Everything you're doing.

Speaker: 1135

01:08:38,156 --> 01:08:42,596

I'm curious also because you're a bit on

both sides, right?

Speaker: 1136

01:08:42,596 --> 01:08:46,406

Where you see practitioners, but you're

also on the very theoretical side.

Speaker: 1137

01:08:46,406 --> 01:08:48,268

And also you teach.

Speaker: 1138

01:08:48,268 --> 01:08:54,368

So I'm wondering if like, what's the, in

your opinion, what's the biggest hurdle in

Speaker: 1139

01:08:54,368 --> 01:08:56,136

the Bayesian workflow currently?

Speaker: 1140

01:08:57,868 --> 01:08:59,828

Yeah, I think there's really a lot of

hurdles.

Speaker: 1141

01:08:59,828 --> 01:09:01,928

I don't know if there's a biggest one.

Speaker: 1142

01:09:01,948 --> 01:09:08,068

So obviously, you know, Professor Andrew

Gelman has enormous manuscript on the

Speaker: 1143

01:09:08,068 --> 01:09:09,968

archive, which is called Bayesian

workflow.

Speaker: 1144

01:09:09,968 --> 01:09:13,908

And he goes through the nitty gritty of

all the different challenges with coming

Speaker: 1145

01:09:13,908 --> 01:09:15,688

up with the Bayesian model.

Speaker: 1146

01:09:15,688 --> 01:09:20,188

But for me, at least the one that's tied

closely to my research is where do we even

Speaker: 1147

01:09:20,188 --> 01:09:21,288

start?

Speaker: 1148

01:09:21,288 --> 01:09:22,868

Where do we start this workflow?

Speaker: 1149

01:09:22,868 --> 01:09:27,222

And that's really what drives a lot of my

interest in automatic model discovery.

Speaker: 1150

01:09:27,244 --> 01:09:29,164

probabilistic program synthesis.

Speaker: 1151

01:09:29,164 --> 01:09:33,424

The idea is not that we want to discover

the model that we're going to use for the

Speaker: 1152

01:09:33,424 --> 01:09:38,584

rest of our, for the rest of the lifetime

of the workflow, but come up with good

Speaker: 1153

01:09:38,584 --> 01:09:42,704

explanations that we can use to bootstrap

this process, after which then we can

Speaker: 1154

01:09:42,704 --> 01:09:44,534

apply the different stages of the

workflow.

Speaker: 1155

01:09:44,534 --> 01:09:49,044

But I think it's getting from just data to

plausible explanations of that data.

Speaker: 1156

01:09:49,044 --> 01:09:52,504

And that's what, you know, probabilistic

program synthesis or automatic model

Speaker: 1157

01:09:52,504 --> 01:09:55,244

discovery is trying to solve.

Speaker: 1158

01:09:56,204 --> 01:09:58,754

So I think that's a very large bottleneck.

Speaker: 1159

01:09:58,754 --> 01:10:01,724

And then I'd say, you know, the second

bottleneck is the scalability of

Speaker: 1160

01:10:01,724 --> 01:10:02,444

inference.

Speaker: 1161

01:10:02,444 --> 01:10:07,404

I think that Bayesian inference has a poor

reputation in many corners because of how

Speaker: 1162

01:10:07,404 --> 01:10:10,384

unscalable traditional MCMC algorithms

are.

Speaker: 1163

01:10:10,384 --> 01:10:15,324

But I think in the last 10, 15 years,

we've seen many foundational developments

Speaker: 1164

01:10:15,324 --> 01:10:21,884

in more scalable posterior inference

algorithms that are being used in many

Speaker: 1165

01:10:21,884 --> 01:10:24,564

different settings in computational

science, et cetera.

Speaker: 1166

01:10:24,564 --> 01:10:25,548

And I think...

Speaker: 1167

01:10:25,548 --> 01:10:28,928

building probabilistic programming

technologies that better expose these

Speaker: 1168

01:10:28,928 --> 01:10:35,868

different inference innovations is going

to help push Bayesian inference to the

Speaker: 1169

01:10:35,868 --> 01:10:42,448

next level of applications that people

have traditionally thought are beyond

Speaker: 1170

01:10:42,448 --> 01:10:45,648

reach because of the lack of scalability.

Speaker: 1171

01:10:45,648 --> 01:10:49,168

So I think putting a lot of effort into

engineering probabilistic programming

Speaker: 1172

01:10:49,168 --> 01:10:53,508

languages that really have fast, powerful

inference, whether it's sequential Monte

Speaker: 1173

01:10:53,508 --> 01:10:54,668

Carlo, whether it's...

Speaker: 1174

01:10:54,668 --> 01:10:58,308

Hamiltonian Monte Carlo with no U -turn

sampling, whether it's, you know, there's

Speaker: 1175

01:10:58,308 --> 01:11:01,688

really a lot of different, in volutive

MCMC over discrete structure.

Speaker: 1176

01:11:01,688 --> 01:11:03,598

These are all things that we've seen quiet

recently.

Speaker: 1177

01:11:03,598 --> 01:11:07,468

And I think if you put them together, we

can come up with very powerful inference

Speaker: 1178

01:11:07,468 --> 01:11:08,628

machinery.

Speaker: 1179

01:11:08,668 --> 01:11:13,588

And then I think the last thing I'll say

on that topic is, you know, we also need

Speaker: 1180

01:11:13,588 --> 01:11:18,808

some new research into how to configure

our inference algorithms.

Speaker: 1181

01:11:18,808 --> 01:11:22,408

So, you know, we spend a lot of time

thinking is our model the right model, but

Speaker: 1182

01:11:22,408 --> 01:11:22,956

you know,

Speaker: 1183

01:11:22,956 --> 01:11:27,176

I think now that we have probabilistic

programming and we have inference

Speaker: 1184

01:11:27,176 --> 01:11:31,936

algorithms maybe themselves implemented as

probabilistic programming, we might think

Speaker: 1185

01:11:31,936 --> 01:11:37,256

in a more mathematically principled way

about how to optimize the inference

Speaker: 1186

01:11:37,256 --> 01:11:40,756

algorithms in addition to optimizing the

parameters of the model.

Speaker: 1187

01:11:40,756 --> 01:11:44,056

I think of some type of joint inference

process where you're simultaneously using

Speaker: 1188

01:11:44,056 --> 01:11:47,756

the right inference algorithm for your

given model and have some type of

Speaker: 1189

01:11:47,756 --> 01:11:51,296

automation that's helping you make those

choices.

Speaker: 1190

01:11:52,620 --> 01:11:59,160

Yeah, kind of like the automated

statistician that you were talking about

Speaker: 1191

01:11:59,160 --> 01:12:01,740

at the beginning of the show.

Speaker: 1192

01:12:01,880 --> 01:12:05,120

Yeah, that would be fantastic.

Speaker: 1193

01:12:05,200 --> 01:12:12,300

Definitely kind of like having a stats

sidekick helping you when you're modeling.

Speaker: 1194

01:12:12,300 --> 01:12:15,240

That would definitely be fantastic.

Speaker: 1195

01:12:15,300 --> 01:12:21,260

Also, as you were saying, the workflow is

so big and diverse that...

Speaker: 1196

01:12:21,260 --> 01:12:28,240

It's very easy to forget about something,

forget a step, neglect one, because we're

Speaker: 1197

01:12:28,240 --> 01:12:31,500

all humans, you know, things like that.

Speaker: 1198

01:12:31,500 --> 01:12:33,140

No, definitely.

Speaker: 1199

01:12:33,140 --> 01:12:38,980

And as you were saying, you're also a

professor at CMU.

Speaker: 1200

01:12:38,980 --> 01:12:45,780

So I'm curious how you approach teaching

these topics, teaching stats to prepare

Speaker: 1201

01:12:45,780 --> 01:12:49,932

your students for all of these challenges,

especially given...

Speaker: 1202

01:12:49,932 --> 01:12:54,372

challenges of probabilistic computing that

we've mentioned throughout this show.

Speaker: 1203

01:12:55,820 --> 01:12:59,839

Yeah, yeah, that's something I think about

frequently actually, because, you know, I

Speaker: 1204

01:12:59,839 --> 01:13:03,080

haven't been teaching for a very long time

and this is over the course of the next

Speaker: 1205

01:13:03,080 --> 01:13:08,900

few years, gonna have to put a lot of

effort into thinking about how to give

Speaker: 1206

01:13:08,900 --> 01:13:13,000

students who are interested in these areas

the right background so that they can

Speaker: 1207

01:13:13,000 --> 01:13:14,660

quickly be productive.

Speaker: 1208

01:13:14,800 --> 01:13:17,980

And what's especially challenging, at

least in my interest area, which is

Speaker: 1209

01:13:17,980 --> 01:13:21,600

there's both the probabilistic modeling

component and there's also the programming

Speaker: 1210

01:13:21,600 --> 01:13:22,916

languages component.

Speaker: 1211

01:13:23,148 --> 01:13:27,428

And what I've learned is these two

communities don't talk much with one

Speaker: 1212

01:13:27,428 --> 01:13:28,188

another.

Speaker: 1213

01:13:28,188 --> 01:13:31,988

You have people who are doing statistics

who think like, oh, programming language

Speaker: 1214

01:13:31,988 --> 01:13:34,298

is just our scripts and that's really all

it is.

Speaker: 1215

01:13:34,298 --> 01:13:37,688

And I never want to think about it because

that's the messy details.

Speaker: 1216

01:13:37,748 --> 01:13:41,808

But programming languages, if we think

about them in a principled way and we

Speaker: 1217

01:13:41,808 --> 01:13:46,828

start looking at the code as a first

-class citizen, just like our mathematical

Speaker: 1218

01:13:46,828 --> 01:13:50,968

model is a first -class citizen, then we

need to really be thinking in a much more

Speaker: 1219

01:13:50,968 --> 01:13:52,780

principled way about our programs.

Speaker: 1220

01:13:52,780 --> 01:13:56,920

And I think the type of students who are

going to make a lot of strides in this

Speaker: 1221

01:13:56,920 --> 01:14:00,960

research area are those who really value

the programming language, the programming

Speaker: 1222

01:14:00,960 --> 01:14:05,380

languages theory, in addition to the

statistics and the Bayesian modeling

Speaker: 1223

01:14:05,380 --> 01:14:08,220

that's actually used for the workflow.

Speaker: 1224

01:14:08,580 --> 01:14:13,800

And so I think, you know, the type of

courses that we're going to need to

Speaker: 1225

01:14:13,800 --> 01:14:17,520

develop at the graduate level or at the

undergraduate level are going to need to

Speaker: 1226

01:14:17,520 --> 01:14:21,964

really bring together these two different

worldviews, the worldview of, you know,

Speaker: 1227

01:14:21,964 --> 01:14:26,584

empirical data analysis, statistical model

building, things of that sort, but also

Speaker: 1228

01:14:26,584 --> 01:14:31,004

the programming languages view where we're

actually being very formal about what are

Speaker: 1229

01:14:31,004 --> 01:14:34,304

these actual systems, what they're doing,

what are their semantics, what are their

Speaker: 1230

01:14:34,304 --> 01:14:39,284

properties, what are the type systems that

are enabling us to get certain guarantees,

Speaker: 1231

01:14:39,284 --> 01:14:40,864

maybe compiler technologies.

Speaker: 1232

01:14:40,864 --> 01:14:46,244

So I think there's elements of both of

these two different communities that need

Speaker: 1233

01:14:46,244 --> 01:14:51,116

to be put into teaching people how to be

productive probabilistic programming.

Speaker: 1234

01:14:51,116 --> 01:14:54,876

researchers bringing ideas from these two

different areas.

Speaker: 1235

01:14:54,956 --> 01:15:00,016

So, you know, the students who I advise,

for example, I often try and get a sense

Speaker: 1236

01:15:00,016 --> 01:15:02,776

for whether they're more in the

programming languages world and they need

Speaker: 1237

01:15:02,776 --> 01:15:05,936

to learn a little bit more about the

Bayesian modeling stuff, or whether

Speaker: 1238

01:15:05,936 --> 01:15:09,896

they're more squarely in Bayesian modeling

and they need to appreciate some of the PL

Speaker: 1239

01:15:09,896 --> 01:15:11,116

aspects better.

Speaker: 1240

01:15:11,116 --> 01:15:13,956

And that's the sort of a game that you

have to play to figure out what are the

Speaker: 1241

01:15:13,956 --> 01:15:17,956

right areas to be focusing on for

different students so that they can have a

Speaker: 1242

01:15:17,956 --> 01:15:19,308

more holistic view of

Speaker: 1243

01:15:19,308 --> 01:15:22,088

probabilistic programming and its goals

and probabilistic computing more

Speaker: 1244

01:15:22,088 --> 01:15:25,828

generally, and building the technical

foundations that are needed to carry

Speaker: 1245

01:15:25,828 --> 01:15:28,448

forward that research.

Speaker: 1246

01:15:29,048 --> 01:15:31,008

Yeah, that makes sense.

Speaker: 1247

01:15:31,208 --> 01:15:43,148

And related to that, are there any future

developments that you foresee or expect or

Speaker: 1248

01:15:43,148 --> 01:15:48,848

hope in probabilistic reasoning systems in

the coming years?

Speaker: 1249

01:15:49,580 --> 01:15:50,890

Yeah, I think there's quite a few.

Speaker: 1250

01:15:50,890 --> 01:15:55,220

And I think I already touched upon one of

them, which is, you know, the integration

Speaker: 1251

01:15:55,220 --> 01:15:57,640

with language models, for example.

Speaker: 1252

01:15:57,640 --> 01:16:00,340

I think there's a lot of excitement about

language models.

Speaker: 1253

01:16:00,340 --> 01:16:04,480

I think from my perspective as a research

area, that's not what I do research in.

Speaker: 1254

01:16:04,480 --> 01:16:08,080

But I think, you know, if we think about

how to leverage the things that they're

Speaker: 1255

01:16:08,080 --> 01:16:12,770

good at, it might be for creating these

types of interfaces between, you know,

Speaker: 1256

01:16:12,770 --> 01:16:16,400

automatically learned probabilistic

programs and natural language queries

Speaker: 1257

01:16:16,400 --> 01:16:18,828

about these learned programs for solving

tasks.

Speaker: 1258

01:16:18,828 --> 01:16:21,188

data analysis or data science tasks.

Speaker: 1259

01:16:21,188 --> 01:16:25,428

And I think this is an important, marrying

these two ideas is important because if

Speaker: 1260

01:16:25,428 --> 01:16:28,968

people are going to start using language

models for solving statistics, I would be

Speaker: 1261

01:16:28,968 --> 01:16:30,028

very worried.

Speaker: 1262

01:16:30,028 --> 01:16:34,628

I don't think language models in their

current form, which are not backed by

Speaker: 1263

01:16:34,628 --> 01:16:38,488

probabilistic programs, are at all

appropriate to doing data science or data

Speaker: 1264

01:16:38,488 --> 01:16:39,048

analysis.

Speaker: 1265

01:16:39,048 --> 01:16:41,788

But I expect people will be pushing that

direction.

Speaker: 1266

01:16:41,788 --> 01:16:45,468

The direction that I'd really like to see

thrive is the one where language models

Speaker: 1267

01:16:45,468 --> 01:16:45,900

are

Speaker: 1268

01:16:45,900 --> 01:16:50,180

interacting with probabilistic programs to

come up with better, more principled, more

Speaker: 1269

01:16:50,180 --> 01:16:53,820

interpretable reasoning for answering an

end user question.

Speaker: 1270

01:16:54,180 --> 01:16:59,260

So I think these types of probabilistic

reasoning systems, you know, will really

Speaker: 1271

01:16:59,260 --> 01:17:04,040

make probabilistic programs more

accessible on the one hand, and will make

Speaker: 1272

01:17:04,040 --> 01:17:06,440

language models more useful on the other

hand.

Speaker: 1273

01:17:06,440 --> 01:17:10,060

That's something that I'd like to see from

the application standpoint.

Speaker: 1274

01:17:10,060 --> 01:17:13,920

From the theory standpoint, I have many

theoretical questions, which maybe I won't

Speaker: 1275

01:17:13,920 --> 01:17:14,924

get into.

Speaker: 1276

01:17:14,924 --> 01:17:18,684

which are really related about the

foundations of random variate generation.

Speaker: 1277

01:17:18,684 --> 01:17:22,744

Like I was mentioning at the beginning of

the talk, understanding in a more

Speaker: 1278

01:17:22,744 --> 01:17:26,164

mathematically principled way the

properties of the inference algorithms or

Speaker: 1279

01:17:26,164 --> 01:17:29,684

the probabilistic computations that we run

on our finite precision machines.

Speaker: 1280

01:17:29,684 --> 01:17:34,164

I'd like to build a type of complexity

theory for these type or a theory about

Speaker: 1281

01:17:34,164 --> 01:17:38,644

the error and complexity and the resource

consumption of Bayesian inference in the

Speaker: 1282

01:17:38,644 --> 01:17:40,184

presence of finite resources.

Speaker: 1283

01:17:40,184 --> 01:17:43,980

And that's a much longer term vision, but

I think it will be quite valuable.

Speaker: 1284

01:17:43,980 --> 01:17:47,080

once we start understanding the

fundamental limitations of our

Speaker: 1285

01:17:47,080 --> 01:17:52,040

computational processes for running

probabilistic inference and computation.

Speaker: 1286

01:17:53,680 --> 01:17:57,080

Yeah, that sounds super exciting.

Speaker: 1287

01:17:57,080 --> 01:17:58,040

Thanks, Alain.

Speaker: 1288

01:17:58,740 --> 01:18:06,320

That's making me so hopeful for the coming

years to hear you talk in that way.

Speaker: 1289

01:18:06,320 --> 01:18:11,880

I'm like, yeah, it's super stoked about

the world that you are depicting here.

Speaker: 1290

01:18:11,880 --> 01:18:13,932

And...

Speaker: 1291

01:18:13,932 --> 01:18:19,732

Actually, it's so I think I still had so

many questions for you because as I was

Speaker: 1292

01:18:19,732 --> 01:18:21,462

saying, you're doing so many things.

Speaker: 1293

01:18:21,462 --> 01:18:25,612

But I think I've taken enough of your

time.

Speaker: 1294

01:18:25,612 --> 01:18:27,692

So let's call it to show.

Speaker: 1295

01:18:27,812 --> 01:18:32,252

And before you go though, I'm going to ask

you the last two questions I ask every

Speaker: 1296

01:18:32,252 --> 01:18:33,972

guest at the end of the show.

Speaker: 1297

01:18:33,972 --> 01:18:39,272

If you had unlimited time and resources,

which problem would you try to solve?

Speaker: 1298

01:18:39,292 --> 01:18:43,468

Yeah, that's a very tough question.

Speaker: 1299

01:18:43,468 --> 01:18:46,088

I should have prepared for that one

better.

Speaker: 1300

01:18:46,848 --> 01:18:55,448

Yeah, I think one area which would be

really worth solving is using, or at least

Speaker: 1301

01:18:55,448 --> 01:19:01,108

within the scope of Bayesian inference and

probabilistic modeling, is using these

Speaker: 1302

01:19:01,108 --> 01:19:13,782

technologies to unify people around data,

solid data -driven inferences.

Speaker: 1303

01:19:14,028 --> 01:19:18,448

to have better discussions in empirical

fields, right?

Speaker: 1304

01:19:18,448 --> 01:19:20,988

So obviously politics is extremely

divisive.

Speaker: 1305

01:19:20,988 --> 01:19:26,348

People have all sorts of different

interpretations based on their political

Speaker: 1306

01:19:26,348 --> 01:19:30,748

views and based on their aesthetics and

whatever, and all that's natural.

Speaker: 1307

01:19:30,748 --> 01:19:36,828

But one question I think about, which is

how can we have a shared language when we

Speaker: 1308

01:19:36,828 --> 01:19:41,848

talk about a given topic or the pros and

cons of those topic in terms of rigorous

Speaker: 1309

01:19:41,848 --> 01:19:42,988

data -driven,

Speaker: 1310

01:19:42,988 --> 01:19:48,708

or rigorous data -driven theses about why

we have these different views and try and

Speaker: 1311

01:19:48,708 --> 01:19:53,628

disconnect the fundamental tensions and

bring down the temperature so that we can

Speaker: 1312

01:19:53,628 --> 01:19:58,648

talk more about the data and have good

insights or leverage insights from the

Speaker: 1313

01:19:58,648 --> 01:20:04,048

data and use that to guide our decision

-making across, especially the more

Speaker: 1314

01:20:04,048 --> 01:20:07,868

divisive areas like public policy, things

of that nature.

Speaker: 1315

01:20:07,868 --> 01:20:11,788

But I think part of the challenge is that

why we don't do this, well, you know,

Speaker: 1316

01:20:11,788 --> 01:20:15,548

From the political standpoint, it's much

easier to not focus on what the data is

Speaker: 1317

01:20:15,548 --> 01:20:19,098

saying because that could be expedient and

it appeals to a broader amount of people.

Speaker: 1318

01:20:19,098 --> 01:20:23,348

But at the same time, maybe we don't have

the right language of how we might use

Speaker: 1319

01:20:23,348 --> 01:20:28,048

data to think more, you know, in a more

principled way about some of the main, the

Speaker: 1320

01:20:28,048 --> 01:20:29,808

major challenges that we're facing.

Speaker: 1321

01:20:29,808 --> 01:20:36,048

So I, yeah, I think I'd like to get to a

stage where we can focus more about, you

Speaker: 1322

01:20:36,048 --> 01:20:40,620

know, principle discussions about hard

problems that are really grounded in data.

Speaker: 1323

01:20:40,620 --> 01:20:45,160

And the way we would get those sort of

insights is by building good probabilistic

Speaker: 1324

01:20:45,160 --> 01:20:49,660

models of the data and using it to

explain, you know, explain to policymakers

Speaker: 1325

01:20:49,660 --> 01:20:52,880

why they shouldn't, they shouldn't do a

different, a certain thing, for example.

Speaker: 1326

01:20:52,880 --> 01:20:58,260

So I think that's a very important problem

to solve because surprisingly many areas

Speaker: 1327

01:20:58,260 --> 01:21:03,100

that are very high impact are not using

real world inference and data to drive

Speaker: 1328

01:21:03,100 --> 01:21:04,000

their decision -making.

Speaker: 1329

01:21:04,000 --> 01:21:07,820

And that's quite shocking, whether that be

in medicine, you know, we're using very

Speaker: 1330

01:21:07,820 --> 01:21:09,068

archaic.

Speaker: 1331

01:21:09,068 --> 01:21:13,068

inference technologies in medicine and

clinical trials, things of that nature,

Speaker: 1332

01:21:13,068 --> 01:21:14,548

even economists, right?

Speaker: 1333

01:21:14,548 --> 01:21:17,088

Like linear regression is still the

workhorse in economics.

Speaker: 1334

01:21:17,088 --> 01:21:22,308

We're using very primitive data analysis

technologies.

Speaker: 1335

01:21:22,308 --> 01:21:28,088

I'd like to see how we can use better data

technologies, better types of inference to

Speaker: 1336

01:21:28,088 --> 01:21:31,908

think about these hard, hard challenging

problems.

Speaker: 1337

01:21:32,808 --> 01:21:36,908

Yeah, couldn't agree more.

Speaker: 1338

01:21:37,168 --> 01:21:37,900

And...

Speaker: 1339

01:21:37,900 --> 01:21:42,020

And I'm coming from a political science

background, so for sure these topics are

Speaker: 1340

01:21:42,020 --> 01:21:46,860

always very interesting to me, quite dear

to me.

Speaker: 1341

01:21:47,300 --> 01:21:52,700

Even though in the last years, I have to

say I've become more and more pessimistic

Speaker: 1342

01:21:52,700 --> 01:21:54,200

about these.

Speaker: 1343

01:21:55,140 --> 01:22:02,280

And yeah, like I completely agree with

your, like with the problem and the issues

Speaker: 1344

01:22:02,280 --> 01:22:07,564

you have laid out and the solutions I am

for now.

Speaker: 1345

01:22:07,564 --> 01:22:10,204

completely out of them.

Speaker: 1346

01:22:10,344 --> 01:22:16,384

Unfortunately, but yeah, like that I agree

that something has to be done.

Speaker: 1347

01:22:16,384 --> 01:22:28,204

Because these kind of political debates,

which are completely out of our out of the

Speaker: 1348

01:22:28,204 --> 01:22:33,704

science, scientific consensus just so we

are to me, I'm like, but I don't know,

Speaker: 1349

01:22:33,704 --> 01:22:37,164

we've talked about that, you know, we've

learned that I like,

Speaker: 1350

01:22:37,164 --> 01:22:38,594

It's one of the things we know.

Speaker: 1351

01:22:38,594 --> 01:22:41,044

I don't know what we're still arguing

about that.

Speaker: 1352

01:22:41,044 --> 01:22:46,344

Or if we don't know, why don't we try and

find a way to, you know, find out instead

Speaker: 1353

01:22:46,344 --> 01:22:52,744

of just being like, I know, but I'm right

because I think I'm right and my position

Speaker: 1354

01:22:52,744 --> 01:22:54,664

actually makes sense.

Speaker: 1355

01:22:54,884 --> 01:23:00,964

It's like one of the worst arguments like,

oh, well, it's common sense.

Speaker: 1356

01:23:01,444 --> 01:23:07,122

Yeah, I think maybe there's some work we

have to do in having people trust.

Speaker: 1357

01:23:07,180 --> 01:23:12,360

know, science and data -driven inference

and data analysis more.

Speaker: 1358

01:23:12,360 --> 01:23:16,500

That's about by being more transparent, by

improving the ways in which they're being

Speaker: 1359

01:23:16,500 --> 01:23:20,300

used, things of that nature, so that

people trust these and that it becomes the

Speaker: 1360

01:23:20,300 --> 01:23:24,480

gold standard for talking about different

political issues or social issues or

Speaker: 1361

01:23:24,480 --> 01:23:26,040

economic issues.

Speaker: 1362

01:23:26,580 --> 01:23:27,840

Yeah, for sure.

Speaker: 1363

01:23:27,840 --> 01:23:32,820

But at the same time, and that's

definitely something I try to do at a very

Speaker: 1364

01:23:32,820 --> 01:23:35,554

small scale with these podcasts,

Speaker: 1365

01:23:35,660 --> 01:23:43,340

It's how do you communicate about science

and try to educate the general public

Speaker: 1366

01:23:43,340 --> 01:23:43,859

better?

Speaker: 1367

01:23:43,859 --> 01:23:46,380

And I definitely think it's useful.

Speaker: 1368

01:23:46,380 --> 01:23:52,520

At the same time, it's a hard task because

it's hard.

Speaker: 1369

01:23:52,740 --> 01:23:58,800

If you want to find out the truth, it's

often not intuitive.

Speaker: 1370

01:23:58,800 --> 01:24:03,380

And so in a way you have to want it.

Speaker: 1371

01:24:03,380 --> 01:24:05,284

It's like, eh.

Speaker: 1372

01:24:05,644 --> 01:24:12,464

I know broccoli is better for my health

long term, but I still prefer to eat a

Speaker: 1373

01:24:12,464 --> 01:24:15,404

very, very fat snack.

Speaker: 1374

01:24:15,404 --> 01:24:17,664

I definitely prefer sneakers.

Speaker: 1375

01:24:17,664 --> 01:24:22,464

And yet I know that eating lots of fruits

and vegetables is way better for my health

Speaker: 1376

01:24:22,464 --> 01:24:23,604

long term.

Speaker: 1377

01:24:23,604 --> 01:24:30,304

And I feel it's a bit of a similar issue

where it's like, I'm pretty sure people

Speaker: 1378

01:24:30,304 --> 01:24:34,532

know it's long term better to...

Speaker: 1379

01:24:35,020 --> 01:24:39,380

use these kinds of methods to find out

about the truth, even if it's a political

Speaker: 1380

01:24:39,380 --> 01:24:42,400

issue, even more, I would say, if it's a

political issue.

Speaker: 1381

01:24:44,080 --> 01:24:50,520

But it's just so easy right now, at least

given how the different political

Speaker: 1382

01:24:50,520 --> 01:24:58,260

incentives are, especially in the Western

democracies, the different incentives that

Speaker: 1383

01:24:58,260 --> 01:25:01,540

are made with the media structure and so

on.

Speaker: 1384

01:25:01,540 --> 01:25:04,940

It's actually way easier to

Speaker: 1385

01:25:04,940 --> 01:25:10,880

not care about that and just like, just

lie and say what you think is true, then

Speaker: 1386

01:25:10,880 --> 01:25:13,100

actually doing the hard work.

Speaker: 1387

01:25:13,100 --> 01:25:14,340

And I agree.

Speaker: 1388

01:25:14,340 --> 01:25:16,080

It's like, it's very hard.

Speaker: 1389

01:25:16,080 --> 01:25:23,040

How do you make that hard work look not

boring, but actually what you're supposed

Speaker: 1390

01:25:23,040 --> 01:25:26,220

to do and that I don't know for now.

Speaker: 1391

01:25:26,220 --> 01:25:26,740

Yeah.

Speaker: 1392

01:25:26,740 --> 01:25:32,480

Um, that makes me think like, I mean, I,

I'm definitely always thinking about these

Speaker: 1393

01:25:32,480 --> 01:25:33,452

things and so on.

Speaker: 1394

01:25:33,452 --> 01:25:40,092

Something that definitely helped me at a

very small scale, my scale where, because

Speaker: 1395

01:25:40,092 --> 01:25:44,072

of course I'm always the, the scientists

around the table.

Speaker: 1396

01:25:44,072 --> 01:25:48,952

So of course, when these kinds of topics

come up, I'm like, where does that come

Speaker: 1397

01:25:48,952 --> 01:25:49,232

from?

Speaker: 1398

01:25:49,232 --> 01:25:49,481

Right?

Speaker: 1399

01:25:49,481 --> 01:25:51,202

Like, why are you saying that?

Speaker: 1400

01:25:51,202 --> 01:25:53,092

Where, how do you know that's true?

Speaker: 1401

01:25:53,092 --> 01:25:53,302

Right?

Speaker: 1402

01:25:53,302 --> 01:25:55,832

What's your level of confidence and things

like that.

Speaker: 1403

01:25:55,832 --> 01:26:01,732

There is actually a very interesting

framework where, which can teach you how

Speaker: 1404

01:26:01,732 --> 01:26:03,108

to ask.

Speaker: 1405

01:26:03,276 --> 01:26:07,396

questions to actually really understand

where people are coming from and how they

Speaker: 1406

01:26:07,396 --> 01:26:12,956

develop their positions more than trying

to argue with them about their position.

Speaker: 1407

01:26:13,156 --> 01:26:17,476

And usually it ties in also with the

literature about that, about how to

Speaker: 1408

01:26:17,476 --> 01:26:23,836

actually not debate, but talk with someone

who has very entrenched political views.

Speaker: 1409

01:26:24,816 --> 01:26:28,496

And it's called street epistemology.

Speaker: 1410

01:26:28,496 --> 01:26:30,456

I don't know if you've heard of that.

Speaker: 1411

01:26:30,456 --> 01:26:32,476

That is super interesting.

Speaker: 1412

01:26:32,476 --> 01:26:32,716

And

Speaker: 1413

01:26:32,716 --> 01:26:34,216

I will link to that in the show notes.

Speaker: 1414

01:26:34,216 --> 01:26:39,296

So there is a very good YouTube channel by

Anthony McNabosco, who is one of the main

Speaker: 1415

01:26:39,296 --> 01:26:42,876

person doing straight epistemology.

Speaker: 1416

01:26:42,876 --> 01:26:44,226

So I will link to that.

Speaker: 1417

01:26:44,226 --> 01:26:50,536

You can watch his video where he goes in

the street literally and just talk about

Speaker: 1418

01:26:50,536 --> 01:26:54,736

very, very hot topics to random people in

the street.

Speaker: 1419

01:26:54,736 --> 01:26:55,916

Can be politics.

Speaker: 1420

01:26:55,916 --> 01:27:01,420

Very often it's about supernatural beliefs

about...

Speaker: 1421

01:27:01,420 --> 01:27:06,580

religious beliefs, things like this is

really, these are not light topics.

Speaker: 1422

01:27:06,960 --> 01:27:11,260

But it's done through the framework of

street epistemology.

Speaker: 1423

01:27:11,260 --> 01:27:13,660

That's super helpful, I find.

Speaker: 1424

01:27:14,300 --> 01:27:19,320

And if you want like a more, a bigger

overview of these topics, there is a very

Speaker: 1425

01:27:19,320 --> 01:27:25,800

good somewhat recent book that's called

How Minds Change by David McCraney, who's

Speaker: 1426

01:27:25,800 --> 01:27:29,460

got a very good podcast also called You're

Not So Smart.

Speaker: 1427

01:27:30,020 --> 01:27:30,572

So,

Speaker: 1428

01:27:30,572 --> 01:27:32,412

Definitely recommend those resources.

Speaker: 1429

01:27:32,412 --> 01:27:34,326

I'll put them in the show notes.

Speaker: 1430

01:27:36,300 --> 01:27:36,820

Awesome.

Speaker: 1431

01:27:36,820 --> 01:27:41,660

Well, for us, that was an unexpected end

to the show.

Speaker: 1432

01:27:41,660 --> 01:27:42,430

Thanks a lot.

Speaker: 1433

01:27:42,430 --> 01:27:46,600

I think we've covered so many different

topics.

Speaker: 1434

01:27:46,980 --> 01:27:49,940

Well, actually, I still have a second

question to ask you.

Speaker: 1435

01:27:49,940 --> 01:27:56,260

The second last question I ask you, so if

you could have dinner with any great

Speaker: 1436

01:27:56,260 --> 01:28:00,772

scientific mind, dead, alive, fictional,

who would it be?

Speaker: 1437

01:28:03,468 --> 01:28:10,628

I think I will go with Hercules Poirot,

Agatha Christie's famous detective.

Speaker: 1438

01:28:10,848 --> 01:28:16,988

So I read a lot of Hercules Poirot and I

would ask him, because he's an inference,

Speaker: 1439

01:28:16,988 --> 01:28:19,188

everything he does is based on inference.

Speaker: 1440

01:28:19,188 --> 01:28:23,748

So I'd work with him to come up with a

formal model of the inferences that he's

Speaker: 1441

01:28:23,748 --> 01:28:26,268

making to solve very hard crimes.

Speaker: 1442

01:28:28,288 --> 01:28:29,708

I am not.

Speaker: 1443

01:28:29,908 --> 01:28:33,132

That's the first time someone answers

Hercules Poirot.

Speaker: 1444

01:28:33,132 --> 01:28:38,602

But I'm not surprised as to the

motivation.

Speaker: 1445

01:28:38,602 --> 01:28:39,842

So I like it.

Speaker: 1446

01:28:39,842 --> 01:28:40,632

I like it.

Speaker: 1447

01:28:40,632 --> 01:28:43,632

I think I would do that with Sherlock

Holmes also.

Speaker: 1448

01:28:43,632 --> 01:28:45,732

Sherlock Holmes has a very Bayesian mind.

Speaker: 1449

01:28:45,732 --> 01:28:47,062

I really love that.

Speaker: 1450

01:28:47,062 --> 01:28:48,572

Yeah, for sure.

Speaker: 1451

01:28:48,832 --> 01:28:49,332

Awesome.

Speaker: 1452

01:28:49,332 --> 01:28:50,642

Well, thanks a lot, Ferris.

Speaker: 1453

01:28:50,642 --> 01:28:52,512

That was a blast.

Speaker: 1454

01:28:52,512 --> 01:28:53,882

We've talked about so many things.

Speaker: 1455

01:28:53,882 --> 01:28:55,652

I've learned a lot about GPs.

Speaker: 1456

01:28:55,652 --> 01:29:00,972

Definitely going to try AutoGP .jl.

Speaker: 1457

01:29:01,580 --> 01:29:07,580

Thanks a lot for all the work you are

doing on that and all the different topics

Speaker: 1458

01:29:07,580 --> 01:29:13,280

you are working on and were kind enough to

come here and talk about.

Speaker: 1459

01:29:13,380 --> 01:29:18,860

As usual, I will put resources and links

to your website in the show notes for

Speaker: 1460

01:29:18,860 --> 01:29:24,980

those who want to dig deeper and feel free

to add anything yourself or for people.

Speaker: 1461

01:29:25,280 --> 01:29:29,600

And on that note, thank you again for

taking the time and being on this show.

Speaker: 1462

01:29:29,600 --> 01:29:30,380

Thank you, Alex.

Speaker: 1463

01:29:30,380 --> 01:29:31,876

I appreciate it.

Speaker: 1464

01:29:35,756 --> 01:29:39,496

This has been another episode of Learning

Bayesian Statistics.

Speaker: 1465

01:29:39,496 --> 01:29:44,456

Be sure to rate, review, and follow the

show on your favorite podcatcher, and

Speaker: 1466

01:29:44,456 --> 01:29:49,356

visit learnbaystats .com for more

resources about today's topics, as well as

Speaker: 1467

01:29:49,356 --> 01:29:54,096

access to more episodes to help you reach

true Bayesian state of mind.

Speaker: 1468

01:29:54,096 --> 01:29:56,036

That's learnbaystats .com.

Speaker: 1469

01:29:56,036 --> 01:30:00,886

Our theme music is Good Bayesian by Baba

Brinkman, fit MC Lass and Meghiraam.

Speaker: 1470

01:30:00,886 --> 01:30:04,036

Check out his awesome work at bababrinkman

.com.

Speaker: 1471

01:30:04,036 --> 01:30:05,196

I'm your host,

Speaker: 1472

01:30:05,196 --> 01:30:06,196

Alex and Dora.

Speaker: 1473

01:30:06,196 --> 01:30:10,456

You can follow me on Twitter at Alex

underscore and Dora like the country.

Speaker: 1474

01:30:10,456 --> 01:30:15,516

You can support the show and unlock

exclusive benefits by visiting patreon

Speaker: 1475

01:30:15,516 --> 01:30:17,696

.com slash LearnBasedDance.

Speaker: 1476

01:30:17,696 --> 01:30:20,136

Thank you so much for listening and for

your support.

Speaker: 1477

01:30:20,136 --> 01:30:26,036

You're truly a good Bayesian change your

predictions after taking information and

Speaker: 1478

01:30:26,036 --> 01:30:29,396

if you think and I'll be less than

amazing.

Speaker: 1479

01:30:29,396 --> 01:30:32,492

Let's adjust those expectations.

Speaker: 1480

01:30:32,492 --> 01:30:37,892

Let me show you how to be a good Bayesian

Change calculations after taking fresh

Speaker: 1481

01:30:37,892 --> 01:30:43,932

data in Those predictions that your brain

is making Let's get them on a solid

Speaker: 1482

01:30:43,932 --> 01:30:45,772

foundation

Transcript

Sign up for our newsletter!

The latest from Reverend Bayes directly in your inbox!

QUICK Links

Get in Touch