Proudly sponsored by PyMC Labs, the Bayesian Consultancy. Book a call, or get in touch!
GPs are extremely powerful…. but hard to handle. One of the bottlenecks is learning the appropriate kernel. What if you could learn the structure of GP kernels automatically? Sounds really cool, but also a bit futuristic, doesn’t it?
Well, think again, because in this episode, Feras Saad will teach us how to do just that! Feras is an Assistant Professor in the Computer Science Department at Carnegie Mellon University. He received his PhD in Computer Science from MIT, and, most importantly for our conversation, he’s the creator of AutoGP.jl, a Julia package for automatic Gaussian process modeling.
Feras discusses the implementation of AutoGP, how it scales, what you can do with it, and how you can integrate its outputs in your models.
Finally, Feras provides an overview of Sequential Monte Carlo and its usefulness in AutoGP, highlighting the ability of SMC to incorporate new data in a streaming fashion and explore multiple modes efficiently.
Our theme music is « Good Bayesian », by Baba Brinkman (feat MC Lars and Mega Ran). Check out his awesome work at https://bababrinkman.com/ !
Thank you to my Patrons for making this episode possible!
Yusuke Saito, Avi Bryant, Ero Carrera, Giuliano Cruz, Tim Gasser, James Wade, Tradd Salvo, William Benton, James Ahloy, Robin Taylor,, Chad Scherrer, Zwelithini Tunyiswa, Bertrand Wilden, James Thompson, Stephen Oates, Gian Luca Di Tanna, Jack Wells, Matthew Maldonado, Ian Costley, Ally Salim, Larry Gill, Ian Moran, Paul Oreto, Colin Caprani, Colin Carroll, Nathaniel Burbank, Michael Osthege, Rémi Louf, Clive Edelsten, Henri Wallen, Hugo Botha, Vinh Nguyen, Marcin Elantkowski, Adam C. Smith, Will Kurt, Andrew Moskowitz, Hector Munoz, Marco Gorelli, Simon Kessell, Bradley Rode, Patrick Kelley, Rick Anderson, Casper de Bruin, Philippe Labonde, Michael Hankin, Cameron Smith, Tomáš Frýda, Ryan Wesslen, Andreas Netti, Riley King, Yoshiyuki Hamajima, Sven De Maeyer, Michael DeCrescenzo, Fergal M, Mason Yahr, Naoya Kanai, Steven Rowland, Aubrey Clayton, Jeannine Sue, Omri Har Shemesh, Scott Anthony Robson, Robert Yolken, Or Duek, Pavel Dusek, Paul Cox, Andreas Kröpelin, Raphaël R, Nicolas Rode, Gabriel Stechschulte, Arkady, Kurt TeKolste, Gergely Juhasz, Marcus Nölke, Maggi Mackintosh, Grant Pezzolesi, Avram Aelony, Joshua Meehl, Javier Sabio, Kristian Higgins, Alex Jones, Gregorio Aguilar, Matt Rosinski, Bart Trudeau, Luis Fonseca, Dante Gates, Matt Niccolls, Maksim Kuznecov, Michael Thomas, Luke Gorrie, Cory Kiser, Julio, Edvin Saveljev, Frederick Ayala, Jeffrey Powell and Gal Kampel.
Visit https://www.patreon.com/learnbayesstats to unlock exclusive Bayesian swag 😉
Takeaways:
– AutoGP is a Julia package for automatic Gaussian process modeling that learns the structure of GP kernels automatically.
– It addresses the challenge of making structural choices for covariance functions by using a symbolic language and a recursive grammar to infer the expression of the covariance function given the observed data.
-AutoGP incorporates sequential Monte Carlo inference to handle scalability and uncertainty in structure learning.
– The package is implemented in Julia using the Gen probabilistic programming language, which provides support for sequential Monte Carlo and involutive MCMC.
– Sequential Monte Carlo (SMC) and inductive MCMC are used in AutoGP to infer the structure of the model.
– Integrating probabilistic models with language models can improve interpretability and trustworthiness in data-driven inferences.
– Challenges in Bayesian workflows include the need for automated model discovery and scalability of inference algorithms.
– Future developments in probabilistic reasoning systems include unifying people around data-driven inferences and improving the scalability and configurability of inference algorithms.
Chapters:
00:00 Introduction to AutoGP
26:28 Automatic Gaussian Process Modeling
45:05 AutoGP: Automatic Discovery of Gaussian Process Model Structure
53:39 Applying AutoGP to New Settings
01:09:27 The Biggest Hurdle in the Bayesian Workflow
01:19:14 Unifying People Around Data-Driven Inferences
Links from the show:
- Sign up to the Fast & Efficient Gaussian Processes modeling webinar: https://topmate.io/alex_andorra/901986
- Feras’ website: https://www.cs.cmu.edu/~fsaad/
- LBS #3.1, What is Probabilistic Programming & Why use it, with Colin Carroll: https://learnbayesstats.com/episode/3-1-what-is-probabilistic-programming-why-use-it-with-colin-carroll/
- LBS #3.2, How to use Bayes in industry, with Colin Carroll: https://learnbayesstats.com/episode/3-2-how-to-use-bayes-in-industry-with-colin-carroll/
- LBS #21, Gaussian Processes, Bayesian Neural Nets & SIR Models, with Elizaveta Semenova: https://learnbayesstats.com/episode/21-gaussian-processes-bayesian-neural-nets-sir-models-with-elizaveta-semenova/
- LBS #29, Model Assessment, Non-Parametric Models, And Much More, with Aki Vehtari: https://learnbayesstats.com/episode/model-assessment-non-parametric-models-aki-vehtari/
- LBS #63, Media Mix Models & Bayes for Marketing, with Luciano Paz: https://learnbayesstats.com/episode/63-media-mix-models-bayes-marketing-luciano-paz/
- LBS #83, Multilevel Regression, Post-Stratification & Electoral Dynamics, with Tarmo Jüristo: https://learnbayesstats.com/episode/83-multilevel-regression-post-stratification-electoral-dynamics-tarmo-juristo/
- AutoGP.jl, A Julia package for learning the covariance structure of Gaussian process time series models: https://probsys.github.io/AutoGP.jl/stable/
- Sequential Monte Carlo Learning for Time Series Structure Discovery: https://arxiv.org/abs/2307.09607
- Street Epistemlogy: https://www.youtube.com/@magnabosco210
- You’re not so smart Podcast: https://youarenotsosmart.com/podcast/
- How Minds Change: https://www.davidmcraney.com/howmindschangehome
- Josh Tenebaum’s lectures on computational cognitive science: https://www.youtube.com/playlist?list=PLUl4u3cNGP61RTZrT3MIAikp2G5EEvTjf
Transcript
This is an automatic transcript and may therefore contain errors. Please get in touch if you’re willing to correct them.
Transcript
GPs are extremely powerful, but hard to
handle.
2
One of the bottlenecks is learning the
appropriate kernels.
3
Well, what if you could learn the
structure of GP's kernels automatically?
4
Sounds really cool, right?
5
But also, eh, a bit futuristic, doesn't
it?
6
Well, think again, because in this
episode, Farah Saad will teach us how to
7
do just that.
8
Feras is an assistant professor in the
computer science department at Carnegie
9
Mellon University.
10
He received his PhD in computer science
from MIT.
11
And most importantly for our conversation,
he's the creator of AutoGP .jl, a Julia
12
package for automatic Gaussian process
modeling.
13
Feras discusses the implementation of
AutoGP, how it scales, what you can do
14
with it, and how you can integrate its
outputs in your patient models.
15
Finally,
16
DeepFerence provides an overview of
Sequential Monte Carlo and its usefulness
17
in AutoGP, highlighting the ability of SMC
to incorporate new data in a streaming
18
fashion and explore multiple modes
efficiently.
19
This is Learning Basics Statistics,
episode 104, recorded February 23, 2024.
20
Welcome to Learning Bayesian Statistics, a
podcast about Bayesian inference, the
21
methods, the projects, and the people who
make it possible.
22
I'm your host, Alex Andorra.
23
You can follow me on Twitter at alex
.andorra, like the country, for any info
24
about the show.
25
LearnBayStats .com is Laplace to me.
26
Show notes,
27
becoming a corporate sponsor, unlocking
Bayesian Merge, supporting the show on
28
Patreon, everything is in there.
29
That's learnbasedats .com.
30
If you're interested in one -on -one
mentorship, online courses, or statistical
31
consulting, feel free to reach out and
book a call at topmate .io slash alex
32
underscore and dora.
33
See you around, folks, and best Bayesian
wishes to you all.
34
idea patients.
35
First, I want to thank Edwin Saveliev,
Frederic Ayala, Jeffrey Powell, and Gala
36
Campbell for supporting the show.
37
Patreon, your support is invaluable, guys,
and literally makes this show possible.
38
I cannot wait to talk with you in the
Slack channel.
39
Second, I have an exciting modeling
webinar coming up on April 18 with Juan
40
Ardus, a fellow PyMC Core Dev and
mathematician.
41
In this modeling webinar, we'll learn how
to use the new HSGP approximation for fast
42
and efficient Gaussian processes, we'll
simplify the foundational concepts,
43
explain why this technique is so useful
and innovative, and of course, we'll show
44
you a real -world application in PyMC.
45
So if that sounds like fun,
46
Go to topmade .io slash Alex underscore
and Dora to secure your seat.
47
Of course, if you're a patron of the show,
you get bonuses like submitting questions
48
in advance, early access to the recording,
et cetera.
49
You are my favorite listeners after all.
50
Okay, back to the show now.
51
Arasad, welcome to Learning Vision
Statistics.
52
Hi, thank you.
53
Thanks for the invitation.
54
I'm delighted to be here.
55
Yeah, thanks a lot for taking the time.
56
Thanks a lot to Colin Carroll.
57
who of course listeners know, he was in
episode 3 of Uninvasioned Statistics.
58
Well I will of course put it in the show
notes, that's like a vintage episode now,
59
from 4 years ago.
60
I was a complete beginner in invasion
stats, so if you wanna embarrass myself,
61
definitely that's one of the episodes you
should listen to without my -
62
my beginner's questions, and that's one of
the rare episodes I could do on site.
63
I was with Colleen in person to record
that episode in Boston.
64
So, hi Colleen, thanks a lot again.
65
And Feres, let's talk about you first.
66
How would you define the work you're doing
nowadays?
67
And also, how did you end up doing that?
68
Yeah, yeah, thanks.
69
And yeah, thanks for calling Carol for
setting up this connection.
70
I've been watching the podcast for a while
and I think it's really great how you've
71
brought together lots of different people
in the Bayesian inference community, the
72
statistics community to talk about their
work.
73
So thank you and thank you to Colin for
that connection.
74
Yeah, so a little background about me.
75
I'm a professor at CMU and I'm working
in...
76
a few different areas surrounding Bayesian
inference with my colleagues and students.
77
One, I think, you know, I like to think of
the work I do as following different
78
threads, which are all unified by this
idea of probability and computation.
79
So one area that I work a lot in, and I'm
sure you have lots of experience in this,
80
being one of the core developers of PyMC,
is probabilistic programming languages and
81
developing new tools that help
82
both high level users and also machine
learning experts and statistics experts
83
more easily use Bayesian models and
inferences as part of their workflow.
84
The, you know, putting my programming
languages hat on, it's important to think
85
about not only how do we make it easier
for people to write up Bayesian inference
86
workflows, but also what kind of
guarantees or what kind of help can we
87
give them in terms of verifying the
correctness of their implementations or.
88
automating the process of getting these
probabilistic programs to begin with using
89
probabilistic program synthesis
techniques.
90
So these are questions that are very
challenging and, you know, if we're able
91
to solve them, you know, really can go a
long way.
92
So there's a lot of work in the
probabilistic programming world that I do,
93
and I'm specifically interested in
probabilistic programming languages that
94
support programmable inference.
95
So we can think of many probabilistic
programming languages like Stan or Bugs or
96
PyMC as largely having a single inference
algorithm that they're going to use
97
multiple times for all the different
programs you can express.
98
So bugs might use Gibbs sampling, Stan
uses HMC with nuts, PyMC uses MCMC
99
algorithms, and these are all great.
100
But of course, one of the limitations is
there's no universal inference algorithm
101
that works well for any problem you might
want to express.
102
And that's where I think a lot of the
power of programmable inference comes in.
103
A lot of where the interesting research is
as well, right?
104
Like how can you support users writing
their own say MCMC proposal for a given
105
Bayesian inference problem and verify that
that proposal distribution meets the
106
theoretical conditions needed for
soundness, whether it's defining a
107
reducible chain, for example, or whether
it's a periodic.
108
or in the context of variational
inference, whether you define the
109
variational family that is broad enough,
so it's support encompasses the support of
110
the target model.
111
We have all of these conditions that we
usually hope are correct, but our systems
112
don't actually verify that for us, whether
it's an MCMC or variational inference or
113
importance sampling or sequential Monte
Carlo.
114
And I think the more flexibility we give
programmers,
115
And I touched upon this a little bit by
talking about probabilistic program
116
synthesis, which is this idea of
probabilistic, automated probabilistic
117
model discovery.
118
And there, our goal is to use hierarchical
Bayesian models to specify prior
119
distributions, not only over model
parameters, but also over model
120
structures.
121
And here, this is based on this idea that
traditionally in statistics, a data
122
scientist or an expert,
123
we'll hand design a Bayesian model for a
given problem, but oftentimes it's not
124
obvious what's the right model to use.
125
So the idea is, you know, how can we use
the observed data to guide our decisions
126
about what is the right model structure to
even be using before we worry about
127
parameter inference?
128
So, you know, we've looked at this problem
in the context of learning models of time
129
series data.
130
Should my time series data have a periodic
component?
131
Should it have polynomial trends?
132
Should it have a change point?
133
right?
134
You know, how can we automate the
discovery of these different patterns and
135
then learn an appropriate probabilistic
model?
136
And I think it ties in very nicely to
probabilistic programming because
137
probabilistic programs are so expressive
that we can express prior distributions on
138
structures or prior distributions on
probabilistic programs all within the
139
system using this unified technology.
140
Yeah.
141
Which is where, you know, these two
research areas really inform one another.
142
If we're able to express
143
rich probabilistic programming languages,
then we can start doing inference over
144
probabilistic programs themselves and try
and synthesize these programs from data.
145
Other areas that I've looked at are
tabular data or relational data models,
146
different types of traditionally
structured data, and synthesizing models
147
there.
148
And the workhorse in that area is largely
Bayesian non -parametrics.
149
So prior distributions over unbounded
spaces of latent variables, which are, I
150
think, a very mathematically elegant way
to treat probabilistic structure discovery
151
using Bayesian inferences as the workhorse
for that.
152
And I'll just touch upon a few other areas
that I work in, which are also quite
153
aligned, which a third area I work in is
more on the computational statistics side,
154
which is now that we have probabilistic
programs and we're using them and they're
155
becoming more and more routine in the
workflow of Bayesian inference, we need to
156
start thinking about new statistical
methods and testing methods for these
157
probabilistic programs.
158
So for example, this is a little bit
different than traditional statistics
159
where, you know, traditionally in
statistics we might
160
some type of analytic mathematical
derivation on some probability model,
161
right?
162
So you might write up your model by hand,
and then you might, you know, if you want
163
to compute some property, you'll treat the
model as some kind of mathematical
164
expression.
165
But now that we have programs, these
programs are often far too hard to
166
formalize mathematically by hand.
167
So if you want to analyze their
properties, how can we understand the
168
properties of a program?
169
By simulating it.
170
So a very simple example of this would be,
say I wrote a probabilistic program for
171
some given data, and I actually have the
data.
172
Then I'd like to know whether the
probabilistic program I wrote is even a
173
reasonable prior from that data.
174
So this is a goodness of fit testing, or
how well does the probabilistic program I
175
wrote explain the range of data sets I
might see?
176
So, you know, if you do a goodness of fit
test using stats 101, you would look, all
177
right, what is my distribution?
178
What is the CDF?
179
What are the parameters that I'm going to
derive some type of thing by hand?
180
But for policy programs, we can't do that.
181
So we might like to simulate data from the
program and do some type of analysis based
182
on samples of the program as compared to
samples of the observed data.
183
So these type of simulation -based
analyses of statistical properties of
184
probabilistic programs for testing their
behavior or for quantifying the
185
information between variables, things like
that.
186
And then the final area I'll touch upon is
really more at the foundational level,
187
which is.
188
understanding what are the primitive
operations, a more rigorous or principled
189
understanding of the primitive operations
on our computers that enable us to do
190
random computations.
191
So what do I mean by that?
192
Well, you know, we love to assume that our
computers can freely compute over real
193
numbers.
194
But of course, computers don't have real
numbers built within them.
195
They're built on finite precision
machines, right, which means I can't
196
express.
197
some arbitrary division between two real
numbers.
198
Everything is at some level it's floating
point.
199
And so this gives us a gap between the
theory and the practice.
200
Because in theory, you know, whenever
we're writing our models, we assume
201
everything is in this, you know,
infinitely precise universe.
202
But when we actually implement it, there's
some level of approximation.
203
So I'm interested in understanding first,
theoretically, what is this approximation?
204
How important is it that I'm actually
treating my model as running on an
205
infinitely precise machine where I
actually have finite precision?
206
And second, what are the implications of
that gap for Bayesian inference?
207
Does it mean that now I actually have some
208
properties of my Markov chain that no
longer hold because I'm actually running
209
it on a finite precision machine whereby
all my analysis was assuming I have an
210
infinite precision or what does it mean
about the actual variables we generate?
211
So, you know, we might generate a Gaussian
random variable, but in practice, the
212
variable we're simulating has some other
distribution.
213
Can we theoretically quantify that other
distribution and its error with respect to
214
the true distribution?
215
Or have we come up with sampling
procedures that are as close as possible
216
to the ideal real value distribution?
217
And so this brings together ideas from
information theory, from theoretical
218
computer science.
219
And one of the motivations is to thread
those results through into the actual
220
Bayesian inference procedures that we
implement using probabilistic programming
221
languages.
222
So that's just, you know, an overview of
these three or four different areas that
223
I'm interested in and I've been working on
recently.
224
Yeah, that's amazing.
225
Thanks a lot for these, like full panel of
what you're doing.
226
And yeah, that's just incredible also that
you're doing so many things.
227
I'm really impressed.
228
And of course we're going to dive a bit
into these, at least some of these topics.
229
I don't want to take three hours of your
time, but...
230
Before that though, I'm curious if you
remembered when and how you first got
231
introduced to Bayesian inference and also
why it's ticked with you because it seems
232
like it's underpinning most of your work,
at least that idea of probabilistic
233
programming.
234
Yeah, that's a good question.
235
I think I was first interested in
probability before I was interested in
236
Bayesian inference.
237
I remember...
238
I used to read a book by Maasteller called
50 Challenging Problems in Probability.
239
I took a course in high school and I
thought, how could I actually use these
240
cool ideas for fun?
241
And there was actually a very nice book
written back in the 50s by Maasteller.
242
So that got me interested in probability
and how we can use probability to reason
243
about real world phenomena.
244
So the book that...
245
that I used to read would sort of have
these questions about, you know, if
246
someone misses a train and the train has a
certain schedule, what's the probability
247
that they'll arrive at the right time?
248
And it's a really nice book because it
ties in our everyday experiences with
249
probabilistic modeling and inference.
250
And so I thought, wow, this is actually a
really powerful paradigm for reasoning
251
about the everyday things that we do,
like, you know, missing a bus and knowing
252
something about its schedule and when's
the right time that I should arrive to
253
maximize the probability of, you know,
some, some, some, some,
254
event of interest, things like that.
255
So that really got me hooked to the idea
of probability.
256
But I think what really connected Bayesian
inference to me was taking, I think this
257
was as a senior or as a first year
master's student, a course by Professor
258
Josh Tannenbaum at MIT, which is
computational cognitive science.
259
And that course has evolved.
260
quiet a lot through the years, but the
version that I took was really a beautiful
261
synthesis of lots of deep ideas of how
Bayesian inference can tell us something
262
meaningful about how humans reason about,
you know, different empirical phenomena
263
and cognition.
264
So, you know, in cognitive science for,
you know, for...
265
majority of the history of the field,
people would run these experiments on
266
humans and they would try and analyze
these experiments using some type of, you
267
know, frequentist statistics or they would
not really use generative models to
268
describe how humans are are solving a
particular experiment.
269
But the, you know, Professor Tenenbaum's
approach was to use Bayesian models.
270
as a way of describing or at least
emulating the cognitive processes that
271
humans do for solving these types of
cognition tasks.
272
And by cognition tasks, I mean, you know,
simple experiments you might ask a human
273
to do, which is, you know, you might have
some dots on a screen and you might tell
274
them, all right, you've seen five dots,
why don't you extrapolate the next five?
275
Just simple things that, simple cognitive
experiments or, you know, yeah, so.
276
I think that being able to use Bayesian
models to describe very simple cognitive
277
phenomena was another really appealing
prospect to me throughout that course.
278
I'm seeing all the ways in which that
manifested in very nice questions about.
279
how do we do efficient inference in real
time?
280
Because humans are able to do inference
very quickly.
281
And Bayesian inference is obviously very
challenging to do.
282
But then, if we actually want to engineer
systems, we need to think about the hard
283
questions of efficient and scalable
inference in real time, maybe at human
284
level speeds.
285
Which brought in a lot of the reason for
why I'm so interested in inference as
286
well.
287
Because that's one of the harder aspects
of Bayesian computing.
288
And then I think a third thing which
really hooked me to Bayesian inference was
289
taking a machine learning course and kind
of comparing.
290
So the way these machine learning courses
work is they'll teach you empirical risk
291
minimization, and then they'll teach you
some type of optimization, and then
292
there'll be a lecture called Bayesian
inference.
293
And...
294
What was so interesting to me at the time
was up until the time, up until the
295
lecture where we learned anything about
Bayesian inference, all of these machine
296
learning concepts seem to just be a
hodgepodge of random tools and techniques
297
that people were using.
298
So I, you know, there's the support vector
machine and it's good at classification
299
and then there's the random forest and
it's good at this.
300
But what's really nice about using
Bayesian inference in the machine learning
301
setting, or at least what I found
appealing was how you have a very clean
302
specification of the problem that you're
trying to solve in terms of number one, a
303
prior distribution.
304
over parameters and observable data, and
then the actual observed data, and three,
305
which is the posterior distribution that
you're trying to infer.
306
So you can use a very nice high -level
specification of what is even the problem
307
you're trying to solve before you even
worry about how you solve it.
308
you can very cleanly separate modeling and
inference, whereby most of the machine
309
learning techniques that I was initially
reading or learning about seem to be only
310
focused on how do I infer something
without crisply formalizing the problem
311
that I'm trying to solve.
312
And then, you know, just, yeah.
313
And then, yeah.
314
So once we have this Bayesian posterior
that we're trying to infer, then maybe
315
we'll do fully Bayesian inference, or
maybe we'll do approximate Bayesian
316
inference, or maybe we'll just do maximum
likelihood.
317
That's maybe less of a detail.
318
The more important detail is we have a
very clean specification for our problem
319
and we can, you know, build in our
assumptions.
320
And as we change our assumptions, we
change the specification.
321
So it seemed like a very systematic way,
very systematic way to build machine
322
learning and artificial intelligence
pipelines.
323
using a principled process that I found
easy to reason about.
324
And I didn't really find that in the other
types of machine learning approaches that
325
we learned in the class.
326
So yeah, so I joined the probabilistic
computing project at MIT, which is run by
327
my PhD advisor, Dr.
328
Vikash Mansinga.
329
And, um, you really got the opportunity to
explore these interests at the research
330
level, not only in classes.
331
And that's, I think where everything took
off afterwards.
332
Those are the synthesis of various things,
I think that got me interested in the
333
field.
334
Yeah.
335
Thanks a lot for that, for that, that
that's super interesting to see.
336
And, uh, I definitely relate to the idea
of these, um, like the patient framework
337
being, uh, attractive.
338
not because it's a toolbox, but because
it's more of a principle based framework,
339
basically, where instead of thinking, oh
yeah, what tool do I need for that stuff,
340
it's just always the same in a way.
341
To me, it's cool because you don't have to
be smart all the time in a way, right?
342
You're just like, it's the problem takes
the same workflow.
343
It's not going to be the same solution.
344
But it's always the same workflow.
345
Okay.
346
What does the data look like?
347
How can we model that?
348
Where is the data generative story?
349
And then you have very different
challenges all the time and different
350
kinds of models, but you're not thinking
about, okay, what is the ready made model
351
that they can apply to these data?
352
It's more like how can I create a custom
model to these data knowing the
353
constraints I have about my problem?
354
And.
355
thinking in a principled way instead of
thinking in a toolkit way.
356
I definitely relate to that.
357
I find that amazing.
358
I'll just add to that, which is this is
not only some type of aesthetic or
359
theoretical idea.
360
I think it's actually strongly tied into
good practice that makes it easier to
361
solve problems.
362
And by that, what do I mean?
363
Well, so I did a very brief undergraduate
research project in a biology lab,
364
computational biology lab.
365
And just looking at the empirical workflow
that was done,
366
made me very suspicious about the process,
which is, you know, you might have some
367
data and then you'll hit it with PCA and
you'll get some projection of the data and
368
then you'll use a random forest classifier
and you're going to classify it in
369
different ways.
370
And then you're going to use the
classification and some type of logistic
371
regression.
372
So you're just chaining these ad hoc
different data analyses to come up with
373
some final story.
374
And while that might be okay to get you
some specific result, it doesn't really
375
tell you anything about how changing one
modeling choice in this pipeline.
376
is going to impact your final inference
because this sort of mix and match
377
approach of applying different ad hoc
estimators to solve different subtasks
378
doesn't really give us a way to iterate on
our models, understand their limitations
379
very well, knowing their sensitivity to
different choices, or even building
380
computational systems that automate a lot
of these things, right?
381
Like probabilistic programs.
382
Like you're saying, we can write our data
generating process as the workflow itself,
383
right?
384
Rather than, you know, maybe in Matlab
I'll run PCA and then, you know, I'll use
385
scikit -learn and Python.
386
Without, I think, this type of prior
distribution over our data, it becomes
387
very hard to reason formally about our
entire inference workflow, which would...
388
know, which probabilistic programming
languages are trying to make easier and
389
give a more principled approach that's
more amenable to engineering, to
390
optimization, to things of that sort.
391
Yeah.
392
Yeah, yeah.
393
Fantastic point.
394
Definitely.
395
And that's also the way I personally tend
to teach patient stats.
396
Now it's much more on a, let's say,
principle -based way instead of, and
397
workflow -based instead of just...
398
Okay, Poisson regression is this
multinomial regression is that I find that
399
much more powerful because then when
students get out in the wild, they are
400
used to first think about the problem and
then try to see how they could solve it
401
instead of just trying to find, okay,
which model is going to be the most.
402
useful here in the models that I already
know, because then if the data are
403
different, you're going to have a lot of
problems.
404
Yeah.
405
And so you actually talked about the
different topics that you work on.
406
There are a lot I want to ask you about.
407
One of my favorites, and actually I think
Colin also has been working a bit on that
408
lately.
409
is the development of AutoGP .jl.
410
So I think that'd be cool to talk about
that.
411
What inspired you to develop that package,
which is in Julia?
412
Maybe you can also talk about that if you
mainly develop in Julia most of the time,
413
or if that was mostly useful for that
project.
414
And how does this package...
415
advance, like help the learning structure
of Gaussian Processes kernels because if I
416
understand correctly, that's what the
package is mostly about.
417
So yeah, if you can give a primer to
listeners about that.
418
Definitely.
419
Yes.
420
So Gaussian Processes are a pretty
standard model that's used in many
421
different application areas.
422
spatial temporal statistics and many
engineering applications based on
423
optimization.
424
So these Gaussian process models are
parameterized by covariance functions,
425
which specify how the data produced by
this Gaussian process co -varies across
426
time, across space, across any domain
which you're able to define some type of
427
covariance function.
428
But one of the main challenges in using a
Gaussian process for modeling your data,
429
is making the structural choice about what
should the covariance structure be.
430
So, you know, the one of the universal
choices or the most common choices is to
431
say, you know, some type of a radial basis
function for my data, the RBF kernel, or,
432
you know, maybe a linear kernel or a
polynomial kernel, somehow hoping that
433
you'll make the right choice to model your
data accurately.
434
So the inspiration for auto GP or
automatic Gaussian process is to try and
435
use the data not only to infer the numeric
parameters of the Gaussian process, but
436
also the structural parameters or the
actual symbolic structure of this
437
covariance function.
438
And here we are drawing our inspiration
from work which is maybe almost 10 years
439
now from Dave Duvenoe and colleagues
called the Automated Statistician Project,
440
or ABCD, Automatic Bayesian Covariance
Discovery, which introduced this idea of
441
defining a symbolic language.
442
over Gaussian process covariance functions
or covariance kernels and using a grammar,
443
using a recursive grammar and trying to
infer an expression in that grammar given
444
the observed data.
445
So, you know, in a time series setting,
for example, you might have time on the
446
horizontal axis and the variable on the y
-axis and you just have some variable
447
that's evolving.
448
You don't know necessarily the dynamics of
that, right?
449
There might be some periodic structure in
the data or there might be multiple
450
periodic effects.
451
Or there might be a linear trend that's
overlaying the data.
452
Or there might be a point in time in which
the data is switching between some process
453
before the change point and some process
after the change point.
454
Obviously, for example, in the COVID era,
almost all macroeconomic data sets had
455
some type of change point around April
2020.
456
And we see that in the empirical data that
we're analyzing today.
457
So the question is, how can we
automatically surface these structural
458
choices?
459
using Bayesian inference.
460
So the original approach that was in the
automated statistician was based on a type
461
of greedy search.
462
So they were trying to say, let's find the
single kernel that maximizes the
463
probability of the data.
464
Okay.
465
So they're trying to do a greedy search
over these kernel structures for Gaussian
466
processes using these different search
operators.
467
And for each different kernel, you might
find the maximum likelihood parameter, et
468
cetera.
469
And I think that's a fine approach.
470
But it does run into some serious
limitations, and I'll mention a few of
471
them.
472
One limitation is that greedy search is in
a sense not representing any uncertainty
473
about what's the right structure.
474
It's just finding a single best structure
to maximize some probability or maybe
475
likelihood of the data.
476
But we know just like parameters are
uncertain, structure can also be quite
477
uncertain because the data is very noisy.
478
We may have sparse data.
479
And so, you know, we'd want type of
inference systems that are more robust.
480
when discovering the temporal structure in
the data and that greedy search doesn't
481
really give us that level of robustness
through expressing posterior uncertainty.
482
I think another challenge with greedy
search is its scalability.
483
And by that, if you have a very large data
set in a greedy search algorithm, we're
484
typically at each stage of the search,
we're looking at the entire data set to
485
score our model.
486
And this is also a traditional Markov
chain Monte Carlo algorithms.
487
We often score our data set, but in the
Gaussian process setting, scoring the data
488
set is very expensive.
489
If you have N data points, it's going to
cost you N cubed.
490
And so it becomes quite infeasible to run
greedy search or even pure Markov chain
491
Monte Carlo, where at each step, each time
you change the parameters or you change
492
the kernel, you need to now compute the
full likelihood.
493
And so the second motivation in AutoGP is
to build an inference algorithm.
494
that is not looking at the whole data set
at each point in time, but using subsets
495
of the data set that are sequentially
growing.
496
And that's where the sequential Monte
Carlo inference algorithm comes in.
497
So AutoGP is implemented in Julia.
498
And the API is that basically you give it
a one -dimensional time series.
499
You hit infer.
500
And then it's going to report an ensemble
of Gaussian processes or a sample from my
501
posterior distribution, where each
Gaussian process has some particular
502
structure and some numeric parameters.
503
And you can show the user, hey, I've
inferred these hundred GPS from my
504
posterior.
505
And then they can start using them for
generating predictions.
506
You can use them to find outliers because
these are probabilistic models.
507
You can use them for a lot of interesting
tasks.
508
Or you might say, you know,
509
This particular model actually isn't
consistent with what I know about the
510
data.
511
So you might remove one of the posterior
samples from your ensemble.
512
Yeah, so those are, you know, we used
AutoGP on the M3.
513
We benchmarked it on the M3 competition
data.
514
M3 is around, or the monthly data sets in
M3 are around 1 ,500 time series, you
515
know, between 100 and 500 observations in
length.
516
And we compared the performance against
different statistics baselines and machine
517
learning baselines.
518
And it's actually able to find pretty
common sense structures in these economic
519
data.
520
Some of them have seasonal features,
multiple seasonal effects as well.
521
And what's interesting is we don't need to
customize the prior to analyze each data
522
set.
523
It's essentially able to discover.
524
And what's also interesting is that
sometimes when the data set just looks
525
like a random walk, it's going to learn a
covariance structure, which emulates a
526
random walk.
527
So by having a very broad prior
distribution on the types of covariance
528
structures that you see, it's able to find
which of these are plausible explanation
529
given the data.
530
Yes, as you mentioned, we implemented this
in Julia.
531
The reason is that AutoGP is built on the
Gen probabilistic programming language,
532
which is embedded in the Julia language.
533
And the reason that Gen, I think, is a
very useful system for this problem.
534
So Gen was developed primarily by Marco
Cosumano Towner, who wrote a PhD thesis.
535
He was a colleague of mine at the MIT
Policy Computing Project.
536
And Gen really, it's a Turing complete
language and has programmable inference.
537
So you're able to write a prior
distribution over these symbolic
538
expressions in a very natural way.
539
And you're able to customize an inference
algorithm that's able to solve this
540
problem efficiently.
541
And
542
What really drew us to GEN for this
problem, I think, are twofold.
543
The first is its support for sequential
Monte Carlo inference.
544
So it has a pretty mature library for
doing sequential Monte Carlo.
545
And sequential Monte Carlo construed more
generally than just particle filtering,
546
but other types of inference over
sequences of probability distributions.
547
So particle filters are one type of
sequential Monte Carlo algorithm you might
548
write.
549
But you might do some type of temperature
annealing or data annealing or other types
550
of sequentialization strategies.
551
And Jen provides a very nice toolbox and
abstraction for experimenting with
552
different types of sequential Monte Carlo
approaches.
553
And so we definitely made good use of that
library when developing our inference
554
algorithm.
555
The second reason I think that Jen was
very nice to use is its library for
556
involutive MCMC.
557
And involutive MCMC, it's a relatively new
framework.
558
It was discovered, I think, concurrently.
559
and independently both by Marco and other
folks.
560
And this is kind of, you can think of it
as a generalization of reversible jump
561
MCMC.
562
And it's really a unifying framework to
understand many different MCMC algorithms
563
using a common terminology.
564
And so there's a wonderful ICML paper
which lists 30 or so different algorithms
565
that people use all the time like
Hamiltonian Monte Carlo, reversible jump
566
MCMC, Gibbs sampling, Metropolis Hastings.
567
and expresses them using the language of
involutive MCMC.
568
I believe the author is Nick Liudov,
although I might be mispronouncing that,
569
sorry for that.
570
So, Jen has a library for involutive MCMC,
which makes it quite easy to write
571
different proposals for how you do this
inference over your symbolic expressions.
572
Because when you're doing MCMC within the
inner loop of a sequential Monte Carlo
573
algorithm,
574
You need to somehow be able to improve
your current symbolic expressions for the
575
covariance kernel, given the observed
data.
576
And, uh, doing that is, is hard because
this is kind of a reversible jump
577
algorithm where you make a structural
change.
578
Then you need to maybe generate some new
parameters.
579
You need the reverse probability of going
back.
580
And so Jen has a high level, has a lot of
automation and a library for implementing
581
these types of structure moves in a very
high level way.
582
And it automates the low level math for.
583
computing the acceptance probability and
embedding all of that within an outer
584
level SMC loop.
585
And so this is, I think, one of my
favorite examples for what probabilistic
586
programming can give us, which is very
expressive priors over these, you know,
587
symbolic expressions generated by symbolic
grammars, powerful inference algorithms
588
using combinations of sequential Monte
Carlo and involutive MCMC and reversible
589
jump moves and gradient based inference
over the parameters.
590
It really brings together a lot of the
591
a lot of the strengths of probabilistic
programming languages.
592
And we showed at least on these M3
datasets that they can actually be quite
593
competitive with state -of -the -art
solutions, both in statistics and in
594
machine learning.
595
I will say, though, that as with
traditional GPs, the scalability is really
596
in the likelihood.
597
So whether AutoGP can handle datasets with
10 ,000 data points, it's actually too
598
hard because ultimately,
599
Once you've seen all the data in your
sequential Monte Carlo, you will be forced
600
to do this sort of N cubed scaling, which
then, you know, you need some type of
601
improvements or some type of approximation
for handling larger data.
602
But I think what's more interesting in
AutoGP is not necessarily that it's
603
applied to inferring structures of
Gaussian processes, but that it's sort of
604
a library for inferring probabilistic
structure and showing how to do that by
605
integrating these different inference
methodologies.
606
Hmm.
607
Okay.
608
Yeah, so many things here.
609
So first, I put all the links to autogp
.jl in the show notes.
610
I also put a link to the underlying paper
that you've written with some co -authors
611
about, well, the sequential Monte Carlo
learning that you're doing to discover
612
these time -series structure for people
who want to dig deeper.
613
And I put also a link to all, well, most
of the LBS episodes where we talk about
614
Gaussian processes for people who need a
bit more background information because
615
here we're mainly going to talk about how
you do that and so on and how useful is
616
it.
617
And we're not going to give a primer on
what Gaussian processes are.
618
So if you want that, folks, there are a
bunch of episodes in the show notes for
619
that.
620
So...
621
on that basically practical utility of
that time -series discovery.
622
So if understood correctly, for now, you
can do that only on one -dimensional input
623
data.
624
So that would be basically on a time
series.
625
You cannot input, let's say, that you have
categories.
626
These could be age groups.
627
So.
628
you could one -hot, usually I think that's
the way it's done, how to give that to a
629
GP would be to one -hot encode each of
these edge groups.
630
And then that means, let's say you have
four edge group.
631
Now the input dimension of your GP is not
one, which is time, but it's five.
632
So one for time and four for the edge
groups.
633
This would not work here, right?
634
Right, yes.
635
So at the moment, we're focused on, and
these are called, I guess, in
636
econometrics, pure time series models,
where you're only trying to do inference
637
on the time series based on its own
history.
638
I think the extensions that you're
proposing are very natural to consider.
639
You might have a multi -input Gaussian
process where you're not only looking at
640
your own history, but you're also
considering some type of categorical
641
variable.
642
Or you might have exogenous covariates
evolving along with the time series.
643
If you want to predict temperature, for
example, you might have the wind speed and
644
you might want to use that as a feature
for your Gaussian process.
645
Or you might have an output, a multiple
output Gaussian process.
646
You want a Gaussian process over multiple
different time series generally.
647
And I think all of these variants are, you
know, they're possible to develop.
648
There's no fundamental difficulty, but the
main, I think the main challenge is how
649
can you define a domain specific language
over these covariance structures for
650
multi, for multivariate input data?
651
becomes a little bit more challenging.
652
So in the time series setting, what's nice
is we can interpret how any type of
653
covariance kernel is going to impact the
actual prior over time series.
654
Once we're in the multi -dimensional
setting, we need to think about how to
655
combine the kernels for different
dimensions in a way that's actually
656
meaningful for modeling to ensure that
it's more tractable.
657
But I think extensions of the DSL to
handle multiple inputs, exogenous
658
covariates, multiple outputs,
659
These are all great directions.
660
And I'll just add on top of that, I think
another important direction is using some
661
of the more recent approximations for
Gaussian processes.
662
So we're not bottlenecked by the n cubed
scaling.
663
So there are, I think, a few different
approaches that have been developed.
664
There are approaches which are based on
stochastic PDEs or state space
665
approximations of Gaussian processes,
which are quite promising.
666
There's some other things like nearest
neighbor Gaussian processes, but I'm a
667
little less confident about those because
we lose a lot of the nice affordances of
668
GPs once we start doing nearest neighbor
approximations.
669
But I think there's a lot of new methods
for approximate GPs.
670
So we might do a stochastic variational
inference, for example, an SVGP.
671
So I think as we think about handling more
672
more richer types of data, then we should
also think about how to start introducing
673
some of these more scalable approximations
to make sure we can still efficiently do
674
the structure learning in that setting.
675
Yeah, that would be awesome for sure.
676
As a more, much more on the practitioner
side than on the math side.
677
Of course, that's where my head goes
first.
678
You know, I'm like, oh, that'd be awesome,
but I would need to have that to have it
679
really practical.
680
Um, and so if I use auto GP dot channel,
so I give it a time series data.
681
Um, then what do I get back?
682
Do I get back, um, the busier samples of
the, the implied model, or do I get back
683
the covariance structure?
684
So that could be, I don't know what, what
form that could be, but I'm thinking, you
685
know,
686
Uh, often when I use GPS, I use them
inside other models with other, like I
687
could use a GP in a linear regression, for
instance.
688
And so I'm thinking that'd be cool if I'm
not sure about the covariance structure,
689
especially if it can do the discovery of
the seasonality and things like that
690
automatically, because it's always
seasonality is a bit weird and you have to
691
add another GP that can handle
periodicity.
692
Um, and then you have basically a sum of
GP.
693
And then you can take that sum of GP and
put that in the linear predictor of the
694
linear regression.
695
That's usually how I use that.
696
And very often, I'm using categorical
predictors almost always.
697
And I'm thinking what would be super cool
is that I can outsource that discovery
698
part of the GP to the computer like you're
doing with this algorithm.
699
And then I get back under what form?
700
I don't know yet.
701
I'm just thinking about that.
702
this covariance structure that I can just,
which would be an MV normal, like a
703
multivit normal in a way, that I just use
in my linear predictor.
704
And then I can use that, for instance, in
a PMC model or something like that,
705
without to specify the GP myself.
706
Is it something that's doable?
707
Yeah, yeah, I think that's absolutely
right.
708
So you can, because Gaussian processes are
compositional, just, you know, you
709
mentioned the sum of two Gaussian
processes, which corresponds to the sum of
710
two kernel.
711
So if I have Gaussian process one plus
Gaussian process two, that's the same as
712
the Gaussian process whose covariance is
k1 plus k2.
713
And so what that means is we can take our
synthesized kernel, which is comprised of
714
some base kernels and then maybe sums and
products and change points, and we can
715
wrap all of these in just one mega GP,
basically, which would encode the entire
716
posterior disk or, you know,
717
a summary of all of the samples in one GP.
718
Another, and I think you also mentioned an
important point, which is multivariate
719
normals.
720
You can also think of the posterior as
just a mixture of these multivariate
721
normals.
722
So let's say I'm not going to sort of
compress them into a single GP, but I'm
723
actually going to represent the output of
auto GP as a mixture of multivariate
724
normals.
725
And that would be another type of API.
726
So depending on exactly what type of
727
how you're planning to use the GP, I think
you can use the output of auto GP in the
728
right way, because ultimately, it's
producing some covariance kernels, you
729
might aggregate them all into a GP, or you
might compose them together to make a
730
mixture of GPs.
731
And you can export this to PyTorch, or
most of the current libraries for GPs
732
support composing the GPs with one
another, et cetera.
733
So I think depending on the use case, it
should be quite straightforward to figure
734
out how to leverage the output of AutoGP
to use within the inner loop of some bra
735
or within the internals of some larger
linear regression model or other type of
736
model.
737
Yeah, that's definitely super cool because
then you can, well, yeah, use that,
738
outsource that part of the model where I
think the algorithm probably...
739
If not now, in just a few years, it's
going to make do a better job than most
740
modelers, at least to have a rough first
draft.
741
That's right.
742
The first draft.
743
A data scientist who's determined enough
to beat AutoGP, probably they can do it if
744
they put in enough effort just to study
the data.
745
But it's getting a first pass model that's
actually quite good as compared to other
746
types of automated techs.
747
Yeah, exactly.
748
I mean, that's recall.
749
It's like asking for a first draft of, I
don't know, blog post to ChatGPT and then
750
going yourself in there and improving it
instead of starting everything from
751
scratch.
752
Yeah, for sure you could do it, but that's
not where your value added really lies.
753
So yeah.
754
So what you get is these kind of samples.
755
In a way, do you get back samples?
756
or do you get symbolic variables back?
757
You get symbolic expressions for the
covariance kernels as well as the
758
parameters embedded within them.
759
So you might get, let's say you asked for
five posterior samples, you're going to
760
have maybe one posterior sample, which is
a linear kernel.
761
And then another posterior sample, which
is a linear times linear, so a quadratic
762
kernel.
763
And then maybe a third posterior sample,
which is again, a linear, and each of them
764
will have their different parameters.
765
And because we're using sequential Monte
Carlo,
766
all of the posterior samples are
associated with weights.
767
The sequential Monte Carlo returns a
weighted particle collection, which is
768
approximating the posterior.
769
So you get back these weighted particles,
which are symbolic expressions.
770
And we have, in AutoGP, we have a minimal
prediction GP library.
771
So you can actually put these symbolic
expressions into a GP to get a functional
772
GP, but you can export them to a text file
and then use your favorite GP library and
773
embed them within that as well.
774
And we also get noise parameters.
775
So each kernel is going to be associated
with the output noise.
776
Because obviously depending on what kernel
you use, you're going to infer a different
777
noise level.
778
So you get a kernel structure, parameters,
and noise for each individual particle in
779
your SMC ensemble.
780
OK, I see.
781
Yeah, super cool.
782
And so yeah, if you can get back that as a
text file.
783
Like either you use it in a full Julia
program, or if you prefer R or Python, you
784
could use auto -gp .jl just for that.
785
Get back a text file and then use that in
R or in Python in another model, for
786
instance.
787
Okay.
788
That's super cool.
789
Do you have examples of that?
790
Yeah.
791
Do you have examples of that we can link
to for listeners in the show notes?
792
We have tutorial.
793
And so...
794
The tutorial, I think, prints, it shows a
print of the, it prints the learned
795
structures into the output cells of the
IPython notebooks.
796
And so you could take the printed
structure and just save it as a text file
797
and write your own little parser for
extracting those structures and building
798
an RGP or a PyTorch GP or any other GP.
799
Okay.
800
Yeah.
801
That was super cool.
802
That's awesome.
803
And do you know if there is already an
implementation in R?
804
and or in Python of what you're doing in
AutoGP .JS?
805
Yeah, so we, so this project was
implemented during my year at Google when
806
I was so between starting at CMU and
finishing my PhD, I was at Google for a
807
year as a visiting faculty scientist.
808
And some of the prototype implementations
were also in Python.
809
But I think the only public version at the
moment is the Julia version.
810
But I think it's a little bit challenging
to reimplement this because one of the
811
things we learned when trying to implement
it in Python is that we don't have Gen, or
812
at least at the time we didn't.
813
The reason we focused on Julia is that we
could use the power of the Gen
814
probabilistic programming language in a
way that made model development and
815
iterating.
816
much more feasible than a pure Python
implementation or even, you know, an R
817
implementation or in another language.
818
Yeah.
819
Okay.
820
Um, and so actually, yeah, so I, I would
have so many more questions on that, but I
821
think that's already a good, a good
overview of, of that project.
822
Maybe I'm curious about the, the biggest
obstacle that you had on the path, uh,
823
when developing
824
that package, autogp .jl, and also what
are your future plans for this package?
825
What would you like to see it become in
the coming months and years?
826
Yeah.
827
So thanks for those questions.
828
So for the biggest challenge, I think
designing and implementing the inference
829
algorithm that includes...
830
sequential Monte Carlo and involuted MCMC.
831
That was a challenge because there aren't
many works, prior works in the literature
832
that have actually explored this type of a
combination, which is, um, you know, which
833
is really at the heart of auto GP, um,
designing the right proposal distributions
834
for, I have some given structure and I
have my data.
835
How do I do a data driven proposal?
836
So I'm not just blindly proposing some new
structure from the prior or some new sub
837
-structure.
838
but actually use the observed data to come
up with a smart proposal for how I'm going
839
to improve the structure in the inner loop
of MCMC.
840
So we put a lot of thought into the actual
move types and how to use the data to come
841
up with data -driven proposal
distributions.
842
So the paper describes some of these
tricks.
843
So there's moves which are based on
replacing a random subtree.
844
There are moves which are detaching the
subtree and throwing everything away or...
845
embedding the subtree within a new tree.
846
So there are these different types of
moves, which we found are more helpful to
847
guide the search.
848
And it was a challenging process to figure
out how to implement those moves and how
849
to debug them.
850
So that I think was, was part of the
challenge.
851
I think another challenge which, which we
came, which we were facing was of course,
852
the fact that we were using these dense
Gaussian process models without the actual
853
approximations that are needed to scale to
say tens or hundreds of thousands of data
854
points.
855
And so.
856
This I think was part of the motivation
for thinking about what are other types of
857
approximations of the GP that would let us
handle datasets of that size.
858
In terms of what I'd like for AutoGP to be
in the future, I think there's two answers
859
to that.
860
One answer, and I think there's already a
nice success case here, but one answer is
861
I'd like the implementation of AutoGP to
be a reference for how to do probabilistic
862
structure discovery using GEN.
863
So I expect that people...
864
across many different disciplines have
this problem of not knowing what their
865
specific model is for the data.
866
And then you might have a prior
distribution over symbolic model
867
structures and given your observed data,
you want to infer the right model
868
structure.
869
And I think in the auto GP code base, we
have a lot of the important components
870
that are needed to apply this workflow to
new settings.
871
So I think we've really put a lot of
effort in having the code be self
872
-documenting in a sense.
873
and make it easier for people to adapt the
code for their own purposes.
874
And so there was a recent paper this year
presented at NURiPS by Tracy Mills and Sam
875
Shayet from Professor Tenenbaum's group
that extended the AutoGP package for a
876
task in cognition, which was very nice to
see that the code isn't only valuable for
877
its own purpose, but also adaptable by
others for other types of tasks.
878
Um, and I think the second thing that I'd
like auto GP or at least the auto GP type
879
models to do is, um, you know, integrating
these with, and this goes back to the
880
original automatic statistician that, uh,
that motivated auto GP.
881
It's worked say 10 years ago.
882
Um, so the auto automated statistician had
the component, the natural language
883
processing component, which is, you know,
at the time there was no chat GPT or large
884
language models.
885
So they just wrote some simple rules to
take the learned Gaussian process.
886
and summarize it in terms of a report.
887
But now we have much more powerful
language models.
888
And one question could be, how can I use
the outputs of AutoGP and integrate it
889
within a language model, not only for
reporting the structure, but also for
890
answering now probabilistic queries.
891
So you might say, find for me a time when
there could be a change point, or give me
892
a numerical estimate of the covariance
between two different time slices, or
893
impute the data.
894
between these two different time regions,
or give me a 95 % prediction interval.
895
And so a data scientist can write these in
terms of natural language, or rather a
896
domain specialist can write these in
natural language, and then you would
897
compile it into different little programs
that are querying the GP learned by
898
AutoGP.
899
And so creating some type of a higher
level interface that makes it possible for
900
people to not necessarily dive into the
guts of Julia and, you know, or implement
901
even an IPython notebook.
902
but have the system learn the
probabilistic models and then have a
903
natural language interface which you can
use to query those models, either for
904
learning something about the structure of
the data, but also for solving prediction
905
tasks.
906
And in both cases, I think, you know, off
the shelf models may not work so well
907
because, you know, they may not know how
to parse the auto GP kernel to come up
908
with a meaningful summary of what it
actually means in terms of the data, or
909
they may not know how to translate natural
language into
910
Julia code for AutoGP.
911
So there's a little bit of research into
thinking about how do we fine tune these
912
models so that they're able to interact
with the automatically learned
913
probabilistic models.
914
And I think what's, I'll just mention
here, which is one of the benefits of an
915
AutoGP like system is its
interpretability.
916
So because Gaussian processes are, they're
quiet, transparent, like you said, they're
917
ultimately at the end of the day, these
giant multivariate normals.
918
We can explain to people who are using
these types of these distributions and
919
they're comfortable with them, what
exactly is the distribution that's been
920
learned?
921
These are some weights and some giant
neural network and here's the prediction
922
and you have to live with it.
923
Rather, you can say, well, here's our
prediction and the reason we made this
924
prediction is, well, we inferred a
seasonal components with so -and -so
925
frequency.
926
And so you can get the predictions, but
you can also get some type of
927
interpretable summary for why those
predictions were made, which maybe helps
928
with the trustworthiness of the system.
929
or just transparency more generally.
930
Yeah.
931
I'm signing now.
932
That sounds like an awesome tool.
933
Yeah, for sure.
934
That looks absolutely fantastic.
935
And yeah, hopefully that will, these kind
of tools will help.
936
I'm definitely curious to try that now in
my own models, basically.
937
And yeah, see what...
938
AutoGP .jl tells you, but the covariance
structure and then try and use that myself
939
in a model of mine, probably in Python so
that I have to get out of the Julia and
940
see how that, like how you can plug that
into another model.
941
That would be super, super interesting for
sure.
942
Yeah.
943
I'm going to try and find an excuse to do
that.
944
Um, actually I'm curious now, um, we could
talk a bit about how that's done, right?
945
How you do that discovery of the time
series structure.
946
And you've mentioned that you're using
sequential Monte Carlo to do that.
947
So SMC, um, can you give listeners an idea
of what SMC is and why that would be
948
useful in that case?
949
Uh, and also if.
950
the way you do it for these projects
differs from the classical way of doing
951
SMC.
952
Good.
953
Yes, thanks for that question.
954
So sequential Monte Carlo is a very broad
family of algorithms.
955
And I think one of the confusing parts for
me when I was learning sequential Monte
956
Carlo is that a lot of the introductory
material of sequential Monte Carlo are
957
very closely married to particle filters.
958
But particle filtering, which is only one
application of sequential Monte Carlo,
959
isn't the whole story.
960
And so I think, you know, there's now more
modern expositions of sequential Monte
961
Carlo, which are really bringing to light
how general these methods are.
962
And here I would like to recommend
Professor Nicholas Chopin's textbook,
963
Introduction to Sequential Monte Carlo.
964
It's a Springer 2020 textbook.
965
I continue to use this in my research and,
you know, I think that it's a very well
966
-written overview of really
967
how general and how powerful sequential
Monte Carlo is.
968
So a brief explanation of sequential Monte
Carlo.
969
I guess maybe one way we could contrast it
is the traditional Markov chain Monte
970
Carlo.
971
So in traditional MCMC, we have some
particular latent state, let's call it
972
theta.
973
And we just, theta is supposed to be drawn
from P of theta given X, where that's our
974
posterior distribution and X is the data.
975
And we just apply some transition kernel
over and over and over again, and then we
976
hope.
977
And the limit of the applications of these
transition kernels, we're going to
978
converge to the posterior distribution.
979
Okay.
980
So MCMC is just like one iterative chain
that you run forever.
981
You can do a little bit of modifications.
982
You might have multiple chains, which are
independent of one another, but sequential
983
Monte Carlo is, is in a sense, trying to
go beyond that, which is anything you can
984
do in a traditional MCMC algorithm, you
can do using sequential Monte Carlo.
985
But in sequential Monte Carlo,
986
you don't have a single chain, but you
have multiple different particles.
987
And each of these different particles you
can think of as being analogous in some
988
way to a particular MCMC chain, but
they're allowed to interact.
989
And so you start with, say, some number of
particles, and you start with no data.
990
And so what you would do is you would just
draw these particles from your prior
991
distribution.
992
And each of these draws from the prior are
basically draws from p of theta.
993
And now I'd like to get them to p of theta
given x.
994
That's my goal.
995
So I start with a bunch of particles drawn
from p of theta, and I'd like to get them
996
to p of theta given x.
997
So how am I going to go from p of theta to
p of theta given x?
998
There's many different ways you might do
that, and that's exactly what's
999
sequential, right?
Speaker:
How do you go from the prior to the
posterior?
Speaker:
The approach we take in data in AutoGP is
based on this idea of data tempering.
Speaker:
So let's say my data x consists of a
thousand measurements, okay?
Speaker:
And I'd like to go from p of theta to p of
theta given x.
Speaker:
Well, here's one sequential strategy that
I can use to bridge between these two
Speaker:
distributions.
Speaker:
I can start with P of theta, then I can
start with P of theta given X1, then P of
Speaker:
theta given X1 and X2, P of theta given X2
and X3.
Speaker:
So I can anneal or I can temper these data
points into the prior.
Speaker:
And the more data points I put in, the
closer I'm going to get to the full
Speaker:
posterior P of theta given X1 through a
thousand or something.
Speaker:
Or you might introduce these data in
batch.
Speaker:
But the key idea is that you start with
draws from some prior typically.
Speaker:
and then you're just adding more and more
data and you're reweighting the particles
Speaker:
based on the probability that they assign
to the new data.
Speaker:
So if I have 10 particles and some
particle is always able to predict or it's
Speaker:
always assigning a very high score to the
new data, I know that that's a particle
Speaker:
that's explaining the data quite well.
Speaker:
And so I might resample these particles
according to their weights to get rid of
Speaker:
the particles that are not explaining the
new data well and to focus my
Speaker:
computational effort on the particles that
are explaining the data well.
Speaker:
And this is something that an MCMC
algorithm does not give us.
Speaker:
Because even if we run like a hundred MCMC
chains in parallel, we don't know how to
Speaker:
resample the chains, for example, because
they're all these independent executions
Speaker:
and we don't have a principled way of
assigning a score to those different
Speaker:
chains.
Speaker:
You can't use the joint likelihood.
Speaker:
That's not, it's not a valid or even a
meaningful statistic to use to measure, to
Speaker:
measure the quality of a given chain.
Speaker:
But SMC has, because it's built on
importance sampling, has a principled way
Speaker:
for us to assign weights to these
different particles and focus on the ones
Speaker:
which are most promising.
Speaker:
And then I think the final component
that's missing in my explanation is where
Speaker:
does the MCMC come in?
Speaker:
So traditionally in sequential Monte
Carlo, there was no MCMC.
Speaker:
You would just have your particles, you
would add new data, you would reweight it
Speaker:
based on the probability of the data, then
you would resample the particles.
Speaker:
Then I'm going to add some...
Speaker:
next batch of data, resample, re -weight,
et cetera.
Speaker:
But you're also able to, in between adding
new data points, run MCMC in the inner
Speaker:
loop of sequential Monte Carlo.
Speaker:
And that does not sort of make the
algorithm incorrect.
Speaker:
It preserves the correctness of the
algorithm, even if you run MCMC.
Speaker:
And there the intuition is that, you know,
your prior draws are not going to be good.
Speaker:
So now that after I've observed say 10 %
of the data, I might actually run some
Speaker:
MCMC on that subset of 10 % of the data
before I introduce the next batch of data.
Speaker:
So after you're reweighting the particles,
you're also using a little bit of MCMC to
Speaker:
improve their structure given the data
that's been observed so far.
Speaker:
And that's where the MCMC is run inside
the inner loop.
Speaker:
So some of the benefits I think of this
kind of approach are, like I mentioned at
Speaker:
the beginning, in MCMC you have to compute
the probability of all the data at each
Speaker:
step.
Speaker:
But in SMC, because we're sequentially
incorporating new batches of data, we can
Speaker:
get away with only looking at say 10 or 20
% of the data and get some initial
Speaker:
inferences before we actually reach to the
end and processed all of the observed
Speaker:
data.
Speaker:
So that's, I guess, a high level overview
of the algorithm that AutoGP is using.
Speaker:
It's annealing the data or tempering the
data.
Speaker:
It's reassigning the scores of the
particles based on how well they're
Speaker:
explaining the new batch of data and it's
running MCMC to improve their structure by
Speaker:
applying these different moves like
removing the sub -expression, adding the
Speaker:
sub -expression, different things of that
nature.
Speaker:
Okay, yeah.
Speaker:
Thanks a lot for this explanation because
that was a very hard question on my part
Speaker:
and I think you've done a tremendous job
explaining the basics of SMC and when that
Speaker:
would be useful.
Speaker:
So, yeah, thank you very much.
Speaker:
I think that's super helpful.
Speaker:
And why in this case, when you're trying
to do these kind of time series
Speaker:
discoveries, why...
Speaker:
would SMC be more useful than a classic
MCMC?
Speaker:
Yeah.
Speaker:
So it's more useful, I guess, for several
reasons.
Speaker:
One reason is that, well, you might
actually have a true streaming problem.
Speaker:
So if your data is actually streaming, you
can't use MCMC because MCMC is operating
Speaker:
on a static data set.
Speaker:
So what if I'm running AutoGP in some type
of industrial process system where some
Speaker:
data is coming in?
Speaker:
and I'm updating the models in real time
as my data is coming in.
Speaker:
That's a purely online algorithm in which
SMC is perfect for, but MCMC is not so
Speaker:
well suited because you basically don't
have a way to, I mean, obviously you can
Speaker:
always incorporate new data in MCMC, but
that's not the traditional algorithm where
Speaker:
we know its correctness properties.
Speaker:
So for when you have streaming data, that
might be extremely useful.
Speaker:
But even if your data is not streaming,
Speaker:
you know, theoretically there's results
that show that convergence can be much
Speaker:
improved when you use the sequential Monte
Carlo approach.
Speaker:
Because you have these multiple particles
that are interacting with one another.
Speaker:
And what they can do is they can explore
multiple modes whereby an MCMC, you know,
Speaker:
each individual MCMC chain might get
trapped in a mode.
Speaker:
And unless you have an extremely accurate
posterior proposal distribution, you may
Speaker:
never escape from that mode.
Speaker:
But in SMC, we're able to resample these
different particles so that they're
Speaker:
interacting, which means that you can
probably explore the space much more
Speaker:
efficiently than you could with a single
chain that's not interacting with other
Speaker:
chains.
Speaker:
And this is especially important in the
types of posteriors that AutoGP is
Speaker:
exploring, because these are symbolic
expression spaces.
Speaker:
They are not Euclidean space.
Speaker:
And so we expect there to be largely non
-smooth components, and we want to be able
Speaker:
to jump efficiently through this space
through...
Speaker:
the resampling procedure of, of, of SMC,
uh, which, which is why, uh, which, which
Speaker:
is why it's a suitable algorithm.
Speaker:
And then the third component is because,
you know, this is more specific to GPs in
Speaker:
particular, which is because GPs have a
cubic cost of evaluating the likelihood in
Speaker:
MCMC, that's really going to bite you if
you're doing it each step.
Speaker:
If I have a million, a thousand
observations, I don't want to be doing
Speaker:
that at each step, but in SMC, because the
data is being introduced in batches, what
Speaker:
that means is.
Speaker:
I might be able to get some very accurate
predictions using only the first 10 % of
Speaker:
the data, which is going to be quite cheap
to evaluate the likelihood.
Speaker:
So you're somehow smoothly interpolating
between the prior, where you can get
Speaker:
perfect samples, and the posterior, which
is hard to sample, using these
Speaker:
intermediate distributions, which are
closer to one another than the distance
Speaker:
between the prior and the posterior.
Speaker:
And that's what makes inference hard,
essentially, which is the distance between
Speaker:
the prior and the posterior.
Speaker:
because SMC is introducing datasets in
smaller batches, it's making this sort of
Speaker:
bridging.
Speaker:
It's making it easier to bridge between
the prior and the posterior by having
Speaker:
these partial posteriors, basically.
Speaker:
Okay, I see.
Speaker:
Yeah.
Speaker:
Yeah, okay.
Speaker:
That makes sense because of that batching
process, basically.
Speaker:
Yeah, for sure.
Speaker:
And the requirements also of MCMC coupled
to a GP that's...
Speaker:
That's for sure making stuff hard.
Speaker:
Yeah.
Speaker:
Yeah.
Speaker:
And well, I've already taken a lot of time
from you.
Speaker:
So thanks a lot for us.
Speaker:
I really appreciate it.
Speaker:
And that's very, very fascinating.
Speaker:
Everything you're doing.
Speaker:
I'm curious also because you're a bit on
both sides, right?
Speaker:
Where you see practitioners, but you're
also on the very theoretical side.
Speaker:
And also you teach.
Speaker:
So I'm wondering if like, what's the, in
your opinion, what's the biggest hurdle in
Speaker:
the Bayesian workflow currently?
Speaker:
Yeah, I think there's really a lot of
hurdles.
Speaker:
I don't know if there's a biggest one.
Speaker:
So obviously, you know, Professor Andrew
Gelman has enormous manuscript on the
Speaker:
archive, which is called Bayesian
workflow.
Speaker:
And he goes through the nitty gritty of
all the different challenges with coming
Speaker:
up with the Bayesian model.
Speaker:
But for me, at least the one that's tied
closely to my research is where do we even
Speaker:
start?
Speaker:
Where do we start this workflow?
Speaker:
And that's really what drives a lot of my
interest in automatic model discovery.
Speaker:
probabilistic program synthesis.
Speaker:
The idea is not that we want to discover
the model that we're going to use for the
Speaker:
rest of our, for the rest of the lifetime
of the workflow, but come up with good
Speaker:
explanations that we can use to bootstrap
this process, after which then we can
Speaker:
apply the different stages of the
workflow.
Speaker:
But I think it's getting from just data to
plausible explanations of that data.
Speaker:
And that's what, you know, probabilistic
program synthesis or automatic model
Speaker:
discovery is trying to solve.
Speaker:
So I think that's a very large bottleneck.
Speaker:
And then I'd say, you know, the second
bottleneck is the scalability of
Speaker:
inference.
Speaker:
I think that Bayesian inference has a poor
reputation in many corners because of how
Speaker:
unscalable traditional MCMC algorithms
are.
Speaker:
But I think in the last 10, 15 years,
we've seen many foundational developments
Speaker:
in more scalable posterior inference
algorithms that are being used in many
Speaker:
different settings in computational
science, et cetera.
Speaker:
And I think...
Speaker:
building probabilistic programming
technologies that better expose these
Speaker:
different inference innovations is going
to help push Bayesian inference to the
Speaker:
next level of applications that people
have traditionally thought are beyond
Speaker:
reach because of the lack of scalability.
Speaker:
So I think putting a lot of effort into
engineering probabilistic programming
Speaker:
languages that really have fast, powerful
inference, whether it's sequential Monte
Speaker:
Carlo, whether it's...
Speaker:
Hamiltonian Monte Carlo with no U -turn
sampling, whether it's, you know, there's
Speaker:
really a lot of different, in volutive
MCMC over discrete structure.
Speaker:
These are all things that we've seen quiet
recently.
Speaker:
And I think if you put them together, we
can come up with very powerful inference
Speaker:
machinery.
Speaker:
And then I think the last thing I'll say
on that topic is, you know, we also need
Speaker:
some new research into how to configure
our inference algorithms.
Speaker:
So, you know, we spend a lot of time
thinking is our model the right model, but
Speaker:
you know,
Speaker:
I think now that we have probabilistic
programming and we have inference
Speaker:
algorithms maybe themselves implemented as
probabilistic programming, we might think
Speaker:
in a more mathematically principled way
about how to optimize the inference
Speaker:
algorithms in addition to optimizing the
parameters of the model.
Speaker:
I think of some type of joint inference
process where you're simultaneously using
Speaker:
the right inference algorithm for your
given model and have some type of
Speaker:
automation that's helping you make those
choices.
Speaker:
Yeah, kind of like the automated
statistician that you were talking about
Speaker:
at the beginning of the show.
Speaker:
Yeah, that would be fantastic.
Speaker:
Definitely kind of like having a stats
sidekick helping you when you're modeling.
Speaker:
That would definitely be fantastic.
Speaker:
Also, as you were saying, the workflow is
so big and diverse that...
Speaker:
It's very easy to forget about something,
forget a step, neglect one, because we're
Speaker:
all humans, you know, things like that.
Speaker:
No, definitely.
Speaker:
And as you were saying, you're also a
professor at CMU.
Speaker:
So I'm curious how you approach teaching
these topics, teaching stats to prepare
Speaker:
your students for all of these challenges,
especially given...
Speaker:
challenges of probabilistic computing that
we've mentioned throughout this show.
Speaker:
Yeah, yeah, that's something I think about
frequently actually, because, you know, I
Speaker:
haven't been teaching for a very long time
and this is over the course of the next
Speaker:
few years, gonna have to put a lot of
effort into thinking about how to give
Speaker:
students who are interested in these areas
the right background so that they can
Speaker:
quickly be productive.
Speaker:
And what's especially challenging, at
least in my interest area, which is
Speaker:
there's both the probabilistic modeling
component and there's also the programming
Speaker:
languages component.
Speaker:
And what I've learned is these two
communities don't talk much with one
Speaker:
another.
Speaker:
You have people who are doing statistics
who think like, oh, programming language
Speaker:
is just our scripts and that's really all
it is.
Speaker:
And I never want to think about it because
that's the messy details.
Speaker:
But programming languages, if we think
about them in a principled way and we
Speaker:
start looking at the code as a first
-class citizen, just like our mathematical
Speaker:
model is a first -class citizen, then we
need to really be thinking in a much more
Speaker:
principled way about our programs.
Speaker:
And I think the type of students who are
going to make a lot of strides in this
Speaker:
research area are those who really value
the programming language, the programming
Speaker:
languages theory, in addition to the
statistics and the Bayesian modeling
Speaker:
that's actually used for the workflow.
Speaker:
And so I think, you know, the type of
courses that we're going to need to
Speaker:
develop at the graduate level or at the
undergraduate level are going to need to
Speaker:
really bring together these two different
worldviews, the worldview of, you know,
Speaker:
empirical data analysis, statistical model
building, things of that sort, but also
Speaker:
the programming languages view where we're
actually being very formal about what are
Speaker:
these actual systems, what they're doing,
what are their semantics, what are their
Speaker:
properties, what are the type systems that
are enabling us to get certain guarantees,
Speaker:
maybe compiler technologies.
Speaker:
So I think there's elements of both of
these two different communities that need
Speaker:
to be put into teaching people how to be
productive probabilistic programming.
Speaker:
researchers bringing ideas from these two
different areas.
Speaker:
So, you know, the students who I advise,
for example, I often try and get a sense
Speaker:
for whether they're more in the
programming languages world and they need
Speaker:
to learn a little bit more about the
Bayesian modeling stuff, or whether
Speaker:
they're more squarely in Bayesian modeling
and they need to appreciate some of the PL
Speaker:
aspects better.
Speaker:
And that's the sort of a game that you
have to play to figure out what are the
Speaker:
right areas to be focusing on for
different students so that they can have a
Speaker:
more holistic view of
Speaker:
probabilistic programming and its goals
and probabilistic computing more
Speaker:
generally, and building the technical
foundations that are needed to carry
Speaker:
forward that research.
Speaker:
Yeah, that makes sense.
Speaker:
And related to that, are there any future
developments that you foresee or expect or
Speaker:
hope in probabilistic reasoning systems in
the coming years?
Speaker:
Yeah, I think there's quite a few.
Speaker:
And I think I already touched upon one of
them, which is, you know, the integration
Speaker:
with language models, for example.
Speaker:
I think there's a lot of excitement about
language models.
Speaker:
I think from my perspective as a research
area, that's not what I do research in.
Speaker:
But I think, you know, if we think about
how to leverage the things that they're
Speaker:
good at, it might be for creating these
types of interfaces between, you know,
Speaker:
automatically learned probabilistic
programs and natural language queries
Speaker:
about these learned programs for solving
tasks.
Speaker:
data analysis or data science tasks.
Speaker:
And I think this is an important, marrying
these two ideas is important because if
Speaker:
people are going to start using language
models for solving statistics, I would be
Speaker:
very worried.
Speaker:
I don't think language models in their
current form, which are not backed by
Speaker:
probabilistic programs, are at all
appropriate to doing data science or data
Speaker:
analysis.
Speaker:
But I expect people will be pushing that
direction.
Speaker:
The direction that I'd really like to see
thrive is the one where language models
Speaker:
are
Speaker:
interacting with probabilistic programs to
come up with better, more principled, more
Speaker:
interpretable reasoning for answering an
end user question.
Speaker:
So I think these types of probabilistic
reasoning systems, you know, will really
Speaker:
make probabilistic programs more
accessible on the one hand, and will make
Speaker:
language models more useful on the other
hand.
Speaker:
That's something that I'd like to see from
the application standpoint.
Speaker:
From the theory standpoint, I have many
theoretical questions, which maybe I won't
Speaker:
get into.
Speaker:
which are really related about the
foundations of random variate generation.
Speaker:
Like I was mentioning at the beginning of
the talk, understanding in a more
Speaker:
mathematically principled way the
properties of the inference algorithms or
Speaker:
the probabilistic computations that we run
on our finite precision machines.
Speaker:
I'd like to build a type of complexity
theory for these type or a theory about
Speaker:
the error and complexity and the resource
consumption of Bayesian inference in the
Speaker:
presence of finite resources.
Speaker:
And that's a much longer term vision, but
I think it will be quite valuable.
Speaker:
once we start understanding the
fundamental limitations of our
Speaker:
computational processes for running
probabilistic inference and computation.
Speaker:
Yeah, that sounds super exciting.
Speaker:
Thanks, Alain.
Speaker:
That's making me so hopeful for the coming
years to hear you talk in that way.
Speaker:
I'm like, yeah, it's super stoked about
the world that you are depicting here.
Speaker:
And...
Speaker:
Actually, it's so I think I still had so
many questions for you because as I was
Speaker:
saying, you're doing so many things.
Speaker:
But I think I've taken enough of your
time.
Speaker:
So let's call it to show.
Speaker:
And before you go though, I'm going to ask
you the last two questions I ask every
Speaker:
guest at the end of the show.
Speaker:
If you had unlimited time and resources,
which problem would you try to solve?
Speaker:
Yeah, that's a very tough question.
Speaker:
I should have prepared for that one
better.
Speaker:
Yeah, I think one area which would be
really worth solving is using, or at least
Speaker:
within the scope of Bayesian inference and
probabilistic modeling, is using these
Speaker:
technologies to unify people around data,
solid data -driven inferences.
Speaker:
to have better discussions in empirical
fields, right?
Speaker:
So obviously politics is extremely
divisive.
Speaker:
People have all sorts of different
interpretations based on their political
Speaker:
views and based on their aesthetics and
whatever, and all that's natural.
Speaker:
But one question I think about, which is
how can we have a shared language when we
Speaker:
talk about a given topic or the pros and
cons of those topic in terms of rigorous
Speaker:
data -driven,
Speaker:
or rigorous data -driven theses about why
we have these different views and try and
Speaker:
disconnect the fundamental tensions and
bring down the temperature so that we can
Speaker:
talk more about the data and have good
insights or leverage insights from the
Speaker:
data and use that to guide our decision
-making across, especially the more
Speaker:
divisive areas like public policy, things
of that nature.
Speaker:
But I think part of the challenge is that
why we don't do this, well, you know,
Speaker:
From the political standpoint, it's much
easier to not focus on what the data is
Speaker:
saying because that could be expedient and
it appeals to a broader amount of people.
Speaker:
But at the same time, maybe we don't have
the right language of how we might use
Speaker:
data to think more, you know, in a more
principled way about some of the main, the
Speaker:
major challenges that we're facing.
Speaker:
So I, yeah, I think I'd like to get to a
stage where we can focus more about, you
Speaker:
know, principle discussions about hard
problems that are really grounded in data.
Speaker:
And the way we would get those sort of
insights is by building good probabilistic
Speaker:
models of the data and using it to
explain, you know, explain to policymakers
Speaker:
why they shouldn't, they shouldn't do a
different, a certain thing, for example.
Speaker:
So I think that's a very important problem
to solve because surprisingly many areas
Speaker:
that are very high impact are not using
real world inference and data to drive
Speaker:
their decision -making.
Speaker:
And that's quite shocking, whether that be
in medicine, you know, we're using very
Speaker:
archaic.
Speaker:
inference technologies in medicine and
clinical trials, things of that nature,
Speaker:
even economists, right?
Speaker:
Like linear regression is still the
workhorse in economics.
Speaker:
We're using very primitive data analysis
technologies.
Speaker:
I'd like to see how we can use better data
technologies, better types of inference to
Speaker:
think about these hard, hard challenging
problems.
Speaker:
Yeah, couldn't agree more.
Speaker:
And...
Speaker:
And I'm coming from a political science
background, so for sure these topics are
Speaker:
always very interesting to me, quite dear
to me.
Speaker:
Even though in the last years, I have to
say I've become more and more pessimistic
Speaker:
about these.
Speaker:
And yeah, like I completely agree with
your, like with the problem and the issues
Speaker:
you have laid out and the solutions I am
for now.
Speaker:
completely out of them.
Speaker:
Unfortunately, but yeah, like that I agree
that something has to be done.
Speaker:
Because these kind of political debates,
which are completely out of our out of the
Speaker:
science, scientific consensus just so we
are to me, I'm like, but I don't know,
Speaker:
we've talked about that, you know, we've
learned that I like,
Speaker:
It's one of the things we know.
Speaker:
I don't know what we're still arguing
about that.
Speaker:
Or if we don't know, why don't we try and
find a way to, you know, find out instead
Speaker:
of just being like, I know, but I'm right
because I think I'm right and my position
Speaker:
actually makes sense.
Speaker:
It's like one of the worst arguments like,
oh, well, it's common sense.
Speaker:
Yeah, I think maybe there's some work we
have to do in having people trust.
Speaker:
know, science and data -driven inference
and data analysis more.
Speaker:
That's about by being more transparent, by
improving the ways in which they're being
Speaker:
used, things of that nature, so that
people trust these and that it becomes the
Speaker:
gold standard for talking about different
political issues or social issues or
Speaker:
economic issues.
Speaker:
Yeah, for sure.
Speaker:
But at the same time, and that's
definitely something I try to do at a very
Speaker:
small scale with these podcasts,
Speaker:
It's how do you communicate about science
and try to educate the general public
Speaker:
better?
Speaker:
And I definitely think it's useful.
Speaker:
At the same time, it's a hard task because
it's hard.
Speaker:
If you want to find out the truth, it's
often not intuitive.
Speaker:
And so in a way you have to want it.
Speaker:
It's like, eh.
Speaker:
I know broccoli is better for my health
long term, but I still prefer to eat a
Speaker:
very, very fat snack.
Speaker:
I definitely prefer sneakers.
Speaker:
And yet I know that eating lots of fruits
and vegetables is way better for my health
Speaker:
long term.
Speaker:
And I feel it's a bit of a similar issue
where it's like, I'm pretty sure people
Speaker:
know it's long term better to...
Speaker:
use these kinds of methods to find out
about the truth, even if it's a political
Speaker:
issue, even more, I would say, if it's a
political issue.
Speaker:
But it's just so easy right now, at least
given how the different political
Speaker:
incentives are, especially in the Western
democracies, the different incentives that
Speaker:
are made with the media structure and so
on.
Speaker:
It's actually way easier to
Speaker:
not care about that and just like, just
lie and say what you think is true, then
Speaker:
actually doing the hard work.
Speaker:
And I agree.
Speaker:
It's like, it's very hard.
Speaker:
How do you make that hard work look not
boring, but actually what you're supposed
Speaker:
to do and that I don't know for now.
Speaker:
Yeah.
Speaker:
Um, that makes me think like, I mean, I,
I'm definitely always thinking about these
Speaker:
things and so on.
Speaker:
Something that definitely helped me at a
very small scale, my scale where, because
Speaker:
of course I'm always the, the scientists
around the table.
Speaker:
So of course, when these kinds of topics
come up, I'm like, where does that come
Speaker:
from?
Speaker:
Right?
Speaker:
Like, why are you saying that?
Speaker:
Where, how do you know that's true?
Speaker:
Right?
Speaker:
What's your level of confidence and things
like that.
Speaker:
There is actually a very interesting
framework where, which can teach you how
Speaker:
to ask.
Speaker:
questions to actually really understand
where people are coming from and how they
Speaker:
develop their positions more than trying
to argue with them about their position.
Speaker:
And usually it ties in also with the
literature about that, about how to
Speaker:
actually not debate, but talk with someone
who has very entrenched political views.
Speaker:
And it's called street epistemology.
Speaker:
I don't know if you've heard of that.
Speaker:
That is super interesting.
Speaker:
And
Speaker:
I will link to that in the show notes.
Speaker:
So there is a very good YouTube channel by
Anthony McNabosco, who is one of the main
Speaker:
person doing straight epistemology.
Speaker:
So I will link to that.
Speaker:
You can watch his video where he goes in
the street literally and just talk about
Speaker:
very, very hot topics to random people in
the street.
Speaker:
Can be politics.
Speaker:
Very often it's about supernatural beliefs
about...
Speaker:
religious beliefs, things like this is
really, these are not light topics.
Speaker:
But it's done through the framework of
street epistemology.
Speaker:
That's super helpful, I find.
Speaker:
And if you want like a more, a bigger
overview of these topics, there is a very
Speaker:
good somewhat recent book that's called
How Minds Change by David McCraney, who's
Speaker:
got a very good podcast also called You're
Not So Smart.
Speaker:
So,
Speaker:
Definitely recommend those resources.
Speaker:
I'll put them in the show notes.
Speaker:
Awesome.
Speaker:
Well, for us, that was an unexpected end
to the show.
Speaker:
Thanks a lot.
Speaker:
I think we've covered so many different
topics.
Speaker:
Well, actually, I still have a second
question to ask you.
Speaker:
The second last question I ask you, so if
you could have dinner with any great
Speaker:
scientific mind, dead, alive, fictional,
who would it be?
Speaker:
I think I will go with Hercules Poirot,
Agatha Christie's famous detective.
Speaker:
So I read a lot of Hercules Poirot and I
would ask him, because he's an inference,
Speaker:
everything he does is based on inference.
Speaker:
So I'd work with him to come up with a
formal model of the inferences that he's
Speaker:
making to solve very hard crimes.
Speaker:
I am not.
Speaker:
That's the first time someone answers
Hercules Poirot.
Speaker:
But I'm not surprised as to the
motivation.
Speaker:
So I like it.
Speaker:
I like it.
Speaker:
I think I would do that with Sherlock
Holmes also.
Speaker:
Sherlock Holmes has a very Bayesian mind.
Speaker:
I really love that.
Speaker:
Yeah, for sure.
Speaker:
Awesome.
Speaker:
Well, thanks a lot, Ferris.
Speaker:
That was a blast.
Speaker:
We've talked about so many things.
Speaker:
I've learned a lot about GPs.
Speaker:
Definitely going to try AutoGP .jl.
Speaker:
Thanks a lot for all the work you are
doing on that and all the different topics
Speaker:
you are working on and were kind enough to
come here and talk about.
Speaker:
As usual, I will put resources and links
to your website in the show notes for
Speaker:
those who want to dig deeper and feel free
to add anything yourself or for people.
Speaker:
And on that note, thank you again for
taking the time and being on this show.
Speaker:
Thank you, Alex.
Speaker:
I appreciate it.
Speaker:
This has been another episode of Learning
Bayesian Statistics.
Speaker:
Be sure to rate, review, and follow the
show on your favorite podcatcher, and
Speaker:
visit learnbaystats .com for more
resources about today's topics, as well as
Speaker:
access to more episodes to help you reach
true Bayesian state of mind.
Speaker:
That's learnbaystats .com.
Speaker:
Our theme music is Good Bayesian by Baba
Brinkman, fit MC Lass and Meghiraam.
Speaker:
Check out his awesome work at bababrinkman
.com.
Speaker:
I'm your host,
Speaker:
Alex and Dora.
Speaker:
You can follow me on Twitter at Alex
underscore and Dora like the country.
Speaker:
You can support the show and unlock
exclusive benefits by visiting patreon
Speaker:
.com slash LearnBasedDance.
Speaker:
Thank you so much for listening and for
your support.
Speaker:
You're truly a good Bayesian change your
predictions after taking information and
Speaker:
if you think and I'll be less than
amazing.
Speaker:
Let's adjust those expectations.
Speaker:
Let me show you how to be a good Bayesian
Change calculations after taking fresh
Speaker:
data in Those predictions that your brain
is making Let's get them on a solid
Speaker:
foundation