Name: Bayesian Deep Learning & Deep GPs, with Maurizio Filippone
Uploaded: 2025-10-30T21:43:46Z
Description: Maurizio Filippone unpacks Bayesian deep learning: from Gaussian processes to deep GPs, variational inference, MC dropout, and calibrated uncertainty at scale

‌

Listen on your favorite platform:

Apple Podcasts

Spotify

Youtube

• Sign up for Alex's first live cohort, about Hierarchical Model building: https://athlyticz.com/cohorts/alex-andorra/hierarchical

• Get 25% off "Building AI Applications for Data Scientists and Software Engineers": https://bit.ly/lbs

• Join this channel to get access to perks:

https://www.patreon.com/c/learnbayesstats

• Proudly sponsored by PyMC Labs. Get in touch at https://www.pymc-labs.com/

Our theme music is « Good Bayesian », by Baba Brinkman (feat MC Lars and Mega Ran). Check out his awesome work at https://bababrinkman.com/ !

Takeaways:

Why GPs still matter: Gaussian Processes remain a go-to for function estimation, active learning, and experimental design – especially when calibrated uncertainty is non-negotiable.
Scaling GP inference: Variational methods with inducing points (as in GPflow) make GPs practical on larger datasets without throwing away principled Bayes.
MCMC in practice: Clever parameterizations and gradient-based samplers tighten mixing and efficiency; use MCMC when you need gold-standard posteriors.
Bayesian deep learning, pragmatically: Stochastic-gradient training and approximate posteriors bring Bayesian ideas to neural networks at scale.
Uncertainty that ships: Monte Carlo dropout and related tricks provide fast, usable uncertainty – even if they’re approximations.

-Model complexity ≠ model quality: Understanding capacity, priors, and inductive bias is key to getting trustworthy predictions.

Deep Gaussian Processes: Layered GPs offer flexibility for complex functions, with clear trade-offs in interpretability and compute.
Generative models through a Bayesian lens: GANs and friends benefit from explicit priors and uncertainty – useful for safety and downstream decisions.
Tooling that matters: Frameworks like GPflow lower the friction from idea to implementation, encouraging reproducible, well-tested modeling.
Where we’re headed: The future of ML is uncertainty-aware by default – integrating UQ tightly into optimization, design, and deployment.

Chapters:

08:44 Function Estimation and Bayesian Deep Learning

10:41 Understanding Deep Gaussian Processes

25:17 Choosing Between Deep GPs and Neural Networks

32:01 Interpretability and Practical Tools for GPs

43:52 Variational Methods in Gaussian Processes

54:44 Deep Neural Networks and Bayesian Inference

01:06:13 The Future of Bayesian Deep Learning

01:12:28 Advice for Aspiring Researchers

01:22:09 Tackling Global Issues with AI

Thank you to my Patrons for making this episode possible!

Yusuke Saito, Avi Bryant, Ero Carrera, Giuliano Cruz, James Wade, Tradd Salvo, William Benton, James Ahloy, Robin Taylor,, Chad Scherrer, Zwelithini Tunyiswa, Bertrand Wilden, James Thompson, Stephen Oates, Gian Luca Di Tanna, Jack Wells, Matthew Maldonado, Ian Costley, Ally Salim, Larry Gill, Ian Moran, Paul Oreto, Colin Caprani, Colin Carroll, Nathaniel Burbank, Michael Osthege, Rémi Louf, Clive Edelsten, Henri Wallen, Hugo Botha, Vinh Nguyen, Marcin Elantkowski, Adam C. Smith, Will Kurt, Andrew Moskowitz, Hector Munoz, Marco Gorelli, Simon Kessell, Bradley Rode, Patrick Kelley, Rick Anderson, Casper de Bruin, Philippe Labonde, Michael Hankin, Cameron Smith, Tomáš Frýda, Ryan Wesslen, Andreas Netti, Riley King, Yoshiyuki Hamajima, Sven De Maeyer, Michael DeCrescenzo, Fergal M, Mason Yahr, Naoya Kanai, Aubrey Clayton, Jeannine Sue, Omri Har Shemesh, Scott Anthony Robson, Robert Yolken, Or Duek, Pavel Dusek, Paul Cox, Andreas Kröpelin, Raphaël R, Nicolas Rode, Gabriel Stechschulte, Arkady, Kurt TeKolste, Marcus Nölke, Maggi Mackintosh, Grant Pezzolesi, Joshua Meehl, Javier Sabio, Kristian Higgins, Matt Rosinski, Bart Trudeau, Luis Fonseca, Dante Gates, Matt Niccolls, Maksim Kuznecov, Michael Thomas, Luke Gorrie, Cory Kiser, Julio, Edvin Saveljev, Frederick Ayala, Jeffrey Powell, Gal Kampel, Adan Romero, Will Geary, Blake Walters, Jonathan Morgan, Francesco Madrisotti, Ivy Huang, Gary Clarke, Robert Flannery, Rasmus Hindström, Stefan, Corey Abshire, Mike Loncaric, David McCormick, Ronald Legere, Sergio Dolia, Michael Cao, Yiğit Aşık, Suyog Chandramouli and Adam Tilmar Jakobsen.

How do we bring rigorous uncertainty into modern machine learning without losing scalability?

Today I am joined by Maurizio Filippone, associate professor at Gaust and leader of the Bayesian Deep Learning Group whose path from physics to machine learning has been guided

by a single obsession, function, estimation, done the Bayesian way.

We dive into the frontier where GPs meet deep learning, deep Gaussian processes,

patient neural networks trained with stochastic gradients and pragmatic tools like Monte Carlo Dropout for uncertainty quantification.

Along the way, we tackle trade-offs between interpretability and flexibility, when to reach for a GP versus a neural net, and how Bayesian ideas improve optimization,

experimental design, and even generative models.

Finally, we look ahead to the future where uncertainty isn't an afterthought

but a first-class citizen of AI, integrated, efficient, and indispensable.

This is Learning Basics and episode 144, recorded October 2nd, 2025.

Welcome Bayesian Statistics, a podcast about Bayesian inference, the methods, the projects, and the people who make it possible.

I'm your host, Alex Andorra.

You can follow me on Twitter at alex-underscore-andorra.

like the country.

For any info about the show, learnbasedats.com is the last to be.

Show notes, becoming a corporate sponsor, unlocking Beijing Merch, supporting the show on Patreon, everything is in there.

That's learnbasedats.com.

If you're interested in one-on-one mentorship, online courses, or statistical consulting, feel free to reach out and book a call at topmate.io slash Alex underscore and Dora.

See you around, folks.

and best patient wishes to you all.

And if today's discussion sparked ideas for your business, well, our team at Pimc Labs can help bring them to life.

Check us out at pimc-labs.com.

Hello, my dear patients.

Just a quick word to remind you that I'm running my first ever live workshop and it's going to be a live cohort.

We're going to kick this off with athletics and we're going to do hierarchical models in PIMC and BAMBEE on November 5 and 6.

And in two sessions, you will be an interpret, a working multi-level model with poster checks and stakeholder ready.

It's gonna be short, live, code-first.

Thanks to Athletics, we're gonna have pre-authenticated GCPVMs so you can model without set-up rictions.

that means that if you wanna come learn live with me, also join the Discord that we have with the Learn by Stats patrons.

Well, now is the time to join.

We're gonna learn with sports analytics examples.

A other examples, of course, were pretty much at capacity, but there are still a few spots left, so I would love to see you in a few weeks, November 5 and 6.

All the details are in the show notes.

And well, if you have any questions, feel free to reach out.

Otherwise, I'll see you very soon.

Thank you, folks.

And now, let's talk about what is a deep caution process with Maurizio Filippone.

Maurizio Filippone, welcome to Learning Basin Statistics.

Well, thank you so much, Alex.

Thank you so much.

Great to be here.

Thanks a lot again to Hans for putting us in contact.

You're like the podcast matchmaker of LBS because we already had some colleague of yours, actually, Maurizio.

I'll put the link in the show notes.

We had Harvard Roo and Yanet Vanikak.

They were in episode

136.

So I'll put that into the show notes.

That was a really, really fun episode.

Talked everything about Bayesian inference at scale, everything about INLAM.

So lots of good nuggets in this episode, folks.

We talked about penalized complexity priors, which are available out of the box in the INLAM R package.

I guess they will be in the Python package.

If you're interested in INLAM and Bayesian inference at scale, this is definitely an exciting time.

I think it's very good for us to tackle the idea that

that Beijing is not available to scale.

And we'll keep doing that today, I guess, with you Maurizio.

We'll talk about GPs, about deep learning, all the very fun stuff you do.

But first, let's start with your origin story.

Can you tell us what you're doing nowadays?

And also, how do you end up working on that?

know, because you're an Italian in Saudi Arabia.

How did that happen?

Yes, that's a long journey.

So first of all, I would like to thank you for the great service you're doing for the community.

think this is very important and I'm really happy to be here and to be part of this long list of great speakers that you had before, know, big important guests that you had before

me.

So I started with a master's in physics.

So that's where I started in Italy.

um I got uh interested in dynamical systems at the time and I took a course on neural networks.

uh this is many years ago, where we were trying to understand whether we could predict time series using neural networks without knowledge of the physics.

So even if I was studying physics, we were trying to avoid having to know physics to make these predictions.

And it turns out that there's some nice mathematics behind um the theory of dynamical systems allowing you to

even for chaotic systems, so systems that are relatively simple to write in terms of differential equations, but then the trajectory is really evolving in a seemingly random

way.

But actually it's not random.

It's just that the characteristics of these differential equations so that uh there is this emergence of chaos.

You can still predict this time series really well.

And then I got interested in machine learning.

And at the time people were not believing that this would be

a smart move because it was more than 20 years ago.

And I believe that this would be something interesting to pursue.

And then I started a PhD in computer science and that led me to uh moving to the UK.

So one of my reviewers was uh Mark Gerolami at Glasgow at the time.

So I got interested in exploring the UK.

So I first did the postdoc in Sheffield and then eventually moved to Glasgow with Mark Gerolami.

And then uh from there, we moved to UCL for a year uh while I was doing a postdoc because he got a chair of statistics in the statistics department at UCL.

Then I got a lectureship back in Glasgow.

And uh after a few years in computer science at Glasgow, I decided to move to France.

There was a big opportunity to develop machine learning at scale in this Institute and build something new there.

So it was an exciting opportunity.

And then after eight years that I...

successfully built something there.

I decided to explore something new where there was this opportunity here at KAUST.

I knew a lot of great people working here, so I decided to give it a shot and now I'm here.

in this journey, I started from time series prediction, going through clustering, anomaly detection, and eventually now we're working on various applications in, well, I've worked

in some applications in neuroscience, fraud detection.

uh industrial applications of various kinds.

And now here, there's a stronger focus on environmental sciences and yeah, it's really exciting.

Yeah, yeah, I agree.

ah And before diving into the technical details, can you also give us an idea um of what your group's main goals and research themes are because you lead the Bayesian deep

learning group there at KAUST.

Yeah.

So in terms of themes, I think there is one big theme that is central to everything, which is function estimation.

So a lot of the things I do every day, a lot of the things that most people do in machine learning every day and statistics, I think it's really function estimation.

So, you know, I started working on kernel methods back in when I was doing my master's and PhD.

And then eventually through the postdoc, we started working on these non-parametric models, so Gaussian processes.

with probabilistic nonparametric models, so Gaussian processes.

then eventually, know, deep learning started to become quite popular and very powerful.

So naturally it felt like we had to uh think about these extensions to deeper models.

And so the natural thing for me was to take a Gaussian process and make it deep, right?

And so we started studying these deep Gaussian processes, which were already being proposed a few years back, but of course.

we started thinking about approximations to make them scalable and so on.

then eventually today I don't have any, let's say preference, or I don't see a distinct line between deep Gaussian processes and Bayesian deep neural networks.

In the end, you can sort of view deep Gaussian processes as a special case of a Bayesian neural network.

So for me now, a lot of the techniques that we've always used to do scalable inference for GPs, you know, sort of port them

to Bayesian deep learning.

yeah, and so there is an exciting, this is an exciting space because there is so much development going on and we're part of this.

So it's really great.

Yeah, that's super fun.

And yeah, I'm glad that you already established the connection there is between Bayesian neural networks and Gaussian processes.

Like in the end, everything is a Gaussian process.

And so I'm curious if you can define what a deep

process is because I think my audience has a good idea of what a Bayesian neural network is.

And I've had, especially recently, Vincent Fortwin talk about that on the show.

I'll put that also in the show notes.

So these Bayesian deep learning, I think people are familiar with.

Can you tell us what a deep Gaussian process is?

Because I think

People see what a Gaussian process is, but what makes it a deep one?

Great episode.

The one with Vincent, by the way.

I checked it out.

Thank you.

Because I guess they would say a lot of things that I would probably say also in my episode.

it was great to see it.

So yeah, so a Gaussian process, uh there's many ways which you can see it.

The easiest way is probably to start from a linear model.

I think I really like the construction from a linear model.

So if we start from a linear model.

And we make it Bayesian.

So we put a prior on the parameters.

Then we have analytical forms for the posterior, the predictions, everything is nice and Gaussian.

And so now one nice thing we can do is to start thinking about linear regression, but now with basis functions.

So we start introducing linear combinations, not of just the covariates or features, if you want to call them that.

But you have a transformation, let's say sine and cosine could be trigonometric functions of any kind.

could be polynomials.

And it turns out that you can use kernel tricks to be able to say what the predictive distribution is going to be for this.

The model is still linear in the parameters, but now what we can do is to take the number of basis functions to infinity.

So we can make an large polynomial.

And now the number of parameters will be infinite.

But what we can do is to use this kernel, so-called kernel trick to actually

express everything in terms of scalar products among this mapping of inputs to this polynomial.

And so if you do that, then what you can do is to, instead of working with polynomials or these basis functions, now you can define a so-called kernel function, which is the one

that takes inputs features and it speeds out a scalar product of these induced polynomials in this very large dimension, infinite dimensional space.

So this kind of trick allows you to just then work with something which is infinitely uh powerful in a way, because it's infinitely flexible in a way that you have an infinite

number of parameters now.

But the great thing is that if you have only n observations, all you need to do is to care about what happens for this n uh observations.

And so you can construct this covariance matrix and you know, you can do, and everything is Gaussian again.

It's very nice.

The first time you generate a function from Gaussian process, it's beautiful because you get these nice functions that look beautiful and it's just a multivariate normal really.

And it's just, that's all it is.

know?

So I still remember the first time I generated the function from a GP because it was a eureka moment, you know, where you realize how simple and beautiful this is.

And, and now, so then you can think that now.

This is, this represents a distribution over functions.

So if you draw from this uh GP, you obtain samples that are functions.

And now what you can do is to say, well, what if I take this function now and instead of just observing this function alone, I just put it inside as an input to another Gaussian

process.

So in a GP, you have inputs, which are your input data where you have observations.

So now you're mapping into functions.

And then this function can become now the input to another uh GP, for example, you know, and then you can even say, okay, let's take these inputs and map them not just to a

univariate Gaussian process where we have just one function, but maybe we can map it into 10 functions.

And then these 10 functions become the input to a new Gaussian process.

And so this would be a one layer deep Gaussian process, right?

So you have now one layer, which is first hidden functions that then enter

the, as input to another Gaussian process.

What's the advantage of this?

Why do we do this?

Well, you know, so with Gaussian processes, characteristics that you observe for the functions that you will generate are determined by the choice of the covariance function.

So if you take a covariance function, which is a RBF, you're going to have infinitely smooth functions that you generate.

And the way these functions are going to be, the length scale of these functions and the amplitude, they're going to be determined by the parameters that you put in the covariance

function.

And of course, you know, there might be problems where, you you have no stationarity.

So in a part of the space, functions should be nice and smooth.

In other parts of the space, maybe you want more flexibility.

And then, you know,

A Gaussian process with a standard covariance function cannot achieve that.

And so in order to increase flexibility, you either spend time designing kernels that actually can do crazy things, which is possible, but relatively hard because now you have

a lot of choices.

You can combine kernels in multiple ways.

And if you have a space of possible kernels you want to choose from, combining them, you know, becomes a combinatorial problem.

So you may say instead, let's just...

compose functions and composition is very powerful.

And this is why deep learning works because in deep learning, essentially have function compositions.

so even if you compose simple things, the result is something very complicated and you can try it yourself.

You know, take a sine function and put it into another sine function.

If you play around with the parameters, you can get things that oscillates in a crazy way.

And this is

Very simple, but very powerful.

And so the idea of deep Gaussian process is exactly this, to try to enrich the kind of class of functions you can obtain by composing functions, composing Gaussian processes.

And of course, now the marginals, you know, in a Gaussian process, all the marginals are nice and Gaussian.

If you compose, these marginals become non-Gaussian.

And this is really, you know, getting to the point where you start thinking, well, why should we then...

restrict ourselves to composing processes that are Gaussian, maybe we can do something else.

And then maybe thinking about other ways in which you can be flexible in the way you parametrize these complicated conditional distributions.

Okay.

Yeah.

Damn, this is super fun.

So it sounds to me like Fourier decomposition on steroids, basically.

it's like decomposing everything through these basis functions and plugging everything into...

into each other.

like, um you know, like these mamushkas of Gaussian processes, basically.

So yeah.

And I can definitely see the power of that.

like, yeah, it's like having very deep neural networks, basically.

So I see, I definitely see the connection and why that would be super helpful.

um And that helps, I'm guessing that helps uncover...

very complex non-linear patterns that are very hard to express in a functional form.

That functional form would be, well, you have to choose the kernels.

And sometimes, as you were saying, the out of the box kernels can't express the complexity you in the data.

then having basically the machine discover the kernels by itself is much easier.

Yeah.

And it's really also about the marginals.

If you believe that your marginals can be Gaussian and you're happy with that, then it's all fine.

You can do kernel design.

You can spend a bit of time trying to find a good kernel that gives you good fit to the data, good modeling, good uncertainties.

But then there's still going to be this constraint in a way that you're working with the Gaussian process.

In the end, marginally, everything is Gaussian.

You may not want that in certain applications where it may be the...

Distributions are very skewed and other things, you know, and then maybe the skewness also is position depend input dependent, you know, so there's no sessionarity also.

Again, you can encode it in certain kernels, you know, but it's just so much easier to compose.

mean, from the principle of just a mathematical composition, then of course, computationally how to handle this, this is another pair of hands.

yeah, yeah.

No, exactly.

mean, you're trading, you're trading.

basically something that's more comfortable for the user for something that's much harder to compute for the computer.

But yeah, like in the end, that also can be something that is more transferable because if you have, unless you're a deep expert in Gaussian processes, coming up with your own

kernels each time you need to work on a project is very time consuming.

So it can be actually worth your time to turn into the deep Gaussian processes framework.

throw computing power at it and, you know, go your merry way working on something in the meantime while the computer samples.

definitely makes sense.

but again, the deep aspect carries other design choices.

Now you have to choose how many layers, what's the dimensionality of each layer.

So, and then there is this other uh problem of now what kind of inference you choose.

which definitely has an effect.

So we've done some studies on this, you know, trying to compare a little bit, various approaches.

I mean, we did this a few years ago now because the deep, I think we started working on this right after TensorFlow came out.

So this was 2016.

So we started doing, we did our DPGP with a certain kind of approximation that is not very popular.

mean, the community seems to have agreed that

know, inducing points methods are very powerful to do approximations.

you know, I've also done some work on that with some great people, particularly James Hansman, who has developed the GP flow with some other great guys.

but random features is what you said before, you mentioned the Fourier transform on steroids.

mean, the idea is really to, you know, for certain classes of kernels, you can do some sort of expansions and

sort of linearize the Gaussian process.

So before I was talking about going from a linear model to something which is uh infinite number of basis functions.

And now the idea is just truncate this number of basis functions.

You know, can do it in various ways.

You know, there is a randomized version that we do when we do these random features.

uh And then you sort of truncate.

And so now instead of working with this, you turn a Gaussian process into a linear model with a

a large number of basis functions.

And then linear models are nice to work with.

And then if you compose them, then that's when you get the deep Gaussian process.

Essentially you get the deep neural network with some stochasticity in the layers.

And that's all there is to it.

And so when we did this, we implemented it in TensorFlow because it was the new thing ah and it was very scalable.

know, we took some competitors and we really, you know...

we're really fast at converging to good solutions and getting good results.

And we have an implementation out there in TensorFlow, unfortunately.

We should now maybe port it to PyTorch, which has become uh what we work on more.

Hugo Bowne Anderson here, data and AI scientist, consultant and educator.

I'm a friend of Alex and I was on episode 122 of Learn Bayesian Statistics, talking about learning and teaching in the age of AI.

If you're building with LLMs and AI, and especially if you've hit that wall where your prototype works sometimes, but isn't reliable enough to ship, I've got something for you.

I'm teaching a four week course called Building AI Applications.

We focus on the actual software development life cycle.

agents, evals, logging, rag, fine tuning, iterating, debugging and more.

I teach you with Stefan Kraucic, who is currently working on AI agent infrastructure at Salesforce.

Students get over $1,200 in cloud credits from modal, Pydantic, Logfire, Chroma Cloud and more to build with immediately.

We're excited to offer you all 25 % off.

The link is in the show notes.

You can also go to bit.ly slash LBS friends.

Class starts November 3rd.

Would love to see you there.

No, for sure.

I mean, yeah, that's definitely, um, that's definitely linked to that, um, to that TensorFlow implementation that you have, because yeah, I'm very big on pointing people

towards how they can apply that in practice.

Uh, and basically making the bridge between, uh, Frontier Research, uh, as you're doing and then helping people implement that in their own modeling workflows and problems.

So let's definitely do that.

Um,

And yeah, I was actually going to ask you, that's a great explanation and thank you so much for laying that out so clearly.

I think it's awesome to start from the linear representation, as you were saying, basically, um yeah, going to the very big, deep, deep ease, which are in a way easier for

me to represent to myself because they, you know, it's like in the infinity, in the limit.

It's easier, I find, to work with than deep neural networks, for instance.

But yes, like, can you give us a lay of the land of how, what's the field about right now?

Let's start with um the practicality of it.

What would you recommend for people?

In which cases would these deep GP's be useful?

First and second question, why wouldn't they use just deep neural networks instead of deep GP's?

Let's start with that.

I have a lot of other questions, but let's start with that.

think it's the most general.

Yeah.

I think, I mean, it's a, it's a great question.

It's a, it's the mother of all questions really.

mean, what kind of model should you choose for your data?

And I think, I think that is going to be a lot of great work that is going to happen soon where we, we're going to maybe be able to.

to give more definite answers to this.

I think we're starting to realize that this overparametrization that we see in deep learning is not so bad after all.

So for someone working in business statistics, I think we have this image in mind where we should find the right complexity for the data that we have.

So there's going to be a sweet spot of a model that is sort of parsimonious in looking at the data and not too parametrized.

But actually deep learning is telling us now a different story, which is not different from the story that we know for nonparametric modeling, you know, for Gaussian processes.

In Gaussian processes, we push the number of parameters to infinity, right?

And in deep learning now we're sort of doing the same, but in a slightly mathematical different form.

the, so where we're getting at is a point where actually this enormous complexity is in a way facilitating certain

behaviors for these models to be able to represent our data in a very simple way.

So the emergence of simplicity seems to be connected to this explosion in parameters.

And I think Andrew Wilson has done some amazing work on this and it's recently published and I can link you to that paper, which says, deep learning is not so mysterious.

uh And it's something I was reading recently and it's beautiful read.

And I think, you know, to go back to your question, so today, what should we do?

Should we stick to a GP?

Should we go for a deep neural network?

I think for certain problems, we may have some understanding of the kind of functions we want.

And so for those, if it's possible and easy to encode them with the GPs, I think it's definitely a good idea to go for that.

But there might be other problems where we have no idea or maybe that is

many complications in the way we can think about the uncertainties and other things.

so maybe just throwing a data driven, I mean, if we have a lot of data, maybe we can say, okay, maybe we can go for an approach that is data hungry and then, you know, we can

leverage that.

And deep learning seems to be like maybe a right choice there.

But of course now uh there is also a lot of stuff happening in other uh spaces, let's say in terms of foundation models.

now.

There is this class, this breed of new things, new models that have been trained on a lot of data.

eh then with some fine tuning on your small data, you can actually adopt them.

know, this transfer learning actually works and we've done it for, so there's this paper again by Andrew Wilson on ah predicting time series with language models.

So you take Chatchpt and you make it predict, you discretize your

time series, you tokenize and you give it to GPT and you look at the predictions, you invert the transformation.

you get back a scalar values.

And actually this seems to be working quite well.

So we tried now for, with the multivariate versions of this probabilistic multivariate and so on.

So we've done some work on that also.

But just to say that, I mean, now this is something also kind of new that is happening, you know, because before maybe it was really hard to train these models.

at such a large scale, but now if you train a model on the entire web with all the language, language is Markovian in a way.

So, know, these Markovian structures are sort of learned by these models.

And now if you feed these models with the stuff that is Markovian, it will try to make a prediction that is actually going to be reasonable.

And this is what we've seen in the literature.

And all these things are, I think are going to change a lot of the way we think about.

designing a model for the data we have and how we do inference and all these things.

So as of today, think maybe still it's relevant to think about, if I have a particular type of data, I know that it makes sense to use a Gaussian process because I want certain

properties in the functions.

want certain, know, maternal, for example, gives us some sort of smoothness up to a certain degree.

And it's easy to encode length scales of these functions for the prior of the functions.

And this is great, you know, for neural networks, this is very hard to do.

So we've done some work trying to map the two, right?

So we try to say, okay, what can we make a neural network imitate what Gaussian processes do so that we gain sort of the interpretability and the nice properties of a Gaussian

process.

But then we also inherit the flexibility and the power of this deep learning model so that they can really perform well and also give us sound uncertainty quantification.

Yeah.

Okay, yeah, yeah.

So many things to unpack here.

ah I love it.

This is super exciting to me because I love working with these methods, but I also end up working with them a lot.

um GPs, GPs of course, as my listeners are tired of hearing.

uh But everything you just said here is something that resonates because what I love in GPs is their composability and their interpretability.

especially because you can, and thanks to that, can, you can impose prior structure on the functions you're going to get.

And I find this is extremely useful.

um the, yeah, so two questions on that.

First, do you still, do you still have the interpretability of DPs if you have deep DPs?

Like does the length still mean something and, and then, and the amplitude if you have like an exponential family kernel?

And second question, what are the state of the art packages that you would recommend people check out right now, both in Python or maybe just in Python because deep learning

is mostly Python centric.

But like, let's say I'm a listener, I find what you're saying very interesting for my use case.

I want to check out how to do deep GPs for my project.

and put that in competition with deep neural networks and hopefully put that into in competition with deep Bayesian neural networks.

But we talked with Vincent about the fact that for now there is no real um out of the box package that helps you do that in Bayesian neural networks.

yeah, two big questions.

but like, I think it's going to be super interesting.

Yeah.

Well, I think so in terms of the code.

In terms of code, I think that GP flow is probably one of the most accessible ones.

James is a good friend.

Again, we were talking when we were chatting at NaverRace 2015 when we were presenting our paper together and then there was a presentation about TensorFlow that was coming out.

so he said, okay, I'm going to do a software package for GPs in

in uh TensorFlow and this is something that he then developed over the years.

He moved to a startup company called Prowler for a few years.

He had a good team of developers helping him out.

So he did a really great job on that.

And I think GPflow is uh a really good starting point.

think for some projects also with my students in the past, we relied on that.

And I think you can also...

Yeah, I'll put that in the show notes.

And James should come on the show, sounds like.

Absolutely, yeah.

He'll be a great guest for that, uh like a GP flow episode.

Yeah, he's also a great cook, by the way.

He invited me and Alex Matthews for dinner once in Sheffield for pasta.

And I thought, okay, you know, he's going to make some normal pasta.

No, he made pasta from scratch.

non-Italian pasta, overcooked.

I was very impressed.

He did a fantastic job.

was really nice.

damn.

That's quite the endorsement.

That's cool.

So then no, like he needs to come on the show, but like for a live show, then I need to do a live show in Sheffield.

sounds like.

Yes.

And, and so, so yeah, so I think there are also deep GP's you can, you can easily do there.

With GP Flow?

With GP Flow.

Yes.

And I think you can.

I think the type of approximation you can use is based on most of James's work, which is based on inducing points for uh random features, which is another way in which you can

approximate.

So if for inducing points, instead of expressing a full process with N data points, you select inducing points, we call it, that allow you to express the entire process, but

having to do computations only with this M.

Let's say, so having to deal with matrices, which are by M essentially.

So you have cube complexity rather than N cube that you would have with a full GP.

And this worked really well.

you know, you have a nice, beautiful variational approximate sort of treatment uh for these models.

You can optimize the position of the inducing inputs.

And everything is really nice and beautiful.

There is a nice stream of papers by James.

I contributed to a couple of these where we also did some MCMC.

And later on also with my group, did some full-fledged MCMC where we also sample the inducing locations, which was something that people typically optimize.

But just to say, think in GPflow, you can start with a lot of great examples that can take you very far.

As Vincent was saying, you had Vincent here in another episode, he's right that it's a bit of a pain point not having an accepted and widely used toolbox for Basin Deep Learning.

So I think that's something that we should work as a community.

There are many events that we are trying to participate to get together, to reflect on.

What is the role of Bayes in the current state of AI?

So we had one in Dagstuhl last year, and we're going to have one in Abu Dhabi coming up soon at the end of this month.

And I think we should talk about this specifically, you know, how can we lead an initiative for code development.

But I think it's not easy because each one of us as professors, as academics, we have to serve certain priorities, which are in our case publications.

And maybe in my case, also engagement with the applications here in the kingdom is something very valued.

so the effort of developing a software package, I think is, it goes a bit beyond that.

Right.

So there needs to be some nice conditions to be able to have a team of developers available to do something like that for a long time.

And I think that's a challenge, at least for us and people working in industry also have, you know, I have certain.

priorities for coming from constraints coming from their company.

So I think that's difficult one for everybody, but definitely very valuable.

there was another part of your question that I think I missed.

Yeah.

We'll come back to that.

Don't worry.

Yeah.

Just to piggyback on what you're saying.

Yeah, for sure.

In industry, would say it's mostly you need to tie that to a project you have at work.

Like if you need that for work, then that's definitely something that can.

make things happen very fast and much, faster because then you can get some, some budget to, finance an open source solution to that, to that problem, which will, which will make

the development cycle much faster than if you have to do it internally alone.

So yeah, for sure.

But that's very good that JPflow already has all the support for inducing point um for um

for deep GPs is very good.

would say that PIMC also has very good GP support.

use that all the time for my Bayesian GPs.

um Not only the Vanilla GPs, but the Inducing Point GPs too.

have that in PIMC and PIMC Extras both for um marginal likelihood GPs, so for normal likelihood and for latent GPs.

So if you have a non-normal likelihood.

And of course the HSGP approximation has been a real game changer for using GPs in the wild.

um And we have that in Python C2 out of the box.

The great thing here compared to GPflow I say is that you can compose that with other parts in your vision model.

So it doesn't have to be a pure GP model.

It can be combined to other.

other random variables that you have in the model.

So you could have like a classic linear slope added to GP with a baseline.

So this is very interesting too.

um And you get the different inference methods that you get with a classic PPL.

So not only MCMC, but ADVI, Pathfinder, Soon In-Live.

So yeah, this is great.

so I encourage people checking that out.

Definitely encourage people checking GP flow out.

think this is, as you were saying, great baseline, very useful, great API.

and, and we definitely, definitely need James on the show to dive deeper into that because I already want to, want to dive deeper into that.

Um, and I've never, never done a show, but, about GP flow.

So I'll keep that in mind, but there is that.

We'll come back to the inference part afterwards.

But I asked you a question about the interpretation.

Do we keep the benefit of interpretability of the kernel parameters when we're using the GPs?

Well, in the composition, obviously things become uh more obscure in a way because now a length scale parameter for the first GP is a length scale for functions that become hidden

variables, latent variables for new...

uh Gaussian process.

So I think it's possible to think a little bit about the implications of this.

But you can start thinking about maybe how many oscillations you may expect by doing certain length scales over a certain domain.

You can start thinking, okay, if I take derivatives, maybe I can start looking at how many zeros I may expect from this.

It becomes much harder, I think, the deeper you go.

of course, in the end, is a lot of other beautiful theory that tells you that if you start pushing out the number of Gaussian processes, so the dimensionality of the Gaussian

process to infinity, then you go back to something which is again a Gaussian process.

So all this nice work by Radford Neal in 1996 gives a lot of these nice limits for.

even neural networks with stochasticity, as you push the number of layers to infinity or the number of neurons to infinity.

And there's been a lot of follow-up work on that showing that convolutional neural networks with a lot of, when you take the number of filters to large values, then they

become Gaussian processes and so on.

So central limit theorem there kicks in, in a way, and then a lot of these things become Gaussian again.

I think maybe that you may recover some interpretability again when you start pushing things to some limits.

But then again, in the output you get Gaussians.

So then you lose in a way the flexibility that you wanted by introducing the composition.

So it's a trade off, right?

So how much you want to be flexible and how much you want to be interpretable, I think.

Yeah, yeah, okay.

Yeah, that makes sense.

um That makes a ton of sense.

So let's go back to the inference part now.

um Can you give us a lay of the land of what are the approximations and scalable GP methods?

Also feel free to talk about Bayesian or non-Bayesian digital networks.

How can people sample from these models?

if you can walk us through the most promising techniques.

Yeah, great.

Well, maybe I break up the answer into maybe GPs first and then we move on to maybe the neural networks.

I think for GPs, there hasn't been much development in the last few years, I would say.

mean, there is still papers submitted and accepted in the major conferences, but I think they're really a small fraction compared to anything else that is happening.

I think a lot of people kind of settled now to some approximation methods and some inference methods.

Variational seems, I mean, it remains one of the nice formulations to be able to treat these models when you start introducing late, sorry, inducing points.

So with inducing points, it becomes kind of nice to work with this variational approximations.

has been some great work by Michalis Titsias in 2009, where this, you know, led to a lot of the developments that we see today in variational methods for GPs.

eh And so I think I would say variationally methods for treating the latent variables in Gaussian processes, I think is very predominant now to be able to handle scalability and

any likelihoods you want.

We've done some work on MCMC, which also works quite well.

I spent a lot of time doing MCMC a long time ago when I was trying to sample parameters of the covariance along with latent variables.

So there's been.

nice works by Ian Murray, example, Ryan Adams, Dave MacKayo himself.

Also, he's some work with these guys.

And at the time I was trying to do sampling and there is this problem of being a hierarchical model, it introduces some complications.

You have hyperparameters, latent variables, and data.

And because of this structure, latent variables becomes quite tricky.

Sorry, sampling hyperparameters becomes quite tricky because...

They're tightly coupled to the latent variables.

So when you sample from the posterior of latent variables, you're conditioning on data, but also on the hyperparameters.

And so imagine you have a length scale parameter.

It means that you're sampling your latent functions to be compatible with the length scale you have.

And then if you sample then length scale, given the latent variables, the length scale is not going to change much because the latent variables have a certain length scale, which

was informed by the length scale before.

So you have this.

very, very slow uh convergence process for this MCMC.

So you have to break it up in a way.

there has been a lot of work on this ancillary parameterizations, non-center parameterization.

People call it in many different ways.

And so you can start thinking about reparameterizing the Gaussian process in a way that you view latent variables uh from a Gaussian distribution with the covariance K.

You start saying, okay, K decompose into LL transpose where L is a Cholesky decomposition.

And then you say, I write my latent functions F as L times nu, where nu now are variables that are standard normals.

And if you do that now, you kind of decouple a little bit in the prior, at least you decouple the dependence between the hyperparameters, which affect the Cholesky of the

covariance and nu, which is now independent variables.

And so now you can sample a bit more efficiently.

Then people came up with even better ways of doing this kind of decoupling.

I've done some work on pseudo marginal Markov chain Monte Carlo, where you sort of use important sampling or adaptive importance sampling to integrate out latent variables

approximately.

So you can really sample much faster with, you know, faster convergence for the hyperparameters.

So MCMC for the hyperparameters is possible.

And I think, you know, now with the computing becoming more and more.

available and cheaper.

I think this is something that is definitely something worth considering, especially because a lot of times people work on applications, especially for, you know, these

expensive computer simulation problems where you have these simulations that run, you know, you know, hours, if not days.

And then you have to fit a Gaussian process on these expensive observations to construct an emulator.

that they use to sort of calibrate certain parameters of these computer models.

So these things are very expensive.

So MCMC, maybe it's not that expensive after all, if you do that.

And for me, like when I started doing this, was working on some neuroimaging applications.

We were handling 68, I think, images from patients and we're trying to do a classifier for Parkinson with this data.

And we said, you know, we want to do a good job.

in quantifying uncertainty in our prediction.

So we ran this MCMC for like a week and yeah, we got, uh you know, long chains, good convergence and yeah, we just did it, you know.

So this is just what I wanted to say maybe about uh GPs.

So we have implementations for this MCMC also now for when you handle, want to handle everything in a Bayesian way, you want to sample everything, you want to sample the

inducing inputs.

Inducing variables, hyperparameters, everything.

And this was an ASTATS paper we had in 2022, I think, or 2023.

But also in GP flow, again, going back there, you see a lot of nice code there that you just use to optimize some of these parameters.

And I think in many applications, this may work quite well.

Okay.

And so, for these, GP flow is a very good option.

the paper you talked about, did you implement that in GPflow or is that a custom implementation in TensorFlow?

So yeah, we started from GPflow as a code base.

Yes.

Okay.

So if people want to replicate, for instance, your paper, can do that.

Yeah.

But code is available.

can download the code and yes.

Yeah.

Also, pretty much every paper we do, we try to also release code to make it reproducible.

Yeah.

Yeah, so that's awesome.

But that's also great that it's reproducible with GPflow because it's a package that's evolving all the time, that's curated.

And then people can safely use that in an industrial production setting.

And that's, that's extremely helpful.

Um, and I find that's also very, like that's a very, that's a very good piece of news because like that's also been my experience that you can actually do a lot of MCMC

something with GP even with

uh, big data sets.

So, yeah, like it's usually the bad priors that people have about it is usually, usually not warranted when you actually trying to do that.

So a few years ago, gave a talk, I gave a talk in Cambridge at one event there and, uh, O'Hagan was there.

So I was presenting these DeepGPs with random features and then...

There was a plot that I didn't like so much when I gave her presentations.

So people were actually asking me questions about that.

They were not so clear.

So then while we went for lunch after my talk and while we were in the queue, I just took my laptop and I just ran the code again to replicate that figure in a better way.

And uh I was showing this beautiful function.

So while we were queuing, I showed this to uh Professor Huygen and he was very impressed that the code was running so fast.

And this was almost 10 years ago now.

Yeah.

And now we have even better NCMC samplers.

We have better personal computers.

So yeah, I've definitely ran very big hierarchical GP models on my laptop in like running in 15 minutes in that pie.

So definitely encourage people trying much more of that because I mean, if you see all these huge LLMs which are running.

Imagine that you can run a much more efficient GP model on your computer.

um For this paper, actually, do you remember the size of the dataset to give people an idea?

So yeah, we can run our, you know, we were running MNIST, I think this was already 10 years ago almost.

We're running MNIST on a laptop uh for a couple of hours or something, I don't know.

Okay.

Yeah.

So that's millions of datasets.

Sorry, no, was, it's only 60,000, but we also run on this MNIST 8 million, which is 8 million MNIST images.

So we ran it on that and again, we could run it on a laptop.

yeah.

Yeah.

Okay.

Yeah.

With GP flow.

No, this was our implementation of this DeepGPs with random features.

Okay.

But now is that available in GP flow or?

No, So GPflow focuses exclusively on these inducing points methods and not random features, at least as far as I know.

Maybe they, I don't know if they've evolved that part, but as far as I know, that was not something in their priorities.

Okay.

Yeah.

Actually, you, I don't think we made the clear distinction between inducing points and random features.

Yes.

Can you do that right within this in points, so you select a number of inputs that you can then optimize after if you want, and then introduce new random variables that allow you to

express the full process as a function of only these small set of random variables.

With random features instead, you think of an expansion of your model as an infinite number of basis functions.

And then you truncate this expansion to a fixed number.

So for certain kernels, for example, the RBF kernel, these random features actually are random Fourier features.

So you can express the GP just as a weighted combination of sine and cosine with different uh frequencies sampled appropriately.

So in one way, in one case you're approximating space.

And so this would be the inducing points method.

And in the random feature, uh if you think about random Fourier features, you're doing some approximation of the spectrum of these processes.

If that makes sense.

One is in space, the other one is in frequency.

um Yeah.

Yeah.

The random features sounds a lot like HSGP actually.

Yeah.

It probably has some connections.

Yeah.

Okay.

Interesting.

Okay, that's cool.

So if people want to use random features, this is going to be your implementation from the paper.

Yeah, we have that.

I mean, it's a bit old now.

I think it was on TensorFlow, a very old version.

So we should probably try to maintain it or maybe release a Python, PyTorch version.

Yeah.

And otherwise, in do seek points, then try that on GPflow.

Yes.

But then of course there is other, other approximation.

So again, Andrew Wilson has done some work on uh

this KISS GP, which is a way to do this scalable kernel approximations, which are pretty powerful.

So yeah, there are different ways, than...

Yeah, exactly.

And again, inducing policies in PIMC, HSGPs and also my approximation too.

You folks should give it a try if you want.

That's in PIMC also.

yeah, definitely a lot of great options.

um So as for GPs, let's turn to...

Deep neural networks.

Yeah.

Can you give us an idea of the land here?

So, well, I mean, one of the main things about deep learning is the possibility to do mini-batch.

So one of the great things about uh training a big neural network is that you just feed it small batches of data and then, you you keep updating the model using stochastic gradient

optimization.

So what is the problem with doing something like this for inference?

So if we do a Bayesian.

neural network, we want to get a posterior over these parameters of the neural network.

We want to sample from this.

How do we do it?

So actually it turns out that since 2013, 14, people started thinking about how to obtain versions of traditional samplers that exploit this stochastic gradients instead of just

having to compute the full objective every time you update the Markov chain.

And so then there is a beautiful paper by Max Welling and Y.T.

on stochastic gradient Langevin dynamics sampler.

I think it's a 2014 paper.

And then there is a hybrid Monte Carlo, Hamiltonian Monte Carlo, you want, version using stochastic gradient by Emily Fox and her group, which is also quite powerful.

And there is some nice theory around this.

Also, we worked a little bit on the theory as well in our group to try to understand a bit more about this.

the properties and essentially, you know, there is a way to show that you have, even if you're using stochastic gradients, which are not exact, these trajectories, somehow you

dump on them with some friction and then you can show that if you do things right, can avoid having to compute the entire likelihood when you accept or reject.

And therefore, you you can, you can really be scalable.

And we tried this with pretty big models.

Of course, if we talk about LLMs, we're still very small, but we've done this with the convolutional neural networks with my students some time ago.

We could sample easily models with a few tens of millions of parameters.

We were doing convergence checks, know, our statistics, the stuff that you need to do when you sample to make sure that you're really sampling from the posterior.

Of course, you know, because the parameter space is so...

So the models are not identifiable.

So we actually do the convergence checks on the predictions.

So on some sort of projection of the parameters onto some, something that we can actually meaningfully understand.

Because, know, if you sample from multiple modes that represent the same kind of configurations, of course, the Barkov chains are very far away from each other when you do

multiple chains, but actually you're sampling from the same configuration.

So that's okay.

So MCMC is possible.

Sorry, just to conclude, maybe variational inference.

don't think, I mean, there was a paper by Alex Graves, the thousand 11, which was the first one that proposed the variational inference for deep neural nets.

It doesn't work so well because people haven't spent enough time working on good priors.

And this is something we've addressed a bit in our work in 2022.

We have a general paper where we actually.

try to address these sort of problems of choosing good priors.

But then we tested this mostly with MCMC rather than variational.

And then a lot of people in the community are really excited about Laplace methods.

So Gaussian approximations with looking at the Hessian and so on.

But I think for deep learning, I don't know, this is maybe my outlier voice here in the community, but I don't think that's the right way of doing things because its posteriors

are not Gaussian at all.

And we're in lots of dimensions.

There is a lot of redundancies in the parameter space so that this non-identifiability creates ridges in the parameter space where the likelihood is the same.

So I don't think that Gaussian approximation would do particularly well, but of course it's a very popular way of doing things and the community is really pushing that a lot,

but I don't think that's the right way of doing things.

Okay.

So what to you would be, it would be the right way of doing things.

Like let's say listeners want to try um deep learning models right now.

Again, the Bayesian version is not very easy, but let's say they want to try deep learning models.

um What should they look at first?

Which packages, which methods, which inference methods?

I think the easiest thing, mean, when I have students coming maybe for a short project, know, the first thing I tell them, you know, try Monte Carlo dropout.

It's a very simple thing.

I mean, I know that a lot of people would disagree with me, but it's it's a very practical way of doing things and there is connections with variational inference.

So yeah, you retain some principle, let's say, although the posterior now is very degenerate because you're just doing switching off and on some, some weights, but it's a

very.

intuitive way of doing things, very practical.

can take pre-trained models and just, you know, introduce some dropout at test time.

Maybe a fine tune first with dropout and training time.

And then, you know, do it at test time.

It's a beautiful idea.

It's very simple.

I think it's a perfectly valid way to start, you know, at least to get some uncertainties and then, you know, what do you do with that?

Of course, depends on the problem you have, but uh I think it's a good

starting point.

Otherwise, think variational has a good potential if you make the class of posterior distributions quite flexible.

And now we're seeing these diffusion models or other powerful generative models being used for variational inference.

mean, this was the way sort of normalizing flows were proposed for variational inference by Rezende and Mohammed.

And then, you know, it was uh ported

to just density estimation.

now, know, we have diffusion models that do a wonderful job at density estimation.

And now people are starting to use them for posterior sampling.

And so I think, you know, having sort of this flexible posteriors could be like a good way forward.

And I think we're going to see more and more of that because if you're variational, the class of distributions you can represent with your variational

with your model is very large, you can really make the bound very tight.

So the variational really is going to give you the true marginal likelihood.

So eventually I think it would be nice to go in that direction.

Of course, for these huge models, it's very challenging, uh yeah, there is a lot of great work now that people are doing on partial stochasticity.

So you may not need to be stochastic about the entire network, but just a few uh parameters in your model.

And to do that, what's a great first bet Mike?

Are all of these methods available in PyTorch or TensorFlow so that people can come up with their neural network model and use these inference engines or is this too much of a

frontier methods so far?

So I think for the Monte Carlo dropout is really almost you don't need any skill to do it.

You just take a model that is already there.

code or, know, it just switch on and off.

Actually you switch on, drop out layers at training and test time.

That's it.

And for variational, I think in terms of implementations, I think Pyro has maybe a lot of these things already sort of embedded in the way they do things.

I've never used it myself.

I mean, we tend to develop a lot of code ourselves because we have to break stuff and to try stuff.

So we try to have.

code that we have under control ourselves.

So that's why I tend not to use too many packages myself, but I guess Pyro has maybe a lot of things already sort of implemented for doing this.

Yeah.

Okay.

So I'll put a link to PyTorch and Pyro documentation in the show notes for em this episode.

And then...

People can give it a try, but it's great that, yeah, from what you're saying, it sounds like it's pretty easy to implement for practitioners and to try these methods out.

I think, yeah, these days, mean, when I teach my class, what I do, I say, you know, take an MNIST tutorial for deep learning and just turn it into variational.

You know, what you need to do is to add a few extra variables and, you know, it's a good exercise.

People can usually do it relatively easily and yeah.

And do you have actually some like are your courses public and can we put something in the show notes that people can study what you're teaching actually over there at CalST?

Maybe you have to get a course, the exercises?

Yeah, we record everything, but we keep it private for now.

I don't think we can open it easily.

I also record, I mean, it's been 10 years now that I've been recording my courses, even when I was in France before, when I was in...

at Glasgow.

uh But yeah, they remain within the usage for the students.

think I put in the notes a link to a tutorial we gave uh on Gaussian processes.

I think there is another tutorial that I should probably also include there.

I'm not sure I put the link to that.

They gave it IJCAI.

So on, on based in deep learning.

we, we did a couple of tutorials, one on gospel processes, another one on based in deep learning.

So yeah, yeah.

So let's definitely add that.

And I will add my own tutorial about GPs that I taught at Pydata New York last year.

Awesome.

I did that with Chris Fonsbeck.

He went into the, the different methods, the different uh algorithms that you can use actually for, to feed GPs.

mainly in Prime C, so in the Bayesian framework.

So vanilla GPs inducing points and HSGPs.

And the last half of the tutorial was myself going through um an example tutorial for people trying to infer um player performance in soccer with GPs on three different

timescale, um the days, the month, and the years, and pulling the GPs hierarchically.

across players um while sharing the current structure.

So it's a pretty advanced use case.

um And you'll see it fits very fast on the laptop, folks.

So yeah, I'll put the link to the GitHub repo and you have the link to the YouTube video in like at the beginning of the GitHub repo.

uh So yeah, let's put that in there, Mauricio.

I think it's going to be super, super interesting for people.

something I'm also...

curious about is how do Bayesian ideas integrate into modern deep learning, especially in terms of uncertainty quantification.

You talked a bit about that earlier, where it's actually a good question right now, where does Bayes fit into that new AI and especially general landscape?

I'm curious to hear your thoughts about that.

Yeah.

I think one of the practical sides, know, I think

Many times we tried to do this, to start thinking about how do we put a prior over the parameters and quickly realized that it's very difficult to do because of this composition

and everything.

so it makes a lot of sense to think about priors over functions that you can represent with your model.

So this is also something that Vincent talked about because he worked on this, we worked on this also in parallel.

so the idea...

is really that if we start thinking in that direction, then I think it's much more powerful to think about the kind of functions you can represent.

And I think it goes a lot in the spirit of the things that we were discussing at the beginning of what kind of complexity would you allow for your functions?

So are you happy with functions that have a certain degree of complexity?

And this idea of complexity is very profound because complexity is not just number of parameters.

Complexity is more about simplicity and Kolmogorov complexity in a way tells you a lot about that.

And here at KAUST, I'm interacting a lot with Professor Schmidt-Huber, who is here as one of the greatest minds in AI.

And he's been thinking about this stuff for a long time.

And whenever I get coffee with him, you know, I get a lecture on Kolmogorov complexity.

And so I've been thinking about this a lot myself now.

And also Andrew, again, Andrew Wilson has done some work on that.

talking about this type of things.

I think in the end that we were making progress in a way in understanding how much stochasticity we need in the networks to be able to represent at least any distributions

we want.

But then we have to disentangle that in a way from the complexity of the functions that we can represent.

So there is these two aspects, think, complexity of the functions and how crazy you the uncertainties to be.

or the distributions that you can represent a priori before you're looking at any data.

And this is how you design a model.

Right.

And so there's this work by the work, the group at Oxford, uh Tom Rainforth and uh Eric Nalicic who did this work on partial stochasticity, which I think is very fundamental

because it really gives you a practical way to say how many neurons in your neural network should you pick to be stochastic and how many you can just optimize.

And this gives already a guiding principle on how to think about these Bayesian neural networks in the future, I think.

They're not very excited about this work.

When I talked with Tom, I saw him a couple of weeks ago in Denmark at a workshop and also Dykstra last year.

I was telling him like, Tom, is great.

This is one of the best things that happened in our community in a long time.

And he was like, come on, I don't think that this is so great.

It was downplaying a lot this contribution, which I think instead is very important because imagine now if you can do MCMC on a much smaller dimensional space and still

achieve the same representation power of a fully full-blown stochastic neural network with millions and millions of parameters.

And instead, maybe if your output is only 10 dimensional, you can get away with 10 neurons being stochastic.

powerful, you know?

And so I got excited about this stuff and I started working on crazy things like GANs that nobody looks at anymore because the GANs are now this generative other side of networks

that are out of fashion, but actually they're based on neural networks themselves, you know, and they are partially stochastic.

So it fits perfectly in the narrative of the kind of things I was looking at.

And I got sucked into this and it's been a pain because optimizing these models is extremely difficult.

But at least now we have an understanding of this in amazing way.

And it's very nice because we can view not only now GANs, but pretty much any generative models where you take a set of random variables and you have a complicated neural network

mapping it into something complicated, a complicated P of X.

And this is the mother of all problems.

You know, if you estimate P of X, you solve any problems you want, right?

And X can be, you know, if you have a supervised learning problem, it can be labels and inputs.

If you're doing unsupervised learning, it's just your inputs.

So if you can do this well, you can do a lot of things.

And so this forces us to think a lot about regularization, model complexity and all these things.

And I'm really excited about this.

And this is really what we're working on at the moment with my group.

Yeah, this is fascinating.

And I agree, GANs are amazing.

I mean, this is fascinating.

really love the, I mean, it's generative models.

Of course I love it.

But I really love this idea of having two networks.

competing against each other.

This is super, super interesting and can help you win in cases of rare, like of sparse data actually.

So it can be extremely, extremely powerful.

And I see that you put a video tutorial of about GANs precisely and how they see critical for...

yeah, I was invited to give a presentation on this.

yeah.

Yeah.

Well, definitely.

check that out and encourage people to do that.

Thanks.

see you put indeed a lot of lectures already in the show notes.

That's fantastic for myself and for listeners.

It's going to be a great episode for show notes also, folks.

So definitely check them out.

And well, I'm going to start plying this out Mauricio, because I could keep talking with you for a long time because I'm really passionate about these topics and we work on very

similar kind of models.

So that's awesome.

Um, but I need to, to respect your, your bedtime.

It's already late for you.

Um, I'm, I'm curious, you know, more generally in the context of, of the current GNI developments, where do you see Gaussian processes and Bayesian deep learning heading in

the next few years?

And what advice do you give your young students?

researchers, practitioners who want to dive deeper into Beijing deep learning or deep learning in general.

Yeah.

Making predictions about what's going to happen is very difficult, but I mean, I think a lot of this amortization through foundational models is happening really fast.

And I think we're not realizing how fast this is going.

so now...

So you mean amortize Beijing inference for instance?

Yes, amortize everything, know, predictions, inference, everything.

And through these big models that have learned from other data and so on, it's very powerful idea.

You you learn from lots of data sets and then you get a new data set, you know what to do, right?

In a way that makes a lot of sense.

In terms of GPs, I think they still play a pretty powerful role in, uh I think there was a paper not long ago showing that GPs actually for Bayesian optimization still perform

pretty well compared to Bayesian neural networks of all kinds.

So they still have a place there for based on optimization, experimental design, incremental experimental design, adaptive experimental design.

And also for these computer models, calibration computer models.

mean, this paper by Kennedy O'Hagan, 2001, which is very fundamental for this is a paper that really, I think is still quite relevant today.

think there's still a lot of design choices you can make about the GPs that somehow allow you to.

model emulate the code with uncertainty that is meaningful.

think O'Hagan has done tremendous amount of work on eliciting priors for these computer models.

you know, this is still very, very powerful and relevant, I think, and this is going to stay for some time.

I think GP is for uh spatial temporal models, also thanks to people like Howard, here, KAUST.

I mean, they're going to stay for a long time.

I remember when I...

I met Howard in 2012.

He invited me to give a talk on MCMC for GPs because I was citing his work and he was working a lot on MCMC for Markov Random Fields.

And so he invited me for a keynote and one of his latent Gaussian models workshops.

And they had 120 seats, I still remember.

And he sold out in like an hour.

That was like a rock concert.

And everybody wants to use this because so many people have problems in spatial temporal.

that involves some spatial temporal data and they want to do it fast and they want to try stuff out.

They want to change models.

They want to change assumptions.

And the only way to do this fast is to have something that does the inference fast and accurately.

And what they developed is just tailored for that and works brilliantly.

So just to say that I think for these type of data, I think it's going to be pretty hard to beat gas or micro random fields.

We tried a bit with neural networks to do things to make the models more flexible, more non-stationary.

We've done some work on this, but you know, still I think the advantage of doing something so fast and so plug and play really, I mean, you can just plug your data in, make a few

assumptions, you know, about what you want and then you just get the result.

That's very powerful.

Yeah, yeah, yeah, no, for sure.

uh And I'll put again into the show notes an episode I did with Marvin Schmidt about amortized patient inference.

and the work they do on the baseflow package.

um If you Maurizio have some links also you want to add on amortized anything, especially practical Python packages people can use.

Definitely add that please.

in terms of the future of Bayesian deep learning, think that's a much uh bigger question.

think as a community, we're trying to identify, mean, there have been some nice works and also Vincent was mentioning this.

some nice works on various applications in healthcare, self-driving cars.

But I think we're still missing the, you know, the kind of application that goes in the news, you know, something like, like a killer application, something that, you know, alpha

go type thing, you know, where people are going to talk about it and BBC news or something like that, you know, something that is going to convince everybody that, and ourselves

perhaps, that what we're doing is actually very meaningful.

think we.

We rely a lot on other types of applications like computer vision problems, because people work a lot on these or now LLMs have become popular.

So some of my friends, colleagues are actually showing that you can do also LLMs a bit Bayesian, with some Bayesian low rank Laplace, for example.

eh And I think, yeah, so we're testing ourselves in these grounds, but actually ultimately uncertainties is what matters for decision-making.

so...

I think ultimately this is, I think the kind of ground where we have to compete and try to evaluate ourselves and how well we can do with this, you And this is really also the

difference between, everybody talks about AI, but AI really is thinking about an agent that interacts with an environment, senses, reasons, and then acts on the environment.

And machine learning is the reasoning.

And then all this pipeline is AI.

At the moment there is no real AI, AI would mean that we have an agent that actually interacts with this and intervenes on the environment.

I tried to talk about this a lot with my students when I give lectures about machine learning, statistics, AI, what is everything.

I tried to give some history about Thomas Bayes and what was going on back then in...

1720 when he was thinking about Bayes theorem, what was happening in other fields, in other sciences, in other arts and all these things.

And going all the way to the first statistics department in UCL in 1911 to touring for Neumann and all this.

anyway, will be material for another episode.

Yeah, for sure.

So do you have, before I ask you the last two questions, um do you have any advice you give to people who want to start um working in that field, whether they are students,

researchers, practitioners?

Yeah, you probably have heard this a lot, but I think working on the foundations is very important.

A deep understanding of the foundations is always what gives you unparalleled advantage because you really can...

think in a very profound way about certain problems and what kind of problems we want to solve at a larger scale.

Many times it boils down to have a deeper understanding of the fundamentals.

And many times for me, I find it very useful to go back to linear models.

Whenever we develop new theory, new algorithms, new methods, we try to get some good grounding on linear models.

So what does it mean for a linear model to have this?

So we were studying now recently, singular learning theory to try to explain some scaling laws for uncertainty.

know, people ask me all the time, why I have so much data, why do I need to be Bayesian?

And now I can tell you, you know, we did work on the scaling laws.

know when uncertainties, epistemic uncertainties become small as number of data increases.

And now, know, for a ResNet 18, I can tell you that you need, you know, 10 billion images before this.

uncertainty becomes on the second digit of your probabilities, something like that.

uh Just to give practical advice to people to understand these things.

So we were trying to study the theory behind this and we think that singular learning theory can give us some intuition about this.

so Watanabe has done a lot of work on this and nice book.

And so we were looking at this quantity generalization error, which was very mysterious.

And so we sort of derived it for linear models and we

understood what it means for real.

So many times this grounding on something that is tractable is really important, I think.

Yeah.

Completely agree.

This is fascinating.

We need to have you back on the show at some point Maurizio to talk about these other topics because otherwise it's going to be a three hour episode.

I'd be fine with that, but I have a plane to catch and you have a bed to be in.

Let's play us out.

uh well, for the last two questions, if you have listened to the show and you know them, first one is if you had unlimited time and resources, which problem would you try to

solve?

Awesome question.

I love this question.

And I would say nutrition uh is a huge, problem for me.

We see the statistics about the number of people with diabetes worldwide and it's insane.

Like we're talking hundreds of millions just in the U S or we're talking, you even in, uh, in India, think it's a 200 million people with the adults with diabetes.

This is, this is serious stuff.

And, and I think now we have the tools to understand all this.

I mean, the food industry has done this experiment on, on all of us, right?

So, uh, and now that we see the effects.

So I think it's possible to, to draw some, uh, conclusions about, uh,

All these things and understanding, you know, optimal health based on what we eat.

And I think there are some people, you know, doing this.

There is a famous guy that is spending millions on this.

And I think, you know, I would probably spend time and energy on this if I had unlimited resources, because you would need a lot of resources to go against the common wisdom,

against the...

food industry, government, regulations, and so on.

But I think, know, there is something definitely we can optimize and now we have more and more tools to measure something about ourselves.

Yeah, completely agree.

And I think it's also related to this incredible ability, mean, weakness we have as humans, which is like our ability to entertain us to death, um which is definitely not one

of our best instincts.

On that, have actually at least two episodes to recommend.

The latest one is the one just before yours, actually Maurizio.

It's episode 143 with Christophe Bamberg.

And he does research exactly on that, appetite, how it's related to cognitive uh processes and how it's related to self-esteem and things like that.

And second episode that is in the show notes for episode 143.

which is the one I recorded with Eric Trexler, who is much more focused on weight management and exercise and how that relates to appetite and the environment that you're

in is extremely important basically.

To put it shortly.

so second question, Maurizio, if you could have dinner with any great scientific mind, dead, alive or fictional, who would it be?

Another great question.

think it's very easy to overthink this.

As an Italian, would say Leonardo da Vinci has been one of the greatest scientists, artists, architect, philosopher.

was just so much ahead of his time.

And I think whenever you interact with these people that are so much ahead of their time, you really see it something new.

get so much inspiration.

It happened to me a few times.

of the latest ones was when I interviewed here at KAUST.

I had a three hours dinner with Jürgen Schmidt-Huber.

And I can tell you, that was an experience that I will never forget.

it was a great three hours of talking about wonderful things and being challenged about thinking about things that I've never thought about.

And this is the kind of things we...

I think as scientists we need, know, be challenged and get out of the comfort zone.

And I like doing that a lot, getting out of the comfort zone, you know?

Yeah.

I mean, I can tell it from your work for sure.

And I think that's something, yeah, a lot of researchers have in common for sure, because like, you have to be comfortable being uncomfortable because you're always at the

frontier.

And so by definition, you don't know the answers.

um You don't even know if you'll get there.

So.

This is definitely something that's hard.

um Any type of research you do.

And definitely that's very awesome to have people like you in these kinds of jobs because well, you help us advancing in all the domains you're touching Maurizio.

So thank you so much.

And thank you so much for being on this show.

I think it was a great one.

It's time to wrap up now, but we'll have you on the show next time you have.

a fun paper or code or package to share with us.

Thanks again to Hans, I think it's Hans Munchel, I may be butchering your name.

about that.

But yeah, thank you so much for putting us in contact.

And Mauricio, thank you so much for taking the time and being on this show.

Well, thank you so much.

It has been a huge pleasure and yeah, I hope this has been interesting for...

your audience and for you and happy to be back on the show whenever you want.

You are doing a great service and thank you so much for that.

Yeah, definitely was super fun and thank you for your kind words.

Definitely appreciate it.

This has been another episode of Learning Bayesian Statistics.

Be sure to rate, review, and follow the show on your favorite podcatcher, and visit learnbaystats.com for more resources about today's topics, as well as access to more

episodes to help you reach true Bayesian state of mind.

That's learnbaystats.com.

Our theme music is Good Bayesian by Baba Brinkman, fit MC Lass and Meghiraam.

Check out his awesome work at bababrinkman.com.

I'm your host.

Alex and Dora.

can follow me on Twitter at Alex underscore and Dora like the country.

You can support the show and unlock exclusive benefits by visiting Patreon.com slash LearnBasedDance.

Thank you so much for listening and for your support.

You're truly a good Bayesian.

Change your predictions after taking information and if you're thinking I'll be less than amazing.

Let's adjust those expectations.

Let me show you how to be.

Good days, you change calculations after taking fresh data in Those predictions that your brain is making Let's get them on a solid foundation

Support & Resources

→ Support the show on Patreon
→ Intro to Bayes Course (first 2 lessons free)
→ Advanced Regression Course (first 2 lessons free)
Theme music: “Good Bayesian” by Baba Brinkman (feat MC Lars and Mega Ran). bababrinkman.com

Episode

Listen on your favorite platform:

Support & Resources