#129 Bayesian Deep Learning & AI for Science with Vincent Fortuin
IMPORTANT: this new version fixes an editing glitch that sneaked into in the original upload (https://youtu.be/3hYYGiucS0U)
• Join this channel to get access to perks:
https://www.youtube.com/@learningbayesianstatistics/join
• Proudly sponsored by PyMC Labs. Get in touch at https://www.pymc-labs.com/!
• Intro to Bayes Course (first 2 lessons free): https://topmate.io/alex_andorra/503302
• Advanced Regression Course (first 2 lessons free): https://topmate.io/alex_andorra/1011122
Our theme music is « Good Bayesian », by Baba Brinkman (feat MC Lars and Mega Ran). Check out his awesome work at https://bababrinkman.com/ !
Takeaways:
Chapters:
02:57 Exploring Bayesian Deep Learning
05:50 Vincent's Journey into Machine Learning
09:03 Current Focus in Bayesian Deep Learning
11:47 Understanding Bayesian Deep Learning
14:51 Libraries for Bayesian Deep Learning
18:14 Real-World Applications of Bayesian Deep Learning
21:01 When to Use Bayesian Deep Learning
23:48 Data Efficiency in AI
26:53 Generative AI and Bayesian Deep Learning
32:11 Integrating Bayesian Knowledge with Generative Models
33:27 The Role of Meta-Learning in Bayesian Deep Learning
36:20 Understanding PAC Bayesian Theory
40:10 Exploring Bayesian Deep Learning Algorithms
45:02 Advancements in Efficient Inference Techniques
51:06 The Future of AI Models and Their Reliability
54:14 Advice for Aspiring Researchers in AI
01:00:33 Vision for Solving Global Challenges with AI
Thank you to my Patrons for making this episode possible!
Yusuke Saito, Avi Bryant, Ero Carrera, Giuliano Cruz, Tim Gasser, James Wade, Tradd Salvo, William Benton, James Ahloy, Robin Taylor,, Chad Scherrer, Zwelithini Tunyiswa, Bertrand Wilden, James Thompson, Stephen Oates, Gian Luca Di Tanna, Jack Wells, Matthew Maldonado, Ian Costley, Ally Salim, Larry Gill, Ian Moran, Paul Oreto, Colin Caprani, Colin Carroll, Nathaniel Burbank, Michael Osthege, Rémi Louf, Clive Edelsten, Henri Wallen, Hugo Botha, Vinh Nguyen, Marcin Elantkowski, Adam C. Smith, Will Kurt, Andrew Moskowitz, Hector Munoz, Marco Gorelli, Simon Kessell, Bradley Rode, Patrick Kelley, Rick Anderson, Casper de Bruin, Philippe Labonde, Michael Hankin, Cameron Smith, Tomáš Frýda, Ryan Wesslen, Andreas Netti, Riley King, Yoshiyuki Hamajima, Sven De Maeyer, Michael DeCrescenzo, Fergal M, Mason Yahr, Naoya Kanai, Steven Rowland, Aubrey Clayton, Jeannine Sue, Omri Har Shemesh, Scott Anthony Robson, Robert Yolken, Or Duek, Pavel Dusek, Paul Cox, Andreas Kröpelin, Raphaël R, Nicolas Rode, Gabriel Stechschulte, Arkady, Kurt TeKolste, Gergely Juhasz, Marcus Nölke, Maggi Mackintosh, Grant Pezzolesi, Avram Aelony, Joshua Meehl, Javier Sabio, Kristian Higgins, Alex Jones, Gregorio Aguilar, Matt Rosinski, Bart Trudeau, Luis Fonseca, Dante Gates, Matt Niccolls, Maksim Kuznecov, Michael Thomas, Luke Gorrie, Cory Kiser, Julio, Edvin Saveljev, Frederick Ayala, Jeffrey Powell, Gal Kampel, Adan Romero, Will Geary, Blake Walters, Jonathan Morgan, Francesco Madrisotti, Ivy Huang, Gary Clarke, Robert Flannery, Rasmus Hindström, Stefan, Corey Abshire, Mike Loncaric, David McCormick, Ronald Legere, Sergio Dolia, Michael Cao, Yiğit Aşık and Suyog Chandramouli.
Hi folks, you may be wondering why there is another episode 129 in your feed.
And well, you would be right.
And that is because AI cannot be trusted anymore.
No, seriously, we had some editing issue on episode 129, where for some reason, the AI decided that some parts of my conversation with Vincent was just considered a silence.
The AI model just labeled these parts as science and just got rid of it.
So that's the moral of the story.
We cannot fully trust the AI yet.
On that note, we re-edited the episode that way we are sure that all the parts are here even ones that the AI deemed too bad, I guess, to be included.
But I will let you be.
judge of that, you'll tell me in the comments or in the discord if you are a patron.
Okay, now I'm gonna leave you tweet and see you very soon.
Bye dear patients.
uh
Today I am excited to host Vincent Fortwin, a leading researcher in Bayesian deep learning and AI for science.
Vincent is a tenure-track research group leader at Helmholtz AI in Munich, where he leads the Efficient Learning and Probabilistic Inference for Science group.
In this episode, we explore why traditional deep learning often struggles in scientific applications and how incorporate
operating prior knowledge and uncertainty quantification can enhance model reliability.
Vincent shares his insight on generative AI, meta-learning and inference techniques like Laplace and subspace inference, explaining how they contribute to more efficient and
robust AI models.
We'll also discuss the current landscape of Bayesian deep learning libraries, the challenges of real-world applications and the role of PEC Bayesian theory
in providing generalization bounds.
Whether you're an AI researcher or someone interested in the intersection of deep learning and science, this episode is packed with insights into the future of reliable and data
efficient AI.
This is Learning Vision Statistics, episode 129, recorded November 22, 2024.
Welcome to Learning Bayesian Statistics, a podcast about Bayesian inference, the methods, the projects, and the people who make it possible.
I'm your host, Alex Andorra.
You can follow me on Twitter at alex-underscore-andorra.
like the country.
For any info about the show, learnbasedats.com is Laplace to be.
Show notes, becoming a corporate sponsor, unlocking Beijing Merch, supporting the show on Patreon, everything is in there.
That's learnbasedats.com.
If you're interested in one-on-one mentorship, online courses, or statistical consulting, feel free to reach out and book a call at topmate.io slash alex underscore and dora.
See you around, folks.
and best patient wishes to you all.
And if today's discussion sparked ideas for your business, well, our team at Pimc Labs can help bring them to life.
Check us out at pimc-labs.com.
Hello my dear patients!
Well, I hope that you are doing well and before we dive into this episode, I wanna thank some new patrons, Ejit Asik and Suyog Chandramouli.
I hope I am not butchering your names, guys.
But thank you so much for supporting the show on Patreon.
Well, not exactly Patreon, actually on YouTube.
You guys are the first ones
to support their show on YouTube on the good Bayesian tier and above.
So well done guys and thank you so much.
Thank you so much not only for being the first one to support the show on YouTube, but to support the show period.
Really as you know, this is exactly the support that makes the show possible.
I pay for editing, I pay for everything with your support, editing, hosting.
recording all that stuff that you don't see that goes into producing the episode.
Well, I pay for that thanks to you guys support.
So thank you so much.
Make sure to link your YouTube account to your Discord account and that way you will be automatically added to the LBS Discord server.
And well, I can't wait to see you in there.
And if other people are interested in supporting the show on YouTube, well, you can just go to the YouTube channel.
LBS, Learn Based Stats, you look into that in YouTube, and then you will see a membership tab and you'll have all the info in there.
That should be super easy to set up with your YouTube account.
You just link then your Discord account and you're all done.
So on that note, thank you again, guys.
I will see you in the Discord and now onto the show.
Vincent Fortoyne, welcome to Learning Vision Statistics.
Hi Alex, great to be here.
Yeah, thank you for taking the time.
How was my Dutch pronunciation?
Perfect, mean, impeccable.
Although, you know, I have to say like there's this weird phoneme in Dutch that like also I personally can't.
actually pronounced that well because I grew up in Germany but like it's definitely as close as I would come so well done.
a huge thank you before we start to Marvin Schmidt who put us in contact so for listeners I highly recommend Marvin's LBS episodes that was episode 107 I put that in the show notes
we talked about all the fantastic
work that Marvin is doing on armatized patient inference and base flow.
I want to contribute more to base flow and I keep trying but we're keep getting in the middle.
My secret wish is that at one point I will find a way to have to use armatized patient inference for the Marlins and then I can actually contribute to base flow from a far.
for my job, that's what I'm trying to do.
But right now, I have to focus on other priorities.
Still contributing to other open source packages, you know.
Sorry about that, Marvin.
I'm trying, I'm trying.
So Vincent, let's talk, you see the French just came up.
Let's talk about you.
Can you share your journey into machine learning and
what sparked your interest in what you actually do, which is patient deep learning?
Yeah, sure.
So I didn't have the most straightforward path.
I started off studying biochemistry in my undergrad.
And I don't know if you've ever been to a biochemistry lab.
Essentially, it's this kind of place where you pipette little watery liquids into each other, and it takes you a week.
And then you stick it into some machine that goes, merp, you did it wrong, start from scratch.
And so I'm caricaturing, but I guess I was just a bit too clumsy for the experiment, so they never worked out, and I found that a bit frustrating.
And I realized that I could handle computers much better than pipettes, so I moved on to bioinformatics.
And when I did my master's in bioinformatics, it was just about the time when you would see all these papers that claimed that all these algorithms that bioinformaticians had
worked on for decades.
were essentially being beaten by deep learning solutions, right?
So I wanted to be on the right side of history and moved into deep learning for my PhD.
But what I realized was that a lot of this hype in AI for science was not really delivering on the promises they made because like scientists really have a lot of prior
knowledge about their field that they wanted to get into these models.
And they also were really careful about
like the kind of predictions they made, right?
They didn't just want a high accuracy, but they want to have well calibrated predictions that they can really generate insight out of.
And normal deep learning didn't quite fit the bill, right?
And so that's how I got into Bayesian deep learning, where then the hope is of course to marry the expressive power of deep learning with all the promises that Bayesian statistics
usually gives us, which is like putting in prior knowledge into the models and getting uncertainties and just like.
you know, optimal decisions under some sense.
Okay, okay.
I see.
That's, that's super cool.
I really like the, the meandering path like that, you know, illustration of randomness and the good things actually come from randomness.
So that's, that's great.
oh And so today, what are you focusing on?
And you know, what does it mean to be a researcher in patient deep learning?
Because to me, it sounds like, you know,
the conjunction of three extremely highly rated SEO keywords.
Yeah, for sure.
I mean, that's definitely like a lot of the work that we're doing in our lab that's still on the method side and trying to be better at doing inference in these Bayesian models and
trying to, you know, like just make them more reliable and robust.
But we also look a lot into application areas like AI for science.
So as I said, like my background was in science originally, so I'm still trying to follow through on that a little bit at least.
And a particularly interesting area that we're looking at right now is sequential learning.
So in the context of Bayesian optimization or Bayesian experimental design.
So I know you had Desi on the podcast recently, so that kind of stuff.
And obviously like these days, if you do anything related to deep learning, you can't ignore that we have these big foundation models and LLMs and these kinds of things now.
So some of the work we do is also trying to figure out how we can fit in into that space and how we can make them more Bayesian in some way, which probably doesn't make much sense
on the pre-training because you have more or less infinite data anyway.
But in the fine tuning, we've done some work that is quite interesting where typically if you have your big GPT or whatever, LLM, and you want to fine tune it on a tiny data set in
your target domain, like this is where you really start caring about the uncertainties.
And then if you fine tune it in a way that's inspired by Bayesian updating, that usually gives you much better calibration than if you do the standard fine tuning.
Okay.
I see.
That's pretty cool.
I didn't know that.
So I should...
Well, let's dive a bit into that because I want to ask you a bit more about Bayesian deep learning and so on, but I'm curious about that.
Like how would that work?
Like what would be the workflow here where you would use Bayesian uh statistics or I'm guessing more prior knowledge uh and infuse that into the fine tuning of the RMS?
Because I mean, I'm not surprised by that because if I'm like that last time I checked and read more about these methods, fine tuning.
was kind of m a human heavy element of the LLM workflow.
So I'm really curious to hear about that.
Yeah, definitely.
I think like philosophically, the way I think about normal fine tuning is also from a somewhat Bayesian viewpoint, right?
So if you pre-train your LLM on the entire text on the internet, you can somehow view that as a way to encode all the prior knowledge that is on the internet into some
into some condensed form, right, which now comes as a point estimate of parameters of some big transformer model.
And then what you typically do these days in fine tuning is this idea of parameter efficient fine tuning, right?
So you actually keep this big model fixed, like the backbone, as people would say, and then you add these low rank adapters like LoRa, or there's kind of more modern versions
that are called Vera or whatever.
But so what you end up doing is you have this big model with billions of parameters.
But then you have your small adapters with like just a few million parameters that you fine tune.
And they actually guide the big model towards the task you care about.
And so what we did is essentially like within these small parameter efficient adapters to then treat them in a Bayesian way, right?
So like the big model is still just a fixed backbone of a point estimate.
And that's essentially in some way our prior.
But then we use these small adapter layers uh to do Bayesian inference.
because they're
you know, like very small, as I said, like you can do Bayesian inference actually quite efficiently as opposed to the big neural network or the big transformer where you couldn't
do it.
Okay, I see.
This is awesome.
I didn't know that was the case yet.
So actually, could you now actually define, you know, Bayesian deep learning?
You know, maybe what's deep learning in comparison to machine learning, for instance, and what makes it Bayesian?
Yeah, sure.
So, okay, I'm going to give you the more traditional view first and then a second one that I prefer.
So in the traditional sense, like if you think about what deep learning is, like it's essentially trying to learn functions that are parametrized by these artificial neural
networks, right?
So it's essentially an architecture that has an input layer where you put your data and then it propagates through several layers.
So that's where the deep comes from.
And at the end, there's an output layer that gives you the predictions, right?
um
So algebraically speaking, essentially it's just a bunch of matrices, which are your weights that you multiply with the input vector.
And then you apply some non-linearity, which is the activation function of the neural network.
uh Now this is really a powerful way of learning functions because you can prove that if you make your network big enough, it can approximate any function.
So that's what's called the universal approximation theorem.
And you know,
using techniques like backpropagation, we can actually do this quite efficiently in GPUs, for instance.
So that's why this whole deep learning paradigm has become so popular because we just happen to have the hardware that could do these matrix vector products quite efficiently.
And so now, once you have this neural network idea, this deep learning, then the classic idea to make it Bayesian is to say, instead of just learning a single setting of the
parameters, you learn a distribution over parameters.
So then you don't just have one setting for your weights, but you have, for instance, a Gaussian distribution over weights, which is then defined by a mean and a covariance
matrix.
And then you have to figure out how to get to that distribution.
And one intuitive way is to do it via Bayesian inference.
So you write down some distribution that is your prior, then you observe your data, you use Bayesian inference to update, and then you get a posterior.
And then from that posterior, you can sample
different parameters for the network, which will then give you different predictions.
So each sample of the parameters gives you a different set of predictions.
And then you can use that to quantify the uncertainty in your prediction space.
So this is kind of the classic textbook view of what Bayesian deep learning is, right?
The way I personally like to rather think about it is slightly the other way around.
And I think this is where the difference between statistics and machine learning comes in, right?
So in statistics, you really care about the parameters of the model, right?
So if you build a model for how well different players and the model ends up performing, then these are actually like interpretable parameters, like, know, which parameter is
which player and you infer a posterior and that tells you something about the real world.
In this neural network setting, of course, these parameters are just arbitrary, right?
We just like build this big model.
We could have built it twice as big or half, and then these parameters would be different.
But ultimately what we care about is the function we're learning.
So we're trying to learn a function that maps from inputs to outputs.
And we just use this neural network as a convenient shape of that function because we know that it can approximate things well.
And so that's how I kind of like to think about Bayesian deep learning rather as Bayesian inference in the function space.
So very similar to how a Gaussian process would work, right?
So like in a Gaussian process, you essentially also have a distribution over functions.
but it's a very restricted one because it's Gaussian, right?
um And in the Bayesian deep learning sense, you could say we have a very flexible distribution over functions because we know that these neural networks can essentially fit
any function we want.
um And then the main thing we have to care about is like, how does our posterior and function space look like?
And we don't really care about the parameters.
That's just the means towards the end of getting a distribution over functions that fits our data well.
OK very interesting because I was going to ask you is so we've talked already on the show as a whole time about the thing that actually the neural networks converge to caution
processes, um and the way he's right and was extremely close to what caution processes are doing so as going to ask you what the difference is between well a deep neural network and
a motion process it seems like they are very close to each other anyways, um that's equal to two here and in
For instance, for someone who would like to uh start using Bayesian deep learning, Bayesian uh deep neural networks in their work, uh which library would you recommend
looking at?
Yeah, so that's a good question.
It's a bit of a pain point maybe.
So like right now, I guess we have this issue that there isn't like one library that rules them all kind of, right?
I think that's why you mentioning the base flow, right?
Before, I think this is a great effort to try to put everyone on one ship.
So in Bayesian deep learning, we currently don't have that.
So we have different libraries for different types of inference.
So there's a Laplace library in PyTorch that does Laplace inference quite well.
And that's something that I've worked with quite a lot.
uh there's also one that is called Tihi, which is doing MCMC inference.
And there's one that's Bayesian Torch, which does variational inference.
And all these libraries are essentially maintained by different people and need slightly different ways of defining your model.
And the problem is really that a priori, you don't actually know which is the right inference, right?
So like if you really want to do stuff.
you probably have to actually install all the three libraries and like try all of them.
And of course, in practice, like most practitioners don't want to do this.
So I think that's one of the main problems why, you know, we have a lot of papers where we show academically that it can really make a difference, but then in the real world, people
just don't want to go through that hassle of having to figure out what's the right library and stuff.
And if they can just use normal deep learning and two lines of code, they don't want to spend more than that on Bayesian methods.
So I think this is really something where the community still has to come together a bit more and build tools that maybe have a joint API that can then talk to all these other
libraries.
I mean, that makes sense also because that's really at the frontier of research.
it's like, yeah, the time for the research to trickle down to tools you can actually use is quite normal.
That's what happens with PIMC, for instance, uh which I'm
one of the core developers of is like, for instance, if you take HSGP, so Helmhurst based decomposition of GPS, I think the seminal paper is like two years old, something like
that, if I remember correctly, so 2022.
But then the time, you know, that uh we read the paper, we understand it deeply, we implement a first version, we deploy it into PIMC, we make sure everything works and that
doesn't break anything.
It takes time, it's like...
uh It's already quite fast, but it always takes quite some time.
uh But that's already...
It doesn't mean the alternative would be faster.
uh But yeah, so I'll link to these three libraries that you've named in the show notes anyways, uh for people who are interested.
For sure, having to understand which method to use before.
uh
that's a pain point for practitioners because I'm guessing most of them are not specialized.
And so they don't know, know, like it's, it's not that more than they don't want to, I'm guessing it's that they don't absolutely don't know how to do that.
So that's for sure.
uh And then the classic deep learning libraries, uh I guess these are PyTorch, uh TensorFlow and always forget the third one.
Anyways, these are good references for people to try out if they want to.
uh You don't have the patient stuff with that, but that's already something to get familiar with neural networks, I'd say.
Yeah, and mean, definitely if people from the more open source software engineering community want to get involved, I think there's a lot of open problems we could need help
with, because we're more like these academic types.
We write our little GitHub repo and link it on yours paper, but then, you
just as you said like putting it into production like on the level of PyMC is a whole other problem for which you also need qualified people that actually know how to do this
well.
Yeah for sure, good point.
actually do you have, so I know you're more in the algorithm science, but I'm wondering if you have some real world applications where patient deep learning has significantly
improved outcomes.
Yeah, that's a great question.
I think for the reason that I mentioned before that somehow we use it a lot in research, but it hasn't really like...
made it easy enough for people to use.
I think people still have this preconception that Bayesian deep learning just doesn't work because they don't see it used very much, right?
But if you actually look into the papers that people write, there's all kinds of applications like healthcare, drug discovery, astrophysics, climate science, robotics,
autonomous driving, and so on, where it actually can make a difference.
And a lot of these are projects where then you have
some domain experts working with someone like me, right, like some researchers from Bayesian deep learning.
And then we can show on a project by project level, like that it actually has a positive impact.
ah But of course, yeah, for the wider impact, then we would need to make it more usable for people.
But I think it's really uh very promising to see that in all these different areas, have been attempts to use it and that it actually has made a difference there.
And just for people that want to read a bit more about what the pros and cons are and how it's being used in all these fields, maybe I kind of shamelessly plug a little position
paper that we wrote and we published at ICML this year, which was co-written with a whole bunch of very, very good co-authors from different institutions.
And essentially like there, we try to make an argument why really like today we still need to use Bayesian deep learning, like probably more than ever, just because AI is so
pervasive in the world and to make it more reliable and trustworthy.
That's one way of doing it.
Yeah, I love that.
Yeah, for sure.
And I agree with your point that, yeah, seeing more real world applications is definitely something that's going to inspire people to use these kind of methods much more, in
addition to the whole uh workflow convenience and package convenience that we just mentioned before.
I'm actually curious, in which cases would you recommend people look at?
deep learning or Bayesian deep learning in particular, in which case cases you think, no, uh that wouldn't be useful here or that's an overkill.
Yeah, yeah, that's a great question.
So I think the main, the main properties that I think make a problem interesting for Bayesian deep learning is if you have some kind of prior knowledge, right, so typically in
a lot of sciences, that's the case.
Then secondly, if you have certain uh
Decisions that you want to make with the predictions that depend on your uncertainty, right?
So like and I mean medicine is a classic example, right?
Like if you have a diagnostic machine learning system, like you probably care about whether the system is 99.9 % certain that the patient has a certain disease versus like 70
% because that might change how you Treat them immediately or run another test or something, right?
And I guess the third thing is like if your data are
kind of expensive to generate, right?
So typically in sciences, again, like often you, I don't know, if you have a chemical experiment and it costs you like a few thousand bucks to run each experiment, then you
can't generate billions of data points, but like you're, you're quite limited in how much data you can generate.
And that really helps you then to really get the most out of your data to, to be Bayesian.
Like on the, on the Contra side, again, if you want to pre-train a language model on like 15 trillion tokens that you scrape from the internet, like.
Probably you don't need to be Bayesian because you have a large data set and you probably don't have any better prior knowledge than what's written on the internet anyway.
So maybe there it's fine to just use normal deep learning.
Okay, yeah.
So basically, if you have prior knowledge, patient deep learning, if you don't, but have and or have a lot of data, then classic deep learning will be useful to you.
And we all like how much data is enough data for classic deep learning, I would say.
Yeah, that's really, I mean, a super hard question.
I think it really depends on your problem, right?
I mean, first, it depends on the dimensionality.
Obviously, if you have a one dimensional problem, then probably if you have 100 data points, that's already like a lot in one dimension.
But if you have a problem that's a million dimensional, then you need, you know, more data points.
And then it depends how complex your problem is.
Like if it's just a binary classification, maybe you can do with quite a few data points to fit some decision boundary.
But if you want to do a thousand class classification, like an image net or something, then uh you might need more data points.
Right.
So I think it's very hard to give any like, you know, off the mill, like number, but yeah, definitely think like, if you look at your data and you randomly sample data points and
they start looking very similar, then probably you have a of data.
Right.
Like if you sample data points and they all look very different, then
you're probably in the low data regime to some extent and then maybe there it helps you more to be Bayesian.
Okay.
Yeah.
And also from what I understood from talking with Marvin and the base flow team, something that's very important for you to be able to apply at least amortized Bayesian inference, I
don't know for the other methods, is the number of parameters as you were saying and dimensions in your model.
For instance, most of my data is very hierarchical and my models have tons of parameters and dimensions and in these cases it seems like it's not very useful to use amortized
Bayesian inference because then if understood correctly the neural network will be very hard to learn.
Whereas if you have, as you were saying, something that's less dimensional but a lot of data, well then here that's a clearer use case.
Is that, am I summarizing that well?
Yeah.
I mean, I think for, for normal deep learning, that's definitely true.
I guess in the basic deep learning, it's always a question like how good your prior is, right?
So like in the, you know, let's kind of for the sake of argument, let's assume you actually already know what the solution is and you can write this down as a prior, then
you don't need any training data and you already have a good posterior, right?
So it also always depends on this.
Like if you.
If your problem is very hard and high dimensional, you say, but you actually already kind of know what the right solution probably is, and you just want to fine tune it a little
bit to like fit the last wiggles, then you can do that quite well.
But yeah, if you don't know much a priori, then of course the more complex the problem is, harder it be to learn.
Okay, yeah, yeah, I see, I see.
That's interesting for me to really understand that.
And something you work on quite a lot is also something that's called data efficient AI.
I'm wondering what that is mainly, you know, and if you could discuss your work in this, and especially if I understood correctly, there is a relationship between deep generative
modeling.
and in data efficient here?
Yeah, definitely.
yeah, I mean, the data efficiency is really like what comes from this idea of having prior knowledge, right?
As I just said, essentially, like if you have a perfect prior, then you're maximally data efficient because you don't need any data and you already solve your problem.
uh And this is really what in many scientific applications is quite useful, right?
Like if, you know, as I said, in chemistry, it's very expensive to
generate data, but on the other hand, you have a whole library full of chemistry books that tell you a lot about how that field works.
Then the hope would be that you can extract some of that prior knowledge, put it in your model, and then you don't need to see as many data points to make progress.
So the connection to generative AI is also quite interesting because in some level, like Bayesian deep learning and generative AI are quite related in that they both model joint
distributions.
Basin deep learning case, you have a joint distribution between your parameters of the model and the predictions, right?
um While in the generative AI, you typically model a joint distribution over the data itself.
between inputs and outputs.
But, you know, like because they both care about modeling joint distributions over different things, you can quite fruitfully exchange ideas between the two.
So like typically the one way would be to say, like, we can actually use generative AI tools to do the Bayesian inference better.
And that's maybe along the lines where people might say these days we have these powerful diffusion models for generative modeling and we can use diffusion models now to model
posteriors of Bayesian neural networks, for instance.
And the other way is obviously the other way around where you could say like, let's take one of these big generative models and try to infuse it with some Bayesian prior knowledge
to make it more data efficient.
And so this is a bit like saying, yeah, if we already know what
of antibiotics look like and I want to build a diffusion model that can produce new target molecules that look like antibiotics.
Like maybe I can put some prior knowledge in there so it doesn't have to see as many of them to learn how to model them.
Okay.
Yeah, it's fascinating.
I really like that.
And it's all intertwined in everything you're doing.
that's so cool.
And is that related to some- I should maybe also mention, I did write another position paper on generative AI.
maybe we also put that in the show notes if people are interested.
yeah.
Yeah, yeah, for sure.
For sure.
That's going to be super interesting.
Yeah.
And how is that related to another interest of yours that's meta-learning?
How does that interact with your research and
What advancement have you observed in this area?
Yeah, that's another good question.
Yeah.
So, so essentially, as I said, like one of the main things in Bayesian deep learning is to have a good prior, right?
So like the better your prior is, the better your whole like model is going to be, and it's going to be more data efficient and hopefully more calibrated.
But writing down priors by hand can sometimes be challenging, right?
So sometimes if, if you go to a medical doctor and you tell them, like we're trying to build this model to predict some disease, like what's your prior.
they might not be able to actually tell you that much that you can put in there.
So one way to get these priors is to use meta-learning, which is essentially some way to look at other tasks that you've solved before that are similar to the problem you care
about.
And then you use the knowledge from those related tasks to make the performance in your target task better.
So you can actually view meta-learning as a hierarchical Bayesian model where you have essentially like a distributional.
tasks at the top level, and then you have the different tasks as different Bayesian inference problems, but you use the previous task knowledge to inform the prior on every
new task you see.
And so this is how meta-learning can really help you get better priors.
And then, of course, there's the other way around where you can actually say, we can also use meta-learning as a tool to learn how to do Bayesian inference.
And then that's...
you know, the amortized kind of inference idea that you mentioned that Marvin and others are working on, where you really meta-learn how to do the Bayesian inference, so you don't
actually have to run the whole Bayesian inference routine, but you can use some like meta-learned neural network or something to do that inference for you.
And for instance, one class of models that do something like that are neural processes, which are kind of a way of framing, you know, Gaussian processes with neural networks.
So that comes back to what we talked about earlier.
And that's one of these meta learning frameworks for Bayesian inference that we're also quite interested in in my group.
Okay.
I see.
it's like, so meta learning would be learning about the models themselves.
Am I, am I understand understanding that right?
Yeah.
So meta learning is really like, you kind of try to try to look at some previous task and then say like, okay, now, now that I've solved this task, like what can I do better next
time?
Right.
So I think in the normal world, you might think about, you know, if you learn several languages, right?
Like I know you speak, you speak different languages.
Like I think every, every new language gets a little bit easier because you have all these previous ones and you can kind of reuse some of that knowledge to have a better prior of
what the next one might be like, right?
Okay.
Okay.
I see.
Interesting.
And so is that related to pack Bayesian theory?
which is another thing you're doing in your life.
So can you explain what that is and yeah, basically why that's useful, why it's relevant to your work?
Yeah.
Yeah.
Yeah.
So, so it's not just moving on from meta-learning.
Like there is a relationship and I actually have a paper on using pack-based in theory for meta-learning.
It's not necessarily like directly related.
It's just something that I happen to do, but maybe on a, on a higher, more general level, I guess the idea of pack-based.
is it's one of these things where people try to marry like Bayesian and frequentist ideas, right?
So I think in the past, there was kind of like in the early 20th century, like there were these kind of fights essentially between the frequentists and the Bayesians and they were
like solidly on different sides of statistics and arguing about things.
But these days, I think it's actually like more, more acumenical, right?
Like people really try to use ideas from Bayesian statistics and
printed statistics and make them work where they work and use the others otherwise.
PackBase is a great example to combine these two.
So PackBase essentially is a way to give you generalization bounds.
So you essentially have a model that you trained, could be a Bayesian model, could actually be something else.
And you try to ask the question based on the test error that you care, like what's kind of a bound, how bad that could be.
So like how well will your model do on unseen data?
And typically the form that these pack-based bounds take is to say with a high probability over possible test data you might observe, the expected test error under your posterior
will not be much larger than the expected train error under the posterior.
Right?
So if you your base posterior and you evaluate it on your training set and you get some numbers, so on MNIST maybe you get 1 % or something.
then the pack-based bound might tell you with like 95 % probability your test error on unseen data might not be worse than 1 % plus x.
So it might be 2 % or 3 % or something.
And then like how much slack there is, right?
Like how much between the test error bound and your actually trained error that depends on how many data points you've observed, how high you want the probability to be, right?
So if you want it to be 99 instead of 95%, it will get a bit looser.
usually on the KL divergence between your prior and posterior in some way.
So this is where this kind of pack-based idea comes in that you really have this prior and you compute the KL divergence.
um And I guess some of your listeners might now think, okay, if I say train error plus something like KL divergence, that sounds a bit like the elbow, right?
The evidence lower bound and variational inference.
And indeed you can.
you can see it in a very similar way.
So you can optimize a pack-based bound in the same way that you optimize the elbow and use it for model selection or to actually derive posterior measures.
And whereas the elbow, like if you optimize it, will essentially give you the proper base posterior, with these pack-based bounds, you can get something like pseudo-posteriors that
are also Gibbs measures, right?
So they have very similar mathematical form as a base posterior, but they might be more robust in certain ways because they deviate slightly.
hmm okay okay so it's like i like how your work is related all over the algorithms the methods and how to make them better how to use them more efficiently so yeah i understand
why why marvin was uh interested in having you on the show from that perspective that's awesome and i'm actually wondering you know because we've talked a bit about that
different algorithms
for the Bayesian deep learning models.
Can you give us a rundown of these algorithms and when they are useful for which cases?
Yeah, definitely.
So there's essentially a spectrum of, you know, a trade-off between how expensive the algorithm is and how good it is, right?
Like how well you fit the posterior.
So on one side, have things like Laplace inference, which I've been working on quite a lot recently, which is very cheap.
So there essentially you just train your model as you would normally, right?
So you optimize your log posterior typically.
So you get a map estimate and you have a point estimate for your parameters.
That's your neural network.
And this is going to be your mean for the, for the posterior.
And then to get some distribution around it, you essentially have to approximate your Hessian.
So you have to.
compute a second order derivative on the loss function.
And that Hessian then gives you the covariance for your Gaussian approximation, right?
So you just wrap a Gaussian around your optimized point estimate.
And that sounds very crude, right?
So like this clearly doesn't fit the entire posterior, but it turns out that it works quite okay, right?
So like for how cheap it is, it's quite a decent approximation.
And with the modern algebra frameworks, you can do this Hessian approximation quite fast.
So this is something that these days people have actually successfully done even on GPT-2 or something, right?
So you can really scale this up quite a lot.
Now, if you want a slightly better posterior, you can do things like variational inference.
So there you can choose a bit more freely what your posterior shape might be.
So it doesn't have to be Gaussian.
You can use some...
other distribution and then just optimize the elbow between your true posterior and your approximate one.
And you can also do things like mixtures, like if you believe that your posterior might be multimodal, you can actually have a mixture distribution that you optimize.
And maybe subset of that mixture variational inference is these kind of particle-based approaches where the mixture is actually a mixture of DRUMP measures.
So you actually just have some.
some point masses that you move around.
And you can show that you can move them around in a way that they cover the posterior quite nicely.
So this is actually an elegant way of saying, typically we care about drawing samples anyway, right?
And so if I said from my posterior, I would naturally draw 50 samples.
Then instead of first approximating the whole posterior and then drawing 50 samples, I can just start with 50 samples and then just move them around so they look like they were
drawn from the posterior and then I'm done.
And that's quite related to deep ensembles, which is also a slightly non-Basian bass line that people often use in practice because it's easy to implement and easy to use.
And so there's some kind cool connections between how, like if you take a normal deep ensemble and you add some certain repulsive force between the ensemble members, then if
you do that in the right way, it essentially recovers some bass posterior.
And then of course the more...
Expensive ones are all these MCMC approaches, so Markov chain Monte Carlo approaches like uh stochastic gradient Langevin dynamics and Hamiltonian Monte Carlo that, you know, like
the longer you run the chain at some point it mixes and I mean, I guess you know all that from PiMC.
So this is the best way to get actual samples that have some guarantees, but it's also very expensive.
So you might have to pay a huge amount of extra compute.
And it really depends like how much compute you're willing to spend in your particular problem, right?
Like if you pre-train a language model, like already one training run costs you millions of dollars.
You probably don't want to spend any extra compute on that.
But if you're a medical researcher and you spend already half a year generating your dataset, then maybe you don't care whether your neural network training now takes one day
or five days or something, right?
Like this is not the bottleneck in your project.
So it really depends on what the application is.
And sometimes if you can afford running more compute, then of course you should run the better inference.
Yeah, for sure.
That makes sense.
And what are the latest advancements when it comes to more efficient inference techniques for patient deep learning?
Yeah, I mean, a lot of them are still being developed, like the Laplace stuff I talked about.
We definitely had a lot of papers recently, also some of them that I was involved in, that um have kind of pushed the scalability by using all kinds of clever tricks from
matrix-free linear algebra.
and all these things that you can do these days in cool frameworks.
I think a general idea that's quite interesting is the idea of subspace inference.
So that's essentially the insight that you don't have to treat every single parameter of your neural network probabilistically in order to get a good enough posterior over
functions, right?
So this kind of comes back to what I talked about earlier.
Like if you think about the...
Bayesian neural networks as just like a neural network that's now Bayesian and you make it the parameter is a distribution, then it sounds like you might need to do this of all
parameters.
But if you think about it just from the lens of saying we want to have a posterior over functions that makes sense, then it becomes obvious that, you know, in your millions of
parameters, maybe there's a subset that is enough to be random to actually introduce enough randomness in the functions that are being implemented.
And so there's a lot of work that
for instance, just takes the last layer of the neural network or just the first layer or some sub network inside the bigger neural network.
And as I told you before, like in the case of language models, for instance, you can take the backbone and leave it fixed and frozen as a point estimate and then just have these
small parameter efficient adapters that you treat as a Bayesian.
So I think this is really like where you get a lot of efficiency gains is to figure out like out of your huge neural network, like which subset of parameters.
Do you need to treat in a Bayesian way to then make the function space posterior fit what your function should be?
And then most of the other parameters you can just leave as point estimates.
Okay.
Okay.
I see.
Very interesting.
So yeah, if you have any link em to dive deeper into that for listeners and for myself, yeah, leave that in the show notes because I'm always curious to see these latest
developments.
Something I'm very excited about is m the uh inter-twining.
I don't if that's the word in English, of initialization of the MCMC chains with neural networks.
So normalizing flows or the pathfinder algorithm from Bob Carpenter, where you would use basically draws from pathfinder.
or normalizing flows as the initialization of the MCMC chains.
And uh that makes your sampling not necessarily faster because you still have to train a neural network.
uh But sometimes when like, if you have really a lot of data and MCMC is really slow, then that's definitely useful, especially if you can have access to GPU.
And that also makes
basically especially the pine pathfinder uh option basically makes them the variational inference much more practical because usually gives you much better much better answers
and then there is the normalizing for initialization option
that here if you have GPU that's definitely extremely helpful and actually there are ongoing efforts right now going on on the PIMC side um to add PathfinderVI to PIMC as
initialization, an initialization option to
MCMC which would be super efficient because that's also when Bob Carpenter created that for us like basically running Pathfinder using some draws to initialize MCMC and then you
can just run MCMC for a few iterations on Arnaud's chains and that's a much faster uh
convergence than pure MCMC but that should also be much more reliable than the classic VI that we have right now in MCMC in PyMC and then there is also ongoing effort by Adrian
Zeibolt in particular on the NutPy side where he just added uh the ability to use normalizing flows as initialization for the MCMC inference on NutPy so you can already use
that in your PIMC or STAND models uh to try that out.
So definitely I'll put the links in the show notes uh of that option because uh the Pathfinder option on PIMC is still under development so it's still a pull request uh so
not very useful to link to but the NutPy thing with normalizing flow that's definitely implemented already uh so I'll put a link in the show notes uh to a
scores post that Adrian posted uh recently to explain how you can use that and so I definitely encourage people to check that out.
Also report to Adrian on any github issue on notepad if there are any problems because that's renew so uh
In this case, that's extremely useful for open source developers to hear from early adopters to fine tune the details.
Keep in mind, and you'll see that in the discourse post-ten.
Definitely GPU will help you a lot here because you need to fit the neural network first.
yeah, don't expect any model to run in less than 10 minutes.
But then if your model is already...
beer than that and takes much more time than that, then that could be a viable option.
Yeah, cool.
That sounds very good.
that's really awesome.
I know I'll definitely definitely link to that.
uh And also another question I have for you actually is related to be to that.
But what advancements do you foresee?
And maybe also do you wish for in making these AI models?
uh
more reliable and data efficient.
Yeah.
Yeah.
mean, I mean, definitely as, as you probably guessed, like I'm personally hoping that Bayesian deep learning can, can play a role in that.
Right.
Uh, and I guess in our position paper that I mentioned before, like we really tried to make this argument quite strongly.
Um, I think what's really, I think an issue is often we don't do a good job in communicating to people like what they need these uncertainties for.
Right.
So I don't know if you have that.
in your consultancy and stuff, right?
But like I often talk to scientists and then they, you know, they're like, oh, we need this, this model to solve our task.
And I'm like, okay, what if we make it Bayesian?
I can also give you uncertainties.
And then they're like, but what should I, what should I get uncertainties for?
I don't care.
I just want like a good performance.
And I think it's a bit unfortunate that in the communication, we don't make them understand that, well, that it's really about downstream decisions, right?
So like we don't.
care about uncertainties for their own sake, but we care about making good decisions in the real world.
And as I said before, like often you really need the uncertainty to make the decision, right?
Like if you're a doctor and you have a patient, like you need to understand whether my algorithm is giving you like a 99.9 % accurate prediction or whether it's just like a 80,
20 kind of guess, in which case, like in the latter, you probably want to do another test or something, right?
And, you know, the problem is
that as long as we don't communicate this, then people also won't have an intrinsic motivation to try our methods.
So I think this is really something that maybe as a community, we could be a bit more kind of like better at communicating, right?
That we really don't care about uncertainties just because we want to have nice likelihood numbers in our tables, but we care about making good decisions in the real world.
And as long as people care about this, then I mean, I'm also happy if they find other ways to.
to serve that purpose, right?
Like, I said before, like frequentist methods can also work quite well.
And if people want to use conformal prediction for some specific task where that works well, then I'm not going to force them to do a Bayesian thing, right?
I think as long as people actually start thinking more about like, why, why do I need reliable predictions?
What do I use them for?
And how expensive is my data?
Can I maybe get more out of the small data set that I paid a lot of money for?
Then I think.
you know, like that I'm happy.
And if they, you know, I hope that most of them will find that that base is a good way of doing this.
Um, but some of them might find a different way and that's also fine.
Yeah, in the end, whatever works, right?
That's important.
It's better to have a good enough model than no model at all.
Yeah, for sure.
So to play us out, I'm wondering if you have any advice to offer to uh those looking to pursue a career in topics you're working on, so whether deep learning or probabilistic
inference?
Yeah, definitely.
mean, I think
You know, I'm one of the big advice that I always give my students is to really focus on the foundations.
So I like to tell this anecdote.
When I started my PhD, like everyone was crazy about GANs.
I don't know if you remember that.
So like these generative adversarial networks, that was the time where, you know, they were everywhere and there were like hundreds of papers about them.
And so when I started my PhD, some people are like, you should really get into GANs, right?
You should like learn about that in detail and learn all the tricks to train them and whatever.
which I didn't do.
I'm guessing if I had done like none of this would be relevant anymore, right?
Because like these days people don't use scans anymore.
People use diffusion models and you know, that's going to be the next next thing, right?
um But I think in contrast to that, if you look at things like Bayesian inference, like that's been used for 200 years and people still use it and it's still an important thing
to understand.
So I think if people focus more on these like big ideas, right?
These really like these foundational things.
rather than the latest trends and fads.
I think that serves a much better purpose.
And I understand that these days, lot of people are quite excited about language models and stuff, but you know, like who knows how long we still care about them in this
particular way, right?
Like maybe next year someone comes along and develops some cool new architecture that's very different from a transformer.
And suddenly like everyone does language models differently, or people stop doing autoregressive language modeling and use diffusion for language or whatever.
And then suddenly, like if you've only learned about this particular thing because it was cool right now, then your knowledge will be obsolete, right?
So I think that's maybe the main idea.
And then maybe another advice that I always give people is to just talk to as many people as possible, right?
And I think what's really nice in our community is that people are quite open.
Like if you go to any machine learning conference, you can just talk to anyone and people are happy to have a chat.
Like nobody's going to turn you down.
And just to get an idea of...
the diversity of research that's being done and to get an idea that not everyone just works on LLMs, but there's actually like tons of interesting ideas that people work on.
And yeah, it's quite an exciting time to be in that field really.
Yeah, I second everything you just said.
Extremely welcoming community.
Feel free to ask questions.
Always politely, of course, you know.
But yeah, like extremely welcoming community.
And if you're uh persistent uh and really want to help people out and be active, um I don't think you're going to have any problem uh getting a foot in the door, let's say.
Before I ask you the last two questions, ask every guest at the end of the show, I'm wondering what future projects or research areas you are currently excited about.
Because you do a lot of things, I'm really curious to hear where your mind is at right now.
Yeah, that's a good question.
Since I started my own research group, I feel like it's been diffusing around the edges a bit because suddenly you have PhD students and they obviously have their own ideas as they
definitely should, right?
so, so yeah, there's a lot of things we're looking into.
mean, as I said before, think AI for science applications are still something that I, that I find really exciting.
And I feel like we should, you know, like now, now is the time to show the world that the kind of algorithms we've developed over the last 10 years or whatever actually make a
difference.
Um, and particularly like looking into how we can also benchmark them better.
Right.
So I think that's a bit of an issue right now that.
we end up using a lot of benchmarks that other people have developed for their particular models to make them look good, right?
So typically if you look at these deep learning benchmarks like MNIST and CIFAR and whatever, they're really like very highly curated data sets that are very clean.
Like there's essentially no uncertainty about these digits, right?
So like if you train a normal neural network on MNIST, that works perfectly fine.
So arguably you don't need to make it Bayesian.
And then if we try to benchmark on this, then you don't see much of a benefit.
And then people will say, there's not much of a benefit to be Bayesian.
Like, why would you do this in the first place?
But it's because this dataset is just not interesting for that purpose, right?
And similarly on the Bayesian side, a lot of the datasets that people use are very small and low dimensional because that's where traditional Bayesian methods work well, right?
Like if you just have a Gaussian process, then you want to run on like some little, you know, five dimensional regression thing.
So I think what we're lacking kind of right now are these benchmarks that are realistic type of data from the real world that is both high dimensional so that deep learning is
useful, but also has all these like complicated noise structure or some uncertainty.
And so I think this is something that we're also like looking into a bit more now to find these kind of.
the right niche for our product.
It's a bit like, obviously, people might say, if you have a hammer, you try to make everything look like a nail.
So don't want to be that person, but I still believe that there are nails in the world and I want to find them so I can use my hammer.
And I don't want to hammer on everything that's not a nail.
So I think that's something that we're definitely looking into right now.
Cool.
Yeah, that's very exciting.
oh Yeah, I love that.
Come back on the show.
as soon as you have something to tell us about that.
definitely.
Yeah, that sounds like something.
uh Awesome.
Well, Vincent, I know you have a lot to do and I have to let you go because you have a hard stop.
I would still ask you a ton of questions, but let's call it a show.
I think we covered a lot of ground, learned a lot of things, and yeah, I really like it because...
Lots of things are clear to me now that we have talked before the show.
So that's awesome.
That's also why I do that show.
uh But before letting you go, I'm going to ask you the last two questions I ask every guest at the end of the show.
First one, if you had unlimited time and resources, which problem would you try to solve?
Yeah, good question.
oh
So I don't know if they use that slogan anymore, but DeepMind used to have that slogan where they said their goal was to solve intelligence and then use it to solve everything
else.
So I might make that, you know, solve Bayesian inference and then use it to solve everything else.
But jokes aside, I think like there's so many interesting problems in AI for science right now, you know, from healthcare to material science to climate.
And so I really hope that we can have some impact there by essentially building AI methods that scientists can use, which are strongly founded on Bayesian principles and are
therefore more reliable, more robust, and more trustworthy.
I love that, completely aligned with that objective.
And second question, if you could have dinner with any great scientific mind, dead, alive, or fictional, who would it be?
Yeah, I've been thinking about this.
I don't think I have any more creative answer than all your other guests sent before me.
I do think having dinner with like Bayes or Laplace would be fun, right?
Although for the latter, I might have to brush up on my French a little bit.
Otherwise, one thing I'm quite sad about personally is that I never actually got to meet Dave McKay before he passed away.
So I think that might also be very nice to meet him for dinner one time.
Yeah, for sure.
Well, Vincent, a pleasure to meet you, a pleasure to have you on the show.
Come back any time.
And as usual, I'll put a link to...
website, your socials and all the papers for this episode, a lot of packages.
that's great.
The show notes are already very big.
So feel free to add anything in the show notes for those who want to dig deeper.
Thank you again, Vincent, for taking your time and being on this show.
Perfect.
Thanks, Alex.
It was great fun.
This has been another episode of Learning Bayesian Statistics.
Be sure to rate, review, and follow the show on your favorite podcatcher, and visit learnbayestats.com for more resources about today's topics, as well as access to more
episodes to help you reach true Bayesian state of mind.
That's learnbayestats.com.
Our theme music is Good Bayesian by Baba Brinkman, fit MC Lance and Meghiraan.
Check out his awesome work at bababrinkman.com.
I'm your host,
Alex and Dora.
can follow me on Twitter at Alex underscore and Dora like the country.
You can support the show and unlock exclusive benefits by visiting Patreon.com slash LearnBasedDance.
Thank you so much for listening and for your support.
You're truly a good Bayesian.
Change your predictions after taking information.
And if you're thinking I'll be less than amazing.
Let's adjust those expectations.
Let me show you how to be a good Bayesian Change calculations after taking fresh data in Those predictions that your brain is making Let's get them on a solid foundation