Learning Bayesian Statistics

Proudly sponsored by PyMC Labs, the Bayesian Consultancy. Book a call, or get in touch!

Listen on Podurama

Today, we’re gonna learn about probabilistic numerics — what they are, what they are good for, and how they relate computation and inference in artificial intelligent systems.

To do this, I have the honor of hosting Philipp Hennig, a distinguished expert in this field, and the Chair for the Methods of Machine Learning at the University of Tübingen, Germany. Philipp studied in Heidelberg, also in Germany, and at Imperial College, London. Philipp received his PhD from the University of Cambridge, UK, under the supervision of David MacKay, before moving to Tübingen in 2011. 

Since his PhD, he has been interested in the connection between computation and inference. With international colleagues, he helped establish the idea of probabilistic numerics, which describes computation as Bayesian inference. His book, Probabilistic Numerics — Computation as Machine Learning, co-authored with Mike Osborne and Hans Kersting, was published by Cambridge University Press in 2022 and is also openly available online. 

So get comfy to explore the principles that underpin these algorithms, how they differ from traditional numerical methods, and how to incorporate uncertainty into the decision-making process of these algorithms.

Our theme music is « Good Bayesian », by Baba Brinkman (feat MC Lars and Mega Ran). Check out his awesome work at https://bababrinkman.com/ !

Thank you to my Patrons for making this episode possible!

Yusuke Saito, Avi Bryant, Ero Carrera, Giuliano Cruz, Tim Gasser, James Wade, Tradd Salvo, William Benton, James Ahloy, Robin Taylor, Chad Scherrer, Zwelithini Tunyiswa, Bertrand Wilden, James Thompson, Stephen Oates, Gian Luca Di Tanna, Jack Wells, Matthew Maldonado, Ian Costley, Ally Salim, Larry Gill, Ian Moran, Paul Oreto, Colin Caprani, Colin Carroll, Nathaniel Burbank, Michael Osthege, Rémi Louf, Clive Edelsten, Henri Wallen, Hugo Botha, Vinh Nguyen, Raul Maldonado, Marcin Elantkowski, Adam C. Smith, Will Kurt, Andrew Moskowitz, Hector Munoz, Marco Gorelli, Simon Kessell, Bradley Rode, Patrick Kelley, Rick Anderson, Casper de Bruin, Philippe Labonde, Michael Hankin, Cameron Smith, Tomáš Frýda, Ryan Wesslen, Andreas Netti, Riley King, Yoshiyuki Hamajima, Sven De Maeyer, Michael DeCrescenzo, Fergal M, Mason Yahr, Naoya Kanai, Steven Rowland, Aubrey Clayton, Jeannine Sue, Omri Har Shemesh, Scott Anthony Robson, Robert Yolken, Or Duek, Pavel Dusek, Paul Cox, Trey Causey, Andreas Kröpelin, Raphaël R, Nicolas Rode, Gabriel Stechschulte, Arkady, Kurt TeKolste, Gergely Juhasz, Marcus Nölke, Maggi Mackintosh, Grant Pezzolesi, Avram Aelony, Joshua Meehl, Javier Sabio, Kristian Higgins, Alex Jones, Gregorio Aguilar and Matt Rosinski.

Visit https://www.patreon.com/learnbayesstats to unlock exclusive Bayesian swag 😉

Links from the show:

Abstract

by Christoph Bamberg

In episode 88 with Philipp Henning, chair of Methods in Machine Learning at the Eberhard Karls University Tübingen, we learn about new, technical areas for the Bayesian way of thinking: Probabilistic numerics.

Philipp gives us a conceptual introduction to Machine Learning as “refining a model through data” and explains what challenges Machine Learning phases due to the intractable nature of data and the used computations. 

The Bayesian approach, emphasising uncertainty over estimates and parameters, naturally lends itself for handling these issues. 

In his research group, Philipp tries to find more general implementations of classically used algorithms, while maintaining computational efficiency. They successfully achieve this goal by bringing in the Bayesian approach to inferences. 

Philipp explains probabilistic numerics as “redescrbiing everything a computer does as Bayesian inference” and how this approach is suitable for advancing Machine Learning.

We expand on how to handle uncertainty in machine learning and Philipp details his teams approach for handling this issue.

We also collect many resources for those interested in probabilistic numerics and finally talk about the future of this field.

Transcript

This is an automatic transcript and may therefore contain errors. Please get in touch if you’re willing to correct them.

Transcript

Alex (00:01.919)

Philippe Henig, welcome to Learning Vasion Statistics.

Philipp (00:07.33)

Thanks for the invitation. Nice to be here.

Alex (00:09.627)

Yeah, thanks a lot for taking the time. I'm really psyched about this episode. And thanks a lot to my patrons who recommended you for an episode. My memory is extremely bad, so right now I cannot remember who sent me the message, but I'm sure they recognize you if you're listening to the episode. Thank you so much for your recommendations, folks, and please keep them coming.

in the Slack, that's always awesome. So, Philippe, let's... I have a lot of things to talk about with you. I think it's gonna be super interesting and also a bit more epistemological than a classic technical episode, so I'm really psyched about this blend of technique and epistemology. But, as usual, let's start with your origin story. So...

Yeah, like basically how did you come to the world of statistics and probabilistic modeling and how senior of a pass was that?

Philipp (01:17.098)

He was certainly not a straight path. So I originally actually studied physics in Germany and in the UK. And during that time, I didn't really get much exposure to statistics at all. Most of the statistics was maybe just, maybe at most multiplying Gaussian distributions on plotting paper. But in doing my master's thesis, I worked on

Alex (01:22.309)

Uh-huh.

Alex (01:36.692)

Mm-hmm.

Philipp (01:45.258)

actually on the assimilation for an electron microscope, where I had to build a simulation that produced lots of random numbers. And I got interested in how to deal with them properly, because what I did seemed very wasteful. And so I stumbled over a textbook by David Mackay, which gave me a certainly very idiosyncratic and unique introduction into very Bayesian thinking. And I was so excited about it. So.

Alex (01:50.397)

Uh-huh.

Alex (01:57.543)

Mm-hmm.

Alex (02:03.263)

Mm-hmm.

Philipp (02:13.362)

and thrilled by the ideas that I contacted David and asked whether he needed any new PhD students. And so I ended up in this group and got a proper brainwashing and left it after a few years as a hardcore vision. But since then, I think sort of myself and also in terms of my publications and my research work, I found myself associated with the machine learning community, maybe more than the stats community.

Alex (02:20.382)

Mm-hmm.

Alex (02:25.055)

Thanks for watching!

Alex (02:43.07)

Mm-hmm.

Philipp (02:45.53)

which in Cambridge felt like two very related but different things to work on because this is a university that has both a big tradition in stats and also in machine learning and I've been in this community since well 27 or so I got to share some of the wild

Alex (03:08.571)

Hmm. Yeah, yeah, definitely. It seems like you're, you're doing a lot of research and development in the, in the machine learning environment. Uh, so we're definitely going to talk about that in, and that's interesting because you, so you got introduced to Bayesian methods pretty fast, basically in your, in your, um, undergraduate studies, if I understood correctly. So.

Something I'm wondering is why did they stick with you?

Philipp (03:42.274)

So I think maybe one of the, maybe an advantage that I had in hindsight is that I never really had a formal education in statistics and in particular a lot in classic statistics. And I started my PhD not actually knowing what a p-value actually is, to be honest. And so maybe I got exposed to vision thinking basically from the start. And it seemed often in hindsight, it seemed to me...

Alex (04:00.639)

food for you.

Alex (04:09.055)

Mm-hmm.

Philipp (04:12.486)

Well, I had to learn about frequentist concepts of statistics afterwards. And then naturally, I was maybe naturally critical of them because of the education that I'd gone through. But actually after my PhD, I moved to do a postdoc in the group of Bernhard Schirrkopf, here in Tübingen, who was, well, at least back at the time was a learning theorist. Now he's maybe more of a causality researcher. And there I got kind of sort of...

Alex (04:23.258)

Mm-hmm.

Alex (04:35.036)

Mm-hmm.

Philipp (04:40.702)

exposed to the other perspective, to learning theory, statistical learning theory. Maybe that's a better word than frequent statistics. And that helped me or it made me reflect a lot on how to best think about what I actually wanted to do, which was maybe also not the typical task of a statistician. So I was always more interested in the computational side of these things, how to do.

Alex (04:49.46)

Mm-hmm.

Alex (05:06.695)

Mm.

Philipp (05:10.722)

how to do inference on a computer in particular with data.

Alex (05:15.471)

Yeah, it seems like you really are interested in this aspect of not mainly on the modeling side, but mainly how do you model, how do you write the algorithms, what are the algorithms about and stuff like that, which is very interesting to me. And we're going to talk a bit about that, but basically not mainly on using the machine.

but understanding what the machine is about, which seems to be what I understood from the work you do, because it's not me. I'm really on the side of modeling and using the machine more than thinking about what's inside the machine.

Philipp (05:50.475)

Yes.

Philipp (06:00.106)

Yes. So, the group I now lead here in Tübingen is called the methods of machine learning group, and I always interpreted this word methods as the algorithms that run inside of the machine. So when you're thinking of a learning machine or an inference machine, then you're right, of course, that if you look up sort of a textbook definition of what machine learning is, or maybe what computational statistics is, then it's...

A computer program that refines a model through data or that uses data to refine a model. But both the model and the data are things that come from outside. Right? The model is provided by the human programmer or designer and the data comes from somehow the real world in various forms, digital or otherwise, and it's somehow stored on disk. But the thing that actually happens in the learning machine when it learns is a computational task. It's the solution of what I call a numerical problem. So.

Alex (06:31.293)

Mm-hmm.

Alex (06:35.743)

Mm-hmm.

Alex (06:40.582)

Mm-hmm.

Philipp (06:56.982)

the solution of some principally intractable mathematical task, optimization, simulation, large scale linear algebra. So large scale that it's not fully tractable typically. And this is quite different from the tasks that computers were maybe originally invented for, which is to compute very precisely products and sums of numbers and use this sort of facility that they have to

Alex (07:03.443)

Mm-hmm.

Philipp (07:25.798)

solve computable tasks to very specific problems that have a concrete answer that can be found in a finite amount of time. In contemporary computational physics, that's not the case anymore. So we use algorithms that can principally only estimate the answer to their task. And they do that using the finite amount of computational power that they have available. So what

Alex (07:34.759)

Hmm.

Philipp (07:52.726)

we do in the end with these algorithms, or what the algorithms do is that they estimate something that you can't actually know, that you can't observe directly, using something that they can observe or compute directly. And that sounds a lot like statistics, right? Inferring a latent variable from observations. It's just that the observations aren't data that comes from the disk, they are data that comes from the chip that are produced by the computer itself.

Alex (08:19.749)

Yeah.

Philipp (08:21.218)

But other than that, it's the very same setting. And I've always found it intriguing to think about computation from this perspective, because it seems like that should make it, thinking about it this way should make it easier to think about what happens when we apply such computations to data that comes from the real world.

Alex (08:42.263)

Yeah, thanks a lot. That makes it, I think, pretty clear to understand basically the kind of work that you're doing nowadays and also the topics that you're interested in. But that was a bit general, so now maybe we can dive in. So are you interested in how these algorithms work in general, or is there a specific part?

of the algorithms that you focus about in your group or any algorithms in particular, maybe.

Philipp (09:19.202)

So what we do is what we describe as probabilistic numerics. So it's the application of Bayesian ideas to computation. And to just say it out loud once, of course.

Philipp (09:36.678)

What got me interested in this is the experience that when... Maybe I should take a short step back in history. So machine learning is still a very young field. And in its home in computer science, it's still very new. So when the people working in this field encountered that they had to solve numerical problems to make their methods work, everything from...

the classic statistical tools like least-choice estimation, logistic regression, all the way to contemporary deep learning, they invariably discovered that there are already algorithms out there in some toolboxes. Maybe early on, they came in some Fortran packages, and then they were maybe available in Matlab or in R. And now these days, people have Python libraries, SciPy, and so on. These methods already existed. They somehow were just sort of.

Alex (10:14.623)

Mm-hmm.

Philipp (10:32.846)

primordial, they've been built by someone else at some other time for seemingly the same or related task. So there are methods for solving least squares problems in linear algebra, there are methods for solving differential equations or initial value problems, boundary value problems, to be more precise. There are methods for solving optimization problems. And that makes it very tempting to just use these tools because it saves a lot of time, obviously, but you can just start working. But then when you start using these tools,

Alex (10:34.479)

Yeah.

Alex (10:49.811)

Mm-hmm.

Philipp (11:02.114)

these algorithms, you sort of find over time that they don't exactly solve the problem that you're really trying to solve. Because what we're doing with computers has really changed over the last, I don't know, maybe two decades or so. From a setting where computers used to be used to solve very specific tasks. For example, in physics, you might have the Schrodinger equation and you need it.

Alex (11:07.956)

Mm-hmm.

Philipp (11:31.672)

You can write down a boundary without your problem and you're trying to solve, I don't know.

find a numerical solution to the helium atom setup, or you have Maxwell's equations, you want to find a simulation for electromagnetic field in some non-trivial geometry. So you know exactly what you're trying to do. But in the last 20 odd years or so, computational statistics and machine learning and AI have made data a central part of the computation. So we now...

Alex (12:06.111)

Mm-hmm.

Philipp (12:07.318)

have grown used to on the one hand, writing down computations that fundamentally are not fully specified. So we're estimating something through the computation that doesn't have a precise, unique answer. It's an ill-posed problem, maybe. And on the other hand, so this fundamentally means we need computations that need to be able to deal with the fact that there is no unique answer to their problem. But at the same time, we also have on the other end, sometimes so much data.

We talk about big data all the time that we are maybe not even able to process it all. And then we decide to only load parts of the data or to process them in some kind of interesting iterative fashion through batching and data loading. And that introduces imprecision in the computation. So people are now used to computing stochastic gradients for their optimization problems, for example. So stochastic, in fact, that the stochasticity dominates over the signal.

Alex (13:00.187)

Yeah.

Philipp (13:07.074)

So the standard deviation of the signal is larger than the mean. And that has meant that computation has become fundamentally very imprecise, very uncertain. And the numerical algorithms that we inherit from our forebears are not designed to deal with this setting. They are not stable to this kind of noise. For example, the very beautiful optimization algorithms that were invented by the optimization community, by operations researchers in the 70s, like

Alex (13:29.788)

Hmph.

Philipp (13:37.022)

the BFGS method, for example, they are not particularly stable to this kind of noise. And that has meant that people have sort of begrudgingly realized that you can't necessarily train a big neural network with BFGS. So here we are again with SGD and we are paying a price for it. So people are now having to spend a lot of time even at the most, even at the richest companies in the world, trying to tune single algorithmic parameters that were almost forgotten not so long ago.

Alex (13:52.584)

Mm-hmm.

Philipp (14:06.354)

like learning rates and step sizes.

Philipp (14:13.006)

I-

Early on thought that this was a weird situation and it felt like something ripe for a new way of approaching how we build algorithms. And for me, the answer over the years that I've come up with, or not just me, but several people I've worked with and sometimes they also just worked out there without me, has been to realize that, and sometimes also long before us, to make this realization that I already described that you can actually think about the computation itself as a form of Bayesian inference.

Alex (14:25.061)

Yeah.

Philipp (14:46.146)

So make this mathematically maybe a bit more precise. One of the earliest things we realized, and also that was a phase in the research early on, around like 2012, 2013, was that we took a look at existing numerical algorithms. Maybe like the textbook cases, the most prominent algorithms that people who take a numerics class learn about. Things like...

Alex (15:06.371)

Mm-hmm. Yeah, yeah.

Philipp (15:14.026)

It was linear solvers like the Cholesky decomposition and conjugate gradients. Things like classic ODE solvers for solvers for ordinary differential equations, like the Runge-Kutta methods. And things like classic optimization methods for like BFGS, which I just mentioned, or nonlinear conjugate gradient, or there's a whole family of these quasi Newton methods called DFP and SR1 and Bartzlai-Borwein and so on. And we realized that all of these methods are also in the textbooks motivated.

Alex (15:18.045)

No.

Philipp (15:43.434)

as minimizers of a regularized empirical risk. So they typically minimize an L2 risk, regularized by another L2 term. So this is a story that statisticians know all too well, and we know that there is a probabilistic interpretation of this framework that involves treating this regularized empirical risk as a negative log posterior, described by summing a negative log prior and a negative log likelihood, which are both...

Alex (15:49.319)

Mm-hmm.

Alex (15:54.28)

Mm-hmm.

Philipp (16:12.834)

Gaussian. So the prior and the likelihood are both Gaussian. So it turns out that one can think about these methods as computing a map estimate. Actually, pretty much all of the ones that I just mentioned. And we wrote like a whole series of papers saying, you know, you can think about BFGS as a Gaussian regression algorithm on the Hessian of the loss function, actually on the inverse Hessian. You can think of Runge-Kutta methods as a particular choice of Gaussian inference on the solution of the differential equation and so on and so on.

But it turns out that these priors and likelihoods that are sort of hidden in these algorithms, they tend to be quite special. And it has to do with the fact that they are computational methods. So the priors tend to be very general, very broad. That makes sense because you're trying to build a tool that works on a large class of problems. Someone who invents a numerical algorithm wants it to work for pretty much everyone. So you have to use a broad prior, otherwise it wouldn't work. Well, it would work on some problems.

On the other hand, the likelihoods tend to be very precise. They are actually often Dirac measures, which encodes the assumption that the computer just computes the thing it's supposed to compute. So you tell it to compute a number and it just computes that number. So the right thing to do is just to condition on that number having that value. So it's a precise conditioning rather than an actual likelihood that is a probability measure, a non-trivial one. And that sort of

Alex (17:29.799)

Yeah.

Philipp (17:38.998)

So first of all, let's understand how these methods work. Well, it turns out there are particular choices of Gaussian priors and likelihoods, and there's often very smart choices in there, you know, how to do things so that they're actually computationally efficient and fast and can be realized. There's a lot of, a lot of thought has gone into these methods over the decades to make them work really well and also to implement them very well. But now that we have them, we can take them as starting points and say, well, contemporary AI, machine learning, computational statistics is a bit different.

Alex (17:55.261)

Mm-hmm.

Philipp (18:08.49)

because we have finite data in our computation, so the computations are imprecise. If we mini-batch the data during the computation, we get a stochastic gradient, which is associated with a non-trivial likelihood. Maybe we can include that in our formulation. If we do simulations, maybe the simulation is based on imperfect knowledge of the differential equation that drives the simulation. How can we include that in our...

or likelihood somewhere in this computation. So this gave us a phase of trying to build new types of algorithms that are based on generalizing the existing numerical algorithms. Always with the goal, at least for me, I always felt it was important to try to do so without raising computational complexity too much. But this is something that I realized early on that if we try to just

sort of write down what we would ideally like the computer to do to deal with all sorts of uncertainty and imprecisions, we usually end up with algorithms that are way more expensive. They might not even be tractable, but even if they are, they tend to be more expensive. And people tend not to like that, right? Because they are used to the methods that they have that do something. So we're competing, it's like a market we're actually competing in. So we have to try and build algorithms that are a little bit more powerful, that add more interesting functionality without being more expensive. And so we often stay quite close to the classic methods. But add...

Alex (19:14.496)

Uh huh.

Yeah, yeah.

Alex (19:22.376)

Mm-hmm.

Alex (19:31.827)

Mm-hmm.

Philipp (19:35.394)

that carefully add very specific kind of functionality to them, to make them fit better to the kind of settings we have in machine learning today.

Alex (19:46.271)

Hmm. Okay. Yeah, that's fascinating. Who knew that, yeah, like diving into the machine itself would be so kind of like another smaller world of what I leave myself when I'm modeling. So actually, you already mentioned probabilistic numerics, which is one of the building stones of your work. So and it's the first time we talk about that on the podcast. So can you

Define the concept of probabilistic numerics for listeners.

Philipp (20:20.142)

So for me, the picture is that a probabilistic numerical method is an algorithm that fundamentally describes the solution of a numerical task as computing a posterior. That involves the usual steps of Bayesian modeling. So we assign, we build a generative model, a prior and a likelihood for the quantity we're trying to compute and its relationship to the quantities that we can compute. For example, for...

Alex (20:36.434)

Mm-hmm.

Alex (20:47.356)

Mm-hmm.

Philipp (20:49.626)

the solution of an initial value problem involving an ordinary differential equation, we would write down a prior distribution for what the solution of the differential equation might look like. That's a curve we might describe with a Gauss-Markov process, so a Markov chain with a linear time varying system. And then we describe an observation model that says we can condition at various points in time on

Alex (21:01.555)

Mm-hmm.

Philipp (21:17.898)

the fact that the differential equation holds. We call this an information operator, so it's a special kind of likelihood that conditions on the fact that the difference between the time derivative of this curve and the evaluation of some non-linear function at that curve is zero. That gives rise to a posterior distribution, which in this case might be a Gaussian process posterior that can be computed very efficiently using a Kalman filter or actually an approximate

Alex (21:20.511)

Mm-hmm.

Philipp (21:45.014)

variant of it that can deal with non-linearity, like an extended Kalman filter, to produce a posterior distribution over the output with the goal of that posterior distribution having good properties. Just like in numerical analysis, it's typical that one then analyzes the method. We have to do a similar task. And now we have two objects to talk about. We have the point estimate that the algorithm returns, the posterior mean of the Gaussian process. That's the analog to what a classic numerical method returns, an estimate.

Alex (22:14.128)

Mm-hmm.

Philipp (22:14.726)

would like that estimate to have good properties. For example, it should converge quickly towards the true solution if we invest more computational resources. For differential equation solvers, this typically means a high polynomial rate of convergence. But we have this other thing as well, which defines a Gaussian process that's a posterior covariance. And we would like to be able to interpret this object as a notion of uncertainty that can somehow be analytically attracted.

Alex (22:23.004)

Yeah.

Alex (22:39.231)

Mm-hmm.

Philipp (22:44.438)

described or shown to have good properties. So of course, what we can't hope for is that this posterior uncertainty, the standard deviation is exactly the true error that the method makes. Because if that were the case, we could just subtract it from the posterior mean and then we would have the perfect solution. If that were possible, we wouldn't need to use a computational method in the first place. So instead, we would just hope for the uncertainty to be a meaningful characterization of the actual error. So that's exactly the setting you

Alex (23:05.423)

Yeah, that'd be cool.

Philipp (23:14.602)

know from statistics that you'd like to say something about the calibration of the uncertainty that arises from Bayesian inference. And in classic, let's call it epistemic, no, let's call it empirical statistics or physical statistics of dealing with the real world. At that point, it becomes very difficult because to do an analysis, you tend to have to assume somehow that the model is in some way sense correct. For example, if you do Gaussian process regression and you want to say something about

Alex (23:40.069)

Mm-hmm.

Philipp (23:42.694)

calibration of the posterior, you need to assume that the true function is either a sample from the correct underlying Gaussian process, or maybe it's an element of the reproducing kind of Hilbert space or something like this. When the data you're dealing with comes from the real world, it's very difficult to make these assumptions. But in computation, we have an advantage in that we actually know what we're doing. We have a computational task that we have written down in a formal language. In a

programming language and say this is the task you're trying to solve. So we can analyze it and actually check whether the algorithm has a good convergence property. So for example, we can show that if the underlying function that drives the differential equation is of a sufficient smoothness, then the uncertainty estimate that this algorithm that I just briefly described returns.

is some kind of worst case bound on the actual numerical error, maybe up to a constant. And then that constant can be inferred with an empirical Bayesian toolkit. So long story short, probabilistic numerics is about like on the conceptual philosophical level, it's about redescribing everything that a computer does as Bayesian inference.

Alex (24:43.688)

Mm-hmm.

Alex (24:47.728)

Mm-hmm.

Alex (25:00.803)

Mm-hmm.

Philipp (25:01.726)

And many of the things that computers do are so simple that doing this makes little sense because if the computer can just run for a very short amount of time and return the exact answer, then there's no need for uncertainty. But in AI and machine learning, many computational tasks are so complicated and so fraught with sources of uncertainty because both data, which is central to the computation and the computation itself are finite. And we're dealing with objects that are interactively complicated.

that it's very important to keep track of the imprecision of the entire process. So the uncertainty that arises both from the finite amount of data and its relationship to the quantity we're trying to infer and the finite amount of computation that we have invested in trying to find the right answer. And what probabilistic numerical methods then fundamentally bring to the table is the ability to describe both of these sources of uncertainty.

Alex (25:31.874)

Mmm.

Philipp (25:59.438)

finiteness of the data and the finiteness of the computation in the same mathematical language in that of probability measures and positive errors being refined.

Alex (26:07.659)

Okay, okay, so it's because basically not only the data are uncertain, but also the computation becomes uncertain because they take so much time. And like they have a lot of intricacies that then you need to apply the concepts of uncertainties and probability. Not only on top of the data, but on top of the algorithms themselves. Okay, super interesting. And I'm guessing that it's in that way that probabilistic numerics.

differs from traditional numerical methods because they can take into account that uncertainty also in the way computation is done.

Philipp (26:48.342)

Yes. So maybe on a, now that I've just given a bit of a theoretical mathematical formulation, I can also give a bit of a social formulation. So traditionally, and this is good historical reasons, the, the algorithms we use in stats and machine learning and the models we use have been built by different people and they use different mathematical languages. And so

Alex (26:56.112)

Yeah.

Philipp (27:14.53)

The algorithms come from, let's call them for the sake of argument, numerical analysts, numerical mathematicians, although that's not quite true. So historically they might have come from people who saw themselves as, I don't know, physicists or economists or whatever. And the models come from, you know, a statistician, a computer scientist, a machine learning engineer whatsoever. And what I see as a central part of the idea of probabilistic numerics is to come up with a language.

that describes both of these processes in the same mathematical forms as manipulation of probability measures. So that actually needs convincing on both sides. Not everyone in machine learning always thinks in terms of probability measures and people just like point estimates. And similarly, it takes some convincing for the numerical analysts to say there's this language from statistics and Bayesian statistics that might be really useful to describe what it means to solve a partial differential equation.

That's even harder, of course, to convince people in this way. But I think it's very, well, so as always, of course, science doesn't advance by convincing old people to do things differently, but by convincing young people to just think about things that they learned for the first time. So I think there's a lot to gain practically from teaching people to think about the numerical tasks inside of their inference engine in the same language that they do about the role of the data in their model.

it makes everyone pretty much a numerical analyst or an algorithms designer who can build a probabilistic or a Bayesian model. So if you know how to build a Gaussian process regression algorithm, then you have principally the right tools to also build a linear algebra routine or a solver for a differential equation, ordinary or partial, or in fact even maybe an optimization method, although optimization is even more tricky in some sense.

Alex (28:49.151)

Mm-hmm.

Alex (29:13.651)

Okay, yeah, I see. So basically, yeah, the core principle here is integrating uncertainty, which is extremely important. How do you do that concretely? How do you incorporate uncertainty into the decision-making process of these algorithms?

Philipp (29:35.502)

So first of all, at least in the work of my group, the design paradigm that we use is that we phrase pretty much everything in terms of Gaussian distributions. That's not fundamentally the only way to do things, of course. So there are other groups who also, I think, count themselves to the probabilistic numerics community that see things differently. So shout out to some of my colleagues, in particular, in the UK.

Alex (29:46.087)

Mm-hmm.

Philipp (30:04.066)

who are maybe more interested in trying to calibrate the uncertainty precisely to really nail down where numerical uncertainty arises. But usually when we do that, when we take that road, we tend to pay more in terms of computational cost. So for me, the design principle, and that's mostly because I like working this way, has been to describe everything in terms of Gaussian distributions. Why? Because Gaussian probability distributions map the paradigm of Bayesian inference onto linear algebra.

And linear algebra, manipulating vectors and matrices, multiplying and adding floating point numbers is something that computers can fundamentally do very well. They can also do it in parallel and using accelerators like GPUs. And so therefore this is a good metaphor to operate in. So the nice thing about Gaussian distributions is that they very cleanly separate into the description of the problem into exactly two parts.

a point estimate and an uncertainty. You ask what do we do with the uncertainty, it's clear what we do with the point estimate, it's just the thing we believe, it's the best guess. So what do we do with the uncertainty? First of all, how does it arise? Well, it arises from the basic algebra of Gaussian probability distributions. So when you multiply a Gaussian prior with a Gaussian likelihood, you get a Gaussian posterior with a structured posterior covariance matrix. And...

Alex (31:11.996)

Mm-hmm.

Philipp (31:29.85)

If depending on how we write down the likelihood to encode certain kinds of information about the problem, we encounter the structure of the problem in this positive co-variance. If the likelihood isn't actually Gaussian, which it often isn't because numerical problems usually are nonlinear, then we just linearize. And that sounds like a, like a silly simple thing to do, but it's actually extremely powerful because in 2023, we have access to powerful linearization libraries called automatic differentiation.

And so they basically can make anything a linear function if you really wanted to. So that gives rise to a Gaussian postivia. And now what do we do with this uncertainty? So that maybe the very first thing, the sort of textbook thing to do is, of course, to just visualize it, to just see what kind of remaining numerical computational uncertainty is in your task. And it's sort of the thing that people also do when you first encounter Bayesian methods and statistics, which you just model at the postivia. So

Anyone who's ever done Gaussian process regression likes these plots with, you know, a sausage of uncertainty around the posterior mean and some samples from it that look beautiful. So we can do that as well now with the solution of, you know, ordinary partial differential equations or other types of problems. And that's nice. But actually we just like in statistics, you usually then quickly encounter that just looking at the uncertainty is not quite enough that you somehow have to make use of it. And we start to think about how to make use of uncertainty in computation.

Alex (32:26.793)

Mm-hmm.

Alex (32:30.641)

Yeah.

Philipp (32:56.086)

And there's multiple really interesting uses for uncertainty that also provide a main motivation for why probabilistic thinking about computation is interesting. Our first one is that uncertainty can act as the guiding load star for computation. One interesting aspect of numerical algorithms is that they actually take active decisions. If you think about what an optimization method is, and it's a method that keeps deciding

to do certain things, right? It gets an evaluation, it evaluates a gradient, then it decides to take a step somewhere in the parameter space and evaluate a new gradient, and then keep doing that. That's a form of, you know, in machine learning, we would call this exploration or exploitation, depending on what exactly we're trying to do. It's an active process. It's not something that is predetermined before the algorithm starts. It reacts to the data it got to see. So...

Alex (33:48.207)

Yeah.

Philipp (33:50.73)

Numerical algorithms really are active agents. So we need to decide what step to take. And it turns out that often, even for classic methods, you can explain the way that these algorithms work from a perspective of reducing uncertainty, from an information theoretic perspective, of reducing entropy of the posterior distribution, or of maximizing expected drop in uncertainty from the, or sorry, maximizing drop in uncertainty, or maximizing the expected reduction of a residual from the next step.

Alex (34:09.147)

Mm-hmm.

Alex (34:19.336)

Mm-hmm.

Philipp (34:20.654)

So that's a role for uncertainty deciding what to move next. But there is sort of a more high level use for it as well. Outside of the algorithm itself, if you have multiple algorithms that interact with each other, that are all quantifying their uncertainty, that maybe also interact with a data source that is also finite and provides a finite amount of information, then uncertainty can guide the flow of the computation itself.

So for example, the simplest thing to do is just to decide when to stop. So one of the common problems that people now have in machine learning is that you don't even know when you're done training with your deep neural network, because of stochasticity, you'll never actually get a zero gradient, right? At some point you just get some kind of white noise of gradients where you just diffuse around a minimum. You'd like to detect when that happens and you can do that actually with a statistical test or.

Alex (35:00.74)

Yeah.

Philipp (35:17.114)

you could decide to go back to a previous computation and say, uh, this, this computation was maybe not as precise as it needed to be to actually allow me to do the subsequent computational step. And maybe I have to refine it. Or the other way around, you could decide to stop a computation early, because you think that what you currently have, the remaining uncertainty is sufficient in some sense. I'm expecting to not get any more information out of the data source, so I don't need to continue my computation.

These are all things that classic numerical methods are not very good at. So, if you've used a numerical algorithm from a library like SciPy, then people are used to these algorithms having optional parameters that define their tolerances. So that, you know, you can tell a solver for differential equation that you have a relative tolerance of 10 to the minus three and an absolute tolerance of 10 to the minus six or so. And so these are usually default settings. Often they are very precise.

So people run Runge-Kutta methods of order four or five to 10 to the minus 13 precision. But that often means that these algorithms take more computational resources than they actually need because they are operating on problems that are fundamentally ill specified and give rise to a lot of posterior uncertainty. And then it might not be necessary to run them to this high precision. So if these methods quantify their own uncertainty and if they know they...

Alex (36:37.032)

Mmm.

Philipp (36:41.846)

what kind of uncertainty they were given to work with on the task that they are trying to solve to begin with, they can calibrate themselves to only expend as much resources as necessary to basically reach a point where their own computational error does not contribute significantly to the overall error anymore and then just stop the computation. So this is often a very subtle kind of thing. Maybe I can give you a practical example, one sort of thing that people have thought to think of.

Alex (37:09.776)

Yeah.

Philipp (37:12.138)

I think one setting that should be easy to understand to anyone who's lived for longer than five years on this planet is pandemics. So we are all used now to, I mean, we've all become to forget about them, but we used to have this phase where we all looked at these infection curves a lot. So, you know, you see that case counts go up and down and there was a first and a second and a third wave and people were, you know, the evening news were filled with, at least in Germany, there was a phase when we were often talking about it.

Alex (37:12.22)

Yeah, exactly.

Alex (37:21.299)

Mm-hmm.

Philipp (37:41.186)

waves returning, do we have to have another lockdown or not. So these curves that were shown to the entire public at that point, they are basically time series of data. So you can imagine a one-dimensional curve going up and down with a case count. It's a curve that is positive valued, it's lower bounded by zero, and it's maybe upper bounded by the total size of the population. So if your question is, how does this curve continue on the right? You can think of this in different ways.

You can think of it as a statistical estimation problem. There's just a time series of points you'd like to extrapolate. How do you do that? Well, with any of your favorite statistical tools, logistic regression, a neural network, Gaussian process regression, I don't know, whichever one you choose, all of these will give, at least in the naive textbook setting, they will give pretty bad extrapolation because they don't know anything about the process that you're trying to model. You can tell them that it's a smooth curve or that it's a curve that returns to zero or one that can grow exponentially. But...

None of this is particularly useful. You just get really silly extrapolations. But an advantage of these methods is that they make use of the data. They produce a line that goes through the data, at least if you make them sufficiently flexible. You can do logistic regression and produce a curve that goes to all of the data points, or generalize the model maybe. So the other approach is to say, well, I know something about the underlying mechanism. I know something about the causal structure of this data.

And I can describe that causal structure in terms of a differential equation. For example, to use a simple textbook example, an SIR model. So a model that separates the population into three groups, susceptible, infectious, recovered. And the curve that we have on screen maybe is an infection curve. So it's the I in the S I and R. And as a differential equation that describes that people, when they interact with an infected person, with a certain probability, move

into the infected group and then once they're in that group with a certain probability, they move into the recovered group. So that's a nice mechanistic model of the world and you can solve it with a numerical method with a solver for an ordinary differential equation. That's a kind of code that was written by a mathematician, well invented maybe a hundred years ago, implemented in the 80s or so, maybe in Fortran and it's a piece of code that just

Philipp (40:04.702)

and where to start it and for how long to run it. And you call it and it runs wonderfully, very efficiently. It produces a curve. But that curve, notice that I didn't say anything about data. This curve has nothing to do with the data that we just looked at. It's just a curve. So why should we trust it at all? And the reason it doesn't, it also obviously normally doesn't go through the data. Why does it not? Well, because the differential equation tends to miss certain aspects of reality. Like for example, the fact that people don't always meet with the same frequency, but

Alex (40:07.625)

Mm-hmm.

Philipp (40:32.29)

their contact rate goes up and down, depending on whether we have a lockdown or not and how people behave. So you want that parameter to be part of the differential equation, but you don't know what it is, right? It's a latent quantity. So you'd like to do inference on it. So how do we do inference? So now we have two tools, right? On the one hand, we have something that is very data centric, generalized linear model, classic statistical tool, but for which we can't directly say, oh, this thing, this curve is actually the solution of a differential equation. And on the other hand, we have a numerical tool that is very

Alex (40:40.06)

Yeah.

Philipp (41:01.566)

equation-centric, that you give a differential equation to and then it can solve. But what we need is something that can do both, that can deal with the data and the fact that we have mechanistic knowledge, what both of them are finitely precise. We have finite data and we don't know everything about the differential equation in an algebraic sense. There are some terms in that you don't know. So people in the computational statistics community and the machine learning community of course have ways of dealing with this. This is the typical thing that happens when a practitioner encounters such a problem.

Alex (41:10.961)

Yeah.

Alex (41:19.656)

Yeah, for sure.

Philipp (41:31.41)

You just basically wrap a lot of duct tape around the whole thing. So you take an ODE solver and an Autodiff framework, and you just initialize, you write the contact rate across time as some parameterized model, I don't know, some simple neural network, and then initialize it somehow, make it predict forward through time using the ODE solver. So you call this piece of code that was somehow given to you from a biomathematician. And it produces a curve that doesn't look like the data. So you use Autodiff to compute.

Alex (41:52.454)

Mm-hmm.

Philipp (42:00.654)

residual between the two curves and then do gradient descent. So this is the typical kind of stack of algorithms that sit on top of each other in all of our contemporary machine learning solutions. And then they only work with a lot of babysitting, right? Someone has to sit around them and make sure that, you know, the gradient descent has the right learning rate and that the initialization of this neural network, which is then sometimes called a neural ordinary differential equation, a node, is correctly initialized.

And also it's quite wasteful with computation, because to make this thing work, you have to repeatedly call this ODE solver over and over and over again to compute new gradients to follow the gradient. Now, if I tell you that in this ODE solver, what it actually does is it steps forward through time and evaluates a likelihood, a Gaussian likelihood or an approximate Gaussian likelihood that tells it that locally at some points in time this differential equation holds. And then your outside loop

Alex (42:36.464)

Yeah.

Philipp (42:58.922)

steps forward through the data set and locally at each point in time evaluates a likelihood that this curve i is at this point relative to what you've observed, it seems really wasteful to do these two things separately. Doesn't it make more sense to think about the whole process as one path through the data from the left to the right and at various points in continuous time we condition on a, the knowledge that the differential equation holds and b, the knowledge that the curve has to go through this point because we've evaluated it with noise.

Alex (43:11.377)

Yeah, for sure.

Alex (43:16.031)

Mm-hmm.

Philipp (43:29.01)

with a likelihood. And since both of them are approximately up to linearization, Gaussian likelihoods, there is even an algorithm to do so. It's called a Kalman filter. It's a standard prior for everything we don't know. And we just condition on the things we know. We know that the differential equation holds. The differential equation is an algebraic relationship between all the quantities involved in the model. And we know that the curve i has a particular value.

Alex (43:46.591)

Mm-hmm.

Philipp (43:56.614)

i, the infection count, is one part of the state space that we're trying to simulate. So when we condition on these observations, we might be able to use the information we get from this data to learn about the parts of the model that we don't know, which is the contract rate, how often people actually meet with each other. And then end up with a prediction from this entire algorithm that, first of all, can run much faster.

because it doesn't require an autofor loop. It just goes forward through the data once, and then it's essentially done. Or if you want to have a consistent output, you also have to run a smoother backward through time once. Fine, that's pretty much for free. And then it's done. And secondly, it captures uncertainty from both sources. It captures uncertainty from the fact that we only have evaluated the ODE at a bunch of points, that there is a part of the ODE we don't know, and that we have finite data. And it turns out this actually works. It's actually an algorithm that one can write.

Alex (44:50.239)

Mm-hmm.

Philipp (44:53.782)

And we had a new RIPPS paper about it in 2021, I think with Jonathan Schmidt and Nico Kremer to show that one can actually build simulation methods in this way. So when we do this process, it blurs the lines between a numerical algorithm and a statistical algorithm. It sort of simultaneously both things. It's an algorithm that infers from the empirical data.

Alex (45:17.959)

Mm-hmm.

Philipp (45:22.722)

but it also solves the differential equation. It's just that it treats both of them as imprecise. We are used to treating the data as imprecise. That's typical for statistics, of course, through a likelihood. But we tend not to think of the solution of a numerical task as something imprecise, just because the algorithm we tend to use exerts a lot of computational resources to produce a very precise answer. But maybe it doesn't have to. Maybe it's fine to produce an answer that is just about precise enough that

Alex (45:31.922)

Yeah.

Alex (45:37.755)

Mm-hmm.

Philipp (45:50.43)

its precision is dominated by the lack of precision in the data that we're running.

Alex (45:57.831)

I see, yeah, yeah. I mean, the cool thing is that, well, if you're already using MCMC algorithms, we do have uncertainties around the estimations of this algorithm. So I'm guessing that for most of the listeners of this podcast, what you're talking about is not that unfamiliar and is welcome broadening, basically, of that.

of that use on a lot of other algorithms. So that's super cool. Something I'm wondering is, what's the state of these algorithms you're talking about? Like, if people are interested in trying them out, can they already do that? Do you have any open source software packages out there in the world that people can try out?

Philipp (46:53.326)

So this is a really interesting question in that I've now spoken with the past few sentences on many different numerical problems, linear algebra, optimization, simulation, and so on. And for us, at least for us as a research group, my research group here in Tübingen, but also us as a community of probabilistic numerics across the world, we sort of went through different phases. The very early one was this big philosophical observation, you know, computation is somehow the same as inference. Interesting.

Then there was a phase of, oh, how is this connected to existing numerical algorithms? Like, well, OK, it turns out classic methods can be interpreted in this way, and there's some kind of corner case. And the third step was, well, OK, so how would a better algorithm actually look like? What is the killer application for this idea? And so we came up with some of these algorithms that I described to you just now. There's a few, there's many other examples, but I picked out a few. And then there's the next phase. It's actually not the next, I think we're in the middle of it, is to build

software solutions that actually provide these algorithms and make them available. And we've been doing this for a few years now. So yes, people can try them out. Um, but there's a reason why I told this whole story about them, which I'm going to go through in a moment. So if you want to have, if you're listening to this podcast and you'd like to have a look at what these algorithms look like as a, let's say a reference implementation of what a numerical method actually is, a probabilistic numerical method actually is. Then I recommend that you have a look at probnum.org one word, no hyphen.

propnumpt, or probabilisticnumerics.org. This is a Python package that several PhD students from my group and also from other groups wrote together as an open source package. And it contains reference implementations for linear algebra, Bayesian quadrature, which is the probabilistic version of linear integration, if you like, not solving differential equations, but solving normal linear integrals for the solution of differential equations, in particular, ordinary differential equations.

Alex (48:29.935)

Nice.

Alex (48:42.216)

Mm-hmm.

Philipp (48:48.694)

And I think that's actually it at the moment in there. There are more libraries. Well, there's some kind of low-level libraries for Kalman filtering, for example, and Gaussian process regression and so on. And so what I should say is that this code is designed deliberately with didactics in mind, it's relatively flexible and general. You can try it out and change things in multitude of ways to create new algorithms. What it's not designed for.

is extreme numerical efficiency. And there is an interesting challenge in coming up with new numerical algorithms is that we're usually up against algorithms that were built maybe decades ago by people who knew exactly the algorithm they are trying to implement and were just hunting for computational efficiency. So if you run the ODE solvers that are in that package, you should not expect them to run as fast as your, you know, SciPy dot.

Alex (49:20.543)

Mm-hmm.

Philipp (49:47.634)

ODE.dopre5. Well, essentially called to a Fortran library because they're not designed to be fast. They're designed to be very flexible and to allow people to change the model in any way they want. Now, there are also implementations that are actually much faster. So if people want to check on GitHub, Nico Kremer, whose GitHub handle is pnkremer, has a Python implementation in Jax of

probabilistic numerical differential equation solvers, he calls them prop-num-diff-ec, which are actually very fast. They are faster than the CypI implementations of ODE solvers typically and nearly as fast as the fastest Jax implementations of classic ODE solvers. And Nathaniel Bosch, who is currently also finishing a PhD in my research group, has an implementation of these methods in Julia, which he calls prop-num-diff-ec.jl.

Alex (50:44.319)

Mm-hmm.

Philipp (50:46.454)

which are also very fast. They are much faster than the Python implementations of these servers, not as fast as the fastest Julia implementations of classic ODE servers, but they are very fast. So if people out there who are willing to use SciPy implementations of ODE servers should be happy with those implementations because they are much faster than what they are currently using. And this is just the situation for ODE servers. And this is like a snapshot on one part of the problem.

If you talk about linear algebra methods, the race is even tighter because of course, linear algebra algorithms are extremely optimized. There's this entire blast ecosystem of algorithms optimized for particular linear algebra tasks. And there we are only beginning to make on inroads. Jonathan Wenger, who has just finished his PhD in my group and is now moving to Columbia University, has contributed a lot to this. And he also wrote initially implementations that you can find in PropNum.

Now they are increasingly becoming part of toolboxes like GPyTorch and so on, where they actually have become in some cases the default mode for certain types of inference. So yeah, we're beginning to reach the point where you see these algorithms in the wild and also in real world implementations. There is a sort of design philosophy question.

between building algorithms that are very flexible and can be used for research to try out lots of different things, which then necessarily often means that they are not very fast. And building super optimized implementations that actually can compete in runtime with existing methods, but are then typically very limited in specifically what they can do. They might still add functionality on top of what the classic method can do, but maybe only in a specific way, or is only used, possible to use in a particular specific.

Alex (52:13.629)

Yeah.

Philipp (52:36.722)

And sometimes it's not even a good idea to do that for us because our goal is still research and trying to find new functionality. And that is, that puts a limit on how much we want to optimize our code base. because otherwise, yeah. So long story short, if you're interested, have a look at either my website or the ones that I just mentioned, and you will find lots of links to different software packages. You can also have a look at probabilisticnumerics.org, which is our community web page.

Alex (52:49.662)

Yep.

Philipp (53:05.442)

which has a sub page on software where people, not just us, but other people also list a software package.

Alex (53:12.591)

Yes. So yeah, thanks so much. That's super useful. Love that. And also, so first I love how, you know, like this intersection of, yeah, sure, kind of fundamental research, but with a very applied side where you have all these software packages. And of course, I put all the links in the show notes while you were talking, Philippe. So for people interested.

You will have in the show notes the link to the PropMem Python package, the version in Jax that is more efficient. So, PropDivEq and also the version of PropDivEq in Julia. And you have everything in there. And also you have the page with all the probabilistic numerics research that Philip just mentioned, probabilisticnumerics.org. So,

Yeah, like if you folks are interested in that, just go ahead and check that out and maybe send GitHub issues or even better GitHub pull requests to those people because I'm sure they will appreciate it. On my end, I already shared all that awesome work with my fellow nerd colleagues in Pimesy Labs and I'm pretty sure they will be happy to check that out.

Especially Adrian Zabolt, someone who worked and works a lot on oddies. He himself has worked and is still working on a package to do oddies. And so I'm pretty sure he'll appreciate your efforts. And I will link to at least two episodes of this podcast where...

We talked about ODEs, one from the very beginning of the podcast with Dimitri, whose last name I'm forgetting right now. But with Dimitri, I will put that into the show notes. So we went into the ODE. He's the one who developed the ODE subpackage of PMC. And you will see how difficult.

Alex (55:33.831)

ODE's are and how difficult it is to actually solve computationally. And also the episode with Adrian Zeybolt, as just mentioned, because Adrian is doing so many things, and one of them is working on ODE's. So I will link to this episode and also his package in the show notes, which works really well with PyMC because Adrian is also a PyMC developer. So most of his projects.

marry really well, usually with Poem C. Damn, that's super cool, Philippe. Thanks a lot for this. I'm really excited to take a look at all these new things. So we're getting short on time and I want to make sure we have time for some of my other questions. So let me think. Yes, something I wanted to ask you is basically the frontiers.

right now of your work of your field in probabilistic numerics? What are the main current challenges faced by researchers in your field?

Philipp (56:43.674)

So I'll try and keep it short. There's a lot of things to do. And maybe the most important things I should say is that we are very much inviting anyone who would like to contribute to this field to join. And by joining, I mean, just start writing papers about it. And of course also contact us and contact the various people whose names you can find on these publications and on these websites that we just spoke about. For me personally, if you ask me what I'm most interested in to work in, but that doesn't have to be what everyone should work on is on the one hand,

advancing simulation methods to much more challenging problems. So in my group, now more and more people are working on complicated, nonlinear, partial differential equations and how to include sources of information about them from different directions. Some form of probabilistic data simulation in simulation. And the other main topic that also a significant part of my group and I work on is the question of deep learning, the algorithmic question of deep learning. I think.

I don't have to tell anyone that it's an important modeling domain. Much of AI now runs on deep learning, but the algorithmic side of deep learning is actually way behind the development of the models. The people who train even the largest, most complicated deep neural networks at the moment on the planet have very little understanding about how and why the training process proceeds in the way it does, whether it's wasteful or useful.

how to best tune the parameters of the algorithm in particular. It's not for wanting or knowing, but it's because we just don't really have a good mathematical tool set and algorithmic tool set for deep learning yet, at least not compared to how powerful the modeling language are these days. So I believe there are really important exciting developments also more recently with theoretical developments like tangent kernels and the resurgence of Laplace approximations.

We now have probabilistic handles on deep learning models. And with the ideas coming from the software engineering direction of array-centric programming and differentiable programming combined together in languages like JAX and DAX and things that come afterwards, we have really new interesting directions to build much more manageable, much more controllable, useful deep learning architectures. And those two are what I'm most excited about.

Alex (59:05.726)

Mm-hmm.

Philipp (59:07.874)

partial differential equations and deep learning.

Alex (59:11.863)

Hmm. Yeah, that sounds like a fun thing to do, for sure. And basically, how does the book that you co-authored last year at Cambridge University Press, so the book is about probabilistic numerics, so I'm wondering about how you see the significance of these first textbooks.

about probabilistic numerics in this context and basically how it came to be, how is this for and so on.

Alex (59:54.643)

Did I lose you? Oh, here.

Philipp (59:55.906)

Yeah, I'm back. I think we are very briefly cut out, but I think I know what you asked. The textbook is a snapshot of course, of our research. But at the same time, it's also the very first time that we got to write the whole story in one place. So I felt that it was an opportunity for us to for once explain to people who are coming into the field, all of the opportunities and ideas and also the challenges in one place.

Alex (01:00:00.019)

Okay

Alex (01:00:13.831)

Mm-hmm.

Philipp (01:00:25.118)

and develop one joint view on the entire field. The textbook, by the way, of course, is available for free as a PDF online. You're also very welcome to order it from Cambridge University Press or your favorite online bookstore or real world bookstore. I'm sure that in a few years time, our knowledge about these methods has evolved, will have evolved to a point where we might need to...

We consider some of the presentation in this book, but at the point where the majority of people come from a classic numerical perspective on algorithms, it was very important to have one opportunity to write once down the path from the classic methods to how to think about them in probabilistic viewpoints. So maybe if now everyone goes and reads that book, in a few years' time, we can read another one where we don't even have to explain how a classic numerical method works. Instead, actually, just start directly from the Bayesian perspective.

Wouldn't that be nice for all of Bayesian statistics if we didn't have to make the connection to classic statistics and just say here's the clean probabilistic formulation, the measure theoretic formulation of the world.

Alex (01:01:30.527)

I see. Yeah, yeah, yeah. So for sure, again, this is in the show notes, folks. So if you're interested, definitely check out Philippe's book and get started on these probabilistic numerics, new algorithms. Yeah, that's so exciting. I love that. And actually, kind of to start.

winding down on the episode. I'd like to open that up a bit more and a bit more philosophical if you want. Something I find interesting is that your research group kind of conceptualizes these algorithms as intelligent agents themselves, which to me challenges a bit the traditional views of numerical computing, right, where it's more of a binary

and kind of, yeah, dumb agent. But so that means that maybe that leads us to rethink the concepts of rationality and decision making of these very agents. So can you, yeah, delve a bit into these to start closing up the show?

Philipp (01:02:52.558)

Okay, so I've also already mentioned that, yeah, clearly numerical algorithms are active in the sense that they change their behavior and in response to the numbers that they compute. This is true for optimization methods, for simulation methods, and for pretty much any nonlinear numerical algorithm, including even conjugate gradients for linear algebra. So they clearly are active in some sense. Now, you're already raising the right point that these things have limited computational capacity because...

These are the algorithms that run inside of other algorithms, inside of learning machines. So they can't be arbitrarily powerful or research. Sorry, their resource requirements can't be arbitrarily high. So that puts a bound on what kind of decisions they can actually make. Usually means we have to describe the posterior in terms of Gaussian distributions and use covariances to guide their decisions. If people want to read the book, there's actually a long discussion in the linear algebra

about all the complicated constraints that this puts in. For example, it means that certain kinds of prior information can be used. Like for example, the fact that a matrix is symmetric, but other kinds of prior information are not easy to use. Like knowledge that the matrix is positive definite because it's a nonlinear kind of constraint on the problem domain. There is an entire field called information-based complexity that is sort of as in the past, I'm not really involved in this community, but I know that it has been studied by mathematicians for quite some time.

Alex (01:04:04.137)

Mm-hmm.

Philipp (01:04:17.454)

to try to understand what kind of bounds of rationality can be used to improve computational performance. And there are some really interesting, but deep and subtle mathematical results on how information can be used, how knowledge can be used to improve computation and how adaptive methods can actually be to the numbers that they get. But this is clearly a domain that is not yet completely covered and where there's still a lot to be gained to understand. In particular,

in the applications of optimization for deep learning.

Alex (01:04:50.444)

Yeah, yeah, well, yeah, I find that super interesting, fascinating and also, yeah, kind of forces you to rethink some of your long held assumptions. So I really love that. Yeah.

Philipp (01:05:06.254)

Maybe to add one sentence, it's also not just the decision for the algorithm to do something in parameter space where to move to, but also which data to load. So in contemporary AI, machine learning, the data loader is increasingly a central part of the algorithmic solution. You need to decide which numbers to actually load from disk. And many training algorithms are IO bound by how quickly and which kind of data they load from disk. So if an algorithm knows what information it needs, it can actively decide to look at certain parts of the data set.

Alex (01:05:12.508)

Mm-hmm.

Alex (01:05:27.423)

Mm-hmm.

Yeah.

Philipp (01:05:36.458)

And we're only beginning to answer this question now as a field.

Alex (01:05:40.555)

Okay, I see. I see. Yeah. Oh, actually, that's a good segue to one of my last questions, which is basically looking ahead What are the most exciting advancements of probability-seq numerics to you and in particular? What would you like to see and what would you like to not see?

Philipp (01:05:58.114)

So in terms of this little story arc that I opened up before on what we've done in the past and what we do now, I think the high level point for where I personally think we are now as a field is that on the one hand, we need to drastically professionalize the kind of software stack that we're building. We need to start competing with these existing numerical toolboxes out there. And we're beginning to do that. And the other main sort of the...

The next technology readiness level to reach is to work more concretely on real hard applications. These are typically in scientific applications of statistics and machine learning and AI because these tend to come with combinations of different forms of information and hard computational problems to combine. So that's the high level view. I think there is a very hard-nosed question of software engineering. So one of the skill sets that...

We don't have enough of in our community is people who really know how to build good, high quality, robust software and make it fast. So if there are people out there listening who think of themselves as a very good algorithmic coders. So for people in the lower end of the stack, not the application domain at the top, but the engine room of machine learning at the bottom. I hope that you're going to have a look at the book and the videos and the papers and try to see how you can contribute. This is something that I think is a real challenge.

this community because we're up against decades of optimized code and breaking those paradigms is something that takes time just like changing programming paradigms in other parts of computer science takes time.

Alex (01:07:40.431)

Yeah, for sure. For sure. Yeah, so if people hear your call, if you get in touch with Philippe, there are all the info in the show notes. And well, usually you came to the right podcast, Philippe, because we do have a good part of the audience who fits the characteristics you just mentioned. So feel free to...

Philipp (01:08:06.827)

I'd be very happy to see that.

Alex (01:08:08.691)

Feel free to get in touch with Philippe or with me and then I'll get you in touch with Philippe. Before we close up the show, is there any topic I didn't ask you about and that you'd like to mention?

Philipp (01:08:21.997)

So you already kindly gave me the opportunity to plug the book, so people should have a look. No, I think actually we've probably covered pretty much everything that I was keen to talk about as well. And you might have noticed that I was keen to talk about it, but yeah. No, I think I'm happy.

Alex (01:08:40.503)

Okay, now that's super cool. I mean, I had a lot of fun and I think the show notes. Yes.

Philipp (01:08:44.686)

Ah, maybe I can mention one thing. Ah, no, I can't. Sorry. I can't remember, but very quickly. If people, there are people out there who might not want, if you're listening to a podcast, maybe you don't want to read a whole big textbook. One thing you can do is you can go to our YouTube channel called Tubing and Machine Learning. You can just type it into the YouTube search bar and you'll find a collection of videos from both myself, but also many of my colleagues here in Tubing and teaching all sorts of things related to machine learning and computer science. But there is a whole playlist.

of a course called Numerics of Machine Learning that my research group, my PhD students and I taught last term, which gives a nice overview over the state of the art of these algorithms in an oral video form. There's also collected videos from a recent probabilistic numeric spring school where you can hear lots of other colleagues from elsewhere in the world who have very different views on this field and what it's supposed to mean and watch their talks and presentations and keynotes.

to maybe get a more round picture of what this field is about and not just to hear my opinions.

Alex (01:09:49.371)

Yeah, and I second that I used the YouTube channel to prepare for the episode. So definitely recommend it. Of course, it's already in the show notes and Philippe already added also the, the introductory course to probabilistic numerics that he mentioned. So, yeah, definitely check out the show notes for these episodes. They are extremely thorough and I think it reflects.

the quality of this episode. So thanks a lot, Philippe, for taking the time. But as usual, before letting you go, I have to ask you the last two questions I ask every guest at the end of the show. So if you had unlimited time and resources, which problem would you try to solve?

Philipp (01:10:35.246)

So if I were the head of, I don't know, a massive industrial research lab, and what I'm going to say maybe is among the good reasons why I'm not in this kind of position is, I would try and get a group of gifted people together who combine knowledge in Bayesian thinking and algorithmic knowledge and software engineering and have them try and come up with a clean new information-centric programming paradigm that combines the ideas from...

Alex (01:10:46.055)

Heh!

Philipp (01:11:01.89)

probabilistic numerics of course, but also probabilistic programming, array-centric programming, automatic differentiation, differentiable programming, to basically build a new way of building code in which all variables can be random variables, but information can also be provided in the form of observational data, of empirical data, algebraic relationships, symmetries, all sorts of information that we actually have available about the world. And then...

automatically discretize those that aren't discretized yet into forms of information operators to allow a new form of inference and learning on computers that can efficiently deal with all sorts of information, but also all sorts of hardware, including stochastic and quantum hardware in the future. I think that this is a direction that really could combine many of the cool things that have happened in computer science in the past few years and in statistics to

really build something new, but doing that should be the work of some big organization, so it's not the sort of thing that I can just do with a dozen PhD students.

Alex (01:12:05.628)

Yeah, that sounds like a fun endeavor. And second question, if you could have dinner with any great scientific mind, dead, alive or fictional, who would it be?

Philipp (01:12:18.478)

So this may be a bit of a personal answer, but I would love to have another final dinner with my PhD advisor, David Mackay. He left us way too early, not so long after I finished my PhD, actually. And David always had idiosyncratic views on the world and our field in particular. And I would just love to hear what he would have to say in 2023 about the current state of AI.

Alex (01:12:26.631)

Mm-hmm.

Philipp (01:12:46.582)

and machine learning and everything else in the world as well, including renewable energy and climate change. David was always quick to call out salesmanship and buzzwords and marketing and profit driven research in general. And I'm sure he would have a lot to criticize about how our field has developed recently. I'm also sure he would have some cool

deep immediate Bayesian insight into how large language models work. You'd probably have some beautiful hierarchical Dirichlet process idea for how to write a probabilistic form of a transformer. And I have some mathematical questions of my own that I would love to ask David because he might have an answer to. It's just a shame that I can't ask him anymore. I think everyone who's ever interacted with David kind of feels that they have some questions left that they didn't get to ask.

Alex (01:13:42.211)

Yeah, for sure. That sounds like a very interesting dinner. Well, I'm not going to take more of your time. I mean, I would have so many other questions, but you've already been very generous with your time, Philippe. So let's call it a show. I'm extremely happy with this episode. I think we managed to blend a bit of both the practice, the conceptual and the technical.

That's amazing and that was a very original topic. So thanks a lot, Philippe. As usual, I put resources and owing to your website in the show notes for those who want to dig deeper. Thanks again, Philippe, for taking the time.

Philipp (01:14:26.722)

Thank you very much, Alex, for having me and for allowing me to rant so long.

Alex (01:14:32.644)

That's what a podcast is for. Well, good luck on all these endeavors, Philippe, and see you very soon on the show.

Previous post
Next post