Modeling Webinar | Fast & Efficient Gaussian Processes, with Juan Orduz
My Intuitive Bayes Online Courses: https://www.intuitivebayes.com/
1:1 Mentorship with me: https://topmate.io/alex_andorra
This is part one of our series on HSGP, focusing on the mathematical foundations of the method. The two other parts will focus on Practical Tips & Tricks, and finally a Full Bayesian Workflow with examples. Subscribe to the channel to know when they are released!
📚 In this webinar, dive into the cutting-edge world of data analysis and modeling in our upcoming webinar, where we demystify the revolutionary Hilbert Space Gaussian Process (HSGP) approximation.
This technique is your key to leveraging Gaussian processes at an unprecedented scale, transforming complex data into actionable insights.
📈 Takeaways
🔍 What You Will Learn:
- Foundational Concepts Simplified: Begin your journey with an accessible introduction to Gaussian processes and the principles of spectral analysis methods. We break down complex ideas into understandable concepts, ensuring you grasp the fundamentals of HSGP approximation.
- Real-World Application: Through the lens of the seminal Birthdays dataset, witness the power of HSGP in action. We'll guide you step-by-step, demonstrating how this method can be applied to tangible data challenges, making the abstract concrete.
- Innovative Techniques Explained: At the heart of HSGP lies the Laplacian's spectral decomposition—its eigenvalues and eigenvectors. Learn how a truncated sum of eigenvectors, combined in a linear fashion, can approximate the kernel of a Gaussian process with high efficiency. Discover how the selection of coefficients, influenced by the spectral density of the Gaussian process kernel, unlocks faster computations without sacrificing precision.
- Speed and Efficiency: Uncover the game-changing observation that the eigenvectors used in HSGP approximation remain constant across different hyperparameters of the Gaussian process kernel. This insight is crucial for accelerating computational processes, enabling you to handle larger datasets more efficiently than ever before.
🌟 Why Attend?
Don't miss this unique opportunity to enhance your machine learning and modeling expertise. Gain a comprehensive understanding of Gaussian Processes through the innovative lens of HSGP approximation.
Whether you're a data science enthusiast, a seasoned analyst, or a professional looking to refine your skill set, this webinar will equip you with the knowledge and tools to elevate your analytical capabilities.
🎙️ Our guest speaker, Juan Orduz, is a mathematician (Ph.D., Humboldt Universität Berlin) and data scientist. He is interested in interdisciplinary applications of mathematical methods, particularly time series analysis, Bayesian methods, and causal inference.
🎁 If you're a Patron of the Learning Bayesian Statistics podcast, you can submit questions in advance, enjoy early access to all webinar recordings, and get at least a 50% discount on future webinars (https://www.patreon.com/learnbayesstats ).
References:
Slides: https://juanitorduz.github.io/html/hsgp_intro.html#/title-slide
Notebook: https://github.com/juanitorduz/website_projects/blob/master/Python/hsgp_intro.ipynb
Previous webinars: https://www.youtube.com/playlist?list=PL7RjIaSLWh5lDvhGf6qs_im0fRzOeFN5_
HSGP Reference & First Steps: https://www.pymc.io/projects/examples/en/latest/gaussian_processes/HSGP-Basic.html#gaussian-processes-hsgp-reference-first-steps
GPs with Aki Vehtari: https://learnbayesstats.com/episode/model-assessment-non-parametric-models-aki-vehtari/
Automated GPs: https://learnbayesstats.com/episode/104-automated-gaussian-processes-sequential-monte-carlo-feras-saad/
Juan's website: https://juanitorduz.github.io/
Alex on Twitter: https://twitter.com/alex_andorra
Alex on LinkedIn: https://www.linkedin.com/in/alex-andorra/
Chapters:
00:00 Overview
12:31 Understanding GPs and Kernels
38:18 Understanding the Covariance Function
01:19:46 Implementing HSGPs in PyMC
01:45:38 HSGPs v Splines
Hi everyone and welcome to a new modeling webinar.
This time we are with Juan Orduz Thanks a lot for all the people who are here live.
It's absolutely awesome.
And well, I am very happy to have you all here and very excited also about the topics.
I have been...
learning about GPs for a few years now, HSGPs only for a few months, of course, because it's a very new method.
And so that's why we are here today.
In the meantime, while I'm talking, I know some of you already know the show and love it.
So thank you so much.
I can see some of you who are already patrons, actually.
I think Bart is here.
So thank you so much for your support.
guys, especially patrons, you make that show possible.
So you have a bunch of QR codes here on the slide.
If you wanna follow the show, support the show, or online courses, stuff like that.
And in the meantime, please let me know in the chat if you have any technical difficulties, and mainly any questions during.
the presentation because Juan will be sharing his screen, he'll be talking, he won't be able to look at the chat, but I will monitor the chat and the questions.
So feel free to write any questions in there.
That's why we're doing that, so that you also can understand and help us understand better by asking great questions.
And so I'm...
asking for your attention for this hour, hour and a half we're going to spend together.
Close your email tabs, your Instagram tabs, stuff like that.
They will be here at the end of the webinar and you're here live.
So I think the best way to benefit from that hour together is to actually be here.
And again, thanks a lot for...
being the warriors here live.
And most of all, thank you so much, Juan Orduz, to be here, to go through the HSGP approximation with us.
I'll shut up now and I'll give you the floor.
You can start sharing your screen and take us on our HSGP adventure.
Thank you, Alex, for the invitation.
I'm very happy being here and share kind of my path on learning about this topic.
So I'm definitely not an expert, but I tried to dig into the papers and the code because this method has proven to be super helpful in applications.
And I want to use this session to actually deep dive into some type of material which is not kind of accessible
kind of focus on people that kind of want to go into specific details, but I do think that having some exposure to these topics will actually help you understand the modeling, of
course.
So I'm going to try to share my screen now.
Let me see if this works.
Can you see something?
Yes.
Alright, so I won't be able to see the questions, but feel free to interrupt me at any time, Alex, if you think there's an interesting question or some comment that you might
want to add.
I think this is the setting for doing this, because at the very end, this is not just me trying to give you a lecture about Hilbert space, but really about trying to learn
together about these topics.
So, yeah, so this is all about a Gaussian process and the relatively new Hilbert space approximation.
And I'm gonna be totally honest with you.
I don't expect anyone to really follow the whole details.
And it's not because I'm a bad speaker, but really because kind of there are a lot of kind of areas coming together.
For me, every time I open the paper and read the code, I learn something new.
So even if you take home the idea and kind of the main components and motivation, that will already be a success.
So, and if you, again, if there are certain things that you don't understand, that's totally fine.
I think that's part of the game.
I'll share the slides with you.
And moreover, I'm going to share a long notebook, which is fully reproducible, where you can play it around and try to hack these things and try to break it and see how it works.
on different circumstances.
Alright, so let's get through it.
So, yeah, the outline of the presentation is that I'm going to briefly talk about some applications.
I'm not going to assume everyone is aware about what a Gaussian process is, but I don't want to get kind of in the basic case, but I want to go into the approximation kind of
method.
But I still need to talk about Gaussian processors.
And this is a huge topic.
So I don't intend to give a very exhaustive treatment of the topic, but just to set up notation and understand why the problem we're trying to solve is indeed a problem.
We're going to focus on section 3, which we do a deep dive on the approximation part.
And I think this is where I want to spend most of the time.
So that actually what's following in the next steps is actually not that relevant.
So if we can cover, maybe just talk about a relatively famous example about the Paradise dataset, that's fine.
If we don't get there, we are preparing a lot of material with the PMC folks to cover this and other types of topics.
And I'm going to share also some references for you all to get a kind of start and see where to start looking.
for a specific topic.
So that's the plan for today.
Yeah, let's get started.
So about motivation application, I think this is important because no matter how cool something can be, if it's useless in the real world, then it's gonna be hard to sell.
And the first time I actually saw Gaussian processes work,
in a non-trivial example was actually in Alex's nice post about how to model the popularity of French presidents.
So yeah, here you have the link to the blog post.
And this was the first time I was exposed to kind of a non-trivial model, let's say in PIMC, which was actually useful.
So
It was really interesting.
And I remember when I read this post, I was not understanding what was happening, kind of the details, but the gist was clear.
Because in most of the cases, like for the Gaussian processor, you have the typical toy example that we will use today.
But in this example, you see how you can use processes to actually model Latin variables, which are not what actually you're seeing in the plot.
And that's extremely powerful because they're very flexible.
They're also easy to interpret and you can pass kind of important structures through them via the kernel.
So definitely check it out.
I think it's super interesting.
And yeah, let's, let's see.
You want to add something, Alex?
Only that I am blushing.
Yeah.
But yeah, I mean, I grew with you, and that's why I really love GPs.
And the very cool thing also is that you can include them in your models.
So again, that's what I'm doing here in the blog post.
So using linear regression and adding a GP term on top of that.
And then you can do also counterfactuals, which is the plot there.
you are seeing here.
So here that would be like what would happen if the unemployment dropped by X points, increased by X points or stayed the same.
And so you can do that kind of stuff.
Extremely, yeah, extremely powerful.
And that's also why, why I really love GPs.
Yeah, cool.
So please check it out.
The next example that I actually use quite a lot is to use time-variant coefficients in linear regression.
So I have a couple of kind of blog posts and notebooks where I take this classical kind of bikes data set on which you try to model the number of bikes rented in a specific city as
a function of a lot of variables.
One of them is temperature.
And you can run a kind of naive or simple linear regression model where you estimate kind of the effect of kind of the increase, like the beta coefficient will indicate kind of
what's how the bike runs increases with the temperature and you get something reasonable.
But if you want, if you actually try some other more complex models, which start to be more flexible, you see actually that at some point, if the temperature goes really kind of
high.
then people start not renting bikes, which is kind of intuitive.
So one way of kind of getting these relationships is using kind of a random forest or a exe boost model and then looking into the partial dependency plots so that you can see
what's the effect of the predictors on the target variable.
Nevertheless, these tree-based models are not that transparent.
And one thing that I play around with, and I think it's quite nice, is that if you allow your beta coefficients to vary over time,
then you can capture this type of behavior, where you actually see that the relationship between temperature and the bi-quenches is not constant.
And it's very similar to what we get with the tree-based model, but kind of I would choose this model better just because it's more interpretable.
And this thing about time-variant coefficients is actually something that we're working quite a lot in the PMC marketing package, because this is one of the key ingredients.
on modeling how marketing activities affect kind of acquisition, sales or conversions as a function of time.
So in the simple model, so to say, you have media spend, let's say how much you spend in Facebook and TikTok to acquire customers through an app.
And you can estimate the customer, the cost per acquisition kind of globally, so to say, in spite of the fact that you have a
time series.
It's in some sense the beta coefficient, but in kind of real applications, you know that this is actually varying over time.
Yeah, there's work on a blog post from the PyMC Labs folks about how they pull this off at HelloFresh.
And this is something that we're actually actively looking to integrate into the PyMC marketing package.
So these three examples,
Gaussian processes can work in different types of applications, either like as latent processes or time-variant coefficients.
There are many more.
So I hope that gives you some really concrete applications that we are actually working quite a lot these days at PyTree Labs.
So let me go now to Gaussian processes.
And again, this is a huge topic, so I want to just get the most important ingredients.
And I'm going to try to just show these three examples.
much as possible the math machinery in spite of the fact that we still need to do something.
So I have generated this kind of classical data set where you have a kind of a non-linear relationship between an x variable, which is your covariate, and a target variable.
Here I took a combination of sine functions just because I think they're fine, but of course you don't know this and this could be for example the true variation of your media
spend.
efficiency over time.
And in the blue, in blue I have the training data so that's where I'm gonna fit the model and in the orange points actually they know the test set.
And as you see intentionally I've selected the test set to be outside of the range of the training set to see how good I can extrapolate with the simple model that I'm gonna have
here.
So for now, this is just kind of an inference problem and try to see whether we can fit in sample and how to extrapolate outside of the boundary.
So one of the key components, and I think it's kind of the main intuition between Gaussian processes is that this is a model that requires kind of certain input from the modeler.
So this is not something that you just kind of close your eyes, press a button and hope for the best, but you actually really need to think about kind of what you want to encode.
I think the most important thing is to understand how do you want to measure smoothness or distance between points.
And this is what a kernel is doing.
So a kernel in some sense is actually controlling how weakly the fit or the Gaussian process can be giving your points.
So usually a kernel consists of a function that takes two points, let's say x and x prime, and just compute a scalar, which is...
which is always positive.
And in this case, for example, the one that we're going to use, and just because it's very popular, is the square exponential kernel, which actually depends on two parameters.
You can actually take the amplitude out, but I'm going to put it in any case.
So you have the amplitude A to say how big this thing is going to fluctuate.
And then you have the length scale L, which is kind of encoding, let's say,
the GP will be.
So if L is very big as compared to X and X' then kind of if you change X' a bit this is not going to make any difference or you expect this to be less flexible.
On the other hand if L is very small as compared to X then every small change in X' is going to affect this function quite a lot.
So that's kind of the intuition.
every fit that you get from this using this kernel is going to be very smooth.
So in this case, I have this black line, which is the underlying process is quite smooth.
So I know this is going to work.
But in many cases, actually, the smoothness can actually be a problem.
And having something less smooth can actually make the model fit better.
But again,
two points.
And there are various types of kernels.
You have the square exponential, you have other types of kernels which are less smooth, you have linear kernels which are in some sense just linear regressions, and you have
periodic kernels to model periodic signals.
So one kind of the math formula of this kernel is not that relevant.
What is important is that it depends actually in the difference between these two points.
And this
kind of one characteristic of what we call a stationary kernels.
They just depend on the difference between these two points.
OK, so yeah, what you can do is control this kernel matrix or ground matrix.
And what you can do is for your training set, you can compute all of this kernel function for all of these pairs.
And you can plot this.
This is, of course, going to be a very big matrix if your data set is big.
And what you see in this case is that most of the mass is concentrated in the diagonal.
And this is kind of not a surprise because if you see the formula above, you see that if the two points are equal, then you get the kernel is equals to a squared.
But if the distance between these two points is actually big, then this is going to vanish quite rapidly to zero.
And this is what this is showing.
And this matrix actually is very important.
for the rest of the talk.
But keep in mind that this matrix is as big as the training data set.
So the GP model, what it's trying to do is to use this K matrix, this current matrix, so to say, as a way of setting a prior on your points.
So the notation that we're going to have here is that you have x and y, which are training data, so the blue points in the initial plot.
and the x star would be the test set.
Of course, you don't know y star because this is what you're trying to predict.
But what you could also be interested is in f star.
And f is kind of the latent Gaussian process underlying that.
So it's just kind of the main signal of the process.
And of course, you can then add a noise.
And how you model this in innovation setting is that you set up a prior.
And when concatenating these two values, of course, you don't know f star, but that's fine.
You can still kind of consider this matrix.
And this k here is actually kind of a large block matrix because the dimension of this concatenated vector should be kind of the n plus m star, which is the number of training
points plus test points.
and is decomposed as follows.
So X is my training set, so I take the ground matrix here.
And for the upper left corner, I add some noise because this is also what I want to infer from the training data.
And the off diagonals is just this kind of kernel matrix evaluated in the kind of trace and train.
And in the other diagonal is that the kernel matrix evaluated on the test set.
So that's kind of the prior.
of the Gaussian process.
This is without seeing the data in some sense.
And the key thing is that you can actually estimate analytically the fit on the test set.
And mathematically, it's just saying that given that model, you can condition on x, y, and x star, which means just the training set and the x component of the test
to estimate what would be the distribution of your latent Gaussian process.
This is just a math formula that exists and it turns out that this is again a multivariate distribution with a given mean and with a given covariance function.
So you can do this with pen and paper and you don't need a laptop, but kind of, of course, this is very hard to do manually.
So you use a laptop.
But in any case, either if you do it by hand or with a laptop, the key component about these formulas is that at some point, you need to take the inverse of this kernel matrix,
so both for the mean and for the covariance.
And if you have a big data set, then this is going to be problematic.
And the reason is because
the complexity of this operation is of the order of n cubed.
Of course, you can do other tricks using that Cholesky decomposition, but there's no way out about the complexity of this step.
This is actually what makes Gaussian processes unfeasible even in modern days, if you have a very large dataset.
I was trying to do this with other probabilistic languages like Pyro or PymC, it takes a while and it doesn't skate quite well for other problems.
This is the problem we're trying to solve.
We still want to have the kernels, we still want to have the flexibility, but we want to avoid having to compute this inverse.
I hope this gives you an idea about what a Gaussian process is.
The math is not very important.
What it's important is that if you want to solve this, you need to take the inverse of a matrix that depends on the size of your training data.
And this is very expensive.
So there have been many kind of approaches about how to do approximation.
So what I'm going to show you is definitely not the first one, but it's one modern one, which works quite well.
And actually mathematically it's very nice and flexible.
So how to do this in PyME C?
Of course, one thing that I've hidden behind the rock, so to say, is that you have hyperparameters.
So the kernel depends on this A and L parameter.
So you need to set up some priors.
And actually, setting priors on GPs is a huge field by itself.
I have a reference about Dan Simpson wrote, kind of has many blog posts about it.
So it's definitely.
not trivial, but in time, see something that we often do is kind of use this function, a function constraint prior to just try to help us in this process.
So, in this example, again, the code is not very important.
I'm going to show it to you.
You can assume that the, let's say the length parameter has an inverse gamma distribution.
This is just very convenient because it's positive and kind of fades out close to zero very rapidly.
And yeah, given the data points, I'm pretty sure that the length scale would be between 0.01, so something small, and probably not bigger than 0.04 because that's already quite
big in thinking about the domain.
And it can also have a similar line of thought about the noise.
And I'm trying to be more flexible because the noise is, I'm expecting this noise parameter to be much wider than the length scale, and then you can run certain
optimization.
kind of behind it to give so that you get the parameters of the inverse gamma distributions that kind of fit your intuition based on these constraints.
And there's a nice library called Prelis.
We can help you plot this super fast.
So yeah, these are the two kind of functions that you'll get.
So these are the priors that I'm going to impose for the length scale and for the amplitude.
Juan, so two things.
One, some people can see your screen, some cannot.
So can you try and share your screen again?
Some have the hypothesis that that's because they joined late.
And I'm starting to think it's possible.
So first thing.
And then when you shared your screen again, can you?
Quickly go back to slide 8, where you're showing the simulated data.
And the question on that is, what is the underlying kernel, then, of that simulated data, where you're using the sinus and so on?
Yeah, so let me go to, can you see?
again.
So if some of you still cannot see the screen, it's definitely super weird.
Please tell us if you can, if you cannot, sorry.
And in the meantime, I shared the link to the slides so you should be able to follow along.
I'll make my best to.
let you know where we are in the presentation.
And now that's slide eight, and that's Huey's question.
Basically, what is the corresponding kernel that you're using here?
Yeah, so there are two ways of presenting this.
So one thing that you could do is actually build the Gaussian process yourself and actually select a kernel and generate data from it, and then try to recover the
parameters.
So this is one way of doing it, but in this case, I'm not doing that.
I'm just giving some data.
So again, there's no kernel that fits it.
Actually, I could have used
a matter like kernel, which is less smooth.
I could have used any other kernel.
At some point, it goes back into the kind of Bayesian workflow, where you see the data, try something, get some diagnostics, and iterate.
So you can actually use other kernels with different parameterization, and you probably get better or worse fit.
So in this example, I'm choosing the explanation.
square just because it's simple and it's famous, but there's apparently no reason for choosing that.
So I could have used other kernels, probably not a periodic one because I'm assuming I don't know the formula, maybe with this little snapshot of the data that's not enough, but
I could have tried that.
Another important thing that I forgot to mention is that you can combine kernels, meaning that you can sum kernels and take products.
So I could have used two exponential
a square exponential and add them together with different amplitude and length parameters.
So there's no unique way of doing it and I'm not doing a parameter recovery.
That's another possibility that you could do when trying to study this processor.
So let me continue and again, feel free to interrupt.
Yeah.
yeah, no, they're saying thanks a lot.
And so the underlying code is in the notebook that we will share.
And I mean, that I will also share right now in the chat.
So that code for how Juan's simulated the data and so on, that should make it clear.
The notebook is very extensive.
I want to go to the main points and ideally, you can combine what you learned from today to the code that is published online.
So going back about how you do this in PyMC, then it's actually pretty simple.
This is one of the things that I love the most about PyMC is that it's very intuitive.
So I'm going to go through it relatively fast, just because I'm assuming that you are familiar with it.
But if not, you can always see the code.
I want to just show you that the model can be written in 15 lines of code.
So again, you can use data containers.
This is not really necessary.
Where I put the training data.
And then you define the priors for the kernel amplitude and the kernel length scale.
And here you can pass the parameters that you obtained.
the previous steps or where you did this optimization with the fine constraint prior.
And yeah, you can also put a prior on the noise.
For the mean, I'm assuming that it has mean zero, so not very relevant at the moment.
And for the kernel, I can just kind of call the GP module in time C and take the squared exponential and pass the kernel length scale and multiply the amplitude.
Then what I need to do is to initiate this latent GP object, which takes the mean and a covariant function.
Again, thinking about conceptually, this is what actually the two ingredients in the math formulation of a Gaussian process.
Then you just set this prior, which is just conditioning on the training there.
This F is going to model the mean of my signal, and I have the noise, and I'm going to assume this as a normal likelihood.
You see, you can easily get more complex models, adding other type of kernels, adding them together, multiplying them.
It's actually relatively simple.
Of course, you don't want to start with the most complex model, but actually build iteratively.
This is how it looks.
you have the kernel amplitude, the length scale, and the noise.
That's kind of the main entry point, and of course you have the data and the likelihood.
So before fitting the model, what you usually do is use some prior predictive checks, so just sampling kind of functions from these priors, and this is what we see in this plot.
And yeah, it looks reasonable, at least they look in the same range.
Some are smoother, some are more wiggly.
So that's fine.
We are not constraining our space a lot, but we're also making sure that we're not getting kind of nonsense curves before seeing the data.
Yeah.
Yeah, and so just to make sure it's clear to people, the blue dots and the black line, the model hasn't seen that yet.
That's the data.
So the model just has the structure and any covariate that you would have passed to your model, but it hasn't seen any observations.
So that's why you see these wiggly lines.
That mainly comes from the amplitude.
And then the length scale is having an impact on the X axis.
You can think of the length scale as, like, as Juan explained, kind of like the correlation of the points through time.
So on the X axis and the amplitude is what is going to make the lines wigglier on the Y axis.
And so here you can see we're expecting a lot of different behavior.
That really depends on your use case and your data here.
So again, as Juan said, you have to be very careful and very thoughtful about how to choose the priors.
And that's why also, since it's the prior, that's why it's like expecting no trend, because we're saying there is a mean of zero.
So the GP is not expecting any positive or negative trend.
So that's just that.
I don't think there is any questions on that.
So I think you can go ahead, Juan.
Okay, so we just run the machinery and we get some parameters.
I mean, I didn't do any fine tuning or anything like seed picking.
We do a little, see a little divergence, but I don't care about that.
I just want to get the gist of what we do.
So overall the change converged quite nicely.
And now you want to generate predictions on the test set.
And if you remember from the previous slides, this is the huge formula, which was just saying how to condition on the training set on the kind of domain of the test set.
And there's a very convenient conditional method.
So if you see what I'm doing now at the prediction step, when you usually do out of sample predictions, there's just one kind of line of code that you need to add, which is just
called the conditional on the...
on the test set.
And then you can generate out-of-sample predictions.
And you just pass this new kind of mean evaluated on the test set, which I called f star, just to mimic the math notation.
And this is what you get.
So you see that within the range of the training data, which is between 0 and 1, the feed is quite well.
And you see two colors.
You see orange, which is kind of the whole posterior, and the pink, which is F, which is the Latin Gaussian process, which on the training set is actually, they are different.
So they feel the data, fit the data quite well.
But notice how kind of the uncertainty grows actually as soon as you leave, so to say the domain of the, or the region of the training set.
So this is actually very common when you're using these type of kernels.
So kind of extrapolating with GPS is something that has to be done.
It's possible, but you have to do, you have to be careful about it.
And in the notebook, I do some experiments about seeing how much, let's say if I take one length scale in these two directions, how would the feed goes.
But this is usually the typical type of plots that you get.
The sample takes maybe, I don't know.
four minutes or something could be less depends on your machine, but it just works.
But I have just 80 data points on the training set and 100 on the test set.
In real applications, you could have hundreds or even a thousand.
You can already see with this example that this is not going to be super fast.
Yeah, and you can see that the out-of-sample predictions, they revert very fast to the mean.
On the right side, it's already very close to the mean, which is 0.
So you can see the GP reverts somewhat fast to that.
It's less, well, it looks less fast, slower on the left, but just because the GP is like,
can see a downward trend.
So then it has that downward trend, and then it picks up.
Also, it makes sense because the data is actually picking up again.
But yeah, you can see the GPs revert pretty fast to the mean.
I mean, at least these kind of GPs.
Yeah, of course, if you have chosen different kernels, probably this, yeah, have changed as well.
All right.
here from Kanz, who is asking, do all kernel function outputs zero or positive values?
Yes, so when you define kernels, notice that one important restriction is that the kernels define these matrices and then this is going to be passed as a covariance matrix.
And in order to have a valid covariance matrix in a multinormal distribution, it has to be positive definite.
which means it needs to have positive eigenvalues.
I don't know if that question was prepared, but I'm going to talk about that in a moment, about eigenvalues and eigenvectors.
But I think this is the important thing, that you need to make sure that this happens.
This is the constraint.
I'm going to talk about this in a little bit more detail.
So hopefully I'll come back to that later.
So about the Hilbert space GP deep dive, kind of what I'm going to do, and this is a tip about someone who gave a talk about how to give talks, is that I'm going to tell you about
the method and about what we're going to do, then I'm going to do it, and then I'm going to do a recap.
So,
in spite of the risk of sounding repetitive, I really want to sketch the main components so that we don't get lost in the details.
So what do we want to do?
And this is kind of the executive summary of the method.
So the problem is that we have this grand matrix, which is huge, or potentially could be very big if the training data is very big.
So what we want to do is to approximate this matrix with a matrix which is small.
and this is what we will call k tilde.
That's the whole point, because if we have a good approximation of that matrix, then we can invert the other one and we'll win.
The problem is how we find this.
This might sound a bit weird, but this is what we want to get in the next slides, but it's going to at least explain the name of the method.
One key observation that the authors of this paper realize is that we can
interpret the covariant function as a kernel of a certain type of differential operator.
Why would you care about this?
It could be a little bit weird, but the thing is differential operators is an area of mathematics which has been studied quite a lot.
And if you put it in the context of what is called Hilbert spaces, which is nothing else like a function space where you have an inner product like the dot product in the clirian
space.
then you have a very rich theory.
So with this kind of point in mind, there are a lot of results on the theory of differential operators, which allow you to actually create this reduced rank approximation
for the covariant functions.
Probably there are many ways of doing this type of approximations, but the key component is that the basis function of this approximation,
is independent of the parameters of the kernel.
So let me say this once again.
The key feature about this approximation is that you are going to use certain basis functions, think about splines, but these basis functions won't depend on, for example,
the amplitude and the length scale, which means this actually simplifies the computations quite a lot.
Even that the key idea of going into differential operators seems a little bit too exotic, it actually pays off quite well because at the very end, what you're doing is reducing
this problem to running a linear regression.
When you run Hilbert space GPs, they run extremely fast and they can give very good results with a very decent number of
approximation where you do the cut.
So if this didn't make any sense, it's fine.
We'll try to go through it and then we are going to come back.
But this is what we want to understand in this session.
I want to focus more on the concepts because again, I think this webinar is precisely a perfect space for doing that.
Then I'm preparing some examples.
Alex is presenting some examples and Bill.
Engels is preparing a lot of documentation about the code.
Hopefully, all of the topics we're going to touch today will help you understand how this works.
These are the steps now that we're going to try to do.
We don't want to be super detailed, but we're trying to follow the storyline to make this executive summary feasible.
spectral theory, so I'm going to do a little detour about eigenvalues and eigenvectors.
If you do know them, I still hope you can learn something new because I'm going to talk about interesting topics in this area.
Then the idea is now that once you have this tooling of eigenvectors and eigenvalues, then kind of this is going to be the main structure.
So again, we're starting from a kernel, from a Gaussian process, so we need to start from that.
And I'm going to describe that certain type of kernels, for example, the stationary kernels, admit what is called a spectral density.
So it's kind of in some sense an integral representation of the kernel function.
And then we can actually assume that this spectral function has a polynomial expansion.
Think about it as a Taylor series.
But we don't know the kind of the coefficients.
but you still can believe that this is feasible if you assume, for example, certain analytical properties on the spectral density.
So you have an approximation or let's say a power expansion, but you don't know anything about it.
But as you can expand this, not just in any polynomials, but in quadratic polynomials, actually you can link this with the Fourier transform of the Laplacian.
So this is the link.
This is why it's important.
This is why the Laplacian is the key operator and not other one.
This is also another question that often pops up.
So we're going to talk about the Laplacian and the Fourier transform.
And now to make things more concrete, we are going to talk about boundary conditions for the Laplacian operator.
And by imposing boundary conditions on certain domain,
we are going to have a very rich theory on eigenvalues and eigenvectors of the Laplacian.
And in the context of the Dirichlet-Laplacian or the Laplacian with boundary conditions, there is an associated kind of expansion in terms of eigenvalues.
And the final kind of punchline is that we will identify the kind of coefficients of the expansion in step two.
with the coefficients of the expansion of the Laplacian.
Again, if this sounds like very strange, it's fine, this is what we want to do.
What is important for me is not go into the specific details, but really go into linking the story and why at least it could be feasible to do this.
So I have thought about many ways of communicating this.
I think this diagram is a good summary of the path that we're going to get.
So we're going to talk about eigenvalues and eigenvectors.
This is going to be our starting point.
Then once we have our Gaussian process, we have a kernel.
Then we'll associate a spectral density, and we will have a polynomial expansion.
On the other hand, on an unrelated topic, we have the Laplacian.
and we can impose certain boundary conditions to get certain functional calculus or a way of taking Taylor series of the Laplacian.
And we're going to link this with the Fourier transform, and then we will have two series.
We're going to identify the coefficients, and we will read the formulas for the approximation.
So in spite of the fact that these things look very
a kind of apart.
I'm claiming that the Fourier transform is the gluing step between these two worlds, which let's say apparently had nothing to do with each other.
All right, so let's start from the beginning and recalling what's the definition of an eigenvalue and eigenvector.
So from linear algebra, if you have a matrix A and you use my isosquare matrix, you're going to say that the pair
a vector V and a scalar lambda form an eigenvector and eigenvalue pair if this holds.
So if you multiply the matrix with the vector, and this just gives you a scalar of this vector V.
And of course, we don't want lambda to be zero because otherwise it will be trivial.
Everything will be possible.
So we are restricting ourselves.
Sorry, it shouldn't be lambda, it should be V.
So there's the typo here.
V cannot be zero, lambda can be zero.
That's not a problem.
It should be V.
I'm going to fix that.
We're going to say, yeah, the spectrum of an operator or a matrix is just the set of the eigenvalues.
That's just a fancy word of saying that.
yeah.
So to make it very clear for people, V is the set of eigenvectors.
It cannot be null and lambdas or lambda is the eigenvalue.
It can be any value possible on the real line.
And the spectrum of a matrix is the set of its eigenvalues.
Okay.
restriction on priority.
Yeah, here, so here, though, you're saying that lambda is unique in the sentence.
And then afterwards, you say the spectrum is the set of eigenvalues.
So where would these new eigenvalues come from?
Or is lambda already a vector to begin with?
No, sorry.
So lambda is a scalar.
It could be like three and it's not unique.
So
Okay, so, but then why are you talking about the spectrum of the matrix?
Where would the spectrum come from, basically?
That's my question.
Okay, so let's say you have the, I'm gonna give you an example to make this more tangible.
Here it is, because I like samples.
So let's consider the matrix A, which is just this two by two matrix, the square matrix, and one can kind of, there's an algorithmic way of computing the eigenvalues and the
eigenvectors, and you can show that the values lambda one equal to three,
and lambda two equals to minus one are eigenvalues with the corresponding eigenvectors one, and minus one, one.
The square root is not very important.
So what you can actually test is that if you multiply the matrix A with the vector v1, you will get three times the vector v1.
And if you multiply this matrix with lambda two, sorry, with the vector v2.
you're going to get the vector v2 times minus 1.
So I'll leave that as an exercise.
And I just.
you have several eigenvalues because there isn't a unique solution to that equation, basically.
In this case, as I have a symmetric matrix, actually, I can always find the second values and they're always going to be real.
It could be that they can be complex.
In this case, I have two of them, but I can have just one.
In this specific example, the spectrum is going to be minus one.
I guess,
To be fair, it's also not just the eigenvalues, but the pairs.
But for example, if you take the identity matrix, no matter how big it is, it's always going to take the eigenvalue 1.
But by the spectrum, I also mean we tie it with the eigenvectors, not just the values themselves.
But so, yeah.
the spectrum can be between 0 and the number of dimensions of the matrix.
That's what you're saying.
Yeah.
Hm.
I just add this funky coefficient to show the following properties that if I add these, then the inner product of these two vectors is zero.
If I take the norm of them, they're going to be one.
This is going to be very convenient.
But again, I can take b1 to be
hundred times this vector, it will still be an eigenvector.
That's not going to be a problem because the factor will cancel out anyway.
But I'm going to choose these two specific ones because they're very convenient.
So one of my favorite theorems, because I actually studied this quite a lot, is what is called the spectral theorem.
It states that if you have a symmetric matrix as the one that you had before,
matrix is because it's symmetric with respect to the diagonal, then you can find an orthonormal basis of eigenvectors in such a way that this matrix becomes diagonal.
So what this means is in my previous example, if I take this kind of change of basis matrix Q to be the eigenvectors normalized, so let's say with this
funky one over square root of two, and I compute Q transpose AQ, I will get a diagonal matrix on which the diagonal elements are actually the eigenvalues.
Which means whenever I have a symmetric matrix, I can reduce this to a very simple operator, which is just diagonal, just by selecting the eigenvectors as a basis.
So this doesn't seem to be very sexy or very kind of important, but it has a lot of good consequences.
And another exercise for the audience is to see that this actually cannot be done if your matrix is not symmetric.
So there's kind of some, something to prove here.
Let's do it via code.
If you don't trust me, but you can run this yourself.
We're going to use JAX, just because you can use NumPy.
We can have our matrix A, which is exactly the one from the example.
We can compute the eigenvalues with this function in the linear algebra submodule.
What you see is that the eigenvalues are 3 and minus 1.
And JAX actually allows this to be complex, but the complex component is actually zero.
So it doesn't matter.
And the eigenvectors in this case, it will give you this normalized.
So it's up to this constant is 1, 1 and minus 1, which is what we had before.
And if you set this Q matrix as being the eigenvectors, then you can see if you compute this Q transport A and Q.
then you will get a diagonal matrix.
Of course, the off diagonal entries here are just essentially zero.
You can prove it yourself and try to see if you change your initial matrix for a non-diagonal matrix that this is actually not going to work.
So actually, why do we care about the spectral theorem?
And the theorem by itself seems to be very random.
But the answer is what we are going to call the functional calculus.
But before going that, we need to talk about something which is just projections.
If you think about the changes basis function, another way of seeing that is that we can write the matrix A as the following operator, which acts on vectors.
You could apply this to a vector, let's say w, and you can take the inner product with one eigenvector that will give you a scalar.
And then you can multiply this by the eigenvector and the eigenvalue.
And by doing this procedure, actually, you will always get the action of A.
So it's a very convenient way of decomposing a matrix.
And you can actually.
see that this little operator vj, vj transpose is actually a projection in the sense that if you take the square of it, then you will get the same value and it's self-adjoint.
This is not that important, but it's actually a projection in the definition of it.
This is just very convenient because of the following thing, but we can actually show that this actually works.
So I can define the projection operator as in Jax, just taking the dot product and multiplying with any vector.
And I can just use this kind of explicitly write the formula that I had before.
This formula, I can write it in Jax, so to say.
And I can show that if I apply this matrix projection function to the first eigenvector,
I will get the eigenvector multiplied by the eigenvalue.
That's why I'm taking the quotient.
I can do this for the second eigenvector, and it's also going to work.
And I can do a random vector.
I chose the vector 1 to 2.
And if I take the effect of A on V, then this is exactly the same as applying this projection sum.
So it works, and it's one of the results of the spectral theorem.
Kind of the question is still, why do we care about this?
Why is it so important to be able to write a matrix in such a, let's say, cumbersome way?
Where, well, it's about the functional calculus.
Because once you have this representation, then you can take kind of functions of the operator by applying any function.
Which means, think about any function that you want.
Let's say, taking the square of a matrix.
you know that this is not a simple operation.
You need to do this row times column, row times column, and do this by hand.
This is very tiring.
But once you have this decomposition, then applying a function to your matrix is actually as simple as just applying the function to the eigenvalues and keeping the projection.
This is very useful because you can get very weird and complex functions.
The only thing you need to do, I mean, you don't need to mess up with matrices and nothing in that area.
We just need to take the function on the eigenvalue, which is a number, and it's much easier to do, and just multiply it by the projection.
So as a little example, if you take f of set being the identity, then you recover the operator A.
And if you take f of set equal to 1, you take the actually identity operator, you get the identity matrix.
you can show that this is left as an exercise.
So let's do something fun.
So one thing that you could do for a matrix is to define the exponential as a matrix, and this is this infinite sum of this power of the matrix divided by this factorial factor,
which is not that important.
But think that this operation is on matrix.
So if you take a to the power of the 30 or 20,
This is multiplying a matrix 30 times.
This can be very kind of expensive to do, right?
Because you need to do this row times column multiplication many times.
But Type I has a function to do that.
And you can still do it.
But think about this care.
but actually with the functional calculus, you can see that what you actually need to do is, in essence, take this changes basis function.
And if you put it in the formula, you actually realize that you only need to take the kind of exponential of the eigenvalues.
This is what I'm doing here.
And then taking this change of basis again.
And this is exactly the same result.
But notice the difference that in the SciPy function, we're taking this exponential of a matrix, whereas with the functional calculus or the spectral decomposition, we actually
just need to evaluate this on the eigenvalues and do a final multiplication at the end.
So I hope you can now value the fact that this is actually simplifying a lot of things.
This is a very nice result on linear algebra, which by itself is very useful, but this is going to be the key component in this Hilbert-Payes approximation.
Bear with me.
Now about the approximation itself.
We're going to start from a kernel and we're going to assume
that kernel is stationary and you can relax this type of conditions but for now let's assume this is the case and there's a theorem that says that if you have such a kernel and
if stationary, which means it does it just depends on the distance between these two functions are, it always can be written as an integral.
And actually this integral looks very similar like a Fourier transform but that's a comment that will come up later and
If this integral has this factor s Omega, so this is what we call a measure, it's not very important, but if this integral can be written as an exponential times a function, then
this function is known as the spectral density.
For now, it's just a trick that is not useful by itself, but it's going to be good once we identify this with the Laplace operator.
There are formulas to compute this.
So for example, for this square explanation kernel, you have explicit formulas like this.
So it of course depends on the hyperparameters, let's say the amplitude and the length scale, and this is another exponential.
Of course, notice that this depends on the dimension of the domain.
So for d equals one, which is the example that we want to treat, this is how the formula looks.
which means it's very explicit actually, and you can compute this for all the kernels that we usually work with.
And notice that in this variable omega, this exponential actual, the spectral density, it's again an exponential decaying as minus omega squared.
And again, coming back to the comment about the Fourier transform, for the ones that you know, if you take a Fourier transform of a Gaussian, you get a Gaussian as well.
And this is exactly what's happening, but we do care about the factors.
So we actually need to compute all of these two pi and constants out there, but it's doable.
It's not hard to do.
And in the notebook, I can give you kind of plots of the exponent of the, of the spectral density of the exponential function when you change the length scale.
One of the things that you notice from the formula, but also from the plot is that these things decay to zero very rapidly.
This is actually going to be the weights in the final approximation formula in the context of this linear regression that we'll get there.
For now, it's just interesting object, but not very interesting for the moment.
If you further assume that these kernels are isotropic, which mean that they don't only depend on the distance, but actually just on the norm.
which again is the case for the square exponential, then the associated spectral density will also be isotropic, which means it will also depend just on the norm of w or omega.
We know that the spectral density is going to just depend on the norm of omega.
With a simple transformation,
depends actually on the square of it.
This is not very restrictive.
You can always do that for a solvable function psi.
And actually, you can expand this psi as powers of this omega.
And this is not really restrictive, because you can think about a Taylor expansion.
The only thing is that you don't know what these a0 and a1 are, and of a power.
You can compute them, but they can get very complex.
So for now.
we are hoping that through this approximation, we can read or interpret these coefficients in a way that are computable, so to say.
But at the moment we're blocked, there's nothing we can do, we don't know much about it, and we're kind of blocked here, but this is where the other side of the story will start,
and we will merge this at the very end.
So for now,
Think about that the kernel has a spectral density and the spectral density has certain approximation formula.
On the other side of the world, you have the Laplace operator.
And coming back, if you ever talk about PDEs or differential equations, the Laplace and this is nothing else is taking the sum of the second order partial derivatives.
The Laplacian appears in many contexts in physical models, like the wave equation, the heat equation.
It's a very well studied operator has many interesting properties.
One of them is that it's positive definite, which means that as we'll see later that the spectrum is always positive.
It's an elliptic operator.
I'm not going to talk about that, but this is actually the key property for the Laplacian.
If you take this out, everything, we're going to talk about this here, it's gonna break.
And it's also, it's self-adjoint, which in the context of matrices, is saying that it's symmetric.
So these are all very good properties of the Laplacian and it's not like a random operator.
Kind of having all of these properties are kind of key things that we want from a good operator if we want to do a spectral decomposition.
And kind of, if you read the paper, they go through the formulas very fast.
but kind of it's not clear why did we think about the laplacian, right?
You have this spectral density and so what.
But one of the things that is important is that people know a lot about spectral decomposition of the laplacian.
So if you can ever bring the laplacian to the game, please do so.
So how to bring the laplacian to the game?
Well, one of the key features of the Fourier transform is that
If you take a Fourier transform of a derivative, then this is going to give you a polynomial.
So if you take the Fourier transform of a second order operator like the Laplacian, then you're going to get this second order polynomial, which is what we want to actually map to
the spectral density.
This is where things become a little bit interesting because in this context, if you think about the covariance operator as being this kernel operation, on which you have the
covariance matrix here and this argument is let free, you can actually think about taking the Fourier transform.
of the approximation formula that we have before.
Let me go back.
So you have this formula here, and these things depends on polynomials of order two.
If you take the Fourier transform, you will get an operator, which is actually the sum of powers of the Laplacian.
This is what it's called a pseudo-differential operator because this is actually an infinite sum.
What we're trying to do in this step is saying that the kernel operator associated with the kernel function of a Gaussian process can be seen as an infinite sum of Laplacian.
And this is what we call a pseudo differential operator.
And I think this is kind of the key gluing piece between these two worlds.
And if you don't follow the detail, that's totally fine.
But just to know that the Fourier transform is kind of the key player here.
So when you study the Laplacian and differential operators, you actually need to impose certain boundary conditions, mainly motivated by kind of physical phenomenon, but also I
said that the Laplacian was self-adjoint under certain boundary conditions.
And remember that in the context of matrices, self-adjoint actually resembles symmetric, which means if we impose certain...
boundary conditions to the Laplacian, it becomes symmetric and we will have the full spectral theory that we have for matrices, we can actually translate it into the
Laplacians.
But why do we want to do this?
Well, notice that here we're taking powers of the Laplacian.
We know there's a better way of taking powers of an operator, and this is through the functional calculus of the spectral theorem.
We can impose boundary conditions, some very common is calling Dirichlet boundary conditions, which just says that we're going to restrict the domain of the operator to
functions that satisfy that there are zero on the boundary and that they lie on the kernel of the Laplacian.
They vanish on the boundary.
If you impose this, then this operator becomes
self-adjoint, which means it's symmetric with respect to the inner product on functions, which is the integral.
So where does the Hilbert space come from?
Well, because if you think about the function spaces defined by this boundary condition, then you can see that you can shift kind of the Laplacian from one place to another kind
of as you do in the normal inner space formula.
the Hilbert space is actually the space of square integral functions that vanish on the boundary.
And one thing which is not for free, but you can translate all of the kind of most of the results that I showed you at the very beginning for matrices, you can actually translate
these to differential operators and even to pseudo differential operators, which means there's a version of the spectral theorem.
that states that for such operator, you can find a basis of eigenvalues and eigenvectors on which this operator is diagonal.
In this case, they will have an infinite number of eigenvalues that they will grow up to infinity.
And the eigenfunction associated with them will satisfy this kind of, we can make them satisfy this condition, which is something is very similar to the condition that I show in
the case of a matrix.
which is that they form an orthonormal basis.
This is always guaranteed for such operators.
But what is nice about the Laplacian is that we can actually compute this by hand because computing the eigenvalues of the Laplacian in one dimension is actually quite easy.
In one dimension, you need to set up a domain with boundaries.
You have a box between minus L and L, and you are trying to find
a function which if you take the second derivative, it would be minus lambda, the function, and that at the boundary actually vanishes.
And it's not hard to see that the sine function actually does the job.
You just need to kind of have these factors to make the oratory condition come through and these kind of factors to make sure that they vanish on the boundary.
And these actually, you don't have just one function, but a family of them, which are indexes by integers.
So this j could be 1, 2, 3, 4, 5, up to infinity.
And all of these functions will satisfy the boundary conditions and will satisfy the eigenvalue equation for the Laplacian.
And the eigenvalues also, you can compute them.
And as you see, they grow as j squared, which means that they grow to infinity.
So.
If you have ever studied the Helios-Pete approximation, you now can get a little bit excited because you see that the basis functions are precisely this.
But now what we're trying to see is to understand where are they coming from.
And this is kind of where the magic comes in because I can use the functional calculus defined by the spectral theory of the Laplacian to compute all of these powers.
What I'm saying here as I had in the matrix case is that I can compute the laplacian of a function just by taking these sum over the eigenvalues, sorry, eigenvalues and these
little spectral projections, and applying this within the integral.
This is actually the same thing I had for matrices.
If I take powers of these things,
I know that I don't need to have tight integrals.
I don't need to have chain integrals, which would be a nightmare, but I can easily just take powers of this red part.
That's why the functional calculus comes so handy because applying functions to operators is just applying functions to the eigenvalues.
And kind of the last step would be, okay, you have the kernel operator, which is the, which can be written in terms of an infinite sum of the powers of the Laplacian, but you
do know how to take powers using the spectral decomposition and the functional calculus, and it's nothing else just taking the powers of the eigenvalues.
And with these two formulas, actually,
if you identify kind of what is in the square bracket, you see that the kernel, which is what you are interested in at the very end of the day, can be written as this sum of a
zero plus a one lambda j and the powers of the eigenvalues.
But if you think about it, this is actually the expansion of the exponential density.
As we were evaluating this in omega square, what you actually need to do in practice is take the square root of the eigenvalues.
I know that is a lot to digest, but I really hope you can get the main component, and it's really about just identifying these two worlds through this composition.
If you think about practical formula, what we have arrived is
the fact that you can approximate the kernel, which is the grand matrix, as a finite sum, because of course it's an approximation, which by the way, it will converge to the true
one if we take big enough.
And the only thing that you need to do to compute kernels, which is very important for your Gaussian process formula, is to take a finite sum, take the spectral density
associated to the kernel, and
evaluate this in the eigenvalues of the Laplacian, which you know how to compute because all of these formulas are known.
So no matter how big the dimension, people have worked this and you just need to multiply this by the eigenfunctions which again, you do know how to compute.
But what is important about this is that these eigenfunctions,
which in the one-dimensional case are the sine functions, do not depend on neither the amplitude or the length scale of the kernel.
They are don't seeing the Gaussian process.
The only information from the Gaussian process is passed through these weights.
This is what actually makes the computation feasible.
I know I'm a little bit ahead of close to
the end, but if you're thinking about kind of from the computational point of view, if you think that you can compute this grand matrix as this finite sum, if you pass this to the
kind of multivariate normal model, and again another nice thing is that in this decomposition, your operator is diagonal, and this is why it can all like diagonalization
is making everything
you can actually don't think about this multinormal distribution, but actually just about the components.
So in essence, you can estimate your function or your Gaussian process as a linear regression, where you have all of the hyperparameters in the spectral density, and there's
kind of nothing to do here up to estimating the parameters, because the spectral density, you know the formulas.
You have the eigenfunctions of the Laplacian, which are easy to compute.
you multiply this by like a normal distribution.
This is for the ones familiar with the non-center parametrization of a linear regression.
So these are the three components, and notice that they're coupled in a very, very nice way.
And the computational cost of this is way smaller.
It's of the order of nm plus m, where n is the number of training data points and m the number of bases that you get.
If you do this for the one-dimensional case, this is actually something that I do by hand on the notebook.
In the notebook, I use JAX to do everything by hand and then compare it with the PIMC implementations.
That can give you very concrete feeling about how easy once you know these formulas are.
Luckily, the PIMC developers have done a lot of work, including this.
spectral density formulas, and the eigenvalues of the Laplacian, so that we don't need to think about this but use them.
So the recap, again, what we did is we started with the kernel.
For certain type of kernels, we associated which has to be isotropic, which depends just on the norm of the distance of the points.
We can associate certain spectral density, which has a polynomial expansion.
To read the coefficients of this expansion, we had to start from a different part of the world, which was the Laplacian.
Once you impose boundary conditions, you have a very rich spectral theory of the Laplacian.
In particular, you have the functional calculus, which tells you how to easily compute any type of function of the Laplacian, in particular, powers of the Laplacian.
The key thing is that
as the Fourier transform of the Laplacian is a polynomial, then you can match the polynomial from the kernel side to polynomials on the other side of the, let's say the
right-hand side of this diagram, and you can match them together.
And then through the functional calculus, you can actually, you don't need to compute powers of integrals, but you just need to compute kind of powers of the eigenvalues of the
Laplacian.
Eigenvalues and eigenvectors of the Laplacian have been starting for ages, which means that this is known in the books from probably from where we all kids, so there's nothing
new there.
But the approximation format really speeds up this computation because we end up with a linear regression up to some factors regarding the spectral density.
In the notebook, I also show how to do this by hand.
So I give you a plot of how this Laplacian
eigenfunctions look like, which not very surprising, are sinus and cosinus.
And I do kind of some exercise of running kind of samples of this approximation formula.
This is in some sense a prior predictive distribution.
And there will be a lot of material coming up in the upcoming weeks and months about how to do this in time C.
But
kind of if you know how to code the Gaussian process in the usual case, you just need to use the GSGP class and this will take kind of a lot of boilerplate for you.
So you don't need to compute anything.
But what is important here is that now you know, because this HSGP class actually have certain parameters which are M.
which in this case, as we have seen in the presentation denotes how many basis functions to use.
And this L is kind of the size of the box.
And usually we will center the training data so that it's symmetric around the box, but now you know where these things are coming from.
And you can still pass any type of kernel and what PyMC is going to do inside.
of the code is actually look for the associated spectral function and evaluate it in the corresponding eigenvalues of the Laplacian, which we all know the formula.
So this can be just easily kind of written as a pytensor formulas.
And again, we can have the conditional method and do exactly the same thing to get kind of this out of sample predictions.
And kind of there's
and we see like 20 basis functions.
It's already enough and this samples super fast.
I just want to mention that if you really want to go into more kind of details about how to use this and hack it around, there's this prior linearize function, which acts a little
bit different because it will actually give you a...
the spectral density and the basis function.
So here the square root of the spectral density, and here you have the eigenfunction of the Applashen, and then you can actually have much more freedom to have, for example, the
non-center parameterization and do all the funky stuff you want to do.
So this is already quite nice and it's nicely documented on the doc strings.
But again, even if you want to go very granular, you don't need to write the spectral density.
functions by yourself every time you do this.
This is going to be done by time C.
This is, as you see, what is happening behind the hood is that you are having a linear regression on which you have the beta coefficients, which are normally distributed and
multiplying the spectral density evaluated at the end values times the basis function.
Then you can generate out of sample predictions.
In this case, it's a little bit easier because you need to use this conditional because you're actually using a linear regression.
You can use the setData method to just replace the data in the graph and do posterior samples.
Alex, I think we run out of time.
The rest of the presentation, which just to get a glimpse on the birthday data set, probably doesn't make sense to do it.
But I'll share these slides with you and we have these notebooks on the PyMC examples website.
And at the end of the presentation, yeah, I'll have some references on where to look for more information about these topics.
And I'll stop now.
Well, yes, so first, thanks, Antoine.
It's an impressive fit you just did.
Well done.
So I think it depends on your level of energy.
We can definitely go at least some parts of the birthday example.
because they cease to record it anyways, so if people need to drop, they'll drop, but then they can watch the recording afterwards and then it's gonna be on the YouTube channel.
So that's the first thing.
That depends on your time and your energy level.
Then, let's see if there are any questions.
Folks, do you have any questions on that last part, especially on the cut part, the mask stuff?
I don't see any questions, but you have the slides and the notebook.
Any questions on the code that you just saw, let us know.
And well, in the meantime, Juan, up to you.
You can present us the birthday example, at least from a high level, if you're up to it.
Yeah, yeah, absolutely.
So I think still I won't be able to go into details and also will present this to the audience in the PIMC examples gallery.
But before going into this again, if you felt like you were lost, if you felt like this thing is out of your reach, please, I totally get it.
Even for me today, I was reading the paper and I was having doubts and there are certain details, but what I really wanted to do is to make sure that this is something that
people have certain material to look for, and that at the very end, it's linear algebra, so to say, of course, when we talk about differential operators, it becomes a little bit
more funky, but in essence, kind of the message that we're trying to do on the strategy should be something that the modeler, ideally should have certain intuition, because if
you need to tweak the L parameter, the M, and if something is not working,
and you go into the code and see whether, for example, you computed these eigenfunctions functions in the wrong way, then kind of knowing these things is always useful.
But of course, you don't need to know all of this theory to apply this for your specific problems.
So I just wanted to make that note so that it's not mandatory to really use this code.
All right.
so yeah, thanks a lot.
That's a very good point, Juan.
So yes, to build a net, and I'll expand a bit on that at the very end of the webinar.
But as Juan already hinted at, it's just one first part of the educational content we're planning on doing on HSGP in the coming month.
One month.
We could do that in days or weeks, but this is open source.
So we're not paid for that stuff, so it takes longer than usual.
But we're getting through that.
As Juan was saying, I'm working with belangles on a tutorial notebook and more like nitty gritty on how do you choose priors for the HSGP method that
Juan has just demonstrated.
I'm also writing on another notebook here, even more applied, where I'll go through the Bayesian workflow and building a bunch of different GPs on the same dataset and debugging
all the different GPs I built, because they fell until the last one.
But that's novel.
That's the life of a modeler.
And so that would be like at least two other webinars.
Bill would be here for the second one, and then that'd be just me.
So bear with us.
This is just the first part, but that's a foundational one, because Juan walked you through all the foundations, basically, of why that approximation is working and what it
is.
And then now you'll understand completely what you're going to see in the code that Juan is going to show right now.
So that's the first thing.
And then second thing, a very good question from Pau.
who is wondering if these works for periodic kernels too, since they are not isotropic.
So perfect question in segue, Juan.
I'll leave that to you.
Yeah, so let's say by itself, the kind of the method that I showed here, it won't work, but nevertheless, all the authors like Akive Tai and co-authors actually found a way of
doing it.
So it's not the same technique, but it can be done.
And actually kind of after the GS, like Hilverspace GP class.
that Bill implemented, there was another pull request which actually did this.
So in time C, you can actually do it.
And it's not surprising that it's just a kind of combination of sine and cosine, which you think about it is not really surprising because it should be possible in the sense that
you can always use Fourier terms to approximate periodic signals.
So yeah, the short answer is no, kind of straight away, but you can do it and it has already been implemented in time C.
So because I know that this is kind of the purpose of this, of this talk was more on the foundations.
And I know that Alex and Bill are working on kind of fantastic material on the applications.
I just wanted to show like a typical example and especially kind of map this to Aki's block about he's using the Bayesian workflow to do this model.
And also to show that actually these Gaussian processes don't work on these toy examples, but they can be applied in real data.
So a typical example, and there's code available, and there will be more content about it, is this birthday data set on which we have a time series of observations, and we are kind
of trying to model the relative births in the United States between 69 and 1988.
So what you see...
from this plot.
Again, as a model, you do this in an iterative way, so you immediately won't go to the more complex model, but at least you want to get some structure.
At least how I see this, for example, when I see this type of data, is that there's a long-term behavior or a long-term trend, which is just this long development.
It goes down around
4 and then keeps going up.
I also noticed that the variance of the signal is going up, which then hints that probably if we take a log transform to normalize this variance, it could help us model the signal
better.
We also see that there's a yearly seasonality, which you can see through this cloud of points becoming dark and light.
And probably there's other type of seasonalities within the data.
So, these are the things that we would like to extract.
And if we deep dive into the year, actually you see that there's like an in-year seasonality and there are specific drops on, for example, the end of the year.
And like, this is kind of a season, but there's like Memorial Day, Labor Day, and Thanksgiving on which
for whatever reason, this kind of birthrate actually just, it's not following the trend probably just because hospitals maybe are close or I don't know.
You can hypothesize about it or people don't kind of struggling during holidays, but I guess the point is in the data, they look like an anomaly.
So kind of modeling this data point separately is going to definitely help the model.
And we also have kind of a day of the week seasonality on which kind of on average the number of births decreases during the weekend.
Probably it's also about personal and so on.
And yeah, other types that kind of you can see from the ADA is that, for example, the seasonal pattern kind of within the kind of day of the week over the years change over
time.
So if you want to, if you want to model this kind of day of week, let's say as a dummy, then you need to add something else which allows this kind of effect to vary over time.
This is one of the things that you will do in the iterative process of the model development, but also looking into the exploratory data analysis part.
So I'm not saying that this is the first one that you look into, but kind of looking to the data, how this can hint you to kind of which type of kernels or processes to input in
the model.
So really think about kind of the EDA part and the patterns you want to capture.
So...
within this long blog post by Aki and also something that I tried to replicate in PyMC, we kind of build a model, which is one of the models that Aki did, which it's fairly complex
in the sense that it's many components, but it's quite reasonable if you think about the EDA component.
So we actually have a lot of data points.
So if you try to do this with the usual GP class,
work, I mean it's going to take days.
So we have like a first GP which is going to model the long-term trend and for this we use a square explanation kernel and kind of this is kind of the Latin process that we learn
from the data.
We have the periodic kernel and here's an asterisk because back to the question in spite of the fact that it doesn't have an spectral density you still can work around an
approximation.
which can be fitted within this framework.
So this is for the yield systemality.
We use a CSOM normal.
I mean, this is something that again, depends on the modeler and kind of the approach.
It could have used like a dummy variable from the day of the week, but this is just to hint about all the very cool distribution that PMC has on which we care more about the
relative distance between the week of the day effects.
but as we saw from the EDA, this actually changed a bit over time.
So what we did was take the CL subnormal and put a Gaussian process for coefficient, we say as an amplitude, so that we have this time variation and this green plot shows
actually this Gaussian process which is multiplying the CL subnormal.
So in spite of the fact that it's a little bit more complex actually, it fits the model.
quite well, and again, this is really going to be tied to what Alex is going to do about the Bayesian workflow.
But just to also hint you that you can use these kernels and processes as Lego pieces, and you as a modeler have the complete flexibility to do this in PyMC.
Another thing we did or that I can suggest to do is to include all of the day of the
also including these holidays and use a student distribution to make the model sample better.
So that's a little bit beyond what I want to talk here.
But what I wanted to add with this part is that you can also have Gaussian processes, but you can also have other components.
So you can still have
other things like you can use bad, you can use whatever you want because at the very end, if you use the prior linearized class in the Hilbert space GPs, you can essentially are
working with a linear regression which make this API quite flexible.
Check it out in the PIMC example, we will have a simpler version of this, which is more for introductory purposes.
I think BL is going to
build a complete model to show that the standard implementation can be done in PyMC with the existing code implementation.
And actually fitting this model, which is fairly complex on around, I think, the 700 data points, it runs in about 15 minutes in my local machine, which is quite reasonable as
compared about the number of days you will need to wait.
I actually think it's unfeasible to do this.
with a normal Gaussian process implementation, again, because of the number of data points that you have.
Yeah, and I'll share this with you.
Yeah, it's amazing.
Actually, it makes all of the applications in the residency is actually feasible.
Yeah, and the last one I wanted to say, it's about the references.
Yeah.
it actually makes a lot of GPs actually usable.
It's like, this approximation is a really big deal.
Maybe some limitation we should mention from the approximation.
So it's related to the m and c parameters that we mentioned.
We'll go through that in the second and the third.
webinar in the series, not in detail here, but that's to keep in mind.
And also, HSTP folders after three dimensions.
Basically, you can use HSTP with one dimension.
It works extremely well.
Two dimensions very well, too.
Three dimensions, it doesn't work at all, and it starts to folders.
So basically, that would be like spatial temporal data.
Probably won't work.
Spatial data, it will work.
Temporal data, it will work.
something to add in.
Before you go through the references, oh yeah, sorry, go ahead.
I just wanted to add a comment because I do know about this limitation, but I haven't gone into the details about why it fails.
But one thing to notice here, this is just me talking out loud, so please don't take me seriously, but kind of knowing what we explain here, then it might help us to hack these
higher dimensions.
And why do I say this?
Is because you have freedom on the boundary conditions.
which means if your specific eigenfunctions are not working well in higher dimensions, then you could potentially choose all the binary conditions which makes the Laplacian
self-adjoint with the discrete spectrum because these are not the unique ones and you can also make it work.
Another thing that for the geeks here around is that I've been thinking about kind of...
Okay, the Laplacian is the natural differential operator, because it gets you at the skilled spectrum.
But kind of when you think about the Fourier transform, you often care about the higher order of the operator.
And I wonder, for example, if we can take perturbations of the Laplacian to make kind of this work, because this will be again, the higher order term will be quadratic.
So we might.
do the same, and if you can compute the eigenfunctions, then you can give it a try.
Another hypothesis that I have, which could be interesting, is that at the end of the day, you need this box, this minus L and L.
But I think in some cases, you might, but I don't know, this is me thinking out loud, try to get away from this because at the very end, as I said, you need an operator with
discrete spectrum.
If we have physicists on the room, you know that if you have a system on the real line, and you have the quantum harmonic oscillator, this thing is defined on the whole real
line.
But with this potential of the harmonic oscillator, you actually have discrete spectrum, which you know, and physicists have already computed the eigenfunctions of such operator,
which has this Hermite polynomial.
So what I'm trying to say here is that in the case where these things might fail, one thing that I want to do is to understand why do they fail and if there's a way we could
modify this in such a way that we can make it work.
Because we have chosen certain things that work, but I don't think they are unique, which means that we might be missing some important properties if we, for example, change the
operator or the domain.
Just wanted to end with this kind of, why it could be relevant to know about these things when things fail.
Yeah, fantastic.
Thanks a lot.
And so definitely one of the bottlenecks to answer your question is the number of basis vectors, because they are multiplied through the dimensions.
They are not added.
They are multiplied.
So if you have only 25 basis vector per dimension, if you have two dimensions, that's already, what, 625, something like that.
That's a lot.
So it's 25 squared.
If you have three dimensions, it's 25 cubed.
And so that's basically that becomes the bottleneck.
Not only mathematically, that becomes a lot of basis functions, but it's also you need then a lot of memory on the computer.
You need a lot of.
memorizing the computer to have everything and do all the computation at the same time.
I experimented with that.
That broke my kernel very fast.
That's also one of the bottlenecks when you add dimensions.
what I'm thinking is that still it's because you are using the Laplacian in Cartesian coordinates and you are multiplying that, but if you have certain or expect certain
symmetry, you can write this, for example, in polar coordinates and maybe it's enough to, for example, take certain angle or the radius, kind of, you can compute the eigenfunction
within the radius coordinate and that might be enough if you have certain symmetry.
So, kind of, if your problem has certain symmetry.
you might reduce this by kind of, because the coordinates are irrelevant for the operator.
The operator can be written in polar coordinates or in any other coordinates.
So yeah, this is me.
Just try to think is if we can make it work.
I need to read much more about it, but I trust you that kind of doing this kind of with the usual kind of combination of this is going to explode your computer.
Yeah, it will.
But you're going to try.
If you want to see what your computer is doing, people, if you haven't seen a dying kernel, you can do that.
It's no big deal.
Maybe one, I have one question here from Tyler, which is very interesting.
And then you can go through the references quickly, and then we'll be done.
After a two-hours webinar, well done everybody, especially Juan.
So Tyler is asking that he's curious to hear what you think about comparing and contrasting HSGPs against B-splines and when it might, like when you might elect to use
HSGP over a spline model.
And he's asking that because he's worked on this recently, trying to model a simple spline function, for example.
And he got nearly identical results.
So he'd be curious if you know about contexts that HSGP maps better to.
Yeah, I think I've actually thought about this question quite a lot because at the end of the day, if you think about, if you forget all about the math, what you have is sine
functions over different volts times a weight.
So you could essentially say, okay, you have certain basic function and then trying to learn the coefficients, which resembles quite a lot to splines.
But nevertheless, the spectral density formula still captures a lot of information.
about how you want to model the data.
Which means in many cases, actually, the square exponential is too smooth for the data and actually having this other martian kernel, I don't know if I pronounced this correctly, on
which you have a certain degree of smoothness, actually makes your model fit better.
So, your passing intuition...
through the kernel.
So on the one hand side, you could have the kind of the smoothness.
On the other hand, you can have kind of the periodicity.
So you can still have these periodic kernels within these frameworks.
And you can have other type of kernels, for example, linear trends, all within this framework.
So at least in my experience, for me, I like to think about the kernels in the sense of distances, whether I'm on the other hand.
kind of for the birthday data set, I was talking about these components and which type of kernels I would use and how the length scale could fit and the prior predictive and the
prior selection, thinking about this kind of the meaning of the kernel, whereas splines, I haven't worked with splines quite a lot, it feels like it's just having basis functions
and constraining through kind of a smoothing parameter, but kind of.
At least when I'm trying to model something, I'm losing kind of this control over the meaning of the kernels.
At the day of the day, they could give kind of good kind of results.
I'm not saying they won't, but at least for me, when I need to model something, my intuition is better when I work with kernels than when I work with splines.
So long story short, I don't think one is better than the other.
It's a matter of how you can pass the information that you know to the model.
I hope that makes sense.
Yeah, yeah, yeah.
And I have the same experience, actually.
And I really love, also, what I really love from GPs and kernels is that usually you can interpret them.
Like, you can interpret the parameters.
Usually, the amplitude has a meaning.
The length scale has a meaning.
And that's really cool, because you can relate that a lot to the domain experts.
And also, when you build the model,
That makes total sense.
And then afterwards, the interpretation of the results.
So I find that helps really.
It's not just like a weird hyperparameter that you set, and you don't really understand what it means.
Sorry, I love that.
That being said, I have less experience with Plines, too.
But yeah, I would say I would have given an answer that's similar to yours, Juan.
And yeah, just one last comment about the Maturn family that you were talking about, the kernels.
Yeah, that's the same.
I usually recommend going or starting with the Maturn 5-2 kernel, which is kind of in between the exponentiated quadratic and the Maturn 3-2.
So you have Maturn 1-2, which is the more the Wig-lest.
Then you have Maturn 3-2, which is a bit smoother.
And then you have Maturn 5-2, which is the smoothest of the three.
modern kernels, but it's more, it's wigglier than the exponentiated quadratic, and usually it behaves better as you were saying, Juan, also because it doesn't make the assumption
that the functions are infinitely differentiable.
So that's all, it's less smooth.
So at practice, usually we recommend working with that one, at least starting.
Yeah, I think that's all, Juan.
So yeah, now you can go through the references.
I just want to point a couple of them.
So you have the classic Gaussian process for machine learning book.
This is available online.
I actually bought it because I think it's a fantastic book and it's a great reference to have.
There's a peer in the PIMC example, there's a million covariance function notebook.
I also think it gives a lot of intuition.
I have a couple of notebooks about how
to go from linear regression to Gaussian process that might help the audience.
I have seen other kind of tool resources which are not really PIMES related, but also insightful.
So there's one by Benjamin Betancourt about Gaussian processes.
So really a lot about the details of classical GPs and Dan Simpson has a lot of posts about prior selections and GPs.
And as I mentioned kind of
doing prime selection for Gaussian processes, it's really like a big topic.
For the Hilbert space GEPA approximation, I mentioned the first paper, the classical paper, which is this one about Hilbert space method for reduced run Gaussian process
regression.
There's a following paper, which is by Akis and collaborators, where they focus on the computation.
So they do a review on the method, but also give some hints.
on how to pick up this MC and L for specific cases.
And they developed this kind of expansion for periodic kernels.
And we have the content by Bill and Alex.
We had a PyCon web series about this topic.
And there's a work in progress, which is a reference guide for Hilbert Space GP.
And I've seen the PR and it's honestly fantastic because it covers...
basics and then it goes kind of to hierarchical GPs.
And this is, as I mentioned, this is something that the folks from Pimcey Labs have used to model kind of marketing efficiency.
But these examples, I think really open a lot of opportunities to use this in practice.
And I'm just very excited about this content being publicly available.
and hopefully complement what we did today because it really goes through the implementation details and how to really make this actionable within the PyMC.
And I think that's pretty much it.
and that's exactly one of the tutorials I was talking about.
It's going to be the second one in this series, where Bill Engels is going to come and do this kind of presentation.
Here, it will be live coding and going through the tutorials and answering questions, folks.
And yeah, it's still not ready, right?
We're working on the POI as we speak.
We've spotted a few bugs in the implementation.
We've spotted also a few things we can improve in the user API.
So going through these changes when free time allows, I hope to be able to push that somewhat soon and have that your way very soon, folks.
Yeah, yeah, yeah.
It's definitely, it's definitely worth it.
And yeah, about the Baird dataset, we have a couple of examples.
I should have mentioned that there's a non-Pyro implementation in the non-Pyro website, there's Aki's example and this My Little implementation that we're trying to improve to
make it up to the PyMC standards in the PyMC library.
And last but not least, if you actually feel about learning about spectral theory on Hilbert spaces in general, I recommend this book.
You can actually click.
I don't think it's really available.
And just to finish, thank you very much, Alex, for giving the opportunity for me to present this.
It was really fun and challenging to prepare this topic.
So thank you for the opportunity.
And if you have any questions, feel free to reach up to me because I'm still on the learning path.
I consider myself still a kind of a beginner.
So if I miss something or if I...
I'm confused in certain steps where you want to discuss.
Yeah, drop me a line.
I'll be happy to talk about it.
Awesome.
Yeah, fantastic.
Thanks a lot, Juan.
Thanks a lot for all the work you did on that presentation.
And the notebook is also fantastic.
Definitely take a look at it, folks.
Definitely recommend it.
It's absolutely awesome.
And that will be in the show notes for these webinars recording.
So I'll get to that in a bit.
Just a second, first, I put in the chat three links to three Learn Based Stats episodes about caution processes that I think will be very interesting to you folks if you want to
dig into these topics.
First one is, well, actually not about GPs.
It's about the Bayesian workflow, and that's Michael Betancourt.
So that's the author of some of the blog posts.
very good blog post about Gaussian processes that Juan put in the references.
I think this episode is really a fantastic one.
It was the sixth, number six, and now we are at 104.
So you will hear me being very, very embarrassing and absolutely ignorant.
So that's great.
You can see that as Juan, I am learning, definitely on the learning path.
So that one is definitely one of the listener's favorites.
And then you have one with Aki Vetari, where we talk about GPs, episode 29, I think.
We talk about GPs, and Aki is definitely one of the big names of the GP literature.
So definitely give it to a listen if you want some pointers.
And finally, this week's episode, released yesterday, 104.
where actually we talked about automated caution processes, which, as far as I said, so that's a very interesting use case where instead of declaring the kernel, you try to learn
the kernel from the data.
And that's really super cool.
So hopefully, that's partly the future of GP models, because then you can give the data to the computer.
It gives you the.
like the kernel structure to use, and then you can write up your model, your GP model.
That's pretty cool.
Kind of an assisted patient workflow.
Software assisted patient workflow, let's say.
Okay, and now for the parting words.
So again, here you have a bunch of QR codes to follow the show and so on.
I'm not gonna go through all of them.
We've been there for a long time, but you have that here for reference.
You have a lot of people thanking you, Juan, in the chat.
Thank you so much, everyone, for joining, for staying that long.
You are definitely statistical warriors.
If you didn't know, you are officially nerds now, right?
So you have to call yourself that.
If you stay two hours on a GP webinar, you are an official Beijing nerd.
That means you have to listen to Learning Vision Stat 6, by the way, otherwise you are disqualified.
And finally, I forgot what I wanted to say, so we'll stop here.
But yeah, please get in touch if you have ideas for future webinars.
Tell me what you liked from today, what you'd like to see a bit changed.
And I'll see you.
very soon for a coming webinar.
I think it's going to be the one with Bill Engels, where we go through the, let's say, the nitty gritty of using HSGP But from a very practical perspective today, you had a very
foundational one in mathematical.
And then we'll take that, and we'll go through, OK, how do you stat in practice?
What are the pitfalls?
What are the good practices?
and how to use that in several models.
And then the last part in the series will be my GP workflow.
That's all, folks.
Thank you so much.
Thank you so much, Juan, for taking so much time.
I know it's late for you, so we're going to let you go now.
And well done, because that was the longest webinar.
So you now have the record, Juan, and we'll see if people can beat you.
All right, yeah, thank you very much.
We enjoyed.