#134 Bayesian Econometrics, State Space Models & Dynamic Regression, with David Kohns
• Join this channel to get access to perks:
https://www.patreon.com/c/learnbayesstats
• Proudly sponsored by PyMC Labs. Get in touch at https://www.pymc-labs.com/!
• Intro to Bayes Course (first 2 lessons free): https://topmate.io/alex_andorra/503302
• Advanced Regression Course (first 2 lessons free): https://topmate.io/alex_andorra/1011122
Our theme music is « Good Bayesian », by Baba Brinkman (feat MC Lars and Mega Ran). Check out his awesome work at https://bababrinkman.com/ !
Takeaways:
Chapters:
10:09 Understanding State Space Models
14:53 Predictively Consistent Priors
20:02 Dynamic Regression and AR Models
25:08 Inflation Forecasting
50:49 Understanding Time Series Data and Economic Analysis
57:04 Exploring Dynamic Regression Models
01:05:52 The Role of Priors
01:15:36 Future Trends in Probabilistic Programming
01:20:05 Innovations in Bayesian Model Selection
Thank you to my Patrons for making this episode possible!
Yusuke Saito, Avi Bryant, Ero Carrera, Giuliano Cruz, Tim Gasser, James Wade, Tradd Salvo, William Benton, James Ahloy, Robin Taylor,, Chad Scherrer, Zwelithini Tunyiswa, Bertrand Wilden, James Thompson, Stephen Oates, Gian Luca Di Tanna, Jack Wells, Matthew Maldonado, Ian Costley, Ally Salim, Larry Gill, Ian Moran, Paul Oreto, Colin Caprani, Colin Carroll, Nathaniel Burbank, Michael Osthege, Rémi Louf, Clive Edelsten, Henri Wallen, Hugo Botha, Vinh Nguyen, Marcin Elantkowski, Adam C. Smith, Will Kurt, Andrew Moskowitz, Hector Munoz, Marco Gorelli, Simon Kessell, Bradley Rode, Patrick Kelley, Rick Anderson, Casper de Bruin, Philippe Labonde, Michael Hankin, Cameron Smith, Tomáš Frýda, Ryan Wesslen, Andreas Netti, Riley King, Yoshiyuki Hamajima, Sven De Maeyer, Michael DeCrescenzo, Fergal M, Mason Yahr, Naoya Kanai, Steven Rowland, Aubrey Clayton, Jeannine Sue, Omri Har Shemesh, Scott Anthony Robson, Robert Yolken, Or Duek, Pavel Dusek, Paul Cox, Andreas Kröpelin, Raphaël R, Nicolas Rode, Gabriel Stechschulte, Arkady, Kurt TeKolste, Gergely Juhasz, Marcus Nölke, Maggi Mackintosh, Grant Pezzolesi, Avram Aelony, Joshua Meehl, Javier Sabio, Kristian Higgins, Alex Jones, Gregorio Aguilar, Matt Rosinski, Bart Trudeau, Luis Fonseca, Dante Gates, Matt Niccolls, Maksim Kuznecov, Michael Thomas, Luke Gorrie, Cory Kiser, Julio, Edvin Saveljev, Frederick Ayala, Jeffrey Powell, Gal Kampel, Adan Romero, Will Geary, Blake Walters, Jonathan Morgan, Francesco Madrisotti, Ivy Huang, Gary Clarke, Robert Flannery, Rasmus Hindström, Stefan, Corey Abshire, Mike Loncaric, David McCormick, Ronald Legere, Sergio Dolia, Michael Cao, Yiğit Aşık and Suyog Chandramouli.
Links from the show:
David's website: https://davkoh.github.io/
David on LinkedIn: https://www.linkedin.com/in/david-kohns-03984013b/
David on GitHub: https://github.com/davkoh
David on Google Scholar: https://scholar.google.com/citations?user=9gKE8e4AAAAJ&hl=en
Dynamic Regression Case Study: https://davkoh.github.io/case-studies/01_dyn_reg/dyn_reg_casestudy5.html
ARR2 Paper GitHub repository: https://github.com/n-kall/arr2/tree/main
ARR2 StanCon talk: https://www.youtube.com/watch?v=8XBe2jrOKvw&list=PLCrWEzJgSUqzNzh6mjWsWUu-lSK59VXP6&index=29
ARR2 Prior in PyMC: https://www.austinrochford.com/posts/r2-priors-pymc.html
Nutpie’s Normalizing Flows adaptation: https://pymc-devs.github.io/nutpie/nf-adapt.html
Today, I'm excited to be joined by David Comte, a postdoctoral researcher in the Bayesian workflow group under Professor Aki Dettari at Aalto University.
With a background in econometrics and Bayesian time series modeling, David's work focuses on using state-space models and principled prior licitation to improve model reliability
and decision-making.
In this episode, David demos live how to use the AR.
squared prior, a flexible and predictive prior definition for Bayesian autoregressions.
We show how to use this prior to write your own Bayesian time series models ARMA, autoregressive distributed lag or ARTL and vector autoregressive models VAR.
David also talks about the different ways one can generate samples from the prior to mimic the different expected time series behaviors and look into what the prior
implies on many other spaces than the natural parameter space of the AR coefficients.
So you will see this episode is packed with technical advice and recommendations and we even live demo the code for you so you might wanna tune in on the YouTube channel for this
episode.
And if you like this new format, kind of a hybrid between a classic interview and a modeling webinar, well, let me know.
and let me know which topics and guests you would like to have for this new format.
This is Learning Vasion Statistics, episode 134, recorded April 24, 2025.
Welcome to Learning Basion Statistics, a podcast about Basion inference, the methods, the projects, and the people who make it possible.
I'm your host, Alex Andorra.
You can follow me on Twitter at alex-underscore-andorra.
like the country.
For any info about the show, learnbasedats.com is Laplace to be.
Show notes, becoming a corporate sponsor, unlocking Beijing Merch, supporting the show on Patreon, everything is in there.
That's learnbasedats.com.
If you're interested in one-on-one mentorship, online courses, or statistical consulting, feel free to reach out and book a call at topmate.io slash Alex underscore and Dora.
See you around, folks.
and best patient wishes to you all.
And if today's discussion sparked ideas for your business, well, our team at Pimc Labs can help bring them to life.
Check us out at pimc-labs.com.
David Gomes, welcome to Learn Invasions Statistics.
Thank you very much, pleasure to be here.
Yeah, that's great.
I'm delighted to have you on.
I feel I could do a live show at Aalto University and just interview everybody one after the other and then have one year of content and then just go to an island and receive Mai
Tai and just earn a passive income thanks to you guys.
I'm sure we can organize that Yeah, full disclosure.
I would not be able to leave with the leader the podcast that would not work at all It's not a good business model.
Don't do that people ah But if you can have fun then then yeah do it um Now it's great to have you on I'm gonna have also team.
I don't know if that's how you printed his name, but I'm gonna have Timo Timo on the show in a few weeks also um
Osvaldo has been has been here.
uh Akki, obviously, and and I think and Noah should should come on the show one of these days to talk about everything he's he's doing.
So no, I like I need to I need to contact you.
So anyways, today's you, David, thanks.
Thank you so much for taking the time I've been reading your
You'll work for a few weeks now because you're doing a lot of very interesting things about auto-aggressive models, state-space models, how to choose priors, and so on.
So that's really cool.
We're going to talk about that in a few minutes, and you're going to do a few demos live.
um So if you happen to be in the chat because you're an LBS patron, um please don't be shy and introduce yourself in the chat, and then you can ask questions to David.
But before that, David, as usual, let's start with your origin story.
Can you tell us what you're doing nowadays and how you ended up working on this?
Yeah, thanks.
So pretty much my whole educational background is in econ.
So I did my bachelor and master's and PhD later on also on econ, but um always with a flavor of econometrics.
So I was early on interested already in my undergraduate studies about statistical relationships.
uh Particularly back then I was more interested in things like the relationship between debt relief allocation and then later on the country's development.
So that involved a lot of uh what we call an econ panel data methods, which are really uh spatial type of models.
uh
Then during my graduate studies, I was then more interested in time series models.
I really just loved kind of the simplicity and the mathematics of working through some discrete time series models.
And um that is, of course, widely applicable to many things in econ.
at some point, well, I got then really interested in uh thinking about uh how can you apply higher dimensional
time series models to problems where you have maybe lots of data.
So especially like in finance and macroeconomics, you have a lot of situations where you have very short time series, but potentially a lot of explanatory factors.
And so then classical methods tend to be fairly weak in terms of power, but also then in terms of regularizing the variance sufficiently of the model to get useful predictions.
So then I really delved into Bayesian Econometrics with, I suppose, my first mentor, you wouldn't call him that, Gary Coupe at Strathclyde University in Scotland.
So I did my graduate study in Edinburgh and then he, Gary Coupe, was in Strathclyde and uh had the great honor of doing this Bayesian Econometrics course with him.
And it was probably the best six weeks of my academic life at that point.
I just really loved his stuff.
uh
great website by the way, with a lot of resources if people are interested in Bayesian time series econometrics, some panel data stuff as well, a lot of multivariate stuff in
fact, so a lot of like vector auto regressions, but maybe we can talk about that later too.
yeah, basically starting with the backgrounds that Gary gave, I delved further and further into Bayesian time series econometrics.
And that's where I'm pretty much
tradition still.
after that, did my PhD also in Scotland.
And there, my sole focus was then on Bayesian methods for time series modeling, and then also some modeling in the direction of quantile regression as well.
Okay, interesting.
Yeah, I didn't know you were your icon heavy.
That's interesting.
That's a bit like, yeah, Jesse Grabowski has the same, a similar background.
So I want to refer people to episode 124 where Jesse talked about state-based models.
All of that, it's one of his specialties.
So for background about that, we'll talk about that a bit today again, but for more background information on that, Nisner, you can refer to that as a prerequisite, let's
say, for this episode with David.
And yeah, definitely if you have a link to uh Gary Coop's material, feel free to add that to the show notes place because I think it's going to be very uh interesting to people, at
least to me.
uh I love time series and vector of accuracy stuff and so on.
And JC and I are working a lot to make the Pimc state space module.
better and more useful to people.
yeah, if we can make all that easier to use, that's going to be super helpful.
yeah, awesome.
Feel free to add that to the show note.
And thanks for this very quick introduction.
That's perfect.
That's a great segue to just uh start and dive in, basically, because you have a um case study for us.
today and you're going to share your screen.
uh Maybe we can start with a quick theory of state space models and mainly geared to towards what you're going to share with us today.
you can take it over David, feel free to share screen already or a bit later.
So perhaps before I go into the state space specifics, maybe I can first comment on like maybe what we're still working on today.
And then I think that that
give you some background at least and why we're interested in still thinking about state spaces.
because part of the reason why I entered the research realm around, well not around, also was that uh Aki was working a lot on these kind of Bayesian workflow problems.
So how to build models in various circumstances, how to robustly draw inference.
And one thing that was, I think, direly missing from also the research I was doing at the beginning of my PhD was how to safely build out these time series models.
Like, uh how do you set priors on things that you can interpret?
then that allows you then oftentimes to add more complexity to the model without sacrificing in predictions or at least statistics that involve predictions.
And so uh one thing that we're working on at the moment very in a focused sense is this idea of predictively consistent priors, meaning that you start out with some notion of
understanding about a statistic on the predictive space that might be something like the R squared statistic.
This measures the amount of variance fit.
So this is a statistic between zero and one.
often it's bounded in that space uh for many models at least, and it measures uh the variance that the predicted term of your model is fitting.
So let's say the location component of a normal linear regression over the total variance of the data.
So the higher the r2 is, the better.
So how much variance can I fit as a fraction between 0 and 1?
And uh that kind of
idea has been developed also in the Bayesian sense, where Aki and Andrew Gellman have worked out the kind of methodology behind this Bayesian R squared, so Bayesian
interpretation, which really is just a posterior predictive of the predictor term or uh the variance of your model over the uh entire predictive variance, including also the
error term and so on so forth.
And what we recognize is that this statistic is well understood in many domains.
So in econ, in the biomedical sphere, in lot of social sciences, people usually have a model where they understand this notion of R2, this notion of variance fit.
So that goes even beyond just the kind of classical normal linear regression case, but also for
general GLMs.
There are certain definitions of R-square that exist and people are able to interpret.
And what we are doing in our group at the moment a lot, at least I'm working on it a lot and with Noah also, we're looking into how can you set a prior on the R-squared and from
that point of view, citer out the or define the priors and the rest of the model.
So you start from a notion of understanding of R squared and perhaps some prior about this.
And given this, how can you find priors of all the other components in the model?
Yeah.
Yeah, yeah.
I really love that.
That's very interpretable.
And that's also really how you would define models most of the time you think about them.
Because anybody who's worked with a model with an AR component in there knows that and have done
prior predictive checks, these checks become crazy in magnitude if you have just an AR2.
AR1 is fine with somewhat normal priors, but then if you have an AR2 component and I encourage you, if you stan or PMC, go to the stan or PMC website, just copy paste the code
for an AR model and then just
uh sample prior predictive samples from there with an AR2 and you'll see that if you use a normal 0,1 on the coefficients the magnitude of the priors just becomes super huge with
the time steps and that's the big problem uh and one of the problems that you are trying to address with the AR squared prior and I think that's yeah that
The way you it, I really love it because it's also very interpretable and intuitive.
Yeah.
And then again, it's predictably consistent.
you have a notion of R squared.
And if you generate from your model, so if you don't condition your data, you just do this push forward distribution where you sample from your prior, plug it into your model,
generate predictions, then the prior predict of R squared will align with your prior expectations.
your prior knowledge of uncertainty of r squared, say, the shape of the distribution.
And that's exactly what we're doing in this line of research for time series models, in particular, stationary time series models.
Yeah.
Yeah, yeah, Yeah.
So thanks a lot for this background.
I think that's indeed very, important.
And so now, do you want to dive a bit more into the state space models in the case study you have for us today?
Yeah.
Let's do it.
Awesome.
Let's go.
So for a
people live in the in the recording you'll be able to see david's screen and otherwise if you are watching on youtube um this episode well you you also see that in the in the video
you'll see david's live otherwise if you are listening to the episode well for that part of the episode
Encourages you to go on YouTube and check that out as soon as you can because that's probably gonna be a bit easier to follow
Alright, I'll just share my entire screen, think that will be easiest.
Yes, we are on.
So this is the dynamic regression case study um that you see on the screen.
You have that listeners in the show notes of this episode.
So the link is in there.
It's on David's website.
And now David, you can take it away.
All right.
So.
Yeah, I think we covered some of the basics already with the R squared stuff.
You can define this for AR type regressions and MA and ARMA type models.
uh There are some special mathematical things you have to take into account for.
this time series structure implies a conditional variance, which you have to include in your prior definition.
uh But here we're looking at something that's even one step further.
So we go from
model that has as the target yt, this is a scalar, uh we relate it to a set of covariates, uh so those are the x's, they are here dimension k times one per time point t, and we have
this unknown regression vector beta t.
So so far so good, this is basically the same almost as just your normal linear regression case, but indexed by time.
The special thing about this model is that uh the coefficients themselves, the betas, they evolve according to a latent state process.
So this is the second uh row in equation one.
uh This says that the coefficients uh vary across time according to an AR1 process.
And this allows for the fact that the relationship
between your covariates and your targets may change over time.
So like a famous uh example in econ, for example, is that the response of uh interest rates that the uh central bank might set to do economic policy might change in response to
inflation, which is one of the main drivers by policy changes over time, because maybe the targets shift of this relationship um or
there are some extra things happening like COVID, for example, which somehow uh distort this relationship for a little time.
And uh this time-varying process of how these coefficients evolve is then regulated, if you will, by an AR1 process.
And in fact, those people who know time series, they will notice that since this beta vector is k times 1, you essentially have a vector order regression here in the second
line.
But what we do in the paper for the simplicity of the math, and also what I do in this case study, is that I assume that this uh coefficient vector, which is called the uh state
transition matrix, phi, just k by k, is diagonal.
So it's only non-zero across the diagonal component.
What this means is that each individual uh
coefficient is only related to its own past individual coefficient, not other coefficients also.
Right, okay, okay, yeah.
And if it were not the case, would we be in the presence of a VAR model then?
Vector, or TOR, or aggressive?
So it's still a VAR model, it's just assuming that um all the other...
coefficients are unrelated except for maybe this error term here.
This kind of gives you some further way how to impose non-zero correlation between the coefficients.
Right, yeah, yeah, yeah.
Okay.
So here, but here we assume that the different...
So we have k times series here that are modeled at the same time, right?
Right, so the target is still a scalar, but then the covariates are then k times 1, right?
And then the process for the covariate coefficients, that is then a vector autoregression.
Right, yeah.
So the beta t's are modeled with a vector autoregression, but here we impose the correlation between the betas, the k betas,
to be zero when it's not on the diagonal.
Well, it's implied by the structure of the state transition matrix.
Yeah.
So that means the later process is dependent only on the previous version.
of the covariate, the previous value of the covariate.
Yeah, coefficients, exactly.
Yeah, and the k covariates don't interact, basically.
No, exactly.
This keeps the math nice and contained.
yeah.
And so that means um we have k latent states here, and we have k latent states because we have
k covariates.
Yes.
Correct.
So you can expand this in several ways.
You can also, what we do in the paper as well, the AR2 paper is that we allow also for time varying intercepts.
So you would have like another, let's say tau coefficient here, and then that could also have its own AR process.
This would be then closer to this what you mentioned before, Alex, this structural time series models, where you have uh multiple state processes modeled simultaneously.
Right, yeah.
So I think I mentioned that off the record.
I'm going to say it again on the record.
uh Yeah, basically, we're here.
The idea would be to have um like
each latent state, so each of the K latent state being modeled with not only an autoregressive process as we have here, but maybe you have a local linear trend and then
you add to that an AR process to pick up the noise.
Because the issue of just having the AR process is that when you are interested in out of sample predictions is that...
the out-of-sample predictions of the AR are usually not very interesting uh because they pick up the noise.
And so that's not really what you're interested in when you do out-of-sample predictions.
So here, if you have a structural time-series decomposition, you could be able to decompose basically the signal and the noise between these different processes.
And so here, yeah, each of your case states
would be modeled like that with one structural time series, but you would still have an emission.
So like the YTs we see here, like the data usually in these literature are called emissions.
um And so your emission would still be 1D, right?
It would still be a scalar emission.
That's correct.
Okay, cool.
You can extend that as well, so you can make y also multivariate.
That's a different beast, maybe we can talk about that later.
Yeah, yeah.
These beasts start to be very big models where you have covariation everywhere at the...
What is the...
I always forget the name of the second equation, so you have the latent state equations and the emission equation.
Yeah, like the...
the process equation and the emission equation.
Is that the right term?
Well, every literature has their own definition.
I've heard that as well, emission.
I've never used it actually, be honest.
So in econ, we call the y equation this one, the observation equation, and we call this the state equation for the betas.
Right, yeah.
Yeah, so I've seen emission and observation equation used interchangeably and then the latent equation.
yeah, dude, so that people have the nomenclature right and clear.
Yeah.
OK, cool.
So that's all clear, hopefully.
So let's continue with the case study.
Right.
And so one of the big problems here is that this unknown in the state equation, so the betas are explained by the past betas plus another error term.
That's k dimensional.
This has a covariance, which we call big sigma subscript, sorry, big sigma subscript beta.
And these, eh in this case, I'm also just making a diagonal structure just for the
for simplicity of everything, they determine how wiggly the states are uh because they inject noise into the state process and the larger these variance terms are, so those are
the diagonals across the error covariance term of the state, the larger these are, the more variable the state process is.
uh
There's a huge literature on how to set priors for these oh because if you let them be fairly uh wide, then what you'll find is that a horribly overfitting state space model
because you're essentially fitting all the noise in your data by making the state process as wiggly as possible.
Alex, I think you're on mute.
Right, sorry.
uh Yeah, yeah, that makes sense.
So basically, if you have two y of a prior on the sigma from the latent state equation, um and I think also I've seen in the literature this matrix, because it's often written also
in matrix form when you have...
I hate the names of these matrix because they don't mean anything.
think it's like F and Q and H and R.
It's like, invented these names?
It's terrible.
They tried to make it as unaccessible as possible for newcomers.
It's completely stupid.
Anyways, yeah, so you have like the, like, and these matrix on the, on the, on the location of the normals.
So F and.
and h usually in the literature they are called also uh the weights of the processes and on the right so the noise um well I think they are called drifts also uh there is a lot of
different names for that so that's why I get that out of the way for for people right now but basically here we're talking about the noise of the latent state equation
So this is the Sigma Beta in your case study.
people would probably see that also in the literature as the matrix Q.
And so what you're saying is that if the priors on this matrix are too big, then basically your AR process will explain all the noise.
in your data and your observational noise, so the sigma on the emission equation, the observation equation exactly, which is the sigma in your case study, which also people
will see as...
um
think R, the matrix R in the literature.
Everybody knows R stands for noise.
so yeah, then that means this matrix will be really small.
And if you just take that for granted, you would uh just interpret that as, there is not a lot of noise in my observational process.
Yeah, correct.
So this is the scalar, just to be clear.
um
Yeah, exactly.
you know, typically the, what the previous literature does is it says, let's put an inverse gamma.
That's what I call IG here.
Inverse gamma prior on the state innovation variances.
the diagonal of the state variance covariance.
Let's put an inverse gamma on this and uh be fairly un-reformative or
you know, in quotation marks, something like a inverse comma 0.1, 0.1.
And I think the listeners of your podcast will probably immediately know, oh, this is a bad choice because you have a very long tail and along the positive reels.
And if your likelihood information is not very strong, that identifies the variation of the states, then the prior will dominate and you'll end up with hugely.
a huge variance on your states and therefore overfitting.
And I can recommend this paper in particular.
I'm hovering over it uh in the case study.
It's by Sylvia Früderschnatter, great econometrician, a statistician, and her colleague Helga, um who rewrite the space process into its non-centered form.
which allows you to put normal priors on the state standard deviation, which feature then in the observation equation.
This might sound a bit esoteric without kind of seeing the math, but um they go into much more detail as to why setting as an inverse gamma prior on the state variances is a bad
idea.
And um we take kind of this idea one step further in that we say, okay, how about we
uh starts in fact from an R squared prior over the entire process.
So something that explains the variation of this guy.
So the, the state and covariate contribution over the variance of the entire data, because this is something that we often can interpret, say like, we know our model explains, let's
60 % of the variation in our target.
And then from prior on this, what is the prior prior on
the state variances.
And uh just to be clear what y priors will entail, it will entail that the variance of uh this term, the predictor term in the observation equation, so x times beta, will dwarf the
variance of the observation noise, which is in this case because it's just a normal model, uh this variance plus the variance of the observation model.
Yeah, yeah, So that's radially 2.
what we just talked about in the dead variance becomes too wide.
Exactly.
Oftentimes you'll find that those overfitting models will in fact result in a R squared that is very close to one.
Basically saying that, you're able to explain all of the variation of data and this is often highly unrealistic.
And particularly if you think about this with time series models where, and let's just briefly go back to the AR.
so the simple autoregressive type model case, if you add more lags, so more information about the past, you wouldn't think that you can better predict the future, right?
Oftentimes, only the first couple of lags or whatever the time series structure is, is good for prediction.
And then if you increase the number of lags more and more, you wouldn't think that you're going to get more more variance of future data, right?
So in that sense, uh setting a reasonable prior on the R squared is actually a good thing also with time series.
And this is kind of preempting some maybe screams that the audience has, like particularly those who are more trained in classical time series econometrics, they'll tell you, R
squared is not a good thing to look at for time series.
And I agree when the model and data are non-stationary.
because then the variance goes to infinity and this R-squared metric is not well defined.
But in the case where you have stationary time series, the variance will be strictly uh below infinity and therefore this R-squared metric again makes sense to use.
Yeah.
Okay.
But...
Sounds a bit like, you yeah.
You could be R-squared hacking with that, basically.
Yeah, exactly.
I mean, that's what people are afraid of with this R-squared thing, right?
Because they understand from their classical training, if you just include more more covariates, then by definition, R-square is monotonically increasing with the amount of
covariates you include.
However, in the probabilistic sense, you also have a uh probability distribution, posterior probability distribution over your R-squared.
And here, you can regularize with the prior
away from this tendency.
Yeah, that makes sense.
And um yeah, so if we think along the lines of what the R squared metric looks like, if we go through the math that we present at the paper, then we get this ugly looking fraction.
And this is basically telling you that the R squared is a function
of, let me zoom in a bit, as a function of the state variances, the state AR coefficients, phi, and the observation noise.
And what we've done to arrive here is that we integrated out the data, so the x's and the y's, but also the state
Realization itself.
So you'll recognize that the betas don't appear here, but only the variance of the betas.
Yeah.
And the nice thing about this expression really is it's pretty much the total variance of your predictor term.
So that's Xt times beta t over the variance of your predictor term plus one.
And so if we wanted to set a certain prior, let's say a beta prior on this R squared metric, then we can figure out by change of variables, what is the implied prior on the
state variances on this kind of total variance term here.
Yeah.
And that's very cool because then you can just basically define that prior on R squared.
in your model, right?
And then I guess just use that as the prior in like use that in the priors for the betas in the model directly.
Correct.
Yeah.
And so then I can, guess, and I,
I think you give some recommendations in the paper for recommend if I remember correctly to set the prior R squared and then just basically do prior predictive checks to see that
it makes sense in your case uh and then go from there and then you can fit your model.
Yeah, exactly.
In the paper, we'd like to recommend this uh beta one third and three prior.
parabenorized here in terms of location and scale lipid distribution.
I think in PMC you also have that coded up, Yeah.
The beta proportion.
Yeah, exactly.
So that would be familiar in that case uh because it has a lot of the mass towards uh an R squared below 0.5.
It has a very gentle slope.
So if the uh likelihood is pulling you
in one direction, you're not going to overwhelm, most likely, the likelihood too much with an aggressive slope on the R-square space.
You're of uh weakly, let's say, regularizing toward ah lower R-square uh values and therefore likely have overfitting.
Yeah, yeah, No, that's very cool.
And so you're going to show a bit now the implementation on different.
uh...
different data sets but also for people using PIMC Austin, Rushford, Deed,
like, Kodita basically that prior in Pimc uh on his blog I linked to the blog post in the show notes uh that's a very good very very good very good blog post so I definitely
encourage you to check that out uh his blog post though is uh limited to one part of your paper you do more than that in the paper and you'll get to that
in a minute, David, but yeah, like that's a good introduction.
think in his blog post, Austin mainly, so he cuts up the prior and then generates data based on three different processes, different data generating processes, and then check
that we can recover the parameters.
of the three different processes with the R-Square prior.
You do that, but you also do more than that.
that's what we're also going to talk about today.
And so just briefly walking through the machinery, then we are able to set the uh prior process, how it would look like in the Stan program.
So we have then this variance term here in particular, which comes then from this whole R-squared machinery.
And if you look very closely at these two equations, so one is this R-squared definition and the other is the prior variance on the latent spaces, you'll see that here are two
factors, which if they are included, they allow you to get rid
of most of this very unwieldy looking stuff in the R-square definition and allows you then to isolate uh only the variance terms.
uh But yeah, so there's more about this in the paper.
I would encourage you for those who want to dig more into this to have look at that.
But the only other important thing I want to mention at this point is that you have another part of this R-square prior that allows you to decompose uh
um the variance.
And this you can think of as determining the importance of the individual model components and um what they do mechanistically.
states, right.
Exactly.
And what they do mechanistically is that they allocate the variance.
Right.
Yeah.
So basically which part, which states contributes more variance than another state.
Exactly.
Because basically, think ultimately you can't really determine the exact value of the variance that's contributed by each state.
can't just...
In absolute, you can only do that in relative.
The proportion of the variance is coming from that state, but you cannot really say it from an absolute perspective, I guess.
As in marginal?
Yeah, that would be hard.
You can make statements about the entire variance of all the
And you can make a statement about relative variance in a way.
that's what this decomposition lets you do.
Yeah, exactly.
Yeah, because I think if you want the absolute decomposition, that's just undetermined.
So because an infinity of different decomposition of the variance of the different states will give you the same total variance.
So yeah, I think just the proportion is going to be identifiable, which is what this is doing.
That's why you're doing a direct clip prior on this side term.
Yeah, Dirichlet makes sense here because you're trying to find weights that allows you to decompose this variance and then Dirichlet is just a natural prior of a simplex, really.
But yeah, I mean, you mentioned something about identifiability.
We're not taking hard stances on this.
It can be a problem, I think, in state spaces more generally.
Like how can you identify where the variance comes from?
But in general, think putting a prior on the weights to decompose the variance makes sense.
You want the data somehow to inform also that, I think.
Yes.
No, for sure, for sure.
Yeah.
mean, identifiability in general is hard for time series models.
It's hard also if you use GPs on time series models.
It's just that time search data is hard, and you don't have a lot of data in a way.
You need a lot of covariates.
If you can have external covariates, that helps a lot.
um But whether you're using state space models or GPs, I think GPs is even harder because this is semi-parametric, where state space is, you have more structure by definition.
um But yeah, identifiability is always a.
a big issue here and the more informative data and priors you can have the better.
And I'll mention that Arno Solen at Aalto has investigated the link between GPs and state spaces and there's a very close computational link.
can find the posterior state space as if were, sorry, the posterior of a GP as if it were a state space.
Right, yeah.
And you can think of also the state space as being
somewhat of a discrete approximation to the continuous GP.
can also, because you also have like, let's say like a variance of the GP in a way, the latent function, the, in this case, the, the state itself competing with the variance of
the observation model itself.
Yes.
Yeah.
Yeah.
And that definitely happens with GPs, right?
Yeah.
They can, they can pick up.
like they are so flexible that they can pick up the noise.
Also, so you have to be very careful on the priors.
And so, and then also, and if you add categorical predictors to that, it's very hard because categorical predictors are not really predictors anyways, you know, it's like, I
find they don't really add a lot of information.
They just, you know, break down the model in different subsets.
ah But you also need, like if you can have continuous predictors indicate like.
informing the different subsets that definitely helps.
Because otherwise, yeah, the GP can fit anything, so it can definitely feed the noise in your different subset anyways.
But yeah, I'm not surprised that state space is generalized to GPs.
It seems to be a lot of their universe, everything in the GP in the end.
I'm pretty sure that's what Black Holes are.
They are just cities inside.
So yeah, let's continue.
think you can now go to the application part of that, right?
And you have an inflation forecasting example for us.
Exactly.
And there are some other priors also in literature.
So those who are familiar with econ might recognize this Minnesota prior.
Those who generally follow also the uh shrinkage literature, they'll know the regularized social prior.
So we're incorporating these here too as a part of comparison.
Mm hmm.
Yes.
And then inflation forecasting.
So here, some very crude code.
I'm not particularly um happy about it, but I think it does the job.
It loads data from the um from one of the feds in the US with this red R package.
And I'm just have a lot of functions here.
So let me just skip that.
And where the data are directly loaded uh from
the St.
Louis Fed website.
I've changed it now locally to first download it and then upload it because I find that these, like downloading from links is not the safest thing to do.
So here in this particular instance where I'm showing this, just having first download the data and then uploading.
And we have a set of 20 covariates.
So 20 covariates and then therefore also 20 states.
because those are the regression coefficients which we allow to vary over time.
OK, yeah.
Yeah, I think that's interesting.
Yeah, especially that plot here where you showed the data.
you have the outcome variable is inflation, right?
Correct.
And then you have 20 other time series.
And each of these time series are a covariate, right?
Yeah, each of these time series are, so that's just the covariate value of the X's in the ah state space equation I showing above.
It's different data, it relates to financial market information, you have some sub parts of inflation in here as well.
uh Industrial production for those who are in economy and macro, they'll know that industrial production is super important for explaining ah macro data movements.
So that's included here.
So that's like, if we look at just one of them,
for instance, industrial production that would be like for each year, I think it's monthly data, right?
For each month, what is the value of the industrial production, whatever the value, the scale of it is.
And so if you're looking at the screen, have like each, these are lines, but basically these are just points and then each point gives you that value.
And so for the corresponding value of the industrial production, you have a corresponding value at that same time point of the inflation, which is the outcome, which is the y of
the equation, of the reservation equations.
And then the XTs are all the other variables.
So 20 of them in total, which is k in the equations we saw before.
here, k equals 20.
so the matrix we talked about before.
the phi matrix that is also called the R matrix in the literature, the H matrix, sorry, in the literature, uh which is the matrix of the latent state process.
uh Could be a full matrix, full rank matrix, but here it's on your diagonal matrix.
And so the parameters in that matrix are called the betas.
So, and betas are indexed by K and T.
Is that all correct?
Can we continue?
I think that was pretty much, yeah.
Okay, cool.
Yeah, so, you know, I think it's always a good point in the workflow to plot your data, just to know like, oh, is there maybe an outlier somewhere that looks fishy?
You know, you see, for example, here all the time series have usually like this S shape around COVID, this is 2020.
So you know that there's a lot of funky things expected around this time.
Yeah, and I'm guessing you're scaling all the variables, standardizing all the variables before feeding that to the model so that it's all in the same scale?
So here I'm following the recommendations of the St.
Louis Fed.
They have a set of very good researchers who look at what is the best transformation for the data, such that they are stationary.
Or, know, weekly stationary at least.
And um generally if you do econ analysis for macro time series data stuff, I would recommend just follow the recommendations of the statistical agencies.
In this case, the San Luis Fed.
Interesting.
um They are kind of the data authority on much of the uh US stuff.
Interesting.
Yeah.
Nice.
So here, do you do?
you need, do you, like, what is the recommendation?
Are you doing any pretreatment or do you just...
follow their...
So they have codes.
They have codes.
They mean different things.
Like, let's say one time difference, maybe growth rate calculation, maybe you leave it entirely unprepared.
So it depends on the time series.
I...
Okay.
Okay.
Exactly.
Interesting.
Yeah.
Because these time series have very different scales.
So...
Exactly.
Yeah.
So yeah, that's definitely something I would be concerned with, especially when you give that to HMC.
And so the R-square prior, in fact, can be made robust to the scale of the data by including the variance of uh your covariate information in the prior itself.
So you can scale it properly.
Which is not done here, right?
I've not seen that in the equation.
Let me see.
I think for simplicity I assumed that the covariates all have variance 1, but in the model it would be then another fraction here divided by the variance of x.
But more on that also on the paper.
So that's where we have that more.
Okay, so now we know what the data are looking like.
So just to motivate where time variation, the coefficients comes from, what I've done here is that every 100 months, starting from 1980, I do a simple linear regression where I take
a univariate regression.
So just our target inflation against, let's say, industrial production.
save the coefficient value and roll until the end of uh my data availability.
And what we would find if there's indeed variation in the coefficients is that we also find variation here.
And so that means basically you run univariate regression for each time point individually?
Exactly.
Like in a for loop?
Exactly.
That's exactly what this guy's doing.
So the model doesn't know anything about like...
time correlation.
Why it's like you just run 1990 against 1990 then 91 against 91 again and yeah okay I mean it's not a statistical guarantee however if you would find that those lines are all just
like straight you know then you probably are not going to find much variation even if you do all the bells and whistles that we offer kind of you know yeah yeah no I mean and
that's okay that's a good check right because these models are not
The models we're talking about here are not trivial.
Each time you need to do state spaces or Gaussian processes, it's not trivial.
So if you don't have time variation in your data, that's better.
Honestly, your life is going to be easier.
uh yeah, that's interesting.
I didn't know about that method.
like, it's a good heuristic.
It's just like, you can just take a subset of your data, maybe your random sample, or just take every
I don't know, five or six months and then you run a regression in a for loop.
I mean, not even a for loop, you just factorize that with standard primacy.
But it's just like, it's an independent univariate regression, just plain.
You could do that in PRMS or BAMBI even.
Exactly.
If the regression coefficients come up very, very close to each other, then that probably means you don't have that much time variation.
Here it's not the case.
Here we can see the lines are very, very weakly.
Exactly.
It's also part of the workflow.
I think you should always start simple.
Always start with a simple model.
See if you can find interesting relationships.
can even start just by, you you do a ggplot and then you just have put an lm between the data points in there and just see, is there any kind of interesting dynamics you could
pick up?
And this is doing this essentially 20 times.
Yeah, yeah, yeah.
I would probably for that plot that probably be useful if you uh shared the Y scale, know, the Y axis between the plots because like they had different scales and so probably that
will also inform but which chorus seem to be more variable in time than others.
Yeah, that's a good point.
um I think for some you can already see that there's some significant variation like that's a industrial production.
I think if you look at the scale of the data before, was between think 0 and 15.
The coefficient goes between 0.2 and minus 0.4.
So there's some variation to be expected.
Nice.
now, Stan models.
Yeah.
So I've hit them below.
So you can look at them, I think, when you have the time.
There also, we have a repo for the paper as well.
There, we have written a SnakeMake pipeline.
So it's maybe a little bit obscure if you haven't gone through SnakeMake pipelines before.
But the Stan code is also there.
And here, I reproduced it for the dynamic regression, so the state space model that we're looking at here.
And that's at the end of this.
Yeah, and I've put that link in the show notes, of course.
All right.
And so here we're just setting up the models and sampling from them.
I've coded up one indicator there that indicates whether it's a prior predictive or full posterior analysis.
So basically including the model in the model block or the likelihood contribution model block or not.
And we were talking in the beginning about what happens if you have fairly unrestricted priors on the variances of your coefficients in your model.
And this is exactly what's happened, what I'm showing here.
So the Minnesota and RHS, those are two popular shrinkage priors for time series.
If you sample from the priors, plug it into the observation equation, generate what would be the per predictive
wise and then calculate the R-square statistic, you'll find that these two models say a priori we expect to fit 100 % of the data variation.
Whereas if we uh look at our model, the AR2, um here we have full control about how this distribution looks like.
And this is approximately this beta 1 3rd distribution.
There's also a nice way to check that your coding is correct.
Like if this had an entirely different form, like a 10 structure around 0.5, um then you would also know, okay, something's wrong with my code.
Yeah, Yeah, for sure.
And that's really why, that's what we were talking about before at the beginning.
That's what I really like about the way of setting prior that way.
Also, that's how you can see that setting your priors with other priors than the IRR squared is really weird.
It's like, before seeing any data, you're expecting, you're telling the model to expect to be able to explain.
all the variance in the data with your latent state equation.
So it's like saying, oh yeah, there is no noise in the data at all.
It's possible, but I would bet it is very, very improbable.
Well, they cancel the noise in data, but it is dwarfed by the...
variance of your...
the latent space process.
Yeah, it's gonna pick up everything and I don't think it's a good model in choice.
I think the error squared process here in the prior distribution makes much more sense.
Yeah, yeah.
And if you know inflation or if you know kind of your inflation data in the US, you also know that it's a really hard time series to predict.
So there's a whole literature about how hard it is to predict variance, sorry, And we would expect in fact
lower r squared something between 0 and 0.5 like if you're a specialist in econ you would say okay one not possible
All right.
And so here, um I'm just generating still from the prior.
So just to motivate what the variance looks like, looks like of the coefficients to each of these time series, I just sample from the state process.
This is how it looks like for the AR2.
You know, it's not informed by the data, so they all look approximately like you know, a distributed around zero.
And this is how it looks like for
the Minnesota and the RHS.
And you're still plugging in the X information.
You're informing the prior with the likelihood at this point.
But you see that the variance of the data also heavily influences the prior predictives here.
And so these are the betas, right?
Yes.
Yeah.
Yeah.
And you can see, definitely, if you see the screen here, people, or if you're following up with a blog,
blog post.
These are weird.
These are weird prior checks.
Yeah, you wouldn't expect your coefficient value for a certain time series to range between minus 50 and 50 if your range was like between 0 and 100, let's say, for your
target.
Yeah, and also because that implies that then other time series have zero contribution and you cannot really control also.
which one have zeros also.
So it's quite bad.
Basically, also the problem with that is that it puts a lot of onus on the data to be very informative.
And that might not be the case, especially with some search data where all these models have a lot of parameters.
so that's already a big responsibility for the model.
then if you like.
put even less prior information in there.
That means you need to squeeze even more information from the data, where the data already in time series is not necessarily the most informative.
So it's like that's pining up on the complexities.
Yes, correct.
And good point, by the way.
I think also for listeners, you can have very fancy priors and everything.
But if your likelihood is very, very strong, really informative about the value of parameters, eh oftentimes uh
doesn't matter so much what you're doing with the prior.
So it can happen that the data information is fully overwhelmed the prior.
ah As you just said, you have k states, you have t time points.
That means you have k times t parameters you're estimating.
That's a lot.
At least.
And that's just the bad-ass.
But then if you start sharing information and so on, you add parameters uh to
to be able to do that partial pulling and so on.
So, and also like each time that means also like you, you sub that subset, posterior space if you want in a way.
And so that means that each part of the, of this, it's just subspace is only informed by one layer of the data.
So it's not like you're taking the full time series and then you're just sharing everything.
It's no, like then you have that time state and these state and that's just like,
you might end up just having one data point to inform that parameter in the end.
if at all.
Yeah, if at all.
So priors matter.
this distance basis.
Especially for a time search model.
it's like, that's basically my point.
Because also I've discovered that with experience, right?
And that's why if you don't see any time variation, it's way better.
Because then if you ignore time, you basically can
of your data and aggregate it, and so that increases your sample size basically, and so that increases the information that you have in the likelihood, and decreases the
information of the importance of the price.
Yeah, yeah, exactly.
And of course, you know, the nice thing about this R-squared stuff is that you're a priori saying that those states, they have to fight each other for the same variance.
Like, we've upper bounded the variance, so they have to fight each other for explaining the data, m loosely speaking.
So if one state is important, and that means away from 0 significantly in some sense, then another state has to give, it has to then have less variation.
And that comes from the Dirichlet prior.
Yes, correct.
Yeah, and this manifests.
now we have the posterior distributions on the r squared.
We can see that they get, that they can, like we have posterior shrinkage, so that's really good.
all go in the same direction but it's just that Minnesota and and Hoss for Pryor were so biased towards the one the uh priority a probability mass gerrit and biased towards an R
squad of one that like it is super hard for them to get away from it too much whereas the
the R squared prior is much more aggressive on saying that the latent states are not that are not picking up too much of the noise.
Yeah, correct.
maybe this...
I don't know if this is a good value of R squared.
I'm not making a statement about this, but there's a big difference and that's what's important.
And we can verify whether this good or not later with predictions.
Yes.
And so basically what the R squared AR squared model here is saying is that the covariates here, the latent states, inform much less of the variation in the data than what you would
conclude if you're using Minnesota or RHS priors.
Absolutely correct.
Yeah, nothing to add.
Very good.
Thanks.
Awesome.
Let's go on.
oh And you can also think in time series about some notion of R squared over time.
And this literally takes uh just the contribution of the states and covariates in terms of the variability per time point and relates it to the total variability per time point.
And this is like how much of the variance of the data can you explain at each individual time point?
And what's...
What those posterior series are saying here is that the Minnesota and RHS, which tended to have a marginal R-squared, total over all time points to be larger, also show much more
variability over time in R-squared.
Okay, yeah, that's interesting.
And you have, like, that formula, I guess you implemented it in...
in R in the package somewhere?
Yeah, well, I've just coded it myself here.
I made a function that's up below, and I just call it.
um So this is the extract R2 function.
it's very easy.
You really just take a sample from the sample from your posterior, those beta.
you multiply it by the inner product of the...
oh, it was mistake with the transposes, by the way.
I'll fix that.
You multiply the inner product of the covariate vector per time point and relate this to that um quantity again and the observation noise.
This is a way how you can think about R squared over time.
Yeah, it's definitely something that's like...
Yeah, needs to be if you're using a package to that.
Like let's say we have that in primacy state space.
That's a function we'd like to have basically.
And uh same story, more variation with Minnesota RHS compared to AR2.
And then here are now the posteriors of the beta vector over time.
So we have drawn our MCMC samples, we take the average over the MCMC samples, and then just look at the time series of the beta uh states.
And so we see some variation with the AR2 that's being picked up.
There's a lot of variation for all the time series in a way, uh very similar scale.
So nothing is um fully dominating em the variance.
So these are the betas.
These are the weights of the latent states.
Exactly.
And for those who are following along, TVP in those graphs refers to time varying parameters.
In Econ, we refer to these state space models where you have a stage for the coefficients for an inner regression.
We call those time varying parameters for whatever reason.
I understand the reason, but it's a little bit too general.
And this is what happens with the Minnesota and RHS, the same picture, basically.
A lot of the series are getting shrunk to zero and then a couple of times series are have a lot of variation.
Just to show you this how looks like for our test too.
actually, a quick question that's a bit more theoretical, and I don't know if you'll be able to answer it, but what I'm wondering, maybe what I'm a bit confused by here is, is
that a state-space model with discrete or continuous latent spaces here?
Discrete.
Yeah, OK.
Yeah, discrete.
But they are not.
matured exclusive.
No, mean, a discrete is a subset of the continuous time series.
Right.
But it's not, so it's not an HMM.
It's not a hidden Markov model.
Is it?
Well, it depends how you define hidden Markov model in a way.
So if you say that the hidden or the Markovian process here is this state, discrete state space transition.
than uh it would be, but it's not in the sense of what you sometimes see where you say, okay, we have five discrete states for coefficients and we draw inference onto the
location and magnitude of where the states are.
Right.
Yeah, for me, an HMM is more like that, where it's like we have discrete states, but you're switching from one state to the other.
It's like...
Let's say you have five states at some part in the time series, the regime you're at that uh dictates your emissions depends on well, an AR process, for instance, that belongs to
state one.
And then at some point, the regime switches to state two and then it switches back to one or goes to three or five, et cetera.
That's more like that where it's like, that's why I was saying mutually exclusive.
Whereas here, the states are not mutually exclusive, like literally because the parameters in the sense that the parameters uh can be all active at the same time.
Like you can have beta one positive for um industrial production and also beta one positive or negative for AAA FFME here, which I don't know what that means, right?
It's not like all the states can be active at the same time.
And it's like, and then the application of them gives you the emissions, which in my mind is not really a hidden Markov model, but it's more like kind of a discretized linear
Gaussian state space.
Yes.
And, well, okay.
I mean, the hidden Markov model can also be discrete, right?
But then what's not the case here is that uh let's say you have 10 time points and you have 10
beta states, then it cannot be beta 1, beta 2, beta 3, beta 1.
You're not repeating the same state along the time series.
Every new time point implies a new state.
They can be related, but there's no transition matrix which says
that the probability of going back to beta 1 after 10.1 has passed.
Yes.
Yeah, yeah.
Yeah.
Yeah.
So that's why it's really different in my mind.
Like that's, that looks much more like a linear Gaussian state space model to me, whereas the hidden Markov model is more something like a categorical, categorical, not necessarily
emissions, but categorical state.
at least.
they're in some way related.
think the um Hamilton time series book has some nice description on relationship between these models.
I read it during beginning of my PhD.
Don't quiz me on the details, but it's a cool read if you want to learn more about that stuff too.
Yeah, I'm sure I'm confusing some people here, but it speaks...
I'm confused myself on that.
like, I'm still trying to understand really the difference.
I know it's a nuanced difference and that maybe don't really doesn't really matter.
But yeah, it's just like, for me to understand really what the actual differences.
I mean, just a recap, we're not drawing inference on another transition matrix, which tells you the probability of going between states.
It's just that you start at a state and you end at a state and what happens in between
can be fairly unrestricted.
Yeah.
It's more like, we are, so here each state, each case state is like one dimensional.
So it's like tracking the position of, of a particle for instance, like that's what each state is doing where we're tracking the position of the inflation particle in the subset
of industrial production, for instance.
Yeah, correct.
And um yeah, pretty much that's what's going on here.
um Just to recap, lot more variation in some states than in others compared to the AR2, which has more like a constant, almost, variance across all states.
you might ask, well, which one is better for prediction?
And it turns out that the AR2 is then significantly better in terms of ELPD diff.
Yeah, which is great.
I guess you were happy to see that.
Yeah, exactly.
Awesome.
Yeah.
Maybe your last question related to that.
So I linked to Austin's blog post.
Can you tell us basically
what's the difference between what Austin is implementing in the blog post and what you're doing in the paper is because Austin is just doing one part.
That blog post is just implementing one part of what you're doing in the paper.
So can you make sure that is clear to people when what the difference is?
Yeah, of course.
Thank you.
So the main difference is that Austin is looking only at one of our subset of the time series models that we define this R2 prior over.
So in the paper, have AR models, MA models, ARMAs.
We have uh AR plus X, so independent covariates included with the AR regression.
And we have some simple state space models.
And what Austin did was he took a subset of only the AR simulations.
and looked at the recovery for um the true parameter values that he sets according to what we do in the paper with the AR prior set over the AR coefficients.
So there there's no unknown states, it's all just y at the target and then um on the right hand side of the equation you have lags of your target.
Yeah, yeah because...
oh
then yeah, it's just like the likelihood of Y is an AR.
that's all.
The model is an AR and then the likelihood is conditionally normal.
Whereas something that is more practical is what we're talking about at the beginning, where you would have Y as a normal emission here as you have in the case study, but then
the states could
Well, not the state, but the observation equation could depend on each state being a structurally decomposed time series with an AR process.
So local linear trend plus AR, and you will use the AR squared prior on the AR coefficient.
Yes.
Well, I mean, in the state space models, don't actually, the AR superar is not set on the...
state space er coefficients but on the state spaces variances.
Because that is the main determinant for the variability.
okay.
the, how did you call the col-vend here in your case study?
uh The sigmas.
So in the literature I know about it's the R matrix.
um So the
the variance of the state equation.
And here you call that the sigma.
Capital sigma underscore beta.
Sigma betas.
Yes, exactly.
Cool.
Awesome.
Great.
So uh thank you so much, David, for that in-depth case today.
Damn, that was good.
And I think that was a first on the show.
So thank you so much for doing that.
um If you listeners let me know what you thought about that.
I really like that kind of hybrid uh format content.
uh think it's really it's more handsome and I think it's very practical.
That means you guys have to check out the YouTube channel maybe a bit more but oh
But I'm fine with that.
So yeah, that was at least super cool to do.
So thank you so much for that, David.
I think you can stop sharing your screen now.
And I've already taken a lot of your time, so I still have a lot of questions for you, but I'm going to start playing this out, because I it's getting late for you.
But maybe what I'm curious about is maybe for...
you know, your future work.
uh Like, what do you see as the most exciting trends or advancements in your field?
And also where, where do you see the future of probabilistic programming heading?
Of course, you, you're called out on Stan.
You also do, you also work on some, some Python now, thanks to Osvaldo being there with you, know, spreading the, spreading the dark energy of the Python world.
Thanks, Osvaldo.
Yeah, so basically, I'm curious to know where your head is at here, where your future projects are.
Yeah, I think there's a lot that excites me about our research agenda at Aalto, but also others.
What excites me in our group is that, and the people that we work with more generally, is that we're still very actively thinking about how can we set priors about things that we
have expert knowledge on.
uh summary statistics, something about the predictive space and what do these in prior imply then for all of these coefficients that we have in the model where we typically just
go ahead and set normal zero one priors, you know.
uh Those, that is still under active development.
So we have like the, let's say the simple time series stuff covered to some degree, but there's so much more to be done in time series, even with multivariate models.
So there are ways to define this R squared step also for multivariate time series stuff.
think that's really cool and has a lot of policy applications well.
Because, you know, central banks and so on who do the econ policy for a country, they often know that, well, everything is related to each other.
If you're modeling inflation, you're also going to model GDP and so on and so forth.
And, you know, doing this jointly is really the way to go in the end.
And these priors, I think, can also be
very good for those kind of questions.
No, for sure.
In the end, everything is a vector or a progressive model.
oh I you're pitching to the choir, but I would tend to agree, at least approximately.
Yeah, yeah.
mean, yeah.
Basically, often the limitation, the
is the computational bottleneck, right?
But honestly, almost all the time you would want uh vector autoregressive processes on the observation equation and on the latent state equations.
Most of the time you have correlations everywhere and you want to estimate that.
The problem is that we often don't do that because it's just impossible to fit.
But ideally, we would be able to do that.
Yeah, exactly.
And, you know, lot to be done there still.
And we're also looking a lot into still, you know, workflow in terms of how, you know, prior is one thing, but a whole new aspect also is model selection.
So we're also very excited about a project where we're investigating the question of when is selection necessary if you have different priors.
uh to fulfill your goal in terms of prediction in the first case scenario.
But even for causal analysis, this is an important question.
How do you set the priors and do you need selection to somehow um produce reasonable predictions for the treated versus non-treated treated?
And we have lots of covariates or other structure in your model.
uh So we're working also on that.
I think it's going to be, you
fun results are to come out of that.
What do you mean by selection here?
Selection processes, selection bias, or is that different?
More like a variable selection or like component selection.
so there's some stuff like this predictive inference, which does selection based on can you find a surrogate model which gets as close as possible to a full model, like a
Gaussian process that is hard to compute.
And, you know, statistical folklore tells you that, well, if things get too hard, as in you have too many components, do selection.
Uh, because then you, you, you, implicitly decreased the variance for predictions because you're focusing only on a couple things that are model and, um, well, you know, what we're
kind of saying is, well, that's not necessarily true if you have good priors and understanding, um, when that statement in fact is true and when, when it is not so true
is, is an interesting.
question because like let's say in those causal analyses where you have uh randomized controlled trials, let's a drug uh is being administered to one population randomly or
not, uh then does it make sense to let's say use an R-square prior, which implicitly will say the treatment effect is correlated with other parameters that you're estimating.
And is that a good choice?
you know.
What we're saying is like, it's, it's, it depends.
And we kind of go into detail, but when the R square priors and priors like that are good and when they're bad and when selection is needed and when not.
Nice.
Yeah.
Yeah.
Super interesting.
Let me, let me know when, when you have something out on that.
I'll be, I'll be very interested to, to read about that and, and, and maybe talk to you again about that because that sounds, sounds very, important and interesting.
So, yeah.
Yeah.
I'll be very curious about that.
Maybe one thing about other people's work.
was very selfish talking about our work, but I think there's some really cool stuff I'm excited about that comes out from groups around like uh Paul Berkner and so on, which are
also picking up work on normalizing flows and amortized Bayesian inference.
think that stuff is going to be really good going forward because you can simplify computations.
You can reuse models for huge estimation tasks.
I think this will make the kind of general
based in computational workflow much easier, much more easier in the future.
So I think using this, maybe integrating it with the knowledge that we're working on also how to model and then how to do computation, those things, they are interdependent, I
think, for the future.
I'll be back to see what comes out of that.
Yeah, completely agree with that.
And I'll refer listeners to episode 107 with Marvin Schmidt about amortization inference.
that was super interesting and haven't been able to use that in production yet but really I'm looking forward to be able to do that and like have an excuse and use case for that
because this looks really cool and and yeah I completely agree with you that it has a lot of potential uh for that and everything Marvin and the bass flow team and Paul
Paul Berkner are doing on that front end.
Even anything Paul is doing is just always super brilliant and interesting.
And what I love is um very practical.
It's not research that's like, okay, that's cool,
I can't even do that because the math is too complicated and it's not implemented anywhere.
know, that's always...
uh His research and you guys research at Aalto is what I really like.
It's often...
It's always geared towards practical application and not just, yeah, that's cool math, but...
uh
Nobody knows how to implement that.
So that's really cool.
And well done on that.
I think it's amazing.
ah And talking about normalizing flows, I'll also add to the show notes.
nutpy from adrian zaybolt uh so he was also on the podcast i will also link to his podcast episode with me where he came and talked about zero subnormal and uh nutpy which is uh an
implementation of hmc but in rust so that's much faster and now he did something very cool in nutpy and you can use that with pymc and stan you know models but now
you can use normalizing flows to adapt HMC in NutPy.
So basically what this will do is first run normalizing flow and train a neural network with that.
And then once it learns the way to basically turn the posterior space into a standard normal, then it will use that to.
initialize HMCE and run HMCE in your model.
uh And so, of course, you don't want to do that on a simple linear regression, right?
It's overkill, because it's going to take at least 10 minutes to fit, because you have to train an old network first to learn uh the transformation of the posterior space that
would make it sound normal.
But if you have very complex models with
very complex posterior space, things like nil's funnels, uh banana shapes, and so on, where it's very hard to find a reparametrization that's efficient, then uh trying the
normalizing flow adaptation of NetPy could be very interesting to you.
uh And literally, if that works in your case, it can make your MCMC sampling
much faster and also much more efficient.
So that means much bigger effective sample size.
So I will definitely do that in the show notes because I think it's something people uh need to know about and well, try it out uh and that way Adrian can know uh this uh is
working out there in the world.
And I know he loves that.
awesome well date uh...
that's cool anything you want and that maybe i didn't uh...
i didn't ask you or or mention before asking the last two questions and don't known i think we have a couple of grantor uh...
i think there's a lot of cool stuff here it's it's probably impossible to to find it all i do want to make an honorable mention to all this work
that goes into uh prior elicitation.
I know that you're also interested in that, Alex, but there's also work that is coming out of Helsinki and Aalto, which is looking into how can we go from knowledge about effects of
covariance to priors.
And um we have tools that can work for simple cases very well, but what if you have correlated effects?
like let's say, I don't know, age and um income predicting, I don't know, school outcomes or whatever, right?
Those things are often highly correlated and then going from like a conditional expectation on predictions to um the prior.
So let's say you have this age and this income, does that, how does that relate ah to education outcomes?
uh
And specifying the prior in that way, I think is super interesting.
And there's a lot of cool stuff also being developed that helps to specify these priors with artificial intelligence, uh AI trying to go from um very prose and conversational way
of talking about then what we want to do a prior on to then actually implementing it and uh things like Stan and PyMC and so on.
think that's
a lot of the future that's awaiting people who are maybe not so interested in learning Stan and details, but still want to do cool Bayesian inference.
And then these kinds of things, I think, will make it accessible to a much wider audience than it right now.
Yeah.
Yeah.
I mean, definitely.
I mean, even for us, know, who are like power users of the software, that would make my model workflow be way faster.
because most of the time that's a much, much more interpretable and intuitive way of defining the priors than trying to understand what the prior on the AR squared process of
my time series of my structural time series model is going to mean.
The only way I can understand what this means right now is just doing cumbersome iterative process of changing one not at a time and seeing how that impacts
the prior predictive checks and maybe an interesting metric, like the prior squared or something like that.
that's the only thing that's really reliable right now.
And it feels like it can be automated for sure.
Because it's like a lot of cumbersome back and forth, basically, probably something ASC-TEED would make faster.
Yeah, but still it's kind of nice that you still have to get your hands dirty in a way.
not everything is too automated because it does let you learn a lot.
But the problem still remains that not everyone has the time, inclination or interest in getting their hands that dirty.
Yeah, yeah, No, and also like everything has a trade-off, right?
So the time you spend on that is not time you're spending thinking about expanding your model.
Yeah, I need more expressive and so on.
yeah, that if we can make that easier, that definitely be amazing and high impact.
Awesome.
So I need to let you go, David.
That's already like one hour and a half we're recording.
So I don't want to take too much of your time.
You'll come back on the show for your for future work you have for sure.
But before you go, let me ask you the last two questions I ask every guest at the end of the show.
So if you had unlimited time and resources,
Which problem would you try to solve?
This is a really a weighty question and I feel like there have been such good answers in the past So it's it's really hard.
I find to add to any of that But you know, let's say that with infinite resources and everything I've I've done all the things that we should do for humanity.
All right, so we've been the good guy already I think What I would do is I would go back to one of those core econ things oh that are important namely
How do you set policy such that you maximize the utility of a nation or maybe all nations?
you know, one, one particular question econ is how can you, how can you achieve the best amount of good or the most amount of good for all people?
And this is a really difficult question because there are just, there are always so many trade-offs in, in policymaking.
do one thing, you improve the life for others.
You decrease the,
benefit for another group.
And I think if I had infinite resources, I would try to find the optimal policy rule that would satisfy the condition of uh best amount of welfare, whatever that definition is, by
the way, I guess that needs to be conditioned on philosophy um across all time periods.
And then basically have a fairly automated rule.
uh that kind of is running and whenever any economic actor takes any decision and what would happen that would be that you would basically have like a steady state process for
the entire nation's economy without any significant variation.
like policymaking would always be such that we would all have kind of the best economic life uh possible within the confines of the chosen philosophy and the constraints of
resources.
That's fine.
Yeah, I love that.
Very nerdy answer.
And I really appreciate that.
Thank you.
I appreciate the effort.
I love that.
And I definitely resonate with that.
Although I would argue we're very far from that.
So you would need to do a lot of work.
Good thing you have unlimited time.
And second question, if you could have dinner with any great scientific mind dead alive or fictional, who do you
So, so again, that is like too much of a weighty question.
So I'm just going to sidestep that.
think there are too many cool people I would like to talk to, but I think who is alive and I would really like have a dinner with is Chris Sims.
He's a Nobel laureate in econ.
um He in fact was one of the initial researchers on vector audit questions, Alex.
So if you're, if you're looking into vector audit stuff, then Chris Sims is like one of those OG researchers in a way.
And.
He won Nobel Prize on related work related to more policy related stuff, but he's done a lot of really interesting time series econometrics.
And I would love to just have a conversation with him over dinner where we talk about how can we integrate, let's say, the work on R squared stuff and, you know, safe uh Bayesian
model building with his time series knowledge.
I think that would be such a cool, such a cool thing to do.
And in fact, he
I think that a lecture recently in the past, two, three years where he was suggesting that people should look at econ problems with multiple lenses.
This goes a little bit into this kind of a multiverse idea of, of, um, statistical modeling and acknowledging that there's a workflow that you have to work through.
There's not always one solution for every statistical problem and econ, which is kind of dogma, you know?
Um, I think.
Working with him on that would be such a cool thing to do.
Yeah, definitely.
ah And I've never had a Nobel Prize laureate on the show.
I've had a sir, but I've never had a Nobel Prize laureate.
yeah, if anybody knows...
um Kreese, right?
Yes, I'm sure.
Then let me know.
Put me in contact.
I'll definitely try and get him on the show for sure.
Amazing.
Well...
David, thank you so much.
um That was awesome.
Really had a blast.
um Learned a lot, but I'm m not surprised by that.
I had a good prior on that.
yeah, thank you so much for taking the time.
Please let me know, listeners, how you find that new hybrid format.
I really like it so far, so unless you tell me, I really hate it and most of you tell me that, I think I'll keep going with that uh whenever I can.
So as usual, I put a lot of things in the show notes for those who want a deep deeper David, so your socials, your work and so on, for people who want a deep deeper.
Thanks again for taking the time and being on this show.
thank you.
This has been another episode of Learning Bayesian Statistics.
Be sure to rate, review, and follow the show on your favorite podcatcher, and visit LearnBayStats.com for more resources about today's topics, as well as access to more
episodes to help you reach true Bayesian state of mind.
That's LearnBayStats.com.
Our theme music is Good Bayesian by Baba Brinkman, fit MC Lass and Meghiraan.
Check out his awesome work at BabaBrinkman.com.
I'm your host.
Alex and Dora.
can follow me on Twitter at Alex underscore and Dora like the country.
You can support the show and unlock exclusive benefits by visiting Patreon.com slash LearnBasedDance.
Thank you so much for listening and for your support.
You're truly a good Bayesian.
oh
Be sure you have to be a good bazier Change calculations after taking fresh data Those predictions that your brain is making Let's get them on a solid foundation