Bayesian Causal Inference at Scale

‌

Listen on your favorite platform:

In this episode, we sit down with Thomas Pinder, a Senior Scientist at Netflix (formerly at Amazon and Uber), to discuss the intersection of Bayesian machine learning, causal inference, and high-performance computing. Thomas is a leading voice in the JAX ecosystem and the creator of GPJax, a Gaussian process library.

GPJax: Making Math Executable

The core philosophy behind GPJax is that code should be as close as possible to the math we write on paper. Built on JAX, the library provides a flexible, low-level interface that allows researchers to extend models such as Sparse Variational Inference or Poisson Regression, without sacrificing computational performance.

Bayesian Causal Inference in Industry

Thomas shares how he applies Bayesian methods to solve complex industrial problems, particularly in experimentation.

Bayesian Synthetic Control: Thomas implements a Bayesian variant of Synthetic Control using a Dirichlet distribution for weights. This naturally constrains weights to the probability simplex (positive and summing to one), and the use of a concentration hyperprior allows for sparsity control, helping identify the most relevant control units for a counterfactual.
Synthetic Difference-in-Differences (SDiD): A highlight of the conversation is Thomas's work on SDiD via cut posteriors. To prevent the outcome likelihood from distorting the weights intended to balance control and treated units, he uses modular Bayesian inference. This cuts the feedback loop, ensuring uncertainty flows forward into the treatment effect without letting the treatment effect bias the weights.

The Bayesian Workflow: Why Prior Predictive Checks Matter

A recurring theme in the episode is that picking uninformative priors can often lead to models that encode physically meaningless assumptions. Thomas advocates for Prior Predictive Checks -- simulating data from the model before seeing any real observations -- as a critical diagnostic tool.

Linear Regression: Naive priors in a model for human weights can easily predict impossible values, such as negative weights or weights in the thousands of kilograms.
AR(1) Time Series: Standard normal priors on autoregressive coefficients can lead to explosive, non-stationary trajectories in roughly a third of draws, a pathology that only becomes visible through simulation.

Looking Ahead

With the maturation of tools like NumPyro and the GPJax ecosystem, there is real optimism that scalable Bayesian methods are becoming the default for robust causal analysis in production environments.

Check out the full episode above and show notes for deeper dives into SDiD, GPJax examples, and the future of Bayesian ML!

You can also interact with the episode on NotebookLM! Ask Gemini questions, flashcards, infographics, and more.

Hope you enjoyed, and see you in two weeks, my dear Bayesians!

---

Chapters

11:40 What is GPJax and how does it simplify Gaussian Process modeling?

15:48 How are Bayesian methods used for experimentation and causal inference in industry?

18:40 How do you implement Bayesian Synthetic Control?

32:17 What is Bayesian Synthetic Difference-in-Differences?

39:44 What are the research applications and supported methods for the GPJax library?

45:47 What are the primary software and computational bottlenecks when scaling Gaussian Processes?

49:02 What are the real-world industrial applications of Gaussian Process models?

54:36 How is Bayesian modeling applied to soccer and sports analytics?

58:43 What is the future development roadmap for the GPJax ecosystem?

01:05:37 What is Impulso and how does it integrate into a Bayesian modeling workflow?

01:13:42 How do you balance Bayesian computational overhead with industrial latency requirements?

01:20:26 Why is there optimism that scalable Bayesian methods for causal inference are now within reach?

Thank you to my Patrons for making this episode possible!

Links from the show

GPJax: Making Math Executable
Bayesian Causal Inference in Industry
The Bayesian Workflow: Why Prior Predictive Checks Matter
Looking Ahead
Chapters
Links from the show

Today, we are bridging the gap between cutting-edge Gaussian process research and the high-stakes world of industrial causal inference.

My guest is Thomas Pinder, a senior data scientist in Netflix's Studios team building causal models to measure risk and vulnerability.

Prior to this, Thomas was a senior applied scientist at Amazon working on Bayesian synthetic control methods and marketing

mixed modeling.

He also spent time as a senior scientist in the Maps team of Uber, where he built attribution models for Uber's core map technology.

And honestly, Thomas has great tastes.

Because yeah, he loves caution processes.

He loves them so much that he created the open source package GP.Jax.

In this conversation, we dive deep into why Bayesian methods are essential for causal inference and decision making.

We explore the why and the how of GP checks, the elegance of Bayesian synthetic control, and a fascinating new approach to synthetic difference in differences.

We also get a sneak peek at Thomas's new project, Impulso, which aims to bring structural VAR models to the Python ecosystem.

So if you like time series, you'll want to stick around.

This is Learning Bayesian Statistics, episode 154.

recorded March 17, 2026.

Welcome Bayesian Statistics, a podcast about Bayesian inference, the methods, the projects, and the people who make it possible.

I'm your host, Alex Andorra.

You can follow me on Twitter at alex-underscore-andorra.

like the country.

For any info about the show, learnbasedats.com is Laplace to be.

Show notes, becoming a corporate sponsor, unlocking Beijing merch, supporting the show on Patreon, everything is in there.

That's learnbasedats.com.

If you're interested in one-on-one mentorship, online courses, or statistical consulting, feel free to reach out and book a call at topmate.io slash alex underscore and dora.

See you around, folks, and best Beijing wishes to you all.

Hello, my dear patients!

I just wanted to share that now the LearnBasedStats website has a blog section.

So not only you will find articles about recent episodes we've done, or just some insights that we have about the episodes themselves, but also from time to time standalone blog

posts.

for that first blog post, well, I actually wrote one about the coding agent Skia that I wrote a few...

days ago, which is a Bayesian workflow coding agent skill.

And in the blog post, I go through how I did it and how it only took mass producing the same mistakes 63 times to get there.

So if you want to get the detail of the full breakdown, what surprised me about how LLM's new base or not, what also they were great at consistently and also not great at

consistently.

Well,

you should check out the blog post.

In particular, that agent's skill enforces the full Bayesian workflow, prior predictive checks, diagnostics, calibration, reporting, every single time using PIMC and the all-new

Arvies 1.0, spearheaded by, among others, the brilliant Osvaldo Martín and Oriol Aurélie-Pla, two friends of the show, of course.

This is an open-source skill, so feel free to chime in with ideas.

It works with Cloud Code, Cursor, Gmini, CLI, Codex, Gimicode, and any agent supporting the agent skills spec.

And well, check out the blog post, let me know what you think.

I hope you enjoy it.

And of course, the link is in the show notes.

On that note, folks, let's continue with today's episode.

Thomas Pinder, welcome to Learning Basics and Statistics.

you.

Thank you so much for having me on.

No, thanks taking the time.

It's you over there where you are.

And so I definitely appreciate you being a night owl for the show and for the listeners.

I'm sure they will appreciate it.

I was super excited to come across your profile actually on LinkedIn when you started, when you posted actually with the

about the GP-JAX integration into NumPyro.

So thanks for doing that and making it public because I don't know when I would have come across your profile.

So that's perfect.

But definitely when I saw what you were working on, you came super excited because I was like, damn, we're working on very similar topics.

it's like, great, I need to talk to this guy.

We're going to have a lot in common.

very probably going to learn a lot from you.

So let's do that.

And you have an interesting background for sure, because you transitioned from a PhD in statistics at Lancaster in the UK, which is where you're from.

And you've had senior scientist roles at mostly, if I understood correctly, only US companies, right?

Big tech US companies.

So Amazon, Uber, Netflix, but you're still based in Europe.

This is interesting in itself.

so yeah, I'm just curious about your origin story and how did your focus shift toward experimentation and causal inference in such large scale industrial settings?

Yeah.

So as you say, I did my PhD in statistics.

um I pretty heavily focused on Gaussian processes during my PhD.

I found these to be fascinating.

And the type of thing where for each new thing you learn, you stumble across four or five things which you know nothing about.

So it was quite easy to fill a whole PhD learning about these things.

I think I sort of got into Bayes and Gaussian processes during my master's.

actually um this was around the time when neural networks were getting pretty popular.

We just had attention is all you need come out.

ah yeah, I got very interested in the idea of adversarial.

detection and computer vision and simultaneously read a wonderful paper by Yarin Garland and Zubin Garamani on Monte Carlo dropout.

And I was fascinated as to whether we could combine these two things together and we could detect adversaries using Bayesian deep learning.

And that was really my sort of pathway into the world of Bayes.

um It was a wonderful introduction and I think it was really the thing which made me appreciate Bayes, like this idea of using the uncertainty to make decisions.

uh I did work for a few UK based companies.

I did spend a little bit of time working for some startups in Lancaster on NLP and also down in Oxford, working on sort of cost of our base in optimization.

But as you say, I have spent most of my career working in sort of large tech US companies.

My first time was actually, I was fortunate enough to work in Amazon supply chain team.

during my PhD, where I got to intern with someone called James Hensman.

we, I was a super fun six months.

were basically using Gaussian processes to sort of accelerate Horvitz-Thompson estimators.

So sort of approximating sums using Gaussian processes and control variants.

And that for me was really what made me appreciate that actually you can do really fun, complicated.

impactful machine learning and statistics in industry.

Until then, I think I had had this impression that you had to stay in a university to do this type of complex type of problem solving.

But this was really exciting to me.

um However, I finished my PhD and I had spent the best part of four years now pretty focused on Gaussian processes.

And I just fancy to change.

And there was an opportunity in Amazon's prime and marketing team to go and work on some experimentation projects.

um So I took that opportunity and I spent almost three years actually at Amazon and working on experimentation and quasi observational models, mainly working on synthetic

controls.

I was lucky enough to get to spend most of my time at Amazon working with Alberto Aberde, who came up with the idea of synthetic controls.

So that was...

I was really educational and informative for me.

And um I had a great time just sort of fusing together Bayesian methods with some of these more quasi-observational uh methods.

So that's sort of what landed me in the world I sit in today.

Wow.

Okay.

Yeah.

This is really, really a fascinating background and context.

and those sound like a lot of really cool and interesting jobs.

I obviously, I listeners will know my passion about Gaussian processes, so of course I resonate with that.

I already have a few shows about Gaussian processes that I'll put into the related shows on the website.

And by the way, folks, we have a completely revamped website now where each episode comes with an accompanied blog post that

I write and we have like all these key takeaways and related notes.

That's been a huge effort from the team here at Learn Based Dance.

I've only been, you know, leading that from the background and now it looks really well.

So you'll see like when you go there, you'll see all the related shows that I'll put in there.

So we'll have Gaussian processes and probably we'll have some of the episodes we've made about, we've done about

state-space models because I know Thomas you work a lot also on these kind of models uh So I'll have that in there if you want that as background for the conversation We're gonna

have today because we're not gonna dive too much into these these technical details um But yeah, damn sounds sounds like super fun work Thomas and I understand Definitely where

we're coming from and I'm really happy also as you were saying that now these kind of jobs are not only

in academia, but also in these kind of industry setting where you get to also actually have an impact on really how things are done in a massive scale company.

I think it's, it's really like, that's really amazing that we can live and actually contribute at such a time personally.

so yeah.

You also lead something that's called GPTACS, which is an open source package, which is

Why, I you.

And for those unfamiliar, what is the overarching big picture goal of this library and why was Jax the necessary foundation for it?

like also just some info, but my audience, can assume that most people will know what a Gaussian process is and what Jax is.

Um, so feel free to build on that and answer that question.

Yeah.

I might start with the second part because Jax was not a necessity for GP Jax.

It actually came about because during my PhD, I actually used to work on a Julia library for Gaussian processes, Gaussian processes dot JL.

And that was really where I learned coding actually was in Julia.

And then through my PhD supervisor, I contributed to this library and

I really liked this functional way of programming, but then I kind of realized that to get your code out there and used by people, Python is for better or worse, the dominant

language in our world.

ah And up until Jax, I think PyTorch and TensorFlow both encouraged this kind of object oriented way of programming.

So when Jax came along, I was naturally very attracted to the framework it provided.

And ah it honestly, GKJax started because I wanted an excuse.

to play around with Jax during my PhD.

A nice thing about being in a university is you're allowed a large amount of sort of academic freedom.

And I took that as an opportunity to just take some time to learn about how Jax worked.

And at the time I knew about GPs and I thought, there's no library in Jax for GPs.

So why don't we make one?

So it really did just start out as a piece of research code.

And I think today it's a little bit more than a piece of research code.

It's a little bit more reliable, I think.

But I think the overarching principle of GPJAX is to essentially provide the sort of verbs and nouns for people wanting to fit GPs in maybe a non-standard way.

So I think there are, I think GPITorch and SKLearn have Gaussian process implementations, which allow you to sort of fit a GP out the box and get the estimates out at the end in a

reliable way.

GPJAX, I wasn't trying to recreate that framework in JAX.

What I really wanted was something with very few guardrails and the ability for researchers to really hack around in a GP and to maybe do somebody more unorthodox things,

which some of these larger frameworks would prevent you doing.

um And allow people to basically have all of the components of a Gaussian process and they can choose how they plan on composing those components with one another to maybe develop

novel.

GP methodologies.

So that's really, think it's still true to that principle today.

I think it still sits in that space.

um But that's really where GPJax came from actually.

Okay.

Yes.

uh I can definitely resonate with the idea of just being curious about a method or framework and just trying to find a use case for

for learning it.

oh Well done on doing that.

think also the package in itself is really helpful and useful for researchers.

And I think you did a great job on that.

So uh people, you want to check that out, it's in the show notes.

uh Definitely give GPTags a try if that's something you're curious about and need right now.

I will also put the links to GPytorch.

that you were, uh, that you were mentioning for sure.

Something you also, work on frequently is bridging Bayesian, and something I work a lot on.

So, um, I was, I was very interested in talking to you, especially on the, like the.

Yeah.

I think.

I think Bayesian methods lend themselves very naturally into the world of causal modeling.

think, I think when you work on these large causal questions inside of industry, I think it's very rarely a binary outcome.

think it's very rarely the case where you run an experiment, it a randomized control or some type of observational method.

And you get a very clear picture at the end, which says you should definitely do this or you should definitely not do that.

I think these type of business questions are often layered in nuance and complexity.

And I think the role of us as scientists, when we're trying to build models to address these types of questions is to really provide our stakeholders and partners in business

with the context, which allows them to make what they believe to be the right decision.

And that should be guided by their own expertise and what the data is saying to them.

And I Bayesian methods really give you a very natural way of reasoning about the uncertainty.

Like we no longer have to worry about just getting a treatment effect.

We can get a treatment effect and we can get a credible interval around that treatment effect and we can work with the full posterior distribution if we like.

And we can then start to talk about the risk of making a false positive to our stakeholders.

And in my experience, when you start to frame the outcome of these causal questions,

through the lens of probabilities and sort of the likelihood of something being positive or the likelihood of the effect being greater than some value versus compressing things

down to a p-value being significant or not.

I think you provide your colleagues and your stakeholders with far more context and information that should allow them to ah make the best informed decision that they can off

the back of the data you've modeled.

Yeah.

I mean, think you're preaching

to the choir here and I'm pretty sure everybody in the audience will appreciate what you're saying.

um Something I'm also obviously very curious about on your work is you've done some work or you're actually very, or you're at least very curious about Bayesian synthetic control.

So here we're talking about quasi-experiments, methods.

We haven't told about that too much.

Talked about that too much on the show actually weirdly because I mean I think it's scattered across across a lot of episodes But we have at least that episode with Ben

Vincent the author of causal pie So I will link to that to that show to that episode and of course putting the show notes causal pie But can you tell listeners what patient

synthetic control is about and how how it works?

Yeah.

So I think Ben has some excellent stuff on this in CausalPy.

Like I would definitely recommend people go and look at that.

um It's truly excellent.

um Synthetic control though, I guess to me, when I first saw synthetic control, I think in its original formulation, it's essentially posed as an optimization problem where you have

a collection of control units and then one or many, let's just say one treated unit for now.

And your goal is to basically

form some linear combination of the control units that they best match the treated unit before the intervention was applied.

And in the original paper, m Alberto does this by sort of optimizing the weights by constraining the weights to live on some probability simplex.

And it forms it as a constrained optimization problem.

However, I think when I first sort of saw this, kind of just...

thought of it as linear regression in a way where my response variable is my treated unit.

And my design matrix is just a collection of control units observed over time.

And my Bayesian mind said, well, if I'm performing optimization on a simplex, that's somewhat analogous to just putting a Dirichlet prior down on the coefficients of my linear

model.

And so I...

sort of implemented that using NumPyro and PyMC.

These models are exceptionally easy and fast to fit nowadays.

And I think what you get out of this is I think it's a little bit more than just taking a frequentist method and saying, let's make it Bayesian because we can.

I think by sort of recasting synthetic control from this constrained optimization problem to this sort of Bayesian regression problem.

You give yourself a series of tools which actually allow you to perform quite a rich mode of inference, optimizing the weights on the simplex is somewhat fragile.

Whereas when you set your Dirichlet prior, you can be selective in how you set that concentration parameter on the Dirichlet distribution.

And you can then start to say whether you think a few units would be very informative of the counterfactual.

And then I'm going to set a very low alert, very low.

value for my concentration parameter.

Or you might believe actually many units are explanatory of the treated unit and you can put that concentration parameter a little bit higher on, or as I frequently did, you can

put something like a gamma hyperprior down on that concentration parameter and really just let the data drive um this type of sort of balancing that.

And I will sort of return to this original question, this original point I made of like performing inference and giving stakeholders

context around the confidence and the sort of what happened on the way to getting to your treatment effect.

When you perform a sort of ordinary synthetic control and you do this optimization on the simplex, when you want to get then some uncertainty around your treatment effect in the

sort of original synthetic control method, the idea was to do a permutation test.

So where you would take each of your control units, sort of mock, pretend that it was a treated unit and then try and estimate the treatment effect and cycle through your pool of

control units and then sort of measure the sort of average distance between your treatment effect and then all of these and you would get the uncertainty.

That's okay, but all you end up with is a boundary, like a permutation test interval around your treatment effect.

You don't know what happened between the intervention and maybe the weeks which led up until the end of your treatment effect.

Whereas when you have a linear model, a Bayesian linear model, you get this full trajectory from treatment point until final time point.

And you can see how the treatment evolved over time.

And you also have fewer concerns when you have very few units, like this permutation test becomes a little bit unstable.

However, when you have the sort of Dirichlet prior approach, I think you end up with much more robust inference.

And I think in practice, this is very meaningful because

When your estimates are more reliable, you've saved yourself, ever been in that awful position where you've given some guidance to a stakeholder about how you should interpret

some causal effect.

And then it turns out that actually your model was very, very fragile and you need to maybe uh change that narrative.

I think the Bayesian approach gives you a very robust ah mode of inference.

Yeah.

Yeah, for sure.

And next, I think that was a...

great explanation of what synthetic control is but basically yeah the idea usually is to try and find some combination of units that represent the treatment unit that you have and

basically trying to make another treatment unit but just that wasn't treated and to have the counterfactual as well if that unit had not been treated

it would have looked like that, whereas we have observed what it looks like when it's treated, so the difference is the causal effect.

like if you want an example, it could be something like, let's say, you know, uh some company makes a change, some multinational company makes a change to the way they operate

in Argentina because of law or something like that, and only in Argentina, but they don't change anything in the rest of South America.

then you could use some synthetic control here to try and create a synthetic Argentina based on a combination of weights from the other countries of South America that in a way

when they are combined gives you a sort of simile Argentina and then the difference is your causal effect.

So here the difficulty in this method is actually in computing the weights and making sure that actually your synthetic unit

is actually a good proxy of what the treatment unit would have been without the treatment.

Is that correct?

It's correct as far as I understand it.

Yeah.

think that's a great explanation of what it is.

Me too.

It's always weird to, it's always a bit weird to explain these things because there is a lot of like, this is a unit.

What would have been like if it wasn't treated, but it's actually treated.

So you have to be extremely careful in the, the definitions and the word.

em you're using.

I agree.

I think that counterfactual unit is actually like a super rich object to have in a causal model, right?

It's definitely the way you, the path you get to your treatment effect by comparing sort of what happened versus what would have happened.

But that counterfactual unit actually gives you so much information about like scenarios that could have happened.

And if your control units were a different set of control units, what would that mean for your counterfactual like...

Having this Bayesian synthetic control giving you this full counterfactual distribution gives you a super rich object for which you can start to reason about different scenarios.

I think your explanation is great and that counterfactual is really the key part, I think.

Yeah, completely.

What's also fascinating to me in, I think all the quasi-experiments methods I know and have used is that

Actually computing the causal effect is super simple.

It's always a difference between the counterfactual unit and the treated unit.

What's hard is actually getting to an estimate of the counterfactual unit.

Getting there is hard, but then once you get there, it's super easy.

You just compute the difference.

And if you're in the Bayesian framework, you actually have a full posterior of differences and you can make all the Bell-Zell whistles that you were talking about.

Yeah, it's really like an iceberg.

type situation where was like, tip of the iceberg, that treatment effect is really easy to get to everything beneath and all the assumptions you've had to make and the modeling

choices you then had to translate.

That's where the work goes in building these models.

Yeah, yeah, exactly.

And I like that.

I mean, in a way I like that because it's also, it's a bit like Bayes, know, Bayes you have one estimator, the Bocera distribution and you're done.

You you don't need to know about all these tests and so on.

It's just like you have one estimator.

And here I like that also because he's, well, you have one estimator.

It's just the difference.

That's your causal effect.

That's what you care about.

And you need to get there, but you it's always the same thing.

And I think it's great.

Um, so actually there is a thread I could pick up here on like you, you, talked about, you know, permutation tests and the frequentist way of doing that.

think there is like also a big limitation of that.

think when you talk about P value is that.

like you're also limited to the number of units and you cannot go below that for your significant threshold.

And so that's a big limitation.

Like if you have 20 units, you'll never, you literally mathematically cannot have a p-value that's lower than 5%.

And that can be, that can be a problem.

Um, but that's just, yeah, like I think it's more of a, of a neat peak or maybe for a debate about the more, uh, statistical, statistical, uh, conversation.

And I have so many other topics that I want to talk about with you.

So let's go on.

Before that though, and before we talk about synthetic differences, um can you just remind me quickly why um using a Dirac-Ley distribution for the weights is actually more robust?

Because I think I missed that and uh I'd love you to repeat that.

Yeah, maybe let's take a step back and sort of like...

reframe it in the original synthetic control idea of optimizing on the simplex and performing optimization on the simplex and specifying the Dirichlet distribution.

They're both essentially supplying the same geometric restriction on the coefficients of your units.

And you're doing this because if you just treat it as an ordinary least squares type problem, all of your control units are really highly correlated both with each other and

with the treatment unit in the pre-treatment window.

So you have this multicollinearity problem and you really need to regularize these units.

And what happens in practice when you are optimizing on the Dirichlet distribution, sorry, when you're optimizing on the Simplex is you end up with very sparse weights.

Like you end up with most of the control units been assigned zero mass and then a few control units basically carrying the weight uh of that counterfactual distribution.

And that's quite a heavy task and it's pretty...

It's a pretty sort of brittle estimator.

um And for example, if you maybe introduce an additional unit into your control set, it'll completely shift the allocation of mass across your unit and that can then dramatically

change your counterfactual distribution.

And it's not to say that the Dirichlet distribution is necessarily better.

But I think it's a little bit more configurable and you can be a little bit more flexible in how you want that weight to be allocated.

As I sort of said earlier, like that concentration parameter, as it gets closer to zero, will start to sort of give you much sparser units.

Whereas when it gets larger, you'll then start to allocate the mass uniformly.

And as a practitioner, you're totally free.

If you have no idea whether you need very few units or you need quite a lot of uniform mass spread across all of your units.

You're totally free to just put a prior down on that value and sort of allow the sampling routine that you pick to sort of guide you to the right value.

But there's also another sort of framing here.

Like I've worked on some problems in the past where actually you're sort of running like sort of experiments where putting units into either treatment or control is somewhat

costly and you maybe want the smallest design possible.

Like in these cases,

You really almost want to specify that by definition, I want the smallest number of control units, which will allow me to have a well explained counterfactual.

It's quite hard for me to reason about that when I'm optimizing on the simplex.

Whereas when I'm putting that into my prior distribution and I can purposefully set that concentration parameter to be quite small.

I don't think it's necessarily better.

I just think it enables richer inference and it allows you as the practitioner to have more hooks into the model.

to impart the information which you think is helpful in getting to your final causal estimate.

And it's a classic case if there's no one right way to do this, but having more options in these types of cases is often better and having less of a black box type model.

Sure.

So I encourage people to read your blog post that they put in the show notes about this topic.

Now let's turn briefly to one's synthetic difference in differences.

So I think diff in diff...

Most people will uh be somewhat familiar with the concept.

But can you tell us a bit more about that, what it is, when to use it and how?

And I know you've done a bit of work on that lately, uh mostly as something you're curious about.

it's like, it's also great to pick your thoughts where, you know, how you're thinking about that right now and still have your thinking evolve.

Not at all being, uh, know, uh, pretty like not at all prescribing things here, but actually tell us how you're thinking about that topic and how that influences your work.

Um, and of course I will put the blog post you wrote about that in the show notes also.

Yeah.

So let's just.

Just thinking about diff and diff, I sort of gave an explanation of how one can think about synthetic control as a form of linear regression.

The same statement is somewhat true of diff and diff.

um In difference and differences, you have a treatment and control unit and you have this pre versus post intervention window.

And you sort of then end up with two indicator variables, like was the unit in control or treatment and was the unit in the pre was the observation in the pre or the post

intervention window.

then the interaction between these two variables ends up the coefficient on that will end up being your treatment effect.

so diff and diff and synthetic control can both be framed in the light of linear regression.

And there's some really nice papers by Guido Embens and might even be a talk now on YouTube from Guido Embens sort of framing pretty much all of this line of literature as

just different variants on linear regression.

And it's a really great talk.

So to me, if we can frame diff and diff through the lens of linear regression, then much like with synthetic control, I have the option to cast this in the light of Bayesian

regression, where now my uh prior distribution on that interaction term um is the sort of prior on my treatment effect.

And that's kind of nice.

um Synthetic diff and diff was a paper.

maybe five or six years ago now, I think it was 2019 actually.

um But it essentially said, well what if we can combine together diff and diff with synthetic control?

Meaning synthetic control gives me a way of assigning uh a collection of control units to explain the counterfactual.

um And diff and diff sort of allows me to weight my pre post intervention windows and have sort of weights on that.

synthetic different differences, whatever I could have weights on time and on the control units.

And it goes about estimating them in sort of like a three-step process where I first would learn the weights of the control units using all of the pre-intervention data.

And then I would go about learning the time weights using only the control units.

And then as you put it earlier, then we just get the treatment effect out at the end by just doing a difference in between two terms.

However, the

The challenge in trying to cast this into the light of Bayesian inference is now that I sort of have this information restriction where I have all of my data pre and post control

and treatment.

um However, I have sort of two quantities I need to estimate the unit weights and the time weights, but the unit weights can only take data from uh pre-intervention and the time

weights can only take data from the controls.

In a Bayesian framework, normally like your likelihood would just factorize out over the data.

um However, I can't do that here.

Otherwise I'll have information about the wrong quantity flowing into the wrong component of my model.

um I actually just stumbled across this idea of sort of a cut posterior or modular posterior is just in passing.

And this essentially says like in networks, you may have this problem quite...

quite commonly where you want information to flow to different components of your posterior and then sort of compose together these modules into one full posterior

distribution.

And I kind of thought that and I saw that, sorry, and thought maybe we can apply that to synthetic DID and it turns out you can actually relatively simply.

um I have a sort of very short sort of ideas on my blog about how you could do this and

It's very recent and could very well be wrong.

So I would encourage people to read it with a skeptical eye.

um However, it turns out that when you sort of frame Bayesian synthetic diff and diff in this cut posterior type way, you end up with an estimate which is very comparable to the

estimate reported in the synthetic diff and diff paper, which essentially said synthetic controls on their own.

overestimate the magnitude of the treatment effect from the original problem of the California smoking example.

So this was a sort of the data set used in the original synthetic control paper, which measured the sort of effect of cigarette ban in California, where California was the

treated unit and then all the other states became the control units.

And I think the effect in that paper was around 26 sales of cigarettes.

And actually synthetic diff and diff estimates this result to be closer to minus 15 packs ah per capita.

Based on synthetic control, synthetic diff and diff recovers that almost exact same estimate.

It gets around minus 15.6.

So a 0.6 difference.

But you get the full posterior out, as we said multiple times, this sort of gives you that richer form of inference.

Yeah, I'm still sort of pushing the idea around the sort of open questions around sort of like, how do you represent the uncertainty on that treatment unit itself, because you have

sort of time now factoring in and the sort of different ways you can parameterize the model such that it becomes closer to synthetic control and closer to diff and diff or in

certain cases equal to those models.

And I'm still trying to sort of mull over what that means.

um But it's quite a...

It's quite interesting and it's something I'm quite excited to sort of continue developing and seeing how that idea evolves.

Yeah.

This is a very, very intriguing and I think powerful idea of merging synthetic control and deep in deep.

Honestly, I had not heard about that before preparing the show.

So thanks for...

putting that in front of me, I will definitely look into that and see, yeah, like study it as you're doing right now and see how we can use that actually in a professional setting.

And actually, can you add the YouTube video of the talk you mentioned to the show notes?

Because I think it's going to be very helpful, especially if it's...

if it's exposing this idea of, of synthetic definitive.

Yeah.

Let me try and find the talk.

um Yeah.

can do that.

I'll add it to the show notes.

Yeah.

Yeah, exactly.

Awesome.

And so that was for the, the quasi experimental causal inference part of the episode for today.

We can get back to that at some point if you want, but I also want to ask you about GPJacks because well,

That's another common passion of ours.

So you gave us already the elevator pitch for it at the beginning, when, like more precisely, when would you recommend listeners to use it and how?

I lost the last part of my question.

And you said, when would you recommend listeners to use it?

And then I lost you.

And how to use it?

Like usually what's workflow when you're using GP checks?

Gotcha.

Yeah.

How?

So GPJAX is definitely not the Gaussian process you should use all the time.

Like GPJAX will let you do some things which are a bad idea.

And other frameworks will have guardrails in place, which will just stop you doing that.

I think if you're, for example, if you want to put a Gaussian process into our production system.

And it's a very high stakes type scenario.

think maybe using GPyTorch is perhaps a, maybe a safer choice.

I think it simply has more guardrails put in place.

Equally, I think if you really don't care too much about the model's construction and you just want to fit a Gaussian process, think something like SK, and you don't have very big

data, then I think something like SKLearn will allow you to fit a, a correct Gaussian process in far fewer lines than what

we would require you to use in GPJAX.

So I think those are the two cases where you maybe wouldn't want to use GPJAX.

However, I think if you're a researcher and you're wanting to fit maybe unusual Gaussian processes, so one example would be if you have a different type of kernel, which allows

your covariance matrix to have some type of unique structure which you wish to exploit when doing the matrix inverse.

common bottleneck in all Gaussian processes.

um GPJX gives you super easy ways to hook into that.

Or if you have a new variation approximation, which you want to test out to accelerate a variation inference type workflow, then again, GPJX makes that really easy for you to do.

So I think in these type of cases using GPJX is uh a great idea.

I also think if you really want to fit a model and then retrospectively decompose it, dig into it, take it apart, I think, uh, and or compose it with other modules, like a linear

component, like a linear model or something else.

Then I think GPJax enables that very nicely.

Like as you mentioned in the introduction, it recently, we integrated it very tightly with NumPyro, which allows you to build much bigger Bayesian models with Gaussian process

components therein.

And I think in these types of cases, think GPJAX is somewhat unique in the sense that it really does allow you to build quite complicated models with relatively little code.

As to exactly how I use GPJAX, I actually use it less these days.

I find myself using Gaussian processes less and less these days and find myself working on GPJAX as a hobby.

I tend to use GPJex in a Marimo notebook and most of the time these days I find myself filling Gaussian processes for case studies and for trying to understand some data.

was never the best methodological or theoretical researcher during my PhD.

That was never my strong suit.

think where I was able to sort of be most useful was in building tools and

like GPJax and sort of working on causal problems, which are much more closely coupled with applications.

um So I don't find myself sort of developing too much novel methodology.

I prefer to let other people develop clever methodology.

And then I would work out a way to provide a nice abstraction for that within GPJax.

It would allow people to use this nice methodologies within the GPJax framework.

Okay.

Yeah.

That makes sense and everything you've talked about here also resonate with my experience of Gaussian processes, especially the fact that they're very modular, that you can combine

them with other kinds of models, that you can have that in like, yeah, you can have a Gaussian process component or even several Gaussian processes components in a model that

has also linear regression components.

It's just, yeah, that's one of the big...

big powers of Gaussian processes is that one.

And that's why also one of the big reasons I love them so much.

um One thing I'm curious about though is that, you know, when scaling Gaussian processes, what were the primary software bottlenecks that you encountered in the Python ecosystem

that led you to favor the, well, the Jaxx?

stack we talked about at the beginning, it was more of a curiosity, but also now you have the NumPyro stack integrated into GPTex.

Yeah.

So I think, although I opted to use Jax at the start, I think really I had actually before that implemented a very, very small Gaussian process library using NumPyro ah and really

like Jax

solves so many problems where you don't have to worry about calculating the gradients.

In the Julia package, we had to calculate a lot of the gradients by scratch, which means things are super fast.

But for each new kernel you want to introduce, you have to calculate the gradient with respect to each of the parameters of that kernel.

And it's very difficult.

um So I think software like Jaxx really unlocks the ability to build out new code very quickly because

I can take a derivative with any leaf of my pi tree and that's a super powerful paradigm, but I can sort of through the chain rule, just essentially differentiate through my whole

loss function in each of the parameters and get out a gradient at the end.

And within Jax also, and now PyTorch and TensorFlow, I can just compile my whole graph, which in again, something like a Gaussian process where computation can be quite heavy.

Jax really solves that problem of

making a highly computationally challenging problem slightly less challenging through the ability to JIT compile your code.

On the NumPyro side, NumPyro really solves the problem of, as you said, like integrating a Gaussian process into a wider framework because NumPyro is a lot more general than GPJax.

NumPyro simply provides a way for me to build any type of probabilistic model.

There's nothing to say we couldn't support that in GPJAX, but we're not a probabilistic programming library.

We're a Gaussian process library.

And this has really been the paradigm I have had from the start where GPJAX will fit Gaussian processes.

And then when there are good libraries for us to offload other functionality into, we'll always take that.

So for example, we don't implement any optimizers inside GPJAX because there's an excellent library called Optax that can achieve optimization.

So we write software within GPJAX, which allows us to hook into Optax and allow Optax to maintain an excellent optimization library.

That statement is true for NumPo and the sampling.

There would be nothing stopping us integrating a Hamiltonian Monte Carlo sampler into GPJAX.

In fact, we did that using the Julia library um before.

However,

You then have to start maintaining a Monte Carlo or an MCMC sampler as well as your Gaussian process code.

And that becomes challenging.

And to be totally honest, like I can write a good piece of code to implement the Gaussian process, but I am not necessarily very well skilled in writing the best sampling library.

Whereas NumPyro does provide a really efficient sampling library.

So I think it's really identifying dependencies and sort of codependent libraries where you can offload.

challenging functionality into these libraries and design, GPJX has always been designed to kind of allow us to connect with other libraries as easily as possible and sort of

share the strengths of several libraries therein.

Yeah, yeah, that makes a ton of sense.

Much, better to stand on the shoulders of giants than just re-implementing everything on your own, you know, and just playing on our strength as you were saying where you're very

good.

tool builder and so basically you have that ability to use different tools and blend them and make life easier for people who need that kind of models.

think this is amazing and that's also really the spirit of this show because again, I resonate a lot with your profiling experience.

yeah, for sure.

I completely understand where we're coming from.

em Something you've somewhat...

talked about already a bit, but I'm curious to hear if you have more to say.

What are practical ways you use GPs and maybe even GPTags in your work in relation with causal inference?

Yeah.

So I use Gaussian processes typically if I...

I need to fit some type of regression model and I just kind of want something that's going to work and I don't want to have to think too much about it.

And I really care about the uncertainty.

I'll often reach for a Gaussian process.

There's an element of personal bias.

It's a model I simply know how to fit relatively well.

And when linear regression doesn't solve a problem, I know there are other like tree-based methods and there are several other different approaches I could reach for, but Gaussian

processes are simply the tool I have.

So they are sort of my default regression model.

I think we, they have been, we've used them quite a lot.

I had a unsuccessful attempt at using Gaussian processes for actually for Bayesian synthetic diff and diff whilst at Amazon.

So there I had this idea of basically saying in synthetic control, your weight belonged to this Dirichlet distribution.

And I want to essentially correlate my weights in time.

So kind of tried to place Gaussian process priors down on like the concentration parameter and try and evolve that parameter through time.

then, or I also tried trying to evolve the weights through time, but the inference just became a bit of a nightmare and like the inference was super unstable.

Our hats were all over the place.

um The model was really not a great model and I never was able to get it to work very well.

But although the model was unsuccessful, I think.

it does speak to where Gaussian processes can be useful in sort of causal inference workflows and just generally in industrial workflows when you have this sort of latent

property which you wish to propagate over time or over some spatial domain.

I think it can be really useful.

I worked a little bit on some spatial data before where we essentially needed to sort of model a spatial residual between two satellites.

In this type of instance like

The satellites belonged, uh the satellites were measured on different resolutions.

Like one of them was on nine by nine kilometer grid cells and one of them was on 500 by 500 meter grid cells.

You kind of need to propagate this residual and spatially smooth it in these types of instances.

Like a Gaussian process is a really natural type of model, think.

So I think it's like anything.

I don't think they are the silver bullet.

I don't think a Gaussian process is the model to be used every single time.

But I think in these cases where you really care about propagating uncertainty around a model and you really want to sort of carry information through a model and never compress

information down, I Gaussian processes in my experience often allow you to achieve that relatively easily.

um same experience here where these kind of models where you need

Nonlinear relationships, you don't really know what they look like mathematically Don't really have the time and expertise to do that and want a method that's gonna work GPs are

gonna be a great bet as we were saying for spatial data for sure also Me I've used them a lot for time series data again recommend them a lot for that So yeah, like all that all

that thing is is extremely

Like they are extremely versatile and now pretty easy to feed, honestly, to sample with the stacks we have.

So unless you have a huge amount of data, but now it's so much easier to sample GPs than it used to be.

Yeah.

I even think actually nowadays, even unless your dataset is truly huge and high dimensional, I actually think high dimensional hurts you more nowadays than high

number of observations in a Gaussian process.

We think with a lot of the stochastic sort of inducing point type ideas and Hilbert space approximations that we have for GPs, I think we kind of moved away a little bit now from

this idea that GPs don't scale.

I think they do scale in the number of observations.

I think the challenge is still scaling in a high number of covariates.

And this is still where I would be inclined to reach for some type of tree-based model.

um However, yeah, they are pretty easy to fit nowadays.

agree.

Yeah, no, exactly.

I mean, if people are actually interested in seeing what you can get from caution processes, I will put in the show notes the project I have still ongoing with Max Goebel.

who was on the show, also link to his episode um in the related episodes on your episode's show notes, Thomas.

But yeah, for that project, that's called the soccer factor model.

uh We used a bunch of hierarchical GPs on different timescale for soccer strikers to try and predict the number of goals they would score in their next game.

So yeah, like if you go to this dashboard, you'll see the depth of analysis you can get from these kinds of models.

this is, yeah, this is always to me, this is thanks to such powerful models.

so yeah, like this is an open source project.

So that's great because we can show everything.

So you can go in there.

There's a paper, there is the code that's available, the data, everything in there.

It's if you want to learn, if you want to learn GPs.

Yeah, but I...

I'm not aware of this word, that sounds super interesting.

um We actually did something I think not too different.

um We had a higher, we sort of developed a new sort of hierarchical sparse Gaussian process.

We weren't applying it to anything as fun as football players and strikers goal rates, but we were trying to sort of model different climate models and sort of in the climate

science world, there's these projections of sort of surface level temperature out to sort of 2100.

And different models, of course, produce different predictions and under different constraints, you end up with different predictions.

And we were actually trying to do something pretty similar to what you described where we were putting hierarchical Gaussian process down on this to kind of understand the latent

underlying sort of trajectory, which underpinned all of these models in a way to kind of ensemble them, but also allow each model to vary from one another.

Yeah, maybe we should compare approaches.

Maybe there's something interesting happening at the intersection.

Yeah, for sure.

Happy to share notes.

uh But yeah, that sounds a lot like what we've done.

um Also, I think in your case, it's the same in our case that was very clear that the hierarchical structure of the data was helpful because we have players, but they are part

of teams and even players are just part of a special population.

So that was something that was really important to us to have that kind of higher level population GP and that would trickle down towards each player who would have his own uh GP

that would vary from the population level one but would still have the information from the population.

uh What is great with that is that then you can

make a better prediction of out of simple players, know, like young players or players who come from a new league from another league and they're completely new, let's say to the

Premier League, then you can still make a decent prediction.

Whereas before you were more or less flying blind.

So this is very important.

Yeah.

Do you also have incomplete data?

Because this is something we had to grapple with, like not every satellite provided reliable...

predictions at every single point in the world.

Some of the models were better at different points in the world.

Do you have the same thing with your soccer players where I guess not every player will play every game in a season, so you end up with kind of incomplete data?

Is that also Yeah, exactly.

ah So Max knows the data set by heart because he's the one who literally made it from nothing.

But I am very certain that yes, we have players like that where

You they get injured or they just like you had a lot of replacement level player or even average player, which they just get out of the bench.

They don't start all the time, then don't enter the game all the time.

And that's where also having the hierarchical structure helps a lot because well, having predictors also, you know, it's a combination of having predictors that give you

information about who the player is and also having the hierarchical structure of them then fills in those missing data points.

Yeah.

Super interesting.

Yeah.

This is a very fun project that actually we've been doing for almost two years now, but it's just like, keeps on giving, you know, it's such a big project and also a fun

playground that you can always find something new to, uh, to keep pushing the boundary.

So, so yeah, it's part of these, one of these fun projects.

so to close up on GP checks and then I went.

I want to touch on another project of yours, because you do a lot of things, Thomas.

But just curious about the coming roadmap for the coming month for GPChanks.

Do you have any priorities already that things you want to touch on?

Yeah.

So, yeah, it's possible that I follow the literature less closely these days.

Now I'm not in an academic setting, but my observation is the sort of literature around GPs is...

quietened a little bit from where we were several years ago.

when I look forward, at least for the remainder of this year, I'm not imagining implementing any new methodologies into GPJAX.

We have a pretty wide range of kernels, at least the major kernels, and our goal is not to implement every single kernel.

Our goal is to provide the infrastructure which would allow someone to implement their own kernel very quickly.

We have that and we have a couple of different variational approximations, which one could use depending on their use case.

And we also have a few different imducing point schemes and multi output Gaussian processes implemented.

So I think we have a pretty, complete set of functionality within GPJAX.

So I think when I look forward to the remainder of this year, I think I want to do two things with GPJAX.

I would quite like to close that bridge between GPJAX and something like SKLearn or GPytorch.

Essentially providing a higher level abstraction of GPJAX, which would allow someone to fit a Gaussian process in a few lines of code versus the sort of 20 lines of code we make

you write today.

Using something like

providing just a YAML file or just a JSON config and then building your model and fitting it.

Because whilst the original use case of GPJAX was for researchers, I acknowledge that many people, and myself included, sometimes just want to fit a GP.

I don't want to define all of the boilerplate code every time.

And also when I fit a GP, I want things like logging to be implemented under the hood.

I want things...

like to connect to AWS or some cloud-based system.

Like all of these things can be handled behind the scene very sort of easily.

And I think I would like to provide this abstraction within GPJAX.

maybe in a, in a sort of secondary library.

I'm not entirely sure at this point, but essentially giving people a more production ready form of using GPJAX.

um The other thing is I would really like to...

improve the quality of our documentation.

So I think we've done a pretty good job with our documentation.

We've made a conscious effort always to explain the underlying maths as well as what the code is doing.

However, most of our examples use simulated or synthetic data.

And I think we could make them far more interesting by just using real data.

We had a collaborator

And actually we had two collaborators, I should say, a few years ago now who implemented a really, really cool document, notebook to GPJAX modeling vectors.

So rather than predicting a scalar value, you're actually predicting a vector.

And they use this to model ocean currents using GPJAX and modeling the spatial vector field.

And you can do this using a certain type of kernel.

And I think this is a really cool notebook.

It uses real data and it's a real problem which people actually work on.

So I would like to migrate some of the GPJAX examples away from using simulated data and to bring in some new data sets which are interesting and may just catch the interest of

people browsing with documentation and they might look at it and think, oh, I have a very similar problem to that.

First, is today sort of building a data set by composing a sinusoidal function and a...

and a sort of bit of white noise together.

Going a little bit beyond that.

And to be honest, where I am now, like those are the type of documentation, those are the bits of documentation, which I enjoy writing as well.

Claude code and cursor are so good now, they can implement most of the backend type code far quicker than I ever could with fewer bugs.

But I don't think it's very good still at writing documentation and writing case studies.

I think there's still a handcrafted narrative, which is better than

what an LLM can generate for us.

And I still find enjoyment in writing these types of things.

So I think this year that's a little too sort of angles, which I would like to steer GP Jackson.

Yeah.

Makes sense.

And yeah, I kind of have the same experience where I feel if I can delegate a lot of the coding to the agent and just basically supervise its work.

And basically doing what I've done a lot on the privacy side, you where you review pull requests.

uh Basically here you have your, your, your agent making like writing the code and so on.

then, um, pinging you for a pull request, your review that that goes much faster.

And then I get to go to the most enjoyable part, which is, okay, how do I use actually that code?

How do I show people how to use that?

How do I?

Yeah, like basically teach people.

where and when and why this is useful and writing this kind of pedagogical content where.

Yes, I think we as humans uh still have a much more interesting viewpoint because I think we're more aware of the difficulties we have when we learn this.

This is actually something we enjoy more and that's mostly what's really, really useful to people because most users are introduced to a software.

not through the code on GitHub, but through the example notebooks on the documentation website.

So yeah, that makes sense to invest a lot of time in there.

And actually this notebook also, if you have it already ready to share with people, please uh feel free to add it to the show notes.

think this is going to be very helpful.

Sure.

Yeah, we'll do.

Awesome.

So now I'm going to start to...

to close us up because it's going to be too late for you, Thomas.

But I do want to talk about Impulso.

That's one of your new projects.

And again, this, like, the resemblance uh is uncanny because this is one of our common passion, too.

This is a package to do vector auto-recreation in Python, in PIMEC even.

So, yeah, I, of course, looked into it before the show, but can you give...

give us the elevator pitch, know, what this is about, uh why did you even do that?

And yeah, well, what is the state of this project?

Yeah.

So Impulso is a play on the impulse response function, which is a incredibly useful property you get out of structural VAR models.

So a VAR model is essentially a way of modeling a time series model with a vector of outputs.

and acknowledging that there may be correlation between those outputs.

Structural VAR is something which, to be totally honest, I'm not an expert in structural VARs.

I worked on them with a colleague um whilst at Amazon and found them fascinating, um the use cases of them.

And since then have just taken an active interest in reading about them and trying to learn more about them.

And I feel like I have a moderate understanding of the uh

the principles and really it was a case of trying to use them, but there didn't really, there wasn't really a reliable, good implementation of them that I could find in Python,

which would allow me to fit them in a Bayesian manner.

think there's some excellent R packages, but I, I'm not the person who can write good R code for you.

um And so I thought let's try and implement something in, in Python to, to solve this problem.

So.

It also came about because of actually what I just touched on with the use of LLMs.

So I use LLMs and Claude code, namely within GPJacks nowadays.

And it's, pretty good, but I have a huge bias in the sense that I wrote or reviewed every single line of GPJacks.

So I know when the LLM is doing something weird or is in the wrong area.

I was really curious how we could build software libraries from the ground up using LLMs.

So.

I thought Impulso was a great opportunity to try out that and to try and document my learnings as we go about how we can use LLMs to build software libraries because software

new, we will always need new software libraries.

And I think it's super important that we continue to fill gaps in the, in the ecosystem with new libraries.

And so I was curious how we could leverage LLMs to achieve that most effectively.

actually, I guess the summary of areas, Impulso is a

entirely selfish project where I've used it to learn more about VAR models and structural VAR models and also how we can use LLMs to build code libraries.

okay.

That makes a ton of sense to me.

Basically, you were curious about these models and you were looking for toy projects.

And again, this is what I do all the time.

I'm like really, really...

impressed by the resemblance here.

yeah, I have to say, I looked around the project and also sent it to one of my friends, Jesse Kropowski, who is actually the person who introduced me the most to Statespace

Models.

So all the math I know about Statespace Models to Jesse, who is always extremely patient with me and my misunderstandings of

matrix algebra.

uh So Jesse was on the show episode 124 to talk exactly about state space models, especially his brainchild, which is the sub-package in PyMC extras to do state space

models uh with PyMC.

So you can do uh var with that, vector after regressions, although we do it a bit differently on PyMC extras because

we do it from the state space equation point of view.

with Kalman filters, if I understood correctly, an impulsor, you do that from the linear regression equation point of view.

So a good thing of that is that it's going to be much faster to sample your way of doing it because it's a linear regression.

So to give listeners an idea, that's going to be a bit more like something like profit.

If you're used to these kind of packages, this is going to be something like more of that structure, whereas the count line filter thing is what we implemented in Pimc Extras,

which is going to be taking more time to sample, but you can do all the other state space model stuff.

know, always try it out.

So, always try it out.

So, people curious about these models, I really encourage you.

to take a look at Impulso.

think um the different documentation is really well done.

And yeah, it's also like you're still developing the project.

So of course it's still a developing project.

But if you want to also give Thomas a hand, feel free to do so.

Open issues, open pull requests.

As an open source developer, I always appreciate those.

So I'm sure Thomas will.

And um also you can check out PymCX tries.

I'll put that in the show notes.

related episode, the one with Jessie.

And finally, Jessie and I taught a tutorial last September in Pindata Berlin um introducing the Pindacy Extras state space sub-module.

So if you want to have an introduction to that, I encourage you to check out the show notes.

I'll put that in there.

That's called the Beginner's Guide to State Space Modeling.

On that note, Thomas, anything else you want to add about Impulso or um did we summarize it well?

No, I think you gave an excellent summary.

would maybe add that in, obviously there is the backend difference you mentioned between Impulso and PyMC Extras.

I think also PyMC Extras state space library definitely

tries to give you more flexibility, much like I described with GPjax, like allows you to sort of compose things together and gives you a lot of control.

That's actually not the problem I'm trying to solve in Impulse.

So I'm sort of building software here maybe for economists who just want to sort of fit these models and get access to the inference from the model and worry less about the sort

of mechanics of the model itself.

It's more of a high level uh abstraction.

So um

Yeah, they might do different things on the backend, but I think they also maybe will end up trying to do different things on the front end as well.

Right.

Yeah, yeah.

Yeah.

They probably optimize for a different kind of population.

think, uh I think the state space sub module, yeah, is made to be self-contained in a way if you're using Pint C, but it's still like yourself, you know, probably stick program

language environment so you can.

go and combine elements together and have structural time series and things like that.

um think Impulso, from what I saw, is more opinionated.

And I think that's what you were going for, because that way users have to make less choices.

often some kind of users will prefer that, some others would.

So yeah, it's great to have this diversity in there.

Can you still have let's say five, 10 more minutes so that they can ask you two other questions before asking you the last two questions or should I cut to the last two

questions already?

No, I can hang around.

Okay.

So thanks.

Thanks for, for taking the time.

uh I was actually curious, just you've worked a lot in these industrial settings.

So something I really want to ask you is how do you balance the

computational overhead of patient sampling with the latency requirements of production level decision making.

So I guess I actually know I had, so oftentimes my work is never, I've never really worked too close to the edge in terms of having to give.

give model inferences in the sort of milliseconds.

Like often for me, the limiting factor has not been the inference speed of a model, but more of a decision-making speed of an individual human stakeholder and the ability for

them to digest information.

So oftentimes we would run a campaign and we would get the results and then we would have a few days to analyze them.

So there the, the models are never taking more than a few hours to fit.

And really then you have a couple of days to digest the outcomes and

work out how to report it.

So rarely have I been in that case, but I think even within that is SLAs or like service level agreements that your model should sit with when it sits in a production system.

And I think actually Bayesian models nowadays, I think with PyMC and NumPyro and the different samplers we have, it's not that they're always, they're often slower than the

frequentist counterparts.

Honestly, they're not so slow nowadays.

We're not always talking about several hours or days to fit.

If your model is correctly framed and the priors are reasonable and the likelihood is a good choice, they'll often fit actually quite quickly.

And when they're fitting very slowly, sometimes that's actually an artifact of your model and you've got a ill-posed model and it's simply slow to sample.

because it's difficult to do the sampling routine itself because you've maybe specified a bad likelihood.

However, they are always slower.

And I think in those cases, like in the past when I've had real high service level agreements and we've had to give as quick as possible inferences, I've off in those cases,

I've actually ended up using things like conformal prediction.

So taking frequentist models and then trying to get some type of prediction interval on top using conformal prediction type approaches.

which are super fast, often in the simplest form, they're just a case of taking some predefined quantiles and applying them to your predictions.

So those can be uh an option when you're really pinched for time.

It would also say though, sometimes working backwards from the constraints you have is a sensible approach.

So when I used to work in the supply chain team of Amazon and we used to have to deliver predictions within 12 hours.

We actually framed that as saying like, what can we do within 12 hours to guarantee the fastest possible model?

So that means you can't evaluate every single data point maybe in your training set.

So actually building secondary models, which select the data points of most information, kind of like in active learning.

and those into your model and fitting your model on the most informative data points, and then you can get under 12 hours.

So sometimes they're just constraints you have to work around.

Like in this particular model we were fitting in the supply chain, we actually really needed a posterior distribution.

There was no way around it.

um So there it was a case of sort of working backwards from that and working out what the best you can do within the time you have.

um Yeah, I guess what I'm saying is there's no sort of universal answer, but

Basing methods are not as slow as they once were and often there are ways to accelerate them when you really need to.

Yeah.

I I couldn't have said it better.

uh thing is, I wanted you to say it because otherwise I sound like, you know, like a preacher and that's less objective.

uh But yeah, like, I mean, it's, it's, it's also been my experience where now I am very...

I very rarely am convinced that you cannot use a patient model if you want one.

ah I've actually never seen that really in all the different models I've shipped in industry.

It's always been, well, no, you know, ah we can make that model better.

And as you were saying most of the time, it's, actually there is an overparameterization or the priors are too wide or they're too narrow.

Well, we can use also the latest NutPy sampler and we can use this approximation of a GP here.

Let's use HSGP instead of a plain vanilla GP.

all that stuff and then you end up fitting, you end up sampling uh huge hierarchical Gaussian processes with a lot of data points on your laptop in like 15 minutes.

Well, you know, that's very, very fine.

Especially if you don't need to run that model every day.

And if you needed to, that's 15 minutes.

So it's also super fast.

Yes, I agree.

totally agree.

So, so yeah, I mean, but this is already like such a great point, right?

Because I remember when I was starting this podcast in 2019, that was a huge point of contention.

Not necessarily from academia and practitioners, like in

people who did know, but patients that's who could still have these, uh, preconceived notion that it's most of the time the best theoretically, impractical to apply.

you know, let's not even try.

Um, and now I feel like these battle has pretty much been one when it comes to, to education most of the time.

I agree.

I think with.

scaling of GPUs, scaling of the quality of CPUs available to us.

think data sets might get bigger, don't think we, GPs are sort of GPs in their vanilla form are a particularly bad case of cubic scaling and the number of data points, but often

giving more compute or throwing more compute as a problem can solve a lot of your problems very quickly if you have the resources to do so.

When it comes to that, because it's like...

Well, a GP, this is not such a big model.

Yes, that's a good point.

Yeah.

That's an easy one, but yeah.

And actually, know, given your experience and your work on these scalable patient tools, I'm wondering how optimistic you are that these methods will soon become the default

choice for causal inference in the industry.

It's quite a good...

Yeah, that's a good question.

I'm not sure there will ever be the default choice.

And maybe that's okay.

It's my belief that when we report treatment effects, we should be talking about the uncertainty that comes through a posterior distribution.

But there are other fine ways of doing this.

actually often don't think they lead to the wrong outcome.

I think they're often just more work to get to that same outcome.

I think there are pretty popular frameworks which can achieve pretty good results.

think the double ML framework in certain instances can be quite an effective tool.

think it's often used as the default choice when maybe we could look to other models to achieve what double ML can do for us.

But I think there'll always be other tools.

as we...

As we evolve as an industry to continue using causal inference to really provide data points around high sort of value or high risk decision-making in industry, because it's

still not the norm, I don't think.

I think we are still working towards a world where everyone is doing that.

But I think as we continue to do that, I think the need for Bayesian methods becomes pronounced, not because we as Bayesian practitioners believe them to be the right model to

use, but actually because the guidance we're getting from our stakeholders

mean the only model that is really reasonable to use as a Bayesian model.

example, when I've worked on some sort of geo testing and sort of video marketing projects in the past, our stakeholders really wanted to know questions like, what's the risk of us

making the wrong decision here?

And like, can you put dollar values in that?

That's really quite easy to do when you have access to the posterior distribution.

But if you're trying to do that through the lens of P values or confidence into us, that's actually quite challenging.

And so I think it may end up being more commonly used because we end up being more risk minded when we start making these decisions as we as an industry evolve to using these

models more for our sort of decision making.

Yeah.

I agree and I think that was uh kind of a provoking question on my end and I completely agree with you that actually the assumption of the question was that patient stats would

somehow need to become the default.

I don't think they do, but I do think that they need to be more basically

accepted out of the box as something in the toolbox.

basically, you know, like you're just as you're using scikit-learn or PyTorch or any other framework, which looks classic to um any member of the team then when it's needed, well,

you're using a Bayesian causal inference method because that's what solves the problem best.

And you don't have to justify, overly justify why.

And like you just justify it as you would using a neural network or an NLM for some other project.

And I think that's definitely what we're trying to build ourselves to, which is where, yeah, here.

Yeah, definitely it makes sense.

Let's just use that and we're good.

And then we're going to have all the bells and whistles.

I'm also very, very sure that when people starts to see, you don't need to know what a p-value is and you can actually interpret the...

confidence intervals as you actually want them to interpret.

This will become one of the, one of the beloved methods of, of statistical folks for sure.

Yes, I totally agree on that point.

I also, I do also think this just picking up on something you just said that I think there's also an element of, sort of need to own our own shortcomings here.

think like, PsychedLearn and other tools have done an amazing job about making Frequentist type methods super accessible and to be

Completely honest, we as the Bayesian community have not done such a good job.

A lot of our packages and a lot of our frameworks are very much geared towards researchers or people wishing to fit new Bayesian models.

And I think there are libraries like PyMC and NumPyro and sort of what I'm trying to do with Impulso to try and bridge that gap.

But we really haven't designed an ecosystem which makes it particularly friendly for people who don't already understand a lot about Bayesian methods to just go and fit a

Bayesian method off the shelf.

And so I think there's an element of we also need to continue to build the right tools, which will allow Bayesian methods to become a more default option within a practitioner's

toolbox.

Yeah.

100 % agree.

Awesome.

Well, Thomas, I could continue this conversation, but you're going to have to go to bed at some point.

oh I'm going to let you be some...

I'm that magnanimous.

But before that, before letting you go, I'm going to ask you the last two questions I ask every guest at the end of the show.

First one, if you had unlimited time and resources, which problem would you try to solve?

So, yeah, mean, there are huge problems in the world and most of them I'm grossly incapable of solving.

But I think a big problem, which I would love to, if I had unlimited time and resources to try and solve would be the supply chain issue that we have in the world.

the frug, the brittleness and the fragility of supply chains feels like a problem, which I could imagine working on.

Um, but it's incredibly complex getting data around like where weak points in supply chains sit and trying to model them using Bayesian methods would be incredibly meaningful

and incredibly impactful, incredibly

Costly both in terms of time, money and resources.

um But if you're telling me I have unlimited time and resources, think that would be the problem I would spend a lot of time thinking about.

How do we build more resilient supply chains and how do we understand the risk of supply chain problems?

Yeah, that makes sense for sure.

And then second question is if you could have...

dinner with any great scientific mind, dead, alive or fictional, who would it be?

It would be amazing to have dinner with Radford Neill, actually.

I always find Radford Neill's papers incredibly conversationally written whilst also being highly academically stimulating and informative.

I could imagine he'd be an incredibly interesting person to have dinner with and

Really for me, when I was first getting into the world of Bayesian statistics, some of his papers on connecting Gaussian processes to infinitely wide neural networks and Hamiltonian

Monte Carlo and a lot of these types of papers, honestly for me were so accessible whilst also being so dense in information that I would keep reading them and coming back to them.

It'd be amazing to sit down with Radford Neal and try and learn.

some of the things which he has in his head.

Yeah.

I love that.

And I think you're the first one to answer that on the show.

So that's not the goal.

The goal is to sample the distribution, but definitely on the tail.

Also, well, Thomas, let's call it a show.

That was an absolute pleasure to have you on.

Thank you for staying up for us.

I'm sure everybody appreciates it.

And well, listeners, if you want to give Thomas a token of appreciation for his work, definitely drop him a word on wherever you go, LinkedIn or GitHub, you're using any of his

uh project, or if you have ideas on how to contribute or just want to start on open source and think that one of Thomas's project is something you want to contribute to, then just

chime in on the GitHub repo and I'm sure they'll be very happy.

On that note.

Thomas, thank you so much for taking the time and being on this show.

Thank you ever so much for having me.

It's been a really fun conversation.

I couldn't have thought of a better way to spend my evening.

Well, I am very glad to hear that.

This has been another episode of Learning Bayesian Statistics.

Be sure to rate, review, and follow the show on your favorite podcatcher, and visit learnbaystats.com for more resources about today's topics, as well as access to more

episodes to help you reach true Bayesian state of mind.

That's learnbaystats.com.

Our theme music is Good Bayesian by Baba Brinkman, fit MC Lars and Meghiraam.

Check out his awesome work at bababrinkman.com.

I'm your host.

Alex and Dora.

can follow me on Twitter at Alex underscore and Dora like the country.

You can support the show and unlock exclusive benefits by visiting Patreon.com slash LearnBasedDance.

Thank you so much for listening and for your support.

You're truly a good Bayesian.

Change your predictions after taking information in and if you're thinking I'll be less than amazing.

Let's adjust those expectations.

Let me show you how to be a good Bayesian.

You change calculations after taking fresh data in Those predictions that your brain is making Let's get them on a solid foundation

Key Takeaways

GPJax was developed to provide a high-performance, flexible framework for Gaussian processes (GPs) within the JAX ecosystem. It allows researchers to move beyond black-box implementations and easily experiment with custom kernels and model structures while leveraging JAX’s automatic differentiation and GPU acceleration.

Gaussian processes are highly effective at modeling complex, nonlinear relationships in data. Unlike many machine learning methods that only provide a point estimate, GPs offer built-in uncertainty quantification, which is essential for understanding the reliability of predictions in research and industry.

The integration allows users to treat GPJax models as components within a larger NumPyro probabilistic program. This combination enables the use of advanced sampling techniques like NUTS (No-U-Turn Sampler), making it easier to build and fit complex hierarchical models that include Gaussian processes.

High-dimensional data significantly complicates GP modeling due to the curse of dimensionality and the cubic scaling of computational costs. In high dimensions, defining meaningful distance metrics for kernels becomes harder, often requiring specialized techniques like sparse GPs or dimensionality reduction to remain tractable.

Bayesian synthetic control offers a richer mode of inference by providing full posterior distributions for treatment effects. Instead of a single point estimate, it allows practitioners to quantify the probability and magnitude of an intervention's impact, leading to more nuanced and robust decision-making.

In causal inference, Bayesian methods provide the necessary context for decision-making. By incorporating prior knowledge and propagating uncertainty throughout the entire model, they ensure that the final causal estimates reflect the true level of evidence available, preventing overconfident business decisions.

Synthetic Difference-in-Differences is a hybrid causal inference method that combines the strengths of Synthetic Control and traditional Difference-in-Differences. It uses unit weights to balance control groups against treated units and time weights to balance pre- and post-treatment periods, resulting in more stable and accurate estimates.

Uncertainty propagation ensures that the noise and error from every stage of a model – from the initial data balancing to the final regression – is carried through to the end result. Without this, the final estimate of a treatment effect may appear more certain than the data actually warrants.

Hierarchical models allow for partial pooling, where information is shared across different groups in a dataset. This enhances predictions for new units or those with missing observations by allowing the model to borrow strength from the broader group-level distribution.

Impulso was created to address the lack of specialized tools for structural Vector Autoregression (SVAR) models in Python. It provides a dedicated framework for analyzing time-series dynamics, supply chain resilience, and the impact of economic shocks within a modern data science stack.