Proudly sponsored by PyMC Labs, the Bayesian Consultancy. Book a call, or get in touch!
Our theme music is « Good Bayesian », by Baba Brinkman (feat MC Lars and Mega Ran). Check out his awesome work!
Visit our Patreon page to unlock exclusive Bayesian swag 😉
Takeaways
- Convincing non-stats stakeholders in sports analytics can be challenging, but building trust and confirming their prior beliefs can help in gaining acceptance.
- Combining subjective beliefs with objective data in Bayesian analysis leads to more accurate forecasts.
- The availability of massive data sets has revolutionized sports analytics, allowing for more complex and accurate models.
- Sports analytics models should consider factors like rest, travel, and altitude to capture the full picture of team performance.
- The impact of budget on team performance in American sports and the use of plus-minus models in basketball and American football are important considerations in sports analytics.
- The future of sports analytics lies in making analysis more accessible and digestible for everyday fans.
- There is a need for more focus on estimating distributions and variance around estimates in sports analytics.
- AI tools can empower analysts to do their own analysis and make better decisions, but it’s important to ensure they understand the assumptions and structure of the data.
- Measuring the value of certain positions, such as midfielders in soccer, is a challenging problem in sports analytics.
- Game theory plays a significant role in sports strategies, and optimal strategies can change over time as the game evolves.
Chapters
00:00 Introduction and Overview
09:27 The Power of Bayesian Analysis in Sports Modeling
16:28 The Revolution of Massive Data Sets in Sports Analytics
31:03 The Impact of Budget in Sports Analytics
39:35 Introduction to Sports Analytics
52:22 Plus-Minus Models in American Football
01:04:11 The Future of Sports Analytics
Thank you to my Patrons for making this episode possible!
Yusuke Saito, Avi Bryant, Ero Carrera, Giuliano Cruz, Tim Gasser, James Wade, Tradd Salvo, William Benton, James Ahloy, Robin Taylor,, Chad Scherrer, Zwelithini Tunyiswa, Bertrand Wilden, James Thompson, Stephen Oates, Gian Luca Di Tanna, Jack Wells, Matthew Maldonado, Ian Costley, Ally Salim, Larry Gill, Ian Moran, Paul Oreto, Colin Caprani, Colin Carroll, Nathaniel Burbank, Michael Osthege, Rémi Louf, Clive Edelsten, Henri Wallen, Hugo Botha, Vinh Nguyen, Marcin Elantkowski, Adam C. Smith, Will Kurt, Andrew Moskowitz, Hector Munoz, Marco Gorelli, Simon Kessell, Bradley Rode, Patrick Kelley, Rick Anderson, Casper de Bruin, Philippe Labonde, Michael Hankin, Cameron Smith, Tomáš Frýda, Ryan Wesslen, Andreas Netti, Riley King, Yoshiyuki Hamajima, Sven De Maeyer, Michael DeCrescenzo, Fergal M, Mason Yahr, Naoya Kanai, Steven Rowland, Aubrey Clayton, Jeannine Sue, Omri Har Shemesh, Scott Anthony Robson, Robert Yolken, Or Duek, Pavel Dusek, Paul Cox, Andreas Kröpelin, Raphaël R, Nicolas Rode, Gabriel Stechschulte, Arkady, Kurt TeKolste, Gergely Juhasz, Marcus Nölke, Maggi Mackintosh, Grant Pezzolesi, Avram Aelony, Joshua Meehl, Javier Sabio, Kristian Higgins, Alex Jones, Gregorio Aguilar, Matt Rosinski, Bart Trudeau, Luis Fonseca, Dante Gates, Matt Niccolls, Maksim Kuznecov, Michael Thomas, Luke Gorrie, Cory Kiser, Julio, Edvin Saveljev, Frederick Ayala, Jeffrey Powell, Gal Kampel, Adan Romero, Will Geary, Blake Walters, Jonathan Morgan and Francesco Madrisotti.
Links from the show:
- LBS Sports Analytics playlist: https://www.youtube.com/playlist?list=PL7RjIaSLWh5kDiPVMUSyhvFaXL3NoXOe4
- Paul’s website: https://sabinanalytics.com/
- Paul on GitHub: https://github.com/sabinanalytics
- Paul on Linkedin: https://www.linkedin.com/in/rpaulsabin/
- Paul on Twitter: https://twitter.com/SabinAnalytics
- Paul on Google Scholar: https://scholar.google.com/citations?user=wAezxZ4AAAAJ&hl=en
- Soccer Power Ratings & Projections: https://sabinanalytics.com/ratings/soccer/
- Estimating player value in American football using plus–minus models: https://www.degruyter.com/document/doi/10.1515/jqas-2020-0033/html
- World Football R Package: https://github.com/JaseZiv/worldfootballR
Transcript
This is an automatic transcript and may therefore contain errors. Please get in touch if you’re willing to correct them.
Transcript
Folks, you may know it by now, I am a huge
sports fan.
2
So needless to say that this episode was
like being in a candy store for me.
3
Well, more appropriately, in a chocolate
store.
4
Paul Sabin is so knowledgeable that this
conversation was an absolute blast for me.
5
In it, Paul discusses his experience with
non -stats stakeholders in sports
6
analytics and the challenges of convincing
them to adopt evidence -based decisions.
7
He also explains his soccer power ratings
and projections model, which uses a
8
Bayesian approach and expected goals, as
well as the importance of understanding
9
player value in difficult to measure
positions and the need for more accessible
10
and digestible sports analytics for fans.
11
We also touch on the impact of budget on
team performance in American sports and
12
the use of plus -minus models in
basketball and American football.
13
Paul is a senior fellow at the Wharton
Sports Analytics and Business Initiative
14
and I like truer
15
in the Department of Statistics and Data
Science at the Wharton School of the
16
University of Pennsylvania.
17
He has spent his entire career as a sports
analytics professional, teaching and
18
leading sports analytics research
projects.
19
This is Learning Visions Statistics,
episode 108, recorded April 11, 2024.
20
Welcome to Learning Bayesian Statistics, a
podcast about Bayesian inference, the
21
methods, the projects, and the people who
make it possible.
22
I'm your host, Alex Andorra.
23
You can follow me on Twitter at Alex
underscore Andorra, like the country, for
24
any info about the show.
25
LearnBayStats .com is Laplace to me.
26
Show notes.
27
becoming a corporate sponsor, unlocking
Bayesian Merge, supporting the show on
28
Patreon, everything is in there.
29
That's LearnBasedStats .com.
30
If you're interested in one -on -one
mentorship, online courses, or statistical
31
consulting, feel free to reach out and
book a call at topmate .io slash alex
32
underscore and dora.
33
See you around, folks, and best Bayesian
wishes to you all.
34
Welcome to Learning Vagin Statistics.
35
a full conversation in French as we just
had before recording.
36
Well done.
37
It used to be though.
38
Go back two to three hundred years.
39
Maybe you just don't go to Africa enough.
40
That's where French is spoken a lot now
too.
41
Exactly.
42
But other than that, you can see French
used to be a very international language
43
because in my travels, almost all the time
people tell me, yeah, I studied French in
44
high school.
45
And the only thing they can say is just a
few words.
46
Which is normal, like if you don't use it,
right?
47
But yeah, you can see that because French
is still, or was still taught in high
48
school and now less and less.
49
So yeah, so well done Paul for that.
50
I know, I don't think French is an easy
language to learn.
51
What has been your experience?
52
I'm actually very curious.
53
You know, it's hard to say, so this is a
statistics pod or data science podcast.
54
So I guess I can't really, I can't really
compare it to anything else.
55
That's the only other language I've
learned besides my native English.
56
So, you know, I guess, you know, one
sample size for me, I took it in high
57
school as well.
58
I hated it.
59
I had, so, you know, coming from America,
you know, so the reason I chose, you know,
60
seventh grade is when I had to choose
whether I was taking French or Spanish.
61
And I'm the youngest of four kids in my
family growing up.
62
And my older siblings told me that the
Spanish teacher was really mean.
63
And that's originally why I took took
French.
64
and then I took it for the required two to
three years.
65
And then I was done.
66
I had in high school, I had this teacher
from Belgium and I still remember her
67
name, Madame Vendon Plus, and I couldn't
stand her, but come, come to find out
68
looking back in life that she was actually
a really nice person.
69
She was just Belgian.
70
And the cultural, you know, like Americans
think they're the best and the French
71
language in Europe people also think
they're the best because they ruled the
72
world in the 17 and 1800s and America felt
like they've ruled the world for the last
73
100 years.
74
And so when you get into a room together
and you think both of your cultures are
75
superior, you know, that doesn't go well
together.
76
But actually, so after that, I didn't...
77
speak French at all.
78
And then I did church service for my
church for two years and I lived in
79
Montreal, I lived in Quebec, not actually
in the city, I lived in a lot of rural
80
small town.
81
And so I studied French really hard.
82
I had to learn the very strong Quebecois
accent.
83
And then when I went back to school, it's
when I like really honed in my French.
84
I was very conversational, could speak
very fluently in Quebec, but then, you
85
know, I had to learn the grammar a little
bit more.
86
in depth.
87
So then I studied French as well at
university as well.
88
So, you know, immersing yourself and the
actually like learning languages because
89
when I learned it in school, it didn't
never made sense to me.
90
But when I studied it on my own and I
studied conjugation and all these things,
91
it became kind of like a math problem.
92
And so when I would speak a sentence in my
head, I'd always be like, I need a
93
subject.
94
I need to conjugate the verb.
95
And then I need to say like what I'm, you
know, just
96
do an adverb or an adjective after it.
97
And like it made sense in my head, but
that's not how I was taught in school.
98
I was taught, I had to memorize all these
words, like everything in the kitchen.
99
How do you say dishwasher?
100
How do you say refrigerator?
101
How do you say fork?
102
How do you say spoon?
103
I couldn't learn like that, but at like
living and like thinking about French as a
104
math equation, it made sense in my head
and I was able to pick it up.
105
You know, sure.
106
I made tons of mistakes and embarrassed
myself, but it wasn't too bad.
107
And that's how you learn.
108
Yeah.
109
So I'm guessing.
110
Like from that answer, I'm guessing people
already know why I invited you on the
111
podcast.
112
Very nerdy answer, your put languages,
that's perfect.
113
Thanks a lot.
114
And yeah, I completely relate actually.
115
I learned English and German in high
school and yeah, kind of the same.
116
I always hated formal language learning.
117
And like in the end I learned these
languages and Spanish that was the same
118
and Italian that was the same, just going
to the country basically.
119
And yeah, as you were saying, I think also
what it adds is you've got skin in the
120
game.
121
You're in the country, you're having a
conversation with someone.
122
If you're not able to talk, you look
extremely stupid.
123
So it's a very good incentive for the
brain to step up and learn.
124
And that's really awesome.
125
And then when you are in the situation
that you...
126
don't know what to say, you remember that.
127
And then when you learn, this is what I
should have said, it sticks with you
128
because it has an emotional attachment to
it.
129
Yeah.
130
Yeah.
131
No, exactly.
132
And I mean, and that's going to be a good
segue to my first question to you, but I
133
think it's also one of the situations in
life, where you can really, feel and see
134
your brain learning.
135
So that's why I also really love learning
new languages and going to countries to do
136
that because.
137
Like you arrive in the country, you don't
know how to say anything.
138
And in just a few weeks, your brain starts
picking up stuff and you can really,
139
really feel your brain doing its amazing
work that it's been like conditioned to do
140
from years of evolution.
141
And to me, that's just absolutely
incredible that the brain is able to do
142
that.
143
Even when you're like in your thirties and
beyond, you can do that.
144
And it's just, I found that absolutely
incredible.
145
And that's kind of like a Bayesian.
146
neural network, you know, so I mean, see
that segue, I should definitely have a
147
podcast.
148
So actually talking about base.
149
Yeah, I invited you on the podcast because
you do absolutely awesome work on sports
150
modeling.
151
And people know that I'm a big fan of a
lot of sports.
152
I love modeling sports and so on.
153
So I'm super happy to have you here.
154
And I have a list of questions that is
embarrassingly long.
155
But maybe can you tell us if you are
actually yourself using some basic
156
methods, if you're familiar with those or
not?
157
And yeah, in general, what does that look
like in your work?
158
Yeah.
159
So yeah, I mean, just a quick background
about myself, right?
160
I've worked in sports, what we call sports
analytics for almost 10 years now.
161
Out of actually, I was getting my PhD.
162
And statistics, and I, you got, there was
this job opportunity at ESPN, you know,
163
which is a sports broadcasting television
channel in the U S and a few other
164
countries.
165
And, you know, I got the job offer to work
on their sports analytics team where
166
essentially what the team there does is
make forecasts so that, you know, they can
167
show on TV, you know, on the bottom line,
like who's expected to win, or they can,
168
we will run simulations on.
169
you know, who's likely to win the
championship, you know, all throughout the
170
season.
171
And so, you know, you can tell stories
with that saying, you know, the team was
172
just like the beginning of the season.
173
No one thought they were going to be any
good, but just look how it, you know, they
174
got better or the opposite.
175
Like they were supposed to be really good
and everything just went wrong.
176
And so in my field in sports modeling, I
would think actually you can't, you can't
177
do it without being Bayesian.
178
And so when I would interview people, I'd
always focus on, on those.
179
So as people coming out of school,
sometimes they don't always learn Bayesian
180
methods very well.
181
And the reason is in sports, sample sizes
are very small and you have to make
182
forecasts with very limited data.
183
And the great thing about Bayesian is
statistics is that you actually have more
184
data.
185
You just haven't observed it.
186
You have expertise or you have opinions,
but those opinions actually matter.
187
And so maybe we'll get into this, but I'm
actually a very strong advocate because of
188
my field of being a subjective Bayesian
analysis.
189
It's okay to insert some information into
your models and it usually makes them
190
better.
191
Yeah.
192
Well, awesome.
193
couldn't have dreamt better and I have to
fully structure.
194
I didn't know Paul was going to answer
that because that's not really, I haven't
195
seen that in your, you know, on your
website or else,
196
So before, while preparing the episode, I
didn't know if you were already using
197
Bayesian methods or else.
198
But definitely, definitely happy to hear
that.
199
And so that people know that was not a
conspiracy.
200
I didn't know anything that Paul was going
to say.
201
OK, so that's awesome.
202
So I'm an open source developer, so I'm
always very curious about the stack you're
203
using.
204
What are you using actually when you're
doing Bayesian analysis of a spot model?
205
So in my career, I almost always use R and
Stan.
206
So if I'm doing Bayes analysis, I write a
lot of Stan code.
207
It's gotten easier with the Chat GPT.
208
It doesn't do it all the way, right?
209
But if it's like, hey, I want to build
this kind of model, it'll at least give me
210
a good framework.
211
And then I can adjust it and edit it as I
want from there.
212
Yeah.
213
Yeah.
214
And I mean, for sure, you cannot go wrong
with the.
215
with R and Stan.
216
So yeah, definitely.
217
And we've had the, one of the creators of
Stan, Andrew Gellman, was back on the
218
podcast a few weeks ago.
219
It was not released yet, but through time
travel, it's gonna have been released when
220
your episode is out.
221
So folks, you can go back to - Right,
because I am definitely a lesser draw than
222
Andrew Gellman is, but that's great.
223
No, yeah, so if people are curious about
what Andrew has been up to, lastly, it's
224
the third time he's been on the show and
he just released a new book, Active
225
Statistics, that I definitely recommend.
226
It's really fun to read.
227
It's like, it's how to teach statistics
with stories, which actually relates to
228
something you just said, Paul, about the,
like, cool and fun way to relate
229
statistics to...
230
non -stats people was to be able to tell
stories about a team's probability of
231
winning or any forecast like that.
232
So that's definitely interesting to hear
you talk about that.
233
And actually I'm curious because I've been
following that field of spots analytics
234
for a few years and I've seen it
personally mature.
235
quite a lot and evolved quite a lot when
it comes to the technology and the data
236
availability.
237
So I'm curious what an expert like you
think about that evolution of technology
238
and data availability and how that changed
the landscape of Spots Analytics.
239
Yeah, I mean, it's exploded in the last 10
to 15 years.
240
So I mean, if people are familiar with the
book slash movie Moneyball, which is
241
20, about 20 years, the book is about 20
years old now.
242
The movie is about 12, 13 years old now.
243
you know, back then in baseball, baseball
was the sport that sort of took off in
244
sports analytics.
245
I mean, for a couple of reasons.
246
One, the game is very discreet.
247
So their start and their stopping points.
248
So you can measure.
249
Right.
250
Discrete events very well in baseball, but
two, like they're the only sport that
251
actually had a really long running data
set.
252
And that went back and they've been
keeping statistics in baseball and you can
253
actually go back to the 1800s and find out
how people were playing baseball in 1895.
254
No other sport has that.
255
So that's, that's probably the reason why
baseball took off.
256
but since then, you know, every sport for
a while after that, every sport had what
257
we call play by play data, which is like,
this is what happens.
258
Soccer had a, a version that was called
event data.
259
So would people would.
260
watch a game and every time someone
touched the ball or made a pass, they
261
would mark, the ball was touched here on
the field and it was passed to there or
262
they dribbled from here to there.
263
So it was, they kind of were discretizing
soccer in a way to make it a similar
264
format.
265
But then about 10 years ago, we started
getting this player tracking data, which
266
is the location of everybody and the ball
or the puck on the field, you know,
267
depending on the sport, 10 to 25 times per
second.
268
And that's drastically changed.
269
the methodologies and things that are
used.
270
So, I mean, Bayesian analysis was great
for this play by play data or even, you
271
know, game by game data and measuring how,
how players or teams performed.
272
And then now we've started getting such
huge data sets that, you know, more of the
273
computer science world, neural networks,
things like that started becoming much
274
more prevalent in sports analysis just
because the data sets were so massive.
275
Not that statistics doesn't play a role.
276
It still does.
277
And I think.
278
People sometimes overly rely on these
black box methods.
279
They don't think about the implications or
the biases in the data, which are still
280
important.
281
But we have these huge amounts of data now
and it's just exploded to like, you know,
282
if you want all the data in a season in
the NFL, it's like over one terabyte of
283
locations of everybody on every field, 20,
every play of 25 times a second.
284
It's just massive.
285
Right.
286
So it's, it's really changed the way
people have done things.
287
Right.
288
And we started going from really simple
questions to huge big questions.
289
And the funny thing is now, I actually
think with the data being so large, people
290
are now actually going back to answering
more simple questions.
291
Like we're not trying to measure
everything all at once.
292
Let's try to measure very specific things
that we weren't able to measure before.
293
Hmm.
294
Yeah, that is definitely interesting.
295
and is that so first.
296
Is that availability of data, massive
availability of data, the case in all the
297
sports industry?
298
Or is it more, well, the most historical
ones, as you were saying, maybe more
299
baseball.
300
I know the data set are more massive there
and maybe other sports like soccer are
301
less prevalent, the data set are less
prevalent, less massive, or is that a
302
uniform trend?
303
First question.
304
And then second question is,
305
Where does that data leave?
306
Is that mostly open source or is that
still quite close source data?
307
Yeah.
308
So I mean, baseball is usually like the
cutting edge of everything because they
309
had a head start.
310
And basketball and then like kind of
American football, international soccer
311
football and hockey kind of trail behind.
312
But the data sets now in all those sports
are very massive.
313
Hockey just got
314
The NHL just got their player puck
tracking data just a couple of years ago.
315
Now baseball and basketball have moved on
beyond just knowing where players are on
316
the field.
317
They actually have data of what's called
pose data.
318
So they know where different joints and
their arms and the legs are of every
319
player on the field or on the court.
320
So that data is massive.
321
It's massive everywhere.
322
There's companies that are trying to
collect new data based on
323
video, so they're using computer vision
algorithms to do that, but largely to
324
answer your second question.
325
This is not open source data.
326
So the old school data, the play by play
data is open source.
327
You can find that on every sport pretty
much via an open source mechanism now.
328
But this huge, these huge data sets of the
tracking of the players, you know, 10 to
329
25 times per second.
330
It's usually all closed source.
331
There are a few.
332
releases of that here and there, you know,
the NFL does a competition where they
333
release some of that data each year, like
a very small set.
334
and a few other leagues have done
something similar as well.
335
If they know that's, that's kind of gives
you a taste.
336
if you have money, there are companies
that try to create that data themselves
337
and they'll sell it to you.
338
But you know, that's usually pretty
expensive for an individual person to buy.
339
So again, just that.
340
I see.
341
Okay.
342
Yeah, interesting.
343
Definitely.
344
Because like data is kind of oil in our
industry, right?
345
So it's definitely interesting to know
what's the state of the supply of oil in a
346
way.
347
Maybe for people who are less versed in in
sports modeling, can you give us an
348
example of how analytical insights have
349
directly influenced team strategy or
player selection in one of your consulting
350
roles.
351
Yeah.
352
So I mean, I'll just kind of talk broadly
at first.
353
I mean, so sometimes it's just the most
basic things, right?
354
So like in basketball, people shoot three
pointers more because all they did is
355
figured out the expected value was larger
for three point shot than it was for most
356
two point shots.
357
Not, not those layups and the dunks,
right?
358
Those are very high percentages.
359
So the expected value of a, of a high
percentage times two is, you know, is, is
360
pretty good.
361
But then even if.
362
The percentage drops off a lot when you
multiply it by three to get the expected
363
value of a three point shot.
364
You know, it's also pretty good.
365
So that means basketball has changed
drastically because of that.
366
and in my roles, I guess, you know, I
think in a lot of sports, there's just
367
been a lot of open questions.
368
People kind of move one way.
369
And then I think actually, I think the
sports analysis does really good job of
370
tackling very easy problems first.
371
But then I think there's actually a
tendency for the analysts themselves to be
372
overconfident in their analysis and
they're not factoring in all of the
373
sources of variation that might be there.
374
And something I'm also very curious about
it is what's your experience with non
375
-stats stakeholders?
376
So coaches, scouts, players, how do they
typically respond to the analytics and the
377
insights you provide and other...
378
differences in reception across sports,
maybe across roles.
379
Yeah.
380
So, I mean, it really does vary as in all
things, there's variance.
381
There are some typically younger, you
know, coaches or scouts that are a little
382
bit more receptive than people who have
been doing something for a long time.
383
And I think that's just human nature.
384
You're used to doing things a certain way.
385
You don't like.
386
You know, to stereotype, you don't like
some young person coming and telling you
387
how to do your job.
388
Right.
389
So you have to be really careful about
that.
390
and the, and the funny thing is, you know,
everything that I have learned or, you
391
know, I believe in, in terms of making
data driven decisions and don't
392
overestimate based on small sample sizes
goes out the window when I'm trying to
393
convince a stakeholder of something.
394
So for example,
395
If I have a model and I want them to use
it, and I think it's going to help them.
396
Of course, I've done the analysis to say,
you know, what over the long run, how it
397
would improve our efficiency, or if we
make a decision in this way, it'd be
398
better process, et cetera.
399
I've done that analysis and I've done it
over a larger sample size.
400
But when I, when I tell them what they
want to know is they want confirmation
401
bias, right?
402
They love confirming their beliefs.
403
So in order to get them to, agree with
what you're saying, it, this works so much
404
more better than saying, you know, out of
the thousand players that I did this in,
405
you know, you only were correct 60 % of
the time, but my model would have been
406
correct 70%.
407
Like they don't want to hear that.
408
They essentially say, well, my model, you
know, you love this player.
409
So does my model.
410
I find the one guy, even if it's literally
only one person, they're like, yeah.
411
Like, if your model can.
412
If your model can see that, then it must
be doing something right.
413
And then it's like, then they start to
trust you a little bit.
414
And over time you give them little pieces,
little crumbs of a cookie that they can
415
help, you know, get confidence in.
416
And then, you know, then is when you share
with them, okay, well, but it's also
417
suggesting this, which is different than
what you've been doing in the past.
418
Right?
419
So you don't ever start with, you know,
trust me.
420
because you might be wrong, because you're
a human.
421
I mean, like, you know, humans always make
mistakes, but we usually don't think we
422
make as many mistakes as we do.
423
And so I found just over time is if you
get people to trust you by confirming
424
their prior held beliefs, right?
425
It's another Bayesian concepts.
426
If you can confirm their prior beliefs,
they're going to accept your future
427
recommendations or future things that the
model might suggest more than if you start
428
with.
429
the differences upfront.
430
And so that's like a little bit of human
bias, right?
431
That you have just learned over time.
432
And some things are just really hard for
people to accept, but over time, if you
433
get people to trust you and you build that
relationship, there's a lot of human
434
elements here and then they trust your
work by confirming their prior held
435
beliefs, then they'll trust you and open
up a little bit more to being a little bit
436
more open -minded about other things as
well.
437
Because then like, okay, well, I know
you're not an idiot.
438
Like you could speak my language some.
439
now I might be more open to learning a
little bit of your language.
440
And that's just sort of a human
relationship thing that you have to always
441
work on.
442
Yeah, that is very interesting.
443
And I'm very, yeah, I'm always very
interested to hear about that because I
444
also face clients daily and have to
explain models to them.
445
And so as you were saying, that definitely
varies a lot in interactions to the model.
446
But that negative wisdom of maybe
indulging the...
447
the confirmation bias at the beginning and
then slowly go towards a bit more of
448
speaking the truth.
449
It's very interesting.
450
I had not thought of that, but that's
yeah, definitely I can see that being a
451
valid strategy when you also are in front
of someone who doesn't really understand
452
the value of the modeling, I would say.
453
Whereas when I
454
encounter clients who are already
convinced of what the models can do for
455
them.
456
They are usually looking for contradicting
what they already think.
457
And that's when they find the model
interesting.
458
So I find that really, really cool to see.
459
The contradictions are really where
there's value, right?
460
But there's no value in a model if no one
uses it, right?
461
Even if the model is really good, if no
one uses it, it has zero value.
462
If they use it, the contradictions are
valuable if they're right, correct?
463
So in soccer analysis, you know, I've
spent my career doing lots of different
464
sports, but there's this sort of, this
applies to every sport.
465
In basketball, we can call it the LeBron
test and soccer, we'll call it the messy
466
test, where it's essentially, if you build
a model and it's trying to evaluate
467
players and messy is not like one of the
top players in your model, then.
468
You're not going to share it with anybody
because no one's going to believe you.
469
Right.
470
That's like the first thing everyone does
is like, okay, well is messy up top.
471
And if like, if messy is near the top,
then like people, at least they'll listen
472
to you a little bit longer.
473
Right.
474
But they're not going to listen to you at
all.
475
If you're like, yeah, messy is an okay
player.
476
Right.
477
Like I don't care what your model says.
478
Right.
479
That's wrong.
480
Right.
481
That's that, that's what people believe.
482
So it's like a little bit of like, I need
to feed you like, no, no, no.
483
Like I'm taking a different approach than
what you do, but you know, my approach
484
also thinks that messy is the best.
485
Right.
486
And then I'm like, it's okay.
487
You know,
488
Okay, yeah, we agree.
489
He is really good.
490
Yeah, it's like a sniff test, right?
491
And it's like, in a way, it's like, well,
that's a strong prior.
492
And it's like, it's saying, well, I have a
very strong prior.
493
That message is really good.
494
To convince me, otherwise you're going to
need really, really good data.
495
It's like, well, the earth is very
probably somewhat round.
496
It's going to be very hard for you to...
497
move that prior from me and telling me
it's not, in a way.
498
Yeah.
499
And in sports, people have really strong
priors, right?
500
So, you know, those sniff tests do really
matter.
501
And as a modeler, even for myself, like,
I'm a human.
502
So like, I do the same thing.
503
If I'm building a model, I always want to
see the results.
504
And it's like, I don't look at the median,
like I do, but I don't look at who the
505
median result is in my model half the
time.
506
I usually look at the best and I look at
the worst.
507
And if I don't understand it, then I'm
like, maybe my model is doing something
508
wrong.
509
And I'm all like, gonna, I'm going to dive
in a little bit more.
510
If it like confirms my prior held beliefs,
I'm like, it's probably correct.
511
Right.
512
And even as a modeler, right, you have to
be careful of that.
513
But at the same time in sports, you know,
it's like I said, subjective analysis can
514
be helpful.
515
It's because people's subjective and I'm
like, there's wisdom.
516
People coaches have been playing a game
for.
517
20, 30, or coaching a game for 20 or 30
years to think that they don't have
518
something to offer a model is kind of
crazy in my opinion.
519
They might have biases and of course they
do, but their information that they can
520
provide is useful.
521
Yeah, definitely.
522
And that's where we go back to what we
were talking about at the beginning in the
523
value of Bayesian inference in that
context.
524
Because if you can leverage that deep and
hard -hearned knowledge,
525
from the coaches, from the scouts, and add
that to your model, it's like getting the
526
best of both worlds.
527
And that can make your analysis extremely
powerful and useful, as you were saying.
528
Yeah.
529
And people have done studies like this,
I've done studies like this.
530
If you build a model just on the data and
ignore the human element, right?
531
Or if you build a model just on human and
scouting analysis and ignore the other
532
data.
533
Right.
534
Neither one of those is going to do as
well as when you combine both.
535
And that's really, that's what, you know,
that's Bayesian analysis is you're
536
combining subjective belief with objective
data and then making forecasts based on
537
them.
538
And we know that if you have priors that
are not really, really bad, a subjective
539
Bayesian forecast is going to have smaller
error than a data, you know, what we call
540
maximum likelihood forecast, right.
541
And stats terms, right.
542
Or.
543
You know, just the human one, just the no
data, but, you know, feelings forecast as
544
well, right?
545
So there's the combination of the two,
always does better.
546
Yeah.
547
Yeah.
548
Yeah.
549
Preaching, preaching to the choir here for
sure.
550
And actually, I think that's a good time
now in the episode to get a bit more
551
nerdy, if we can, because I've seen you,
so you've obviously worked extensively
552
with.
553
soccer analytics and you have an
interesting soccer power ratings and
554
projections on your website that I'm gonna
link to in the show notes but can you tell
555
us about it and what makes these
projections unique in your perspective in
556
evaluating team and player performance and
don't be afraid to dig into the nerdy
557
details because...
558
My audience definitely liked that.
559
Yes.
560
Sure.
561
I'll dig in.
562
So what's on my website is...
563
Sorry if you can hear my dog there.
564
What's on my website is perhaps the most
simple power ratings forecast that I've
565
ever done.
566
So I say that, not that it's like stupid
or anything.
567
So when I was at ESPN, I build power
ratings in American football, both
568
professional and collegiate, and
basketball, professional and collegiate.
569
and hockey, I mean, like almost every
sport, right?
570
So what's on my website, I'll explain the
model very simply is it's a Bayesian model
571
where you have an effect for each team,
right?
572
And the response variable is the expected
goals for each team.
573
So usually when we do a power ratings and
we're trying to estimate for a team, you
574
know, there's two sort of.
575
things that we're trying to estimate their
offensive ability and their defensive
576
ability and then you assume essentially
that their overall team ability, you know,
577
if it's a linear model, right is the
combination of their offense and their
578
defensive abilities.
579
Okay, so you so essentially in each match,
right?
580
You have essentially two rows of data
where you have the expected goals for the
581
one team and then the expected goals for
the other and the reason we use expected
582
goals, although I actually have
583
lot of issues with the expected goals.
584
They are a better indicator of how, how
good the team performed on offense than
585
just the raw number of goals.
586
And right.
587
I don't need to go into details, right?
588
It's essentially a, it's an expected value
as opposed to an observation from a
589
Poisson distribution, which soccer scores
roughly, roughly reflect a Poisson or
590
pretty close to a Poisson distribution,
right?
591
The expected goals is that expectation.
592
And so essentially I have a hierarchical
Bayesian model where I actually.
593
I actually do a few things.
594
So I actually assume the expected goals is
the mean of a Poisson distribution.
595
The observed goals is the actual outcome
of the Poisson distribution.
596
And then I fit a linear model essentially
where I look, okay, I have team A was on
597
offense, team B was the opponent.
598
And this was team A's expected goals.
599
And I'm essentially fitting a regression
model, right?
600
A Bayesian regression model where I have
individual team effects.
601
I have a prior on each team.
602
each team's offense and each team's
defense.
603
And that prior, you know, rough, I don't
have to get too crazy.
604
You know, I just use a normal distribution
and, and, you know, sometimes I actually,
605
when I code in Stan, I actually like
using, distribution was a little, a little
606
bit thicker tails.
607
But I think for this model, I was just
trying to go simple, normal distribution
608
prior with a mean, you know, for my
expected, essentially each team's expected
609
goals per game, on offense versus.
610
Defense right and the defensive value I
usually use I usually do the subtraction
611
So it's team the offensive team minus the
defensive team and that way The the
612
defensive team's value is is is higher if
they're a good defense So essentially if
613
team a's, you know expect the goals and
they in a game against an average opponent
614
is like 1 .5 and the defense was Average
expected goals in the game was you know
615
that they allowed was 1 .4
616
then you would say, the difference is like
0 .1, okay.
617
I also include effects for being at home
in this model.
618
I think, actually, I think that's all I
do.
619
But in other models I've done, you can
look at things such as how much rest
620
they've had since their last match.
621
You can look at the difference between
each team's rest.
622
And those are not linear effects, right?
623
You have to do some sort of nonlinear
effects for that, right?
624
Because like one day of rest is, two days
of rest is not,
625
Like the difference between two days of
rest and one day of rest is very different
626
than seven days versus eight days of rest,
right?
627
Seven and eight days of rest are pretty
much the same thing, but two and one is
628
very different, right?
629
Like much bigger effect for having two
days of rest than just one day of rest.
630
And so you can do things like that, or how
far away they had to travel, those sorts
631
of things.
632
Now in European soccer, that's not a huge
deal, because especially in the
633
competitions within each country, no team
is traveling that far.
634
But in American sports, it is a pretty big
deal.
635
Like, you know, you, you have to fly five,
six hours across the country on short
636
notice.
637
Like that can, that can really affect
performance.
638
and, and other things, like I said, I
don't have this in the soccer model, but
639
I, if anyone's interested in modeling
sports outcomes, that people typically
640
tend to overlook is the, I liked always a
big proponent of elevation, meaning that
641
if there are certain sports where there
are certain teams that play at higher
642
altitudes,
643
And if you're not used to playing at
higher altitudes, it's actually a very
644
noticeable effect in a model that you're
going to have a lower offensive output and
645
you'll actually allow more points on the
other end due to fatigue.
646
And so the United States, it's the teams
that are playing in Colorado and in Utah.
647
But in Europe, it could be the teams that
have to go to Switzerland or the teams
648
that have to go to some of these alpine
regions that are higher up in altitude.
649
In Mexico, if you have to go to Mexico
City, it's extremely high.
650
Or Colombia, right?
651
I mean, depending on what you're doing,
these are very high altitude places that
652
have shown to have a measurable impact on
an opponent's performance.
653
Yeah, that's very fun.
654
My God, I love those kind of models.
655
That's so much fun.
656
And I would also guess that, I mean, at
least my per would be that there is a
657
reverse mechanism also for teams who are
used to playing altitude.
658
Do they get a boost of performance when
they play closer to the C level?
659
Because they could have had adaptation
that make them better when they go to the
660
C level.
661
Yeah.
662
I mean, I think there's certainly science
behind that.
663
I found that is a lot harder to show in a
model than the reverse.
664
Not that it might not be there, but I
think the effect size, if it is there, is
665
definitely smaller than the reverse.
666
Yeah.
667
That's what...
668
That's what I would expect to like.
669
I think the effect is here mainly because,
well, I've seen it.
670
Like it seems to be pretty well seated in
the science literature, but that doesn't
671
mean the effect is big.
672
So yeah.
673
Yeah.
674
I mean, I'm a runner and I know that all
of the distance runners that are training
675
for marathons that are elites and
professionals, they all train at higher
676
altitudes, right?
677
For the...
678
six weeks leading up to a competition and
then they travel to the competition at a
679
lower altitude.
680
And, you know, they think they have an
oxygen performance boost due to that.
681
Yeah.
682
Yeah.
683
Kind of like legal oxygen doping, legal
blood doping.
684
Yeah.
685
Yeah, exactly.
686
Yeah.
687
Yeah.
688
I mean, I think it seems to be pretty much
proven.
689
I would say maybe it has more of an impact
on individual spots like marathon running
690
or else, because it's more like, you know,
it's just like,
691
Even if you're winning just a few tenths
of a second, well, it can help you have a
692
better time in the end because, well, at
this level, just having the smallest
693
increase in performance could be the
difference between first and second place.
694
But maybe that's harder to see such a
small effect on a collective spot, a
695
collective game because, well,
696
Maybe there are some...
697
Maybe it's just not an addition.
698
Maybe it's actually the effect cancel out.
699
So in the end, you don't really see a big
effect.
700
But that would be...
701
Yeah.
702
I'd love to do an experiment on that.
703
Like an RCT.
704
That would be so much fun.
705
Yeah.
706
Well, good luck trying to do experiments
in sports.
707
It's hard.
708
Yeah, I know.
709
I know.
710
But that...
711
I mean, if the multiverse exists...
712
Then there is a universe where we can do
that kind of experiments.
713
And my god, these scientists must have so
much fun.
714
And yeah, so thanks a lot, first, for
detailing the model that clearly and in so
715
much details.
716
That's super cool.
717
So the results of the model are in a cool
dashboard on your website.
718
Do you have the model and data available
freely, maybe on your GitHub, that we can
719
put in the show notes?
720
Yeah, I'm not sure.
721
I think my GitHub, I don't know if my
GitHub model is in the model.
722
It's on GitHub.
723
I don't know if it's private or not, but I
can let you know.
724
You know, I use actually open source data
for that.
725
So I, I, let me double check.
726
I can actually double check and get back
to you after the show on if, yeah, if I
727
could have it in my public GitHub or not.
728
So, yeah.
729
Yeah.
730
But essentially it uses the, there's a
package called world football R and.
731
It uses data from there to build the
model.
732
So some of that data is just from, it's
scraped from like transfer market.
733
so I use, I use, I didn't really talk
about how I set priors means for each of
734
the teams, but very, a very simple, very
simple, hierarchical model is essentially
735
just to use the expenditures of the club
and use that as a prior mean for how good
736
the club will be going into the season.
737
And, and.
738
Unlike some other sports in soccer, world
football, how much a club spends is very
739
highly correlated with how successful they
are, which makes sense, but it's not true
740
necessarily in like baseball.
741
So, do you see these effects of budget?
742
So, yeah, first, before I go on a follow
up question, yeah, for sure.
743
Get back to me after the show.
744
And if that's possible, we'll put that in
the show notes because I'm sure.
745
A lot of listeners will be interested in
checking that out.
746
I personally will be very interested in
checking that out, definitely.
747
So that'd be awesome.
748
And second, that effect of budget that you
see on the performance of a team.
749
And so I guess in football performance
mean number of expect expectation of games
750
won.
751
Do you see that on Curse?
752
Do you see?
753
that much of an effect also in a closed
league system like the MLS?
754
Or is that so because my prior would be
the effect of budget would be even
755
stronger in open leagues like we have in
Europe because it's like there is no
756
compensation mechanism, right?
757
Clubs can go down and usually in Europe
the strongest clubs are the historical
758
clubs.
759
or the new clubs are just the ones that
were lucky to be bought by very, very
760
healthy shareholders.
761
And like, there is not a lot of switching
of the hierarchy and changing of the
762
hierarchy, mainly because of budget, as
you were saying.
763
But I would think that maybe the effect of
budget is less strong in a closed league
764
like the MLS.
765
Is that true?
766
Is that something you see or is it
something that's still in the air?
767
Yes.
768
So I haven't looked specifically at the
MLS, but in general in American sports,
769
which all have closed leagues, the budget,
well, for various reasons, the budget
770
effects are not super strong.
771
So, you know, in American baseball, there
is no spending limit.
772
So in some American sports, like the NFL
and football, like there's a salary cap,
773
meaning you can't spend more than a
certain amount.
774
So there is no relationship between
overall spending and winning because
775
everyone has to spend a minimum and
there's a maximum.
776
In baseball, there is no limit.
777
There's a tax.
778
If you spend too much money, they do tax
you.
779
But there's still not a huge correlation.
780
And then in MLS, like I said, I'm not
entirely sure.
781
Most of the clubs, they are constrained
about how much they can spend.
782
And so there isn't as much variance also
in spending.
783
So like, you know, Messi going to Inter
Miami, it wasn't that Inter Miami could
784
pay him a lot of money.
785
They actually, you know, there's a couple
of exemptions that an MLS club could use
786
to pay an international player.
787
They have, they're called, you know, a
couple of exemption players they have.
788
And that's originally started when David
Beckham went to Los Angeles and they kind
789
of made that rule essentially just so he
could, they could afford paying him what
790
he was used to or close to what he was
used to being paid in Europe.
791
and, and the MLS is still kind of the
case.
792
You have one or two players you're allowed
to have on these exemptions and.
793
The way Messi was able to make it work is
he's getting paid from Apple for his
794
Apple's broadcasting the MLS games.
795
So they're paying him essentially to play
in the MLS because they're hoping, more
796
people are going to watch our broadcasts
are going to pay us.
797
And so we're going to give you a
percentage of that.
798
And that's where actually a lot of his
salary or like his earnings are coming
799
from is from a, a deal with Apple versus
the actual MLS club in Miami, which can
800
only pay him so much.
801
So my guess is, my prior is, I haven't
looked specifically at the MLS with this,
802
but my prior is yes, that there isn't a
huge relationship in the MLS between
803
winning and spending just because there's
not much of a variance.
804
In order to see those correlations, you
have to have a large enough variance in
805
the spending to notice the relationship,
right?
806
So.
807
Yeah, definitely interesting.
808
I mean, I love also looking at these, you
know, the...
809
how the structure of a league impacts the
show and the wins is extremely
810
interesting.
811
That can seem very nerdy and I think
that's my political science training that
812
kicks back here, but really how you
structure the game also makes the game
813
what it is and the results and the show
you're going to get.
814
I find that extremely interesting to see
how the American games, the US games are
815
structured.
816
Because ironically, it's a system where
there is much more social transfers, if
817
you want, like we have in Europe for
social security and health and education.
818
American sports are socialist, and
European sports are capitalist.
819
But typically, we consider Americans to be
more capitalist and the Europeans to be
820
more socialist.
821
So it's an interesting inversion.
822
Yeah.
823
No, definitely.
824
And I mean, I think...
825
Honestly, that's going to be interesting
in the coming years to see what's
826
happening on the European side because
there are more and more debates about
827
whether we should have a closed European
wide league, which would basically be an
828
extension of the current Champions League.
829
And honestly, I think it's going to take
that road because more and more
830
championship, at least all the
championship, I would say, for the
831
exception of the Premier League.
832
get more and more concentrated on just a
few clubs.
833
And just from time to time, you have one
club that bumps onto the top, like
834
Leverkusen this year in Germany, Monaco in
France a few years ago, Montpellier.
835
But that's like really exceptions.
836
And in the end, you almost always get the
same clubs that win all the time.
837
And so the idea of open leagues is not
really true for the top of the leagues.
838
It's definitely true for the bottom, but
the big clubs never go down.
839
And...
840
And so I think at some point, this
illusion of the open leagues is going to
841
disappear and probably we'll get a
European wide championship where like
842
basically the leagues are going to get a
bit more even because I think it's better
843
for the show and that's going to make more
money.
844
And in the end, I think that's what the
question is also.
845
Yeah, you might be right, but I hope, I
hope not.
846
I really, as an American, always have
dreamed of Americans doing relegation and
847
promotion just because...
848
You know, in America, we have this problem
where we call it tanking, right?
849
Because we have the socialist draft system
where if the worst teams are incentivized
850
to lose because they know they're not
going to win.
851
So they want to get the best possible
players in the draft the next season.
852
And so they're incentivized, you know, to,
to lose a little bit more.
853
And so that really does kind of, you know,
the promotion relegation is nice because
854
it solves that, you know, if you keep
losing, you lose a lot of money because
855
you get sent down.
856
so everyone's motivated even at the bottom
of each league to keep winning games,
857
right?
858
As much as possible.
859
Otherwise they lose a lot of money.
860
And in American leagues with the closed
system, it's like, well, Hey, you know,
861
it's actually, we talk about sick sickle.
862
He, and one thing that sports analytics
analytics have done is essentially say,
863
it's really hard to go from an American
sport being an average team to a really
864
good team.
865
And the reason is.
866
is the draft system.
867
So in the draft system, people are always
overconfident in how good the players are,
868
but there's really thick right tails of
how good a player can be.
869
So when you get a new player who's young
and you can draft them at the top of the
870
draft, they might not pan out, but they
also have a really thick right tail,
871
meaning that if they do pan out, you could
go from being one of the worst teams to
872
one of the best teams really quickly.
873
And so,
874
You know, it's this other analysis of
like, well, if you don't ever have an
875
option opportunity to draft someone in a
position where there's that right tail,
876
where, you know, once out of every five
years, you get a player who's transcends
877
everyone else that comes in, then you
can't move up from average to really good,
878
but you can go from being bad to really
good.
879
So often teams and the smarter teams, if
they're really good, they say really good.
880
But once they start noticing the players
are getting older, they just trade
881
everybody away.
882
They get rid of all their best players and
they just stink for a year or two and
883
hopefully they can get some good draft.
884
They get a lot of draft picks.
885
Essentially.
886
They try to trade their players away, get
more draft picks, and then it becomes a
887
sample size problem.
888
And it says, well, if we have more draft
picks, our probability of getting someone
889
on the right tail goes up.
890
And so that's all we're going to do is
we're just going to increase our odds of
891
getting that right tail player.
892
And if we get that player, then we'll be
good again.
893
Yeah.
894
Yeah.
895
It's like.
896
buying a lot of lottery tickets.
897
Yeah, that's what they're doing.
898
Yeah, now that's fascinating.
899
Yeah, I wasn't aware of these effects.
900
That's super interesting.
901
Because basically, what you're saying is
there is an incentive to be extreme,
902
basically.
903
Either you want to be among the top ones
or you want to be among the worst ones.
904
But being in the middle is the worst,
actually.
905
It is the worst.
906
Yeah.
907
Yeah.
908
That is extremely interesting.
909
And that's...
910
Yeah, I mean, I actually don't know which
system I prefer.
911
Honestly, I'm just saying I think Europe
is getting, is going there because we have
912
more and more basically concentration of
the wealth at the very top of the leagues
913
and that's going to make the national
leagues less and less interesting
914
basically.
915
But I don't know either if I prefer the
European wide championship.
916
Well, I think I would prefer European wide
championship.
917
for sure, but I think it would be great to
have it still open.
918
So where you could have, you know, like
basically countries would become regions
919
and then you get from like, if you, if
you're in the best in France, basically in
920
one year, then you get to the highest
level, which is the European one.
921
And then if you're among the worst, you
get down to your country the next year.
922
I think that would be very fun because
the, like, especially now that players can
923
be traded very easily between the, the...
924
continental Europe because it's basically
the same country legally.
925
That also makes sense that the teams, you
know, basically meeting PSG versus
926
Barcelona is much more tied than PSG
versus literally any team in France.
927
So yeah, that's going to be very
interesting.
928
But at the same time, I'm very, yeah, I
love hearing about the wrong incentives.
929
at the same time of the closed system.
930
So thanks a lot for that.
931
That's food for thought.
932
And that's again, like that's very close
to two elections, actually, like how you
933
count the votes impacts the winner.
934
And so here, like really in sports to how
you structure your game has an impact on
935
the winners.
936
And I think it's extremely important to
keep in mind because in the end, like how
937
the
938
the organization, so the MLS in the US or
the UEFA in Europe have actually huge
939
power over the game.
940
Well, thanks for that political science
parenthesis.
941
I wasn't expecting that, but that's
definitely super interesting.
942
To get back to the modeling because time
is running by and I definitely want to ask
943
you about the plus minus models because
you're using that also to...
944
estimate player value in American
football.
945
So I'm curious about that.
946
What is that kind of model?
947
Is that mainly for American football that
you're using that also for other sports?
948
Or if it's only for American football, why
is that particularly tailored to that
949
sport?
950
Yeah.
951
So plus minus models actually are
originated in basketball and they're, they
952
work the best in basketball.
953
They're not perfect.
954
And that sort of the concept in basketball
is you have 10 players on the court at
955
each.
956
at each moment and they substitute in and
out.
957
But while those 10 players are on the
court, you know how many points are scored
958
for each team, right?
959
So, you know, five players on the offense
side and five players on defensive side.
960
There's essentially just a big linear
model and you look at and you want to
961
adjust for how long they're on the court
or how many possessions they were on the
962
court for.
963
So you can say, okay, these 10 players are
on the court for two and a half minutes.
964
And in those two and a half minutes, this
team scored six points and their team
965
scored four points.
966
And essentially what you're doing then is
a plus minus model, essentially.
967
So sometimes you might see in a, in a
statistic after the game, like the total
968
difference in the net points for the team
when a player was on the court versus when
969
they're not.
970
Well, that's not too useful because
there's a lot of correlations, right?
971
You're playing with someone else a lot.
972
So what we call an adjusted plus minus
model, right, is a linear model that then
973
tries to fit those player effects of, you
know, you get a one when you're on the
974
court.
975
on offense and negative one year on
defense.
976
And we look at your team's efficiency,
right?
977
Your points divided by some denominator,
whether it's minutes or possessions.
978
Okay.
979
And that's sort of the basketball thing
over time.
980
They realized, okay, well, there's so much
correlation between who is playing
981
together.
982
We need to adjust for that.
983
So they used ridge regression.
984
And so that would divvy up the credit a
little bit better.
985
And you know, ridge regression is very
good at when there's
986
A lot of multicollinearity or correlation
between two effects, right?
987
And on the basketball team or all
basketball players, you have teammates
988
that play a lot together and they don't
play with other people a lot.
989
But Ridge Regression has done a decently
good job in basketball over a big sample
990
of estimating how effective players are.
991
And if you look at these things, you'll
see, we talked about the sniff test.
992
In 2012, LeBron is the number one player.
993
And he's the number one player for a lot
of the years, not so much anymore because
994
he's older, et cetera.
995
Right.
996
But that's sort of those sniff tests that
we get.
997
Well, some people in, in basketball and
I'm proponent of this, like, you know,
998
this is a Bayesian podcast is that ridge
regression, you know, for those unfamiliar
999
is, is a frequentist way to write a
Bayesian model.
Speaker:
That's very specific where you have a
normal prior on each player with a mean
Speaker:
zero.
Speaker:
Okay.
Speaker:
And that's ridge regression.
Speaker:
So we think about it from that perspective
with adjusted plus minus models.
Speaker:
What happens when you have a normal prior
with mean zero is that when you have
Speaker:
players that play less, we shrink more
towards the prior mean.
Speaker:
And it's only when we have more data for
players that we can deviate from that
Speaker:
prior mean.
Speaker:
Well, one thing we know about sports is if
you're not playing as much, that actually
Speaker:
is pretty useful information.
Speaker:
And what does that tell us?
Speaker:
You're not very good.
Speaker:
Because if you're good, you're going to
play more.
Speaker:
And if you're bad, you play less.
Speaker:
So other people have come around and, you
know, in the last 10, 15 years and said,
Speaker:
okay, well, instead of a ridge regression
model for basketball, we should do a
Speaker:
Bayesian regression model.
Speaker:
And instead of having a mean zero for a
player, we should have a mean of something
Speaker:
else.
Speaker:
So there's a few different versions that
people have done.
Speaker:
One thing, a very simple version is say
just everybody has a mean prior mean of,
Speaker:
you know, what we call a replacement
player.
Speaker:
Okay.
Speaker:
Someone that doesn't play very much.
Speaker:
If you're really good and you play a lot.
Speaker:
It doesn't matter what the prior mean is
too much because the data is going to
Speaker:
overwhelm the prior.
Speaker:
But if you don't play very much, we're
going to stick with that sort of negative
Speaker:
prior mean because it means you're below
average.
Speaker:
And so that's one thing you can do.
Speaker:
A more sophisticated thing sometimes
people will do is they'll build a
Speaker:
hierarchical model where you have
essentially a, a prior mean that is based
Speaker:
on other statistics that we observe.
Speaker:
So how many points you score or how many
assists you have.
Speaker:
And those that's called a box, a box score
prior mean or a box score plus minus.
Speaker:
So that's sort of the basketball.
Speaker:
So we gave you the what plus minus models.
Speaker:
So that's sort of the basketball approach.
Speaker:
Now.
Speaker:
Basketball is really nice because you have
lots of games in the NBA.
Speaker:
You play every team at least twice and you
substitute a lot and there's lots of
Speaker:
scoring.
Speaker:
Now my work in American football tried to
address a lot of these issues in American
Speaker:
football.
Speaker:
You don't play every team.
Speaker:
you don't substitute very much.
Speaker:
And if you do play, you only play with
certain people like all the time.
Speaker:
And then there's not a lot of scoring
compared to basketball.
Speaker:
There's some scoring, but you know,
there's, you know, American football point
Speaker:
scoring is unique, right?
Speaker:
You get six or seven points for a
touchdown, you get three points for a
Speaker:
field goal, you know, and then on more
rare occasions, you get these two point
Speaker:
safeties.
Speaker:
Yeah.
Speaker:
So there's roughly maybe 10 scoring events
in an American football game versus in
Speaker:
basketball where you have, you know, a
hundred to a hundred.
Speaker:
So there's, you know, about each two to
three points, each one there's, you know,
Speaker:
80 to 120 scoring events in a basketball
game.
Speaker:
Right.
Speaker:
So these models work a lot better.
Speaker:
My work in American football has been to
sort of, how do we take the basketball
Speaker:
model and make some modifications so we
can do a football model?
Speaker:
And so one of the things that is tricky in
football is.
Speaker:
that certain positions never get
substituted out.
Speaker:
So on offense, the quarterback plays every
single play unless they're hurt or they
Speaker:
stink.
Speaker:
So they get benched.
Speaker:
Well, the quarterback also always plays
with the same offensive line as long as
Speaker:
they're healthy and they don't get
substituted out.
Speaker:
So how does a model separate credit when
the same players are on the field all the
Speaker:
time?
Speaker:
And so my work in that was sort of to use
Bayesian statistics and take the...
Speaker:
the Bayesian regression model where we had
a prior mean, I used some information to
Speaker:
inform the prior mean for each player, but
I also did this unique thing where I
Speaker:
shrink.
Speaker:
So the prior variance is a function and is
actually, there's one prior variance for
Speaker:
all players and then it's multiplied by
another parameter, which is unique for the
Speaker:
position that they play.
Speaker:
And so quarterbacks have a different
shrinkage parameter, essentially, or prior
Speaker:
variance than.
Speaker:
a different position.
Speaker:
And then instead of just looking at
scoring plays in football, we have what we
Speaker:
call is expected points added.
Speaker:
So at each play, we look at on average,
how many points are you going to score if
Speaker:
you have the ball in this position?
Speaker:
And I look at the difference between two
plays, right?
Speaker:
And that tells you essentially how much
value you got in the result of the play.
Speaker:
So instead of using every scoring play, I
just use every single play in football.
Speaker:
And I do this unique shrinkage.
Speaker:
dependent on position and doing that, and
it's a huge model.
Speaker:
So I did this in college football, which
has way too many parameters because
Speaker:
there's like 16 ,000 kids.
Speaker:
But even in the NFL, I've done this and
you get interesting results.
Speaker:
Sometimes they match up with what you
think, sometimes they don't.
Speaker:
But the interesting thing is you can
actually estimate how much you should
Speaker:
shrink each position.
Speaker:
And so actually the model is nice because
it essentially tells you how much of the
Speaker:
variance in the outcome of the play.
Speaker:
is dependent on how good players are
across different positions.
Speaker:
So in football, we all know that
quarterbacks are the most impactful
Speaker:
position in the game.
Speaker:
And I did give somewhat subjective priors,
but not with, I still left a lot of
Speaker:
uncertainty around and the model very well
could see and estimate that quarterbacks
Speaker:
are in fact the most important position
because you shrink them the less they have
Speaker:
the largest variance.
Speaker:
So.
Speaker:
You could look at that.
Speaker:
If you look at the most impactful players
in football, it should be a quarterback.
Speaker:
But in the same measure, the worst players
in football are also quarterbacks because
Speaker:
in order to negatively hurt your team, you
can only hurt your team really a lot.
Speaker:
If you're a quarterback compared to other
positions, I mean, every position you can
Speaker:
hurt your team, but no one can hurt a team
as much as a bad quarterback hurts their
Speaker:
team.
Speaker:
Just like a good quarterback can help
their team better.
Speaker:
So that's sort of like a kind of rough
overview of, of my plus minus modeling in
Speaker:
football.
Speaker:
I think I do have, when I wrote the paper,
I have a version of that written in Stan.
Speaker:
The data set itself was not public, but I
did have a version of the Stan model
Speaker:
written and uploaded on my GitHub that you
can look at.
Speaker:
It's pretty massive.
Speaker:
In recent years, I've tried to expand it
and to do a state space model type
Speaker:
version.
Speaker:
So I have effects for each player for each
season over time.
Speaker:
Yeah, that was exactly what I meant.
Speaker:
Computationally, that gets a little bit
trickier.
Speaker:
And my dataset, actually, I was able to
scrape some data for that.
Speaker:
And then actually, I can't anymore.
Speaker:
The NFL just stopped releasing that.
Speaker:
So that work is on hold for now.
Speaker:
But I probably need to find a graduate
student that can help me finish it.
Speaker:
Yeah, definitely we should put that in the
show notes.
Speaker:
That's super interesting.
Speaker:
Your paper in the...
Speaker:
and the link to the GitHub repo.
Speaker:
That's for sure.
Speaker:
And that makes me think a recent episode I
did, and also a recent interest of mine, I
Speaker:
started contributing to that package
called Baseflow, where that's precisely
Speaker:
that could be useful in your case here,
because your model structure doesn't
Speaker:
change.
Speaker:
If I understand correctly, because well,
once you have the model structure, it's
Speaker:
kind of like a physics model.
Speaker:
It's not going to change when you have new
data, but the data sets do change.
Speaker:
So you have new data sets coming in.
Speaker:
And so that's where probably using these
kind of inference that's called amortized
Speaker:
Bayesian inference could be extremely
useful because you would basically, if the
Speaker:
bottle, the computational bottleneck would
just happen once.
Speaker:
That would be when you train the deep
neural network.
Speaker:
to learn the posterior structure and
parameters.
Speaker:
So instead of MCMC, you're using the deep
neural network to learn the posterior.
Speaker:
But then once you have trained the deep
neural network, then it's like doing
Speaker:
posterior inference is trivial.
Speaker:
And so for that kind of models where you
have a lot of data, but the model is the
Speaker:
same.
Speaker:
That's a very good use case for amortized
Bayesian inference.
Speaker:
So that could be something very
interesting here.
Speaker:
Yeah.
Speaker:
Yeah.
Speaker:
Yeah.
Speaker:
Yeah.
Speaker:
Happy to tell you more about that
afterwards if you're interested.
Speaker:
But yeah, I've started digging into that,
and that's super fun for sure.
Speaker:
So yeah, and I think this is a cool use
case.
Speaker:
Awesome.
Speaker:
Well, I still have a few questions, but
can I?
Speaker:
We are getting short on time, so can I
keep you a bit longer?
Speaker:
Yeah, just a few more minutes.
Speaker:
Sure.
Speaker:
Yeah.
Speaker:
Okay.
Speaker:
Awesome.
Speaker:
Yeah.
Speaker:
So actually, I'd like to pick your brain
about now talking a bit more about the
Speaker:
future.
Speaker:
I'm curious.
Speaker:
So let me fuse two questions.
Speaker:
So first, I'm curious what, where do you
see the field of spots analytics heading
Speaker:
in the next years?
Speaker:
five to 10 years.
Speaker:
And also sub question is other spots,
specific spots where you see significant
Speaker:
potential for growth in analytics.
Speaker:
Yeah, those are, those are good questions.
Speaker:
I think they go kind of hand in hand.
Speaker:
You know, I think it's hard to, it's hard
if I could predict the future, right?
Speaker:
I would probably have a different job.
Speaker:
I'd probably be retired.
Speaker:
But.
Speaker:
You know, I think a lot of the future is
going to be catching up to, you know,
Speaker:
sports like soccer, American football,
hockey going to be catching up.
Speaker:
And I think a lot of the growth is
actually going to be making sports
Speaker:
analytics more digestible for just
everyday people.
Speaker:
So the fans, right.
Speaker:
And that's happened over time, right?
Speaker:
You watched a broadcast of a, of a soccer
game,
Speaker:
20 years ago, no one talked about expected
goals.
Speaker:
Now, most broadcasts will show it.
Speaker:
They might not always talk about it.
Speaker:
They'll show it.
Speaker:
Like I said, expected goals, it's better
than just showing the score, but there's a
Speaker:
lot to be left undone.
Speaker:
I think in the future, there's going to be
a lot of sports analytics that's really
Speaker:
much focused on expected values to date.
Speaker:
And not enough has been focused on
distributions and variance around
Speaker:
estimates.
Speaker:
And so I think once one place it's going
to have to end up going.
Speaker:
and part of the reason is, right, we, we
talk about neural networks.
Speaker:
Neural networks are very good at expected
values, with really large data sets.
Speaker:
It's a lot harder, right?
Speaker:
Modeling variance is a lot harder in
anything than modeling and expectations.
Speaker:
So I think catching up on some of those
things.
Speaker:
And I think also, like I said, taking a
step back and I think, you know, there's
Speaker:
been a lot of good work that has been
done, but I think we're going to find a
Speaker:
few things that.
Speaker:
Hey, maybe we were a little bit
overconfident, right?
Speaker:
And with everything in sports, it's always
about game theory.
Speaker:
So even if something is optimal today,
that strategy is not always going to be
Speaker:
optimal in the future.
Speaker:
And so if you, if, you know, in basketball
for a sec, we talked about three pointers.
Speaker:
Of course, three pointers are really good
right now because they have higher
Speaker:
expected value, but you know, defensively
players are learning to play against three
Speaker:
pointers better than they used to.
Speaker:
or in American football, the numbers have
said you should pass the ball more.
Speaker:
Well, now the defenses are learning how to
defend it better.
Speaker:
And so running is going to be more
important than it used to be.
Speaker:
Right.
Speaker:
And so these things are always going to
change.
Speaker:
And so in five to 10 years, I don't know
exactly what it's going to be, but I think
Speaker:
in some ways, you know, you might find
some analytics person in 10 years giving
Speaker:
exact opposite advice of what we're seeing
now, just because the game has evolved.
Speaker:
The game has changed.
Speaker:
And so now you should do something else,
right?
Speaker:
To get an edge.
Speaker:
and so I think the growth is in twofold.
Speaker:
We're always staying on the cutting edge
of like, what's next.
Speaker:
Sometimes that's going back to where you
were.
Speaker:
and like I said, making the numbers more
digestible for the everyday consumer.
Speaker:
you know, it's, it's one thing you and I,
we can talk about models.
Speaker:
I had to do this at ESPN all the time.
Speaker:
I can't talk about prior distributions on
TV.
Speaker:
Right?
Speaker:
So how do we explain these things?
Speaker:
Right?
Speaker:
And I think what's really going to be key
is over time, this has happened already,
Speaker:
but it's going to keep on happening that
the analysts themselves are going to be
Speaker:
much more data literate than they have
been in the past.
Speaker:
Not just because they have more people
working with them or they're younger.
Speaker:
Also the analysts in the future is going
to be able to use AI to do their own
Speaker:
analysis.
Speaker:
And that could be scary because they might
make some bad assumptions.
Speaker:
but they're also going to be more data
savvy and they could load up a data set
Speaker:
and use an AI tool.
Speaker:
And even if they can't code to get
insights that, you know, I used to have to
Speaker:
write some code to get them and now they
can just do it themselves.
Speaker:
Right.
Speaker:
And so that's, I think somewhere else that
teams and coaches are going to be able to
Speaker:
do more analysis on their own.
Speaker:
And it's not that the data people aren't,
aren't needed.
Speaker:
In fact, they're going to be needed even
more to make sure that the coach isn't
Speaker:
missing an assumption, right.
Speaker:
That he needs to be thinking about of the
structure of the data.
Speaker:
Cause he might just be, great.
Speaker:
Now I can run a regression.
Speaker:
I don't even know.
Speaker:
I don't even need to know how to code it.
Speaker:
Right.
Speaker:
that's great.
Speaker:
But are you thinking about this?
Speaker:
Right.
Speaker:
And so there's going to be a lot of
education about using some of these tools
Speaker:
better and every, but everyone's going to
have their access to it.
Speaker:
Right.
Speaker:
It's going to be so much more accessible
in the future than it has been in the
Speaker:
past.
Speaker:
Yeah.
Speaker:
Yeah.
Speaker:
Yeah.
Speaker:
yeah, for sure.
Speaker:
Completely, completely agree with that.
Speaker:
and that's also something I'm very
passionate about.
Speaker:
That's also what these show.
Speaker:
is here, right?
Speaker:
It's to have the bridge between the
modelers and the known stats people be
Speaker:
easier, in a way.
Speaker:
And that's something I really love doing
also in my job, basically being that
Speaker:
bridge between the really nitty gritty
details of the model.
Speaker:
And then, OK, now that we have the model,
how do we explain to the people who are
Speaker:
actually going to consume the model
results what the model can do, what it
Speaker:
cannot do, and how we can?
Speaker:
make decisions based on that, that
hopefully are going to be better decisions
Speaker:
than we used to make.
Speaker:
And also, how do we update our decisions?
Speaker:
Because, well, the game changes, as you
said so well.
Speaker:
So yeah, for sure, all that stuff is
absolutely crucial.
Speaker:
And I like using the metaphor of the
engine and the car, right?
Speaker:
It's like building the model is the engine
of the car.
Speaker:
So surely, you want the best engine
possible, but you also need a very cool
Speaker:
car, because otherwise, nobody's going to
want your engine.
Speaker:
And so...
Speaker:
like then building all the communication
around the model, the visualizations,
Speaker:
things like that, extremely important
because then in the end, as you were
Speaker:
saying at the beginning of the show, if
the model isn't used, well, that's not a
Speaker:
very good investment.
Speaker:
Yeah.
Speaker:
So I would have literally, I would have a
lot more questions if they are on my list,
Speaker:
but we are going to call it a show poll
because I don't want to keep you...
Speaker:
three hours, you've already been very
generous with your time.
Speaker:
You can come back to the show anytime if
you want to, if you have a cool new
Speaker:
project you want to talk about for sure.
Speaker:
Yeah, maybe we can record the French
version of the podcast sometime, you know.
Speaker:
yeah, yeah.
Speaker:
I'll definitely be down for that.
Speaker:
You know, someone who will be very happy
is my mother.
Speaker:
She's always asking me, so when are you
going to do the French version of your
Speaker:
courses in your podcasting zone?
Speaker:
I'm like, that's not going to happen, mom.
Speaker:
Maybe that's what moms are for though.
Speaker:
Exactly.
Speaker:
Before letting you go, Paul, I'm going to
ask you the last two questions.
Speaker:
I ask every guest at the end of the show
because it's a Beijing show, so what
Speaker:
counts is not the individual point
estimate, but the distribution of the
Speaker:
responses.
Speaker:
First question, if you had unlimited time
and resources, which problem?
Speaker:
would you try to solve?
Speaker:
Good.
Speaker:
That's a good question.
Speaker:
You sent me this ahead of time and I spent
a couple seconds and I was like, man, I
Speaker:
don't know.
Speaker:
But I, it's tough.
Speaker:
There's so many questions in sports.
Speaker:
Yeah.
Speaker:
I know.
Speaker:
I, I mean, my, one of my passions is
American football and I just keep going
Speaker:
back.
Speaker:
So I could tell, I love American football
and I love soccer, international football.
Speaker:
Right.
Speaker:
And both of those games, understanding.
Speaker:
There's certain positions that are just
really hard to understand how valuable
Speaker:
they are.
Speaker:
And so in soccer, it's like the midfield.
Speaker:
It's we know you need a good midfielder,
but how do you measure that?
Speaker:
That's a really hard problem.
Speaker:
And in football, there's a lot of
positions in American football.
Speaker:
There's a lot of positions like that as
well.
Speaker:
So I probably go somewhere along those.
Speaker:
Like I want to, I want to discover and
measure the value in these really hard to
Speaker:
measure, traits and values and these two
sports.
Speaker:
Yeah.
Speaker:
Yeah, I definitely understand.
Speaker:
The battle for the middle is extremely
important always in soccer.
Speaker:
And if you look at all the teams which win
the Champions League, so the Holy Grail,
Speaker:
like the Super Bowl of the soccer world,
almost all the time they have an amazing
Speaker:
and impressive pair or three players as
midfielders.
Speaker:
And that's like a sine qua non.
Speaker:
But...
Speaker:
As you were saying, it's extremely hard to
come up with a metric that's going to not
Speaker:
only explain why the midfielders are good,
but also help you constantly choose
Speaker:
midfielders that will increase your
probability of winning the Champions
Speaker:
League.
Speaker:
And I'm seeing that as a very frustrated
Paris fan because that's been years since
Speaker:
Thiago Mota basically retired that we're
looking for a number six.
Speaker:
So the play, the midfielder just before
the defense and we're still looking for
Speaker:
him.
Speaker:
Yeah.
Speaker:
So please, Paul, let me know when you're
done with that.
Speaker:
Yeah.
Speaker:
Well, unfortunately, there's several
really good French midfielders.
Speaker:
They just don't play for PSG.
Speaker:
I know.
Speaker:
I know.
Speaker:
Not a lot of French players stay in
France.
Speaker:
That's why I'm telling you, we need a
European wide league.
Speaker:
Many more players would stay in France and
play for PSG, I guess.
Speaker:
And second question, if you could have
dinner with any great scientific mind.
Speaker:
dead, alive or fictional, who would it be?
Speaker:
Fictional?
Speaker:
I haven't really thought about fictional
scientific minds.
Speaker:
That is a good question.
Speaker:
Geez.
Speaker:
Man.
Speaker:
Well, I mean, I thought you were going to
answer very fast.
Speaker:
Actually, that one, I thought you were
going to answer Bill James like super
Speaker:
fast.
Speaker:
Bill James.
Speaker:
Yeah.
Speaker:
Well, I've met Bill James.
Speaker:
So, okay.
Speaker:
So I have dinner with him, but I have met
him.
Speaker:
I'll go a little, how liberal are you with
the word scientific mind here?
Speaker:
Yeah.
Speaker:
So I think scientific mind, I think
Galileo, I think Newton, I think Einstein,
Speaker:
right?
Speaker:
Like,
Speaker:
You know, those are all, but I'm sure from
the sports world, from the sports world,
Speaker:
there is a former football player that
very few people have ever heard of and his
Speaker:
name is Virgil Carter.
Speaker:
And the reason why I love him, he played
in the seventies is that he wrote a paper
Speaker:
about expected points in football while he
was playing in the NFL.
Speaker:
And it was sort of the first sports
analytics.
Speaker:
ever done in American football and he was
a player in American football at the same
Speaker:
time.
Speaker:
So very, not very well known.
Speaker:
He's still alive.
Speaker:
I don't know him at all, but he would be a
really cool person.
Speaker:
If I go like classical, scientific,
scientific minds, I would, I would
Speaker:
probably, maybe Gauss like, Hey, this
distribution that has your name is like
Speaker:
used everywhere and it's very useful.
Speaker:
So I probably, I would stick with him.
Speaker:
Normal distributions.
Speaker:
counseling distributions, like the rule of
world nowadays.
Speaker:
So I'd probably stick with that if I were
to go traditional scientific mind.
Speaker:
Yeah.
Speaker:
Yeah.
Speaker:
Now good choices.
Speaker:
Good choices.
Speaker:
I am amazed about that Virgil Carter
story.
Speaker:
That's so amazing.
Speaker:
Yeah.
Speaker:
So if anybody knows Virgil Carter, please
contact us and we'll try to get that
Speaker:
dinner for Paul.
Speaker:
If you do that, I'll definitely be here to
grab the dinner and have a conversation
Speaker:
with Virgil because like having someone
like that on the show would be absolutely
Speaker:
amazing.
Speaker:
I love that story.
Speaker:
That's so amazing.
Speaker:
It's like, you know, the myth of the
philosopher king.
Speaker:
Well, here is like the myth of the
scientist player.
Speaker:
It's just like, I love that.
Speaker:
Yeah.
Speaker:
that's fantastic.
Speaker:
Damn.
Speaker:
Thanks a lot, Paul.
Speaker:
Let's call it a show.
Speaker:
Thanks for having me.
Speaker:
Yeah, that was amazing.
Speaker:
As usual, we'll put resources and a link
to your website in the show notes for
Speaker:
those who want to dig deeper.
Speaker:
Thanks again, Paul, for taking the time
and being on this show.
Speaker:
Thanks once again, I really enjoyed it.
Speaker:
This has been another episode of Learning
Bayesian Statistics.
Speaker:
Be sure to rate, review, and follow the
show on your favorite podcatcher, and
Speaker:
visit learnbaystats .com for more
resources about today's topics, as well as
Speaker:
access to more episodes to help you reach
true Bayesian state of mind.
Speaker:
That's learnbaystats .com.
Speaker:
Our theme music is Good Bayesian by Baba
Brinkman, fit MC Lass and Megharam.
Speaker:
Check out his awesome work at bababrinkman
.com.
Speaker:
I'm your host.
Speaker:
Alex and Dora.
Speaker:
You can follow me on Twitter at Alex
underscore and Dora like the country.
Speaker:
You can support the show and unlock
exclusive benefits by visiting patreon
Speaker:
.com slash LearnBasedDance.
Speaker:
Thank you so much for listening and for
your support.
Speaker:
You're truly a good Bayesian change your
predictions after taking information and
Speaker:
if you think and I'll be less than
amazing.
Speaker:
Let's adjust those expectations.
Speaker:
Let me show you how to be a good Bayesian
Change calculations after taking fresh
Speaker:
data in Those predictions that your brain
is making Let's get them on a solid
Speaker:
foundation