Bayesian Occupancy Models in Ecology -- Matthijs Hollanders

‌

Listen on your favorite platform:

In this episode, Alex Andorra sits down with Matthijs Hollanders, a postdoc at the University of Newcastle in Australia and creator of occARU -- an R package for Bayesian occupancy modeling of camera trap and acoustic monitoring data. Matthijs's path here runs through the Netherlands, a deep love of wildlife, and a growing frustration with statistical tools that throw away perfectly good data before the modeling even starts. The episode covers what occupancy models actually are and why they matter, how automated recording units break the assumptions baked into classical approaches, and why modeling raw counts directly -- rather than collapsing everything to a binary "detected or not" -- unlocks a richer family of questions about detection rates, species co-occurrence, and how both vary across space and time, all handled through hierarchical Gaussian processes, global-local shrinkage priors, and some of the newer Stan features around zero-sum and simplex constraints.

The reason this episode is worth your time, even if you don't work in ecology, is that occARU is a really clean case study in something I think about a lot: the cost of decisions you make before you start modeling. Most occupancy software collapses raw counts to ones and zeros at the data-prep stage. Matthijs's argument -- and the design principle behind the package -- is that you shouldn't, and that the consequences of doing so quietly compound into worse science.

What Occupancy Models Actually Do

An occupancy model exists because of one stubborn fact. When you survey a site for a species and don't see it, you don't know whether the species wasn't there or whether it was there and you missed it. With rare or cryptic species, that distinction is the whole game.

The fix is the repeated-measures trick. Visit every site multiple times. If you find the species sometimes but not always, you can estimate a detection probability -- how often you'd expect to see it given it's actually present. That detection probability then lets you back out true occupancy, separately from the observation process. Matthijs offered a framing I hadn't seen before: it's a zero-inflation model where the zero-inflation happens at the level of the site, not the observation. Once you see it that way, the whole apparatus collapses back into a stack of GLMs, and the conceptual complexity drops dramatically.

Why Throwing Away Counts Is a Bad Idea

The wedge issue for occARU is what you do with raw count data from automated recording units. A camera trap records a kangaroo. Twenty kangaroos. Two hundred kangaroos over two months. Traditional software bins this into "detected today, yes or no" before anything else happens. Matthijs's argument is blunt: that's throwing away information by construction, and you should at minimum have a justification for doing it.

The catch is that raw counts have their own pathologies. A group of 20 kangaroos standing in front of the camera for two hours generates a huge spike of detections that isn't really new information -- it's the same animals, repeatedly. occARU handles this with a thinning window: choose an interval (30 minutes is a typical choice) and retain the maximum count observed within each window. The two-hour kangaroo episode collapses to four detections, each carrying the genuine peak group size in that interval. The model can still see "there were a lot of kangaroos that morning" without treating each frame as independent.

This makes occARU a count-based occupancy model that still produces traditional occupancy outputs (does the species occur here, yes/no), but layers on something traditional models don't have: detection rates that vary in space, time, and across species. As Matthijs puts it, you only gain stuff -- nothing is lost.

Multi-Species Gaussian Processes Without the Headache

The packaging of Gaussian processes into occARU is the part that made me sit up. GPs are notoriously a barrier-to-entry technology. They're powerful, they're flexible, and they're a pain to actually use in practice -- which is why most ecological packages don't include them, and the ones that do tend to require specialist knowledge to drive.

occARU adds spatial and temporal GPs to handle the obvious problem: sites closer together in space should be more similar; surveys closer together in time should be more similar; pretending otherwise is pseudoreplication. The clever part is the multi-species extension. occARU structures the GP as a matrix normal -- a covariance matrix over sites in one direction, and a covariance matrix over species in the other. Sample from that and you get random effects per species per site that are spatially structured AND that let you directly estimate which species co-occur. All of it from a single Cholesky decomposition.

For datasets up to a few hundred sites, occARU uses exact GPs. For larger regional structure, it fits separate smaller GPs per grid sharing hyperparameters -- a thousand cameras across ten grids becomes ten manageable problems rather than one expensive thousand-by-thousand covariance matrix. Hilbert-space GP approximations for genuinely large single-grid datasets are on the roadmap, and Matthijs and I spent some time digging into PyMC's HSGP hyperparameter tools and how the same logic could be transplanted into a specialized package like occARU.

One Prior to Govern Them All

The other piece of the package that's worth dwelling on is its treatment of priors. A complex ecological model can easily have dozens of variance components -- coefficients on predictors, random site and survey effects, species-level deviations, hierarchical category effects. Put independent priors on all of them and the prior predictive distribution explodes. Realistically, no scientist actually thinks all of those components are simultaneously enormous.

occARU borrows the R2D2-style approach: put a single Student-t prior on the total variance of the linear predictor, then partition that variance across components with a simplex. The sparsity of the simplex is itself a hyperparameter \-- so the model decides whether one component dominates or variance spreads evenly across the budget. The same hyperpriors then work whether you fit one species or a hundred, zero predictors or fifty.

Combined with the zero-sum-normal constraint -- which we've covered with Adrian Seyboldt on nutpie, Sean Pinkney in episode 133, and the AR² prior episode with David Kohns -- this lets occARU handle categorical predictors without reference-category encoding, keeping intercepts cleanly interpretable as averages across categories. It leans heavily on recent Stan additions (the simplex_constrain function that takes n-1 free parameters being the most important), and the back-and-forth between Stan's evolving language and the package's design choices was one of my favorite threads in the conversation.

The Pragmatic Stuff: Pareto-k, Parallelization, and What Breaks

Two of my favorite exchanges in this episode are about things that don't work the way the documentation suggests.

The first is within-chain parallelization. The pitch is appealing -- if you have 40 cores and four chains, why not parallelize each chain across ten cores? In practice, Matthijs has found that the overhead from copying data structures across cores usually erases the gain. It helps only when the per-site likelihood is genuinely expensive. For cheap likelihoods, you spend more time copying than computing. Stan's own documentation acknowledges there's no clean rule. Trial and error remains the honest answer.

The second is Monte Carlo integration for high Pareto-k values, a fix Matthijs got via Aki Vehtari on the Stan Discourse. Models with observation-level random effects routinely produce high Pareto-k diagnostics in leave-one-out cross-validation, making model comparison unreliable. The trick: in generated quantities, draw the random effect many times per MCMC iteration, compute the likelihood at each draw, and average. Since generated quantities is cheap, the runtime cost is small.

In Matthijs's experience, this drops the percentage of high-k observations from around 60-70% to around 10%, and the remaining 10% usually reflect genuinely poorly-fitting observations rather than LOO failures. For more on related diagnostics and the penalized-complexity priors that Matthijs also uses on overdispersion parameters, the INLA episode with Janet van Niekerk and Håvard Rue is worth a listen.

As for when occARU breaks down: the bottleneck is Stan's fitting speed on large datasets. Years of daily data across many sites will be slow. The workaround is straightforward -- bin to weekly or monthly intervals. For occupancy estimation specifically, the coarseness barely matters; you only lose some detection-rate resolution.

Looking Ahead

What's next for occARU is the multi-season version, where occupancy itself evolves as a Markovian process across time -- bringing the package fully into the hidden-Markov family Matthijs has worked with throughout his career. HSGP approximations for big single-grid datasets are on the list. And whether the package eventually crosses over into the broader ecological-modeling community will depend less on the math than on how it actually feels to use on a real research problem.

The bigger thing I took from the conversation is that specialized Bayesian packages live or die on the choices they make for users. Generic frameworks like PyMC, Stan, and BRMS exist for a reason -- but they can't make domain-specific decisions about priors, parameterizations, and defaults. A well-designed specialized package can. occARU is a good example of what that looks like done right: opinionated where it should be (count modeling, GPs by default), flexible where the science demands it (any number of species, predictors, or random effects), and built on top of a generic foundation rather than reinventing it.

Check out the full episode above, and the show notes for links to occARU and the related episodes we mentioned -- including Aaron MacNeil on Bayesian shark trade modeling.

You can also interact with the episode on NotebookLM! Ask questions, generate flashcards, and more.

Hope you enjoyed it, and see you in two weeks, my dear Bayesians!

Chapters

00:12:14 What is an occupancy model and what problem does it solve?

00:16:16 What are Automated Recording Units and why do they need different models?

00:18:45 What is the occARU R package and why does it exist?

00:23:55 Why does occARU model counts directly rather than binary detection?

00:26:38 What does multi-species hierarchical modeling with Gaussian processes look like?

00:32:22 How does occARU implement Gaussian processes efficiently?

00:41:01 Why are Gaussian processes such a powerful but tricky modeling tool?

00:44:11 What is variance decomposition with global-local shrinkage priors?

00:49:02 How does occARU leverage recent Stan features for zero-sum constraints?

00:57:37 When does within-chain parallelization actually help?

01:01:30 How does Monte Carlo integration reduce high Pareto-k values?

01:15:27 When does occARU underperform and what's on the roadmap?

Links from the Show

What Occupancy Models Actually Do
Why Throwing Away Counts Is a Bad Idea
Multi-Species Gaussian Processes Without the Headache
One Prior to Govern Them All
The Pragmatic Stuff: Pareto-k, Parallelization, and What Breaks
Looking Ahead
Chapters
Links from the Show

My guest today is Matis Hollanders, a postdoc at the University of Newcastle in Australia.

Yes, there is a Newcastle in Australia, where he spends his days analysing camera traps and song meter datasets and building Bayesian models that take that messy ecological data

seriously.

The reason I wanted Matis on the show is his R package, Ocaru.

It does something I find.

Conceptually clean and refreshingly honest.

Instead of collapsing camera trap data down to did we see the species today, yes or no, it models the counts directly.

That choice unlocks a whole family of richer questions about detection rates, multi-species co-occurrence, and how all that varies in space and time, and the package

handles it all with, you guessed it, hierarchical Gaussian processes.

Global, local, shrinkage priors, and some of the newest stand features around zerosum and simplex constraints.

So as you know now, all of these are a bunch of show favorites.

So if you have ever wondered how Bayesian thinking actually helps ecologists get more out of their data, this is the episode for you.

This is Learning Bayesian Statistics, episode 159, recorded May 13, 2019.

Let me show you how to be a good base and change your predictions after taking information.

And if you think it now be less than amazing, let's adjust those expectations.

What's a Bayesian is someone who cares about evidence.

Welcome to Learning Basion Statistics, a podcast about Bayan inference, the methods, the projects, and the people who make it possible.

I'm your host, Alexandora.

You can follow me on Twitter at alex underscore andora, like the country, for any info about the show.

Learnbase stats.com is Lab Plus2B.

Show notes, becoming a corporate sponsor, unlocking Vision Merch, supporting the show on Patreon.

Everything is in there.

That's learnbasetats.com.

If you're interested in one-on-one mentorship, online courses, or statistical consulting, feel free to reach out and book a call at topmate.io/slash alex underscore andora.

See you around.

Folks and best Asian wishes to you all.

Mateis Hollanders.

Welcome to Learning Basion Statistics.

Thank you.

Yeah, T, how was my Dutch pronunciation?

It was perfect.

You nailed it.

Well, we'll see if at some point I learn I learn Dutch.

I think I will not until like unless I have to go leave there at some point now to be honest.

It's hard to justify learning another language, especially if you have no reason to speak it on a regular basis.

Yeah, yeah, exactly.

Um especially since Dutch people actually speak so like such a good English, you know.

So th all the Dutch people I've met in my life speak perfect English.

So I'm like, well, you know No incentives.

It's pretty well the last country you'd actually need to learn the language to be able to communicate with the people there.

So Yeah, yeah.

Yeah, yeah.

I have uh I have a friend who lives there right now.

He's German, so actually it's very close.

Like German and Dutch are super close, but he still hasn't learned Dutch because everybody speaks English.

Well to to be fair, I I don't speak a word of German.

I I don't reckon the two languages are all that close together, but uh maybe it just shows some of my skills of learning languages.

I mean you live in Australia.

So you know.

Yes.

You're not l you're not living in Germany, so the like the the comparison here is not you know, it's not a a perfectly

uh control trial.

but actually we were talking about that just before recording, so you're the first you're the first person I meet in I in in I'm the first person you meet who has whose last name

is also the name of a country because your name means from from Holland, basically from what you told me, right?

That's right, that's right.

Yeah, I think it's it's fairly unusual.

Okay, well, I think it fits well because we've got the World Cup going on.

Um and so so perfect.

Today today game is uh is Andorra Andora Netherlands.

So you know.

That's perfect.

man And actually let's uh let's dig into into your background um because you do you do interesting stuff.

So maybe tell us what you're doing nowadays and how you ended up working on wildlife data analysis.

Right.

So uh I'm currently doing a postdoc at the University of Newcastle in Australia where I'm looking at uh analyzing basically camera trap data sets.

before that I was working on songmeter data sets, which uh it's microphones hanging up in the bush that record information continuously, basically.

And there's a lot of overlap between those two data types.

So that's what I've been focusing on the last couple of years, I'd say.

Um before that I was doing my PhD and um I was doing mark recapture analysis, which is basically just another type of hidden Markov model that I'm doing right now.

So I think a lot of a lot of the modeling that I've done has kind of been uh very similar but answering different questions.

Okay, yes, so I already see uh some some similarities with other episodes we've recorded.

The one with uh Aaron McNeil obviously comes into mind.

Uh where um actually also worked on on that model with him.

He's leading a fantastic team and they are they are doing modeling to infer the trade of shark meat around the world.

And and so they are using a bunch of very, very cool Bayesian uh methods.

to to do that.

It's an extremely complex model, one of one of the most complex I've I've ever worked on.

But that that has been super fun to work with them on that.

And that was episode 127.

I'll put that in the in the related episodes for this one.

Uh and also the one with uh Vianelios Paraja.

She is doing a lot of hidden Markov models.

So this one will we'll hit close to home

For you, Matt, uh and and so people who wanna wanna hear more about these kind of models, I will put that in the in the show notes.

and that was episode fourteen with Yanis.

So this is uh this is one of the one of the oldest ones, DM.

Uh so lots of um hidden Markov models, things like that.

So fun fact for listeners, I actually so be s

I I saw in your email signature that you were at the University of Newcastle.

and as an ignorant person who just relies on their priors, I thought it was Newcastle, England.

And so and so that that was um that was interesting when you told me you were in Australia.

I was like, wait, there is a new castle in Australia.

I have to beg you to change the time, otherwise I would have been here uh in the middle of the night.

Yeah, yeah, the original time was one AM for you.

So, you know.

Um but actually it's better.

Now that I'm in California, it's actually much easier to record with uh Australians than it is to record with uh with people in Europe.

So I I thank you for that, because otherwise now I have to wake up very early to record the episodes.

Uh so it's good so that's good.

And actually how did you end up in in Australia?

is that related to um to the the kind of models and data and and animals you have access to or that was just kind of random?

It was definitely related to the animals mostly.

Uh I've been interested in wildlife my entire life and I'm actually from the Netherlands, but I spent some time in the US which really opened my eyes to uh what it's like to live

in a place with lots of wildlife.

And I always knew I didn't want to stay in the Netherlands and I'd shortlisted Australia as a place I'd like to live.

And during my masters, I just looked for PhD opportunities uh in Australia just to get me over here because I just wanted to live here to to live here.

The the specific choice of PhD was a bit of an afterthought.

Um but yeah, that's that's how I ended up here and um yeah, so

My work didn't really bring me here, but during my PhD I became very interested in the work I'm doing now.

Yeah, yeah, yeah.

Okay.

So it's like really you um you really fell in love with Australia in a way.

That's right, yeah.

And it was I think to be quite honest, it wasn't the best to do my PhD for those reasons.

I don't think I did a PhD for the right reasons and I found it quite stressful at the time.

Um but

Here we are, I landed on my feet, so uh can't complain.

Yeah, yeah, yeah.

It sounds like it.

and yeah, I still have to to visit Australia.

I visited um New Zealand a few a couple years ago, actually with Aaron, because uh we were working on the model over there, and that was absolutely incredible.

And I've been meaning to come back since then and also come through uh through Austral Australia at that time.

Time, so yeah, definitely want to do that at some point.

Well, maybe before you come to Australia, I can convince you to come to the Yucatan Peninsula in Mexico for the International Statistical Ecology Conference in January 2027.

Oh.

yeah, you definitely can.

That sounds very fun.

and actually much closer to home.

So you know, for sure.

Let me uh like um yeah, send that my way and and I'll see

How I can make that happen.

That'd be fun to to record some uh some episodes over there actually.

It would be It's a fantastic conference.

It's every two years.

Uh four years ago I think it was in South Africa, so I could really justify going.

Uh the time after that it wasn't Wales, so I couldn't really.

But now that it's in Mexico again, uh where we have neotropical rattlesnakes and lots of margaritas, I think it'll be it'll be easy to go.

Yeah.

Yeah, yeah.

That sounds that sounds perfect.

Yeah.

Yes, and

Send the links my way for sure, please.

I will.

And and what's about Bayes, actually, because you do a lot of Bayesian models, but what drew you to Bayesian stats specifically?

Well, that's that's also easy to answer.

Uh it kind of starts with being at uni.

Um during my master's and bachelors, I think I took three or four uh classical statistics classes and I I never I never understood what was going on.

I always found it incredibly

not intuitive.

And therefore when I started doing my PhD and I knew I was going to have to analyze these uh kind of complex mark recapture data sets, I needed to have a proper analysis.

So I was like, okay, I need to actually sit down and learn this stuff.

And it was the introduction to Bayesian statistics in Win Bugs uh by Mark Carey, one of the like introductory Jags, bugs, books for doing analysis.

And

It's when I feel like I need to do something, I can either rush myself and get frustrated or I sit down and just go as slow as I need to go until I get it.

And that's what I did with this book.

I went chapter by chapter, like hand copied all the code.

It never ran, always took it twenty times to rewrite it, even though it was only like ten lines of code, uh, because I literally had no idea what I was doing.

But as I went, I became very interested in it and um

Yeah, that that was definitely the the beginning where it all started probably six years ago now.

And since then it's completely changed the way I think about everything because because I've always written stuff in jags or bugs, like you have to write the generative model

out, which I'm sure I don't need to tell you how uh how good that is for forming an intuition around what you're doing or a getting an actual mechanistic understanding of

what a model is doing.

Um so that's definitely where it started.

I have I have to credit Mark Carey really for um all of those books that that they wrote for ecological modeling.

Wow.

Yeah.

Yeah that uh that makes sense and and yeah, I these models are very complicated.

So starting with those, this is uh yeah, that's hardcore.

Well done because uh this is clearly not the easiest.

It's

They are both complicated and not complicated in the sense that uh they're they're compli they're only complicated or not only, but one of the main reasons they're complicated is

because the data is messy, because we're not perfectly observing animals, because we're not surveying at set times all the time.

So we have all this like almost like infrastructure around the model that is complicated that you need to account for.

But

It's also not complicated in the way that a lot of these complicated models can be thought of as just some flavor of GLM still.

Um, which I'm sure we'll get into it in a bit.

Um but yeah, very I I'd say most of the complexity comes just from how messy the data is that you need to accommodate.

But under the hood, you're kind of still doing

classical things like GLMs and linear models on different model components.

Yeah.

Um and they are they're super fun also to to work on, to be honest.

and and that's also because they are not that easy.

Um that you always have something to learn.

There is always a complexity or nuance you didn't know about.

And so there is always something you can you can add on top of it and that that really makes a difference.

And actually if we so let's set the the stage for the for the listeners who don't work in ecology.

Um again I recommend listening to the two Irons and VNA's episodes because we go into the into the the details of capture recapture models, especially with VNA.

Um but you focus a lot and at at at least today we'll focus a lot on occupancy models.

Yeah, set the stage for the listeners about what that is and what problem that's actually solving.

So occupancy models are are used to accommodate for the fact that we don't always detect a species when we go and survey for it.

So a lot of animals are rare, so you might only see it once every few times that you go and visit a site.

Instead of just modeling the naive detections, like let's say we have a hundred sites and we go and visit them once and we find the species at fifty percent of sites, we would

never think that the species only occurs at fifty percent of sites.

We only detected them at fifty percent of sites.

So what occupancy models do is they add this repeated measures component where you survey every site multiple times, and that lets you estimate a detection probability.

Where, for instance, a common, highly detectable species, you see it almost every time you go and survey at a given site.

So if you're never getting it at a site, it's probably not there.

Um whereas if there's a rare species where the sites where you did get it, you only get it one out of every three visits, every five visits, every ten visits.

So that's a rare or hard to detect species.

And therefore some of the sites where you um

never detected the species, they might be there, you just didn't pick them up because basically what it's telling you is that if the species is there, you're still unlikely to

see it.

Hmm.

Yeah.

Yeah, that's a very good uh I think that's a very good way of explaining it.

And that makes it very intuitive because, yeah, like everybody can understand that some species, some animals are harder than than others to spot.

And so basically uh

A different way that I like to think about it now is that an occupancy model has a zero inflation component which happens at the level of the site.

So the feature of occupancy models is that every site is multi is surveyed multiple times.

And what we're estimating, the first part of this model is is this site occupied yes or no?

Which is a latent variable because we we know it's occupied when we found it, but we don't know um it's not occupied.

when we didn't find the species.

So we're estimating this zero inflation parameter at the level of the sites, but conditioned on that uh occupancy state, we have this detection history, which lets us

estimate the detection probabilities or detection rates at those sites.

And that's kind of what I meant before about ecological models are complicated, but they're also not, if you can sort of wrap your head around that,

the zero inflation is not happening at the level of the observation as it does in a typical uh zero inflated model, but it happens at the level of the site, then you're kinda

back at just this, you know, model that's in our our toolkit.

Yeah.

Yeah, yeah, yeah.

and added to that, you also work out an automated recording units.

So what which is ARU, um

And so what's what's that?

I surely don't know anything about that.

What what is special about that kind of data?

And why do traditional occupancy models struggle with it if they do?

So automated recording units are things like camera traps and uh microphones that are hanging up in the forest basically to detect birds or vocalizing mammals, and it's in the

name.

They're just these

Um, devices that are essentially monitoring continuously over a deployment period.

So a camera battery or an SD card might last like two weeks, but d during those two weeks it's running the whole time.

But there's ways to extend it as well.

Like you could have battery packs, or you could only have the camera be on for one hour every day, could be on every other day.

Uh so the key feature of it is that it's this thing that's actually monitoring.

continuously in time versus the traditional types of surveys that we would do where a human would go and visit a site for half an hour, six hours, two days, and look for the

species and then record in our notebook if we detected the species or how many we detected.

The the key feature is that the surveys are not discrete really, they're occurring in continuous time.

which is different from how the traditional um survey methods work.

And I wouldn't say that occupancy models struggle with the kind of data.

It's just that the models were kind of designed for discrete surveys as opposed to stuff that's going on in continuous time like these kind of d devices.

Mm-hmm.

Okay.

Yeah.

Thanks.

That's very clear.

And that actually

illuminate some of the modeling choices we're gonna we're gonna talk about next because um you created an R package to just that kind of models that you called Ocaru.

Um so occupancy and ARU and actually isn't Okaru something special in your field?

Is is that even an animal?

I don't think so.

The way the way I thought of pronouncing it was Ok ARU, but maybe this is a good a good signal that the name isn't chosen that well.

I mean

I like okaroo.

It it it sounds uh it sounds like an animal, you know.

Look at that okaroo.

Yeah.

Yeah, you're you're welcome.

You're welcome to call it you're welcome to call it that.

So actually the most important part is what is it about?

You know, um why did you create that package?

Um and what's the elevator pitch for it?

Well elevator pitch might be hard because I feel like I always end up talking a long time when someone asks me this question, but I'll I'll try and keep it brief.

There's kind of two motivations here.

The first one is that in a traditional occupancy model, we normally record whether or not we detected the species, yes or no, during the survey.

So we collapse our data to a detection or non-detection event, aka 01 or 0.

Um and a lot of analyses that are done even with ARUs, so camera traps, people will end up, for instance, binning

their survey periods to one day or one week of camera trapping and then collapse any detections of the target species to a one or a zero, as opposed to modeling the counts

directly.

And my view is that this kind of happened because, well, when occupancy models were proposed, the what they called an observation model, so the detection side of things, was

a Bernoulli or logistic regression.

Where you model the detection or non-detection during a given survey.

In my view, this collapses the actual raw data, which is a count, to a one and a zero, or a one or a zero, before doing any modeling at all.

And you are throwing away information by definition when you collapse a count to a one or a zero.

And the my main motivation here was to be like,

We shouldn't do that.

Or if you're gonna do it, you need a justification as opposed to doing it by default.

So there are issues, of course, like autocorrelation between detections.

Like you might have a group of 20 kangaroos that's standing in front of a camera for two hours, which will really ramp up your detections.

So I put some features into the package which lets you thin detections.

So for instance, you could have a 30 minute thinning window where it

retains the image with the highest count during that window.

So going back to the kangaroos, let's say they are there for two hours, that would be reduced to four detections with the maximum number of kangaroos in an image during each 30

minute window.

So it might be like the group in the camera might range from like seven to seventeen individuals.

So it would retain the highest count because that's the highest that we know was there.

at any given time during that window.

But regardless of the thinning, the idea is to just not just go to ones and zeros, but model how many did we see because all else being equal, um, if you're seeing more

kangaroos in one month than the next, there were probably more kangaroos in the system in that time.

And I'm treading carefully here because this is a highly contested issue.

in ecological modeling where the holy grail of ecological modeling is estimating abundance, for which the gold standard is mark recapture analysis, where you have actually

tagged individuals, recapture them over time, and you're essentially modeling these time series of individuals surviving in the wild.

But the key, the key part of those models is that we have individual identification.

Which means we can track them over time.

There are methods that exist where you use the counts directly to estimate an abundance, but those models are quite sensitive to model assumptions that are very hard to meet in

practice.

And one of them is, for instance, that you need to have independent count or counts of independent individuals.

If you're double counting individuals, that throws out the abundance estimates.

And that gets very hard to do with

um animals that are hard to individually identify, like kangaroos or, you know, plug in any any mmal that doesn't have clear patterns like stripes or spots or whatever.

And it gets very hard to do.

But nevertheless, I still don't think that that justifies collapsing counts to just ones and zeros, because how you choose to interpret the information that's in those counts is

up to you.

But you might as well model them before just collapsing it to a one and a zero before you even start your analysis.

Yeah.

And so actually that was one of my following up questions was about that choice that you made in in Okaru, which is to model the counts directly, uh, rather than treating the

detection as a nuisance parameter, which is exactly what you uh which you just talked about.

Yeah, I can see why why that's definitely a meaningful shift conceptually in how that'd be very useful.

And yeah, basically instead of throwing away that information, you you do want to have it as much as possible in the model.

And then if you actually want to throw it away afterwards, well maybe, but but you don't have to and Exactly.

You you can just you can just fit the count model and then be like, I'm not interested in in in um

interpreting this at all.

And that's another thing actually, where in occupancy models, usually the focus is on the occupancy of any given site, which is just to say, does this species occur here at this

site?

Yes or no.

That's all occupancy is.

It's just a one or a zero for if the model thinks that the species was there.

My another argument that I'm putting forth is that when you have these continuous deployments of camera traps,

simply focusing on this binary presence absence of species is almost kind of boring.

Like we have much richer data, right?

Like we potentially have these cameras running for months at a time, uh potentially repeated over many seasons.

And to just basically have your target of inference be if the speci or at which site the species occurs, um, that still comes out of the package.

Because that's just what occupancy is.

But I think it's much more interesting to focus on variation in the detection rates um over time, especially when you get longer deployments.

Like I've had some data sets that are many years, and you get to see these crazy fluctuations in underlying detection rates of all your species.

And it just gives you a lot of insight about what's in your data that's beyond just occupancy.

What I'm saying is this package still gives you occupancy like any traditional package might do, but it just also lets you focus on those detection rates.

So you don't lose anything.

You only gain stuff.

And it all comes from the fact of just basically not throwing away information in your data before you even begin modeling.

Yes.

Completely agree.

Uh and you have also some very cool stuff in there where you have the options for

Hierarchical special spatial and temporal Gaussian processes to help for multi species modeling.

Obviously I I was very um very charmed by that option because I'm a I'm a fan of GPs, so can you can you unpack that for us?

Can you tell us what that's doing and what scientific questions it lets ecologists ask that they could not before?

Yeah, sure.

I'll I'll start with just the justification for having these spatial and survey effects in there to begin with.

And just from first principles, when we essentially have repeated measures at two levels, at the level of the site, because every site is surveyed over potentially many days, many

weeks, many months.

And also at the survey, because usually we have a grid of cameras, so every survey period, let's say every day or every week that these camera traps are active, we're actually

getting measurements from multiple camera traps.

So we have repeated measures at the level of the site and survey.

And there's plenty of literature that talks about what or how you should handle that.

But the the key thing is that you uh essentially have pseudo replication in your data if you don't account for that somehow.

So the way I account for that is a random site effect and a random survey effect.

And then it's just a question of how do we actually model these random effects.

So typically you might have a normal distribution ah where all of these random effects are coming from, where you're estimating some standard deviation.

That's just your run-of-the-mill random effect as implemented in any R package.

With our site and survey effects.

Those sites and surveys aren't necessarily independent because you would think that sites that are closer together are going to be more similar, and you'd also expect surveys that

are close together in time would also be more similar to each other.

And once you get into this spatial and temporal sort of correlation structure, you're basically knocking on the door of a Gaussian process where you're constructing a

covariance matrix over all the sites and over all the surveys.

The way I try and explain this as an elevator pitch is it's just a random effect that looks at the spatial and temporal locations of where we're collecting our data.

Where the multi-species comes in is the package is designed to not just look at w data from one species, but to look at data from all the species that are recorded.

And the motivation here is when you look at your raw data.

Of a camera trap or a song meter, essentially what you have is a timestamp and a species label of what we found.

And that means if we have multiple species, we probably should have a random effect for every species and every site because not every site's going to be good for each species.

um And then you get into the multi-species Gaussian process thing where you still have the same Gaussian process kernel.

But now you have a realization from this for each species.

And there's actually a way to do this to make it quite computationally efficient that you only have to do one Kaleski decomposition, even though you're getting these draws for all

of these species.

And it ends up being matrix normal, which I hadn't come across matrix normal until about two years ago.

But it's basically a

matrix of values where you have a row and a column covariance.

So if you have a row covariance that governs the correlation between your sites, you have a column covariance that governs the correlation between your species.

So what you get is not just these spatially structured random site effects and random survey effects, but you can also directly estimate how species are correlated in these

effects.

So for instance you can

estimate whether two species are likely to co occur in space and time based on this multi species process.

Um now I imagine that's quite a lot, but that's that that's what it gets you that I don't think I've seen implemented in other packages.

And if I had to if I had to say how the package like clearly differed from other things that are available, it would probably be

how the multi species hierarchical nature of the model is handled.

Mm-hmm.

Yeah.

And that definitely makes sense.

I think it's a it's a great setup for a g for a a GP.

So that that makes sense that you have that in there.

And and also then the hierarchical nature of your data also marries really well.

So um that

that's super fun to see that in there.

How how did you end up implementing that?

Like are you using vanilla GPs or using approximations?

How is the how are you handling the um the speed and efficiency side of the GP sampling?

Yeah, there there's a couple of ways.

At the moment the GPs are still exact, and that's simply because

So far, most of the data sets that I've come across haven't been crazy in the number of sites that they have.

So often you'll have like hundred or a few hundred, which I feel still works well for an exact GP.

Um also I have built in a feature where often we'll have different grids of cameras in different locations.

So essentially independent grids.

And

When I have independent grids, it does a separate smaller GP for each region, which just assumes that the regions are independent.

They still share the same hyperparameters, but you don't need to construct a full covariance matrix.

You do it by region.

So if you have a thousand cameras in your data set, that would be an expensive GP to compute.

But if those thousand cameras are split between ten different grids,

then you've got 10 smaller ones, which is actually way easier to manage.

And it is definitely something I want to add is some form of uh approximation.

And someone recently told me about GP tools, which is a a I think it's some sort of package that's implemented in Stan that lets you um base I think I want to say it's four

years transform GPs.

but some sort of approximation for the inevitable case that you do have a huge data set where you actually need a massive covariance matrix.

Mm-hmm.

Yeah.

Yeah, I don't know GP tools per se, but maybe they are using HSGP and Ratherhood most of the time.

Um if you're in a maximum 3D case for a GP, that's gonna be that's gonna be extremely helpful.

and so for

spatial temporal data should work because that's 3D.

Um and that's yeah, same flavor as if we transfer.

And that really is a really a game changer for big data set because it just becomes almost linear in in compute time with with your data data size.

So that's Yeah.

I I do have code laying around because I've implemented them in the past for the HSGP stuff.

So it's it's

Honestly it's probably something that will be in the package in the next month or so.

Mm-hmm.

Yeah.

I mean, that's really uh that's really helpful.

I mean if most of your data set are not too big, that's awesome.

You know, it means you can't use exact GPs.

Uh this is rare, but this is this is great to hear that is that it's possible.

It's not that rare in camera trapping because just most most of the time people are bottlenecked by by resources and uh I think it would be not uncommon to have a data set of

thousands of cameras.

I do think it would be unusual for all of them to be part of the same grid as opposed to uh split into multiple smaller grids where you can actually just do them separately within

each region.

But even there the approximations would uh end up being faster, I'm sure.

So uh worth implementing regardless.

I mean Yeah, I mean it depends on the approximation because you have some approximations that actually work less well than a vanilla GP if the dataset's not big enough.

So, you know, kind of like uh central limit theorem kind of stuff where it's if you're not in the regime where the approximation uh accuracy kicks in, then you actually want the

vanilla GP.

Actually, if you can use the vanilla GP, just do that.

It's like you you have the mathematical guarantee and and that's great.

In my case, that's been super rare.

Like most of the time I cannot use vanilla GPs and I've had to use approximations and thankfully there are some now.

And also the compute we can use is just so much better now.

Um but yeah, like it I mean it's good to have the different options, I guess.

I have wondered about that with a spatial GP where you have two inputs, a well basically a latitude and a longitude.

And yeah, the way I understood the HS GPs, you have a basis function for each dimension, or a number of basis functions for each dimension.

So if you set it to the default of like twenty, it ends up being twenty times twenty, I think, because you need the combinatorics.

So you end up having to estimate a lot of coefficients, which

When you have a a smaller field, you could I could see how that would be uh not ideal.

Yes.

Yes, exactly.

Uh actually the the choice of hyperparameters for HSGPs is extremely important.

And and it's gonna impact a lot the accuracy of the model and of the predictions.

So this is

It's always trade-offs, right?

But basically the approximation is valid for a given universe, if you will, and the hyperparameters partially partially determine that universe.

And so you definitely wanna pay a lot of attention into how these hyperparameters are chosen.

Uh in PyMC we have that function that

B Langles and I developed a few, I think, last year or a year and a half ago, something like that, where you pass a bunch of constraints that you want on your covariance.

And and then PyMC will do some optimization under the hood to find the optimized hyperparameters for you for the model.

It's basically the same thing.

as what you would do by hand, but it's just automated and it's a bit more user friendly.

A bit like the defined constraint prior function that I worked on that I merged like a now a few years ago in PyMC where you could just tell Pyme C uh hey I'm looking for a gamma

function with ninety-five percent of the mass between this and that.

What are the good parameters to do that?

Now it's in prelise and it's so much better and Oswald Martin

And and colleagues made that a awesome package that's so much better than what I could eat at the time.

Um but that's kind of the same idea.

Yep.

That's good.

I should probably look at that because I wanted to add a some functionality to automatically determine the length scales based on the input uh from uh of the GPs because

you can set the priors well that you can kind of set the priors.

I in the model I've

basically decided on the prior distribution for each parameter, but you can set the hyperparameters yourself.

and so it has some sensible defaults, but it would be nice to input, for instance, the distance matrix for your sites and then determine a suitable length scale based on the

input.

Yeah.

Especially since

I mean in your case it's a specialized package, so that's that's actually very good because most of the people are gonna use the package for the same kind of models.

The difficulty we have in Prime C is that it's a generic package, so tons of people use it for very different use cases, and so we cannot make a lot of very informed default like

that.

So the the function we have to find the hyperparameters of the the HSGP, you need to provide a length scale, you need to provide a covariance function.

The length scale and uh like a couple other parameters that I'm forgetting.

but you could go one level of abstraction because most of your users are gonna use it for occupancy ARU models.

And so you can have very very good informed defaults on that already on the length scales.

Uh so yeah, exactly.

You can just go on the Pime C source code.

Find that function, I think it's called find HSZP hyperparameters or something like that.

and yeah, like start from here and and add a level of n of abstraction on that.

That'd be that'd be very good from a user friendly perspective.

Yeah.

Um yeah, but that that's super exciting.

Uh I I think it's awesome that you're making access to GP like that much easier.

Because as I always say on the show, GPs are amazing.

Um they are in itself, you know, even though they are a beautiful mathematical object, uh they are in itself an approximation.

Like no no real process in nature is really a GP.

Um you're using a GP because you don't know the actual mathematical process underlying the the phenomenon, otherwise you would just use the mathematical formula mathematical formula

for it.

but they are super powerful because they are so flexible.

And and I mean there are there are neural networks, right?

And we all know how good neural networks are.

And so GPs are just uh infinitely uh G um neural networks with an infinite number of hidden layers and and that's what neural networks converge to and and they are just

amazing because of that.

But they are a bit complicated to understand from a theoretical standpoint.

And there is a high barrier to entry.

And and then also the barrier to entry tends to be high for practitioners.

Because you're like it's you don't have a lot of packages that make access to it very easy and also make them available in packages that people are already using for other kind of

models.

So often they are specialized packages just for GPs.

Maybe for one kind of GP, you know, the which is like if you don't know the research, you're like, whatever needs that GP?

You know.

Um so yeah, I think making GPs more accessible, especially when it makes a ton of sense here, that's that that can really be a game changer for practitioners.

Yeah, and uh initially I didn't even have the option that you could turn them off.

Because the package started off a bit more opinionated, basically, about the fact that we need to accommodate for these repeated measures through random effects.

And as complicated as GPs are, the way I explain them to people just getting used to them is it's essentially a a random effect where the realizations aren't independent

necessarily, but they're dependent on their neighbors.

So you're just adding that proximity component of

surveys that are closer together in time and sites that are closer together in space, uh maybe more similar to each other than sites that are farther away, which I think is quite

easy to wrap your head around that that is useful and relevant.

And yeah, I think at at this point you can turn them off, but by default they're turned on.

just just out of principle basically that to a to account for these repeated measures that we w we should have them in.

Yeah.

Bold choice to have that by default.

I love it.

and so something also another thing you have that's that's also very very useful are global local shrinkage priors for variance decomposition of the occupancy and detection uh

linear predictors.

So can you tell us what this is about?

What is variance decomposition in this context?

And why is shrinkage an interesting tool here?

So the the intuition around it is that we have a linear predictor with a certain amount of variance, and different model components are in a way competing for that variance or at

least contribute to that variance.

And you might have a bunch of predictors.

You could have two predictors, you could have 200 predictors.

and then you have random effects and then you have species level deviations because every species has its own response to the predictor and species level random effects.

So we potentially have a large number of variance components.

And it's actually traditionally you would just put independent priors on all of the coefficients on

the species well the species coefficients might be modeled hierarchically with a shared uh scale.

Um, but the coefficients would probably be independent.

the variance the variance components of the random effects would be independent.

And they've pretty well shown that the prior predictive distribution of these models, as you add more and more predictors and more and more random effects, the prior predictive

distribution gets a bit silly.

Where you have a huge amount of variance that you are expecting a priori to be captured by your model components.

And what the global locate shrinkage priors does is it kind of turns it on its head by saying, I'm going to put one prior on the overall variance of my linear predictor.

And the prior distribution for that variance, I've chosen a student T, just because it has a nice thick tail that if you have.

A lot of variance in your linear predictor that can easily be accommodated.

The key intuition is that you now partition this global variance to your different model components via a simplex.

So at this point, you have a global um you have a global variance and a simplex that you're estimating from your data.

And it basically neatly partitions your available, let's call it, variance budget.

Across an arbitrary number of model components.

So, like I said, you could have literally hundreds of predictors in there, and it should be fine for the model because your total variance doesn't really change.

It just gets partitioned between these different model components.

And the simplex has an underlying parameter, or it could be multiple parameters, but in this package it just has one.

And that parameter is determines the sparsity of the simplex.

And the sparsity of the simplex basically determines how evenly our simplex is distributed.

So when you have a sparse simplex, there's a few components, a few model components that are explaining most of the variance.

So let's say you have 20 predictors in there, none of them are important except for one.

You expect the simplex to be very concentrated around that single model component.

uh which means the simplex is sparse because most of the simplex values are zero except that one which is very high.

Uh and if you have an even simplex, you expect a more even distribution.

So this predictor has like 10% and that predictor has 7% and this one maybe 15% of the simplex.

That's actually a hyperparameter that's included in the model as well, uh which basically means that the there's a prior over the sparsity of the simplex.

Which means that a huge amount of model configurations essentially are possible using the exact same hyper priors.

So theoretically, uh this model should basically run whether you have one species or a hundred, uh zero predictors or fifty, uh random effects everywhere or none at all.

That should all just be handled by having these two priors on

the overall variance and the simplex decomposition.

Okay.

Yeah, I love it.

This is this is super interesting.

Uh in so to do that, what are you what are you using under the hood from the for I'm I'm guessing it's from Stan.

Yeah, so actually I'll just backtrack and give credit where it's due.

Um the this basically comes from the R2 D two literature where they

decompose R squared because you can the you can get the explained variance as a deterministic transformation of your R squared.

And so what they do is they set a prior on R squared, deterministically transform that to your explained variance, and then simplex decompose that.

I skip the R squared because the R squared gets a bit hard to interpret when you have count variables and also logistic

So the occupancy side remains a Bernoulli, so a one or a zero, and the detection rate is a count, so it's Poisson or negative binomial.

R squared doesn't work as nicely intuitively as it does for nice Gaussian models.

so I skip the R2 stuff, I go straight to the variance prior, and under the hood, I just produce a an R a simplex of arbitrary length.

And I have to give thanks to the stand developers because they have recently put in functions that lets you simplex transform just a free vector with one less element than

your desired simplex, which means that under the hood in this model, it looks at all the variance components.

So how many species, um, how many predictors, how many random effects.

Based on that, you know how long your simplex has to be.

the simplex is then produced under the hood.

And yeah, what I use in Stan is a a variance parameter with a student T prior and a simplex with either a Dirichlet or a logistic normal prior for the simplex.

Mm-hmm.

Yeah.

So you're removing one degree of freedom because well, you have to.

Yeah.

Sounds exactly sounds a lot like the the zero sumnormal uh distribution that that we worked on on.

on the Pinty sample with Adrian C Bolt.

It is, and there's a lot of overlap between zero sum normal and the simplex.

And um both of those constraints uh are getting a lot of use in the package um because the package accommodates for instance categorical predictors and instead of setting a

reference category I just model those as zero sum normal based on how many categories there are.

But every categorical predictor can have an arbitrary number of categories.

So you might have habitat type like forest or swamp or desert as a categorical predictors with three levels, but you could also have one with six or seven levels.

Um so based on that, it just produces um a zero sum vector of the appropriate length for every categorical predictor.

Um so actually a lot of the stuff that's in this package now has been

has implemented really recent additions to the stand language.

So the timing's been really good.

Yeah, I'm guessing that that made your life much easier.

Because before that was that was much harder to do.

Yeah, and it and it was just really messy.

There's the there's still there's still that you have to have a bunch of unconstrained parameters in your parameters block and then use the transform parameters to actually

constrain them.

one downside is still that your final fitted object is gonna be bigger because it has a bunch of these unconstrained parameters that you need to produce the transform parameters.

But um I mean that's a small price to pay because it just makes it just makes modeling these types of things so flexible.

Yeah.

Yeah, yeah.

No, exactly.

That that that's really that's really being a game changer to have the zero summon constraint in Pine C and in stance or

Definitely encourage people to to use that more.

It's great that you're using that under the hood because again, these are complex mathematical objects, in the sense that they're not very intuitive at the beginning.

And so being able to just have that under the hood for the users, they don't even have to care that it's actually using that.

It's just like just use that because it's it's better.

Like it's gonna make your model better.

Yeah, that's the only thing you need to know.

And actually you don't even need to know about it.

It's just like

That's that's that's what you need to use.

Just just do what the doctor said.

And and that's gonna make their model better.

Uh so we're going fast here, and for people who want to have uh background on that, there are at least three episodes they can listen to.

Uh seventy four with Adrian Zabold, who is the the father of the Zero Subnomal.

Episode 133 with Sean Pinkney, who is one of the persons who developed one of the people who developed Zero Subnormal Constraint on the stand side.

So if you guys want the the background on that and all the cool thing you can do with that, um definitely listen to that one.

Adrian also make a a guest star appearance in that one.

And finally, episode 134 with David Cohns, because in this one

David goes through and actually demos live the ARR squared prior, which is the R two D two prior, but for time series, for especially auto regressive time series, and it goes into a

lot of the concepts that you just uh you just uh mentioned, Matt.

So so if people are interested in that, definitely check that one out too.

Just um a another nice thing with the zero sum stuff.

Is that in this model, for instance, I have species level random intercepts that sort of capture the overall detection rate of a given species in your data set.

But then if you have, let's say, a categorical predictor and you model that hierarchically, you have an overall habitat effect to say this habitat has this influence

on just your species and a species level offset realization of that.

So there you get into actually the zero sum matrix where it has to sum to zero over the categories, but also across your species, so that your overall effect keeps that nice,

well, that zero sum interpretation.

And it means that the intercepts in these models are actually the the averages over the species over all of your categorical predictors.

Uh whereas if you don't zero sum anything.

you know, you get those wider uncertainty you actually have to go and derive those means afterwards in generated quantities.

And gets really hard to keep track of.

So yeah, the zero sums have just been amazing because of well, identifiability of your parameters in the strict sense.

but also sorry, I just actually blanked on what I was saying.

It's that's normal.

These are complicated objects, as I was saying.

Um but

Yeah, basically you are you're talking about the fact that i it can get intricate because these are like Russian dolls where you have basically you might need zero subnomals all

the way down because if you're dealing with categories and on top of that simplexes, then you have a a bunch of sum to zero constraints that you need to avoid over parameterization

in your model.

And so basically what you're doing is getting the headache out for the users.

Which I'm sure everybody will appreciate.

Um and actually another another concept I wanted to touch on is within chain parallelisation, because that's something um I think a lot of listeners have heard of on

the show, but never necessarily actually used because it's quite new.

So in your experience, when does it actually help and when is it not worth the complexity?

Yeah, that's a that's a a funny one.

It's it's not really intuitive, uh, I find.

So just as a background, with within chain parallelization, uh, let's say you still want to run four chains.

Normally you just run one chain per core.

But if you have more than four cores available on your computer, you can actually split the work of the individual chains across different cores.

So for instance, if you had forty cores and wanted to run

four chains.

You could run the work of each of the chains across ten different cores.

Um there's a lot of overhead associated with within chain parallelization, as I understand it.

A lot of copies to go across the different cores.

And I've actually found that it often does not speed up the model.

I think normally you cut the likelihood

computations over your sampling units.

And ultimately in these types of models or in occupancy models, the sampling unit is the site.

So you compute all these likelihoods for all of the species at each different site.

And what I have found is the more uh expensive the per site likelihood computation gets, the more benefit you get from it.

I have not really figured it out.

I end up testing it out on the different uh models that I have.

Within chain parallelization is optimal.

You can just turn it you can just turn it off because a lot of the time it won't speed up.

Um Yeah, I'm also not sure how m what to exactly parallelize because you could just parallelize the likelihood function and do everything else inside, for instance, the model

block or transform parameters.

But I've actually put

all of this stuff associated with the sites inside of the parallelization function, which I think means that there's ends up being a lot of copying.

So a lot of data structures have to get copied.

There's like 30 arguments to the uh parallelization function.

So yeah, maybe at some point I will try and figure out if there's a a better way to do it.

But ultimately it does come down to trial and error really and how many cores you have.

I'm not quite sure.

I think it is a from from what I've read on the stand page, it's actually it's there is no um cut and dry sort of outcome with parallelization.

Mm-hmm.

Yeah, it's also my experience, but this is not really my area of expertise.

I'm sure Adrian, for instance, if he were on the show, uh and he will be soon again on the show by the way.

Um maybe he'll have he'll have uh things to add.

But another thing, another concept you're using also in your package is Monte Carlo integration to deal with high Pertok values when you have observation level random

effects.

So listeners may have seen that in their models when they are using RVs, they may have a warning that there is a high Pertok value in there that makes the

The comparison with Livono cross validation not trustworthy.

Uh maybe they don't even know what that means.

And so this is related to that.

So can you can you walk us through what that problem looks like in practice?

And how the Monte Carlo integration fix works?

Yeah.

I'm I'm actually with your hypothetical users because I do not have a good understanding of what Leave One Out Cross Validation is actually doing and don't have an intuition

around the hyper eat okay's.

Uh I do understand when they come up.

And in an occupancy model, like I said before, the lowest level where the likelihood factorizes, which is the level where you want to compute the log likelihood for leave one

out cross validation, is the site.

And as I mentioned before, uh, we have these site level random effects because well, we have the repeated measures at this unit as well.

So I feel like they should be there, but it means that when you have a uh random effect or a parameter associated with the level of the observation, you end up getting these high

Parito K values.

And it's something that I've been uh encountering or struggling with.

For a lot of ecological models, because well, we tend to have random effects at the level of the site when you want to capture uh variation between the sites.

And I got the tip from Aki Uh Vetari or Vetari, I don't know how to pronounce his last name, uh, on the Stan discourse, and he said to use Monte Carlo integration, which is

essentially in-generated quantities.

For each post or for each MCMC iteration, within each draw, you would do a bunch of draws, or you would compute the likelihood many times for what is essentially a posterior

predictive random effect.

So let's say you have 50 Monte Carlo draws.

Within each MCMC draw, you do you compute the likelihood 50 times with a different value for that random effect.

And then you average it out within each draw.

Um, and this is still fast because it's happening in generated quantities.

So uh it's well, stuff that happens in that block is fast.

So it doesn't add that much to the runtime.

Uh, but I found it very much reduces the paridocays.

So uh I've gone usually I'll get somewhere of I don't know, sixty or seventy percent of my observations have high Parido K values without the Monte Carlo integration.

And it might go down to ten percent.

And those probably genuinely reflect uh poorly fitting observations or poorly fitting sites.

so it's just one I don't know if it's the most principled way.

I think it's kind of a brute force way to go about it, but I'll take anything to yeah, reduce those high paridocays uh and make my model comparison a bit more trustworthy.

Because yeah, the the key thing is when you have observation level parameters and you get those high peridocays, um it's not necessarily that your model is bad.

It's just that uh leave without cross validation isn't working well.

Yeah, yeah, exactly.

Uh and in in and in I mean a lot of time, especially with count models, you will see these warnings and that's because you have some highly influential observations.

And so that means that if you re if you remove that observation, then the landscape completely changes because that observation is is very highly influential and it

determines a lot of what the model is saying about the phenomenon.

But if you remove it and if you have a very low sample size, which can happen with count data because it's it's discrete, if you just have, you know, like one count of that data,

i as I understood, then that's a highly

influential observation and if you remove it, that means leave one out cross validation is gonna change a lot depending on which observation you remove and that can change that can

change the results a lot and that's not what you want because that means the the number you see is not necessarily very reliable and reproducible.

That's also yeah.

Have you found have you found this in your experience to be uh common with count models more so than c continuous data?

Yes, but that's ac anecdotal evidence.

Like I'm sure Aki would be would be a much better interlocutor than myself to tell you the the actual scientific truths about that.

But my experience since I work also a lot with count models is that often part okay is is problematic for for count models.

I think I remember also for a negative binomial?

Because I feel like the Poisson is just like completely restrictive and unrealistic.

Especially for uh for ecological data.

Uh have you found that high free D's tend to persist with negative binomial as well?

Honestly that I don't know.

Like I don't remember.

I would need I didn't pay enough attention, I think.

what I know is that yeah, all I almost know and I almost never can use a pure Poisson distribution.

Like it's always almost always too restrictive because of the mean equals variance constraint.

It's just like the the likelihood's just too constrained and not variable enough and and then I have to use a gamma Poisson or I actually uh I encountered it recently in a wild

data set where the Poisson was preferred to a negative binomial and it was an incredi it was an incredible feeling.

Yeah Yeah, that never happens.

Yeah, yeah.

No.

Yeah.

I don't expect it will happen again.

No, yeah, yeah.

Almost always I mean I always try the Poisson because like it's the scientific process and you know, I have to justify why I'm using the the gamma poisson.

But like almost always I I can't even remember a case, but that doesn't mean never happened, where I had to I could use a Poisson.

Like it it doesn't pop into my mind.

That's where I quite like the I don't know if you've come across the PC the penalized complexity priors where

If you consider the Poisson to just be a special version of the negative binomial, you can set a prior that essentially penalizes how far it differs from a Poisson model.

So you can kind of have your cake and eat it too.

You can if there's no evidence in your data to suggest that over dispersion is present, you can just fit a negative binomial model that basically looks like Poisson.

Uh but if there is evidence, then there's plenty of space in the prior for that to occur.

and that's actually the default prior that's in the negative binomial over dispersion parameter.

Um because the the R package the the R package allows you to uh have over dispersed data.

So you instead of a plus on you fit a negative binomial.

And I think it was Aki that um did some work to look at which values for

a an inverse gamma on the over dispersion parameters most closely mimicked the PC prior uh on it and that's the prior that's now default in BRMS for negative binomial uh models.

So it's an inverse gamma uh 0.4 0.3.

So that's what's uh that's what implemented by default.

Wow let's recall that it's the default.

yeah so penalized complexity priors

I love them.

I use them all the time for GPs, actually.

especially for amplitudes and length scales.

Uh a bit more for amplitudes because most of the time when I'm using GPs, I'm using HSGP and the length scale is actually fine with HSGP because if you're using SGP, most of the

time you have a lot of data.

And so you have a lot of data which can inform the length scale.

And so you're fine.

Um the amplitude is a bit more complicated.

And so I always use a penalized complexity prior on the amplitude, which is awesome because it's very intuitive to define.

and the on the length scale, I've tried it, but it's actually harmful.

It's much better to have like a more diffuse prior and then have the data move the the prior on the length scale if you have a lot of it.

And so yeah, like and and then the other PC Priors we need to add them in PyMC.

I know Bill Engels has been doing a lot of work on that and he's actually being the one opening my eyes to the awesomeness of PC Priors.

That's thanks to him that I use them in GPs.

And we have a project to add them into Pine C but we never we never get to it.

If anybody wants to work on that with us and actually get that across the finish line, please contact me.

Um but yeah, we need we need to attack Quency because it's extremely helpful.

And and I know Bill has been working on that.

He's been working on some GP stuff also.

So he'll come on the show very soon.

I've been working on on having him on the show, so that will happen very soon.

Um But in the meantime

and and why we're working in in adding PC priors to Prime C out of the box.

You can already do that.

But uh you can already listen to episode one thirty-six uh with uh Yannet van Niekirk and Havard Rue.

They work on INLA, so Nested Laplace approximation.

And uh in the episode we talk exactly about uh PC priors.

And so if you just wanna

hear about that uh I think it's in the in the chapters you can just jump into that.

But it's a it's a fun episode to listen to.

And yeah, they are they already have a bunch of PC priors coded into the in lab package that's both for R and for Python.

So yeah for for people interested in that feel free to to look at that.

I'm sure you'll you'll find some very interested interesting nuggets of uh codes and concepts in there.

I think the PC priors are actually a bit orthogonal to the variance decomposition priors because the scale or the amplitude of the GPs is just one of the variance partitions

that's in s that lives on this simplex.

So I don't even know how I don't think it would be possible to even put a PC prior on there because that variance partition is just absorbed

into absorb isn't the right word there is just one of the components that lives on that simplex that decomposes the total variance.

Yeah.

The that I mean I've never had to work on that case, but I agree with your with your prior that uh yeah it's probably mutually exclusive and I think so yeah it's a trade off yeah

it's a trade off you have to take depending on what you're mostly interested in in in that case.

I do enjoy or I enjoyed seeing that when I uh wrote the function that lets you set the hyperparameters for your priors, that you have a potentially very complicated model with

predictors on occupancy, predictors on detection, random effects, et cetera.

But the actual list of parameters that you need to set priors on is quite short.

Whereas if you didn't have these decomposition priors, you'd potentially have to set priors for every coefficient, for instance.

or the variance of every random coefficient slope type thing.

and if there's one thing that this package has sort of writing this package has sort of taught me, it's just the the choices you have to make when you write a package that's sort

of general purpose and lets you fit arbitrary number of species, arbitrary number of seasons and predictors and stuff like that, like the the sort of

Classical approaches of just having a well a regression model with maybe a couple of coefficients is really such a narrow version of what's possible with probabilistic

programming.

Um and a more complex model really forces you to think about how to set principal priors that are essentially generalizable to a huge number of model configurations.

Yeah.

Yeah, exactly.

Definitely.

It's it's incredible to see how models are actually incredibly customized, even though you think you're working on a on a classic uh classic simple model, right?

and so I'm gonna start to have playing us out here, Matt, because I don't wanna take uh you know your whole lunch time.

But that's good.

We've we've been yeah, I'm very happy because we've gone through a lot of the technical topics I wanted to touch on with you, but

A question I always love to ask people who work on packages is when does Ocaru break down or underperform?

You know, what kinds of datasets or questions is it not the right tool for?

And what would you tell someone before they reach for it?

I think there's a I wouldn't say there's a huge learning curve.

It's a complicated model that I've tried to make as accessible as possible with just a couple of functions.

there is a big part is the chosen survey interval, which is basically how do you bin your data?

Do you group it by day or week or month type thing?

So if you have years and years of data and you your observation level is how many count how many individuals or detections you got of each species per day, that model will be

incredibly slow to fit.

Whereas if you have longer, like multi year data, you might group by a week or two weeks.

So that's a bit of a um where big data can still be accommodated by choosing that survey interval.

But overall, if you have lots of data, lots of sites, this will probably end up being quite slow.

And I I work on a MacBook and I've got a Apple Silicon and it's not super slow or anything, but I've seen people try and use it on a

not as fast a computer and it's already quite slow.

I mean, these people still have a lot of these data sets will have tens of thousands of observations even after a thinning interval.

Um so that's ultimately still always going to be a lot of data, especially when you have these random effects.

So yeah, I think the thing is I don't know what would otherwise suffice and be a lot faster.

because even if you do collapse it down to ones and zeros, like you still end up having

a lot of sites and a lot of surveys, you just have a different observation model and maybe not as many random effects.

Um so yeah, to I'd say the biggest bottleneck will be really large data sets uh and also potentially not um reducing the resolution enough of the data.

So if you if your model runs really slowly, you might even group it by month.

And just have how many counts per month did we get?

And you know, that it l it actually doesn't make any difference at all for getting occupancy.

So uh if you're only interested in occupancy, you can just group you can have huge grouping periods and it still gets you occupancy.

So um yeah.

Probably speed, I would have to say, given that it's um stand and potentially it can be quite complicated with with many sites and slow to fit, I would say.

Mm-hmm.

Yeah.

Yeah, that makes sense.

yeah, grouping grouping always your friend.

I had the same could have the same issue when uh when I was working on electoral forecasting models.

So I definitely understand what you're saying.

Yeah.

And um actually for for an ecologist or an applied researcher who've been listening to us right now and who wants to start using Ocaroo on their own, where do you point them?

And and w also what's next for you for the the roadmap on the for the package?

So the the you can download the package on uh GitHub.

and the package I don't think will be on CRAN because it uses command stan R in the background, which has the latest version of Stan.

So for instance, all this zero sum and simplex constraining uh wouldn't be available on Rstan, but command stand r isn't on

Crayon and there's no plans to put it on Crayon.

Um roadmap for the package is I'm nearly finished with a the multi-season version.

So that's where you start to get into like dynamic how occupancy can change from one season to the next as a Markovian process.

So that's where you get it really in the hidden Markov side of things.

Um once I have that done, I

have a lot of work to do myself using this package because uh for my my postdoc, I mean my main motivation for developing it was that I didn't feel that the current methods were

able to fully exploit the information that's actually present in these data.

And when people want to know like, hey, how good is our camera trap data, I can't really answer that if I have methods that only touch on a fragment of the information that's

actually contained in them.

So I hope to be ironing out a lot of kinks as they come up.

Like it's unbelievably easy to introduce bugs into code.

I've come to discover making this R package.

Um yeah, I'd like to get smooth sailing and hopefully get people using it and tell them what they need, what they're looking for, and obviously bugs that they're finding.

Uh and I'd just like to keep improving it to expand on it and making it more um more customizable as well.

Yeah.

Sounds uh sounds like a great roadmap.

very very exciting and uh and hope you'll you'll manage to to do that in the in the coming month.

You seem you seem to be uh on a good cruising speed, so so really rooting for you.

and well let's call it a show, Matt.

I think it's it's a good stopping time.

We've been we've been covering a ton of um of topics.

I'm very happy about that.

Before letting you go, of course, uh you listen to the show, so you know I have to ask you the last two questions.

Ask every guest at the end of the show.

If you had unlimited time and resources, which problem would you try to solve?

Um, yeah, so firstly, thanks for having me.

And uh yeah, it was uh it was great.

The problem I'd like to solve is uh habitat loss and just the destruction of the natural world around us is

100% the thing that I would like to that I would like to solve the most.

Uh I don't know if there is a clear solution.

Uh I'm fairly cynical or pessimistic about it.

But yeah, if I if I had infinite time and resources, that would be uh what I would like to work on.

Sounds perfect.

Uh sounds yeah, like a like a very, very important problem indeed.

And second question.

If you could have dinner with any great scientific mind, dead, alive, or fictional, who would it be?

Yeah, I had to think about this because I knew I knew the question was coming.

And I think it would have to be Jeffrey West, who's a physicist that wrote a book called Scale, uh, which is probably my favorite book of all time and changed the way I look at

the world.

So I think uh having dinner with him would be awesome and I actually think it's worth uh worth people listening or reading his book.

It's

It's a fascinating fascinating story.

Okay.

Yeah.

Um I I didn't know about him, but uh that sounds that sounds interesting.

So thanks for uh educating me, Matt.

And uh well on that note let's call it a show.

Thank you so much again for coming on the show, Matt.

Everybody there will be a lot of uh of show notes for this one, so make sure

To check them out if you wanna dig deeper.

And Matt, thank you again for taking the time and being on this show.

Thank you.

This has been another episode of Learning Bayesian Statistics.

Be sure to rate, review, and follow the show on your favorite podcatcher and visit learnbasetats.com for more resources about today's topics, as well as access to more

episodes to help you reach true Bayesian state of mind.

That's LearnBaseTats.com.

Our theme music is GoodBayan by Beba Brinkman.

Fit MC Las and Mega Rand.

Check out his awesome work at bebabrinkman.com.

I'm your host.

Alexandora.

You can follow me on Twitter at Alex underscore Andora like the country.

You can support the show and unlock exclusive benefits by visiting patreon.com slash learnbase tax.

Thank you so much for listening and for your support.

You're truly a good baby and change your predictions after taking information.

And if you're thinking I'll be less than amazing, let's adjust those expectations.

Let me show you how to be a good daisy.

Change calculations after taking fresh data.

Those predictions that your brain is making.

Let's get them on a solid foundation.

Key Takeaways

An occupancy model accounts for the fact that you don't always detect a species when surveying for it, especially when the species is rare. A naive count of where you found it underestimates true occupancy. The model adds a repeated-measures component: you visit each site multiple times, and from the pattern of detections vs. non-detections it estimates a detection probability. Matthijs framed it as a zero-inflation structure where the zero-inflation happens at the site level rather than the observation level -- which keeps the model conceptually simple, just a standard GLM with a Bernoulli “is the species here at all?” stacked on top of a detection-rate process.

ARUs are camera traps and acoustic monitors that record continuously over deployment periods of days, weeks, or months. The data they produce isn't a sequence of discrete human-led surveys; it's a continuous-time observation stream. Traditional occupancy models were designed for the discrete case -- a human visits a site, records yes or no, goes home. With ARUs, the question becomes how to bin or threshold the continuous data without losing the richer signal it actually contains.

Most occupancy software collapses each survey to a 1 or a 0 before modeling anything. Matthijs's argument is that this throws away information by construction -- if you saw 17 kangaroos in one window and 3 in the next, that variation matters and shouldn't be discarded a priori. Modeling counts directly gives you everything an occupancy model would, plus rich detection-rate dynamics over time and across species. You can still ignore the count side at interpretation time -- but the data deserves to be modeled before it's thresholded.

With a thinning window. If 20 kangaroos stand in front of a camera for two hours, the raw counts inflate detections in a way that isn't really new information -- it's the same animals, repeatedly. occARU lets you set a thinning interval (30 minutes is a typical choice) and retain the maximum count observed within each window. The two-hour kangaroo episode collapses to four detections, each carrying the genuine peak group size in that interval. This preserves the count signal while avoiding pseudoreplication from a single behavioral event.

Repeated measures at sites and surveys imply you should have random effects at both levels. Sites that are spatially close, and surveys close in time, should resemble each other -- that's a Gaussian process. occARU stacks species on top: the GP is structured as a matrix normal with a row covariance over sites and a column covariance over species. From the column covariance you can directly read off which species tend to co-occur in space and time. And computationally, you only need one Cholesky decomposition across all species rather than one per species.

In Matthijs's experience, most camera-trap datasets have hundreds (not thousands) of sites, so exact GPs work fine. When data comes in independent grids (multiple regions, each with its own cameras), occARU fits a smaller GP per region sharing hyperparameters -- so a thousand cameras across ten grids is ten manageable GPs rather than one expensive thousand-by-thousand covariance matrix. For genuinely large single-grid datasets, Hilbert-space GP approximations are the natural next step, and they're on the occARU roadmap.

A complex ecological model has many variance components -- coefficients on predictors, random effects, species-level deviations. If you put independent priors on each, the prior predictive distribution blows up as you add components. The shrinkage prior flips that: you put one Student-t prior on the total linear-predictor variance, then partition that variance across components with a simplex. The sparsity of the simplex is itself a parameter, so the model learns whether one component dominates or variance spreads evenly. The same hyperpriors then handle one species or a hundred, zero predictors or fifty.

Instead of picking a reference category (which breaks symmetry and complicates interpretation), occARU models categorical effects as zero-sum-normal vectors. Each category gets its own coefficient, but they're constrained to sum to zero -- so the intercept retains its clean average-across-categories interpretation, and credible intervals stay tight. Combined with the simplex decomposition, you can stack species-level offsets on top, getting zero-sum matrices that sum to zero across both categories and species. Recent Stan additions (simplex_constrain with n-1 free parameters) make this tractable to write.

Less often than people assume. There's significant overhead in copying data structures across cores, and Matthijs has frequently found that turning it on doesn't speed up the model -- and sometimes slows it down. The main signal that it'll help is when the per-site likelihood is genuinely expensive; for cheap per-site computations, the overhead dominates. The Stan documentation itself doesn't offer cut-and-dry guidance -- trial and error on your specific model is the honest answer.

When you have random effects at the observation level, LOO cross-validation tends to produce high Pareto-k values everywhere, which makes model comparison untrustworthy. Aki Vehtari's tip via the Stan Discourse is to do Monte Carlo integration in generated quantities: for each MCMC draw, sample the random effect many times, compute the likelihood at each sample, and average. Since this happens in generated quantities it's cheap. In Matthijs's experience, this drops the percentage of observations with high Pareto-k from around 60-70% to around 10%, and the remaining 10% usually reflect genuinely poorly-fitting observations rather than LOO failures.

When your dataset is large and your survey interval is fine-grained. The bottleneck is Stan's fitting speed -- years of daily count data across many sites will fit slowly. The workaround is to bin coarser (weekly or monthly), which doesn't hurt occupancy estimation at all and only loses some detection-rate resolution. If you're only interested in occupancy, big grouping windows are fine.

Related Episodes

#127 Saving Sharks... with Python, Causal Inference and Aaron MacNeil

Listen →

#14 Hidden Markov Models & Statistical Ecology, with Vianey Leos-Barajas

Listen →

#134 Bayesian Econometrics, State Space Models & Dynamic Regression, with David Kohns

Listen →

#133 Making Models More Efficient & Flexible, with Sean Pinkney & Adrian Seyboldt

Listen →

#74 Optimizing NUTS and Developing the ZeroSumNormal Distribution, with Adrian Seyboldt

Listen →

#136 Bayesian Inference at Scale: Unveiling INLA, with Haavard Rue & Janet van Niekerk

Listen →

#20 Regression and Other Stories, with Andrew Gelman, Jennifer Hill & Aki Vehtari

Listen →

#29 Model Assessment, Non-Parametric Models, And Much More, with Aki Vehtari

Listen →

Support & Resources

→ Support the show on Patreon
→ Intro to Bayes Course (first 2 lessons free)
→ Advanced Regression Course (first 2 lessons free)
Theme music: “Good Bayesian” by Baba Brinkman (feat MC Lars and Mega Ran). bababrinkman.com