Teach your AI coding agent to think like a Bayesian. Discover the bayesian-workflow Agent Skill that prevents subtle statistical errors in PyMC models.
And it only took mass-producing the same mistakes 63 times to get there.
If you've ever asked a coding agent to "build me a Bayesian model," you know the feeling. It writes something that looks right — pm.Normal here, pm.sample() there — and you think, great, done in 30 seconds! Then you look closer and realize it's happily interpreting posteriors from a model with 47 divergences. Or it's describing your Bayesian results using p-values and "statistical significance." Or two models are practically indistinguishable and the agent just shrugs and says "consider your goals" instead of recommending the simpler one.
The code runs. The plots render. But the analysis is subtly, quietly wrong — and that's the dangerous part. Not syntax errors. Not crashes. The kind of mistakes that make it into reports, into decisions, into production.
I got tired of catching these things after the fact. So I built an Agent Skill that catches them at the source.
If you haven't heard of them yet, Agent Skills are a simple open format for giving coding agents — Claude Code, Cursor, Gemini CLI, Kimi Code, and many others — new capabilities. Think of them as instruction manuals that your agent reads before tackling a task. Instead of relying on whatever the model remembers from training data (which, let's be honest, is a mix of great tutorials and terrible Stack Overflow answers), the skill enforces a specific workflow with guardrails.
In our case: the Bayesian workflow. The one you should be following every time you fit a model but probably skip half the steps of because, well, life is short and your deadline was yesterday.
Here's the thing about modern LLMs: they know a lot about Bayesian statistics. They can write PyMC code, explain what a posterior is, and even tell you about non-centered parameterizations if you ask nicely. But knowing and doing are different things.
When I ran the same modeling tasks with and without the skill — 6 diverse scenarios, from logistic regression to hierarchical models to model comparison — the differences were revealing. Not in the "model doesn't run" sense. In the "model runs fine but the analysis is subtly incomplete" sense. The dangerous kind.
Without the skill, the agent consistently:
These aren't catastrophic failures. They're the kind of subtle issues that make the difference between an analysis you can trust and one that looks trustworthy but isn't. And that's exactly the kind of thing a skill can fix, because the skill doesn't forget. It's the same 9-step workflow every single time.
I ran a proper evaluation: 6 test scenarios, each graded against a specific set of assertions — things like "uses Bernoulli likelihood with logit link" or "runs LOO-PIT calibration checks". Here's how it went:
| Scenario | With skill | Without skill | |---|---|---| | Logistic regression (churn) | 100% | 90% | | Hierarchical schools | 100% | 89% | | Troubleshooting divergences | 100% | 88% | | Count data (overdispersion) | 100% | 100% | | Model comparison | 100% | 88% | | Prior elicitation + report | 100% | 89% | | Overall | 100% | 90.5% |
100% vs 90.5% overall. That 9.5 percentage points gap might not look dramatic — until you realize these are the easy-to-miss best practices. The ones where a human reviewer would say "technically it works, but..." The skill closes that gap completely.
The cost? About 29% more time and 87% more tokens per task. For a beginner who'd spend an hour Googling "what to do about PyMC divergences," that's a bargain. For an expert who just wants guardrails, it's peace of mind.
The bayesian-workflow skill encodes a 9-step sequence that every Bayesian analysis should follow:
It's opinionated on purpose. There are good reasons for each of these choices, and I'd rather have an agent that follows a strict, correct workflow than one that freestyles its way through an analysis.
One thing I care about: this isn't locked to one tool. The skill follows the Agent Skills spec, which means it works with Claude Code, Kimi Code, Cursor, Gemini CLI, and any other agent that supports the standard. Install it in 10 seconds:
```bash
git clone https://github.com/Learning-Bayesian-Statistics/baygent-skills.git /tmp/baygent-skills mkdir -p ~/.claude/skills cp -r /tmp/baygent-skills/bayesian-workflow ~/.claude/skills/
cp -r /tmp/baygent-skills/bayesian-workflow/ ~/.config/agents/skills/ ```
This is the first skill in a collection I'm calling baygent-skills — a set of skills to call your agent Bayes. Thomas Bayes. More specialized skills are coming, covering different domains and modeling paradigms.
If you try the skill and something doesn't work, or you think a guardrail is missing, open an issue. PRs are even more welcome! This is already happening — Osvaldo Martín (ArviZ maintainer) suggested integrating the new `arviz_stats.diagnose()` function, which consolidates five diagnostic checks into a single call. That's now in the skill. The whole point is to encode the community's hard-won knowledge into something agents can actually use — and that takes more than one person to do.
Now if you'll excuse me, I have some priors to elicit.
PyMCheers & Keep On Sampling!