A cute cartoon robot sitting at a desk with a glowing tablet, studying a normal distribution and Bayes' rule with earnest curiosity
← All posts

I Taught My Coding Agent to Think Like a Bayesian

Teach your AI coding agent to think like a Bayesian. Discover the bayesian-workflow Agent Skill that prevents subtle statistical errors in PyMC models.

And it only took mass-producing the same mistakes 63 times to get there.

If you've ever asked a coding agent to "build me a Bayesian model," you know the feeling. It writes something that looks right — pm.Normal here, pm.sample() there — and you think, great, done in 30 seconds! Then you look closer and realize it's happily interpreting posteriors from a model with 47 divergences. Or it's describing your Bayesian results using p-values and "statistical significance." Or two models are practically indistinguishable and the agent just shrugs and says "consider your goals" instead of recommending the simpler one.

The code runs. The plots render. But the analysis is subtly, quietly wrong — and that's the dangerous part. Not syntax errors. Not crashes. The kind of mistakes that make it into reports, into decisions, into production.

I got tired of catching these things after the fact. So I built an Agent Skill that catches them at the source.

Wait, what's an Agent Skill?

If you haven't heard of them yet, Agent Skills are a simple open format for giving coding agents — Claude Code, Cursor, Gemini CLI, Kimi Code, and many others — new capabilities. Think of them as instruction manuals that your agent reads before tackling a task. Instead of relying on whatever the model remembers from training data (which, let's be honest, is a mix of great tutorials and terrible Stack Overflow answers), the skill enforces a specific workflow with guardrails.

In our case: the Bayesian workflow. The one you should be following every time you fit a model but probably skip half the steps of because, well, life is short and your deadline was yesterday.

The workflow agents skip (and why it matters)

Here's the thing about modern LLMs: they know a lot about Bayesian statistics. They can write PyMC code, explain what a posterior is, and even tell you about non-centered parameterizations if you ask nicely. But knowing and doing are different things.

When I ran the same modeling tasks with and without the skill — 6 diverse scenarios, from logistic regression to hierarchical models to model comparison — the differences were revealing. Not in the "model doesn't run" sense. In the "model runs fine but the analysis is subtly incomplete" sense. The dangerous kind.

Without the skill, the agent consistently:

  • Skipped coords and dims in the PyMC model definition. This seems cosmetic until you need to debug a shape mismatch or use ArviZ's labeled indexing. Then it's not cosmetic at all.
  • Used frequentist language to describe Bayesian results. I'm talking p-values and "statistically significant" in a posterior summary. If you're going to do Bayesian inference, at least talk Bayesian — credible intervals, posterior probabilities, the works.
  • Forgot to warn when results are unreliable. The troubleshooting scenario had 47 divergences, R-hat of 1.03, and a stuck chain. Without the skill, the agent dove into interpreting the posteriors anyway. With the skill, it led with "these results are unreliable and must not be interpreted" — which, yeah, is kind of important.
  • Didn't recommend the simpler model when two models were indistinguishable. In the model comparison task, the ELPD difference was tiny relative to its standard error. Without the skill, the agent shrugged and said "consider your goals." With the skill, it said "prefer the simpler model" — which is the right default when the data can't tell them apart.

These aren't catastrophic failures. They're the kind of subtle issues that make the difference between an analysis you can trust and one that looks trustworthy but isn't. And that's exactly the kind of thing a skill can fix, because the skill doesn't forget. It's the same 9-step workflow every single time.

The numbers

I ran a proper evaluation: 6 test scenarios, each graded against a specific set of assertions — things like "uses Bernoulli likelihood with logit link" or "runs LOO-PIT calibration checks". Here's how it went:

| Scenario | With skill | Without skill | |---|---|---| | Logistic regression (churn) | 100% | 90% | | Hierarchical schools | 100% | 89% | | Troubleshooting divergences | 100% | 88% | | Count data (overdispersion) | 100% | 100% | | Model comparison | 100% | 88% | | Prior elicitation + report | 100% | 89% | | Overall | 100% | 90.5% |

100% vs 90.5% overall. That 9.5 percentage points gap might not look dramatic — until you realize these are the easy-to-miss best practices. The ones where a human reviewer would say "technically it works, but..." The skill closes that gap completely.

The cost? About 29% more time and 87% more tokens per task. For a beginner who'd spend an hour Googling "what to do about PyMC divergences," that's a bargain. For an expert who just wants guardrails, it's peace of mind.

What the skill actually enforces

The bayesian-workflow skill encodes a 9-step sequence that every Bayesian analysis should follow:

  • Formulate the generative story
  • Specify priors — with documented justifications, not just Normal(0, 1) and vibes
  • Implement in PyMC — coords, dims, nutpie sampler, the works
  • Prior predictive checks — before you fit anything, make sure your priors aren't insane
  • Inference — let nutpie pick the number of chains, use a descriptive seed for reproducibility
  • Convergence diagnostics — one call to `arviz_stats.diagnose()` checks R-hat, ESS, divergences, tree depth, and E-BFMI. R-hat < 1.01 (not 1.1, that threshold is outdated)
  • Model criticism — posterior predictive checks and LOO-PIT calibration. Not just "the trace plots look fine"
  • Model comparison — LOO-CV, ELPD with standard errors, Pareto k diagnostics, stacking weights
  • Reporting — a companion markdown that interprets what the numbers mean, adapted for the audience, and that you can yourself customize.

It's opinionated on purpose. There are good reasons for each of these choices, and I'd rather have an agent that follows a strict, correct workflow than one that freestyles its way through an analysis.

Built for the open ecosystem

One thing I care about: this isn't locked to one tool. The skill follows the Agent Skills spec, which means it works with Claude Code, Kimi Code, Cursor, Gemini CLI, and any other agent that supports the standard. Install it in 10 seconds:

```bash

Claude Code

git clone https://github.com/Learning-Bayesian-Statistics/baygent-skills.git /tmp/baygent-skills mkdir -p ~/.claude/skills cp -r /tmp/baygent-skills/bayesian-workflow ~/.claude/skills/

Other agents (Kimi Code, Cursor, etc.) — same idea, different skills directory

cp -r /tmp/baygent-skills/bayesian-workflow/ ~/.config/agents/skills/ ```

What's next

This is the first skill in a collection I'm calling baygent-skills — a set of skills to call your agent Bayes. Thomas Bayes. More specialized skills are coming, covering different domains and modeling paradigms.

If you try the skill and something doesn't work, or you think a guardrail is missing, open an issue. PRs are even more welcome! This is already happening — Osvaldo Martín (ArviZ maintainer) suggested integrating the new `arviz_stats.diagnose()` function, which consolidates five diagnostic checks into a single call. That's now in the skill. The whole point is to encode the community's hard-won knowledge into something agents can actually use — and that takes more than one person to do.

Now if you'll excuse me, I have some priors to elicit.

PyMCheers & Keep On Sampling!