A Researcher Stress-Tested My Bayesian Agent Skill. Here's What Broke
← All posts
Alex Andorra

A Researcher Stress-Tested My Bayesian Agent Skill. Here's What Broke

A researcher stress-tested my Bayesian Agent Skill on a real hierarchical model. It caught the workflow violations but missed a subtle prediction bug, had no sparsity priors, and lacked classification evaluation. Here's what I shipped to fix it

And what I shipped to fix it -- in the same week.

Last week I wrote about building an Agent Skill that teaches coding agents the Bayesian workflow. The pitch was simple: agents know Bayesian stats but skip half the steps. The skill enforces the full 9-step workflow -- prior predictive checks, convergence diagnostics, calibration, reporting -- every time. It scored 100% on my eval suite.

Then someone actually used it on a real model. And the eval suite suddenly felt a lot less complete.

The experiment

One of the many good things of working with someone as brilliant as Max Göbel is that he'll do smart things without even telling me: he decided to put the skill through a proper test.

Max and I are co-authors on the Soccer Factor Model, so, naturally, he took SFMMO -- our ordered logistic hierarchical model for predicting match outcomes across 5 European football leagues (~10,000 matches) -- and ran the same prompt through two Claude Code sessions:

  • Without the skill: "Look at this notebook, discuss the model, and revamp the architecture".
  • With the skill: Same prompt, plus "read this bayesian-workflow skill first".

Then he had a third agent (one that hadn't seen either implementation) compare the two outputs blind.

This is the kind of test I couldn't have designed myself. SFMMO is a real research model with real complexity: ordinal outcomes (away win < draw < home win), league-varying intercepts, 30+ candidate features, cross-sectional standardization, and out-of-sample evaluation across 4 held-out seasons. It hits corners that my synthetic eval scenarios never touched.

What the skill nailed

The comparison was clear on the fundamentals. The skilled agent:

  • Ran prior predictive checks before fitting. The unskilled agent skipped them entirely. This is the single most common Bayesian workflow violation, and it's exactly what the skill exists to prevent.
  • Used non-centered parameterization for the hierarchical league intercepts. The unskilled agent used centered parameterization -- a well-known source of the "funnel of doom" that makes NUTS struggle with small numbers of groups. With only 5 leagues, non-centered is textbook correct.
  • Applied the strict R-hat threshold (1.01), not the outdated 1.05. The unskilled agent used 1.05, which can mask real convergence issues.
  • Used a descriptive seed (sum(map(ord, "sfmmo-v2-ordered-hierarchical"))) instead of 42. Small thing, but it prevents accidental seed reuse across projects and makes results more traceable.
  • Saved InferenceData immediately after sampling, before any post-processing. The unskilled agent only saved at the very end -- one crash away from losing an expensive MCMC run.
  • Built a properly identified model. This one's subtle. The unskilled agent separated league intercepts from home advantage into two terms (alpha_league + delta_home). Sounds reasonable -- until you realize that every observation in the dataset is from the home team's perspective. The data never observes a match without home advantage active, so alpha_league and delta_home are not separately identifiable. Their individual posteriors reflect prior assumptions, not data signal. The skilled agent avoided this by using a single league-level shift —--honest about what the data can actually tell you.

The workflow discipline held. The skill does what it promises.

What broke

But the unskilled agent did two things genuinely better — things the skill didn't even mention.

1. The median bug

The skilled agent computed out-of-sample predictions by taking the median of posterior predictive probabilities across samples. This is wrong. The proper Bayesian predictive distribution is the mean:

$$\hat{P}(Y=k \mid x) = \frac{1}{S} \sum_{s=1}^{S} P(Y=k \mid x, \theta^{(s)})$$

The median doesn't correspond to the posterior predictive distribution. It can violate probability coherence (probabilities may not sum to 1 across categories), and it biases calibration due to Jensen's inequality. The unskilled agent used np.mean -- the correct choice.

The skill said nothing about this. It's a subtle statistical error that passes all convergence diagnostics and produces results that look reasonable. The code runs. The probabilities are close to summing to 1. But the analysis is quietly, systematically wrong -- exactly the kind of problem the skill was supposed to prevent.

2. No sparsity priors

The skilled agent used Normal(0, 1) priors on all regression coefficients. For 7 features, that's fine. But the original model had 30+ candidate features, and the unskilled agent reached for a much better tool: the regularized horseshoe prior (Piironen & Vehtari, 2017). This adaptively shrinks irrelevant features toward zero while preserving signal from important ones -- no manual feature selection required.

The skill's priors guide covered Normal, Gamma, Exponential, Beta, LKJ -- the basics. But it said nothing about sparsity priors for high-dimensional settings. That's a real gap for anyone working with more than a handful of features.

3. Richer evaluation

The skilled agent produced standard diagnostics: trace plots, LOO-CV, PPC, coverage. The unskilled agent went further:

  • Per-class calibration plots: Is the model well-calibrated for home wins, draws, and away wins separately? The overall calibration can look fine while specific outcomes are poorly predicted.
  • Per-season out-of-sample breakdown: Does the model degrade over time? A model that works on 2020 data but fails on 2023 might be overfitting to historical patterns.
  • Feature importance via P(|β| > 0.05): With a horseshoe prior, you can compute the probability that each feature has a practically meaningful effect. This is the natural Bayesian answer to "which features matter?"
  • Expected Calibration Error (ECE): A single number summarizing calibration quality across all predicted probabilities.

None of these were in the skill. They should have been.

What I shipped

Within a week, I pushed v1.1 of the bayesian-workflow skill. Here's what changed:

Three new critical rules:

  • Always use the posterior mean (not median) for predictive probabilities
  • Use pm.set_data() + pm.sample_posterior_predictive() for out-of-sample predictions -- don't manually extract and recompute
  • Check model identifiability before interpreting individual components

Sparsity priors guide with full code templates for:

  • Regularized horseshoe (Finnish horseshoe) with the double-funnel geometry warning
  • R2-D2 prior for when you have domain knowledge about R²
  • Decision table: when to use sparsity vs. simple Normal priors

Classification and ordinal evaluation:

  • ECE and Ranked Probability Score helper functions
  • Per-class calibration plots
  • Temporal out-of-sample splits
  • Feature importance for shrinkage models

Identifiability checks:

  • Common traps (always-active offsets, overparameterized intercepts, group-level collinearity)
  • Detection via az.plot_pair()
  • Remediation strategies

Plus new model families in the quick-reference table (zero-inflated, hurdle, ordinal, censored, truncated), Pathfinder initialization for slow warmups, and a couple of gotcha fixes.

Lessons

1. Synthetic evals have blind spots

My 6-scenario eval suite caught the obvious workflow violations: missing prior predictive checks, wrong R-hat thresholds, frequentist language in Bayesian reports. It did not catch the median-vs-mean bug, the lack of sparsity priors, or the identifiability trap. These only surfaced when a domain expert applied the skill to a real, complex model.

Evals are necessary. They're not sufficient.

2. The unskilled agent isn't always worse

Max's third agent concluded that the skilled agent produced the "more reliable implementation overall." But the unskilled agent contributed two genuinely important ideas (horseshoe prior, proper posterior averaging) that the skilled agent missed.

In other words, the skill makes the floor higher, but it doesn't always raise the ceiling.

3. The best feedback comes from real use

I could have iterated on the eval suite for weeks. Instead, one skilled researcher running one real model in one afternoon revealed more actionable gaps than three rounds of synthetic benchmarking. If you're building agent tools, get them into real hands as early as possible.

4. Agent Skills are living documents

v1.0 was good enough to ship. v1.1 is materially better because someone actually used it and told me what was missing. I expect v1.2 will come from the next researcher who tries it on a model I haven't even thought of.

Unlike your steak, the skill isn't done. It's never done. That's the point.

Updating to v1.1

Easy as ABC: just ask your agent to "Update the bayesian-workflow skill" and you should be good to go!

The next time your agent loads the skill, it'll have the new sparsity priors guide, classification evaluation metrics, identifiability checks, and all the other v1.1 additions.

The bayesian-workflow skill is open source and works with Claude Code, Cursor, Gemini CLI, and any agent supporting the Agent Skills spec. If you try it and something doesn't work, open an issue. If the comparison report above is any indication, that issue will make v1.2 better for everyone, and your contribution will _actually_ make a difference!

On that note, PyMCheers, my dear Bayesians!

Alexandre Andorra