A researcher stress-tested my Bayesian Agent Skill on a real hierarchical model. It caught the workflow violations but missed a subtle prediction bug, had no sparsity priors, and lacked classification evaluation. Here's what I shipped to fix it
And what I shipped to fix it -- in the same week.
Last week I wrote about building an Agent Skill that teaches coding agents the Bayesian workflow. The pitch was simple: agents know Bayesian stats but skip half the steps. The skill enforces the full 9-step workflow -- prior predictive checks, convergence diagnostics, calibration, reporting -- every time. It scored 100% on my eval suite.
Then someone actually used it on a real model. And the eval suite suddenly felt a lot less complete.
One of the many good things of working with someone as brilliant as Max Göbel is that he'll do smart things without even telling me: he decided to put the skill through a proper test.
Max and I are co-authors on the Soccer Factor Model, so, naturally, he took SFMMO -- our ordered logistic hierarchical model for predicting match outcomes across 5 European football leagues (~10,000 matches) -- and ran the same prompt through two Claude Code sessions:
Then he had a third agent (one that hadn't seen either implementation) compare the two outputs blind.
This is the kind of test I couldn't have designed myself. SFMMO is a real research model with real complexity: ordinal outcomes (away win < draw < home win), league-varying intercepts, 30+ candidate features, cross-sectional standardization, and out-of-sample evaluation across 4 held-out seasons. It hits corners that my synthetic eval scenarios never touched.
The comparison was clear on the fundamentals. The skilled agent:
The workflow discipline held. The skill does what it promises.
But the unskilled agent did two things genuinely better — things the skill didn't even mention.
The skilled agent computed out-of-sample predictions by taking the median of posterior predictive probabilities across samples. This is wrong. The proper Bayesian predictive distribution is the mean:
$$\hat{P}(Y=k \mid x) = \frac{1}{S} \sum_{s=1}^{S} P(Y=k \mid x, \theta^{(s)})$$
The median doesn't correspond to the posterior predictive distribution. It can violate probability coherence (probabilities may not sum to 1 across categories), and it biases calibration due to Jensen's inequality. The unskilled agent used np.mean -- the correct choice.
The skill said nothing about this. It's a subtle statistical error that passes all convergence diagnostics and produces results that look reasonable. The code runs. The probabilities are close to summing to 1. But the analysis is quietly, systematically wrong -- exactly the kind of problem the skill was supposed to prevent.
The skilled agent used Normal(0, 1) priors on all regression coefficients. For 7 features, that's fine. But the original model had 30+ candidate features, and the unskilled agent reached for a much better tool: the regularized horseshoe prior (Piironen & Vehtari, 2017). This adaptively shrinks irrelevant features toward zero while preserving signal from important ones -- no manual feature selection required.
The skill's priors guide covered Normal, Gamma, Exponential, Beta, LKJ -- the basics. But it said nothing about sparsity priors for high-dimensional settings. That's a real gap for anyone working with more than a handful of features.
The skilled agent produced standard diagnostics: trace plots, LOO-CV, PPC, coverage. The unskilled agent went further:
None of these were in the skill. They should have been.
Within a week, I pushed v1.1 of the bayesian-workflow skill. Here's what changed:
Three new critical rules:
Sparsity priors guide with full code templates for:
Classification and ordinal evaluation:
Identifiability checks:
Plus new model families in the quick-reference table (zero-inflated, hurdle, ordinal, censored, truncated), Pathfinder initialization for slow warmups, and a couple of gotcha fixes.
My 6-scenario eval suite caught the obvious workflow violations: missing prior predictive checks, wrong R-hat thresholds, frequentist language in Bayesian reports. It did not catch the median-vs-mean bug, the lack of sparsity priors, or the identifiability trap. These only surfaced when a domain expert applied the skill to a real, complex model.
Evals are necessary. They're not sufficient.
Max's third agent concluded that the skilled agent produced the "more reliable implementation overall." But the unskilled agent contributed two genuinely important ideas (horseshoe prior, proper posterior averaging) that the skilled agent missed.
In other words, the skill makes the floor higher, but it doesn't always raise the ceiling.
I could have iterated on the eval suite for weeks. Instead, one skilled researcher running one real model in one afternoon revealed more actionable gaps than three rounds of synthetic benchmarking. If you're building agent tools, get them into real hands as early as possible.
v1.0 was good enough to ship. v1.1 is materially better because someone actually used it and told me what was missing. I expect v1.2 will come from the next researcher who tries it on a model I haven't even thought of.
Unlike your steak, the skill isn't done. It's never done. That's the point.
Easy as ABC: just ask your agent to "Update the bayesian-workflow skill" and you should be good to go!
The next time your agent loads the skill, it'll have the new sparsity priors guide, classification evaluation metrics, identifiability checks, and all the other v1.1 additions.
The bayesian-workflow skill is open source and works with Claude Code, Cursor, Gemini CLI, and any agent supporting the Agent Skills spec. If you try it and something doesn't work, open an issue. If the comparison report above is any indication, that issue will make v1.2 better for everyone, and your contribution will _actually_ make a difference!
On that note, PyMCheers, my dear Bayesians!