Coding agents confidently say "X causes Y" without drawing a DAG, checking assumptions, or running refutation tests. I built an Agent Skill that won't let them. 100% eval pass rate vs 68% without -- the gap is in the reasoning, not the code.
And that's exactly why you should trust it.
Have you ever asked a coding agent to estimate a causal effect and watch what happens? I mean, why wouldn't you -- this sounds like a great plan!
The agent will load your data, run a regression, and confidently declare: "Exercise causes a 1.5 mmHg reduction in blood pressure per additional hour per week." Clean code. Nice coefficient table. Publication-ready formatting.
One problem though: it never drew a DAG. It never asked you which variables are confounders and which are mediators. It never checked whether the effect survives a sensitivity analysis. It never questioned whether "causes" is even the right word for an observational study where nobody was randomized.
Sure, the code runs and the estimate looks reasonable, but the causal claim is indefensible — and that's the dangerous part. Because this isn't a coding error that crashes at runtime. It's a reasoning error that makes it into your report, your policy memo, your medical board presentation.
Since I love being prescriptive and, above all, pointing out other people's mistakes, I built an Agent Skill that won't let that happen.
The causal-inference skill is part of the baygent-skills collection. It enforces an 8-step workflow where the agent must think before it codes:
Steps 1 through 4 are the thinking phase. No code. No estimation. Just assumptions, made explicit, confirmed by the human who has the domain knowledge.
The agent can't skip ahead. It can't jump to code because it's excited about the data. It can't claim causality because refutation hasn't happened yet. It must earn the right to say "causes" — by first doing the work that makes "causes" defensible.
I ran the same prompt through two agents — one with the skill, one without:
"I want to estimate the causal effect of exercise on blood pressure. I have observational data on 3,000 adults with exercise hours, blood pressure, age, BMI, smoking, stress, diet quality, and income. No randomization. Help me build a causal analysis."
The agent drew a DAG (good), identified confounders (good), and correctly flagged BMI as a mediator rather than a confounder (impressive, actually). Then it ran three estimators: OLS regression, inverse propensity weighting, and g-computation — all frequentist. It reported a 95% confidence interval. It discussed limitations in a caveats section.
Solid work. A B+ from a causal inference class.
But it never asked me whether its DAG was correct. It never asked whether I was comfortable defending the "no unmeasured confounders" assumption. It never ran a sensitivity analysis for how strong an unobserved confounder would need to be to explain away the effect. And it said "the effect of exercise" — not "the estimated effect, assuming our DAG is correct and no unmeasured confounders exist".
The agent started by proposing a precise estimand: "the ATE of exercise hours on systolic blood pressure, in mmHg per hour per week." Then it asked me to confirm.
It drew a DAG with 8 nodes and 13 edges, justifying each one. It explicitly listed the non-edges — the assumptions where it's asserting no direct causal effect — and asked me whether each one was defensible. It flagged diet_quality as ambiguous: "Is this a confounder or a mediator? If exercise changes your diet, which changes your blood pressure, adjusting for diet would block part of the causal pathway."
It identified the effect via the backdoor criterion using DoWhy, stated the adjustment set, and told me: "This identification rests entirely on the assumption of no unobserved confounders. Are you comfortable defending this to your medical board?"
Then — and only then — it wrote a PyMC model: prior predictive check, nutpie sampler, full diagnostics, using my Bayesian workflow skill.
Then, it ran three DoWhy refutation tests (random common cause, placebo treatment, data subset — all passed) and a sensitivity analysis for unobserved confounding. The tipping point was at effect strength ~1.0, which is ~2.8x stronger than the strongest measured confounder. The agent rated this "marginal" and flagged that antihypertensive medications could plausibly be a confounder that strong.
The report adapted for the dual audience: clinicians got plain language ("roughly 9 in 10 chance the effect is negative, which means exercise likely reduces blood pressure"); statisticians got the full DAG, identification result, and posterior plots.
And the conclusion used suggestive causal language: "Evidence is suggestive of a causal effect, but the marginal sensitivity analysis means we cannot make definitive causal claims."
The agent refused to say "causes" — because the evidence didn't fully earn it.
I ran 6 test scenarios covering the major causal inference designs. Here are the eval pass rates:
100% vs 68%. A 32-point gap.
The without-skill agent isn't bad — it consistently identifies the right design (DiD, RDD, SC, mediation). It knows causal inference, but it makes three systematic mistakes:
These aren't edge cases. They're the gap between analysis-you-can-defend and analysis-that-looks-defensible.
The scenario I'm most proud of: DiD with a parallel trends violation stated in the prompt.
The user says: "The treatment cities were already on a downward trend before the intervention, probably because they were richer cities with better healthcare access. Can you still estimate the causal effect?"
The without-skill agent handles this well — 90% pass rate. It recognizes the violation, runs a naive DiD as illustration, then proposes corrections (trend-adjusted DiD, synthetic control, Rambachan-Roth bounds). Genuinely good work.
The skilled agent does all of that plus: it draws a DAG showing wealth as an unobserved confounder affecting both treatment adoption and outcome trajectory. It asks the user about confounders. It downgrades causal language explicitly: "associated with, not caused, because a critical refutation test failed". And it quantifies the bias: the naive DiD overestimates by 87% compared to the trend-adjusted estimate.
The difference isn't that the without-skill agent gets it wrong. It's that the skilled agent documents exactly what could go wrong and adjusts its claims accordingly. That's the difference between an analysis you can defend in a seminar room and one you can defend in a courtroom.
The skill supports the full landscape of Bayesian causal inference:
Quasi-experimental designs (via CausalPy, from the brilliant Ben Vincent):
Structural causal models (via PyMC):
Identification (via DoWhy):
And it composes with the bayesian-workflow skill for all the PyMC mechanics — priors, sampling, diagnostics, calibration. No duplication.
Just ask your agent to "Install the causal-inference skill: https://github.com/Learning-Bayesian-Statistics/baygent-skills/tree/main/causal-inference" and you should be good to go! Bonus: the skill will auto-install bayesian-workflow if it's not already present ;)
As always, it works with Claude Code, Cursor, Gemini CLI, Kimi Code and any agent supporting the Agent Skills spec.
Allow me to get philosophical for a while -- I am French after all... There's a deeper principle here: Most AI tools are designed to do more, faster. The causal-inference skill is designed to do less, carefully. In this, I was really inspired by the work of Cal Newport, especially around slow productivity.
I wanted to force the agent to slow down. To draw the DAG before touching data. To state assumptions before estimating effects. To run robustness checks before claiming causality. To ask the famous human-in-the-loop before assuming domain knowledge. All stuff I teach my own students!
This isn't a limitation. It's the whole point.
Causal inference is the one domain where confidence without rigor is actively dangerous. A wrong soccer match prediction is annoying. A wrong causal claim about a drug, a policy, or an intervention can hurt people. The skill's job isn't to make causal inference faster — it's to make it harder to do wrong.
An agent that refuses to say "causes" until it's earned the right? That's not a bug. That's the feature.
The causal-inference skill is open source and part of the baygent-skills collection. If you try it and something doesn't work, open an issue. If the bayesian-workflow experience is any indication, your feedback will make v1.1 better for everyone 😊
On that note, PyMCheers, my dear Bayesians!