I Built an Agent That Refuses to Say "Causes"
← All posts
Alex Andorra

I Built an Agent That Refuses to Say "Causes"

Coding agents confidently say "X causes Y" without drawing a DAG, checking assumptions, or running refutation tests. I built an Agent Skill that won't let them. 100% eval pass rate vs 68% without -- the gap is in the reasoning, not the code.

And that's exactly why you should trust it.

Have you ever asked a coding agent to estimate a causal effect and watch what happens? I mean, why wouldn't you -- this sounds like a great plan!

The agent will load your data, run a regression, and confidently declare: "Exercise causes a 1.5 mmHg reduction in blood pressure per additional hour per week." Clean code. Nice coefficient table. Publication-ready formatting.

One problem though: it never drew a DAG. It never asked you which variables are confounders and which are mediators. It never checked whether the effect survives a sensitivity analysis. It never questioned whether "causes" is even the right word for an observational study where nobody was randomized.

Sure, the code runs and the estimate looks reasonable, but the causal claim is indefensible — and that's the dangerous part. Because this isn't a coding error that crashes at runtime. It's a reasoning error that makes it into your report, your policy memo, your medical board presentation.

Since I love being prescriptive and, above all, pointing out other people's mistakes, I built an Agent Skill that won't let that happen.

The skill that says "not yet"

The causal-inference skill is part of the baygent-skills collection. It enforces an 8-step workflow where the agent must think before it codes:

  • Formulate the causal question — What specific effect are we estimating?
  • Draw the DAG — Propose a causal graph. Ask the user to confirm.
  • Identify — Determine how the effect is identified. Ask the user to confirm the untestable assumptions.
  • Choose the design — DiD, synthetic control, RDD, ITS, IV, structural model? Ask the user to confirm.
  • Estimate — Now, and only now, write model code.
  • Refute — Mandatory. Run design-specific robustness checks.
  • Interpret — Effect size with decision-relevant credible intervals.
  • Report — Assumptions before results. Limitations mandatory, not optional.

Steps 1 through 4 are the thinking phase. No code. No estimation. Just assumptions, made explicit, confirmed by the human who has the domain knowledge.

The agent can't skip ahead. It can't jump to code because it's excited about the data. It can't claim causality because refutation hasn't happened yet. It must earn the right to say "causes" — by first doing the work that makes "causes" defensible.

The spectacle: same question, two agents

I ran the same prompt through two agents — one with the skill, one without:

"I want to estimate the causal effect of exercise on blood pressure. I have observational data on 3,000 adults with exercise hours, blood pressure, age, BMI, smoking, stress, diet quality, and income. No randomization. Help me build a causal analysis."

Without the skill

The agent drew a DAG (good), identified confounders (good), and correctly flagged BMI as a mediator rather than a confounder (impressive, actually). Then it ran three estimators: OLS regression, inverse propensity weighting, and g-computation — all frequentist. It reported a 95% confidence interval. It discussed limitations in a caveats section.

Solid work. A B+ from a causal inference class.

But it never asked me whether its DAG was correct. It never asked whether I was comfortable defending the "no unmeasured confounders" assumption. It never ran a sensitivity analysis for how strong an unobserved confounder would need to be to explain away the effect. And it said "the effect of exercise" — not "the estimated effect, assuming our DAG is correct and no unmeasured confounders exist".

With the skill

The agent started by proposing a precise estimand: "the ATE of exercise hours on systolic blood pressure, in mmHg per hour per week." Then it asked me to confirm.

It drew a DAG with 8 nodes and 13 edges, justifying each one. It explicitly listed the non-edges — the assumptions where it's asserting no direct causal effect — and asked me whether each one was defensible. It flagged diet_quality as ambiguous: "Is this a confounder or a mediator? If exercise changes your diet, which changes your blood pressure, adjusting for diet would block part of the causal pathway."

It identified the effect via the backdoor criterion using DoWhy, stated the adjustment set, and told me: "This identification rests entirely on the assumption of no unobserved confounders. Are you comfortable defending this to your medical board?"

Then — and only then — it wrote a PyMC model: prior predictive check, nutpie sampler, full diagnostics, using my Bayesian workflow skill.

Then, it ran three DoWhy refutation tests (random common cause, placebo treatment, data subset — all passed) and a sensitivity analysis for unobserved confounding. The tipping point was at effect strength ~1.0, which is ~2.8x stronger than the strongest measured confounder. The agent rated this "marginal" and flagged that antihypertensive medications could plausibly be a confounder that strong.

The report adapted for the dual audience: clinicians got plain language ("roughly 9 in 10 chance the effect is negative, which means exercise likely reduces blood pressure"); statisticians got the full DAG, identification result, and posterior plots.

And the conclusion used suggestive causal language: "Evidence is suggestive of a causal effect, but the marginal sensitivity analysis means we cannot make definitive causal claims."

The agent refused to say "causes" — because the evidence didn't fully earn it.

The numbers

I ran 6 test scenarios covering the major causal inference designs. Here are the eval pass rates:

ScenarioWith skillWithout skill
DiD: policy evaluation100%50%
DiD: parallel trends violation100%90%
RDD: scholarship threshold100%70%
Synthetic control: poor donor pool100%80%
Structural mediation100%60%
Observational: confounders + DAG100%64%
Overall100%68%

100% vs 68%. A 32-point gap.

The without-skill agent isn't bad — it consistently identifies the right design (DiD, RDD, SC, mediation). It knows causal inference, but it makes three systematic mistakes:

  • Never asks the user. Causal inference requires domain knowledge that isn't in the data. The without-skill agent never pauses to ask whether its assumptions are correct. The skilled agent has 4 mandatory checkpoints where it stops and asks.
  • No systematic refutation. The without-skill agent sometimes runs a robustness check, sometimes doesn't. The skilled agent always runs design-specific refutation tests — parallel trends for DiD, McCrary density for RDD, leave-one-out donors for synthetic control — plus general sensitivity analysis. Every time.
  • Overconfident causal language. The without-skill agent says "causes" when "is associated with" would be more honest. The skilled agent calibrates its language to the evidence: causal when refutation passes, suggestive when it's marginal, associational when it fails.

These aren't edge cases. They're the gap between analysis-you-can-defend and analysis-that-looks-defensible.

The scenario I'm most proud of: DiD with a parallel trends violation stated in the prompt.

The user says: "The treatment cities were already on a downward trend before the intervention, probably because they were richer cities with better healthcare access. Can you still estimate the causal effect?"

The without-skill agent handles this well — 90% pass rate. It recognizes the violation, runs a naive DiD as illustration, then proposes corrections (trend-adjusted DiD, synthetic control, Rambachan-Roth bounds). Genuinely good work.

The skilled agent does all of that plus: it draws a DAG showing wealth as an unobserved confounder affecting both treatment adoption and outcome trajectory. It asks the user about confounders. It downgrades causal language explicitly: "associated with, not caused, because a critical refutation test failed". And it quantifies the bias: the naive DiD overestimates by 87% compared to the trend-adjusted estimate.

The difference isn't that the without-skill agent gets it wrong. It's that the skilled agent documents exactly what could go wrong and adjusts its claims accordingly. That's the difference between an analysis you can defend in a seminar room and one you can defend in a courtroom.

What the skill covers

The skill supports the full landscape of Bayesian causal inference:

Quasi-experimental designs (via CausalPy, from the brilliant Ben Vincent):

  • Difference-in-Differences (including staggered)
  • Synthetic Control
  • Interrupted Time Series (including piecewise)
  • Regression Discontinuity (including regression kink)
  • Instrumental Variables
  • Inverse Propensity Score Weighting

Structural causal models (via PyMC):

  • pm.do() for interventions
  • pm.observe() for conditioning
  • Counterfactual queries
  • Mediation analysis (NDE/NIE decomposition)

Identification (via DoWhy):

  • DAG specification and backdoor/frontdoor criteria
  • Formal adjustment set computation
  • Refutation tests (random common cause, placebo, data subset, sensitivity to unobserved confounding)

And it composes with the bayesian-workflow skill for all the PyMC mechanics — priors, sampling, diagnostics, calibration. No duplication.

Install

Just ask your agent to "Install the causal-inference skill: https://github.com/Learning-Bayesian-Statistics/baygent-skills/tree/main/causal-inference" and you should be good to go! Bonus: the skill will auto-install bayesian-workflow if it's not already present ;)

As always, it works with Claude Code, Cursor, Gemini CLI, Kimi Code and any agent supporting the Agent Skills spec.

The philosophy

Allow me to get philosophical for a while -- I am French after all... There's a deeper principle here: Most AI tools are designed to do more, faster. The causal-inference skill is designed to do less, carefully. In this, I was really inspired by the work of Cal Newport, especially around slow productivity.

I wanted to force the agent to slow down. To draw the DAG before touching data. To state assumptions before estimating effects. To run robustness checks before claiming causality. To ask the famous human-in-the-loop before assuming domain knowledge. All stuff I teach my own students!

This isn't a limitation. It's the whole point.

Causal inference is the one domain where confidence without rigor is actively dangerous. A wrong soccer match prediction is annoying. A wrong causal claim about a drug, a policy, or an intervention can hurt people. The skill's job isn't to make causal inference faster — it's to make it harder to do wrong.

An agent that refuses to say "causes" until it's earned the right? That's not a bug. That's the feature.

The causal-inference skill is open source and part of the baygent-skills collection. If you try it and something doesn't work, open an issue. If the bayesian-workflow experience is any indication, your feedback will make v1.1 better for everyone 😊

On that note, PyMCheers, my dear Bayesians!

Alexandre Andorra