Classic web A/B testing assumes a deterministic system: show variant A or B, measure conversions, declare a winner. AEO breaks that assumption because AI engines are non-deterministic and you rarely control the surface where the result appears. This article covers how to experiment rigorously anyway.
Why AEO experimentation is different
In conventional CRO, you split traffic and the platform serves exactly what you tell it. With answer engines, the same prompt can return different responses on repeat runs, and you can’t split the model’s “traffic.” The implications are explored in why queries return different results.
This means you cannot run a true randomized A/B test against a live model. Instead, you run before-and-after experiments and cohort comparisons, then control for noise with sample size and repetition.
Designing a valid AEO experiment
A defensible experiment has five parts:
- Hypothesis. State it concretely: “Adding an FAQ section with direct answers will increase our citation rate on how-to prompts.”
- A single isolated variable. Change one thing — the FAQ block — not the FAQ block and a new title and fresh schema. If you change three things, you learn nothing about which one worked.
- A defined prompt set. Lock a fixed list of 30-100 representative prompts before you start. Building this set is covered in prompt monitoring strategy.
- A baseline window. Measure the prompt set repeatedly for one to two weeks before the change.
- A metric. Citation rate, share of voice, or sentiment — defined up front so you don’t move the goalposts later.
Controlling for non-determinism
Because a single prompt run is noisy, never judge a result from one query. Instead:
- Repeat each prompt multiple times per measurement window and average the results.
- Use a fixed prompt set so day-to-day comparisons are apples to apples.
- Track across engines. A change that helps in one engine may not transfer; use multi-engine monitoring to see the full picture.
- Watch the trend, not the day. Look for a sustained shift in the average, not a single spike.
The core question is always the same one in understanding lift attribution: did the metric move because of your change, or would it have moved anyway?
Isolating variables in practice
The most common experimentation mistake is confounding. Suppose you rewrite a page and citation rate rises. Was it the rewrite — or did a competitor’s page go down, or did the engine refresh its index that week?
To reduce confounding:
- Use a control group. Keep a set of comparable pages unchanged and track them alongside the test pages. If both move together, the cause is external.
- Stagger rollouts. Change one page cluster, wait, measure, then change the next. Simultaneous changes are impossible to untangle.
- Document the timeline. Record exactly when each change shipped so you can align it with metric movement.
Common experiment types
These are reliable, repeatable AEO experiments:
- Content structure tests — adding direct-answer intros, FAQ blocks, or summaries, following content optimization for AI.
- Structured data tests — adding or refining schema and watching whether eligibility for rich extraction improves.
- Freshness tests — updating publish dates and content, measuring whether re-crawl improves inclusion.
- Format tests — comparing prose, tables, and lists to see which an engine extracts most readily.
How long to run and when to call it
Give an experiment at least two to four weeks after the change ships, because engines re-crawl and re-weight gradually. End the experiment when the post-change average has clearly separated from the baseline band for a sustained period — not on the first promising day.
Finally, tie wins back to value. A higher citation rate only matters if it drives outcomes; connect experiments to measuring GEO/AEO ROI so you invest in changes that compound.
Frequently Asked Questions
Can I run a true A/B test on an AI engine?
Not in the classic sense. You cannot split a model’s traffic or force it to serve variant A to half of users. The practical alternative is before-and-after testing against a fixed prompt set, combined with a control group of unchanged pages to catch external effects.
How many prompts do I need for a reliable experiment?
A fixed set of roughly 30-100 representative prompts, each measured repeatedly, is a reasonable starting point. Smaller sets are too noisy to separate signal from non-determinism, while very large sets are harder to maintain consistently across an experiment.
How do I avoid being fooled by random fluctuation?
Repeat each prompt multiple times, average the results, and judge changes by sustained shifts in the trend rather than single days. A control group of comparable, unchanged pages is the strongest safeguard: if the control moves with the test, the cause is external, not your change.
How long should an AEO experiment run?
Plan for at least two to four weeks after your change ships, plus a one-to-two-week baseline beforehand. AI engines update their understanding gradually as they re-crawl and re-weight sources, so shorter windows risk measuring crawl timing rather than the effect of your change.