Do Statistics Improve AI Citations? The GEO Data, Tested

Plenty of GEO advice rests on one tidy promise: add statistics, get more AI citations. It gets repeated like settled science, yet almost all of it traces back to a single 2023 paper that very few people quoting it have actually read.

So I read it, and checked the numbers against the source. Here is what the research really measured, where the popular version gets it wrong, and how to test the idea on your own content before you trust it.

Key Takeaways

Most “statistics boost AI citations” numbers trace back to a single 2023 research paper, often misquoted.
“15 statistics = 50% more citations” appears nowhere in that paper; it is a number bolted on later.
The headline +41% lift actually belongs to adding quotations, not statistics; statistics came in around +31% and was never the top method.
The single biggest jump, +115%, came from adding citations to pages that started low in the rankings (rank 5).
Every top method actually reduced visibility for pages already ranked first, so the effect depends heavily on where you start.
It measured visibility in a fixed benchmark, not live citations in today’s ChatGPT or AI Overviews, and any GEO stat without a sample size and a date is closer to decoration than evidence.

Where the GEO stat canon actually comes from

You have probably seen the rule stated like law: “Add 15 statistics per article and get 50% more AI citations.” Or that “44% of citations come from the first 30% of the page,” or that “statistics boost AI visibility 30 to 40%.” Almost all of these numbers come from one source: the paper GEO: Generative Engine Optimization by Aggarwal and colleagues (KDD ’24), with authors from Princeton, the Allen Institute for AI, Georgia Tech, and IIT Delhi.

They built a benchmark called GEO-bench, around 10,000 queries across nine sources, and tested different content changes against generative engines using two visibility metrics: a position-adjusted word-count measure and a more subjective “impression” measure. The three content changes that performed best looked like this (relative lift over a no-optimization baseline):

Method (vs no optimization)	Word-count visibility metric	Subjective impression metric
Quotation Addition	~+41%	~+28%
Statistics Addition	~+31%	~+23%
Cite Sources	~+28% (up to +115% for rank-5 pages)	~+13%

Two things jump out the moment you read the table rather than the slogan. First, the paper measured two metrics, not one, and the numbers differ across them, so a single “+41%” with no context is already half a quote. Second, statistics were never the top method: the headline +41% belongs to adding quotations, and the single biggest jump, +115%, came from adding citations to pages that started low in the search rankings (rank 5).

So “statistics boost visibility 30 to 40%” is at least in the right neighbourhood for the word-count metric, where statistics landed around +31%. But “add 15 statistics per article and get 50% more citations” appears nowhere in the paper; the study never prescribed a number of statistics, and it never produced a clean 50% figure for them.

Somewhere between the original PDF and the carousel slide, a benchmark lift on two metrics turned into a precise recipe with a number attached, and the credit drifted onto statistics when the data points elsewhere. That is the part I would push back on, because the number is doing a lot of work the research never supported.

What the paper does and does not let you conclude

The Princeton work is genuinely useful, and I am not here to trash it; it is a serious study and a good starting point. But three caveats tend to get dropped whenever it is quoted, and each one changes how much weight you should put on the “add stats” advice.

It measured “visibility” in a benchmark, not citations in live ChatGPT or AI Overviews. GEO-bench is a controlled harness over specific engines at a specific point in time. Generative surfaces have changed a lot since then, so the same relative effects may not carry over cleanly to today’s production systems.
The standout result was citing sources on low-ranked pages, not adding statistics. Cite Sources lifted pages sitting at rank 5 by +115%, while statistics was never the top method on either metric. The popular “add stats” advice fixated on the catchy line and skipped the lever that actually moved the needle.
The effects were uneven across positions, and could even go negative. Every one of the top methods actually reduced visibility for pages already ranked first (Cite Sources by about -30%), and only paid off further down. A lift is something that depends on where you start, not a number you are guaranteed.

The confounder nobody mentions

Here is the reasoning a working analyst applies that a quick checklist tends to skip. Pages that are dense with statistics, quotations, and citations are not a random sample of the web; they tend to be the work of competent publishers, the same people who already build authority, earn links, and write the kind of content retrieval systems want to surface anyway.

So when you observe that “cited pages have more stats,” you are partly observing that good sites simply do good-site things. Correlation and causation get tangled together, and a single benchmark cannot fully separate them. That does not mean statistics do nothing; it means we should be careful about how much of the lift we credit to the statistics themselves.

This is the same trap that swallowed the schema-markup hype, where a “3x more likely to be cited” correlation got sold as causation, something I unpacked in my breakdown of the 1,885-page schema study. The pattern keeps repeating because correlation dressed as causation is easy to sell: it turns a messy reality into a tidy checklist.

Why testing LLM citations is harder than it looks

Even if you want to test these claims yourself, the ground under you keeps moving, so it helps to know the three things that quietly ruin most casual tests.

Non-determinism. Ask the same model the same question twice and you can get different citations. A single before-and-after screenshot proves almost nothing on its own.
Model and version drift. Providers update models quietly, so a lift you “measure” this week might be the model changing rather than your content.
Prompt sensitivity. Rephrasing the query reshuffles the citations, which means your result is partly a function of how you happened to ask.

Because of all this, any GEO stat quoted without a sample size and a date is closer to decoration than evidence. The honest version of every one of these claims really ends with “in this study, on these engines, at that time.”

How to run a defensible self-test

If you want a result you can actually trust on your own niche, here is the approach I would use. It is more work than a screenshot, but it is the difference between an opinion and a measurement.

Pick a set of comparable pages, matched on topic, authority, and current citation rate, then split them into a test group and a control.
On the test group, apply one change at a time: add well-sourced statistics, or add source citations, but not both, so you keep one variable per experiment.
Sample citations repeatedly, for example 10 runs per query across the engines you care about, and record the rate rather than a single result, to beat the non-determinism.
Re-measure over 8 to 12 weeks and compare the test group’s change against the control’s, so any model drift hits both groups equally.
Only then decide whether the lever actually moved your niche, on your engines, today.

So, do statistics improve AI citations?

In my opinion, statistics, quotes, and citations do seem to help, and for a sensible reason: they make content more specific, more verifiable, and more quotable, which is exactly what retrieval systems tend to reward. So they are worth adding when they genuinely belong in the piece.

What I would not do is treat “add 15 stats for 50% more citations” as a rule. That number is bolted onto the research rather than found in it, and the paper’s own data points elsewhere: quotations led the headline metrics, the single biggest jump came from citing sources on pages that started low in the rankings, and statistics was never the top method. Write well-sourced, specific, original content because it is better content, not because a misquoted benchmark promised you a percentage.

And when someone hands you a GEO stat, ask for the study, the sample size, and the date. The good claims survive that question; most of the canon does not. If you want to see how the same “data-driven” framing props up other SEO myths, my teardown of the fake-ranking-drop patent narrative is a useful companion read.

Update Logs

14Jun 2026

Rewrote the article in a clearer, measured voice and tightened the argument.
Corrected the findings table against the source PDF: the +41% lift belongs to Quotation Addition (word-count metric), Statistics Addition is about +31%, and the +115% is Cite Sources for rank-5 pages.
Added the finding that the top methods reduced visibility for pages already ranked first, and adjusted the Key Takeaways and conclusion to match.

13 Jun 2026

Scheduled for publishing.

Discover more from WpConsults

Subscribe to get the latest posts sent to your email.

Do Statistics Improve AI Citations? Pressure-Testing the GEO Stat Canon

Key Takeaways

Where the GEO stat canon actually comes from

What the paper does and does not let you conclude

The confounder nobody mentions

Why testing LLM citations is harder than it looks

How to run a defensible self-test

So, do statistics improve AI citations?

Update Logs

Like this:

Discover more from WpConsults

Leave a ReplyCancel reply

Free WordPress & SEO Tips!

Services

Company

Resources

Key Takeaways

Where the GEO stat canon actually comes from

What the paper does and does not let you conclude

The confounder nobody mentions

Why testing LLM citations is harder than it looks

How to run a defensible self-test

So, do statistics improve AI citations?

Update Logs

Share this:

Like this:

Discover more from WpConsults

Leave a ReplyCancel reply

Services

Company

Resources