Introducing the Marketing Factuality Benchmark

When you ask an AI tool how your ads are doing, you act on the answer. You shift budget, you kill a campaign, you tell your client the quarter is on track. So here's an uncomfortable question: how do you know the answer is right?

We kept running into the same thing. Ask most AI assistants a plain question about your Meta campaigns — “which one had the best CPC?” — and you get a confident, well-formatted paragraph. Clean prose, a tidy number, not a hint of doubt. And every so often, quietly, it's wrong. Wrong metric. Stale window. A figure that was never in your account at all. The confidence never wavers; only the facts do.

That's the problem we built Outfox to solve. And today we're publishing the receipts.

Introducing MktF1, the Marketing Factuality Benchmark

MktF1 measures one thing, precisely: when you ask an AI system about your real ad performance, how often does it tell you the truth? Not how fluent it sounds, not how fast it answers — whether the number is correct.

As far as we can tell, it's the first benchmark of its kind. There are benchmarks for reasoning, for coding, for medical exams. There wasn't one for the deceptively simple task of reading a marketer's ad account and reporting back without making things up. So we built it — and we're holding ourselves to it in public.

Why we did it the hard way

We could have published a benchmark we were guaranteed to win. Everyone does. Instead, we ran the version we'd trust if a competitor had built it. A real ad account. Ground truth pulled straight from Meta's own API. The questions a marketing team actually asks on a normal Tuesday. And our system measured right next to the obvious alternatives.

And what you see here is only the public slice. Behind this snapshot are many more tests, run the same way — more questions, more accounts, more answers compared head to head. We're publishing the version that's simplest to understand and that you can run yourself.

If we're going to ask you to act on Foxley's answers, the bar isn't “trust us.” It's “here's the test, here's the data, run it yourself.”

What “factual” actually means

Getting an ad answer right is more than returning a number. A good answer has four parts — miss any one and you've quietly misled someone:

The right metric. Link CPC for a traffic campaign, not all-clicks CPC. Those are different numbers, and only one answers the question you meant.
The right number. It matches the source — the exact number Meta reports, not a close-enough guess.
The right scope. Your campaigns, in the window you asked about — not a lifetime total, not last year's flight dragged in for company.
Nothing invented.No fabricated figures, no manufactured problems, no “here's a metric” for a metric that doesn't exist.

How we tested it

We started with a real, fast-growing ad account — a month of live delivery across seven campaigns, no fake data. Before running any system, we pulled the true figures from the Meta Marketing API for that exact window. That's our ground truth.

Then we asked twelve questions, ranging from “what was my total spend?” to “how did my campaigns do?” And we asked each one two ways:

once naming the exact metric(“what was the link CTR?”), and
once the way a marketer actually phrases it (“which campaign had the best CPC?”).

The gap between those two is the kicker. There are multiple ways to measure clicks, video views, attribution. Naming the exact, disambiguated metric lets a system fetch a field. Phrasing it like a human forces it to choose and correct the metric — and that's where the wheels tend to come off. We scored every answer against the ground truth, 1 to 5, on four things: accuracy, completeness, actionability, and trust.

The results

We compared three ways to get the answer, all reading the same Meta data. First, Foxley — our harness. Second, Claude (Opus 4.8) on Meta's official Ads MCP: the same frontier model, on Meta's own thinner tool layer. Third, Meta AI (Meta's built-in assistant). We scored each against Meta's own API, across all twelve questions:

Factuality score by surface on the MKTF1 Marketing Factuality benchmark
Surface	Factuality score
Outfox	99.4%
Claude with Meta's MCP	91%
Meta AI	60%

Foxley got every number right across all twelve questions, with no fabrications — whether you use the in-app chat or connect through the MCP suite. Same tools, same result.

Where the gap comes from

All three read the same data. The tool layer between the model and the numbers decides the answer.

No metric guidance. Meta's official Ads MCP exposes no link-click cost field, so the same model grabbed all-clicks numbers for “best CPC” and crowned the wrong winner — the right number for the wrong metric.
No guardrails. Meta AI invented metric values — including campaign costs off by more than 10× — and held them steady across follow-up questions. It flagged healthy campaigns as problems and pulled in an ad from years outside the window.

These are exactly the errors a harness exists to catch. Foxley chooses and corrects the metric for the question. Ask about a traffic campaign and it reports link CPC, and says so, instead of grabbing whatever field you happened to name.

The part we're proudest of

Here's the number that actually matters. When we stopped naming the metric and asked the way a marketer would, Foxley stayed flat — 99.2% and 99.6% across the two phrasings. The others dropped: Claude on Meta's MCP by about 18 points, Meta AI by about 11.

A tool that's accurate only when you already know which metric to name isn't solving your problem. It's grading your homework.

Robustness — staying right when the question is vague — is the point of a harness. It turns a tool you'd have to babysit into one you can hand the work to.

We re-ran it on a newer model, too

Benchmarks rot. Models change, surfaces change. We have to stay on top. So we will continue to re-run this and publish updates. We replicated the test with Fable 5, Anthropic's newest flagship: Foxley (powered by Fable) scored 100% vs. Claude (Fable) with Meta's official Ads MCP at 94%.

The numbers barely moved on a much more capable model. So we read the small shifts from the Opus run as run-to-run variance, not a model effect — the tool harness, not the model, drives factuality.

Run it on your own account

We benchmarked against one account, one time window, twelve questions. Want the full question set? Email us at hello@outfox.ai with “MktF1” in the subject line and we'll send it over. MktF1 measures factual accuracy only. Creative, strategy, and growth execution get their own benchmarks, which we'll share as they mature. And definitely don't have to take our word for it: ask Foxley a set of questions about your own campaign metrics, then check each answer against Ads Manager.

The reason that we started on this journey: it was many late evenings, trying to do right by our team, reading numbers off a screen and betting real budget on them.

You deserve a number that's actually true.

That's what Foxley is for, and it's what we measured here. See how Outfox Analytics works, or try it on your own account.