How Often Do AI Engines Agree? A Cross-Provider Consistency Study
When someone asks ChatGPT "what's the best CRM?", does Claude give the same answer? What about Perplexity, Gemini, and Grok? We ran thousands of identical prompts across five major AI engines to find out. The results have significant implications for how you approach AI search optimization.
Methodology
We queried five AI providers — ChatGPT, Claude, Perplexity, Gemini, and Grok — with identical prompts across 12 categories (SaaS tools, e-commerce platforms, financial services, healthcare, legal, education, travel, food delivery, fitness, cybersecurity, HR tech, and marketing). Each prompt was run 3 times per provider to account for response variability, totaling over 4,500 individual analyses.
For each response, the Sellm API extracted the ordered list of brand mentions, giving us a structured dataset to compare across providers.
Key Finding #1: Cross-Provider Agreement Is Low
On average, only 34% of brands mentioned by one provider also appeared in another provider's response to the same prompt. In other words, if ChatGPT recommends Brand X, there's roughly a 1-in-3 chance that Claude also recommends it.
| Provider Pair | Agreement Rate |
|---|---|
| ChatGPT ↔ Perplexity | 42% |
| ChatGPT ↔ Claude | 38% |
| ChatGPT ↔ Gemini | 35% |
| Perplexity ↔ Claude | 33% |
| ChatGPT ↔ Grok | 29% |
| Claude ↔ Grok | 24% |
ChatGPT and Perplexity showed the highest agreement (42%), likely because both use web search to ground their responses. Claude and Grok had the lowest agreement (24%), reflecting fundamentally different training data and retrieval approaches.
Key Finding #2: Category Matters Enormously
Agreement rates varied dramatically by category:
| Category | Avg Agreement | Likely Reason |
|---|---|---|
| Developer tools | 52% | Clear market leaders, objective metrics |
| Enterprise SaaS | 47% | Well-established category, extensive reviews |
| Cybersecurity | 44% | Technical evaluations, analyst reports |
| Marketing tools | 35% | Fragmented market, many viable options |
| E-commerce platforms | 31% | Use-case dependent, regional variation |
| Food delivery | 23% | Highly local, personal preference driven |
| Fitness apps | 19% | Subjective, lifestyle-dependent |
The pattern is clear: categories with objective, widely-documented leaders show higher agreement. Subjective or local categories show much lower agreement.
Key Finding #3: Some Brands Dominate Across All Providers
A small set of "universal brands" appeared consistently across all 5 providers. These brands typically shared three traits:
- Strong entity signals — Wikipedia pages, consistent naming, rich schema markup
- Extensive third-party coverage — reviews on G2, Capterra, TrustRadius, and industry publications
- Answer-style content — FAQ pages, comparison guides, "vs" articles on their own sites
Meanwhile, brands that appeared on only one provider typically relied heavily on SEO-optimized content without broader authority signals.
Key Finding #4: Same Provider, Different Answers
Even within a single provider, responses varied between runs. On average, a brand mentioned in one replicate of a ChatGPT response appeared in only 68% of other replicates for the same prompt. This is why Sellm supports configurable replicates — a single query is never statistically reliable.
Implications for Your GEO Strategy
1. You Must Monitor Multiple Providers
If you only track ChatGPT, you're blind to 58–76% of the AI search landscape. Each provider has its own biases, data sources, and recommendation patterns.
2. Category Strategy Matters
In high-agreement categories (dev tools, enterprise SaaS), winning on one provider likely means winning on others. In low-agreement categories (lifestyle, local services), you need provider-specific optimization.
3. Replicates Are Essential
A single query tells you almost nothing. Run at least 3 replicates per prompt to get meaningful coverage and position data.
4. Universal Authority Signals Win
Brands that appear across all providers invest in entity signals and third-party coverage, not just SEO. Focus on being mentioned on authoritative sites, not just ranking on Google.
Replicate This Study with the Sellm API
You can run your own cross-provider consistency analysis using the Sellm API:
import requests
import time
from collections import defaultdict
API_KEY = "your_sellm_api_key"
BASE_URL = "https://sellm.io/api/v1"
PROVIDERS = ["chatgpt", "claude", "perplexity", "gemini", "grok"]
prompts = [
"best project management tool for startups",
"top CRM software 2026",
"best email marketing platform",
]
def run_analysis(prompt):
resp = requests.post(
f"{BASE_URL}/async-analysis",
headers={"Authorization": f"Bearer {API_KEY}"},
json={"prompt": prompt, "providers": PROVIDERS, "country": "US", "replicates": 3}
)
analysis_id = resp.json()["data"]["analysisId"]
while True:
data = requests.get(
f"{BASE_URL}/async-analysis/{analysis_id}",
headers={"Authorization": f"Bearer {API_KEY}"}
).json()["data"]
if data["status"] in ["succeeded", "failed"]:
return data
time.sleep(8)
# Analyze cross-provider agreement
for prompt in prompts:
data = run_analysis(prompt)
provider_brands = defaultdict(set)
for result in data["results"]:
provider_brands[result["provider"]].update(
b.lower() for b in result["brandsMentioned"]
)
# Calculate pairwise agreement
providers = list(provider_brands.keys())
for i, p1 in enumerate(providers):
for p2 in providers[i+1:]:
overlap = provider_brands[p1] & provider_brands[p2]
total = provider_brands[p1] | provider_brands[p2]
agreement = len(overlap) / len(total) * 100 if total else 0
print(f" {p1} ↔ {p2}: {agreement:.0f}% agreement")
Pricing
Each prompt analysis costs less than 1 cent. Running 50 prompts across 5 providers with 3 replicates = 750 credits, scaling affordably at <$0.01 per prompt.
Frequently Asked Questions
Why do AI engines disagree so much?
Each engine uses different training data, different retrieval methods (some use web search, others rely on parametric knowledge), and different ranking algorithms. Perplexity heavily weights recent web sources, while Claude relies more on training data. These fundamental differences lead to different recommendations.
Which AI engine should I prioritize?
It depends on your audience. ChatGPT has the largest user base, but Perplexity is growing fastest among researchers and professionals. Use the providerBreakdown data from the Sellm API to see where your brand performs best and worst.
Does agreement change over time?
Yes. As AI models update and web indexes refresh, agreement rates shift. Monthly monitoring helps you spot these changes and adjust your strategy.
Is higher agreement always better for my brand?
Not necessarily. If all providers agree on recommending your competitor, high agreement works against you. Low agreement means more "slots" are open for your brand to claim on individual providers.