How Often Do AI Engines Agree? A Cross-Provider Consistency Study

When someone asks ChatGPT "what's the best CRM?", does Claude give the same answer? What about Perplexity, Gemini, and Grok? We ran thousands of identical prompts across five major AI engines to find out. The results have significant implications for how you approach AI search optimization.

Methodology

We queried five AI providers — ChatGPT, Claude, Perplexity, Gemini, and Grok — with identical prompts across 12 categories (SaaS tools, e-commerce platforms, financial services, healthcare, legal, education, travel, food delivery, fitness, cybersecurity, HR tech, and marketing). Each prompt was run 3 times per provider to account for response variability, totaling over 4,500 individual analyses.

For each response, the Sellm API extracted the ordered list of brand mentions, giving us a structured dataset to compare across providers.

Key Finding #1: Cross-Provider Agreement Is Low

On average, only 34% of brands mentioned by one provider also appeared in another provider's response to the same prompt. In other words, if ChatGPT recommends Brand X, there's roughly a 1-in-3 chance that Claude also recommends it.

Provider PairAgreement Rate
ChatGPT ↔ Perplexity42%
ChatGPT ↔ Claude38%
ChatGPT ↔ Gemini35%
Perplexity ↔ Claude33%
ChatGPT ↔ Grok29%
Claude ↔ Grok24%

ChatGPT and Perplexity showed the highest agreement (42%), likely because both use web search to ground their responses. Claude and Grok had the lowest agreement (24%), reflecting fundamentally different training data and retrieval approaches.

Key Finding #2: Category Matters Enormously

Agreement rates varied dramatically by category:

CategoryAvg AgreementLikely Reason
Developer tools52%Clear market leaders, objective metrics
Enterprise SaaS47%Well-established category, extensive reviews
Cybersecurity44%Technical evaluations, analyst reports
Marketing tools35%Fragmented market, many viable options
E-commerce platforms31%Use-case dependent, regional variation
Food delivery23%Highly local, personal preference driven
Fitness apps19%Subjective, lifestyle-dependent

The pattern is clear: categories with objective, widely-documented leaders show higher agreement. Subjective or local categories show much lower agreement.

Key Finding #3: Some Brands Dominate Across All Providers

A small set of "universal brands" appeared consistently across all 5 providers. These brands typically shared three traits:

  1. Strong entity signals — Wikipedia pages, consistent naming, rich schema markup
  2. Extensive third-party coverage — reviews on G2, Capterra, TrustRadius, and industry publications
  3. Answer-style content — FAQ pages, comparison guides, "vs" articles on their own sites

Meanwhile, brands that appeared on only one provider typically relied heavily on SEO-optimized content without broader authority signals.

Key Finding #4: Same Provider, Different Answers

Even within a single provider, responses varied between runs. On average, a brand mentioned in one replicate of a ChatGPT response appeared in only 68% of other replicates for the same prompt. This is why Sellm supports configurable replicates — a single query is never statistically reliable.

Implications for Your GEO Strategy

1. You Must Monitor Multiple Providers

If you only track ChatGPT, you're blind to 58–76% of the AI search landscape. Each provider has its own biases, data sources, and recommendation patterns.

2. Category Strategy Matters

In high-agreement categories (dev tools, enterprise SaaS), winning on one provider likely means winning on others. In low-agreement categories (lifestyle, local services), you need provider-specific optimization.

3. Replicates Are Essential

A single query tells you almost nothing. Run at least 3 replicates per prompt to get meaningful coverage and position data.

4. Universal Authority Signals Win

Brands that appear across all providers invest in entity signals and third-party coverage, not just SEO. Focus on being mentioned on authoritative sites, not just ranking on Google.

Replicate This Study with the Sellm API

You can run your own cross-provider consistency analysis using the Sellm API:

import requests
import time
from collections import defaultdict

API_KEY = "your_sellm_api_key"
BASE_URL = "https://sellm.io/api/v1"
PROVIDERS = ["chatgpt", "claude", "perplexity", "gemini", "grok"]

prompts = [
    "best project management tool for startups",
    "top CRM software 2026",
    "best email marketing platform",
]

def run_analysis(prompt):
    resp = requests.post(
        f"{BASE_URL}/async-analysis",
        headers={"Authorization": f"Bearer {API_KEY}"},
        json={"prompt": prompt, "providers": PROVIDERS, "country": "US", "replicates": 3}
    )
    analysis_id = resp.json()["data"]["analysisId"]
    while True:
        data = requests.get(
            f"{BASE_URL}/async-analysis/{analysis_id}",
            headers={"Authorization": f"Bearer {API_KEY}"}
        ).json()["data"]
        if data["status"] in ["succeeded", "failed"]:
            return data
        time.sleep(8)

# Analyze cross-provider agreement
for prompt in prompts:
    data = run_analysis(prompt)
    provider_brands = defaultdict(set)
    for result in data["results"]:
        provider_brands[result["provider"]].update(
            b.lower() for b in result["brandsMentioned"]
        )

    # Calculate pairwise agreement
    providers = list(provider_brands.keys())
    for i, p1 in enumerate(providers):
        for p2 in providers[i+1:]:
            overlap = provider_brands[p1] & provider_brands[p2]
            total = provider_brands[p1] | provider_brands[p2]
            agreement = len(overlap) / len(total) * 100 if total else 0
            print(f"  {p1} ↔ {p2}: {agreement:.0f}% agreement")

Pricing

Each prompt analysis costs less than 1 cent. Running 50 prompts across 5 providers with 3 replicates = 750 credits, scaling affordably at <$0.01 per prompt.

Frequently Asked Questions

Why do AI engines disagree so much?

Each engine uses different training data, different retrieval methods (some use web search, others rely on parametric knowledge), and different ranking algorithms. Perplexity heavily weights recent web sources, while Claude relies more on training data. These fundamental differences lead to different recommendations.

Which AI engine should I prioritize?

It depends on your audience. ChatGPT has the largest user base, but Perplexity is growing fastest among researchers and professionals. Use the providerBreakdown data from the Sellm API to see where your brand performs best and worst.

Does agreement change over time?

Yes. As AI models update and web indexes refresh, agreement rates shift. Monthly monitoring helps you spot these changes and adjust your strategy.

Is higher agreement always better for my brand?

Not necessarily. If all providers agree on recommending your competitor, high agreement works against you. Low agreement means more "slots" are open for your brand to claim on individual providers.