Claude vs GPT-4o vs Gemini for Data Analysis: I Tested All 3 on Real Datasets

I've been using AI for data analysis pretty much every day for the past year. And I keep hearing the same question from colleagues: "Which one should I actually use?" So I decided to stop guessing and run a proper head-to-head comparison.

I took three real-world tasks — the kind of stuff data analysts actually deal with — and threw them at Claude (3.5 Sonnet), GPT-4o, and Gemini 1.5 Pro. No cherry-picking prompts. No retrying until I got a good answer. Just one shot per task, same prompt, same data.

Here's what happened.

Why I Ran This Test

Let me be honest: I'm not sponsored by any of these companies. I pay for all three subscriptions out of my own pocket ($20/month each, which adds up fast). The reason I ran this test is purely selfish — I wanted to know which subscription I could drop.

Most "AI comparison" articles I've read are either surface-level or clearly biased toward whichever model the author prefers. They'll test something trivial like "write me a haiku" and then draw sweeping conclusions about enterprise readiness. That's not helpful.

I wanted to test things that actually matter for data work: handling messy real-world data, generating correct SQL across multiple tables, and extracting insights from long documents. These are the tasks I do every single week, and I suspect most data professionals do too.

The Setup

Before diving into results, here's how I structured the tests:

  • Same prompt for all three models — I wrote each prompt once and copy-pasted it exactly
  • No system prompts or custom instructions — clean slate for everyone
  • Default settings — no temperature adjustments, no top-p tweaking
  • Single attempt — whatever came back first is what I scored
  • Tested in April 2026 — models update constantly, so this is a snapshot in time

I scored each model on a 1-10 scale across multiple criteria per task, then averaged them. Not perfectly scientific, but way more rigorous than vibes.

Test 1: 50,000-Row CSV Analysis

The Dataset

I used a real e-commerce transactions dataset with 50,247 rows and 23 columns. It included order IDs, timestamps, product categories, customer demographics, payment methods, shipping details, return status, and revenue figures. The data was messy on purpose — there were 1,847 missing values scattered across different columns, some duplicate entries, date format inconsistencies, and a few obvious outliers (like a $99,999 order that was clearly a data entry error).

My prompt was straightforward: "Analyze this CSV file. Identify the top trends, flag data quality issues, and provide 5 actionable business recommendations with supporting evidence from the data."

Claude's Response

Claude immediately identified the data quality issues — all 1,847 missing values, the duplicates, and the outlier. What impressed me was the depth of analysis. It didn't just say "electronics is your top category." It broke down the revenue by category, cross-referenced it with return rates, and pointed out that while electronics had the highest gross revenue, the net revenue picture looked different because of a 23% return rate.

The five recommendations were specific and tied to data points. For example: "Consider reducing free shipping threshold from $75 to $50 for the home goods category — orders between $50-$75 in this category show a 34% cart abandonment rate, and the average shipping cost of $8.20 would be offset by the 12% increase in conversion." That's the kind of analysis that actually helps decision-makers.

Claude also voluntarily created a correlation matrix and identified a seasonal pattern I hadn't noticed — Q3 returns were 40% higher than other quarters, concentrated in the clothing category, likely due to back-to-school purchases with high size-mismatch returns.

GPT-4o's Response

GPT-4o took a more structured approach. It organized its analysis into clear sections with headers and even generated Python code for each analysis step. The code was clean, well-commented, and actually runnable — I tested it. It used pandas profiling concepts and created visualization code using matplotlib and seaborn.

The data quality identification was good but not as thorough. It caught the missing values and the outlier but missed the duplicate entries (there were 23 of them). The business recommendations were solid but more generic — things like "focus marketing spend on high-performing categories" without the specific threshold analysis Claude provided.

Where GPT-4o really shined was in the code output. If I needed to build an automated pipeline based on this analysis, GPT-4o gave me a massive head start. The code was modular, had error handling, and even included docstrings.

Gemini's Response

Gemini 1.5 Pro processed the file quickly and provided a comprehensive overview. It handled the large file size without complaints (which wasn't always the case with earlier versions). The analysis covered the basics well — top categories, revenue trends, customer segments.

But the depth wasn't there compared to the other two. Recommendations were surface-level: "Improve customer retention" and "Optimize product mix." These aren't wrong, but they're not actionable either. A business leader reading those would immediately ask "How?" and the answer wasn't in Gemini's output.

Gemini did do something interesting though — it automatically created a summary table comparing month-over-month growth rates that was easy to scan. And it was the only model that flagged a potential currency inconsistency in 12 rows where the values suggested they might be in a different currency.

Test 1 Scores

CriteriaClaudeGPT-4oGemini
Data Quality Detection977
Analytical Depth1075
Actionable Recommendations975
Code Quality7106
Presentation/Readability898
Average8.68.06.2

Test 2: SQL Query Generation Across 3 Related Tables

The Schema

I gave each model a schema with three tables: customers (customer_id, name, email, signup_date, plan_type, region), orders (order_id, customer_id, product_id, order_date, quantity, unit_price, discount_pct, status), and products (product_id, name, category, subcategory, cost_price, list_price, supplier_id, is_active).

Then I asked five increasingly complex questions:

  1. Show me the top 10 customers by total spend in the last 90 days, including their plan type and number of orders
  2. Calculate the month-over-month revenue growth rate for each product category, but only for categories with at least 100 orders per month
  3. Find customers who've downgraded their plan (went from premium to basic) AND whose order frequency dropped by more than 50% compared to their first 3 months
  4. Generate a cohort analysis showing retention rates by signup month, where "retained" means at least one order in each subsequent month
  5. Identify products where the discount percentage is eating more than 30% of the margin, grouped by supplier, with a running total

Claude's SQL

Claude produced correct SQL for all five queries on the first try. The queries were well-structured, using CTEs (Common Table Expressions) that made the logic easy to follow. For the cohort analysis query, which is notoriously tricky, Claude used a clean approach with CROSS JOIN to generate the full cohort grid and LEFT JOIN to fill in actual retention numbers. This meant the output correctly showed zeros for months where no customers were retained, rather than just omitting those rows.

Claude also added comments explaining the business logic behind each step, which is something I always appreciate. It noted edge cases too — for example, it pointed out that the discount-margin query assumes discount_pct applies to list_price, not cost_price, and asked if that was correct.

The only nit: one query used a window function syntax that's PostgreSQL-specific and wouldn't work in MySQL without modification. Claude didn't specify which dialect it was targeting.

GPT-4o's SQL

GPT-4o also got all five queries correct, and the code quality was arguably the best of the three. Each query came with the SQL dialect specified (it defaulted to PostgreSQL but offered MySQL alternatives), performance notes about which indexes would help, and estimated execution plans for large tables.

The cohort analysis query was elegant — it used a slightly different approach with date_trunc and generate_series that was more concise than Claude's version while being equally correct. GPT-4o also provided the query results in a formatted table showing what the output would look like, which was helpful for validation.

GPT-4o went above and beyond by suggesting a materialized view for the revenue growth query: "If you're running this regularly, consider creating a materialized view that refreshes daily. Here's the DDL..." That's the kind of production-ready thinking that distinguishes good from great.

Gemini's SQL

Gemini got queries 1, 2, and 5 correct. Query 3 had a logical error — it compared plan changes by looking at the current plan_type field but didn't account for the fact that the schema only stores the current plan, not the history. It assumed there was a plan_history table that didn't exist. When I pointed this out, it corrected itself, but remember — this was a single-attempt test.

Query 4 (cohort analysis) was functionally correct but had a performance issue: it used correlated subqueries instead of joins, which on a large dataset would be significantly slower. The logic was right, but I wouldn't want to run it on a table with millions of rows.

Gemini did excel at explaining the business context of each query. Its explanations of what cohort analysis means and why margin analysis matters were the clearest of the three, making it a good learning tool.

Test 2 Scores

CriteriaClaudeGPT-4oGemini
Query Correctness10107
Code Structure/Readability9107
Performance Awareness795
Edge Case Handling985
Documentation/Explanation889
Average8.69.06.6

Test 3: 47-Page PDF Summarization

The Document

I used a real (anonymized) quarterly business review document — 47 pages with financial tables, strategic initiatives, risk assessments, departmental KPIs, and a board presentation deck embedded at the end. The document was roughly 28,000 words with 14 tables and 8 charts described in text.

My prompt: "Summarize this document in a way that a C-level executive could read in 5 minutes and walk into a board meeting fully prepared. Highlight the 3 biggest risks, the 2 most promising opportunities, and any numbers that look off compared to the narrative."

Claude's Summary

Claude produced a tight, well-organized summary that was about 800 words — genuinely readable in 5 minutes. The three risks it identified were spot-on: supply chain concentration (72% from a single region), the customer acquisition cost trend (up 34% QoQ while LTV was flat), and a compliance deadline that the narrative mentioned casually on page 38 but was actually a major regulatory risk.

The opportunities section was good too, highlighting a market expansion play and a partnership deal with favorable terms. But where Claude really earned its score was the "numbers that look off" section. It caught that the revenue figure on page 12 ($14.2M) didn't match the sum of the regional breakdowns on page 23 ($13.8M), and flagged that the headcount numbers in the HR section implied a 18% turnover rate that contradicted the "strong retention" narrative on page 7.

That kind of cross-referencing across a long document is exactly what I need an AI to do. I'd been reading this document for an hour and missed both of those discrepancies.

GPT-4o's Summary

GPT-4o's summary was well-structured and professional. It used bullet points effectively and organized information by department. The risk identification was solid — it caught the supply chain issue and the CAC problem but missed the compliance deadline on page 38.

The summary was longer than Claude's (about 1,200 words), which cut into the "5-minute read" requirement. It included more detail about departmental performance, which is useful but wasn't what I asked for. GPT-4o seems to default to comprehensiveness rather than conciseness when given long documents.

On the "numbers that look off" request, GPT-4o caught the revenue discrepancy but not the turnover rate contradiction. It did catch something the others missed though — it noted that the projected Q3 growth rate assumed a seasonality pattern from 2024, but the 2025 data showed the seasonal pattern had shifted by about 6 weeks, making the Q3 projection potentially optimistic by 8-12%.

Gemini's Summary

This is where Gemini's large context window paid off. It processed all 47 pages without any chunking or summarization artifacts. The summary was comprehensive and accurate, covering every major section of the document. It handled the financial tables particularly well, extracting key metrics and presenting them in a clean format.

Gemini identified the supply chain risk and a market competition risk that the other two didn't emphasize. However, it treated the document more like a chapter-by-chapter summary rather than a strategic briefing. A C-suite exec would get all the information but would have to do the "so what?" synthesis themselves.

The numbers audit was the weakest of the three. Gemini confirmed the numbers in the document without cross-referencing between sections. It essentially said "the financial figures are consistent with the narrative" — which they weren't, as Claude demonstrated.

But I want to give Gemini credit for something: its handling of the embedded charts. Even though it couldn't see the actual images, it referenced the text descriptions of charts and correctly noted that two charts described contradictory trends (one showed increasing market share while another showed declining relative competitive position). That's a subtle catch.

Test 3 Scores

CriteriaClaudeGPT-4oGemini
Executive Readiness1076
Risk Identification987
Opportunity Spotting887
Numerical Cross-Referencing1084
Conciseness vs Completeness968
Average9.27.46.4

Overall Results

TestClaudeGPT-4oGemini
CSV Analysis (50K rows)8.68.06.2
SQL Generation (3 tables)8.69.06.6
PDF Summarization (47 pages)9.27.46.4
Overall Average8.88.16.4

My Honest Take: When to Use Which

Use Claude when: You need deep analytical thinking, cross-referencing across large documents, or business-ready insights that go beyond surface-level observations. Claude consistently provided the most nuanced analysis and caught details that the others missed. If you're a data analyst presenting to stakeholders, Claude gives you the "so what?" that turns data into decisions.

Use GPT-4o when: You need production-quality code output, well-documented SQL, or a pipeline-ready analysis. GPT-4o's code was consistently the cleanest and most production-ready. If you're building something, not just analyzing something, GPT-4o is your best bet. The performance optimization suggestions were a nice bonus too.

Use Gemini when: You're working with massive documents or need to process a lot of context at once. Gemini's large context window is genuinely useful for very long documents, and it handled the full 47 pages without breaking a sweat. It's also the best at explanations and teaching, which makes it valuable for learning new concepts.

What About Cost?

All three offer $20/month consumer plans. For API usage, it gets more nuanced:

ModelInput (per 1M tokens)Output (per 1M tokens)Est. Cost for This Test
Claude 3.5 Sonnet$3.00$15.00$0.47
GPT-4o$2.50$10.00$0.38
Gemini 1.5 Pro$1.25$5.00$0.21

Gemini wins on price by a significant margin. If cost is your primary concern and the depth difference doesn't matter for your use case, it's hard to argue against Gemini's value proposition.

Limitations of This Test

I want to be upfront about what this test doesn't tell you:

  • These models update constantly. What's true today might not be true in three months. I'll try to re-run this comparison quarterly.
  • Single-attempt bias. AI models can give different outputs on the same prompt. Running each test 10 times and averaging would be more rigorous but wasn't practical.
  • My scoring is subjective. Another analyst might weigh code quality higher than analytical depth and reach different conclusions.
  • I didn't test multimodal capabilities. If your data analysis involves images, charts, or video, that's a different comparison entirely.
  • Context window limits matter more for some workflows. If you routinely process 100+ page documents, Gemini's advantage becomes much more significant.

What I'm Actually Keeping

After running this test, I'm keeping all three subscriptions — but I'm using them differently than before. Claude is my go-to for analysis and document review. GPT-4o handles my coding tasks and pipeline building. Gemini comes out when I need to process something massive or when I'm learning a new domain and need clear explanations.

Is $60/month a lot? Yes. But considering I'm replacing what used to be hours of manual work every week, it's probably the best ROI of any tool subscription I have.

If I had to keep only one, it'd be Claude for data analysis work. The analytical depth and cross-referencing ability saved me from presenting incorrect numbers to a client — once. That alone paid for a year of subscription.

But honestly, the gap between Claude and GPT-4o is narrow enough that your mileage may vary. Try all three on YOUR specific tasks before committing. What works for my workflow might not match yours.

What's Next

I'm planning to run similar tests on visualization generation (can these models create good charts directly?), real-time data analysis (streaming data scenarios), and multi-language data processing. If you've got specific scenarios you'd like me to test, drop a comment or reach out.

The AI data analysis landscape is moving incredibly fast. The model that's best today might not be best next quarter. But right now, for the work I do, this is where things stand.

Read more