Close Menu
    • Home
    • Events
      • Upcoming Events
      • Videos
        • Machine Can Think Summit 2026
        • Step Dubai Conference 2026
    • Technology & Innovation
    • Business & Marketing
    • Trends & Insights
    • Industry Applications
    • Tutorials & Guides
    What's Hot
    Technology & Innovation

    SAS Puts AI Governance at the Core of Its Agent Strategy

    By Art RyanApril 29, 20260

    As it moves deeper into the era of agentic AI, SAS is making governance a…

    Big Tech AI Spending 2026: Investment Trends Revealed

    April 29, 2026

    Amazon AI Hiring Software Enhances Recruitment Efficiency

    April 29, 2026

    Oracle & CoreWeave Shares Fall on OpenAI Growth Miss

    April 29, 2026
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    Breaking AI News
    Wednesday, April 29
    • Home
    • Events
      • Upcoming Events
      • Videos
        • Machine Can Think Summit 2026
        • Step Dubai Conference 2026
    • Technology & Innovation

      SAS Puts AI Governance at the Core of Its Agent Strategy

      April 29, 2026

      Amazon AI Hiring Software Enhances Recruitment Efficiency

      April 29, 2026

      AI Drug Development Johnson & Johnson Impact on Healthcare

      April 28, 2026

      Qualcomm OpenAI AI Smartphone Processors Partnership News

      April 28, 2026

      Google AI Campus South Korea and Its Development Plans

      April 28, 2026
    • Business & Marketing

      Big Tech AI Spending 2026: Investment Trends Revealed

      April 29, 2026

      Oracle & CoreWeave Shares Fall on OpenAI Growth Miss

      April 29, 2026

      Authentic Brands Group Could Hit $50 Billion in Retail Sales by 2026, CEO Says

      April 29, 2026

      UK AI Startup Ineffable Secures $1.1B in Europe’s Largest Seed Round

      April 28, 2026

      Meta Manus AI Acquisition Blocked Over Strategic Concerns

      April 28, 2026
    • Trends & Insights

      SAS Puts AI Governance at the Core of Its Agent Strategy

      April 29, 2026

      Big Tech AI Spending 2026: Investment Trends Revealed

      April 29, 2026

      Oracle & CoreWeave Shares Fall on OpenAI Growth Miss

      April 29, 2026

      Google AI Campus South Korea and Its Development Plans

      April 28, 2026

      Meta Manus AI Acquisition Blocked Over Strategic Concerns

      April 28, 2026
    • Industry Applications

      Amazon AI Hiring Software Enhances Recruitment Efficiency

      April 29, 2026

      AI Drug Development Johnson & Johnson Impact on Healthcare

      April 28, 2026

      Accenture Copilot Rollout Enhances Employee Productivity

      April 28, 2026

      HomeLight AI Real Estate Closings Transforming the Market

      April 27, 2026

      UiPath & Databricks Partner to Transform Enterprise Operations through Automation and Data Intelligence

      April 27, 2026
    • Tutorials & Guides

      How AI Is Revolutionizing the Future of Travel 2026 with Wellness and Sustainability

      April 19, 2026

      University of Wollongong in Dubai AI initiative boosts future-ready education

      March 31, 2026

      Microsoft AI upgrades Copilot Cowork unveiled for early access users

      March 31, 2026

      Starcloud $11 billion valuation signals AI space race surge

      March 31, 2026

      Flexible AI Factories Power the Future of Energy Grids

      March 30, 2026
    Breaking AI News
    Home » The rise of AI ‘reasoning’ models is making benchmarking more expensive
    Technology & Innovation

    The rise of AI ‘reasoning’ models is making benchmarking more expensive

    Art RyanBy Art RyanApril 11, 2025No Comments4 Mins Read
    Facebook Twitter Pinterest LinkedIn Tumblr Email
    Share
    Facebook Twitter LinkedIn Pinterest Email

    AI labs like OpenAI claim that their so-called “reasoning” AI models, which can “think” through problems step by step, are more capable than their non-reasoning counterparts in specific domains, such as physics. But while this generally appears to be the case, reasoning models are also much more expensive to benchmark, making it difficult to independently verify these claims.

    According to data from Artificial Analysis, a third-party AI testing outfit, it costs $2,767.05 to evaluate OpenAI’s o1 reasoning model across a suite of seven popular AI benchmarks: MMLU-Pro, GPQA Diamond, Humanity’s Last Exam, LiveCodeBench, SciCode, AIME 2024, and MATH-500.

    Benchmarking Anthropic’s recent Claude 3.7 Sonnet, a “hybrid” reasoning model, on the same set of tests cost $1,485.35, while testing OpenAI’s o3-mini-high cost $344.59, per Artificial Analysis.

    Some reasoning models are cheaper to benchmark than others. Artificial Analysis spent $141.22 evaluating OpenAI’s o1-mini, for example. But on average, they tend to be pricey. All told, Artificial Analysis has spent roughly $5,200 evaluating around a dozen reasoning models, close to twice the amount the firm spent analyzing over 80 non-reasoning models ($2,400).

    OpenAI’s non-reasoning GPT-4o model, released in May 2024, cost Artificial Analysis just $108.85 to evaluate, while Claude 3.6 Sonnet — Claude 3.7 Sonnet’s non-reasoning predecessor — cost $81.41.

    Artificial Analysis co-founder George Cameron told TechCrunch that the organization plans to increase its benchmarking spend as more AI labs develop reasoning models.

    “At Artificial Analysis, we run hundreds of evaluations monthly and devote a significant budget to these,” Cameron said. “We are planning for this spend to increase as models are more frequently released.”

    Artificial Analysis isn’t the only outfit of its kind that’s dealing with rising AI benchmarking costs.

    Ross Taylor, the CEO of AI startup General Reasoning, said he recently spent $580 evaluating Claude 3.7 Sonnet on around 3,700 unique prompts. Taylor estimates a single run-through of MMLU Pro, a question set designed to benchmark a model’s language comprehension skills, would have cost more than $1,800.

    “We’re moving to a world where a lab reports x% on a benchmark where they spend y amount of compute, but where resources for academics are << y,” said Taylor in a recent post on X. “[N]o one is going to be able to reproduce the results.”

    Why are reasoning models so expensive to test? Mainly because they generate a lot of tokens. Tokens represent bits of raw text, such as the word “fantastic” split into the syllables “fan,” “tas,” and “tic.” According to Artificial Analysis, OpenAI’s o1 generated over 44 million tokens during the firm’s benchmarking tests, around eight times the amount GPT-4o generated.

    The vast majority of AI companies charge for model usage by the token, so you can see how this cost can add up.

    Modern benchmarks also tend to elicit a lot of tokens from models because they contain questions involving complex, multi-step tasks, according to Jean-Stanislas Denain, a senior researcher at Epoch AI, which develops its own model benchmarks.

    “[Today’s] benchmarks are more complex [even though] the number of questions per benchmark has overall decreased,” Denain told TechCrunch. “They often attempt to evaluate models’ ability to do real-world tasks, such as write and execute code, browse the internet, and use computers.”

    Denain added that the most expensive models have gotten more expensive per token over time. For example, Anthropic’s Claude 3 Opus was the priciest model when it was released in May 2024, costing $75 per million output tokens. OpenAI’s GPT-4.5 and o1-pro, both of which launched earlier this year, cost $150 per million output tokens and $600 per million output tokens, respectively.

    “[S]ince models have gotten better over time, it’s still true that the cost to reach a given level of performance has greatly decreased over time,” Denain said. “But if you want to evaluate the best largest models at any point in time, you’re still paying more.”

    Many AI labs, including OpenAI, give benchmarking organizations free or subsidized access to their models for testing purposes. But this colors the results, some experts say — even if there’s no evidence of manipulation, the mere suggestion of an AI lab’s involvement threatens to harm the integrity of the evaluation scoring.

    “From [a] scientific point of view, if you publish a result that no one can replicate with the same model, is it even science anymore?” wrote Taylor in a follow-up post on X. “(Was it ever science, lol)”.

    Source: https://techcrunch.com/

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Art Ryan

    Related Posts

    SAS Puts AI Governance at the Core of Its Agent Strategy

    April 29, 2026

    Amazon AI Hiring Software Enhances Recruitment Efficiency

    April 29, 2026

    AI Drug Development Johnson & Johnson Impact on Healthcare

    April 28, 2026

    Comments are closed.

    Latest News

    SAS Puts AI Governance at the Core of Its Agent Strategy

    April 29, 2026

    Big Tech AI Spending 2026: Investment Trends Revealed

    April 29, 2026

    Amazon AI Hiring Software Enhances Recruitment Efficiency

    April 29, 2026

    Oracle & CoreWeave Shares Fall on OpenAI Growth Miss

    April 29, 2026
    Facebook X (Twitter) Pinterest Vimeo WhatsApp TikTok Instagram LinkedIn YouTube Spotify Reddit Snapchat Threads

    AI University

    • Global Universities
    • Universities in Africa
    • Universities in Asia
    • Universities in Europe
    • Universities in Latin America
    • Universities in Middle East
    • Universities in North America
    • Universities in Oceania

    AI Tools & Apps Directory

    • AI Productivity Tools
    • AI Coding Tools
    • AI Voice Tools
    • AI Video Tools
    • AI Image Generators
    • AI Writing Tools

    Info

    • Home
    • About Us
    • AI Organizations & Associations
    • Contact Us

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    © 2026 Breaking AI News.
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.