Close Menu
    • Home
    • Events
      • Upcoming Events
      • Videos
        • Machine Can Think Summit 2026
        • Step Dubai Conference 2026
    • Technology & Innovation
    • Business & Marketing
    • Trends & Insights
    • Industry Applications
    • Tutorials & Guides
    What's Hot
    Industry Applications

    AI Drug Development Johnson & Johnson Impact on Healthcare

    By Art RyanApril 28, 20260

    Johnson & Johnson (J&J) has unveiled new information about the future of AI in healthcare,…

    Qualcomm OpenAI AI Smartphone Processors Partnership News

    April 28, 2026

    Google AI Campus South Korea and Its Development Plans

    April 28, 2026

    Accenture Copilot Rollout Enhances Employee Productivity

    April 28, 2026
    Facebook X (Twitter) Instagram
    Facebook X (Twitter) Instagram
    Breaking AI News
    Wednesday, April 29
    • Home
    • Events
      • Upcoming Events
      • Videos
        • Machine Can Think Summit 2026
        • Step Dubai Conference 2026
    • Technology & Innovation

      AI Drug Development Johnson & Johnson Impact on Healthcare

      April 28, 2026

      Qualcomm OpenAI AI Smartphone Processors Partnership News

      April 28, 2026

      Google AI Campus South Korea and Its Development Plans

      April 28, 2026

      New AI-Based Solution Launched by Box to Revolutionize Enterprise Workflows

      April 28, 2026

      Meta AWS Graviton AI Partnership: Revolutionizing Infrastructure

      April 28, 2026
    • Business & Marketing

      UK AI Startup Ineffable Secures $1.1B in Europe’s Largest Seed Round

      April 28, 2026

      Meta Manus AI Acquisition Blocked Over Strategic Concerns

      April 28, 2026

      Microsoft Ceases Revenue Split With OpenAI in Landmark AI Partnership Move

      April 28, 2026

      ZainTECH Named a Leader in IDC MarketScape: Gulf Countries AI Professional Services

      April 28, 2026

      AI Job Cuts Forecast: Shocking Prediction That 50% of UK Executives Expect Workforce Reduction

      April 20, 2026
    • Trends & Insights

      Google AI Campus South Korea and Its Development Plans

      April 28, 2026

      Meta Manus AI Acquisition Blocked Over Strategic Concerns

      April 28, 2026

      Emirati Inventor AI UAE: Bridging Culture and Technology

      April 28, 2026

      Cursor’s $50 Billion Ambition: Explosive AI Coding Demand Fuels Massive Growth

      April 19, 2026

      Dubai AI-powered government will change your daily life in the UAE

      April 3, 2026
    • Industry Applications

      AI Drug Development Johnson & Johnson Impact on Healthcare

      April 28, 2026

      Accenture Copilot Rollout Enhances Employee Productivity

      April 28, 2026

      HomeLight AI Real Estate Closings Transforming the Market

      April 27, 2026

      UiPath & Databricks Partner to Transform Enterprise Operations through Automation and Data Intelligence

      April 27, 2026

      Visit Oman Launches Revolutionary AI Digital Hub and Global Collaboration to Transform Tourism Industry

      April 27, 2026
    • Tutorials & Guides

      How AI Is Revolutionizing the Future of Travel 2026 with Wellness and Sustainability

      April 19, 2026

      University of Wollongong in Dubai AI initiative boosts future-ready education

      March 31, 2026

      Microsoft AI upgrades Copilot Cowork unveiled for early access users

      March 31, 2026

      Starcloud $11 billion valuation signals AI space race surge

      March 31, 2026

      Flexible AI Factories Power the Future of Energy Grids

      March 30, 2026
    Breaking AI News
    Home » Medical AI tools are growing, but are they being tested properly?
    Technology & Innovation

    Medical AI tools are growing, but are they being tested properly?

    Art RyanBy Art RyanMarch 8, 2025No Comments7 Mins Read
    Facebook Twitter Pinterest LinkedIn Tumblr Email
    Share
    Facebook Twitter LinkedIn Pinterest Email

    Artificial intelligence algorithms are being built into almost all aspects of health care. They’re integrated into breast cancer screenings, clinical note-taking, health insurance management and even phone and computer apps to create virtual nurses and transcribe doctor-patient conversations. Companies say that these tools will make medicine more efficient and reduce the burden on doctors and other health care workers. But some experts question whether the tools work as well as companies claim they do.

    AI tools such as large language models, or LLMs, which are trained on vast troves of text data to generate humanlike text, are only as good as their training and testing. But the publicly available assessments of LLM capabilities in the medical domain are based on evaluations that use medical student exams, such as the MCAT. In fact, a review of studies evaluating health care AI models, specifically LLMs, found that only 5 percent used real patient data. Moreover, most studies evaluated LLMs by asking questions about medical knowledge. Very few assessed LLMs’ abilities to write prescriptions, summarize conversations or have conversations with patients — tasks LLMs would do in the real world.

    The current benchmarks are distracting, computer scientist Deborah Raji and colleagues argue in the February New England Journal of Medicine AI. The tests can’t measure actual clinical ability; they don’t adequately account for the complexities of real-world cases that require nuanced decision-making. They also aren’t flexible in what they measure and can’t evaluate different types of clinical tasks. And because the tests are based on physicians’ knowledge, they don’t properly represent information from nurses or other medical staff.

    “A lot of expectations and optimism people have for these systems were anchored to these medical exam test benchmarks,” says Raji, who studies AI auditing and evaluation at the University of California, Berkeley. “That optimism is now translating into deployments, with people trying to integrate these systems into the real world and throw them out there on real patients.” She and her colleagues argue that we need to develop evaluations of how LLMs perform when responding to complex and diverse clinical tasks.

    Science News spoke with Raji about the current state of health care AI testing, concerns with it and solutions to create better evaluations. This interview has been edited for length and clarity.

    SN: Why do current benchmark tests fall short?

    Raji: These benchmarks are not indicative of the types of applications people are aspiring to, so the whole field should not obsess about them in the way they do and to the degree they do.

    This is not a new problem or specific to health care. This is something that exists throughout machine learning, where we put together these benchmarks and we want it to represent general intelligence or general competence at this particular domain that we care about. But we just have to be really careful about the claims we make around these datasets.

    The further the representation of these systems is from the situations in which they are actually deployed, the more difficult it is for us to understand the failure modes these systems hold. These systems are far from perfect. Sometimes they fail on particular populations, and sometimes, because they misrepresent the tasks, they don’t capture the complexity of the task in a way that reveals certain failures in deployment. This sort of benchmark bias issue, where we make the choice to deploy these systems based on information that doesn’t represent the deployment situation, leads to a lot of hubris.

    SN: How do you create better evaluations for health care AI models?

    Raji: One strategy is interviewing domain experts in terms of what the actual practical workflow is and collecting naturalistic datasets of pilot interactions with the model to see the types or range of different queries that people put in and the different outputs. There’s also this idea that [coauthor] Roxana Daneshjou has been doing in some of her work with “red teaming,” with actively gathering a group of people to adversarialy prompt the model. Those are all different approaches to getting at a more realistic set of prompts closer to how people actually interact with the systems.

    Another thing we are trying is getting information from actual hospitals as usage data — like how they are actually deploying it and workflows from them about how they are actually integrating the system — and anonymized patient information or anonymized inputs to these models that could then inform future benchmarking and evaluation practices.

    There are approaches that exist from other disciplines [like psychology] about how to ground your evaluations in observations of reality to be able to assess something. The same applies here — how much of our current evaluation ecosystem is grounded in the reality of what people are observing and what people are either appreciating or struggling with in terms of the actual deployment of these systems.

    SN: How specialized should model benchmark testing be?

    Raji: The benchmark that is geared towards question answering and knowledge recall is very different from a benchmark to validate the model on summarizing doctors’ notes or doing questioning and answering on uploaded data. That kind of nuance in terms of the task design is something that I’m trying to get to. Not that every single person should have their own personalized benchmark, but that common task that we do share needs to be way more grounded than multiple-choice tests. Because even for real doctors, those multiple-choice questions are not indicative of their actual performance.

    SN: What policies or frameworks need to be in place to create such evaluations?

    Raji: This is mostly a call for researchers to invest in thinking through and constructing not just benchmarks but also evaluations, at large, that are more grounded in the reality of what our expectations are for these systems once they get deployed. Right now, evaluation is very much an afterthought. We just think that there’s a lot more attention that could be paid towards the methodology of evaluation, the methodology of benchmark design and the methodology of just assessment in this space. 

    Second, we can ask for more transparency at the institutional level such as through AI inventories in hospitals, wherein hospitals should share the full list of different AI products that they make use of as part of their clinical practice. That’s the kind of practice at the institutional level, at the hospital level, that would really help us understand what people are currently using AI systems for. If [hospitals and other institutions] published information about the workflows that they sort of integrate these AI systems into, that can also help us think of better evaluations. That kind of thing at the hospital level will be super helpful.

    At the vendor level too, sharing information about what their current evaluation practice is — what their current benchmarks rely on — helps us figure out the gap between what they are currently doing and something that could be more realistic or more grounded.

    SN: What is your advice for people working with these models?

    Raji: We should, as a field, be more thoughtful about the evaluations that we focus on or that we [overly base our performance on.]

    It’s really easy to pick the lowest hanging fruit — medical exams are just the most available medical tests out there. And even if they are completely unrepresentative of what people are hoping to do with these models at deployment, it’s like an easy dataset to compile and put together and upload and download and run.

    But I would challenge the field to be a lot more thoughtful and to pay more attention to really constructing valid representations of what we hope the models do and our expectations for these models once they are deployed.

    Source: https://www.sciencenews.org/

    Share. Facebook Twitter Pinterest LinkedIn Tumblr Email
    Art Ryan

    Related Posts

    AI Drug Development Johnson & Johnson Impact on Healthcare

    April 28, 2026

    Qualcomm OpenAI AI Smartphone Processors Partnership News

    April 28, 2026

    Google AI Campus South Korea and Its Development Plans

    April 28, 2026

    Comments are closed.

    Latest News

    AI Drug Development Johnson & Johnson Impact on Healthcare

    April 28, 2026

    Qualcomm OpenAI AI Smartphone Processors Partnership News

    April 28, 2026

    Google AI Campus South Korea and Its Development Plans

    April 28, 2026

    Accenture Copilot Rollout Enhances Employee Productivity

    April 28, 2026
    Facebook X (Twitter) Pinterest Vimeo WhatsApp TikTok Instagram

    AI University

    • Global Universities
    • Universities in Africa
    • Universities in Asia
    • Universities in Europe
    • Universities in Latin America
    • Universities in Middle East
    • Universities in North America
    • Universities in Oceania

    AI Tools & Apps Directory

    • AI Productivity Tools
    • AI Coding Tools
    • AI Voice Tools
    • AI Video Tools
    • AI Image Generators
    • AI Writing Tools

    Info

    • Home
    • About Us
    • AI Organizations & Associations
    • Contact Us

    Subscribe to Updates

    Get the latest creative news from FooBar about art, design and business.

    © 2026 Breaking AI News.
    • Privacy Policy

    Type above and press Enter to search. Press Esc to cancel.