0

0

Can AI learn to "double-check" its work when reading tricky medical reports in other languages?

Agent-Based Uncertainty Awareness Improves Automated Radiology Report Labeling with an Open-Source Large Language Model

Published 2/5/2025 by Hadas Ben Atya, Naama Gavrielov, Zvi Natan Badash, Gili Focht, Ruth Cytter Kuint, Talar Hagopian, Dan Turner and 1 more...

Get notified when new papers like this one come out!

Have an account? We'll apply the trial to it


Overview

  • Research on extracting data from Hebrew radiology reports using large language models (LLMs)
  • Developed uncertainty-aware approach with agent-based decision making
  • Analyzed 9,683 reports from Crohn's disease patients across three medical centers
  • Used Llama 3.1 model with Bayesian Prompt Ensembles
  • Achieved improved accuracy after filtering uncertain predictions
  • Demonstrated better reliability for medical applications

Plain English Explanation

Getting useful information from medical reports written in languages like Hebrew is hard for AI systems. This research team developed a smarter way to do it by making the AI express how confident it is in its answers.

Think of it like having multiple doctors look at the same medical report. Each doctor might interpret things slightly differently. By combining their views and noting where they agree or disagree, you get a more reliable final opinion.

The agent-based uncertainty awareness system works like a chief doctor who looks at all these opinions and decides how confident they can be in the final diagnosis. When the system isn't sure, it says so instead of making risky guesses.

Key Findings

The research showed impressive improvements in accuracy when using this new approach. The system performed best when it could admit uncertainty about difficult cases:

  • F1 score improved from 0.3967 to 0.4787 after filtering uncertain predictions
  • Achieved 64.37% recall on initial testing
  • System could clearly separate between correct and incorrect predictions
  • Multi-agent diagnostic assistance showed better calibrated uncertainty estimates

Technical Explanation

The team used Llama 3.1 (8B parameter version) as their base model. They created six different ways to ask the same question (prompts) and used Bayesian Prompt Ensembles to estimate uncertainty.

The language models retrieval system processed reports in multiple stages:

  1. Initial processing of 9,683 Hebrew radiology reports
  2. Manual annotation of 512 reports for validation
  3. Automatic annotation of remaining reports using HSMP-BERT
  4. Implementation of agent-based decision model with five confidence levels

Critical Analysis

Several limitations deserve consideration:

  • Limited to Hebrew language reports only
  • Focused specifically on Crohn's disease cases
  • Relatively small manually annotated dataset
  • Automated structured generation may need further validation

The research could benefit from testing across more languages and medical conditions. Additionally, the impact of different prompt designs needs more investigation.

Conclusion

This research represents a significant step forward in making AI systems more trustworthy for medical applications. The agentic LLM workflows approach of combining multiple perspectives with uncertainty awareness shows promise for improving medical data extraction.

The ability to know when to express uncertainty rather than make potentially dangerous mistakes is crucial for healthcare applications. This work provides a foundation for developing more reliable AI systems in medicine.

Original Paper

View on arxiv(opens in a new tab)

Highlights

    No highlights yet