Radiologists assisted by an AI screen for breast cancer more successfully than they do when they work alone, according to new research. That same AI also produces more accurate results in the hands of a radiologist than it does when operating solo.

The large-scale study, published this month in The Lancet Digital Health, is the first to directly compare an AI’s performance in breast cancer screening according to whether it’s used alone or to assist a human expert. The hope is that such AI systems could save lives by detecting cancers doctors miss, free up radiologists to see more patients, and ease the burden in places where there is a dire lack of specialists.

The software being tested comes from Vara, a startup based in Germany that also led the study. The company’s AI is already used in over a fourth of Germany’s breast cancer screening centers and was introduced earlier this year to a hospital in Mexico and another in Greece.

The Vara team, with help from radiologists at the Essen University Hospital in Germany and the Memorial Sloan Kettering Cancer Center in New York, tested two approaches. In the first, the AI works alone to analyze mammograms. In the other, the AI automatically distinguishes between scans it thinks look normal and those that raise a concern. It refers the latter to a radiologist, who would review them before seeing the AI’s assessment. Then the AI would issue a warning if it detected cancer when the doctor did not.

“In the proposed AI-driven process nearly three-quarters of the screening studies didn’t need to be reviewed by a radiologist, while improving accuracy overall.”

Charles Langlotz

To train the neural network, Vara fed the AI data from over 367,000 mammograms—including radiologists’ notes, original assessments, and information on whether the patient ultimately had cancer—to learn how to place these scans into one of three buckets: “confident normal,” “not confident” (in which no prediction is given), and “confident cancer.” The conclusions from both approaches were then compared with the decisions real radiologists originally made on 82,851 mammograms sourced from screening centers that didn’t contribute scans used to train the AI.

Related work from others:  UC Berkeley - Goal Representations for Instruction Following

The second approach—doctor and AI working together—was 3.6% better at detecting breast cancer than a doctor working alone, and raised fewer false alarms. It accomplished this while automatically setting aside scans it classified as confidently normal, which amounted to 63% of all mammograms. This intense streamlining could slash radiologists’ workloads.

After breast cancer screenings, patients with a normal scan are sent on their way, while an abnormal or unclear scan triggers follow-up testing. But radiologists examining mammograms miss 1 in 8 cancers. Fatigue, overwork, and even the time of day all affect how well radiologists can identify tumors as they view thousands of scans. Signs that are visually subtle are also generally less likely to set off alarms, and dense breast tissue—found mostly in younger patients—makes signs of cancer harder to see.

Radiologists using the AI in the real world are required by German law to look at every mammogram, at least glancing at those the AI calls fine. The AI still lends them a hand by pre-filling reports on scans labeled normal, though the radiologist can always reject the AI’s call. 

Thilo Töllner, a radiologist who heads a German breast cancer screening center, has used the program for two years. He’s sometimes disagreed when the AI classified scans as confident normal and manually filled out reports to reflect a different conclusion, but he says “normals are almost always normal.” Mostly, “you just have to press enter.” 

Mammograms the AI has labeled as ambiguous or “confident cancer” are referred to a radiologist—but only after the doctor has offered an initial, independent assessment.

Related work from others:  Latest from MIT Tech Review - This robot can tidy a room without any help

Radiologists classify mammograms on a 0 to 6 scale known as BI-RADS, where lower is better. A score of 3 indicates that something is probably benign, but worth checking up on. If Vara has assigned a BI-RADS score of 3 or higher to a mammogram the radiologist labels normal, a warning appears.

AI generally excels at image classification. So why did Vara’s AI on its own underperform a lone doctor? Part of the problem is that a mammogram alone can’t determine whether someone has cancer—that requires removing and testing the abnormal-looking tissue. Instead, the AI examines mammograms for hints.

Christian Leibig, lead author on the study and director of machine learning at Vara, says that mammograms of healthy and cancerous breasts can look very similar, and both types of scans can present a wide range of visual results. This complicates AI training. So does the low prevalence of cancer in breast screenings (according to Leibig, “in Germany, it’s roughly six in 1,000”). Because AIs trained to catch cancer are mostly trained on healthy breast scans, they can be prone to false positives.

The study tested the AI only on past mammogram decisions and assumed that radiologists would agree with the AI each time it issued a decision of “confident normal” or “confident cancer.” When the AI was unsure, the study defaulted to the original radiologist’s reading. That means it couldn’t test how using AI affects radiologists’ decisions—and whether any such changes may create new risks. Töllner admits he spends less time scrutinizing scans Vara labels normal than those it deems suspicious. “You get quicker with the normals because you get confident with the system,” he says.

Related work from others:  Latest from MIT Tech Review - LLMs become more covertly racist with human intervention

Curtis Langlotz, director of Stanford’s Center for Artificial Intelligence in Medicine and Imaging, is impressed, but he says the next step would be to confirm how well the AI performs over a long period of time in actual clinics with real patients.

So far, attempts to fully replace radiologists with AI have failed. A 2021 review found that in 34 of 36 studies, the AI did worse than a single radiologist at screening for breast cancer from mammograms. All 36 were less accurate than the consensus of two radiologists, which some countries require.

“We often say that AI will not replace radiologists,” Langlotz says. “This study doesn’t change that, but in the proposed AI-driven process nearly three-quarters of the screening studies didn’t need to be reviewed by a radiologist, while improving accuracy overall.” That, he says, is “groundbreaking.”

Langlotz adds that this approach could ease the shortage of radiologists, especially in countries such as Malawi, where there is one radiologist per 8.8 million people, or India, a country of 1.4 billion served by one radiologist for every 100,000 people. Even the US, which proportionally has 10 times as many radiologists as India, is projected to be short 17,000 radiologists by 2033.

Töllner is optimistic that more radiologists using AI will mean earlier breast cancer detection, which could improve survival rates. He also hopes Vara will help quash the high number of false positives—patients recalled for further testing who are actually fine.

Similar Posts