Back to List

Performance of a Breast Cancer Detection AI Algorithm Using the Personal Performance in Mammographic Screening Scheme

Published 2023

Performance of a Breast Cancer Detection AI AlgorithmUsing the Personal Performance in MammographicScreening Scheme

Yan Chen, Adnan G. Taib, Iain T. Darker, et al.

Radiology, 2023

Abstract
Background: The Personal Performance in Mammographic Screening (PERFORMS) scheme is used to assess reader performance.Whether this scheme can assess the performance of artificial intelligence (AI) algorithms is unknown.

Purpose: To compare the performance of human readers and a commercially available AI algorithm interpreting PERFORMS test sets.Materials and Methods: In this retrospective study, two PERFORMS test sets, each consisting of 60 challenging cases, were evaluatedby human readers between May 2018 and March 2021 and were evaluated by an AI algorithm in 2022. AI considered each breastseparately, assigning a suspicion of malignancy score to features detected. Performance was assessed using the highest score per breast.Performance metrics, including sensitivity, specificity, and area under the receiver operating characteristic curve (AUC), were calculatedfor AI and humans. The study was powered to detect a medium-sized effect (odds ratio, 3.5 or 0.29) for sensitivity.

Results: A total of 552 human readers interpreted both PERFORMS test sets, consisting of 161 normal breasts, 70 malignant breasts, and nine benign breasts. No difference was observed at the breast level between the AUC for AI and the AUC for human readers (0.93and 0.88, respectively; P = .15). When using the developer’s suggested recall score threshold, no difference was observed for AI versushuman reader sensitivity (84% and 90%, respectively; P = .34), but the specificity of AI was higher (89%) than that of the humanreaders (76%, P = .003). However, it was not possible to demonstrate equivalence due to the size of the test sets. When using recallthresholds to match mean human reader performance (90% sensitivity, 76% specificity), AI showed no differences inperformance, with a sensitivity of 91% (P =. 73) and a specificity of 77% (P = .85).

Conclusion: Diagnostic performance of AI was comparable with that of the average human reader when evaluating cases from twoenriched test sets from the PERFORMS scheme.

Read the full paper