Agreement Across 10 Artificial Intelligence Models in Assessing HER2 in Breast Cancer Whole Slide Images: Findings from the Friends of Cancer Research Digital PATH Project
Brittany McKelvey, Pedro A. Torres-Saavedra, Jessica Li, Glenn Broeckx, Frederik Deman, Siraj Ali, Hillary Andrews, Salim Arslan, Santhosh Balasubramanian, J. Carl Barrett, Peter Caie, Ming Chen, Daniel Cohen, Tathagata Dasgupta, Brandon Gallas, George Green, Mark Gustavson, Sarah Hersey, Ana Hidalgo-Sastre, Shahanawaz Jiwani, Wonkyung Jung, Kimary Kulig, Vladimir Kushnarev, Xiaoxian Li, Meredith Lodge, Joan Mancuso, Mike Montalto, Satabhisa Mukhopadhyay, Matthew Oberley, Pahini Pandya, Oscar Puig, Edward Richardson, Alexander Sarachakov, Or Shaked, Mark Stewart, Lisa M. McShane, Roberto Salgado, Jeff Allen
San Antonio Breast Cancer Symposium, 2024
Recent successes of HER2 antibody-drug conjugates (ADCs) have expanded patient eligibility for HER2-targeted therapy; therefore, accurate and consistent identification of patients who may benefit from ADCs is more critical than ever. Previous studies of agreement between pathologists highlight areas of discordance, but little is known about the reproducibility of assessments by emerging artificial intelligence (AI) models, particularly at low levels of HER2 expression. These models have the potential to deliver more quantitative and reproducible HER2 assessments than visual scoring by pathologists, but large-scale comparative evaluations to understand their variability are lacking. Friends of Cancer Research created a research partnership to describe and evaluate the agreement of HER2 biomarker assessment across independently developed AI models. Both H&E and HER2 IHC whole-slide images (WSIs, N=1,124) from 733 patients diagnosed with breast cancer in 2021 were obtained from a single laboratory (ZAS Hospital, Antwerp, Belgium). Available pathology and specimen data include three pathologists’ HER2 readings and details on slide processing and digitization. Ten AI models assessed HER2 status on all cases. Blinded, independent analyses were performed by statisticians from the National Cancer Institute. Of the 10 AI models, seven used HER2 IHC WSIs, two used H&E WSIs, and one used both stains as inputs to determine HER2 score and/or status. The primary analysis focused on the seven models (6 using IHC, 1 using IHC and H&E) providing HER2 scores based on the ASCO/CAP 2018 categories (0, 1+, 2+, 3+). Absent a defined reference standard, agreement was evaluated for all possible pairings of models across all samples, resulting in a median (interquartile range, IQR) pairwise overall percent agreement (OPA) of 65.1% (60.3-69.1%) and unweighted Cohen’s kappa of 0.51 (0.45-0.55). When defining binary HER2 scores as 3+ vs. not 3+, the median (IQR) pairwise agreement measures were: OPA 97.3% (95.9-97.9%), average positive agreement (APA) 87.3% (84.1-90.9%), average negative agreement (ANA) 98.5% (97.7-98.8%), and kappa 0.86 (0.82-0.90). Conversely, when defining HER2 scores as 0 vs. not 0, the median (IQR) pairwise measures were: OPA 85.6% (82.4-88.0%), APA 91.3% (87.4-92.6%), ANA 65.2% (59.9-69.7%), and kappa 0.57 (0.51-0.61). Ongoing analyses aim to assess the association of between-model agreement with patient, specimen, and model characteristics as well as the agreement between models and pathologist readings. These findings highlight variability in HER2 biomarker scoring across models, with the least variability and a higher level of agreement in reporting 3+ cases and larger inter-model variations in evaluating HER2 low tumors, similar to agreement measures between pathologists observed in published studies. Further work is needed to understand the variability in ascribing lower HER2 scores and to evaluate performance in the context of clinical application, especially given the evolving treatment landscape and clinical implications of HER2 scores. This ongoing research partnership will enable a greater understanding of the variability in AI models and support best practices for using these models for measuring and reporting AI driven biomarker assessments in drug development and clinical practice. This dataset also has potential value for creating reference sets for future model development.