Pathologist-Artificial Intelligence (AI) Concordance in HER2 Interpretation for Advanced Biliary Tract Cancer: Intra-observer, Inter-observer, and Human-AI Variability
Hyunchul Kim, Jinhyung Heo, Soo Ick Cho, Beodeul Kang, Jung Sun Kim, Chan Kim, Chang Il Kwon, Min Je Sung, Seok-Pyo Shin, Seok Jeong Yang, Incheon Kang, Sung Hwan Lee, Chansik An, Seungeun Lee, Jin Woo Oh, Hee Yeon Kay, Jiwon Shin, Taebum Lee, Sanghoon Song, Sukjun Kim, Heon Song, Sergio Pereira, Gwangil Kim, Hong Jae Chon
Laboratory Investigation, 2025
Abstract
Biliary tract cancer (BTC) is a rare, aggressive malignancy with a poor prognosis. HER2 is overexpressed in a subset of BTC patients, and recent advances in HER2-targeted agents are expanding therapeutic options. However, accurately assessing HER2 expression is challenging in BTC, especially at low levels. This study compared HER2 interpretations in advanced BTC by pathologists using light microscopy (LM) and digital pathology (DP) and evaluated the potential of artificial intelligence (AI)-powered pathology in enhancing HER2-scoring consistency.
A total of 309 HER2 immunohistochemistry slides were obtained from advanced BTC patients who received systemic therapy at CHA Bundang Medical Center between 2019 and 2022. Three pathologists independently evaluated HER2 expression twice, once using LM and once using DP, with a wash-out period between evaluations. An AI-powered whole slide image analyzer evaluated HER2 expression. The ground truth was selected based on consensus among pathologists.
Pathologists showed complete agreement on HER2 results in 62.1% of LM evaluations and 63.4% of DP evaluations. For intra-observer variability, the weighted kappa values ranged from 0.979 to 0.984, while inter-observer variability ranged from 0.819 to 0.863 for LM and 0.820 to 0.876 for DP. The overall concordance rate of HER2 categories between the AI and the ground truth was 83.5%. Clinical factors such as HER2 expression level and specimen type significantly influenced intra- and inter-observer variability, whereas histologic grade was a key factor affecting the AI model’s performance.
In this study, we quantified intra- and inter-observer variability in HER2 evaluation for BTC by pathologists and demonstrated that AI-powered HER2 scoring showed high concordance with pathologists’ evaluations. As the clinical factors underlying the discrepancies in the results from the AI model and pathologists were different, the AI model is expected to help pathologists make more objective assessments of BTC HER2 readings in the future.