Lunit Logo

Publications

ABSTRACT

Background
We examined the potential change in cancer detection when using an artificial intelligence (AI) cancer-detection software to triage certain screening examinations into a no radiologist work stream, and then after regular radiologist assessment of the remainder, triage certain screening examinations into an enhanced assessment work stream. The purpose of enhanced assessment was to simulate selection of women for more sensitive screening promoting early detection of cancers that would otherwise be diagnosed as interval cancers or as next-round screen-detected cancers. The aim of the study was to examine how AI could reduce radiologist workload and increase cancer detection.

Methods
In this retrospective simulation study, all women diagnosed with breast cancer who attended two consecutive screening rounds were included. Healthy women were randomly sampled from the same cohort; their observations were given elevated weight to mimic a frequency of 0·7% incident cancer per screening interval. Based on the prediction score from a commercially available AI cancer detector, various cutoff points for the decision to channel women to the two new work streams were examined in terms of missed and additionally detected cancer.

Findings
7364 women were included in the study sample: 547 were diagnosed with breast cancer and 6817 were healthy controls. When including 60%, 70%, or 80% of women with the lowest AI scores in the no radiologist stream, the proportion of screen-detected cancers that would have been missed were 0, 0·3% (95% CI 0·0–4·3), or 2·6% (1·1–5·4), respectively. When including 1% or 5% of women with the highest AI scores in the enhanced assessment stream, the potential additional cancer detection was 24 (12%) or 53 (27%) of 200 subsequent interval cancers, respectively, and 48 (14%) or 121 (35%) of 347 next-round screen-detected cancers, respectively.

Interpretation
Using a commercial AI cancer detector to triage mammograms into no radiologist assessment and enhanced assessment could potentially reduce radiologist workload by more than half, and pre-emptively detect a substantial proportion of cancers otherwise diagnosed later.

AUTHORS
Mattie Salim, MD1,2; Erik Wåhlin, MSc3; Karin Dembrower, MD4,5; Edward Azavedo, MD, PhD1,6; Theodoros Foukakis, MD, PhD1,2; Yue Liu, MSc7; Kevin Smith, MSc, PhD8; Martin Eklund, MSc, PhD9; Fredrik Strand, MD, PhD1,10
1Department of Oncology-Pathology, Karolinska Institute, Stockholm, Sweden, 2Department of Radiology, Karolinska University Hospital, Stockholm, Sweden, 3Department of Medical Radiation Physics and Nuclear Medicine, Karolinska University Hospital, Stockholm, Sweden, 4Department of Physiology and Pharmacology, Karolinska Institute, Stockholm, Sweden, 5Department of Radiology, Capio Sankt Görans Hospital, Stockholm, Sweden, 6Department of Molecular Medicine and Surgery, Karolinska Institute, Stockholm, Sweden, 7Division of Computational Science and Technology, KTH Royal Institute of Technology, Science for Life Laboratory, Solna, Sweden, 8KTH Royal Institute of Technology, Science for Life Laboratory, Solna, Sweden, 9Department of Medical Epidemiology and Biostatistics, Karolinska Institute, Stockholm, Sweden, 10Breast Radiology, Karolinska University Hospital, Stockholm, Sweden
URL

https://www.thelancet.com/journals/landig/article/PIIS2589-7500(20)30185-0/fulltext

ABSTRACT

Importance
A computer algorithm that performs at or above the level of radiologists in mammography screening assessment could improve the effectiveness of breast cancer screening.

Objective
To perform an external evaluation of 3 commercially available artificial intelligence (AI) computer-aided detection algorithms as independent mammography readers and to assess the screening performance when combined with radiologists.

Design, Setting, and Participants
This retrospective case-control study was based on a double-reader population-based mammography screening cohort of women screened at an academic hospital in Stockholm, Sweden, from 2008 to 2015. The study included 8805 women aged 40 to 74 years who underwent mammography screening and who did not have implants or prior breast cancer. The study sample included 739 women who were diagnosed as having breast cancer (positive) and a random sample of 8066 healthy controls (negative for breast cancer).

Main Outcomes and Measures
Positive follow-up findings were determined by pathology-verified diagnosis at screening or within 12 months thereafter. Negative follow-up findings were determined by a 2-year cancer-free follow-up. Three AI computer-aided detection algorithms (AI-1, AI-2, and AI-3), sourced from different vendors, yielded a continuous score for the suspicion of cancer in each mammography examination. For a decision of normal or abnormal, the cut point was defined by the mean specificity of the first-reader radiologists (96.6%).

Results
The median age of study participants was 60 years (interquartile range, 50-66 years) for 739 women who received a diagnosis of breast cancer and 54 years (interquartile range, 47-63 years) for 8066 healthy controls. The cases positive for cancer comprised 618 (84%) screen detected and 121 (16%) clinically detected within 12 months of the screening examination. The area under the receiver operating curve for cancer detection was 0.956 (95% CI, 0.948-0.965) for AI-1, 0.922 (95% CI, 0.910-0.934) for AI-2, and 0.920 (95% CI, 0.909-0.931) for AI-3. At the specificity of the radiologists, the sensitivities were 81.9% for AI-1, 67.0% for AI-2, 67.4% for AI-3, 77.4% for first-reader radiologist, and 80.1% for second-reader radiologist. Combining AI-1 with first-reader radiologists achieved 88.6% sensitivity at 93.0% specificity (abnormal defined by either of the 2 making an abnormal assessment). No other examined combination of AI algorithms and radiologists surpassed this sensitivity level.

Conclusions and Relevance
To our knowledge, this study is the first independent evaluation of several AI computer-aided detection algorithms for screening mammography. The results of this study indicated that a commercially available AI computer-aided detection algorithm can assess screening mammograms with a sufficient diagnostic performance to be further evaluated as an independent reader in prospective clinical trials. Combining the first readers with the best algorithm identified more cases positive for cancer than combining the first readers with second readers.

AUTHORS
Mattie Salim, MD1,2; Erik Wåhlin, MSc3; Karin Dembrower, MD4,5; Edward Azavedo, MD, PhD1,6; Theodoros Foukakis, MD, PhD1,2; Yue Liu, MSc7; Kevin Smith, MSc, PhD8; Martin Eklund, MSc, PhD9; Fredrik Strand, MD, PhD1,10
1Department of Oncology-Pathology, Karolinska Institute, Stockholm, Sweden, 2Department of Radiology, Karolinska University Hospital, Stockholm, Sweden, 3Department of Medical Radiation Physics and Nuclear Medicine, Karolinska University Hospital, Stockholm, Sweden, 4Department of Physiology and Pharmacology, Karolinska Institute, Stockholm, Sweden, 5Department of Radiology, Capio Sankt Görans Hospital, Stockholm, Sweden, 6Department of Molecular Medicine and Surgery, Karolinska Institute, Stockholm, Sweden, 7Division of Computational Science and Technology, KTH Royal Institute of Technology, Science for Life Laboratory, Solna, Sweden, 8KTH Royal Institute of Technology, Science for Life Laboratory, Solna, Sweden, 9Department of Medical Epidemiology and Biostatistics, Karolinska Institute, Stockholm, Sweden, 10Breast Radiology, Karolinska University Hospital, Stockholm, Sweden
URL

https://jamanetwork.com/journals/jamaoncology/article-abstract/2769894

ABSTRACT

Importance
Mammography screening currently relies on subjective human interpretation. Artificial intelligence (AI) advances could be used to increase mammography screening accuracy by reducing missed cancers and false positives.

Objective
To evaluate whether AI can overcome human mammography interpretation limitations with a rigorous, unbiased evaluation of machine learning algorithms.

Design, Setting, and Participants
In this diagnostic accuracy study conducted between September 2016 and November 2017, an international, crowdsourced challenge was hosted to foster AI algorithm development focused on interpreting screening mammography. More than 1100 participants comprising 126 teams from 44 countries participated. Analysis began November 18, 2016.

Main Outcomes and Measurements
Algorithms used images alone (challenge 1) or combined images, previous examinations (if available), and clinical and demographic risk factor data (challenge 2) and output a score that translated to cancer yes/no within 12 months. Algorithm accuracy for breast cancer detection was evaluated using area under the curve and algorithm specificity compared with radiologists’ specificity with radiologists’ sensitivity set at 85.9% (United States) and 83.9% (Sweden). An ensemble method aggregating top-performing AI algorithms and radiologists’ recall assessment was developed and evaluated.

Results
Overall, 144 231 screening mammograms from 85 580 US women (952 cancer positive ≤12 months from screening) were used for algorithm training and validation. A second independent validation cohort included 166 578 examinations from 68 008 Swedish women (780 cancer positive). The top-performing algorithm achieved an area under the curve of 0.858 (United States) and 0.903 (Sweden) and 66.2% (United States) and 81.2% (Sweden) specificity at the radiologists’ sensitivity, lower than community-practice radiologists’ specificity of 90.5% (United States) and 98.5% (Sweden). Combining top-performing algorithms and US radiologist assessments resulted in a higher area under the curve of 0.942 and achieved a significantly improved specificity (92.0%) at the same sensitivity.

Conclusions and Relevance
While no single AI algorithm outperformed radiologists, an ensemble of AI algorithms combined with radiologist assessment in a single-reader screening environment improved overall accuracy. This study underscores the potential of using machine learning methods for enhancing mammography screening interpretation.

AUTHORS
Thomas Schaffter, PhD; Diana S. M. Buist, PhD, MPH; Christoph I. Lee, MD, MS; Yaroslav Nikulin, MS; Dezső Ribli, MSc; Yuanfang Guan, PhD; William Lotter, PhD; Zequn Jie, PhD; Hao Du, BEng; Sijia Wang, MSc; Jiashi Feng, PhD; Mengling Feng, PhD; Hyo-Eun Kim, PhD; Francisco Albiol, PhD; Alberto Albiol, PhD; Stephen Morrell, B Bus Sc, MiF, M Res; Zbigniew Wojna, MSI; Mehmet Eren Ahsen, PhD; Umar Asif, PhD; Antonio Jimeno Yepes, PhD; Shivanthan Yohanandan, PhD; Simona Rabinovici-Cohen, MSc; Darvin Yi, MSc; Bruce Hoff, PhD; Thomas Yu, BS; Elias Chaibub Neto, PhD; Daniel L. Rubin, MD, MS; Peter Lindholm, MD, PhD; Laurie R. Margolies, MD; Russell Bailey McBride, PhD, MPH; Joseph H. Rothstein, MSc; Weiva Sieh, MD, PhD; Rami Ben-Ari, PhD; Stefan Harrer, PhD; Andrew Trister, MD, PhD; Stephen Friend, MD, PhD; Thea Norman, PhD; Berkman Sahiner, PhD; Fredrik Strand, MD, PhD; Justin Guinney, PhD; Gustavo Stolovitzky, PhD; and the DM DREAM Consortium
URL

https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2761795?resultClick=1

ABSTRACT

Background
Mammography is the current standard for breast cancer screening. This study aimed to develop an artificial intelligence (AI) algorithm for diagnosis of breast cancer in mammography, and explore whether it could benefit radiologists by improving accuracy of diagnosis.

Methods
In this retrospective study, an AI algorithm was developed and validated with 170 230 mammography examinations collected from five institutions in South Korea, the USA, and the UK, including 36 468 cancer positive confirmed by biopsy, 59 544 benign confirmed by biopsy (8827 mammograms) or follow-up imaging (50 717 mammograms), and 74 218 normal. For the multicentre, observer-blinded, reader study, 320 mammograms (160 cancer positive, 64 benign, 96 normal) were independently obtained from two institutions. 14 radiologists participated as readers and assessed each mammogram in terms of likelihood of malignancy (LOM), location of malignancy, and necessity to recall the patient, first without and then with assistance of the AI algorithm. The performance of AI and radiologists was evaluated in terms of LOM-based area under the receiver operating characteristic curve (AUROC) and recall-based sensitivity and specificity.

Findings
The AI standalone performance was AUROC 0·959 (95% CI 0·952–0·966) overall, and 0·970 (0·963–0·978) in the South Korea dataset, 0·953 (0·938–0·968) in the USA dataset, and 0·938 (0·918–0·958) in the UK dataset. In the reader study, the performance level of AI was 0·940 (0·915–0·965), significantly higher than that of the radiologists without AI assistance (0·810, 95% CI 0·770–0·850; p<0·0001). With the assistance of AI, radiologists' performance was improved to 0·881 (0·850–0·911; p<0·0001). AI was more sensitive to detect cancers with mass (53 [90%] vs 46 [78%] of 59 cancers detected; p=0·044) or distortion or asymmetry (18 [90%] vs ten [50%] of 20 cancers detected; p=0·023) than radiologists. AI was better in detection of T1 cancers (73 [91%] vs 59 [74%] of 80; p=0·0039) or node-negative cancers (104 [87%] vs 88 [74%] of 119; p=0·0025) than radiologists.

Interpretation
The AI algorithm developed with large-scale mammography data showed better diagnostic performance in breast cancer detection compared with radiologists. The significant improvement in radiologists' performance when aided by AI supports application of AI to mammograms as a diagnostic support tool.

AUTHORS
Hyo-Eun Kim, PhD, Hak Hee Kim, MD, Boo-Kyung Han, MD, Ki Hwan Kim, MD ,Kyunghwa Han, PhD, Hyeonseob Nam, MS, Eun Hye Lee, MD, Eun-Kyung Kim, MD
URL

https://www.thelancet.com/journals/landig/article/PIIS2589-7500(20)30003-0/fulltext

ABSTRACT

Objectives
To perform test-retest reproducibility analyses for deep learning–based automatic detection algorithm (DLAD) using two stationary chest radiographs (CRs) with short-term intervals, to analyze influential factors on test-retest variations, and to investigate the robustness of DLAD to simulated post-processing and positional changes.

Methods
This retrospective study included patients with pulmonary nodules resected in 2017. Preoperative CRs without interval changes were used. Test-retest reproducibility was analyzed in terms of median differences of abnormality scores, intraclass correlation coefficients (ICC), and 95% limits of agreement (LoA). Factors associated with test-retest variation were investigated using univariable and multivariable analyses. Shifts in classification between the two CRs were analyzed using pre-determined cutoffs. Radiograph post-processing (blurring and sharpening) and positional changes (translations in x- and y-axes, rotation, and shearing) were simulated and agreement of abnormality scores between the original and simulated CRs was investigated.

Results
Our study analyzed 169 patients (median age, 65 years; 91 men). The median difference of abnormality scores was 1–2% and ICC ranged from 0.83 to 0.90. The 95% LoA was approximately ± 30%. Test-retest variation was negatively associated with solid portion size (β, − 0.50; p = 0.008) and good nodule conspicuity (β, − 0.94; p < 0.001). A small fraction (15/169) showed discordant classifications when the high-specificity cutoff (46%) was applied to the model outputs (p = 0.04). DLAD was robust to the simulated positional change (ICC, 0.984, 0.996), but relatively less robust to post-processing (ICC, 0.872, 0.968).

Conclusions
DLAD was robust to the test-retest variation. However, inconspicuous nodules may cause fluctuations of the model output and subsequent misclassifications.

Key Points
• The deep learning–based automatic detection algorithm was robust to the test-retest variation of the chest radiographs in general.

• The test-retest variation was negatively associated with solid portion size and good nodule conspicuity.

• High-specificity cutoff (46%) resulted in discordant classifications of 8.9% (15/169; p = 0.04) between the test-retest radiographs.

AUTHORS
Hyungjin Kim, Chang Min Park & Jin Mo Goo
URL

https://link.springer.com/article/10.1007%2Fs00330-019-06589-8

ABSTRACT

Background
The performance of a deep learning (DL) algorithm should be validated in actual clinical situations, before its clinical implementation.

Purpose
To evaluate the performance of a DL algorithm for identifying chest radiographs with clinically relevant abnormalities in the emergency department (ED) setting.

Materials and Methods
This single-center retrospective study included consecutive patients who visited the ED and underwent initial chest radiography between January 1 and March 31, 2017. Chest radiographs were analyzed with a commercially available DL algorithm. The performance of the algorithm was evaluated by determining the area under the receiver operating characteristic curve (AUC), sensitivity, and specificity at predefined operating cutoffs (high-sensitivity and high-specificity cutoffs). The sensitivities and specificities of the algorithm were compared with those of the on-call radiology residents who interpreted the chest radiographs in the actual practice by using McNemar tests. If there were discordant findings between the algorithm and resident, the residents reinterpreted the chest radiographs by using the algorithm’s output.

Results
A total of 1135 patients (mean age, 53 years ± 18; 582 men) were evaluated. In the identification of abnormal chest radiographs, the algorithm showed an AUC of 0.95 (95% confidence interval [CI]: 0.93, 0.96), a sensitivity of 88.7% (227 of 256 radiographs; 95% CI: 84.1%, 92.3%), and a specificity of 69.6% (612 of 879 radiographs; 95% CI: 66.5%, 72.7%) at the high-sensitivity cutoff and a sensitivity of 81.6% (209 of 256 radiographs; 95% CI: 76.3%, 86.2%) and specificity of 90.3% (794 of 879 radiographs; 95% CI: 88.2%, 92.2%) at the high-specificity cutoff. Radiology residents showed lower sensitivity (65.6% [168 of 256 radiographs; 95% CI: 59.5%, 71.4%], P < .001) and higher specificity (98.1% [862 of 879 radiographs; 95% CI: 96.9%, 98.9%], P < .001) compared with the algorithm. After reinterpretation of chest radiographs with use of the algorithm’s outputs, the sensitivity of the residents improved (73.4% [188 of 256 radiographs; 95% CI: 68.0%, 78.8%], P = .003), whereas specificity was reduced (94.3% [829 of 879 radiographs; 95% CI: 92.8%, 95.8%], P < .001).

Conclusion
A deep learning algorithm used with emergency department chest radiographs showed diagnostic performance for identifying clinically relevant abnormalities and helped improve the sensitivity of radiology residents’ evaluation.

AUTHORS
Eui Jin Hwang, Ju Gang Nam, Woo Hyeon Lim, Sae Jin Park, Yun Soo Jeong, Ji Hee Kang, Eun Kyoung Hong, Taek Min Kim, Jin Mo Goo, Sunggyun Park, Ki Hwan Kim, Chang Min Park
From the Department of Radiology, Seoul National University College of Medicine, 101 Daehak-ro, Jongno-gu, Seoul 03080, Korea (E.J.H., J.G.N., W.H.L., S.J.P., Y.S.J., J.H.K., E.K.H., T.M.K., J.M.G., C.M.P.); and Lunit, Seoul, Korea (S.P., K.H.K.).
URL

https://pubs.rsna.org/doi/10.1148/radiol.2019191225

ABSTRACT

Interpretation of chest radiographs is a challenging task prone to errors, requiring expert readers. An automated system that can accurately classify chest radiographs may help streamline the clinical workflow. To develop a deep learning–based algorithm that can classify normal and abnormal results from chest radiographs with major thoracic diseases including pulmonary malignant neoplasm, active tuberculosis, pneumonia, and pneumothorax and to validate the algorithm’s performance using independent data sets. This diagnostic study developed a deep learning–based algorithm using single-center data collected between November 1, 2016, and January 31, 2017. The algorithm was externally validated with multicenter data collected between May 1 and July 31, 2018. A total of 54 221 chest radiographs with normal findings from 47 917 individuals (21 556 men and 26 361 women; mean [SD] age, 51 [16] years) and 35 613 chest radiographs with abnormal findings from 14 102 individuals (8373 men and 5729 women; mean [SD] age, 62 [15] years) were used to develop the algorithm. A total of 486 chest radiographs with normal results and 529 with abnormal results (1 from each participant; 628 men and 387 women; mean [SD] age, 53 [18] years) from 5 institutions were used for external validation. Fifteen physicians, including nonradiology physicians, board-certified radiologists, and thoracic radiologists, participated in observer performance testing. Data were analyzed in August 2018. Image-wise classification performances measured by area under the receiver operating characteristic curve; lesion-wise localization performances measured by area under the alternative free-response receiver operating characteristic curve. The algorithm demonstrated a median (range) area under the curve of 0.979 (0.973-1.000) for image-wise classification and 0.972 (0.923-0.985) for lesion-wise localization; the algorithm demonstrated significantly higher performance than all 3 physician groups in both image-wise classification (0.983 vs 0.814-0.932; all P < .005) and lesion-wise localization (0.985 vs 0.781-0.907; all P < .001). Significant improvements in both image-wise classification (0.814-0.932 to 0.904-0.958; all P < .005) and lesion-wise localization (0.781-0.907 to 0.873-0.938; all P < .001) were observed in all 3 physician groups with assistance of the algorithm. The algorithm consistently outperformed physicians, including thoracic radiologists, in the discrimination of chest radiographs with major thoracic diseases, demonstrating its potential to improve the quality and efficiency of clinical practice.

AUTHORS
Eui Jin Hwang1 , Sunggyun Park2 , Kwang-Nam Jin3 , Jung Im Kim4 , So Young Choi5 , Jong Hyuk Lee6 , Jin Mo Goo1 , Brian Jaehong Aum2 , Jae-Joon Yim7 , Julien G Cohen8 , Gilbert R. Ferretti8 and Chang Min Park1
1Seoul National University Hospital and College of Medicine, 2Lunit Inc., 3Seoul National University Boramae Medical Center, 4Kyung Hee University College of Medicine, 5Eulji University Medical Center, 6Armed Forces Seoul Hospital, 7Seoul National University College of Medicine, 8Centre Hospitalier Universitaire de Grenoble
URL

https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2728630

ABSTRACT

Detection of active pulmonary tuberculosis (TB) on chest radiographs (CR) is critical for the diagnosis and screening of TB. An automated system may help streamline the TB screening process and improve diagnostic performance. We developed a deep-learning-based automatic detection (DLAD) algorithm, using 54,221 normal CRs and 6,768 CRs with active pulmonary TB, which were labeled and annotated by 13 board-certified radiologists. The performance of DLAD was validated using six external multi-center, multi-national datasets. To compare the performances of DLAD with physicians, an observer performance test was conducted by 15 physicians including non-radiology physicians, board-certified radiologists, and thoracic radiologists. Image-wise classification and lesion-wise localization performances were measured using area under the receiver operating characteristic (ROC) curves, and area under the alternative free-response ROC curves, respectively. Sensitivities and specificities of DLAD were calculated using two cutoffs [high sensitivity (98%) and high specificity (98%)] obtained through in-house validation. DLAD demonstrated classification performances of 0.977–1.000 and localization performance of 0.973–1.000. Sensitivities and specificities for classification were 94.3–100% and 91.1–100% using the high sensitivity cutoff and 84.1–99.0% and 99.1–100% using the high specificity cutoff. DLAD showed significantly higher performance in both classification (0.993 vs. 0.746–0.971) and localization (0.993 vs. 0.664–0.925) compared to all groups of physicians. Our DLAD demonstrated excellent and consistent performance in the detection of active pulmonary TB on CR, outperforming physicians including thoracic radiologists.

AUTHORS
Eui Jin Hwang1 , Sunggyun Park2 , Kwang-Nam Jin3 , So Young Choi4 , Jong Hyuk Lee5 , Jin Mo Goo1 , Brian Jaehong Aum2 , Jae-Joon Yim6 and Chang Min Park1
1Seoul National University Hospital and College of Medicine, 2Lunit Inc., 3Seoul National University Boramae Medical Center, 4Eulji University Medical Center, 5Armed Forces Seoul Hospital, 6Seoul National University College of Medicine
URL

https://academic.oup.com/cid/advance-article/doi/10.1093/cid/ciy967/5174137

ABSTRACT

The purpose of this study is to develop and validate a deep learning–based automatic detection algorithm (DLAD) for malignant pulmonary nodules on chest radiographs and to compare its performance with physicians including thoracic radiologists. For this retrospective study, DLAD was developed by using 43 292 chest radiographs (normal radiograph–to–nodule radiograph ratio, 34 067:9225) in 34 676 patients (healthy-to-nodule ratio, 30 784:3892; 19 230 men [mean age, 52.8 years; age range, 18–99 years]; 15 446 women [mean age, 52.3 years; age range, 18–98 years]) obtained between 2010 and 2015, which were labeled and partially annotated by 13 board-certified radiologists, in a convolutional neural network. Radiograph classification and nodule detection performances of DLAD were validated by using one internal and four external data sets from three South Korean hospitals and one U.S. hospital. For internal and external validation, radiograph classification and nodule detection performances of DLAD were evaluated by using the area under the receiver operating characteristic curve (AUROC) and jackknife alternative free-response receiver-operating characteristic (JAFROC) figure of merit (FOM), respectively. An observer performance test involving 18 physicians, including nine board-certified radiologists, was conducted by using one of the four external validation data sets. Performances of DLAD, physicians, and physicians assisted with DLAD were evaluated and compared. According to one internal and four external validation data sets, radiograph classification and nodule detection performances of DLAD were a range of 0.92–0.99 (AUROC) and 0.831–0.924 (JAFROC FOM), respectively. DLAD showed a higher AUROC and JAFROC FOM at the observer performance test than 17 of 18 and 15 of 18 physicians, respectively (P < .05), and all physicians showed improved nodule detection performances with DLAD (mean JAFROC FOM improvement, 0.043; range, 0.006–0.190; P < .05). This deep learning–based automatic detection algorithm outperformed physicians in radiograph classification and nodule detection performance for malignant pulmonary nodules on chest radiographs, and it enhanced physicians’ performances when used as a second reader.

AUTHORS
Sunggyun Park1 , Ju Gang Nam2 , Eui Jin Hwang2 , Jong Hyuk Lee3 , Kwang-Nam Jin4 , Kun Young Lim5 , Thienkai Huy Vu6 , Jae Ho Sohn6 , Sangheum Hwang1 , Jin Mo Goo2 and Chang Min Park2
1Lunit Inc., 2Seoul National University Hospital and College of Medicine, 3Armed Forces Seoul Hospital, 4Seoul National University Boramae Medical Center, 5National Cancer Center, 6University of California, San Francisco
URL

https://pubs.rsna.org/doi/10.1148/radiol.2018180237

ABSTRACT

We assessed the feasibility of a data-driven imaging biomarker based on weakly supervised learning (DIB; an imaging biomarker derived from large-scale medical image data with deep learning technology) in mammography (DIB-MG). A total of 29,107 digital mammograms from five institutions (4,339 cancer cases and 24,768 normal cases) were included. After matching patients’ age, breast density, and equipment, 1,238 and 1,238 cases were chosen as validation and test sets, respectively, and the remainder were used for training. The core algorithm of DIB-MG is a deep convolutional neural network; a deep learning algorithm specialized for images. Each sample (case) is an exam composed of 4-view images (RCC, RMLO, LCC, and LMLO). For each case in a training set, the cancer probability inferred from DIB-MG is compared with the per-case ground-truth label. Then the model parameters in DIB-MG are updated based on the error between the prediction and the ground-truth. At the operating point (threshold) of 0.5, sensitivity was 75.6% and 76.1% when specificity was 90.2% and 88.5%, and AUC was 0.903 and 0.906 for the validation and test sets, respectively. This research showed the potential of DIB-MG as a screening tool for breast cancer.

AUTHORS
Eun-Kyung Kim1 , Hyo-Eun Kim2 , Kyunghwa Han1 , Bong Joo Kang3 , Yu-Mee Sohn4 , Ok Hee Woo5 and Chan Wha Lee6
1Severance Hospital, Yonsei University, 2Lunit Inc., 3Seoul St. Mary’s Hospital, Catholic University, 4Kyung Hee University Hospital, 5Korea University Guro Hospital, 6National Cancer Center Hospital
URL

https://www.nature.com/articles/s41598-018-21215-1