Lunit Logo

Publications

ABSTRACT

The recent medical applications of deep-learning (DL) algorithms have demonstrated their clinical efficacy in improving speed and accuracy of image interpretation. If the DL algorithm achieves a performance equivalent to that achieved by physicians in chest radiography (CR) diagnoses with Coronavirus disease 2019 (COVID-19) pneumonia, the automatic interpretation of the CR with DL algorithms can significantly reduce the burden on clinicians and radiologists in sudden surges of suspected COVID-19 patients. The aim of this study was to evaluate the efficacy of the DL algorithm for detecting COVID-19 pneumonia on CR compared with formal radiology reports. This is a retrospective study of adult patients that were diagnosed as positive COVID-19 cases based on the reverse transcription polymerase chain reaction among all the patients who were admitted to five emergency departments and one community treatment center in Korea from February 18, 2020 to May 1, 2020. The CR images were evaluated with a publicly available DL algorithm. For reference, CR images without chest computed tomography (CT) scans classified as positive for COVID-19 pneumonia were used given that the radiologist identified ground-glass opacity, consolidation, or other infiltration in retrospectively reviewed CR images. Patients with evidence of pneumonia on chest CT scans were also classified as COVID-19 pneumonia positive outcomes. The overall sensitivity and specificity of the DL algorithm for detecting COVID-19 pneumonia on CR were 95.6%, and 88.7%, respectively. The area under the curve value of the DL algorithm for the detection of COVID-19 with pneumonia was 0.921. The DL algorithm demonstrated a satisfactory diagnostic performance comparable with that of formal radiology reports in the CR-based diagnosis of pneumonia in COVID-19 patients. The DL algorithm may offer fast and reliable examinations that can facilitate patient screening and isolation decisions, which can reduce the medical staff workload during COVID-19 pandemic situations.

AUTHORS
Se Bum Jang ,Suk Hee Lee ,Dong Eun Lee ,Sin-Youl Park ,Jong Kun Kim,Jae Wan Cho,Jaekyung Cho,Ki Beom Kim,Byunggeon Park,Jongmin Park,Jae-Kwang Lim
URL

https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0242759

ABSTRACT

Importance The improvement of pulmonary nodule detection, which is a challenging task when using chest radiographs, may help to elevate the role of chest radiographs for the diagnosis of lung cancer.

Objective To assess the performance of a deep learning–based nodule detection algorithm for the detection of lung cancer on chest radiographs from participants in the National Lung Screening Trial (NLST).

Design, Setting, and Participants This diagnostic study used data from participants in the NLST ro assess the performance of a deep learning–based artificial intelligence (AI) algorithm for the detection of pulmonary nodules and lung cancer on chest radiographs using separate training (in-house) and validation (NLST) data sets. Baseline (T0) posteroanterior chest radiographs from 5485 participants (full T0 data set) were used to assess lung cancer detection performance, and a subset of 577 of these images (nodule data set) were used to assess nodule detection performance. Participants aged 55 to 74 years who currently or formerly (ie, quit within the past 15 years) smoked cigarettes for 30 pack-years or more were enrolled in the NLST at 23 US centers between August 2002 and April 2004. Information on lung cancer diagnoses was collected through December 31, 2009. Analyses were performed between August 20, 2019, and February 14, 2020.

Exposures Abnormality scores produced by the AI algorithm.

Main Outcomes and Measures The performance of an AI algorithm for the detection of lung nodules and lung cancer on radiographs, with lung cancer incidence and mortality as primary end points.

Results A total of 5485 participants (mean [SD] age, 61.7 [5.0] years; 3030 men [55.2%]) were included, with a median follow-up duration of 6.5 years (interquartile range, 6.1-6.9 years). For the nodule data set, the sensitivity and specificity of the AI algorithm for the detection of pulmonary nodules were 86.2% (95% CI, 77.8%-94.6%) and 85.0% (95% CI, 81.9%-88.1%), respectively. For the detection of all cancers, the sensitivity was 75.0% (95% CI, 62.8%-87.2%), the specificity was 83.3% (95% CI, 82.3%-84.3%), the positive predictive value was 3.8% (95% CI, 2.6%-5.0%), and the negative predictive value was 99.8% (95% CI, 99.6%-99.9%). For the detection of malignant pulmonary nodules in all images of the full T0 data set, the sensitivity was 94.1% (95% CI, 86.2%-100.0%), the specificity was 83.3% (95% CI, 82.3%-84.3%), the positive predictive value was 3.4% (95% CI, 2.2%-4.5%), and the negative predictive value was 100.0% (95% CI, 99.9%-100.0%). In digital radiographs of the nodule data set, the AI algorithm had higher sensitivity (96.0% [95% CI, 88.3%-100.0%] vs 88.0% [95% CI, 75.3%-100.0%]; P = .32) and higher specificity (93.2% [95% CI, 89.9%-96.5%] vs 82.8% [95% CI, 77.8%-87.8%]; P = .001) for nodule detection compared with the NLST radiologists. For malignant pulmonary nodule detection on digital radiographs of the full T0 data set, the sensitivity of the AI algorithm was higher (100.0% [95% CI, 100.0%-100.0%] vs 94.1% [95% CI, 82.9%-100.0%]; P = .32) compared with the NLST radiologists, and the specificity (90.9% [95% CI, 89.6%-92.1%] vs 91.0% [95% CI, 89.7%-92.2%]; P = .91), positive predictive value (8.2% [95% CI, 4.4%-11.9%] vs 7.8% [95% CI, 4.1%-11.5%]; P = .65), and negative predictive value (100.0% [95% CI, 100.0%-100.0%] vs 99.9% [95% CI, 99.8%-100.0%]; P = .32) were similar to those of NLST radiologists.

Conclusions and Relevance In this study, the AI algorithm performed better than NLST radiologists for the detection of pulmonary nodules on digital radiographs. When used as a second reader, the AI algorithm may help to detect lung cancer.

AUTHORS
Hyunsuk Yoo, MD; Ki Hwan Kim, MD, PhD; Ramandeep Singh, MD2, et al
URL

https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2770952

ABSTRACT

Background
We examined the potential change in cancer detection when using an artificial intelligence (AI) cancer-detection software to triage certain screening examinations into a no radiologist work stream, and then after regular radiologist assessment of the remainder, triage certain screening examinations into an enhanced assessment work stream. The purpose of enhanced assessment was to simulate selection of women for more sensitive screening promoting early detection of cancers that would otherwise be diagnosed as interval cancers or as next-round screen-detected cancers. The aim of the study was to examine how AI could reduce radiologist workload and increase cancer detection.

Methods
In this retrospective simulation study, all women diagnosed with breast cancer who attended two consecutive screening rounds were included. Healthy women were randomly sampled from the same cohort; their observations were given elevated weight to mimic a frequency of 0·7% incident cancer per screening interval. Based on the prediction score from a commercially available AI cancer detector, various cutoff points for the decision to channel women to the two new work streams were examined in terms of missed and additionally detected cancer.

Findings
7364 women were included in the study sample: 547 were diagnosed with breast cancer and 6817 were healthy controls. When including 60%, 70%, or 80% of women with the lowest AI scores in the no radiologist stream, the proportion of screen-detected cancers that would have been missed were 0, 0·3% (95% CI 0·0–4·3), or 2·6% (1·1–5·4), respectively. When including 1% or 5% of women with the highest AI scores in the enhanced assessment stream, the potential additional cancer detection was 24 (12%) or 53 (27%) of 200 subsequent interval cancers, respectively, and 48 (14%) or 121 (35%) of 347 next-round screen-detected cancers, respectively.

Interpretation
Using a commercial AI cancer detector to triage mammograms into no radiologist assessment and enhanced assessment could potentially reduce radiologist workload by more than half, and pre-emptively detect a substantial proportion of cancers otherwise diagnosed later.

AUTHORS
Mattie Salim, MD1,2; Erik Wåhlin, MSc3; Karin Dembrower, MD4,5; Edward Azavedo, MD, PhD1,6; Theodoros Foukakis, MD, PhD1,2; Yue Liu, MSc7; Kevin Smith, MSc, PhD8; Martin Eklund, MSc, PhD9; Fredrik Strand, MD, PhD1,10
1Department of Oncology-Pathology, Karolinska Institute, Stockholm, Sweden, 2Department of Radiology, Karolinska University Hospital, Stockholm, Sweden, 3Department of Medical Radiation Physics and Nuclear Medicine, Karolinska University Hospital, Stockholm, Sweden, 4Department of Physiology and Pharmacology, Karolinska Institute, Stockholm, Sweden, 5Department of Radiology, Capio Sankt Görans Hospital, Stockholm, Sweden, 6Department of Molecular Medicine and Surgery, Karolinska Institute, Stockholm, Sweden, 7Division of Computational Science and Technology, KTH Royal Institute of Technology, Science for Life Laboratory, Solna, Sweden, 8KTH Royal Institute of Technology, Science for Life Laboratory, Solna, Sweden, 9Department of Medical Epidemiology and Biostatistics, Karolinska Institute, Stockholm, Sweden, 10Breast Radiology, Karolinska University Hospital, Stockholm, Sweden
URL

https://www.thelancet.com/journals/landig/article/PIIS2589-7500(20)30185-0/fulltext

ABSTRACT

Importance
A computer algorithm that performs at or above the level of radiologists in mammography screening assessment could improve the effectiveness of breast cancer screening.

Objective
To perform an external evaluation of 3 commercially available artificial intelligence (AI) computer-aided detection algorithms as independent mammography readers and to assess the screening performance when combined with radiologists.

Design, Setting, and Participants
This retrospective case-control study was based on a double-reader population-based mammography screening cohort of women screened at an academic hospital in Stockholm, Sweden, from 2008 to 2015. The study included 8805 women aged 40 to 74 years who underwent mammography screening and who did not have implants or prior breast cancer. The study sample included 739 women who were diagnosed as having breast cancer (positive) and a random sample of 8066 healthy controls (negative for breast cancer).

Main Outcomes and Measures
Positive follow-up findings were determined by pathology-verified diagnosis at screening or within 12 months thereafter. Negative follow-up findings were determined by a 2-year cancer-free follow-up. Three AI computer-aided detection algorithms (AI-1, AI-2, and AI-3), sourced from different vendors, yielded a continuous score for the suspicion of cancer in each mammography examination. For a decision of normal or abnormal, the cut point was defined by the mean specificity of the first-reader radiologists (96.6%).

Results
The median age of study participants was 60 years (interquartile range, 50-66 years) for 739 women who received a diagnosis of breast cancer and 54 years (interquartile range, 47-63 years) for 8066 healthy controls. The cases positive for cancer comprised 618 (84%) screen detected and 121 (16%) clinically detected within 12 months of the screening examination. The area under the receiver operating curve for cancer detection was 0.956 (95% CI, 0.948-0.965) for AI-1, 0.922 (95% CI, 0.910-0.934) for AI-2, and 0.920 (95% CI, 0.909-0.931) for AI-3. At the specificity of the radiologists, the sensitivities were 81.9% for AI-1, 67.0% for AI-2, 67.4% for AI-3, 77.4% for first-reader radiologist, and 80.1% for second-reader radiologist. Combining AI-1 with first-reader radiologists achieved 88.6% sensitivity at 93.0% specificity (abnormal defined by either of the 2 making an abnormal assessment). No other examined combination of AI algorithms and radiologists surpassed this sensitivity level.

Conclusions and Relevance
To our knowledge, this study is the first independent evaluation of several AI computer-aided detection algorithms for screening mammography. The results of this study indicated that a commercially available AI computer-aided detection algorithm can assess screening mammograms with a sufficient diagnostic performance to be further evaluated as an independent reader in prospective clinical trials. Combining the first readers with the best algorithm identified more cases positive for cancer than combining the first readers with second readers.

AUTHORS
Mattie Salim, MD1,2; Erik Wåhlin, MSc3; Karin Dembrower, MD4,5; Edward Azavedo, MD, PhD1,6; Theodoros Foukakis, MD, PhD1,2; Yue Liu, MSc7; Kevin Smith, MSc, PhD8; Martin Eklund, MSc, PhD9; Fredrik Strand, MD, PhD1,10
1Department of Oncology-Pathology, Karolinska Institute, Stockholm, Sweden, 2Department of Radiology, Karolinska University Hospital, Stockholm, Sweden, 3Department of Medical Radiation Physics and Nuclear Medicine, Karolinska University Hospital, Stockholm, Sweden, 4Department of Physiology and Pharmacology, Karolinska Institute, Stockholm, Sweden, 5Department of Radiology, Capio Sankt Görans Hospital, Stockholm, Sweden, 6Department of Molecular Medicine and Surgery, Karolinska Institute, Stockholm, Sweden, 7Division of Computational Science and Technology, KTH Royal Institute of Technology, Science for Life Laboratory, Solna, Sweden, 8KTH Royal Institute of Technology, Science for Life Laboratory, Solna, Sweden, 9Department of Medical Epidemiology and Biostatistics, Karolinska Institute, Stockholm, Sweden, 10Breast Radiology, Karolinska University Hospital, Stockholm, Sweden
URL

https://jamanetwork.com/journals/jamaoncology/article-abstract/2769894

ABSTRACT

Objectives
Performance of deep learning–based automated detection (DLAD) algorithms in systematic screening for active pulmonary tuberculosis is unknown. We aimed to validate DLAD algorithm for detection of active pulmonary tuberculosis and any radiologically identifiable relevant abnormality on chest radiographs (CRs) in this setting.

Methods
We performed out-of-sample testing of a pre-trained DLAD algorithm, using CRs from 19.686 asymptomatic individuals (ages, 21.3 ± 1.9 years) as part of systematic screening for tuberculosis between January 2013 and July 2018. Area under the receiver operating characteristic curves (AUC) for diagnosis of tuberculosis and any relevant abnormalities were measured. Accuracy measures including sensitivities, specificities, positive predictive values (PPVs), and negative predictive values (NPVs) were calculated at pre-defined operating thresholds (high sensitivity threshold, 0.16; high specificity threshold, 0.46).

Results
All five CRs from four individuals with active pulmonary tuberculosis were correctly classified as having abnormal findings by DLAD with specificities of 0.959 and 0.997, PPVs of 0.006 and 0.068, and NPVs of both 1.000 at high sensitivity and high specificity thresholds, respectively. With high specificity thresholds, DLAD showed comparable diagnostic measures with the pooled radiologists (p values > 0.05). For the radiologically identifiable relevant abnormality (n = 28), DLAD showed an AUC value of 0.967 (95% confidence interval, 0.938–0.996) with sensitivities of 0.821 and 0.679, specificities of 0.960 and 0.997, PPVs of 0.028 and 0.257, and NPVs of both 0.999 at high sensitivity and high specificity thresholds, respectively.

Conclusions
In systematic screening for tuberculosis in a low-prevalence setting, DLAD algorithm demonstrated excellent diagnostic performance, comparable with the radiologists in the detection of active pulmonary tuberculosis.

AUTHORS
Jong Hyuk Lee, Sunggyun Park, Eui Jin Hwang, Jin Mo Goo, Woo Young Lee, Sangho Lee, Hyungjin Kim, Jason R. Andrews & Chang Min Park
URL

https://link.springer.com/article/10.1007/s00330-020-07219-4

ABSTRACT

Abstract
A deep learning algorithm detected lung cancer nodules on chest radiographs with a performance comparable to that of radiologists, which will be helpful for radiologists in healthy populations with a low prevalence of lung cancer.

Background
The performance of a deep learning algorithm for lung cancer detection on chest radiographs in a health screening population is unknown.

Purpose
To validate a commercially available deep learning algorithm for lung cancer detection on chest radiographs in a health screening population.

Materials and Methods
Out-of-sample testing of a deep learning algorithm was retrospectively performed using chest radiographs from individuals undergoing a comprehensive medical check-up between July 2008 and December 2008 (validation test). To evaluate the algorithm performance for visible lung cancer detection, the area under the receiver operating characteristic curve (AUC) and diagnostic measures, including sensitivity and false-positive rate (FPR), were calculated. The algorithm performance was compared with that of radiologists using the McNemar test and the Moskowitz method. Additionally, the deep learning algorithm was applied to a screening cohort undergoing chest radiography between January 2008 and December 2012, and its performances were calculated.

Results
In a validation test comprising 10 285 radiographs from 10 202 individuals (mean age, 54 years ± 11 [standard deviation]; 5857 men) with 10 radiographs of visible lung cancers, the algorithm’s AUC was 0.99 (95% confidence interval: 0.97, 1), and it showed comparable sensitivity (90% [nine of 10 radiographs]) to that of the radiologists (60% [six of 10 radiographs]; P = .25) with a higher FPR (3.1% [319 of 10 275 radiographs] vs 0.3% [26 of 10 275 radiographs]; P < .001). In the screening cohort of 100 525 chest radiographs from 50 070 individuals (mean age, 53 years ± 11; 28 090 men) with 47 radiographs of visible lung cancers, the algorithm’s AUC was 0.97 (95% confidence interval: 0.95, 0.99), and its sensitivity and FPR were 83% (39 of 47 radiographs) and 3% (2999 of 100 478 radiographs), respectively.

Conclusion
A deep learning algorithm detected lung cancers on chest radiographs with a performance comparable to that of radiologists, which will be helpful for radiologists in healthy populations with a low prevalence of lung cancer.

AUTHORS
Jong Hyuk Lee*, Hye Young Sun*, Sunggyun Park, Hyungjin Kim, Eui Jin Hwang, Jin Mo Goo, Chang Min Park
*J.H.L. and H.Y.S. contributed equally to this work.
URL

https://pubs.rsna.org/doi/10.1148/radiol.2020201240

ABSTRACT

Objective
To describe the experience of implementing a deep learning-based computer-aided detection (CAD) system for the interpretation of chest X-ray radiographs (CXR) of suspected coronavirus disease (COVID-19) patients and investigate the diagnostic performance of CXR interpretation with CAD assistance.

Materials and Methods
In this single-center retrospective study, initial CXR of patients with suspected or confirmed COVID-19 were investigated. A commercialized deep learning-based CAD system that can identify various abnormalities on CXR was implemented for the interpretation of CXR in daily practice. The diagnostic performance of radiologists with CAD assistance were evaluated based on two different reference standards: 1) real-time reverse transcriptase-polymerase chain reaction (rRT-PCR) results for COVID-19 and 2) pulmonary abnormality suggesting pneumonia on chest CT. The turnaround times (TATs) of radiology reports for CXR and rRT-PCR results were also evaluated.

Results
Among 332 patients (male:female, 173:159; mean age, 57 years) with available rRT-PCR results, 16 patients (4.8%) were diagnosed with COVID-19. Using CXR, radiologists with CAD assistance identified rRT-PCR positive COVID-19 patients with sensitivity and specificity of 68.8% and 66.7%, respectively. Among 119 patients (male:female, 75:44; mean age, 69 years) with available chest CTs, radiologists assisted by CAD reported pneumonia on CXR with a sensitivity of 81.5% and a specificity of 72.3%. The TATs of CXR reports were significantly shorter than those of rRT-PCR results (median 51 vs. 507 minutes; p < 0.001).

Conclusion
Radiologists with CAD assistance could identify patients with rRT-PCR-positive COVID-19 or pneumonia on CXR with a reasonably acceptable performance. In patients suspected with COVID-19, CXR had much faster TATs than rRT-PCRs.

AUTHORS
Eui Jin Hwang, MD, PhD, Hyungjin Kim, MD, PhD, Soon Ho Yoon, MD, PhD, Jin Mo Goo, MD, PhD and Chang Min Park, MD, PhD
URL

https://kjronline.org/DOIx.php?id=10.3348/kjr.2020.0536

ABSTRACT

Abstract
Chest radiograph interpretation, assisted by a deep learning–based automatic detection algorithm, can reduce the number of overlooked lung cancers without increasing the frequency of chest CT follow-up.

Background
It is uncertain whether a deep learning–based automatic detection algorithm (DLAD) for identifying malignant nodules on chest radiographs will help diagnose lung cancers.

Purpose
To evaluate the efficacy of using a DLAD in observer performance for the detection of lung cancers on chest radiographs.

Materials and Methods
Among patients diagnosed with lung cancers between January 2010 and December 2014, 117 patients (median age, 69 years; interquartile range [IQR], 64–74 years; 57 women) were retrospectively identified in whom lung cancers were visible on previous chest radiographs. For the healthy control group, 234 patients (median age, 58 years; IQR, 48–68 years; 123 women) with normal chest radiographs were randomly selected. Nine observers reviewed each chest radiograph, with and without a DLAD. They detected potential lung cancers and determined whether they would recommend chest CT for follow-up. Observer performance was compared with use of the area under the alternative free-response receiver operating characteristic curve (AUC), sensitivity, and rates of chest CT recommendation.

Results
In total, 105 of the 117 patients had lung cancers that were overlooked on their original radiographs. The average AUC for all observers significantly rose from 0.67 (95% confidence interval [CI]: 0.62, 0.72) without a DLAD to 0.76 (95% CI: 0.71, 0.81) with a DLAD (P < .001). With a DLAD, observers detected more overlooked lung cancers (average sensitivity, 53% [56 of 105 patients] with a DLAD vs 40% [42 of 105 patients] without a DLAD) (P < .001) and recommended chest CT for more patients (62% [66 of 105 patients] with a DLAD vs 47% [49 of 105 patients] without a DLAD) (P < .001). In the healthy control group, no difference existed in the rate of chest CT recommendation (10% [23 of 234 patients] without a DLAD and 8% [20 of 234 patients] with a DLAD) (P = .13).

Conclusion
Using a deep learning–based automatic detection algorithm may help observers reduce the number of overlooked lung cancers on chest radiographs, without a proportional increase in the number of follow-up chest CT examinations.

AUTHORS
Sowon Jang, Hwayoung Song, Yoon Joo Shin, Junghoon Kim, Jihang Kim, Kyung Won Lee, Sung Soo Lee, Woojoo Lee, Seungjae Lee, Kyung Hee Lee
URL

https://pubs.rsna.org/doi/10.1148/radiol.2020200165

ABSTRACT

Objectives
To evaluate the calibration of a deep learning (DL) model in a diagnostic cohort and to improve model’s calibration through recalibration procedures.

Methods
Chest radiographs (CRs) from 1135 consecutive patients (M:F = 582:553; mean age, 52.6 years) who visited our emergency department were included. A commercialized DL model was utilized to identify abnormal CRs, with a continuous probability score for each CR. After evaluation of the model calibration, eight different methods were used to recalibrate the original model based on the probability score. The original model outputs were recalibrated using 681 randomly sampled CRs and validated using the remaining 454 CRs. The Brier score for overall performance, average and maximum calibration error, absolute Spiegelhalter’s Z for calibration, and area under the receiver operating characteristic curve (AUROC) for discrimination were evaluated in 1000-times repeated, randomly split datasets.

Results
The original model tended to overestimate the likelihood for the presence of abnormalities, exhibiting average and maximum calibration error of 0.069 and 0.179, respectively; an absolute Spiegelhalter’s Z value of 2.349; and an AUROC of 0.949. After recalibration, significant improvements in the average (range, 0.015–0.036) and maximum (range, 0.057–0.172) calibration errors were observed in eight and five methods, respectively. Significant improvement in absolute Spiegelhalter’s Z (range, 0.809–4.439) was observed in only one method (the recalibration constant). Discriminations were preserved in six methods (AUROC, 0.909–0.949).

Conclusion
The calibration of DL algorithm can be augmented through simple recalibration procedures. Improved calibration may enhance the interpretability and credibility of the model for users.

AUTHORS
Eui Jin Hwang, Hyungjin Kim, Jong Hyuk Lee, Jin Mo Goo & Chang Min Park
URL

https://link.springer.com/article/10.1007/s00330-020-07062-7

ABSTRACT

Early identification of pneumonia is essential in patients with acute febrile respiratory illness (FRI). We evaluated the performance and added value of a commercial deep learning (DL) algorithm in detecting pneumonia on chest radiographs (CRs) of patients visiting the emergency department (ED) with acute FRI. This single-centre, retrospective study included 377 consecutive patients who visited the ED and the resulting 387 CRs in August 2018–January 2019. The performance of a DL algorithm in detection of pneumonia on CRs was evaluated based on area under the receiver operating characteristics (AUROC) curves, sensitivity, specificity, negative predictive values (NPVs), and positive predictive values (PPVs). Three ED physicians independently reviewed CRs with observer performance test to detect pneumonia, which was re-evaluated with the algorithm eight weeks later. AUROC, sensitivity, and specificity measurements were compared between “DL algorithm” vs. “physicians-only” and between “physicians-only” vs. “physicians aided with the algorithm”. Among 377 patients, 83 (22.0%) had pneumonia. AUROC, sensitivity, specificity, PPV, and NPV of the algorithm for detection of pneumonia on CRs were 0.861, 58.3%, 94.4%, 74.2%, and 89.1%, respectively. For the detection of ‘visible pneumonia on CR’ (60 CRs from 59 patients), AUROC, sensitivity, specificity, PPV, and NPV were 0.940, 81.7%, 94.4%, 74.2%, and 96.3%, respectively. In the observer performance test, the algorithm performed better than the physicians for pneumonia (AUROC, 0.861 vs. 0.788, p = 0.017; specificity, 94.4% vs. 88.7%, p < 0.0001) and visible pneumonia (AUROC, 0.940 vs. 0.871, p = 0.007; sensitivity, 81.7% vs. 73.9%, p = 0.034; specificity, 94.4% vs. 88.7%, p < 0.0001). Detection of pneumonia (sensitivity, 82.2% vs. 53.2%, p = 0.008; specificity, 98.1% vs. 88.7%; p < 0.0001) and ‘visible pneumonia’ (sensitivity, 82.2% vs. 73.9%, p = 0.014; specificity, 98.1% vs. 88.7%, p < 0.0001) significantly improved when the algorithm was used by the physicians. Mean reading time for the physicians decreased from 165 to 101 min with the assistance of the algorithm. Thus, the DL algorithm showed a better diagnosis of pneumonia, particularly visible pneumonia on CR, and improved diagnosis by ED physicians in patients with acute FRI.

AUTHORS
Jae Hyun Kim1 , Jin Young Kim1 , Gun Ha Kim1 , Donghoon Kang2 , In Jung Kim2 , Jeongkuk Seo2 , Jason R. Andrews3 and Chang Min Park4
1Department of Radiology, Armed Forces Goyang Hospital, 215, Hyeeum-ro, Deogyang-gu, Goyang-si, Gyeonggi-do 10271, Korea, 2Department of Internal Medicine, Armed Forces Goyang Hospital, 215, Hyeeum-ro, Deogyang-gu, Goyang-si, Gyeonggi-do 10271, Korea, 3Division of Infectious Diseases and Geographic Medicine, Stanford University School of Medicine, 291 Campus Drive, Stanford, CA 94305, USA, 4Department of Radiology and Institute of Radiation Medicine, Seoul National University College of Medicine, 101 Daehak-ro, Jongno-gu, Seoul 03080, Korea
URL

https://www.mdpi.com/2077-0383/9/6/1981

ABSTRACT

Objectives
Pneumothorax is the most common and potentially life-threatening complication arising from percutaneous lung biopsy. We evaluated the performance of a deep learning algorithm for detection of post-biopsy pneumothorax in chest radiographs (CRs), in consecutive cohorts reflecting actual clinical situation.

Methods
We retrospectively included post-biopsy CRs of 1757 consecutive patients (1055 men, 702 women; mean age of 65.1 years) undergoing percutaneous lung biopsies from three institutions. A commercially available deep learning algorithm analyzed each CR to identify pneumothorax. We compared the performance of the algorithm with that of radiology reports made in the actual clinical practice. We also conducted a reader study, in which the performance of the algorithm was compared with those of four radiologists. Performances of the algorithm and radiologists were evaluated by area under receiver operating characteristic curves (AUROCs), sensitivity, and specificity, with reference standards defined by thoracic radiologists.

Results
Pneumothorax occurred in 17.5% (308/1757) of cases, out of which 16.6% (51/308) required catheter drainage. The AUROC, sensitivity, and specificity of the algorithm were 0.937, 70.5%, and 97.7%, respectively, for identification of pneumothorax. The algorithm exhibited higher sensitivity (70.2% vs. 55.5%, p < 0.001) and lower specificity (97.7% vs. 99.8%, p < 0.001), compared with those of radiology reports. In the reader study, the algorithm exhibited lower sensitivity (77.3% vs. 81.8–97.7%) and higher specificity (97.6% vs. 81.7–96.0%) than the radiologists.

Conclusion
The deep learning algorithm appropriately identified pneumothorax in post-biopsy CRs in consecutive diagnostic cohorts. It may assist in accurate and timely diagnosis of post-biopsy pneumothorax in clinical practice.

AUTHORS
Eui Jin Hwang, Jung Hee Hong, Kyung Hee Lee, Jung Im Kim, Ju Gang Nam, Da Som Kim, Hyewon Choi, Seung Jin Yoo, Jin Mo Goo & Chang Min Park
URL

https://link.springer.com/article/10.1007%2Fs00330-020-06771-3

ABSTRACT

Importance
Mammography screening currently relies on subjective human interpretation. Artificial intelligence (AI) advances could be used to increase mammography screening accuracy by reducing missed cancers and false positives.

Objective
To evaluate whether AI can overcome human mammography interpretation limitations with a rigorous, unbiased evaluation of machine learning algorithms.

Design, Setting, and Participants
In this diagnostic accuracy study conducted between September 2016 and November 2017, an international, crowdsourced challenge was hosted to foster AI algorithm development focused on interpreting screening mammography. More than 1100 participants comprising 126 teams from 44 countries participated. Analysis began November 18, 2016.

Main Outcomes and Measurements
Algorithms used images alone (challenge 1) or combined images, previous examinations (if available), and clinical and demographic risk factor data (challenge 2) and output a score that translated to cancer yes/no within 12 months. Algorithm accuracy for breast cancer detection was evaluated using area under the curve and algorithm specificity compared with radiologists’ specificity with radiologists’ sensitivity set at 85.9% (United States) and 83.9% (Sweden). An ensemble method aggregating top-performing AI algorithms and radiologists’ recall assessment was developed and evaluated.

Results
Overall, 144 231 screening mammograms from 85 580 US women (952 cancer positive ≤12 months from screening) were used for algorithm training and validation. A second independent validation cohort included 166 578 examinations from 68 008 Swedish women (780 cancer positive). The top-performing algorithm achieved an area under the curve of 0.858 (United States) and 0.903 (Sweden) and 66.2% (United States) and 81.2% (Sweden) specificity at the radiologists’ sensitivity, lower than community-practice radiologists’ specificity of 90.5% (United States) and 98.5% (Sweden). Combining top-performing algorithms and US radiologist assessments resulted in a higher area under the curve of 0.942 and achieved a significantly improved specificity (92.0%) at the same sensitivity.

Conclusions and Relevance
While no single AI algorithm outperformed radiologists, an ensemble of AI algorithms combined with radiologist assessment in a single-reader screening environment improved overall accuracy. This study underscores the potential of using machine learning methods for enhancing mammography screening interpretation.

AUTHORS
Thomas Schaffter, PhD; Diana S. M. Buist, PhD, MPH; Christoph I. Lee, MD, MS; Yaroslav Nikulin, MS; Dezső Ribli, MSc; Yuanfang Guan, PhD; William Lotter, PhD; Zequn Jie, PhD; Hao Du, BEng; Sijia Wang, MSc; Jiashi Feng, PhD; Mengling Feng, PhD; Hyo-Eun Kim, PhD; Francisco Albiol, PhD; Alberto Albiol, PhD; Stephen Morrell, B Bus Sc, MiF, M Res; Zbigniew Wojna, MSI; Mehmet Eren Ahsen, PhD; Umar Asif, PhD; Antonio Jimeno Yepes, PhD; Shivanthan Yohanandan, PhD; Simona Rabinovici-Cohen, MSc; Darvin Yi, MSc; Bruce Hoff, PhD; Thomas Yu, BS; Elias Chaibub Neto, PhD; Daniel L. Rubin, MD, MS; Peter Lindholm, MD, PhD; Laurie R. Margolies, MD; Russell Bailey McBride, PhD, MPH; Joseph H. Rothstein, MSc; Weiva Sieh, MD, PhD; Rami Ben-Ari, PhD; Stefan Harrer, PhD; Andrew Trister, MD, PhD; Stephen Friend, MD, PhD; Thea Norman, PhD; Berkman Sahiner, PhD; Fredrik Strand, MD, PhD; Justin Guinney, PhD; Gustavo Stolovitzky, PhD; and the DM DREAM Consortium
URL

https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2761795?resultClick=1

ABSTRACT

Background
Mammography is the current standard for breast cancer screening. This study aimed to develop an artificial intelligence (AI) algorithm for diagnosis of breast cancer in mammography, and explore whether it could benefit radiologists by improving accuracy of diagnosis.

Methods
In this retrospective study, an AI algorithm was developed and validated with 170 230 mammography examinations collected from five institutions in South Korea, the USA, and the UK, including 36 468 cancer positive confirmed by biopsy, 59 544 benign confirmed by biopsy (8827 mammograms) or follow-up imaging (50 717 mammograms), and 74 218 normal. For the multicentre, observer-blinded, reader study, 320 mammograms (160 cancer positive, 64 benign, 96 normal) were independently obtained from two institutions. 14 radiologists participated as readers and assessed each mammogram in terms of likelihood of malignancy (LOM), location of malignancy, and necessity to recall the patient, first without and then with assistance of the AI algorithm. The performance of AI and radiologists was evaluated in terms of LOM-based area under the receiver operating characteristic curve (AUROC) and recall-based sensitivity and specificity.

Findings
The AI standalone performance was AUROC 0·959 (95% CI 0·952–0·966) overall, and 0·970 (0·963–0·978) in the South Korea dataset, 0·953 (0·938–0·968) in the USA dataset, and 0·938 (0·918–0·958) in the UK dataset. In the reader study, the performance level of AI was 0·940 (0·915–0·965), significantly higher than that of the radiologists without AI assistance (0·810, 95% CI 0·770–0·850; p<0·0001). With the assistance of AI, radiologists' performance was improved to 0·881 (0·850–0·911; p<0·0001). AI was more sensitive to detect cancers with mass (53 [90%] vs 46 [78%] of 59 cancers detected; p=0·044) or distortion or asymmetry (18 [90%] vs ten [50%] of 20 cancers detected; p=0·023) than radiologists. AI was better in detection of T1 cancers (73 [91%] vs 59 [74%] of 80; p=0·0039) or node-negative cancers (104 [87%] vs 88 [74%] of 119; p=0·0025) than radiologists.

Interpretation
The AI algorithm developed with large-scale mammography data showed better diagnostic performance in breast cancer detection compared with radiologists. The significant improvement in radiologists' performance when aided by AI supports application of AI to mammograms as a diagnostic support tool.

AUTHORS
Hyo-Eun Kim, PhD, Hak Hee Kim, MD, Boo-Kyung Han, MD, Ki Hwan Kim, MD ,Kyunghwa Han, PhD, Hyeonseob Nam, MS, Eun Hye Lee, MD, Eun-Kyung Kim, MD
URL

https://www.thelancet.com/journals/landig/article/PIIS2589-7500(20)30003-0/fulltext

ABSTRACT

Objectives
To perform test-retest reproducibility analyses for deep learning–based automatic detection algorithm (DLAD) using two stationary chest radiographs (CRs) with short-term intervals, to analyze influential factors on test-retest variations, and to investigate the robustness of DLAD to simulated post-processing and positional changes.

Methods
This retrospective study included patients with pulmonary nodules resected in 2017. Preoperative CRs without interval changes were used. Test-retest reproducibility was analyzed in terms of median differences of abnormality scores, intraclass correlation coefficients (ICC), and 95% limits of agreement (LoA). Factors associated with test-retest variation were investigated using univariable and multivariable analyses. Shifts in classification between the two CRs were analyzed using pre-determined cutoffs. Radiograph post-processing (blurring and sharpening) and positional changes (translations in x- and y-axes, rotation, and shearing) were simulated and agreement of abnormality scores between the original and simulated CRs was investigated.

Results
Our study analyzed 169 patients (median age, 65 years; 91 men). The median difference of abnormality scores was 1–2% and ICC ranged from 0.83 to 0.90. The 95% LoA was approximately ± 30%. Test-retest variation was negatively associated with solid portion size (β, − 0.50; p = 0.008) and good nodule conspicuity (β, − 0.94; p < 0.001). A small fraction (15/169) showed discordant classifications when the high-specificity cutoff (46%) was applied to the model outputs (p = 0.04). DLAD was robust to the simulated positional change (ICC, 0.984, 0.996), but relatively less robust to post-processing (ICC, 0.872, 0.968).

Conclusions
DLAD was robust to the test-retest variation. However, inconspicuous nodules may cause fluctuations of the model output and subsequent misclassifications.

Key Points
• The deep learning–based automatic detection algorithm was robust to the test-retest variation of the chest radiographs in general.

• The test-retest variation was negatively associated with solid portion size and good nodule conspicuity.

• High-specificity cutoff (46%) resulted in discordant classifications of 8.9% (15/169; p = 0.04) between the test-retest radiographs.

AUTHORS
Hyungjin Kim, Chang Min Park & Jin Mo Goo
URL

https://link.springer.com/article/10.1007%2Fs00330-019-06589-8

ABSTRACT

Background
The performance of a deep learning (DL) algorithm should be validated in actual clinical situations, before its clinical implementation.

Purpose
To evaluate the performance of a DL algorithm for identifying chest radiographs with clinically relevant abnormalities in the emergency department (ED) setting.

Materials and Methods
This single-center retrospective study included consecutive patients who visited the ED and underwent initial chest radiography between January 1 and March 31, 2017. Chest radiographs were analyzed with a commercially available DL algorithm. The performance of the algorithm was evaluated by determining the area under the receiver operating characteristic curve (AUC), sensitivity, and specificity at predefined operating cutoffs (high-sensitivity and high-specificity cutoffs). The sensitivities and specificities of the algorithm were compared with those of the on-call radiology residents who interpreted the chest radiographs in the actual practice by using McNemar tests. If there were discordant findings between the algorithm and resident, the residents reinterpreted the chest radiographs by using the algorithm’s output.

Results
A total of 1135 patients (mean age, 53 years ± 18; 582 men) were evaluated. In the identification of abnormal chest radiographs, the algorithm showed an AUC of 0.95 (95% confidence interval [CI]: 0.93, 0.96), a sensitivity of 88.7% (227 of 256 radiographs; 95% CI: 84.1%, 92.3%), and a specificity of 69.6% (612 of 879 radiographs; 95% CI: 66.5%, 72.7%) at the high-sensitivity cutoff and a sensitivity of 81.6% (209 of 256 radiographs; 95% CI: 76.3%, 86.2%) and specificity of 90.3% (794 of 879 radiographs; 95% CI: 88.2%, 92.2%) at the high-specificity cutoff. Radiology residents showed lower sensitivity (65.6% [168 of 256 radiographs; 95% CI: 59.5%, 71.4%], P < .001) and higher specificity (98.1% [862 of 879 radiographs; 95% CI: 96.9%, 98.9%], P < .001) compared with the algorithm. After reinterpretation of chest radiographs with use of the algorithm’s outputs, the sensitivity of the residents improved (73.4% [188 of 256 radiographs; 95% CI: 68.0%, 78.8%], P = .003), whereas specificity was reduced (94.3% [829 of 879 radiographs; 95% CI: 92.8%, 95.8%], P < .001).

Conclusion
A deep learning algorithm used with emergency department chest radiographs showed diagnostic performance for identifying clinically relevant abnormalities and helped improve the sensitivity of radiology residents’ evaluation.

AUTHORS
Eui Jin Hwang, Ju Gang Nam, Woo Hyeon Lim, Sae Jin Park, Yun Soo Jeong, Ji Hee Kang, Eun Kyoung Hong, Taek Min Kim, Jin Mo Goo, Sunggyun Park, Ki Hwan Kim, Chang Min Park
From the Department of Radiology, Seoul National University College of Medicine, 101 Daehak-ro, Jongno-gu, Seoul 03080, Korea (E.J.H., J.G.N., W.H.L., S.J.P., Y.S.J., J.H.K., E.K.H., T.M.K., J.M.G., C.M.P.); and Lunit, Seoul, Korea (S.P., K.H.K.).
URL

https://pubs.rsna.org/doi/10.1148/radiol.2019191225

ABSTRACT

Interpretation of chest radiographs is a challenging task prone to errors, requiring expert readers. An automated system that can accurately classify chest radiographs may help streamline the clinical workflow. To develop a deep learning–based algorithm that can classify normal and abnormal results from chest radiographs with major thoracic diseases including pulmonary malignant neoplasm, active tuberculosis, pneumonia, and pneumothorax and to validate the algorithm’s performance using independent data sets. This diagnostic study developed a deep learning–based algorithm using single-center data collected between November 1, 2016, and January 31, 2017. The algorithm was externally validated with multicenter data collected between May 1 and July 31, 2018. A total of 54 221 chest radiographs with normal findings from 47 917 individuals (21 556 men and 26 361 women; mean [SD] age, 51 [16] years) and 35 613 chest radiographs with abnormal findings from 14 102 individuals (8373 men and 5729 women; mean [SD] age, 62 [15] years) were used to develop the algorithm. A total of 486 chest radiographs with normal results and 529 with abnormal results (1 from each participant; 628 men and 387 women; mean [SD] age, 53 [18] years) from 5 institutions were used for external validation. Fifteen physicians, including nonradiology physicians, board-certified radiologists, and thoracic radiologists, participated in observer performance testing. Data were analyzed in August 2018. Image-wise classification performances measured by area under the receiver operating characteristic curve; lesion-wise localization performances measured by area under the alternative free-response receiver operating characteristic curve. The algorithm demonstrated a median (range) area under the curve of 0.979 (0.973-1.000) for image-wise classification and 0.972 (0.923-0.985) for lesion-wise localization; the algorithm demonstrated significantly higher performance than all 3 physician groups in both image-wise classification (0.983 vs 0.814-0.932; all P < .005) and lesion-wise localization (0.985 vs 0.781-0.907; all P < .001). Significant improvements in both image-wise classification (0.814-0.932 to 0.904-0.958; all P < .005) and lesion-wise localization (0.781-0.907 to 0.873-0.938; all P < .001) were observed in all 3 physician groups with assistance of the algorithm. The algorithm consistently outperformed physicians, including thoracic radiologists, in the discrimination of chest radiographs with major thoracic diseases, demonstrating its potential to improve the quality and efficiency of clinical practice.

AUTHORS
Eui Jin Hwang1 , Sunggyun Park2 , Kwang-Nam Jin3 , Jung Im Kim4 , So Young Choi5 , Jong Hyuk Lee6 , Jin Mo Goo1 , Brian Jaehong Aum2 , Jae-Joon Yim7 , Julien G Cohen8 , Gilbert R. Ferretti8 and Chang Min Park1
1Seoul National University Hospital and College of Medicine, 2Lunit Inc., 3Seoul National University Boramae Medical Center, 4Kyung Hee University College of Medicine, 5Eulji University Medical Center, 6Armed Forces Seoul Hospital, 7Seoul National University College of Medicine, 8Centre Hospitalier Universitaire de Grenoble
URL

https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2728630

ABSTRACT

Detection of active pulmonary tuberculosis (TB) on chest radiographs (CR) is critical for the diagnosis and screening of TB. An automated system may help streamline the TB screening process and improve diagnostic performance. We developed a deep-learning-based automatic detection (DLAD) algorithm, using 54,221 normal CRs and 6,768 CRs with active pulmonary TB, which were labeled and annotated by 13 board-certified radiologists. The performance of DLAD was validated using six external multi-center, multi-national datasets. To compare the performances of DLAD with physicians, an observer performance test was conducted by 15 physicians including non-radiology physicians, board-certified radiologists, and thoracic radiologists. Image-wise classification and lesion-wise localization performances were measured using area under the receiver operating characteristic (ROC) curves, and area under the alternative free-response ROC curves, respectively. Sensitivities and specificities of DLAD were calculated using two cutoffs [high sensitivity (98%) and high specificity (98%)] obtained through in-house validation. DLAD demonstrated classification performances of 0.977–1.000 and localization performance of 0.973–1.000. Sensitivities and specificities for classification were 94.3–100% and 91.1–100% using the high sensitivity cutoff and 84.1–99.0% and 99.1–100% using the high specificity cutoff. DLAD showed significantly higher performance in both classification (0.993 vs. 0.746–0.971) and localization (0.993 vs. 0.664–0.925) compared to all groups of physicians. Our DLAD demonstrated excellent and consistent performance in the detection of active pulmonary TB on CR, outperforming physicians including thoracic radiologists.

AUTHORS
Eui Jin Hwang1 , Sunggyun Park2 , Kwang-Nam Jin3 , So Young Choi4 , Jong Hyuk Lee5 , Jin Mo Goo1 , Brian Jaehong Aum2 , Jae-Joon Yim6 and Chang Min Park1
1Seoul National University Hospital and College of Medicine, 2Lunit Inc., 3Seoul National University Boramae Medical Center, 4Eulji University Medical Center, 5Armed Forces Seoul Hospital, 6Seoul National University College of Medicine
URL

https://academic.oup.com/cid/advance-article/doi/10.1093/cid/ciy967/5174137

ABSTRACT

The purpose of this study is to develop and validate a deep learning–based automatic detection algorithm (DLAD) for malignant pulmonary nodules on chest radiographs and to compare its performance with physicians including thoracic radiologists. For this retrospective study, DLAD was developed by using 43 292 chest radiographs (normal radiograph–to–nodule radiograph ratio, 34 067:9225) in 34 676 patients (healthy-to-nodule ratio, 30 784:3892; 19 230 men [mean age, 52.8 years; age range, 18–99 years]; 15 446 women [mean age, 52.3 years; age range, 18–98 years]) obtained between 2010 and 2015, which were labeled and partially annotated by 13 board-certified radiologists, in a convolutional neural network. Radiograph classification and nodule detection performances of DLAD were validated by using one internal and four external data sets from three South Korean hospitals and one U.S. hospital. For internal and external validation, radiograph classification and nodule detection performances of DLAD were evaluated by using the area under the receiver operating characteristic curve (AUROC) and jackknife alternative free-response receiver-operating characteristic (JAFROC) figure of merit (FOM), respectively. An observer performance test involving 18 physicians, including nine board-certified radiologists, was conducted by using one of the four external validation data sets. Performances of DLAD, physicians, and physicians assisted with DLAD were evaluated and compared. According to one internal and four external validation data sets, radiograph classification and nodule detection performances of DLAD were a range of 0.92–0.99 (AUROC) and 0.831–0.924 (JAFROC FOM), respectively. DLAD showed a higher AUROC and JAFROC FOM at the observer performance test than 17 of 18 and 15 of 18 physicians, respectively (P < .05), and all physicians showed improved nodule detection performances with DLAD (mean JAFROC FOM improvement, 0.043; range, 0.006–0.190; P < .05). This deep learning–based automatic detection algorithm outperformed physicians in radiograph classification and nodule detection performance for malignant pulmonary nodules on chest radiographs, and it enhanced physicians’ performances when used as a second reader.

AUTHORS
Sunggyun Park1 , Ju Gang Nam2 , Eui Jin Hwang2 , Jong Hyuk Lee3 , Kwang-Nam Jin4 , Kun Young Lim5 , Thienkai Huy Vu6 , Jae Ho Sohn6 , Sangheum Hwang1 , Jin Mo Goo2 and Chang Min Park2
1Lunit Inc., 2Seoul National University Hospital and College of Medicine, 3Armed Forces Seoul Hospital, 4Seoul National University Boramae Medical Center, 5National Cancer Center, 6University of California, San Francisco
URL

https://pubs.rsna.org/doi/10.1148/radiol.2018180237

ABSTRACT

We assessed the feasibility of a data-driven imaging biomarker based on weakly supervised learning (DIB; an imaging biomarker derived from large-scale medical image data with deep learning technology) in mammography (DIB-MG). A total of 29,107 digital mammograms from five institutions (4,339 cancer cases and 24,768 normal cases) were included. After matching patients’ age, breast density, and equipment, 1,238 and 1,238 cases were chosen as validation and test sets, respectively, and the remainder were used for training. The core algorithm of DIB-MG is a deep convolutional neural network; a deep learning algorithm specialized for images. Each sample (case) is an exam composed of 4-view images (RCC, RMLO, LCC, and LMLO). For each case in a training set, the cancer probability inferred from DIB-MG is compared with the per-case ground-truth label. Then the model parameters in DIB-MG are updated based on the error between the prediction and the ground-truth. At the operating point (threshold) of 0.5, sensitivity was 75.6% and 76.1% when specificity was 90.2% and 88.5%, and AUC was 0.903 and 0.906 for the validation and test sets, respectively. This research showed the potential of DIB-MG as a screening tool for breast cancer.

AUTHORS
Eun-Kyung Kim1 , Hyo-Eun Kim2 , Kyunghwa Han1 , Bong Joo Kang3 , Yu-Mee Sohn4 , Ok Hee Woo5 and Chan Wha Lee6
1Severance Hospital, Yonsei University, 2Lunit Inc., 3Seoul St. Mary’s Hospital, Catholic University, 4Kyung Hee University Hospital, 5Korea University Guro Hospital, 6National Cancer Center Hospital
URL

https://www.nature.com/articles/s41598-018-21215-1