Magnetic resonance imaging (MRI), used in concert with computer-aided methods, can detect, diagnose, and characterize invasive breast cancers. To this end, fully automated segmentation of breast tissues is important for quantitative breast imaging analysis, and for use in spatially resolved biophysical models of breast cancer. To ensure the accuracy of segmentations, tissue label maps from such models must be validated by domain experts such as radiologists.
We developed an ensembled suite of convolutional neural networks (core components of SimBioSys TumorSight) that segmented tumor and other tissues, in and around the breast (chest, adipose, gland, vasculature, skin). We sought to validate model results against the expertise of two breast-specialized radiologists. A ground truth dataset was created based on the radiologists’ assessments of tumor longest dimensions (LD), tumor segmentation, multi-tissue segmentation, and background parenchymal enhancement (BPE). This allowed us to quantify the agreement between the convolutional neural networks (CNN) and the radiologists, and also the observed variability between radiologists. This metric can be used as a lower bound for the expected variability between CNN results and radiologists’ assessments, providing a benchmark to evaluate current and future models.
CNN-generated tumor and multi-tissue segmentations were created for 100 early-stage breast cancer cases based on dynamic contrast enhanced (DCE) MRI University of Alabama Birmingham (UAB) Hospital (e.g., Figure 1). Each case underwent an internal review and, if necessary, the tumor segmentation was manually edited (“manual segmentation”) to more accurately reflect the underlying tumor characteristics. These cases were then independently assessed by two board-certified radiologists (Reviewer 1 and Reviewer 2) for the following: LD (measurements of primary tumor and total extent of disease), tumor segmentation (approve/reject), multi-tissue segmentation (approve/reject), as well as categorizing BPE per BI-RADS1. The reviewers were required to make their LD measurements prior to viewing the tumor and multi-tissue segmentation to minimize bias. CNN-generated segmentations without any manual edits were used for comparison (“CNN segmentations”). A workflow summary is shown in Figure 2.
Percent approval by reviewer was calculated, and lower bounds (LB) of a one-sided 95% exact confidence interval (CI) are given for both tumor and multi-tissue segmentations. IRR was measured between the two reviewers using Gwet’s AC12 for tumor and multi-tissue segmentation calls, as well as BPE. Reviewer’s LD measurements were used to create a 3D bounding box around the tumor. Intra-class correlation3 (ICC) was used to assess the reliability of these measurements between the reviewers, the CNN segmentations, and the manual segmentations.
Reviewer 1 approved 67 of 91 tumor segmentations (73.6%, LB 95% CI=65.0%) and 87 of 90 multi-tissue segmentations (96.7%, LB 95% CI=91.6%). Reviewer 2 approved 84 of 96 tumor segmentations (87.5%, LB 95% CI=80.5%) and 88 of 93 multi-tissue segmentations (94.6%, LB 95% CI=89.0%). Overall, 87 of the 97 reviewed tumor segmentations (89.7%) were approved by one or more reviewer, along with 80 of 86 multi-tissue segmentations (93.0%). Reasons for incomplete reviews included issues loading/visualizing the images, and issues locating the lesion. A Gwet’s AC1 of 0.70, indicating substantial reliability, was observed between reviewers regarding tumor segmentation calls, and a Gwet’s AC1 of 0.93 was observed for multi-tissue segmentation calls, indicating strong reliability. High concordance was observed on BPE categorization (71.0% agreement, Gwet’s AC1=0.62). Cohort demographics and clinical characteristics are found in Table 1.
The ICC coefficient for the log of the bounding box volume around the primary tumor was 0.82 (95% CI: [0.75, 0.87] between the two reviewers and the CNN segmentations, indicating good reliability. ICC was 0.92 (95% CI: [0.89, 0.94]) between the two reviewers and the manual segmentations, indicating excellent reliability. Strong Pearson correlations (rrange=0.77 to 0.90) between maximum LD measurements were observed for all six comparisons (Figure 3). Summary statistics for the maximum LD measurements and bounding box volumes are found in Table 2.
We observed a very high level of acceptance overall for both tumor and multi-tissue segmentations. In addition, we observed high levels of agreement between Reviewer 1 and 2 on measures of tumor segmentation acceptance, and multi-tissue segmentation acceptance. These findings strongly support the validity of our CNN for multi-tissue segmentation purposes and facilitate the use of the accepted segmentations in downstream segmentation model fine-tuning.
We also found a high level of concordance between Reviewer 1 and 2 regarding measurement of the main mass LD, and the total extent of disease LD. Additionally, strong correlation was observed between the CNN segmentations and the reviewers, as well as the manual segmentations and the reviewers (Figure 3). Importantly, these concordance measurements assist the contextualization of CNN model output, by establishing the ceiling of performance that our model may reach in the future. A high level of concordance was also observed, regarding BPE categorization (71.0% agreement); this context will be used as an internal benchmark in the development of future BPE-detection models. These results are consistent with recent studies.
Input and evaluation from domain experts (e.g., radiologists) is an invaluable step in validating the results of CNN-generated segmentations and creating benchmarks for future work. By completing this exercise, we’ve assessed the acceptability of our segmentations, quantified reliability between CNN, internal reviewers, and external radiologists, and generated a ground truth dataset that can be used for future validation and research/development efforts.