Advertisement
ANATOMICAL PATHOLOGY| Volume 55, ISSUE 3, P342-349, April 2023

Download started.

Ok

Artificial intelligence for basal cell carcinoma: diagnosis and distinction from histological mimics

Published:January 12, 2023DOI:https://doi.org/10.1016/j.pathol.2022.10.004

      Summary

      We trained an artificial intelligence (AI) algorithm to identify basal cell carcinoma (BCC), and to distinguish BCC from histological mimics. A total of 1061 glass slides were collected: 616 containing BCC and 445 without BCC. BCC slides were collected prospectively, reflecting the range of specimen types and morphological variety encountered in routine pathology practice. Benign and malignant histological mimics of BCC were selected prospectively and retrospectively, including cases considered diagnostically challenging for pathologists. Glass slides were digitally scanned to create a whole slide image (WSI), which was divided into patches representing a tissue area of 65,535 μm2. Pathologists annotated the data, yielding 87,205 patches labelled BCC present and 1,688,697 patches labelled BCC absent. The COMPASS model (COntext-aware Multi-scale tool for Pathologists Assessing SlideS) based on Convolutional Neural Networks, was trained to provide a probability of BCC being present at the patch level and the slide level. The test set comprised 246 slides, 147 of which contained BCC. The COMPASS AI model demonstrated high accuracy, classifying WSIs as containing BCC with a sensitivity of 98.0% and a specificity of 97.0%, representing 240 WSIs classified correctly, three false positives, and three false negatives. Using BCC as a proof of concept, we demonstrate how AI can account for morphological variation within an entity, and accurately distinguish from histologically similar entities. Our study highlights the potential for AI in routine pathology practice.

      Key words

      Introduction

      Artificial intelligence (AI) has the potential to improve diagnostic accuracy in surgical pathology.
      • Acs B.
      • Rantalainen M.
      • Hartman J.
      Artificial intelligence as the next step towards precision pathology.
      • Harmon S.A.
      • Patel P.G.
      • Sanford T.H.
      • et al.
      High throughput assessment of biomarkers in tissue microarrays using artificial intelligence: PTEN loss as a proof-of-principle in multi-center prostate cancer cohorts.
      • Salto-Tellez M.
      • Maxwell P.
      • Hamilton P.
      Artificial intelligence - the third revolution in pathology.
      In the field of dermatopathology, previous research has yielded promising results for AI classification of common skin tumours.
      • Wells A.
      • Patel S.
      • Lee J.B.
      • et al.
      Artificial intelligence in dermatopathology: diagnosis, education, and research.
      • Kimeswenger S.
      • Tschandl P.
      • Noack P.
      • et al.
      Artificial neural networks and pathologists recognize basal cell carcinomas based on different histological patterns.
      • Cruz-Roa A.A.
      • Arevalo Ovalle J.E.
      • Madabhushi A.
      • et al.
      A deep learning architecture for image representation, visual interpretability and automated basal-cell carcinoma cancer detection.
      • Olsen T.
      • Jackson B.
      • Feeser T.
      • et al.
      Diagnostic performance of deep learning algorithms applied to three common diagnoses in dermatopathology.
      • Hart S.
      • Flotte W.
      • Norgan A.
      • et al.
      Classification of melanocytic lesions in selected and whole-slide images via convolutional neural networks.
      • Hekler A.
      • Utikal J.S.
      • Enk A.H.
      • et al.
      Pathologist-level classification of histopathological melanoma images with deep neural networks.
      • Thomas S.M.
      • Lefevre J.G.
      • Baxter G.
      • et al.
      Interpretable deep learning systems for multi-class segmentation and classification of non-melanoma skin cancer.
      However, obstacles remain regarding the inclusion of AI in routine pathology practice.
      • Goyal M.
      • Knackstedt T.
      • Yan S.
      • et al.
      Artificial intelligence-based image classification methods for diagnosis of skin cancer: challenges and opportunities.
      ,
      • Tizhoosh H.R.
      • Pantanowitz L.
      Artificial intelligence and digital pathology: challenges and opportunities.
      Diagnostically, AI needs to retain accuracy across the breadth of morphological variation that may be encountered within a single entity. Furthermore, AI needs to accurately distinguish an entity from its histological mimics.
      • Goyal M.
      • Knackstedt T.
      • Yan S.
      • et al.
      Artificial intelligence-based image classification methods for diagnosis of skin cancer: challenges and opportunities.
      The AI-human interface also warrants careful consideration. AI results should be presented in a manner that allows human pathologists to reconcile their visual impression
      • Rudin C.
      Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.
      with the mathematical prediction of a computer algorithm. We consider these issues in regard to cutaneous basal cell carcinoma (BCC), the most common human malignancy.
      • Stanoszek L.M.
      • Wang G.Y.
      • Harms P.W.
      Histologic mimics of basal cell carcinoma.
      Previous studies mainly assessed AI accuracy in distinguishing typical BCC cases from normal skin
      • Kimeswenger S.
      • Tschandl P.
      • Noack P.
      • et al.
      Artificial neural networks and pathologists recognize basal cell carcinomas based on different histological patterns.
      ,
      • Le’Clerc Arrastia J.
      • Heilenkötter N.
      • Otero Baguer D.
      • et al.
      Deeply Supervised UNet for semantic segmentation to assist dermatopathological assessment of basal cell carcinoma.
      ,
      • Jiang Y.Q.
      • Xiong J.H.
      • Li H.Y.
      • et al.
      Recognizing basal cell carcinoma on smartphone-captured digital histopathology images with a deep neural network.
      or from clinical mimics (seborrhoeic keratosis and intradermal naevus).
      • Olsen T.
      • Jackson B.
      • Feeser T.
      • et al.
      Diagnostic performance of deep learning algorithms applied to three common diagnoses in dermatopathology.
      ,
      • Thomas S.M.
      • Lefevre J.G.
      • Baxter G.
      • et al.
      Interpretable deep learning systems for multi-class segmentation and classification of non-melanoma skin cancer.
      The objective of this study was to assess the accuracy of AI in distinguishing BCC from various histologic mimics.
      • Stanoszek L.M.
      • Wang G.Y.
      • Harms P.W.
      Histologic mimics of basal cell carcinoma.
      The type of AI used in this study is deep neural networks, a subset of machine learning that trains algorithms using similar patterns to decision making in the human brain. Machine learning is itself a subset of AI, in which computer algorithms have the ability to learn. Potential BCC histological mimics encompass a broad range of non-neoplastic and neoplastic entities, the former including basaloid hyperplasia overlying dermatofibroma and (in some instances) native adnexal epithelium. Benign adnexal tumours, such as desmoplastic trichoepithelioma and trichoblastoma, may be challenging for pathologists to distinguish from BCC, particularly on partial biopsy. Malignant differentials mainly comprise other types of carcinomas, such as Merkel cell carcinoma (MCC), which has a very different prognosis and management to BCC.
      • Stanoszek L.M.
      • Wang G.Y.
      • Harms P.W.
      Histologic mimics of basal cell carcinoma.
      In diagnostically challenging cases, AI may play a valuable role alongside other traditional ancillary tests such as immunohistochemistry.
      Emulating daily dermatopathology practices with prospective image collection is beneficial for developing a robust AI system.
      • Campanella G.
      • Hanna M.G.
      • Geneslaw L.
      • et al.
      Clinical-grade computational pathology using weakly supervised deep learning on whole slide images.
      ,
      • Ianni J.D.
      • Soans R.E.
      • Sankarapandian S.
      • et al.
      Tailored for real-world: a whole slide image classification system validated on uncurated multi-site data emulating the prospective pathology workload.
      Although usually considered a straightforward task for human pathologists, the diagnosis of BCC may be more difficult in slides with low tumour volume, architectural distortion (for example, in curettage specimens), obscuring inflammation, or uncommon BCC subtypes. An AI algorithm that is capable of accounting for these variables could be a useful tool for human pathologists. To create a dataset that reflects the broad variability of histological findings encountered in daily practice, we used prospectively collected BCC cases to develop an AI algorithm for BCC diagnosis.

      Materials and methods

      Ethics approval

      The Sullivan Nicolaides Pathology ethics committee approved the use of histological images in this study. It was determined that the study was of low risk and given that the outcomes of this study did not affect present or future diagnoses, patient consent was waived. We complied with the Declaration of Helsinki.

      Slide collection: BCC present

      Glass slides containing BCC were prospectively collected during routine practice by one of the authors working at a single institution. To account for potential short-term variations in hematoxylin and eosin (H&E) staining quality, discrete periods of slide collection were alternated with periods of non-collection. Slide collection occurred in 2020 and 2021. During periods of slide collection, each slide deemed to contain BCC was collected regardless of tumour volume, BCC subtype, or tissue preservation. In the instance of multiple glass slides being produced from a single tissue block, for example in biopsies or where the tissue was incompletely represented on the initial H&E section, only the first H&E glass slide containing BCC was included; in one case, representing a challenging BCC diagnosis in a punch biopsy, three glass slides were included. Slide-level ground truth was established as part of routine practice, where necessary using additional H&E sections, immunohistochemistry, and second opinions from colleagues.

      Slide collection: BCC absent

      BCC absent slides were collected by four pathologists from three sources. The first were slides identified during routine practice as containing native adnexal structures or neoplasms that could be considered histological mimics of BCC. The second source were slides collected from cases seen during second opinion consultation. Consultation cases were either from within the department or from regional Sullivan Nicolaides Pathology (SNP) laboratories. Finally, slides were retrospectively collected from the SNP archives, representing the results of a search for less common entities that may mimic BCC histologically, reported in 2020–2021. Two pathologists (BOB, DR) selected representative slides from these cases.

      Slide scanning and image processing

      Slides were scanned with a Philips IntelliSite Ultra Fast Scanner (Olympus Plan Apo 40x objective, NA 0.75, 0.25μm per pixel), and images exported using the iSyntax format, a proprietary format owned by Philips. A slide may contain more than one cross-section of tissue from the same tissue block. Philips software segmented these cross-sections on the same slide into individual regions, when possible, to remove white space. Using python code, the regions from each iSyntax file were extracted into tiff images.

      40× patches

      Each region was segmented into 1024×1024 pixel patches with zero overlap. The resolution of each patch image was equivalent to a 40× microscope objective and represented an area of tissue 65,536 μm2 (0.25 μm per pixel).

      20× patches

      The image data were then shrunk (using pyvips) to 50% along each axis and each region was segmented into 512×512 pixel patches with zero overlap. Each patch image resolution was equivalent to a 20× microscope objective, also representing a tissue area of 65,536 μm2 (0.5 μm per pixel).

      10× patches

      To examine the effect of larger tissue areas on the ability of AI to distinguish BCC from mimics, we combined four 512×512 pixel 20× patches into a single patch. These patches were created using a sliding window algorithm such that each 512×512 pixel patch might be present in up to four 1024×1024 pixel patches (once for each quadrant). The tile label is the superset of the combined individual sub-tile labels. If any of the sub-tile labels had a label of BCC present, indicating the presence of BCC, then the superset tile label is also BCC present.
      These larger-area patches were shrunk (using pyvips) by 50% along each axis to form larger-area patches of 512×512 pixels, equivalent to a 10× microscope objective. These 10× patches now cover a tissue area that is four times the 20× patch (262,144 μm2 at 1 μm per pixel).

      Slide labelling

      The first pass of data labelling used OpenCV grayscale variance to separate the background patches (variance ≤40). Patches with higher variance (mostly tissue) were labelled as BCC absent by default for the pathologists to review. These default labels were then curated to correct for obvious mislabelling, such as low variance fat cells or high variance non-tissue artifacts.
      The dataset was labelled at three hierarchical levels: slide, region, and patch-level.
      The slide-level labels were taken from routine reporting.
      For the region labels, as the slide label per whole slide image (WSI) only indicated that a diagnosis was seen somewhere on the slide, it was possible that some regions on the same slide did not have the diagnosis present. Therefore, a region label was formed from its own patch-level labels. If any patch was labelled BCC present, that region was then labelled BCC present; otherwise, it was labelled BCC absent.
      For patch-level labelling, the patches were labelled individually by at least one pathologist. Supplementary Fig. 1 (Appendix A) shows the web interface used for labelling the patches. Each patch was either left with the default label of BCC absent or reassigned with one of the following three labels: BCC present, unsure BCC, or unsure non-BCC. BCC present was defined as a patch that included at least one nucleus or part of the nucleus from a cell of BCC. The patch was classified as BCC absent when it did not contain BCC or when the confines of the patch only included cytoplasm from a BCC cell, stromal changes adjacent to (or within) BCC, or necrosis. Unsure BCC was assigned when the confines of the square patch were equivocal, although favoured by the pathologist to include BCC (as per the definition above). Similarly, unsure non-BCC was assigned to equivocal patches favoured not to contain BCC. For each WSI containing a BCC histological mimic, the patches containing the mimic entity and patches containing background tissue were both labelled BCC absent.
      From 1061 WSIs, the number of labelled patches in the 20× data was as follows: BCC present 87,205; unsure BCC 1,051; unsure non-BCC 195; BCC absent 1,688,697. Patches that remained with one of the unsure labels had no definitive ground truth; therefore, they were excluded. No WSI was excluded. BCC tumour volume per WSI, assessed as the number of patches labelled BCC present, ranged from 1 to 1,799 (mean, 141.57; standard deviation, 187.20). This volume relates only to slides containing BCC.

      Methodology

      Prior to training, all patches with a Shannon entropy of less than 3.5 were discarded. These discarded patches mostly comprised empty spaces with only minimal tissue at the specimen edges.
      Fig. 1 presents an overview of our method. In the first stage, we implemented a patch-level model to train an effective feature extractor for patches. We used the recent Convolutional Neural Networks architecture, EfficientNetV2-B3, with 12M parameters proposed by Tan et al.
      • Tan M.
      • Le Q.V.
      Efficientnetv2: smaller models and faster training.
      EfficientNetV2 uses a training-aware neural architecture search and scaling and outperforms previous models in terms of both training speed and parameter efficiency. In the second stage, learned features within the same region were adaptively integrated by a multiple instance learning network
      • Ilse M.
      • Tomczak J.
      • Welling M.
      Attention-based deep multiple instance learning.
      to predict a region-level label. For the feature integration, we investigated a single-scale (10×/20×) model and multi-scale model (COMPASS). Please refer to the Supplementary Methods (Appendix A) for further details on our model.
      Fig. 1
      Fig. 1The overview of our method. In the first stage, patch feature extractors were trained for one or two scales (10× and 20×). In the second stage, the patch information was adaptively integrated by a multiple instance learning network to predict a region-level label. The integration can be performed using patches from a single scale or multiple scales. The components with dashed borders relate to the multi-scale integration. The multi-scale feature was obtained by concatenating each high magnification (20×) patch feature with the most important context information, which was learned from 10× patch features by an attention network.

      Data set details

      The data were randomly split into training, validation, and test sets for approximately 70%, 10%, and 20% of the data, respectively. The data were split at the patient level; thus, no patient overlap existed among the training, validation, and test sets. Specifically, as the data were imbalanced among different subtypes and one patient may have multiple labels among different slides, we used the stratification method
      • Sechidis K.
      • Tsoumakas G.
      • Vlahavas I.
      On the stratification of multi-label data.
      to handle the data split. This resulted in 722 slides in the training set, 93 in the validation set, and 246 in the test set.

      Results

      Specimen types

      A total of 1061 slides were collected, representing 780 specimens from 693 patients. This set comprised 616 slides (465 specimens and 396 patients) containing BCC (labelled BCC) and 445 slides without BCC (315 specimens and 297 patients, labelled non-BCC) encompassing 46 diagnostic labels, as shown in Supplementary Table 1 (Appendix A). Slides containing BCC and an additional non-BCC neoplasm were labelled BCC. Solar keratosis and intradermal naevus were the two most common dual neoplasms with BCC (11 and 7 cases, respectively). None of the misclassified slides in the test set represented dual neoplasms. The 465 BCC specimens included 233 excisions, 78 shaves, 55 curettes, and 99 punches. BCC subtyping was performed at the specimen level. Superficial BCC was present in 245 specimens, nodular/solid-type BCC in 351, and aggressive-type BCC in 171. Aggressive-type BCC included infiltrating/sclerosing, micronodular, and basosquamous variants. Additional demographic information and slide details are presented in Supplementary Table 2 (Appendix A).

      Comparison of different AI models

      Of the 246 WSIs used in the test set, 147 slides were labelled as containing BCC by pathologists. The numbers of labelled patches were as follows: BCC present 26,983; BCC absent 420,306. To distinguish between labels, we used the labels BCC present and BCC absent to represent labels determined by pathologists, and BCC predicted and BCC not predicted for labels assigned by the AI models (confidence threshold set to 0.5).
      Multiple 2-class (BCC predicted versus BCC not predicted) AI analyses were conducted at the patch, region and slide levels. To select the most appropriate magnification for the AI model, we investigated the patch-level performance of the 40×, 20×, and 10× data. Table 1 shows that the 40× data obtained a performance very similar to that of the 20× data. The performance of the 10× data was 2.3 percentage points lower than that of the 20× data in terms of sensitivity. As the 40× data was very time consuming to train and provided no accuracy improvement, the 20× data was chosen as the highest magnification for the implementation of our method.
      Table 1Results of different data types with different data sets
      Data typeDatasetAccuracySensitivitySpecificityMacro-F1 score
      Patch-level10×97.4%88.0%98.4%92.6%
      20×98.2%90.3%98.8%93.3%
      40×98.2%90.9%98.7%93.2%
      COMPASS98.3%90.2%99.0%93.7%
      Region-level10×95.0%99.0%89.6%94.8%
      20×97.6%97.9%97.2%97.5%
      COMPASS97.8%97.2%98.6%97.8%
      Slide-level10×93.9%100.0%84.8%93.5%
      20×97.2%99.3%93.9%97.0%
      COMPASS97.6%98.0%97.0%97.5%
      The slide-level results were computed as BCC predicted if any region of the slide was classified as BCC predicted. Overall, the 10× data performed the worst, with the exception of 100% sensitivity at the slide level. However, the corresponding specificity was 84.8%, meaning that the model trained with the 10× data seemed to be biased towards BCC predicted classification.
      The model trained with multi-scale data [COMPASS model (COntext-aware Multi-scale tool for Pathologists Assessing SlideS)] performed the best, with more balanced predictions in terms of sensitivity and specificity.

      Ablation study

      In Table 2, we investigated the different components of our methods and demonstrated the performance contribution of each component. The following results were obtained using COMPASS on the test set.
      Table 2Ablation study on different components of COMPASS for slide-level prediction
      Slide modelFocal lossCentre lossSensitivitySpecificityAccuracy
      100.0%11.1%64.2%
      98.0%94.9%96.7%
      98.0%97.0%97.6%
      The first row shows the results generated without the slide model. It predicted a slide as BCC predicted if at least one of the patches was classified as BCC predicted by the patch model. The results in the second and third row were obtained with/without centre loss for the training in the second stage.
      The first row in Table 2 was obtained by setting the slide as BCC predicted if at least one of the patches was classified as BCC predicted by the patch model in the first stage. We found that generating the slide level prediction using only the patch model resulted in a very low performance, with an accuracy of 64.2%. This was because the patch level model was trained without considering the surrounding information and was very sensitive to the over-prediction of BCC in the patches. Almost 90% of the slides were predicted to have BCC patches. This demonstrated the necessity of having a second model that learned to predict the region level class by investigating the global region content and the local surrounding information of the patches. In addition, we analysed the loss functions (focal loss
      • Lin T.Y.
      • Goyal P.
      • Girshick R.
      • et al.
      Focal loss for dense object detection.
      and centre loss
      ) used by our model. We showed that with centre loss, the performance of the slide model improved, particularly in terms of specificity, which increased from 94.9% to 97.0%.
      At 10×, no WSIs were incorrectly classified as BCC not predicted (false negative), while 15 WSIs were incorrectly classified as BCC predicted (i.e., false positives). These false positive WSIs represented the following diagnoses: one mantleoma; two squamoid eccrine ductal carcinomas; three cylindroma; one MCC; one trichoblastoma; one squamous cell carcinoma plus intraepidermal carcinoma; one desmoplastic trichoepithelioma; three basaloid induction overlying dermatofibroma; two sebaceous carcinomas.
      At 20×, one BCC WSI was incorrectly classified as BCC not predicted, while six WSIs were incorrectly classified as BCC predicted. These false positive WSIs represented the following diagnoses: two squamoid eccrine ductal carcinomas; three cylindroma; one MCC.
      The multi-scale COMPASS model (combining the 10× and 20× data) provided the smallest number of incorrect classifications: three WSIs were incorrectly classified as BCC predicted, and three WSIs were incorrectly classified as BCC not predicted (discussed in the false positive and false negative results section below).

      COMPASS model results

      The COMPASS AI results for each of the 246 WSIs are presented in three forms, as demonstrated in Fig. 2.
      • 1.
        A slide-level classification of BCC predicted or BCC not predicted, as previously discussed.
      • 2.
        Sigmoid function was used to generate the prediction probabilities ranging from 0 to 1, and the probabilities of the two classes were summed to 1. For example, the confidence score in Case 1 was 0.94 (i.e., 94% confidence in the classification of BCC predicted, 6% confidence in the classification of BCC not predicted). In our test set, the BCC predicted probabilities ranged from 50.6% to 98.2%, with a mean of 88.6% and a standard deviation of 9.5%.
      • 3.
        Graphic overlay (GO) image. GO indicated which individual patches on the H&E image were classified by AI as BCC predicted and BCC not predicted. Colour coding indicates the confidence of each patch.
      Fig. 2
      Fig. 2Case 1: neck, excision. Percentage scores in the legend relate to confidence for basal cell carcinoma (BCC) predicted. (A) H&E, slide 1. Pathologist diagnosis: basal cell carcinoma; mixed nodular, infiltrating and micronodular subtypes. (B) Graphic overlay (GO), slide 1. Indicating a predominance of high-confidence BCC predicted patches. COMPASS result: BCC predicted with confidence 94%. (C) H&E, slide 2. In this area of tumour, BCC nests are more widely separated, while the edge of the tumour is ill-defined. Pathologist diagnosis: basal cell carcinoma; mixed nodular, infiltrating and micronodular subtypes. (D) GO, slide 2. Distribution of the AI BCC predicted patches, which correlates to the infiltrating growth pattern observed on H&E. COMPASS result: BCC predicted with confidence 79%.
      COMPASS correctly classified 144 of 147 BCC present WSIs as BCC predicted, and 96 of 99 BCC absent WSIs as BCC not predicted. Fig. 3 shows the correct classification of three BCC absent cases and one BCC present case.
      Fig. 3
      Fig. 3Percentage scores in the legend relate to confidence for basal cell carcinoma (BCC) predicted. (A) Case 2: cheek, punch biopsy. Pathologist diagnosis: favour desmoplastic trichoepithelioma. Subsequent excision reported as residual desmoplastic trichoepithelioma. COMPASS result: BCC not predicted with confidence 96%. (B) Case 3: nose, punch biopsy. Pathologist diagnosis: benign appendageal lesion, favour mantleoma. COMPASS result: BCC not predicted with confidence 88%. (C) Case 4: shoulder, punch excision. Pathologist diagnosis: dermatofibroma, with overlying basaloid hyperplasia. COMPASS result: BCC not predicted with confidence 98%. (D,E) Case 5: back, shave. Pathologist diagnosis: dysplastic junctional naevus; single nest at peripheral edge of the shave suspicious for superficial BCC. COMPASS result: BCC predicted with confidence 85%. (D) Single basaloid nest at peripheral edge of shave. (E) AI BCC predicted patches correspond to the single basaloid nest.
      COMPASS incorrectly classified three non-BCC WSIs as BCC predicted, representing false positive results. The three false positive BCC predicted included two WSIs from a single specimen, labelled by the pathologist as cylindroma. The third false positive WSI was labelled by a pathologist as MCC. Fig. 4 shows GO images from both false positive cases (Cases 6 and 8). In all WSIs, BCC predicted patches represented a minority of the overall tumours and were predominantly located at the periphery of the tumours.
      Fig. 4
      Fig. 4Percentage scores in the legend relate to confidence for BCC predicted. (A) Case 6: scalp, excision. Pathologist diagnosis: cylindroma. COMPASS result: BCC predicted, confidence 51%. (B) Case 7: shoulder, excision. Pathologist diagnosis: BCC. COMPASS result: BCC not predicted with confidence 72%. (C) Case 8: forehead, excision. Pathologist diagnosis: merkel cell carcinoma. COMPASS result: BCC predicted with confidence 54%. (D–F) Case 9: deltoid. (D) Punch biopsy. Pathologist diagnosis: carcinoma, MCC versus BCC. COMPASS result: BCC predicted with confidence 73%. (E,F) Subsequent excision specimen. Pathologist diagnosis: BCC. COMPASS result: BCC not predicted with confidence 91% and 66% (two slides). (E) Area showing morphology typical for BCC. (F) Area showing morphology that could prompt consideration of MCC as a differential.
      COMPASS incorrectly classified three BCC WSIs as BCC not predicted, representing false negative results. The WSIs (Cases 7 and 9) are shown in Fig. 4. Case 7, diagnosed by a pathologist as BCC, could be considered a straightforward diagnosis. Case 9 represented a more difficult diagnosis. The initial punch biopsy was reported by the pathologist as a differential diagnosis of MCC or BCC. This punch biopsy was classified by COMPASS as BCC predicted, with a confidence score of 73%. However, the excision specimen was classified by COMPASS as BCC not predicted, with confidence results of the two WSIs of 66% and 91%, respectively. Immunohistochemistry, performed on the excision case as part of routine work-up, supported a BCC diagnosis, while two other dermatopathologists concurred with the diagnosis of BCC.

      Discussion

      COMPASS performance

      We aimed to develop an AI model to assess whether a WSI contained BCC or did not contain BCC. The 616 BCC slides in the dataset were collected prospectively, without subjective filtering of cases. The 445 non-BCC slides were selected prospectively and retrospectively, encompassing a broad range of benign and malignant histological mimics of BCC. In a test set of 246 WSI, our COMPASS AI model accurately classified WSIs as BCC predicted or BCC not predicted, with a sensitivity of 98.0% and a specificity of 97.0% (240 WSI correctly classified, with three false positive and three false negative results).

      Potential applications of COMPASS in routine practice

      The COMPASS AI model was accurate across a variety of specimen types (i.e., excision, curettage, shave and punch) and BCC subtypes, suggesting its feasibility for use in daily practice. An alternative view could be that BCC diagnosis is a task performed rapidly and accurately by human pathologists without the need for AI. However, certain cases in our test set results highlighted specific ways in which AI for BCC may be beneficial to pathologists.
      Case 1 (Fig. 2C,D) included a component of aggressive-type BCC, in which the tumour nests exhibited an infiltrating growth pattern separated by areas of uninvolved dermis. The extent of tumour invasion was accurately highlighted by AI, despite individual foci of tumour being relatively subtle. In BCC cases demonstrating this type of growth pattern, AI could assist with tumour mapping and accurate margin assessment.
      Case 5 (Fig. 3D,E) included a single basaloid nest suspicious for superficial BCC. Despite the low volume of lesional tissue, COMPASS correctly classified this slide as BCC predicted. Additionally, the GO image indicated that the single nest suspicious to the pathologist corresponded to the same nest classified as BCC by AI. This reflects the value of providing a graphic representation of AI results, allowing the pathologist to correlate with their visual impression.
      Cases 2–4 (Fig. 3) reflect AI accuracy in distinguishing BCC from potential mimics (the diagnoses in these cases were desmoplastic trichoepithelioma, mantleoma, and basaloid hyperplasia overlying dermatofibroma). In such cases, AI may function as a useful ‘second opinion’ to the pathologist, either supporting a favoured diagnosis or prompting consideration of alternative possibilities. The limited labelling of data in this study highlights the potential of AI. Patches were labelled by pathologists only as BCC present or BCC absent. The AI model was not trained to recognise skin anatomic levels (such as epidermis, dermis, or subcutis), native adnexal structures, or features potentially associated with malignancy (such as mitoses, necrosis, or atypia). Additionally, no demographic information was available to AI. Nevertheless, the COMPASS model could accurately distinguish BCC from histological mimics.

      False positive and false negative results

      COMPASS incorrectly labelled three WSIs (2 cylindroma and 1 MCC) as BCC predicted. However, Fig. 4A highlights the value of the AI result including a confidence score and graphic overlay. Although this WSI was incorrectly classified as BCC predicted, the confidence score was low (51%), and the positive patches reflected only a minority of the tumour, limited to the periphery. It would appear that AI had difficulty in classifying this lesion; this information is valuable to the pathologist, when integrating all available information to reach a final diagnosis. In the overall results, COMPASS labelled four of six cylindroma WSIs correctly as BCC not predicted, and nine of ten MCC WSIs correctly as BCC not predicted. A limitation of this study is the small number of each non-BCC histological mimic. There are insufficient numbers of each type of BCC mimic to suggest that COMPASS has a specific weakness in distinguishing cylindroma and MCC from BCC, which could be the subject of further research.
      In the three false positive WSIs, it was noted that the false positive patches tended to be located at the tumour periphery. The reason for this is unclear. Histological clues at tumour edges, such as the classical findings of peripheral palisading and tumour-stroma retraction in BCC, are considered important to pathologists.
      • Stanoszek L.M.
      • Wang G.Y.
      • Harms P.W.
      Histologic mimics of basal cell carcinoma.
      In our study, we could not ascertain whether AI over-interprets features within the tumour or adjacent stroma. Further iterations of the COMPASS AI model will seek to address this issue, possibly through more detailed pathologist labelling of the tumour-stroma interface.
      Of the three false negative AI results, one was from a BCC excision in which the diagnosis was straightforward for the pathologist (Fig. 4B). The other two slides were from a single BCC excision in which the diagnosis was more challenging, and the initial punch biopsy had been reported with a differential diagnosis of MCC versus BCC. Interestingly, COMPASS classified the punch biopsy WSI correctly as BCC predicted (Fig. 4D). COMPASS incorrectly classified the excision specimen as BCC not predicted, even though it included areas morphologically typical for BCC (Fig. 4E,F).
      We can only speculate as to the cause of this error, reflecting a potential criticism of AI, in that it functions as a ‘black box’.
      • Thomas S.M.
      • Lefevre J.G.
      • Baxter G.
      • et al.
      Interpretable deep learning systems for multi-class segmentation and classification of non-melanoma skin cancer.
      ,
      • Wada M.
      • Ge Z.
      • Gilmore S.J.
      • et al.
      Use of artificial intelligence in skin cancer diagnosis and management.
      A possible avenue for future research would be to train a dataset by labelling individual BCC mimics. Consequently, AI would provide a result for each entity that the pathologist may consider in the differential diagnosis. For example, in Case 9, the AI result would include BCC predicted/not predicted and MCC predicted/not predicted, with each of these individual predictions carrying a confidence score. This additional information may assist in reconciling the discrepancies between AI results and the pathologist's preferred diagnosis.

      Patch labelling and AI model selection

      AI performance depends on the quality of the data, which in tissue pathology, is achieved through accurate slide labelling. However, this is a laborious task. In our study, we chose to label square patches corresponding to an area of 65,536 μm,
      • Harmon S.A.
      • Patel P.G.
      • Sanford T.H.
      • et al.
      High throughput assessment of biomarkers in tissue microarrays using artificial intelligence: PTEN loss as a proof-of-principle in multi-center prostate cancer cohorts.
      which is approximately 3.6 times smaller than the typical 40× microscopic optical field of view. Although the labelling tool allowed for the selection of multiple patches simultaneously, some cases were time consuming to label accurately; for example, broadly infiltrating BCCs with ill-defined margins and curettage specimens comprising multiple, separated tissue fragments. A subject for further research would be to determine the optimal patch size to achieve a balance between labelling time and AI accuracy. This optimal patch size may differ depending on the features or diagnosis of interest. Alternatively, other studies have utilised a different approach to labelling, in which pathologists identify the target feature/diagnosis with a freehand outline.
      • Thomas S.M.
      • Lefevre J.G.
      • Baxter G.
      • et al.
      Interpretable deep learning systems for multi-class segmentation and classification of non-melanoma skin cancer.
      ,
      • Le’Clerc Arrastia J.
      • Heilenkötter N.
      • Otero Baguer D.
      • et al.
      Deeply Supervised UNet for semantic segmentation to assist dermatopathological assessment of basal cell carcinoma.
      ,
      • Jiang Y.Q.
      • Xiong J.H.
      • Li H.Y.
      • et al.
      Recognizing basal cell carcinoma on smartphone-captured digital histopathology images with a deep neural network.
      Finally, partially trained AI algorithms can assist in preliminary labelling, thus reducing time investment of the pathologists.
      Among the various AI models assessed in this study, the best performance was achieved using a multiscale approach. COMPASS combined data from 10× and 20× magnification. The rationale for our decision to develop a multiscale tool was to reflect the visual process of a pathologist, who collates information from different microscopic magnifications. Given the relatively high accuracy of all the models assessed in this study, the superiority of the multi-scale COMPASS model was only slight. However, this multiscale approach appears promising and may particularly suit AI models for the assessment of more challenging skin tumours, such as melanocytic lesions.

      Conclusion

      In our study, AI demonstrated a high level of accuracy in identifying BCC and distinguishing BCC from histological mimics. The highest accuracy was achieved using a multi-scale COMPASS model, which integrated data from 10× and 20× magnification. Including a confidence score and graphic representation of the AI result allowed the pathologist to integrate AI prediction with their visual impression. With further improvements, this COMPASS AI model may be a useful tool to enhance pathologists' efficiency and accuracy.

      Data availability statement

      The dataset generated and/or analysed during the current study is not publicly available due to its size (15TB).

      Acknowledgements

      The authors acknowledge the support of Stephen W. Hampton and Benjamin A. Cook from the Histopathology Department, Sullivan Nicolaides Pathology (SNP), for their input and analysis of patient demographics and specimen types, and Dr Lauren Furnas (SNP) for her valuable assistance in labelling BCC cases. The authors also acknowledge SNP for support of this project and the use of histological images from cases reported at SNP.

      Conflicts of interest and sources of funding

      The authors state that there are no conflicts of interest to disclose.

      Appendix A. Supplementary data

      The following is the Supplementary data to this article.

      References

        • Acs B.
        • Rantalainen M.
        • Hartman J.
        Artificial intelligence as the next step towards precision pathology.
        J Intern Med. 2020; 288: 62-81
        • Harmon S.A.
        • Patel P.G.
        • Sanford T.H.
        • et al.
        High throughput assessment of biomarkers in tissue microarrays using artificial intelligence: PTEN loss as a proof-of-principle in multi-center prostate cancer cohorts.
        Mod Pathol. 2021; 34: 478-489
        • Salto-Tellez M.
        • Maxwell P.
        • Hamilton P.
        Artificial intelligence - the third revolution in pathology.
        Histopathology. 2019; 74: 372-376
        • Wells A.
        • Patel S.
        • Lee J.B.
        • et al.
        Artificial intelligence in dermatopathology: diagnosis, education, and research.
        J Cutan Pathol. 2021; 48: 1061-1068
        • Kimeswenger S.
        • Tschandl P.
        • Noack P.
        • et al.
        Artificial neural networks and pathologists recognize basal cell carcinomas based on different histological patterns.
        Mod Pathol. 2021; 34: 895-903
        • Cruz-Roa A.A.
        • Arevalo Ovalle J.E.
        • Madabhushi A.
        • et al.
        A deep learning architecture for image representation, visual interpretability and automated basal-cell carcinoma cancer detection.
        Med Image Comput Comput Assist Interv. 2013; 16: 403-410
        • Olsen T.
        • Jackson B.
        • Feeser T.
        • et al.
        Diagnostic performance of deep learning algorithms applied to three common diagnoses in dermatopathology.
        J Pathol Inform. 2018; 9: 32
        • Hart S.
        • Flotte W.
        • Norgan A.
        • et al.
        Classification of melanocytic lesions in selected and whole-slide images via convolutional neural networks.
        J Pathol Inform. 2019; 10: 5
        • Hekler A.
        • Utikal J.S.
        • Enk A.H.
        • et al.
        Pathologist-level classification of histopathological melanoma images with deep neural networks.
        Eur J Cancer. 2019; 115: 79-83
        • Thomas S.M.
        • Lefevre J.G.
        • Baxter G.
        • et al.
        Interpretable deep learning systems for multi-class segmentation and classification of non-melanoma skin cancer.
        Med Image Anal. 2021; 68101915
        • Goyal M.
        • Knackstedt T.
        • Yan S.
        • et al.
        Artificial intelligence-based image classification methods for diagnosis of skin cancer: challenges and opportunities.
        Comput Biol Med. 2020; 127104065
        • Tizhoosh H.R.
        • Pantanowitz L.
        Artificial intelligence and digital pathology: challenges and opportunities.
        J Pathol Inform. 2018; 9: 38
        • Rudin C.
        Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead.
        Nat Mach Intell. 2019; 1: 206-215
        • Stanoszek L.M.
        • Wang G.Y.
        • Harms P.W.
        Histologic mimics of basal cell carcinoma.
        Arch Pathol Lab Med. 2017; 141: 1490-1502
        • Le’Clerc Arrastia J.
        • Heilenkötter N.
        • Otero Baguer D.
        • et al.
        Deeply Supervised UNet for semantic segmentation to assist dermatopathological assessment of basal cell carcinoma.
        J Imaging. 2021; 7: 71
        • Jiang Y.Q.
        • Xiong J.H.
        • Li H.Y.
        • et al.
        Recognizing basal cell carcinoma on smartphone-captured digital histopathology images with a deep neural network.
        Br J Dermatol. 2020; 182: 754-762
        • Campanella G.
        • Hanna M.G.
        • Geneslaw L.
        • et al.
        Clinical-grade computational pathology using weakly supervised deep learning on whole slide images.
        Nat Med. 2019; 25: 1301-1309
        • Ianni J.D.
        • Soans R.E.
        • Sankarapandian S.
        • et al.
        Tailored for real-world: a whole slide image classification system validated on uncurated multi-site data emulating the prospective pathology workload.
        Sci Rep. 2020; 10: 1-2
        • Tan M.
        • Le Q.V.
        Efficientnetv2: smaller models and faster training.
        ICML. 2021; 139: 10096-10106
        • Ilse M.
        • Tomczak J.
        • Welling M.
        Attention-based deep multiple instance learning.
        ICML. 2018; 80: 2127-2136
        • Sechidis K.
        • Tsoumakas G.
        • Vlahavas I.
        On the stratification of multi-label data.
        ECML PKDD. 2011; 6913: 145-158
        • Lin T.Y.
        • Goyal P.
        • Girshick R.
        • et al.
        Focal loss for dense object detection.
        IEEE Trans Pattern Anal Mach Intell. 2020; 42: 318-327
      1. A discriminative feature learning approach for deep face recognition. ECCV. 2016; 9911: 499-515
        • Wada M.
        • Ge Z.
        • Gilmore S.J.
        • et al.
        Use of artificial intelligence in skin cancer diagnosis and management.
        Med J Aust. 2020; 213: 256-259