Dermatology: Practical and Conceptual Original Article | Dermatol Pract Concept. 2022;12(3):e2022126 1 Dermatology Practical & Conceptual Comparison of Convolutional Neural Network Architectures for Robustness Against Common Artefacts in Dermatoscopic Images Florian Katsch1, Christoph Rinner1, Philipp Tschandl2 1 Center for Medical Statistics, Informatics and Intelligent Systems, Medical University of Vienna, Vienna, Austria 2 Department of Dermatology, Medical University of Vienna, Vienna, Austria Key words: image classification, object detection, instance segmentation, artefacts, dermatoscopy, Citation: Katsch F, Rinner C, Tschandl P. Comparison of convolutional neural network architectures for robustness against common artefacts in dermatoscopic images. Dermatol Pract Concept. 2022;12(3):e2022126. DOI: https://doi.org/10.5826/dpc.1203a126 Accepted: December 7, 2021; Published: July 2022 Copyright: ©2022 Katsch et al. This is an open-access article distributed under the terms of the Creative Commons Attribution- NonCommercial License (BY-NC-4.0), https://creativecommons.org/licenses/by-nc/4.0/, which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original authors and source are credited. Funding: None. Competing interests: PT reports consulting fees from Silverchair, honoraria from FotoFinder, Novartis and Lilly, and grants from MetaOptima Technology and Lilly, outside the submitted work. All other authors declare no conflict of interest. Authorship: All authors have contributed significantly to this publication. Corresponding author: Philipp Tschandl, PhD, Department of Dermatology, Medical University of Vienna, Währinger Gürtel 18-20, 1090 Vienna, Austria, E-mail: philipp.tschandl@meduniwien.ac.at Introduction: Classification of dermatoscopic images via neural networks shows comparable perfor- mance to clinicians in experimental conditions but can be affected by artefacts like skin markings or rulers. It is unknown whether specialized neural networks are more robust to artefacts. Objectives: Analyze robustness of 3 neural network architectures, namely ResNet-34, Faster R-CNN and Mask R-CNN. Methods: We identified common artefacts in the HAM10000, PH2 and the 7-point criteria evaluation datasets, and established a template-based method to superimpose artefacts on dermatoscopic images. The HAM10000-dataset with and without superimposed artefacts was used to train the networks, followed by analyzing their robustness against artefacts in test images. Performance was assessed via area under the precision recall curve and classification results. Results: ResNet-34 and Faster R-CNN models trained on regular images perform worse than Mask R-CNN on images with superimposed artefacts. Artefacts added to all tested images led to a decrease in area under the precision-recall curve values of 0.030 for ResNet-34 and 0.045 for Faster R-CNN in comparison to only 0.011 for Mask R-CNN. However, changes in model performance only became sig- nificant with 40% or more of the images having superimposed artefacts. A loss in performance occurred when the training was biased by selectively superimposing artefacts on images belonging to a certain class. Conclusions: As Mask R-CNN showed the least decrease in performance when confronted with artefacts, instance segmentation architectures may be helpful to counter the effects of artefacts, warranting further research on related architectures. Our artefact insertion mechanism could be useful for future research. ABSTRACT 2 Original Article | Dermatol Pract Concept. 2022;12(3):e2022126 Introduction Epidemiological studies show an increasing trend in the in- cidence rates of melanoma and non-melanoma skin cancer worldwide over the last 30 years [1]. According to the Amer- ican Joint Committee on Cancer melanoma staging system, stage I malignant skin alterations with a five-year survival rate of more than 90% contrasts with a survival rate of less than 15% for stage IV patients. This indicates a clear need for early, reliable and consistent diagnosis and treatment [2]. The desire for automatic lesion analysis is further intensified by a high dependency between the diagnostic quality and the examiners experience in dermoscopy, as well as a high degree of inter- and intra-variability of diagnoses [3,4]. Methods of automatic skin lesion analysis have been the focus of research for decades, and have gained interest in recent years [5,6]. These methods are intended to support tele-der- matologic settings, improve management decisions or aid in difficult clinical scenarios, but often suffer, among other things, from the presence of artefacts in dermatoscopic images [7-12]. A common neural network used for classification is Res- Net, two well-known neural network architectures in com- puter vision are Faster R-CNN and Mask R-CNN (Figure 1) [13,14]. The first is performing “object detection”, a process where one or multiple objects in an image can be detected and located with a rectangular “bounding box”. The latter is performing “instance segmentation” where one or more objects in an image can be found and their respective area (i.e. pixels) in the image outlined (“segmented”), and can be regarded as a CNN-based multi-instance generalisation of computer-vision based techniques of lesion segmentation [15,16]. Object detection has been used in the field of au- tomated skin cancer detection on clinical images [17], but training of instance segmentation neural networks in der- matoscopy has not yet been reported on successfully, most probably because of missing ground-truth data. Objectives Our hypothesis is that in contrast to ResNet, the other net- work architectures intrinsically have to “concentrate” on re- gions of the classified object in an image and hence may offer robustness against artefacts surrounding the lesions. Robust- ness in this case describes the consistency of the obtained diagnoses under the influence of artefacts in the input image data. These networks could potentially be used as off-the- shelf methods with little customization-effort needed and could enable us to focus less on tedious image pre-processing such as removal of bubbles or hairs [18]. Methods Image datasets The primary source of dermatoscopic images was the HAM10000 dataset [19]. This dataset also includes pub- licly available lesion segmentation masks for every image, as described previously, which are necessary for training the Faster R-CNN and Mask R-CNN architectures [8]. It con- tains 10,015 images, each with 600x450 pixels and 3-8 bit color channels. Each image is assigned one of seven diag- nostic classes: actinic keratosis / intraepithelial carcinoma Classi�cation Object Detection Instance Segmentation Mask R-CNNFaster R-CNNResNet-34 Nevus Nevus Nevus O U TP U T A R C H IT EC TU R E IN P U T A P P R O A C H Figure 1. A visual representation of the outputs of the three approaches. Image classifi- cation (ie ResNet-34) classifies the image as a whole, object detection (ie Faster R-CNN) finds objects and their approximate position in the image and instance segmentation (ie Mask R-CNN) finds objects and their exact spatial delimitation. Original Article | Dermatol Pract Concept. 2022;12(3):e2022126 3 (akiec), basal cell carcinoma (bcc), benign keratotic lesion (bkl), dermatofibroma (df), nevus (nv), melanoma (mel), or vascular lesion (vasc). Also, the PH2 and the 7-point criteria evaluation dataset were reviewed and several images were utilized to extract artefacts from [20,21]. Images from those datasets were not used for other purposes within this study. We used the ISIC2018 test-set as the test-set to keep varia- tion as low as possible, as it sources from the same origin as the HAM10000 dataset and includes the same classes. Artefact generation As with every real-world picture, dermatoscopic images can contain content considered as “artefacts”. Examples are hairs, dark corners, vignettes, medical devices, different sorts of rul- ers, ink markings in different shapes, styles and colors, air bubbles or reflections. This work focuses on three of them: “bubbles” that originate from trapped air in the liquid be- tween skin and the dermatoscope, “rulers” used to show the spatial dimension of a lesion, and ink “markings” on the pa- tient’s skin used to highlight the lesion for excision or review. In order to generate artefact-modified cases, we selected 60 images from the HAM10000, PH2 and 7-point criteria dataset which contain either a bubble, a ruler or a marking artefact. From those images we extracted the artefacts by manually re- pairing the images areas with Adobe® Photoshop’s® (version CC 2018 (19.1.9), Adobe Inc.) content aware image repair mechanism and using the difference, per RGB channel, to the untouched image as a template (Figure 2). The insertion of those templates was done in a way that the position of artefacts var- ies according to observed patterns, using the provided segmen- tation mask of the target image. In Figure 3, a dermatoscopic image with automatically superimposed artefacts is shown. The source code will be made available upon publication of this work at https://github.com/thisismexp/artefact_insertion. Using the artefact insertion mechanism, several dataset mutations of the original HAM10000 dataset were created, where artefacts were superimposed on either none or all of the images and on every image belonging to a certain diagnosis. The test portion of the HAM10000 dataset, corresponding to the ISIC2018 challenge Task 3 test-set with 1,511 images, was altered in the same way. Additionally, artefacts were inserted in a certain percentage of images in 20% step increments. Neural Network Training As representatives for image classification, object detection and instance segmentation we trained a ResNet-34, a Faster Input: Original Image Output: Artefact Template 1.) Content-aware image repair 3.) RGB channel-Wise di�erence 2.) Extraction of repaired regions Figure 2. Workflow for extracting artefact templates. Manually selected original images (Input) were repaired manually (1), and corresponding image areas extracted (2). The channel-wise difference (3) was stored as a template for the corresponding artefact type. 4 Original Article | Dermatol Pract Concept. 2022;12(3):e2022126 R-CNN (with ResNet-34 backbone) and a Mask R-CNN (also with a ResNet-34 backbone) model as provided by the Torchvision package of the open source machine learning framework PyTorch [22]. All models were trained on all of the 9 generated datasets in a 5-fold cross validation fashion. Transfer-learning and data augmentation including random crops, resize, rotations, mirroring operations as well as color jitter operations were used. Statistics To evaluate diagnostic accuracy, all trained network mod- els are tested against the 13 test datasets and performance was reported in terms of area under the precision recall curve (PR-AUC), precision, recall, false positive (FPR) and false negative rates (FNR) and differences thereof (calcu- lated using scikit-learn version 0.24.1) [23]. To visualize spatial activations, Gradient based Class Activation Map (Grad-CAM) visualizations were used. A two-sided p-value of 0.05 was regarded as statistically significant, and all calculations were performed using statsmodels version 0.12.2 [24]. Results Baseline performance in terms of PR-AUC of our models trained and tested with no additional artefacts was 0.8 for ResNet-34 and 0.72 for Faster R-CNN as well as Mask R-CNN. Introduction of artefacts in only the test dataset led to a reduction in performance for all three architectures (Figure 4) increasing with the proportion of artefacts pres- ent in the test dataset, and more severe for the ResNet-34 and Faster R-CNN model. With a maximum relative reduc- tion of 0.05 PR-AUC the Faster R-CNN model was affected the most, ResNet-34 (-0.03) the second most, and Mask R-CNN was the most robust (-0.01). For ResNet-34 and Faster R-CNN, changes in predictive performance compared to baseline was significant at and above 40% of introduced artefacts in the test set (P < 0.01; tested using McNemar test with Edwards correction on binarized predictions). For Mask R-CNN we did not detect a significant difference in predictions in all used test sets (all P values > 0.17). Introducing artefacts in the training data led to biased re- sults in all three examined architectures. Artefacts introduced Figure 3. Example of automatically superimposed artefacts on a dermatoscopic image. (A) In the top left the original image without artefact is shown. The other 3 images show the lesion with the superimposed artefacts bubbles (B), ink markings (C) and a ruler (D). Original Article | Dermatol Pract Concept. 2022;12(3):e2022126 5 into all images of the melanocytic nevi class during training decreased recall values on average by 0.218 for ResNet-34, 0.129 for Faster R-CNN and by 0.155 for Mask R-CNN in comparison to the respective unbiased models. Reduc- tion in recall values indicate that those are indeed biased by artefacts for specific classes. This effect was more apparent the bigger the proportion of biased samples in the dataset is. Considering the FPR and FNR for specific classes, a selective bias towards classes that were corrupted by artefacts during training could be observed for all three architectures. The increase in FPR for the class with inserted artefacts during training, and a simultaneous increase in FNR for all others, in fact showed a shift in classifications towards the biased class. This effect could not be observed if artefacts were in- serted into none or all of the images. When inspecting heat map representations of the Grad- CAM we observed that training with artefacts shifted the attention of the object detection and instance segmentation net- work away from the artefact itself towards areas of the lesion Trained without inserted artefacts Trained with inserted artefacts Input A B C D HGFE LKJI ResNet-34 Faster R-CNN Mask R-CNN Input ResNet-34 Faster R-CNN Mask R-CNN O PNM Figure 5. Grad-CAM for used network architectures. The first column shows the input image for the corresponding row, in its original form (top) and with bubble artefacts inserted (bottom). Grad-CAM heatmaps show the ResNet-34 increases attention towards the bubble-area after training with artefacts (N), where the Faster R-CNN network loses its initial attention towards the artefact (G) afterwards (O). The Mask R-CNN architecture seems to ignore the artefact throughout (H and P). Black boxes denote positions of inserted bubble artefacts. 0.00 –0.02 –0.04 Mask R-CNN ResNet-34 Faster R-CNN 0.0 0.2 0.4 0.6 0.8 1.0 Proportion of artefacts in the test set C h an g e in P R A U C –0.06 Figure 4. Neural networks show different robustness to inserted ar- tefacts on the test set. Precision recall curve (PR-AUC) as achieved by training without additional artefacts in the train and test set was used as the baseline (0%). With increasing proportion of inserted ar- tefacts, PR-AUC decreases for ResNet-34 (blue) and Faster R-CNN (green), but almost not for Mask R-CNN (purple). Shaded areas denote 95%-confidence intervals. (Figure 5). These mappings indicate an increase in robustness against these very artefacts for Faster and Mask R-CNN mod- els, if trained with inserted artefacts in the dataset. 6 Original Article | Dermatol Pract Concept. 2022;12(3):e2022126 Conclusions We compared representatives of three neural network archi- tectures to classify lesions in dermatoscopic images in regard to their robustness against artefacts. Although as a limita- tion the baseline performance of the examined models were not the same, we found differences in their vulnerability to performance changes under the influence of artefacts. Mask R-CNN tends to be the most robust. The influence on classi- fication results by artefacts in test images can be reduced by augmenting training data with artificially superimposed arte- facts for all three architectures. This is in line with findings by Maron et al, who reduced - but not eliminated - brittleness of their system through data augmentation [25]. We anticipate that automated superimposition of artefacts as presented here as a further evolution of data augmentation, that to- gether with integrating more diverse variants, will enhance robustness of automated classifiers and decision support sys- tems further [26,27]. The initial data, in our view, warrants more in-depth follow up research on this topic, to understand which approaches are the most effective and efficient. However, this work failed to find evidence for a clinically relevant robustness against artefacts of instance segmentation for several reasons. On the one hand we used a shallow back- bone network architecture for our experiments, even though current research and commercial products commonly use deeper models, and an increase in robustness against image distortions has been demonstrated by others with increased backbone capacity [28]. We also used a new template-based approach to superimpose artefacts on images. This approach leaves room for improvement with regard to the number of images the artefacts are extracted from, and a detailed anal- ysis on how different artefact types affect the classification performance. Alternatively, lesions with existing artefacts could be used after manual or automated annotations. References 1. Apalla Z, Lallas A, Sotiriou E, Lazaridou E, Ioannides D. Ep- idemiological trends in skin cancer. Dermatol Pract Concept. 2017;7(2):1-6. DOI: 10.5826/dpc.0702a01. PMID: 28515985. PMCID: PMC5424654. 2. Balch CM, Soong S-J, Atkins MB, et al. An evidence-based staging system for cutaneous melanoma. CA Cancer J Clin. 2004;54(3):131-149; quiz 182-184. DOI: 10.3322/canj- clin.54.3.131. PMID: 15195788. 3. Kittler H, Pehamberger H, Wolff K, Binder M. Diagnostic ac- curacy of dermoscopy. Lancet Oncol. 2002;3(3):159-165. DOI: 10.1016/s1470-2045(02)00679-4. PMID: 11902502. 4. Korotkov K, Garcia R. Computerized analysis of pigmented skin lesions: a review. Artif Intell Med. 2012;56(2):69-90. DOI: 10.1016/j.artmed.2012.08.002. PMID: 23063256. 5. Rubegni P, Burroni M, Cevenini G, et al. Digital dermoscopy analysis and artificial neural network for the differentiation of clinically atypical pigmented skin lesions: a retrospective study. J Invest Dermatol. 2002;119(2):471-474. DOI: 10.1046/j.1523- 1747.2002.01835.x. PMID: 12190872. 6. Esteva A, Kuprel B, Novoa RA, et al. Dermatologist-level clas- sification of skin cancer with deep neural networks. Nature. 2017;542(7639):115-118. DOI: 10.1038/nature21056. PMID: 28117445. PMCID: PMC8382232. 7. Muñoz-López C, Ramírez-Cornejo C, Marchetti MA, et al. Per- formance of a deep neural network in teledermatology: a sin- gle-centre prospective diagnostic study. J Eur Acad Dermatol Venereol. 2021;35(2):546-553. DOI: 10.1111/jdv.16979. PMID: 33037709. PMCID: PMC8274350. 8. Tschandl P, Rinner C, Apalla Z, et al. Human–computer collab- oration for skin cancer recognition. Nat Med. 2020;26(8):1229- 1234. DOI: 10.1038/s41591-020-0942-0. PMID: 32572267. 9. Fink C, Blum A, Buhl T, et al. Diagnostic performance of a deep learning convolutional neural network in the differentiation of combined naevi and melanomas. J Eur Acad Dermatol Vene- reol. 2020;34(6):1355-1361. DOI: 10.1111/jdv.16165. PMID: 31856342. 10. Okuboyejo DA, Olugbara OO. A review of prevalent meth- ods for automatic skin lesion diagnosis. Open Dermatol J. 2018;12(1):14-53. DOI: 10.2174/187437220181201014 11. Winkler JK, Fink C, Toberer F, et al. Association Between Surgi- cal Skin Markings in Dermoscopic Images and Diagnostic Per- formance of a Deep Learning Convolutional Neural Network for Melanoma Recognition. JAMA Dermatol. 2019;155(10):1135- 1141. DOI: 10.1001/jamadermatol.2019.1735. PMID: 31411641. PMCID: PMC6694463. 12. Navarrete-Dechent C, Dusza SW, Liopyris K, Marghoob AA, Halpern AC, Marchetti MA. Automated Dermatological Diag- nosis: Hype or Reality? J Invest Dermatol. 2018;138(10):2277- 2279. DOI: 10.1016/j.jid.2018.04.040. PMID: 29864435. PMCID: PMC7701995. 13. Ren S, He K, Girshick R, Sun J. Faster R-CNN: Towards Re- al-Time Object Detection with Region Proposal Networks. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2017;39(6):1137-1149. . DOI: 10.1109/TPAMI.2016.2577031. PMID: 27295650. 14. He K, Gkioxari G, Dollar P, Girshick R. Mask R-CNN. 2017 IEEE International Conference on Computer Vision (ICCV). 2017:2980-2988. DOI: 10.1109/ICCV.2017.322. 15. Kaur R, LeAnder R, Mishra NK, et al. Thresholding methods for lesion segmentation of basal cell carcinoma in dermoscopy images. Skin Res Technol. 2017;23(3):416-428. DOI: 10.1111/ srt.12352. PMID: 27892649. 16. Mishra NK, Kaur R, Kasmi R, et al. Automatic lesion border selection in dermoscopy images using morphology and color features. Skin Res Technol. 2019;25(4):544-552. DOI: 10.1111/ srt.12685. PMID: 30868667. PMCID: PMC7173402. 17. Han SS, Moon IJ, Lim W, et al. Keratinocytic Skin Cancer De- tection on the Face Using Region-Based Convolutional Neu- ral Network. JAMA Dermatol. 2020;156(1):29-37. DOI: 10.1001/jamadermatol.2019.3807. PMID: 31799995. PMCID: PMC6902187. 18. Lee T, Ng V, Gallagher R, Coldman A, McLean D. Dull- razor: A software approach to hair removal from images. Original Article | Dermatol Pract Concept. 2022;12(3):e2022126 7 Comput Biol Med. 1997;27(6):533-543. DOI: 10.1016/s0010- 4825(97)00020-6. PMID: 9437554. 19. Tschandl P, Rosendahl C, Kittler H. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci Data. 2018;5:180161. DOI: 10.1038/ sdata.2018.161. PMID: 30106392. PMCID: PMC6091241. 20. Mendonça T, Ferreira PM, Marques JS, Marcal ARS, Rozeira J. PH2 - A dermoscopic image database for research and bench- marking. In: 2013 35th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC).; 2013:5437-5440. DOI: 10.1109/EMBC.2013.6610779. PMID: 24110966. 21. Kawahara J, Daneshvar S, Argenziano G, Hamarneh G. Sev- en-Point Checklist and Skin Lesion Classification Using Mul- titask Multimodal Neural Nets. IEEE Journal of Biomedical and Health Informatics. 2019;23(2):538-546. DOI: 10.1109/ JBHI.2018.2824327. PMID: 29993994. 22. Paszke A, Gross S, Massa F, et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In: Wallach H, La- rochelle H, Beygelzimer A, d’ Alché-Buc F, Fox E, Garnett R, eds. Advances in Neural Information Processing Systems 32. Curran Associates, Inc.; 2019:8026-8037. Available from: https://papers.nips.cc/paper/2019/hash/bdbca288fee7f92f2b- fa9f7012727740-Abstract.html 23. Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: Ma- chine learning in Python. the Journal of machine Learning re- search. 2011;12:2825-2830. Available from: https://www.jmlr. org/papers/volume12/pedregosa11a/pedregosa11a.pdf 24. Seabold S, Perktold J. Statsmodels: Econometric and statistical modeling with python. In: Proceedings of the 9th Python in Science Conference. Vol 57. Austin, TX; 2010:61. Available from: https:// conference.scipy.org/proceedings/scipy2010/pdfs/seabold.pdf 25. Maron RC, Haggenmüller S, von Kalle C, et al. Robustness of convolutional neural networks in recognition of pigmented skin lesions. Eur J Cancer. 2021;145:81-91. DOI: 10.1016/j. ejca.2020.11.020. PMID: 33423009. 26. Aggarwal SLP. Data augmentation in dermatology image recogni- tion using machine learning. Skin Res Technol. 2019;25(6):815- 820. DOI: 10.1111/srt.12726. PMID: 31140653. 27. Winkler JK, Sies K, Fink C, et al. Association between different scale bars in dermoscopic images and diagnostic performance of a market-approved deep learning convolutional neural network for melanoma recognition. Eur J Cancer. 2021;145:146-154. DOI: 10.1016/j.ejca.2020.12.010. PMID: 33465706. 28. Michaelis C, Mitzkus B, Geirhos R, et al. Benchmarking Robust- ness in Object Detection: Autonomous Driving when Winter is Coming. arXiv [csCV]. Published online July 17, 2019. Available from: http://arxiv.org/abs/1907.07484