key: cord-027732-8i8bwlh8 authors: boudaya, amal; bouaziz, bassem; chaabene, siwar; chaari, lotfi; ammar, achraf; hökelmann, anita title: eeg-based hypo-vigilance detection using convolutional neural network date: 2020-05-31 journal: the impact of digital technologies on public health in developed and developing countries doi: 10.1007/978-3-030-51517-1_6 sha: doc_id: 27732 cord_uid: 8i8bwlh8 hypo-vigilance detection is becoming an important active research areas in the biomedical signal processing field. for this purpose, electroencephalogram (eeg) is one of the most common modalities in drowsiness and awakeness detection. in this context, we propose a new eeg classification method for detecting fatigue state. our method makes use of a and awakeness detection. in this context, we propose a new eeg classification method for detecting fatigue state. our method makes use of a convolutional neural network (cnn) architecture. we define an experimental protocol using the emotiv epoc+ headset. after that, we evaluate our proposed method on a recorded and annotated dataset. the reported results demonstrate high detection accuracy (93%) and indicate that the proposed method is an efficient alternative for hypo-vigilance detection as compared with other methods. hypo-vigilance has been one of the major causes of accidents in many areas such as driving [1] , aviation [2] and military sector [3] . hence, the drowsiness problem has gained great interest from researchers. this is today a real up to date problem within the current covid-19 [4] pandemic where medical stuff is generally overbooked. in fact, the drowsy condition is expressed predominantly by the emergence of various behavioral signs such as heaviness in terms of reaction, reflex reduction, occurrences of yawning, heaviness of the eyelids and/or the difficulty of keeping the head in the frontal position relative to the field of vision. many studies [5] [6] [7] [8] have been proposed to detect hypo-vigilance based on biomedical signals such as electroencephalogram (eeg), electrocardiogram (ecg), electromyogram (emg), and electrooculogram (eog). given, its high temporal resolution, portability and reasonable cost, the present work focus on hypo-vigilance detection by analyzing eeg signal of various brain's functionalities using fourteen electrodes placed on the participant's scalp. on the other hand, deep learning networks offer great potential for biomedical signals analysis through the simplification of raw input signals (i.e., through various steps including feature extraction, denoising and feature selection) and the improvement of the classification results. in this paper, we focus on the eeg signal study recorded by fourteen electrodes for hypo-vigilance detection by analyzing the various functionalities of the brain from the electrodes placed on the participant's scalp. various deep learning architectures [9] exist such as convolutional neural network (cnn), recurrent cnn (r-cnn), auto-encoder (ae), deep belief network (dbn), including long short-term memory (lstm) and gated recurrent units (gru). as in [10] , the cnn architecture is the most used to biomedical signals analysis providing a high classification accuracy. previous related work [11] proposes a hypo-vigilance detection method using cnn by facial features. this method showed a classification accuracy of 92.33%. likewise [12] , introduces an adaptive conditional representation learning system for driver drowsiness detection based on a 3d-cnn. the proposed system consists of four steps (spatio-temporal representation, data preprocessing, features combination and somnolence detection). the experimental results show a detection accuracy equal to 92.04%. in this paper, we propose a cnn hypo-vigilance detection method using eeg data in order to classify drowsiness and awakeness states. accordingly, the proposed approach including used equipment are presented in sect. 2. section 3 describes the experimental results and the evaluation of the employed method. finally, a conclusion and future work are drawn in sect. 4. as shown in fig. 1 , the realization of the proposed approach is suggested by two primary procedures: data acquisition and data analysis. the following subsections provide a detailed explanation of each procedure. the eeg data acquisition procedure is made up of two main steps which are data collection and data preprocessing. to collect the raw eeg data from participants, we use an emotiv epoc+ headset as shown in fig. 2 [a] for the data acquisition process. the key feature of this headset is a non-invasive brain computer interface (bci) tool designed for the development of human brain and contextual research [13] . the emotiv epoc + helmet contains fourteen active electrodes with two reference electrodes (drl and cms), as shown in fig. 2 [b]. the electrodes are placed around the participant's head in the structures of the following zones: frontal and anterior parietal (af3, af4, f3, f4, f7, f8, fc5, fc6), temporal (t7, t8) and occipital-parietal (o1, o2, p7, p8). the specific preprocessing steps of the data revolve around the following points which are data preparation, data annotation and data augmentation. during data acquisition, our raw eeg signals may be influenced by various sources of artifacts and noise such as endogenous electrical properties, specific fabrics physical structure, dipolar size variation, muscle shifts and blinks. hence, data processing is a preliminary step to denoising the raw signals. we suggest using an infinite impulse response (iir) filter that manages an impulsive signal within time and frequency domains. other sophisticated denoising approaches could be considered at the expense of higher computational complexity [14, 15] . to evaluate each individual's state of exhaustion, we concentrate on the brain areas that are responsible for hypo-vigilance detection. in this regard, different brain waves are targeted such as [16] : • delta waves refer to consciousness, sleep or deep sleep states. these waves were found in the temporal and occipital conditions with low frequency (less than 4 hz) and high amplitude. • theta waves design the relaxation and hypnosis states with a range of frequency between 4 and 8 hz. theta waves are extracted from the temporal zone and are produced during the first phase of slow sleep or in deep relaxation state. • alpha waves refer to waking but relaxed states. these waves are captured in the posterior part, precisely the occipital region, with a frequency interval between 8 and 12 hz and a low amplitude interval between 20 and 60 µv. • beta waves relate to alertness states. these waves are captured from the temporal and occipital lobes of the brain. they are characterized by high frequency interval of 12 to 30 hz with a low amplitude interval of 10 to 30 µv. • gamma waves refer to hypervigilance states with a frequency interval between 30 to 80 hz. in the data annotation step, we only use the o1 and o2 electrodes of occipital zone which are responsible for drowsiness sensation. as an annotation example, fig. 3 indicates the amplitudes of the alpha and theta signals from the two o1 and o2 electrodes reported for a participant in three periods of the day. the relaxation state has been indicated by alpha waves which have a frequency interval between 8 to 12 hz and an amplitude interval between 20 to 60 µv. the somnolence state has been indicated by theta waves which have a frequency interval between 4 to 8 hz and an amplitude interval between 50 and 75 µv. in order to reduce overfitting and increase testing accuracy, we use the data augmentation technique [17] which consists of increasing the training set by label-retaining data transformations. the purpose procedure is to extend the data by doubling the vectors from (5850, 2) to (59053, 2) where 5850 (resp. 59053) represents the vector size and 2 represents the class number. the diagram of the neural network simple cnn used in our eeg drowsiness detection approach is represented in fig. 4 . the proposed simple cnn model is composed of the following six main layers: -the convolutional layers allow the filter application and the features extraction characteristics of the input signals. -the sample-based discretization max-pooling-1d blocks is used to sub-sample each input layer by reducing its dimensionality using a decrease in the number of the parameters to learn, there by reducing calculation costs. our protocol revolves around the following axes: eight volunteers in which four women and four men aged twenty six and fifty eight with normal mental health. for each participant, we make three recordings of sixteen minutes divided over three day periods (morning, afternoon and evening). to fully understand the condition of the participants, we split the signal into windows to accurately identify these different states. in the proposed simple cnn architecture for eeg signals classification, we use the keras deep learning library. the different parameters as filters, kernelsize, padding, kernel-initializer, and activation of the four convolutional layers have the same values respectively 512, 32, same, normal and relu. the parameter values of the remaining layers are detailed in the following: -the dropout layer value equal to 0.2 (respect. 0.5) is used to inactivate 20% (respect. 50%) of neurons in order to prevent overfitting. -the max-pooling 1d layer is used with a filter size of 128. -the muti-dimensional data output flatting using 1d flatten layer. -for better classification results, two dropout layers are used. the first hidden layer takes a value of 128 neurons. since a binary classification problem, the second layer takes a value of 1. the choice of the optimization algorithm makes the difference between good results in minutes, hours or even days. there are various optimizers like adam [18] , sgd [19] and rms pop optimizer [20] . in our model, we use the sgd optimizer which is more popular [21] . the method of this optimizer is simple and effective for finding optimal values in a neural network. table 1 presents the hyperparameters choice of our model. for selecting the best accuracy rate of the proposed method, we propose to compare different results recorded by different numbers of electrodes. in [22, 23] , the authors discover that the prefrontal and occipital cortex are the most important channels to better diagnose the hypo-vigilance state. in this regard, we choose the following recorded data: -recorded data by 2 electrodes (o1 and o2) electrodes from the occipital area. -recorded data by 4 electrodes (t7, t8, o1 and o2) from temporal and occipital areas. -recorded data by 7 electrodes (af3, f7, f3, t7, o2, p8, f8) from prefrontal and occipital areas. -recorded data by 14 electrodes. for the distribution of our data, we choose 70% for the train part and 30% for the test. table 2 presents the reported testing and training accuracy respectively with two, four, seven and fourteen electrodes. after convergence the optimum number of test epochs for all the different electrodes results establish a value equal to 80. the best results are given by the recording of 2 electrodes from the occipital area. the curves of testing and training results for recorded data by o1 and o2 electrodes are represented in fig. 5 . according to results obtained in fig. 5 , we note that the test accuracy increases after a certain number of epochs and the test loss decreases. to test our system's efficiency we measured the precision, recall and f1-score. table 3 shows these different measures in our experimental configuration. for comparison purposes, we compare the proposed method with recent drowsiness methodology [24] where the authors propose a driver hypovigilance detection using the emotiv epoc+ helmet. the common spatial pattern (csp) algorithm is used for optimization accuracy of extreme learning machine (elm). the reported values in table 4 indicate that our method gives the optimum accuracy value classification. the present work proposes a cnn based approach for hypo-vigilance detection. in order to create a eeg dataset, we recorded raw eeg data using epoc+ headset. the suggested system achieves an average classification accuracy to 93.94% by testing it on a real dataset of eight participants. in future work, we will focus to improve classification accuracy with large datasets. additionally, fusion with other biomedical signals should be also considered to improve the classification accuracy. open access this chapter is licensed under the terms of the creative commons attribution 4.0 international license (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the creative commons license and indicate if changes were made. the images or other third party material in this chapter are included in the chapter's creative commons license, unless indicated otherwise in a credit line to the material. if material is not included in the chapter's creative commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. noise robustness analysis of performance for eeg-based driver fatigue detection using different entropy feature sets fatigue detection in commercial flight operations: results using physiological measures simulated sustained flight operations and performance, part 1: effects of fatigue covid-19 pandemic by the "real-time" monitoring: the tunisian case and lessons for global epidemics in the context of 3pm strategies electromyogram signal based hypovigilance detection real-time ecg-based detection of fatigue driving using sample entropy exploring neuro-physiological correlates of drivers' mental fatigue caused by sleep deprivation using simultaneous eeg, ecg, and fnirs data muscle fatigue detections during arm movement using emg signal a state-of-the-art survey on deep learning theory and architectures 1-d convolutional neural networks for signal processing applications drowsy driver detection using representation learning driver drowsiness detection using conditionadaptive representation learning framework analysis of performance metrics using emotiv epoc+ hybrid sparse regularization for magnetic resonance spectroscopy sparse signal recovery using a bernouilli generalized gaussian prior analysis of the meditation brainwave from consumer eeg device a novel deep learning approach with data augmentation to classify motor imagery signals deep learning for eeg data analytics: a survey two classes classification using different optimizers in convolutional neural network automatic microemboli characterization using convolutional neural networks and radio frequency signals optimization of deep learning using various optimizers, loss functions and dropout real time fatigue-driver detection from electroencephalography using emotiv epoc+ drowsiness analysis using common spatial pattern and extreme learning machine based on electroencephalogram signal key: cord-175846-aguwenwo authors: chatsiou, kakia title: text classification of manifestos and covid-19 press briefings using bert and convolutional neural networks date: 2020-10-20 journal: nan doi: nan sha: doc_id: 175846 cord_uid: aguwenwo we build a sentence-level political discourse classifier using existing human expert annotated corpora of political manifestos from the manifestos project (volkens et al., 2020a) and applying them to a corpus ofcovid-19press briefings (chatsiou, 2020). we use manually annotated political manifestos as training data to train a local topic convolutionalneural network (cnn) classifier; then apply it to the covid-19pressbriefings corpus to automatically classify sentences in the test corpus.we report on a series of experiments with cnn trained on top of pre-trained embeddings for sentence-level classification tasks. we show thatcnn combined with transformers like bert outperforms cnn combined with other embeddings (word2vec, glove, elmo) and that it is possible to use a pre-trained classifier to conduct automatic classification on different political texts without additional training. a substantial share of citizen involvement in politics arises through written discourse especially in the digital space. through advanced, novel communication strategies, the public can play their part in constructing a political agenda, which has led politicians to increasingly use social media and other types of digital broadcasting to communicate (compared to mainstream press and traditional print media). this is especially pertinent with crisis communication discourse and the recent covid-19 pandemic has created a great opportunity to study how similar topics get communicated in different countries and the narrative choices made by government and public health officials at different levels of governance (international, national, regional). to aid fellow scholars with the systematic study of such a large and dynamic set of unstructured data, we set out to employ a text categorization classifier trained on similar domains (like existing manually annotated sentences from political manifestos) and use it to classify press briefings about the pandemic in a more effective and scalable way. the main attraction behind using manually coded political manifestos (volkens et al., 2020a) as training data is that the political science expert community have been manually collecting and annotating in a systematic way political parties' manifestos for years (since the 1960s) around the world in order to apply content analysis methods and to advance political science. they have subsequently been used as training data in semi-supervised domain-specific classification tasks with good results (zirn et in this paper, we build variations of a cnn sentence-level political discourse classifier using existing annotated corpora of political manifestos from the manifestos project (volkens et al., 2020a) . we test different cnn and word embedding architectures on the already annotated (english language) sentences of the manifestos project corpus. we then apply them to a corpus of covid-19 press briefings (chatsiou, 2020) , a subset of which was manually annotated by political scholars for the purposes of this work. the article is organised as follows: we first offer a brief overview of previous related work on the use of human expert annotated political manifestos for discourse classification. we then describe our framework including the training data used, data pre-processing performed and used architecture. we report on a series of experiments with cnn trained on top of pre-trained word vectors for sentence-level classification tasks. we conclude with evaluation of the bert+cnn architecture against other combinations (word2vec+cnn, glove+cnn, elmo+cnn) for both corpora. experimental results show that a cnn classifier combined with transformers like bert outperforms cnn combined with other non-context sensitive embeddings (word2vec, glove, elmo). the use of nlp methods to analyse political texts is a well-established field within political science and computational social science more generally (lazer et al., 2009; grimmer and stewart, 2013; benoit, laver, and mikhaylov, 2009) . researchers have used nlp methods to acccomplish various classification tasks, such as political positioning on a left to right continuum (slapin and proksch, 2008; glavas, nanni, and ponzetto, 2017) , identification of political ideology differences from text glavas, nanni, and ponzetto (2017) propose an approach for cross-lingual topical coding of sentences from electoral manifestos using as training data, manually coded manifestos with a total of 77500 sentences in four languages (english, french, german and italian) (and cnns with word embeddings) and inducing a joint multilingual embedding space. they report achieving better results than monolingual classifiers in english, french and italian but worse results with their multilingual classifier than a monolingual classifier in german. more recently, bilbao-jayo and almeida (2018a) build a sentence classifier using multi-scale convolutional neural networks trained in seven different languages trained with sentences extracted from annotated parties' election manifestos. they use the full range of the domains defined by the manifestos project and they prove that enhancing the multi-scale convolutional neural networks with context data improves their classification. for a detailed discussion of different deep learning text classification-based models for text classification and their technical contributions, similarities, and strengths (chatsiou and mikhaylov, 2020; minaee et al., 2020, see). -using annotated political manifestos as the training dataset for classifying other types of political texts is gaining traction in the literature, especially with the boost in performance of deep learning methods for text. nanni et al. (2016) used expert annotated political manifestos in english and speeches to train a local supervised topic classifier (svm with a bag of words approach) that combines lexical with semantic textual similarity features at a sentencelevel. a sub-part of the training set was annotated manually by human experts, and the rest was labelled automatically with the global optimisation step performed via a markov logic network presented in zirn et al. (2016) . the advantage of such a domain transfer approach is that no manual topic annotation on the rest of the corpus is needed. they then classify the speeches from the 2008, 2012 and 2016 us presidential campaign into the 7 domains defined by the manifestos project, without the need for additional topic annotation. bilbao-jayo and almeida (2018b) used annotated political manifestos in spanish and the regional manifestos project taxonomy alonso, gomez, and cabeza (2013), to train a neural network sentence-level classifier (cnn) with word2vec word embeddings, also taking account the context of the phrase (like what was previously said and the political affiliation of the transmitter). they used this to analyse social media (twitter) data of the main spanish political parties during 2015 and 2016 spanish general elections without the need for additional manual coding of the twitter data. this paper builds on this area of research presenting a comparison of a cnn classifier trained on the manifestos project annotations for english, but comparing more context-free (word2vec, glove, elmo) to context-sensitive (bert) word embeddings. we then apply this to a corpus of daily press-briefings on the covid-19 status by government and public health authorities. the main attraction behind using manually coded political manifestos (volkens et al., 2020a) as training data is that the political science community has been manually collecting and annotating in a systematic way political parties' manifestos for decades in a combined effort to create a resource for the systematic content analysis and to advance political science. the corpus is based on the work of the manifesto research group (mrg) and the comparative manifestos (cmp) projects (budge et al., 2001) . classification annotations are described in the manifesto coding handbook which has evolved over the years, and provides information and instructions to the human annotators on how political parties' manifestos should be coded (latest version in volkens et al. (2020b) ). the handbook also includes a speficic set of policy areas or 'domains' (7) and subareas or 'subdomains' (56) which are available to annotators to use (see figure 1) . for our training corpus, we use a subset of the corpus contatining 115 english manifestos with 86,500 annotated sentences. table 1 shows the domain codes distribution in the dataset. 11.20% domain 7 (social groups) 9.99% the coronavirus (covid-19) press briefings corpus is a collection of daily briefings on the covid-19 status and policies from the uk and the world health organisation. the corpus is still in development, but we have selected example sentences from the uk and who which were the ones available. during the peak of the pandemic, most countries around the world informed their citizens of the status of the pandemic (usually involving an update on the number of infection cases, number of deaths) and other policy-oriented decisions about dealing with the health crisis, such as advice about what to do to reduce the spread of the epidemic. at the moment the dataset includes briefings covering announcements between march 2020 and august 2020 from the uk (england, scotland, wales, northern ire-land) and the world health organisation (who) as follows: • , 2014) ). word2vec uses a shallow neural network model to learn word associations from a large corpus of text. once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. word2vec uses a neural network model to learn word associations from a large corpus of text. once trained, such a model can detect synonymous words or suggest additional words for a partial sentence. glove is an unsupervised learning model for obtaining vector representations for words. this is achieved by mapping words into a meaningful space where the distance between words is related to semantic similarity. training is performed on aggregated global word-word co-occurrence statistics from a corpus, and the resulting representations showcase interesting linear substructures of the word vector space. we also obtained word embeddings for more context-sensitive word embeddings, namely elmo (peters et al., 2018) and bert (devlin et al., 2019) . elmo is a deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). these word vectors are learned functions of the internal states of a deep bidirectional language model (bilm), which is pre-trained on a large text corpus. they can be easily added to existing models and significantly improve the state of the art across a broad range of challenging nlp problems, including question answering, textual entailment and sentiment analysis. bert is a deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus. it includes a variant that uses the english wikipedia with 2.5 million words. unlike previous context-free models, which generate a single word embedding representation for each word in the vocabulary, bert takes into account the context for each occurrence of a given word, providing a contextualised embedding that is different for each sentence. since kim (2014)'s paper outlining the idea of using cnns for text classification (traditionally used for recognising visual patterns from images), cnns have achieved very good performance in several text classification tasks (poria, cambria, and gelbukh, 2015; bilbao-jayo and almeida, 2018b). cnns involve convolutional operations of moving frames or windows (filter sizes) which analyse and reduce different overlapping regions in a matrix, to extract different features. the ability to also bootstrap word embeddings in this type of neural network make it an excellent candidate for extracting knowledge and classifying non-annotated texts. we therefore set up 4 variations of the cnn classifier m1, m2, m3, m4 as follows: 1. word vectors of the training dataset sentences are created using one of the following word embeddings: word2vec (m1), glove (m2), elmo (m3) and bert (m4). sentences are fed as sequences of words, then mapped to indexes, then a sequence of word vectors. we have chosen 300 as the word vector size and 60 x d for the space where the convolution operations can be performed. vectors are fed to the neural network (cnn). we then perform convolution operations with 100 filters and three different filter sizes (2 x d, 3 x d, and 4 x d). we reduce the dimensionality of the feature maps generated by each group of filters using 1-max-pooling, which are consequently concatenated (boureau, ponce, and lecun, 2010). a dropout rate of 0.5 is applied (srivastava et al., 2014) as regularisation to prevent overfitting. the layer with softmax computes the probability distribution over the labels. we perform optimization using the adam optimiser with the parameters of the original manuscript (kingma and ba, 2017). note that this is a sentence-level topic classifier basing its predictions by taking into account only the information local within the sentence. for our training corpus, we use a subset of the corpus containing 115 english manifestos with 86,500 annotated sentences. table 1 shows the domain codes distribution in the dataset. in order to evaluate the different architectures, we divided our training dataset in 2 different subsets: training and validation sets (85%) and test set (15%). typically, we have used a validation set (or development test set) separate from the test set, to ensure correct evaluation and that our model(s) do not overfit, thus ensuring how each domain is classified and that the evaluation is robust. we performed 4 experiments, one for each combination of cnn and word embeddings: • m1: cnn with word2vec table 2 , the performance of the classifier improves when more context-sensitive word embeddings are used. using bert with cnn (m4) seems to provide a substantial increase in accuracy and f1, whereas using elmo performs very well as well. we also tested the performance of the same different pre-trained models on the covid-19 corpus. we asked two political science scholars to annotate a subset of 20 press briefings (4 of each set), using the 7 domains of the manifestos project. this resulting in a dataset of 1740 manually annotated sentences, with domain distrubution as in table 3 . note that the pre-trained models have been trained using the annotated manifestos from the manifestos project, without any additional training on the press briefings corpus sentences. as shown in table 4 , the performance of the classifier improves when more context-sensitive word embeddings are used in the context of the covid-19 press briefings corpus as well. using bert with cnn (m4) seems to provide a substantial increase in accuracy and f1, whereas using elmo performs very well as well. as expected there is some loss of accuracy, as we are porting the classifier to a slightly different domain of political text (from manifestos to press briefings). in this paper, we built a sentence-level political discourse classifier using existing human expert annotated corpora of english political manifestos from the manifestos project (volkens et al., 2020a) . we tested the accuracy and performance of a neural networks classifier (cnn) using different word embeddings as part of the word to vector mapping and we showed that sentence-level cnn classifiers combined with transformers like bert outperform models with other embeddings (word2vec, glove, elmo). we then applied the same pre-trained models to a different set of text, the covid-19 press briefings corpus. we observe similar patterns in the accuracy and f1 scores, and additionally show that it is possible to use a pre-trained classifier to conduct automatic classification on different political texts without additional training in the future, we aim to conduct similar experiments also considering the 'subdomain' categories of the manifesto corpus annotations. we also look forward to re-running these experiments for other languages in the manifestos project, testing the language-agnostic advantage of word embeddings and see if we could obtain different results. this paper follows the aaai publications ethics and malpractice statement and the aaai code of professional conduct. we use publicly available text data to ensure transparency and reproducibility of the research. additionally, all code will be available as open source code (on github.com) at the end of the submission and reviewing process. the paper suggests ways to automatically extract topic information from political discourse texts, employing deep learning methods which are usually associated with artificial intelligence and ethical considerations around them. we do not envisage any ethical, social and legal considerations arising from the work outlined in this study, such as impact of ai on humans, on economic growth, on inequality, amplifying bias or undermining political stability or other issues described in recent reports on ethics in ai (see for example (bird et al., 2020) ). table 1 domain codes' distribution in the english subset of the manifestos corpus used for training the cnn classifier. . . . . . . 4 table 2 domain results of all models using political manifestos . . . . 6 table 3 manifest project domain codes' distribution in the manually annotated subset of the covid-19 corpus. . . . . . . . . . . . 7 table 4 domain results of all models using covid probabilistic latent semantic indexing mapping policy preferences: estimates for parties, electors, and governments latent dirichlet allocation". en. in: a deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus automated classification of congressional legislation a scaling model for estimating time-series party positions from texts". en treating words as data with error: uncertainty in text statements of policy positions life in the network: the coming age of computational social science use of force and civil-military relations in russia: an automated content analysis". en a theoretical analysis of feature pooling in visual recognition affective news: the automated coding of sentiment in political texts measuring centre-periphery preferences: the regional manifestos project text as data: the promise and pitfalls of automatic content analysis methods for political texts". en efficient estimation of word representations in vector space measuring ideological proportions in political speeches convolutional neural networks for sentence classification glove: global vectors for word representation dropout: a simple way to prevent neural networks from overfitting". en deep convolutional neural network textual features and multiple kernel learning for utterancelevel multimodal sentiment analysis crowd-sourced text analysis: reproducible and agile production of political data entities as topic labels: combining entity linking and labeled lda to improve topic interpretability and evaluability". en agreement and disagreement: comparison of points of view in the political domain topfish: topic-based analysis of political position in us electoral campaigns classifying topics and detecting topic shifts in political manifestos". en understanding state preferences with text as data: introducing the un general debate corpus". en cross-lingual classification of topics in political texts". en adam: a method for stochastic optimization building entity-centric event collections automatic political discourse analysis with multi-scale convolutional neural networks and contextual data". en political discourse classification in social networks using context sensitive convolutional neural networks". en deep contextualized word representations bert: pre-training of deep bidirectional transformers for language understanding topic models meet discourse analysis: a quantitative tool for a qualitative approach structural topic modeling for social scientists: a brief case study with social movement studies literature the ethics of artificial intelligence: issues and initiatives. en. study -european parliament's panel for the future of science and technology pe 634.452. lu: publications office covid-19 press briefings corpus. eng. type: dataset deep learning for political science deep learning based text classification: a comprehensive review manifesto project dataset. en. version number: 2020a type: dataset. 2020 the manifesto data collection. manifesto project (mrg/cmp/marpor). version the author would like to acknowledge the support of the business and local government data research centre (es/s007156/1) funded by the economic and social research council (esrc) whilst undertaking this work. key: cord-266055-ki4gkoc8 authors: kikkisetti, s.; zhu, j.; shen, b.; li, h.; duong, t. title: deep-learning convolutional neural networks with transfer learning accurately classify covid19 lung infection on portable chest radiographs date: 2020-09-02 journal: nan doi: 10.1101/2020.09.02.20186759 sha: doc_id: 266055 cord_uid: ki4gkoc8 portable chest x-ray (pcxr) has become an indispensable tool in the management of coronavirus disease 2019 (covid-19) lung infection. this study employed deep-learning convolutional neural networks to classify covid-19 lung infections on pcxr from normal and related lung infections to potentially enable more timely and accurate diagnosis. this retrospect study employed deep-learning convolutional neural network (cnn) with transfer learning to classify based on pcxrs covid-19 pneumonia (n=455) on pcxr from normal (n=532), bacterial pneumonia (n=492), and non-covid viral pneumonia (n=552). the data was split into 75% training and 25% testing. a five-fold cross-validation was used. performance was evaluated using receiver-operating curve analysis. comparison was made with cnn operated on the whole pcxr and segmented lungs. cnn accurately classified covid-19 pcxr from those of normal, bacterial pneumonia, and non-covid-19 viral pneumonia patients in a multiclass model. the overall sensitivity, specificity, accuracy, and auc were 0.79, 0.93, and 0.79, 0.85 respectively (whole pcxr), and were 0.91, 0.93, 0.88, and 0.89 (cxr of segmented lung). the performance was generally better using segmented lungs. heatmaps showed that cnn accurately localized areas of hazy appearance, ground glass opacity and/or consolidation on the pcxr. deep-learning convolutional neural network with transfer learning accurately classifies covid-19 on portable chest x-ray against normal, bacterial pneumonia or non-covid viral pneumonia. this approach has the potential to help radiologists and frontline physicians by providing more timely and accurate diagnosis. coronavirus disease 2019 (covid-19) is a highly infectious disease that causes severe respiratory illness (1, 2) . it was first reported in wuhan, china in december 2019 (3) and was declared a pandemic on mar 11, 2020 (4) . the first confirmed case of coronavirus disease 2019 in the united states was reported from washington state on january 31, 2020. (5) soon after, washington, california and new york reported outbreaks. covid-19 has already infected 10 million, killed more than 0.5 million people, and the united states has become the worst-affected country, with more than 2.4 million diagnosed cases and at least 122,796 deaths (https://coronavirus.jhu.edu, assessed jun 28, 2020). there are recent spikes of covid-19 infection cases across many states and around the world and there will likely be second waves and recurrence. a definitive test of covid-19 infection is the reverse transcription polymerase chain reaction (rt-pcr) of a nasopharyngeal or oropharyngeal swab specimen (6, 7) . although rt-pcr has high specificity, it has low sensitivity, high false negative rate, and long turn-around time (6,7) (currently ~4 days although it is improving and other tests are becoming available (8)). by contrast, portable chest x-rays (pcxr) is convenient to perform, has a fast turnaround, and is well suited for imaging contagious patients and longitudinal monitoring of critically ill patients in the intensive care units because the equipment can be readily disinfected, preventing crossinfection. pcxr of covid-19 infection has certain unique characteristics, such as predominance of bilateral, peripheral, and low lobes involvement, with ground-glass opacities with or without airspace consolidations as the disease progresses. these characteristics generally differ from other lung pathologies, such as bacterial pneumonia or other viral (non-covid-19) lung infection. based on cxr and laboratory findings, clinicians might start patients on empirical treatment before the rt-pcr results become available or even if the rt-pcr come back negative due to high false negative rate of rt-pcr. early treatment in covid-19 patients is associated with better clinical outcomes. similarly, computed tomography (ct), which offers relatively more detailed features (such as subtle ground-glass opacity (9,10)), has also been used in the context of covid-19. however, ct suite and equipment are more challenging to disinfect, and thus it is much less suitable for examining patients suspected of or confirmed with contagious diseases in general and covid-19 in particular. longitudinal ct monitoring of critically ill patients in the intensive care units is also challenging. in short, pcxr has become an indispensable imaging tool in the management of covid-19 infection, is often one of the first examinations a patient suspected of covid-19 infection receives in the emergency room, and ideally used for longitudinal monitoring of critically ill patients in the intensive care units. the usage of pcxr under the covid-19 pandemic circumstances is unusual in many aspects. for instance, pcxr is preferred as it can be used at the bedside without moving the patients, but the imaging quality is not as good as conventional cxr (11) . in addition, covid-19 patients may not be able to take full inspirations during the examination, obscuring possible pathology, especially in the lower lung fields. many sicker patients may be positioned on the side which compromises imaging quality. thus, pcxr data under the covid-19 pandemic circumstances are suboptimal and, thus, may be more challenging to interpret. moreover, pcxr is increasingly read by non-chest radiologists in some hospitals due to increasing demands, resulting in reduced accuracy and efficiency. pcxr images contain important clinical features that could be easily missed by the naked eyes. computer-aided methods can improve efficiency and accuracy of pcxr interpretations, all rights reserved. no reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted september 2, 2020. . https://doi.org/10.1101/2020.09.02.20186759 doi: medrxiv preprint which in turn provides more timely and relevant information to frontline physicians. deeplearning artificial intelligence (ai) has become increasingly popular for analyzing diagnostic images (12, 13) . ai has the potential to facilitate disease diagnosis, staging of disease severity and longitudinal monitoring of disease progression. one common machine-learning algorithm is the convolutional neural network (cnn) (14, 15) , which takes an input image, learns important features in the image such as size or intensity, and saves these parameters as weights and bias to differentiate types of images (16, 17) . cnn architecture is ideally suited for analyzing images. moreover, the majority of machine learning algorithms to date are trained to solve specific tasks, working in isolation. models have to be rebuilt from scratch if the feature-space distribution changes. transfer learning overcomes the isolated learning paradigm by utilizing knowledge acquired for one task to solve related ones. transfer learning in ai is particularly important for small sample size data because the pre-trained weights enable more efficient training and improved performance (18,19). many artificial intelligence (ai) algorithms based on deep-learning convolutional neural networks have been deployed for pcxr applications (20) (21) (22) (23) (24) and these algorithms can be readily repurposed for covid-19 pandemic circumstances. while there are already many papers describing prevalence and radiographic features on pcxr of covid-19 lung infection (see reviews (25, 26) ), there is a few peer-reviewed ai papers (27-32) and non-peer reviewed papers (33-36) to classify cxrs of covid-19 patients from cxr of normals or related lung infections. the full potential of ai applications of pcxr under covid-19 pandemic circumstances is not yet fully realized. the goal of this pilot study is to employ deep-learning convolutional neural networks to classify normal, bacterial infection, and non-covid-19 viral infection (such as influenza) all rights reserved. no reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted september 2, 2020. . https://doi.org/10.1101/2020.09.02.20186759 doi: medrxiv preprint against covid-19 infection on pcxr. the performance was evaluated using receiver-operating curve (roc) analysis. heatmaps were also generated to visualize and assessment the performance of the ai algorithm. we recognized that this dataset was a public, community-driven dataset and there are potential selection biases. a radiologist (bs) evaluated all images for quality and relevance and each case was covid-19 positive based on available data. thus, this dataset is useful and valid for the purpose of algorithm development. the other datasets were taken from the established kaggle chest x-ray image (pneumonia) dataset (https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia). although the kaggle database has a large sample size, we randomly selected a sample size comparable to that of covid-19 to avoid asymmetric sample size bias that could skew sensitivity and specificity. the sample sizes chosen for bacterial pneumonia, non-covid-19 viral pneumonia, and normal pcxr were 492, 552 and 532 patients, respectively. similarly, a chest radiologist evaluated all images for quality. all rights reserved. no reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted september 2, 2020. . https://doi.org/10.1101/2020.09.02.20186759 doi: medrxiv preprint cnn: the cnn architecture was based on vgg16, a convolutional neural network (37) . the vgg16 model was used because it was pretrained on the imagenet database and properly employs transfer learning which makes the training process efficient. the data was normalized first by transforming all files into rgb images and resizing them into 224x224 pixels to make them compatible with the vgg16 framework. next, the images were one-hot-encoded and split into 75% training and 25% testing. for data analysis, batch sizes of 32 were used to limit computational expense and trained for 50 epochs. several optimizers were tested however, adams optimization function gave the lowest validation loss. the learning rate was lowered from the recommended 0.01 to 0.001 to prevent overshooting the global minimum loss. categorical cross entropy was used as a loss function since the loss value decreases as the predicted probability converges to the actual label. the vgg16 architecture was utilized for computation efficiency and ease to implement, for immediate translation potential. all rights reserved. no reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted september 2, 2020. figure 2 shows examples of pcxr from a normal subject and from patients with different lung infections. covid-19 is often characterized by ground-glass opacities with or all rights reserved. no reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted september 2, 2020. . https://doi.org/10.1101/2020.09.02.20186759 doi: medrxiv preprint without nodular consolidation with predominance of bilateral, peripheral and lower lobes involvement. non-covid-19 viral pneumonia is often characterized by diffuse interstitial opacities, usually bilaterally. bacterial pneumonia is often characterized by confluent areas of focal airspace consolidation. table 1 . the precision, recall and f1 scores for the whole pcxr are shown in table 2 . the overall precision, recall and f1 scores showed good to excellent performance. for cnn with transfer learning performed on the whole pcxr, the overall sensitivity, specificity, accuracy, and auc were 0.79, 0.93, and 0.79, .84 respectively. for cnn performed on segmented lungs, the overall sensitivity, specificity, accuracy, and auc were 0.91, 0.93, 0.88, 0.89 respectively. the performance was generally better using segmented lungs. to visualize the spatial location on the images that the cnn networks were paying attention to for classification, heatmaps of the covid-19 versus normal pcxr are shown in performed on the whole pcxr, the majority of the hot spots were reasonably localized to regions of ground glass opacities and/or consolidations, but some hot spots were located outside the all rights reserved. no reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted september 2, 2020. . https://doi.org/10.1101/2020.09.02.20186759 doi: medrxiv preprint lungs. for cnn performed on segmented lungs, the majority of the hot spots were reasonably localized to regions of ground glass opacities and/or consolidations, mostly as expected. this study developed and applied a deep-learning cnn algorithm with transfer learning to classify covid-19 cxr from normal, bacterial pneumonia, and non-covid viral pneumonia cxr in a multiclass model. heatmaps showed reasonable localization of abnormalities in the lungs. the overall sensitivity, specificity, accuracy, and auc were 0.91, 0.93, 0.88, and .89 respectively (segmented lungs). there are a few ai studies to date using machine learning methods to classify cxrs of covid-19, normal and related lung infections. by the time this paper is reviewed many more (which was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted september 2, 2020. . https://doi.org/10.1101/2020.09.02.20186759 doi: medrxiv preprint no-findings (n=500) vs. pneumonia (n=500) as well as a binary classification for covid vs. no-findings which achieved 87.02% and 98.08% accuracies, respectively (31). pereira et al. pneumonia vs no-finding using resampling algorithms, texture descriptors, and cnn. this model achieved a f1-score of 0.65 for the multiclass approach and f1 score of 0.89 for the hierarchical classification (32). auc and accuracy were not reported. a few non-peer reviewed pre-prints using ai to classify covid-19 cxrs have also been reported (33-36). our study had one of the larger cohorts, balanced sample sizes, and multi-class model. our approach is also amongst the simplest ai models with comparable performance index, likely facilitate immediate clinical translation. together, these studies indicate that ai has the potential to assist frontline physicians in distinguishing covid-19 infection based on cxrs. heatmaps are informative tools to visualize regions that cnn algorithm pays attention to for detection. this is particular important given ai operates on high dimensional space. such heatmaps enable reality checks and make ai interpretable with respect to clinical findings. our algorithm showed that the majority of the hotspots were highly localized to abnormalities within the lungs, i.e., ground glass opacity and/or consolidation, albeit imperfect. the majority of the above-mentioned machine learning studies to classify covid-19 cxrs did not provide heatmaps. we also noted that cnn on whole pcxr image resulted in some hot spots located outside the lungs. cnn of segmented lungs solved this problem. another advantage of using segmented lung is reduced computational cost during training. transfer learning also reduced computational cost, making this algorithm practical. the performance is generally better using segmented lungs. all rights reserved. no reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted september 2, 2020. . https://doi.org/10.1101/2020.09.02.20186759 doi: medrxiv preprint most covid-19 positive patients showed significant abnormalities on pcxr (39) . some early studies have even suggested that pcxr could be used as a primary tool for covid-19 screening in epidemic areas (39, 40) , which could complement swab testing which still has long turnaround time and non-significant false positive rate. in some cases, imaging revealed chest abnormalities even before swab tests confirm infection (41,42). in addition, pcxr can detect superimposed bacteria pneumonia, which necessitates urgent antibiotic treatment. pcxr can also suggest acute respiratory distress syndrome, which is associated with severe negative outcomes and necessitates immediate treatment. together with the anticipated widespread shortage of intensive care units and mechanical ventilators in many hospitals, pcxr also has the potential to play a critical role in decision-making, especially in regards to which patients to admit to the icu, put on mechanical ventilation, or when to safely extubate. a timely implementation of ai methods could help to realize the full potential of pcxr in this covid-19 pandemic. this pilot proof-of-principal study has several limitations. this is a retrospective study with a small sample size and the data sets used for training had limited alternative diagnoses. although the kaggle database has a large sample size for non-covid-19 cxr, we chose the sample sizes to be comparable to that of covid-19 to avoid asymmetric sample sizes that could skew sensitivity and specificity. future studies will need to increase the covid-19 sample size and include additional lung pathologies. the spatiotemporal characteristics on pcxr of covid-19 infection and its relation to clinical outcomes are unknown. future endeavors could include developing ai algorithms to stage severity, and predict progression, treatment response, all rights reserved. no reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted september 2, 2020. . https://doi.org/10.1101/2020.09.02.20186759 doi: medrxiv preprint recurrence, and survival, to inform and advise risk management and resource allocation associated with the covid-19 pandemic. in conclusion, deep learning convolutional neural networks with transfer learning accurately classify covid-19 pcxr from pcxr of normal, bacterial pneumonia, and non-covid viral pneumonia patients in a multiclass model. this approach has the potential to help radiologists and frontline physicians by providing efficient and accurate diagnosis. all rights reserved. no reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted september 2, 2020. 2020. all rights reserved. no reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted september 2, 2020. (which was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted september 2, 2020. all rights reserved. no reuse allowed without permission. (which was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted september 2, 2020. . https://doi.org/10.1101/2020.09.02.20186759 doi: medrxiv preprint table 2 shows the precision and recall rate and f1 score (whole cxr). recall f1 -score (which was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted september 2, 2020. . https://doi.org/10.1101/2020.09.02.20186759 doi: medrxiv preprint (which was not certified by peer review) is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted september 2, 2020. . https://doi.org/10.1101/2020.09.02.20186759 doi: medrxiv preprint the continuing 2019-ncov epidemic threat of novel coronaviruses to global health -the latest 2019 novel coronavirus outbreak in wuhan, china outbreak of pneumonia of unknown etiology in wuhan, china: the mystery and the miracle early transmission dynamics in wuhan, china, of novel coronavirus-infected pneumonia first case of 2019 novel coronavirus in the united states the laboratory diagnosis of covid-19 infection: current issues and challenges detection of sars-cov-2 in different types of clinical 10 imaging and clinical features of patients with 2019 novel coronavirus sars-cov-2 portable versus fixed x-ray equipment: a review of the clinical effectiveness, cost-effectiveness, and guidelines. ottawa (on) deep learning using deep convolutional neural networks to identify and classify tumor-associated stroma in diagnostic breast biopsies imagenet classification with deep convolutional neural networks improving neural networks by preventing co-adaptation of feature detectors very deep convolutional networks for large-scale image recognition deep machine learning-a new frontier in artificial 19 transfer learning with deep convolutional neural network for liver steatosis assessment in ultrasound images artificial intelligence and machine learning in respiratory medicine a systematic review of the diagnostic accuracy of artificial intelligence-based computer programs to analyze chest x-rays for pulmonary tuberculosis attention-guided convolutional neural network for detecting pneumonia on chest x-rays deep learning algorithms with demographic information help to detect tuberculosis in chest radiographs in annual workers' health examination data explainable covid-19 predictions based on chest x-ray images an automated machine learning model to assist in the diagnosis of covid-19 infection in chest x-ray images automatic detection of coronavirus disease (covid-19) using x-ray images and deep convolutional neural networks very deep convolutional networks for large-scale image recognition learning deep features for discriminative localization association of inpatient use of angiotensin converting enzyme inhibitors and angiotensin ii receptor blockers with mortality among patients with hypertension hospitalized with covid-19 correlation of chest ct and rt-pcr testing in coronavirus key: cord-024491-f16d1zov authors: qiu, xi; liang, shen; zhang, yanchun title: simultaneous ecg heartbeat segmentation and classification with feature fusion and long term context dependencies date: 2020-04-17 journal: advances in knowledge discovery and data mining doi: 10.1007/978-3-030-47436-2_28 sha: doc_id: 24491 cord_uid: f16d1zov arrhythmia detection by classifying ecg heartbeats is an important research topic for healthcare. recently, deep learning models have been increasingly applied to ecg classification. among them, most methods work in three steps: preprocessing, heartbeat segmentation and beat-wise classification. however, this methodology has two drawbacks. first, explicit heartbeat segmentation can undermine model simplicity and compactness. second, beat-wise classification risks losing inter-heartbeat context information that can be useful to achieving high classification performance. addressing these drawbacks, we propose a novel deep learning model that can simultaneously conduct heartbeat segmentation and classification. compared to existing methods, our model is more compact as it does not require explicit heartbeat segmentation. moreover, our model is more context-aware, for it takes into account the relationship between heartbeats. to achieve simultaneous segmentation and classification, we present a faster r-cnn based model that has been customized to handle ecg data. to characterize inter-heartbeat context information, we exploit inverted residual blocks and a novel feature fusion subroutine that combines average pooling with max-pooling. extensive experiments on the well-known mit-bih database indicate that our method can achieve competitive results for ecg segmentation and classification. arrhythmia occurs when the heart rhythms are irregular, which can lead to serious organ damage. arrhythmias can be caused by high blood pressure, heart diseases, etc [1] . electrocardiogram (ecg) is one of the most popular tools for arrhythmia diagnosis. to manually handle long ecg recordings with thousands of heartbeats, clinicians have to determine the class of each heartbeat to detect arrhythmias, which is highly costly. therefore, great efforts have been made to create computer-aided diagnosis tools that can detect irregular heartbeats automatically. in recent years, deep learning models have been gradually applied to ecg classification. among them, most methods work in three steps: preprocessing, heartbeat segmentation and beat-wise classification (see sect. 2). the preprocessing step removes various kinds of noise from raw signals, the heartbeat segmentation step identifies individual heartbeats, and the beat-wise classification step classifies each heartbeat. this methodology has the following drawbacks: first, explicit heartbeat segmentation can undermine model simplicity and compactness. traditional heartbeat segmentation methods explicitly extract ecg features for qrs detection. since deep learning methods can produce feature maps from raw data, heartbeat segmentation can be simultaneously conducted with classification with a single neural network. second, beat-wise classification uses isolated heartbeats, which risks losing inter-heartbeat context information that can be useful to boosting classification performance. addressing these drawbacks, we propose a novel deep learning model that can simultaneously conduct heartbeat segmentation and classification. compared to existing methods, our model is more compact as it does not require explicit heartbeat segmentation. the difference between our model and existing deep learning models is shown in fig. 1 . as is shown, our model takes in a 1-d ecg sequence and outputs both the segmented heartbeats and their corresponding labels. besides, our model is more context-aware, for it takes into account the relationship between heartbeats. to achieve simultaneous segmentation and classification, we present a faster r-cnn [2] based model that has been customized to handle ecg sequences. to capture inter-heartbeat context information, we exploit inverted residual blocks [3] to produce multi-scale feature maps, which are then fused by a novel feature fusion mechanism to learn inter-heartbeat context information. moreover, the semantic information and morphological information are explored from the fused features to improve performance. our main contributions are as follows: -we propose a novel deep learning model for simultaneous heartbeat segmentation and classification. -we present a novel faster r-cnn based model that has been customized to handle ecg data. -we use inverted residual blocks and a novel feature fusion subroutine to exploit long term inter-heartbeat dependencies for context awareness. -we conduct extensive experiments on the well-known mit-bih database [4, 5] to demonstrate the effectiveness of our model. the rest of this paper is organized as follows. section 2 reviews the related work. section 3 presents our model. section 4 reports the experimental results. section 5 concludes this paper. traditional arrhythmia detection methods extract handcrafted features from ecg data, such as r-r intervals [6, 7] , ecg morphology [6] , frequency [8] , etc. classifiers such as linear discriminant analysis models [6] , support vector machines [7] and random forests [9] are then built upon these features. in recent years, many researchers turn to deep neural networks for heartbeat classification. the majority of deep learning models take raw signals as their input, omitting explicit feature extraction and selection steps. in [10] , kiranyaz et al. proposed a patient-specific 1-d cnn for ecg classification. in [11] , yildirim et al. designed a deep ltsm network with wavelet-based layers for heartbeat classification. some methods [12, 13] combine lstm with cnn. these aforementioned deep learning methods work in three steps: preprocessing, heartbeat segmentation and beat-wise classification. they do not explicitly utilize context information among heartbeats. by contrast, mousavi et al. [14] proposed a sequence-to-sequence lstm based model which maps a sequence of heartbeats in time order to a sequence of labels, where context dependencies are captured by the cell state of the network. hannun et al. [15] proposed a 33-layer neural network for arrhythmia detection which maps ecg data to a sequence of labels. however, the precise regions of arrhythmias cannot be obtained. oh et al. [16] used a modified u-net to identify regions of heartbeats and the background from raw signals, yet this method needs extra steps to detect arrhythmias from the generated annotation. several researches applies faster r-cnn to ecg analysis. for example, ji et al. [17] proposed a heartbeat classification framework based on faster r-cnn. 1-d heartbeats extracted from original signals are converted to images as the input of the model. sophisticated preprocessing is required before classification. he et al. [18] and yu et al. [19] use faster r-cnn to perform heartbeat segmentation and qrs complex detection. in our method, we present a modified faster r-cnn for arrhythmia detection which works in only two steps: preprocessing, and simultaneous heartbeat segmentation and classification. the architecture of our model is shown in fig. 2 , which takes 1-d ecg sequence as its input and conducts heartbeat segmentation and classification simultaneously. to achieve this, our model consists of 6 modules: a backbone network, a region proposal network (rpn), a region classification network (rcn), a filter block, a down sampling block and a region pooling block. the backbone network produces multi-scale feature maps from the ecg signal. the modules in the upper part of fig. 2 performs heartbeat segmentation, while ones in the lower part performs heartbeat classification. we now elaborate on the details of our method. the preprocessing step removes noise from raw signals. here we employ a three-order butterworth bandpass filter with a frequency range of 0.27 hz-45 hz because this range contains the main components of ecg signals [20] . the backbone network generates multi-scale semantic and morphological feature maps from raw ecg signals. for efficiency, we choose the inverted residual block [3] as the building block. we customize it for ecg data in the following manner: (1) we increase the kernel size from 3 to 5 to enlarge the receptive field. (2) the activation function is replaced by elu for less information loss. (3) the residual connection is added to the building block in stride 2 condition. there are two branches in stride 2 condition (fig. 3) . (4) stride 2 convolution is replaced by max-pooling to make the model more lightweight. there are 6 layers in our backbone. each layer is composed of several building blocks (fig. 3) . besides, each layer downsamples the feature map by a factor of two. different from most deep learning methods which compute feature maps for a single heartbeat, our backbone model takes a long ecg sequence as its input. the produced feature maps encode not only morphological and semantic information of individual heartbeats, but also context information amongst multiple heartbeats. the bottom layers of the backbone generate feature maps of strong morphological information while the top layers produce feature maps of strong semantic information [21] . moreover, the receptive field increases from bottom layers to top layers, thus the feature maps encode involve inter-heartbeat context dependencies with varying differences in time. we fuse multi-scale feature maps to utilize both morphological and semantic information of the heartbeats in segmentation and classification. besides, the fused feature maps can provide more context information. all feature maps except those in the first two layers are used for efficiency. in the segmentation task, the feature maps need to be normalized to have equal dimensions by the downsampling block before fusion. feature maps are downsampled to a fixed length by a novel mechanism shown in eq. 1, which is a trade-off between performance and complexity. contrary to convolution based down sampling methods [21] , our down sampling block is parameter-free. average pooling is exploited for less information loss during downsampling, while max-pooling highlights the discriminative features in the feature maps. the rpn fuses the feature maps from the downsampling block and performs heartbeat segmentation. here we directly segment the heartbeats without qrs detection. as is shown in fig. 4 (a) , rpn has two branches performing regression (top) and classification (bottom). the classification branch produces a binary label for each region indicating whether it contains a single heartbeat. the regression branch produces endpoints of each region which encloses a heartbeat. intuitively, the regression task is far more difficult than binary classification for rpn, thus we use multi-size convolutional layers (3, 5, 7 in this paper) to further extract features before regression. following the practice of faster r-cnn [2] , at each position of a feature map, we pre-define three reference regions. these regions have different sizes (128, 192, 256 for ecg heartbeats). to predict a region, we use the center of one of the three regions as the reference point and report its offset to this reference. however, the regions obtained by rpn can overlap with nearby regions, undermining the efficiency of the model. in response, we use non-maximum suppression (nms) to filter these regions in the filter block. nms selects a region with max confidence in each iteration and then compute the overlaps between each remaining region and the selected one. the regions whose overlap exceeds a pre-set threshold (30% in this paper) are discarded, so are the regions containing no heartbeats (confidence below 0.5). in the heartbeat classification task, the region pooling block generates heartbeat feature maps for the predicted heartbeat regions [2] . feature maps in the last four layers (with strides 8, 16, 32, 64) are reused to extract heartbeat features. because these feature maps have different sizes, the predicted regions are mapped as: region = (start/stride, end/stride). moreover, each region is divided into fixed-size sub-regions. heartbeat feature maps are produced by average pooling on each sub-region. to keep sufficient morphological information, heartbeat feature maps in the bottom layers have larger sizes (8, 4, 2, 1 for strides 8, 16, 32, 64). heartbeat feature maps are then fed into the region classification network (rcn, fig. 4 (b) ) to classify heartbeats inside each region. rcn performs heartbeat classification by fusing the feature maps from the region pooling block. note that we do not fine-tune the regions as in faster r-cnn because it trades efficiency for only minor improvements in accuracy. following common practices in the detection task [2, 21] , our backbone network is initialized with a pre-trained network. we extract heartbeats from the experimental database (to be discussed later) and pre-train the backbone network with extra layers on these heartbeats. then, the last few layers are removed while the remain ones are used as the backbone network. we coarsely annotate the groundtruth heartbeat region for each heartbeat which ranges from 0.25 s-0.83 s around the r peak so as to most of heartbeat. since our model can capture inter-heartbeat context information, finer annotation is not necessary. the offsets of reference regions to the groundtruth ones [2] are used to train the rpn regression branch. to train the classification branch, positive labels are assigned a predicted region when the following criteria are met: (1) its overlap with a groundtruth heartbeat region is over 0.7. (2) it has the highest overlap with a groundtruth heartbeat region. in rcn training, the label of a predict region is assigned to the heartbeat inside it. we use jaccard distance as the metric for overlap computation. similar to [2] , our entire training process has two steps: 1) train the rpn with regression loss and binary classification loss. 2) train the rcn with multiclassification loss. for better performance, we choose smooth l1 loss (eq. 2) to train the regression branch and focal loss (eq. 3) to train rcn and rpn for classification. where p t is the estimated probability for the class t. we set γ to 2, and α t to [0. 25, 1] for binary classification and [1, 0.5, 1, 0.5, 1] for multi-classification. we implemented our model using pytorch 1 . our source code is available 2 for reproducibility. the experiments were run on a linux server with an intel xeon w-2145 cpu @3.7 ghz, 64 gb memory and a single nvidia 1080ti gpu. adadelta was used as the optimizer with weight decay 1e −5 . the learning rate was set to 0.15 for training, decaying every 10 epochs exponentially as follows: lr = 0.3 * 0.9 epoch/50 . the batch size was set to 240 for both training and testing. we used data from the well-known mit-bih database [4, 5] , which contains 48 half-hour two-lead ecg recordings. we used the mlii lead in the experiments and excluded 4 recordings with paced beats, following the ansi/aami ec57 standard [22] . due to limited computational resources, we divided each recording into a series of long sequences with 3600 data points. the first and last 10 s of each recording were discarded. note that our model can process much longer ecg recordings with abundant computational resources. we run the experiments 5 times, randomly dividing the dataset into training, validation and test sets for each run. the training set contained 70% of all data. the evaluation and test sets included 10% and 20% of all data. the heartbeat labels were mapped into 5 groups by ansi/aami standard, namely n, s, v, f, q (see table 1 ). we did not take the q class into consideration because of its scarcity. we applied the following metrics for evaluation: positive predictive value (ppv), sensitivity (sen), specificity (spe) and accuracy (acc). to evaluate heartbeat segmentation performance, we define truth positive (tp) as: 1) a predicted region contains only one heartbeat and 2) its non-overlapping area with the groundtruth is less than 150 ms. we define false positive (fp) is as: 1) a predict region encloses more than two heartbeats or 2) its non-overlapping area with the groundtruth is exceeds 150 ms. we define false negative (fn) as: a groundtruth heartbeat is not enclosed by any predicted region. for baselines, we used two qrs detection based heartbeat segmentation methods: pan-tompkins [23] and wavedet [24] . the results are shown in table 2 . as is shown, our method is highly competitive against the baselines. it is worth noting that unlike the baselines, our model does not apply qrs detection for segmentation, thus there may be inconsistencies on the definitions of tp, fp and fn between our model and the baselines. however, it is safe to say that our model performs well enough to be applied in real-world scenarios. we now evaluate the heartbeat classification performance. the baselines come from [13, 14, 25, 26] . here we applied smote [27] for data augmentation as it was also used in our baselines. figure 5 shows the results. our model achieves an accuracy of 99.6%, a sensitivity of 99.75% and a specificity of 99.6%. these results are similar to those obtained by [14] which used a lstm-based sequenceto-sequence model to learn context information. the difference between our work and [14] is that we learn context information from raw signals while [14] did so using a sequence of individual heartbeats. besides, the lstm-based model has lower efficiency. compared to other baselines which perform classification on individual heartbeat, our model has a simpler model structure but achieves similar or better results on some metrics, highlighting the power of contextawareness. we now investigate the impact of the key design features of our model. for better evaluation, we did not use smote [27] here. to better capture long term dependencies, we have enlarged the receptive field to retain context information. also, our model captures inter-heartbeat dependencies by learning multi-scale feature maps with rpn and rcn. to demonstrate the effectiveness of these design choices, we conducted the following ablation tests: 1) setting the kernel size to 3 for all convolution filters, 2) using the feature maps in the top layers only, 3) using the feature maps in the bottom layers only, 4) equalize the output sizes of multi-scale feature maps in the region pooling block. figure 6 and table 3 presents the results in the segmentation and classification tasks. as is shown, while the effectiveness of these design features in the segmentation task is limited, they are indeed beneficial to the classification task. figure 7 and table 4 show the results. by modifying the backbone, our model can learn more strong features to improve performance. our downsampling method also outperforms using only max-pooling or average pooling. our model has more parameters in the regression branch of rpn based on the intuition that regression is more difficult than binary classification in rpn. to evaluate this design, we conducted the following ablation tests: 1) enlarging the classification branch. 2) simplifying the regression branch. table 5 presents the results. as is shown, simplifying the regression branch has negative impact on performance, while enlarging the classification branch brings about no improvement. in this paper, we have propose a novel deep learning model that can simultaneously conduct heartbeat segmentation and classification. compared to existing methods, our model is more compact as it does not require explicit heartbeat segmentation. moreover, our model is context-aware by using feature fusion and long term context dependencies. in the future, we plan to extend our model to multi-lead ecg analysis tasks. cardiac arrhythmia: mechanisms, diagnosis, and management faster r-cnn: towards real-time object detection with region proposal networks proceedings of the ieee conference on computer vision and pattern recognition the impact of the mit-bih arrhythmia database physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals automatic classification of heartbeats using ecg morphology and heartbeat interval features classification of electrocardiogram signals with support vector machines and particle swarm optimization ecg beat classification using pca, lda, ica and discrete wavelet transform medical decision support system for diagnosis of heart arrhythmia using dwt and random forests classifier real-time patient-specific ecg classification by 1-d convolutional neural networks a novel wavelet sequence based on deep bidirectional lstm network model for ecg signal classification cardiac arrhythmia detection from ecg combining convolutional and long short-term memory networks a lstm and cnn based assemble neural network framework for arrhythmias classification inter-and intra-patient ecg heartbeat classification for arrhythmia detection: a sequence to sequence deep learning approach cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network automated beat-wise arrhythmia diagnosis using modified u-net on extended electrocardiographic recordings with heterogeneous arrhythmia types electrocardiogram classification based on faster regions with convolutional neural network a deep learning method for heartbeat detection in ecg image qrs detection and measurement method of ecg paper based on convolutional neural networks frequency content and characteristics of ventricular conduction feature pyramid networks for object detection ansi/aami ec57: 2012-testing and reporting performance results of cardiac rhythm and st segment measurement algorithms a real-time qrs detection algorithm a wavelet-based ecg delineator: evaluation on standard databases arrhythmias classification by integrating stacked bidirectional lstm and two-dimensional cnn a selective ensemble learning framework for ecg-based heartbeat classification with imbalanced data smote: synthetic minority over-sampling technique acknowledgement. this work is funded by nsfc grant 61672161 and dongguan innovative research team program 2018607201008. we sincerely thank prof chun liang and dr zhiqing he from department of cardiology, shanghai changzheng hospital for their valuable advice. key: cord-002901-u4ybz8ds authors: yu, chanki; yang, sejung; kim, wonoh; jung, jinwoong; chung, kee-yang; lee, sang wook; oh, byungho title: acral melanoma detection using a convolutional neural network for dermoscopy images date: 2018-03-07 journal: plos one doi: 10.1371/journal.pone.0193321 sha: doc_id: 2901 cord_uid: u4ybz8ds background/purpose: acral melanoma is the most common type of melanoma in asians, and usually results in a poor prognosis due to late diagnosis. we applied a convolutional neural network to dermoscopy images of acral melanoma and benign nevi on the hands and feet and evaluated its usefulness for the early diagnosis of these conditions. methods: a total of 724 dermoscopy images comprising acral melanoma (350 images from 81 patients) and benign nevi (374 images from 194 patients), and confirmed by histopathological examination, were analyzed in this study. to perform the 2-fold cross validation, we split them into two mutually exclusive subsets: half of the total image dataset was selected for training and the rest for testing, and we calculated the accuracy of diagnosis comparing it with the dermatologist’s and non-expert’s evaluation. results: the accuracy (percentage of true positive and true negative from all images) of the convolutional neural network was 83.51% and 80.23%, which was higher than the non-expert’s evaluation (67.84%, 62.71%) and close to that of the expert (81.08%, 81.64%). moreover, the convolutional neural network showed area-under-the-curve values like 0.8, 0.84 and youden’s index like 0.6795, 0.6073, which were similar score with the expert. conclusion: although further data analysis is necessary to improve their accuracy, convolutional neural networks would be helpful to detect acral melanoma from dermoscopy images of the hands and feet. a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 in asians, melanoma is rare, compared to its prevalence in caucasians, and usually occurs in acral areas such as the hands and feet. it can be misrecognized as benign nevi (bn), is occasionally hidden by calluses, and eventually results in late diagnosis at an advanced stage, with a poor prognosis [1] [2] [3] . since effective anti-cancer agents for treating melanoma have not yet been developed, early detection and wide excision of the skin lesion is more crucial to the cure for melanoma. recently, to aid the early diagnosis of melanoma and the reduction of unnecessary skin biopsy, dermoscopy has been widely used [4, 5] . moreover, because it is difficult for nonexperts to use [6] , artificial intelligence and deep-learning models have been applied to help physicians who are untrained to handle a digital dermoscope [7] ; its use is expected to increase in the field of teledermatology. a convolutional neural network (cnn) is one of the representative models among the various deep-learning models. it has already shown potential for general and highly variable tasks across many fine-grained object categories [8] [9] [10] [11] [12] and has been shown to exceed human performance in object recognition [9] . recently, it was applied to detect skin cancers in images, including from dermoscopy, and successfully demonstrated artificial intelligence capable of classifying skin cancer with a competence level comparable to that of dermatologists [13] . for the success of cnn models, a large amount of training data labeled with class types to produce a rich feature hierarchy is necessary, and therefore, its usefulness in the diagnosis of rare diseases with insufficient data has not been fully established. in this study, we applied an end-to-end cnn framework to detect a rare disease in asians, acral melanoma (am), from the dermoscopy images of pigmentation on the hands and feet. to overcome the insufficiency of the datasets, we adopted a transfer learning technique to leverage learned features from a cnn model pre-trained on a large-scale natural image dataset [14] . moreover, we also applied a half-training and half-trial method to validate its clinical usefulness for the early diagnosis of patients compared with the dermatologist's and non-expert's evaluation. a total of 724 dermoscopy images were collected from january 2013 to march 2014 at the severance hospital in the yonsei university health system, seoul, korea, and from march 2015 to april 2016 at the dongsan hospital in the keimyung university health system, daegu, korea. among them, 350 dermoscopy images were from 81 patients with am and 374 images were from 194 patients with bn of the acral area (fig 1) . a total of 632 dermoscopy images were captured by the dermlite cam (3gen inc., usa), and 92 images were captured by the dermlite hybrid ii (3 gen inc., usa), connected to a digital camera (nikon coolpix p6000, japan). all diagnoses were histopathologically confirmed and multiple images were captured in cases of large lesions. we provide a strobe checklist for the study of diagnostic efficacy as supporting information (s1 table) . dermoscopy images of bn were divided into nine types, and am images into three types according to the reference [15] , by two dermatologists. this study protocol was approved by the institutional review board of yonsei university, severance hospital and keimyung university, dongsan hospital and was conducted according to the declaration of helsinki principles. patient records/information was anonymized and deidentified prior to analysis. we have described the cnn architecture we adopted in section 2. 1 and presented the training and inference methods for detecting melanoma in section 2. 2. 2.1 convolutional neural network. cnns are composed of several convolutional layers, each involving linear and nonlinear operators, as well as fully connected layers. the architecture for the state-of-the-art cnn has many parameters; for example, the vgg-16 model has 138 million parameters, where the parameters are learned from the imagenet dataset containing 1.2 million general object images of 1,000 different object categories for training [16] . deep neural networks are difficult to train using small datasets (i.e., a few hundred images). to circumvent this problem, we used the fine-tuning technique, which is one of the regularization techniques. we fine-tuned a modified vgg model with 16 layers (13 convolutional and three fully connected layers), which uses the convolution filters of the same size (i.e., 3 × 3) for all convolution layers, as seen in table 1 . our network configuration is depicted in fig 2 and table 1 . each layer and feature map in the cnn is represented by a three-dimensional array of table 1 ; "conv" represents a convolutional layer and "fc" represents a fully connected layer). the input with a fixed-size, 224 × 224, was passed through a stack of convolutional layers, where each followed a rectified linear unit (relu) activation function, and max-pooling was performed over a 2 × 2 pixel window with a stride of 2. a series of convolutional layers (conv1, conv2, conv3, conv4, and conv5) were followed by three fully connected layers: the first 2 fully connected layers (fc6 and fc7) had 4,096 channels each, where each followed a relu activation function, while the last fully connected layer (fc8) had 2 channels since our problem was a two-way classification problem (melanoma and non-melanoma class). it should be noted that the number of channels of the last fully connected layer was the same as the number of classes. hence, we replaced the original fully connected layer (fc8: fc with 1000 channels) with a fully connected layer with two channels. the last layer had the soft-max activation function and predicted whether the input patch was a melanoma or non-melanoma lesion. moreover, the vgg-16 model pre-trained on the imagenet database are used to perform transfer learning, and the weights of the last convolutional layers (the last two layers of conv5) and three fully convolutional layers (fc6, fc7, and fc8) are initialized using xavier weight initialization [17] . in order to perform fine-tuning, we froze the weights of conv1, conv2, conv3, conv4, and the first layer of conv5 on pre-trained imagenet, and trained the initialized weights on our dermoscopy image dataset. the above procedure is performed to prevent the large gradient caused by randomly initialized weights from ruining the pre-trained weights. after several training epochs, we trained the all weights of our network without freezing any layer. our dataset consisted of 724 images and associated labels, which were split into two mutually exclusive subsets (group a and b); half of the total image dataset was selected for training and the rest for testing. the scale and location of a skin lesion in a captured image were changed according to the capture conditions. to resolve this issue, we adopted a sliding window strategy and used the cropped patches instead of the full image at the training and inference time. at the inference time, we extracted about 12 image patches from each test image on a regularly spaced grid with a partial overlap between neighboring patches and then each patch was rescaled to the size of 224 × 224 pixels, as seen in fig 3. in addition, to increase the robustness of the variation of geometric transformation in our cnn model, the training dataset was artificially augmented at training. additional augmented data were formed by rotating and flipping images from the original training set. we generated 216 image patches from a single image using rotations by 0˚, 45˚, 90˚, and 135˚, as well as leftright and top-bottom reflections. in addition, the patches that did not contain any melanoma lesions among the melanoma training images were manually removed and the patches that did not contain any skin lesions among the non-melanoma training images were assigned to the non-melanoma class at training time. we randomly selected 30% of the training dataset as a validation set and the rest as a training set at the onset of training. the validation data were used to prevent the overfitting of the training data and to provide guidance on when to stop training the network. the training of our cnn was stopped when the validation error on the validated dataset stopped decreasing. we trained the network using an adaptive stochastic sub-gradient method where the batch size is set to 50, and the momentum parameter, learning rate, and weight decay are set to 0.9, 0.0001, and 0.0005, respectively. some of the filters learned from our melanoma dataset may be seen in fig 4. fig 4(a) shows 64 learned filters at the 1 st convolutional layer, where each represents a learned filter with a 3 × 3 kernel size. the input of the first layer is an rgb image with 224x224 pixel size, and it is convolved with 64 learned filters with 3x3 kernel size as shown in fig 3(a) and 64 feature maps with 224x224 size are generated. in addition, the output feature maps are used as the input of the next layer. fig 4(b) -4(m) shows 100 filters among the learned filters from the 2 nd to the 13 th convolution layer, respectively, where each represents a learned filter with a 3 × 3 kernel size. at the time of inference, we interpreted 12 image patches per test image, and when one or more images were predicted as containing melanoma, the corresponding test image was interpreted as containing melanoma. each input of the network was an rgb image subtracted from the average image and calculated over the entire training image dataset. we implemented our method using matconvnet, a matlab-based cnn framework for computer vision applications [18] . moreover, we fine-tuned a vgg model with 16 layers downloaded from http:// www.vlfeat.org/matconvnet/pretrained). to assess the clinical usefulness of the cnn, we compared its diagnostic rate with those of two dermatologists who had five or more years of clinical experience in dermoscopy (expert group) and two non-trained general physicians (non-expert group). all images on the computer screen were evaluated simultaneously. if there was a dissensus between two physicians, they reached a conclusion under the agreement. since 724 images were randomly and equally the agreement between the pathologic result and each rater's diagnosis was measured using the calculation of cohen's kappa coefficient. all statistical analyses were performed with medcalc software version 17.9. (po = accuracy, pe = hypothetical probability of a chance agreement) among 724 dermoscopy images, 71 images were from the hands and fingers, and the others were from the feet and toes. a total of 350 am images included homogenous diffuse irregular pigmented, parallel ridge, and multicomponent patterns, while 374 bn images included parallel furrow, fibrillar, lattice-like, reticular, globular, and homogenous patterns (s1 table) . in the group a results obtained by the training of group b images, cnn showed 92.57% sensitivity and 75.39% specificity, which were similar to those of the expert (94.88% and 68.72%, respectively). however, the non-expert showed lower sensitivity (41.71%) and relatively higher specificity (91.28%, table 2 ). for diagnostic accuracy, both the cnn and expert group showed similar scores (83.51% and 81.08%, respectively), which were higher than that of the non-expert (67.84%, fig 5) . in the result of group b by the training of group a images, cnn also showed a higher diagnostic accuracy (80.23%) than that of the non-expert (62.71%) but was similar to that of the expert (81.64%). for validating diagnostic reliability, both the cnn and expert showed an auc above 0.8 in group a and b (fig 5) . however, the non-expert regarding the concordance rate between the cnn and expert group, 73 cases (73/362, 20.17%) in group a (am: 14 cases, bn: 59 cases) were discordant. of these, 41 cases (56.16%) of the cnn and 32 cases (43.84%) of the expert were identical with the pathologic results. however, in the concordant cases between them, 29 cases (29/362, 8.01%) differed from the pathology reports. in group b, 57 cases (am: 12 cases, bn: 45 cases) showed discordance between the cnn and expert, and 26 cases (45.61%) of the cnn and 31 cases (54.39%) of the expert were identical with the pathologic results. among the concordant cases in group b, 39 cases (39/362, 10.77%) differed from the pathology results. cohen's kappa between cnn and expert, cnn and non-expert, expert and non-expert is shown in table 3 . to verify the performance of cnn architecture for the discrimination of acral melanoma, we perform the deep learning architecture, inception-v3, in [13] , the state-of-the-art publication for the classification of skin cancer. in [13] , a single image was used for learning. meanwhile, we applied multiple images for learning. thus, we compared inception-v3 with a single image and inception-v3 with multiple images to cnn with multiple images. the results are shown in table 4 . although non-invasive and automated diagnostic techniques have been introduced for the early detection of melanoma, they are still not easy to apply in the acral type [7, 19] . this may be due to the overall low occurrence rate of melanoma in asians, depending on the ethnic differences, which need a longer time to provide a sufficient dataset to improve diagnostic accuracy. to overcome the problem of an insufficient dataset, we adopt a 2-fold cross validation method, for the training and test groups. in addition, capturing images at different places for one lesion helps to construct a robust cnn. similarly, data augmentation generating virtual images using rotation, translation, different angle positioning from one image also helps for a robust cnn. these procedures are necessary to construct an automated diagnosis system from small datasets due to the low occurrence rate of acral melanoma. for the effective screening of melanoma, higher sensitivity is required. thus, if there is a small compartment corresponding to the melanoma in one image, our system considers it as melanoma. also, our system recognizes one image as one patient. from the results, the accuracy of the cnn was above 80%, which was similar in both groups and was close to that of the expert. the cnn and expert also showed auc values above 0.8, indicating good discrimination. generally, higher auc values are considered to demonstrate better discriminatory abilities as follows: excellent discrimination, auc of !0.90; good discrimination, 0.80 auc < 0.90; fair discrimination, 0.70 auc < 0.80; and poor discrimination, auc of <0.70 [20] . since the auc of the non-expert was lower than 0.7, cnn can be a useful tool for the early detection of am by the physicians who are not familiar with the dermoscopic images. moreover, additional datasets of am images can improve the diagnostic accuracy of cnn [21] , making it a more reliable tool for the evaluation of the need for skin biopsy for hand and feet pigmentation. there were several auto-classification methods independent of the size of training data using dermatologists' checklist, such as the abcd rule and 7-point scale [22] [23] [24] [25] . this method used particular features such as color, shape, size, the boundary of the skin lesion, and statistical features of wavelength, which showed 91.26% of accuracy and 0.937 auc value [25] . however, these cannot be directly applied to acral melanoma due to the different morphologic features such as ridge or furrow patterns. although there was a new dermoscopic algorithm reflecting these characteristics for diagnosing acral melanoma: braaff [26] , it has not yet been applied to the automated diagnosis. in addition, although there is a state-of-art automated classification method for acral melanoma, these methods cannot be generalized and only work well for a particular pattern of acral melanoma, which is a ridge-and-furrow pattern [27] . automated diagnosis methods using particular features are able to reflect experts' perception and the speed of performance is fast. however, it is not easy to catch experts' perception, although we are trying to reach the goal with significant features. on the other hand, deep learning does not require specific features as inputs. it automatically finds the most correlated features with expert's perception by learning. thus the accuracy is higher than feature-based methods. however, a large database is critical for the successful completion of deep learning. recently, the melanoma classification performance of cnn using 1,010 dermoscopy images was reported as having an auc of 0.94 [13] , which was higher than noted in our results (0.84, 0.8). our inferior results may be due to the characteristics of am; it occurs on the pressure area, thick skin, callus, etc., which can hinder and transform the classic pigmented lesion into an atypical case. because of this, experts in our experiment also showed an auc of 0.82. therefore, if the datasets are analyzed separately considering these anatomic characters, cnn may perform a more precise discrimination. furthermore, if combined with images from noninvasive devices for melanoma diagnosis, which may overcome the problems presented by a thick skin, the accuracy of cnn can be markedly improved. several non-invasive devices such as confocal and photon microscopy are being introduced to provide convenient ways to diagnose melanoma early [28] . however, they require much effort and time for a physician to gain expertise. an automated diagnostic system using a cnn, even with a small dataset, may alleviate the difficulty of learning how to use these newly developed devices. in conclusion, a half-training and half-trial method were useful for creating a comparatively accurate deep-learning model from a relatively small dataset. although further data analysis is necessary to improve its accuracy, cnn would be helpful for the early detection of am, which is usually associated with delayed diagnosis and poor prognosis. supporting information s1 conceptualization: sejung yang, byungho oh. treatment and outcomes of melanoma in acral location in korean patients epub 2010/05/26 plantar malignant melanoma-a challenge for early recognition improvement in survival rate of patients with acral melanoma observed in the past 22 years in sendai biopsy of the pigmented lesion-when and how dermoscopy of pigmented skin lesions: results of a consensus meeting via the internet diagnostic accuracy of dermoscopy systematic review of dermoscopy and digital dermoscopy/ artificial intelligence for the diagnosis of melanoma deep learning imagenet large scale visual recognition challenge imagenet classification with deep convolutional neural networks. advances in neural information processing systems rethinking the inception architecture for computer vision deep residual learning for image recognition dermatologist-level classification of skin cancer with deep neural networks convolutional neural networks for medical image analysis: full training or fine tuning? an atlas of dermoscopy very deep convolutional networks for large-scale image recognition understanding the difficulty of training deep feedforward neural networks matconvnet: convolutional neural networks for matlab. proceedings of the 23rd acm international conference on multimedia non-invasive tools for the diagnosis of cutaneous melanoma can routine laboratory tests discriminate between severe acute respiratory syndrome and other causes of community-acquired pneumonia? revisiting unreasonable effectiveness of data in deep learning era computer image analysis in the diagnosis of melanoma reliability of computer image analysis of pigmented skin lesions of australian adolescents combination of features from skin pattern and abcd analysis for lesion classification computer-aided diagnosis of melanoma using border and waveletbased texture analysis the braaff checklist: a new dermoscopic algorithm for diagnosing acral melanoma. the british journal of dermatology ridge and furrow pattern classification for acral lentiginous melanoma using dermoscopic images innovations and developments in dermatologic non-invasive optical imaging and potential clinical applications key: cord-028792-6a4jfz94 authors: basly, hend; ouarda, wael; sayadi, fatma ezahra; ouni, bouraoui; alimi, adel m. title: cnn-svm learning approach based human activity recognition date: 2020-06-05 journal: image and signal processing doi: 10.1007/978-3-030-51935-3_29 sha: doc_id: 28792 cord_uid: 6a4jfz94 although it has been encountered for a long time, the human activity recognition remains a big challenge to tackle. recently, several deep learning approaches have been proposed to enhance the recognition performance with different areas of application. in this paper, we aim to combine a recent deep learning-based method and a traditional classifier based hand-crafted feature extractors in order to replace the artisanal feature extraction method with a new one. to this end, we used a deep convolutional neural network that offers the possibility of having more powerful extracted features from sequence video frames. the resulting feature vector is then fed as an input to the support vector machine (svm) classifier to assign each instance to the corresponding label and bythere, recognize the performed activity. the proposed architecture was trained and evaluated on msr daily activity 3d dataset. compared to state of art methods, our proposed technique proves that it has performed better. human activity recognition remains a very important research field of numerous computer science organizations because of its potency to provide adapted support for various applications such as human-computer interaction, ehealth applications and surveillance. nowadays, according to the method of feature extraction, the recognition of the human activity system can be classified as a classical or a deep model. a classical model is based on hand-crafted feature descriptors which can be categorized in three types; local features, global features or a combination between them to tackle the human activity recognition problem. the global features designate the image as a whole to describe the entire human body motions. however, the local features are extracted from a set of spatio-temporal interest points (stips) to describe the image patches of a human action. although global methods are able to represent more visual informations by maintaining spatio-temporal structures of the occured actions in the video, they are very sensitive to background variations and partial occlusions. the local features considers the image as small regions, which is practically computationally expensive. on another side, deep models using deep neural networks are a promising alternative in the image analysis applications areas. convolutional neural network (cnn) is considered as one of the successful deep models for image classification tasks. traditionally, to deal with such problem of recognition, researcher are obliged to anticipate their algorithms of human activity recognition by prior data training preprocessing in order to extract a set of features using different types of descriptors such as hog3d [1] , extended surf [2] and space time interest points (stips) [3] before inputting them to the specific classification algorithm such as hmm, svm, random forest [4] [5] [6] . it has been proven that the previous approaches are not very robust due to their poor performance and their requirement in time and memory space. recently, deep learning architectures are employed in order to change the engineering feature extraction phase by an automatic processing where deep neural networks have been directly applied to the raw data without human intervention to extract deep features. since the training of a new cnn from scratch requires to load huge amount of data and expensive computational resources, we used the concept of transfer learning and fine tune the parameters of a pretrained model. the initial cnn model was trained on a subset of the ilsvrc-2015 of the large scale imagenet [7] dataset. consequently, we decreased the training time, and avoid over fitting by insuring the suitable weight initialization given the quite small used data set. in this study, we proposed an advanced human activity recognition method from video sequence using cnn, where the large scale dataset imagenet pretrains the network. in fact, a pretrained cnn extracts feature vectors that characterize frames from the raw data. the resulting deep sparse representation of features vectors are fed as input to a multi class support vector machines algorithm to be classified. since the deep neural networks are more difficult to train, the residual learning approach based resnet model was proposed to facilitate the training phase. the main contribution of the present work is to propose a learning approach for human activity recognition based cnn and svm able to classify activities from one shot. the proposed framework is trained and tested on a publicly available dataset, i.e., msrdailyactivity 3d dataset [8] . obtained results show that the proposed method outperforms the state-of-the-art methods. the rest of this paper is organized as follows: sect. 2 highlights some related works, in sect. 3, we describe our proposed approach. we present the experimental evaluation in sect. 4. finally, in sect. 5, we conclude the paper. for human activity recognition challenge, an activity has to be represented by a set of features. to represent complex activities, authors in [9] have combined the histogram of oriented gradient (hog), the motion history image (mhi) and the foreground image (fi). the hog feature represents the magnitude and the direction of corners and edges, mhi feature is extracted to characterize motion direction and the fi is obtained by background subtraction. finally, all the resulting features have been merged to be fed as input to a simulated annealing multiple instance learning support vector machine (smile-svm) classifier for human activity recognition. the work of [10] extracted a motion space-time feature descriptor characterizing the video frames by combining the histogram of silhouette and the optical flow values. the first feature is obtained by background subtraction and the second is calculated using the algorithm of lucas-kanade [11] inside a normalized bounding box. a multi class svm classifier has been used to classify the activities. this system was set up to face the restraints of long training time and high dimension of the feature vector. [12] investigates a two distinct stream convnets architecture that includes spatial and temporal networks. in the spatial stream, the action recognition is performed from rgb video frames, whereas in the temporal stream, the recognition of action was made from motion information obtained by stacking dense optical flow between consecutive frames. both streams are employed as convnets and are finally combined by late fusion. two fusion methods have been considered; a fusion by averaging and a fusion by multi-class linear svm on softmax scores. the purpose in [13] is to classify the human actions from videos into different classes. the process is performed by extracting interest points from each video, segmenting images and constructing motion history images. after selecting discriminating features and representing images by visual words, a histogram of visual words is elaborated based on features extracted from the motion history images. finally, the extracted features vectors are used to train a support vector machine for action classification. [14] proposed a system to recognize abnormal comportment providing an alert to the accurate user on his android mobile phone. the task is to extract features using scale invariant feature transform (sift) descriptor for each video after dividing them into number of frames. the extracted features are then exploited as input to two different types of classifiers, i.e; the k nearest neighbor (knn) and the support vector machine (svm) to classify the actions. as recent written works [12, 24, 27] has proven, the deep hierarchical visual feature extractors are currently outperforming traditional hand-crafted descriptor, and are more generalizable and accurate when dealing with important levels of immanent noise problems. to describe the activities in a frame-wise way, we chose to use the cnn approach based on rgb data because of its widespread application in different areas. cnns are also advantageous by their reduction of the number of parameters and connections used on artificial neural model to facilitate their training phase. in this step, the question now is how to represent the human actions in each extracted frame of the video. to extract the most pertinent and significant features from the raw rgb video frame, we employed a pre-trained deep cnn architecture with pre-trained parameters based on ima-genet. the original cnn was trained on the 1.2m high-resolution images of the ilsvrc2015 classification training subset of the imagenet dataset. though, in the proposed method, we used a deep cnn network architecture to generate a probability vector for each input frame which represents the probability of the presence of the different objects present in each individual frame. a resnet model is used with pre-trained parameters from imagenet database and applied to extract sparse and pertinent residual representations of features from video frames of each sequence video. the architecture is composed of several resnet blocks with three layer deep, composed of five composite convolutional layers including small kernels sizing by 7 × 7, 1 × 1 and 3 × 3. the network takes an input of size 224 × 224 which was reduced five times in the network by a stride of 2. the output obtained from the average pooling operation is applied to the final feature map of the network followed by the fully connected layer. the resulting vector from the last pooling layer is considered as the features representation generated from the reused pretrained model in a feedforward pass. after each convolution, a batch normalization and an relu are achieved. the residual units are represented as: where x l and x l+1 correspond to the input and the output of the l t h layer, f denotes a nonlinear residual mapping characterized by convolutional filter weights wl and f corresponds to the relu function. the main advantage of handling residual units in such types of networks, is that their skip connections or "shortcuts" allow the direct propagation of signals over all the network' layers. this design is very advantageous mainly during the backpropagation phase; in fact, gradients are directly propagated from the loss layer to all the other preceding layers while skipping some intermediate layers which have the potential to provoke the deterioration or the disappearance of the gradient signal. this strategy helped the network to appreciate the accuracy gained from deeper architectures. since training a new deep cnn model from scratch requires important loads of data and elevated resources of computation, we have implemented a transfer learning procedure to fine-tune the parameters of a pre-trained model. we adopted an original cnn model that was pretrained on a subset of the largescale image classification dataset such as the imagenet. proceeding in this way, we succeed to reduce the required time for training and to avoid our dataset from overfitting by assuring a good initialization of weights, given the quiet small available dataset. in fact, the dataset was artificially augmented by using three techniques. first random reflect frames in the left direction, second a random horizontal translation that consists of moving frames along the horizontal direction, and finally, a random vertical translation is applied by moving frames on the vertical direction. in reality, the last layer of the adopted cnn model is a classification layer; though, in the present study, we removed this layer and exploited the output of the preceding layer as frame features for the classification step. instead of the eliminated layer, the svm classifier has been employed to predict the human activity label. figure 1 summarizes the architecture of the proposed action recognition model. svm is supposed as machine learning classifier method that gives good results in comparison with other types of classifier. we decided to use it in this study because of its effectiveness when dealing with quiet small datasets and its performance in high dimensional spaces [15, 16, [25] [26] [27] [28] [29] . the principal idea behind the use of svm is to applicate a supervised learning algorithm facilitating to find the optimal hyperplane that separates the feature space. during training, the svm generates hyperplanes in a high dimensional space to separate the training dataset into different classes. if the training data subset are not linearly separable, a kernel function svm is used to transmit the data to a new vector space. svm performs well with large scale training datasets and yields to accurate and effective results. for a given training dataset; d(x 1 , y 1 ), (x 2 , y 2 ), . ..(x n , y n ) where x i ∈ r n and memberships y i ∈ ±1 classes; i represents the label corresponding to each action in the defined dataset. to determine a decision function for a linear classification, the hyperplane separation is represented by: a generic hyperplane is defined by satisfying the condition: when delimited by margins, the set of hyperplanes can be written as: to formulate the optimal hyperplane that separates the data, we should minimize: subject to the constraints of eq. 4). multi-class svm. even though svm were initially developed for binary classification, it can be successfully extended to be applied to multiclass classification problems. the main strategy consists to separate the multiclass problem into many biclass problems and combine the outputs of all the sub-binary classifiers to provide the final class prediction of a sample. fundamentally, there are two main methods for multiclass svm. the first type is called "oneagainstone" [17] , it consists to construct one classifier per pair of classes and combine binary classifiers in a way to form a multi-class classifier by selecting the most voted class. so, n (n − 1)/2 binary svm classifiers are needed, each of them is trained on the samples of the two corresponding classes. the second method is called "oneagainstall" [27] and it considers all the classes of data in one optimization problem. in fact, for each classifier, the considered class is fitted against all the other classes, so, n number of classes use n svm classifiers. when using the latter technique, the training process takes a long time. the"msrdailyactivity 3d"dataset [12] is an rgb sequences dataset that contains sixteen daily human activities. the database was captured by a kinect camera around various objects, and the humans in question are located at different distances from the camera. activities are accomplished by ten different subjects, the most of them are categorized as "human object interactions". activities were performed twice by each person in two different positions; i.e; the "standing" and the "sitting" situation. the deep cnn model was trained using matlab 2018. our approach based cnn model was performed on a machine equipped with a nvidia geforce 960m gpu, 64 gb memory and an intel core i7-6700 hq (2.60 ghz) processor. our dataset was artificially augmented. this technique allows to avoid the problem of dataset overfitting. each video from our dataset were split into frames which serve as input to the pre-trained cnn model. in the training stage, a 224 × 224 frame is randomly reflected from the selected frame; it then undergoes a random horizontal and vertical translation. these operations are applied in such a way that the training dataset is augmented at each iteration. the 2048 dimensional vector resulting from the last pooling layer of the resnet model were used to activate the training and testing subsets. the resulting vectors were used as training and test data for the multi-class svm classifier. the training process is performed using a mini-batch stochastic gradient descent with a momentum set to 0.9 to learn the network weights. at each iteration, a mini-batch size of 50 samples is constructed by sampling the training videos by 50, from which a single frame is selected randomly. during our experimentation, the learning rate is initially set to 1e −4 and the network is trained for 6 epochs. we also tried to increment the number of epochs but we got always overfitting. for our used multi-class svm classifier, we chose to employ the linear function kernel to project the original linear or nonlinear dataset into a higher dimensional space in order to make it linearly separable and to give a better performance for the svm. the linear kernel is a simple kernel function based on the penalty parameter c described by the following format: during experimentation, we evaluated our method on the dataset described above: 70% used for the training stage and 30% from data are used for testing. firstly, each frame is resized to 224 × 224 resolution. we have determined the confusion matrix of our proposed system in order to demonstrate the correspondence between the predicted labels along the x-axis and the true labels along the y-axis and to represent the recognition performance for each action class in the msrdailyactivity 3d dataset. generally, a confusion matrix involves four groupings: tp (true positive) mean the instances that are correctly identified as positives, fp (false positive) refers to the negative examples incorrectly identified as positive, tn (true negative) refers to the negative instances that are correctly predicted as negative, and fn (false negative) represents the positive instances incorrectly predicted as negative. we also evaluate different performance metrics of our proposed approach by calculating the precision, recall and f-measure values as shown in table 1 . figure 2 demonstrates that the most confusion is between sit down and stand up labels. this misclassification can be explained by the similarity in a few steps when carrying out both of actions which contain a person in a half-sitting position.i̇n fact, the middle frames of the two classes sit down and stand up presenting a person in a half setting position are making the confusion, because of their repetition in the two cases. whereas more than half of the classes have been correctly classified at 100%. table 2 notices that our approach has achieved a good recognition performance and outperforms other state-of-the-art methods on msrdailyactivity 3d dataset. achieved performance confirms the generalization competence of our learned representations across domains. the work of [18] has obtained bad results in this dataset despite it was based on the combination of two deep neural network models which are the cnn and lstm. whereas the implemented cnn model for feature extraction is not based transfer learning concept. based on all these observations, we can deduce that pretraining a model on a largescale dataset and fine tune his hyper-parameters on a small one is very efficient to obtain good performance rate. we have also combined the same pretrained resnet model which was used to extract features, once with a multi layer perception (mlp) classifier and another time with a long short term memory (lstm) network. the obtained results show that using a multi-class svm classifier, gives the best result. in order to investigate on the effect of the choice of the svm kernel, we have performed a classification using radial basis function (rbf) kernel. the results were not interesting due to the relevance of the feature representation obtained from convolutional neural network. in this study we presented the support vector machines approach for human activity recognition task. we proposed to use a pre-trained cnn approach based resnet model in order to extract spatial and temporal features from consecutive video frames. our proposed architecture was trained and tested on msrdaily-activity 3d dataset and it achieved a good recognition performance. for our future works, we propose to use a combination of a genetic algorithm with support vector machines in order to optimize the weights of the used cnn model leading to automatically improve the performance. likewise, we would like to expend the proposed model for more large-scale dataset such as ntu rgb+d because the used dataset is small and the used pretrained cnn model can be more effective when applied to a big one. a spatio-temporal descriptor based on 3d-gradients an efficient dense and scale-invariant spatio-temporal interest point detector behavior recognition via sparse spatio-temporal features multi-sensor fusion for human daily activity recognition in robot-assisted living recognizing human activities from smartphone sensor signals unintrusive eating recognition using google glass imagenet large scale visual recognition challenge mining actionlet ensemble for action recognition with depth cameras action detection in complex scenes with spatial and temporal ambiguities faster human activity recognition with svm an iterative image registration technique with an application to stereo vision two-stream convolutional networks for action recognition in videos human action recognition: a construction of codebook by discriminative features selection approach human activity recognition on real time and offline dataset support-vector networks improving accuracy of intrusion detection model using pca and optimized svm support vector machines: a recent method for classification in chemometrics convolutional neural networks and long short-term memory for skeleton-based human activity and hand gesture recognition spatio-temporal rgbd cuboids feature for human activity recognition spatio-temporal depth cuboid similarity feature for activity recognition using depth camera learning actionlet ensemble for 3d human action recognition multimodal multipart learning for action recognition in depth videos action recognition from depth maps using deep convolutional neural networks deep neural network features for horses identity recognition using multiview horses' face pattern neural approach for context scene image classification based on geometric, texture and color information relidss: novel lie detection system from speech signal deepcolorfasd: face anti spoofing solution using a multi channeled color spaces cnn human gait identity recognition system based on gait pal and pal entropy (gppe) and distances features fusion towards human behavior recognition based on spatio temporal features and support vector machines key: cord-286887-s8lvimt3 authors: nour, majid; cömert, zafer; polat, kemal title: a novel medical diagnosis model for covid-19 infection detection based on deep features and bayesian optimization date: 2020-07-28 journal: appl soft comput doi: 10.1016/j.asoc.2020.106580 sha: doc_id: 286887 cord_uid: s8lvimt3 a pneumonia of unknown causes, which was detected in wuhan, china, and spread rapidly throughout the world, was declared as coronavirus disease 2019 (covid-19). thousands of people have lost their lives to this disease. its negative effects on public health are ongoing. in this study, an intelligence computer-aided model that can automatically detect positive covid-19 cases is proposed to support daily clinical applications. the proposed model is based on the convolution neural network (cnn) architecture and can automatically reveal discriminative features on chest x-ray images through its convolution with rich filter families, abstraction, and weight-sharing characteristics. contrary to the generally used transfer learning approach, the proposed deep cnn model was trained from scratch. instead of the pre-trained cnns, a novel serial network consisting of five convolution layers was designed. this cnn model was utilized as a deep feature extractor. the extracted deep discriminative features were used to feed the machine learning algorithms, which were k-nearest neighbor, support vector machine (svm), and decision tree. the hyperparameters of the machine learning models were optimized using the bayesian optimization algorithm. the experiments were conducted on a public covid-19 radiology database. the database was divided into two parts as training and test sets with 70% and 30% rates, respectively. as a result, the most efficient results were ensured by the svm classifier with an accuracy of 98.97%, a sensitivity of 89.39%, a specificity of 99.75%, and an f-score of 96.72%. consequently, a cheap, fast, and reliable intelligence tool has been provided for covid-19 infection detection. the developed model can be used to assist field specialists, physicians, and radiologists in the decision-making process. thanks to the proposed tool, the misdiagnosis rates can be reduced, and the proposed model can be used as a retrospective evaluation tool to validate positive covid-19 infection cases. covid-19, a new type of coronavirus, has created a very critical chaotic situation, negatively affecting a large number of deaths and people's lives worldwide. it first appeared in wuhan, china, in december 2019. it has spread to approximately 200 countries worldwide. in many countries, rulers and governments have taken new measures and created new lifestyles to combat covid-19. today's science and technology have made an extremely valuable contribution to the implementation of these new policies of states in this unknown and unpredictable process. as an example of technological developments, robots, and drones have been used to transport food and medicines to hospitals [1, 2] . while many researchers in the medical field develop vaccines to prevent the virus, many medicines and medical practices are being developed to heal infected patients and prevent them from passing on to others [3] . on the other hand, artificial intelligence and computer scientists have proposed and implemented real-life hybrid systems based on x-ray images and computed tomography (ct) to detect covid-19. this artificial intelligence (ai) applications have been successfully applied in many areas [4] . the studies carried out in the literature and the studies carried out to give a more detailed description are given in the form of a table. some studies and diagnostic methods regarding covid-19 in the literature are briefly summarized below. in y. pathak et al. study [5] , they used chest computed tomography (ct) images and deep transfer learning (dtl) method to detect covid-19 and obtained a high diagnostic accuracy. mesut toğaçar et al. proposed a novel hybrid method called the fuzzy color technique + deep learning models (mobilenetv2, squeezenet) with a social mimic optimization method to classify the covid-19 cases and achieved high success rate in their work [6] . in the ali abbasian ardakani et al. work [7] , they used the deep learning models including alexnet, vgg-16, vgg-19, squeezenet, googlenet, mobilenet-v2, resnet-18, resnet-50, resnet-101, and xception to diagnose the covid-19 and compared them with each other with respect to the obtained classification accuracy. ferhat ucar et al. proposed a novel method called deep bayes-squeezenet based covidiagnosis-net to classify the covid-19 cases as the covid-19 or normal (healthy) [8] . as for other work of tulin ozturk et al. [9] , they suggested a new method called the darkcovidnet model for diagnosing the covid-19 cases. table 1 presents the conducted works regarding covid-19 detection and diagnosis in the literature. contributions of the proposed model can be listed as follows: (1) cnns with rich filter family, convolution, abstraction, and weight sharing have ensured an effective deep feature extraction engine. (2) the deep features extracted from deep layers of cnns have been applied as the input to machine learning models to further improve covid-19 infection detection. (3) as a result, a cheap, fast, and reliable intelligence tool has been provided for covid-19 infection detection. (4) the developed model can be used to assist the field specialists, physicians, and radiologists in the decision-making process. (5) thanks to this study, the misdiagnosis rates can be reduced, and the proposed model can be used as a retrospective evaluation tool. the rest of this study is organized as follows: the dataset and the related methods are presented in section 2. the results are reported in section 3. a discussion is presented in section 4, and lastly, concluding remarks are given in section 5. not only the structures of the samples in a database but also the distribution of the recordings among the classes have a great impact on the model to be developed. the morphological features, color, shape, and texture-based features directly affect the achievements of the intelligence computer-aided models [16] . besides, it is important to ensure an equal number of samples, which cover all situations or cases for each class to produce a consistent and robust model. recently, many studies have pointed out that chest ct images can be a vital evaluation means for diagnosing covid-19 infection [6] [7] [8] [9] . several specific patterns, including bilateral, peripheral and basal predominant ground-glass opacity (ggo), multifocal patchy consolidation, crazy-paving pattern with a peripheral distribution observed on chest ct images have been adopted as the findings of covid-19 infection [17] [18] [19] . a subsample of the recordings belonging to covid-19, normal and viral pneumonia classes is shown in fig. 1 . an open-access database that covers the posterior-to-anterior chest x-ray images was used in this study [20] . in fact, the covid-19 radiology database was generated by collecting the samples from four different resources. in other words, the samples collected from the italian society of medical and interventional radiology (sirm) covid-19 database [21] , novel corona virus 2019 dataset [22] , covid-19 positive chest x-ray images from different articles and lastly chest x-ray [23] pneumonia images were combined. totally 2905 images are presented with three classes in this database, as shown in table 2 . cnns are architectures consisting of a large number of sequenced layers. layers that perform different functions are used in these architectures to reveal the distinctive features of the data applied as input [24] . in general, the tasks of these layers can be summarized as follows: (1) convolution layer: this layer is the main building block of cnn architectures, and it is used to reveal the discriminative features of the input data. this layer applies some filter families to the data so as to reveal low and high-level features in the data [25] . after the convolution process, the size of the input data changes. these charges vary depending on the stride and padding. the outputs of the convolution layers are called activation maps and defined as follows: (1) the convolution process is defined as in eq. (1). herein, the previous layers are shown with , the learnable kernels are and the bias term is . matches the input map section. (2) non-linearity layer: the convolution layer is ordinarily followed by the nonlinearity layer. this layer gives the system a non-linearity feature and called the activation layer. since the neural network acts as a single perceptron, the outputs of the neural network can be calculated using linear combinations, so activation maps are used j o u r n a l p r e -p r o o f journal pre-proof [26] . to this aim, the most commonly used activation function is rectifier (relu), and it is defined as follows: (2) (3) pooling (down-sampling) layer: this layer is often added between consecutive convolutional layers to reduce the number of the computational nodes. average pooling, maximum pooling, and l2-norm pooling are used frequently. (4) flatting layer: this layer collects the data in a single vector and prepares the data for the neural network. (5) fully-connected layers: this layer is used to transfer the activations that are obtained by passing the data throughout the network for the next unit. fully connected layers are located at the end of the architecture to ensure the connections between all activations and computational nodes in these layers [27, 46, 47, 48] . these layers are exploited when the cnns are used as the feature extractors. table 3 . j o u r n a l p r e -p r o o f journal pre-proof offline or online data augmentation techniques can be used to realize a more efficient training for the computational models [24] . however, it is essential to be aware that the data augmentation techniques should not be used on the test set because of the overfitting problem. in the experiment, the whole data set was divided into two parts as the training and test sets with 70% and 30% rates, respectively. the distribution of the samples over the classes is imbalanced. to overcome this issue, the data augmentation approach has been used. to this aim, we focused on only the covid-19 class since the number of samples in this class was lower compared to other classes, as shown in table 3 . the overall block diagram of the proposed model is given in fig. 4 . the whole dataset is divided into two sets as training and test sets with 70% and 30% rates, respectively. only the number of samples in the covid-19 class is increased by using the offline data augmentation approach, and then the proposed cnn model is trained and tested. then, the deep features extracted from the proposed cnn model is considered. a combination of deep feature extraction and machine learning techniques are utilized to achieve a consistent and robust diagnosis model for covid-19 infection diagnosis. three different classification algorithms have been used to detect covid-19 infection detection in this study. these classification algorithms are different in structure and have high performance. each classifier algorithm was trained and tested using the 70-30% training and testing data partition. the used classifier algorithms were explained in the following subsections. support vector machines (svm) is a consulting machine learning algorithm that can be j o u r n a l p r e -p r o o f journal pre-proof (3) where is the distance between data points of and . for more information about the multi-class-svm classifier, the readers can refer to [28] [29] [30] . the decision tree classifier is used to solve simple and mostly classification problems. applies the correct way to solve the classification problem. the decision tree classifier has a structure consisting of roots, leaves, and branches descending from top to bottom. the most used decision tree classification algorithms are id3, c4.5, and c5. in our applications, we have used the c4.5 decision tree classifier. for more information about the decision tree classifier, the readers can refer to [31-33, 44, 45, 49 ]. the -nn ( -nearest neighbor) algorithm is one of the simplest and most widely used classification algorithms. nn is a non-parametric, lazy learning algorithm. unlike eager learning, if we try to understand the concept of lazy, lazy learning does not have a training phase. it does not learn the training data; instead, it "memorizes" the training data set. when we want to make an estimate, it looks for the nearest neighbors in the whole dataset [34] . in the study of the algorithm, a k value is determined. the meaning of this value is the number of elements to be looked at. when a value arrives, the distance between the incoming value is calculated by taking the nearest element. the euclidean function is generally used in distance calculation. as an alternative to the euclidean function, city block, minkowski, and chebychev functions can also be used [35] . after the distance is calculated, it is sorted, and the incoming value is assigned to the appropriate class. the parameters in the nn classifier have been optimized by using the bayesian optimization method in our study. to evaluate the proposed model, we have used the confusion matrix, and some commonly the experiments were carried out on a workstation with intel ® xeon ® gold 6132 cpu @2.60 ghz and nvidia quadro p6000 gpu. the simulation environment was matlab as for a prediction, the confusion matrix is given in fig. 6 (a) . as mentioned before, the test set was separated and frozen at the starting of the experiment. the number of samples belonging to the covid-19 class in the test set was 66. 59 of these samples were identified correctly by the proposed cnn model. the rates of the classification achievements for normal and viral pneumonia cases were rather satisfactory. the final validation accuracy and final validation loss were 97.25% and 0.2032, respectively. the se, sp, and f-score were achieved as 94.61%, 98.29%, and 95.75%, respectively. the roc curves of the proposed cnn model are also presented in fig. 6 (b) . the aucs were obtained as 0.9942, 0.9956, 0.9955 for covid-19, normal, and viral pneumonia cases, respectively. as a result, an efficient cnn model ensured for diagnosis of covid-19 infection. in the second step of the experiment, we focused on the activation maps in the proposed cnn architecture. these activation maps with different levels keep the discriminative features of the input data and finally collected in the fully connected layers. the activations may help us to understand what the model has learned. a visual representation of the activation maps is given in fig. 7 . and set to 6 for fc1 deep feature, as shown in fig. 9 (e). the acc, se, sp, and f-score were 93.35%, 90.55%, 96.29%, and 90.06%, respectively. in addition, the best estimated feasible point considering the dt algorithm was 675 for fc2 deep feature set, as shown in fig. 9 (f). the acc was 96.10%, se was 93.81%, sp was 97.70% and f-score was 94.56%. all scores of the classifiers are reported in table 4 , considering the two different deep feature sets. the svm classifier was superior to nn and dt machine learning algorithms. it was seen that the svm model ensured an improvement in the automated covid-19 infection detection task. unlike it was observed that the classification achievement was lightly decreased when the classification task was realized by nn and dt. in this section, we evaluate the superior aspects as well as the limitations of the proposed model by taking into account the state-of-art models. however, it is important to be aware of a one-to-one comparison is not feasible due to differences in datasets, methods, and various simulation environments. in these datasets were collected from different resources, as inferred from table 5 . recently, it is seen that the scientific community has focused on chest x-ray images in order to contribute to the clinical evaluation of covid-19 cases that have increased day by day. many computational models based on cnn architecture have been proposed. the greatest advantage of these models is that they provide an end-to-end learning scheme by isolating handcrafted feature engine. to this aim, the transfer learning approach has been generally adopted to train the cnns. some of the computational studies have been focused on the deep features provided by the pre-trained models [39] . in this aspect, our study offers a novel cnn model that was trained from scratch, not a transfer learning approach. also, instead of using pre-trained cnns, fully-connected layers in the proposed architecture were considered, examined, and used for the covid-19 infection detection task. our study contains the innovative components in this respect. besides, the proposed model works according to the end-to-end learning principle, and a handcrafted feature extraction engine is it can be argued that the database is not large enough. however, we think that there is nothing to worry about this issue. because the performances of the cnn networks increase depending on the scale of the number of samples used in the training process, in such a case, it is only necessary to consider the calculation time and hardware resources. another important issue is that when the positive covid-19 cases are detected using x-ray images, the infection may have already significantly advanced. in other words, x-ray images may be a very significant means to confirm positive covid-19 cases, but may not be clinically relevant for early diagnosis. general public health, global economy, and our routine life continue with new norms with the effect of covid-19. the number of people affected by this infection is still increasing significantly. in this study, an automated covid-19 diagnostic system has been proposed to contribute to clinical trials. the proposed model is based on the cnn architecture, and it is trained from scratch, as opposed to the transfer learning approach. thanks to its convolution with rich filter families, abstraction, and weight sharing features, it automatically provides highly efficient deep, distinctive features. thus, the handcrafted feature extraction engine is not performed. as a result, positive covid-19 cases can be detected easily and with high sensitivity via the proposed tool using chest x-ray images. as a result of this study, a cheap, fast, and reliable diagnostic tool was obtained. the model provided an accuracy of 98.97%, a sensitivity of 89.39%, the specificity of 99.75%, and fscore of 95.75%. when it is evaluated clinically, the developed model can support the decision-making processes of field specialists, physiologists, and radiologists. with this model, the misdiagnosis rate can be reduced, and positive covid-19 cases can be detected quickly without having to wait for days. estimation of covid-19 prevalence in italy epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in wuhan, china: a descriptive study death and contagious infectious diseases: impact of the covid-19 virus on stock market returns ai-assisted ct imaging analysis for covid-19 screening: building and deploying a medical ai system in four weeks deep transfer learning based classification model for covid-19 disease covid-19 detection using deep learning models to exploit social mimic optimization and structured chest x-ray images using fuzzy color and stacking approaches application of deep learning technique to manage covid-19 in routine clinical practice using ct images: results of 10 convolutional neural networks covidiagnosis-net: deep bayes-squeezenet based diagnosis of the coronavirus disease 2019 (covid-19) from x-ray images automated detection of covid-19 cases using deep neural networks with x-ray predicting the growth and trend of covid-19 pandemic using machine learning and cloud computing an automated residual exemplar local binary pattern and iterative relieff based corona detection method using lung x-ray image diagnosis of coronavirus disease 2019 (covid-19) with structured latent multi-view representation learning a weakly-supervised framework for covid-19 classification and lesion localization from chest ct deep learning covid-19 features on cxr using limited training data sets data augmentation using auxiliary classifier gan for application of breast cancer diagnosis based on a combination of convolutional neural networks, ridge regression and linear discriminant analysis using invasive breast cancer images processed with autoencoders essentials for radiologists on covid-19: an update-radiology scientific expert panel clinical characteristics and diagnostic challenges of pediatric covid-19: a systematic review and meta-analysis co-rads -a categorical ct assessment scheme for patients with suspected covid-19: definition and evaluation covid-19 radiology database. can ai help screen viral radiology is of m and i. italian society of medical and interventional radiology n covid-19 image data collection identifying medical diagnoses and treatable diseases by image-based deep learning convolutional neural network approach for automatic tympanic membrane detection and classification brainmrnet: brain tumor detection using magnetic resonance images with a novel convolutional neural network model waste classification using autoencoder network with integrated feature selection method in convolutional neural network models computer-aided diagnosis system combining fcn and bi-lstm model for efficient breast cancer detection from histopathological images support-vector networks a comprehensive survey on support vector machine in data mining tasks: applications & challenges support vector machines for classification bt -efficient learning machines: theories, concepts, and applications for engineers and system designers a survey of decision tree classifier methodology decision trees instance-based learning algorithms nearest neighbours without k: a classification formalism based on probability an improved k-nearest neighbor classification using genetic algorithm application of deep learning for fast detection of covid-19 in x-rays using ncovnet covid-19: automatic detection from x-ray images utilizing transfer learning with convolutional neural networks covid-net: a tailored deep convolutional neural network design for detection of covid-19 cases from chest x-ray images detection of coronavirus disease (covid-19) based on deep features covidx-net: a framework of deep learning classifiers to diagnose covid-19 in x-ray images automatic detection of coronavirus disease (covid-19) using x-ray images and deep convolutional neural networks deep learning enables accurate diagnosis of novel coronavirus (covid-19) with ct images a deep learning algorithm using ct images to screen for corona virus disease (covid-19) binary particle swarm optimization (bpso) based channel selection in the eeg signals and its application to speller systems deep learning applications for hyperspectral imaging: a systematic review hybrid computerized method for environmental sound classification time-frequency representation and convolutional neural network based emotion recognition robust approach based on convolutional neural networks for identification of focal eeg signals surface emg signals and deep transfer learningbased physical action classification key: cord-354819-gkbfbh00 authors: islam, md. zabirul; islam, md. milon; asraf, amanullah title: a combined deep cnn-lstm network for the detection of novel coronavirus (covid-19) using x-ray images date: 2020-08-15 journal: inform med unlocked doi: 10.1016/j.imu.2020.100412 sha: doc_id: 354819 cord_uid: gkbfbh00 nowadays, automatic disease detection has become a crucial issue in medical science due to rapid population growth. an automatic disease detection framework assists doctors in the diagnosis of disease and provides exact, consistent, and fast results and reduces the death rate. coronavirus (covid-19) has become one of the most severe and acute diseases in recent times and has spread globally. therefore, an automated detection system, as the fastest diagnostic option, should be implemented to impede covid-19 from spreading. this paper aims to introduce a deep learning technique based on the combination of a convolutional neural network (cnn) and long short-term memory (lstm) to diagnose covid-19 automatically from x-ray images. in this system, cnn is used for deep feature extraction and lstm is used for detection using the extracted feature. a collection of 4575 x-ray images, including 1525 images of covid-19, were used as a dataset in this system. the experimental results show that our proposed system achieved an accuracy of 99.4%, auc of 99.9%, specificity of 99.2%, sensitivity of 99.3%, and f1-score of 98.9%. the system achieved desired results on the currently available dataset, which can be further improved when more covid-19 images become available. the proposed system can help doc-tors to diagnose and treat covid-19 patients easily. 2 covid-19 can range from cold to fever, shortness of breath, and acute respiratory syndrome [3] . in comparison to sars, the kidneys and liver are affected by coronavirus as well as the respiratory system [4] . coronavirus detection at an early stage plays a vital role in controlling covid-19 due to its high transmissibility. the diagnosis of coronavirus by gene sequencing for respiratory or blood samples should be confirmed as the main pointer for reverse transcription-polymerase chain reaction (rt-pcr), according to the guidelines of the chinese government [5] . the process of rt-pcr takes 4-6 hours to get results, which takes a long time compared to covid-19's rapid spread rate. rt-pcr test kits are in huge shortage, in addition to being inefficient [6] . as a result, many infected patients cannot be detected in time and tend to unknowingly infect others. with the detection of this disease at an early stage, the prevalence of covid-19 disease will decrease [7] . to alleviate the inefficiency and scarcity of current covid-19 tests, a lot of effort has been made to look for alternative test methods. another visualization method is to diagnose covid-19 infections using radiological images such as x-rays or computed tomography (ct). earlier works have shown that anomalies can be found in covid-19 patients in chest ct scans in the shape of ground-glass opacities [8] . the researchers have claimed that a system based on chest ct scans can be an important method for diagnosis and quantifying of covid-19 cases [9] . many researchers have demonstrated various approaches to detect covid-19 utilizing x-ray images. recently, computer vision [10] , machine learning [11, 12, 13] , and deep learning [14, 15] have been used to automatically diagnose several diverse ailments in the human body, which ensures smart healthcare [16, 17] . the deep learning method is used as a feature extractor that enhances classification accuracies [18] . the detection of tumor regions in the lungs, x-ray bone suppression, diabetic retinopathy, prostate segmentation, skin lesions, and the presence of the myocardium in coronary ct scans are examples of the contributions [19, 20] of deep learning. therefore, this paper aims to propose a deep learning based system that combines the cnn and lstm networks to automatically detect covid-19 from x-ray images. in the proposed system, cnn is used for feature extraction and lstm is used to classify covid-19 based on those features. the lstm network has an internal memory that is capable of learning from imperative experiences with long-term states. in fully connected networks, the layers are fully connected and the nodes between layers are connectionless and process only one input. in the case of lstm, the nodes are connected from a directed graph along a temporal sequence that is considered an input with a specific order [21] . hence, the 2-d cnn and lstm layout feature combination improves classification greatly. the dataset used for this paper was collected from multiple sources and a preprocessing was performed to reduce the noise. in the following, the contributions of this research are summarized. j o u r n a l p r e -p r o o f 3 a) developing a combined deep cnn-lstm network to automatically assist the early diagnosis of patients with covid-19 efficiently. b) to detect covid-19 using chest x-rays using a dataset formed comprising 4575 images. c) a detailed experimental analysis is provided in terms of accuracy, sensitivity, specificity, f1-score, a confusion matrix, and auc using receiver operating characteristic (roc) to measure the performance of the proposed system. the paper is organized as follows: a review of recent scholarly works related to this study is described in section 2. a description of the proposed system, including dataset collection and preparation, is presented in section 3. the experimental results and comparative performance of the proposed deep learning system are provided in section 4. the discussion is given in section 5. section 6 concludes the paper. to address the covid-19 epidemic, researchers have developed deep learning techniques to diagnose covid-19 based on clinical images, ct scans, and x-rays of the chest. this review describes the recently developed systems that have applied deep learning techniques in the field of covid-19 detection. rahimzadeh et al. [22] developed a concatenated cnn based on xception and res-net50v2 models that classified covid-19 cases using chest x-rays. fig. 1 illustrated the overall system for the detection of covid-19 consisting of several phases. raw x-ray images were first passed through the preprocessing pipeline. data resizing, shuffling, and normalization were done in the preprocessing pipeline. the preprocessed data set was then partitioned into a training set and testing set, and we trained the cnn and cnn-lstm architecture using the training data. after each epoch, the training accuracy and loss were determined. at the same time, using 5-fold cross-validation, validation accuracy and loss were also obtained. the performance of the proposed system was measured with the following evaluation metrics: confusion matrix, accuracy, auc using roc, specificity, sensitivity, and f1-score. as the emergence of covid-19 is very recent, none of the large repositories contain any covid-19 labeled data, thereby requiring us to rely on different sources of images of normal, pneumonia, and covid-19 cases. first, 613 x-ray images of covid-19 cases were collected from the following websites: github [35, 36] , radiopaedia [37], the cancer imaging archive (tcia) [38] , and the italian society of radiology (sirm) [39] . then, instead of data being independently augmented, a dataset containing 912 already augmented images was collected from mendeley [40] . finally, 1525 images of pneumonia cases and 1525 x-ray images of normal cases were collected from the kaggle repository [41] and nih dataset [42] . the main objective of the dataset selection was to make it available to the public so that it is accessible and extensible to researchers. the use of this dataset in further studies may also enable more efficient diagnoses of covid-19 patients. we resized the images to ones with a resolution of 224 × 224 pixels. the number of x-ray images of each set was partitioned in table 1 . the visualization of x-ray images of each class is shown in fig. 2 . the proposed architecture was developed with a combination of a convolutional neural network (cnn) and a long short-term memory (lstm) network, which are briefly described as follows. j o u r n a l p r e -p r o o f 7 a particular type of multilayer perceptron is a cnn, but a simple neural network cannot learn complex features, unlike a deep learning architecture. cnns have shown excellent performance in many applications [43, 44] , such as image classification, object detection, and medical image analysis. the main idea behind a cnn is that it can obtain local features from high layer inputs and transfer them to lower layers for more complex features. a cnn comprises convolutional, pooling, and fully connected (fc) layers. a typical cnn architecture along with these layers is depicted in fig. 3 the convolutional layer encompasses a set of kernels [45] for determining a tensor of feature maps. these kernels convolve an entire input using "stride(s)" so that the dimensions of an output volume become integers [46] . the dimensions of an input volume decrease after the convolutional layer is used to execute the striding process. therefore, zero padding [47] is required to pad an input volume with zeros and maintain the dimensions of an input volume with low-level features. the operation of the convolutional layer is given as: where i refers to the input matrix, k denotes a 2d filter of size m × n, and f represents the output of a 2d feature map. the operation of the convolutional layer is denoted by i*k. to increase nonlinearity in feature maps, the rectified linear unit (relu) layer is used [48] . relu computes activation by keeping the threshold input at zero. it is mathematically expressed as follows: the pooling layer [49] performs a downsampling of a given input dimension to reduce the number of parameters. max pooling is the most common method, which produces the maximum value in an input region. the fc layer [50] is used as a classifier that makes a decision on the basis of features obtained from the convolutional and pooling layers. . long short-term memory is an improvement of recurrent neural networks (rnns). lstm proposes memory blocks instead of conventional rnn units in solving the vanishing and exploding gradient problem [51] . it then adds a cell state to save longterm states, which is its main difference from rnns. an lstm network can remember and connect previous information to data obtained in the present [52] . lstm is combined with three gates, such as an input gate, a "forget" gate, and an output gate, where ‫ݐݔ‬ refers to the current input; ‫ܥ‬ t and ‫ݐܥ‬ −1 denote the new and previous cell states, respectively; and h t and ℎ‫ݐ‬ −1 are the current and previous outputs, respectively. j o u r n a l p r e -p r o o f the principle of the input gate of lstm is shown in the following forms. (5), where ‫ݐ݅‬ refers to a sigmoid output, and ‫ݐܥ‬ refers to a tanh output. here, w i denotes weight matrices, and b i represents the input gate bias of lstm. the lstm's forget gate then allows the selective passage of information using a sigmoid layer and a dot product. the decision about whether to forget related information from a previous cell with a certain probability is executed using (6), in which w f refers to the weight matrix, b f is the offset, and σ is the sigmoid function. the lstm's output gate determines the states that are required for continuation by the ℎ‫ݐ‬ −1 and ‫ݐݔ‬ inputs following (7) and (8) . the final output is obtained and multiplied by the state decision vectors that pass new information, c t , through the tanh layer. where ܹo and ܾo are the output gate's weighted matrices and lstm bias, respectively. in this study, a combined method was developed to automatically detect covid-19 cases using three types of x-ray images. the structure of this architecture was designed by combining cnn and lstm networks, where the cnn is used to extract complex features from images, and lstm is used as a classifier. fig. 5 illustrates the proposed hybrid network for covid-19 detection. the network has 20 layers: 12 convolutional layers, five pooling layers, one fc layer, one lstm layer, and one output layer with the softmax function. each convolution block is combined with two or three 2d cnns and one pooling layer, followed by a dropout layer characterized by a 25% dropout rate. the convolutional layer with a size of 3 × 3 kernels is used for feature extraction that is activated by the relu function. the max-pooling layer with a size of 2 × 2 kernels is used to reduce the dimensions of an input image. in the last part of the architecture, the function map is transferred to the lstm layer to extract time information. after the convolutional block, the output shape is found to be (none, 7, 7, 512). using the reshape method, the input size of the lstm layer has become (49, 512) . the summary of the proposed architecture is shown in table 2 . after analyzing the time characteristics, the architecture sorts the x-ray images through a fully connected layer to predict whether they belong under any of the three categories (covid-19, pneumonia, and normal). the following metrics are used to measure the performance of the proposed system: tp denotes the correctly predicted covid-19 cases, fp denotes the normal or pneumonia cases that are misclassified as covid-19 by the proposed system, tn denotes the normal or pneumonia cases that are correctly classified, and fn denotes the covid-19 cases that are misclassified as normal or pneumonia cases. in the experiment, the dataset was split into 80% and 20% for training and testing, respectively. the results were obtained using 5-fold cross-validation technique. the proposed network consists of 12 convolutional layers, as described in table 2 , the learning rate is 0.0001, and the maximum epoch number is 125, as determined experimentally. the cnn and cnn-lstm networks were implemented using python and the keras package with tensorflow2 on an intel(r) core(tm) i7-2.2 ghz processor. in addition, the experiments were executed using the graphical processing unit (gpu) nvidia gtx 1050 ti with 4 gb and 16 gb ram, respectively. the overall accuracy, specificity, sensitivity, and f1-score for each case of cnn architecture are summarized in table 3 and visually shown in fig. 9 . the cnn network achieved 98.2% specificity, 99.0% sensitivity, and 97.7% f1-score for the covid-19 cases. for the pneumonia classification, it recorded 99.7% specificity, 96.4% sensitivity, and 97.8% f1-score. in the normal cases, it obtained 99.8% specificity, 100% sensitivity, and 99.8% f1-score. while the highest specificity, sensitivity, and f1score were obtained in the normal cases, the lower values of sensitivity were found in the pneumonia cases. furthermore, the roc curves are added between the true positive rate (tpr) and the false positive rate (fpr) to compare the overall performance in fig. 11 finally, gradient-weighted class activation mapping (grad-cam) refers to the heat map used for the visual explanation of our experiment using the gradients of a target concept. a coarse localization map highlights the important regions in the image for prediction after passing into the final layer. in fig.12 , the heat map of the classified test examples is shown for covid-19, pneumonia, and normal cases both for cnn and cnn-lstm architecture. by analyzing the results, it is demonstrated that a combination of cnn and lstm has significant effects on the detection of covid-19 based on the automatic extraction of features from x-ray images. the proposed system could distinguish covid-19 from pneumonia and normal cases with high accuracy. a comparison between existing systems and our proposed system, in terms of accuracy and computational time, is shown in table 5 . from table 5 , it is found that some of the proposed systems [53] , [54] , [28] , [31] , [22] , [31] , [55] , and [56] obtained a slightly lower accuracy in range of 80.6% to 92.3%. the moderately highest accuracy of 93.5%, 95.2%, 95.4%, 98.3% and 98.3% are found in [26] , [23] , [29] , [57] , and [25] respectively. the system developed in [58] obtained an overall accuracy of 98.08% considering the multi-classes. moreover, a comparison between existing systems in terms of computational time depicted that the system developed in [23] our proposed cnn-lstm architecture provides good performance and it is also faster than the cnn approach, taking 18372.0s and 113.0s for training and testing which is proportionally faster than other studies. overall, the result of our proposed system is superior compared to other existing systems. as covid-19 cases are increasing daily, many countries are facing resource shortages. hence, it is necessary to identify every single positive case during this health emergency. we introduced a deep cnn-lstm network for the detection of novel covid-19 from x-ray images. here, cnn is used as a feature extractor and the lstm network as a classifier for the detection of coronavirus. the performance of the proposed system is improved by combining extracted features with lstm that differentiate covid-19 cases from others. the developed system obtained an accuracy of 99.4%, auc of 99.9%, specificity of 99.2%, sensitivity of 99.3%, and f1-score of 98.9%. the proposed cnn-lstm and competitive cnn architecture are applied both on the same dataset. our extensive experimental results revealed that our proposed architecture outperforms a competitive cnn network. in this global covid-19 pandemic, we hope that the proposed system would be able to develop a tool for covid-19 patients and reduce the workload of the medical diagnosis for covid-19. the proposed system has some limitations. firstly, the sample size is relatively small that needs to be increased to test the generalizability of the developed system. this would be overcomed if more images are found in the coming days. secondly, it only focuses on the posterior-anterior (pa) view of x-rays, hence it cannot differentiate other views of x-rays such as anterior-posterior (ap), lateral, etc. thirdly, covid-19 images comprising multiple disease symptoms cannot be efficiently classified. finally, the performance of our proposed system was not compared with radiologists. hence, a comparison of the proposed system with radiologists would be part of a future study. the authors declare no conflicts of interest. clinical features of patients infected with 2019 novel coronavirus in coronavirus disease correlation of chest ct and rt-pcr testing in coronavirus disease 2019 (covid-19) in china: a report of 1014 cases the incubation period of coronavirus disease 2019 (covid-19) from publicly reported confirmed cases: estimation and application predictive data mining models for novel coronavirus (covid-19) infected patients recovery covid-19): role of chest ct in diagnosis and management sensitivity of chest ct for covid-19: comparison to rt-pcr a survey on computer vision for assistive medical diagnosis from faces prediction of breast cancer using support vector machine and k-nearest neighbors performance evaluation of random forests and artificial neural networks for the classification of liver disorder mathematical model development to detect breast cancer using multigene genetic programming diabetes prediction: a deep learning approach coronary artery heart disease prediction: a comparative study of computational intelligence techniques developing iot based smart health monitoring systems: a review, rev. d'intelligence artif development of smart healthcare monitoring system in iot environment feature extraction for image recognition and computer vision optimal deep learning model for classification of lung cancer on ct images bone suppression of chest radiographs with cascaded convolutional networks in wavelet domain deep back propagation-long shortterm memory network based upper-limb semg signal classification for automated rehabilitation a new modified deep convolutional neural network for detecting covid-19 from x-ray images covid-2019 detection using x-ray images and artificial intelligence hybrid systems within the lack of chest covid-19 xray dataset: a novel detection model based on gan and deep transfer learning covidiagnosis-net: deep bayes-squeezenet based diagnosis of the coronavirus disease 2019 (covid-19) from x-ray images covid-19: automatic detection from x-ray images utilizing transfer learning with convolutional neural networks machine learning approach for confirmation of covid-19 cases: positive, negative, death and release coronet: a deep neural network for detection and diagnosis of covid-19 from chest x-ray images detection of coronavirus disease ( covid-19 ) based on deep features x-ray image based covid-19 detection using pre-trained deep learning models covidx-net: a framework of deep learning classifiers to diagnose covid-19 in x-ray images transfer learning based ensemble support vector machine model for automated covid-19 detection using lung computerized tomography scan data deep transfer learning -based automated detection of covid-19 from lung ct scan slices composite monte carlo decision making under high uncertainty of novel coronavirus epidemic using hybridized deep learning and fuzzy rule induction covid-19 image data collection covid-19 chest x-ray welcome to the cancer imaging archive -the cancer imaging archive (tcia) covid-19 database | sirm mendeley data -augmented covid-19 x-ray images dataset kaggle chest x-ray repository object detection using convolutional neural networks convolutional neural networks for medical image analysis: full training or fine tuning? combining deep and handcrafted image features for mri brain scan classification recent advances in convolutional neural networks a novel method for classifying liver and brain tumors using convolutional neural networks, discrete wavelet transform and long short-term memory networks classification of covid-19 patients from chest ct images using multi-objective differential evolution-based convolutional neural networks an overview of deep learning in medical imaging focusing on mri deeplearning convolutional neural networks accurately classify genetic mutations in gliomas the vanishing gradient problem during learning recurrent neural nets and problem solutions a gentle tutorial of recurrent neural network with error backpropagation within the lack of chest covid-19 xray dataset: a novel detection model based on gan and deep transfer learning, symmetry 2020 on-device covid-19 screening using snapshots of chest x-ray covid-net: a tailored deep convolutional neural network design for detection of covid-19 cases from chest x-ray images using x-ray images and deep learning for automated detection of coronavirus disease can ai help in screening viral and covid-19 pneumonia automated detection of covid-19 cases using deep neural networks with x-ray images covid-resnet: a deep learning framework for screening of covid19 from radiographs key: cord-135296-qv7pacau authors: polsinelli, matteo; cinque, luigi; placidi, giuseppe title: a light cnn for detecting covid-19 from ct scans of the chest date: 2020-04-24 journal: nan doi: nan sha: doc_id: 135296 cord_uid: qv7pacau ovid-19 is a world-wide disease that has been declared as a pandemic by the world health organization. computer tomography (ct) imaging of the chest seems to be a valid diagnosis tool to detect covid-19 promptly and to control the spread of the disease. deep learning has been extensively used in medical imaging and convolutional neural networks (cnns) have been also used for classification of ct images. we propose a light cnn design based on the model of the squeezenet, for the efficient discrimination of covid-19 ct images with other ct images (community-acquired pneumonia and/or healthy images). on the tested datasets, the proposed modified squeezenet cnn achieved 83.00% of accuracy, 85.00% of sensitivity, 81.00% of specificity, 81.73% of precision and 0.8333 of f1score in a very efficient way (7.81 seconds medium-end laptot without gpu acceleration). besides performance, the average classification time is very competitive with respect to more complex cnn designs, thus allowing its usability also on medium power computers. in the next future we aim at improving the performances of the method along two directions: 1) by increasing the training dataset (as soon as other ct images will be available); 2) by introducing an efficient pre-processing strategy. : images extracted from dataset [3] . a covid-19 image (a) and a not covid-19 image also containing inflammations (b). coronavirus (covid-19) is a world-wide disease that has been declared as a pandemic by the world health organization on 11th march 2020. to date, more than two million people have been infected and more than 160 thousand died. a quick diagnosis is fundamental to control the spread of the disease and increases the effectiveness of medical treatment and, consequently, the chances of survival without the necessity of intensive and sub-intensive care. this is a crucial point because hospitals have limited availability of equipment for intensive care. viral nucleic acid detection using real-time polymerase chain reaction (rt-pcr) is the accepted standard diagnostic method. however, many countries are unable to provide the sufficient rt-pcr due to the fact that the disease is very contagious. so, only people with evident symptoms are tested. moreover, it takes several hours to furnish a result. therefore, faster and reliable screening techniques that could be further confirmed by the pcr test (or replace it) are required. computer tomography (ct) imaging seems to be a valid alternative to detect covid-19 [1] with a higher sensitivity [2] (up to 98% compared with 71% of rt-pcr). ct is likely to become increasingly important for the diagnosis and management of covid-19 pneumonia, considering the continuous increments in global cases. early research shows a pathological pathway that might be amenable to early ct detection, particularly if the patient is scanned 2 or more days after developing symptoms [1] . nevertheless, the main bottleneck that radiologists experience in analysing radiography images is the visual scanning of small details. moreover, a large number of ct images have to be evaluated in a very short time thus increasing the probability of misclassifications. this justifies the use of intelligent approaches that can automatically classify ct images of the chest. deep learning methods have been extensively used in medical imaging. in particular, convolutional neural networks (cnns) have been used both for classification and segmentation problems, also of ct images [4] . though cnns have demonstrated promising performance in this kind of applications, they require a lot of data to be correctly trained. in fact, ct images of the lungs can be easily misclassified, especially when both contain damages due to pneumonia, referred due to different causes ( figure 1 ). until now, there are limited datasets for covid-19 and those available contain a limited number of ct images. for this reason, during the training phase it is necessary to avoid/reduce overfitting (that means the cnn is not learning the discriminant features of covid-19 ct scans but only memorizing it). another critical point is that cnn inference requires a lot of computational power. in fact, usually cnns are executed on particularly expensive gpus equipped with specific hardware acceleration systems. anyway, expensive gpus are still the exception rather than the norm in normal computing clusters that usually are cpu based [5] . even more, this type of machines could not be available be available in hospitals, especially in emergency situations and/or in developing countries. in the present work, we aim at obtaining acceptable performances for an automatic method in recognizing covid-19 ct images of lungs while, at the same time, dealing with reduced datasets for training and validation and reducing the computational overhead imposed by more complex automatic systems. for this reason, in this work we started from the model of the squeezenet cnn, because it is able to reach the same accuracy of modern cnns but with fewer parameters [6] . moreover, in a recent benchmark [7] , squeezenet has achieved the best accuracy density (accuracy divided by number of parameters) and the best inference time. to date, some works on covid-19 detection by ct images are being published [8, 9, 10] . all these works use heavy cnns (respectively resnet, inception and resnet) adapted to improve accuracy. in this work we developed, trained and tested a light cnn (based on the squeezenet) to discriminate between covid-19 and community-acquired pneumonia and/or healthy ct images. the hyper-parameters have been optimized with bayesian method on two datasets [3, 11] . in addition, class activation mapping (cam) [12] has been used to understand which parts of the image are relevant for the cnn to classify it and to check that no over-fitting occurs. the paper is structured as follow: in the next section (materials and methods) the datasets organization, the used processing equipment and the proposed methodology are presented; section 3 contains results and discussion, including a comparison with recent works on the same argument; finally section 4 concludes the paper and proposes future improvements. . the datasets used therein are the zhao et al. dataset [3] and the italian dataset [11] . the zhao et al. dataset [3] is composed by 360 ct scans of covid-19 subjects and 397 ct scans of other kinds of illnesses and/or healthy subjects. the italian dataset is composed of 100 ct scans of covid-19. these datasets are continuously updating and their images is raising at the same time. in this work we used two different arrangements of the datasets, one in which data from both datasets are used separately and the other containing data mixed by both datasets. the first arrangement contains two different test datasets (test-1 and test-2). in fact, the zhao dataset is used alone and divided in train, validation and test-1. the italian dataset is integrated into a second test dataset, test-2 ( table 1) , while the zhao dataset is always used in train, validation and test-2 (in test-2, the not covid-19 images of the zhao dataset are the same of test-1). the first arrangement is used to check if, even with a small training dataset, it is possible to train a cnn capable to work well also on a completely different and new dataset (the italian one). in the second arrangement, both datasets are mixed as indicated in table 2 . in this arrangement the number of images from the italian dataset used to train, validate and test-1 are 60, 20 and 20, respectively. the second arrangement represents a more realistic case in which both datasets are mixed to increase as possible the training dataset (at the expenses of a test-2 which, in this case, is absent). in both arrangements, the training dataset has been augmented with the following transformations: a rotation (with a random angle between 0 and 90 degrees), a scale (with a random value between 1.1 and 1. 3) and addition of gaussian noise to the original image. . for the numerical of the proposed cnns we used two hardware systems: 1) a high level computer with cpu intel core i7-67100, ram 32 gb and gpu nvidia geforce gtx 1080 8 gb dedicated memory; 2) a low level laptot with cpu intel core i5 processor, ram 8 gb and no dedicated gpu. the first is used for hyperparameters optimization and to train, validate and test the cnns; the second is used just for test in order to demonstrate the computational efficiency of the proposed solution. in both cases we used the development environment matlab 2020a. matlab integrates powerful toolboxes for the design of neural networks. moreover, with matlab it is possible to export the cnns in an open source format called "onnx", useful to share the cnns with research community. when used the high level computer is used, the gpu acceleration is enabled in matlab environment, based on the technology nvida cuda core provided by the gpu that allows parallel computing. in this way we speed up the prototyping of the cnns. when final tests are performed on the low level hardware, no gpu acceleration is used. . the squeezenet is capable of achieving the same level of accuracy of others, more complex, cnns designs which have a huge number of layers and parameters [6] . for example, squeezenet can achieve the same accuracy of alex-net [13] on the imagenet dataset [14] with 50x fewer parameters and a model size of less than 0.5mb [6] . the squeezenet is composed of blocks called "fire module". as shown in figure 2 .a, each block is composed of a squeeze convolution layer (which has 1x1 filters) feeding an expanding section of two convolution layers with 1x1 and 3x3 filters, respectively. each convolution layer is followed by a relu layer. the relu layers output of the expanding section are concatenated with a concatenation layer. to improve the training convergence and to reduce overfitting we added a batch normalization layer between the squeeze convolution layer and the relu layer ( figure 2 .b). each batch normalization layer adds 30% of computation overhead and for this reason we chose to add them only before the expanding section in order to make it more effective while, at the same time, limiting their number. moreover, we replaced all the relu layers with elu layers because, from literature [15] , elus networks without batch normalization significantly outperform relu networks with batch normalization. the squeezenet has 8 fire modules in cascade configuration. anyway, two more complex architectures exist: one with simple and another with complex bypass. the simple bypass configuration consists in 4 skip connections added between fire module 2 and fire module 3, fire module 4 and fire module 5, fire module 6 and fire module 7 and, finally, between fire module 8 and fire module 9. the complex bypass added 4 more skip connections (between the same fire modules) with a convolutional layer of filter size 1x1. from the original paper [6] it seems that the better accuracy is achieved by the simpler bypass configuration. for this reason, in this work we test both squeezenet without any bypass (to have the most efficient model) and with simple bypass (to have the most accurate model), while complex bypass configuration is not considered. besides, we propose also a further modify cnn (figure 3 ) based on the squeezenet without any bypass. moreover, we added a transpose convolutional layer to the last custom fire module that expands the feature maps 4 times along width and height dimensions. these feature maps are concatenated in depth with the feature maps from the second custom fire module through a skip connection. weighted sum is performed between them with a convolution layer with 128 filters of size 1x1. finally all the feature map are concatenated in depth and averaged with a global average pool layer. this design allows to combine spatial information (early layers) and features information ( last layers) to improve the accuracy. . since we are using a light cnn to classify, the optimization of the training phase is crucial to achieve good results with a limited number of parameters. the training phase of a cnn is highly correlated with settings hyperparameters. hyperparameters are different from model weights. the former are calculated before the training phase, whereas the latter are optimised during the training phase. setting of hyperparameters is not trivial and different strategies can be adopted. a first way is to select hyperparameters manually though it would be preferable to avoid it because the number of different configurations is huge. for the same reason, approaches like grid search do not use do not use past evaluations: a lot of time has to be spent for evaluating bad hyperparameters configurations. instead, bayesian approaches, by using past evaluation results to build a surrogate probabilistic model mapping hyperparameters to a probability of a score on the objective function, seem to work better. in this work we used bayesian optimization for the following hyper-parameters: 1. initial learning rate: the rate used for updating weights during the training time; 2. momentum: this parameter influences the weights update taking into consideration the update value of the previous iteration; 3. l2-regularization: a regularization term for the weights to the loss function in order to reduce over-fitting. 2. squeezenet with simple bypass but without transfer learning; 3. squeezenet with simple bypass and transfer learning; regarding the arrangement 1, the results of the experiments are reported in table 3 . for a better visualization of the results, we report just the the best accuracy calculated with respect to all the attempts, the accuracy estimated by the objective function at the end of all attempts and the values of the hyperparameters. the best accuracy value is achieved with the experiment #4. both observed and estimated accuracy are the highest between all the experiments. regarding the original paper of the squeezenet [6] , it seems that there is not a relevant difference between the model without bypass and with bypass. it is also interesting to note that use transfer learning (experiment #3) from the original weights of the squeezenet does not have a relevant effect. regarding the dataset arrangement 2, the results of the experiments are shown in table 4 . the experiment #4 is still the best one, though experiment #1 is closer in terms of observed accuracy. however, we did not expect such a difference between the learning rate of experiment #4 of table 3 and table 4 . moreover, also the l2-regularization changed a lot. it suggests that the cnn trained/validated on the dataset arrangement 1 (that we call cnn-1) has a different behavior with respect to the cnn trained/validated on dataset arrangement 2 (that we call cnn-2). however, the results shown in table 3 and table 4 suggest that the proposed cnn achieves better results when compared to different configurations of the original squeezenet. . both cnn-1 and cnn-2 have been trained for more 20 epochs, with a learning rate drop of 0.8 every 5 epochs. after that, both cnns have been evaluated with the respective test-1 dataset with the following benchmark metrics: accuracy (measures the correct predictions of the cnn), sensitivity (measures the positives that are correctly identified), specificity (measures the negatives that are correctly identified), precision (measures the proportion of positive identification that is actually correct) and f1score(measures the balance between precision and recall). the results, shown in table 5 , confirm the hypothesis of the previous section: cnn-1 and cnn-2 have a different behavior. this is clearly understandable by taking into account the sensitivity and specificity values. the cnn-1 has higher specificity (0.85 against 0.81) and that means that is capable to better recognize not covid-19 images. the cnn-2 has higher sensitivity (0.8500 against 0.7900) and that means that is capable to better recognize covid-19 images. regarding the application of cnn-1 on test-2, the results are frustrating. the accuracy reaches just 0.5024 because the cnn is capable only to recognize well not covid-19 images (precision is 0.80) but has very poor performance on covid-19 images (sensitivity = 0.1900). as affirmed before, the analyses of test-2 is very hard if we do not use a larger dataset of images. in order to deeply understand the behaviour of cnn-1 and cnn-2 we used cam [12] , that gives a visual explanations of the predictions of convolutional neural networks. this is useful to figure out what each cnn has learned and which part of the input of the network is responsible for the classification. it can be useful to identify biases in the training set and to increase model accuracy. with cam it is also possible to understand if the cnns are overfitting. in fact, if the network has high accuracy on the training set, but low accuracy on the test set, cam helps to verify if the cnn is basing its predictions on the relevant features of the images or on the background. to this aim, we expect that the activations maps are focused on the lungs and especially on those parts affected by covid-19 (lighter regions with respect to healthy, darker, zones of the lungs). figure 4 shows 3 examples of cams for each cnns and, to allow comparisons, we refer them to the same 3 ct images (covid-19 diagnosed both from radiologists and cnns) extracted from the training dataset. for cnn-1, figure 4 .a, 4.b and 4.c, the activations are not localized inside the lungs. in figure 4 .b the activations are just a little bit better than figures 4.a 4 .c, because the red area is partially focused on the ill part of the right lung. the situations enhances in the cams of cnn-2 (figures 4.d, 4 .e, 4.f) because the activations are more localized on the ill parts of the lungs (this situation is perfectly represented in figure 4 .f). figure 5 shows 3 examples of cams for each cnns (as figure 4) but with 3 ct images of lungs not affected by covid-19 and correctly classified by both cnns. cnn-1 focuses on small isolated zones ( figures 5.a, 5 .b and 5.c): even if these zones are inside the lungs, it seems unreasonable to obtain a correct classification with so few information (and without having checked the remaining of the lungs). instead, in cnn-2, the activations seems to take into consideration the whole region occupied by lungs, as demonstrated in figures 5.d,5 .e and 5.f, which is the necessary step to correctly classify a lung ct image. as a conclusion, it is evident that cnn-2 has a better behaviour with respect to cnn-1. since cnn-1 and cnn-2 have the same model design but different training datatasets, we argue that the training dataset is the responsible of their different behaviour. in fact, the dataset arrangement-2 contains more training images (taken from the italian dataset) and the cnn-2 seems to be gain by it. so, figure 4 and figure 5 suggest that the cnn model, even with a limited number of parameters, is capable to learn the discriminant features of this kind of images. therefore, the increment of the training dataset should increase also the performance of the cnn. . we compare the results of our work (in particular the cnn-2) with [8, 10, 9] . since methods and datasets (training and test) differ and a correct quantitative comparison is arduous, we can have an idea regarding the respective results, summarized in table 6 . the methods [9] achieve better results than the method we propose. with respect to [8] , our method achieves better results, especially regarding sensitivity which, in our method, is 28% higher: this suggests a better classification regarding covid-19 images. the average time required by our cnn to classify a single ct image is 1.25 seconds on our high-end workstation. as comparison, the method in [9] requires 4.51 seconds on a similar high-end workstation (intel xeon processor e5-1620, gpu ram 16gb, gpu nvidia quadro m4000 8gb). on our medium-end laptot the cnn requires an average time of 7.81 seconds to classify a single image. this represents, for the method proposed therein, the possibility to be used massively on medium-end computers: a dataset of about 4300 images, roughly corresponding to 3300 patients [9] , could be classified in about 9.32 hours. the improvement in efficiency of the proposed method with respect to the previously compared is demonstrated in table 7 , where the sensitivity value (the only parameter reported by all the compared methods) is rated with respect the number of parameters used to reach it: the resulting ratio confirms that the proposed method greatly overcomes the others in efficiency. in this study, we proposed a cnn design (starting from the model of the squeezenet cnn) to discriminate between covid-19 and other ct images (composed both by community-acquired pneumonia and healthy images). on both dataset arrangements, the proposed cnn outperforms the original squeezenet. in particular, on the test dataset the proposed cnn (cnn-2) achieved 83.00% of accuracy, 85.00% of sensitivity, 81.00% of specificity, 81.73% of precision and 0.8333 of f1score. moreover, the proposed cnn is more efficient with respect to other, more complex cnns design. in fact, the average classification time is low both on a high-end computer (1.25 seconds for a single ct image) and on a medium-end laptot (7.81 seconds for a single ct image). this demonstrates that the proposed cnn is capable to analyze thousands of images per day even with limited hardware resources. the next major improvements that we want to achieve is to improve the accuracy, sensitivity, specificity, precision and f1score. in order to do that, since the cnn model seems to be robust as shown with cams tests, we aim at increasing the training dataset as soon as new ct images will be available. moreover, when we compared our methods with those presented in [10, 9] and in [8] , we noticed that the last method, as ours, does not use pre-processing, differently from the first two. a possible explanation of the better results of methods [10, 9] with respect to our method could be in the usage of pre-processing. as a future work, we aim to study efficient pre-processing strategies that could improve accuracy while reducing computational overhead in order to preserve the efficiency. the role of ct in case ascertainment and management of covid-19 pneumonia in the uk: insights from high-incidence regions sensitivity of chest ct for covid-19: comparison to rt-pcr efficient multiple organ localization in ct image using 3d region proposal network improving the speed of neural networks on cpus squeezenet: alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size benchmark analysis of representative deep neural network architectures a deep learning algorithm using ct images to screen for corona virus disease artificial intelligence distinguishes covid-19 from community acquired pneumonia on chest ct deep learning system to screen coronavirus disease 2019 pneumonia sirm dataset of covid-19 chest ct scan learning deep features for discriminative localization imagenet classification with deep convolutional neural networks imagenet: a large-scale hierarchical image database fast and accurate deep network learning by exponential linear units (elus) key: cord-296359-pt86juvr authors: polsinelli, matteo; cinque, luigi; placidi, giuseppe title: a light cnn for detecting covid-19 from ct scans of the chest date: 2020-10-03 journal: pattern recognit lett doi: 10.1016/j.patrec.2020.10.001 sha: doc_id: 296359 cord_uid: pt86juvr computer tomography (ct) imaging of the chest is a valid diagnosis tool to detect covid-19 promptly and to control the spread of the disease. in this work we propose a light convolutional neural network (cnn) design, based on the model of the squeezenet, for the efficient discrimination of covid-19 ct images with respect to other community-acquired pneumonia and/or healthy ct images. the architecture allows to an accuracy of 85.03% with an improvement of about 3.2% in the first dataset arrangement and of about 2.1% in the second dataset arrangement. the obtained gain, though of low entity, can be really important in medical diagnosis and, in particular, for covid-19 scenario. also the average classification time on a high-end workstation, 1.25 seconds, is very competitive with respect to that of more complex cnn designs, 13.41 seconds, witch require pre-processing. the proposed cnn can be executed on medium-end laptop without gpu acceleration in 7.81 seconds: this is impossible for methods requiring gpu acceleration. the performance of the method can be further improved with efficient pre-processing strategies for witch gpu acceleration is not necessary. coronavirus (covid19) is a world-wide disease that has been declared as a pandemic by the world health organization on 11th march 2020. to date, covid-19 disease counts more than 10 millions of confirmed cases, of which: more than 500 thousands of deaths around the world (mortality rate of 5.3%); more than 5 millions of recovered people. a quick diagnosis is fundamental to control the spread of the disease and increases the effectiveness of medical treatment and, consequently, the chances of survival without the necessity of intensive and subintensive care. this is a crucial point because hospitals have limited availability of equipment for intensive care. viral nucleic acid detection using real-time polymerase chain reaction (rt-pcr) is the accepted standard diagnostic method. however, many countries are unable to provide the sufficient rt-pcr due to the fact that the disease is very contagious. so, only people with evident symptoms are tested. moreover, it takes several hours to furnish a result. therefore, faster and reliable screening techniques that could be further confirmed by the pcr test (or replace it) are required. computer tomography (ct) imaging is a valid alternative to detect covid-19 [2] with a higher sensitivity [5] (up to 98% compared with 71% of rt-pcr). ct is likely to become increasingly important for the diagnosis and management of covid-19 pneumonia, considering the continuous increments in global cases. early research shows a pathological pathway that might be amenable to early ct detection, particularly if the patient is scanned 2 or more days after developing symptoms [2] . nevertheless, the main bottleneck that radiologists experience in analysing radiography images is the visual scanning of small details. moreover, a large number of ct images have to be evaluated in a very short time thus increasing the probability of misclassifications. this justifies the use of intelligent approaches that can automatically classify ct images of the chest. deep learning methods have been extensively used in medical imaging. in particular, convolutional neural networks (cnns) have been used both for classification and segmentation problems, also of ct images [16] . however, ct images of the lungs referred to covid-19 and not covid-19 can be easily misclassified especially when damages due to pneumonia referred due to different causes are present at the same time. in fact, the main chest ct findings are pure ground glass opacities (ggo) [6] but also other lesions can be present like consolidations with or without vascular enlargement, interlobular septal thickening, and air bronchogram [11] . as an example, two ct scans of covid-19 and not covid-19 are reported in figure 1 .a and figure 1 .b, respectively. until now, there are limited datasets for covid-19 and those available contain a limited number of ct images. for this reason, during the training phase it is necessary to avoid/reduce overfitting (that means the cnn is not learning the discriminant features of covid-19 ct scans but only memorizing it). another critical point is that cnn inference requires a lot of computational power. in fact, usually cnns are executed on particularly expensive gpus equipped with specific hardware acceleration systems. anyway, expensive gpus are still the exception rather than the norm in common computing clusters that usually are cpu based [13] . even more, this type of machines could not be available in hospitals, especially in emergency situations and/or in developing countries. at the moment, of the top 12 countries with more confirmed cases [12] (table 1) , 7 are developing countries though covid-19 emergency also is strongly stressing health systems of advanced countries. in this work, we present an automatic method for recognizing covid-19 and not covid-19 ct images of lungs. it's accuracy comparable with complex cnns supported by massive pre-processing strategies while maintaining a light architecture and high efficiency that makes it executable in low/middle range computers. we started from the model of the squeezenet cnn to discriminate between covid-19 and community-acquired pneumonia and/or healthy ct images. in fact, squeezenet is capable to reach the same accuracy of modern cnns but with fewer parameters [7] . moreover, in a recent benchmark [1] , squeezenet has achieved the best accuracy density (accuracy divided by number of parameters) and the best inference time. the hyperparameters have been optimized with bayesian method on two datasets [17, 8] . in addition, class activation mapping (cam) [18] has been used to understand which parts of the image are relevant for the cnn to classify it and to check that no overfitting occurs. the paper is structured as follow: in the next section (materials and methods) the datasets organization, the used processing equipment and the proposed methodology are presented; section 3 contains results and discussion, including a comparison with recent works on the same argument; finally section 4 concludes the paper and proposes future improvements. the datasets used are the zhao et al. dataset [17] and the italian dataset [8] . both datasets used in this study comply with helsinki declaration and guidelines and we also operated in respect to the helsinki declaration and guidelines. the zhao et al. dataset is composed by 360 ct scans of covid-19 subjects and 397 ct scans of other kinds of illnesses and/or healthy subjects. the italian dataset is composed of 100 ct scans of covid-19. these datasets are continuously updating and their images is raising at the same time. in this work we used two different arrangements of the datasets, one in which data from both datasets are used separately and the other containing data mixed by both datasets. the first arrangement contains two different test datasets (test-1 and test-2). in fact, the zhao dataset is used alone and divided in train, validation and test-1. the italian dataset is integrated into a second test dataset, test-2 (table 2) , while the zhao dataset is always used in train, validation and test-2 (in test-2, the not covid-19 images of the zhao dataset are the same of test-1). the first arrangement is used to check if, even with a small training dataset, it is possible to train a cnn capable to work well also on a completely different and new dataset (the italian one). in the second arrangement, both datasets are mixed as indicated in table 3 . in this arrangement the number of images from the italian dataset used to train, validate and test-1 are 60, 20 and 20, respectively. the second arrangement represents a more realistic case in which both datasets are mixed to increase as possible the training dataset (at the expenses of a test-2 which, in this case, is absent). in both arrangements, the training dataset has been augmented with the following transformations: a rotation (with a random angle between 0 and 90 degrees), a scale (with a random value between 1.1 and 1.3) and addition of gaussian noise to the original image. for the numerical of the proposed cnns we used two hardware systems: 1) a high level computer with cpu intel core i7-67100, ram 32 gb and gpu nvidia geforce gtx 1080 8 gb dedicated memory; 2) a low level laptot with cpu intel core i5 processor, ram 8 gb and no dedicated gpu. the first is used for hyperparameters optimization and to train, validate and test the cnns; the second is used just for test in order to demonstrate the computational efficiency of the proposed solution. in both cases we used the development environment matlab 2020a. matlab integrates powerful toolboxes for the design of neural networks. moreover, with matlab it is possible to export the cnns in an open source format called onnx, useful to share the cnns with research community. when the high level computer is used, the gpu acceleration is enabled in matlab environment, based on the technology nvida cuda core provided by the gpu that allows parallel computing. in this way we speed up the prototyping of the cnns. when final tests are performed on the low level hardware, no gpu acceleration is used. the squeezenet is capable of achieving the same level of accuracy of others, more complex, cnn designs which have a huge number of layers and parameters [7] . for example, squeezenet can achieve the same accuracy of alex-net [9] on the imagenet dataset [4] with 50x fewer parameters and a model size of less than 0.5mb [7] . the squeezenet is composed of blocks called "fire module". as shown in figure 2 .a, each block is composed of a squeeze convolution layer (which has 1x1 filters) feeding an expanding section of two convolution layers with 1x1 and 3x3 filters, respectively. each convolution layer is followed by a relu layer. the relu layers output of the expanding section are concatenated with a concatenation layer. to improve the training convergence and to reduce overfitting we added a batch normalization layer between the squeeze convolution layer and the relu layer (figure 2 .b). each batch normalization layer adds 30% of computation overhead and for this reason we chose to add them only before the expanding section in order to make it more effective while, at the same time, limiting their number. moreover, we replaced all the relu layers with elu layers because, from literature [3] , elus networks without batch normalization significantly outperform relu networks with batch normalization. the squeezenet has 8 fire modules in cascade configuration. anyway, two more complex architectures exist: one with simple and another with complex bypass. the simple bypass configuration consists in 4 skip connections added between fire module 2 and fire module 3, fire module 4 and fire module 5, fire module 6 and fire module 7 and, finally, between fire module 8 and fire module 9. the complex bypass added 4 more skip connections (between the same fire modules) with a convolutional layer of filter size 1x1. from the original paper [7] the better accuracy is achieved by the simpler bypass configuration. for this reason, in this work we test both squeezenet without any bypass (to have the most efficient model) and with simple bypass (to have the most accurate model), while complex bypass configuration is not considered. besides, we propose also a further modify cnn (figure 3 ) based on the squeezenet without any bypass. moreover, we added a transpose convolutional layer to the last custom fire module that expands the feature maps 4 times along width and height dimensions. these feature maps are concatenated in depth with the feature maps from the second custom fire module through a skip connection. weighted sum is performed between them with a convolution layer with 128 filters of size 1x1. finally all the feature map are concatenated in depth and averaged with a global average pool layer. this design allows to combine spatial information (early layers) and features information (last layers) to improve the accuracy. since we are using a light cnn to classify, the optimization of the training phase is crucial to achieve good results with a limited number of parameters. the training phase of a cnn is highly correlated with settings hyperparameters. hyperparameters are different from model weights. the former are calculated before the training phase, whereas the latter are optimised during the training phase. setting of hyperparameters is not trivial and different strategies can be adopted. a first way is to select hyperparameters manually though it would be preferable to avoid it because the number of different configurations is huge. for the same reason, approaches like grid search do not use past evaluations: a lot of time has to be spent for evaluating bad hyperparameters configurations. instead, bayesian approaches, by using past evaluation results to build a surrogate probabilistic model mapping hyperparameters to a probability of a score on the objective function, seem to work better. in this work we used bayesian optimization for the following hyperparameters: 1. initial learning rate: the rate used for updating weights during the training time; 2. momentum: this parameter influences the weights update taking into consideration the update value of the previous iteration; 3. l2-regularization: a regularization term for the weights to the loss function in order to reduce over-fitting. for each dataset arrangement we organized 4 experiments in which we tested different cnn models, transfer learning and the effectiveness of data augmentation. for each experiment, 30 different attempts (with bayesian method) have been made with different set of hyperparameters (initial learning rate, momentum, l2-regularization). for each attempt, the cnn model has been trained for 20 epochs and evaluated by the accuracy results calculated on the validation dataset. the experiments, all performed on the augumented dataset were: 1. squeezenet without bypass and transfer learning; 2. squeezenet with simple bypass but without transfer learning; 3. squeezenet with simple bypass and transfer learning; 4. the proposed cnn. regarding the arrangement 1, the results of the experiments are reported in table 4 . for a better visualization of the results, we report just the the best accuracy calculated with respect to all the attempts, the accuracy estimated by the objective function at the table 5 . the experiment #4 is still the best one, though experiment #1 is closer in terms of observed accuracy. by comparing the hyperparameters of the experiment #4 of table 4 and table 5 , a relevant difference in learning rate and l2-regularization is evident. regarding the dataset arrangement 1, table 4 shows that to a decrease of the learning rate corresponds an increment of momentum and vice-versa; the same occurs between the learning rate and l2-regularization; momentum and l2regularization have the same behaviour. regarding the dataset arrangement 2, table 5 shows that learning rate, l2-regularization and momentum have concordant trend. this hypothesis is confirmed in all the experiments. the different behaviour between hyperparameters in table 4 and table 5 suggests that the cnn trained/validated on the dataset arrangement 1 (that we call cnn-1) is different by the cnn trained/validated on dataset arrangement 2 (that we call cnn-2), also confirmed by the evaluation of cam, presented and discussed in the next subsection. the results shown in table 4 and table 5 confirm that the proposed cnns (experiment #4) perform better then original squeezenet configurations. in particular, cnn-1 design overcomes the original 3 squeezenet models in terms of accuracy of 1.6%, 4.0% and 4.0% (3.2% on average), respectively and cnn-2 of 0.7%, 2.2%, and 3.1% (2.1% on average), respectively. two considerations are necessary: 1) the proposed architecture always overcomes the original ones; 2) an accuracy gain, though of low entity, can be really important in medical diagnosis. the calculated hyperparameters have been used to train (20 epochs, learning rate drop of 0.8 every 5 epochs) both cnn-1 and cnn-2 with a 10-fold cross-validation strategy on both table 6 and table 7 , respectively). each, cnn is evaluated with the following benchmark metrics: accuracy, sensitivity, specificity, precision and f1-score. the average 10-fold cross-validation metrics, summarized in table 8 , confirm that cnn-1 and cnn-2 behave differently. regarding the application of cnn-1 on test-2, the results are insufficient. in fact, the accuracy reaches just 50.24% because the cnn is capable only to recognize well not covid-19 images (precision is 80.00%) but has very low performance on covid-19 images (sensitivity = 19.00%). as affirmed before, the analyses of test-2 is very hard if we do not use a larger dataset of images. in order to deeply understand the behaviour of cnn-1 and cnn-2 we used cam, that gives a visual explanations of the predictions of convolutional neural networks. this is useful to figure out what each cnn has learned and which part of the input of the network is responsible for the classification. it can be useful to identify biases in the training set and to increase model accuracy. with cam it is also possible to verify if a cnn is overfitting and, in particular, if its predictions are based on relevant image features or on the background. to this aim, we expect that the activations maps are focused on the lungs and especially on those parts affected by covid-19 (lighter regions with respect to healthy, darker, zones of the lungs). figure 4 shows 3 examples of cams for each cnns and, to allow comparisons, we refer them to the same 3 ct images (covid-19 diagnosed both from radiologists and cnns) extracted from the training dataset. by a visual comparison, for cnn-1 (figure 4 .a, 4.b and 4.c), the activations are not well localized inside the lungs, though in figure 4 .b the activations are better focused on the lungs than in figures 4.a and 4 .c. regarding the cams of cnn-2 (figures 4.d, 4 .e, 4.f), there is an improvement because the activations are more localized on the ill parts of the lungs (this situation is perfectly represented in figure 4 .f). figure 5 shows 3 examples of cams for each cnns (as figure 4) but with 3 ct images of lungs not affected by covid-19 and correctly classified by both cnns. cnn-1 focuses on small isolated zones (figures 5.a, 5 .b and 5.c): even if these zones are inside the lungs, it is unreasonable to obtain a correct classification with so few information (and without having checked the remaining of the lungs). instead, in cnn-2, the activations take into consideration the whole region occupied by lungs as demonstrated in figures 5.d,5 .e and 5.f. as a conclusion, it is evident that cnn-2 has a better behaviour with respect to cnn-1. since cnn-1 and cnn-2 have the same model design but different training datatasets, we argue that the training dataset is the responsible of their different behaviour. in fact, the dataset arrangement-2 contains more training images (taken from the italian dataset) and the cnn-2 seems to be gain by it. figure 4 and figure 5 show that the cnn model, even with a limited number of parameters, is capable to learn the discriminant features of this kind of images. therefore, the increment of the training dataset should increase also the performance of the cnn. we compare the results of the cnn-2 with [10, 14, 15] . since methods and datasets (training and test) differ and a correct quantitative comparison is arduous, we can have an idea regarding the respective results, summarized in table 9 . the method [10] achieves better results than cnn-2. with respect to [14] and [15] our method achieves better results, especially regarding sensitivity. the average time required by cnn-2 to classify a single ct image is 1.25 seconds on the previously defined high end workstation. as comparison, the method in [10] requires 4.51 seconds on a similar high-end workstation (intel xeon processor e5-1620, gpu ram 16gb, gpu nvidia quadro m4000 8gb) when just classification is considered. however, when the time necessary for pre-processing is considered, the method in [10] requires 13.41 seconds on the same workstation, thus resulting more 10 times slower than cnn-2. the computation time dramatically increases for [10] when considering pre-processing: it includes lungs segmentation through a supplementary cnn (a u-net), voxel intensity clipping/normalization and, finally, the application of maximum intensity projection. this also makes the method in [10] unpractical for medium-end machines without graphic gpu acceleration. on the contrary, the average classification time for cnn-2 was 7.81 seconds on a middle class computer. this represents, for the method proposed therein, the possibility to be used massively on medium-end computers: a dataset of about 4300 images, roughly corresponding to 3300 patients [10] , could be classified in about 9.32 hours. the improvetable 9 , where the sensitivity value (the only parameter reported by all the compared methods) is rated with respect the number of parameters used to reach it: the resulting ratio confirms that the proposed method greatly overcomes the others in efficiency. in this study, we proposed a cnn design (starting from the model of the squeezenet cnn) to discriminate between covid-19 and other ct images (composed both by community-acquired pneumonia and healthy images). on both dataset arrangements, the proposed cnn-2 outperforms the original squeezenet. in particular, cnn-2 achieved 85.03% of accuracy, 87.55% of sensitivity, 81.95% of specificity, 85.01% of precision and 86.20% of f1-score. moreover, cnn-2 is more efficient than other, more complex, cnn designs. in fact, the average classification time is low both on a high-end computer (1.25 seconds for a single ct image) and on a medium-end laptot (7.81 seconds for a single ct image). this demonstrates that the proposed cnn is capable to analyze thousands of images per day even with limited hardware resources. the next step is to further increase the performance of cnn-2 through specific pre-processing strategies. in fact, performant cnn designs [15, 10] mostly use pre-processing with gpu acceleration. our future ambitious goal is to obtain specific and efficient pre-processing strategies for middle class computers without gpu acceleration. benchmark analysis of representative deep neural network architectures the role of ct in case ascertainment and management of covid-19 pneumonia in the uk: insights from high-incidence regions fast and accurate deep network learning by exponential linear units (elus) imagenet: a large-scale hierarchical image database sensitivity of chest ct for covid-19: comparison to rt-pcr early ct features and temporal lung changes in covid-19 pneumonia in wuhan, china squeezenet: alexnet-level accuracy with 50x fewer parameters andâ¡ 0.5 mb model size sirm dataset of covid-19 chest ct scan imagenet classification with deep convolutional neural networks, in: advances in neural information processing systems artificial intelligence distinguishes covid-19 from community acquired pneumonia on chest ct coronavirus disease 2019 (covid-19): role of chest ct in diagnosis and management world healt organization web site improving the speed of neural networks on cpus a deep learning algorithm using ct images to screen for corona virus disease deep learning system to screen coronavirus disease 2019 pneumonia efficient multiple organ localization in ct image using 3d region proposal network learning deep features for discriminative localization â�� the authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.â��the authors declare the following financial interests/personal relationships which may be considered as potential competing interests: key: cord-168974-w80gndka authors: ozkaya, umut; ozturk, saban; barstugan, mucahid title: coronavirus (covid-19) classification using deep features fusion and ranking technique date: 2020-04-07 journal: nan doi: nan sha: doc_id: 168974 cord_uid: w80gndka coronavirus (covid-19) emerged towards the end of 2019. world health organization (who) was identified it as a global epidemic. consensus occurred in the opinion that using computerized tomography (ct) techniques for early diagnosis of pandemic disease gives both fast and accurate results. it was stated by expert radiologists that covid-19 displays different behaviours in ct images. in this study, a novel method was proposed as fusing and ranking deep features to detect covid-19 in early phase. 16x16 (subset-1) and 32x32 (subset-2) patches were obtained from 150 ct images to generate sub-datasets. within the scope of the proposed method, 3000 patch images have been labelled as covid-19 and no finding for using in training and testing phase. feature fusion and ranking method have been applied in order to increase the performance of the proposed method. then, the processed data was classified with a support vector machine (svm). according to other pre-trained convolutional neural network (cnn) models used in transfer learning, the proposed method shows high performance on subset-2 with 98.27% accuracy, 98.93% sensitivity, 97.60% specificity, 97.63% precision, 98.28% f1-score and 96.54% matthews correlation coefficient (mcc) metrics. corona virus disease (covid-19) is essential to apply the necessary quarantine conditions and discover the treatment methods in order to prevent the rapid spread of covid-19. it has become a global epidemic similar to other pandemic diseases, causes patient deaths in china according to world health organization (who) data [1] [2] [3] . early application of treatment procedures for individuals with covid-19 infection increases the patient's chances of survival. fever, cough and shortness of breath are the most important symptoms in infected individuals for the diagnosis of covid-19. at the same time, these symptoms may show carrier characteristics by not being seen in infected individuals. pathological tests performed in laboratories are taking more time. also, the margin of error can be high. a fast and accurate diagnosis is necessary for an effective struggle against covid-19. for this reason, experts have been started to use radiological imaging methods. these procedures are performed with computed tomography (ct) or x-ray imaging techniques. covid-19 cases have similar features in ct images in the early and late stages. it shows a circular and inward diffusion from within the image [4] . therefore, radiological imaging provides early detection of suspicious cases with an accuracy of 90%. when the studies in the literature are examined, shan et al proposed a neural network model called vb-net in order to segment the covid-19 regions in ct images. this proposed method has been tested in 300 new cases. a recommendation system has been used to make it easier for radiologists to mark infected areas within ct images [5] . xu et al. analyzed ct images to determine healthy, covid-19 and other viral case. the dataset used included 219 covid-19, 224 viral diseases and 175 healthy images. they achieved 87.6% general classification accuracy with their deep learning method [6] . apostolopoulos et al. proposed a transfer learning methods to classify covid-19 and normal case. they obtained performance metrics which are 96.78% accuracy, 98.66% sensitivity, and 96.46% specificity [7] . shuai et al. were able to successfully diagnose covid-19 using deep learning models that could obtain graphical features in ct images [8] . 150 ct images were used in this study to classify covid-19 cases. two different datasets were generated from 150 ct images. these datasets include 16×16 and 32×32 patch images. each dataset contains 3000 number of images labeled with covid-19 and no findings. deep features were obtained with pre-trained convolutional neural network (cnn) models. these deep features was fused and rank to train support vector machine (svm). the performance of proposed method can be used for early diagnosis of covid-19 cases. this study consists of 5 sections. the properties of obtained patch images are visualized in section 2. in section 3, the basics of deep learning methods, feature fusion and ranking techniques are mentioned. comparative classification performances are given in section 4. there is a discussion and conclusion in section 5. 53 infected ct images was accessed to the societa italiana di radiologia medica e interventistica to generate datasets [9] . patch images obtained from infected and non-infected regions form ct images. properties of two different patch are given in table 1 . ct images were obtained. the process of obtaining patches is given in figure 1 . in 2006, geoffrey hinton has shown that deep neural networks can be effectively trained by the greedy-layered pre-training method [10] . other research groups used the same strategy to train many other deep networks. the use of term (deep learning) in order to draw attention to the theoretical importance of depths has been popularized for the design of better performing networks of neural networks and the importance of deeper networks. deep learning, which has become quite popular recently, has been used in many areas. e-mail filtering, search engine matching, smartphones, social media, e-commerce can be written to them. academic studies have been pioneers for their use in these areas. deep learning is also used for face recognition, object recognition, object detection, text classification and speech recognition. deep learning is a type of artificial neural network and has multilayers. the more layers are increased, the greater accuracy is achieved. while deep convolutional networks are successfully used in image, video, speech and sound processing, recurrent neural networks are used in sequential data such as text and speech. deep learning, started to be used in 2010, a large data set with multilayer of machine learning calculations used in many layers, even in the machine learning the parameters that need to be defined, perhaps a better system that can evaluate the parameters. deep learning artificial neural networks are the algorithms created by taking advantage of the functions of the brain. in machine learning, deep belief networks (dbn) is a productive graphical model or, alternatively, a class of deep neural networks consisting of multiple layers in hidden nodes. when trained on a series of unsupervised examples, the dbn can learn to reconfigure its entries as probabilistic. the layers then act as feature detectors. after this learning phase, a dbn can be trained with more control to make the classification. dbns can be seen as a combination of simple, unsupervised networks, such as restricted boltzmann machines (rbms) or auto encoder, which serve as the hidden layer of each subnet, the visible layer of the next layer. convolution is used as a mathematical process. it is a special type of linear operations. convolutional neural networks (cnn) are a type of neural network with at least one layer of convolution. however, the convolution process in deep learning is different from the convolution process in normal or engineering mathematics. convolution neural networks has some layer such as convolution, relu, pooling, normalization, fully connected and softmax layer. in the convolution neural networks, classification process takes place in fully connected layers and softmax layer. generally, convolution is a process that takes place on two actual functions. to describe the convolution operation, two function can be used for this definition. for example, the location of a space shuttle with a laser is monitored. the laser sensor produces a simple x(t) output, which is the space of the space shuttle at time t. where x and t are actual values, for example, any t is a different value received at a snapshot time. also, this sensor has a bit noisy. to carry out a less noisy prediction, designer can take the average of several measurements together. naturally, final measurements are closer, so that the average weights that give more weights to desired final measurements. this can be done with the weighting function w(a), which is a measurement period. if a weighted average operation is applied at all times, a new function is obtained which allows to more accurately estimate the position: the above process is a convolution and is represented by a star: in cnn terminology, first argument in x function at eq. 2 is called an introduction to convolution and the second argument for w function is called the kernel. the output is called feature map. in the above example, the measurement is made without interruption, but this is not realistic. time is parsed when working on the computer. in order to realize realistic measurement, one measurement per second is taken. where t is the time index and is an integer, so x and w are integers. in machine learning applications, the input function consists of a multidimensional array set and the kernel function consists of a multidimensional array of several parameters. multiple axes are convolved at one time. so if the input is a two-dimensional image, the kernel becomes a two-dimensional matrix. the above equation means shifting the kernel according to the input. this increases invariance of convolution [11] . but this feature is not very important for machine learning libraries. instead, many machine learning libraries process the kernel without inversion, which is called as cross correlation, which is related to convolution. but because it looks like a convolution, it is called a convulsive neural network: discrete convolution is seen as a matrix product. typical convolution neural networks' benefit from further expertise to effectively deals with large inputs. figure 2 shows how the process occurs in convolution neural networks: convolution provides three important thoughts to improve a machine learning system: infrequent interactions, parameter sharing, and covariant representations. furthermore, convolution process can be worked with variable-sized inputs. convolution neural network layers use a matrix parameter with a matrix parameter that includes a different kinds of link between each input unit and each output unit. it means that each output unit connects with each input unit. however, cnn typically have infrequent interactions (also called sparse links or sparse weights). this is done by making the kennel smaller than the entrance. since the number of pixels after each convolution process decreases, if there is a quality that should not be overlooked at the edges, zero and edge attributes are preserved by adding zero at the end of the rows and columns. this process is called padding. for example, input image may consist of thousands or millions of pixels for image process, but small and meaningful properties such as kernel's edges consisting of only ten or hundreds of pixels can be detected. this means we need to save fewer parameters that both reduce the memory requirements of cnn model and increase its efficiency. it also means that calculating output requires less processing. these improvements in productivity are generally quite large. parameter sharing refers to the use of the same parameter for more than one function in a model. in a conventional neural network, each element in weighted matrix is used to calculate the output of a layer. this is multiplied by an element of the entry and will not be reviewed again. it can be said that a network ties weights because the value of the weight applied to an input depends on the value of the weight applied elsewhere as in parameter sharing. in a cnn, each member of the core is used in each position of the insert. parameter sharing used by the convolution process means that instead of learning a separate set of parameters for each subject, only one set will be learned. considering that the images are three-dimensional in the form of h x w x d size if k x k is called kernel size is how many pixels of convolution output is calculated as follows: roughly means normalization. the size of the data in artificial neural networks is important. as the data grows, the memory they occupy increases and this reduces both the efficiency of the artificial neural network and decreases the working speed. by compressing the entire dataset value to 0-1, the operations are made easy. it extracts this process from the average of all the data sets and thus the data is in the range 0-1. the result of standardization) is to rescale features for a standard normal distribution. where μ and σ is represented as average standard deviation respectively. standard scores for each samples are computed as follows: the standard deviation for the features is centered between 1 and 0. also, it is important for training of many machine learning algorithms. a pooling function changes the output of the network at a specific location with a summary statistics of nearby outputs. for example, max-pooling yields the largest in the quadrilateral space as output. other popular pooling functions; mean and minimum pooling functions. when number of parameters in the next layer depends on input image or feature map size, any reduction in input size also increases the statistical efficiency and reduces the memory requirements for storing parameters. the number of pixels of the pooling output is calculated as follows: rectified linear unit is an activation function type. the rectified linear unit has recently become popular. calculates the function f (x) = max (0, x). in other words, activation is thresholded equal to zero. there are a number of pros and cons of the use of relu. it has been found that stochastic gradient descent significantly accelerates convergence compared to sigmoid / tanh functions. it is claimed that this originates from a linear, unsatisfactory form. when the neurons containing costly operations are compared to tanh / sigmoid, relu can simply be applied by thresholding an activation matrix to zero. relu units can become sensitive during training phase. for example, a large gradient scale flowing through neuron with a relu activation function can cause weights to be updated so that the neuron is not reactivated at any data point. if this happens, the gradient flowing through the unit will be zero from that point forever. that is, relu can kill units irrevocably during training because data replication can be disabled. for example, if the learning rate is too high, 40% of the network may be dead. this is a less frequent occurrence with an appropriate adjustment of the learning rate. in fully connected layers, reduction of nodes below a certain threshold increased the performance. so it is observed that forgetting the weak information increases learning. some properties of dropout value are as follows. the dropout value is generally 0.5. different uses are also common. it varies according to the problem and data set. the random elimination method can also be used for the dropout. the dropout value is defined as a value in the range [0, 1] when used as the threshold value. it is not necessary to use the same dropout value on all layers; different dilution values can also be used. the softmax function is a sort of classifier. logistic regression is a classifier of the classifier and the softmax function is multi-class of logistic regression. 1/∑je fj term normalizes the distribution. that is, the sum of the values equals 1. therefore, it calculates the probability of the class to which the class belongs. when a test input is given x, the activation function in j = 1,…,k is asked to predict the probability of p (y = j | x) for each value. for example, it is desirable to estimate the probability that the class tag will have each of the different possible values. thus, as a result of the activation function, it produces a k-dimensional vector which gives us our predictive possibilities. the error value must be calculated for the learning to occur and the error value for the softmax function is calculated by the softmax loss function. in the softmax classifier, the f (xi; w) = wxi function match remains unchanged, but we now interpret these scores as normalized log probabilities for each class and use the following form of cross entropy loss. vgg-16, googlenet and resnet-50 models were used for feature extraction. the obtained feature vectors with these models were fused to obtain higher dimensional fusion features. in this way, the effect of insufficient features obtained from a single cnn network is minimized. in addition, there is a certain level of correlation and excessive information among the features. this also increases consuming time and computational complexity. therefore, it is necessary to rank the features. t-test technique was used in feature ranking. it calculates the difference between the two features and determines its differences statistically [12] . in this way, it performs the ranking process by taking into account the frequency of the same features in the feature vector and the frequency of finding the average feature. after the feature fusion and ranking functions were performed, the binary svm classifier was trained for classification. svm transfers features into space where it can better classify features with kernel functions [13] . linear kernel function was used in svm. the svm classifier was trained to minimize the squared hinge loss. the squared hinge loss is given in eq. 10. here, xn represents the fusion and the ranking feature vector. the wrong classification penalty is determined by the c hyper parameter in the loss function. in the proposed method, pre-trained cnn networks were trained for subset-1 and subset-2 separately. vgg-16, googlenet and resnet-50 models were used as a pre-trained network. patch images were given as input to trained pre-trained cnn structures during the test phase. feature vectors (1000 × 1 × 3) obtained from these networks provide a new feature set with fusion process. correlation values between features were taken into consideration in fusion process. the obtained features were ranked by t-test method. in the t-test ranking process, features close to each other were eliminated according to feature frequency. in the last stage, fusion and ranking deep features were evaluated with svm classifier. the method proposed in figure 4 is visualized. there are 6000 pieces of 16 × 16 ct patches in subset-1. data distribution between classes is equal. 75% of these images were used for training and 25% for testing. table 2 shows comparatively classification performance pre-trained cnn networks and of the proposed method. subset-2 includes 3000 covid-19 and 3000 no finding 32 × 32 ct patches. comparative classification results of subset-2 are given in table 3 . the best performance in subset-1 showed proposed method with 95.60% as can be seen in respectively. the proposed method achieved the highest metric performance in f1-score and mcc metrics with 98.28% and 96.54% respectively. as can be seen in table 2 and table 3 , there are confusion matrixes obtained with subset-1 and subset-2 datasets of the proposed method in figure 5 and figure 6 . confusion matrix was obtained for proposed method using subset-1 in figure 5 . when confusion matrix was evaluated in class, covid-19 class was classified with an accuracy rate of 97.9%. performance of no findings class was lower than covid-19. 93.3% accuracy rate was obtained for this class. a classification accuracy of 93.6% was obtained in the analysis of positive class. in negative class, this rate is higher and had a value of 97.8%. subset-2 was used in the training and testing process for the proposed method. in figure 6 , a confusion matrix was obtained for test data. in class analysis, 97.6% accuracy rate of covid-19 class was obtained. performance was increased compared to subset-1 in the no findings class. accuracy rate was 98.9% for this class. in the positive and negative class evaluation, a classification accuracy of 98.9% and 97.6% was obtained respectively. the first case of covid-19 was found in the wuhan region of china. covid-19 is an epidemic disease and threatens world health system and economy. covid-19 virus behaves similarly to other pandemic viruses. this makes it difficult to detect covid-19 cases quickly. therefore, covid-19 is a candidate for a global epidemic. radiological imaging techniques are used for a more accurate diagnosis in the detection of covid-19. therefore, it is possible to obtain more detailed information about covid-19 using ct imaging techniques. when ct images are examined, shadows come to the fore in the regions where covid-19 is located. at the same time, a spread is observed from the outside to the inner parts. obtained images with different ct devices were used in the study. there were different levels of grey level in the images. different characteristics of ct devices caused it. this complicates the analysis of the images. in the study, deep features were obtained by using pre-trained cnn networks. then, deep features were fused and ranked. the data set was generated by taking random patches on ct images. clinical features of patients infected with 2019 novel coronavirus in wuhan added value of computer-aided ct image features for early lung cancer diagnosis with small pulmonary nodules: a matched case-control study dermatologist-level classification of skin cancer with deep neural networks mining x-ray images of sars patients lung infection quantification of covid-19 in deep learning system to screen coronavirus disease covid-19: automatic detection from x-ray images utilizing transfer learning with convolutional neural networks a deep learning algorithm using ct images to screen for corona virus disease (covid-19) improving neural networks by preventing co-adaptation of feature detectors non-native children speech recognition through transfer learning a modified t-test feature selection method and its application on the hapmap genotype data statistical learning theory: a tutorial evaluation of the confusion matrix method in the validation of an automated system for measuring feeding behaviour of cattle key: cord-317643-pk8cabxj authors: masud, mehedi; eldin rashed, amr e.; hossain, m. shamim title: convolutional neural network-based models for diagnosis of breast cancer date: 2020-10-09 journal: neural comput appl doi: 10.1007/s00521-020-05394-5 sha: doc_id: 317643 cord_uid: pk8cabxj breast cancer is the most prevailing cancer in the world and each year affecting millions of women. it is also the cause of largest number of deaths in women dying in cancers. during the last few years, researchers are proposing different convolutional neural network models in order to facilitate diagnostic process of breast cancer. convolutional neural networks are showing promising results to classify cancers using image datasets. there is still a lack of standard models which can claim the best model because of unavailability of large datasets that can be used for models’ training and validation. hence, researchers are now focusing on leveraging the transfer learning approach using pre-trained models as feature extractors that are trained over millions of different images. with this motivation, this paper considers eight different fine-tuned pre-trained models to observe how these models classify breast cancers applying on ultrasound images. we also propose a shallow custom convolutional neural network that outperforms the pre-trained models with respect to different performance metrics. the proposed model shows 100% accuracy and achieves 1.0 auc score, whereas the best pre-trained model shows 92% accuracy and 0.972 auc score. in order to avoid biasness, the model is trained using the fivefold cross validation technique. moreover, the model is faster in training than the pre-trained models and requires a small number of trainable parameters. the grad-cam heat map visualization technique also shows how perfectly the proposed model extracts important features to classify breast cancers. breast cancer is affecting millions of women every year in the world and is the reason of highest cause of deaths by cancers among women [1] . the survival rates of breast cancer largely vary in the countries. in north america, it is greater than 80%, in sweden and japan it is around 60%, while in low-income countries it is below 40% [1] . the main reason of low survival rate in low-income countries is the lack of programs for early detection and the shortage of enough diagnosis and healthcare facilities. therefore, it is vital to detect breast cancer in the pre-mature stage to minimize the rate of mortality. mammography and ultrasound images are the common tools to identify cancers by the experts and requires expert radiologists. manual process may cause to generate high false positive and false negative numbers. therefore, nowadays computer aided diagnosis systems (cads) are vastly used to aid radiologists during the process of decision making in identifying cancers. the cad systems now potentially reducing the efforts of radiologists and minimizing the false positive and negative numbers in diagnosis. machine traditional computer aided systems for disease diagnosis and patient monitoring [2] . however, the traditional machine learning techniques involve a hand-created step for extraction of features which is very difficult sometimes. it also requires domain knowledge and an expert radiologist. meanwhile, deep learning (dl) models automatically develop a learning process adaptively and can extract features from the input dataset considering the target output [3, 4] . the dl methods tremendously reduce the exhaustive process of data engineering and feature extraction while enabling the reusability of the methods. numerous researches [5] have been conducted to study breast cancer images from various perceptions. machine learning (ml), convolutional neural networks (cnns), and deep learning methods are now widely used to classify breast cancers from the breast images. cnn models have been effectively used in the wideranging computer vision fields for years [6, 7] . since the last few years numerous researches have been conducted applying cnn-based deep learning deep architectures for disease diagnosis. a cnn-based image recognition and classification model perhaps first applied in the competition of imagenet [8] . after then cnn-based models are currently considered in various applications, for example, image segmentation in medical image processing, feature extraction from images, finding region of interests, object detection, natural language processing, etc. cnn has incredible trainable parameters at the various layers that are applied for extracting important features at various abstraction levels [9] . meanwhile, a cnn model needs a huge dataset to train. particularly in the medical field medical dataset may not be always possible to obtain. moreover, a cnn model requires high speed computing resources to train and tune its hyper parameters. to overcome the data unavailability transfer learning techniques at present vastly applied in classification of medical images. applying transfer learning techniques, a model can use knowledge from other the pre-trained models (e.g., vgg16 [10] , alexnet [11] , densenet [12] , etc.) that are trained over a huge dataset to classify images. this lessens the requirement of data linked to the problem we are tackling with. the pre-trained models often are used as feature extractors in images from abstract level to more detailed levels. transfer learning techniques using pre-trained models have shown promising results in different medical diagnosis, such as chest x-ray image analysis for pneumonia and covid-19 patients' identification [13] , retina image analysis for blind person classification, mri image analysis for brain tumor classification, etc. deep learning models leveraging cnns are used widely to classify breast cancers. we now discuss some of the promising researches that have been proposed using cnn. authors in [14] proposed a learning framework leveraging deep learning architecture that can learn features automatically form mammography images in order to identify cancer. the framework was tested on the bcdr-fm dataset. although they showed improved results, however, did not compare employing pre-trained models. authors in [15] considered alexnet as a feature extractor for mass diagnosis in mammography images. support vector machin (svm) is applied as a classification model after alexnet generates features. the outcome of the proposed model is higher compared to the analytically feature extractor method. in our approach, we considered eight different pre-trained models and show their performances using ultrasound images. authors in [16] considered transfer learning approach using googlenet [17] and alexnet pre-trained models and some preprocessing techniques. the model is applied on mammograms images, where cancers are already segmented. the authors claim that the model achieves improved performance than the human involved methods. authors in [18] proposed a convolutional neural network leveraging inception-v3 pre-trained model to classify breast cancer using breast ultrasound images. the model supports facility for extracting multiview features. the model is trained only on 316 images and achieved 0.9468 auc, 0.886 sensitivity, and 0.876 specificity. authors in [19] developed an ensembled cnn model leveraging vgg192 and resnet1523 pre-trained models with fine tuning. the authors considered a dataset managed by jabts. there are 1536 breast masses that include 897 malignant and 639 benign cases. the model achieves 0.951 auc, 90.9% sensitivity, and 87.0 specificity. authors in [20] developed another ensemble-based computer aided diagnosis (cad) system combining vggnet, resnet, and densenet pre-trained models. they considered a private database that consists of 1687 images that includes 953 benign and 734 malignant cases. the model achieved 91.0% accuracy and 0.9697 auc score. the model is also tested on the busi dataset. in this dataset, the model achieved 94.62% accuracy and 0.9711 auc score. authors in [21] implemented two different approaches (1) a cnn and (2) a transfer learning to classify breast cancer from combining two sets of datasets, one containing 780 images and another containing 163 images. the model showed better performance results combining traditional and generative adversarial network augmentation techniques. in the transfer learning approach, the authors compared the performance of four pre-trained models, mainly, vgg16, inception [22] , resnet, and nasnet [23] . in the combined dataset, the nasnet achieved highest accuracy value 99%. authors in [24] compared three cnn-based transfer learning models resnet50, xception, and inceptionv3, and proposed a base model that consists of three convolutional layers to classify breast cancers from the breast neural computing and applications ultrasound images dataset. the dataset comprised of 2058 images that includes 1370 benign and 688 malignant cases. according to their analysis, inceptionv3 showed best accuracy of 85.13% with auc score 0.91. authors in [25] analyzed four pre-trained models vgg16, vgg19, inceptionv3, and resnet50 on a dataset that consists of 5000 breast images comprised of 2500 benign and 2500 malignant cases. inceptionv3 model achieved the highest auc of 0.905. authors in [26] proposed a cnn model for breast cancer classification considering the local and frequency domain information using histopathological images. the objective is to utilize the important information of images that are carried by the local and frequency domain information which sometime shows better accuracy for the model. the proposed model is applied on the breakhis dataset. however, the model obtained 94.94% accuracy. authors in [27] proposed a novel deep neural network consisting of clustering method and cnn model for breast cancer classification using histopathological images. the model is based on cnn, a long-short-term-memory (lstm), and a mixture of the cnn and lstm models. in the model, both softmax and svm are applied at the classifier layer. however, the model achieved 91% accuracy. from the above discussion, it is evident researchers still on the search for a better model to classify breast cancers. in order to overcome the scarcity of datasets, this research combines two publicly available ultrasound image datasets. then eight different pre-trained models after fine tuning are applied on the combined dataset to observe the performance results of breast cancer classification. however, the pre-trained models did not show expected outcome. therefore, we also develop a shallow cnn-based model. the model outperforms all the fine-tuned pre-trained models in all the performance metrics. the proposed model is also faster in training. we also employed different evaluation techniques to prove the better outcome of the proposed model. the details of the methods study, evaluation results and discussion are presented in sect. 3 . the paper is organized as follows: sect. 2 discusses materials and methods that are used for the purpose of breast cancer classification. section 3 proposes the custom cnn model. section 4 discusses evaluation results of the pre-trained models and the proposed custom. finally, the paper concludes in sect. 5. in this research, we consider two publicly available breast ultrasound image datasets [28, 29] . the two datasets are considered mainly for two reasons: (1) to increase the size of the dataset for the training purpose in order to avoid overfitting and biasness and (2) to consider three classes (benign, malignant and normal). combining the datasets also will improve the reliability of the model. dataset in [28] contains 250 images in which there are two categories: malignant and benign cases. the size of the images is different. the minimum and the maximum size of the images are 57 9 75 and 61 9 199 pixels with gray and rgb colors, respectively. therefore, all the images are transformed into gray color to fit into the model. the dataset in [29] contains 780 images, in which there are three categories: malignant, benign, and normal cases. the average image size of the images is 500 9 500 pixels. the breast ultrasound images are collected from 600 women in 2018, and the age range of the women is between 25 and 75 years. table 1 shows the class distribution of the images in the two datasets. figure 1 demonstrates examples of ultrasound images of different cases in the two datasets. data normalization is an important pre-processing phase before feeding the data into a model for training. with preprocessing the data features become easily interpretable by the model. lack of correct pre-processing makes the model slow in training and unstable. generally, standardization and normalization techniques are used in scaling data. normalization technique rescales the data values between 0 and 1. since the datasets that are considered in this research, are both gray and color images, hence the values of the pixels lie between 0 and 255. we consider zerocentering approach that shifts the distribution data values in such a way that its mean becomes equal to zero. assume a dataset d, that consists of n samples and m features. therefore, d[:, i] denotes ith feature and d[j, :] denotes sample j. the equation below defines zerocentering. d½k; i and in this research, we employed k-fold (k = 5) cross-validation on the dataset to overcome overfitting problem during model training. in k-fold cross validation method, k different datasets of same size is generated, where each fold is used to validate the model, and k-1 folds are considered for the purpose of model training. this ensures that the model produces reliable accuracy. cross-validation is a widely used mechanism to resample data for evaluating machine learning models when the dataset sample size is small. cross-validation is mainly considered to approximate the learning skill of a machine learning model using data which the model has not seen previously. the result of a model obtained using cross-validation is normally less biased or gives optimistic estimation skill of the model compared to train/test split method. table 2 shows how the fivefold cross-validation generates five different datasets of ultrasound images from the two datasets. during the last few years, transfer learning algorithms are widely used in many research problems in machine learning which concentrate on preserving knowledge acquired during unraveling one problem and employing the knowledge into another but a relevant problem. for example, an algorithm that is trained to learn in recognizing dogs can be applied to recognize horses. authors in [30] formally define the transfer learning in terms of domain and task as follows: let an arbitrary domain d = {x, p(x)}. here x denotes a feature vector {x 1 , x 2 , …, x n } and the probability distribution in x is denoted by p(x). one of the reasons that the transfer learning algorithms being used when small size dataset is available to train a custom model, but the goal is to produce an accurate model. a custom model employing transfer learning, applies the knowledge of the pre-trained models that are trained over a huge dataset for a long duration. there are mainly two approaches to apply transfer learning: (i) model developing and (ii) using pre-trained models. the pretrained model approach is widely used in deep learning domain. considering the importance of the pre-trained models as feature extractors this research implements eight pre-trained models using the weights of the convolutional layers of the pre-trained models. these weights act as feature extractors for classifying breast cancers applying on the ultrasound images. table 3 shows the pre-trained models that are considered in this research. all the models are built on convolutional neural network and were trained on the imagenet database [31] that consists of a million images. the models can classify 1000 objects (mouse, keyboard, pencil, and many animals) from different images. therefore, all the models have learned huge feature representations from a large number of images. from the table 3 , we see that different models us different input size. therefore, the images in the dataset are transformed accordingly to feed into the models. in the fine-tuning process of the pre-trained models, the final layer is substituted with a classifier that can classify three objects since the dataset consists of images with three classes (normal, malignant, and benign). hence the models are fine tuned at the top layers. in the fine-tuning process the last three layers of the models are substituted with (i) a fully connected layer (ii) softmax activation layer, and (iii) a custom classifier. we considered three different optimizers to train the models and to determine which model produces the best results. the brief description of the optimizers is given below: stochastic gradient descent with momentum (sgdm) is the fundamental optimizer in neural network that is used for the convergence of neural networks, i.e., moving in the direction of the optimum cost function. the following equation is used to update neural network parameters to calculate the gradient r. here l: initial learning, v t : exponential average of squares of gradients, and g t : gradient at time along w j . adam optimizer associates the heuristics of momentum and rmsprop. the equation is given below. here l: initial learning, v t : exponential average of gradients w j and g t : gradient at time t along w j s t : exponential average of squares of gradients along w j , b 1 ; b 2 are hyperparameters. the fine-tuned pre-trained models used softmax activation function to generate the probability between the range 0 and 1 of the class outcomes from the input images. using softmax activation function at the end of a cnn model to convert its outcome scores into a normalized neural computing and applications probability distribution is a very well-known practice. softmax function is defined with following equation: where z is a input vector, z i are the elements in z, e z i is the exponential function, and p k j¼1 e z j is the normalization term: 3 proposed custom model the model applies batch normalization with 20 channels. it also consists of one max pooling layer, one fully connected layer. dropout regularization is also added after the fully connected layer. finally, softmax activation function is applied since the model needs to classify three classes. the initial learning rate 1.0000e-04 is considered during training. the model also considers mini-batch size 8. the model is trained using three optimizers as the pre-trained models are trained. the model is trained and validated using the configuration of table 4 . the performance of the fine-tuned pre-trained models are evaluated with various standard performance of metrics. the metrics are accuracy (acc), area under curve (auc), precision, recall, sensitivity, specificity, and f1score. confusion matrix for each model is also generated to observe the scores of true positive (tp), true negative (tn), false positive (fp), and false negative (fn) of normal, malignant, and benign cases. the tp (e.g., malignant) score represents how the model correctly classify real malignant cases as malignant. the fp (e.g., malignant) represents how the model wrongly classifies benign cases as malignant. similarly, tn (e.g., benign) score represents how the model correctly classifies benign cases as benign, and fn (e.g., benign) score represents how the model wrongly classifies malignant cases as benign. another important metric is the precision that demonstrates the performance of a model in terms of proportion of the truly classified patients as malignant, benign, and normal cases. meanwhile, sensitivity or recall value shows the proportion of a case (e.g., malignant) a model truly classifies as malignant cases. specificity demonstrates the percentage of a case (e.g., benign) that a model classifies correctly. through the f1-score we achieve a single score from precision and recall through evaluating their harmonic mean. in the below, we show the formula of different metrics. fig. 2 the architecture of the custom model the scores of performance evaluation of the fine-tuned pretrained models as well as the custom model are shown are table 6 summarizes the models' performance with the best scores with different evaluation metrics and compares the results with the proposed cnn model. figure 3 shows the confusion matrix generated from the different pre-trained models as well as the proposed custom model. the figure only shows the confusion matrix of the best pre-trained models as mentioned in the table 6 . from the confusion matrix of the custom model, we observe that in all the classification of breast cancers the score is high. for example, the model classifies 100% benign class, 100% malignant class, and 100% normal classes using adam optimizer. the results also outperform the results of the pre-trained models. table 7 shows the classification results of the models. table 8 shows performance comparison results between the custom and the pre-trained models. the custom model outperforms all the pre-trained models with respect to accuracy, prediction time and number of parameters. the custom model is also very fast in training than all the finetuned pre-trained models. the reason is that the custom model has only one fully connected layer. in addition, the custom model requires a very small number of trainable parameters compared to the other models. all the models are trained in a gpu (nvidia ò geforce gtx 1660 ti with max-q design and 6 gb ram) considering a minibatch size of 8. figure 4 shows execution time and the accuracy score of each model. to calculate accurate time, we run the code four times. the area of each marker in the fig. 4 shows the size of the number of parameters in the networks. the time of models' prediction is calculated with respect to the fastest network. from the plot, it is quite evident that that custom model is fast and training and produces higher accuracy than the other pre-trained models. figure 5 shows the accuracy and loss values when the custom model is trained and validated. from the graph in fig. 5 , it is evident that the custom model generates very high accuracy result as claimed in table 8 . the custom model's performance is also evaluated by generating heat map visualization using grad-cam tool [32] to see how the model identifies the region of interest and how well the model distinguishes cancer classes. grad-cam is used to judge whether a model identifies the key areas in the images for prediction. grad-cam visualizes the portion of an image through heatmap of a class label that the model focuses for prediction. figure 6 shows a sample grad-cam output of benign and malignant classes and prediction probability. from the output, we observe that the model perfectly focuses on the key areas of images to classify cancers. this study implemented eight pre-trained cnn models with fine tuning leveraging transfer learning to observe the classification performance of breast cancer from ultrasound images. the images are combined from two different datasets. we evaluated the fine-tuned pre-trained models applying the adam, rmsprop, and sgdm optimizers. the highest accuracy 92.4% is achieved by the resnet50 with adam optimizer and the highest auc 0.97 score is achieved by vgg16. we also proposed a shallow custom model since the pre-trained models have not shown expected results and all the pre-trained models have many convolutional layers and need long duration in the training phase. the proposed custom model consists of only one convolutional layer as feature extractors. the custom model achieved 100% accuracy and 1.0auc value. with respect to training time, the custom model is faster than any other model and needs small size of trainable parameters. the future plan is to validate the model with other datasets that include new ultrasound images. html#:*:text= breast%20cancer%20survival%20rates%20vary,et%20al.%2c% 202008. accessed cloud-supported cyber-physical localization framework for patients monitoring applying deep learning for epilepsy seizure detection and brain mapping visualization hybrid deeplearning-based anomaly detection scheme for suspicious flow detection in sdn: a social multimedia perspective cervical cancer classification using convolutional neural networks and extreme learning machines automatic fruit classification using deep learning for industrial applications emotion recognition using secure edge and cloud computing imagenet large scale visual recognition challenge deep relative attributes very deep convolutional networks for large-scale image recognition. arxiv reprint imagenet classification with deep convolutional neural networks neural computing and applications explainable ai and mass surveillance system-based healthcare framework to combat covid-i9 like pandemics representation learning for mammography mass lesion classification with convolutional neural networks digital mammographic tumor classification using transfer learning from deep convolutional neural networks improving eeg-based emotion classification using conditional transfer learning going deeper with convolutions breast cancer classification in automated breast ultrasound using multiview convolutional neural network with transfer learning computer-aided diagnosis system for breast ultrasound images using deep learning computer-aided diagnosis of breast ultrasound images using ensemble learning from convolutional neural networks deep learning approaches for data augmentation and classification of breast masses using ultrasound images rethinking the inception architecture for computer vision learning transferable architectures for scalable image recognition comparison of transferred deep neural networks in ultrasonic breast masses discrimination diagnostic efficiency of the breast ultrasound computer-aided prediction model based on convolutional neural network in breast cancer histopathological breast-image classification using local and frequency domains by convolutional neural network histopathological breast cancer image classification by deep neural network techniques guided by local clustering dataset of breast ultrasoundimages. data brief intro to optimization in deep learning: momentum visual explanations from deep networks via gradient-based localization publisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations funding not applicable.availability of data and material datasets are collected from public repository [28, 29] . conflicts of interest not applicable.code availability not applicable. key: cord-127759-wpqdtdjs authors: qi, xiao; brown, lloyd; foran, david j.; hacihaliloglu, ilker title: chest x-ray image phase features for improved diagnosis of covid-19 using convolutional neural network date: 2020-11-06 journal: nan doi: nan sha: doc_id: 127759 cord_uid: wpqdtdjs recently, the outbreak of the novel coronavirus disease 2019 (covid-19) pandemic has seriously endangered human health and life. due to limited availability of test kits, the need for auxiliary diagnostic approach has increased. recent research has shown radiography of covid-19 patient, such as ct and x-ray, contains salient information about the covid-19 virus and could be used as an alternative diagnosis method. chest x-ray (cxr) due to its faster imaging time, wide availability, low cost and portability gains much attention and becomes very promising. computational methods with high accuracy and robustness are required for rapid triaging of patients and aiding radiologist in the interpretation of the collected data. in this study, we design a novel multi-feature convolutional neural network (cnn) architecture for multi-class improved classification of covid-19 from cxr images. cxr images are enhanced using a local phase-based image enhancement method. the enhanced images, together with the original cxr data, are used as an input to our proposed cnn architecture. using ablation studies, we show the effectiveness of the enhanced images in improving the diagnostic accuracy. we provide quantitative evaluation on two datasets and qualitative results for visual inspection. quantitative evaluation is performed on data consisting of 8,851 normal (healthy), 6,045 pneumonia, and 3,323 covid-19 cxr scans. in dataset-1, our model achieves 95.57% average accuracy for a three classes classification, 99% precision, recall, and f1-scores for covid-19 cases. for dataset-2, we have obtained 94.44% average accuracy, and 95% precision, recall, and f1-scores for detection of covid-19. conclusions: our proposed multi-feature guided cnn achieves improved results compared to single-feature cnn proving the importance of the local phase-based cxr image enhancement. coronavirus disease 2019 (covid19) is an infectious disease caused by severe acute respiratory syndrome coronavirus 2 (sars-cov-2), a newly discovered coronavirus [1, 2] . in march 2020, the world health organization (who) declared the covid-19 outbreak a pandemic. up to now, more than 9.23 million cases have been reported across 188 countries and territories, resulting in more than 476,000 deaths [3] . early and accurate screening of infected population and isolation from public is an effective way to prevent and halt spreading of virus. currently, the gold standard method used for diagnosing covid-19 is real-time reverse transcription polymerase chain reaction (rt-pcr) [4] . the disadvantages of rt-pcr include its complexity and problems associated with its sensitivity, reproducibility, and specificity [5] . moreover, the limited availability of test kits makes it challenging to provide the sufficient diagnosis for every suspected patients in the hyper-endemic regions or countries. therefore, a faster, reliable and automatic screening technique is urgently required. in clinical practice, easily accessible imaging, such as chest x-ray (cxr), provides important assistance to clinicians in decision making. compared to computed tomography (ct) the main advantages of cxr are: enabling fast screening of patients, being portable, and easy to setup (can be setup in isolation rooms). however, the sensitivity and specificity (radiographic assessment accuracy) of cxr for diagnosing covid-19 is low compared to ct. this is especially problematic for identifying early stage covid-19 patients with mild symptoms. this causes larger intra-and inter-observer variability in reading the collected data by radiologists since qualitative indicators can be subtle. therefore, there is increased demand for computer aided diagnostic method to aid the radiologist during decision making for improved management of covid-19 disease. in view of these advantages and motivated by the need for accurate and automatic interpretation of cxr images, a number of studies based on deep convolutional neural networks (cnns) have shown quite promising results. ozturk et al. [6] proposed a cnn architecture, termed darkcovidnet, and achieved 87.02% three class classification accuracy. the method was evaluated on 127 covid-19, 500 healthy and 500 pneumonia cxr scans. covid-19 data was obtained from 125 patients. wang et al. [7] built a public dataset named covidx, which is comprised of a total of 13975 cxr images from 13870 patient case and developed covid-net, a deep learning model. their dataset had 358 covid-19 images obtained from 266 patients. their model achieved 93.3% overall accuracy in classifying normal, pneumonia, and covid-19 scans. in [8] a resnet-50 architecture was utilized to achieve a 96.23% overall accuracy in classifying four classes, where pneumonia was split into bacterial pneumonia and viral pneumonia. however, there were only eight covid-19 cxr images used for testing. in [9] , 76.37% overall accuracy was reported on a dataset including 1583 normal, 4290 pneumonia and 76 covid-19 scans. covid-19 data was collected from 45 patients. in order to improve the performance of the proposed method, data augmentation was performed on the covid-19 dataset bringing the total covid-19 datasize to 1,536. with data augmentation they have improved the overall accuracy 97.2%. in [10] , contrast limited adaptive histogram equalization (clahe) was used to enhance the cxr data. the authors proposed a depth-wise separable convolutional neural network (dscnn) architecture. evaluation was performed on 668 normal, 619 pneumonia, and 536 covid-19 cxr scans. average reported multi-class accuracy was 96.43%. number of patients for the covid-19 dataset was not available. in [11] , a stacked cnn architecture achieved an average accuracy of 92.74%. the evaluation dataset had 270 covid-19 scans from 170 patients, 1139 normal scans from 1015 patients, and 1355 pneumonia scans from 583 patients. in [12] , the reported multi-class average classification accuracy was s 94.2%. the evaluation dataset included 5000 normal, 4600 pneumonia, and 738 covid-19 cxr scans. the data was collected from various sources and patient information was not specified. in [13] transfer learning was investigated for training the cnn architecture. the evaluation dataset included 224 covid-19, 504 normal, and 700 pneumonia images. 93.48% average accuracy was reported for three-class classification. the average accuracy increased to 94.72% if viral pneumonia was included in the evaluation. in [14] , performance of three different, previously proposed, cnn architectures was evaluated for multi-class classification. with 2,265 covid-19 images, the study used the largest covid-19 dataset reported so far. average area under the curve (auc), for classification of covid-19 from regular pneumonia, was 0.73 [14] . although numerous studies have shown the capability of cnns in effective identification of covid-19 from cxr images, none of these studies investigated local phase cxr image features as multi-feature input to a cnn architecture for improved diagnosis of covid-19 disease. furthermore, except [14, 7] , most of the previous work was evaluated on a limited number of covid-19 cxr scans. in this work we show how local phase cxr features based image enhancement improves the accuracy of cnn architectures for covid-19 diagnosis. specifically, we extract three different cxr local phase image features which are combined as a multi-feature image. we design a new cnn architecture for processing multi-feature cxr data. we evaluate our proposed methods on large scale cxr images obtained from healthy subjects as well as subjects who are diagnosed with community acquired pneumonia and covid-19. quantitative results show the usefulness of local phase image features for improved diagnosis of covid-19 disease from cxr scans. our proposed method is designed for processing cxr images and consists of two main stages as illustrated in figure 1 : 1-we enhance the cxr images (cxr(x, y)) using local phase-based image processing method in order to obtain a multi-feature cxr image (m f (x, y)), and 2-we classify cxr(x, y) by designing a deep learning approach where multi feature cxr images (m f (x, y)), together with original cxr data (cxr(x, y)), is used for improving the classification performance. next, we describe how these two major processes are achieved. in order to enhance the collected cxr images, denoted as cxr(x, y), we use local phase-based image analysis [15] . three different cxr(x, y) image phase features are extracted: 1-local weighted mean phase angle (lwp a(x, y)), 2-lwp a(x, y) weighted local phase energy (lp e(x, y)), and 3-enhanced local energy attenuation image (elea(x, y)). lp e(x, y) and lwp a(x, y) image features are extracted using monogenic signal theory where the monogenic signal image (cxr m (x,y)) is obtained by combining the bandpass filtered cxr(x, y) image, denoted as cxr b (x, y), with the riesz filtered components as: here h 1 and h 2 represent the vector valued odd filter (riesz filter) [16] . α-scale space derivative quadrature filters (assd) are used for band-pass filtering due to their superior edge detection [17] . the lwp a(x, y) image is calculated using: ). we do not employ noise compensation during the calculation of the lwp a(x, y) image in order to preserve the important structural details of cxr(x, y). the lp e(x, y) image is obtained by averaging the phase sum of the response vectors over many scales using: in the above equation sc represents the number of scales. lp e(x, y) image extracts the underlying tissue characteristics by accumulating the local energy of the image along several filter responses. the lp e(x, y) image is used in order to extract the third local phase image elea(x, y). this is achieved by using lp e(x, y) image feature as an input to an l1 norm based contextual regularization method. the image model, denoted as cxr image transmission map (cxr a (x, y)), enhances the visibility of lung tissue features inside a local region and assures that the mean intensity of the local region is less than the echogenicity of the lung tissue. the scattering and attenuation effects in the tissue are combined as: here ρ is a constant value representative of echogenicity in the tissue. in order to calculate elea(x, y), cxr a (x, y) is estimated first by minimizing the following objective function [15] : in the above equation • represents element-wise multiplication, χ is an index set, and * is convolution operator. d j is calculated using a bank of high order differential filters [18] . the filter bank enhances the cxr tissue features inside a local region while attenuating the image noise. w j is a weighting matrix calculated using: equation the first part measures the dependence of cxr a (x, y) on lp e(x, y) and the second part models the contextual constraints of cxr a (x, y) [15] . these two terms are balanced using a regularization parameter λ [15] . after and is a small constant used to avoid division by zero [15] . combination of these three types of local phase images as three-channel input creates a new multi-feature image, denoted as m f (x, y). qualitative results corresponding to the enhanced local phase images are displayed in figure 2 . investigating figure 2 we can observe that the enhanced local phase images extract new lung features that are not visible in the original cxr(x, y) images. since local phase image processing is intensity invariant, the enhancement results will not be affected from the intensity variations due to patient characteristics or x-ray machine acquisition settings. the multi-feature image m f (x, y) and the original cxr(x, y) image are used as an input to our proposed deep learning architecture which is explained in the next section. our proposed multi-feature cnn architecture consists of two same convolutional network streams for processing cxr(x, y) images and the corresponding m f (x, y) respectively. strategies for the optimal fusion of features from multi-modal images is an active area of research. generally, data is fused earlier when the image features are correlated, and later when they are less correlated [19] . depending on the dataset, different types of fusion strategies outperform the other [20] . in [21] , our group has also investigated early, mid, and late-level fusion operations in the context of bone segmentation from ultrasound data. late-fusion operation has outperformed the other fusion operations. in [22] , authors have also used late-fusion network, for segmenting brain tumors from mri data, has outperformed other fusion operations. during this work we design mid-fusion and late-fusion architectures (fig.3) . as part of this work we have also investigate several fusion operations: sum fusion, max fusion, averaging fusion, concatenation fusion, convolution fusion. based on the performance of the fusion operations and fusion architectures, on a preliminary experiment, we use concatenation fusion operation for both of our architectures. we use the following network architectures as the encoder network: pretrained alexnet [23] , resnet50 [24] , sononet64 [25] , xnet(xception) [26] , inceptionv4(inception-resnet-v2) [27] and efficient-netb4 [28] . pretrained alexnet [23] and resnet50 [24] have been incorporated into various medical image analysis tasks [29] . sononet64 achieved excellent performance in implementation of both classification and localization tasks [25] . xnet(xception) [26] , inceptionv4 (inception-resnet-v2) [27] and ef-ficientnetb4 [28] were chosen due to their outstanding performance on recent medical data classification tasks as well as classification of covid-19 from chest ct data [30, 31] . we use the following datasets to evaluate the performance of proposed fusion network models: bimcv [32] , covidx [7] , and covid-cxnet [12] . covid-19 cxr scans from bimcv [32] and covidx [7] datasets were combined to generate the 'evaluation dataset' (table 1) . for normal and pneumonia datasets we have randomly selected a subset of 2567 images (from 2567 subjects) from the evaluation dataset (table 1 ). in total 2567 images from each class (normal, pneumonia, covid-19) were used during 5-fold cross validation. table 2 shows the data split for covid-19 data only. similar split was also performed for normal and pneumonia datasets. in order to provide additional testing for our proposed networks, we have designed a new test dataset which we call 'test dataset-2' ( table 3 ). the images from normal and pneumonia cases which were not included in the 'evaluation dataset' were part of the 'test dataset-2'. furthermore, we have included all the covid-19 scans from covid-cxnet [12] . in order to show the improvements achieved using our proposed multifeature cnn architecture we also trained the same cnn architectures using only m f (x, y) or cxr(x, y) images. we refer to these architectures as monofeature cnns. quantitative performance was evaluated by calculating average accuracy, precision, recall, and f1-scores for each class [9, 7] . the experiments were implemented in python using pytorch framework. all models were trained using stochastic gradient descent (sgd) optimizer, crossentropy loss function, learning rate 0.001 for the first epoch and a learning rate fig. 4 : grad-cam images [33] obtained by late fusion resnet50 architecture. decay of 0.1 every 15 epochs with a mini-batches of size 16. for local phase image enhancement, we have used sc = 2 and the rest of the assd filter parameters were kept same as reported in [15] . for calculating elea(x, y) images we used λ = 2, = 0.0001, η = 0.85, and ρ, the constant related to tissue echogenicity, was chosen as the mean intensity value of lp e(x, y). these values were determined empirically and kept constant during qualitative and quantitative analysis. qualitative analysis: gradient-weighted class activation mapping (grad-cam) [33] visualization of normal, pneumonia, and covid-19 are presented as qualitative results in figure 4 . investigating figure 4 we can see the discriminative regions of interest localized in the normal, pneumonia, and covid-19 data. quantitative analysis of evaluation dataset: table 4 shows average accuracy of the 5-fold cross validation on the 'evaluation dataset' for mono-feature cnn architectures as well as the proposed multi-feature cnn architectures. a box and whisker plot is presented in figure 5 . in most of the investigated network designs m f (x, y)-based mono-feature cnn architectures outperform cxr(x, y)-based mono-feature cnn architectures. the best average accuracy is obtained when using our proposed multi-feature resnet50 [24] architecture. all multi-feature cnns with mid-and late-fusion operation compared with mono-feature cnns, with original cxr(x, y) images as input, achieved statistically significant difference in terms of classification accuracy (p<0.05 using a paired t-test at %5 significance level). except sononet64 [25] , xnet(xception) [26] , and inceptionv4(inception-resnet-v2) [27] , all multi-feature cnns with mid-fusion operation compared with mono-feature cnns with m f (x, y) images as input show statistically significant difference in terms of classification accuracy (p<0.05 using a paired t-test at %5 significance level). we did not find any statistical significant difference in the average accuracy results between the middle-level and late-fusion networks (p>0.05 using a paired t-test at %5 significance level). figure 6 presents confusion matrix results together with average precision, recall, and f1-scores for all multi-feature late-fusion cnn architectures. one important aspect observed from the presented results we can see that almost all the investigated multi-feature networks achieved very high precision, recall, and f1-scores for covid-19 data indicating very few cases were misclassified as covid-19 from other infected types. quantitative analysis of test dataset-2: multi-feature resnet50 provides the highest overall accuracy shown in table 5 , which is consistent with the quantitative result achieved with the 'evaluation dataset'. figure 7 shows a box and whisker plot for each network. all multi-feature cnns with late-fusion operation compared with mono-feature cnns, with original cxr(x, y) imfig. 6 : confusion matrix, and average precision, recall and f1-scores obtained from 5-fold cross validation on 'evaluation data' using all multi-feature network models. ages as input, achieved statistically significant difference in terms of classification accuracy (p<0.05 using a paired t-test at %5 significance level). except xnet(xception) [26] , all the multi-feature cnns with mid fusion operation compared with mono-feature cnns with original cxr(x, y) images as input achived statistically significant difference in terms of classification accuracy (p<0.05 using a paired t-test at %5 significance level). except xnet(xception) [26] , all multi-feature cnns with mid-fusion operation compared with mono-feature cnns with m f (x, y) images as input show statistically significant difference in terms of classification accuracy (p<0.05 using a paired t-test at %5 significance level). similar to 'evaluation dataset' results, there was no statistically significant difference in the average accuracy results between the middle-level and late-fusion networks (p>0.05 using a paired t-test at %5 significance level) except resnet50 [24] , and xnet(xception) [26] architectures. confusion matrix results, together with average precision recall and f1-score values, for all multi-feature late-fusion cnn architectures evaluated are presented in fig-ure8 . similar to the results presented for 'evaluation dataset', high precision, recall, and f1-score values are obtained for the covid-19 data. development of a new computer aided diagnostic methods for robust and accurate diagnosis of covid-19 disease from cxr scans is important for improved management of this pandemic. in order to provide a solution to this need, in this work, we present a multi-feature deep learning model for classification of cxr images into three classes including covid-19, pneumonia,and normal healthy subjects. our work was motivated by the need for enhanced representation of cxr images for achieving improved diagnostic accuracy. to this end we proposed a local phase-based cxr image enhancement method. we have shown that by using the enhanced cxr data, denoted as m f (x, y), in conjunction with the original cxr data, diagnostic accuracy of cnn architectures can be improved. our proposed multi-feature cnn architectures were trained on a large dataset in terms of the number of covid-19 cxr scans and have achieved improved classification accuracy across all classes. one of the very encouraging result is the proposed models show high precision, recall, and f1-scores on the covid-19 class for both testing datasets. in addition, except for alexnet [23] , all multi-feature cnns with late fusion operation has less number of parameters compared with corresponding multi-feature cnns with middle fusion operation ( figure 9 ). since the image classifier of alexnet [23] is consist of three fully connected layers (fc), which store majority of parameters, alexnet [23] with late fusion operation almost double the number of parameters compared with middle fusion operation. the rest of networks have only one or no fc layer in the image classifiers. finally, compared to previously reported results, our work achieves the highest three class classification accuracy on a significantly larger covid-19 dataset (table 6 ). this will ensure few false positive cases for the covid-19 detected from cxr images and will help alleviate burden on the healthcare system by reducing the amount of ct scans performed. while the obtained results are very promising, more evaluation studies are required specifically for diagnosing early stage covid-19 from cxr images. our future work will involve the collection of cxr scans fig. 9 : model size vs. overall accuracy from early stage or asymptotic covid-19 patients. we will also investigate the design of a cxr-based patient triaging system. haghanifar et al. [12] unet+densenet training data: testing data: a review of coronavirus disease-2019 (covid-19) coronavirus disease 2019 an interactive web-based dashboard to track covid-19 in real time detection of sars-cov-2 in different types of clinical specimens development of reverse transcription (rt)-pcr and real-time rt-pcr assays for rapid detection and quantification of viable yeasts and molds contaminating yogurts and pasteurized food products automated detection of covid-19 cases using deep neural networks with x-ray images covid-net: a tailored deep convolutional neural network design for detection of covid-19 cases from chest x-ray images covid-resnet: a deep learning framework for screening of covid19 from radiographs covidiagnosis-net: deep bayes-squeezenet based diagnostic of the coronavirus disease 2019 (covid-19) from x-ray images covidlite: a depth-wise separable deep neural network with white balance and clahe for detection of covid-19 stacked convolutional neural network for diagnosis of covid-19 disease from x-ray images covid-cxnet: detecting covid-19 in frontal chest x-ray images using deep learning covid-19: automatic detection from x-ray images utilizing transfer learning with convolutional neural networks umls-chestnet: a deep convolutional neural network for radiological findings, differential diagnoses and localizations of covid-19 in chest x-rays localization of bone surfaces from ultrasound data using local phase information and signal transmission maps the monogenic signal α scale spaces filters for phase based edge detection in ultrasound images efficient image dehazing with boundary constraint and contextual regularization multimodal deep learning. in: icml a review: deep learning for medical image segmentation using multi-modality fusion automatic segmentation of bone surfaces from ultrasound using a filter-layer-guided cnn multi modal convolutional neural networks for brain tumor segmentation imagenet classification with deep convolutional neural networks deep residual learning for image recognition sononet: real-time detection and localisation of fetal standard scan planes in freehand ultrasound xception: deep learning with depthwise separable convolutions inception-v4, inception-resnet and the impact of residual connections on learning efficientnet: rethinking model scaling for convolutional neural networks a survey on deep learning in medical image analysis identifying melanoma images using efficientnet ensemble: winning solution to the siim-isic melanoma classification challenge automatic detection of coronavirus disease (covid-19) in x-ray and ct images: a machine learningbased approach bimcv covid-19+: a large annotated dataset of rx and ct images from covid-19 patients grad-cam: visual explanations from deep networks via gradient-based localization acknowledgements the authors are thankful to all the research groups, and national agencies worldwide who provided the open source x-ray images. funding: nothing to declare. conflict of interest the authors declare that they have no conflict of interest. key: cord-249065-6yt3uqyy authors: kassani, sara hosseinzadeh; kassasni, peyman hosseinzadeh; wesolowski, michal j.; schneider, kevin a.; deters, ralph title: automatic detection of coronavirus disease (covid-19) in x-ray and ct images: a machine learning-based approach date: 2020-04-22 journal: nan doi: nan sha: doc_id: 249065 cord_uid: 6yt3uqyy the newly identified coronavirus pneumonia, subsequently termed covid-19, is highly transmittable and pathogenic with no clinically approved antiviral drug or vaccine available for treatment. the most common symptoms of covid-19 are dry cough, sore throat, and fever. symptoms can progress to a severe form of pneumonia with critical complications, including septic shock, pulmonary edema, acute respiratory distress syndrome and multi-organ failure. while medical imaging is not currently recommended in canada for primary diagnosis of covid-19, computer-aided diagnosis systems could assist in the early detection of covid-19 abnormalities and help to monitor the progression of the disease, potentially reduce mortality rates. in this study, we compare popular deep learning-based feature extraction frameworks for automatic covid-19 classification. to obtain the most accurate feature, which is an essential component of learning, mobilenet, densenet, xception, resnet, inceptionv3, inceptionresnetv2, vggnet, nasnet were chosen amongst a pool of deep convolutional neural networks. the extracted features were then fed into several machine learning classifiers to classify subjects as either a case of covid-19 or a control. this approach avoided task-specific data pre-processing methods to support a better generalization ability for unseen data. the performance of the proposed method was validated on a publicly available covid-19 dataset of chest x-ray and ct images. the densenet121 feature extractor with bagging tree classifier achieved the best performance with 99% classification accuracy. the second-best learner was a hybrid of the a resnet50 feature extractor trained by lightgbm with an accuracy of 98%. a series of pneumonia cases of unknown etiology occurred in december 2019, in wuhan, hubei province, china. on december 31, 2019, 27 unexplained cases of pneumonia were identified and found to be associated with so called "wet markets" which sell fresh meat and seafood from a variety of animals including bats and pangolins. the pneumonia was found to be caused by a virus identified as "severe acute respiratory syndrome coronavirus 2" (sars-cov-2), with the associated disease subsequently termed coronavirus disease 2019 (covid-19) figure 1 : the illustration of covid-19, created at the centers for disease control and prevention (cdc) [10] . the protein particles e, s, and m are located on the outer surface of the virus particle. the spherical viral particles, colorized blue, contain cross-sections through the viral genome, seen as black dots [11] . processing techniques and deep learning algorithms could assist physicians as diagnostic aides for covid-19 and help provide a better understanding of the progression the disease. hemdan et al. [13] developed a deep learning framework, covidx-net, to diagnose covid-19 in x-ray images. a comparative study of different deep learning architectures including vgg19, densenet201, resnetv2, inceptionv3, inceptionresnetv2, xception and mo-bilenetv2 is provided by authors. the public dataset of x-ray images was provided by dr. joseph cohen [14] and dr. adrian rosebrock [15] . the provided dataset included 50 x-ray images, divided into two classes as 25 normal cases and 25 positive covid-19 images. hemdan's results demonstrated vgg19 and densenet201 models achieved the best performance scores among counterparts with 90.00% accuracy. barstugan et al. [16] proposed a machine learning approach for covid-19 classification from ct images. patches with different sizes 16×16, 32×32, 48×48, 64×64 were extracted from 150 ct images. different hand-crafted features such as grey level co-occurrence matrix (glcm), local directional pattern (ldp), grey level run length matrix (glrlm), grey-level size zone matrix (glszm), and discrete wavelet transform (dwt) algorithms were employed. the extracted features were fed into a support vector machine (svm) [17] classifier on 2-fold, 5-fold and 10-fold cross-validations. the best accuracy of 98.77% was obtained by glszm feature extractor with 10-fold cross-validation. wang and wong [18] designed a tailored deep learning-based framework, covid-net, developed for covid-19 detection from chest x-ray images. the covid-net architecture was constructed of combination of 1×1 convolutions, depth-wise convolution and the residual modules to enable design deeper architecture and avoid the gradient vanishing problem. the provided dataset consisted of s a combination of covid chest x-ray dataset provided by dr. joseph cohen [14] , and kaggle chest x-ray images dataset [19] for a multi-class classification of normal, bacterial infection, viral infection (non-covid) and covid-19 infection. obtained accuracy of this study was 83.5%. in a study conducted by maghdid et al. [20] , a deep learning-based method and transfer learning strategy were used for automatic diagnosis of covid-19 pneumonia. the proposed architecture is a combination of a simple convolutional neural network (cnn) architecture (one convolutional layer with 16 filters followed by batch normalization, rectified linear unit (relu), two fully-connected layers) and a modified alexnet [21] architecture with the feasibility of transfer learning. the proposed modified architecture achieved an accuracy of 94.00%. ghoshal and tucker [22] investigated the diagnostic uncertainty and interpretability of deep learning-based methods for covid-19 detection in x-ray images. dropweights based bayesian convolutional neural networks (bcnn) were used to estimate uncertainty in deep learning solutions and provide a level of confidence of a computer-based diagnosis for a trusted clinician setting. to measure the relationship between accuracy and uncertainty, 70 posterioranterior (pa) lung x-ray images of covid-19 positive patients from the public dataset provided by dr. joseph cohen [14] were selected and balanced by kaggle's chest x-ray images dataset [19] . to prepare the dataset, all images were resized to 512×512 pixels. a transfer learning strategy and real-time data augmentation strategies were employed to overcome the limited size of the dataset. the proposed bayesian inference approach obtained the detection accuracy of 92.86% on x-ray images using vgg16 deep learning model. hall et al. [23] used a vgg16 architecture and transfer learning strategy with 10-fold crossvalidation trained on the dataset from dr. joseph cohen [14] . all images were rescaled to 224×224 pixels and a data augmentation strategy was employed to increase the size of dataset. the proposed approach achieved an overall accuracy 96.1% and overall area under curve (auc) of 99.70% on the provided dataset. farooq and hafeez [24] proposed a fine-tuned and pre-trained resnet-50 architecture, covid-resnet, for covid-19 pneumonia screening. to improve the generalization of the training model, different data augmentation methods including vertical flip, random rotation (with angle of 15 degree), along with the model regularization were used. the proposed method achieved the accuracy of 96.23% on a multi-class classification of normal, bacterial infection, viral infection (non-covid-19) and covid-19 infection dataset. the main motivation of this study is to present a generic feature extraction method using convolutional neural networks that does not require handcrafted or very complex features from input data while being easily applied to different modalities such as x-ray and ct images. another primary goal is to reduce the generalization error while achieving a more accurate diagnosis. the contributions are summarized as follows: • deep convolutional feature representation [25, 26, 27] is used to extract highly representative features using state-of-the-art deep cnn descriptors. the employed approach is able to discriminate between covid-19 and healthy subjects from chest x-ray and ct images and hence produce higher accuracy in comparison to other works presented in the literature. to the best of our knowledge, this research is the first comprehensive study of the application of machine learning (ml) algorithms (15 deep cnn visual feature extractor and 6 ml classifier) for automatic diagnoses of covid-19 from x-ray and ct images. • to overcome the issue of over-fitting in deep learning due to the limited number of training images, a transfer-learning strategy is adopted as the training of very deep cnn models from scratch requires a large number of training data. • no data augmentation or extensive pre-processing methods are applied to the dataset in order to increase the generalization ability and also reduce bias toward the model performance. • the proposed approach reduces the detection time dramatically while achieving satisfactory accuracy, which is a superior advantage for developing real or near real-time inferences on clinical applications. • with extensive experiments, we show that the combination of a deep cnn with bagging trees classifier achieves very good classification performance applied on covid-19 data despite the limited number of image samples. • finally, we developed an end to end web-based detection system to simulate a virtual clinical pipeline and facilitate the screening of suspicious cases. the rest of this paper is organized as follows. the proposed methodology for automatically classifying covid-19 and healthy cases is explained in section 2. the dataset description, experimental settings and performance metrics are given in section 3. a brief discussion and results analysis are provided in section 4, and finally, the conclusion is presented in section 5. few studies have been published on the application of deep cnn feature descriptors to x-ray and ct images. each of the cnn architectures is constructed by different modules and convolution layers that aid in extracting fundamental and prominent features from a given input image. briefly, in the first step, we collect available public chest x-ray and ct images. in the next step, we pre-processed the provided dataset using standard image normalization techniques to improve the quality of visual information of the input data. once input images are prepared, we fed them into the feature extraction phase with the state-of-the-art cnn descriptors to extract deep features from each input image. for the training phase, the generated features are then fed into machine learning classifiers such as decision tree (dt) [28] , random forest (rf) [29] , xgboost [30] , adaboost [31] , bagging classifier [32] and lightgbm [33] . finally, the performance of the proposed approach is evaluated on test images. the concept of transfer learning has been introduced for solving deep learning problems arising from insufficiently labeled data, or when the cnn model is too deep and complex. aiming to tackle these challenges, studies in a variety computer vision tasks demonstrated the advantages of transfer learning strategies from an auxiliary domain in improving the detection rate and performance of a classifier [34] [35] [36] . in a transfer learning strategy, we transfer the weights already learned on a cross-domain dataset into the current deep learning task instead of training a model from scratch. with the transfer learning strategy, the deep cnn can obtain general features from the source dataset that cannot be learned due to the limited size of the dataset in the current task. transfer learning strategies have various advantages, such as avoiding the overfitting issue when the number of training samples is limited, reducing the computational resources, and also speeding up the convergence of the network [37] [38]. effective feature extraction is one of the most important steps toward learning rich and informative representations from raw input data to provide accurate and robust results. the small or imbalanced size of the training samples poses a significant challenge for the training of a deep cnn where data dimensionality is much larger than the number of samples leading to over-fitting. although various strategies, e.g. data augmentation [39] , transfer learning [40] and fine-tuning [41] , may reduce the problem of insufficient or imbalance training data, the detection rate of the cnn model may degrade due to the over-fitting issue. since the overall performance obtained by a fine-tuning method in the initial experiments for this study was not significant, we employed a different approach inspired by [25] [26] [27] known as deep convolutional feature representation. in this method, we used pre-trained well-established cnn models as a visual feature extractor to encode the input images into a feature vector of sparse descriptors of low dimensionality. then the computed encoded feature vectors produced by cnn architectures are fed into different classifiers, i.e. machine learning algorithms, to yield the final prediction. this lower dimension vector significantly reduces the risk of over-fitting and also the training time. different robust cnn architectures such as mobilenet, densenet, xception, inceptionv3, inceptionresnetv2, resnet, vggnet, nasnet are selected for feature extraction with the possibility of transfer learning advantage for limited datasets and also their satisfying performances in different computer vision tasks [42, 43, 44, 45] . figure 3 . illustrates the visual features extracted by vggnet architecture from an x-ray image of a covid-19 positive patient. in order to evaluate the performance of our feature extracting and classifying approach, we used the public dataset of x-ray images provided by dr. joseph cohen available from a github repository [14] . we used the available 117 chest x-ray images and 20 ct images (137 images in total) of covid-19 positive cases. we also included 117 images of healthy cases of x-ray images from kaggle chest x-ray images (pneumonia) dataset available at [19] and 20 images of healthy cases of ct images from kaggle rsna pneumonia detection dataset available at [46] to balance the dataset with both positive and normal cases. figure 4 shows examples of confirmed covid-19 images extracted from the provided dataset. the x-ray images of confirmed covid-19 infection demonstrate different shapes of "pure ground glass" also known as hazy lung opacity with irregular linear opacity depending the disease progress [12] . the images within the dataset were collected from multiple imaging clinics with different equipment and image acquisition parameters; therefore, considerable variations exist in images' intensity. the proposed method in this study avoids extensive pre-processing steps to improve the generalization ability of the cnn architecture. this helps to make the model more robust to noise, artifacts and variations in input images during feature extraction phase. hence, we only employed two standard pre-processing steps in training deep learning models to optimize the training process. • resizing: the images in this dataset vary in resolution and dimension, ranging from 365×465 to 1125×859 pixels; therefore, we re-scaled all images of the original size to the size of 600×450 pixels to obtain a consistent dimension for all input images. the input images were also separately resized to 331×331 pixels and 224×224 pixels as required for nasnetlarge and nasnetmobile architectures, respectively. • image normalization: for image normalization, first, we re-scaled the intensity values of the pixels using imagenet mean subtraction as a pre-processing step. the imagenet mean is a pre-computed constant derived from the imagenet database [21] . another essential pre-process step is intensity normalization. to accomplish this, we normalized the intensity values of all images from [0, 255] to the standard normal distribution by min-max normalization to the intensity range of [0, 1], which is computed as: where x is the pixel intensity. x min and x max are minimum and maximum intensity values of the input image in equation 1. this operation helps to speed up the convergence of the model by removing the bias from the features and achieve a uniform distribution across the dataset. to measure the prediction performance of the methods in this study, we utilized common evaluation metrics such as recall, precision, accuracy and f1-score. according to equations (2) (3) (4) (5) true positive (tp) is the number of instances that correctly predicted; false negative (fn) is the number of instances that incorrectly predicted. true negative (tn) is the number of negative instances that predicted correctly, while false positive (fp) is the number of negative instances incorrectly predicted. given tp, tn, fp and fn, all evaluation metrics were calculated as follows: recall or sensitivity is the measure of covid-19 cases that are correctly classified. recall is critical, especially in the medical field and is given by: precision or positive predictive value is defined as the percentage of correctly classified labels in truly positive patients and is given as: accuracy shows the number of correctly classified cases divided by the total number of test images, and is defined as: f1-score, also known as f-measure, is defined as the weighted average of precision and recall that combines both the precision and recall together. f-measure is expressed as: diagnostic imaging modalities, such as chest radiography and ct are playing an important role in confirming the primary diagnosis from the polymerase chain reaction (pcr) test for covid-19. medical imaging is also playing a critical in monitoring the progression of the disease and patient care. extracting features from radiology modalities is an essential step in training machine learning models since the model performance directly depends on the quality of extracted features. motivated by the success of deep learning models in computer vision, the focus of this research is to provide an extensive comprehensive study on the classification of covid-19 pneumonia in chest x-ray and ct imaging using features extracted by the stateof-the-art deep cnn architectures and trained on machine learning algorithms. the 10-fold cross-validation technique was adopted to evaluate the average generalization performance of the classifiers in each experiment. for all cnns, the network weights were initialized from the weights trained on imagenet. the windows based computer system used for this work had an intel(r) core(tm) i7-8700k 3.7 ghz processors with 32 gb ram. the training and testing process of the proposed architecture for this experiment was implemented in python using keras package with tensorflow backend as the deep learning framework backend and run on nvidia geforce gtx 1080 ti gpu with 11gb ram. table 1 and figure 5 summarize the accuracy performance of six machine learning algorithms, namely, dt, rf, xgboost, adaboost, bagging classifier and lightgbm on the feature extracted by deep cnns. each entry in table 1 , is in the format (µ ± σ) where µ is the average classification accuracy and σ is standard deviation. analyzing table 1 the topmost result was obtained by bagging classifier with a maximum of 99.00% ± 0.09 accuracy on features extracted by desnsenet121 architecture (with feature extraction time of 9.306 seconds and training time of 30.748 seconds in table 5 ), which is the highest result reported in the literature for covid-19 classification of this dataset. it is also inferred from table 1 that the second-best result obtained by resnet50 feature extractor and lightgbm classifier (with feature extraction time of 0.960 seconds and training time of 10.206 seconds in table 5 ) with an overall accuracy of 98.00 ± 0.09. comparing the first and second winners among all combinations, the classification accuracy of densenet121 with bagging is slightly better (1%) than resnet50 with lightgbm, while the training time of the second winner is tempting, almost 30 times better than the first winner in terms of accuracy. although bagging is a slow learner, it has the lowest standard deviation and hence is more stable than other learners. the results also demonstrate that the detection rate is worst on the features extracted by resnet101v2 trained by the adaboost classifier with 76.00 ± 0.32 accuracy. figure 5 and figure 6 demonstrate box-plot distributions of deep cnns feature extractors and classification accuracy from the 10-fold cross-validation. circles in figure 5 represent outliers. in tables 2, 3 table 4 : comparison of classification f1-score metric of different machine learning models. the bold value indicates the best result; underlined value represents the second-best result of the respective category. trained visual feature extractor so far was desnsenet121, mobilenet and inceptionv3 rather than counterpart architectures for covid-19 image classification. although the approach presented here shows satisfying performance, it also has limitations classifying more challenging instances with vague, low contrast boundaries, and the presence of artifacts. some examples of these cases are illustrated in figure 7 . finally, comparison of the feature extraction time using deep cnn models and training with ml algorithms are shown in table 5 and after training a model, the pre-trained weights and models can be used as predictive engine for cad systems to allow an automatic classification of new data. a web-based application was implemented using standard web development tools and techniques such as python, javascript, html, and flask web framework. figure 9 shows the output of our web-based application for covid-19 pneumonia detection. this web application could help doctors benefit from our proposed method by providing an online tool that only requires uploading an x-ray or ct image. the application then provides the physician with a simple covid-19 positive, or covid-19 negative observation. it should be noted that this application has yet to be clinically validated, is not yet approved for diagnostic use and would simply serve as a diagnostic aid for the medical imaging specialist. the proposed method is generic as it does not need handcrafted features and can be easily adapted, requiring minimal pre-processing. the provided dataset is collected across multiple sources with different shape, textures and morphological characteristics. the transfer learning strategy has successfully transferred knowledge from the source to the target domain despite the limited dataset size of the provided dataset. during the proposed approach, we observed that no overfitting occurs to impact the classification accuracy adversely. however, our study has some limitations. the training data samples are limited. extending the dataset size by additional data sources can provide a better understanding on the proposed approach. also, employing pre-trained networks as feature extractors requires to rescale the input images to a certain dimension which may discard valuable information. although the proposed methodology achieved satisfying performance with an accuracy of 99.00%, the diagnostic performance of the deep learning visual feature extractor and machine learning classifier should be evaluated on real clinical study trials. the ongoing pandemic of covid-19 has been declared a global health emergency due to the relatively high infection rate of the disease. as of the time of this writing, there is no clinically approved therapeutic drug or vaccine available to treat covid-19. early detection of covid-19 is important to interrupt the human-to-human transmission of covid-19 and patient care. currently, the isolation and quarantine of the suspicious patients is the most effective way to prevent the spread of covid-19. diagnostic modalities such as chest xray and ct are playing an important role in monitoring the progression and severity of the disease in covid-19 positive patients. this paper presents a feature extractor-based deep learning and machine learning classifier approach for computer-aided diagnosis of covid-19 pneumonia. several ml algorithms were trained on the features extracted by well-established cnns architectures to find the best combination of features and learners. considering the high visual complexity of image data, proper deep feature extraction is considered as a critical step in developing deep cnn models. the experimental results on available chest x-ray and ct dataset demonstrate that the features extracted by desnsenet121 architecture and trained by a bagging tree classifier generates very accurate prediction of 99.00% in terms of classification accuracy. covid-19 infection: origin, transmission, and characteristics of human coronaviruses thrombocytopenia is associated with severe coronavirus disease 2019 (covid-19) infections: a meta-analysis probable pangolin origin of sars-cov-2 associated with the covid-19 outbreak the impact of the covid-19 epidemic on the utilization of emergency dental services coronavirus disease (covid-19): a primer for emergency physicians the epidemiology and pathogenesis of coronavirus disease (covid-19) outbreak clinical and ct imaging features of the covid-19 pneumonia: focus on pregnant women and children covid-19) pandemic transmission potential and severity of covid-19 in south korea coronavirus infections -transmission electron microscopic image temporal changes of ct findings in 90 patients with covid-19 pneumonia: a longitudinal study covidx-net: a framework of deep learning classifiers to diagnose covid-19 in x-ray images covid-19 image data collection detecting covid-19 in x-ray images with keras, tensorflow, and deep learning coronavirus (covid-19) classification using ct images by machine learning methods an introduction to support vector machines and other kernel-based learning methods covid-net: a tailored deep convolutional neural network design for detection of covid-19 cases from chest radiography images kaggle's chest x-ray images (pneumonia) dataset diagnosing covid-19 pneumonia from x-ray and ct images using deep learning and transfer learning algorithms imagenet classification with deep convolutional neural networks estimating uncertainty and interpretability in deep learning for coronavirus (covid-19) detection finding covid-19 from chest x-rays using deep learning on a small dataset covid-resnet: a deep learning framework for screening of covid19 from radiographs a theoretical analysis of feature pooling in visual recognition deep convolutional neural networks for breast cancer histology image analysis deep learning for visual understanding: a review induction of decision trees random forests proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining -kdd '16 a desicion-theoretic generalization of on-line learning and an application to boosting bagging predictors lightgbm: a highly efficient gradient boosting decision tree breast cancer diagnosis with transfer learning and global pooling a novel deep learning based framework for the detection and classification of breast cancer using transfer learning breast cancer histology images classification: training from scratch or transfer learning? pathological brain detection based on alexnet and transfer learning classification of histopathological biopsy images using ensemble of deep learning networks automatic diagnosis of fungal keratitis using data augmentation and image fusion with deep convolutional neural network a novel scene classification model combining resnet based transfer learning and data augmentation with a filter decision fusionbased fetal ultrasound image plane classification using convolutional neural networks automated identification and grading system of diabetic retinopathy using deep neural networks deep learning iot system for online stroke detection in skull computed tomography images detection of tumors on brain mri images using the hybrid convolutional neural network architecture mapgi: accurate identification of anatomical landmarks and diseased tissue in gastrointestinal tract using deep learning rsna pneumonia detection challenge key: cord-325235-uupiv7wh authors: makris, a.; kontopoulos, i.; tserpes, k. title: covid-19 detection from chest x-ray images using deep learning and convolutional neural networks date: 2020-05-24 journal: nan doi: 10.1101/2020.05.22.20110817 sha: doc_id: 325235 cord_uid: uupiv7wh the covid-19 pandemic in 2020 has highlighted the need to pull all available resources towards the mitigation of the devastating effects of such "black swan" events. towards that end, we investigated the option to employ technology in order to assist the diagnosis of patients infected by the virus. as such, several state-of-the-art pre-trained convolutional neural networks were evaluated as of their ability to detect infected patients from chest x-ray images. a dataset was created as a mix of publicly available x-ray images from patients with confirmed covid-19 disease, common bacterial pneumonia and healthy individuals. to mitigate the small number of samples, we employed transfer learning, which transfers knowledge extracted by pre-trained models to the model to be trained. the experimental results demonstrate that the classification performance can reach an accuracy of 95% for the best two models. the year 2020 has been marked by the pandemic disease caused by a type of the corona virus family (cov), called covid-19 or sars-cov-2, which has led to over four million infections and more than 290, 000 deaths worldwide. covid-19 is a severe acute respiratory syndrome (sars) that was firstly identified in wuhan, china in december 2019 and has rapidly spread globally in a few months, making it a highly contagious virus. the virus is characterized by symptoms that mostly relate to the respiratory system and include shortness of breath, loss of smell and taste, cough and fever, a range of symptoms that is shared among other types of viruses such as the common cold. in this research work the effectiveness of several state-of-the-art pre-trained convolutional neural networks was evaluated regarding the automatic detection of covid-19 disease from chest x-ray images. a collection of 336 x-ray scans in total from patients with covid-19 disease, bacterial pneumonia and normal incidents is processed and utilized to train and test the cnns. due to the limited available data related to covid-19, the transfer learning strategy is employed. the main difference between our work and the previous studies is that this study incorporates a large number of cnn architectures in an attempt to not only distinguish x-rays between covid-19 patients and people without the disease, but to also discriminate pneumonia patients from patients with the corona virus, acting as a classifier of respiratory diseases. the rest of the paper is structured as follows. section 2 presents a two-fold literature review: i) the usage of deep learning for image classification and ii) the usage of deep learning for the detection of covid-19. section 3 describes the methodology employed towards the identification of the corona virus through x-ray scans, while section 4 presents the research findings and the experimental results. finally, section 5 concludes the merits of our work and presents roadmaps for future research. numerous studies have used convolutional neural networks (cnns) for the problem of image classification in the literature, most of which create different architectures for the neural networks. deep convolutional neural networks are one of the powerful deep learning architectures and have been widely applied in a broad range of machine learning tasks. according to [8] cnns are able to handle four different manners: training the weights from scratch in the presence of a very large available dataset, fine-tuning the weights of an existing pre-trained cnn with smaller datasets, unsupervised pre-training for weights initialization before putting inputs into cnn models and pre-training cnn as a feature extractor. the first cnn to create a standard "architectural template" was the lenet-5 [6] , which uses two convolutional layers and three fully-connected ones. ever since, more architectures followed that used the same idea of adding more convolutions and pooling layers, ending with one or more fully-connected ones. following the footsteps of the previous cnn, alexnet [9] added three more convolutional layers, making it the deepest neural network of its time. moreover, alexnet was the first cnn architecture that implemented rectified linear units (relus) as an activation function. before making more variations on the architectures, researchers continued using more layers and creating deeper networks and as a result, vgg-16 [10] was emerged. vgg-16 used 13 convolutional layers and 3 fully connected ones, keeping the relus from alexnet as an activation function. vgg-19, a successor of the previous network, simply added more layers. the years that followed, researchers, apart from making the networks deeper, added more complexity by introducing several techniques inside the layers of the networks. inception-v1 [11] besides the fact that it uses 22 layers in total, it also uses a "network inside a network" approach by using "inception" modules. the main concept of these modules was to use parallel towers of convolutions with different filters, each filter capturing different features, and then cluster these features together. the idea was motivated by arora et al. [12] , which suggested an architecture that analyzes the correlation statistics of the last layer and clusters them into groups of high-correlation units. sharing a similar architecture, inception-v3 [13] , a successor of the previous network, was among the first networks to use batch normalization to the layers. inception-v4 [14] , the latest successor of the two previous networks, added more inception modules and made some modifications to improve the training speed. the same authors of the previous networks introduced a family of a new architecture, called inception-resnet-v2 [15] , in which they converted the inception modules to residual inception blocks, created a new type of inception modules and added more of these to the network, making it even deeper. resnet-50 [7] was also among the first networks to use batch normalization. moreover, it had an even deeper architecture (152 layers) and it used skip connections or residuals. xception [16] replaced the inception modules with depthwise separable convolutions. this means that it performed 1 × 1 convolutions to every channel, and then performed a 3 × 3 convolution to each output. similarly to xception, mobilenetv2 [17] uses depthwise separable convolutions, which reduce the complexity and size of the network. furthermore, a module with inverted residual structure is introduced and non-linearities in narrow layers are removed. the characteristics of this network introduced a state-of-the-art image classifier suitable for mobile devices. finally, the "battle" for a better network architecture continued and resulted in several other cnns, each one introducing a different modification, such as densenet [18] , nasnet [19] and resnet152v2 [20] . various research studies already exist for covid-19 detection. for the most part, deep learning techniques are employed on chest radiography images with a view to detect infected patients and the results have been shown to be quite promising in terms of accuracy. in [21] a deep convolutional neural network able to predict the coronavirus disease from chest x-ray (cxr) images is presented. the proposed cnn is based on pre-trained transfer models (resnet50, inceptionv3 and inception-resnetv2), in order to obtain high prediction accuracy from a small sample of x-ray images. the images are classified into two classes, normal and covid-19. furthermore, to overcome the insufficient data and training time, a transfer learning technique is applied by employing the imagenet dataset. the results showed the superiority of resnet50 model in terms of accuracy in both training and testing stage. abbas et al [22] presented a novel cnn architecture based on transfer learning and class decomposition in order to improve the performance of pre-trained models on the classification of x-ray images. the proposed architecture is called detrac and consist of three phases. in the first phase an imagenet pre-trained cnn is employed for local feature extraction. in the second phase a stochastic gradient descent optimisation method is applied for training and finally the class-composition layer is adapted for the final classification of the images using error-correction criteria applied to a softmax layer. the resnet18 pre-trained imagenet network is used and the results showed an accuracy of 95.12% on cxr images. zhang et al [23] presented a new deep anomaly detection model for fast, reliable screening of covid-19 based on cxr images. the proposed model consist of three components namely a backbone network, a classification head and an anomaly detection head. the backbone network extract the high-level features of images, which are then used as input into the classification and anomaly detection head. the classification head is used for image classification and consist of a new classification convolutional layer which contains a hidden layer of 100-neurons, an one-neuron output layer, and the "sigmoid" activation function. the anomaly detection head has the same architecture as the classification but generates the scalar anomaly scores which in turn detects anomaly images (covid-19 cases). the proposed model achieved to reduce the false positive rate. more specifically, the results demonstrated a sensitivity . cc-by-nd 4.0 international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. (which was not certified by peer review) the copyright holder for this preprint this version posted may 24, 2020. . https://doi.org/10.1101/2020.05.22.20110817 doi: medrxiv preprint and specificity of 96.00% and 70.65% respectively. in [24] a deep convolutional neural network called covid-net is presented which is able to detect covid-19 cases from cxr images. the network design is consist of two stages, a human-machine collaborative design strategy and a machine-driven design exploration stage and the architecture utilizes a lightweight residual projection-expansion-projection-extension (pepx) design pattern. furthermore, an explainability-driven audit is performed for decisions validation. the results showed a high sensitivity (87.1%) and a precision of 96.4% for covid-19 cases. another work [4] presents a cnn framework for covid-19 detection from other pneumonia cases. the framework called covid-resnet and utilizes a three step technique to fine-tune a pre-trained resnet-50 architecture in order to improve performance and reduce training time. progressive resizing of input images (28x128x3-stage 1, 224x224x3-stage 2, 229x229x3-stage 3) and fine-tuning of network at each stage manages to achieve a better generalization and an increased overall performance (96.23% accuracy). hemdan et al [25] presented a framework consist of seven deep learning image classifiers called covidx-net with a view of classifying covid-19 disease from cxr images. as the results showed, the best performance achieved for the vgg19 and densenet201 classifiers with an accuracy of 90%. in [26] the authors investigated how monte-carlo dropweights bayesian convolutional neural networks can estimate uncertainty in deep learning in order to improve the performance of human-machine decisions. bayesian deep learning classifier has been trained using transfer learning on a pre-trained resnet50v2 model using covid-19 x-ray images to estimate model uncertainty. the results demonstrated a strong correlation between estimated uncertainty in prediction and classification accuracy, thus enabling false predictions identification. finally, apostolopoulos et al [1] evaluated the performance of five pre-trained cnn networks regarding the detection of covid-19 from cxr. the results showed that vgg19 and mobilenetv2 achieved the higher accuracy, 93.48% and 92.85% respectively. the dataset used in this study contains chest x-ray images from patients with confirmed covid-19 disease, common bacterial pneumonia and normal incidents (no infections) and is a combination of two different publicly available datasets. more specifically, covid-19 cases have been obtained from dr. joseph cohen's github repository [27] and consist of 112 posterior-anterior (pa) x-ray images of lungs. in general, this repository contains chest x-ray / ct images of patients with acute respiratory distress syndrome (ards), covid-19, middle east respiratory syndrome (mers), pneumonia and severe acute respiratory syndrome (sars). in addition, 112 normal and 112 pneumonia (bacterial) chest x-ray images were selected from kaggle's repository 2 . in summary, the dataset used for this work is evenly distributed regarding the number of cases and consist of 3 classes (covid, pneumonia and normal) and it is publicly available in 3 . there are some limitations that are worth mentioning. firstly, confirmed covid-19 samples exist already is very small compared to pneumonia or normal cases. at this time, there is not a larger and reliable sample available. the same number of samples was selected for each class for the sake of uniformity. furthermore, to the best of our knowledge the pneumonia samples are older recorded samples and do not represent pneumonia images from patients with suspected coronavirus symptoms, while the clinical conditions are missing. finally, the normal class represents individuals that are not classified as covid-19 or pneumonia cases. we do not imply that a "normal" patient based on the cxr image does not have any emerging disease. data augmentation is a commonly used process in deep learning which increases the number of the available samples. in this work, due to the lack of a larger number of available samples, data augmentation with multiple pre-processing techniques was performed, leveraging keras imagedatagenerator during training. the transformations that employed 2 chest x-ray images (pneumonia), https://www.kaggle.com/paultimothymooney/chest-xray-pneumonia 3 https://github.com/antonismakris/covid19-xray-dataset . cc-by-nd 4.0 international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may 24, 2020. . https://doi.org/10.1101/2020.05.22.20110817 doi: medrxiv preprint include random rotation of the images (maximum rotation angle was 30 degrees), horizontal flips, shearing, zooming, cropping and small random noise perturbation. data augmentation improves the generalization and enhance the learning capability of the model. furthermore it is another efficient way to prevent model overfitting by increasing the amount of training data using information only in training [28] . the performance metrics adopted are: where tp, tn, fp, fn refer to the true positive, true negative, false positive and false negative samples for each class (covid, pneumonia, normal). then, the macro-average results were computed and used to present the classification performance achieved by the networks. accuracy is a commonly used classification metric and indicates how well a classification algorithm can discriminate the classes in the test set. as shown in eq 1, the accuracy can be defined as the proportion of the predicted correct labels to the total number (predicted and actual) of labels. in this study, accuracy refers to the overall accuracy of the model in distinguishing the three classes (covid, pneumonia, normal). p recision (eq 2) is the proportion of predicted correct labels to the total number of actual labels while recall (eq 3) is the proportion of predicted correct labels to the total number of predicted labels. recall is often referred as sensitivity (also called true positive rate). furthermore, f 1 − score (eq 4) refers to the harmonic mean of precision and recall while specif icity (also called true negative rate) measures the proportion of actual negatives that are correctly identified as such (eq 5). deep learning models require a large amount of data in order to perform accurate feature extraction and classification. regarding medical data analysis, especially if the disease is at an early stage such as in covid-19, one major drawback is that the data analyzed were relatively limited. in order to overcome this limitation, transfer learning was adopted. transfer learning method achieves data training with fewer samples as the retention of the knowledge extracted by a pre-trained model is then transferred to the model to be trained. a pre-trained model is a network that was previously trained on a large dataset, typically on a large-scale image-classification task. the intuition behind transfer learning for image classification is that if a model is trained on a general large dataset, this model will effectively serve in turn as a generic model. the learned features can be used to solve a different but related task involving new data, which usually are of a smaller population to train a cnn from scratch [29] . thus the need of training from scratch a large model on a large dataset is eliminated. . cc-by-nd 4.0 international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may 24, 2020. . https://doi.org/10.1101/2020.05. 22.20110817 doi: medrxiv preprint in general, there are two types of transfer learning in the context of deep learning: a) feature extraction [30] and b) fine-tuning [31, 32] . in feature extraction a new classifier will be trained from scratch on top of the pre-trained model. the representations learned from the pre-trained model which treated as an arbitrary feature extractor are employed in order to extract meaningful features from new samples. the base convolutional network already contains generically useful features for classification, thus there is no need for retraining the entire model. on the other hand, for an increased performance, in fine-tuning the weights of the top layers of the pre-trained model are "fine-tuned" along with the newly-added classifier layers. thus, the weights are tuned from generic feature maps to features associated specifically with the provided dataset. the aim of fine-tuning is to adapt specialized features to a given task rather than overwrite the generic learning. fine-tuned learning experiments are much faster and more accurate compared to models trained from scratch [33] . in this work, the cnn models were fine-tuned to identify and classify the different classes (covid, pneumonia, normal). the weights used by all cnns are pre-trained on the imagenet dataset [34] . imagenet is an image database which contains about 14 million images belonging to more than 20.000 categories created for image recognition competitions. figure 1 illustrates an example of the fine-tuning process on the vgg16 network architecture. the network is instantiated with weights pre-trained on imagenet. on the top of the figure the layers of the vgg16 network are showed. as stated in 2.1, vgg16 contains 13 convolutional (con v ) and 3 fully-connected (f c) layers. the final set of layers which contain the f c layers along with the sof tmax activation function is called "head". afterwards, the f c layers are excluded and the final p ool layer is treated as a feature extractor as depicted in the middle of the figure. finally, a new f c head layer is randomly initialized and placed on top of the original architecture (bottom of the figure). is is worth mentioning, that the body of the network, i.e. the con v layers have been "repressed" such that only the f c head layer is trained. the reason for this behaviour is that the con v layers have already learned discriminative filters while f c head layer is randomly initialized from scratch and random values are able to destroy the learned features. weights transferred from the model pre-trained on imagenet dataset figure 1 : fine-tuning on the vgg16 network architecture . cc-by-nd 4.0 international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may 24, 2020. . https://doi.org/10.1101/2020.05. 22.20110817 doi: medrxiv preprint in this research work the effectiveness of several state-of-the-art pre-trained convolutional neural networks was evaluated regarding the detection of covid-19 disease from chest x-ray images. more specific, a pool of existing deep learning classifiers were employed namely, vgg16, vgg19, mobilenet v2, inception v3, xception, inceptionresnet v2, densenet201, resnet152 v2 and nasnetlarge. is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may 24, 2020. . https://doi.org/10.1101/2020.05.22.20110817 doi: medrxiv preprint to train the proposed deep transfer learning models, the python programming language was used including the keras package and a tensorflow backend. keras is a simple to use neural network library built on top of theano or tensorflow [35] . keras provides most of the building blocks needed to build reasonably sophisticated deep learning models. this framework was used along with the set of weights learned on imagenet. the underlying computing infrastructure that has been used for the execution of the cnns has been a commodity machine with the following configuration: ubuntu 18.04 lts 64-bit; intel core i7-8550u cpu @ 1.80ghz × 8; and 16 gib ram. all the examined cnns share some common hyper-parameters. specifically, all images were scaled to a fixed size of 224 × 224 pixels. the dataset used was randomly split into 80% and 20% for training and testing respectively and the training was conducted for 35 epochs to avoid overfitting for all pre-trained models with a learning rate of 1e − 3 and a batch size of 8. cnns were compiled utilizing the optimization method called adam [36] and all the convolutional layers are activated by the rectified linear unit (relu) [37] . furthermore, a dropout layer [38] of 0.5 is applied which means that 50% of neurons will randomly set to zero during each training epoch thus avoiding overfitting on the training dataset. dropout is a form of regularization that forces the weights in the network to receive only small values making the distribution of weight values more regular. as a result this technique can reduce overfitting on small training examples [39] . since the problem consists of 3 classes the "categorical_crossentropy" is employed as loss function as shown in eq 6, where p model [y i ∈ c yi ] is the probability predicted by the model for the i th observation to belong to the c th category. "categorical crossentropy" compares the distribution of the predictions with the true distribution. true class is represented as a one-hot encoded vector, and the closer the model's outputs are to that vector, the lower the loss. in this section the classification performance for each cnn is presented. in order to evaluate the results, the following metrics were adopted for each class (covid, pneumonia, normal): precision, recall (sensitivity), f1-score, specificity and the overall accuracy of the model as illustrated in table 1 . the results suggest that the vgg16 and the vgg19 achieve the best classification accuracy of 95%. nasnetlarge model showed a moderate accuracy of 81%. the other models did not surpass 80% of accuracy with mobilenetv2 and densenet201 presenting the lowest results with 40% and 38% accuracy respectively. furthrmore, the confusion matrices of the best two models (vgg16, vgg19), the moderate model (nasnetlarge) and the worst models (mobilenetv2 and densenet201) are presented in figure 4 . a sensitivity of 96% and 92% for the covid class can be observed for vgg16 and vgg19 models respectively. this is critical as the model should be able to detect all positive covid-19 cases to reduce the virus spread to the community. in other words, confirmed positive covid-19 patients would be accurately identified as "covid-19 positive" 96% and 92% of the time by employing vgg16 and vgg19 models respectively. furthermore, the aforementioned models show a high precision value of 96% and 100% for covid class respectively. this implies that for vgg19 there were no classes incorrectly classified as covid from another classes while for vgg16 only one covid case was incorrectly classified as pneumonia as showed in figures 4(b), 4(a) . another important aspect of the results is the high values associated with specificity. specifically, the specificity for the covid class is 98% and 100% for vgg16 and vgg19 respectively. this practically means that confirmed negative patients to covid-19 would be accurately identified as "covid-19 negative" 98% and 100% of the time using vgg16 and vgg19 models respectively. a similar trend can be depicted in terms of f1-score. also, one of the very encouraging results is the ability of these models to achieve high . cc-by-nd 4.0 international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may 24, 2020. . https://doi.org/10.1101/2020.05.22.20110817 doi: medrxiv preprint sensitivity and precision on the normal class. this ensures that the fps are minimized not only for the covid but also for the pneumonia class and can potentially help alleviate the burden on the healthcare system. regarding nasnetlarge, a precision of 100% for the covid class is observed which means that there were no normal or pneumonia classes falsely missclassified as covid. furthermore the model would accurately identify "covid-19 negative" cases 100% of the time but presents a low sensitivity value. confirmed covid-19 cases would be able to be identified almost only half the time. additionally, the model presents a moderate value of 73% for f1-score in covid class. indeed, this low value is justified by a large number of fns. figure 4 (c) depicts that the covid class has 11 missclassified cases in total as normal or pneumonia. as expected, this is not acceptable when dealing with such a contagious virus. although mobilenetv2 and densenet201 present the worst results in terms of accuracy, they outperform vgg16 in terms of specificity and precision (100% for both metrics) for covid class. however, as the results suggest, the sensitivity is one of the most important metric in the particular disease. the extremely low value of 12% depicted for both models can have devastating effects regarding virus spread. only 12% of confirmed covid-19 cases would accurately identified correctly. furthermore, the low value of 21% concerning f1-score in covid class implies many fns. indeed, both models presents 23 fns as illustrated in figures 4(d) , 4(e). a real-life interpretation of a false negative instance is the erroneous assumption that the patient is "covid-19 negative" with what this entails in relation to the spread of the virus and public health. furthermore, we visualized the loss and the accuracy of the same cnns during their training in figure 2 . specifically, figures 2(a) , 2(b), 2(c), 2(d) and 2(e) demonstrate the training/validation loss/accuracy of vgg16, vgg19, nasnetlarge, mobilenetv2 and densenet201, respectively. the two best models (figure 2 (a) and 2(b)), demonstrate a smooth training process during which the loss gradually decreases and the accuracy increases. moreover, it can be observed that the accuracy of both training and validation do not deviate much from one another in most cases, a phenomenon that can also be observed for the training and validation loss, indicating that the models do not overfit. on the other hand, the rest of the models not only present a low accuracy, but their validation loss is either increasing or fluctuating. in the case of nasnetlarge (figure 2(c) ), which presents a relatively high accuracy (in the range of 75%), the fluctuating validation loss means that the model most probably overfits. another interesting fact is that the models with the least number of layers (vgg16 and vgg19) achieve a better classification performance. this can be explained by the fact that neural networks with more hidden layers require more training data, thus an even larger number of x-ray samples needs to be provided in these networks. table 1 all the cnn models present quite high precision and specificity values in covid class even if the overall accuracy is low, except for vgg16 and vgg19. confusion matrices confirm this as the false positives are zero (only 1 fp in vgg16) as illustrated in figure 4 . however, the sensitivity is extremely low in some models. for example, inceptionv3 and densenet201 present the lowest values of sensitivity with 4% and 12% respectively. this practically means, the models are not able to detect the confirmed covid-19 cases which is likely to cause disastrous results. furthermore, one important observation is that sensitivity presents high values for pneumonia class except in inceptionresnetv2 model. this ensures that patients with common bacterial pneumonia will not missclassified as covid. figure 3 depicts the execution time (in seconds) of each cnn. the largest execution times are presented for the most accurate models. specifically, nasnetlarge exhibits the highest execution time followed by vgg19 and vgg16. this can be explained by the fact that these models consist of the largest number of parameters. mobilenetv2 presents the lowest execution time and is included along with densenet201 in the models with the worst overall accuracy. nevertheless, inceptionv3, xception and inceptionresnetv2 present smaller execution times even if the accuracy is much better than densenet201. . cc-by-nd 4.0 international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted may 24, 2020. in this work, a study was conducted and presented for the detection of patients positive to covid-19, a pandemic that infected a large amount of the human population in the first semester of the year 2020. specifically, the study presented and employed 9 well-known convolutional neural networks (cnns) for the classification of x-ray images originating from patients with covid-19, pneumonia and healthy individuals. research findings indicated that cnns have the potential to detect respiratory diseases with high accuracy, although a large amount of sample images needs to be collected. specifically, vgg16 and vgg19 achieve an overall accuracy of 95%. the high values associated with sensitivity, specificity and precision of covid class, imply the ability of these models to detect positive and/or negative covid-19 cases accurately thus reducing as much as possible the virus spread to the community. as the results show, determining the most effective model for this classification task involves several performance metrics. furthermore, one of the very encouraging results is the ability of the aforementioned cnns to achieve high sensitivity and precision on the normal class thus ensuring the minimization of false positives regarding infection classes which can potentially help alleviate the burden on the healthcare system. finally, we would like to emphasize that these methods should not be used directly without clinical diagnosis. for future work, we intend to train the cnns on more data and to evaluate more architectures for the case of covid-19 detection. is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. (which was not certified by peer review) the copyright holder for this preprint this version posted may 24, 2020. . https://doi.org/10.1101/2020.05.22.20110817 doi: medrxiv preprint covid-19: automatic detection from x-ray images utilizing transfer learning with convolutional neural networks detrac: transfer learning of class decomposed medical images in convolutional neural networks deep learning-based detection for covid-19 from chest ct using weak label covid-resnet: a deep learning framework for screening of covid19 from radiographs deep convolutional neural networks for image classification: a comprehensive review gradient-based learning applied to document recognition deep residual learning for image recognition utilizing pretrained deep learning models for automated pulmonary tuberculosis detection using chest radiography imagenet classification with deep convolutional neural networks very deep convolutional networks for large-scale image recognition going deeper with convolutions provable bounds for learning some deep representations rethinking the inception architecture for computer vision inception-v4, inception-resnet and the impact of residual connections on learning inception-v4, inception-resnet and the impact of residual connections on learning xception: deep learning with depthwise separable convolutions 2018 ieee/cvf conference on computer vision and pattern recognition densely connected convolutional networks learning transferable architectures for scalable image recognition deep cnns for microscopic image classification by exploiting transfer learning and feature concatenation automatic detection of coronavirus disease (covid-19) using x-ray images and deep convolutional neural networks classification of covid-19 in chest x-ray images using detrac deep convolutional neural network covid-19 detection from chest x-ray images using deep learning and convolutional neural networks covid-19 screening on chest x-ray images using deep learning based anomaly detection covid-net: a tailored deep convolutional neural network design for detection of covid-19 cases from chest radiography images covidx-net: a framework of deep learning classifiers to diagnose covid-19 in x-ray images estimating uncertainty and interpretability in deep learning for coronavirus (covid-19) detection covid-19 image data collection the effectiveness of data augmentation in image classification using deep learning a survey on transfer learning deep feature extraction and classification of hyperspectral images based on convolutional neural networks deep learning of representations for unsupervised and transfer learning decaf: a deep convolutional activation feature for generic visual recognition using deep learning for image-based plant disease detection imagenet: a large-scale hierarchical image database deep learning with keras adam: a method for stochastic optimization rectified linear units improve restricted boltzmann machines improving neural networks by preventing co-adaptation of feature detectors the problem of overfitting key: cord-202184-hh7hugqi authors: wang, jun; liu, qianying; xie, haotian; yang, zhaogang; zhou, hefeng title: boosted efficientnet: detection of lymph node metastases in breast cancer using convolutional neural network date: 2020-10-10 journal: nan doi: nan sha: doc_id: 202184 cord_uid: hh7hugqi in recent years, advances in the development of whole-slide images have laid a foundation for the utilization of digital images in pathology. with the assistance of computer images analysis that automatically identifies tissue or cell types, they have greatly improved the histopathologic interpretation and diagnosis accuracy. in this paper, the convolutional neutral network (cnn) has been adapted to predict and classify lymph node metastasis in breast cancer. unlike traditional image cropping methods that are only suitable for large resolution images, we propose a novel data augmentation method named random center cropping (rcc) to facilitate small resolution images. rcc enriches the datasets while retaining the image resolution and the center area of images. in addition, we reduce the downsampling scale of the network to further facilitate small resolution images better. moreover, attention and feature fusion (ff) mechanisms are employed to improve the semantic information of images. experiments demonstrate that our methods boost performances of basic cnn architectures. and the best-performed method achieves an accuracy of 97.96% and an auc of 99.68% on rpcam datasets, respectively. even though excellent progress has been made in understanding cancers and blooming the diagnostic and therapeutic methods, breast cancer is the most common malignant cancer diagnosed worldwide, leading to the second cause of cancer-associated death in women [1] [2] [3] . metastatic breast cancers (mbcs), the leading cause of breast cancer death due to their incurable nature, start spreading from the local invasion of surrounding tissues, expand into the lymphatic and blood vessels, and terminate to distant organs 4 . it is estimated that 10% to 50% of patients arise metastases despite diagnosed with regular bc at the beginning 5 . besides, the rate and site of metastasis possess heterogeneities due to the primary tumor subtype 6 . thus, accurate diagnosis, prognosis, and treatment for mbcs remain challenging. for bc diagnosis, one of the essential jobs is the staging of bc that counts the recognition of axillary lymph node (aln) metastases, which is detectable in most nodepositive patients using sentinel lymph node (sln) biopsies 7, 8 . evaluating microscopy images from slns are conventional techniques to assess alns. however, they require on-site pathologists to investigate samples, which is time-consuming, laborious, and lesser reliable due to a certain degree of subjectivity, particularly in cases that contain small lesions or the lymph nodes are negative for cancer 9 . consequently, developing digital pathology methods to assist in microscopic diagnosis has evolved significantly during the last decade 10, 11 . advanced scanning technology, cost reduction, quality of spatial images, and magnification have made full digitalization feasible for evaluating histopathologic tissues 12 . digital pathology has multiple advantages, including remote consultation and sample analysis, thus improving the availability of samples and waiving on-site experts. still, it requires manual inspection, which brings inconsistent diagnostic decisions caused by individual pathologists that affect the accuracy of diagnosis are unsettled. in addition, hospitals are short of professional equipment and pathologists to support digital pathology. it is reported that presumptive treatment phenomena may exist widely among developing countries due to the lack of well-trained pathologists and professional equipment 13 . moreover, the majority of the population can barely get access to pathology and laboratory medicine services. take cancer and cardiovascular disease as examples, only a few and unbalanced communities can get the plam treatment [14] [15] [16] . to better facilitate digital pathology, reduce the cost of hospitals, and alleviate the problems mentioned before, various analysis methods have been proposed (e.g., deep learning, machine learning, and some specific software) to enhance the accuracy and sensitivity of metastatic cancer detection [17] [18] [19] [20] [21] . convolutional neural network (cnn) is the most successful deep learning method in the computer vision field due to its robust feature extraction ability. it has been wildly used in diseases diagnosed with microscopy (e.g., alzheimer's diseases) [22] [23] [24] [25] . cnn automatically learns image features from multiple dimensions on a large image dataset, which is applied to identify or classify structures and is therefore applicable in multiple automated image-recognition biomedical areas 26, 27 . cnn-based cancer detection was proved as a convenient method to classify tumours from other cells or tissues and has demonstrated satisfactory results [28] [29] [30] [31] . efficientnet is one of the most potent cnn architecture that utilizes the compound scaling method to enlarge the network depth, width, and resolution, obtaining state-of-the-art capacity in various benchmark datasets while requiring lesser computation resources than other models 32 . hence, the efficientnet as a suitable model may show significant medical image classification potentials, although there is a big difference between the medical images and traditional images. however, few studies have explored the performance of efficientnet in medical images, which motivates us to conduct this research. in this work, we propose three strategies to improve the capability of efficientnet, including developing a cropping method called random center cropping (rcc) to retain significant features on the center area of images, reducing the downsampling scale of efficientnet to facilitate the small resolution images of rpcam datasets, and integrating attention and feature fusion mechanisms with efficientnet to obtain features containing rich semantic information. this work has three main contributions: (1) to our limited knowledge, we are the first study to explore the power of efficientnet on mbcs classification, and elaborate experiments are conducted to compare the performance of efficientnet with other state-of-the-art cnn models, which might offer inspirations for researchers who are interested in image-based diagnosis using dl; (2) we propose a novel data augmentation method rcc to facilitate the data enrichment of small resolution datasets; (3) all of our four technological improvements boost the performance of original efficientnet. the best accuracy and auc achieve 97.96% and 99.68%, respectively, confirming the applicability of utilizing cnn-based methods for bc diagnosis. digital pathology has been widely employed for early cancer detection, classification, and monitoring treatment-response since it can be deployed readily and alleviate the uneven distribution of medical experts to a certain extent while saving their valuable time 33 . the manual process of recognizing mbcs requires high professionalism and many auxiliary materials (e.g., bone scanning, liver ultrasonography, and chest radiography), and it is time-consuming 34 . in addition, judgments may be affected by some factors, such as fatigue. due to the unexpected low accuracy of manual-based mbc detection, the arbitration of conflicting double reading opinions in 2005 was put forward 35 . until now, the challenge of improving diagnostic accuracy is remaining. therefore, computer-aided diagnosis (cad) systems were adopted to assist pathologists in interpreting medical images to mitigate problems mentioned before 36 . the traditional machine learning (ml) method plays a crucial role in the mbc classification based on cad in the early stage. in 1993, wu et al. used three-layer, feed-forward artificial neural networks to diagnose bc on mammograms, and obtained a roc value over 95%, which outperformed the average capacity of attending and resident radiologists alone 37 . in 1996, quinlan applied the decision tree method in bc classification and demonstrated a 94.74% classification accuracy using the c4.5 decision tree with a 10-fold cross-validation 38 . besides, hamilton et al. 39 showed a 96% accuracy via the riac method, while ster and dobnikar 40 gained a 96.8% accuracy via the linear discreet analysis method. furthermore, abonyi and szeifert 41 adopted the supervised fuzzy clustering (sfc) technique and achieved a 95.57% accuracy. however, training and testing datasets in these works are small, leading to low generalization ability. with the rapid development of computer vision technology, computer hardware, and big data technology, image recognition based on dl has matured. since alexnet 42 won the 2012 imagenet competition, an increasing number of convnets have been proposed (e.g., vgg 43 , inception 44 , resnet 45 , densenet 46 ), leading to a significant advance in computer vision tasks, including image classification and object detection. deep convolutional neural networks (dcnns) models can automatically learn image features, classify images in various fields, and possess higher generalization ability than traditional ml methods, which can distinguish different types of cells, allowing diagnosing other lesions. this technology has also achieved remarkable advances in medical fields 47 . in past decades, many articles have been published relevant to applying the cnn method to cancer detection and diagnosis. for instance, albayrak et al. 48 developed a cnn-based feature extraction algorithm to detect mitosis in bc histopathological images. in this algorithm, the cnn model was used to extract features to train a support vector machine (svm) for mitosis detection. also, dl technology was proved to be useful in lung detection on various image modalities. dcnns were adopted to predict patients' survival time directly from lung cancer pathological images 49 . moreover, other groups utilized dl methods to finish medical image classification and achieved good results [50] [51] [52] [53] . for bc detection and diagnosis, agarwal et al. 54 released a cnn method for automated masses detection in digital mammograms, which used transfer learning with three pre-trained models (e.g., vgg16, resnet50, and inceptionv3). in 2018, ribli et al. proposed a faster r-cnn model-based method for the detection and classification of bc masses 55 . the evaluation of their model on the inbreast dataset showed an auc over 95%. besides, shayma'a et al. used alexnet and googlenet to test bc masses on the national cancer institute (nci) and mias database 47 . alexnet performed an accuracy of 97.89% with auc of 98.32%, and an accuracy of 98.53% with auc of 98.95% on the national cancer institute (nci) and mias database, respectively. in comparison, googlenet achieved a 91.58% accuracy with 96.50% auc and an 88.24% accuracy with 94.65% auc. also, alantari et al. presented a dl method including detection, segmentation, and classification of bc masses from digital x-ray mammograms 56 . they utilized the cnn architecture yolo and obtained 95.64% accuracy and an auc of 94.78% 57 59 . the accuracy of the method was competitive to that of pathologists, with the auc of 97.00%. tan et al. proposed efficientnet, the state-of-the-art dcnn, that maintains competitive performance while requiring remarkably lesser computation resources in image recognitions 32 . they presented a systematic study to balance the network depth, width, and resolution. great success could be seen about applying efficientnet in many benchmark datasets. academics also explored the capability of efficientnet in medical imaging classification. gonçalo marques et al. utilized efficientnet to support the diagnosis of covid-19 and demonstrate a 99.62% accuracy 60 this work also utilizes efficientnet as the backbone, which is similar to some aforementioned works, but we focus on the mbc task. in addition, quite different from past works that usually use bc masses datasets with large resolution, our work detects the lymph node metastases in breast cancer and the dataset resolution is small. to our limited knowledge, there is no research to explore the performance of efficientnet in the detection of lymph node metastases in breast cancer. therefore, this work aims to examine and improve the capacity of efficientnet in bc detection. the performances of dl models are highly dependent on the scale and quality of training datasets. a large dataset allows researchers to train deeper networks and improves the generalization ability of models, thus enhancing the performance of dl methods. however, establishing large datasets is timeconsuming and not economically proficient. to cope with this problem, data augmentation has been proposed to enrich the dataset without introducing new data. cropping is one of the most commonly used data augmentation methods in computer vision tasks and is adopted in our work. however, as mentioned in 3.1, features used for metastasis distinguishments are mainly focused in the central area (32*32) in an image, so traditional cropping methods (random cropping and center cropping) may lead to the incomplete or lose of these essential areas. therefore, we propose a cropping method named random center cropping (rcc) to ensure the integrity of the central 32*32 area while selecting peripheral pixels randomly, allowing dataset enrichment. apart from retaining the significant center areas, rcc maintains more pixels facilitating small resolution images and enabling deeper network architectures. this section clearly describes our methods to improve the performance of efficientnet on rpcam datasets. we reduce the downsampling scale to maintain appropriate-level semantics information of features. besides, feature fusion (ff) and attention mechanisms are embedded in this work, which enhance the feature representation ability and increase the response of vital features. there are eight types of efficientnet from efficientnet-b0 to efficientnet-b7 with an increasing network scale. efficientnet-b3 is selected as our backbone network due to its superior performances than other architectures according to our experimental results on rpcam datasets. the architecture of boosted efficientnet-b3 is shown in figure 3 . the main building block is mbconv 64 . components in red dashed rectangles are different from the original efficientnet-b3. images are first sent to some blocks containing multiple convolutional layers to extract image features. then, these features are weighted by the attention mechanism to improve the response of features contributing to classification. next, feature fusion mechanism is utilized, enabling features to retain some low-level information. in consequence, images are classified according to those fused features. figure 3 . the architecture of boosted-efficientnet-b3. efficientnet first extracts image features by its convolutional layers. attention mechanism is then utilized to reweight features, increasing the activation of significant parts. next, we perform ff on the outputs of several convolutional layers. after that, images are classified based on those fused features. details of these methods are described in the following sections. although efficientnet has demonstrated competitive functions in many tasks, we observe that there is a disparity in image resolution between the designed model inputs and rpcam datasets. most models set their input resolution to 224*224 or lager, which maintains the balance between the performance and time complexity. the depth of the network is also designed for adapting the input size. this setting performs well in most well-known baseline image datasets (e.g., imagenet 65 , pascal voc 66 ) as their resolutions usually are more than 1000*1000. however, the resolution of rpcam datasets is 96*96, which is much smaller than the designed model inputs 300*300. after the efficientnet processing, the size of the final feature will be 32 times smaller than the input (from 96*96 to 3*3). this feature map is likely to be too abstractive and thus losing low-level features, which may defect the performance of efficientnet. to mitigate this problem, we adjust the down-sampling multiple in efficientnet. our idea is implemented by modifying the stride of the convolution kernel of efficientnet even though the receptive filed of convolution kernels might be reduced. however, the reduction influence could be slight since the resolution of inputs is small. to select the best-performed downsampling scale, multiple and elaborate experiments are conducted on the downsampling scale {2, 4, 6, 8, 10}, and strategy 16 outperforms other settings. the size of the feature map in best-performed downsampling scale (16) is 6*6, which is one times larger than the original downsampling multiple (32) . the change of the downsampling scale from 32 to 16 is implemented by modifying the stride of the first convolution layer from two to one, as shown in the red dashed rectangles on the left half of figure 3. when seeing a picture, the human visual system selectively focuses on a specific part of the picture while ignoring other visible information due to limited visual information processing resources. for example, although the sky information largely covers in the figure, people are able to capture the aeroplane in the image readily (figure 4) 67 . to simulate this process in artificial neural networks, attention mechanism is proposed and has achieved great success in many tasks such as image caption 68, 69 , image classification 70 , and object detection 71, 72 . attention technique can be simply interpreted as a means of increasing the response of the most informative parts and suppressing the activation of others. for instance (figure 5) , it can be seen that the response of background is large as most parts of image are background. however, this information usually is useless to the classification, so their response should be suppressed. on the other hand, cancerous tissue is more informative and deserves higher activation, so its response is enhanced after processed by the attention mechanism. as we stated before, the most informative features are in the center area of images on rpcam datasets, making attention more critical for this work. hence, this project also adopts the attention mechanism implemented by a squeeze-and-excitation block proposed by hu et al. 73 briefly, the essential components are the squeeze and excitation. suppose feature maps have channels and the size of the feature in each channel is * . for squeeze operation, global average pooling is applied to , enabling features to gain a global receptive field. after squeeze operation, the size of feature maps change from * * to 1 * 1 * . results are denoted as . more precisely, this change is given by where denotes ℎ channel of , and is the squeeze function. following the squeeze operation, the excitation operation is to learn the weight (scalar) of different channels, which is simply implemented by the gating mechanism. specifically, two fully connected layers are employed to learn the weight of features and activation function sigmoid, and relu are applied for non-linearity increasing. excepting the non-linearity, the sigmoid function also certifies the weight falls in the range of [0, 1]. the calculation process of the scalar (weight) is shown in equation (2). where is the result of excitation operation, is the excitation function, and refers to the gating function. and denote the sigmoid and relu function, respectively. 1 and 2 are learnable parameters of the two fully connected layers. the final output is calculated by multiplying the scalar s with the original feature maps u. in our work, the attention mechanism is combined with the feature fusion technique, as shown in figure 6 . high-level features generated by deeper convolutional layers contain rich semantic information, but they usually lose details such as positions and colors that are helpful in the classification. in reverse, low-level features include more detailed information but introducing non-specific noise. ff is a technique that combines low-level and high-level features and has been adopted in many image recognition tasks for performance improvement 74 . detail information is more consequential in our work since complex textures contours exist in the rpcam images despite their small resolution. accordingly, we adopt the ff technique to boost classification accuracy. four steps are involved during the ff technique ( figure 6 ): 1) during the forward process, we save the outputs (features) of the convolutional layers in the 4 th , 7 th , 17 th and 25 th blocks. (2) after the last convolutional layer extracts features, attention mechanism is applied to features recorded in step one to value the essential information. (3) low-level and high-level features are combined using the outputs of step 2 after the attention mechanism. (4) these fused features are then sent to the following layers to conduct classification. this section first introduces the evaluation metrics used for verifying the performance of our methods. implementation details are then clearly described. next, we exhibit the capacity of boosted efficientnet and comparisons among other state-of-the-art models. after that, the influence of each method is investigated via ablation studies. consequently, elaborate experiments are conducted to explore the effectiveness of the boosted efficientnet. we evaluate our method on the rectified patchc camelyon (rpcam) dataset. since the testing set is not provided, we split the original training set into a training set and a validation set and utilize the validation set to verify the performance of models. in detail, the capacities of models are evaluated by five indicators, including area under the curve (auc), accuracy (acc), sensitivity (sen), specificity (spe), and f-measure 75 our method is built on the efficientnet-b3 model and implemented based on the pytorch deep learning framework using python 76 . four pieces of gtx 2080ti gpus are employed to accelerate the training. all models are trained for 30 epochs. the gradient optimizer is adam. before being fed into the network, images are normalized by the mean and standard deviation on their rgb-channels. in addition to the rcc, we also employ random horizontal and vertical flipping in the training time to enrich the datasets. during the training, the initial learning rate is 0.003 and decayed by a factor of 10 at the 15th and 23rd epochs. the batch size is set to 256. the parameters of boosted efficientnet and other comparable models are placed as close as possible to enhance the credibility of the comparison experiment. in detail, the parameter sizes of these three models are increased in turn from the improved efficientnet, densenet121, and resnet50. experiments are conducted on the basic efficientnet and boosted-efficientnet to evaluate the effectiveness of our methods. moreover, we compare boosted efficientnet with another two state-ofthe-art cnn models, resnet50 and densenet121 43 , to prove its superiority further. the results are shown in table 1 and figure 7 . it can be seen that basic efficientnet outperforms boosted-efficientnet-b3 on the training set both on the acc and auc, while a different pattern can be seen on the testing set. the main reason for this different trend is that the basic efficientnet overfits the training set but boosted-efficientnet-b3 mitigates overfitting problems since rcc enables the algorithm to crop images randomly, and thus improving the diversity of training images. although enhancing the performance of a well-performing model is of great difficulty, compared with basic efficientnet-b3, boosted-efficientnet-b3 significantly improves the acc from 97.01% to 97.96% and boosts auc from 99.24% to 99.68% modestly. besides, more than 1% increasing can be seen in the sen, spe, and f-measure. same patterns of comparison between basic efficientnet and boosted efficientnet-b3 can be found when comparing efficientnet-b3 to other cnn architectures. notably, resnet50 and densenet121 significantly suffer from the overfitting problem. efficientnet-b3 obtains better performance than resnet50 and densenet121 for all indicators on testing datasets while using lesser parameters and computation resources (figure 7) . all these results confirm the capability of our methods, and we believe these methods can boost other state-of-the-art backbone networks. therefore, we intend to extend the application scope of these methods in the future. ablation studies are conducted to illustrate the effectiveness and coupling degree of the four methods, as shown in section 4.3. in this part, we conduct ablation experiments to illustrate the capacity of our methods, including random center cropping (rcc), reduce the downsampling scale (rds), feature fusion (ff), and attention. auc and acc are utilized as the primary evaluation metrics. from the first two rows of table 2 , we can observe that the rcc significantly improves performances of algorithms by noticing the auc is increased from 99.24% to 99.54%, and the acc is increased from 97.01% to 97.57% because rcc enhances the diversity of training images and mitigates overfitting problem. as the first and third rows of table 2 show, modest improvements of acc and auc (0.35% and 0.19%, respectively) can be seen because of the larger feature map. the image resolution of the rpcam dataset is much lower than the designed input of the efficientnet-b3, resulting in smaller and abstractive features, thus defecting the performance. it is worth noting that the improvement of the rds is enhanced when being combined with the rcc. feature fusion (ff) combines low-level and high-level features to boost the performance of models. as shown in table 2 , when adopting only one mechanism, the ff demonstrates the largest auc and the second-highest acc increasing among rcc, rds, and ff, indicating ff's adaptability and effectiveness in efficientnet. the ff contributes to more remarkable improvement to the model after utilizing rcc and rds since acc reaches the highest value, and auc comes the second among all methods. it should be emphasized that the attention mechanism needs to be combined with ff in our work. utilizing the attention mechanism to enhance the response of cancerous tissues and suppress the background can further boost the performance. from the 4th, 5th rows of table 2 , it can be seen that the attention mechanism improves the performance of original architectures both in the acc and auc, confirming its effectiveness. then, we analyze the last four rows. when the first three strategies are employed, adding attention increases the auc by 0.02%, but the acc remains at a 97.96% value. meanwhile, attention brings a significant performance improvement comparing with models only utilize rcc and ff since acc and auc are increased from 97.59% to 97.85% and from 99.58% to 99.68%, respectively. although the model using all methods demonstrates the same value of the auc as the model only utilizing rcc, rds, and ff, all utilized model shows 0.11% acc improvements. a possible reason for the minor improvement between these two models is that rds enlarges the size of the final feature maps, thus maintaining some low-level information to some extent, which is similar to ff and attention mechanism. the purpose of this project is to facilitate the development of digital diagnosis in mbcs and explore the applicability of a novel cnn architecture efficientnet on mbc. in this paper, we propose a boosted efficientnet cnn architecture to automatically diagnose the presence of cancer cells in the pathological tissue of breast cancers. we develop a data augmentation method rcc to retain the most informative parts of images and maintain original image resolution. experiments demonstrate that this method significantly improves the performance of efficentnet-b3. in addition, we propose to reduce the downsampling scale of basic efficientnet by adjusting the architecture of efficientnet-b3 to facilitate small resolution training images better. moreover, two mechanisms are employed to enrich the semantic information of features. as shown in the ablation studies, both of these methods boost the basic efficientnet-b3, and more remarkable improvements can be obtained by combining some of them. boosted-efficientnet-b3 is also compared with another two state-of-the-art cnn architectures, resnet50 and densenet121, and shows superior performance. we believe that our methods can be utilized in other models and lead to improved performance on other diseases diagnosis and will explore this in the future. in summary, our boosted efficientnet-b3 achieves an accuracy of 97.96% and an auc value of 99.68%, respectively, and hence may provide a reliable, efficient, and economical alternative for medical institutions in relevant areas. all data generated or analyzed during this study are included in this published article and its supplementary information files. the authors declare that they have no competing interests. detection of breast cancer on digital histopathology images: present status and future possibilities immunomagnetic sequential ultrafiltration (isuf) platform for enrichment and purification of extracellular vesicles from biofluids. biorxiv isolation and detection technologies of extracellular vesicles and application on cancer diagnostic breast cancer metastasis: markers and models metastatic behavior of breast cancer subtypes effect of axillary dissection vs no axillary dissection on 10-year overall survival among women with invasive breast cancer and sentinel node metastasis: the acosog z0011 (alliance) randomized clinical trial sentinel-node biopsy to avoid axillary dissection in breast cancer with clinically negative lymph-nodes axillary node interventions in breast cancer: a systematic review digital imaging in pathology: whole-slide imaging and beyond validation of a digital pathology system including remote review during the covid-19 pandemic histopathological image analysis: a review large scale tissue histopathology image classification, segmentation, and visualization via deep convolutional activation features improving pathology and laboratory medicine in low-income and middleincome countries: roadmap to solutions nanoscale technologies in highly sensitive diagnosis of cardiovascular diseases exosomes: a novel therapeutic agent for cartilage and bone tissue regeneration deep learning: convergence to big data analytics machine learning for medical imaging open source software for digital pathology image analysis a. j. o. p. pathology image analysis using segmentation deep learning algorithms deep learning for identifying radiogenomic associations in breast cancer impact of deep learning assistance on the histopathologic review of lymph nodes for metastatic breast cancer deep learning in medical image analysis deep learning a study on deep machine learning algorithms for diagnosis of diseases identifying medical diagnoses and treatable diseases by image-based deep learning 2018 international conference on computing, mathematics and engineering technologies (icomet international conference image analysis and recognition deep learning vs. radiomics for predicting axillary lymph node metastasis of breast cancer using ultrasound images: don't forget the peritumoral region efficientnet: rethinking model scaling for convolutional neural networks imaging and cancer: a review baseline staging tests after a new diagnosis of breast cancer: further evidence of their limited indications the pathological and radiological features of screen-detected breast cancers diagnosed following arbitration of discordant double reading opinions c. m. i. & graphics. computer-aided diagnosis in medical imaging: historical review, current status and future potential artificial neural networks in mammography: application to decision making in the diagnosis of breast cancer j. o. a. i. r. improved use of continuous attributes in c4 riac: a rule induction algorithm based on approximate classification international conference on engineering applications of neural networks supervised fuzzy clustering for the identification of fuzzy classifiers advances in neural information processing systems very deep convolutional networks for large-scale image recognition proceedings of the ieee conference on computer vision and pattern recognition proceedings of the ieee conference on computer vision and pattern recognition proceedings of the ieee conference on computer vision and pattern recognition breast cancer masses classification using deep convolutional neural networks and transfer learning ieee 17th international symposium on computational intelligence and informatics (cinti) ieee international conference on bioinformatics and biomedicine (bibm) computer-aided diagnosis for burnt skin images using deep convolutional neural network a deep learning-based framework for automatic brain tumors classification using transfer learning deep convolutional neural networks with transfer learning for automated brain image classification transfer learning of class decomposed medical images in convolutional neural networks automatic mass detection in mammograms using deep convolutional neural networks detecting and classifying lesions in mammograms with deep learning i. j. o. m. i. a fully integrated computer-aided diagnosis system for digital x-ray mammograms via deep learning detection, segmentation, and classification proceedings of the ieee conference on computer vision and pattern recognition deep learning to improve breast cancer detection on screening mammography classification and mutation prediction from non-small cell lung cancer histopathology images using deep learning automated medical diagnosis of covid-19 through efficientnet convolutional neural network international conference on advanced machine learning technologies and applications international conference on medical image computing and computer-assisted intervention diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer proceedings of the ieee conference on computer vision and pattern recognition ieee conference on computer vision and pattern recognition the pascal visual object classes challenge 2007 (voc2007) results i. j. o. c. v. the pascal visual object classes (voc) challenge international conference on machine learning proceedings of the ieee conference on computer vision and pattern recognition i. j. o. r. s. a survey of image classification methods and techniques for improving classification performance proceedings of the 2001 ieee computer society conference on computer vision and pattern recognition. cvpr 2001. i-i sixth international conference on computer vision proceedings of the ieee conference on computer vision and pattern recognition & processing, i. multisensor image fusion using the wavelet transform a review on evaluation metrics for data classification evaluations key: cord-258170-kyztc1jp authors: shorfuzzaman, mohammad; hossain, m. shamim; alhamid, mohammed f. title: towards the sustainable development of smart cities through mass video surveillance: a response to the covid-19 pandemic date: 2020-11-05 journal: sustain cities soc doi: 10.1016/j.scs.2020.102582 sha: doc_id: 258170 cord_uid: kyztc1jp sustainable smart city initiatives around the world have recently had great impact on the lives of citizens and brought significant changes to society. more precisely, data-driven smart applications that efficiently manage sparse resources are offering a futuristic vision of smart, efficient, and secure city operations. however, the ongoing covid-19 pandemic has revealed the limitations of existing smart city deployment; hence; the development of systems and architectures capable of providing fast and effective mechanisms to limit further spread of the virus has become paramount. an active surveillance system capable of monitoring and enforcing social distancing between people can effectively slow the spread of this deadly virus. in this paper, we propose a data-driven deep learning-based framework for the sustainable development of a smart city, offering a timely response to combat the covid-19 pandemic through mass video surveillance. to implementing social distancing monitoring, we used three deep learning-based real-time object detection models for the detection of people in videos captured with a monocular camera. we validated the performance of our system using a real-world video surveillance dataset for effective deployment. due to the coronavirus disease 2019 , the world is undergoing a situation unprecedented in recent human history, with massive economic losses and a global health crisis. the virus initially identified in december 2019 in the city of wuhan, china has rapidly spread throughout the world, resulting in the ongoing pandemic. since the initial outbreak, the disease has affected over two hundred countries and territories across the globe, with more than 20 million cases reported (covid-19, 2020) . the outbreak was declared a public health emergency of international concern (pheic) by the world health organization (who) (who, 2020) on january 30, 2020. the virus is very contagious and is primarily transmitted between people through close contact. a variety of common symptoms are found in those infected, such as cough, fever, shortness of breath, fatigue, loss of smell, and pneumonia. the complications of the disease include pneumonia, acute respiratory distress syndrome, and other infections. precise and timely diagnosis is being hampered due to the lack of treatment, scarcity of resources, and harsh conditions of the laboratory environment. this has increased the challenge to curb the spread of the virus. furthermore, the absence of an approved therapy to cure covid-19 infections has motivated the pressing need for prevention and mitigation solutions to reduce the spread of the virus. social distancing protocols, including country-wide lockdowns, travel bans, and limiting access to essential businesses, are gradually curbing the spread. in fact, social distancing has already proven to be an effective non-pharmaceutical measure for stopping the transmission of this infectious disease (ferguson et al., 2006; fraser et al., 2004) . social distancing refers to an approach to minimizing disease spread by maintaining a safe physical distance between people, avoiding crowds, and reducing physical contact. according to who norms (hensley, j o u r n a l p r e -p r o o f 2020), proper social distancing requires people to maintain a distance of at least 6 ft from other individuals. because it is highly likely that an infected individual may transmit the virus to a healthy person, social distancing can significantly reduce the number of fatalities caused by the virus, as well as reduce economic loss. fig. 1 illustrates the impact of social distancing on the daily number of cases (irfan, 2020) . it can be observed from fig. 1 (a) that social distancing can significantly reduce the peak number of cases of infection, and essentially delay the occurrence of the peak if it is implemented at an early stage of the pandemic. this would reduce the burden on health care facilities and allow more time for adopting countermeasures. also, as shown in fig. 1 (b) , social distancing can reduce the total number of cases, and the sooner the measure is taken, the higher the positive impact will be. lately, several countries throughout the world including the netherlands (amsterdam, 2020) , usa (smart america, 2020), and south korea (silva, khan, & han, 2018) are taking the initiative to deploy "sustainable smart cities" (bibri & krogstie, 2017) . for example, the latest j o u r n a l p r e -p r o o f smart city 3.0 (amsterdam, 2020) initiative by the city of amsterdam encourages the effective participation of citizens, government, and private organizations in building smart city solutions. the plan includes the development of infrastructures and technologies in the areas of smart energy and water systems, the intelligent transport system (its), and so on. however, as part of the effective preparation for the current and future pandemics, it is expected that the sustainable development of smart cities will provide situational intelligence and an automated targeted response to ensure the safety of global public health and to minimize massive economic losses. in this context, smart cities will host data-driven services along with other iot devices, such as ip surveillance and thermal cameras, sensors, and actuators, to deliver community-wide social distancing estimates and the early detection of potential pandemics. fig. 2 illustrates a sustainable smart city scenario where social distancing is monitored in real time to offer a variety of services, such as detecting and monitoring the distance between any two individuals, detecting crowds and gatherings in public areas, monitoring physical contacts between people such as handshaking and hugging, detecting and monitoring individuals with disease symptoms such as cough and high body temperature, and monitoring any violation of quarantine by infected people. in this paper, we propose a data-driven deep learning framework for the development of a sustainable smart city, offering a timely response to combat the covid-19 pandemic through mass video surveillance. upon the detection of a violation, an audio-visual, non-intrusive alert is generated to warn the crowd without revealing the identities of the individuals who have violated the social distancing measure. in particular, we make the following contributions: (a) a deep learning-based framework is presented for monitoring social distancing in the context of sustainable smart cities in an effort to curb the spread of covid-19 or similar infectious diseases; (b) the proposed system leverages state-of-the-art, deep learning-based real-time object detection models for the detection of people in videos, captured with a monocular camera, to implement social distancing monitoring use cases; (c) a j o u r n a l p r e -p r o o f perspective transformation is presented, where the captured video is transformed from a perspective view to a bird's eye (top-down) view to identify the region of interest (roi) in which social distancing will be monitored; (d) a detailed performance evaluation is provided to show the effectiveness of the proposed system on a video surveillance dataset. the rest of the paper is organized as follows. the background and related work is presented in section 2. sections 3 and 4 present the proposed system, dataset, and experiments with the performance results. finally, section 5 concludes the paper with suggestions for future work. object detection is one of the most challenging problems in the computer vision domain, and lately there has been substantial improvement in this field with the advancements in deep learning (yang x et al., (2016) . in this study, we use three state-of-the-art object detection architectures that are pre-trained and optimized on large image datasets, such as pascal-voc (everingham et al., 2010) and ms-coco (lin et al., 2014) , to detect pedestrians for monitoring social distancing in mass video surveillance footage through vision-based social media event analysis (yang et al., 2015 , qian et al., 2015 . we present a two-stage detector called faster r-cnn (faster region with convolutional neural networks) (ren, he, girshick, & sun, 2015) and two one-stage detectors called ssd (single shot multistage detector) (liu et al., 2015) and yolo (you only look once) (farhadi & redmon, 2018). faster r-cnn (ren, he, girshick, & sun, 2015) was built incrementally from two of its predecessor architectures, called r-cnn (girshick, 2014) and fast r-cnn (girshick, 2015) , where rois are generated using a technique called selective search (ss) (google, 2020) . because ss does not involve any deep learning techniques, the authors of faster r-cnn proposed the region proposal network (rpn), which uses cnn models such as resnet 101 (he et al., 2016) , vgg-16 (simonyan & zisserman, 2015) , and inception v2 (szegedy et al., 2016) to generate region proposals. this increases the speed of the faster r-cnn compared with fast r-cnn at least tenfold. fig. 3 shows a schematic diagram of faster r-cnn architecture, where the rpn accepts an image as input and outputs an roi. each roi consists j o u r n a l p r e -p r o o f of a bounding box and an objectness probability. to generate those numbers, a cnn is used to extract a feature volume. after post-processing, the final output is a list of rois. in the second stage, faster r-cnn performs classification in which it accepts two inputs, namely the list of rois from the previous step (the rpn) and a feature volume computed from the input image, and outputs the final bounding boxes. consequently, the detection process consists of two steps, namely feature map extraction and j o u r n a l p r e -p r o o f object detection through convolutional filtering built from three separate components. the first part represents the base pre-trained network (such as mobilenet) (howard et al., 2017) , which is used for feature extraction. the second part consists of a series of convolutional filters representing multi-scale feature layers. finally, an nms unit represents the last layer, where unwanted overlapping bounding boxes are removed to produce only one box per object. a schematic architecture of ssd is shown in fig. 4 . another single stage object detector is yolo (shown in fig. 5 ) (farhadi & redmon, 2018) , which is often considered a competitor of ssd. we used yolov3 in this study. it is one of the fastest object detection algorithms available in the literature, and can run at more than 170 fps on a modern gpu. however, it is outperformed by faster r-cnn in terms of accuracy. moreover, due to the way it detects objects, yolo struggles with smaller objects. nevertheless, the architecture is constantly evolving from its earlier variants (redmon & farhadi, 2017) , and its challenges are being worked on. the core idea of yolo is that it reframes object detection as a single regression problem. the model is split into two parts, namely inference and training. inference refers to the process of taking an input image and computing results, while training represents the process of learning the weights of the model. like most other image detection models, yolo is based on a backbone model that extracts meaningful features from the image to be used in the final layers. while any architecture can be chosen as a feature extractor, the yolo study employs a custom architecture called darknet-53. the performance of the final model depends heavily on the choice of feature extractor architecture. since the onset of the covid-19 pandemic, many countries around the world have taken j o u r n a l p r e -p r o o f the initiative to develop solutions for combatting the outbreak based on emerging technology. many law enforcement departments are making use of drones and video surveillance cameras to detect and monitor crowded areas and adopt disciplinary actions that alert the crowd (robakowska et al., 2017) . a recent study (nguyen et al., 2020) investigated how social distancing can be enforced through various scenarios, and by using technologies such as ai and iot. the authors used the basic concept of social distancing, and various models that used existing technologies to control the spread of the virus. agarwal et al. discussed state-of-the-art disruptive technologies to fight the covid-19 pandemic. they introduced the notion of disruptive technologies and classified their scope in terms of humancentric or smart-space categories. furthermore, the authors provided a swot analysis of the identified techniques. khandelwal et al. (2020) proposed a computer vision-based system to monitor the activities of a workforce to ensure their safety using cctv feeds. as part of the system, they built tools to effectively monitor social distancing and to detect face masks. another recent work presented by punn et al. (2020) proposed a social distancing monitoring approach using yolov3 and deep sort to detect pedestrians and calculate a social distancing violation index. the study was limited by the lack of statistical analysis and direction for deployment. cristani et al. (2020) also proposed a special social distancing monitoring approach in which they formulated the monitoring problem as visual social distancing (vsd) problem. they discussed the impact of the subjects' social context on the computation of distances, and they raised privacy concerns. hossain et al. (2020) presented a health care framework based on a 5g network to develop a mass video surveillance system for monitoring body temperature, face masks, and social distancing. sun and zhai (2020) introduced and developed two critical indices called social distance probability and ventilation effectiveness for the prediction of covid-19 infection probability. using these indices, the authors demonstrated the impact of social distancing and ventilation on the risk of respiratory j o u r n a l p r e -p r o o f illness infection. rahman et al. (2020) presented a data-driven approach to building a dynamic clustering framework to alleviate the adverse economic impact of covid-19. they developed a clustering algorithm to simulate various scenarios, and thus to identify the strengths and weaknesses of the algorithm. kolhar, al-turjman, alameen, and abualhaj (2020) proposed a social distancing monitoring scheme based on a mobile robot with commodity sensors that could navigate through a crowd without collision to estimate the distance between all detected people. the robot was also equipped with thermal cameras to remotely transmit thermal images to security personnel who monitored individuals with a higher-than-normal temperature. fan et al. (2020) presented a similar approach to social distancing monitoring with an autonomous surveillance quadruped robot that could promote social distancing in complex urban environments. the existing systems in the literature that leverage various measurements for social distancing monitoring are interesting, however, recording and storing surveillance data and j o u r n a l p r e -p r o o f generating intrusive alerts may not be acceptable for many individuals. hence, the current implementation of the proposed system detects pedestrians in an roi using a fixed monocular camera and estimates the distance between pedestrians in real time without recording data. our system generates an audio-visual, non-intrusive dismissal alert to caution the crowd when it detects any social distancing violation. moreover, a perspective transformation is presented, where the captured video is transformed from a perspective view to a bird's eye (top-down) view to determine the roi in which social distancing will be monitored. the recent advancement in deep learning technology has brought significant improvement to the development of techniques for a broad range of challenges and tasks involved in medical diagnosis , epilepsy seizure detection (hossain et al., 2019) , speech recognition (amodei et al., 2016) , machine translation (vaswani et al., 2018) , and so on. the majority of these tasks are focused on classification, segmentation, detection, recognition, and the tracking of objects (brunetti et al., 2018; punn. & agarwal, 2019) . to this end, the state-of-the-art cnn-based architectures pre-trained and optimized on large image datasets such as pascal-voc (everingham et al., 2010) and ms-coco (lin et al., 2014) have shown substantial performance improvement for object detection. motivated by this, we present in this study a deep learning-based video surveillance framework using state-of-the-art object detection and tracking models to monitor physical distancing in crowded areas in an attempt to combat the covid-19 pandemic. for the sake of simplicity, the current implementation of the proposed system detects pedestrians in an roi using a fixed monocular camera and estimates the distance between pedestrians in real time without recording data. recording and storing surveillance data and generating intrusive alerts may not be acceptable by many individuals. our system generates an audio-visual non-j o u r n a l p r e -p r o o f intrusive dismissal alert to signal the crowd upon detecting any social distancing violation. a general overview of our system is presented in fig. 6 , and a detailed description starts below. fig. 6 illustration of the proposed system. real-time video data from an ip surveillance camera is directly fed into the system for social distancing monitoring. an audio-visual non-intrusive dismissible alert is generated for any violation the incoming video may be fed to the system from any perspective view, and hence we first needed to transform the video from a perspective view to a bird's eye (top-down) view. to achieve this, we selected four points in the perspective view that formed the roi where social distancing would be monitored. subsequently, we align these four points to the four corners of a rectangle in the bird's eye view. fig. 7 illustrates an intuitive representation of perspective transformation reproduced from the study by luo et al. (2010) . after the transformation, the concerned points constitute parallel lines if they are observed from the top (hence the bird's eye view). this bird's eye view is characterized by a uniform distribution of points in both j o u r n a l p r e -p r o o f horizontal and vertical directions, even though the scale is different in each direction. we also measured the scaling factor of the bird's eye view during this calibration process, by which we determined how many pixels should correspond to 6 ft in real-world coordinates. thus, we can obtain a transformation that can be applied to the entire image in perspective view. in the second step, we detect pedestrians in the transformed image view with the selected object detection models (faster r-cnn, yolo, ssd) trained on real-world datasets. subsequently, a bounding box with four corners is drawn for each detected pedestrian. we use non-max suppression (nms) to remove unwanted bounding boxes to ensure that our detector detects a pedestrian only once. the last step is to calculate the distance between each pair of pedestrians to detect any potential violation of the social distancing norm. to do this, we make use of the bounding box for each pedestrian in the image. to localize the detected pedestrian in the image, we take the bottom center point of the bounding box and apply a perspective transformation on it, resulting j o u r n a l p r e -p r o o f in a bird's eye view of the position of the detected pedestrian. after calculating the distance between every pair of pedestrians in the bird's eye view, we identify the pedestrians whose distance is below the minimum acceptable threshold and highlight them with red bounding boxes, and at the same time generate a non-intrusive audio-visual alert to warn the crowd. based on the calculated distance, other pedestrians are marked as safe or at low risk with green and yellow, respectively. the complete algorithmic flow of the detection process is shown in fig. 8. fig. 8 . algorithmic flow of the proposed system. opencv's perspective transform routine is used for bird's eye view transformation. to demonstrate the effectiveness of our video surveillance framework while monitoring social distancing in crowded areas, we extensively evaluated the proposed framework using all three object detection models-faster r-cnn, yolo, and ssd-with the publicly available oxford town center dataset (benfold & reid, 2011) . this is a video dataset that was released by oxford university as part of the visual surveillance project. it contains video data from a fps. the video was downsampled to a standardized resolution of 1280 ã� 720 before it was fed to the object detection models. the dataset also contains the ground truth bounding boxes for the pedestrians in all the frames in the entire video. we evaluated the object detection models for person detection in the test video using the predicted bounding boxes and the coordinates from the ground truth boxes. the implementation started with obtaining the perspective transformation (top-down view) of the video. we used a mouse click event to select the roi, where we chose four points to designate the area in the first frame to monitor the social distancing. this is a one-time process that was repeated for all the frames in the video. next, three points were chosen to define a 6 ft (approximately 180 cm) distance in both the vertical and horizontal directions, forming lines parallel to the roi. from these three points, a scaling factor was calculated for use in the top-down (bird's eye) view in both directions to determine how many pixels corresponded to 6 ft in real-world coordinates. in the second step, we applied object detection models to detect pedestrians and draw a bounding box around each of them. as mentioned, we applied nms and other rule-based heuristics as part of the minimal post-processing on the output bounding boxes to reduce the possibility of over-fitting. after the pedestrians were located, their positions were transformed into real-world coordinates through bird's eye view transformation. the pre-trained object detection models optimized on ms coco (lin et al., 2014) and pascal voc (everingham et al., 2010) datasets were implemented using pytorch and tensorflow. more particularly, the detectron2 api from the pytorch and tensorflow object detection api was used. we conducted experiments in the google colab notebook environment, which provides free gpu access. it currently offers an nvidia tesla p100 gpu with 16 gb ram, and is equipped with pre-installed python 3.x packages, pytorch, and the keras api with a tensorflow backend. in the third step, we conducted social distancing monitoring by calculating the distance between each pair of pedestrians by measuring from the bottom center point of each pedestrian's bounding box. the statics related to the total number of violations and the level of risk for individuals were recorded over time. in subsequent sections, we present the metrics for evaluation and the experimental results with discussion. for various annotated datasets such as pascal voc (everingham et al., 2010) and ms coco (lin et al., 2014) , and their relevant object detection challenges, the most widely used performance metric for estimating detection accuracy was the average precision (ap). in this study, we use similar metrics to demonstrate the performance of our social distancing framework. in particular, the object detection metrics provide an estimate of how well our model performs on a person detection task in mass surveillance areas. in this context, it is important to distinguish between correct and incorrect detections. a common way to do this is to use the intersection over union (iou) metric. iou, also referred to as the jaccard index, is used to measure the similarity between two datasets (jaccard, 1901) . in the context of object detection, it provides a measure of the similarity between the ground truth bounding box and the predicted bounding box as a measurement for the quality of the prediction. the value of iou varies from 0 to 1. the closer the bounding boxes, the higher the value of iou. specifically, the iou estimates the overlap of ground truth (bboxgt) and predicted (bboxp) bounding boxes over the area obtained by their union, as illustrated in the following equation and fig. 9 . fig. 9 . illustrating intersection over union (iou) now, after computing the iou for each detection, we compared it with a given threshold, tth, to obtain a classification for the detection. if the value of iou was above the threshold, the detection was considered as a positive (correct) prediction. on the contrary, if the value of iou was below the threshold, the detection was considered as a false (incorrect) prediction. more specifically, the predictions were categorized as true positive (tp), false positive (fp), and false negative (fn). intuitively, there are two cases that are deemed as fps. in one, the object is present but the iou is less than the threshold, and in the other case, the object is not present, but the model detects one. fn refers to the case where the object is present, but the model fails to detect it. based on these various prediction types, precision and recall values were calculated and served as the basis for creating precision ã� recall curves and computing mean ap (map). precision refers to the model's ability to detect relevant objects and was calculated as the percentage of correct detections over all positive detections. recall refers to the model's sensitivity and was calculated as the percentage of correct positive predictions over all ground truth objects. the precision ã� recall curve summarizes both precision and recall as a trade-off for various confidence values linked to the bounding boxes produced by the detection model. in practice, the curve appears to be very noisy due to the trade-off between precision and recall, and hence it is difficult to estimate the model performance by computing the area under the curve (auc). this is managed by smoothing out the curve before auc estimation by means of a numerical value called ap. there are two different j o u r n a l p r e -p r o o f techniques, called 11-point and all-point interpolation, used to achieve this. in fact, the computation method for ap was changed by the pascal voc challenge (everingham et al., 2010) from 2010 onward. at present, all data points are used for interpolation, rather than interpolating at only 11 points that are equally spaced. however, we adopted both interpolation techniques for the sake of completeness. this approach summarizes the precision and recall curve by taking an average of the maximum precision values across a set of 11 equally spaced recall values in the range of 0 to 1. more precisely, we interpolated the precision score for a certain recall value, r, by taking the maximum precision where the corresponding recall value, ì� was greater than r. this can be formulated as follows: where the interpolated precision is denoted as: in this case, we compute ap by interpolating the precision score at all recall values instead of using only 11 recall levels. this can be translated mathematically as follows: where the interpolated precision is denoted as: we used three different cnn-based object detection models, namely faster r-cnn, yolov3, and ssd, for experiments with social distancing monitoring. fig. 10 with red, yellow, and green color, respectively. in general, the faster r-cnn models appear to be overly sensitive and detected a plastic human display as a pedestrian, as shown in fig. 4 (b) . for sustainable development, maintaining safety and encouraging well-being at all ages is important. the current pandemic has devastated the sustainable development of society. this study is a step toward a better understanding of the dynamics of the covid-19 pandemic, and proposes a j o u r n a l p r e -p r o o f system aims to achieve this through state-of-the-art deep learning-based object detection models to detect and track individuals in real-time with the help of bounding boxes. upon the detection of a violation, an audio-visual non-intrusive alert is generated to warn the crowd without revealing the identities of the individuals who have violated the social distancing measure. an extensive performance evaluation was done using faster r-cnn, ssd, and yolo object detection models with a public video surveillance dataset, in which yolo proved to be the best performing model with balanced map score and speed (fps). the absence of an effective vaccine and the lack of immunity against covid-19 have made social distancing a largely feasible and widely adopted approach to controlling the ongoing pandemic. maintaining social distancing has also been recommended by leading health organizations, such as the who and centers for disease control and prevention (cdc). to this end, our proposed deep learning-based video surveillance framework will play a significant role in combating the spread of covid-19 in a sustainable smart city context. at this stage, it is imperative to identify some of the potential impact of our approach on the surrounding environments, such as increased anxiety and panic among the individuals who receive the repetitive alerts. in addition, some legitimate concerns regarding individual rights and privacy could be raised, and can be effectively handled by obtaining prior consent from individuals and concealing their identities. unleashing the power of disruptive and emerging technologies amid covid 2019: a detailed review privacy-aware energy-efficient framework using internet of medical things for covid-19 deep speech 2: end-to-end speech recognition in english and mandarin amsterdam smart city 3 stable multi-target tracking in real-time surveillance video covid-19. (2020). dashboard, coronaboard smart sustainable cities of the future: an extensive interdisciplinary literature review computer vision and deep learning techniques for pedestrian detection and tracking: a survey the visual social distancing problem the pascal visual object classes (voc) challenge autonomous social distancing in urban environments using a quadruped robot strategies for mitigating an influenza pandemic factors that make an infectious disease outbreak controllable rich feature hierarchies for accurate object detection and semantic segmentation fast r-cnn open image dataset v6 deep residual learning for image recognition social distancing is out, physical distancing is in here is how to do it, global news-canada explainable ai and mass surveillance system-based healthcare framework to combat covid-i9 like pandemics mobilenets: efficient convolutional neural networks for mobile vision applications ai techniques for covid-19 the math behind why we need social distancing, starting right now etude comparative de la distribution florale dans une portion des alpes et des jura using computer vision to enhance safety of workforce in manufacturing in a post covid world a three layered decentralized iot biometric architecture for city lockdown during covid-19 outbreak microsoft coco: common objects in context. european conference on computer vision -eccv 2014 ssd: single shot multibox detector low-cost implementation of bird's-eye view system for camera-on-vehicle enabling and emerging technologies for social distancing: a comprehensive survey-rob monitoring covid-19 social distancing with person detection and tracking via fine-tuned yolov3 and deepsort techniques detection and brain mapping visualization inception u-net architecture for semantic segmentation to identify nuclei in microscopy cell images crowd analysis for congestion control early warning system on foot over bridge social event classification via boosted multimodal supervised latent dirichlet allocation data-driven dynamic clustering framework for mitigating the adverse economic impact of covid-19 lockdown practices yolo9000: better, faster, stronger faster r-cnn: towards real-time object detection with region proposal networks deepsocial: social distancing monitoring and infection risk assessment in covid-19 pandemic, medrxiv preprint the use of drones during mass events covid-robot: monitoring social distancing constraints in crowded scenarios towards sustainable smart cities: a review of trends, architectures, components, and open challenges in smart cities very deep convolutional networks for large-scale image recognition automatic visual concept learning for social event understanding deep relative attributes the efficacy of social distance and ventilation effectiveness in preventing covid-19 transmission rethinking the inception architecture for computer vision tensor2tensor for neural machine translation statement on the second meeting of the international health regulations (2005) emergency committee regarding the outbreak of novel coronavirus (2019-ncov). world health organization archived from the original on 31 the authors do not have conflicts of interest. the authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. key: cord-308219-97gor71p authors: elzeiny, sami; qaraqe, marwa title: stress classification using photoplethysmogram-based spatial and frequency domain images date: 2020-09-17 journal: sensors (basel) doi: 10.3390/s20185312 sha: doc_id: 308219 cord_uid: 97gor71p stress is subjective and is manifested differently from one person to another. thus, the performance of generic classification models that classify stress status is crude. building a person-specific model leads to a reliable classification, but it requires the collection of new data to train a new model for every individual and needs periodic upgrades because stress is dynamic. in this paper, a new binary classification (called stressed and non-stressed) approach is proposed for a subject’s stress state in which the inter-beat intervals extracted from a photoplethysomogram (ppg) were transferred to spatial images and then to frequency domain images according to the number of consecutive. then, the convolution neural network (cnn) was used to train and validate the classification accuracy of the person’s stress state. three types of classification models were built: person-specific models, generic classification models, and calibrated-generic classification models. the average classification accuracies achieved by person-specific models using spatial images and frequency domain images were 99.9%, 100%, and 99.8%, and 99.68%, 98.97%, and 96.4% for the training, validation, and test, respectively. by combining 20% of the samples collected from test subjects into the training data, the calibrated generic models’ accuracy was improved and outperformed the generic performance across both the spatial and frequency domain images. the average classification accuracy of 99.6%, 99.9%, and 88.1%, and 99.2%, 97.4%, and 87.6% were obtained for the training set, validation set, and test set, respectively, using the calibrated generic classification-based method for the series of inter-beat interval (ibi) spatial and frequency domain images. the main contribution of this study is the use of the frequency domain images that are generated from the spatial domain images of the ibi extracted from the ppg signal to classify the stress state of the individual by building person-specific models and calibrated generic models. stress is a mental, emotional, and physical reaction experienced when a person perceives demands that exceed their ability to cope. the two common forms of stress are acute stress and chronic stress. acute stress is a short-term form and caused by recent past and near future demands, events, or pressures. money worries, losing a job, causing an accident, taking an exam, death of a close family member, serious injury, or attending an interview can cause acute stress disorder. however, it requires a relief technique to relax and recover, such as breathing exercises, get outdoors, or muscle relaxation. in contrast, chronic stress is a long term form and resulting from prolonged and repeated exposure to stressors for a prolonged period and can lead to more severe health problems if it is not handled adequately [1] [2] [3] . chronic stress weakens the body's immune system, leading to several mental and physical illnesses such as depression and cardiovascular diseases [4] . people experience in which training is conducted in the fourier domain. the results indicated that convolution in the fourier domain speeds up without affecting the accuracy of image classification [26] . faster and more accurate image classification was obtained by fourier-based convolution neural network (fcnn). quan liu et al. designed cnn models to predict the depth of anesthesia (doa) indicator for patients from the eeg-based spectrum, and the model achieved 93.5% and can provide physicians with measures to prevent the influence of patient and anesthetic drug differences [27] . koln et al. developed several neural networks to classify images in the fourier domain to visualize patterns learned by the networks, and they found the important regions to classify particular objects [28] . frequency domain features are important for image classification as well as spatial features, especially when the spatial resolution increases [29] . lin et al. classified pixels in frequency domain infrared microscopic images to human breast cell and non-cell categories by k-means clustering [30] . however, perceived stress is very subjective and expressed differently among different people. the generic model can classify stress status for the unseen person, but the stress classification model needs personalization due to the differences in individual stress responses to stress and coping ability. moreover, a stressful situation for one individual may not be an issue for another one, and females will, in general, have a higher level of stress than men. likewise, there exist differences in stress vulnerability, reactivity, resilience, and responsiveness to the threading events. therefore, building a person-specific classification model is significant [31] [32] [33] . martin et al. found that developing student-specific models yielded better results than general and cluster-specific classification models for perceived stress detection in students using smartphone data [34] . kizito et al. proposed a hybrid stress prediction method, which revealed an increase in generic model accuracy from 42.5% to 95.2% by combining 100 person-specific samples used to train the generic model. they tested their new approach on two different datasets and found that the calibrated stress detection model outperformed the generic one. jing et. al proposed a new classification model for the drive's stress level using ibi images for the ecg signal and cnn. they compared the accuracy of this approach with the ann method using time-domain features (mean ibi and root mean squared difference of adjacent ibis (rmssd), and standard deviation of ibis (sdnn)). they found that the accuracy of the new approach was more accurate than the ann method, which has been frequently used in recent researches [35] . in this study, a new stress classification approach is proposed to classify the individual stress state into stressed or non-stressed by converting spatial images of inter-beat intervals of a ppg signal to frequency domain images and we use these pictures to train several cnn models. three types of stress classification models were used: person-specific models, generic models, and calibrated-generic models taking into account intra-individual and interindividual differences. the accuracy measurements of the proposed models (person-specific, calibrated generic model) showed the potential of using frequency domain images in stress detection. our binary classification approach can be applied to classify the state of the daily life stress of individuals into stressed or non-stressed using inter-beat intervals (ibis) data. moreover, it can be used to to monitor a person's psychological wellbeing in everyday life and trigger clinical intervention when the occurrence of acute stress states detected in a specific patient becomes too frequent. this could prompt the clinician to look for lifestyle related issues at the origin of the stress. this paper is structured as follows. section 2 describes the dataset used in this research, and section 3 describes the proposed stress image-based detection model. in section 4, the results for the proposed models are discussed. section 5 presents the results and states the findings of this research. wearable stress and affect detection (wesad) is a publicly available data set that contains motion and physiological data recorded from the chest and wrist-worn devices and self-reports of 15 subjects in laboratory settings during three conditions (baseline, amusement, and stress) [36] . the wesad multimodal data was used for this study. the tier social stress test (tsst) was implemented for inducing psychological stress [37] . the tsst is a procedure that induces acute social stress in a laboratory environment. in tsst, the public speaking is followed directly by mental math task in the same session, both are delivered in front of an interview panel, and both introduce novelty and uncontrollably [38] . in the baseline session, the subjects were given a neutral magazine to read for 20 min and watched a set of funny movies for amusement. while in stress conditions, they were exposed to public speaking and mental arithmetic tasks. the participants delivered a five-minute speech in front of a panel and were then asked to count down numbers from 2023 to zero with 17 steps. a repeat count was mandated for any mistake in the course of the counting exercise. for mediation, the subject performed a controlled berating exercise, during which ppg, ecg, emg, eda, skin temperatures, acceleration, and respiration signals were recorded using respiban professional and empatica e4. respiban recorded ecg, emg, eda, temp, acc, and resp data sampled at 700 hz. e4 records eda (4 hz), acc (32 hz), bvp (64 hz), and temp (4 hz). the data collection was conducted in a laboratory setting. in this study, the ibi sequence provided by empatica e4 wristband was used. ibi is computed by using a proprietary algorithm provided by empatica to detect heartbeats from the bvp signal and calculated the lengths of the intervals between adjacent beats. in empatica e4, the bvp signal is collected by a ppg sensor using a proprietary algorithm that combines the light signals detected during the exposure of the red and green lights with a 64 hz sampling rate. the ibi data file consists of two columns: a timestamp and the duration of the detected beats. the incorrect peaks caused by noise in the bvp signal were removed from the file [39, 40] . the ibi data for the public speaking and mental math task were combined to reflect the stress class to build a binary classification model to classify the stress state of a person into two categories: stressed or non-stressed. the ibi is a significant cardiac measure, which is used to detect stress and provides an emotional state of the individual [17, 41] . in this paper, the entire time interval of the extracted ibi data from ppg signals was divided into intervals according to their distributions. n × m matrices determine inter-beat interval distribution. after that, spatial images were generated from the extracted matrices and converted to frequency domain images for stress classification models using deep convolutional neural networks. the output of the classification model is the stress state of the person (stressed or non-stressed) as shown in figure 1 . an image can be presented as a 2d matrix, and each element in this matrix represents pixel intensity. the intensity distribution of the image is called a spatial domain. for the colored image, the spatial domain can be described as a 3d vector of 2d matrices that contains the intensities for rgb colors. the abnormal values outside the ibi normal ranges (6-1.2 s) were removed. then, the descriptive statics were calculated, such as range, minimum, and maximum values, and the time interval of the inter-beat interval was divided into 28 intervals according to the distribution of the inter-beat intervals as discussed in [35] . second, a n × 1 column vector was created for each inter-beat interval and assigned 1 to the interval in which the inter-beat belongs and 0 for the remaining elements. third, an n × m matrix was formed by concatenating the consecutive m column vectors, transferring the output matrix to 28 × 28 pixel spatial domain images using matlab. a sliding window of size 28 was moved only with the column, as shown in figure 2 . figure 3 shows two different images for several subjects in both condition stressed and non-stressed state. the value of pixel intensity is the primary information stored in the pixels, and the most significant feature used for image classification. the intensity of an image is the mean of all pixels in the image. the average pixel intensity was calculated for non-stressed and stressed images in order to quantify the differences between the two classes using the generated images. table 1 shows the mean of all the pixel values in the entire image for several subjects in both conditions (stressed and non-stressed) are shown. table 2 displays the average intensity for the four segments of each image in two different conditions (stressed and non-stressed). the mean values of the stress images are higher than the non-stressed spatial images. a spatial image can be represented in a frequency domain using transformation. in the output image, each point represents a particular frequency contained in the spatial domain image. in the frequency domain image, high-and low-frequency components correspond to edges and smooth regions, respectively; such image transformation helps to reveal pixels information and detect whether or not repeating patterns exist. the fourier transform is utilized to decompose a spatial domain equivalent image into its cosine and sine components. for a squared image of size n×m pixels, the 2d discrete fourier transform (dft) is given by the equation (1), in which the value of each point f(u, v) is calculated by the summation result of multiplying the spatial image with the corresponding base function. where in this research, the spatial image is converted to the frequency domain by applying fast fourier transformation (fft) on spatial images to get the frequency domain version for these images based on algorithm 1 as shown in the figure 4 . classification performance in the fourier domain outperforms the classification in the spatial domain [28, 42, 43] . moreover, image processing using frequency domain images provides more features and reduces the computational time of the classification model. in addition, image in frequency domain offers another level of information that spatial domain images can not provide. specifically, frequency domain images provide information with the rate at which the pixel values are changing in spatial domain. the rate (frequency) of this change has information that can be exploited to enhance classification models. the fft is a fast algorithm that is used to compute the dft. dft computation takes approximately n 2 (dft computational complexity : o(n 2 )) whereas fft computation takes approximately n log (n) (fft computational complexity : o (n log (n))) table 3 shows the average pixel intensity for the frequency domain images for subjects in the two conditions: stressed and non-stressed. the mean values of the stressed ibi frequency domain images are lower than the non-stressed images. cnn is an example of a deep learning neural network and can be used for computer vision tasks such as image classifying by processing the input image and output the class or probability that the image belongs to it. cnn has input, output, and hidden layers in which it extracts features from images while the network trains on a set of pictures. it applies several filters on the input image to build the feature map and trains through forward and back-propagation for many epochs until reaching a distinct network with trained weights and features. to classify individual stress status into stressed or non-stressed, a 19-layer cnn model was built, as illustrated in figure 6 . cnn is a deep learning algorithm used for image classification and object detection. images pass through 2d convolution layers with kernels, pooling, and fully connected layers. cnn extracts features from the input images while the network trains, and each layer increases the complexity of learned features. like other artificial neural networks, cnn or convnet has an input, several hidden layers (e.g., convolution layers), and an output layer. convolution is a linear operation that includes the multiplication of a set of 2d weight arrays called the filter or kernel with the input data array. the output of this multiplication is a 2d array called feature map. the feature map values are passed through nonlinear functions, such as the rectified linear unit (relu). cnn can train and learn abstract features for efficient object identification. it does not suffer from overfitting, overcomes the limitation of other machine learning algorithms, and is very effective at reducing the parameters amount using dimensional reduction methods without affecting the quality of models. it is used to solve complex problems in different domains such as image classification and object detection due to their better performance [44] [45] [46] [47] [48] . in our model, the input image with size of 28 × 28 pixels goes through 8 convolution layers to produce 32, 64, 128, 256 feature maps using filters with a convolution kernel of a 3 × 3 receptive field. there are 4 max-pooling layers with size 2 × 2 after every two convolution layers. max-pooling is used to reduce the two dropout layers with a rate of 0.5 for regularization. the fully connected layers have depths of 256, 256 and 1. relu activation layers are used to increase nonlinearity in the network. the outputs of these networks were stressed and non-stressed. the following stress classification models were trained, tested, and evaluated using our cnn model architecture and both type of images (spatial and frequency domain). person-specific models using spatial images: models were trained, validated, and tested on the spatial domain images of the same subject. the entire datasets were divided into 70%, 15%, and 15% for training, validation, and testing, respectively 2. person-specific models using frequency domain images: models were trained, validated, and tested on the frequency domain images of the same subject. the entire datasets were divided into 70%, 15%, and 15% for training, validation, and testing, respectively. generic models using spatial domain images: models were trained and validated models on the spatial domain images of 12 subjects (n − 3) and we tested their performance on three others that were left out. of these, three were used to evaluate the model's accuracy in classifying the unseen person's stress status. three subjects were in the test dataset, and the other 12 subjects' data were in the training and validation sets. generic models using frequency domain images: the models were trained and validated models on the frequency domain images of n-3 subjects, and we tested their performance on the left out three subjects frequency domain images. three subjects were in the test dataset, and the other 12 subject's data were in the training and validation sets. generic models using spatial domain images with calibration samples: 20% of the test dataset were incorporated in the training pool, and the models were tested on the remaining samples. this approach was implemented because the performance of the generic model is lower than the person-specific model. for models training and accuracy measurements, three subjects of data were used as a test dataset and we combined 20% of these data into the other 12 subjects' data in the training datasets. 6. generic models using frequency domain images with calibration samples: 20% of the test dataset was incorporated in the training pool, and the models were tested on the remaining samples. three subjects' data were used as a test dataset, and 20% of their data were combined with the training dataset to train the model and measure its accuracy. the above classification models were evaluated by measuring the accuracy of the training, validation, and testing. moreover, other parameters were also measured. these are the sensitivity (number of samples were classified by the model as positive among all actual positives), specificity (number of samples were classified by the model as negative among all actual negatives), and precision (how many samples were positive among all classified positive samples). the inputs were spatial images and frequency domain images, and the output was stressed or non-stressed. the performance of the classification models was measured by comparing the values of accuracy for the train, valid and test, along with the test sensitivity (true positive rate), precision, and specificity (true negative rate). the equations for calculating these performance metrics are shown in equations (2)-(5). the accuracy is the ratio of the correct classifications from all classifications. sensitivity is defined as the capability of a test to correctly classify a person as stressed: the specificity is the capability of a test to correctly classify a person as non-stressed: precision measures how correctly the classifier was able to classify positive out of all positives: the classification accuracy measurements for all models were satisfactory among the training, validation, and test datasets. the person-specific models achieved high performance compared to the generic models. the average classification accuracy of the person-specific models using spatial images for the training, validation, and test datasets was 99.9%, 100%, and 99.8%, respectively. for the person-specific models using frequency domain images, the accuracy was 99.68%, 98.97%, and 96.4%. the performance of the generic models varied between the different subjects and had lower accuracy than the person-specific models. the average accuracy for the generic classification models using spatial images was 98.6% (train), 96.8% (valid), and 61% (test), and 98.9% (train), 97.6% (valid), and 62.6% (test) when using frequency domain images. moreover, the accuracy for frequency domain classification models was slightly lower than the spatial image classification models, as shown in tables 4 and 5. the generic models cannot perfectly recognize the inter-subject difference in response to stress events. thus, adding some samples from the test to training data significantly increased the accuracy of the generic models as shown in tables 6 and 7 when using spatial images and tables 8 and 9 when using frequency domain images. by adding these samples, the performance of the models significantly increased from 61% to 88.1% and from 62.6% to 87.6% as happened in the generic models for the test dataset when using the spatial and frequency domain images, respectively. confusion matrix is a performance measurement that visualizes the performance of the classification model on test data in which the true values are known. the generic model had 179 non-stressed spatial images incorrectly classified as stressed while it had 619 stressed images incorrectly classified as non-stressed, as shown in figure 7 (left). however, the majority of the spatial images were classified correctly, while by adding 20% of the test data into the training pool, the performance of the model was increased, as it had 20 non-stressed spatial images incorrectly classified as stressed and 37 stressed images incorrectly classified as non-stressed, as shown in figure 8 (left) . from the confusion matrices in figures 7 and 8 , where the data of subjects 8, 9, and 10 were in the test dataset, and 20% of their calibrated samples were injected in the training dataset, the sensitivity was increased to 96% and 80% and specificity was also increased to 98% and 92% for spatial and frequency domain images, respectively. adding a few calibration samples allowed the model to learn more information about the unseen person and highlighted the effect of person-specific signals in classifying his/her stress state to either stressed or non-stressed. another finding is that the time of cnn training and validation using fourier domain images was lower than that of training and validation on spatial images (e.g., for the person-specific model of he subject number 10, the cnn spent 143 s to train and validate 1019 frequency domain images in around 125 epochs, while using the same number of spatial images took around 214 s). moreover, to achieve higher accuracy when using spatial and frequency domain images, there is a need to use more epochs to train the generic models. in this study, 150 epochs were used for all generic models using both spatial and frequency domain images. table 4 . the accuracy measures for the person-specific models using spatial images. 2 99 .8 100 99 98 100 100 3 100 100 99 100 98 98 4 100 100 100 100 100 100 5 100 100 100 100 100 100 6 100 100 100 100 100 100 7 100 100 100 100 100 100 8 100 100 100 100 100 100 9 100 100 100 100 100 100 10 100 100 100 100 100 100 11 100 100 100 100 100 100 13 100 100 100 100 100 100 14 100 100 100 100 100 100 15 100 100 100 100 100 100 16 100 100 100 100 100 100 17 100 100 100 100 100 100 average 99.9 100 99.8 99.8 99.8 99.8 table 5 . the accuracy measures for the person-specific models using frequency domain images. 2 100 100 97 100 96 96 3 100 100 99 100 98 98 4 100 100 100 100 100 100 5 95.7 84.6 74 100 65 65 6 100 100 93 85 100 100 7 100 100 99 100 99 99 8 100 100 100 100 100 100 9 100 100 99 100 99 99 10 100 table 6 . the accuracy measures for the generic models using spatial domain images. table 9 . the accuracy measures for the generic models with 20% calibration samples using frequency domain images. table 10 compares the results of this approach with other approaches conducted in the domain of stress detection. one of the main differences between this study and the other studies is the type of images that were utilized for training and validating the models. moreover, the accuracy of the calibrated model outperformed that of the generic model. compared to other approaches, the proposed method achieved high accuracy in person-specific models and comparative scores with the other generic models taking into account the different types of the data used (ibi extracted from ppg signal, spatial, and frequency domain images from the ibi) in our study. the results show the potential of using frequency domain images in stress detection. [35] ecg-ibi spatial cnn generic 92.8 [49] face cnn generic 85.23 [22] respiration cnn generic 84.59 in this study, a new approach was proposed to classify a person's stress state using a convolution neural network, spatial, and frequency domain images for inter-beat intervals extracted from the ppg signal. the entire time interval of the extracted ibi data from ppg signals was divided into intervals according to the ibi distributions, and then the output matrix was converted to spatial images. these images were transformed into the frequency domain by using the fourier transform. frequency domain features are important for image classification as well as spatial features, especially when the spatial resolution increases. several types of binary classification models were developed: generic model, person-specific model, and calibrated generic models. the proposed models utilized the ibi's files generated by empatica e4 devices founded in the wesad dataset. the average accuracy for the proposed models achieved a satisfactory performance. the person-specific models were able to classify stress status with high accuracy. although these models cannot be generalized, it is necessary and effective to personalize the model, as stress is subjective and each person has unique responses and degree of vulnerability to stress. these models can be used in the health monitoring system to monitor the stress status of the patient and can be enriched by collecting new data and training the models again. an image can be represented as a 2d matrix where each element shows pixel intensity. this spatial image can be transformed into the frequency domain by using a fourier transform. frequency domain features are important for image classification as well as spatial features, especially when the spatial resolution increases. images processing using frequency domain images can perform better than spatial domain images, provide more features, and reduce the computation time. cnn is an example of deep learning neural networks and can be used for computer vision tasks such as image classifying by processing the input image and output the class or probability that the image belongs to. cnn has input, output, and hidden layers in which it extracts features from images while the network trains a set of pictures. it applies several filters on the input image to build the feature map and trains through forward and back-propagation for many epochs until it reaches a distinct network with trained weights and features. in this study, a novel approach to classify the stress state of a person by using both spatial and frequency domain ibi images and convolution neural networks is proposed. the proposed models using the ibi's files generated by empatica e4 devices founded in the wesad dataset were tested. several classification models were built: person-specific, generic, and calibrated generic models. generic models performed more poorly than the person-specific models when trying to classify stress state of unseen people, as shown in the test accuracy measures in tables 6 and 8. these generic models cannot generalize well as stress is subjective, and some people more reactive to stress and have different types of physical and physiological responses. a personalized model was derived by combining a few person-specific samples with the training data to improve the performance of these generic models. in this study, 20% of the subject's data in the test dataset were combined with the training data, which showed a substantial improvement in the stress classification models performance, as shown by the accuracy measurement in tables 7 and 9 . these calibrated generic models introduced the subjects' identities and characteristics to the models. to ensure that our calibrated models were not suffering from overfitting, we validated these models by using 5-fold cross-validation, which leads to unbiased model performance estimation and tests how the different parts of the training set performed in the model. moreover, the results show that the average accuracy for the generic classification models using frequency domain images was slightly higher than the other models that used spatial images for training, validation, and testing. in addition, classifying the stress status using frequency domain images performed well and provided more features about the entire images and reduced the computation time. the proposed models in this study were effective at classifying stress state and applicable in a stress monitoring health system. our approach can be applied to monitor a person's psychological wellbeing and classify his state of daily life stress using inter-beat intervals (ibis) data. in addition, it can trigger alerts that can be used to guide clinical interventions to prevent and treat symptoms of acute stress disorders when the occurrence of acute stress states detected in a specific patient becomes too frequent. moreover, stress detection models can be used in the military or police to detect when soldiers and police officers experience high levels of stress that are abnormal and to improve their performance in stressful environments. they can also be used for a student in educational systems to identify which subjects may present issues for particular students. this will enable teachers to intervene and present the material in an alternative manner and minimize stressful events in the classroom as much as they can to reduce stress or anxiety. one limitation of this study is that the proposed models in this study are used to classify the state of an individual into two categories: stressed or non-stressed. the model is aim at instantaneous detection of stress via classification of physiological data. the model does not consider the prediction of stress as it is out of the scope of this work. for future work, newly ppg data will be collected either from lab settings or real-life using wrist-worn devices. these data will be used to train and test the proposed models to measure the accuracy and compare the results. the authors declare no conflict of interest continuous stress detection using wearable sensors in real life: algorithmic programming contest case study fact sheet: health disparities and stress acute vs. chronic stress heart rate variability metrics for fine-grained stress level assessment world health organization. mental health in the workplace salivary cortisol levels as a biological marker of stress reaction salivary cortisol as a biomarker in stress research psychological stress detection using photoplethysmography photoplethysmography based psychological stress detection with pulse rate variability feature differences and elastic net assessing mental stress from the photoplethysmogram: a numerical study can ppg be used for hrv analysis? comparison of heart rate variability from ppg with that from ecg mental stress assessment based on pulse photoplethysmography instant stress: detection of perceived mental stress through smartphone photoplethysmography and thermal imaging monitoring stress with a wrist device using context monitoring physical activity and mental stress using wrist-worn device and a smartphone part iii; voulme 414 smartphone-based approach to enhance mindfulness among undergraduates with stress assessing mental stress based on smartphone sensing data: an empirical study deep ecg-respiration network (deeper net) for recognizing mental stress deep learning of breathing patterns for automatic stress recognition using low-cost thermal imaging in unconstrained settings ambulatory and laboratory stress detection based on raw electrocardiogram signals using a convolutional neural network a multichannel convolutional neural network architecture for the detection of the state of mind using physiological signals from wearable devices spectral domain convolutional neural network. arxiv 2019 fcnn: fourier convolutional neural networks spectrum analysis of eeg signals using cnn to model patient's consciousness level based on anesthesiologists' experience visualizing image classification in fourier domain a survey of image classification methods and techniques for improving classification performance classification of fourier transform infrared microscopic imaging data of human breast cells by cluster analysis and artificial neural networks individual differences in stress susceptibility and stress inhibitory mechanisms individual differences in biological stress responses moderate the contribution of early peer victimization to subsequent depressive symptoms the effect of person-specific biometrics in improving generic stress predictive models automatic detection of perceived stress in campus students using smartphones a novel classification method for a driver's cognitive stress level by transferring interbeat intervals of the ecg signal to pictures introducing wesad, a multimodal dataset for wearable stress and affect detection trier social stress test'-a tool for investigating psychobiological stress responses in a laboratory setting the trier social stress test protocol for inducing psychological stress early detection of migraine attacks based on wearable sensors: experiences of data collection using empatica e4 e4 data-ibi expected signal inter-beat interval estimation from facial video based on reliability of bvp signals frequency learning for image classification image classification in the frequency domain with neural networks and absolute value dct conceptual understanding of convolutional neural network-a deep learning approach diabetes detection using deep learning algorithms automated detection of diabetes using cnn and cnn-lstm network and heart rate signals face recognition based on convolutional neural network research on face recognition based on cnn emotional analysis using image processing this article is an open access article distributed under the terms and conditions of the creative commons attribution (cc by) license the following abbreviations are used in this manuscript. inter key: cord-190424-466a35jf authors: lee, sang won; chiu, yueh-ting; brudnicki, philip; bischoff, audrey m.; jelinek, angus; wang, jenny zijun; bogdanowicz, danielle r.; laine, andrew f.; guo, jia; lu, helen h. title: darwin's neural network: ai-based strategies for rapid and scalable cell and coronavirus screening date: 2020-07-22 journal: nan doi: nan sha: doc_id: 190424 cord_uid: 466a35jf recent advances in the interdisciplinary scientific field of machine perception, computer vision, and biomedical engineering underpin a collection of machine learning algorithms with a remarkable ability to decipher the contents of microscope and nanoscope images. machine learning algorithms are transforming the interpretation and analysis of microscope and nanoscope imaging data through use in conjunction with biological imaging modalities. these advances are enabling researchers to carry out real-time experiments that were previously thought to be computationally impossible. here we adapt the theory of survival of the fittest in the field of computer vision and machine perception to introduce a new framework of multi-class instance segmentation deep learning, darwin's neural network (dnn), to carry out morphometric analysis and classification of covid19 and mers-cov collected in vivo and of multiple mammalian cell types in vitro. coronavirus disease-19 is an emerging acute respiratory infectious disease that has demonstrated highly pathogenic capabilities, spreading through populations globally primarily through droplet transmission. although the virus is expected to be of zoonotic origin in the seafood markets of wuhan, china, global human-to-human transmission has prompted the emergence of over 15 million covid-19 cases worldwide and over five hundred thousand deaths. covid-19 is the seventh member of the family of coronaviruses to widely cause infection in humans. [1] the clinical spectrum of coronavirus ranges from asymptomatic forms to conditions characterized by respiratory failure to septic shock. [2] the first known widespread infection caused by a coronavirus began in 2002, with the emergence of severe acute respiratory syndrome coronavirus (sars-cov) in china, that resulted in 8,098 infections and 774 deaths among 29 countries. [3] a second outbreak followed in 2012, beginning in saudi arabia, with the spread of the middle east respiratory syndrome coronavirus (mers-cov) that infected 2,458 people and resulted in 848 deaths in 27 countries. [4] covid-19 is the third coronavirus outbreak of the 21 st century, and it is already more deadly than the previous outbreaks. however, the ability to combat this emerging infectious disease has been limited by the slow turnaround time in the development of new therapeutics, the inability to quickly diagnose patients, and the limited knowledge of the virus' pathogenesis. the pathophysiology and virulence mechanisms of coronaviruses have been shown to be mediated through the virion morphological structure and surface structural proteins. [5] coronaviruses have distinct morphological features that make them easily distinguishable under a high-powered microscope. typical coronavirus virions are spherical, 125 nm in diameter, and have club-shaped projections emerging from their surfaces. [6] mers-cov is comprised of four major surface proteins that aid in viral infiltration of cells: envelope protein (e), spike glycoprotein (s), nucleocapsid protein (n), and membrane protein (m). [7] each protein has an integral function in virus transmission within a host. for instance, the s protein that comprises the spikes on the surface mediates virus entrance via binding and fusion to host cells, as it contains a receptor domain that binds to the host cell receptors. [7] currently, extensive knowledge regarding the structure and morphology of the covid-19 virus limits the understanding of its pathogenesis and virulence. however, as the spikes are a unique characteristic of coronaviruses, including covid-19, common approaches in therapeutic development to neutralize their viral infections involve inhibition of surface protein capabilities, such as blocking the interaction between the s protein and its host receptor. [8] given the previous success with using animal models to study in vivo the ability of antibodies and other therapeutics to limit viral replication as well as the pathology of mers-cov, [9, 10] similar therapeutic approaches can be developed to analyze morphological effects on covid-19 in vivo in response to similar therapies. this paper intends to implement novel machine learning methods to analyze the in vivo morphology of covid-19, comparing it to the better-known mers-cov. evaluation of the emerging viral strain, collected in vivo, under the microscope and comparing it to the existing mers-cov morphology will potentially allow for a better understanding of the virus' pathogenesis. we have correctly classified different types of coronaviruses using a deep learning multi-class instance segmentation network as well as analyze their morphological properties. advances in the field of deep learning are allowing previously thought impossible research to be conducted. every year, new convolutional neural networks (cnns) yield higher accuracies for their tasks with higher gpu efficiency. typical tasks for cnns include object tracking, image classification, and semantic segmentation. object tracking allows following of an entity, such as tracing the migration of a cell; image classification is used to predict a label for an object, such as determining whether a cell is a type i or type ii macrophage; and semantic segmentation identifies parts of an image that correspond to distinct objects, such as identifying pixel locations of a nucleus in a mammalian cell. however, optimal neural networks for a certain task change every year due to the invention of newer and more powerful cnns. in classification neural networks, for example, alexnet [11] is rarely used except for educational purposes after the publication of superior neural networks such as googlenet [12] , followed by vgg-16 [13] , inception-resnet-v2 [14] , and nasnet-large. [15] as for object detection networks, r-cnn [16] was triumphed by fast r-cnn [17] , which was surpassed by faster r-cnn. [18] yolo [19] network was surpassed by yolo2. [20] finally, for semantic segmentation neural networks, the original u-net [21] evolved to yield higher accuracy by adapting newer neural networks like vgg-16 into its architecture. thus, the capabilities of classification, object detection, and segmentation networks are continually adapting and succeeding their predecessors for different tasks. here we propose a new framework for multi-class instance segmentation that utilizes three independent cnns: object detection network, classification network, and semantic segmentation network. as state-of-the-art networks emerge in each field, pre-existing cnns and the new cnn compete to yield the highest accuracies for the task, and only the cnns with superior accuracies survive to form a framework. one can also choose to cull the neural networks and replace them to yield a higher multi-class instant segmentation accuracy. using the combined cnns, the framework can automatically comb through the existing cnns and select the combination with superior reliability and accuracy. we call this conglomerate darwin's neural network, as the "fittest" or most accurate results yielding cnns survive and are replaced over time. a graphic illustration of the dnn framework is shown in figure 1 . this network can be implemented to compare morphometric parameters of mers-cov and covid19 virus particles using transmission electron micrographs (tem). specifically, we wanted to use these micrographs to investigate structural and morphological changes in these virus particles in vivo. advances in microscopy have enabled researchers to access spatial and temporal variations inherent in biological systems. progress in the field of optics has resulted in microscopes capable of imaging over a range of spatial scales, from single cells to organisms in its entirety. in concurrence with these technological advances, there has been an overwhelming increase in demand in the biosciences for automatic and precise image analysis. here we also implement dnn to establish an automated method for cell morphometric analysis and cell-type classification utilizing only brightfield images taken on a benchtop microscope directly from cell-culture wells. despite the low resolution of the images obtained and significant well-to-well heterogeneity, our team was able to demonstrate precise morphometric analysis and high classification accuracy of novel data. there exist three major components in dnn: object detection network, classification network, and semantic segmentation network. a graphic illustration of dnn workflow is shown in figure 2 . first, multiple object detection networks compete, and the winner only finds locations of morphologically similar objects of interest and crop them out automatically to feed them to more gpu exhaustive and accurate classification networks. at this stage, the classification task is not carried out between the morphologically similar objects but leaves the heavy lifting for a more apt classification network. the objects' location information is saved for the reconstruction of images at a later stage of dnn framework. then, multiple classification networks compete, and the winner takes in the cut-out images as inputs and carries out classification task of morphologically very similar objects; for example, covid19 virus particles to mers virus particles, or macrophage type i to macrophage type ii. then, the cropped images and their classes are passed onto the segmentation network for semantic segmentation according to their class for instance segmentation and accurate morphological analysis. these steps result in instance segmentation instead of semantic segmentation. this instance segmentation network can be used for tasks, such as single-cell morphological analysis in clusters or colonies of cells and proves to be more accurate than any algorithm for post semantic segmentation to singularize objects in binary clusters. each segmentation result can be colored differently and labeled accordingly to their classes. the results are superimposed on top of the original input image for object detection to achieve a multiclass instance segmentation network framework. human thp-1 cells (atcc, tib-202) were commercially obtained from atcc. cell cultures were maintained in suspension in non-tissue culture treated flasks (nunc) with a surface area of 25 cm 2 . the culture media was refreshed every 2 days and was comprised of roswell park memorial institute (rpmi) 1640 medium supplemented with 10% fetal bovine serum (atlanta biologics), 1% penicillin-streptomycin (sigma-aldrich) and 0.05 mm 2-mercaptoethanol (sigma-aldrich). cells were initially suspended at 200,000 cells / ml of media and passaged upon reaching a density of 1.0 million cells / ml. for this study, passage two cells were collected and resuspended at 100,000 cells / ml in fresh fully supplemented rpmi (f/s rpmi) media containing 100nm phorbol 12-myristate 13-acetate (pma) to induce differentiation. one ml of this cell suspension was then added to each well of a 12-well plate (bd falcon) and cultured at 37°c. after 72 hours, pma containing medium (pma + medium) was removed, and adherent cells were rinsed with pbs, and the medium was replaced with f/s rpmi-1640 medium. cells were allowed to rest (for m0) or were subjected to polarization media for 72 hours before assessment. m1 polarization medium contains f/s rpmi + 20 ng/ml interferon-γ (ifn-γ, humanzyme) and 1 µg/ml lipopolysaccharide (lps, sigma-aldrich). m2 polarization medium contained f/s rpmi + 20 ng/ml interleukin-4 (il-4, humanzyme). after 72 hours of rest or polarization, cells were washed with pbs 1x and cultured in f/s rpmi media. all groups were prepared with a batch size n=10. after polarization, cells were imaged (n=10) using brightfield microscopy with a phase-contrast filter. images for machine learning analysis were captured at 32x, and each frame contained approximately 20 cells. raw 264.7 cells (atcc, tib-71) were commercially obtained from atcc. cell cultures were maintained in 100 mm non-tissue culture treated plates (fisher). the culture media was refreshed every two days and was comprised of dulbecco's modified eagle's medium (dmem) supplemented with 10% fetal bovine serum (atlanta biologics), and 1% penicillin-streptomycin (sigma-aldrich). cells were initially suspended at 200,000 cells per dish and passaged upon reaching a 75% confluency. for this study, passage two cells were collected and seeded into 48 well plates at a density of 25,000 cells / cm 2 . after 18 hours to allow cell attachment, cells were either allowed to rest or treated with polarization media for 48 hours. for the m0 phenotype, the cells were allowed to culture in f/s dmem media (as described above). for m1 polarization, f/s dmem containing 10 ng/ml lps (sigma-aldrich) and 20 ng/ml ifn-γ (humanzyme) was used. for m2 polarization, f/s dmem containing 20 ng/ml il-4 (humanzyme) was used. after 48 hours, cells were rinsed with pbs and cultured in f/s dmem just prior to imaging. cells were imaged (n=10) using brightfield microscopy with a phase-contrast filter. images for machine learning analysis were captured at 32x, and each frame contained approximately 20 cells. following published protocols [22] , bovine dermal fibroblasts (df) were harvested from bovine skins. briefly, neonatal (1-7 days old) bovine skins were obtained from a local abattoir (n=2, tissues pooled; green village packing company). before harvest, skins were sterilized by soaking in soapy water for 40 min, followed by 70% ethanol for 20 min, after which the surrounding fur was removed with a sharp scalpel. approximately 1 cm 2 skin fragments were collected aseptically in a sterile environment. the dermis was separated from the epidermis by gently scraping dermis with a scalpel. the dermis was digested for 30 min at 37 °c with collagenase ii (1.2% w/w; worthington biochemical) in dulbecco's modified eagle's medium (dmem, cellgro-mediatech) supplemented with 10% fetal bovine serum (fbs), 2% penicillin/streptomycin (p/s), 0.2% amphotericin b (amp-b), and 0.2% gentamicin (g/s). the mixture was then filtered (30 μm, spectrum labs), and the isolated cells were collected by centrifugation and plated. after 14 days from the beginning of cell isolation, the cells were re-plated at a density of 5 × 10 5 cells/cm 2 on tissue culture plates. images for machine learning analysis were captured at 32x, and each frame contained approximately 20 cells. bone marrow derived macrophages (bmdm) were harvested from the murine tibia. 4-8 weeks old macrophage fas-induced apoptosis mice were purchased (n=2; the jackson laboratory). prior to harvest, mice were sterilized by soaking in soapy water for 40 min, followed by 70% ethanol for 20 min, after which the tibiofemoral joints were removed. the surrounding subcutaneous fascia and muscle were removed aseptically in a sterile environment. tibial tuberosity and medial malleolus were removed from tibia. bone marrow cells were flushed out by forcing roswell park memorial institute (rpmi; thermo fisher scientific) containing 5% fbs through the central bone marrow canal using a 10 ml syringe. the collected bone marrow tissue was then filtered (70 μm, spectrum labs), and the isolated cells were collected by centrifugation. the cells were mixed with 1 ml of ammonium chloride (ack) lysis solution and were promptly washed with 1 ml of rpmi media containing 5% fbs. the isolated cells were then collected by centrifugation and plated at a density of 1 × 10 7 cells/cm 2 on tissue culture plates (non-treated 100 petri dish). images for machine learning analysis were captured at 32x, and each frame contained approximately 20 cells. transmission electron micrographs (tem) of covid19 and mers-cov virus particles isolated from patients, were obtained from the open database published by national institute of allergy and infectious diseases' (niaid) rocky mountains laboratories (rml). three different types of cnns were considered for the dnn deep learning algorithm: object detection networks, classification networks, and semantic segmentation networks. for cnn i, yolo v2 and faster-r-cnn were used with resnet50 and inception-resnet-v2 backbones. these four cnns were transfer-learned and tasked to isolate individual cells in brightfield microscope images, as seen in figure 3(a) . the networks were trained to only isolate the cells, which showed the complete morphology. the networks were trained not to pick up overlapped cells since missing part of the data can skew later morphometric analysis. another set of cnns was transfer-learned and tasked to isolate covid19 and mers-cov virus particles in transmission electron micrographs. again, the networks were trained to only isolate the viruses, which showed the complete morphology. the output coordinates were modified to superimpose boxes onto the original image and crop each object detection result, as shown in figure 3(b) . input images were rotated with mirrored corners to increase the size of the training set. a total of 217 tem micrographs of covid19 and mers-cov virus particles were used. 130 tem images were used for the training set, 65 images were used for the validation set, and 22 images were used for the test set (6:3:1). for the cells, 540 brightfield images were used for the training set, 270 images were used as a validation set, and 90 images were used for the test set. the training sets were carried out until absolute minima were observed for the loss function. other parameters, such as kernel, stride, max pooling sizes were unadjusted to retain the advantages of original cnns and maximize the benefit of transfer learning. the network which produced the highest precision over recalls was chosen to be integrated into the dnn. the chosen cnn was used to crop individual cells from tem and brightfield feeds. the resulting cropped images were used to train convolutional neural network ii (cnn ii). the resulting images were further processed to greyscale images, and the histograms of images were equalized to reduce bias. , alexnet, and googlenet were used to compete with each other. the individually cropped cells and viruses from cnn i were manually divided into respective classes to create the training sets, as seen in figure 3(c) . again, the training images were rotated and mirrored to increase the training set. a total of 1680 brightfield images of cells was used for cnn ii. 1008 training images, 504 validation images, and 168 test set images (6:3:1). a total of 360 images of virus particles was used: 216 training images, 108 validation images, and 36 test set images. the cnn, which yielded the highest accuracy, was integrated into the dnn to carry out the task. to visualize the progress and focus of the cnn, activation maps were derived from the last rectifier linear unit. activation maps were created for iteration 1, iteration 5, and iteration 700 for viruses. visualization maps of completed cnn were created for cells and their corresponding classes. u-nets with resnet18, resnet50, vgg16, and inception-resnet-v2 backbones competed with each other for placement in cnn i. corresponding masks were manually created for cells and viruses according to their class as seen in figure 3(d) . a total of 360 tem images and their corresponding 360 masks of virus particles were used for cnn ii, utilizing 216 images for the training set, 108 images for the validation set, and 36 images for the test set (6:3:1). a total of 1680 brightfield images and their corresponding 1680 masks of cells were used for cnn ii. 1008 training images, 504 validation images, and 168 test set images (6:3:1). the network with the highest global accuracy, as determined by the ratio of correctly classified pixels to the total number of pixels, regardless of class, was integrated into dnn. the resulting binary output images were passed down for further morphometric analysis. facebook's mask-r-cnn [23] with microsoft's resnet101 [24] backbone was used to compare instance segmentation results. jaccard similarity coefficient was used to evaluate both mask-r-cnn and dnn. morphometric data of cells and viruses are derived from the binary inputs, which are the outputs of cnn iii. the following morphometric parameters are calculated: area, eccentricity, circularity, and solidity. binary image outputs of cnn iii were further processed using the regionprops function in matlab ® [25] to calculate morphometric parameters of the virus. the following formulas were used for morphometric analysis: circularity: (4 x area)/convex perimeter 2 solidity: the proportion of the pixels in the convex hull that are also in the object. [26] eccentricity: the eccentricity is the ratio of the distance between the foci of the ellipse and its major axis length. the value is between 0 and 1. [26] results are presented as the mean ± standard deviation. the tukey-kramer posthoc-test was used for all morphometric pairwise comparisons, and statistical significance was attained at p < 0.05. the results for cnn i are shown in figure 4(a, b) . precision was one over all recalls for yolov2 with resnet50 [24] and inception-resnet-v2 for both cells and viruses. faster-r-cnn also yielded identical precision over all recalls for resnet 50 and inception-resnet-v2 backbones for both cells and viruses. since all architectures yielded perfect precision, faster-r-cnn with resnet50 backbone was chosen to be integrated into the place of cnn i in dnn for cells. for the viruses, yolo v2 with resnet50 backbone was chosen to be in place of cnn i in the dnn framework. the results for cnn ii are shown in figure 4(c, d) . for cnn ii cell classification, alexnet yielded the lowest test set accuracy of 0.96 for the test set. googlenet yielded the second-lowest accuracy of 0.985. rest of the networks, densenet 201, inception-resnet-v2, inception v3, mobilenet v2, resnet18, resnet101, squeezenet, vgg19, xception, yielded an accuracy of 1 for all test sets. for virus classification, all networks yielded an accuracy of 1 for the test sets. for visualization of progression and focus of the cnn, activation maps were created for iteration 1, iteration 5, and iteration 700 for viruses, as seen in figure 5(c) . visualization maps of completed cnn were created for cells and their corresponding classes, as seen in figure 5(a, b) . from the neural architectures, which yielded an accuracy of 1, squeezenet was chosen to be integrated into the place of cnn ii in the dnn framework for viruses. for cells, inception-resnet-v2 was chosen to be integrated into the dnn framework. the results for cnn iii are shown in figure 4(e, f) . for cnn iii cell semantic segmentation, u-net with resnet18 backbone yielded the lowest jaccard similarity coefficient of 0.7942. u-net with vgg16 yielded the second-lowest jaccard similarity coefficient of 0.7984, and u-net with resnet50 yielded global accuracy of 0.8324. u-net with inception resnet v2 backbone yielded the highest global accuracy of 0.8346, as seen in figure 4(e) ; therefore, inception-resnet-v2 was integrated in the place of cnn ii for dnn for cells. for virus semantic segmentation, u-net with resnet18 backbone yielded the lowest jaccard similarity coefficient of 0.7681 as seen in figure 4(f) . u-net with vgg16 yielded a jaccard similarity coefficient of 0.8083, and u-net with resnet50 yielded the highest jaccard similarity coefficient of 0.8245. u-net with inception resnet v2 backbone yielded a jaccard similarity coefficient of 0.787; therefore, resnet50 was chosen to be integrated into dnn for viruses. as a result of the competition between the networks, dnn framework for cells consisted of faster-r-cnn with resnet50 backbone for cnn i, inception-resnet-v2 for cnn ii, and u-net with inception-resnet-v2 backbone. dnn framework for viruses consisted of yolo v2 with resnet50 backbone for cnn i, squeezenet for cnn ii, and u-net with resnet50 backbone. for overall instance segmentation results, dnn produced both superior global accuracy and jaccard similarity coefficient for cells and viruses. for mask-r-cnn, the global accuracies were 0.9059 and 0.8871 for cells and viruses, respectively, as seen in figure 4 (g). for dnn, the global accuracies were 0.9301 and 0.8964 for cells and viruses, respectively. for mask-r-cnn, the jaccard similarity coefficients were 0.5537 and 0.5038 for cells and viruses, respectively, as seen in figure 4 (g). for dnn, the jaccard similarity coefficients were 0.8346 and 0.8083 for cells and viruses, respectively. all results of the cellular and viral morphometric analyses are shown in figure 6 . all cells were plotted in a 3d graph according to their circularity, eccentricity, and solidity in figure 3(e) . viruses were plotted in a 3d graph according to their circularity, eccentricity, and solidity in figure 3(e) . for cells, ground truth for area, circularity, eccentricity, and solidity were calculated by hand and compared with dnn output in figure 6(a1-a4) . for viruses, ground truths for area, circularity, eccentricity, and solidity were also calculated by hand and compared with dnn output in figure 6(b1-b4) . statistical significances between the virus types, in terms of area and morphology, in the dnn output data are shown in figure 6 (c2) (n=33, ^p<0.05 between groups). statistical significances between the virus types, in terms of area and morphology, in the ground truth data are shown in figure 6 (c1) (n=33, ^p<0.05 between groups). statistical significances between thp1 m1 and thp1 m2, in terms of area and morphology, in the ground truth data are shown in figure 6 (d1) (n=51 and n=60, respectively. ^p<0.05 between groups). statistical significances between thp1 m1 and thp1 m2, in terms of area and morphology, in the dnn output data are shown in figure 6 (d1) (n=51 and n=60, respectively. ^p<0.05 between groups). statistical significances between the cell types, in terms of area and morphology, in the dnn output data are shown in table 1 here, we have demonstrated the ability to use multi-class instance segmentation to correctly analyze the morphological differences between multiple types of mammalian cells, as well as covid-19 and mers-cov. by comparing precisions over recalls, classification accuracies, and jaccard similarity coefficients, the dnn framework was able to produce a higher jaccard similarity coefficient than of one using a mask-r-cnn framework with resnet101 backbone. this was achieved through dnn's decision-making algorithm, which tests out different networks and finds the best fit cnns for one's specific task. we have found that cnns with higher reported benchmarking accuracies [27] may not produce higher accuracies for certain biomedical engineering tasks. for example, both u-net with a vgg16 backbone and u-net with resnet50 backbone yielded higher jaccard similarity coefficients and global accuracies than u-net with an inception-resnet-v2 backbone, despite inception-resnet-v2 having a higher reported benchmarking accuracy than both resnet50 and vgg16. squeezenet, which has a lower benchmarking accuracy than googlenet, was found more apt for classifying mammalian cells and was thus chosen to be in place of cnn i in dnn. as observed in figure 6 (c1-c2) , the dnn analysis showed statistical significance in area and circularity of the covid19 in comparison to the mers virus particles, which aligned with findings in the ground truth data of the viruses. in figure 6(d1-d2) , the dnn analysis also showed statistical significance in area and solidity of the thp1 m1 cells in comparison to the thp1 m2 cells; however, circularity was not statistically significant between the cells according to the dnn analysis. in terms of instance segmentation abilities, dnn's object detection network ability to cut out overlapping cells appeared to help the semantic segmentation network do a superior job of cell and virus edge detection. this resulted in dnn's higher jaccard similarity coefficient compared to mask-r-cnn's. as better cnns are invented every day, the dnn can evolve to yield better accuracy over time by adding new state-of-the-art networks to the arena and culling older cnns from cnn i, cnn ii, or cnn iii. other cnns can take the place of cnn i, cnn ii, and cnn iii to compete and ultimately yield a better final dnn for a given biomedical engineering task. we have decided to identify the morphometric parameters that are considered important in viral pathogenesis: area, eccentricity, circularity, and solidity. these morphological parameters are important because differences in these aspects of virus morphology result in different pathogenic ability. the morphological parameters are also important for mammalian cells to study the effects of cell to cell interaction, virus-induced cell morphology, or stem cell morphology that indicates a certain type of differentiation. we have demonstrated that time and labor-consuming forms of cellular and viral morphometric analysis can be replaced by sub-section and one-click operation using dnn. popular software tools, such as cell profiler [28] and image j [29] , require manual parameter tuning often requiring familiarity with the software and manual inputs from the user; however, this also has the potential to create user bias when analyzing morphometric data of cells. robustly trained dnn, with large-scale datasets of cells, maybe a solution to a non-biased sub-second solution. this may also eliminate the need for chemical assays or facs for cell analysis when used in conjunction with a benchtop microscope. chemical assays and facs are often time and labor intensive, and cell processing may change the morphometrics of the stained or sorted cells as well as result in the further production of chemical and biological waste; however, dnn, when used in conjunction with the microscope, eliminates the aforementioned downsides. in the future, the dnn framework can also be implemented to examine the morphological change of virus-infected cells. the dnn could also be used to examine the virus' response to therapeutic interventions, such as through examination of structural changes that may inhibit the virus' ability to infect the host. in terms of computational power and time, dnn's architecture of partitioning the cnns may also be advantageous compared to using one large instance segmentation cnn due to partitioning the gpu usage as well as easier optimization of each cnns. the dnn generally requires a less gpu exhaustive cnn for object detection and then employs a more gpu exhaustive cnn for classification; for example, classifying and detecting covid-19 virus particles from mers virus particles may require more exhaustive cnn as a backbone, but one can use relatively less resource exhaustive cnn backbone to only locate the virus particles with high accuracy from the background of tem images before feeding them into a more exhaustive classification network. by cropping the objects of interest and feeding it into the segmentation network, dnn was able to achieve a high score for the jaccard similarity coefficient for multi-class instance segmentation. in the future, we will seek to have dnn analysis encompass a wider variety of viruses and cell types to broaden the application and ease the implementation of the dnn framework in future research. for the next step, we will train the dnn using sars-cov-2 infected cells and mock cells observed in sputum sample smears of human subjects. this would enable us to rapidly diagnose sars-cov-2 infected patients using the dnn in conjunction with any benchtop microscopes. cells in sputum samples of sars infected patients showed cellular abnormalities, such as cytoplasmic foaminess, distinct vacuoles, multinucleation, and glass appearance of the nucleus. [30] sars-cov-2 infected cells also showed a dramatic increase in filopodial protrusions, which were significantly longer and more branched than in uninfected cells. [31] uninfected cells also exhibited filopodial protrusions, but their frequency and shape were dramatically different. the sars-cov-2 infected cells also revealed prominent m protein clusters, possibly making assembled viral particles, localized along the tips of actin-rich filopodia. [31] reorganization of the actin cytoskeleton is a common feature of many viral infections and is associated with different stages of the viral life cycle. [31, 32] we hypothesize that the cell morphology changes due to sars-cov-2 infection can be detected by the dnn. as a pre-trained dnn takes any time from subsecond to less than a minute, according to the user's computer hardware specifications, a pre-trained dnn using sars-cov-2 cells and mock cells from sputum sample smears of human subjects can be rapidly distributed around the world and used in conjunction with existing benchtop microscopes for rapid and scalable screening. furthermore, different dnns will be trained to classify sars-cov-2 infected cells and mock cells present in sputum samples according to patients' age group, sex, and ethnicity. this is to personalize the diagnostic method for higher accuracy in screening. furthermore, classfication of cells infected by different types of coronaviruses and mock cells will studied using dnn. table s1 . cells' morphometrics observed by the dnn. the table shows tukey hsd q statistic, hsd p-value, and tukey hsd inference between the cell types in terms of (a) area, (b) eccentricity, (c) circularity, and (d) solidity. (a) (b) (c) (d) a novel coronavirus from patients with pneumonia in china mers-cov: understanding the latest human coronavirus threat severe acute respiratory syndrome: clinical and laboratory manifestations the middle east respiratory syndrome (mers) nsp3 of coronaviruses: structures and functions of a large multi-domain protein coronaviruses: an overview of their replication and pathogenesis middle east respiratory syndrome: pathogenesis and therapeutic developments human neutralizing antibodies against mers coronavirus: implications for future immunotherapy treatment with interferon-α2b and ribavirin improves outcome in mers-cov-infected rhesus macaques rapid generation of a mouse model for middle east respiratory syndrome imagenet classification with deep convolutional neural networks going deeper with convolutions very deep convolutional networks for large-scale image recognition inception-v4, inception-resnet and the impact of residual connections on learning learning transferable architectures for scalable image recognition rich feature hierarchies for accurate object detection and semantic segmentation fast r-cnn faster r-cnn: towards real-time object detection with region proposal networks you only look once: unified, real-time object detection yolo9000: better, faster, stronger convolutional networks for biomedical image segmentation, international conference on medical image computing and computer-assisted intervention establishing primary adult fibroblast cultures from rodents mask r-cnn deep residual learning for image recognition the language of technical computing: computation, visualization, programming. installation guide for unix version 5 adaptation of a simple microfluidic platform for high-dimensional quantitative morphological analysis of human mesenchymal stromal cells on polystyrene-based substrates benchmark analysis of representative deep neural network architectures cellprofiler: image analysis software for identifying and quantifying cell phenotypes nih image to imagej: 25 years of image analysis sputum cytology of patients with severe acute respiratory syndrome (sars) the global phosphorylation landscape of sars-cov-2 infection subversion of the actin cytoskeleton during viral infection competing interests: h.l., s.l., y.c., j.g., a.l., are inventors on a pending provisional patent application submitted by the columbia university related to this work. the authors declare that they have no other competing interests.data and materials availability: additional data related to this paper may be requested form the authors. key: cord-330239-l8fp8cvz authors: oyelade, o. n.; ezugwu, a. e. title: deep learning model for improving the characterization of coronavirus on chest x-ray images using cnn date: 2020-11-03 journal: nan doi: 10.1101/2020.10.30.20222786 sha: doc_id: 330239 cord_uid: l8fp8cvz the novel coronavirus, also known as covid19, is a pandemic that has weighed heavily on the socio-economic affairs of the world. although researches into the production of relevant vaccine are being advanced, there is, however, a need for a computational solution to mediate the process of aiding quick detection of the disease. different computational solutions comprised of natural language processing, knowledge engineering and deep learning have been adopted for this task. however, deep learning solutions have shown interesting performance compared to other methods. this paper therefore aims to advance the application deep learning technique to the problem of characterization and detection of novel coronavirus. the approach adopted in this study proposes a convolutional neural network (cnn) model which is further enhanced using the technique of data augmentation. the motive for the enhancement of the cnn model through the latter technique is to investigate the possibility of further improving the performances of deep learning models in detection of coronavirus. the proposed model is then applied to the covid-19 x-ray dataset in this study which is the national institutes of health (nih) chest x-ray dataset obtained from kaggle for the purpose of promoting early detection and screening of coronavirus disease. results obtained showed that our approach achieved a performance of 100% accuracy, recall/precision of 0.85, f-measure of 0.9, and specificity of 1.0. the proposed cnn model and data augmentation solution may be adopted in pre-screening suspected cases of covid19 to provide support to the use of the well-known rt-pcr testing. the 2019 novel coronavirus disease presents an important and urgent threat to global health, and has equally exposed to an extent the fragility of the most highly placed health institutions and infrastructures across the globe [27, 28] . since the first status recorded case of covid-19 was identified in early december 2019 in wuhan, in the hubei province of the people's republic of china, the number of patients confirmed to have contracted the disease has exceeded 35,960,908 in 214 countries, and the number of people infected is probably much higher [29] . moreover, a record estimate of more than 1,052,310 people have died from the coronavirus covid-19 outbreak as of october 06, 2020 [29] . however, despite several public health responses aimed at containing the disease and delaying its spread, many countries have now been confronted with critical health care catastrophes ranging from limited hospital beds, shortage of medical equipment, and contamination of medical frontline workers [27] . in order to alleviate the liability placed on the already fragile healthcare system, while also providing the best possible medical care for patients, efficient diagnosis and treatments of the novel coronavirus disease is urgently needed. the study conducted by wynants et al. [27] revealed that the development of efficient prediction models that combine several variables or features to estimate the risk of people being infected by the disease or experiencing a poor outcome from the infection could assist medical staff in triaging patients when allocating limited healthcare resources [27] . several advanced artificial intelligence and machine learning models, specifically deep learning algorithms, have been proposed as evidenced in many academic databases and journals in response to a call by the who to share relevant covid-19 research findings rapidly and openly to inform the public health response and help save people's lives. singh et al. [32, 34] developed a deep convolution neural network (cnn) that was applied in the automated diagnosis and analysis of covid-19 in infected patients. the authors' proposed model involved the tuning of although different artificial intelligence approaches also exist, like the case-based reasoning (cbr) [25] which have been applied to the detection of this disease, cnn methods however have shown to be more effective and promising. several studies [4, 5, 6, 78, 26, 30] and reviews which have adapted cnn to the task of detection and classification of covid-19 have proven that the deep learning model is one of the most popular and effective approaches in the diagnosis of covd-19 from digitized images. this outstanding performance of cnn is due to its ability to learn features automatically from digital images as has been applied to diagnoses of covid-19 based on clinical images, ct scans, and x-rays of the chest by researchers. therefore, considering the advantages of the several automated deep learning solution approaches as mentioned above for curbing the spread of covid-19 through early detection, classification, isolation and treatment of affected persons, it would be worthwhile to further investigate the possibility of developing better and more efficient variants of deep machine learning techniques. moreover, as revealed in most of the literature implementation results, adapting computational solutions to effectively extract the relevant information inherent in x-ray imaging can help to automate the process of speeding up diagnoses of sars-cov-2 virus. in this paper, we propose the application of deep learning model in the category of convolutional neural network (cnn) techniques to automate the process of extracting important features and then classification or detection of covid-19 from digital images, and this may eventually be supportive in overcoming the issue of a shortage of trained physicians in remote communities [24] . the proposed model implementation is such that we first applied some selected image pre-processing techniques to reduce the noise on ct and chest x-rays digital images obtained from the covid-19 x-ray dataset. all the dataset used to validate the performance superiority of the new model is taken from the national institutes of health (nih) chest x-ray datasets. specifically, the contributions of this research are summarized as follows: i. design and development of a new cnn based deep learning framework for the automatic characterization and accurate diagnosis of covid-19 cases. ii. the proposed cnn model is aimed at detecting covid-19 cases using chest x-rays images from a combined covid-19 x-ray images extracted from the national institutes of health (nih) chest x-ray dataset. iii. experimentation was done using two optimization algorithms namely adam and stochastic gradient descent (sgd). iv. the implemented cnn based deep learning model was evaluated and its results compared with existing similar state-of-the-art results from literature using the following metrics: accuracy, sensitivity, specificity, f1-score, a confusion matrix, and auc using receiver operating characteristic (roc). the rest of the paper is structured accordingly. in section 2, we explain the proposed deep leaning framework for the characterization of coronavirus on chest x-ray images and datasets. the computational results and different experimentations are reported in section 3. the interpretation of the obtained experimental results from the proposed deep learning model is presented in section 4. finally, the concluding remarks and future research direction are given in section 5. in this section, an overview of the deep learning approach proposed in this study is presented. this overview is summarized through an architectural pipeline flow of the concepts comprising it. the datasets and the associated data/image preprocessing techniques adopted for this study are also detailed. the choice and category of datasets for application to any cnn model are very important and require the selection of an appropriate dataset. in this study, we decided to apply our cnn model to chest x-rays or ct images which were outcomes of radiological imaging which have been proven to yield better diagnosis of covid [1] . two categories of datasets are employed for the characterization of the features and classification the novel covid-19 disease. these databases are the covid-19 x-ray images [2] and the national institutes of health (nih) chest x-ray dataset [3] . the most frequently accessed imaging is the chest x-ray which is due to cost-effectiveness although it presents a more challenging clinical diagnosis task compared with chest ct imaging. hence, our combined approach of chest x-rays/ct images and the use of publicly available datasets with large instances position our cnn model to achieve clinically relevant diagnoses. the covid-19 x-ray dataset consists of cases of covid-19, mers, sars, and ards which are all represented as chest x-ray or ct images database. the database is accompanied with several fields for each instance which provides further details on the image sample. these fields include number of days since the start of symptoms or hospitalization of patient (necessary for tracking multiple copies of image taken per patient); sex, age, findings or outcome of the diagnoses, patient survival status, the view of the image presented (pa, ap, or l for x-rays and axial or coronal for ct scans), modality (ct or x-ray), clinical notes, and other important information. we obtained 363 instances of images and their accompanied metadata from the covid-19 x-ray database. the second database combined with the covid-19 x-ray dataset in this study is the national institutes of health (nih) chest x-ray dataset. this database is comprised of 112,120 x-ray images which are of sizes 1024 x1024 with disease labels from 30,805 unique patients. the database provides samples of images with their diseases and the disease region bounding boxes. similar to the covid-19 x-ray dataset, this database also provides the following metadata about each instance: findings/diagnosis, type of disease diagnosed, age and gender of patient, the view of the image and other details. in the following figures, we have summarized the class distributions and sizes of the databases and also present a combined chart of the two databases. in figure 1 , we show the number of images in the covid-19 chest x-ray and nih chest x-rays databases which are 363 and 84823 respectively. figure 2 reveals that the covid-19 chest xray consists of ten (10) figure 3 . is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. (which was not certified by peer review) the copyright holder for this preprint this version posted november 3, 2020. in the experimentation phase, the combined representation of the images from the two datasets were split into training, evaluation and testing categories and yielded 63887 samples for training, 17034 samples for validation and 4265 samples for testing. this is illustrated in figure 4 . a joint representation of class distribution of images/samples across the two databases for the purpose of training is shown in figure 5 . the combination yielded twenty-four (24) classes with the following number of samples in each class: no-finding or disease free samples has 28222 images, infiltration has 8017 images, effusion has 5701 images, atelectasis has 5373 images, nodule has 2887 images, mass has 2558 images, pneumothorax has 2255 images, consolidation has 2012 images, pleural thickening has 1511 images, cardiomegaly has 1187 images, emphysema has 1171 images, edema has 1003 images, fibrosis has 968 images, pneumonia has 636 images, covid-19 has 203 images, hernia has 118 images, streptococcus has 17 images, ards has 15 images, pneumocystis has 15 images, sars has 11 images, e.coli has 4 images, chlamydophila has 2 images, legionella has 2 images, and klebsiella has 1 image. meanwhile, a presentation of the splitting of the datasets into evaluation and testing sets are captured in figures 6 and 7. a combine graphing of distribution of images used for training, testing and validation as drawn from the covid-19 and nih chest x-ray datasets . cc-by-nc-nd 4.0 international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted november 3, 2020. ; figure 5 : distribution of training samples among classes of disease as drawn from the covid-19 chest x-ray and nih chest x-ray datasets figure 6 : distribution of validation samples among classes of disease as drawn from the covid-19 chest x-ray and nih chest x-ray datasets . cc-by-nc-nd 4.0 international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted november 3, 2020. ; https://doi.org/10.1101/2020.10.30.20222786 doi: medrxiv preprint figure 7 : distribution of testing samples among classes of disease as drawn from the covid-19 chest x-ray and nih chest x-ray datasets considering the level of noise, distortion or anomalies that might be associated with some of the images accessed from the public databases, this study attempted to pre-process all the samples. this was achieved by applying some standard image preprocessing techniques to our images. the next section details this approach. to enhance the performance of deep learning models, several studies [12, 13, 14, and 15] have encouraged the application of inputs/samples to appropriate preprocessing techniques/algorithms. basically, image processing, which is the use of algorithms to perform image processing on digital images, is categorized into two: namely, the analogue image processing and digital image processing. the preprocessing techniques are aimed at improving the features in the image through image enhancement and the suppression of unwanted distortions thereby yielding an improved image/input for the deep learning model. in this study, we applied our samples to the following preprocessing techniques after reading or loading images into the buffer/memory: • image resizing: due to the heterogeneity of the databases and as a result of variation in the sizes of images, we resized the images into 220x220 sizes. such resizing operation allows for decreasing the total number of pixels from 888x882 and 1024x1024 for covid-19 x-ray and nih chest x-ray datasets respectively to 220x220 for both. • removal of noise (denoise): image denoising can present a challenging procedure arising from the operation of estimation of the original image through the elimination of noise from a noisy image. for instance, one might be interested in removing any of the following noises from an image: poisson noise, salt and pepper noise, gaussian noise, and speckle noise. in this study, we attempted to eliminate/remove noise from our image samples using the gaussian blur technique since study [16] showed that the technique is relevant in images with high noise. we used a gaussian filter by applying our images to the function cv2.gaussianblur using kernel size of 5x5 and zero (0) for both the standard deviation for both the x and y directions. • morphology (smoothing edges): as a preprocessing operation before applying segmentation to our images, we applied the morphology operation to our samples. this enables us to extract image components that are useful in the representation and description of region shape. this operation (morphological smoothing) was aimed at removing bright and dark artifacts of noise, and was achieved through an opening followed by a closing operation. the output of this phase yielded images whose edges are smoothened for easy detection. • segmentation: it is well known that image segmentation allows for the partitioning of an image into multiple image objects or segments appearing as different categories of pixels such that a similar category constitutes a segment. this study approached this technique with the aim of enhancing the process of detecting image objects which support feature extraction thereby obtaining meaningful results. we achieved this through the application of thresholding method leaving out other methods such as edge detection based techniques, region based techniques, clustering based techniques, watershed based techniques, partial differential equation based and artificial neural network based techniques. using the thresholding method, we thresh_binary_inv thresholding style of opencv, and a maxval of 255 which represents the value to be given if pixel value is more than (sometimes less than) the threshold value. the computation of thresh_binary_inv is as shown in equation 1. the second parameter to the maxval is the retval as used in our thresholding technique. we used otsu's method which is widely reported to yield interesting results and is also suitable when there is distinguishable foreground and background [17] . the use of this method is inferred from the value we set for the retval which is the thresh_otsu. this allows for automating the process of calculating the threshold value from image histogram. thus far, we have filtered our image samples with a 5x5 gaussian kernel to remove the noise, and then applied otsu thresholding. furthermore, we applied the dilate operation on the image to enlarge the foreground and thereby find the sure background area. also, to find the sure background area in the image, we applied the distancetransform operation to achieve a representation of a binary image so that the value of each pixelwas replaced by its distance to the nearest background pixel. hence the threshold segmentation was applied to divide the image into regions of object and background. our thresholding segmentation was completed through the application of global thresholding which uses any appropriate threshold value of t= kept constant for the whole image so that the output image is obtained from original image as seen in equation 2 [18] . the resulting images from all the preprocessing techniques above are then passed as input into the cnn model described in subsection 3.3. the proposed cnn model is a fraction of a complete framework in figure 8 which represents the pipeline flow of techniques used in this study. the architectural pipeline shown in the figure below first reads in the sample images from the buffer where the combined datasets from the two databases are stored. thereafter, the preprocessing techniques described in subsection 3.2 are applied sequentially on them. furthermore, the resized and improved image samples are split into training and validation sets based on the illustration shown in subsection 2.1. thereafter, the cnn model is applied to the input samples for training and validation. the trained model is then exposed to the testing set of images for prediction and then the result of the classification is output for performance evaluation. . cc-by-nc-nd 4.0 international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted november 3, 2020. ; https://doi.org/10.1101/2020.10.30.20222786 doi: medrxiv preprint overfitting is the situation when a model learns the training data excellently but falls short of generalizing well when some other data is exposed to it. regularization techniques such as l2 and l1, . cc-by-nc-nd 4.0 international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted november 3, 2020. ; https://doi.org/10.1101/2020.10.30.20222786 doi: medrxiv preprint dropout, data augmentation, and early stopping have been widely reported to enhance the performance of deep learning models [19, 20] . this study therefore experimented with some of the techniques to ensure an optimal performance of the proposed deep learning (cnn) model. hence, we do not just hope to improve performance but enable our model to generalize well. a model failing to generalize well will show validation error increases while the training error steadily decreases. in this study, we applied our work to the most common regularization technique l2 which is also referred to as "weight decay". we aimed at applying this weight regularization technique to reduce overfitting. l2 values ranges between 0 and 0.1 with examples as 0.1, 0.001, 0.0001, and are in logarithmic scale. we therefore hoped to reduce our model's training error [21, 22] by applying this technique. for instance, the inception v3 model experimented with a value of 0.00004 and we discovered that it was suboptimal and instead experimented with 0.00005 [23] . in addition to the use of l2, we also demonstrated the use of early stopping to stop our model from continuing training when it had attained its optimal performance. this regularization concept is another widely used technique in deep learning to stop training when generalization error increases. in addition to the use of data augmentation technique, the proposed cnn model was also allowed to use dropout layer at the rate of 0.5. in this section, the covid-19 chest x-ray and nih chest x-ray datasets described in subsection 2.1 are trained based on our cnn model and the performances of multiclass classification are evaluated. the environment for the experimentation and the outcome of the preprocessing techniques are also described in this section. all our experiments were carried out on google's colab environment with the following configurations as the need arose: 2-core intel(r) xeon(r) cpu @ 2.30ghz, 13gb memory and 33gb hard drive; and gpu tesla p100-pcie-16gb. the pre-processing techniques applied to our input images/samples were extensively discussed in subsection 2.2. therefore we aim to present the outcome of the application of those techniques on our datasets. the first operation applied was the resizing of images from the high resolution of 888x882 and 1024x1024 to a collective size of 220x220. this was necessary to allow the datasets sourced from different platforms to feed into our model effectively as fixed size. in figures 10 and 11 , we show the original image samples from covid-19 and the nih chest x-ray datasets respectively, and the outcome of the resizing of the operation is shown in figure 12 . is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted november 3, 2020. ; https://doi.org/10.1101/2020.10.30.20222786 doi: medrxiv preprint figure 11 : a sample of raw image labeled 'no finding' of size 1024x1024 from the nih chest x-ray dataset figure 12 : a resized sample from 888x882 to 220x200 in the covid-19 chest x-ray dataset one major pre-processing operation carried out on our input sets is the removal of noise as described in subsection 2.2. the approach taken in this study is to blur the image samples as a measure to clean and denoise it. hence, in figure 13 , a pair of samples resulting from an un-denoised and then denoised image are captured and shown. furthermore, to demonstrate the segmentation operation carried out on the image samples by this study, we have also presented output from such operation. in figures 14 (a-b) , a pair of samples of images whose segments and background are extracted are presented. the pair of images in figure 14a shows the original image and the outcome of the segmented image, while that of figure 14b shows the original image and its extracted background. these operations allow for easier understanding of what is in the image and enable easier analysis of each part. in addition, the segmentation operation on our medical x-ray images revealed objects which were unknown, but were segmented within the image typically for further investigation. . cc-by-nc-nd 4.0 international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted november 3, 2020. ; https://doi.org/10.1101/2020.10.30.20222786 doi: medrxiv preprint the use of bounding boxes is one of the most interesting operations that provide support for image annotation in deep learning models. this proves useful in object classification in images and even for further localization tasks. whereas image is aimed at assigning a class label to an image, object localization allows for creating bounding boxes around recognizable objects in the image. the target of the model to classify and also obtain positions of objects in image is referred to as object detection or object recognition. drawing bounding boxes can be achieved using deep learning models or other algorithms. for instance, to describe the location of some targeted diseases in our input images, we draw a bounding box as a rectangular box that can be determined by the x and y axis coordinates in the upper-left corner and the x and y axis coordinates in the lower-right corner of the rectangle. this operation allows for easily annotating our samples for convenient recognition by cnn model. . cc-by-nc-nd 4.0 international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted november 3, 2020. . cc-by-nc-nd 4.0 international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted november 3, 2020. contours allow for the identification of the shapes of objects within an image, and are recognized through lines of curves joining all the continuous points with similar color or intensity. this technique provides support for object detection and recognition. in this study, to extract the contours as shown in the images below those with the bounding boxes, we carry out the following: we first threshold each image and then find all the contours in each image; with each contour, we draw a bounding rectangle in green color; then get a minimum area rectangle and convert all coordinates floating point values to integer, and draw a red 'nghien' rectangle; furthermore, we get the minimum enclosing circle and convert all values to integer to draw the circle in blue; then finally draw all contours on each image. the proposed cnn model receives grayscale images as its input and experiments are performed with multiclass classifications. table 1 shows the detection classes for each classification and their distribution in both datasets. meanwhile, for each experiment carried out, we train the model for 50 epochs and 1310 steps. . cc-by-nc-nd 4.0 international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted november 3, 2020. we experimented with the proposed cnn model on the datasets using some variation of hyperparameters. for instance, we investigated the performance of the model when sgd and adam optimizers were applied to the model and plotted the output of the model. furthermore, we experimented on our proposed model the effect of two different values for the l2 (weight decay) regularization technique. meanwhile, the proposed cnn model was also applied to classical data augmentation technique to determine possible improvement in its performance. the first set of experiments used the adam optimization algorithm and weight decay (l2) value of 0.0002. figure 17 captures the performance of the model in terms of loss function while figure 18 shows the trajectory of the accuracy for both training and validation cases. we plotted the confusion matrix of the experimentation after carrying out a prediction using the test datasets and this is shown in figure 19 . note that the configuration of the adam optimizer is as follows: learning rate=0.001, beta1=0.9, beta2=0.999 and epsilon=1e-8. similarly, in the second experiment performed, we experimented using the sgd optimizer with the following configuration: learning rate=0.01, decay=1e-6, momentum=0.9 and nesterov=true. the value of 0.0005 was used for the l2 regularization technique. the performance of the model was also examined and we found that although the accuracy remained close to that of adam optimizer, there was, however, a difference in the loss values trajectory. figures 20 and 21 capture the performance of the model on the training and validation datasets in the cases of loss function and accuracy. we also exposed the trained model to the test dataset under the same configuration and plotted the confusion matrix which is shown in figure 22 . the outcome of applying the confusion matrix operation on the output of the prediction yielded the values below across the classes (18) found in the test datasets. in figure 22 we present a graphical representation for it. in subsection 2.3, we noted that one of the regularization techniques applied to our proposed cnn model is the early stopping technique. we found this application useful as it helped to stop our model once it recognized that optimal performance was attained. we show evidence of this in figure 23 where the training was terminated at 24 th epoch during the first experiment where data augmentation was not applied. now that the experimentation on the proposed cnn model has been presented, we shall proceed to the next section to discuss the performance of the model compared with some state-of-the-art models diagnosing covid-19 disease. . cc-by-nc-nd 4.0 international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted november 3, 2020. ; in table 2 , we list the performance of our model for the experiments carried out in comparison with similar models adapted to the purpose of classification covid-19 disease. the result obtained in table 2 shows that our system achieves 1.00, 0.85, 0.85, 0.90, 0.50, and 1.00 for specificity, recall, and precision, f-score, auc and accuracy respectively in phase one of the first experiment. on the other hand, in the second experimentation we carried out, our model yielded the following achieves: 1.00, 0.85, 0.85, 0.90, 0.50, and 1.00 for specificity, recall, and precision, f-score, auc and accuracy respectively in the phase one of the second experiment. the proposed model attains an 85% value for both precision and recall which made it useful for the proposed task, hence avoiding unnecessary false alarms. f1 measure is relevant if we are looking to select a model based on a balance between precision and recall, and is the harmonic mean of precision and recall and gives a better measure of the incorrectly classified cases than the accuracy. as a result, the value of 0.9 for our f1-score shows the performance of our model even when there are imbalanced classes as is the case in most real-life classification problems. confusion matrix is well known as an unambiguous way to present prediction results of a classifier and works for both binary and multiclass classification. when binary classification is desired, confusion matrix is usually presented in a 2x2 table showing true positive, true negative, false positive, and false negative. however, in this study which aims at multiclass classifications, the table has a size equal to the number of classes (24) squared. the confusion matrix computed by our proposed model using both adam sgd (optimizers) in making predictions of different classes in the combined dataset is shown in figures 19 and 22 . generally, it is known that the higher the number on the diagonal of the confusion matrix, the better the accuracy of the model. evidently, our proposed model achieved a good performance with respect to this. one of the most useful metrics is the classification_report which combines several measures and prints a table with the results. further to the presentation and comparison of the performance of our model using metrics like specificity, sensitivity, recall, precision, f-score, auc and others, we use table 3 to show the performance of our model with others in terms of accuracy. . cc-by-nc-nd 4.0 international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. (which was not certified by peer review) the copyright holder for this preprint this version posted november 3, 2020. ; https://doi.org/10.1101/2020. 10 panwar et al [11] ncovnet chest x-ray images 97.00% considering the performance of our model from table 3 compared with other similar studies, we conclude that this study outperforms state-of-the-art deep learning models aimed at detecting and classifying the novel coronavirus . this is readily seen from the performances of each study. we note that only ko et al [10] and panwar et al [11] who used fast-track covid-19 classification network (fconet) and ncovnet respectively have their models' performances competing with our model. this study has, therefore, successfully advanced research in the areas of detection and classification of covid-19 using deep learning models. in this paper a deep learning model based on cnn was designed and implemented for the purpose of detecting and classifying the presence of covid-19 in chest x-rays and ct images. we applied the proposed cnn model to two publicly available datasets, namely covid-19 chest x-ray and the nih chest x-ray databases. in addition, we enhanced the performance of our model by applying it to some regularization techniques. furthermore, we investigated the performance of the proposed model by juxtaposing the use of optimizer between the popular adam and sgd. the result obtained revealed that our model achieved 100% accuracy in classifying the novel coronavirus . with the exponential increase in covid-19 reported cases and treatments around the globe, the volume of covid-19 datasets is being created and archived daily. therefore, future studies could focus on advancing the architecture of the proposed deep learning model presented in this paper and most importantly investing the robustness of this model on some large scale datasets. furthermore, it will also be interesting to see the deployment of our trained cnn based deep learning model to both web and android applications for clinical utilization. sensitivity of chest ct for covid-19: comparison to rt-pcr covid-19 image data collection: prospective predictions are the future chestx-ray8: hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases automated detection of covid19 cases using deep neural networks with x-ray images covid-net: a tailored deep convolutional neural network design for detection of covid-19 cases from chest radiography images covid-19: automatic detection from x-ray images utilizing transfer learning with convolutional neural networks coronet: a deep neural network for detection and diagnosis of covid-19 from chest x-ray images unveiling covid-19 from chest x-ray with deep learning: a hurdles race with small data deep learning-based decision-tree classifier for covid-19 diagnosis from chest x-ray imaging covid-19 pneumonia diagnosis using a simple 2d deep learning framework with a single chest ct image: model development and validation application of deep learning for fast detection of covid-19 in x-rays using ncovnet. chaos, solitons, and fractals a survey on image data augmentation for deep learning preprocessing for image classification by convolutional neural networks image enhancement effect on the performance of convolutional neural networks image processing, computer vision, and deeplearning: new approaches to the analysis andphysics interpretation of lhc events investigation on the effect of a gaussian blur in image filtering and segmentation overview of different thresholding methods in image processing. teqip sponsored 3rd national conference on etacc various image segmentation techniques:a review regularization for deep learning: a taxonomy a comparison of regularization techniques in deep neural networks imagenet classification with deep convolutional neural networks very deep convolutional networks for large-scale image recognition xception: deep learning with depthwise separable convolutions covid faster r-cnn: a novel framework to diagnose novel coronavirus disease (covid-19) in x-ray images a case-based reasoning framework for early detection and diagnosis of novel coronavirus classification of covid-19 in chest x-ray images using detrac deep convolutional neural network prediction models for diagnosis and prognosis of covid-19 infection: systematic review and critical appraisal an interactive web-based dashboard to track covid-19 in real time. the lancet infectious diseases a machine learning solution framework for combatting covid-19 in smart cities from multiple dimensions applications of machine learning and artificial intelligence for covid-19 (sars-cov-2) pandemic: a review deep convolutional neural networks based classification model for covid-19 infected patients using chest x-ray images classification of the covid-19 infected patients using densenet201 based deep transfer learning classification of covid-19 patients from chest ct images using multi-objective differential evolution-based convolutional neural networks coronavirus (covid-19) classification using ct images by machine learning methods automatic detection of covid-19 using x-ray images with deep convolutional neural networks and machine learning classification of covid-19 from chest x-ray images using deep convolutional neural networks weakly supervised deep learning for covid-19 infection detection and classification from ct images fast deep learning computer-aided diagnosis against the novel covid-19 pandemic from digital chest x-ray images the experimental results of multiclass classification for four (4) experimental scenarios show that most scenarios have more than 99% accuracy. to evaluate the performance of the proposed model, we calculate accuracy, sensitivity, specificity, precision, recall, f1-score, cohen's kappa, roc auc, and confusion matrix. in the following paragraphs, we briefly outline the metrics and their relevance to our classification of novel covid-19 disease.the computational metric precision checks what proportion or quantity of positive identifications achieved by a model were actually correct and is given by equation 3.on the other hand, recall checks the number of actual positive cases in our datasets which the proposed cnn model was able to correctly identify. this is given by equation 4.evaluating the effectiveness of our cnn model requires that we examine its performance in terms of precision and recall, hence the need for the computation of these metrics. furthermore, we examined another metric known as f1 score. this metric expresses the balance between the precision and the recall described above and helps us decide whether the performance of our model is based on precision and recall. we give the equation for f1 score in equation 5.in this study, we chose an under-utilized, though effective at multiclass classification, metric known as cohen's kappa. this metric is robust in handling imbalanced class problems as may be seen in our datasets. in a multiclass classification problem, this metric provides a wider view of the performance of a classification model compared to accuracy (in equation 6) or precision/recall. the metric is represented in equation 7.the receiver operating characteristic (roc) curve expresses the performance of our classification (cnn) model using a graphical approach and does this at all classification thresholds. it is able to achieve this by graphing the true positive rate (tpr) and false positive rate (fpr). the metric gives a summary of the performance of a classifier over all possible thresholds. similar to the roc is the area under the roc curve (auc) which examines the entire two-dimensional area underneath the entire roc curve which covers from (0,0) to (1, 1) . this metric is effective at checking the proper/wellness and quality of our model's prediction performance.finally for the description of our metrics, we have the confusion matrix. whereas accuracy of a model may seem appealing in some sense, it is, however, limited by its inability to give detail of the performance of the classification model. on the other hand, confusion matrix presents this detail by presenting the prediction result in an unambiguous manner. key: cord-275258-azpg5yrh authors: mead, dylan j.t.; lunagomez, simón; gatherer, derek title: visualization of protein sequence space with force-directed graphs, and their application to the choice of target-template pairs for homology modelling date: 2019-07-26 journal: j mol graph model doi: 10.1016/j.jmgm.2019.07.014 sha: doc_id: 275258 cord_uid: azpg5yrh the protein sequence-structure gap results from the contrast between rapid, low-cost deep sequencing, and slow, expensive experimental structure determination techniques. comparative homology modelling may have the potential to close this gap by predicting protein structure in target sequences using existing experimentally solved structures as templates. this paper presents the first use of force-directed graphs for the visualization of sequence space in two dimensions, and applies them to the choice of suitable rna-dependent rna polymerase (rdrp) target-template pairs within human-infective rna virus genera. measures of centrality in protein sequence space for each genus were also derived and used to identify centroid nearest-neighbour sequences (cnns) potentially useful for production of homology models most representative of their genera. homology modelling was then carried out for target-template pairs in different species, different genera and different families, and model quality assessed using several metrics. reconstructed ancestral rdrp sequences for individual genera were also used as templates for the production of ancestral rdrp homology models. high quality ancestral rdrp models were consistently produced, as were good quality models for target-template pairs in the same genus. homology modelling between genera in the same family produced mixed results and inter-family modelling was unreliable. we present a protocol for the production of optimal rdrp homology models for use in further experiments, e.g. docking to discover novel anti-viral compounds. (219 words) since high-throughput sequencing technologies entered mainstream use towards the end of the first decade of the 21st century, there has been an explosion in available protein sequences. by contrast, there has been no corresponding high-throughput revolution in structural biology. obtaining solved structures of proteins at adequate resolution remains a painstaking task. x-ray crystallography is still the gold standard for structure determination more than 60 years after its first use in determining myoglobin structure [1] . the result of this discrepancy between the rate of protein sequence determination and the rate of protein structure determination is the protein sequence-structure gap [2] . homology modelling is a rapid computational technique for prediction of a protein's structure from (a) the protein's sequence, and (b) a solved structure of a related protein, referred to as the target and the template, respectively. since structural similarity often exists even where sequence similarity is low [2, 3] , homology modelling has the potential to reduce massively the size of the protein sequence-structure gap, provided the models produced can be considered reliable enough for use in further research. the rna-dependent rna polymerase (rdrp) of rna viruses presents an opportunity to test and expand this approach. rdrps are the best conserved proteins throughout the rna viruses, being essential for their replication [4] . conservation is particularly high in structural regions that are involved in the replication process, for instance the indispensable rna-binding pocket [5] . rdrps are also of immense medical importance as the principal targets for antiviral drugs. evolution of resistance against anti-viral drugs is a major concern for the future, and the design of novel anti-viral compounds is a highly active research area. solved structures of rdrps are of great assistance to these efforts, as they enable the use of docking protocols against large libraries of pharmaceutical candidate compounds [e.g. refs. [6, 7] ]. although some human-infective rna viruses have solved rdrp structures, there are still large areas within the virus taxonomy that lack any. this paper will first identify where the protein sequencestructure gap is at its widest in rdrps. because of the sequencestructure gap, it is therefore impossible in many genera to perform docking protocols against solved structures of rdrp for discovery of novel anti-viral compounds. under these circumstances, replacement of real solved structures with homology models for docking experiments requires that the homology models used should be both high quality and also optimally representative of their respective genera. our second task is to present several similarity metrics in sequence space that assist in the identification of the virus species having the rdrp sequence that is most representative of its genus as a whole. we then present the first use of force-directed graphs to produce an intuitive visualization of sequence space, and select target rdrps without solved structures for homology modelling. these are then used to perform homology modelling using template-target pairs within the same genus, between sister genera and between sister families, monitoring the quality of the models produced as the template becomes progressively more genetically distant to the target sequence being modelled. finally, we produce homology models for reconstructed common ancestral rdrp sequences. in the light of our results, we comment on the strengths and weakness of homology modelling to reduce the size of the protein sequence-structure gap for rdrps, and produce a flowchart of recommendations for docking experiments on rdrp proteins lacking a solved structure. we chose rdrps from human-infective viruses based on the list provided by woolhouse & brierley [8] . given the global medical importance of aids, we also included lentivirus reverse transcriptases (rts) for analysis. solved structures for these proteins, where available, were downloaded from the rcsb protein data bank (pdb) [9] . table 1 presents our criteria for selecting suitable homology modelling candidates. rdrp and rt amino acid sequences for all virus species satisfying the criteria of table 1 were downloaded from genbank [10] . alignment of sequence sets for each genus, was performed using mafft [11] . alignments were refined in mega [12] using muscle [13] where necessary, and the best substitution model determined. alignment of target sequences onto their solved structure templates for homology modelling was carried out using the molecular operating environment (moe v.2016.08, chemical computing group, montreal h3a 2r7, canada). we define sequence space as a theoretical multi-dimensional space within which protein sequences may be represented by points. for an alignment of n related proteins, the necessary dimensionality of this sequence space is n-1, with the hyperspatial co-ordinates in each dimension for any protein determined by its genetic distance to the n-1 other proteins. for n ¼ 5, direct visualization of all dimensions of sequence space is impractical at best, since a 4-dimensional space must be simulated in three dimensions, and is effectively impossible for n ! 6. the following methods were used to reduce sequence space to two and three dimensions for ease of visualization. to simplify calculations, we allow an extra dimension defined by the distance from each sequence to itself. the value of the co-ordinate in that dimension is always zero and our sequence space has n dimensions rather than n-1. the pairwise distance matrix (m d ) for each genus, calculated from the sequence alignment in mega, consists of entries m d (i,j) giving the genetic distance between each pair of sequences i and j where {i, j} 2 {1,2 …. n} and i s j, for a set of n sequences. in our data set n ranges (see supplementary table) the similarity matrix was then used as input for r package qgraph [14] . the "spring" layout option was chosen, which uses the fruchterman-reingold algorithm to produce a two-dimensional undirected graph in which edge thickness is proportional to absolute distance in n dimensions and node proximity in two dimensions is optimized for ease of viewing while attempting to ensure that those nodes closely related in the n-dimensional input are also close in the two-dimensional output [15] . 500 iterations were performed, or until convergence was achieved. for each alignment, the pairwise distance matrix (m d ) was used as input for r package cmdscale, which uses multi-dimensional scaling to produce a three-dimensional graph from the n-dimensional input, with node proximity again reflecting relative similarity [16] . spotfire analyst (tibco spotfire analyst, v.7.12.0, 2018) was used to visualize the output of cmdscale. we define the centroid as a hypothetical protein sequence located at the centre point of the sequence space of an alignment. the real sequence closest to the hypothetical centroid is termed the centroid nearest neighbour (cnn). we calculate the position of the cnn in three ways. table 1 list of criteria used to select rna-dependent rna polymerases (rdrps) for homology modelling. human-infective virus importance to human health ncbi refseq annotated genome easy retrieval of high quality rdrp sequence rdrp located at the 3 0 end of polyprotein or on its own segment eliminates unconventional rdrps at least one solved rdrp at a range of different taxonomic levels, e.g. in same species, same genus, same family, same order to be used as the templates in homology modelling at different levels of genetic distance 2.4.1. shortest-path centroid nearest neighbour for a sequence i 2 {1,2 …. n} in an alignment of n sequences, its total path length d(i) to the other n-1 sequences may be calculated from the distance matrix m d as follows: is zero. this may be omitted to enforce a strict n-1 dimensions for n input sequences, but we leave it in to simplify subsequent calculations. we define i* as the index that minimizes d(i). the shortest path cnn is therefore sequence i*. for alignments where clusters of closely related sequences exist, giving many values of m d (i,j) close to zero, this method will tend to place the cnn within a cluster. to overcome this problem, the arithmetic mean and median, respectively, were used to determine the mean cnn and the median cnn. the values of d (equation (2)) may be averaged to produce mean total path distance d: where again n is the total number of sequences in the alignment. we now re-define i* as the index that minimizes d(i) -d. in the event of equation (5) returning zero, the mean cnn and the true centroid are identical. as with all variables using means, the mean cnn is liable to skewing by outliers. we generate a vector d over i 2 {1,2 …. n}, in which each entry d(i) represents the total path length for sequence i (equation (2)). the values of vector d are then ranked in ascending order x s(1) to x s(n) to produce vector d s . the median cnn is the sequence with value d(i) situated in the middle of the array d s , at d(m), where d(m) is either d (m odd ) or d (m even ) for alignments with odd or even numbers of sequences respectively. we now re-define i* as the index that minimizes d(i) -d(m). again, in the event of equation (9) returning zero, the median cnn and the true centroid are identical. as with all variables using medians, the median cnn is liable to skewing by the presence in the alignment of multiple sequences with the same value of d(i). the choice of solved structures as templates for homology modelling, and the choice of targets to be modelled, within each genus was governed by the following rules: (1) for each genus the solved structure that covered the highest proportion of the rdrp or rt sequence was chosen as the template for that genus. (2) if more than one candidate template structure was found at this sequence length, the structure with the lowest resolution in angstroms was selected. see table 2 for the templates satisfying these two criteria. (3) within each genus, the sequence with the greatest genetic distance from the template, was chosen as the target for homology modelling. see table 3 for the template-target pairs satisfying this criterion. (4) criterion 3 was applied to find template-target pairs in different genera (see table 4 ) and different families (see table 5 ), thus testing the limits of homology modelling at high genetic distances. homology modelling was carried out using the molecular operating environment (moe v.2016.08, chemical computing group, montreal h3a 2r7, canada). ten intermediate models were produced using the amber10:eht forcefield under medium refinement. the model that scored best under the generalised born/ volume integral (gb/vi) was selected to undergo further energy minimisation using protonate3d, which predicts the location of hydrogen atoms using the model's 3d coordinates [17, 18] . to assess the stereochemical quality of the homology models produced, ramachandran plots were derived in moe, and used to calculate the proportion of bad outlier f-j angles in the model, after subtraction of the number of outlier f-j angles in the template. generally, outlier angle percentage below 0.05% indicates a very high quality model, and a percentage below 2% indicates a good quality model [19] . models were superposed with their templates in moe and rootmean-square deviation (rmsd) value derived for the alpha carbons (ca) in the two structures. generally, an rmsd below 2 å indicates a good quality model [20] . qualitative model energy analysis (qmean) was used to analyse models using both statistical and predictive methods [21] . the qmean z-score is an overall measure of the quality of the model when compared to similar models from a pdb reference set of x-ray crystallography-solved structures. a z-score of 0 would indicate a model of the same quality as a similar high quality x-ray crystallographic structure, while a z-score below à4.00 indicates a low quality model [22] . maximum likelihood (ml) trees [23] were produced for each genus in mega. the ml tree and the corresponding multiple sequence alignment were input into the ancestral reconstruction server, fastml [24] . the reconstructed sequence for the root of the tree, i.e. the putative common ancestor rdrp or rt sequence for the genus was used as the target for homology modelling in moe, using the template chosen according to the rules in section 2.5. the reconstructed ancestral sequence was added to the alignment and the force-directed graph re-drawn. fig. 1b , showing the targettemplate pairs for homology modelling may be compared with fig. 1c , showing the ancestor-template pairs. our first observation is that there are still large areas of the viral taxonomy where no solved rdrp structures exist. no suitable templates for homology modelling were found within the entire nidovirales order of rna viruses. this order contains several coronaviruses important to human health including severe acute respiratory syndrome-related coronavirus (sars-cov) and middle east respiratory syndrome-related coronavirus (mers-cov) [25] . in the order mononegavirales, vesiculovirus was the only genus with a solved rdrp structure suitable for homology modelling. however, this order contains many medically important viruses such as zaire ebolavirus, hendra henipavirus, measles morbillivirus, and mumps rubulavirus [26] . in the order bunyavirales, phenuiviridae stands out as an important family lacking a solved rdrp, despite it containing various human-infective arboviruses such as rift valley fever phlebovirus and sandfly fever naples phlebovirus [27] . (table 1) . fig. 1 shows two-dimensional force-directed graphs of similarity for each genus with more than four rdrp reference sequences (or rt sequences in the case of lentivirus). in principle, it would be possible to draw force-directed graphs for entire families and even orders. however, the input to qgraph is the similarity matrix calculated from the distance matrix, and the distance matrix is calculated in mega from an alignment. once taxonomic distance begin to extend beyond genera, alignment becomes progressively less reliable, with all the downstream statistics tending to degrade as a consequence. we therefore confine our construction of forcedirected graphs to intra-genus comparisons. it is evident from fig. 1 that sequences are not necessarily evenly distributed in sequence space. clustering is noticeable in the genus flavivirus, with two sub-groups and an outlier sequence evident. mammarenavirus also shows division into two sub-groups. by contrast, picobirnavirus has only five relatively equidistant reference sequences, thus producing a highly regular pentagram. similarly, rotavirus has eight reference sequences, with four at each end of a fairly regular cuboid. fig. 1a also shows how the various methods equations (2)e(9) for determining the cnn of sequence space for each genus, are in poor agreement. only in rotavirus and table 2 solved structures of rdrps and reverse transcriptase (for hiv-1) selected as templates for homology modelling. all are derived by x-ray crystallography except 5a22 which is a cryo-electron microscopy structure. for protein coverage, indicates that the template covers more than 90% of the sequence, indicates less. for f-j outliers and qmean z-score, indicates good-quality, indicates poor-quality, determined by the following thresholds: f-j ¼ 2%, qmean z-score ¼ à4.00. table 3 homology modelling at intra-genus, inter-species level. templates are as given in table 2 . targets are the rdrp (or reverse transcriptase for lentivirus) sequences from the reference genome accession numbers given. rmsd: root mean square deviation in angstroms between template and model when superposed in moe. indicates good quality, indicates poor quality, determined by the following thresholds: f-j < 2%; qmean z-score > à4.00; rmsd <2 å. indicates good quality, but using a partial template (see table 1 ) *imjin thottimvirus was reclassified in 2018 by the international committee on taxonomy of viruses (ictv) in a new genus thottimvirus. table 4 homology modelling at intra-family, inter-genus level. templates are as given in table 2 . targets are the rdrp (or reverse transcriptase for spumavirus) sequences from the reference genome accession numbers given. rmsd: root mean square deviation in angstroms between template and model when superposed in moe. indicates goodquality, indicates poor-quality, determined by the following thresholds: f-j < 2%; qmean z-score > à4.00; rmsd <2 å. picobirnavirus are mean and median cnns found in the same sequence. fig. 1a also shows that the best solved structure for the purposes of template choice in homology modelling is rarely close to the centre of sequence space. only in lentivirus is the optimal template also the mean cnn, and only in vesiculovirus is the optimal template a shortest-path cnn. fig. 1b shows the relations of the template-target pairs in sequence space, illustrating how intra-genus homology modelling template-target selection attempts to traverse the largest genetic distance available within the genus. figs. 2 and 3 compare, for genera orthohantavirus and mammarenavirus respectively, the force-directed graphs of fig. 1 with the three-dimensional equivalent output of multidimensional scaling. fig. 2 shows a sequence clustering within orthohantavirus that is not readily apparent in the force-directed graph. the cnns are distributed among four clusters, as there is no sequence close to the geometrical centre of the three-dimensional space, where the notional centroid is located. the solved structure has 10 other sequences in its proximity in the three-dimensional space, roughly table 5 homology modelling at intra-order, inter-family level. templates are as given in table 2 . targets are the rdrp (or reverse transcriptase for lentivirus) sequences from the reference genome accession numbers given. rmsd: root mean square deviation in angstroms between template and model when superposed in moe. indicates goodquality, indicates poor-quality, determined by the following thresholds: f-j < 2%; qmean z-score > à4.00; rmsd <2 å. fig. 1 . force-directed graph visualisations of similarity of rdrps (or reverse transcriptase for lentivirus) within genera. the genetic distance matrix for each alignment was converted into a similarity matrix equations (1) and (2). the fruchterman-reingold algorithm (500 minimisation iterations) was implemented in r module qgraph to produce a force-directed graph. relative similarity is represented by node proximity, and absolute similarity is proportional to edge thickness. the solved structure and the three types of centroid nearest neighbour (cnn) sequences are highlighted. the species names corresponding to the numbered nodes are listed in the supplementary table. cardiovirus has less than four reference sequences and is omitted. a: location of solved structure and the three cnns in sequence space equations (3)e(7). some genera have two median cnns. equivalent to the lower right quadrant of the two-dimensional force-directed graph. similarly, the shortest-path cnn and mean cnn are both located within another three-dimensional cluster also containing 11 sequences, which is roughly equivalent to the upper right quadrant of the two-dimensional force-directed graph. fig. 3 presents a similar picture for mammarenavirus. the forcedirected graph for mammarenavirus has more obvious clustering that for orthohantavirus, showing a lower-left to top-right split. in the three-dimensional representation, these are equivalent, respectively, to the three clusters on the right and two clusters on the left. as with orthohantavirus, there is no cnn near the geometrical centre of the three-dimensional space, but the cnns are distributed around two clusters. three dimensional representations of all the genera in fig. 1 are available from the link in the raw data section. homology modelling was carried out as follows: (1) intra-genus, inter-species (11 models, table 3) (2) intra-family, inter-genus (5 models, table 4 ) (3) intra-order, inter-family (7 models, table 5 ) (4) intra-genus, on reconstructed common ancestor (12 models, table 6 ) table 3 shows that homology modelling with template and target within the same genus, produced good quality models in most cases, as judged by percentage of f-j outliers and rmsd within the high quality range. only the models for american bat vesiculovirus and tamana bat virus have percentages of f-j outliers outside of the high quality range. qmean, however, is rather more critical of the output with only the model for porcine picobirnavirus falling within the high quality range. the model for imjin thottimvirus scores eighth best on percentage of f-j outliers and second best on rmsd, despite the re-classification (occurring after the completion of our experimental work) by the ictv of this virus, originally in genus orthohantavirus into a new thottimvirus genus [28] . it should be noted that the models for imjin thottimvirus, burana orthonairovirus and brazilian mammarenavirus were based on very short template structures (see table 2 ). table 4 shows that homology modelling with template and target within the same family but different genera, still produced good quality models in most cases, as judged by percentage of f-j outliers and rmsd within the high quality range. only the models for lleida bat lyssavirus and macaque simian foamy virus have percentages of f-j outliers outside of the high quality range. however, once again, qmean assesses all models as outside the high quality range. table 5 shows that homology modelling with template and target within the same order but in different families, is a far more difficult proposition than at the lower taxonomic levels. the model for mammalian orthobornavirus 1 fails all three quality tests and only the model for rift valley fever phlebovirus manages to pass two out of three. table 6 shows that modelling the structure of the reconstructed sequence of the common ancestor of each genus, produces models of the same standard as intra-genus modelling (compare tables 3 and 6 ). by contrast with almost all the other models, the qmean scores are within the high quality range, with only two exceptions, table. the common ancestors of genera rotavirus and vesiculovirus. fig. 1c shows the force-directed graphs with the locations of the ancestral sequences added. table 7 summarises the results of tables 3e6 inclusive. as the taxonomical distance increases, production of high quality homology models becomes more difficult. however, modelling the reconstructed ancestral sequence of each genus is typically productive of a better scoring model even than the real sequence targets chosen for intra-genus modelling. fig. 4 shows representative examples of homology models of high and low quality superimposed with their template solved structure along with their corresponding ramachandran plots and qmean quality scores. all homology models in tables 3e6 are available from the link in the raw data section. the first objective of this study was to identify viral taxa which are comparatively lacking in solved structures for rna-dependent rna polymerase (rdrp). we observed that the entire order nidovirales, the families bornaviridae, filoviridae and paramyxoviridae within the order mononegavirales, and the family phenuiviridae within the order bunyavirales, fall into this category. additionally, within the genera orthohantavirus, orthonairovirus and mammarenavirus, all within the order bunyavirales, the solved structure available for rdrp covers less than 10% of the protein sequence. given the medical importance of many viruses within these taxa, and the number of anti-viral drugs that target rdrps, we suggest that they are prioritized for x-ray crystallography to close the "sequence-structure gap". our second objective was to assess how well homology modelling could provide models that might serve for computerassisted drug discovery of novel anti-viral compounds. to assist in the visualization of sequence space, we produced the first application of force-directed graphs to protein sequences (fig. 1) . we also applied multidimensional scaling for comparative purposes (figs. 2 and 3) . force-directed graphs enable the visualization of complex data in two dimensions. the three dimensional visualization produced from multidimensional scaling is visually richer, but this benefit can only be appreciated when a viewing application such as spotfire is available so that the three-dimensional image can be rotated. force-directed graphs convey much of the information in a single image which may be printed on a page or viewed on screen. this two-dimensional collapsing of sequence space also allows for easy simultaneous comparison of multiple datasets, in the present case multiple genera, which cannot readily be performed if separate three-dimensional viewers require to be open. the most common method of visualizing sequence space is the phylogenetic tree. for instance, starting from a distance matrix, agglomerative hierarchical clustering, such as the upgma method [29] , can be performed to generate a tree. slightly more sophisticated methods, such as neighbour-joining [30] can generate trees where the branch lengths are proportional to genetic distance. force-directed graphs do not represent genetic distance as accurately as phylogenetic trees, since the distances between nodes, table. although optimized to reflect relatedness, are constrained by the fruchterman-reingold algorithm to the best representation in two dimensions. however, force-directed graphs again allow easier simultaneous comparison of several data sets than phylogenetic trees. fig. 1 would be impossible to create on a single page if trees were used instead of force-directed graphs. trees represent ancestral sequences as nodes on the tree, with only existing taxa as leaves. force-directed graphs, by contrast, allow ancestral sequences to be represented in the same way as existing ones. fig. 1c shows that ancestral sequences do not necessarily appear as outliers in force-directed graphs. indeed, for genera flavivirus, hepacivirus, orthobunyavirus and orthohantavirus in particular, the insertion of the reconstructed ancestral sequence into the forcedirected graph in fig. 1c does not overly distort its original shape in fig. 1aeb . the reason for this becomes apparent when one considers a phylogenetic tree represented in unrooted "star" format. the ancestral sequence is then at the centre of the star topology and it can be seen that the genetic distance from the root to any particular leaf sequence may often be less than for many pairwise leaf sequence combinations. we did not perform calculation of centroid nearest neighbours (cnns) for alignments incorporating reconstructed ancestral sequences, but we are tempted to speculate that many of the ancestral sequences would have been cnns, had they been included. table 6 homology modelling the common ancestor for each genus. templates are as given in table 2 . targets are the reconstructed ancestral rdrp (or reverse transcriptase for lentivirus) sequences. rmsd: root mean square deviation in angstroms between template and model when superposed in moe. indicates good-quality, indicates poorquality, determined by the following thresholds: f-j < 2%; qmean z-score > à4.00; rmsd <2 å. table 7 mean model (or structure) quality. the top line shows the mean quality scores for the solved structures used. the other lines show the mean quality scores for the models produced at various levels of taxonomic distance between template and target. indicates good-quality, indicates poor-quality, determined by the following thresholds: f-j < 2%; qmean z-score > à4.00; rmsd <2 å. numbers in brackets indicate the revised scores if the model for imjin thottimvirus is moved out of the intra-genus category and into the intra-family category in the light of its subsequent transfer into the new genus thottimvirus. ), and outliers ( cross, text). the z-score graphics show model quality on a sliding scale: low-quality ( ), high-quality ( ). qmean4 shows the overall z-score, "all atom" shows the average z-score for all of the atoms in the model, "cbeta" the z-score for all cb carbons, "solvation" is a measure of how accessible the residues are to solvents, and "torsion" is a measure of torsion angle for each residue compared to adjacent residues. it is important to remember that homology models are theoretical constructions and caution must be exercised in treating them as input material for further experiments. among the various statistics for assessment of model quality, f-j outlier percentage is a measure of the proportion of implausible dihedral angles in the model, and indicate where parts of the model backbone are likely to be incorrectly predicted. nevertheless, it is also important not to become too dependent on statistics such as f-j outlier percentage, as "bad" angles do occasionally occur in solved structures. for instance in the present study, the thresholds of <0.05% for a very high quality model, and <2% for a good quality model given by lovell et al. [19] would suggest that six of the twelve template solved structures used here ( table 2) would not have been assessed as "very high quality" had they been models rather than solved structures. indeed the templates from indiana vesiculovirus and rotavirus a have more than 0.5% f-j outliers, and also have the poor quality scores for qmean. these two structures also have the poorest resolution of any of our templates, at > 3 å. the poor quality scoring may therefore simply be a consequence of uncertainties in positioning of atoms in these structures. one might reasonably posit that the use of template solved structures having such issues might influence the resulting models to contain the same outliers. however, the model for rotavirus i has a lower level of f-j outliers than its rotavirus a template ( table 3) . as might be expected, production of high quality models becomes more difficult as the genetic distance between target and template increases, as show in tables 3e5 nevertheless, even at the level of template-target pairs in separate genera (table 4) , the average performance is acceptable, as summarized in table 7 . we therefore suggest that homology modelling may be used to produce rdrp models for research use even for genera where no solved structure exists, provided a template structure exists within the same family. here, we provide examples (table 4 ) of such successful inter-genus, intra-family, models for genera coltivirus and parechovirus. our inter-genus models for lyssavirus and spumavirus are slightly less successful. moving to the next taxonomic level, models with template-target pairs in separate families (table 5) are generally less successful. one exception is our model for family phenuiviridae, which is better than some of the intra-family models. this is encouraging, since phenuiviridae is a family without any solved rdrp structure. homology models have been produced at much larger taxonomic distances than those dealt with here, for instance from bacteria to eukaryotes [31] , so it should be stressed that we make no claim for the generality of our findings outside of the viral orders under consideration, or for proteins other than rdrp. multi-domain proteins in particular, may produce higher quality models for some domains than others. one surprising result was the high quality of the models of reconstructed ancestral sequences (table 6 , summarized in table 7 ). as previously discussed, this may be due to the fact that the ancestral sequence is, assuming a regular molecular clock, potentially equally related to all descendent members of its genus. in this paper, we calculated centroid nearest neighbours (cnns) as the central points in sequence space for each genus (fig. 1) . a reconstructed ancestral sequence may also be considered as a candidate central point. the value of central points is that they may serve as targets that could be used to make models representative of their genus as a whole. for instance, the shortest-path, mean and median cnns of genus orthohantavirus are sequences 16, 22 and 7 (see supplementary table for a list of sequences for each genus) , representing sin nombre orthohantavirus, rockport orthohantavirus and cao bang orthohantavirus respectively. the partial solved structure used as the template for modelling in the genus orthohantavirus in the present paper is from hantaan orthohantavirus (5ize, see table 2 ) and the target used, imjin thottimvirus (sequence 27 in orthohantavirus panel of fig. 1) , is now classified as belonging to a new genus thottimvirus (table 3) . the three cnns, sin nombre orthohantavirus, rockport orthohantavirus and cao bang orthohantavirus are 71%, 64% and 75% identical to 5ize respectively, whereas imjin thottimvirus is only 58% identical. the latter was of course chosen to test the effectiveness of intra-genus homology modelling over as wide a genetic distance as possible (see section 2.5). for the performance of subsequent experimental procedures on orthohantavirus rdrps, for instance docking to discover novel anti-viral compounds, a homology model corresponding to one of the three cnns mentioned above or to the reconstructed ancestor (table 6 ) would be the preferred target, along with the existing solved structure. where a solved rdrp structure exists in a genus, it should be used. however, if that solved structure is not a cnn, a homology model of a cnn or ancestral sequence should be produced for comparative purposes. where no solved rdrp structure exists in a genus, a structure from another genus in the same family may be used. on the basis of our investigations, we recommend a procedural flowchart for selection of an rdrp structure for further study, for instance docking to discover novel anti-viral compounds, in any rna virus genus of interest (fig. 5) . where a solved structure exists within a genus, it is the obvious choice for further experiments. however, where that solved structure is far from any of the cnn sequences of the genus, as judged by the force-directed graph, a cnn may also be homology modelled for comparative purposes, using the existing solved structure as a template. any differential performance of the solved structure and the homology model in, for instance, a docking experiment, may give clues as to the generality of conclusions derived from the solved structure alone. a reconstructed ancestral rdrp may also be used as an alternative to, or in addition to, a cnn. the limits of homology modelling would appear, on the basis of the results presented here, to be at the intrafamily, inter-genus level. template-target pairs in different viral families are unlikely to be of practical use, as the predicted quality of the resulting models is low. our models were produced using moe, and we have not performed comparisons using other modelling tools, such as swiss-model [31] or modeller [32] . we feel that it is unlikely that significant differences in output would be produced, but when the object of the exercise is drug-discovery, we recommend that the protocol in fig. 5 be implemented using several alternative modelling softwares. crystallographic structural genome projects are badly needed to close the sequence-structure gap. in the meantime, systematic attempts to fill the gaps via homology modelling may be useful. however, for many taxa e all of the order nidovirales and much of mononegavirales -the paucity of solved structures to act as templates remains a serious obstacle. all code, inputs and outputs are available from: https://doi.org/ 10.17635/lancaster/researchdata/276. a three-dimensional model of the myoglobin molecule obtained by x-ray analysis protein modeling: what happened to the "protein structure gap the high throughput sequence annotation service (ht-sas) -the shortcut from sequence to true medline words the evolution and emergence of rna viruses crystal structure of the full-length japanese encephalitis virus ns5 reveals a conserved methyltransferase-polymerase interface molecular docking revealed the binding of nucleotide/ side inhibitors to zika viral polymerase solved structures using bioinformatics tools for the discovery of dengue rna-dependent rna polymerase inhibitors epidemiological characteristics of humaninfective rna viruses the rcsb protein data bank: integrative view of protein, gene and 3d structural information reference sequence (refseq) database at ncbi: current status, taxonomic expansion, and functional annotation mafft: iterative refinement and additional methods mega7: molecular evolutionary genetics analysis version 7.0 for bigger datasets muscle: multiple sequence alignment with high accuracy and high throughput network visualizations of of relationships in psychometric data graph drawing by force-directed placement some properties of classical multidimensional scaling protonate 3d: assignment of ionization states and hydrogen coordinates to macromolecular structures the generalized born/volume integral implicit solvent model: estimation of the free energy of hydration using london dispersion instead of atomic surface area structure validation by calpha geometry: phi,psi and cbeta deviation on the accuracy of homology modeling and sequence alignment methods applied to membrane proteins qmean: a comprehensive scoring function for model quality assessment toward the estimation of the absolute quality of individual protein structure models evolutionary trees from dna sequences: a maximum likelihood approach fastml: a web server for probabilistic reconstruction of ancestral sequences sars and mers: recent insights into emerging coronaviruses taxonomy of the order mononegavirales: second update emerging phleboviruses taxonomy of the order bunyavirales: second update construction of phylogenetic trees for proteins and nucleic acids: empirical evaluation of alternative matrix methods the neighbor-joining method: a new method for reconstructing phylogenetic trees swiss-model and the swiss-pdbviewer: an environment for comparative protein modeling modeller: generation and refinement of homology-based protein structure models supplementary data to this article can be found online at https://doi.org/10.1016/j.jmgm.2019.07.014. key: cord-102774-mtbo1tnq authors: sun, yuliang; fei, tai; li, xibo; warnecke, alexander; warsitz, ernst; pohl, nils title: real-time radar-based gesture detection and recognition built in an edge-computing platform date: 2020-05-20 journal: nan doi: 10.1109/jsen.2020.2994292 sha: doc_id: 102774 cord_uid: mtbo1tnq in this paper, a real-time signal processing frame-work based on a 60 ghz frequency-modulated continuous wave (fmcw) radar system to recognize gestures is proposed. in order to improve the robustness of the radar-based gesture recognition system, the proposed framework extracts a comprehensive hand profile, including range, doppler, azimuth and elevation, over multiple measurement-cycles and encodes them into a feature cube. rather than feeding the range-doppler spectrum sequence into a deep convolutional neural network (cnn) connected with recurrent neural networks, the proposed framework takes the aforementioned feature cube as input of a shallow cnn for gesture recognition to reduce the computational complexity. in addition, we develop a hand activity detection (had) algorithm to automatize the detection of gestures in real-time case. the proposed had can capture the time-stamp at which a gesture finishes and feeds the hand profile of all the relevant measurement-cycles before this time-stamp into the cnn with low latency. since the proposed framework is able to detect and classify gestures at limited computational cost, it could be deployed in an edge-computing platform for real-time applications, whose performance is notedly inferior to a state-of-the-art personal computer. the experimental results show that the proposed framework has the capability of classifying 12 gestures in real-time with a high f1-score. r adar sensors are being widely used in many longrange applications for the purpose of target surveillance, such as in aircrafts, ships and vehicles [1] , [2] . thanks to the continuous development of silicon techniques, various electric components can be integrated in a compact form at a low price [2] , [3] . since radar sensors become more and more affordable to the general public, numerous emerging short-range radar applications, e.g., non-contact hand gesture recognition, are gaining tremendous importance in efforts to improve the quality of human life [4] , [5] . hand gesture recognition enables users to interact with machines in a more natural and intuitive manner than conventional touchscreen-based and button-based human-machine-interfaces [6] . for example, google has integrated a 60 ghz radar into the smartphone pixel 4, which allows users to change songs without touching the screen [7] . what's more, virus and bacteria surviving on surfaces for a long time could contaminate the interface and cause people's health problems. for instance, in 2020, tens of a video is available on https://youtu.be/ir5nnzvzblk this article will be published in a future issue of ieee sensors journal. doi: 10.1109/jsen.2020.2994292 thousands of people have been infected with covid-19 by contacting such contaminate surfaces [8] . radar-based hand gesture recognition allows people to interact with the machine in a touch-less way, which may reduce the risk of being infected with virus in a public environment. unlike optical gesture recognition techniques, radar sensors are insensitive to the ambient light conditions; the electromagnetic waves can penetrate dielectric materials, which makes it possible to embed them inside devices. in addition, because of privacypreserving reasons, radar sensors are preferable to cameras in many circumstances [9] . furthermore, computer vision techniques applied to extract hand motion information in every frame are usually not power efficient, which is therefore not suitable for wearable and mobile devices [10] . motivated by the benefits of radar-based touch-less hand gesture recognition, numerous approaches were developed in recent years. the authors in [9] , [11] , [12] extracted physical features from micro-doppler signature [1] in the time-dopplerfrequency (tdf) domain to classify different gestures. li et al. [13] extracted sparsity-based features from tdf spectrums for gesture recognition using a doppler radar. in addition to doppler information of hand gestures, the google soli project [10] , [14] utilized the range-doppler (rd) spectrums for gesture recognition via a 60 ghz frequency-modulated continuous wave (fmcw) radar sensor. thanks to the wide available bandwidth (7 ghz), their systems could recognize fine hand motions. similarly, the authors in [15] [17] also extracted hand motions based on rd spectrums via an fmcw radar. in [18] , [19] , apart from the range and doppler information of hand gestures, they also considered the incident angle information by using multiple receive antennas to enhance the classification accuracy of their gesture recognition system. however, none of the aforementioned techniques exploited all the characteristics of a gesture simultaneously, i.e., range, doppler, azimuth, elevation and temporal information. for example, in [9] [16] , they could not differentiate two gestures, which share similar range and doppler information. this restricts the design of gestures to be recognized. in order to classify different hand gestures, many research works employed artificial neural networks for this multiclass classification task. for example, the authors in [12] , [18] [20] considered the tdf spectrums or range profiles as images and directly fed them into a deep convolutional neural network (cnn). whereas, other research works [14] , [15] , [21] considered the radar data over multiple measurement-cycles 1558-1748 ©2020 ieee. personal use of this material is permitted. permission from ieee must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. as a time-sequential signal, and utilized both the cnns and recurrent neural networks (rnns) for gesture classification. the soli project [14] employed a 2-dimensional (2-d) cnn with a long short-term memory (lstm) to extract both the spatial and temporal features, while the latern [21] , [22] replaced the 2-d cnn with 3-d cnn [23] followed by several lstm layers. because the 3-d cnn could extract not only the spatial but also the short-term temporal information from the rd spectrum sequence, it results in a better classification accuracy than the 2-d cnn [24] . however, the proposed 2-d cnn, 3-d cnn and lstm for gesture classification require huge amounts of memory in the system, and are computationally inefficient. although choi et al. [16] projected the range-doppler-measurement-cycles into rangetime and doppler-time to reduce the input dimension of the lstm layer and achieved a good classification accuracy in real-time, the proposed algorithms were implemented on a personal computer with powerful computational capability. as a result, the aforementioned radar-based gesture recognition system in [12] , [14] [16] , [18] [21] are not applicable for most commercial embedded systems such as wearable devices, smartphones, in which both memory and computational power are limited. in this paper, we present a real-time gesture recognition system using a 60 ghz fmcw radar in an edge-computing platform. the proposed system is expected to be applied in short-range applications (e.g., tablet, display, and smartphone) where the radar is assumed to be stationary to the user. the entire signal processing framework is depicted in fig. 1 . after applying the 2-dimensional finite fourier transform to the raw data, we select a certain number of points from the resulting rd spectrum as an intermediate step rather than directly putting the entire spectrum into deep neural networks. additionally, thanks to the l-shaped receive antenna array, the angle of arrival (aoa) information of the hand, i.e., azimuth and elevation, can be calculated. for every measurement-cycle, we store this information in a feature matrix with reduced dimensions. by selecting a few points from the rd spectrum, we reduce the input dimension of the classifier and limit the computational cost. further, we present a hand activity detection (had) algorithm called the short-term average/longterm average (sta/lta)-based gesture detector. it employs the concept of sta/lta [25] to detect when a gesture comes to an end, i.e., the tail of a gesture. after detecting the tail of a gesture, we arrange the feature matrices belonging to the measurement-cycles, which are previous to this tail, into a feature cube. this feature cube constructs a compact and comprehensive gesture profile which includes the features of all the dominant point scatters of the hand. it is subsequently fed into a shallow cnn for classification. the main contributions are summarized as follows: • the proposed signal processing framework is able to recognize more gestures (12 gestures) than those reported in other works in the literature. the framework can run in real-time built in an edge-computing platform with limited memory and computational capability. • we develop a multi-feature encoder to construct the ges-ture profile, including range, doppler, azimuth, elevation and temporal information into a feature cube with reduced dimensions for the sake of data processing efficiency. • we develop an had algorithm based on the concept of sta/lta to reliably detect the tail of a gesture. • since the proposed multi-feature encoder has encoded all necessary information in a compact manner, it is possible to deploy a shallow cnn with a feature cube as its input to achieve a promising classification performance. • the proposed framework is evaluated twofold: its performance is compared with the benchmark in off-line scenario, and its recognition ability in real-time case is assessed as well. the remainder of this paper is organized as follows. section ii introduces the fmcw radar system. section iii describes the multi-feature encoder including the extraction of range, doppler and aoa information. in section iv, we introduce the had algorithm based on the concept of the sta/lta. in section v, we present the structure of the applied shallow cnn for gesture classification. in section vi, we describe the experimental scenario and the collected gesture dataset. in section vii, the performance is evaluated in both off-line and real-time cases. finally, conclusions are given in section viii. our 60 ghz radar system adopts the linear chirp sequence frequency modulation [26] to design the waveform. after mixing, filtering and sampling, the discrete beat signal consisting of i t point scatters of the hand in a single measurement-cycle from the z-th receive antenna can be approximated as [27] : where the range and doppler frequencies f ri and f di are given as: respectively, r i and v ri are the range and relative velocity of the i-th point scatter of the hand, f b is the available bandwidth, t c is the chirp duration, λ is the wavelength at 60 ghz, c is the speed of light, the complex amplitude a (z) i contains the phase information, i s is the number of sampling points in each chirp, i c is the number of chirps in every measurement-cycle, and the sampling period t s = t c /i s . the 60 ghz radar system applied for gesture recognition can be seen in fig. 2 . it can also be seen that, the radar system has an l-shaped receive antenna array. to calculate the aoa in azimuth and elevation directions, the spatial distance between two receive antennas in both directions is d, where d = λ/2. a 2-d fft is applied to the discrete beat signal in (1) to extract the range and doppler information in every measurementcycle [28] . the resulting complex-valued rd spectrum for the z-th receive antenna can be calculated as: where w(u, v) is a 2-d window function, p and q are the range and doppler frequency indexes. the range and relative velocity resolution can be deduced as: where the range and doppler frequency resolution ∆f r and ∆f d are 1/t c and 1/(i c t c ), respectively. to improve the signal-to-noise ratio (snr), we sum the rd spectrums of the three receive antennas incoherently, i.e., to obtain the range, doppler and aoa information of the hand in every measurement-cycle, we select k points from rd(p, q), which have the largest magnitudes. the parameter k is predefined, and its choice will be discussed in section vii-a. then, we extract the range, doppler frequencies and the magnitudes of those k points, which are denoted asf rk , f dk and a k , respectively, where k = 1, · · · , k. the aoa can be calculated from the phase difference of extracted points in the same positions of complex-valued rd spectrums belonging to two receive antennas. the aoa in azimuth and elevation of the k-th point can be calculated as: respectively, where ψ(·) stands for the phase of a complex value, a (z) k is the complex amplitude b (z) f rk ,f dk from the z-th receive antenna. as a consequence, in every measurement-cycle, the k-th point in rd(p, q) has five attributes, i.e., range, doppler, azimuth, elevation and magnitude. as depicted in fig. 3 , we encode the range, doppler, azimuth, elevation and magnitude of those k points with the largest magnitudes in rd(p, q) along i l measurement-cycles into the feature cube v with dimension i l ×k ×5. the v has five channels corresponding to five attributes and each element in v at the l-th measurementcycle can be described as: where l = 1, · · · , i l . similar to voice activity detection in the automatic speech recognition system, our gesture recognition system also needs to detect some hand activities in advance, before forwarding the data to the classifier. it helps to design a power-efficient gesture recognition system, since the classifier is only activated when a gesture is detected rather than keeping it active for every measurement-cycle. the state-of-the-art event detection algorithms usually detect the start time-stamp of an event. for example, the authors in [25] used the sta/lta and power spectral density methods to detect when a micro-seismic event occurs. in the case of radar-based gesture recognition, we could also theoretically detect the start time-stamp of a gesture and consider that a gesture event occurs within the following i l measurement-cycles. however, detecting the start-stamp and forwarding the hand data in the following i l measurement-cycles to the classifier could cause a certain time delay, since the time duration of designed gestures is usually different. as illustrated in fig. 4(a) , due to the facts that the proposed multi-feature encoder requires i l measurementcycles and the duration of the gesture is usually shorter than i l , a delay occurs, if we detect the start time-stamp of the gesture. therefore, as depicted in fig. 4(b) , to reduce the time delay, our proposed had algorithm is designed to detect when a gesture finishes, i.e., the tail of a gesture, rather than detecting the start time-stamp. we propose a sta/lta-based gesture detector to detect the tail of a gesture. the exponential moving average (ema) is used to detect the change of the magnitude signal at the l-th measurement-cycle, which is given as: where α ∈ [0, 1] is the predefined smoothing factor, x(l) is the range-weighted magnitude (rwm), and it is defined as: where a max represents the maximal magnitude among k points in rd(p, q) at l-th measurement-cycle, f rmax denotes the range corresponding to a max , and the predefined coefficient β denotes the compensation factor. the radar cross section (rcs) of a target is independent of the propagation path loss between the radar and the target. according to the radar equation [29] , the measured magnitude of a target is a function of many arguments, such as the path loss, rcs, etc. as deduced in (10), we have built a coarse estimate of the rcs by multiplying the maximal range information with its measured magnitude to partially compensate the path loss. furthermore, we define the sta(l) and lta(l) as the mean ema in short and long windows at the l-th measurementcycle: respectively, where l 1 and l 2 are the length of the short and long window. the tail of a gesture is detected, when the following conditions are fulfilled: where γ 1 and γ 2 are the predefined detection thresholds. fig. 5 illustrates that the tails of two gestures are detected via the proposed sta/lta gesture detector. according to (12) , one condition of detecting the tail of a gesture is that, the average of rwm in the long window exceeds the threshold γ 1 . it means that a hand motion appears in the long window. the other condition is that, the ratio of the mean ema in the short window and that in the long window is lower than the threshold γ 2 . in other words, it detects when the hand movement finishes. in practice, the parameters β, γ 1 and γ 2 in our had algorithm should be thoroughly chosen according to different application scenarios. as discussed in section iii-d, the feature cube obtained by the multi-feature encoder has a dimension of i l ×k ×5. thus, we could simply use the cnn for classification without any reshaping operation. the structure of the cnn can be seen in fig. 6 . we employ four convolutional (conv) layers, each of that has a kernel size 3 × 3 and the number of kernels in each conv layer is 64. in addition, the depth of the first kernel is five, since the input feature cube has five channels (i.e., range, doppler, azimuth, elevation and magnitude), while that of the other kernels in the following three conv layers is 64. we choose the rectified linear unit (relu) [30] as activation function, since it solves the problem of gradient vanishing and is able to accelerate the convergence speed of training [31] . then, the last conv layer is connected by two fullyconnected (fc) layers, either of which has 256 hidden units and is followed by a dropout layer for preventing the network from overfitting. the third fc layer with a softmax function is utilized as the output layer. the number of hidden units in the third fc layer is designed to be in accordance with the number of classes in the dataset. the softmax function normalizes the output of the last fc layer to a probability distribution over the classes. through thoroughly network tuning (e.g., number of hidden layers, number of hidden units, depth number), we construct the cnn structure as shown in fig. 6 . the designed network should (a) take the feature cube as input, (b) achieve a high classification accuracy, (c) consume few computational resources, and (d) be deployable in the edge-computing platform. in section vii, we will show that the designed network in fig. 6 fulfills these criteria. as illustrated in fig. 7 , we used the 60 ghz fmcw radar in fig. 2 to recognize gestures. our radar system has a detection range up to 0.9 m and an approx. 120 • antenna beam width in both azimuth and elevation directions. the parameter setting used in the waveform design is presented in table i , where the pulse repetition interval (pri) is 34 ms. the radar is connected with an edge-computing platform, i.e., nvidia jetson nano, which is equipped with quad-core arm a57 at 1.43 ghz as central processing unit (cpu), 128-core maxwell as graphics processing unit (gpu) and 4 gb memory. we have built our entire radar-based gesture recognition framework described in fig. 1 in the edge-computing platform in c/c++. the proposed multi-feature encoder and had have been implemented in a straightforward manner without any runtime optimization, while the implementation of the cnn is supported by tensorrt developed by nvidia. in addition, as depicted in fig we invited 20 human subjects including both genders with various heights and ages to perform these gestures. among 20 subjects, the ages range from 20 to 35 years old, and the heights are from 160 cm to 200 cm. we divided the 20 subjects into two groups. in the first group, ten subjects were taught how to perform gestures in a normative way. whereas, in the second group, in order to increase the diversity of the dataset, only an example for each gesture was demonstrated to the other ten subjects and they performed gestures using their own interpretations. self-evidently, their gestures were no longer as normative as the ones performed by the ten taught subjects. furthermore, every subject repeated each gesture 30 times. therefore, the total number of realizations in our gesture dataset is (12 gestures)×(20 people)×(30 times), namely 7200. we also found out that the gestures performed in our dataset take less than 1.2 s. thus, to ensure that the entire hand movement of a gesture is included in the observation time, we set i l to 40, which amounts to a duration of 1.36 s (40 measurement-cycles × 34 ms). in this section, the proposed approach is evaluated regarding a twofold objective: first, its performance is thoroughly compared with benchmarks in literature through an off-line crossvalidation, and secondly, its real-time capability is investigated with an on-line performance test. in section vii-a, we discuss how the parameter k affects the classification accuracy. in section vii-b, we compare our proposed algorithm with the state-of-the-art radar-based gesture recognition algorithms in terms of classification accuracy and computational complexity based on leave-one-out cross-validation (loocv). it means that, in each fold, we use the gestures from one subject as test set, and the rest as training set. in addition, section vii-c describes the real-time evaluation results of our system. the performances of taught and untaught subjects are evaluated separately. we randomly selected eight taught and eight untaught subjects as training sets, while the remaining two taught and two untaught subjects are test sets. in realtime performance evaluation, we performed the hardware-inthe-loop (hil) test, and fed the raw data recorded by the radar from the four test subjects into our edge-computing platform. a. determination of parameter k as described in section iii, we extract k points with the largest magnitudes from rd(p, q), to represent the hand information in a single measurement-cycle. we define the average (avg.) accuracy as the avg. classification accuracy across the 12 gestures based on loocv. in fig. 9 , we let k vary from 1 to 40, and compute the avg. accuracy in five trials. it can be seen that the mean avg. accuracy over five trials keeps increasing and reaches approx. 95%, when k is 25. after that, increasing k can barely improve the classification accuracy. as a result, in order to keep low computational complexity of the system and achieve a high classification accuracy, we set k to 25. it results that the feature cube v in our proposed multi-feature encoder has a dimension of 40 × 25 × 5. in the off-line case, we assumed that each gesture is perfectly detected by the had algorithm and compared our proposed multi-feature encoder + cnn with the 2-d cnn + lstm [14] , the 3-d cnn + lstm [21] , 3-d cnn + lstm (with aoa) and shallow 3-d cnn + lstm (with aoa) in terms of the avg. classification accuracy and computational complexity based on loocv. in our proposed multi-feature encoder + cnn, the feature cube v, which has the dimension of 40 × 25 × 5, was fed into the cnn described in fig. 6 . the input of the 2-d cnn + lstm [14] and the 3-d cnn + lstm [21] is the rd spectrum sequence over 40 measurement-cycles, which has the dimension of 40 × 32 × 32 × 1. since [21] did not include any aoa information in their system for gesture classification, the comparison might not be fair. thus, we added the aoa information according to (6) and (7) cnn but with reduced classification accuracy. to achieve a fair comparison, we optimized the structures and the hyperparameters as well as the training parameters of those models. the cnn demonstrated in fig. 6 in the proposed approach was trained for 15000 steps based on the back propagation [32] using the adam optimizer [33] with an initial learning rate of 1 × 10 −4 , which degraded to 10 −5 , 10 −6 and 10 −7 after 5000, 8000 and 11000 steps, respectively. the batch size is 128. 1) classification accuracy and training loss curve: in table ii , we present the classification accuracy of each type of gesture based on the algorithms mentioned above. the avg. accuracies of the 2-d cnn + lstm [14] and 3-d cnn + lstm [21] are only 78.50% and 79.76%, respectively. since no aoa information is utilized, the rotate cw and rotate ccw can hardly be distinguished, and similarly the four swipe gestures can hardly be separated, either. on the contrary, considering the aoa information, the multi-feature encoder + cnn, the 3-d cnn + lstm (with aoa) and the shallow 3-d cnn + lstm (with aoa) are able to separate the two rotate gestures, and the four swipe gestures. it needs to be mentioned that the avg. accuracy of our proposed multifeature encoder is almost the same as that of the 3-d cnn + lstm with (aoa). however, it will be shown in the following section that our approach requires much less computational resources and memory than those of the other approaches. what's more, in fig. 10 , we plot the training loss curves of the three structures of neural networks. it can be seen that the loss of the proposed cnn in fig. 6 has the fastest rate of convergence among the three structures of neural networks and approaches to zero at around the 2000-th training step. unlike the input of the 3-d cnn + lstm (with aoa) and shallow 3-d cnn + lstm (with aoa), the feature cube contains sufficient gesture characteristics in spite of its compact form (40 × 25 × 5). it results that the cnn in fig. 6 is easier to be trained than the other neural networks, and it achieves a high classification accuracy. 2) confusion matrix: in fig. 11 , we plotted two confusion matrices for ten taught and ten untaught subjects based on our proposed multi-feature encoder + cnn. it could be observed that, for the normative gestures performed by the ten taught subjects, we could reach approx. 98.47% avg. accuracy. although we could observe an approx. 5% degradation in avg. accuracy in fig. 11(b) , where the gestures to be classified are performed by ten untaught subjects, it still has 93.11% avg. accuracy. 3) computational complexity and memory: the structures of the 3-d cnn + lstm (with aoa), shallow 3-d cnn + lstm (with aoa) and the proposed multi-feature encoder + cnn are presented in table iii . we evaluated their computational complexity and required memory in line with the giga floating point operations per second (gflops) and the model size. the gflops of different models were calculated by the built-in function in tensorflow, the model size is observed through tensorboard [34] . although the 3-d cnn + lstm (with aoa) offers almost the same classification accuracy as that of the proposed multi-feature encoder + cnn, it needs much more gflops than that of the multi-feature encoder + cnn (2.89 gflops vs. 0.26 gflops). its model size is also much larger than that of the proposed approach (109 mb vs. 4.18 mb). although we could reduce its gflops using a shallow network structure, such as the shallow 3-d cnn + lstm (with aoa) in table iii , it results in the degradation of classification accuracy (94.36%), as can be seen in table ii . we also found out that the cnn used in our approach has the least model size, since its input dimension is much smaller than that of other approaches. on the contrary, the input of the 3-d cnn + lstm (with aoa) contains lots of zeros due to the sparsity of rd spectrums. such large volumes usually need large amounts of coefficients in neural networks. whereas, we exploit the hand information in every measurement-cycle using only 25 points, and the input dimension of the cnn is only 40 × 25 × 5, which requires much less computational complexity than the other approaches. as mentioned above, subjects are divided into taught and untaught groups, and each has ten subjects. in each group, eight subjects are randomly selected as training set, and the remaining two subjects constitute the test set, resulting in either group having 720 true gestures in the test set. in the hil context, we directly fed the recorded raw data from the four test subjects into the edge-computing platform. in the realtime case, the system should be robust enough to distinguish true gestures from random motions (rms). thus, we also included a certain amount of rms as negative samples during the training phase. the scale of rms and true gestures is around 1:3. 1) precision, recall and f 1 -score: to quantitatively analyze the real-time performance of our system, we introduce the precision, recall and f 1 -score, which are calculated as: precision = tp tp + fp , recall = tp tp + fn , where tp, fp and fn denote the number of true positive, false positive, and false negative estimates. for two subjects in the test set, we have 60 realizations for each gesture. it means that tp + fn = 60. as presented in table iv , the avg. precision and recall over 12 types of gestures using two taught subjects as test set are 93.90% and 94.44%, respectively, while those using two untaught subjects as test set are 91.20% and 86.11%. it needs to be mentioned that, the off-line avg. accuracies in fig. 11 , namely 98.47 % and 93.11%, can also be regarded as the recall in taught and untaught cases. after comparing with the recall in the off-line case, we could observe an approx. 4% and 7% degradation in recall in the realtime case considering both the taught and untaught subjects. the reason is that, in the off-line performance evaluation, we assumed that each gesture is detected perfectly. however, in the real-time case, the recall reduction is caused by the facts that our had performance miss-detected some gestures or incorrectly triggered the classifier even when the gesture was not completely finished. for example, due to the small movement of the hand, the had sometimes failed to detect the gesture "pinch index". similarly, the recall of the gesture "cross" is also impaired, since the gesture "cross" has a turning point, which leads to a short pause. in some cases where the subject performs the gesture "cross" with lowvelocity, the had would incorrectly consider the turning point as the end of "cross", resulting in a wrong classification. overall, in both taught and untaught cases, the f 1 -score of our radar-based gesture recognition system reaches 94.17% and 88.58%, respectively. 2) detection matrix: we summarized the gesture detection results of our real-time system. since we did not aim to evaluate the classification performance here, we depicted the detection results in table v considering all four test subjects. our system correctly detected 1388 true positive gestures, and provoked 25 false alarms among the total of 1864 test samples in which there are 1440 true gestures and 424 true negative rms, respectively. furthermore, we define two different types of miss-detections (mds), in which the mds from had means that our had miss-detects a gesture, while the mds from the classifier means that, the had detects the gesture, but this gesture is incorrectly rejected by our classifier as a rm. the false alarm rate (far) and miss-detection rate (mdr) of our system are 5.90% and 3.61%, respectively. 3) runtime: as depicted in table vi , in the hil context, we also noted the avg. runtime of the multi-feature encoder, had and cnn based on all the 1838 classifications, which include 1388 true positives, 399 true negatives, 25 false alarms and 26 mds from the classifier. the multi-feature encoder includes the 2-d fft, 25 points selection, rd and aoa estimation. it needs to be mentioned that the multifeature encoder and the had were executed in the cpu using unoptimized c/c++ code, while the cnn ran in the gpu based on tensorrt. the multi-feature encoder and had took only approx. 7.12 ms and 0.38 ms without using any fft acceleration engine, while the cnn took only 25.84 ms on average. the overall runtime of our proposed radar-based gesture recognition system is only approx. 33 ms. we developed a real-time radar-based gesture recognition system built in an edge-computing platform. the proposed multi-feature encoder could effectively encode the gesture profile, i.e., range, doppler, azimuth, elevation, temporal information as a feature cube, which is then fed into a shallow cnn for gesture classification. furthermore, to reduce the latency caused by the fixed number of required measurementcycles in our system, we proposed the sta/lta-based gesture detector, which detects the tail of a gesture. in the off-line case, based on loocv, our proposed gesture recognition approach achieves 98.47% and 93.11% avg. accuracy using gestures from taught and untaught subjects, respectively. in addition, the trained shallow cnn has a small model size and requires few gflops. in the hil context, our approach achieves 94.17% and 88.58% f 1 -scores based on two taught and two untaught subjects as test sets, respectively. finally, our system could be built in the edge-computing platform, and requires only approx. 33 ms to recognize a gesture. thanks to the promising recognition performance and low computational complexity, our proposed radar-based gesture recognition system has the potential to be utilized for numerous applications, such as mobile and wearable devices. in future works, different gesture datasets with large diversity need to be constructed according to specific use cases. what's more, in some use cases where the radar is not stationary to the user, the classification accuracy of the proposed system might decrease and accordingly algorithms, such as ego motion compensation, could be considered. micro-doppler effect in radar: phenomenon, model, and simulation study millimeter-wave technology for automotive radar sensors in the 77 ghz frequency band an ultra-wideband 80 ghz fmcw radar system using a sige bipolar transceiver chip stabilized by a fractional-n pll synthesizer radar-based human-motion recognition with deep learning: promising applications for indoor monitoring radar signal processing for sensing in assisted living: the challenges associated with real-time implementation of emerging algorithms motion sensing using radar: gesture interaction and beyond google pixel 4 and 4 xl handson: this time, it's not about the camera persistence of coronaviruses on inanimate surfaces and its inactivation with biocidal agents gesture classification with handcrafted micro-doppler features using a fmcw radar soli: ubiquitous gesture sensing with millimeter wave radar hand gesture recognition based on radar micro-doppler signature envelopes hand gesture recognition using micro-doppler signatures with convolutional neural network sparsity-driven micro-doppler feature extraction for dynamic hand gesture recognition interacting with soli: exploring fine-grained dynamic gesture recognition in the radio-frequency spectrum ts-i3d based hand gesture recognition method with radar sensor short-range radar based real-time hand gesture recognition using lstm encoder short-range radar-based gesture recognition system using 3d cnn with triplet loss hand-gesture recognition using two-antenna doppler radar with deep convolutional neural networks automatic radar-based gesture detection and classification via a region-based deep convolutional neural network u-deephand: fmcw radar-based unsupervised hand gesture feature learning using deep convolutional auto-encoder network latern: dynamic continuous hand gesture recognition using fmcw radar sensor riddle: real-time interacting with hand description via millimeter-wave sensor 3d convolutional neural networks for human action recognition multimodal gesture recognition using 3-d convolution and convolutional lstm comparison of the sta/lta and power spectral density methods for microseismic event detection new chirp sequence radar waveform a high-resolution framework for range-doppler frequency estimation in automotive radar systems two-dimensional subspace-based model order selection methods for fmcw automotive radar systems radar handbook rectified linear units improve restricted boltzmann machines empirical evaluation of rectified activations in convolutional network backpropagation applied to handwritten zip code recognition adam: a method for stochastic optimization tensorflow: a system for large-scale machine learning and the research institute for automotive electronics (e-lab) in collaboration with hella gmbh & co. kgaa, lippstadt, germany. his research interests are automotive radar signal processing, radar-based human motion recognition and machine learning collaboration with the signal processing group at tud, darmstadt, germany, where his research interest was the detection and classification of underwater mines in sonar imagery lippstadt, germany, where he is mainly responsible for the development of reliable signal processing algorithms for automotive radar systems xibo li received the b.sc. degree in mechanical engineering from beijing institute of technology his current research interests include automotive radar signal processing, machine learning and sensor fusion as a research associate at the institute for power electronic and electrical drives (isea), he was involved in several projects related to ageing of lithiumion batteries at the chair for electrochemical energy conversion and storage systems he joined the department of communications engineering of the university of paderborn in 2001 as a research staff member, where he was involved in several projects related to single-and multi-channel speech processing and automated speech recognition he is currently the head of the radar signal processing and signal validation department at hella gmbh & co. kgaa, lippstadt, germany. nils pohl (gsm'07-m'11-sm'14) received the dipl.-ing. and dr.-ing. degrees in electrical engineering from he has authored or coauthored more than 100 scientific papers and has issued several patents. his current research interests include ultra-wideband mm-wave radar, design, and optimization of mm-wave integrated sige circuits and system concepts with frequencies up to 300 ghz and above, as well as frequency synthesis and antennas. prof. pohl is a member of vde, itg, euma, and ursi. he was a corecipient of the the authors would like to thank the editor and anonymous reviewers for giving us fruitful suggestions, which significantly improve the quality of this paper. many thanks to the students for helping us collect the gesture dataset in this interesting work. key: cord-032684-muh5rwla authors: madichetty, sreenivasulu; m., sridevi title: a stacked convolutional neural network for detecting the resource tweets during a disaster date: 2020-09-25 journal: multimed tools appl doi: 10.1007/s11042-020-09873-8 sha: doc_id: 32684 cord_uid: muh5rwla social media platform like twitter is one of the primary sources for sharing real-time information at the time of events such as disasters, political events, etc. detecting the resource tweets during a disaster is an essential task because tweets contain different types of information such as infrastructure damage, resources, opinions and sympathies of disaster events, etc. tweets are posted related to need and availability of resources (nar) by humanitarian organizations and victims. hence, reliable methodologies are required for detecting the nar tweets during a disaster. the existing works don’t focus well on nar tweets detection and also had poor performance. hence, this paper focus on detection of nar tweets during a disaster. existing works often use features and appropriate machine learning algorithms on several natural language processing (nlp) tasks. recently, there is a wide use of convolutional neural networks (cnn) in text classification problems. however, it requires a large amount of manual labeled data. there is no such large labeled data is available for nar tweets during a disaster. to overcome this problem, stacking of convolutional neural networks with traditional feature based classifiers is proposed for detecting the nar tweets. in our approach, we propose several informative features such as aid, need, food, packets, earthquake, etc. are used in the classifier and cnn. the learned features (output of cnn and classifier with informative features) are utilized in another classifier (meta-classifier) for detection of nar tweets. the classifiers such as svm, knn, decision tree, and naive bayes are used in the proposed model. from the experiments, we found that the usage of knn (base classifier) and svm (meta classifier) with the combination of cnn in the proposed model outperform the other algorithms. this paper uses 2015 and 2016 nepal and italy earthquake datasets for experimentation. the experimental results proved that the proposed model achieves the best accuracy compared to baseline methods. micro-blogging [10, 14, 36, 40] sites like twitter, facebook, instagram, etc. are helpful for collecting situational information [13] during a disaster like an earthquake, floods, disease outbreaks [25] , etc. during these events, minor tweets are posted relevant to the specific classes such as infrastructure damage, resources [6, 33] , service requests [24] , etc., and also spam tweets, communal tweets and emotion information are posted [8, 16, 17, 19, 31, 38] . therefore, it is required to design the powerful methodologies for the detection of specific class tweets (like need, availability of resources, etc.), so that relevant tweets can be automatically detected from the large set of tweets. the detection of specific class tweets [1, 11, 21, 35] has received much attention in the last two years. in the next few years, the detection of specific class tweets is likely to become more important in social media. specifically, the detection of two types of tweets contains information related to need and availability of resources is a challenging task. during the disaster, victims post tweets with information such as where essential resources such as food, water, medical aid, shelter, etc. are needed or required. similarly, humanitarian organizations post tweets with information such as where specific resources such as medical resources, food, water packets, etc., are available in the affected area. examples of need and availability of resource tweets are shown in table 1 . the first four tweets represent the need for resources such as mobile hospitals, password-free wi-fi, blood and ambulances. the next four tweets reflect the availability of information on resources such as the italian army to provide services to earthquake victims, the availability of shelter tents, money and ambulances. however, detection of need and availability of resource tweets is very beneficial for both humanitarian organizations and victims during the disaster. the main objective of this work is to assist the victims and humanitarian organizations in the event of a disaster by designing a method for automatic identification of need and availability of resource tweets (nar) from twitter. the problem of detecting nar tweets can be treated as a multi-classification problem. the classes are (i) need of resource tweet (ii) availability of resource tweet and (iii). none of both. only a few existing works [1, 3, 11] are only focused on extracting the need and availability of resource tweets during the disaster. among them, most of the works used informationretrieval methodologies such as word2vec, a combination of word embeddings and character embeddings, etc. specifically, the authors in [3] used both information-retrieval methodologies and classification methodologies (cnn with crisis word embeddings) to extract the need and availability of resource tweets during the disaster. the main drawback of cnn with crisis embeddings is that it does not work well if the number of training tweets is small and, in the case of information retrieval methodologies, keywords must be given manually to identify the need and availability of resource tweets during the disaster. to overcome the above-mentioned issues, a novel method is proposed by using the stacking mechanism [44] to identify nar tweets during the disaster. the stacking mechanism uses a two-level classifiers. the first level uses multiple classifiers and the classifier output is used as the second level classifier input, while the second level uses only one classifier. search and rescue dogs 20 ambulances on the ground in #perugia following #earthquake volunteers from @crocerossa on the scene. the stacking method does not produce improved results if the models used in the stacking method are stable. therefore, different models such as cnn and knn classifiers with domain-specific features are used in this work. cnn is used to capture the semantic similarity between words, and even vocabulary words are different in the testing phase. in order to overcome the problem of a lower number of training tweets, new features are proposed and used in the knn classifier to detect nar tweets. the two models (cnn and knn classifiers with proposed features) have different functionality for the detection of tweets. the output of these two models is given as input to the svm (second level) classifier. the svm classifier is trained to determine the relationship between the output of the two cnn and knn classifier models. it gives the final prediction of tweets whether a tweet label is a resource need or a resource availability or none. the efficacy of the final prediction depends on the classifiers used in level-1 and level-2. the reason for selecting the knn and svm classifiers as first and second level classifiers is clearly explained in sections 4.4.2 and 4.5.2. the main contributions are summarized as: this paper is organized as follows. the second section examines the related work. the proposed approach for the detection of nar tweets during a disaster is described in the third section. experimental results and analysis are discussed in the fourth section. the last section is the conclusion of the paper. many studies [22, 28, 32, 41] focused on the detection of the tweets related to a disaster. preliminary work [41] focused mainly on extracting the features such as uni-gram and bigram frequency, parts-of-speech (pos), objective or subjective, personal or impersonal and formal or informal from tweets and used the classifiers for classifying the tweets based on the relevancy. classifiers such as naive bayes and max entropy classifiers are used for detection of the situational tweets related to the disaster. the authors explained that their work depends on the vocabulary of a specific event. in [32] , the authors investigated and developed an application for detecting the earthquake based on the features such as context words, keyword position, content words and length of the tweets. it is applicable only for japanese tweets. to overcome the problem domain dependent, the authors in [28] proposed a novel framework for classifying the situational and non-situational information based on the low-level lexical and syntactical features. after classification, the tweets are summarized based on the content words and also concluded that it works on cross-domain (domain independent). however, all the methods are focused only on situational tweets related to disaster but they failed to address specific class tweets. in recent years, more researchers focused on the detection of user-defined class tweets during a disaster. several studies, for instance [2, 11, 21, 29] have been proposed on different specific classes. the authors in [21] , suggests that decision tree with context and content features give the best results for recall and f1-measure parameters among the classifiers such as svm, adaboost and random forest. however, it does not focus on nar tweets. in recent literature, the authors in [35] developed a method by extracting the features by applying maximum frequency of words from the tweets to detect resource tweets during a disaster. resources include both availability and need of the resources. however, it's not focused alone on the availability and need of the resources tweets during a disaster. the authors in [9] designed artificial intelligence disaster response (aidr) system for classifying the tweets into user-defined categories for detecting the tweets related to the disaster. in aidr, the uni-gram and bi-gram features are used for detecting the tweets related to the user-defined categories. these features are applied for detecting any user-defined classes during a disaster. in [2] , the authors manually analyzed whatsapp messages for the requirement of medical, human, infrastructural resources during a disaster by considering the case study of nepal earthquake dataset 2015. however, they have not proposed an automatic method for identifying the resources. in [11] , the authors found that neural network retrieved models by integrating the character-level and word-level embeddings with pattern recognition techniques perform well than state-of-art models. the authors applied information retrieval techniques for detecting the nar tweets. in [7] , the authors used a novel vector training approach for clustering the tweets about the emergency situations and compared their method with bag-of-words (bow), word2vec-sum and doc2vec. and described that clustering of tweets will be helpful further for identifying the different aspects of topic in emergency situations. however, they are not proposed a method for identifying the nar tweets during a disaster. the problem can be defined as follows: given a 'n' number of tweets x = {x 1 , x 2 , x 3 , x 4 , .....x n }, identify the tweets which are related to the three classes such as 1). need of the resource 2). availability of the resource and 3). none of the above. this section describes the stacked convolutional neural network for identifying the nar tweets during a crisis. the overview of the proposed stacked convolutional neural network is shown in fig. 1 . the stacking mechanism [44] combines the predictions of diverse classifiers in the best way by learning the relationship between the models. different classifiers vary in prediction errors from the data. for instance, some classifiers mispredict the data, while some other classifiers predict the same data correctly. it increases the generalization ability of the model and reduces the misclassification rate, bias and variance of the model. the stacking based classifiers give a high performance than the individual classifier models due to its generalization ability [42] . however, most of the resource detection systems focus on the individual classifier models rather than the ensemble methods (a combination of diverse classifiers). in this work, stacked convolutional neural network is proposed for detecting the resource tweets from social media during the disaster. it consists of two phases of the classifier. in the first phase, the convolutional neural network and the knn classifiers are used and referred to as base-level classifiers. the svm classifier is used as a meta-level classifier in the second phase. before the tweets are given as inputs to the base-level classifiers, the following pre-processing and extraction steps are performed, such as: -all tweets are changing to lower case letters to avoid the multiple copies of same words. -these are divided into words and it referred as tokens -the user mentions (@users), hash-tags (#) and url's are removed from the tweets. -similarly, stop-words, numerical and unknown symbols are omitted from tweets. for each tweet, two types of feature representation, and the following techniques are used to generate a feature representation from tweets, such as: we used pre-trained crisis word embeddings to represent the 300-dimensional vectors for each word in a tweet. it is mainly based on 52 million crisis-related tweets collected during 19 crisis events and used word2vec tool for training the word embeddings. it uses the continuous bag of words model (cbow) architecture with negative sampling to generate word embeddings. [45] to extract the top-most informative words from tweets because it has already been shown to be one of the most efficient feature selection algorithm for text categorization. the svm classifier is used for the χ 2 − static feature selection algorithm because the authors in [20] concluded that the svm with χ 2 statistic feature selection performed well than other traditional methods. the extracted domain-specific features are shown in table 2 . the first, second, and third columns are the serial number, features and information category, respectively. χ 2 − static feature selection algorithm is used the above two methods provide two feature vector representations for each tweet that are given as input to base-level classifiers such as cnn and knn classifiers. cnn is suitable to elicit local and deep features from natural language. the authors [12] have shown that cnn has had better results in sentence classification. the authors in [34] have extended a convolutional-recursive deep model for 3d object classification that employs a combination of convolutional and recursive neural networks (rnn) cooperatively. the cnn layer discovers the low-level translation stable features that are feed into multiple, fixed-tree rnns to formulate higher-order features. in [27] , the authors have shown that cnn outperforms many traditional methods in biomedical text classification, embedding layer it is the very first layer of cnn. it takes a fixed number of words from the tweets as input and converts into a corresponding 300-dimensional crisis word vector. the 300-dimensional tweet vector is passed into a series of convolution and pooling operations to understand high-level feature representations. in the convolution layer, the new features 'f ' are generated by using convolution kernel 'u ∈ r gd ' to a window of g words (filter size) as shown in (1). where 'x j :j +g−1 ' is the concatenation of input vectors '(x j , x j +1 ...x j +g−1 )', 'b' is a bias term and 'f' is a non-linear activation function like 'sig', 'tanh', etc. the filter is used to the window of 'g' words for getting the feature map with 'f ∈ r n−g+1 ' which is shown in (2). different 'g' values (3 ,4 ,5) are used to capture the different n-gram features from the tweet. this process is repeated for 100 times (100 filters) to produce the 100 feature maps to learn the complementary features of the same filter size. after getting the feature map, maximum pooling is applied to each feature map. where 'μ q (f i )' refers to the maximum pooling operation [4] used to the each window of 'q' features in the feature map 'f i '. the output dimension is reduced by the max-pooling while keeping important features from each feature map. after the maximum pooling operation, different feature vectors are generated from the convolution layer with filter sizes (3, 4, 5) . then, the concatenation operation is applied to the different feature vectors to become a single block. the dense layer with the softmax activation function is used on the top of the pooling layer to keep the features generated from the pooling layer. it is shown in the (4). where 'w' is a weight matrix, 'b e ' is a bias vector and 'e' is a non-linear activation function. the input of dense layer may be variable length, which produces fixed output 'z', and it is given as input for classification. the output layer defines the probability distribution and uses a softmax function. the probability of the 't' label output is given by (5) . where 'w t ' is the weights associated with the class 't' labels in the output layer. we adopted the k-nearest neighbour as a base-level classifier in the proposed model to get the feature vector of the tweet to the meta-level (second-level) classifier. it acts as a firstlevel classifier for getting better performance than other classifiers (decision tree, naive bayes classifier), and a detailed explanation is shown in sections 4.4 and 4.5.2. it accepts domain-specific features such as aid, needs, etc., as an input feature vector of the tweets. the knn classifier gives the scores to the tweet neighbors among the training tweets and uses the class labels of 'k' most similarity neighbors to predict the probability vector of the tweet. we use the euclidean distance 'e(t w, t w 1 )' to measure the similarity between the tweets 't w' and 't w 1 ' that is shown in (6) where 'n' is dimension size of the tweet vectors 't w' and 't w 1 '. the classes of these neighbors are weighted using the similarity of each neighbor to t w 0 as follows: where 'knn(t w)' indicates the set of k-nearest neighbors of tweet tw. δ(t w j , c i ) represents the probability of t w j with respect to the class c i and i=3 represents the number of classes are three such as need of resource, availability of resource and none of the both. finally, it produces the three-dimensional probability vector for each tweet in testing data. results indicate that the knn classifier also plays a significant role in the proposed model for detecting the nar tweets. in this work, we have adopted the svm classifier [39] and it is one of the traditional machine learning algorithms in the proposed model. svm is used as a meta-level classifier for getting better performance than other classifiers (decision tree, naive bayes classifier) and a detailed explanation is shown in sections 4.4 and 4.5.2. it accepts the concatenation of the predicted outputs of the cnn and knn classifiers as input features. the size of the input vector is six-dimensional. we used the radial basis function (rbf) kernel in the svm classifier for transforming the data into a higher dimensional feature space. given a set of testing tweets to the base-level classifiers and it produces the output of six-dimensional vectors. the results are sent as input features to the meta-level classifier (svm classifier). the output of the svm (second level classifier) is used as a final tweet prediction. later, the learned model will be used to detect nar tweets during a disaster. the main advantage of the proposed stacked convolutional neural network for detecting nar tweets during a disaster is that it works effectively, even for small datasets, due to the use of domain-specific features. and also, even though the words are different in both training and testing tweets using the cnn model. the summarization of the proposed method is shown in algorithm 1. the summarization of the proposed method. cnn and knn with proposed features 1: it represents tweet related to the availability of resources 0: it represents tweet non-related to the need and availability resources 2: it represents tweet related to the need resources steps: 1. the tweets are preprocessed by applying the following techniques. -removal of stop-words, numerical and unknown symbols. -changing to lower case letters. in this section, we first introduce the datasets, parameters details of the model and metrics used for performance evaluation. subsequently, the experimental results include the results of the preliminary experiments, the classifier selection experiments in the proposed model and the ablation experiments. furthermore, a comparison is made between the proposed approach and existing approaches. the data are collected from nepal and italy earthquakes that occurred during 2015 and 2016, respectively. tweets are crawled from the tweet-id's through the twitter api the tweet-id's are obtained from the authors [11] . out of the total tweets, 80% and 20% of tweets are used for training and testing the proposed model, respectively. the details of disaster datasets are given in table 3 . the code is made available to the public 1 . training the cnn model by optimizing the sparse-cross entropy of (5) using the adadelta [46] algorithm. the maximum epoch number is set at 50. the mini-batch sizes of 32, 64, 128 are used. the mini-batch size is 64, which gives better results compared to other batch sizes and is tabulated in table 6 and filter sizes of 3, 4, 5 are used. to avoid the over-fitting, 0.5 dropout [37] and early stopping criteria based on the loss of the validation data are used. all the experiments are performed using the python language scikit [23] package. table 4 gives the inscription of the various methods. the first column, second column and third column indicate the serial number, method name and abbreviation, respectively. in the abbreviation, the methods before and after '+' symbol are the base-level classifiers (first level classifiers) , '+' indicates the concatenation of predicted output of the base-level classifiers (first level classifiers) and '→' symbol indicates the flow of predicted output of the base-level classifiers as input to the metaclassifier. the method after '→' symbol indicates the meta-level classifier (second level classifier). the performance of the proposed models is assessed based on the standard measures such as accuracy, precision, recall and f1-score are calculated using eqs. 8 to 11, respectively. where t p table 6 for various batch sizes. however, the batch size of 64 got the best accuracy compared to the batch sizes of 32 and 128. therefore, for further experiments batch size of cnn, 64 is considered. this section explains the results of the preliminary experiments, the classifier selection experiments in the proposed model, and the ablation experiments. initially, the experiment is performed on the svm classifier based on the proposed domainspecific features for the identification of nar tweets and compared to the bow model shown in table 5 . it highlighted the impact of the proposed domain-specific features compared with the bow model for the proposed solution. it is beneficial for the proposed solution to identify tweets, especially for smaller datasets. later, various experiments are performed using the cnn model to determine the best batch size. the batch sizes such as 16, 32 and 64 are used. results of the cnn model using the accuracy parameter is shown in table 6 by varying the batch sizes. the results show that the cnn model provides the best outcome for the batch size of 64 compared to others, such as 32 and 128. therefore, for additional experiments, 64 batch size is considered. it is noted that the values reported in all tables are based on the average need and availability of resource classes. the following four different experiments are performed for the proposed method to choose the best appropriate classifier for base-level and meta-level classifiers. 1. in the first experiment, the output of cnn and svm (base-level classifiers) are given as features to the meta-level classifier. by varying the meta-level classifiers (svm, knn, decision tree and naive bayes), the results are reported in table 7 . knn gives the best performance than other classifiers for the nepal earthquake dataset. but in the case of the italy earthquake dataset, svm gives the best performance than the other classifiers. 2. in the second experiment, the cnn output and the decision tree (base-level classifiers) are given as features to the meta-level classifier. the models used in the second experiment by different meta-level classifiers are cds, cdk, cdnb and cdd, and the results are reported in table 8 . among the other models, cdk gives the best accuracy for the nepal earthquake dataset and italy earthquake dataset. cdnb also provides the same accuracy as cdk in the case of the italy earthquake dataset. 3. in the third experiment, the output of the cnn and naive bayes classifiers (base-level classifiers) is given as a feature to the meta-level classifier. the models used in the third experiment to vary the meta-level classifiers are cnbs, cnbk, cnbnb and cnbd, and the results are reported in table 9 . cnbnb has the best accuracy among the models for both disaster datasets. cnbs gives the same accuracy as the cnbnb in the case of the italy earthquake dataset. 4. finally, in the fourth experiment, the output of the cnn and knn classifiers (baselevel classifiers) is given as input to the meta-classifier. the models used in the fourth experiment to vary the meta-classifiers are cks, ckk, cknb and ckd, and the results are tabulated in table 10 . cks achieves the highest accuracy among the models for both disaster models. after performing four different experiments, the best f1-score models (models that achieve the best f1-score) are selected from the four various experiments of models such as cdk, cks / ckk, cnbs and csk for both disaster datasets. in the same way, the best precision models (models that achieve the highest precision) such as cknb, cdnb, cnbb / cnbd and csnb on the nepal earthquake dataset are selected. similarly, csnb, cds, cnbnb and cks models achieve the best precision for the italy earthquake dataset. in the case of the execution time, cds runs very fastly on average of both disaster datasets. however, it does not give the best results compare to other models. finally, all models are compared and selected as the csk model that achieves the best f1-score for the nepal earthquake dataset. in the case of an accuracy parameter, the csk model gives the best performance for the nepal earthquake dataset but not provide for the italy earthquake dataset. overall comparison of all the models, cks performs well than the other models on both disaster datasets. therefore, cks is selected to identify nar tweets during the disaster. various experiments are conducted to assess the effectiveness of the individual component in the proposed model (cks) on two datasets, such as nepal and italy earthquake. the proposed model is initially evaluated and the results for two datasets are tabulated in table 11 . later, the experiments are performed by excluding informative (domain-specific) features and cnn individually in the proposed model and the results are reported in table 11 . the informative features play a crucial role in the proposed method for italy's earthquake dataset, which reduces the performance of the proposed model by almost 5.31% accuracy. in the case of the nepal earthquake, the performance is reduced by approximately 0.90% accuracy. by removing the cnn model, the performance of both datasets is drastically reduced by almost 25% and 15% for the nepal and italy earthquake datasets, respectively. it indicates that cnn plays a significant role in both disaster datasets. by removing both cnn and svm classifiers from the proposed model, the performance reduction is the same as when cnn is removed. it indicates that the svm classifier alone does not have much impact on the performance of the model. however, the proposed method (cks) provides the best accuracy than any of the components used to identify nar tweets during the disaster. it is also proved by using statistical validation and it is given in section 4.5.2. this section provides a brief explanation of the methods that are compared with the proposed model. it can be categorized into two subsections based on the methods. 1. classification methodologies. 2. statistical validation of the classifier models. this section describes the comparison of the proposed model with the existing classification methodologies [9, 12, 30, 35] . in [9] , the authors presented an aidr platform for automatic classification of tweets into user-defined categories with the use of uni-gram and bi-gram features. similarly, in this paper, the svm classifier with features such as uni-gram and bi-gram used as a baseline, and experiments are performed. in [35] , the authors used features such as location, infrastructure damage, communication, etc., for identifying the resources during a disaster and svm classifier is used for classification. the authors [12] used cnn for sentence classification by hyper-tuning the parameters. similar to this, cnn is experimented and compared with the proposed model. in [30] , the authors used the low-level lexical and syntactical features for identifying the situational information during a disaster. the proposed cks model achieves the best accuracy compared to existing methods on the nepal and italy earthquake dataset and the results are reported in table 12 . however, the proposed model outperforms existing methods on both nepal and italy earthquake datasets for identifying the nar tweets. better accuracy is achieved for the proposed model when compared to the existing method due to the use of informative features and traditional classifiers, which enhanced the diversity of the model for identifying the nar tweets. in general, stacking models give better accuracy than individual models when the models have diversity. and also, it is observed that from table 12 , for italy earthquake dataset has a huge impact on the proposed method compared to the nepal earthquake dataset due to the small dataset. in case of the execution time, rudra model [30] runs very fastly and bow model [9] runs very slowly compared to other models. however, it does not give the best result for detecting the nar tweets during the disaster. in this section, we have investigated the statistical significance of the different classification models. the authors in [5] suggest that the use of the mcnemar statistical test for the deep learning models. therefore, we have used the mcnemar statistical methods [5] to study the efficacy of statistical significance for classification methods. the contingency table of the mcnemar test is shown in table 13 . here 'n 01 ' represents the number of tweets corrected detected by model a and model b. 'n 02 ' represents the number of tweets corrected detected by model b and wrongly detected by model a. 'n 11 ' represents the number of tweets corrected detected by model a and wrongly detected by model b. 'n 12 ' represents the number of tweets wrongly detected by model a and model b the chi-squared (χ 2 ) can be defined as follows: the hypothesis is: 1. null hypothesis (n0): there exists no significant difference between the performances of the classifier model. 2. alternate hypothesis (n1): it can be defined as the existence of a significant difference between the performances of the classifier model. if n0 is accepted, then the probability (p) value is greater than 0.05. if n1 is accepted, then the probability (p) value is less than 0.05. tables 14 and 15 show the results of the mcnemar statistical test of the performance of the various proposed methods and the comparison with the existing methods. in tables, the '↑↑' indicates that the strong evidence of the proposed method is statistically significant compared to the other method and that the probability value is less than 0.01 (p<0.01). it represents the confidence level of 99.99% of the proposed method. '↑' indicates that the weak evidence of the proposed method is statistically significant compared to the other method and the probability value is between 0.01 and 0.05 (0.01 0, > 0 and compute b using eq. (20) 1. start with the iteration r = 0 2. initialize a as a (0) = b t b; and w (0) = 0 3. do end while convergence is not achieved 4. output: a r+1 this section describes the proposed denoising architecture based on the cnn model. the functional diagram of the proposed denoising algorithm is illustrated in fig. 1 . cnn is composed of two sections viz. encoding and decoding of features. feature encoding, comprises of multiple convolutional layers followed by pooling layers. the reception of neurons can be enhanced with the help of pooling layers. convolutional layers in the encoding part employ kernels of size 3 × 3 with a stride of 1. a non-linear activation relu was applied subsequently. pooling layer is placed next to the convolutional layer to reduce the spatial resolution of the feature maps. the feature maps inside the pooling layer are convoluted with the kernels of size 2 × 2. in the encoding stage, input is a 256 × 256 image which is followed by convolutional layers with 64-channel feature maps. a max pool layer is used next to downsample the feature maps to 128 × 128 . the subsequent convolutional layers generate a 128-channel feature maps. a max pool layer further downsamples the size of feature maps to 64 × 64 . the lowest convolutional layer generates 256 channel feature maps. encoding primarily doubles the number of feature map channels and at the same time reduces its size at every stage while traversing down the u-net [21] . decoding stage, also known as expansive path, involves upsampling of feature maps along with the deconvolution process. however, the reconstructed feature maps lose spatial information to large extent and the final reconstructed images may lose essential details of the images. therefore, it is crucial to eliminate this problem and it can be achieved by fusing (concatenating) the feature maps from the encoder with the corresponding feature maps of the upsampling layer to generate updated and upsampled feature maps. the decoding consists of convolutional layers with 3 × 3 kernels and a subsequent nonlinearity of relu type, as in encoding. the 256 channel feature map obtained from the encoding is deconvolved to 128 channel feature maps. a max pool layer is placed after every convolution stage, to upsample the size of the feature maps. at every stage, the encoder feature maps from the same level are fused with the output of the convolutional layer. the final layer generates the desired number of classes to produce the reconstructed output image. the architecture of proposed cnn based denoising model is trained by an end-to-end process. overfitting in the proposed cnn architecture has been reduced by employing drop out layer before the output layer. a learning rate of 0.001 has been employed and the momentum parameters have been fixed to 0.9 1 fig. 1 proposed cnn-based denoising model epsilon was initialized to 1e-08. cnn has been trained for 10,000 iterations with a batch size of 8. proposed cnn uses adam solver as it combines the advantage of adagrad and rmsprop [36] . adam works with sparse and noisy gradients, moreover the parameter updates are invariant with the rescaling of gradients and provides a good control over the stepsize [36] . input to the proposed algorithm is images normalized in the range [0, 1]. fully sampled, noiseless and normalized 256 × 256 images have been used to train the proposed cnn to obtain the target weights. a simulated rician noise of level 10 is added to the same set of images in the k-space. sparsity is achieved by applying k-space undersampling, from which around 75% samples are removed. inverse fourier transform converts the undersampled noisy k-space data to noisy and sparse mr images. these images are next given to the same cnn to generate observed weights which are further refined with error minimization. target weights (obtained from fully sampled noiseless image) are considered as reference weights. cnn refines the weights by minimizing the loss function as given below where, b f and a f represents the f th pair of degraded and actual image patches respectively. b f ; represents the patches of reconstructed image which consists of a parameter set as . it is to be noted that the set of parameters is different from the weights w.the quality of reconstructed images can be further improved by employing several loss functions while training the cnn. post processing involves scaling the normalized images back to the original intensity levels and dimensions. figure 1 shows the schematic diagram of the proposed denoising model based on cnn and sparse k-space data. this section demonstrates the performance of the proposed algorithm by considering various images and undersampling schemes that include random, pseudoradial and cartesian sampling. experimentation has been performed with in vivo scans of mr images to evaluate the algorithm for undersampling limit. experimental outcomes have been validated with various parameters namely psnr [2] , fsim [37] , hfen [2] and qilv [38] . our algorithm achieves high visual quality with detailed information and patterns in the reconstructed mr image. the cnn in the proposed algorithm has been trained on a gpu; it effectively removes rician noise and improves the efficiency of mr data acquisition process. a parallel gpu configuration using caffe framework enhances the execution speed to a great extent. in the present work, every implementation has been carried out on an intel i5 − 4460 processor with 16 gb ram and 64-bit windows 10 os with 3.20 ghz cpu. matlab 2016b has been used to simulate the results in this work. the proposed cnn based denoising algorithm has been compared with various state-ofthe-art-techniques namely (1) dictionary learning magnetic resonance imaging (dlmri) [2] (2) non-local means (nlm) and its variants namely unbiased nlm (unlm), rician nlm (rnlm), enhanced nlm (enlm) and enhanced nlm filter with preprocessing (penlm) [5] . exclusive experiments were carried out on simulated k-space data of brain and in vivo k-space data of knee by increasing the undersampling limit from 4-fold to 20-fold. all the experiments are validated not only by visual analysis but also by quantitative data assessment. this section presents the experimental results with the proposed cnn-based algorithm by considering axial brain, t2-weighted sagittal l-spine, c-spine images [39] and t1-weighted knee image [40] . the performance of the cnn-based reconstruction has been compared with dlmri [2] and basic zero filled (zf) reconstruction with respect to the psnr and computation time. rician noise level of around 10 has been added to the above reference images in the k-space and subsequently undersampled in the same domain. dlmri iterations involve alternate sparse coding and dictionary learning stages. to implement dlmri, orthogonal matching pursuit (omp) [41] was employed for sparse coding and 5 iterations of k-svd [42] was used for learning the dictionary. rest of the dlmri parameters were chosen as in [39] . in fig. 2 , reconstruction with the proposed cnn-based method at 20-fold undersampling is presented and compared with the dlmri and zf reconstructions. figure 2a depicts the 20-fold random undersampling having most of the samples in the central k-space. figure 2b shows the 512 × 512 axial brain image and fig. 2c is the noise corrupted image. zf reconstruction in fig. 2d depicts prominent artifacts with 13.94 db psnr. dlmri and cnn reconstruction for axial brain is given in fig. 2e , f respectively. dlmri reconstruction in fig. 2e displays excessive smoothing and provides a psnr of 26.12 db. cnn reconstruction removes noise effectively and reconstructs sharper image with 30.20 db psnr, which is 4.08 db higher as compared to the dlmri. computation time with the proposed cnn was around 13 s whereas dlmri needed nearly 60 s to reconstruct the same 512 × 512 axial brain image. figure 3 shows the reconstruction of 512 × 512 l-spine image with 4-fold cartesian undersampling (fig. 3a) . reference l-spine image is given in the fig. 3b and the noisy l-spine image is shown in fig. 3c . zf reconstruction in fig. 3d displays some blurring due to k-space undersampling and its psnr was observed to be 27.9 db. l-spine reconstructed with cnn method (fig. 3f) provided a psnr of 31.28 db which was 2.08 db higher than the psnr with dlmri (29.25 db) in fig. 3e . computation time for cnn was around 13 s whereas dlmri required around 70 s. table 1 demonstrates the comparison of the quantitative results obtained with the proposed algorithm, dlmri [2] and zf reconstruction by 4-fold random undersampling of the k-space (fig. 4a) . experiments were carried out by considering the images of axial brain (fig. 2b) , l-spine (fig. 3b) , c-spine (fig. 4b) and knee (fig. 4c) . quantitative results with the proposed cnn method exhibit an improvement of around 2 db in psnr and a considerable reduction in the computation time, compared to dlmri reconstruction. zf reconstruction is obtained by applying inverse fourier transform to the noisy and sparse k-space measurements. hence, it provides very low psnr and computation time. this section illustrates the qualitative results obtained with our algorithm and also presents the quantitative comparison of the proposed cnn-based denoising with the variants of nlm [5] . mr sequences like t1, t2 and proton density (pd)-weighted provides excellent contrast and anatomical details; however, under the influence of rician noise the image quality deteriorates. rician noise introduces bias that leads to the blurring of edges and structural errors. images from brainweb database [43] have been employed for simulations in this section. experiments have been carried out on t1-weighted (fig. 5a) , t2-weighted (fig. 5d) and pd-weighted (fig. 5g) mr images, to evaluate the performance of the cnn-based proposed algorithm. all the parameters have been selected according to the requirement and a rician noise of around 9% has been added to the images in the k-space. fig. 2 reconstructions with 20-fold random undersampling: a sampling mask, b reference axial brain image [39] , c brain image with 10% rician noise, d zf reconstruction, e dlmri reconstruction, f cnnbased reconstruction figure 5 presents the visual results with the proposed algorithm for 4-fold undersampling of the k-space by considering the above mr image sequences. our algorithm denoises and provides accurate reconstruction of t1-weighted image with a psnr of 25.55 db in fig. 5c . psnr of the reconstructed t2-weighted image in fig. 5f is 30.53 db and displays good denoising. a psnr of 31.29 db was observed with the reconstructed pd-weighted image (fig. 5i) . reconstruction time for our algorithm was observed to be few seconds. table 2 compares the numerical results of the proposed method with the nlm and its variants. the proposed cnn-based method has been implemented with a 4-fold and 20-fold undersampling of the k-space unlike nlm, which considers entire k-space data for reconstruction. quantitative outputs confirms that the cnn based denoising could achieve better psnr for t2 and pd-weighted compared to nlm, unlm and rnlm even with 20-fold undersampling. standard deviation (sd) of the rician noise for nlm and the proposed method has also been displayed in table 2 . the numerical values obtained with the reconstructions of fig. 5 are indicated in bold as shown in table 2 . visual quality of the reconstructions manifests that the proposed denoising algorithm has the capacity to effectively remove the rician noise as well as reconstruct edges precisely even with 4-fold undersampling. dedicated experiments were carried out with the random and pseudoradial sampling scheme to test the proposed cnn-based denoising algorithm for achievable undersampling limit. a complex rician noise of 10% was added in the k-space to the 256 × 256 ground truth image of brain (fig. 6a) and 320 × 320 human knee image (fig. 6b ) from the dataset of [44] . the random undersampling of fig. 4a was increased from 4-fold to 20-fold for the experimentation to record the psnr, fsim, qilv and runtime variations, as given in table 3 . proposed algorithm reconstructs mr images with precision even at undersampling rates as high as 20-fold, which is evident from the high psnr, fsim and qilv values. computation time to reconstruct brain and knee image is between 3 to 6 s. psnr was observed to be consistent even at high undersampling rates and with various sampling patterns like pseudoradial. the minimum reconstruction time observed for ground truth brain and in vivo knee images are marked in bold as shown in table 3 . the algorithm was also evaluated with various undersampling schemes by considering the reference images of fig. 6 corrupted with the same noise level as above. a structural undersampling was adopted with a pseudoradial sampling mask as demonstrated in fig. 7a , which contains 30% k-space samples. figure 7b displays the noise corrupted brain image obtained by adding 10% rician noise to the reference image of fig. 6a . the reconstructed brain image displayed in fig. 7c provided a psnr of 28.32 db, similarity index (fsim) and local variance (qilv) was observed to be 0.966 and 0.9677 respectively. next, in vivo knee mr image of fig. 6b was undersampled with cartesian sampling mask (fig. 7d) obtained by acquiring 50% of the k-space points. rician noise corrupted image, prior to undersampling is shown in fig. 7e and the reconstructed knee image with fold random undersampling. a t1-weighted reference brain image [5] , b t1 image with 9% rician noise, c reconstructed t1 image, d t2-weighted reference brain image [5] , e t2 image with 9% rician noise, f reconstructed t2 image, g pd-weighted reference brain image [5] , h pd image with 9% rician noise, i reconstructed pd image table 2 numerical results for t1, t2 and pd-weighted images [5, 43] [44] a ground truth brain, b in vivo knee, c in vivo axial brain table 3 quantitative results with the proposed algorithm for brain image (fig. 6a) and in vivo knee mr image (fig. 6b the proposed cnn-based method is presented in fig. 7f . a psnr of 31.58 db has been observed with the reconstructed knee image. qilv and fsim were noted to be 0.7428 and 0.9548 respectively. fig. 7 visual representation of the reconstructed mr images using the proposed cnn-based denoising algorithm with 10% rician noise added to the reference images of fig. 6 . a pseudoradial mask [44] , b noisy brain image, c reconstructed brain image, d cartesian mask [44] , e noisy knee image, f reconstructed knee image, g random mask [44] , h noisy axial brain image, i denoised axial brain image we consider a 230 × 180 axial image of brain shown in fig. 6c as the final image for the experimentation. this image was undersampled in k-space with 20% randomly acquired sampling pattern of fig. 7g . a noise corrupted version of the axial brain image is given in fig. 7h . proposed algorithm reconstructs the denoised image (fig. 7i) with a psnr of 27.47 db. qilv and fsim values were observed to be 0.9897 and 0.9644 respectively. experimental results with proposed algorithm show high quality reconstructions even with 80% missing k-space data. psnr decreased by about 0.5 db as the random undersampling rate was increased from 4-fold to 20-fold for brain image (table 3 ) and maximum reconstruction time is 3.8 s. quantitative results in table 3 show only 0.2 db variation in psnr for knee image and maximum reconstruction time is as low as 5.1 s. low computational time makes the proposed cnn based algorithm suitable for the online reconstruction. in this manuscript, a cnn based novel framework to denoise sparse mr images corrupted with rician noise, has been presented. cnn training exploits patch based processing to update and refine the dictionary of weights. training eliminates the rician noise and also estimates the missing k-space data to provide precise reconstructions even at high undersampling rates. the proposed cnn based denoising method converges faster and extracts mr image patterns in the recovered image, thereby boosting the speed and efficiency of mr data acquisitions. our cnn based method has been compared with dlmri and zf reconstruction in terms of psnr and computation time. there has been a tremendous increase in speed and psnr due to cnn based approach when compared to dlmri and zf. the proposed cnn framework can be employed without estimating the noise level, further our algorithm preserves the local structures better than the traditional nlm, unlm and rnlm even with high undersampling. the computation time using our algorithm is very low, minimum being 3.5 s. proposed algorithm reconstructs with high visual quality even at high undersampling rates which is evident from the psnr, fsim and qilv values. high undersampling rates provide required compression of k-space data at the acquisition stage and can be employed for wireless transmission. however, various issues like bandwidth requirement, power consumption and encoding, associated with wireless streaming of k-space data needs to be addressed and remains the subject of future study. the cnn approach can be adopted for image segmentation to identify covid-19 patients efficiently to overcome the current crisis. sparse mri: the application of compressed sensing for rapid mr imaging mr image reconstruction from highly undersampled k-space data by dictionary learning sparse geometric image representations with bandelets improvement of the snr and resolution of susceptibility weighted venography by model-based multi-echo denoising denoising 3d mr images by the enhanced non-local means filter for rician noise an improved total variation regularized sense reconstruction for mri images sparse reconstruction techniques in mri: methods, applications, and challenges to clinical adoption compressed sensing trends in magnetic resonance imaging. engineering science and technology reconstruction of signals drawn from a gaussian mixture via noisy compressive measurements brain mr image restoration using an automatic trilateral filter with gpu-based acceleration the brain mri image sparse representation based on the gradient information and the non-symmetry and anti-packing model the rician distribution of noisy mri data second order total generalized variation (tgv) for mri low-rank modeling of local k-space neighborhoods (loraks) for constrained mri motion robust reconstruction of multi-shot diffusion-weighted images without phase estimation through locally low-rank regularization imagenet classification with deep convolutional neural networks deep residual learning for image recognition stochastic deep compressive sensing for the reconstruction of diffusion tensor cardiac mri adversarial and perceptual refinement for compressed sensing mri reconstruction deep learning with domain adaptation for accelerated projectionreconstruction mr deep learning for undersampled mri reconstruction framing u-net via deep convolutional framelets: application to sparse-view ct unet++: a nested u-net architecture for medical image segmentation unet++: redesigning skip connections to exploit multiscale features in image segmentation compressed sensing dynamic mri reconstruction using multiscale 3d convolutional sparse coding with elastic net regularization highly scalable image reconstruction using deep neural networks with bandpass filtering mr image reconstruction using deep density priors. compueter vision and pattern recognition deep residual learning for compressed sensing mri segmenting the left ventricle in cardiac in cardiac mri: from handcrafted to deep region based descriptors a deep learning approach to t1 mapping in quantitative mri. in 36th annual scientific meeting of the european society for magnetic resonance in medicine and biology classification of covid-19 patients from chest ct images using multi-objective differential evolution-based convolutional neural networks classification of covid-19 in chest x-ray images using detrac deep convolutional neural network covidgan: data augmentation using auxiliary classifier gan for improved covid-19 detection convolutional neural network for sparse reconstruction of mr images interposed with gaussian noise rank minimization and applications in system theory adam: a method for stochastic optimization fsim: a feature similarity index for image quality assessment image quality assessment based on local variance convex optimization and greedy iterative algorithms for dictionary learning in the presence of rician noise signal recovery from random measurements via orthogonal matching pursuit k-svd: an algorithm for designing overcomplete dictionaries for sparse representation partial discreteness: a novel prior for magnetic resonance image reconstruction publisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations andhra university in 1994 and m.tech from jntuh, in 2003. she is pursuing her ph.d from jntua, ananthapuramu. she has been working at m.v.s.r. engineering college since 2000, and is currently serving as associate professor in the department of electronics and communication engineering key: cord-131094-1zz8rd3h authors: parisi, l.; neagu, d.; ma, r.; campean, f. title: qrelu and m-qrelu: two novel quantum activation functions to aid medical diagnostics date: 2020-10-15 journal: nan doi: nan sha: doc_id: 131094 cord_uid: 1zz8rd3h the relu activation function (af) has been extensively applied in deep neural networks, in particular convolutional neural networks (cnn), for image classification despite its unresolved dying relu problem, which poses challenges to reliable applications. this issue has obvious important implications for critical applications, such as those in healthcare. recent approaches are just proposing variations of the activation function within the same unresolved dying relu challenge. this contribution reports a different research direction by investigating the development of an innovative quantum approach to the relu af that avoids the dying relu problem by disruptive design. the leaky relu was leveraged as a baseline on which the two quantum principles of entanglement and superposition were applied to derive the proposed quantum relu (qrelu) and the modified-qrelu (m-qrelu) activation functions. both qrelu and m-qrelu are implemented and made freely available in tensorflow and keras. this original approach is effective and validated extensively in case studies that facilitate the detection of covid-19 and parkinson disease (pd) from medical images. the two novel afs were evaluated in a two-layered cnn against nine relu-based afs on seven benchmark datasets, including images of spiral drawings taken via graphic tablets from patients with parkinson disease and healthy subjects, and point-of-care ultrasound images on the lungs of patients with covid-19, those with pneumonia and healthy controls. despite a higher computational cost, results indicated an overall higher classification accuracy, precision, recall and f1-score brought about by either quantum afs on five of the seven bench-mark datasets, thus demonstrating its potential to be the new benchmark or gold standard af in cnns and aid image classification tasks involved in critical applications, such as medical diagnoses of covid-19 and pd. sars-cov-2 is responsible for covid-19, the 'severe acute respiratory syndrome coronavirus 2' (cohen & normile, 2020) and the current global pandemic announced by the world health organization (who, mar 2020) . this virus leads to respiratory disease in humans (cui et al., 2019) , but it may take from 2 to 14 days for the initial symptoms, e.g., fever and cough, to become manifest after an infection (centers for disease control and prevention, 2020) . however, more severe symptoms can progress to viral pneumonia and typically require mechanical ventilation to assist patients with breathing (verity et al., 2020) . in some more severe cases, covid-19 can also lead to worsen symptoms and even death (zhou, et al., 2020) , as well as it may be an aetiology of pd itself (beauchamp et al., 2020) . thus, it is important to be able to detect neurodegenerative co-morbidities in vulnerable undiagnosed patients, such as pd, promptly and non-invasively too, for example via cnns that can recognise patterns from spiral drawings, and then applying non-ionising medical imaging techniques (bhaskar et al., 2020) , which are more appropriate for such patients, to facilitate a prompt diagnosis of covid-19 to improve clinical outcomes. whilst tremors can be detected from patterns in spiral drawings as indicators of early pd, ground-glass opacities, lung consolidation, bilateral patchy shadowing and relevant other lesionslike patterns can be detected as biomarkers to identify covid-19-related pneumonia from any other types, including both viral and bacterial pneumonia (shi et al., 2020) . improvements in the afs of cnns can help to improve generalisation in both these image classification tasks. different layers of a deep neural network represent various degrees of abstraction, thus capturing a varying extent of patterns from input images (zeiler & fergus, 2014) . afs provide the cnn with the non-linearity required to learn from non-linearly distributed data, even in presence of a reasonable amount of noise. an af defines the gradient of a layer, which depends on its domain and the range. afs are differentiable and can be either saturated or unsaturated. in table 1 the main activation functions commonly used in deep neural networks, including the convolutional neural network (cnn), with their equations and references, are summarised, and introduced below. saturated afs are continuous with their outputs threshold into finite boundaries, typically represented as s-shaped curves, also named 'sigmoidal' or 'squashing' afs, e.g., the logistic sigmoidal function with its output in the range of 0 and 1 (liew et al., 2016) . saturated afs are typically applied in shallow neural networks, e.g., in mlps. however, saturated afs lead to the vanishing gradient issue whilst training a network with back-propagation (cui, 2018) , i.e., results in gradients that are less than 1, which become smaller with multiple differentiations and ultimately become 0 or 'vanish'. thus, changes in the activated neurons do not lead to modifications of any weights during back-propagation. moreover, the exploding gradient problem can occur, which has an opposite effect to vanishing gradients, wherein the error gradient in the weight is so high that it leads to instability whilst updating the weights during back-propagation. hyperbolic tangent or 'tanh' (see table 1 ) is a further saturated af, but it attempts to mitigate this issue by extending the range of the logistic function from -1 to 1, centred at 0. nevertheless, tanh still does not solve the vanishing gradient problem. unsaturated functions are not bounded in any output ranges and are centred at 0. the rectified linear unit (relu) ( table 1) is the most widely applied unsaturated af in deep neural networks, e.g., in cnns, which provides faster convergence than logistic sigmoidal (lecun et al., 1998) and tanh afs, as well as improved generalisation (litjens et al., 2017) . in fact, relu generally leads to more efficient updates of weights during the back-propagation training process (gao et al., 2020) . the relu's gradient (or slope) is either one for positive inputs or zero for negative ones, thus solving the vanishing gradient issue. nevertheless, despite providing appropriate initialisation of the weights to small random values via the he initialisation stage (glorot et al., 2011) , with large weight updates, the summed input to the relu activation function is always negative ('dying relu' problem) . this negative value yields a zero value at the output and the corresponding nodes do not have any influence on the neural network (abdelhafiz et al., 2019) , which can lead to misclassification resulting in lack of ability in detecting a pathology involved in an image classification task accurately and reliably, such as for covid-19 or pd diagnostics. in an attempt to mitigate the 'dying relu' issue, in cnns and deeper alexnet, vgg 16, resnet, etc.) , multiple variations of the relu af have been introduced, such as the leaky relu (lrelu), the parametric relu (prelu), the randomised relu (rrelu) and the concatenated relu (crelu), as summarised in table 1 . maas et al. (2013) introduced leaky relu (lrelu) to provide a small negative gradient for negative inputs into a relu function, instead of being 0. a constant variable , with a default value of 0.01, was used to compute the output for negative inputs ( another variant of relu, named 'exponential linear unit' (elu) is aimed at improving convergence (maas et al., 2013) (table 1) , but it still does not solve the 'dying relu' issue either. klambauer et al. (2017) introduced a variant of elu called 'scaled exponential linear unit' (selu) ( table 1) , which is a self-normalising function that provides an output as a normal distribution graph, making it suitable for deep neural networks with the output converging to zero mean when passed through multiple layers. although selu attempts to avoid both vanishing and exploding gradient problems, it does not mitigate the 'dying relu' issue. he et al. (2015) proposed the parametric rectified linear unit (prelu) in an attempt to provide a better performance than relu in large-scale image classification tasks, although the only difference from lrelu is that is not a constant and it is learned during training via back-propagation. nevertheless, due to this, the prelu does not solve the 'dying relu' issue either, as it is intrinsically a slight variation of the lrelu af. similarly, the randomised leaky rectified linear unit is a randomised version of lrelu (pedamonti, 2018) , whereby is a random number sampled from a uniform distribution, thus being still susceptible to the 'dying relu' issue too. shang et al. (2016) proposed a further slight improvement to the relu named 'concatenated relu' (crelu), allowing for both a positive and negative input activation, by applying relu after copying the input activations and concatenating them. thus, crelu is computationally expensive and prone to the 'dying relu' problem, although it generally leads to competitive classification performance with respect to the gold standard relu and lrelu afs (shang et al., 2016) . table 1 . the main activation functions commonly used in deep neural networks, including the convolutional neural network (cnn), with their equations and reference. the relu and leaky relu are the most common and reliable ones in cnns. logistic sigmoid ( ) = 1 1 + − han & moraga (1995) tanh gold & rangarajan (1996) arctan ( ) = −1 ( ) campbell et al. (1999) softplus despite the wide application of dl-based algorithms for image classification in healthcare, such as the cnn (lecun et al., 2015) described in 1.2, its classical af, although it mitigates the vanishing gradient issue typical of sigmoid afs, can still experience the 'dying relu' problem. as discussed in 1.2, none of the recently proposed afs, such as the lrelu, the prelu, elu and selu, have not solved this issue yet, as they are still algorithmically similar in their relu-like implementations. this issue can lead to lack of generalisation for cnns, thus hindering their application in a clinical setting. it is worth noting that, as an example, the last fully connected layer of the cnn in kollias et al. (2018) , having 1,500 neurons led, due to the 'dying relu' problem, to having only 30 neurons yielding non-zero values. even by coupling a recurrent neural network (rnn) with their cnn, thus having a cnn-rnn (kollias et al., 2018) , and their last layer then being designed with 128 neurons, only about 20 of them led to non-zero values, whilst the remaining ones experienced the 'dying relu' issue, yielding negligible values. these two examples confirm that classical approaches to relu failed to solve its associated 'dying relu' problem, thus warranting a different approach, which the authors suggest being of quantum nature, as illustrated in 1.4 and motivated in 1.5. quantum ml is a relatively new field that blends the computational advantages brought by quantum computing and advances in ml beyond classical computation (ciliberto, et al., 2018) . quantum ml has not only led to more effective algorithmic performance, but it has also enabled to find the global minimum in the solutions sought after in ml with a higher probability (ciliberto, et al., 2018) . the main principles of quantum computing are those inherited from quantum physics, such as superposition, entanglement, and interferences (barabasi et al., 2019) . according to the quantum principle of superposition, the fundamental quantum bit or qubit can have multiple states at any point in time, i.e., a qubit can have a value of either 0 or 1, such as classical bits, but, differently from and beyond classical bits, a qubit can also have both values 0 and 1 concurrently (barabasi et al., 2019) . a quantum gate is the unification of two quantum states for them to stay 'entangled' into an individual quantum state, wherein a change in one state would affect the other one and vice versa (jozsa & linden, 2003) . thus, a system of qubits, each of which holds multiple bits of information concurrently, behaves as one via the quantum property named 'entanglement', hence enabling massive parallelism too (cleve et al., 1998; solenov et al., 2018) . however, existing quantum approaches to implement afs in deep neural networks have only adopted the repeat-until-success (rus) technique to achieve pseudo non-linearity due to restrictions to linear and unitary operations in quantum mechanics (nielsen & chuang, 2002; cao et al., 2017) . this rus approach to afs involves an individual state preparation routine and the generation of various superimposed and entangled linear combinations to propagate the routine of an af to all states at unison. thus, a deep neural network leveraging this quantum rus technique could theoretically approximate most nonlinear afs (macaluso et al., 2020) . nevertheless, the practical applications of this approach are very limited due to the input range of the neurons in such architectures being bounded between 0 and π/2 as a trade-off of their theoretically generic af formulation. hu (2018) led a similar theoretical research effort in proposing a sigmoid-based non-linear af, which is not periodic to enable a more efficient gradient descent whilst leveraging the principle of superposition in training neurons with multiple states concurrently. however, the classical form of the approach of hu (2018) is the traditional relu, thus still not solving the 'dying relu' problem either. konarac et al. (2020) leveraged a similar quantum-based sigmoid af in their quantum-inspired self-supervised network (qis-net) to provide high accuracy (99%) and sensitivity (96.1%) in magnetic resonance image segmentation, improving performance by about 1% with respect to classical approaches. differently from the related studies mentioned above, the two properties of entanglement and superposition could be pivotal in devising a quantum-based approach to relu in having both a positive solution and a negative one simultaneously, being able to avoid a negative solution by preferring the positive one, whereas traditional classical relu at times would fail by leading to negative solutions only, i.e., the 'dying relu' problem. moreover, this principle enables quantum systems to reduce computational cost with respect to classical approaches, since several optimisations in multiple states can be performed concurrently (schuld et al., 2014) . as described in sections 1.1 and 1.2, dl is highly suitable in classifying medical images due to its intrinsic feature extraction mechanisms. as illustrated in both 1.2 and 1.4, the importance of the af is evident in both classical and quantum dl, respectively. although numerous variants of relu functions have been proposed in classical dl models (as revised in section 1.2) they have not been widely adopted as relu and lrelu. these two afs typically ensure accurate and reliable classification and are readily available in python open source libraries, such as tensorflow and keras. nevertheless, both these afs and any recent afs (see section 1.2) have not solved the 'dying relu' problem yet. moreover, vanishing and exploding gradient issues have not been fully resolved either. elu and selu may at times provide faster convergence than relu and lrelu, but they are not as reliable as those and are computationally more expensive (pedamonti, 2018) . such unresolved issues lead to lack of generalisation that may hinder the diagnostic accuracy and reliability of an application leveraging dl techniques for the detection of covid-19 or pd, thus resulting in a potentially high number of false negatives when the model's performance is evaluated on unseen patient data. the authors have hypothesised that this impaired generalisation is due to the classical approach underpinning such relu-based afs that has been just leveraged and moulded in various ways so far, without breaking its inherent functional limitations. the hereby contribution proposes, for the first time, that a quantum-based methodology to relu would improve the learning and generalisation in cnns with relevant impact for critical applications, such as the above-mentioned diagnostic tasks. in particular, by blending the two key quantum principles of entanglement of qubits and the effects of superposition to help reach the global minimum in the solution, thus avoiding negative solutions differently from classical approaches as in 1.3, this study investigates the development of a novel af 'quantum relu' to avoid the problem of the 'dying relu' in a quantistic manner. this builds on recent research efforts by cong et al. (2019) to develop a quantum cnn that, although demonstrating how quantum states can be recognised, have not yet addressed the 'dying relu' problem, as it simply leveraged the traditional relus instead. patterns from lung ultrasound images and spiral drawings are known diagnostic biomarkers for covid-19 and pd respectively, pd being at times a delicate co-morbidity of covid-19 patients, and improvements in generalisation are key to an accurate and reliable early diagnosis that can improve outcomes, especially in the event of co-morbidities. thus, the novel quantum relu will be leveraged in a cnn to improve classification performance in such pattern recognition tasks, as quantified via clinically relevant and interpretable metrics, and compared against the same cnn with current gold-standard afs, including relu and lrelu. the proposed added capability of a quantum relu in a cnn is hypothesised to improve its generalisation for pattern recognition in image classification, such as detecting covid-19 and pd from ultrasound scans and spiral drawings, respectively. the remaining sections of the paper are structured as follows. section 2 deals with the methods, including sub-section 2.1 illustrating the two novel quantum afs, along with their mathematical formulation and respective implementations in python codes (in both tensorflow and keras libraries). sub-section 2.2 provides a description of the benchmark datasets selected, along with a standardised data pre-processing strategy, whilst section 3 summarises the results obtained comparing the accuracy, reliability and computational time of a cnn with the proposed quantum afs against salient gold standard afs outlined in table 1 . eventually, section 4 provides a thorough discussion of the results and section 5 summarises the current work and outlines its access, impact, and future applications. despite appropriate initialisation of the weights to small random values via the he initialisation, with large weight updates, the summed input to the traditional relu activation function is always negative, although the input values fed to the cnn. current improvements to the relu, such as the leaky relu, allow for a more non-linear output to either account for small negative values or facilitate the transition from positive to small negative values, without eliminating the problem though. consequently, this study investigates the development of a novel activation function to obviate the problem of the 'dying relu' in a quantistic manner, i.e., by achieving a positive solution where previously the solution was negative. such an added novel capability in a cnn was hypothesised to improve its generalisation for pattern recognition in image classification, particularly important in critical applications, such as medical diagnoses of covid-19 and pd. thus, using the same standard two-layered cnn in tensorflow for mnist data classification, after identifying the main reproducible (with associated codes available in tensorflow and keras) afs following a critical review of the literature (section 1), the following nine classical activation functions were considered: relu, leaky relu, crelu, sigmoid, tanh, softmax, vlrelu, elu and selu. a two-step quantum approach was applied to relu first, by selecting its solution for positive values ( ( ) = , ∀ > 0), and the leaky relu's solution for negative values ( ( ) = × , ∀ ≤ 0, ℎ = 0.01) as a starting point to improve quantistically. by applying the quantum principle of entanglement, the tensor product of the two candidate state spaces from relu and leaky relu was performed and the following quantum-based combination of solutions was obtained: thus, keeping r(z) = z for positive values (z > 0) as in the relu, but with the added novelty of the entangled solution for negative values (1), the quantum relu (qrelu) was attained (fig. 1) . the algorithms to describe the methodology and af were implemented in tensorflow and keras, and presented in listings 1 and 2 respectively, thus avoiding the 'dying relu' maintaining the positivity of the solution mathematically via this new quantum state. ()) model.add(layers.maxpooling2d((2, 2))) by leveraging the quantum principle of superposition on the qrelu's solution for positive and negative values, the following modified qrelu (m-qrelu) was obtained (fig. 2) . the algorithms to describe the methodology and af were implemented in tensorflow and keras, and presented in listings 3 and 4 respectively, still avoiding the 'dying relu' issue: listing 3 provides the snippet of code in python to leverage the m-qrelu in tensorflow, using 'py_func' per listing 1. its usage in tensorflow is the same as the 'qrelu' in listing 1 but using 'tf_m_q_relu' as an activation function of the second convolutional layer ('conv2_act' conv2d(32, (3, 3) , input_shape=(32, 32, 3))) #model.add(qrelu()) model.add(m_qrelu()) model.add(layers. maxpooling2d((2, 2) )) the m-qrelu also satisfies the entanglement principle being derived via the tensor outer product of the solutions from the qrelu. thus, a quantum-based blend of both superposition and entanglement principles mathematically leads the qrelu and the m-qrelu to obviate the 'dying relu' problem intrinsically. as shown in (1) and (2), although the two proposed afs are quantistic in nature, both qrelu and m-qrelu can be run on classical hardware, such as central processing unit (cpu), graphics processing unit (gpu) and tensor processing unit (tpu), the latter being the type of runtime used in this study via google colab (http://colab.research.google.com/) to perform the required evaluation on the datasets described in 2.1. the novel qrelu and m-qrelu were developed and tested using python 3.6 and written to be compatible with both tensorflow (1.12 and 1.15 tested, 1.15 supports tensorflow serving to deploy the novel afs on the cloud) and the keras sequential api. thus, both afs were programmed as new keras layers for ease of use. by selecting the positive quantum state of the summed input of the qrelu and m-qrelu, an optimal early diagnosis could be achieved for patients with covid-19 and pd. thus, this study demonstrates the qrelu and m-qrelu as a potential new benchmark activation function to use in cnns for critical image classification tasks, particularly useful in medical diagnoses, wherein generalisation is key to improving patient outcomes. to assess which af was suitable for each of the pattern recognition tasks involved in classifying the seven benchmark datasets as per 2.1, the performance of the baseline cnn was assessed via the test or out-of-sample classification accuracy, precision, sensitivity/recall and f1-score. precision, recall, and f1-score are important metrics to measure the reliability of the classification outcomes. 95% confidence intervals (cis) were also reported. to enable reproducibility and replicability of the results obtained, publicly available benchmark datasets were gathered and used in this study, as mentioned below. moreover, to this purpose, full python codes (.py and .ipynb formats) in both ten-sorflow (https://www.tensorflow.org/) and keras (https://keras.io/) on how these were used for training the model, as well as to evaluate its performance, are also provided. as a general benchmark dataset for any image classifiers, especially cnns, the mnist data (lecun et al., 1998) , including 60,000 images of handwritten digits (50,000 images for training, 10,000 images for testing), was used for the initial model and af validation. this dataset is in tensor format available in tensorflow (https://www.tensorflow.org/datasets/catalog/mnist). to address the specific needs to improve diagnosis of parkinson's disease (pd) and that of covid-19 dealt with in this study, further benchmark datasets were used. four benchmark datasets were leveraged to identify pd based on patterns on spiral drawings (1290 subjects in total), as follows: as in the mnist dataset, images in all benchmark datasets were converted to grayscale and resized to be 28*28. the two-layered cnn, designed as an mnist classifier, was initially validated on the mnist benchmark dataset itself, used for recognising handwritten digits. the qrelu and the m-qrelu were the best and second-best performing activation functions respectively, leading to an acc and an f1-score of 0.99 (99%) and of 0.98 (98%) respectively (table 2 ). the relu, the leaky relu and the vlrelu also led to the best classification performance on the mnist data (acc = 0.99/99%, f1-score = 0.99/99%) (table 2) . thus, the proposed qrelu achieved gold standard classification performance on this benchmark dataset. noteworthily, the qrelu and the m-qrelu led the same two-layered cnn architecture to achieve the best (acc = 0.92/92%, f1-score = 0.93/93%) and third (acc = 0.88/88%, f1-score = 0.90/90%) classification performance (table 3) on the benchmark dataset named 'spiral handpd' on images of spiral drawings taken via graphic tablets from patients with pd and healthy subjects. as illustrated in table 4 , competitive results were achieved by the qrelu and the m-qrelu versions on a further benchmark dataset on spiral drawings, the 'newhandpd dataset', leading to the sixth and eight classification performance respectively (acc = 0.83/83%, f1-score = 0.83/83%; acc = 0.79/79%, f1-score = 0.79/79%). very competitive outcomes were obtained by the two proposed quantum afs on the kaggle spiral drawings dataset, with m-qrelu (acc = 0.73/73%, f1score = 0.70/70%) and qrelu (acc = 0.67/67%, f1-score = 0.67/67%) leading to the second and fourth classification performance respectively (table 5) , as well as when evaluated against the uci spiral drawings dataset (qrelu ranked fifth with acc = 0.82/82% and f1-score = 0.74/74%; m-qrelu ranked sixth with acc = 0.78/78% and f1-score = 0.68/68%) ( table 6 ). the overall increased generalisation brought about by the two novel quantum afs is evident in the outstanding and mutually consistent classification outcomes achieved on both benchmark lung us datasets to distinguish covid-19 from both pneumonia and healthy subjects with the best (table 7 -qrelu and m-qrelu with acc = 0.73/73% and f1-score = 0.73/73%) and the second (table 8 -qrelu and m-qrelu with acc = 0.6/60% and f1-score = 0.63/63%) classification performance respectively for both qrelu and m-qrelu. despite a higher computational cost (four-fold with respect to the other afs except for the crelu's increase being almost three-fold), the results achieved by either or both the proposed qrelu and m-relu afs, assessed on classification accuracy, precision, recall and f1-score, indicate an overall higher generalisation achieved on five of the seven benchmark datasets ( table 2 on the mnist data, tables 3 and 5 on pd-related spiral drawings, tables 7 and 8 on covid-19 lung us images). consequently, the two quantum relu methods are the overall best performing afs that can be applied for aiding diagnosis of both covid-19 from lung us data and pd from spiral drawings. specifically, when using the novel quantum afs (qrelu and m-qrelu) as compared to the traditional relu and leaky relu afs, the gold standard afs in dnns, the following percentage increases in acc, precision, recall/sensitivity and f1score were noted: • an increase of 55.32% in acc and sensitivity/recall via m-qrelu as compared to relu and by 37.74% with respect to leaky relu, thus avoiding the 'dying relu' problem when the cnn was evaluated on the kaggle spiral drawings benchmark dataset (table 5 ); • an increase by 65.91% in f1-score via both qrelu and m-qrelu as opposed to leaky relu, hence obviating the 'dying relu' problem again but when tested on the covid-19 ultrasound benchmark dataset (table 7) . • an increase of 50% in acc and sensitivity/recall via both qrelu and m-qrelu with regards to both relu and leaky relu, hence solving the 'dying relu' problem when evaluated on the pocus 19 ultrasound benchmark dataset (table 8 ). • an increase by 82,000% in acc and sensitivity/recall via qrelu (82%) when compared to tanh (0% acc and sensitivity/recall), thus avoiding the vanishing gradient problem too, as assessed on the uci spiral drawings benchmark dataset (table 6) . furthermore, it is worth noting the proposed quantum afs led to improved classification outcomes as compared to recent advances in relu afs, such as crelu and vlrelu: • qrelu led to acc, precision, sensitivity/recall, and f1-score all higher by 1% those obtained via crelu when evaluating the cnn's classification performance on the mnist data (table 2 ). • m-qrelu resulted in an acc and a sensitivity/recall higher by 3% than crelu, and an f1-score greater by 2% on the spiral handpd dataset (table 3) . • m-qrelu led to an acc and a sensitivity/recall greater by 11% than vlrelu, and an f1-score also higher by 11% on the spiral handpd dataset (table 3) . • m-qrelu resulted in an acc and a sensitivity/recall higher by 6% than vlrelu, and an f1-score greater by 3% on the kaggle spiral drawings dataset (table 5) . • qrelu and m-qrelu led to an acc and a sensitivity/recall greater by 9% and 18% than crelu and vlrelu respectively, and an f1-score higher by 5% and 14% on the covid-19 ultrasound dataset (table 7) . • qrelu and m-qrelu resulted in an acc and a sensitivity/recall higher by 20% than vlrelu, and an f1-score greater by 10% on the pocus 19 ultrasound dataset (table 8) . the results obtained via the qrelu and m-qrelu in a two-layered cnn on the mnist dataset (table 2) the two-layered cnn's classification performance via the proposed m-qrelu (acc = 92%, f1-score = 93%, table 3 ) was also higher by over 2% than the best performing five-layered cnns, whose hyperparameters were also optimised respectively via both the bat algorithm and particle swarm optimisation (pso) (pereira et al., 2016c) , to aid diagnosis of pd from spiral drawings, such as using the 'spiral handpd' benchmark dataset. a comparable precision was achieved by the two-layered cnn model (table 7) when the qrelu and m-qrelu were used as afs with respect to the best classifier so far on the covid 19 ultrasound dataset, i.e., the sixteen-layered pocovid-net model, which builds on the vgg 16 model (born et al., 2020) . table 5 . results on performance evaluation of the first convolutional neural network having two convolutional layers, built in tensor-flow, and tested on the kaggle spiral drawings benchmark dataset. the size of the images was set to 28*28, as per the mnist benchmark dataset. table 6 . results on performance evaluation of the first convolutional neural network having two convolutional layers, built in tensor-flow, and tested on the university california irvine (uci) spiral drawings benchmark dataset. the kaggle spiral drawings benchmark dataset, which includes drawings from both healthy subjects and patients with parkinson's disease, was used for training and the uci spiral drawings benchmark dataset, which only has spiral drawings acquired during both static and dynamic tests from patients with pd, was deployed for testing. the size of the images was set to 28*28, as per the mnist benchmark dataset. table 7 . results on performance evaluation of the first convolutional neural network having two convolutional layers, built in tensor-flow, and tested on the covid-19 ultrasound benchmark dataset. the size of the images was set to 28*28, as per the mnist benchmark dataset. further to the extensive review of existing relu afs provided in section 1.2, also considering that classical approaches have been unable to solve the 'dying relu' problem as reviewed in section 1.3, and taking into account the advantages of quantum states in afs (listed in section 1.4), two novel quantum-based afs were mathematically formulated in section 2.2 and developed in both tensorflow (listings 1 and 3 , https://www.tensorflow.org/) and keras (listings 2 and 4, https://keras.io/) to enable reproducibility and replicability. thus, the mnist two-layered cnn-based classifier in tensor-flow was selected as the baseline model to assess the impact of using either quantum afs (qrelu and m-qrelu) on the classification performance on seven benchmark datasets as described in section 2.1 and evaluated based on test acc, precision, recall/sensitivity and f1-score, as mentioned in section 2.2. the proposed qrelu leads to the best classification performance on the mnist benchmark dataset (acc = 99%, f1-score = 99%, table 2 ) to recognise handwritten digits serves as a regression test to validate the hypothesis whereby, using the baseline cnn-based mnist classifier, the highest classification performance is achieved with the presumed best af. this hypothesis has been further confirmed by the m-qrelu achieving the second classification performance (acc = 99%, f1score = 99%, table 2 ) across all eleven afs evaluated as in 2.2. achieving the same classification performance as the gold standard reproducible and replicable afs in cnns (relu, the leaky relu and the vlrelu)readily available in both tensorflow and kerasthe qrelu can be granted the designation of benchmark af for the task of handwritten digits recognition performed on the mnist benchmark dataset. the benefits of avoiding the 'dying relu' problem become evident when assessing the same two-layered cnn architecture with the qrelu especially (acc = 0.92/92%, f1-score = 0.93/93%, table 3 ), which achieved the best classification performance on critical image classification tasks, such as recognising pd-related patterns from spiral drawings in the 'spiral handpd' benchmark dataset. the higher generalisability achieved via the two proposed quantum afs in further support of the advantage of obviating the 'dying relu' issue is evident from the best classification performance in differentiating covid-19 from both bacterial pneumonia and healthy controls from the lung us data (table 7 -qrelu and m-qrelu with acc = 0.73/73% and f1-score = 0.73/73%). such an overall higher diagnostic performance is corroborated by the second-best classification outcomes attained on the second benchmark lung us dataset (table 8) . whilst traditional relu approaches show highly variable classification outcomes, especially when they experience the 'dying relu' problem (tables 5, 7 and 8), both the qrelu and the m-qrelu were able to ensure a consistently higher classification performance and generalisation across the entire variety of image classification tasks involved, from the benchmark handwritten digits recognition task (mnist), to recognising pd-related patterns from spiral drawings taken from graphic tablets, to aiding detection of covid-19 from bacteria pneumonia and healthy lungs based on us scans. the advantage of using the proposed afs for covid-19 detection lies in the potential for their translational applications in a clinical setting, i.e., in leveraging cnns with the qrelu or m-qrelu to detect covid-19 in patients with neurodegenerative co-morbidities, such as pd, via non-ionising medical imaging (e.g., us). this added capability will come handy in future, as portable mri and ml-enhanced mri technologies will also become more affordable and widespread, thus being improvable with deep learning models (e.g., the two-layered cnn with qrelu or m-qrelu afs in this study). solutions either on edge devices or on the cloud for tele-diagnosis and tele-monitoring required in pandemics similar to the current one (covid-19) could be soon suitable for in-home diagnostic and prognostic assessments too, which should improve personalised care for shielded or vulnerable individuals. moreover, competitive outcomes were obtained via the qrelu and the m-qrelu on three further benchmark datasets, e.g., 'newhandpd dataset', the kaggle and the uci spiral drawings benchmark datasets, with acc and f1-score mostly above 75% (tables 4-6) using the relatively simple deep neural network leveraged in this study (the two-layered mnist cnn classifier). such results also demonstrate the added capability of the proposed qrelu and the m-qrelu to avoid the vanishing gradient problem occurred using tanh (0% acc and sensitivity/recall), as evaluated on the uci spiral drawings benchmark dataset (table 6) . despite the overall increase in generalisability brought about by the qrelu and the m-qrelu, the computational cost of the cnn increased by four times as compared to the other nine afs evaluated, except for the crelu, against which a threefold increase was reported (tables 2-8) . nevertheless, considering the importance of achieving higher classification performance over lower computational cost for diagnostic applications in a clinical setting, especially for the critical image classification tasks involved in this study, such as the detection of pd (tables 3-6 ) and covid-19 (tables 7 and 8) , this increase in computational cost is not expected to impair the wide application of the two novel quantum afs to aid such diagnostic tasks and any other medical applications involving image classification. in fact, the qrelu and m-qrelu have been demonstrated as considerably better than the current (undisputedly assumed) gold standard afs in cnns, i.e., the traditional relu and the leaky relu. in particular, an increase by 50-66% in both accuracy and reliability (especially, sensitivity/recall and f1-score) metrics was reported across both pattern recognition tasks, i.e., detection of pd-related patterns from spiral drawings (tables 5 and 6 ) and aiding diagnosis of covid-19 from us scans ( table 7) . the two proposed quantum afs also outperformed more cutting-edge relu afs, such as the crelu and the vlrelu, by 5-20% across all classification tasks considered, i.e., mnist data classification (table 2) , spiral drawings pd-related pattern recognition (in particular, tables 3 and 5) , and covid-19 detection from us scans (tables 7 and 8) . moreover, the qrelu and the m-qrelu led the baseline two-layered cnn mnist classifier to achieve a comparable classification performance on the mnist dataset as deeper cnns, ranging from three to four layers (lecun et al., 1998; siddique et al., 2019; ahlawat et al., 2020) , including deeper architectures, e.g., resnet and densenet (chen et al., 2018) . it is worth noting that, when leveraging the qrelu and the m-qrelu, the two-layered cnn with hyperparameters based on the mnist data outperformed (acc = 92%, f1-score = 93%, table 3 ) deeper and ba-and pso-optimised cnns from published studies by over 2% (pereira et al., 2016c) in aiding the diagnosis of pd from patterns in spiral drawings (e.g., using the 'spiral handpd' benchmark data). the two-layered cnn model with either qrelu or m-qrelu as afs achieved a comparable precision (table 7) to the best-performing classifier on the covid 19 ultrasound dataset, i.e., the sixteen-layered pocovid-net model, which is an extension of the vgg 16 benchmark model (born et al., 2020) . these outcomes show the two main practical advantages brought about by the avoidance of the 'dying relu' problem in qrelu and the m-qrelu that outweigh the initial consideration on these two quantum afs leading to an overall higher computational cost despite the increased generalisation, which are as follows: 1. using qrelu or m-qrelu can obviate the need for several convolutional layers in cnns and any cnn-derived models, such as alexnet, resnet, densenet, condensenet, ccondensenet and vgg 16, as demonstrated above and in section 3 (results), 2. leveraging qrelu or m-qrelu as afs in cnn can minimise the need for optimisation of cnn's hyperparameters. the implications of the two above-mentioned practical benefits are multiple. firstly, the two proposed afs may not only improve generalisation but also computational cost when considering image classification tasks that involve deeper architectures than the two-layered cnn used in this study. thus, the proposed afs may be viable alternatives to the relu af, which is the current gold standard af in cnns. second, by improving both generalisation and computational cost when deeper architectures may be required, the qrelu and m-qrelu may be suitable for tasks that require scalability of deep neural networks. third, the proposed quantum afs may enable more effective transfer learning, such as for covid-19 detection in multiple geographical areas, as well as extending trained deep nets to further diagnostic tasks, including prognostic applications too, and aiding self-driving vehicles in image classification tasks essential to ensure passenger safety. overall, the avoidance of the 'dying relu' problem achieved via qrelu and m-qrelu is expected to radically shift the paradigm of blindly relying on the traditional relu af in cnn and any cnn-derived models, and embrace innovative approaches, including quantum-based, such as the two novel afs designed, developed and validated in this study. further to a thorough analysis of the classification performance of the two-layered cnn mnist classifier leveraging the two quantum afs developed in this study, qrelu and m-qrelu, and evaluated against nine benchmark afs, including relu and its main recent reproducible and replicable advances, as well as relevant published studies, the proposed qrelu and m-qrelu prove to be the first two afs in the recorded history of deep learning to successfully avoid the 'dying relu' problem, by design. their novel algorithms describing the methodology and af were implemented in tensorflow and keras, as well as presented in listings 1-4. this added capability ensured accurate and reliable classification for recognising pdrelated patterns from spiral drawings and detecting covid-19 from non-ionising medical imaging (us) data. furthermore, its availability in both google's tensorflow and kerasthe two most popular libraries in python for deep learning -facilitate their wide application beyond clinical diagnostics, including medical prognostics and any other applications involving image classification. thus, the qrelu and m-qrelu can aid detection of covid-19 during these unprecedented times of this pandemic, as well as deliver continuous value added in aiding the diagnosis of pd based on pattern recognition from spiral drawings. noteworthily, when leveraging the proposed quantum afs, the baseline cnn model achieved comparable classification performance to deeper cnn and cnn-derived architectures across all image recognition tasks involved in this study, from handwritten digits recognition, to detection of pd-related patterns from spiral drawings and covid-19 from lung us scans. thus, these outcomes corroborate the benefit of using afs that avoid the 'dying relu' problem for critical image classification tasks, such as for medical diagnoses, making them a viable alternative to the current gold standard af in cnns, i.e., the relu. this study is expected to have a radical impact in redefining the benchmark afs in cnn and cnn-derived deep learning architectures for applications across academic research and industry. improved handwritten digit recognition using convolutional neural networks (cnn) quantum computing and deep learning working together to solve optimization problems big data and machine learning in health care parkinsonism as a third wave of the covid-19 pandemic? chronic neurology in covid-19 era: clinical considerations and recommendations from the reprogram consortium. front. neurol pocovid-net: automatic detection of covid-19 from a new lung ultrasound imaging dataset (pocus) stability and bifurcation of a simple neural network with multiple time delays quantum neuron: an elementary building block for machine learning on quantum computers. arxiv assessing four neural networks on handwritten digit recognition dataset (mnist) quantum machine learning: a classical perspective quantum algorithms revisited quantum convolutional neural networks applying gradient descent in convolutional neural networks origin and evolution of pathogenic coronaviruses adaptive convolution relus. thirty-fourth aaai conference on artificial intelligence deep sparse rectifier neural networks softmax to softassign: neural network algorithms for combinatorial optimization the influence of the sigmoid function parameters on the speed of backpropagation learning sigmoid transfer functions in backpropagation neural networks delving deep into rectifiers: surpassing human-level performance on imagenet classification reducing the dimensionality of data with neural networks towards a real quantum neuron improved spiral test using digitized graphics tablet for monitoring parkinson's disease on the role of entanglement in quantum-computational speed-up deep learning applications in medical image analysis self-normalizing neural networks. arxiv deep neural architectures for prediction in healthcare a quantum-inspired self-supervised network model for automatic segmentation of brain mr images imagenet classification with deep convolutional neural networks convolutional networks for images, speech, and time-series gradient-based learning applied to document recognition. proceedings of the ieee deep learning bounded activation functions for enhanced training stability of deep neural networks on visual pattern recognition problems a survey on deep learning in medical image analysis rectifier nonlinearities improve neural network acoustic models. proceedings of the 30th international conference on machine learning a variational algorithm for quantum neural networks rectified linear units improve restricted boltzmann machines quantum computation and quantum information comparison of non-linear activation functions for deep neural networks on mnist classification task a new computer vision-based approach to aid the diagnosis of parkinson's disease deep learning-aided parkinson's disease diagnosis from handwritten dynamics convolutional neural networks applied for parkinson's disease identification frelu: flexible rectified linear units for improving convolutional neural networks learning representations by back-propagating errors imagenet large scale visual recognition challenge collection and analysis of a parkinson speech dataset with multiple types of sound recordings the quest for a quantum neural network. quantum inf process understanding and improving convolutional neural networks via concatenated rectified linear units radiological findings from 81 patients with covid-19 pneumonia in wuhan recognition of handwritten digit using convolutional neural network in python with tensorflow and comparison of performance for various hidden layers the potential of quantum computing and machine learning to advance clinical research and change the practice of medicine leaky_relu | tensorflow core v2.3.0 tensorflow. 2020. tf.keras.layers.leakyrelu | tensorflow core v2.3.0 estimates of the severity of coronavirus disease 2019: a model-based analysis empirical evaluation of rectified activations in convolutional network visualizing and understanding convolutional networks distinguishing different stages of parkinson's disease using composite index of speed and pen-pressure of sketching a spiral clinical course and risk factors for mortality of adult inpatients with covid-19 in wuhan, china: a retrospective cohort study the authors would like to thank two research assistants from the university of bradford, ms smriti kotiyal and mr rohit trivedi, for their assistance to the background review relevant for this paper.the authors declare that no ethical approval was required for carrying out the study, as the data used in it were taken from publicly available repositories and appropriately referenced in text. moreover, the authors declare not to have any competing interests and an appropriate funding statement has been provided on the title page of this article. key: cord-034614-r429idtl authors: yasar, huseyin; ceylan, murat title: a new deep learning pipeline to detect covid-19 on chest x-ray images using local binary pattern, dual tree complex wavelet transform and convolutional neural networks date: 2020-11-04 journal: appl intell doi: 10.1007/s10489-020-02019-1 sha: doc_id: 34614 cord_uid: r429idtl in this study, which aims at early diagnosis of covid-19 disease using x-ray images, the deep-learning approach, a state-of-the-art artificial intelligence method, was used, and automatic classification of images was performed using convolutional neural networks (cnn). in the first training-test data set used in the study, there were 230 x-ray images, of which 150 were covid-19 and 80 were non-covid-19, while in the second training-test data set there were 476 x-ray images, of which 150 were covid-19 and 326 were non-covid-19. thus, classification results have been provided for two data sets, containing predominantly covid-19 images and predominantly non-covid-19 images, respectively. in the study, a 23-layer cnn architecture and a 54-layer cnn architecture were developed. within the scope of the study, the results were obtained using chest x-ray images directly in the training-test procedures and the sub-band images obtained by applying dual tree complex wavelet transform (dt-cwt) to the above-mentioned images. the same experiments were repeated using images obtained by applying local binary pattern (lbp) to the chest x-ray images. within the scope of the study, four new result generation pipeline algorithms having been put forward additionally, it was ensured that the experimental results were combined and the success of the study was improved. in the experiments carried out in this study, the training sessions were carried out using the k-fold cross validation method. here the k value was chosen as 23 for the first and second training-test data sets. considering the average highest results of the experiments performed within the scope of the study, the values of sensitivity, specificity, accuracy, f-1 score, and area under the receiver operating characteristic curve (auc) for the first training-test data set were 0,9947, 0,9800, 0,9843, 0,9881 and 0,9990 respectively; while for the second training-test data set, they were 0,9920, 0,9939, 0,9891, 0,9828 and 0,9991; respectively. within the scope of the study, finally, all the images were combined and the training and testing processes were repeated for a total of 556 x-ray images comprising 150 covid-19 images and 406 non-covid-19 images, by applying 2-fold cross. in this context, the average highest values of sensitivity, specificity, accuracy, f-1 score, and auc for this last training-test data set were found to be 0,9760, 1,0000, 0,9906, 0,9823 and 0,9997; respectively. in the last few months of 2019, a new type of virus, which is a member of the family coronaviridae, emerged. the virus in question is considered to have had a zoonotic origin [1] . the virus that emerged in the city of wuhan in hubei province in china affected this region first and then spread all over the world in a short time. the virus generally affects the upper and lower respiratory tract, lungs, and, less frequently, the heart muscles [2] . while the virus generally affects young and middle-aged people and people who do not have any chronic diseases to a lesser extent, it can cause severe consequences, resulting in death, in people who suffer from diseases such as hypertension, cardiovascular disease, and diabetes [3] . the epidemic, which was declared to be a pandemic in march 2020 by the world health organization; as of the first week of october of the same year, had a number of cases approaching thirty-six million, while the death toll reached one million hundred thousand. also, a modeling study carried out by hernandez-matamoros et al. [4] indicates that the effects of the epidemic will become more severe in the future. in people suffering severely from the disease, the serious adverse effects are generally in the lungs [3] . in this context, many literature studies have been carried out in a short time in which these effects of the disease in the lungs were shown using ct scans of lungs and chest x-ray imaging. literature studies indicate that radiological imaging, along with clinical symptoms, blood, and biochemical tests, is an effective and reliable diagnostic tool for the diagnosis of covid-19 disease. many clinical studies in which x-ray images were examined [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] have shown that covid-19 disease causes interstitial involvement, bilateral and irregular ground-glass opacities, parenchymal abnormalities, a unilobar reversed halo sign, and consolidation on the lungs. the recent review article published by long and ehrenfeld [25] highlighted the importance of using artificial intelligence methods to quickly diagnose covid-19 disease and reduce the effects of the outbreak crisis. in this context, some literature studies have been carried out that diagnose covid-19 disease (covid-19 and non-covid19 ) through x-ray images and using deep learning methods. table 1 contains some summary information about the number of images, study methods, and study results used in these literature studies. ct imaging generally contains more data than x-ray imaging. however, it has some disadvantages for the follow-up of all stages of the disease due to the excess amount of radiation that the patients are exposed to. for this reason, an artificial intelligence application using x-ray images was created and tested in the study. in this study, which aims at early diagnosis of covid-19 disease with the help of x-ray images, a deep learning approach, which is an artificial intelligence method applying the latest technology, was used. in this context, automatic classification of the images was carried out through the two different convolutional neural networks (cnns). in the study, experiments were carried out for the use of images directly, using local binary pattern (lbp) as a pre-process and dual tree complex wavelet transform (dt-cwt) as a secondary operation, and the results of the automatic classification were calculated separately. within the scope of the study, four new classification approaches that involve performing the experiments together and combining the results through a result generation algorithm, have been proposed and tested. the results of the study show that in the diagnosis of covid-19 disease, the analysis of chest x-ray images using deep learning methods provides fast and highly accurate results. the chest x-ray images of patients with covid-19 used in the study were obtained by combining metadata data sets that were made open access over github after being created by cohen et al. [42] and over kaggle after being created by dadario [43] . the images that these data sets contain in common and the clinical notes related to these images were combined and a mixed covid-19 image data set consisting of 150 chest x-ray images was created. in the study, images obtained while the patients were facing the x-ray device directly were used. in the studies, the images taken from the same patient were obtained on different days of the course of the disease and therefore do not contain exactly the same content. the dimensions of the images in question vary between 255 px × 249 px and 4280 px × 3520 px (px is pixel abbreviation) and show a wide variety. also, these images have different data formats such as png, jpg, jpeg and two different bit depths such as 8-bit (gray-level) and 24-bit (rgb). standardization of the images is an essential process for use in this study. in this context, all of the images have been converted to 8-bit gray-level images. then, to clarify the area of interest on the images, manual framing was performed so as to cover the chest area. after this process, all the images were rearranged to 448 px × 448 px and saved in png format. for the non-covid-19 x-ray images in the study, two data sets, a montgomery data set [44] and a shenzhen data set [44] , were used separately. these databases contain 80 and 326 non-covid-19 x-ray images, respectively. the first trainingtest data set contains a total of 230 x-ray images, of which 150 are covid-19 images and 80 are non-covid-19 images, while the second training-test data set contains 476 x-ray images, of which 150 are covid-19 images and 326 are non-covid-19 images. thus, it was ensured that classification results were obtained for the two data sets that contained predominantly covid-19 images and predominantly non-covid-19 images, respectively. the processes applied to the covid-19 images were likewise applied to the non-covid-19 images. in fig. 1 , original and edited versions of the x-ray images are shown; one belonging to a patient with covid-19 and two belonging to people without covid-19 (non-covid-19 people). local binary pattern (lbp) is an approach that was proposed by ojala et al. [45] to reveal local features. the method is basically based on comparing a pixel on the image to the neighboring pixels one by one, in terms of size. in fig. 2 , the images obtained by applying the lbp operation to the x-ray images given in fig. 1 are included. the purpose of benefiting from lbp operation within the scope of this study is to observe the effects of using lbp images, which reflect the local features in the cnn input on the study results, rather than the original images. additionally, the aim of the study is to increase the image feature depth used in the new result generation algorithm. dual tree complex wavelet transform (dt-cwt) was first introduced by kingsbury [46] [47] [48] . this method is generally similar to the gabor wavelet transform. in the gabor wavelet transform, low-pass and high-pass filters are applied to the rows and columns of the image horizontally and vertically. in this way, two different sub-band groups are formed in rows and columns as low (l) and high (h). crossing is made during the conversion of the said one-dimensional bands into two dimensions. at the end of the process, a low sub-band, named ll, is obtained. in addition, three sub-bands containing high bands, lh, hl, and hh, are formed. further sub-bands (such as lll, llh) can be obtained by applying the same operations to the ll sub-band. unlike the gabor wavelet transform, instead of a single filter, dt-cwt uses two filters that work in parallel. these two trees contain real and imaginary parts of complex numbers. that is, as a result of the dt-cwt process, a sub-band containing more directions than the gabor wavelet transform is obtained. when dt-cwt is applied to an image, the processes are performed for six different directions, +15, −15, +45, −45, +75, and − 75 degrees. three of these directions represent real sub-bands and the other three represent imaginary sub-bands. figure 3 shows the dt-cwt decomposition tree. in fig. 4 , real and imaginary sub-band images obtained by applying the dt-cwt process (scale = 1) to the x-ray images given in fig. 1 , are shown. within the scope of the study, the dt-cwt process was used with a scale (level) value of 1, and the dimensions of the sub-band images obtained were half the size of the original images. since the complex wavelet transform has been successful in many studies [49] [50] [51] where medical images have previously been used, this conversion was preferred in the study. deep learning has come to the fore in recent years as an artificial intelligence approach that provides successful results in many image processing applications from image enhancement (such as [52] ) to object identification (such as [53, 54] ). convolutional neural network (cnn) has been the preferred deep learning model in image processing applications in recent years. the cnn classifier, in general, consists of a convolution layer, activation functions, a pooling layer, a flatten layer, and fully connected layer components. in this context, fig. 5 describes the general operation of the cnn classifier. it is possible to examine more detailed information fig. 1 a) x-ray image of a patient with covid-19 (phan et al. [23] ) b) non-covid-19 x-ray image (montgomery data set [44] )) c) non-covid-19 x-ray image (shenzhen data set [44] )) about the functions and operating modes of the layers in the cnn classifier from the studies [55] [56] [57] [58] [59] . within the scope of the study, a cnn architecture with a total of 23 layers was designed. an effective design was aimed at, since increasing the number of layers in the cnn architecture leads to increased processing time in the training and classification processes. table 2 contains details of the first cnn architecture used in the study. also, a second cnn architecture was used to check whether the proposed pipeline approaches applied to other cnn architectures. in this context, an architecture modeled on vgg-16 cnn was used. however, to reduce the processing load, the number of filters and the fully connected layer sizes have been reduced. additionally, normalization layers were added after the intermediate convolution layers. details of this second cnn architecture used are given in table 3 . in the context of the study, matlab 2019a program was preferred as software. the layer names and parameters in tables 2 and 3 are the names and parameters used directly in the software. in the study, more than one experiment was carried out and the sizes of the input images used in the experiments differ. for this reason, there are different sizes in the input layer in tables 2 and 3 . those cnn architectures were used in all the experiments carried out within the scope of the study. within the scope of the study, confusion matrix and statistical parameters obtained from this matrix were used to evaluate the results. it is possible to examine detailed information about the confusion matrix, i.e., sensitivity (sen), specificity (spe), accuracy (acc), and f-1 score (f-1), from the studies [60] . receiver operating characteristic (roc) analysis was also used to evaluate the results. in addition, the sizes of the areas under the roc curve (area under curve (auc)) were calculated. roc analysis basically reflects graphically the variation of sensitivity (sen) (y-axis) relative to 1-spe (x-axis) for the case that the threshold value is gradually changed with a certain precision between the minimum and maximum output predicted for the classification. first of all, in the proposed pipeline algorithm, training and test procedures for images of size of 448 × 448 were performed and results were obtained. & before the experiments after the first experiment were conducted, dt-cwt was applied to the images of size & in the third experiment, training and testing procedures were carried out and results were obtained for the case of giving the imaginary part of the ll sub-band image obtained by applying dt-cwt, as input to the cnn. & in the fourth experiment, training and testing procedures were carried out and results were obtained for the case of & in the seventh experiment, results were obtained for the case of giving the real and imaginary parts of the ll, lh, hl sub-band images obtained by applying dt-cwt, as input to the cnn, together. a block diagram of the experiments carried out in the study is shown in fig. 6 . the first seven experiments conducted were repeated using new images obtained by applying lbp to the x-ray images, and the first stage experiments were completed. since the image size decreases after lbp processing, these images were rearranged as 448 px × 448 px in size. in the ongoing part of the study, four pipeline classification algorithms were designed using the principle of parallel operation. these algorithms are based on combining the results of previous experiments to obtain new results. the first two pipeline classification algorithms mentioned above work as follows: & if the numbers of labeling (threshold value for 0,5) obtained in the experiments (with and without lbp) for an image are not equal to each other, the labeling result obtained in more than half of the experiments is considered to be the algorithm labeling result for covid-19 or non-covidthe basic coding of the first two pipeline classification approaches is included in table 4 . in the codes between tables 4 and 6, result-1 and label-1 represent the actual test result and the label obtained without using lbp, while result-2 and label-2 represent the actual test result and the label obtained using lbp. in the third and fourth pipeline algorithms, unlike the first two pipeline algorithms, if the tags obtained as a result of the classification experiment differ from each other, the result obtained without applying lbp has been taken into consideration with priority. accordingly, in the case where the two classification tags are different from each other in the third pipeline algorithm, if the tag result obtained without applying lbp was abnormal, the result was considered abnormal. in the fourth pipeline algorithm, in the case of the two classification tags being different from each other, if the tag result obtained without applying lbp was normal, the result was considered normal. the other procedures are the same as for the first two pipeline algorithms. a mixing rate of 50% -50% was applied in the third and fourth pipeline algorithms. the basic coding of the third and fourth pipeline classification approaches is given in tables 5 and 6 . in this study, which aims to detect covid-19 disease early using x-ray images, the deep learning approach, which is the artificial intelligence method applying the latest technology, was used and automatic classification of the images was performed using cnn. in the first training-test data set used in the study, there were 230 x-ray images, of which 150 were covid-19 and 80 were non-covid-19, while in the second training-test data set there were 476 x-ray images, of which 150 were covid-19 and 326 were non-covid-19. thus, it was ensured that the classification results were obtained separately from the two data sets containing predominantly abnormal images and predominantly normal images. the information from the training-test data sets is given in table 7 . within the scope of the study, chest x-ray images were manually framed to cover the lung region, primarily to determine the areas of interest on the image. then, standardization table 4 basic coding of the pipeline algorithms (pipeline-1 and -2) proposed in the study table 5 basic coding of the pipeline algorithm (pipeline-3) proposed in the study was carried out since the images used were of very different sizes, formats, and bit depths. the areas of interest on the image were resized and the image sizes were arranged as 448 px × 448 px. after that, the images in question were saved in png format so as to be as gray-scale and 8-bit depth. these operations were applied to all the abnormal and normal images used in the study. in the ongoing part of the study, a 23-layer cnn architecture and a 54-layer cnn architecture were designed and used, the details of which have been previously described. those cnn architectures were used in all the experiments. due to the fact that more than one experiment was performed within the scope of the study, only the images given to the cnn input differ in size. in the experiments conducted in the study, the trainings were carried out with the k-fold cross validation method. in this context, the k value was chosen as 23. since the first training-test data set consists of 230 images, 220 images, except for ten images at each stage (fold), were used for the training operations, and the remaining ten images were used for the testing operations. the second training-test data set consists of 476 images, and, in the same way, except 20/21 (16 groups consisting of 21 images and seven groups of 20 images) images, 456/455 images were used in the training operations, and the remaining 20/21 images were used in the testing operations. the test procedures were repeated 23 times and classification results were obtained for all the images. finally, within the scope of the study, all the images were combined and the training and testing procedures were repeated by applying a 2-fold cross for a total of 556 x-ray images comprising 150 covid-19 images and 406 non-covid-19 images. considering the length of the study as well, the results that have been shared in the study are only for the input data that provided the best results for the first and second data sets. in this part of the study, a total of 14 experiments were carried out. some initial weights and parameters in the cnn are randomly assigned. to make the study results stable, each experiment was repeated five times in itself, and average results in the study are shown. within the scope of the study, the cpu time taken for an experiment to be completed entirely, including the training and testing, was divided by the total number of images processed, and the processing cpu time per image was measured. the experiments of this study were carried out using matlab 2019 (a) software running on a computer with 64 gb ram and intel(r) xeon (r) cpu e5-2680 2.7 ghz (32 cpus). in the first experimental group within the scope of the study, the training and testing procedures were first performed using the table 6 basic coding of the pipeline algorithm (pipeline-4) proposed in the study chest x-ray images, and the results were obtained. lbp operation was then applied to the images in question, and then the training and testing procedures were repeated and the results were calculated. finally, the results were calculated using the pipeline classification algorithms, the details of which were previously described and proposed within the scope of the study. due to the random assignment of some initial variables used in the internal structure of the cnn, each experiment group was repeated five times in order to make the results more stable. the image sizes given to the cnn as input for this experiment were 448 × 448 × 1. the results obtained from the experimental group are given in table 8 (first training-test data set) and table 9 (second training-test data set). in the second experimental group within the scope of the study, the training and testing procedures were performed using the real part of the ll sub-image obtained by applying dt-cwt to the chest x-ray images, and the results were obtained. then, the training and testing procedures were performed using the real part of the ll subimage obtained by applying the lbp and dt-cwt operations to the x-ray images, respectively. finally, the results were calculated using the pipeline classification algorithms, the details of which were previously described and proposed within the scope of the study. the image sizes given to the cnn as input for this experiment were 224 × 224 × 1. the results obtained from the experimental group are given in table 10 (first training-test data set) and table 11 (second training-test data set). in the third experimental group within the scope of the study, the training and testing procedures were performed using the imaginary part of the ll sub-image obtained by applying dt-cwt to the chest x-ray images, and the results were obtained. then, the training and testing procedures were performed using the imaginary part of the ll sub-image obtained by applying the lbp and dt-cwt operations to the x-ray images, respectively. finally, the results were calculated using the pipeline classification algorithms, the details of which were previously described and proposed within the scope of the study. the image sizes given to the cnn as input for this experiment were 224 × 224 × 1. the results obtained from the experimental group are given in table 12 (first training-test data set) and table 13 (second training-test data set). in the fourth experimental group within the scope of the study, the training and testing procedures were performed using the real part of the ll, lh and hl sub-images obtained by applying dt-cwt to the chest x-ray images, and the results were obtained. then, the training and testing procedures were performed using the real part of the ll, lh and hl sub-images obtained by applying the lbp and dt-cwt operations to the x-ray images, respectively. finally, the results were calculated using the pipeline classification algorithms, the details of which were previously described and proposed within the scope of the study. the image sizes given to the cnn as input for this experiment were 224 × 224 × 3. the results obtained from the experimental group are given in table 14 (first training-test data set) and table 15 (second training-test data set). in the fifth experimental group within the scope of the study, the training and testing procedures were performed using the imaginary part of the ll, lh and hl sub-images obtained by applying dt-cwt to the chest x-ray images, and the results were obtained. then, the training and testing procedures were performed using the imaginary part of the ll, lh and hl sub-images obtained by applying the lbp and dt-cwt operations to the x-ray images, respectively. finally, the results were calculated using the pipeline classification algorithms, the details of which were previously described and proposed within the scope of the study. the image in the sixth experimental group within the scope of the study, the training and testing procedures were performed using the real and imaginary parts of the ll sub-image obtained by applying dt-cwt to the chest x-ray images, and the results were obtained. then, the training and testing procedures were performed using the real and imaginary parts of the ll sub-image obtained by applying the lbp and dt-cwt operations to the x-ray images, respectively. finally, the results were calculated using the pipeline classification algorithms, the details of which were previously described and proposed within the scope of the study. the image sizes given to the cnn as input for this experiment were 224 × 224 × 2. the results obtained from the experimental group are given in table 18 (first training-test data set) and table 19 (second training-test data set). in the seventh experimental group within the scope of the study, the training and testing procedures were performed using the real and imaginary parts of the ll, lh, hl subimages obtained by applying dt-cwt to the chest x-ray images, and the results were obtained. then, the training and testing procedures were performed using the real and imaginary parts of the ll, lh, hl sub-images obtained by applying the lbp and dt-cwt operations to the x-ray images, respectively. finally, the results were calculated using the pipeline classification algorithms, the details of which were previously described and proposed within the scope of the study. the image sizes given to the cnn as input for this experiment were 224 × 224 × 6. the results obtained from he experimental group are given in table 20 (first training-test data set) and table 21 (second trainingtest data set). finally, all the training-test data sets were combined to test the performance of the proposed method and the pipeline approaches. in this context, a collective training-test data set containing a total of 556 x-ray images comprising 150 covid-19 and 406 non-covid-19 images was created. then the k value was determined as 2 (cross training and testing for 75 covid-19 and 203 non-covid-19 images). the training and testing processes were realized for the input images (original image and the ll (real sub-band)), ensuring the best results in the first and second training-test data sets. the results obtained are given in tables 22 and 23 . in this section, first of all, the results that were obtained without using pipeline algorithms are compared. when the results of the study given between tables 8 and 23 are examined within the scope of the study, it can be seen that the results of the study obtained without using lbp are generally better than the results of the study using lbp, for the same input image. in this context, it is understood that there are exceptions for the sensitivity parameter of some results obtained using the first cnn architecture for the first training-test data set. within the scope of the study, the highest mean sensitivity, specificity, accuracy, f-1 score, and auc values obtained without using the pipeline algorithms were, respectively; within the scope of the study, dt-cwt was used to reduce the image dimensions. in this way, dt-cwt tolerated the increase in result-producing time due to the use of the pipeline algorithm. in this context, when the results obtained using the original images and the ones obtained using dt-cwt are compared, it can be seen that there is no serious decrease in the results, in general. using dt-cwt, the image sizes were reduced successfully and a reduction in the result-producing times was achieved, in the study. the pipeline algorithms proposed within the scope of the study are based on combining the results obtained without using lbp and with using lbp, as detailed previously. after this stage, the study results obtained by using the pipeline algorithms were analyzed. with the introduction of the pipeline algorithms, improvements were achieved in all the parameters obtained by using both training-test data sets and the cnn architectures. in this context, an improvement was achieved in general, according to the highest results obtained without lbp and with using lbp, in terms of percentage ranging between 0,67% and 3,73% for the sensitivity parameter, between 0,06% and 2,25% for the specificity parameter, between 0% to 2,61% for the accuracy parameter, between 0,03% and 2,04% for the f-1 score parameter, and between 0% and 1,20% for the auc parameter. it was also observed that similar improvements were achieved for the experiments performed by combining all data and using 2-fold cross. in this context, according to the highest results obtained without lbp and with using lbp, an improvement was achieved generally in terms of percentage ranging between 2,13% and 5,07% for the sensitivity parameter, between 0,59% and 1,08% for the specificity parameter, between 0,58% and 1,87% for the accuracy parameter, between 1,18% and 3,55% for the f-1 score parameter, and between 0,13% and 0,59% for the auc parameter. when comparing the success of pipeline algorithms in improving the results in general, it can be seen that the algorithms of pipeline-1 and pipeline-3 obtain the highest sensitivity values; pipeline-4 obtains the highest specificity values; pipeline-1 and pipeline-3 obtain the highest accuracy values; pipeline-1 and pipeline-3 obtain the highest f-1 scores values; and pipeline-1, pipeline-2 and pipeline-3 algorithms successfully obtained the highest auc values. when the input data with the best results obtained by using the pipeline algorithms are examined, it can be seen that using the real part of the ll sub-image band for the first training-test data set and using the original images for the second trainingtest data set provided the best results. experiments performed using the 2-fold cross by combining all the data also confirm this situation. for this reason, only the results of the experiments mentioned were included in the study, in consideration of the length of the study. the highest mean sensitivity, specificity, accuracy, f-1 score, and auc values obtained using the study pipeline algorithms are as follows, respectively; 0,9947, 0,9800, 0,9843, 0,9881, 0,9990 for the first training-test data set and the first cnn architecture; 0,9867, 0,9800, 0,9809, 0,9853, 0,9977 for the first training-test data set and the second cnn architecture; 0,9853, 0,9926, 0,9857, 0,9774, 0,9988 for the second training-test data set and the first cnn architecture; and 0,9920, 0,9939, 0,9891, 0,9828, 0,9991 for the second training-test and the second cnn architecture. the highest mean sensitivity, specificity, accuracy, f-1 score and auc values obtained in the experiments performed by combining all data and using the 2-fold cross were respectively; 0,9760, 1,0000, 0,9906, 0,9823, 0,9997 for the first cnn architecture; and 0,9707, 1,0000, 0,9867, 0,9752, 0,9994 for the second cnn architecture. within the scope of the study, the best results obtained before and after using the pipeline algorithm and the comparison of these results with the recent literature studies are given in table 24 . as a result of our study on the automatic classification of chest x-ray images and using one of the deep learning methods, the cnn, some important and comprehensive test results were obtained for early diagnosis of covid-19 disease. when the results obtained within the scope of the study are compared with the literature studies detailed in tables 1 and 24 , the results of the study were found to be better than the 14 out of the 16 studies in which this value was calculated for the sensitivity parameter, than all the 13 studies in which this value was calculated for the specificity parameter, than the 13 out of the 15 studies in which this value was calculated for the accuracy parameter, than the eight out of the nine studies in which this value was calculated for the f-1 score parameter, and than all the 3 studies in which this value was calculated for the auc parameter. moreover, if it is necessary to make a comparison in terms of run-times, it was found that it produced a result at least three times faster in terms of runtime than the result was obtained in the study conducted by mohammed et al. [29] . this study is the only study in which this parameter was calculated. also, it is at least ten times faster than the study conducted by toraman et al. [39] . these two studies were studies in which the run-times were shared. no information was given about run-times in the other previous studies. overall, the results obtained within the scope of the study lagged behind the results obtained in studies conducted by tuncer et al. [26] , benbrahim et al. [35] , and loey et al. [38] . however, in order to make a more detailed comparison, the number of images used in these studies should be compared with the number of images used in our study. the number of images used in our study is higher than the number of images used in these studies. in particular, the number of images used in our study is almost three times the number of images used by loey et al. [38] . another important issue is the procedure for training and testing. there was no cross validation in the studies by benbrahim et al. [35] and loet et al. [38] . in our study, cross-validation in the training-test processes is one of the important measures taken against the overfitting problem that occurs during the training of the network. however, it is known that cross validation improves the reliability of the study results while balancing the study results. in this context, these issues should be taken into consideration when making a comparison. in the context of the study, if an evaluation should be based on the differentiation made between giving the images to the cnn as input directly and after the lbp was applied, it can be seen that the images obtained by applying the lbp produced worse results than the original images. however, the pipeline classification algorithm presented in the context of this study enabled the results obtained to be improved by combining the original and lbp-applied images. in this context, a significant part of the best results obtained in the study was provided using the pipeline classification algorithm. in this sense, it can be seen that the results of the study support some other literature studies [61] [62] [63] [64] [65] [66] where the cnn and lbp methods are used together and use of the lbp was shown to increase the success of the relevant study. the success achieved through the pipeline approaches in the study is due to the fact that some classification results that could not be revealed without using the lbp alone and with using the lbp alone were revealed by using the two methods together. feeding the results from the two sources in the pipeline approaches results in an increase in running time. [27] 0,9762 0,7857 0,881 x x ozturk et al. [28] 0,9513 0,953 0,9808 0,9651 x mohammed et al. [29] 0,706-0,974 0,557-1,000 0,620-0,987 0,555-0,987 0,800-0,988 khan et al. [30] 0,993 0,986 0,990 0,985 x apostolopoulos and mpesiana [31] 0,9866 0,9646 0,9678 x x waheed et al. [32] 0,69-0,90 0,95-0,97 0,85-0,95 x x mahmud et al. [33] 0,978 0,947 0,974 0,971 0,969 vaid et al. [34] 0,9863 0,9166 0,9633 0,9729 x benbrahim et al. [35] 0,9803-0,9811 x 0,9803-0,9901 0,9803-0,9901 x elaziz et al. [36] 0,9875-0,9891 x 0,9609-0,9809 x x martínez et al. [37] 0,97 x 0,97 0,97 x loey et al. [38] 1,0000 1,0000 1,0000 x x toraman et al. [39] 0,28-0,9742 0,8095-0,98 0,4914-0,9724 0,55-0,9724 x duran-lopez et al. [40] 0 however, the results obtained within the scope of the study show that this time cost can be eliminated by using dt-cwt. in this way, it has been observed that working success can be increased significantly without time cost. it is considered that this model is within the scope of the study and can be used in many other deep learning studies. it was evaluated that another important factor in achieving the successful results in this study was the framing process, which included the chest region and clarified the area of interest before the training and test procedures started. hence, thanks to this pre-process carried out in this context, the parts lacking medical diagnostic information were removed from the images and only the relevant areas on the images were used in the procedures. as the size of the inputs given to the cnn increases, the time taken for the training and testing increases. the dt-cwt transformation used in the study reduces the size of the image by half. although the image sizes are reduced by half, there is no serious adverse effect on the study results. by contrast, some of the best results achieved in the study were obtained using the dt-cwt. in this context, although the pipeline classification algorithms proposed in the study increase the time to produce the results for the image, the times in question are less than half the time required for the images to be used directly without applying lbp and dt-cwt. also, all the training and test procedures provided in the study reflect the amount per image. however, approximately 98% of these periods are spent on the training procedures. in this context, in the case where the results obtained by the transfer learning approach are used with the pipeline classification algorithm proposed in the study, the periods mentioned will decrease accordingly. the pipeline algorithms revealed within the scope of the study were tested for data sets with different weights in terms of the number of covid-19 and non-covid-19 images, for different training-test ratios and different cnn architectures. the pipeline algorithms were successful for all these situations that may have affected the results. this shows that the proposed pipeline algorithms are not partial but are general solutions. from this point of view, it is obvious that if the pipeline algorithms mentioned above are added to the algorithms used in other literature studies, this would increase the success of these studies. the results of the study show that analyzing chest x-ray images in the diagnosis of covid-19 disease using deep learning methods will speed up the diagnosis and significantly reduce the burden on healthcare personnel. to further improve the results of the study, increasing the number of images in the training set, i.e., the creation of databases in which the clinical data of patients with covid-19 that are accessible to the public, is of prime importance. after this stage, it is aimed to realize applications using ct images of the lungs an important diagnostic tool, such as chest x-ray images, in covid-19 disease diagnosis. in addition, it is planned to analyze the effects of using the results obtained, through direct transfer learning in pipeline classification algorithms, on the study results. this is evaluated as another important application to classify the complex-valued sub-bands of images obtained by applying dt-cwt, with the help of using the complex-valued cnn directly. conflict of interest dr. ceylan declares that he has no conflict of interest. mr. yasar declares that he has no conflict of interest. ethical approval this article does not contain any studies with human participants or animals performed by any of the authors. a novel coronavirus from patients with pneumonia in china clinical features of patients infected with 2019 novel coronavirus in a review of coronavirus disease-2019 forecasting of covid19 per regions using arima models and polynomial functions 2019-novel coronavirus severe adult respiratory distress syndrome in two cases in italy: an uncommon radiological presentation featuring covid-19 cases via screening symptomatic patients with epidemiologic link during flu season in a medical center of central taiwan clinical characteristics of 140 patients infected with sars-cov-2 in wuhan epidemiological and clinical characteristics of 99 cases of 2019 novel coronavirus pneumonia in wuhan, china: a descriptive study a case of covid-19 and pneumonia returning from macau in taiwan: clinical course and anti-sars-cov-2 igg dynamic a locally transmitted case of sars-cov-2 infection in taiwan breadth of concomitant immune responses prior to patient recovery: a case report of non-severe covid-19 case of the index patient who caused tertiary transmission of covid-19 infection in korea: the application of lopinavir/ritonavir for the treatment of covid-19 infected pneumonia monitored by quantitative rt-pcr chest imaging appearance of covid-19 infection chest radiographic and ct findings of the 2019 novel coronavirus disease (covid-19): analysis of nine patients treated in korea clinical characteristics of imported cases of coronavirus disease 2019 (covid-19) in jiangsu province: a multicenter descriptive study coronavirus disease 2019 (covid-19): a perspective from china emerging 2019 novel coronavirus (2019-ncov) pneumonia first case of 2019 novel coronavirus in the united states first case of coronavirus disease 2019 (covid-19) pneumonia in taiwan imaging profile of the covid-19 infection: radiologic findings and literature review evolution of ct manifestations in a patient recovered from 2019 novel coronavirus (2019-ncov) pneumonia in wuhan first imported case of 2019 novel coronavirus in canada, presenting as mild pneumonia importation and human-to-human transmission of a novel coronavirus in vietnam the first vietnamese case of covid-19 acquired from china the role of augmented intelligence (ai) in detecting and preventing the spread of novel coronavirus an automated residual exemplar local binary pattern and iterative relieff based corona detection method using lung x-ray image application of deep learning for fast detection of covid-19 in x-rays using ncovnet automated detection of covid-19 cases using deep neural networks with x-ray images benchmarking methodology for selection of optimal covid-19 diagnostic model based on entropy and topsis methods coronet: a deep neural network for detection and diagnosis of covid-19 from chest x-ray images covid-19: automatic detection from x-ray images utilizing transfer learning with convolutional neural networks covidgan: data augmentation using auxiliary classifier gan for improved covid-19 detection covxnet: a multidilation convolutional neural network for automatic covid-19 and other pneumonia detection from chest x-ray images with transferable multi-receptive feature optimization deep learning covid-19 detection bias: accuracy through artificial intelligence deep transfer learning with apache spark to detect covid-19 in chest x-ray images new machine learning method for image-based diagnosis of covid-19 performance evaluation of the nasnet convolutional network in the automatic identification of covid-19 within the lack of chest covid-19 x-ray dataset: a novel detection model based on gan and deep transfer learning convolutional capsnet: a novel artificial neural network approach to detect covid-19 disease from x-ray images using capsule networks covid-xnet: a custom deep learning system to diagnose and locate covid-19 in chest x-ray images deepcovid: predicting covid-19 from chest x-ray images using deep transfer learning covid-19 image data collection covid-19 x rays two public chest x-ray datasets for computer-aided screening of pulmonary diseases a comparative study of texture measures with classification based on featured distributions the dual-tree complex wavelet transform the dual-tree complex wavelet transform: a new efficient tool for image restoration and enhancement shift invariant properties of the dual-tree complex wavelet transform dual-tree complex wavelet transform and svd based medical image resolution enhancement a novel method for lung segmentation on chest ct images: complex-valued artificial neural network with complex wavelet transform blood vessel extraction from retinal images using complex wavelet transform and complex-valued artificial neural network improved adaptive image retrieval with the use of shadowed sets uncertainty-optimized deep learning model for small-scale person re-identification vehicle and wheel detection: a novel ssd-based approach and associated large-scale benchmark dataset kernel pooling for convolutional neural networks deep metric learning with angular loss a study on the cardinality of ordered average pooling in visual recognition data augmentation for eeg-based emotion recognition with deep convolutional neural networks a modified convolutional neural network for face sketch synthesis a novel comparative study for detection of covid-19 on ct lung images using texture analysis, machine learning, and deep learning methods a novel comparative study using multi-resolution transforms and convolutional neural network (cnn) for contactless palm print verification and identification a face recognition method based on lbp feature for cnn. in advanced information technology, electronic and automation control conference (iaeac) facial expression recognition algorithm basedon cnn and lbp feature fusion local binary convolutional neural networks. in: conference on computer vision and pattern recognition a novel face recognition algorithm based on the combination of lbp and cnn automated breast tumor diagnosis using local binary patterns (lbp) based on deep learning classification publisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations key: cord-337740-8ujk830g authors: matencio, adrián; caldera, fabrizio; cecone, claudio; lópez-nicolás, josé manuel; trotta, francesco title: cyclic oligosaccharides as active drugs, an updated review date: 2020-09-29 journal: pharmaceuticals (basel) doi: 10.3390/ph13100281 sha: doc_id: 337740 cord_uid: 8ujk830g there have been many reviews of the cyclic oligosaccharide cyclodextrin (cd) and cd-based materials used for drug delivery, but the capacity of cds to complex different agents and their own intrinsic properties suggest they might also be considered for use as active drugs, not only as carriers. the aim of this review is to summarize the direct use of cds as drugs, without using its complexing potential with other substances. the direct application of another oligosaccharide called cyclic nigerosyl-1,6-nigerose (cnn) is also described. the review is divided into lipid-related diseases, aggregation diseases, antiviral and antiparasitic activities, anti-anesthetic agent, function in diet, removal of organic toxins, cds and collagen, cell differentiation, and finally, their use in contact lenses in which no drug other than cds are involved. in the case of cnn, its application as a dietary supplement and immunological modulator is explained. finally, a critical structure–activity explanation is provided. cyclodextrins (cds, figure 1 ) are torus-shaped oligosaccharides made up of α-(1,4)-linked glucose units, obtained by the degradation of starch by the enzyme cyclodextrin glucosyltransferase (cgtase), which were (cds) discovered by antoine villiers [1] . although the most common cds are the natural α, β, and γ-cd forms (which contain six, seven, or eight glucose units, respectively), cds containing nine and up to nineteen units have also been characterized [2, 3] , although they are not used because of their tendency to collapse. recently, even smaller cds with 3 or 4 glucose units have been synthesized [4] . the cd ring is a conical cylinder of an amphiphilic nature, with a hydrophilic outer layer (formed by the hydroxyl groups) and a lipophilic cavity [5] . although inorganic and organic salts and neutral molecules can form complexes with cds [6] , it is generally poorly soluble drugs that are complexed with them to create so-called "inclusion complexes" or nanoparticles [7] [8] [9] [10] [11] [12] . different chemically obtained derivates (e.g., hydroxylpropyl-β-cd or methyl-β-cd among others) and materials have been seen to possess better capacities, such as complexation efficiency and release accuracy, than natural cds [13] [14] [15] . however, only natural cds are considered to be suitable as food additives at present (e-457, e-458, and e-459, [5] ). despite the importance of cds in science, especially as carriers in pharmacy [16, 17] , their use as active drugs has been less studied: despite the recent reviews [18] [19] [20] that have focused on this particular application. in addition to cds, another dietary indigestible cyclic oligosaccharide formed by four d-glucopyranosyl residues linked by alternating α(1→3) and α(1→6) glucosidic linkages was recently found to have intrinsic bioactivity cyclic nigerosyl-1,6-nigerose or cyclotetraglucose (cnn, figure 1 [21] ) the present review will update the most relevant applications mentioned in the review made by braga et al., 2019 , including applications, such as the ability of cds to combat aggregation diseases, their dietary functions, toxins removal, cell differentiation, and their application in contact lenses. the review aims to provide a general overview of the use of different oligosaccharides as active drugs, rather than as mere drug carriers, summarizing and updating the most relevant applications mentioned in previous reviews and adding new possible uses. salivary α-amylase can rapidly hydrolyze dextrins, although their rapid transport to the stomach makes such degradation unimportant. of the three natural cds, αand β-cd are essentially stable towards α-amylase, while γ-cd is rapidly digested [16, 22] . in the stomach, unspecific ph dependent degradation may occur, and then, in the neutral ph environment found in the small intestine, pancreatic amylase continues the hydrolysis process. while αand β-cd are mostly digested by bacteria in the colon (α-cd is more slowly digested than β-cd, [16] ), γ-cd is almost completely digested in the gastrointestinal tract. finally, non-digested remains are metabolized by microbiota in the lower section of the digestion system, where they are almost completely degraded. the low bioavailability of cds and their derivatives (they are not able to pass the intestinal barrier) makes them very safe when administered orally [16, 23, 24] . since dextrins can also be administered parentally, their pharmacokinetics has also been studied, leading to monographs such as those included in the european pharmacopoeia [25] . as regards their pharmacokinetics, dextrins below 15 kda are almost totally excreted (≥90%) in the urine without any substantial modification. more specifically, the pharmacokinetics of hydroxypropyl-β-cd (hpβ-cd), sulfobutylether β-cd sodium salt and sugammadex sodium salt has been studied in rats, where they showed a t 1/2 of 1.9, 1.6, and 1.7 h respectively. more than 90% of the cd was recovered in the urine in 24 h, although the cds may remain longer in the kidney of affected subjects [16, 24, [26] [27] [28] [29] [30] . in terms of toxicity, most studies refer to the medical uses. a high dose of orally administered cds can generate diarrhea and caecum enlargement or even affect the bioavailability of some substances; as a consequence of which, the european commission prepared a guide to help during drug development [31] . on the other hand, the toxicology of hpβ-cd has been better studied due to its classical use as a medical excipient [32] , and its degree of substitution [33] . the results showed that the best option for minimizing toxicity would be to have a low degree of substitution (d.s); however, more studies are necessary in this respect. as regards the use of cds for supplementing food products, only natural cds are considered food additives (e-457, e-458, and e-459) and "generally recognized as safe" (gras). the recommendation of the joint fao/who expert committee on food additives (jecfa) established a maximum level of β-cd in foods of 5 mg/kg/day. on the other hand, for αand γ-cd there is not an acceptable daily intake (adi) recommendation due to their favorable toxicological profiles. the european food safety authority (efsa) recognized that α-cd could be described as dietary fiber, and as suitable for reducing post-prandial glycemic responses [34] due to is competitive inhibition of α-amylase. generally, gras molecules are directly approved for use as excipients (in this case, natural cds). moreover, the food and drug administration (fda) has published a list of inactive pharmaceutical ingredients that can be downloaded (https://www.fda.gov/drugs/informationondrugs/ucm113978.htm). in this list, the route, dosage form, and maximum concentration is indicated. additionally, the european medicines agency (ema) has published several reports on the use of different cds in pharmaceutical products (https://www.ema.europa.eu/en/cyclodextrins). a recent question and answer document about cds and their uses [31] summarizes information on safety: for example, although about 200 mg/kg/day of cd is generally recommended for oral uses in pharma, this value will depend on the type of cds used; a dosage of 8000 mg/day, for instance, in the case of hpβ-cd. another organization, the japanese pharmaceutical codex (jpc), has published monographs about cds [35] . in general, cyclic oligosaccharides are "generally recognized as safe" (gras) [36] . cnn is a food additive approved by the fda but no adi is reported [37] . according to a summary of who reporting, several unpublished data about the safety of cnn [38] , salivary, or pancreatic α-amylase are not able to degrade cnn in vitro [39] . the in vivo digestibility of cnn was also checked in 35 rats given 100 mg cnn/kg. approximately 94% of the administered cnn dose was excreted intact in the feces, while the remaining portion (6%) was detected in the gastrointestinal tract undergoing slow degradation by microbiota. no cnn was detected in the blood during the experimental period [39] . studies about acute toxicity (male and female rats administered a single dose of cnn of 200, 2000, or 5000 mg/kg for 15 days) or short-term toxicity (male and female rats given a mean daily dose of 1568, 2012, and 6333 mg cnn/kg for males and 1799, 3597, and 7270 mg cnn/kg for females) did not reveal signs of toxicity or changes in weight, biochemical parameters, or abnormalities. however, the oral intake of 20 or 30 g not less than 2 h after lunch or dinner had a laxative effect in a human clinical trial [38] . although cds are mainly used as excipients for carrying different molecules, new applications have been explored, including: (i) to complex metabolites (e.g., cholesterol), (ii) as dietary fibers, (iii) to reduce contaminants, (iv) as antiviral or antiparasitic agents. in the field of lipid-related diseases, the complexation of cholesterol is the principal application of cds [40] . however, in this section, we will also look at different targets associated with several diseases. niemann pick disease type c (npc) is a rare recessive disease caused by the mutation of npc1 and/or npc2 genes, which change the processing of low-density proteins (ldl) resulting in an accumulation of lipids in the cells [41] . the most promising treatment would seem to use cds due to their ability to form inclusion complexes ( figure 2 ) with lipids and to mobilize the deposits [30] . it has been demonstrated that methyl-β-cd (mβ-cd) and hpβ-cd can reduce cholesterol accumulation [29, 42, 43] . although the mechanisms that modulate cholesterol homeostasis are unclear, several hypotheses exist [44] : (i) to control the plasmatic membrane cholesterol, cds remove the cholesterol from the membrane, which is replaced by intracellular cholesterol; (ii) the cds enter the cell by pinocytosis, capturing the intracellular cholesterol [45] (it has recently been demonstrated that cds reach the cytosol and autophagy vesicles [46] ); or (iii) the presence of cds activates an unknown system to remove the cholesterol, as in the results obtained with hpγ-cd (cd unable tp complex cholesterol) suggest [44] . although treatment with cds have adverse effects, the blood lipids protect the membrane against injury by cds; indeed, the treatment with hpγ-cd and hpβ-cd [44] induced the expression of protein-like lysosomal-associated membrane protein 1 (lamp-1), which is expressed in the lysosomal membrane. it is possible that cholesterol is linked to this protein, thus facilitating its sequestration [47] . however, the principal difficulty of treatment with cd is the molecule's inability to cross the blood-brain barrier, and while subcutaneous injection can decrease the level of cholesterol in several tissues, an intrathecal injection is needed to have any effect in the brain [48, 49] . indeed, intrathecal injections prevent some adverse effect of cd treatment such as lung toxicity. in 2018, berry-kravis et al., studied the neurological function of three patients treated intrathecally with hpβ-cd [50] and observed slight improvement in cognitive ability, mobility, equilibrium, and swallowing. finally, different cd monomers are at present being synthetized to optimize treatments, such as 6-o-maltosyl-β-cd (g2-β-cd) [51] , mono-lactose β-cd (lac-β-cd), and multi-lactose (multi-lac-β-cd) [52] or octa-arginine derivatives [53] . another possibility is to use polymers, formed by covalent bonds or cyclodextrin-based polyrotaxanes (cdprx) [54] . as the cavity is covered by the polyrotaxane, the cd cannot take cholesterol of the membranes, reducing its toxicity. moreover, the structure of cdprx improves endocytosis [54] . another interesting polymer is orx-301, a ph sensitive β-cd-based polymer with a better pharmacokinetics and bioavailability [55] . atherosclerosis is a vascular disease caused by cholesterol accumulation on the walls of arteries. the inflammatory response starts with the recruitment of macrophages to remove the cholesterol, forming "foam cells" in combination with others such as smooth muscle cells or endothelial cells [56, 57] . this excessive cholesterol accumulation modifies the cellular cholesterol pools. the regulation of cholesterol is carried out by abca1 (atp-binding cassette transporter, also known as the cholesterol efflux regulatory protein), abcg1 (atp-binding cassette sub-family g member 1), and sr-b1 (scavenger receptor class b type 1) transporters, which remove the cholesterol from the cell membranes towards the extracellular hdl (high-density lipoprotein); in this process, the alteration of any of these transporters may cause atherosclerosis [58] . another therapy that has been considered is to use cd to form complexes with cholesterol solubilizing the plaques. although a recent review has been published about this particular application [20] , a summary of the most interesting studies is presented below. for example, the potential use of kleptose ® crysmeβ (mβ-cd) was tested in vivo [59] in an apoe-deficient mouse model. the animals were fed with a normal or cholesterol-rich diet, and a vehicle (pbs) or the kleptose ® crysmeβ solution was injected. independently of the diet, the results showed an increase in hdl cholesterol and a decrease in blood triglyceride levels. moreover, kleptose ® crysmeβ also reduced the atherosclerotic plaque by lowering cholesterol levels to a greater extent than the control. in another in vitro study, the effect of several β-cds and their methylated derivatives on the cholesterol metabolism was tested in cell cultures [60] . another group recently published their interesting findings concerning atherosclerosis: β-cd (probably mβ-cd, although this is not specified) was found to be a shuttle of cholesterol at low β-cd concentrations and a sink of cholesterol at high concentrations [61] . this study showed that β-cds can extract cell membrane cholesterol. this extraction process has been correlated with the strongly decreased expression of abca1 and abcg1 transporters. the use of the cd derivative hpβ-cd was also evaluated. the cd was able to treat atherosclerosis not only by increasing the efflux of cholesterol [62] but also through macrophage reprogramming [63] . the last authors demonstrated in vivo that the underlying mechanism of action of hpβ-cd involves the liver x receptor (lxr)-mediated signaling pathway; cholesterol efflux was increased as a result of abca1 and abcg1 upregulation, which was corroborated in another recent study where this cd reduced the levels of plasma triglycerides and inflammatory cytokines and also increased the level of plasma hdl-cholesterol. importantly, the atherosclerotic lesion areas and the macrophage and collagen contents in the atherosclerotic lesions were reduced [64] . pilely et al., in 2019, discovered that α-cd inhibits cholesterol crystal-induced complement-mediated inflammation better than hpβ-cd. it is even able to solubilize cholesterol crystals [65] . the role of α-cd was also studied in atherosclerosis. as this cd is approved for use as a nutritional supplement, it was thought its oral intake might help treat atherosclerosis by reducing the uptake of cholesterol. however, only a modest reduction in ldl [66] was observed, possibly due to the low bioavailability of α-cd [16] . nevertheless, in another study α-cd was able not only to reduce atherosclerosis, but also to change the microbiota [67] . it is clear that more studies are needed in this field to clarify the real effect of cds. when synthetic cd polymers were evaluated, those with a diameter of~10nm were found to exhibit outstanding pharmacokinetics and plaque targeting efficacy compared with monomeric cd in vivo and showed no ototoxicity [68] . to sum up, the potential effects of cyclodextrins on atherosclerosis progression are [20] : (i) to inhibit the entry of circulating monocytes into the lesion by inhibiting their adhesion to the endothelium. parkinson disease (pd) is caused by α-synuclein protein aggregation and misfolding [69] . it has been reported that cds (in particular mβ-cd) have the capacity of complex α-synuclein preventing its aggregation [70] . on the other hand, the essential activity of the transcription factor eb (tfeb) could be activated pharmacologically using hpβ-cd, a master regulator of the autophagy-lysosomal pathway [71] , to prevent the accumulation of synuclein aggregates [72] (figure 2 ). although these are promising results, more work is necessary in this field to clearly understand the potential use of cds. alzheimer disease (ad) results from an accumulation of β-amyloid peptides (ap) in the brain, which is linked to an abnormal cholesterol metabolism [73] [74] [75] . in this disease, cds present interesting possibilities; β-cd and hpβ-cd can bind ap directly [76] [77] [78] . in mouse model, the administration of hpβ-cd reduced β-amyloid (figure 2 ) deposits as a result of a reduced app protein cleavage while upregulating abca1 and npc1 gene expressions [79] . as a consequence, alginate or chitosan microparticles containing cds have been created to target the brain with no toxicity [80] and more effective cd derivatives with dimethylamino aromatic moiety [81] . huntington's disease (hd) is a rare autosomal dominant neurodegenerative disease caused by a mutation within the huntingtin gene leading to the expansion of a cag triplet repeat encoding for a polyglutamine (polyq) tract. this leads to the expression of a mutant form of the huntingtin protein bearing an expanded polyq tract and other peptides produced by translational frameshifting or atg-independent translation. the accumulation of these toxic peptides, through gain-and loss-of-function mechanisms [82] [83] [84] [85] , drives the neurodegenerative processes in hd, leading to motor, behavioral, and cognitive dysfunctions generally related to the length of the cag expansion, with some individual variability [86] . in an interesting study, treatment with β-cd reduced the content of ordered domains of cholesterol at the cell surface, which in turn, protected cells against nmda(n-methyl-d-aspartate)-mediated excitotoxicity (figure 2 ). this is because mutant huntingtin produces the accumulation of cholesterol and alters its cellular distribution, thus contributing to nmda-mediated excitotoxicity [87] . the capacity to complex cholesterol is the principal mechanism through which cds can reduce the infectivity of several viruses, as summarized by braga in 2019 [18] . however, in recent studies novel derivates, which are able to directly block some parts of the virus have been prepared. the capacity of cds to complex cholesterol can be used to induce structural deformation in virus membranes containing cholesterol (figure 3) . such is the case with the influenza virus in which the presence of rameb (randomized mβ-cd) not only deforms the membranes [88] , but also reduces the infectivity of viral particles of influenza a (h1n1 strain) in vitro [89, 90] . however, these results lack the foundations for producing novel therapies against influenza virus. on the other hand, several cd derivatives have been designed for the treatment of influenza, including a family of pentacyclic triterpene-functionalized α-, β-, and γ-cd derivatives [91, 92] and fullerene-cyclodextrin conjugates [93] . the terpenic derivative was tested in vitro with promising (but limited) results showing no toxicity and good affinity for hemagglutinin. recently, 18 water-soluble β-cyclodextrin-glycyrrhetinic acid conjugates [94] were tested against the influenza virus of which six showed promising antiviral activity. about glycyrrhetinic acid, the c-3 and c-30 were showed the best positions to modify. however, more experiments are needed to determine the real potential of these novel derivatives. in japan, some researchers have been evaluating the first cd-adjuvanted vaccine in history, in this case against the influenza virus [95] . the vaccine uses cds as adjuvant because of its ability to enhance the production of antibodies (by nearly 30%) and to induce dendritic cell maturation [96, 97] . in the same way, the capacity of cds (in this case hpβ-cd) to complex cholesterol (figure 3) can be used to decrease the infectivity of hiv and siv (simian immunodeficiency virus) [98, 99] . this capacity was demonstrated in vivo using a mouse model in which the vaginal administration of hpβ-cd blocked 91% of the infection [100] . however, the results in rhesus macaque advised against the use of hpβ-cd in repeated doses: while hpβ-cd blocked the first infection, the prolonged treatment with hpβ-cd (when the viral inoculation was repeated 11 or 47 weeks later) increased the infectivity causing a large-scale infection [101] . these data point to the need for further research before this treatment can be recommended for use against hiv virus. on the other hand, the treatment of monocytes cultured with hpβ-cd was seen to reduce inflammatory molecules such as interleukins (il-10) and cytokines (tnf-α, tumor necrosis factor alpha) [102] . finally, novel-branched cd bearing long-chain alkyl group have been developed to penetrate and be fixed into the lipid bilayer of the hiv virus, while sulfated maltoheptaose moieties have been found to electrostatically interact with hiv gp120 molecule and so be used for anti-hiv applications [103, 104] . the sars-cov-2 pandemic and its impact on society demands new therapies to prevent infection and manage the disease. in this case, it is possible that cds (figure 3 ) could be used as agents to prevent and treat infection not only by complexing the approved drugs, but also as a drug per se [105] : the use of mβ-cd was previously reported to lessen the infectivity of coronavirus infectious bronchitis virus (ibv) [106] , or by affecting the lipid rafts and the levels of angiotensin-2 [107] . in addition, different derivatives of cd-based materials may have the capacity to block coronavirus [108] . more information about different mechanisms of cds against sars-cov-2 (e.g., in inclusion complexes containing drugs), and also as active drugs, is available in this review [109] . it is clear that cholesterol-dependent viruses can be affected by cds (figure 3 ). for example, the capacity to reduce the infectivity of dengue virus was suggested to be due to the presence of cholesterol in membrane [110] . in this respect, cds were seen to be able reduce the infectivity of this virus in a monocyte cell model (u937 myelomonocyte cell line) [111] , and the same was true in the case of herpes virus 1 [112] , varicella-zoster virus (vzv) [113] , hepatitis c virus [114] , or enterovirus d68 [115] . on the other hand, these viruses are also hs-dependent (generally heparan sulfate proteoglycans) [116] and novel highly sulfonated cd derivatives (sodium undec-10-enesulfonate with different chain lengths) have been tested against this type of viruses (respiratory syncytial virus, human metapneumovirus, dengue virus, hepatitis c, and others) in vitro, ex vivo, and in an animal model. the derivatives exhibited a broad-spectrum virucidal, irreversible mechanism of action, and high biocompatibility and also acted as a barrier to viral resistance. moreover, to determine the inhibition mechanism of these novel cds, a molecular dynamic simulation was carried out in presence of glycoprotein b (gb) from hsv-2. the results suggested that the cds interacted with the binding loop of gb, producing a conformational change in the protein, thus blocking the attack on the host cell. as previously reported [18] , cds also have antiparasitic cd applications: in the case of leishmaniasis, for example, the use of hpβ-cd in balb/c mice infected with the parasite leishmania donovani caused a 21% reduction in liver infection compared with a control, due to its ability to complex cholesterol [117] . although the use of other cds such as mβ-cd has also been studied [118] , more research is necessary before a useful therapy is available. another interesting application of cds could be against malaria, which is caused by protozoa parasites of the plasmodium genus; in this sense, sulfated cds were able to block the entry of p. falciparum [119] . neuromuscular blocking (nb) is used during surgery to prevent movement. several agents can block acetylcholine in nicotinic receptors in striated muscle cells. although there are different ways to remove the anesthetic agent, using cds was found to be a simple way to complex the molecules. using rocuronium bromide, one of the most widely used anesthetic molecules as a model to test the ability of natural cds, the most suitable form for complexing was found to be γ-cd [120] . however, a derivative called sugammadex, obtained by perfunctionalization of the primary hydroxyl side of γ-cd with sulphanylpropanoic acid, has generated very stable complexes, which have been approved by the ema and fda for therapeutic use [121] . in addition to the capacity to complex rocuronium bromide, sugammadex is also able to complex vecuronium bromide and pancuronium bromide and needs less time to revert the anesthetic than neostigmine (the classical agent used) [122] , attaining good clearance in 24 h [123] . although sugammadex is generally safe, cases of anaphylaxis (a life-threatening clinical condition that is typically the result of drugs or substances used for anesthesia or surgery) have been reported with a low incidence of 29 per million cases [124] . additional information about this use can be found in this review [18] . the role of cds as a nutritional supplement has been evaluated in patient consuming (figure 4 ) cholesterol-rich diets, finding that cds are able to reduce hypercholesterolemia by reducing cholesterol absorption, and even plasma cholesterol or triglyceride levels [125] [126] [127] [128] . another study found supplementation with α-cd altered the gut microbiota and increased the production of lactic acid and short-chain fatty acid (scfas). this had beneficial antiobesity effects by modulating the expression of genes related to lipid metabolism, indicating the prebiotic property of α-cd due to its metabolization [10, 129] . finally, the efsa permitted α-cd to be described as dietary fiber and is suitable for reducing post-prandial glycemic responses due to its competitive inhibition of α-amylase [34] . the sequestering properties of cds and cd-based materials can be exploited to remove contaminants and toxins (figure 4 ). in an interesting study, insoluble β-cd beads polymers (bbp) were tested to remove zearalenone (zen), a fusarium-derived mycotoxin, which exerts xenoestrogenic effects in animals and humans and is formed in cereals and cereal-based products [130] . the results showed that even relatively small amounts of bbp can strongly decrease the mycotoxin content of aqueous solutions (including beer), and they can be easily recycled with an etoh/water (50:50) solution. in another study alternariol (aoh), a mycotoxin that occurs in wine and tomato products as a contaminant was removed by bbp from aqueous solutions (ph 3.0-7.4). bbp strongly decreased the aoh content of both wine and tomato juice samples, suggesting the suitability of cd polymers as aoh binders in some beverages [131] . moreover, a study about cyclodextrin nanosponges for removing organic toxic molecule from the body was recently published [132] . cyclodextrin-based nanosponges are cross-linked polymer structures with a three-dimensional network tunable particle size and good swelling properties [14] . in the above article, different nanosponges were tested to complex indole, a metabolite of tryptophan formed by the gut microbiota, which can form dangerous uremic toxins, such as indoxyl sulfate, which is metabolized from indole in the liver. three of the four nanosponges tested were able to adsorb indole from aqueous solutions as well as from simulated gastric fluid. toluene diisocyanate cross-linked cd-nanosponges, especially, had a very high indole adsorption capacity (over 90%) and is a promising agent for cleansing the body of toxic compounds from food or through oral ingestion in general. in addition, this derivative was more stable in gastrointestinal media. animal studies further revealed that orally administered cd-nss do not tend to accumulate and damage gastrointestinal tissues and are excreted from the gi tract with minimal absorption [132] . recent studies have demonstrated the ability of cds to modulate collagen-related processes: mβ-cd was able to up-regulate collagen i expression in chronologically-aged skin through its anti-caveolin-1 activity; the intra-dermal administration of a 2.5% concentration of mβ-cd (twice per week for two months) showed a strong collagen i up-regulation activity, leading to an increase in skin thickness without adverse reactions such as skin fibrosis [133] . in addition, cds can interact with hydrophobic amino acid residues of collagen for different uses: the application of β-cd to collagen vitrigels produces materials with aligned fibers and lamellae similar to those of the native cornea, resulting in mechanically robust and transparent materials that can be used to create β-cd/collagen implants with a curvature matching that of the cornea. the implants show good tissue integration and support re-epithelialization [134] . collagen-glycoseaminoglycan scaffolds that incorporate β-cd showed improved sequestration as well as the extended retention and release of tgf-β1 (transforming growth factor beta 1) and bmp-2 (bone morphogenetic protein 2), which influence the metabolic activity and proliferation of mesenchymal stem cells. moreover, a gene expression analysis showed that the tgf-β1 released from β-cd promoted early chondrogenic-specific differentiation [135] . finally, for the treatment or prevention of cartilage degeneration and arthrosis or arthritis a patent has been registered for the use of cds with hyaluronate and chondroitin [136] . in the last section, cds were able to promote the differentiation of chondrocytes by complexing several molecules (such as tgf-β1). however, there are some cases where cds interacts directly with the pathway. in 2017, a study demonstrated that β-cd could induce the differentiation of resident cardiac stem cells to cardiomyocytes through autophagy [137] : β-cd increased the expression of cardiac transcription factors and structural proteins among others to induce cardiac stem cells differentiation. in addition, jnk/stat3 (c-jun n-terminal kinase/signal transducer and activator of transcription 3) and gsk3β/β-catenin (glycogen synthase kinase 3 beta/β-catenin) pathways were showed as downstream pathways of β-cd-induced autophagy and differentiation. moreover, β-cd performed its functions by improving intracellular cholesterol levels and so affecting cholesterol efflux [137] . the use of cds as a carrier in liquid solutions for contact lenses is extensive, but a recent patent [138] points to the role of dexolve™ (sbecd, a cd derivate) as a pharmaceutically active agent to prevent, treat, or reduce the risk of disorders or conditions associated with the wearing of contact lenses. cds reduce the concentration or the bioactivity of an eye allergen, inflammatory mediators, or toxic aldehyde. cds may, therefore, inactivate mediators of the inflammatory response, such as prostaglandins, reactive aldehydes, and lipid peroxidation products. although this is a promising application, more studies are needed. in contrast to the vast literature on cds, very few examples of the applications of cnn have been described to date, and what is known is presented below. more specifically, this paragraph deals with the use of cnn as a diet supplement and as immunological modulator. in a patented study, a group of rats received a diet containing 0-5% of cnn and supplemented with 3.5% of a mineral mix containing calcium, phosphorus, magnesium, iron, sodium, and potassium. a dose-related increase in the absorption rates of calcium, magnesium, phosphorus, and iron was reported [139] . two studies demonstrated the ability of cnn to increase the production of scafs. in one study, a dose of 7480 mg/kg in mouse led to statistically significant increases in butyrate and lactate levels [140] . in a second study, similar results for the production of scfas were obtained in rats; furthermore, a decrease in serum triglyceride and cholesterol levels was observed for the high-dose diet (≈5000 mg/kg) [39] . the dietary supplementation of cnn was increased the production of iga (immunoglobulin a) in mice, also affecting the levels of il-6 and tgf-β (transforming growth factor beta) in peyer's patch cells [140] ; the authors suggested a possible prebiotic effect of this molecule, which could change the microbiota profile. the increase in the production of iga, the major antibody secreted into the gut, was studied in colitis induced in mice. the cnn-treated mice with induced colitis showed an improvement in colitis factors (e.g., mrna levels of interleukin-1) compared with the cnn-untreated mice. although there was no difference in the iga concentration among groups, a higher proportion of cecal microbiota was coated with iga in the cnn-treated group compared with that observed in the control. iga plays a crucial role in suppressing gut inflammation due to commensal gut microbiota. the authors concluded that cnn treatment reduced gut inflammation in mice with induced colitis, possibly through synergistic effects of the restoration of goblet cells, increased abundance of butyrate-producing bacteria, and promotion of iga coating of gut microbiota [141] . finally, the use of cnn was effective in the treatment of melanoma in vitro. cnn administered to b16 melanoma cells resulted in a dose-dependent decrease in melanin synthesis, even under conditions that stimulate melanin synthesis, with no significant degree of cytotoxity. cnn was able to slightly reduce the tyrosinase activity directly and to moderately decrease its expression. moreover, the colocalization of the enzyme in lamp-1 organelles (where tyrosinase is degraded) was observed [142] . as is clear from the previous paragraphs, cyclic oligosaccharides can play several different roles (as antiviral, pd treatment, removal of organic toxins, etc.) the principal factor that determines the activity is their affinity of the target for the internal cavity ( figure 1 ). in the case of cnn, we have seen how this cyclic oligosaccharide is able to increase the absorption of several ions, increasing their bioaccessibility, through the formation of complexes. cnn is able to interact directly with an enzyme, inhibiting its activity and increasing the production of iga as a prebiotic. however, more research in this field is needed before cnn can be administered as a drug. most of the applications of cds mentioned in this review are related to their ability to complex cholesterol, especially in the case of β-cd and its derivatives [143, 144] . the type and degree of substitution of cd derivatives have been correlated with cholesterol complexation [144] ; these authors reported that methyl-β-derivatives provided better solubilization (and complexation) of cholesterol, with an optimum degree of substitution of fourteen. although other derivatives such as hpβ-cd presented a lower capacity to solubilize cholesterol, significant differences were found between the cytotoxicity of highly toxic derivatives of methylated compounds (ic 50 : ≈ 50 mm, except those with a low degree of substitution) and other cds. the studied non-methylated derivatives, such as hpβ-cd and ionic β-cds, presented no cytotoxicity up to 200 mm. indeed, as we have shown in this review, hpβ-cd is used for diseases such as npc due to its low toxicity. the authors of the above-cited article concluded that in the case of methylated-β-cd compounds, the cell toxicity closely depends on the number and only slightly on the position of the methyl groups [144] . in the case of sugammadex and its better ability to complex rocuronium bromide, γ-cd has the best suited inner diameter for this purpose. the addition of sulphanylpropanoic acid to the primary -oh group of γ-cd increased of the cavity depth, presenting an anionic charge, which increases the affinity for rocuronium bromide and similar [120] . on the other hand, cds can also interact with proteins as in alzheimer disease or parkinson disease, because the hydrophobic amino acids of the proteins are also suitable for complexation. indeed, cds are able to improve the tridimensional structure of a protein acting as chaperones [145, 146] . this effect also suggests the same use for other aggregates. finally, we have described novel derivatives with structures better able to complex some molecules (see section on removing toxins) or to interact with certain receptors (see antiviral section). this ability can be used to carry a given molecule to the target [147] or to directly block the target. the present review emphasizes the role of cds and cnn as active agents in themselves, rather than as carriers. their excellent biocompatibility, good interactions with biomolecules, and the fact that they can be easily functionalized make cyclic oligosaccharides highly versatile and multitasking materials suitable for a wide set of different applications. for example, the capacity of cds to complex cholesterol can be exploited not only to prevent diseases such as npc or atherosclerosis, but also to prevent viral or parasitic infections. moreover, their ability to disaggregate a variety of molecules, including proteins, suggests a possible application in the treatment of other diseases, such as parkinson's. they can be used as dietary supplement to control cholesterol uptake and to remove toxins. in addition, when used in combination with collagen, they present interesting properties as scaffold and as pharmaceutically active agents in contact lens solutions. cnn is able to increase the absorption of ions and the expression of iga, thus acting as an immunological modulator. as a final remark, it should be mentioned that several research groups worldwide are currently working to unlock the full potential of cyclic oligosaccharides and their derivatives, so that new and surprising applications can be expected in the near future. sur la fermentation de la fécule par l'action du ferment butyrique studies on the schardinger dextrins. xi. the isolation of new scharginger dextrins studies on the schardinger dextrins: xii. the molecular size and structure of the δ-, -, ζ-, and η-dextrins conformationally supple glucose monomers enable synthesis of the smallest cyclodextrins applications of cyclodextrins in food science. a review environmental chemistry for a sustainable world aggregation of t10,c12 conjugated linoleic acid in presence of natural and modified cyclodextrins. a physicochemical, thermal and computational analysis separating and identifying the four stereoisomers of methyl jasmonate by rp-hplc and using cyclodextrins in a novel way nanoparticles of betalamic acid derivatives with cyclodextrins. physicochemistry, production characterization and stability evaluation of the properties of the essential oil citronellal nanoencapsulated by cyclodextrins encapsulation of piceatannol, a naturally occurring hydroxylated analogue of resveratrol, by natural and modified cyclodextrins ellagic acid-borax fluorescence interaction: application for novel cyclodextrin-borax nanosensors for analyzing ellagic acid in food samples removal of aromatic chlorinated pesticides from aqueous solution using β-cyclodextrin polymers decorated with fe3o4 nanoparticles study of oxyresveratrol complexes with insoluble cyclodextrin based nanosponges: developing a novel way to obtain their complexation constants and application in an anticancer study a way to increase the bioaccesibility and photostability of roflumilast, a copd treatment, by cyclodextrin monomers emerging medicines of the new millennium. biomolecules cyclodextrins as emerging therapeutic tools in the treatment of cholesterol-associated vascular and neurodegenerative diseases cyclodextrins: potential therapeutics against atherosclerosis x-ray structure determination and modeling of the cyclic tetrasaccharide cyclo-{→6) kinetic difference between hydrolyses of γ-cyclodextrin by human salivary and pancreatic α-amylases recent findings on safety profiles of cyclodextrins, cyclodextrin conjugates, and polypseudorotaxanes cyclodextrins: structure, physicochemical properties and pharmaceutical applications pheur.) 9th edition | edqm the pharmacokinetics of β-cyclodextrin and hydroxypropyl-β-cyclodextrin in the rat pharmacokinetics of diclofenac and hydroxypropyl-β-cyclodextrin (hpβcd) following administration of injectable hpβcd-diclofenac in subjects with mild to moderate renal insufficiency or mild hepatic impairment pharmacokinetics of sulfobutylether-β-cyclodextrin (sbecd) in subjects on hemodialysis application of a simple methodology to analyze hydroxypropyl-β-cyclodextrin in urine using hplc-ls in early niemann-pick disease type c patient recent advances in the treatment of niemann pick disease type c: a mini-review questions and answers on cyclodextrins used as excipients in medicinal products for human use 2-hydroxypropyl-beta-cyclodextrin (hp-beta-cd): a toxicology review comparison in toxicity and solubilizing capacity of hydroxypropyl-β-cyclodextrin with different degree of substitution scientific opinion on the substantiation of health claims related to alpha cyclodextrin and reduction of post prandial glycaemic responses (id 2926, further assessment) pursuant to article 13(1) of regulation (ec) no 1924 pharmaceutical and medical device regulatory science society of japan. japanese pharmacopoeia functional oligosaccharides: production and action world health organization digestibility and suppressive effect on rats' body fat accumulation of cyclic tetrasaccharide characterization of an inclusion complex of cholesterol and hydroxypropyl-beta-cyclodextrin niemann-pick type c disease-the tip of the iceberg? a review of neuropsychiatric presentation, diagnosis and treatment analytical characterization of methyl-β-cyclodextrin for pharmacological activity to reduce lysosomal cholesterol accumulation in niemann-pick disease type c1 cells niemann-pick disease treatment: a systematic review of clinical trials cyclodextrins: assessing the impact of cavity size, occupancy, and substitutions on cytotoxicity and cholesterol homeostasis endocytosis of beta-cyclodextrins is responsible for cholesterol reduction in niemann-pick type c mutant cells methyl-β-cyclodextrin restores impaired autophagy flux in niemann-pick c1-deficient cells through activation of ampk hydroxypropyl-beta and -gamma cyclodextrins rescue cholesterol accumulation in niemann-pick c1 mutant cell via lysosome-associated membrane protein 1 2-hydroxypropyl-β-cyclodextrins and the blood-brain barrier: considerations for niemann-pick disease type c1 cyclodextrins in the treatment of a mouse model of niemann-pick c disease long-term treatment of niemann-pick type c1 disease with intrathecal 2-hydroxypropyl-β-cyclodextrin in vitro and in vivo evaluation of 6-o-α-maltosyl-β-cyclodextrin as a potential therapeutic agent against niemann-pick disease type c cholesterol lowering effects of mono-lactose-appended β-cyclodextrin in niemann-pick type c disease-like hepg2 cells cholesterol-lowering effect of octaarginine-appended β-cyclodextrin in npc1-trap-cho cells cyclodextrin-based macromolecular systems as cholesterol-mopping therapeutic agents in niemann-pick disease type c linear cyclodextrin polymer prodrugs as novel therapeutics for niemann-pick type c1 disorder vascular smooth muscle cells in atherosclerosis inflammation and atherosclerosis implications for the treatment of atherosclerosis treatment with kleptose ® crysmeb reduces mouse atherogenesis by impacting on lipid profile and th1 lymphocyte response β-cyclodextrins decrease cholesterol release and abc-associated transporter expression in smooth muscle cells and aortic endothelial cells shuttle/sink model composed of β-cyclodextrin and simvastatin-loaded discoidal reconstituted high-density lipoprotein for enhanced cholesterol efflux and drug uptake in macrophage/foam cells hydroxypropyl-β-cyclodextrin-mediated efflux of 7-ketocholesterol from macrophage foam cells cyclodextrin promotes atherosclerosis regression via macrophage reprogramming cyclodextrin ameliorates the progression of atherosclerosis via increasing high-density lipoprotein cholesterol plasma levels and anti-inflammatory effects in rabbits alpha-cyclodextrin inhibits cholesterol crystal-induced complement-mediated inflammation: a potential new compound for treatment of atherosclerosis randomized double blind clinical trial on the effect of oral α-cyclodextrin on serum lipids dietary α-cyclodextrin reduces atherosclerosis and modifies gut flora in apolipoprotein e-deficient mice cyclodextrin polymer improves atherosclerosis therapy and reduces ototoxicity the incidence of parkinson's disease: a systematic review and meta-analysis effects of the cholesterol-lowering compound methyl-beta-cyclodextrin in models of alpha-synucleinopathy the autophagy-lysosomal pathway in neurodegeneration: a tfeb perspective genetic and chemical activation of tfeb mediates clearance of aggregated α-synuclein biomarkers for alzheimer disease: classical and novel candidates review and hypotheses review of the advances in treatment for alzheimer disease: strategies for combating β-amyloid protein β-cyclodextrin interacts with the alzheimer amyloid β-a4 peptide two-site binding of β-cyclodextrin to the alzheimer aβ(1−40) peptide measured with combined pfg-nmr diffusion and induced chemical shifts hp-β-cyclodextrin as an inhibitor of amyloid-β aggregation and toxicity neuroprotection by cyclodextrin in cell and mouse models of alzheimer disease mucoadhesive microspheres for nasal administration of cyclodextrins synthesis and evaluation of new cyclodextrin derivatives as amyloid-β aggregation inhibitors essential role of coiled coils for aggregation and activity of q/n-rich prions and polyq proteins association of polyalanine and polyglutamine coiled coils mediates expansion disease-related protein aggregation and dysfunction polyserine repeats promote coiled coil-mediated fibril formation and length-dependent protein aggregation structure of n-terminal domain of npc1 reveals distinct subdomains for binding and transfer of cholesterol huntington's disease: a clinical review altered cholesterol homeostasis contributes to enhanced excitotoxicity in huntington's disease lipid raft disruption by cholesterol depletion enhances influenza a virus budding from mdck cells role for influenza virus envelope cholesterol in virus entry and infection host lipid rafts play a major role in binding and endocytosis of influenza a virus inhibition of influenza virus infection by multivalent pentacyclic triterpene-functionalized per-o-methylated cyclodextrin conjugates pentacyclic triterpenes grafted on cd cores to interfere with influenza virus entry: a dramatic multivalent effect design, synthesis and biological evaluation of water-soluble per-o-methylated cyclodextrin-c60 conjugates as anti-influenza virus agents synthesis and structure-activity relationship studies of water-soluble β-cyclodextrin-glycyrrhetinic acid conjugates as potential anti-influenza virus agents niph clinical trials search a phase1 study of hydroxypropyl-beta-cyclodextrin(hp-beta-cyd)-adjuvanted influenza split vaccine induction of dendritic cell maturation and activation by a potential adjuvant, 2-hydroxypropyl-β-cyclodextrin hydroxypropyl-β-cyclodextrin spikes local inflammation that induces th2 cell and t follicular helper cell responses to the coadministered antigen cholesterol depletion of human immunodeficiency virus type 1 and simian immunodeficiency virus with β-cyclodextrin inactivates and permeabilizes the virions: evidence for virion-associated lipid rafts lipid rafts and hiv pathogenesis: virion-associated cholesterol is required for fusion and infection of susceptible cells vaginal transmission of cell-associated hiv-1 in the mouse is blocked by a topical, membrane-modifying agent incomplete protection against simian immunodeficiency virus vaginal transmission in rhesus macaques by a topical antiviral agent revealed by repeat challenges de hydroxypropyl-beta-cyclodextrin reduces inflammatory signaling from monocytes: possible implications for suppression of hiv chronic immune activation synthesis and anti-hiv activity of sulfated oligosaccharide-branched β-cd role of a long-chain alkyl group in sulfated alkyl oligosaccharides with high anti-hiv activity revealed by spr and dls combating coronavirus: key role of cyclodextrins in treatment and prevention | innovation hub | roquette the important role of lipid raft-mediated attachment in the infection of cultured cells by coronavirus infectious bronchitis virus beaudette strain lipid rafts play an important role in the early stage of severe acute respiratory syndrome-coronavirus life cycle. microbes infect materials science in the time of coronavirus the lord of the nanorings: cyclodextrins and the battle against sars-cov-2 requirement of cholesterol in the viral envelope for dengue virus infection antibody-dependent enhancement of dengue virus infection in u937 cells requires cholesterol-rich membrane microdomains cholesterol dependence of varicella-zoster virion entry into target cells detergent-resistant membrane association of ns2 and e2 during hepatitis c virus replication methyl-β-cyclodextrin inhibits ev-d68 virus entry by perturbing the accumulation of virus particles and icam-5 in lipid rafts modified cyclodextrins as broad-spectrum antivirals identification of new antileishmanial leads from hits obtained by high-throughput screening cholesterol is required for leishmania donovani infection: implications in leishmaniasis sulfated cyclodextrins inhibit the entry of plasmodium into red blood cells: implications for malarial therapy a novel concept of reversing neuromuscular block: chemical encapsulation of rocuronium bromide by a cyclodextrin-based synthetic host the development and regulatory history of sugammadex in the united states reversal of rocuronium-induced neuromuscular blockade with sugammadex compared with neostigmine during sevoflurane anaesthesia: results of a randomised, controlled trial sugammadex-a short review and clinical recommendations for the cardiac anesthesiologist sugammadex and rocuronium-induced anaphylaxis hypocholesterolemic action of beta-cyclodextrin and its effects on cholesterol metabolism in pigs fed a cholesterol-enriched diet dietary α-cyclodextrin lowers low-density lipoprotein cholesterol and alters plasma fatty acid profile in low-density lipoprotein receptor knockout mice on a high-fat diet effects of alpha-cyclodextrin on cholesterol control and compound k on glycaemic control in people with pre-diabetes: protocol for a phase iii randomized controlled trial the effect of α-cyclodextrin on postprandial lipid and glycemic responses to a fat-containing meal dietary α-cyclodextrin modifies gut microbiota and reduces fat accumulation in high-fat-diet-fed obese mice removal of zearalenone and zearalenols from aqueous solutions using insoluble beta-cyclodextrin bead polymer extraction of mycotoxin alternariol from red wine and from tomato juice with beta-cyclodextrin bead polymer preparation and characterization of cyclodextrin nanosponges for organic toxic molecule removal methyl-β-cyclodextrin up-regulates collagen i expression in chronologically-aged skin via its anti-caveolin-1 activity cyclodextrin modulated type i collagen self-assembly to engineer biomimetic cornea implants incorporating β-cyclodextrin into collagen scaffolds to sequester growth factors and modulate mesenchymal stem cell activity combination of glycosaminoglycans and cyclodextrins β-cyclodextrin induces the differentiation of resident cardiac stem cells to cardiomyocytes through autophagy accelerator for mineral absorption and use thereof effect of dietary cyclic nigerosylnigerose on intestinal immune functions in mice cyclic nigerosylnigerose ameliorates dss-induced colitis with restoration of goblet cell number and increase in iga reactivity against gut microbiota in mice. biosci. microbiota food health effects of a non-cyclodextrin cyclic carbohydrate on mouse melanoma cells: characterization of a new type of hypopigmenting sugar molecular mechanism of cyclodextrin mediated cholesterol extraction evaluation of the cytotoxicity of β-cyclodextrin derivatives: evidence for the role of cholesterol extraction lifespan extension in caenorhabditis elegans by oxyresveratrol supplementation in hyper-branched cyclodextrin-based nanosponges artificial chaperone-assisted refolding of carbonic anhydrase b encapsulation of acyclovir in new carboxylated cyclodextrin-based nanosponges improves the agent's antiviral efficacy acknowledgments: this work is the result of an aid to postdoctoral training and improvement abroad (for adrián matencio) financed by the consejería de empleo, universidades, empresa y medio ambiente of the carm, through the fundación séneca-agencia de ciencia y tecnología de la región de murcia. the authors declare no conflict of interest.pharmaceuticals 2020, 13, 281 key: cord-121200-2qys8j4u authors: zogan, hamad; wang, xianzhi; jameel, shoaib; xu, guandong title: depression detection with multi-modalities using a hybrid deep learning model on social media date: 2020-07-03 journal: nan doi: nan sha: doc_id: 121200 cord_uid: 2qys8j4u social networks enable people to interact with one another by sharing information, sending messages, making friends, and having discussions, which generates massive amounts of data every day, popularly called as the user-generated content. this data is present in various forms such as images, text, videos, links, and others and reflects user behaviours including their mental states. it is challenging yet promising to automatically detect mental health problems from such data which is short, sparse and sometimes poorly phrased. however, there are efforts to automatically learn patterns using computational models on such user-generated content. while many previous works have largely studied the problem on a small-scale by assuming uni-modality of data which may not give us faithful results, we propose a novel scalable hybrid model that combines bidirectional gated recurrent units (bigrus) and convolutional neural networks to detect depressed users on social media such as twitter-based on multi-modal features. specifically, we encode words in user posts using pre-trained word embeddings and bigrus to capture latent behavioural patterns, long-term dependencies, and correlation across the modalities, including semantic sequence features from the user timelines (posts). the cnn model then helps learn useful features. our experiments show that our model outperforms several popular and strong baseline methods, demonstrating the effectiveness of combining deep learning with multi-modal features. we also show that our model helps improve predictive performance when detecting depression in users who are posting messages publicly on social media. mental illness is a serious issue faced by a large population around the world. in the united states (us) alone, every year, a significant percentage of the adult population is affected by different mental disorders, which include depression mental illness (6.7%), anorexia and bulimia nervosa (1.6%), and bipolar mental illness (2.6%) [1] . sometimes mental illness has been attributed to the mass shooting in the us [26] , which has taken numerous innocent lives. one of the common mental health problems is depression that is more dominant than other mental illness conditions worldwide [60] . the fatality risk of suicides in depressed people is 20 times higher than the general population [54] . diagnosis of depression is usually a difficult task because depression detection needs a thorough and detailed psychological testing by experienced psychiatrists at an early stage [39] . moreover, it is very common among people who suffer from depression that they do not visit clinics to ask help from doctors in the early stages of the problem [66] . however, it is common for people who suffer from mental health problems to often "implicitly" (and sometimes even "explicitly") disclose their feelings and their daily struggles with mental health issues on social media as a way of relief [3, 33] . therefore, social media is an excellent resource to automatically help discover people who are under depression. while it would take a considerable amount of time to manually sift through individual social media posts and profiles to locate people going through depression, automatic scalable computational methods could provide timely and mass detection of depressed people which could help prevent many major fatalities in the future and help people who genuinely need it at the right moment. the daily activities of users on social media could be a gold-mine for data miners because this data helps provide rich insights on user-generated content. it not only helps give them a new platform to study user behaviour but also helps with interesting data analysis, which might not be possible otherwise. mining users' behavioural patterns for psychologists and scientists through examining their online posting activities on multiple social networks such as facebook, weibo [12, 25] , twitter, and others could help target the right people at right time and provide urgent crucial care [5] . there are existing startup companies such as neotas 1 with offices in london and elsewhere which mines publicly available user data on social media to help other companies automatically do the background check including understanding the mental states of prospective employees. this suggests that studying the mental health conditions of users online using automated means not only helps government or health organisations but it also has a huge commercial scope. the behavioural and social characteristics underlying the social media information attract many researchers' interests from different domains such as social scientists, marketing researchers, data mining experts and others to analyze social media information as a source to examine human moods, emotions and behaviours. usually, depression diagnosis could be difficult to be achieved on a large-scale because most traditional ways of diagnosis are based on interviews, questionnaires, self-reports or testimony from friends and relatives. such methods are hardly scalable which could help cover a larger population. individuals and health organizations have thus shifted away from their traditional interactions, and now meeting online by building online communities for sharing information, seeking and giving the advice to help scale their approach to some extent so that they could cover more affected population in less time. besides sharing their mood and actions, recent studies indicate that many people on social media tend to share or give advice on health-related information [17, 29, 36, 40] . these sources provide the potential pathway to discover the mental health knowledge for tasks such as diagnosis, medications and claims. detecting depression through online social media is very challenging requiring to overcome various hurdles ranging from acquiring data to learning the parameters of the model using sparse and complex data. concretely, one of the challenges is the availability of the relevant and right amount of data for mental illness detection. the reason why more data is ideal is primarily that it helps give the computational model more statistical and contextual information during training leading to faithful parameter estimation. while there are approaches which have tried to learn a model on a small-scale data, the performance of these methods is still sub-optimal. for instance, in [10] , the authors tried crawling tweets that contain depression-related keywords as ground truth from twitter. however, they could collect only a limited amount of relevant data which is mainly because it is difficult to obtain relevant data on a large-scale quickly given the underlying search intricacies associated with the twitter application programming interface (api) and the daily data download limit. despite using the right keywords the service might return several false-positives. as a result, their model suffered from the unsatisfactory quantitative performance due to poor parameter estimation on small unreliable data. the authors in [9] also faced a similar issue where they used a small number of data samples to train their classifier. as a result, their study suffered from the problem of unreliable model training using insufficient data leading to poor quantitative performance. in [20] the authors propose a model to detect anxious depression of users. they have proposed an ensemble classification model that combines results from three popular models including studying the performance of each model in the ensemble individually. to obtain the relevant data, the authors introduced a method to collect their data set quickly by choosing the first randomly sampled 100 users who are followers of ms india student forum for one month. a very common problem faced by the researchers in detecting depression on social media is the diversity in the user's behaviours on social media, making extremely difficult to define depressionrelated features to cope with mental health issues. for example, it was evidenced that although social media could help us to gather enough data through which useful feature engineering could be effectively done and several user interactions could be captured and thus studied, it was noticed in [15, 51] that one could only obtain a few crucial features to detect people with eating disorders. in [44] the authors also suffered from the issue of inadequate features including the amount of relevant data set leading to poor results. different from the above works, we have proposed a novel model that is trained on a relatively large dataset showcasing that the method scales and it produces better and reliable quantitative performance than existing popular and strong comparative methods. we have also proposed a novel hybrid deep learning approach which can capture crucial features automatically based on data characteristic making the approach reliable. our results show that our model outperforms several state-of-the-art comparative methods. depressed users behave differently when they interact on social media, producing rich behavioural data, which is often used to extract various features. however, not all of them are related to depression characteristics. many existing studies have either neglected important features or selected less relevant features, which mostly are noise. on the other hand, some studies have considered a variety of user behaviour. for example, [41] is one such work that has collected a large-scale dataset with reliable ground truth labels. they then extracted various features representing user behaviour in social media and grouped these features into several modalities. finally, they proposed a new model called the multimodal dictionary learning model (mdl) to detect depressed users from tweets, based on dictionary learning. however, given the high-dimensional, sparse, figurative and ambiguous nature of tweet language use, dictionary learning cannot capture the semantic meaning of tweets. instead, word embedding is a new technique that can solve the above difficulties through neural network paradigms. hence, due to the capability of the word embedding for holding the semantic relationship between tweets and the knowledge to capture the similarity between terms, we combine multi-modal features with word embedding, to build a comprehensive spectrum of behavioural, lexical, and semantic representations of users. recently, using deep learning to gain insightful and actionable knowledge from complex and heterogeneous data has become mainstream in ai applications for healthcare, e.g. the medical image processing and diagnosis has gained great success. the advantage of deep learning sits in its outstanding capability of iterative learning and automated optimizing latent representations from multi-layer network structure [32] . this motivates us to leverage the superior neural network learning capability with the rich and heterogeneous behavioural patterns of social media users. to be specific, this work aims to develop a new novel deep learning-based solution for improving depression detection by utilizing multi-modal features from diverse behaviour of the depressed user in social media. apart from the latent features derived from lexical attributes, we notice that the dynamics of tweets, i.e. tweet timeline provides a crucial hint reflecting depressed user emotion change over time. to this end, we propose a hybrid model comprising bidirectional gated recurrent unit (bigru) and conventional neural network (cnn) model to boost the classification of depressed users using multi-modal features and word embedding features. the model can derive new deterministic feature representations from training data and produce superior results for detecting depression-level of twitter users. our proposed model uses a bigru, which is a network that can capture distinct and latent features, as well as long-term dependencies and correlations across the features matrix. bigru is designed to use backward and forward contextual information in text, which helps obtain a user latent feature from their various behaviours by using a reset and update gates in a hidden layer in a more robust way. in general, gru-based models have shown better effectiveness and efficiency than the other recurrent neural networks (rnn) such as long short term memory (lstm) model [8] . by capturing the contextual patterns bidirectionally helps obtain a representation of a word based on its context which means under different contexts, a word could have different representation. this indeed is very powerful than other techniques such as traditional unidirectional gru where one word is represented by only one representation. motivated by this we add a bidirectional network for gru that can effectively learn from multi-modal features and provide a better understanding of context, which helps reduce ambiguity. besides, bigru can extract more discrete features and helps improve the performance of our model. the bigru model could capture contextual patterns very well, but lacks in automatically learning the right features suitable for the model which would play a crucial role in predictive performance. to this end, we introduce a one-dimensional cnn as a new feature extractor method to classify user timeline posts. our full model can be regarded as a hybrid deep learning model where there is an interplay between a bigru and a cnn model during model training. while there are some existing models which have combined cnn and birnn models, for instance, in [63] the authors combine bilstm or bigru and cnn to learn better features for text classification using an attention mechanism for feature fusion, which is a different modelling paradigm than what is introduced in this work, which captures the multi-modalities inherent in data. in [62] , the authors proposed a hybrid bigru and cnn model which later constrains the semantic space of sentences with a gaussian. while the modelling paradigms may be closely related with the combinations of a bigru and a cnn model, their model is designed to handle sentence sentiment classification rather than depression detection which is a much more challenging task as tweets in our problem domain are short sentences, largely noisy and ambiguous. in [53] , the authors propose a combination of bigru and cnn model for salary detection but do not exploit multi-modal and temporal features. finally, we also studied the performance of our model when we used the two attributes word embedding and multi-modalities separately. we found that model performance deteriorated when we used only multi-modal features. we further show when we combined the two attributes, our model led to better performance. to summarize, our study makes the following contributions: (1) we propose a novel depression detection framework by deep learning the textual, behavioural, temporal, and semantic modalities from social media. (2) a gated recurrent unit to detect depression using several features extracted from user behaviours. (3) we built a cnn network to classify user timeline posts concatenated with bigru network to identify social media users who suffer from depression. to the best of our knowledge, this is the first work of using multi-modalities of topical, temporal and semantic features jointly with word embeddings in deep learning for depression detection. (4) the experiment results obtained on a real-world tweet dataset have shown the superiority of our proposed method when compared to baseline methods. the rest of our paper is organized as follows. section 2 reviews the related work to our paper. section 3 presents the dataset that used in this work, and different pre-processing we applied on data. section 4 describes the two different attributes that we extracted for our model. in section 5, we present our model for detection depression. section 6 reports experiments and results. finally, section 7 concludes this paper. in this section, we will discuss closely related literature and mention how they are different from our proposed method. in general, just like our work, most existing studies focus on user behaviour to detect whether a user suffers from depression or any mental illness. we will also discuss other relevant literature covering word embeddings and hybrid deep learning methods which have been proposed for detecting mental health from online social networks and other resources including public discussion forums. since we also introduce the notion of latent topics in our work, we have also covered relevant related literature covering topic modelling for depression detection, which has been widely studied in the literature. data present in social media is usually in the form of information that user shares for public consumption which also includes related metadata such as user location, language, age, among others [20] . in the existing literature, there are generally two steps to analyzing social data. the first step is collecting the data generated by users on networking sites, and the second step is to analyze the collected data using, for instance, a computational model or manually. in any data analysis, feature extraction is an important task because using only a relevant small set of features, one can learn a high-quality model. understanding depression on online social networks could be carried out using two complementary approaches which are widely discussed in the literature, and they are: â�¢ post-level behavioural analysis â�¢ user-level behavioural analysis methods that use this kind of analysis mainly target at the textual features of the user post that is extracted in the form of statistical knowledge such as those based on count-based methods [21] . these features describe the linguistic content of the post which are discussed in [9, 19] . for instance, in [9] the authors propose classifier to understand the risk of depression. concretely, the goal of the paper is to estimate that there is a risk of user depression from their social media posts. to this end, the authors collect data from social media for a year preceding the onset of depression from user-profiles and distil behavioural attributes to be measured relating to social engagement, emotion, language and linguistic styles, ego network, and mentions of antidepressant medications. the authors collect their data using crowd-sourcing task, which is not a scalable strategy, on amazon mechanical turk. in their study, the crowd workers were asked to undertake a standardized clinical depression survey, followed by various questions on their depression history and demographics. while the authors have conducted thorough quantitative and qualitative studies, they are disadvantageous in that it does not scale to a large set of users and does not consider the notion of text-level semantics such as latent topics and semantic analysis using word embeddings. our work is both scalable and considers various features which are jointly trained using a novel hybrid deep learning model using a multi-modal learning approach. it harnesses high-performance graphics processing units (gpus) and as a result, has the potential to scale to large sets of instances. in hu et al., [19] the authors also consider various linguistic and behavioural features on data obtained from social media. their underlying model relies on both classification and regression techniques for predicting depression while our method performs classification, but on a large-scale using a varied set of crucial features relevant to this task. to analyze whether the post contains positive or negative words and/or emotions, or the degree of adverbs [49] used cues from the text, for example, i feel a little depressed and i feel so depressed, where they capture the usage of the word "depressed" in the sentences that express two different feelings. the authors also analyzed the posts' interaction (i.e., on twitter (retweet, liked, commented)). some researchers studied post-level behaviours to predict mental problems by analysing tweets on twitter to find out the depression-related language. in [38] , the authors have developed a model to uncover meaningful and useful latent structure in a tweet. similarly, in [41] , the authors monitored different symptoms of depression that are mentioned in a user's tweet. in [42] , they study users' behaviour on both twitter and weibo. to analyze users' posts, they have used linguistic features. they used a chinese language psychological analysis system called textmind in sentiment analysis. one of the interesting post-level behavioural studies was done by [41] on twitter by finding depression relevant words, antidepressant, and depression symptoms. in [37] the authors used postlevel behaviour for detecting anorexia; they analyze domain-related vocabulary such as anorexia, eating disorder, food, meals and exercises. there are various features to model users in social media as it reflects overall behaviour over several posts. different from post-level features extracted from a single post, user-level features extract from several tweets during different times [49] . it also extracts the user's social engagement presented on twitter from many tweets, retweets and/or user interactions with others. generally, posts' linguistic style could be considered to extract features [19, 59, 59] . the authors in [41] extracted six depression-oriented feature groups for a comprehensive description of each user from the collected data set. the authors used the number of tweets and social interaction as social network features. for user profile features, they have used user shared personal information in a social network. analysing user behaviour looks useful for detecting eating disorder. in wang et al., [51] they extracted user engagement and activities features on social media. they have extracted linguistic features of the users for psychometric properties which resembles the settings described in [20, 37, 42] where the authors have extracted 70 features from two different social networks (twitter and weibo). they extracted features from a user profile, posting time and user interaction feature such as several followers and followee. this is one interesting work [56] where the authors combine user-level and post-level semantics and cast their problem as a multiple instance learning setup. the advantage that this method has is that it can learn from user-level labels to identify post-level labels. there is an extensive literature which has used deep learning for detecting depression on the internet in general ranging from tweets to traditional document collection and user studies. while some of these works could also fall in one of the categories above, we are separately presenting these latest findings which use modern deep learning methods. the most closely related recent work to ours is [23] where the authors propose a cnn-based deep learning model to classify twitter users based on depression using multi-modal features. the framework proposed by the authors has two parts. in the first part, the authors train their model in an offline mode where they exploit features from bidirectional encoder representations from transformers (bert) [11] and visual features from images using a cnn model. the two features are then combined, just as in our model, for joint feature learning. there is then an online depression detection phase that considers user tweets and images jointly where there is a feature fusion at a later stage. in another recently proposed work [7] , the authors use visual and textual features to detect depressed users on instagram posts than twitter. their model also uses multi-modalities in data, but keep themselves confined to instagram only. while the model in [23] showed promising results, it still has certain disadvantage. for instance, bert vectors for masked tokens are computationally demanding to obtain even during the fine-tuning stage, unlike our model which does not have to train the word embeddings from scratch. another limitation of their work is that they obtain sentence representations from bert, for instance, bert imposes a 512 token length limit where longer sequences are simply truncated resulting in some information loss, where our model has a much longer sequence length which we can tune easily because our model is computationally cheaper to train. we have proposed a hybrid model that considers a variety of features unlike these works. while we have not specifically used visual features in our work, using a diverse set of crucial relevant textual features is indeed reasonable than just visual features. of course, our model has the flexibility to incorporate a variety of other features including visual features. multi-modal features from the text, audio, images have also been used in [64] , where a new graph attention-based model embedded with multi-modal knowledge for depression detection. while they have used temporal cnn model, their overall architecture has experimented on small-scale questionnaire data. for instance, their dataset contains 189 sessions of interactions ranging between 7-33min (with an average of 16 min). while they have not experimented their method with short and noisy data from social media, it remains to be seen how their method scales to such large collections. xezonaki et al., [57] propose an attention-based model for detecting depression from transcribed clinical interviews than from online social networks. their main conclusion was that individuals diagnosed with depression use affective language to a greater extent than those who are not going through depression. in another recent work [55] , the authors discuss depression among users during the covid-19 pandemic using lstm and fasttext [28] embeddings. in [43] , the authors also propose a multi-model rnn-based model for depression prediction but apply their model on online user forum datasets. trotzek et al., [48] study the problem of early detection of depression from social media using deep learning where the leverage different word embeddings in an ensemble-based learning setup. the authors even train a new word embedding on their dataset to obtain task-specific embeddings. while the authors have used the cnn model to learn high-quality features, their method does not consider temporal dynamics coupled with latent topics, which we show to play a crucial role in overall quantitative performance. the general motivation of word embeddings is to find a low-dimensional representation of a word in the vocabulary that signifies its meaning in the latent semantic space. while word embeddings have been popularly applied in various domains in natural language processing [34] and information retrieval [61] , it has also been applied in the domain of mental health issues such as depression. for instance, in [2] , the authors study on reddit (reddit is also used in [47] ) a few communities which contain discussions on mental health struggles such as depression and suicidal thoughts. to better model the individuals who may have these thoughts, the authors proposed to exploit the representations obtained from word embeddings where they group related concepts close to each other in the embeddings space. the authors then compute the distance between a list of manually generated concepts to discover how related concepts align in the semantic space and how users perceive those concepts. however, they do not exploit various multi-modal features including topical features in their space. farruque et al., [13] study the problem of creating word embeddings in cases where the data is scarce, for instance, depressive language detection from user tweets. the underlying motivation of their work is to simulate a retrofitting-based word embedding approach [14] where they begin with a pre-trained model and fine-tune the model on domain-specific data. gong et al., [16] proposed a topic modelling approach to depression detection using multi-modal analysis. they propose a novel topic model which is context-aware with temporal features. while the model produced satisfactory results on 2017 audio/visual emotion challenge (avec), the method does not use a variety of rich features and could face scalability issues because simple posterior inference algorithms such as those based on gibbs or collapsed gibbs sampling do not parallelize unlike deep learning methods, or one need sophisticated engineering to parallelize such models. twitter has been popularly regarded as one online social media resource that provides free data for data mining on tweets. this is the reason for its popularity among researchers who have widely used data from twitter. one can freely and easily download tweet data through their apis. however, in the past, researchers have generally followed two methods for using twitter data, which are: â�¢ using an already existing dataset shared freely and publicly by others. the downside of such datasets is that they might be old to learn anything useful in the current context. recency may be crucial in some studies such as understanding current trends of a recently trending topic [22] . â�¢ crawling data using vocabulary from a social media network though is slow but helps get fresh, relevant and reliable data which would help learn patterns that are currently being discussed on online social networks. this method takes time to collect relevant and then process the data given that resources such as twitter which provide data freely impose tweet download restrictions per user per day, as a result of fair usage policy applied to all users. developing and validating the terms used in the vocabulary by users with mental illness is time-consuming but helps obtain a reliable list of words, by which reliable tweets could be crawled reducing the amount the false-positives. recent research conducted by the authors of [41] is one such work that has collected a large-scale data with reliable ground truth data, which we aim to reuse. we present the statistics of the data in table 1 . to exemplify the dataset further, the authors collected three complementary data sets, which are: â�¢ depression data set: each user is labelled as depressed, based on their tweet content between 2009 and 2016. this includes 1,402 depressed users and 292,564 tweets. â�¢ non-depression data set: each user is labelled as non-depressed and the tweets were collected in december 2016. this includes over 300 million active users and 10 billion tweets. â�¢ depression-candidate data set: the authors collected are labelled as depression-candidate, where the tweet was collected if contained the word "depress". this includes 36,993 depressioncandidate users and over 35 million tweets. data collection mechanisms are often loosely controlled, impossible data combinations, for instance, users labelled as depressed but have provided no posts, missing values, among others. after data has dataset depressed non-depressed no. of users 1402 300 million no. of tweets 292, 564 10 billion table 1 . statistics of the large dataset collected by the authors in [41] which is used in this study. been crawled, it is still not ready to be used directly by the machine learning model due to various noise still present in data, which is called the "raw data". the problem is even more exacerbated when data has been downloaded from online social media such as twitter because tweets may contain spelling and grammar mistakes, smileys, and other undesirable characters. therefore, a pre-processing strategy is needed to ensure satisfactory data quality for computational modal to achieve reliable predictive analysis. the raw data used in this study has labels of "depressed" and "non-depressed". this data is organised as follows: users: this data is packaged as a json file for each user account describing details about the user such as user id, number of followers, number of tweets etc. note that json is a standard popular data-interchange which is easy for humans to read and write. timeline: this data package contains files containing several tweets along with corresponding metadata, again in json format. to further clean the data we used natural language processing toolkit (nltk). this package has been widely used for text pre-processing [18] and various other works. it has also been widely used for removing common words such as stop words from text [10, 20, 38] . we have removed the common words from users tweets (such as "the", "an", etc.) as these are not discriminative or useful enough for our model. these common words sometimes also increase the dimensionality of the problem which could sometimes lead to the "curse-of-dimensionality" problem and may have an impact on the overall model efficiency. to further improve the text quality, we have also removed non-ascii characters which have also been widely used in literature [59] . pre-processing and removal of noisy content from the data helped get rid of plenty of noisy content from the dataset. we then obtained a high-quality reliable data which we could use in this study. besides, this distillation helped reduce the computational complexity of the model because we are only dealing with informative data which eventually would be used in modelling. we present the statistics of this distilled data below: to further mitigate the issue of sparsity in data, we excluded those users who have posted less than ten posts and users who have less than 5000 followers, therefore we ended up with 2500 positive users and 2300 negative users. social media data conveys all user contents, insights and emotion reflected from individual's behaviours in the social network. this data shows how users interact with their connections. in this work, we collect information from each user and categorize it into two types of attributes, namely multi-modal attribute and word embedding, as follows: we introduce this attribute type where the goal is to calculate the attribute value corresponding to each modality for each user. we estimate that the dimensionality for all modalities of interest is 76; and we mainly consider four major modalities as listed below and ignore two modalities due to missing values. these features are extracted respectively for each user as follows: 4.1.1 social information and interaction. from this attribute, we extracted several features embedded in each user profile. these are features related to each user account as specified by each feature name. most of the features are directly available in the user data, such as the number of users following and friends, favourites, etc. moreover, the extracted features relate to user behaviour on their profile. for each user, we calculate their total number of tweets, their total length of all tweets and the number retweets. we further calculate posting time distribution for each user, by counting how many tweets the user published during each of the 24 hours a day. hence it is a 24-dimensional integer array. to get posting time distribution for each tweet, we extract two digits as hour information, then go through all tweets of each user and track the count of tweets posted in each hour of the day. emojis allow users to express their emotions through simple icons and non-verbal elements. it is useful to get the attention of the reader. emojis could give us a glance for the sentiment of any text or tweets, and it is essential to differentiate between positive and negative sentiment text [31] . user tweets contain a large number of emojis which can be classified into positive, negative and neutral. for each positive, neutral, and negative type, we count their frequency in each tweet. then we sum up the numbers from each user's tweets to get the sum for each user. so the final output is three values corresponding to positive, neutral and negative emojis by the user. we also consider voice activity detection (vad) features. these features contain valance, arousal and dominance scores. for that, we count first person singular and first person plural. using affective norms for english words, a vad score for 1030 words are obtained. we create a dictionary with each word as a key and a tuple of its (valance, arousal, dominance) score as value. next, we parse each tweet and calculate vad score for each tweet using this dictionary. finally, for each user, we add up the vad scores of tweets by that user, to calculate the vad score for each user. topic modelling belongs to the class statistical modelling frameworks which helps in the discovery of abstract topics in a collection of text documents. it gives us a way of organizing, understanding and summarizing collections of textual information. it helps find hidden topical patterns throughout the process, where the number of topics is specific by the user apriori. it can be defined as a method of finding a group of words (i.e. topics) from a collection of documents that best represent the latent topical information in the collection. in our work, we applied the unsupervised latent dirichlet allocation (lda) [4] to extract the most latent topic distribution from user tweets. to calculate topic level features, we first consider corpus of all tweets of all depressed users. next, we split each tweet into a list of words and assemble all words in decreasing order of their frequency of occurrence, and common english words (stopwords) are removed from the list. finally, we apply lda to extract the latent k = 25 topics distribution, where k is the number of topics. we have found experimentally k = 25 to be a suitable value. while there are tuning strategies and strategies based on bayesian non-parametrics [46] , we have opted to use a simple, popular, and computationally efficient approach which helps give us the desired results. it is the count of depression symptoms occurring in tweets, as specified in nine groups in dsm-iv criteria for a depression diagnosis. the symptoms are listed in appendix a. we count how many times the nine depression symptoms are mentioned by the user in their tweets. the symptoms are specified as a list of nine categories, each containing various synonyms for the particular symptom. we created a set of seed keywords for all these nine categories, and with the help of the pre-trained word embedding, we extracted the similarities of these symptoms to extend the list of keywords for each depression symptoms. furthermore, we scan through all tweets, counting how many times a particular symptom is mentioned in each tweet. we also focused on the antidepressants, and we created a lexicon of antidepressants from the "antidepressant" wikipedia page which contains an exhaustive list of items and is updated regularly, in which we counted the number of names listed for antidepressants. the medicine names are listed in appendix b. word embeddings are a class of representation learning models which find the underlying meaning of words in the vocabulary in some low-dimensional semantic space. their underlying principle is based on optimising an objective function which helps bring words which are repeatedly occurring together under a certain contextual window, close to each other in the semantic space. the usual windows size that works well in many settings is 10 [34] . a remarkable ability of these models is that they can effectively capture various lexical properties in natural language such as the similarity between words, analogies among words, and others. these models have become increasingly popular in the natural language processing domain and have been used as input to deep learning models. among various word embedding models proposed in the literature, word2vec [27] is one of the most popular techniques that use shallow neural networks to learn word embedding. word2vec is a predictive model for learning word embeddings from raw text that is also computationally efficient. word2vec takes a large corpus of text as its input and generates a vector space with a corresponding vector in the space allocated to each specific word. word vectors are placed in the space of the vector. the words that share common meanings in the corpus are located in space near to each other. to learn the semantic meaning between the words that were posted by depressed users, we add a new attribute to extract more meaningful features. count features in multi-modalities attribute are useful and effective to extract features from normal text. however, they could not effectively capture the underlying semantics, structure, sequence and meaning in tweets. while count features are based on the independent occurrence of words in a text corpus, they cannot capture the contextual meaning of words in the text which is effectively captured by word embeddings. motivated by this, we apply word embedding techniques to extract more meaningful features from every user's tweets and capture the semantic relationship among word sequence. we used a popular model called word2vec [27] with a 300-dimensional set of word embeddings pre-trained on google news corpus to produce a matrix of word vectors. the skip-gram model is used to learn word vector representations which are characterised by low-dimensional real-valued representations for each word. this is usually done as a pre-processing stage, after which the vectors learned are fed into a model. in this section, we describe our hybrid model that learns from multi-modal features. while there are various hybrid deep learning models proposed in the literature, our method is novel in that it learns multi-modal features which include topical features as shown in figure 1 . the joint learning mechanism learns the model parameters in a consolidated parameter space where different model parameters are shared during the training phase leading to more reliable results. note that simple cascaded-based approaches incorporate error propagation from one stage to next [65] . at the end of the feature extraction step, we obtain the training data in the form of an embedding matrix for each user representing the user timeline posts attribute. we also have a 76-dimensional vector of integers for each user representing the multi-modalities attribute. due to the complexity of user posts and the diversity of their behaviour on social media, we propose a hybrid model based on cnn that combines with bigru to detect depression through social media as depicted in figure 1 . for each user, the model takes two inputs for the two attributes. first, the four modalities feature input that represents user behaviour vector runs into bigru, capturing distinct and latent features, as well as long-term dependencies and correlation across the features matrix. the second input represents each user input tweet that will be replaced with it's embedding and fed to the convolution layer to learn some representation features from the sequential data. the output in the middle of both attributes is concatenated to represent one single vector feature that fed into an activation layer of sigmoid for prediction. in the following sections, we will discuss the following two existing separate architectures which will be combined leading to a novel computational model for modelling spatial structures and multi-modalities. in particular, the model comprises a cnn network to learn the spatial structure from user tweets and a framework to extract latent features from multi-modalities attribute followed by the application of bigru. an individual user's timeline comprises semantic information and local features. recent studies show that cnn has been successfully used for learning strong, suitable and effective features representations [24] . the effective feature learning capabilities of cnns make them an ideal choice to extract semantic features from a user post. in this work, we propose to apply cnn network to extract semantic information features from user tweets. the input to our cnn network is the embedded matrix layer with a sentence matrix and the sentence will be treated as sequence of words s : [w 1 , w 2 , w 3 , . . . , w i ]. each word w â�� r 1ã�d is a one vector of the embedding matrix r wã�d , where d represents the dimension of each word in the matrix and w represents the length or number of words for each user posts. we set the size of each user sentence between 0 and 1000 words and describe the average of only ten tweets for each user. note that this size is much larger than what has been used in other recent closely-related models which are based on bert. also, we could train our model on the dataset which helps create specific representations for our dataset in a computationally less demanding way unlike those which are based on bert that is both computational and financially expensive to train followed by fine-tuning. the input layer is attached to the convolution layer by three conventional layers to learn n-gram features capturing word order; thereby capturing crucial text semantic which usually cannot be captured by a bag-of-words-based model [52] . we use a convolution operation c n to extract features between words as follows: (1) where f is a nonlinear function, b denotes bias and x n:n+hâ��1 a window of h words. here the convolution is applied to the window of word vector, where the window size is h. the network now creates a feature map according to the following equation: (2) the output of convolution layer feature map will be an input for the pooling layer, which is an important step to reduce dimension of the space by selecting appropriate features. we used the max pooling layer to calculate the maximum value for every feature-map patch. the output of pooling operation is generated as follows: . we add the lstm layer to create a stack of deep learning algorithms to optimize the results. the recurrent neural network (rnn) is a powerful network when the input is fixed vectors to process in sequence even if the data is non-sequential. models such as bigru, gru, and lstm fall in the class of rnns. the static attributes are usually inputted to the bigru. gru is an alternative of lstm and links the forget gate and the input gate into a single update gate, which is computationally efficient than an lstm network due to the reduction of gates. gru can effectively and efficiently capture long-distance information between features, but one way or unidirectional gru could only capture the historical information features partly. moreover, for our static attributes, we would like to get the information about the behavioural semantics of each user. to this end, we have applied bigru to combine the forward and backward directions for every input feature to capture the behavioural semantics in both directions. bidirectional models, in general, capture information of the past and the future, where information is captured considering both past and future contexts which makes it more powerful than unidirectional models [11] . suppose the input which resembles a user behaviour be represented as x1,x2..., xn. when we apply the traditional unidirectional gru, we have the following form: (1) bidirectional gru actually consist of two layers of gru as in figure 2 , and introduced to obtain the forward and the backward information. and the hidden layer has two values for the output, one for backward output and the other to forward output, and the algorithm can be describe as follow: where h s represents the input of step s, while 㬠h s and h s represent the hidden state of the forward and the backward gru in step s. each gru network is defined as the follow: , gru network is calculates the update gate z s in the time step s. this gate helps the model decide how much information is obtained from the previous step which could be passed to the next step. the reset gate in equation 7 is used to determine how much information from past step needs to be forgotten. the gru model used a reset gate to save related information from the past as depicted in equation 8 . lastly, the model will calculate h s that holds all the information and passes it down to the network as depicted in equation 9 . after we obtain the latent features from each model, we integrate these features and concatenate them as feature vector to be input into an activation function for classification as mentioned below. 6 experiments and results we compare our model with the following classification methods: â�¢ â�¼mdl: multimodal dictionary learning model (mdl) is to detect depressed users on twitter [41] . they use a dictionary learning to extract latent data features and sparse representation of a user. since we cannot get access to all [41] 's attributes, we implement mdl in our way. â�¢ svm: support vector machines are a class of machine learning models in text classification that try to optimise a loss function that learns to draw a maximum-margin separating hyperplane between two sets of labelled data, e.g., drawing a maximum-margin hyperplane between a positive and negative labelled data [6] . this is the most popular classification algorithm. â�¢ nb: naive bayes is a family of probabilistic algorithms based on applying bayes' theorem with the "naive" assumption of conditional independence between instances [30] . while the suitability conditional independence has been questioned by various researchers, these models surprisingly give superior performance when compared with many sophisticated models [45] . for our experiments, we have used the datasets as mentioned in section (3). they provide a large scale of data, especially for labelled negative and candidate positive. after pre-processing and extracting info from their raw data, we filter out the below datasets to perform our experiments: â�¢ number of users labelled positive: 5899. â�¢ number of tweets from positive users: 508786. â�¢ number of users labelled negative: 5160. â�¢ number of tweets from negative users: 2299106. then further excluded users who posted less than ten posts and users who have more than 5000 followers, we end up with a final dataset consisting of 2500 positive users and 2300 negative users. we adopt the ratio 80:20 to split our data into training and test. we used pre-trained word2vec that is trained on google news corpus which comprises of 3 billion words. we used python 3.6.3 and tensorflow 2.1.0 to develop our implementation. we rendered the embedding layer to be not trainable so that we keep the features representations, e.g., word vectors and topic vectors in their original form. we used one hidden layer, and max-pooling layer of size 4 which gave better performance in our setting. for both network bigru and cnn optimization, we used adam optimization algorithm. finally we trained our model for 10 iterations, with batch size of 32. the number of iterations was sufficient to converge the model and our experimental results further cement this claim where we outperform existing strong baseline methods. we employ traditional information retrieval metrics such as precision, recall, f1, and accuracy based on the confusion matrix to evaluate our model. a confusion matrix is a sensational matrix used for evaluating classification performance, which is also called an error matrix because it shows the number of wrong predictions versus the number of right predictions in a tabulated manner. some important terminologies associated with computing the confusion matrix are the following: â�¢ p: the actual positive case, which is depressed in our task. â�¢ n: the actual negative case, which is not depressed in our task. â�¢ tn: the actual case is not depressed, and the predictions are not depressed as well â�¢ fn: the actual case is not depressed, but the predictions are depressed. â�¢ fp: the actual case is depressed, but the predictions are not depressed. â�¢ tp: the actual case is depressed, and the predictions are depressed as well. based on the confusion matrix, we can compute the accuracy, precision, recall and f1 score as follows: in our experiments, we study our model attributes including the quantitative performance of our hybrid model. the multi-modalities attribute and user's timeline semantic features attribute, we will use both these attributes jointly. after grouped user behaviour in social media into multi-modalities attribute (mm), we evaluate the performance of the model. first, we examine the effectiveness of using the multi-modalities attribute (mm) only with different classifiers. second, we showed how the model performance increased when we combined word embedding with mm. we summarise the results in table 2 and figure 4 as follows: â�¢ naive bayes obtain the lowest f1 score, which demonstrates that this model has less capability to classify tweets when compared with other existing models to detect depression. the reason for its poor performance could be that the model is not robust enough to sparse and noisy data. â�¢ â�¼mdl model outperforms svm and nb and obtains better accuracy than these two methods. since this is a recent model especially designed to discover depressed users, it has captured the intricacies of the dataset well and learned its parameters faithfully leading to better results. â�¢ we can see our proposed model improved the depression detection up to 6% on f1-score, compared to â�¼mdl model. this suggests that our model outperforms a strong model. the reason why our model performs well is primarily because it leverages a rich set of features which is jointly learned in the consolidated parameters estimation resulting in a robust model. â�¢ we can also deduce from the table that our model consistently outperforms all existing and strong baselines. â�¢ furthermore, our model achieved the best performance with 85% in f1, indicating that combining bigru with cnn for multimodal strategy for user timeline semantic features strategy is sufficient to detect depression in twitter. to get a better look for our model performance and how it does classify the samples, we have used the confusion matrix. for this, we import the confusion matrix module from sklearn, which helps us to generate the confusion matrix. we visualize the confusion matrix, which demonstrates how classes are correlated to indicate the percentage of the samples. we can observe from figure 3 that our model predicts effectively non-depressed users (tn) and depressed users (tp). we have also compared the effectiveness of each of the two attributes of our model. therefore, we test the performance of the model with a different attribute, we build the model to feed it with each attribute separately and compare how the model performs. first, we test the model using only the multi-modalities attribute, we can observe in fig 4 the model perform less optimally when we used bigru only. in contrast, the model performs better when we use only cnn with word embedding attribute. this signifies that extracting semantic information features from user tweets is crucial for depression detection. although, the model when used only word embedding attribute outperform multi-modalities, still the true positive rate (sensitivity) for both attribute are close to each other as we see the precision score for each bigru and cnn. finally, we can see the model performance increased when combined both cnn and bigru, and outperforms when using each attribute independently. after depressed users are classified, we examined the most common depression symptoms among depressed users. in figure 5 , we can see symptom one (feeling depressed), is the most common symptom posted by depressed users. that shows how depressed users are exposing and posting their depressive mood on social media more than any other symptoms. besides that, other symptoms such as energy loss, insomnia, a sense of worthlessness, and suicidal thoughts have appeared in more than 20% of the depressed user. to further investigate the five most influencing symptoms among depressed users, we collected all the tweets associated with these symptoms. then we created a tag cloud [50] for each of these five symptoms, to determine what are the frequent words and importance that related to each symptom as shown in figure 6 where larger font words are relatively more important than rest in the same cloud representation. this cloud gives us an overview of all the words that occur most frequently within each of these five symptoms. in this paper, we propose a new model for detecting depressed user through social media analysis by extracting features from the user behaviour and the user's online timeline (posts). we have used a real-world data set for depressed and non-depressed users and applied them in our model. we have proposed a hybrid model which is characterised by introducing an interplay between the bigru and cnn models. we assign the multi-modalities attribute which represents the user behaviour into the bigru and user timeline posts into cnn to extract the semantic features. our model shows that by training this hybrid network improves classification performance and identifies depressed users outperforming other strong methods. this work has great potential to be further explored in the future, for instance, we can enhance multi-modalities feature by using short-text topic modelling, for instance, propose a new variant of the biterm topic model (btm) [58] capable of generating depression-associated topics, as a feature extractor to detect depression. besides, using a new recently proposed popular word representation techniques also known as pre-trained language models such as deep contextualized word representations (elmo) [35] and bidirectional encoder representations from transformers (bert) [11] , and train them on a large corpus of depression-related tweets instead of using a pre-trained word embedding model. while there will be challenges when using such pre-trained language models can introduce because of the restriction that they impose on the sequence length; nevertheless, studying these models on this task helps to unearth their pros and cons. eventually, our future works aim to detect other mental illness in conjunction with depression to capture complex mental issues which have pervaded into an individual's life. diagnostic and statistical manual of mental disorders (dsm-5â®) towards using word embedding vector space for better cohort analysis depressed individuals express more distorted thinking on social media latent dirichlet allocation methods in predictive techniques for mental health status on social media: a critical review libsvm: a library for support vector machines multimodal depression detection on instagram considering time interval of posts empirical evaluation of gated recurrent neural networks on sequence modeling predicting depression via social media depression detection using emotion artificial intelligence bert: pre-training of deep bidirectional transformers for language understanding a depression recognition method for college students using deep integrated support vector algorithm augmenting semantic representation of depressive language: from forums to microblogs retrofitting word vectors to semantic lexicons analysis of user-generated content from online social communities to characterise and predict depression degree topic modeling based multi-modal depression detection take two aspirin and tweet me in the morning: how twitter, facebook, and other social media are reshaping health care natural language processing methods used for automatic prediction mechanism of related phenomenon predicting depression of social media user on different observation windows anxious depression prediction in real-time social data rehabilitation of count-based models for word vector representations text-based detection and understanding of changes in mental health sensemood: depression detection on social media supervised deep feature extraction for hyperspectral image classification using social media content to identify mental health problems: the case of# depression in sina weibo mental illness, mass shootings, and the politics of american firearms advances in pretraining distributed word representations rethinking communication in the e-health era on discriminative vs. generative classifiers: a comparison of logistic regression and naive bayes borut sluban, and igor mozetiä� deep learning for depression detection of twitter users depressive moods of users portrayed in twitter glove: global vectors for word representation deep contextualized word representations identifying health-related topics on twitter early risk detection of anorexia on social media beyond lda: exploring supervised topic modeling for depression-related language in twitter beyond modelling: understanding mental disorders in online social media dissemination of health information through social networks: twitter and antibiotics depression detection via harvesting social media: a multimodal dictionary learning solution cross-domain depression detection via harvesting social media multi-modal social and psycho-linguistic embedding via recurrent neural networks to identify depressed users in online forums detecting cognitive distortions through machine learning text analytics a comparison of supervised classification methods for the prediction of substrate type using multibeam acoustic and legacy grain-size data sharing clusters among related groups: hierarchical dirichlet processes understanding depression from psycholinguistic patterns in social media texts utilizing neural networks and linguistic metadata for early detection of depression indications in text sequences recognizing depression from twitter activity timelines tag clouds and the case for vernacular visualization detecting and characterizing eatingdisorder communities on social media topical n-grams: phrase and topic discovery, with an application to information retrieval salary prediction using bidirectional-gru-cnn model world health organization estimating the effect of covid-19 on mental health: linguistic indicators of depression during a global pandemic modeling depression symptoms from social network data through multiple instance learning georgios paraskevopoulos, alexandros potamianos, and shrikanth narayanan. 2020. affective conditioning on hierarchical networks applied to depression detection from transcribed clinical interviews a biterm topic model for short texts semi-supervised approach to monitoring clinical depressive symptoms in social media survey of depression detection using social networking sites via data mining relevance-based word embedding combining convolution neural network and bidirectional gated recurrent unit for sentence semantic classification feature fusion text classification model combining cnn and bigru with multi-attention mechanism graph attention model embedded with multi-modal knowledge for depression detection medlda: maximum margin supervised topic models. the depression and disclosure behavior via social media: a study of university students in china list of depression symptoms as per dsm-iv:(1) depressed mood.(2) iminished interest. key: cord-269270-i2odcsx7 authors: sahlol, ahmed t.; yousri, dalia; ewees, ahmed a.; al-qaness, mohammed a. a.; damasevicius, robertas; elaziz, mohamed abd title: covid-19 image classification using deep features and fractional-order marine predators algorithm date: 2020-09-21 journal: sci rep doi: 10.1038/s41598-020-71294-2 sha: doc_id: 269270 cord_uid: i2odcsx7 currently, we witness the severe spread of the pandemic of the new corona virus, covid-19, which causes dangerous symptoms to humans and animals, its complications may lead to death. although convolutional neural networks (cnns) is considered the current state-of-the-art image classification technique, it needs massive computational cost for deployment and training. in this paper, we propose an improved hybrid classification approach for covid-19 images by combining the strengths of cnns (using a powerful architecture called inception) to extract features and a swarm-based feature selection algorithm (marine predators algorithm) to select the most relevant features. a combination of fractional-order and marine predators algorithm (fo-mpa) is considered an integration among a robust tool in mathematics named fractional-order calculus (fo). the proposed approach was evaluated on two public covid-19 x-ray datasets which achieves both high performance and reduction of computational complexity. the two datasets consist of x-ray covid-19 images by international cardiothoracic radiologist, researchers and others published on kaggle. the proposed approach selected successfully 130 and 86 out of 51 k features extracted by inception from dataset 1 and dataset 2, while improving classification accuracy at the same time. the results are the best achieved on these datasets when compared to a set of recent feature selection algorithms. by achieving 98.7%, 98.2% and 99.6%, 99% of classification accuracy and f-score for dataset 1 and dataset 2, respectively, the proposed approach outperforms several cnns and all recent works on covid-19 images. medical imaging techniques are very important for diagnosing diseases. image segmentation is a necessary image processing task that applied to discriminate region of interests (rois) from the area of outsides. also, image segmentation can extract critical features, including the shape of tissues, and texture 5, 6 . in general, feature selection (fs) methods are widely employed in various applications of medical imaging applications. for example, lambin et al. 7 proposed an efficient approach called radiomics to extract medical image features. they showed that analyzing image features resulted in more information that improved medical imaging. chong et al. 8 proposed an fs model, called robustness-driven fs (rdfs) to select futures from lung ct images to classify the patterns of fibrotic interstitial lung diseases. they applied the svm classifier with and without rdfs. the evaluation showed that the rdfs improved svm robustness against reconstruction kernel and slice thickness. in 9 , to classify ultrasound medical images, the authors used distance-based fs methods and a fuzzy support vector machine (fsvm). moreover, a multi-objective genetic algorithm was applied to search for the optimal features subset. more so, a combination of partial differential equations and deep learning was applied for medical image classification by 10 . they employed partial differential equations for extracting texture features of medical images. acharya et al. 11 applied different fs methods to classify alzheimer's disease using mri images. the shearlet transform fs method showed better performances compared to several fs methods. also, in 12 , an fs method based on svm was proposed to detect alzheimer's disease from spect images. duan et al. 13 applied the gaussian mixture model (gmm) to extract features from pulmonary nodules from ct images. the optimum path forest (opf) classifier was applied to classify pulmonary nodules based on ct images. in 14 , the authors proposed an fs method based on a convolutional neural network (cnn) to detect pneumonia from lung x-ray images. afzali et al. 15 proposed an fs method based on principal component analysis and contour-based shape descriptors to detect tuberculosis from lung x-ray images. they used k-nearest neighbor (knn) to classify x-ray images collected from montgomery dataset, and it showed good performances. zhang et al. 16 proposed a kernel feature selection method to segment brain tumors from mri images. they applied the svm classifier for new mri images to segment brain tumors, automatically. to segment brain tissues from mri images, kong et al. 17 proposed an fs method using two methods, called a discriminative clustering method and the information theoretic discriminative segmentation. harikumar et al. 18 proposed an fs method based on wavelets to classify normality or abnormality of different types of medical images, such as ct, mri, ultrasound, and mammographic images. it can be concluded that fs methods have proven their advantages in different medical imaging applications 19 . furthermore, deep learning using cnn is considered one of the best choices in medical imaging applications 20 , especially classification. cnns are more appropriate for large datasets. also, they require a lot of computational resources (memory & storage) for building & training. in some cases (as exists in this work), the dataset is limited, so it is not sufficient for building & training a cnn. in such a case, in order to get the advantage of the power of cnn and also, transfer learning can be applied to minimize the computational costs 21, 22 . in transfer learning, a cnn which was previously trained on a large & diverse image dataset can be applied to perform a specific classification task by 23 . therefore, several pre-trained models have won many international image classification competitions such as vggnet 24 , resnet 25 , nasnet 26 , mobilenet 27 , inception 28 and xception 29 . however, some of the extracted features by cnn might not be sufficient, which may affect negatively the quality of the classification images. therefore, a feature selection technique can be applied to perform this task by removing those irrelevant features. among the fs methods, the metaheuristic techniques have been established their performance overall other fs methods when applied to classify medical images. for example, da silva et al. 30 used the genetic algorithm (ga) to develop feature selection methods for ranking the quality of medical images. they used different images of lung nodules and breast to evaluate their fs methods. evaluation outcomes showed that ga based fs methods outperformed traditional approaches, such as filter based fs and traditional wrapper methods. johnson et al. 31 applied the flower pollination algorithm (fpa) to select features from ct images of the lung, to detect lung cancers. they also used the svm to classify lung ct images. the evaluation confirmed that fpa based fs enhanced classification accuracy. kharrat and mahmoud 32 proposed an fs method based on a hybrid of simulated annealing (sa) and ga to classify brain tumors using mri. the combination of sa and ga showed better performances than the original sa and ga. narayanan et al. 33 proposed a fuzzy particle swarm optimization (pso) as an fs method to enhance the classification of ct images of emphysema. they applied a fuzzy decision tree classifier, and they found that fuzzy pso improved the classification accuracy. li et al. 34 proposed a self-adaptive bat algorithm (ba) to address two problems in lung x-ray images, rebalancing, and feature selection. they compared the ba to pso, and the comparison outcomes showed that ba had better performance. dhanachandra and chanu 35 proposed a hybrid method of dynamic pso and fuzzy c-means to segment two types of medical images, mri and synthetic images. they concluded that the hybrid method outperformed original fuzzy c-means, and it had less sensitive to noises. li et al. 36 proposed an fs method using a discrete artificial bee colony (abc) to improve the classification of parkinson's disease. the evaluation outcomes demonstrate that abc enhanced precision, and also it reduced the size of the features. in this paper, we proposed a novel covid-19 x-ray classification approach, which combines a cnn as a sufficient tool to extract features from covid-19 x-ray images. then, using an enhanced version of marine predators algorithm to select only relevant features. in general, mpa is a meta-heuristic technique that simulates the behavior of the prey and predator in nature 37 . this algorithm is tested over a global optimization problem. however, it has some limitations that affect its quality. in addition, up to our knowledge, mpa has not applied to any real applications yet. so, based on this motivation, we apply mpa as a feature selector from deep features that produced from cnn (largely redundant), which, accordingly minimize capacity and resources consumption and can improve the classification of covid-19 x-ray images. the proposed covid-19 x-ray classification approach starts by applying a cnn (especially, a powerful architecture called inception which pre-trained on imagnet dataset) to extract the discriminant features from raw images (with no pre-processing or segmentation) from the dataset that contains positive and negative covid-19 images. then, applying the fo-mpa to select the relevant features from the images. this task is achieved by fo-mpa which randomly generates a set of solutions, each of them represents a subset of potential features. the next process is to compute the performance of each solution using fitness value and determine which one is the best solution. thereafter, the fo-mpa parameters are applied to update the solutions of the current population. the updating operation repeated until reaching the stop condition. then the best solutions are reached which determine the optimal/relevant features that should be used to address the desired output via several performance measures. inspired by our recent work 38 , where vgg-19 besides statistically enhanced salp swarm algorithm was applied to select the best features for white blood cell leukaemia classification. also, other recent published works 39 , who combined a cnn architecture with weighted symmetric uncertainty (wsu) to select optimal features for traffic classification. it is obvious that such a combination between deep features and a feature selection algorithm can be efficient in several image classification tasks. the main contributions of this study are elaborated as follows: 1. propose an efficient hybrid classification approach for covid-19 using a combination of cnn and an improved swarm-based feature selection algorithm. this combination should achieve two main targets; high performance and resource consumption, storage capacity which consequently minimize processing time. 2. propose a novel robust optimizer called fractional-order marine predators algorithm (fo-mpa) to select efficiently the huge feature vector produced from the cnn. 3. test the proposed inception fractional-order marine predators algorithm (ifm) approach on two publicity available datasets contain a number of positive negative chest x-ray scan images of covid-19. 4. evaluate the proposed approach by performing extensive comparisons to several state-of-art feature selection algorithms, most recent cnn architectures and most recent relevant works and existing classification methods of covid-19 images. we do not present a usable clinical tool for covid-19 diagnosis, but offer a new, efficient approach to optimize deep learning-based architectures for medical image classification purposes. such methods might play a significant role as a computer-aided tool for image-based clinical diagnosis soon. remainder sections are organized as follows: "material and methods" section presents the methodology and the techniques used in this work including model structure and description. the experimental results and comparisons with other works are presented in "results and discussion" section, while they are discussed in "discussion" section finally, the conclusion is described in "conclusion" section. features extraction using convolutional neural networks. in this paper, we apply a convolutional neural network (cnn) to extract features from covid-19 x-ray images. we adopt a special type of cnn called a pre-trained model where the network is previously trained on the imagenet dataset, which contains millions of variety of images (animal, plants, transports, objects,..) on 1000 classe categories. so, transfer learning is applied by transferring weights that were already learned and reserved into the structure of the pre-trained model, such as inception, in this paper. in inception, there are different sizes scales convolutions (conv.), such as 5 × 5 , 3 × 3 , 1 × 1 . for instance,1 × 1 conv. is applied before larger sized kernels are applied to reduce the dimension of the channels, which accordingly, reduces the computation cost. pool layers are used mainly to reduce the input's size, which accelerates the computation as well. so, for a 4 × 4 matrix, will result in 2 × 2 matrix after applying max pooling. there are three www.nature.com/scientificreports/ main parameters for pooling, filter size, stride, and max pool. in this paper, filters of size 2, besides a stride of 2 and 2 × 2 as max pool, were adopted. inception architecture is described in fig. 1 . the main purpose of conv. layers is to extract features from input images. in this paper, different conv. layers are applied to extract different types of features such as edges, texture, colors, and high-lighted patterns from the images. the combination of conv. and pool layers, three fully connected layers, the last one performs classification. the softmax activation function is used for this purpose because the output should be binary (positive covid-19 negative covid-19). inception's layer details and layer parameters of are given in table 1 . as seen in table 1 , we keep the last concatenation layer which contains the extracted features, so we removed the top layers such as the flatten, drop out and the dense layers which the later performs classification (named as fc layer). we have used rmsprop optimizer for weight updates, cross entropy loss function and selected learning rate as 0.0001. in this paper, inception is applied as a feature extractor, where the input image shape is (229, 229, 3). since its structure consists of some parallel paths, all the paths use padding of 1 pixel to preserve the same height & width for the inputs and the outputs. one of the drawbacks of pre-trained models, such as inception, is that its architecture required large memory requirements as well as storage capacity (92 m.b), which makes deployment exhausting and a tiresome task. the shape of the output from the inception is (5, 5, 2048), which represents a feature vector of size 51200. so some statistical operations have been added to exclude irrelevant and noisy features, and by making it more computationally efficient and stable, they are summarized as follows: • chi-square is applied to remove the features which have a high correlation values by computing the dependence between them. it is calculated between each feature for all classes, as in eq. (1): where o k and e k refer to the actual and the expected feature value, respectively. in this paper, after applying chi-square, the feature vector is minimized for both datasets from 51200 to 2000. • tree based classifier are the most popular method to calculate feature importance to improve the classification since they have high accuracy, robustness, and simple 38 . for each decision tree, node importance is calculated using gini importance, eq. (2) calculated two child nodes. where ni j is the importance of node j, while w j refers to the weighted number of samples reaches the node j, also c j determines the impurity value of node j. left(j) and right(j) are the child nodes from the left split and the right split on node j, respectively. in eq. (3), the importance of each feature is then calculated. (1) www.nature.com/scientificreports/ where fi i represents the importance of feature i, while ni j refers to the importance of node j. in order to normalize the values between 0 and 1 by dividing by the sum of all feature importance values, as in eq. (4). finally, the sum of the feature's importance value on each tree is calculated then divided by the total number of trees as in eq. (5). where refi i represents the importance of feature i that were calculated from all trees, where normfi ij is the normalized feature importance for feature i in tree j, also t is the total number of trees. after applying this technique, the feature vector is minimized from 2000 to 459 and from 2000 to 462 for dataset1 and dataset 2, respectively. fractional-order calculus (fc) gains the interest of many researchers in different fields not only in the modeling sectors but also in developing the optimization algorithms. the memory properties of fc calculus makes it applicable to the fields that required non-locality and memory effect. fc provides a clear interpretation of the memory and hereditary features of the process. accordingly, the fc is an efficient tool for enhancing the performance of the meta-heuristic algorithms by considering the memory perspective during updating the solutions. one from the well-know definitions of fc is the grunwald-letnikov (gl), which can be mathematically formulated as below 40 : where where d δ (u(t)) refers to the gl fractional derivative of order δ . ŵ(t) indicates gamma function. the gl in the discrete-time form can be modeled as below: where t is the sampling period, and m is the length of the memory terms (memory window). the δ symbol refers to the derivative order coefficient. for the special case of δ = 1 , the definition of eq. (8) can be remodeled as below: where d 1 [x(t)] represents the difference between the two followed events. marine predators algorithm. the marine predators algorithm (mpa)is a recently developed meta-heuristic algorithm that emulates the relation among the prey and predator in nature 37 . mpa simulates the main aim for most creatures that is searching for their foods, where a predator contiguously searches for food as well as the prey. inspired by this concept, faramarzi et al. 37 developed the mpa algorithm by considering both of a predator a prey as solutions. the mpa starts with the initialization phase and then passing by other three phases with respect to the rational velocity among the prey and the predator. • initialization phase: this phase devotes for providing a random set of solutions for both the prey and predator via the following formulas: where the lower and upper are the lower and upper boundaries in the search space, rand 1 is a random vector ∈ the interval of (0,1). according to the formula 10, the initial locations of the prey and predator can be defined as below: (3) fi i = j:node j splits on feature i ni j k∈all nodes ni k www.nature.com/scientificreports/ where the elite matrix refers to the fittest predators. • stage 1: after the initialization, the exploration phase is implemented to discover the search space. therefore in mpa, for the first third of the total iterations, i.e., 1 3 t max ). accordingly, the prey position is upgraded based the following equations. where r ∈ [0, 1] is a random vector drawn from a uniform distribution and p = 0.5 is a constant number. the symbol r b refers to brownian motion. indicates the process of element-wise multiplications. • stage 2: the prey/predator in this stage begin exploiting the best location that detects for their foods. stage 2 has been executed in the second third of the total number of iterations when 1 3 t max < t < 2 3 t max . faramarzi et al. 37 divided the agents for two halves and formulated eqs. (14)(15) to emulate the motion of the first half of the population (prey) and eqs. (18)(19) for the second half (predator) as represented below. where r l has random numbers that follow lévy distribution. eq. (14)(15) are implemented in the first half of the agents that represent the exploitation. while the second half of the agents perform the following equations. where cf is the parameter that controls the step size of movement for the predator. • stage 3: this stage executed on the last third of the iteration numbers ( t > 2 3 t max ) where based on the following formula: • eddy formation and fish aggregating devices' effect: faramarzi et al. 37 considered the external impacts from the environment, such as the eddy formation or fish aggregating devices (fads) effects to avoid the local optimum solutions. this stage can be mathematically implemented as below: in eq. (20), fad = 0.2 , and w is a binary solution (0 or 1) that corresponded to random solutions. if the random solution is less than 0.2, it converted to 0 while the random solution becomes 1 when the solutions are greater than 0.2. the symbol r ∈ [0, 1] represents a random number. r 1 and r 2 are the random index of the prey. • marine memory: this is the main feature of the marine predators and it helps in catching the optimal solution very fast and avoid local solutions. faramarzi et al. 37 implement this feature via saving the previous best solutions of a prior iteration, and compared with the current ones; the solutions are modified based on the best one during the comparison stage. recently, a combination between the fractional calculus tool and the meta-heuristics opens new doors in providing robust and reliable variants 41 . for this motivation, we utilize the fc concept with the mpa algorithm to boost the second step of the standard version of the algorithm. hence, the fc memory is applied during updating the prey locating in the second step of the algorithm to enhance the exploitation stage. moreover, the r b parameter has been changed to depend on weibull distribution as described below. by taking into account the early mentioned relation in eq. (23), the general formulation for the solutions of fo-mpa based on fc memory perspective can be written as follows: after checking the previous formula, it can be detected that the motion of the prey becomes based on some terms from the previous solutions with a length of (m), as depicted in fig. 2 (left) . with accounting the first four previous events ( m = 4 ) from the memory data with derivative order δ , the position of prey can be modified as follow; • second: adjusting r b random parameter based on weibull distribution. for the exploration stage, the weibull distribution has been applied rather than brownian to bost the performance of the predator in stage 2 and the prey velocity in stage 1 based on the following formula: where k, and ζ are the scale and shape parameters. the weibull distribution is a heavy-tied distribution which presented as in fig. 2 (right) . in the current work, the values of k, and ζ are set to 2, and 2, respectively. our proposed approach is called inception fractional-order marine predators algorithm (ifm), where we combine inception (i) with fractional-order marine predators algorithm (fo-mpa). the proposed ifm approach is summarized as follows: 1. extracting deep features from inception, where about 51 k features were extracted. 2. initialize solutions for the prey and predator. the prey follows weibull distribution during discovering the search space to detect potential locations of its food. 3. the predator tries to catch the prey while the prey exploits the locations of its food. the predator uses the weibull distribution to improve the exploration capability. meanwhile, the prey moves effectively based on its memory for the previous events to catch its food, as presented in eq. (24). 4. finally, the predator follows the levy flight distribution to exploit its prey location. all above stages are repeated until the termination criteria is satisfied. these datasets contain hundreds of frontal view x-rays and considered the largest public resource for covid-19 image data. they were manually aggregated from various web based repositories into a machine learning (ml) friendly format with accompanying data loader code. they were also collected frontal and lateral view imagery and metadata such as the time since first symptoms, intensive care unit (icu) status, survival status, intubation status, or hospital location. both datasets shared some characteristics regarding the collecting sources. for both datasets, the covid19 images were collected from patients with ages ranging from 40-84 from both genders. it is also noted that both datasets contain a small number of positive covid-19 images, and up to our knowledge, there is no other sufficient available published dataset for covid-19. table 2 shows some samples from two datasets. table 2 depicts the variation in morphology of the image, lighting, structure, black spaces, shape, and zoom level among the same dataset, as well as with the other dataset. • best accuracy: • best fitness value: • worst fitness value: • average of fitness value: • standard deviation of fitness value where r is the run numbers. fit i denotes a fitness function value. google colaboratory 46 , commonly referred to as "google colab, " which is a research project for prototyping machine learning models on powerful hardware options such as gpus and tpus. in this paper, we used tpus for powerful computation, which is more appropriate for cnn. the model was developed using keras library 47 with tensorflow backend 48 . performance of the proposed approach. as inception examines all x-ray images over and over again in each epoch during the training, these rapid ups and downs are slowly minimized in the later part of the training. after feature extraction, we applied fo-mpa to select the most significant features. in this subsection, the results of fo-mpa are compared against most popular and recent feature selection algorithms, such as whale optimization algorithm (woa) 49 , henry gas solubility optimization (hgso) 50 , sine cosine algorithm (sca), slime mould algorithm (sma) 51 , particle swarm optimization (pso), grey wolf optimization (gwo) 52 , harris hawks optimization (hho) 53 , genetic algorithm (ga), and basic mpa. in this paper, each feature selection algorithm were exposed to select the produced feature vector from inception aiming (30) f score = 2 × specificity × sensitivity specificity + sensitivity tables 3 and 4. table 3 shows the numerical results of the feature selection phase for both datasets. four measures for the proposed method and the compared algorithms are listed. as seen in table 3 , on dataset 1, the fo-mpa outperformed the other algorithms in the mean of fitness value as it achieved the smallest average fitness function value followed by sma, hho, hgso, sca, bgwo, mpa, and bpso, respectively whereas, the sga and woa showed the worst results. the results of max measure (as table 3 . results of the feature selection phase based on fitness function. highest results are in bold. www.nature.com/scientificreports/ in eq. (33)), showed that fo-mpa also achieved the best value of the fitness function compared to others. sma is on the second place, while hgso, sca, and hho came in the third to fifth place, respectively. according to the best measure, the fo-mpa performed similarly to the hho algorithm, followed by sma, hgso, and sca, respectively. although the performance of the mpa and bgwo was slightly similar, the performance of sga and woa were the worst in both max and min measures. generally, the most stable algorithms on dataset 1 are woa, sca, hgso, fo-mpa, and sga, respectively. however, woa showed the worst performances in these measures; which means that if it is run in the same conditions several times, the same results will be obtained. for dataset 2, fo-mpa showed acceptable (not the best) performance, as it achieved slightly similar results to the first and second ranked algorithm (i.e., mpa and sma) on mean, best, max, and std measures. also, woa algorithm showed good results in all measures, unlike dataset 1, which can conclude that no algorithm can solve all kinds of problems. whereas, the worst algorithm was bpso. for more analysis of feature selection algorithms based on the number of selected features (s.f) and consuming time, fig. 4 and table 4 list these results for all algorithms. regarding the consuming time as in fig. 4a , the sma was considered as the fastest algorithm among all algorithms followed by bpso, fo-mpa, and hho, respectively, while mpa was the slowest algorithm. also, as seen in fig. 4b , fo-mpa algorithm selected successfully fewer features than other algorithms, as it selected 130 and 86 features from dataset 1 and dataset 2, respectively. hgso was ranked second with 146 and 87 selected features from dataset 1 and dataset 2, respectively. the largest features were selected by sma and sga, respectively. the convergence behaviour of fo-mpa was evaluated over 25 independent runs and compared to other algorithms, where the x-axis and the y-axis represent the iterations and the fitness value, respectively. figure 5 illustrates the convergence curves for fo-mpa and other algorithms in both datasets. figure 5 , shows that fo-mpa shows an efficient and faster convergence than the other optimization algorithms on both datasets. whereas, the slowest and the insufficient convergences were reported by both sga and woa in dataset 1 and by sga in dataset 2. to further analyze the proposed algorithm, we evaluate the selected features by fo-mpa by performing classification. in this experiment, the selected features by fo-mpa were classified using knn. table 4 show classification accuracy of fo-mpa compared to other feature selection algorithms, where the best, mean, and std for classification accuracy were calculated for each one, besides time consumption and the number of selected features (sf). in table 4 , for dataset 1, the proposed fo-mpa approach achieved the highest accuracy in the best and mean measures, as it reached 98.7%, and 97.2% of correctly classified samples, respectively. while, mpa, bpso, sca, and sga obtained almost the same accuracy, followed by both bgwo, woa, and sma. the lowest accuracy was obtained by hgso in both measures. based on standard deviation measure (std), the most stable algorithms were sca, sga, bpso, and bgwo, respectively. whereas, fo-mpa, mpa, hgso, and woa showed similar std results. the hgso also was ranked last. in dataset 2, fo-mpa also is reported as the highest classification accuracy with the best and mean measures followed by the bpso. the classification accuracy of mpa, woa, sca, and sga are almost the same. whereas the worst one was sma algorithm. besides, all algorithms showed the same statistical stability in std measure, except for hho and hgso. generally, the proposed fo-mpa approach showed satisfying performance in both the feature selection ratio and the classification rate. www.nature.com/scientificreports/ moreover, from table 4 , it can be seen that the proposed fo-mpa provides better results in terms of f-score, as it has the highest value in datatset1 and datatset2 which are 0.9821 and 0.99079, respectively. comparison with other cnn architectures. in this subsection, the performance of the proposed covid-19 classification approach is compared to other cnn architectures. it noted that all produced feature vectors by cnns used in this paper are at least bigger by more than 300 times compared to that produced by fo-mpa in terms of the size of the featureset. for example, as our input image has the shape 224 × 224 × 3 , nasnet 26 produces 487 k features, resnet 25 and xception 29 produce about 100 k features and mobilenet 27 produces 50 k features, while fo-mpa produces 130 and 86 features for both dataset1 and dataset 2, respectively. figure 6 shows a comparison between our fo-mpa approach and other cnn architectures. from fig. 6 (left), for dataset 1, it can be seen that our proposed fo-mpa approach outperforms other cnn models like vggnet, xception, inception, mobilenet, nasnet, and resnet. it also shows that fo-mpa can select the smallest subset of features, which reflects positively on performance. accordingly, that reflects on efficient usage of memory, and less resource consumption. on the second dataset, dataset 2 (fig. 6, right) , our approach still provides an overall accuracy of 99.68%, putting it first with a slight advantage over mobilenet (99.67 %). comparison with related works. in this subsection, a comparison with relevant works is discussed. figure 7 shows the most recent published works as in 54-57 and 44 on both dataset 1 and dataset 2. in 54 , alexnet pre-trained network was used to extract deep features then applied pca to select the best features by eliminating highly correlated features. based on 54 , the later step reduces the memory requirements, and improve the efficiency of the framework. while 55 used different cnn structures. however, it was clear that vgg19 and mobilenet achieved the best performance over other cnns. also, in 58 a new cnn architecture called efficientnet was proposed, where more blocks were added on top of the model after applying normalization of images pixels intensity to the range (0 to 1). also, some image transformations were applied, such as rotation, horizontal flip, and scaling. in 57 , resnet-50 cnn has been applied after applying horizontal flipping, random rotation, random zooming, random lighting, and random wrapping on raw images. as seen in fig. 7 , most works are pre-prints for two main reasons; covid-19 is the most recent and trend topic; also, there are no sufficient datasets that can be used for reliable results. however, the proposed fo-mpa approach has an advantage in performance compared to other works. also, all other works do not give further statistics about their model's complexity and the number of www.nature.com/scientificreports/ featurset produced, unlike, our approach which extracts the most informative features (130 and 86 features for dataset 1 and dataset 2) that imply faster computation time and, accordingly, lower resource consumption. compared to 59 which is one of the most recent published works on x-ray covid-19, a combination between you only look once (yolo) which is basically a real time object detection system and darknet as a classifier was proposed. they achieved 98.08 % and 96.51 % of accuracy and f-score, respectively compared to our approach with 98.77 % and 98.2% for accuracy and f-score, respectively. while no feature selection was applied to select best features or to reduce model complexity. the proposed imf approach successfully achieves two important targets, selecting small feature numbers with high accuracy. therefore, reducing the size of the feature from about 51 k as extracted by deep neural networks (inception) to be 128.5 and 86 in dataset 1 and dataset 2, respectively, after applying fo-mpa algorithm while increasing the general performance can be considered as a good achievement as a machine learning goal. besides, the used statistical operations improve the performance of the fo-mpa algorithm because it supports the algorithm in selecting only the most important and relevant features. it also contributes to minimizing resource consumption which consequently, reduces the processing time. in addition, the good results achieved by the fo-mpa against other algorithms can be seen as an advantage of fo-mpa, where a balancing between exploration and exploitation stages and escaping from local optima were achieved. as a result, the obtained outcomes outperformed previous works in terms of the model's general performance measure. furthermore, using few hundreds of images to build then train inception is considered challenging because deep neural networks need large images numbers to work efficiently and produce efficient features. however, the proposed imf approach achieved the best results among the compared algorithms in least time. one of the main disadvantages of our approach is that it's built basically within two different environments. the first one is based on python, where the deep neural network architecture (inception) was built and the feature extraction part was performed. the second one is based on matlab, where the feature selection part (fo-mpa algorithm) was performed. so, there might be sometimes some conflict issues regarding the features vector file types or issues related to storage capacity and file transferring. computational image analysis techniques play a vital role in disease treatment and diagnosis. taking into consideration the current spread of covid-19, we believe that these techniques can be applied as a computer-aided tool for diagnosing this virus. therefore, in this paper, we propose a hybrid classification approach of covid-19. it based on using a deep convolutional neural network (inception) for extracting features from covid-19 images, then filtering the resulting features using marine predators algorithm (mpa), enhanced by fractionalorder calculus(fo). the proposed imf approach is employed to select only relevant and eliminate unnecessary features. extensive evaluation experiments had been carried out with a collection of two public x-ray images datasets. extensive comparisons had been implemented to compare the fo-mpa with several feature selection algorithms, including sma, hho, hgso, woa, sca, bgwo, sga, bpso, besides the classic mpa. the results showed that the proposed approach showed better performances in both classification accuracy and the number of extracted features that positively affect resource consumption and storage efficiency. the results are the best achieved compared to other cnn architectures and all published works in the same datasets. according to the promising results of the proposed model, that combines cnn as a feature extractor and fo-mpa as a feature selector could be useful and might be successful in being applied in other image classification tasks. all data used in this paper is available online in the repository, [https ://githu b.com/ieee8 023/covid -chest xraydatas et], [https ://stanf ordml group .githu b.io/proje cts/chexn et], [https ://www.kaggl e.com/pault imoth ymoon ey/ chest -xray-pneum onia] and [https ://www.sirm.org/en/categ ory/artic les/covid -19-datab ase/]. the code of the proposed approach is also available via the following link [https ://drive .googl e.com/file/d/1-ok-eeegd cmcny kh364 ikak3 opmqa 9rvas x/view?usp=shari ng]. isolation and characterization of a bat sars-like coronavirus that uses the ace2 receptor optimization method for forecasting confirmed cases of covid-19 in china transmission scenarios for middle east respiratory syndrome coronavirus (mers-cov) and how to tell them apart use of chest ct in combination with negative rt-pcr assay for the 2019 novel coronavirus but high clinical suspicion medical image segmentation using fruit fly optimization and density peaks clustering brain tumor segmentation with deep neural networks radiomics: extracting more information from medical images using advanced feature analysis robustness-driven feature selection in classification of fibrotic interstitial lung disease patterns in computed tomography using 3d texture features classification of ultrasound medical images using distance based feature selection and fuzzy-svm detection of lung cancer on chest ct images using minimum redundancy maximum relevance feature selection method with convolutional neural networks automated detection of alzheimers disease using brain mri images-a study with various feature extraction techniques svm feature selection for classification of spect images of alzheimers disease using spatial information feature selection based on gaussian mixture model clustering for the classification of pulmonary nodules based on computed tomography a deep feature learning model for pneumonia detection applying a combination of mrmr feature selection and machine learning models feature selection for contour-based tuberculosis detection from chest x-ray images kernel feature selection to fuse multi-spectral mri images for brain tumor segmentation discriminative clustering and feature selection for brain mri segmentation performance analysis of neural networks for classification of medical images with wavelets as a feature extractor feature based nonrigid brain mr image registration with symmetric alpha stable filters a survey on deep learning in medical image analysis cnn features off-the-shelf: an astounding baseline for recognition decaf: a deep convolutional activation feature for generic visual recognition deep cnns for microscopic image classification by exploiting transfer learning and feature concatenation very deep convolutional networks for large-scale image recognition deep residual learning for image recognition automl for large scale image classification and object detection mobilenets: efficient convolutional neural networks for mobile vision applications going deeper with convolutions xception: deep learning with depthwise separable convolutions improving the ranking quality of medical image retrieval using a genetic feature selection method feature selection using flower pollination optimization to diagnose lung cancer from ct images feature selection based on hybrid optimization for magnetic resonance imaging brain tumor classification and segmentation emphysema medical image classification using fuzzy decision tree with fuzzy particle swarm optimization clustering dual feature selection and rebalancing strategy using metaheuristic optimization algorithms in x-ray image datasets an image segmentation approach based on fuzzy c-means and dynamic particle swarm optimization algorithm diagnosis of parkinson's disease with a hybrid feature selection algorithm based on a discrete artificial bee colony marine predators algorithm: a nature-inspired metaheuristic efficient classification of white blood cell leukemia with improved swarm optimization of deep features an efficient feature generation approach based on deep learning and feature selection techniques for traffic classification fractional differential equations: an introduction to fractional derivatives, fdifferential equations, to methods of their solution and some of their applications fractional-order cuckoo search algorithm for parameter identification of the fractional-order chaotic, chaotic with noise and hyper-chaotic financial systems covid-19 image data collection radiologist-level pneumonia detection on chest x-rays with deep learning can ai help in screening viral and covid-19 pneumonia? building machine learning and deep learning models on google cloud platform tensorflow: large-scale machine learning on heterogeneous systems the whale optimization algorithm henry gas solubility optimization: a novel physicsbased algorithm slime mould algorithm: a new method for stochastic optimization harris hawks optimization: algorithm and applications classification of covid-19 in chest x-ray images using detrac deep convolutional neural network covid-19: automatic detection from x-ray images utilizing transfer learning with convolutional neural networks towards an efficient deep learning model for covid-19 patterns detection in x-ray images the diagnostic evaluation of convolutional neural network (cnn) for the assessment of chest x-ray of patients jcs: an explainable covid-19 diagnosis system by joint classification and segmentation automated detection of covid-19 cases using deep neural networks with x-ray images the authors declare no competing interests. correspondence and requests for materials should be addressed to r.d.reprints and permissions information is available at www.nature.com/reprints.publisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. license, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the creative commons license, and indicate if changes were made. the images or other third party material in this article are included in the article's creative commons license, unless indicated otherwise in a credit line to the material. if material is not included in the article's creative commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. to view a copy of this license, visit http://creat iveco mmons .org/licen ses/by/4.0/. key: cord-255884-0qqg10y4 authors: chiroma, h.; ezugwu, a. e.; jauro, f.; al-garadi, m. a.; abdullahi, i. n.; shuib, l. title: early survey with bibliometric analysis on machine learning approaches in controlling coronavirus date: 2020-11-05 journal: nan doi: 10.1101/2020.11.04.20225698 sha: doc_id: 255884 cord_uid: 0qqg10y4 background and objective: the covid-19 pandemic has caused severe mortality across the globe with the usa as the current epicenter, although the initial outbreak was in wuhan, china. many studies successfully applied machine learning to fight the covid-19 pandemic from a different perspective. to the best of the authors knowledge, no comprehensive survey with bibliometric analysis has been conducted on the adoption of machine learning for fighting covid-19. therefore, the main goal of this study is to bridge this gap by carrying out an in-depth survey with bibliometric analysis on the adoption of machine-learning-based technologies to fight the covid-19 pandemic from a different perspective, including an extensive systematic literature review and a bibliometric analysis. methods: a literature survey methodology is applied to retrieve data from academic databases, and a bibliometric technique is subsequently employed to analyze the accessed records. moreover, the concise summary, sources of covid-19 datasets, taxonomy, synthesis, and analysis are presented. the convolutional neural network (cnn) is found mainly utilized in developing covid-19 diagnosis and prognosis tools, mostly from chest x-ray and chest computed tomography (ct) scan images. similarly, a bibliometric analysis of machine-learning-based covid-19-related publications in scopus and web of science citation indexes is performed. finally, a new perspective is proposed to solve the challenges identified as directions for future research. we believe that the survey with bibliometric analysis can help researchers easily detect areas that require further development and identify potential collaborators. results: the findings in this study reveal that machine-learning-based covid-19 diagnostic tools received the most considerable attention from researchers. specifically, the analyses of the results show that energy and resources are more dispensed toward covid-19 automated diagnostic tools, while covid-19 drugs and vaccine development remain grossly underexploited. moreover, the machine-learning-based algorithm predominantly utilized by researchers in developing the diagnostic tool is cnn mainly from x-rays and ct scan images. conclusions: the challenges hindering practical work on the application of machine-learning-based technologies to fight covid-19 and a new perspective to solve the identified problems are presented in this study. we believe that the presented survey with bibliometric analysis can help researchers determine areas that need further development and identify potential collaborators at author, country, and institutional levels to advance research in the focused area of machine learning application for disease control. study. similar bibliometric analyses have been reported in the literature as presented by chahrour et al. (2020) , hossain (2020) , and lou et al. (2020) . however, these existing analyses differ from the current bibliometric analysis in this study because the current analysis result focuses on the application of machine learning techniques to combat covid-19 pandemic as opposed to various literature reporting general medical practices on in this study, we propose to conduct a dedicated comprehensive survey on the adoption of machine learning to fight the covid-19 pandemic from a different perspective, including an extensive literature review and a bibliometric analysis. to the best of our knowledge, this study is the first comprehensive analysis of research output focusing on several possible applications of machine learning techniques for mitigating the worldwide spread of the ongoing covid-19 pandemic. we are mindful that other publications might not be captured in our scope because the current study is only limited to the eight academic databases mentioned in table 1 . we are also very dependent on the indexing of the databases used, which is akin to any other bibliometric research study. other sections of the study are organized as follows: section 2 presents the methodology for the survey. section 3 presents the rudiments of the major machine learning algorithms used in fighting the covid-19 pandemic. section 4 presents the adoption of machine learning to fight covid-19. section 5 unravels the different sources of covid-19 datasets. section 6 discusses the survey and bibliometric analysis. section 7 unveils challenges and future research directions before the conclusion in section 8. figure 1 presents the visual structure of the survey paper, which is similar to the work in (mohammadi et al., 2018) . inclusion/exclusion criteria were set up based on the research aim to decide which articles are eligible for the next review stage. articles that meet the inclusion criteria were considered relevant for the research, and those that do not meet the inclusion criteria were excluded. the set inclusion/exclusion criteria are provided in table 2 . exclusion criteria the review only focuses on covid-19. other viral infections and health issues were not considered relevant in the survey. only articles that applied machine learning techniques to fight covid-19 were considered. articles using techniques other than machine learning techniques were excluded. articles/conference papers published by prominent and indexed journals were included articles/conference papers published by nonindexed journals were excluded. the article uploaded as a preprint in preprint servers such . cc-by-nd 4.0 international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted november 5, 2020. ; https://doi. org/10.1101 org/10. /2020 as biorxiv, medrxiv, arxiv, etc. without peer review were excluded. only articles written in the english language were considered for inclusion. articles written in languages other than english were excluded. article selection for this research followed a three-stage analysis. the first analysis stage considered only the titles and abstracts of the papers to extract relevant papers. the second analysis stage considered the analysis of the abstract, introduction, and conclusion to refine the selection in the first stage. at the third and final analysis stages, papers were read thoroughly, and a threshold was set to rate the quality of papers in terms of their relevance to the research. a paper was selected if it reported an empirical application of machine learning to fight covid-19 similar to rodriguez-morales et al. (2020) . articles that met the threshold value were selected, and those below the threshold were dropped. figure 2 shows the total number of papers obtained from the academic databases and the final number of papers considered for the research after applying all the extraction criteria. vosviewer software was used to present a bibliometric analysis of the existing literature on covid-19. vosviewer software is a tool for constructing and visualizing bibliometric maps of items, such as journals, research, or individual publications. these maps can be created based on citation, bibliographic coupling, co-citation, or co-authorship relations . the bibliometric analysis software also offers text mining functionality that can be used to construct and visualize co-occurrence networks of important terms extracted from a body of scientific literature (see www.vosviewer.com). we only used 1,178 publications with the keyword "novel coronavirus" and 98 publications with the keyword "covid-19 and artificial intelligence" that were retrieved from scopus and web of science academic databases for the bibliometric analysis presented in this study. only 57 document results were extracted using the keyword "covid-19 and machine learning" from the same academic database. . cc-by-nd 4.0 international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. (which was not certified by peer review) deployed to enable continual learning of tasks by resetting the state of the lstm (greff et al., 2016) . its updated architecture consists of multiple lstm units with each unit having an input gate, a forget gate, an output gate, and a memory cell. sak et al. (2014) described the underlying architecture of lstm as consisting of memory blocks in its hidden layer. the memory blocks have memory cells for the storage of the temporal state of the network with additional units, known as gates, to supervise the information flow. a memory cell has an input gate that manages the inflow of input activations to the memory cell and an output gate to manage the outflow of cell activations to the network. the forget gate is incorporated to forget or reset the memory of the cell adaptively. lstm computes mapping iteratively at a timestamp = 1 with an input sequence = ( 1 , 2, … , ) and an output sequence = ( 1 , 2, … , ). lstm is good at addressing complex sequential machine learning problems (karim et al., 2018) . deep lstm architectures consist of stacked lstm layers (sak et al., 2014) . lstms are strong in handling temporal dependencies in sequences but weak in dealing with long sequence dependencies (karim et al., 2018) . in the fight against covid-19, different aspects of artificial intelligence (ai) were applied to curtail its adverse effect (dananjayan & raj, 2020) . the taxonomy in figure 3 was created from the project that involved machine learning in fighting covid-19. the data used in creating the taxonomy were extracted from the papers that applied the machine learning algorithm to fight covid-19. currently, the sensitivities for reverse transcription-polymerase chain reaction (rt-pcr)-based viral nucleic acid assay are used as the reference standard method to confirm covid-19 infection (corman et al., 2020) . however, such a laboratory test is time consuming, and the supply of test kits may be the bottleneck for a rapidly growing suspicious . cc-by-nd 4.0 international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. (which was not certified by peer review) the copyright holder for this preprint this version posted november 5, 2020. ; https://doi.org/10.1101 https://doi.org/10. /2020 population even for many developed countries such as the usa. more importantly, initial false-negative or weakly positive rt-pcr test results were found in several later-confirmed cases, while highly suspicious computed tomography (ct) imaging features were present xie et al., 2020) . the treatment and screening of covid-19 can be more effective when deep learning approach, ct features, and real-time rt-pcr results are integrated . ai and deep learning can assist in developing diagnostic tools and deciding on treatment (rao and vazquez, 2020; shi et al., 2020) . as a result, many diagnostic tools were developed based on the machine learning algorithm to fight covid-19. for example, apostolopoulos and mpesiana (2020) applied transfer learning with cnn to detect covid-19 from x-ray images containing common bacterial pneumonia and normal incidents and established covid-19 infection. transfer learning cnn was used to diagnose covid-19 cases from x-ray datasets. the results indicated that vgg19 diagnosed covid-19 confirmed cases with better accuracy on two-and threeclassification problems compared with mobilenet v2, inception, xception, and inception resnet v2. the proposed approach can help develop a cost-effective, fast, and automatic covid-19 diagnostic tool, and reduce the exposure of medical workers to covid-19. similarly, rahaman et al. (2020) developed an automated computer-aided diagnosis (cad) system for the detection of covid-19 samples from healthy cases and cases with pneumonia using chest xray (cxr) images. their study demonstrated the effectiveness of applying deep transfer learning techniques for the identification of covid-19 cases using cxr images. ardakani et al. (2020) were motivated by the time consumption and high cost of the traditional medical laboratory covid-19 test to investigate the performance of 10 well-known cnns in diagnosing covid-19. the 10 variants of cnn included alexnet, xception, squeezenet, googlenet, . all the cnn variants were applied on ct scan images because the ct slice is a fast method of diagnosing patients with covid-19. the diagnostic results of the cnn variants indicated that resnet-101 and xception outperformed the other cnn variants in diagnosing covid-19. they concluded that resnet-101 has a high sensitivity in characterizing and diagnosing covid-19 infections. therefore, it can be used as an alternative tool in the department of radiology for diagnosing covid-19 infection. it is cheaper and faster compared with traditional laboratory analysis. butt et al. (2020) applied cnn for the detection of covid-19 from the chest ct scan of patients. cnn was found very fast and reliable in the detection of covid-19 from a chest ct scan compared with the conventional rt-pcr testing. in summary, the cnn model is fast and reliable in detecting covid-19 infection. huang et al. (2020) applied a deep learning algorithm on a chest ct scan of a patient with covid-19 to quantify lung burden changes. the patients with covid-19 were grouped into mild, moderate, severe, and critical based on findings from the chest ct scan, clinical evaluation, and laboratory results. deep learning algorithm was applied to assess the lung burden changes. they found that the assessment of lung opacification measured on the chest ct scan substantially differed from that of the clinical groups. the approach can remove the subjectivity in the initial assessment of covid-19 findings. . cc-by-nd 4.0 international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. (which was not certified by peer review) the copyright holder for this preprint this version posted november 5, 2020. machine learning algorithm accuracy: 70% -80% detect covid-19 severity in a patient at the initial presentation help in optimal utilization of scarce . cc-by-nd 4.0 international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted november 5, 2020. mei et al. (2020) proposed a joint model comprising cnn, support vector machine (svm), random forest (rf), and multilayer perceptron integrated with chest ct scan result and non-image clinical information to predict covid-19 infection in a patient. cnn was run on the ct image, while the other algorithms classified covid-19 using the nonimage clinical information. the output of the cnn and the different algorithms were combined to predict the patient's covid-19 infection. the diagnostic tool can rapidly detect covid-19 infection in patients. used logistic regression for the prediction of covid-19 infection sliding to the severity of the covid-19 cohort. the results of the study showed that the ct quantification for the pneumonia lesions could predict the progression of a patient with covid-19 to a severe stage at an early, non-invasive level. this situation can provide a prognostic indicator for coronavirus clinical management. jiang et al. (2020) applied a machine learning algorithm to predict covid-19 clinical severity. they developed a predictive tool that predicts patients at risk for increased covid-19 severity at the first presentation. the survey can help in the optimal utilization of scarce resources to cope with the covid-19 pandemic. hurt et al. (2020) collected cxr images from patients with covid-19 in china and america. they applied a deep learning algorithm for the early diagnosis of covid-19 from the cxr. they found that deep learning predicted and consistently localized areas of pneumonia. the deep learning algorithm can diagnose a patient's covid-19 infection early. loey et al. (2020) were motivated by the insufficient covid-19 dataset to propose a generative adversarial network (gan) and cnn variant to detect covid-19 in patients. gan was used to generate more x-ray images. googlenet, alexnet, and resnet18 were applied as the deep transfer learning models. they found that googlenet and alexnet scored 80.6%, 85.2%, and 100%, respectively, in the four-, three-, and two-class classification problem, respectively. the study's method can facilitate the early detection of covid-19 and reduce the workload of a radiologist. wu et al. (2020) proposed a multi-view resnet50 for the screening of covid-19 from chest ct scan images. resnet50 was trained with the multi-view chest ct scan images. the results showed that the multi-view resnet50 fusion achieved a high performance compared with the single view. the diagnosis tool developed can reduce the workload of a radiologist by offering fast, accurate covid-19 diagnosis. ucar and korkmaz (2020) developed a rapid covid-19 diagnosis tool from x-ray images based on sqeezenet (a pre-defined cnn) and the bayesian optimization method. the squeezenet hyperparameters were optimized using the bayesian optimization method. bayesian optimization-based squeezenet was applied to detect covid-19 from x-ray images labeled normal, pneumonia, and covid-19. bayesian-based squeezenet outperformed the baseline diagnostic tools. togaçar et al. (2020) applied cnn for the exploitation of social mimic and cxr based on fuzzy color and the stacking method to diagnose covid-19. the stacked data were trained using cnn, and the features obtained were processed with mimicking social optimization. the compelling features were used for classification into covid-19, pneumonia, and standard x-ray imagery using svm. used cnn and multi-objective differential evolution (mode) for the early detection of covid-19 from a chest ct scan image. the initial parameters of the cnn were tuned using mode to create a mode-based cnn and classify patients with covid-19 based on positive or negative . cc-by-nd 4.0 international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted november 5, 2020. ; https://doi.org/10.1101/2020.11.04.20225698 doi: medrxiv preprint chest ct scan images. mode-based cnn outperformed the competitive models (ann, anfis, and traditional cnn). the proposed method is beneficial for covid-19 real-time classification owing to its speed in diagnosing covid-19. salman et al. (2020) constructed a cnn-based covid-19 diagnostic tool for the detection of covid-19 from cxr images. cnn-inceptionv3 was applied to detect covid-19 from 130 x-ray images of patients infected with covid-19 and 130 normal x-ray images. the results indicated that cnn-inceptionv3 could detect covid-19 from the x-ray images and reduce the testing time required by a radiologist. ozturk et al. (2020) used cnn to develop an automated tool for diagnosing covid-19 from raw cxr images. binary and multi-class categories were experimented on using a cnn with 17 convolution layers with a different filter on each convolution layer. the model can be used for the early screening of patients with covid-19 and assist the radiologist in validating covid-19 screening. developed an automated framework based on cnn for the detection of covid-19 from chest ct scan and differentiate it from community-acquired pneumonia. the study collected data from 3,322 patients comprising 4,356 chest ct scans. cnn was applied to detect patients with covid-19 and typical community pneumonia. the experiment results showed that cnn can distinguish patients with covid-19 from those with community-acquired pneumonia and other similar lung diseases. the proposed framework automated the covid-19 testing and reduced the testing time and fatigue. yang et al. (2020b) applied densely connected convolutional networks optimized with stochastic gradient descent algorithm for the detection of covid-19 from chest ct scan images. oh et al. (2020) applied patch-based cnn-resnet-18 (p-cnn) due to lack of sufficient training data for diagnosing covid-19 from cxr images. the study used imaging biomarkers of the cxr radiographs. p-cnn resnet-18 was applied, and p-cnn produced clinically salient maps that are useful in diagnosing covid-19 and patient triage. p-cnn resnet-18 achieved the best result compared with the baselined algorithm performance. the limited amount of data can be used for covid-19 diagnoses and were interpretable. table 3 summarizes the diagnostic tools developed based on machine learning. refer to dong et al. (20202) for an engaging research review on the role of imaging in the detection and management of covid-19 disease spread. the decision support system related to covid-19 can help decision/policymakers formulate policy to curtail covid-19. many covid-19 decision support systems were developed based on machine learning approaches. for example, applied lstm and linear regression to predict the number of positive cases in iran. lstm and linear regression were used on google search data to predict the covid-19 cases in iran. the results indicated that linear regression outperforms lstm in predicting the positive cases of covid-19. the algorithm can predict the trend of the covid-19 pandemic in iran, which can help policymakers plan the allocation of medical resources. chimmula and zhang (2020) applied deep lstm for forecasting covid-19 transmission and possible covid-19 ending period in canada and other parts of the world. the transmission rate of canada was compared with that of italy and the usa. the future outbreak of the covid-19 pandemic was predicted to help canadian decision makers monitor the covid-19 situation and prevent the future transmission of the epidemic in canada. liu et al. (2020b) proposed ann in modeling the trend of covid-19 and restoring the operational capability of medical services in china. ann was used for modeling the pattern of covid-19 in wuhan, beijing, shanghai, and guangzhou. autoregressive integrated moving average (arima) was applied for the estimation of nonlocal hospital demands for the period of covid-19 pandemic in beijing, shanghai, and guangzhou. the results indicated that the number of people infected with covid-19 would increase by 45%, while death would increase by 567%. covid-19 will reach its peak by march 2020 and toward the end of april 2020. this finding will assist policymakers and health officials in planning to deal with challenges of the unmet medical requirement of other diseases during the covid-19 pandemic. . cc-by-nd 4.0 international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted november 5, 2020. ; https://doi.org/10.1101/2020.11.04.20225698 doi: medrxiv preprint (2020) proposed a group method of data handling in a neural network to predict the number of covid-19 confirmed cases based on weather conditions. the dominant weather condition used included temperature, city density, humidity, and wind speed. the results indicated that humidity and temperature have a substantial influence on covid-19 confirmed cases. temperature and humidity influence covid-19 negatively and positively, respectively. these results can be used by decision makers to manage the covid-19 pandemic. yang et al. (2020b) applied lstm to predict the covid-19 trend in china. the prediction model indicated that the covid-19 pandemic should peak toward the end of february 2020 and start declining at the end of april 2020. the prediction model can be used by authorities in china to decide in controlling the covid-19 pandemic. vaid et al. (2020) adopted a machine learning approach to predict covid-19 potential infections based on reported cases in north america. critical parameters were identified from dimension reduction. passed diseases were inferred from recent fatalities using a hierarchical bayesian estimator. the model predicted potential covid-19 infections in north america. policymakers in north america can use the projection to curtail the effect of the covid-19 pandemic. tuli et al. (2020) developed a machine learning covid-19 predictive model and deployed it in the cloud computing environment for real-time tracking of covid-19, predicting the growth and potential thread of covid-19 in different countries worldwide. government and citizens can use the results for proactive measures to fight covid-19. . cc-by-nd 4.0 international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted november 5, 2020. ; https://doi.org/10.1101/2020.11.04.20225698 doi: medrxiv preprint tiwari et al. (2020) used a machine learning approach to predict the covid-19 pandemic number of cases, recoveries, and deaths in india based on data from china. the prediction results indicated that covid-19 would peak between the third and fourth week of april 2020. the indian government can use the study to formulate policies and decide on mitigating the spread of covid-19. ribeiro et al. (2020) evaluated six machine learning algorithms, namely, cubis regression (cubist), rf, ridge regression (ridge), support vector regression (svr), arima, and stackingensemble learning (sel), on covid-19 datasets collected in brazil to predict confirmed cases for one, three, and six days ahead. they found that svr outperformed ridge, arima, rf, cubist, and sel. the study can help monitor covid-19 cases in brazil and facilitate critical decisions on covid-19. tummers et al. (2020) applied k-means to cluster documents based on covid-19 and people with intellectual disability. table 4 summarizes the studies on covid-19 decision support system. the protein sequence of covid-19 can be collected to apply the machine learning approach for the prediction of covid-19 (qiang et al., 2020) . for example, qiang et al. (2020) predicted the infection risk of non-human origin of covid-19 from spike protein for prompt alarm using rf. the genome data comprised of non-human covid-19 origin (positive) and human covid-19 origin (negative). rf was applied for the training to predict non-human covid-19 origin. the results showed that the rf model achieved high accuracy in predicting non-human covid-19 origin. the study can be used in covid-19 genome mutation surveillance and exploring evolutionary dynamics in a simple, fast, and large-scale manner. combined decision tree and digital signal processing (dt-dsp) to detect the covid-19 virus genome and identified the signature of intrinsic covid-19 viruses' genome. dt-dsp was applied to explore over 5,000 viral genome sequences with 61.8 million by the 29 viral sequences of covid-19. the result obtained supported the bat origin of covid-19 and successfully classified covid-19 with 100% accuracy as sub-genus sarbecovirus within betacoronavirus. dt-dsp is a reliable real-time alternative taxonomic classification. table 5 summarizes the studies. the dt-dsp is a reliable real-time alternative for the classification of taxonomic machine learning and ai provide approaches for the speedy processing of a large amount of collected medical data generated daily as well as the extraction of new information from transversely different applications. in the prediction of disease, a viral mutation can be forecast before the emergence of new strains. it also allows the prediction of new structure and availability of broader structural information. efficient drug repurposing can be achieved in mining existing data. the stages for the development of covid-19 drugs are as follows : disease prediction: the prediction of future-generation viral mutation can be accomplished by ai and machine learning approaches. structural analysis: the covid-19 structure and primary functional site are characterized. drug repurposing: for insight into new disease treatment, existing drug data are mined. new drug development: efficiencies across the entire pharmaceutical life cycle are achieved by rapid processing. ke et al. (2020) applied machine learning . cc-by-nd 4.0 international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted november 5, 2020. ; https://doi.org/10. 1101 to identify drugs already marketed that can treat covid-19. they compiled two independent datasets to develop two machine learning models. the first model was built based on drugs that are known to have antiviral activities. the second model was built based on 3c-like protease inhibitors. the database of market-approved drugs was screened by the machine learning model to predict the drugs with potential antiviral activities. the drugs predicted to have antiviral activities were evaluated against the antiviral activities by a cell-based feline infectious peritonitis virus duplication assay. the assay results were the machine learning model feedbacks for incremental learning of the model. finally, 80 marketed drugs were identified to have potential antiviral activities. old drugs with antiviral activities against feline infectious peritonitis covid-19 were found. typically, the immune system is prepared to elicit antibody or cell-mediated responses against a pathogen by a vaccine that protects the body from infectious diseases. immunogenicity is the vaccine ability to the response. for a long-time, effective immunity, the vaccine has to properly activate innate, adaptive responses (klein et al., 2010) . the following phases should be adopted to develop a covid-19 vaccine (gonzalez-dias et al., 2020) : dataset preparation: the quality of the data to be used influences the machine learning algorithm. thus, preparing quality data before feeding into the algorithm is sacrosanct. data come in different sizes ranging from small, medium, and large. data quality must be ensured because a quality immune response is needed. the reliability of the data needs to be guaranteed by ensuring that the serological assay is well qualified in case it is not validated based on known parameters (linearity, specificity, lloq, ruggedness, llod, uloq, and reproducibility). vaccines and relevant genes: in vaccinology, the machine learning algorithm is trained to discover the combination of genes and the best vaccines parameters. the data for the training are extracted from omics experiment, which will be used to obtain the required combination. feature selection is performed to find the best representative of the discriminatory gene signatures. then, the new vaccines are predicted. the three main feature selection methods are filter, wrapper, and embedded. machine learning algorithm selection: this task is not a straightforward task because many factors must be considered before selecting the appropriate algorithm for the modeling. the choice of the algorithm depends on the nature of the data, and the options include supervised, unsupervised, and semi-supervised learning. for instance, if the data have no output, then unsupervised learning algorithm, e.g., k-the nearest neighbor is the possible candidate algorithm for the modeling but is not guaranteed. many algorithms have to be tested on solving the same problem before the algorithm that produces the best output is selected. model testing: the performance of the model is tested. the data are partitioned into training and testing; the former is used for training the algorithm, and the latter is used for evaluating the performance of the model using several performance parameters, e.g., mse, accuracy, and fmeasure (gonzalez-dias et al., 2020) . the application of a machine learning algorithm to sift through trillions of compounds of the vaccine adjuvants can shorten the vaccine development time. the machine learning algorithm can be used for screening compounds for a potential adjuvant candidate for the sars-cov-2 vaccine (ahuja et al., 2020) . ahuja et al. (2020) reported that covid-19 data are now growing. in this section, we present the sources of covid-19 data to the machine learning community. given the novelty of the virus, centralizing the collection of sources of data will help researchers access different types of covid-19-related data and provide them opportunities to work on a different aspect of covid-19 that may lead to novel discoveries. table 6 has five columns, where the first, second, third, fourth, and fifth columns represent the reference, data, owners, source/accessibility, and remarks, respectively. we only present the projects that revealed and fully discussed their data sources. . cc-by-nd 4.0 international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted november 5, 2020. ; https://doi.org/10.1101/2020.11.04.20225698 doi: medrxiv preprint the data is chaos game representation of sars-cov-2 containing both the raw and processed data with 100 instances of sars-cov-2 genome butt et al. (2020) ct scan butt et al. huang et al. (2020) 126 covid-19 patients that underwent a ct chest scan from 1/1/2020 to 3/2/ 2020 . cc-by-nd 4.0 international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted november 5, 2020. is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted november 5, 2020. ; https://doi.org/10.1101/2020.11.04.20225698 doi: medrxiv preprint hurt et al. (2020) x-ray hurt et al. in this section, we discuss the diagnosis of covid-19 based on x-ray and ct scan images because of their high value in covid-19 screening. table 6 shows that researchers heavily utilize x-rays and ct scans in developing a machinelearning-based covid-19 diagnosis tool. guan et al. (2020) and wong et al. (2020) found that portable chest radiography (cxr) has a sensitivity of 59% for the initial detection of covid-19-related abnormalities. radiographic abnormalities, when present, mirror those of ct, with a bilateral lower zone, a peripherally predominant consolidation, and hazy opacities (wong et al., 2019) . the radiological findings of covid-19 on cxr are those of atypical pneumonia or organizing pneumonia (kooraki et al., 2020) . although chest ct scans are reportedly less sensitive than cxrs, chest radiography remains the first-line imaging modality of choice used for patients with suspected covid-19 infection because it is cheap and readily available, and can easily be cleaned. for ease of decontamination, the use of portable radiography units is preferred. chest radiographs are often normal in early or mild disease. according to a recent study of patients with covid-19 requiring hospitalization, 69% had an abnormal chest radiograph at the initial time of admission, and 80% had radiographic abnormalities sometime during hospitalization. the findings are reported to be most extensive about 10-12 days after symptom onset. the most frequent radiographic findings are airspace opacities, whether described as consolidation or less commonly, groundglass opacity (ggo) . the distribution is most often bilateral, peripheral, and lower zone predominant (rodrigues et al., 2020). unlike parenchymal abnormalities, pleural effusion is rare (3%) . according to the center for disease control (cdc), even if a chest ct or x-ray suggests covid-19, viral testing is the only specific method for diagnosis. radiography's sensitivity was reported at only 25% for detection of lung opacities related to covid-19, among 20 patients seen in south korea with a reported specificity of 90% (wen et al., 2020) . the x-ray image should be considered a useful tool for detecting covid-19 which is challenging the healthcare system due to the overflow of patients. as the covid-19 pandemic grinds on, clinicians on the front lines may increasingly turn to radiography (casey, 2020) . the most frequent findings are airspace opacities, whether described as consolidation or less commonly, ggo. the distribution is most often bilateral, peripheral, and lower zone predominant . much of the imaging focus is on ct. in february 2020, chinese studies revealed that chest ct achieved a higher sensitivity for the diagnosis of covid-19 compared with initial rt-pcr tests of pharyngeal swab samples fang et al., 2020) . subsequently, the national health commission of china briefly accepted chest ct findings of viral pneumonia as a diagnostic tool for detecting covid-19 infection (yuen et al., 2020; zu et al., 2020) . . cc-by-nd 4.0 international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted november 5, 2020. ; https://doi.org/10.1101/2020.11.04.20225698 doi: medrxiv preprint the typical appearance of covid-19 on chest ct consists of multi-lobar, bilateral, predominantly lower lung zone, rounded ggos, with or without consolidation, in a mostly peripheral distribution. however, such findings are nonspecific; the differential diagnosis includes organizing pneumonia and other infections, drug reactions, and other inflammatory processes. consequently, using ct to screen for covid-19 may result in false positives. moreover, the presence of abnormalities not typically associated with covid-19 infection, including pure consolidation, cavitation, thoracic lymphadenopathy, and nodules suggests a different etiology . covid-19-related chest ct abnormalities are more likely to appear after symptom onset, but they may also precede clinical symptoms. in a retrospective study by bernheim et al. (2020) , 44% of patients presenting within two days of symptom onset had an abnormal chest ct, while 91% presenting within 3-5 days and 96% presenting after six days had abnormal chest cts. shi et al. (2020) found ggos in 14 of 15 asymptomatic healthcare workers with confirmed covid-19. similarly, 54% of 82 asymptomatic passengers with covid-19 on the diamond princess cruise ship had findings of viral pneumonia on the ct (inui et al., 2020) . in a prospective study by wang et al., pure ggos were the only abnormalities seen prior to symptom onset. subsequently, 28% of patients developed superimposed septal thickening 6-11 days after symptom onset . architectural distortion evolving from ggos appeared later in the disease course, likely reflecting organizing pneumonia and early fibrosis. long-term follow-up imaging is also needed to determine the sequelae of sars-cov-2 infection. in a retrospective study by das et al., 33% of patients who recovered from mers-cov developed pulmonary fibrosis; a similar outcome following covid-19 is likely (das et al., 2020) . lung ultrasound offers a low-cost, point-of-care evaluation of the lung parenchyma without ionizing radiation. the modality is especially useful in resource-limited settings (stewart et al., 2020). peng et al. (2020) found that sonographic findings in patients with covid-19 correlated with typical ct abnormalities. the predominantly peripheral distribution of lung involvement facilitated sonographic visibility. characteristic findings included thickened, irregular pleural lines, b lines (edema), and eventual appearance of a lines (air) during recovery. peng et al. (2020) suggested that ultrasound may be useful in recruitment maneuver monitoring and guide prone positioning. previous studies confirmed that the majority of patients infected with covid-19 exhibited common chest ct characteristics, including ggos and consolidation, which reflect lesions affecting multiple lobes or infections in the bilateral lung parenchyma. increasing evidence suggests that these chest ct characteristics can be used to screen suspected patients and serve as a diagnostic tool for covid-19-caused acute respiratory diseases (ards) . these findings have led to the modification of the diagnosis and treatment protocols of sars-cov-2-caused pneumonia to include patients with characteristic pneumonia features on chest ct but negative rt-pcr results in severe epidemic areas such as wuhan city and hubei province . patients with negative rt-pcr but positive ct findings should be isolated or quarantined to prevent clustered or wide-spread infections. the critical role of ct in the early detection and diagnosis of covid-19 becomes more publicly acceptable. however, several studies also reported that a proportion of rt-pcr-positive patients, including several severe cases, had initially normal cxr or ct findings . according to the diagnostic criteria of covid-19, patients might have no or atypical radiological manifestations even at the mild or moderate stages because several lesions are easily missed in the low-density resolution of cxr, suggesting that chest ct may be a better modality with a lower false-negative rate. another possible explanation is that in several patients, the targeted organ of covid-19 may not be the lung. multiple-organ dysfunctions, including ards, acute cardiac injury, hepatic injury, and kidney injury, have been reported during covid-19 infection . studies also reported the chest ct appearances in patients with covid-19 after treatment, suggesting its critical role in treatment evaluation and follow up. for example, a study investigated the change in chest ct findings associated with covid-19 at different time points during the infection course (pan et al., 2020) . the results showed that most apparent abnormalities on the chest ct were still observable for 10 days but disappeared at 14 days after the initial onset of symptoms. unexpectedly, a case report showed pre-and post-treatment chest ct findings of a 46-year-old woman whose rt-pcr result became negative, while pulmonary lesions were reversal (duan et al., 2020) . singh et al. (2020) developed a deep cnn, which was applied in the automated diagnosis and analysis of covid-19 in infected patients to save the time and energy of medical professionals. they tuned and used the hyperparameters of cnn by using multi-objective adaptive differential evolution (made). further in the course of their experiments . cc-by-nd 4.0 international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted november 5, 2020. ; which were extensively carried out, they used several benchmark covid-19 datasets. the data used to evaluate the performance of their proposed model were divided into training and testing datasets. the training sets were used to build the covid-19 classification model. then, the hyperparameters of the cnn model were optimized on the training sets by using the made-based optimization approach. the results from the comparative analysis showed that their proposed method outperformed existing machine learning models such as cnn, ga-based cnn, and pso-based cnn in terms of different metrics (including f-measure, sensitivity, specificity, and kappa statistics). jaiswal et al. (2020) applied deep learning models for the diagnosis and detection of covid-19, and it was called densenet201-based deep transfer learning (dtl). the authors used these pre-trained deep learning architecture as automation tools to detect and diagnose covid-19 in chest ct scans. the dtl model was used to classify patients as covid-19 positive (+ve) or covid-19 negative (−ve). the proposed model was also utilized to extract several features by adopting its own learned weights on the imagenet dataset along with a convolutional neural structure. extensive analysis of the experiments showed that the proposed dtl-based covid-19 model was superior to competing methods. the proposed densenet201 model achieved a 97% accuracy compared with other models and could serve as an alternative to other covid-19 testing kits. developed a fully automated ai system to assess the severity of covid-19 and its progression quantitatively using thick-section chest ct images. the ai system was implemented to partition and quantify the covid-19-infected lung regions on thick-section chest ct images automatically. the data generated from the automatically segmented lung abnormalities were compared with those of the manually segmented abnormalities of two professional radiologists by using the dice coefficient on a randomly selected subset of 30 ct scans. during manual and automatic comparisons, two biomarker images were automatically computed, namely, the portion of infection (poi) and the average infection hu (ihu), which were then used to assess the severity and progression of the viral disease. the performance of the assessments was then compared with patients' status of diagnosis reports, and key phrases were extracted from the radiology reports using the area under the receiver's operating characteristic curve (auc) and cohen's kappa statistics. further in their study, the poi was the only computed imaging biomarker that was effective enough to show high sensitivity and specificity for differentiating the groups with severe covid-19 and groups with non-severe covid-19. the ihu reflected the progress rate of the infection but was affected by several irrelevant factors such as the construction slice thickness and the respiration status. the results of the analysis revealed that the proposed deep-learning-based ai system accurately quantified the covid-19 strains associated with the lung abnormalities, and assessed the virus' severity and its corresponding progression. their results also showed that the deep learning-based tool can help cardiologists in the diagnosis and follow-up treatment for patients with covid-19 based on the ct scans. used a cnn to classify patients with covid-19 as covid-19 +ve or covid-19 −ve. the initial parameters of cnn were tuned by using mode. the authors adopted the mutation, crossover, and selection operations of the differential evolution (de) algorithm. they extracted the chest ct dataset of covid-19infected patients and decomposed them into training and testing groups. the proposed mode-based cnn and competitive classification models were then applied to the training dataset. they compared the competitive and proposed classification models by considering different fractions of the training and testing datasets. the extensive analysis showed that the proposed model classified the chest ct images at reasonable accuracy rates compared with other competing models, such as ann, anfis, and cns. the proposed model was also useful for covid-19 disease classification from chest ct images. asif and yi (2020) implemented a model that automatically detected covid-19 pneumonia in patients using digital cxr images while maximizing the accuracy in detection by using deep convolutional neural networks (dcnn). their model named dcnn-based model inception v3 with transfer learning detected covid-19 infection in patients using cxr radiographs. the proposed dcnn also provided insights on how deep transfer learning methods were used for the early detection of the disease. the experimental results showed that the proposed dcnn model achieved high accuracy. the proposed model also exhibited excellent performance in classifying covid-19 pneumonia by effectively training itself from a comparatively lower collection of images. hu et al. (2020) implemented a weak supervised deep learning model for detecting and classifying covid-19 infection from ct images. the proposed model minimized the requirements of manual labelling of ct images and . cc-by-nd 4.0 international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted november 5, 2020. ; https://doi.org/10.1101 https://doi.org/10. /2020 accurately detected the viral disease. the model could distinguish positive covid-19 cases from non-positive covid-19 cases by using covid-19 samples from retrospectively extracted ct images from multiple scanners and centers. the proposed method accurately pinpointed the exact position of the lesions (inflammations) caused by the viral covid-19 and potentially provided advice on the patient's severity to guide the disease triage and treatment. the experimental results indicated that the proposed model achieved high accuracy, precision, and classification as well as good qualitative visualization for the lesion detections. conducted a study to predict the incidence and occurrence of covid-19 in iran. the authors obtained data from the google trends website (recommender systems) and used linear regression and lstm models to estimate the number of positive covid-19 cases from the extracted data. root mean square error and 10fold cross-validation were used as performance metrics. the predictions obtained from the google trend's website were not very precise but could be used to build a base for accurate models for more aggregated data. their study showed that the population (iranians) focused on the usage of hand sanitizer and handwashing practices with antiseptic as preventive measures against the disease. the authors used specific keywords related to covid-19 to extract google search frequencies and used the extracted data to predict the degree of covid-19 epidemiology in iran. they suggested future research direction using other data sources such as social media information, people's contact with the special call center for covid-19, mass media, environmental and climate factors, and screening registries. integrated supervised machine learning with digital signal processing called mldsp for genome analyses, which were then augmented by a dt approach to the machine learning component, and a spearman's rank correlation coefficient analysis for result validation. the authors identified an intrinsic covid-19 virus genome signature and used it together with a machine-learning-based alignment-free approach for an ultra-fast, scalable, and highly accurate classification of the covid-19 genomes. they also demonstrated how machine learning used intrinsic genomic signature to provide a rapid alignment-free taxonomic classification of novel pathogens. the model accurately classified the covid-19 virus without having a priori knowledge by simultaneous processing of the geometric space of all relevant viral genomes. their result analysis supported the hypothesis of a bat origin and classified the covid-19 virus as sarbecovirus within betacoronavirus. also, their results were obtained through a comprehensive analysis of over 5,000 unique viral sequences through an alignment-free analysis of their 2d genomic signatures, combined with a dt use of supervised machine learning, and confirmed by spearman's rank correlation coefficient analyses. farhat et al. (2020) reviewed the developments of deep learning applications in medical image analysis which targeted pulmonary imaging and provided insights into contributions to covid-19. the study covered a survey of various contributions from diverse fields for about three years and highlighted various deep learning tasks such as classification, segmentation, and detection as well as different pulmonary pathologies such as airway diseases, lung cancer, covid-19, and other infections. the study summarized and discussed current state-of-the-art approaches in the research domain, highlighting the challenges, especially given the current situation of covid-19. first, the authors provided an overview of several medical image modalities, deep learning, and surveys on deep learning in medical imaging, in addition to available datasets for pulmonary medical images. second, they provided a summarized survey on deep-learning-based applications and methods on pulmonary medical images. third, they described the covid-19 disease and related medical imaging concerns, summarized reviews on deep learning application to covid-19 medical imaging analysis, and listed and described contributions to this domain. finally, they discussed the challenges experienced in the research domain and made suggestions for future research. in this survey, we review the projects that used machine learning to fight covid-19 from a different perspective. we only considered published papers in reputable journals, and conferences, and no preprint papers uploaded in preprint server was used in the survey. we apprised 30 studies that reported the description of the machine learning approach to fighting covid-19. we found that machine learning has made an inroad into fighting covid-19 from a different . cc-by-nd 4.0 international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted november 5, 2020. ; https://doi.org/10.1101/2020.11.04.20225698 doi: medrxiv preprint aspect with potential for real-life applications to curtail the negative effect of covid-19. machine learning algorithms such as cnn, lstm, and ann that are utilized in fighting covid-19 mostly reported excellent performance compared with the baseline approaches. many of the studies complained about the scarcity of sufficient data to carry out large-scale study because of the novelty of the covid-19 pandemic. we found that various studies used different covid-19 data. figure 4 depicts the type of data used in different studies that applied machine learning algorithm to develop different models for fighting covid-19 pandemic. the data used to plot figure 4 were extracted from machine learning research on covid-19 (refer to table 6 ). the longest bars show that x-rays and ct scans have the highest patronage from the studies. many of the studies used deep learning algorithms, e.g., cnn and lstm, for the diagnosis of covid-19 on x-rays and ct scans. the evaluation indicated the excellent performance of the algorithms in detecting covid-19 on x-rays and ct scan images. the ct scan has a great value in the screening, diagnosis, and follow up of patients with covid-19. the ct scan has now been added as a criterion for diagnosing of covid-19 . the x-rays with covid-19 pandemic data project by cohen hosted on github is receiving unprecedented attention from the research community for accessing freely available data. figure 5 presents the frequency of machine learning algorithms adopted to fight covid-19. the longest bar indicates that cnn received the most considerable attention from the researchers working in this domain to fight covid-19. the likely reason why cnn has the highest number of applications is that most of the data used in detecting covid-19 infection in patients are images (see figure 4 ). cnn is well known for its robustness, effectiveness, and efficiency in image processing compared with other conventional machine learning algorithms because of its automated feature engineering and high performance. the cnn variant suitable for the diagnosis of covid-19 from x-ray and ct scan images is resnet. however, many of the studies did not provide the specific type of cnn adopted for the diagnosis of the covid-19 from x-ray and ct scan images (see table 3 ). cc-by-nd 4.0 international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted november 5, 2020. ; https://doi.org/10.1101/2020.11.04.20225698 doi: medrxiv preprint figure 5 : machine learning algorithms adopted in fighting covid-19 figure 6 shows the different aspects where machine learning algorithms were applied in fighting covid-19. we found that the studies mainly adopted machine learning algorithms in developing covid-19 diagnosis tools, decision support system, drug development, and detection from protein sequence. the most extended portion of the pie chart indicates that diagnostic tools attracted the most considerable attention, showing the quest for diagnostic tools in the fight against covid-19 pandemic because the match starts with a diagnosis before the appropriate treatment is administered to save a life, and incorrect diagnosis can lead to inappropriate medication, resulting in further health complications. most of the studies that adopted machine learning to develop diagnostic tools intended to reduce the workload of radiologists, improve the speed of diagnosis, automate the covid-19 diagnostic process, reduce the cost compared with traditional laboratory tests, and help healthcare workers in making critical decisions. . cc-by-nd 4.0 international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted november 5, 2020. ; https://doi.org/10.1101 https://doi.org/10. /2020 the studies argued that the diagnostic tool could reduce the exposure of healthcare workers to patients with covid-19, thus decreasing the risk of spreading covid-19 to healthcare workers. the second part of the pie chart with the most substantial portion is the decision support system for detecting the rate of spread of the virus, confirmed cases, mortalities, and recovered cases. this information from the decision support system can help the government functionaries, policymakers, decision makers, and other stakeholders in formulating policy that can help fight covid-19 pandemic. the primary purpose of conducting a bibliometric analysis study in this study is to reflect the trend of rapidly emerging topics on covid-19 research, where substantial research activity has already begun extensively during the early stage of the outbreak. another significance of the bibliometric analysis method presented is to aid in the mapping of research situation on coronavirus disease as reported in several scientific works of literature by the research community. in this section, we present the bibliographic coupling among different article items on machine learning to fight covid-19. the link between the items on the constructed map corresponds to the weight between them either in terms of the number of publications, common references, or co-citations. these items may belong to a group or a cluster. in the visualization, items within the same cluster are marked with the same color, and colors indicate the cluster to which a journal was assigned by the clustering technique implemented by the vosviewer software. the circular node may represent the items, and their sizes may vary depending on the weight of the article . the bibliographic coupling between the top 25 authors is shown in figure 8 . the two clusters, namely, red and green, correspond to all authors working on similar research fields "covid-19" and citing the same source in their reference listings. the similarity in cluster color for the authors also implies that the degree of overlap between the reference lists of publications of these authors is higher. figure 8 shows the visible names, and other names may not be represented in the constructed map. year . cc-by-nd 4.0 international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted november 5, 2020. ; https://doi.org/10.1101 https://doi.org/10. /2020 figure 9 shows the bibliographic coupling of the topmost productive countries. here, bibliographic coupling indicates that a common reference list in the papers published by these countries. the five clusters are represented by six colors. red represents china and the usa with the highest strength in terms of contributions, after which comes india and iran as the next countries within the red node. green represents hong kong, which appears to have the highest strength, whereas blue is for the united kingdom and saudi arabia that have the highest strength. yellow denotes japan, singapore, thailand, and taiwan as the highest contributors. purple refers to italy and canada as the two contributing countries. the link between the red and green clusters are thicker compared with that between the blue and red clusters, or between the blue and purple clusters. the thickness of the link simply depicts the degree of intersection of the literature work between the different locations or countries. . cc-by-nd 4.0 international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted november 5, 2020. ; https://doi.org/10.1101 https://doi.org/10. /2020 this article has been accepted for publication in peerj computer science . cc-by-nd 4.0 international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted november 5, 2020. ; https://doi.org/10.1101 https://doi.org/10. /2020 this article has been accepted for publication in peerj computer science bibliographic coupling between journals implies that the papers published in these journals have more common reference lists. three clusters are depicted on the map with red, blue, and green colors. the links with the highest strength occur between emerging microbes journal, journal of virology, and journal of infection. this link is closely followed by the links between eurosurveillance and journal of infection, archive of academic emergency medicine, chinese medical journal, and the lancet. the journal of infection control and hospital and journal of hospital infection form the weakest networks of a cluster. figure 11 illustrates the bibliographic coupling between the considered journals. . cc-by-nd 4.0 international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted november 5, 2020. ; https://doi.org/10.1101 doi: medrxiv preprint figure 11 : bibliographic coupling among journals figure 12 illustrates the co-authors and author map visualization. this analysis aims to produce the visualization of all the major authors publishing together or working on similar research fields. the analysis type is co-authorship, and the unit of analysis is authors. the threshold of the minimum number of papers by an author is 25. network construction and analysis shows that of 2,381 authors, only 9 authors meet the limits. however, the most extensive set of connected entities consists of only 8 authors, whose visual representation is depicted in figure 12 , where only one cluster is denoted by red color. the connected link illustrates that these authors have collaborated on the same project or worked on the same research with a similar focus. the thickness of the link between these three authors indicates more common publications. . cc-by-nd 4.0 international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted november 5, 2020. ; https://doi.org/10. 1101 this article has been accepted for publication in peerj computer science figure 12 : co-authorship and authors' analysis figure 13 illustrates the citation analysis among authors' institutions. six clusters are represented using different colors. the red cluster has the highest number of author citations from two institutions, namely, the huazhong university of science and technology, wuhan university (state key laboratory of virology), and the department of microbiology, university of hong kong. figure 14 shows the bibliometric analyses of author citations by journal sources. a link between two journal sources indicates the citation connectivity between the two sources. . cc-by-nd 4.0 international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted november 5, 2020. ; https://doi.org/10.1101/2020.11.04.20225698 doi: medrxiv preprint cc-by-nd 4.0 international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted november 5, 2020. ; the connected links between the journal of virology and the new england journal of medicine in figure 14 reveal that a publication from the journal of virology has cited another publication that is published in the new england journal of medicine or vice versa. the thickness and link strength signify more numbers of citation among the clusters. therefore, among the different clusters identified in the analysis, the journal of virology is the top-cited source by publication from other journal sources. in this section, we present challenges and future research prospects. more so, figure 15 describes the course of conducting the literature survey and opportunities for future research with the possibility of solving the challenges to help expert researchers easily identify areas that need development. the challenges and future research opportunities are presented as follows: lack of sufficient covid-19 data: the primary concern with the research in covid-19 is the barrier prompted by the lack of adequate covid-19 clinical data (alimadadi et al., 2020; mei et al., 2020; fong et al., 2020; oh et al., 2020; togaçar et al., 2020; ucara and korkmazb, 2020; belfiore et al., 2020; oyelade and ezugwu, 2020) . however, an in-depth analysis of patients with covid-19 requires much more data (apostolopoulos and mpesiana, 2020). data is the key component in machine learning. machine learning approaches typically experience a limitation in their efficiency and effectiveness in solving machine learning problems without sufficient data. therefore, insufficient covid-19 clinical data can limit the performance of specific machine learning algorithms, such as deep learning algorithms that require large-scale data. in this case, developing machine-learningbased covid-19 diagnostic and prognosis tools, and therapeutic approaches to curtail covid-19, and predicting a future pandemic can face a severe challenge in terms of performance due to insufficient covid-19 clinical data. alimadadi et al. (2020) suggested global collaborations among stakeholders to build covid-19 clinical database and mitigate the issue of inadequate covid-19 clinical data. existing biobanks containing the data of patients with covid-19 are integrated with covid-19 clinical data. we suggest that researchers use gan to generate additional x-rays and ct scan images for covid-19 to obtain sufficient data for building covid-19 diagnosis tools. for example, loey et al. (2020) were motivated by insufficient data and used gan to generate more x-ray images and develop a covid-19 diagnostic tool. figure 4 shows that x-ray and ct scan are the two primary clinical data for detecting covid-19 infection in patients. distinguishing patients with covid-19 and mild symptoms from pneumonia on x-ray images could be visualized inaccurately or cannot be visualized totally (apostolopoulos and mpesiana, 2020). we suggest that researchers propose machine learning strategies that can accurately differentiate patients with covid-19 and mild symptoms from patients with pneumonia symptoms based on x-ray images. covid-19 that is caused by coronavirus might have a ct scan image characteristic similar to other pneumonia caused by a different virus. in the future, the performance of cnn should be evaluated in classifying covid-19 and viral pneumonia with rt-pcr uncertainties: when a new pandemic breaks out, it comes with limited information and very high uncertainly, unlike the commonly known influenza. therefore, knowledge regarding the new epidemic is not sufficient due to the absence of a prior case that is the same as the recent pandemic. in the case of covid-19, many of the decision makers relied on sars for reference because of the similarity, even though it is considerably different from covid-19. the new pandemic typically poses a challenge to data analytics, considering its limited information and geographical and temporal evolving of the recent epidemic. therefore, an accurate model for predicting the future behavior of a pandemic becomes challenging due to uncertainty . we suggest that researchers propose a new pandemic forecasting model based on active learning in machine learning to reduce the level of uncertainty, typically accompanying new pandemics such as covid-19. applied susceptible, exposed, infectious, recovered (seir) for modeling covid-19. however, the seir model could not capture the complete number of infected cases, while the study ignored imported covid-19 confirmed cases. seir was based on the people's natural distribution and cannot apply to welfare institute an example of different population distribution. the epidemiological trend of covid-19 was not predicted accurately by the seir model under the viral mutation and specific ant-viral therapy development scenario. . cc-by-nd 4.0 international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. the copyright holder for this preprint this version posted november 5, 2020. ; the seir model was unable to simulate non-uniform patterns, such as the issue of increasing medical professionals and bed capacity . we suggest that researchers propose a machine-learning-based strategy for handling the non-uniform pattern in the future and consider all the other factors not considered in the study. adequate covid-19 data for a particular region are lacking because the capacity to gather reliable data is not uniform across regions worldwide. this situation can bring a challenge to the region without available covid-19 data. we suggest that researchers apply the cross-population train-test model because a model trained in a different region can be used to detect covid-19 in a different region. for example, the model trained to detect the new virus in wuhan, china, can be used in italy (santosh, 2020) . cc-by-nd 4.0 international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. (which was not certified by peer review) the copyright holder for this preprint this version posted november 5, 2020. ; https://doi.org/10. 1101 image resolution: the resolution of the x-ray images affects the performance of the machine learning algorithm. dealing with low-resolution images typically poses a challenge to the machine learning approach. variable size of the resolution dimension has a negative effect. successful performance cannot be achieved if the input images of the data have different sizes. the original image resolution dimension, structured images, and stacking technique need to be the same (togaçar et al., 2020) . we suggest high-resolution x-ray images for developing covid-19 diagnostic and prognosis system with the ability to work with low-resolution x-ray images. outliers and noise: at the early phase of covid-19, the covid-19 data contained many outliers and much noise (tuli et al., 2020) . an outlier in data is a subset of the data that appears with inconsistencies from the remaining data. outliers typically lower the fit of a statistical model (bellazzi et al., 1998) . the presence of outliers and noise in covid-19 makes predicting the correct number of covid-19 cases challenging (tuli et al., 2020) . dealing with outliers and noise in data increases data engineering efforts and expenses. we suggest that researchers propose a robust machine learning approach that can effectively handle outliers and noise in covid-19 data. the limitation of deep learning algorithms is a deficiency in terms of transparency and interpretability. for instance, knowing the image features that are applied to decide the output of the deep learning algorithms is not possible. the unique features used by the deep learning algorithm to differentiate covid-19 from cap cannot be sufficiently visualized by the heatmap, although the heatmap is used to visualize region in images that led to the algorithm output . images, especially x-rays and ct scans, are heavily relied on in detecting covid-19. we suggest that researchers propose explainable deep learning algorithms for the detection of covid-19 to instill transparency and interpretation in deep learning algorithms. the application of a deep learning algorithm to detect covid-19 on a chest ct scan has the possibility of misdiagnosis because of the similarity of the covid-19 symptoms with other types of pneumonia (belfiore et al., 2020) . incorrect diagnoses can mislead the health professional in deciding and lead to inappropriate medication, further complicating the health condition of the patient with covid-19. we suggest that researchers combine the ct scan diagnosis using deep learning algorithm with clinical information such as the nucleic acid detection results, clinical symptoms, epidemiology, and laboratory indicators to avoid misdiagnosis . resource allocation is a challenge as the covid-19 pandemic keeps spreading because the increase in the number of patients means more resources are required to take care of them. the allocation of limited resources in a rapidly expanding pandemic entails a difficult decision for the distribution of scarce resources . the epicenters of the covid-19 are challenged with resource problems of shortage of beds, gowns, masks, medical staff, and ventilators (ahuja et al., 2020; taiwo and ezugwu, 2020) . we propose the development of a machine learning decision support system to help in crucial decisions on resource allocation. in this study, we propose a survey, including a bibliometric analysis of the adoption of machine learning, to fight covid-19. the concise summary of the projects that adopted machine learning to fight covid-19, sources of covid-19 datasets, new comprehensive taxonomy, synthesis and analysis, and bibliometric analysis is presented. the results reveal that covid-19 diagnostic tools received the most considerable attention from researchers, and energy and resources are more dispensed toward automated covid-19 diagnostic tools. by contrast, covid-19 drugs and vaccine development remain grossly underexploited. the algorithm predominantly utilized by the researchers in developing the diagnostic tool is cnn mainly from x-rays and ct scan images. the most suitable cnn architecture for the detection of covid-19 from the x-ray and ct scan images is resnet. the challenges hindering practical work on machine learning to fight covid-19 and a new perspective to solve the identified problems are presented in the study. we believe that our survey with bibliometric analysis could enable researchers to determine areas that need further development and identify potential collaborators at author, country, and institutional levels. based on the bibliometric analysis conducted on the global scientific research output on covid-19 disease spread and preventive measures, the analysis results reveal that most of the research outputs were published in prestigious journals with high influence factors. these journals include the lancet, journal of medical virology, and . cc-by-nd 4.0 international license it is made available under a is the author/funder, who has granted medrxiv a license to display the preprint in perpetuity. (which was not certified by peer review) the copyright holder for this preprint this version posted november 5, 2020. ; https://doi.org/10. 1101 eurosurveillance. the bibliometric analysis also shows the focused subjects in various aspects of covid-19 infection transmission, diagnosis, treatment, prevention, and its complications. other prominent features include strong collaboration among research institutions, universities, and co-authorships among researchers across the globe. machine learning algorithms have many practical applications in medicine, and novel contributions from different researchers are still evolving and growing exponentially in a bid to satisfy the essential clinical needs of the individual patients, as it is the case with its application to fighting the covid-19 pandemic. as a way forward, we suggest an in-depth machine learning application review that would focus on the critical analysis of the novel coronavirus disease and other related cases of global pandemics. artificial intelligence and covid-19: a multidisciplinary approach correlation of chest ct and rt-pcr testing in coronavirus disease 2019 (covid-19) in china: a report of 1014 cases. radiology artificial intelligence and machine learning to fight covid-19 application of deep learning technique to manage covid-19 in routine clinical practice using ct images: results of 10 convolutional neural networks automatic detection of covid-19 using x-ray images with deep convolutional neural networks and machine learning predicting covid-19 incidence through analysis of google trends data in iran: data mining and deep learning pilot study predicting covid-19 incidence through analysis of google trends data in iran: data mining and deep learning pilot study chaos game representation dataset of sars-cov-2 genome artificial neural networks : fundamentals , computing , design , and application artificial intelligence to codify lung ct in covid-19 patients qualitative and fuzzy reasoning for identifying non-linear physiological systems: an application to intracellular thiamine kinetics chest ct findings in coronavirus disease-19 (covid-19): relationship to duration of infection deep learning system to screen coronavirus disease 2019 pneumonia. applied intelligence how good is radiography for covid-19 detection? a bibliometric analysis of covid-19 research activity: a call for increased output time series forecasting of covid-19 transmission in canada using lstm networks detection of 2019 novel coronavirus (2019-ncov) by real-time rt-pcr artificial intelligence during a pandemic: the covid-19 example followup chest radiographic findings in patients with mers-cov after recovery a tutorial survey of architectures, algorithms, and applications for deep learning the role of imaging in the detection and management of covid-19: a review pre-and posttreatment chest ct findings: 2019 novel coronavirus (2019-ncov) pneumonia restructured society and environment: a review on potential technological strategies to control the covid-19 pandemic automatic clustering algorithms: a systematic review and bibliometric analysis of relevant literature sensitivity of chest ct for covid-19: comparison to rt-pcr deep learning applications in pulmonary medical imaging: recent updates and insights on covid-19. machine vision and applications composite monte carlo decision making under high uncertainty of novel coronavirus epidemic using hybridized deep learning and fuzzy rule induction methods for predicting vaccine immunogenicity and reactogenicity severe acute respiratory syndrome-related coronavirus-the species and its viruses, a statement of the coronavirus study group lstm : a search space odyssey clinical characteristics of coronavirus disease 2019 in china deep learning approaches to biomedical image segmentation first case of 2019 novel coronavirus in the united states current status of global research on novel coronavirus disease (covid-19): a bibliometric analysis and knowledge mapping. hossain mm. current status of global research on novel coronavirus disease (covid-19): a bibliometric analysis and knowledge mapping version 1 weakly supervised deep learning for covid-19 infection detection and classification from ct images clinical features of patients infected with 2019 novel coronavirus in wuhan serial quantitative chest ct assessment of covid-19: deep-learning approach the continuing 2019-ncov epidemic threat of novel coronaviruses to global health-the latest 2019 novel coronavirus outbreak in wuhan deep learning localization of pneumonia: 2019 coronavirus (covid-19) outbreak report 2: estimating the potential total number of novel coronavirus cases in wuhan city chest ct findings in cases from the cruise ship "diamond princess classification of the covid-19 infected patients using densenet201 based deep transfer learning towards an artificial intelligence framework for data-driven prediction of coronavirus clinical severity lstm fully convolutional networks for time series classification artificial intelligence approach fighting covid-19 with repurposing drugs the xs and y of immune responses to viral vaccines. the lancet infectious diseases coronavirus (covid-19) outbreak: what the department of radiology should know a review of modern technologies for tackling covid-19 pandemic deep learning false-negative results of real-time reversetranscriptase polymerase chain reaction for severe acute respiratory syndrome coronavirus 2: role of deeplearning-based ct diagnosis and insights from two cases artificial intelligence distinguishes covid-19 from community acquired pneumonia on chest ct from community acquired pneumonia to covid-19: a deep learning based method for quantitative analysis of covid-19 on thick-section ct scans ct quantification of pneumonia lesions in early days predicts progression to severe illness in a cohort of covid-19 patients deep learning-based channel prediction for edge computing networks toward intelligent connected vehicles the indispensable role of chest ct in the detection of coronavirus disease 2019 (covid-19) modeling the trend of coronavirus disease 2019 and restoration of operational capability of metropolitan medical service in china: a machine learning and mathematical model-based analysis artificial neural networks: methods and applications within the lack of chest covid-19 x-ray dataset: a novel detection model based on coronavirus disease 2019: a bibliometric analysis and review artificial intelligence-enabled rapid diagnosis of patients with covid-19 deep learning for iot big data and streaming analytics: a survey deep learning covid-19 features on cxr using limited training data sets a case-based reasoning framework for early detection and diagnosis of novel coronavirus automated detection of covid-19 cases using deep neural networks with x-ray images emergence of new disease-how can artificial intelligence help? initial public health response and interim clinical guidance for the 2019 novel coronavirus outbreak-united states findings of lung ultrasonography of novel corona virus pneumonia during the 2019-2020 epidemic a survey on deep learning : algorithms , techniques , and applications. acm computing surveys (csur) using the spike protein feature to predict infection risk and monitor the evolutionary dynamic of coronavirus identification of covid-19 samples from chest x-ray images using deep learning: a comparison of transfer learning approaches machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: covid-19 case study machine learning using intrinsic genomic signatures for rapid classification of novel pathogens: covid-19 case study identification of covid-19 can be quicker through artificial intelligence framework using a mobile phone-based survey when cities and towns are under quarantine novel coronavirus 2019-ncov: early estimation of epidemiological parameters and epidemic predictions short-term forecasting covid-19 cumulative confirmed cases: perspectives for brazil clinical, laboratory and imaging features of covid-19: a systematic review and meta-analysis. travel medicine and infectious disease long short-term memory recurrent neural network architectures for large scale acoustic modeling has. fifteenth annual conference of the international speech communication association covid-19 detection using artificial intelligence ai-driven tools for coronavirus outbreak: need of active learning and cross-population train/test models on multitudinal/multimodal data review of artificial intelligence techniques in imaging data acquisition, segmentation and diagnosis for covid-19 classification of covid-19 patients from chest ct images using multiobjective differential evolution-based convolutional neural networks classification of covid-19 patients from chest ct images using multiobjective differential evolution-based convolutional neural networks deep convolutional neural networks based classification model for covid-19 infected patients using chest x-ray images smart healthcare support for remote patient monitoring during covid-19 quarantine digital technology and covid-19 outbreak trends of coronavirus disease-2019 in india: a prediction. disaster medicine and public health preparedness covid-19 detection using deep learning models to exploit social mimic optimization and structured chest x-ray images using fuzzy color and stacking approaches predicting the growth and trend of covid-19 pandemic using machine learning and cloud computing coronaviruses and people with intellectual disability: an exploratory data analysis covidiagnosis-net: deep bayes-squeezenet based diagnostic of the coronavirus disease 2019 (covid-19) from x-ray images using machine learning to estimate unobserved covid-19 infections in north america artificial intelligence (ai) applications for covid-19 pandemic detection of sars-cov-2 in different types of clinical specimens temporal changes of ct findings in 90 patients with covid-19 pneumonia: a longitudinal study systematic literature review in computer science-a practical guide coronavirus disease 2019: initial detection on chest ct in a retrospective multicenter study of 103 chinese subjects frequency and distribution of chest radiographic findings in covid-19 positive patients deep learning-based multi-view fusion model for screening 2019 novel coronavirus pneumonia: a multicentre study prediction models for diagnosis and prognosis of covid-19 infection: systematic review and critical appraisal chest ct for typical 2019-ncov pneumonia: relationship to negative rt-pcr testing imaging and clinical features of patients with 2019 novel coronavirus sars-cov-2 deep learning for detecting corona virus disease 2019 (covid-19) on high-resolution computed tomography: a pilot study modified seir and ai prediction of the epidemics trend of covid-19 in china under public health interventions sars-cov-2 and covid-19: the most important research questions deep learning and its applications to machine health monitoring lstm network : a deep learning approach for shortterm traffic forecast coronavirus disease 2019 (covid-19): a perspective from china. radiology the authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper. key: cord-103297-4stnx8dw authors: widrich, michael; schäfl, bernhard; pavlović, milena; ramsauer, hubert; gruber, lukas; holzleitner, markus; brandstetter, johannes; sandve, geir kjetil; greiff, victor; hochreiter, sepp; klambauer, günter title: modern hopfield networks and attention for immune repertoire classification date: 2020-08-17 journal: biorxiv doi: 10.1101/2020.04.12.038158 sha: doc_id: 103297 cord_uid: 4stnx8dw a central mechanism in machine learning is to identify, store, and recognize patterns. how to learn, access, and retrieve such patterns is crucial in hopfield networks and the more recent transformer architectures. we show that the attention mechanism of transformer architectures is actually the update rule of modern hop-field networks that can store exponentially many patterns. we exploit this high storage capacity of modern hopfield networks to solve a challenging multiple instance learning (mil) problem in computational biology: immune repertoire classification. accurate and interpretable machine learning methods solving this problem could pave the way towards new vaccines and therapies, which is currently a very relevant research topic intensified by the covid-19 crisis. immune repertoire classification based on the vast number of immunosequences of an individual is a mil problem with an unprecedentedly massive number of instances, two orders of magnitude larger than currently considered problems, and with an extremely low witness rate. in this work, we present our novel method deeprc that integrates transformer-like attention, or equivalently modern hopfield networks, into deep learning architectures for massive mil such as immune repertoire classification. we demonstrate that deeprc outperforms all other methods with respect to predictive performance on large-scale experiments, including simulated and real-world virus infection data, and enables the extraction of sequence motifs that are connected to a given disease class. source code and datasets: https://github.com/ml-jku/deeprc transformer architectures (vaswani et al., 2017) and their attention mechanisms are currently used in many applications, such as natural language processing (nlp), imaging, and also in multiple instance learning (mil) problems . in mil, a set or bag of objects is labelled rather than objects themselves as in standard supervised learning tasks (dietterich et al., 1997) . examples for mil problems are medical images, in which each sub-region of the image represents an instance, video a pooling function f is used to obtain a repertoire-representation z for the input object. finally, an output network o predicts the class labelŷ. b) deeprc uses stacked 1d convolutions for a parameterized function h due to their computational efficiency. potentially, millions of sequences have to be processed for each input object. in principle, also recurrent neural networks (rnns), such as lstms (hochreiter et al., 2007) , or transformer networks (vaswani et al., 2017) may be used but are currently computationally too costly. c) attention-pooling is used to obtain a repertoire-representation z for each input object, where deeprc uses weighted averages of sequence-representations. the weights are determined by an update rule of modern hopfield networks that allows to retrieve exponentially many patterns. classification, in which each frame is an instance, text classification, where words or sentences are instances of a text, point sets, where each point is an instance of a 3d object, and remote sensing data, where each sensor is an instance (carbonneau et al., 2018; uriot, 2019) . attention-based mil has been successfully used for image data, for example to identify tiny objects in large images (ilse et al., 2018; pawlowski et al., 2019; tomita et al., 2019; kimeswenger et al., 2019) and transformer-like attention mechanisms for sets of points and images . however, in mil problems considered by machine learning methods up to now, the number of instances per bag is in the range of hundreds or few thousands (carbonneau et al., 2018; lee et al., 2019 ) (see also tab. a2). at the same time the witness rate (wr), the rate of discriminating instances per bag, is already considered low at 1% − 5%. we will tackle the problem of immune repertoire classification with hundreds of thousands of instances per bag without instance-level labels and with extremely low witness rates down to 0.01% using an attention mechanism. we show that the attention mechanism of transformers is the update rule of modern hopfield networks (krotov & hopfield, 2016 demircigil et al., 2017) that are generalized to continuous states in contrast to classical hopfield networks (hopfield, 1982) . a detailed derivation and analysis of modern hopfield networks is given in our companion paper (ramsauer et al., 2020) . these novel continuous state hopfield networks allow to store and retrieve exponentially (in the dimension of the space) many patterns (see next section). thus, modern hopfield networks with their update rule, which are used as an attention mechanism in the transformer, enable immune repertoire classification in computational biology. immune repertoire classification, i.e. classifying the immune status based on the immune repertoire sequences, is essentially a text-book example for a multiple instance learning problem (dietterich et al., 1997; maron & lozano-pérez, 1998; wang et al., 2018) . briefly, the immune repertoire of an individual consists of an immensely large bag of immune receptors, represented as amino acid sequences. usually, the presence of only a small fraction of particular receptors determines the immune status with respect to a particular disease (christophersen et al., 2014; emerson et al., 2017) . this is because the immune system has already acquired a resistance if one or few particular immune receptors that can bind to the disease agent are present. therefore, classification of immune repertoires bears a high difficulty since each immune repertoire can contain millions of sequences as instances with only a few indicating the class. further properties of the data that complicate the problem are: (a) the overlap of immune repertoires of different individuals is low (in most cases, maximally low single-digit percentage values) (greiff et al., 2017; elhanati et al., 2018) , (b) multiple different sequences can bind to the same pathogen (wucherpfennig et al., 2007) , and (c) only subsequences within the sequences determine whether binding to a pathogen is possible (dash et al., 2017; glanville et al., 2017; akbar et al., 2019; springer et al., 2020; fischer et al., 2019) . in summary, immune repertoire classification can be formulated as multiple instance learning with an extremely low witness rate and large numbers of instances, which represents a challenge for currently available machine learning methods. furthermore, the methods should ideally be interpretable, since the extraction of class-associated sequence motifs is desired to gain crucial biological insights. the acquisition of human immune repertoires has been enabled by immunosequencing technology (georgiou et al., 2014; brown et al., 2019) which allows to obtain the immune receptor sequences and immune repertoires of individuals. each individual is uniquely characterized by their immune repertoire, which is acquired and changed during life. this repertoire may be influenced by all diseases that an individual is exposed to during their lives and hence contains highly valuable information about those diseases and the individual's immune status. immune receptors enable the immune system to specifically recognize disease agents or pathogens. each immune encounter is recorded as an immune event into immune memory by preserving and amplifying immune receptors in the repertoire used to fight a given disease. this is, for example, the working principle of vaccination. each human has about 10 7 -10 8 unique immune receptors with low overlap across individuals and sampled from a potential diversity of > 10 14 receptors (mora & walczak, 2019) . the ability to sequence and analyze human immune receptors at large scale has led to fundamental and mechanistic insights into the adaptive immune system and has also opened the opportunity for the development of novel diagnostics and therapy approaches (georgiou et al., 2014; brown et al., 2019) . immunosequencing data have been analyzed with computational methods for a variety of different tasks (greiff et al., 2015; shugay et al., 2015; miho et al., 2018; yaari & kleinstein, 2015; wardemann & busse, 2017) . a large part of the available machine learning methods for immune receptor data has been focusing on the individual immune receptors in a repertoire, with the aim to, for example, predict the antigen or antigen portion (epitope) to which these sequences bind or to predict sharing of receptors across individuals (gielis et al., 2019; springer et al., 2020; jurtz et al., 2018; moris et al., 2019; fischer et al., 2019; greiff et al., 2017; sidhom et al., 2019; elhanati et al., 2018) . recently, jurtz et al. (2018) used 1d convolutional neural networks (cnns) to predict antigen binding of t-cell receptor (tcr) sequences (specifically, binding of tcr sequences to peptide-mhc complexes) and demonstrated that motifs can be extracted from these models. similarly, konishi et al. (2019) use cnns, gradient boosting, and other machine learning techniques on b-cell receptor (bcr) sequences to distinguish tumor tissue from normal tissue. however, the methods presented so far predict a particular class, the epitope, based on a single input sequence. immune repertoire classification has been considered as a mil problem in the following publications. a deep learning framework called deeptcr (sidhom et al., 2019) implements several deep learning approaches for immunosequencing data. the computational framework, inter alia, allows for attention-based mil repertoire classifiers and implements a basic form of attention-based averaging. ostmeyer et al. (2019) already suggested a mil method for immune repertoire classification. this method considers 4-mers, fixed sub-sequences of length 4, as instances of an input object and trained a logistic regression model with these 4-mers as input. the predictions of the logistic regression model for each 4-mer were max-pooled to obtain one prediction per input object. this approach is characterized by (a) the rigidity of the k-mer features as compared to convolutional kernels (alipanahi et al., 2015; zhou & troyanskaya, 2015; zeng et al., 2016) , (b) the max-pooling operation, which constrains the network to learn from a single, top-ranked k-mer for each iteration over the input object, and (c) the pooling of prediction scores rather than representations (wang et al., 2018) . our experiments also support that these choices in the design of the method can lead to constraints on the predictive performance (see table 1 ). our proposed method, deeprc, also uses a mil approach but considers sequences rather than k-mers as instances within an input object and a transformer-like attention mechanism. deeprc sets out to avoid the above-mentioned constraints of current methods by (a) applying transformer-like attention-pooling instead of max-pooling and learning a classifier on the repertoire rather than on the sequence-representation, (b) pooling learned representations rather than predictions, and (c) using less rigid feature extractors, such as 1d convolutions or lstms. in this work, we contribute the following: we demonstrate that continuous generalizations of binary modern hopfield-networks (krotov & hopfield, 2016 demircigil et al., 2017) have an update rule that is known as the attention mechanisms in the transformer. we show that these modern hopfield networks have exponential storage capacity, which allows them to extract patterns among a large set of instances (next section). based on this result, we propose deeprc, a novel deep mil method based on modern hopfield networks for large bags of complex sequences, as they occur in immune repertoire classification (section "deep repertoire classification). we evaluate the predictive performance of deeprc and other machine learning approaches for the classification of immune repertoires in a large comparative study (section "experimental results") exponential storage capacity of continuous state modern hopfield networks with transformer attention as update rule in this section, we show that modern hopfield networks have exponential storage capacity, which will later allow us to approach massive multiple-instance learning problems, such as immune repertoire classification. see our companion paper (ramsauer et al., 2020) for a detailed derivation and analysis of modern hopfield networks. we assume patterns x 1 , . . . , x n ∈ r d that are stacked as columns to the matrix x = (x 1 , . . . , x n ) and a query pattern ξ that also represents the current state. the largest norm of a pattern is m = max i x i . the separation ∆ i of a pattern x i is defined as its minimal dot product difference to any of the other patterns: we consider a modern hopfield network with current state ξ and the energy function for energy e and state ξ, the update rule is proven to converge globally to stationary points of the energy e, which are local minima or saddle points (see (ramsauer et al., 2020) , appendix, theorem a2 ). surprisingly, the update rule eq. (1) is also the formula of the well-known transformer attention mechanism. to see this more clearly, we simultaneously update several queries ξ i . furthermore the queries ξ i and the patterns x i are linear mappings of vectors y i into the space r d . for matrix notation, we set x i = w t k y i , ξ i = w t q y i and multiply the result of our update rule with w v . using y = (y 1 , . . . , y n ) t , we define the matrices and the patterns are now mapped to the hopfield space with dimension d = d k . we set β = 1/ √ d k and change softmax to a row vector. the update rule eq. (1) multiplied by w v performed for all queries simultaneously becomes in row vector notation: this formula is the transformer attention. if the patterns x i are well separated, the iterate eq. (1) converges to a fixed point close to a pattern to which the initial ξ is similar. if the patterns are not well separated the iterate eq.(1) converges to a fixed point close to the arithmetic mean of the patterns. if some patterns are similar to each other but well separated from all other vectors, then a metastable state between the similar patterns exists. iterates that start near a metastable state converge to this metastable state. for details see ramsauer et al. (2020) , appendix, sect. a2. typically, the update converges after one update step (see ramsauer et al. (2020) , appendix, theorem a8) and has an exponentially small retrieval error (see ramsauer et al. (2020) , appendix, theorem a9). our main concern for application to immune repertoire classification is the number of patterns that can be stored and retrieved by the modern hopfield network, equivalently to the transformer attention head. the storage capacity of an attention mechanism is critical for massive mil problems. we first define what we mean by storing and retrieving patterns from the modern hopfield network. definition 1 (pattern stored and retrieved). we assume that around every pattern x i a sphere s i is given. we say x i is stored if there is a single fixed point x * i ∈ s i to which all points ξ ∈ s i converge, for randomly chosen patterns, the number of patterns that can be stored is exponential in the dimension d of the space of the patterns (x i ∈ r d ). theorem 1. we assume a failure probability 0 < p 1 and randomly chosen patterns on the sphere with radius m = k √ d − 1. we define a := 2 d−1 (1 + ln(2 β k 2 p (d − 1))), b := 2 k 2 β 5 , and c = b w0(exp(a + ln(b)) , where w 0 is the upper branch of the lambert w function and ensure then with probability 1 − p, the number of random patterns that can be stored is examples are c ≥ 3.1546 for β = 1, k = 3, d = 20 and p = 0.001 (a + ln(b) > 1.27) and c ≥ 1.3718 for β = 1 k = 1, d = 75, and p = 0.001 (a + ln(b) < −0.94). see ramsauer et al. (2020) , appendix, theorem a5 for a proof. we have established that a modern hopfield network or a transformer attention mechanism can store and retrieve exponentially many patterns. this allows us to approach mil with massive numbers of instances from which we have to retrieve a few with an attention mechanism. deep repertoire classification problem setting and notation. we consider a mil problem, in which an input object x is a bag of n instances x = {s 1 , . . . , s n }. the instances do not have dependencies nor orderings between them and n can be different for every object. we assume that each instance s i is associated with a label y i ∈ {0, 1}, assuming a binary classification task, to which we do not have access. we only have access to a label y = max i y i for an input object or bag. note that this poses a credit assignment problem, since the sequences that are responsible for the label y have to be identified and that the relation between instance-label and bag-label can be more complex (foulds & frank, 2010) . a modelŷ = g(x) should be (a) invariant to permutations of the instances and (b) able to cope with the fact that n varies across input objects (ilse et al., 2018) , which is a problem also posed by point sets (qi et al., 2017) . two principled approaches exist. the first approach is to learn an instance-level scoring function h : s → [0, 1], which is then pooled across instances with a pooling function f , for example by average-pooling or max-pooling (see below). the second approach is to construct an instance representation z i of each instance by h : s → r dv and then encode the bag, or the input object, by pooling these instance representations (wang et al., 2018) via a function f . an output function o : r dv → [0, 1] subsequently classifies the bag. the second approach, the pooling of representations rather than scoring functions, is currently best performing (wang et al., 2018) . in the problem at hand, the input object x is the immune repertoire of an individual that consists of a large set of immune receptor sequences (t-cell receptors or antibodies). immune receptors are primarily represented as sequences s i from a space s i ∈ s. these sequences act as the instances in the mil problem. although immune repertoire classification can readily be formulated as a mil problem, it is yet unclear how well machine learning methods solve the above-described problem with a large number of instances n 10, 000 and with instances s i being complex sequences. next we describe currently used pooling functions for mil problems. pooling functions for mil problems. different pooling functions equip a model g with the property to be invariant to permutations of instances and with the ability to process different numbers of instances. typically, a neural network h θ with parameters θ is trained to obtain a function that maps each instance onto a representation: z i = h θ (s i ) and then a pooling function z = f ({z 1 , . . . , z n }) supplies a representation z of the input object x = {s 1 , . . . , s n }. the following pooling functions are typically used: average-pooling: where e m is the standard basis vector for dimension m and attention-pooling: z = n i=1 a i z i , where a i are non-negative (a i ≥ 0), sum to one ( n i=1 a i = 1), and are determined by an attention mechanism. these pooling functions are invariant to permutations of {1, . . . , n } and are differentiable. therefore, they are suited as building blocks for deep learning architectures. we employ attention-pooling in our deeprc model as detailed in the following. modern hopfield networks viewed as transformer-like attention mechanisms. the modern hopfield networks, as introduced above,have a storage capacity that is exponential in the dimension of the vector space and converge after just one update (see (ramsauer et al., 2020) , appendix).additionally, the update rule of modern hopfield networks is known as key-value attention mechanism, which has been highly successful through the transformer (vaswani et al., 2017) and bert (devlin et al., 2019) models in natural language processing. therefore using modern hopfield networks with the key-value-attention mechanism as update rule is the natural choice for our task. in particular, modern hopfield networks are theoretically justified for storing and retrieving the large number of vectors (sequence patterns) that appear in the immune repertoire classification task. instead of using the terminology of modern hopfield networks, we explain our deeprc architecture in terms of key-value-attention (the update rule of the modern hopfield network), since it is well known in the deep learning community. the attention mechanism assumes a space of dimension d k in which keys and queries are compared. a set of n key vectors are combined to the matrix k. a set of d q query vectors are combined to the matrix q. similarities between queries and keys are computed by inner products, therefore queries can search for similar keys that are stored. another set of n value vectors are combined to the matrix v . the output of the attention mechanism is a weighted average of the value vectors for each query q. the i-th vector v i is weighted by the similarity between the i-th key k i and the query q. the similarity is given by the softmax of the inner products of the query q with the keys k i . all queries are calculated in parallel via matrix operations. consequently, the attention function att(q, k, v ; β) maps queries q, keys k, and values v to d v -dimensional outputs: att(q, k, v ; β) = softmax(βqk t )v (see also eq. (2)). while this attention mechanism has originally been developed for sequence tasks (vaswani et al., 2017) , it can be readily transferred to sets ye et al., 2018) . this type of attention mechanism will be employed in deeprc. the deeprc method. we propose a novel method deep repertoire classification (deeprc) for immune repertoire classification with attention-based deep massive multiple instance learning and compare it against other machine learning approaches. for deeprc, we consider immune repertoires as input objects, which are represented as bags of instances. in a bag, each instance is an immune receptor sequence and each bag can contain a large number of sequences. note that we will use z i to denote the sequence-representation of the i-th sequence and z to denote the repertoire-representation. at the core, deeprc consists of a transformer-like attention mechanism that extracts the most important information from each repertoire. we first give an overview of the attention mechanism and then provide details on each of the sub-networks h 1 , h 2 , and o of deeprc. attention mechanism in deeprc. this mechanism is based on the three matrices k (the keys), q (the queries), and v (the values) together with a parameter β. values. deeprc uses a 1d convolutional network h 1 (lecun et al., 1998; hu et al., 2014; kelley et al., 2016) that supplies a sequence-representation z i = h 1 (s i ), which acts as the values v = z = (z 1 , . . . , z n ) in the attention mechanism (see figure 2 ). keys. a second neural network h 2 , which shares its first layers with h 1 , is used to obtain keys k ∈ r n ×d k for each sequence in the repertoire. this network uses 2 self-normalizing layers (klambauer et al., 2017) with 32 units per layer (see figure 2 ). query. we use a fixed d k -dimensional query vector ξ which is learned via backpropagation. for more attention heads, each head has a fixed query vector. with the quantities introduced above, the transformer attention mechanism (eq. (2)) of deeprc is implemented as follows: where z ∈ r n ×dv are the sequence-representations stacked row-wise, k are the keys, and z is the repertoire-representation and at the same time a weighted mean of sequence-representations z i . the attention mechanism can readily be extended to multiple queries, however, computational demand could constrain this depending on the application and dataset. theorem 1 demonstrates that this mechanism is able to retrieve a single pattern out of several hundreds of thousands. attention-pooling and interpretability. each input object, i.e. repertoire, consists of a large number n of sequences, which are reduced to a single fixed-size feature vector of length d v representing the whole input object by an attention-pooling function. to this end, a transformer-like attention mechanism adapted to sets is realized in deeprc which supplies a i -the importance of the sequence s i . this importance value is an interpretable quantity, which is highly desired for the immunological problem at hand. thus, deeprc allows for two forms of interpretability methods. (a) a trained deeprc model can compute attention weights a i , which directly indicate the importance of a sequence. (b) deeprc furthermore allows for the usage of contribution analysis methods, such as integrated gradients (ig) (sundararajan et al., 2017) or layer-wise relevance propagation (montavon et al., 2018; arras et al., 2019) . see sect. a8 for details. classification layer and network parameters. the repertoire-representation z is then used as input for a fully-connected output networkŷ = o(z) that predicts the immune status, where we found it sufficient to train single-layer networks. in the simplest case, deeprc predicts a single target, the class label y, e.g. the immune status of an immune repertoire, using one output value. however, since deeprc is an end-to-end deep learning model, multiple targets may be predicted simultaneously in classification or regression settings or a mix of both. this allows for the introduction of additional information into the system via auxiliary targets such as age, sex, or other metadata. table 1 with sub-networks h 1 , h 2 , and o. d l indicates the sequence length. network parameters, training, and inference. deeprc is trained using standard gradient descent methods to minimize a cross-entropy loss. the network parameters are θ 1 , θ 2 , θ o for the sub-networks h 1 , h 2 , and o, respectively, and additionally ξ. in more detail, we train deeprc using adam (kingma & ba, 2014) with a batch size of 4 and dropout of input sequences. implementation. to reduce computational time, the attention network first computes the attention weights a i for each sequence s i in a repertoire. subsequently, the top 10% of sequences with the highest a i per repertoire are used to compute the weight updates and prediction. furthermore, computation of z i is performed in 16-bit, others in 32-bit precision to ensure numerical stability in the softmax. see sect. a2 for details. in this section, we report and analyze the predictive power of deeprc and the compared methods on several immunosequencing datasets. the roc-auc is used as the main metric for the predictive power. methods compared. we compared previous methods for immune repertoire classification, (ostmeyer et al., 2019) ("log. mil (kmer)", "log. mil (tcrb)") and a burden test (emerson et al., 2017) , as well as the baseline methods logistic regression ("log. regr."), k-nearest neighbour ("knn"), and support vector machines ("svm") with kernels designed for sets, such as the jaccard kernel ("j") and the minmax ("mm") kernel (ralaivola et al., 2005) . for the simulated data, we also added baseline methods that search for the implanted motif either in binary or continuous fashion ("known motif b.", "known motif c.") assuming that this motif was known (for details, see sect. a4). datasets. we aimed at constructing immune repertoire classification scenarios with varying degree of difficulties and realism in order to compare and analyze the suggested machine learning methods. to this end, we either use simulated or experimentally-observed immune receptor sequences and we implant signals, specifically, sequence motifs or sets thereof weber et al., 2020) , at different frequencies into sequences of repertoires of the positive class. these frequencies represent the witness rates and range from 0.01% to 10%. overall, we compiled four categories of datasets: (a) simulated immunosequencing data with implanted signals, (b) lstm-generated immunosequencing data with implanted signals, (c) real-world immunosequencing data with implanted signals, and (d) real-world immunosequencing data with known immune status, the cmv dataset (emerson et al., 2017) . the average number of instances per bag, which is the number of sequences per immune repertoire, is ≈300,000 except for category (c), in which we consider the scenario of low-coverage data with only 10,000 sequences per repertoire. the number of repertoires per dataset ranges from 785 to 5,000. in total, all datasets comprise ≈30 billion sequences or instances. this represents the largest comparative study on immune repertoire classification (see sect. a3). hyperparameter selection. we used a nested 5-fold cross validation (cv) procedure to estimate the performance of each of the methods. all methods could adjust their most important hyperparameters on a validation set in the inner loop of the procedure. see sect. a5 for details. table 1 : results in terms of auc of the competing methods on all datasets. the reported errors are standard deviations across 5 cross-validation (cv) folds (except for the column "simulated"). real-world cmv: average performance over 5 cv folds on the cmv dataset (emerson et al., 2017) . real-world data with implanted signals: average performance over 5 cv folds for each of the four datasets. a signal was implanted with a frequency (=witness rate) of 1% or 0.1%. either a single motif ("om") or multiple motifs ("mm") were implanted. lstm-generated data: average performance over 5 cv folds for each of the 5 datasets. in each dataset, a signal was implanted with a frequency of 10%, 1%, 0.5%, 0.1%, or 0.05%, respectively. simulated: here we report the mean over 18 simulated datasets with implanted signals and varying difficulties (see tab. a9 for details). the error reported is the standard deviation of the auc values across the 18 datasets. results. in each of the four categories, "real-world data", "real-world data with implanted signals", "lstm-generated data", and "simulated immunosequencing data", deeprc outperforms all competing methods with respect to average auc. across categories, the runner-up methods are either the svm for mil problems with minmax kernel or the burden test (see table 1 and sect. a6). results on simulated immunosequencing data. in this setting the complexity of the implanted signal is in focus and varies throughout 18 simulated datasets (see sect. a3). some datasets are challenging for the methods because the implanted motif is hidden by noise and others because only a small fraction of sequences carries the motif, and hence have a low witness rate. these difficulties become evident by the method called "known motif binary", which assumes the implanted motif is known. the performance of this method ranges from a perfect auc of 1.000 in several datasets to an auc of 0.532 in dataset '17' (see sect. a6). deeprc outperforms all other methods with an average auc of 0.846 ± 0.223, followed by the svm with minmax kernel with an average auc of 0.827 ± 0.210 (see sect. a6). the predictive performance of all methods suffers if the signal occurs only in an extremely small fraction of sequences. in datasets, in which only 0.01% of the sequences carry the motif, all auc values are below 0.550. results on lstm-generated data. on the lstm-generated data, in which we implanted noisy motifs with frequencies of 10%, 1%, 0.5%, 0.1%, and 0.05%, deeprc yields almost perfect predictive performance with an average auc of 1.000 ± 0.001 (see sect. a6 and a7). the second best method, svm with minmax kernel, has a similar predictive performance to deeprc on all datasets but the other competing methods have a lower predictive performance on datasets with low frequency of the signal (0.05%). results on real-world data with implanted motifs. in this dataset category, we used real immunosequences and implanted single or multiple noisy motifs. again, deeprc outperforms all other methods with an average auc of 0.980 ± 0.029, with the second best method being the burden test with an average auc of 0.883 ± 0.170. notably, all methods except for deeprc have difficulties with noisy motifs at a frequency of 0.1% (see tab. a11) . results on real-world data. on the real-world dataset, in which the immune status of persons affected by the cytomegalovirus has to be predicted, the competing methods yield predictive aucs between 0.515 and 0.825 (see table 1 ). we note that this dataset is not the exact dataset that was used in emerson et al. (2017) . it differs in pre-processing and also comprises a different set of samples and a smaller training set due to the nested 5-fold cross-validation procedure, which leads to a more challenging dataset. the best performing method is deeprc with an auc of 0.831 ± 0.002, followed by the svm with minmax kernel (auc 0.825 ± 0.022) and the burden test with an auc of 0.699 ± 0.041. the top-ranked sequences by deeprc significantly correspond to those detected by emerson et al. (2017) , which we tested by a mann-whitney u-test with the null hypothesis that the attention values of the sequences detected by emerson et al. (2017) would be equal to the attention values of the remaining sequences (p-value of 1.3 · 10 −93 ). the sequence attention values are displayed in tab. a14. we have demonstrated how modern hopfield networks and attention mechanisms enable successful classification of the immune status of immune repertoires. for this task, methods have to identify the discriminating sequences amongst a large set of sequences in an immune repertoire. specifically, even motifs within those sequences have to be identified. we have shown that deeprc, a modern hopfield network and an attention mechanism with a fixed query, can solve this difficult task despite the massive number of instances. deeprc furthermore outperforms the compared methods across a range of different experimental conditions. impact on machine learning and related scientific fields. we envision that with (a) the increasing availability of large immunosequencing datasets (kovaltsuk et al., 2018; corrie et al., 2018; christley et al., 2018; zhang et al., 2020; rosenfeld et al., 2018; shugay et al., 2018) , (b) further fine-tuning of ground-truth benchmarking immune receptor datasets (weber et al., 2020; olson et al., 2019; marcou et al., 2018) , (c) accounting for repertoire-impacting factors such as age, sex, ethnicity, and environment (potential confounding factors), and (d) increased gpu memory and increased computing power, it will be possible to identify discriminating immune receptor motifs for many diseases, potentially even for the current sars-cov-2 (covid-19) pandemic minervina et al., 2020; galson et al., 2020) . such results would greatly benefit ongoing research on antibody and tcr-driven immunotherapies and immunodiagnostics as well as rational vaccine design (brown et al., 2019) . in the course of this development, the experimental verification and interpretation of machine-learningidentified motifs could receive additional focus, as for most of the sequences within a repertoire the corresponding antigen is unknown. nevertheless, recent technological breakthroughs in highthroughput antigen-labeled immunosequencing are beginning to generate large-scale antigen-labeled single-immune-receptor-sequence data thus resolving this longstanding problem (setliff et al., 2019) . from a machine learning perspective, the successful application of deeprc on immune repertoires with their large number of instances per bag might encourage the application of modern hopfield networks and attention mechanisms on new, previously unsolved or unconsidered, datasets and problems. impact on society. if the approach proves itself successful, it could lead to faster testing of individuals for their immune status w.r.t. a range of diseases based on blood samples. this might motivate changes in the pipeline of diagnostics and tracking of diseases, e.g. automated testing of the immune status in regular intervals. it would furthermore make the collection and screening of blood samples for larger databases more attractive. in consequence, the improved testing of immune statuses might identify individuals that do not have a working immune response towards certain diseases to government or insurance companies, which could then push for targeted immunisation of the individual. similarly to compulsory vaccination, such testing for the immune status could be made compulsory by governments, possibly violating privacy or personal self-determination in exchange for increased over-all health of a population. ultimately, if the approach proves itself successful, the insights gained from the screening of individuals that have successfully developed resistances against specific diseases could lead to faster targeted immunisation, once a certain number of individuals with resistances can be found. this might strongly decrease the harm done by e.g. pandemics and lead to a change in the societal perception of such diseases. consequences of failures of the method. as common with methods in machine learning, potential danger lies in the possibility that users rely too much on our new approach and use it without reflecting on the outcomes. however, the full pipeline in which our method would be used includes wet lab tests after its application, to verify and investigate the results, gain insights, and possibly derive treatments. failures of the proposed method would lead to unsuccessful wet lab validation and negative wet lab tests. since the proposed algorithm does not directly suggest treatment or therapy, human beings are not directly at risk of being treated with a harmful therapy. substantial wet lab and in-vitro testing and would indicate wrong decisions by the system. leveraging of biases in the data and potential discrimination. as for almost all machine learning methods, confounding factors, such as age or sex, could be used for classification. this, might lead to biases in predictions or uneven predictive performance across subgroups. as a result, failures in the wet lab would occur (see paragraph above). moreover, insights into the relevance of the confounding factors could be gained, leading to possible therapies or counter-measures concerning said factors. furthermore, the amount of data available with respec to relevant confounding factors could lead to better or worse performance of our method. e.g. a dataset consisting mostly of data from individuals within a specific age group might yield better performance for that age group, possibly resulting in better or exclusive treatment methods for that specific group. here again, the application of deeprc would be followed by in-vitro testing and development of a treatment, where all target groups for the treatment have to be considered accordingly. all datasets and code is available at https://github.com/ml-jku/deeprc. the cmv dataset is publicly available at https://clients.adaptivebiotech.com/pub/emerson-2017-natgen. in section a2 we provide details on the architecture of deeprc, in section a3 we present details on the datasets, in section a4 we explain the methods that we compared, in section a5 we elaborate on the hyperparameters and their selection process. then, in section a6 we present detailed results for each dataset category in tabular form, in section a7 we provide information on the lstm model that was used to generate antibody sequences, in section a8 we show how deeprc can be interpreted, in section a9 we show the correspondence of previously identified tcr sequences for cmv immune status with attention values by deeprc, and finally we present variations and an ablation study of deeprc in section a10. input layer. for the input layer of the cnn, the characters in the input sequence, i.e. the amino acids (aas), are encoded in a one-hot vector of length 20. to also provide information about the position of an aa in the sequence, we add 3 additional input features with values in range [0, 1] to encode the position of an aa relative to the sequence. these 3 positional features encode whether the aa is located at the beginning, the center, or the end of the sequence, respectively, as shown in figure a1 . we concatenate these 3 positional features with the one-hot vector of aas, which results in a feature vector of size 23 per sequence position. each repertoire, now represented as a bag of feature vectors, is then normalized to unit variance. since the cytomegalovirus dataset (cmv dataset) provides sequences with an associated abundance value per sequence, which is the number of occurrences of a sequence in a repertoire, we incorporate this information into the input of deeprc. to this end, the one-hot aa features of a sequence are multiplied by a scaling factor of log(c a ) before normalization, where c a is the abundance of a sequence. we feed the sequences with 23 features per position into the cnn. sequences of different lengths were zero-padded to the maximum sequence length per batch at the sequence ends. 1d cnn for motif recognition. in the following, we describe how deeprc identifies patterns in the individual sequences and reduces each sequence in the input object to a fixed-size feature vector. deeprc employs 1d convolution layers to extract patterns, where trainable weight kernels are convolved over the sequence positions. in principle, also recurrent neural networks (rnns) or transformer networks could be used instead of 1d cnns, however, (a) the computational complexity of the network must be low to be able to process millions of sequences for a single update. additionally, (b) the learned network should be able to provide insights in the recognized patterns in form of motifs. both properties (a) and (b) are fulfilled by 1d convolution operations that are used by deeprc. we use one 1d cnn layer (hu et al., 2014) with selu activation functions (klambauer et al., 2017) to identify the relevant patterns in the input sequences with a computationally light-weight operation. the larger the kernel size, the more surrounding sequence positions are taken into account, which influences the length of the motifs that can be extracted. we therefore adjust the kernel size during hyperparameter search. in prior works (ostmeyer et al., 2019) , a k-mer size of 4 yielded good predictive performance, which could indicate that a kernel size in the range of 4 may be a proficient choice. for d v trainable kernels, this produces a feature vector of length d v at each sequence position. subsequently, global max-pooling over all sequence positions of a sequence reduces the sequence-representations z i to vectors of the fixed length d v . given the challenging size of the input data per repertoire, the computation of the cnn activations and weight updates is performed using 16-bit floating point values. a list of hyperparameters evaluated for deeprc is given in table a3 . a comparison of rnn-based and cnn-based sequence embedding for motif recognition in a smaller experimental setting is given in sec. a10. regularization. we apply random and attention-based subsampling of repertoire sequences to reduce over-fitting and decrease computational effort. during training, each repertoire is subsampled to 10, 000 input sequences, which are randomly drawn from the respective repertoire. this can also be interpreted as random drop-out (hinton et al., 2012) on the input sequences or attention weights. during training and evaluation, the attention weights computed by the attention network are furthermore used to rank the input sequences. based on this ranking, the repertoire is reduced to the 10% of sequences with the highest attention weights. these top 10% of sequences are then used to compute the weight updates and the prediction for the repertoire. additionally, one might employ further regularization techniques, which we only partly investigated further in a smaller experimental setting in sec. a10 due to high computational demands. such regularization techniques include l1 and l2 weight decay, noise in the form of random aa permutations in the input sequences, noise on the attention weights, or random shuffling of sequences between repertoires that belong to the negative class. the last regularization technique assumes that the sequences in positive-class repertoires carry a signal, such as an aa motif corresponding to an immune response, whereas the sequences in negative-class repertoires do not. hence, the sequences can be shuffled randomly between negative class repertoires without obscuring the signal in the positive class repertoires. hyperparameters. for the hyperparameter search of deeprc for the category "simulated immunosequencing data", we only conducted a full hyperparameter search on the more difficult datasets with motif implantation probabilities below 1%, as described in table a3 . this process was repeated for all 5 folds of the 5-fold cross-validation (cv) and the average score on the 5 test sets constitutes the final score of a method. table a3 provides an overview of the hyperparameter search, which was conducted as a grid search for each of the datasets in a nested 5-fold cv procedure, as described in section a4. computation time and optimization. we took measures on the implementation level to address the high computational demands, especially gpu memory consumption, in order to make the large number of experiments feasible. we train the deeprc model with a small batch size of 4 samples and perform computation of inference and updates of the 1d cnn using 16-bit floating point values. the rest of the network is trained using 32-bit floating point values. the adam parameter for numerical stability was therefore increased from the default value of = 10 −8 to = 10 −4 . training was performed on various gpu types, mainly nvidia rtx 2080 ti. computation times were highly dependent on the number of sequences in the repertoires and the number and sizes of cnn kernels. a single update on an nvidia rtx 2080 ti gpu took approximately 0.0129 to 0.0135 seconds, while requiring approximately 8 to 11 gb gpu memory. taking these optimizations and gpus with larger memory (≥ 16 gb) into account, it is already possible to train deeprc, possibly with multi-head attention and a larger network architecture, on larger datasets (see sec. a10). our network implementation is based on pytorch 1.3.1 (paszke et al., 2019) . incorporation of additional inputs and metadata. additional metadata in the form of sequencelevel or repertoire-level features could be incorporated into the input via concatenation with the feature vectors that result from taking the maximum of the 1d cnn outputs w.r.t. the sequence positions. this has the benefit that the attention mechanism and output network can utilize the sequence-level or repertoire-level features for their predictions. sparse metadata or metadata that is only available during training could be used as auxiliary targets to incorporate the information via gradients into the deeprc model. limitations. the current methods are mostly limited by computational complexity, since both hyperparameter and model selection is computationally demanding. for hyperparameter selection, a large number of hyperparameter settings have to be evaluated. for model selection, a single repertoire requires the propagation of many thousands of sequences through a neural network and keeping those quantities in gpu memory in order to perform the attention mechanism and weight update. thus, increased gpu memory would significantly boost our approach. increased computational power would also allow for more advanced architectures and attention mechanisms, which may further improve predictive performance. another limiting factor is over-fitting of the model due to the currently relatively small number of samples (bags) in real-world immunosequencing datasets in comparison to the large number of instances per bag and features per instance. we aimed at constructing immune repertoire classification scenarios with varying degree of realism and difficulties in order to compare and analyze the suggested machine learning methods. to this end, we either use simulated or experimentally-observed immune receptor sequences and we implant signals, which are sequence motifs weber et al., 2020) , into sequences of repertoires of the positive class. it has been shown previously that interaction of immune receptors with antigens occur via short sequence stretches . thus, implantation of short motif sequences simulating an immune signal is biologically meaningful. our benchmarking study comprises four different categories of datasets: (a) simulated immunosequencing data with implanted signals (where the signal is defined as sets of motifs), (b) lstm-generated immunosequencing data with implanted signals, (c) real-world immunosequencing data with implanted signals, and (d) real-world immunosequencing data. each of the first three categories consists of multiple datasets with varying difficulty depending on the type of the implanted signal and the ratio of sequences with the implanted signal. the ratio of sequences with the implanted signal, where each sequence carries at most 1 implanted signal, corresponds to the witness rate (wr). we consider binary classification tasks to simulate the immune status of healthy and diseased individuals. we randomly generate immune repertoires with varying numbers of sequences, where we implant sequence motifs in the repertoires of the diseased individuals, i.e. the positive class. the sequences of a repertoire are also randomly generated by different procedures (detailed below). each sequence is composed of 20 different characters, corresponding to amino acids, and has an average length of 14.5 aas. in the first category, we aim at investigating the impact of the signal frequency, i.e. the wr, and the signal complexity on the performance of the different methods. to this end, we created 18 datasets, whereas each dataset contains a large number of repertoires with a large number of random aa sequences per repertoire. we then implanted signals in the aa sequences of the positive class repertoires, where the 18 datasets differ in frequency and complexity of the implanted signals. in detail, the aas were sampled randomly independent of their respective position in the sequence, while the frequencies of aas, distribution of sequence lengths, and distribution of the number of sequences per repertoire, i.e. the number of instances per bag, are following the respective distributions observed in the real-world cmv dataset (emerson et al., 2017) . for this, we first sampled the number of sequences for a repertoire from a gaussian n (µ = 316k, σ = 132k) distribution and rounded to the nearest positive integer. we re-sampled if the size was below 5k. we then generated random sequences of aas with a length of n (µ = 14.5, σ = 1.8), again rounded to the nearest positive integers. each simulated repertoire was then randomly assigned to either the positive or negative class, with 2, 500 repertoires per class. in the repertoires assigned to the positive class, we implanted motifs with an average length of 4 aas, following the results of the experimental analysis of antigenbinding motifs in antibodies and t-cell receptor sequences by . we varied the characteristics of the implanted motifs for each of the 18 datasets with respect to the following parameters: (a) ρ, the probability of a motif being implanted in a sequence of a positive repertoire, i.e. the average ratio of sequences containing the motif, which is the witness rate. in this way, we generated 18 different datasets of variable difficulty containing in total roughly 28.7 billion sequences. see table a1 for an overview of the properties of the implanted motifs in the 18 datasets. in the second dataset category, we investigate the impact of the signal frequency and complexity in combination with more plausible immune receptor sequences by taking into account the positional aa distributions and other sequence properties. to this end, we trained an lstm (hochreiter & schmidhuber, 1997 ) in a standard next character prediction (graves, 2013) setting to create aa sequences with properties similar to experimentally observed immune receptor sequences. in the first step, the lstm model was trained on all immuno-sequences in the cmv dataset (emerson et al., 2017) that contain valid information about sequence abundance and have a known cmv label. such an lstm model is able to capture various properties of the sequences, including positiondependent probability distributions and combinations, relationships, and order of aas. we then used the trained lstm model to generate 1, 000 repertoires in an autoregressive fashion, starting with a start sequence that was randomly sampled from the trained-on dataset. based on a visual inspection of the frequencies of 4-mers (see section a7), the similarity of lstm generated sequences and real sequences was deemed sufficient for the purpose of generating the aa sequences for the datasets in this category. further details on lstm training and repertoire generation are given in section a7. after generation, each repertoire was assigned to either the positive or negative class, with 500 repertoires per class. we implanted motifs of length 4 with varying properties in the center of the sequences of the positive class to obtain 5 different datasets. each sequence in the positive repertoires has a probability ρ to carry the motif, which was varied throughout 5 datasets and corresponds to the wr (see table a1 ). each position in the motif has a probability of 0.9 to be implanted and consequently a probability of 0.1 that the original aa in the sequence remains, which can be seen as noise on the motif. in the third category, we implanted signals into experimentally obtained immuno-sequences, where we considered 4 dataset variations. each dataset consists of 750 repertoires for each of the two classes, where each repertoire consists of 10k sequences. in this way, we aim to simulate datasets with a low sequencing coverage, which means that only relatively few sequences per repertoire are available. the sequences were randomly sampled from healthy (cmv negative) individuals from the cmv dataset (see below paragraph for explanation). two signal types were considered: (a) one signal with one motif. the aa motif ldr was implanted in a certain fraction of sequences. the pattern is randomly altered at one of the three positions with probabilities 0.2, 0.6, and 0.2, respectively. (b) one signal with multiple motifs. one of the three possible motifs ldr, cas, and gl-n was table a1 : properties of simulated repertoires, variations of motifs, and motif frequencies, i.e. the witness rate, for the datasets in categories "simulated immunosequencing data", "lstm-generated data", and "real-world data with implanted signals". noise types for * are explained in paragraph "real-world data with implanted signals". implanted with equal probability. again, the motifs were randomly altered before implantation. the aa motif ldr changed as described above. the aa motif cas was altered at the second position with probability 0.6 and with probability 0.3 at the first position. the pattern gl-n, wheredenotes a gap location, is randomly altered at the first position with probability 0.6 and the gap has a length of 0, 1, or 2 aas with equal probability. additionally, the datasets differ in the values for ρ, the average ratio of sequences carrying a signal, which were chosen as 1% or 0.1%. the motifs were implanted at positions 107, 109, and 114 according to the imgt numbering scheme for immune receptor sequences (lefranc et al., 2003) with probabilities 0.3, 0.35 and 0.2, respectively. with the remaining 0.15 chance, the motif is implanted at any other sequence position. this means that the motif occurrence in the simulated sequences is biased towards the middle of the sequence. we used a real-world dataset of 785 repertoires, each of which containing between 4, 371 to 973, 081 (avg. 299, 319) tcr sequences with a length of 1 to 27 (avg. 14.5) aas, originally collected and provided by emerson et al. (2017) . 340 out of 785 repertoires were labelled as positive for cytomegalovirus (cmv) serostatus, which we consider as the positive class, 420 repertoires with negative cmv serostatus, considered as negative class, and 25 repertoires with unknown status. we changed the number of sequence counts per repertoire from −1 to 1 for 3 sequences. furthermore, we exclude a total of 99 repertoires with unknown cmv status or unknown information about the sequence abundance within a repertoire, reducing the dataset for our analysis to 686 repertoires, 312 of which with positive and 374 with negative cmv status. we give a non-exhaustive overview of previously considered mil datasets and problems in table a2 . to our knowledge the datasets considered in this work pose the most challenging mil problems with respect to the number of instances per bag (column 5). table a2 : mil datasets with their numbers of bags and numbers of instances. "total number of instances" refers to the total number of instances in the dataset. the simulated and real-world immunosequencing datasets considered in this work contain a by orders of magnitudes larger number of instances per bag than mil datasets that were considered by machine learning methods up to now. we evaluate and compare the performance of deeprc against a set of machine learning methods that serve as baseline, were suggested, or can readily be adapted to immune repertoire classification. in this section, we describe these compared methods. this method serves as an estimate for the achievable classification performance using prior knowledge about which motif was implanted. note that this does not necessarily lead to perfect predictive performance since motifs are implanted with a certain amount of noise and could also be present in the negative class by chance. the known motif method counts how often the known implanted motif occurs per sequence for each repertoire and uses this count to rank the repertoires. from this ranking, the area under the receiver operator curve (auc) is computed as performance measure. probabilistic aa changes in the known motif are not considered for this count, with the exception of gap positions. we consider two versions of this method: (a) known motif binary: counts the occurrence of the known motif in a sequence and (b) known motif continuous: counts the maximum number of overlapping aas between the known motif and all sequence positions, which corresponds to a convolution operation with a binary kernel followed by max-pooling. since the implanted signal is not known in the experimentally obtained cmv dataset, this method cannot be applied to this dataset. the support vector machine (svm) approach uses a fixed mapping from a bag of sequences to the corresponding k-mer counts. the function h kmer maps each sequence s i to a vector representing the occurrence of k-mers in the sequence. to avoid confusion with the sequence-representation obtained from the cnn layers of deeprc, we denote u i = h kmer (s i ), which is analogous to z i . specifically, where #{p m ∈ s i } denotes how often the k-mer pattern p m occurs in sequence s i . afterwards, average-pooling is applied to obtain u = 1/n n i=1 u i , the k-mer representation of the input object x. for two input objects x (n) and x (l) with representations u (n) and u (l) , respectively, we implement the minmax kernel (ralaivola et al., 2005) as follows: where u (n) m is the m-th element of the vector u (n) . the jaccard kernel (levandowsky & winter, 1971 ) is identical to the minmax kernel except that it operates on binary u (n) . we used a standard c-svm, as introduced by cortes & vapnik (1995) . the corresponding hyperparameter c is optimized by random search. the settings of the full hyperparameter search as well as the respective value ranges are given in table a4a . the same k-mer representation of a repertoire, as introduced above for the svm baseline, is used for the k-nearest neighbor (knn) approach. as this method clusters samples according to distances between them, the previous kernel definitions cannot be applied directly. it is therefore necessary to transform the minmax as well as the jaccard kernel from similarities to distances by constructing the following (levandowsky & winter, 1971) : d jaccard (u (n) , u (l) ) = 1 − k jaccard (u (n) , u (l) ). (a2) the amount of neighbors is treated as the hyperparameter and optimized by an exhaustive grid search. the settings of the full hyperparameter search as well as the respective value ranges are given in table a5 . we implemented logistic regression on the k-mer representation u of an immune repertoire. the model is trained by gradient descent using the adam optimizer (kingma & ba, 2014) . the learning rate is treated as the hyperparameter and optimized by grid search. furthermore, we explored two regularization settings using combinations of l1 and l2 weight decay. the settings of the full hyperparameter search as well as the respective value ranges are given in table a6 . we implemented a burden test (emerson et al., 2017; li & leal, 2008; wu et al., 2011) in a machine learning setting. the burden test first identifies sequences or k-mers that are associated with the individual's class, i.e., immune status, and then calculates a burden score per individual. concretely, for each k-mer or sequence, the phi coefficient of the contingency table for absence or presence and positive or negative immune status is calculated. then, j k-mers or sequences with the highest phi coefficients are selected as the set of associated k-mers or sequences. j is a hyperparameter that is selected on a validation set. additionally, we consider the type of input features, sequences or k-mers, as a hyperparameter. for inference, a burden score per individual is calculated as the sum of associated k-mers or sequences it carries. this score is used as raw prediction and to rank the individuals. hence, we have extended the burden test by emerson et al. (2017) to k-mers and to adaptive thresholds that are adjusted on a validation set. the logistic multiple instance learning (mil) approach for immune repertoire classification (ostmeyer et al., 2019) applies a logistic regression model to each k-mer representation in a bag. the resulting scores are then summarized by max-pooling to obtain a prediction for the bag. each amino acid of each k-mer is represented by 5 features, the so-called atchley factors (atchley et al., 2005) . as k-mers of length 4 are used, this gives a total of 4 × 5 = 20 features. one additional feature per 4-mer is added, which represents the relative frequency of this 4-mer with respect to its containing bag, resulting in 21 features per 4-mer. two options for the relative frequency feature exist, which are (a) whether the frequency of the 4-mer ("4mer") or (b) the frequency of the sequence in which the 4-mer appeared ("tcrβ") is used. we optimized the learning rate, batch size, and early stopping parameter on the validation set. the settings of the full hyperparameter search as well as the respective value ranges are given in table a8 . for all competing methods a hyperparameter search was performed, for which we split each of the 5 training sets into an inner training set and inner validation set. the models were trained on the inner training set and evaluated on the inner validation set. the model with the highest auc score on the inner validation set is then used to calculate the score on the respective test set. here we report the hyperparameter sets and search strategy that is used for all methods. deeprc. the set of hyperparameters of deeprc is shown in table a3 . these hyperparameter combinations are adjusted via a grid search procedure. table a3 : deeprc hyperparameter search space. every 5 · 10 3 updates, the current model was evaluated against the validation fold. the early stopping hyperparameter was determined by selecting the model with the best loss on the validation fold after 10 5 updates. * : experiments for {64; 128; 256} kernels were omitted for datasets with motif implantation probabilities ≥ 1% in the category "simulated immunosequencing data". known motif. this method does not have hyperparameters and has been applied to all datasets except for the cmv dataset. the corresponding hyperparameter c of the svm is optimized by randomly drawing 10 3 values in the range of [−6; 6] according to a uniform distribution. these values act as the exponents of a power of 10 and are applied for each of the two kernel types (see table a4a ). knn. the amount of neighbors is treated as the hyperparameter and optimized by grid search operating in the discrete range of [1; max{n, 10 3 }] with a step size of 1. the corresponding tight upper bound is automatically defined by the total amount of samples n ∈ n >0 in the training set, capped at 10 3 (see table a5 ). number of neighbors {1; max{n, 10 3 }} type of kernel {minmax; jaccard} table a5 : settings used in the hyperparameter search of the knn baseline approach. the number of trials (per type of kernel) is automatically defined by the total amount of samples n ∈ n >0 in the training set, capped at 10 3 . logistic regression. the hyperparameter optimization strategy that was used was grid search across hyperparameters given in table a6. learning rate 10 −{2;3;4} batch size 4 max. updates 10 5 coefficient β 1 (adam) 0.9 coefficient β 2 (adam) 0.999 weight decay weightings {(l1 = 10 −7 , l2 = 10 −3 ); (l1 = 10 −7 , l2 = 10 −5 )} table a6 : settings used in the hyperparameter search of the logistic regression baseline approach. burden test. the burden test selects two hyperparameters: the number of features in the burden set and the type of features, see table a7 . number of features in burden set {50, 100, 150, 250} type of features {4mer; sequence} table a7 : settings used in the hyperparameter search of the burden test approach. logistic mil. for this method, we adjusted the learning rate as well as the batch size as hyperparameters by randomly drawing 25 different hyperparameter combinations from a uniform distribution. the corresponding range of the learning rate is [−4.5; −1.5], which acts as the exponent of a power of 10. the batch size lies within the range of [1; 32]. for each hyperparameter combination, a model is optimized by gradient descent using adam, whereas the early stopping parameter is adjusted according to the corresponding validation set (see table a8 ). learning rate 10 {−4.5;−1.5} batch size {1; 32} relative abundance term {4mer; tcrβ} number of trials 25 max. epochs 10 2 coefficient β 1 (adam) 0.9 coefficient β 2 (adam) 0.999 table a8 : settings used in the hyperparameter search of the logistic mil baseline approach. the number of trials (per type of relative abundance) defines the quantity of combinations of random values of the learning rate as well as the batch size. in this section, we report the detailed results on all four categories of datasets (a) simulated immunosequencing data (table a9 ) (b) lstm-generated data (table a10) , (c) real-world data with implanted signals (table a11) , and (d) real-world data on the cmv dataset (table a12) , as discussed in the main paper. ± 0.000 ± 0.000 ± 0.271 ± 0.000 ± 0.000 ± 0.218 ± 0.000 ± 0.000 ± 0.029 ± 0.000 ± 0.001 ± 0.017 ± 0.001 ± 0.002 ± 0.023 ± 0.001 ± 0.048 ± 0.013 ± 0.223 svm (minmax) 1.000 1.000 0.764 1.000 1.000 0.603 1.000 0.998 0.539 1.000 0.994 0.529 1.000 0.741 0.513 1.000 0.706 0.503 0.827 ± 0.000 ± 0.000 ± 0.016 ± 0.000 ± 0.000 ± 0.021 ± 0.000 ± 0.002 ± 0.024 ± 0.000 ± 0.004 ± 0.016 ± 0.000 ± 0.024 ± 0.006 ± 0.000 ± 0.013 ± 0.013 ± 0.013 ± 0.013 ± 0.014 ± 0.011 ± 0.009 ± 0.007 ± 0.008 ± 0.011 ± 0.012 ± 0.012 ± 0.007 ± 0.014 ± 0.017 ± 0.010 ± 0.020 ± 0.012 ± 0.016 ± 0.016 ± 0.074 known motif b. 1.000 1.000 0.973 1.000 1.000 0.865 1.000 1.000 0.700 1.000 0.989 0.609 1.000 0.946 0.570 1.000 0.834 0.532 0.890 ± 0.000 ± 0.000 ± 0.004 ± 0.000 ± 0.000 ± 0.004 ± 0.000 ± 0.000 ± 0.020 ± 0.000 ± 0.002 ± 0.017 ± 0.000 ± 0.010 ± 0.024 ± 0.000 ± 0.016 ± 0.020 ± 0.001 ± 0.014 ± 0.020 ± 0.001 ± 0.013 ± 0.017 ± 0.001 ± 0.012 ± 0.012 ± 0.001 ± 0.018 ± 0.018 ± 0.002 ± 0.010 ± 0.009 ± 0.002 ± 0.012 ± 0.013 ± 0.202 table a9 : auc estimates based on 5-fold cv for all 18 datasets in category "simulated immunosequencing data". the reported errors are standard deviations across the 5 cross-validation folds except for the last column "avg.", in which they show standard deviations across datasets. wildcard characters in motifs are indicated by z, characters with 50% probability of being removed by d . table a10 : auc estimates based on 5-fold cv for all 5 datasets in category "lstm-generated data". the reported errors are standard deviations across the 5 cross-validation folds except for the last column "avg.", in which they show standard deviations across datasets. characters affected by noise, as described in a3, paragraph "lstm-generated data", are indicated by r . table a12 : results on the cmv dataset (real-world data) in terms of auc, f1 score, balanced accuracy, and accuracy. for f1 score, balanced accuracy, and accuracy, all methods use their default thresholds. each entry shows mean and standard deviation across 5 cross-validation folds. we trained a conventional next-character lstm model (graves, 2013) based on the implementation in https://github.com/spro/practical-pytorch (access date 1st of may, 2020) using pytorch 1.3.1 (paszke et al., 2019) . for this, we applied an lstm model with 100 lstm blocks in 2 layers, which was trained for 5, 000 epochs using the adam optimizer (kingma & ba, 2014) with learning rate 0.01, an input batch size of 100 character chunks, and a character chunk length of 200. as input we used the immuno-sequences in the cdr3 column of the cmv dataset, where we repeated sequences according to their counts in the repertoires, as specified in the templates column of the cmv dataset. we excluded repertoires with unknown cmv status and unknown sequence abundance from training. after training, we generated 1, 000 repertoires using a temperature value of 0.8. the number of sequences per repertoire was sampled from a gaussian n (µ = 285k, σ = 156k) distribution, where the whole repertoire was generated by the lstm at once. that is, the lstm can base the generation of the individual aa sequences in a repertoire, including the aas and the lengths of the sequences, on the generated repertoire. a random immuno-sequence from the trained-on repertoires was used as initialization for the generation process. this immuno-sequence was not included in the generated repertoire. finally, we randomly assigned 500 of the generated repertoires to the positive (diseased) and 500 to the negative (healthy) class. we then implanted motifs in the positive class repertoires as described in section a3.2. as illustrated in the comparison of histograms given in fig. a2 , the generated immuno-sequences exhibit a very similar distribution of 4-mers and aas compared to the original cmv dataset. real-world data deeprc allows for two forms of interpretability methods. (a) due to its attention-based design, a trained model can be used to compute the attention weights of a sequence, which directly indicates its importance. (b) deeprc furthermore allows for the usage of contribution analysis methods, such as integrated gradients (ig) (sundararajan et al., 2017) or layer-wise relevance propagation (montavon et al., 2018; arras et al., 2019; montavon et al., 2019; preuer et al., 2019) . we apply ig to identify the input patterns that are relevant for the classification. to identify aa patterns with high contributions in the input sequences, we apply ig to the aas in the input sequences. additionally, we apply ig to the kernels of the 1d cnn, which allows us to identify aa motifs with high contributions. in detail, we compute the ig contributions for the aas and positional features in the kernels for every repertoire in the validation and test set, so as to exclude potential artifacts caused by over-fitting. averaging the ig values over these repertoires then results in concise aa motifs. we include qualitative visual analyses of the ig method on different datasets below. here, we provide examples for the interpretation of trained deeprc models using integrated gradients (ig) (sundararajan et al., 2017) as contribution analysis method. the following illustrations were created using 50 ig steps, which we found sufficient to achieve stable ig results. a visual analysis of deeprc models on the simulated datasets, as illustrated in tab. a13 and fig. a3 , shows that the implanted motifs can be successfully extracted from the trained model and are straightforward to interpret. in the real-world cmv dataset, deeprc finds complex patterns with high variability in the center regions of the immuno-sequences, as illustrated in figure a4 . real-world data with implanted signals extracted motif implanted motif(s) g r s r a r f r l r d r r r {l r d r r r ; c r a r s; g r l-n} motif freq. ρ 0.05% 0.1% 0.1% table a13 : visualization of motifs extracted from trained deeprc models for datasets from categories "simulated immunosequencing data", "lstm-generated data", and "real-world data with implanted signals". motif extraction was performed using integrated gradients on the 1d cnn kernels over the validation set and test set repertoires of one cv fold. wildcard characters are indicated by z, random noise on characters by r , characters with 50% probability of being removed by d , and gap locations of random lengths of {0; 1; 2} by -. larger characters in the extracted motifs indicate higher contribution, with blue indicating positive contribution and red indicating negative contribution towards the prediction of the diseased class. contributions to positional encoding are indicated by < (beginning of sequence), ∧ (center of sequence), and > (end of sequence). only kernels with relatively high contributions are shown, i.e. with contributions roughly greater than the average contribution of all kernels. b) c) figure a3 : integrated gradients applied to input sequences of positive class repertoires. three sequences with the highest contributions to the prediction of their respective repertoires are shown. a) input sequence taken from "simulated immunosequencing data" with implanted motif sz d z d n and motif implantation probability 0.1%. the deeprc model reacts to the s and n at the 5 th and 8 th sequence position, thereby identifying the implanted motif in this sequence. b) and c) input sequence taken from "real-world data with implanted signals" with implanted motifs {l r d r r r ; c r a r s; g r l-n} and motif implantation probability 0.1%. the deeprc model reacts to the fully implanted motif cas (b) and to the partly implanted motif aas c and a at the 5 th and 7 th sequence position (c), thereby identifying the implanted motif in the sequences. wildcard characters in implanted motifs are indicated by z, characters with 50% probability of being removed by d , and gap locations of random lengths of {0; 1; 2} by -. larger characters in the sequences indicate higher contribution, with blue indicating positive contribution and red indicating negative contribution towards the prediction of the diseased class. figure a4 : visualization of the contributions of characters within a sequence via ig. each sequence was selected from a different repertoire and showed the highest contribution in its repertoire. the model was trained on cmv dataset, using a kernel size of 9, 32 kernels and 137 repertoires for early stopping. larger characters in the extracted motifs indicate higher contribution, with blue indicating positive contribution and red indicating negative contribution towards the prediction of the disease class. table a14 : tcrβ sequences that had been discovered by emerson et al. (2017) with their associated attention values by deeprc. these sequences have significantly (p-value 1.3e-93) higher attention values than other sequences. the column "quantile" provides the quantile values of the empiricial distribution of attention values across all sequences in the dataset. in this section we investigate the impact of different variations of deeprc on the performance on the cmv dataset. we consider both a cnn-based sequence embedding, as used in the main paper, and an lstm-based sequence embedding. in both cases we vary the number of attention heads and the β parameter for the softmax function the attention mechanism (see eq. 2 in main paper). for the cnn-based sequence embedding we also vary the number of cnn kernels and the kernel sizes used in the 1d cnn. for the lstm-based sequence embedding we use one one-directional lstm layer, of which the output values at the last sequence position (without padding) are taken as embedding of the sequence. here we vary the number of lstm blocks in the lstm layer. to counter over-fitting due to the increased complexity of these deeprc variations, we added a l2 weight penalty to the training loss. the factor with which the l2 weight penalty contributes to the training loss is varied over 3 orders of magnitudes, where suitable value ranges were manually determined on one of the training folds beforehand. to reduce the computational effort, we do not consider all numbers of kernels that were considered in the main paper. furthermore, we only compute the auc scores on 3 of the 5 cross-validation folds. the hyperparameters, which were used in a grid search procedure, are listed in tab. a15 for the cnn-based sequence embedding and tab. a16 for the lstm-based sequence embedding. results. we show performance in terms of auc score with single hyperparameters set to fixed values so as to investigate their influence in tab. a18 for the cnn-based sequence embedding and tab. a17 for the lstm-based sequence embedding. we note that due to restricted computational resources this study was conducted with fewer different numbers of cnn kernels, with the auc estimated from only 3 of the 5 cross-validation folds, which leads to a slight decrease of performance in comparison to the full hyperparameter search and cross-validation procedure used in the main paper. as can be seen in tab. a18 and a17, the lstm-based sequence embedding generalizes slightly better than the cnn-based sequence embedding. table a17 : impact of hyperparameters on deeprc with lstm for sequence encoding. mean ("mean") and standard deviation ("std") for the area under the roc curve over the first 3 folds of a 5-fold nested cross-validation for different sub-sets of hyperparameters ("sub-set") are shown. the following sub-sets were considered: "full": full grid search over hyperparameters; "beta=*": grid search over hyperparameters with reduction to specific value * of beta value of attention softmax; "heads=*": grid search over hyperparameters with reduction to specific number * of attention heads; "lstms=*": grid search over hyperparameters with reduction to specific number * of lstm blocks for sequence embedding. table a18 : impact of hyperparameters on deeprc with 1d cnn for sequence encoding. mean ("mean") and standard deviation ("std") for the area under the roc curve over the first 3 folds of a 5-fold nested cross-validation for different sub-sets of hyperparameters ("sub-set") are shown. the following sub-sets were considered: "full": full grid search over hyperparameters; "beta=*": grid search over hyperparameters with reduction to specific value * of beta value of attention softmax; "heads=*": grid search over hyperparameters with reduction to specific number * of attention heads; "ksize=*": grid search over hyperparameters with reduction to specific kernel size * of 1d cnn kernels for sequence embedding; "kernels=*": grid search over hyperparameters with reduction to specific number * of 1d cnn kernels for sequence embedding. a compact vocabulary of paratope-epitope interactions enables predictability of antibody-antigen binding predicting the sequence specificities of dna-and rna-binding proteins by deep learning explaining and interpreting lstms solving the protein sequence metric problem rank-loss support instance machines for miml instance annotation augmenting adaptive immunity: progress and challenges in the quantitative engineering and analysis of adaptive immune receptor repertoires multiple instance learning: a survey of problem characteristics and applications vdjserver: a cloud-based analysis portal and data commons for immune repertoire sequences and rearrangements tetramer-visualized gluten-specific cd4+ t cells in blood as a potential diagnostic marker for coeliac disease without oral gluten challenge ireceptor: a platform for querying and analyzing antibody/b-cell and t-cell receptor repertoire data across federated repositories support-vector networks quantifiable predictive features define epitope-specific t cell receptor repertoires on a model of associative memory with huge storage capacity bert: pre-training of deep bidirectional transformers for language understanding solving the multiple instance problem with axis-parallel rectangles predicting the spectrum of tcr repertoire sharing with a data-driven model of recombination immunosequencing identifies signatures of cytomegalovirus exposure history and hla-mediated effects on the t cell repertoire predicting antigen-specificity of single t-cells based on tcr cdr3 regions. biorxiv a review of multi-instance learning assumptions deep sequencing of b cell receptor repertoires from covid-19 evaluation and benchmark for biological image segmentation the promise and challenge of high-throughput sequencing of the antibody repertoire tcrex: detection of enriched t cell epitope specificity in full t cell receptor sequence repertoires. biorxiv identifying specificity groups in the t cell receptor repertoire generating sequences with recurrent neural networks. arxiv a bioinformatic framework for immune repertoire diversity profiling enables detection of immunological status learning the high-dimensional immunogenomic features that predict public and private antibody repertoires improving neural networks by preventing co-adaptation of feature detectors long short-term memory fast model-based protein homology detection without alignment neural networks and physical systems with emergent collective computational abilities convolutional neural network architectures for matching natural language sentences attention-based deep multiple instance learning nettcr: sequence-based prediction of tcr binding to peptide-mhc complexes using convolutional neural networks basset: learning the regulatory code of the accessible genome with deep convolutional neural networks detecting cutaneous basal cell carcinomas in ultra-high resolution and weakly labelled histopathological images self-normalizing neural networks capturing the differences between humoral immunity in the normal and tumor environments from repertoire-seq of b-cell receptors using supervised machine learning observed antibody space: a resource for data mining next-generation sequencing of antibody repertoires dense associative memory for pattern recognition dense associative memory is robust to adversarial inputs gradient-based learning applied to document recognition set transformer: a framework for attention-based permutation-invariant neural networks imgt unique numbering for immunoglobulin and t cell receptor variable domains and ig superfamily v-like domains distance between sets methods for detecting associations with rare variants for common diseases: application to analysis of sequence data the extended cohnkanade dataset (ck+): a complete dataset for action unit and emotion-specified expression high-throughput immune repertoire analysis with igor a framework for multiple-instance learning computational strategies for dissecting the high-dimensional complexity of adaptive immune repertoires longitudinal high-throughput tcr repertoire profiling reveals the dynamics of t cell memory formation after mild covid-19 infection. biorxiv methods for interpreting and understanding deep neural networks layer-wise relevance propagation: an overview how many different clonotypes do immune repertoires contain? current opinion in systems biology treating biomolecular interaction as an image classification problem -a case study on t-cell receptorepitope recognition prediction. biorxiv sumrep: a summary statistic framework for immune receptor repertoire comparison and model validation biophysicochemical motifs in t-cell receptor sequences distinguish repertoires from tumor-infiltrating lymphocyte and adjacent healthy tissue pytorch: an imperative style, high-performance deep learning library needles in haystacks: on classifying tiny objects in large images interpretable deep learning in drug discovery pointnet: deep learning on point sets for 3d classification and segmentation graph kernels for chemical informatics cov-abdab: the coronavirus antibody database. biorxiv immunedb, a novel tool for the analysis, storage, and dissemination of immune repertoire sequencing data a $$k$$-nearest neighbor based algorithm for multi-instance multi-label active learning machine learning in automated text categorization high-throughput mapping of b cell receptor sequences to antigen specificity vdjtools: unifying post-analysis of t cell receptor repertoires vdjdb: a curated database of t-cell receptor sequences with known antigen specificity deeptcr: a deep learning framework for understanding t-cell receptor sequence signatures within complex t-cell repertoires prediction of specific tcr-peptide binding from large dictionaries of tcr-peptide pairs. biorxiv axiomatic attribution for deep networks attention-based deep neural networks for detection of cancerous and precancerous esophagus tissue on histopathological slides learning with sets in multiple instance regression applied to remote sensing attention is all you need revisiting multiple instance neural networks novel approaches to analyze immunoglobulin repertoires immunesim: tunable multi-feature simulation of b-and t-cell receptor repertoires for immunoinformatics benchmarking genome-wide protein function prediction through multiinstance multi-label learning rare-variant association testing for sequencing data with the sequence kernel association test polyspecificity of t cell and b cell receptor recognition practical guidelines for b-cell receptor repertoire sequencing analysis learning embedding adaptation for few-shot learning convolutional neural network architectures for predicting dna-protein binding pird: pan immune repertoire database multi-instance multi-label learning with application to scene classification predicting effects of noncoding variants with deep learning-based sequence model the ellis unit linz, the lit ai lab and the institute for machine learning are supported by the land oberösterreich, lit grants deeptoxgen ( in the following, the appendix to the paper "modern hopfield networks and attention for immune key: cord-319868-rtt9i7wu authors: majeed, taban; rashid, rasber; ali, dashti; asaad, aras title: issues associated with deploying cnn transfer learning to detect covid-19 from chest x-rays date: 2020-10-06 journal: phys eng sci med doi: 10.1007/s13246-020-00934-8 sha: doc_id: 319868 cord_uid: rtt9i7wu covid-19 first occurred in wuhan, china in december 2019. subsequently, the virus spread throughout the world and as of june 2020 the total number of confirmed cases are above 4.7 million with over 315,000 deaths. machine learning algorithms built on radiography images can be used as a decision support mechanism to aid radiologists to speed up the diagnostic process. the aim of this work is to conduct a critical analysis to investigate the applicability of convolutional neural networks (cnns) for the purpose of covid-19 detection in chest x-ray images and highlight the issues of using cnn directly on the whole image. to accomplish this task, we use 12-off-the-shelf cnn architectures in transfer learning mode on 3 publicly available chest x-ray databases together with proposing a shallow cnn architecture in which we train it from scratch. chest x-ray images are fed into cnn models without any preprocessing to replicate researches used chest x-rays in this manner. then a qualitative investigation performed to inspect the decisions made by cnns using a technique known as class activation maps (cam). using cams, one can map the activations contributed to the decision of cnns back to the original image to visualize the most discriminating region(s) on the input image. we conclude that cnn decisions should not be taken into consideration, despite their high classification accuracy, until clinicians can visually inspect and approve the region(s) of the input image used by cnns that lead to its prediction. the severe acute respiratory syndrome coronavirus 2 (sars-cov-2), the virus causing covid-19, has become a pandemic since its emergence in wuhan, china in dec 2019 [1] . the death toll from the infection escalated and many health systems around the world have struggled to cope. a critical step in the control of covid-19 is effective and accurate screening of patients so that positive cases receive timely treatment and get appropriately isolated from the public; a measure deemed crucial in curbing the spread of the infection. reverse-transcription polymerase chain reaction (rt-pcr) testing, which can detect sars cov-2 rna from respiratory specimens (such as nasopharyngeal or oropharyngeal swabs), is the golden screening method for detecting covid-19 cases. the high sensitivity of rt-pcr testing is overshadowed by the limited availability of test kits and the amount of time required for the result to be available (few hours to a day or 2) [2] . therefore, there is a growing need to use fast and reliable screening techniques that could be further confirmed by the rt-pcr testing. some studies have suggested the use of imaging techniques such as x-rays computed tomography (ct) scans of the chest to look for visual indicators associated with sars-cov-2 viral infection. it was found in early studies that patients display abnormalities in chest radiographs that are characteristic of covid-19 infection, with some suggesting that radiography examination could be used as a primary tool for covid-19 screening in epidemic areas [3] . facilities for chest imaging is readily available in modern healthcare systems making radiography examination a good complement to rt-pcr testing and, in some cases, showing even a higher sensitivity index. given x-ray visual indicators could be subtle; radiologist will face a great challenge in being able to detect those subtle changes and interpreting them accurately. as such, it becomes highly desired and required to have computeraided diagnostic systems that can aid radiologists in making a more time-efficient and accurate interpretation of x-ray images that are characteristic of covid-19 [4] . in recent months, much research came out addressing the problem of covid-19 detection in chest x-rays using deep learning approaches in general, and convolutional neural networks (cnns) in particular [3] [4] [5] [6] [7] [8] [9] [10] . the majority of papers report high covid-19 disease detection accuracy [2, 6, [10] [11] [12] [13] [14] . for a detailed survey of recent artificial intelligence algorithms, the reader is directed to the review by nguyen et al. [15] . however, deploying cnn architectures directly on chest radiography images may not produce reliable covid-19 detection results, especially when chest x-ray images feed into cnn models directly without any preprocessing steps such as region of interest segmentation, noise elimination and un-wanted object removal. we take this hypothesis onboard to prove that despite the high classification accuracy of cnn models, we demonstrate that cnns are 'cheating' by using artefacts in the images to build their prediction that has nothing to do with covid-19 disease. since the start of covid-19, researchers quickly divided their effort on combating it by focusing on developing a vaccine in one hand [16] and detecting covid-19 using pcr and imaging systems on the other hand [3] . here, we review studies devoted to the use of radiography images to aid and complement pcr in diagnosing covid-19 cases. ai et al. [3] built a deep convolutional neural network (cnn) based on resnet50, inceptionv3 and inception-resnetv2 models for the classification of covid-19 chest x-ray images to normal and covid-19 classes. they reported a good correlation between ct image results and pcr approach. chest x-ray images of 50 covid-19 patients have been obtained from the open source github repository shared by (dr. joseph cohen [17] ). kumar et al. in [5] proposed a method to detect covid-19 using x-ray images based on deep feature and support vector machines (svm). they collected x-ray images from github, kaggle and open-i repository. they extracted the deep feature maps of a number of cnn models and conclude that resnet50 is performing better despite the small number of images used in their investigation. maghdid et al. [6] proposed a simple cnn of 16 filters only to detect covid-19 using both x-ray and ct scans and reported good performance but the dataset used is small. the work of fei et al. [1] focused on segmenting covid-19 ct scans using a deep learning approach known as vb-net and reported dice similarity of 91% ± 10%. xu et al. [8] , obtained an early prediction model to classify covid-19 pneumonia from influenza-a viral pneumonia and healthy cases using pulmonary ct images using resnet18 model by feeding image patches focused on regions of interest. the highest accuracy for the cnn model was 86.7% ct images. in wang et al. [9] , authors use ct images to predict covid-19 cases where they deployed inception transfer-learning model to establish an accuracy of 89.5% with specificity of 88.0% and sensitivity of 87.0%. in [4] a number of cnn architectures that are already used for other medical image classifications evaluated over a dataset of x-ray images to distinguish the coronavirus cases from pneumonia and normal cases. cnn's adopted on a dataset of 224 images of covid-19, 700 of non-covid19 pneumonia, and 504 normal where they report overall accuracy of 97.82. wang and wong [2] investigated a dataset that they called covidx and a neural network architecture called covid-net designed for the detection of covid-19 cases from an open source chest x-ray radiography images. the dataset consists of chest radiography images belonging to 4 classes including normal x-rays comprising cases without any infections, bacterial, viral pertaining to non-covid-19 pneumonia and covid-19 x-rays. they reported an overall accuracy of 83.5% for these four classes. their lowest reported positive predictive value was for non-covid-19 class (67.0%) and highest was for normal class (95.1%). as required to improve the previous studies farooq and hafeez [7] deals with this need by presenting another cnn with fewer parameters but better performance. authors used the same dataset as in [2] to build an open source and accurate covid-resnet for differentiating covid-19 cases from the other four pneumonia cases and outperform covid-net. in [10] , narin et al. experimented several cnn architectures classify normal with covid-19 x-ray images and they report excellent classification accuracy, sensitivity and specificity. but the authors failed to discuss the clinical importance of their approach as it may not be difficult to distinguish severe covid-19 cases from normal chest x-rays, as we show in table 2 , and this is not the situation radiologists face in a regular basis or it may not be of importance in this current pandemic. finally, they trained their cnns based on 50 images from each of the normal and covid-19 classes which may result in some sort of biasness in the training phase. there are many papers focused on issues related to cnn deployment, which are not all covid19 related, where they demonstrated that cnn results can be misleading, not reproducible and need interpretation. hu et al. [18] criticized artificial intelligence based approaches to diagnose covid19 from medical images as they lack the transparency when it comes to their predictive outcomes as well as the small number of control cases many studies based on. darcema et al. [19] discussed the problems of reproducing results by cnn models for recommender systems and that many of the proposed cnn models can be outperformed by other conceptually simple methods. in [20] , wynants et al. reviewed 91 models, mostly deep learning based, and concluded that all of the models are of high risk of bias due to non-representative selection of control patients. they also report high risk of model overfitting and vague reporting by not including any description of the study population and indented use of the models. in all the works discussed here, to the best of our knowledge, we did not encounter an explicit description of preprocessing, segmentation nor noise reduction on chest x-rays. we address this problem by assessing the quality of the decisions made by 12 cnn models using class activation mapping introduced in [21] . furthermore, there is no justification why researchers favored a particular cnn model over others and did not compare their final results if one opt to choose another cnn architecture. this paper benchmarks 12 popular cnn models and deploy them in a transfer learning mode on 3 public datasets popularized for the detection of covid-19 infection. finally, a qualitative analysis is performed on these 12 cnn models to demonstrate the most discriminating regions in the input image used by each cnn and the need of such process to reveal the bias in current datasets as well as cnn weaknesses. in recent years, the use of deep learning algorithms in general and convolutional neural networks (cnns) led to many breakthroughs in a variety of computer vision applications like segmentation, recognition and object detection [22] . deep learning methods have been shown to be successful in automating the task of feature-representation learning and gradually attempts to eliminate the tedious task of handcrafted feature engineering. deep learning, and convolutional neural networks (cnns), attempts to mimic the human visual cortex system in terms of structure and operation by adopting a hierarchical layer of feature representation. this approach of multi-layer feature representation made it possible to learn different image features automatically and hence enabled cnns to outperform handcrafted-feature methods [23] . in 1960s, hubel and wiesel [24] studied monkey's visual cortex system and found cells which are responsible for constructing image and detecting light signal in receptive filed. in the same vein, hubel and wiesel also showed that monkey's visual field can be represented using a topographic mapping. in 1980s, neocognitron proposed by fukushima and miyake [25] which is a self-organizing neural network and regarded as a predecessor of cnn. in [26] , lecun et al.'s groundbreaking work introduced modern cnn models for the purpose of handwritten digit recognition in which the architecture later popularized and known as lenet. after lenet architecture, convolutional layers and backpropagation algorithm for training popularized and became a fundamental building block of most of the modern cnn architectures. in 2012, alexnet architecture, proposed by krizhevsky et al. [27] , won imagenet large scale visual recognition challenge (ilsvrc) [28] by outperforming other methods and reducing the top-5 error from 26 to 15.3%. this was a turning point so that cnns became an exceptionally popular tool to be deployed in many computer visions tasks. roughly speaking, alexnet is a similar version of lenet but deeper structure and trained on 1.2 million high resolution images. complex architectures that has millions of parameters, and hyperparameters, to train and fine tune need a substantial amount of computational time and power but again alexnet popularized the use of powerful computational resources such as graphical processing units (gpus) to compensate the increase in trainable parameters. alexnet opened the door for researchers around the world to design novel cnn models which are deep but efficient at the same time especially after ilsvrc became an annual venue for the recognition of new cnn models. the participation of technology giants such as google, microsoft and facebook also helped in pushing research in this direction especially the depth of cnn architectures increased dramatically from 8 layers in 2012 to 152 layers in 2015 which helped the recognition error rate to drop to 3.5%. pre-trained cnn architectures on imagenet have been open-sourced and immediately used by researcher to transfer the knowledge to other application domains and promising results achieved [29] . one of the many useful features of transfer learning (tl) is that in other domains, such as medical image analysis, millions of labeled medical images are not available therefore it is natural to consider the use of fine-tuned weights and biases of cnn architectures trained on imagenet, and other large databases, to be used for medical image analysis. hence, we opt to use 12 deep learning architectures in a tl mode and modify their final layers to adapt to the number of classes in our investigation. the deep learning architectures that we used for the purpose of covid19 detection from x-ray images are alexnet, vgg16, vgg19, resnet18, resnet50, resnet101, goog-lenet, inceptionv3, squeezenet, inception-resenet-v2, xception and densenet201. in what follows we are going to briefly describe each of the 12 cnn architectures used here and highlight their distinct properties. it is out of the scope of this work to give details of all of these 12 cnn models, hence we direct interested reader to consult many survey articles on deep learning and cnn architectures such as [30, 31] . alexnet architecture is the winner of ilsvrc 2012, proposed by krizhevsky et al. [27] outperformed the handcrafted features significantly. alexnet constitutes of 5 convolutional layers and 2 fully connected layers together with rectified linear unit (relu) activation function which is used for the first time. it can be regarded as a scaled version of lenet except that it is a deeper architecture trained on a larger dataset of images (imagenet) and benefitted from the gpu computational power. hyperparameters of alexnet fine-tuned and won 2013 ilsvrc [28] (later named zf-net). we use alexnet in a transfer learning mode and modify the last layer of alexnet according to the number of x-ray image classes, i.e. instead of 1000 classes that alexnet trained on we change this to 4 classes because 4 x-ray classes used here which are covid19, bacteria, viral and normal. the same approach of tl is used for the rest of cnn models. vgg architectures proposed by oxford university's visual geometry group [32] , hence the acronym vgg, whereby they demonstrated that using small filters of size 3-by-3 in all of the convolutional layers throughout the network leads to a better performance. the main intuition behind vgg architectures is that multiple small filters in a sequence can imitate the effect of larger filters. due to its simplicity in design and generalization power, vgg architectures are widely used. we use vgg16 and vgg19 that constitute of 16 and 19 convolutional layers, respectively. googlenet architecture is the winner of ilsvrc 2014 which is proposed by szegedy et al. [33] from google in 2014. novelty of googlenet is the innovation of inception module, which is a small network inside a bigger network. furthermore, 1-by-1 convolutional layers/blocks used as a dimensionality reduction and feature aggregation. in total, googlenet is 22 layers deep with 9 inception modules. inception v1 (googlenet), is later improved in terms of batch normalization, representational bottleneck and computational complexity and resulted in inception v2 and v3. here we opt to use googlenet and inceptionv3 [34] in a transfer learning mode. in the same vein, we use xception [35] , which is another architecture proposed by f. chollet from google which uses the idea of extreme inception module whereby depthwise convolutional layers used first then followed by pointwise convolutional layers. in other words, they replaced inception modules by depthwise separable convolutions in such a way that the total number of parameters is the same as inceptionv3 but the performance on large datasets (350 million images of 17,000 classes) are significantly higher. resnet architectures are proposed by he et al. [36] from microsoft and won 2015 ilsvrc. main innovation in resnet architectures are the use of residual layers and skip connections to solve the problem of vanishing gradient that may result in stopping the weights in the network to further update/change. this is particularly a problem in deep networks because the value of gradient can vanish, i.e. shrink to zero, when several chain rules applied consecutively. skipping connections will help gradians to flow backwards directly from end layers to initial layer filters enabling cnn models to deepen with 152 layers. densenet can be regarded as a logical extension of resnet which was first proposed in 2016 by huang et al. from facebook [37] . in densenet, each layer of cnn connected to every other layer in the network in a feed-forward manner which helps in reducing the risk of gradient-vanishing, fewer parameters to train, feature-map reuse and each layer takes all preceding layer features as inputs. the authors also point out that when datasets used without augmentation, densenet is less prone to overfitting. there are a number of densenet architectures, but we opt to use densenet201 for our analysis of covid19 detection from x-ray images by using the weights trained on imagenet dataset in tl mode. squeezenet is a small architecture proposed by iandola et al. [38] in 2016 that uses the idea of fire module which contain 3 filters of size 1-by-1 feed into an expanded layer (4 filters of size 1-by-1 and 4 filters of size 3-by-3). even though the number of parameters of squeezenet is by 50 × less than alexnet but achieves the same accuracy of alexnet on imagenet. inception-resnetv2 is a combined architecture proposed by szegedy et al. [34] in 2016 that uses the idea of inception blocks and residual layers together. the aim of using residual connections is to avoid the problem of degradation causes by deep networks and reduce the training time. the inception-resnetv2 architecture used here contains 20 inception-resnet blocks that empower the network to become 164 layers deep, and we use the pre-trained weights in these layers to assist our mission of detecting covid19 in x-ray images. in this study, we designed a cnn model for covid-19 detection from chest radiography images guided by the fact that in order to properly classify and detect covid-19, radiologists need to discriminate covid-19 x-rays from normal chest x-ray first, and then from other viral and bacterial infections in order to isolate and treat the patient properly. therefore, we opt to choose the design of cnn to make one of the following predictions: (a) normal (i.e. no infection) (b) covid-19, (c) viral infection (none-covid-19) and (d) bacterial infection. the rationale behind using these 4 cases is to aid radiologists to prioritize covid-19 patients for pcr testing and employ treatments according to infection-specific causes. having these requirements in mind, we designed our simple cnn architecture, named cnn-x, that constitutes of 4 parallel layers where we have 16 filters in each layer in 3 different sizes (3-by-3, 5-by-5 and 9-by-9). batch normalization and rectified linear unit (relu) is then applied to the convolved images and two different types of pooling operation applied next which are average pooling and maximum pooling. the rationale behind using different filter sizes is to detect local-features using filters of size 3-by-3 and rather global features by filters of size 9-by-9 while 5-by-5 filter size is to detect what is missed by the other two filters. different pooling operations utilized to further reduce the dimensionality of feature maps. a stride of size 3 is adopted here, with pooling operations, to further reduce the dimension of the resulting feature maps taking into consideration the fact that there is redundant information in images and neglecting a row and a column after each pooling window is not causing a massive information loss. see fig. 1 where we visually depict the difference between pooling of size 3-by-3 with stride 2 versus pooling of size 2-by-2 with stride 3 and conclude that we are not losing much information while reducing the size of the image/feature map further. proposed architecture design is not deep, hence the feature map (i.e. convolved image) is not a very abstract representation of the input image yet and as such there are still redundant information. feature maps from the four parallel layers are then concatenated before fully connected layer. weights are generated using glorot method [39] with adam optimizer [40] and 0.0003 initial learning rate. training conducted using 20 epochs and 15 mini batch size. we visualize the structure of proposed cnn model in fig. 2 . to investigate and test the cnn architectures explained in section iii and iv, we used x-ray images collected from 3 publicly available sources. first dataset is a collection of 111 covid-19 chest x-ray images collected by cohen [17] . second dataset is a collection of 5840 chest x-ray images of confirmed normal, bacterial and other non-covid-19 viral infections from kermany et al. [41] . the third dataset contains 73 confirmed covid-19 chest x-rays collected from the following websites; radiological society of north america (rsna), radiopedia, and italian society of medical and interventional radiology (sirm). this dataset is also available publicly in [42] . in total, 6024 chest x-ray images used from the 3 datasets in which we divide them into four classes as follows; the total number of normal chest x-rays are 1575, confirmed bacterial infection cases are 2771, viral (non-covid-19) are 1494 and covid19 images are 184. in fig. 3 examples of all four radiographic x-ray classes are shown. to shed more light on the number of artifacts and the nature of the artifacts present in the 3 datasets used in this work, we inspected every single image to check whether there is an artifact or not and the type of artifacts present in the images. in table 1 we demonstrate the percentage of images that contain some form of artifact and in fig. 4 we each database contains different images with different sizes (i.e. the images are in different pixels resolutions). in table 1 , we showed the variety of image resolutions in the databases by presenting the minimum and maximum pixel resolution that every database contains. as it can be seen from the percentages in table 1 , there is a high number of images that contain some form of artifacts that may affect the diagnostic results produced by cnn models. swinging the results of any machine learning classifier by artifacts is not good and we are going to show the effect of these artifacts on diagnostic decisions made by cnn models in the rest of this paper, especially in part a of section iii. figure 4 depicts different types of text and medical device traces present in the 3 datasets used in our experiments. some of the artefacts can be removed by cropping or automatic segmentation such as those at the corners of the images but the artefacts like the one in the middle image in fig. 4 is harder to remove automatically or manually. it should also be noted that despite the small amount of background present in the chest x-ray images, it does still affect the decisions of cnn models and we are going to demonstrate this in the next section. details of distributing the images to train set, validation set, and test set will be discussed and explained in the next section. we adopted transfer learning (tl) approach to investigate the performance of the cnn architectures discussed here and compare it with proposed cnn-x architecture. tl is the process of utilizing gained knowledge (learned weights) from solving one problem to a different but related problem. weights optimized from training the 12 cnn models on imagenet dataset used in tl mode such that weights in all layers are retrained on our x-ray images. all images from training and testing sets are resized to the suitable dimensions that each of the architectures designed for. no preprocessing applied to input images because none of the methods in the literature (so far) mentioned it and hence we followed the same norm. training parameters in tl for all 12 cnn architectures are as follows: number of epochs = 20, minibatch size = 15, initial learning rate = 0.0003. all experiments conducted using matlab version 2019b on a core i5 cpu machine with 16 gb of ram and 3.30 ghz. to measure cnn classification performance, four metrics were recorded which are sensitivity, specificity, f1-score and classification confidence. to be able to calculate the aforementioned metrics the following measures of test classification computed: true positive (tp): number of correctly identified disease x-ray images. false negative (fn): number of incorrectly classified disease x-ray images. true negative (tn): number of correctly identified healthy x-ray cases. false positive (fp): incorrectly identified healthy x-ray cases. furthermore, tp refers to disease (covid-19, bacterial or viral) x-ray images correctly identified as a disease x-ray image while fp is normal or other pneumonia cases incorrectly identified as covid-19 disease. sensitivity measures the proportion of diseased cases correctly detected by cnns while specificity measure the proportion of healthy cases correctly identified as healthy by cnn models. the equation of sensitivity and specificity calculation is provided in appendix, which also contain the f1-score calculation and equation. because the number of covid-19 chest x-ray images is small in comparison with the other 3 classes, it is sometimes misleading to rely on sensitivity and specificity of cnn models alone. therefore, we also report the computation of the estimate of 95% confidence interval (see the appendix) of classification errors of each of the cnn models utilised here where we assume that the cnn classification . 4 highlights of different type of artifacts in deployed datasets output distributed normally, i.e. follows a gaussian distribution. the smaller the confidence interval, more reliable the predictive model is and hence one expects its cnn model more likely to work on other datasets. three different scenarios deployed to test the performance of 12 off-the-shelf cnn architectures as well as our proposed cnn-x model which will be discussed next. in this scheme, cnn architectures trained on 1341 normal x-ray images with 111 covid-19 cases while 234 cases of normal with 73 cases of covid-19 are used for testing. table 2 below shows obtained results from all 13 cnn architectures. the aim of testing this hypothesis is to see the effect of differentiating covid-19 from normal chest x-rays. it can be seen from the table above that all of the cnn models (except vgg19 and vgg19), can be deployed successfully to detect covid-19 x-rays with sensitivity of above 90%. however, the specificity of some of the techniques are below 90% in which we can avoid using it in practice. in this vein, one can opt to rely on the highest performing architectures such as xception, desnsenet201, squeezenet and inceptionresnetv2 as their specificity is > 99%. it should be noted that our proposed cnn architecture's performance is comparable to other state-of-the-art cnn models whereby it achieves 93% sensitivity and specificity of 97%, which is better than alexnet, googlenet, vgg19 and vgg16. albeit excellent results in table 2 , this is not a realistic scenario to build machine learning algorithms for the purpose of covid-19 detection in the present time because there is no guarantee that the system is not classifying other pneumonia infections as covid-19 and vice versa. furthermore, it may not be of a clinical significance to differentiate extreme covid-19 cases from normal chest x-rays but it's the diagnostics and discrimination of covid-19 from other pneumonia is of a particular interest. hence, we designed the second scenario to address the task of discriminating covid-19 cases from other viral, bacterial and normal x-rays images. in this scenario we aim to classify x-ray images into the 4 respective classes of normal, covid-19, bacteria and viral (non-covid-19). this scenario addresses the limitation in the first scenario whereby any machine learning algorithm needs to, ultimately, discriminate not only covid-19 chest x-ray from normal x-ray but it also needs to discriminate covid-19 chest x-rays from other viral and bacterial infections. this is a necessary condition to stop the spread of the virus and prepare covid-19 patients for special treatments. a total of 1341 normal x-rays, 2529 bacteria cases, 1346 viral x-rays and 111 covid-19 x-rays used for training. for testing, 234, 242, 148 and 73 x-rays of normal, bacteria, viral and covid-19 used respectively. it is worth to notice that we train the model on 111 covid chest x-rays from covidx dataset but we test the cnn models on 73 chest x-rays from a different source. this is critical to examine the effectiveness of feature maps learnt by cnn on one source and testing it on images coming from a different source. table 3 below demonstrates classification performance obtained by adopting this scenario. in this scenario we used part of the dataset to train cnn models to see the effect of each architecture with the smaller number of image samples. the rationale behind this scenario is the fact that most of the time the challenge in medical image analysis is limitation of available data for investigation and to reduce bias in having unbalanced number of images in training phase. hence, the design of this scenario is to get more insight of how these cnn models perform in the case of limited availability of image samples. in this scenario, four classes used with 350 x-ray images of normal, bacteria, viral and 111 x-rays of covid-19 for training whereas the same number of testing images used for the four classes are as scenario 2. table 3 shows experimental results obtained from scenario 2 and scenario 3, where s n and s p stand for sensitivity and specificity respectively in table 3 . it clearly depicts that none of the cnn architectures perform well on differentiating x-rays to all four classes. perhaps the only exception is inception-resnetv2 that performs better in comparison with the rest of the architectures especially on normal x-rays with sensitivity of > 76% using all image samples. the good performance of inception-resnetv2 is due to the idea of combining residual learning with inception blocks which makes the performance to be better than using resnet or google/inception architectures alone. furthermore, we notice that all cnn models work well on detecting two of the classes, namely bacteria and covid-19, but not performing well on classifying normal and viral x-rays to their respective classes. this suggests that deployed cnn models learn features of bacterial and covid-19 better than normal and non-covid19 viral infections. in other words, there is more similarity between features of x-ray images of viral infection and normal cases with each other and with other classes that cannot be distinguished easily. the second-best performing architecture, using all image samples, is xception architecture with sensitivity of 97%, 94%, 66% and 82% for bacteria, covid-19, normal and viral chest infections respectively. when it comes to scenario 3, where only 350 images used from normal, bacterial and viral chest x-rays, again inception-resnetv2 outperform all other cnn architectures including cnn-x. this confirms the effectiveness of inception-resnetv2 in terms of design and learning power. nonetheless, we want to remind the reader that input images have not been segmented and they contain artefact that may contribute to cnn prediction but has no relation to covid-19 infection. we confirm this point in the next section, see figs. 5 and 6, where we demonstrate the region(s) in the image used by cnns and some, if not all, of these regions are artifacts. direct comparison of best results obtained here, which is by inception-resnetv2, is not possible with other works in the literature because the covid-19 images used for testing here is different and more importantly the number of testing images is 73 which is higher than the number of test images used in [2] and [7] whereby they tested their cnns based on 8 covid-19 images only. nonetheless, our results are outperforming covid-net [2] in terms of sensitivity for viral and normal x-ray classification. the sensitivity of inception-resnet-v2 is again outperforms covid-net for bacterial, covid-19, and viral infection classification. in scenario 2, proposed cnn-x architecture is not performing better than any of the 12 cnn models used if we take the overall classification error obtained from each cnn architecture into consideration, see 4th column of table 5 from the appendix. nonetheless, cnn-x's overall classification error is 0.341 which is comparable and close to squeeze-net and vgg19 with classification errors of 0.324 and 0.303 respectively. in scenario 3, cnn-x with a classification error of 0.377 outperforms 7 cnn models which are resnet101, xception, vgg16, alexnet, squeezenet, resnet18 and densenet201 with classification errors of 0.396, 0.418, 0.436, 0.443, o.446, 0.449, and 0.494 respectively. classification errors of scenario 2 and scenario 3 can be seen in table 5 and table 6 in appendix together with classification confidence and f1-score of each class. table 4 contain the elapsed time of training each of the 13 cnn models used here. next, we analyse qualitatively the performance of all cnn models used here to visually inspect the most discriminating regions on x-ray images used by cnns. this step is critical so that radiologists can visualize the regions used by cnns to predict pneumonia presence in input x-ray images. there are many ways one can visualize the region(s) used by cnns to predict the class label of an input image such as gradient descent class activation mappings or global average pooling class activation mappings and others [21, 43, 44] .to interpret the output decision made by any of the cnn architectures investigated in this study, heatmaps of the most discriminating regions generated and visualized for the input images in testing using the method introduced in [21] which is known as class activation mappings (cam). using cams, one can highlight class specific distinctive regions used by cnns that lead to its prediction. after fully training a cnn model, a testing image will be fed into the network and feature maps extracted from final convolutional layer. in what follows we briefly introduce the procedure of generating cams. let a u (x, y) be activation of unit u of the last convolutional layer at a spatial position of (x, y) . let be average pooling operation and the input by the softmax layer is then can be defined as follows: where l is the class label, w l u is the weight of class l of the unit u . here, w l u highlights important of the activation a u for a given class l . probability score output by softmax for a given class l can then be defined as follows: substituting eq. (1) into eq. (2) we obtain the following: then each class l activation maps can be defined at each spatial position (x, y) as follows: finally, substituting activation maps for each class label in eq. (5) into eq. (4) we obtain the activation output by softmax for each class label l as follows: hence, m l (x, y) indicates the discriminative power of activation maps at the spatial grid (x, y) that leads to the decision made the cnn to classify the input image into class l . to allow comparison to the input image, bilinear up-sampling is then applied to resize activation map to the size of input images accepted by each cnn model. in fig. 5 we demonstrate the image regions used by cnn models that lead to a successful class prediction. it can be observed that in very few occasions the cnn algorithms are focusing on the frontal region of the chest (i.e. lung region) where we search for signs/features of covid-19 and other infections. rather, they are using either regions outside the frontal view of chest area, see 1st column of row (b) and 3rd and 4th column of row (e) of fig. 5 . direct overlaps of hot spots of cams with texts can be seen in fig. 5 especially in 1st column of row(b), 1st-3rd-4th column of row (e), 1st column of row (g) and 1st-4th columns of row (j). medical device traces, on the other hand, can also be used by cnns on medical images to derive their decision as it can be seen in fig. 5, 1st column of rows (b, c, g-j) . furthermore, ranking the 13 cnn architectures deployed in this study according to cams will provide a new approach of using cnn architectures that are not solely based on classification results obtained. according to the intersection (overlap) between the lung region and cams hot spot distribution, we ranked the 13 cnn models into 7 categories (r1 being good and r7 being worst) as follows: r1: resnet50. r2: inceptionv3. r3: resnet18 and inceptionresnet. r4: resnet101 and xception. r5: googlenet and cnn-x. r6: densenet201, squeezenet and alexnet. r7: vgg16 and vgg19. in the same vein, incorrect classification may be caused by these artifacts, see fig. 6 where we show examples of mis-classified images by cnns and their corresponding cams to highlight the most discriminating regions lead to cnn decisions. for example, 4th column of most of the rows in fig. 6 is an x-ray image where texts on medical images lead to an incorrect classification decision by cnns. specifically, there is a letter r in the top left corner and small texts in top-right corner of a viral x-ray image whereby most of the cnn architectures cheated by using features of these texts to obtain their final prediction. in row (j) of fig. 6 , column number 3, we can see clearly that inceptionresnet used the small amount of the background in the image to derive its incorrect decision. this conclusion is mainly because there is a direct overlap between cams and the background region present in this image. first column of row (e) and row (m) in fig. 6 is a good example where regions outside roi have been used to obtain final classification prediction by vgg19 and xception architectures. therefore, we conclude that using x-ray images as it is, without preprocessing to segment the region of interest and remove some hidden noise, is not a good practice and result in a biased and misleading classification prediction. in other words, we want to have a cnn model that learn the symptoms (i.e. features) of covid-19 disease and its classification prediction is solely based on these features. this paper presented a critical analysis for 12 off-the-shelf cnn architectures, proposed originally for natural image analysis, for the purpose of aiding radiologists to discriminate covid-19 disease based on chest x-ray images. we also proposed a simple cnn architecture, with fewer parameters than many of the well-established cnn architectures, that can outperform 7 cnn architectures such as xception, resnet10, vgg16, alexnet, squeezenet, resnet18 and densenet201 when trained on a small dataset of images. overall classification error for each of the 13 cnn architectures deployed in our investigation to help radiologists to diagnose covid19 can be seen in tables 5 and 6 for scenario 2 and 3 respectively. furthermore, beside quantitative analysis of cnns, we qualitatively assessed cnn methods investigated in this paper using class activation mappings where we visualize the regions on x-ray images utilised by cnns to derive their final prediction scores. we demonstrated that deep learning predictions of covid-19 disease are not reliable when clear artefacts such as texts and medical device traces are present on the input x-ray image. in the same vein, we demonstrated that cnns will use regions/features in the input image which are outside the roi and have no relation with covid-19 pneumonia, see figs. 5 and 6 for more than one example as evidence. therefore, positive or negative class predictions by cnn model must be treated cautiously unless qualitatively inspected and approved by radiologists. whenever cnn models used/learnt features inside roi and these features lead to the final decision by cnn algorithms, then and only then radiologists can rely on such diagnostic decisions by cnns. figures 5 and 6 contain multiple examples where texts, medical device traces and irrelevant x-rays image regions (i.e. backgrounds) used by cnns to build their prediction results. it is important to note that, one needs to design machine learning algorithms based on radiologist opinions and not fully depend on data-driven mechanisms. one limitation of current study is the lack of using multiple quality assessment tools to analyse cnn models decisions beside class activation mappings. to address this issue, we need to expand the list of methods to qualitatively analyse cnn predictions to include gradient cams and saliency maps. future research directions, and in progress work, contain segmenting the lung region from chest x-rays and removing other artefact such as text and medical device traces on chest x-rays. we have not encountered any study that segmented the lung region in x-ray images and then feed it to cnn models, while this is considered as one of the important areas that needs to be further researched. the reliability of lung segmentation approaches is another problem that needs to be addressed and further researched by machine learning community. we have also not encountered, to the best of our knowledge, any study incorporated clinical and cardiac features with deep learning models or used cardiac features alone to prognosticate covid-19 pneumonia. data from other sources need to be incorporated to build cnn models that can be generalized and not biased towards a specific country, such as china or italy, or a targeted population. funding this study did not receive external funding. conflict of interest the authors declare that they have no conflict of interest. ethical approval for this type of study, formal consent is not required. informed consent this article does not contain any studies with human participants or animals performed by any of the authors. and n = number of observation used to evaluate the model. lung infection quantification of covid-19 in ct images with deep learning covid-net: a tailored deep convolutional neural network design for detection of covid-19 cases from chest x-ray images correlation of chest ct and rt-pcr testing in coronavirus disease covid-19: automatic detection from x-ray images utilizing transfer learning with convolutional neural networks detection of coronavirus disease (covid-19) based on deep features diagnosing covid-19 pneumonia from x-ray and ct images using deep learning and transfer learning algorithms covid-resnet: a deep learning framework for screening of covid19 from radiographs deep learning system to screen coronavirus disease 2019 pneumonia a deep learning algorithm using ct images to screen for corona virus disease (covid-19) automatic detection of coronavirus disease (covid-19) using x-ray images and deep convolutional neural networks predicting covid-19 malignant progression with ai techniques development and evaluation of an ai system for covid-19 diagnosis automated detection of covid-19 cases using deep neural networks with x-ray images can ai help in screening viral and covid-19 pneumonia? artificial intelligence in the battle against coronavirus (covid-19): a survey and future research directions covid-19 vaccine development-oxford vaccine group the challenges of deploying artificial intelligence models in a rapidly evolving pandemic a troubling analysis of reproducibility and progress in recommender systems research prediction models for diagnosis and prognosis of covid-19 infection: systematic review and critical appraisal learning deep features for discriminative localization deep learning an analysis of deep neural network models for practical applications receptive fields and functional architecture of monkey striate cortex neocognitron: a self-organizing neural network model for a mechanism of visual pattern recognition. lectute notes in biomathematics handwritten digit recognition with a backpropagation network imagenet classification with deep convolutional neural networks imagenet large scale visual recognition challenge iris recognition with off-the-shelf cnn features: a deep learning perspective recent advances in convolutional neural networks convolutional neural networks: an overview and application in radiology very deep convolutional networks for large-scale image recognition going deeper with convolutions rethinking the inception architecture for computer vision xception: deep learning with depthwise separable convolutions deep residual learning for image recognition densely connected convolutional networks squeezenet: alexnet-level accuracy with 50x fewer parameters and <0.5mb model size understanding the difficulty of training deep feedforward neural networks adam: a method for stochastic optimization identifying medical diagnoses and treatable diseases by image-based deep learning visualizing and understanding convolutional networks grad-cam: visual explanations from deep networks via gradient-based localization publisher's note springer nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations key: cord-193356-hqbstgg7 authors: widrich, michael; schafl, bernhard; ramsauer, hubert; pavlovi'c, milena; gruber, lukas; holzleitner, markus; brandstetter, johannes; sandve, geir kjetil; greiff, victor; hochreiter, sepp; klambauer, gunter title: modern hopfield networks and attention for immune repertoire classification date: 2020-07-16 journal: nan doi: nan sha: doc_id: 193356 cord_uid: hqbstgg7 a central mechanism in machine learning is to identify, store, and recognize patterns. how to learn, access, and retrieve such patterns is crucial in hopfield networks and the more recent transformer architectures. we show that the attention mechanism of transformer architectures is actually the update rule of modern hopfield networks that can store exponentially many patterns. we exploit this high storage capacity of modern hopfield networks to solve a challenging multiple instance learning (mil) problem in computational biology: immune repertoire classification. accurate and interpretable machine learning methods solving this problem could pave the way towards new vaccines and therapies, which is currently a very relevant research topic intensified by the covid-19 crisis. immune repertoire classification based on the vast number of immunosequences of an individual is a mil problem with an unprecedentedly massive number of instances, two orders of magnitude larger than currently considered problems, and with an extremely low witness rate. in this work, we present our novel method deeprc that integrates transformer-like attention, or equivalently modern hopfield networks, into deep learning architectures for massive mil such as immune repertoire classification. we demonstrate that deeprc outperforms all other methods with respect to predictive performance on large-scale experiments, including simulated and real-world virus infection data, and enables the extraction of sequence motifs that are connected to a given disease class. source code and datasets: https://github.com/ml-jku/deeprc transformer architectures (vaswani et al., 2017) and their attention mechanisms are currently used in many applications, such as natural language processing (nlp), imaging, and also in multiple instance learning (mil) problems . in mil, a set or bag of objects is labelled rather than objects themselves as in standard supervised learning tasks (dietterich et al., 1997) . examples for mil problems are medical images, in which each sub-region of the image represents an instance, video a pooling function f is used to obtain a repertoire-representation z for the input object. finally, an output network o predicts the class labelŷ. b) deeprc uses stacked 1d convolutions for a parameterized function h due to their computational efficiency. potentially, millions of sequences have to be processed for each input object. in principle, also recurrent neural networks (rnns), such as lstms (hochreiter et al., 2007) , or transformer networks (vaswani et al., 2017) may be used but are currently computationally too costly. c) attention-pooling is used to obtain a repertoire-representation z for each input object, where deeprc uses weighted averages of sequence-representations. the weights are determined by an update rule of modern hopfield networks that allows to retrieve exponentially many patterns. classification, in which each frame is an instance, text classification, where words or sentences are instances of a text, point sets, where each point is an instance of a 3d object, and remote sensing data, where each sensor is an instance (carbonneau et al., 2018; uriot, 2019) . attention-based mil has been successfully used for image data, for example to identify tiny objects in large images (ilse et al., 2018; pawlowski et al., 2019; tomita et al., 2019; kimeswenger et al., 2019) and transformer-like attention mechanisms for sets of points and images . however, in mil problems considered by machine learning methods up to now, the number of instances per bag is in the range of hundreds or few thousands (carbonneau et al., 2018; lee et al., 2019 ) (see also tab. a2). at the same time the witness rate (wr), the rate of discriminating instances per bag, is already considered low at 1% − 5%. we will tackle the problem of immune repertoire classification with hundreds of thousands of instances per bag without instance-level labels and with extremely low witness rates down to 0.01% using an attention mechanism. we show that the attention mechanism of transformers is the update rule of modern hopfield networks (krotov & hopfield, 2016 demircigil et al., 2017) that are generalized to continuous states in contrast to classical hopfield networks (hopfield, 1982) . a detailed derivation and analysis of modern hopfield networks is given in our companion paper (ramsauer et al., 2020) . these novel continuous state hopfield networks allow to store and retrieve exponentially (in the dimension of the space) many patterns (see next section). thus, modern hopfield networks with their update rule, which are used as an attention mechanism in the transformer, enable immune repertoire classification in computational biology. immune repertoire classification, i.e. classifying the immune status based on the immune repertoire sequences, is essentially a text-book example for a multiple instance learning problem (dietterich et al., 1997; maron & lozano-pérez, 1998; wang et al., 2018) . briefly, the immune repertoire of an individual consists of an immensely large bag of immune receptors, represented as amino acid sequences. usually, the presence of only a small fraction of particular receptors determines the immune status with respect to a particular disease (christophersen et al., 2014; emerson et al., 2017) . this is because the immune system has already acquired a resistance if one or few particular immune receptors that can bind to the disease agent are present. therefore, classification of immune repertoires bears a high difficulty since each immune repertoire can contain millions of sequences as instances with only a few indicating the class. further properties of the data that complicate the problem are: (a) the overlap of immune repertoires of different individuals is low (in most cases, maximally low single-digit percentage values) (greiff et al., 2017; elhanati et al., 2018) , (b) multiple different sequences can bind to the same pathogen (wucherpfennig et al., 2007) , and (c) only subsequences within the sequences determine whether binding to a pathogen is possible (dash et al., 2017; glanville et al., 2017; akbar et al., 2019; springer et al., 2020; fischer et al., 2019) . in summary, immune repertoire classification can be formulated as multiple instance learning with an extremely low witness rate and large numbers of instances, which represents a challenge for currently available machine learning methods. furthermore, the methods should ideally be interpretable, since the extraction of class-associated sequence motifs is desired to gain crucial biological insights. the acquisition of human immune repertoires has been enabled by immunosequencing technology (georgiou et al., 2014; brown et al., 2019) which allows to obtain the immune receptor sequences and immune repertoires of individuals. each individual is uniquely characterized by their immune repertoire, which is acquired and changed during life. this repertoire may be influenced by all diseases that an individual is exposed to during their lives and hence contains highly valuable information about those diseases and the individual's immune status. immune receptors enable the immune system to specifically recognize disease agents or pathogens. each immune encounter is recorded as an immune event into immune memory by preserving and amplifying immune receptors in the repertoire used to fight a given disease. this is, for example, the working principle of vaccination. each human has about 10 7 -10 8 unique immune receptors with low overlap across individuals and sampled from a potential diversity of > 10 14 receptors (mora & walczak, 2019) . the ability to sequence and analyze human immune receptors at large scale has led to fundamental and mechanistic insights into the adaptive immune system and has also opened the opportunity for the development of novel diagnostics and therapy approaches (georgiou et al., 2014; brown et al., 2019) . immunosequencing data have been analyzed with computational methods for a variety of different tasks (greiff et al., 2015; shugay et al., 2015; miho et al., 2018; yaari & kleinstein, 2015; wardemann & busse, 2017) . a large part of the available machine learning methods for immune receptor data has been focusing on the individual immune receptors in a repertoire, with the aim to, for example, predict the antigen or antigen portion (epitope) to which these sequences bind or to predict sharing of receptors across individuals (gielis et al., 2019; springer et al., 2020; jurtz et al., 2018; moris et al., 2019; fischer et al., 2019; greiff et al., 2017; sidhom et al., 2019; elhanati et al., 2018) . recently, jurtz et al. (2018) used 1d convolutional neural networks (cnns) to predict antigen binding of t-cell receptor (tcr) sequences (specifically, binding of tcr sequences to peptide-mhc complexes) and demonstrated that motifs can be extracted from these models. similarly, konishi et al. (2019) use cnns, gradient boosting, and other machine learning techniques on b-cell receptor (bcr) sequences to distinguish tumor tissue from normal tissue. however, the methods presented so far predict a particular class, the epitope, based on a single input sequence. immune repertoire classification has been considered as a mil problem in the following publications. a deep learning framework called deeptcr (sidhom et al., 2019) implements several deep learning approaches for immunosequencing data. the computational framework, inter alia, allows for attention-based mil repertoire classifiers and implements a basic form of attention-based averaging. ostmeyer et al. (2019) already suggested a mil method for immune repertoire classification. this method considers 4-mers, fixed sub-sequences of length 4, as instances of an input object and trained a logistic regression model with these 4-mers as input. the predictions of the logistic regression model for each 4-mer were max-pooled to obtain one prediction per input object. this approach is characterized by (a) the rigidity of the k-mer features as compared to convolutional kernels (alipanahi et al., 2015; zhou & troyanskaya, 2015; zeng et al., 2016) , (b) the max-pooling operation, which constrains the network to learn from a single, top-ranked k-mer for each iteration over the input object, and (c) the pooling of prediction scores rather than representations (wang et al., 2018) . our experiments also support that these choices in the design of the method can lead to constraints on the predictive performance (see table 1 ). our proposed method, deeprc, also uses a mil approach but considers sequences rather than k-mers as instances within an input object and a transformer-like attention mechanism. deeprc sets out to avoid the above-mentioned constraints of current methods by (a) applying transformer-like attention-pooling instead of max-pooling and learning a classifier on the repertoire rather than on the sequence-representation, (b) pooling learned representations rather than predictions, and (c) using less rigid feature extractors, such as 1d convolutions or lstms. in this work, we contribute the following: we demonstrate that continuous generalizations of binary modern hopfield-networks (krotov & hopfield, 2016 demircigil et al., 2017) have an update rule that is known as the attention mechanisms in the transformer. we show that these modern hopfield networks have exponential storage capacity, which allows them to extract patterns among a large set of instances (next section). based on this result, we propose deeprc, a novel deep mil method based on modern hopfield networks for large bags of complex sequences, as they occur in immune repertoire classification (section "deep repertoire classification). we evaluate the predictive performance of deeprc and other machine learning approaches for the classification of immune repertoires in a large comparative study (section "experimental results") exponential storage capacity of continuous state modern hopfield networks with transformer attention as update rule in this section, we show that modern hopfield networks have exponential storage capacity, which will later allow us to approach massive multiple-instance learning problems, such as immune repertoire classification. see our companion paper (ramsauer et al., 2020) for a detailed derivation and analysis of modern hopfield networks. we assume patterns x 1 , . . . , x n ∈ r d that are stacked as columns to the matrix x = (x 1 , . . . , x n ) and a query pattern ξ that also represents the current state. the largest norm of a pattern is m = max i x i . the separation ∆ i of a pattern x i is defined as its minimal dot product difference to any of the other patterns: we consider a modern hopfield network with current state ξ and the energy function for energy e and state ξ, the update rule is proven to converge globally to stationary points of the energy e, which are local minima or saddle points (see (ramsauer et al., 2020) , appendix, theorem a2 ). surprisingly, the update rule eq. (1) is also the formula of the well-known transformer attention mechanism. to see this more clearly, we simultaneously update several queries ξ i . furthermore the queries ξ i and the patterns x i are linear mappings of vectors y i into the space r d . for matrix notation, we set x i = w t k y i , ξ i = w t q y i and multiply the result of our update rule with w v . using y = (y 1 , . . . , y n ) t , we define the matrices and the patterns are now mapped to the hopfield space with dimension d = d k . we set β = 1/ √ d k and change softmax to a row vector. the update rule eq. (1) multiplied by w v performed for all queries simultaneously becomes in row vector notation: this formula is the transformer attention. if the patterns x i are well separated, the iterate eq. (1) converges to a fixed point close to a pattern to which the initial ξ is similar. if the patterns are not well separated the iterate eq.(1) converges to a fixed point close to the arithmetic mean of the patterns. if some patterns are similar to each other but well separated from all other vectors, then a metastable state between the similar patterns exists. iterates that start near a metastable state converge to this metastable state. for details see ramsauer et al. (2020) , appendix, sect. a2. typically, the update converges after one update step (see ramsauer et al. (2020) , appendix, theorem a8) and has an exponentially small retrieval error (see ramsauer et al. (2020) , appendix, theorem a9). our main concern for application to immune repertoire classification is the number of patterns that can be stored and retrieved by the modern hopfield network, equivalently to the transformer attention head. the storage capacity of an attention mechanism is critical for massive mil problems. we first define what we mean by storing and retrieving patterns from the modern hopfield network. definition 1 (pattern stored and retrieved). we assume that around every pattern x i a sphere s i is given. we say x i is stored if there is a single fixed point x * i ∈ s i to which all points ξ ∈ s i converge, for randomly chosen patterns, the number of patterns that can be stored is exponential in the dimension d of the space of the patterns (x i ∈ r d ). theorem 1. we assume a failure probability 0 < p 1 and randomly chosen patterns on the sphere with radius m = k √ d − 1. we define a := 2 d−1 (1 + ln(2 β k 2 p (d − 1))), b := 2 k 2 β 5 , and c = b w0(exp(a + ln(b)) , where w 0 is the upper branch of the lambert w function and ensure then with probability 1 − p, the number of random patterns that can be stored is examples are c ≥ 3.1546 for β = 1, k = 3, d = 20 and p = 0.001 (a + ln(b) > 1.27) and c ≥ 1.3718 for β = 1 k = 1, d = 75, and p = 0.001 (a + ln(b) < −0.94). see ramsauer et al. (2020) , appendix, theorem a5 for a proof. we have established that a modern hopfield network or a transformer attention mechanism can store and retrieve exponentially many patterns. this allows us to approach mil with massive numbers of instances from which we have to retrieve a few with an attention mechanism. deep repertoire classification problem setting and notation. we consider a mil problem, in which an input object x is a bag of n instances x = {s 1 , . . . , s n }. the instances do not have dependencies nor orderings between them and n can be different for every object. we assume that each instance s i is associated with a label y i ∈ {0, 1}, assuming a binary classification task, to which we do not have access. we only have access to a label y = max i y i for an input object or bag. note that this poses a credit assignment problem, since the sequences that are responsible for the label y have to be identified and that the relation between instance-label and bag-label can be more complex (foulds & frank, 2010) . a modelŷ = g(x) should be (a) invariant to permutations of the instances and (b) able to cope with the fact that n varies across input objects (ilse et al., 2018) , which is a problem also posed by point sets (qi et al., 2017) . two principled approaches exist. the first approach is to learn an instance-level scoring function h : s → [0, 1], which is then pooled across instances with a pooling function f , for example by average-pooling or max-pooling (see below). the second approach is to construct an instance representation z i of each instance by h : s → r dv and then encode the bag, or the input object, by pooling these instance representations (wang et al., 2018) via a function f . an output function o : r dv → [0, 1] subsequently classifies the bag. the second approach, the pooling of representations rather than scoring functions, is currently best performing (wang et al., 2018) . in the problem at hand, the input object x is the immune repertoire of an individual that consists of a large set of immune receptor sequences (t-cell receptors or antibodies). immune receptors are primarily represented as sequences s i from a space s i ∈ s. these sequences act as the instances in the mil problem. although immune repertoire classification can readily be formulated as a mil problem, it is yet unclear how well machine learning methods solve the above-described problem with a large number of instances n 10, 000 and with instances s i being complex sequences. next we describe currently used pooling functions for mil problems. pooling functions for mil problems. different pooling functions equip a model g with the property to be invariant to permutations of instances and with the ability to process different numbers of instances. typically, a neural network h θ with parameters θ is trained to obtain a function that maps each instance onto a representation: z i = h θ (s i ) and then a pooling function z = f ({z 1 , . . . , z n }) supplies a representation z of the input object x = {s 1 , . . . , s n }. the following pooling functions are typically used: average-pooling: where e m is the standard basis vector for dimension m and attention-pooling: z = n i=1 a i z i , where a i are non-negative (a i ≥ 0), sum to one ( n i=1 a i = 1), and are determined by an attention mechanism. these pooling functions are invariant to permutations of {1, . . . , n } and are differentiable. therefore, they are suited as building blocks for deep learning architectures. we employ attention-pooling in our deeprc model as detailed in the following. modern hopfield networks viewed as transformer-like attention mechanisms. the modern hopfield networks, as introduced above,have a storage capacity that is exponential in the dimension of the vector space and converge after just one update (see (ramsauer et al., 2020) , appendix).additionally, the update rule of modern hopfield networks is known as key-value attention mechanism, which has been highly successful through the transformer (vaswani et al., 2017) and bert (devlin et al., 2019) models in natural language processing. therefore using modern hopfield networks with the key-value-attention mechanism as update rule is the natural choice for our task. in particular, modern hopfield networks are theoretically justified for storing and retrieving the large number of vectors (sequence patterns) that appear in the immune repertoire classification task. instead of using the terminology of modern hopfield networks, we explain our deeprc architecture in terms of key-value-attention (the update rule of the modern hopfield network), since it is well known in the deep learning community. the attention mechanism assumes a space of dimension d k in which keys and queries are compared. a set of n key vectors are combined to the matrix k. a set of d q query vectors are combined to the matrix q. similarities between queries and keys are computed by inner products, therefore queries can search for similar keys that are stored. another set of n value vectors are combined to the matrix v . the output of the attention mechanism is a weighted average of the value vectors for each query q. the i-th vector v i is weighted by the similarity between the i-th key k i and the query q. the similarity is given by the softmax of the inner products of the query q with the keys k i . all queries are calculated in parallel via matrix operations. consequently, the attention function att(q, k, v ; β) maps queries q, keys k, and values v to d v -dimensional outputs: att(q, k, v ; β) = softmax(βqk t )v (see also eq. (2)). while this attention mechanism has originally been developed for sequence tasks (vaswani et al., 2017) , it can be readily transferred to sets ye et al., 2018) . this type of attention mechanism will be employed in deeprc. the deeprc method. we propose a novel method deep repertoire classification (deeprc) for immune repertoire classification with attention-based deep massive multiple instance learning and compare it against other machine learning approaches. for deeprc, we consider immune repertoires as input objects, which are represented as bags of instances. in a bag, each instance is an immune receptor sequence and each bag can contain a large number of sequences. note that we will use z i to denote the sequence-representation of the i-th sequence and z to denote the repertoire-representation. at the core, deeprc consists of a transformer-like attention mechanism that extracts the most important information from each repertoire. we first give an overview of the attention mechanism and then provide details on each of the sub-networks h 1 , h 2 , and o of deeprc. attention mechanism in deeprc. this mechanism is based on the three matrices k (the keys), q (the queries), and v (the values) together with a parameter β. values. deeprc uses a 1d convolutional network h 1 (lecun et al., 1998; hu et al., 2014; kelley et al., 2016) that supplies a sequence-representation z i = h 1 (s i ), which acts as the values v = z = (z 1 , . . . , z n ) in the attention mechanism (see figure 2 ). keys. a second neural network h 2 , which shares its first layers with h 1 , is used to obtain keys k ∈ r n ×d k for each sequence in the repertoire. this network uses 2 self-normalizing layers (klambauer et al., 2017) with 32 units per layer (see figure 2 ). query. we use a fixed d k -dimensional query vector ξ which is learned via backpropagation. for more attention heads, each head has a fixed query vector. with the quantities introduced above, the transformer attention mechanism (eq. (2)) of deeprc is implemented as follows: where z ∈ r n ×dv are the sequence-representations stacked row-wise, k are the keys, and z is the repertoire-representation and at the same time a weighted mean of sequence-representations z i . the attention mechanism can readily be extended to multiple queries, however, computational demand could constrain this depending on the application and dataset. theorem 1 demonstrates that this mechanism is able to retrieve a single pattern out of several hundreds of thousands. attention-pooling and interpretability. each input object, i.e. repertoire, consists of a large number n of sequences, which are reduced to a single fixed-size feature vector of length d v representing the whole input object by an attention-pooling function. to this end, a transformer-like attention mechanism adapted to sets is realized in deeprc which supplies a i -the importance of the sequence s i . this importance value is an interpretable quantity, which is highly desired for the immunological problem at hand. thus, deeprc allows for two forms of interpretability methods. (a) a trained deeprc model can compute attention weights a i , which directly indicate the importance of a sequence. (b) deeprc furthermore allows for the usage of contribution analysis methods, such as integrated gradients (ig) (sundararajan et al., 2017) or layer-wise relevance propagation (montavon et al., 2018; arras et al., 2019) . see sect. a8 for details. classification layer and network parameters. the repertoire-representation z is then used as input for a fully-connected output networkŷ = o(z) that predicts the immune status, where we found it sufficient to train single-layer networks. in the simplest case, deeprc predicts a single target, the class label y, e.g. the immune status of an immune repertoire, using one output value. however, since deeprc is an end-to-end deep learning model, multiple targets may be predicted simultaneously in classification or regression settings or a mix of both. this allows for the introduction of additional information into the system via auxiliary targets such as age, sex, or other metadata. table 1 with sub-networks h 1 , h 2 , and o. d l indicates the sequence length. network parameters, training, and inference. deeprc is trained using standard gradient descent methods to minimize a cross-entropy loss. the network parameters are θ 1 , θ 2 , θ o for the sub-networks h 1 , h 2 , and o, respectively, and additionally ξ. in more detail, we train deeprc using adam (kingma & ba, 2014) with a batch size of 4 and dropout of input sequences. implementation. to reduce computational time, the attention network first computes the attention weights a i for each sequence s i in a repertoire. subsequently, the top 10% of sequences with the highest a i per repertoire are used to compute the weight updates and prediction. furthermore, computation of z i is performed in 16-bit, others in 32-bit precision to ensure numerical stability in the softmax. see sect. a2 for details. in this section, we report and analyze the predictive power of deeprc and the compared methods on several immunosequencing datasets. the roc-auc is used as the main metric for the predictive power. methods compared. we compared previous methods for immune repertoire classification, (ostmeyer et al., 2019) ("log. mil (kmer)", "log. mil (tcrb)") and a burden test (emerson et al., 2017) , as well as the baseline methods logistic regression ("log. regr."), k-nearest neighbour ("knn"), and support vector machines ("svm") with kernels designed for sets, such as the jaccard kernel ("j") and the minmax ("mm") kernel (ralaivola et al., 2005) . for the simulated data, we also added baseline methods that search for the implanted motif either in binary or continuous fashion ("known motif b.", "known motif c.") assuming that this motif was known (for details, see sect. a4). datasets. we aimed at constructing immune repertoire classification scenarios with varying degree of difficulties and realism in order to compare and analyze the suggested machine learning methods. to this end, we either use simulated or experimentally-observed immune receptor sequences and we implant signals, specifically, sequence motifs or sets thereof weber et al., 2020) , at different frequencies into sequences of repertoires of the positive class. these frequencies represent the witness rates and range from 0.01% to 10%. overall, we compiled four categories of datasets: (a) simulated immunosequencing data with implanted signals, (b) lstm-generated immunosequencing data with implanted signals, (c) real-world immunosequencing data with implanted signals, and (d) real-world immunosequencing data with known immune status, the cmv dataset (emerson et al., 2017) . the average number of instances per bag, which is the number of sequences per immune repertoire, is ≈300,000 except for category (c), in which we consider the scenario of low-coverage data with only 10,000 sequences per repertoire. the number of repertoires per dataset ranges from 785 to 5,000. in total, all datasets comprise ≈30 billion sequences or instances. this represents the largest comparative study on immune repertoire classification (see sect. a3). hyperparameter selection. we used a nested 5-fold cross validation (cv) procedure to estimate the performance of each of the methods. all methods could adjust their most important hyperparameters on a validation set in the inner loop of the procedure. see sect. a5 for details. table 1 : results in terms of auc of the competing methods on all datasets. the reported errors are standard deviations across 5 cross-validation (cv) folds (except for the column "simulated"). real-world cmv: average performance over 5 cv folds on the cmv dataset (emerson et al., 2017) . real-world data with implanted signals: average performance over 5 cv folds for each of the four datasets. a signal was implanted with a frequency (=witness rate) of 1% or 0.1%. either a single motif ("om") or multiple motifs ("mm") were implanted. lstm-generated data: average performance over 5 cv folds for each of the 5 datasets. in each dataset, a signal was implanted with a frequency of 10%, 1%, 0.5%, 0.1%, or 0.05%, respectively. simulated: here we report the mean over 18 simulated datasets with implanted signals and varying difficulties (see tab. a9 for details). the error reported is the standard deviation of the auc values across the 18 datasets. results. in each of the four categories, "real-world data", "real-world data with implanted signals", "lstm-generated data", and "simulated immunosequencing data", deeprc outperforms all competing methods with respect to average auc. across categories, the runner-up methods are either the svm for mil problems with minmax kernel or the burden test (see table 1 and sect. a6). results on simulated immunosequencing data. in this setting the complexity of the implanted signal is in focus and varies throughout 18 simulated datasets (see sect. a3). some datasets are challenging for the methods because the implanted motif is hidden by noise and others because only a small fraction of sequences carries the motif, and hence have a low witness rate. these difficulties become evident by the method called "known motif binary", which assumes the implanted motif is known. the performance of this method ranges from a perfect auc of 1.000 in several datasets to an auc of 0.532 in dataset '17' (see sect. a6). deeprc outperforms all other methods with an average auc of 0.846 ± 0.223, followed by the svm with minmax kernel with an average auc of 0.827 ± 0.210 (see sect. a6). the predictive performance of all methods suffers if the signal occurs only in an extremely small fraction of sequences. in datasets, in which only 0.01% of the sequences carry the motif, all auc values are below 0.550. results on lstm-generated data. on the lstm-generated data, in which we implanted noisy motifs with frequencies of 10%, 1%, 0.5%, 0.1%, and 0.05%, deeprc yields almost perfect predictive performance with an average auc of 1.000 ± 0.001 (see sect. a6 and a7). the second best method, svm with minmax kernel, has a similar predictive performance to deeprc on all datasets but the other competing methods have a lower predictive performance on datasets with low frequency of the signal (0.05%). results on real-world data with implanted motifs. in this dataset category, we used real immunosequences and implanted single or multiple noisy motifs. again, deeprc outperforms all other methods with an average auc of 0.980 ± 0.029, with the second best method being the burden test with an average auc of 0.883 ± 0.170. notably, all methods except for deeprc have difficulties with noisy motifs at a frequency of 0.1% (see tab. a11) . results on real-world data. on the real-world dataset, in which the immune status of persons affected by the cytomegalovirus has to be predicted, the competing methods yield predictive aucs between 0.515 and 0.825 (see table 1 ). we note that this dataset is not the exact dataset that was used in emerson et al. (2017) . it differs in pre-processing and also comprises a different set of samples and a smaller training set due to the nested 5-fold cross-validation procedure, which leads to a more challenging dataset. the best performing method is deeprc with an auc of 0.831 ± 0.002, followed by the svm with minmax kernel (auc 0.825 ± 0.022) and the burden test with an auc of 0.699 ± 0.041. the top-ranked sequences by deeprc significantly correspond to those detected by emerson et al. (2017) , which we tested by a mann-whitney u-test with the null hypothesis that the attention values of the sequences detected by emerson et al. (2017) would be equal to the attention values of the remaining sequences (p-value of 1.3 · 10 −93 ). the sequence attention values are displayed in tab. a14. we have demonstrated how modern hopfield networks and attention mechanisms enable successful classification of the immune status of immune repertoires. for this task, methods have to identify the discriminating sequences amongst a large set of sequences in an immune repertoire. specifically, even motifs within those sequences have to be identified. we have shown that deeprc, a modern hopfield network and an attention mechanism with a fixed query, can solve this difficult task despite the massive number of instances. deeprc furthermore outperforms the compared methods across a range of different experimental conditions. impact on machine learning and related scientific fields. we envision that with (a) the increasing availability of large immunosequencing datasets (kovaltsuk et al., 2018; corrie et al., 2018; christley et al., 2018; zhang et al., 2020; rosenfeld et al., 2018; shugay et al., 2018) , (b) further fine-tuning of ground-truth benchmarking immune receptor datasets (weber et al., 2020; olson et al., 2019; marcou et al., 2018) , (c) accounting for repertoire-impacting factors such as age, sex, ethnicity, and environment (potential confounding factors), and (d) increased gpu memory and increased computing power, it will be possible to identify discriminating immune receptor motifs for many diseases, potentially even for the current sars-cov-2 (covid-19) pandemic minervina et al., 2020; galson et al., 2020) . such results would greatly benefit ongoing research on antibody and tcr-driven immunotherapies and immunodiagnostics as well as rational vaccine design (brown et al., 2019) . in the course of this development, the experimental verification and interpretation of machine-learningidentified motifs could receive additional focus, as for most of the sequences within a repertoire the corresponding antigen is unknown. nevertheless, recent technological breakthroughs in highthroughput antigen-labeled immunosequencing are beginning to generate large-scale antigen-labeled single-immune-receptor-sequence data thus resolving this longstanding problem (setliff et al., 2019) . from a machine learning perspective, the successful application of deeprc on immune repertoires with their large number of instances per bag might encourage the application of modern hopfield networks and attention mechanisms on new, previously unsolved or unconsidered, datasets and problems. impact on society. if the approach proves itself successful, it could lead to faster testing of individuals for their immune status w.r.t. a range of diseases based on blood samples. this might motivate changes in the pipeline of diagnostics and tracking of diseases, e.g. automated testing of the immune status in regular intervals. it would furthermore make the collection and screening of blood samples for larger databases more attractive. in consequence, the improved testing of immune statuses might identify individuals that do not have a working immune response towards certain diseases to government or insurance companies, which could then push for targeted immunisation of the individual. similarly to compulsory vaccination, such testing for the immune status could be made compulsory by governments, possibly violating privacy or personal self-determination in exchange for increased over-all health of a population. ultimately, if the approach proves itself successful, the insights gained from the screening of individuals that have successfully developed resistances against specific diseases could lead to faster targeted immunisation, once a certain number of individuals with resistances can be found. this might strongly decrease the harm done by e.g. pandemics and lead to a change in the societal perception of such diseases. consequences of failures of the method. as common with methods in machine learning, potential danger lies in the possibility that users rely too much on our new approach and use it without reflecting on the outcomes. however, the full pipeline in which our method would be used includes wet lab tests after its application, to verify and investigate the results, gain insights, and possibly derive treatments. failures of the proposed method would lead to unsuccessful wet lab validation and negative wet lab tests. since the proposed algorithm does not directly suggest treatment or therapy, human beings are not directly at risk of being treated with a harmful therapy. substantial wet lab and in-vitro testing and would indicate wrong decisions by the system. leveraging of biases in the data and potential discrimination. as for almost all machine learning methods, confounding factors, such as age or sex, could be used for classification. this, might lead to biases in predictions or uneven predictive performance across subgroups. as a result, failures in the wet lab would occur (see paragraph above). moreover, insights into the relevance of the confounding factors could be gained, leading to possible therapies or counter-measures concerning said factors. furthermore, the amount of data available with respec to relevant confounding factors could lead to better or worse performance of our method. e.g. a dataset consisting mostly of data from individuals within a specific age group might yield better performance for that age group, possibly resulting in better or exclusive treatment methods for that specific group. here again, the application of deeprc would be followed by in-vitro testing and development of a treatment, where all target groups for the treatment have to be considered accordingly. all datasets and code is available at https://github.com/ml-jku/deeprc. the cmv dataset is publicly available at https://clients.adaptivebiotech.com/pub/emerson-2017-natgen. in section a2 we provide details on the architecture of deeprc, in section a3 we present details on the datasets, in section a4 we explain the methods that we compared, in section a5 we elaborate on the hyperparameters and their selection process. then, in section a6 we present detailed results for each dataset category in tabular form, in section a7 we provide information on the lstm model that was used to generate antibody sequences, in section a8 we show how deeprc can be interpreted, in section a9 we show the correspondence of previously identified tcr sequences for cmv immune status with attention values by deeprc, and finally we present variations and an ablation study of deeprc in section a10. input layer. for the input layer of the cnn, the characters in the input sequence, i.e. the amino acids (aas), are encoded in a one-hot vector of length 20. to also provide information about the position of an aa in the sequence, we add 3 additional input features with values in range [0, 1] to encode the position of an aa relative to the sequence. these 3 positional features encode whether the aa is located at the beginning, the center, or the end of the sequence, respectively, as shown in figure a1 . we concatenate these 3 positional features with the one-hot vector of aas, which results in a feature vector of size 23 per sequence position. each repertoire, now represented as a bag of feature vectors, is then normalized to unit variance. since the cytomegalovirus dataset (cmv dataset) provides sequences with an associated abundance value per sequence, which is the number of occurrences of a sequence in a repertoire, we incorporate this information into the input of deeprc. to this end, the one-hot aa features of a sequence are multiplied by a scaling factor of log(c a ) before normalization, where c a is the abundance of a sequence. we feed the sequences with 23 features per position into the cnn. sequences of different lengths were zero-padded to the maximum sequence length per batch at the sequence ends. 1d cnn for motif recognition. in the following, we describe how deeprc identifies patterns in the individual sequences and reduces each sequence in the input object to a fixed-size feature vector. deeprc employs 1d convolution layers to extract patterns, where trainable weight kernels are convolved over the sequence positions. in principle, also recurrent neural networks (rnns) or transformer networks could be used instead of 1d cnns, however, (a) the computational complexity of the network must be low to be able to process millions of sequences for a single update. additionally, (b) the learned network should be able to provide insights in the recognized patterns in form of motifs. both properties (a) and (b) are fulfilled by 1d convolution operations that are used by deeprc. we use one 1d cnn layer (hu et al., 2014) with selu activation functions (klambauer et al., 2017) to identify the relevant patterns in the input sequences with a computationally light-weight operation. the larger the kernel size, the more surrounding sequence positions are taken into account, which influences the length of the motifs that can be extracted. we therefore adjust the kernel size during hyperparameter search. in prior works (ostmeyer et al., 2019) , a k-mer size of 4 yielded good predictive performance, which could indicate that a kernel size in the range of 4 may be a proficient choice. for d v trainable kernels, this produces a feature vector of length d v at each sequence position. subsequently, global max-pooling over all sequence positions of a sequence reduces the sequence-representations z i to vectors of the fixed length d v . given the challenging size of the input data per repertoire, the computation of the cnn activations and weight updates is performed using 16-bit floating point values. a list of hyperparameters evaluated for deeprc is given in table a3 . a comparison of rnn-based and cnn-based sequence embedding for motif recognition in a smaller experimental setting is given in sec. a10. regularization. we apply random and attention-based subsampling of repertoire sequences to reduce over-fitting and decrease computational effort. during training, each repertoire is subsampled to 10, 000 input sequences, which are randomly drawn from the respective repertoire. this can also be interpreted as random drop-out (hinton et al., 2012) on the input sequences or attention weights. during training and evaluation, the attention weights computed by the attention network are furthermore used to rank the input sequences. based on this ranking, the repertoire is reduced to the 10% of sequences with the highest attention weights. these top 10% of sequences are then used to compute the weight updates and the prediction for the repertoire. additionally, one might employ further regularization techniques, which we only partly investigated further in a smaller experimental setting in sec. a10 due to high computational demands. such regularization techniques include l1 and l2 weight decay, noise in the form of random aa permutations in the input sequences, noise on the attention weights, or random shuffling of sequences between repertoires that belong to the negative class. the last regularization technique assumes that the sequences in positive-class repertoires carry a signal, such as an aa motif corresponding to an immune response, whereas the sequences in negative-class repertoires do not. hence, the sequences can be shuffled randomly between negative class repertoires without obscuring the signal in the positive class repertoires. hyperparameters. for the hyperparameter search of deeprc for the category "simulated immunosequencing data", we only conducted a full hyperparameter search on the more difficult datasets with motif implantation probabilities below 1%, as described in table a3 . this process was repeated for all 5 folds of the 5-fold cross-validation (cv) and the average score on the 5 test sets constitutes the final score of a method. table a3 provides an overview of the hyperparameter search, which was conducted as a grid search for each of the datasets in a nested 5-fold cv procedure, as described in section a4. computation time and optimization. we took measures on the implementation level to address the high computational demands, especially gpu memory consumption, in order to make the large number of experiments feasible. we train the deeprc model with a small batch size of 4 samples and perform computation of inference and updates of the 1d cnn using 16-bit floating point values. the rest of the network is trained using 32-bit floating point values. the adam parameter for numerical stability was therefore increased from the default value of = 10 −8 to = 10 −4 . training was performed on various gpu types, mainly nvidia rtx 2080 ti. computation times were highly dependent on the number of sequences in the repertoires and the number and sizes of cnn kernels. a single update on an nvidia rtx 2080 ti gpu took approximately 0.0129 to 0.0135 seconds, while requiring approximately 8 to 11 gb gpu memory. taking these optimizations and gpus with larger memory (≥ 16 gb) into account, it is already possible to train deeprc, possibly with multi-head attention and a larger network architecture, on larger datasets (see sec. a10). our network implementation is based on pytorch 1.3.1 (paszke et al., 2019) . incorporation of additional inputs and metadata. additional metadata in the form of sequencelevel or repertoire-level features could be incorporated into the input via concatenation with the feature vectors that result from taking the maximum of the 1d cnn outputs w.r.t. the sequence positions. this has the benefit that the attention mechanism and output network can utilize the sequence-level or repertoire-level features for their predictions. sparse metadata or metadata that is only available during training could be used as auxiliary targets to incorporate the information via gradients into the deeprc model. limitations. the current methods are mostly limited by computational complexity, since both hyperparameter and model selection is computationally demanding. for hyperparameter selection, a large number of hyperparameter settings have to be evaluated. for model selection, a single repertoire requires the propagation of many thousands of sequences through a neural network and keeping those quantities in gpu memory in order to perform the attention mechanism and weight update. thus, increased gpu memory would significantly boost our approach. increased computational power would also allow for more advanced architectures and attention mechanisms, which may further improve predictive performance. another limiting factor is over-fitting of the model due to the currently relatively small number of samples (bags) in real-world immunosequencing datasets in comparison to the large number of instances per bag and features per instance. we aimed at constructing immune repertoire classification scenarios with varying degree of realism and difficulties in order to compare and analyze the suggested machine learning methods. to this end, we either use simulated or experimentally-observed immune receptor sequences and we implant signals, which are sequence motifs weber et al., 2020) , into sequences of repertoires of the positive class. it has been shown previously that interaction of immune receptors with antigens occur via short sequence stretches . thus, implantation of short motif sequences simulating an immune signal is biologically meaningful. our benchmarking study comprises four different categories of datasets: (a) simulated immunosequencing data with implanted signals (where the signal is defined as sets of motifs), (b) lstm-generated immunosequencing data with implanted signals, (c) real-world immunosequencing data with implanted signals, and (d) real-world immunosequencing data. each of the first three categories consists of multiple datasets with varying difficulty depending on the type of the implanted signal and the ratio of sequences with the implanted signal. the ratio of sequences with the implanted signal, where each sequence carries at most 1 implanted signal, corresponds to the witness rate (wr). we consider binary classification tasks to simulate the immune status of healthy and diseased individuals. we randomly generate immune repertoires with varying numbers of sequences, where we implant sequence motifs in the repertoires of the diseased individuals, i.e. the positive class. the sequences of a repertoire are also randomly generated by different procedures (detailed below). each sequence is composed of 20 different characters, corresponding to amino acids, and has an average length of 14.5 aas. in the first category, we aim at investigating the impact of the signal frequency, i.e. the wr, and the signal complexity on the performance of the different methods. to this end, we created 18 datasets, whereas each dataset contains a large number of repertoires with a large number of random aa sequences per repertoire. we then implanted signals in the aa sequences of the positive class repertoires, where the 18 datasets differ in frequency and complexity of the implanted signals. in detail, the aas were sampled randomly independent of their respective position in the sequence, while the frequencies of aas, distribution of sequence lengths, and distribution of the number of sequences per repertoire, i.e. the number of instances per bag, are following the respective distributions observed in the real-world cmv dataset (emerson et al., 2017) . for this, we first sampled the number of sequences for a repertoire from a gaussian n (µ = 316k, σ = 132k) distribution and rounded to the nearest positive integer. we re-sampled if the size was below 5k. we then generated random sequences of aas with a length of n (µ = 14.5, σ = 1.8), again rounded to the nearest positive integers. each simulated repertoire was then randomly assigned to either the positive or negative class, with 2, 500 repertoires per class. in the repertoires assigned to the positive class, we implanted motifs with an average length of 4 aas, following the results of the experimental analysis of antigenbinding motifs in antibodies and t-cell receptor sequences by . we varied the characteristics of the implanted motifs for each of the 18 datasets with respect to the following parameters: (a) ρ, the probability of a motif being implanted in a sequence of a positive repertoire, i.e. the average ratio of sequences containing the motif, which is the witness rate. in this way, we generated 18 different datasets of variable difficulty containing in total roughly 28.7 billion sequences. see table a1 for an overview of the properties of the implanted motifs in the 18 datasets. in the second dataset category, we investigate the impact of the signal frequency and complexity in combination with more plausible immune receptor sequences by taking into account the positional aa distributions and other sequence properties. to this end, we trained an lstm (hochreiter & schmidhuber, 1997 ) in a standard next character prediction (graves, 2013) setting to create aa sequences with properties similar to experimentally observed immune receptor sequences. in the first step, the lstm model was trained on all immuno-sequences in the cmv dataset (emerson et al., 2017) that contain valid information about sequence abundance and have a known cmv label. such an lstm model is able to capture various properties of the sequences, including positiondependent probability distributions and combinations, relationships, and order of aas. we then used the trained lstm model to generate 1, 000 repertoires in an autoregressive fashion, starting with a start sequence that was randomly sampled from the trained-on dataset. based on a visual inspection of the frequencies of 4-mers (see section a7), the similarity of lstm generated sequences and real sequences was deemed sufficient for the purpose of generating the aa sequences for the datasets in this category. further details on lstm training and repertoire generation are given in section a7. after generation, each repertoire was assigned to either the positive or negative class, with 500 repertoires per class. we implanted motifs of length 4 with varying properties in the center of the sequences of the positive class to obtain 5 different datasets. each sequence in the positive repertoires has a probability ρ to carry the motif, which was varied throughout 5 datasets and corresponds to the wr (see table a1 ). each position in the motif has a probability of 0.9 to be implanted and consequently a probability of 0.1 that the original aa in the sequence remains, which can be seen as noise on the motif. in the third category, we implanted signals into experimentally obtained immuno-sequences, where we considered 4 dataset variations. each dataset consists of 750 repertoires for each of the two classes, where each repertoire consists of 10k sequences. in this way, we aim to simulate datasets with a low sequencing coverage, which means that only relatively few sequences per repertoire are available. the sequences were randomly sampled from healthy (cmv negative) individuals from the cmv dataset (see below paragraph for explanation). two signal types were considered: (a) one signal with one motif. the aa motif ldr was implanted in a certain fraction of sequences. the pattern is randomly altered at one of the three positions with probabilities 0.2, 0.6, and 0.2, respectively. (b) one signal with multiple motifs. one of the three possible motifs ldr, cas, and gl-n was table a1 : properties of simulated repertoires, variations of motifs, and motif frequencies, i.e. the witness rate, for the datasets in categories "simulated immunosequencing data", "lstm-generated data", and "real-world data with implanted signals". noise types for * are explained in paragraph "real-world data with implanted signals". implanted with equal probability. again, the motifs were randomly altered before implantation. the aa motif ldr changed as described above. the aa motif cas was altered at the second position with probability 0.6 and with probability 0.3 at the first position. the pattern gl-n, wheredenotes a gap location, is randomly altered at the first position with probability 0.6 and the gap has a length of 0, 1, or 2 aas with equal probability. additionally, the datasets differ in the values for ρ, the average ratio of sequences carrying a signal, which were chosen as 1% or 0.1%. the motifs were implanted at positions 107, 109, and 114 according to the imgt numbering scheme for immune receptor sequences (lefranc et al., 2003) with probabilities 0.3, 0.35 and 0.2, respectively. with the remaining 0.15 chance, the motif is implanted at any other sequence position. this means that the motif occurrence in the simulated sequences is biased towards the middle of the sequence. we used a real-world dataset of 785 repertoires, each of which containing between 4, 371 to 973, 081 (avg. 299, 319) tcr sequences with a length of 1 to 27 (avg. 14.5) aas, originally collected and provided by emerson et al. (2017) . 340 out of 785 repertoires were labelled as positive for cytomegalovirus (cmv) serostatus, which we consider as the positive class, 420 repertoires with negative cmv serostatus, considered as negative class, and 25 repertoires with unknown status. we changed the number of sequence counts per repertoire from −1 to 1 for 3 sequences. furthermore, we exclude a total of 99 repertoires with unknown cmv status or unknown information about the sequence abundance within a repertoire, reducing the dataset for our analysis to 686 repertoires, 312 of which with positive and 374 with negative cmv status. we give a non-exhaustive overview of previously considered mil datasets and problems in table a2 . to our knowledge the datasets considered in this work pose the most challenging mil problems with respect to the number of instances per bag (column 5). table a2 : mil datasets with their numbers of bags and numbers of instances. "total number of instances" refers to the total number of instances in the dataset. the simulated and real-world immunosequencing datasets considered in this work contain a by orders of magnitudes larger number of instances per bag than mil datasets that were considered by machine learning methods up to now. we evaluate and compare the performance of deeprc against a set of machine learning methods that serve as baseline, were suggested, or can readily be adapted to immune repertoire classification. in this section, we describe these compared methods. this method serves as an estimate for the achievable classification performance using prior knowledge about which motif was implanted. note that this does not necessarily lead to perfect predictive performance since motifs are implanted with a certain amount of noise and could also be present in the negative class by chance. the known motif method counts how often the known implanted motif occurs per sequence for each repertoire and uses this count to rank the repertoires. from this ranking, the area under the receiver operator curve (auc) is computed as performance measure. probabilistic aa changes in the known motif are not considered for this count, with the exception of gap positions. we consider two versions of this method: (a) known motif binary: counts the occurrence of the known motif in a sequence and (b) known motif continuous: counts the maximum number of overlapping aas between the known motif and all sequence positions, which corresponds to a convolution operation with a binary kernel followed by max-pooling. since the implanted signal is not known in the experimentally obtained cmv dataset, this method cannot be applied to this dataset. the support vector machine (svm) approach uses a fixed mapping from a bag of sequences to the corresponding k-mer counts. the function h kmer maps each sequence s i to a vector representing the occurrence of k-mers in the sequence. to avoid confusion with the sequence-representation obtained from the cnn layers of deeprc, we denote u i = h kmer (s i ), which is analogous to z i . specifically, where #{p m ∈ s i } denotes how often the k-mer pattern p m occurs in sequence s i . afterwards, average-pooling is applied to obtain u = 1/n n i=1 u i , the k-mer representation of the input object x. for two input objects x (n) and x (l) with representations u (n) and u (l) , respectively, we implement the minmax kernel (ralaivola et al., 2005) as follows: where u (n) m is the m-th element of the vector u (n) . the jaccard kernel (levandowsky & winter, 1971 ) is identical to the minmax kernel except that it operates on binary u (n) . we used a standard c-svm, as introduced by cortes & vapnik (1995) . the corresponding hyperparameter c is optimized by random search. the settings of the full hyperparameter search as well as the respective value ranges are given in table a4a . the same k-mer representation of a repertoire, as introduced above for the svm baseline, is used for the k-nearest neighbor (knn) approach. as this method clusters samples according to distances between them, the previous kernel definitions cannot be applied directly. it is therefore necessary to transform the minmax as well as the jaccard kernel from similarities to distances by constructing the following (levandowsky & winter, 1971) : d jaccard (u (n) , u (l) ) = 1 − k jaccard (u (n) , u (l) ). (a2) the amount of neighbors is treated as the hyperparameter and optimized by an exhaustive grid search. the settings of the full hyperparameter search as well as the respective value ranges are given in table a5 . we implemented logistic regression on the k-mer representation u of an immune repertoire. the model is trained by gradient descent using the adam optimizer (kingma & ba, 2014) . the learning rate is treated as the hyperparameter and optimized by grid search. furthermore, we explored two regularization settings using combinations of l1 and l2 weight decay. the settings of the full hyperparameter search as well as the respective value ranges are given in table a6 . we implemented a burden test (emerson et al., 2017; li & leal, 2008; wu et al., 2011) in a machine learning setting. the burden test first identifies sequences or k-mers that are associated with the individual's class, i.e., immune status, and then calculates a burden score per individual. concretely, for each k-mer or sequence, the phi coefficient of the contingency table for absence or presence and positive or negative immune status is calculated. then, j k-mers or sequences with the highest phi coefficients are selected as the set of associated k-mers or sequences. j is a hyperparameter that is selected on a validation set. additionally, we consider the type of input features, sequences or k-mers, as a hyperparameter. for inference, a burden score per individual is calculated as the sum of associated k-mers or sequences it carries. this score is used as raw prediction and to rank the individuals. hence, we have extended the burden test by emerson et al. (2017) to k-mers and to adaptive thresholds that are adjusted on a validation set. the logistic multiple instance learning (mil) approach for immune repertoire classification (ostmeyer et al., 2019) applies a logistic regression model to each k-mer representation in a bag. the resulting scores are then summarized by max-pooling to obtain a prediction for the bag. each amino acid of each k-mer is represented by 5 features, the so-called atchley factors (atchley et al., 2005) . as k-mers of length 4 are used, this gives a total of 4 × 5 = 20 features. one additional feature per 4-mer is added, which represents the relative frequency of this 4-mer with respect to its containing bag, resulting in 21 features per 4-mer. two options for the relative frequency feature exist, which are (a) whether the frequency of the 4-mer ("4mer") or (b) the frequency of the sequence in which the 4-mer appeared ("tcrβ") is used. we optimized the learning rate, batch size, and early stopping parameter on the validation set. the settings of the full hyperparameter search as well as the respective value ranges are given in table a8 . for all competing methods a hyperparameter search was performed, for which we split each of the 5 training sets into an inner training set and inner validation set. the models were trained on the inner training set and evaluated on the inner validation set. the model with the highest auc score on the inner validation set is then used to calculate the score on the respective test set. here we report the hyperparameter sets and search strategy that is used for all methods. deeprc. the set of hyperparameters of deeprc is shown in table a3 . these hyperparameter combinations are adjusted via a grid search procedure. table a3 : deeprc hyperparameter search space. every 5 · 10 3 updates, the current model was evaluated against the validation fold. the early stopping hyperparameter was determined by selecting the model with the best loss on the validation fold after 10 5 updates. * : experiments for {64; 128; 256} kernels were omitted for datasets with motif implantation probabilities ≥ 1% in the category "simulated immunosequencing data". known motif. this method does not have hyperparameters and has been applied to all datasets except for the cmv dataset. the corresponding hyperparameter c of the svm is optimized by randomly drawing 10 3 values in the range of [−6; 6] according to a uniform distribution. these values act as the exponents of a power of 10 and are applied for each of the two kernel types (see table a4a ). knn. the amount of neighbors is treated as the hyperparameter and optimized by grid search operating in the discrete range of [1; max{n, 10 3 }] with a step size of 1. the corresponding tight upper bound is automatically defined by the total amount of samples n ∈ n >0 in the training set, capped at 10 3 (see table a5 ). number of neighbors {1; max{n, 10 3 }} type of kernel {minmax; jaccard} table a5 : settings used in the hyperparameter search of the knn baseline approach. the number of trials (per type of kernel) is automatically defined by the total amount of samples n ∈ n >0 in the training set, capped at 10 3 . logistic regression. the hyperparameter optimization strategy that was used was grid search across hyperparameters given in table a6. learning rate 10 −{2;3;4} batch size 4 max. updates 10 5 coefficient β 1 (adam) 0.9 coefficient β 2 (adam) 0.999 weight decay weightings {(l1 = 10 −7 , l2 = 10 −3 ); (l1 = 10 −7 , l2 = 10 −5 )} table a6 : settings used in the hyperparameter search of the logistic regression baseline approach. burden test. the burden test selects two hyperparameters: the number of features in the burden set and the type of features, see table a7 . number of features in burden set {50, 100, 150, 250} type of features {4mer; sequence} table a7 : settings used in the hyperparameter search of the burden test approach. logistic mil. for this method, we adjusted the learning rate as well as the batch size as hyperparameters by randomly drawing 25 different hyperparameter combinations from a uniform distribution. the corresponding range of the learning rate is [−4.5; −1.5], which acts as the exponent of a power of 10. the batch size lies within the range of [1; 32]. for each hyperparameter combination, a model is optimized by gradient descent using adam, whereas the early stopping parameter is adjusted according to the corresponding validation set (see table a8 ). learning rate 10 {−4.5;−1.5} batch size {1; 32} relative abundance term {4mer; tcrβ} number of trials 25 max. epochs 10 2 coefficient β 1 (adam) 0.9 coefficient β 2 (adam) 0.999 table a8 : settings used in the hyperparameter search of the logistic mil baseline approach. the number of trials (per type of relative abundance) defines the quantity of combinations of random values of the learning rate as well as the batch size. in this section, we report the detailed results on all four categories of datasets (a) simulated immunosequencing data (table a9 ) (b) lstm-generated data (table a10) , (c) real-world data with implanted signals (table a11) , and (d) real-world data on the cmv dataset (table a12) , as discussed in the main paper. ± 0.000 ± 0.000 ± 0.271 ± 0.000 ± 0.000 ± 0.218 ± 0.000 ± 0.000 ± 0.029 ± 0.000 ± 0.001 ± 0.017 ± 0.001 ± 0.002 ± 0.023 ± 0.001 ± 0.048 ± 0.013 ± 0.223 svm (minmax) 1.000 1.000 0.764 1.000 1.000 0.603 1.000 0.998 0.539 1.000 0.994 0.529 1.000 0.741 0.513 1.000 0.706 0.503 0.827 ± 0.000 ± 0.000 ± 0.016 ± 0.000 ± 0.000 ± 0.021 ± 0.000 ± 0.002 ± 0.024 ± 0.000 ± 0.004 ± 0.016 ± 0.000 ± 0.024 ± 0.006 ± 0.000 ± 0.013 ± 0.013 ± 0.013 ± 0.013 ± 0.014 ± 0.011 ± 0.009 ± 0.007 ± 0.008 ± 0.011 ± 0.012 ± 0.012 ± 0.007 ± 0.014 ± 0.017 ± 0.010 ± 0.020 ± 0.012 ± 0.016 ± 0.016 ± 0.074 known motif b. 1.000 1.000 0.973 1.000 1.000 0.865 1.000 1.000 0.700 1.000 0.989 0.609 1.000 0.946 0.570 1.000 0.834 0.532 0.890 ± 0.000 ± 0.000 ± 0.004 ± 0.000 ± 0.000 ± 0.004 ± 0.000 ± 0.000 ± 0.020 ± 0.000 ± 0.002 ± 0.017 ± 0.000 ± 0.010 ± 0.024 ± 0.000 ± 0.016 ± 0.020 ± 0.001 ± 0.014 ± 0.020 ± 0.001 ± 0.013 ± 0.017 ± 0.001 ± 0.012 ± 0.012 ± 0.001 ± 0.018 ± 0.018 ± 0.002 ± 0.010 ± 0.009 ± 0.002 ± 0.012 ± 0.013 ± 0.202 table a9 : auc estimates based on 5-fold cv for all 18 datasets in category "simulated immunosequencing data". the reported errors are standard deviations across the 5 cross-validation folds except for the last column "avg.", in which they show standard deviations across datasets. wildcard characters in motifs are indicated by z, characters with 50% probability of being removed by d . table a10 : auc estimates based on 5-fold cv for all 5 datasets in category "lstm-generated data". the reported errors are standard deviations across the 5 cross-validation folds except for the last column "avg.", in which they show standard deviations across datasets. characters affected by noise, as described in a3, paragraph "lstm-generated data", are indicated by r . table a12 : results on the cmv dataset (real-world data) in terms of auc, f1 score, balanced accuracy, and accuracy. for f1 score, balanced accuracy, and accuracy, all methods use their default thresholds. each entry shows mean and standard deviation across 5 cross-validation folds. we trained a conventional next-character lstm model (graves, 2013) based on the implementation in https://github.com/spro/practical-pytorch (access date 1st of may, 2020) using pytorch 1.3.1 (paszke et al., 2019) . for this, we applied an lstm model with 100 lstm blocks in 2 layers, which was trained for 5, 000 epochs using the adam optimizer (kingma & ba, 2014) with learning rate 0.01, an input batch size of 100 character chunks, and a character chunk length of 200. as input we used the immuno-sequences in the cdr3 column of the cmv dataset, where we repeated sequences according to their counts in the repertoires, as specified in the templates column of the cmv dataset. we excluded repertoires with unknown cmv status and unknown sequence abundance from training. after training, we generated 1, 000 repertoires using a temperature value of 0.8. the number of sequences per repertoire was sampled from a gaussian n (µ = 285k, σ = 156k) distribution, where the whole repertoire was generated by the lstm at once. that is, the lstm can base the generation of the individual aa sequences in a repertoire, including the aas and the lengths of the sequences, on the generated repertoire. a random immuno-sequence from the trained-on repertoires was used as initialization for the generation process. this immuno-sequence was not included in the generated repertoire. finally, we randomly assigned 500 of the generated repertoires to the positive (diseased) and 500 to the negative (healthy) class. we then implanted motifs in the positive class repertoires as described in section a3.2. as illustrated in the comparison of histograms given in fig. a2 , the generated immuno-sequences exhibit a very similar distribution of 4-mers and aas compared to the original cmv dataset. real-world data deeprc allows for two forms of interpretability methods. (a) due to its attention-based design, a trained model can be used to compute the attention weights of a sequence, which directly indicates its importance. (b) deeprc furthermore allows for the usage of contribution analysis methods, such as integrated gradients (ig) (sundararajan et al., 2017) or layer-wise relevance propagation (montavon et al., 2018; arras et al., 2019; montavon et al., 2019; preuer et al., 2019) . we apply ig to identify the input patterns that are relevant for the classification. to identify aa patterns with high contributions in the input sequences, we apply ig to the aas in the input sequences. additionally, we apply ig to the kernels of the 1d cnn, which allows us to identify aa motifs with high contributions. in detail, we compute the ig contributions for the aas and positional features in the kernels for every repertoire in the validation and test set, so as to exclude potential artifacts caused by over-fitting. averaging the ig values over these repertoires then results in concise aa motifs. we include qualitative visual analyses of the ig method on different datasets below. here, we provide examples for the interpretation of trained deeprc models using integrated gradients (ig) (sundararajan et al., 2017) as contribution analysis method. the following illustrations were created using 50 ig steps, which we found sufficient to achieve stable ig results. a visual analysis of deeprc models on the simulated datasets, as illustrated in tab. a13 and fig. a3 , shows that the implanted motifs can be successfully extracted from the trained model and are straightforward to interpret. in the real-world cmv dataset, deeprc finds complex patterns with high variability in the center regions of the immuno-sequences, as illustrated in figure a4 . real-world data with implanted signals extracted motif implanted motif(s) g r s r a r f r l r d r r r {l r d r r r ; c r a r s; g r l-n} motif freq. ρ 0.05% 0.1% 0.1% table a13 : visualization of motifs extracted from trained deeprc models for datasets from categories "simulated immunosequencing data", "lstm-generated data", and "real-world data with implanted signals". motif extraction was performed using integrated gradients on the 1d cnn kernels over the validation set and test set repertoires of one cv fold. wildcard characters are indicated by z, random noise on characters by r , characters with 50% probability of being removed by d , and gap locations of random lengths of {0; 1; 2} by -. larger characters in the extracted motifs indicate higher contribution, with blue indicating positive contribution and red indicating negative contribution towards the prediction of the diseased class. contributions to positional encoding are indicated by < (beginning of sequence), ∧ (center of sequence), and > (end of sequence). only kernels with relatively high contributions are shown, i.e. with contributions roughly greater than the average contribution of all kernels. b) c) figure a3 : integrated gradients applied to input sequences of positive class repertoires. three sequences with the highest contributions to the prediction of their respective repertoires are shown. a) input sequence taken from "simulated immunosequencing data" with implanted motif sz d z d n and motif implantation probability 0.1%. the deeprc model reacts to the s and n at the 5 th and 8 th sequence position, thereby identifying the implanted motif in this sequence. b) and c) input sequence taken from "real-world data with implanted signals" with implanted motifs {l r d r r r ; c r a r s; g r l-n} and motif implantation probability 0.1%. the deeprc model reacts to the fully implanted motif cas (b) and to the partly implanted motif aas c and a at the 5 th and 7 th sequence position (c), thereby identifying the implanted motif in the sequences. wildcard characters in implanted motifs are indicated by z, characters with 50% probability of being removed by d , and gap locations of random lengths of {0; 1; 2} by -. larger characters in the sequences indicate higher contribution, with blue indicating positive contribution and red indicating negative contribution towards the prediction of the diseased class. figure a4 : visualization of the contributions of characters within a sequence via ig. each sequence was selected from a different repertoire and showed the highest contribution in its repertoire. the model was trained on cmv dataset, using a kernel size of 9, 32 kernels and 137 repertoires for early stopping. larger characters in the extracted motifs indicate higher contribution, with blue indicating positive contribution and red indicating negative contribution towards the prediction of the disease class. table a14 : tcrβ sequences that had been discovered by emerson et al. (2017) with their associated attention values by deeprc. these sequences have significantly (p-value 1.3e-93) higher attention values than other sequences. the column "quantile" provides the quantile values of the empiricial distribution of attention values across all sequences in the dataset. in this section we investigate the impact of different variations of deeprc on the performance on the cmv dataset. we consider both a cnn-based sequence embedding, as used in the main paper, and an lstm-based sequence embedding. in both cases we vary the number of attention heads and the β parameter for the softmax function the attention mechanism (see eq. 2 in main paper). for the cnn-based sequence embedding we also vary the number of cnn kernels and the kernel sizes used in the 1d cnn. for the lstm-based sequence embedding we use one one-directional lstm layer, of which the output values at the last sequence position (without padding) are taken as embedding of the sequence. here we vary the number of lstm blocks in the lstm layer. to counter over-fitting due to the increased complexity of these deeprc variations, we added a l2 weight penalty to the training loss. the factor with which the l2 weight penalty contributes to the training loss is varied over 3 orders of magnitudes, where suitable value ranges were manually determined on one of the training folds beforehand. to reduce the computational effort, we do not consider all numbers of kernels that were considered in the main paper. furthermore, we only compute the auc scores on 3 of the 5 cross-validation folds. the hyperparameters, which were used in a grid search procedure, are listed in tab. a15 for the cnn-based sequence embedding and tab. a16 for the lstm-based sequence embedding. results. we show performance in terms of auc score with single hyperparameters set to fixed values so as to investigate their influence in tab. a18 for the cnn-based sequence embedding and tab. a17 for the lstm-based sequence embedding. we note that due to restricted computational resources this study was conducted with fewer different numbers of cnn kernels, with the auc estimated from only 3 of the 5 cross-validation folds, which leads to a slight decrease of performance in comparison to the full hyperparameter search and cross-validation procedure used in the main paper. as can be seen in tab. a18 and a17, the lstm-based sequence embedding generalizes slightly better than the cnn-based sequence embedding. table a17 : impact of hyperparameters on deeprc with lstm for sequence encoding. mean ("mean") and standard deviation ("std") for the area under the roc curve over the first 3 folds of a 5-fold nested cross-validation for different sub-sets of hyperparameters ("sub-set") are shown. the following sub-sets were considered: "full": full grid search over hyperparameters; "beta=*": grid search over hyperparameters with reduction to specific value * of beta value of attention softmax; "heads=*": grid search over hyperparameters with reduction to specific number * of attention heads; "lstms=*": grid search over hyperparameters with reduction to specific number * of lstm blocks for sequence embedding. table a18 : impact of hyperparameters on deeprc with 1d cnn for sequence encoding. mean ("mean") and standard deviation ("std") for the area under the roc curve over the first 3 folds of a 5-fold nested cross-validation for different sub-sets of hyperparameters ("sub-set") are shown. the following sub-sets were considered: "full": full grid search over hyperparameters; "beta=*": grid search over hyperparameters with reduction to specific value * of beta value of attention softmax; "heads=*": grid search over hyperparameters with reduction to specific number * of attention heads; "ksize=*": grid search over hyperparameters with reduction to specific kernel size * of 1d cnn kernels for sequence embedding; "kernels=*": grid search over hyperparameters with reduction to specific number * of 1d cnn kernels for sequence embedding. a compact vocabulary of paratope-epitope interactions enables predictability of antibody-antigen binding predicting the sequence specificities of dna-and rna-binding proteins by deep learning explaining and interpreting lstms solving the protein sequence metric problem rank-loss support instance machines for miml instance annotation augmenting adaptive immunity: progress and challenges in the quantitative engineering and analysis of adaptive immune receptor repertoires multiple instance learning: a survey of problem characteristics and applications vdjserver: a cloud-based analysis portal and data commons for immune repertoire sequences and rearrangements tetramer-visualized gluten-specific cd4+ t cells in blood as a potential diagnostic marker for coeliac disease without oral gluten challenge ireceptor: a platform for querying and analyzing antibody/b-cell and t-cell receptor repertoire data across federated repositories support-vector networks quantifiable predictive features define epitope-specific t cell receptor repertoires on a model of associative memory with huge storage capacity bert: pre-training of deep bidirectional transformers for language understanding solving the multiple instance problem with axis-parallel rectangles predicting the spectrum of tcr repertoire sharing with a data-driven model of recombination immunosequencing identifies signatures of cytomegalovirus exposure history and hla-mediated effects on the t cell repertoire predicting antigen-specificity of single t-cells based on tcr cdr3 regions. biorxiv a review of multi-instance learning assumptions deep sequencing of b cell receptor repertoires from covid-19 evaluation and benchmark for biological image segmentation the promise and challenge of high-throughput sequencing of the antibody repertoire tcrex: detection of enriched t cell epitope specificity in full t cell receptor sequence repertoires. biorxiv identifying specificity groups in the t cell receptor repertoire generating sequences with recurrent neural networks. arxiv a bioinformatic framework for immune repertoire diversity profiling enables detection of immunological status learning the high-dimensional immunogenomic features that predict public and private antibody repertoires improving neural networks by preventing co-adaptation of feature detectors long short-term memory fast model-based protein homology detection without alignment neural networks and physical systems with emergent collective computational abilities convolutional neural network architectures for matching natural language sentences attention-based deep multiple instance learning nettcr: sequence-based prediction of tcr binding to peptide-mhc complexes using convolutional neural networks basset: learning the regulatory code of the accessible genome with deep convolutional neural networks detecting cutaneous basal cell carcinomas in ultra-high resolution and weakly labelled histopathological images self-normalizing neural networks capturing the differences between humoral immunity in the normal and tumor environments from repertoire-seq of b-cell receptors using supervised machine learning observed antibody space: a resource for data mining next-generation sequencing of antibody repertoires dense associative memory for pattern recognition dense associative memory is robust to adversarial inputs gradient-based learning applied to document recognition set transformer: a framework for attention-based permutation-invariant neural networks imgt unique numbering for immunoglobulin and t cell receptor variable domains and ig superfamily v-like domains distance between sets methods for detecting associations with rare variants for common diseases: application to analysis of sequence data the extended cohnkanade dataset (ck+): a complete dataset for action unit and emotion-specified expression high-throughput immune repertoire analysis with igor a framework for multiple-instance learning computational strategies for dissecting the high-dimensional complexity of adaptive immune repertoires longitudinal high-throughput tcr repertoire profiling reveals the dynamics of t cell memory formation after mild covid-19 infection. biorxiv methods for interpreting and understanding deep neural networks layer-wise relevance propagation: an overview how many different clonotypes do immune repertoires contain? current opinion in systems biology treating biomolecular interaction as an image classification problem -a case study on t-cell receptorepitope recognition prediction. biorxiv sumrep: a summary statistic framework for immune receptor repertoire comparison and model validation biophysicochemical motifs in t-cell receptor sequences distinguish repertoires from tumor-infiltrating lymphocyte and adjacent healthy tissue pytorch: an imperative style, high-performance deep learning library needles in haystacks: on classifying tiny objects in large images interpretable deep learning in drug discovery pointnet: deep learning on point sets for 3d classification and segmentation graph kernels for chemical informatics cov-abdab: the coronavirus antibody database. biorxiv immunedb, a novel tool for the analysis, storage, and dissemination of immune repertoire sequencing data a $$k$$-nearest neighbor based algorithm for multi-instance multi-label active learning machine learning in automated text categorization high-throughput mapping of b cell receptor sequences to antigen specificity vdjtools: unifying post-analysis of t cell receptor repertoires vdjdb: a curated database of t-cell receptor sequences with known antigen specificity deeptcr: a deep learning framework for understanding t-cell receptor sequence signatures within complex t-cell repertoires prediction of specific tcr-peptide binding from large dictionaries of tcr-peptide pairs. biorxiv axiomatic attribution for deep networks attention-based deep neural networks for detection of cancerous and precancerous esophagus tissue on histopathological slides learning with sets in multiple instance regression applied to remote sensing attention is all you need revisiting multiple instance neural networks novel approaches to analyze immunoglobulin repertoires immunesim: tunable multi-feature simulation of b-and t-cell receptor repertoires for immunoinformatics benchmarking genome-wide protein function prediction through multiinstance multi-label learning rare-variant association testing for sequencing data with the sequence kernel association test polyspecificity of t cell and b cell receptor recognition practical guidelines for b-cell receptor repertoire sequencing analysis learning embedding adaptation for few-shot learning convolutional neural network architectures for predicting dna-protein binding pird: pan immune repertoire database multi-instance multi-label learning with application to scene classification predicting effects of noncoding variants with deep learning-based sequence model the ellis unit linz, the lit ai lab and the institute for machine learning are supported by the land oberösterreich, lit grants deeptoxgen ( in the following, the appendix to the paper "modern hopfield networks and attention for immune key: cord-133273-kvyzuayp authors: christ, andreas; quint, franz title: artificial intelligence: research impact on key industries; the upper-rhine artificial intelligence symposium (ur-ai 2020) date: 2020-10-05 journal: nan doi: nan sha: doc_id: 133273 cord_uid: kvyzuayp the trirhenatech alliance presents a collection of accepted papers of the cancelled tri-national 'upper-rhine artificial inteeligence symposium' planned for 13th may 2020 in karlsruhe. the trirhenatech alliance is a network of universities in the upper-rhine trinational metropolitan region comprising of the german universities of applied sciences in furtwangen, kaiserslautern, karlsruhe, and offenburg, the baden-wuerttemberg cooperative state university loerrach, the french university network alsace tech (comprised of 14 'grandes 'ecoles' in the fields of engineering, architecture and management) and the university of applied sciences and arts northwestern switzerland. the alliance's common goal is to reinforce the transfer of knowledge, research, and technology, as well as the cross-border mobility of students. in the area of privacy-preserving machine learning, many organisations could potentially benefit from sharing data with other, similar organisations to train good models. health insurers could, for instance, work together on solving the automated processing of unstructured paperwork such as insurers' claim receipts. the issue here is that organisations cannot share their data with each other for confidentiality and privacy reasons, which is why secure collaborative machine learning where a common model is trained on distributed data to prevent information from the participants from being reconstructedis gaining traction. this shows that the biggest problem in the area of privacy-preserving machine learning is not technical implementation, but how much the entities involved (decision makers, legal departments, etc.) trust the technologies. as a result, the degree to which ai can be explained, and the amount of trust people have in it, will be an issue requiring attention in the years to come. the representation of language has undergone enormous development of late: new models and variants, which can be used for a range of natural language processing (nlp) tasks, seem to pop up almost monthly. such tasks include machine translation, extracting information from documents, text summarisation and generation, document classification, bots, and so forth. the new generation of language models, for instance, is advanced enough to be used to generate completely realistic texts. these examples reveal the rapid development currently taking place in the ai landscape, so much so that the coming year may well witness major advances or even a breakthrough in the following areas: • healthcare sector (reinforced by the covid-19 pandemic): ai facilitates the analysis of huge amounts of personal information, diagnoses, treatments and medical data, as well as the identification of patterns and the early identification and/or cure of disorders. • privacy concerns: how civil society should respond to the fast increasing use of ai remains a major challenge in terms of safeguarding privacy. the sector will need to explain ai to civil society in ways that can be understood, so that people can have confidence in these technologies. • ai in retail: increasing reliance on online shopping (especially in the current situation) will change the way traditional (food) shops function. we are already seeing signs of new approaches with self-scanning checkouts, but this is only the beginning. going forward, food retailers will (have to) increasingly rely on a combination of staff and automated technologies to ensure cost-effective, frictionless shopping. • process automation: an ever greater proportion of production is being automated or performed by robotic methods. • bots: progress in the field of language (especially in natural language processing, outlined above) is expected to lead to major advances in the take-up of bots, such as in customer service, marketing, help desk services, healthcare/diagnosis, consultancy and many other areas. the rapid pace of development means it is almost impossible to predict either the challenges we will face in the future or the solutions destined to simplify our lives. one thing we can say is that there is enormous potential here. the universities in the trirhenatech alliance are actively contributing interdisciplinary solutions to the development of ai and its associated technical, societal and psychological research questions. utilizing toes of a humanoid robot is difficult for various reasons, one of which is that inverse kinematics is overdetermined with the introduction of toe joints. nevertheless, a number of robots with either passive toe joints like the monroe or hrp-2 robots [1, 2] or active toe joints like lola, the toyota robot or toni [3, 4, 5] have been developed. recent work shows considerable progress on learning model-free behaviors using genetic learning [6] for kicking with toes and deep reinforcement learning [7, 8, 9] for walking without toe joints. in this work, we show that toe joints can significantly improve the walking behavior of a simulated nao robot and can be learned model-free. the remainder of this paper is organized as follows: section 2 gives an overview of the domain in which learning took place. section 3 explains the approach for model free learning with toes. section 4 contains empirical results for various behaviors trained before we conclude in section 5. the robots used in this work are robots of the robocup 3d soccer simulation which is based on simspark 1 and initially initiated by [10] . it uses the ode physics engine 2 and runs at an update speed of 50hz. the simulator provides variations of aldebaran nao robots with 22 dof for the robot types without toes and 24 dof for the type with toes, naotoe henceforth. more specifically, the robot has 6 (7) dof in each leg, 4 in each arm and 2 in its neck. there are several simplifications in the simulation compared to the real nao: all motors of the simulated nao are of equal strength whereas the real nao has weaker motors in the arms and different gears in the leg pitch motors. joints do not experience extensive backlash rotation axes of the hip yaw part of the hip are identical in both robots, but the simulated robot can move hip yaw for each leg independently, whereas for the real nao, left and right hip yaw are coupled the simulated naos do not have hands the touch model of the ground is softer and therefore more forgiving to stronger ground touches in the simulation energy consumption and heat is not simulated masses are assumed to be point masses in the center of each body part the feet of naotoe are modeled as rectangular body parts of size 8cm x 12cm x 2cm for the foot and 8cm x 4cm x 1cm for the toes (see figure 1 ). the two body parts are connected with a hinge joint that can move from -1 degrees (downward) to 70 degrees. all joints can move at an angular speed of at most 7.02 degrees per 20ms. the simulation server expects to get the desired speed at 50 hz for each joint. if no speeds are sent to the server it will continue movement of the joint with the last speed received. joint angles are noiselessly perceived at 50hz, but with a delay of 40ms compared to sent actions. so only after two cycles, the robot knows the result of a triggered action. a controller provided for each joint inside the server tries to achieve the requested speed, but is subject to maximum torque, maximum angular speed and maximum joint angles. the simulator is able to run 22 simulated naos in real-time on reasonable cpus. it is used as competition platform for the robocup 3d soccer simulation league 3 . in this context, only a single agent was running in the simulator. the following subsections describe how we approached the learning problem. this includes a description of the design of the behavior parameters used, what the fitness functions for the genetic algorithm look like, which hyperparameters were used and how the fitness calculation in the simspark simulation environment works exactly. the guiding goal behind our approach is to learn a model-free walk behavior. with model-free we depict an approach that does not make any assumptions about a robot's architecture nor the task to be performed. thus, from the viewpoint of learning, our model consists of a set of flat parameters. these parameters are later grounded inside the domain. the server requires 50 values per second for each joint. to reduce the search space, we make use of the fact that output values of a joint over time are not independent. therefore, we learn keyframes, i.e. all joint angles for discrete phases of movement together with the duration of the phase from keyframe to keyframe. the experiments described in this paper used four to eight of such phases. the number of phases is variable between learning runs, but not subject to learning for now, except for skipping phases by learning a zero duration for it. the robocup server requires robots to send the actual angular speed of each joint as a command. when only leg joints are included, this would require to learn 15 parameters per phase (14 joints + 1 for the duration of the phase), resulting in 60, 90 and 120 parameters for the 4, 6, 8 phases worked with. the disadvantage of this approach is that the speed during a particular phase is constant, thus making it unable to adapt to discrepancies between the desired and the actual motor movement. therefore, a combination of angular value and the maximum amount of angular speed each joint should have is used. the direction and final value of movement is entirely encoded in the angular values, but the speed can be controlled separately. it follows that: -if the amount of angular speed does not allow reaching the angular value, the joint behaves like in version 1. -if the amount of angular speed is bigger, the joint stops to move even if the phase is not over. this almost doubles the amount of parameters to learn, but the co-domain of values for the speed values is half the size, since here we only require an absolute amount of angular speed. with these parameters, the robot learns a single step and mirrors the movement to get a double step. feedback from the domain is provided by a fitness function that defines the utility of a robot. the fitness function subtracts a penalty for falling from the walked distance in x-direction in meters. there is also a penalty for the maximum deviation in y-direction reached during an episode, weighted by a constant factor. in practice, the values chosen for f allenp enalty and a factor f were usually 3 and 2 respectively. this same fitness function can be used without modification for forward, backward and sideward walk learning, simply by adjusting the initial orientation of the agent. the also trained turn behavior requires a different fitness function. f itness turn = (g * totalt urn) − distance (2) where totalt urn refers to the cumulative rotation performed in degrees, weighted by a constant factor g (typically 1/100). we penalize any deviation from the initial starting x / y position (distance) as incentive to turn in-place. it is noteworthy that other than swapping out the fitness function and a few more minor adjustments mentioned in 3. 3 , everything else about the learning setup remained the same thanks to the model-free approach. naturally, the fitness calculation for an individual requires connecting an agent to the simspark simulation server and having it execute the behavior defined by the learned parameters. in detail, this works as follows: at the start of each "episode", the agent starts walking with the old model-based walk engine at full speed. once 80 simulation cycles (roughly 1.5 seconds) have elapsed, the robot starts checking the foot force sensors. as soon as the left foot touches the ground, it switches to the learned behavior. this ensures that the learned walk has comparable starting conditions each time. if this does not occur within 70 cycles (which sometimes happens due to non-determinism in the domain and noise in the foot force perception), the robot switches anyway. from that point on, the robot keeps performing the learned behavior that represents a single step, alternating between the original learned parameters and a mirrored version (right step and left step). an episode ends once the agents has fallen or 8 seconds have elapsed. to train different walk directions (forward, backward, sideward), the initial orientation of the player is simply changed accordingly. in addition, the robot uses a different walk direction of the model-based walk engine for the initial steps that are not subject to learning. in case of training a morphing behavior (see 4.5) , the episode duration is extended to 12 seconds. when a morphing behavior should be trained, the step behavior from another learning run is used. this also means that a morphing behavior is always trained for a specific set of walk parameters. after 6 seconds, the morphing behavior is triggered once the foot force sensors detect that the left foot has just touched the ground. unlike the step / walk behavior, this behavior is just executed once and not mirrored or repeated. then the robot switches back to walking at full speed with the model-based walk engine. to maximize the reward, the agent has to learn a morphing behavior that enables the transition between learned model-free and old model-based walk to work as reliably as possible. finally, for the turn behavior, the robot keeps repeating the learned behavior without alternating with a mirrored version. in any case, if the robot falls, a training run is over. the overall runtime of each such learning run is 2.5 days on our hardware. learning is done using plain genetic algorithms. the following hyperparameters were used: more details on the approach can be found in [11] . this section presents the results for each kind of behavior trained. this includes three different walk directions, a turn behavior and a behavior for morphing. the main focus of this work has been on training a forward walk movement. figure 2 shows a sequence of images for a learned step. the best result reaches a speed of 1.3 m/s compared to the 1.0 m/s of our model-based walk and 0.96 m/s for a walk behavior learned on the nao robot without toes. the learned walk with toes is less stable, however, and shows a fall rate of 30% compared to 2% of the model-based walk. regarding the characteristics of this walk, it utilizes remarkably long steps 4 . table 1 shows an in-depth comparison of various properties, including step duration, length and height, which are all considerably bigger compared to our previous model-based walk. the forward leaning of the agent has increased by 80.4%, while 28.1% more time is spent with both legs off the ground. however, the maximum deviation from the intended path (maxy ) has also increased by 137.8%. table 1 : comparison of the previously fastest and the fastest learned forward walk once a working forward walk was achieved, it was natural to try to train a backward walk behavior as well, since this only requires a minor modification in the learning environment (changing the initial rotation of the agent and model-based walk direction to start with). the best backward walk learned reaches a speed of 1.03 m/s, which is significantly faster than the 0.67 m/s of its model-based counterpart. unfortunately, the agent also falls 15% more frequently. it is interesting just how backward-leaning the agent is during this walk behavior. it could almost be described as "controlled falling" 5 (see figure 3 ). sideward walk learning was the least successful out of the three walk directions. like with all directions, the agent starts out using the old walk engine and then switches to the learned behavior after a short time. in this case however, instead of continuing to walk sideward, the agent has learned to turn around and walk forward instead, see figure 4 . the resulting forward walk is not very fast and usually causes the agent to fall within a few meters 6 , but it is still remarkable that the learned behavior manages to both turn the agent around and make it walk forward with the same repeating step movement. it is also remarkable that the robot learned that it is quicker with the given legs at least for long distances to turn and run forward than to keep making sidesteps. with the alternate fitness function presented in section 3, the agent managed to learn a turn behavior that is comparable in speed to that of the existing walk engine. despite this, the approach is actually different: while the old walk engine uses small, angled steps 7 , the learned behavior uses the left leg as a "pivot", creating angular momentum with the right leg 8 . figure 5 shows the movement sequence in detail. unfortunately, despite the comparable speed, the learned turn behavior suffers from much worse stability. with the old turn behavior, the agent only falls in roughly 3% of cases, with the learned behavior it falls in roughly 55% of the attempts. one of the major hurdles for using the learned walk behaviors in a robocup competition is the smooth transition between them and other existing behaviors such as kicks. the initial transition to the learned walk is already built into the learning setup described in 3 by switching mid-walk, so it does not have to be given special consideration. more problematic is switching to another behavior afterwards without falling. to handle this, the robot simply attempted to train a "morphing" behavior using the same model-free learning setup. the result is something that could be described as a "lunge" (see figure 6 ) that reduces the forward momentum sufficiently to allow it to transition to the slower model-based walk when successful. 9 however, the morphing is not successful in about 50% of cases, resulting in a fall. we were able to successfully train forward and backward walk behaviors, as well as a morphing and turn behavior using plain genetic algorithms and a very flexible model-free approach. the usage of the toe joint in particular makes the walks look more natural and human-like than that of the model-based walk engine. however, while the learned behaviors outperform or at least match our old modelbased walk engine in terms of speed, they are not stable enough to be used during actual robocup 3d simulation league competitions. we think this is an inherent limitation of the approach: we train a static behavior that is unable to adapt to changing circumstances in the environment, which is common in simspark's non-deterministic simulation with perception noise. deep reinforcement learning seems more promising in this regard, as the neural network can dynamically react to the environment since sensor data serves as input. it is also arguably even less restrictive than the keyframe-based behavior parameterization we presented in this paper, as a neural network can output raw joint actions each simulation cycle. at least two other robocup 3d simulation league teams, fc portugal [8] and itandroids [9] , have had great success with this approach, everything points towards this becoming the state-of-the-art approach in robocup 3d soccer simulation in the near future, so we want to concentrate our future efforts here as well. retail companies dealing in alcoholic beverages are faced with a constant flux of products. apart from general product changes like modified bottle designs and sizes or new packaging units two factors are responsible for this development. the first is the natural wine cycle with new vintages arriving at the market and old ones cycling out each year. the second is the impact of the rapidly growing craft beer trend which has also motivated established breweries to add to their range. the management of the corresponding product data is a challenge for most retail companies. the reason lies in the large amount of data and its complexity. data entry and maintenance processes are linked with considerable manual effort resulting in high data management costs. product data attributes like dimensions, weights and supplier information are often entered manually into the data base and are often afflicted with errors. another widely used source of product data is the import from commercial data pools. a means of checking the data thus acquired for plausibility is necessary. sometimes product data is incomplete due to different reasons and a method to fill the missing values is required. all these possible product data errors lead to complications in the downstream automated purchase and logistics processes. we propose a machine learning model which involves domain specific knowledge and compare it a heuristic approach by applying both to real world data of a retail company. in this paper we address the problem of predicting the gross weight of product items in the merchandise category alcoholic beverages. to this end we introduce two levels of additional features. the first level consists of engineered features which can be determined by the basic features alone or by domain specific expert knowledge like which type of bottle is usually used for which grape variety. in the next step an advanced second level feature is computed from these first level features. adding these two levels of engineered features increases the prediction quality of the suggestion values we are looking for. the results emphasize the importance of careful feature engineering using expert knowledge about the data domain. feature engineering is the process of extracting features from the data in order to train a prediction model. it is a crucial step in the machine learning pipeline, because the quality of the prediction is based on the choice of features used to training. the majority of time and effort in building a machine learning pipeline is spent on data cleaning and feature engineering [domingos 2012] . a first overview of basic feature engineering principles can be found in [zheng 2018 ]. the main problem is the dependency of the feature choice on the data set and the prediction algorithm. what works best for one combination does not necessarily work for another. a systematic approach to feature engineering without expert knowledge about the data is given in [heaton 2016 ]. the authors present a study whether different machine learning algorithms are able to synthesize engineered features on their own. as engineered features logarithms, ratios, powers and other simple mathematical functions of the original features are used. in [anderson 2017 ] a framework for automated feature engineering is described. the data set is provided by a major german retail company and consists of 3659 beers and 10212 wines. each product is characterized by the seven features shown in table 1. the product name obeys only a generalized format. depending on the user generating the product entry in the company data base, abbreviating style and other editing may vary. the product group is a company specific number which encodes the product category -dairy products, vegetables or soft drinks for example. in our case it allows a differentiation of the product into beer and wine. additionally wines are grouped by country of origin and for germany also into wine-growing regions. note that the product group is no inherent feature like length, width, height and volume, but depends on the product classification system a company uses. the dimensions length, width, height and the volume derived by multiplicating them are given as float values. the feature (gross) weight, also given as a float value, is what we want to predict. as is often the case with real world data, a pre-processing step has to be performed prior to the actual machine learning in order to reduce data errors and inconsistencies. for our data we first removed all articles missing one or more of the required attributes of table 1. then all articles with dummy values were identified and discarded. dummy values are often introduced due to internal process requirements but do not add any relevant information to the data. if for example the attribute weight has to be filled for an article during article generation in order to proceed to the next step but the actual value is not know, often a dummy value of 1 or 999 is entered. these values distort the prediction model when used as training data in the machine learning step. the product name is subjected to lower casing and substitution of special german characters like umlauts. special symbolic characters like #,! or separators are also deleted. with this preprocessing done the data is ready to be used for feature engineering. following this formal data cleaning we perform an additional content-focused pre-processing. the feature weight is discretized by binning it with bin width 10g. volume is likewise treated with bin size 10ml. this simplifies the value distribution without rendering it too coarse. all articles where length is not equal to width are removed, because in these cases there are no single items but packages of items. often the data at hand is not sufficient to train a meaningful prediction model. in these cases feature engineering is a promising option. identifying and engineering new features depends heavily on expert knowledge of the application domain. the first level consists of engineered features which can be determined by the original features alone. in the next step advanced second level features are computed from these first level and the original features. for our data set the original features are product name and group as well as the dimensions length, width, height and volume. we see that the volume is computed in the most general way by multiplication of the dimensions. geometrically this corresponds to all products being modelled as cuboids. since angular beer or wine bottles are very much the exception in the real world, a sensible new feature would be a more appropriate modelling of the bottle shape. since weight is closely correlated to volume, the better the volume estimate the better the weight estimate. to this end we propose four first level engineered features: capacity, wine bottle type, beer packaging type and beer bottle type which are in turn used to compute a second level engineered feature namely the packaging specific volume. figure 1 shows all discussed features and their interdependencies. let us have a closer look at the first level engineered features. the capacity of a beverage states the amount of liquid contained and is usually limited to a few discrete values. 0.33l and 0.5l are typical values for beer cans and bottles while wines are almost exclusively sold in 0.75l bottles and sometimes in 0.375l bottles. the capacity can be estimated from the given volume with sufficient certainty using appropriate threshold values. outliers were removed from the data set. there are three main beer packaging types in retail: cans, bottles and kegs. while kegs are mainly of interest to pubs and restaurants and are not considered in this paper, cans and bottles target the typical super market shopper and come in a greater variety. in our data set, the product name in case of beers is preceded by a prefix denoting whether the product is packaged in a can or a bottle. extracting the relevant information is done using regular expressions. not, though, that the prefix is not always correct and needs to be checked against the dimensions. the shapes of cans are the same for all practical purposes, no matter the capacity. the only difference is in their wall thickness, which depends on the material, aluminium and tin foil being the two common ones. the difference is weight is small and the actual material used is impossible to extract from the data. a further distinction for cans in different types like for beer and wine is therefore unnecessary. regarding the german beer market, the five bottle types shown in figure 2 the engineered feature beer packaging type assigns each article identified as beer by its product group to one of the classes bottle or can. the feature beer bottle type contains the most probably member of the five main beer bottle types. packages containing more than one bottle or can like crates or six packs are not considered in this paper and were removed from the data set. compared to beer the variety of commercially sold wine packagings is limited to bottles only. a corresponding packaging type attribute to distinguish between cans and bottles is not necessary. again there are a few bottle types which are used for the majority of wines, namely schlegel, bordeaux and burgunder ( figure 3 ). deciding what product is filled in which bottle type is a question of domain knowledge. the original data set does not contain a corresponding feature. from the product group the country of origin and in the case of german wines the region can be determined via a mapping table. this depends on the type of product classification system the respective company uses and has not to be valid for all companies. our data set uses a customer specific classification with focus on germany. a more general one would be the global product classification (gpc) standard for example. to determine wine growing regions in non-german countries like france the product name has to be analyzed using regular expressions. the type of grape is likewise to be deduced from the product name if possible. using the country and specifically the region of origin and type of grape of the wine in question is the only way to assign a bottle type with acceptable certainty. there are countries and region in which a certain bottle type is used predominantly, sometimes also depending on the color of the wine. the schlegel bottle, for example, is almost exclusively used for german and alsatian white wines and almost nowhere else. bordeaux and burgunder bottles on the other hand are used throughout the world. some countries like california or chile use a mix of bottle types for their wines, which poses an additional challenge. with expert knowledge one can assign regions and grape types to the different bottle types. as with beer bottles this categorization is by no means comprehensive or free of exceptions but serves as a first step. the standard volume computation by multiplying the product dimensions length, width and height is a rather coarse cuboid approximation to the real shape of alcoholic beverage packagings. since the volume is intrinsically linked to the weight which we want to predict a packaging type specific volume computation is required for cans and especially bottles. the modelling of a can is straightforward using a cylinder with the given height ℎ and a diameter of the given width and length . thus the packaging type specific volume is: a bottle on the other hand needs to be modelled piecewise. its height can be divided into three parts: base, shoulders and neck as shown in figure 4. base and neck can be modeled by a cylinder. the shoulders are approximated by a truncated cone. with the help of the corresponding partial heights ℎ , ℎ ℎ and ℎ we can compute coefficients , ℎ and as fractions of the overall height ℎ of the bottle. the diameters of the bottle base and the neck opening are given by and and are likewise used to compute the ratio . since bottles have circular bases, the values for width and length in the original data have to be the same and either one may be used for . these four coefficients are characteristic for each bottle type, be it beer or wine (table 3) . with their help, a bottle type specific volume from the original data length, width and height can be computed which is a much better approximation to the true volume than the former cuboid model. the bottle base can be modelled as a cylinder as follows: the bottle shoulders have the form of a truncated cone and are described by formula 3: the bottle neck again is a simple cylinder: summing up all three sections yields the packaging type specific volume for bottles: ur-ai 2020 // 18 the experiments follow the multi-level feature engineering scheme as shown in figure 1 . first, we use only the original features product group and dimensions. then we add the first level engineered features capacity and bottle type to the basic features. next the second level engineered feature packaging type specific volume is used along with the basic features. finally all features from every level are used for the prediction. after pre-processing and feature engineering the data set size is reduced from 3659 to 3380 beers and from 10212 to 8946 wines. for prediction of the continuous valued attribute gross weight, we use and compare several regression algorithms. both the decision-tree based random forests algorithm (breimann, 2001) and support vector machines (svm) (cortes, 1995) are available in regression mode (smola, 1997) . linear regression (lai, 1979) and stochastic gradient descent (sgd) (taddy, 2019) are also employed as examples of more traditional statics-based methods. our baseline is a heuristic approach taking the median of the attribute gross weight for each product group and use this value as a prediction for all products of the same product group. practical experience has shown this to be a surprisingly good strategy. the implementation was done in python 3.6 using the standard libraries sk-learn and pandas. all numeric features were logarithmized prior to training the models. the non-numeric feature bottle type was converted to numbers. the final results were obtained using tenfold cross validation (kohavi, 1995) . for model training 80% of the data was used while the remaining 20% constituted the test data. we used the root mean square error (rsme) (6) as well as the mean and variance of the absolute percentage error (7) as metrics for the evaluation of the performance of the algorithms. all machine learning algorithms deliver significant improvements regarding the observed metrics compared to the heuristic median approach. the best results for each feature combination are highlighted in bold script. the results for the beer data set in table 4 show that the rsme can be more than halved, the mean almost be reduced to a third and the variance of quartered compared to the baseline approach. the random forest regressor achieves the best results in terms of rsme and for almost all feature combinations except basic features and basic features combined with the packaging type specific volume, in which cases support vector machines prove superior. linear regression and sgd are are still better than the baseline approach but not on par with the other algorithms. linear regression shows the tendency to improved results when successively adding features. sgd on the other hand exhibits no clear relation between number and level of features and corresponding prediction quality. a possible cause could be the choice of hyper parameters. sgd is very sensitive in this regard and depends more heavily upon a higher number of correctly adjusted hyper parameters than the other algorithms we used. random forests is a method which is very well suited to problems, where there is no easily discernable relation between the features. it is prone to overfitting, though, which we tried to avoid by using 20% of all data as test data. adding more engineered features leads to increasingly better results using random forest with an outlier for the packaging type specific volume feature. svm are not affected by only first level engineered features but profit from using the bottle type specific volume. regarding the wine data set the results depicted in table 5 are not as good as for the beer data set though still much better than the baseline approach. a reduction of the rsme by over 29% and of the mean by almost 50% compared to the baseline were achieved. the variance of could even be limited to under 10% of the baseline value. again random forests is the algorithm with the best metrics. linear regression and svm are comparable in terms of while sgd is worse but shows good rsme values. in conclusion the general results of the wine data set show not much improvement when applying additional engineered features. 6 discussion and conclusion the experiments show a much better predicting quality for beer than for wine. a possible cause could be the higher weight variance in bottle types compared to beer bottles and cans. it is also more difficult to correctly determine the bottle type for wine, since the higher overlap in dimensions does not allow to compute the bottle type with the help of idealized bottle dimensions. using expert knowledge to assign the bottle type by region and grape variety seems not to be as reliable, though. especially with regard to the lack of a predominant bottle type in the region with the most bottles (red wine from baden for example), this approach should be improved. especially bordeaux bottles often sport an indentation in the bottom, called a 'culot de bouteille'. the size and thickness of this indentation cannot be inferred from the bottle's dimensions. this means that the relation between bottle volume and weight is skewed compared to other bottles without these indentations, which in turn decreases prediction quality. predicting gross weights with machine learning and domain-specifically engineered features leads to smaller discrepancies than using simple heuristic approaches. this is important for retail companies since big deviations are much worse for logistical reasons than small ones which may well be within natural production tolerances for bottle weights. our method allows to check manually generated as well as data pool imported product data for implausible gross weight entries and proposes suggestion values in case of missing entries. the method we presented can easily be adapted to non-alcoholic beverages using the same engineered features. in this segment, plastics bottles are much more common than glass ones and hence the impact of the bottle weight compared to the liquid weight is significantly smaller. we assume that this will cause a smaller importance of the bottle type feature in the prediction. a more problematic kind of beverage is liquor. although there are only a few different standard capacities, the bottle types vary so greatly, that identifying a common type is almost impossible. one of the main challenges of our approach is determining the correct bottle types. using expert knowledge is a solid approach but cannot capture all exemptions. especially if a wine growing region has no predominant bottle type and is using mixed bottle types instead. additionally many wine growers use bottle types which haven't been typical for their wine types because they want to differ from other suppliers in order to get the customer's attention. assuming that all rieslings are sold in schlegel bottles, for example, is therefore not exactly true. one option could be to model hybrid bottles using a weighted average of the coefficients for each bottle type in use. if a region uses both burgunder and bordeaux bottles with about equal frequency, all products from this region could be assigned a hybrid bottle with coefficients computed by the mean value of each coefficient. if an initially bottle type labeled data set is available, preliminary simulations have shown that most bottle types can be predicted robustly using classification algorithms. the most promising strategy, in our opinion, is to learn the bottle types directly from product images using deep neural nets for example. with regard to the ever increasing online retail sector, web stores need to have pictures of their products on display, so the data is there to be used. quality assurance is one of the key issues for modern production technologies. especially new production methods like additive manufacturing and composite materials require high resolution 3d quality assurance methods. computed tomography (ct) is one of the most promising technologies to acquire material and geometry data non-destructively at the same time. with ct it is possible to digitalize subjects in 3d, also allowing to visualize their inner structure. a 3d-ct scanner produces voxel data, comprising of volumetric pixels that correlate with material properties. the voxel value (grey value) is approximately proportional to the material density. nowadays it is still common to analyse the data by manually inspecting the voxel data set, searching for and manually annotating defects. the drawback is that for high-resolution ct data, this process it very time consuming and the result is operator-dependent. therefore, there is a high motivation to establish automatic defect detection methods. there are established methods for automatic defect detection using algorithmic approaches. however, these methods show a low reliability in several practical applications. at this point artificial neural networks come into play that have been already implemented successfully in medical applications [1] . the most common networks, developed for medical data segmentation, are by ronneberger et al., the u-net [2] and by milletari et al., the v-net [3] and their derivates. these networks are widely used for segmentation tasks. fuchs et al. describes three different ways of analysing industrial ct data [4] . one of these contains a 3d-cnn. this cnn is based on the u-net architecture and is shown in their previous paper [5] . the authors enhance and combine the u-net and v-net architecture to build a new network for examination of 3d volumes. in contrast, we investigate in our work how the networks introduced by ronneberger et al. and milletari et al. perform in industrial environments. furthermore, we investigate if derivates of these architectures are able to identify small features in industrial ct data. in industrial ct systems, not only in the hardware design but also in the resulting 3d imaging data differs from medical ct systems. voxel data from industrial parts differ from medical data in the contrast level and the resolution. state-of-the-art industrial ct scanner produce one to two order of magnitude larger data sets compared to medical ct systems. the corresponding resolution is necessary to resolve small defects. medical ct scanners are optimised for a low xray dose for the patient, the energy of x-ray photons are typically up to 150 kev, industrial scanner typically use energies up to 450 kev. in combination with the difference of the scan "object", the datasets differ significantly in size and image content. to store volume data there are a lot of different file formats. some of them are mainly used in medical applications like dicom [6] , nifti 1 or raw. in industrial applications vgl 3 , raw and tiff 4 are commonly used. also depending on the format, it is possible to store the data slice wise or as a complete volume stack. industrial ct data, as mentioned in previous section, has some differences to medical ct data. one aspect is the size of the features to be detected or learned by the neural network. our target is to find defects in industrial parts. as an example, we analyse pores in casting parts. these features may be very small, down to 1 to 7 voxels in each dimension. compared to the size of the complete data volume (typically larger than 512 x 512 x 512 voxel), the feature size is very small. the density difference between material and pores may be as low as 2% of the maximum grey value. thus, it is difficult to annotate the data even for human experts. the availability of real industrial data of good quality, annotated by experts, is very low. most companies don't reveal their quality analysis data. training a neural network with a small quantity of data is not possible. for medical applications, especially ai applications, there are several public datasets available. yet these datasets are not always sufficient and researchers are creating synthetic medical data [7] . therefore, we decided to create synthetic industrial ct data. another important reason for synthetic data is the quality of annotations done by human experts. the consistency of results is not given for different experts. fuchs et al. have shown that training on synthetic data and predicting on real data lead to good results [4] . however, synthetic data may not reflect all properties of real data. some of the properties are not obvious, which may lead to ignoring some varieties in the data. in order to achieve a high associability, we use a large numbers of synthetic data mixed with a small number of real data. to achieve this, we developed an algorithm which generates large amounts of data, containing a large variation of aspects, needed to generalize a neural network. the variation includes material density, pore density, pore size, pore amount, pore shape and size of the part. there are some samples that could be learned easily, because the pores are clearly visible inside the material. however, some samples are more difficult to be learned, because the pores are nearly invisible. this allows us to generate data with a wide variety and hence the network can predict on different data. to train the neural networks, we can mix the real and synthetic data or use them separately. the real data was annotated manually by two operators. to create a dataset of this volume we sliced it into 64x64x64 blocks. only the blocks with a mean density greater than 50% of the grayscale range are used, to avoid too much empty volumes in the training data. another advantage of synthetic data is the class balance. we have two classes, where 0 corresponds to material and surrounding air and 1 for the defects. because of the size of the defects there is a high imbalance between the classes. by generating data with more features than in the real data, we could reduce the imbalance. reducing the size of the volume to 64x64x64 also leads to better balance between the size of defects compared to full volume. in table 1details of our dataset for training, evaluation and testing are shown. the synthetic data will not be recombined to a larger volume as they represent separate small components or full material units. the following two slices of real data ( figure 1 ) and synthetic data (figure 2 ) with annotated defects show the conformity between the data. ur-ai 2020 // 26 hardware and software setup deep learning (dl) consist of two phases: the training and its application. while dl models can be executed very fast, the training of the neural network can be very time-consuming, depending on several factors. one major factor is the hardware. the time consumed can be reduced by the factor of around ten when graphics cards (gpus) are used. [8] to cache the training data, before it is given into the model, calculated on the gpu, a lot of random-access memory (ram) is used [9] [10] [11] . our system is built on a dual cpu hardware with 10 cores each running at 2.1 ghz and a nvidia gpu titan rtx 5 with 24gb of vram and 64gb of regular ram. all measurements in this work concerning training and execution time are related to this hardware setup. the operating system is ubuntu 18.4lts. anaconda is used for python package management and deployment. the dl-framework is tensorflow 6 2.1 and keras as a submodule in python 7 . based on the 3du-net [12] and 3dv-net [3] architecture compared from paichao et al. [13] we created modified versions which differ in number of layers and their hyperparameters. due to the small size of our data, no patch division is necessary. instead the training is performed on the full volumes. we actually do not use the z-net enhancement proposed in their paper. the input size, depending on our data, is defined to 64x64x64x1 with 1 dimension for channel. the incoming data will be normalized. as we have a binary segmentation task, our output activation is the sigmoid [14] function. based on paichao et al. [13] the convolutional layer of our 3du-nets have a kernel size of (3, 3, 3) and the 3dv-nets have a kernel size of (5, 5, 5). as convolution activation function we are using elu [14] [15] and he_normal [16] as kernel initialization [17] . the adam optimisation method [18] [19] is used with a starting learning rate of 0.0001, a decay factor of 0.1 and the loss function is the binary cross-entropy [20] . figure 3 shows a sample 3du-net architecture where downwards max pooling and upwards transposed convolution are used. compared to figure 4 , the 3dv-net, where we have a fully convolutional neural network, the descend is done with a (2, 2, 2) convolution and a stride of 2 and ascent with transposed convolution. it also has a layer level addition of the input of this level added to the last convolution output of the same level, as marked by the blue arrows. to adapt the shapes of the tensors for adding them, the down-convolution and the last convolution of the same level, have to have the same number of kernel filters. our modified neural network differ in the levels of de-/ascending, the convolution filter kernel size and their hyperparameters, shown in table 2 . the convolutions on one level have the same number of filter kernel. after every down convolution the number of filters is multiplied by 2 and on the way up divided by 2. training and evaluation of the neural networks the conditions of a training and a careful parameters selection is important. in table 3 the training conditions fitted to our system and networks are shown. we are also taking into account that different network architectures and number of layers are better performing on different learning rates, batch size, etc. to evaluate our trained models, we are mainly focusing on the iou metric, also called jackard index, which is the intersection over union. this metric is widely used for segmentation tasks and compares the intersection over union between the prediction and ground truth for each voxel. the value of iou range between 0 and 1, whereas the loss values range between 0 and infinite. therefore, the iou is a much clearer indicator. an iou close to 1 indicates a high intersectionprecision between the prediction and the groundtruth. our networks where trained between 30 and 90 epochs until no more improvement could be achieved. both datasets consist of a similar number of samples, which means the epoch time is equivalent. one epoch took around 4 minutes. figure 5 shows the loss determined based on the evaluation data. as described in fehler! verweisquelle konnte nicht gefunden werden., all models are trained on and evaluated against the synthetic dataset gdata and on the mixed dataset mdata. in general, the loss achieved by all models is higher on mdata because the real data is harder to learn. a direct comparison between the models is only possible between models with the same architecture. the iou metric shown in figure 6 . here the evaluation is sorted based on the iou metric. if we compare the loss of unet-mdata with unet-gdata, which are nearly the same for mdata, with their corresponding iou (unet-mdata (~0.8) and unet-gdata (~0.93)), we can see that a lower loss does not necessarily lead to higher iou score. if only the loss and iou are considered, the unets tend to be better than the vnets. as a conclusion, considering the iou metric for model selection, the unet-gdata is the best performing model and vnet-gdata the least performing. the evaluation loss determined based on the evaluation data sorted from lowest to highest. the evaluation iou determined based on the evaluation data sorted from lowest to highest. after comparing the automatic evaluation, we show prediction samples of different models on real and synthetic data ( table 4) . rows 1 and 2 show the comparison between unet-gdata and vnet-gdata, predicting on a synthetic test sample. the result of unet-gdata exactly hits the groundtruth, whereas the vnet-gdata prediction has a 100% overlap to the groundtruth but with surrounding false positive segmentations. in row 3 and 4 both models predict the groundtruth plus some false positive segmentations in the close neighbourhood. in row 5 and 6 the prediction results of the same two models on real data is shown, taking into account that both models are not trained on real data. unet-gdata delivers a good precision with some false positive segmentations in thegroundtruth area and one additional segmented defect. this shows that the model was able to find a defect which was missed by the expert. vnet-gdata shows a very high number of false positive segmentations. in this paper, we have proposed a neural network to find defects in real and synthetic industrial ct volumes. we have shown that neural networks, developed for medical applications can be adapted to industrial applications. to achieve high accuracy, we used a large variety of features in our data. based on the evaluation and manually reviewing random samples we have chosen the unet architecture for further research. this model achieved great performance on our real and synthetic dataset. in summery this paper shows that the artificial intelligence and their neural networks will take an import enrichment in industrial issues. stress can affect all aspects of our lives, including our emotions, behaviors, thinking ability, and physical health, making our society sick -both mentally and physically. among the effects that the stress and anxiety can cause are heart diseases, such as coronary heart disease and heart failure [5] . due this information, this research will present a proposal to help people handling stress using the benefit of technology development and to set patters of stress status as way to propose some intervention, once the first step to controlling stress is to know the symptoms of stress. the stress symptoms are very board and can be confused with others diseases according the american institute of stress [15] , for example the frequent headache, irritability, insomnia, nightmares, disturbing dreams, dry mouth, problems swallowing, increased or decreased appetite, or even cause other diseases such as frequent colds and infections. in view of the wide variety of symptoms caused by stress, this research intends to define, through physiological signals, the patterns generated by the body and obtained by wearable sensors and develop a standardized database to apply the machine learning. hand, advances in sensor technology, wearable devices and mobile growth would help to online stress identification based on physiological signals and delivery of psychological interventions. currently with the advancement of technology and improvements in the wearable sensors area, made it possible to use these devices as a source of data to monitor the user's physiological state. the majority of the wearable devices consist of low-cost board that can be used to the acquisition of physiological signals [1, 10] . after the data are obtained it is necessary apply some filters to clear signal, without noise or distortions aiming to use some machine learning approaches to model and predict these stress states [2, 11] . the wide-spread use of mobile devices and microcomputers, as raspberry pi, and its capabilities presents a great possibility to collect, and process those signs with an elaborated application. these devices can collect the physiological signals and detect specific stress states to generate interventions following the predetermined diagnosis based on the standards already evaluated in the system [9, 6] . during the literature review it was evident the presence of few works dedicated to evaluating comprehensively the complete cycle of biofeedback, which comprises using the wearable devices, applying machine learning patterns detection algorithms, generate the psychologic intervention, besides monitoring its effects and recording the history of events [9, 3] . stress is identified by professionals using human physiology, so wearables sensors could help on data acquisition and processing, through machine learning algorithms on biosignal data, suggesting psychological interventions. some works [6, 14] are dedicated to define patterns as experiment for data acquisition simulation real situations. jebelli, khalili and lee [6] showed a deep learning approach that was used to compare with a baseline feedforward artificial neural network. schmidt et al. [12] describes wearable stress and affect detection (wesad), one public dataset used to set classifiers and identify stress patterns integrating several sensors signals with the emotion aspect with a precision of 93% in the experiments. the work of gaglioli et al. [4] describe the main features and preliminary evaluation of a free mobile platform for the selfmanagement of psychological stress. in terms of the wearables, some studies [13, 14] evaluate the usability of devices to monitory the signals and the patient's well-being. pavic et al. [13] showed a research performed to monitor cancer patients remotely and as the majority of the patients have a lot of symptoms but cannot stay at hospital during all treatment. the authors emphasize that was obtained good results and that this system is viable, as long as the patient is not a critical case, as it does not replace medical equipment or the emergency care present in the hospital. henriques et al. [5] focus was to evaluated the effects of biofeedback in a group of students to reduce anxiety, in this paper was monitored the heart rate variability with two experiments with duration of four weeks each. the work of wijman [8] describes the use of emg signals to identify stress, this experiment was conducted with 22 participants, evaluating both the wearables signals and questionnaires. in this section will be described the uniqueness of this research and the devices that was used. this solution is being proposed by several literature study about stress patterns and physiological aspects but with few results, for this reason, our project will address topics like experimental study protocol on signals acquisition from patients/participants with wearables to data acquisition and processing, in sequence will be applied machine learning modeling and prediction on biosignal data regarding stress (fig. 1) . the protocol followed to the acquisition of signals during all different status is the trier social stress test (tsst) [7] , recognized as the gold standard protocol for stress experiments. the estimated total protocol time, involving pre-tests and post-tests, is 116 minutes with a total of thirteen steps, but applied experiment was adapted and it was established with ten stages: initial evaluation: the participant arrives, with the scheduled time, and answer the questionnaires; habituation: it will take a rest time of twenty minutes before the pre-test to avoid the influence of events and to establish a safe baseline of that organism; pre-test: the sensors will be allocated ( fig. 2 ), collected saliva sample and applied the psychological instruments. the next step is explanation of procedure and preparation: the participant reads the instructions and the researcher ensures that he understands the job specifications, in sequence, he is sent to the room with the jurors (fig. 3) , composed of two collaborators of the research, were trained to remain neutral during the experiment, not giving positive verbal or non-verbal feedback; free speech: after three minutes of preparation, the participant is requested to start his speech, being informed that he cannot use the notes. this will follow the arithmetic task: the jurors request an arithmetic task in which the participant must subtract mentally, sometimes, the jurors interrupt and warn that the participant has made a mistake; post-test evaluation: the experimenter receives the subject outside the room for the post-test evaluations; feedback and clarification: the investigator and jurors talk to the subject and clarify what the task was about; relaxation technique: a recording will be used with the guidelines on how to perform a relaxation technique, using only the breathing; final post-test: some of the psychological instruments will be reapplied, saliva samples will be collected, and the sensors will still be picking up the physiological signals. based on literature [14] and wearable devices available the signals that was selected to analysis is the ecg, eda and emg for an initial experiment. this experimental study protocol on data acquisition started with 71 participants, where data annotation each step was done manually, from protocol experiment, preprocessing data based on features selection. in the machine learning step, it was evaluated the metrics of different algorithms as decision tree, random forest, adaboost, knn, k-means, svm. the experiment was made using the bitalino kit -plux wireless biosignals s.a. (fig. 4 ) composed by ecg sensor, which will provide data on heart rate and heart rate variability; eda sensor that will allow measure the electrical dermal activity of the sweat glands; emg sensor that allows the data collect the activity of the muscle signals. this section will describe the results in the pre-processing step and how it was made, listing all parts regarded to categorization and filtering data, evaluating the signal to know if it has plausibility and create a standardized database. the developed code is written in python due to the wide variety of libraries available, in this step was used the libraries numpy and pandas, both used to data manipulation and analysis. in the first step it is necessary read the files with the raw data and the timestamp, during this process the used channels are renamed to the name of the signal, because the bitalino store the data with the channel number as name of each signals. in sequence, the data timestamp is converted to a useful format, with goal to compare with the annotations, after time changed to the right format all channels unused are discarded to avoid unnecessary processing. the next step is to read the annotations taken manually in the experiment, as said before, to compare the time and classify each part of the experiment with its respective signal. after all signals are classified with its respective process of the tsst, each part of the experiment is grouped in six categories, which will be analyzed later. the first category is the "baseline", with just two parts of the experiment, representing the beginning of the experiment, when the participants had just arrived. the second is called of "tsst" comprises the period in which the participant spoke, the third category is the "arithmetic" with the data in acquired in the arithmetic test. the others two relevant categories are the "post_test_sensors_1" and "post_test_sensors_2", with its respective signals in the parts called with the same name. every other part of the experiment was categorized as "no_category", in sequence, this category is discarded in function of it will not be necessary in the machine learning stage. after the dataframe is right with all signals properly classified, the columns with the participants number and the timestamp are removed of the dataframe. the next step is evaluated the signal, to verify if the signal is really useful in the process of machine learning. for this, it is analyzed the signals using the biosppy library, which performs the data filtering process and makes it possible to view the data. finally, the script checks the volume of data present in each classification and returns the value of the smallest category. this is done because it was found that the categories have different volumes of data, which would become a problem in the machine learning stage, by offering more data from a determinate category than from the others. due this fact, the code analyzes the others categories and reduce its size until all categories stay with the same number of rows in each category (); after this the dataframe is exported in a csv file, to be read in the machine learning stage. the purpose of this article is to describe some stages of the development of a system for the acquisition and analysis of physiological signals to determine patterns in these signals that would detect stress states. during the development of the project was verified that there are data gaps in the dataframe in the middle of the experiment in some participants; a hypothesis about the motivation of this had happened is the sampling of the acquisition of bitalino regarding communication issues in some specifics sampling rates. it evaluate the results obtained when reducing this acquisition rate, however, it is necessary to carefully evaluate the extent to which the reduction in the sampling rate will interfere with the results. during the evaluation of the plausibility of the signals, it was verified that there are evident differences between the signals patterns in the different stages of the process, thus validating the protocol followed in the acquisition of the standards. the next step in this project is implement the machine learning stage, applying different algorithms as svm, decision tree, random forest, adaboost, knn and k-means; besides to evaluate the results using metrics like accuracy, precision, recall and f1. the next steps of this research will support the confirmation of the hypothesis raised about being able to define patterns of physiological signals to detect stress states. from the definition of the patterns, a system can be applied that identifies the acquisition of the signals and, in real time, performs the analysis of these data based on the machine learning results. therefore we can detect the state of the person and that the psychologist can indicate a proposal intervention and monitor whether the decrease is occurring. technological developments have been influencing all kinds of disciplines by transferring more competences from human beings to technical devices. the steps inculde [1]: 1. tools: transfer of mechanics (material) from the human being to the device 2. machines: transfer of energy from the human being to the device 3. automatic machines 1 : transfer of information from the human being to the device 4. assistants: transfer of decisions from the human being to the device with the introduction of artificial intelligence (ai), in particular its latest developments in deep learning, we let the system (in step 4) take over our decisions and creation processes. thus, tasks and disciplines that were exclusively reserved for humans in the past can now co-exist or even take the human out of the loop. it is no wonder that this transformation is not stopped at disciplines such as engineering, business, agriculture but also affects humanities, art and design. each new technology has been adopted for artistic expression-just see the many wonderful examples in media art. therefore, it is not surprising, that ai is going to be established as a novel tool to produce creative content of any form. however, in contrast to other disruptive technologies, ai seems particular challenging to be accepted in the area of art because it offers capabilities we thought once only humans are able to perform-the art is no longer done by artists using new technology to perform their art, but the art is done by the machine itself without the need for a human to intervene. the question of "what is art" has always been an emotionally debated topic in which everyone has a slightly different definition depending on his or her own experiences, knowledge base and personal aesthetics. however, there seems to be a broad consensus that art requires human creativity and imagination as, for instance, stated by the oxford dictionary "the expression or application of human creative skill and imagination, typically in a visual form such as painting or sculpture, producing works to be appreciated primarily for their beauty or emotional power." every art movement challenges old ways and uses artistic creative abilities to spark new ideas and styles. with each art movement diverse intentions and reasons for creating the artwork came along with critics who did not want to accept the new style as an artform. with the introduction of ai into the creation process another art movement is trying to be established which is fundamentally changing the way we see art. for the first time, ai has the potential to take the artist out of the loop, to leave humans only in the positions of curators, observers and judges to decide if the artwork is beautiful and emotionally powerful. while there is a strong debate going on in the arts if creativity is profoundly human, we investigate how ai can foster inspiration, creativity and produce unexpected results. it has been shown by many publications that ai can generate images, music and the like which can resemble different styles and produce artistic content. for instance, elgammal et al. [2] have used generative adversarial networks (gan) to generate images by learning about styles and deviating from style norms. the promise of ai-assisted creation is "a world where creativity is highly accessible, through systems that empower us to create from new perspectives and raise the collective human potential" as roelof pieters and samim winiger pointed out [3] . to get a better understanding of the process on how ai is capable to propose images, music, etc. we have to open the black box to investigate where and how the magic is happening. random variations in the image space (sometimes also referred to as pixel space) are usually not leading to any interesting result. this is because semantic knowledge cannot be applied. therefore, methods need to be applied which constrain the possible variations of the given dataset in a meaningful way. this can be realized by generative design or procedural generation. it is applied to generate geometric patterns, textures, shapes, meshes, terrain or plants. the generation processes may include, but are not limited, to self-organization, swarm systems, ant colonies, evolutionary systems, fractal geometry, and generative grammars. mccormack et al. [4] review some generative design approaches and discuss how art and design can benefit from those applications. these generative algorithms which are usually realized by writing program code are very limited. ai can change this process into data-driven procedures. ai, or more specifically artificial neural networks, can learn patterns from (labeled) examples or by reinforcement. before an artificial neural network can be applied to a task (classification, regression, image reconstruction), the general architecture is to extract features through many hidden layers. these layers represent different levels of abstractions. data that have a similar structure or meaning should be represented as data points that are close together while divergent structures or meanings should be further apart from each other. to convert the image back (with some conversion/compression loss) from the low dimensional vector, which is the result of the first component, to the original input an additional component is needed. together they form the autoencoder which consists of the encoder and the decoder . the encoder compresses the data from a high dimensional input space to a low dimensional space, often called the bottleneck layer. then, the decoder takes this encoded input and converts it back to the original input as closely as possible. the latent space is the space in which the data lies in the bottleneck layer. if you look at figure 1 you might be wondering why a model is needed that converts the input data into a "close as possible" output data. it seems rather useless if all it outputs is itself. as discussed, the latent space contains a highly compressed representation of the input data, which is the only information the decoder can use to reconstruct the input as faithfully as possible. the magic happens by interpolating between points and performing vector arithmetic between points in latent space. these transformations result in meaningful effects on the generated images. as dimensionality is reduced, information which is distinct to each image is discarded from the latent space representation, since only the most important information of each image can be stored in this low-dimensional space. the latent space captures the structure in your data and usually offers some semantic meaningful interpretation. this semantic meaning is, however, not given a priori but has to be discovered. as already discussed autoencoders, after learning a particular non-linear mapping, are capable of producing photo-realistic images from randomly sampled points in the latent space. the latent space concept is definitely intriguing but at the same time non-trivial to comprehend. although latent space means hidden, understanding what is happening in latent space is not only helpful but necessary for various applications. exploring the structure of the latent space is both interesting for the problem domain and helps to develop an intuition for what has been learned and can be regenerated. it is obvious that the latent space has to contain some structure that can be queried and navigated. however, it is non-obvious how semantics are represented within this space and how different semantic attributes are entangled with each other. to investigate the latent space one should favor a dataset that offers a limited and distinctive feature set. therefore, faces are a good example in this regard because they share features common to most faces but offer enough variance. if aligned correctly also other meaningful representations of faces are possible, see for instance the widely used approach of eigenfaces [5] to describe the specific characteristic of faces in a low dimensional space. in the latent space we can do vector arithmetic. this can correspond to particular features. for example, the vector a smiling woman representing the face of a smiling woman minus the vector a neutral woman representing a neutral looking woman plus the vector a neutral man representing a neutral looking man resulted in the vector a smiling man representing a smiling man. this can also be done with all kinds of images; see e.g. the publication by radford et al. [6] who first observed the vector arithmetic property in latent space. a visual example is given in figure 2 . please note that all images shown in this publication are produced using biggan [7] . the photo of the author on which most of the variations are based on is taken by tobias schwerdt. in latent space, vector algebra can be carried out. semantic editing requires to move within the latent space along a certain 'direction'. identifying the 'direction' of only one particular characteristic is non-trivial since editing one attribute may affect others because they are correlated. this correlation can be attributed to some extent to pre-existing correlations in 'the real world' (e.g. old persons are more likely to wear eyeglasses) or bias in the training dataset (e.g. more women are smiling on photos than men). to identify the semantics encoded in the latent space shen et al. proposed a framework for interpreting faces in latent space [8] . beyond the vector arithmetic property, their framework allows decoupling some entangled attributes (remember the aforementioned correlation between old people and eyeglasses) through linear subspace projection. shen et al. found that in their dataset pose and smile are almost orthogonal to other attributes while gender, age, and eyeglasses are highly correlated with each other. disentangled semantics enable precise control of facial attributes without retraining of any given model. in our examples, in figures 3 and 4 , faces are varied according to gender or age. it has been widely observed that when linearly interpolate between two points in latent space the appearance of the corresponding synthesized images 'morphs' continuously from one face to another; see figure 5 . this implies that also the semantic meaning contained in the two images changes gradually. this is in stark contrast to having a simple fading between two images in image space. it can be observed that the shape and style slowly transform from one image into the other. this demonstrates how well the latent space understands the structure and semantics of the images. other examples are given in section 3. even though our analysis has focused on face editing for the reasons discussed earlier it holds true also for other domains. for instance, bau et al. [9] generated living rooms using similar approaches. they showed that some units from intermediate layers of the generator are specialized to synthesize certain visual concepts such as sofas or tvs. so far we have discussed how autoencoders can connect the latent space and the image semantic space, as well as how the latent code can be used for image editing without influencing the image style. next, we want to discuss how this can be used for artistic expression. while in the former section we have seen how to use manipulation in the latent space to generate mathematical sound operations not much artistic content has been generatedjust variations of photography like faces. imprecision in ai systems can lead to unacceptable errors in the system and even result in deadly decisions; e.g. at autonomous driving or at cancer treatment. in the case of artistic applications, errors or glitches might lead to interesting, non-intended, artifacts. if those errors or glitches are treated as a bug or a feature lies in the eye of the artist. to create higher variations in the generated output some artists randomly introduce glitches within the autoencoder. due to the complex structure of the autoencoder these glitches (assuming that they are introduced at an early layer in the network) occur on a semantic level as already discussed and might cause the models to misinterpret the input data in interesting ways. some could even be interpreted as glimpses of autonomous creativity; see for instance the artistic work 'mistaken identity' by mario klingemann [10] . so far the latent space is explored by humans either by random walk or intuitive steering into a particular direction. it is up to human decisions if the synthesized image of a particular location in latent space is producing a visually appealing or otherwise interesting result. the question arises where to find those places and if those places can be spotted by an automatized process. the latent space is usually defined by a space of ddimensions for which it is assumed the data to be represented as multivariate gaussian distributions n (0, i d ) [11] . therefore, the mean representation of all images lies in the center of the latent space. but what does that mean for the generated results? it is said that "beauty lies in the eyes of the beholder", however, research shows that there is a common understanding of beauty. for instance, averaged faces are perceived as more beautiful [12] . adopting these findings to latent space let us assume that the most beautiful images (in our case faces) can be found in the center of the space. particular deviations from the center stand for local sweet spots (e.g. female and male, ethnic groups). these types of sweet spots can be found by common means of data analysis (e.g. clustering). but where are interesting local sweet spots if it comes to artistic expression? figure 6 demonstrates some variation in style within the latent space. of course, one can search for locations in the latent space where particular artworks from a given artist or art styles are located; see e.g. figure 7 where the styles of different artists, as well as white noise 2 , have been used for adoption. but isn't lingering around these sweet spots not only producing "more of the same"? how to find the local sweet spots which can define a new art style and can be deemed truly creative? or do those discoveries of new art style lie outside of the latent space, because the latent space is trained within a particular set of defined art styles and can, therefore, produce only interpolations of those styles but nothing conceptually new? so far we have discussed how ai can help to generate different variations of faces and where to find visually interesting sweet spots. in this section, we want to show how ai is supporting the creation process by applying the discussed techniques to other areas of image and object processing. 3 probably, different variations of image-to-image translation are the most popular approach at least if looking at the mass media. the most prominent example is style transfer -the capability to transfer the style of one image to draw the content of another (examples are shown in figure 7 ). but mapping an input image to an output image is also possible for a variety of other applications such as object transfiguration (e.g. horse-to-zebra, apple-to-orange, season transfer (e.g. summer-to-winter) or photo enhancement [13] . while some of the just mentioned systems are not yet in a state to be widely applicable, ai tools are taking over and gradually automating design processes which used to be time-consuming manual processes. indeed, the most potential for ai in art and design is seen in its application to tedious, uncreative tasks such as coloring black-and-white images [14] . marco kempf and simon zimmerman used ai in their work dubbed 'deepworld' to generate a compilation of 'artificial countries' using data of all existing countries (around 195) to generate new anthems, flags and other descriptors [15] . roman lipski uses an ai muse (developed by florian dohmann et al.) to foster his/her inspiration [16] . because the ai muse is trained only on the artist's previous drawings and fed with the current work in progress it suggests image variations in line with roman's taste. cluzel et al. have proposed an interactive genetic algorithm to progressively sketch the desired side-view of a car profile [17] . for this, the user has taken on the role of a fitness function 4 through interaction with the system. the chair project [18] is a series of four chairs co-designed by ai and human designers. the project explores a collaborative creative process between humans and computers. it used a gan to propose new chairs which then have been 'interpreted' by trained designers to resemble a chair. deep-wear [19] is a method using deep convolutional gans for clothes design. the gan is trained on features of brand clothes and can generate images that are similar to actual clothes. a human interprets the generated images and tries to manually draw the corresponding pattern which is needed to make the finished product. li et al. [20] introduced an artificial neural network for encoding and synthesizing the structure of 3d shapes which-according to their findings-are effectively characterized by their hierarchical organization. german et al. [21] have applied different ai techniques trained by a small sample set of shapes of bottles, to propose novel bottle-like shapes. the evaluation of their proposed methods revealed that it can be used by trained designers as well as nondesigners to support the design process in different phases and that it could lead to novel designs not intended/foreseen by the designers. for decades, ai has fostered (often false) future visions ranging from transhumanist utopia to "world run by machines" dystopia. artists and designers explore solutions concerning the semiotic, the aesthetic and the dynamic realm, as well as confronting corporate, industrial, cultural and political aspects. the relationship between the artist and the artwork is directly connected through their intentions, although currently mediated by third-parties and media tools. understanding the ethical and social implications of ai-assisted creation is becoming a pressing need. the implications, where each has to be investigated in more detail in the future, include: -bias: al systems are sensitive to bias. as a consequence, the ai is not being a neutral tool, but has pre-decoded preferences. bias relevant in creative ai systems are: • algorithmic bias occurs when a computer system reflects the implicit values of the humans who created it; e.g. the system is optimized on dataset a and later retrained on dataset b without reconfiguring the neural network (this is not uncommon, as many people do not fully understand what is going on in the network, but are able to use the given code to run training on other data). • data bias occurs when your samples are not representative of your population of interest. • prejudice bias results from cultural influences or stereotypes which are reflected in the data. -art crisis: until 200 years ago painting served as the primary method for visual communication and was a widely and highly respected art form. with the invention of photography, painting began to suffer an identity crisis because painting-in its current form then-was not able to reproduce the world as accurate and with as low effort as photography. as a consequence visual artists had to change to different forms of representations not possible by photography inventing different art styles such as impressionism, expressionism, cubism, pointillism, constructivism, surrealism, up to abstract expressionism. at the time ai can perfectly simulate those styles what will happen with the artists? will artists still be needed, be replaced by ai, or will they have to turn to other artistic work which yet cannot be simulated by ai? -inflation: similar to the image flood which has reached us the same can happen with ai art. because of the glut, nobody is valuing and watching the images anymore. -wrong expectations: only esthetic appealing or otherwise interesting or surprising results are published which can be contributed to similar effects as the well-known publication bias [22] in other areas. eventually, this is leading to wrong expectations of what is already possible with ai. in addition, this misunderstanding is fueled by content claimed to be created by ai but has indeed been produced-or at least reworked-either by human labor or by methods not containing ai. -unequal judgment: even though the raised emotions in viewing artworks emerge from its underlying structure in the works, people also include the creation process in their judgment (in the cases where they know about it). frequently, becoming to know that a computer or an ai has created the artwork, in the opinion of the people it turns boring, has no guts, no emotion, no soul while before it was inspiring, creative and beautiful. -authorship: the authorship of ai-generated content has not been clarified. for instance, is the authorship of a novel song composed by an ai trained exclusively on songs by johann sebastian bach belonging to the ai, the developer/artist, or bach? see e.g. [23] for a more detailed discussion. -trustworthiness: new ai-driven tools make it easy for non-experts to manipulate audio and/or visual media. thus, image, audio as well as video evidence is not trustworthy anymore. manipulated image, audio, and video are leading to fake information, truth skepticism, and claims that real audio/video footage is fake (known as the liar's dividend ) [24] . the potential of ai in creativity has just been started to be explored. we have investigated on the creative power of ai which is represented-not exclusively-in the semantic meaningful representation of data in a dimensionally reduced space, dubbed latent space, from which images, but also audio, video, and 3d models can be synthesized. ai is able to imagine visualizations that lie between everything the ai has learned from us and far beyond and might even develop its own art styles (see e.g. deep dream [25] ). however, ai still lacks intention and is just processing data. those novel ai tools are shifting the creativity process from crafting to generating and selecting-a process which yet can not be transferred to machine judgment only. however, ai can already be employed to find possible sweet spots or make suggestions based on the learned taste of the artist [21] . ai is without any doubt changing the way we experience art and the way we do art. doing art is shifting from handcrafting to exploring and discovering. this leaves humans more in the role of a curator instead of an artist, but it can also foster creativity (as discussed before in the case of roman lipski) or reduce the time between intention and realization. it has the potential, just as many other technical developments, to democratize creativity because the handcrafting skills are not so much in need to express his/her own ideas anymore. widespread misuse (e.g. image manipulation to produce fake pornography) can limit the social acceptance and require ai literacy. as human beings, we have to ask ourselves if feelings are wrong just because the ai never felt alike in its creation process as we do? or should we not worry too much and simply enjoy the new artworks created no matter if they are done by humans, by ai or as a co-creation between the two ones? [1] aims to design and implement a machine learning system for the sake of generating prediction models with respect to quality checks and reducing faulty products in manufacturing processes. it is based on an industrial case study in cooperation with sick ag. we will present first results of the project concerning a new process model for cooperating data scientists and quality engineers, a product testing model as knowledge base for machine learning computing and visual support of quality engineers in order to explain prediction results. a typical production line consists of various test stations that conduct several measurements. those measurements are processed by the system on the fly, to point out problematic products. among the many challenges, one focus of the project is on support for quality engineers. preparation of prediction models is usually done by data scientists. but the demand for data scientists is increasing too fast, when a big number of products, production lines and changing circumstances have to be considered. hence, a software is needed which quality engineers can operate directly and leverage the results from prediction models. based on quality management and data science standard processes [2] [3] we created a reference process model for production error detection and correction which includes needed actors and associated tasks. with ml system and data scientist assistance we bolster the quality engineer in his work. to support the ml system, we developed a product testing model which includes crucial information about a specific product. in this model we describe the relation to product specific features, test systems, production lines sequences etc. the idea behind this, is to provide metadata information which in turn is used by the ml system instead of individual script solutions for each product. a ml model with good predictions has often a lack of information about the internal decisions. therefore, it is beneficial to support the quality engineer with useful feature visualizations. by default, we support the quality engineer with 2d -3d feature plots and histograms, in which the error distribution is visualized. on top, we developed further feature importance measures based on shap values [4] . these can be used to get deeper insight for particular ml decisions to significant features which get lower ranked by standard feature importance measures. medicine is a highly empirical discipline, where important aspects have to be demonstrated using adequate data and sound evaluations. this is one of the core requirements, which were emphasized during the development of the medical device regulation (mdr) of the european union (eu) [1] . this applies to all medical devices, including mechanical and electrical devices as well as software systems. also, the us food & drug administration (fda) recently set a focus on the discussions about using data for demonstrating the safety and efficacy of medical devices [2] . beside pure approval steps, they foster the use of data for optimization of the products, as nowadays data can be acquired more and more, using modern it technology. in particular, they pursue the use of real world evidence, i.e. data that is collected through the lifetime of a device, for demonstrating improved outcomes. [2] such approaches require the use of sophisticated data analysis techniques. beside classical statistics, artificial intelligence (ai) and machine learning (ml) are considered to be powerful techniques for this purpose. currently, they gain more and more attention. these techniques allow to detect dependencies in complex situations, where inputs and/or outputs of a problem have high-dimensional parameter spaces. this can e.g. be the case when extensive data is collected from diverse clinical studies or also treatment protocols from local sites. furthermore, ai/ml based techniques may be used in the devices themselves. for example, devices may be developed which are considered to improve complex diagnostic tasks or find individualized treatment options for specific medical conditions (see e.g. [3, 4] for an overview). for some applications, it already has been demonstrated that ml algorithms are able to outperform human experts with respect to specific success rates (e.g. [5, 6] ). in this paper, it will be discussed how ml based techniques can be brought onto the market including an analysis of appropriate regulatory requirements. for this purpose, the main focus lies on ml based devices applied in the intensive care unit (icu) as e.g. proposed in [7, 8] . the need for specific regulatory requirements comes from the observation, that ai/ml based techniques pose specific risks which need to be considered and handled appropriately. for example, ai/ml based methods are more challenging w.r.t. bias effects, reduced transparency, vulnerability to cybersecurity attacks, or general ethical issues (see e.g. [9, 10] ). in particular cases, ml based techniques may lead to noticeably critical results, as it has been shown for the ibm watson for oncology device. in [11] , it was reported that the direct use of the system in particular clinical environments resulted in critical treatment suggestions. the characteristics of ml based systems led to various discussions about their reliability in the clinical context. it requires to find appropriate ways to guarantee their safety and performance. (cf. [12] ) this applies to the field of medicine / medical devices as well as ai/ml based techniques in general. the latter was e.g. approached by the eu in their ethics guidelines for trustworthy ai [9] . driven by this overall development, the fda started a discussion regarding an extended use of ml algorithms in samd (software as a medical device) with a focus in quicker release cycles. in [13] , it pursued the development of a specific process which makes it easier to bring ml based devices onto the market and also to update them during their lifecycle. current regulations for medical devices, e.g. in us or eu, do not provide specific guidelines for ml based devices. in particular, this applies to systems which continuously collect data in order to improve the performance of the device. current regulations focus on a fixed status of the device, which may only be adapted in a minor extent after the release. usually, a new release or clearance by the authority is required, when the clinical performance of a device is modified. but continuously learning systems exactly want to do such improvement steps using additional real-world data from daily applications without extra approvals (see fig. 1 ). basic approaches for ai/ml based medical devices. left side: classical approach, where the status of the software has to be fixed after the release / approval stage. right side: continuously learning system where data is collected during the lifetime of the device without a separated release / approval step. in this case, an automatic validation step has to guarantee proper safety and efficacy. in [13] , the fda made suggestions how this could be addressed. it proposed the definition of so called samd pre-specifications (sps) and an algorithm change protocol (acp), which are considered to represent major tools for dealing with modifications of the ml based system during its lifetime. within the sps, the manufacturer has to define the anticipated changes which are considered to be allowed during the automatic update process. in addition, the acp defines the particular steps which have to be implemented to realize the sps specifications. see [13] for more information about sps and acp. but the details are not yet well elaborated by the fda at the moment. the fda requested for suggestions with respect to this. in particular, these tools serve as a basis for performing an automated validation of the updates. the applicability of this approach depends on the risk of the samd. in [13] , the fda uses the risk categories from the international medical device regulators forum (imdrf) [14] . this includes the categories state of healthcare situation or condition (critical vs. serious vs. noncritical) and significance of information provided by samd to healthcare decision (treat or diagnose vs. drive clinical management vs. inform clinical management) as the basic attributes. according to [13] , the regulatory requirements for the management of ml based systems are considered to depend on this classification as well as the particular changes which may take place during the lifetime of the device. the fda categorizes them as changes in performance, inputs, and intended use. such anticipated changes have to be defined in the sps in advance. the main purpose of the present paper is to discuss the validity of the described fda approach for enabling continuously learning systems. therefore, it uses a scenario based technique to analyze whether validation in terms of sps and acp can be considered adequate tools. the scenarios represent applications of ml based devices in the icu. it checks its consistency with other important regulatory requirements and analyzes pitfalls which may jeopardize the safety of the devices. additionally, it discusses whether more general requirements can be sufficiently addressed in the scenarios, as e.g. proposed in ethical guidelines for ai based systems like [9, 10] . this is not considered as a comprehensive analysis of the topics, but as an addition to current discussions about risks and ethical issues, as they are e.g. discussed in [10, 12] . finally, the paper proposes own suggestions to address the regulation of continuously learning ml based systems. again, this is not considered to be a full regulatory strategy, but a proposal of particular requirements, which may overcome some of the current limitations of the approach discussed in [13] . the overall aim of this paper is to contribute to a better understanding of the options and challenges of ai/ml based devices on the one hand and to enable the development of best practices and appropriate regulatory strategies, in the future. within this paper, the analysis of the fda approach proposed in [13] is performed using specific reference scenarios from icu applications, which are particularly taken from [13] itself. the focus lies on ml based devices which allow continuous updates of the model according to data collected during the lifetime of the device. in this context, sps and acp are considered as crucial steps which allow an automated validation of the device based on specified measures. in particular, the requirements and limitations of such an automated validation are analyzed and discussed, including the following topics / questions.  is automated validation reasonable for these cases? what are limitations / potential pitfalls of such an approach when applied in the particular clinical context?  which additional risks could apply to ai/ml based samd, in general, which go beyond the existing discussions in the literature as e.g. presented in [9, 10, 12] ?  how should such issues be taken into account in the future? what could be appropriate measures / best practices to achieve reliability? the following exemplary scenarios are used for this purpose. ur-ai 2020 // 56  base scenario icu: ml based intensive care unit (icu) monitoring system where the detection of critical situations (e.g. regarding physiological instability, potential myocardial infarcts or sepsis) is addressed by using ml. using auditory alarms, the icu staff is informed to initiate appropriate measures to treat the patients in these situations. this scenario addresses a 'critical healthcare situation or condition' and is considered to 'drive clinical management' (according to the risk classification used in [13] ).  modification "locked": icu scenario as presented above, where the release of the monitoring system is done according to a locked state of the algorithm.  modification "cont-learn": icu scenario as presented above, where the detection of alarm situations is continuously improved according to data acquired during daily routine, including adaptation of performance to sub-populations and/or characteristics of the local environment. in this case, scs and acp have to define standard measures like success rates of alarms/detection and requirements for the management of data, update of the algorithm, and labeling. more details of such requirements are discussed later. this scenario was presented as scenario 1a in [13] with minor modifications. this section provides the basic analysis of the scenarios according to the particular aspects addressed in this paper. it addresses the topics automated validation, man-machine interaction, explainability, bias effects, and confounding, fairness and non-discrimination as well as corrective actions to systematic deficiencies. according to standard regulatory requirements [1, 15, 16] , validation is a core step in the development and for the release of medical devices. according to [17] , a change in performance of a device (including an algorithm in a samd) as well as a change in particular risks (e.g. new risks, but also new risk assessment or new measures) usually triggers a new premarket notification (510(k)) for most of the devices which get onto the market in the us. thus, such situations require an fda review for clearance of the device. for samd, this requires to include an analytical evaluation, i.e. correct processing of input data to generate accurate, reliable, and precise output data. additionally, a clinical validation as well as the demonstration of a valid clinical association need to be provided. [18] this is intended to show that the outputs of the device appropriately work in the clinical environment, i.e. have a valid association regarding the targeted clinical condition and achieve the intended purpose in the context of clinical care. [18] thus, based on the current standards, a device with continuously changing performance usually requires a thorough analysis regarding its validity. this is one of the main points, where [13] proposes to establish a new approach for the "cont-learn" cases. as already mentioned, sps and acp basically have to be considered as tools for automated validation in this context. within this new approach, the manual validation step is replaced by an automated process with only reduced or even no additional control by a human observer. thus, it may work as an automated of fully automatic, closed loop validation approach. the question is whether this change can be considered as an appropriate alternative. in the following, this question is addressed using the icu scenario with a main focus on the "cont-learn" case. some of the aspects also apply to the "locked" cases. but the impact is considered to be higher in the "cont-learn" situation, since the validation step has to be performed in an automated fashion. human oversight, which is usually considered important, is not included here during the particular updates. within the icu scenario, the validation step has to ensure that the alarm rates stay on a sufficiently high level, regarding standard factors like specificity, sensitivity, area under curve (auc), etc. basically, these are technical parameters which can be analyzed according to an analytical evaluation as discussed above. (see also [18] ) this could also be applied to situations, where continuous updates are made during the lifecycle of the device, i.e. in the "cont-learn". however, there are some limitations of the approach. on the one hand, it has to be ensured, that this analysis is sound and reliable, i.e. it is not compromised according to statistical effects like bias or other deficiencies in the data. on the other hand, it has to be ensured that the success rates really have a valid clinical association and can be used as a sole criterion for measuring the clinical impact. thus, the relationship between pure success rates and clinical effects has to be evaluated thoroughly and there may be some major limitations. one major question in the icu scenario is, whether better success rates really guarantee a higher or at least sufficient level of clinical benefit. this is not innately given. for example, a higher success rate of the alarms may still have a negative effect when the icu staff relies more and more on the alarms and subsequently reduces attention. thus, it may be the case that the initiation of appropriate treatment steps may be compromised even though the actually occurring alarms seem to be more reliable. in particular, this may apply in situations where the algorithms are adapted to local settings, like in the "cont-learn" scenario. here, the ml based system is intended to be optimized to subpopulations in the local environment or to specific treatment preferences at the local site. according to habituation effects, the staff's expectations get aligned to the algorithm's behavior to a certain degree after a period of time. but when the algorithm changes or an employee from another hospital or department takes over duties in the local unit, the reliability of the alarms may be affected. in these cases, it is not clear whether the expectations are well aligned with the current status of the algorithmeither in the positive or negative direction. since the data updates of the device are intended to improve its performance w.r.t. detection rates, it is clear that significant effects on user interaction may happen. under some circumstances, the overall outcome in terms of the clinical effect may be impaired. evaluation of such risks have to be addressed during validation. it is questionable whether this can be performed by using an automatic validation approach which focuses on alarm rates but does not include an assessment of the associated risks. at least a clear relationship between these two aspects has to be demonstrated in advance. it is also unclear, whether this could be achieved by assessment of pure technical parameters which are defined in advance as required by the sps and acp. usually, ml based systems are trained to a specific scenario. they provide a specific solution for this particular problem. but they do not have a more general intelligence and reasoning about potential risks, which were not under consideration at that point of time. such a more general intelligence can only be provided when using human oversight. in general, it is not clear whether technical aspects like alarms lead to valid reactions by the users. in technical terms, alarm rates are basically related to the probability of occurrence of specific hazardous situations. but they do not address a full assessment of occurrence of harm. however, this is pivotal for risk assessment in medical devices, in particular for risks related to potential use errors. this is considered to be one of the main reasons why a change in risk parameters triggers a new premarket approval in the us according to [17] . also, the mdr [1] sets high requirements to address the final clinical impact and not only technical parameters. basically, the example emphasizes the importance to consider the interaction between man and machine, or in this case, the algorithm and its clinical environment. this is addressed in the usability standards for medical devices, e.g. iso 62366 [19] . for this reason, the iso 62366 requires that the final (summative) usability evaluation is performed using the final version of the device (in this case, the algorithm) or an equivalent version. this is in conflict with the fda proposal which allows to perform this assessment based on previous versions. at most, a predetermined relationship between technical parameters (alarm rates) and clinical effects (in particular, use related risks) can be obtained. for usage of ml based devices, it remains crucial to consider the interaction between the device and the clinical environment as there usually are important interrelationships. the outcome of an ml based algorithm always depends on the data it gets provided. whenever an input parameter is omitted, which is clinically relevant, the resulting outcome of the ml based system is limited. in the presented scenarios, the pure alarm rates may not be the only clinically relevant outcomes. even though, such parameters are usually the main focus regarding the quality of algorithms, e.g. in publications about ml based techniques. this is due to the fact, that such quality measures are commonly considered the best available objective parameters, which allow a comparison of different techniques. this even more applies to other ml based techniques which are also very popular in the scientific community, like segmentation tasks in medical image analysis. here the standard quality measures are general distance metrics, i.e. differences between segmented areas. [20] they usually do not include specific clinical aspects like the accuracy in specific risk areas, e.g. important blood vessels or nerves. but such aspects are key factors to ensure the safety of a clinical procedure in many applications. again, only technical parameters are typically in focus. the association to the clinical effects is not assessed accordingly. this situation is depicted in fig. 2 for the icu as well as image segmentation cases. additionally, the validity of an outcome in medical treatments depends on many factors. regarding input data, multiple parameters from a patient's individual history may be important for deciding about a particular diagnosis or treatment. a surgeon usually has access to a multitude of data and also side conditions (like socio-economic aspects) which should be included in an individual diagnosis or treatment decision. his general intelligence and background knowledge allow him to include a variety of individual aspects, which have to be considered for a specific case-based decision. in contrary, ml based algorithms rely on a more standardized structure of input data and are only trained for a specific purpose. they lack a more general intelligence, which allows them to react in very specific situations. even more, ml based algorithms need to generalize and thus to mask out very specific conditions, which could by fatal in some cases. in [13] , the fda presents some examples where changes of the inputs in an ml based samd are included. it is surprising, that the fda considers some of them as candidates for a continuous learning system, which does not need an additional review, when a tailored sps/acp is available. such discrepancies between technical outcomes and clinical effects also apply to situations like the icu scenario, which only informs or drives clinical management. often users rely on automatically provided decisions, even when they are informed that this only is a proposal. again, this is a matter of man-machine interaction. this gets even worse due to the lack of explainability which ml based algorithms typically have. [9, 21] when surgeons or more general users (e.g, icu staff) detect situations which require a diverging treatment because of very specific individual conditions, they should overrule the algorithm. but users will often be confused by the outcome of the algorithm and do not have a clear idea how they should treat conflicting results between the algorithm's suggestions and their own belief. as long as the ml based decision is not transparent to the user, they will not be able to merge these two directions. the ibm watson example, referenced in the introduction shows, that this actually is an issue [11] . this may be even more serious, when the users (i.e. healthcare professionals) fear litigation because they did not trust the algorithm. in a situation, where the algorithm's outcome finally turns out to be true, they may be sued because of this documented deviation. because of such issues, the eu general data protection regulation (gfpr) [22] requires that the users get autonomy regarding their decisions and transparency about the mechanisms underlying the algorithm's outcome. [23] this may be less relevant for the patients, who usually have only limited medical knowledge. they will probably also not understand the medical decisions in conventional cases. but it is highly relevant for responsible healthcare professionals. they require to get basic insights how the decision emerged, as they finally are in charge of the treatment. this demonstrates that methods regarding the explainability of ml based techniques are important. fortunately, this currently gets a very active field. [21, 24] this need for explainability applies to locked algorithms as well as situations where continuous learning is applied. based on their own data-driven nature, ml based techniques highly depend on a very high quality of data which are provided for learning and validation. in particular, this is important for the analytical evaluation of the ml algorithms. one of the major aspects are bias effects due to unbalanced input data. for example, in [25] a substantially different detection rate between white and colored people was recognized due to unbalanced data. beside ethical considerations, this demonstrates dependencies of the outcome quality on sub-populations, which may be critical in some cases. even though, the fda proposal [13] currently does not consequently include specific requirements for assessing bias factors or imbalance of data. however, high quality requirements for data management are crucial for ml based devices. in particular, this applies to the icu "cont-learn" cases. there have to be very specific protocols that guarantee that new data and updates of the algorithms are highly reliable w.r.t. bias effects. most of the currently used ml based algorithms fall under the category of supervised learning. thus, they require accurate and clinically sound labeling of the data. during the data collection, it has to be ensured how this labeling is performed and how the data can be fed back into the system in a "cont-learn" scenario. additionally, the data needs to stay balancedwhatever this means in a situation where adaptions to sub-populations and/or local environments are intended for optimization. it is unclear, whether and how this could be achieved by staff who is only operating with the system but possibly does not know potential algorithmic pitfalls. in the icu scenario, many data points probably need to be recorded by the system itself. thus, a precise and reliable recording scheme has to be established which automatically avoids imbalance of data on the one hand and fusion with manual labelings on the other hand. basically, the sps and acp (proposed in [13] ) are tools to achieve this. the question is whether this is possible in a reliable fashion using automated processes. a complete closed loop validation approach seems to be questionable, especially when the assessment of clinical impact has to be included. thus, the integration of humans including adequate healthcare professionals as well as ml/ai experts with sufficient statistical knowledge seems reasonable. at least, bias assessment steps should be included. as already mentioned, this is not addressed in [13] in a dedicated way. further on, the outcomes may be compromised by side effects in the data. it may be the case, that the main reason for a specific outcome of the algorithm is not a relevant clinical parameter but a specific data artifact, i.e. some confounding factor. in the icu case, it could be the case, that the icu staff reacts early to a potentially critical situation and e.g. gives specific medication in advance to prevent upcoming problems. the physiological reaction of the patient can then be visible in the data as some kind of artifact. during its learning phase, the algorithm may recognize the critical situation not based on a deeper clinical reason, but on detecting the physiological reaction pattern. this may cause serious problems as shown subsequently. in the presented scenario, the definition of clinical situation and the pattern can be deeply coupled by design, since the labeling of the data by the icu staff and the administration of the medication will probably be done in combination at the particular site. this may increase the probability of such effects. usually, confounding factors are hard to determine. even when they can be detected, they are hard to be communicated and managed in an appropriate way. how should healthcare professionals react, when they get such potentially misleading information (see discussion about liability). this further limits the explanatory power of ml based systems. when confounders are not detected, they may have unpredictable outcomes w.r.t. the clinical effects. for example, consider the following case. in the icu scenario, an ml based algorithm gets trained in a way that it basically detects the medication artifact as described above during the learning phase. in the next step, this algorithm is used in clinical practice and the icu staff relies on the outcome of the algorithm. then, on the one hand, the medication artifact is not visible unless the icu staff administers the medication. on the other hand, the algorithm does not recognize the pattern and thus does not provide an alarm. subsequently, the icu staff does no act appropriately to manage the critical situation. in particular, such confounders may be more likely in situations where a strong dependence between the outcome of the algorithm and the clinical treatment exists. further examples of such effects were discussed in [7] for icu scenarios. the occurrence of confounders may be a bit less probable in pure diagnostic cases without influence of the diagnostic task onto the generation of data. but even here, such confounding factors may occur. the discussion in [10] provided examples where confounders may occur in diagnostic cases e.g. because of rulers placed for measurements on radiographs. in most of the publications about ml based techniques, such side effects are not discussed (or only in a limited fashion). in many papers, the main focus is the technical evaluation and not the clinical environment and the interrelation between technical parameters and clinical effects. additional important aspects which are amply discussed in the context of ai/ml based systems are discrimination and fairness (see e.g. [10] ). in particular, the eu puts a high priority of their future ai/ml strategy on fairness requirements [9] . fairness is often closely related to bias effects. but it goes beyond to more general ethical questions, e.g. regarding the natural tendency of ml based systems to favor specific subgroups. for example, the icu scenario "cont-learn" is intended to optimize w.r.t. to specifics of sub-populations and local characteristics, i.e. it tries to make the outcome better for specific groups. based on such optimization, other groups (e.g. minorities, underrepresented groups) which are not well represented may be discriminated in some sense. this is not a statistical but a systematic effect. superiority of a medical device for a specific subgroup (e.g. gender, social environment, etc.) is not uncommon. for example, some diagnosis steps, implants, or treatments achieve deviating success rates when applied to women in comparison to men. this also applies to differences between adults and children. when assessing bias in clinical outcome in ml based devices, it will probably often be unclear whether this is due to imbalance of data or a true clinical difference between the groups. does an ml based algorithm has to adjust the treatment of a subgroup to a higher level, e.g. a better medication, to achieve comparable results, when the analysis recognized worse results for this subgroup? another example could be a situation where the particular group does not have the financial capabilities to afford the high-level treatment. this could e.g. be the case in a developing country or in subgroups with a lower insurance level. in these cases, the inclusion of socio-economical parameters into the analysis seems to be unavoidable. subsequently, this compromises the notion of fairness as basic principle in some way. this is nothing genuine to ml based devices. but in the case of ml based systems with a high degree of automation, the responsibility for the individual treatment decision more and more shifts from the health care professional to the device. it is implicitly defined in the ml algorithm. in comparison to human reasoning, which allows some weaknesses in terms of individual adjustments of general rules, ml based algorithms are rather deterministic / unique in their outcome. for a fixed input, they have one dedicated outcome (when we neglect statistical algorithms which may allow minor deviations). differences of opinions and room for individual decisions are main aspects of ethics. thus, it remains unclear how fairness can be defined and implemented at all when considering ml based systems. this is even more challenging as socioeconomical aspects (even more than clinical aspects) are usually not included in the data and analysis of ml based techniques in medicine. additionally, they are hard to assess and implement in a fair way, especially when using automated validation processes. another disadvantage of ml based devices is the limited opportunities to fix systematic deficiencies in the outcome of the algorithm. let us assume that during the lifetime of the icu monitoring system a systematic deviation of the intended outcome was detected, e.g. in the context of post-market surveillance or due to an increased number of serious adverse events. according to standard rules, a proper preventive respectively corrective action has to be taken by the manufacturer. in conventional software devices, the error simple should be eliminated, i.e. some sort of bug fixing has to be performed. for ml based devices it is less clear, how bug fixing should work especially when the systematic deficiency is deeply hidden in the data and/or ml model. in these cases, there usually is no clear reason for the deficiency. subsequently, the deficiency cannot be resolved in a straightforward way using standard bug fixing. there is no dedicated route to find the deeper reasons and to perform changes which could cure the deficiencies, e.g. by providing additional data or changing the ml model. even more, other side effects may easily occur, when data and model are changed manually by intent to fix the issue. 4 discussion and outlook in summary, there are many open questions, which are not yet clarified. there still is little experience how ml based systems work in clinical practice and which concrete risks may occur. thus, the fda's commitment to foster the discussion about ml based samd is necessary and appreciated by many stakeholders as the feedback docket [26] for [13] shows. however, it is a bit surprising that the fda proposes to substantially reduce its very high standards in [13] at this point of time. in particular, it is questionable whether an adequate validation can be achieved by using a fully automatic approach as proposed in [13] . ml based devices are usually optimized according to very specific goals. they can only account for the specific conditions that are reflected in the data and the used optimization / quality criteria. they do not include side conditions and a more general reasoning about potential risks in a complex environment. but this is important for medical devices. for this reason, a more deliberate path would be suited, from the author's perspective. in a first step, more experience should be gained w.r.t. to the use of ml based devices in clinical practice. thus, continuous learning should not be a first hand option. first, it should be demonstrated that a device works in clinical practice before a continuous learning approach should be possible. this could also be justified from a regulatory point-of-view. the automated validation process itself should be considered as a feature of the device. it should be considered as part of the design transfer which enables safe use of the device during its lifecycle. as part of the design transfer, it should be validated itself. thus, it has to be demonstrated that this automated validation process, e.g. in terms of the sps and acp, works in a real clinical environment. ideally, this would have been demonstrated during the application of the device in clinical practice. thus, one reasonable approach for a regulatory strategy could be to reduce or prohibit the options for enabling automatic validation in a first release / clearance of the device. during the lifetime, direct clinical data could be acquired to demonstrate a better insight into the reliability and limitations of the automatic validation / continuous learning approach. in particular, the relation between technical parameters and clinical effects could be assessed on a broader and more stable basis. based on this evidence in real clinical environments, the automated validation feature could then be cleared in a second round. otherwise, the validity of the automated validation approach would have to be demonstrated in a comprehensive setting during the development phase. in principle, this is possible when enough data is available which truly reflects a comprehensive set of situations. as discussed in this paper, there are many aspects which do not render this approach impossible but very challenging. in particular, this applies to the clinical effects and the interdependency between the users and clinical environment on the one hand and the device, including the ml algorithm, data management, etc., on the other hand. this also includes not only variation in the status and needs of the individual patient but also the local clinical environment and potentially also the socioeconomic setting. following a consequent process validation approach, it would have to be demonstrated that the algorithm reacts in a valid and predictable way no matter which training data have been provided, which environment have to be addressed, and which local adjustments have been applied. this also needs to include deficient data and inputs in some way. in [20] , it has been shown that the variation of outcomes can be substantial, even w.r.t. rather simple technical parameters. in [20] , this was analyzed for scientific contests ("challenges") where renowned scientific groups supervised the quality of the submitted ml algorithms. this demonstrates the challenges validation steps for ml based systems still include, even w.r.t. technical evaluation. for these reasons, it seems adequate to pursue the regulatory strategy in a more deliberate way. this includes the restriction of the "cont-learn" cases as proposed. this also includes a better classification scheme, where automated or fully automatic validation is possible. currently, the proposal in [13] does not provide clear rules when continuous learning is allowed. it does not really address a dedicated risk-based approach that defines which options and limitations are applicable. for some options, like the change of the inputs, it should be reviewed, whether automatic validation is a natural option. additionally, the dependency between technical parameters and clinical effects as well as risks should get more attention. in particular, the grade of interrelationship between the clinical actions and the learning task should be considered. in general, the discussions about ml based medical devices are very important. these techniques provide valuable opportunities for improvements in fields like medical technologies, where evidence based on high quality data is crucial. this applies to the overall development of medicine as well as to the development of sophisticated ml based medical devices. this also includes the assessment of treatment options and success of particular devices during their lifetime. data-driven strategies will be important for ensuring high-level standards in the future. they may also strengthen regulatory oversight in the long term by amplifying the necessity of post-market activities. this seems to be one of the promises the fda envisions according to their concepts of "total product lifecycle quality (tplc)" and "organizational excellence" [13] . also, the mdr strengthens the requirements for data-driven strategies in the pre-as well as postmarket phase. but it should not shift the priorities for a basically proven-quality-in-advance (exante) to a primarily ex-post regulation, which boils down to a trial-and-error oriented approach in the extreme. thus, we should aim at a good compromise between pushing these valuable and innovative options on the one hand and potential challenges and deficiencies on the other hand. computer-assisted technologies in medical interventions are intended to support the surgeon during treatment and improve the outcome for the patient. one possibility is to augment reality with additional information that would otherwise not be perceptible to the surgeon. in medical applications, it is particularly important that demanding spatial and temporal conditions are adhered to. challenges in augmenting the operating room are the correct placement of holograms in the real world, and thus, the precise registration of multiple coordinate frames to each other, the exact scaling of holograms, and the performance capacity of processing and rendering systems. in general, two different scenarios can be distinguished. first, applications exist, in which a placement of holograms with an accuracy of 1 cm and above are sufficient. these are mainly applications where a person needs a three-dimensional view of data. an example in the medical field may be the visualization of patient data, e.g. to understand and analyse the anatomy of a patient, for diagnosis or surgical planning. the correct visualization of these data can be of great benefit to the surgeon. often only 2d patient data is available, such as ct or mri scans. the availability of 3d representations depend strongly on the field of application. in neurosurgery 3d views are available but often not extensively utilized due to their limited informative value. additionally computer monitors are a big limitation, because the data can not be visualized in real world scale. further application areas are the translation of known user interfaces into augmented ur-ai 2020 // 67 reality (ar) space. the benefit here is that a surgeon refrains from touching anything, but can interact with the interface in space using hand or voice gestures. applications visualizing patient data, such as ct scans, only require a rough positioning of the image or holograms in the operation room (or). thus, the surgeon can conveniently place the application freely in space. the main requirement is then to keep the holograms in a constant position. therefore, the internal tracking of the ar device is sufficient to hold the holograms at a fixed position in space. the second scenario covers all applications, in which an exact registration of holograms to the real world is required, in particular with a precision below 1 cm. these scenarios are more demanding, especially when holograms must be placed precisely over real patient anatomy. to achieve this, patient tracking is essential to determine position and to follow patient movements. the system therefore needs to track the patient and adjust the visualization to the current situation. furthermore, it is necessary to track and augment surgical instruments and other objects in the operating room. the augmentation needs to be visualized at the correct spatial position and time constraints need to be fulfilled. therefore, the ar system needs to be embedded into the surgical workflow and react to it. to achieve these goals modern state of the art machine learning algorithms are required. however, the computing power on available ar devices is often not yet sufficient for sophisticated machine learning algorithms. one way to overcome this shortcoming is the integration of the ar system into a distributed system with higher capabilities, such as the digital operating theatre op:sense (see fig. 2 ). in this work an augmented reality system holomed [4] (see fig. 1 ) is integrated into the surgical research platform for robot assisted surgery op:sense [5] . the objective is to enable high-quality and patient-safe neurosurgical procedures in order to increase the surgical outcome by providing surgeons with an assistance system that supports them in cognitively demanding operations. the physician's perception limits are extended by the ar system, which bases on supporting intelligent machine learning algorithms. ar glasses allow the neurosurgeon to perceive the internal structures of the patient's brain. the complete system is demonstrated by applying this methodology to the ventricular puncture of the human brain, one of the most frequently performed procedures in neurosurgery. the ventricle system has an elongated shape with a width of 1-2 cm and is located in a depth of 4 cm inside the human head. patient models are generated fast (< 2s) from ct-data [3] , which are superimposed over the patient during operation and serve as a navigation aid for the surgeon. in this work the expanded system architecture is presented to overcome some limitations of the original system where all information were processed on the microsoft hololens, which lead to performance deficits. to overcome these shortcomings the holomed project was integrated into op:sense for additional sensing and computing power. to achieve integration of ar into the operation room and the surgical workflows, the patient, the instruments and the medical staff need to be tracked. to track the patient, a marker system is fixated on the patient head and registration from the marker system to the patient is determined. a two-stage process was implemented for this purpose. first the rough position of the patient's head is determined on the or table by applying a yolo v3 net to reduce the search space. then a robot with a mounted rgb-d sensor is used to scan the acquired area and build a point cloud of the same. to determine the patient's head in space as precisely as possible a two-step surface matching approach is utilized. during recording, the markers are also tracked. with known position of the patient and the markers, the registration matrix can be calculated. for the ventricular puncture a solution is proposed to track the puncture catheter to determine the depth of insertion into the human brain. by tracking the medical staff the system is able to react to the current situation, e.g. if an instrument is passed. in the following the solutions are described in detail. our digital operating room op:sense (illustrated in fig. 2a) to detect the patient's head, the coarse position is first determined with the yolo v3 cnn [6] , performed on the kinect rgb image streams. the position in 3d is determined through the depth stream of the sensors. the or table and the robots are tracked with retroreflective markers by the arttrack system. this step reduces the spatial search area for fine adjustment. the franka panda has an attached intel realsense rgb-d camera as depicted in fig. 3 . the precise determination of the position is performed on the depth data with surface matching. the robot scans the area of the coarsely determined position of the patient's head. a combined surface matching approach with feature-based and icp matching was implemented. the process to perform the surface matching is depicted in fig. 4 . in clinical reality, a ct scan of the patient head is always performed prior to a ventricular puncture for diagnosis, such that we can safely assume the availability of ct data. a process to segment the patient models from ct data was proposed by kunz et al. in [3] . the algorithm processes the ct data extremely fast in under two seconds. the data format is '.nrrd', a volume model format, which can easily be converted into surface models or point clouds. the point cloud of the patient's head ct scan is the reference model that needs to be found in or space. the second point cloud is recorded from the realsense depth stream mounted on the panda robot by scanning the previously determined rough position of the patient head. all points are recorded in world coordinate space. the search space is further restricted with a segmentation step by filtering out points that are located on the or table. additionally, manual changes can be made by the surgeon. in a performance optimization, the resolution of the point clouds is reduced to decrease processing time without loosing too much accuracy. the normals of both point clouds generated from ct data and from the recorded realsense depth stream are subsequently calculated and harmonised. during this step, the harmonisation is especially important as the normals are often misaligned. this misalignment occurs because the ct data is a combination of several individual scans. for alignment of all normals, a point inside the patient's head is chosen manually as a reference point, followed by orienting all normals in the direction of this point and subsequently inverting all normals to the outside of the head (see fig. 5 ). after the preprocessing steps, the first surface fitting step is executed. it is based on the initial alignment algorithm proposed by rusu et al. [8] . an implementation within the point cloud library (pcl) is used. therefore fast point feature histograms need to be calculated as a preprocessing step. in the last step an iterative closest point (icp) algorithm is used to refine the surface matching result. after the two point clouds have been aligned to each other the inverse transformation matrix can be calculated to get the correct transformation from marker system to patient model coordinate space. as outlined in fig. 6 , catheter tracking was implemented based on semantic segmentation using a full-resolution residual network (frrn) [7] . after the semantic segmentation of the rgb stream of the kinect cameras, the image is fused with the depth stream ur-ai 2020 // 71 to determine the voxels in the point cloud belonging to the catheter. as a further step a density based clustering approach [2] is performed on the chosen voxels. this is due to noise especially on the edges of the instrument voxels in the point cloud. based on the found clusters an estimation of the three dimensional structure of the catheter is performed. for this purpose, a narrow cylinder with variable length is constructed. the length is changed accordingly to the semantic segmentation and the clustered voxels of the point cloud. the approach is applicable to identify a variety of instruments. the openpose [1] library is used to track key points on the bodies of the medical staff. available ros nodes have been modified to integrate openpose in the op:sense ros environment. the architecture is outlined in fig. 7 . in this chapter the results of the patient, catheter and medical staff tracking are described. the approach to find the coarse position of a patient's head was performed on a phantom head placed on the or table within op:sense. multiple scenarios with changing illumination and occlusion conditions were recorded. the results are depicted in fig. 8 and the evaluation results are depicted in table 1 . precision detection of the patient was performed with a two-stage surface matching approach. different point cloud resolutions were tested with regard to runtime behaviour. voxel grid edge sizes of 6, 4 and 3 mm have been tested, with a higher edge size corresponding to a smaller point cloud. the matching results of the two point clouds were analyzed manually. an average accuracy of 4.7 mm was found with an accuracy range between 3.0 and 7.0 mm. in the first stage of the surface matching, the two point clouds are coarsely aligned as depicted in fig. 9 . in the second step icp is used for fine adjustment. a two-stage process was implemented as icp requires a good initial alignment of the two point clouds. ur-ai 2020 // 73 for catheter tracking a precision of the semantic segmentation between 47% and 84% is reached (see table 3 ). tracking of instruments, especially neurosurgical catheters, are challenging due to their thin structure and non-rigid shape. detailed results on catheter tracking have been presented in [7] . the 3d estimation of the catheter is shown in fig. 10 . the catheter was moved in front of the camera and the 3d reconstruction was recorded simultaneously. over a long period of the recording over 90% of the catheter are tracked correctly. in some situations this drops to under 50% or lower. the tracking of medical personnel is shown in fig. 11 . the different body parts and joint positions are determined, e.g. the head, eyes, shoulders, elbows, etc. the library yielded very good results as described in [1] . we reached a performance of 21 frames per second on a workstation (intel i7-9700k, geforce 1080 ti) processing 1 stream. fig. 11 . results of the medical staff tracking. ur-ai 2020 // 75 4 discussion as shown in the evaluation, our approach succeeds in detecting the patient in an automated two-stage process with an accuracy between 3 and 7 mm. the coarse position is determined by using a yolo v3 net. the results under normal or conditions are very satisfying. the solution performance drops strongly under bright illumination conditions. this is due to large flares that occur on the phantom as it is made of plastic or silicone. however, these effects do not occur on human skin. the advantage of our system is that the detection is performed on all four kinect rgb streams enable different views on the operation area. unfavourable illumination conditions normally don't occur on all of these streams. therefore a robust detection is still possible. in the future the datasets will be expanded with samples with strong illumination conditions. the following surface matching of the head yields good results and a robust and precise detection of the patient. most important is a good preprocessing of the ct data and the recorded point cloud of the search area, as described in the methods. the algorithm does not manage to find a result if there are larger holes in the point clouds or if the normals are not calculated correctly. additionally, challenges that have to be considered include skin deformities and noisy ct data. the silicone skin is not fixed to the skull (as human skin is), which leads to changes in position, some of which are greater than 1 cm. also the processing time of 7 minutes is quite long and must be optimized in the future. the processing time may be shortened by reducing the size of the point clouds. however, in this case the matching results may also become worse. catheter tracking [7] yielded good results, despite the challenging task of segmenting a very thin ( 2.5 mm) and deformable object. additionally, a 3d estimation of the catheter was implemented. the results showed that in many cases over 90% of the catheter can be estimated correctly. however, these results strongly depend on the orientation and the quality of the depth stream. using higher quality sensors could improve the detection results. for tracking of the medical staff openpose as a ready-to-use people detection algorithm was used and integrated into ros. the library produces very good results, despite medical staff wearing surgical clothing. in this work the integration of augmented reality into the digital operating room op:sense is demonstrated. this makes it possible to expand the capabilities of current ar glasses. the system can determine the precise patient's position by implementing a two-stage process. first a yolo v3 net is used to coarsly detect the patient to reduce the search area. in a second subsequent step a two-stage surface matching process is implemented for refined detection. this approach allows for precise location of the patient's head for later tracking. further, a frnn-based solution to track the surgical instruments in the or was implemented and demonstrated on a thin neurosurgical catheter for ventricular punctures. additionally, openpose was integrated into the digital or to track the surgical personnel. the presented solution will enable the system to react to the current situation in the operating room and is the base for an integration into the surgical workflow. due to the emergence of commodity depth sensors many classical computer vision tasks are employed on networks of multiple depth sensors e.g. people detection [1] or full-body motion tracking [2] . existing methods approach these applications using a sequential processing pipeline where the depth estimation and inference are performed on each sensor separately and the information is fused in a post-processing step. in previous work [3] we introduce a scene-adaptive optimization schema, which aims to leverage the accumulated scene context to improve perception as well as post-processing vision algorithms (see fig. 1 ). in this work we present a proof-of-concept implementation of the scene-adaptive optimization methods proposed in [3] for the specific task of stereomatching in a depth sensor network. we propose to improve the 3d data acquisition step with the help of an articulated shape model, which is fitted to the acquired depth data. in particular, we use the known camera calibration and the estimated 3d shape model to resolve disparity ambiguities that arise from repeating patterns in a stereo image pair. the applicability of our approach can be shown by preliminary qualitative results. in previous work [3] we introduce a general framework for scene-adaptive optimization of depth sensor networks. it is suggested to exploit inferred scene context by the sensor network to improve the perception and post-processing algorithms themselves. in this work we apply the proposed ideas in [3] to the process of stereo disparity estimation, also referred to as stereo matching. while stereo matching has been studied for decades in the computer vision literature [4, 5] it is still a challenging problem and an active area of research. stereo matching approaches can be categorized into two main categories, local and global methods. while local methods, such as block matching [6] , obtain a disparity estimation by finding the best matching point on the corresponding scan line by comparing local image regions, global methods formulate the problem of disparity estimation as a global energy minimization problem [7] . local methods lead to highly efficient real-time capable algorithms, however, they suffer from local disparity ambiguities. in contrast, global approaches are able to resolve local ambiguities and therefore provide high-quality disparity estimations. but they are in general very time consuming and without further simplifications not suitable for real-time applications. the semi-global matching (sgm) introduced by hirschmuller [8] aggregates many feasible local 1d smoothness constraints to approximate global disparity smoothness regularization. sgm and its modifications are still offering a remarkable trade-off between the quality of the disparity estimation and the run-time performance. more recent work from poggi et al. [9] focuses on improving the stereo matching by taking additional high-quality sources (e.g. lidar) into account. they propose to leverage sparse reliable depth measurements to improve dense stereo matching. the sparse reliable depth measurements act as a prior to the dense disparity estimation. the proposed approach can be used to improve more recent end-to-end deep learning architectures [10, 11] , as well as classical stereo approaches like sgm. this work is inspired by [9] , however, our approach does not rely on an additional lidar sensor but leverages a priori scene knowledge in terms of an articulated shape model instead to improve the stereo matching process. we set up four stereo depth sensors with overlapping fields of view. the sensors are extrinsically calibrated in advance, thus their pose with respect to a world coordinates system is known. the stereo sensors are pointed at a mannequin and capture eight greyscale images (one image pair for each stereo sensor, the left image of each pair is depicted in fig. 3a) . for our experiments we use a high-quality laser scan of the mannequin as ground truth. we assume that the proposed algorithm has access to an existing shape model that can express the observed geometry of the scene in some capacity. in our experimental setup, we assume a shape model of a mannequin with two articulated shoulders and a slightly different shape in the belly area of the mannequin (see fig. 2 ). in the remaining section we use the provided shape model to improve the depth data generation of the sensor network. first, we estimate the disparity values of each of the four stereo sensors with sgm without using the human shape model. let p denote a pixel and q denote an adjacent pixel. let d denote a disparity map and d p ,d q denote the disparity at pixel location p and q. let p denote the set of all pixels and n the set of all adjacent pixels. then the sgm cost function can be defined as where d(p, d p ) denotes the matching term (here the sum of absolute differences in a 7 × 7 neighborhood) which assigns a matching cost to the assignment of disparity d p to pixel p and r(p, d p , q, d q ) penalizes disparity discontinuities between adjacent pixels p and q. in sgm the objective given in (1) is minimized with dynamic programming, leading to the resulting disparity mapd = arg min d e(d). as input for the shape model fitting we apply sgm on all four stereo pairs leading to four disparity maps as depicted in fig. 4a . to be able to exploit the articulated shape model for stereo matching we initial need to fit the model to the 3d data obtained by classical sgm as described in 3.2. to be more robust to outliers we do only use disparity values from pixels with high contrast and transform them into 3d point clouds. since we assume that the relative camera poses are known, it is straight forward to merge the resulting point clouds in one world coordinate system. finally the shape model is fitted to the merged point cloud by optimizing over the shape model parameters, namely the pose of the model and the rotation of the shoulder joints. we use an articulated mannequin shape model in this work as a proxy for an articulated human shape model (e.g. [2] ) as proof-of-concept and plan to transfer the proposed approach on real humans in future work. once the model parameters of the shape model are obtained we can reproject the model fit to each sensor view by making use of the known projection matrices. fig. 3b shows the rendered wireframe mesh of the fitted model as an overlay on the camera images. for our guided stereo matching approach we then need the synthetic disparity map which can be computed from the synthetic depth maps (a byproduct of 3d rendering). we denote the synthetic disparity image by d synth . one synthetic disparity image is created for each stereo sensor, see fig. 4b . in the final step we exploit the existing shape model fit, in particular the synthetic disparity image d synth of each stereo sensor and combine it with sgm (inspired by guided stereo matching [9] ). our augmented objective is defined as with the introduced objective is very similar to sgm and can be minimized in a similar fashion leading to the final disparity estimation in our scene-adaptive depth sensor network to summarize our approach, we exploit an articulated shape model fit to enhance sgm with minor adjustments. to show the applicability of our approach we present preliminary qualitative results. the results are depicted in fig. 4 . using sgm without exploiting the provided articulated shape model leads to reasonable results, but the disparity map is very noisy and no clean silhouette of the mannequin is extracted (see fig. 4a ). fitting our articulated shape model to the data leads to clean synthetic disparity maps as shown in fig. 4c , with a clean silhouette. in the belly area the synthetic model disparity map (fig. 4b) does not agree with the ground truth (fig. 4d) . the articulated shape model is not general enough to explain the recorded scene faithfully. using the guided stereo matching approach, we construct a much cleaner disparity map than sgm. in addition, the approach takes the current sensor data into account and exploits an existing articulated shape model. in this work we have proposed a method for scene-adaptive disparity estimation in depth sensor networks. our main contribution is the exploitation of a fitted human shape model to make the estimation of disparities more robust to local ambiguities. our early results indicate that our method can lead to more robust and accurate results compared to classical sgm. future work will focus on a quantitative evaluation as well as incorporating sophisticated statistical human shape models into our approach. inverse process-structure-property mapping abstract. workpieces for dedicated purposes must be composed of materials which have certain properties. the latter are determined by the compositional structure of the material. in this paper, we present the scientific approach of our current dfg funded project tailored material properties through microstructural optimization: machine learning methods for the modeling and inversion of structure-property relationships and their application to sheet metals. the project proposes a methodology to automatically find an optimal sequence of processing steps which produce a material structure that bears the desired properties. the overall task is split in two steps: first find a mapping which delivers a set of structures with given properties and second, find an optimal process path to reach one of these structures with least effort. the first step is achieved by machine learning the generalized mapping of structures to properties in a supervised fashion, and then inverting this relation with methods delivering a set of goal structure solutions. the second step is performed via reinforcement learning of optimal paths by finding the processing sequence which leads to the best reachable goal structure. the paper considers steel processing as an example, where the microstructure is represented by orientation density functions and elastic and plastic material target properties are considered. the paper shows the inversion of the learned structure-property mapping by means of genetic algorithms. the search for structures is thereby regularized by a loss term representing the deviation from process-feasible structures. it is shown how reinforcement learning is used to find deformation action sequences in order to reach the given goal structures, which finally lead to the required properties. keywords: computational materials science, property-structure-mapping, texture evolution optimization, machine learning, reinforcement learning the derivation of processing control actions to produce materials with certain, desired properties is the "inverse problem" of the causal chain "process control" -"microstructure instantiation" -"material properties". the main goal of our current project is the creation of a new basis for the solution of this problem by using modern approaches from machine learning and optimization. the inversion will be composed of two explicitly separated parts: "inverse structure-property-mapping" (spm) and "microstructure evolution optimization". the focus of the project lies on the investigation and development of methods which allow an inversion of the structure-property-relations of materials relevant in the industry. this inversion is the basis for the design of microstructures and for the optimal control of the related production processes. another goal is the development of optimal control methods yielding exactly those structures which have the desired properties. the developed methods will be applied to sheet metals within the frame of the project as a proof of concept. the goals include the development of methods for inverting technologically relevant "structure-property-mappings" and methods for efficient microstructure representation by supervised and unsupervised machine learning. adaptive processing path-optimization methods, based on reinforcement learning, will be developed for adaptive optimal control of manufacturing processes. we expect that the results of the project will lead to an increasing insight into technologically relevant process-structure-property-relationships of materials. the instruments resulting from the project will also promote the economically efficient development of new materials and process controls. in general, approaches to microstructure design make high demands on the mathematical description of microstructures, on the selection and presentation of suitable features, and on the determination of structure-property relationships. for example, the increasingly advanced methods in these areas enable microstructure sensitive design (msd), which is introduced in [1] and [2] and described in detail in [3] . the relationship between structures and properties descriptors can be abstracted from the concrete data by regression in the form of a structure-property-mapping. the idea of modeling a structure-property-mapping by means of regression and in particular using artificial neural networks was intensively pursued in the 1990s [4] and is still used today. the approach and related methods presented in [5] always consist of a structure-property-mapping and an optimizer (in [5] genetic algorithms) whose objective function represents the desired properties. the inversion of the spm can be alternatively reached via generative models. in contrast to discriminative models (e.g. spm), which are used to map conditional dependencies between data (e.g. classification or regression), generative models map the composite probabilities of the variables and can thus be used to generate new data from the assumed population. established, generative methods are for example mixture models [6] , hidden markov models [7] and in the field of artificial neural networks restricted boltzmann machines [8] . in the field of deep learning, generative models, in particular generative adversarial networks [9] , are currently being researched and successfully applied in the context of image processing. conditional generative models can generalize the probability of occurrence of structural features under given material properties. in this way, if desired, any number of microstructures could be generated. based on the work on the spm, the process path optimization in the context of the msd is treated depending on the material properties. for this purpose, the process is regarded as a sequence of structure-changing process operations which correspond to elementary processing steps. shaffer et al. [10] construct a so called texture evolution network based on process simulation samples, to represent the process. the texture evolution network can be considered as a graph with structures as vertices, connected by elementary processing steps as edges. the structure vertices are points in the structure-space and are mapped to the property-space by using the spm for property driven process path optimization. in [11] one-step deformation processes are optimized to reach the most reachable element of a texture-set from the inverse spm. processes are represented by so called process planes, principal component analysis (pca) projections of microstructures reachable by the process. the optimization then is conducted by searching for the process plane which best represents one of the texture-set elements. in [12] , a generic ontology based semantic system for processing path hypothesis generation (matcalo) is proposed and showcased. the required mapping of the structures to the properties is modeled based on data from simulations. the simulations are based on taylor models. the structures are represented using textures in the form of orientation density functions (odf), from which the properties are calculated. in the investigations, elastic and plastic properties are considered in particular. structural features are extracted from the odf for a more compact description. the project uses spectral methods such as generalized spherical harmonics (gsh) to approximate the odf. as an alternative representation we investigate the discretization in the orientation-space, where the orientation density is represented by a histogram. the solution of the inverse problem consists of a structure-property-mapping and an optimizer: as [4] described, the spm is modeled by regression using artificial neural networks. in this investigation, we use a multilayer perceptron. differential evolution (de) is used for the optimization problem. de is an evolutionary algorithm developed by rainer storn and kenneth price [13] . it is a optimization method, which repeatedly improves a candidate solution set under consideration of a given quality measure over a continuous domain. the de algorithm optimizes a problem by taking a population of candidate solutions and generating new candidate solutions (structures) by mutation and recombination existing ones. the candidate solution with the best fitness is considered for further processing. so, for the generated structures the reached properties are determined using the spm. the fitness f is composed of two terms: the property loss l p , which expresses, how close the property of a candidate is to the target property, and the structure loss l s , which represents the degree of feasibility of the candidate structure in the process the property loss is the mean squared error (mse) between the reached properties p r ∈ p r and the desired properties p d ∈ p d : considering the goal that the genetic algorithm generates reachable structures, a neural network is formed which functions as an anomaly detector. the data basis of this neural network are structures that can be reached by a process. the goal of anomaly detection is to exclude unreachable structures. the anomaly detection is implemented using an autoencoder [14] . this is a neural network (see fig. 1 ) which consists of the following two parts: the encoder and the decoder. the encoder converts the input data to an embedding space. the decoder converts the embedding space as close as possible to the original data. due to the reduction to an embedding space, the autoencoder uses data compression and extracts relevant features. the cost function for the structures is a distance function in the odf-space, which penalizes the network if it produces outputs that differ from the input. the cost function is also known as the reconstruction loss: with s i ∈ s as the original structures,ŝ i ∈ˆ s as the reconstructed structures and λ = 0.001 to avoid division by zero. when using the anomaly detection, the autoencoder determines a high reconstruction loss if the input data are structures that are very different from the reachable structures. the overall approach is shown in fig. 2 and consists of the following steps: 1. the genetic algorithm generates structures. 2. the spm determines the reached properties of the generated structures. 3. the structure loss l s is determined by the reconstruction loss of the anomaly detector for the generated structures with respect to the reachable structures. 4. the property loss l p is determined by the mse of the reached properties and the desired properties. 5. the fitness is calculated as the sum of the structure loss l s and the property loss l p . the structures, resulting from the described approach form the basis for optimal process control. due to the forward mapping, the process evolution optimization based on texture evolution networks ( [10] ) is restricted to a-priori sampled process paths. [11] relies on linearization assumptions and is applicable to short process sequences only. [12] relies on a-priori learned process models in the form of regression trees and is also applicable to relatively short process sequences only. ur-ai 2020 // 88 as an adaptive alternative for texture evolution optimization, that can be trained to find process-paths of arbitrary length, we propose methods from reinforcement learning. for desired material properties p d . the inverted spm determines a set of goal microstructures s d ∈ g, which are very likely reachable by the considered deformation process. the texture evolution optimization objective is then to find the shortest process path p * starting from a given structure s 0 , and leading close to one of the structures from g. where p = (a k ) k=0,...,k ; k t is a path of process actions a, t is the maximum allowed process length. the mapping e(s, p) = s k delivers the resulting structure, when applying p to the structure s. here, for the sake of simplicity, we assume the process to be deterministic, although the reinforcement learning methods we use are not restricted to deterministic processes. g τ is a neighbourhood of g, the union of all open balls with radius τ and center points from g. to solve the optimization problem by reinforcement learning approaches, it must be reformulated as markov decision process (mdp), which is defined by the tuple (s, a, p, r). in our case s is the space of structures s, a is the parameter-space of the deformation process, containing process actions a, p : s × a → s is the transition function of the deformation process, which we assume to be deterministic. r g : s × a → r is a goalspecific reward function. the objective of the reinforcement learning agent is then to find the optimal goal-specific policy π * g (s t ) = a t that maximizes the discounted future goal-specific reward where γ ∈ [0, 1] discounts early attained rewards, the policy π g (s k ) determines a k and the transition function p (s k , a k ) determines s k+1 . for a distance function d in the structure space, the binary reward function r g (s, a) = 1, if d(p (s, a), g) < τ 0, otherwise (6) if maximized, leads to an optimal policy π * g that yields the shortest path to g from every s for γ < 1. moreover, if v g is given for every microstructure from g, p from eq. 4 is identical with the application of the policy π * ζ , where ζ = arg max g [v g ]. π * g can be approached by methods from reinforcement learning. value-based reinforcement learning is doing so by learning expected discounted future reward functions [15] . one of these functions is the so called value-function v . in the case of a deterministic mdp and for a given g, this expectation value function reduces to v g from eq. 4 and ζ can be extracted if v is learned for every g from g. for doing so, a generalized form of expectation value functions can be learned as it is done e.g. in [16] . this exemplary mdp formulation shows how reinforcement learning can be used for texture evolution optimization tasks. the optimization thereby is operating in the space of microstructures and does not rely on a-priori microstructure samples. when using off-policy reinforcement learning algorithms and due to the generalization over goal-microstructures, the functions learned while solving a specific optimization task can be easily transferred to new optimization tasks (i.e. different desired properties or even a different property space). industrial robots are mainly deployed in large-scale production, especially in the automotive industry. today, there are already 26.1 industrial robots deployed per 1,000 employees on average in these industry branches. in contrast, small and medium-sized enterprises (smes) only use 0.6 robots per 1,000 employees [1] . reasons for this low usage of industrial robots in smes include the lack of flexibility with great variance of products and the high investment expenses due to additional peripherals required, such as gripping or sensor technology. the robot as an incomplete machine accounts for a fourth of the total investment costs [2] . due to the constantly growing demand of individualized products, robot systems have to be adapted to new production processes and flows [3] . this development requires the flexibilization of robot systems and the associated frequent programming of new processes and applications as well as the adaption of existing ones. robot programming usually requires specialists who can adapt flexibly to different types of programming for the most diverse robots and can follow the latest innovations. in contrast to many large companies, smes often have no in-house expertise and a lack of prior knowledge with regard to robotics. this often has to be obtained externally via system integrators, which, due to high costs, is one of the reasons for the inhibited use of robot systems. during the initial generation or extensive adaption of process flows with industrial robots, there is a constant risk of injuring persons and damaging the expensive hardware components. therefore, the programs have to be tested under strict safety precautions and usually in a very slow test mode. this makes the programming of new processes very complex and therefore time-and cost-intensive. the concept presented in this paper combines intuitive, gesture-based programming with simulation of robot movements. using a mixed reality solution, it is possible to create a simulation-based visualization of the robot and project, to program and to test it in the working environment without disturbing the workflow. a virtual control panel enables the user to adjust, save and generate a sequence of specific robot poses and gripper actions and to simulate the developed program. an interface to transfer the developed program to the robot controller and execute it by the real robot is provided. the paper is structured as follows. first, a research on related work is conducted in section 2, followed by a description of the system of the gesture-based control concept in section 3. the function of robot positioning and program creation is described in section 4. last follow the evaluation in section 5 and conclusion in section 6. various interfaces exist to program robots, such as lead-trough, offline or walk-trough programming, programming by demonstration, vision based programming or vocal commanding. in the survey of villani et al. [4] a clear overview on existing interfaces for robot programming and current research is provided. besides the named interfaces, the programming of robots using a virtual or mixed reality solution aims to provide intuitiveness, simplicity and accessibility of robot programming for non-experts. designed for this purpose, guhl et al. [5] developed a generic architecture for human-robot interaction based on virtual and mixed reality. in the marker tracking based approach presented by [6] and [7] , the user defines a collision-free-volume and generates and selects control points while the system creates and visualizes a path through the defined points. others [8] , [9] , [10] and [11] use handheld devices in combination with gesture control and motion tracking. herein, the robot can be controlled through gestures, pointing or via the device, while the path, workpieces or the robot itself are visualized on several displays. other gesture and virtual or mixed reality based concepts are developed by cousins et al. [12] or tran et al. [13] . here, the robots perspective or the robot in the working environment is presented to the user on a display (head-mounted or stationary) and the user controls the robot via gestures. further concepts using a mixed reality method enable an image of the workpiece to be imported into cad and the system automatically generates a path for robot movements [14] or visualizing the intended motion of the robot on the microsoft hololens, that the user knows where the robot will move to next [15] . other methods combine pointing at objects on an screen with speech instructions to control the robot [16] . sha et al. [17] also use a virtual control panel in their programming method, but for adjusting parameters and not for controlling robots. another approach pursues programming based on cognition, spatial augmented reality and multimodal input and output [18] , where the user interacts with a touchable table. krupke et al. [19] developed a concept in which humans can control the robot by head orientation or by pointing, both combined with speech. the user is equipped with a head-mounted display presenting a virtual robot superimposed over the real robot. the user can determine pick and place position by specifying objects to be picked by head orientation or by pointing. the virtual robot then executes the potential pick movement and after the user confirms by voice command, the real robot performs the same movement. a similar concept based on gesture and speech is persued by quintero et al. [20] , whose method offers two different types of programming. on the one hand, the user can determine a pick and place position by head orientation and speech commands. the system automatically generates a path which is displayed to the user, can be manipulated by the user and is simulated by a virtual robot. on the other hand, it is possible to create a path on a surface by the user generating waypoints. ostanin and klimchik [21] introduced a concept to generate collision-free paths. the user is provided with virtual goal points that can be placed in the mixed reality environment and between which a path is automatically generated. by means of a virtual menu, the user can set process parameters such as speed, velocity etc.. additionally, it is possible to draw paths with a virtual device and the movement along the path is simulated by a virtual robot. differently to the concept described in this paper, only a pick and place task can be realized with the concepts of [19] and [20] . a differentiation between movements to positions and gripper commands as well as the movement to several positions in succession and the generation of a program structure are not supported by these concepts. another distinction is that the user only has the possibility to show certain objects to the robot, but not to move the robot to specific positions. in [19] a preview of the movement to be executed is provided, but the entire program (pick and place movements) is not simulated. in contrast to [21] , with the concept presented in this paper it is possible to integrate certain gripper commands into the program. with [21] programming method, the user can determine positions but exact axis angles or robot poses cannot be set. overall, the approach presented in this paper offers an intuitive, virtual user interface without the use of handheld devices (cf. [6] , [7] , [8] , [9] , [10] and [11] ) which allows the exact positions of the robot to be specified. compared to other methods, such as [12] , [13] , [14] , [15] or [16] , it is possible to create more complex program structures, which include the specification of robot poses and gripper positions, and to simulate the program in a mixed reality environment with a virtual robot. in this section the components of the mixed reality robot programming system are introduced and described. the system consists of multiple real and virtual interactive elements, whereby the virtual components are projected directly into the field of view using a mixed reality (mr) approach. compared to the real environment, which consists entirely of real objects and virtual reality (vr), which consists entirely of virtual objects and which overlays the real reality, in mr the real scene here is preserved and only supplemented by the virtual representations [22] . in order to interact in the different realities, head-mounted devices similar to glasses, screens or mobile devices are often used. figure 1 provides an overview of the systems components and their interaction. the system presented in this paper includes kukas collaborative, lightweight robot lbr iiwa 14 r820 combined with an equally collaborative gripper from zimmer as real components and a virtual robot model and a user interface as virtual components. the virtual components are presented on the microsoft hololens. for calculation and rendering the robot model and visualization of the user interface, the 3d-and physics-engine of the unity3d development framework is used. furthermore, for supplementary functions, components and for building additional mr interactable elements, the microsoft mixed reality toolkit (mrtk) is utilized. for spatial positioning of the virtual robot, marker tracking is used, a technique supported by the vuforia framework. in this use case, the image target is attached to the real robot's base, such that in mr the virtual robot superimposes the real robot. the program code is written in c . the robot is controlled and programmed via an intuitive and virtual user interface that can be manipulated using the so-called airtap gesture, a gesture provided by microsoft hololens. ur-ai 2020 // 95 to ensure that the virtual robot mirrors the motion sequences and poses of the real robot, the most exact representation of the real robot is employed. the virtual robot consists of a total of eight links, matching the base and the seven joints of iiwa 14 r820: the base frame, five joint modules, the central hand and the media flange. the eight links are connected together as a kinematic chain. the model is provided as open source files from [23] and [24] and is integrated into the unity3d project. the individual links are created as gameobjects in a hierarchy, with the base frame defining the top level and are limited similar to those of the real robot. the cad data of the deployed gripping system is also imported into unity3d and linked to the robot model. the canvas of the head-up displayer of the microsoft hololens is divided into two parts and rendered at a fixed distance in front of the user and on top of the scene. at the top left side of the screen the current joint angles (a1 to a7) are displayed and on the left side the current program is shown. this setting simplifies the interaction with the robot as the informations do not behave like other objects in the mr scene, but are attached to the head up display (hud) and move with the user's field of view. the user interface, which consists of multiple interactable components, is placed into the scene and is shown at the right side of the head-up display. at the beginning of the application the user interface is in "clear screen" mode, i.e. only the buttons "drag", "cartesian", "joints", "play" and "clear screen" and the joint angles at the top left of the screen are visible. for interaction with the robot, the user has to switch into a particular control mode by tapping the corresponding button. the user interface provides three different control modes for positioning the virtual robot: -drag mode, for rough positioning, -cartesian mode, for cartesian positioning and -joint mode, for the exact adjustment of each joint angle. figure 2 shows the interactable components that are visible and therefore controllable in the respective control modes. depending on the selected mode, different interactable components become visible in the user interface, with whom the virtual robot can be controlled. in addition to the control modes, the user interface offers further groups of interactable elements: -motion buttons, with which e.g. the speed of the robot movement can be adjusted or the robot movement can be started or stopped, -application buttons, to save or delete specific robot poses, for example, -gripper buttons, to adjust the gripper and -interface buttons, that enable communication with the real robot. this section focuses on the description of the usage of the presented approach. in addition to the description of the individual control modes, the procedure for creating a program is also described. as outlined in section 3.2, the user interface consists of three different control modes and four groups of further interactable components. through this concept, the virtual robot can be moved efficiently to certain positions with different movement modes, the gripper can be adjusted, the motion can be controlled and a sequence of positions can be chained. drag by gripping the tool of the virtual robot with the airtap gesture, the user can "drag" the robot to the required position. additionally, it is possible to rotate the position of the robot using both hands. this mode is particularly suitable for moving the robot very quickly to a certain position. cartesian this mode is used for the subsequent positioning of the robot tool with millimeter precision. the tool can be translated to the required positions using the cartesian coordinates x, y, z and the euler angles a, b, c. the user interface provides a separate slider for each of the six translation options.the tool of the robot moves analogously to the respective slider button, which the user can set to the required value. joints this mode is an alternative to the cartesian method for exact positioning. the joints of the virtual robot can be adjusted precisely to the required angle, which is particularly suitable for e.g. bypassing an obstacle. there is a separate slider for each joint of the virtual robot. in order to set the individual joint angles, the respective slider button is dragged to the required value, which is also displayed above the slider button for better orientation. to program the robot, the user interface provides various application buttons, such as saving and removing robot poses from the chain and a display of the poses in the chain. the user directs the virtual robot to the desired position and confirms using the corresponding button. the pose of the robot is then saved as joint angles from a1 to a7 and one gripper position in a list and is displayed on the left side of the screen. when running the programmed application, the robot moves to the saved robot poses and gripper positions according to the defined sequence. for a better orientation, the robots current target position changes its color from white to red. after testing the application, the list of robot poses can be sent to the controller of the real robot via a webservice. the real robot then moves analogously to the virtual robot to the corresponding robot poses and gripper positions. the purpose of the evaluation is how the gesture-based control concept compares to other concepts regarding intuitiveness, comfort and complexity. for the evaluation, a study was conducted with seven test persons, who had to solve a pick and place task with five different operating concepts and subsequently evaluate them. the developed concept based on gestures and mr was evaluated against a lead through procedure, programming with java, programming with a simplified programming concept and approaching and saving points with kuka smartpad. the test persons had no experience with microsoft hololens and mr, no to moderate experience with robots and no to moderate programming skills. the questionnaire for the evaluation of physical assistive devices (quead) developed by schmidtler et al [25] was used to evaluate and compare the five control concepts. the questionnaire is classified into five categories (perceived usefulness, perceived ease of use, emotions, attitude and comfort) and contains a total of 26 questions, rated on an ordinal scale from 1 (entirely disagree) to 7 (entirely agree). firstly, each test person received a short introduction to the respective control concept, conducted the pick and place task and immediately afterwards evaluated the respective control concept using quead. all test persons agreed that they would reuse the concept in future tasks (3 mostly agree, 4 entirely agree). in addition, the test persons considered the gesture-based concept to be intuitive (1 mostly agree, 4 entirely agree), easy to use (5 mostly agree, 2 entirely agree) and easy to learn (1 mostly agree, 6 entirely agree). two test persons mostly agree and four entirely agree that the gesture-based concept enabled them to solve the task efficiently and four test persons mostly agree and two entirely agree that the concept enhances their work performance. all seven subjects were comfortable using the gesturebased concept (4 mostly agree, 2 entirely agree). overall, the concept presented in this paper was evaluated as more comfortable, more intuitive and easier to learn than the other control concepts evaluated. in comparison to them, the new operating concept was perceived as the most useful and easiest to use. the test persons felt physically and psychologically most comfortable when using the concept and were most positive in total. in this paper, a new concept for programming robots based on gestures and mr and for simulating the created applications was presented. this concept forms the basis for a new, gesture-based programming method, with which it is possible to project a virtual robot model of the real robot into the real working environment by means of a mr solution, to program it and to simulate the workflow. using an intuitive virtual user interface, the robot can be controlled by three control modes and further groups of interactable elements and via certain functions, several robot positions can be chained as a program. by using this concept, test and simulation times can be reduced, since on the one hand the program can be tested directly in the mr environment without disturbing the workflow. on the other hand, the robot model is rendered into the real working environment via the mr approach, thus eliminating the need for time-consuming and costly modeling of the environment. the results of the user study indicate that the control concept is easy to learn, intuitive and easy to use. this facilitates the introduction of robots and especially in smes, since no expert knowledge is required for programming, programs can be created rapidly and intuitively and processes can be adapted flexibly. in addition, the user study showed that tasks can be solved efficiently and the concept is perceived as performance-enhancing. potential directions of improvement are: implement various movement types, such as point-to-point, linear and circular movements in the concept. this makes the robot motion more flexible and efficient, since positions can be approached in different ways depending on the situation. another improvement is to extend the concept with collaborative functions of the robot, such as force sensitivity or the ability to conduct search movements. in this way, the functions that make collaborative robots special can be integrated into the program structure. a further approach for improvement is to engage in a larger scale study. in 2019 the world's commercial fleet consists of 95,402 ships with a total capacity of 1,976,491 thousand dwt. (a plus of 2.6 % in carrying capacity compared to last year) [1] . according to the international chamber of shipping, the shipping industry is responsible for about 90 % of all trade [2] . in order to ensure the safe voyage of all participant in the international travel at sea, the need for monitoring is steadily increasing. while more and more data regarding the sea traffic is collected by using cheaper and more powerful sensors, the data still needs to be processed and understood by human operators. in order to support the operators, reliable anomaly detection and situation recognition systems are needed. one cornerstone for this development is a reliable automatic classification of vessels at sea. for example by classifying the behaviour of non cooperative vessels in ecological protected areas, the identification of illegal, unreported and unregulated (iuu) fishing activities is possible. iuu fishing is in some areas of the world a major problem, e. g., »in the wider-caribbean, western central atlantic region, iuu fishing compares to 20-30 percent of the legitimate landings of fish« [3] resulting in an estimated value between usd 700 and 930 million per year. one approach for gathering information on the sea traffic is based on the automatic identification system (ais) 3 . it was introduced as a collision avoidance system. as each vessel is broadcasting its information on an open channel, the data is often used for other purposes, like training and validating of machine learning models. ais provides dynamic data like position, speed and course over ground, static data like mmsi 4 , shiptype and length, and voyage related data like draught, type of cargo, and destination about a vessel. the system is self-reporting, it has no strong verification of transmission, and many of the fields in each message are set by hand. therefore, the data can not be fully trusted. as harati-mokhtari et al. [4] stated, half of all ais messages contain some erroneous data. as for this work, the dataset is collected by using the ais stream provided by aishub 5 , the dataset is likely to have some amount of false data. while most of the errors will have no further consequences, minor coordinate inaccuracies or wrong vessel dimensions are irrelevant, some false information in vessel information can have an impact on the model performance. classification of maritime trajectories and the detection of anomalies is a challenging problem, e.g., since classifications should be based on short observation periods, only limited information is available for vessel identification. riveiro et al. [5] give a survey on anomaly detection at sea, where shiptype classification is a subtype. jiang et al. [6] present a novel trajectorynet capable of point-based classification. their approach is based on the usage of embedding gps coordinates into a new feature space. the classification itself is accomplished using an long short-term memory (lstm) network. further, jiang et al. [7] propose a partition-wise lstm (plstm) for point-based binary classification of ais trajectories into fishing or non-fishing activity. they evaluated their model against other recurrent neural networks and achieve a significantly better result than common recurrent network architectures based on lstm or gated recurrent units. a recurrent neural network is used by nguyen et al. in [8] to reconstruct incomplete trajectories, detect anomalies in the traffic data and identify the real type of a vessel. they are embedding the position data to generate a new representation as input for the neural network. besides these neural network based approaches, other methods are also used for situation recognition tasks in the maritime domain. especially expert-knowledge based systems are used frequently, as illegal or at least suspicious behaviour is not recorded as often as desirable for deep learning approaches. conditional random fields are used by hu et al. [9] for the identification of fishing activities from ais data. the data has been labelled by an expert and contains only longliner fisher boats. saini et al. [10] propose an hidden markov model (hmm) based approach to the classification of trajectories. they combine global-hmm and segmental-hmm using a genetic algorithm. in addition, they tested the robustness of the framework by adding gaussian noise. in [11] fischer et al. introduce a holistic approach for situation analysis based on situation-specific dynamic bayesian networks (ssdbn). this includes the modelling of the ssdbn as well as the presentation to end-users. for a bayesian network, the parametrisation of the conditional probability tables is crucial. fischer introduces an algorithm for choosing these parameters in a more transparent way. important for the functionality is the ability of the network to model the domain knowledge and the handling of noisy input data. for the evaluation, simulated and real data is used to assess the detection quality of the ssdbn. based on dbns, anneken et al. [12] implemented an algorithm for detecting illegal diving activities in the north sea. as explained by de rosa et al. [13] an additional layer for modelling the reliability of different sensor sources is added to the dbn. in order to use the ais data, preprocessing is necessary. this includes cleaning wrong data, filtering data, segmentation, and calculation of additional features. the whole workflow is depicted in figure 1 . the input in form of ais data and different maps is shown as blue boxes. all relevant mmsis are extracted from the ais data. for each mmsi, the position data is used for further processing. segmentation into separate trajectories is the next step (yellow). the resulting trajectories are filtered (orange). based on the remaining trajectories, geographic (green) and trajectory (purple) based features are derived. for each of the resulting sequences, the data is normalized (red), which results in the final dataset. only the 6 major shiptypes in the dataset are used for the evaluation. these are "cargo", "tanker", "fishing", "passenger", "pleasure craft" and "tug". due to their similar behaviour, "cargo" and "tanker" will combined to a single class "cargo-tanker". figure 1 : visualization of all preprocessing steps. input in blue, segmentation in yellow, filtering in orange, geographic features in green, trajectory feature in purple and normalization in red. four different trajectory features are used: ur-ai 2020 // 105 -time difference -speed over ground -course over ground -trajectory transformation as the incoming data from ais is not necessarily uniformly distributed in time, there is a need to create a feature representing the time dimension. therefore, the time difference between two samples is introduced. as the speed and course over ground is directly accessible through the ais data, the network will be directly fed with these features. the vessel's speed is a numeric value in 0.1-knot resolution in the interval [0; 1022] and the course is the negative angle in degrees relative to true north and therefore in the interval [0; 359]. the position will be transformed in two ways. the first transformation, further called "relative-to-first", will shift the trajectory to start at the origin. the second transformation, henceforth called "rotate-to-zero", will rotate the trajectory, in such a way, that the end point is on the x-axis. additional to the trajectory based features, two geographic features are derived by using coastline maps 6 and a map of large harbours. the coastline map consists of a list of line strips. in order to reduce complexity, the edge points are used to calculate the "distance-to-coast". further, only a lower resolution of the shapefile itself is used. in figure 2 , the resolution "high" and "low" for some fjords in norway are shown. due to the geoindex' cell size set to 40 km, a radius of 20 km can be queried. the world's 140 major harbours based on the world port index 7 are used to calculate the "distance-to-closest-harbor". as fishing vessels are expected to stay near to a certain harbour, this feature should support the network to identify some shiptypes. the geoindex' cell size is set for this feature to 5,000 km, resulting in a maximum radius of 2,500 km. the data is split into separate trajectories by using gaps in either time or space, or the sequence length. as real ais data is used, package loss during the transmission is common. this problem is tackled by splitting the data if the time between two successive samples is larger than 2 hours, or if the distance between two successive samples is large. regarding the distance, even though the great circle distance is more accurate, the euclidean distance is used. for simplification the distance value is squared and as a threshold 10 −4 is used. depending on latitude this corresponds to a value of about 1 km at the equator and only about 600 m at 60 • n. since the calculation includes approximation a relatively high threshold is chosen. as the neural network depends on a fixed input size, the data is split into fitting chunks by cutting and padding with these rules: -longer sequences are split into chunks according to the desired sequence length. -any left over sequence shorter than 80 % of the desired length is discarded. -the others will be padded with zeroes. this results in segmented trajectories of similar but not necessarily same duration. as this work is about the vessel behaviour at sea, stationary vessels (anchored and moored vessels) and vessels traversing rivers are removed from the segmented trajectories. the stationary vessels are identified by using a measure of movement in a trajectory: where n as the sequence length and p i its data points. a trajectory will be removed if α stationary is below a certain threshold. a shapefile 8 containing the major and most minor rivers (compare ??) is used in order to remove the vessels not on the high seas. a sequence with more than 50 % of its points on a river is removed from the dataset. in order to speed up the training process, the data is normalized in the interval [0; 1] by applying here, for the positional features a differentiation between "global normalization" and "local normalization" is taken into account. the "global normalization" will scale the input data for the maximum x max and minimum x min calculated over the entire data set, while "local normalization" will estimate the maximum x max and minimum x min only over the trajectory itself. as the data is processed parallel, the parameters for the "global normalization" will be calculated only for each chunk of data. this will result in slight deviations in the minimum and maximum, but for large batches this should be neglectable. all other additional features are normalized as well. for the geographic features "distance-to-coast" and "distance-to-closest-harbor" the maximum distance, that can be queried depending on grid size, is used as x max and 0 is used as the lower bound x min . the time difference feature is scaled using a minimum x min of 0 and the threshold for the temporal gap since this is the maximum value for this feature. speed and course are normalized using 0 and their respective maximum values. for the dataset, a period between 2018-07-24 and 2018-11-15 is used. altogether 209,536 unique vessels with 2,144,317,101 raw data points are included. using this foundation and the previously described methods, six datasets are derived. all datasets use the same spatial and temporal thresholds. in addition, filter thresholds are identical as well. the datasets differentiate in their sequence length and by applying only the "relativeto-first" transformation or additionally the "rotate-to-zero" transformation. either 360, 1,080, or 1,800 points per sequence are used resulting in approximate 1 h, 3 h, or 5 h long sequences. in figure 3 , the distribution of shiptypes in the datasets after applying the different filters is shown. for the shiptype classification, neural networks are chosen. the different networks are implemented using keras [14] with tensorflow as backend [15] . fawaz et al. [16] have shown, that, despite their initial design for image data, a residual neural network (resnet) can perform quite well on time-series classification. thus, as foundation for the evaluated architectures the resnet is used. the main difference to other neural network architectures is the inclusion of "skip connections". this allows for deeper networks by circumventing the vanishing gradient problem during the training phase. based on the main idea of a resnet, several architectures are designed and evaluated for this work. some information regarding the structure are given in table 1 . further, the single architectures are depicted in figures 4a to 4f . the main idea behind these architectures is to analyse the impact of the depth of the networks. furthermore, as the features itself are not necessarily logically linked with each other, the hope is to be able to capture the behaviour better by splitting up the network path for each feature. to verify the necessity of cnns two multilayer perceptron (mlp) based networks are tested: one with two hidden layers and one with four hidden layers, all with 64 neurons and fully connected with their adjacent layers. the majority of the parameters for the two networks are bound in the first layer. they are necessary to map the large number of input neurons, e. g., for the 360 samples dataset 360 * 9 = 3,240 input neurons, to the first hidden layer. each of the datasets is split into three parts: 64 % for the training set, 16 % for the validation set, and 20 % for the test set. for solving or at least mitigating the problem of overfitting, regularization techniques (input noise, batch normalization, and early stopping) are used. small noise on the input in the training phase is used to support the generalization of the network. for each feature a normal distribution with a standard deviation of 0.01 and a mean of 0 is used as noise. furthermore, batch normalization is implemented. this means, before each relulayer a batch normalization layer is added, allowing higher learning rates. therefore, the initial learning rate is doubled. additionally, the learning rate is halved if the validation error does not improve after ten training epochs, improving the training behaviour during oscillation on a plateau. in order to prevent overfitting, an early stopping criteria is introduced. the training will be interrupted if the validation error is not decreasing after 15 training epochs. to counter the dataset imbalance, class weights were considered but ultimately did not lead to better classification results and were discarded. the different neural network architectures are evaluated on a amd ryzen threadripper batch normalization and the input noise is tested. the initial learning rate is set to 0.001 without batch normalization and 0.002 with batch normalization activated. the maximum number of epochs is set to 600. the batch sizes are set to 64, 128, and 256 for 360, 1,080, and 1,800 samples per sequence respectively. in total 144 different setups are evaluated. furthermore, 4 additional networks are trained on the 360 samples dataset with "relative-to-first" transformation. two mlps to verify the need of deep neural networks, and the shallow and deep resnet trained without geographic features to measure the impact of these features. (f) "rtz" with 1,800 samples shown. the first row shows the results for the "relative-to-first" (rtf) transformation, the second for the "rotate-to-zero" (rtz) transformation. the results for the six different architectures are depicted in figure 5 . for 360 samples the shallow resnet and the deep resnet outperformed the other networks. in case of the "relative-to-first" transformation (see figure 5a ), the shallow resnet achieved an f 1 -score of 0.920, while the deep resnet achieved 0.919. for the "rotate-to-zero" transformation (see figure 5d ), the deep resnet achieved 0.918 and the shallow resnet 0.913. in all these cases the regularization methods lead to no improvements. the "relative-to-first" transformation performs slightly better overall. for the datasets with 360 samples per sequence, the standard resnet variants achieve higher f 1 -scores compared to the split resnet versions. but this difference is relatively small. as expected, the tiny resnet is not large and deep enough to classify the data on a similar level. for the "relative-first" transformation and trajectories based on 1080 samples (see figure 5b ), the split resnet and the total split resnet achieve the best results. the first performed well with an f 1 -score of 0.913, while the latter is slightly worse with 0.912. in both cases again the regularization did not improve the result. for the "rotateto-zero" transformation (see figure 5e ), the shallow resnet achieved an f 1 -score of 0.907 without any regularization and 0.905 with only the the noise added to the input. for the largest sequence length of 1,800 samples, the split based networks slightly outperform the standard resnets. for the "relative-to-first" transformation (see figure 5c ), the split resnet achieved an f 1 -score of 0.911, while for the "rotate-to-zero" transformation (see figure 5f ) the total split resnet reached an f 1 -score of 0.898. again without noise and batch normalization. to verify, that the implementation of cnns is actually necessary, additional tests with mlps were carried out. two different mlps are trained on the 360 samples dataset with "relative-to-first" transformation since this dataset leads to best results for the resnet architectures. both networks lead to no results as their output always is the "cargo-tanker" class regardless of the actual input. the only thing the models are able to learn is, that the "cargo-tanker" class is the most probable class based on the uneven distribution of classes. an mlp is not the right model for this kind of data and performs badly. the large dimensionality of even the small sequence length makes the use of the fully connected networks impracticable. probably, further hand-crafted feature extraction is needed to achieve better results. to measure the impact the feature "distance to coast" and "distance to closest harbor" have on the overall performance, a shallow resnet and a deep resnet are trained on the 360 sample length data set with the "relative-to-first" transformation excluding these features. the trained networks have f 1 -scores of 0.888 and 0.871 respectively. this means, by including this features, we are able to increase the performance by 3.5 %. the "relative-to-first" transformation compared to the "rotate-to-zero" transformation yields the better results. especially, this is easily visible for the longest sequence length. a possible explanation can be seen in the "stationary" filter. this filter removes more trajectories for the "relative-to-first" transformation than for the additional "rotate-to-zero" transformation. a problem might be, that the end point is used for rotating the trajectory. this adds a certain randomness to the data, especially for round trip sequences. in some cases, the stretched deep resnet is not able to learn the classes. it is possible, that there is a problem with the structure of the network or the large number of parameters. further, there seems to be a problem with the batch normalization, as seen in figures 5c and 5e . the overall worse performance of the "rotate-to-zero" transformation could be because of the difference in the "stationary" filter. in the "rotate-to-zero" dataset, fewer sequences are filtered out. the filter leads to more "fishing" and "pleasure craft" sequences in relation to each other as described in section 3.6. this could also explain the difference in class prediction distribution since the network is punished more for mistakes in these classes because more classes are overall from this type. for the evaluation, the expectation based on previous work by other authors was, that the shorter sequence length should perform worse compared to the longer ones. instead the shorter sequences outperform the longer ones. the main advantages of the shorter sequences are essentially the larger number of sequences in the dataset. for example the 360 samples dataset with "relative-to-first" transformation contains about 2.2 million sequences, while the corresponding 1,800 sample dataset contains only approximately 250,000 sequences. in addition, the more frequent segmentation can yield more easily classifiable sequences: the behaviour of a fishing vessel in general contains different characteristics, like travelling from the harbour to the fishing ground, the fishing itself, and the way back. the travelling parts are similar to other vessels and only the fishing part is unique. a more aggressive segmentation will yield more fishing sequences, that will be easier to classify regardless of observation length. the shallow resnet has the overall best results by using the 360 samples dataset and the "relative-to-first" transformation. the results for this setup are shown in the confusion matrix in figure 6 . as expected, the tiny resnet is not able to compete with the others. the other standard resnet architectures performed well, especially on shorter sequences. the split architectures are able to perform better on datasets with longer sequences, with the shallow resnet achieving similar performance. comparing the number of parameters, all three architectures have about 400,000 the shallow resnet about 50,000 more, the total split resnet about 40,000 less. only on the dataset with more sequences, the deep resnet performs well. this correlates with the need of more information due to the larger parameter count. due to the reduced flexibility, the split architecture can be interpreted as a "head start". this means, that the network has already information regarding the structure of the data, which in turn does not need to be extracted from the data. this can result in a better performance for smaller datasets. all in all, the best results are always achieved by omitting the suggested regularization methods. nevertheless, the batch normalization had an effect on the learning rate and needed training epochs: the learning rate is higher and less epochs are needed before convergence. based on the resnet, several architectures are evaluated for the task of shiptype classification. from the initial dataset based on ais data with over 2.2 billion datapoints six datasets with different trajectory length and preprocessing steps are derived. further to the kinematic information included in the dataset, geographical features are generated. each network architecture is evaluated with each of the datasets with and without batch normalization and input noise. overall the best result is an f 1 -score of 0.920 with the shallow resnet on the 360 samples per sequence dataset and a shift of the trajectories to the origin. additionally, we are able to show, that the inclusion of geographic features yield an improvement in classification quality. the achieved results are quite promising, but there is still some room for improvement. first of all, the the sequence length used for this work might still be too long for real world use cases. therefore, shorter sequences should be tried. additionally, interpolation for creating data with the same time delta between two samples or some kind of embedding or alignment layer might yield better results. as there are many sources for additional domain related information, further research in the integration of these sources is necessary. comparison of cnn for the detection of small ojects based on the example of components on an assembly many tasks which only a few years ago had to be performed by humans can now be performed by robots or will be performed by robots in the near future. nevertheless, there are some tasks in assembly processes which cannot be automated in the next few years. this applies especially to workpieces that are only produced in very small series or tasks that require a lot of tact and sensitivity, such as inserting small screws into a thread or assembling small components. in conversations with companies we have found out that a big problem for the workers is learning new production processes. this is currently done with instructions and by supervisors. but this requires a lot of time. this effort can be significantly reduced by modern systems, which accompany workers in the learning process. such intelligent systems require not only instructions that describe the target status and the individual work steps that lead to it, but also information on the current status at the assembly workstation. one way to obtain this information is to install cameras above the assembly workstation and use image recognition to calculate where an object is located at any given moment. the individual parts, often very small compared to the work surface, must be reliably detected. we have trained and tested several deep neural networks for this purpose. we have developed an assembly workstation where work instructions can be projected directly onto the work surface using a projector. at a distance, 21 containers for components are arranged in three rows, slightly offset to the rear, one above the other. these containers can also be illuminated by the projector. thus a very flexible pick-by-light system can be implemented. in order for the system behind it to automatically switch to the next work step and, in the event of errors, to point them out and provide support in correcting them, it is helpful to be able to identify the individual components on the work surface. we use a realsense depth camera for this purpose, from which, however, we are currently only using the colour image. the camera is mounted in a central position at a height of about two meters above the work surface. thus the camera image includes the complete working surface as well as the 21 containers and a small area next to the working surface. the objects to be detected are components of a kit for the construction of various toy cars. the kit contains 25 components in total. some of the components vary considerably from each other, but some others are very similar to each other. since it is the same with real components of a production, the choice of the kit seemed appropriate for the purposes of this project. object detection, one of the most fundamental and challenging problems in computer vision, seeks to local object instances from a large number of predefined categories in natural images. until the beginning of 2000, a similar approach was mostly used in object detection. keypoints in one or more images of a category were searched for automatically. at these points a feature vector was generated. during the recognition process, keypoints in the image were again searched, the corresponding feature vectors were generated and compared with the stored feature vectors. after a certain threshold an object was assigned to the category. one of the first approaches based on machine learning was published by viola and jones in 2001 [1] . they still selected features, in their case they were selected by using a haar basis function [2] and then using a variant of adaboost [3] . starting in 2012 with the publication of alexnet by krizhevsky et al. [4] , deep neural networks became more and more the standard in object detection tasks. they used a convolutional neural network which has 60 million parameters in five convolutional layers, some of them are followed by max-pooling layers, three fully-connected layers and a final softmax layer. they won the imagenet lsvrc-2012 competition with a error rate almost half as high as the second best. inception-v2 is mostly identical to inception-v3 by szegedy et al. [5] . it is based on inception-v1 [6] . all inception architectures are composed of dense modules. instead of stacking convolutional layers, they stack modules or blocks, within which are convolutional layers. for inception-v2 they redesigned the architecture of inception-v1 to avoid representational bottlenecks and have more efficient computations by using factorisation methods. they are the first using batch normalisation in object detection tasks. in previous architectures the most significant difference has been the increasing number of layers. but with the network depth increasing, accuracy gets saturated and then degrades rapidly. kaiming et al. [7] addressed this problem with resnet using skip connections, while building deeper models. in 2017 howard et al. presented mobilenet architecture [8] . mobilenet was developed for efficient work on mobile devices with less computational power and is very fast. they used depthwise convolutional layers for a extremely efficient network architecture. one year later sandler et al. [9] published a second version of mobilenet. besides some minor adjustments, a bottleneck was added in the convolutional layers, which further reduced the dimensions of the convolutional layers. thus a further increase in speed could be achieved. in addition to the neural network architectures presented so far, there are also different methods to detect in which area of the image the object is located. the two most frequently used are described briefly below. to bypass the problem of selecting a huge number of regions, girshick et al. [10] proposed a method where they use selective search by the features of the base cnn to extract just 2000 regions proposals from the image. liu et al. [11] introduced the single shot multibox detector (ssd). they added some extra feature layers behind the base model for detection of default boxes in different scales and aspect ratios. at prediction time, the network generates scores for the presence of each object in each default box. then it produces adjustments to the box to better match the object shape. there is just one publication over the past few years which gives an survey of generic object detection methods. liu et al. [12] compared 18 common object detection architectures for generic object detection. there are many other comparisons of specific object detection tasks. for example pedestrian detection [13] , face detection [14] and text detection [15] . the project is based on the methodology of supervised learning. thereby the models are trained using a training dataset consisting of many samples. each sample within the training dataset is tagged with a so called label (also called annotation). the label provides the model with information about the desired output for this sample. during training, the output generated by the model is then compared to the desired output (labels) and the error is determined. this error on the one hand gives information about the current performance of the model and, on the other hand it is used for further mathematical computations to adjust the model's parameters, so that the model's performance improves. for the training of neural networks in the field of computer vision the following rule of thumb applies: the larger and more diverse the training dataset, the higher the accuracy that can be achieved by the trained model. if you have too little data and/or run it through the model too often, this can lead to so-called overfitting. overfitting means that instead of learning an abstract concept that can be applied to a variety of data, the model basically memorizes the individual samples [16, 17] . if you train neural networks for the purpose of this project from scratch, it is quite possible that you will need more than 100,000 different images -depending on the accuracy that the model should finally be able to achieve. however, the methodology of the so-called transfer learning offers the possibility to transfer results of neural networks, which have already been trained for a specific task, completely or partially to a new task and thus to save time and resources [18] . for this reason, we also applied transfer learning methods within the project. the training dataset was created manually: a tripod, a mobile phone camera (10 megapixel format 3104 x 3104) and an apeman action cam (20 megapixel format 5120x3840) were used to take 97 images for each of the 25 classes. this corresponds to 2,425 images in total (actually 100 images were taken per class, but only 97 were suitable for use as training data). all images were documented and sorted into close-ups (distance between camera and object less than or equal to 30 cm) and standards (distance between camera and object more than 30 cm). this procedure should ensure the traceability and controllability of the data set. in total, the training data set contains approx. 25% close-ups and approx. 75% standards, each taken on different backgrounds and under different lighting conditions (see fig. 2 ). the labelimg tool was used for the labelling of the data. with the help of this tool, bounding boxes, whose coordinates are stored in either yolo or pascval voc format, can be marked in the images [19] . for the training of the neural networks the created dataset was finally divided into: ur-ai 2020 // 118 -training data (90% of all labelled images): images that are used for the training of the models and that pass through the models multiple times during the training. -test data (10% of all labelled images): images that are used for later testing or validation of the training results. in contrast to the images used as training data, the model is presented these images for the first time after training. the goal of this approach, which is common in deep learning, is to see how well the neural network recognizes objects in images, that it has never seen before, after the training. thus it is possible to make a statement about the accuracy and to be able to meet any further training needs that may arise. the training of deep neural networks is very demanding on resources due to the large number of computations. therefore, it is essential to use hardware with adequate performance. since the computations that run for each node in the graph can be highly parallelized, the use of a powerful graphical processing unit (gpu) is particularly suitable. a gpu with its several hundred computing cores has a clear advantage over a current cpu with four to eight cores when processing parallel computing tasks [20] . these are the outline parameters of the project computer in use: -operating system (os): ubuntu 18.04.2 lts -gpu: geforce r gtx 1080 ti (11 gb gddr5x-memory, data transfer speed 11 gbit/s) for the intended comparison the tensorflow object detection api was used. tensorflow object detection api is an open source framework based on tensorflow, which among other things provides implementations of pre-trained object detection models for transfer learning [21, 22] . the api was chosen because of its good and easy to understand documentation and its variety of pre-trained object detection models. for the comparison the following models were selected: -ssd mobilenet v1 coco: [11, 23, 24] -ssd mobilenet v2 coco: [11, 25, 26] -faster rcnn inception v2 coco: [27] [28] [29] -rfcn resnet101 coco: [30] [31] [32] to ensure comparability of the networks, all of the selected pre-trained models were trained on the coco dataset [33] . fundamentally, the algorithms based on cnn models can be grouped into two main categories: region-based algorithms and one-stage algorithms [34] . while both ssd models can be categorized as one-stage algorithms, faster r-cnn and r-fcn fall into the category of region-based algorithms. one-stage algorithms predict both -the fields (or the bounding boxes) and the class of the contained objects -simultaneously. they are generally considered extremely fast, but are known for their trade-off between accuracy and real-time processing speed. region-based algorithms consist of two parts: a special region proposal method and a classifier. instead of splitting the image into many small areas and then working with a large number of areas like conventional cnn would proceed, the region-based algorithm first proposes a set of regions of interest (roi) in the image and checks whether one of these fields contains an object. if an object is contained, the classifier classifies it [34] . region-based algorithms are generally considered as accurate, but also as slow. since, according to our requirements, both accuracy and speed are important, it seemed reasonable to compare models of both categories. besides the collection of pre-trained models for object detection, the tensorflow object detection api also offers corresponding configuration files for the training of each model. since these configurations have already shown to be successful, these files were used as a basis for own configurations. the configuration files contain information about the training parameters, such as the number of steps to be performed during training, the image resizer to be used, the number of samples processed as a batch before the model parameters are updated (batch size) and the number of classes which can be detected. to make the study of the different networks as comparable as possible, the training of all networks was configured in such a way that the number of images fed into the network simultaneously (batch size) was kept as small as possible. since the configurations of some models did not allow batch sizes larger than one, but other models did not allow batch sizes smaller than two, no general value for all models could be defined for this parameter. during training, each of the training images should be passed through the net 200 times (corresponds to 200 epochs). the number of steps was therefore adjusted accordingly, depending on the batch size. if a fixed shape resizer was used in the base configurations, two different dimensions of resizing (default: 300x300 pixels and custom: 512x512 pixels) were selected for the training. table 1 gives an overview of the training configurations used for the training of the different models. in this section we will first look at the training, before we then focus on evaluating the quality of the results and the speed of the selected convolutional neural networks. when evaluating the training results, we first considered the duration that the neural networks require for 200 epochs (see fig. 3 ). it becomes clear that especially the two region based object detectors (faster r-cnn inception v2 and rfcn resnet101) took significantly longer than the single shot object detectors (ssd mobilenet v1 and ssd mobilenet v2). in addition, the single shot object detectors clearly show that the size of the input data also has a decisive effect on the training duration: while ssd mobilenet v2 with an input data size of 300x300 pixels took the shortest time for the training with 9 hours 41 minutes and 47 seconds, the same neural network with an input data size of 512x512 pixels took almost three hours more for the training, but is still far below the time required by rfcn resnet101 for 200 epochs of training. the next point in which we compared the different networks was accuracy (see fig. 4 ). we focused on seeing which of the nets were correct in their detections and how often (absolute values), and we also wanted to see what proportion of the total detections were correct (relative values). the latter seemed to us to make sense especially because some of the nets showed more than three detections for a single object. the probability that the correct classification will be found for the same object with more than one detection is of course higher in this case than if only one detection per object is made. with regard to the later use at the assembly table, however, it does not help us if the neural net provides several possible interpretations for the classification of a component. figure 4 shows that, in this comparison, the two region based object detectors generally perform significantly better than the single shot object detectors -both in terms of the correct detections and their share of the total detections. it is also noticeable that for the single shot object detectors, the size of the input data also appears to have an effect on the comparison point on the result. however, there is a clear difference to the previous comparison of the required training durations: while the training duration increased uniformly with increasing size of the images with the single shot object detectors, such a uniform observation cannot be made with the accuracy, concerning the relation to the input data sizes. while ssd mobilenet v2 achieves good results with an input data size of 512x512 pixels, ssd mobilenet v1 delivers the worst result of this comparison for the same input data size (regarding the number of correct detections as well as their share of the total detections). with an input data size of 300x300 pixels, however, the result improves with ssd mobilenet v1, while the change to a smaller input data size has a deteriorating effect on the result with ssd mobilenet v2. the best result of this comparison -judging by the absolute values -was achieved by faster r-cnn inception v2. however, in terms of the proportion of correct detections in the total detections, the region based object detector is two percentage points behind rfcn resnet 101, also a region based object detector. we were particularly interested in how neural networks would react to particularly similar, small objects. therefore, we decided to investigate the behavior of neural networks within the comparison using an example to illustrate the behavior of the three very similar objects. figure 5 shows the selected components for the experiment. for each of these three components we examined how often it was correctly detected and classified by the compared neural networks and how often the network misclassified it with which of the similar components. the first and the second component was detected in nearly all cases by both region based approaches. the classification by inception-v2 and resnet-101 failed in about one third of images. the ssd networks detected the object in just one of twenty cases but mobilenet classified this correct. it has been surprising, that the results for the third component looks very different to the others (see fig. 6 ). ssd mobilenet v1 correctly identified the component in seven of ten images and did not produce any detections that could be interpreted as misclassifications with one of the similar components. ssd mobilenet v2 did not detect any of the three components, as in the two previous investigations. the results of the two region based object detectors are rather moderate. faster r-cnn inception v2 has detected the correct component in four of ten images, but still five misclassifications with the other two components. rfcn resnet101 has caused many misclassifications with the other two components. only two of ten images were correctly detected but it had six misclassifications with the similar components. an other important aspect of the study is the speed, or rather the speed at which the neural networks can detect objects, especially with regard to later use at the assembly table. for the comparison of the speeds on the one hand the data of the github repository of the tensorflow object detection api for the individual neural nets were used, on the other hand the actual speeds of the neural nets within this project were measured. it becomes clear that the speeds measured in the project are clearly below the achievable speeds that are mentioned in the github repository of the tensorflow object-detection api. on the other hand, the differences between the speeds of the region based object detectors and the single shot object detectors in the project are far less drastic than expected. we have created a training dataset with small, partly very similar components. with this we have trained four common deep neural networks. in addition to the training times, we examined the accuracy and the recognition time with general evaluation data. in addition, we examined the results for ten images each of three very similar and small components. none of the networks we trained produced suitable results for our scenario. nevertheless, we were able to gain some important insights from the results. at the moment, the runtime is not yet suitable for our scenario, but it is also not far from the minimum requirements, so that these can easily be achieved with smaller optimizations and better hardware. it was also important to realize that there are no serious runtime differences between the different network architectures. the two region based approaches delivered significantly better results than the ssd approaches. however, the results of the detection of the third small component suggest that mobilenet in combination with a faster r-cnn could possibly deliver even better results. longer training and training data better adapted to the intended use could also significantly improve the results of the object detectors. team schluckspecht from offenburg university of applied sciences is a very successful participant of the shell eco marathon [1] . in this contest, student groups are to design and build their own vehicles with the aim of low energy consumption. since 2018 the event features the additional autonomous driving contest. in this area, the vehicle has to fulfill several tasks, like driving a parcour, stopping within a defined parking space or circumvent obstacles, autonomously. for the upcoming season, the schluckspecht v car of the so called urban concept class has to be augmented with the according hardware and software to reliably recognize (i. e. detect and classify) possible obstacles and incorporate them into the software framework for further planning. in this contribution we describe the additional components in hard-and software that are necessary to allow an opitcal 3d object detection. main criteria are accuracy, cost effectiveness, computational complexity for relative real time performance and ease of use with regard to incorporation in the existing software framework and possible extensibility. this paper consists of the following sections. at first, the schluckspecht v system is described in terms of hard-and software components for autonomous driving and the additional parts for the visual object recognition. the second part scrutinizes the object recognition pipeline. therefore, software frameworks, neural network architecture and final data fusion in a global map is depicted in detail. the contribution closes with an evaluation of the object recognition results and conclusions. the schluckspecht v is a self designed and self build vehicle according to the requirements of the eco marathon rules. the vehicle is depicted in figure 1 . the main features are the relatively large size, including driver cabin, motor area and a large trunk, a fully equipped lighting system and two doors that can be opened separately. for the autonomous driving challenges, the vehicle is additionally equipped with several essential parts, that are divided into hardware, consisting of actuators, sensors, computational hardware and communication controllers. the software is based on a middle ware, can-open communication layers, localization, mapping and path planning algorithms that are embedded into a high level state machine. actuators the car is equipped with two actors, one for steering and one for braking. each actor is paired with sensors for measuring steering angle and braking pressure. environmental sensors several sensors are needed for localization and mapping. backbone is a multilayer 3d laser scanning system (lidar), which is combined with an inertial navigation system that consists of accelerometers, gyroscopes and magnetic field sensors all realized as triads. odometry information is provided from a global navigation satellite system (gnss) and two wheel encoders. the communication is based on two separate can-bus-systems, one for basic operations and an additional one for the autonomous functions. the hardware can nodes are designed and build from the team coupling usb-, i2c-, spi-and can-open-interfaces. messages are send from the central processing unit or the driver depending on drive mode. the trunk of the car is equipped with an industrial grade high performance cpu and an additional graphics processing unit (gpu). can communication is ensured with an internal card, remote access is possible via generic wireless components. software structure the schluckspecht uses a modular software system consisting of several basic modules that are activated and combined within a high level state ma-chine as needed. an overview of the main modules and possible sensors and actuators is depicted in figure 2 localization and mapping the schluckspecht v is running a simultaneous localization and mapping (slam) framework for navigation, mission planning and environment representation. in its current version we use a graph based slam approach based upon the cartographer framework developed by google [2] . we calculate a dynamic occupancy grid map that can be used for further planning. sensor data is provided by the lidar, inertial navigation and odometry systems. an example of a drivable map is shown in figure 3 . this kind of map is also used as base for the localization and placement of the later detected obstacles. the maps are accurate to roughly 20 centimeters, providing relative localization towards obstacles or homing regions. path planning to make use of the slam created maps, an additional module calculates the motion commands from start to target pose of the car. the schluckspecht is a classical car like mobile system which means that the path planning must take into account the non holonomic kind of permitted movement. parking maneuvers, close by driving on obstacles or planning a trajectory between given points is realized as a combination of local control commands based upon modeled vehicle dynamics, the so called local planner, and optimization algorithms that find the globally most cost efficient path given a cost function, the so called global planner. we employ a kinodynamic strategy, the elastic band method presented in [3] , for the local planning. global planning is realized with a variant of the a* algorithm as described in [4] . middleware and communication all submodules, namely, localization, mapping, path planning and high-level state machines for each competition are implemented within ur-ai 2020 // 129 the robot operating system (ros) middleware [5] . ros provides a messaging system based upon the subscriber/publisher principle. the single modules are capsuled in a process, called node, capable to asynchronously exchange messages as needed. due to its open source character and an abundance on drivers and helper functions, ros provides additional features like hardware abstraction, device drivers, visualization and data storage. data structures for mobile robotic systems, e. g. static and dynamic maps or velocity control messages, allow for rapid development. the lidar sensor system has four rays, enabling only the incorporation of walls and track delimiters within a map. therefore, a stereo camera system is additionally implemented to allow for object detection of persons, other cars, traffic signs or visual parking space delimiters and simultaneously measure the distance of any environmental objects. camera hardware a zed-stereo-camera system is installed upon the car and incorporated into the ros framework. the system provides a color image streams for each camera and a depth map from stereo vision. the camera images are calibrated to each other and towards the depth information. the algorithms for disparity estimation are running around 50 frames per second making use of the provided gpu. the object recognition relies on deep neural networks. to seamlessly work with the other software parts and for easy integration, the networks are evaluated with tensorflow [6] and pytorch [7] frameworks. both are connected to ros via the opencv image formats providing ros-nodes and -topics for visualization and further processing. the object recognition pipeline relies on a combination of mono camera images and calibrated depth information to determine object and position. core algorithm is a deep learning approach with convolutional neural networks. ur-ai 2020 // 130 main contribution of this paper is the incorporation of a deep neural network object detector into our framework. object detection with deep neural networks can be subdivided into two approaches, one being a two step approach, where regions of interest are identified in a first step and classified in a second one. the second are so called single shot detectors (like [8] ), that extract and classify the objects in one network run. therefore, two network architectures are evaluated, namely yolov3 [9] as a single shot approach and faster r-cnn [10] as two step model. all are trained on public data sets and fine tuned to our setting by incorporating training images from the schluckspecht v in the zed image format. the models are pre-selected due to their real time capability in combination with the expected classification performance. this excludes the current best instance segmentation network mask r-cnn [11] due to computational burdens and fast but inaccurate networks based on the mobilenet backbone [12] . the class count is adapted for the contest, in the given case eight classes, including the relevant pedestrian, car, van, tram and cyclist. for this paper, the two chosen network architectures were trained in their respective framework, i. e. darknet for the yolov3 detector and tensorflow for the faster r-cnn detector. yolov3 is used in its standard form with the darknet 53 backbone, faster r-cnn is designed with the resnet 101 [13] backbone. the models were trained on local hardware with the kitti [14] data set. alternatively, an open source data set from the teaching company udacity, with only three classes (truck, car, pedestrian) was tested. to deal with the problem of domain adaptation, the training images for yolov3 were pre-processed to fit the aspect ratio of the zed camera. the faster r-cnn net can cope with ratio variations as it uses a two stage approach for detection based on regions of interest pooling. both networks were trained and stored. afterward, their are incorporated into the system via a ros node making use of standard python libraries. the detector output is represented by several labeled bounding boxes within the 2d image. three dimensional information is extracted from the associated depth map by calculating the center of gravity of each box to get a x and y coordinate within the image. interpolating the depth map pixels accordingly one gets the distance coordinate z from the depth map to determine the object position p(x, y, z) in the stereo camera coordinate system. the ease of projection between dieeferent coordinate systems is one reason to use the ros middleware. the complete vehicle is modeled in a so calle tranform tree (tf-tree), that allows the direct interpolation between different coordinate systems in all six spatial degrees of freedom. the dynamic map, created in the slam subsystem, is now augmented with the current obstacles in the car coordinate system. the local path planner can take these into account and plan a trajectory including kinodynamic constraints to prevent collision or initiate a breaking maneuver. both newly trained networks were first evaluated upon the training data. exemplary results for the kitti data set are shown in figure 4 . the results clearly indicate an advantage for the yolov3 system, both in speed and accuracy. the figure depicts good results for occlusions (e. g. the car on the upper right) or high object count (see the black car on the lower left as example). the evaluation on a desktop system showed 50 fps for yolov3 and approximately 10 fps for faster r-cnn. after validating the performance upon the training data, both networks were started as a ros node and tested upon real data of the schluckspecht vehicle. as the training data differs from the zed-camera images in format and resolution, several adaptions were necessary for the yolov3 detector. the images are cropped in real time before presented to the neural net to emulate the format of the training images. the r-cnn like two stage networks are directly connected to the zed node. the test data is not labeled as ground truth. it is therefore not possible to give quantitative results for the recognition task. table 1 gives a quantitative overview of the object detection and classification, the subsequent figures give some expression of exemplary results. the evaluation on the schluckspecht videos showed an advantage for the yolov3 network. main reason is the faster computation, which results in a frame rate nearly twice as high compared to two stage detectors. in addition, the recognition of objects in the distance, i. e. smaller objects is a strong point of yolo. the closer the camera gets, the bigger is the balance shift towards faster r-cnn, that outperforms yolo on all categories for larger objects. what becomes apparent is a maximum detection distance of approximately 30 meters, from which on cars become to small in size. figure 6 shows an additional result demonstrating the detection power for partially obstructed objects. another interesting finding was the capability of the networks to generalize. faster r-cnn copes much better with new object instances than yolov3. persons with so far unknown cloth color or darker areas with vehicles remain a problem for yolo, but ur-ai 2020 // 133 commonly not for the r-cnn. the domain transfer from training data in berkeley and kitti to real zed vehicle images proved problematic. this contribution describes an optical object recognition system in hard-and software for the application in autonomous driving under restricted conditions, within the shell eco marathon competition. an overall overview of the system and the incorporation of the detector within the framework is given. main focus was the evaluation and implementation of several neural network detectors, namely yolov3 as one shot detector and faster r-cnn as a two step detector, and their combination with distance information to gain a three dimensional information for detected objects. for the given application, the advantage clearly lies with yolov3. especially the achievable frame rate of minimum 10 hz allows a seamless integration into the localization and mapping framework. given the velocities and map update rate, the object recognition and integration via sensor fusion for path planning and navigation works in quasi real-time. for future applications we plan to further increase the detection quality by incorporating new classes and modern object detector frameworks like m2det [15] . this will additionally increase frame rate and bounding box quality. for more complex tasks, the data of the 3d-lidar system shall be directly incorporated into the fusion framework to enhance the perception of object boundaries and object velocities. a few useful things to know about machine learning feature engineering for machine learning an empirical analysis of feature engineering for predictive modeling input selection for fast feature engineering random forests support vector regression machines strong consistency of least squares estimates in multiple regression ii business data science: combining machine learning and economics to optimize, automate, and accelerate business decisions global product classification (gpc) a study of cross-validation and bootstrap for accuracy estimation and model selection automatic liver and tumor segmentation of ct and mri volumes using cascaded fully convolutional neural networks convolutional networks for biomedical image segmentation v-net: fully convolutional neural networks for volumetric medical image segmentation self-supervised learning for pore detection in ct-scans of cast aluminum parts generating meaningful synthetic ground truth for pore detection in cast aluminum parts nema ps3 / iso 12052, digital imaging and communications in medicine (dicom) standard, national electrical manufacturers association ct-realistic lung nodule simulation from 3d conditional generative adversarial networks for robust lung segmentation deep learning hardware: past, present, and future a survey on specialised hardware for machine learning a survey on distributed machine learning hardware for machine learning: challenges and opportunities 3d u-net: learning dense volumetric segmentation from sparse annotation z-net: an anisotropic 3d dcnn for medical ct volume segmentation activation functions: comparison of trends in practice and research for deep learning fast and accurate deep network learning by exponential linear units (elus) delving deep into rectifiers: surpassing human-level performance on imagenet classification toward deeper understanding of neural networks: the power of initialization and a dual view on expressivity adam: a method for stochastic optimization diffgrad: an optimization method for convolutional neural networks tversky loss function for image segmentation using 3d fully convolutional deep networks a low-power multi physiological monitoring". processor for stress detection. ieee sensors using heart rate monitors to detect mental stress positive technology: a free mobile platform for the self-management of psychological stress exploring the effectiveness of a computer-based heart rate variability biofeedback program in reducing anxiety in college students psychological stress and incidence of atrial fibrillation continuously updated, computationally efficient stress recognition framework using electroencephalogram (eeg) by applying online multitask learning algorithms (omtl) ten years of research with the trier social stress test trapezius muscle emg as predictor of mental stress poptherapy: coping with stress through pop-culture du-md: an open-source human action dataset for ubiquitous wearable sensors stress recognition using wearable sensors and mobile phones introducing wesad, a multimodal dataset for wearable stress and affect detection feasibility and usability aspects of continuous remote monitoring of health status in palliative cancer patients using wearables detection of diseases based on electrocardiography and electroencephalography signals embedded in different devices: an exploratory study stress effects". the american institute of stress der smarte assistent can: creative adversarial networks, generating" art" by learning about styles and deviating from style norms creative ai: on the democratisation & escalation of creativity generative design: a paradigm for design research eigenfaces for recognition unsupervised representation learning with deep convolutional generative adversarial networks large scale gan training for high fidelity natural image synthesis interpreting the latent space of gans for semantic face editing visualizing and understanding generative adversarial networks mistaken identity spectral normalization for generative adversarial networks beauty is in the ease of the beholding: a neurophysiological test of the averageness theory of facial attractiveness unpaired image-to-image translation using cycleconsistent adversarial networks colorization for anime sketches with cycle-consistent adversarial network artificial muse using evolutionary design to interactively sketch car silhouettes and stimulate designer's creativity the chair project-four classics deepwear: a case study of collaborative design between human and artificial intelligence grass: generative recursive autoencoders for shape structures co-designing object shapes with artificial intelligence systematic review of the empirical evidence of study publication bias and outcome reporting bias ki-kunst und urheberrecht -die maschine als schöpferin? public law research paper no. 692; u of maryland legal studies research paper no inceptionism: going deeper into neural networks proactive error prevention in manufacturing based on an adaptable machine learning environment. artificial intelligence: from research to application: the upper-rhine artificial intelligence symposium ur-ai the benefits fo pdca crisp-dm 1.0: step-by-step data mining guide interpretable machine learning for quality engineering in manufacturing-importance measures that reveal insights on errors regulation (eu) 2017/745 of the european parliament and of the council of 5 april 2017 on medical devices -medical device regulation (mdr) use of real-world evidence to support regulatory decision-making for medical devices. guidance for industry and food and drug administration staff high-performance medicine: the convergence of human and artificial intelligence artificial intelligence powers digital medicine dermatologist-level classification of skin cancer with deep neural networks chestx-ray8: hospital-scale chest x-ray database and benchmarks on weaklysupervised classification and localization of common thorax diseases an attention based deep learning model of clinical events in the intensive care unit the artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care the european commission's high-level expert group on artificial intelligence: ethics guidelines for trustworthy ai key challenges for delivering clinical impact with artificial intelligence ibm's watson supercomputer recommended 'unsafe and incorrect' cancer treatments, internal documents show towards international standards for the evaluation of artificial intelligence for health proposed regulatory framework for modifications to artificial intelligence/machine learning (ai/ml)-based software as a medical device (samd) artificial-intelligence-and-machine-learning-discussion-paper.pdf 14. international medical device regulators forum (imdrf) -samd working group medical device software -software life-cycle processes general principles of software validation. final guidance for industry and fda staff deciding when to submit a 510(k) for a change to an existing device. guidance for industry and food and drug administration staff software as a medical device (samd): clinical evaluation. guidance for industry and food and drug administration staff international electrotechnical commission. iec 62366-1:2015 -part 1: application of usability engineering to medical devices why rankings of biomedical image analysis competitions should be interpreted with care what do we need to build explainable ai systems for the medical domain /679 of the european parliament and of the council of 27 april 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data (general data protection regulation -gdpr) artificial intelligence in healthcare: a critical analysis of the legal and ethical implications explainable artificial intelligence: understanding, visualizing and interpreting deep learning models association between race/ethnicity and survival of melanoma patients in the united states over 3 decades docket for feedback -proposed regulatory framework for modifications to artificial intelligence/machine learning (ai/ml)-based software as a medical device (samd) openpose: realtime multi-person 2d pose estimation using part affinity fields a density-based algorithm for discovering clusters in large spatial databases with noise fast volumetric auto-segmentation of head ct images in emergency situations for ventricular punctures a system for augmented reality guided ventricular puncture using a hololens: design, implementation and initial evaluation op sense-a robotic research platform for telemanipulated and automatic computer assisted surgery yolov3: an incremental improvement. arxiv deep learning based 3d pose estimation of surgical tools using a rgb-d camera at the example of a catheter for ventricular puncture fast point feature histograms (fpfh) for 3d registration joint probabilistic people detection in overlapping depth images towards end-to-end 3d human avatar shape reconstruction from 4d data scene-adaptive optimization scheme for depth sensor networks a taxonomy and evaluation of dense two-frame stereo correspondence algorithms advances in computational stereo a comparative analysis of cross-correlation matching algorithms using a pyramidal resolution approach fast approximate energy minimization via graph cuts stereo processing by semiglobal matching and mutual information guided stereo matching a large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation pyramid stereo matching network microstructure-sensitive design of a compliant beam microstructure sensitive design of an orthotropic plate subjected to tensile load microstructure sensitive design for performance optimization on the design, analysis, and characterization of materials using computational neural networks texture optimization of rolled aluminum alloy sheets using a genetic algorithm finite mixture models a tutorial on hidden markov models and selected applications in speech recognition information processing in dynamical systems: foundations of harmony theory generative adversarial nets building texture evolution networks for deformation processing of polycrystalline fcc metals using spectral approaches: applications to process design for targeted performance linear solution scheme for microstructure design with process constraints matcalo: knowledge-enabled machine learning in materials science differential evolution -a simple and efficient adaptive scheme for global optimization over continuous spaces reinforcement learning: an introduction hindsight experience replay industrieroboter für kmu. flexible und intuitive prozessbeschreibung toward efficient robot teach-in and semantic process descriptions for small lot sizes survey on human-robot collaboration in industrial settings: safety, intuitive interfaces and applications concept and architecture for programming industrial robots using augmented reality with mobile devices like microsoft hololens robot programming using augmented reality robot path and end-effector orientation planning using augmented reality spatial programming for industrial robots based on gestures and augmented reality spatial programming for industrial robots through task demonstration augmented reality based teaching pendant for industrial robot intuitive robot tasks with augmented reality and virtual obstacles development of a mixed reality based interface for human roboter interaction a hands-free virtual-reality teleoperation interface for wizard-of-oz control mixed reality as a tool supporting programming of the robot communicating robot arm motion intent through mixed reality head-mounted displays intuitive industrial robot programming through incremental multimodal language and augmented reality development of mixed reality robot control system based on hololens interactive spatial augmented reality in collaborative robot programming: user experience evaluation comparison of multimodal heading and pointing gestures for co-located mixed reality human-robot interaction robot programming through augmented trajectories in augmented reality interactive robot programming using mixed reality a taxonomy of mixed reality visual displays experimental packages for kuka manipulators within ros-indus-trial 24. siemens: ros-sharp a questionnaire for the evaluation of physical assistive devices (quead) unctad: review of maritime transport 2019 (2019) last accessed 2019-11-19. 2. international chamber of shipping report of the second meeting of the regional working group on illegal, unreported and unregulated (iuu) fishing automatic identification system (ais): data reliability and human error implications maritime anomaly detection: a review trajectorynet: an embedded gps trajectory representation for point-based classification using recurrent neural networks partition-wise recurrent neural networks for point-based ais trajectory classification a multi-task deep learning architecture for maritime surveillance using ais data streams identifying fishing activities from ais data with conditional random fields a segmental hmm based trajectory classification using genetic algorithm wissensbasierte probabilistische modellierung für die situationsanalyse am beispiel der maritimen überwachung detecting illegal diving and other suspicious activities in the north sea: tale of a successful trial source quality handling in fusion systems: a bayesian perspective tensorflow: large-scale machine learning on heterogeneous systems deep learning for time series classification: a review rapid object detection using a boosted cascade of simple features general framework for object detection a decision-theoretic generalization of on-line learning and an application to boosting imagenet classification with deep convolutional neural networks rethinking the inception architecture for computer vision going deeper with convolutions deep residual learning for image recognition mobilenets: efficient convolutional neural networks for mobile vision applications mobilenetv2: inverted residuals and linear bottlenecks rich feature hierarchies for accurate object detection and semantic segmentation ssd: single shot multibox detector deep learning for generic object detection: a survey pedestrian detection: an evaluation of the state of the art a survey on face detection in the wild: past, present and future text detection and recognition in imagery: a survey information visualizations used to avoid the problem of overfitting in supervised machine learning data science for business: what you need to know about data mining and data-analytic thinking automatic object detection from digital images by deep learning with transfer learning gpu asynchronous stochastic gradient descent to speed up neural network training tensorflow: tensorflow object detection api: ssd mobilenet v2 coco faster r-cnn: towards real-time object detection with region proposal networks tensorflow object detection api: faster rcnn inception v2 coco. online 29. tensorflow: tensorflow object detection api: faster rcnn inception v2 coco r-fcn: object detection via region-based fully convolutional networks 31. tensorflow: tensorflow object detection api: rfcn resnet101 coco multi-scale feature fusion single shot object detector based on densenet references 1. shell: the shell eco marathon real-time loop closure in 2d lidar slam kinodynamic trajectory optimization and control for car-like robots experiments with the graph traverser program robot operating system automatic differentiation in pytorch ssd: single shot multibox detector yolov3: an incremental improvement rich feature hierarchies for accurate object detection and semantic segmentation mobilenets: efficient convolutional neural networks for mobile vision applications deep residual learning for image recognition are we ready for autonomous driving? the kitti vision benchmark suite m2det: a single-shot object detector based on multi-level feature pyramid network the upper-rhine artificial intelligence symposium ur-ai 2020we thank our sponsor! main sponsor: esentri ag, ettlingen this research and development project is funded by the german federal ministry of education and research (bmbf) and the european social fund (esf) within the program "future of work" (02l17c550) and implemented by the project management agency karlsruhe (ptka). the author is responsible for the content of this publication. underlying projects to this article are funded by the wtd 81 of the german federal ministry of defense. the authors are responsible for the content of this article.this work was developed in the fraunhofer cluster of excellence "cognitive internet technologies". the upper-rhine artificial intelligence symposium