IEEE Paper Template in A4 (V1) International Journal on Advances in ICT for Emerging Regions 2021 14(2): March 2021 International Journal on Advances in ICT for Emerging Regions Cluster Identification in Metagenomics – A Novel Technique of Dimensionality Reduction through Autoencoders Kalana Wijegunarathna#1*,Uditha Maduranga2*, Sadeep Weerasinghe3*, Indika Perera4*, and Anuradha Wickaramarachchi5 Ϯ Abstract— Analysis of metagenomic data is not only challenging because they are acquired from a sample in their natural habitats but also because of the high volume and high dimensionality. The fact that no prior lab based cultivation is carried out in metagenomics makes the inference on the presence of numerous microorganisms all the more challenging, accentuating the need for an informative visualization of this data. In a successful visualization, the congruent reads of the sequences should appear in clusters depending on the diversity and taxonomy of the microorganisms in the sequenced sample. The metagenomic data represented by their oligonucleotide frequency vectors is inherently high dimensional and therefore impossible to visualize as is. This raises the need for a dimensionality reduction technique to convert these higher dimensional sequence data into lower dimensional data for visualization purposes. In this process, preservation of the genomic characteristics must be given highest priority. Currently, for dimensionality reduction purposes in metagenomics, Principal Component Analysis (PCA) which is a linear technique and t-distributed Stochastic Neighbor Embedding (t-SNE), a non-linear technique, are widely used. Albeit their wide use, these techniques are not exceptionally suited to the domain of metagenomics with certain shortcomings and weaknesses. Our research explores the possibility of using autoencoders, a deep learning technique, that has the potential to overcome the prevailing impediments of the existing dimensionality reduction techniques eventually leading to richer visualizations. Keywords— Metagenomic data visualizations, nonlinear dimensionality reduction, autoencoders, clustering I. INTRODUCTION he field of metagenomics has shown popular interest among bioinformatics and computer science researchers in the recent years. It has opened up new pathways in many areas including population-level genomic diversity of the microbial organisms. Metagenomics [1] was coined, with the idea of performing analysis on similar in certain criteria, yet non identical, microorganisms which are extracted from diverse environmental samples or from the natural habitats in order to study the structure and functions of microorganisms. An earliest-known method for studying metagenomic DNA is the abundance of guanine-cytosine (%GC) content. %GC varying widely between species but remaining relatively constant within the species is proven. Acquiring oligonucleotide frequencies of the microbial organisms is a widely used method that identifies the nucleotide composition with much better accuracy and effectiveness, compared to %GC [2]. Contemporary studies have shown that the oligonucleotide frequencies as they appear in genomic sequences is unique for a given microorganism. Research on this which runs back to 1960s, showcase the fact that oligonucleotide frequencies having species-specific signatures [3]. Because of this, an array of all oligonucleotide frequencies for a given length provides genomic signatures for microorganisms. Oligonucleotide frequencies can be represented as vectors in high dimensional Euclidean space. Visualization of metagenomic data, without prior taxonomic references using sequence fragments can use frequency vectors to be used as genomic signatures. An ideal visualization must be capable of capturing the authentic characteristics of the microorganisms in the sample, given a set of metagenomic sequence fragments, and display the alike species separated from the rest. Consequently, the visuals must be capable to display the taxonomic structure which is inherited by the original sequence data. Being self-explanatory and ability to carry out further analysis are few of the other characteristics that are expected of visualizations. Visualization of metagenomic data is broadly twofold. The visualization of a single metagenome and the visualization of multiple metagenomes. While visualization of multiple metagenomes gives insight into the nature of the same or different types of species found in the two different environments and helps researchers gain insightful information on the environments, the study of single metagenomes focuses on the species richness and diversity in the particular environmental sample. Despite the high dimensionality and other challenges, metagenomic data can be visualized in various techniques as described by Sudarikov et al [4]. Conversely, taxonomic classification of metagenomic data can be broadly categorized into four categories. Sequence similarity based classification employs a database search on a database of reference sequences. This method has been successfully used in the identification of reads of length as short as 80 base pairs where most other methods have failed [5]. This method is usually slower and uses Basic Local Alignment Search Tool (BLAST) [6] to identify similarities. Classification based on sequence composition is another method of taxonomic classification. One way of doing this is by using nucleotide composition of the reads. The nucleotide composition of the reads are compared with the models built using the composition of reference genomes and the model that fits the composition of the reads best are chosen. The T Correspondence: K. Wijethunga#1(E-mail: kalanainduwara.16@cse.mrt.ac.lk) Received: 20-12-2020 Revised: 25-02-2021 Accepted: 14-03-2021 This paper is an extended version of the paper “Dimensionality Reduction for Cluster Identification in Metagenomics using Auto encoders” presented at the ICTer 2020 conference. K. Wijegunarathna#1*,U. Maduranga2*, S. Weerasinghe3*, I. Perera4* are from the Department of Computer Science & Engineering, University of Moratuwa, Sri Lanka. {kalanainduwara.16, maduranga.16, sadeep.16, indika}@cse.mrt.ac.lk Anuradha Wickramarachchi5Ϯ is from the Australian National University, Australia. (anuradha.wickramarachchi@anu.edu.au) DOI: http://doi.org/10.4038/icter.v14i2.7224 © 2021 International Journal on Advances in ICT for Emerging Regions Cluster Identification in Metagenomics – A Novel Technique of Dimensionality Reduction through Autoencoders 2 International Journal on Advances in ICT for Emerging Regions March 2021 main drawback of this method is that it fails to classify reads shorter than 1000 base pairs with reasonable accuracy [7]. The third classification method is a hybrid of the two previous methods that combines both approaches of read similarity and nucleotide composition. It is possible to increase the accuracy by taking an aggregation of the two matches for a better result. In this case the score from the read-sequence similarity and the score from the read composition and reference genome model are combined. Alternatively, it is possible to narrow down the database using composition matching as the initial step to apply the similarity search on the filtered, reduced database [8]. The last of the four approaches is the use of marker genes. The reads are classified according to the markers they hit. Although this method is faster and efficient, it induces a bias towards organisms with larger genomes since they generate a larger number of reads [7]. Due to variability in 16S rRNA (ribosomal Ribonucleic Acid) [9] copy number, Amplicon sequencing [10] also suffers with bias. A key challenge in bioinformatics as well as metagenomics to date is the visualization of metagenomic fragment data without prior taxonomic identification. Usually for the visualization needs, the higher dimensional frequency vectors need to be embedded into lower dimensions (2D or 3D). Preservation of the inherited data from the genomic sequences is a deciding factor of the quality and accuracy of the visualizations when reducing into lower dimensions. Principal Component Analysis (PCA) [11] and t-distributed Stochastic Neighbor Embedding (t-SNE) [12] are two of the most widely used techniques for this purpose. But limitations in these techniques which hinder faithful visualizations prevail. Massive volumes of genomic data can be produced in such an efficient manner using the advancements in the field of sequencing and with the introduction of latest sequencing technologies like Next-Generation Sequencing (NGS). It is evident that with the availability of large volumes of genomic data, optimizations and new analyzing techniques are becoming more crucial. Cutting-edge technologies that are being used throughout in computer science can be adopted in the field of bioinformatics as well. Deep learning techniques are built to handle the rapid rate of generation of data and for the intuitive and rapid analysis. This research aimed at producing better visualizations in metagenomics for cluster identification by adopting an autoencoder based approach of dimensionality reduction. Autoencoders can be used in this context to convert higher dimensional metagenomic data into visualizable lower- dimensional data preserving the important characteristics of the original sequences. Since it is a deep learning technique it will be useful in a context like genomics. This allows better analysis on data-inherent taxonomic structure, free from alignments. The rest of the paper is organized as follows. The related work on dimensionality reduction techniques and research that are carried out on using autoencoders in the genomics domain are discussed under Section 2. A broader introduction to autoencoders is given in Section 3. Dataset, methodology for dimensionality reduction and visualizations are discussed in Section 4 followed by the experimental results which are described and demonstrated in Section 5 in detail. Finally, Section 6 concludes the paper giving an overview of the future work. II. RELATED WORK A. Dimensionality Reduction Techniques Representing data visually can be considered as a major challenge in the field of genomics. Processes such as transforming, scaling, normalizing, color-encoding and clustering play major roles in visualizing genomic data. It is important not to hinder users’ ability to interpret data while facilitating the users to carry out their analysis more conveniently and precisely. According to the studies done by Rall et al. [13], still number of challenges prevail in visualizing the genomic data. One of the leading challenges will be dealing with the dimensionality of the genomic sequence data. A plethora of research were carried out on lower-dimensional embedding techniques. One of the dominant concerns is dimensionality reduction, in bioinformatics fields, which looks into analyzing sequence data. Genomic sequence data consists of extensive amounts of sequence data and features. Thus, it is essential to reduce the dimensionality of data to extract useful analysis and visualization by avoiding the curse of dimensionality. Transformation of high dimensional data to a lower number of dimensions is a major goal in dimensionality reduction, providing simple interpretations. In the ideal case, dimensionality of reduced representation must have the dimensionality that corresponds to the intrinsic data [14]. One of the related concerns is the dense preservation of information. PCA, without a doubt, is the most commonly used linear technique in dimensionality reduction across multiple domains. It converts data in the higher dimension spaces to the lower-dimensional subspace making sure maximized variance in the projected data. Making sure that the maximum variance in the projected data also means PCA minimizes the squared reconstruction error. One of the leading drawbacks of PCA is its restriction with respect to linear transformations. Hauskrecht et al. [15] in in their work also displays the restrictions of PCA to guarantee high- quality features for discriminatory purposes because it is a totally unsupervised technique. Stochastic Neighbor Embedding (SNE) [16] introduced by Hinton and Roweis is a non-linear technique to get lower dimensional representations. It uses a Gaussian distribution on each point of data in the higher dimension and defines a probability distribution for its neighbors. This unsupervised dimensionality reduction technique has been commonly used over many years. t-SNE [12] is a variant of the SNE for non- linear dimensionality reduction. This method is also used to produce lower dimensional representations of higher dimensions that can be visualized with ease, especially in scatterplots. t-SNE preserves the global structures of the data like clusters while capturing the local structure of the higher dimensions. t-SNE adopts a Gaussian kernel in order to identify the similarities between points in the higher dimensions. Lower dimensional points are plotted while providing similar probabilities to the points and they are usually configured in such a way that they will reduce the divergence between the two distributions. Importantly, t-SNE strongly advocates against using Gaussians to measure distances in lower dimensions. It will instead opt for the one- dimensional t distribution (i.e. the Cauchy Distribution). Thus, it has heavier tails and allows for more spread in the lower dimensional representation than Gaussian. Nevertheless, a 3 K. Wijegunarathna#1*, U. Maduranga2*, S. Weerasinghe3*, I. Perera4*, A Wickaramarachchi5Ϯ March 2021 International Journal on Advances in ICT for Emerging Regions significant limitation of SNE as well as t-SNE is that their computational and memory complexity scale quadratically in the number of data objects N. Thus, SNE and its variants can only be used with limited number of points [17]. The standard implementation of t-SNE has a time complexity of O(N2), where N is the number of genomic fragments. Barnes-Hut-SNE (BH-SNE) introduced by Laurens van der Maaten [17] has a time complexity of O(N log N). Therefore, BH-SNE is significantly faster compared to t-SNE. Despite the efficiency, the results produced by BH- SNE are similar to that of t-SNE. Sparse similarities between each pair of points were obtained using vantage-point trees while the forces of the corresponding embedding were acquired using an enhanced version of Barns-Hut algorithm [18]. Due to the lower time complexity, BH-SNE can process more than a million data points within a reasonable time. A 2009 paper [19] by Dick et al describes the use of Self- Organizing Maps [13] (SOMs) for reducing the dimensionality of tetranucleotide genomic signatures belonging to two acidophilic biofilm communities. The unsupervised learning technique SOM uses an artificial neural network to generate a two-dimensional representation of the high dimensional data. This technique was used by the researchers to bin tetranucleotide sequence fragments obtained from isolated genomes and metagenomic samples. Despite being neural networks based, SOM does not rely on error-correction learning. Instead, SOM depends on competitive learning to map the inputs to an output while preserving characteristics in the input space. Gisbrecht et al. [21], in their research on “Nonlinear dimensionality reduction for cluster identification in metagenomic samples” has compared several currently used techniques for dimensionality reduction. The researchers have obtained oligonucleotide frequency vectors from a set of sequences generated by simulating metagenomic next- generation sequencing. These vectors have then been fed into dimensionality reduction algorithms. Researchers have used the effectiveness of these algorithms at clustering the output to compare the dimensionality reduction techniques. The techniques being compared are PCA, GTM (Generative Topographic Mapping) [22] and t-SNE. The research demonstrates that t-SNE outperforms the other techniques in terms of accuracy. The researchers have also introduced improvements to t-SNE to overcome its high complexity. Datasets in bioinformatics are typically large. The quadratic time complexity of t-SNE does not scale well for the needs of the genomic datasets. This is a shortcoming that needs to be addressed. The use of autoencoders for this purpose has been explored by Wang and Gu in their paper on dimension reduction and visualization of single-cell RNA seq data [23]. B. Autoencoders in Genomics As deep learning became mainstream, we have observed a rise of neural network based techniques that rival the traditional mathematical & probabilistic methods in their corresponding areas. In the case of dimensionality reduction, autoencoders can be named as the candidate, deep learning has to offer. The viability of autoencoders to replace t-SNE, and PCA in the context of bioinformatics has to be explored. Autoencoders, however, have already played a role in the bioinformatics domain in several instances. A 2019 paper by Ersalan et al. [24] describes how they managed to denoise single-cell RNA sequencing (scRNA-seq) datasets using deep count autoencoders (DCA). They used an enhanced version of autoencoders although a number of other methods existed to perform this very task. The enhancement was in terms of the specialized loss functions which drive in the direction of denoised scRNA-seq data. In addition to that, DCA scales linearly with regard to the number of cells overcoming the limitation of limiting the datasets. Experiment results suggest that DCA has surpassed the existing methods for imputation by means of quality and speed. Wang et al. [23] in the research on dimension reduction of single-cell RNA seq data, propose the use of variational autoencoders. Research experimented the use of variational autoencoders for single-cell RNA seq data because the chances of dropouts are higher when dealing with single-cell levels with higher transcriptional fluctuations. This experiment went on to show how to overcome the limitations of PCA and ZIFA (Zero Inflated Factor Analysis) [25] by testing on over 20 different datasets. Variational autoencoders gain a special edge over both PCA and ZIFA because of its ability to deal with complex non-linear relationships. Wang and Wang [26] in their research have used variational autoencoders to study two subtypes of lung cancers, lung adenocarcinoma (LUAD) and lung squamous cell carcinoma (LUSC). Although the researchers were expected to capture underlying DNA methylation patterns of the different original subtypes separately, some LUSC samples were classified into a LUAD group, which may be an indication that some parts of LUSC tumor samples may have similar DNA methylation expression compared to LUAD tumor. It was evident that a biologically meaningful latent space can be extracted using variational autoencoders from the data consisting of two or more subtypes of the lung cancers. III. AUTOENCODERS Although autoencoders took the center stage in the late 2000s with the introduction of the deep architecture, the original concept goes back as far as 1980s when Rumelhart et al. [27] introduced a new learning procedure that recurrently adjusts the weights of the connections of the network in such a way that the disparity between the actual output and the expected output of the network is minimal. Autoencoders can be considered as a special type of neural networks, which consists of two symmetric components namely encoder and decoder. Encoder maps input to more compressed lower dimensions, in contrast to the decoder which does exactly the opposite. Autoencoders are an unsupervised learning technique that leverages reducing the dimensionality of the input vectors efficiently, such that it will preserve the important characteristics of the original data and then reconstructing the original data with minimum loss using compressed representation. L(x, x’), which represents the deviation between the original input (x) and the consequent reconstruction (x’) should be diminished to get better reconstructions. A general autoencoder (Fig. 1) can be expressed using the tuple X, Y, Φ, Ψ, X’ where Φ denotes the encoder function and Ψ represents the decoder function of the autoencoder. X Cluster Identification in Metagenomics – A Novel Technique of Dimensionality Reduction through Autoencoders 4 International Journal on Advances in ICT for Emerging Regions March 2021 and X’ are the input and the output vectors of the autoencoder respectively. Fig. 1 Structure of a general autoencoder Φ : Χ → Υ (1) Ψ : Υ → Χ (2) Φ, Ψ = . argΦ,Ψ min (loss(x,x’)) (3) Φ denotes the function that maps the original data X, to a latent space Y, which appears at the bottleneck and Ψ function does the opposite. Deep autoencoders can have multiple hidden layers. Basically, when the layers of the neural network increased, those autoencoders are called deep autoencoders. Having advanced learnability will be essential to reconstruct the input data as it is if possible. A well trained autoencoder is capable of reconstructing the input with minimum loss. Although the reconstruction is done by the decoder, saving important characteristics of the input data plays a major role up to the bottle-neck layer of the autoencoder. Autoencoders can be considered as the non-linear generalization of PCA that converts higher dimensional data into lower dimensional code using the encoder part [28]. Autoencoders are being used as a powerful tool for dimensionality reduction [29]. It is proven to be useful in fields which have extensive amounts of data to work with. IV. METHODOLOGY A. Dataset For the experiment, we chose a subset of the genomic sequences used by Gisbrecht et al [21] as our dataset. The sequences were obtained from the NCBI (National Center for Biotechnology Information) microbial genomes database (https://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/). We first went on to reproduce the PCA, and t-SNE based results to work as the baseline for the research. To conduct the research the genomic sequences were downloaded in the FASTA file format. Complete genomes of 8 microbial species were used. Our chosen subset contained species that are taxonomically very close to each other, as well as species that are very different from each other. The intention was to explore how well the dimensional reduction process preserves the taxonomic similarities and dissimilarities. B. Implementation The DNA sequences that we obtained, must go through a multi-stage process before they are ready to be visualized. This does not mean that significant alterations are done to the underlying genomic features of the sequences. Two- or three- dimensional scatter plots are drawn based on the input sequences at the end of the following process. • Processing raw microbial DNA sequences and extracting genomic metrics • Converting the higher dimensional genomic metrics into a lower dimensional state • Visualizing the lower dimensional state with scatter plots First, we read contiguous blocks from each input DNA sequence. The blocks were taken from random locations within the sequence. The lengths of the sequences were chosen so that they form a normal distribution where the mean length is 5000 base pairs, and the standard deviation is 1000 base pairs. For a given k value, we calculated the number of the occurrences of each k-mer within each block. A k-mer and its reverse complement pairs were considered as a single k-mer; therefore, we had to take the sum of the occurrences of each k-mer in the pair and use that to represent that k-mer and its reverse complement. This method was proposed by Abe et al. [30]. For each block, we created a vector by taking all such k-mer frequencies. The resulting vector was then normalized. After following the above process, what we end up with is a set of vectors, each having an equal dimensionality. The dimensionality depends on k. The high dimensionality of these vectors makes them impossible to visualize as they are. Transforming these high dimensional vectors to a lower dimensionality is done by the autoencoder. The output of the autoencoder is a set of lower dimensional vectors, typically the dimensionality of these vectors is less than 4. These vectors are then directly used for visualization by drawing scatter plots. Other techniques can also be used in the place of the autoencoder for the process of dimensionality reduction. The autoencoders perform dimensionality reduction in 2 stages. In the first stage, we feed the high dimensional vectors to train the autoencoder. In this stage the input layer of the neural network is provided with the vectors, and the same vectors are expected from the output layer of the neural network. In other words, we train the neural network to produce the same vector as the input. Each of the vectors fed into the autoencoder corresponds to each block we read originally. Once the training is over, we send the same set of vectors as input and obtain the values produced in the bottleneck layer. The resulting values correspond to a lower dimensional representation of the input vector. We experimented using various neuron counts for each layer and analyzed the results. As we identified the following configurations worked the best for respectively 3-mers and 4- mers, 1. {32, 16, 2, 16, 32} 2. {136, 64, 2, 64, 136} The results we obtained using these autoencoders are presented in the paper. We used sigmoid as the activation function for neurons, while ‘Adam’ and mean square error were used as the optimizer and the loss function respectively. 1000 reads (blocks) were taken from each of the original genomes, each on average having a length of 5000 bp. For comparison purposes, the results of PCA and t-SNE obtained using the same data fed to the autoencoder were visualized. The sequence lengths and standard deviation were kept constant across all three approaches. T-SNE and PCA https://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/ 5 K. Wijegunarathna#1*, U. Maduranga2*, S. Weerasinghe3*, I. Perera4*, A Wickaramarachchi5Ϯ March 2021 International Journal on Advances in ICT for Emerging Regions were both imported from the ‘scikit-learn’ [31] library. Seaborn data visualization library was used to visualize the data in its reduced 2D space. The availability of ground truth labels ensured comparability of the performances of the three techniques of dimensionality reduction using the clusters produced by the respective approaches. Due to the use of a manually curated simulated metagenomic dataset, the ground truth label, i.e., the true identity of the microorganism to which the oligonucleotide frequency belonged to, is known. But this is rarely the case in the field of metagenomics, where a single metagenomic sample obtained from the environment may consist of large amounts of DNA fragments from a wide variety of species who may or may not be related to each other. Hence making the evaluation of latent space visualizations of real metagenomic datasets extremely challenging. Though this can be broadly named a clustering problem, several acute complications remain. While the creation of an isolated, lonely clusters consisting of data points from the right species is useful in the context of identifying the number of organisms in the sample, this sort of isolation can be costly in terms of taxonomic information. Not only is the creation of a cluster consisting of the same species important but the whole visualization must reflect the connections and similarities these different clusters bear with each other. Hence, mere arbitrary separation is not recommended. In our experiment, we used DBSCAN (Density-based spatial clustering of applications with noise) DBSCAN [32] for clustering. Though ideal method for clustering low dimensional metagenomic data is still a point of debate in the research community, DBSCAN is a powerful algorithm that discovers clusters of arbitrary shapes and can efficiently deal with large datasets that are part and parcel of genomic problems. DBSCAN relies on a density-based notion of clusters and requires only one essential input parameter. The low dimensional (2D) coordinate pairs, obtained from the bottleneck layer of the autoencoder, of the oligonucleotide frequencies are then fed to the DBSCAN for clustering. These 2D coordinate pairs are then clustered by DBSCAN without knowledge of their ground truth labels into clusters solely based on their distances in the 2D coordinate plane. It is these clusters that are produced by DBSCAN that need evaluation. Clustering evaluation can be broadly categorized into two; intrinsic metrics and extrinsic metrics [33]. Extrinsic metrics, when calculating the quality of a cluster, considers the ground truth label of the data points in each of the clusters. Therefore, the use of an extrinsic metric demands ground truth labels, which we fortunately possess. Contrary to extrinsic metrics, intrinsic metrics only consider the intra-cluster closeness and inter-cluster distance, not taking into consideration the ground truth labels of the data. Intrinsic clustering evaluations evaluate the integrity of the clusters formed in the low dimension that meets the user’s eyes. The primitive logic that is used in the density-based clustering approach is derived from a human-intuitive clustering method. In DBSCAN the resultant clusters are the dense regions in the given dataset, separated by regions of the lower density of points. The ability to identify arbitrary shaped clusters and robustness towards the outliers make DBSCAN more effective with the datasets that have relatively similar densities in the clusters. This algorithm is based on connecting data points within certain distance thresholds. For any given set of data points, DBSCAN separates data points into three categories. • Hub points - Points that are at the interior of a cluster (Centre). • Edge points - Points that fall within the neighborhood of a hub point that is not a hub point. • Noise points - Any point that is not a hub point or an edge point. The most important factor to consider when using DBSCAN is getting appropriate values for the parameters of the algorithm. Although there are a few parameters that can be tuned, the eps parameter, and the min_points (min_samples) are crucial. Among those two, min_points is used to identify a dense region in the dataset by considering the number of neighboring points required for a point to be considered as a dense region. The quality of clusters created by DBSCAN is governed by DBSCAN’s vital input parameter, epsilon. Epsilon is the threshold for the maximum distance between two data points, above which distance, the two points will no longer be neighbors. For a fair estimation of epsilon in each of the DBSCAN clustering, the “knee point” [34] of the plot between the 10 nearest neighbors in sorted order vs. the distance was calculated to find the optimal epsilon. Scikit- learn’s NearestNeighbors and “kneed’s” KneeLocator was used for the calculations. To improve the visibility of the clusters, Convex Hull algorithm [35] was used to draw boundaries around them. V. RESULTS For a formal evaluation two metrics were employed in addition to direct visual inspection. Distinct autoencoders were used with distinct number of layers and neurons for trinucleotide and tetranucleotide data to obtain separate visualizations. Clustering was then applied to the reduced dimensionality data obtained through the three dimensionality reduction approaches. As was discussed earlier, a comprehensive evaluation of clustering must take into account an intrinsic measure as well as an extrinsic measure. There are important metrics which can be used to evaluate clustering. But most of these metrics alone will not make it an effective performance evaluation on clustering as they can be biased. In this experiment one intrinsic metric and one extrinsic metric will be used for the evaluations and comparisons. We chose V-measure and Silhouette coefficient to evaluate the performance of overall clustering with respect to all three techniques that will be analyzed. Validity Measure (V-measure) [36] based on two other metrics called Homogeneity and Completeness, is an entropy based extrinsic clustering evaluation metric. It evaluates the successful clustering of data points with respect to their ground truth labels. Homogeneity and completeness are inversely proportional, and a good clustering should maintain a balance between those two metrics. Homogeneity evaluates how many data points in each cluster are with the same label. Maximum homogeneity is obtained by a clustering that has clusters that only have data points of the relevant class. This in turn results in zero entropy. Assume a clustering with N number of total data samples, C different class labels and K clusters. Assume also that the number of data points from class c in cluster k is ac,k. Cluster Identification in Metagenomics – A Novel Technique of Dimensionality Reduction through Autoencoders 6 International Journal on Advances in ICT for Emerging Regions March 2021 TABLE I CLUSTERING USING TRINUCLEOTIDES Data representation Trinucleotides V-measure Silhouette coefficient Autoencoder {32, 16, 2, 16, 32} 0.901 0.564 t-SNE 0.878 0.614 PCA 0.745 0.485 (4) (5) (6) Completeness is a symmetrical metric to homogeneity. A clustering is complete when all the data points from the same class are clustered together in each cluster. In a clustering, completeness can be evaluated using the conditional entropy of the cluster distribution in comparison with the label given. (7) (8) (9) V-measure, the metric that is used can be computed using weighted harmonic mean of homogeneity and completeness. (10) The second evaluation metric, silhouette coefficient [37] is a measure of distance. It gives a measure on how close each data point in one cluster is to other data points in the neighboring clusters. Silhouette coefficient ranges between -1 and 1. If the value of the coefficient is 0 that means the data point is on the inflection point of the two clusters. (11) The tetranucleotide and trinucleotide reductions of autoencoders were separately compared and evaluated with the reductions of t-SNE and PCA. Autoencoder generated reductions colored with using the ground truth label can be seen in Fig. 2. TABLE III CLUSTERING USING TETRANUCLEOTIDES Data representation Tetranucleotides V-measure Silhouette coefficient Autoencoder {136, 64, 2, 64, 136} 0.932 0.621 t-SNE 0.886 0.654 PCA 0.646 0.536 As visible, there are clear clusters and separations. The same 2D datapoints clustered using DBSCAN without taking their ground truth labels into consideration can be seen in Fig. 4. Worthy of notice is the close affinity of the two Clostridium bacteria, Clostridium phytofermentans and Clostridium beijerinckii.. As seen in Fig. 4., DBSCAN has managed to differentiate these two species to two different clusters but this is not always the case. In Fig. 3 and Fig. 5 with trinucleotide frequencies, the reduced dimensions do not distinctly classify the two Clostridium bacteria into two different clusters. But the close affinity of the two species on both occasions signals the preservation taxonomical information and relationships during dimensionality reduction when compared to t-SNE. In Comparison, t-SNE’s performance on the same 4-mer data can be found in Fig. 7 and Fig. 8. The distances between the clusters of the two Clostridium bacteria are arbitrary and conveys no information about the two species belonging to the same genus. Clostridium phytomenatas is in close affinity to all Microcytis, Mycobacterium, Erythrobacter and Rubrobacter as much as it is to Clostridium beijerinckii. Furthermore, DBSCAN seems to have identified separate clusters within the two Clostridium bacteria as well. PCA plots, Fig. 9 and Fig. 13 as seen to the naked eye and as also suggested by the evaluation metrics, are subpar. Well below the metrics for t-SNE and autoencoders. Note that a new optimized epsilon was calculated for each DBSCAN clustering, optimizing the algorithm to the data presented by each dimensionality reduction technique. Fig. 6. Gives the legend for plots colored using the ground truth labels. The autoencoder outperforms both t-SNE and PCA in all cases with regards to the V-measure. This improvement reflects the relatively lower loss of information compared to the ground truth labels. The formation of string-like shapes in t-SNE is another reason for the relatively lesser V-Measures of t-SNE compared to autoencoders. The breakage of these strings has more than once led DBSCAN to identify the data points of the same species as few different clusters. Silhouette coefficient, on the other hand, is highest in t- SNE. Compared to the autoencoder t-SNE’s Silhouette coefficient is always slightly higher. This is due to the arbitrary yet clear separation of data points of different species. This arbitrary yet clear separation comes at a greater cost. The arbitrary distancing of the different clusters by t- SNE has led to a loss in taxonomic relationships among species. On the contrary, autoencoder’s cluster distances are not arbitrary. The relative distances between clusters provide some useful insights into the taxonomic relationships in the real species the data points belong to. 7 K. Wijegunarathna#1*, U. Maduranga2*, S. Weerasinghe3*, I. Perera4*, A Wickaramarachchi5Ϯ March 2021 International Journal on Advances in ICT for Emerging Regions Fig. 2 Dimensionality reduction of tetranucleotide frequencies using autoencoder {136, 64, 2, 64, 136}. Coloured according to known true labels Fig. 3 Dimensionality reduction of trinucleotide frequencies using autoencoder {32, 16, 2, 16, 32}. Coloured according to known true labels VI. CONCLUSION AND FUTURE WORK The results obtained from the research backed by the superior results obtained by autoencoders back the potential of using autoencoders in the field of metagenomics for dimensionality reduction and visualization of metagenomic reads. The systematically optimized DBSCAN clustering algorithm has always managed to identify a number of clusters that is quite close to the actual number of microorganisms present in the sample. This congruency of the lower dimensions with the information in the higher dimensions is reflected in the improved V-Measures. The significantly higher V-measure produced in comparison with the t-SNE and PCA dimensionality reductions prove Autoencoder’s ability to preserve the intrinsic dimensionality of data in the process of dimensionality reduction. Silhouette coefficient does not consider whether the points in the same cluster actually belong to a single cluster in the higher dimensions. The autoencoder remains in close contention with t-SNE on the silhouette coefficient which is an intrinsic measure of clustering that only considers the visual integrity of the data in the lower dimensions. Not only have autoencoders outperformed PCA and t-SNE on the metrics front but it has also managed to preserve taxonomic data by placing the species of the same genus in Fig. 4 DBSCAN cluster identification with convex hull for tetranucleotide frequencies in Fig. 2. Black dots show noise points. Fig. 5 DBSCAN cluster identification with convex hull for trinucleotide frequencies in Fig. 3. Black dots show noise points. relatively closer affinity. Taxonomic data preservation ability of autoencoders standout in contrast to t-SNE’s shortcoming of arbitrarily separating clusters giving no relevance to species’ relationships. Additionally, unlike t-SNE’s faltering quality with growing data volume, autoencoders, being a deep learning technique thrives with growing data volumes. These results demand a place for autoencoders in bioinformatics. Noisy reads can however plague the ultimate analysability and interpretability of visualizations. Noise can result in loss of important insights and false interpretations. The use of denoising autoencoders to denoise large volumes of sequence data is another potential avenue of research. Other forms of autoencoders like the denoising autoencoders and variational autoencoders are potential techniques that can be integrated to improve metagenomic analysis and further the use of deep learning in the wider domain of bioinformatics. Fig. 6 Legend for Fig. 2, Fig 3, Fig 7, Fig 8, Fig 9 and Fig 13. Cluster Identification in Metagenomics – A Novel Technique of Dimensionality Reduction through Autoencoders 8 International Journal on Advances in ICT for Emerging Regions March 2021 Fig. 7 Dimensionality reduction of tetranucleotide frequencies using t-SNE . Coloured according to known true labels Fig. 8 Dimensionality reduction of trinucleotide frequencies using t-SNE. Coloured according to known true labels Fig. 9 Dimensionality reduction of tetranucleotide frequencies using PCA. Coloured according to known true labels Fig. 10 DBSCAN cluster identification with convex hull for tetranucleotide frequencies in Fig. 7. Black dots show noise points. Fig. 11 DBSCAN cluster identification with convex hull for trinucleotide frequencies in Fig. 8. Black dots show noise points. Fig. 12 DBSCAN cluster identification with convex hull for tetranucleotide frequencies in Fig. 9. Black dots show noise points. 9 K. Wijegunarathna#1*, U. Maduranga2*, S. Weerasinghe3*, I. Perera4*, A Wickaramarachchi5Ϯ March 2021 International Journal on Advances in ICT for Emerging Regions Fig. 13 Dimensionality reduction of trinucleotide frequencies using PCA. Coloured according to known true labels REFERENCES [1] J. Handelsman, M. R. Rondon, S. F. Brady, J. Clardy, and R. M. Goodman. “Molecular biological access to the chemistry of unknown soil microbes: a new frontier for natural products”. Chem. Biol. 5:R245–R249, 1998. [2] B. J. Baker, G. W. Tyson, R. I. Webb, J. Flanagan , P. Hugenholtz, E. E. Allen, J. F. Banfield, “Lineages of acidophilic archaea revealed by community genomic analysis”. Science 2006, 314:1933-1935. [3] C. C. Laczny, N. Pinel, N. Vlassis, and P. Wilmes, “Alignment-free visualization of metagenomic data by nonlinear dimension reduction,” Sci. Rep., vol. 4, pp. 1–12, 2014. [4] K. Sudarikov, A. Tyakht, and D. Alexeev, “Methods for the metagenomic data visualization and analysis,” Curr Issues Mol Biol, vol. 24, pp. 37– 58, 2017. [5] C. Ander, O. B. Schulz-Trieglaff, J. Stoye, and A. J. Cox, “metabeetl: high-throughput analysis of heterogeneous microbial populations from shotgun dna sequences,” in BMC bioinformatics, vol. 14, p. S2, Springer, 2013. [6] S. F. Altschul, W. Gish, W. Miller, E. W. Myers, and D. J. Lipman, “Basic local alignment search tool,” Journal of molecular biology, vol. 215, no. 3, pp. 403–410, 1990. [7] M. A. Peabody, T. Van Rossum, R. Lo, and F. S. Brinkman, “Evaluation of shotgun metagenomics sequence classification methods using in silico and in vitro simulated communities,” BMC bioinformatics, vol. 16, no. 1, p. 362, 2015. [8] M. H. Mohammed, T. S. Ghosh, N. K. Singh, and S. S. Mande, “Sphinx—an algorithm for taxonomic binning of metagenomic sequences,” Bioinformatics, vol. 27, no. 1, pp. 22–30, 2011. [9] J. M. Janda and S. L. Abbott, “16s rrna gene sequencing for bacterial identification in the diagnostic laboratory: pluses, perils, and pitfalls,” Journal of clinical microbiology, vol. 45, no. 9, pp. 2761–2764, 2007. [10] N. A. Bokulich, S. Subramanian, J. J. Faith, D. Gevers, J. I. Gordon, R. Knight, D. A. Mills, and J. G. Caporaso, “Quality-filtering vastly improves diversity estimates from illumina amplicon sequencing,” Nature methods, vol. 10, no. 1, pp. 57–59, 2013. [11] S. Wold, K. I. M. Esbensen, and P. Geladi, “Principal Component Analysis,” vol. 2, pp. 37–52, 1987. [12] C. R. García-Alonso, L. M. Pérez-Naranjo, and J. C. Fernández- Caballero, “Multiobjective evolutionary algorithms to identify highly autocorrelated areas: The case of spatial distribution in financially compromised farms,” Ann. Oper. Res., vol. 219, no. 1, pp. 187–202, 2014. [13] G. F. Rall, M. J. Schnell, B. M. Davis, “Tasks, Techniques, and Tools for Genomic Data Visualization,” Comput Graph Forum., vol. 176, no. 1, pp. 139–148, 2019. [14] L. J. P. Van Der Maaten, E. O. Postma, and H. J. Van Den Herik, “Dimensionality Reduction: A Comparative Review,” J. Mach. Learn. Res., vol. 10, pp. 1–41, 2009. Fig. 14 DBSCAN cluster identification with convex hull for trinucleotide frequencies in Fig. 13. Black dots show noise points. [15] M. Hauskrecht, R. Pelikan, M. Valko, J. Lyons-Weiler, “Feature Selection and Dimensionality Reduction in Genomics and Proteomics,” Fundamentals of Data Mining in Genomics and Proteomics, Springer, pp.149-172, 2006, [16] G. Hinton and S. Roweis, “Stochastic neighbor embedding,” in Advances in Neural Information Processing Systems, 2003. [17] L. van der Maaten, “Barnes-Hut-SNE,” 1st Int. Conf. Learn. Represent. ICLR 2013 - Conf. Track Proc., pp. 1–11, 2013. [18] J. Barnes and P. Hut., “A hierarchical O(N log N) force-calculation algorithm,” Nature, 324(4):446–449, 1986. [19] G. J. Dick et al., Community-wide analysis of microbial genome sequence signatures. Genome Biol 10, R85 (2009). [20] T. Kohonen: Self-organizing maps. Volume 0. New York: SpringerVerlag; 1997. [21] A. Gisbrecht, B. Hammer, B. Mokbel, and A. Sczyrba, “Nonlinear dimensionality reduction for cluster identification in metagenomic samples,” Proc. Int. Conf. Inf. Vis., pp. 174–179, 2013. [22] C. M. Bishop, M. Svensn, and C. K. I. Williams, “Gtm: The generative topographic mapping,” Neural Computation, vol. 10, pp. 215–234, 1998. [23] D. Wang and J. Gu, “VASC: Dimension Reduction and Visualization of Single-cell RNA-seq Data by Deep Variational Autoencoder,” Genomics, Proteomics Bioinforma., vol. 16, no. 5, pp. 320–331, 2018. [24] G. Eraslan, L. M. Simon, M. Mircea, N. S. Mueller, and F. J. Theis, “Single-cell RNA-seq denoising using a deep count autoencoder,” Nat. Commun., vol. 10, no. 1, pp. 1–14, 2019. [25] E. Pierson and C. Yau, “ZIFA: Dimensionality reduction for zero- inflated single-cell gene expression analysis,” Genome Biol., vol. 16, no. 1, pp. 1–10, 2015. [26] Z. Wang and Y. Wang, “Extracting a biologically latent space of lung cancer epigenetics with variational autoencoders,” BMC Bioinformatics, vol. 20, no. Suppl 18, pp. 1–7, 2019. [27] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning representations by back-propagating errors,” Nature, vol. 323, no. 6088, pp. 533–536, 1986. [28] A. J. Holden et al., “Reducing the Dimensionality of Data with Neural Networks,” Science (80-. )., vol. 313, no. July, pp. 504–507, 2006. [29] W. Wang, Y. Huang, Y. Wang, and L. Wang, “Generalized Autoencoder: A Neural Network Framework for Dimensionality Reduction DeepVision: Deep Learning for Computer Vision 2014,” CVPR Work., pp. 490–497, 2014. [30] T. Abe, H. Sugawara, S. Kanaya, M. Kinouchi, and T. Ikemura, “Self- Organizing Map (SOM) unveils and visualizes hidden sequence characteristics of a wide range of eukaryote genomes,” Gene 365, 27– 34 (2006). [31] G. Varoquaux, L. Buitinck, G. Louppe, O. Grisel, F. Pedregosa, and A. Mueller, “Scikit-learn,” GetMobile Mob. Comput. Commun., vol. 19, no. 1, pp. 29–33, 2015. [32] M. Daszykowski and B. Walczak, “Density-Based Clustering Methods,” Compr. Chemom., vol. 2, pp. 635–654, 2009. Cluster Identification in Metagenomics – A Novel Technique of Dimensionality Reduction through Autoencoders 10 International Journal on Advances in ICT for Emerging Regions March 2021 [33] E. Amigó, J. Gonzalo, J. Artiles, and F. Verdejo, “A comparison of extrinsic clustering evaluation metrics based on formal constraints,” Inf. Retr. Boston., vol. 12, no. 4, pp. 461–486, 2009. [34] B. Chazelle, “An optimal convex hull algorithm in any fixed dimension,” Discrete Comput. Geom., vol. 10, no. 1, pp. 377–409, 1993. [35] V. Satopää, J. Albrecht, D. Irwin, and B. Raghavan, “Finding a ‘kneedle’ in a haystack: Detecting knee points in system behavior,” Proc. - Int. Conf. Distrib. Comput. Syst., pp. 166–171, 2011. [36] A. Rosenberg and J. Hirschberg, “V-Measure: A conditional entropy- based external cluster evaluation measure,” EMNLP-CoNLL 2007 - Proc. 2007 Jt. Conf. Empir. Methods Nat. Lang. Process. Comput. Nat. Lang. Learn., no. June, pp. 410–420, 2007. [37] M. Pant, T. Radha, and V. P. Singh, “Particle swarm optimization using Gaussian inertia weight,” Proc. - Int. Conf. Comput. Intell. Multimed. Appl. ICCIMA 2007, vol. 1, pp. 97–102, 2008. 2. {136, 64, 2, 64, 136}