key: cord-347547-makm0j09 authors: Duran-Frigola, Miquel; Bertoni, Martino; Blanco, Roi; Martínez, Víctor; Pauls, Eduardo; Alcalde, Víctor; Turon, Gemma; Villegas, Núria; Fernández-Torras, Adrià; Pons, Carles; Mateo, Lídia; Guitart-Pla, Oriol; Badia-i-Mompel, Pau; Gimeno, Aleix; Soler, Nicolas; Brun-Heath, Isabelle; Zaragoza, Hugo; Aloy, Patrick title: Bioactivity Profile Similarities to Expand the Repertoire of COVID-19 Drugs date: 2020-07-16 journal: J Chem Inf Model DOI: 10.1021/acs.jcim.0c00420 sha: doc_id: 347547 cord_uid: makm0j09 [Image: see text] Until a vaccine becomes available, the current repertoire of drugs is our only therapeutic asset to fight the SARS-CoV-2 outbreak. Indeed, emergency clinical trials have been launched to assess the effectiveness of many marketed drugs, tackling the decrease of viral load through several mechanisms. Here, we present an online resource, based on small-molecule bioactivity signatures and natural language processing, to expand the portfolio of compounds with potential to treat COVID-19. By comparing the set of drugs reported to be potentially active against SARS-CoV-2 to a universe of 1 million bioactive molecules, we identify compounds that display analogous chemical and functional features to the current COVID-19 candidates. Searches can be filtered by level of evidence and mechanism of action, and results can be restricted to drug molecules or include the much broader space of bioactive compounds. Moreover, we allow users to contribute COVID-19 drug candidates, which are automatically incorporated to the pipeline once per day. The computational platform, as well as the source code, is available at https://sbnb.irbbarcelona.org/covid19. A new coronavirus, named SARS-CoV-2, is the responsible agent for the current 2019−2020 viral pneumonia (COVID-19) outbreak, 1,2 which is already affecting millions of people worldwide and causing hundreds of thousands of deaths. The COVID-19 pandemic has prompted an unprecedented effort by the scientific community to understand its molecular constituents and find an effective treatment to mitigate viral infectiveness and symptoms. This is reflected in the over 6000 COVID-related publications that appeared in the past few weeks. 3 Huge efforts are being invested in the discovery of an effective vaccine, but even the most optimistic scenarios suggest that it will not be available until 2021. Other drug discovery projects have been launched to target specific viral proteins, particularly its main protease (Mpro). 4 However, these initiatives, even if successful, could take even longer to deliver an approved drug. Thus, the repurposing of existing drugs is our best chance to face the current outbreak therapeutically, since approved drugs have known safety profiles and are ready to be tested in humans. For instance, several compounds initially developed to treat HIV (e.g., lopinavir/ritonavir) 5 or Ebola (e.g., remdesivir), 6 as well as antimalarial drugs (e.g., hydroxychloroquine), 7 are being tested against COVID-19. Indeed, we conducted a limited review of the most relevant scientific literature and identified over 200 compounds that are potentially active against COVID-19 with different levels of experimental support, from purely computational predictions to preclinical and drugs already in clinical trials. We now exploit this literature mining effort to identify other compounds with the potential to be effective against COVID-19. To this aim, we use the Chemical Checker (CC), a resource that provides processed, harmonized, and integrated bioactivity data for about 1 million small molecules. 8 In the CC, bioactivity data are expressed in a vector format, which naturally extends the notion of chemical similarity between compounds to similarities between bioactivity profiles. The CC organizes data into five levels of increasing complexity, ranging from drug binding profiles to clinical outcomes, and thus enables similarity searches that should be mechanistically and clinically relevant. In the current resource, we use CC signatures to identify similarities between bioactive compounds and the list of current COVID-19 drug candidates (i.e., bait compounds). The similarity search is performed systematically across the large chemical space encompassed by the CC, thereby substantially expanding the portfolio of potential molecules effective against SARS-CoV-2. Results are stratified between drug molecules and a broader medicinal chemistry space, thus offering ranked lists of compounds that should be of value for drug repurposing endeavors as well as preclinical screening campaigns. Our resource capitalizes on an ongoing literature curation effort done by our group. Additionally, we welcome contributions from the broader scientific community via web form, allowing users to include compounds under investigation in their laboratories, or to update the evidence level as new COVID-19 experiments accumulate. The scientific evidence supporting COVID-19 drug candidates is variable: some compounds come from computational predictions, some have proven their value in preclinical tests, others are approved drugs with a therapeutic indication unrelated to infectious diseases, and, finally, some are drugs currently used to fight SARS-CoV-2-related pathogens. The mechanisms of action (MoA) suggested to confer efficacy are also variable, ranging from immunomodulators to protease inhibitors. During curation, we classify literature COVID-19 candidates by their level of evidence and MoA ( Figure 1 ). By the 18th of April, 2020, we have found that 230 small molecules have been suggested as potential treatments for COVID-19. Starting from the SMILES representation of a compound, we derive CC bioactivity signatures for each COVID-19 literature bait compound. We then run bioactivity similarity searches against the ∼1 million bioactive molecules characterized in the CC and keep the top 10,000 most similar compounds for each search type. Likewise, we conduct conventional similarity searches solely based on 2D representations of the compounds (2048-bit Morgan fingerprints, radius 2). Similarities are expressed as empirical P-values (−log 10 scale) derived from the expected similarity distribution across the full search space. A simple support measure is provided for each compound by adding up the number of similar COVID-19 drugs (weighted by −log 10 P-value and level of evidence, as shown in Figure 1 ). In addition, we complement our literature curation effort with a further level of evidence, namely, text-mining, based on the automatic detection of experiments (bioassays) that could be relevant to COVID-19. More specifically, we process the text description of the ∼1.2 million bioassays catalogued in the ChEMBL database and rank them according to their relevance to the current corpus of about 30,000 articles related to COVID-19 and other coronavirus infections. 9 ChEMBL bioassays 10 are ranked using two complementary approaches: (i) We construct a retrieval query from the bioassay descriptions and use it to score each of the paragraphs and abstracts contained in the articles collection. We then use statistics of the score distribution of top scoring documents to rank the bioassays. And (ii), we manually labeled a set of (seed) molecules that tested positive in ∼100 bioassays relevant to COVID-19. We then automatically identify compounds from all the bioassay descriptions and compute their contextual embeddings. Finally, we rank the bioassays according to their cosine similarity to the seed molecules. We then keep the 1000 most relevant COVID-19 literature bioassays, as ranked by either text-mining approach and identify those bioactive molecules within the CC universe that tested positive (<10 μM) in at least one of them. Finally, we cross these results with the 10,000 compounds obtained from the similarity searches described above and assign an extra literature-evidence level (text-mining) to those in common, which are then used as bait compounds. The pipeline runs automatically every day, so that we always provide the most updated results. Searches are precomputed for each evidence strength and MoA. Results of the large-scale similarity search are made available as a web-resource at https://sbnb.irbbarcelona.org/covid19. The interface contains five tabs: Figure 1 . Methodological strategy. We use the list of COVID-19 compounds extracted from the literature, with different levels of experimental evidence, as bait to search for compounds with similar bioactivity or chemical features among the 800,000 molecules contained in the CC. We also include compounds that are positive in relevant bioassays, identified through automatic mining of the COVID-19 literature, and for which we find further bioactivity support in the CC. We keep and rank the top 10,000 most similar molecules to bait compounds and weight them to favor molecules with similar properties to those with higher levels of experimental evidence. Candidates. We provide the 10,000 molecules, within the CC universe of 1 M bioactive compounds, that are more similar to the COVID-19 bait compounds collected from the literature (Figure 2 ). The precomputed similarity matrix can be queried to extract candidates that fulfill properties of interest by selecting among the levels of evidence for the bait compounds as well as their MoA. In addition, the resulting list of molecules can be sorted following different criteria, including whether they are approved/experimental drugs, the cumulative level of support, or their similarity to specific COVID-19 literature drugs. Full and partial tables can be downloaded and exported to several formats, including the SMILES string representation for all the compounds. Literature. This tab lists the COVID-19 bait compounds extracted from the literature, together with their level of experimental evidence and, if known, the MoA that confers efficacy against SARS-CoV-2. Documentation. Here, we present a brief description of the methodological strategy, and more importantly, we offer updated statistics and benchmarks of the resource. In particular, we quantify the number of literature bait compounds available at each level of evidence and MoA ( Figure 3A ,B) and project CC signatures on a 2D plane to offer a global view of the chemical space explored by our resource (Figure 3C,D) . We see that, while significantly diverse, COVID-19 bait compounds cluster in certain regions of the chemical space, and we find new candidate molecules in their vicinity. Reassuringly, when we analyze the therapeutic categories of the top-ranked candidates, as expected, we retrieve a significant number of anti-infective drugs ( Figure 4A ). Other therapeutic categories such as hormonal treatments are enriched after the highest-ranking compounds. Note that, for this enrichment analysis, only drug molecules could be considered since ATC annotations are not available for most of the compounds in the CC. Finally, we perform a leave-one-out cross-validation to assess whether bait compounds can be retrieved by our similarity search. Figure 4B shows that known COVID-19 drugs are significantly up-ranked when using and evaluating all levels of evidence ( Figure 4B ). Contribute. Through this form, users can contribute to the resource by including their molecules of interest. We require the name and SMILES representation of the molecules as well as their level of experimental evidence, MoA, and references, if available. After each submission, we manually check the data and incorporate it in the next daily update. Code. This links to the Gitlab repository containing the complete code to run the pipeline and analyze results. Overall, we believe that the tool presented herein explores regions of the bioactive chemical space that could be relevant to COVID-19 treatment. Our web-based resource is updated daily and can be used to dynamically search for candidates related to COVID-19 drugs with varying levels of evidence and MoA. Therefore, our resource will be useful to a broad range of COVID-19 drug discovery approaches, ranging from those seeking a repurposing opportunity to those departing from the in vitro screening of compounds. A Novel Coronavirus from Patients with Pneumonia in China A new coronavirus associated with human respiratory disease in China Structure of Mpro from SARS-CoV-2 and discovery of its inhibitors Compassionate Use of Remdesivir for Patients with Severe Covid-19 Quantifying treatment effects of hydroxychoroquine and azithromycin for COVID-19: a secondary analysis of an open label non-randomized clinical trial Extending the small molecule similarity principle to all levels of biology with the Chemical Checker ChEMBL: towards direct deposition of bioassay data The authors declare no competing financial interest. This project has received funding from the European Union's Horizon 2020 research and innovation programme under grant agreement no. 101003633 (RiPCoN).