ADME prediction with KNIME: A retrospective contribution to the second “Solubility Challenge”


doi: http://dx.doi.org/10.5599/admet.979   209 

ADMET & DMPK 9(3) (2021) 209-218; doi: https://doi.org/10.5599/admet.979  

 
Open Access : ISSN : 1848-7718  

http://www.pub.iapchem.org/ojs/index.php/admet/index   

Original scientific paper 

ADME prediction with KNIME: A retrospective contribution to 
the second “Solubility Challenge”  

Gabriela Falcón-Cano
1
, Christophe Molina*

2
, and Miguel Ángel Cabrera-Pérez*

1,3,4
 

1
Unit of Modelling and Experimental Biopharmaceutics. Centro de Bioactivos Químicos. Universidad Central “Marta 

Abreu” de las Villas. Santa Clara 54830, Villa Clara, Cuba 
2
PIKAÏROS S.A., 31650 Saint Orens de Gameville, France 

3
Department of Pharmacy and Pharmaceutical Technology, University of Valencia, Burjassot 46100, Valencia, Spain 

4
Department of Engineering, Area of Pharmacy and Pharmaceutical Technology, Miguel Hernández Universit y, 03550 

Sant Joan d'Alacant, Alicante, Spain 

*Corresponding Authors: E-mail: macabreraster@gmail.com; Tel.: +53-42-281473; Fax: +53-42-281130; E-mail: 
christophe.molina@pikairos.com.  

Received: March 09, 2021; Revised: June 21, 2021; Available online: July 12, 2021  

 
Abstract 

Computational models for predicting aqueous solubility from the molecular structure represent a 
promising strategy from the perspective of drug design and discovery. Since the first “Solubility Challenge”, 
these initiatives have marked the state-of-art of the modelling algorithms used to predict drug solubility. In 
this regard, the quality of the input experimental data and its influence on model performance has been 
frequently discussed. In our previous study, we developed a computational model for aqueous solubility 
based on recursive random forest approaches. The aim of the current commentary is to analyse the 
performance of this already trained predictive model on the molecules of the second “Solubility 
Challenge”. Even when our training set has inconsistencies related to the pH, solid form and temperature 
conditions of the solubility measurements, the model was able to predict the two sets from the second 
“Solubility Challenge” with statistics comparable to those of the top ranked models. Finally, we provided a 
KNIME automated workflow to predict aqueous solubility of new drug candidates, during the early stages 
of drug discovery and development, for ensuring the applicability and reproducibility of our model. 

©2021 by the authors. This article is an open-access article distributed under the terms and conditions of the Creative Commons 
Attribution license (http://creativecommons.org/licenses/by/4.0/). 

Keywords 

Second Solubility Challenge; Quantitative Structure-Property Relationship (QSPR); KNIME; aqueous 
solubility; ADME; machine learning; Random Forest; supervised recursive variable selection 

 
Introduction 

Pharmacokinetic parameters are usually influenced by a combination of different physicochemical 

properties. Among these, solubility has occupied a very important role due to its influence on the 

absorption process. The need to balance solubility, avoiding excess or insufficiency, is a challenge from the 

perspective of drug discovery. 

In this regard, several research efforts have been made to provide accurate prediction of aqueous 

solubility through Quantitative Structure-Property Relationship (QSPR) approaches. Undoubtedly, the first 

http://dx.doi.org/10.5599/admet.979
https://doi.org/10.5599/admet.979
http://www.pub.iapchem.org/ojs/index.php/admet/index
mailto:macabreraster@gmail.com
mailto:christophe.molina@pikairos.com
http://creativecommons.org/licenses/by/4.0/


Falcón-Cano et al.  ADMET & DMPK 9(3) (2021) 209-218 

210  

and second “Solubility Challenges” proposed by Llinas et al. have been a very effective indicator of the 

progress and state-of-art of solubility estimation [1,2]. Recently, Llinas et al. have reviewed the results of 

the second “Solubility Challenge” to analyse the evolution of the computational methods used in this 

prediction task and the influence of data quality on the results [3].  

 In our previous publication, we presented a new method based on recursive random forest approaches 

to predict aqueous solubility values of drug and drug-like molecules [4]. It was based on the development 

of two novel recursive machine-learning approaches used for data cleaning and variable selection, and a 

consensus model generated by the combination of regression and classification algorithms. This model was 

able to provide good solubility prediction compared to many of the models described in the literature. 

Considering that our model was developed from a database of aqueous solubility values with limited 

information on the experimental conditions of the solubility assay, could our model successfully predict the 

intrinsic solubility values of the two sets of drugs used in the second "Solubility Challenge"? 

The present study describes the performance of our model with the molecules of the second “Solubility 

Challenge” and the comparison of the results with those obtained with the best performing models of the 

competition. It is necessary to clarify that, for this task, the model was not trained, retrained or optimized 

based on the molecules of the challenge tests, i.e., the model parameters or hyper-parameters remained 

exactly the same as those set in previously published work [4]. 

Materials and methods 

Challenge sets 

The second “Solubility Challenge” consisted of evaluating the intrinsic solubility estimation of two sets of 

drugs. The first set is composed of 100 drugs with an average inter-laboratory standard deviation estimated 

of ~0.17 log units. The second test set consists of 32 “difficult” drugs, characterized by poor inter-laboratory 

reproducibility: Standard Deviation ~0.62 log units. A detailed list of these molecules have been shown in a 

previous paper [3]. 

Software 

The Konstanz Information Miner (KNIME) is a free and public software tool that has become one of the 

main analytical platforms for innovation, data mining and machine learning. The flexibility of workflows 

developed in KNIME to include different tools allows users to read, create, edit, train and test machine 

learning models, greatly facilitating the automation of predictions and application by any user [5,6]. In this 

study, we used the open source software KNIME Analytical Platform version 4.0.2 [7] and its free 

complementary extensions for transformation, analysis, modelling, data visualization and data prediction. 

For the generation of molecular descriptors from structures, the “Descriptor” node from “alvaDesc” 

extension [8] and the “RDKit Descriptor” node [9] were employed.  

Modelling dataset 

To predict the molecules of the second “Solubility Challenge”, we used as the training set the curated 

set of aqueous solubility published in our previous paper. This set consists of two large aqueous solubility 

databases [10,11]. For each molecule, taking the SMILES (Simplified Molecular Input Line Entry 

Specification) code as input format, a structure cleaning, standardization, and duplicate removal protocol 

was developed. The InChi (IUPAC International Chemical Identifier) code was used for duplicate 

identification and the standard deviation among experimental measurements was computed. A detailed 

description of this procedure has been shown in our previous article [4]. Although the hypothesis that -the 


ADMET & DMPK 9(3) (2021) 209-218 A KNIME retrospective contribution to solubility 

doi: http://dx.doi.org/10.5599/admet.979 211 

quality of the experimental data is the main limiting factor in predicting aqueous solubility- has been 

challenged [12], any variability in the experimental protocol is always “noise” for in silico modelling 

purposes. In this sense, our model had several challenges such as: 1) the pH value for the solubility 

measurement of the collected compounds was not stated, 2) the solid form of the molecule (polymorphs, 

hydrates, solvates, amorphous) was not characterized in the reported solubility measurements, 3) it was 

not possible to verify the type of solubility measurement (kinetic or thermodynamic) and 4) the 

experimental measurement method was not specified. 

Modelling algorithm 

Due to the uncertainty of the database, we considered the importance of a rigorous protocol for data 

selection in the development of the original model, in order to discriminate those molecules with potential 

unreliability. As a first step, we selected a RELIABLE Test Set, consisting of molecules with more than one 

reported measurement and with inter-source standard deviation greater than 0 and less than 1 logarithmic 

unit. We used beyond 1 logarithmic unit as a threshold to discriminate unreliable samples. This RELIABLE 

Test Set was used for model optimization.  

From the QSPR perspective, it is necessary to select a set of descriptors that leads to the most predictive 

model and facilitates model interpretation. To this end, we developed a recursive variable selection 

algorithm based on regression random forest (RRF). RRF is a widely used ensemble method that assembles 

multiple decision trees and outputs the consensus predictions from individual trees [13]. It is recognized for 

its ability to select “important” descriptors. Based on this ability, we use the number of occurrences of a 

variable in the RRF as a measure of the descriptor's importance, combined with a correlation analysis 

between variables to avoid collinearity. Each numerical descriptor was injected in the RRF in two ways: non-

shuffled and shuffled. Once the individual decision trees were trained and extracted from the ensemble, 

the total number of occurrences of each variable was calculated. Only variables with a number of 

occurrences greater than a marginal threshold of 110 were retained. Among those, variables were 

discarded if the non-shuffled variable had a number of occurrences lower than the number of occurrences 

of its homologous shuffled variable. All shuffled variables were eventually discarded too. The final set of 

variables was selected recursively by initially computing the linear correlation between variables, and then 

keeping only those with the highest number of occurrences among variables with a correlation coefficient 

greater than a threshold of 0.51 between them.  

In an attempt to reduce the uncertainty of the data, independent of any external set, a cleaning 

procedure based on an RRF approach was developed. This procedure uses the Prediction Variance (PV) of 

the RRF as a metric to discriminate unreliable samples. The PV is an RRF score that highlights the variability 

of each individual prediction with respect to the mean. A high PV can be a sign of anomalous behaviour or 

uncertainty. This procedure was applied to the UNRELIABLE Set, i.e. molecules with aqueous solubility 

standard deviation between sources equal to 0 or greater than 1. To set the parameters of this algorithm, 

the minimization of the root mean squared error (RMSE) of the RELIABLE Test was used as the objective 

function. First, the UNRELIABLE Set was randomly divided into two sets of 50 % and 50 % cardinal. A 

regression random forest was trained on one of the two sets and used to predict the aqueous solubility and 

PV of the other set. In addition, the PV of the out-of-bag samples was also calculated. Recursively, 

molecules were classified as within the PV threshold (CLEAN data) or alternatively as beyond the PV 

threshold (UNCLEAN data), until no molecules changed from CLEAN to UNCLEAN labelled set or vice versa.  

Using the CLEAN set, a Gradient Boosting Model (GBM) was trained for classification using logS = -2 as 

the cut-off to label molecules into highly soluble or soluble and slightly soluble or insoluble. Two 

independent RRF models were developed based on these two subsets of labelled molecules and one more 

http://dx.doi.org/10.5599/admet.979


Falcón-Cano et al.  ADMET & DMPK 9(3) (2021) 209-218 

212  

RRF model was trained on all CLEAN data. Finally, the average prediction among the three GBM models was 

assumed as the final prediction value. The parameters of all models were optimized based on the RMSE 

minimization of the RELIABLE test set. Full details on our developed algorithm are given in previous 

published paper [4]. 

Second “Solubility Challenge” prediction 

First, we ensured that all test set molecules found in the initial source set used as the training set were 

removed. Since the model was previously validated using the RELIABLE Test Set and by 5-fold cross-

validation, we used the entire database (including the RELIABLE Test Set) to predict the test challenge 

samples. To analyse the performance of the solubility regression models, two types of coefficient of 

determination (r
2
), root mean squared error (RMSE), mean absolute error (MAE), bias and the percent of 

molecules with an absolute error less than 0.5 logarithmic units (% 0.5 log) were calculated.  

Results and Discussion 

Model performance 

The statistics obtained for both sets (Test Set 1 = 100 molecules and Test Set 2 = 32 molecules) are 

shown in Table 1 and Figure 1. To demonstrate model robustness, the results are reported as mean and 

standard deviation (Std). 

Table 1. Performance of the final consensus model for the molecules of the second “Solubility Challenge” 

Test 
  

r
2 

 
(validation)

 
r

2
  

(Pearson)
 

RMSE  
(validation)

 
MAE 

(validation)
 Bias

 
% 0.5 log

 
Mean Std Mean Std Mean Std Mean Std Mean Std Mean Std 

Test Set 1  
0.458 0.01 0.58 0.01 0.925 0.03 0.74 0.03 -0.234 0.01 40 1 

(N = 100) 

Test Set 2 
0.777 0.02 0.78 0.01 1.019 0.1 0.77 0.1 -0.278 0.02 40 6 

(N = 32) 

 
Figure 1. Plot of log S (predicted) vs log S0 (experimental) for both test sets. Molecules with residual values 
higher than 0.5 (logarithm units) are highlighted in red.  

Figure 2 compares our results with the top-rank models of the second “Solubility Challenge”. According to 

the mean RMSE value, our consensus model ranks ninth among the top-ranked models for the prediction of 

Test Set 1 and first for the prediction of Test Set 2.  


ADMET & DMPK 9(3) (2021) 209-218 A KNIME retrospective contribution to solubility 

doi: http://dx.doi.org/10.5599/admet.979 213 

 
Figure 2. Comparison between the top-rank models of the Second Solubility Challenge and our results 
(according to RMSE) 

Although there are no significant differences in terms of prediction performance, the training set we 

have used contains aqueous solubility measurements under non-specified experimental conditions (pH, 

method and solid form), without information on their type of solubility (aqueous or intrinsic). It is known 

that the presence of acidic and basic groups in a molecule and the pH of the medium affect the solubility 

value. Intrinsic solubility corresponds to the solubility of the uncharged molecular species, whereas 

aqueous solubility depends on the pH used for measurements. Therefore, not all the values in the training 

set are true intrinsic solubility values, which influences the model prediction of the external test set with 

intrinsic solubility measurements, leading in some cases to higher uncertainty for samples contained in the 

training set.  

We analysed the overlap of our source set with the molecules from the second "Solubility Challenge", 

resulting on two overlaps of 88 and 21 molecules, 1
st
 and 2

nd
 test respectively. Only for the case of these 

109 overlapping molecules, a correlation analysis was performed between the intrinsic solubility values 

reported in the second "Solubility Challenge" and the aqueous solubility values reported in our initial 

source set. The overlapping molecules were eliminated from the training set for modelling purposes. This 

analysis is shown in Figure 3. 

Considering the lack of real intrinsic solubility values in the training set, the most problematic molecules 

in the second "Solubility Challenge" should be the ionizable compounds. The analysis of residuals showed 

that Amiodarone (TS2), Cisapride (TS1) and Folic Acid (TS1) are response outliers. All of them contain at 

least one acidic or basic functional group and are practically insoluble compounds. For these molecules, the 

aqueous solubility value (log Sw) is different from the intrinsic solubility value, since not enough solute is 

dissolved to modify the pH in order to maintain a near-neutral species in the poorly buffered medium. 

Table 2 describes the values of log S0 (second "Solubility Challenge"), log Sw (initial data source), log Sw 

(reported in other sources) and log Sw (predicted).  

 
http://dx.doi.org/10.5599/admet.979


Falcón-Cano et al.  ADMET & DMPK 9(3) (2021) 209-218 

214  

 
Figure 3. Overlapping log S0 against log Sw analysis between the molecules of the second “Solubility 
Challenge” and the training set. For modelling purposes, these overlapping molecules were eliminated from 

the training set. 

Table 2. Summary of solubility values for the outliers 

Structure Name log S0
a
 

log Sw
b
 

(initial source set) 
log Sw

 
(predicted) 

log Sw
c 

(other 
sources) 

O
O

N

I

O

I

 
Amiodarone 
-10.4 -9.35 -7.54 -7.17 [14] 

NH

N O

O

O

O

F

NH
2

Cl

 
Cisapride 
-6.78 -5.23 -4.27 -4.7 [15] 

 
Folic Acid 
-5.96 -5.44 -3.12 > -2.87 [15] 

a
Intrinsic Aqueous Solubility reported in the second “Solubility Challenge”, 

b
Aqueous Solubility reported for the three outliers in the 

initial source set, 
c
Aqueous Solubility reported in other sources 

To assess whether the method was able to deal with the uncertainty in the data, a simple experiment 

was performed. As shown in Figure 3, 88 molecules from the first test set of the challenge overlapped with 

our initial source set. A correlation analysis between the two solubility values reported by each overlapping 

molecule showed a root mean squared error of 0.568 log units. We assume that the value reported in the 

challenge refers to a curated and reliable measurement, whereas the value reported in our initial source set 

could be of potential uncertainty. There is a significant difference between the two sets of values for the 88 


ADMET & DMPK 9(3) (2021) 209-218 A KNIME retrospective contribution to solubility 

doi: http://dx.doi.org/10.5599/admet.979 215 

molecules (Confidence interval (CI): 95 %; p = 2.9E-5). Next, a paired-sample t-test was developed for 

comparing the performance of two models based on two different training sets: (a) the literature solubility 

data reported in our initial source set and (b) the reliable intrinsic solubility measurements reported in the 

first set of the challenge. Both models were evaluated on the second challenge test. There was no 

significant difference (CI: 95%, p = 0.58) between the root mean squared errors achieved on the second 

challenge test using one or the other training sets. However, if a single random forest regression without 

recursive selection of data and variables and without applying a consensus model is used as the modelling 

algorithm, the t-test highlights a significant difference (CI: 95%; p = 3.3 E-6). The influence of data quality on 

model performance depends on the modelling procedure used. Thus, data quality was not the determinant 

factor when an appropriate modelling approach was designed to address data uncertainty by selecting the 

most important variables and using a consensus model of combined single model predictions. Table 3 

shows a review of the results.  

Table 3. Mean with Std statistics based on two training sets when predicting the second test of the second “Solubility 
Challenge” using our method (Recursive Random Forest (consensus)) versus a single RRF: reliable solubility 
measurements (data challenge) and literature solubility data.  

*The results are reported as Mean (Std). The Std was computed by repeating 10-times the modelling procedure. 

Automated system for aqueous solubility prediction 

We trust there is a need to make publicly available a reliable and diverse data set of intrinsic solubility 

measurements for a rigorous comparison between modelling algorithms, due to the relative influence of 

data quality on the performance of a model. Furthermore, applicability and reproducibility of solubility 

QSPR models should be a priority for data to be Findable, Accessible, Interoperable and Reusable (FAIR) 

[16–18]. In this regard, the final purpose of the current commentary is to make publicly available an 

automated system for in silico aqueous solubility assessment. Our model has been successfully validated in 

a previous published study and has been blind tested with the second “Solubility Challenge”, showing an 

adequate performance. The KNIME workflow published with the paper contains the results of our model on 

the second “Solubility Challenge” and allows the prediction of new sets. The user can download the 

workflow and follow the instructions it contains from https://pikairos.eu/download/aqueous

_solubility_prediction/. We developed a version based on RDKit and AlvaDesC descriptors, calculated using 

the “Descriptor” node contained in the “alvaDesc” extension. AlvaDesc 1.0.16 is available with academic or 

commercial licenses, which can be obtained by requesting a quote online (registration required) or by 

contacting them directly by email (chm@kode-solutions.net). Only the SMILES codes of the structures are 

needed for aqueous solubility prediction, as the model does not require any experimentally determined 

value for solubility calculation. The model is characterized by its simplicity since it is only based on 0-2D 

descriptors. In addition, the model is implemented in the open-source analytics platform KNIME, which is a 

user-friendly software suitable for further data analysis and visualization.  

Test 

Reliable solubility measurements (data challenge) 
n (training) = 88 

Literature solubility data (reported in Initial Data 
Source) 

n (training) = 88 
r

2
 

(validation)* 
RMSE (validation)* r

2
 (validation)* RMSE (validation)* 

Recursive 
Random 
Forest 

(consensus)  

0.30 (0.05) 1.79 (0.06) 0.29 (0.05) 1.80 (0.05) 

Single 
Random 
Forest 

Regression 

0.19 (0.01) 1.93 (0.02) 0.14 (0.06) 1.98 (0.06) 

http://dx.doi.org/10.5599/admet.979
https://pikairos.eu/download/aqueous_solubility_prediction/
https://pikairos.eu/download/aqueous_solubility_prediction/
mailto:chm@kode-solutions.net


Falcón-Cano et al.  ADMET & DMPK 9(3) (2021) 209-218 

216  

Conclusions 

The results obtained with the evaluation of the second “Solubility Challenge” reinforce the idea that 

data quality is not the major limiting factor for obtaining adequate solubility predictions if the implemented 

modelling methodology can cope with data uncertainty. In our case, the developed algorithm was able to 

overcome data variability to obtain acceptable aqueous solubility prediction results. The results published 

here are a blind prediction, since the experimental aqueous solubility values of the challenge test  set were 

not accessible at the time of our model development and training. Although the achieved performance is 

comparable to those reported in the review of the second Solubility Challenge, our model is only based on 

public data compared to some of the best models of the second Solubility Challenge, which were based on 

the huge aqueous solubility databases available from pharmaceutical companies. Furthermore, the 

algorithm of our model is global, as demonstrated by the use of generic data without the bias of "training 

close to the test data". The automation of the proposed methodology and its possible application on larger 

databases, collected under more homogeneous conditions, could be a step forward to improve solubility 

prediction during drug discovery and development stages. In attention to the importance of sharing data 

and methods to ensure reproducibility and applicability of QSPR models, we made the data publicly 

available along with our predictive model based on the KNIME Analytical Platform as a new free tool for the 

assessment of aqueous solubility of drug candidates. 

Abbreviations 

ADME:  Absorption-Distribution-Metabolism-Excretion 

QSPR:  Quantitative Structure-Property Relationship  

KNIME:  Konstanz Information Miner 

RF:  Random Forest 

RRF:  Regression Random Forest 

Std:  standard deviation 

yiobs:  experimental intrinsic solubility value 

yicalc:  predicted aqueous solubility value (model) 

r2 (val):  the square of the correlation coefficient of regression (validation). r2 (val) = r2 = 1 - Σi 
(yiobs - yicalc)2 / Σi (yiobs - <yobs>)2, where yiobs is the experimental log S0 and <yobs> is the 
mean value of the experimental log S0 values. 

r2 (Pearson):  the square of the correlation coefficient of regression (Pearson). r2 (Pearson) = r2 = 1 - 
Σi (yiobs - a byicalc)2 / Σi (yiobs - <yobs>)2, where yiobs is the experimental log S0, <yobs> is the 
mean value of the experimental log S0 values, a is the intercept and b the slope. 

RMSE:  the root mean squared error. RMSE = 1/n Σi (yiobs – yicalc) 2]1/2, where yobs/ ycalc = 
observed/calculated value of logS0, n = number of samples. 

MAE:  mean absolute error. MAE = 1/n Σi |yiobs – yicalc|, where yobs/ ycalc = observed/calculated 
value of log S0, n = number of samples. 

Bias = 1/n Σi (yiobs – yicalc), where yobs/ ycalc = observed/calculated value of log S0, n = 
number of samples. 

TS:  Test Set 

PVS:  Prediction Solubility Variance 

CI:  Confidence interval 

FAIR:  Findable, Accessible, Interoperable and Reusable 


ADMET & DMPK 9(3) (2021) 209-218 A KNIME retrospective contribution to solubility 

doi: http://dx.doi.org/10.5599/admet.979 217 

Acknowledgements: All the authors acknowledge KNIME and its many contributors for making the KNIME 

data-mining environment available free of charge, as well as Alvascience for the academic licence of 

alvaDesc. 

Conflict of interest: The authors declare no conflict of interest. 

References  

[1] A. Llinàs, R. C. Glen, J. M. Goodman. Solubility Challenge : Can You Predict Solubilities of 32 Molecules 
Using a Database of 100 Reliable Measurements?. J. Chem. Inf. Model. 48 (2008) 1289–1303. 
https://doi.org/10.1021/ci800058v.  

[2] A. Llinas, A. Avdeef. Solubility Challenge Revisited after Ten Years, with Multilab Shake-Flask Data, 
Using Tight (SD ∼ 0.17 log) and Loose (SD ∼ 0.62 log) Test Sets. J. Chem. Inf. Model. 59 (2019) 3036–
3040. https://doi.org/10.1021/acs.jcim.9b00345.  

[3] A. Llinas, I. Oprisiu, A. Avdeef. Findings of the Second Challenge to Predict Aqueous Solubility. J. 
Chem. Inf. Model. 60, (2020) 4791–4803. https://doi.org/10.1021/acs.jcim.0c00701.  

[4] G. Falcón-Cano, C. Molina, M. Á. Cabrera-Pérez. ADME prediction with KNIME: In silico aqueous 
solubility consensus model based on supervised recursive random forest approaches. ADMET DMPK 
8 (2020) 1–23. https://doi.org/10.5599/admet.852.  

[5] P.M. Mazanetz, J.R. Marmon, B.T.C. Reisser, I. Morao. Drug Discovery Applications for KNIME: An 
Open Source Data Mining Platform. Curr. Top. Med. Chem. 12 (2012) 1965–1979. 
https:/doi.org/10.2174/156802612804910331.  

[6] M.-A. Trapotsi. Development and evaluation of ADME models using proprietary and opensource 
data. University of Hertfordshire, 2017. https://doi.org/10.18745/th.19719.  

[7] “KNIME Analytics Platform 4.0.2.” [Online]. Available: https://www.knime.com/download-previous-
versions. [Accessed: 17-Mar-2021]. 

[8] A. Mauri, “alvaDesc: A Tool to Calculate and Analyze Molecular Descriptors and Fingerprints,” in 
Ecotoxicological QSARs. Methods in Pharmacology and Toxicology, K. Roy, Ed. Humana Press Inc., 
2020, pp. 801–820. 

[9] “RDKit KNIME Integration.” [Online]. Available: https://www.knime.com/rdkit. [Accessed: 19-Jun-
2020]. 

[10] M.C. Sorkun, A. Khetan, S. Er. AqSolDB, a curated reference set of aqueous solubility and 2D 
descriptors for a diverse set of compounds. Sci. Data 6 (2019) 1–8, Dec. 2019. 
https://doi.org/10.1038/s41597-019-0151-1.  

[11] Q. Cui, S. Lu, B. Ni, X. Zeng, Y. Tan, Y.D. Chen, H. Zhao. Improved Prediction of Aqueous Solubility of 
Novel Compounds by Going Deeper With Deep Learning. Front. Oncol. 10 (2017) 1–9. 
https://doi.org/10.3389/fonc.2020.00121.  

[12] D.S. Palmer, J.B.O. Mitchell. Is experimental data quality the limiting factor in predicting the aqueous 
solubility of druglike molecules?. Mol. Pharm. 11 (2014) 2962–2972. https://doi.org/10.1021/-
mp500103r.  

[13] V. Svetnik, A. Liaw, C. Tong, J.C. Culberson, R.P. Sheridan, B.P. Feuston. Random forest: a 
classification and regression tool for compound classification and QSAR modeling. J. Chem. Inf. 
Comput. Sci. 43 (2003) 1947–1958. https://doi.org/10.1021/ci034160g.  

[14] M. Salahinejad, T.C. Le, D.A. Winkler. Aqueous Solubility Prediction: Do Crystal Lattice Interactions 
Help?. Mol. Pharm. 10 (2013) 2757–2766. https://doi.org/10.1021/mp4001958.  

[15] S.H. Yalkowsky, Y. He, P. Jain. Handbook of Aqueous Solubility Data, Second. 6000 Broken Sound 
Parkway NW, Suite 300 Boca Raton, FL 33487-2742, USA: CRC Press Taylor & Francis Group, 2010. 

[16] M. D. Wilkinson, M. Dumontier, I.J. Aalbersberg et al. Comment: The FAIR Guiding Principles for 
scientific data management and stewardship. Sci. Data 3 (2016) 1–9. https://doi.org/10.1038/
sdata.2016.18.  

http://dx.doi.org/10.5599/admet.979
https://doi.org/10.1021/ci800058v
https://doi.org/10.1021/acs.jcim.9b00345
https://doi.org/10.1021/acs.jcim.0c00701
https://doi.org/10.5599/admet.852
https://doi.org/10.2174/156802612804910331
https://doi.org/10.18745/th.19719
https://www.knime.com/download-previous-versions
https://www.knime.com/download-previous-versions
https://www.knime.com/rdkit
https://doi.org/10.1038/s41597-019-0151-1
https://doi.org/10.3389/fonc.2020.00121
https://doi.org/10.1021/ci034160g
https://doi.org/10.1021/mp4001958
https://doi.org/10.1038/sdata.2016.18
https://doi.org/10.1038/sdata.2016.18


Falcón-Cano et al.  ADMET & DMPK 9(3) (2021) 209-218 

218  

[17] J. Wise, A.G. de Barron, A. Splendiani et al. Implementation and relevance of FAIR data principles in 
biopharmaceutical R&D. Drug Discovery Today 24, (2019) 933–938. https://doi.org/10.1016/j.drudis.
2019.01.008.  

[18] K.M. Merz, R. Amaro, Z. Cournia, M. Rarey, T. Soares, A. Tropsha, H.A. Wahab, R. Wang. Editorial: 
Method and Data Sharing and Reproducibility of Scientific Results. J. Chem. Inf. Model. 60 (2020) 
5868–5869. https://doi.org/10.1021/acs.jcim.0c01389.  

 
©2021 by the authors; licensee IAPC, Zagreb, Croatia. This article is an open-access article distributed under the terms and 

conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/)  

 
https://doi.org/10.1016/j.drudis.2019.01.008
https://doi.org/10.1016/j.drudis.2019.01.008
https://doi.org/10.1021/acs.jcim.0c01389
http://creativecommons.org/licenses/by/3.0/