Original article Biomath 2 (2013), 1309089, 1–11 B f Volume ░, Number ░, 20░░ BIOMATH ISSN 1314-684X Editor–in–Chief: Roumen Anguelov B f BIOMATH h t t p : / / w w w . b i o m a t h f o r u m . o r g / b i o m a t h / i n d e x . p h p / b i o m a t h / Biomath Forum Quantitative Structure-Activity Relationships: Linear Regression Modelling and Validation Strategies by Example Sorana D. Bolboacă Department of Medical Informatics and Biostatistics ”Iuliu Haţieganu” UMF Cluj-Napoca Cluj-Napoca, Romania Email: sbolboaca@gmail.com Lorentz Jäntschi Department of Physics and Chemistry Technical University of Cluj-Napoca Cluj-Napoca, Romania Email: lorentz.jantschi@gmail.com Received: 4 May 2013, accepted: 8 September 2013, published: 18 September 2013 Abstract—Quantitative structure-activity relationships are mathematical models constructed based on the hy- pothesis that structure of chemical compounds is related to their biological activity. A linear regression model is often used to estimate and/or to predict the nature of the relationship between a measured activity and some measure or calculated descriptors. Linear regression helps to answer main three questions: does the biological activity depend on structure information; if so, the nature of the relationship is linear; and if yes, how good is the model in prediction of the biological activity of new compound(s). This manuscript presents the steps on linear regression analysis moving from theoretical knowledge to an example conducted on sets of endocrine disrupting chemicals. Keywords-robust regression; validation; diagnostic; pre- dictive power; quantitative structure-activity relationships (QSARs); I. LINEAR REGRESSION ON QSAR ANALYSIS Quantitative structure-activity relationships (QSARs) are mathematical models linking chemical structure and pharmacological activity/property in a quantitative man- ner for a series of compounds [1]. The approaches are based on the assumption that the structure of chem- ical compounds (such as geometric, topologic, steric, electronic properties, etc.) contains features responsible for its physical, chemical and/or biological properties [2]. This assumption could be summarized as ”similar compounds have similar properties” [3]. The two main fields where linear regression analysis found its applicability are drug discovery [4], [5] and toxicology prediction [6], [7]. In both of these fields, the linear regression is used mainly to predict not to estimate (the model is used to quickly determine the activity/property of new/un-investigated compounds) [8]. The linear regression is used in QSAR analysis to linearly link the activity/property of chemical compounds (measured or observed value - outcome variable abbrevi- ated as Y) and some values translated from the structure of the compounds and generally called descriptors (as- sumed error non-affected independent variables abbre- viated as X(s)). The multiple linear regression (MLR) expression is presented in Eq(1): Ŷ = b0 + k∑ i=1 biXi (1) where Ŷ = estimated activity/property; b0 = intercept; bi = coefficient of the ith independent variable / descriptor variable (1 ≤ i ≤ k, 5 × k ≤ n [9]), k = number of descriptors (independent/descriptor variables) in the model, n = number of observations in the sample. The regression coefficients bi could be interpreted as the change in Y when Xi increased or decreased by 1 unit Citation: Sorana D. Bolboacă, Lorentz Jäntschi, Quantitative Structure-Activity Relationships: Linear Regression Modelling and Validation Strategies by Example, Biomath 2 (2013), 1309089, http://dx.doi.org/10.11145/j.biomath.2013.09.089 Page 1 of 11 http://www.biomathforum.org/biomath/index.php/biomath http://dx.doi.org/10.11145/j.biomath.2013.09.089 S D Bolboacă, L Jäntschi, Quantitative Structure-Activity Relationships: Linear Regression Modelling and ... when all other independent variables are held constant (b0 and b1 estimate the population parameters β0 and βi, [10]). The identified values of b0 and bi are calculated to minimize the squared error for all n observations. A. Linear Regression Assumptions The main assumptions of linear regression (Table I) could be summarized as: 1) Linearity. The relation between Y and each of descriptors Xi are linear. 2) Independence of the errors. Both the experimental values (Y ) and experimental/calculated descriptors (Xi) are measured without errors. 3) Homoscedasticity. The variance of the errors is constant. 4) Normality. The dependent variable (Y ) is normal distributed. 5) Absence of multicolinearity. The independent vari- ables (Xi) are linearly independent of each other. Please note that this constrain did not exclude a certain degree of collinearity. Since it has been recognized that ”normal law ... is not valid in a great many cases which are both common and important” [11] a series of transformation could be used to reach normal distribution [29] (see Table II). 1) Model Selection and Diagnostic: Selection of the regression model is an important task that researchers must to accomplish. The main criteria useful in this step are: • Determination coefficient (R2) and its adjustment form (R2adj - R 2 adjusted with the number of coef- ficients in the model → the value will not necessary increase with the addition of X’s). Generally, the R2 increase with the number of parameters in the model but R2adj penalizes according to the number of parameters (the model with higher number of descriptors does not necessary has the higher value of R2adj). • Standard error of the estimate: the average error predicting the activity/property of interest by the identified model. • Statistics of overall model performances (F -value and associated p-value): assess the overall ability of a model to explain as much as possible from the observed variability in Y . • Models performances in cross-validation by the leave-one-out analysis. It is say that a model with Q2 (determination coefficient in cross-validation by the leave-one-out analysis) >0.6 and |R2 −Q2| < 0.1 is a desired model in QSAR analysis [30]. However, the value of F -statistics and its associated probability are as important as Q2 in assessment of internal validation of a QSAR model. • Mallows Cp-statistic (Cp = SSres/MSres −n + 2 · (k + 1), k = number of descriptor variables in the model) [31], [32], [33]: measures the overall bias or mean square error in the estimated model parame- ters. This is a useful parameter when models with different X(s) are compared on the same sample of compounds. A low Cp value indicates good model prediction or a model with a small positive/negative discrepancy between Cp and (k+1) - could be used in evaluating candidate regression models. • Akaikes information criterion and derivative for- mulas: assess the degree of fit by involving the goodness-of-fit of the model (R2): Akaike informa- tion criterion (AIC = n · ln(RSS/n) + 2 · (k + 1) for the model with intercept and AIC = n · ln(RSS/n) + 2 ·k for the model without intercept, where n = sample size, RSS = residual sum of squares; k = number of Xi) [34]; AIC based on the determination coefficient (AICR2 = ln[(1 − R2)/n] + 2·(k + 1)); McQuarrie and Tsai corrected AIC (AICu = ln[RSS/(n − k + 1)] + (n + k + 1)/(n − k − 1)) [35]; Bayesian Information Criterion (BIC = n · ln[RSS/(n − k + 1)] + (k + 1)·ln(n)) [36]; Amemiya Prediction Criterion (APC = RSS/n · (n−k + 1)/(n + k + 1)) [37]; Hannan-Quinn Criterion (HQC = n·ln(RSS/n)+ 2 · (k + 1) · ln[ln(n)] [38]. The smallest the AIC, BIC, APC and HQC values are the better the model is considered. In addition to AIC values, the Akaike weights are also used in models assessment: wi = [exp(−0.5 · ∆i)/[ΣJj=1exp(−0.5 · ∆j)]] [39] where ∆i = AICimin(AIC), ∆i = difference between the AIC of the best fitting model and that of the model ith, min(AIC) = minimum AIC value out of all models, j = the number of the models. • Kubinyi function (FIT ) [40], [41]: FIT = [R2 · (n−k)]/[(n+ (k + 1)2)·(1−R2)]. The highest the FIT value the better the model is considered. The diagnosis of a regression model when the dependent variable is continuous could be conducted by analyzing of residuals or rescaled residuals: • Look to the largest and/or smallest experimental values ← detect if the values are in the plausible range. Also look to descriptive statistics value: mean, standard deviation, histogram. Biomath 2 (2013), 1309089, http://dx.doi.org/10.11145/j.biomath.2013.09.089 Page 2 of 11 http://dx.doi.org/10.11145/j.biomath.2013.09.089 S D Bolboacă, L Jäntschi, Quantitative Structure-Activity Relationships: Linear Regression Modelling and ... TABLE I ASSUMPTIONS OF LINEAR REGRESSION: EFFECT - IDENTIFICATION - METHODS • Plot the independent variable(s) vs dependent vari- able. • Plot the values associated to studentized residuals (si), leverage (hi), Cook’s (Di) vs individual Xi values. The hat values (0 ≤ hi ≤ 1) are used to evaluate the leverage of observations in the dimen- sional space of independent variables (covariates). If the hi value of a compound exceeds the threshold value (2·(k+1)/n for a regression model with inter- cept and 2·k/n for a model without intercept, where k = number of Xi [42]) it is considered influential whenever if by its removal determine a significant improvement of the model. Cook’s distance con- sider in its formula both residuals and hat matrix to identify influential compound(s) (threshold Di > 4/n, where Di = 1/(k+ 1)·s2i ·[hi/(1−hi)] for the model with intercept and Di = 1/k·s2i ·[hi/(1−hi)] for the model without intercept, si = studentized residuals [43]). Several parameters that can found their usefulness in diagnosis of a MLR are presented in Table III. Several parameters presented in Table III are also used by some authors as measures of model predictivity power (see for example MAE [44]). B. Model Predictive Power The ability to predict the activity/property of new compounds is of major importance in QSAR/QSPR analysis. Several parameters were proposed and are used to assess model predictivity power and are presented in Table IV. The diagnosis of a linear regression model could be conducted using a series of statistical parameters calculated on contingency table [58] after transforma- Biomath 2 (2013), 1309089, http://dx.doi.org/10.11145/j.biomath.2013.09.089 Page 3 of 11 http://dx.doi.org/10.11145/j.biomath.2013.09.089 S D Bolboacă, L Jäntschi, Quantitative Structure-Activity Relationships: Linear Regression Modelling and ... TABLE II METHODS FOR DATA TRANSFORMATION Transformation Applied to: Appropriate when: 'log' Y' = logY Stabilize the variance of Y Normalized the dependent variable ← positive skewed distribution of the residuals for Y Linearize the regression model Y have positive values 'square root' Y' = √Y Stabilize the variance (the variance is proportional with the mean of Y) Y has the Poisson distribution 'reciprocal' Y' = 1/Y Stabilize the variance the variance is proportional to the fourth power of the mean of Y 'square' Y' = Y2 Stabilize the variance (the variance decrease with the mean of Y) Normalized the dependent variable ← negative skewed distribution of the residuals for Y Linearize the regression model ← the original relation with some independent variable is curvilinear downward (such as decrease of slope with the increase of independent variable) 'arcsine' Y' = asin√Y Stabilize the variance Y is a proportion or a percentage tion of the observed and estimated/predicted logRBA as dichotomial variables using criteria for classification of compounds as active or inactive. The total fraction of compounds correctly classified (parameter called concor- dance / accuracy / non-error rate) is one parameter that could bring useful information in choosing which model to be applied. II. PRACTICAL CONSIDERATIONS Three data sets of endocrine disrupting chemicals with experimental values of relative binding affinity expressed in logarithmic scale (logRBA) [59] were used for exemplification. The investigated compounds could be classified according to their logRBA values as weak binders (logRBA < −2.0), moderate binders (−2.0 = logRBA = 0) and strong binders (logRBA > 0) [60]. The following descriptors were previously calculated on the investigated structures [59] and were used here to illustrate how linear regression analysis works: TIE = E- state topological parameter; TIC1 = Total information content index (neighbourhood symmetry of 1-order); ATS4m = Broto-Moreau autocorrelation of a topological structure - lag 4 / weighted by atomic masses; EEig02d = Eigenvalue 02 from edge adj. matrix weighted by dipole moments; E1s = 1st component accessibility directional WHIM index / weighted by atomic electrotopological states; and Dv = total accessibility index / weighted by atomic van der Waals volumes. The first set was used to identify the model and comprised 132 compounds (training set; 1 withdrawn, 60 weak binders, 41 moderate binders and 30 strong binders). The second dataset was used to test the per- formances of the model (test set) and comprised 23 compounds (3 weak binders, 16 moderate binders and 4 strong binders). The third dataset was used as external validation set and consists of 9 compounds (4 weak binders and 5 moderate binders). A. MLR in Training Sets The first step in the linear regression analysis was to investigate the distribution of logRBA in training set. One out of three tests rejected the null hypothesis of normality (Chi-Square statistics = 14.862, p-value = 0.03781). No outlier had been identified when the Grubbs test was applied but there was one compound with studentized residuals higher than 3 standard devia- tions. The experimental data in training test proved not normal distributed according just with the Chi-Square test (see Table V), the normality test that is known to be affected by the presence of outlier(s) [12], even if in this example no outlier has been identified. The normality was not achieved even by withdrawing that compounds but the correlation coefficient increased from 0.810 to 0.837. The studentized residuals, hat matrix and Cook’s distance values were plotted against logRBA to identify how data were distributed (Figure 1). Three models obtained on the same datasets were investigated: Biomath 2 (2013), 1309089, http://dx.doi.org/10.11145/j.biomath.2013.09.089 Page 4 of 11 http://dx.doi.org/10.11145/j.biomath.2013.09.089 S D Bolboacă, L Jäntschi, Quantitative Structure-Activity Relationships: Linear Regression Modelling and ... TABLE III STATISTICAL PARAMETERS FOR DIAGNOSIS OF MLR Parameter (Abbreviation) Formula [ref] Remarks Residual Mean Square (RMS) - Error variance kn yy RMS n i ii − − = ∑=1 2)ˆ( RMS: the smaller the better 0 < RMS < ∞ Average Prediction Variance (APV) )( kn n RMS APV +⋅= [45] The smaller the better Total Squared Error (TSE) nk yy TSE n i ii −⋅+ − = ∑= 2 ˆ )ˆ( 2 1 2 σ [46] 2)2( +⋅−−= kn MSE SSE TSE [33] The smaller the better TSE > (k+1) → bias due to incompletely specified model TSE< (k+1) → the model is over specified (contains too many variables) Average Prediction Mean Squared Error (APMSE) 1−− = kn RMS APMSE [47] The smaller the better Mean Absolute Error (MAE) - Measures the average magnitude of the errors; could be also used to compare two models n yy MAE n i ii∑= −= 1 | ˆ| MAE = 0 → perfect accuracy 0 < MAE < ∞ Root Mean Square Error (RMSE): - Measures the average magnitude of the error ( ) n yy RMSE n i ii∑= −= 1 2ˆ RMSE > MAE → variation in the errors exists 0 < RMSE < ∞ Mean Absolute Percentage Error (MAPE) - Measure of accuracy expressed as percentage n yyy MAPE n i iii∑= −= 1 |/) ˆ(| [48], [49] MAPE ~ 0 → perfect fit Standard Error of Prediction (SEP) ( ) 1 ˆ 1 2 − − = ∑= n yy SEP n i ii The smaller the better Relative Error of Prediction (REP%) ( ) n yy y REP n i ii∑ = −= 1 2ˆ100 (%) The smaller the better n = sample size; k = number of independent variables in the model; y = the mean of estimated/predicted activity/property; iŷ = predicted value of the ith compound in the sample; yi = observed/measured activity/property of i th compound; SSE = sum of squared errors; MSE = mean of squared errors full-model (the model comprised all compounds assigned to training test), Di-model (the model comprised just the compounds that did not exceeded the imposed Cooks distance threshold), and hi-model (the model comprised just the compounds that did not exceeded the imposed hat matrix threshold). The Cook’s distance and hat matrix approaches were applied to withdrawn compounds of the training sample until two criteria were accomplished: logRBA proved normal distributed and withdrawing the compound(s) did not led to an improvement in determination coefficient. Both models proved smaller RMSE and RMSEP values. The characteristics of all investigated models are pre- sented in Table V. The analysis of the models (Table V) revealed that none model proved collinearity (the highest correlation coefficient did not exceeded 0.8 and VIF values are less than 10). The Di-model is twice better in terms of internal validity when the |R2 − Q2| difference is evaluated compared to hi-model and three times better compared to the full-model. The Mallows Cp-statistic did not found its applicability in our example because the same descriptors are used in all models. The smallest values of information criteria parameter were systemat- Biomath 2 (2013), 1309089, http://dx.doi.org/10.11145/j.biomath.2013.09.089 Page 5 of 11 http://dx.doi.org/10.11145/j.biomath.2013.09.089 S D Bolboacă, L Jäntschi, Quantitative Structure-Activity Relationships: Linear Regression Modelling and ... TABLE IV STATISTICS FOR ASSESSMENT THE PREDICTIVE POWER OF MLR Parameter (abbr.) Formula [ref] Remarks Predictive Squared Correlation Coefficient in Training Set (QF1 2) ∑ ∑ = = − − −= TS TS n i TRi n i ii F yy yy Q 1 2 1 2 2 )( )ˆ( 1 1 [50] Predictive Squared Correlation Coefficient in Test Set (QF2 2) ∑ ∑ = = − − −= TS TS n i TSi n i ii F yy yy Q 1 2 1 2 2 )( )ˆ( 1 2 [52] External Predictive Ability (QF3 2) TR n i TRi TS n i ii F nyy nyy Q TS TS /)( /)ˆ( 1 1 2 1 2 2 3 ∑ ∑ = = − − −= [53] Prediction is considered accurate if the predictive power of the model is > 0.6 [51] rm 2 metrics [ ]20222 1 rrrrm −−⋅= [44], [54] || 2'22 mmm rrr −=Δ Values higher than 0.5 indicate an acceptable model [44] , [54] 2 mrΔ indicate an acceptable model Concordance Correlation Coefficient (CCC) 2 1 2 2 1 1 )ˆ()ˆˆ()( )ˆˆ()(2 yynyyyy yyyy CCC n i i n i i i n i i −⋅+−+− −⋅−⋅ = ∑∑ ∑ == = [55] Strength of agreement between observed and predicted values [56]: > 0.99 almost perfect; [0.95; 0.99) substantial; [0.90; 0.95) moderate; < 0.90 poor Predictive Power (PP): Fisher's approach TSTS TS nresstdev res t /)( 0− = [57] p = TDIST(abs(t),nTS-1,1) Evaluate if the mean of residual is statistically different by the expected value (0) n = sample size; v = number of independent variables in the model; y = the mean of observed/measured activity/property; ŷ = the mean of estimated/predicted activity/property; iŷ = predicted value of the i th compound in the sample; yi = observed/measured activity/property of ith compound; res = mean of residuals; stdev = standard deviation; TR = training set; TS = test set; r 2 m = a metric calculated using observed (y-axis) and estimated/predicted (x-axis)values; r′2m = a metric calculated using observed (x-axis) and estimated/predicted (y-axis)values; r20 = determination coefficient calculating by forcing the origin of axis; Δr 2 m = absolute difference between r2m and r′ 2 m; EXT = external set; abs = absolute value ically obtained by Di-model which was follow by hi- model while the full-model systematically obtained the highest values (see Table V). The concordance correlation coefficient for training sets had values closed to the correlation coefficients and for all models were higher than 0.80 (see Table 5). Looking to the weights of Akaike’s information crite- ria, which can be interpreted as probability that a certain model is the best model, it could not be identify any model with robust inference (none of the model had the values of weights higher than 0.9 [61]). The Di- model had the weights around 0.37 that is far away from 0.90 but are a little higher than those obtained by the full model where the weights are around 0.30 or by those obtained by the hi-model which are around 0.32. Recall that the Di-model could be considered the preferred model and from the inspection of the Akaike weights in Table V, this model is 1.2 (wi −AICR2) to 1.4 (wi−AICc) times more likely in terms of Kullback- Leible discrepancy, a measure of distance between the probability generated by the model and reality [62], compared with hi-model. Significant differences between models could also been observed if the BIC and HQC parameters are analyzed; the smallest value of BIC was obtained by Di- model while the smallest value of HQC was obtained by hi-model. The plots of residuals versus predicted values for the investigated models are presented in Figure 2. The analyses of residuals allow to identify if the assumptions of the regression appear to have been met or not (specifically linearity and homoscedascity) - the residual plot look like a horizontal band. Thus, according Biomath 2 (2013), 1309089, http://dx.doi.org/10.11145/j.biomath.2013.09.089 Page 6 of 11 http://dx.doi.org/10.11145/j.biomath.2013.09.089 S D Bolboacă, L Jäntschi, Quantitative Structure-Activity Relationships: Linear Regression Modelling and ... -5 -4 -3 -2 -1 0 1 2 3 4 -5 -4 -3 -2 -1 0 1 2 3 logRBA St ud en ti ze d re si du al s s i>3 → 1compound 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 -5 -4 -3 -2 -1 0 1 2 3 logRBA C oo k' s di st an ce Di>4/n→ 9 compounds 0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 0.40 -5 -4 -3 -2 -1 0 1 2 3 logRBA H at m at ri x va lu e hi>2(k+1)/n → 6 compounds a) b) c) Fig. 1. Studentized residuals (a), Cook’s distance (b) and hat matrix values (c) versus logRBA in model with all compounds in training set (n=132) TABLE V MLR IN TRAINING SETS: MODELS CHARACTERISTICS Statistical parameter Full-model (n=132) Di-model (n=115) a hi-model (n=123) b Normality tests: KS-AD-CS 0.116* - 2.409* - 14.862** 0.124* - 2.432* - 12.613* 0.120* - 2.428* - 12.083* Durbin-Watson 1.275 1.292 1.263 Collinearity: highest R higher VIF & lower T 0.7700 TIE: 3.367& 0.297 0.7889 ATS4m: 4.082&0.245 0.7752 ATS4m: 4.516&0.221 R2 0.6559 0.7797 0.6928 R2adj 0.6394 0.7675 0.6769 RMSE 1.0701 0.8293 0.9977 F-value (p-value) 39.711 (9.89·10-27) 63.721 (3.12·10-33) 43.59 (1.62·10-27) Q2 0.5832 0.7543 0.6497 RMSEP 1.1827 0.8764 1.0668 Floo-value (p-value) 28.74 (9.49·10 -22) 55.17 (1.85·10-31) (1.62·10-27) |R2-Q2| 0.0727 0.0254 0.0431 Concordance Correlation Coefficient (CCC) 0.8108 [0.7476 to 0.8595] 0.8762 [0.8278 to 0.9117] 0.8185 [0.7545 to 0.8671] r2m (Δr 2 m) 0.6071 (0.1324) 0.7797 (0.1278) 0.6921 (0.1586) Cp-statistic 7.00 7.00 7.00 AIC (wi-AIC) 18.9639 (0.2856) 18.3078 (0.3965) 18.7490 (0.3180) AICR2 (wi- AICR2) 8.0504 (0.3137) 7.7421 (0.3659) 8.0077 (0.3204) AICc (wi- AICc) 1.2657 (0.2990) 0.7766 (0.3819) 1.1358 (0.3191) BIC 52.0750 9.8317† 33.1255 HQC 26.2887 34.7113† 7.8043 FIT 1.3058 2.3097 1.5076 * p ≥0.05; ** p = 0.0378; † = absolute values; KS = Kolmogorow-Smirnov; AD = Anderson Darling; CS = Chi-Square; R = correlation coefficient; VIF = Variance Inflation Factor; T = tolerance; R2 = determination coefficient; R2adj = adjusted determination coefficient; RMSE = root mean square error; F-value = Fisher's statistics; Q2 = determination coefficient in cross-validation by the leave-one-out analysis; RMSEP = root mean square error in prediction; CCC = concordance correlation coefficient [95% confidence interval]; Cp-statistic = Mallows’ statistic; AIC = Akaike’s information criterion; AICR2 = AIC based on the determination coefficient; AICc = AIC corrected by McQuarrie and Tsai; BIC = Bayesian Information Criterion; HQC = Hannan-Quinn Criterion; FIT = Kubinyi's function; a 56 weak binders, 35 moderate binders, and 24 strong binders; withdrawn (16 compounds): 4 weak binders, 6 moderate binders and 6 strong binders; b 57 weak binders, 38 moderate binders, and 28 strong binders; withdrawn (8 compounds): 3 weak binders, 3 moderate binders and 2 strong binders; to the pattern of the residuals [63], the most appropriate model is the Di-model since the distribution indicates a homoscedastic model. Furthermore, both full-model and hi-model showed evidence of heteroscedascity, the error in estimating logRBA increasing as the value of logRBA increase. However, both these models could be accepted because none of them showed the presence of systematic errors or inadequacy [63]. If assumption of linearity Biomath 2 (2013), 1309089, http://dx.doi.org/10.11145/j.biomath.2013.09.089 Page 7 of 11 http://dx.doi.org/10.11145/j.biomath.2013.09.089 S D Bolboacă, L Jäntschi, Quantitative Structure-Activity Relationships: Linear Regression Modelling and ... and/or of homoscedascity is violated, the residual plots show an increasing and narrow pattern if systematic error exists or depict a Gaussian trend when the model is inadequate [64]. Other proposed plot methods, such as linear residual plots, show to be useful in identification of non-linearity while squared residual plots proved utility in detection of non-constant variances [65]. The normal probability plots (right graphical repre- sentations in Figure 2) can be used to verify normality assumption of the residuals. Figure 2 showed that the hi-model fit better a straight line compared to both full- model and Di-model. The results obtained on our data associated to the statistical parameters useful in model diagnosis intro- duced in Table III are presented in Table VI. The total square error is the single parameter that has the same value for all models and in all cases is equal to 7 (obtained by adding 1 to the number of descriptors in the model 6 in our example), indicating that none of the models were not over-specified or did not contain bias due to incompletely specified model. The classification of our models based on parameters presented in Table VI led to the classification obtained according to the parameters presented in Table V: Di-model, hi-model, and full model. Several parameters were used to assess the predictive power of the models and their results are presented in Table VII. The analysis of results presented in Table VII revealed the followings: • External predictive ability parameter (Q2F3) [53] systematically took negative values for both external and withdrawn sets. At least for the external set, this result could be explained by the distribution of logRBA values (min=-3.3, max=-0.6) compared to training (min=-4.5, max=2.6) and test (min=- 2.51, max=1.41) sets. It could be also of interest to analyze how different are the compounds containing in external and withdrawn data sets compared to the compounds from training set (in terms of similarity of their structure for example). • Di-model achieve the criterion of exceeding 0.6 [52] in just one of 6 possible case while the hi- model reach this criterion in four out of 6 cases. The hi-model accomplished more frequently the criteria of having values higher than 0.6 while the full-model did not accomplished at all this criterion. Thus, it seems that the compounds in test and external sets are uniformly distributed over the range of training set at least in hi-model, in view of the fact that otherwise the Q2F1 and the Q 2 F2 suffer from drawbacks [66]. • The concordance correlation coefficients obtained values higher than 0.70 in test sets. The abilities of prediction the external sets proved smaller than 0.5 for all investigated models but had values higher than 0.50 (Di-model and hi-model) when the with- drawn set is investigated. • The residual of the models proved significantly different by zero in test set for full-model and Di- model and in external set for all models. Both Di- and hi-models proved to have residual not signifi- cantly different by zero in samples that contain the withdrawn compounds. According to this criterion, just hi-model proved prediction power. The r2m metric and associated ∆r 2 m obtained in test sets were as follows: 0.3726 (0.1743) for full model, 0.3134 (0.1796) for Di-model, and 0.5248 (0.1494) for hi-model. These metrics showed that the hi-model is acceptable model. The r2m is a parameter computed by forcing the regression through origin [54] with certain applicability and limitations (fails to detect the differ- ences between experimental and predicted values when the slopes of the regression line are not near to 1) [67]. The values of these metrics were smaller than the determination coefficient in all investigated models and the highest value was observed in Di-model when training (see Table V) set was investigated but acceptable values were obtained just by the hi-model when the test set was investigated (r2m > 0.5 and ∆r 2 m < 0.2). The classification of the models according to results presented Table VII is as follows: hi-model, Di-model, and full-model. One remark about the parameters used to assess the predictive power, namely Q2F1, Q 2 F2 and Q 2 F3, can be made. Even the symbols contain ”square”, these param- eters could take both positive and negative values accord- ing to their formula (see Table IV). A simulation study of these parameters needs to be done to identify their possible values as well as their proper interpretation. The best way to see the abilities of a MLR model is to plot the measured values against the estimated / predicted values to visualize how well each model works (see Figure 3). With one exception, represented by hi-model in external set (p-value = 0.0632), all other correlation coefficients proved statistically significant (p < 0.04). The analysis of models presented in Figure 3 revealed the followings: • The distribution of compounds in training set is narrower in Di-model compared to both full-model and hi-model. Biomath 2 (2013), 1309089, http://dx.doi.org/10.11145/j.biomath.2013.09.089 Page 8 of 11 http://dx.doi.org/10.11145/j.biomath.2013.09.089 S D Bolboacă, L Jäntschi, Quantitative Structure-Activity Relationships: Linear Regression Modelling and ... • Di-model obtained higher determination coeffi- cients in training and external sets while the hi- model obtained the higher determination coeffi- cients in training and withdrawn sets. • The hi-model in more stable compared to Di-model if the difference in determination between training and test set is concerned. • Both Di-model and hi-model performed better in training and test sets compared to full-model. Whenever applicable, the accuracy of a model will show its ability in correct classification of compounds. The overall accuracy as well as the accuracy on each class (weak binder, moderate binder and strong binder) were computed and the obtained results are presented in Figure 4. The analysis of Figure 4 revealed the followings: • The accuracy of all three models was identical for strong binders in test set (75%) and weak binders in external set (25%). Overall, out of 16 possibilities, all models (full-model, Di-model, and hi-model) proved highest accuracy in almost 38% of cases. • Full-model proved highest overall accuracy in both test and external sets, and highest accuracy for moderate binders in test and external sets. • Di-model proved highest overall accuracy in train- ing set, highest accuracy for strong binders in training set, highest accuracy for weak binders in training set, and highest accuracy of moderate binders in training set. • hi-model proved highest overall accuracy, as well as higher accuracy for weak binders, moderate binders and strong binders for withdrawn compounds. • No model proved abilities in correct classification of weak binders in test set or of strong binders in external set. Regarding the accuracy of investigated models it is impossible to classify them since their performances are generally the same (38%). It could be observed that mod- els had abilities to accurately identify the compounds on average of two sets out of three or four. The absence of accurate classification of weak binders in test set and strong binders in externals set could be explained by differences in the chemical structure or measured logRBA of compounds included in these sets. III. SUMMARY AND FURHER WORK Choosing a proper linear model is crucial in QSAR analysis because a model able to predict accurately the activity of interest of new chemical compounds is desired under the hypothesis that changes in molecular structure directly reflect in the compound activity/property. Input data and data preparation for regression analysis are of great importance but these subjects were beyond the aim of the present manuscript. Linear regression analyses identify in QSAR analysis the linearity between compound’s activity and calcu- lated descriptors based on chemical structure. Regression analysis answer to the following questions: Does the biological activity depend on structural information? If so, the nature of the relationship is linear? If yes, how good is the model in prediction of the biological activity of new compounds? In this manuscript, some rules had been presented: 1© test the assumption of linear regression (normality, lin- earity, independence, homoscedascity, and/or collinear- ity); 2© construct the model(s) if assumptions are accom- plished - analyze the data (choose the best performing model); 3© assess and diagnose the alternative models - analyze the MLR; 4© decide which model fit best to your objectives. Following these steps in linear regression analysis certainly led to a performing estimation model but the prediction power of the model will always depend on the structure of compounds and their biological activity on which the model is used to predict; in other words, will be dependent by similarity in terms of structure and activity. Researches on linear regression analysis are of general interest since MLR found its applicability in many research fields. The classical approach implemented in available dedicated software deal with maximization of correlation coefficient. Maximization of the observed probability under assumption of random error affecting all variables in the model is an ongoing research and will be reported somewhere else. It is known that the classical method is exposed to type I errors (to accept a regression model obtained by maximization of determination corre- lation even if it does not exist) while this new approach does not because it maximize just the observation chance having as hypothesis that the errors between observed value and value obtained by the model is random and depend just by the observed/measured value (therefore being symmetric relative to its arithmetic mean). ACKNOWLEDGMENT The authors are grateful to the organizers of the BIOMATH 2013 for the opportunity to present our results. Biomath 2 (2013), 1309089, http://dx.doi.org/10.11145/j.biomath.2013.09.089 Page 9 of 11 http://dx.doi.org/10.11145/j.biomath.2013.09.089 S D Bolboacă, L Jäntschi, Quantitative Structure-Activity Relationships: Linear Regression Modelling and ... REFERENCES [1] L. P. Hammett, ”The Effect of Structure upon the Reactions of Organic Compounds. Benzene Derivatives,” J. Am. Chem. Soc., vol. 59, no. 1, 1937, pp. 96-103. http://dx.doi.org/10.1063/1.1749914 [2] P. Gramatica, ”A short history of QSAR evolution,” [online] [Accessed January 26, 2012]. Available from: http://qsarworld. com/Temp Fileupload/Shorthistoryofqsar.pdf. [3] A. M. Johnson and G.M. Maggiora, ”Concepts and Applications of Molecular Similarity”, New York: John Willey & Sons, 1990. [4] T. Arodź and A.Z. Dudek, ”Multivariate modeling and analysis in drug discovery,” Curr. Comput. Aided Drug Des., vol. 3, no. 4, 2007, pp. 240-247. http://dx.doi.org/10.2174/157340907782799381 [5] J. Galvez, M. Galvez-Llompart, R. Zanni, and R. Garcia- Domenech, ”Advances in the molecular modeling and quan- titative structure-activity relationship-based design for antihis- tamines,” Expert Opin. Drug Discov., vol. 8, no. 3, 2013, pp. 305-317. http://dx.doi.org/10.1517/17460441.2013.748745 [6] M. P. Gleeson, S. Modi, A. Bender, R. L. Marchese Robinson, J. Kirchmair, M. Promkatkaew, S. Hannongbua, and R. C. Glen, ”The challenges involved in modeling toxicity data in silico: A review,” Curr. Pharm. Des., vol. 8, no. 9, 2012, pp. 1266-1291. http://dx.doi.org/10.2174/138161212799436359 [7] S. Kar, O. Deeb, and K. Roy, ”Development of classification and regression based QSAR models to predict rodent carcinogenic potency using oral slope factor,” Ecotoxicol. Environ. Saf., vol. 82, 2012, pp. 85-95. http://dx.doi.org/10.1016/j.ecoenv.2012.05.013 [8] M. Goodarzi, B. Dejaegher, and Y. V. Heyden, ”Feature selec- tion methods in QSAR studies,” J. AOAC Int., vol. 95, no. 3, pp. 636-651, 2012. http://dx.doi.org/10.5740/jaoacint.SGE Goodarzi [9] D. M. Hawkins, ”The problem of overfitting,” J. Chem. Inf. Comput. Sci., vol. 44, no. 1, 2004, pp. 1-12. http://dx.doi.org/10.1021/ci0342472 [10] S. Chatterjee and A. S. Hadi, ”Regression Analysis by Exam- ple,” New Jersey: John Wiley & Sons, 2006. [11] G. U. Yule, ”On the significance of Bravais formulae for regression in the case of skew correlation,” Proc. R. Soc. Lond., vol. 60, 1897, pp. 477-489. [12] L. Jäntschi and S. D. Bolboacă, ”Distribution Fitting 2. Pear- son -Fisher, Kolmogorov-Smirnov, Anderson-Darling, Wilks- Shapiro, Kramer-von-Misses and Jarque-Bera statistics,” Bul- letin UASVM Horticulture, vol. 66, no. 2, 2009, pp. 691-697. [13] A. Kolmogorov, ”Confidence Limits for an Unknown Distri- bution Function,” Ann. Math. Stat., vol. 12, no. 4, 1941, pp. 461-463. http://dx.doi.org/10.1214/aoms/1177731684 [14] N. V. Smirnov, ”Tables for estimating the goodness of fit of empirical distributions,” Ann. Math. Stat., vol. 19, no. 2, 1948, pp. 279-281. http://dx.doi.org/10.1214/aoms/1177730256 [15] T. W. Anderson and D. A. Darling, ”Asymptotic theory of certain ”goodness-of-fit” criteria based on stochastic processes,” Ann. Math. Stat., vol. 23, no. 2, 1952, pp. 193-212. http://dx.doi.org/10.1214/aoms/1177729437 [16] K. Pearson, ”On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling,” Philos Mag, vol. 50, 1900, pp. 157-175. [17] A. A. Shapiro and M. B. Wilks, ”An analysis of variance test for normality (complete sample),” Biometrika, vol. 52, no. 3/4, 1965, pp. 591-611. http://dx.doi.org/10.2307/2333709 [18] F. Grubbs, ”Procedures for Detecting Outlying Observations in Samples,” Technometrics, vol. 11, no. 1, 1969, pp. 1-21. http://dx.doi.org/10.1080/00401706.1969.10490657 [19] J. Durbin and G. S. Watson, ”Testing for Serial Correlation in Least Squares Regression. I,” Biometrika, vol. 37, no. 3/4, 1950, pp. 409-428. http://dx.doi.org/10.2307/2332391 [20] J. Durbin and G. S. Watson, ”Testing for Serial Correlation in Least Squares Regression. II,” Biometrika, vol. 38, no. 1/2, 1951, pp. 159-177. http://dx.doi.org/10.2307/2332325 [21] T. S. Breusch and A. R. Pagan,. ”Simple test for heteroscedas- ticity and random coefficient variation,” Econometrica, vol. 47, no. 5, 1979, pp. 1287-1294. http://dx.doi.org/10.2307/1911963 [22] M. S. Bartlett,. ”Properties of sufficiency and statistical tests,” Proc. Roy. Stat. Soc. A, vol. 160, 1937, pp. 268-282. http://dx.doi.org/10.1098/rspa.1937.0109 [23] W. G. S. Hines and R. J. O. Hines, ”Increased power with modified forms of the Levene (med) test for heterogeneity of variance,” Biometrics, vol. 56, no. 2, 2000, pp. 451-454. http://dx.doi.org/10.1111/j.0006-341X.2000.00451.x [24] T. E. Philippi, ”Design and Analysis of Ecological Experiments. Multiple regression: Herbivory,” New York: Chapman & Hall, 1993. [25] G. P. Quinn and M. J. Keough, ”Experimental Design and Data Analysis for Biologists, 6. Multiple Regression and Correla- tion,” UK: Cambridge University Press, 2002, pp. 124-174. http://dx.doi.org/10.1017/CBO9780511806384 [26] R. H. Myers, ”Classical and Modern Regression With Applica- tions,” 2nd edition, PWS-Kent, 1990. [27] J. O. Rawlings, S. G. Pantula, and D. A. Dickey, ”Applied Regression Analysis; A Research Tool,”, 2nd edition, New York: Springer-Verlag, 1998. http://dx.doi.org/10.1007/b98890 [28] J. Neter, M. H. Kutner, C. J. Nachtsheim, and W. Wasserman, ”Applied Linear Statistical Models,” 4th edition, Illinois: Irwin, 1996. [29] D. G. Kleinboum, L .L. Kupper, A. Nizam, and K. E. Muller, ”Applied Regression Analysis and Other Multivariate Methods. Chapter 14. Regression Diagnostics,” Forth edition, Canada: Duxbury, 2008, pp. 287-348. [30] A. Tropsha, ”Best practices for QSAR model development, validation, and exploitation,” Mol. Inf., vol. 29, no. 6-7, 2010, pp. 476-488. http://dx.doi.org/10.1002/minf.201000061 [31] C. L. Mallows, ”Some comments on Cp,” Technometrics, vol. 15, no. 4, 1973, pp. 661-675. [32] C. L. Mallows, ”More comments on Cp,” Technometrics, vol. 37, no. 4, 1995, pp. 362-372. [33] C. L. Mallows, ”Cp and prediction with many regressors: comments on Mallows,” Technometrics, vol. 39, no. 1, 1997, pp. 115-116. [34] H. Akaike, ”Fitting Autoregressive Models for Prediction,” Ann. I. Stat. Math., vol. 21, no. 1, 1969, pp. 243-247. http://dx.doi.org/10.1007/BF02532251 [35] A. D. R. McQuarrie and C.-L. Tsai, ”Regression and time series model selection in small samples,” Singapore: World Scientific Pub Co Inc, 1998. [36] G. Schwarz, ”Estimating the dimension of a Model,” Ann. Stat., vol. 6, no. 2, 1978, pp. 461-464. http://dx.doi.org/10.1214/aos/1176344136 [37] T. Amemiya, ”Qualitative response models: A survey,” J. Econ. Lit., vol. 19, no. 4, 1981, pp. 1483-1536. [38] E. J. Hannan and B. G. Quinn, ”The determination of the Order of an Autoregression,” J. R. Stat. Soc. Ser. B Stat. Methodol., vol. 41, no. 2, 1979, pp. 190-195. Biomath 2 (2013), 1309089, http://dx.doi.org/10.11145/j.biomath.2013.09.089 Page 10 of 11 http://dx.doi.org/10.1063/1.1749914 http://qsarworld.com/Temp_Fileupload/Shorthistoryofqsar.pdf http://qsarworld.com/Temp_Fileupload/Shorthistoryofqsar.pdf http://dx.doi.org/10.2174/157340907782799381 http://dx.doi.org/10.1517/17460441.2013.748745 http://dx.doi.org/10.2174/138161212799436359 http://dx.doi.org/10.1016/j.ecoenv.2012.05.013 http://dx.doi.org/10.5740/jaoacint.SGE_Goodarzi http://dx.doi.org/10.1021/ ci0342472 http://dx.doi.org/10.1214/aoms/1177731684 http://dx.doi.org/10.1214/aoms/1177730256 http://dx.doi.org/10.1214/aoms/1177729437 http://dx.doi.org/10.2307/2333709 http://dx.doi.org/10.1080/00401706.1969.10490657 http://dx.doi.org/10.2307/2332391 http://dx.doi.org/10.2307/2332325 http://dx.doi.org/10.2307/1911963 http://dx.doi.org/10.1098/rspa.1937.0109 http://dx.doi.org/10.1111/j.0006-341X.2000.00451.x http://dx.doi.org/10.1017/CBO9780511806384 http://dx.doi.org/10.1007/b98890 http://dx.doi.org/10.1002/minf.201000061 http://dx.doi.org/10.1007/BF02532251 http://dx.doi.org/10.1214/aos/1176344136 http://dx.doi.org/10.11145/j.biomath.2013.09.089 S D Bolboacă, L Jäntschi, Quantitative Structure-Activity Relationships: Linear Regression Modelling and ... [39] S. T. Buckland, K. P. Burnham, and N. H. Augustin, ”Model selection: An integral part of inference,” Biometrics, vol. 53, no. 2, 1997, pp. 603-618. [40] H. Kubinyi, ”Variable Selection in QSAR Studies. II. A Highly Efficient Combination of Systematic Search and Evolution,” QSAR Comb. Sci., vol. 13, no. 4, 1994, pp. 393-401. http://dx.doi.org/10.1002/qsar.19940130403 [41] H. Kubinyi, ”Variable Selection in QSAR Studies. I. An Evo- lutionary Algorithm,” QSAR Comb. Sci, vol. 13, no. 13, 1994, pp. 285-294. http://dx.doi.org/10.1002/qsar.19940130306 [42] D. C. Hoaglin and R. E. Welsch, ”The hat matrix in regression and ANOVA,” Am. Stat., vol. 32, no. 1, 1978, pp. 17-22. http://dx.doi.org/10.1080/00031305.1978.10479237 [43] K. A. Bollen and R. Jackman, ”Regression diagnostics: An ex- pository treatment of outliers and influential cases,” In: Modern Methods of Data Analysis, Fox, J.; Scott, and J. Long (Eds.), Sage: Newbury Park, 1990, pp. 257-291. [44] N. Chirico and P. Gramatica, ”Real external predictivity of QSAR models. Part 2. New intercomparable thresholds for different validation criteria and the need for scatter plot inspec- tion,” J. Chem. Inf. Model., vol. 52, no. 8, 2012, pp. 2044-2058. http://dx.doi.org/10.1021/ci300084j [45] C. L. Mallows, ”Choosing a subset regression,” Unpublished report, Bell Telephone Laboratories. [46] J. W. Gorman and R. J. Toman, ”Selection of variables for fitting equations to data,” Technometrics, vol. 8, no. 1, 1966, pp. 27-51. http://dx.doi.org/10.1080/00401706.1966.10490322 [47] J. W. Tukey, ”Discussion,” J. R. Statisti. Soc., vol. 29, 1967, pp. 47-48. [48] J. S. Armstrong, ”Long-range Forecasting: From Crystal Ball to Computer,” United States of America: John Wiley & Sons, 1978. [49] B. E. Flores, ”A pragmatic view of accuracy measurement in forecasting,” Omega (Oxford), vol. 14, no. 2, 1986, pp. 93-98. http://dx.doi.org/10.1016/0305-0483(86)90013-7 [50] L. M. Shi, H. Fang, W. Tong, J. Wu, R. Perkins, R. M. Blair, W. S. Branham, S. L. Dial, C. L. Moland, and D. M. Sheehan, ”QSAR Models Using a Large Diverse Set of Estrogens,” J. Chem. Inf. Comput. Sci., vol. 41, no. 1, 2001, pp. 186-195. http://dx.doi.org/10.1021/ci000066d [51] A. Golbraikh and A. Tropsha, ”Beware of q2!”, J. Mol. Graph- ics Mod., vol. 20, no. 4, 2002, pp. 269-276. http://dx.doi.org/10.1016/S1093-3263(01)00123-1 [52] G. Schüürmann, R. U. Ebert, J. Chen, B. Wang, and R. Kühne, ”External Validation and Prediction Employing the Predictive Squared Correlation Coefficient Test Set Activity Mean vs Training Set Activity Mean,” J. Chem. Inf. Model., vol. 48, no. 11, 2008, pp. 2140-2145. http://dx.doi.org/10.1021/ci800253u [53] V. Consonni, D. Ballabio, and R. Todeschini, ”Comments on the Definition of the Q2 Parameter for QSAR Validation,” J. Chem. Inf. Model., vol. 49, no. 7, 2009, pp. 1669-1678. http://dx.doi.org/10.1021/ci900115y [54] P. K. Ojha, I. Mitra, R. N. Das, and K. Roy, ”Further exploring r2m metrics for validation of QSPR models,” Chemom. Intell. Lab. Syst., vol. 107, no. 1, 2011, pp. 194-205. http://dx.doi.org/10.1016/j.chemolab.2011.03.011 [55] L. I. Lin, ”A concordance correlation coefficient to evaluate reproducibility,” Biometrics, vol. 45, 1989, pp. 255-268. [56] G. B. McBride, ”A proposal for strength-of-agreement criteria for Lin’s Concordance Correlation Coefficient, ” NIWA Client Report: HAM2005-062, 2005, [online] [accs. March 14, 2013]. http://medcalc.org/download/pdf/McBride2005.pdf [57] R. A. Fisher, ”The goodness of fit of regression formulae, and the distribution of regression coefficients,” J. Royal Statist. Soc., vol. 85, no. 4, 1922, pp. 597-612. [58] S. D. Bolboacă and L. Jäntschi, ”Predictivity Approach for Quantitative Structure-Property Models. Application for Blood- Brain Barrier Permeation of Diverse Drug-Like Compounds,” Int. J. Mol. Sci., vol. 12, no. 7, 2011, pp. 4348-4364. http://dx.doi.org/10.3390/ijms12074348 [59] J. Li and P. Gramatica, ”The importance of molecular structures, endpoints’ values, and predictivity parameters in QSAR re- search: QSAR analysis of a series of estrogen receptor binders,” Mol. Divers., vol. 14, no. 4, 2010, pp. 687-696. http://dx.doi.org/10.1007/s11030-009-9212-2 [60] R. M. Blair, H. Fang, W. S. Branham, B. S. Hass, S. L. Dial, C. L. Moland, W. Tong, L. Shi, R. Perkins, and D. M. Sheehan, ”The Estrogen Receptor Relative Binding Affinities of 188 Natural and Xenochemicals: Structural Diversity of Ligands,” Toxicol Sci., vol. 54, no. 1, 2000, pp. 138-153. http://dx.doi.org/10.1093/toxsci/54.1.138 [61] K. P. Burnham and D. R. Anderson, ”Model selection and mul- timodel inference: A practical information-theoretic approach,” New York: Springer-Verlag, 2002. [62] K. P. Burnham and D. R. Anderson, ”Kullback-Leibler infor- mation as a basis for strong inference in ecological studies,” Wildlife Res., vol. 28, no. 2, 2001, pp. 111-119. http://dx.doi.org/10.1071/WR99107 [63] J. W. Osborne and E. Waters, ”Four Assumptions Of Multiple Regression That Researchers Should Always Test,” Practical Assessment, Research, and Evaluation, vol. 8, 2002, [on- line] [Accessed February 26, 2013]. Available from: http:// PAREonline.net/getvn.asp?v=8&n=2 [64] N. R. Draper and H. Smith, ”Applied Regression Analysis,” (2nd ed.). New York: Wiley, 1981. [65] C.-L. Tsai, Z. Cai, and X. Wu, ”The Examination of Residual Plots,” Stat. Sin., vol. 8, 1998, pp. 445-465. [66] V. Consonni, D. Ballabio, and R. Todeschini, ”Evaluation of model predictive ability by external validation techniques,” J. Chemom., vol. 24, no. 3-4, 2010, pp. 194-201. http://dx.doi.org/10.1002/cem.1290 [67] N. Chirico and P. Gramatica, ”Real external predictivity of QSAR models: How to evaluate it? Comparison of different validation criteria and proposal of using the concordance cor- relation coefficient,” J. Chem. Inf. Model., vol. 51, no. 9, 2011, pp. 2320-2335. http://dx.doi.org/10.1021/ci200211n Biomath 2 (2013), 1309089, http://dx.doi.org/10.11145/j.biomath.2013.09.089 Page 11 of 11 http://dx.doi.org/10.1002/qsar.19940130403 http://dx.doi.org/10.1002/qsar.19940130306 http://dx.doi.org/10.1080/00031305.1978.10479237 http://dx.doi.org/10.1021/ci300084j http://dx.doi.org/10.1080/00401706.1966.10490322 http://dx.doi.org/10.1016/0305-0483(86)90013-7 http://dx.doi.org/10.1021/ci000066d http://dx.doi.org/10.1016/S1093-3263(01)00123-1 http://dx.doi.org/10.1021/ci800253u http://dx.doi.org/10.1021/ci900115y http://dx.doi.org/10.1016/j.chemolab.2011.03.011 http://medcalc.org/download/pdf/McBride2005.pdf http://dx.doi.org/10.3390/ijms12074348 http://dx.doi.org/10.1007/s11030-009-9212-2 http://dx.doi.org/10.1093/toxsci/54.1.138 http://dx.doi.org/10.1071/WR99107 http://PAREonline.net/getvn.asp?v=8&n=2 http://PAREonline.net/getvn.asp?v=8&n=2 http://dx.doi.org/10.1002/cem.1290 http://dx.doi.org/10.1021/ci200211n http://dx.doi.org/10.11145/j.biomath.2013.09.089 Linear Regression on QSAR Analysis Linear Regression Assumptions Model Selection and Diagnostic Model Predictive Power Practical Considerations MLR in Training Sets SUMMARY AND FURHER WORK References