Engineering, Technology & Applied Science Research Vol. 8, No. 4, 2018, 3135-3140 3135 www.etasr.com Ahmad et al.: Statistical Modeling via Bootstrapping and Weighted Techniques Based on Variances Statistical Modeling via Bootstrapping and Weighted Techniques Based on Variances An Oral Health Case Study W. M. A. W. Ahmad School of Dental Sciences Universiti Sains Malaysia Malaysia wmamir@usm.my N. A. Aleng School of Informatics and Applied Mathematics, Universiti Malaysia Terengganu, Malaysia azlida_aleng@umt.edu.my Z. Ali School of Mathematical Sciences Universiti Sains Malaysia Malaysia zalila_ali@usm.my M. S. M. Ibrahim School of Dental Sciences Universiti Sains Malaysia Malaysia shafiqmat786@gmail.com Abstract—Multiple logistic regression is a methodology of handling dependent variables with a binary outcome. This method is becoming increasingly widespread as a statistical technique that represents a discrete probability model. Many studies have focused on the application but less on the methodology building. This study aims to provide an applied method for multiple logistic regression which is called modified Bayesian logistic regression modeling as an alternative technique for logistic regression analysis that focuses on a combination of the bootstrap method using SAS macro and weighted techniques based on variances using SAS algorithm. Data on oral cancer were applied to illustrate a real scenario of oral health data. This data will be applied to the multiple logistic regression algorithm and modified Bayesian logistic regression. Results from both cases are strongly supported by clinical studies. Through the proposed algorithm, the researcher will have an option whether to analyze the data with the usual or an alternative method. Final results indicate that the modified procedure can provide more efficient results especially for the case which involves statistical inferences. Keywords-multiple logistic regression; bootstrap; Bayesian and weighted techniques I. INTRODUCTION The logistic regression, analyzes the relationship between multiple independent variables and categorical dependent variables [1]. The multiple logistic response functions is exp( ) 1 exp( ) XE Y X           β β where X β β X β X 0 1 1 p-1 p-1     β  X β β X β X 0 1 i1 p-1 i, p-1     β  β 111 XXβ i112 X Xβ p 1 i,p 1p 1                                        β X X i  p 1p 1 p 1 The multiple logistic regression models can, therefore, be stated as follows: Yi are independent Bernoulli random variables with expected values E{Yi}=πi where:   exp( ) 1 exp( ) XiE Yi i Xi     β β  . The X observations are considered to be constants. Alternatively, if the X variables are random  iE Y is viewed as a conditional mean, given the values of Xi1+Xi2+…+Xi,p-1. A. Bootstrapping, Weighted Techniques and Bayesian Approach with SAS Authors in [2] introduced the bootstrap method which emphasizes on an empirical density function (EDF). The basic concept of bootstrap is that it is initiated with an original sample which is taken from the studied population. The second step is to copy the original sample a number of times in order to create a pseudo-population. Then, it draws several samples considering random sampling approach thus providing a new comprehensive sample from the original sample. It stores the new set of data from the original dataset and creates a new distribution for further analysis [2, 3]. The Bayesian analysis involves the posterior distribution. In the stage of Bayesian estimation procedures, the posterior distribution will play an important role especially in statistical inferential. While running the analysis the summary statistics for the posterior distribution samples are produced by default. The SAS statements of OUTPOST provide an option that saves the samples in the SAS data set for further processing. PROC GENMOD procedure fits generalized linear models with Bayesian methods (considering Bayesian estimation procedures) with a normal error term [4]. In SAS programming procedure, the SEED option is to maintain reproducibility. By default, the uniform prior is a flat prior with a distribution that reflects ignorance of the location of the parameter. Its placing an equal likelihood for all possible values which regression coefficients can take. PROC GENMOD also produces convergence diagnostics where ODS Graphics is enabled in SAS statements which provides a section of assessing Markov chain convergence diagnostics and their interpretation [5]. Engineering, Technology & Applied Science Research Vol. 8, No. 4, 2018, 3135-3140 3136 www.etasr.com Ahmad et al.: Statistical Modeling via Bootstrapping and Weighted Techniques Based on Variances Weighting is a very important technique which involves adjusting data to reflect dissimilarities in the number of population units that represent each respondent [6]. There are several techniques that can be used as a weighted method like weighted by mean, standard deviation or variances to apply to the model according to the sample population of interest. Moreover, a weighted technique allows assigning different weights to the different cases in data analysis. The aim of the weighted method is to correct the skewness and to make the sample more representative of a true population. II. METHODOLOGY: ALGORITHM BUILDING AND RESULTS We used secondary oral medical data which involved 23 oral cancer patients from Universiti Sains Malaysia (USM). The selected variables are nerve invasion (nerv_inv), gender (gen), betel quid (bet), tumour site (tum_site) and tumour size (tum_size). To explore the underlying association between nerve invasion and the selected explanatory variables, a set of the regression model is fitted in this section. Let us define the following dichotomous variables for the model: Yij=0 has not nerve invasion and Yij=1 has nerve invasion. The proposed model is given in equation form as follows:         ..... β β Gender β Betel Quid 0 1 2 β Tumour site β Tumour size εij43        b X ij (1) Then the logistic regression model for (1) is given as:                   ˆP 1 | exp( ) 1 exp( ) i i 0 1 2 3 4 0 1 2 3 4 Y X β β Gender β Betel Quid β Tumour site β Tumour size β β Gender β Betel Quid β Tumour site β Tumour size                             1 1 exp β β Gender β Betel Quid0 1 2 β Tumour site3 β Tumour size4                                Then we obtain (2) as follows:           1 ˆP 1 | 1 exp β β Gender0 1 β Betel Quid2 Y Xi i β Tumour site3 β Tumour size4                                          (2) The estimated model for our case is given in (2). Before we apply the equation there are two main steps needed to be done which are bootstrapping and weight data. A. Multiple Logistic Regression Modeling on Nerve Invasion  Cancer cell data should be entered as follows in SAS algorithm to calculate the multiple logistic regressions.. Data cancer; input Nerv_inv Gen Bet Tum_site Tum_size ; datalines; 0 0 0 1 1 0 1 0 2 0 0 1 0 2 0 0 1 0 2 0 0 1 0 3 0 0 1 0 2 0 0 1 0 3 0 0 0 0 2 0 1 0 1 1 1 0 0 0 4 0 0 0 1 1 1 1 0 1 3 1 0 0 1 4 0 0 0 1 1 0 0 0 1 1 1 0 0 0 2 0 0 1 0 4 0 0 0 0 2 0 0 1 0 2 0 0 1 0 2 0 1 0 1 2 0 1 1 0 2 0 1 0 0 4 0 ; run; ods rtf file='abc.rtf' style=journal;  Run the analysis using multiple logistic regression. Below is the syntax of multiple logistic regression. /* Run The Logistic Regression Through Proc Logistic*/ ods graphics on; proc logistic descending data=cancer; model Nerv_inv(event='1') = Gen Bet Tum_site Tum_size / rsquare expb lackfit; roc 'Gen Bet Tum_site Tum_size ' Gen Bet Tum_site Tum_size ; run; ods graphics off; ods rtf close; B. Results: Using Multiple Logistic Regression Modeling on Nerve Invasion In Figure 1 and Table I the results of using the above syntax are shown. None of the variables in the list is significant. The area under the ROC curve is 0.70556. The model can accurately discriminate 71.0% of the cases (it significantly discriminates more than half of the cases). C. Modified Multiple Bayesian Logistic Regression on Nerve Invasion  Adding bootstrapping to the calculation. The following syntax calculates the data using a bootstrap method and prints them out: %macro bootstrap (data=_last_, booted=booted, boots=4, seed=1234); data &booted; pickobs = int(ranuni(&seed)*n)+1; /* This procedure is randomly select an integer from 1 to n*/ tel bse loo ob  Engineerin www.etasr set &data po ls SAS to re ervation in & op through the replicate=int /* and saves i+1; if i > n*& b /* This proc servation have run; %mend boot TA Parameter Intercept Gen Bet Tum_site Tum_size Logistic regr Data cancer; input Nerv_i datalines; 0 0 0 0 1 0 0 1 0 0 1 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 1 1 0 1 0 0 1 0 0 1 0 0 1 0 0 0 ng, Technology r.com oint = pickobs ad value pick data, when th e data step fore t(i/n)+1; number of cur boots then stop edure stop wi e been created tstrap; Fig. 1. ABLE I. MA Estimate -3.2222 -0.1454 0.9728 0.5295 1.2608 ression using b nv Gen Bet Tu 0 1 1 0 2 0 0 2 0 0 2 0 0 3 0 0 2 0 1 3 0 0 2 0 1 1 1 0 4 0 1 1 1 1 3 1 1 4 0 1 1 0 1 1 1 0 2 0 y & Applied Sci Ahma nobs = n; /*T kobs, nobs se he point optio ever*/ rrent bootstrap p; ill leave data d*/ ROC Curve AXIMUM LIKEHOOD Standard Er 2.1179 1.5915 1.4332 0.5908 1.6022 Hosmer and L bootstrap and w um_site Tum_ ience Research ad et al.: Statist This procedure ets n to numb on is used SA p*/ set when n*& D ESTIMATES rror Pr>Ch 0.128 0.927 0.497 0.370 0.431 Lemeshow Goodness-o Chi-Square Pr>ChiSq weighted samp _size ; V tical Modeling v e point ber of S will &boots hiSq 82 72 73 01 14 of-Fit Test 4.6580 q 0.4590 ple.    dist      Vol. 8, No. 4, 20 via Bootstrappi 0 1 0 0 0 0 0 1 0 0 1 0 1 0 1 1 1 0 1 0 0 ; run; ods rtf file='ab The syntax fo % bootstrap(d run; The syntax fo proc print dat run; The syntax fo residuals proc genmod class Nerv_in model Nerv t=binomial lin output out=ps run; Compute the data data_ner set pseudo_da absresid=abs( sqresid=residu Run a regress variables to g proc genmod model absresi output out=da run; Compute the data data_ner set data_nerve varians_weigh label varians_ Do the weigh estimated var proc genmod weight varian model Nerv_i dist=binomial link=logit; run; ods graphics o Run the logis logistic. proc logistic d plots= effect p weight varian model Nerv_i 018, 3135-3140 ing and Weight 0 4 0 0 2 0 0 2 0 0 2 0 2 0 0 2 0 0 4 0 bc.rtf' style=jo or generating a data= cancer, b or bootstrappin ta=booted; or logistic reg data= booted nv; v_inv = Gen nk=logit; seudo_data re absolute and s rve_invasion1; ata; (residual); ual**2; sion with the ab et the estimate data=data_ner id = Gen Bet T ata_nerve_inva weights using rve_invasion; e_invasion; ht=1/varians_h _weight = "we hted least squ iances data=data_ne ns_weight; inv = Gen Bet l on; tic regression descending da plots= roc(id= ns_weight; inv(event='1') 0 ted Techniques ournal; a bootstrap sam boots=4); ng data printing gression and b descending; n Bet Tum_ eschi=residual; squared residu ; absolute residu ed standard de rve_invasion1 Tum_site Tum asion p=varian g the estimated hat; eights using sq ares using the erve_invasion; t Tum_site Tum with weighted ata=data_nerve =prob); = Gen Bet Tu 3137 Based on Vari mple g bootstrap to ge _site Tum_si ; uals uals vs. indepen eviation ; m_size; ns_hat; d variances quared residual e weights from ; m_size / d data through e_invasion um_site iances et the ize / ndent ls"; m the h proc  D. Th sig suf con tra exa bas Ma exa hig Engineerin www.etasr Tum_size / r roc 'Gen Bet Tum_size ; run; Run the logis genmod. proc genmod weight varian model Nerv_ dist=binomia link=logit; bayes nbi=10 out=posterior run; ods graphics ods rtf close; Results: Mod on Nerve Inv The area und he model can gnificantly disc TABLE I Parameter Intercept Gen Bet Tum_site Tum_size A converged fficiently acc nvergence by ace plot for e amination), no sed on conver arkov chain h amination of gh-density reg ng, Technology r.com square expb la Tum_site Tum stic regression d data=data_ne ns_weight; _inv = Gen Be al 000 nmc=1000 r; off; ; dified Multiple vasion der the curve accurately dis criminates mo II. MAXIMUM Estimate 1.9942 1.3686 -1.6603 -0.1978 -0.3001 Fig. 2. d Markov cha cording to t visualization c ach variable ot just the one rgence Markov has stabilized the trace plot gion of the y & Applied Sci Ahma ackfit; m_size ' Gen B n with weighte erve_invasion et Tum_site Tu 00 thin=2 seed e Bayesian Log of ROC (Fi scriminate 78. re than half of M LIKEHOOD PARA Standard Error 1.5181 1.1029 0.9156 0.4045 1.0968 Hosmer and L ROC curve ain explored th the distributi can also be ex (assess conve es of interest. I v chains. Figur with a good t. The sample target distribu ience Research ad et al.: Statist Bet Tum_site ed data through descending; um_size/ d=1 gistic Regress gure 2) is 0.7 5% of the cas f the cases). AMETER ESTIMATE r Pr > ChiSq 0.1890 0.2146 0.0698 0.6248 0.7844 Lemeshow Goodness-o Chi-Square Pr>ChiSq he parameter ion. Besides xamined throug ergence with Inference shou res 3 to 6 show d mixing by es stay close ution (conver V tical Modeling v h proc sion 78556. ses (it ES of-Fit Test 4.9294 q 0.2946 space that, gh the visual uld be w that visual to the rgence look stab E. met Bay met app sign sup inv mas bete was not sign valu resu Vol. 8, No. 4, 20 via Bootstrappi ks good and bilized). Comparison o Modified Mul Table III sh thods, multipl yesian logisti thod’s result pear to be mo nificant. This pported in [7] asion were fou sk zone of the el quid indepe s the same wit related to the nificant associ ue. This show ults by fixing 018, 3135-3140 ing and Weight d consistent Fig. 3. Trac Fig. 4. Trace Fig. 5. Trac Fig. 6. Trac of the Multiple ltiple Bayesian hows the sum le logistic reg ic regression is improved. ore significant study signifi ] in which th und in male se e face. Autho endently contri th our finding e dependent va iation with ner ws that the alte g the p-value 0 ted Techniques in paramete ce plot for gender e plot for betel qu ce plot tumour site ce plot tumour siz e Logistic Reg n Logistic Reg mmary of the gression and . As expect Gender and t at p<0.25, w icance factor he skin tumo exual orientati ors in [8-10] f ibutes to the ri gs. Moreover, ariable also ap rve invasion w ernative metho and the ROC 3138 Based on Vari ers and they r uid e ze gression with gression Mode e results for modified mu ted, the mod betel quid fa which is clini on gender is rs with perin ion, in the high found that che isk of cancer w the factor whi ppears to have with non-signif od can improv C curve. Figu iances y are el both ultiple dified actors ically s also neural h-risk ewing which ich is a not ficant ve the ure 7 sho (ob (ob the bu Se res inf pat on tak dem suc reg reg P     ord sam the me mo pro ner eff me bet we Engineerin www.etasr ows that the btained from btained from m P B T T This paper e e first of whi ilding with b cond is the ou search. In o fluencing the tients. Accord factors that a king betel monstrates tw ch relationship gression we ob  P 1 | 1 exp Y Xi i β              And second gression:  ˆ1 | 1 exp 0.1 Y Xi i β0              This propose dinary multipl mple size. Thi e previous one IV. D The objectiv ethod for m odification on ocedure, we c rve invasion (f fectiveness of ethods is the tter results. T eighted techn ng, Technology r.com increased are multiple log modified multi Parameter E Intercept Gender Betel Quid Tumour site Tumour size III. CO examined and ch is the met ootstrap weig utput gained a our study, th nerve invasio ding to Table are related to g quid β2=-1. wo different te ps. At first, b btained the res   ˆ 0.1454 0.5295 β Gen0 Tumour     d, the modifie   1.3686 1978 Gender Tumour site  ed method can le logistic reg s method prod e. DISCUSSION AN e of this resea multiple logis the programm can determine for the case of f calculation b primary focus The combined niques and y & Applied Sci Ahma ea under curv gistic regressio iple Bayesian T Multiple Logisti Estimate St -3.2222 -0.1454 0.9728 0.5295 1.2608 ONCLUSION focused on tw thodology bas ghted data by s a new findin he potential on among ora II, nerve inva gender β1=1.13 6603, p<0.2 echniques that by using the sult as follows:     0.9728 1.2608 nder B r site T   ed Bayesian     * 1.6603 0.3001 r Bet e Tum   n be alternativ gression in th duces better re ND RECOMMEN arch is to deve stics regressi ming calculati the associate f oral study). O y combining s, which lead d calculation bootstrap ap ience Research ad et al.: Statist e equals to 0 on) and to 0 logistic regres TABLE III. R ic Regression tandard Error 2.1179 1.5915 1.4332 0.5908 1.6022 wo main sectio sed on algorit using varian ng for the app variables w al health prob asion may dep 3686, p<0.25 25. This pa t can be used multiple logi :   Betel Quid Tumour size    multiple logi   1 *tel Quid mour size    vely applied to he case of sm sults compare NDATIONS elop an alterna ion with so ion. Through ed factors for On top of that, several statist ds to significan method betw pproach show V tical Modeling v 0.7056 0.7855 ssion). The mea RESULTS COMPARI Sig. Value 0.1282 0.9272 0.4973 0.3701 0.4314 Chi-Square 4.6580 Pr > ChiSq 0.4590 ons, thm nces. plied were lem pend and aper d to istic 1 istic 1 o an mall d to ative ome this the the tical ntly ween wed im II th im th a s a th Vol. 8, No. 4, 20 via Bootstrappi e position of asuring of the ISON OF THE TWO Modified Mu Estimate 1.9942 1.3686 -1.6603 -0.1978 -0.3001 mprovement o I. Table I show he modified mprovement in Fig. 7. In Table II he dependent also estimate t quare. This pr and gave better he decision 018, 3135-3140 ing and Weight the ROC on model and ov METHODS ultiple Bayesian L Standard 1.518 1.102 0.915 0.404 1.096 Hosmer and Le of the results w ws the standar calculation. n the results. Comparison of I, factors whic variable show the precise od romising meth r results for th maker. Ther 0 ted Techniques the graph ref verall test perfo Logistic Regress Error Sig. V 1 0.18 9 0.21 6 0.06 5 0.62 8 0.78 Chi-Square Pr > ChiSq emeshow Goodness-of- *Significant at p which can be s rd calculation Both Tabl f ROC curve of th ch are clinical w their true as dd ratio based hod lead to a s he decision ma refore, in or 3139 Based on Vari flects the accu ormance. sion Value 890 146* 698* 248 844 e 4.9294 q 0.2946 Fit Test p < 0.25 seen in Tables and Table II s les shown he two methods lly associated ssociation. We on the Wald successful res aking especiall rder to keep iances uracy I and shows some with e can d Chi- earch ly for p the Engineering, Technology & Applied Science Research Vol. 8, No. 4, 2018, 3135-3140 3140 www.etasr.com Ahmad et al.: Statistical Modeling via Bootstrapping and Weighted Techniques Based on Variances effectiveness of the results, it is necessary to have a good way of calculation with some improvement of the proposed strategy. The approached method can have a better predicting result in future for the decision making. In this paper, the algorithms have a good potential to determine the potential factors that lead to oral cancer. Some recommendations are raised in the following study findings:  There is a need to explore more on the methodology improvement in order to optimize the gained output. This could include a higher level of a combination of theoretical, methodology building and computation which may lead to higher precision and accuracy of the results.  Performance measurements can be taken into consideration when measuring quality of the recommended algorithms. This knowledge will empower researchers and serve as a roadmap to improve future studies. ACKNOWLEDGMENT Authors would like to express their gratitude to Universiti Sains Malaysia for providing the research funding (Grant no.1001/PPSG/8012278, School of Dental Sciences, Universiti Sains Malaysia). REFERENCES [1] D. W. Hosmer, S. Lemeshow, R. X. Sturdivant, Applied Logistic Regression, 3rd ed, John Wiley & Sons, 2013 [2] B. Efron, R. J. Tibshirani, An Introduction to the Bootstrap, Chapman and Hall/CRC, 1993 [3] G. E. Higgins, “Statistical Significance Testing: The Bootstrapping Method and an Application to Self-Control Theory”, The Southwest Journal of Criminal Justice, Vol. 2, No. 1, pp. 54-76, 2005 [4] A. Gelman, J. B Carlin, H. S. Stern, D. B. Rubin, Bayesian Data Analysis, Chapman and Hall/CRC, 2004 [5] M. Stokes, F. Chen, F. Gunes, “An Introduction to Bayesian Analysis with SAS/STAT Software”, SAS Global Forum 2014 Conference, Washington DC, USA, Paper SAS400-2014, March 23-26, 2014 [6] J. Mickey, S. Greenland, “The Impact of Confounder-Selection Criteria on Effect Estimation”, American Journal of Epidemiology, Vol. 129, No. 1, pp 125-137, 1989 [7] I. Leibovitch, S. C. Huilgol, D. Selva, S. Richards, R. Paver, “Basal cell carcinoma treated with Mohs surgery in Australia III. Perineural invasion”, Journal of the American Academy of Dermatology, Vol. 53, No. 3, pp. 458-463, 2005 [8] C. T. Liao, C. J. Tung-Chieh, H. M. Wang, I. H. Chen, C. Y. Lin, T. M. Chen, L. L. Hsieh, A. J. Cheng, “Telomerase as an independent prognostic factor in head and neck squamous cell carcinoma”, Head & Neck, Vol. 26, No. 6, pp. 504–512, 2004 [9] C. T. Liao, J. T. Chang, H. M. Wang, S. H. Ng, C. Hsueh, L. Y. Lee, C. H. Lin, I. H. Chen, S. F. Huang, A. J. Cheng, T. C. Yen, “Analysis of risk factors predictive of local tumor control in oral cavity cancer”, Annals of Surgical Oncology, Vol. 15, No. 3, pp. 915–922, 2008 [10] C. T. Liao, C. J. Kang, J. T. Chang, H. M. Wang, S. H. Ng, C. Hsueh, L. Y. Lee, C. H. Lin, A. J. Cheng, I. H. Chen, S. F. Huang, T.C. Yen, “Survival of second and multiple primary tumors in patients with oral cavity squamous cell carcinoma in the betel quid chewing area”, Oral Oncology, Vol. 43, No. 8, pp. 811–819, 2007