Engineering, Technology & Applied Science Research Vol. 8, No. 2, 2018, 2853-2858 2853 www.etasr.com Jain et al: An Enhanced Binary Classifier Incorporating Weighted Scores An Enhanced Binary Classifier Incorporating Weighted Scores Deepali Virmani Computer Science Engineering Department Bhagwan Parshuram Institute of Technology Delhi, India deepalivirmani@gmail.com Nikita Jain Computer Science Engineering Department Bhagwan Parshuram Institute of Technology Delhi, India nikitajain1210@gmail.com Abhishek Srivastav Computer Science Engineering Department Bhagwan Parshuram Institute of Technology Delhi, India srisaiabhi2196@gmail.com Mahima Mittal Computer Science Engineering Department Bhagwan Parshuram Institute of Technology Delhi, India mahimamittal01@gmail.com Surbhi Mittal Computer Science Engineering Department Bhagwan Parshuram Institute of Technology Delhi, India surbhimittal19970@gmail.com Abstract—In this study, an approach is being proposed which will predict the output of an observation based on several parameters which employ the weighted score classification method. We will use the weighted scores concept for classification by representing data points on graph with respect to a threshold value found through the proposed algorithm. Secondly, cluster analysis method is employed to group the observational parameters to verify our approach. The algorithm is simple in terms of calculations required to arrive at a conclusion and provides greater accuracy for large datasets. The use of the weighted score method along with the curve fitting and cluster analysis will improve its performance. The algorithm is made in such a way that the intermediate values can be processed for clustering at the same time. The proposed algorithm excels due to its simplistic approach and provides an accuracy of 97.72%. Keywords-weighted score; classification; clustering; deviation; threshold; SVM; decision tree I. INTRODUCTION Classification is a data mining technique in which a collection of data is categorized into classes in which the training dataset may or may not have class labels. A dataset may have two or more class labels. In this work we are focusing on binary classification using clustering technique based on curve analysis and weighted score method followed by verification. To illustrate with an example, let us suppose that we have a dataset containing data about spam from a repository. We want to identify the data points above and below the threshold level which are classified as spam and not spam respectively. The threshold level can be obtained through sorting and processing of the dataset. The proposed algorithm preprocesses the dataset to find the weighted score as well as to predict the threshold value, which is then represented graphically. After this step, verification is done using cluster analysis. Observations on various datasets were found to be accurate to a high degree. Deviations of various data points from the threshold values were obtained and various inferences were found. We have also calculated the individual effect of an attribute with respect to its effect on classification. Data provided by datasets contain hidden information that might not be known to the user. An effort has been made to develop a new algorithm to facilitate mining techniques. The simplistic approach of the algorithm is easy to understand and implement. The proposed algorithm is based on clustering which acts as a stable preprocessing method for binary classification. Weighted score method assigns different importance degrees to the instances of a dataset. The proposed classifier calculates the mean of each sample which is multiplied with each attribute’s value summed up to assign a weight to that sample. A threshold value is taken and plotted data points fall below or above it. The minimum and maximum values among the weighted sample sums are subtracted from the threshold value which is halved to obtain the centers of two clusters. Clustering is performed using these centers and by taking maximum distance into consideration which will be the same with the distance between a center and the threshold value. The clusters obtained correspond to the binary class labels which classify the dataset. Observations are cross verified using the clustering method. Weighted score is a simple technique and also incorporates the individual contribution of an attribute consisting of its weighted score in its contribution to the deviation of the data point from the threshold. Engineering, Technology & Applied Science Research Vol. 8, No. 2, 2018, 2853-2858 2854 www.etasr.com Jain et al: An Enhanced Binary Classifier Incorporating Weighted Scores II. LITERATURE REVIEW Various studies have been proposed for classifying datasets into two categories. Previous researchers utilized different classification approaches. SVM (support vector machines), is basically used for classification and regression analysis and employs supervised learning techniques. In SVM algorithm, new examples are assigned or classified into categories and therefore it is regarded as a non-probabilistic classifier. SVM can be thought of as a clustering algorithm in space in which points belonging to a cluster are distant from points of other clusters. In that space a hyper-plane divides the points in groups. A particular hyper-plane with the characterization of minimizing the total distance of the data points on either of its sides is selected. This is also called a linear classifier. There are various variations to the basic approach of the SVM namely linear kernel SVM, polynomial kernel SVM and radial kernel SVM. The most efficient method for fitting SVM is the sequential minimal optimization (SMO) method. It breaks the problem down into sub-problems that can be solved analytically rather than numerically. There are various SVM applications like the recognition of standing people in a picture. Authors in [1] used SVM along with k nearest neighbor (KNN) for visual category recognition. Authors in [2] used variations of SVM to predict the future popularity of social media messages. The disadvantages of SVM are that the theory only covers the determination of the parameters for a given value of the regularization and kernel parameters and is dependent on the kernel choice. As a result, SVM comes up with the problem of overfitting from optimizing the parameters to model selection. Kernel models can be quite sensitive to overfitting the model selection criterion [3]. In [4], local space time features were used for recognizing complex motion patterns using SVM. A decision tree is a predictive model to go from observations and related choices about an item to possible outcomes about the item's target value. It has various applications in statistics, data mining and machine learning. In this structure, each node denotes a test on an attribute, leaves represent class labels and branches represent conjunctions of features that denote the test outcome. Besides being simple to interpret and understand, decision trees are able to handle both categorical and numerical data [5]. To solve the problem of fragmentation and replication, a notion of decision graphs has been introduced which allows disjunctions or joins. There are assumptions taken into consideration regarding decision tree algorithm. At the beginning, the whole training set is considered as the root. Feature values are preferred to be categorical. If the values are continuous then they are discretized prior to building the model. Records are distributed recursively on the basis of attribute values. The decision tree algorithm is sensitive to root selection. If the dataset consists of n attributes then the decision of which attribute to place at the root or at different tree levels as internal nodes is a complicated step. Any random node cannot be placed at the root. If the random approach is followed, it may give bad results with low accuracy. Placing attributes is done by statistical approach. A variation of this weighted class based decision tree [6] has been proposed in which weights are easy assigned according to the importance of class labels which are further classified using a decision tree. The dataset is split in this approach which might potentially introduce bias where small changes in the dataset can introduce big impact. Decision-tree can lead to over- complex trees that do not generalize well the training data. Studies have shown that classification issues are often more precise when using a combination of classifiers which outperform a highly specific classifier [7]. Using a combination of classifiers noisy data can be handled in a better way with augmented accuracy and speed even though complexity issues may emerge [8]. Weighted score method assigns different importance degrees to instances of a dataset and is often used as a pre-processing method. Automated weighted sum (AWSum) uses a weighted sum approach where feature values are assigned weights that are summed and compared to a threshold in order to classify an example. It provides insight into the data [9]. Authors in [10] dealt with the weighted score fusion method which involves the classification of a fruit based on the diverse and complementary features that can be used to describe it. The algorithm has various steps which involve preprocessing, multiple feature selection, optimal feature selection and SVM. However, the approach requires improvements in the real world environment. A quadratic classifier is used in statistical classification to separate measurements of two or more classes of objects or events using a quadric surface. It is a more general version of the linear classifier. Statistical classification considers a set of vectors of observations x of an object or event, each of which has a known type y referred to as the training set. The problem is then to determine the class of a new observation vector. The correct solution is quadratic in nature. In the special case where each observation consists of two measurements, this means that the surfaces separating the classes will be conic sections, thus the quadratic model is the generalization of the linear approach developed to incorporate the conic separating surfaces for classification. Quadratic discriminant analysis (QDA) is closely related to linear discriminant analysis (LDA), where it is assumed that the measurements from each class are normally distributed. Unlike LDA however, in QDA there is no assumption that the covariance of each of the classes is identical. Classification error rate ranges around 20%-30%. An artificial neural network consists of units (neurons), arranged in layers, which convert an input vector into some output. Each unit takes an input, applies a (most probably a nonlinear) function to it and then passes the output on to the next layer. The networks are defined to be feed-forward, which means that a unit feeds its output to all the units on the next layer, but there is no feedback to the previous layer. Weightings are applied to the signals passing from one unit to another. These weightings are tuned in the training phase (learning phase) to adopt the neural network to the particular problem. The network processes records one at a time, and learns by comparing their classification with the known actual classification. The errors from the initial classification are fed back into the network and used to modify the network's algorithm for further iterations. Neurons are organized into layers: input, hidden and output. The input layer is composed not of full neurons but rather consists simply of the record's values that are inputted to the next layer of neurons. Next there are one or more hidden layers. The final layer is the output lay for val cla net (ro we Au and mo of tec Th req lar pre alg can pro clu 1. 2. 3. //T //T Engineerin www.etasr yer, where the rward through lue to each o ass node with tworks is the ows) are prese eights associat uthors in [12] u d classificatio odified the out Bayes classifi The search f chniques has b he proposed quired calculat rge datasets. S e-processing gorithm is ma n be processe ocessing of t ustering and cl TABLE I. Weight Binary Classif N WEIGHTED of dataset, columns 1.1. LOOP 1.1.1. 1.2. LOOP2 1.2.1. 1.3. return FIND_THR threshold va 2.1. LOOP cluster 2.1.1. L p 2.2. ∑ c1∪c2∪ 2.3. END L THRESHOL threshold va T(−X)=max(t s T(X)={t∈T | t≥ 3.1. LOOP ng, Technology r.com ere is one nod h the network utput node, a h the highest e iterative lea ented to the ted with the inp used the neura on of medic tput of the neu ier values. for a confirme been a contin algorithm is tions to arrive Secondly, the improves the ade in such a ed for cluster the data and luster producti COMPARISON O Name SVM-KNN ted Classification fication / Quantum AWSum Quadratic Clas Neural Network C III. ME D_SCORE(x,n x is the colu P1: for i in 1 to∑ 2: for i in 1 to 1 n i ii r w    ri. RESHOLD(ri,n alue by cluster P3: for i in 1 rs LOOP4: for j i points ∑ ∑∪c3 ..∪.cn= da LOOP3,LOOP LD_VERIFY( alue such that #{s∈ ≥T(−X)} wher P5: for i in 1 to y & Applied Sci de for each cla k results in th and the record value. A key arning process network one put values are al network cla cal problems. ural network c ed improveme nuing topic in advantageous e to a conclusi use of weigh e clustering way that the ring at the sa its represen ion at the same OF BINARY CLASS N n / Decision tree m Adiabatic Algo m ssifier Classifier ETHODOLOGY n) //function t umn values, n o n n * ix // ri is the n,k) //function ring (k means) to k //where in 1 to n // n is | | ∑ atapoint to near P4 (ri,n,K,Ki) ∈T | s≥t}=X) th re X=n(no of d o n ience Research ass. A single s he assignment d is assigned t y feature of n s in which re at a time, an adjusted each assifier for diag Authors in lassifier in the ent of classifi n data mining s in the term ion and accura hted scores fo results [11]. intermediate v ame time. The ntation allowe e time. SIFICATION TECHN Comp High Med orithm Med Low Lo Mode to find row av n is the numb e weighted sco n that finds k is the numb s the number o //assign rest center //verifying hen dataset entries) h V Jain et al: sweep t of a to the neural ecords nd the h time. gnosis n [13] e form cation field. ms of acy on r data . The values e pre- ed the NIQUES plexity hest dium dium west ow erate verage ber of ore. s the ber of of data each the ). //th sca //Ti of t 4. sco per var sets exe the row Vol. 8, No. 2, 20 : An Enhanced 3.1.1. if 3.1.2. se 3.2. LOOP6 3.2.1. if 3.2.2. se 3.3. thresho his TH coincide aling is there. 3.4. O1 = (T 3.5. O2= (T 3.6. LOOP7 clusters 3.6.1. L ob 3.6. ij is the j -th o the i-th cluster 3.7. The da the thre MEASURE_ of weighted 4.1. dev=TH 4.2. LOOP9 4.2.1. do va 4.3. End LO A flow cha Fig. In this paper ore and proce rform clusterin rious datasets s were of dif ecution time o results (Table w entry or data 018, 2853-2858 d Binary Classif f TMAXri et TMIN=ri old _value TH = es with k me TMAX -TH )/2 // TH-TMIN)/2 7: for i in 1 to s OOP8: for j bjects of the cl .1.1. F{C1,C2, object of the i r which is defi ataset is divide eshold other be _DEVIATION score from th H-ri 9: for i in 1 to om_factor[i]= alue on weight OOP9 art of the algor . 1. Flow char IV. we focused o essing of weig ng. The prop and the obser fferent types of the algorithm e II). There w a point of the 8 fier Incorporat n =(TMAX- TMIN) ans center pro initial centroid o K //where k in 1 to Ki // luster i ...Ck}=i=1K j= -th cluster and ined. ed into two pa elow it. N(xi,,TH,ri,n) //m hreshold value n xi/ri //influenc ted score rithm is shown rt of the algorithm RESULTS on the modific ghted scores posed algorith rvations were and varied in m on the data were 7 datasets dataset was c 2855 ting Weighted S )/2 . ovided approp ds k is the numbe Ki the numb =1KiTij-Oi d Oi is the cen arts one part a measures devi ce of each attr n in Figure 1. m used cation of weig has been don hm was applie recorded. The n complexity. asets is provid s selected and classified to o Scores priate er of ber of ntroid above iation ribute ghted ne to ed to e data . The ded in each ne of the alg as alg app (ea it w bot thr wh dep are wi bec and alg slig les acc dev Engineerin www.etasr e two levels e gorithm can cl soon as it is gorithm applic plication of t ach lower imag was found tha th cases pro reshold value i hich divides th picted in red a e black and the de range of cause of its na d [17] contai gorithm was h ghtly larger n ss. When the d curacy was sli viation from a Fig. 2. U Dow Fig. 3. U Dow ng, Technology r.com either below o assify the data provided with cation on the d the clustering ge). Results of at the threshold ovided approp is represented he dataset int and black dots e points below variations. Ev ature and entrie in a smaller highly accurat number of data data points we ightly less du an ideal behavi Up: Algorithm app wn: Clustering on Up: Algorithm app wn: Clustering on y & Applied Sci or above the th a points of the h an input. Fi dataset (each up g algorithm o f both algorith d value or the priate scaling in the Figures to two parts: s. The points a w are red. Cho very dataset es. Some of th number of d te in that cas a points the ac ere scattered a e to variations ior. plication on sorte n unsorted dataset plication on sorte n unsorted dataset ience Research hreshold level dataset in two gures 2-8 sho pper image) an on weighted s hms are verifie center coincid g is present. s with an orang the data poin above the thre osen datasets sh is unique in he datasets, lik data points an e. In the case ccuracy was a along the grap s in the datase d dataset_1 [14]. t_1 [14] ed dataset_2 [15] t_2 [15]. h V Jain et al: l. The o parts ow the nd the scores ed and ded in . The ge dot nts are eshold how a itself ke [12] nd the e of a a little ph, the et and Vol. 8, No. 2, 20 : An Enhanced Fig. 4. U Dow Fig. 5. U Dow Fig. 6. U Dow 018, 2853-2858 d Binary Classif Up: Algorithm app wn: Clustering on Up: Algorithm app wn: Clustering on Up: Algorithm app wn: Clustering on 8 fier Incorporat plication on sorted unsorted dataset_ plication on sorted unsorted dataset_ plication on sorted unsorted dataset_ 2856 ting Weighted S d dataset_3 [16] _3 [16] d dataset_4 [17] _4 [17] d dataset_5 [18] _5 [18] Scores po exp inc po the tha bet rel lar lyi bel Engineerin www.etasr Fig. 7. U Dow Fig. 8. U Dow Dataset Dataset_1 [14 Dataset_2 [15 Dataset_3 [16 Dataset_4 [17 Dataset_5 [18 Dataset_6 [19 Dataset_7 [20 V. ERR The error per ints that were pected behav correctly clust ints to obtain e processing o at the datasets tween individ latively larger rger distances ing on the th longs to either ng, Technology r.com Up: Algorithm app wn: Clustering on Up: Algorithm app wn: Clustering on TABLE Error (%) 4] 0.227% 5] 2% 6] 2.197% 7] 2.3% 8] 5.2% 9] 4.329% 0] 3.125% ROR ANALYSIS rcentage was c e incorrectly c vior. The num ered was divid the error perc of the dataset b s in which th dual data po r than the dat between them hreshold value r below or abo y & Applied Sci plication on sorte n unsorted dataset plication on sorte n unsorted dataset E II. RESULTS Accuracy 99.72% 98% 97.802% 97.7% 94.8% 95.67% 96.875% S AND DOMINA calculated on t clustered as a mber of data ded with the t centage. The e by the algorith ere were ther oints the erro tasets in whic m. No data poi e and therefo ove the thresho ience Research ed dataset_6 [19] t_6 [19] ed dataset_7 [20] t_7 [20] S Execution Ti 2.143s 1.022s 1.561s 4.820s 2.69s 3.01s 0.992s ATING FACTOR the basis of th deviation fro a points that otal number o xecution time hm. It was obs re smaller dist or percentage ch data point int was found re each data old level. h V Jain et al: ime s s s s s R he data m the were of data gives served tances e was ts had d to be point algo ave data clus data that oth fou lead clus ver k m clus sca sign poi rep and the the bot The util clas Som wei get effi use the of sup nec [1] [2] [3] [4] [5] Vol. 8, No. 2, 20 : An Enhanced On the resu orithm, it was erage accuracy a and its weig stering and c aset can be s t are above or her cluster divi und that fixing d to effectiv stering we can rification to ac means. It is im ster centers sh ale to obtain u nificance, it g nts of the resentations: o d the other pe accuracy of th threshold valu th cases theref e algorithm is lizes the weigh ssification at th The proposed me other effic ighted score su a higher leve icient in handl ed to function threshold leve The authors w the Bhagwan pport and enco cessary facilitie H. Zhang, A. C nearest neighbo IEEE Compute Recognition, Ne B. Yu, C. Mia marketing me Modeling and Science, Vol. 6 C. J. C. Burge Recognition”, D C. Schuldt, I. L SVM approach (ICPR ) 2004, C P. J. Tan, D. L way joins”, in: Notes in Comp Heidelberg, 200 018, 2853-2858 d Binary Classif VI. CO ults obtained s found that t y. The algorith ghted represen cluster produc imply calcula r below the th ided by the tot g of initial cen ve clustering. n also involve hieve higher d mportant that th hould be done useful results. gives a center dataset. T one was with r ertained to the he classificatio ue. It was obse fore the obtain different from hted score of he same time. VII. FU d classifier ca ient clustering uch as fuzzy k el of accuracy ling initial cen as a neural cl el and perform ACKNOW would like to t n Parshuram I ouragement du es at institute. REFER C. Berg, M. Maire or classification er Society Confer ew York, USA, p ao, L. Kwok, “T ssages”, in: So Prediction. SBP 589, pp. 317-324 s, “A Tutorial on Data Min. Knowl. Laptev, B. Caput h”, 17th Internatio Cambridge, UK, V . Dowe, “MML i : Advances in A puter Science, Vo 03 8 fier Incorporat ONCLUSION by the imple the algorithm hm allowed pr ntation allowe ction. The pre ated using the hreshold value tal number of ntroids and ma Apart from e other cluster degree of prec he plotting of t e on the same The threshold r point for cl There were respect to the p e clustering al on depending erved that the ned threshold v m traditional w data point for UTURE WORK an be enhance g algorithm ca k means, proba y in clustering ntroids. The alg lassifier to aut m the classifica WLEDGMENT thank Dr. Paya Institute of T uring the resea RENCES e, J. Malik, “SVM for visual catego rence on Compu pp. 2126-2136, Jun Toward predictin ocial Computing P 2011. Lecture , Springer, Berlin n Support Vecto . Disc., Vol. 2, No to, “Recognizing onal Conference o Vol. 3, pp. 32-36, inference of decis Artificial Intellige ol. 2903, pp. 269 2857 ting Weighted S ementation o is having 97 e-processing o d the simultan ecision rates e number of p e and belong t the clusters. I aximum value m using k m ring algorithm ision such as f the data point graph with p d point is of lassifying the two data proposed algor lgorithm empl on the selecti center coincid value was accu weighted score r the clustering d in various w an be applied t abilistic k mea g as they are gorithm can al tomatically ide ation. al Pahwa, prin Technology fo arch and prov M-KNN: Discrimi ory recognition”, uter Vision and P ne 17-22, 2006 ng popularity of g, Behavioral-C e Notes in Com n, Heidelberg, 201 r Machines for P o. 2, pp. 121–167 human actions: a on Pattern Recog , 2004 sion graphs with ence. AI 2003. L 9-281, Springer, B Scores f the 7.72% of the neous for a points to the It was es can means ms for fuzzy ts and proper great data point rithm loyed ion of ded in urate. e as it g and ways. to the ans to more lso be entify ncipal or her viding inative , 2006 Pattern social Cultural mputer 11 Pattern 7, 1998 a local gnition, multi- Lecture Berlin, Engineering, Technology & Applied Science Research Vol. 8, No. 2, 2018, 2853-2858 2858 www.etasr.com Jain et al: An Enhanced Binary Classifier Incorporating Weighted Scores [6] J. L. Polo, F. Berzal, J. C. Cubero, “Weighted Classification Using Decision Trees for Binary Classification Problems”, II Congreso Español de Informática, pp. 333-341, Zaragoza, Spain, September 11-14, 2007 [7] C. Chiu, Y. Ku, T. Lie, Y. Chen, “Internet auction fraud detection using social network analysis and classification tree approaches”, International Journal of Electronic Commerce, Vol. 15, No. 3, pp. 123-147, 2011 [8] H. Neven, V. S. Denchev, G. Rose, W. G. Macready, “Training a binary classifier with the quantum adiabatic algorithm”, arXiv preprint arXiv:0811.0416, 2008 [9] A. Quinn, A. Stranieri, J. Yearwood, “Classification for accuracy and insight: A weighted sum approach”, Proceedings of the sixth Australasian conference on Data mining and analytics, Vol. 70, pp. 203- 208, Australian Computer Society Inc., 2007 [10] L. Kuncheva, J. Bezdek, R. Duin. “Decision templates for multiple classifier fusion: an experimental comparison”, Pattern Recognition, Vol. 24, No. 2, pp. 299–314, 2001 [11] D. Virmani, S. Taneja, G. Malhotra, “Normalization based K means Clustering Algorithm”, arXiv preprint arXiv:1503.00900, 2015 [12] M. A. Mazurowski, P. A. Habas, J. M. Zurada, J. Y. Lo, J. A. Baker, G. D. Tourassi, “Training neural network classifiers for medical decision making: The effects of imbalanced datasets on classification performance”, Neural Networks, Vol. 21, No. 2-3, pp. 427-436, 2008 [13] J. S. Bridle, “Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition”, In: Neurocomputing. NATO ASI Series (Series F: Computer and Systems Sciences), Vol. 68, pp. 227-236, Springer, Berlin, Heidelberg, 1990 [14] Machine Learning Depository, Wholesale customers Data Set, https://archive.ics.uci.edu/ml/datasets/wholesale+customers [15] USArrests, http://stat.ethz.ch/R-manual/R-devel/library/datasets/html/ USArrests.html [16] “Datasets distributed with R Git Source Tree”, https://forge.scilab.org/ index.php/p/rdataset/source/tree/master/csv/datasets/attenu.csv [17] https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/ csv/datasets/quakes.csv [18] https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/ csv/datasets/volcano.csv [19] https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/ csv/boot/channing.csv [20] https://raw.githubusercontent.com/vincentarelbundock/Rdatasets/master/ csv/boot/neuro.csv