INTERNATIONAL JOURNAL OF COMPUTERS COMMUNICATIONS & CONTROL Online ISSN , ISSN-L , Volume: 17, Issue: 6, Month: December, Year: 2022 Article Number: 4988, https://doi.org/10.15837/ijccc.2022.6.4988 CCC Publications A Feature Engineering and Ensemble Learning Based Approach for Repeated Buyers Prediction M. Zhang, J. Lu, N. Ma, T.C.E. Cheng, G. Hua Mingyang Zhang Department of Management Science and Engineering, School of Economics and Management Beijing Forestry University, China No. 35 Qinghua East Road, Haidian District, Beijing 100083, China mingyangzhang@bjfu.edu.cn Jiayue Lu 1. National Science Library, Chinese Academy of Sciences Address33 Beisihuan Xilu, Zhongguancun, Beijing 100190, China lujiayue22@mails.ucas.ac.cn 2. School of Economics and Management, University of Chinese Academy of Sciences No.19A Yuquan Road, Beijing 100049, China Ning Ma* Department of Management Science and Engineering, School of Economics and Management Beijing Forestry University, China No. 35 Qinghua East Road, Haidian District, Beijing 100083, China *Corresponding author: maning@bjfu.edu.cn T.C. Edwin Cheng Department of Logistics and Maritime Studies The Hong Kong Polytechnic University M923, Li Ka Shing Tower, Hong Kong Special Administrative Region, China edwin.cheng@polyu.edu.hk Guowei Hua Department of Logistics Management, School of Economics and Management Beijing Jiaotong University, China Siyuan East Building, Beijing Jiaotong University, Haidian District, Beijing 100044, China gwhua@bjtu.edu.cn Abstract The global e-commerce market is growing at a rapid pace, but the percentage of repeat buyers is low. According to Tmall, the repurchase rate is only 6.1%, while research shows that a 5% increase in the repurchase rate can lead to a 25% to 95% increase in profit. To increase the repurchase rate, merchants need to predict potential repeat buyers and convert them into repurchasers. Therefore, it is necessary to predict repeat buyers. In this paper we build a prediction model of repeat https://doi.org/10.15837/ijccc.2022.6.4988 2 purchasers using Tmall’s dataset. First, we build high-quality feature engineering for e-commerce scenarios by manual construction and algorithmic selection. We introduce the synthetic minority oversampling technique (SMOTE) algorithm to solve the data imbalance problem and improve prediction performance. Then we train classical classifiers including factorization machine and logistic regression, and ensemble learning classifiers including extreme gradient boosting, and light gradient boosting machine machines. Finally, we construct a two-layer fusion model based on the Stacking algorithm to further enhance prediction performance. The results show that through a series of innovations such as data imbalance processing, feature engineering, and fusion models, the model area under curve (AUC) value is improved by 0.01161. Our findings provide important implications for managing e-commerce platforms and the platform merchants. Keywords: feature engineering; ensemble learning; fusion model; repeat buyer prediction. 1 Introduction The rapid development of e-commerce has had a profound impact on the global economy[23]. In the US, e-commerce sales reached US$960.4 billion in 2021, accounting for 15% of total US re- tail spending[48]. In China, 81.6% of Internet users have undertaken online shopping[49]. Global e-commerce platforms have run promotions on specific dates (e.g., Black Friday, “6.18”), which signif- icantly stimulate the potential purchase value of new and existing users. In 2021, the Taobao “Double Eleven” promotion, one of China’s largest e-commerce promotions, achieved an all-time high trans- action volume of 540 billion RMB. However, many buyers that participated in the promotion were one-time users, making it difficult to generate long-term benefits [50]. Studies have shown that the repurchase rate for Tmall is only 6.1%. The cost of acquiring new users is 5-10 times higher than the cost of maintaining old users, while a 5% increase in customer retention can increase profit by 25% to 95% [4][10]. Thus ,if focusing solely on new user acquisition in the long term and ignoring low levels of repurchase rates„ it can cost the platform a great deal of money but not the desired return on investment. Increasing repurchase rates is imminent. A possible method is to predict the users with the tendency to repurchase and then implement precise marketing to lead them to become repurchase users. Recently, data-driven analytics have been increasingly developed to help companies analyze user behavior and optimize decision making[18][21][31][43]. However, e-commerce scenarios are complex and varied, and their data has special characteristics. The question of how to use a data-driven approach in e-commerce repurchase prediction is worth thinking about. We try to answer the following questions: How to comprehensively and accurately portray the features in e-commerce scenarios? How to efficiently predict potential repurchase users? We propose an AI-based data-driven approach to address thess problem. We use the Tmall’s real user dataset from AliTianchi to build the model. These data are log data generated by users in e-commerce behavior, which are implicit feedback. We carry out a series of steps to process the special characteristics of e-commerce data, introduce the synthetic minority oversampling technique (SMOTE) algorithm to solve the data imbalance problem. We manually construct feature engineering from three perspectives: merchant, user, and merchant-user. Meanwhile, we use different models for prediction, compare model performance, and choose the best approach. Results show that fusion model can significantly improve the accuracy of prediction. Finally, we efficiently and accurately predicted the potential target users in the dataset with a AUC value of 0.68406. Predicting repeat buyers can help e-commerce firms provide personalized services, such as accurate product recommendations, differentiated pricing, and demand management, to effectively improve the repurchase rate. We organize the rest of the paper as follows: In Section 2 we review and summarize the related work. In Section 3 we introduce the data and the data pre-processing process. In Section 4 we discuss the data-driven methodology, including data imbalance processing, feature engineering, and the prediction models. In Section 5 we present the experimental results and analysis. Finally, in Section 6, we conclude the paper, discuss the management implications of the research findings, and suggest topics for future research. https://doi.org/10.15837/ijccc.2022.6.4988 3 2 Literature Review In this paper, we use user behaviour data to predict whether he/she will become a repeat buyer. In this section we review and summarize related work covering two research streams, namely user behaviour research and user purchase behaviour prediction, and discuss the motivation for this study. 2.1 User Behaviour Research Users generate massive data during using the online platform, which can be used to analyze their behavior. Further, user data can be divided into. "explicit feedback" and "implicit feedback"[26]. Explicit feedback is a direct and quantifiable expression of users’ preferences, while implicit feedback records users’ natural behaviours when using a product[12]. Explicit feedback relies on users’ active evaluation and is difficult to collect, so the related data are quite sparse, while implicit feedback is easy to obtain and a large amount of the related data is available. Thus, many researchers have used implicit feedback. Early researchers analyze implicit feedback data, such as click and search, to provide support for interface UI improvement and user behavior understanding[1][3][26]. Later studies have made richer use of user behavior data. Some scholars focus on the optimization of algorithmic models based on implicit feedback data. They combined implicit feedback data with the SVD algorithm in matrix decomposition[16][17], and applying collaborative filtering algorithms based on items and users[45][47] to improve algorithm accuracy. Further, more scholars have combined implicit feedback data with specific scenarios, such as e-commerce[34], finance[28], beauty[44],and medicine[24]. However, these studies only stay at analyzing the current behavior of users. User behaviour analysis in the e-commerce context ultimately aims to predict subsequent user behaviour and raise the user purchase rate. The common methods used for user purchase prediction are statistical methods and machine learning methods. Statistical methods achieve prediction by modelling the relationships between input variables and output variables in advance[41]. However, under practical scenarios, the complex relationships between variables are often difficult to model. In addition, different models are based on different assumptions, rendering it difficult to achieve experimental prediction accuracy in reality[8]. Therefore, researchers have started to use machine learning, a method that does not require experimental simulations, for user purchase prediction. models that are often used for purchase prediction are decision trees[22], artificial neural networks[37]. However, all the related studies use a single prediction model, which has problems such as vulnerability to random factors and low generalization ability. To effectively exclude the interference of random factors in the single model and improve prediction accuracy, some researchers introduce the ideas of ensemble learning and fusion model into research on user behaviour prediction, such as GBDT[? ], AdaBoost[27], CatBoost[6]. The above studies demonstrate that models using ensemble learning and fusion ideas have better prediction results than traditional single algorithm models. In summary, early scholars’ research stopped at behavioral analysis; however, some scholars have already started to make behavioral predictions. Implicit feedback is more suitable for behavioral analysis because of its large amount of data and high availability. Several scholars have attempted to predict e-commerce user behavior. They have confirmed the superiority of machine learning methods, and stronger ensemble models have emerged. Like these studies, this paper will also use user logs, an implicit feedback, for analysis. However, currently, a single machine learning model is mainly used, and this paper will try a more powerful integration approach. 2.2 Repeat Purchase Behaviour Prediction In an increasingly competitive market, the prediction of e-commerce buying behavior has entered a new phase. For content, repeat purchase behavior is emphasized; for methodology, more advanced methods such as ensemble models are introduced. Research on prediction of repeat buyers is much less abundant than research on topics such as user purchase behaviour prediction and user repurchase intention. Some researchers use non-machine learn- ing methods such as interview method[33], game theory[29] and Buy till You Die (BTYD) models[9] https://doi.org/10.15837/ijccc.2022.6.4988 4 to model and predict e-commerce users’ repurchase behaviour. Other researchers have tried to use machine learning to improve the accuracy and robustness of prediction models. [39]applied the ex- planation method based on an improved decision tree algorithm to enable firms to explore the factors that drive customers’ repurchases.[46]used the vote-stacking method to combine the prediction results of three separate models, namely DeepCatboost, DeepGBM, and DABiGRU, and found that the ac- curacy of fusion models is significantly higher than that of a single model. [13]proposed a BERT-MLP prediction model, with large-scale data unsupervised pre-training and small amount of labeled data fine-tuning, to predict repeat buyer. However, the important step of feature engineering has been neglected in the above studies. While machine learning algorithms tend to be generic, feature engineering is specific, and good feature engineering determines the final prediction results. The importance of feature engineering can be seen in the study of general user behavior prediction.[30] constructed feature engineering from the perspectives of users, merchants, products, brands, categories, and their interactions to improve prediction accuracy. Considering time-evolving features in feature en- gineering, [15] found that it could more realistically depict users’ purchase intention. [? ] dynamically updated user features monthly, characterizing customers in a given month, and achieved a prediction accuracy of 98%. Obviously, these studies have placed great emphasis on feature engineering. In summary, methods such as game theory rely on strict assumptions, while the questionnaire method suffers from small sample size and under-representation, making it difficult for traditional statistical schemes to fully model user behavior. Therefore, we will use real data provided by Alibaba and use machine learning methods to prediction. Insufficient attention has been paid to feature engi- neering in existing studies, and we will manually construct a high-quality feature set for e-commerce scenarios. Finally, real data are prone to data imbalance and cheating users, and we will also give improvement measures. 2.3 Research Method Ensemble learning is one of the frontiers in the field of computing. Its main principle is to train multiple learners and use special rules to combine their prediction results to improve the final prediction performance[36]. [11]introduced the concept of "ensemble learning" for the first time. [19]constructed an ensemble model based on a neural network model, and showed that the integrated model has lower absolute value of variance and superior generalization ability compared with the common neural network. [38]upgraded many weak performance classifiers by Boosting idea to obtain strong performance classi- fication models. Since then, ensemble learning has jumped to be a popular research topic. Many new models have been born, such as hybrid expert models[20], stacked generalization models[42], bagging algorithms[5], and so on. Ensemble learning has rich applications in e-commerce, such as e-commerce review mining[25][40], e-commerce product category labeling and recommendation[? ][35], e-commerce security[7], among others. These studies confirm the superiority of ensemble learning approaches. In summary, the superiority of ensemble learning is undeniable and is widely used in e-commerce. We will use ensemble learning for repurchase prediction. In addition, are there methods that can combine ensemble learning with classical models to optimize the results? We will also explore this question. 3 Data Description and Pre-Process In this section we introduce the dataset and discuss data pre-processing process. 3.1 Research Method The data come from Alibaba’s Tianchi platform. The data are provided by Tmall, which records the real data of 4,995 merchants and 424,170 new buyers in the 2014 Double Eleven shopping festival on the Tmall platform. The purpose of the experiment is to predict whether a new user that purchases a merchant’s product on the Double Eleven Day will make a second purchase from that merchant https://doi.org/10.15837/ijccc.2022.6.4988 5 Table 1: User Behaviour Log Field Name Description user_id Unique ID code of the purchaser item_id Unique ID code of the merchandise cat_id Unique ID code for merchandise categories merchant_id Unique ID code for merchants brand_id Unique ID code for merchandise brands time_tamp Purchase time action_type Contains 0, 1, 2, 3. 0=Click, 1=Cart Add, 2=Purchase, 3=Bookmark to Favorites Table 2: User Profile Table Field Name Description user_id Unique ID code of the purchaser age_range User age range. 1 for <18 years; 2 for [18,24]; 3 for [25,29]; 4 for [30,34]; 5 for [35,39]; 6 for [40,49]; 7 and 8 for > = 50; 0 and NULL for unknown gender User gender. 0 for female, 1 for male, 2 and NULL for unknown within six months, and the new user is are a “repeat buyer” for each merchant. The dataset consists of three tables, namely the user behaviour log table, the user profile table, and the training set table. The user behaviour log records the behaviours of all the “new users” on the day of Double Eleven and 6 months before Double Eleven, spanning 12 May - 11 November 2014. The field information is listed in Table 1. The user profile table records the demographic information. Table 2 lists the field information. The training set table records whether a specific user makes repeat purchases at a specific merchant. The field information is listed in Table 3. 3.2 Data Pre-Processing Data pre-processing includes data cleaning, data integration, data transformation, data imputation etc. In this paper we use real behaviour log data, and there is inevitably data noise. Data quality will directly affect the accuracy and universality of the prediction model. Therefore, we combine the characteristics of the dataset and the special attributes of e-commerce to pre-process the data to improve data quality. 3.2.1 Data Integration We use three datasets, where the two fields of user id (user_id) and merchant id (merchant_id) are common fields. To improve the efficiency of feature engineering, we use user id and merchant id as the primary keys for data integration. We show the structure of the integrated user purchase behaviour information table in table 4. 3.2.2 Missing Value Processing The age and gender data of users in the dataset are missing 0.52% and 1.52%, respectively because they are category attributes with significant differences in repurchase behaviour, the missing values are filled by the plural of the whole data. For the brand id, on the one hand, it is difficult to replace it by plenary features or other attributes of the dataset; on the other hand, it has no impact on the subsequent feature engineering, so the missing data are excluded. Table 3: Training Set Table Field Name Description user_id Unique ID code of the purchaser merchant_id Unique ID code of a merchant label Repeat purchase user identifier. Contains 0, 1. 1 for a repeat buyer, 0 for a non-repeat buyer. https://doi.org/10.15837/ijccc.2022.6.4988 6 Table 4: User Purchase Behaviour Information Field Name Description user_id Unique ID code of a purchaser merchant_id Unique ID code of a merchant item_id Unique ID code of an item cat_id Unique ID code of a merchandise category brand_id Unique ID code of a merchandise brand time_tamp Purchase time action_type Contains 0, 1, 2, 3. 0=Click, 1=Cart Add, 2=Purchase, 3=Bookmark to Favorites age_range User age range. 1 for <18 years; 2 for [18,24]; 3 for [25,29]; 4 for [30,34]; 5 for [35,39]; 6 for [40,49]; 7 and 8 for > = 50; 0 and NULL for unknown gender User gender. 0 for female, 1 for male, 2 and NULL for unknown label Repeat purchase user identifier. Contains 0, 1. 1 for a repeat buyer, 0 for a non-repeat buyer. 3.2.3 Abnormal User Identification In the e-commerce environment, there are phenomena such as crawlers and swipers, whose behav- iors are different from the normal purchase behaviour, and who belong to abnormal users. If a user has a large amount of product browsing behaviour in a period but the purchase behaviour is 0, then the user is likely to be a “crawler user”. If a user has a large amount of product purchase behaviour in a period but little or no browsing behaviour, then the user is likely to be a “crawler user”. After identifying the abnormal users, we delete their records. 4 Methodology In this section we present methodologies related to data imbalance processing, feature engineering, predictive models, and evaluation metrics. 4.1 Data Imbalance Processing In our dataset, the positive sample, i.e., the percentage of repeat users is only 6.1%, and the data are severely imbalanced. When the data are unbalanced, the minority class samples in the overlapping region between classes will be misclassified in large batches, and the class interval surface will move to the side with sparse sample distribution, thus interfering with the classification accuracy of the model for the minority classes. Existing research deals with data imbalance in three aspects, namely data pre-processing, features, and algorithms. In this paper we use SMOTE for imbalanced data processing(Figure 1). The process is as follows: – Each sample x in the minority class is obtained as its k-nearest neighbour based on its distance to all the samples in the minority class set (generally using the Euclidean distance). – The sampling multiplicity N is determined by the degree of sample positive and negative class imbalance; many samples are arbitrarily selected from the k nearest neighbours of each minority class sample x. The selected nearest neighbour is assumed to be xnew. – To the arbitrarily selected nearest neighbour xnew, construct its new sample about the original sample as follows: xnew = x + rand(0, 1) ∗ (x̃ − x). (1) 4.2 Feature Engineering Feature construction refers to the use of manual methods to select meaningful data in the initial dataset, or to combine and deform the initial data to obtain new features. To better characterize e-commerce behaviour, we focus on two main subjects in the e-commerce context, namely users and merchants, and construct features manually in three dimensions, namely user portraits, merchant portraits, and user-merchant interaction portraits. Besides basic features (gender, age), different https://doi.org/10.15837/ijccc.2022.6.4988 7 Figure 1: Principle of SMOTE statistical methods are used to construct features, including counting and ratio features, aggregation features (mean, median etc), and temporal features. The construction of features is based on the features of e-commerce and the subjective judgement of researchers, which are prone to developing invalid features and feature redundancy. If we use all the features, the model will be inefficient. Therefore, we keep the strong features and drop the useless features to avoid the loss of important information. Feature selection mainly addresses three types of problems, namely dimensional catastrophe, over- fitting, and noise, which can not only reduce the model complexity and computation but also improve the final prediction of the model by including high-quality features. There are three types of main- stream methods for feature selection as follows: – Filtering: Feature selection and model training are independent of each other. First, select the experimental data features, then train the model using the filtered data set, and determine the weights by the scores. – Wrapping: Feature selection and model training are correlated with each other. A subset of all the features is selected for model training and compared each time, and the best features are selected based on the classifier results. – Embedding: Feature selection is carried out together with model training. The dataset is trained on the model, the weights of the features are obtained by model fitting, and feature selection is performed in the order from highest to lowest. There is no uniform way for feature selection, so we use four methods for feature selection including random forest, ANOVA, recursive feature elimination, and L1 regularization, and compare the results and retain the best set of features. 4.3 Prediction Models We use classical machine learning models, ensemble models, and fusion models to predict repeat buyers and compare the prediction results of different models. – For classical machine learning, we use the common classification predictors such as logistic regression, factorization machine, decision tree, and support vector machine for experiments, and use prediction accuracy as the baseline in this paper. –Ensemble models combine multiple learners into one stronger learner and can crack problems that cannot be solved by a single model. Therefore, using integrated learning can yield higher prediction accuracy and more reliable prediction results than a single model[14]. We use ensemble learning models such as XGBoost and LightGBM, which are commonly used for dichotomous prediction, to make predictions. –Different models have unique advantages, and fusing models by certain methods can construct stronger classifiers and greatly improve the prediction results, so we fuse different models to im- prove prediction accuracy. Stacking is one strategy of model fusion, which is usually a two-layer construction[42]. Its framework is shown in Figure 2. First, the original dataset is divided into several sub-datasets and input to the n base learners in the first layer in turn; the output training results become the input to the second layer learners, which are trained to output the final results. https://doi.org/10.15837/ijccc.2022.6.4988 8 Figure 2: The Framework of Stacking The above model contains numerous parameters and optimizing the parameters can effectively improve the model’s performance. Gradient method, genetic algorithm, and other common parameter optimization methods converge faster, but with more than two parameters, the parameters will affect one and other and interfere with the results. The grid search method combines the parameters and performs the optimization search at the same time, avoiding the local optimum, improving efficiency, and obtaining the best generalization ability. In this paper we use grid search with cross-validation (GSCV) algorithm for model parameter optimization. This method combines cross-validation and grid search. In the grid search, the parameters are gradually adjusted by the learning rate in a limited range, the learner is trained, and the results are continuously compared. The scores of the model on the test set are calculated, and the final score is averaged over k times. 4.4 Model Evaluation Metrics We use the area under curve (AUC) value as the model classification ability evaluation metric. The problem in this paper is dichotomous, i.e., whether a user is a repeat buyer, there are the positive class (1) and negative class (0), and the positive class is the repeat buyer. The positive class and the negative class of the real type and the predicted type constitute the confusion matrix, as shown in Figure 3. Figure 3: Confusion Matrix. Four major categories are derived from the confusion matrix, namely TP, FN, FP, and TN, rep- resenting the number of samples in the four categories of true positive, false negative, false positive, and true negative, respectively. The accuracy rate, which is the ratio of the number of samples with correct classification results to the total number of samples is calculated as follows: Accuracy = T P + T N T P + F N + F P + T N . (2) https://doi.org/10.15837/ijccc.2022.6.4988 9 In addition, two indicators can be derived, namely the true positive rate (T P Rate) and the false positive rate (F P Rate), which are calculated as follows: T P Rate = T P T P + F N . (3) F P Rate = F P F P + T N . (4) T P Rate indicates the probability that a sample with true category 1 is predicted to be class 1, and FPRate indicates the probability that a sample with true category 0 is predicted to be class 0. The curve formed by taking F P Rate as the horizontal axis and T P Rate as the vertical axis is the receiver operating characteristic (ROC) curve, as shown in Figure 4. In general, AU C = 1 means the perfect classifier, 1 > AU C > 0.5 means fair performance, and 0.5 > AU C > 0 means poor performance. Figure 4: Confusion Matrix. When the sample is imbalanced between the positive and negative classes, the AUC value can measure the classification accuracy when the true type is positive (=1) and negative (=0) at the same time, which can avoid the impact of the sample imbalance on the evaluation of the classifier. 5 Experimental Results and Analysis In this section we present the results after data processing and modelling based on the methodology presented in the previous section, including the feature sets, prediction model parameters and results, and important features. 5.1 Feature Construction and Selection The user portrait dimension includes four aspects: – Basic information. During the initial exploration of the data, we find that user repurchase differs in different ages and genders, so these attributes should be added. –Time information. It mainly describes the user’s activity of “shopping” and fluctuations of the activity over time. –Preference information. It mainly describes the user’s favorite items/categories/shops/brands, the pattern of the user’s various operations, and the comparison between a user’s value and the average value of all the users. https://doi.org/10.15837/ijccc.2022.6.4988 10 Table 5: Results of Feature Selection. Method No Select AUC Feature Select AUC Feature dimension before selection Feature dimension after selection Random forest-based feature selection 0.582 0.587 147 94 ANOVA 0.582 0.585 147 79 Recursive feature elimination 0.582 0.585 147 74 L1 regularization 0.582 0.569 147 34 –Behaviour information. It mainly describes the frequency, extensiveness, and recent repurchase behaviour of users’ various operations. A total of 83 features are constructed. The merchant portrait dimension includes three aspects: –Basic information. In the initial exploration of the data, we find that there are differences in the repurchase situations of stores, and that items, brands, and categories are the most basic attributes of an e-commerce merchant. –Time information. It mainly describes the frequency and status of the merchant in operation. –Strength and popularity information. It mainly describes the merchant’s popularity with users, the merchant’s strength, and the merchant’s repurchase situation. A total of 41 features are con- structed. The user-merchant interaction portrait dimension is the most important dimension to show the repurchase characteristics and relationship between different users and merchants, and describes the interaction information between merchants and users in terms of frequency, time, and status through user-merchant matching. A total of 23 features are constructed. We successively use four methods for feature selection based on random forest, ANOVA, recursive feature elimination, and L1 regularization. The results of the AUC values and feature dimensionality before and after the four methods are shown in Table 5. Finally, we pick the random forest-based feature selection as the best, and retain the selected 94 features as the feature set. 5.2 Data Preparation We first address the data imbalance problem by using the SMOTE algorithm, which is implemented using the SMOTE interface in the imblearn library in python. The pseudo-code is shown in Figure 5. To avoid high generalization error, the dataset is often split by the self-help method, which will change Figure 5: Pseudo-code for the SMOTE Algorithm. https://doi.org/10.15837/ijccc.2022.6.4988 11 Table 6: Best Parameters and Prediction Results of a Single Model. Model Best parameter AUC score LR Penalty: L2; intercept_scaling: liblinear; C: 0.05; class_weight: None; max_iter: 100 0.67305 FM n_iter: 0; l2_reg_w: 0.1; rank: 4 0.67245 XGBoost n_estimator: 2000; learning_rate: 0.01; max_features: 9; subsample: 0.5; max_depth: 8; min_samples_split: 1000; min_samples_leaf: 30; scale_pos_weight: 0.061 0.67956 LightGBM max_depth: 7; num_leaves: 80; colsample_bytree: 1; learn- ing_rate: 0.01; reg_alpha: 0 0.67991 the distribution of data, so we do not apply it to this study. As the amount of data used in this paper is sufficient, we use the simple and efficient leave-out method, dividing the training set by 1:1, with 50% as the training set and 50% as the test set. We apply five-fold cross-validation, i.e., we divide the training set into five equal and arbitrarily divided subsets, where the data sets are mutually exclusive. Each time, we randomly chose a subset as the validation set and the other four as the training set. 5.3 Model Construction and Parameters We first use the classical models such as logistic regression (LR), factorization machine (FM), decision tree (DT), and support vector machine (SVM), finding that LR and FM perform the best. Meanwhile, the good prediction accuracy achieved by the classical algorithms confirms the effectiveness of our study in dealing with data imbalance and constructing feature engineering. Seeking better prediction performance, we apply the ensemble learning models to construct XGBoost, and LightGBM for prediction. The parameter settings and AUC values of a single model are shown in Table 6. The optimal parameters here are obtained by the grid search with cross-validation (GSCV) algorithm. To achieve the best prediction possible, we introduce the Stacking method to construct a two- layer fusion model. We consider the following factors in the selection of the first layer base learner: a) Performance: A strong learner should be selected to achieve the effect of combining advantages; otherwise, it will affect the efficiency of the whole model; there should be differences in the structure of each learner to fuse and learn from different perspectives. b) Number: The number n of the first layer model determines the feature dimension n+1 of the second layer model input, and the dimension should be at least three, so at least two models are selected. After comprehensive consideration and trials, we select LightGBM, XGBoost, and RF as the base learners in the first layer. The second layer generally uses weak learners, and we choose the LR model, which is commonly used in machine learning studies. The LR model has strong generalizability and can avoid the risk of overfitting in the Stacking algorithm. The pseudo-code of the Stacking fusion model is shown in Figure 6. The Figure 6: Pseudo-code for the Stacking Fusion Model. https://doi.org/10.15837/ijccc.2022.6.4988 12 Table 7: Comparison of Model Results. Model AUC score Ranking Gain Other studies FM 0.67245 Top 9.7% - - LR 0.67305 Top 9.5% +0.0006 0.617[51] XGBoost 0.67956 Top 5.6% +0.00651 0.6774[52] LightGBM 0.67991 Top 5.5% +0.00035 0.6797[53] Stacking fusion model 0.68406 Top 2.8% +0.00415 0.6232[54] Table 8: Top Ten Features. Rank Feature 1 Number of purchases made by users at merchants 2 The number of repeat purchases made by users before “Double Eleven” 3 The click-to-purchase conversion rate of users to merchants 4 Number of repeated purchases made by users before “Double Eleven” 5 Number of store item categories 6 Difference between number of times repurchase is made to a merchant before “Double Eleven” and the average value of all the merchants 7 Difference between the user click-to-purchase conversion rate and the average value of all the users 8 Variance of user active days 9 Number of user clicks as a percentage of the number of all user actions 10 Difference between a merchant’s click-to-purchase conversion rate and the average of all the merchants modelling process comprising four steps is as follows: – Divide the training dataset D into five random and uniform copies to obtain D1−5; – For the first learner M1 in the first layer, four sub-datasets are randomly taken as the training data, the remaining copy is used as test data, and the new learner generated by learning from the training data is used to predict the test data to obtain the prediction results. Specifically, the first-fold cross-training of learner M1 takes D1−4 as the training data and D5 as the test data to obtain the prediction result P11 for the first-fold; – Complete five-fold cross-validation for each first-layer trainer based on Step 2. Combine the results of the first layer to obtain the input P of the second layer; – Perform five-fold cross-validation for the second layer of learners M according to Step 2, using the dataset as the output training set P of the first layer to obtain the final prediction model and results. 5.4 Comparison of Model Results The AUC values of the classical, ensemble, and fusion models are shown in Table 7 (ranked in ascending order of the AUC values). On the one hand, comparing the five models used in this paper, we see that the Stacking fusion model has a significant improvement in the AUC value compared with the general model, with an increase of 0.01161 in absolute amount and 6.9% in the ranking ratio, which can reach top 2.8% of all the participants in Alibaba Tianchi, confirming that the fusion model constructed in this paper is effective. On the other hand, comparing the results of this paper with other studies using the same model, based on the same model, we achieve a higher AUC value, which confirms the effectiveness of our approach that embraces feature engineering and data processing. From the official baseline of 0.704954, there is still room for improvement in the results of this paper. However, after incorporating the innovations of feature engineering and data imbalance processing, even the classical model can reach the top 10% of the ranking. 5.5 Important Features The prediction of repeat buyers can bring insights to merchants and platforms. Focusing on the features of potential repeat buyers further helps merchants and platforms understand user behaviour and adjust their business actions. We obtain the importance score of each feature through the Light- GBM model, and the top ten features are shown in Table 8. The top ten features mainly concern the user profile (six features), merchant profile (two features), and user-merchant interaction profile (two https://doi.org/10.15837/ijccc.2022.6.4988 13 features). These features provide guidance for management. Among them, user purchase features such as “number of purchases made by users at merchants”, “number of repeat purchases made by users before “Double Eleven””, and preference information such as “number of days users are active”, “proportion of user clicks to the number of all user actions”, and other characteristics reflect the user’s habit of using the e-commerce platform. Users with these characteristics may be potential repeat buyers, and managers should give timely and targeted measures. The characteristics of merchants such as “number of store categories” and “difference between the merchant’s click-to-purchase con- version rate and the average value of all the merchants” also affect user repurchase. So merchants should optimize their stores based on these key features, improve their strength and competitiveness to attract repurchase users. 6 Conclusion The development of e-commerce platforms has so far resulted in a low percentage of new users in the market, so it is the future direction to focus on existing users of the platform and carry out targeted marketing strategies for repeat buyers. We use Tmall real user behaviour log data, and apply machine learning algorithms such as ensemble learning and fusion model to carry out repurchase prediction. Introducing feature engineering based on e-commerce characteristics and imbalance data processing, we construct a repurchase prediction model that contain a feature set of 94 features. The model attains an AUC value as high as 0.68406 after incorporating the fusion model, realizing efficient and accurate repeat buyer prediction. We also make the following findings: (1) Compared with traditional machine learning models, the ensemble models and fusion models can improve the predictive effect of models, allowing us to predict repeat buyers more accurately and efficiently. (2) In the e-commerce context, especially when using real, large data sets, problems such as data imbalance and anomalous samples are inevitable, and solving these problems is important to attain accuracy in model prediction results. (3) It is necessary to adopt advanced and popular models, but the application of feature engineering should also be emphasized, and appropriate feature engineering can help bring the best model results. We deduced important management insights from the study: (1) The study confirms the possibility of predicting repeat users, and by predicting it also captures important characteristics of potential users. Platforms should pay attention to users with these characteristics and guide them to repurchase through product personalized recommendations and SMS alerts; (2) the prediction model also contains merchant features. for store improvement, the current store operations can be checked based on the important features, and optimized and upgraded to better convert users into repeat buyers, such as diversifying their product range. (3) the merchant can develop separate strategies for repeat buyers and new users based on the prediction results, implementing differentiated marketing and pricing, and (4) predictions of repeat users indicate future orders, and merchants can adjust their product generation and inventory management based on the predictions. For example, when an item has a large number of potential repeat customers, production should be increased. Due to the single data source and limited technology, our study has shortcomings. From a man- agement perspective, future research can be improved in the following ways:(1) real-time modelling analysis should be conducted using real-time datasets from more platforms to broaden the platform sources, thus enhancing the generalizability of the model across different platforms and uncovering the differences in repeat buyers between platforms. (2) machine learning methods should be introduced into the feature construction process to improve the explanatory power of features to the problem. Funding Supported by Natural Science Foundation of China(71901027); Beijing Forestry University 2021 Course Ideological and Political Teaching, Research and Teaching Reform Project, "Management Model and Basic Decision-making" project(2021KCSZXY010) https://doi.org/10.15837/ijccc.2022.6.4988 14 Author contributions The authors contributed equally to this work. Conflict of interest The authors declare no conflict of interest. References [1] Abel, F.; Gao, Q.; Houben, G. J.; Tao, K. (2011). Analyzing user modeling on twitter for per- sonalized news recommendations. International Conference on User Modeling, Adaptation, And Personalization, 1-2, 2011 [2] Belem, F. M.; Silva, R. M.; de Andrade, C. M.; Person, G.; Mingote, F.; Ballet, R.; Alponti, H.; de Oliveira, H. P.; Almeida, J. M.; Goncalves, M. A. (2020). “Fixing the curse of the bad product descriptions”–Search-boosted tag recommendation for E-commerce products. Information Pro- cessing Management, 57(5), 102289, 2020 [3] Benevenuto, F.; Rodrigues, T.; Cha, M.; Almeida, V. (2009). Characterizing user behavior in online social networks. Proceedings of the 9th ACM SIGCOMM Conference on Internet Measure- ment, 49-62, 2009 [4] Bhattacharya, C. B. (1998). When customers are members: Customer retention in paid member- ship contexts. Journal of The Academy of Marketing Science, 26(1), 31-44, 1998 [5] Breiman, L. (1996). Bagging predictors. Machine learning, 24(2): 123-140, 1996 [6] Cao, W.; Wang, K.; Gan, H.; Yang, M. (2021). User online purchase behavior prediction based on fusion model of CatBoost and Logit. Journal of Physics: Conference Series, 2003(01), 012011, 2021 [7] Carta, S.; Fenu, G.; Recupero, D. R.; Saia, R. (2019). Fraud detection for E-commerce transac- tions by employing a prudential Multiple Consensus model. Journal of Information Security and Applications, 46, 13-22, 2019 [8] Chen, S.; Wang, J. Q.; Zhang, H. Y. (2019). A hybrid PSO-SVM model based on clustering algo- rithm for short-term atmospheric pollutant concentration forecasting. Technological Forecasting and Social Change, 146, 41-54, 2019 [9] Chou, P.; Chuang, H. H. C.; Chou, Y. C.; Liang, T. P. (2022). Predictive analytics for customer repurchase: Interdisciplinary integration of buy till you die modeling and machine learning. Eu- ropean Journal of Operational Research, 296(2), 635-651, 2022 [10] Daly, J. L. (2002). Pricing for profitability: Activity-based pricing for competitive advantage. John Wiley & Sons, 2002. [11] Dasarathy, B. V.; Sheela, B. V.(1979). A composite classifier system design: Concepts and methodology. Proceedings of the IEEE, 67(5): 708-713, 1979 [12] Deng, Z. H.; Huang, L.; Wang, C. D.; Lai, J. H.; Philip, S. Y. (2019). Deepcf: A unified framework of representation learning and matching function learning in recommender system. Proceedings of The AAAI Conference on Artificial Intelligence, 33(01), 61-68, 2019 [13] Dong, J.; Huang, T.; Min, L.; Wang, W. (2022). Prediction of Online Consumers’ Repeat Purchase Behavior via BERT-MLP Model. Journal of Electronic Research and Application, 6(3), 12-19, 2022 https://doi.org/10.15837/ijccc.2022.6.4988 15 [14] Dong, X.; Yu, Z.; Cao, W.; Shi, Y.; Ma, Q. (2020). A survey on ensemble learning. Frontiers of Computer Science, 14(2), 241-258, 2020 [15] Dong, Y.; Jiang, W. (2019). Brand purchase prediction based on time-evolving user behaviors in e-commerce. Concurrency and Computation: Practice and Experience, 31(1), e4882, 2019 [16] Enrich, M.; Braunhofer, M.; Ricci, F. (2013). Cold-start management with cross-domain collabo- rative filtering and tags. International Conference on Electronic Commerce and Web Technologies 101-112, 2013 [17] Fernández-Tobías, I.; Cantador, I. (2014). Exploiting Social Tags in Matrix Factorization Models for Cross-domain Collaborative Filtering. Proceedings of the 1st Workshop on New Trends in Content-based Recommender Systems, 34-41, 2014 [18] Gajsek B.; Dukic G.; Kovacic M.; Brezocnik M. (2021). A Multi-Objective Genetic Algorithms Approach for Modelling of Order Picking. Int. Journal of Simulation Modelling, 20(4), 719-729, 2021 [19] Hansen, L. K.; Salamon, P. (1990). Neural network ensembles. IEEE transactions on pattern analysis and machine intelligence, 12(10): 993-1001, 1990 [20] Jacobs, R.; Jordan, M.; Nowlan, S.; Hinton G. (2014). Adaptive mixtures of local experts. Neural Computation, 3(1): 79-87, 1991 [21] Janekova J.; Fabianova J.; Kadarova J. (2021). Selection of Optimal Investment Variant Based on Monte Carlo Simulations. Int. Journal of Simulation Modelling, 20(2), 279-290, 2021 [22] Kagan, S.; Bekkerman, R. (2018). Predicting purchase behavior of website audiences. Interna- tional Journal of Electronic Commerce, 22(4), 510-539, 2018 [23] Knezevic, B.; Skrobot, P.; Pavic, E. (2021). Differentiation of e-commerce consumer approach by product categories. Journal of Logistics, Informatics and Service Science, 8(1), 1-19, 2021 [24] Kocheturov, A.; Pardalos, P. M.; Karakitsiou, A. (2019). Massive datasets and machine learning for computational biomedicine: trends and challenges. Annals of Operations Research, 276(1), 5-34, 2019 [25] Koehn, D.; Lessmann, S.; Schaal, M. (2020). Predicting online shopping behaviour from click- stream data using deep learning. Expert Systems with Applications, 150, 113342, 2020 [26] Koren, Y. (2008). Factorization meets the neighborhood: a multifaceted collaborative filtering model. Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 426-434, 2008 [27] Kumar, A.; Kabra, G.; Mussada, E. K.; Dash, M. K.; Rana, P. S. (2019). Combined artificial bee colony algorithm and machine learning techniques for prediction of online consumer repurchase intention. Neural Computing and Applications, 31(2), 877-890, 2019 [28] Kyriakou, I.; Mousavi, P.; Nielsen, J. P.; Scholz, M. (2021). Forecasting benchmarks of long-term stock returns via machine learning. /emphAnnals of Operations Research, 297(1), 221-240, 2021 [29] Li, X.; Hitt, L. M.; Zhang, Z. J. (2011). Product reviews and competition in markets for repeat purchase products. Journal of Management Information Systems, 27(4), 9-42, 2011 [30] Liu, X.; Li, J. (2016). Using support vector machine for online purchase predication. Emph2016 International Conference on Logistics, Informatics and Service Sciences, 1-6, 2016 [31] Ma X. Y.; Lin Y.; Ma Q. W. (2021). Data-Driven Robust Model for Container Slot Allocation with Uncertain Demand. Int. Journal of Simulation Modelling, 20(4), 707-718, 2021 https://doi.org/10.15837/ijccc.2022.6.4988 16 [32] Martínez, A.; Schmuck, C.; Pereverzyev Jr, S.; Pirker, C.; Haltmeier, M. (2020). A machine learning framework for customer purchase prediction in the non-contractual setting. European Journal of Operational Research, 281(3), 588-596, 2020 [33] Moriuchi, E.; Takahashi, I. (2022). An empirical study on repeat consumer’s shopping satisfaction on C2C e-commerce in Japan: the role of value, trust and engagement. Asia Pacific Journal of Marketing and Logistics, ahead-of-print, 2022 [34] Ni, Y.; Chen, X.; Pan, W.; Chen, Z.; Ming, Z. (2021). Factored heterogeneous similarity model for recommendation with implicit feedback. Neurocomputing, 455(2021), 59-67, 2021 [35] Oyewole, S. A.; Olugbara, O. O. (2018). Product image classification using Eigen Colour feature with ensemble machine learning. Egyptian Informatics Journal, 19(2), 83-100, 2018 [36] Sagi, O.; Rokach, L. (2018). Ensemble learning: A survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(4), e1249, 2018 [37] Sakar, C. O.; Polat, S. O.; Katircioglu, M.; Kastro, Y. (2019). Real-time prediction of online shoppers’ purchasing intention using multilayer perceptron and LSTM recurrent neural networks. Neural Computing and Applications, 31(10), 6893-6908, 2019 [38] Schapire, R. E.; Freund, Y. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of computer and system sciences, 55(1): 119-139, 1997 [39] Shen, Y.; Xu, X.; Cao, J. (2020). Reconciling predictive and interpretable performance in repeat buyer prediction via model distillation and heterogeneous classifiers fusion. Neural Computing and Applications, 32(13), 9495-9508, 2020 [40] Tripathi, P.; Singh, S.; Chhajer, P.; Trivedi, M. C.; Singh, V. K. (2020). Analysis and prediction of extent of helpfulness of reviews on E-commerce websites. Materials Today: Proceedings, 33, 4520-4525, 2020 [41] Van Nguyen, T.; Zhou, L.; Chong, A. Y. L.; Li, B.; Pu, X. (2020). Predicting customer demand for remanufactured products: A data-mining approach. European Journal of Operational Research, 281(3), 543-558, 2020 [42] Wolpert, D. H. (1992). Stacked generalization. Neural networks, 5(2): 241-259, 1992 [43] Wu P. J., Yang D. (2021). E-Commerce Workshop Scheduling Based on Deep Learning and Genetic Algorithm. Int. Journal of Simulation Modelling, 20(1),192-200,2021 [44] Xu, J.; Kim, H.K. (2021). A study on the factors influencing consumers’ purchase intention towards Chinese beauty industry: focusing on SNS characteristic elements. Journal of Logistics, Informatics and Service Science, 8(2), 47-64, 2021 [45] Yin, X. C.; Liu, C. P.; Han, Z. (2005). Feature combination using boosting. Pattern Recognition Letters, 26(14), 2195-2205, 2005 [46] Zhang, H.; Dong, J. (2020). Prediction of repeat customers on E-commerce platform based on blockchain. Wireless Communications and Mobile Computing, 2020(8841437), 2020 [47] Zhang, Z.; Zeng, D. D.; Abbasi, A.; Peng, J.; Zheng, X. (2013). A random walk model for item recommendation in social tagging systems. ACM Transactions on Management Information Systems 4(2), 1-24, 2013 [48] [Online]. Available: https://www.census.gov/retail/index.html [49] [Online]. Available: https://www.cnnic.net.cn/n4/2022/0401/c88-1131.html [50] [Online]. Available: https://tianchi.aliyun.com/competition/entrance/231576/introduction https://doi.org/10.15837/ijccc.2022.6.4988 17 [51] [Online]. Available: https://github.com/huiminren/RepeatBuyersPrediction [52] [Online]. Available: https://github.com/leowang7553/repeatBuyersPrediction [53] [Online]. Available: https://github.com/Ashitemaru/DM-Tmall-prediction [54] [Online]. Available: https://github.com/DatAvalon/RepeatBuyersPrediction Copyright ©2022 by the authors. Licensee Agora University, Oradea, Romania. This is an open access article distributed under the terms and conditions of the Creative Commons Attribution-NonCommercial 4.0 International License. Journal’s webpage: http://univagora.ro/jour/index.php/ijccc/ This journal is a member of, and subscribes to the principles of, the Committee on Publication Ethics (COPE). https://publicationethics.org/members/international-journal-computers-communications-and-control Cite this paper as: Zhang, M.; Lu, J.; Ma, N.; Cheng, T.C.E.; Hua G. (2022). A Feature Engineering and Ensem- ble Learning Based Approach for Repeated Buyers Prediction International Journal of Computers Communications & Control, 17(6), 4988, 2022. https://doi.org/10.15837/ijccc.2022.6.4988 Introduction Literature Review User Behaviour Research Repeat Purchase Behaviour Prediction Research Method Data Description and Pre-Process Research Method Data Pre-Processing Data Integration Missing Value Processing Abnormal User Identification Methodology Data Imbalance Processing Feature Engineering Prediction Models Model Evaluation Metrics Experimental Results and Analysis Feature Construction and Selection Data Preparation Model Construction and Parameters Comparison of Model Results Important Features Conclusion