99 CAR PRICE PREDICTION IN THE USA BY USING LINEAR REGRESSION Huseyn Mammadov Carlo Bo University of Urbino, Italy Received: November 18, 2021 Accepted: December 27, 2021 Online Published: December 29, 2021 Abstract This paper studies a linear regression model to predict the car prices for the U.S market, in order to help a new entrant understanding important pricing factors/variables in the U.S automobile industry. The prediction of a car price has become a high-interest research area, as it requires significant knowledge of the field. I have applied to a highly comprehensive analysis with all data cleaning, exploration, visualization, feature selection and model building. The data used for the prediction were collected from the web portal fred.stlouisfed.org using web scraper, written in Python/Jupyter programming language. According to a problem solving approach, I have split it to 5 parts (Data understanding and exploration, Data cleaning, Data preparation: Feature Engineering and Scaling, Feature Selection using RFE and Model Building and Linear Regression Assumptions Validation and Outlier Removal). The points are symmetrically placed along a diagonal line in the former plot and along a horizontal line in the later plot in the examination plots of observed against forecast values or residuals versus projected values. According to the table of Residuals vs. Predicted, many points with extremely high residual values suggest that the model predicts one item adversely. Other well-known raised residual points may possibly be significant outliers. Keywords: Car Price Prediction, Liner Regression, Data Understanding, Data Cleaning. 1. Introduction In this paper the given purpose is to explain the price of cars in the US where the liner regression is used, and which helped to estimate predictions. Respectively, an accurate estimation of automobile prices requires specialized expertise, as quality typically relies on several different features and variables. In addition, the amount of gasoline used in the vehicle and the fuel usage per mile have a significant effect on a car's price leading to regular adjustments in a fuel 's demand. This analysis is organized in this structure: • Data understanding and exploration • Data cleaning International Journal of Economic Behavior, vol. 11 n. 1, 2021, 99-108. https://doi.org/10.14276/2285-0430.3049 100 • Data preparation: Feature Engineering and Scaling • Feature Selection using Recursive Feature Elimination (RFE) and Model Building • Linear Regression Assumptions Validation and Outlier Removal. 2. Literature Review Noor and Jan (2017) use multiple linear regression to construct a model for forecasting car prices. The dataset was generated during the two-month span and included the following characteristics: size, cubic ability, exterior color, date of posting of the ad, amount of ad views, power steering, kilometer mileage, type of transmission, type of motor, area, registered area, layout, edition, make and model year. With the Results setup researchers were able to reach 98 per cent predictability. The authors have suggested prediction model based on the single machine learning algorithm in the relevant research seen above. Nevertheless, it is notable that a standard approach to machine learning algorithms did not produce impressive predictive outcomes and could be improved by combining multiple methods of machine learning into an ensemble. Gonggie (2011) suggested a model that would be developed using ANN (Artificial Neural Networks) to estimate the price of a used vehicle. He considered several attributes: passed miles, estimated car life and mark. The new model was developed in order to cope with nonlinear data interactions, which was not the case for prior models using standard linear regression techniques. The non-linear model was able to forecast car prices better than other linear models with greater accuracy. Wu et al. (2009) performed analysis of car price estimation utilizing a knowledge-based neuro- fuzzy method. They took the following characteristics into account: model, year of production, and engine size. Their model of projection had comparable findings to the simplistic model of regression. They have created a specialist program named ODAV (Optimal Distribution of Auction Vehicles), since there is a strong demand for auto dealers to deliver the vehicles at the end of the leasing year. This method offers information into the best car rates, as well as the place where the best quality can be earned. Regression model focused on neighboring k-nearest machine learning algorithm was used to predict a car's speed. This program appears to be remarkably effective, as it has exchanged more than two million vehicles. In his thesis research Richardson (2009) offered a specific approach. His expectation was that more robust vehicles should be made by automakers. Richardson implemented multiple regression analyses and found that electric vehicles have maintained their worth longer than regular vehicles. This has origins in urban warming issues and offers greater fuel efficiency. 3. Methodology and Problem Solving 3.1 Data Understanding and Exploration Let's first have a look at the dataset and understand the size, attribute names etc. Figure 1 shows the data types and names of the columns of the dataset and according to the estimation Python is used where it helps to apply to the liner regression. 101 Figure 1 − Understanding the features and data Observations on Target Variable- Price The target variable price has a positive skew; however, majority of the cars are low priced. More than 50% of the cars (around 105-107 out of total of 205) are priced 10,000 and close to 35% cars are priced between 10,000 and 20,000. So around 85% of cars in US market are priced between 5,000 to 20,000. Based on above observations and graph on right side (KDE/green one) it appears there are 2 distributions one for cars priced between 5,000 and 25000 and another distribution for high priced cars 25,000 and above. (Notice the approximate bell curve from little less than 30000 up to 45,000/50,000). Data Exploration To perform linear regression, the target variable should be linearly related to independent variables. Let's see whether that's true in this case. Figure 2 shows the tabular form of the dataset on which we will carry the operations. Figure 2 − Var indicators These vars appears to have a linear relation with price: carwidth, curbweight, enginesize, horsepower, boreration and citympg. Other variables either don't have a relation with price or relationship isn't strong. None of the variables appear to have polynomial relation with price. 102 In linear regression assumptions validation section, we will check for linearity assumption in detail. Figure 3 shows the useful insights from Correlation Heatmap (which shows a 2D correlation matrix between two discrete dimensions), dependent variables and independent variables. Figure 3 — Heatmap Correlation Positive correlation: price highly correlated with enginesize, curbweight, horsepower, carwidth (all of these variables represent the size/weight/engine power of the car) Negative correlation: price negatively correlation with mpg var's citympg and highwaympg. This suggest that cars having high mileage may fall in the 'economy' cars category or in other words indicates that Low priced cars have mostly high mpg Correlation among independent variables: many independent variables are highly correlated; wheelbase, carlength, curbweight, enginesize etc. are all measures of 'size/weight', and are positively correlated Since independent variables are highly correlated (more than 80% correlation among many of them) we'll have to pay attention to multicollinearity, which we will check in assumptions validation section using VIF score 3.2 Data Cleaning: Missing values and feature data type check In this section we will check dataset for missing values and check the datatypes of different features. Figure 4 shows the data types and names of the columns of the dataset and meanwhile Figure 5 shows the conversion of desire column. 103 Figure 4 — The types of columns of the dataset Figure 5 − The conversation of the column 104 3.3 Data Preparation: feature engineering In this section we prepare the data for model building and desire operations. Enable to make future operations we prepared the data. Data preparation contains drop, merge, and creating dummies. Scaling features though not necessary in (Multiple Linear regression) MLR but it’s good to do it as it makes interpretation of regression coefficients easier 3.4 Model Building and Feature Selection Using RFE (Recursive Feature Elimination) Since our dependent variable price looks to be linearly related to most of the independent variables, we are using Linear Regression (because of in statistics when dependent variable is linearly related to independent variable then we apply Linear Regression) only and no other types of regression like Polynomial, Random Forest/Boosting regression etc. Massive overfitting: all features in model is never a good idea unless features are too less and all of them are important, so we used using recursive feature elimination to reduce dimensionality. First, we need to split the data into train and test as shown in Figure 6. Then we perform some R-square and root mean squared error (RMSE) on train and test data and we obtain some values of R-square on train and test data as well and also RMSE on train and test data after performing these operations as these values are clearly shown in Figure 6. Figure 6 – Data Split Feature selection using RFE First we decide optimal number of features rather than arbitrarily specifying count of features to be used in model in the RFE function. From the graphs as shown in Figure 7 we find: − R square for test data peaks at 13 features and at this point model generalizes well as train R2 is v close to test. Train R2 keeps on increasing beyond 13 features but R2 105 keeps increasing as you add more features to train data. We have selected the number of features where model accuracy and generalization both are at satisfactory level. − RMSE for test data is lowest at 13 features and beyond that it increases. Train RMSE at 13 also looks good, adding more features to train decreases RMSE but again there is always a tradeoff between removing features (aka reducing complexity) and model performance. So, we will go with 13 features (Figure 8). Figure7 − Features count 3.5 Linear Regression To detect linearity lets inspect plots of observed vs. predicted values or residuals vs. predicted values. The desired outcome is that points are symmetrically distributed around a diagonal line in the former plot & around horizontal line in the latter one. From the graphs shown in Figure 9: 1. Obs vs predicted shows that most of the values are closer to the diagonal line, however some are not which is a problem. 2. Resi vs pred graph does not give a conclusive evidence that residuals are evenly scattered around the zero line as Resi values increase with increase in predicted values, so assumption of linearity can't be confirmed. 3. There seems to be presence of outliers, which might be giving a non-conclusive enough Resi vs Predicted graph. Some points have very high residual values; a point (~ -3000, ~ 8000) shows one value is predicted negatively by the model. There are many other prominent high residual points which could be influential outliers. 106 Figure 8 − Model Building with optimal features Figure 9 − Comparation of Observed and Predicted Values 107 Figure 10 indicates the Actual vs Predictions price. Blue label indicates the actual price of the cars and red label indicates the predicted values of the cars. Figure 10 − Relation of Actuals and Predictions 4. Conclusions According to the aim and methodology of this research may apply other countries with using the same statistical analysis. The precise structure explains and indicates poses if the variables where the price fluctuated over the cars with subject to the results and according to the understanding the features and data observations on target variable Price, the estimation illustrates information about, the goal variable price has an optimistic bias because most vehicles are low cost. Over 50 percent of the vehicles are priced at 10,000 and approximately 35 % are priced between 10,000 and 20,000. So, in the US industry about 85 per cent of cars are priced between 5,000 and 20,000. On the basis of the above findings and graph on the right side there are 2 distributions: one for cars priced between 5,000 and 25,000 and another for high priced cars, at 25,000 and beyond. In the data exploration we have started to perform the liner regression. So, in detail, some var's seem to have a linear price relation: carwidth, curbweight, enginesize, horsepower, boreration, and citympg and certain factors either have no price relation or are not good association. Neither of these variables appear to have a polynomial relation to size. In the segment Validation of linear regression assumptions, we have tested for the linearity assumption. In the correlation heatmap, price is highly correlated with enginesize, curbweight, horsepower, carwidth and negatively correlated with mpg var's citympg and highwaympg, so, this means that high-mileage cars can fall into the 'economy' car category or, in other words, mean that low-priced cars often have high mpgs. In the examination plots of observed versus forecast values or residuals versus projected values, the intended consequence is that the points are symmetrically arranged in the former plot along a diagonal line and in the latter along a horizontal line. According to the Residuals vs Predicted table, many points have very high residual values, indicating that the model negatively forecasts one value. There are also other elevated residual points that may theoretically be powerful outliers. 108 References 1. Du, J., Xie, L. Schroeder, S. (2009). Practice Prize Paper − PIN Optimal Distribution of Auction Vehicles System: Applying Price Forecasting, Elasticity Estimation and Genetic Algorithms to Used- Vehicle Distribution. Marketing Science, 28(4), 637-644. 2. Gelman, A., Hill, J. (2006). Data Analysis Using Regression and Multilevel Hierarchical Models. Cambridge University Press, New York, USA. 3. Gongqi, S., Yansong, W., Qiang, Z. (2011). New Model for Residual Value Prediction of the Used Car Based on BP Neural Network and Nonlinear Curve Fit. In Measuring Technology and Mechatronics Automation (ICMTMA), 2011 Third International Conference, Vol. 2, pp. 682-685, IEEE. 4. Listiani, M. (2009). Support Vector Regression Analysis for Price Prediction in a Car Leasing Application. Thesis (MSc). Hamburg University of Technology. 5. Noor, K., Jan, S. (2017). Vehicle Price Prediction System using Machine Learning Techniques. International Journal of Computer Applications, 167(9), 27-31. 6. Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. Morgan Kauffmann. 7. Richardson, M. S. (2009). Determinants of used car resale value. Retrieved from: https://digitalcc.coloradocollege.edu/islandora/ 3A1346 [accessed: August 1, 2020.] 8. Used cars database. (n.d.) Retrieved from: https://fred.stlouisfed.org/ 9. Wu, J.D., Hsu, C.C., Chen, H.C. (2009). An expert system of price forecasting for used cars using adaptive neuro-fuzzy inference. Expert Systems with Applications, 36(4), 7809-817. 10. Angioni M., Musso F. (2020) “New perspectives from technology adoption in senior cohousing facilities”, The TQM Journal, Vol. 32, n. 4, pp. pp. 761-777. doi 10.1108/TQM-10-2019-0250 11. Musso F. (2004), “Il sistema distributivo cinese fra tradizione e modernizzazione”, China News, n. 1, Milano, Franco Angeli, pp. 11-31. 12. Musso F. (2009), “La Cina come mercato: prospettive, vincoli, illusioni”, in Beretta S., Pissavino P.C. (a cura di), Cina e oltre. Piccola e media impresa tra internazionalizzazione e innovazione, Rubbettino, Soveria Mannelli 13. Musso F. (2013), "Is Industrial Districts Logistics suitable for Industrial Parks?", Acta Universitatis Danubius. Œconomica, Vol 9, No 4, pp. 221-233. 14. Musso F., Risso M. (2006), “Responsabilità sociale d'impresa nelle filiere internazionali della grande distribuzione”, Symphonya: Emerging Issues in Management, n. 1, pp. 91-107. 15. Musso F., Risso M. (2007). Sistemi di supporto alle decisioni di internazionalizzazione commerciale: un modello applicativo per le piccole e medie imprese, in Ferrero G. (ed.), Le ICT per la qualificazione delle Piccole Imprese Marchigiane, Carocci, Roma, 205-255. 16. Musso F., Risso M., (2013) "CSR for retailers' led channel relationships: Evidence from Italian SME manufacturers", International Journal of Information Systems and Social Change (IJISSC), Vol. 4, n. 1, January-March, pp.21-36, doi: 10.4018/ijissc.2013010102. 17. Pepe C., Musso F. (1994), "Integrazione europea e distribuzione commerciale: politiche comunitarie ed evoluzione del fenomeno", Economia e Diritto del Terziario, n. 1, ISSN: 1593- 9464, pp. 129-175. 18. Pepe C., Musso F. (1999), “Imprese distrettuali e rapporto col mercato: potenzialità e limiti dei processi di internazionalizzazione del distretto pesarese del mobile”, Atti del Convegno: Il futuro dei distretti, Vicenza, 4 giugno. 19. Musso F. (2010), “Le nuove frontiere del marketing internazionale fra approccio strategico, contestualizzazione e interculturalità”, Mercati e competitività, n. 4/2010, pp. 15-19. doi: 10.3280/MC2010-004002. 1. Introduction 2. Literature Review 3. Methodology and Problem Solving 3.1 Data Understanding and Exploration 3.3 Data Preparation: feature engineering 3.4 Model Building and Feature Selection Using RFE (Recursive Feature Elimination) 3.5 Linear Regression 4. Conclusions References