99 
 
 
CAR PRICE PREDICTION IN THE USA BY 

USING LINEAR REGRESSION 
 

Huseyn Mammadov 
Carlo Bo University of Urbino, Italy 

 
Received: November 18, 2021     Accepted: December 27, 2021    Online Published: December 29, 2021 
 
 
Abstract 
 
This paper studies a linear regression model to predict the car prices for the U.S market, in 
order to help a new entrant understanding important pricing factors/variables in the U.S 
automobile industry. The prediction of a car price has become a high-interest research area, 
as it requires significant knowledge of the field. I have applied to a highly comprehensive 
analysis with all data cleaning, exploration, visualization, feature selection and model building. 
The data used for the prediction were collected from the web portal fred.stlouisfed.org using 
web scraper, written in Python/Jupyter programming language. According to a problem 
solving approach, I have split it to 5 parts (Data understanding and exploration, Data cleaning, 
Data preparation: Feature Engineering and Scaling, Feature Selection using RFE and Model 
Building and Linear Regression Assumptions Validation and Outlier Removal). The points are 
symmetrically placed along a diagonal line in the former plot and along a horizontal line in the 
later plot in the examination plots of observed against forecast values or residuals versus 
projected values. According to the table of Residuals vs. Predicted, many points with extremely 
high residual values suggest that the model predicts one item adversely. Other well-known 
raised residual points may possibly be significant outliers. 

  
Keywords: Car Price Prediction, Liner Regression, Data Understanding, Data Cleaning. 

1. Introduction  
In this paper the given purpose is to explain the price of cars in the US where the liner 
regression is used, and which helped to estimate predictions. Respectively, an accurate 
estimation of automobile prices requires specialized expertise, as quality typically relies on 
several different features and variables. In addition, the amount of gasoline used in the vehicle 
and the fuel usage per mile have a significant effect on a car's price leading to regular 
adjustments in a fuel 's demand. 

This analysis is organized in this structure: 
• Data understanding and exploration 
• Data cleaning 

International Journal of Economic Behavior, vol. 11 n. 1, 2021, 99-108. 
https://doi.org/10.14276/2285-0430.3049 


 100 

 
• Data preparation: Feature Engineering and Scaling 
• Feature Selection using Recursive Feature Elimination (RFE) and Model Building 
• Linear Regression Assumptions Validation and Outlier Removal. 

2. Literature Review 
Noor and Jan (2017) use multiple linear regression to construct a model for forecasting car 
prices. The dataset was generated during the two-month span and included the following 
characteristics: size, cubic ability, exterior color, date of posting of the ad, amount of ad views, 
power steering, kilometer mileage, type of transmission, type of motor, area, registered area, 
layout, edition, make and model year. With the Results setup researchers were able to reach 98 
per cent predictability. The authors have suggested prediction model based on the single 
machine learning algorithm in the relevant research seen above. Nevertheless, it is notable that 
a standard approach to machine learning algorithms did not produce impressive predictive 
outcomes and could be improved by combining multiple methods of machine learning into an 
ensemble. 

Gonggie (2011) suggested a model that would be developed using ANN (Artificial Neural 
Networks) to estimate the price of a used vehicle. He considered several attributes: passed 
miles, estimated car life and mark. The new model was developed in order to cope with 
nonlinear data interactions, which was not the case for prior models using standard linear 
regression techniques. The non-linear model was able to forecast car prices better than other 
linear models with greater accuracy. 

Wu et al. (2009) performed analysis of car price estimation utilizing a knowledge-based 
neuro- fuzzy method. They took the following characteristics into account: model, year of 
production, and engine size. Their model of projection had comparable findings to the simplistic 
model of regression. They have created a specialist program named ODAV (Optimal 
Distribution of Auction Vehicles), since there is a strong demand for auto dealers to deliver the 
vehicles at the end of the leasing year. This method offers information into the best car rates, 
as well as the place where the best quality can be earned. Regression model focused on 
neighboring k-nearest machine learning algorithm was used to predict a car's speed. This 
program appears to be remarkably effective, as it has exchanged more than two million vehicles. 

In his thesis research Richardson (2009) offered a specific approach. His expectation was 
that more robust vehicles should be made by automakers. Richardson implemented multiple 
regression analyses and found that electric vehicles have maintained their worth longer than 
regular vehicles. This has origins in urban warming issues and offers greater fuel efficiency. 

3. Methodology and Problem Solving 
3.1 Data Understanding and Exploration 
Let's first have a look at the dataset and understand the size, attribute names etc. Figure 1 shows 
the data types and names of the columns of the dataset and   according to the estimation Python 
is used where it helps to apply to the liner regression. 
 

  101 
 
 
Figure 1 − Understanding the features and data Observations on Target Variable- Price 
 

The target variable price has a positive skew; however, majority of the cars are low priced. 

More than 50% of the cars (around 105-107 out of total of 205) are priced 10,000 and close to 
35% cars are priced between 10,000 and 20,000. So around 85% of cars in US market are priced 
between 5,000 to 20,000. Based on above observations and graph on right side (KDE/green 
one) it appears there are 2 distributions one for cars priced between 5,000 and 25000 and 
another distribution for high priced cars 25,000 and above. (Notice the approximate bell curve 
from little less than 30000 up to 45,000/50,000). 

 
Data Exploration 
To perform linear regression, the target variable should be linearly related to independent 
variables. Let's see whether that's true in this case. Figure 2 shows the tabular form of the dataset 
on which we will carry the operations. 
 

Figure 2 − Var indicators 

 
These vars appears to have a linear relation with price: carwidth, curbweight, enginesize, 
horsepower, boreration and citympg. Other variables either don't have a relation with price or 
relationship isn't strong. None of the variables appear to have polynomial relation with price. 


 102 

 
In linear regression assumptions validation section, we will check for linearity assumption in 
detail. Figure 3 shows the useful insights from Correlation Heatmap (which shows a 2D 
correlation matrix between two discrete dimensions), dependent variables and independent 
variables. 

 
Figure 3 — Heatmap Correlation 

 
Positive correlation: price highly correlated with enginesize, curbweight, horsepower, 

carwidth (all of these variables represent the size/weight/engine power of the car) 
Negative correlation: price negatively correlation with mpg var's citympg and 

highwaympg. This suggest that cars having high mileage may fall in the 'economy' cars category 
or in other words indicates that Low priced cars have mostly high mpg 

Correlation among independent variables: many independent variables are highly 
correlated; wheelbase, carlength, curbweight, enginesize etc. are all measures of 'size/weight', 
and are positively correlated 

Since independent variables are highly correlated (more than 80% correlation among many 
of them) we'll have to pay attention to multicollinearity, which we will check in assumptions 
validation section using VIF score 

 
3.2 Data Cleaning: Missing values and feature data type check 
In this section we will check dataset for missing values and check the datatypes of different 
features. Figure 4 shows the data types and names of the columns of the dataset and meanwhile 
Figure 5 shows the conversion of desire column. 

  
  103 
 
 
Figure 4 — The types of columns of the dataset   

 
Figure 5 − The conversation of the column 

 
 104 

 
3.3 Data Preparation: feature engineering 
In this section we prepare the data for model building and desire operations. Enable to make 
future operations we prepared the data. Data preparation contains drop, merge, and creating 
dummies. Scaling features though not necessary in (Multiple Linear regression) MLR but it’s 
good to do it as it makes interpretation of regression coefficients easier 
 

3.4 Model Building and Feature Selection Using RFE (Recursive Feature Elimination) 
Since our dependent variable price looks to be linearly related to most of the independent 
variables, we are using Linear Regression (because of in statistics when dependent variable is 
linearly related to independent variable then we apply Linear Regression) only and no other 
types of regression like Polynomial, Random Forest/Boosting regression etc. 

Massive overfitting: all features in model is never a good idea unless features are too less 
and all of them are important, so we used using recursive feature elimination to reduce 
dimensionality. First, we need to split the data into train and test as shown in Figure 6. Then we 
perform some R-square and root mean squared error (RMSE) on train and test data and we 
obtain some values of R-square on train and test data as well and also RMSE on train and test 
data after performing these operations as these values are clearly shown in Figure 6. 

 
Figure 6 – Data Split 

 
Feature selection using RFE 
First we decide optimal number of features rather than arbitrarily specifying count of features 
to be used in model in the RFE function. 

From the graphs as shown in Figure 7 we find: 
− R square for test data peaks at 13 features and at this point model generalizes well as 

train R2 is v close to test. Train R2 keeps on increasing beyond 13 features but R2 


  105 
 
 
keeps increasing as you add more features to train data. We have selected the number 
of features where model accuracy and generalization both are at satisfactory level. 

− RMSE for test data is lowest at 13 features and beyond that it increases. Train RMSE 
at 13 also looks good, adding more features to train decreases RMSE but again there 
is always a tradeoff between removing features (aka reducing complexity) and model 
performance. So, we will go with 13 features (Figure 8). 

 
Figure7 − Features count 

 
3.5 Linear Regression 
To detect linearity lets inspect plots of observed vs. predicted values or residuals vs. predicted 
values. The desired outcome is that points are symmetrically distributed around a diagonal line 
in the former plot & around horizontal line in the latter one. 

From the graphs shown in Figure 9: 
1. Obs vs predicted shows that most of the values are closer to the diagonal line, 

however some are not which is a problem. 
2. Resi vs pred graph does not give a conclusive evidence that residuals are evenly 

scattered around the zero line as Resi values increase with increase in predicted 
values, so assumption of linearity can't be confirmed. 

3. There seems to be presence of outliers, which might be giving a non-conclusive 
enough Resi vs Predicted graph. Some points have very high residual values; a point 
(~ -3000, ~ 8000) shows one value is predicted negatively by the model. There are 
many other prominent high residual points which could be influential outliers. 


 106 

 
Figure 8 − Model Building with optimal features 
 

Figure 9 − Comparation of Observed and Predicted Values 
 

  107 
 
 
Figure 10 indicates the Actual vs Predictions price. Blue label indicates the actual price of 

the cars and red label indicates the predicted values of the cars. 
 
 
Figure 10 − Relation of Actuals and Predictions 
 

4. Conclusions 
According to the aim and methodology of this research may apply other countries with using 
the same statistical analysis. The precise structure explains and indicates poses if the variables 
where the price fluctuated over the cars with subject to the results and according to the 
understanding the features and data observations on target variable Price, the estimation 
illustrates information about, the goal variable price has an optimistic bias because most 
vehicles are low cost. Over 50 percent of the vehicles are priced at 10,000 and approximately 
35 % are priced between 10,000 and 20,000. So, in the US industry about 85 per cent of cars 
are priced between 5,000 and 20,000.  

On the basis of the above findings and graph on the right side there are 2 distributions: one 
for cars priced between 5,000 and 25,000 and another for high priced cars, at 25,000 and 
beyond. In the data exploration we have started to perform the liner regression. So, in detail, 
some var's seem to have a linear price relation: carwidth, curbweight, enginesize, horsepower, 
boreration, and citympg and certain factors either have no price relation or are not good 
association. Neither of these variables appear to have a polynomial relation to size. In the 
segment Validation of linear regression assumptions, we have tested for the linearity 
assumption. In the correlation heatmap, price is highly correlated with enginesize, curbweight, 
horsepower, carwidth and negatively correlated with mpg var's citympg and highwaympg, so, 
this means that high-mileage cars can fall into the 'economy' car category or, in other words, 
mean that low-priced cars often have high mpgs.  

In the examination plots of observed versus forecast values or residuals versus projected 
values, the intended consequence is that the points are symmetrically arranged in the former 
plot along a diagonal line and in the latter along a horizontal line. According to the Residuals 
vs Predicted table, many points have very high residual values, indicating that the model 
negatively forecasts one value. There are also other elevated residual points that may 
theoretically be powerful outliers. 


 108 

 
References 
1. Du, J., Xie, L. Schroeder, S. (2009). Practice Prize Paper − PIN Optimal Distribution of 

Auction Vehicles System: Applying Price Forecasting, Elasticity Estimation and Genetic 
Algorithms to Used- Vehicle Distribution. Marketing Science, 28(4), 637-644. 

2. Gelman, A., Hill, J. (2006). Data Analysis Using Regression and Multilevel Hierarchical 
Models. Cambridge University Press, New York, USA. 

3. Gongqi, S., Yansong, W., Qiang, Z. (2011). New Model for Residual Value Prediction of 
the Used Car Based on BP Neural Network and Nonlinear Curve Fit. In Measuring 
Technology and Mechatronics Automation (ICMTMA), 2011 Third International 
Conference, Vol. 2, pp. 682-685,  IEEE. 

4. Listiani, M. (2009). Support Vector Regression Analysis for Price Prediction in a Car 
Leasing Application. Thesis (MSc). Hamburg University of Technology. 

5. Noor, K., Jan, S. (2017). Vehicle Price Prediction System using Machine Learning 
Techniques. International Journal of Computer Applications, 167(9), 27-31. 

6. Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. Morgan Kauffmann. 
7. Richardson, M. S. (2009). Determinants of used car resale value. Retrieved from: 

https://digitalcc.coloradocollege.edu/islandora/ 3A1346 [accessed: August 1, 2020.] 
8. Used cars database. (n.d.) Retrieved from: https://fred.stlouisfed.org/ 
9. Wu, J.D., Hsu, C.C., Chen, H.C. (2009). An expert system of price forecasting for used cars 

using adaptive neuro-fuzzy inference. Expert Systems with Applications, 36(4), 7809-817. 
10. Angioni M., Musso F. (2020) “New perspectives from technology adoption in senior cohousing 

facilities”, The TQM Journal, Vol. 32, n. 4, pp.  pp. 761-777. doi 10.1108/TQM-10-2019-0250 

11. Musso F. (2004), “Il sistema distributivo cinese fra tradizione e modernizzazione”, China News, n. 
1, Milano, Franco Angeli, pp. 11-31. 

12. Musso F. (2009), “La Cina come mercato: prospettive, vincoli, illusioni”, in Beretta S., Pissavino 
P.C. (a cura di), Cina e oltre. Piccola e media impresa tra internazionalizzazione e innovazione, 
Rubbettino, Soveria Mannelli 

13. Musso F. (2013), "Is Industrial Districts Logistics suitable for Industrial Parks?", Acta Universitatis 
Danubius. Œconomica, Vol 9, No 4, pp. 221-233. 

14. Musso F., Risso M. (2006), “Responsabilità sociale d'impresa nelle filiere internazionali della 
grande distribuzione”, Symphonya: Emerging Issues in Management, n. 1, pp. 91-107. 

15. Musso F., Risso M. (2007). Sistemi di supporto alle decisioni di internazionalizzazione 
commerciale: un modello applicativo per le piccole e medie imprese, in Ferrero G. (ed.), Le ICT 
per la qualificazione delle Piccole Imprese Marchigiane, Carocci, Roma, 205-255. 

16. Musso F., Risso M., (2013) "CSR for retailers' led channel relationships: Evidence from Italian 
SME manufacturers", International Journal of Information Systems and Social Change (IJISSC), 
Vol. 4, n. 1, January-March, pp.21-36, doi: 10.4018/ijissc.2013010102. 

17. Pepe C., Musso F. (1994), "Integrazione europea e distribuzione commerciale: politiche 
comunitarie ed evoluzione del fenomeno", Economia e Diritto del Terziario, n. 1, ISSN: 1593-
9464, pp. 129-175. 

18. Pepe C., Musso F. (1999), “Imprese distrettuali e rapporto col mercato: potenzialità e limiti dei 
processi di internazionalizzazione del distretto pesarese del mobile”, Atti del Convegno: Il futuro 
dei distretti, Vicenza, 4 giugno.  

19. Musso F. (2010), “Le nuove frontiere del marketing internazionale fra approccio strategico, contestualizzazione 
e interculturalità”, Mercati e competitività, n.  4/2010, pp. 15-19. doi: 10.3280/MC2010-004002. 


	1. Introduction
	2. Literature Review
	3. Methodology and Problem Solving
	3.1 Data Understanding and Exploration
	3.3 Data Preparation: feature engineering
	3.4 Model Building and Feature Selection Using RFE (Recursive Feature Elimination)
	3.5 Linear Regression

	4. Conclusions
	References