In this notebook I will attempt to predict automobile prices using Python and its data analysis and machine learning packages such as pandas and scikit-learn. Forecasting car prices can be useful for businesses in general: insurance companies can use this information to calculate their premia; websites and enterprises can provide estimates even when asking prices are not available in a specific application; enterprises can set contracts where a car's resale value must be defined a priori with greater information, or determine if a car is overvalued or undervalued with respect to the market.
My data source is Mercado Livre, a widely used e-commerce platform in Brazil. For simplicity, and also to isolate other geographical factors that affect prices, this research restricts to ads in the city of Porto Alegre/RS. I also excluded trucks and minivans from the analysis. While new cars can be announced, most of the cars advertised are used. To publish an ad, the seller must fill information about the car, such as brand, model, mileage, engine power and additional features. It is a common belief that this features can help predict a car's asking price in the market, and in our methodology I will explore them to improve our predictions (of course, there also will be unobservable factors that affect prices). A popular source for automobile prices in Brazil is the FIPE table. In the table, average prices are calculated from newspaper and web ads. Data from the FIPE table can be improved here, as learning models can benefit from data updates, predict prices for cars with a specific set of features and location.
Along the notebook, I will go through all steps of data analysis, with code and commentary.