Using machine learning to beat the market for lemons

Few purchases in life come with so much uncertainty as buying a car. As a buyer you have no idea what’s under the hood or if the car will break down in its first hundred kilometers. This problem is also known as the lemons problem after the famous paper by economist George Akerlof, who won the Nobel Memorial Prize in Economic Sciences for his work regarding information asymmetry. In this blog post I try to found out if it is possible to beat the lemons problem using machine learning… while making movie title puns.

In 2017, the website kapaza.be, owned by the Norwegian Schibsted group closed down. It lost the classified ad battle against Ebay’s tweedehands.be. Ever since, tweedehands.be is the go-to-website for your daily portion of second hand phones and cars. Yes, daily. Swiping through classified ads during toilet visits has never been easier. There’s no more reason not to score a second-hand Dora bag or that abtronic you always wanted.

But tweedehands.be evolved and has become more than a classified ad website for private individuals. Any store with a product feed can offer his goods on the website through Admarkt. These days, you no longer have to visit half a dozen of car dealers to buy a second hand car. The website and app offer you thousands of car at the click of a button. But pictures are not the same as a test drive. Is a car overpriced? Hard to tell, especially if you’re a layman when it comes to cars.

So I wondered: could I beat the lemons problem with (not so big) data and machine learning? Please, read on.

Scraped in less than 60 seconds

Nicolas Cage once stole 50 cars in one night. That’s pretty cool, but I scraped 17.000 in one night. I stored quite a lot of information from the ads that can roughly be classified in three categories: (1) information regarding the offer such as how many times it has been viewed and how long it has been online; (2) information regarding the seller such as his location (per province), the age of his profile and the amount of ratings he received; and finally (3) information regarding the car such as the production year, the kilometers traversed, fuel type, transmission, etc.

Next, I had had to impute a lot of missing values as many ads had missing values because the seller did not provide all the information. Based on the known values of each offer, the algorithm predicted the missing values.

In a next phase, I took a look at possible aspects — ‘features’ in machinelearnish — of each ad that could possible have an impact on the price demanded by the seller. Here are some interesting insights.

The Fast and the Curious

What are the most expensive brands (with more than 5 ads) that can be found on tweedehands.be? Yes, you guessed right: it’s Ferrari at a staggering average price of over 80.000 euro. The runner up, with an average price half of that is Dodge, closely followed by Porsche.

Something else that seemed very relevant to me is the amount of kilometers that a car has traveled. I’ve always been told cars that are over 200.000 km are ready for the scrapyard. And it seems that many sellers on tweedehands.be actually agree. There appears to be a linear relationship between kilometers and price that crosses the x-axis around the 200k mark. After that, other factors seem to be more important.

The relationship is even more clear for the age of the car. At the age of 10 years, the average price of most cars approaches 0. Unsurprisingly, the age of the car is very closely related to its kilometers.

Besides looking at the car, I wondered if we could learn something from the location of the seller. As you can see from the graph, in West-Flanders, the average price of an offer is 12.500 while in Brussels it’s only 5000. One should be careful about drawing conclusions about causation between location and price, because it could mean several things: in West-Flanders, cars trade at higher prices, or they sell more expensive brands, or maybe they are more experienced salesmen.

Especially experience seemed like an interesting path. There are probably multiple ways to quantify it, but the amount of ratings of a seller is a good proxy. One could argue that more experienced sellers are able to take a higher markup because they know the market better. However, as you can see, by visual inspection, it is hard to distinguish any correlation between ratings and price. That doesn’t necessarily mean there is none, because there could be multicollinearity (for example between brand and region – e.g. more BMWs in West-Flanders) and the effect gets obscured.

Model Max

Finally, I tried modeling the ask price of the offers to identify which offers deviate the most from this predictions. However, there’s something funky going on here. Normally you train the model on a verified data set (meaning, the price of the offers would have been valued by an expert) and you run it on a test set. Once your model is accurate on the test set, you can take it to new data and see how far the asked prices deviate from the predictions. However, in this experiment, the target of the train and test set is the ask price. Nevertheless, one could argue that on such a large data set, we’d get both too high and too low ask prices, cancelling each other out. For some reason, I tend to relate this to the law of large numbers where the average of the same type of cars should be close to the expected value — its real market value.

Using the devised features, I ran a popular — yet quite simple — algorithm, a random forest. Not only is the algorithm relatively fast, especially when parallelized, it also offers the possibility to calculate variable importance. As we can clearly see from the next graph, the power of the engine is really important in dermining the price, followed by its kilometers and its age.

If we assume that using ask prices to train the model is an appropriate thing to do and that the model is correct, we can use the following graph to identify which cars are underpriced and overpriced. If a car — represented by a dot — is correctly priced, it would be near the pink line. If it is underpriced, it would be below that and analogously for overpriced cars.

If you are looking to buy a car that’s valued between four thousand and eight thousand euro, you simply pick a dot below the pink line and see which car it represents. You could call the seller, negotiate and maybe sell it afterwards to make a quick hundred bucks.

Need for Heed

But that’s not what I’m going to do. I’m just heeding the call to read up on Akerlof, make the model and write a blog.

Code here: https://github.com/RoelPi/2hands