“You can best learn data mining and data science by doing, so start analyzing data as soon as you can! However, don’t forget to learn the theory, since you need a good statistical and machine learning foundation to understand what you are doing and to find real nuggets of value in the noise of big data.”
— Gregory Piatetsky-Shapiro”

In this post, we are going to be looking at the a dataset from the Nigerian used car market. We are going to explore the data, do some necessary data cleaning and answer some questions about the data. The data set consists of information about 1451 cars in the Nigerian car market, you can find more details on the data set. here. Let’s read in the data and take a brief look it.

library(tidyverse)

car <- read_csv("C:/Users/Adejumo/Downloads/car_scrape(1).csv")
                     
head(car)                 
## # A tibble: 6 x 10
##   title    odometer location isimported  engine  transmission fuel  paint  price
##   <chr>       <dbl> <chr>    <chr>       <chr>   <chr>        <chr> <chr>  <dbl>
## 1 Toyota ~    60127 Lagos    Locally us~ 4-cyli~ automatic    petr~ Silv~ 2.00e6
## 2 Acura M~   132908 Lagos    Foreign Us~ 6-cyli~ automatic    petr~ Whine 3.32e6
## 3 Lexus E~   120412 Lagos    Locally us~ 6-cyli~ automatic    petr~ Silv~ 2.66e6
## 4 Mercede~    67640 Lagos    Foreign Us~ 4-cyli~ automatic    petr~ Black 9.02e6
## 5 Mercede~    92440 Abuja    Foreign Us~ 4-cyli~ automatic    petr~ Black 5.79e6
## 6 Mercede~    39979 Abuja    Foreign Us~ 4-cyli~ automatic    petr~ Brown 1.94e7
## # ... with 1 more variable: year <dbl>
glimpse(car)
## Rows: 1,451
## Columns: 10
## $ title        <chr> "Toyota Corolla", "Acura MDX", "Lexus ES 350", "Mercedes-~
## $ odometer     <dbl> 60127, 132908, 120412, 67640, 92440, 39979, 144211, 82828~
## $ location     <chr> "Lagos", "Lagos", "Lagos", "Lagos", "Abuja", "Abuja", "La~
## $ isimported   <chr> "Locally used", "Foreign Used", "Locally used", "Foreign ~
## $ engine       <chr> "4-cylinder(I4)", "6-cylinder(V6)", "6-cylinder(V6)", "4-~
## $ transmission <chr> "automatic", "automatic", "automatic", "automatic", "auto~
## $ fuel         <chr> "petrol", "petrol", "petrol", "petrol", "petrol", "petrol~
## $ paint        <chr> "Silver", "Whine", "Silver", "Black", "Black", "Brown", "~
## $ price        <dbl> 1995000, 3315000, 2655000, 9015000, 5790000, 19440000, 19~
## $ year         <dbl> 2009, 2009, 2008, 2013, 2013, 2016, 2008, 2000, 2010, 200~
skimr::skim(car)
Table 1: Data summary
Name car
Number of rows 1451
Number of columns 10
_______________________
Column type frequency:
character 7
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
title 0 1 6 37 0 240 0
location 0 1 3 16 0 13 0
isimported 0 1 3 12 0 3 0
engine 0 1 14 16 0 9 0
transmission 0 1 6 9 0 2 0
fuel 0 1 6 6 0 2 0
paint 0 1 3 23 0 75 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
odometer 0 1 116802.14 115841.9 0 53338 92919 152769.5 1775588 ▇▁▁▁▁
price 0 1 8431088.23 13089603.8 400000 2615000 4215000 8865000.0 167015008 ▇▁▁▁▁
year 0 1 2008.59 39.2 1217 2006 2010 2014.0 2626 ▁▁▇▁▁

From the results above, we can see that we have no missing values. There seems to be some problem with the values under year, we have presence of extreme values both at the upper and lower ends. Lets us take a closer look at the year variable.

car %>% 
  filter(year <  1960 | year > 2021)
## # A tibble: 5 x 10
##   title     odometer location isimported  engine transmission fuel  paint  price
##   <chr>        <dbl> <chr>    <chr>       <chr>  <chr>        <chr> <chr>  <dbl>
## 1 Mercedes~   403461 Lagos    Locally us~ 4-cyl~ manual       dies~ white 6.02e6
## 2 Mercedes~   701934 Lagos    Locally us~ 8-cyl~ manual       dies~ white 1.20e7
## 3 Mercedes~        0 Lagos    Locally us~ 8-cyl~ manual       dies~ white 1.20e7
## 4 Mercedes~   510053 Lagos    Locally us~ 6-cyl~ manual       dies~ white 7.50e7
## 5 Mercedes~   650923 Lagos    Locally us~ 6-cyl~ manual       dies~ blue  7.02e6
## # ... with 1 more variable: year <dbl>

There are 5 cars with wrong years, this might have occurred in the entry of the data. Since the number is negligible, we can do away with the rows and proceed with our analysis. Next is to also transform the price columns by dividing the price by millions.

car_data <- car %>% 
  filter(!(year <  1960 | year > 2021)) %>% 
  mutate(price_millions = price/1000000, .keep = "unused")
  
head(car_data)
## # A tibble: 6 x 10
##   title     odometer location isimported  engine  transmission fuel  paint  year
##   <chr>        <dbl> <chr>    <chr>       <chr>   <chr>        <chr> <chr> <dbl>
## 1 Toyota C~    60127 Lagos    Locally us~ 4-cyli~ automatic    petr~ Silv~  2009
## 2 Acura MDX   132908 Lagos    Foreign Us~ 6-cyli~ automatic    petr~ Whine  2009
## 3 Lexus ES~   120412 Lagos    Locally us~ 6-cyli~ automatic    petr~ Silv~  2008
## 4 Mercedes~    67640 Lagos    Foreign Us~ 4-cyli~ automatic    petr~ Black  2013
## 5 Mercedes~    92440 Abuja    Foreign Us~ 4-cyli~ automatic    petr~ Black  2013
## 6 Mercedes~    39979 Abuja    Foreign Us~ 4-cyli~ automatic    petr~ Brown  2016
## # ... with 1 more variable: price_millions <dbl>

Now let’s start answering some interesting questions about the data.

Are locally used car more in the market?

Some cars are used abroad and imported into the country while others are used newly in the country. Let’s see the percentage of locally and foreign used cars.

car_data %>% 
  group_by(isimported) %>% 
  summarise(count = n()) %>% 
  mutate(percentage = (count/sum(count))*100) %>% 
  ggplot(aes(x = isimported, 
             y = percentage, 
             fill = isimported)) +
  geom_col() +
  labs(x = "Cars",
       y = "Percentage",
       title = "Percentage of the type of used car",
       aes = "Type of Used Car") +
  guides(fill = guide_legend(title="Type of Use"))

Wow we also have new cars in the market, but they don’t seem to be much. It also seems that more foreign used cars are sold in the market.

Prices of type of used cars

We now know the kind of cars sold in the market, let’s compare the prices and see which among the types of cars are more expensive.

car_data %>% 
    ggplot(aes(x = isimported,
               y = price_millions,
               colour = isimported))+
  geom_boxplot()+
  labs(x = "Cars",
       y = "Price in millions",
       title = "Prices of various types of car",
       aes = "Type of Used Car") +
  guides(fill = guide_legend(title="Type of Use"))

New cars are very expensive, an average new car is even costlier than an expensive foreign or locally used car. There are also locally and foreign used cars that seem to be on a very high price. This means that we are having problems of outliers, extreme price values especially on locally used cars which might be possible as a result of luxury cars which are very expensive. Let’s take a look on cars above 100 million naira.

car_data %>% 
  filter(price_millions > 100)
## # A tibble: 6 x 10
##   title     odometer location isimported  engine  transmission fuel  paint  year
##   <chr>        <dbl> <chr>    <chr>       <chr>   <chr>        <chr> <chr> <dbl>
## 1 Land Rov~    13687 Lagos    Foreign Us~ 8-cyli~ automatic    petr~ Green  2019
## 2 Mercedes~       20 Lagos    New         8-cyli~ automatic    petr~ Black  2019
## 3 Mercedes~     6758 Lagos    New         12-cyl~ automatic    petr~ Black  2019
## 4 Land Rov~    18720 Lagos    Foreign Us~ 8-cyli~ automatic    petr~ Grey   2019
## 5 Lexus LX~    55530 Abuja    Foreign Us~ 8-cyli~ automatic    petr~ Black  2014
## 6 Rolls-Ro~    16069 Lagos    Locally us~ 4-cyli~ automatic    petr~ Black  2011
## # ... with 1 more variable: price_millions <dbl>

Well guessed right, they are luxury cars!. I wish I could just be the owner of the Rolls-Royce Ghost(lol). Now let’s take a critical look on foreign and locally used cars below 25 million naira since majority of them fall below this range.

car_data %>% 
  filter(isimported != "New",
         price_millions < 25) %>% 
    ggplot(aes(x = isimported,
               y = price_millions,
               colour = isimported))+
  geom_boxplot() +
  labs(x = "Cars",
       y = "Percentage",
       title = "Percentage of the type of used car",
       aes = "Type of Used Car") +
  guides(fill = guide_legend(title="Type of Use"))

Most foreign used cars are expensive than locally used cars and an average foreign used car is more expensive than an expensive locally used car.

Prices of cars in various locations

Let’s see how many locations do we have.

car_data %>% 
  group_by(location) %>% 
  summarize(count = n()) %>% 
  arrange(desc(count))
## # A tibble: 13 x 2
##    location         count
##    <chr>            <int>
##  1 Lagos             1159
##  2 Abuja              216
##  3 Ogun                34
##  4 Lagos State         21
##  5 other                5
##  6 Abia                 2
##  7 FCT                  2
##  8 Ogun State           2
##  9 Abia State           1
## 10 Accra                1
## 11 Adamawa              1
## 12 Arepo ogun state     1
## 13 Mushin               1

There is a problem, we are having Lagos state and Lagos instead of only Lagos. Let’s fix this and also limit the locations to Abuja and Lagos only.

car_data %>% 
  mutate(location = replace(location, 
                            location == "Lagos State",
                            "Lagos")) %>% 
  group_by(location) %>% 
  summarize(count = n()) %>% 
  arrange(desc(count)) %>%
  filter(location == c("Lagos", "Abuja"))
## # A tibble: 2 x 2
##   location count
##   <chr>    <int>
## 1 Lagos     1180
## 2 Abuja      216

Now let us look at the prices of cars in the above locations.

car_data %>% 
  mutate(location = replace(location, 
                            location == "Lagos State",
                            "Lagos")) %>% 
  filter(location == c("Lagos", "Abuja")) %>% 
  ggplot(aes(x = location,
             y = price_millions,
             colour = location))+
  geom_boxplot() +
  xlab("Location")+
  ylab("Price in Million")+
  ggtitle("Car price in various locations")

Well can’t say much about the data but cars with a higher price tag seems to be much more in Lagos than Abuja. Lets us filter the visualization to look at cars below 25 million naira, since the box plot above shows that most of our data lies below 25 million naira.

car_data %>% 
  mutate(location = replace(location, 
                            location == "Lagos State",
                            "Lagos")) %>% 
  filter(location == c("Lagos", "Abuja"),
         price_millions <= 25) %>% 
  ggplot(aes(x = location,
             y = price_millions,
             colour = location))+
  geom_boxplot() +
  xlab("Location")+
  ylab("Price in Million")+
  ggtitle("Car price in various locations")

Well a better conclusion will be that cars that majority of cars in both location have price below 9 million naira. Also, cars in Lagos have a much more higher price tag than cars in Abuja.

Conclusion

Well that is all for this post, there are also many questions which can be answered from the data set such as how the price of car changes yearly and so on, these are all but a few.