library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.0 ✔ purrr 1.0.0
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.5.0
## ✔ readr 2.1.3 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(ggplot2)
First, we will look at the dataset, the variables and observations (and the different type of data entries per column) to get an overview of what we can analyse with this dataset. Since this is a dataset from R, we also don’t really need to clean it, since it is tidy and ready for analysis.
head(mpg)
## # A tibble: 6 × 11
## manufacturer model displ year cyl trans drv cty hwy fl class
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr>
## 1 audi a4 1.8 1999 4 auto(l5) f 18 29 p compa…
## 2 audi a4 1.8 1999 4 manual(m5) f 21 29 p compa…
## 3 audi a4 2 2008 4 manual(m6) f 20 31 p compa…
## 4 audi a4 2 2008 4 auto(av) f 21 30 p compa…
## 5 audi a4 2.8 1999 6 auto(l5) f 16 26 p compa…
## 6 audi a4 2.8 1999 6 manual(m5) f 18 26 p compa…
After inspecting the dataset, I have decided to focus my analysis on three existing variables: manufacturers (car manufacturer name), hwy (highway miles per gallon) and cty (city miles per gallon). With these three variables I can create a goal for this analysis, which is to find out, which car manufacturer (on average) produces the most fuel efficient cars .
To execute the goal of the analysis, we should first get an overview of the number of entries per manufacturer:
table_count <- table(mpg$manufacturer)
table_count
##
## audi chevrolet dodge ford honda hyundai jeep
## 18 19 37 25 9 14 8
## land rover lincoln mercury nissan pontiac subaru toyota
## 4 3 4 13 5 14 34
## volkswagen
## 27
This table gives us a good overview of the amount of observations per manufacturer. Still, to get a better impression at a glance, I will plot the Amount of cars per manufacturer, so that we can easily compare the manufacturers.
df_count <- as.data.frame(table_count)
count_plot <- ggplot(df_count, aes(x = Var1, y = Freq)) +
geom_col() + scale_x_discrete(guide = guide_axis(angle = 90)) +
labs(title = "Amount of Cars in Dataset per Manufacturer", x = "Manufacturer", y = "Amount of Cars")
count_plot
It is important that we counted and plotted the observations per
manufacturer, because if at some part in the further analysis some
manufacturers have a smaller range in fuel efficiency, we can backtrack
it to a smaller pool of observations. E.g. when looking at the “mean
plot” (further along) we can see that manufacturers lincoln and mercury
have an especially small range of average miles per gallon. We see from
the table below, that this is not necessarily due to these cars having a
similar fuel consumption, but potentially because there are only 3
observations for the manufacturer lincoln and 4 for the manufacturer
mercury.
On the other hand, we can trust that we get a more accurate representation of manufacturers fuel efficiency when there is a large set of observations (e.g. for dodge, toyota, volkswagen and ford).
To fulfill the goal of the analysis (which car manufacturer produces the most fuel efficient car model), we will focus our analysis on three main variables - manufacturer, cty, and hwy.
One way to analyse the fuel efficiency overall, is to create another column in the dataset, which analyses the average fuel efficiency between the city miles per gallon and highway miles per gallon
# we are naming this dataset fuel_mpg because it focuses on the fuel efficiency analysis:
fuel_mpg <- mpg %>% mutate(mean_fuel = (hwy + cty)/2)
head(fuel_mpg)
## # A tibble: 6 × 12
## manufact…¹ model displ year cyl trans drv cty hwy fl class mean_…²
## <chr> <chr> <dbl> <int> <int> <chr> <chr> <int> <int> <chr> <chr> <dbl>
## 1 audi a4 1.8 1999 4 auto… f 18 29 p comp… 23.5
## 2 audi a4 1.8 1999 4 manu… f 21 29 p comp… 25
## 3 audi a4 2 2008 4 manu… f 20 31 p comp… 25.5
## 4 audi a4 2 2008 4 auto… f 21 30 p comp… 25.5
## 5 audi a4 2.8 1999 6 auto… f 16 26 p comp… 21
## 6 audi a4 2.8 1999 6 manu… f 18 26 p comp… 22
## # … with abbreviated variable names ¹manufacturer, ²mean_fuel
mean_plot <- ggplot(fuel_mpg, aes(mean_fuel, manufacturer)) + geom_boxplot() + labs(title = "Average Miles per Gallon depending on the Car Manufacturer", x = "Average Miles per Gallon ", y = "Manufacturer")
mean_plot
This graph shows us the following things:
Through this graph, we can already see a great deal of which car manufacturer will most likely produce the most fuel efficient car. Still, the box plot does not show the average fuel efficiency, only the middle efficiency. So, to get accurate results, i will create a graph with the average fuel efficiency overall, and the averagy city and highway mileage.
# this dataframe contains the average value of the mean miles per gallon
mean_per_manufacturer <- fuel_mpg %>% group_by(manufacturer) %>% summarise(mean_per_manufacturer = mean(mean_fuel)) %>% arrange(desc(mean_per_manufacturer))
mean_per_manufacturer
## # A tibble: 15 × 2
## manufacturer mean_per_manufacturer
## <chr> <dbl>
## 1 honda 28.5
## 2 volkswagen 25.1
## 3 hyundai 22.8
## 4 subaru 22.4
## 5 audi 22.0
## 6 toyota 21.7
## 7 pontiac 21.7
## 8 nissan 21.3
## 9 chevrolet 18.4
## 10 ford 16.7
## 11 mercury 15.6
## 12 jeep 15.6
## 13 dodge 15.5
## 14 lincoln 14.2
## 15 land rover 14
# this dataframe contains the average value of the city miles per gallon
mean_per_manufacturer_cty <- fuel_mpg %>% group_by(manufacturer) %>% summarise(mean_per_manufacturer = mean(cty)) %>% arrange(desc(mean_per_manufacturer))
mean_per_manufacturer_cty
## # A tibble: 15 × 2
## manufacturer mean_per_manufacturer
## <chr> <dbl>
## 1 honda 24.4
## 2 volkswagen 20.9
## 3 subaru 19.3
## 4 hyundai 18.6
## 5 toyota 18.5
## 6 nissan 18.1
## 7 audi 17.6
## 8 pontiac 17
## 9 chevrolet 15
## 10 ford 14
## 11 jeep 13.5
## 12 mercury 13.2
## 13 dodge 13.1
## 14 land rover 11.5
## 15 lincoln 11.3
# this dataframe contains the average value of the highway miles per gallon
mean_per_manufacturer_hwy <- fuel_mpg %>% group_by(manufacturer) %>% summarise(mean_per_manufacturer = mean(hwy)) %>% arrange(desc(mean_per_manufacturer))
mean_per_manufacturer_hwy
## # A tibble: 15 × 2
## manufacturer mean_per_manufacturer
## <chr> <dbl>
## 1 honda 32.6
## 2 volkswagen 29.2
## 3 hyundai 26.9
## 4 audi 26.4
## 5 pontiac 26.4
## 6 subaru 25.6
## 7 toyota 24.9
## 8 nissan 24.6
## 9 chevrolet 21.9
## 10 ford 19.4
## 11 mercury 18
## 12 dodge 17.9
## 13 jeep 17.6
## 14 lincoln 17
## 15 land rover 16.5
df_plot <- mean_per_manufacturer_hwy %>% mutate(Type = "highway miles per gallon") %>% bind_rows(mean_per_manufacturer_cty %>% mutate(Type = "city miles per gallon")) %>% bind_rows(mean_per_manufacturer %>% mutate(Type = "average miles per gallon"))
# I decided to rearrange the manufacturers on the graph not in alphabetical order as done in the graphs before, but in descending average mileage per gallon, to make the results more visible.
cty_hwy <- ggplot(df_plot, aes( x = reorder(manufacturer, -mean_per_manufacturer), y = mean_per_manufacturer, color = Type)) +
geom_point() + scale_x_discrete(guide = guide_axis(angle = 90)) +
labs(title = "Average Miles per Gallon depending on the Car Manufacturer", subtitle = "This Graph shows three points per manufacturer: the average-, city- and highway miles per gallon", x = "Manufacturer ", y = "Average Miles per Gallon")
cty_hwy
This graph shows us the most fuel efficient car manufacturer, which is Honda. Each of the three average miles per gallon (per observation) is higher than any other manufacturer, which means, that on average this manufacturer produces cars with very good mileage per gallon values.
We can see a trend in this data set: the car manufacturers from the UK and the US generally show low mileage per gallon values, means these manufacturers produce less fuel efficient cars, compared to the Asian and German Manufacturers.
Although the goal of the analysis is met, aka I know which car manufacturer produces on average the most fuel efficient car, it would be good to know which factors contribute to a miles per gallon values. One way to do this, is to create a Multivariate Regression, which analyses the correlation of the indpendent variables with the dependent variable.
For this, I have created two MLR models, one which analyses the dependency of the city miles per gallon and the engine displacement in liters, year of manufacturing and number of cylinders, and the other which analyses the dependency of highway miles per gallon and he engine displacement in liters, year of manufacturing and number of cylinders.
I have refrained from using the independent variables hwy and mean_fuel to find the dependent variables cty and vice versa because they are bound to have a close correlation.
mlr_model_cty <- lm(cty ~ displ + year + cyl, data = fuel_mpg)
summary(mlr_model_cty)
##
## Call:
## lm(formula = cty ~ displ + year + cyl, data = fuel_mpg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.2614 -1.4456 -0.2509 1.0013 14.1903
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -114.30388 72.15026 -1.584 0.114511
## displ -1.26087 0.34015 -3.707 0.000263 ***
## year 0.07121 0.03603 1.976 0.049303 *
## cyl -1.21204 0.27174 -4.460 1.28e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.451 on 230 degrees of freedom
## Multiple R-squared: 0.6726, Adjusted R-squared: 0.6684
## F-statistic: 157.5 on 3 and 230 DF, p-value: < 2.2e-16
We can see that if the number of cylinders is increased by one unit (one more cylinder), that the city miles per gallon decreases on average by 1.21 miles per gallon (holding all else constant). Similarly, if we increase the displacement by 1 liter, then the city miles per gallon decreases on average by 1.26 miles per gallon (holding all else constant). Both of these variables are statistically significant
This means that displacement and cylinders have a negative relationship with city miles per gallon.
mlr_model_hwy <- lm(hwy ~ displ + year + cyl, data = fuel_mpg)
summary(mlr_model_hwy)
##
## Call:
## lm(formula = hwy ~ displ + year + cyl, data = fuel_mpg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.3225 -2.2018 0.0091 1.9684 15.4732
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -259.12213 109.15941 -2.374 0.01843 *
## displ -2.09122 0.51462 -4.064 6.63e-05 ***
## year 0.14850 0.05451 2.724 0.00694 **
## cyl -1.30653 0.41112 -3.178 0.00169 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.708 on 230 degrees of freedom
## Multiple R-squared: 0.6172, Adjusted R-squared: 0.6122
## F-statistic: 123.6 on 3 and 230 DF, p-value: < 2.2e-16
Here, we can see that if the number of cylinders is increased by one unit (one more cylinder), that the highway miles per gallon decreases on average by 1.31 miles per gallon (holding all else constant).If we increase the displacement by 1 liter, then the city miles per gallon decreases on average by 2.1 miles per gallon (holding all else constant). Both of these variables are statistically significant.
This means that displacement and cylinders have a negative relationship with highway miles per gallon, but here, compared to city miles per gallon, displacement has a greater negative effect on mileage than in cities, and number of cylinders also has a little bit of a more negative effect on mileage per gallon.