R for Data Science Day 2: Data Viz Exercises

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.2.1     ✓ purrr   0.3.3
## ✓ tibble  2.1.3     ✓ dplyr   0.8.3
## ✓ tidyr   1.0.0     ✓ stringr 1.4.0
## ✓ readr   1.3.1     ✓ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
# devtools::install_github("thomasp85/patchwork")
library(patchwork)

Exercises 3.2.4

  1. Run ggplot(data = mpg). What do you see?
ggplot(data = mpg)

Nothing, because we haven’t selected a geom.

  1. How many rows are in mpg? How many columns?
nrow(mpg)
## [1] 234
ncol(mpg)
## [1] 11
dim(mpg)
## [1] 234  11
  1. What does the drv variable describe? Read the help for ?mpg to find out.
?mpg
  • f = front-wheel drive, r = rear wheel drive, 4 = 4wd

4.Make a scatterplot of hwy vs cyl.

# set all ggplot figures to use minimal theme
theme_set(theme_classic())
mpg %>%
  ggplot(aes(x = cyl, y = hwy)) + geom_point()

# use transparency and jitter to make the points separate better
mpg %>%
  ggplot(aes(x = cyl, y = hwy)) + geom_jitter(width = 0.4, alpha = 0.5)

  1. What happens if you make a scatterplot of class vs drv? Why is the plot not useful?
theme_set(theme_minimal())
mpg %>%
  ggplot(aes(x = class, y = drv)) + geom_point()

Because it is plotting a category vs a category, so most of the space in the plot cannot be filled. However, I’d argue that it’s not completely useless as it does show that all 2 seater cars have rear wheel drive, while all minivans have forward wheel drive.

The barplot below probably presents a better visualisation, as it also shows that we may not have sampled enough 2 seater vehicles to identify whether any of them could possibly have forward drive. Having said that, I do like the dot-plot visualisation as well.

theme_set(theme_minimal())
mpg %>%
  ggplot(aes(x = class, fill = drv)) + geom_bar()


Excercises 3.3.1

  1. What’s gone wrong with this code? Why are the points not blue?
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))

Because color has been set within the aesthetic, so ggplot is assuming that we want to set the value of the colour aesthetic to the string blue. To fix:

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy),color = "blue")

  1. Which variables in mpg are categorical? Which variables are continuous? (Hint: type ?mpg to read the documentation for the dataset). How can you see this information when you run mpg?
glimpse(mpg)
## Observations: 234
## Variables: 11
## $ manufacturer <chr> "audi", "audi", "audi", "audi", "audi", "audi", "audi", …
## $ model        <chr> "a4", "a4", "a4", "a4", "a4", "a4", "a4", "a4 quattro", …
## $ displ        <dbl> 1.8, 1.8, 2.0, 2.0, 2.8, 2.8, 3.1, 1.8, 1.8, 2.0, 2.0, 2…
## $ year         <int> 1999, 1999, 2008, 2008, 1999, 1999, 2008, 1999, 1999, 20…
## $ cyl          <int> 4, 4, 4, 4, 6, 6, 6, 4, 4, 4, 4, 6, 6, 6, 6, 6, 6, 8, 8,…
## $ trans        <chr> "auto(l5)", "manual(m5)", "manual(m6)", "auto(av)", "aut…
## $ drv          <chr> "f", "f", "f", "f", "f", "f", "f", "4", "4", "4", "4", "…
## $ cty          <int> 18, 21, 20, 21, 16, 18, 18, 18, 16, 20, 19, 15, 17, 17, …
## $ hwy          <int> 29, 29, 31, 30, 26, 26, 27, 26, 25, 28, 27, 25, 25, 25, …
## $ fl           <chr> "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "p", "…
## $ class        <chr> "compact", "compact", "compact", "compact", "compact", "…

We can use the glimpse() command, which will show us the type of each variable. Those that are ‘chr’ (character) are categorical, whereas those that are ‘int’ (integer) are continuous.

  1. Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables?
mpg %>% ggplot(aes(x = displ, y = hwy, col = hwy)) + geom_point() 

For a continuous variable, colour is use to represent a gradient.

mpg %>% ggplot(aes(x = displ, y = hwy, size = hwy)) + geom_point() 

Size becomes bigger as the values get bigger

# mpg %>% ggplot(aes(x = displ, y = hwy, shape = hwy)) + geom_point() 

Shape gives an error.

mpg %>% ggplot(aes(x = displ, y = hwy, col = manufacturer)) + geom_point() 

Colour colours the points by the levels of the category.

mpg %>% ggplot(aes(x = displ, y = hwy, size = manufacturer)) + geom_point() 
## Warning: Using size for a discrete variable is not advised.

Size is not advised, but still works.

# mpg %>% ggplot(aes(x = displ, y = hwy, shape = manufacturer)) + geom_point()
# Shape by default throws an error, since only 6 shapes are allowed
mpg %>% ggplot(aes(x = displ, y = hwy, shape = manufacturer)) + geom_point() + scale_shape_manual(values=1:length(unique(mpg$manufacturer)))

Shape by default doesn’t work, but can be coerced by using scale_shape_manual() to present more than 6 shapes.

  1. What happens if you map the same variable to multiple aesthetics?

It gets mapped!

mpg %>% ggplot(aes(x = displ, y = hwy, shape = manufacturer, col=manufacturer)) + geom_point() + scale_shape_manual(values=1:length(unique(mpg$manufacturer)))

  1. What does the stroke aesthetic do? What shapes does it work with? (Hint: use ?geom_point)
mpg %>% ggplot(aes(x = displ, y = hwy, stroke = displ)) + geom_point()

Increases the thickness of the stroke as values of the variable get larger.

  1. What happens if you map an aesthetic to something other than a variable name, like aes(colour = displ < 5)? Note, you’ll also need to specify x and y.
mpg %>% ggplot(aes(x = displ, y = hwy, colour = displ < 5)) + geom_point()

The expression will be evaluated, and the variable plotted will be (displ<5).


Exercises 3.5.1 Exercises - Facets

  1. What happens if you facet on a continuous variable?
mpg %>% ggplot(aes(x = displ, y = hwy)) + geom_point() + facet_grid(~hwy)

It treats it as a categorical - so bad things!

  1. What do the empty cells in plot with facet_grid(drv ~ cyl) mean? How do they relate to this plot?
ggplot(data = mpg) + geom_point(mapping = aes(x = drv, y = cyl))

There are no cars with cyl == 7 or (drv == r where cyl ==4) or (cyl ==5 and drv ==4 or drv == r).

  1. What plots does the following code make? What does . do?
ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + facet_grid(drv ~ .)

. says not to faced on that dimension.

  1. Take the first faceted plot in this section. What are the advantages to using faceting instead of the colour aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset?

    ggplot(data = mpg) +geom_point(mapping = aes(x = displ, y = hwy)) + facet_wrap(~ class, nrow = 2)

Cleaner to see the trend in each level of displ. If we had a larger dataset this would be more important, as overlaying all of the points would create a data blob instead of a meaningful visualisation.

5.Read ?facet_wrap. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol arguments?

nrow and ncol specify how many rows and columns we want our panels to be split into. facet_grid() doesn’t do this, as it uses the number of factor levels in the data we’re faceting by to cleanly present this automatically.

scales is veru useful as it allows us to have free scales (i.e. different scales) for each of our individual plots.

  1. When using facet_grid() you should usually put the variable with more unique levels in the columns. Why?

Because that allows us to better see the spread of the data.

Exercises 3.6 Geometric objects

  1. What geom would you use to draw a line chart ? A boxplot ? A histogram? An area chart?
  • geom_line()
  • geom_boxplot()
  • geom_histogram()
  • geom_area()
ggplot(data = mpg, mapping = aes(x = displ, y = hwy, color = drv)) + geom_point() + geom_smooth(se = FALSE 
    )
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

  1. What does show.legend = FALSE do? What happens if you remove it? Why do you think I used it earlier in the chapter?

Hides the legend for that geom layer. Note that if you want to hide the legend completely, you need to include it in each geom level we present, so geom_point(show.legend = FALSE) and geom_smooth(show.legend = FALSE) for the plot above.

  1. What does the se argument to geom_smooth() do?

Specifies whether to show the standard error.

  1. Will these two graphs look different? Why/why not?
ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
      geom_point() + 
      geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot() + 
      geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) + 
      geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

No, they should be identical, because the specify the same x/y aesthetics for both geoms.

  1. Recreate the R code necessary to generate the following graphs.

I code the six plots to variables first, and then use the patchwork library to present them in one figure below:

one <- mpg %>% ggplot(aes(x = displ, y = hwy)) + geom_point() + geom_smooth(se=FALSE)
two <- mpg %>% ggplot(aes(x = displ, y = hwy)) + geom_point() + geom_smooth(aes(fill = drv), se=FALSE, show.legend = F)
three <- mpg %>% ggplot(aes(x = displ, y = hwy,col = drv)) + geom_point() + geom_smooth(se = F)
four <- mpg %>% ggplot() + geom_point(aes(x = displ, y = hwy,col = drv)) + geom_smooth(aes(x = displ, y = hwy), se = F)
five <- mpg %>% ggplot(aes(x = displ, y = hwy,col = drv, linetype = drv)) + geom_point() + geom_smooth(se = F)
six <- mpg %>% ggplot(aes(x = displ, y = hwy, fill = drv)) + geom_point(shape = 21, col = "white", stroke = 2, size = 3) + theme_gray()

one + two + three + four + five + six+ plot_layout(ncol = 2)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Exercises 3.7 Statistical transformations Exercises

1.What is the default geom associated with stat_summary()? How could you rewrite the previous plot to use that geom function instead of the stat function?

  • geom_pointrange
# original
ggplot(data = diamonds) +
  stat_summary(
    mapping = aes(x = cut, y = depth),
    fun.ymin = min,
    fun.ymax = max,
    fun.y = median
  )

#modified
ggplot(data = diamonds, aes(x = cut, y = depth)) + 
  geom_pointrange(stat = "summary",
    fun.ymin = min,
    fun.ymax = max,
    fun.y = median)

  1. What does geom_col() do? How is it different to geom_bar()?

It is the equivalent of geom_bar(stat=“identity”).

  1. Most geoms and stats come in pairs that are almost always used in concert. Read through the documentation and make a list of all the pairs. What do they have in common?
geom stat
geom_bar() stat_count()
geom_bin2d() stat_bin_2d()
geom_boxplot() stat_boxplot()
geom_contour() stat_contour()
geom_count() stat_sum()
geom_density() stat_density()
geom_density_2d() stat_density_2d()
geom_hex() stat_hex()
geom_freqpoly() stat_bin()
geom_histogram() stat_bin()
geom_qq_line() stat_qq_line()
geom_qq() stat_qq()
geom_quantile() stat_quantile()
geom_smooth() stat_smooth()
geom_violin() stat_violin()
geom_sf() stat_sf()

Many (but not all) have similar names.

  1. What variables does stat_smooth() compute? What parameters control its behaviour?
  • y - predicted value
  • ymin - lower pointwise confidence interval around the mean
  • ymax -upper pointwise confidence interval around the mean
  • se - standard error
  1. In our proportion bar chart, we need to set group = 1. Why? In other words what is the problem with these two graphs?
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = ..prop..))

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, fill = color, y = ..prop..))

ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = cut, y = ..prop.., fill = color, group = 1))

The proportions are calculated within the groups, so it’s always presented out of 100%. To get the “best” visualisation:

ggplot(data = diamonds) +
  geom_bar(aes(x = cut, y = ..count.. / sum(..count..), fill = color))

Exercises 3.8 Position adjustments

  1. What is the problem with this plot? How could you improve it?

    ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + geom_point()

 ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
      geom_jitter(height = 1, width  = 1, alpha = 0.6)

The points overlap. To address: use jitter and alpha.

  1. What parameters to geom_jitter() control the amount of jittering?
  • width and height
  1. Compare and contrast geom_jitter() with geom_count()
 ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) + 
      geom_count()

Will plot the number of observations at each point as a blob instead of moving the points.

  1. What’s the default position adjustment for geom_boxplot()? Create a visualisation of the mpg dataset that demonstrates it.
  • position = “dodge2”
mpg %>% ggplot(aes(x = as.factor(cyl), y = hwy, colour = fl)) + geom_boxplot()

Exercises 3.9 Coordinate systems

  1. Turn a stacked bar chart into a pie chart using coord_polar().
ggplot(data = diamonds) + 
  geom_bar(mapping = aes(x = 1, fill = clarity)) + coord_polar(theta = "y")

  1. What does labs() do? Read the documentation.

Specify labels! x, y axes, title etc!

  1. What’s the difference between coord_quickmap() and coord_map()?
  • coord_map() uses the Mercator projection (by default)
  • coord_quickmap() uses a quick approximation that preserves straight lines, esp for smaller areas closer to the equator.
  1. What does the plot below tell you about the relationship between city and highway mpg? Why is coord_fixed() important? What does geom_abline() do?
  • hwy and cty are linearly related
  • You always get more mileage on a highway than in the city.
  • coord_fixed allows you to see this
  • geom_abline() plots the diagonal reference line
ggplot(data = mpg, mapping = aes(x = cty, y = hwy)) +
  geom_point() + 
  geom_abline() +
  coord_fixed()