We are going to practice using ggplot today, focusing on the data, aesthetic, and geom layers. We are going to use data from the TidyTuesday project. For this recitation, we are going to use the Giant Pumpkins data which is collected from the Great Pumpkin Commonwealth.
At the end of of this module you will create of of this descriptive plots
library(tidyverse)
library(lubridate)
pumpkins_raw <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-10-19/pumpkins.csv')
## Rows: 28065 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (14): id, place, weight_lbs, grower_name, city, state_prov, country, gpc...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
pumpkins_raw %>%
separate(col = "id", into = c("year", "type")) %>%
filter(type == "P" & place == "1") %>%
mutate(weight_lbs = str_remove(weight_lbs, ",") ) %>%
mutate(weight_lbs = as.numeric(weight_lbs)) %>%
mutate(year = ymd(year, truncated = 2)) %>%
ggplot(aes(year, weight_lbs)) +
geom_point() +
geom_line()
geom_point()
and geom_line()
.
Which might make sense in this situation?Showing wrangling first then will plot after.
pumpkins_raw %>%
separate(col = "id", into = c("year", "type")) %>%
filter(type == "P" & place == "1") %>%
mutate(weight_lbs = str_remove(weight_lbs, ",") ) %>%
mutate(weight_lbs = as.numeric(weight_lbs)) %>%
mutate(year = ymd(year, truncated = 2L))
## # A tibble: 9 × 15
## year type place weight_…¹ growe…² city state…³ country gpc_s…⁴ seed_…⁵
## <date> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2013-01-01 P 1 2032 Mathis… Napa Califo… United… Uesugi… "2009 …
## 2 2014-01-01 P 1 2324. Meier,… Pfun… Other Switze… Europa… "2009 …
## 3 2015-01-01 P 1 2230. Wallac… Gree… Rhode … United… SNGPG … "2009 …
## 4 2016-01-01 P 1 2625. Willem… Deur… East F… Belgium Europa… "2145 …
## 5 2017-01-01 P 1 2363 Hollan… Sumn… Washin… United… Safewa… "2145.…
## 6 2018-01-01 P 1 2528 Geddes… Bosc… New Ha… United… Deerfi… "1911 …
## 7 2019-01-01 P 1 2517 Haist,… Clar… New Yo… United… Ohio V… "2005 …
## 8 2020-01-01 P 1 2594. Paton,… Ever… England United… Royal … "1875 …
## 9 2021-01-01 P 1 2703. Cutrup… Radd… Tuscany Italy Campio… "1885.…
## # … with 5 more variables: pollinator_father <chr>, ott <chr>,
## # est_weight <chr>, pct_chart <chr>, variety <chr>, and abbreviated variable
## # names ¹weight_lbs, ²grower_name, ³state_prov, ⁴gpc_site, ⁵seed_mother
## # ℹ Use `colnames()` to see all variable names
Can also parse the date in a slightly different way.
(pumpkins_to_plot <- pumpkins_raw %>%
separate(col = "id", into = c("year", "type")) %>%
filter(type == "P" & place == "1") %>%
mutate(weight_lbs = str_remove(weight_lbs, ",") ) %>%
mutate(weight_lbs = as.numeric(weight_lbs)) %>%
mutate(year = as.POSIXct(year, format = "%Y")) %>%
mutate(year = as.Date(year, format = "%Y")))
## # A tibble: 9 × 15
## year type place weight_…¹ growe…² city state…³ country gpc_s…⁴ seed_…⁵
## <date> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2013-09-20 P 1 2032 Mathis… Napa Califo… United… Uesugi… "2009 …
## 2 2014-09-20 P 1 2324. Meier,… Pfun… Other Switze… Europa… "2009 …
## 3 2015-09-20 P 1 2230. Wallac… Gree… Rhode … United… SNGPG … "2009 …
## 4 2016-09-20 P 1 2625. Willem… Deur… East F… Belgium Europa… "2145 …
## 5 2017-09-20 P 1 2363 Hollan… Sumn… Washin… United… Safewa… "2145.…
## 6 2018-09-20 P 1 2528 Geddes… Bosc… New Ha… United… Deerfi… "1911 …
## 7 2019-09-20 P 1 2517 Haist,… Clar… New Yo… United… Ohio V… "2005 …
## 8 2020-09-20 P 1 2594. Paton,… Ever… England United… Royal … "1875 …
## 9 2021-09-20 P 1 2703. Cutrup… Radd… Tuscany Italy Campio… "1885.…
## # … with 5 more variables: pollinator_father <chr>, ott <chr>,
## # est_weight <chr>, pct_chart <chr>, variety <chr>, and abbreviated variable
## # names ¹weight_lbs, ²grower_name, ³state_prov, ⁴gpc_site, ⁵seed_mother
## # ℹ Use `colnames()` to see all variable names
pumpkins_to_plot %>%
ggplot(aes(x = year, y = weight_lbs)) +
geom_line(color = "blue") +
geom_point()
pumpkins_to_plot %>%
ggplot(aes(x = year, y = weight_lbs)) +
geom_line() +
geom_point(aes(color = year))
Because date is a continuous variable, we are getting a continuous color scale, which might not be what we want. We can get around it by setting date as a factor.
pumpkins_to_plot %>%
ggplot(aes(x = year, y = weight_lbs)) +
geom_line() +
geom_point(aes(color = as.factor(year)))
pumpkins_to_plot %>%
ggplot(aes(x = year, y = weight_lbs)) +
geom_line() +
geom_point(aes(color = as.factor(year), shape = country))
pumpkins_2021 <- pumpkins_raw %>%
separate(col = "id", into = c("year", "type")) %>%
filter(type == "P" & year == 2021) %>%
mutate(weight_lbs = str_remove(weight_lbs, ",") ) %>%
mutate(weight_lbs = as.numeric(weight_lbs)) %>%
mutate(year = ymd(year, truncated = 2L))
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
pumpkins_2021 %>%
ggplot(aes(x = weight_lbs)) +
geom_density()
## Warning: Removed 1 rows containing non-finite values (stat_density).
Also can you add all the datapoints on top of the boxplot? Is this a good idea? Might there be a better geom to use than a boxplot?
pumpkins_all <- pumpkins_raw %>%
separate(col = "id", into = c("year", "type")) %>%
filter(type == "P") %>%
mutate(weight_lbs = str_remove(weight_lbs, ",") ) %>%
mutate(weight_lbs = as.numeric(weight_lbs)) %>%
mutate(year = ymd(year, truncated = 2L))
## Warning in mask$eval_all_mutate(quo): NAs introduced by coercion
pumpkins_all %>%
ggplot(aes(x = as.factor(year), y = weight_lbs)) +
geom_boxplot(outlier.shape = NA) +
geom_jitter(alpha = 0.1)
## Warning: Removed 9 rows containing non-finite values (stat_boxplot).
## Warning: Removed 9 rows containing missing values (geom_point).
pumpkins_all %>%
ggplot(aes(x = as.factor(year), y = weight_lbs)) +
geom_violin(draw_quantiles = 0.5)
## Warning: Removed 9 rows containing non-finite values (stat_ydensity).