(Refer back to the Advanced Data Manipulation lesson).

Key Concepts

  • dplyr verbs
  • the pipe %>%
  • the tbl_df
  • variable creation
  • multiple conditions
  • properties of grouped data
  • aggregation
  • summary functions
  • window functions

Getting Started

We’re going to work with a different dataset for the homework here. It’s a cleaned-up excerpt from the Gapminder data. Download the gapminder.csv data by clicking here or using the link above. Download it, and save it in a data/ subfolder of the project directory where you can access it easily from R.

Load the dplyr and readr packages, and read the gapminder data into R using the read_csv() function (n.b. read_csv() is not the same as read.csv()). Assign the data to an object called gm.

In your submitted homework assignment, I would prefer you use the read_csv() function to read the data directly from the web (see below). This way I can run your R code without worrying about whether I have the data/ directory or not.

library(dplyr)
library(readr)

# Preferably: read data from web
gm <- read_csv("http://bioconnector.org/workshops/data/gapminder.csv")

# Alternatively read from file:
# gm <- read_csv("data/gapminder.csv")

# Display the data
gm

Problem set

Use dplyr functions to address the following questions:

  1. How many unique countries are represented per continent?
## # A tibble: 5 x 2
##   continent     n
##   <chr>     <int>
## 1 Africa       52
## 2 Americas     25
## 3 Asia         33
## 4 Europe       30
## 5 Oceania       2
  1. Which European nation had the lowest GDP per capita in 1997?
## # A tibble: 1 x 6
##   country continent  year lifeExp     pop gdpPercap
##   <chr>   <chr>     <dbl>   <dbl>   <dbl>     <dbl>
## 1 Albania Europe     1997    73.0 3428038     3193.
  1. According to the data available, what was the average life expectancy across each continent in the 1980s?
## # A tibble: 5 x 2
##   continent mean.lifeExp
##   <chr>            <dbl>
## 1 Africa            52.5
## 2 Americas          67.2
## 3 Asia              63.7
## 4 Europe            73.2
## 5 Oceania           74.8
  1. What 5 countries have the highest total GDP over all years combined?
## # A tibble: 5 x 2
##   country        Total.GDP
##   <chr>              <dbl>
## 1 United States    7.68e13
## 2 Japan            2.54e13
## 3 China            2.04e13
## 4 Germany          1.95e13
## 5 United Kingdom   1.33e13
  1. What countries and years had life expectancies of at least 80 years? N.b. only output the columns of interest: country, life expectancy and year (in that order).
## # A tibble: 22 x 3
##    country          lifeExp  year
##    <chr>              <dbl> <dbl>
##  1 Australia           80.4  2002
##  2 Australia           81.2  2007
##  3 Canada              80.7  2007
##  4 France              80.7  2007
##  5 Hong Kong, China    80    1997
##  6 Hong Kong, China    81.5  2002
##  7 Hong Kong, China    82.2  2007
##  8 Iceland             80.5  2002
##  9 Iceland             81.8  2007
## 10 Israel              80.7  2007
## # … with 12 more rows
  1. What 10 countries have the strongest correlation (in either direction) between life expectancy and per capita GDP?
## # A tibble: 10 x 2
##    country            r
##    <chr>          <dbl>
##  1 France         0.996
##  2 Austria        0.993
##  3 Belgium        0.993
##  4 Norway         0.992
##  5 Oman           0.991
##  6 United Kingdom 0.990
##  7 Italy          0.990
##  8 Israel         0.988
##  9 Denmark        0.987
## 10 Australia      0.986
  1. Which combinations of continent (besides Asia) and year have the highest average population across all countries? N.b. your output should include all results sorted by highest average population. With what you already know, this one may stump you. See this Q&A for how to ungroup before arrangeing. This also behaves differently in more recent versions of dplyr.
## # A tibble: 48 x 3
##    continent  year  mean.pop
##    <chr>     <dbl>     <dbl>
##  1 Americas   2007 35954847.
##  2 Americas   2002 33990910.
##  3 Americas   1997 31876016.
##  4 Americas   1992 29570964.
##  5 Americas   1987 27310159.
##  6 Americas   1982 25211637.
##  7 Americas   1977 23122708.
##  8 Americas   1972 21175368.
##  9 Europe     2007 19536618.
## 10 Europe     2002 19274129.
## # … with 38 more rows
  1. Which three countries have had the most consistent population estimates (i.e. lowest standard deviation) across the years of available data?
## # A tibble: 3 x 2
##   country               sd.pop
##   <chr>                  <dbl>
## 1 Sao Tome and Principe 45906.
## 2 Iceland               48542.
## 3 Montenegro            99738.
  1. Subset gm to only include observations from 1992 and store the results as gm1992. What kind of object is this?
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"
  1. Bonus! Which observations indicate that the population of a country has decreased from the previous year and the life expectancy has increased from the previous year? See the vignette on window functions.
## # A tibble: 36 x 6
## # Groups:   country [22]
##    country                continent  year lifeExp      pop gdpPercap
##    <chr>                  <chr>     <dbl>   <dbl>    <dbl>     <dbl>
##  1 Afghanistan            Asia       1982    39.9 12881816      978.
##  2 Bosnia and Herzegovina Europe     1992    72.2  4256013     2547.
##  3 Bosnia and Herzegovina Europe     1997    73.2  3607000     4766.
##  4 Bulgaria               Europe     2002    72.1  7661799     7697.
##  5 Bulgaria               Europe     2007    73.0  7322858    10681.
##  6 Croatia                Europe     1997    73.7  4444595     9876.
##  7 Czech Republic         Europe     1997    74.0 10300707    16049.
##  8 Czech Republic         Europe     2002    75.5 10256295    17596.
##  9 Czech Republic         Europe     2007    76.5 10228744    22833.
## 10 Equatorial Guinea      Africa     1977    42.0   192675      959.
## # … with 26 more rows

Source: https://raw.githubusercontent.com/4va/biodatasci/master/r-dplyr-homework.Rmd