(Refer back to the Advanced Data Visualization lesson).
- geoms
- aesthetic mappings
- statistical layers
- scales
- ggthemes
- ggsave
The data we’re going to look at is cleaned up version of a gene expression dataset from Brauer et al. Coordination of Growth Rate, Cell Cycle, Stress Response, and Metabolic Activity in Yeast (2008) Mol Biol Cell 19:352-367. This data is from a gene expression microarray, and in this paper the authors are examining the relationship between growth rate and gene expression in yeast cultures limited by one of six different nutrients (glucose, leucine, ammonium, sulfate, phosphate, uracil). If you give yeast a rich media loaded with nutrients except restrict the supply of a single nutrient, you can control the growth rate to any rate you choose. By starving yeast of specific nutrients you can find genes that:
You can download the cleaned up version of the data at the link above. The file is called brauer2007_tidy.csv. Load the ggplot2, dplyr, readr packages, and read the tidy Brauer data into R using the read_csv()
function (note, not read.csv()
). Assign the data to an object called ydat
.
library(tidyverse)
# or ...
# library(ggplot2)
# library(dplyr)
# library(readr)
library(ggthemes)
# Preferably read data from the web
#ydat <- read_csv("http://bioconnector.org/workshops/data/brauer2007_tidy.csv")
# Alternatively read data from file
ydat <- read_csv("data/brauer2007_tidy.csv")
# Display the data
ydat
## # A tibble: 198,430 x 7
## symbol systematic_name nutrient rate expression bp mf
## <chr> <chr> <chr> <dbl> <dbl> <chr> <chr>
## 1 SFB2 YNL049C Glucose 0.05 -0.24 ER to Golgi transport molecular function unknown
## 2 <NA> YNL095C Glucose 0.05 0.28 biological process unknown molecular function unknown
## 3 QRI7 YDL104C Glucose 0.05 -0.02 proteolysis and peptidolysis metalloendopeptidase activity
## 4 CFT2 YLR115W Glucose 0.05 -0.33 mRNA polyadenylylation* RNA binding
## 5 SSO2 YMR183C Glucose 0.05 0.05 vesicle fusion* t-SNARE activity
## 6 PSP2 YML017W Glucose 0.05 -0.69 biological process unknown molecular function unknown
## 7 RIB2 YOL066C Glucose 0.05 -0.55 riboflavin biosynthesis pseudouridylate synthase activity*
## 8 VMA13 YPR036W Glucose 0.05 -0.75 vacuolar acidification hydrogen-transporting ATPase activity, rotationa…
## 9 EDC3 YEL015W Glucose 0.05 -0.24 deadenylylation-independent dec… molecular function unknown
## 10 VPS5 YOR069W Glucose 0.05 -0.16 protein retention in Golgi* protein transporter activity
## # … with 198,420 more rows
Follow the prompts and use ggplot2 to reproduce the plots below.
We can start by taking a look at the distribution of the expression values.
Wow. That’s ugly. Might be a candidate for accidental aRt but not very helpful for our analysis.
The basic exploratory process above confirms that the overall distribution (as well each distribution by nutrient) is normal.
Let’s compare the genes with the highest and lowest average expression values.
We can figure out which these are using some familiar logic:
The code below implements that pipeline in dplyr syntax:
ydat %>%
group_by(symbol) %>%
summarise(meanexp = mean(expression)) %>%
arrange(desc(meanexp)) %>%
filter(row_number() == 1 | row_number() == n())
## # A tibble: 2 x 2
## symbol meanexp
## <chr> <dbl>
## 1 HXT3 4.01
## 2 HXT6 -2.68
The output tells us that the gene with the highest mean expression is HXT3, while the gene with the lowest mean expression is HXT6.
HINT you can add a “jitter” position to
geom_point()
but it’s easier to control width of the effect if you usegeom_jitter()
Although these two genes are on opposite ends of the distribution of average expression values, they both seem to express similar amounts when Glucose is the restricted nutrient.
Now let’s try to make something that has a little bit more of a polished look.
group_by()
and summarize()
). Create a plot of this data with rate on the x-axis and mean expression on the y-axis and lines colored by nutrient.HINT The
read_csv()
function read in the rate variable as continuous rather than discrete. There are a few ways to remedy this, but first see if you can set the scale for the x axis variable without changing the dataframe.
ggplot()
will name the x and y axes with names of their respective variables. You might want to apply more meaningful labels. Change the name of the x-axis to “Rate”, the name of the y-axis to “Mean Expression” and the plot title to “Mean Expression By Rate (Brauer)”HINT
?labs
will pull up the ggplot2 documentation on axes labels and plot titles.
HINT 1:
library(ggthemes)
not working for you? Install the package first.
HINT 2 You can either do this by trial-and-error or check out the package vignette to get an idea of what each theme looks like: https://github.com/jrnold/ggthemes