(Refer back to the RNA-seq data analysis lesson).

Recount

Take a look at the preprint: Collado-Torres, Leonardo, et al. “recount: A large-scale resource of analysis-ready RNA-seq expression data.” bioRxiv (2016): 068478. It’s available at http://biorxiv.org/content/early/2016/08/08/068478.

Here’s the abstract:

recount is a resource of processed and summarized expression data spanning nearly 60,000 human RNA-seq samples from the Sequence Read Archive (SRA). The associated recount Bioconductor package provides a convenient API for querying, downloading, and analyzing the data. Each processed study consists of meta/phenotype data, the expression levels of genes and their underlying exons and splice junctions, and corresponding genomic annotation. We also provide data summarization types for quantifying novel transcribed sequence including base-resolution coverage and potentially unannotated splice junctions. We present workflows illustrating how to use recount to perform differential expression analysis including meta-analysis, annotation-free base-level analysis, and replication of smaller studies using data from larger studies. recount provides a valuable and user-friendly resource of processed RNA-seq datasets to draw additional biological insights from existing public data. The resource is available at https://jhubiostatistics.shinyapps.io/recount/.

This preprint describes an incredible resource. The authors here have downloaded and pre-processed several thousand RNA-seq studies from the raw data available in the SRA. They make this data available through a web application1 at where you can browse, search, and sort the studies they’ve processed, and it gives you direct links to download analysis-ready pre-processed count data and metadata. As the abstract mentions, you can access recount at jhubiostatistics.shinyapps.io/recount/.

Get some data

In the accession search box, search for SRP026387. This has some pre-processed data from a study looking at the Wnt/Beta-catenin pathway and how that’s affected by androgen deprivation therapy in treating prostate cancer. You can learn more about the data and the study by clicking the accession number hyperlink.

Androgen ablation therapy (AAT) is standard treatment for locally-advanced/metastatic prostate cancer (PCa). Many patients develop castration-resistance (CRPCa) after ~2-3 years, with a poor prognosis. The molecular mechanisms underlying CRPCa progression are unclear. mRNA-Seq was performed on tumours from 7 patients with locally-advanced/metastatic PCa before and ~22 weeks after AAT initiation. Differentially regulated genes were identified in treatment pairs. Overall design: Tumour biopsies from 7 patients were taken before and after AAT treatment.

You could get the gene-level counts and metadata by clicking the appropriate links. I downloaded the data and and cleaned up the metadata and gene identifiers a bit, and made that data available on the Data page. The files are called SRP026387_scaledcounts.csv and SRP026387_metadata.csv.

Go to the Data page, download these data files, and load them into R. As we did in the original lesson) under the importing data heading, you can load the tidyverse and load them with read_csv(), but before you can import them into a DESeqDataSet you’ll have to convert them back to a regular data frame. You could alternatively use base R’s read.csv() function and not have to worry about it.

# Load the count data and metadata. Here, the "data" folder
# is relative to my current working directory. Load the data
# From wherever you keep it, or just load it directly from
# the web using the bioconnector.org/data/... URLs.
mycounts <- read.csv("data/SRP026387_scaledcounts.csv")
metadata <- read.csv("data/SRP026387_metadata.csv")

Take a look at the metadata:

metadata
##           id replicate prepost
## 1  SRR923920        R1     Pre
## 2  SRR923921        R2     Pre
## 3  SRR923923        R4     Pre
## 4  SRR923924        R5     Pre
## 5  SRR923925        R6     Pre
## 6  SRR923927        R1    Post
## 7  SRR923926        R7     Pre
## 8  SRR923929        R3    Post
## 9  SRR923928        R2    Post
## 10 SRR923930        R4    Post
## 11 SRR923931        R5    Post
## 12 SRR923933        R7    Post
## 13 SRR923932        R6    Post

And the first few rows of the count data:

head(mycounts)
##           ensgene SRR923920 SRR923921 SRR923923 SRR923924 SRR923925 SRR923927 SRR923926 SRR923929 SRR923928 SRR923930 SRR923931
## 1 ENSG00000000003       159      1508      2062      1369      1625       401      2024      1075       719      1104      1441
## 2 ENSG00000000005         0         0        17         7         0         0         0         7         8         0         6
## 3 ENSG00000000419       416       312       505       404       609       376       754       387       324       454       652
## 4 ENSG00000000457       529       507       718       851       687       722       599       705       689       855      1016
## 5 ENSG00000000460       214       197       330       279       249       213       217       281       220       184       379
## 6 ENSG00000000938       170       278       100       142       174       537       214       206       255       110       253
##   SRR923933 SRR923932
## 1      1125      1376
## 2         6         2
## 3       640       620
## 4       883       671
## 5       383       269
## 6       147       185

Finally, load the annotation data (best to do this with dplyr’s read_csv() because it’s pretty big. We’ll need tidyverse later to do some joins).

library(tidyverse)
anno <- read_csv("data/annotables_grch38.csv")
anno
## # A tibble: 66,531 x 9
##    ensgene      entrez symbol   chr      start     end strand biotype    description                                                  
##    <chr>         <dbl> <chr>    <chr>    <dbl>   <dbl>  <dbl> <chr>      <chr>                                                        
##  1 ENSG0000000…   7105 TSPAN6   X       1.01e8  1.01e8     -1 protein_c… tetraspanin 6 [Source:HGNC Symbol;Acc:HGNC:11858]            
##  2 ENSG0000000…  64102 TNMD     X       1.01e8  1.01e8      1 protein_c… tenomodulin [Source:HGNC Symbol;Acc:HGNC:17757]              
##  3 ENSG0000000…   8813 DPM1     20      5.09e7  5.10e7     -1 protein_c… dolichyl-phosphate mannosyltransferase polypeptide 1, cataly…
##  4 ENSG0000000…  57147 SCYL3    1       1.70e8  1.70e8     -1 protein_c… SCY1-like, kinase-like 3 [Source:HGNC Symbol;Acc:HGNC:19285] 
##  5 ENSG0000000…  55732 C1orf112 1       1.70e8  1.70e8      1 protein_c… chromosome 1 open reading frame 112 [Source:HGNC Symbol;Acc:…
##  6 ENSG0000000…   2268 FGR      1       2.76e7  2.76e7     -1 protein_c… FGR proto-oncogene, Src family tyrosine kinase [Source:HGNC …
##  7 ENSG0000000…   3075 CFH      1       1.97e8  1.97e8      1 protein_c… complement factor H [Source:HGNC Symbol;Acc:HGNC:4883]       
##  8 ENSG0000000…   2519 FUCA2    6       1.43e8  1.44e8     -1 protein_c… fucosidase, alpha-L- 2, plasma [Source:HGNC Symbol;Acc:HGNC:…
##  9 ENSG0000000…   2729 GCLC     6       5.35e7  5.36e7     -1 protein_c… glutamate-cysteine ligase, catalytic subunit [Source:HGNC Sy…
## 10 ENSG0000000…   4800 NFYA     6       4.11e7  4.11e7      1 protein_c… nuclear transcription factor Y, alpha [Source:HGNC Symbol;Ac…
## # … with 66,521 more rows

What genes are DE?

  1. After you’ve loaded the data, load the DESeq2 library and import your data into a DESeqDataSet using the tidy=TRUE argument, with the design formula being the prepost variable from the metadata.
  2. Run the DESeq pipeline.
  3. Get the results for the pre vs post analysis.2
  4. Arrange these results by p-value, and join the results to the annotation data (hint: %>% inner_join(anno, by=c("row"="ensgene"))).
  5. Print out the top 10 differentially expressed genes, their fold changes, p-values, adjusted p-values, annotated with the gene symbol and the description from the annotation data.3

Here’s what your results should look like. I’m showing actual values for the first few genes so you can check to see if your results are close. Don’t worry about rounding errors.

## Warning: `funs()` is deprecated as of dplyr 0.8.0.
## Please use a list of either functions or lambdas: 
## 
##   # Simple named list: 
##   list(mean = mean, median = median)
## 
##   # Auto named with `tibble::lst()`: 
##   tibble::lst(mean, median)
## 
##   # Using lambdas
##   list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## # A tibble: 10 x 6
##    row             log2FoldChange pvalue   padj     symbol     description
##    <chr>           <chr>          <chr>    <chr>    <chr>      <chr>      
##  1 ENSG00000151503 3.46           3.8e-37  1.04e-32 NCAPD3     ?          
##  2 ENSG00000249599 3.62           1.44e-29 1.96e-25 BMPR1B-AS1 ?          
##  3 ENSG00000228278 7.48           3.98e-28 3.62e-24 ORM2       ?          
##  4 ?               ?              ?        ?        ?          ?          
##  5 ?               ?              ?        ?        ?          ?          
##  6 ?               ?              ?        ?        ?          ?          
##  7 ?               ?              ?        ?        ?          ?          
##  8 ?               ?              ?        ?        ?          ?          
##  9 ?               ?              ?        ?        ?          ?          
## 10 ?               ?              ?        ?        ?          ?


  1. Incidentally, the web application itself was built with R using the Shiny framework (https://shiny.rstudio.com/). Pete teaches a workshop on this!

  2. Note that without further specifying to the results() function, the default is to treat the alphabetically first condition as the control, and the alphabetically second condition as the experimental group. Therefore, the differential expression values will be comparing “Pre” versus “Post”, since “Post” comes first.

  3. Note that you may need to explicitly call select() from the dplyr library using dplyr::select(), because loading DESeq2 will mask dplyr’s select if you loaded dplyr or tidyverse first.