(Refer back to the RNA-seq data analysis lesson).
Take a look at the preprint: Collado-Torres, Leonardo, et al. “recount: A large-scale resource of analysis-ready RNA-seq expression data.” bioRxiv (2016): 068478. It’s available at http://biorxiv.org/content/early/2016/08/08/068478.
Here’s the abstract:
recount is a resource of processed and summarized expression data spanning nearly 60,000 human RNA-seq samples from the Sequence Read Archive (SRA). The associated recount Bioconductor package provides a convenient API for querying, downloading, and analyzing the data. Each processed study consists of meta/phenotype data, the expression levels of genes and their underlying exons and splice junctions, and corresponding genomic annotation. We also provide data summarization types for quantifying novel transcribed sequence including base-resolution coverage and potentially unannotated splice junctions. We present workflows illustrating how to use recount to perform differential expression analysis including meta-analysis, annotation-free base-level analysis, and replication of smaller studies using data from larger studies. recount provides a valuable and user-friendly resource of processed RNA-seq datasets to draw additional biological insights from existing public data. The resource is available at https://jhubiostatistics.shinyapps.io/recount/.
This preprint describes an incredible resource. The authors here have downloaded and pre-processed several thousand RNA-seq studies from the raw data available in the SRA. They make this data available through a web application1 at where you can browse, search, and sort the studies they’ve processed, and it gives you direct links to download analysis-ready pre-processed count data and metadata. As the abstract mentions, you can access recount at jhubiostatistics.shinyapps.io/recount/.
In the accession search box, search for SRP026387. This has some pre-processed data from a study looking at the Wnt/Beta-catenin pathway and how that’s affected by androgen deprivation therapy in treating prostate cancer. You can learn more about the data and the study by clicking the accession number hyperlink.
Androgen ablation therapy (AAT) is standard treatment for locally-advanced/metastatic prostate cancer (PCa). Many patients develop castration-resistance (CRPCa) after ~2-3 years, with a poor prognosis. The molecular mechanisms underlying CRPCa progression are unclear. mRNA-Seq was performed on tumours from 7 patients with locally-advanced/metastatic PCa before and ~22 weeks after AAT initiation. Differentially regulated genes were identified in treatment pairs. Overall design: Tumour biopsies from 7 patients were taken before and after AAT treatment.
You could get the gene-level counts and metadata by clicking the appropriate links. I downloaded the data and and cleaned up the metadata and gene identifiers a bit, and made that data available on the Data page. The files are called SRP026387_scaledcounts.csv and SRP026387_metadata.csv.
Go to the Data page, download these data files, and load them into R. As we did in the original lesson) under the importing data heading, you can load the tidyverse and load them with read_csv()
, but before you can import them into a DESeqDataSet you’ll have to convert them back to a regular data frame. You could alternatively use base R’s read.csv()
function and not have to worry about it.
# Load the count data and metadata. Here, the "data" folder
# is relative to my current working directory. Load the data
# From wherever you keep it, or just load it directly from
# the web using the bioconnector.org/data/... URLs.
mycounts <- read.csv("data/SRP026387_scaledcounts.csv")
metadata <- read.csv("data/SRP026387_metadata.csv")
Take a look at the metadata:
metadata
## id replicate prepost
## 1 SRR923920 R1 Pre
## 2 SRR923921 R2 Pre
## 3 SRR923923 R4 Pre
## 4 SRR923924 R5 Pre
## 5 SRR923925 R6 Pre
## 6 SRR923927 R1 Post
## 7 SRR923926 R7 Pre
## 8 SRR923929 R3 Post
## 9 SRR923928 R2 Post
## 10 SRR923930 R4 Post
## 11 SRR923931 R5 Post
## 12 SRR923933 R7 Post
## 13 SRR923932 R6 Post
And the first few rows of the count data:
head(mycounts)
## ensgene SRR923920 SRR923921 SRR923923 SRR923924 SRR923925 SRR923927 SRR923926 SRR923929 SRR923928 SRR923930 SRR923931
## 1 ENSG00000000003 159 1508 2062 1369 1625 401 2024 1075 719 1104 1441
## 2 ENSG00000000005 0 0 17 7 0 0 0 7 8 0 6
## 3 ENSG00000000419 416 312 505 404 609 376 754 387 324 454 652
## 4 ENSG00000000457 529 507 718 851 687 722 599 705 689 855 1016
## 5 ENSG00000000460 214 197 330 279 249 213 217 281 220 184 379
## 6 ENSG00000000938 170 278 100 142 174 537 214 206 255 110 253
## SRR923933 SRR923932
## 1 1125 1376
## 2 6 2
## 3 640 620
## 4 883 671
## 5 383 269
## 6 147 185
Finally, load the annotation data (best to do this with dplyr’s read_csv()
because it’s pretty big. We’ll need tidyverse later to do some joins).
library(tidyverse)
anno <- read_csv("data/annotables_grch38.csv")
anno
## # A tibble: 66,531 x 9
## ensgene entrez symbol chr start end strand biotype description
## <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl> <chr> <chr>
## 1 ENSG0000000… 7105 TSPAN6 X 1.01e8 1.01e8 -1 protein_c… tetraspanin 6 [Source:HGNC Symbol;Acc:HGNC:11858]
## 2 ENSG0000000… 64102 TNMD X 1.01e8 1.01e8 1 protein_c… tenomodulin [Source:HGNC Symbol;Acc:HGNC:17757]
## 3 ENSG0000000… 8813 DPM1 20 5.09e7 5.10e7 -1 protein_c… dolichyl-phosphate mannosyltransferase polypeptide 1, cataly…
## 4 ENSG0000000… 57147 SCYL3 1 1.70e8 1.70e8 -1 protein_c… SCY1-like, kinase-like 3 [Source:HGNC Symbol;Acc:HGNC:19285]
## 5 ENSG0000000… 55732 C1orf112 1 1.70e8 1.70e8 1 protein_c… chromosome 1 open reading frame 112 [Source:HGNC Symbol;Acc:…
## 6 ENSG0000000… 2268 FGR 1 2.76e7 2.76e7 -1 protein_c… FGR proto-oncogene, Src family tyrosine kinase [Source:HGNC …
## 7 ENSG0000000… 3075 CFH 1 1.97e8 1.97e8 1 protein_c… complement factor H [Source:HGNC Symbol;Acc:HGNC:4883]
## 8 ENSG0000000… 2519 FUCA2 6 1.43e8 1.44e8 -1 protein_c… fucosidase, alpha-L- 2, plasma [Source:HGNC Symbol;Acc:HGNC:…
## 9 ENSG0000000… 2729 GCLC 6 5.35e7 5.36e7 -1 protein_c… glutamate-cysteine ligase, catalytic subunit [Source:HGNC Sy…
## 10 ENSG0000000… 4800 NFYA 6 4.11e7 4.11e7 1 protein_c… nuclear transcription factor Y, alpha [Source:HGNC Symbol;Ac…
## # … with 66,521 more rows
tidy=TRUE
argument, with the design formula being the prepost
variable from the metadata.%>% inner_join(anno, by=c("row"="ensgene"))
).Here’s what your results should look like. I’m showing actual values for the first few genes so you can check to see if your results are close. Don’t worry about rounding errors.
## Warning: `funs()` is deprecated as of dplyr 0.8.0.
## Please use a list of either functions or lambdas:
##
## # Simple named list:
## list(mean = mean, median = median)
##
## # Auto named with `tibble::lst()`:
## tibble::lst(mean, median)
##
## # Using lambdas
## list(~ mean(., trim = .2), ~ median(., na.rm = TRUE))
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## # A tibble: 10 x 6
## row log2FoldChange pvalue padj symbol description
## <chr> <chr> <chr> <chr> <chr> <chr>
## 1 ENSG00000151503 3.46 3.8e-37 1.04e-32 NCAPD3 ?
## 2 ENSG00000249599 3.62 1.44e-29 1.96e-25 BMPR1B-AS1 ?
## 3 ENSG00000228278 7.48 3.98e-28 3.62e-24 ORM2 ?
## 4 ? ? ? ? ? ?
## 5 ? ? ? ? ? ?
## 6 ? ? ? ? ? ?
## 7 ? ? ? ? ? ?
## 8 ? ? ? ? ? ?
## 9 ? ? ? ? ? ?
## 10 ? ? ? ? ? ?
Incidentally, the web application itself was built with R using the Shiny framework (https://shiny.rstudio.com/). Pete teaches a workshop on this!↩
Note that without further specifying to the results()
function, the default is to treat the alphabetically first condition as the control, and the alphabetically second condition as the experimental group. Therefore, the differential expression values will be comparing “Pre” versus “Post”, since “Post” comes first.↩
Note that you may need to explicitly call select()
from the dplyr library using dplyr::select()
, because loading DESeq2 will mask dplyr’s select if you loaded dplyr or tidyverse first.↩