Required R packages

If you want to reproduce this document, you will need the following R packages installed. It’s best if you install R 3.5 from CRAN and maybe even RStudio.

## Install R 3.5 
install.packages('BiocManager')
BiocManager::install(c('recount', 'downloader', 'devtools', 'DT'))

Get the predictions data

First we load the R packages.

library('recount')
library('downloader')
library('devtools')
library('DT')

Next we load the prediction data from Shannon Ellis which is stored in a data.frame (a table) called PredictedPhenotypes.

## Load the predictions
predfile <- 'PredictedPhenotypes_v0.0.03.rda'
url <- paste0("https://github.com/leekgroup/recount-website/blob/master/predictions/",
    predfile, "?raw=true")
destfile <- file.path(tempdir(), predfile)
downloader::download(url, destfile = destfile, mode = "wb")
load(destfile)

## Explore the data a bit
head(PredictedPhenotypes)
##    sample_id dataset reported_sex predicted_sex accuracy_sex
## 1  SRR660824    gtex         male          male    0.9981128
## 2 SRR2166176    gtex         male          male    0.9981128
## 3  SRR606939    gtex       female        female    0.9981128
## 4 SRR2167642    gtex         male          male    0.9981128
## 5 SRR2165473    gtex         male          male    0.9981128
## 6  SRR603858    gtex         male          male    0.9981128
##   reported_samplesource predicted_samplesource accuracy_samplesource
## 1                tissue                 tissue              0.961732
## 2                tissue                 tissue              0.961732
## 3                tissue                 tissue              0.961732
## 4                tissue                 tissue              0.961732
## 5                tissue                 tissue              0.961732
## 6                tissue              cell_line              0.961732
##   reported_tissue predicted_tissue accuracy_tissue
## 1            Lung             Lung       0.9505137
## 2           Brain            Brain       0.9505137
## 3           Heart            Heart       0.9505137
## 4           Brain            Brain       0.9505137
## 5            Skin             Skin       0.9505137
## 6            Lung             Lung       0.9505137
##   reported_sequencingstrategy predicted_sequencingstrategy
## 1                      PAIRED                       PAIRED
## 2                      PAIRED                       PAIRED
## 3                      PAIRED                       PAIRED
## 4                      PAIRED                       PAIRED
## 5                      PAIRED                       PAIRED
## 6                      PAIRED                       PAIRED
##   accuracy_sequencingstrategy
## 1                   0.9992661
## 2                   0.9992661
## 3                   0.9992661
## 4                   0.9992661
## 5                   0.9992661
## 6                   0.9992661
dim(PredictedPhenotypes)
## [1] 70479    14

Now we can identify which of the 70k+ samples are predicted to be from the brain.

brain <- tolower(PredictedPhenotypes$predicted_tissue) == 'brain'
length(brain)
## [1] 70479
table(brain)
## brain
## FALSE  TRUE 
## 60314  6199

Match with SRA metadata

Using the all_metadata() function from recount we download the metadata for the roughly 50k samples from SRA.

meta <- all_metadata()
## 2017-07-06 15:40:15 downloading the metadata to /var/folders/cx/n9s558kx6fb7jf5z_pgszgb80000gn/T//RtmpEc4716/metadata_clean_sra.Rdata
dim(meta)
## [1] 50099    21

We can match the predicted brain samples to the SRA metadata using the run identifier. Since the metadata doesn’t include the roughly 20k samples from GTEx and TCGA, not all predicted brain samples match. That is ok because we want to focus on the SRA data for which we don’t have proper metadata.

map <- match(PredictedPhenotypes$sample_id[brain], meta$run)
map <- map[!is.na(map)]

## Number of predicted brain samples in SRA
length(map)
## [1] 4601
## Number of corresponding studies in SRA with at least 1 predicted brain sample
length(unique(meta$project[map]))
## [1] 363

Now we can get the project ids and check what percent of the samples for a given project have predicted brain samples.

proj <- meta$project[map]
proj.n <- table(proj)

## Get the total number of samples in the projects
proj.all <- table(meta$project)
map.proj <- match(names(proj.n), names(proj.all))
proj.all <- proj.all[map.proj]

## Drop projects with less than 4 samples
table(proj.all >= 4)
## 
## FALSE  TRUE 
##    37   326
proj.n <- proj.n[proj.all >= 4]
proj.all <- proj.all[proj.all >= 4]

## Calculate the percent of predicted brain samples
proj.perc <- as.vector(round(proj.n / proj.all * 100, 2))

## Explore that distribution
summary(proj.perc)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.15    5.88   18.30   33.79   59.58  100.00
boxplot(proj.perc, main = 'Percent of predicted brain samples')

Selected projects

To reduce false positives brain predictions, we will focus on studies that are mostly brain samples.

cutoffs <- c(70, 80, 90)
cutoffs.pass <- sapply(cutoffs, function(cut) { table(proj.perc > cut) })
colnames(cutoffs.pass) <- cutoffs
cutoffs.pass
##        70  80  90
## FALSE 259 267 275
## TRUE   67  59  51

We can then put all the information in a single table, export it to a text file, and explore it interactively in this document.

proj.df <- data.frame(brain_predicted_n = as.vector(proj.n), project_n = as.vector(proj.all),
    brain_percent = proj.perc, project_id = names(proj.n))
proj.df <- proj.df[order(proj.df$brain_percent, decreasing = TRUE), ]

write.table(proj.df, file = 'projects_list.txt', sep = '\t', col.names = TRUE, quote = FALSE, row.names = FALSE)

datatable(proj.df,
    options = list(pagingType='full_numbers', pageLength=25, scrollX='100%'),
    escape = FALSE, rownames = FALSE)

Data to use

Lets focus on the projects where at least 70 percent of the samples are predicted to come from the brain. That would include project SRP025982 which from https://trace.ncbi.nlm.nih.gov/Traces/study/?acc=SRP025982 is mostly “universal human brain”.

Reproducibility information

## Reproducibility information
print('Reproducibility information:')
## [1] "Reproducibility information:"
Sys.time()
## [1] "2017-07-06 15:40:18 EDT"
proc.time()
##    user  system elapsed 
##  11.375   0.756  16.828
options(width = 120)
session_info()
## Session info ----------------------------------------------------------------------------------------------------------
##  setting  value                       
##  version  R version 3.4.0 (2017-04-21)
##  system   x86_64, darwin15.6.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       America/New_York            
##  date     2017-07-06
## Packages --------------------------------------------------------------------------------------------------------------
##  package              * version  date       source        
##  acepack                1.4.1    2016-10-29 CRAN (R 3.4.0)
##  AnnotationDbi          1.38.1   2017-06-01 Bioconductor  
##  backports              1.1.0    2017-05-22 CRAN (R 3.4.0)
##  base                 * 3.4.0    2017-04-21 local         
##  base64enc              0.1-3    2015-07-28 CRAN (R 3.4.0)
##  Biobase              * 2.36.2   2017-05-04 Bioconductor  
##  BiocGenerics         * 0.22.0   2017-04-25 Bioconductor  
##  BiocParallel           1.10.1   2017-05-03 Bioconductor  
##  biomaRt                2.32.1   2017-06-09 Bioconductor  
##  Biostrings             2.44.1   2017-06-01 Bioconductor  
##  bit                    1.1-12   2014-04-09 CRAN (R 3.4.0)
##  bit64                  0.9-7    2017-05-08 CRAN (R 3.4.0)
##  bitops                 1.0-6    2013-08-17 CRAN (R 3.4.0)
##  blob                   1.1.0    2017-06-17 CRAN (R 3.4.0)
##  BSgenome               1.44.0   2017-04-25 Bioconductor  
##  bumphunter             1.17.2   2017-05-09 Bioconductor  
##  checkmate              1.8.2    2016-11-02 CRAN (R 3.4.0)
##  cluster                2.0.6    2017-03-10 CRAN (R 3.4.0)
##  codetools              0.2-15   2016-10-05 CRAN (R 3.4.0)
##  colorspace             1.3-2    2016-12-14 CRAN (R 3.4.0)
##  compiler               3.4.0    2017-04-21 local         
##  data.table             1.10.4   2017-02-01 CRAN (R 3.4.0)
##  datasets             * 3.4.0    2017-04-21 local         
##  DBI                    0.7      2017-06-18 CRAN (R 3.4.0)
##  DelayedArray         * 0.2.7    2017-06-03 Bioconductor  
##  derfinder              1.10.4   2017-05-11 Bioconductor  
##  derfinderHelper        1.10.0   2017-04-25 Bioconductor  
##  devtools             * 1.13.2   2017-06-02 CRAN (R 3.4.0)
##  digest                 0.6.12   2017-01-27 CRAN (R 3.4.0)
##  doRNG                  1.6.6    2017-04-10 CRAN (R 3.4.0)
##  downloader           * 0.4      2015-07-09 CRAN (R 3.4.0)
##  DT                   * 0.2      2016-08-09 CRAN (R 3.4.0)
##  evaluate               0.10     2016-10-11 CRAN (R 3.4.0)
##  foreach                1.4.3    2015-10-13 CRAN (R 3.4.0)
##  foreign                0.8-69   2017-06-21 CRAN (R 3.4.0)
##  Formula                1.2-1    2015-04-07 CRAN (R 3.4.0)
##  GenomeInfoDb         * 1.12.2   2017-06-09 Bioconductor  
##  GenomeInfoDbData       0.99.0   2017-02-14 Bioconductor  
##  GenomicAlignments      1.12.1   2017-05-12 Bioconductor  
##  GenomicFeatures        1.28.3   2017-06-09 Bioconductor  
##  GenomicFiles           1.12.0   2017-04-26 Bioconductor  
##  GenomicRanges        * 1.28.3   2017-05-25 Bioconductor  
##  GEOquery               2.42.0   2017-04-25 Bioconductor  
##  ggplot2                2.2.1    2016-12-30 CRAN (R 3.4.0)
##  graphics             * 3.4.0    2017-04-21 local         
##  grDevices            * 3.4.0    2017-04-21 local         
##  grid                   3.4.0    2017-04-21 local         
##  gridExtra              2.2.1    2016-02-29 CRAN (R 3.4.0)
##  gtable                 0.2.0    2016-02-26 CRAN (R 3.4.0)
##  Hmisc                  4.0-3    2017-05-02 CRAN (R 3.4.0)
##  htmlTable              1.9      2017-01-26 CRAN (R 3.4.0)
##  htmltools              0.3.6    2017-04-28 CRAN (R 3.4.0)
##  htmlwidgets            0.8      2016-11-09 CRAN (R 3.4.0)
##  httr                   1.2.1    2016-07-03 CRAN (R 3.4.0)
##  IRanges              * 2.10.2   2017-05-25 Bioconductor  
##  iterators              1.0.8    2015-10-13 CRAN (R 3.4.0)
##  jsonlite               1.5      2017-06-01 CRAN (R 3.4.0)
##  knitr                  1.16     2017-05-18 CRAN (R 3.4.0)
##  lattice                0.20-35  2017-03-25 CRAN (R 3.4.0)
##  latticeExtra           0.6-28   2016-02-09 CRAN (R 3.4.0)
##  lazyeval               0.2.0    2016-06-12 CRAN (R 3.4.0)
##  locfit                 1.5-9.1  2013-04-20 CRAN (R 3.4.0)
##  magrittr               1.5      2014-11-22 CRAN (R 3.4.0)
##  Matrix                 1.2-10   2017-04-28 CRAN (R 3.4.0)
##  matrixStats          * 0.52.2   2017-04-14 CRAN (R 3.4.0)
##  memoise                1.1.0    2017-04-21 CRAN (R 3.4.0)
##  methods              * 3.4.0    2017-04-21 local         
##  munsell                0.4.3    2016-02-13 CRAN (R 3.4.0)
##  nnet                   7.3-12   2016-02-02 CRAN (R 3.4.0)
##  parallel             * 3.4.0    2017-04-21 local         
##  pkgmaker               0.22     2014-05-14 CRAN (R 3.4.0)
##  plyr                   1.8.4    2016-06-08 CRAN (R 3.4.0)
##  qvalue                 2.8.0    2017-04-25 Bioconductor  
##  R6                     2.2.2    2017-06-17 CRAN (R 3.4.0)
##  RColorBrewer           1.1-2    2014-12-07 CRAN (R 3.4.0)
##  Rcpp                   0.12.11  2017-05-22 CRAN (R 3.4.0)
##  RCurl                  1.95-4.8 2016-03-01 CRAN (R 3.4.0)
##  recount              * 1.2.1    2017-05-06 Bioconductor  
##  registry               0.3      2015-07-08 CRAN (R 3.4.0)
##  rentrez                1.1.0    2017-06-01 CRAN (R 3.4.0)
##  reshape2               1.4.2    2016-10-22 CRAN (R 3.4.0)
##  rlang                  0.1.1    2017-05-18 CRAN (R 3.4.0)
##  rmarkdown              1.6      2017-06-15 CRAN (R 3.4.0)
##  rngtools               1.2.4    2014-03-06 CRAN (R 3.4.0)
##  rpart                  4.1-11   2017-03-13 CRAN (R 3.4.0)
##  rprojroot              1.2      2017-01-16 CRAN (R 3.4.0)
##  Rsamtools              1.28.0   2017-04-25 Bioconductor  
##  RSQLite                2.0      2017-06-19 CRAN (R 3.4.0)
##  rtracklayer            1.36.3   2017-05-25 Bioconductor  
##  S4Vectors            * 0.14.3   2017-06-03 Bioconductor  
##  scales                 0.4.1    2016-11-09 CRAN (R 3.4.0)
##  splines                3.4.0    2017-04-21 local         
##  stats                * 3.4.0    2017-04-21 local         
##  stats4               * 3.4.0    2017-04-21 local         
##  stringi                1.1.5    2017-04-07 CRAN (R 3.4.0)
##  stringr                1.2.0    2017-02-18 CRAN (R 3.4.0)
##  SummarizedExperiment * 1.6.3    2017-05-29 Bioconductor  
##  survival               2.41-3   2017-04-04 CRAN (R 3.4.0)
##  tibble                 1.3.3    2017-05-28 CRAN (R 3.4.0)
##  tools                  3.4.0    2017-04-21 local         
##  utils                * 3.4.0    2017-04-21 local         
##  VariantAnnotation      1.22.2   2017-06-22 Bioconductor  
##  withr                  1.0.2    2016-06-20 CRAN (R 3.4.0)
##  XML                    3.98-1.9 2017-06-19 CRAN (R 3.4.0)
##  xtable                 1.8-2    2016-02-05 CRAN (R 3.4.0)
##  XVector                0.16.0   2017-04-25 Bioconductor  
##  yaml                   2.1.14   2016-11-12 CRAN (R 3.4.0)
##  zlibbioc               1.22.0   2017-04-25 Bioconductor