Code and results for the recount-brain project that enhances the recount2 project project. The recount_brain table can be accessed via the recount (Collado-Torres, Nellore, Kammers, Ellis, et al., 2017) Bioconductor package using recount::add_metadata(source = 'recount_brain_v2').

Contents

  • select_studies uses the predicted phenotype information by Shannon Ellis et al. (Ellis, Collado-Torres, Jaffe, and Leek, 2018) version 0.0.03 to determine candidate studies for recount_brain from the Sequence Read Archive (SRA) that have at least 4 samples and over 70% of the samples are from the brain. It creates the list of candidate projects saved in projects_lists.txt.
  • SRA_run_selector_info contains a table per study in projects_lists.txt with the data downloaded from the SRA Run Selector website https://www.ncbi.nlm.nih.gov/Traces/study/.
  • SRA_metadata contains a CSV table with the curated metadata for each study. This is the data that is then used to create recount_brain. Note that not all candidate studies were brain studies so the final number of projects considered is 62.
  • merged_metadata contains the recount_brain table that can be easily accessed via recount (Collado-Torres, Nellore, Kammers, Ellis, et al., 2017) using the add_metadata() function. The document merging_data describes how the recount_brain was created using the files from SRA_metadata and includes some brief examples on how to explore the recount_brain table. You can access this initial version of recount_brain using recount::add_metadata(source = 'recount_brain_v1').
  • metadata_reproducibility contains a document describing how the metadata was processed for each SRA study. It is intended to be useful for reproducibility purposes.
  • The cross_studies_metadata directory contains the cross_studies_metadata document describing how the recount-brain version 1 table was merged with GTEx and TCGA brain samples metadata to create the recount-brain version 2 table that facilitates cross-study comparisons. SupplementaryTable2.csv describes which fields from the GTEx and TCGA data were used to merge them with recount_brain and any manipulations required to do so. The cross_studies_metadata directory also contains a second document, recount_brain_ontologies, with the code used for adding Broadmann area, disease and tissue ontology information to recount_brain. This final table is the one you can access using recount::add_metadata(source = 'recount_brain_v2').
  • metasra_comp contains a comparison of recount_brain_v2 and MetaSRA (Bernstein, Doan, and Dewey, 2017) as described in the metasra_comp html document.

Example analyses

  • We used the data from SRP027383 (Bao, Chen, Yang, Zhang, et al., 2014) to show how recount_brain can be used for a gene differential expression analysis. See the full example for more information: example_SRP027383. You can also access the pdf version if you prefer over the HTML version.
  • We used the data from ten studies to replicate some of the analyses by Ferreira et al. (Ferreira, Muñoz-Aguirre, Reverter, Godinho, et al., 2018) that explore the relationship between post-mortem interval and gene expression. See the full example for more information: example_PMI. You can also access the pdf version if you prefer it over the HTML version.
  • We also illustrate how to perform an analysis across multiple studies present in recount_brain and combining them with specific tissue data from The Cancer Genome Atlas (TCGA). See the full example for more information: example_multistudy. You can also access the pdf version if you prefer it over the HTML version.

List of variables

This information is also available as a csv file at SupplementaryTable1.csv.

  1. age: Age of donor
  2. age_units: Units of age - (Years / Months / Post Conception Weeks)
  3. assay_type_s: Sequencing technique - (RNA-Seq)
  4. avgspotlen_l: Average length of sequenced read
  5. bioproject_s: NCBI BioProject ID
  6. biosample_s: NCBI BioSample ID
  7. brain_bank: Brain tissue repository source
  8. brodmann_area: Brodmann area for tissue from cerebral cortex - (1-52)
  9. cell_line: Cell line description
  10. center_name_s: Project center
  11. clinical_stage_1: Clinically relevant tissue sample information
  12. clinical_stage_2: Clinically relevant tissue sample information
  13. consent_s: Data availability - (Public)
  14. development: Stage of human development - (Fetus / Infant / Child / Adolescent / Adult)
  15. disease: Disease description
  16. disease_status: Nature of tissue - (Disease / Control)
  17. experiment_s: NCBI Experiment ID
  18. hemisphere: Cerebral hemisphere - (Left / Right)
  19. insertsize_l: Length of sequence between adaptors
  20. instrument_s: High throughput sequencing system
  21. library_name_s: Internal sample ID used by original study
  22. librarylayout_s: Sequencing layout - (Single / Paired)
  23. libraryselection_s: Sequencing library - (cDNA)
  24. librarysource_s: Sequencing source - (Transcriptomic)
  25. loaddate_s: Sequencing load date
  26. mbases_l: Megabases
  27. mbytes_l: Megabytes
  28. organism_s: Organism - (Homo sapiens)
  29. pathology: Tissue pathology
  30. platform_s: Sequencing platform - (Illumina)
  31. pmi: Postmortem interval
  32. pmi_units: Units of postmortem interval - (Hours)
  33. preparation: Specimen preparation - (Frozen)
  34. present_in_recount: Expression data present in recount2
  35. race: Race of donor - (Asian / Black / Hispanic / White)
  36. releasedate_s: Sequencing release date
  37. rin: RNA integrity number
  38. run_s: NCBI Run ID
  39. sample_name_s: GEO Accession ID
  40. sample_origin: Tissue origin - (Brain / iPSC)
  41. sex: Sex of donor - (Female / Male)
  42. sra_sample_s: NCBI SRA Sample ID
  43. sra_study_s: NCBI SRA Study ID
  44. tissue_site_1: Anatomic site of tissue
  45. tissue_site_2: Anatomic site of tissue, further specified
  46. tissue_site_3: Anatomic site of tissue, further specified
  47. tumor_type: Type of tumor - (Glioblastoma / Astrocytoma / Ependymoma / Oligodendroglioma)
  48. viability: Tissue viability - (Postmortem / Biopsy)

You can access this initial version with recount::add_metadata(source = 'recount_brain_v1').

List of variables present in recount_brain_v2.

  1. Study_full: either the SRA study accession, GTEX or TCGA.
  2. drugName_full: the drug name for TCGA samples.
  3. drug_info_full: logical, whether the sample has drug information; only for TCGA.
  4. drug_type_full: the drug classification (chemotherapy, immunotherapy, …); only for TCGA.
  5. full_260_280: the 260 to 280 ratio; only for TCGA.
  6. count_file_identifier: the SRA run accession or the TCGA run (sample) identifier. Useful for merging with the rest of recount2 metadata.
  7. Dataset: either SRA, GTEX or TCGA.
  8. brodmann_ontology: URL for the Brodmann region ontology. See the recount_brain_ontologies file for how this information was added.
  9. brodmann_synonyms: synonyms used for the Brodmann regions. These facilitate text based searches. Separated by |.
  10. brodmann_parents: URLs for the Brodmann ontology parents. Separated by |.
  11. brodmann_parents_label: Brodmann ontology parent text preferred labels. Separated by |.
  12. disease_ontology: URL for the disease ontology.
  13. tissue: tissue as prioritized by tissue_site_3 over tissue_site_2 over tissue_site_1.
  14. tissue_ontology: URL for the tissue ontology.
  15. tissue_synonyms: tissue synonyms which facilitate text based searches. Separated by |.
  16. tissue_parents: URLs for the tissue ontology parents. Separated by |.
  17. tissue_parents_label: tissue ontology parent text preferred labels. Separated by |.

You can access this version with recount::add_metadata(source = 'recount_brain_v2').

Explore interactively

We recommend opening the interactive recount_brain exploration in another window.

This application is a custom version of shinycsv (Collado-Torres, Semick, and Jaffe, 2018). The code for making this application is available in the shinytable directory.

Questions

If you have any questions about recount_brain please post them as an issue at LieberInstitute/recount-brain and include the relevant session information using the following code. Thank you!

library('sessioninfo')
options(width = 120)
session_info()

References

The analyses were made possible thanks to BioPortal (Whetzel, Noy, Shah, Alexander, et al., 2011), MetaSRA (Bernstein, Doan, and Dewey, 2017), and:

Bibliography file

[1] Z. Bao, H. Chen, M. Yang, C. Zhang, et al. “RNA-seq of 272 gliomas revealed a novel, recurrentPTPRZ1-METfusion transcript in secondary glioblastomas”. In: Genome Research 24.11 (Aug. 2014), pp. 1765–1773. DOI: 10.1101/gr.165126.113. URL: https://doi.org/10.1101/gr.165126.113.

[2] M. N. Bernstein, A. Doan, and C. N. Dewey. “MetaSRA: normalized human sample-specific metadata for the Sequence Read Archive”. In: Bioinformatics 33.18 (May. 2017). Ed. by J. Wren, pp. 2914–2923. DOI: 10.1093/bioinformatics/btx334. URL: https://doi.org/10.1093/bioinformatics/btx334.

[3] C. Boettiger. knitcitations: Citations for ‘Knitr’ Markdown Files. R package version 1.0.8. 2017. URL: https://CRAN.R-project.org/package=knitcitations.

[4] W. Chang. downloader: Download Files over HTTP and HTTPS. R package version 0.4. 2015. URL: https://CRAN.R-project.org/package=downloader.

[5] L. Collado-Torres, A. Nellore, K. Kammers, S. E. Ellis, et al. “Reproducible RNA-seq analysis using recount2”. In: Nature Biotechnology (2017). DOI: 10.1038/nbt.3838. URL: http://www.nature.com/nbt/journal/v35/n4/full/nbt.3838.html.

[6] L. Collado-Torres, S. Semick, and A. E. Jaffe. shinycsv: Explore a table interactively in a shiny application. R package version 0.99.8. 2018. URL: https://github.com/LieberInstitute/shinycsv.

[7] G. Csárdi, R. core, H. Wickham, W. Chang, et al. sessioninfo: R Session Information. R package version 1.1.1. 2018. URL: https://CRAN.R-project.org/package=sessioninfo.

[8] S. E. Ellis, L. Collado-Torres, A. E. Jaffe, and J. T. Leek. “Improving the value of public RNA-seq expression data by phenotype prediction”. In: Nucl. Acids Res. (2018). DOI: 10.1093/nar/gky102. URL: https://doi.org/10.1093/nar/gky102.

[9] P. G. Ferreira, M. Muñoz-Aguirre, F. Reverter, C. P. S. Godinho, et al. “The effects of death and post-mortem cold ischemia on human tissue transcriptomes”. In: Nature Communications 9.1 (Feb. 2018). DOI: 10.1038/s41467-017-02772-x. URL: https://doi.org/10.1038/s41467-017-02772-x.

[10] A. Oleś, M. Morgan, and W. Huber. BiocStyle: Standard styles for vignettes and other Bioconductor documents. R package version 2.10.0. 2018. URL: https://github.com/Bioconductor/BiocStyle.

[11] R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. Vienna, Austria, 2019. URL: https://www.R-project.org/.

[12] P. L. Whetzel, N. F. Noy, N. H. Shah, P. R. Alexander, et al. “BioPortal: enhanced functionality via new Web services from the National Center for Biomedical Ontology to access and use ontologies in software applications”. In: Nucleic Acids Research 39.suppl (Jun. 2011), pp. W541–W545. DOI: 10.1093/nar/gkr469. URL: https://doi.org/10.1093/nar/gkr469.

[13] H. Wickham. tidyverse: Easily Install and Load the ‘Tidyverse’. R package version 1.2.1. 2017. URL: https://CRAN.R-project.org/package=tidyverse.

[14] H. Wickham, J. Hester, and W. Chang. devtools: Tools to Make Developing R Packages Easier. R package version 2.0.2. 2019. URL: https://CRAN.R-project.org/package=devtools.

[15] Y. Xie. “knitr: A Comprehensive Tool for Reproducible Research in R”. In: Implementing Reproducible Computational Research. Ed. by V. Stodden, F. Leisch and R. D. Peng. ISBN 978-1466561595. Chapman and Hall/CRC, 2014. URL: http://www.crcpress.com/product/isbn/9781466561595.

[16] Y. Xie, J. Cheng, and X. Tan. DT: A Wrapper of the JavaScript Library ‘DataTables’. R package version 0.5. 2018. URL: https://CRAN.R-project.org/package=DT.