Introduction to BiocMAP

Sep 27, 2023 4 min read research, software, pipelines

Over the past few years, I’ve had the opportunity to work with a lot of whole-genome bisulfite-sequencing (WGBS) datasets. They provide a powerful opportunity to look at DNA methylation on a complete scale, in contrast to microarrays which target a narrower set of important CpG sites across the genome. But for this same reason, the data is often unwieldy, and can feel difficult to tackle even with access to powerful computational resources. At LIBD, we were excited by the opportunity to better characterize the role of methylation in development and psychiatric disorders like schizophrenia, and we’ve performed WGBS on thousands of samples in just a few years.

The Challenges of WGBS

Despite the massive research opportunity, we had a huge computational challenge in the way. How could we turn thousands of raw sequencing files into methylation proportions for each gene? It’s not like the basic logistics of this preprocessing task is unsolved– in fact, some great tools like nf-core/methylseq exist to chain together the various steps (alignment to a reference genome, counting methylated and unmethylated reads of each gene, etc) into a fairly easy-to-use workflow. Could we just use something like nf-core/methylseq?

At the scale of our datasets, existing pipeline tools could take years of (wall clock!) computational time, even with access to a high-performance computing cluster. We also noticed that many existing solutions would simply run out of memory, even when allocated gigantic (hundreds of GBs) amounts of RAM. We knew that our situation was unique– and we’d need to carefully implement a workflow that was optimized for speed and efficient memory use.

Our Solution

We developed BiocMAP after refining our internal preprocessing workflow. Much of the speed gains were simply achieved by using Arioc, a GPU-based tool for alignment to the reference genome, when the standard in the field was to use Bismark or other CPU-based tools. We limited memory usage by tricks like breaking data by chromosome, and using disk-based backends where possible, details we describe in the manuscript.

The “Bioc” in BiocMAP stand for Bioconductor-friendly– BiocMAP collects all the methylation counts and proportions into SummarizedExperiment-based objects in R, since these objects are how the Bioconductor community likes to represent experimental data of all kinds. A whole ecosystem of R packages is built around performing statistical analyses on SummarizedExperiment-based objects.

Image credit: Morgan et al, retrieved from https://bioconductor.org/packages/release/bioc/vignettes/SummarizedExperiment/inst/doc/SummarizedExperiment.html

So now I’m excited we just published the paper and the software is ready to share with the world!

Using BiocMAP

We aimed to make BiocMAP simple to install and use on a variety of computing environments, while allowing a good deal of customization for interested users. I’ll show examples of running BiocMAP on a SLURM-managed cluster, though running on an SGE-managed cluster or just a single machine is possible too.

#   Install BiocMAP with singularity
git clone git@github.com:LieberInstitute/BiocMAP.git
cd BiocMAP
bash install_software.sh singularity

sbatch run_first_half_slurm.sh

That install_software.sh script allows you to use Docker or Singularity to set BiocMAP up, and then we have shell scripts and configuration files that make it easy to run on computing clusters that have job schedulers like SLURM or SGE.

We split the BiocMAP pipeline into two pieces because our experience with processing WGBS data involved collaboration and the use of more than one computing cluster. Since the use of GPUs is still sometimes seen as a new thing, some clusters may have more impressive GPU resources than others, while others may have more CPUs or overall memory. We found it useful to allow the flexibility of running the GPU-intensive alignment in a potentially different location than the remaining analysis steps. Nothing’s stopping you from running everything on one machine or cluster though:

sbatch run_first_half_slurm.sh

#   Once the first module finishes:
sbatch run_second_half_slurm.sh

Do you have a lot of WGBS data and access to GPUs? BiocMAP may be helpful in powering through the preprocessing so you can help focus on the interesting part of your research– the statistical analysis.

Check out our manuscript, code, and documentation!

DNA methylation GPUs Nextflow

Continuous rstats learning

We are researchers at the @LieberInstitute, blogging about R packages, how-to guides & occasionally our own open-source software (opinions r our own) #rstats