GMD pipeline | ExScalibur

How to Run ExScalibur GMD Pipeline: Detection of Germline Variants in Whole Exome Sequencing Data, Step by Step

In this tutorial, we will perform the full WES data analysis to detect germline mutations from three family members, father (SRR504517), mother (SRR504515), and child (SRR504516). Please note that though we used a trio in our tutorial, GMD does not assume any relationship between individuals and can run on any number of samples given sufficient computational power.

To illustrate the basic steps of the GMD pipeline, we retrieved a subset of reads from the original WES FastQ files and prepared small paired-end sequencing FastQ files for testing. These reads cover ~30 genes of the human genome. The full analysis takes approximately 2 hours with 2 nodes (8 cores each) on AWS EC2.

Outline

For full documentation of ExScalibur GMD pipeline, please go to Documentation.

Sample data overview

For this tutorial, we provide small 2x101bp paired-end FastQ files from three samples:

After downloading and installing the pipelines, you may find the sample data and pre-built files and scripts at directory GMDinstallationPath/data/.

##  Pipeline data directory
$ ls data/
data/Build_ExScaliburGMD.LCAexome.sh            ## Shell script to build pipeline scripts for project LCAexome
data/father_SRR504517_1.fastq.gz                ## R1 FQ for sample father
data/father_SRR504517_2.fastq.gz                ## R2 FQ for sample father
data/hg19.exons.refGene.bed                     ## Exome target region BED file
data/LCAdau_SRR504516_1.fastq.gz                ## R1 FQ for sample LCAdau
data/LCAdau_SRR504516_2.fastq.gz                ## R2 FQ for sample LCAdau
data/LCAexome.metadata.txt                      ## Metadata table for project LCAexome
data/LCAexome.pipeline.yaml                     ## Pipeline configuration file for project LCAexome
data/mother_SRR504515_1.fastq.gz                ## R1 FQ for sample mother
data/mother_SRR504515_2.fastq.gz                ## R2 FQ for sample mother

In the next step, we will copy some of the files to our working directory and start building up the project directory the pipeline scripts.

Step 1 - Prepare input files

To run the pipeline, as a user, you will need to prepare two input files:

Metadata table A tab-delimited table containing sample and sequencing information
Pipeline configuration file A YAML file containing the environmental settings, pipeline flags, software parameters and reference files

Description of each field in metadata and pipeline config files can be found in Documentation.

Step 2 - Build project directory and configuration files

First, let us create our own working directory.
```
$ mkdir myData
```

Then, copy the pre-built files from pipeline data directory to our working directory (replace GMDinstallationPath with the pipeline installation path on your system).

$ cd myData
$ cp -p GMDinstallationPath/data/Build_ExScaliburGMD.LCAexome.sh myData
$ cp -p GMDinstallationPath/data/LCAexome.metadata.txt myData
$ cp -p GMDinstallationPath/data/LCAexome.pipeline.yaml myData

Now we take a look at the three files in our working directory.

$ ls myData
Build_ExScaliburGMD.LCAexome.sh
LCAexome.metadata.txt
LCAexome.pipeline.yaml

LCAexome.metadata.txt Sample metadata table
LCAexome.pipeline.yaml Pipeline configuration file
Build_ExScaliburGMD.LCAexome.sh Build pipeline script
- Calls Build_ExScaliburGMD.pl script from the pipeline installation directory
- Creates the project directory and generates project and sample configuration files for the pipeline run

We can use more command to peek into the build shell script.

## Show the content
$ more Build_ExScaliburGMD.LCAexome.sh

## build pipeline scripts 
now=$(date +"%m-%d-%Y_%H:%M:%S")

## script 
BuildExScaliburGMDExe=Build_ExScaliburGMD.pl

## project info 
project=LCAexome
metadata=LCAexome.metadata.txt
config=LCAexome.pipeline.yaml
projdir=$PWD/LCAexomeProj

## options
threads=4
mapq=30
force="--force"
split="--split"
tree="--tree tree"

## bds 
platform="--cluster"
scheduler="--sge"
bdscfg="--bdscfg /data/rbao/BDS-ExScaliburGMD-032215/.bds/bds.sge.config"
retry=0

## command 
echo "START" `date`
perl $BuildExScaliburGMDExe \
	--project $project \
	--metadata $metadata \
	--config $config \
	--projdir $projdir \
	--threads $threads \
	--mapq $mapq \
	--retry $retry \
	$force \
	$split \
	$tree \
	$platform \
	$scheduler \
	$bdscfg \
	>& Build_ExScaliburGMD.$project.$now.log
echo "END" `date`

Next, we will need to update a few fields in the build shell script Build_ExScaliburGMD.LCAexome.shaccording to your project and system settings.
Any parameter in the script may be modified.
In particular, we require to update
- project=your project name (no space allowed)
- metadata=sample metadata table
- config=pipeline configuration YAML file
- projdir=project output directory (must be full path)
In addition, we recommend to update
- platform=your machine type (local, ssh or cluster; details on BigDataScript)
- scheduler=job scheduler on the cluster (moab, pbs or sge; details on BigDataScript)
- bdscfg=BDS configuration file (if you want to use your own BDS cfg with tuned up settings; details on BigDataScript)
For description of each parameter in Build_ExScaliburGMD.pl, please go to Documentation.
Lastly, we will run this build shell script to prepare project directory LCAexomeProj and all config files ready for the pipeline run.
```
## Run build
$ ./Build_ExScaliburGMD.LCAexome.sh
## Check output directory/files
$ ls Submit_ExScalibur.LCAexome.sh
$ ls LCAexomeProj/
```
Submit_ExScalibur.LCAexome.sh is the master job submission script for the pipeline.

Step 3 - Submit master script

This step is rather simple. We will launch the pipeline by first creating a screen on the machine, and then run the master job submission script.

## Create a screen on the system
$ screen -S exscalgmd
## Within the screen, run the submission script
$ ./Submit_ExScalibur.LCAexome.sh
## Detach the screen
$ screen -d

Now, the pipeline starts and will run all analysis from QC to the annotated variants, with pipeline and project report generated at the end of the run.
You may grab a coffee, enjoy some sunshine and breezes, and then come back. The results will be ready for you.
It takes approximately 2 hours to finish on sample data.

Step 4 - Monitor pipeline progress

While the pipeline is running, you may check the progress of the pipeline instantly.
To do this, simple peek into the log file

## Show the last few lines of the pipeline log file
$ tail Run_ExScaliburGMD.LCAexome.timestamp.logout

Note that timestamp refers to the time printed in your log file name.
If you want to reattach the screen
$ screen -r exscalgmd

You may also check the status of submitted jobs by qstat on a cluster, or top -u yourUsrName to check the running stage of your jobs.

Step 5 - Obtain results

Once pipeline run finished, we provide a script in the pipeline util directory Collect_results.pl to put the most important result files into archive in the project directory.
To run it, try

$ Collect_Results.pl -f LCAexome.archive.yaml -t -o LCAexomeProj/archive/LCAexome_archive

Then all results will be collected in LCAexomeProj/archive/LCAexome_archive, and (except BAM files) provided as a single LCAexome_archive.slim.tar.gz file for download.

Step 6 - Visualization

To visualize the data analysis results, we provide a standalone RShiny utility ExSCaliburViz to view the QC, alignment and variant calls interactively on your own desktop.
INSERT VIZ SCREEN SHOT IMAGE
In addition, the alignment BAM files can be downloaded and visualized on Integrative Genomics Viewer.
INSERT IGV SCREEN SHOT IMAGE

Tips

A few tips that might be useful for setting up or tuning up the pipeline run.

The recommended memory setting of BWA alignment jobs is 6GB. This is also the default setting in the pipeline configuration file. If set less, the job might run out of memory and get killed.
The recommended memory setting for GATK is 4GB. If set less, VM may not start with error returned as "GATK, Could not reserve enough space for object heap".

Questions