How to Run ExScalibur GMD Pipeline: Detection of Germline Variants in Whole Exome Sequencing Data, Step by Step
In this tutorial, we will perform the full WES data analysis to detect germline mutations from three family members, father (SRR504517), mother (SRR504515), and child (SRR504516). Please note that though we used a trio in our tutorial, GMD does not assume any relationship between individuals and can run on any number of samples given sufficient computational power.
To illustrate the basic steps of the GMD pipeline, we retrieved a subset of reads from the original WES FastQ files and prepared small paired-end sequencing FastQ files for testing. These reads cover ~30 genes of the human genome. The full analysis takes approximately 2 hours with 2 nodes (8 cores each) on AWS EC2.
Outline
- Sample data overview
- Step 1 - Prepare input files and work directory
- Step 2 - Build project directory and configuration files
- Step 3 - Submit master script
- Step 4 - Monitor pipeline progress
- Step 5 - Obtain results
- Step 6 - Visualization
- A few useful tips for running the pipeline
- Questions
- More
For full documentation of ExScalibur GMD pipeline, please go to Documentation.
Sample data overview
For this tutorial, we provide small 2x101bp paired-end FastQ files from three samples:
- Mother
- Father
- Child
After downloading and installing the pipelines, you may find the sample data and pre-built files and scripts at directory GMDinstallationPath/data/
.
## Pipeline data directory $ ls data/ data/Build_ExScaliburGMD.LCAexome.sh ## Shell script to build pipeline scripts for project LCAexome data/father_SRR504517_1.fastq.gz ## R1 FQ for sample father data/father_SRR504517_2.fastq.gz ## R2 FQ for sample father data/hg19.exons.refGene.bed ## Exome target region BED file data/LCAdau_SRR504516_1.fastq.gz ## R1 FQ for sample LCAdau data/LCAdau_SRR504516_2.fastq.gz ## R2 FQ for sample LCAdau data/LCAexome.metadata.txt ## Metadata table for project LCAexome data/LCAexome.pipeline.yaml ## Pipeline configuration file for project LCAexome data/mother_SRR504515_1.fastq.gz ## R1 FQ for sample mother data/mother_SRR504515_2.fastq.gz ## R2 FQ for sample mother
In the next step, we will copy some of the files to our working directory and start building up the project directory the pipeline scripts.
Step 1 - Prepare input files
To run the pipeline, as a user, you will need to prepare two input files:
- Metadata table A tab-delimited table containing sample and sequencing information
- Pipeline configuration file A YAML file containing the environmental settings, pipeline flags, software parameters and reference files
Description of each field in metadata and pipeline config files can be found in Documentation.
Step 2 - Build project directory and configuration files
- First, let us create our own working directory.
$ mkdir myData
- Then, copy the pre-built files from pipeline data directory to our working directory (replace
GMDinstallationPath
with the pipeline installation path on your system).$ cd myData $ cp -p GMDinstallationPath/data/Build_ExScaliburGMD.LCAexome.sh myData $ cp -p GMDinstallationPath/data/LCAexome.metadata.txt myData $ cp -p GMDinstallationPath/data/LCAexome.pipeline.yaml myData
- Now we take a look at the three files in our working directory.
$ ls myData Build_ExScaliburGMD.LCAexome.sh LCAexome.metadata.txt LCAexome.pipeline.yaml
- LCAexome.metadata.txt Sample metadata table
- LCAexome.pipeline.yaml Pipeline configuration file
- Build_ExScaliburGMD.LCAexome.sh Build pipeline script
- Calls
Build_ExScaliburGMD.pl
script from the pipeline installation directory - Creates the project directory and generates project and sample configuration files for the pipeline run
- Calls
We can usemore
command to peek into the build shell script.## Show the content $ more Build_ExScaliburGMD.LCAexome.sh ## build pipeline scripts now=$(date +"%m-%d-%Y_%H:%M:%S") ## script BuildExScaliburGMDExe=Build_ExScaliburGMD.pl ## project info project=LCAexome metadata=LCAexome.metadata.txt config=LCAexome.pipeline.yaml projdir=$PWD/LCAexomeProj ## options threads=4 mapq=30 force="--force" split="--split" tree="--tree tree" ## bds platform="--cluster" scheduler="--sge" bdscfg="--bdscfg /data/rbao/BDS-ExScaliburGMD-032215/.bds/bds.sge.config" retry=0 ## command echo "START" `date` perl $BuildExScaliburGMDExe \ --project $project \ --metadata $metadata \ --config $config \ --projdir $projdir \ --threads $threads \ --mapq $mapq \ --retry $retry \ $force \ $split \ $tree \ $platform \ $scheduler \ $bdscfg \ >& Build_ExScaliburGMD.$project.$now.log echo "END" `date`
- Next, we will need to update a few fields in the build shell script
Build_ExScaliburGMD.LCAexome.sh
according to your project and system settings.
Any parameter in the script may be modified.
In particular, we require to update- project=your project name (no space allowed)
- metadata=sample metadata table
- config=pipeline configuration YAML file
- projdir=project output directory (must be full path)
In addition, we recommend to update
- platform=your machine type (local, ssh or cluster; details on BigDataScript)
- scheduler=job scheduler on the cluster (moab, pbs or sge; details on BigDataScript)
- bdscfg=BDS configuration file (if you want to use your own BDS cfg with tuned up settings; details on BigDataScript)
For description of each parameter inBuild_ExScaliburGMD.pl
, please go to Documentation. - Lastly, we will run this build shell script to prepare project directory
LCAexomeProj
and all config files ready for the pipeline run.## Run build $ ./Build_ExScaliburGMD.LCAexome.sh ## Check output directory/files $ ls Submit_ExScalibur.LCAexome.sh $ ls LCAexomeProj/
Submit_ExScalibur.LCAexome.sh
is the master job submission script for the pipeline.
Step 3 - Submit master script
This step is rather simple. We will launch the pipeline by first creating a screen on the machine, and then run the master job submission script.
## Create a screen on the system $ screen -S exscalgmd ## Within the screen, run the submission script $ ./Submit_ExScalibur.LCAexome.sh ## Detach the screen $ screen -d
Now, the pipeline starts and will run all analysis from QC to the annotated variants, with pipeline and project report generated at the end of the run.
You may grab a coffee, enjoy some sunshine and breezes, and then come back. The results will be ready for you.
It takes approximately 2 hours to finish on sample data.
Step 4 - Monitor pipeline progress
While the pipeline is running, you may check the progress of the pipeline instantly.
To do this, simple peek into the log file
## Show the last few lines of the pipeline log file
$ tail Run_ExScaliburGMD.LCAexome.timestamp
.logout
Note that timestamp
refers to the time printed in your log file name.
If you want to reattach the screen
$ screen -r exscalgmd
You may also check the status of submitted jobs by qstat
on a cluster, or top -u yourUsrName
to check the running stage of your jobs.
Step 5 - Obtain results
Once pipeline run finished, we provide a script in the pipeline util directory Collect_results.pl
to put the most important result files into archive
in the project directory.
To run it, try
$ Collect_Results.pl -f LCAexome.archive.yaml -t -o LCAexomeProj/archive/LCAexome_archive
Then all results will be collected in LCAexomeProj/archive/LCAexome_archive
, and (except BAM files) provided as a single LCAexome_archive.slim.tar.gz
file for download.
Step 6 - Visualization
To visualize the data analysis results, we provide a standalone RShiny utility ExSCaliburViz to view the QC, alignment and variant calls interactively on your own desktop.
INSERT VIZ SCREEN SHOT IMAGE
In addition, the alignment BAM files can be downloaded and visualized on Integrative Genomics Viewer.
INSERT IGV SCREEN SHOT IMAGE
Tips
A few tips that might be useful for setting up or tuning up the pipeline run.
- The recommended memory setting of BWA alignment jobs is 6GB. This is also the default setting in the pipeline configuration file. If set less, the job might run out of memory and get killed.
- The recommended memory setting for GATK is 4GB. If set less, VM may not start with error returned as "GATK, Could not reserve enough space for object heap".
Questions
More