Guidelines for Reproducible Research
Overall, the results of a pipeline run are decided by three factors:
- Version of software and scripts used at pipeline level and at individual tool level
- Pipeline input files including sample metadata table and pipeline configuration file
- Sequencing reads in FastQ files
To reproduce previous results, we need to make all three factors exactly the same as previous runs.
In this document, we will focus on how to use version control systems to access and obtain specific versions of ExScalibur and its tools. To illustrate the steps, we will take the NIST-GIAB benchmark data as an example.
Outline
For full documentation of ExScalibur pipelines, please go to Documentation.
Pipeline Version
In order to reproduce previous results, the most important consideration is to use the same version of the pipelines as before. In this section, we will describe how to assess pipelines of a particular version, through the source code stored on BitBucket or from a AWS EC2 image.
Git Tagging
We use Git Tagging to label a particular version of the pipelines. There are two ways to check and download a tagged repo from BitBucket.
1. Through Git
command.
(i) Show tag in the current repo:
$ git tag 0.5
(ii) Download tagged repo:
The command to use is git clone -b
For example, if we want to download ExScaliburGMD version 0.5:
## Clone GMD pipeline version 0.5 $ git clone -b 0.5 git@bitbucket.org:cribioinformatics/exscaliburgmd.git Initialized empty Git repository in /Pipelines/test/exscaliburgmd/.git/ remote: Counting objects: 122, done. remote: Compressing objects: 100% (103/103), done. remote: Total 122 (delta 27), reused 105 (delta 18) Receiving objects: 100% (122/122), 87.74 MiB | 2.08 MiB/s, done. Resolving deltas: 100% (27/27), done. warning: Remote branch 0.5 not found in upstream origin, using HEAD instead ## cd into the destination directory $ cd exscaliburgmd/ ## Show tag of the current repo $ git tag 0.5
2. Through web browser.
(i) Show tag in the current repo:
(ii) Download the tagged repo from web:
Now we have downloaded ExScaliburGMD version 0.5. It should be used to reproduce any previous results generate by this particular version of pipelines.
More instructions about how to install the pipelines can be found in Source.
AWS EC2 Image
To facilitate quick launching of the pipelines, we provide a stable image with pre-configured environments and pre-installed tools on AWS EC2 (Image ID available at Cloud). We will continuously maintain software updates, add new tools and release new images as they become stable. Users have the option to start with a stable image and update/install software as they desire. If anything breaks, the original image can be easily restored.
More instructions about using ExScalibur for data analysis on the cloud can be found in Cloud.
Tool version
Pipelines of a particular version are equipped with a set of tools with their own versions. This is the second level of version control.
As we discussed in the previous section, users have the option to update or add tools after they install the pipelines, in which case, the tool version may be different from the original copy. We are aware of the cases that the original pipelines may not work well with tools of a newer (or older) version, as parameters and libraries may change and the current implementation of ExScalibur does not have the function to automatically check tool version and "learn" to launch corresponding commands. We are working on solutions to better address this in the next release.
Version control of individual tools remains as a challenging task in pipeline development. Our current approach is to provide
This includes log files of each individual job (from the tool stdout and stderr streams; collected in directory myProject/logs) plus a project-level runtime report (generated by BigDataScript). A sample block is shown below, which documents the tool, command, input and output files, computational resources, time and exit status.
# SYS command. line 54 echo "father::SRR504517::fastqc" # SYS command. line 55 . /etc/profile.d/modules.sh; module load java/1.7.0; module load fastqc/0.11.2; # SYS command. line 56 fastqc --extract -o /data/rbao/BDS-ExScaliburGMD-032215/LCAexomeProj/results/LCAexome_samples/father/qc_reports -t 2 --nogroup /data/rbao/BDS-ExScaliburGMD-032215/data/father_SRR504517_1.fastq.gz /data/rbao/BDS-ExScaliburGMD-032215/data/father_SRR504517_2.fastq.gz >& /data/rbao/BDS-ExScaliburGMD-032215/LCAexomeProj/logs/LCAexome_samples/father/SRR504517.fastqc.log
All tools and their version can be specified in the pipeline configuration input file (YAML
format; example). A sample block is shown below. Note that it shows version of BWA that will be used in the pipeline run is 0.7.10 (along with other parameters). We employ the module software to organize and call specific versions of tools, clean and simple.
## Configuration block for BWA aligner bwa: aln_per_read: 100 barcode_length: 0 exe: bwa fastq_format: 33 max_SE_hits: 1 max_PE_hits: 1 max_discor_PE_hits: 1 max_mate_rescue: 50 max_seed_occur: 500 mem: 6 min_base_qual: 10 module: bwa/0.7.10 threads: 4
For example, here is the list of tools implemented in ExScalibur version 0.5. The version listed are those that we completed tests with sample data.
Note that sample data are provided in example directory from pipeline distribution, which can be used for pipeline testing after tool updating/installation.
Empty cells indicate this tool is not used in one of the pipelines (GMD or SMD).
Tool | Version | GMD | SMD |
---|---|---|---|
Annovar | Nov 12,2014 | ● | ● |
BCFtools | 0.1.19 | ● | |
BEDTools | 2.21.0 | ● | ● |
bgzip | 0.2.6 | ● | ● |
BWA | 0.7.10 | ● | ● |
Cutadapt | 1.1 | ● | ● |
FastQC | 0.11.2 | ● | ● |
FreeBayes | 0.9.13 | ● | |
GATK | 3.1.1 | ● | ● |
gvcftools | 0.16 | ● | |
igvtools | 2.3.32 | ● | |
IVC | 1.0.6 | ● | |
MuTect | 1.1.7 | ● | |
Novoalign | 3.02.08 | ● | ● |
Picard Tools | 1.123 | ● | ● |
pigz | 2.3.1 | ● | ● |
SAMtools | 0.1.19 | ● | ● |
SeqPrep | b5efabc5f7 | ● | ● |
Shimmer | b62f433 | ● | |
SomaticSniper | 1.0.4 | ● | |
Strelka | 1.0.14 | ● | ● |
tabix | 0.2.6 | ● | ● |
VarScan2 | 2.3.6 | ● | |
vcflib* | 07.23.2014 | ● | |
vcfsorter* | 09.16.2014 | ● | |
VCFtools | 0.1.12a | ● | ● |
vcfutils | 0.1.19 | ● | |
Virmid | 1.1.1 | ● | |
vt* | 07.23.2014 | ● |
[ * ] Version not available. Tool download date is provided.
The table is a good reference if the users would like to upgrade or test new versions.
Input Files
ExScalibur requires two input files: (1) Sample metadata table; (2) Pipeline configuration file.
The files used for the NIST-GIAB benchmark evaluation are provided in the example directory along with the pipeline distribution. We also include them here:
As we discussed in the previous section, the version of tools are specified in the pipeline configuration file (NA12878trio.pipeline.yaml) and can be easily documented for future use.
Details about how to run the pipeline with the input files can be found at GMD Tutorials.
More
Questions? Contact our bioinformatics team (biocore AT cri DOT uchicago DOT edu)!
For pipeline-specific questions, please post on the BitBucket issue tracking board (GMD; SMD). We prefer that way since it helps organize and track the questions. All of us watch the queue daily.
If you have any suggestions or comments, please do let us know!
Cheers.