Version Control

Guidelines for Reproducible Research



Overall, the results of a pipeline run are decided by three factors:

  • Version of software and scripts used at pipeline level and at individual tool level
  • Pipeline input files including sample metadata table and pipeline configuration file
  • Sequencing reads in FastQ files

To reproduce previous results, we need to make all three factors exactly the same as previous runs.

In this document, we will focus on how to use version control systems to access and obtain specific versions of ExScalibur and its tools. To illustrate the steps, we will take the NIST-GIAB benchmark data as an example.


Outline


For full documentation of ExScalibur pipelines, please go to Documentation.




Pipeline Version

In order to reproduce previous results, the most important consideration is to use the same version of the pipelines as before. In this section, we will describe how to assess pipelines of a particular version, through the source code stored on BitBucket or from a AWS EC2 image.


Git Tagging

We use Git Tagging to label a particular version of the pipelines. There are two ways to check and download a tagged repo from BitBucket.

1. Through Git command.

(i) Show tag in the current repo:

$ git tag
0.5


(ii) Download tagged repo:

The command to use is git clone -b

For example, if we want to download ExScaliburGMD version 0.5:

## Clone GMD pipeline version 0.5
$ git clone -b 0.5 git@bitbucket.org:cribioinformatics/exscaliburgmd.git
Initialized empty Git repository in /Pipelines/test/exscaliburgmd/.git/
remote: Counting objects: 122, done.
remote: Compressing objects: 100% (103/103), done.
remote: Total 122 (delta 27), reused 105 (delta 18)
Receiving objects: 100% (122/122), 87.74 MiB | 2.08 MiB/s, done.
Resolving deltas: 100% (27/27), done.
warning: Remote branch 0.5 not found in upstream origin, using HEAD instead

## cd into the destination directory
$ cd exscaliburgmd/

## Show tag of the current repo
$ git tag
0.5


2. Through web browser.

(i) Show tag in the current repo:

git_tag.demo

(ii) Download the tagged repo from web:

git_clone.demo

Now we have downloaded ExScaliburGMD version 0.5. It should be used to reproduce any previous results generate by this particular version of pipelines.

More instructions about how to install the pipelines can be found in Source.


AWS EC2 Image

To facilitate quick launching of the pipelines, we provide a stable image with pre-configured environments and pre-installed tools on AWS EC2 (Image ID available at Cloud). We will continuously maintain software updates, add new tools and release new images as they become stable. Users have the option to start with a stable image and update/install software as they desire. If anything breaks, the original image can be easily restored.

More instructions about using ExScalibur for data analysis on the cloud can be found in Cloud.



Tool version

Pipelines of a particular version are equipped with a set of tools with their own versions. This is the second level of version control.

As we discussed in the previous section, users have the option to update or add tools after they install the pipelines, in which case, the tool version may be different from the original copy. We are aware of the cases that the original pipelines may not work well with tools of a newer (or older) version, as parameters and libraries may change and the current implementation of ExScalibur does not have the function to automatically check tool version and "learn" to launch corresponding commands. We are working on solutions to better address this in the next release.

Version control of individual tools remains as a challenging task in pipeline development. Our current approach is to provide

  • Complete report of the tools and commands involved in a pipeline run
  • This includes log files of each individual job (from the tool stdout and stderr streams; collected in directory myProject/logs) plus a project-level runtime report (generated by BigDataScript). A sample block is shown below, which documents the tool, command, input and output files, computational resources, time and exit status.

    bds_runtime_report_block.demo

    # SYS command. line 54
     echo "father::SRR504517::fastqc"
    
    # SYS command. line 55
     . /etc/profile.d/modules.sh; module load java/1.7.0; module load fastqc/0.11.2; 
    
    # SYS command. line 56
     fastqc --extract -o /data/rbao/BDS-ExScaliburGMD-032215/LCAexomeProj/results/LCAexome_samples/father/qc_reports -t 2 --nogroup /data/rbao/BDS-ExScaliburGMD-032215/data/father_SRR504517_1.fastq.gz /data/rbao/BDS-ExScaliburGMD-032215/data/father_SRR504517_2.fastq.gz >& /data/rbao/BDS-ExScaliburGMD-032215/LCAexomeProj/logs/LCAexome_samples/father/SRR504517.fastqc.log
     
  • Extensive documentation of tools and their version at the beginning of a run
  • All tools and their version can be specified in the pipeline configuration input file (YAML format; example). A sample block is shown below. Note that it shows version of BWA that will be used in the pipeline run is 0.7.10 (along with other parameters). We employ the module software to organize and call specific versions of tools, clean and simple.

    ## Configuration block for BWA aligner
    bwa:
          aln_per_read: 100
          barcode_length: 0
          exe: bwa
          fastq_format: 33
          max_SE_hits: 1
          max_PE_hits: 1
          max_discor_PE_hits: 1
          max_mate_rescue: 50
          max_seed_occur: 500
          mem: 6
          min_base_qual: 10
          module: bwa/0.7.10
          threads: 4
    
  • In addition, for every release of the pipelines, we document the list of tools and their versions that we have explicitly tested to work with pipelines of that version
  • For example, here is the list of tools implemented in ExScalibur version 0.5. The version listed are those that we completed tests with sample data.
    Note that sample data are provided in example directory from pipeline distribution, which can be used for pipeline testing after tool updating/installation.
    Empty cells indicate this tool is not used in one of the pipelines (GMD or SMD).

    Tool Version GMD SMD
    Annovar Nov 12,2014
    BCFtools 0.1.19
    BEDTools 2.21.0
    bgzip 0.2.6
    BWA 0.7.10
    Cutadapt 1.1
    FastQC 0.11.2
    FreeBayes 0.9.13
    GATK 3.1.1
    gvcftools 0.16
    igvtools 2.3.32
    IVC 1.0.6
    MuTect 1.1.7
    Novoalign 3.02.08
    Picard Tools 1.123
    pigz 2.3.1
    SAMtools 0.1.19
    SeqPrep b5efabc5f7
    Shimmer b62f433
    SomaticSniper 1.0.4
    Strelka 1.0.14
    tabix 0.2.6
    VarScan2 2.3.6
    vcflib* 07.23.2014
    vcfsorter* 09.16.2014
    VCFtools 0.1.12a
    vcfutils 0.1.19
    Virmid 1.1.1
    vt* 07.23.2014



    [ * ] Version not available. Tool download date is provided.

    The table is a good reference if the users would like to upgrade or test new versions.



    Input Files

    ExScalibur requires two input files: (1) Sample metadata table; (2) Pipeline configuration file.

    The files used for the NIST-GIAB benchmark evaluation are provided in the example directory along with the pipeline distribution. We also include them here:

    As we discussed in the previous section, the version of tools are specified in the pipeline configuration file (NA12878trio.pipeline.yaml) and can be easily documented for future use.

    Details about how to run the pipeline with the input files can be found at GMD Tutorials.



    More

    Questions? Contact our bioinformatics team (biocore AT cri DOT uchicago DOT edu)!

    For pipeline-specific questions, please post on the BitBucket issue tracking board (GMD; SMD). We prefer that way since it helps organize and track the questions. All of us watch the queue daily.

    If you have any suggestions or comments, please do let us know!

    Cheers.