GATK Workflow

This workflow is designed for calling variants in bacteria and is designed to be used on the HPCC at Michigan State University and has been modified from a workflow that can be found here.

Moving files onto the HPCC:

  1. Open FileZilla, WinSCP or another filesystem manager.
  2. Enter the host as hpcc.msu.edu your username and password in the corresponding boxes and the port number as 22.
  3. On the left side, your computer’s filesystem is depicted and your HPCC account’s file system is depicted on the right.
  4. Drag and drop files from your computer to the HPCC or vice versa.

Things you will need to use GATK:

  1. A reference fasta file (with the extension .fa or .fasta) that can be obtained from running an assembly with a software like SPAdes.
  2. Paired end reads (both forward and reverse reads for each sample you wish to call variants for), which have preferably been trimmed for quality control with trimmomatic.

Getting onto the HPCC:

  1. Open a terminal (command line) on your computer

  2. Use the ssh command to access the HPCC with your username:

     $ ssh username@hpcc.msu.edu
    
  3. You will be prompted for your password, when you type it the characters will not appear, but you are still entering your password.

  4. You are now in the HPCC gateway and you need to access a development node. You can use the ssh command again with the name of a node to access it:

     $ ssh dev-intel14
    
  5. You are now in the HPCC.

Helpful things:

These commands will help you manipulate the file system on the HPCC so you can organize all the output and data from GATK.

HPCC modules

Modules are used in the HPCC to selectively include or exclude software from the terminal’s access. They are utilized quite extensively for this workflow so it may be prudent to give a short introduction to them.

Using GATK:


1) Load the modules you’ll need

You will need to load the bwa, picardTools, GATK and vcflib modules into the HPCC in order to use the the software they contain.

2) Create the FM-index

Estimated Time: 0 min

3) Sort the Reference File

Estimated Time: 0 min

4) Create Sequence Dictionary

Estimated Time: 0 min

5) Align Reads

Estimated Time: 10 min

6) Sort the Aligned File

Estimated Time: 5 min

7) Mark Duplicates

Estimated Time: 0 min

8) Sort the Marked Duplicates

Estimated Time: 5 min

9) Create Targets for Realignment

Estimated Time: 20 min

10) Realign Reads for Indels

Estimated Time: 5 min

11) Call Variants

Estimated Time: 5 min

12) Filter Variants

Estimated Time: 0 min

13) Extract SNPs

Estimated Time: 0 min

14) Extract Indels

Estimated Time: 0 min

Viewing your data


Viewing the raw SNP files can be important for quality checks. We’ll be using a text-editor in the command line called nano to read the .vcf files. After viewing the files, you can make decisions about the quality and quantity of your SNPs and indels.

Using nano to read files

Organizing .vcf data into a table

Estimated Time: 0 min

Removing Background Anomalies


Background anomalies are SNPs and indels that have been found by GATK that are present in your sequencing control as well as your samples. The only reason to do this step is to eliminate large numbers of SNPs that are present in a control sample that most likely resulted from a reference genome (reference.fa) that has too many variants when compared to your samples.

Download and move the file to the HPCC

  1. To download the file go here.
  2. Then download the ZIP file that contains the program.
  3. Move the file to HPCC using FileZilla as described previously, WinSCP or by other means.

Load the Python 3 module

Filter out background snps