Array Imputation

V2

There are two major differences between v1 and v2. One is v2 is using a left aligned reference panel and the other is the imputed hom ref variants have been moved to their own sites only vcf output. A detailed description of the full pipeline can be found at Array Imputation v2 Pipeline Overview

Left Aligned VCF

We discovered that during our processing of the v1 reference panel, some indel sites were not properly left aligned. These means that the output of the pipeline has some indel sites that are not properly aligned. While the significance of these variants is not lost, the representation is not what would be considered standard. The reference panel for v2 has had this issue addressed and all outputs from that pipeline will be properly aligned.

One can run any left aligning tool like bcftools norm to address the alignment issues from a v1 output.

Output Changes

Due to hom ref sites being a large majority of the file size for our imputed outputs, we decided to split imputed variants and imputed hom ref variants. Imputed variants will continue to be output in a standard vcf format. Imputed hom ref variants are being output in a new sites only vcf file. This file will contain no FORMAT annotations. This change decreases overall output file size by an order of magnitude in most cases.

To reconstruct a vcf that contains both the imputed variant sites and the imputed hom ref variants sites only vcf, you can use the following bash script. Those will not exactly match an equivalent output of a v1 pipeline but should be good enough for most needs.

Requirements:

Python 3
- any recent version will do
bcftools
- https://samtools.github.io/bcftools/howtos/install.html
GATK
- https://github.com/broadinstitute/gatk

bash script to run

bcftools view -h v2.imputed_variants.vcf.gz > header.txt

sample_count=$(bcftools query -l v2.imputed_variants.vcf.gz | wc -l)

bcftools reheader -h header.txt -o reheadered_sites_only.vcf.gz v2.hom_ref_sites_only.vcf.gz

# requires python 3
gunzip -c reheadered_sites_only.vcf.gz | python3 expand_sites_only_vcf.py -n $sample_count | bgzip -c > reheadered_sites_only_expanded.vcf.gz

tabix reheadered_sites_only_expanded.vcf.gz 

# requires gatk jar
java -jar gatk.jar MergeVcfs -I v2.imputed_variants.vcf.gz -I reheadered_sites_only_expanded.vcf.gz  -O merged.all_variants.vcf.gz

Code for expand_sites_only_vcf.py python file

#!/usr/bin/env python3

import sys
import argparse

def add_columns_to_tsv(fill_value="0|0:0", num_columns=10):
    """
    Add specified number of columns to each row in a TSV while preserving header lines starting with '#'
    Reads from stdin and writes to stdout
    """
    # Create the additional columns string
    additional_columns = '\t'.join([fill_value] * num_columns)

    for line in sys.stdin:
        line = line.rstrip('\n')

        if line.startswith('#'):
            # Pass through header lines unchanged
            print(line)
        else:
            print(line + '\t' + 'GT:DS' + '\t' + additional_columns)

def main():
    parser = argparse.ArgumentParser(description='number of samples to add default values for')
    parser.add_argument('-n', '--num-columns', type=int, default=10,
                        help='number of samples to add default values for (default: 10)')
    
    args = parser.parse_args()
    
    add_columns_to_tsv(num_columns=args.num_columns)

if __name__ == "__main__":
    main()

V1

This pipeline is described in Array Imputation v1 Pipeline Overview

Pipeline Versions

Array Imputation

V2

Left Aligned VCF

Output Changes

V1

Comments

Array Imputation

V2

Left Aligned VCF

Output Changes

V1

Related articles