Array Imputation
V2
There are two major differences between v1 and v2. One is v2 is using a left aligned reference panel and the other is the imputed hom ref variants have been moved to their own sites only vcf output. A detailed description of the full pipeline can be found at Array Imputation v2 Pipeline Overview
Left Aligned VCF
We discovered that during our processing of the v1 reference panel, some indel sites were not properly left aligned. These means that the output of the pipeline has some indel sites that are not properly aligned. While the significance of these variants is not lost, the representation is not what would be considered standard. The reference panel for v2 has had this issue addressed and all outputs from that pipeline will be properly aligned.
One can run any left aligning tool like bcftools norm to address the alignment issues from a v1 output.
Output Changes
Due to hom ref sites being a large majority of the file size for our imputed outputs, we decided to split imputed variants and imputed hom ref variants. Imputed variants will continue to be output in a standard vcf format. Imputed hom ref variants are being output in a new sites only vcf file. This file will contain no FORMAT annotations. This change decreases overall output file size by an order of magnitude in most cases.
To reconstruct a vcf that contains both the imputed variant sites and the imputed hom ref variants sites only vcf, you can use the following bash script. Those will not exactly match an equivalent output of a v1 pipeline but should be good enough for most needs.
Requirements:
- Python 3
- any recent version will do
- bcftools
- https://samtools.github.io/bcftools/howtos/install.html
- GATK
bash script to run
bcftools view -h v2.imputed_variants.vcf.gz > header.txt
sample_count=$(bcftools query -l v2.imputed_variants.vcf.gz | wc -l)
bcftools reheader -h header.txt -o reheadered_sites_only.vcf.gz v2.hom_ref_sites_only.vcf.gz
# requires python 3
gunzip -c reheadered_sites_only.vcf.gz | python3 expand_sites_only_vcf.py -n $sample_count | bgzip -c > reheadered_sites_only_expanded.vcf.gz
tabix reheadered_sites_only_expanded.vcf.gz
# requires gatk jar
java -jar gatk.jar MergeVcfs -I v2.imputed_variants.vcf.gz -I reheadered_sites_only_expanded.vcf.gz -O merged.all_variants.vcf.gz
Code for expand_sites_only_vcf.py python file
#!/usr/bin/env python3
import sys
import argparse
def add_columns_to_tsv(fill_value="0|0:0", num_columns=10):
"""
Add specified number of columns to each row in a TSV while preserving header lines starting with '#'
Reads from stdin and writes to stdout
"""
# Create the additional columns string
additional_columns = '\t'.join([fill_value] * num_columns)
for line in sys.stdin:
line = line.rstrip('\n')
if line.startswith('#'):
# Pass through header lines unchanged
print(line)
else:
print(line + '\t' + 'GT:DS' + '\t' + additional_columns)
def main():
parser = argparse.ArgumentParser(description='number of samples to add default values for')
parser.add_argument('-n', '--num-columns', type=int, default=10,
help='number of samples to add default values for (default: 10)')
args = parser.parse_args()
add_columns_to_tsv(num_columns=args.num_columns)
if __name__ == "__main__":
main()
V1
This pipeline is described in Array Imputation v1 Pipeline Overview
Comments
0 comments
Please sign in to leave a comment.