Jump to:
A detailed description of the pipeline can be found at the WARP website.
Inputs and Outputs
Inputs
The All of Us + AnVIL array imputation service expects the following user-provided inputs:
Multi-sample VCF file
See Input VCF requirements to ensure your input file will be accepted and processed by the service
Output basename
The prefix for all of the outputs' filenames. May only contain alphanumeric characters, dashes, and underscores.
Minimum imputation quality for inclusion
Optional. The minimum imputation quality (DR2) for inclusion in output VCF. Value must be between 0 and 1 (inclusive). Default is 0.0.
Description
Optional. User-provided description of the job.
Outputs
When your job is complete, you will be able to download the following output files:
Imputed multi-sample VCF file
A multi-sample VCF file containing imputed genotypes for all samples. The VCF includes the following annotations:
- INFO fields:
- AF (float): Allele Frequency for the ALT allele
- DR2 (float): Dosage R-Squared: estimated squared correlation between estimated REF dose [P(RA) + 2*P(RR)] and true REF dose
- IMP (flag): Imputed marker
- FORMAT fields:
- GT (string): Genotype
- DS (float): Estimated ALT dose [P(RA) + 2*P(AA)]
Imputed multi-sample VCF index file
Imputation chunks QC TSV
-
chrom: the contig/chromosome -
startandend: the start and end positions of the chunk -
var_in_filtered_input: the number of variants in the input array chunk, after filtering (see Site filtering) -
var_in_panel: the number of variants in the filtered input array chunk overlapping with the reference panel -
chunk_was_imputed: boolean, which currently always will betruebecause any chunk with insufficient overlap with the reference panel will fail the job
Contigs metrics TSV
A TSV file containing metrics information about the input contigs (chromosomes). It contains one line for each contig/chromosome in the input, with the following columns:
-
chrom: the contig/chromosome -
var_in_raw_input: the number of variants in the raw input -
var_in_filtered_input: the number of variants in the filtered input (see Site filtering) -
var_in_panel: the number of variants in the filtered input overlapping with the reference panel -
percent_passing_filter: the percentage of variants in the raw input that passed Site filtering -
percent_overlap_with_panel: the percentage of filtered variants overlapping the reference panel
Input Validation
The service will validate that your inputs meet all the requirements. To ensure your inputs will pass these checks, please see Input VCF requirements.
Quality Control
QC is performed as part of the Imputation Pipeline in order to ensure good data is returned to the user. This QC happens at both a site and chunk level.
Site filtering
The following types of sites will be filtered from the input VCF before the chunk QC is done:
- Insertions or deletions
- Contain symbolic alleles
- Multi-allelic
- More than 10% of samples are no-calls
- Site is filtered
If any of these criteria are met, then the site will be filtered out before doing the chunk QC.
Chunk QC
Chunks are created by splitting the site-filtered input VCF into 25Mb chunks with 2Mb overlaps. Each chunk must pass all of the following QC checks otherwise the pipeline will fail:
- There must be at least 3 variants in the site-filtered input VCF that are also in the reference panel.
- At least 50% of the variants in the site-filtered input VCF must be included in the reference panel.
If any chunk does not meet these two criteria, then the pipeline will fail.
Phasing
Phasing is done with Beagle using the All of Us + AnVIL reference panel. Phasing is performed whether the input VCF is phased or not. Each chunk is phased separately.
Note: Variants in the input VCF that do not overlap with the reference panel will be dropped here and will not be in the final imputed output VCF.
Imputing
Imputing is done using Beagle using the All of Us + AnVIL reference panel. Each chunk is imputed separately.
Non Autosomal Imputing
Currently the imputation service only does autosomal imputation (chr1-chr22). We are exploring chrX, chrY, and HLA imputation but they are not currently available.
Troubleshooting
If you have any problems running the pipeline, please refer to this Troubleshooting article
Comments
0 comments
Please sign in to leave a comment.