About the reference panel – Data Science Services at Broad Clinical Labs

The All of Us + AnVIL reference panel is the largest and most ancestrally diverse reference panel currently available, providing a valuable asset to researchers to make new discoveries and enhance their datasets with greater accuracy.

Jump to:

Panel Description
Preliminary Array Imputation scientific validation
Preliminary Low-Pass Imputation scientific validation

Panel Description

Our multi-ancestry reference panel was built by joint phasing 414,830 genomes from the All of Us Curated Data Repository v8 and an additional 100,749 genomes from the Center for Common Disease Genomics (CCDG) datasets hosted on NHGRI’s cloud computing platform, Analysis, Visualization, and Informatics Lab-space’s (AnVIL). This represents a total of 515,579 genomes.

The reference panel contains 665,398,839 high-quality sites comprising 989,868,760 variants. A sites-only VCF without annotations is available to download here for v1 or here for v2 (you will need to authenticate using the account you use with the service).

Based on genetically inferred ancestry (GIA) from the combined dataset, the panel includes 254,416 genomes with European (49%, EUR), 101,982 genomes with African/Afroamerican ancestries (20%, AFR), 90,553 genomes with admixed American (18%, AMR), 13,226 genomes with east Asian (3%, EAS), 9,710 genomes with south Asian (2%, SAS), and 1,065 genomes with predominantly Middle Eastern and North African GIA (0.2%, MENA), as well as an additional 44,627 individuals with other/multiple GIA (9%).

Preliminary Array Imputation scientific validation

We performed preliminary scientific validation by imputing Global Diversity Array (GDA) data from 42 participants representing an ancestrally diverse set of samples against their 30X whole genome sequences (WGS) (Table 1). Genotyping and sequencing were performed at the Broad Institute.

Table 1. A summary table of the number of genomes and their inferred genetic ancestry whose GDA and 30x WGS were used to perform scientific validation.

Inferred genetic ancestry	Number of genomes
African American (AFR-American)	10
Admixed American (AMR)	10
Southeast Asian (SAS)	8
Non-Finnish European (NFE)	6
African (AFR)	4
East Asian (EAS)	4

The GDA data were imputed in both the All of Us + AnVIL imputation service and the NHLBI TOPMed imputation server. We found that the All of Us + AnVIL imputation service provided higher confidence (R2) at all SNP and indel allele frequencies in all ancestries, except for African (Figures 1 and 2). The African samples used in our validation are four Coriell cell lines from the “Africans South of the Sahara” Human Variation Panel. This result is expected because the TOPMed panel includes more African (as opposed to African American) samples than the All of Us + AnVIL reference panel, which is comprised of American participants.

Figures 1 and 2. Mean R2 values for imputed SNPs (Figure 1) and Indels (Figure 2) shared between the two imputation panels across all chromosomes compared to their respective 30X whole genomes for 42 samples from diverse ancestries (African, African-American, Admixed American, East Asian, Non-Finnish European, South Asian). The red dotted line represents R2 values for imputation results from the All of Us + AnVIL Imputation Service, and the blue dotted line represents the NHLBI TOPMed Imputation server. The X axis is allele frequency, and the Y axis is the mean R2.

Preliminary Low-Pass Imputation scientific validation

Coming soon!

Panel Description

Preliminary Array Imputation scientific validation

Preliminary Low-Pass Imputation scientific validation

Related articles