Data description#
Overview#
This repository contains haploid local ancestry inference (LAI) tracts for Phase 3 admixed populations from the 1000 Genomes Project. The data were generated in 2014 by the 1000 Genomes Project Admixture Working Group and downloaded from the official FTP server on December 5, 2025. All tracts were generated with respect to hg19.
Data source: https://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20140818_ancestry_deconvolution/.
Populations Included#
The following admixed populations are included:
ACB – Afro-Caribbean from Barbados,
ASW – African American from Southwest US,
CLM – Colombians from Medellín,
MXL – Mexicans from Los Angeles,
PEL – Peruvians from Lima,
PUR – Puerto Ricans.
File Organization#
For each population, data are distributed in compressed archives:
*_phase3_ancestry_deconvolution.zip
Each archive contains two haploid BED files per individual (one per chromosome copy), for example:
PUR/PopPhased/bed_files/HG00553_A.bed
PUR/PopPhased/bed_files/HG00553_B.bed
Directory Structure#
Each population directory ([POP]) follows the structure below:
[POP]/
├── PopPhased/
│ ├── alleles_rephased/
│ ├── bed_files/
│ ├── karyograms/
│ └── lai_global_*.txt
├── TrioPhased/
│ ├── bed_files/
│ ├── karyograms/
│ └── lai_global_*.txt
└── rfmix_input/
Where:
PopPhased: Phase-corrected LAI calls,
TrioPhased: Phase-uncorrected LAI calls,
bed_files: Collapsed haploid ancestry tracts,
karyograms: PNG visualizations of ancestry along chromosomes,
rfmix_input: Input files used to generate LAI calls.
BED File Format#
Each haploid BED file is tab-delimited with the following six columns:
Column |
Description |
|---|---|
1 |
Chromosome number (1–22) |
2 |
Start physical position (0-based, bp) |
3 |
End physical position (1-based, bp) |
4 |
Haploid ancestry call |
5 |
Start genetic position (cM) |
6 |
End genetic position (cM) |
Ancestry Codes#
Code |
Meaning |
|---|---|
AFR |
African |
EUR |
European |
NAT |
Native American |
UNK |
Uncertain (posterior < 0.9) |
centromere |
Centromeric regions (3 Mb each) |
miscall |
Masked region with high miscall rate |
Masked Region on Chromosome 15#
The following region is masked due to high empirical miscall rates:
Chromosome |
Start (bp) |
End (bp) |
Code |
cM start |
cM end |
|---|---|---|---|---|---|
15 |
20,071,673 |
22,422,348 |
miscall |
0.00598 |
23.70848 |
A BED file containing all masked regions (centromeres + miscall) is provided as:
cent_miscall.bed
Global Ancestry Proportions#
For each population, global ancestry proportions are provided in files named:
lai_global_*.txt
Columns#
Column |
Description |
|---|---|
1 |
1000 Genomes individual ID |
2 |
Proportion African ancestry |
3 |
Proportion European ancestry |
4 |
Proportion Native American ancestry |
5 |
Proportion Unknown (UNK only) |
Example definition for European ancestry:
(sum[EUR/EUR] + (sum[EUR/AFR] + sum[EUR/NAT]) / 2)
--------------------------------------------------
sum[EUR + AFR + NAT + UNK]
Note: UNK excludes regions masked in all individuals (centromere and miscall).
Methodology#
Reference Panels and Phasing#
Phased haplotypes were obtained from the 1000 Genomes Phase 3 FTP site (SHAPEIT2).
Native American reference samples (>99% NAT ancestry) were phased similarly.
Reference panel included: 50 CEU individuals, 50 YRI individuals and 43 Native American individuals.
Local Ancestry Inference#
Local ancestry was inferred using RFMix v1.5.4, a discriminative method based on allele frequency differences.
Two inference modes were used:
TrioPhased
No phase correction,
Options:
-w 0.5,--forward-backward.
PopPhased
Includes phase correction via EM,
Options:
-w 0.5,--forward-backward,--use-reference-panels-in-EMand-e 5.
Calls from the 5th EM iteration were retained.
Post-processing#
Haploid ancestry calls were collapsed into tracts,
Karyograms were generated per individual,
Global ancestry proportions were computed.
Parsing scripts are available at: https://github.com/armartin/ancestry_pipeline.
References#
Mao et al. AJHG 80, 1171 (2007).
O’Connell et al. PLoS Genet. 10, e1004234 (2014).
Maples et al. AJHG 93, 278–288 (2013).
Native American training data available at: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130711_native_american_admix_train/.
Contact#
Data generated by the 1000 Genomes Project Admixture Working Group (2014):
Alicia Martin, Brian Maples, Carlos Bustamante (Stanford University),
Simon Gravel, Soheil Baharian (McGill University),
Eimear Kenny (Icahn School of Medicine at Mount Sinai).
For questions, contact Alicia Martin: armartin@stanford.edu