tracts.population.Population#

class Population(list_indivs=None, names=None, fname=None, labs=None, selectchrom=None, allosomes=None, ignore_length_consistency=False, filenames_by_individual=None, male_list=None)#

Bases: object

A class representing a population of diploid individuals. A Population is a list of Indiv objects.

currentplot#

The index of the currently plotted individual.

Type:

int

win#

The Tkinter window used for plotting.

Type:

tk.Tk

canv#

The Tkinter canvas used for plotting the current individual. This is stored as an attribute to allow for updating the plot when navigating between individuals.

Type:

tk.Canvas

chro_canvas#

The Tkinter canvas used for plotting chromosomes.

Type:

tk.Canvas

colordict#

A dictionary mapping ancestry labels to color strings.

Type:

dict[str, str]

_flats#

A cached flattened list of tracts for the population. If None, the flattened list has not been computed yet.

Type:

list[Tract] | None

allosome_labels#

A list of labels for the allosomes in the population.

Type:

list[str]

allosome_lengths#

A dictionary mapping allosome labels to their lengths.

Type:

dict[str, float]

indivs#

A list of individuals in the population.

Type:

list[Indiv]

male_list#

A list of labels for male individuals in the population. This is used to determine the sex of individuals when the sex cannot be inferred from the data. If None, the sex of individuals will be inferred from the data by checking the number of X chromosomes.

Type:

list[str] | None

nind#

The number of individuals in the population.

Type:

int

Ls#

A list of chromosome lengths for the individuals in the population. It is assumed that all individuals have the same chromosome lengths.

Type:

list[float]

maxLen#

The maximum chromosome length among the individuals in the population.

Type:

float

num_males#

The number of male individuals in the population.

Type:

int

num_females#

The number of female individuals in the population.

Type:

int

__init__(list_indivs=None, names=None, fname=None, labs=None, selectchrom=None, allosomes=None, ignore_length_consistency=False, filenames_by_individual=None, male_list=None)#

Initializes the Population class.

Parameters:
  • list_indivs (list[Indiv] | None) – A list of Indiv objects representing the individuals in the population. If provided, this will be used to initialize the population directly.

  • names (list[str] | None) – A list of names for the individuals in the population. This is used when initializing the population from a file format.

  • fname (tuple[str, str, str] | None) – A tuple with the start, middle and end of the filenames for loading individuals from files. The individual files should be specified in the format start–Indiv–Middle–_A–End.

  • labs (list[str] | None) – A list of labels for the chromosome copies. This is used when initializing the population from a file format.

  • selectchrom (list[int | str] | None) – A list of chromosome labels to select when initializing the population from a file format. If None, all chromosomes will be selected.

  • allosomes (list[str] | None) – A list of labels for the allosomes in the population. This is used when initializing the population from a file format to identify which chromosomes are allosomes.

  • ignore_length_consistency (bool) – A flag indicating whether to ignore consistency in chromosome lengths across individuals when initializing the population from a file format. If False, an error will be raised if individuals have different chromosome lengths. If True, the population will be initialized even if individuals have different chromosome lengths.

  • filenames_by_individual (dict[str, list[str]] | None) – A dictionary mapping individual names to lists of filenames for loading individuals from files. The individual files should be specified in the format start–Indiv–Middle–_A–End. This is an alternative to using fname and names for loading individuals from files, and allows for more flexibility in specifying the filenames for each individual.

  • male_list (list[str] | None) – A list of labels for male individuals in the population.

Notes

There are two ways to build populations, either from a dataset stored in files or from a list of individuals. The facilities for loading populations from files present in this constructor are deprecated. It is advised to instead load a list of individuals, using tracts.indiv.Indiv.from_files(), and to then pass that list to this constructor.

The population can be initialized by providing it with a list list_indivs of Indiv objects, or a file format fname and a list names of names. If reading from a file, fname should be a tuple with the start, middle and end of the filenames, where an individual file is specified by start–Indiv–Middle–_A–End. Otherwise, provide list of individuals.

ancestry_at_pos(select_chrom=0, pos=0, cutoff=0.0)#

Finds ancestry proportion at specific position. The cutoff is used to look only at tracts that extend beyond a given position.

Parameters:
  • select_chrom (int) – The index of the chromosome to analyze. It is assumed that all individuals have the same number of chromosomes and that the chromosome with index select_chrom corresponds to the same chromosome across individuals.

  • pos (int) – The position along the chromosome at which to calculate ancestry proportions. It is assumed that all individuals have the same chromosome lengths and that the position pos corresponds to the same location along the chromosome across individuals.

  • cutoff (float) – A threshold for the length of ancestry tracts to consider when calculating ancestry proportions. Only tracts that extend beyond the position pos by at least cutoff will be included in the calculation of ancestry proportions.

Returns:

A tuple containing two dictionaries. The first dictionary maps ancestry labels to the count of tracts of that ancestry that extend beyond the position pos by at least cutoff. The second dictionary maps ancestry labels to the average length of tracts of that ancestry that extend beyond the position pos by at least cutoff.

Return type:

tuple[dict[str, int], dict[str, float]]

ancestry_per_pos(select_chrom=0, npts=50, cutoff=0.0)#

Prepare the ancestry per position across chromosome.

Parameters:
  • select_chrom (int) – The index of the chromosome to analyze. It is assumed that all individuals have the same number of chromosomes and that the chromosome with index select_chrom corresponds to the same chromosome across individuals.

  • npts (int) – The number of positions along the chromosome at which to calculate ancestry proportions. The positions will be evenly spaced along the chromosome, starting from position 0 and ending at the length of the chromosome.

  • cutoff (float) – A threshold for the length of ancestry tracts to consider when calculating ancestry proportions. Only tracts that extend beyond the position pos by at least cutoff will be included in the calculation of ancestry proportions.

Returns:

  • np.ndarray – An array of positions along the chromosome at which ancestry proportions were calculated. The positions are evenly spaced along the chromosome, starting from position 0 and ending at the length of the chromosome.

  • list[tuple[dict[str, int], dict[str, float]]] – A list of tuples, where each tuple corresponds to a position along the chromosome and contains two dictionaries. The first dictionary maps ancestry labels to the count of tracts of that ancestry that extend beyond the corresponding position by at least cutoff. The second dictionary maps ancestry labels to the average length of tracts of that ancestry that extend beyond the corresponding position by at least cutoff.

applychrom(func, indlist=None)#

Apply a function to chromosomes.

Parameters:
  • func (callable) – A function that takes a Chrom object as input and returns a value. This function will be applied to each chromosome in the population.

  • indlist (list) – A list of individuals to which the function should be applied. If None, the function will be applied to all individuals in the population.

Returns:

A list of the results of applying the function func to each chromosome in the population (or to the chromosomes of the individuals in indlist if it is not None).

Return type:

list

bootinds(seed=0)#

Returns a bootstrapped list of individuals in the population. Set this function as the indlist parameter of get_global_tractlength() to get a bootstrapped sample.

Parameters:

seed (int) – The random seed to use for bootstrapping. Setting the seed allows for reproducibility of the bootstrapped samples.

static calculate_allosome_lengths(indivs, allosome_labels)#

Calculate the lengths of allosomes across individuals.

Parameters:
  • indivs (list[Indiv]) – A list of individuals in the population.

  • allosome_labels (list[str]) – A list of labels for the allosomes in the population. This is used to identify which chromosomes are allosomes.

Returns:

A dictionary mapping allosome labels to their lengths. The length of an allosome is determined by the length of the chromosome with that label in the first individual that has that allosome. It is assumed that all individuals have the same lengths for their allosomes, and an error is raised if this is not the case.

Return type:

dict[str, float]

calculate_allosome_proportions(population_labels, allosome_label, cutoff=0.0)#

Calculates the mean ancestry proportion across individuals in the population using only data from a specified allosome.

Parameters:
  • population_labels (list[str]) – A list of ancestry labels for which to calculate the ancestry proportions.

  • allosome_label (str) – The label for the allosome to use for calculating ancestry proportions. It is assumed that all individuals have the same allosomes and that the allosome with label allosome_label corresponds to the same chromosome across individuals.

  • cutoff (float) – A threshold for the length of ancestry tracts to consider when calculating ancestry proportions.

Returns:

A list of ancestry proportions corresponding to the ancestry labels in the input population_labels list, averaged across all individuals in the population, calculated using only data from the specified allosome.

Return type:

list[float]

Notes

IDE warnings may appear and can be ignored.

calculate_ancestry_proportions(population_labels, cutoff=0.0)#

Calculates the mean ancestry proportion across individuals in the population using only autosomal data.

Parameters:
  • population_labels (list[str]) – A list of ancestry labels for which to calculate the ancestry proportions. The function will calculate the ancestry proportion for each ancestry label in this list for each individual in the population.

  • cutoff (float) – A threshold for the length of ancestry tracts to consider when calculating ancestry proportions.

Returns:

A list of ancestry proportions corresponding to the ancestry labels in the input population_labels list, averaged across all individuals in the population.

Return type:

list[float]

static calculate_num_sexes(indivs, allosome_labels)#

Calculate the number of males and females in the population based on their allosome composition. If the allosome labels do not include ‘X’, a warning is raised and the number of males and females is recorded as zero.

Parameters:
  • indivs (list[Indiv]) – A list of individuals in the population.

  • allosome_labels (list[str]) – A list of labels for the allosomes in the population. This is used to identify which chromosomes are allosomes and to determine the sex of individuals based on their allosome composition.

Returns:

A tuple containing the number of males and females in the population.

Return type:

tuple[int, int]

Notes

Currently, the function only checks for the presence of ‘X’ in the allosome labels.

flatpop(ls=None)#

Returns a flattened version of a population-wide list at the tract level, and throws away the start and end information of the tract.

Parameters:

ls (list | None) – A list of tracts to flatten. If None, the function will flatten the complete list of tracts contained in this population. If a list is provided, the function will flatten that list instead of the complete list of tracts in the population.

Returns:

A list of Tract objects representing the flattened version of the input list of tracts (or the complete list of tracts in the population if ls is None). The start and end information of the tracts is discarded in the returned list.

Return type:

list[Tract]

getMeansByChrom(ancestries)#

Gets the ancestry proportions in each individual of the population for each chromosome.

Parameters:

ancestries (list[str]) – A list of ancestry labels for which to calculate the ancestry proportions for each chromosome. The function will calculate the ancestry proportions for each ancestry label in this list for each chromosome in each individual in the population.

Returns:

A list of lists of lists, where the outer list contains one inner list for each individual in the population, the middle list contains one inner list for each ancestry label in the input ancestries list, and the innermost list contains the ancestry proportions for each chromosome for that ancestry label for that individual.

Return type:

list[list[list[float]]]

get_global_allosome_tractlengths(allosome, npts=50, tol=0.01, indlist=None, exclude_tracts_below_cM=0)#

Returns the allosomal tractlength histogram in males and the allosomal tractlength histogram in females.

Parameters:
  • allosome (str) – The label for the allosome to analyze.

  • npts (int) – The number of bins for the histogram.

  • tol (float) – The tolerance for full chromosomes.

  • indlist (list) – The individuals for which we want the tractlength

  • exclude_tracts_below_cM (float) – The minimum length of tracts to include in the histogram.

Returns:

  • np.ndarray – The bins for the histogram.

  • dict[SexType, dict[str, np.ndarray]] – A dictionary with keys SexType.MALE and SexType.FEMALE, where the value for each key is a dictionary with ancestry labels as keys and a histogram of tract lengths as values for each ancestry.

get_global_tractlength_table(lenbound)#

Calculates the fraction of the genome covered by ancestry tracts of different lengths, specified by lenbound (which must be sorted).

Parameters:

lenbound (list[float]) – A sorted list of length boundaries for categorizing ancestry tracts. The function will calculate the fraction of the genome covered by ancestry tracts that fall into each of the length categories defined by these boundaries.

Returns:

  • list[float] – The length boundaries for categorizing ancestry tracts, as specified by the input lenbound.

  • dict[str, np.ndarray] – A dictionary with ancestry labels as keys and an array of the fraction of the genome covered by ancestry tracts of different lengths as values.

get_global_tractlengths(npts=50, tol=0.01, indlist=None, split_count=1, exclude_tracts_below_cM=0)#
Parameters:
  • tol (float) – The tolerance for full chromosomes.

  • npts (int) – The number of bins for the histogram.

  • indlist (list) – The individuals for which we want the tractlength. To bootstrap over individuals, provide a bootstrapped list individuals.

  • split_count (int) – If greater than 1, the population is split into split_count groups according to their ancestry proportions, and the tractlength histogram is computed separately for each group.

  • exclude_tracts_below_cM (float) – Exclude tracts below this length in cM.

Returns:

  • np.ndarray – The bins for the histogram

  • dict[str, np.ndarray] – A dictionary with ancestry labels as keys and a histogram of tract lengths as values.

Notes

Sometimes there are small issues at the edges of the chromosomes. If a segment is within tol Morgans of the full chromosome, it counts as a full chromosome note that we return an extra bin with the complete chromosome bin, so that we have one more data point than we have bins.

get_mean_ancestry_proportions(ancestries)#

Gets the mean ancestry proportion averaged across individuals in the population.

Parameters:

ancestries (list[str]) – A list of ancestry labels for which to calculate the mean ancestry proportion. The function will calculate the mean ancestry proportion for each ancestry label in this list, averaged across all individuals in the population.

Returns:

A list of mean ancestry proportions corresponding to the ancestry labels in the input ancestries list.

Return type:

list[float]

get_means(ancestries)#

Gets the mean ancestry proportion (only among ancestries in ancestries) for all individuals.

Parameters:

ancestries (list[str]) – A list of ancestry labels for which to calculate the mean ancestry proportion for each individual.

Returns:

A list of lists, where each inner list contains the mean ancestry proportions for the ancestry labels in the input ancestries list for a single individual in the population. The outer list contains one inner list for each individual in the population.

Return type:

list[list[float]]

get_meanvar(ancestries)#

Gets the mean and variance of ancestry proportions across individuals in the population, for ancestries in ancestries.

Parameters:

ancestries (list[str]) – A list of ancestry labels for which to calculate the mean and variance of ancestry proportions across individuals in the population.

Returns:

  • list[float] – A list of mean ancestry proportions corresponding to the ancestry labels in the input ancestries list, averaged across all individuals in the population.

  • list[float] – A list of variances of ancestry proportions corresponding to the ancestry labels in the input ancestries list, calculated across all individuals in the population.

get_variance(ancestries)#

Calculates the total variance in ancestry proportions, the genealogy variance, and the assortment variance, that corresponds to the mean uncertainty about the proportion of genealogical ancestors, given observed ancestry patterns.

Parameters:

ancestries (list[str]) – A list of ancestry labels for which to calculate the variance in ancestry proportions. The function will calculate the variance in ancestry proportions for each ancestry label in this list across all individuals in the population.

Returns:

  • list[float] – A list of total variances in ancestry proportions corresponding to the ancestry labels in the input ancestries list, calculated across all individuals in the population.

  • list[float] – A list of genealogy variances corresponding to the ancestry labels in the input ancestries list, calculated across all individuals in the population.

  • list[float] – A list of assortment variances corresponding to the ancestry labels in the input ancestries list, calculated across all individuals in the population.

Notes

All unlisted ancestries are considered uncalled. For example, calling the function with a single ancestry leads to no variance (and some 0/0 errors).

iflatten(indivs=None)#

Flattens a list of individuals to the tract level.

Parameters:
  • indivs (list | None) – A list of individuals to flatten. If None, the function will flatten the complete list of individuals contained in this population. If a list is provided, the function will flatten that list of individuals instead of the complete list of individuals in the population.

  • Returns

  • generator – A generator that yields Tract objects representing the flattened version of the input list of individuals (or the complete list of individuals in the population if indivs is None). The start and end information of the tracts is preserved in the yielded Tract objects.

list_chromosome(chronum)#

Collects the chromosomes with the given number across the whole population.

Parameters:

chronum (int) – The index of the chromosome to collect across the population. It is assumed that all individuals have the same number of chromosomes and that the chromosome with index chronum corresponds to the same chromosome across individuals.

Returns:

A list of Chrom objects corresponding to the chromosome with index chronum across all individuals in the population.

Return type:

list[Chrom]

merge_ancestries(ancestries, newlabel)#

Treats ancestries in label list ancestries as a single population with label newlabel. Adjacent tracts of the new ancestry are merged.

Parameters:
  • ancestries (list[str]) – A list of ancestry labels to merge into a single population. The function will treat all tracts with labels in this list as belonging to the same population and will merge adjacent tracts of these ancestries into a single tract with the label newlabel.

  • newlabel (str) – The label to assign to the merged ancestry.

new_indiv()#

Creates a new individual by randomly selecting two parents from the population, creating gametes from each parent, and combining those gametes to form a new individual.

Returns:

A new Indiv object representing the offspring of the two randomly selected parents.

Return type:

Indiv

newgen()#

Build a new generation from this population.

Returns:

A new Population object representing the next generation.

Return type:

Population

plot(colordict)#

Plots the individuals in the population using a color dictionary that maps ancestry labels to colors.

Parameters:

colordict (dict[str, str]) – A dictionary that maps ancestry labels (as strings) to color codes (also as strings) that can be used in plotting.

plot_all_ancestries(npts=50, colordict=None, startfig=0, cutoff=0)#

Plots the ancestry proportions along all chromosomes across individuals in the population using a color dictionary that maps ancestry labels to colors.

Parameters:
  • npts (int) – The number of points along each chromosome at which to plot the ancestry proportions.

  • colordict (dict[str, str]) – A dictionary that maps ancestry labels (as strings) to color codes (also as strings) that can be used in plotting. If None, a default color dictionary will be used that maps “CEU” to ‘blue’ and “YRI” to ‘red’.

  • startfig (int) – The starting figure number for plotting. The function will plot the ancestry proportions for each chromosome in a separate subplot, and the figure numbers for these subplots will start from this value.

  • cutoff (float) – A threshold for the length of ancestry tracts to consider when calculating ancestry proportions at each point along the chromosomes. Only tracts that are longer than this threshold will be considered when calculating the ancestry proportions at each point.

plot_ancestries(chrom=0, npts=50, colordict=None, cutoff=0.0)#

Plots the ancestry proportions along a chromosome across individuals in the population using a color dictionary that maps ancestry labels to colors.

Parameters:
  • chrom (int) – The index of the chromosome to plot. It is assumed that all individuals have the same number of chromosomes and that the chromosome with index chrom corresponds to the same chromosome across individuals.

  • npts (int) – The number of points along the chromosome at which to plot the ancestry proportions.

  • colordict (dict[str, str]) – A dictionary that maps ancestry labels (as strings) to color codes (also as strings) that can be used in plotting. If None, a default color dictionary will be used that maps “CEU” to ‘blue’ and “YRI” to ‘red’.

  • cutoff (float) – A threshold for the length of ancestry tracts to consider when calculating ancestry proportions at each point along the chromosome. Only tracts that are longer than this threshold will be considered when calculating the ancestry proportions at each point.

plot_chromosome(i, colordict, win=None)#

Plot a single chromosome across individuals in the population using a color dictionary that maps ancestry labels to colors.

Parameters:
  • i (int) – The index of the chromosome to plot. It is assumed that all individuals have the same number of chromosomes and that the chromosome with index i corresponds to the same chromosome across individuals.

  • colordict (dict[str, str]) – A dictionary that maps ancestry labels (as strings) to color codes (also as strings) that can be used in plotting.

  • win (Tk) – A Tkinter window in which to plot the chromosome. If None, a new window will be created for the plot. If a window is provided, the chromosome will be plotted in that window instead of creating a new one.

plot_global_tractlengths(colordict, npts=50, legend=True)#

Plot the distribution of global tract lengths for each population.

Parameters:
  • colordict (dict[str, str]) – A dictionary that maps ancestry labels (as strings) to color codes (also as strings) that can be used in plotting. The function will plot the distribution of global tract lengths for each ancestry label in this dictionary using the corresponding color.

  • npts (int) – The number of bins for the histogram of tract lengths. The function will use this number of bins when plotting the distribution of global tract lengths for each ancestry label.

  • legend (bool) – Whether to include a legend in the plot. If True, a legend will be included that maps ancestry labels to colors. If False, no legend will be included in the plot.

plot_indiv()#

Plots the individual at the current plot index and stores it in self.canv.

plot_next()#

Plots the next individual in the population.

Returns:

A visual representation of the individual’s ancestry tracts. See plot() for details on the visual representation.

Return type:

tk.Tk

plot_previous()#

Plots the previous individual in the population.

Returns:

A visual representation of the individual’s ancestry tracts. See plot() for details on the visual representation.

Return type:

tk.Tk

save()#

Saves the current plot of the population to a file. The user is prompted to choose a file location and name for saving the plot.

set_males(male_list, allosome_label='X')#

Sets the list of males for each individual.

Parameters:
  • male_list (list[str] | None) – A list of labels for male individuals in the population.

  • allosome_label (str) – The label for the allosome to use for determining the sex of individuals when the sex cannot be inferred from the data.

smooth_unknowns(allosome_labels=None)#

Smooths the unknown labels for each individual in the population.

Parameters:

allosome_labels (list[str] | None) – A list of labels for the allosomes in the population.

split_by_props(count)#

Splits this population into groups according to their ancestry proportions. The individuals are sorted in ascending order of their ancestry named anc.

Parameters:

count (int) – The number of groups to split the population into. If count is 1, the function returns a list containing this population without splitting.

Returns:

A list of count Population objects, each containing a group of individuals from the original population.

Return type:

list[Population]

tractlength_histogram(tracts_by_population, npts=50, tol=0.01, exclude_tracts_below_cM=0, maxLen=None)#

Helper function for get_global_tractlengths that takes in a dictionary of tracts organized by population and returns the histogram of tract lengths for each population.

Parameters:
  • tracts_by_population (dict[str, list[tuple[Tract, float]]]) – A dictionary where the keys are ancestry labels and the values are lists of tuples, where each tuple contains a Tract object and the length of the chromosome that tract is on. This dictionary is used to compute the histogram of tract lengths for each ancestry label.

  • npts (int) – The number of bins for the histogram.

  • tol (float) – The tolerance for full chromosomes. Sometimes there are small issues at the edges of the chromosomes. If a segment is within tol Morgans of the full chromosome, it counts as a full chromosome note that we return an extra bin with the complete chromosome bin, so that we have one more data point than we have bins.

  • exclude_tracts_below_cM (float) – Exclude tracts below this length in centiMorgans from the histogram.

Returns:

  • np.ndarray – The bins for the histogram.

  • dict[str, np.ndarray] – A dictionary with ancestry labels as keys and a histogram of tract lengths as values.