Population Structure Modules

Population genetics analysis modules for Fst statistics, dimensionality reduction (PCA, UMAP, t-SNE), and visualization.

Fst Statistics

class ideal_genom.population.fst_stats.FstSummary(input_path: Path, input_name: str, output_path: Path, high_ld_file: Path = PosixPath('.'), build: str = '38', recompute_merge: bool = True, reference_files: dict = {})[source]

Bases: object

__init__(input_path: Path, input_name: str, output_path: Path, high_ld_file: Path = PosixPath('.'), build: str = '38', recompute_merge: bool = True, reference_files: dict = {}) None[source]

Initialize FstSummary object for Fst analysis.

Parameters:
  • input_path (Path) – Path to the directory containing input files

  • input_name (str) – Name of the input file

  • output_path (Path) – Path to the directory where results will be saved

Raises:
merge_reference_study(ind_pair: list = [50, 5, 0.2]) None[source]

Merge reference and study data by applying quality control filters and merging steps. This method performs a series of quality control steps to merge study data with reference data: 1. Filters problematic SNPs 2. Performs LD pruning 3. Fixes chromosome mismatches 4. Fixes position mismatches 5. Fixes allele flips 6. Removes remaining mismatches 7. Merges the datasets

Parameters:

ind_pair (list, default [50, 5, 0.2]) – Parameters for LD pruning: [window size, step size, r2 threshold]

Return type:

None

Notes

If recompute_merge is False, the method will skip the merging process and expect merged data to already exist in the merging directory.

Raises:

TypeError – If ind_pair is not a list

add_population_tags() None[source]

Add population tags to the merged dataset. This method adds population super-population tags from the reference dataset to the merged dataset. It reads population information from the reference PSAM file, merges it with the study dataset, and assigns ‘StPop’ (study population) to samples not present in the reference dataset.

Requirements

  • Merged dataset files (.bed, .bim, .fam) must exist in the merging directory

  • Reference files dictionary must contain a valid ‘psam’ Path

raises FileNotFoundError:

If any of the required merged files are not found:

raises ValueError:

If the reference files dictionary doesn’t contain a valid ‘psam’ Path:

Side Effects

  • Creates a new tab-separated file with population tags at {merging_dir}/cleaned-with-ref-merged-pop-tags.csv

  • Sets self.population_tags to the path of the created file

rtype:

None

compute_fst() None[source]

Compute FST (fixation index) statistics between populations.

This method calculates FST statistics between each super-population in the dataset and a study population (‘StPop’). The process involves: 1. Reading population tags from the specified file 2. For each unique super-population (except ‘StPop’): - Creating population filter files (keep and within files) - Running PLINK commands to filter the dataset and compute FST statistics

The method requires the following instance variables to be set:
  • population_tags: Path to a file containing population information

  • results_dir: Directory where results will be stored

  • merging_dir: Directory containing the merged genotype data

Returns:

None

report_fst() pandas.DataFrame[source]

Generate a report of Fst results. This method reads the Fst results from the results directory and generates a summary report.

Returns:

DataFrame containing the Fst results summary

Return type:

pd.DataFrame

Raises:

FileNotFoundError – If no Fst result files are found in the results directory.

Dimensionality Reduction and Projection

Module to draw plots based on UMAP dimension reduction

class ideal_genom.population.projection.PCAReduction(input_path: Path, input_name: str, output_path: Path, build: str = '38', high_ld_regions: Path | None = None, generate_plot: bool = True)[source]

Bases: object

__init__(input_path: Path, input_name: str, output_path: Path, build: str = '38', high_ld_regions: Path | None = None, generate_plot: bool = True) None[source]
execute_ld_pruning(maf: float = 0.001, geno: float = 0.1, mind: float = 0.2, hwe: float = 5e-08, ind_pair: list = [50, 5, 0.2]) None[source]

Execute linkage disequilibrium (LD) pruning on study and reference data.

This method performs LD-based pruning using PLINK to remove highly correlated SNPs from both study and reference datasets. The pruning is done using a sliding window approach where SNPs are removed based on their pairwise correlation (r²).

Parameters:

ind_pair (list) –

A list containing three elements:

  • ind_pair[0] (int): Window size in SNPs

  • ind_pair[1] (int): Number of SNPs to shift the window at each step

  • ind_pair[2] (float): r² threshold for pruning

Raises:
  • TypeError – If ind_pair is not a list.

  • TypeError – If first two elements of ind_pair are not integers.

  • TypeError – If third element of ind_pair is not a float.

Return type:

None

Notes

  • Uses PLINK’s –indep-pairwise command for pruning.

  • Excludes high LD regions specified in self.high_ld_regions.

  • Creates pruned datasets for both study and reference data.

  • Updates self.pruned_reference and self.pruned_study with paths to pruned files.

  • Uses all available CPU threads except 2 for processing.

execute_pca(pca: int = 20, maf: float = 0.01) None[source]

Perform Principal Component Analysis (PCA) on the genetic data using PLINK.

This method executes PCA on the merged genetic data file, calculating the specified number of principal components. It automatically determines the optimal number of threads and memory allocation based on system resources.

Parameters:
  • pca (int, default=10) – Number of principal components to calculate. Must be a positive integer.

  • maf (float, default=0.01) – Minor allele frequency threshold for filtering variants. Must be between 0 and 0.5.

Return type:

None

Raises:
  • TypeError – If pca is not an integer or maf is not a float

  • ValueError – If pca is not positive or maf is not between 0 and 0.5

Notes

The method creates two output files: - {output_name}-pca.eigenvec: Contains the eigenvectors (PC loadings) - {output_name}-pca.eigenval: Contains the eigenvalues

The results are stored in self.einvectors and self.eigenvalues attributes.

execute_pcareduction_pipeline(maf: float = 0.001, geno: float = 0.1, mind: float = 0.2, hwe: float = 5e-08, ind_pair: list = [50, 5, 0.2], pca: int = 20, case_control_markers: bool = True) None[source]

Execute the full preparation pipeline: LD pruning followed by PCA.

This method sequentially performs LD pruning on the genetic data and then computes principal components using PCA. It combines the functionalities of execute_ld_pruning and execute_pca methods.

Parameters:
  • ind_pair (list) – A list containing three elements for LD pruning: - ind_pair[0] (int): Window size in SNPs - ind_pair[1] (int): Number of SNPs to shift the window at each step - ind_pair[2] (float): r² threshold for pruning

  • pca (int, default=20) – Number of principal components to calculate. Must be a positive integer.

  • maf (float, default=0.01) – Minor allele frequency threshold for filtering variants. Must be between 0 and 0.5

Return type:

None

class ideal_genom.population.projection.UMAPReduction(eigenvector: Path, output_path: Path)[source]

Bases: object

Class for performing UMAP dimensionality reduction on PCA eigenvectors.

This class handles UMAP transformation of high-dimensional PCA data into 2D space for visualization. Use Plot2D class for generating plots.

__init__(eigenvector: Path, output_path: Path) None[source]

Initialize UMAPReduction object.

Parameters:
  • eigenvector (Path) – Path to the eigenvector file (.eigenvec) from PCA analysis

  • output_path (Path) – Path to the directory where results will be saved

Raises:

Notes

Creates ‘umap_results’ directory in the output path.

fit_transform(n_neighbors: int = 15, min_dist: float = 0.1, metric: str = 'euclidean', random_state: int | None = None, n_components: int = 2, umap_kwargs: dict | None = None) pandas.DataFrame[source]

Perform UMAP dimensionality reduction on PCA eigenvectors.

Parameters:
  • n_neighbors (int, default=15) – Number of neighbors for UMAP manifold approximation. Must be positive.

  • min_dist (float, default=0.1) – Minimum distance between points in low-dimensional space. Must be non-negative.

  • metric (str, default='euclidean') – Distance metric for UMAP (e.g., ‘euclidean’, ‘cosine’, ‘manhattan’)

  • random_state (int, optional) – Random seed for reproducibility. Must be non-negative.

  • n_components (int, default=2) – Number of dimensions in the output

  • umap_kwargs (dict, optional) – Additional keyword arguments to pass to UMAP constructor.

Returns:

DataFrame with columns [‘ID1’, ‘ID2’, ‘umap_1’, ‘umap_2’, …]

Return type:

pd.DataFrame

Raises:
  • TypeError – If parameters are not of correct type

  • ValueError – If parameter values are invalid

class ideal_genom.population.projection.TSNEReduction(eigenvector: Path, output_path: Path)[source]

Bases: object

Class for performing t-SNE dimensionality reduction on PCA eigenvectors.

This class handles t-SNE transformation of high-dimensional PCA data into 2D or 3D space for visualization. Use Plot2D class for generating plots.

__init__(eigenvector: Path, output_path: Path) None[source]

Initialize TSNEReduction object.

Parameters:
  • eigenvector (Path) – Path to the eigenvector file (.eigenvec) from PCA analysis

  • output_path (Path) – Path to the directory where results will be saved

Raises:

Notes

Creates ‘tsne_results’ directory in the output path.

fit_transform(n_components: int = 2, perplexity: float = 30.0, learning_rate: float = 200.0, n_iter: int = 1000, metric: str = 'euclidean', random_state: int | None = None, early_exaggeration: float = 12.0, init: Literal['pca', 'random'] = 'pca', tsne_kwargs: dict | None = None) pandas.DataFrame[source]

Perform t-SNE dimensionality reduction on PCA eigenvectors.

Parameters:
  • n_components (int, default=2) – Number of dimensions in the output (typically 2 or 3)

  • perplexity (float, default=30.0) – Related to number of nearest neighbors. Should be between 5 and 50. Larger datasets require larger perplexity.

  • learning_rate (float, default=200.0) – Learning rate for t-SNE optimization. Usually between 10.0 and 1000.0.

  • n_iter (int, default=1000) – Maximum number of iterations for optimization

  • metric (str, default='euclidean') – Distance metric to use (‘euclidean’, ‘manhattan’, ‘cosine’, etc.)

  • random_state (int, optional) – Random seed for reproducibility. Must be non-negative.

  • early_exaggeration (float, default=12.0) – Controls how tight natural clusters are in the original space

  • init (str, default='pca') – Initialization method (‘pca’ or ‘random’)

  • tsne_kwargs (dict, optional) – Additional keyword arguments to pass to TSNE constructor.

Returns:

DataFrame with columns [‘ID1’, ‘ID2’, ‘tsne_1’, ‘tsne_2’, …]

Return type:

pd.DataFrame

Raises:
  • TypeError – If parameters are not of correct type

  • ValueError – If parameter values are invalid

Notes

t-SNE is computationally expensive. For large datasets (>10,000 samples), consider using perplexity between 30-50 and reducing n_iter if needed.

class ideal_genom.population.projection.Plot2D(output_dir: Path)[source]

Bases: object

Class for generating 2D scatter plots with metadata integration.

This class handles the preparation of metadata (color hue files, case-control markers) and generates publication-quality 2D scatter plots for dimensionality reduction results.

__init__(output_dir: Path) None[source]

Initialize Plot2D object.

Parameters:

output_dir (Path) – Directory where plots will be saved

Raises:
prepare_metadata(color_hue_file: Path | None = None, case_control_markers: bool = False, fam_file: Path | None = None) pandas.DataFrame | None[source]

Prepare metadata DataFrame from color hue file and/or case-control markers.

Parameters:
  • color_hue_file (Path, optional) – Path to tab-separated file with metadata for coloring. Must have at least 3 columns: ID1, ID2, and a metadata column.

  • case_control_markers (bool, default=False) – Whether to load case-control labels from .fam file

  • fam_file (Path, optional) – Path to .fam file containing case-control information. Required if case_control_markers is True.

Returns:

Metadata DataFrame with columns [‘ID1’, ‘ID2’, …] or None if no metadata

Return type:

pd.DataFrame

Raises:
generate_plot(data: pandas.DataFrame, x_col: str, y_col: str, plot_name: str, hue_col: str | None = None, style_col: str | None = None, title: str | None = None, xlabel: str | None = None, ylabel: str | None = None, figsize: tuple = (5, 5), dpi: int = 500, format: str = 'pdf', marker: str = '.', marker_size: int = 10, alpha: float = 0.5, equal_aspect: bool = True, legend_params: dict | None = None) Path[source]

Generate a 2D scatter plot with optional metadata coloring and styling.

Parameters:
  • data (pd.DataFrame) – DataFrame containing the 2D coordinates and IDs (must have ‘ID1’, ‘ID2’ columns)

  • x_col (str) – Column name for x-axis values

  • y_col (str) – Column name for y-axis values

  • plot_name (str) – Name of the output plot file

  • hue_col (str, optional) – Column name for point coloring. If None and metadata exists, uses third column or ‘Phenotype’ if available.

  • style_col (str, optional) – Column name for point styling (different markers)

  • title (str, optional) – Plot title

  • xlabel (str, optional) – X-axis label. If None, uses x_col.

  • ylabel (str, optional) – Y-axis label. If None, uses y_col.

  • figsize (tuple, default=(5, 5)) – Figure size in inches (width, height)

  • dpi (int, default=500) – Resolution for saving the plot

  • format (str, default='pdf') – Output format (‘pdf’, ‘png’, ‘jpeg’, ‘svg’)

  • marker (str, default='.') – Marker style for scatter plot

  • marker_size (int, default=10) – Size of markers

  • alpha (float, default=0.5) – Transparency of markers (0-1)

  • equal_aspect (bool, default=True) – Whether to set equal aspect ratio

  • legend_params (dict, optional) – Parameters for legend customization (bbox_to_anchor, ncols, fontsize, etc.)

Returns:

Path to the saved plot file

Return type:

Path

Raises:
  • ValueError – If required columns are missing or hue_col not found

  • TypeError – If parameters are of incorrect type

class ideal_genom.population.projection.Plot3D[source]

Bases: object

class ideal_genom.population.projection.DimensionalityReductionPipeline(input_path: Path, input_name: str, output_path: Path, build: str = '38', high_ld_regions_file: Path | None = None, generate_plot: bool = True)[source]

Bases: object

Pipeline for running PCA preparation and dimensionality reduction workflows.

This class orchestrates the complete workflow from raw genetic data to dimensionality reduction visualizations, including: 1. PCA preparation (LD pruning + PCA) 2. Optional UMAP reduction 3. Optional t-SNE reduction 4. Automated plotting with metadata

__init__(input_path: Path, input_name: str, output_path: Path, build: str = '38', high_ld_regions_file: Path | None = None, generate_plot: bool = True) None[source]

Initialize the dimensionality reduction pipeline.

Parameters:
  • input_path (Path) – Path to directory containing input genetic data files (.bed/.bim/.fam)

  • input_name (str) – Base name of input files (without extension)

  • output_path (Path) – Path to directory where all results will be saved

  • build (str, default='38') – Genome build version (‘37’ or ‘38’)

  • high_ld_regions_file (Path, optional) – Path to file containing high LD regions. If None, will be fetched automatically.

  • generate_plot (bool, default=True) – Whether to generate plots automatically

Raises:
execute_pca_preparation(maf: float = 0.001, geno: float = 0.1, mind: float = 0.2, hwe: float = 5e-08, ind_pair: list = [50, 5, 0.2], pca: int = 20, case_control_markers: bool = False) Path[source]

Run PCA preparation: LD pruning and principal component analysis.

Parameters:
  • maf (float, default=0.001) – Minor allele frequency threshold

  • geno (float, default=0.1) – Genotype missingness threshold

  • mind (float, default=0.2) – Sample missingness threshold

  • hwe (float, default=5e-8) – Hardy-Weinberg equilibrium p-value threshold

  • ind_pair (list, default=[50, 5, 0.2]) – LD pruning parameters: [window_size, step_size, r2_threshold]

  • pca (int, default=20) – Number of principal components to calculate

Returns:

Path to the generated eigenvector file

Return type:

Path

Notes

This step is required before running UMAP or t-SNE reductions.

execute_umap(n_neighbors: int = 15, min_dist: float = 0.1, metric: str = 'euclidean', random_state: int | None = None, n_components: int = 2, umap_kwargs: dict | None = None) pandas.DataFrame[source]

Run UMAP dimensionality reduction.

Parameters:
  • n_neighbors (int, default=15) – Number of neighbors for UMAP

  • min_dist (float, default=0.1) – Minimum distance between points

  • metric (str, default='euclidean') – Distance metric

  • random_state (int, optional) – Random seed for reproducibility

  • n_components (int, default=2) – Number of output dimensions

  • umap_kwargs (dict, optional) – Additional UMAP parameters

Returns:

UMAP results with columns [‘ID1’, ‘ID2’, ‘umap_1’, ‘umap_2’, …]

Return type:

pd.DataFrame

Raises:

RuntimeError – If PCA preparation hasn’t been run yet

execute_tsne(n_components: int = 2, perplexity: float = 30.0, learning_rate: float = 200.0, n_iter: int = 1000, metric: str = 'euclidean', random_state: int | None = None, early_exaggeration: float = 12.0, init: Literal['pca', 'random'] = 'pca', tsne_kwargs: dict | None = None) pandas.DataFrame[source]

Run t-SNE dimensionality reduction.

Parameters:
  • n_components (int, default=2) – Number of output dimensions

  • perplexity (float, default=30.0) – t-SNE perplexity parameter

  • learning_rate (float, default=200.0) – Learning rate for optimization

  • n_iter (int, default=1000) – Number of optimization iterations

  • metric (str, default='euclidean') – Distance metric

  • random_state (int, optional) – Random seed for reproducibility

  • early_exaggeration (float, default=12.0) – Early exaggeration parameter

  • init ({'pca', 'random'}, default='pca') – Initialization method

  • tsne_kwargs (dict, optional) – Additional t-SNE parameters

Returns:

t-SNE results with columns [‘ID1’, ‘ID2’, ‘tsne_1’, ‘tsne_2’, …]

Return type:

pd.DataFrame

Raises:

RuntimeError – If PCA preparation hasn’t been run yet

generate_plots(color_hue_file: Path | None = None, case_control_markers: bool = False, fam_file: Path | None = None, plot_format: str = 'pdf', dpi: int = 500, figsize: tuple = (5, 5)) dict[source]

Generate plots for all completed reductions.

Parameters:
  • color_hue_file (Path, optional) – Path to metadata file for coloring

  • case_control_markers (bool, default=False) – Whether to use case-control markers

  • fam_file (Path, optional) – Path to .fam file (required if case_control_markers=True)

  • plot_format (str, default='pdf') – Output format (‘pdf’, ‘png’, ‘jpeg’, ‘svg’)

  • dpi (int, default=500) – Resolution for plots

  • figsize (tuple, default=(5, 5)) – Figure size in inches

  • include_pca (bool, default=True) – Whether to generate PCA plots

Returns:

Dictionary mapping method names to plot file paths

Return type:

dict

Raises:

RuntimeError – If no reductions have been run

execute_dimensionality_reduction_pipeline(pca_params: dict | None = None, force_pca_recompute: bool = False, run_umap: bool = True, umap_params: dict | None = None, run_tsne: bool = True, tsne_params: dict | None = None, color_hue_file: Path | None = None, case_control_markers: bool = False, fam_file: Path | None = None, plot_format: str = 'pdf', dpi: int = 500, include_pca: bool = True, save_all_coordinates: bool = True, generate_all_plots: bool = True, grid_summary: bool = True) dict[source]

Run the complete dimensionality reduction pipeline with automatic parameter grid detection.

This method automatically detects whether parameters contain single values or lists. If lists are detected, it runs a parameter grid search exploring all combinations. Otherwise, it runs a single analysis with the provided parameters.

Parameters:
  • pca_params (dict, optional) – Parameters for PCA preparation (maf, geno, mind, hwe, ind_pair, pca)

  • force_pca_recompute (bool, default=False) – If True, recompute PCA even if files already exist. If False, skip PCA computation if eigenvector and eigenvalue files are found.

  • run_umap (bool, default=True) – Whether to run UMAP reduction

  • umap_params (dict, optional) – Parameters for UMAP. Can contain single values or lists for grid search. Example single: {‘n_neighbors’: 15, ‘min_dist’: 0.1} Example grid: {‘n_neighbors’: [10, 15, 30], ‘min_dist’: [0.1, 0.5]}

  • run_tsne (bool, default=True) – Whether to run t-SNE reduction

  • tsne_params (dict, optional) – Parameters for t-SNE. Can contain single values or lists for grid search. Example single: {‘perplexity’: 30, ‘learning_rate’: 200} Example grid: {‘perplexity’: [20, 30, 50], ‘learning_rate’: [100, 200]}

  • color_hue_file (Path, optional) – Metadata file for plot coloring

  • case_control_markers (bool, default=False) – Whether to use case-control markers in plots

  • fam_file (Path, optional) – Path to .fam file. If not provided, will automatically look for {input_name}.fam in the input_path directory

  • plot_format (str, default='pdf') – Output format for plots

  • dpi (int, default=500) – Resolution for plots

  • include_pca (bool, default=True) – Whether to generate PCA plots

  • save_all_coordinates (bool, default=True) – For grid search: whether to save coordinate files for all parameter combinations

  • generate_all_plots (bool, default=True) – For grid search: whether to generate plot files for all parameter combinations

  • grid_summary (bool, default=True) – For grid search: whether to generate summary table of all parameter combinations

Returns:

Results summary with file paths and metadata. For grid searches, includes information about all parameter combinations explored.

Return type:

dict

Examples

Single analysis: >>> pipeline = DimensionalityReductionPipeline(…) >>> results = pipeline.run_full_pipeline( … umap_params={‘n_neighbors’: 15, ‘min_dist’: 0.1}, … tsne_params={‘perplexity’: 30} … )

Parameter grid search (automatically detected): >>> results = pipeline.run_full_pipeline( … umap_params={ … ‘n_neighbors’: [10, 15, 30], … ‘min_dist’: [0.1, 0.5], … ‘random_state’: [42] … }, … tsne_params={ … ‘perplexity’: [20, 30, 50], … ‘random_state’: [42] … } … )

execute_parameter_grid(umap_grid: dict | None = None, tsne_grid: dict | None = None, plot_params: dict | None = None, save_coordinates: bool = True, generate_plots: bool = True, color_hue_file: Path | None = None, case_control_markers: bool = False, fam_file: Path | None = None, plot_format: str = 'pdf') dict[source]

Run systematic parameter grid exploration for UMAP and/or t-SNE.

This method explores all combinations of specified parameters, saving coordinates and generating plots for each combination. Results are organized with clear naming conventions for easy comparison.

Parameters:
  • umap_grid (dict, optional) – Dictionary with parameter names as keys and lists of values as values. Example: {‘n_neighbors’: [15, 30], ‘min_dist’: [0.1, 0.5]}

  • tsne_grid (dict, optional) – Dictionary with parameter names as keys and lists of values as values. Example: {‘perplexity’: [20, 50], ‘learning_rate’: [100, 200]}

  • plot_params (dict, optional) – Additional parameters for plot generation (figsize, dpi, etc.)

  • save_coordinates (bool, default=True) – Whether to save coordinate files for each combination

  • generate_plots (bool, default=True) – Whether to generate plot files for each combination

  • color_hue_file (Path, optional) – Metadata file for plot coloring

  • case_control_markers (bool, default=False) – Whether to use case-control markers in plots

  • fam_file (Path, optional) – Path to .fam file for case-control markers

  • plot_format (str, default='pdf') – Output format for plots

Returns:

Summary of all parameter combinations and results

Return type:

dict

Raises:
  • RuntimeError – If PCA preparation hasn’t been run yet

  • ValueError – If neither umap_grid nor tsne_grid is provided

Examples

>>> pipeline = DimensionalityReductionPipeline(...)
>>> pipeline.run_pca_preparation()
>>> results = pipeline.run_parameter_grid(
...     umap_grid={
...         'n_neighbors': [15, 30],
...         'min_dist': [0.1, 0.5],
...         'random_state': [42]
...     },
...     tsne_grid={
...         'perplexity': [20, 50],
...         'random_state': [42]
...     }
... )