Population Structure Modules

Population genetics analysis modules for Fst statistics, dimensionality reduction (PCA, UMAP, t-SNE), and visualization.

Fst Statistics

class ideal_genom.population.fst_stats.FstSummary(input_path: Path, input_name: str, output_path: Path, high_ld_file: Path = PosixPath('.'), build: str = '38', recompute_merge: bool = True, reference_files: dict = {})[source]

Bases: object

__init__(input_path: Path, input_name: str, output_path: Path, high_ld_file: Path = PosixPath('.'), build: str = '38', recompute_merge: bool = True, reference_files: dict = {}) → None[source]

Initialize FstSummary object for Fst analysis.

Parameters:

input_path (Path) – Path to the directory containing input files
input_name (str) – Name of the input file
output_path (Path) – Path to the directory where results will be saved

Raises:

TypeError – If input types are incorrect for any parameter
FileNotFoundError – If input_path or output_path do not exist

merge_reference_study(ind_pair: list = [50, 5, 0.2]) → None[source]

Merge reference and study data by applying quality control filters and merging steps. This method performs a series of quality control steps to merge study data with reference data: 1. Filters problematic SNPs 2. Performs LD pruning 3. Fixes chromosome mismatches 4. Fixes position mismatches 5. Fixes allele flips 6. Removes remaining mismatches 7. Merges the datasets

Parameters:: ind_pair (list, default [50, 5, 0.2]) – Parameters for LD pruning: [window size, step size, r2 threshold]
Return type:: None

Notes

If recompute_merge is False, the method will skip the merging process and expect merged data to already exist in the merging directory.

Raises:: TypeError – If ind_pair is not a list

add_population_tags() → None[source]

Add population tags to the merged dataset. This method adds population super-population tags from the reference dataset to the merged dataset. It reads population information from the reference PSAM file, merges it with the study dataset, and assigns ‘StPop’ (study population) to samples not present in the reference dataset.

Requirements

Merged dataset files (.bed, .bim, .fam) must exist in the merging directory

Reference files dictionary must contain a valid ‘psam’ Path

raises FileNotFoundError:: If any of the required merged files are not found:
raises ValueError:: If the reference files dictionary doesn’t contain a valid ‘psam’ Path:

Side Effects

Creates a new tab-separated file with population tags at {merging_dir}/cleaned-with-ref-merged-pop-tags.csv

Sets self.population_tags to the path of the created file

rtype:: None

compute_fst() → None[source]

Compute FST (fixation index) statistics between populations.

This method calculates FST statistics between each super-population in the dataset and a study population (‘StPop’). The process involves: 1. Reading population tags from the specified file 2. For each unique super-population (except ‘StPop’): - Creating population filter files (keep and within files) - Running PLINK commands to filter the dataset and compute FST statistics

The method requires the following instance variables to be set:

population_tags: Path to a file containing population information
results_dir: Directory where results will be stored
merging_dir: Directory containing the merged genotype data

Returns:

None

report_fst() → pandas.DataFrame[source]

Generate a report of Fst results. This method reads the Fst results from the results directory and generates a summary report.

Returns:: DataFrame containing the Fst results summary
Return type:: pd.DataFrame
Raises:: FileNotFoundError – If no Fst result files are found in the results directory.

Dimensionality Reduction and Projection

Module to draw plots based on UMAP dimension reduction

class ideal_genom.population.projection.PCAReduction(input_path: Path, input_name: str, output_path: Path, build: str = '38', high_ld_regions: Path | None = None, generate_plot: bool = True)[source]

Bases: object

__init__(input_path: Path, input_name: str, output_path: Path, build: str = '38', high_ld_regions: Path | None = None, generate_plot: bool = True) → None[source]

execute_ld_pruning(maf: float = 0.001, geno: float = 0.1, mind: float = 0.2, hwe: float = 5e-08, ind_pair: list = [50, 5, 0.2]) → None[source]

Execute linkage disequilibrium (LD) pruning on study and reference data.

This method performs LD-based pruning using PLINK to remove highly correlated SNPs from both study and reference datasets. The pruning is done using a sliding window approach where SNPs are removed based on their pairwise correlation (r²).

Parameters:

ind_pair (list) –

A list containing three elements:

ind_pair[0] (int): Window size in SNPs
ind_pair[1] (int): Number of SNPs to shift the window at each step
ind_pair[2] (float): r² threshold for pruning

Raises:

TypeError – If ind_pair is not a list.
TypeError – If first two elements of ind_pair are not integers.
TypeError – If third element of ind_pair is not a float.

Return type:

None

Notes

Uses PLINK’s –indep-pairwise command for pruning.
Excludes high LD regions specified in self.high_ld_regions.
Creates pruned datasets for both study and reference data.
Updates self.pruned_reference and self.pruned_study with paths to pruned files.
Uses all available CPU threads except 2 for processing.

execute_pca(pca: int = 20, maf: float = 0.01) → None[source]

Perform Principal Component Analysis (PCA) on the genetic data using PLINK.

This method executes PCA on the merged genetic data file, calculating the specified number of principal components. It automatically determines the optimal number of threads and memory allocation based on system resources.

Parameters:

pca (int, default=10) – Number of principal components to calculate. Must be a positive integer.
maf (float, default=0.01) – Minor allele frequency threshold for filtering variants. Must be between 0 and 0.5.

Return type:

None

Raises:

TypeError – If pca is not an integer or maf is not a float
ValueError – If pca is not positive or maf is not between 0 and 0.5

Notes

The method creates two output files: - {output_name}-pca.eigenvec: Contains the eigenvectors (PC loadings) - {output_name}-pca.eigenval: Contains the eigenvalues

The results are stored in self.einvectors and self.eigenvalues attributes.

execute_pcareduction_pipeline(maf: float = 0.001, geno: float = 0.1, mind: float = 0.2, hwe: float = 5e-08, ind_pair: list = [50, 5, 0.2], pca: int = 20, case_control_markers: bool = True) → None[source]

Execute the full preparation pipeline: LD pruning followed by PCA.

This method sequentially performs LD pruning on the genetic data and then computes principal components using PCA. It combines the functionalities of execute_ld_pruning and execute_pca methods.

Parameters:

ind_pair (list) – A list containing three elements for LD pruning: - ind_pair[0] (int): Window size in SNPs - ind_pair[1] (int): Number of SNPs to shift the window at each step - ind_pair[2] (float): r² threshold for pruning
pca (int, default=20) – Number of principal components to calculate. Must be a positive integer.
maf (float, default=0.01) – Minor allele frequency threshold for filtering variants. Must be between 0 and 0.5

Return type:

None

class ideal_genom.population.projection.UMAPReduction(eigenvector: Path, output_path: Path)[source]

Bases: object

Class for performing UMAP dimensionality reduction on PCA eigenvectors.

This class handles UMAP transformation of high-dimensional PCA data into 2D space for visualization. Use Plot2D class for generating plots.

__init__(eigenvector: Path, output_path: Path) → None[source]

Initialize UMAPReduction object.

Parameters:

eigenvector (Path) – Path to the eigenvector file (.eigenvec) from PCA analysis
output_path (Path) – Path to the directory where results will be saved

Raises:

TypeError – If input types are incorrect
FileNotFoundError – If eigenvector file or output_path do not exist

Notes

Creates ‘umap_results’ directory in the output path.

fit_transform(n_neighbors: int = 15, min_dist: float = 0.1, metric: str = 'euclidean', random_state: int | None = None, n_components: int = 2, umap_kwargs: dict | None = None) → pandas.DataFrame[source]

Perform UMAP dimensionality reduction on PCA eigenvectors.

Parameters:

n_neighbors (int, default=15) – Number of neighbors for UMAP manifold approximation. Must be positive.
min_dist (float, default=0.1) – Minimum distance between points in low-dimensional space. Must be non-negative.
metric (str, default='euclidean') – Distance metric for UMAP (e.g., ‘euclidean’, ‘cosine’, ‘manhattan’)
random_state (int, optional) – Random seed for reproducibility. Must be non-negative.
n_components (int, default=2) – Number of dimensions in the output
umap_kwargs (dict, optional) – Additional keyword arguments to pass to UMAP constructor.

Returns:

DataFrame with columns [‘ID1’, ‘ID2’, ‘umap_1’, ‘umap_2’, …]

Return type:

pd.DataFrame

Raises:

TypeError – If parameters are not of correct type
ValueError – If parameter values are invalid

class ideal_genom.population.projection.TSNEReduction(eigenvector: Path, output_path: Path)[source]

Bases: object

Class for performing t-SNE dimensionality reduction on PCA eigenvectors.

This class handles t-SNE transformation of high-dimensional PCA data into 2D or 3D space for visualization. Use Plot2D class for generating plots.

__init__(eigenvector: Path, output_path: Path) → None[source]

Initialize TSNEReduction object.

Parameters:

eigenvector (Path) – Path to the eigenvector file (.eigenvec) from PCA analysis
output_path (Path) – Path to the directory where results will be saved

Raises:

TypeError – If input types are incorrect
FileNotFoundError – If eigenvector file or output_path do not exist

Notes

Creates ‘tsne_results’ directory in the output path.

fit_transform(n_components: int = 2, perplexity: float = 30.0, learning_rate: float = 200.0, n_iter: int = 1000, metric: str = 'euclidean', random_state: int | None = None, early_exaggeration: float = 12.0, init: Literal['pca', 'random'] = 'pca', tsne_kwargs: dict | None = None) → pandas.DataFrame[source]

Perform t-SNE dimensionality reduction on PCA eigenvectors.

Parameters:

n_components (int, default=2) – Number of dimensions in the output (typically 2 or 3)
perplexity (float, default=30.0) – Related to number of nearest neighbors. Should be between 5 and 50. Larger datasets require larger perplexity.
learning_rate (float, default=200.0) – Learning rate for t-SNE optimization. Usually between 10.0 and 1000.0.
n_iter (int, default=1000) – Maximum number of iterations for optimization
metric (str, default='euclidean') – Distance metric to use (‘euclidean’, ‘manhattan’, ‘cosine’, etc.)
random_state (int, optional) – Random seed for reproducibility. Must be non-negative.
early_exaggeration (float, default=12.0) – Controls how tight natural clusters are in the original space
init (str, default='pca') – Initialization method (‘pca’ or ‘random’)
tsne_kwargs (dict, optional) – Additional keyword arguments to pass to TSNE constructor.

Returns:

DataFrame with columns [‘ID1’, ‘ID2’, ‘tsne_1’, ‘tsne_2’, …]

Return type:

pd.DataFrame

Raises:

TypeError – If parameters are not of correct type
ValueError – If parameter values are invalid

Notes

t-SNE is computationally expensive. For large datasets (>10,000 samples), consider using perplexity between 30-50 and reducing n_iter if needed.

class ideal_genom.population.projection.Plot2D(output_dir: Path)[source]

Bases: object

Class for generating 2D scatter plots with metadata integration.

This class handles the preparation of metadata (color hue files, case-control markers) and generates publication-quality 2D scatter plots for dimensionality reduction results.

__init__(output_dir: Path) → None[source]

Initialize Plot2D object.

Parameters:

output_dir (Path) – Directory where plots will be saved

Raises:

TypeError – If output_dir is not a Path object
FileNotFoundError – If output_dir does not exist

prepare_metadata(color_hue_file: Path | None = None, case_control_markers: bool = False, fam_file: Path | None = None) → pandas.DataFrame | None[source]

Prepare metadata DataFrame from color hue file and/or case-control markers.

Parameters:

color_hue_file (Path, optional) – Path to tab-separated file with metadata for coloring. Must have at least 3 columns: ID1, ID2, and a metadata column.
case_control_markers (bool, default=False) – Whether to load case-control labels from .fam file
fam_file (Path, optional) – Path to .fam file containing case-control information. Required if case_control_markers is True.

Returns:

Metadata DataFrame with columns [‘ID1’, ‘ID2’, …] or None if no metadata

Return type:

pd.DataFrame

Raises:

FileNotFoundError – If specified files don’t exist
TypeError – If parameters are of incorrect type

generate_plot(data: pandas.DataFrame, x_col: str, y_col: str, plot_name: str, hue_col: str | None = None, style_col: str | None = None, title: str | None = None, xlabel: str | None = None, ylabel: str | None = None, figsize: tuple = (5, 5), dpi: int = 500, format: str = 'pdf', marker: str = '.', marker_size: int = 10, alpha: float = 0.5, equal_aspect: bool = True, legend_params: dict | None = None) → Path[source]

Generate a 2D scatter plot with optional metadata coloring and styling.

Parameters:

data (pd.DataFrame) – DataFrame containing the 2D coordinates and IDs (must have ‘ID1’, ‘ID2’ columns)
x_col (str) – Column name for x-axis values
y_col (str) – Column name for y-axis values
plot_name (str) – Name of the output plot file
hue_col (str, optional) – Column name for point coloring. If None and metadata exists, uses third column or ‘Phenotype’ if available.
style_col (str, optional) – Column name for point styling (different markers)
title (str, optional) – Plot title
xlabel (str, optional) – X-axis label. If None, uses x_col.
ylabel (str, optional) – Y-axis label. If None, uses y_col.
figsize (tuple, default=(5, 5)) – Figure size in inches (width, height)
dpi (int, default=500) – Resolution for saving the plot
format (str, default='pdf') – Output format (‘pdf’, ‘png’, ‘jpeg’, ‘svg’)
marker (str, default='.') – Marker style for scatter plot
marker_size (int, default=10) – Size of markers
alpha (float, default=0.5) – Transparency of markers (0-1)
equal_aspect (bool, default=True) – Whether to set equal aspect ratio
legend_params (dict, optional) – Parameters for legend customization (bbox_to_anchor, ncols, fontsize, etc.)

Returns:

Path to the saved plot file

Return type:

Path

Raises:

ValueError – If required columns are missing or hue_col not found
TypeError – If parameters are of incorrect type

class ideal_genom.population.projection.Plot3D[source]: Bases: object

class ideal_genom.population.projection.DimensionalityReductionPipeline(input_path: Path, input_name: str, output_path: Path, build: str = '38', high_ld_regions_file: Path | None = None, generate_plot: bool = True)[source]

Bases: object

Pipeline for running PCA preparation and dimensionality reduction workflows.

This class orchestrates the complete workflow from raw genetic data to dimensionality reduction visualizations, including: 1. PCA preparation (LD pruning + PCA) 2. Optional UMAP reduction 3. Optional t-SNE reduction 4. Automated plotting with metadata

__init__(input_path: Path, input_name: str, output_path: Path, build: str = '38', high_ld_regions_file: Path | None = None, generate_plot: bool = True) → None[source]

Initialize the dimensionality reduction pipeline.

Parameters:

input_path (Path) – Path to directory containing input genetic data files (.bed/.bim/.fam)
input_name (str) – Base name of input files (without extension)
output_path (Path) – Path to directory where all results will be saved
build (str, default='38') – Genome build version (‘37’ or ‘38’)
high_ld_regions_file (Path, optional) – Path to file containing high LD regions. If None, will be fetched automatically.
generate_plot (bool, default=True) – Whether to generate plots automatically

Raises:

TypeError – If input types are incorrect
FileNotFoundError – If input_path or output_path don’t exist
ValueError – If build is not ‘37’ or ‘38’

execute_pca_preparation(maf: float = 0.001, geno: float = 0.1, mind: float = 0.2, hwe: float = 5e-08, ind_pair: list = [50, 5, 0.2], pca: int = 20, case_control_markers: bool = False) → Path[source]

Run PCA preparation: LD pruning and principal component analysis.

Parameters:

maf (float, default=0.001) – Minor allele frequency threshold
geno (float, default=0.1) – Genotype missingness threshold
mind (float, default=0.2) – Sample missingness threshold
hwe (float, default=5e-8) – Hardy-Weinberg equilibrium p-value threshold
ind_pair (list, default=[50, 5, 0.2]) – LD pruning parameters: [window_size, step_size, r2_threshold]
pca (int, default=20) – Number of principal components to calculate

Returns:

Path to the generated eigenvector file

Return type:

Path

Notes

This step is required before running UMAP or t-SNE reductions.

execute_umap(n_neighbors: int = 15, min_dist: float = 0.1, metric: str = 'euclidean', random_state: int | None = None, n_components: int = 2, umap_kwargs: dict | None = None) → pandas.DataFrame[source]

Run UMAP dimensionality reduction.

Parameters:

n_neighbors (int, default=15) – Number of neighbors for UMAP
min_dist (float, default=0.1) – Minimum distance between points
metric (str, default='euclidean') – Distance metric
random_state (int, optional) – Random seed for reproducibility
n_components (int, default=2) – Number of output dimensions
umap_kwargs (dict, optional) – Additional UMAP parameters

Returns:

UMAP results with columns [‘ID1’, ‘ID2’, ‘umap_1’, ‘umap_2’, …]

Return type:

pd.DataFrame

Raises:

RuntimeError – If PCA preparation hasn’t been run yet

execute_tsne(n_components: int = 2, perplexity: float = 30.0, learning_rate: float = 200.0, n_iter: int = 1000, metric: str = 'euclidean', random_state: int | None = None, early_exaggeration: float = 12.0, init: Literal['pca', 'random'] = 'pca', tsne_kwargs: dict | None = None) → pandas.DataFrame[source]

Run t-SNE dimensionality reduction.

Parameters:

n_components (int, default=2) – Number of output dimensions
perplexity (float, default=30.0) – t-SNE perplexity parameter
learning_rate (float, default=200.0) – Learning rate for optimization
n_iter (int, default=1000) – Number of optimization iterations
metric (str, default='euclidean') – Distance metric
random_state (int, optional) – Random seed for reproducibility
early_exaggeration (float, default=12.0) – Early exaggeration parameter
init ({'pca', 'random'}, default='pca') – Initialization method
tsne_kwargs (dict, optional) – Additional t-SNE parameters

Returns:

t-SNE results with columns [‘ID1’, ‘ID2’, ‘tsne_1’, ‘tsne_2’, …]

Return type:

pd.DataFrame

Raises:

RuntimeError – If PCA preparation hasn’t been run yet

generate_plots(color_hue_file: Path | None = None, case_control_markers: bool = False, fam_file: Path | None = None, plot_format: str = 'pdf', dpi: int = 500, figsize: tuple = (5, 5)) → dict[source]

Generate plots for all completed reductions.

Parameters:

color_hue_file (Path, optional) – Path to metadata file for coloring
case_control_markers (bool, default=False) – Whether to use case-control markers
fam_file (Path, optional) – Path to .fam file (required if case_control_markers=True)
plot_format (str, default='pdf') – Output format (‘pdf’, ‘png’, ‘jpeg’, ‘svg’)
dpi (int, default=500) – Resolution for plots
figsize (tuple, default=(5, 5)) – Figure size in inches
include_pca (bool, default=True) – Whether to generate PCA plots

Returns:

Dictionary mapping method names to plot file paths

Return type:

dict

Raises:

RuntimeError – If no reductions have been run

execute_dimensionality_reduction_pipeline(pca_params: dict | None = None, force_pca_recompute: bool = False, run_umap: bool = True, umap_params: dict | None = None, run_tsne: bool = True, tsne_params: dict | None = None, color_hue_file: Path | None = None, case_control_markers: bool = False, fam_file: Path | None = None, plot_format: str = 'pdf', dpi: int = 500, include_pca: bool = True, save_all_coordinates: bool = True, generate_all_plots: bool = True, grid_summary: bool = True) → dict[source]

Run the complete dimensionality reduction pipeline with automatic parameter grid detection.

This method automatically detects whether parameters contain single values or lists. If lists are detected, it runs a parameter grid search exploring all combinations. Otherwise, it runs a single analysis with the provided parameters.

Parameters:

pca_params (dict, optional) – Parameters for PCA preparation (maf, geno, mind, hwe, ind_pair, pca)
force_pca_recompute (bool, default=False) – If True, recompute PCA even if files already exist. If False, skip PCA computation if eigenvector and eigenvalue files are found.
run_umap (bool, default=True) – Whether to run UMAP reduction
umap_params (dict, optional) – Parameters for UMAP. Can contain single values or lists for grid search. Example single: {‘n_neighbors’: 15, ‘min_dist’: 0.1} Example grid: {‘n_neighbors’: [10, 15, 30], ‘min_dist’: [0.1, 0.5]}
run_tsne (bool, default=True) – Whether to run t-SNE reduction
tsne_params (dict, optional) – Parameters for t-SNE. Can contain single values or lists for grid search. Example single: {‘perplexity’: 30, ‘learning_rate’: 200} Example grid: {‘perplexity’: [20, 30, 50], ‘learning_rate’: [100, 200]}
color_hue_file (Path, optional) – Metadata file for plot coloring
case_control_markers (bool, default=False) – Whether to use case-control markers in plots
fam_file (Path, optional) – Path to .fam file. If not provided, will automatically look for {input_name}.fam in the input_path directory
plot_format (str, default='pdf') – Output format for plots
dpi (int, default=500) – Resolution for plots
include_pca (bool, default=True) – Whether to generate PCA plots
save_all_coordinates (bool, default=True) – For grid search: whether to save coordinate files for all parameter combinations
generate_all_plots (bool, default=True) – For grid search: whether to generate plot files for all parameter combinations
grid_summary (bool, default=True) – For grid search: whether to generate summary table of all parameter combinations

Returns:

Results summary with file paths and metadata. For grid searches, includes information about all parameter combinations explored.

Return type:

dict

Examples

Single analysis: >>> pipeline = DimensionalityReductionPipeline(…) >>> results = pipeline.run_full_pipeline( … umap_params={‘n_neighbors’: 15, ‘min_dist’: 0.1}, … tsne_params={‘perplexity’: 30} … )

Parameter grid search (automatically detected): >>> results = pipeline.run_full_pipeline( … umap_params={ … ‘n_neighbors’: [10, 15, 30], … ‘min_dist’: [0.1, 0.5], … ‘random_state’: [42] … }, … tsne_params={ … ‘perplexity’: [20, 30, 50], … ‘random_state’: [42] … } … )

execute_parameter_grid(umap_grid: dict | None = None, tsne_grid: dict | None = None, plot_params: dict | None = None, save_coordinates: bool = True, generate_plots: bool = True, color_hue_file: Path | None = None, case_control_markers: bool = False, fam_file: Path | None = None, plot_format: str = 'pdf') → dict[source]

Run systematic parameter grid exploration for UMAP and/or t-SNE.

This method explores all combinations of specified parameters, saving coordinates and generating plots for each combination. Results are organized with clear naming conventions for easy comparison.

Parameters:

umap_grid (dict, optional) – Dictionary with parameter names as keys and lists of values as values. Example: {‘n_neighbors’: [15, 30], ‘min_dist’: [0.1, 0.5]}
tsne_grid (dict, optional) – Dictionary with parameter names as keys and lists of values as values. Example: {‘perplexity’: [20, 50], ‘learning_rate’: [100, 200]}
plot_params (dict, optional) – Additional parameters for plot generation (figsize, dpi, etc.)
save_coordinates (bool, default=True) – Whether to save coordinate files for each combination
generate_plots (bool, default=True) – Whether to generate plot files for each combination
color_hue_file (Path, optional) – Metadata file for plot coloring
case_control_markers (bool, default=False) – Whether to use case-control markers in plots
fam_file (Path, optional) – Path to .fam file for case-control markers
plot_format (str, default='pdf') – Output format for plots

Returns:

Summary of all parameter combinations and results

Return type:

dict

Raises:

RuntimeError – If PCA preparation hasn’t been run yet
ValueError – If neither umap_grid nor tsne_grid is provided

Examples

>>> pipeline = DimensionalityReductionPipeline(...)
>>> pipeline.run_pca_preparation()
>>> results = pipeline.run_parameter_grid(
...     umap_grid={
...         'n_neighbors': [15, 30],
...         'min_dist': [0.1, 0.5],
...         'random_state': [42]
...     },
...     tsne_grid={
...         'perplexity': [20, 50],
...         'random_state': [42]
...     }
... )