Population Structure Modules
Population genetics analysis modules for Fst statistics, dimensionality reduction (PCA, UMAP, t-SNE), and visualization.
Fst Statistics
- class ideal_genom.population.fst_stats.FstSummary(input_path: Path, input_name: str, output_path: Path, high_ld_file: Path = PosixPath('.'), build: str = '38', recompute_merge: bool = True, reference_files: dict = {})[source]
Bases:
object- __init__(input_path: Path, input_name: str, output_path: Path, high_ld_file: Path = PosixPath('.'), build: str = '38', recompute_merge: bool = True, reference_files: dict = {}) None[source]
Initialize FstSummary object for Fst analysis.
- Parameters:
input_path (Path) – Path to the directory containing input files
input_name (str) – Name of the input file
output_path (Path) – Path to the directory where results will be saved
- Raises:
TypeError – If input types are incorrect for any parameter
FileNotFoundError – If input_path or output_path do not exist
- merge_reference_study(ind_pair: list = [50, 5, 0.2]) None[source]
Merge reference and study data by applying quality control filters and merging steps. This method performs a series of quality control steps to merge study data with reference data: 1. Filters problematic SNPs 2. Performs LD pruning 3. Fixes chromosome mismatches 4. Fixes position mismatches 5. Fixes allele flips 6. Removes remaining mismatches 7. Merges the datasets
- Parameters:
ind_pair (list, default [50, 5, 0.2]) – Parameters for LD pruning: [window size, step size, r2 threshold]
- Return type:
None
Notes
If recompute_merge is False, the method will skip the merging process and expect merged data to already exist in the merging directory.
- Raises:
TypeError – If ind_pair is not a list
- add_population_tags() None[source]
Add population tags to the merged dataset. This method adds population super-population tags from the reference dataset to the merged dataset. It reads population information from the reference PSAM file, merges it with the study dataset, and assigns ‘StPop’ (study population) to samples not present in the reference dataset.
Requirements
Merged dataset files (.bed, .bim, .fam) must exist in the merging directory
Reference files dictionary must contain a valid ‘psam’ Path
- raises FileNotFoundError:
If any of the required merged files are not found:
- raises ValueError:
If the reference files dictionary doesn’t contain a valid ‘psam’ Path:
Side Effects
Creates a new tab-separated file with population tags at {merging_dir}/cleaned-with-ref-merged-pop-tags.csv
Sets self.population_tags to the path of the created file
- rtype:
None
- compute_fst() None[source]
Compute FST (fixation index) statistics between populations.
This method calculates FST statistics between each super-population in the dataset and a study population (‘StPop’). The process involves: 1. Reading population tags from the specified file 2. For each unique super-population (except ‘StPop’): - Creating population filter files (keep and within files) - Running PLINK commands to filter the dataset and compute FST statistics
- The method requires the following instance variables to be set:
population_tags: Path to a file containing population information
results_dir: Directory where results will be stored
merging_dir: Directory containing the merged genotype data
Returns:
None
- report_fst() pandas.DataFrame[source]
Generate a report of Fst results. This method reads the Fst results from the results directory and generates a summary report.
- Returns:
DataFrame containing the Fst results summary
- Return type:
pd.DataFrame
- Raises:
FileNotFoundError – If no Fst result files are found in the results directory.
Dimensionality Reduction and Projection
Module to draw plots based on UMAP dimension reduction
- class ideal_genom.population.projection.PCAReduction(input_path: Path, input_name: str, output_path: Path, build: str = '38', high_ld_regions: Path | None = None, generate_plot: bool = True)[source]
Bases:
object- __init__(input_path: Path, input_name: str, output_path: Path, build: str = '38', high_ld_regions: Path | None = None, generate_plot: bool = True) None[source]
- execute_ld_pruning(maf: float = 0.001, geno: float = 0.1, mind: float = 0.2, hwe: float = 5e-08, ind_pair: list = [50, 5, 0.2]) None[source]
Execute linkage disequilibrium (LD) pruning on study and reference data.
This method performs LD-based pruning using PLINK to remove highly correlated SNPs from both study and reference datasets. The pruning is done using a sliding window approach where SNPs are removed based on their pairwise correlation (r²).
- Parameters:
ind_pair (list) –
A list containing three elements:
ind_pair[0] (int): Window size in SNPs
ind_pair[1] (int): Number of SNPs to shift the window at each step
ind_pair[2] (float): r² threshold for pruning
- Raises:
- Return type:
None
Notes
Uses PLINK’s –indep-pairwise command for pruning.
Excludes high LD regions specified in self.high_ld_regions.
Creates pruned datasets for both study and reference data.
Updates self.pruned_reference and self.pruned_study with paths to pruned files.
Uses all available CPU threads except 2 for processing.
- execute_pca(pca: int = 20, maf: float = 0.01) None[source]
Perform Principal Component Analysis (PCA) on the genetic data using PLINK.
This method executes PCA on the merged genetic data file, calculating the specified number of principal components. It automatically determines the optimal number of threads and memory allocation based on system resources.
- Parameters:
- Return type:
None
- Raises:
TypeError – If pca is not an integer or maf is not a float
ValueError – If pca is not positive or maf is not between 0 and 0.5
Notes
The method creates two output files: - {output_name}-pca.eigenvec: Contains the eigenvectors (PC loadings) - {output_name}-pca.eigenval: Contains the eigenvalues
The results are stored in self.einvectors and self.eigenvalues attributes.
- execute_pcareduction_pipeline(maf: float = 0.001, geno: float = 0.1, mind: float = 0.2, hwe: float = 5e-08, ind_pair: list = [50, 5, 0.2], pca: int = 20, case_control_markers: bool = True) None[source]
Execute the full preparation pipeline: LD pruning followed by PCA.
This method sequentially performs LD pruning on the genetic data and then computes principal components using PCA. It combines the functionalities of execute_ld_pruning and execute_pca methods.
- Parameters:
ind_pair (list) – A list containing three elements for LD pruning: - ind_pair[0] (int): Window size in SNPs - ind_pair[1] (int): Number of SNPs to shift the window at each step - ind_pair[2] (float): r² threshold for pruning
pca (int, default=20) – Number of principal components to calculate. Must be a positive integer.
maf (float, default=0.01) – Minor allele frequency threshold for filtering variants. Must be between 0 and 0.5
- Return type:
None
- class ideal_genom.population.projection.UMAPReduction(eigenvector: Path, output_path: Path)[source]
Bases:
objectClass for performing UMAP dimensionality reduction on PCA eigenvectors.
This class handles UMAP transformation of high-dimensional PCA data into 2D space for visualization. Use Plot2D class for generating plots.
- __init__(eigenvector: Path, output_path: Path) None[source]
Initialize UMAPReduction object.
- Parameters:
eigenvector (Path) – Path to the eigenvector file (.eigenvec) from PCA analysis
output_path (Path) – Path to the directory where results will be saved
- Raises:
TypeError – If input types are incorrect
FileNotFoundError – If eigenvector file or output_path do not exist
Notes
Creates ‘umap_results’ directory in the output path.
- fit_transform(n_neighbors: int = 15, min_dist: float = 0.1, metric: str = 'euclidean', random_state: int | None = None, n_components: int = 2, umap_kwargs: dict | None = None) pandas.DataFrame[source]
Perform UMAP dimensionality reduction on PCA eigenvectors.
- Parameters:
n_neighbors (int, default=15) – Number of neighbors for UMAP manifold approximation. Must be positive.
min_dist (float, default=0.1) – Minimum distance between points in low-dimensional space. Must be non-negative.
metric (str, default='euclidean') – Distance metric for UMAP (e.g., ‘euclidean’, ‘cosine’, ‘manhattan’)
random_state (int, optional) – Random seed for reproducibility. Must be non-negative.
n_components (int, default=2) – Number of dimensions in the output
umap_kwargs (dict, optional) – Additional keyword arguments to pass to UMAP constructor.
- Returns:
DataFrame with columns [‘ID1’, ‘ID2’, ‘umap_1’, ‘umap_2’, …]
- Return type:
pd.DataFrame
- Raises:
TypeError – If parameters are not of correct type
ValueError – If parameter values are invalid
- class ideal_genom.population.projection.TSNEReduction(eigenvector: Path, output_path: Path)[source]
Bases:
objectClass for performing t-SNE dimensionality reduction on PCA eigenvectors.
This class handles t-SNE transformation of high-dimensional PCA data into 2D or 3D space for visualization. Use Plot2D class for generating plots.
- __init__(eigenvector: Path, output_path: Path) None[source]
Initialize TSNEReduction object.
- Parameters:
eigenvector (Path) – Path to the eigenvector file (.eigenvec) from PCA analysis
output_path (Path) – Path to the directory where results will be saved
- Raises:
TypeError – If input types are incorrect
FileNotFoundError – If eigenvector file or output_path do not exist
Notes
Creates ‘tsne_results’ directory in the output path.
- fit_transform(n_components: int = 2, perplexity: float = 30.0, learning_rate: float = 200.0, n_iter: int = 1000, metric: str = 'euclidean', random_state: int | None = None, early_exaggeration: float = 12.0, init: Literal['pca', 'random'] = 'pca', tsne_kwargs: dict | None = None) pandas.DataFrame[source]
Perform t-SNE dimensionality reduction on PCA eigenvectors.
- Parameters:
n_components (int, default=2) – Number of dimensions in the output (typically 2 or 3)
perplexity (float, default=30.0) – Related to number of nearest neighbors. Should be between 5 and 50. Larger datasets require larger perplexity.
learning_rate (float, default=200.0) – Learning rate for t-SNE optimization. Usually between 10.0 and 1000.0.
n_iter (int, default=1000) – Maximum number of iterations for optimization
metric (str, default='euclidean') – Distance metric to use (‘euclidean’, ‘manhattan’, ‘cosine’, etc.)
random_state (int, optional) – Random seed for reproducibility. Must be non-negative.
early_exaggeration (float, default=12.0) – Controls how tight natural clusters are in the original space
init (str, default='pca') – Initialization method (‘pca’ or ‘random’)
tsne_kwargs (dict, optional) – Additional keyword arguments to pass to TSNE constructor.
- Returns:
DataFrame with columns [‘ID1’, ‘ID2’, ‘tsne_1’, ‘tsne_2’, …]
- Return type:
pd.DataFrame
- Raises:
TypeError – If parameters are not of correct type
ValueError – If parameter values are invalid
Notes
t-SNE is computationally expensive. For large datasets (>10,000 samples), consider using perplexity between 30-50 and reducing n_iter if needed.
- class ideal_genom.population.projection.Plot2D(output_dir: Path)[source]
Bases:
objectClass for generating 2D scatter plots with metadata integration.
This class handles the preparation of metadata (color hue files, case-control markers) and generates publication-quality 2D scatter plots for dimensionality reduction results.
- __init__(output_dir: Path) None[source]
Initialize Plot2D object.
- Parameters:
output_dir (Path) – Directory where plots will be saved
- Raises:
TypeError – If output_dir is not a Path object
FileNotFoundError – If output_dir does not exist
- prepare_metadata(color_hue_file: Path | None = None, case_control_markers: bool = False, fam_file: Path | None = None) pandas.DataFrame | None[source]
Prepare metadata DataFrame from color hue file and/or case-control markers.
- Parameters:
color_hue_file (Path, optional) – Path to tab-separated file with metadata for coloring. Must have at least 3 columns: ID1, ID2, and a metadata column.
case_control_markers (bool, default=False) – Whether to load case-control labels from .fam file
fam_file (Path, optional) – Path to .fam file containing case-control information. Required if case_control_markers is True.
- Returns:
Metadata DataFrame with columns [‘ID1’, ‘ID2’, …] or None if no metadata
- Return type:
pd.DataFrame
- Raises:
FileNotFoundError – If specified files don’t exist
TypeError – If parameters are of incorrect type
- generate_plot(data: pandas.DataFrame, x_col: str, y_col: str, plot_name: str, hue_col: str | None = None, style_col: str | None = None, title: str | None = None, xlabel: str | None = None, ylabel: str | None = None, figsize: tuple = (5, 5), dpi: int = 500, format: str = 'pdf', marker: str = '.', marker_size: int = 10, alpha: float = 0.5, equal_aspect: bool = True, legend_params: dict | None = None) Path[source]
Generate a 2D scatter plot with optional metadata coloring and styling.
- Parameters:
data (pd.DataFrame) – DataFrame containing the 2D coordinates and IDs (must have ‘ID1’, ‘ID2’ columns)
x_col (str) – Column name for x-axis values
y_col (str) – Column name for y-axis values
plot_name (str) – Name of the output plot file
hue_col (str, optional) – Column name for point coloring. If None and metadata exists, uses third column or ‘Phenotype’ if available.
style_col (str, optional) – Column name for point styling (different markers)
title (str, optional) – Plot title
xlabel (str, optional) – X-axis label. If None, uses x_col.
ylabel (str, optional) – Y-axis label. If None, uses y_col.
figsize (tuple, default=(5, 5)) – Figure size in inches (width, height)
dpi (int, default=500) – Resolution for saving the plot
format (str, default='pdf') – Output format (‘pdf’, ‘png’, ‘jpeg’, ‘svg’)
marker (str, default='.') – Marker style for scatter plot
marker_size (int, default=10) – Size of markers
alpha (float, default=0.5) – Transparency of markers (0-1)
equal_aspect (bool, default=True) – Whether to set equal aspect ratio
legend_params (dict, optional) – Parameters for legend customization (bbox_to_anchor, ncols, fontsize, etc.)
- Returns:
Path to the saved plot file
- Return type:
Path
- Raises:
ValueError – If required columns are missing or hue_col not found
TypeError – If parameters are of incorrect type
- class ideal_genom.population.projection.DimensionalityReductionPipeline(input_path: Path, input_name: str, output_path: Path, build: str = '38', high_ld_regions_file: Path | None = None, generate_plot: bool = True)[source]
Bases:
objectPipeline for running PCA preparation and dimensionality reduction workflows.
This class orchestrates the complete workflow from raw genetic data to dimensionality reduction visualizations, including: 1. PCA preparation (LD pruning + PCA) 2. Optional UMAP reduction 3. Optional t-SNE reduction 4. Automated plotting with metadata
- __init__(input_path: Path, input_name: str, output_path: Path, build: str = '38', high_ld_regions_file: Path | None = None, generate_plot: bool = True) None[source]
Initialize the dimensionality reduction pipeline.
- Parameters:
input_path (Path) – Path to directory containing input genetic data files (.bed/.bim/.fam)
input_name (str) – Base name of input files (without extension)
output_path (Path) – Path to directory where all results will be saved
build (str, default='38') – Genome build version (‘37’ or ‘38’)
high_ld_regions_file (Path, optional) – Path to file containing high LD regions. If None, will be fetched automatically.
generate_plot (bool, default=True) – Whether to generate plots automatically
- Raises:
TypeError – If input types are incorrect
FileNotFoundError – If input_path or output_path don’t exist
ValueError – If build is not ‘37’ or ‘38’
- execute_pca_preparation(maf: float = 0.001, geno: float = 0.1, mind: float = 0.2, hwe: float = 5e-08, ind_pair: list = [50, 5, 0.2], pca: int = 20, case_control_markers: bool = False) Path[source]
Run PCA preparation: LD pruning and principal component analysis.
- Parameters:
maf (float, default=0.001) – Minor allele frequency threshold
geno (float, default=0.1) – Genotype missingness threshold
mind (float, default=0.2) – Sample missingness threshold
hwe (float, default=5e-8) – Hardy-Weinberg equilibrium p-value threshold
ind_pair (list, default=[50, 5, 0.2]) – LD pruning parameters: [window_size, step_size, r2_threshold]
pca (int, default=20) – Number of principal components to calculate
- Returns:
Path to the generated eigenvector file
- Return type:
Path
Notes
This step is required before running UMAP or t-SNE reductions.
- execute_umap(n_neighbors: int = 15, min_dist: float = 0.1, metric: str = 'euclidean', random_state: int | None = None, n_components: int = 2, umap_kwargs: dict | None = None) pandas.DataFrame[source]
Run UMAP dimensionality reduction.
- Parameters:
n_neighbors (int, default=15) – Number of neighbors for UMAP
min_dist (float, default=0.1) – Minimum distance between points
metric (str, default='euclidean') – Distance metric
random_state (int, optional) – Random seed for reproducibility
n_components (int, default=2) – Number of output dimensions
umap_kwargs (dict, optional) – Additional UMAP parameters
- Returns:
UMAP results with columns [‘ID1’, ‘ID2’, ‘umap_1’, ‘umap_2’, …]
- Return type:
pd.DataFrame
- Raises:
RuntimeError – If PCA preparation hasn’t been run yet
- execute_tsne(n_components: int = 2, perplexity: float = 30.0, learning_rate: float = 200.0, n_iter: int = 1000, metric: str = 'euclidean', random_state: int | None = None, early_exaggeration: float = 12.0, init: Literal['pca', 'random'] = 'pca', tsne_kwargs: dict | None = None) pandas.DataFrame[source]
Run t-SNE dimensionality reduction.
- Parameters:
n_components (int, default=2) – Number of output dimensions
perplexity (float, default=30.0) – t-SNE perplexity parameter
learning_rate (float, default=200.0) – Learning rate for optimization
n_iter (int, default=1000) – Number of optimization iterations
metric (str, default='euclidean') – Distance metric
random_state (int, optional) – Random seed for reproducibility
early_exaggeration (float, default=12.0) – Early exaggeration parameter
init ({'pca', 'random'}, default='pca') – Initialization method
tsne_kwargs (dict, optional) – Additional t-SNE parameters
- Returns:
t-SNE results with columns [‘ID1’, ‘ID2’, ‘tsne_1’, ‘tsne_2’, …]
- Return type:
pd.DataFrame
- Raises:
RuntimeError – If PCA preparation hasn’t been run yet
- generate_plots(color_hue_file: Path | None = None, case_control_markers: bool = False, fam_file: Path | None = None, plot_format: str = 'pdf', dpi: int = 500, figsize: tuple = (5, 5)) dict[source]
Generate plots for all completed reductions.
- Parameters:
color_hue_file (Path, optional) – Path to metadata file for coloring
case_control_markers (bool, default=False) – Whether to use case-control markers
fam_file (Path, optional) – Path to .fam file (required if case_control_markers=True)
plot_format (str, default='pdf') – Output format (‘pdf’, ‘png’, ‘jpeg’, ‘svg’)
dpi (int, default=500) – Resolution for plots
figsize (tuple, default=(5, 5)) – Figure size in inches
include_pca (bool, default=True) – Whether to generate PCA plots
- Returns:
Dictionary mapping method names to plot file paths
- Return type:
- Raises:
RuntimeError – If no reductions have been run
- execute_dimensionality_reduction_pipeline(pca_params: dict | None = None, force_pca_recompute: bool = False, run_umap: bool = True, umap_params: dict | None = None, run_tsne: bool = True, tsne_params: dict | None = None, color_hue_file: Path | None = None, case_control_markers: bool = False, fam_file: Path | None = None, plot_format: str = 'pdf', dpi: int = 500, include_pca: bool = True, save_all_coordinates: bool = True, generate_all_plots: bool = True, grid_summary: bool = True) dict[source]
Run the complete dimensionality reduction pipeline with automatic parameter grid detection.
This method automatically detects whether parameters contain single values or lists. If lists are detected, it runs a parameter grid search exploring all combinations. Otherwise, it runs a single analysis with the provided parameters.
- Parameters:
pca_params (dict, optional) – Parameters for PCA preparation (maf, geno, mind, hwe, ind_pair, pca)
force_pca_recompute (bool, default=False) – If True, recompute PCA even if files already exist. If False, skip PCA computation if eigenvector and eigenvalue files are found.
run_umap (bool, default=True) – Whether to run UMAP reduction
umap_params (dict, optional) – Parameters for UMAP. Can contain single values or lists for grid search. Example single: {‘n_neighbors’: 15, ‘min_dist’: 0.1} Example grid: {‘n_neighbors’: [10, 15, 30], ‘min_dist’: [0.1, 0.5]}
run_tsne (bool, default=True) – Whether to run t-SNE reduction
tsne_params (dict, optional) – Parameters for t-SNE. Can contain single values or lists for grid search. Example single: {‘perplexity’: 30, ‘learning_rate’: 200} Example grid: {‘perplexity’: [20, 30, 50], ‘learning_rate’: [100, 200]}
color_hue_file (Path, optional) – Metadata file for plot coloring
case_control_markers (bool, default=False) – Whether to use case-control markers in plots
fam_file (Path, optional) – Path to .fam file. If not provided, will automatically look for {input_name}.fam in the input_path directory
plot_format (str, default='pdf') – Output format for plots
dpi (int, default=500) – Resolution for plots
include_pca (bool, default=True) – Whether to generate PCA plots
save_all_coordinates (bool, default=True) – For grid search: whether to save coordinate files for all parameter combinations
generate_all_plots (bool, default=True) – For grid search: whether to generate plot files for all parameter combinations
grid_summary (bool, default=True) – For grid search: whether to generate summary table of all parameter combinations
- Returns:
Results summary with file paths and metadata. For grid searches, includes information about all parameter combinations explored.
- Return type:
Examples
Single analysis: >>> pipeline = DimensionalityReductionPipeline(…) >>> results = pipeline.run_full_pipeline( … umap_params={‘n_neighbors’: 15, ‘min_dist’: 0.1}, … tsne_params={‘perplexity’: 30} … )
Parameter grid search (automatically detected): >>> results = pipeline.run_full_pipeline( … umap_params={ … ‘n_neighbors’: [10, 15, 30], … ‘min_dist’: [0.1, 0.5], … ‘random_state’: [42] … }, … tsne_params={ … ‘perplexity’: [20, 30, 50], … ‘random_state’: [42] … } … )
- execute_parameter_grid(umap_grid: dict | None = None, tsne_grid: dict | None = None, plot_params: dict | None = None, save_coordinates: bool = True, generate_plots: bool = True, color_hue_file: Path | None = None, case_control_markers: bool = False, fam_file: Path | None = None, plot_format: str = 'pdf') dict[source]
Run systematic parameter grid exploration for UMAP and/or t-SNE.
This method explores all combinations of specified parameters, saving coordinates and generating plots for each combination. Results are organized with clear naming conventions for easy comparison.
- Parameters:
umap_grid (dict, optional) – Dictionary with parameter names as keys and lists of values as values. Example: {‘n_neighbors’: [15, 30], ‘min_dist’: [0.1, 0.5]}
tsne_grid (dict, optional) – Dictionary with parameter names as keys and lists of values as values. Example: {‘perplexity’: [20, 50], ‘learning_rate’: [100, 200]}
plot_params (dict, optional) – Additional parameters for plot generation (figsize, dpi, etc.)
save_coordinates (bool, default=True) – Whether to save coordinate files for each combination
generate_plots (bool, default=True) – Whether to generate plot files for each combination
color_hue_file (Path, optional) – Metadata file for plot coloring
case_control_markers (bool, default=False) – Whether to use case-control markers in plots
fam_file (Path, optional) – Path to .fam file for case-control markers
plot_format (str, default='pdf') – Output format for plots
- Returns:
Summary of all parameter combinations and results
- Return type:
- Raises:
RuntimeError – If PCA preparation hasn’t been run yet
ValueError – If neither umap_grid nor tsne_grid is provided
Examples
>>> pipeline = DimensionalityReductionPipeline(...) >>> pipeline.run_pca_preparation() >>> results = pipeline.run_parameter_grid( ... umap_grid={ ... 'n_neighbors': [15, 30], ... 'min_dist': [0.1, 0.5], ... 'random_state': [42] ... }, ... tsne_grid={ ... 'perplexity': [20, 50], ... 'random_state': [42] ... } ... )