Get References
Module for downloading and managing reference datasets (1000 Genomes, reference genomes, dbSNP).
- class ideal_genom.core.get_references.Fetcher1000Genome(destination: Path | None = None, build: str = '38')[source]
Bases:
object- __init__(destination: Path | None = None, build: str = '38')[source]
Initialize a reference data handler.
This class manages reference data files from 1000 Genomes Project.
- Parameters:
destination (Path, optional) – Path where reference files will be stored. If not provided, defaults to ‘../data/1000genomes_build_{build}’.
build (str, optional) – Human genome build version. Defaults to ‘38’.
- destination
Directory path where reference files are stored
- Type:
Path
- pgen_file
Path to PGEN format file
- Type:
Path
- pvar_file
Path to PVAR format file
- Type:
Path
- psam_file
Path to PSAM format file
- Type:
Path
- bed_file
Path to BED format file
- Type:
Path
- bim_file
Path to BIM format file
- Type:
Path
- fam_file
Path to FAM format file
- Type:
Path
- get_1000genomes(url_pgen: str | None = None, url_pvar: str | None = None, url_psam: str | None = None) Path[source]
Download and decompress 1000 Genomes reference data. This method downloads the PLINK2 binary files (.pgen, .pvar, .psam) for the 1000 Genomes reference dataset, corresponding to the specified genome build (37 or 38). If the files already exist in the destination directory, the download is skipped.
Parameters:
- url_pgen (str, optional): Custom URL for downloading the .pgen file.
If None, uses default URL based on genome build.
- url_pvar (str, optional): Custom URL for downloading the .pvar file.
If None, uses default URL based on genome build.
- url_psam (str, optional): Custom URL for downloading the .psam file.
If None, uses default URL based on genome build.
Returns:
Path: Path object pointing to the decompressed .pgen file location.
Note:
The method requires plink2 to be installed and accessible in the system path for decompressing the .pgen file.
- get_1000genomes_binaries() Path[source]
Convert downloaded 1000 Genomes data into PLINK binary files (.bed, .bim, .fam). This method processes the downloaded 1000 Genomes data files and converts them into PLINK binary format. If the binary files already exist, it skips the conversion process. The method handles file cleanup and proper renaming of output files. The conversion is done in two steps: 1. Convert pfile to binary format including only SNPs from chromosomes 1-22,X,Y,MT 2. Update variant IDs and create final binary files
- Returns:
Path object pointing to the generated binary files (without extension) The actual files created will be .bed, .bim, .fam and .psam with the same prefix
- Return type:
Path
- class ideal_genom.core.get_references.ReferenceDataFetcher(base_url: str, build: str, source: str, destination_folder: str | None = None)[source]
Bases:
objectA class for fetching, downloading, and processing reference genome data.
This class provides a framework for retrieving genomic reference data from various sources. It handles downloading compressed files, unzipping them, and extracting gene information from GTF files.
- destination_folder
The directory to save downloaded files. If None, defaults to project_root/data/{source}_latest.
- Type:
Optional[str]
Notes
This is an abstract base class that requires subclasses to implement the get_latest_release() method for specific data sources.
- __init__(base_url: str, build: str, source: str, destination_folder: str | None = None) None[source]
- download_latest() str[source]
Downloads the latest file from self.latest_url to self.destination_folder.
- Raises:
AttributeError – If self.latest_url is not set.
requests.exceptions.RequestException – If the HTTP request fails.
- get_all_genes() str[source]
Extract all genes from the GTF file and save them to a new compressed file.
This method reads the GTF file specified in self.gtf_file, filters for gene features, and creates a new GTF file containing only the gene entries. If the output file already exists, it will return the path without reprocessing.
- Returns:
Path to the compressed GTF file containing all genes
- Return type:
- Raises:
FileNotFoundError – If the reference GTF file (self.gtf_file) is not found
TypeError – If read_gtf does not return a pandas DataFrame
Note
The output file will be named based on the input GTF file with “-all_genes.gtf.gz” suffix
- class ideal_genom.core.get_references.FetcherLDRegions(destination: Path | None = None, build: str = '38')[source]
Bases:
ReferenceDataFetcher- __init__(destination: Path | None = None, build: str = '38')[source]
Initialize LDRegions object. This initializer sets up the destination path for LD regions files and the genome build version. If no destination is provided, it defaults to a ‘data/ld_regions_files’ directory relative to the parent directory of the current file.
- Parameters:
destination (Path, optional) – Path where LD region files will be stored. If None, uses default path.
built (str, optional) – Genome build version, defaults to ‘38’.
- destination
Directory path where LD region files are stored
- Type:
Path
- ld_regions
Placeholder for LD regions data, initially set to None
- Type:
None
- get_ld_regions() Path[source]
Download or create high LD regions file based on genome build version.
This method handles the retrieval of high Linkage Disequilibrium (LD) regions for different genome builds (37 or 38). For build 37, it downloads the regions from a GitHub repository. For build 38, it creates the file from predefined coordinates.
- Returns:
Path to the created/downloaded LD regions file. Returns empty Path if download fails for build 37.
- Return type:
Path
- Raises:
None – Explicitly, but may raise standard I/O related exceptions.
Notes
For build 37: Downloads from genepi-freiburg/gwas repository
For build 38: Creates file from hardcoded coordinates from GWAS-pipeline
Files are named as ‘high-LD-regions_GRCh{build}.txt’
Creates destination directory if it doesn’t exist
- class ideal_genom.core.get_references.Ensembl38Fetcher(destination_folder=None)[source]
Bases:
ReferenceDataFetcherA class for fetching human genome reference data from Ensembl based on GRCh38 build.
This class extends ReferenceDataFetcher to specifically handle Ensembl’s human genome data with build 38. It provides functionality to find and retrieve the latest GTF file from Ensembl’s FTP server.
- Raises:
Exception – If the Ensembl FTP server cannot be accessed
FileNotFoundError – If no matching GTF file is found
- get_latest_release() None[source]
Retrieves the URL of the latest GTF file for human genome (GRCh38) from the base URL.
This method scrapes the base URL to find the most recent Homo_sapiens GRCh38 GTF file available for download. Upon finding the file, it constructs the complete URL and stores it in the instance variable latest_url.
- Return type:
None
- Raises:
Exception – If the base URL cannot be accessed (non-200 response)
FileNotFoundError – If no GTF file matching the criteria is found
- class ideal_genom.core.get_references.Ensembl37Fetcher(destination_folder=None)[source]
Bases:
ReferenceDataFetcherA class for fetching reference genome data from Ensembl’s GRCh37 (hg19) repository.
This class specializes the ReferenceDataFetcher to work specifically with Ensembl’s GRCh37 human genome build. It provides functionality to automatically detect and download the latest available GTF file for Homo sapiens from the Ensembl GRCh37 archive. The fetcher connects to Ensembl’s FTP server, identifies the most recent release available for GRCh37, and locates the chromosome GTF file for human genome data.
- latest_url
The complete URL to the latest GTF file, populated after calling get_latest_release()
- Type:
- __init__(destination_folder=None)[source]
Initialize a reference genome downloader for Ensembl GRCh37. This constructor configures the downloader to retrieve data from Ensembl’s GRCh37 repository.
- Parameters:
destination_folder (str, optional) – The folder where downloaded files will be stored. If None, a default location will be used based on the parent class implementation.
- get_latest_release() None[source]
Fetches the URL of the latest GTF file for Homo sapiens GRCh37 from Ensembl.
This method: 1. Connects to the base URL and identifies all available release folders 2. Determines the latest release by finding the highest release number 3. Navigates to the GTF directory for that release 4. Locates the Homo sapiens GRCh37 chromosome GTF file 5. Stores the complete download URL in self.latest_url
- Raises:
Exception – If the base URL cannot be accessed
Exception – If no release folders are found
Exception – If the latest release folder cannot be accessed
FileNotFoundError – If the GTF file is not found in the latest release
- Return type:
None
- class ideal_genom.core.get_references.RefSeqFetcher(build: str, destination_folder: str | None = None)[source]
Bases:
ReferenceDataFetcherA class for fetching and downloading reference genome data from NCBI’s RefSeq repository.
This class extends ReferenceDataFetcher to specifically handle downloading human genome reference files from the RefSeq database. It supports different genome builds (e.g., ‘GRCh37’, ‘GRCh38’) and automatically identifies the latest version available for the specified build. The class handles navigating the NCBI FTP directory structure, finding the appropriate GTF files for the requested genome build, and managing the download process.
- get_latest_release() None[source]
Fetches the latest GTF file dynamically from the specified base URL.
This method sends a GET request to the base URL, parses the HTML response to find the latest GTF file link, and sets the latest_url attribute to the full URL of the latest GTF file.
- Raises:
FileNotFoundError – If no GTF file is found in the HTML response.
- Return type:
None
- class ideal_genom.core.get_references.AssemblyReferenceFetcher(base_url: str, build: str, extension: str, destination_folder: str | None = None, avoid_substring: str = 'extra')[source]
Bases:
ReferenceDataFetcherA class for fetching and preparing genomic reference files from online repositories.
This class handles the process of: 1. Finding the appropriate reference file URL based on build parameters 2. Downloading the reference file 3. Unzipping compressed reference files if necessary
- Parameters:
base_url (str) – The base URL where reference files are hosted
build (str) – The genome build identifier (e.g., ‘GRCh38’, ‘hg19’)
extension (str) – File extension to look for (e.g., ‘.gtf.gz’, ‘.fa.gz’)
destination_folder (Optional[str], default=None) – Path where files should be downloaded. If None, uses project_root/data/assembly_references
avoid_substring (str, default='extra') – Substring to avoid when selecting reference files
- file_path
Local path to the downloaded reference file
- Type:
Path or None
- Raises:
Exception – If the base URL cannot be accessed
FileNotFoundError – If no matching reference file is found
AttributeError – If methods are called out of sequence
ValueError – If required attributes are None when needed
- __init__(base_url: str, build: str, extension: str, destination_folder: str | None = None, avoid_substring: str = 'extra') None[source]
- get_reference_url() str[source]
Retrieves the URL for the reference file from the base URL.
This method performs an HTTP GET request to the base URL, parses the HTML content, and searches for links matching specific criteria: - Contains the build version string - Ends with the specified extension - Does not contain the specified substring to avoid The first matching link is considered the reference file.
- Returns:
str
- Return type:
The complete URL to the reference file
- Raises:
Exception – If the base URL cannot be accessed
FileNotFoundError – If no matching reference file is found
Notes
Sets self.reference_file to the name of the found file
Sets self.reference_url to the complete URL
Logs information about the found file and URL
- download_reference_file() str[source]
Downloads a reference file from the specified URL to the destination folder.
This method first checks whether the reference file already exists locally. If not found, it also looks for an alternative version with a ‘.fa’ extension. If neither is present, it downloads the file from the given URL.
- Raises:
AttributeError – If self.reference_url or self.reference_file are not set.
ValueError – If self.reference_url or self.reference_file are set to None.
- Returns:
The path to the downloaded or existing reference file.
- Return type:
Note
self.reference_url and self.reference_file must be set by calling get_reference_url() before using this method.
- unzip_reference_file() str[source]
Unzips a reference genome file (typically .fa.gz to .fa) and returns the path to the unzipped file.
This method checks if the file is already unzipped, and if not, unzips it using gzip. After successful unzipping, the original compressed file is deleted.
- Returns:
Path to the unzipped reference file (.fa)
- Return type:
- Raises:
AttributeError – If self.reference_file is not set (get_reference_url should be called first)
AttributeError – If self.file_path is not set or None (download_reference_file should be called first)
OSError – If an error occurs during the unzipping process