Utilities
nplinker.genomics.utils
¶
generate_mappings_genome_id_bgc_id
¶
generate_mappings_genome_id_bgc_id(
bgc_dir: str | PathLike,
output_file: str | PathLike | None = None,
) -> None
Generate a file that maps genome id to BGC id.
The input bgc_dir must follow the structure of the antismash directory defined in
Working Directory Structure, e.g.:
bgc_dir
├── genome_id_1
│ ├── bgc_id_1.gbk
│ └── ...
├── genome_id_2
│ ├── bgc_id_2.gbk
│ └── ...
└── ...
Parameters:
-
bgc_dir(str | PathLike) –The directory has one-layer of subfolders and each subfolder contains BGC files in
.gbkformat.It assumes that
- the subfolder name is the genome id (e.g. refseq),
- the BGC file name is the BGC id.
-
output_file(str | PathLike | None, default:None) –The path to the output file. The file will be overwritten if it already exists.
Defaults to None, in which case the output file will be placed in the directory
bgc_dirwith the file name GENOME_BGC_MAPPINGS_FILENAME.
Source code in src/nplinker/genomics/utils.py
add_strain_to_bgc
¶
Assign a Strain object to BGC.strain for input BGCs.
BGC id is used to find the corresponding Strain object. It's possible that no Strain object is found for a BGC id.
Note
The input bgcs will be changed in place.
Parameters:
-
strains(StrainCollection) –A collection of all strain objects.
-
bgcs(Sequence[BGC]) –A list of BGC objects.
Returns:
-
tuple[list[BGC], list[BGC]]–A tuple of two lists of BGC objects,
- the first list contains BGC objects that are updated with Strain object;
- the second list contains BGC objects that are not updated with Strain object because no Strain object is found.
Raises:
-
ValueError–Multiple strain objects found for a BGC id.
Source code in src/nplinker/genomics/utils.py
add_bgc_to_gcf
¶
add_bgc_to_gcf(
bgcs: Sequence[BGC], gcfs: Sequence[GCF]
) -> tuple[list[GCF], list[GCF], dict[GCF, set[str]]]
Add BGC objects to GCF object based on GCF's BGC ids.
The attribute of GCF.bgc_ids contains the ids of BGC objects. These ids
are used to find BGC objects from the input bgcs list. The found BGC
objects are added to the bgcs attribute of GCF object. It is possible that
some BGC ids are not found in the input bgcs list, and so their BGC
objects are missing in the GCF object.
Note
This method changes the lists bgcs and gcfs in place.
Parameters:
Returns:
-
tuple[list[GCF], list[GCF], dict[GCF, set[str]]]–A tuple of two lists and a dictionary,
- The first list contains GCF objects that are updated with BGC objects;
- The second list contains GCF objects that are not updated with BGC objects because no BGC objects are found;
- The dictionary contains GCF objects as keys and a set of ids of missing BGC objects as values.
Source code in src/nplinker/genomics/utils.py
get_mibig_from_gcf
¶
Get MIBiG BGCs and strains from GCF objects.
Parameters:
Returns:
-
tuple[list[BGC], StrainCollection]–A tuple of two objects,
- the first is a list of MIBiG BGC objects used in the GCFs;
- the second is a StrainCollection object that contains all Strain objects used in the GCFs.
Source code in src/nplinker/genomics/utils.py
extract_mappings_strain_id_original_genome_id
¶
extract_mappings_strain_id_original_genome_id(
podp_project_json_file: str | PathLike,
) -> dict[str, set[str]]
Extract mappings "strain_id <-> original_genome_id".
Tip
The podp_project_json_file is the JSON file downloaded from PODP platform.
For example, for PODP project MSV000079284, its JSON file is https://pairedomicsdata.bioinformatics.nl/api/projects/4b29ddc3-26d0-40d7-80c5-44fb6631dbf9.4.
Parameters:
Returns:
See Also
- podp_generate_strain_mappings: Generate strain mappings JSON file for PODP pipeline.
Source code in src/nplinker/genomics/utils.py
extract_mappings_original_genome_id_resolved_genome_id
¶
extract_mappings_original_genome_id_resolved_genome_id(
genome_status_json_file: str | PathLike,
) -> dict[str, str]
Extract mappings "original_genome_id <-> resolved_genome_id".
Tip
The genome_status_json_file is generated by the podp_download_and_extract_antismash_data
function with a default file name GENOME_STATUS_FILENAME.
Parameters:
Returns:
See Also
- podp_generate_strain_mappings: Generate strain mappings JSON file for PODP pipeline.
Source code in src/nplinker/genomics/utils.py
extract_mappings_resolved_genome_id_bgc_id
¶
extract_mappings_resolved_genome_id_bgc_id(
genome_bgc_mappings_file: str | PathLike,
) -> dict[str, set[str]]
Extract mappings "resolved_genome_id <-> bgc_id".
Tip
The genome_bgc_mappings_file is usually generated by the
generate_mappings_genome_id_bgc_id
function with a default file name GENOME_BGC_MAPPINGS_FILENAME.
Parameters:
Returns:
See Also
- podp_generate_strain_mappings: Generate strain mappings JSON file for PODP pipeline.
Source code in src/nplinker/genomics/utils.py
get_mappings_strain_id_bgc_id
¶
get_mappings_strain_id_bgc_id(
mappings_strain_id_original_genome_id: Mapping[
str, set[str]
],
mappings_original_genome_id_resolved_genome_id: Mapping[
str, str
],
mappings_resolved_genome_id_bgc_id: Mapping[
str, set[str]
],
) -> dict[str, set[str]]
Get mappings "strain_id <-> bgc_id".
Parameters:
-
mappings_strain_id_original_genome_id(Mapping[str, set[str]]) –Mappings "strain_id <-> original_genome_id".
-
mappings_original_genome_id_resolved_genome_id(Mapping[str, str]) –Mappings "original_genome_id <-> resolved_genome_id".
-
mappings_resolved_genome_id_bgc_id(Mapping[str, set[str]]) –Mappings "resolved_genome_id <-> bgc_id".
Returns:
See Also
extract_mappings_strain_id_original_genome_id: Extract mappings "strain_id <-> original_genome_id".extract_mappings_original_genome_id_resolved_genome_id: Extract mappings "original_genome_id <-> resolved_genome_id".extract_mappings_resolved_genome_id_bgc_id: Extract mappings "resolved_genome_id <-> bgc_id".- podp_generate_strain_mappings: Generate strain mappings JSON file for PODP pipeline.