Skip to content

Dataset Loader

nplinker.loader

DatasetLoader

DatasetLoader(config: Dynaconf)

Load datasets from the working directory with the given configuration.

Concept and Diagram

Working Directory Structure

Dataset Loading Pipeline

Loaded data are stored in the data containers (attributes), e.g. self.bgcs, self.gcfs, etc.

Attributes:

Parameters:

  • config (Dynaconf) –

    A Dynaconf object that contains the configuration settings.

Examples:

>>> from nplinker.config import load_config
>>> from nplinker.loader import DatasetLoader
>>> config = load_config("nplinker.toml")
>>> loader = DatasetLoader(config)
>>> loader.load()
See Also

DatasetArranger: Download, generate and/or validate datasets to ensure they are ready for loading.

Source code in src/nplinker/loader.py
def __init__(self, config: Dynaconf) -> None:
    """Initialize the DatasetLoader.

    Args:
        config: A Dynaconf object that contains the configuration settings.

    Examples:
        >>> from nplinker.config import load_config
        >>> from nplinker.loader import DatasetLoader
        >>> config = load_config("nplinker.toml")
        >>> loader = DatasetLoader(config)
        >>> loader.load()

    See Also:
        [DatasetArranger][nplinker.arranger.DatasetArranger]: Download, generate and/or validate
            datasets to ensure they are ready for loading.
    """
    self.config = config

    self.bgcs: list[BGC] = []
    self.gcfs: list[GCF] = []
    self.spectra: list[Spectrum] = []
    self.mfs: list[MolecularFamily] = []
    self.mibig_bgcs: list[BGC] = []
    self.mibig_strains_in_use: StrainCollection = StrainCollection()
    self.product_types: list = []
    self.strains: StrainCollection = StrainCollection()

    self.class_matches = None
    self.chem_classes = None

RUN_CANOPUS_DEFAULT class-attribute instance-attribute

RUN_CANOPUS_DEFAULT = False

EXTRA_CANOPUS_PARAMS_DEFAULT class-attribute instance-attribute

EXTRA_CANOPUS_PARAMS_DEFAULT = (
    "--maxmz 600 formula zodiac structure canopus"
)

OR_CANOPUS class-attribute instance-attribute

OR_CANOPUS = 'canopus_dir'

OR_MOLNETENHANCER class-attribute instance-attribute

OR_MOLNETENHANCER = 'molnetenhancer_dir'

config instance-attribute

config = config

bgcs instance-attribute

bgcs: list[BGC] = []

gcfs instance-attribute

gcfs: list[GCF] = []

spectra instance-attribute

spectra: list[Spectrum] = []

mfs instance-attribute

mfs: list[MolecularFamily] = []

mibig_bgcs instance-attribute

mibig_bgcs: list[BGC] = []

mibig_strains_in_use instance-attribute

mibig_strains_in_use: StrainCollection = StrainCollection()

product_types instance-attribute

product_types: list = []

strains instance-attribute

class_matches instance-attribute

class_matches = None

chem_classes instance-attribute

chem_classes = None

load

load() -> bool

Load all data from data files in the working directory.

See Dataset Loading Pipeline for the detailed steps.

Returns:

  • bool

    True if all data are loaded successfully.

Source code in src/nplinker/loader.py
def load(self) -> bool:
    """Load all data from data files in the working directory.

    See [Dataset Loading Pipeline][dataset-loading-pipeline] for the detailed steps.

    Returns:
        True if all data are loaded successfully.
    """
    if not self._load_strain_mappings():
        return False

    if not self._load_metabolomics():
        return False

    if not self._load_genomics():
        return False

    # set self.strains with all strains from input plus mibig strains in use
    self.strains = self.strains + self.mibig_strains_in_use

    if len(self.strains) == 0:
        raise Exception("Failed to find *ANY* strains.")

    return True