Quickstart
NPLinker allows you to run in two modes:
The local
mode assumes that the data required by NPLinker is available on your local machine.
The required input data includes:
- GNPS molecular networking data from one of the following GNPS workflows
METABOLOMICS-SNETS
,METABOLOMICS-SNETS-V2
FEATURE-BASED-MOLECULAR-NETWORKING
- AntiSMASH BGC data
- BigScape data (optional)
The podp
mode assumes that you use an identifier of
Paired Omics Data Platform (PODP) as the input for
NPLinker. Then NPLinker will download and prepare all data necessary based on the PODP id which
refers to the metadata of the dataset.
So, which mode will you use? The answer is important for the next steps.
1. Create a working directory¶
The working directory is used to store all input and output data for NPLinker. You can name this
directory as you like, for example nplinker_quickstart
:
Important
Before going to the next step, make sure you get familiar with how NPLinker organizes data in the working directory, see Working Directory Structure page.
2. Prepare input data (local
mode only)¶
Details
Skip this step if you choose to use the podp
mode.
If you choose to use the local
mode, meaning you have input data of NPLinker stored on your local
machine, you need to move the input data to the working directory created in the previous step.
GNPS data¶
NPLinker accepts data from the output of the following GNPS workflows:
METABOLOMICS-SNETS
METABOLOMICS-SNETS-V2
FEATURE-BASED-MOLECULAR-NETWORKING
.
NPLinker provides the tools GNPSDownloader
and
GNPSExtractor
to download and extract the GNPS data
with ease. What you need to give is a valid GNPS task ID, referring to a task of the GNPS workflows
supported by NPLinker.
GNPS task id and workflow
Given an example of GNPS task at https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=c22f44b14a3d450eb836d607cb9521bb,
the task id is the last part of this url, i.e. c22f44b14a3d450eb836d607cb9521bb
. Open this link,
you can find the worklow info at the row "Workflow" of the table "Job Status", for this case,
it is METABOLOMICS-SNETS
.
from nplinker.metabolomics.gnps import GNPSDownloader, GNPSExtractor
# Go to the working directory
cd nplinker_quickstart
# Download GNPS data & get the path to the downloaded archive
downloader = GNPSDownloader("gnps_task_id", "downloads") # (1)!
downloaded_archive = downloader.download().get_download_file()
# Extract GNPS data to `gnps` directory
extractor = GNPSExtractor(downloaded_archive, "gnps") # (2)!
- If you already have the downloaded archive of GNPS data, you can skip the download steps.
- Replace
downloaded_archive
with the actuall path to your GNPS data archive if you skipped the download steps.
The required data for NPLinker will be extracted to the gnps
subdirectory of the working directory.
Info
Not all GNPS data are required by NPLinker, and only the necessary data will be extracted. During the extraction, these data will be renamed to the standard names used by NPLinker. See the page GNPS Data for more information.
Prepare GNPS data manually
If you have GNPS data but it is not the archive format as downloaded from GNPS, it's recommended to re-download the data from GNPS.
If (re-)downloading is not possible, you could manually prepare data for the gnps
directory.
In this case, you must make sure that the data is organized as expected by NPLinker.
See the page GNPS Data for examples of how to prepare the data.
AntiSMASH data¶
NPLinker requires AntiSMASH BGC data as input, which are organized in the antismash
subdirectory of
the working directory.
For each output of AntiSMASH run, the BGC data must be stored in a subdirectory named after the NCBI
accession number (e.g. GCF_000514975.1
). And only the *.region*.gbk
files are required by NPLinker.
When manually preparing AntiSMASH data for NPLinker, you must make sure that the data is organized as expected by NPLinker. See the page Working Directory Structure for more information.
BigScape data (optional)¶
It is optional to provide the output of BigScape to NPLinker. If the output of BigScape is not provided, NPLinker will run BigScape automatically to generate the data using the AntiSMASH BGC data.
If you have the output of BigScape, you can put its mix_clustering_c{cutoff}.tsv
file in the
bigscape
subdirectory of the NPLinker working directory, where {cutoff}
is the cutoff value used
in the BigScape run.
Strain mappings file¶
The strain mappings file strain_mapping.json
is required by NPLinker to map the strain to genomics
and metabolomics data.
{
"strain_mappings": [
{
"strain_id": "strain_id_1", # (1)!
"strain_alias": ["bgc_id_1", "spectrum_id_1", ...] # (2)!
},
{
"strain_id": "strain_id_2",
"strain_alias": ["bgc_id_2", "spectrum_id_2", ...]
},
...
],
"version": "1.0" # (3)!
}
strain_id
is the unique identifier of the strain.strain_alias
is a list of aliases of the strain, which are the identifiers of the BGCs and spectra of the strain.version
is the schema version of this file. It is recommended to use the latest version of the schema. The current latest version is1.0
.
The BGC id is same as the name of the BGC file in the antismash
directory, for example, given a
BGC file xxxx.region001.gbk
, the BGC id is xxxx.region001
.
The spectrum id is same as the scan number in the spectra.mgf
file in the gnps
directory,
for example, given a spectrum in the mgf file with a scan SCANS=1
, the spectrum id is 1
.
If you labelled the mzXML files (input for GNPS) with the strain id, you may need the function extract_mappings_ms_filename_spectrum_id to extract the mappings from mzXML files to the spectrum ids.
For the local
mode, you need to create this file manually and put it in the working directory.
It takes some effort to prepare this file manually, especially when you have a large number of strains.
3. Prepare config file¶
The configuration file nplinker.toml
is required by NPLinker to specify the working directory, mode,
and other settings for the run of NPLinker. You can put the nplinker.toml
file in any place, but it
is recommended to put it in the working directory created in step 2.
The details of all settings can be found at this page Config File.
To keep it simple, default settings will be used
automatically by NPLinker if you don't set them in your nplinker.toml
config file.
What you need to do is to set the root_dir
and mode
in the nplinker.toml
file.
root_dir = "absolute/path/to/working/directory" # (1)!
mode = "local"
# and other settings you want to override the default settings
- Replace
absolute/path/to/working/directory
with the absolute path to the working directory created in step 2.
root_dir = "absolute/path/to/working/directory" # (1)!
mode = "podp"
podp_id = "podp_id" # (2)!
# and other settings you want to override the default settings
- Replace
absolute/path/to/working/directory
with the absolute path to the working directory created in step 2. - Replace
podp_id
with the identifier of the dataset in the Paired Omics Data Platform (PODP).
4. Run NPLinker¶
Before running NPLinker, make sure your working directory has the correct directory structure and names as described in the Working Directory Structure page.
from nplinker import NPLinker
# create an instance of NPLinker
npl = NPLinker("nplinker.toml") # (1)!
# load data
npl.load_data()
# check loaded data
print(npl.bgcs)
print(npl.gcfs)
print(npl.spectra)
print(npl.mfs)
print(npl.strains)
# compute the links for the first 3 GCFs using metcalf scoring method
link_graph = npl.get_links(npl.gcfs[:3], "metcalf") # (2)!
# get links as a list of tuples
link_graph.links
# get the link data between two objects or entities
link_graph.get_link_data(npl.gcfs[0], npl.spectra[0])
# Save data to a pickle file
npl.save_data("npl.pkl", link_graph)
- Replace
nplinker.toml
with the actual path to your configuration file. - The
get_links
returns a LinkGraph object that represents the calculated links between the GCFs and other entities as a graph.
For more info about the classes and methods, see the API Documentation.