Working Directory Structure¶
NPLinker requires a fixed structure of working directory with fixed names for the input and output data.
root_dir # (1)!
│
├── nplinker.toml # (2)!
├── strain_mappings.json [F] # (3)!
├── strains_selected.json [F][O] # (4)!
│
├── gnps [F] # (5)!
│ ├── spectra.mgf [F]
│ ├── molecular_families.tsv [F]
│ ├── annotations.tsv [F]
│ └── file_mappings.tsv (.csv) [F] # (6)!
│
├── antismash [F] # (7)!
│ ├── GCF_000514975.1 # (8)!
│ │ ├── xxx.region001.gbk
│ │ └── ...
│ ├── GCF_000016425.1
│ │ ├── xxxx.region001.gbk
│ │ └── ...
│ └── ...
│
├── bigscape [F][O] # (9)!
│ ├── mix_clustering_c0.30.tsv [O] # (10)!
│ ├── data_sqlite.db [O] # (11)!
│ └── bigscape_running_output [A] # (12)!
│ └── ...
│
├── downloads [F][A] # (13)!
│ ├── GCF_000016425.1.zip
│ ├── GCF_0000514975.1.zip
│ ├── c22f44b14a3d450eb836d607cb9521bb.zip
│ ├── genome_status.json
│ ├── mibig_json_3.1.tar.gz
│ └── ...
│
├── mibig [F][A] # (14)!
│ ├── BGC0000001.json
│ ├── BGC0000002.json
│ └── ...
│
├── output [F][A] # (15)!
│ └── ...
│
└── ... # (16)!
root_dir
is the working directory you created, used as the root directory for NPLinker.nplinker.toml
is the configuration file (toml format) provided by the user for running NPLinker.strain_mappings.json
contains the mappings from strain to genomics and metabolomics data. It is generated by NPLinker forpodp
mode; forlocal
mode, users need to create it manually.
[F]
means the file namestrain_mappings.json
is fixed (including the extension) and must be named as required.strains_selected.json
is an optional file containing the list of strains to be used in the analysis. If it is not provided, NPLinker will use all strains detected from the input data.
[O]
means optional, it's optional for users to provide the filestrains_selected.json
.gnps
directory contains the GNPS data. The files in this directory must be named as shown. See gnps data for more information.- This file could be
.tsv
or.csv
format. antismash
directory contains a collection of AntiSMASH BGC data. The BGC data (*.region*.gbk
files) must be stored in subdirectories named after NCBI accession number (e.g.GCF_000514975.1
).- The
GCF_000514975.1
has nothing to do with BigScape GCF, and it's just the NCBI accession number of the genome. - This directory contains the output of BigScape. If the directory is not provided, NPLinker will
run BigScape automatically to generate it using the AntiSMASH BGC data.
If you provide the BigScape output, you just need to provide output from v1 or v2, not both. mix_clustering_c0.30.tsv
is an example output of BigScape v1. The file name must follow the patternmix_clustering_c{cutoff}.tsv
, where{cutoff}
is the cutoff value used in the BigScape run.data_sqlite.db
is the output of BigScape v2.- The
bigscape_running_output
directory is automatically created and managed by NPLinker. It stores the output data of BigScape. Users should not interfere with this directory and its content.
[A]
means the directory is automatically created and/or managed by NPLinker. downloads
directory is automatically created and managed by NPLinker. It stores the downloaded data from the internet. Users can also use it to store their own downloaded data.mibig
directory contains the MIBiG metadata, which is automatically created and downloaded by NPLinker. Users should not interfere with this directory and its content.output
directory is automatically created by NPLinker. It stores the output data of NPLinker.- It's flexible to extend NPLinker by adding other types of data.
Tip
[F]
means the file or directory name is fixed and must be named as shown. The names are defined in the defaults module.[O]
means the file or directory is optional for users to provide. It does not mean the file or directory is optional for NPLinker to use. If it's not provided by the user, NPLinker may generate it.[A]
means the directory is automatically created and/or managed by NPLinker.