This page explains the details of how to make input files, what MAGI job parameters are and how they affect results, how to access your jobs once they've run, what the various MAGI scores mean and how they were calculated, and a brief description of the output files. If you want more detail on how to analyze MAGI results, please go through the Tutorial.
Submitting a Job
Input Files
MAGI Scores, Explained
Job Parameters
Accessing Job Results
MAGI Results, Explained
What is an InChiKey?
Frequently Asked Questions
To submit a job, all you need to do is provide two input files, and your email address. The two files are a FASTA file consisting of the genes/proteins present in your biological sample, and a metabolites file consisting of the metabolites observed in your sample. Your email address will be used to inform you of your job's status (submitted, queued, running, completed, etc.)
Sample files that can be run on MAGI can be found here, which you can use as templates to make your own files. Here are some more details...
This file should be in the standard DNA, RNA, or Amino Acid FASTA format, where each sequence begins with a header line beginning with >, and does not have any additional > symbols in the header line. The header line ends with a newline, and then the nucleic acid or amino acid sequence begins. The sequence can have newlines, but no blank spaces. Only standard nucleic acids (A,T,U,C,G) and standard amino acids (A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y,*,-) are accepted in the FASTA file.
>sequence_1 my description of sequence 1 SCIENCEISGREAT QIKDLLVSSSTDLDTTLVLVNAIYFKGMWK TAFNAEDTREMPFHVTKQESKPVQMMCMNNSFNVATLPAE KMKILELPFASGDLSMLVLLPDEVSDLERIEKTINFEKLT EWTNPNTMEKRRVKVYLPQMKIEEKYNLTS VLMALGMTDLFIPSANLTGISSAESLKISQAVHGAFMELSEDGIEMAGSTGVIEDIKHSPESEQFRADHP FLFLIKHNPTNTIVYFGRYWSP >sequence_2 annotation of sequence 2 ILIKEMAGIBCITISEASYYMMV MGVDPFQVAVGVSNRHIHLSRTD MDTLFGPGAELQRKKAMKQPGQF AAEETVTLKGPKGSLSKVRVLGP LRRETQVEVSVADGFALGITPPL RQSGQLDDTPGLTIIGPQGSVTK DHGVIVAQRHIHMHPSTAAKLGL RNGDEVDVEAGGERGGVMHRVLI RVAEASADEMHIDVEEANALCLK NDDVVRICKKLE
The metabolite file describes the metabolites you wish to connect with the proteins provided in the protein FASTA file. This file can be in any standard table format with the appropriate file extension: (csv: .csv
; tab-delimited: .tab
or .tsv
; excel: .xlsx
or .xls
) with column names in the first line of the file. The only required column name is original_compound
, which describes the metabolites observed in an experiment. You may enter m/z or neutral mass values in this column, or standard InChI Keys. If you enter m/z or neutral masses, you must select the checkbox stating that your metabolite input file contains these values, and enter your desired accurate mass search parameters.
An optional column name is compound_score
, which allows you to weight or otherwise score a compound. This is primarily useful when one is trying to determine the appropriate compound identity for a mass spectrometry feature (a high intensity signal with unique m/z and retention time). Other than those two column names, users are free to add any additional metadata columns describing the compound structure; all columns will be passed through to the final results.
The following example metabolite input files are all shown in tab-delimited format.
Basic file with InChI Keys:original_compound ISFCPXILUVJVOC-KYGJEJSHSA-N VTIKDEXOEJDMJP-WYUUTHIRSA-N VEDWXCWBMDQNCV-SCFUHWHPSA-NBasic file with m/z values:
original_compound 123.4567 111.1234 567.4321Input file with compound_score values and extra metadata
feature
(m/z 123.4567 at retention time 8.90 minutes), with varying probability represented by the compound_score
column:
feature original_compound compound_score compound_name 123.4567@8.90 ISFCPXILUVJVOC-KYGJEJSHSA-N 1.00 compound_A 123.4567@8.90 VTIKDEXOEJDMJP-WYUUTHIRSA-N 0.49 compound_B 123.4567@8.90 VEDWXCWBMDQNCV-SCFUHWHPSA-N 0.99 compound_C
Currently, you are required to find InChI Keys yourself. The Fiehn Lab provides an easy to use Chemical Translation Service. This service allows batch conversion from almost any chemical ID to an InChI Key. Alternatively, you can search for compounds by name on PubChem; the InChI Key is in the top set of data describing each compound.
If you are comfortable doind a bit of programming and you already have a structural identifier for your compounds (e.g. SMILES, SMARTS, InChI string). For fast batch-conversion of these datatypes, our preferred method is to use RDKit in a Python environment. You can follow the installation instructions here (the first set of instructions describing the (ana)conda installation is very easy!). Once you have installed RDKit, you can quickly convert from SMILES, SMARTS, or InChI to InChI Key:
from rdkit import Chem # compound_list = a Python list-like that has your chemical structure datatypes inchikeys = [] for cpd in compound_list: # If your datatype is an InChI String: ikey = Chem.InchiToInchiKey(cpd) inchikeys.append(ikey) # Done # If the datatype is SMILES or SMARTS # Convert to a rdkit Mol object m = Chem.MolFromSmiles(cpd) # SMILES m = Chem.MolFromSmarts(cpd) # SMARTS # Convert the Mol object to InChiKey i = Chem.MolToInchi(cpd) ikey = Chem.InchiToInchiKey(i) inchikeys.append(ikey) # Done
MAGI is not a replacement for using your brain! The purpose of MAGI is simply to connect input metabolites with input sequences, and calculate a score representing the strength or probability of that association. While the best scoring association is usually correct, it is unwise to blindly trust an algorithm. This section explains in detail all of the scores MAGI uses and what they mean. Then we go through some examples of how one should interpret these results. For any of your questions that went unanswered, please Contact Us to help us improve this guide.
MAGI_score
The MAGI score is an aggregation of all the scores discussed below. Specifically, it is the geometric mean of the homology score, reciprocal score, reaction connection score, and compound score. The geometric mean is used here as a way to "normalize" the various scores respective to each other. The geometric mean is then adjusted by the level searched in the chemical similarity network.
pre_score = geometric_mean([ homology_score, reciprocal_score, reaction_connection, compound_score ]) level_penalty = 4 ** level MAGI_score = pre_score / level_penalty
compound_score
The compound score is not something calculated by MAGI; it is defined in the metabolites file. If a compound_score
column is not found in the metabolites input file, then one is created and populated with "1.0", giving all metabolites an equal weight.
level
and neighbor
This is actually not "calculated" by MAGI, but it is used in the final scoring function and contains valuable information for interpretation. The number here represents how "deep" the chemical similarity network was searched in order to find a reaction that an input metabolite participates in:
neighbor
column is the immediate neighbor to the queried metabolite in the chemical network. This neighbor compound was found in the reaction listed; not the input metaboliteneighbor
column is the neighbor's (neighbor's etc.) neighbor in the chemical network. The higher this number goes, the less similar the neighbors will be to your original input metabolite.e_score_r2g
and e_score_g2r
An E-score is a log-transformed E-value from a BLAST result. Specifically, the formula is:
if e_value > 0: e_score = -log10(e_value) else: # The lowest non-zero e-value is 1e-180, # so a perfect 0 gets a slight boost e_score = 200.
There are two homology searches conducted in MAGI: a reaction-to-gene (r2g) search, and a gene-to-reaction (g2r) search:
homology_score
The MAGI homology score is a number that represents the reciprocal homology of an integrated metabolite-reaction-gene element. To calculate this score, the reaction-to-gene and gene-to-reaction results are joined on the shared gene. Then, the E-scores are combined using the following formula:
homology_score = r2g_e_score + g2r_e_score - |r2g_e_score - g2r_e_score|This formula results in a high score if both scores are high, a medium score if both scores are medium, and a low score if both scores are low or one score is high and the other score is low. Overall, this one score can be used to judge the reciprocal BLAST results as "good" "medium" or "bad".
reciprocal_score
The reciprocal score is a direct representation of whether or not the reaction-to-gene and gene-to-reaction homology searches converged on the same reaction or not. Traditionally, reciprocal agreement is judged by using the top BLAST result. However, as MAGI is meant to allow flexibility to homology searches, the top set of BLAST results are taken when matching up reaction to gene and gene to reaction BLAST searches (the stringency of this set is judged by the BLAST cutoff parameter).
reaction_connection
This score simply logs whether the metabolite was found in no reactions (0.1), either in the reaction-to-gene or the gene-to-reaction reaction (1.1) or in both reactions (2.1). It essentially acts as a tiebreaker.
2
This parameter tells MAGI how deep into the chemical network to look when trying to match a compound to a reaction. The higher this number, the less similar compounds will be matched! Currently, keeping this parameter at 2
or less is best; as we improve our chemical similarity network, we will increase the default. Setting this value to 0
means that the chemical network will not be searched; we recommend you only do this if you are 100% confident that the reactions you are interested in have a reference sequence, and 100% confident that the metabolites you are interested in are represented in reactions that have a reference sequence.
These parameters can have drastic effects on the results and we don't recommend you change them unless you really know what you're doing!
85%
This parameter adjusts how many top BLAST results will be kept for each gene or reaction. It represents the cutoff of which BLAST results will be retained for a given gene respective to the top scoring result. When MAGI conducts a BLAST search, it retains all results with an E-value smaller than 1. However, we usually only care about the highest scoring results. This parameter allows you to adjust the tolerance for what a "high scoring" result means, as a percent of the "E-score" (see above).
For example, if the top scoring BLAST result has an E-value of 1E-100, this corresponds to an "E-score" (see above) of 100. If the BLAST cutoff parameter is 85%, then all BLAST results with an E-score greater than or equal to 85 will be retained. Increasing the BLAST cutoff results in fewer BLAST results. Setting this parameter to 100% means that only the top scoring BLAST results will be retained for further MAGI analysis.
75%
This represents how close in value a reciprocal BLAST result needs to be in order to call it "close." When MAGI conducts a reciprocal BLAST with respect to a gene sequence, it compares the E-scores (see above) of the two searches if they did not converge on the same reaction. If the lower E-score is within N% of the higher E-score, they are deemed "close" and are less penalized.
For example, when using a reaction (let's call it Reaction A) reference sequence as a query, the top-scoring gene has an E-score of 90. However, when using that gene as a query, the top-scoring reaction is different (Reaction B), with an E-score of 100. The Reaction A "association" is 90% of the Reaction B association. While the reciprocal BLAST matches disagree, this would constitute a "close" reciprocal agreement if the Reciprocal closeness cutoff is 90% or lower. Setting this parameter to 100% means that nothing will be classified as a "close" agreement.
One important consideration when adjusting this parameter is that the E-value of a BLAST search is dependent on the database size. Because different databases are being used, the E-values will be different even when there is reciprocal agreement.
4
This parameter adjusts the penalty levied on a result if a compound was connected to a gene via the chemical similarity network. The higher the number, the heavier the penalty. This parameter is the base number that is raised to the exponent of the network level searched.
[1, 1, 1, 1]
These parameters allow you to skew the weights of the individual components that make up the consensus MAGI score. The higher the number for an individual score component, the stronger that component will be considered when calculating the final MAGI consensus score.
A value of zero in any of the fields indicates that that particular score component will not be considered in the final score.
When you first go to the Jobs page, a unique Job ID is created for you. This Job ID is your unique key to come back to your job page to view job status and job results simply by entering your unique job id into the MAGI URL: https://magi.nersc.gov/jobs/?id=your_job_id. When you successfully upload and submit a MAGI job, your Job ID and Job URL will be emailed to the email address you specified on the Job form for easy access.
Remember, anyone who has your unique Job ID can navigate to your job's page, so if you want to keep your job a secret, treat your Job ID like a password.
Oh no! You've accidentally deleted your Job Submission email, and you didn't bookmark your job page. Or maybe you just want to see a summary of all the jobs you've ever submitted to MAGI. On the Jobs page, you can click the "email me my jobs" button, which will prompt you to enter your email address. All Job IDs associated with that email address will be collected into a summary page and emailed to that address. For a Job ID to be included in this report, you must have successfully uploaded files and submitted the job (unsubmitted job IDs will not be included).
This section is a detailed description of the contents three output MAGI output files, but only a superficial explanation of how this information can be interpreted. For help on how to interpret these data, please go through the tutorial.
magi_results.csv
fileThis table has all of the MAGI connections between metabolites and genes. There are 3 "groups" of columns in this table with varying information.
# | MAGI_score | Gene ID | original_compound | neighbor | note | ... |
---|---|---|---|---|---|---|
1 | 0.250381 | 637265280 | GUBGYTABKSRVRQ-LVIVMJSQSA-N | WQZGKKKJIJFFOK-GASJEMHNSA-N | direct | ... |
2 | 0.63712 | VTIKDEXOEJDMJP-WYUUTHIRSA-N | direct | ... | ||
3 | 6.74231 | 637271525 | CDAISMWEOUEBRE-SHFUYGGZSA-N | flat tautomer | ... | |
4 | 637276554 | ... |
Row #1 shows the gene 637265280 (Gene ID column) is associated with the compound GUBGYTABKSRVRQ-LVIVMJSQSA-N (original_compound column) via the similar compound WQZGKKKJIJFFOK-GASJEMHNSA-N (in the neighbor column). The most important concept to understand here is that the neighbor compound is the one that is in the reaction that mapped to the gene via homology. The note column tells us that the neighbor compound was directly associated in a reaction, not one of its flattened tautomers. This is important when considering stereochemistry-specific reactions, for example L-Glucose versus D-Glucose.
Row #2 shows that the compound VTIKDEXOEJDMJP-WYUUTHIRSA-N did not connect to any gene (the Gene ID column is empty)
Row #3 shows that the gene 637271525 is associated with the compound CDAISMWEOUEBRE-SHFUYGGZSA-N. Because the neighbor column is empty, this means that CDAISMWEOUEBRE-SHFUYGGZSA-N was directly associated with the gene; the chemical similarity network was not searched. However, the note column indicates that a non stereospecific tautomer of CDAISMWEOUEBRE-SHFUYGGZSA-N was associated with a reaction that matched to gene 637271525 via homology. In this case, care must be taken to ensure that the reaction can accommodate CDAISMWEOUEBRE-SHFUYGGZSA-N (most often this turns out to be okay).
Row #4 indicates that the gene 637276554 did not connect to any compounds in the input column list (the original_compound column is empty).
These scores are useful if you want to dig in to see what the individual component scores were for calculating the final MAGI consensus score.
# | ... | compound_score | level | homology_score | reciprocal_score | reaction_connection | ... |
---|---|---|---|---|---|---|---|
1 | ... | 0.833445 | 1 | 120.116977 | 0.01 | 2.01 | ... |
2 | ... | 1.63141 | 0 | 1.000000 | 0.10 | 1.01 | ... |
3 | ... | 1.28513 | 0 | 400.000000 | 2.00 | 2.01 | ... |
4 | ... | 1.000000 | 0.10 | 1.01 | ... |
In these columns, the compound_score column is a direct pass-through from the user-supplied compound_score column in the metabolites input file. If the user didn't provide compound_score, then this should be all 1. Row #4 does not have a compound_score value because it describes a gene that did not connect to a compound (see above).
The level column describes how "far" into the chemical similarity network MAGI went to connect the compound to the gene. Row #1 has a value of 1, meaning that the compound WQZGKKKJIJFFOK-GASJEMHNSA-N is an immediate neighbor of GUBGYTABKSRVRQ-LVIVMJSQSA-N in the chemical similarity network (so is very similar). Rows #2 and #3 have a value of zero, meaning the chemical network was not used. Row #4 is blank because there is no compound associated with this row.
The homology_score is described more in depth above. Row #3 has a "perfect" homology score, meaning that the bidirectional BLAST results both resulted in a "perfect" homology match (i.e. an E-value of 0.0; an E-score of 200.0 each). Rows #2 and #4 have a homology score of 1.0 because they do not have a reciprocal BLAST to assess.
The reciprocal_score and reaction_connect scores are described in depth above.
These columns are useful after initial filtering of results, and when you are trying to hone down your results to an actionable list, and/or assessing the results in specific biochemical context.
# | ... | e_score_r2g | database_id_r2g | e_score_g2r | database_id_g2r |
---|---|---|---|---|---|
1 | ... | 60.0585 | 6-PHOSPHO-BETA-GLUCOSIDASE-RXN | 105.425 | RXN-13701 |
2 | ... | RXN-18116 | |||
3 | ... | 200 | RHEA:16951 | 200 | RHEA:16951 |
4 | ... | 200 | RHEA:34481 |
notes: Row #2: this is just one reaction that the compound is involved in. Row #3: can see how a perfect result looks Row #1: can see how reciprocal disagreement looks Row #4: even though there wasn't connection to an observed compound, we still got a very strong match to a reaction for this gene; these are worth designing further experiments for, etc. NEED TO ADD IN ONE MORE EXAMPLE THAT SHOWS A "CLOSE" RECIPROCAL DISAGREEMENT/AGREEMENT
magi_compound_results.csv
file
This table's sole purpose is to score and rank compound identifications. It is completely compound-centric and should NOT be used for any other purpose. What we mean by "compound-centric" is that for each metabolite structure, only the best metabolite-reaction-gene association is retained. Furthermore, if you do not include a column describing the mass spectrometry feature that might represent each compound in your metabolites file (e.g. the feature
column in the example above), then this table will be useless to you.
An example slice of a cpd_results.csv
table is shown below, where the first few putative compound identifications of feature 203.0345@3.07 are shown and ranked according to MAGI score:
feature | original_compound | compound_score | MAGI_score | reciprocal_score | homology_score | reaction_id_r2g |
---|---|---|---|---|---|---|
203.0345@3.07 | HVZYIHBMRFYBRI-UHFFFAOYSA-N | 0.648596 | 5.682836 | 2.0 | 400.000000 | RXN-10622 |
203.0345@3.07 | VOJUXHHACRXLTD-UHFFFAOYSA-N | 0.648596 | 3.241197 | 2.0 | 42.327352 | RHEA:26311 |
203.0345@3.07 | MGZOXZPZHVOXQB-UHFFFAOYSA-N | 0.373115 | 0.770257 | 2.0 | 120.155587 | RHEA:12208 |
203.0345@3.07 | HWWWTOHAFWXPCB-UHFFFAOYSA-N | 1.030241 | 0.179157 | 0.1 | 1.000000 | NaN |
203.0345@3.07 | YKPXIWHBRBFRQM-UHFFFAOYSA-N | 1.030112 | 0.179152 | 0.1 | 1.000000 | NaN |
203.0345@3.07 | CQDXJBJBEQPBEM-UHFFFAOYSA-N | 0.941043 | 0.175147 | 0.1 | 1.000000 | NaN |
... | ... | ... | ... | ... | ... | ... |
The top scoring compound is MAGI's best suggestion for what 203.0345@3.07 actually represents based on 1) its presence in a reaction, 2) the reaction's reference sequence having high homology to a gene, and 3) that gene having high homology to the reaction's reference sequence:
reaction_id_r2g
column shows one reaction that this metabolite is a reactant or product in. More specifically, it is the reaction whose reference sequence had a higher homology score than any other reactions the metabolite was a member in.homology_score
column shows the reciprocal homology score between the reaction in reaction_id_r2g
and one gene in the supplied input file. 400 is the maximum possible homology score, so in this case the original_compound
was connected to reaction_id_r2g
with perfect homology, and the gene was also connected to a reaction with perfect homology. Note: without looking at reciprocal_score
, we cannot know if these reactions were the same or not...reciprocal_score
column represents whether or not there was reciprocal agreement between the reaction reference sequence and user input gene. In this case, a 2.0 indicates that the gene was connected to the reaction_id_r2g
reaction as well.
magi_gene_results.csv
fileThis table is only meant to assess different possible functions for gene products based on the integrative MAGI analysis. It is NOT meant to do any sort of compound identification. This table summarizes the MAGI results for each gene-reaction association, showing only the top-scoring result for each unique gene-reaction pair. Please note that if your compound inputs were not certain (i.e. you were also using MAGI to score compound identifications), this table may not be useful for you. In this case, you should identify which compounds were actually present in your sample, and then filter the MAGI results to conly contain results associated with those compounds, then re-create this gene-centric table.
Gene ID | MAGI_score | database_id_r2g | database_id_g2r | homology_score | reciprocal_score | e_score_r2g | e_score_g2r | original_compound |
---|---|---|---|---|---|---|---|---|
637266218 | 4.765548 | RHEA:42893 | RHEA:42893 | 176.697444 | 2.0 | 88.348722 | 89.504456 | IPJDTNIZLKTLEU-KQYNXXCUSA-N |
637266218 | 4.719569 | RHEA:25026 | RHEA:25026 | 193.534008 | 2.0 | 96.767004 | 99.580044 | GXYLOXCSJFJFKA-IOSLPCCCSA-N |
637266218 | 4.704383 | RHEA:20717 | RHEA:20717 | 193.534008 | 2.0 | 96.767004 | 99.580044 | VNPWVMVYUSNFAW-WFMPWKQPSA-N |
This example table shows three possible functions for gene 637266218. Although there is one top score, all three suggestions have similar scores. All three reactions involve a metabolite with an adenosine substructure that may contain an additional chemical group at the 5' position. Furthermore, the three compounds associated with this gene are all the the deamination products of these reactions. Therefore, by investigating the reactions and the compounds associated with this gene product, all three functions can be deemed equally plausible.
An InChI Key is a database-friendly representation of a chemical structure. An InChI Key is generated from a hash code of an InChI, which is a type of digital signature for chemical structures. InChI Keys are how MAGI stores compound information, so knowing how to "read" and InChI key will make your life a lot easier when analyzing the results.
The InChI key documentation and examples are here, but here is a quick graphic with the aspects of the key relevant to MAGI:
What can I use MAGI for?
Here are some common examples for how you might use MAGI:
How is MAGI Bayesian-like?
Although MAGI does not use the Bayesian formula, it does follow the underlying philosophy. The three principles of the MAGI philosophy are:
What is the reaction reference database used?
The reaction reference sequence database is a custom curated MAGI-specific database. It is essentially the join of all MetaCyc and Rhea reference sequences to each reaction. Duplicate reactions with different reference sequences do exist within and between MetaCyc and Rhea, and we have collapsed them into one reaction representation and combined all reference sequences for that reaction.
Can you use annotations if you have them?
If you have annotation information about your genes, you can merge that into the MAGI results tables after analysis (joining on your gene identifier) and assess them by eye. If you want to force a gene to have a specific reaction within the MAGI workflow, that functionality is currently not supported by MAGI.