This tutorial will go through step-by-step examples of how to submit a job to MAGI web, how to download your results, and some tips of how to analyze your data in Microsoft Excel. For specific details of how to create and submit a job, and what the various MAGI scores mean and how they were calculated, please see the Help page.
This example input data consists of a metabolite input describing 7 features observed in a metabolomics experiment conducted on Streptomyces coelicolor, and a gene input file describing all the genes that were associated with the compounds that could represent those features. For an explanation of what a feature is, you can go to the Help page. For the purposes of this tutorial, you can pretend that we compared two different experimental conditions on S. coelicolor, and observed that those 7 features were significantly increased in one condition. Now we want to use MAGI to 1) help refine putative compound identities to the features, 2) help us understand the biological relevance of these features, and 3) point us to genes that we could knock out, overexpress, or monitor the expression of in follow-up experiments.
Download this .zip file and extract it on your personal computer.
If you want to look at what a metabolites file can look like, open the
compounds_input.csv file in Excel or any similar program or spreadsheet viewer. The table should look something like this. The table below shows the first 5 rows of the example metabolites file input. :
The only columns with "protected names" are original_compound and compound_score. The other 5 columns are descriptive and purely for your own benefit and notes. The metabolites file input can have any number of columns with any number of names, but it must have a column named
original_compounds. See the Help page for more details on input file formats and alternative requirements if you are performing an accurate mass search.
For these particular rows, we can see that 5 different compounds could represent a mass spectrometry feature observed at the retention time 4.70 with an m/z value of 207.0297. For this dataset, the
compound_score for each putative compound identification was calculated using Pactolus to score mass fragmentation patterns of each feature against a library of in silico ion fragmentation trees. These scores are user-defined and can come from anywhere, or not included at all.
You may also open the
example_fasta.faa file in any text viewer; it is a standard FASTA file. This file looks something like this:
>640738191 SCO5328 [Streptomyces coelicolor A3(2): NC_003888] LIIYPAALDLPHALVEWVTMLIVTREGDRRCKLRPSQRAMVALVYLREHT TLAKIAAGFGISESTAHAYTSAVIHLLAQRAPGLLKVLLETGPDFVLLDG TLAECDRVGDSRADYSHKHRRHGVNVQVVTDPEGRLLWLPALPGRAHDLT AARPPHHPYLRAPGRPHPGSCLPGRRPVADHGHQTQAPAGTHPHQEDRQP GPGRSAGIGRTRCCPPEVLADFPQSPLQPQPHDVNRQSCTHLGAATL >640738170 SCO0206 [Streptomyces coelicolor A3(2): NC_003888] RGRTAARLRAYMSAGGFHMNVNVLDRAVLEDAMEHPEKYPELTIRV >637276685 SCP1.13c hypothetical protein [Streptomyces coelicolor A3(2) plasmid SCP1: NC_003903] MPKRGSWSWIGSSQAVRRDNRDVTKHRFLVARQIGAEPFSYTRTCLDTSL VERLRNTSLDDMTDLDAAYELTRLAMVDLEAFGTGGGETLDDQEIAVLLR TLPAVLKRLGVALKLPFRDFKSFKSYWKDQGMSDSYAARRSYVNSVFGPA LDQLDEYHPHLRQSRDQTRRITDVTRRRLRDGLAGAWWGGTLDEVQFLSR LYDLDNLPSHDSRFANAARDIWQHCINNPADWEDDWIWHDERFGLADGDE ALLRFLAEMLHPAARTDTAEVERLHAFFNETLAHDGYELIEVDSISGAPV FAARTIGAGVPGVMKNLIFAADGPKPEFVLGDAINNDVLIVKNEQFCLVY SQPLSASGLTWGEMISWWRARESLPADMPDNEVGRALYKRLWTSLHSDPE RPKWLSPEQRVFRTYCELYPISETGASHPALLPQVYLHMDPKTRKERGNK DTALGRERMDFLLLLPHGVRVVVEVDGKHHYAEGDEASPRLYAKMVAEDR ALRLKGYEVYRFGGYELGLNSAPAMLRRFFAELLGSQP ...
The header of each gene sequence may contain any characters except '>'. The input FASTA file can also be nucleotide instead of protein.
Navigate to the Jobs page and upload the two files you just downloaded. Click on the "Choose File" button and select the file for each file type (i.e. FASTA or Metabolite).
You may change the parameters from their default setting if you so wish. See the Help page for more information.
Next, Click Submit Job. You should receive an email within a few minutes to say that your job submission was received and is being prepared and queued to run. Once submitted, this job should complete within 10 minutes. If you left all job parameters at their default values, you may proceed to the next step to download the results now (we already ran this job!). Otherwise, go grab a coffee and wait for your job completion email.
To see the MAGI results for the default parameters, click here. Otherwise, click on the link that is in your job completion email and download the magi_compound_results.csv, magi_results.csv, and magi_gene_results.csv files. You may also reach this page if you know your job ID by using the API.
Once completed, your Jobs page looks different. Here are some cool things you can do and see here
Download your results files by clicking the download results button. This is a .zip file containing 3 files: magi_results.csv, magi_compound_results.csv, and magi_gene_results.csv. Briefly: magi_results.csv is the complete set of results from the MAGI run, while magi_compound_results.csv is joined with your metabolite input file and only the best MAGI association is kept for each compound, and magi_gene_results.csv is joined with the gene headers in your FASTA input file and only the best MAGI association is kept for each gene-reaciton pair. For a detailed explanation of these files, see the Help page.
Every dataset will have a different range of MAGI scores, ranging from very close to zero and up, depending on the parameters set and the compound scores you provided in the metabolites input file. As a result, knowing what the range of MAGI scores is for your dataset will come in very handy when assessing results. To do this in Excel, highlight the "MAGI_score" column, click the drop-down menu at the bottom of the window, and select "Max", then "Min" and keep in mind what those values were. For this dataset, the "Max" is 5.657 and "Min" is 0.0176.
Optionally, you could normalize the MAGI scores within the spreadsheet by dividing each score by the maximum value. To do this in Excel, insert a column (e.g. so the new column is at position "A"), and in cell A2 type in "=B2/MAX(B:B)", and hit enter. Then double click on the little box in the bottom-right of the highlighted A2 cell, which will copy that formula all the way down to the bottom of the spreadsheet.
Optionally, you can use conditional formatting to make visual interpretation of the MAGI score easier. Highlight the "MAGI_score" column (or the normalized column if you made one), click on Conditional Formatting (under the Format menu), and choose your favorite color scale.
The file magi_compound_results.csv only keeps the best result for a given compound (or compound-compound_score pair if provided). For example, if one compound was associated with two reactions and three genes, only the top-scoring metabolite-reaction-gene row would be retained.
In this example experiment, there were seven mass spectrometry features (m/z values at a retention time peak), that could represent a total of 77 metabolites (see the compound_input.csv file). This simulates an untargeted metabolomics experiment, where one has detected a feature of interest and wants to rank its identity among all possible metabolites. This section will show you how to interpret the relevant MAGI result files. In this case, our metabolites input file had a column named "label" that corresponded to the mass spectrometry feature that may represent a given metabolite. Several different metabolites had the same "label" description.
Please note that "label" is not a required field to run MAGI. However, if you wish to assess compound identities for a feature you will need to use a description of each feature as we have here in the "Label" column.
The first order of business is to sort (and effectively group) the spreadsheet by the "label" column. This will group all the results for each mass spectrometry feature together. In Excel, you can highlight the entire column and sort, or highlight the entire spreadsheet, go to the "Data" menu, select "Sort", and then choose the column(s) to sort by. A third way to do this is to go to the "Data" menu, and select "Filter", which will add a drop-down menu at each heading in Row 1, which can be used to sort the "label" column. I personally likt to use the Data > Sort method, because it is easy to add multiple sort rules. For example, I would do Data > Sort > Sort by label on values A to Z, then by MAGI_score on values largest to smallest.
Here we'll go over two examples of how one might go about interpreting the compound results
What you should see is that the feature labeled "email@example.com" and all of the metabolites that can be represented by the feature are in the top 10 rows. However, only one out of these 10 possibilities was connected to a gene (gene_id 637268777). Also, this metabolite-gene association had one of the highest MAGI scores in the dataset, which can be interpreted to mean that it is a very likely association. Given the genome input, it is highly probable that firstname.lastname@example.org is the compound HGVAHYJMDVROLE-UHFFFAOYSA-N (deoxyhdehydrochorismic acid).
The next feature, "email@example.com" appears on the surface to not be so clear-cut, as all 3 seemingly different metabolites were associated with genes. However, this feature had 9 total compounds associated with it, so this is already a 66% reduction in possible identities. But we can do better by digging into the various scores and interpreting the InChI Key code.
The lowest scoring metabolite that was associated with a gene was XQQIMTUYVDUWKJ-DJLDLDEBSA-N. The reason why this metabolite scored so much lower can be seen in the "level" column: the value here is "1", as opposed to "0." This means that XQQIMTUYVDUWKJ-DJLDLDEBSA-N or its tautomers were not connected to a gene at all; the chemical similarity network was used. In the "neighbor" column you can see that the compound WJBNIBFTNGZFBW-DJLDLDEBSA-N was the compound that was connected to gene 637267437 via the reaction RHEA:36432 with perfect homology (i.e. the E-value from the BLAST search was 0.0, which translates to a MAGI E-score of 200). Investigating the compound structures reveals that they are indeed similar compounds:
The hypothesis that MAGI makes is that WJBNIBFTNGZFBW-DJLDLDEBSA-N can substitute for XQQIMTUYVDUWKJ-DJLDLDEBSA-N in the reaction RHEA:36432. If this hypothesis is correct, then observing WJBNIBFTNGZFBW-DJLDLDEBSA-N is likely, since it can also be involved in a reaction supported by the genome. However, this hypothesis should be further assessed by looking at the reaction itself:
How likely do you think it is that the extra imidazole ring on XQQIMTUYVDUWKJ-DJLDLDEBSA-N will interfere with this enzyme-catalyzed reaction? Ultimately, it would depend on docking simulations, or a biochemical experiment. This uncertainty is why searching the chemical network is penalized in the final MAGI score.
The remaining two top-scoring compounds are ZFEGYUMHFZOYIY-MKFCKLDKSA-N and ZFEGYUMHFZOYIY-YVNCZSHWSA-N. Since the "first block" of the InChI key is the same (ZFEGYUMHFZOYIY), we know that these two compounds share the same atomic connectivity, meaning the only difference between them is stereochemistry, or possible tautomerism. Lookings at the database_id_r2g and database_id_g2r columns, we see that both stereoisomers were linked to the same reaction (RHEA:24952) with perfect homology. Finally, looking at the "note" column, we see that only ZFEGYUMHFZOYIY-YVNCZSHWSA-N was linked directly to the reaction; the other stereoisomer was linked via a "flat tautomer," meaning that stereochemistry was ignored. So now we can hypothesize that based on the input genome, we are most likely seeing this specific stereoisomer of 1,6-anhydro-N-acetyl-beta-muramate.
This table shows the best results for a given gene-reaction association. So for example, if MAGI associated one gene with one reaction, and that one reaction with three observed compounds, this would correspond to three rows in the full magi_results.csv table. However, in the magi_gene_results table, only the top-scoring of these three would be retained, leaving just one row to assess the gene-reaction suggestion made by magi.
In the example experiment, the FASTA file input had 781 genes in it. Via homology, MAGI associated these gene products with 2,277 reactions; the gene product may catalyze all or none of its proposed reactions. Since there are metabolomics data, MAGI scores these proposed functions based on the strength of the functional prediction and the strength of the metabolite observation. Here we will go through a few examples to show how you can interpret this table.
While going through this table, it is important to note that absence of evidence is not evidence of absence. For example, if one gene is associated with 2 mutually exclusive reactions, one with weak homology and one with strong, and only the one with weak homology was associated with an observed metabolite, this does not mean that the weaker reaction is correct because perhaps another metabolomics experiment would have evidence for the stronger reaction.
The first order of business is to sort (and effectively group) the spreadsheet by the "gene_id" column. This will group all the results for each gene together. Since we will be using slightly more advanced data manipulation features in Excel, the preferred way to do this is to go to the "Data" menu and select the "Filter" option. This will add a drop-down menu at each heading in Row 1, which can be used to sort the "gene_id" column. Because this spreadsheet contains so many genes, we will also use the filtering options in the dropdown menu.
If you don't have a targeted list of genes to analyze, there are a few ways you might go about prioritizing your analysis. You might want genes that had a MAGI score greater than a threshold, or a certain strength of the homology strength. The simplest way to do this in excel is probably use the dropdown menu for "MAGI_score" and/or any of the homology score columns ("homology_score", "e_score_r2g", "e_score_g2r", "reciprocal score") to filter by your desired threshold, keeping note of the genes that meet this threshold. Then, you should clear the threshold filter and filter the spreadsheet on the "gene_id" column for each or all the genes, so that you can consider all of the results for each gene.
A slightly more advanced method of finding "good" genes to go after is to use Excel Pivot Tables. To make a pivot table, go to the "Data" menu and select "Pivot Table," selecting "create manual PivotTable" if it is an option. This should create a new sheet with an empty pivot table, with a pivot table builder window open. Add the column names you want in the builder, with the gene_id and database_id_g2r columns as the "row labels," and whatever columns you want to filter/sort by in the "values" section. You will likely have to change the values formulas to something more meaningful, like "Max of MAGI_score" or "Min of level" instead of the default. This should make a neat table that has the stats for each gene's reaction associations, and you can sort each stat by selecting all the rows within the "values" section of the pivot table (don't select the header or the entire column), and sorting. This way you can prioritize genes while still seeing all of their results.
We will go through a handful of examples of how one might go about assessing these results: Genes 637266218, 637270178, 637269590
Filter your gene_results table by the gene_id column to only show the gene 637266218. This should leave only three rows to look at, all with great MAGI scores. Two rows have the same compound (GXYLOXCSJFJFKA-IOSLPCCCSA-N) linked to two different reactions, but one of them does not have reciprocal agreement, making it a weaker suggestion. Looking closer at the remaining two suggestions, we see that the two compounds GXYLOXCSJFJFKA-IOSLPCCCSA-N and VNPWVMVYUSNFAW-WFMPWKQPSA-N are actually quite similar in structure (they both have a methylthioinosine substructure), and the two reactions (RHEA:25026 and RHEA:20717)associated with them and the gene have identical mechanisms. This could be interpreted as metabolomic evidence that suggests that the gene product can catalyze both reactions.
Filter the gene_results table to only show results for gene 637270178, which should just be 2 rows. The results for this gene have medium MAGI_scores, mainly because the chemical similarity network was used to associate the gene with any observed compound. It is important to remember that these are the top-scoring MAGI results for this gene; there were no better compounds associated with this gene. The reactions that this gene product could catalyze are listed in database_id_g2r: RXN-15787 and RXN-15788, catalyzing the prodigiosin and norprodigiosin synthase reactions, respectively. It is also important to remember that these results are from a gene-centric point of view: the gene sequence was used to query a reaction database. The database_id_r2g contains the reactions that were associated with both the gene and the observed compound; this can be thought of as a compound-centric view. What happened here is that reactions involving the observed compound or similar compounds were used to find genes that may catalyze those reactions. In this particular case, ISFCPXILUVJVOC-KYGJEJSHSA-N was associated with the undecylprodiogiosin synthase reaction, but the reaction did not have any reference sequences with which to conduct homology searches. Instead, the chemical similarity network was used to find the similar compound WKGQSEFBQTWRPT-UHFFFAOYSA-N (prodigiosin), which is a product of the prodigiosin synthase reaction, which does have a reference sequence. This sequence had a perfect homology score of 200 (seen in the e_score_r2g column) to reaction RXN-15787. When the gene-centric and compound-centric results are merged, we see that the top-scoring row for the gene involves undecylprodigiosin (ISFCPXILUVJVOC-KYGJEJSHSA-N) associated with gene 637270178 via the prodigiosin synthase reaction (RXN-15787), with perfect homology and reciprocal agreement. Since ISFCPXILUVJVOC-KYGJEJSHSA-N and WKGQSEFBQTWRPT-UHFFFAOYSA-N are in fact structurally similar, it is feasible to imagine that an enzyme catalyzing the reaction RXN-15787 could also produce ISFCPXILUVJVOC-KYGJEJSHSA-N; assuming this is true, one would have strong metabolomic evidence to suggest that the gene 637270178 catalyzes a reaction similar to RXN-15787 that synthesizes undecylprodigiosin.
In our last example, we will look at gene 637269590. After filtering magi_gene_results to only show results for this gene, we see four rows, each with fairly high MAGI scores, that show the gene was associated with one compound, KQWNMCHCVXVLRS-UHFFFAOYSA-N. The reason there are four rows here is because when the gene 637269590 was used as a homology search query against the reaction database, MAGI found four different reacions that the gene could have catalyzed: RXN-15413, CDPDIGLYSYN-RXN, RXN0-5515, RXN-15409 all with similar homology scores (some with identical scores). However, only one of these reactions (RXN-15413) had support from a compound observed in the metabolomics experiment (KQWNMCHCVXVLRS-UHFFFAOYSA-N). We know this because only that reaction had reciprocal agreement (reciprocal_score value equaled 2). In this case, we would have experimental evidence that suggested one gene annotation over three mutually exclusive competing annotations.
This table is an unfiltered table of all MAGI results. It can be used to delve deeply into compound, genes, reactions, and associations between them all.
For example, if you are interested in all of the reactions and/oror genes that a compound was associated with, you may filter the table to only show rows pertaining to that compound, and then look in the "database_id_r2g" column for a list of all reactions, and "gene_id" for a list of genes associated with that compound. This could be a useful method for stitching together biochemical pathways.
Or, if you want to see all the compounds that a gene was connected to, you can filter the table to only show one gene, and look at the "database_id_g2r" column. This could be a useful method for curating a metabolic model.
Or, if you were interested to see all the evidence for a particular biochemical reaction, you could filter either the database_id_g2r or database_id_r2g columns to show only one reaction, and you could see all the compounds of that reaction that were observed (and similar compounds), as well as genes that were associated with that reaction. This could also be a useful method for curating a metabolic model.