ArMone: software suite targeted for validation and processing of phosphoproteome data set

 

Author:   Xinning Jiang

 

Contact:       Prof. Hanfa Zou

                 Prof. Mingliang Ye

 

 

mainfrm.jpg

 


 

I Introduction

Even though a few of proteome pipelines have been developed to facilitate the data processing of proteome researches, these pipelines commonly neither provide well support for the processing of phosphoproteome data set because of the unique features for the identification of phosphorylated peptides, such as the poor fragment in MS2 and the multiple possible phosphorylation site localizations on a single peptide, nor report the phosphoproteome data set with sufficient information. To address these problems, we presented a software suite named ArMone for the generation of phosphopeptide identification and phosphosite localizations with high reliability and high sensitivity and for the conveniences of preparing phosphoproteome data set following the proteome guidelines (http://www.mcponline.org/misc/ParisReport_Final.dtl). Easy for use batch-filtering and manual validation modules are also provided by ArMone for the further distinguishing of false positive identifications in the high confidence data set on the fly. It is a stand-alone application with friendly graphic user interface supporting multiple operating systems and multiple database search engines. As ArMone is originally developed for the phosphoproteome researches, it is more powerful and easy for use while the processing of phosphoproteome data set.

 


 

II Requirements

1.     Java 2 runtime envirment 6.0 update 12 (J2RE 6u12) or higher.

2.     Other needed class library

(1) JFreeChart:

(2) IText:

3.     Database search algorithms:

SEQUEST, Mascot, X!Tandem, OMSSA, Crux, Inspect


 

III Modules and usages

1.   The peak list format converison module.

ü  This module is used to convert the peak list between different format for database search algorithms.

ü  Click the “Peak list format conversion” button in the main frame of ArMone to invoke this module.

ü  Current supported formats:

(1) Sequest Dta format (*.dta)

(2) Matrix generic format (*.mgf)

 

Format convertion.jpg

 


 

2.   Peptide list creation module

ü  This module is used to create the common format for peptide identification results from all the supported database search algorithms and the peak list.

ü  Click the “Peptide list file creation” button in the main frame of ArMone to invoke this module.

ü  The peptide list file is with extension of “ppl”, contains the following parts:

(1) Peptides identified by different search algorithms. Mutiple peptides (top n, n > 1) is accepted for a single spectrum.

(2) The peak list of the spectrum for peptide identification.

(3) Database search parameters. Globle format supporting search parameters for all the supported database search algorithms.

(4) Association between the peptide identifications and the spectra peak lists

ü  The module frame

 

ü  How to create a peptide list file (ppl) from the database search results?

1.     The supported database search algorithms:

(1) SEQUEST: in Bioworks, license to Thermo

(2) Mascot: Matrix science

(3) X!Tandem

(4) OMSSA

(5) Inspect

       Right database search algorithm must be selected for your database search.

2.     The database search result files. The accepted input formats for different database search algorithm are described below:

(1) SEQUEST: prefer *.out files with *.dta in the same directly. *.sqt file which can be exported from Bioworks is also supported.

(2) Mascot: dat file which can be found in the mascot server.

(3) X!Tandem: the xml search result file.

(4) OMSSA: the omx search result file.

(5) Inspect:

(6) Crux: the sqt search result file.

3.     To get the peak list for the spectrum resulting in the peptide identification, the raw spectra data file must be given. Currently accepted file formats are Mzxml or MzData files. Some database search result file also contains the peak list, for example, the sequest database search result contains *.dta files corresponding to each of the out file. In this condition, just select the “Embedded peak list” checkbox and the raw spectra file will not need. Currently, only SEQUEST dta&out can use this option.

4.     Select the fasta database used for database search. The database MUST be the same database as that used for database search.

5.     Select output path of the ppl. The default path of ppl will be localized at the same directory as the input file.

6.     For Mascot database search algorithm, the regular expression for the generating of accession of each protein while the preprocessing of the database MUST be selected. NOTICE: this value MUST be selected carefully. For Other database search algorithms, this option is not needed.

7.     After all the entries are selected, click the button “Add a task” at the top of left panel to add a task to the task list.

8.     Set the allowed n top matched peptides to write to the peptide list file. For example, if set as 1, only the top matched peptides for each spectrum will be write to the peptide list file. NOTICE: if there is fewer match results in the original search results for database search algorithms than the set value, all the match results in the original search results will be output to the peptide list file.

9.     Repeat the step 1 – 7 to add other tasks. These tasks can used to process search results from different database search algorithms.

10.  After all the tasks are added to the task list, select the tasks need to be processed, and click the “Start” button to run the selected tasks. Wait until all the tasks finished.


3.   Peptide list file merge module.

ü  In some cases, database search results may need to be merged together for further process. For example, fractions in 2 dimension LC-MS/MS experiments.

ü  Add all the peptide list files need to be merged together, and select the output path for peptide list file after merge, click start button to merge.


4.   Viewer of peptide list file.

ü  The peptide list file needs to be loaded into the viewer for further process.

ü  The loading frame.

While loading, criteria can be preset to eliminate the vast of useless identifications. If the “use filters” check box is not selected, the filter will not used while loading.

The displayed top n matched peptide for each spectrum can be set while loading. Default: top 1. NOTICE: if top n was set as 1 while the creation of peptide list file, setting of larger number of top matched peptides will be useless.

NOTICE: the search algorithm must be selected the same as that used for the results in peptide list file. Otherwise, error information will be shown.

 

ü  The peptides in the viewer.

To lower the memory usage, limited number of peptides will be shown in the screen. Turn to next page to show the next proportion of peptides.

1.     The terms for a peptide identification:

(1)  The common terms: the index of peptides, scan number and the base name of raw file, sequence with variable modifications, MH+, Delta MH+, charge state, rank (top n matches in the search result), protein references for this peptides, pI value, number of enzymatic terms.

(2) The database search engine terms: these terms are the scores for the peptide identifications. For example, for SEQUEST, the scores are Xcorr, DeltaCn, Sp, and Rsp (Rank of Sp).

(3) The check box of whether this peptide is selected for use. If the check box is deselected, the corresponding peptide will not be considered for process by other function modules.

(4) Other special terms: for phosphopeptides, there is an additional term of Ascore; for peptides with probability calculation, the additional term of probability will be shown.

2.     Function modules:

(1) Showing peptide information, including the count of peptide hits for each charge states, the FDR of peptides with different charge states and the global information for all the charge states.

NOTICE: in order to show the FDR information, the composite database containing both target and decoy protein sequences should be used for database search. AND the composite database should be created using the decoy database creation module in ArMone.

(2) Filtering module: filters can be set to further control the displayed peptides. Then use the peptide information module to show the information of peptides after setting of the filter.

This module incorporated with the peptide information module can be used to generate peptide identifications with different confidence level easily.

(3) Select or deselect all the peptides.

(4) The spectrum viewer module.

This is a real time spectrum viewer. Using the floating window of spectrum drawing panel, one can easily check the match quality for the current peptide identification. Click on the peptide row in the peptide viewer panel to view the spectrum with peak list label. Or just use the up or down arrow in the keyboard to select next or previous peptides.

(i)       The neutral loss peak panel. Check each term in the peak to label the neutral loss peaks in the spectrum. For example, to show the precursor peak, just check the MH term and show the neutral loss peak of phosphate by check the MH-H3PO4 term.

In addition, you can manually set the mass for neutral loss peak.

(ii)     The ion type panel. Select the ion type to show the matches between the experimental spectrum and the theoretical ions of the selected types. For example, select b&y type ions for spectrum acquired using CID source and c&z type ions for that acquired by ETD or ECD.

(iii)   Select use mono isotope mass for MS/MS match or use the average mass

(iv)  Set the match tolerance for MS/MS match and the minimum threshold (0 - 1) to consider as a valid match. Default: mass tol, 1.0 and min intensity, 0.1.


5.   Automatic filtering module and manual validation

ü  Automatically validate the peptide identifications using preset manual validation criterion.

ü  Parameters

1.     The ion types for the matches between the experimental spectra and the theoretical peptide ions.

2.     The minimum continuous b || y, or c || z ions. A peptides matches the experimental spectrum will more than the specified number of ions will be considered as a valid match.

3.     The peak list intensity filters.

(1) Exclude the neutral loss peaks. The used neutral loss peaks can be selected from the right panel

(2) The intensity filters for the peak lists: (i) remove the peaks with intensity low than specific percentage of the base peak, e.g. 10%; (ii) only retain the specific number of highest peaks with a m/z region. (ii) retain a specific number of ions

(3) The match ion tolerance

(4) Whether or not use the monoisotope mass

4.     Click the start button to begin.

ü  The manual validation. Please turn to the peptide viewer section.


6.   Classification filtering strategy for Phosphopeptide identification (APIVASE II)

ü  Using MS2/MS3 strategy to improve the phosphopeptide identifications.

ü  Details of the algorithm can be found in the original paper.

ü  Usage:

1.     Select the peptide list file for the MS2 spectra identifications, and the peptide list file for the MS3 spectra identifications

2.     Select the raw spectrum data file. Both MzData and Mzxml are accepted.

3.     Set the match tolerance for the matches of MS/MS ions.

4.     Set the data dependent neutral loss triggered MS3 strategy while the collection. For example, if the MS was acquired as the following series: MS1-MS2-MS2-MS2-MS3-MS3-MS3 or MS1-MS2-MS3-MS2-MS3-MS2-MS3 the MS/MS3 count is 3.

5.     After all the necessary terms are selected, add the process task.

6.     While all the tasks are added, select the tasks to be processed and press start button.

7.     The output phosphopeptide pairs and the phosphopeptides which are identified from only MS2 spectra are written to two different peptide list files. Load these ppl files and do further processes as described in other sections.

ü  The peak list data preprocess module for APIVASE

1.     First extract the MS2 peak list to *.dta into a directory.

2.     Create a directory to handle *.dta of MS3 spectra

3.     Select the MS2 spectra directory with extracted dta files and MS3 directory.

4.     Select the raw spectra file

5.     Set the MS/MS threshold and MS/MS count. Then add the data preprocess task to the task list.

6.     After all the tasks are added, select them and click start button to start.


7.   Phosphorylation specific modules

ü  The module specified for phosphorylated peptides or proteins

1.     PhosphoSite statistic module. First load the peptides from peptide list file. Then check the “Show the phosphopeptide information” to show the statistic module.

NOTICE: the phosphorylation symbol MUST be set prior the statistic.

(1) First set the database, click the button “show global site information” to generate the phosphorylation site statistic information.

(2) Use the “Export the site details” button to export the detail phosphorylation site information to a csv file. The format of the phosphorylation detail file is similar as below:

2.     Only show Phosphopeptides or non-phosphopeptides. The phosphorylation symbol must be set prior the filtering action. Check the checkbox of “show all”, “only show the non-phosphorylated peptides” or “only show the phosphorylated peptides” to set the filters