MS-GFDB

MS-GF+ Documentation home

MS-GFDB is an old application that is no longer under development. It was supserseded by MS-GF+.
MS-GF+ has all the functionalities provided by MS-GFDB, plus numerous improvements.

Differences between MS-GF+ and MS-GFDB

Input
- MS-GF+ supports mzML in addition to mzXML, mgf, ms2, pkl and _dta.txt
- "-t PrecursorMassTolerance" is optional with MS-GF+ (default 20ppm)
- "-c13 0/1/2" was changed to "-ti IsotopeErrorRange" in MS-GF+
  - IsotopeErrorRange: MinIsotopeError,MaxIsotopeError (both are inclusive)
  - -c13 x == -ti 0,x
- "-nnet" was changed to "-ntt" in MS-GF+
  - -nnet 0 == -ntt 2, -nnet 1 == -ntt 1, -nnet 2 == -ntt 0
- Modification file format change
  - In MS-GF+, the name of the modification should match the PSI-MS name (accessible from http://www.unimod.org)
  - CompositionStr can take Br, Cl, Fe, Se in addition to C, H, N, O, S, and P
  - The sequence of the atoms can be arbitrary.
    - Previously C2H2O was valid but OH2C2 was invalid
    - With MS-GF+, both are valid
- "-uniformAAProb 0/1" was deleted in MS-GF+
- "-addFeatures 0/1" was added to MS-GF+; "-addFeatures 1" will output the following extra features for each PSM (will be useful to downstream tools like Percolator or IDPicker):
  - MS2IonCurrent: Summed intensity of all product ions
  - ExplainedIonCurrentRatio: Summed intensity of all matched product ions (e.g. b, b-H2O, y, etc.) divided by MS2IonCurrent
  - NTermIonCurrentRatio: Summed intensity of all matched prefix ions (e.g. b, b-H2O, etc.) divided by MS2IonCurrent
  - CTermIonCurrentRatio: Summed intensity of all matched suffix ions (e.g. y, y-H2O, etc.) divided by MS2IonCurrent
- "-showQValue 0/1" was added to MS-GF+
- "-showDecoy 0/1" was added to MS-GF+
Output
- Output format for MS-GF+ is the HUPO PSI mzIdentML version 1.1 (*.mzid); see http://www.psidev.info/mzidentml for details.
  - Decoy protein prefix is "XXX" in MS-GF+ (vs. "REV" in MS-GFDB)
- MS-GF+ provides a converter from mzIdentML to tsv (the resulting tsv file will be similar to the MS-GFDB output file).
  - The converter is included in the MSGFPlus.jar file
  - It can be run by "java -Xmx2000M edu.ucsd.msjava.ui.MzIDToTsv"
  - A faster converter that supports larger result files is the Mzid-To-Tsv-Converter, available on GitHub. This is a C# application that works under Windows or on Linux with mono.
- Difference between the MS-GFDB output and the MS-GF+ TSV output
  - MS-GF+ includes SpecID (native spectrum ID) instead of SpecIndex
  - MS-GF+ reports IsotopeError
  - When a peptide matches to multiple proteins, all protein accessions will be reported by MS-GF+
  - SpecProb was renamed to SpecEValue in MS-GF+
  - MS-GF+ reports EValue (database-level E-value) instead of PValue (database-level P-value)
  - FDR and PepFDR were renamed to QValue and PepQValue in MS-GF+

MS-GFDB

Usage: java -Xmx2000M -jar MSGFDB.jar
	-s SpectrumFile (*.mzXML, *.mzML, *.mgf, *.ms2, *.pkl or *_dta.txt)
	-d DatabaseFile (*.fasta or .fa)
	-t ParentMassTolerance (e.g. 2.5Da, 30ppm, or 0.5Da,2.5Da)
	   Use comma to set asymmetric values. E.g. "-t 0.5Da,2.5Da" will set 0.5Da to the left (expMass<theoMass) and 2.5Da to the right (expMass>theoMass).
	[-o outputFileName] (Default: stdout)
	[-thread NumOfThreads] (Number of concurrent threads to be executed, Default: Number of available cores)
	[-tda 0/1] (0: don't search decoy database (default), 1: search decoy database to compute FDR)
	[-m FragmentationMethodID] (0: as written in the spectrum or CID if no info (Default), 1: CID, 2: ETD, 3: HCD, 4: Merge spectra from the same precursor)
	[-inst InstrumentID] (0: Low-res LCQ/LTQ (Default for CID and ETD), 1: High-res LTQ (Default for HCD), 2: TOF)
	[-e EnzymeID] (0: No enzyme, 1: Trypsin (Default), 2: Chymotrypsin, 3: Lys-C, 4: Lys-N, 5: Glu-C, 6: Arg-C, 7: Asp-N, 8: aLP, 9: Endogenous peptides)
	[-c13 0/1/2] (Number of allowed C13, Default: 1)
	[-nnet 0/1/2] (Number of allowed non-enzymatic termini, Default: 1)
	[-mod ModificationFileName] (Modification file, Default: standard amino acids with fixed C+57)
	[-minLength MinPepLength] (Minimum peptide length to consider, Default: 6)
	[-maxLength MaxPepLength] (Maximum peptide length to consider, Default: 40)
	[-minCharge MinPrecursorCharge] (Minimum precursor charge to consider if not specified in the spectrum file, Default: 2)
	[-maxCharge MaxPrecursorCharge] (Maximum precursor charge to consider if not specified in the spectrum file, Default: 3)
	[-n NumMatchesPerSpec] (Number of matches per spectrum to be reported, Default: 1)
	[-uniformAAProb 0/1] (0: use amino acid probabilities computed from the input database (default), 1: use probability 0.05 for all amino acids)

Parameters:

-s SpectrumFile (*.mzXML, *.mzML, *.mgf, *.ms2, *.pkl or *_dta.txt) - Required
- Spectrum file name. Currently, MS-GFDB supports the following file formats: mzXML, mzML, mgf, ms2, pkl and _dta.txt.
-d DatabaseFile (*.fasta or *.fa) - Required
- Path to the protein database file. If the database file does not have auxiliary index files (*.canno, *.cnlcp, *.csarr, and *.cseq), MS-GFDB will create them.
- When "-tda 1" option is used, the database must contain only target protein sequences.

If multiple MS-GFDB processes access the same database file, it is strongly recommended to index the database prior to the database search by running BuildSA (see below).

-t ParentMassTolerance - Required
- Parent mass tolerance in Da. or ppm. There must be no space between the number and the unit. E.g. 2.5Da, 30ppm
- To set asymmetric tolerances, use comma to separate left (experimental mass < theoretical mass) or right (experimental mass > theoretical mass) tolerances. E.g. 0.5Da,2.5Da
-o OutputFile (Default: stdout)
- Filename where the output will be written.
- The output will be printed to standard out by default.
-thread NumOfThreads (Number of concurrent threads to be executed, Default: Number of available cores)
- Number of concurrent threads to be executed together.
- Default value is the number of available logical cores (e.g. 8 for quad-core processor with hyper-threading support).
-tda 0/1 (0: don't search decoy database (default), 1: search decoy database to compute FDR)
- Indicates whether to search the decoy database or not.
- If 0, the decoy database is not searched and FDRs are theoretically derived from P-values (EFDR).
- If 1, FDRs are computed based on the target-decoy approach (i.e. reversed database is appended to the target database and MS-GFDB searches the combined database)
  - FDR(t) = #(DecoyPSMs with score equal or above t) / #(TargetPSMs with score equal or above t).
  - PSM: Peptide-Spectrum Match
  - -log(SpecProb) is used as the score to compute FDR.

If -tda 1 is specified, MS-GFDB automatically creates a combined target/reversed database file (DBFileName.revConcat.fasta). Thus, when specifying "-d" parameter, DatabaseFile must contain only target proteins.

-m FragmentationMethodID (0: as written in the spectrum or CID if no info (Default), 1: CID, 2: ETD, 3: HCD, 4: Merge spectra from the same precursor)
- Fragmentation method identifier (used to determine the scoring model).
- If the identifier is 0 and fragmentation method is written in the spectrum file (e.g. activationMethod field in mzXML files), MS-GFDB will recognize the fragmentation method and use a relevant scoring model.
- If the identifier is 0 and there is no fragmentation method information in the spectrum (e.g. mgf files), CID model will be used by default.
- If the identifier is non-zero and the spectrum has fragmentation method information, only the spectra that match with the identifier will be processed.
- If the identifier is non-zero and the spectrum has no fragmentation method information, MS-GFDB will process all spectra assuming the specified fragmentation method.
- If the identifier is 4, MS/MS spectra from the same precursor ion (e.g. CID/ETD pairs, CID/HCD/ETD triplets) will be merged and the "merged" spectrum will be used for searching instead of individual spectra. See Kim et al., MCP 2010 for details.
-inst InstrumentID (0: Low-res LCQ/LTQ (Default for CID and ETD), 1: TOF , 2: High-res LTQ (Default for HCD))
- Identifier of the instrument to generate MS/MS spectra (used to determine the scoring model).
- For "hybrid" spectra with high-precision MS1 and low-precision MS2, use 0.
- For usual low-precision instruments (e.g. Thermo LTQ), use 0.
- For TOF instruments, use 1.
- If MS/MS fragment ion peaks are of high-precision (e.g. tolerance = 10ppm), use 2.
-e EnzymeID (Default: 1)
- Enzyme identifier. Trypsin (1) will be used by default.
- 0: No enzyme, 1: Trypsin (default), 2: Chymotrypsin, 3: Lys-C, 4: Lys-N, 5: Glu-C, 6: Arg-C, 7: Asp-N, 8: alphaLP, 9: Endogenous peptides
-c13 0/1/2 (Number of allowed isotope errors, Default: 1)
- Instruments often choose 2nd or 3rd isotope peak instead of mono-isotope peak from MS1 spectrum.
- If this value is non-zero, expPeptideMass-1.00335 (i.e. mass(13C)-mass(12C)) and expPeptideMass-2.00671 (i.e. 2*(mass(C13)-mass(C12)) (only if -c13 2) will be considered along with expPeptideMass.
- If accurate precursor ion mass is available (e.g. LTQ-Orbitrap), it is better to set a narrow parent mass tolerance and non-zero -c13 value (e.g. -t 30ppm -c13 1) than to set a wide tolerance (e.g. -t 0.5Da,2.5Da).
- If the parent mass tolerance is equal to or larger than 0.5Da or 500ppm, this parameter will be ignored.
-nnet 0/1/2 (Number of allowed non-enzymatic termini, Default: 1)
- This parameter is used to determine the enzyme cleavage rule.
- Specifies the maximum number of peptide termini that are not cleaved by the enzyme.
- For example, for trypsin, K.ACDEFGHR.C, G.ACDEFGHR.C, K.ACDEFGHI.C and G.ACDEFGHR.C have 0, 1, 1 and 2 non-enzymatic termini, accordingly.
- By default, -nnet 1 will be used. Using -nnet 0 (or 2) will make the search significantly faster (slower).
-mod ModificationFile (Default: standard amino acids with fixed C+57)]
- Modification file name. ModificationFile contains the modifications to be considered in the search.
- If -mod option is not specified, standard amino acids with fixed Carbamidomethylation C will be used.
- See an example MS-GFDB modification file.
-minLength MinPepLength (Default: 6)
- Minimum length of the peptide to be considered.
-maxLength MaxPepLength (Default: 40)
- Maximum length of the peptide to be considered.
-minCharge MinPrecursorCharge (Default: 2)
- Minimum precursor charge to consider. This parameter is used only for spectra with no charge.
-maxCharge MinPrecursorCharge (Default: 3)
- Maximum precursor charge to consider. This parameter is used only for spectra with no charge.
-n NumMatchesPerSpec (Default: 1)
- Number of peptide matches per spectrum to report.
- Expected false discovery rates (EFDRs) will be reported only when this value is 1.
-uniformAAProb 0/1 (Default: 0)
- If 0, compute amino acid frequencies from the input database and use them as amino acid probabilities.
- If 1, use uniform amino acid probability (preferable when the database size is small).

MS-GFDB output

MS-GFDB outputs a tab-delimited file with the following columns: #SpecFile, Scan#, FragMethod, Precursor, PMError, Charge, Peptide, Protein, DeNovoScore, MSGFScore, SpecProb, P-value, EFDR.

SpecFile: spectrum file name
SpecIndex: spectrum index (1-based) in the file. The first spectrum has index 1, the second has index 2, and so on. For mzXML files this value is same as the scan number.
Scan#: scan number of the spectrum. If the scan number is not available, the value will be -1.
FragMethod: fragmentation method used to generate the spectrum (e.g. CID, ETD, etc.). When spectra from the same precursor are merged, fragmentation methods of merged spectra will be shown as a form "FragMethod1/FragMethod2/..." (e.g. CID/ETD, CID/HCD/ETD).
Precursor: precursor mass in m/z or ppm
Charge: precursor ion charge
Peptide: peptide sequence with neighboring amino acids
Protein: protein name
DeNovoScore: the score of the optimal scoring peptide (not necessary in the database)
MSGFScore: MS-GF raw score of the peptide-spectrum match (MSGFScore <= DeNovoScore)
SpecProb: spectral probability (spectrum level p-value) of the peptide-spectrum match
P-value: database level p-value (probability that a random PSM have an equal or better score against a random database of the same size)
EFDR or FDR: false discovery rate
- If "-tda 1" is specified, FDRs are estimated using the target-decoy approach using the spectral probability (SpecProb) as the score (the lower, the better).
- Otherwise, FDRs are estimated using P-values without searching the decoy database (EFDR). See Gupta et al., JASMS 2011 for details.
- MS-GFDB reports EFDR only when it is configured to report 1 peptide match per spectrum (i.e. -n 1).
- EFDR accurately estimates FDR when the parent mass tolerance is equal or larger than 0.5.
- EFDR conservatively estimates FDR when the parent mass tolerance is small.
  - E.g. When parent mass tolerance is 30ppm, at EFDR 1% threshold, one identifies approximately 7% less peptide-spectrum matches (PSMs) compared to the case when the target-decoy approach is used to estimate the FDR.
PepFDR
- Peptide-level FDR estimated using the target-decoy approach.
- Reported only if "-tda 1" is specified.
- If multiple spectra are matched to the same peptide, only the best scoring PSM (lowest SpecProb) is retained. After that, PepFDR is calculated as #DecoyPSMs>s / #TargetPSMs>s among the retained PSMs. This approximates the FDR of the set of unique peptides. In the MS-GFDB output, the same PepFDR value is given to all PSMs sharing the peptide. So, even a low-quality PSM may get a low PepFDR value (if it has a high-quality "sibling" PSM sharing the peptide). Note that this should not be used to count the number of identified PSMs.

MS-GFDB output example

    
        
          #SpecFile
          SpecIndex
          Scan#
          FragMethod
          Precursor
          PMError(ppm)
          Charge
          Peptide
          Protein
          DeNovoScore
          MSGFScore
          SpecProb
          P-value
          FDR
          PepFDR
        

          090121_NM_Trypsin_20.mzXML
          2838
          2838
          CID
          964.7707
          1.5199227
          3
          K.TIQNSSVSPTSSSSSSSSTGETQTQSSSR.L
          IPI:IPI00002349.2|SWISS-PROT:Q7Z417|TREMBL:A1L3A7|ENSEMBL:ENSP00000225388|REFSEQ:NP_065823|H-INV:HIT000001036|VEGA:OTTHUMP00000181037 Tax_Id=9606 Gene_Symbol=NUFIP2 Nuclear fragile X mental retardation-interacting protein 2
          190
          181
          9.380133E-30
          2.9333857E-22
          0.0
          0.0
        

          090121_NM_Trypsin_20.mzXML
          3671
          3671
          ETD
          1113.4758
          0.6583758
          2
          R.VGPADDGPAPSGEEEGEGGGEAGGK.E
          IPI:IPI00016725.2|SWISS-PROT:Q9UKN8|TREMBL:B3KNH2;Q05CN7|ENSEMBL:ENSP00000361219|REFSEQ:NP_036336|H-INV:HIT000071196|VEGA:OTTHUMP00000022434 Tax_Id=9606 Gene_Symbol=GTF3C4 General transcription factor 3C polypeptide 4
          162
          158
          1.9912463E-28
          6.0892146E-21
          0.0
          0.0
        

          090121_NM_Trypsin_20.mzXML
          3031
          3031
          ETD
          651.64874
          1.7510794
          3
          K.GAAAAAAASGAAGGGGGGAGAGAPGGGR.L
          IPI:IPI00644073.1|VEGA:OTTHUMP00000038687 Tax_Id=9606 Gene_Symbol=INTS3 18 kDa protein
          214
          202
          6.7318633E-28
          2.093763E-20
          0.0
          0.0
        

          090121_NM_Trypsin_20.mzXML
          19088
          19088
          CID
          1199.0916
          10.392676
          2
          K.VNFSPPGDTNSLFPGTWYLER.V
          IPI:IPI00945760.1|TREMBL:B7Z784;B7Z7M8;B7Z8R3|REFSEQ:NP_001159579 Tax_Id=9606 Gene_Symbol=HMGCS2 hydroxymethylglutaryl-CoA synthase, mitochondrial isoform 2 precursor
          243
          243
          2.9611275E-27
          8.838129E-20
          0.0
          0.0
        

          090121_NM_Trypsin_20.mzXML
          3030
          3030
          CID/ETD
          651.64874
          1.7510794
          3
          K.GAAAAAAASGAAGGGGGGAGAGAPGGGR.L
          IPI:IPI00644073.1|VEGA:OTTHUMP00000038687 Tax_Id=9606 Gene_Symbol=INTS3 18 kDa protein
          389
          389
          7.508096E-33
          2.335189E-25
          0.0
          0.0
        

    

#SpecFile	SpecIndex	Scan#	FragMethod	Precursor	PMError(ppm)	Charge	Peptide	Protein	DeNovoScore	MSGFScore	SpecProb	P-value
090121_NM_Trypsin_20.mzXML	2838	2838	CID	964.7707	1.5199227	3	K.TIQNSSVSPTSSSSSSSSTGETQTQSSSR.L	IPI:IPI00002349.2\|SWISS-PROT:Q7Z417\|TREMBL:A1L3A7\|ENSEMBL:ENSP00000225388\|REFSEQ:NP_065823\|H-INV:HIT000001036\|VEGA:OTTHUMP00000181037 Tax_Id=9606 Gene_Symbol=NUFIP2 Nuclear fragile X mental retardation-interacting protein 2	190	181	9.380133E-30	2.9333857E-22
090121_NM_Trypsin_20.mzXML	3671	3671	ETD	1113.4758	0.6583758	2	R.VGPADDGPAPSGEEEGEGGGEAGGK.E	IPI:IPI00016725.2\|SWISS-PROT:Q9UKN8\|TREMBL:B3KNH2;Q05CN7\|ENSEMBL:ENSP00000361219\|REFSEQ:NP_036336\|H-INV:HIT000071196\|VEGA:OTTHUMP00000022434 Tax_Id=9606 Gene_Symbol=GTF3C4 General transcription factor 3C polypeptide 4	162	158	1.9912463E-28	6.0892146E-21
090121_NM_Trypsin_20.mzXML	3031	3031	ETD	651.64874	1.7510794	3	K.GAAAAAAASGAAGGGGGGAGAGAPGGGR.L	IPI:IPI00644073.1\|VEGA:OTTHUMP00000038687 Tax_Id=9606 Gene_Symbol=INTS3 18 kDa protein	214	202	6.7318633E-28	2.093763E-20
090121_NM_Trypsin_20.mzXML	19088	19088	CID	1199.0916	10.392676	2	K.VNFSPPGDTNSLFPGTWYLER.V	IPI:IPI00945760.1\|TREMBL:B7Z784;B7Z7M8;B7Z8R3\|REFSEQ:NP_001159579 Tax_Id=9606 Gene_Symbol=HMGCS2 hydroxymethylglutaryl-CoA synthase, mitochondrial isoform 2 precursor	243	243	2.9611275E-27	8.838129E-20
090121_NM_Trypsin_20.mzXML	3030	3030	CID/ETD	651.64874	1.7510794	3	K.GAAAAAAASGAAGGGGGGAGAGAPGGGR.L	IPI:IPI00644073.1\|VEGA:OTTHUMP00000038687 Tax_Id=9606 Gene_Symbol=INTS3 18 kDa protein	389	389	7.508096E-33	2.335189E-25

BuildSA

Index a protein database for fast searching.

Usage: java -cp MSGFDB.jar msdbsearch.BuildSA
	-d DatabaseFile (*.fasta or *.fa)
	[-tda 0/1/2] (0: target only, 1: target-decoy database only, 2: both)

Parameters:

-d DbPath
- Name of a protein database (*.fasta or *.fa)
- Database file must ends with ".fasta" or ".fa".
-tda 0/1/2
- If 0, only "DatabaseFile" will be indexed.
- If 1, a new database file (*.revConcat.fasta) will be generated by appending reversed proteins. This forward-reverse database will be indexed.
- If 2, both the original database and the forward-reverse database file will be indexed.

BuildSA creates a suffix array of the protein database. For an input database file DBFileName.fasta, BuildSA will generate 4 auxiliary files (DbFileName.canno, DBFileName.cnlcp, DBFileName.csarr, DBFileName.cseq).It needs to be executed only once per each database file.