Abstract
Over the past decade the large use of genomic approaches has resulted in the impressive generation of complete genomic sequences for over 800 bacterial species. However, the use of different bioinformatic approaches to determine the presence of a gene or open reading frame (ORF) in those genomes cause divergent gene annotations, even for data generated from the same genomic sequences. The use of a correct dataset for protein identification is a key step in many fields as phylogenetics, protein expression experiments, and has an impact on the identification capacity of a proteomic workflow. In this review, we describe successful attempts performed by proteomic groups to improve gene annotation in bacteria using different bioinformatic and mass spectrometry technologies. The review emphasizes the most recent advances in high resolution MS technology, which has increased the sequence coverage and peptide identification reliability by several fold. The capacity to perform deeper and more complete catalogations allows correction of several known genes, plus the discovery of protein products of regions of the genome not yet predicted to be coding areas. Recent results from our group show how such technology can be used as a guide to correct mistakes in transcriptional starting site (TSS) choices of proteins of Mycobacterium tuberculosis and Mycobacterium leprae, as well as to identify N-terminal peptides resulting from signal peptidase cleavages.
Keywords: Bacteria, mycobacterium, genomic sequences, gene annotation, protein translation, protein databases, mass spectrometry