Abstract
By utilizing recently developed full-length cDNA technologies, large-scale cDNA sequencing was carried out by several cDNA projects. Now full-length cDNA resources cover the major part of the protein-coding human genes. Comprehensive analyses of the collected full-length cDNA data revealed not only the complete sequences of thousands of novel gene transcripts but also novel alternatively spliced isoforms of hitherto identified genes. However, it was not as easy as expected to deduce their encoded amino acid sequences based solely on the full-length cDNA sequences. It was neither always the case that the longest open reading frame corresponded to the real protein coding region nor that the first ATG was the translation initiator codon. Also, proteome-wide mass-spectrometry analysis has shown that there is an unexpectedly large population of small proteins, encoded by so-called upstream open reading frames, within the cell. Since sound manual annotations by experts were still indispensable to address these problems, an international meeting to make transcriptome-wide functional annotations of cDNAs was held, namely the H-invitational. In this meeting, functional annotations were made both manually and computationally for most of the pre-existing full-length cDNAs collected from world-wide cDNA projects. The achieved integrated information for each of the cDNAs was published as a database. It was also shown that the full-length cDNA data were useful for identifying alternative splicing variants, exact transcriptional start sites of the mRNAs and the adjacent promoter regions. Rapidly accumulating genome data as well as versatile use of the transcriptome information will shortly lay a firm foundation for proteome-level understanding of human gene networks.
Keywords: Full-length cDNA, transcriptome, upstream ATG, functional annotation, ORFeome