Abstract
We have conducted a dedicated analysis on the frequency distribution of the TATA Box and TATA extension sequences on six data sets of human promoters. Promoters in these sets have different lengths and are from different types of genes (housekeeping genes, tissue specific genes, and all genes). The statistical approach developed in this study will firstly partition the promoters into bins of 20 bp long, then calculate the frequency distribution of TATA elements and TATA extension sequences. The median value is used to capture outstanding TATA elements or TATA extension sequences when calculating their statistical significance. This study discovered that two of the 16 TATA Box elements (TATAAAAG and TATATAAG) showed the sharpest peaks at the location of 10∼30 bp upstream from transcription start sites where TATA Box is believed to reside. Fourteen TATA Box extensions showed the sharpest peaks at this location as well among all TATA extension sequences. Two of these fourteen TATA extension sequences have been verified to be the transcription factor binding sites by other research efforts. We suggest that the remaining twelve TATA extension sequences are the new putative TATA binding sites. This study also found that there was very little difference between the frequency distribution of TATA elements on housekeeping genes and their frequency distribution on tissue specific genes.
Keywords: Promoter identification, statistical analysis, human gene, transcription factor, motif