Abstract
Today, in various aspects of molecular biology, sequence alignment has become an essential tool to study the structure-function relationships of proteins. With the impressive increase of the number of available sequences, alignments provide a substantial piece of information by way of various computational methods. These approaches have generally become a crucial tool to put forward working hypotheses for time-consuming bench work, as protein engineering and site directed mutagenesis. However alignment methods remain hugely perfectible. All methods are dramatically limited in the twilight zone, taking place around 25% of identity between pairs of sequences. More worrying is the very high rate of false positive results generated by most algorithms, depending of empirical parameters, and hard to validate by statistical criteria. After reviewing the main methods, this paper draws users attention to the fact that algorithm performance evaluations are entirely limited to alignment power (sensibility) evaluation. In reference to a given truth defined from alignment of know structures, the power is defined as the proportion of truth restored in the solution. The power may be overestimated by a lack of independent sets of poorly related sequences and its value depends entirely on the criterion used to define the truth. On the other hand, confidence (selectivity) represents the proportion of the solution that is true. Depending on the method and the parameters used, confidence may be much lower than power, and is usually never evaluated. For non-trivial alignments, when the power is high, confidence is low, which means that correctly aligned positions are embedded in large regions unduly aligned. One possible solution to these problems is to use consensus of several multiple alignment methods, which will increase the confidence of the results. The addition of external information, such as the prediction of the secondary structure and / or the prediction of solvent accessibility is also an other way that should increase the performance of existing multiple alignment methods.
Keywords: sequence alignment, scoring matrix, twilight zone, alignment methods, matrices specificity