Abstract
Predicting protein structure and function from amino acid sequences is a central aim of bioinformatics. Most bioinformatics analyses use sequence alignment as the basis by which to measure similarity. However, there is increasing evidence that many protein families are resistant to this straightforward method of comparison. Increasingly, a combination of machine-learning techniques and abstract representations of protein sequences is being used to classify proteins based upon the similarity of their physico-chemical properties rather than scoring sequence alignments. This is particularly effective in protein families that show greater structural conservation but appear to lack conserved sequences. Here we describe the inherent limitations of the alignment-dependent approaches to protein classification and present ‘alignment- free’ representations as a viable and realistic alternative to solve complex problems within bioinformatics.
Keywords: Multiple alignment, protein motifs, alignment independent, protein classification, discrete form, sequential form, local descriptors, proteochemometrics