Abstract
The comparison of DNA sequences is the basic topic in computational biology and bioinformatics, helping in speculation about their previously ambiguous structure, function, and evolution relationship. In this article, we provide a novel DNA sequence comparison scheme by constructing feature vectors based on Markov chain and information entropy. A new measure, which is calculated as the entropy of K-string’s four one-step transition probabilities, is used to compose the feature vector to characterize DNA sequence. At the same time, we provide a novel concept to address the computation burden caused by the exponential growth of computation complexity when K grows in a traditional K-string model, which is named K-string list. The proposed scheme allows us to conduct similarity research and phylogenetic analysis on two real datasets, the first exon of 11 species’
Keywords: DNA sequence comparison, entropy, feature vector, K-string list, markov model, phylogenetic analysis.
Graphical Abstract