Abstract
Several alignment-free sequence comparison methods are available and they use similarity, based on a particular numerical descriptor of biological sequences. Any loss of information incurred in the transformation of a sequence into a numerical descriptor affects the results. A pool of descriptors that use different algorithms in their computation is expected to suffer minimum loss of information and an attempt is made in this direction to study the similarity of DNA sequences that are homogenous or heterogeneous. Several numerical descriptors for the characterization of DNA sequences are described, based on information theoretic approach, connectivity of vertex weighted line-graphs and those derived from the matrices obtained from the graphs constructed by depicting DNA sequences as a random walk on a Euclidean plane. The information theoretic descriptors were obtained based on the Ltuple approach for the combination of different numbers of bases. The connectivity type descriptors were calculated by converting the DNA sequence into vertex weighted graphs in which vertices (nucleotide) were assigned weights based on the pKa of the bases. The graphical representations were converted into numerical descriptors by constructing matrices. Computer programs were developed to calculate seventy DNA descriptors; 560 sequences of different types of organisms were used. After initial data analysis to eliminate almost perfectly correlated descriptors, orthogonal descriptors were obtained by performing principal component analysis. Principal components (PCs) were used to construct an Ndimensional similarity space wherein the 560 sequences were clustered by k-means cluster algorithm. Five principal components (orthogonal descriptors) were extracted and found to explain 92% of data variance. The PCs were used to cluster the sequences in a five-dimensional similarity space. The similarity-based dissimilarity clustering procedure using numerical descriptors was found to be effective for studying similarity/dissimilarity of large number of sequences.
Keywords: Alignment-free, sequence comparison, DNA, similarity, numerical characterization, information content, probability distribution, genomic evolution, coding, non-coding regions, nucleotide distribution, overlapping path, amino-keto ratio, purine-pyrimidine ratio, transformation of sequence, sequence, L-Tuple, zero-order, Codons, sequence information content, complementary sequence information content, Physicochemical Properties, Randic connectivity index, vertex degrees, bond-order, valency, connectivity-based descriptors, nitrogenous base, dissociation constant, Secondary Descriptors, negative X-direction, Wiener index, distance matrix, Leading Eigenvalue, square symmetric containing, Ovality, Graph radius, Nandy-type representation, homogenous, heterogeneous, multi-collinearity, Orthogonalization, SPSS software, Five-Dimensional Similarity Clustering, beta globin, alpha globin, mitochondrial DNA, accession number, Bacterial domain, monophyletic cluster, Influenza-A virus subtype, unicellular green alga sequence, base composition, pair-wise alignment