Abstract
Motifs, known as short and recurring patterns in DNA sequences, are presumed to have biological significance. They can indicate binding sites for transcription factors, which regulate gene expression. Finding motifs from biological sequences is a major task for unraveling the mechanisms of gene expression. Many algorithms have been designed for various problem models, among which, the problem model with substitution, deletion and insertion in motifs is especially a challenge. In 2010, Le et al. designed an algorithm named HIGEDA, which combines EM (expectation maximization) algorithm with GA (genetic algorithm), to find gapped motifs (motifs with substitution and deletion) in biological sequences. In this work, we improve HIGEDA by developing an algorithm named GAEM to solve the planted edited motif finding problem, i.e., to find motifs with substitution, deletion, and also insertion in DNA sequences. We test GAEM on both simulated DNA datasets and DNA transcription factor binding site datasets. More precisely, performance evaluation of GAEM is carried on simulated datasets for the planted edited motif finding problem, where the simulation results show that GAEM can recover motifs with higher and more stable success rate as compared with HIGEDA. For DNA transcription factor binding site datasets, one eukaryotic dataset, ere, and four Escherichia coli datasets, crp, arcA, argR and purR, are used to compare GAEM with HIGEDA, GLAM2 and MEME. The simulation results show that GAEM performs better on finding motifs in realistic DNA datasets as compared with the other three algorithms.
Keywords: Deletion and insertion, dynamic programming, expectation maximization, gene-set based genetic algorithm, motif finding.