Abstract
Background: Duplicate content often corrupts the filtering mechanism in online question answering. Moreover, as users are usually more comfortable conversing in their native language questions, transliteration adds to the challenges in detecting duplicate questions. This compromises with the response time and increases the answer overload. Thus, it has now become crucial to build clever, intelligent, and semantic filters which semantically match linguistically disparate questions.
Objective: Most of the research on duplicate question detection has been done on mono-lingual, majorly English Q&A platforms. The aim is to build a model which extends the cognitive capabilities of machines to interpret, comprehend, and learn features for semantic matching in transliterated bi-lingual Hinglish (Hindi + English) data acquired from different Q&A platforms.
Methods: In the proposed DQDHinglish (Duplicate Question Detection) Model, firstly language transformation (transliteration & translation) is done to convert the bi-lingual transliterated question into a monolingual English only text. Next, a hybrid of Siamese neural network containing two identical Long-Term- Short-Memory (LSTM) models and Multi-layer perceptron network is proposed to detect semantically similar question pairs. Manhattan distance function is used as the similarity measure.
Results: A dataset was prepared by scrapping 100 question pairs from various social media platforms, such as Quora and TripAdvisor. The performance of the proposed model on the basis of accuracy and Fscore. The proposed DQDHinglish achieves a validation accuracy of 82.40%.
Conclusion: A deep neural model was introduced to find a semantic match between an English question and a Hinglish (Hindi + English) question such that similar intent questions can be combined to enable fast and efficient information processing and delivery. A dataset was created and the proposed model was evaluated on the basis of performance accuracy. To the best of our knowledge, this work is the first reported study on transliterated Hinglish semantic question matching.
Keywords: Duplicate detection, bi-lingual, transliteration, siamese network, deep learning, LSTM.
Graphical Abstract