Abstract
Background: POS tagging is the process of identifying the correct grammatical category of words based on its meaning and context in a text document. It is one of the preliminary steps in the processing of natural language text. If any error happens in POS tagging the same will be propagated to whole NLP applications. Hence it must be handled in a genuine and precise way.
Aim: The purpose of this study is to develop a deep level tagger for Malayalam which indicates the semantics of nouns and verbs in a text document.
Methods: The proposed model is a two-tier architecture consisting of deep learning as well as rulebased approaches. The first tier consists of a tagging model, which is trained by a tagged corpus of 287,000 words. To improve the depth of tagging a suffix stripper is also used which can provide morhological features to the shallow machine learning model.
Results: The system is trained on 2,30,000 words and tested on 57,000 words. The accuracy of tagging for the phase-1 architecture is 92.03%. Similarly the accuracy of phase-2 architecture is 98.11%. The overall accuracy of tagging is 91.82%.
Conclusion: The exclusive feature of the proposed tagger is its depth in tagging the noun words. This deep level information can be used in various semantic processing applications of the natural language text like anaphora resolution, text summarization, machine translation, etc.
Keywords: POS tagging, malayalam, deep level tagging, LSTM, sequence-to-sequence learning, word embeddings, MLP.
Graphical Abstract