The work aims at alleviating the loss of translation quality arising due to the frequent occurrence of Out-of-Vocabulary (OOV) words during machine translation of low-resource languages (LRLs). We propose a novel word-to-character embedding mapping algorithm and apply these upon three variants of attention-based seq2seq models to perform transduction of such words from Hindi to Bhojpuri (an LRL instance), learning from a set of cognate pairs built upon a bilingual dictionary of Hindi-Bhojpuri words. We generalize our method to a similar Hindi to Bangla cognate pair dataset while proposing an effective artificial parallel corpus leveraged to carry out a first-ever Hindi-Bhojpuri machine translation task. Further, a detailed error analysis report reveals the models delivering better results than that of the Transformers.

Update: Our article has recently been accepted at the Journal of Language Modeling. Also, a preprint and the code are available publicly.

Leave a Reply

Your email address will not be published. Required fields are marked *