Data generation by combining Markov Chains and Word Embeddings

A new research paper has just been published by our researchers’ team, formed by Eva Martínez García, Alberto Nogales and Álvaro García Tejedor, in collaboration with Javier Morales, a UFV lecturer, and Avanade representative. The paper presents a method to generate new corpus from existing texts. These data augmentation methods are important because the state-of-the-art Natural Language Processing (NLP) techniques are highly data-dependent since they are neural-based. In particular, a hybrid method is presented that combines Markov Chains and Word Embeddings to generate new high-quality sentences that are similar to an initial dataset. This allows the augmentation of the training data. The method has been validated by building several Language Models (LM) based on the Transformer using data from three different domains and evaluating how well each LM was able to model each domain language.

The paper can be found in this link.

Leave a Reply

Your email address will not be published. Required fields are marked *