Data augmentation

One of the largest bottlenecks in building machine learning systems is to collect enough labeled training data to train the downstream machine learning models. In this project, the focus lies on automatically enlarging a small set of training labels from a larger set of samples generated with an unsupervised method.

The idea of data augmentation comes from our recent experience of generating synthetic galaxies. Taking as input a set of images of galaxies, each of which has a label associated with it (e.g., age), we are able to apply Variational Auto-encoders (VAEs) to change the label and generate new galaxies. The goal of this projct is to apply the same technique to text. Our own information extraction system will be used to automatically label a large text corpus with the specified attributes. The student will develop a Variational Auto-encoder to understand how to generate new text by changing the label. This approach will be evaluated on a task of relation extraction.

Supervised by Ce Zhang

ce.zhang@inf.ethz.ch