TMLS2021 Workshop: NLP Without a Ready-made Labeled Dataset

Actions Panel

TMLS2021 Workshop: NLP Without a Ready-made Labeled Dataset

NLP Without a Ready-made Labeled Dataset

When and where

Date and time



About this event

Speakers: Sowmya Vajjala, Researcher - National Research Council, Canada


NLP tutorials and workshops typically start with a labeled/annotated dataset, and discuss different ways of representing text/building models. However, in many real-world scenarios, we don't have that luxury of already having a labeled dataset.

We may often end up in scenarios where we have a problem, and a way to solve it, but no dataset to start working on the solution! In this workshop, I will introduce some ways of approaching this problem, such as looking for existing datasets, data annotation, automatic data labeling, data augmentation, and transfer learning.

What You'll Learn:

- Different ways of collecting labeled data

- Automatic data labeling

- Quick manual annotation of data

- Data augmentation


Sowmya Vajjala currently works as a researcher in Digital Technologies at National Research Council, Canada’s largest federal research and development organization. She has worked in the area of Natural Language Processing (NLP) over the past decade in various roles – as a software developer, researcher, educator, and a senior data scientist.

She recently co-authored a book: “Practical Natural Language Processing: A Comprehensive Guide to Building Real World NLP Systems”, published by O’Reilly Media (June, 2020), which was also translated into Chinese. Her research interests lie in multilingual computing and the relevance of NLP beyond research both in industry practice as well as in other disciplines, through inter-disciplinary research.