Free

Text parsing & matching with High Performance Computing resources

Event Information

Share this event

Date and Time

Event description

Description

This talk will provide a brief introduction to some of the core concepts of analyzing text using computational tools.

We will demonstrate how standard calculations can be scaled to work on very large data sets through simple parallelization strategies that are easy to deploy in an HPC environment using job arrays.

These ideas will be illustrated by a concrete example implemented in Python using the pandas, re, and nltk libraries. The example that we will tackle in this talk comes from social science research where multiple data sets refer to the same individuals and they need to be merged while accounting for deviations in how individuals are named or described.

In order to illustrate a typical solution, we will demonstrate 3 key steps:

  1. text parsing and cleaning with data frames and regular expressions
  2. a parallelization strategy using blocking keys
  3. approximate text matching, string similarity measures, and reduction to a well-defined machine learning problem.

This problem and solution process are representative of a very large class of data analysis problems that involve text comparison.

We will close by indicating some powerful extensions to the presented solution that can be used to apply this overall strategy to more complex problems of text analysis.


Speaker:

  • Ian Percel, Data Scientist, University of Calgary


Webinar Instructions:

  • Click the green "Register' button on this page to register for this event. All registrants will be emailed the connection instructions for this webinar.

If you have questions or would like more information, please contact info@westgrid.ca.


Share with friends

Date and Time

Save This Event

Event Saved