$0 – $330.20
3rd Annual MLOps World Conference on Machine Learning in Production 2022

3rd Annual MLOps World Conference on Machine Learning in Production 2022

Actions and Detail Panel

$0 – $330.20

Date and time

Location

Sheraton Centre Toronto Hotel

123 Queen Street West

Toronto, ON M5H 2M9

Canada

View map

Refund policy

Refunds up to 30 days before event

An initiative for anybody who eventually, or currently, has ML/AI models in production. Come out for all, or a couple of sessions!

About this event

3rd Annual MLOps World Conference on Machine Learning in Production 2022 image

Thank you for joining us here.

The goal for MLOps World is to help companies put more models into production environments, effectively, responsibly, and efficiently. Whether you're working towards a live production environment, or currently working in production, this is geared towards you on your journey.

We hope to see you soon.

-TMLS and MLOps World Team

Special Notes

**If you'd like to attend virtually only and would like to request a special pass, email us at info@mlopsworld.com.

**Some speakers may elect to not have their talk recorded

** Completion of workshop certificates available upon request

** Visa letters available upon request

3rd Annual MLOps World Conference on Machine Learning in Production 2022 image
3rd Annual MLOps World Conference on Machine Learning in Production 2022 image
3rd Annual MLOps World Conference on Machine Learning in Production 2022 image
3rd Annual MLOps World Conference on Machine Learning in Production 2022 image
3rd Annual MLOps World Conference on Machine Learning in Production 2022 image
3rd Annual MLOps World Conference on Machine Learning in Production 2022 image
3rd Annual MLOps World Conference on Machine Learning in Production 2022 image
3rd Annual MLOps World Conference on Machine Learning in Production 2022 image
3rd Annual MLOps World Conference on Machine Learning in Production 2022 image
3rd Annual MLOps World Conference on Machine Learning in Production 2022 image
3rd Annual MLOps World Conference on Machine Learning in Production 2022 image
3rd Annual MLOps World Conference on Machine Learning in Production 2022 image
3rd Annual MLOps World Conference on Machine Learning in Production 2022 image
3rd Annual MLOps World Conference on Machine Learning in Production 2022 image
3rd Annual MLOps World Conference on Machine Learning in Production 2022 image
3rd Annual MLOps World Conference on Machine Learning in Production 2022 image
3rd Annual MLOps World Conference on Machine Learning in Production 2022 image

**If you'd like to attend virtually only and would like to request a special pass, email us at info@mlopsworld.com.

**Some speakers may elect to not have their talk recorded

** Completion of workshop certifcates available upon request

** Visa letters available upon request

80 + speakers from including;

3rd Annual MLOps World Conference on Machine Learning in Production 2022 image
3rd Annual MLOps World Conference on Machine Learning in Production 2022 image
3rd Annual MLOps World Conference on Machine Learning in Production 2022 image
3rd Annual MLOps World Conference on Machine Learning in Production 2022 image
3rd Annual MLOps World Conference on Machine Learning in Production 2022 image
3rd Annual MLOps World Conference on Machine Learning in Production 2022 image
3rd Annual MLOps World Conference on Machine Learning in Production 2022 image
3rd Annual MLOps World Conference on Machine Learning in Production 2022 image
3rd Annual MLOps World Conference on Machine Learning in Production 2022 image
3rd Annual MLOps World Conference on Machine Learning in Production 2022 image
3rd Annual MLOps World Conference on Machine Learning in Production 2022 image
3rd Annual MLOps World Conference on Machine Learning in Production 2022 image

The team at Toronto Machine Learning Society (TMLS) An event dedicated to those working in ML/AI.

Each ticket includes;  

  • Access to 40+ virtual workshops to help build and deploy (Kubernetes, etc.) - June 7-8
  • Access to 80+ in-person talks - June 9 - 10
  • Access 90+ hours of recordings
  • Access to app-brain dates, parties and networking
  • Access to in-person Start-up Expo, Career Fair, Demo Sessions
  • Access to in-person Women in AI celebration
  • Network and connect through our event app 
  • Q+A with speakers
  • Channels to share your work with the community
  • Run your chat groups and virtual gatherings!

www.mlopsworld.com

Too few companies have effective AI leaders and an effective AI strategy. 

Taken from the real-life experiences of our community, the Steering Committee has selected the top applications, achievements and knowledge areas to highlight across this dynamic event.

Talk Tracks include:

- Real World Case Studies

- Business & Strategy

- Technical & Research (levels 1-7)

- Workshops (levels 1-7)

- In-person coding sessions

Top Industries Served:

  • Technology & Service
  • Computer Software
  • Banking & Financial Services
  • Insurance
  • Hospital & Health Care
  • Automotive
  • Telecommunications
  • Environmental Services
  • Food & Beverages
  • Marketing & Advertising

We believe these events should be as accessible as possible and set our ticket passes accordingly 

MLOps World is an international community group of practitioners trying to better understand the science of deploying ML models into live production environments, and everything both technical and non-technical that goes with it!

Created initially by the Toronto Machine Learning Society (TMLS) this initiative is intended to unite and support the wider AI Ecosystem, companies, practitioners, and academics and contributors to open-source communities operating within it.

With an explorative approach, our initiatives address the unique needs of our community of over 10,000+ ML researchers, professionals, entrepreneurs and engineers. Intended to empower its members and propel productionized ML. Our community gatherings and events attempt to re-imagining what it means to have a connected community; offering support, growth, inclusion for all participants.

FAQs

Q: This a virtual or an in-person conference

A portion is virtual and a portion in person (The full conference will not be completely hybrid)

- June 7-8th Bonus workshops days 1 and 2 held virtually (for ticket holders)

- June 9-10th Conference talks, expo, workshops, coding sessions (in person)

**If you'd like to attend virtually only, you can request a special pass by emailing info@mlopsworld.com with ONLINE ONLY PASS in the Subject Header.**

Q: What is your in-person conference policy:

We’re aware that everyone’s comfort levels, and risk tolerance can vary. We are working to support every attendee’s level of comfort with regard to interactions/socializing. That will be indicated through Green/Yellow/Red badge indicators.

We also take very seriously all Safety precautions and follow local health and safety guidelines in accordance with City of Toronto, and Marriot Hotels.

If you’re unsure or have personal requirements- message us! We’re happy to work with you to provide a safe and enjoyable experience!

Q: Which sessions are going to be recorded? When will the recordings be available and do I have access to them?

Most sessions will be recorded during the event (provided speaker permissions) and will be made available to attendees approximately 2-4 weeks after the event and be available for 12 months after release.

Q: Are there ID or minimum age requirements to enter the event? There is not. Everyone is welcome.

Q: How can I contact the organizer with any questions? Please email info@mlopsworld.com

Q: What's the refund policy? Tickets are refundable up to 30 days before the event.

Q: Why should I attend ? From over 300+ submissions the committee has selected the top sessions to help your learning. From hands-on coding workshops to case-studies, you won't find a conference gathering that packs as much information, and at such a low-cost of entry. Come join in our community and celebrate the major triumps of the year, as well as the main learning lessons. As well, aside from the sessions there will also be brain-dates via the app for networking and evening socials that provide opportunities to meet peers and build your network.

Q: Who will attend? Please see our Who Attends Section for a full breakdown. Participants range from data scientists, to engineers, business exectutives, and students. We'll have multiple tracks and in app brain-dates to accommodate various vantage points and maturity levels.

Q: Can I speak at the event?

You can submit an abstract here. Submissions are reviewed by our committee.

*Content is non-commercial and speaking spots cannot be purchased. 

Q: Will you give out the attendee list? No, we do our best to ensure attendees are not inundated with messages, We allow attendees to stay in contact through our slack channel and follow-up monthly socials.

Q: Can my company have a display? Yes, there will be spaces for company displays. You can inquire at faraz@mlopsworld.com

Machine Learning Monitoring in Production: Lessons Learned from 30+ Use Cases

Lina Weichbrodt, Lead Machine Learning Engineer, DKB Bank

Abstract: 

Traditional software monitoring best practices are not enough to detect problems with machine learning stacks. How can you detect issues and be alerted in real-time?

This talk will give you a practical guide on how to do machine learning monitoring: which metrics should you implement and in which order? Can you use your team's existing monitoring and dashboard tools, or do you need an MLOps Platform?

Technical Level: 5/7

What you will learn: 

  • Monitor the four golden signals + add machine learning monitoring
  • For ml monitoring prioritize monitoring the response of a service
  • You often don’t need a new tool, use the tools you already have and add a few metric

Implementing MLOps Practices on AWS using Amazon SageMaker

Shelbee Eigenbrode, Principal AI/ML Specialist Solutions Architect / Bobby Lindsey, AI/ML Specialist Solutions Architect / Kirit Thadaka, ML Solutions Architect,Amazon Web Services (AWS)

Abstract: 

In this workshop, attendees will get hands-on with SageMaker Pipelines to implement ML pipelines that incorporate CI/CD practices.

Technical Level: 5/7

What's unique about this talk: 

The opportunity to get hands-on

What you will learn: 

Familiarity with end-to-end features of Amazon SageMaker used in implementing ML pipelines

ML Monitoring on Edge Devices

Niv Hertz, Solutions Architect, Aporia

Abstract: 

In this session, we will understand how to design a data pipeline for monitoring ML models on edge devices. We will start by understanding the pieces of the pipeline, and the important criterions to consider. We will see how to utilize the pieces to best fit the criterions, and design a proper pipeline.

Technical Level: 5/7

What you will learn: 

  • Understand the criterions to consider while building a ML monitoring pipeline for edge devices.
  • Understand an example architecture of a proper ML monitoring pipeline for edge devices.

Automated Machine Learning & Tuning with FLAML

Qingyun Wu, Assistant Professor, Penn State University, and  Chi Wang, Principal Researcher, Microsoft Research

Abstract: 

In this tutorial, we will provide an in-depth and hands-on training on Automated Machine Learning & Tuning with a fast python library FLAML. FLAML finds accurate machine learning models automatically, efficiently and economically. It frees users from selecting learners and hyperparameters for each learner.

Technical Level: 4/7

What's unique about this talk: 

In addition to a set of hands-on examples, the speakers will also share some rule-of-thumbs, pitfalls, open problems, and challenges learned from AutoML practice.

What you will learn: 

  • How to use FLAML to find accurate ML models with low computational resources for common ML tasks.
  • How to leverage the flexible and rich customization choices provided in FLAML to customize your AutoML or tuning tasks.

Taking MLOps 0-60: How to Version Control, Unify Data and Manage Code Lifecycles

Jimmy Whitaker, Chief Scientist of AI,  Pachyderm

Abstract: 

Machine learning models are never done. The world is always changing and models rely on data to learn useful information about this world. In ML systems we need to be able to embrace change without sacrificing reliability. But how do we do it? MLOps. MLOps, the process of operationalizing your machine learning technology, is fundamental to any organization leveraging AI. However, the complexities of machine learning require managing two lifecycles: the code and the data. Pachyderm is a platform that provides the foundation for unifying these two lifecycles.

Technical Level: 2/7

Deploy High Scale ML Models Without the Hustle

Pavel Klushin, Head of Solutions Architecture, QWAK

Technical Level: 4/7

What you will learn: 

How to deploy ML models to production

How to MLEM Your Models to Production

Mikhail Sveshnikov, MLEM Lead Developer, Iterative

Abstract: 

New open-source product from Iterative called MLEM will help you store, access, package and deploy your models in different scenarios. I will present MLEM, we'll go through simple tutorial and discuss other use cases where MLEM can help fellow MLOps engineers

Technical Level: 4/7

What you will learn: 

What MLEM is (beside popular meme) and how it's features can be used in day-to-day MLOps engineer's work like wrapping models into web services or dockerize them (but that is not all of it!). And probably a bit of knowledge of MLEM inner kitchen

Production ML for Mission-Critical Applications

Robert Crowe, TensorFlow Developer Engineer,  Google

Abstract: 

Deploying advanced Machine Learning technology to serve customers and/or business needs requires a rigorous approach and production-ready systems. This is especially true for maintaining and improving model performance over the lifetime of a production application. Unfortunately, the issues involved and approaches available are often poorly understood. An ML application in production must address all of the issues of modern software development methodology, as well as issues unique to ML and data science. Often ML applications are developed using tools and systems which suffer from inherent limitations in testability, scalability across clusters, training/serving skew, and the modularity and reusability of components. In addition, ML application measurement often emphasizes top level metrics, leading to issues in model fairness as well as predictive performance across user segments.

Rigorous analysis of model performance at a deep level, including edge and corner cases is a key requirement of mission-critical applications. Measuring and understanding model sensitivity is also part of any rigorous model development process.

We discuss the use of ML pipeline architectures for implementing production ML applications, and in particular we review Google’s experience with TensorFlow Extended (TFX), as well as available tooling for rigorous analysis of model performance and sensitivity. Google uses TFX for large scale ML applications, and offers an open-source version to the community. TFX scales to very large training sets and very high request volumes, and enables strong software methodology including testability, hot versioning, and deep performance analysis.

Technical Level: 5/7

What you will learn: 

  • How Production ML is fundamentally different from Research or Academic ML
  • Methods and architectures for creating an MLOps infrastructure that adapts to change
  • Review of several approaches to implementing MLOps in production settings

Using RIME to Eliminate AI Failures

Daniel Glogowski, Head of Product and Jerry Liu, Machine Learning Lead,  Robust Intelligence

Abstract: 

AI Fails, All the Time: AI Failure is when you train an ML model and it behaves poorly in production because of issues like novel corner case inputs, upstream ETL changes, and distributional drift. Data science teams constantly face these issues and more, spending time root causing and firefighting. Data science teams may optimize for a single performance metric like accuracy, but this is inadequate to prevent AI Failure. Combatting AI Failure takes time and energy. Robust Intelligence helps to prevent AI Failure so that you can focus on what truly matters.

RIME Prevents AI Failures: The Robust Intelligence Model Engine (RIME) helps your team accelerate your AI lifecycle. Detect Weaknesses: Train a candidate model, and automatically discover its individual weaknesses with AI Stress Testing. Go beyond simply optimizing for model performance. Improve the model with automatic suggestions. Compare with other candidate models. Establish and enforce standards across your organization. Prevent AI Failure: Confidently deploy the best model into production with AI Firewall with one line of code. Observe your model in production and automate the discovery and remediation of any issues that occur post-deployment. Automatically flag, block, or impute erroneous data in real-time.

Technical Level: 4/7

What you will learn: 

How to help prevent AI Failure so that you can focus on what truly matters.

Implementing a Parallel MLOps Test Pipeline for Open Source Development

Miguel Gonzalez-Fierro, Principal Data Science Manager,  Microsoft

Abstract: 

GitHub has become a hugely popular service for building software, open source or private. As part of the continuous development and integration process, frequent, reliable and efficient testing of repository code is necessary. GitHub provides functionality and resources for automating testing workflows (GitHub Workflows), which allow for both managed and self-hosted test machines.

However, managed hosts are of computational size that is limited for many machine learning workloads. Moreover, they don’t include GPU hosts currently. As for self-hosted machines, there is the inconvenience and cost of keeping machines online 24 x 7. Another issue is that it is cumbersome to distribute test jobs to multiple machines.

Our goal is to leverage Azure Machine Learning along with GitHub Workflows in order to address these issues. With AzureML, we can access powerful compute with both CPU and GPU. Bringing the compute online is automatic and on demand for all the testing jobs. Moreover, we can easily distribute testing jobs to multiple hosts, in order to limit the end-to-end execution time of the workflow.

We show a configuration for achieving the above programmatically, which we have developed as part of the Microsoft Recommenders repository (https://github.com/microsoft/recommenders/), which is a popular open-source repository that we maintain and develop. In our setting, we have three workflows that trigger nightly runs as well as a workflow triggered by pull requests.

Nightly workflows, in particular, include smoke and integration tests and are long (more than 6 hours) if run sequentially. Using our parallelized approach on AzureML, we have managed to bring the end-to-end time down to less than 1 hour. We also discuss how to divide the tests into groups in order to maximize machine utilization.

We also talk about how we retrieve the logs associated with runs from AzureML and register them as artifacts on GitHub. This allows one to view the progress of testing jobs from the GitHub Actions dashboard, which makes monitoring and debugging of errors easier.

Technical Level: 5/7

What's unique about this talk: 

We have one of the most sofisticated test pipelines of GitHub repositories related to machine learning.

What you will learn: 

  • People who attend this session will learn about
  • Best practices on testing GitHub repositories of Python code, which are based on our experience with the Microsoft/Recommenders repository
  • Guidelines on testing in economical ways
  • How to use GitHub workflows for setting up their testing pipelines
  • How to benefit from Azure Machine Learning capabilities in order to automate testing jobs that run in parallel.

MLOps Beyond Training: Simplifying and Automating the Operational Pipeline

Yaron Haviv, Co-Founder & CTO,  Iguazio

Abstract: 

Most data science teams start their AI journey from what they perceive to be the logical beginning: building AI models using manually extracted datasets. Operationalizing machine learning, in the sense of considering all the requirements of the business; handling online and federated data sources, scale, performance, security, continuous operations, etc. comes as an afterthought, making it hard and resource-intensive to create real business value with AI.

Technical Level: 5/7

What you will learn: 

How to simplify and automate your production pipeline to bring data science to production faster and more efficiently

How to Treat Your Data Platform Like a Product: 5 Key Best Practices

Barr Moses, CEO & Co-Founder, Monte Carlo

Abstract: 

Your team just migrated to a data mesh (or so they think). Your CTO is all in on this “modern data stack,” or as he calls it: “The Enterprise Data Discovery.” To satisfy your company’s insatiable appetite for data, you may even be building a complex, multi-layered data ecosystem: in other words, a data platform. Still, it’s one thing to build a data platform, but how do you ensure it actually drives value for your business?

In this fireside chat, Barr Moses, CEO & co-founder of Monte Carlo, will walk through why best in class data teams are treating their data platforms like product software and how to get started with reliability and scale in mind.

Technical Level: 3/7

What's unique about this talk: 

I've never discussed these best practices before at a public talk or in a blog article, and they're pulled from my own experience at Monte Carlo working with 100s of data teams attempt to build their own data platforms.

What you will learn: 

5 best practices (across technology, processes, and culture) for treating your data platform like a scalable, measurable product with machine learning and automation.

WarpDrive: Orders-of-Magnitude Faster Multi-Agent Deep RL on a GPU

Stephan Zheng, Lead Research Scientist; Tian Lan, Senior Applied Scientist; Sunil Srinivasa, Research Engineer, Salesforce

Abstract: 

Reinforcement learning is a powerful tool that has enabled big technical successes in AI, including superhuman gameplay, optimizing data center cooling, nuclear fusion control, economic policy analysis, etc. For wider real-world deployment, users need to be able to run RL workflows efficiently and quickly. WarpDrive is an open-source framework that runs multi-agent deep RL end-to-end on a GPU. This enables orders of magnitude faster RL.

In this talk, we will review how WarpDrive works and several new features introduced since its first release in Sep 2021. These include automatic GPU utilization tuning, distributed training on multiple GPUs, and sharing multiple GPU blocks across a simulation. These features result in throughput scaling linearly with the number of devices, to a scale of millions of agents. WarpDrive also provides several utility functions that improve quality-of-life and enable users to quickly implement and train RL workflows.

Technical Level: 6/7

What's unique about this talk: 

Accessible explanations of the latest features, demos, and future roadmap.

What you will learn: 

How WarpDrive enables you to run reinforcement learning orders of magnitude faster.

Supercharging MLOps with the Petuum Platform

Aurick Qiao, Ph.D., CEO and Tong Wen, Director of Engineering, Petuum

Abstract: 

Today’s widespread practice of ad hoc integration between many fragmented ML tools leaves hard-to-fill gaps in end-to-end automation, scalability, and management of AI/ML applications. With the Petuum Platform, ML applications and infrastructure can be composed quickly and flexibly from standardized and reusable building blocks, thus transforming MLOps from craft production into a repeatable assembly-line process. We will discuss new innovations in Composable, Automatic, and Scalable ML (CASL), developed in collaboration with CMU, UC Berkeley, and Stanford, and how they play a pivotal role in the Petuum Platform.

Technical Level: 3/7

What you will learn: 

This workshop will show how your team can easily compose, manage, and monitor AI/ML infrastructure across multiple systems on a single pane of glass, seamlessly scale ML pipelines from local development to batch execution and online serving, and optimize end-to-end ML pipelines in an automatic and cost-efficient way.

Scaling ML Embedding Models to Serve a Billion Queries

Senthilkumar Gopal, Senior Engineering Manager (Search ML), eBay Inc.

Abstract: 

This talk is aimed at providing a deeper insight into the scale, challenges and solutions formulated for powering embeddings based visual search in eBay. This talk walks the audience through the model architecture, application archite for serving the users, the workflow pipelines produced for building the embeddings to be used by Cassini, eBay's search engine and the unique challenges faced during this journey. This talk provides key insights specific to embedding handling and how to scale systems to provide real time clustering based solutions for users.

Technical Level: 5/7

What's unique about this talk: 

Most of the online content, dwells on pieces of the infrastructure required without providing an end to end coherent picture. Most critically, the content does not relate to the model architecture and how the pipelines and model architecture/parameters are influenced by each other. This talk also goes into the aspects of a large scale search engine and how the application architecture influences the operational aspects to enable the scale required.

What you will learn: 

The audience will learn how to productionize embedding based data pipelines, key challenges and potential solutions, introduction to different quantization algorithms and their advantages/disadvantages. The audience will also get a deeper view on how data pipelines and workflows are modeled for optimal scale.

Personalized Recommendations and Search with Retrieval and Ranking at scale on Hopsworks

Jim Dowling, CEO,  Hopsworks

Abstract: 

Personalized recommendations and personalized search systems at scale are increasingly being built on retrieval and ranking architectures based on the two-tower embedding model. This architecture requires a lot of infrastructure. A single user query will cause a large fanout of traffic to the backend, with hundreds of database lookups in a feature store, similarity search in an embedding store, and model outputs from both a query embedding model and a ranking model. You will also need to index your items in the embedding store using an item embedding model, and instrument your existing systems to store observations of user queries and the items they select.

Technical Level: 6/7

What's unique about this talk: 

The only integrated open-source platform for scalable retrieval and ranking systems.

What you will learn: 

How to build a state-of-the-art two tower model for personalized recommendations that scales with Hopsworks.

Accelerating Transformers with Hugging Face Optimum and Infinity

Philipp Schmid, Machine Learning Engineer and Lewis Tunstall, Machine Learning Engineer, Hugging Face

Abstract: 

Since their introduction in 2017, Transformers have become the de facto standard for tackling a wide range of NLP tasks in both academia and industry. However, in many situations accuracy is not enough — your state-of-the-art model is not very useful if it’s too slow or large to meet the business requirements of your application.

Technical Level: 5/7

What you will learn: 

How Hugging Face Optimum and Infinity provide developers with the tools to easily optimize Transformers with techniques such as quantization and pruning.

Parallelizing Your ETL with Dask on Kubeflow

Jacob Tomlinson, Senior Software Engineer, NVIDIA

Abstract: 

Kubeflow is a popular MLOps platform built on Kubernetes for designing and running Machine Learning pipelines for training models and providing inference services. Kubeflow has a notebook service that lets you launch interactive Jupyter servers (and more) on your Kubernetes cluster. Kubeflow also has a pipelines service with a DSL library written in Python for designing and building repeatable workflows that can be executed on your cluster, either ad-hoc or on a schedule. It also has tools for hyperparameter tuning and running model inference servers, everything you need to build a robust ML service.

Technical Level: 5/7

What's unique about this talk: 

It's common to talk about parallelism and GPU acceleration at the model training stage, but we are working hard to also accelerate ETL stages. There isn't a huge amount of content online about this yet.

What you will learn: 

Data Scientists commonly use Python tools like Pandas on their laptops with CPU compute. Production systems are usually distributed multi-node GPU setups. Dask is an open source Python library that takes the pain out of scaling up from laptop to production.

What's in the box? Automatic ML Model Containerization

 Clayton Davis, Head of Data Science, Modzy

Abstract: 

This talk will include a deep dive on building machine learning (ML) models into container images to run in production for inference. Based on our experience setting up ML container builds for many customers, we’ll share a set of best practices for ensuring secure, multi-tenant image builds that avoid lock-in, and we’ll also cover some tooling (chassis.ml) and a standard (Open Model Interface (OMI)) to execute this process. Data scientists and developers will walk away with an understanding of the merits of a standard container specification that allows for interoperability, portability, and security for models to seamlessly be integrated into production applications.

Technical Level: 5/7

What you will learn: 

Prerequiste: Basic familiarity with ML models and/or common ML frameworks (pytorch, scikit learn, etc.)

A Zero-Downtime Set-up for Model: How and Why

Anouk Dutrée, Product Owner, UbiOps

Abstract: 

When a model is in production you ideally want zero-downtime. Whenever the model is needed it should be ready to respond. This issue is two-sided, on one hand you need to make sure that there is no down-time when updating your model, on the other hand you need to ensure that a request can be processed even if your model itself fails. In this talk we will take you through the set-up we use to ensure zero-downtime when updating models, and how this set-up can be expanded to ensure you can handle failing models as well.

Technical Level: 4/7

What's unique about this talk: 

I personally find most of the articles on this topic to be to specific to one part of the chain. In this talk I want to go over the entire process as a whole, and cover the two sides of downtime. (i.e. downtime caused by maintenance and downtime because the model fails).

What you will learn: 

  • How to create an easy to work with zero-downtime set-up for data science models using smart routing
  • How to expand this set-up to a champion challenger set-up to ensure there is always a model available that can take over if a model fails unexpectedly
  • What a champion challenger set-up is

MLOps is Just HPC in Disguise:A Real-World, No Nonsense Guide to Upgrading Your Workflow

Victor Sonck, Evangelist,  ClearML

Abstract: 

Does the following sound familiar to you? Overwriting existing plots and model files, having to put a model in production in 10 days or running out of GPU availability again. If it does, this workshop is for you, you'll end up with a set of tools and workflows that can make your life so much easier. Increase your productivity by automating mundane tasks.

Technical Level: 4/7

What you will learn: 

A set of tools, tips, tricks and example workflows they can use in their own life to help alleviate common data science challenges.

Critical Use of MLOps in Finance: Using Cloud-managed ML Services that Scale

Vinnie Saini, Director and Senior Principal , Enterprise Data Architecture & Cloud Strategy, Scotiabank

Abstract: 

With ML Engineering being a superset of Software Engineering, treating Data as a first class citizen is key to ML Engineering.The talk will be focused on how leveraging MLOps is a key to improve the quality and consistency of machine learning solutions, managing the lifecycle of your models with the goal of:- Faster experimentation and development of models- Faster deployment of models into production- Quality assurance and end-to-end lineage tracking

With trained machine learning models deployed as web services in the cloud or locally, we'll see how deployments use CPU, GPU, or field-programmable gate arrays (FPGA) for inferencing- using different compute targets:- Container Instance- Kubernetes Service- development environment

Technical Level: 5/7

What you will learn: 

This talk is intended for technology leaders and enterprise architects who want to understand the details about what MLOps in practice: Capture the governance data for the end-to-end ML lifecycle. Monitor ML applications for operational and ML-related issues. Compare model inputs between training and inference, explore model-specific metrics, and provide monitoring and alerts on your ML infrastructure. Automate the end-to-end ML lifecycle with Pipeline to continuously roll out new ML models alongside your other applications and services.

Building Real-Time ML Features with Feast, Spark, Redis, and Kafka

Danny Chiao, Engineering Lead and Achal Shah, Software Engineer, Tecton/Feast

Abstract: 

This workshop will focus on the core concepts underlying Feast, the open-source feature store. We’ll explain how Feast integrates with underlying data infrastructure including Spark, Redis, and Kafka, to provide an interface between models and data.

Technical Level: 4/7

What you will learn: 

We’ll provide coding examples to showcase how Feast can be used to:

  • Curate features in online and offline storage
  • Process features in real-time
  • Ensure data consistency between training and serving environments
  • Serve feature data online for real-time inference
  • Quickly create training datasets
  • Share and re-use features across models

Generalizing Diversity: Machine Learning Operationalization for Pharma Research

Daniel Butnaru, Principal Architect & Head of Scientific Software Engineering & Architecture, Roche

Abstract: 

Many machine learning use cases in pharma research are transitioning from a one-off scenario, where the model is built once and ran few times, to repeated usage of the same model in critical research workflows. This shift significantly raises the bar on the quality and setup necessary to train and deploy ML models. Given the number and diversity of ML models how does a larger enterprise go about leveraging an MLOPS platform? How does one ensure seamless operational embedding of ML models in a heterogeneous enterprise operational landscape?

Technical Level: 4/7

What's unique about this talk: 

is shows MLOPS scenarios in early pharma research. Also some of the presented models consist of 100s of individual models that need to be delivered as one. This is a rather unique setup.

What you will learn: 

  • how Roche Pharma Research operationalizes molecular property predictors (100s of models)
  • why the embedding of the model in operational systems needs to be considered from the start
  • implementation patterns for formalizing the exchange between data scientist, ML engineer and data engineer

Lessons Learned from DAG-based Workflow Orchestration

Kevin Kho,Senior Open Source Community Engineer,  Prefect

Abstract: 

Workflow orchestration has traditionally been closely coupled to the concept of Directed Acyclic Graphs (DAGs). Building data pipelines involved registering a static graph containing all the tasks and their respective dependencies. During workflow execution, this graph would be traversed and executed. The orchestration engine would then be responsible for determining which tasks to trigger based on the success and failure of upstream tasks.

This system was sufficient for standard batch processing-oriented data engineering pipelines but proved to be constraining for some emerging common use cases. Data professionals would have to compromise their vision to get their workflow to fit in a DAG.

For example,

1. How do I re-run a part of my workflow based on a downstream condition?

2. How do I execute a long-running workflow? 3. How do I dynamically add tasks to the DAG during runtime? This has led to the development of Prefect Orion (Prefect 2.0), a DAG-less workflow orchestration system that emphasizes runtime flexbility and an enhanced developer experience. By removing the DAG constraint, Orion offers an interface to workflow orchestration that feels more Pythonic than ever. Developers only need to wrap as little code as they want to get observability into a specific task of the workflows.

Technical Level: 5/7

What's unique about this talk: 

A lot of the content here will come from supporting the Prefect community over the last 3 years and the difficulties we recognized with traditional orchestration systems. There are not a lot of people with the experience of supporting thousands of use cases and extracting insight from that.

What you will learn: 

They will learn about workflow orchestration, and why pinning it to the Directed Acylic Graph concept proved to be limiting. They will learn how to spin up their own free open-source orchestrator.

Defending Against Decision Degradation with Full-Spectrum Model Monitoring : Case Study and AMA

Mihir Mathur, Product Manager, Machine Learning, Lyft

Abstract: 

ML models at Lyft make millions of high stakes decisions per day including decisions for real-time pricing, physical safety classification, fraud detection, and much more. Preventing models from degrading and making ineffective decisions is therefore critical. Over the past two years, we’ve invested in building a full-spectrum model monitoring solution to catch and prevent model degradation.

In this talk, we’ll discuss our suite of approaches for model monitoring including real-time feature validation, performance drift detection, anomaly detection, and model score monitoring as well as the cultural change needed to get ML practitioners to effectively monitor their models. We’ll also discuss the impact our monitoring system delivered by catching problems.

Technical Level: 4/7

What you will learn: 

  • Why it's needed
  • Challenges in building a model monitoring system
  • How to prioritize among a plethora of things that can be built
  • Overview of Lyft's model monitoring architecture
  • How to cause cultural change at a company for better AI/ML practices

Leaner, Greener and Faster Pytorch Inference with Quantization

Suraj Subramanian, Developer Advocate, PyTorch

Abstract: 

Quantization refers to the practice of taking a neural network's painstakingly-tuned FP32 parameters and rounding that to an integer - without destroying accuracies, while actually making the model leaner, greener and faster. In this session, we'll learn more about this sorcery from first principles and see how this is implemented in PyTorch. We'll break down all of the available approaches to quantize your model, their benefits and pitfalls, and most importantly how you can make an informed decision for your use case. Finally, we put our learnings to the test on a large non-academic model to see how this works in the real world.

Technical Level: 4/7

What you will learn: 

  • Foundations of quantization in deep learning
  • Summary of current research in this area
  • Approaches to quantization, their benefits and pitfalls
  • How to debug issues with your quantized model
  • Choosing the quantization workflow for your particular use case

Scale and Accelerate the Distributed Model Training in Kubernetes Cluster

Jack Jin, Lead ML Infra Engineer, Zoom

Abstract: 

In order to orchestrate Deep Learning workloads that scale across multiple GPUs and nodes, Kubernetes offers a compelling solution. With Kubernetes and Kubeflow PytorchJob, we can easily schedule and track a distributed training job on multi-GPU single-node, and multi-GPU multi-nodes in a shared GPU resource pool. To accelerate deep learning training at Zoom, we enable RDMA, RoCE to bypass the CPU kernel and offload the TCP/IP protocol. We apply this technology in Kubernetes with SRIOV by NVIDIA Network Operator in a heterogenous GPUs cluster with 4 GPU servers and 8 GPU servers, and reach a near linear performance increase as the GPU number and worker node increasest

Technical Level: 5/7

What's unique about this talk: 

SRIOV with training operator for PytorchJob in Kubernetes.

What you will learn: 

How to setup a Kubernetes based ML model training platform, and how to manage the training job with Kubernetes

Building Real-Time ML Features with a Feature Platform

Mike del Balso, Co-Founder & CEO and Willem Pienaar, Feast Committer and Tech Lead, Tecton

Abstract: 

Deploying ML in production is hard, and data is often the hardest part. Production ML pipelines are different from traditional analytics pipelines. They need to process both historical data for training, and fresh data for online serving, often using streaming or real-time data sources. They must ensure training/serving parity, provide point-in-time correctness, and serve data with production service levels. These challenges are difficult to tackle with traditional ETL tools, and can often add weeks or months to project timelines.

In this session, Mike Del Balso and Willem Pienaar will present the challenges faced when building the core ML infrastructure at Uber and Gojek faced, and how their teams built feature stores to scale their ML efforts to thousands of models in production. Uber and Gojek used these internal ML platforms to power every aspect of their business: ride ETAs, demand forecasting, pricing, and restaurant recommendations.

Feature stores have now emerged as the tool of choice to solve the challenges of production ML. At their core, they provide a simple solution to store, serve and share features. However, feature stores are not enough. Teams still need to create bespoke data pipelines to process raw data into features in real-time. To solve the data problem for ML, organizations need a complete feature platform, which extends a feature store to include automated ML data pipelines that can transform data from batch and real-time sources. Mike and Willem will share their views on the evolution of feature stores to feature platforms that can manage the complete lifecycle of real-time ML features..

Technical Level: 4/7

What's unique about this talk: 

SIn this session, attendees will learn about the data challenges faced by ML teams at Uber and Gojek, and how they were solved with feature stores. Attendees will also get a hands-on example of how a feature platform can be used to build and operationalize enterprise-grade feature pipelines for a fraud detection use case.

What you will learn: 

In this session, attendees will learn about the data challenges faced by ML teams at Uber and Gojek, and how they were solved with feature stores. Attendees will also get a hands-on example of how a feature platform can be used to build and operationalize enterprise-grade feature pipelines for a fraud detection use case.

We’ll show how to:

  • Define features as code
  • Transform data and materialize feature values 
  • Store values in offline and online store
  • Serve data for training
  • Serve data online for real-time inference
  • Monitor pipeline health, data drift, and online service levels

Understanding Foundation Models: a New Paradigm for Building and Productizing AI Systems

Hagay Lupesko, Director of Engineering, Meta AI

Abstract: 

The term Foundation Models was coined in a 2021 technical report published by dozens of Stanford researchers, describing Foundation Models as no less than a new paradigm for building AI systems. In this session we will unpack this bold concept and identify practical ways for companies to start leveraging foundation models today.

Technical Level: 4/7

What's unique about this talk: 

The combination of a new paradigm with real world examples of how this is leveraged today.

What you will learn: 

  • What are Foundation Models
  • Real world examples of Foundation Models
  • How companies can leverage Foundation Models today

A GitOps Approach to Machine Learning

Amy Bachir, Senior MLOps Engineer and Stephan Brown, MLOps Engineer, Interos

Abstract: 

The focus of this talk is the application of GitOps principles to machine learning in production. At Interos we use GitOps for most of our MLOps work, storing our ML configurations as code. GitOps has many benefits including traceability, stability, reliability, consistency, enhanced productivity, and provides a single source of truth. We apply GitOps to our deployment configurations, onboarding process, monitoring configurations, and use it all stages of the model lifecycle. The portable and declarative nature of GitOps has led to increased traceability, and as a small team has increased our development capacity.

Technical Level: 5/7

What's unique about this talk: 

We use GitOps principles at every stage of the ML lifecycle instead of just on IaC tasks.

What you will learn: 

How far GitOps can take you, it's not just about deployments, it's taking a model from idea to production using GitOps.

Hands-on : A Beginner-Friendly Crashcourse to Kubernetes

Eric Hammel, ML Engineer,  Mohamed Sabri, Senior Consultant in MLOps, and Asim Sultan,  Rocket Science

Abstract: 

Have you ever wondered what kubernetes and Cloud Native applications are? Here is the perfect opportunity to get exposed to these complex yet powerful tools. You will discover concepts and tools such Container Orchestration, Cloud Native, Kubernetes and application deployment.

Technical Level: 4/7

What you will learn: 

The participants will get a crash course about Kubernetes and Cloud Native concepts. They learn how to deploy an application on a managed kubernetes cluster thanks to the presented abstractions.

Don't Fear Compliance Requirements & Audits: Implementing SecMLOps at Every Stage of the Pipeline

Ganesh Nagarathnam, S&P Global

Abstract: 

Over the last few years, MLOps - as a discipline has made significant inroads in operationalizing and democratizing ML for a variety of use cases spanning across industries. With so many tools being available for the conventional ML pipeline, organizations have stayed away from a 'swiss army knife 'kind of a tool mind set and made better choices in choosing the right tool stack for their problems in the pipeline. With this kind of proliferation, managing, building and monitoring security in the ML pipeline poses unique challenges. The speaker dissects the ML pipeline and applies core drivers for incorporating security at every stage and proposes an extension framework - namely SecMLOps

Technical Level: 4/7

What's unique about this talk: 

SecMLOps as a discipline is new to ML world, but rather established in DevOps world with SecDevOps.

What you will learn: 

They will learn how to integrate security early in to the ML development process enabling them to come up with their own core set of drivers for SecMLOps as needed. Product Managers, Program Managers, Application security Managers in the organization will feel empowered to be a part of ML Development cycle. Compliance Requirements and audits will not be feared and it would rather be simplified when such a framework is in place !

One Cluster to Rule Them All - ML on the Cloud Using Ray on Kubernetes and AWS

Victor Yap, MLOps Engineer, Rev.com

Abstract: 

Distributed compute clusters (aka HPC) are fundamental to machine learning in order to scale data processing, model training, model serving and more. However, each of these areas require diverse compute resources, both in quantity (10s-1000s) and type (CPU/GPU/Memory). On top of all of that, data scientists can face significant friction when trying to run their experiments across their local environment and the cluster's environment. This talk will cover how to build a single cluster, on AWS with Ray and Kubernetes, that can dynamically scale any resource type to any quantity, bridge the gap between local and cluster environments, and describe how Rev uses it to handle any compute problem.

Technical Level: 6/7

What's unique about this talk: 

This talk covers a modern approach to distributed compute clusters and an intersection of tooling (Ray, AWS, Kubernetes, Karpenter) that cannot be found online.

What you will learn: 

How to build a dynamic, scalable distributed cluster on AWS with Ray and Kubernetes. The audience will learn why distributed compute clusters are required in ML, how to implement one that dynamically creates instances, and how Rev uses one to run all distributed computing needs. The audience will learn about Ray, Kubernetes and AWS, and what they have to offer in the space of MLOps.

How MLOps Tools Will Need to Adapt to Responsible and Ethical AI: Stay Ahead of the Curve

Patricia Thaine, Co-Founder & CEO, Private AI

Abstract: 

We are at the dawn of a new age for responsible AI: there's a flourishing field studying its benefits and harms, and the EU is actively legislating AI ethics. But while MLOps platforms have grown in capability and complexity, their consideration of responsible/ethical AI have lagged significantly behind. In this talk, we'll dive into the ethical guardrails every MLOps solution should implement to be prepared for a fast-approaching future and into how they can help with GDPR compliance (data residency, data security, data privacy) as well as cater to future regulatory requirements.

Technical Level: 4/7

What's unique about this talk: 

I don't think there's much content out there linking AI regulations and ethics directly to MLOps.

What you will learn: 

How legislators are thinking about regulating AI and how the requirements fit into MLOps, including privacy and explainability.

Solving MLOps From First Principles: A Framework to Reduce Complexity

Dean Pleban, Co-Founder & CEO, DagsHub

Abstract: 

One of the hardest challenges data teams face today is selecting which tools to use in their workflow. Marketing messages are vague, and you continuously hear of new buzzwords you "just have to have in your stack". There is a constant stream of new tools, open-source and proprietary that make buyer's remorse especially bad. I call it "MLOps Fatigue". This talk will not discuss a specific MLOps tool, but instead present guidelines and mental models for how to think about the problems you and your team are facing, and how to select the best tools for the task. We will review a few example problems, analyze them, and suggest Open Source solutions for them. We will provide a mental framework that will help tackle future problems you might face and extract the concrete value each tool provides.

Technical Level: 4/7

What you will learn: 

You'll learn what signals to watch for to notice you might have MLOps fatigue. How to define the challenge you're facing and which questions to ask in order to build a "decision tree" for selecting the best suited tools for the task. A few examples for using this framework in practice on challenges involving data management and automating training/pipeline tasks

Low-latency Neural Network Inference for ML Ranking Applications: Yelp Case Study

Ryan Irwin, Engineering Manager and Rajvinder Singh,  Yelp, Inc.

Abstract: 

At Yelp, we train and deploy models for a variety of business applications requiring low-latency model inference. At first we focused on streamlining support for XGboost and LR models built in Spark to support business recommendations, search, ads, restaurants, and trust & safety use-cases. However, we didn’t have a way of supporting low-latency neural network models with Tensorflow. Such models usually relied on batched model inference in support of models used for photo classification [1] and popular dishes [2].

Technical Level: 6/7

What's unique about this talk: 

The solution uses a combination of open source tools that other solutions do not.

What you will learn: 

The audience will learn how different technologies in MLOPs can be used to solve low-latency ranking problems.

Building Production ML Monitoring from Scratch: Live Coding Session

Alon Gubkin, CTO, Aporia

Abstract: 

In this session, together we will build a cloud native ML Monitoring stack using open-source tools. We’ll start from explaining the basic principles of Machine Learning monitoring in production, and then create a live web dashboard to measure model drift (training compared to prediction), feature statistics, and performance metrics in production. This monitoring stack will also enable us to integrate Python-based custom metrics which will be displayed on the dashboard. The code will be available on GitHub after the workshop.

Technical Level: 5/7

What's unique about this talk: 

This is the first time this workshop will be presented. Currently ML teams have a really hard time building ML Monitoring, especially small teams who have ML platforms based on open source tools.

What they will learn:

They'll learn how to build a cloud native ML Monitoring stack using open-source tools that will integrate into their MLOps platform.

How We Reduced 83% of ML Computing Cost on 100+ ML Projects

Jaeman An, Founder & CEO, VESSL AI

Abstract: 

In this session, We'll describe how our startup, VESSL, reduced computing costs by building time- and cost- effective machine learning infrastructure. We'll explain how to build hybrid cloud architectures with Terraform and Kubernetes. We'll describe our cost optimization methodologies with spot instances and fractional GPUs. We'll highlight some problems we encountered when using multiple environments such as dataset mount, network performance, and server monitoring, and how we solved them. Also, we'll introduce some cases that we actually saved more than 80% of the cost.

Technical Level: 4/7

What's unique about this talk: 

Hybrid cloud ML infrastructure and practical issues in using/managing multiple environments.

What you will learn: 

The audience will gain insight into what goes on from behind the scenes when building an ML platform at scale. From the specific tools required for machine learning to the rationale behind build versus buy for MLOps tools, audience members can use this talk to help frame their evaluations of various tools or internal efforts to stand up ML infrastructure for their organizations. Audience members will learn common challenges and problems from real-world examples, and how engineers approached those challenges head-on.

The Key Pillars of ML Observability and How to Apply them to Your ML Systems

Aman Khan, Senior Product Manager, Arize AI and Gandalf Hernandez, Senior Machine Learning Engineering Manager, Spotify

Abstract: 

"If you build it, they will come" is a totally bogus way to approach building an ML platform. All the time, teams learn the hard way that the details -- justifying the platform, identifying the key components that matter, how it fits into the broader whole, business impact, etc. -- are what determines success, not unnecessarily technical specifications or wasting time building a product that will only be irrelevant once it's done. It's about fundamentals, and getting those right is hard. Compared to DevOps or data engineering, MLOps is still relatively young as a discipline and best practices are often learned on the fly…so sometimes it pays to buy over build. In this session Gandalf Hernandez, Senior Machine Learning Manager at Spotify, and Aparna Dhinakaran – Chief Product Officer at Arize AI share best practices, war stories, and debate questions such as:

  • How do you justify building an ML platform internally?
  • What are the key components that matter to your team?
  • And why is ML infrastructure necessarily distinct from software infrastructure?

Technical Level: 4/7

What's unique about this talk: 

While the challenges of productionalizing ML is not a unique experience, the ways each team goes about solving those challenges and the rationale behind their decisions are seldomly looked at - especially at large organizations. This talk will shine a light on the not-so-talked-about aspects of MLOps that are almost as important as building the tools themselves: who buys it and why.

What you will learn: 

The audience will gain insight into what goes on from behind the scenes when building an ML platform at scale. From the specific tools required for machine learning to the rationale behind build versus buy for MLOps tools, audience members can use this talk to help frame their evaluations of various tools or internal efforts to stand up ML infrastructure for their organizations. Audience members will learn common challenges and problems from real-world examples, and how engineers approached those challenges head-on.

Eliminating AI Risk, One Model Failure at a Time

Yaron Singer,  CEO & Co-Founder, Robust Intelligence

Abstract: 

As organizations adopt AI they inherent AI risk. AI risk often manifests itself in AI models that produce erroneous predictions that go undetected and result in serious consequences for the organization and individuals affected by the decisions. In this talk we will discuss root causes for AI models going haywire, and present a rigorous framework for eliminating risk from AI. We will show how this methodology can be used as building blocks for building an AI firewall that can prevent and model AI model failures.

Technical Level: 5/7

What's unique about this talk: 

Expertise, specific examples, anecdotal evidence, business advice, & product details.

What you will learn: 

How to eliminate AI failure from their model pipelines.

Top 5 Lessons Learned in Helping Organizations Adopt MLOps Practices

Shelbee Eigenbrode, Principal AI/ML Specialist Solutions Architect, Amazon Web Services

Abstract: 

In this session, I'll cover the top 5 lessons learned in helping organizations implement MLOps practices at scale. Here you'll learn about some of the common challenges encountered as well as recommendations in how to mitigate those challenges.

Technical Level: 2/7

What's unique about this talk: 

A lot of content online is theoretical but when it comes down to implementation it's often more complex for a variety of reasons.

What you will learn: 

In this session, the audience will learn pitfalls to avoid based on large scale adoption of MLOps as well as technical implementations.

Scotiabank's Path Towards Accelerated Analytics Through GCP

Shimona Narang and Vipul Upadhye, Data Scientists, Scotiabank

Abstract: 

Investing in data and analytics has been critical for financial institutions for years, but it has risen to the forefront during the pandemic as a critical tool for assisting customers during difficult times. Recently, Scotiabank has partnered with Google Cloud Platform to strengthen the bank's cloud-first strategy and accelerate its global data and analytics efforts. As part of this partnership, AIML team at International Banking, has been leveraging GCP for analytics experiments, model training, and for operationalizing them. With GCP in place, we achieved enhanced performance on bank operations in Peru Analytics by reducing bottlenecks between data science and engineering teams. Our architecture on GCP also facilitates governing ML artifacts to support auditability, traceability, and compliance.

In this talk, we will share our journey of onboarding our machine learning use case to GCP at Scotiabank.

Technical Level: 5/7

What you will learn: 

  • Scotiabank’s architecture, MLOps lifecycle and development on Google Cloud
  • Leveraging GCP’s Vertex AI for model training and serving pattern
  • Secure handling of production data on Google Cloud Platform
  • Advanced customer analytics success stories in International Banking

Robustness and Security for AI and the Dangerous Dismissal of Edge Cases

James Stewart, CEO, TrojAI Inc.

Abstract: 

As we look to deploy AI to mission critical systems, it is no longer acceptable to dismiss AI limitations as edge cases. Doing so is akin to ignoring the robustness and security of models based on the fallacy that there are a very small finite number of edge cases that need to be addressed. The truth is, there are a near limitless number of edge cases and, paired with the probable improbable, AI systems across the world will encounter edge cases every single day. An edge case is something that will rarely occur in practice and can be both naturally occurring or malicious like an adversarial attack. It is tempting to dismiss edge cases as unlikely to happen again once addressed. Sometimes the situation is so obscure that even humans may have been confused except that most won’t and those that are will typically deal with the confusion more gracefully than AI. Conversely, when an AI is confused by an edge case, all AI in the system will be confused. The problem of edge cases is amplified because we cannot predict model performance using traditional accuracy metrics like recall, precision and F1-Score, which do not translate well from the lab to the real world.

In this talk, we present examples of both naturally occurring and malicious edge cases and discuss possible strategies for avoiding the situation where a new model is more accurate but more brittle to failure. Robustness metrics provide insight into problem classes and model failure bias which can reduce risk by shaping models towards more benign failure cases. Regulations and significant penalties are emerging around Responsible AI requiring industry to articulate what could go wrong with models and what steps have been taken to mitigate the risks and ultimately protect the pace of innovation.

Technical Level: 2/7

What's unique about this talk: 

A lot of content online is theoretical but when it comes down to implementation it's often more complex for a variety of reasons.

What you will learn: 

The limitations and risks of AI and the coming regulations for Responsible AI.

Workshop/Tutorial: Introduction to Model Deployment with Ray Serve

Jules Damji, Lead Developer Advocate and Archit Kulkarni, Software Engineer, Anyscale Inc.

Abstract: 

This is a two-part introductory and hands-on guided tutorial of Ray and Ray Serve.

Part one covers a hands-on coding tour through the Ray core APIs, which provide powerful yet easy-to-use design patterns (tasks and actors) for implementing distributed systems in Python.

Building on the foundation of Ray Core APIs, part two of this tutorial focuses on Ray Serve concepts, what and why Ray Serve, scalable architecture, and model deployment patterns. Then, using code examples in Jupyter notebooks, we will take a coding tour of creating, exposing, and deploying models to Ray Serve using core deployment APIs.

And lastly, we will touch upon Ray Serve’s integration with model registries such as MLflow, walk through an end-to-end example, and discuss and show Ray Serve’s integration with FastAPI.

Technical Level: 5/7

What's unique about this talk: 

The tutorial/hands-on in-person workshop will be delivered by orginal creator of Ray and the Ray Serve contributor and committers.

What you will learn: 

  • Use Ray Core APIs to convert Python function/classes into a distributed setting
  • Learn to use Ray Serve APIs to create, expose, and deploy models with Ray Server APIs
  • Access and call deployment endpoints in Ray Serve via Python or HTTP
  • Configure compute resources and replicas to scale models in production
  • Learn about Ray Serve integrations with MLflow and FastAPI

Tips on a Successful MLOps Adoption Strategy: DoorDash Case Study

Hien Luu, Head of Machine Learning Platform, DoorDash

Abstract: 

MLOps is one of the hottest topics being discussed in the ML practitioner community. Streamlining the ML development and productionalizing ML are important ingredients to realize the power of ML, however it requires a vast and complex infrastructure. The ROI of ML projects will start only when they are in production. The journey to implementing MLOps will be unique to each company. At DoorDash, we’ve been applying MLOps for a couple of years to support a diverse set of ML use cases and to perform large scale predictions at low latency. This session will share our approach to MLOps, as well as some of the learnings and challenges.

Technical Level: 5/7

What's unique about this talk: 

The learnings and insights from a real-world case study.

What you will learn: 

A strategy for adoption MLOps

Model Multi-Tenancy Isn’t Just a Glitch in the Matrix

Or Itzary, Chief Architect, Superwise

Abstract: 

Just like in software engineering, multi-tenancy is a choice, and challenge, of scale. Unlike software engineering ML multi-tenancy results in tenets with completely different model instances, data, metrics, hyperparameters, etc., running in production. Basically, the one model you started out with is about to take the red pill and go visit Alice.

Technical Level: 5/7

What you will learn: 

Selecting, building, and maintaining the right ML multi-tenant architecture for your organization while remaining sane.

  • Why and when to use model multi-tenancy
  • 4 degrees of model separation
  • Multi-tenancy deployment architecture evaluation
  • Multi-tenancy observability and monitoring
  • What’s next? How deep does the rabbit hole go?

MLOps at Rovio for Personalization (Self-Service Reinforcement Learning in Production)

Ignacio Amaya dela Pena, Lead Machine Learning Engineer,  Rovio

Abstract: 

Rovio’s game teams leverage Beacon, our internal cloud services platform which among other things enables them to leverage data to grow their games. Machine Learning is part of Beacon’s offering. With a few clicks games can start using Reinforcement Learning models with “Personalized rules” which aim to replace complex sets of rules and heuristics that currently are still common across all industries.

Technical Level: 6/7

What's unique about this talk: 

Currently technical details about MLOps for personalization at Rovio have not been published. Business case has been presented though in https://www.rovio.com/articles/creating-personally-tailored-games-with-machine-learning.

What you will learn: 

From a business point of view, you will learn about the games personalization use case and how Rovio ML product offering helps growing the games. From a technical point of view, you will learn about the MLOps required to run Reinforcement Learning use cases in production (both contextual bandits and deep reinforcement learning) and what are the main challenges we faced

A Guide for Start-ups; How to Scale a PoC to Production System and Not Go Up in Smoke

Maia Brenner, AI Specialist, Tyrolabs

Abstract: 

Nowadays, AI is more than a promising idea; it has become imperative to get a competitive advantage. Companies started to look at their data insights to stay on track, giving their first steps into their AI journey with ad-hoc pilot and PoC projects.

But, without the proper roadmap and building blocks, many of these efforts will fall short and project will never get into production. According to McKinsey, only 8% of companies have integrated AI in core practices that support widespread adoption.

How to avoid falling into that category and succeed as an AI organization?

In this talk, we will introduce best practices to no get trapped during the PoC phase.

Attendees will learn practical approaches for:

  • Avoiding common pitfalls when starting a PoC
  • Understanding the feasibility, impact, and ROI of different AI initiatives
  • Building a simple framework for PoC development to break away from the pack and be able to put the system in production

Technical Level: 6/7

What you will learn: 

  • MLOps best practices need to be adopted and integrated from the very beginning. 
  • Avoiding common pitfalls when starting a PoC
  • Understanding the feasibility, impact, and ROI of different AI initiatives
  • Building a simple framework for PoC development to break away from the pack and be able to put the system in production

CyclOps - A framework for Data Extraction, Model Evaluation, and Drift Detection for Clinical Use-Cases

Amrit Krishnan, Senior Applied ML Specialist, and Vallijah Subasri, Graduate Researcher & Applied Machine Learning Intern, Vector Institute

Abstract: 

The ever-growing applications of Machine Learning (ML) in healthcare emphasizes the increasing need for a unified framework that harmonizes the various components involved in the development and deployment of robust clinical ML models. Namely, data extraction and model robustness are primary challenges in the healthcare domain. Data extraction is particularly convoluted due to a lack of standardization in Electronic Health Record (EHR) systems used across hospitals. Building robust clinical ML systems has also proven difficult, attributed to dataset shifts that change feature distributions and lead to spurious predictions. Rigorous evaluation of ML models across time, hospital sites and diverse patient cohorts is critical for identifying model degradation and informing model retraining.

Technical Level: 6/7

What's unique about this talk: 

The work is novel for a couple of reasons. A unified framework for data querying and processing is missing for healthcare EHR data. Secondly, the framework also includes a suite of experiments which attempt to detect dataset shift, and its impact on model performance. Drift-detection approaches have not been benchmarked on tabular, especially health data. Hence, by using our framework, we showcase the ease of running experiments, and bringing clinical risk predictive models closer to deployment.

What you will learn: 

We wish to share our rationale and technical design of the framework, with the broader MLOps community. Our framework will be open-source, and takes a different approach compared to enterprise solutions that are trying to solve similar problems in the healthcare domain. Furthermore, we are building this on top of one of Canada's largest retrospective databases collected for clinical use-cases, with several hospitals in the Greater Toronto Area as partners, which makes it uniquely positioned.

Concretes Guidelines to Improve ML Model Quality, Based on Future ISO Certifications

Olivier Blais, VP Decision Science, Moov AI

Abstract: 

Only 15% of AI projects will yield results in 2022. That's bad. The good news: there is a better way. We can deliver high-level quality AI systems that meet business objectives and drive adoption. Olivier Blais is Head of Decision Science and Editor of the international AI ISO project on quality evaluation guidelines for AI Systems. He lives and breathes to redefine the quality of AI systems and apply it to real-world business challenges. He'll share a new quality evaluation approach supported by the upcoming ISO standards that redefines how you deliver your ML models and AI systems.

Technical Level: 5/7

What's unique about this talk: 

Many approaches and tools are still not widely available.

What you will learn: 

  • Current validation and testing methodologies are often not sufficient
  • There are new quality evaluation processes and tools
  • Proper quality evaluation enhance delivery success likelihood as well as adoption

A Guide to Building a Continuous MLOps Stack

Itay Ben Haim, ML Engineer, Superwise

Abstract: 

In this workshop, we'll take a dive into MLOps CI/CD pipeline automation with GCP, Superwise, and retraining/auto-resolution notebooks.

In part 1, we’ll focus on how to put together a continuous ML pipeline to train, deploy, and monitor models. Part 2 will focus on automations and production-first insights to detect and resolve issues continuously.

Technical Level: 3/7

What's unique about this talk: 

This is a practical live coding session together with the participants that shows how to implement MLOps level two.

What you will learn: 

  • How to build a continuous MLOps stack
  • Platform and tool alternatives for each step
  • Considerations for scaling up
  • Production-first insights and automations

Shopify's ML Platform Journey Using Open Source Tools. Case study building Merlin & AMA

Isaac Vidas, Machine Learning Platform Tech Lead,  Shopify

Abstract: 

Merlin, Shopify's new machine learning platform is based on an open source stack and tooling end-to-end. In this talk I will share a deeper look at the process, architecture, and how Merlin is helping us scale our ML work. This talk will be based on the following blog post with additional details on our architecture, technologies and tools - https://shopify.engineering/merlin-shopify-machine-learning-platform

Technical Level: 5/7

What's unique about this talk: 

The journey to come up with an architecture that is scalable and versatile for many different machine learning use cases.

What you will learn: 

How to build a machine learning platform with open source tools (Ray, Kubernetes, ML libraries, etc.)

MLOps for Deep Learning

Diego Klabjan, Professor and Yegna Jambunath, MLOps Researcher, Northwestern University, Center for Deep Learning

Abstract: 

In model serving, two important decisions are when to retrain the model and how to efficiently retrain it. Having one fixed model during the entire often life-long inference process is usually detrimental to model performance, as data distribution evolves over time, resulting in a lack of reliability of the model trained on historical data. It is important to detect drift and retrain the model in time. We present an ensemble drift detection technique utilizing three different signals to capture data and concept drifts. In a practical scenario, ground truth labels of samples are received after a lag in time, which we consider appropriate. Our framework automatically decides what data to use to retrain based on the signals. It also triggers a warning indicating a likelihood of drift.

Technical Level: 6/7

What's unique about this talk: 

In model serving, two important decisions are when to retrain the model and how to efficiently retrain it. We delve deeper into research and arrive at algorithms that decide when and how to retrain.

What you will learn: 

  • The practical challenges in Model serving for Deep Learning
  • Possible algorithmic and modeling solutions
  • How to use our open-source project which incorporates these aspects.

MLOps for Fairness - Creating Comprehensive Fairness Workflows

Bhaktipriya Radharapu, Tech Lead, Responsible AI, Google

Abstract: 

ML systems are creating new opportunities to improve the lives of people around the world, from business to healthcare to education. Although these ML systems may bring many benefits, they also contain inherent risks, such as codifying and entrenching biases, even when there is no intention for it.

This talk presents an overview of the main MLOps for identifying, measuring and remediating bias in ML systems at scale. We begin by discussing the causes of algorithmic bias, and metrics for fairness. We then deep-dive into performing bias remediation at all steps of the ML life-cycle: data collection, pre-processing, training, and post-processing. We will also focus on a gamut of tools in the Python ecosystem that can be used to create comprehensive fairness workflows. These tools have not only been vetted by the academic ML community but have also scaled very well for industry level challenges.

We hope that by the end of this talk, ML developers will not only be able to flag fairness issues in ML but also fix them by incorporating these best practices in their ML workflows..

Technical Level: 5/7

What's unique about this talk: 

I will be giving industry-level examples backing each of the remediation techniques mentioned in the talk. There will also be several tips and gotchas discussed for using these techniques in production.

What you will learn: 

  • The practical challenges in Model serving for Deep Learning
  • Possible algorithmic and modeling solutions
  • How to use our open-source project which incorporates these aspects.

Managing a Data Science Team During the Great Resignation

Jessie Lamontagne, Data Science Manager,  Kinaxis

Abstract: 

The COVID-19 pandemic fundamentally changed the way we work and no industry has seen as much change as the tech industry, with many tech giants committing to continue to support remote work for their employees, and hybrid or remote work becoming the new normal. In this talk I cover the challenges and opportunities we face as leader when managing an increasingly remote workforce which now has the opportunity to tap into global labour demand for tech talent. How we retain, motivate, and grow our teams requires rethinking the relationship we have with the firm and with each other, and what it means to build trust.

Technical Level: 5/7

What's unique about this talk: 

Candid discussion of what trust means, and a (former) labour economist's view on how workers maximize their lifetime income and life balance.

What you will learn: 

Keeping good talent requires treating them as individuals - each with unique goals, dreams and aspirations.

Managing Human in the Loop Systems Without Burning Out Your Engineers

Charles Huang, Software Engineer,  Pinterest

Abstract: 

Pinterest has a large scale machine learning system to mine shopping content from merchant websites. We use human in the loop systems to label and train our machine learning and to measure the quality of our shopping product content. How does our relatively small team of engineers ensure high accuracy and scalability?

This talk will discuss the systems and processes we have developed to manage humans in the loops with minimal engineering ops time. We will discuss our in-house HTML labeling system, which allows our contractor team to label and train new models. We will also discuss working with vendor solutions to measure the accuracy of our shopping product content with the help of anonymous online workers.

Technical Level: 5/7

What you will learn: 

Processes and techniques for managing human in the loop machine learning systems in production. Stories and lessons from working with human labelers, labeling platforms, and in-house labeling tools. Challenges associated with large scale information extraction for shopping data with machine learning.

Cutting-edge NLP, Large Language Models, and Their Implications For Products and Research

Ehsan Amjadian, Director of AI & Technology,  RBC

Technical Level: 5/7

What you will learn: 

1. The current state of NLP and self-supervised learning

2. How did we get here.

3. What do the recent developments mean for industry and research.

It’s All About The Data: Continuously Improve ML Models, The Data-Centric Way

Bernease Herman,  Senior Data Scientist, WhyLabs

Abstract: 

Data-centric AI is growing trend in the ML and MLOps community for good reason: issues in your dataset are a common cause for AI system failures and poor performance. These are often more important than improvements to the model alone. Visibility into ML data requires approaches that are purpose built for datasets, their distributions and dependencies. At WhyLabs, we've open-sourced whylogs as the standard for ML data telemetry with these issues in mind.

Technical Level: 5/7

What you will learn: 

  • Understand ML dataset basics and statistical properties
  • Gain understanding of how to detect input data and label data quality issues
  • Gain understanding of how to improve ML model performance via dataset and segment analysis
  • Gain understanding of how to automatically identify data problems post deployment and trigger data improvement pipelines

The Critical Things You Have to Build to Transform Your Company to be ML-Driven

Yuval Fernbach, Co-Founder & CTO, Qwak

Abstract: 

1. First things first - How to define the problem you're trying to solve and make sure it's an ML challenge.

2. How do you know you have what it takes? - Even if it's the right problem to solve with ML, you have to make sure you have relevant coverage that will allow you to build ML on top

3. How do you know you have the right team for the job? Not all data scientists are the same, when you're kicking off your very first data scientist you have to go as wide as possible, data scientists who are doing only the research/algorithm will get stopped quite early with the real-world challenges.

4. What is the Production First Approach and how can you use it? Make sure that you can expose the very first version of your models to its consumers, usually RND engineers

5. Do you have fully autonomous data scientists? Many data scientists are heavily dependent on others in order to release new models or even new versions, make sure you're setting them the right infrastructure to work fast and independently, as they are very expensive resources that you prefer to maximize.

Technical Level: 5/7

µlearn: a Microframework for Building Machine Learning Applications

Niels Bantilan, Machine Learning Engineer,  Union.ai

Abstract: 

A common problem in the machine learning development life cycle is the challenge of going from research to production. An ML team might need to modularize and refactor their code to work more efficiently or effectively in production. Sometimes this might even require re-implementing and maintaining feature engineering or model prediction logic in multiple places depending on whether the application requires offline, online, and/or streaming predictions.

Thinking about a solution to this problem, we can take inspiration from the web. The HTTP protocol, for example, standardizes the way we transfer data across the internet, providing a backbone of methods with clearly defined but flexible interfaces. As machine learning systems become more prevalent across industries, we wanted to ask the question: what if we had such a protocol for building and deploying machine learning applications at scale?

In this talk, we introduce µlearn (pronounced “micro-learn”), an open source microframework for building machine learning applications. Created by the team behind Flyte, µlearn provides a simple, user-friendly interface for defining the building blocks of your machine learning application, from dataset curation and sampling to model training and prediction. Using these building blocks, µlearn automatically creates the workflows that you need to tune your models and deploy them to production in different prediction use cases, such as offline, online, or streaming contexts..

Technical Level: 5/7

What's unique about this talk: 

This will be the first talk given about µlearn, which is early in its development but has gathered interest among Flyte’s user base, including Spotify, Stripe, and ZipRecruiter. This talk will provide a unique perspective on the design of ergonomic high-level frameworks for machine learning.

What you will learn: 

µlearn is a microframework for building machine learning services and massively simplifies the process of going from research to production by providing a light-weight API for easily defining all the core components of a machine learning system.

Supporting Sales Forecasting at Scale for Canada's Largest Grocery Store

Mefta Sadat, Sr. Software Engineer and Cheng Chen, Senior Data Scientist, Loblaw Digital

Abstract: 

Loblaws is one of the largest grocery store chains in Canada and our team at Loblaw Digital runs several ML systems such as search re-ranking, recommendations, inventory prediction, and forecasting on production.

In this talk, we will share our experience setting up our MLOps platform on Google’s Vertex AI and walk you through a data science project using Vertex components like ML Metadata, Pipelines, and Model Training. The goal is to track different stages of the ML lifecycle using Vertex AI components and help our internal data science teams get from exploration to production rapidly on this platform.

Technical Level: 6/7

What's unique about this talk: 

We will share how a production e-commerce ML system is running on Vertex AI.

What you will learn: 

Running ML system on production using Vertex AI How tools/processes help team adapt MLOps practices.

How to Conquer Data Drift & Prevent Stale Models in Production using DVC

Milecia McGregor, Developer Advocate Iterative AI

Abstract: 

Deploying a machine learning model production is not the end of the project. You have to constantly monitor the model for model drift and the underlying data drift that causes it. That means you have to re-train your model on new datasets often.

In this talk, we'll cover how you can use DVC to track all of the changes to your dataset across each model that gets trained and deployed to production. You’ll see how to reproduce experiments and how you can share experiments and their results with others on your team. By the end of the talk, you should feel comfortable switching between datasets as you keep your model up to date.

Technical Level: 4/7

What's unique about this talk: 

There aren't a lot of talk about how to prevent stale models on prod and using experiment/data tracking.

What you will learn: 

How to version their data as they get insights from production and use that to deploy updated models

SLA-Aware Machine Learning Inference Serving on Serverless Computing Platforms

Nima Mahmoudi, Machine Learning Engineer,  Telus Communications Inc.

Abstract: 

Serving machine learning inference workloads on the cloud is still a challenging task on the production level. Optimal configuration of the inference workload to meet SLA requirements while optimizing the infrastructure costs is highly complicated due to the complex interaction between batch configuration, resource configurations, and variable arrival process. Serverless computing has emerged in recent years to automate most infrastructure management tasks. Workload batching has revealed the potential to improve the response time and cost-effectiveness of machine learning serving workloads. However, it has not yet been supported out of the box by serverless computing platforms. Our experiments have shown that for various machine learning workloads, batching can hugely improve the system's efficiency by reducing the processing overhead per request.

In this work, we present MLProxy, an adaptive reverse proxy to support efficient machine learning serving workloads on serverless computing systems. MLProxy supports adaptive batching to ensure SLA compliance while optimizing serverless costs. We performed rigorous experiments on Knative to demonstrate the effectiveness of MLProxy. We showed that MLProxy could reduce the cost of serverless deployment by up to 92% while reducing SLA violations by up to 99% that can be generalized across state-of-the-art model serving frameworks.

Technical Level: 3/7

What's unique about this talk: 

It is based on the latest research on serverless computing platforms, which is projected to be the dominant mode of deployment on the cloud.

What you will learn: 

About the design of our algorithm, MLProxy, which can help improve the performance and cost of deployment for machine learning workloads on serverless computing platforms.

Feature Engineering Made Simple

Anindya Datta, CEO & Founder and Kajanan Sangaralingam, Head of Data Science, Mobilewalla

Abstract: 

More art than science, Feature Engineering consumes 70-80% of the machine learning workflow. It is ad-hoc, messy, error-prone, and has an outsize influence on the quality and resilience of predictive models. Join us for a hands-on workshop that explores new ways of refining feature engineering, turning it into a systematic and procedural process, way more efficient than how it occurs currently.

In this workshop, participants will perform a hands-on, end-to-end, model building workflow, with particular emphasis on feature engineering using Anovos, an open source library that supports data ingestion, cleansing, analytics, feature generation and transformation. Audience: This workshop is for practitioners with knowledge of machine learning, data engineering or data science and who have basic Python skills..

Technical Level: 5/7

What's unique about this talk: 

Featuring engineering takes up a significant portion of a modeler's time and is critical to the success of ML efforts, but there has been little focus on bringing structure to this process. Mobilewalla has a team of experienced data scientists with over 100 ML models in production, who developed this open source library to support their own efforts.

What you will learn: 

Feature engineering is a critically important part of the process and there are tools available to help with the automation of this process and the development of stable, predictive features that can build resilient ML models

Wild Wild Tests: Testing Recommender Systems in the Wild

Jacopo Tagliabue, Director of AI, Coveo

Abstract: 

As with most Machine Learning systems, recommender systems are typically evaluated through performance metrics computed over held-out data points. However, real-world behavior is undoubtedly nuanced: ad hoc error analysis is often employed to ensure the desired quality in the wild. We argue in favor of behavioral testing for RecSys, and leverage an open source package - RecList - to show how to scale up real-world testing. In this workshop, we demonstrate how RecList can be used in research and production settings, with hands-on coding and practical examples.

Technical Level: 4/7

What you will learn: 

  • Testing recommender systems is hard
  • Standard quantitative tests fall short of fully describing recSys behavior
  • Behavioral testing are great but hard to scale, so we need good software and heuristics to make them a feasible strategy

Machine Learning Infrastructure at Meta Scale

Shivam Bharuka, Senior AI Infra Engineer, Meta

Abstract: 

Machine learning models are growing rapidly in scale to support the ranking models at Meta scale. In order to support this growth, we have re-imagined the entire AI Infrastructure stack, from creating special hardwares using powerful GPUs and network devices to designing optimized distributed training algorithms using PyTorch. In this talk, I will talk about the challenges we encountered and the approach we took to re-design and scale the stack

Technical Level: 6/7

What's unique about this talk: 

Insight into how the biggest social media company operates ML Infra at Scale

What you will learn: 

Understand how Meta supports the growth in machine learning by pushing the limits of the different layers in the stack.

If you're unemployed, or in a difficult financial situation, but would stand to benefit from this event, please email info@mlopsworld.com and we can help you out !

Share with friends

Save This Event

Event Saved