Week 3 Overview

Week 3 turns our focus from the big picture to the details about collecting and and organizing data from the real world, with a focus on quality. Gone are the days of Machine Learning being mostly done with fixed data sets on Kaggle. These days, end-to-end machine learning starts by developing a proof of concept and deploying it as quickly as possible. Generally, this means scoping an ML project that allows you to start with a pre-trained model, and then to fine-tune it with data that is easy to find or collect.

Best practices from industry titans like Hugging Face are create your Proof Of Concept (POC), and get it out of the notebook and into stakeholders hands fast.

Take a listen to what Julien Simon, Chief Evangelist of Hugging Face (former AWS) had to say about doing this within hours or days, not weeks.

https://youtu.be/IcC0RZ8Eb5Q

This is where intelligently scoping projects meets a data-centric approach. Since huge general pre-trained models (think GPT-3) now exist alongside emerging tech like AutoML, being able to make the most out of application-specific data is quickly becoming the most valuable skill in industry. Short of subject matter expertise, there are standard ways of ensuring “high-quality” data and taking a data-centric approach.

This week we review what dealing with data in the real world is like by taking a closer look at the importance of data, the fundamental principles that underly data-centric approaches, and how to define your data’s lineage when you begin a project. We’ll also build on our sentiment analyzer to get to a POC quickly by collecting high-quality streaming text data from social media platforms (including Twitter and Reddit), and then using a subset of the transformed data to fine-tune a pre-trained transformer model.

To prepare for technical discussions this week, we recommend that you review the following select materials on API basics on the AI product development lifecycle, data-centric AI, and ethical AI.

To prepare for your live coding session, check out the section below for an overview of sentiment analysis and a few key tools that we’ll be using as we get started building our first minimum viable ML product in the first few weeks

Additional supplemental materials this week include a deeper dive into API basics. This is incredibly important information for those of you who have never worked as a software engineer, as part of an agile team, or on building software applications made up of microservices.

Preparing for Technical Discussions

  1. Read These six best practices for data collection and evaluation when building ML models

    People + AI Guidebook

  2. Watch (~30 minutes) the “Collecting Data” videos from Course 2 in the MLOps Specialization on Canvas here.

  3. Read Labeling & Crowdsourcing, from Michael Bernstein

    Labeling and Crowdsourcing - Data-centric AI Resource Hub

  4. Read about the six most common types of bias when working with data. Neither too much variance nor too much bias is just right when developing ML applications!

    The 6 most common types of bias when working with data

Preparing for Live Coding

Real-World Example APIs

We’ll be hitting a real-world example API this week, and we’ll start by taking the work that we did last week to the next level with live Twitter Data.

Twitter API Documentation

For our in-class assignment, we’ll use the Reddit API to pull relevant and high-quality subreddit threads to fine-tune our sentiment analyzer for Market Sentiment analysis.

reddit.com: api documentation