Machine Learning Yearning
Table of Contents
1 Why Machine Learning Strategy
2 How to use this book to help your team
3 Prerequisites and Notation
4 Scale drives machine learning progress
5 Your development and test sets
1 Why Machine Learning Strategy
Machine learning is the foundation of countless important applications, including web
search, email anti-spam, speech recognition, product recommendations, and more. I assume
that you or your team is working on a machine learning application, and that you want to
make rapid progress. This book will help you do so.
Example: Building a cat picture startup
Say you’re building a startup that will provide an endless stream of cat pictures to cat lovers.
You use a neural network to build a computer vision system for detecting cats in pictures.
But tragically, your learning algorithm’s accuracy is not yet good enough. You are under
tremendous pressure to improve your cat detector. What do you do?
Your team has a lot of ideas, such as:
- Get more data: Collect more pictures of cats.
- Collect a more diverse training set. For example, pictures of cats in unusual positions; cats
with unusual coloration; pictures shot with a variety of camera settings; …. - Train the algorithm longer, by running more gradient descent iterations.
- Try a bigger neural network, with more layers/hidden units/parameters.
Page 6 Machine Learning Yearning-Draft Andrew Ng - Try a smaller neural network.
- Try adding regularization (such as L2 regularization).
- Change the neural network architecture (activation function, number of hidden units, etc.)
- …
If you choose well among these possible directions, you’ll build the leading cat picture
platform, and lead your company to success. If you choose poorly, you might waste months.
How do you proceed?
2 How to use this book to help your team
After finishing this book, you will have a deep understanding of how to set technical
direction for a machine learning project.
But your teammates might not understand why you’re recommending a particular direction.
Perhaps you want your team to define a single-number evaluation metric, but they aren’t
convinced. How do you persuade them?
That’s why I made the chapters short: So that you can print them out and get your
teammates to read just the 1-2 pages you need them to know.
A few changes in prioritization can have a huge effect on your team’s productivity. By helping
your team with a few such changes, I hope that you can become the superhero of your team!
3 Prerequisites and Notation
If you have taken a Machine Learning course such as my machine learning MOOC on
Coursera, or if you have experience applying supervised learning, you will be able to
understand this text.
I assume you are familiar with supervised learning: learning a function that maps from x
to y, using labeled training examples (x,y). Supervised learning algorithms include linear
regression, logistic regression, and neural networks. There are many forms of machine
learning, but the majority of Machine Learning’s practical value today comes from
supervised learning.
I will frequently refer to neural networks (also known as “deep learning”). You’ll only need a
basic understanding of what they are to follow this text.
If you are not familiar with the concepts mentioned here, watch the first three weeks of
videos in the Machine Learning course on Coursera at http://ml-class.org
4 Scale drives machine learning progress
Many of the ideas of deep learning (neural networks) have been around for decades. Why are
these ideas taking off now?
Two of the biggest drivers of recent progress have been:
- Data availability. People are now spending more time on digital devices (laptops, mobile
devices). Their digital activities generate huge amounts of data that we can feed to our
learning algorithms. - Computational scale. We started just a few years ago to be able to train neural
networks that are big enough to take advantage of the huge datasets we now have.
In detail, even as you accumulate more data, usually the performance of older learning
algorithms, such as logistic regression, “plateaus.” This means its learning curve “flattens
out,” and the algorithm stops improving even as you give it more data:
It was as if the older algorithms didn’t know what to do with all the data we now have.
If you train a small neutral network (NN) on the same supervised learning task, you might - get slightly better performance:
- Page 10 Machine Learning Yearning-Draft Andrew Ng
- Here, by “Small NN” we mean a neural network with only a small number of hidden
- units/layers/parameters. Finally, if you train larger and larger neural networks, you can
- obtain even better performance:
5 Your development and test sets
Let’s return to our earlier cat pictures example: You run a mobile app, and users are
uploading pictures of many different things to your app. You want to automatically find the
cat pictures.
Your team gets a large training set by downloading pictures of cats (positive examples) and
non-cats (negative examples) off of different websites. They split the dataset 70%/30% into
training and test sets. Using this data, they build a cat detector that works well on the
training and test sets.
But when you deploy this classifier into the mobile app, you find that the performance is
really poor!
What happened?
You figure out that the pictures users are uploading have a different look than the website
images that make up your training set: Users are uploading pictures taken with mobile
phones, which tend to be lower resolution, blurrier, and poorly lit. Since your training/test
sets were made of website images, your algorithm did not generalize well to the actual distribution you care about: mobile phone pictures.
Before the modern era of big data, it was a common rule in machine learning to use a
random 70%/30% split to form your training and test sets. This practice can work, but it’s a
bad idea in more and more applications where the training distribution (website images in
Page 14 Machine Learning Yearning-Draft Andrew Ng
our example above) is different from the distribution you ultimately care about (mobile
phone images).
We usually define:
Test set — which you use to evaluate the performance of the algorithm, but not to make
any decisions regarding what learning algorithm or parameters to use.
Once you define a dev set (development set) and test set, your team will try a lot of ideas,
such as different learning algorithm parameters, to see what works best. The dev and test
sets allow your team to quickly see how well your algorithm is doing.
In other words, the purpose of the dev and test sets are to direct your team toward
Training set — Which you run your learning algorithm on.
Dev (development) set — Which you use to tune parameters, select features, and
make other decisions regarding the learning algorithm. Sometimes also called the
hold-out cross validation set.
the most important changes to make to the machine learning system.
So, you should do the following:
Choose dev and test sets to reflect data you expect to get in the future
and want to do well on.
In other words, your test set should not simply be 30% of the available data, especially if you
expect your future data (mobile phone images) to be different in nature from your training
set (website images).
If you have not yet launched your mobile app, you might not have any users yet, and thus
might not be able to get data that accurately reflects what you have to do well on in the
future. But you might still try to approximate this. For example, ask your friends to take
mobile phone pictures of cats and send them to you. Once your app is launched, you can
update your dev/test sets using actual user data.
If you really don’t have any way of getting data that approximates what you expect to get in
the future, perhaps you can start by using website images. But you should be aware of the
risk of this leading to a system that doesn’t generalize well.
It requires judgment to decide how much to invest in developing great dev and test sets. But
don’t assume your training distribution is the same as your test distribution. Try to pick test examples that reflect what you ultimately want to perform well on, rather than whatever data
you happen to have for training.