Let’s Write a Pipeline - Machine Learning Recipes #4

0:00   [MUSIC PLAYING]
0:06   Welcome back.
0:07   We've covered a lot of ground already,
0:09   so today I want to review and reinforce concepts.
0:12   To do that, we'll explore two things.
0:14   First, we'll code up a basic pipeline
0:16   for supervised learning.
0:17   I'll show you how multiple classifiers
0:19   can solve the same problem.
0:21   Next, we'll build up a little more intuition
0:23   for what it means for an algorithm to learn something
0:25   from data, because that sounds kind of magical, but it's not.
0:29   To kick things off, let's look at a common experiment
0:31   you might want to do.
0:33   Imagine you're building a spam classifier.
0:35   That's just a function that labels an incoming email
0:37   as spam or not spam.
0:39   Now, say you've already collected a data set
0:41   and you're ready to train a model.
0:42   But before you put it into production,
0:44   there's a question you need to answer first--
0:46   how accurate will it be when you use it to classify emails that
0:49   weren't in your training data?
0:51   As best we can, we want to verify our models work well
0:54   before we deploy them.
0:56   And we can do an experiment to help us figure that out.
0:59   One approach is to partition our data set into two parts.
1:02   We'll call these Train and Test.
1:05   We'll use Train to train our model
1:07   and Test to see how accurate it is on new data.
1:10   That's a common pattern, so let's see how it looks in code.
1:13   To kick things off, let's import a data set into [? SyKit. ?]
1:17   We'll use Iris again, because it's handily included.
1:20   Now, we already saw Iris in episode two.
1:21   But what we haven't seen before is
1:23   that I'm calling the features x and the labels y.
1:26   Why is that?
1:28   Well, that's because one way to think of a classifier
1:30   is as a function.
1:32   At a high level, you can think of x as the input
1:34   and y as the output.
1:36   I'll talk more about that in the second half of this episode.
1:39   After we import the data set, the first thing we want to do
1:42   is partition it into Train and Test.
1:44   And to do that, we can import a handy utility,
1:46   and it makes the syntax clear.
1:48   We're taking our x's and our y's,
1:50   or our features and labels, and partitioning them
1:52   into two sets.
1:54   X_train and y_train are the features and labels
1:56   for the training set.
1:57   And X_test and y_test are the features and labels
2:00   for the testing set.
2:02   Here, I'm just saying that I want half the data to be
2:04   used for testing.
2:05   So if we have 150 examples in Iris, 75 will be in Train
2:09   and 75 will be in Test.
2:11   Now we'll create our classifier.
2:13   I'll use two different types here
2:14   to show you how they accomplish the same task.
2:17   Let's start with the decision tree we've already seen.
2:20   Note there's only two lines of code
2:22   that are classifier-specific.
2:25   Now let's train the classifier using our training data.
2:28   At this point, it's ready to be used to classify data.
2:31   And next, we'll call the predict method
2:33   and use it to classify our testing data.
2:35   If you print out the predictions,
2:37   you'll see there are a list of numbers.
2:38   These correspond to the type of Iris
2:40   the classifier predicts for each row in the testing data.
2:44   Now let's see how accurate our classifier
2:46   was on the testing set.
2:48   Recall that up top, we have the true labels for the testing
2:50   data.
2:51   To calculate our accuracy, we can
2:53   compare the predicted labels to the true labels,
2:55   and tally up the score.
2:57   There's a convenience method in [? Sykit ?]
2:59   we can import to do that.
3:00   Notice here, our accuracy was over 90%.
3:03   If you try this on your own, it might be a little bit different
3:06   because of some randomness in how the Train/Test
3:08   data is partitioned.
3:10   Now, here's something interesting.
3:11   By replacing these two lines, we can use a different classifier
3:14   to accomplish the same task.
3:16   Instead of using a decision tree,
3:18   we'll use one called [? KNearestNeighbors. ?]
3:20   If we run our experiment, we'll see that the code
3:23   works in exactly the same way.
3:25   The accuracy may be different when you run it,
3:27   because this classifier works a little bit differently
3:29   and because of the randomness in the Train/Test split.
3:32   Likewise, if we wanted to use a more sophisticated classifier,
3:35   we could just import it and change these two lines.
3:38   Otherwise, our code is the same.
3:40   The takeaway here is that while there are many different types
3:42   of classifiers, at a high level, they have a similar interface.
3:49   Now let's talk a little bit more about what
3:50   it means to learn from data.
3:53   Earlier, I said we called the features x and the labels y,
3:56   because they were the input and output of a function.
3:58   Now, of course, a function is something we already
4:00   know from programming.
4:02   def classify-- there's our function.
4:04   As we already know in supervised learning,
4:06   we don't want to write this ourselves.
4:09   We want an algorithm to learn it from training data.
4:12   So what does it mean to learn a function?
4:15   Well, a function is just a mapping from input
4:17   to output values.
4:18   Here's a function you might have seen before-- y
4:20   equals mx plus b.
4:22   That's the equation for a line, and there
4:24   are two parameters-- m, which gives the slope;
4:27   and b, which gives the y-intercept.
4:29   Given these parameters, of course,
4:31   we can plot the function for different values of x.
4:34   Now, in supervised learning, our classified function
4:36   might have some parameters as well,
4:38   but the input x are the features for an example we
4:41   want to classify, and the output y
4:43   is a label, like Spam or Not Spam, or a type of flower.
4:47   So what could the body of the function look like?
4:49   Well, that's the part we want to write algorithmically
4:51   or in other words, learn.
4:53   The important thing to understand here
4:55   is we're not starting from scratch
4:57   and pulling the body of the function out of thin air.
5:00   Instead, we start with a model.
5:01   And you can think of a model as the prototype for
5:04   or the rules that define the body of our function.
5:07   Typically, a model has parameters
5:08   that we can adjust with our training data.
5:10   And here's a high-level example of how this process works.
5:14   Let's look at a toy data set and think about what kind of model
5:17   we could use as a classifier.
5:19   Pretend we're interested in distinguishing
5:20   between red dots and green dots, some of which
5:23   I've drawn here on a graph.
5:25   To do that, we'll use just two features--
5:27   the x- and y-coordinates of a dot.
5:29   Now let's think about how we could classify this data.
5:32   We want a function that considers
5:34   a new dot it's never seen before,
5:35   and classifies it as red or green.
5:38   In fact, there might be a lot of data we want to classify.
5:40   Here, I've drawn our testing examples
5:42   in light green and light red.
5:44   These are dots that weren't in our training data.
5:47   The classifier has never seen them before, so how can
5:49   it predict the right label?
5:51   Well, imagine if we could somehow draw a line
5:53   across the data like this.
5:56   Then we could say the dots to the left
5:57   of the line are green and dots to the right of the line are
6:00   red.
6:00   And this line can serve as our classifier.
6:03   So how can we learn this line?
6:05   Well, one way is to use the training data to adjust
6:08   the parameters of a model.
6:09   And let's say the model we use is a simple straight line
6:12   like we saw before.
6:14   That means we have two parameters to adjust-- m and b.
6:17   And by changing them, we can change where the line appears.
6:21   So how could we learn the right parameters?
6:23   Well, one idea is that we can iteratively adjust
6:25   them using our training data.
6:27   For example, we might start with a random line
6:29   and use it to classify the first training example.
6:32   If it gets it right, we don't need to change our line,
6:35   so we move on to the next one.
6:36   But on the other hand, if it gets it wrong,
6:38   we could slightly adjust the parameters of our model
6:41   to make it more accurate.
6:43   The takeaway here is this.
6:44   One way to think of learning is using training data
6:47   to adjust the parameters of a model.
6:50   Now, here's something really special.
6:52   It's called tensorflow/playground.
6:55   This is a beautiful example of a neural network
6:57   you can run and experiment with right in your browser.
7:00   Now, this deserves its own episode for sure,
7:02   but for now, go ahead and play with it.
7:03   It's awesome.
7:04   The playground comes with different data
7:06   sets you can try out.
7:08   Some are very simple.
7:09   For example, we could use our line to classify this one.
7:12   Some data sets are much more complex.
7:15   This data set is especially hard.
7:17   And see if you can build a network to classify it.
7:20   Now, you can think of a neural network
7:21   as a more sophisticated type of classifier,
7:24   like a decision tree or a simple line.
7:26   But in principle, the idea is similar.
7:29   OK.
7:29   Hope that was helpful.
7:30   I just created a Twitter that you can follow
7:32   to be notified of new episodes.
7:33   And the next one should be out in a couple of weeks,
7:36   depending on how much work I'm doing for Google I/O. Thanks,
7:38   as always, for watching, and I'll see you next time.
Transcripción : Youtube

No hay comentarios.:

Publicar un comentario