Visualizing a Decision Tree - Machine Learning Recipes #2

0:00   [MUSIC PLAYING]
0:06   Last episode, we used a decision tree as our classifier.
0:09   Today we'll add code to visualize it
0:10   so we can see how it works under the hood.
0:13   There are many types of classifiers
0:14   you may have heard of before-- things like neural nets
0:16   or support vector machines.
0:17   So why did we use a decision tree to start?
0:20   Well, they have a very unique property--
0:21   they're easy to read and understand.
0:23   In fact, they're one of the few models that are interpretable,
0:26   where you can understand exactly why the classifier makes
0:28   a decision.
0:29   That's amazingly useful in practice.
0:33   To get started, I'll introduce you
0:34   to a real data set we'll work with today.
0:37   It's called Iris.
0:38   Iris is a classic machine learning problem.
0:41   In it, you want to identify what type of flower
0:43   you have based on different measurements,
0:45   like the length and width of the petal.
0:46   The data set includes three different types of flowers.
0:49   They're all species of iris-- setosa, versicolor,
0:52   and virginica.
0:53   Scrolling down, you can see we're
0:55   given 50 examples of each type, so 150 examples total.
1:00   Notice there are four features that are
1:01   used to describe each example.
1:03   These are the length and width of the sepal and petal.
1:06   And just like in our apples and oranges problem,
1:08   the first four columns give the features and the last column
1:11   gives the labels, which is the type of flower in each row.
1:15   Our goal is to use this data set to train a classifier.
1:18   Then we can use that classifier to predict what species
1:21   of flower we have if we're given a new flower that we've never
1:23   seen before.
1:25   Knowing how to work with an existing data set
1:26   is a good skill, so let's import Iris into scikit-learn
1:29   and see what it looks like in code.
1:32   Conveniently, the friendly folks at scikit
1:33   provided a bunch of sample data sets,
1:35   including Iris, as well as utilities
1:37   to make them easy to import.
1:39   We can import Iris into our code like this.
1:42   The data set includes both the table
1:44   from Wikipedia as well as some metadata.
1:47   The metadata tells you the names of the features
1:49   and the names of different types of flowers.
1:52   The features and examples themselves
1:54   are contained in the data variable.
1:56   For example, if I print out the first entry,
1:58   you can see the measurements for this flower.
2:00   These index to the feature names, so the first value
2:03   refers to the sepal length, and the second to sepal width,
2:06   and so on.
2:09   The target variable contains the labels.
2:11   Likewise, these index to the target names.
2:14   Let's print out the first one.
2:16   A label of 0 means it's a setosa.
2:19   If you look at the table from Wikipedia,
2:21   you'll notice that we just printed out the first row.
2:24   Now both the data and target variables have 150 entries.
2:27   If you want, you can iterate over them
2:29   to print out the entire data set like this.
2:32   Now that we know how to work with the data set,
2:34   we're ready to train a classifier.
2:35   But before we do that, first we need to split up the data.
2:39   I'm going to remove several of the examples
2:41   and put them aside for later.
2:43   We'll call the examples I'm putting aside our testing data.
2:46   We'll keep these separate from our training data,
2:48   and later on we'll use our testing examples
2:50   to test how accurate the classifier is
2:53   on data it's never seen before.
2:55   Testing is actually a really important part
2:57   of doing machine learning well in practice,
2:59   and we'll cover it in more detail in a future episode.
3:02   Just for this exercise, I'll remove one example
3:04   of each type of flower.
3:06   And as it happens, the data set is
3:07   ordered so the first setosa is at index 0,
3:10   and the first versicolor is at 50, and so on.
3:14   The syntax looks a little bit complicated, but all I'm doing
3:16   is removing three entries from the data and target variables.
3:21   Then I'll create two new sets of variables-- one
3:24   for training and one for testing.
3:26   Training will have the majority of our data,
3:28   and testing will have just the examples I removed.
3:31   Now, just as before, we can create a decision tree
3:33   classifier and train it on our training data.
3:40   Before we visualize it, let's use the tree
3:42   to classify our testing data.
3:44   We know we have one flower of each type,
3:47   and we can print out the labels we expect.
3:50   Now let's see what the tree predicts.
3:52   We'll give it the features for our testing data,
3:54   and we'll get back labels.
3:56   You can see the predicted labels match our testing data.
3:59   That means it got them all right.
4:01   Now, keep in mind, this was a very simple test,
4:04   and we'll go into more detail down the road.
4:07   Now let's visualize the tree so we can
4:09   see how the classifier works.
4:11   To do that, I'm going to copy-paste
4:13   some code in from scikit's tutorials,
4:15   and because this code is for visualization
4:16   and not machine-learning concepts,
4:18   I won't cover the details here.
4:20   Note that I'm combining the code from these two examples
4:22   to create an easy-to-read PDF.
4:26   I can run our script and open up the PDF,
4:28   and we can see the tree.
4:30   To use it to classify data, you start by reading from the top.
4:33   Each node asks a yes or no question
4:35   about one of the features.
4:37   For example, this node asks if the pedal width
4:39   is less than 0.8 centimeters.
4:41   If it's true for the example you're classifying, go left.
4:44   Otherwise, go right.
4:46   Now let's use this tree to classify an example
4:48   from our testing data.
4:50   Here are the features and label for our first testing flower.
4:53   Remember, you can find the feature names
4:54   by looking at the metadata.
4:56   We know this flower is a setosa, so let's see
4:58   what the tree predicts.
5:00   I'll resize the windows to make this easier to see.
5:03   And the first question the tree asks
5:04   is whether the petal width is less than 0.8 centimeters.
5:08   That's the fourth feature.
5:09   The answer is true, so we proceed left.
5:11   At this point, we're already at a leaf node.
5:14   There are no other questions to ask,
5:15   so the tree gives us a prediction, setosa,
5:18   and it's right.
5:19   Notice the label is 0, which indexes to that type of flower.
5:23   Now let's try our second testing example.
5:25   This one is a versicolor.
5:27   Let's see what the tree predicts.
5:29   Again we read from the top, and this time the pedal width
5:31   is greater than 0.8 centimeters.
5:33   The answer to the tree's question is false,
5:35   so we go right.
5:36   The next question the tree asks is whether the pedal width
5:39   is less than 1.75.
5:40   It's trying to narrow it down.
5:42   That's true, so we go left.
5:44   Now it asks if the pedal length is less than 4.95.
5:47   That's true, so we go left again.
5:49   And finally, the tree asks if the pedal width
5:51   is less than 1.65.
5:52   That's true, so left it is.
5:54   And now we have our prediction-- it's a versicolor,
5:57   and that's right again.
5:58   You can try the last one on your own as an exercise.
6:01   And remember, the way we're using the tree
6:03   is the same way it works in code.
6:05   So that's how you quickly visualize and read
6:07   a decision tree.
6:08   There's a lot more to learn here,
6:09   especially how they're built automatically from examples.
6:12   We'll get to that in a future episode.
6:14   But for now, let's close with an essential point.
6:17   Every question the tree asks must be about one
6:19   of your features.
6:20   That means the better your features are, the better a tree
6:22   you can build.
6:23   And the next episode will start looking
6:25   at what makes a good feature.
6:26   Thanks very much for watching, and I'll see you next time.
6:28   [MUSIC PLAYING]
Transcripción : Youtube

No hay comentarios.:

Publicar un comentario