Note-Code: Videos: Visualizing a Decision Tree

0:00 [MUSIC PLAYING]
0:06 Last episode, we used a decision tree as our classifier.
0:09 Today we'll add code to visualize it
0:10 so we can see how it works under the hood.
0:13 There are many types of classifiers
0:14 you may have heard of before-- things like neural nets
0:16 or support vector machines.
0:17 So why did we use a decision tree to start?
0:20 Well, they have a very unique property--
0:21 they're easy to read and understand.
0:23 In fact, they're one of the few models that are interpretable,
0:26 where you can understand exactly why the classifier makes
0:28 a decision.
0:29 That's amazingly useful in practice.
0:33 To get started, I'll introduce you
0:34 to a real data set we'll work with today.
0:37 It's called Iris.
0:38 Iris is a classic machine learning problem.
0:41 In it, you want to identify what type of flower
0:43 you have based on different measurements,
0:45 like the length and width of the petal.
0:46 The data set includes three different types of flowers.
0:49 They're all species of iris-- setosa, versicolor,
0:52 and virginica.
0:53 Scrolling down, you can see we're
0:55 given 50 examples of each type, so 150 examples total.
1:00 Notice there are four features that are
1:01 used to describe each example.
1:03 These are the length and width of the sepal and petal.
1:06 And just like in our apples and oranges problem,
1:08 the first four columns give the features and the last column
1:11 gives the labels, which is the type of flower in each row.
1:15 Our goal is to use this data set to train a classifier.
1:18 Then we can use that classifier to predict what species
1:21 of flower we have if we're given a new flower that we've never
1:23 seen before.
1:25 Knowing how to work with an existing data set
1:26 is a good skill, so let's import Iris into scikit-learn
1:29 and see what it looks like in code.
1:32 Conveniently, the friendly folks at scikit
1:33 provided a bunch of sample data sets,
1:35 including Iris, as well as utilities
1:37 to make them easy to import.
1:39 We can import Iris into our code like this.
1:42 The data set includes both the table
1:44 from Wikipedia as well as some metadata.
1:47 The metadata tells you the names of the features
1:49 and the names of different types of flowers.
1:52 The features and examples themselves
1:54 are contained in the data variable.
1:56 For example, if I print out the first entry,
1:58 you can see the measurements for this flower.
2:00 These index to the feature names, so the first value
2:03 refers to the sepal length, and the second to sepal width,
2:06 and so on.
2:09 The target variable contains the labels.
2:11 Likewise, these index to the target names.
2:14 Let's print out the first one.
2:16 A label of 0 means it's a setosa.
2:19 If you look at the table from Wikipedia,
2:21 you'll notice that we just printed out the first row.
2:24 Now both the data and target variables have 150 entries.
2:27 If you want, you can iterate over them
2:29 to print out the entire data set like this.
2:32 Now that we know how to work with the data set,
2:34 we're ready to train a classifier.
2:35 But before we do that, first we need to split up the data.
2:39 I'm going to remove several of the examples
2:41 and put them aside for later.
2:43 We'll call the examples I'm putting aside our testing data.
2:46 We'll keep these separate from our training data,
2:48 and later on we'll use our testing examples
2:50 to test how accurate the classifier is
2:53 on data it's never seen before.
2:55 Testing is actually a really important part
2:57 of doing machine learning well in practice,
2:59 and we'll cover it in more detail in a future episode.
3:02 Just for this exercise, I'll remove one example
3:04 of each type of flower.
3:06 And as it happens, the data set is
3:07 ordered so the first setosa is at index 0,
3:10 and the first versicolor is at 50, and so on.
3:14 The syntax looks a little bit complicated, but all I'm doing
3:16 is removing three entries from the data and target variables.
3:21 Then I'll create two new sets of variables-- one
3:24 for training and one for testing.
3:26 Training will have the majority of our data,
3:28 and testing will have just the examples I removed.
3:31 Now, just as before, we can create a decision tree
3:33 classifier and train it on our training data.
3:40 Before we visualize it, let's use the tree
3:42 to classify our testing data.
3:44 We know we have one flower of each type,
3:47 and we can print out the labels we expect.
3:50 Now let's see what the tree predicts.
3:52 We'll give it the features for our testing data,
3:54 and we'll get back labels.
3:56 You can see the predicted labels match our testing data.
3:59 That means it got them all right.
4:01 Now, keep in mind, this was a very simple test,
4:04 and we'll go into more detail down the road.
4:07 Now let's visualize the tree so we can
4:09 see how the classifier works.
4:11 To do that, I'm going to copy-paste
4:13 some code in from scikit's tutorials,
4:15 and because this code is for visualization
4:16 and not machine-learning concepts,
4:18 I won't cover the details here.
4:20 Note that I'm combining the code from these two examples
4:22 to create an easy-to-read PDF.
4:26 I can run our script and open up the PDF,
4:28 and we can see the tree.
4:30 To use it to classify data, you start by reading from the top.
4:33 Each node asks a yes or no question
4:35 about one of the features.
4:37 For example, this node asks if the pedal width
4:39 is less than 0.8 centimeters.
4:41 If it's true for the example you're classifying, go left.
4:44 Otherwise, go right.
4:46 Now let's use this tree to classify an example
4:48 from our testing data.
4:50 Here are the features and label for our first testing flower.
4:53 Remember, you can find the feature names
4:54 by looking at the metadata.
4:56 We know this flower is a setosa, so let's see
4:58 what the tree predicts.
5:00 I'll resize the windows to make this easier to see.
5:03 And the first question the tree asks
5:04 is whether the petal width is less than 0.8 centimeters.
5:08 That's the fourth feature.
5:09 The answer is true, so we proceed left.
5:11 At this point, we're already at a leaf node.
5:14 There are no other questions to ask,
5:15 so the tree gives us a prediction, setosa,
5:18 and it's right.
5:19 Notice the label is 0, which indexes to that type of flower.
5:23 Now let's try our second testing example.
5:25 This one is a versicolor.
5:27 Let's see what the tree predicts.
5:29 Again we read from the top, and this time the pedal width
5:31 is greater than 0.8 centimeters.
5:33 The answer to the tree's question is false,
5:35 so we go right.
5:36 The next question the tree asks is whether the pedal width
5:39 is less than 1.75.
5:40 It's trying to narrow it down.
5:42 That's true, so we go left.
5:44 Now it asks if the pedal length is less than 4.95.
5:47 That's true, so we go left again.
5:49 And finally, the tree asks if the pedal width
5:51 is less than 1.65.
5:52 That's true, so left it is.
5:54 And now we have our prediction-- it's a versicolor,
5:57 and that's right again.
5:58 You can try the last one on your own as an exercise.
6:01 And remember, the way we're using the tree
6:03 is the same way it works in code.
6:05 So that's how you quickly visualize and read
6:07 a decision tree.
6:08 There's a lot more to learn here,
6:09 especially how they're built automatically from examples.
6:12 We'll get to that in a future episode.
6:14 But for now, let's close with an essential point.
6:17 Every question the tree asks must be about one
6:19 of your features.
6:20 That means the better your features are, the better a tree
6:22 you can build.
6:23 And the next episode will start looking
6:25 at what makes a good feature.
6:26 Thanks very much for watching, and I'll see you next time.
6:28 [MUSIC PLAYING]
Transcripción : Youtube

Note-Code: Videos

Visualizing a Decision Tree - Machine Learning Recipes #2

No hay comentarios.:

Publicar un comentario

Entradas populares 00

Entradas populares 30