Mostrando las entradas con la etiqueta Machine Learning Recipes. Mostrar todas las entradas
Mostrando las entradas con la etiqueta Machine Learning Recipes. Mostrar todas las entradas

Let’s Write a Pipeline - Machine Learning Recipes #4

0:00   [MUSIC PLAYING]
0:06   Welcome back.
0:07   We've covered a lot of ground already,
0:09   so today I want to review and reinforce concepts.
0:12   To do that, we'll explore two things.
0:14   First, we'll code up a basic pipeline
0:16   for supervised learning.
0:17   I'll show you how multiple classifiers
0:19   can solve the same problem.
0:21   Next, we'll build up a little more intuition
0:23   for what it means for an algorithm to learn something
0:25   from data, because that sounds kind of magical, but it's not.
0:29   To kick things off, let's look at a common experiment
0:31   you might want to do.
0:33   Imagine you're building a spam classifier.
0:35   That's just a function that labels an incoming email
0:37   as spam or not spam.
0:39   Now, say you've already collected a data set
0:41   and you're ready to train a model.
0:42   But before you put it into production,
0:44   there's a question you need to answer first--
0:46   how accurate will it be when you use it to classify emails that
0:49   weren't in your training data?
0:51   As best we can, we want to verify our models work well
0:54   before we deploy them.
0:56   And we can do an experiment to help us figure that out.
0:59   One approach is to partition our data set into two parts.
1:02   We'll call these Train and Test.
1:05   We'll use Train to train our model
1:07   and Test to see how accurate it is on new data.
1:10   That's a common pattern, so let's see how it looks in code.
1:13   To kick things off, let's import a data set into [? SyKit. ?]
1:17   We'll use Iris again, because it's handily included.
1:20   Now, we already saw Iris in episode two.
1:21   But what we haven't seen before is
1:23   that I'm calling the features x and the labels y.
1:26   Why is that?
1:28   Well, that's because one way to think of a classifier
1:30   is as a function.
1:32   At a high level, you can think of x as the input
1:34   and y as the output.
1:36   I'll talk more about that in the second half of this episode.
1:39   After we import the data set, the first thing we want to do
1:42   is partition it into Train and Test.
1:44   And to do that, we can import a handy utility,
1:46   and it makes the syntax clear.
1:48   We're taking our x's and our y's,
1:50   or our features and labels, and partitioning them
1:52   into two sets.
1:54   X_train and y_train are the features and labels
1:56   for the training set.
1:57   And X_test and y_test are the features and labels
2:00   for the testing set.
2:02   Here, I'm just saying that I want half the data to be
2:04   used for testing.
2:05   So if we have 150 examples in Iris, 75 will be in Train
2:09   and 75 will be in Test.
2:11   Now we'll create our classifier.
2:13   I'll use two different types here
2:14   to show you how they accomplish the same task.
2:17   Let's start with the decision tree we've already seen.
2:20   Note there's only two lines of code
2:22   that are classifier-specific.
2:25   Now let's train the classifier using our training data.
2:28   At this point, it's ready to be used to classify data.
2:31   And next, we'll call the predict method
2:33   and use it to classify our testing data.
2:35   If you print out the predictions,
2:37   you'll see there are a list of numbers.
2:38   These correspond to the type of Iris
2:40   the classifier predicts for each row in the testing data.
2:44   Now let's see how accurate our classifier
2:46   was on the testing set.
2:48   Recall that up top, we have the true labels for the testing
2:50   data.
2:51   To calculate our accuracy, we can
2:53   compare the predicted labels to the true labels,
2:55   and tally up the score.
2:57   There's a convenience method in [? Sykit ?]
2:59   we can import to do that.
3:00   Notice here, our accuracy was over 90%.
3:03   If you try this on your own, it might be a little bit different
3:06   because of some randomness in how the Train/Test
3:08   data is partitioned.
3:10   Now, here's something interesting.
3:11   By replacing these two lines, we can use a different classifier
3:14   to accomplish the same task.
3:16   Instead of using a decision tree,
3:18   we'll use one called [? KNearestNeighbors. ?]
3:20   If we run our experiment, we'll see that the code
3:23   works in exactly the same way.
3:25   The accuracy may be different when you run it,
3:27   because this classifier works a little bit differently
3:29   and because of the randomness in the Train/Test split.
3:32   Likewise, if we wanted to use a more sophisticated classifier,
3:35   we could just import it and change these two lines.
3:38   Otherwise, our code is the same.
3:40   The takeaway here is that while there are many different types
3:42   of classifiers, at a high level, they have a similar interface.
3:49   Now let's talk a little bit more about what
3:50   it means to learn from data.
3:53   Earlier, I said we called the features x and the labels y,
3:56   because they were the input and output of a function.
3:58   Now, of course, a function is something we already
4:00   know from programming.
4:02   def classify-- there's our function.
4:04   As we already know in supervised learning,
4:06   we don't want to write this ourselves.
4:09   We want an algorithm to learn it from training data.
4:12   So what does it mean to learn a function?
4:15   Well, a function is just a mapping from input
4:17   to output values.
4:18   Here's a function you might have seen before-- y
4:20   equals mx plus b.
4:22   That's the equation for a line, and there
4:24   are two parameters-- m, which gives the slope;
4:27   and b, which gives the y-intercept.
4:29   Given these parameters, of course,
4:31   we can plot the function for different values of x.
4:34   Now, in supervised learning, our classified function
4:36   might have some parameters as well,
4:38   but the input x are the features for an example we
4:41   want to classify, and the output y
4:43   is a label, like Spam or Not Spam, or a type of flower.
4:47   So what could the body of the function look like?
4:49   Well, that's the part we want to write algorithmically
4:51   or in other words, learn.
4:53   The important thing to understand here
4:55   is we're not starting from scratch
4:57   and pulling the body of the function out of thin air.
5:00   Instead, we start with a model.
5:01   And you can think of a model as the prototype for
5:04   or the rules that define the body of our function.
5:07   Typically, a model has parameters
5:08   that we can adjust with our training data.
5:10   And here's a high-level example of how this process works.
5:14   Let's look at a toy data set and think about what kind of model
5:17   we could use as a classifier.
5:19   Pretend we're interested in distinguishing
5:20   between red dots and green dots, some of which
5:23   I've drawn here on a graph.
5:25   To do that, we'll use just two features--
5:27   the x- and y-coordinates of a dot.
5:29   Now let's think about how we could classify this data.
5:32   We want a function that considers
5:34   a new dot it's never seen before,
5:35   and classifies it as red or green.
5:38   In fact, there might be a lot of data we want to classify.
5:40   Here, I've drawn our testing examples
5:42   in light green and light red.
5:44   These are dots that weren't in our training data.
5:47   The classifier has never seen them before, so how can
5:49   it predict the right label?
5:51   Well, imagine if we could somehow draw a line
5:53   across the data like this.
5:56   Then we could say the dots to the left
5:57   of the line are green and dots to the right of the line are
6:00   red.
6:00   And this line can serve as our classifier.
6:03   So how can we learn this line?
6:05   Well, one way is to use the training data to adjust
6:08   the parameters of a model.
6:09   And let's say the model we use is a simple straight line
6:12   like we saw before.
6:14   That means we have two parameters to adjust-- m and b.
6:17   And by changing them, we can change where the line appears.
6:21   So how could we learn the right parameters?
6:23   Well, one idea is that we can iteratively adjust
6:25   them using our training data.
6:27   For example, we might start with a random line
6:29   and use it to classify the first training example.
6:32   If it gets it right, we don't need to change our line,
6:35   so we move on to the next one.
6:36   But on the other hand, if it gets it wrong,
6:38   we could slightly adjust the parameters of our model
6:41   to make it more accurate.
6:43   The takeaway here is this.
6:44   One way to think of learning is using training data
6:47   to adjust the parameters of a model.
6:50   Now, here's something really special.
6:52   It's called tensorflow/playground.
6:55   This is a beautiful example of a neural network
6:57   you can run and experiment with right in your browser.
7:00   Now, this deserves its own episode for sure,
7:02   but for now, go ahead and play with it.
7:03   It's awesome.
7:04   The playground comes with different data
7:06   sets you can try out.
7:08   Some are very simple.
7:09   For example, we could use our line to classify this one.
7:12   Some data sets are much more complex.
7:15   This data set is especially hard.
7:17   And see if you can build a network to classify it.
7:20   Now, you can think of a neural network
7:21   as a more sophisticated type of classifier,
7:24   like a decision tree or a simple line.
7:26   But in principle, the idea is similar.
7:29   OK.
7:29   Hope that was helpful.
7:30   I just created a Twitter that you can follow
7:32   to be notified of new episodes.
7:33   And the next one should be out in a couple of weeks,
7:36   depending on how much work I'm doing for Google I/O. Thanks,
7:38   as always, for watching, and I'll see you next time.
Transcripción : Youtube

What Makes a Good Feature? - Machine Learning Recipes #3

0:06   JOSH GORDON: Classifiers are only
0:08   as good as the features you provide.
0:10   That means coming up with good features
0:12   is one of your most important jobs in machine learning.
0:14   But what makes a good feature, and how can you tell?
0:17   If you're doing binary classification,
0:19   then a good feature makes it easy to decide
0:21   between two different things.
0:23   For example, imagine we wanted to write a classifier
0:26   to tell the difference between two types of dogs--
0:29   greyhounds and Labradors.
0:30   Here we'll use two features-- the dog's height in inches
0:34   and their eye color.
0:35   Just for this toy example, let's make a couple assumptions
0:38   about dogs to keep things simple.
0:40   First, we'll say that greyhounds are usually
0:43   taller than Labradors.
0:44   Next, we'll pretend that dogs have only two eye
0:47   colors-- blue and brown.
0:48   And we'll say the color of their eyes
0:50   doesn't depend on the breed of dog.
0:53   This means that one of these features is useful
0:55   and the other tells us nothing.
0:57   To understand why, we'll visualize them using a toy
1:01   dataset I'll create.
1:02   Let's begin with height.
1:04   How useful do you think this feature is?
1:06   Well, on average, greyhounds tend
1:08   to be a couple inches taller than Labradors, but not always.
1:11   There's a lot of variation in the world.
1:13   So when we think of a feature, we
1:15   have to consider how it looks for different values
1:17   in a population.
1:19   Let's head into Python for a programmatic example.
1:22   I'm creating a population of 1,000
1:24   dogs-- 50-50 greyhound Labrador.
1:27   I'll give each of them a height.
1:29   For this example, we'll say that greyhounds
1:31   are on average 28 inches tall and Labradors are 24.
1:35   Now, all dogs are a bit different.
1:37   Let's say that height is normally distributed,
1:39   so we'll make both of these plus or minus 4 inches.
1:42   This will give us two arrays of numbers,
1:44   and we can visualize them in a histogram.
1:47   I'll add a parameter so greyhounds are in red
1:49   and Labradors are in blue.
1:51   Now we can run our script.
1:53   This shows how many dogs in our population have a given height.
1:57   There's a lot of data on the screen,
1:58   so let's simplify it and look at it piece by piece.
2:03   We'll start with dogs on the far left
2:05   of the distribution-- say, who are about 20 inches tall.
2:08   Imagine I asked you to predict whether a dog with his height
2:11   was a lab or a greyhound.
2:13   What would you do?
2:14   Well, you could figure out the probability of each type
2:16   of dog given their height.
2:18   Here, it's more likely the dog is a lab.
2:20   On the other hand, if we go all the way
2:22   to the right of the histogram and look
2:24   at a dog who is 35 inches tall, we
2:26   can be pretty confident they're a greyhound.
2:29   Now, what about a dog in the middle?
2:31   You can see the graph gives us less information
2:33   here, because the probability of each type of dog is close.
2:36   So height is a useful feature, but it's not perfect.
2:40   That's why in machine learning, you almost always
2:42   need multiple features.
2:43   Otherwise, you could just write an if statement
2:45   instead of bothering with the classifier.
2:47   To figure out what types of features you should use,
2:50   do a thought experiment.
2:52   Pretend you're the classifier.
2:53   If you were trying to figure out if this dog is
2:55   a lab or a greyhound, what other things would you want to know?
3:00   You might ask about their hair length,
3:01   or how fast they can run, or how much they weigh.
3:04   Exactly how many features you should use
3:06   is more of an art than a science,
3:08   but as a rule of thumb, think about how many you'd
3:10   need to solve the problem.
3:12   Now let's look at another feature like eye color.
3:15   Just for this toy example, let's imagine
3:17   dogs have only two eye colors, blue and brown.
3:20   And let's say the color of their eyes
3:22   doesn't depend on the breed of dog.
3:24   Here's what a histogram might look like for this example.
3:28   For most values, the distribution is about 50/50.
3:32   So this feature tells us nothing,
3:33   because it doesn't correlate with the type of dog.
3:36   Including a useless feature like this in your training
3:39   data can hurt your classifier's accuracy.
3:41   That's because there's a chance they might appear useful purely
3:45   by accident, especially if you have only a small amount
3:48   of training data.
3:50   You also want your features to be independent.
3:52   And independent features give you
3:54   different types of information.
3:56   Imagine we already have a feature-- height and inches--
3:59   in our dataset.
4:00   Ask yourself, would it be helpful
4:02   if we added another feature, like height in centimeters?
4:05   No, because it's perfectly correlated with one
4:08   we already have.
4:09   It's good practice to remove highly correlated features
4:12   from your training data.
4:14   That's because a lot of classifiers
4:15   aren't smart enough to realize that height in inches
4:18   in centimeters are the same thing,
4:20   so they might double count how important this feature is.
4:23   Last, you want your features to be easy to understand.
4:26   For a new example, imagine you want
4:28   to predict how many days it will take
4:30   to mail a letter between two different cities.
4:33   The farther apart the cities are, the longer it will take.
4:37   A great feature to use would be the distance
4:39   between the cities in miles.
4:42   A much worse pair of features to use
4:44   would be the city's locations given by their latitude
4:47   and longitude.
4:48   And here's why.
4:48   I can look at the distance and make
4:51   a good guess of how long it will take the letter to arrive.
4:54   But learning the relationship between latitude, longitude,
4:56   and time is much harder and would require many more
5:00   examples in your training data.
5:01   Now, there are techniques you can
5:03   use to figure out exactly how useful your features are,
5:05   and even what combinations of them are best,
5:08   so you never have to leave it to chance.
5:11   We'll get to those in a future episode.
5:13   Coming up next time, we'll continue building our intuition
5:16   for supervised learning.
5:17   We'll show how different types of classifiers
5:19   can be used to solve the same problem and dive a little bit
5:22   deeper into how they work.
5:24   Thanks very much for watching, and I'll see you then.
Transcripción : Youtube

Visualizing a Decision Tree - Machine Learning Recipes #2

0:00   [MUSIC PLAYING]
0:06   Last episode, we used a decision tree as our classifier.
0:09   Today we'll add code to visualize it
0:10   so we can see how it works under the hood.
0:13   There are many types of classifiers
0:14   you may have heard of before-- things like neural nets
0:16   or support vector machines.
0:17   So why did we use a decision tree to start?
0:20   Well, they have a very unique property--
0:21   they're easy to read and understand.
0:23   In fact, they're one of the few models that are interpretable,
0:26   where you can understand exactly why the classifier makes
0:28   a decision.
0:29   That's amazingly useful in practice.
0:33   To get started, I'll introduce you
0:34   to a real data set we'll work with today.
0:37   It's called Iris.
0:38   Iris is a classic machine learning problem.
0:41   In it, you want to identify what type of flower
0:43   you have based on different measurements,
0:45   like the length and width of the petal.
0:46   The data set includes three different types of flowers.
0:49   They're all species of iris-- setosa, versicolor,
0:52   and virginica.
0:53   Scrolling down, you can see we're
0:55   given 50 examples of each type, so 150 examples total.
1:00   Notice there are four features that are
1:01   used to describe each example.
1:03   These are the length and width of the sepal and petal.
1:06   And just like in our apples and oranges problem,
1:08   the first four columns give the features and the last column
1:11   gives the labels, which is the type of flower in each row.
1:15   Our goal is to use this data set to train a classifier.
1:18   Then we can use that classifier to predict what species
1:21   of flower we have if we're given a new flower that we've never
1:23   seen before.
1:25   Knowing how to work with an existing data set
1:26   is a good skill, so let's import Iris into scikit-learn
1:29   and see what it looks like in code.
1:32   Conveniently, the friendly folks at scikit
1:33   provided a bunch of sample data sets,
1:35   including Iris, as well as utilities
1:37   to make them easy to import.
1:39   We can import Iris into our code like this.
1:42   The data set includes both the table
1:44   from Wikipedia as well as some metadata.
1:47   The metadata tells you the names of the features
1:49   and the names of different types of flowers.
1:52   The features and examples themselves
1:54   are contained in the data variable.
1:56   For example, if I print out the first entry,
1:58   you can see the measurements for this flower.
2:00   These index to the feature names, so the first value
2:03   refers to the sepal length, and the second to sepal width,
2:06   and so on.
2:09   The target variable contains the labels.
2:11   Likewise, these index to the target names.
2:14   Let's print out the first one.
2:16   A label of 0 means it's a setosa.
2:19   If you look at the table from Wikipedia,
2:21   you'll notice that we just printed out the first row.
2:24   Now both the data and target variables have 150 entries.
2:27   If you want, you can iterate over them
2:29   to print out the entire data set like this.
2:32   Now that we know how to work with the data set,
2:34   we're ready to train a classifier.
2:35   But before we do that, first we need to split up the data.
2:39   I'm going to remove several of the examples
2:41   and put them aside for later.
2:43   We'll call the examples I'm putting aside our testing data.
2:46   We'll keep these separate from our training data,
2:48   and later on we'll use our testing examples
2:50   to test how accurate the classifier is
2:53   on data it's never seen before.
2:55   Testing is actually a really important part
2:57   of doing machine learning well in practice,
2:59   and we'll cover it in more detail in a future episode.
3:02   Just for this exercise, I'll remove one example
3:04   of each type of flower.
3:06   And as it happens, the data set is
3:07   ordered so the first setosa is at index 0,
3:10   and the first versicolor is at 50, and so on.
3:14   The syntax looks a little bit complicated, but all I'm doing
3:16   is removing three entries from the data and target variables.
3:21   Then I'll create two new sets of variables-- one
3:24   for training and one for testing.
3:26   Training will have the majority of our data,
3:28   and testing will have just the examples I removed.
3:31   Now, just as before, we can create a decision tree
3:33   classifier and train it on our training data.
3:40   Before we visualize it, let's use the tree
3:42   to classify our testing data.
3:44   We know we have one flower of each type,
3:47   and we can print out the labels we expect.
3:50   Now let's see what the tree predicts.
3:52   We'll give it the features for our testing data,
3:54   and we'll get back labels.
3:56   You can see the predicted labels match our testing data.
3:59   That means it got them all right.
4:01   Now, keep in mind, this was a very simple test,
4:04   and we'll go into more detail down the road.
4:07   Now let's visualize the tree so we can
4:09   see how the classifier works.
4:11   To do that, I'm going to copy-paste
4:13   some code in from scikit's tutorials,
4:15   and because this code is for visualization
4:16   and not machine-learning concepts,
4:18   I won't cover the details here.
4:20   Note that I'm combining the code from these two examples
4:22   to create an easy-to-read PDF.
4:26   I can run our script and open up the PDF,
4:28   and we can see the tree.
4:30   To use it to classify data, you start by reading from the top.
4:33   Each node asks a yes or no question
4:35   about one of the features.
4:37   For example, this node asks if the pedal width
4:39   is less than 0.8 centimeters.
4:41   If it's true for the example you're classifying, go left.
4:44   Otherwise, go right.
4:46   Now let's use this tree to classify an example
4:48   from our testing data.
4:50   Here are the features and label for our first testing flower.
4:53   Remember, you can find the feature names
4:54   by looking at the metadata.
4:56   We know this flower is a setosa, so let's see
4:58   what the tree predicts.
5:00   I'll resize the windows to make this easier to see.
5:03   And the first question the tree asks
5:04   is whether the petal width is less than 0.8 centimeters.
5:08   That's the fourth feature.
5:09   The answer is true, so we proceed left.
5:11   At this point, we're already at a leaf node.
5:14   There are no other questions to ask,
5:15   so the tree gives us a prediction, setosa,
5:18   and it's right.
5:19   Notice the label is 0, which indexes to that type of flower.
5:23   Now let's try our second testing example.
5:25   This one is a versicolor.
5:27   Let's see what the tree predicts.
5:29   Again we read from the top, and this time the pedal width
5:31   is greater than 0.8 centimeters.
5:33   The answer to the tree's question is false,
5:35   so we go right.
5:36   The next question the tree asks is whether the pedal width
5:39   is less than 1.75.
5:40   It's trying to narrow it down.
5:42   That's true, so we go left.
5:44   Now it asks if the pedal length is less than 4.95.
5:47   That's true, so we go left again.
5:49   And finally, the tree asks if the pedal width
5:51   is less than 1.65.
5:52   That's true, so left it is.
5:54   And now we have our prediction-- it's a versicolor,
5:57   and that's right again.
5:58   You can try the last one on your own as an exercise.
6:01   And remember, the way we're using the tree
6:03   is the same way it works in code.
6:05   So that's how you quickly visualize and read
6:07   a decision tree.
6:08   There's a lot more to learn here,
6:09   especially how they're built automatically from examples.
6:12   We'll get to that in a future episode.
6:14   But for now, let's close with an essential point.
6:17   Every question the tree asks must be about one
6:19   of your features.
6:20   That means the better your features are, the better a tree
6:22   you can build.
6:23   And the next episode will start looking
6:25   at what makes a good feature.
6:26   Thanks very much for watching, and I'll see you next time.
6:28   [MUSIC PLAYING]
Transcripción : Youtube

Youtube : What Makes a Good Feature? - Machine Learning Recipes #3


Transcripción

0:06JOSH GORDON: Classifiers are only
0:08as good as the features you provide.
0:10That means coming up with good features
0:12is one of your most important jobs in machine learning.
0:14But what makes a good feature, and how can you tell?
0:17If you're doing binary classification,
0:19then a good feature makes it easy to decide
0:21between two different things.
0:23For example, imagine we wanted to write a classifier
0:26to tell the difference between two types of dogs--
0:29greyhounds and Labradors.
0:30Here we'll use two features-- the dog's height in inches
0:34and their eye color.
0:35Just for this toy example, let's make a couple assumptions
0:38about dogs to keep things simple.
0:40First, we'll say that greyhounds are usually
0:43taller than Labradors.
0:44Next, we'll pretend that dogs have only two eye
0:47colors-- blue and brown.
0:48And we'll say the color of their eyes
0:50doesn't depend on the breed of dog.
0:53This means that one of these features is useful
0:55and the other tells us nothing.
0:57To understand why, we'll visualize them using a toy
1:01dataset I'll create.
1:02Let's begin with height.
1:04How useful do you think this feature is?
1:06Well, on average, greyhounds tend
1:08to be a couple inches taller than Labradors, but not always.
1:11There's a lot of variation in the world.
1:13So when we think of a feature, we
1:15have to consider how it looks for different values
1:17in a population.
1:19Let's head into Python for a programmatic example.
1:22I'm creating a population of 1,000
1:24dogs-- 50-50 greyhound Labrador.
1:27I'll give each of them a height.
1:29For this example, we'll say that greyhounds
1:31are on average 28 inches tall and Labradors are 24.
1:35Now, all dogs are a bit different.
1:37Let's say that height is normally distributed,
1:39so we'll make both of these plus or minus 4 inches.
1:42This will give us two arrays of numbers,
1:44and we can visualize them in a histogram.
1:47I'll add a parameter so greyhounds are in red
1:49and Labradors are in blue.
1:51Now we can run our script.
1:53This shows how many dogs in our population have a given height.
1:57There's a lot of data on the screen,
1:58so let's simplify it and look at it piece by piece.
2:03We'll start with dogs on the far left
2:05of the distribution-- say, who are about 20 inches tall.
2:08Imagine I asked you to predict whether a dog with his height
2:11was a lab or a greyhound.
2:13What would you do?
2:14Well, you could figure out the probability of each type
2:16of dog given their height.
2:18Here, it's more likely the dog is a lab.
2:20On the other hand, if we go all the way
2:22to the right of the histogram and look
2:24at a dog who is 35 inches tall, we
2:26can be pretty confident they're a greyhound.
2:29Now, what about a dog in the middle?
2:31You can see the graph gives us less information
2:33here, because the probability of each type of dog is close.
2:36So height is a useful feature, but it's not perfect.
2:40That's why in machine learning, you almost always
2:42need multiple features.
2:43Otherwise, you could just write an if statement
2:45instead of bothering with the classifier.
2:47To figure out what types of features you should use,
2:50do a thought experiment.
2:52Pretend you're the classifier.
2:53If you were trying to figure out if this dog is
2:55a lab or a greyhound, what other things would you want to know?
3:00You might ask about their hair length,
3:01or how fast they can run, or how much they weigh.
3:04Exactly how many features you should use
3:06is more of an art than a science,
3:08but as a rule of thumb, think about how many you'd
3:10need to solve the problem.
3:12Now let's look at another feature like eye color.
3:15Just for this toy example, let's imagine
3:17dogs have only two eye colors, blue and brown.
3:20And let's say the color of their eyes
3:22doesn't depend on the breed of dog.
3:24Here's what a histogram might look like for this example.
3:28For most values, the distribution is about 50/50.
3:32So this feature tells us nothing,
3:33because it doesn't correlate with the type of dog.
3:36Including a useless feature like this in your training
3:39data can hurt your classifier's accuracy.
3:41That's because there's a chance they might appear useful purely
3:45by accident, especially if you have only a small amount
3:48of training data.
3:50You also want your features to be independent.
3:52And independent features give you
3:54different types of information.
3:56Imagine we already have a feature-- height and inches--
3:59in our dataset.
4:00Ask yourself, would it be helpful
4:02if we added another feature, like height in centimeters?
4:05No, because it's perfectly correlated with one
4:08we already have.
4:09It's good practice to remove highly correlated features
4:12from your training data.
4:14That's because a lot of classifiers
4:15aren't smart enough to realize that height in inches
4:18in centimeters are the same thing,
4:20so they might double count how important this feature is.
4:23Last, you want your features to be easy to understand.
4:26For a new example, imagine you want
4:28to predict how many days it will take
4:30to mail a letter between two different cities.
4:33The farther apart the cities are, the longer it will take.
4:37A great feature to use would be the distance
4:39between the cities in miles.
4:42A much worse pair of features to use
4:44would be the city's locations given by their latitude
4:47and longitude.
4:48And here's why.
4:48I can look at the distance and make
4:51a good guess of how long it will take the letter to arrive.
4:54But learning the relationship between latitude, longitude,
4:56and time is much harder and would require many more
5:00examples in your training data.
5:01Now, there are techniques you can
5:03use to figure out exactly how useful your features are,
5:05and even what combinations of them are best,
5:08so you never have to leave it to chance.
5:11We'll get to those in a future episode.
5:13Coming up next time, we'll continue building our intuition
5:16for supervised learning.
5:17We'll show how different types of classifiers
5:19can be used to solve the same problem and dive a little bit
5:22deeper into how they work.
5:24Thanks very much for watching, and I'll see you then.