Transcripción
0:06JOSH GORDON: Classifiers are only
0:08as good as the features you provide.
0:10That means coming up with good features
0:12is one of your most important jobs in machine learning.
0:14But what makes a good feature, and how can you tell?
0:17If you're doing binary classification,
0:19then a good feature makes it easy to decide
0:21between two different things.
0:23For example, imagine we wanted to write a classifier
0:26to tell the difference between two types of dogs--
0:29greyhounds and Labradors.
0:30Here we'll use two features-- the dog's height in inches
0:34and their eye color.
0:35Just for this toy example, let's make a couple assumptions
0:38about dogs to keep things simple.
0:40First, we'll say that greyhounds are usually
0:43taller than Labradors.
0:44Next, we'll pretend that dogs have only two eye
0:47colors-- blue and brown.
0:48And we'll say the color of their eyes
0:50doesn't depend on the breed of dog.
0:53This means that one of these features is useful
0:55and the other tells us nothing.
0:57To understand why, we'll visualize them using a toy
1:01dataset I'll create.
1:02Let's begin with height.
1:04How useful do you think this feature is?
1:06Well, on average, greyhounds tend
1:08to be a couple inches taller than Labradors, but not always.
1:11There's a lot of variation in the world.
1:13So when we think of a feature, we
1:15have to consider how it looks for different values
1:17in a population.
1:19Let's head into Python for a programmatic example.
1:22I'm creating a population of 1,000
1:24dogs-- 50-50 greyhound Labrador.
1:27I'll give each of them a height.
1:29For this example, we'll say that greyhounds
1:31are on average 28 inches tall and Labradors are 24.
1:35Now, all dogs are a bit different.
1:37Let's say that height is normally distributed,
1:39so we'll make both of these plus or minus 4 inches.
1:42This will give us two arrays of numbers,
1:44and we can visualize them in a histogram.
1:47I'll add a parameter so greyhounds are in red
1:49and Labradors are in blue.
1:51Now we can run our script.
1:53This shows how many dogs in our population have a given height.
1:57There's a lot of data on the screen,
1:58so let's simplify it and look at it piece by piece.
2:03We'll start with dogs on the far left
2:05of the distribution-- say, who are about 20 inches tall.
2:08Imagine I asked you to predict whether a dog with his height
2:11was a lab or a greyhound.
2:13What would you do?
2:14Well, you could figure out the probability of each type
2:16of dog given their height.
2:18Here, it's more likely the dog is a lab.
2:20On the other hand, if we go all the way
2:22to the right of the histogram and look
2:24at a dog who is 35 inches tall, we
2:26can be pretty confident they're a greyhound.
2:29Now, what about a dog in the middle?
2:31You can see the graph gives us less information
2:33here, because the probability of each type of dog is close.
2:36So height is a useful feature, but it's not perfect.
2:40That's why in machine learning, you almost always
2:42need multiple features.
2:43Otherwise, you could just write an if statement
2:45instead of bothering with the classifier.
2:47To figure out what types of features you should use,
2:50do a thought experiment.
2:52Pretend you're the classifier.
2:53If you were trying to figure out if this dog is
2:55a lab or a greyhound, what other things would you want to know?
3:00You might ask about their hair length,
3:01or how fast they can run, or how much they weigh.
3:04Exactly how many features you should use
3:06is more of an art than a science,
3:08but as a rule of thumb, think about how many you'd
3:10need to solve the problem.
3:12Now let's look at another feature like eye color.
3:15Just for this toy example, let's imagine
3:17dogs have only two eye colors, blue and brown.
3:20And let's say the color of their eyes
3:22doesn't depend on the breed of dog.
3:24Here's what a histogram might look like for this example.
3:28For most values, the distribution is about 50/50.
3:32So this feature tells us nothing,
3:33because it doesn't correlate with the type of dog.
3:36Including a useless feature like this in your training
3:39data can hurt your classifier's accuracy.
3:41That's because there's a chance they might appear useful purely
3:45by accident, especially if you have only a small amount
3:48of training data.
3:50You also want your features to be independent.
3:52And independent features give you
3:54different types of information.
3:56Imagine we already have a feature-- height and inches--
3:59in our dataset.
4:00Ask yourself, would it be helpful
4:02if we added another feature, like height in centimeters?
4:05No, because it's perfectly correlated with one
4:08we already have.
4:09It's good practice to remove highly correlated features
4:12from your training data.
4:14That's because a lot of classifiers
4:15aren't smart enough to realize that height in inches
4:18in centimeters are the same thing,
4:20so they might double count how important this feature is.
4:23Last, you want your features to be easy to understand.
4:26For a new example, imagine you want
4:28to predict how many days it will take
4:30to mail a letter between two different cities.
4:33The farther apart the cities are, the longer it will take.
4:37A great feature to use would be the distance
4:39between the cities in miles.
4:42A much worse pair of features to use
4:44would be the city's locations given by their latitude
4:47and longitude.
4:48And here's why.
4:48I can look at the distance and make
4:51a good guess of how long it will take the letter to arrive.
4:54But learning the relationship between latitude, longitude,
4:56and time is much harder and would require many more
5:00examples in your training data.
5:01Now, there are techniques you can
5:03use to figure out exactly how useful your features are,
5:05and even what combinations of them are best,
5:08so you never have to leave it to chance.
5:11We'll get to those in a future episode.
5:13Coming up next time, we'll continue building our intuition
5:16for supervised learning.
5:17We'll show how different types of classifiers
5:19can be used to solve the same problem and dive a little bit
5:22deeper into how they work.
5:24Thanks very much for watching, and I'll see you then.
No hay comentarios.:
Publicar un comentario