When looking at the paper machine learning, often see the author mentions "curse of dimensionality", Chinese translated as "curse of dimensionality", which in the end is what kind of "disaster"? This article will introduce this annoying "curse of dimensionality" and its importance in the classification problem through an example.
Suppose now that there is a group of photos, each photo has a cat or a dog. We wanted to design a classifier can automatically identify the animal in the picture off. To achieve this goal, we first need to consider how the feature photos of the animals expressed in digital form. What is the biggest difference between a cat and a dog is? One might first think of cats and dogs is not the same color, while others may think of cats and dogs sample sizes. Suppose from color to distinguish cat and dog, you can design three characteristics: the average of the red, green and blue mean averages to determine which pictures belong to a class of animals:
1 if 0.5 * red + 0.3 * green + 0.2 * blue> 0.6:
2 return cat
4 return dog
However, only these three traits, classification may not be a satisfactory result. Thus, it can add some features: size, texture. Perhaps after adding features, classification results will be increased. However, the feature is not better?
To see the performance of the classifier with the increasing number of variation, over a certain value, performance falling instead of rising. This phenomenon is called "curse of dimensionality."
Continue the previous example. Suppose the number of cats and dogs on the planet is unlimited. Due to limited time and computing power, we only selected 10 photos as training samples. Our aim is based on these 10 photographs training a linear classifier, linear classifier can make the remaining cat or dog photos correctly classified. We only use a feature to identify cats and dogs start
If only one of the characteristics of the words, cats and dogs are almost uniformly distributed in this segment, it is difficult to 10 photos linear classification. So, the situation is what will happen after adding a feature
Add a feature we found that you still can not find a line to separate the cats and dogs. So, consider the need to add a feature
At this time, we finally found a plane separate from cats and dogs. Note that only one feature, the feature space is assumed as a segment length of 5, the sample density is 10/2 = 5. There are two features, the feature size is the 5 * 5 = 25, sample density is 10/25 = 0.4. There are three characteristics, the feature size is the 5 * 5 * 5 = 125, sample density is 10/125 = 0.08. If we continue to increase the number of features, sample density will be more sparse, it is easier to find a hyper-plane training sample separately. Because as the number of features tends to infinity, the sample density is very sparse, the training sample is the possibility of misclassification tends to zero. When we classify the results of the high-dimensional space is mapped to the low-dimensional space, a serious problem arises
See the three-dimensional feature space mapped to the results of a two-dimensional feature space after. Although linear separable training samples in high dimensional feature space, but after mapped into low-dimensional space, the result is just the opposite. In fact, increasing the number of features such high-dimensional space linearly separable, the equivalent of a low-dimensional space in a training complex nonlinear classifier. However, this non-linear classifier too "smart" just learned a few exceptions. If it is used to identify those who never appear in the training sample test sample, the results are usually less than ideal. In fact, this is what we learn in high school had a machine overfitting.
Using only two characteristics of linear classifier misclassification some training samples, high accuracy rate does not seem to FIG. 4, however, the use of two characteristic linear classifier generalization than using three characteristics of linear classifier stronger. Because the use of two characteristic linear classifier learning to not just a special case, but an overall trend for those samples has not been seen to be better distinguish open. In other words, by reducing the number of features, you can avoid overfitting, thus avoiding the "curse of dimensionality."
Another point of interpretation of the "curse of dimensionality." Assuming that only one feature, the feature of the range is 0-1, only cats and dogs each characteristic values are unique. If we want the training sample covering 20% of the value range of features, you will need 20% of the total number of cats and dogs. We've added a feature after feature in order to continue covering 20% of the value range would require 45% of the total number of cats and dogs (0.45 ^ 2 = 0.2). Continue to add a feature, it needs 58% of the total number of cats and dogs (0.58 ^ 3 = 0.2). With the increasing number of features, in order to cover 20% of the characteristic value range, we need more training samples. If there is not enough training samples may occur overfitting.
Through the above examples, we can see the number of features more training samples will be more sparse, classifier parameter estimates will be less accurate, more prone to overfitting. Another effect of "curse of dimensionality" is the sparsity of training samples are not evenly distributed. Training at the center of the sample is more sparse than training samples around.
Suppose there is a two-dimensional feature space rectangle as shown in FIG. 8, inside the rectangle has an inscribed circle. Because the closer the center of the sample is sparse, therefore, compared to the sample within the circle, those samples located in the four corners of the rectangle is more difficult to classify. So, with the increasing number of features, circular area will not change? Here we assume that hypercube (hypercube) edge length d = 1, then calculate the radius of the hypersphere (hypersphere) volume of 0.5 (volume) of the formula.
As seen in the increasing number of features, super-spherical volume is gradually reduced until it goes to zero, but hypercube volume is constant. This result was unexpected, but some described the classification of the "curse of dimensionality": In the high-dimensional feature space, most of the training sample is located in the corner of the hypercube.
Under different dimensions, the distribution of samples. In 8-dimensional feature space, a total of 2 ^ 8 = 256 corners, while 98% of the sample distribution in these corners. With the increasing dimension, Equation 2 will tend to 0, where dist_max and dist_min denote sample-to-center distance between the maximum and minimum.
Thus, in the high-dimensional feature space for sample measurement distance meaningless. Since the basic classifier relies on as Euclidean distance, Manhattan distance, so the excessive number of features, classifier performance will decline.
So, how do we avoid the "curse of dimensionality"? Figure 1 shows the performance of the classifier with the increasing number of variation, over a certain value, performance falling instead of rising. Here's how much a certain value in the end what is it? Currently, there is no way to determine the classification in question is how much this threshold, depending on the number of training samples, the decision complexity and the type of classification boundaries. Theoretically, if the number of training samples infinite, then there will be no "curse of dimensionality", we can use any number of features to train a classifier. In fact, the number of training samples is limited, so you should not use too many features. In addition, those who need accurate nonlinear decision boundary classifiers, such as neural network, knn, decision trees and other generalization often not very good, more prone to overfitting. Therefore, in designing these classifiers should be carefully considered when the number of features. In contrast, those better generalization ability classifier, such as naive Bayesian, linear classifier, etc., may be appropriate to increase the number of features.
Given the characteristics of the N, how do we choose from the best features of the M? The most simple and crude way is to try a combination of all the features, the M pick out the best features. In fact, it is very time-consuming, or impractical. In fact, there have been many feature selection algorithm (feature selection algorithms) to help us determine the number of features and feature selection. In addition, there are many feature extraction methods (feature extraction methods), such as the PCA and the like. Cross-validation (cross-validation) it is also often used to detect and avoid over-fitting problems.