Photo by Luke Chesser on Unsplash
Learning Data Science without any prior experience - Part 5
Bias-Variance Trade-Off
Intro
Hi Guys My name is Sunain and I have been working on a company in the climate tech space basically what I started as a newer form of carbon credits (a market-based tool to boost net-zero) is now a net-zero transitioning company. I have started learning data science to understand better the impact of emitted carbon dioxide in the atmosphere both environmentally and financially.
Continuing from where we left off (I hope you’ve read the previous article and are comfortable with notations of the book also please note that in this series I have copied most of the data as it is from the book because I think it’s relatively simple and only changed the stuff which was difficult for me to grasp).
The Bias-Variance Trade-off
Bias and Variance are two competing properties of statistical learning methods.
Variance - Variance refers to the amount by which ˆ f would change if we estimated it using a different training data set.
if a method has high variance then small changes in the training data can result in large changes in ˆ f.
Bias - bias refers to the error that is introduced by approximating a real-life problem, which may be extremely complicated, by a much simpler model.
Generally, more flexible methods result in less bias.
The expected test MSE, for a given value Xo, can always be decomposed into sum of three fundamental quantities:
The variance of ˆf(Xo)
The squared bias of ˆf(Xo)
The variance of the error terms ε
that is
this equation tells us in order to minimize the expected test error, we need to select a statistical learning method that simultaneously achieves low variance and low bias.
(Note that variance is inherently a nonnegative quantity, and squared bias is also nonnegative. Hence, we see that the expected test MSE can never lie below Var(ε), the irreducible error from)
The relative rate of change of these two quantities determines whether the test MSE increases or decreases. As we increase the flexibility of a class of methods, the bias tends to initially decrease faster than the variance increases. Consequently, the expected test MSE declines. However, at some point increasing flexibility has little impact on the bias but starts to significantly increase the variance. When this happens the test MSE increases
The relationship between bias, variance, and test set MSE given i is referred to as the bias-variance trade-off. Good test set performance of a statistical learning method requires low variance as well as low squared bias. This is referred to as a trade-off because it is easy to obtain a method with extremely low bias but high variance
The challenge lies in finding a method for which both the variance and the squared bias are low.
The Classification Setting
Many of the concepts that we have encountered, such as the bias-variance trade-off, transfer over to the classification setting with only some modifications due to the fact that yi is now qualitative.
The most common approach for quantifying the accuracy of our estimate ˆ f is the training error rate, the proportion of mistakes that are made if we apply
our estimate ˆ f to the training observations:
Here ˆyi is the predicted class label for the ith observation using ˆ f. And I(yi %= ˆyi) is an indicator variable that equals 1 if yi %= ˆyi and it is misclassified and If I(yi %= yˆi) = 0 then the ith observation was classified correctly by our variable classification method.
this is referred to as training error rate.
A good classifier is one for which the test error is smallest.
The Bayes Classifier
It is possible to show that the test error rate is minimized, on average, by a very simple classifier that assigns each observation to the most likely class, given its predictor values.
we should simply assign a test observation with predictor vector x0 to the class j for which conditional property (it is the probability that Y = j) is largest given the observed predictor vector Xo
This class shifter is called the Bayes Classifier
In a two-class problem where there are Bayes only two possible response values, say class 1 or class 2, the Bayes classifier corresponds to predicting class one if Pr(Y = 1|X = Xo) > 0.5, and class two otherwise. (classes are referred to as result options for example : red and blue, true or false etc)
an example using a simulated data set in a twodimensional space consisting of predictors X1 and X2. The orange and blue circles correspond to training observations that belong to two different classes. For each value of X1 and X2, there is a different probability of the response being orange or blue. Since this is simulated data, we know how the data were generated and we can calculate the conditional probabilities for each value of X1 and X2. The orange shaded region reflects the set of points for which Pr(Y = orange|X) is greater than 50%, while the blue shaded region indicates the set of points for which the probability is below 50%. The purple dashed line represents the points where the probability is exactly 50%. This is called the Bayes decision boundary. The Bayes classifier’s prediction is determined by the Bayes decision boundary; an observation that falls on the orange side of the boundary will be assigned to the orange class, and similarly an observation on the blue side of the boundary will be assigned to the blue class. The Bayes classifier produces the lowest possible test error rate, called the Bayes error rate. Since the Bayes classifier will always choose the class Bayes error for which bayes classifier is largest, the error rate will be 1−maxj Pr(Y = j|X = Xo) rate at X = Xo. In general, the overall Bayes error rate is given by
where the expectation averages the probability over all possible values of X. For our simulated data, the Bayes error rate is 0.133. It is greater than zero, because the classes overlap in the true population, which implies that maxj Pr(Y = j|X = Xo) < 1 for some values of Xo. The Bayes error rate is analogous to the irreducible error.
In theory we would always like to predict qualitative responses using the Bayes classifier. But for real data, we do not know the conditional distribution of Y given X, and so computing the Bayes classifier is impossible. Therefore, the Bayes classifier serves as an unattainable gold standard against which to compare other methods.
K-Nearest Neighbors
KNN classifier is one of the many approaches attempt to estimate the conditional distribution of Y given X, and then classify a given observation to the class with highest estimated probability
Given a positive integer K and a test observation Xo, the KNN classifier first identifies the K points in the training data that are closest to Xo, represented by N0. It then estimates the conditional probability for class j as the fraction of points in N0 whose response values equal j:
KNN classifies the test observation x0 to the class with the largest probability from this
In the left-hand panel, we have plotted a small training data set consisting of six blue and six orange observations. Our goal is to make a prediction for the point labeled by the black cross. Suppose that we choose K = 3. Then KNN will first identify the three observations that are closest to the cross. This neighborhood is shown as a circle. It consists of two blue points and one orange point, resulting in estimated probabilities of 2/3 for the blue class and 1/3 for the orange class. Hence KNN will predict that the black cross belongs to the blue class.
In the right-hand panel we have applied the KNN approach with K = 3 at all of the possible values for X1 and X2, and have drawn in the corresponding KNN decision boundary. Despite the fact that it is a very simple approach, KNN can often produce classifiers that are surprisingly close to the optimal Bayes classifier.
It displays the KNN decision boundary, using K = 10, when applied to the larger simulated data set Notice that even though the true distribution is not known by the KNN classifier, the KNN decision boundary is very close to that of the Bayes classifier. The test error rate using KNN is 0.1363, which is close to the Bayes error rate of 0.1304. The choice of K has a drastic effect on the KNN classifier obtained. figure above this figure displays two KNN fits to the simulated data from Figure above, using K = 1 and K = 100. When K = 1, the decision boundary is overly flexible and finds patterns in the data that don’t correspond to the Bayes decision boundary. This corresponds to a classifier that has low bias but very high variance. As K grows, the method becomes less flexible and produces a decision boundary that is close to linear. This corresponds to a low-variance but high-bias classifier. On this simulated data set, neither K = 1 nor K = 100 give good predictions: they have test error rates of 0.1695 and 0.1925, respectively. Just as in the regression setting, there is not a strong relationship between the training error rate and the test error rate. With K = 1, the KNN training error rate is 0, but the test error rate may be quite high. In general, as we use more flexible classification methods, the training error rate will decline but the test error rate may not.
In both the regression and classification settings, choosing the correct level of flexibility is critical to the success of any statistical learning method. The bias-variance tradeoff, and the resulting U-shape in the test error, can make this a difficult task.
in the next article we will look into some practicals solving example problems.
bye