Learning Data Science without any prior experience - Part 4

The fourth article of this series

·

4 min read

Intro

Hi Guys My name is Sunain and I have been working on a company in the climate tech space basically what I started as a newer form of carbon credits (a market-based tool to boost net-zero) is now a net-zero transitioning company. I have started learning data science to understand better the impact of emitted carbon dioxide in the atmosphere both environmentally and financially.

Continuing from where we left off (I hope you’ve read the previous article and are comfortable with notations of the book also please note that in this series I have copied most of the data as it is from the book because I think it’s relatively simple and only changed the stuff which was difficult for me to grasp).

Assessing Model Accuracy

There is no free lunch in statistics: no one method dominates all others over all possible data sets. That is why we learn a wide range of statistical learning methods extending the standard linear regression approach.
different methods work on different data sets

Hence it is an important task to decide for any given set of data which method produces the best results. Selecting the best approach can be one of the most challenging parts of performing statistical learning in practice.

Measuring the Quality of Fit

To evaluate the performance of a statistical learning method on a given data set, we need some way to measure how well its predictions actually match the observed data.

in the regression setting, the most commonly used measure is the mean squared error (MSE)

where ˆf(xi) is the prediction that ˆf gives for the ‘i’th observation.

The MSE will be small if the predicted responses are very close to the true responses and vice versa.

in general, we do not really care how well the method works training on the training data. Rather, we are interested in the accuracy of the predictions that we obtain when we apply our method to previously unseen test data.

Why is this what we care about? Suppose that we are interested in developing an algorithm to predict a stock’s price based on previous test data stock returns. We can train the method using stock returns from the past 6 months. But we don’t really care how well our method predicts last week’s stock price. We instead care about how well it will predict tomorrow’s price or next month’s price.

To state it more mathematically, suppose that we fit our statistical learning method on our training observations {(x1, y1), (x2, y2), . . . , (xn, yn)}, and we obtain the estimate ˆ f. We can then compute ˆ f(x1), ˆ f(x2), . . . , ˆ f(xn) If these are approximately equal to y1, y2, . . . , yn, then the training MSE is small.

However, we are really not interested in whether ˆ f(xi) ≈ yi; instead, we want to know whether ˆ f(x0) is approximately equal to y0, where (x0, y0) is a previously unseen test observation not used to train the statistical learning method

We want to choose methods that give the lowest test MSE, as opposed to the lowest training MSE. In simple words while preparing for an exam it doesn’t matter how much you score on mock tests the real result to follow is the score of the main test.

with a large number of test observations, we could compute

the average squared prediction error for these test observations (x0, y0). We’d like to select the model for which this quantity is as small as possible.

when test data is available (data not used while training the model) choose the model with low test MSE otherwise choose the model with low test MSE.

When a given method yields a small training MSE but a large test MSE, we are said to be overfitting the data. This happens because our statistical learning procedure is working too hard to find patterns in the training data, and maybe picking up some patterns that are just caused by random chance rather than by true properties of the unknown function f. When we overfit the training data, the test MSE will be very large because the supposed patterns that the method found in the training data simply don’t exist in the test data.

this is a short one tomorrow we will discuss the bias-variance trade-off and move to the final topic of theory which is classification theory.

bye