Photo by Luke Chesser on Unsplash
Learning Data Science without any prior experience - Part 3
Third article of this series (this is a long one because I am studying aggressively)
Intro
Hi Guys My name is Sunain and I have been working on a company in the climate tech space basically what I started as a newer form of carbon credits (a market-based tool to boost net-zero) is now a net-zero transitioning company. I have started learning data science to understand better the impact of emitted carbon dioxide in the atmosphere both environmentally and financially.
Continuing from where we left off (I hope you’ve read the previous article and are comfortable with notations of the book also please note that in this series I have copied most of the data as it is from the book because I think it’s relatively simple and only changed the stuff which was difficult for me to grasp)
How do we Estimate f?
Our goal is to apply a statistical learning method to the training data to estimate the unknown function f. In other words, we want to find a function ˆ f such that Y ≈ ˆ f(X) for any observation (X, Y ).
Most statistical learning methods for this task can be characterized as either Parametric or non-parametric.
More on this in the next article.
Parametric Methods
Parametric methods involve a two-step model-based approach
1. First, we make an assumption about the functional form, or shape, of f
2. After a model has been selected, we need a procedure that uses the training data to fit or train the model.
The model-based approach just described is referred to as parametric; it reduces the problem of estimating f down to one of estimating a set of parameters. Assuming a parametric form for f simplifies the problem of estimating f because it is generally much easier to estimate a set of parameters
for example
one very simple assumption is that f is linear in X:
(this is a linear model which is to be discussed further)
Once we have assumed that f is linear, the problem of estimating f is greatly simplified. Instead of having to estimate an entirely arbitrary p-dimensional function f(X), one only needs to estimate the p + 1 coefficients.
we want to find values of these parameters such that
The potential disadvantage of a parametric approach is that the model we choose will usually not match the true unknown form of f. If the chosen model is too far from the true f, then our estimate will be poor. We can try to address this problem by choosing flexible models that can fit many different possible functional forms for f. But in general, fitting a more flexible model requires estimating a greater number of parameters. These more complex models can lead to a phenomenon known as overfitting the data, which essentially means they follow the errors, or noise, too closely.
(don’t care about the data just look at the figure how the linear approach gives a yellow plane with red dots to mark the input and black lines as the error )
Non-Parametric Methods
Non-parametric methods do not make explicit assumptions about the functional form of f. Instead, they seek an estimate of f that gets as close to the data points as possible without being too rough or wiggly. Such approaches can have a major advantage over parametric approaches: by avoiding the assumption of a particular functional form for f, they have the potential to accurately fit a wider range of possible shapes for f. Any parametric approach brings with it the possibility that the functional form used to estimate f is very different from the true f, in which case the resulting model will not fit the data well. In contrast, non-parametric approaches completely avoid this danger, since essentially no assumption about the form of f is made. But non-parametric approaches do suffer from a major disadvantage: since they do not reduce the problem of estimating f to a small number of parameters, a very large number of observations (far more than is typically needed for a parametric approach) is required in order to obtain an accurate estimate for f.
for example
to fitting the Income data is shown in Figure 2.5. A thin-plate spline is used to estimate f. This approach does not impose any pre-specified model on f. It instead attempts to produce an estimate for f that is as close as possible to the observed data, subject to the fit—that is, the yellow surface in Figure 2.5—being smooth. In this case, the non-parametric fit has produced a remarkably accurate estimate of the true f In order to fit a thin-plate spline, the data analyst must select a level of smoothness. Figure 2.6 shows the same thin-plate spline fit using a lower level of smoothness, allowing for a rougher fit. The resulting estimate fits the observed data perfectly!
The resulting estimate fits the observed data perfectly! However, the spline fit shown in Figure 2.6 is far more variable than the true function f, This is an example of overfitting the data, which we discussed previously. It is an undesirable situation because the fit obtained will not yield accurate estimates of the response on new observations that were not part of the original training data set.
The Trade-Off Between Prediction Accuracy and Model Interpretability
Prediction accuracy - How close is Y to ˆ Y
Model Interpretability - It is the extent to which you are able to predict what is going to happen, given a change in input or algorithmic parameters
Of the many methods that we examine in this book, some are less flexible, or more restrictive, in the sense that they can produce just a relatively small range of shapes to estimate f.
For example, linear regression is a relatively inflexible approach, because it can only generate linear functions. Other methods, such as the thin plate splines, are considerably more flexible because they can generate a much wider range of possible shapes to estimate f.
In general, as the flexibility of a method increases, its interpretability decreases.
One might reasonably ask the following question:
why would we ever choose to use a more restrictive method instead of a very flexible approach? There are several reasons that we might prefer a more restrictive model. If we are mainly interested in inference, then restrictive models are much more interpretable. For instance, when inference is the goal, the linear model may be a good choice since it will be quite easy to understand the relationship between Y and X1,X2, . . . ,Xp. In contrast, very flexible approaches, such as the splines and the boosting methods, can lead to such complicated estimates of f that it is difficult to understand how any individual predictor is associated with the response.
We have established that when inference is the goal, there are clear advantages to using simple and relatively inflexible statistical learning methods. In some settings, however, we are only interested in prediction, and the interpretability of the predictive model is simply not of interest.
In some settings, however, we are only interested in prediction, and the interpretability of the predictive model is simply not of interest. For instance, if we seek to develop an algorithm to predict the price of a stock, our sole requirement for the algorithm is that it predict accurately— interpretability is not a concern. In this setting, we might expect that it will be best to use the most flexible model available. Surprisingly, this is not always the case! We will often obtain more accurate predictions using a less flexible method. This phenomenon, which may seem counterintuitive at first glance, has to do with the potential for overfitting in highly flexible methods.
supervised vs unsupervised learning
till now we have discussed supervised learning in our examples where we had Y for X.
now talking about unsupervised learning describes the somewhat more challenging situation in which for every observation i = 1, . . . ,n, we observe a vector of measurements xi but no associated response yi. It is not possible to fit a linear regression model, since there is no response variable to predict. In this setting, we are in some sense working blind; the situation is referred to as unsupervised because we lack a response variable that can supervise our analysis.
What sort of statistical analysis is possible?
One statistical learning tool that we may use in this setting is cluster analysis, or clustering. The goal of cluster analysis is to ascertain, on the basis of x1, . . . , x, whether the observations fall into relatively distinct groups.
for example
in a market segmentation study we might observe multiple characteristics (variables) for potential customers, such as zip code, family income, and shopping habits. We might believe that the customers fall into different groups, such as big spenders versus low spenders. If the information about each customer’s spending patterns were available, then a supervised analysis would be possible. However, this information is not available—that is, we do not know whether each potential customer is a big spender or not. In this setting, we can try to cluster the customers on the basis of the variables measured, in order to identify distinct groups of potential customers. Identifying such groups can be of interest because it might be that the groups differ with respect to some property of interest, such as spending habits.
Regression Versus Classification Problems
Variables can be characterized as either quantitative or qualitative (also quantitative qualitative known as categorical). Quantitative variables take on numerical values. Examples include a person’s age, height, or income, the value of a house, and the price of a stock. In contrast, qualitative variables take on values in one of K different classes, or categories. Examples of qualitative variables include a person’s marital status (married or not), the brand of product purchased (brand A, B, or C), whether a person defaults on a debt (yes or no), or a cancer diagnosis (Acute Myelogenous Leukemia, Acute Lymphoblastic Leukemia, or No Leukemia). We tend to refer to problems with a quantitative response as regression problems, while those involving a qualitative response are often referred to as classification problems. (the distinction is not always that crisp)
either quantitative or qualitative responses. We tend to select statistical learning methods on the basis of whether the response is quantitative or qualitative; i.e. we might use linear regression when quantitative and logistic regression when qualitative. However, whether the predictors are qualitative or quantitative is generally considered less important. Most of the statistical learning methods discussed in this book can be applied regardless of the predictor variable type, provided that any qualitative predictors are properly coded before the analysis is performed.
In the next article we will discuss about assessing model accuracy
until then, Happy Reading
bye