Photo by Luke Chesser on Unsplash
Learning Data Science without any prior experience - Part 2
second article of this series
Intro
Hi Guys My name is Sunain and I have been working on a company in the climate tech space basically what I started as a newer form of carbon credits (a market-based tool to boost net-zero) is now a net-zero transitioning company. I have started learning data science to understand better the impact of emitted carbon dioxide in the atmosphere both environmentally and financially.
continuing from where we left off ( I hope you’ve read the previous article and are comfortable with notations of the book )
Why Estimate f?
For two main reasons, we may wish to estimate f.
1. prediction
2. inference
Prediction
As the name defines itself in this scenario we wish to predict output based on input and their previous output. for example, determining stock prices for tomorrow based on data from the previous week.
Technically speaking
In many situations, a set of inputs X are readily available, but the output Y cannot be easily obtained. In this setting, since the error term averages to zero, we can predict Y using
Y = ˆ f(X),
ˆ f represents our estimate for f and ˆ Y represents the resulting prediction for Y.
In this setting, ˆ f is often treated as a black box, in the sense that one is not typically concerned with the exact form of ˆ f, provided that it yields accurate predictions for Y.
The accuracy of ˆ Y as a prediction for Y depends on two quantities
1. Reducible error - ˆf will not be a perfect estimate for f, and this inaccuracy will introduce some error. This error is reducible because we can potentially improve the accuracy of ˆ f by using the most appropriate statistical learning technique to estimate f.
2. Irreducible error - Y is also a function of E(epsilon), which, by definition, cannot be predicted using X. Therefore, variability associated with E(epsilon) also affects the accuracy of our predictions. This is known as the irreducible error. The quantity E(epsilon) may contain unmeasured variables that are useful in predicting Y : since we don’t measure them, f cannot use them for its prediction. The quantity E(epsilon) may also contain unmeasurable variation
For example, the risk of an adverse reaction might vary for a given patient on a given day, depending on manufacturing variation in the drug itself or the patient’s general feeling of well-being on that day.
It is important to keep in mind that the irreducible error will always provide an upper bound on the accuracy of our prediction for Y . This bound is almost always unknown in practice.
Interference
In this, we want to calculate the relationship between input and output but just not to predict it.
We are often interested in understanding the association between Y and X1, . . . ,Xp. In this situation, we wish to estimate f, but our goal is not necessarily to make predictions for Y.
in this case ˆ f cannot be treated as a black box, because we need to know its exact form.
in this setting, one may be interested in answering the following questions:
1. Which predictors (inputs) are associated with the response (output)?
. Identifying the few important predictors among a large set of possible variables can be extremely useful, depending on the application.
2. What is the relationship between the response (output) and each predictor (input)?
. the relationship between the response and a given predictor may also depend on the values of the other predictors.
3. Can the relationship between Y and each predictor be adequately summarized using a linear equation, or is the relationship more complicated?
. Historically, most methods for estimating f have taken a linear form. In some situations, such an assumption is reasonable or even desirable. But often the true relationship is more complicated.
let’s take an example to understand both prediction and interference
In a real estate setting, one may seek to relate values of homes to inputs such as crime rate, zoning, distance from a river, air quality, schools, income level of community, size of houses, and so forth. In this case, one might be interested in the association between each individual input variable and housing price—for instance,
how much extra will a house be worth if it has a view of the river? This is an inference problem.
Alternatively, one may simply be interested in predicting the value of a
home given its characteristics: is this house under- or over-valued? This is a prediction problem.
How do we Estimate f?
we can estimate f through many linear and non-linear approaches, these methods generally share certain characteristics.
let’s take an overview
We will always assume that we have observed a set of n different data points. For example, we observed n = 30 data points. These observations are called the training data because we will use these training observations to train, or teach, our method how to estimate f.
Our goal is to apply a statistical learning method to the training data to estimate the unknown function f. In other words, we want to find a function ˆ f such that Y ≈ ˆ f(X) for any observation (X, Y ).
Most statistical learning methods for this task can be characterized as either Parametric or non-parametric.
more on this in the next article.