Beginner’s Guide to Learning Data Science

Intro

Hi Guys it’s been a long time since I have written a technical blog since my last blog about web3 I have been working on a company in the climate tech space basically what I started as a newer form of carbon credits (a market-based tool to boost net-zero) is now a net-zero transitioning company. I have started learning data science to understand better the impact of emitted carbon dioxide in the atmosphere both environmentally and financially.

I am writing this blog to help folks like me with no prior experience in data science but just passion to understand it can start easily

The learning material - I am using HarvardX CS109x Introduction to Data Science with Python on edx and all my upcoming blogs will be stated upon learnings and my viewpoint of this course.

to all of us who have decided to give our interests a shot.

Part 1 - Introduction to Statistical Learning

This course is based on the book An Introduction to Statistical Learning, so let me briefly introduce Statistical Learning.

Statistical Learning refers to a vast set of tools for understanding data these tools can be

1. Supervised - building a statistical model for predicting, or estimating, an output based on one or more inputs.

2. Unsupervised - there are inputs but no supervising output.

ISL book is based on 4 premises

1. Many statistical learning methods are relevant and useful in a wide range of academic and non-academic disciplines, beyond just the statistical sciences.

2. Statistical learning should not be viewed as a series of black boxes. - understanding the components of each model and how they work.

3. While it is important to know what job is performed by each cog, it is not necessary to have the skills to construct the machine inside the box - this book doesn’t focus on creating new models but on understanding and working with existing models.

4. The authors of this book have presumed that the reader is interested in applying statistical learning methods to real-world problems.

make sure you study the notions and organization of this book for better understanding.

What is statistical Learning?

in a nutshell, Statistical Learning is a study where we find the relation between {input1, result1} and {input2,result2} (I know it’s a bit more technical than that but I am trying to oversimplify)

let me explain it to you by using examples from the book

Suppose that we are hired by a client to assess the relation between advertising and sales of that product in 200 different markets and for three different media TV, radio, and newspaper each with different budgets. If we determine that there is some relationship between advertising and sales, then we can instruct our client to adjust the advertising budget indirectly increasing sales

in this setting
input variables - advertising budgets - typically denoted by X with a substring to distinguish them

X1 = TV
X2 = Radio
X3 = Newspaper

output variables - sales - typically denoted by Y

suppose that we observe a quantitative response Y and p, different predictors,
X1,X2, . . . , Xp. We assume that there is some relationship between
Y and X = (X1,X2, . . . ,Xp), which can be written in the very general form

Y = f(X) + E (not e but epsilon).
here
f - some fixed but unknown function of X1,X2, . . . ,Xp
E(not e but epsilon) - A random error term, which is independent of X and has mean zero.
In this formulation, f represents the systematic information that X provides about Y

above is the representation of this formula (please note that this is from a different example from what I have used in this article) The blue curve represents the true underlying relationship between income and years of education (f), which is generally unknown (but is known in this case because the data were simulated). The black lines represent the error associated with each observation (e). Note that some errors are positive (if an observation lies above the blue curve) and some are negative (if an observation lies below the curve). Overall, these errors have approximately mean zero

In general, the function f may involve more than one input variable. In Figure, we plot income as a function of years of education and seniority. Here f is a two-dimensional surface that must be estimated based on the observed data. In essence, statistical learning refers to a set of approaches for estimating f. During our journey, we outline some of the key theoretical concepts that arise in estimating f, as well as tools for evaluating the estimates obtained.

Why Estimate f?

For two main reasons, we may wish to estimate f.
1. prediction
2. inference

which we will learn in the next article
until then Happy reading.

Learning Data Science without any prior experience

I am starting my data science journey with no prior experience with math the goal of this learning is to work with Climate data accurately

Table of contents

Intro

Part 1 - Introduction to Statistical Learning

What is statistical Learning?

Why Estimate f?