visit
Can you guess, what is that one thing? Yeah, you got it right, we humans are knowingly/unknowingly generating a hell lot of Mr. ‘Data’ every second.
Data, data, data, it’s everywhere and wherever it is, it is being harnessed by data engineer like us, who love to extract meanings and values out of it, in-order to serve humanity at large.Having said that, these data sets are not easy to handle, we need to make sure it is properly sampled, cleaned curated, structured, analyzed, so that it can deduce meaningful inferences, patterns, and predictions.As an aspiring/professional data engineer, you need to make sure that, you are able to identify which features/attributes in that sea of data is real gold which needs to be hunted. That is where data engineers came across this one wonderful tool /algorithm called PCA. Treat data redundancy based on correlation they have and are able to minimize the higher dimension of data sets to lower dimensional space
PCA: Principal components analysis method, is what we will look into deeply today and also will discuss:
What is PCA?Why PCA?How Does It work?PCA use case examplesHow to apply it using python, to solve one real-world problem.
Let’s get started:Technical Definition:
Principal Component Analysis (PCA) is a statistical procedure that orthogonally transforms the original n coordinates of a data set into a new set of n coordinates called principal components. As a result of the transformation, the first principal component has the largest possible variance; each succeeding component has the highest possible variance under the constraint that it is orthogonal to (i.e., uncorrelated with) the preceding components.
Too technical yeah… Let’s make it easier for you with aFunctional Definition:
In a real-world scenario, we often get our data sets which are beyond 2 or 3 D(dimension). For us, it’s difficult to visualize and comprehend them as it has 100 & 1000’s of features. Training ML models become difficult, time taking and also very expensive to run. In order to deal with this high dimensionality problem, we came up with the concept of PCA, where we filter out mathematically which features are important and which we can leave out.This non parametric process /technique used for identification of a smaller number of uncorrelated variables from the larger set of data, to make it easy to comprehend and visualize is what we call the Principal Component Analysis.Here we deal with reducing the dimension of data sets to make life easier for data engineers to analyze.
Principal Component Analysis (PCA) is one of the popular techniques for dimension reduction, feature extraction, and data visualization.
PCA is defined by a transformation of a high dimensional vector space into a low dimensional space. Let’s consider the visualization of 20-dim data. It is barely possible to effectively show the shape of such high dimensional data distribution. PCA provides an efficient way to reduce the dimensionality (i.e., from 20 to 2/3), so it is much easier to visualize the shape and the data distribution. PCA is also useful in the modeling of robust classifier where a considerably small number of high dimensional training data is provided, by reducing the dimensions of learning data sets, PCA provides an effective and efficient method for data description and classification.
A vector is a mathematical object that has a size, called the magnitude, and a direction.
For example, a vector would be used to show the distance and direction something moved in. If you ask for directions, and a person says “Walk one kilometer towards the North”, that’s a vector. If he says “Walk one kilometer”, without showing a direction, it would be a scalar.In linear algebra, an eigenvector or characteristic vector of a linear transformation is a nonzero vector that changes at most by a scalar factor when that linear transformation is applied to it. Geometrically, an eigenvector, corresponding to a real nonzero eigenvalue, points in a direction in which it is stretched by the transformation and the eigenvalue is the factor by which it is stretched.
Every eigenvector has a corresponding eigenvalue. An eigenvector is a direction. If the eigenvalue is negative, the direction is reversed. Loosely speaking, in a multidimensional vector space, the eigenvector is not rotated. However, in a one-dimensional vector space, the concept of rotation is meaningless.
Variance is an error from sensitivity to small fluctuations in the training set. High variance can cause an algorithm to model the random noise in the training data, rather than the intended outputs (overfitting).
Variance is the change in prediction accuracy of ML model between training data and test data.
In probability,
The variance of some random variable X is a measure of how much value in the distribution varies on an average with respect to the mean. The variance is denoted as the function Var() on the variable.
In probability, covariance is the measure of the joint probability for two random variables. It describes how the two variables change together. It is denoted as the function Cov(X, Y), where X
Some other terminologies one needs to understand:Dimensionality :
It is the number of random variables in a dataset or simply the number of features, or rather more simply, the number of columns present in your data set.Correlation:
It shows how strongly two variables are related to each other. The value of the same ranges for -1 to +1. Positive indicates that when one variable increases, the other increases as well, while negative indicates the other decreases on increasing the former. And the modulus value indicates the strength of the relation.Basically PCA is a dimension reduction methodology that aims to reduce a large set of (often correlated) variables into a smaller set of (uncorrelated) variables, called principal components, which holds sufficient information without losing the relevant info much.Mathematically, PCA is a projection of some higher dimensional object into a lower dimension. What sounds complicated is really something we encounter every day: when we watch TV we see a 2D-projection of 3D-objects!
Var[X1] = Cov[X1,X1] and Var[X2] = Cov[X2,X2].
Next we calculate eigenvalues and eigenvectors for the covariance matrix. The matrix here is a square matrix A. ƛ is an eigenvalue for a matrix A if we get equation given below:
I is the identity matrix of the same dimensionA which is a required condition for the matrix subtractiondet’ is the determinant of the matrix.
For each eigenvalue ƛ, a corresponding eigenvector v, can be found by solving:
Feature Vector = (eig1, eig2, eg3,… ), it depends upon the mathematical space we are dealing with.
reducedData= FeatureVectorT x ScaledDataT
Here, reducedData is the Matrix consisting of the principal components,
FeatureVector is the matrix we formed using the eigenvectors we chose to keep, and
ScaledData is the scaled version of original dataset. T stands for transpose we perform on the feature and scaled data.
PCA combines our predictors and allows us to drop the eigenvectors that are relatively unimportant.
To help you all better understand and dry run the code, here is the
Please get your hands dirty and share your feedback if you found it helpfulSo make data your friend, get into the depth & breadth of it to analyze, you will be surprised how much it has to say and infer. As a business leader data needs to be at the core of your day to day decision making if you really want to be in the game for a longer period of time. Machines, when well trained with meaningful data, can empower you with the right set of information to compete in this highly dynamic marketplace.
Thanks for being together in this journey, Happy Learning !
(Picture Credits: Victor Lavrenko)