Principal Component Analysis (Pca)
- Full form of PCA = Principal Component Analysis
- PCA is the linear combination of features
- Principal Component 1 represents the most features
- PCA is used for Dimensionality Reduction and visualizing high dimensional data
- The difference between PC1 is more important than difference between PC2 and PC2 is more important that PC3 and ....
- The newly project dataset have maximum variance in the first component direction, then next, then next and so on....
- The variance can be termed as measure of information, so the data has more information on the first component direction rather than the next ones.
Steps:
- For each dimension, get the Mean
- Plot that point
- Transform that point to origin
- That will transfer all features with same aspect
- Fit a line with lowest Sum of Residual
- That is the PC1
- For next Principal Component $PC_i$, draw a perpendicular on $PC_{1...(i-1)}$
- Project the $D$ dimensional data to all the principal components (usually 2/3)
- Now rotate PC1 to make it horizontal with X axis
- That will rotate the features too
- That is the final PCA
[!def] Variation for Principal Component
$$
\text{Variation for PC}_i = \frac{\text{Sum of Squared Distnace from features to PC}_i}{n-1}
$$
- If Sum of squared distance for PC1 is 15 and for PC2 is 3
- Then 83% ($\frac{15}{15+18}$) data are represented by PC1
Practical Tips:
- All variables are on same scale, i.e., [0, 1] (Data Normalization)
- Make sure data is centered on origin (if not done by library)
[!question] Why do we need to do data normalization in PCA?
If we don't normalize the data before PCA is done, then it's possible that the data with high variance will dominate the principal component, i.e., if the data has weight in kg and weight in grams, then weight in grams might dominate the component as it has more variance then kg.
[!question] Why is it required to center the data in PCA?
PCA requires to center the data as it needs to find the variance to find the most important component. By centering the data, we are actually shifting it to mean of 0. If we don't center the data, it might be possible to be misguided.