ktmagar's blog :): Stat.

Showing posts with label Stat.. Show all posts

Wednesday, 8 March 2023

PCA in brief

Principal Component Analysis is the data reduction technique reducing the number of features in a few numbers while maximum number of variation is explained. This enables the proper data visualisation and interpretation of multiple dimensions of the data. Data is transformed into the new coordinate system with new axes named as principal components. Principal components are the linear transformations of the features into the new data where each data point is multiplied by some constant.

Briefly, in few steps, PCA is applied in these ways:

Each data point is centred around the mean. In other words, each data point is subtracted from its respective mean.
Covariance matrix is calculated between the features. Covariance matrix is the arrangement of variance and covariance of two features. For covariance of the sample, each corresponding data point subtracted from their own respective means are multiplied which is divided by the number of samples subtracted by 1.
Eigen value of the covariance matrix is calculated. Eigen values are the constant which, when multiplied with the given vector, gives the same vector received by multiplying the covariance matrix with the given vector. The vector is called the eigen vector.
That eigen vector is used to multiply the centred data point. The resulting values are the scores for each feature in that principal component.
Generally, eigen values greater than 1 are selected as the principal components. Each principal component has some variance explained. In most cases, the first two or three principal components explain most of the variances which can surely vary in different types of the data.

The complexity of data is reduced in the few features that are advantages in interpreting the data. And the principal component scores adds the advantage of observing the important feature in that principal component. In simpler terms, PCA is the method to change the X and Y axis of the data in the scatter plot, and make the new X and Y axis to understand how data points spread.

Following resources are helpful to dive deeply in understanding the PCA.

- PCA : the math - step-by-step with a simple example https://youtu.be/S51bTyIwxFs
- StatQuest: Principal Component Analysis (PCA), Step-by-Step https://youtu.be/FgakZw6K1QQ
- https://www.mathsisfun.com/algebra/eigenvalue.html

Sunday, 18 October 2020

Strip Plot (Split Block) RCBD Design

Appropriate when the interaction is important between the two factors,

Appropriate when the two factors are applied in the large plots; one treatment applied in horizontal position and other in vertical position

Factor A and Factor B is randomized independent of the other factor: Both randomized in each block

ERROR divided into three components: Error a, Error b, Error ab

~ Number of treatments determined in the Factor A: treatments randomized in one direction

~ Number of treatments determined in the Factor B: treatments randomized in the other direction than Factor A

~ Above process repeated in the other blocks: number of levels of Factor A and number of levels of Factor B, and the replications should be determined. [Square plots are recommended (to reduce variability within the blocks)]

WE HAVE:

a = levels of Factor A

b = levels of Factor B

r = number of replications

GT = Grand Total

Mean = GT/abr

DEGREES OF FREEDOM:

Block = r-1

Factor A = a-1, Factor B = b-1,

Error (a) = (a-1)(r-1) Error (b) = (b-1)(r-1)

A x B = (a-1)(b-1)

Error (ab) = (a-1) (b-1) (r-1)

Total = (abr - 1)

SQUARES:

Correction Factor = GT*GT / abr

Total R Square / ab

Total A Square / br

Total B square / ar

Total AB square / r

Total AR square / b

Total BR square / a

Total ABR square

SUMS OF SQUARE:

Total SS = Toatal ABR Square - CF

RSS = Total R square/ab - CF

ASS = Total A square/br - CF

BSS = Total B square/ar - CF

ESS (a) = Total AR square/b - Total A square/br - Total R suqare/ab + CF

ESS (b) = Total BR square/a - Total B square/ar - Total R square/ ab + CF

ESS (ab) = Total ABR suqare - (AB sq + AR sq + BR sq) + (A sq +B sq + R sq) - CF

MEAN SQUARE:

dividing sum of squares with respective df

~ MS

F COMPUTED

factor A = Mean Squ (A) / EMS (a)

factor B = Mean Squ (B) / EMS (b)

A x B = Mean Sq (AB) / EMS (ab)

POOLED MSE = SS a + SS b + SS ab / df (Error a + Error b + Error ab)

Pooled CV = Sq root Pooled MSE / Mean

CV of A, B and AxB from respective MSE

LSD: only if levels are significant

Sq Root [(2 * MSE) / no other levels of other factor * no. replications]

Resources:

Arnouts, Heidi, et al. “Design and Analysis of Industrial Strip-Plot Experiments.” Quality and Reliability Engineering International, 2018, p. n/a-n/a, www.academia.edu/14930782/Design_and_analysis_of_industrial_strip_plot_experiments. Accessed 18 Oct. 2020.

‌

Friday, 15 May 2020

Free Resources for Learning R (Programming)

R is a popular and widely used programming language for statistical analysis. With various programming and statistical features and being absolutely free, its implications to every data science enthusiast and an academic are multiple.

As a learner myself, I have tried collecting all the free resources for learning R programming language. By free, I meant it to be freely available (100% Free), accessible with the means of internet where the individual has no need to pay anything online. The current digital world luckily provides our wish to learn anything. For that; however, the time and dedication is the strong requirement for sure. More, the collection has to be well prepared ahead before beginning the learning journey.