**You are now following this question**

- You will see updates in your followed content feed.
- You may receive emails, depending on your communication preferences.

# How to select the components that show the most variance in PCA

458 views (last 30 days)

Show older comments

I have a huge data set that I need for training (32000*2500). This seems to be too much for my classifier. So I decided to do some reading on dimensionality reduction and specifically into PCA.

From my understanding PCA selects the current data and replots them on another (x,y) domain/scale. These new coordinates don't mean anything but the data is rearranged to give one axis maximum variation. After these new coefficients I can drop the cooeff having minimum variation.

Now I am trying to implement this in MatLab and am having trouble with the output provided. MatLab always considers rows as observations and columns as variables. So my inout to the pca function would be my matrix of size (32000*2500). This would return the PCA coefficients in an output matrix of size 2500*2500.

The help for pca states:

Each column of coeff contains coefficients for one principal component, and the columns are in descending order of component variance.

In this output, which dimension is the observations of my data? I mean if I have to give this to the classifier, will the rows of coeff represent my datas observations or is it now the columns of coeff?

And how do I remove the coefficients having the least variation? And thus effectively reduce the dimension of my data

##### 0 Comments

### Accepted Answer

the cyclist
on 27 Feb 2016

Edited: the cyclist
on 12 Apr 2021

Here is some code I wrote to help myself understand the MATLAB syntax for PCA.

rng 'default'

M = 7; % Number of observations

N = 5; % Number of variables observed

% Made-up data

X = rand(M,N);

% De-mean (MATLAB will de-mean inside of PCA, but I want the de-meaned values later)

X = X - mean(X); % Use X = bsxfun(@minus,X,mean(X)) if you have an older version of MATLAB

% Do the PCA

[coeff,score,latent,~,explained] = pca(X);

% Calculate eigenvalues and eigenvectors of the covariance matrix

covarianceMatrix = cov(X);

[V,D] = eig(covarianceMatrix);

% "coeff" are the principal component vectors.

% These are the eigenvectors of the covariance matrix.

% Compare "coeff" and "V". Notice that they are the same,

% except for column ordering and an unimportant overall sign.

coeff

V

% Multiply the original data by the principal component vectors to get the

% projections of the original data on the principal component vector space.

% % This is also the output "score". Compare ...

dataInPrincipalComponentSpace = X*coeff

score

% The columns of X*coeff are orthogonal to each other.

% This is shown with ...

corrcoef(dataInPrincipalComponentSpace)

% The variances of these vectors are the eigenvalues of the covariance matrix,

% and are also the output "latent". Compare these three outputs

var(dataInPrincipalComponentSpace)'

latent

sort(diag(D),'descend')

The first figure on the wikipedia page for PCA is really helpful in understanding what is going on. There is variation along the original (x,y) axes. The superimposed arrows show the principal axes. The long arrow is the axis that has the most variation; the short arrow captures the rest of the variation.

Before thinking about dimension reduction, the first step is to redefine a coordinate system (x',y'), such that x' is along the first principal component, and y' along the second component (and so on, if there are more variables).

In my code above, those new variables are dataInPrincipalComponentSpace. As in the original data, each row is an observation, and each column is a dimension.

These data are just like your original data, except it is as if you measured them in a different coordinate system -- the principal axes.

Now you can think about dimension reduction. Take a look at the variable explained. It tells you how much of the variation is captured by each column of dataInPrincipalComponentSpace. Here is where you have to make a judgement call. How much of the total variation are you willing to ignore? One guideline is that if you plot explained, there will often be an "elbow" in the plot, where each additional variable explains very little additional variation. Keep only the components that add a lot more explanatory power, and ignore the rest.

In my code, notice that the first 3 components together explain 87% of the variation; suppose you decide that that's good enough. Then, for your later analysis, you would only keep those 3 dimensions -- the first three columns of dataInPrincipalComponentSpace. You will have 7 observations in 3 dimensions (variables) instead of 5.

I hope that helps!

##### 38 Comments

Faraz
on 27 Feb 2016

That was immensely helpful. Thanks a lot :)

So in PCA, we can reduce the size of each observation (variables). I think I misunderstood that after the new component space we can choose the most prominent observations and in doing so reduce the number of observations as well.

I applied pca on my matrix and the explained vector determined that out of 2500 variables of each dimension, 96% of the variation is in the first 500 components. So after PCA my data can be represented as 32000*500 ?

Thanks again

Image Analyst
on 27 Feb 2016

Faraz
on 29 Feb 2016

the cyclist
on 12 Oct 2018

Matheus Henrique Fessel
on 25 Oct 2018

This code helped a lot enlightening what happens inside the pca function!

I would like to know how to plot these data in a biplot form, like the figure mentioned by the Cyclist. I know that there is function to perform this, but my license doesn't include it unfortunately. Could anyone explain the steps?

Thanks in advance.

the cyclist
on 25 Oct 2018

The following code will make a biplot (of PC1 and PC2) in a similar style to MATLAB's, using the variables from my code.

x_comp = 1; % Principal component for x-axis

y_comp = 2; % Principal component for y-axis

figure

hold on

for nc = 1:N

h = plot([0 coeff(nc,x_comp)],[0 coeff(nc,y_comp)]);

set(h,'Color','b')

end

Sarah Spector
on 6 Nov 2018

the cyclist
on 12 Feb 2019

Sorry that I did not see this for a while. I hope it is not too late to be useful!

The variable explained gives the percentage of the total variance, for each principal component. Therefore,

sum(explained(1:3))

is the percentage of the total variance for the first three.

Yvo Delaere
on 4 Mar 2019

Dear Cyclist,

Thanks for your reply! However, for me the output 'coeff' and 'V' are not the same. I am using Matlab R2018b. All the values are there, but are mirrored and the signs are different.

Thanks

coeff =

-0.5173 0.7366 -0.1131 0.4106 0.0919

0.6256 0.1345 0.1202 0.6628 -0.3699

-0.3033 -0.6208 -0.1037 0.6252 0.3479

0.4829 0.1901 -0.5536 -0.0308 0.6506

0.1262 0.1334 0.8097 0.0179 0.5571

V =

0.0919 0.4106 -0.1131 -0.7366 -0.5173

-0.3699 0.6628 0.1202 -0.1345 0.6256

0.3479 0.6252 -0.1037 0.6208 -0.3033

0.6506 -0.0308 -0.5536 -0.1901 0.4829

0.5571 0.0179 0.8097 -0.1334 0.1262

the cyclist
on 4 Mar 2019

There is no implied ordering of the eigenvectors, and the sign of an eigenvector is unimportant (because that sign would appear on both sides of the equation that it solves for).

I think the "mirroring" is just a coincidence.

the cyclist
on 17 Apr 2019

I'd like to add one clarification to my comment above. While there is no implied ordering of the vectors V, since they are simply eigenvectors, the vectors in coeff are ordered in descending order of component variance (as stated in the documentation).

You can see that

var(dataInPrincipalComponentSpace)

has descending values.

the cyclist
on 13 Sep 2019

Run my code above, and then

figure

bar(latent)

will give a bar chart of the variances, in descending order.

Alternatively,

figure

bar(explained)

will plot the fraction of variance explained by each component. Note that

100*latent./sum(latent) == explained

to within floating-point error.

Warid Islam
on 15 Jun 2020

Hi @the cyclist,

I am having a similar problem but I am unable to find any solution. I have a array of 32*22 matrix(32 observations and 22 variables). I applied the following code for PCA. The resultant component is a 22*22 matrix. It seems that the number of observations has reduced which is not desirable. Is it possible to do PCA and keeping the number of observations same and reducing the number of variables? Thank you.

coeff = pca(statsArray);

coeff2 = coeff(:,1:2);

the cyclist
on 15 Jun 2020

The number of observations has not been reduced. coeff does not give the observations, but rather the transformation from the old to the new coordinate system.

The observations in the new principal component space are given by the output score, not coeff. Note my code above:

dataInPrincipalComponentSpace = X*coeff

score

These arrays are 7x5 in my case, and will be 32x22 in your case.

Warid Islam
on 16 Jun 2020

Hi @the cyclist,

Thank you for the answer. It worked for me. I have one more query. The score array is a 32*22 matrix. If I take the first two columns of the matrix, does that mean that I take the first two components of the PCA? All I want to do is take the two components that have the maximum variances.

the cyclist
on 16 Jun 2020

Yes.

The score array is the original data, but transformed into the principal component space. In that space, each variable (i.e. each column) is orthogonal to each other. The first two columns are the ones with the largest variances.

Warid Islam
on 16 Jun 2020

Hi @the cyclist,

If I want to incorporate PCA into SVM, should I use the score array or the dataInPrincipalComponentSpace? i am actually having confusion between SCORE and dataInPrincipalComponentSpace.

the cyclist
on 16 Jun 2020

Warid Islam
on 16 Jun 2020

Hi @the cyclist,

I did run the code and it worked perfectly. Thank you for the clarification about the above two parameters.

NN
on 4 Dec 2020

Dear Cyclist ,

i have gone through the discussion to understand PCA .I am doing a forecasting problem with neural network and used the below syntax for finding out PCA components for reducing the dimension of training and testing data .

Can i use the output of this command (coeff matrix) as new training and testing data for neural network ad use it for forecasting?

I have 9 input features for forecasting.How can i plot the contribution rates of each feature and prinicpal components against the variance to know the contribution of features and dimension reduction?

kindly help

the cyclist
on 4 Dec 2020

I have several comments here.

No, you should not use coeff as the new training data. coeff is not data -- it is the transformation matrix from the original coordinate system to the PCA coordinate system.

You can use the data from the new coordinate system for your neural network. These data are the score output from pca(). [These are equivalent to the variable I called dataInPrincipalComponentSpace.] Note that if you use all columns of dataInPrincipalComponentSpace, then you have not done dimensional reduction -- you will simply be in a new coordinate system where the vectors are orthogonal to each other. The dimensional reduction step is when you choose to drop columns from dataInPrincipalComponentSpace.

I haven't thought deeply about this, but I'm pretty sure you should only do the PCA on the training set to determine coeff. Otherwise you are leaking information from your test set back to the training set. (But you'll want to apply coeff to the test set, before putting it into your neural network.)

You can use the output explained to make a scree plot of the amount of explained variance in each principal component.

% Scree plot

figure

h = plot(explained,'.-');

set(h,'LineWidth',3,'MarkerSize',36)

ylim([0 100])

set(gca,'XTick',1:N)

title('Explained variance by principal components')

xlabel('Principal component number')

ylabel('Fraction of variation explained [%]')

nissrine Neyy
on 23 Mar 2021

I can't thank you enough for the explanation and time you put into this, so much appreciated sir.

reading all the comments i have a question, let's say we decided about the components that explain a certain percentage of variation. to use them (for exemple 2 first columns) do we transform them back to the original data or use them as they are.

i wanna use this in image processing to reduce feature dimention how do i do it ? i mean after getting the output results do the columns chosen become the new features ?

the cyclist
on 23 Mar 2021

You have two options. (I'll stick with your example where the first two principal components are sufficient.)

The first option is to use the first two columns of dataInPrincipalComponentSpace. If you do this, then you are now doing your calculations in the new space, with the new, transformed vairables.

The second option is a little trickier. If, after finding the principal components, you find that the first two principal components are composed of a very small number of features from the original space, then you could stick with the original variables, but only use the ones that load very heavily on the first two principal components. For example, suppose the first two principal components were

[ 0.01 0.07;

0.06 0.89;

0.92 0.04;

... % only small values after this

]

then you see that the 1st PC is almost entirely composed of the 3rd original variable, and the 2nd PC is composed almost entirely of the 2nd original variable. So, you could choose to work in the original space, with just the 2nd and 3rd variable. (It's not usually this "clean", and there are usually some moderate values like 0.56 in there, that make this approach messier. But it can also be easier to interpret.)

Sai Pavan Batchu
on 26 Jun 2021

Hello,

How can I know which of the predictors(original data) having more weightage to the 95% variance explained principal components? Is it through eigen vectors ?

I wanted to know if I just take those predictors which explain the prominent principle components, Can I get the same results?

Lets say the predictors which explain 90-95% of the first 3 principal components

(I am doing a multi class classification problem)

the cyclist
on 26 Jun 2021

Yes, the eigenvectors (the output coeff) tell you the weight of the original predictors for each principal component (PC). Specifically,

coeff(i,j).^2

gives the percentage weighting of the i'th original predictor to the j'th PC.

But what you want to do is tricky. It goes something like this:

You want to explain 95% of the variance. Therefore, find where cumsum(explained) >= 95. Let's suppose this requires 4 of the PCs.

So, you study the relationship of those 4 PCs to the original variables, by inspecting the first 4 columns of coeff. If just a few of your original variables contribute to the first 4 PCs, this is great news! You drop all the other variables -- losing a little bit of the explained variance -- and you are all set.

But suppose your first column of coeff looks like this:

coeff = [0.37;

0.42;

0.62;

0.41;

...

];

where the first PC has significant contribution from every variable? In that case, you cannot isolate a small number of the original predictors, and you cannot do what you want. Then it is sad trombone for you.

Tom
on 13 Sep 2021

I also think that your explanation was very good and useful but i still have a question how to do something specific. I have 6 variables and want 2 principle components. Now i want to create a scatterplot as following:

The only difference to this graphic with my plot needs to be, that the ellipse should be a circle for me and it should be enclosing 95% of the cases.

I would be really glad if you can help me with this.

Kind regards

TG

the cyclist
on 13 Sep 2021

Here is how to do everything you asked except for figuring out how to select the radius such that 95% of the points are inside the circle. That is off-topic for this question. If you can't figure that part out on your own, I suggest you open a new question just for that. (Feel free to tag me on it, if you'd like.)

Note that a mathematical circle might not seem to be a circle on a plot, if your axes are not set to equal extents.

rng 'default'

M = 7; % Number of observations

N = 5; % Number of variables observed

% Made-up data

X = rand(M,N);

% De-mean (MATLAB will de-mean inside of PCA, but I want the de-meaned values later)

X = X - mean(X); % Use X = bsxfun(@minus,X,mean(X)) if you have an older version of MATLAB

% Do the PCA (Recall that "score" is the data in PC space)

[~,score] = pca(X);

% Scatter plot of the first two components

figure

hold on

h = plot(score(:,1),score(:,2),'r.');

set(h,'MarkerSize',16)

set(gca,'XLim',[-1 1],'YLim',[-1 1],'Box','on')

axis square

xlabel('Component 1')

ylabel('Component 2')

% Add a circle

p = nsidedpoly(1000, 'Center', [0 0], 'Radius', 0.8);

plot(p, 'FaceColor', 'w', 'EdgeColor', 'r')

Tom
on 13 Sep 2021

the cyclist
on 13 Sep 2021

No, it is not necessarily weird. The magnitude of the components in PC space will be dependent on the magnitude in the original space.

Depending on your application, you may want to normalize your input variables (e.g. dividing each variable by its standard deviation). The theory of why you might want to do that is beyond the scope of this forum, but if you search for keywords PCA and normalize, you can find lots of info online.

Tom
on 15 Sep 2021

@the cyclist thanks alot for your help so far! I managed to do all i wanted with your help but i still can´t figure out how to do plot the circle that includes 95% of the cases. I have created a new question for that and added your name to the tags. Maybe you can take a look at it if you find some time.

Thanks in advance.

the cyclist
on 15 Sep 2021

Tom
on 16 Sep 2021

### More Answers (3)

naghmeh moradpoor
on 1 Jul 2017

Dear Cyclist,

I used your code and I was successful to find all the PCAs for my dataset. Thank you! On my dataset, PC1, PC2 and PC3 explained more than 90% of the variance. I would like to know how to find which variables from my dataset are related to PC1, PC2 and PC3?

Please could you help me with this Regards, Ngh

##### 1 Comment

Abdul Haleem Butt
on 3 Nov 2017

Sahil Bajaj
on 12 Feb 2019

Dear Cyclist,

Thansk a lot for your helpful explanation. I used your code and I was successful to find 4 PCAs explaining 97% variance for my dataset, which had total 14 components initially. I was just wondering how to find which variables from my dataset are related to PC1, PC2, PC3 and PC4 so that I can ignore the others, and know which parameters should I use for further analysis?

Thanks !

Sahil

##### 9 Comments

the cyclist
on 12 Feb 2019

In general, every variable contributes to every principal component. (The m-th element of the n-th column of the variable coeff tells you what percentage of the m-th original variable is included in the n-th principal component.) For example, I have done analyses in which the first principal component was made up of approximately equal proportions of every initial variable. They were all highly correlated, and had about the same amount of impact on the total variation!

PCA can be a dimensional reduction technique, but not necessarily. It depends on what the data say, and your needs.

There are techniques that go beyond simple PCA (e.g. varimax), which provide a further "rotation" to the variable, that try to do variable reduction. It looks like MATLAB has the rotatefactors command. I've never used it, so I can't advise.

Yaser Khojah
on 18 Apr 2019

Is there an answer for this question?

Which variables from my dataset are related to PC1, PC2, PC3 and PC4?

Here is the explinaiton of each componete which relates to PC and nothing is related to original data?

- coeff: contains coefficients for one principal component, and the columns are in descending order of component variance
- score: Rows of score correspond to observations, and columns correspond to components.
- explained: the percentage of the total variance explained by each principal component
- latent: Principal component variances, that is the eigenvalues of the covariance matrix of X, returned as a column vector.

I have used your codes and I see the coeff and v are not matching in order?

coeff =

-0.5173 0.7366 -0.1131 0.4106 0.0919

0.6256 0.1345 0.1202 0.6628 -0.3699

-0.3033 -0.6208 -0.1037 0.6252 0.3479

0.4829 0.1901 -0.5536 -0.0308 0.6506

0.1262 0.1334 0.8097 0.0179 0.5571

V =

0.0919 0.4106 -0.1131 -0.7366 -0.5173

-0.3699 0.6628 0.1202 -0.1345 0.6256

0.3479 0.6252 -0.1037 0.6208 -0.3033

0.6506 -0.0308 -0.5536 -0.1901 0.4829

0.5571 0.0179 0.8097 -0.1334 0.1262

However, (dataInPrincipalComponentSpace and score) and (var(dataInPrincipalComponentSpace)' and latent) are matching. Does that mean, the first row in latent is related to the first column in the original data? I think any new use is confused about how to related these answers to the original data's variables? Can you please explain. Thank you

the cyclist
on 19 Apr 2019

Your first question

Recall that the original data is an array with M observations of N variables. There will also be N principal components. The relationship between the original data and the nth PC is

nth PC = X*coeff(:,n) % This is pseudocode, not valid MATLAB syntax.

For example, PC1 is given by

PC1 = X*coeff(:,1)

You can recover the original data from the principal components by

dataInPrincipalComponentSpace * coeff'

Your second question

The first row of latent is not related to the first column of the original data. It is related to the first principal component (which you can see is a linear combination of the original data).

Harsha K
on 27 Feb 2020

Dear @the cyclist

Regarding the answer to your first question.

Lets say I have found the Eigen Values sorted in descending order which is the case after following your code above.

For the Eigen vectors corresponding to the sorted Eigen Values, I would like to recover the original data, so only those variables (or columns) of the original matrix that correspond to the first 3 principle vectors, for example.

Please advise on the backwards transformation.

% Can I do this,

% is this corresponding first 3 or most needed 3 variable columns ?

Xexp = dataInPrincipalComponentSpace(:, 1:3) * coeff(1:3, 1:3)' + meanX(1:3);

% Where meanX = mean(X, 1);

the cyclist
on 28 Feb 2020

Please carefully read the question asked by Sahil Bajaj in this sequence of comments, and my answer to it.

I'll quote myself here: "In general, every variable contributes to every principal component." In my example with 5 variables, if they had all been very highly correlated with each other, that all 5 of them contributed significantly to the first principal component. You could not eliminate any of the original variables without significant loss of information.

Referring again to the figure at the top of the wikipedia page on PCA: you can't eliminate either the x-axis variable or the y-axis variable. Instead, you choose a linear combination of them that captures the maximal variation.

And, repeating myself one more time ... there are techniques like varimax, applied after PCA, that do allow you to remove some of the original variables.

Darren Lim
on 2 Feb 2021

, thanks for answering this post, you wouldnt imagine how much time i have saved by studying your answer, so thank you!

i just picked up PCA a few days ago to solve a financial trading problem, so I am very new to PCA. Just to confirm my understanding , in the coeff example you provided ;

coeff =

-0.5173 0.7366 -0.1131 0.4106 0.0919

0.6256 0.1345 0.1202 0.6628 -0.3699

-0.3033 -0.6208 -0.1037 0.6252 0.3479

0.4829 0.1901 -0.5536 -0.0308 0.6506

0.1262 0.1334 0.8097 0.0179 0.5571

can I clarify that for Column 1 , the Variable of co-efficient 0.6256 describe the largest "weightage" in accordance to PC 1 ? so if my Variable(2,1) is say the mathematics (0.6256) subject of my 7 sample students(Observations) , can I say that Mathematics then , account for the largest "Variance" among all the 7 students in the whole data set (since PC1 has the highest variance and also has accounted for 42.2% of the entire data set) ?

and say , Variable(1,1) is English(-0.5173) , does it mean that English tend to anti correlate to Mathematics?

..and for PC2 , Variable(2,1) English (0.7366) describe the difference the most for the Sample students ?

In Essence , i think i roughly understand PCA at high level , what i am not so sure is how to intepret the data , as i think PCA is powerful but wont be useful if the output is misintepreted. Any help interpreting the coeff will be appreciated :) ( my challenge is to find out which variable is useful for my trading and eliminate unnecesary variables so that i can optimise a trading strategy )

Thanks in advance !

the cyclist
on 2 Feb 2021

I'm happy to hear you have found my answer to be helpful.

The way you are trying to interpret the results is a little confusing to me. Using your example of school subjects, I'll try to explain how I would interpret.

Let's suppose that the original dataset variables (X) are scores on a standardized exam:

- Math (column 1)
- Writing
- History
- Art
- Science

[Sorry I changed up your subject ordering.]

Each row is one student's scores. Row 3 is the 3rd student's scores, and X(3,4) is the 3rd student's Art score.

Now we do the PCA, to see what combination of variables explains the variation among observations (i.e. students).

coeff is the coefficients of the linear combination of original variables . coeff(:,1) are the coefficients to get from the original variables to the first new variable (which explains the most variation between observations):

-0.5173*Math + 0.6256*Writing -0.3033*History + 0.4829*Art + 0.1262*Science

At this point, the researcher might try to interpret these coefficients. For example, because Writing and Art are very positively weighted, maybe this variable -- which is NOT directly measured! -- is something like "Creativity".

Similarly, maybe the coefficients coeff(:,2), which weights Math very heavily, corresponds to "Logic".

And so on.

So, interpreting that single value of 0.6256, I think you can say, "Writing is the most highly weighted original variable in the new variable that explains the most variation."

But, it also seems to me that to answer a couple of your questions, you actually want to look at the original variables, and not the PCA-transformed data. If you want to know which school subject had the largest variance -- just calculate that on the original data. Similarly for the correlation between subjects.

PCA is (potentially) helpful for determining if there is some underlying variable that explains the variation among multiiple variables. (For example, "Creativity" explaining variation in both Writing and Art.) But, factor analysis and other techniques are more explicitly designed to find those latent factors.

Darren Lim
on 3 Feb 2021

Crystal Clear! I think many others will find this answer helpful as well , thanks again for your insights and time!

Darren

Salma Hassan
on 18 Sep 2019

i still not understand

i need an answer for my question------> how many eigenvector i have to use?

from these figures

##### 3 Comments

the cyclist
on 19 Sep 2019

It is not a simple answer. The first value of the explained variable is about 30. That means that the first principal component explains about 30% of the total variance of all your variables. The next value of explained is 14. So, together, the first two components explain about 44% of the total variation. Is that enough? It depends on what you are trying to do. It is difficult to give generic advice on this point.

You can plot the values of explained or latent, to see how the explained variance is captured as you add each additional component. See, for example, the wikipedia article on scree plots.

Salma Hassan
on 19 Sep 2019

if we say that the first two components which explain about 44% enough for me, what does this mean for latent and coff . how can this lead me to the number of eigen vectors

thanks for your interest in reply. i appreicate this

the cyclist
on 20 Sep 2019

It means that the first two columns of coeff are the coefficients you want to use.

### See Also

### Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!**An Error Occurred**

Unable to complete the action because of changes made to the page. Reload the page to see its updated state.