How can I use Principal Component Analysis (PCA) for this problem?

19 views (last 30 days)
Dear all,
I have a dataset of 2643 (n) x 8(p) where p represents the number of predictor and n represents the number of observations. The Matlab code I am using can generate 1D PCA for 2D inputs: (e.g. p1 and p2). Given that I have total 8p I can generate 28 x 1D PCA for different combinations e.g.: p1xp2, p1xp3,p1xp4,..., p1xp6,....p7xp8. My question is: Do you think it is correct to generate 28 plots for 8 p and then calculate the PC score vector? I would appreciate if you could guide me on this. this is the code I am using :
*****************************************
%step 1, loading your data from your dataset
numdata=2643; %should be even - This is number of the rows in my dataset
datasetname='mydataset.csv';
dataset=csvread(datasetname);
x1=dataset(:,4);
x2=dataset(:,6);
%step 2, finding a mean and subtracting
%This is the first step to find the covariacematrix (avergae and substract)
x1mean=mean(x1);
x2mean=mean(x2);
x1new=x1-x1mean*ones(numdata,1);
x2new=x2-x2mean*ones(numdata,1);
subplot(3,1,1);
plot(x1,x2, 'o');
title('Original Data: AppIDxAcID');
%step 3, covariance matrix
covariancematrix=cov(x1new,x2new);
%step 4, Finding Eigenvectors -
[V,D] = eig(covariancematrix);
D=diag(D);
maxeigval=V(:,find(D==max(D)));
%step 5, Deriving the new data set
%finding the projection onto the eigenvectors
finaldata=maxeigval'*[x1new,x2new]';
subplot(3,1,2);
stem(finaldata, 'DisplayName', 'finaldata', 'YDataSource', 'finaldata');
title('PCA 1D output: AppIDxAcID ')
%we do a classification now
subplot(3,1,3);
title('Final Classification: AppIDxAcID')
hold on
for i=1:size(finaldata,2)
if finaldata(i)>=0
plot(x1(i),x2(i),'o')
plot(x1(i),x2(i),'r*')
else
plot(x1(i),x2(i),'o')
plot(x1(i),x2(i),'g*')
end
*****************************************
Regards, Ngh
  5 Comments
the cyclist
the cyclist on 21 Jun 2017
I've looked at your dataset. You seem to have a mix of some variables that are continuous, and some that are discrete. And you actually have a relatively small number of variables. So, again, I would say that this is not the typical kind of problem that PCA would be used on. And, again, I would ask what are you actually trying to accomplish?
I hope you are not offended by this statement, but so far you are not showing to us a very clear understanding of your own problem. That makes it difficult or impossible for us to help.
naghmeh moradpoor
naghmeh moradpoor on 22 Jun 2017
Edited: naghmeh moradpoor on 22 Jun 2017
Hello,
Thank you very much for all your answers and sorry if I was not clear. I am trying to use PCA to do dimensional reduction which follows by Self-organizing Map (SOM) to do clustering for unsupervised learning. The dataset contains user behaviour within an organisation and I am trying to find the insider threat within an organisation by first using PCA and then SOM. The dataset has been standardized and normalized. I would appreciate your help.
Ngh

Sign in to comment.

Answers (4)

John D'Errico
John D'Errico on 21 Jun 2017
Edited: John D'Errico on 21 Jun 2017
Let me explain my comment, as to why this is a silly task to use PCA here.
PCA is a tool that tries to reduce the dimensionality of your data. It looks at all of your data, trying to find things that go together.
For example, suppose I took a snapshot of stock market prices of 2643 different stocks on the exchange. I'll do this for 8 days, just reading the opening prices of each of those 2643 stocks. I would now have 8 observations, made on 2643 predictors, just as you have.
In fact, those stocks will vary due to factors that are often independent. Some stocks will vary together, so all health care stocks may move as a group. Oil companies, mining companies, electronics manufacturing, etc. But even there, some stocks will vary just due to random factors that are not under control or measurement.
PCA looks for patterns though. And if two companies just happen to vary in the same directions one each day, it will decide that they are related!
The problem comes in when you have THOUSANDS of comparisons. Just by random chance, you will see two of these variables moving together, even though they have no relationship at all. This is a known fact, not unlike the birthday paradox. If you have so many possible pairs, then some of them will seem to be related even though they are completely unrelated. Random chance assures that will happen.
The problem is, with only 8 data points, you have no ability to know that some signals are random chance, and some are valid signals.
Yes, if some factor drives the entire stock market up over those 8 days, then PCA will pick out that signal as a strong one. For example, midway through your sampling time, a major war breaks out! This drives the overall stock market down, but defense stocks do go up, on speculation that planes will need to be built, bombs manufactured. PCA will see this pattern, convince you that is how things work.
Of course, that is only one possible way all of these stocks might move around. Suppose that instead, you took your snapshot right in the middle of a major health care law change? Again, some stocks will go up, some down, but not the same stocks as before. PCA would be telling you something completely different about the stock markets, and you would draw completely different conclusions.
Now, suppose instead, you have the same stock market "measurements", daily data, but now taken over the course of many years. A couple of wars along the way, healthcare changes, other legal changes, presidents being elected and (I can only hope) fired. Lots of stuff has happened, and PCA can now see patterns in the data. It still won't understand why some stocks sometimes move together, but it will see patterns that are not just random happenstance. Patterns that rise above the random noise in the system.
The point of all this is that with only 8 data points, you have no way to understand how all of these predictors move, what influences what, which variables are connected together through some (unknown) web of relationships. You have no idea if what you might see in some cases is just random happenstance. For example, today companies A and B just independently announced changes on their board of directors. Company A just fired their entire board due to corruption at the top. So their stock plummets. On the same day, company B, in a completely unrelated event, announces a huge earnings increase. The two stocks jump by wide amounts, up and down. PCA will see something happening, but it is just chance. The two stocks are unrelated, but given enough measurements, there will always be random variability that happens to look as if they are related.
8 observations simply is insufficient to tell you anything about the true variability of a system, not when you have thousands of variables, all moving randomly. Even if you had only a few variables, 8 observations is worth little in terms of information content.
  1 Comment
naghmeh moradpoor
naghmeh moradpoor on 21 Jun 2017
sorry guys, my mistake it is other way around: 2643 observations and 8 predictors: 2643 (n) x 8(p)

Sign in to comment.


naghmeh moradpoor
naghmeh moradpoor on 21 Jun 2017
Edited: naghmeh moradpoor on 21 Jun 2017
Sorry guys, my mistake it is other way around: 2643 observations and 8
predictors: 2643 (n) x 8(p).
My dataset is attached. It is standardised and normalised.
Also, this is the code again:
********************** %step 1, loading your data from your dataset
numdata=2643; %should be even - This is number of the rows in my dataset
datasetname='mydataset.csv';
dataset=csvread(datasetname);
x1=dataset(:,4);
x2=dataset(:,6);
%step 2, finding a mean and subtracting
%This is the first step to find the covariacematrix (avergae and substract)
x1mean=mean(x1);
x2mean=mean(x2);
x1new=x1-x1mean*ones(numdata,1);
x2new=x2-x2mean*ones(numdata,1);
subplot(3,1,1);
plot(x1,x2, 'o');
title('Original Data: AppIDxAcID');
%step 3, covariance matrix
covariancematrix=cov(x1new,x2new);
%step 4, Finding Eigenvectors -
[V,D] = eig(covariancematrix);
D=diag(D);
maxeigval=V(:,find(D==max(D)));
%step 5, Deriving the new data set
%finding the projection onto the eigenvectors
finaldata=maxeigval'*[x1new,x2new]';
subplot(3,1,2);
stem(finaldata, 'DisplayName', 'finaldata', 'YDataSource', 'finaldata');
title('PCA 1D output: AppIDxAcID ')
%we do a classification now
subplot(3,1,3);
title('Final Classification: AppIDxAcID')
hold on
for i=1:size(finaldata,2)
if finaldata(i)>=0
plot(x1(i),x2(i),'o')
plot(x1(i),x2(i),'r*')
else
plot(x1(i),x2(i),'o')
plot(x1(i),x2(i),'g*')
end
**************************
Regards,
Ngh

Muhammad Ibrar
Muhammad Ibrar on 22 Apr 2019
why u used this if u have a big dataset...and why u used Numdata ? explain plz
x1=dataset(:,4);
x2=dataset(:,6);

Muhammad Ibrar
Muhammad Ibrar on 31 May 2019
Anyone why he used in pca x1(:,4); X2(:,6); If u see there is 8 observation so why they used 4 and 6???? Please answer

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!