# How to see how a dataset changes over time with PCA.

2 views (last 30 days)
Alasdair Lees on 8 Mar 2018
Edited: njj1 on 8 Mar 2018
Hi all,
I have a large dataset X of n by t size, where each column is a set of variables (which represents a network condensed into a vector) at a certain timestep. I'd like to see how this data (specifically it's structure) evolves only time, and I was wondering if it was possible to do so with pca by treating each column as its own seperate class. What I planned is to use pca on X, then after projecting it to a 2d graph colour the points so I can see how the data changes over the time (so the points that correlates to data from t=1 would be red, t=2's would be slightly darker and so on).
Is such a thing possible? I know how to use PCA in matlab ([W, pc] = pca(a), then transpose pc and plot the first two rows) but I'm not so sure how to do the colouring step, if indeed it can actually be done (since pca just reduces the dimensions of the matrix).
Any advice would be greatly appreciated, even if it's just saying it can't be done or pointing out another way to see how the structure changes over time. Failing that, advice on how to label points on the 2d plot as being similar/part of the same class would be useful as well.

njj1 on 8 Mar 2018
Edited: njj1 on 8 Mar 2018
It's hard to know exactly what you are hoping to achieve here. When using pca(x) in Matlab, we are assuming that "rows of X correspond to observations and columns correspond to variables", which is reversed in your case. That is, unless you consider each time step to be a different variable. If you had a matrix that was t x n, where each column was a different variable and each row was a set of n variable at a given time, the [W,pc] = pca(x) would produce an n x n matrix, W, of weights that would linear combination of variables at each time. The matrix pc is a projection of x onto the respective columns of W, where the first column of pc is the projection of your data matrix onto the first (primary) principal component (the primary principal component is where the greatest variance in your data is contained, the second is where the second greatest variance is contained, and so on). If you took the first two columns of pc, then you would be getting the linear combinations of your n variables that produce the first and second greatest variances. Without seeing what this would produce, it's hard for me to judge exactly if this will communicate what you want. Nevertheless, it can be done and there's certainly no harm in giving it a shot and seeing what happens.
As to the plotting, you can use "scatter" and color code the points according to your time vector. It does this automatically if you provide a time vector (see the help file for a more thorough explanation), though if you want a different colormap, you will have to set that manually.