How to write a large .dat file in a loop?

Hello,
I am trying to convert a large amount data from one format (64 separate neuralynx .ncs files) into another (a .dat file with all 64 combined, in int16 format) for use with the Kilosort package. Kilosort then uses this huge file in some clever way. The .dat file should be 2D with 64 channels x number of samples per channel (about 3 hours at 20kHz).
Because my data is so large, I am unable to load it all into one matrix and then write it to a file (the total filesize for all 64 inputs is at least 30GB).
So, I have been trying to read each of my input files separately, convert the data type and then write it to a file. In the code below I managed to make it work for a huge .mat file using the matfile function, but I need .dat.
I then tried to do a similar thing using fwrite, where I tried to append each new .ncs readout. All the data gets written to file, but in an enormous 1-D list of numbers, and not in a 64-row array.
I tried to transpose my input in the hope that I would obtain separated rows that way, but that makes no difference.
How can I control how fwrite appends data to my file? Is it at all possible to make a 2D output like this?
Thanks,
Susan
% make .mat file
m = matfile([OutFolder,'\',RatID,'_',RecDate,'.mat'],'Writable',true);
% read .ncs files for all traces and write to .mat and .dat file
for ch = 1:64
%load data
InFile = [InFolder,'\','CSC',num2str(ch),'.ncs'];
[~,~,samples] = readEegDataForKilosort(InFile); % gives 1D array of type double
int_samples = int16(samples); clear samples;
%write to .mat file (not actually useful)
m.WholeRec(1:length(int_samples),ch)=int_samples;
%write to .dat file
if ch == 1; % make file for 1st channel
fileID = fopen([OutFolder,'\',RatID,'_',RecDate,'.dat'],'w');
fwrite(fileID,int_samples','uint16');
fclose(fileID);
else % append for all next channels
fileID = fopen([OutFolder,'\',RatID,'_',RecDate,'.dat'],'a');
fwrite(fileID,int_samples','uint16');
fclose(fileID);
end
clear int_samples
end

4 Comments

Where's a description of the input file for your secondary application?
You're writing a stream file here by channel of all time data for each channel; to write all channels sequentially for a single timestep would require reading each channel file to have those data in memory to write sequentially; you don't have the other channel data available in the order in which you're processing the files if, as the comments and variable names indicate, each file you read is a channel.
If you can't hold it all in memory at once, you would need to read each file in segments that can be held in memory and then write that section of the time history; then process another group.
memmapfile and/or datastore could help here.
What is the storage format for the input files? Are they being saved as conventional doubles? Or, you could save a little if they are also ints by reading into variable of proper type it would seem.
Susan Leemburg
Susan Leemburg on 30 Jan 2021
Edited: dpb on 30 Jan 2021
I'm not sure I understand completely, so I apologize if I'm being a bit obtuse (I also accidentally deleted my earlier reply).
Are you saying that instead of trying to write one file, and then another etc, I should rather read e.g. the first 30min of all my channels (or whatever fits in memory), write that to my .dat file, followed by the next section and so on?
My goal is to make a file that I can use as input for Kilosort2 https://github.com/MouseLand/Kilosort. It should be a .dat file in int16, with all data from all channels (Nchannels x Nsamples). Kilosort reads this data in sections, but requires a single input file.
My data is brain activity recorded with a Neuralynx system on 64 channels simultaneously, which is saved as one .ncs file per channel. NCS is a proprietory file format and I use the import function provided by Neuralynx to read it into matlab (https://neuralynx.com/software/category/matlab-netcom-utilities). This gives me my data for each channel as doubles. I don't load all the channels at once, because my PC can't deal with >30GB in memory (the matlab doubles seem to be much larger than the original ncs).
I then reshape the data a bit so that it is simply a list of subsequent datapoints (the original import is in 512-sample long columns). This is what my readEegDataForKilosort function does. After this, I have a long 1-D array with doubles. However, if I save all my data as doubles, the files become very large, so I converted to int16 before saving using matfile. The resulting .mat file doesn't turn out to be very useful, so I don't think I will be doing that going forward.
I'm trying to figure out for sure what the input file you need actually looks like, specifically.
I didn't find a description of that file format at the link; it's probably there, but isn't clear where that is.
Let's talk something small in size instead...if you had four channels and 3 observations, there would be twelve values. Are these to be arranged as a sequence of three (3) 4-vectors, sequentially in time as
Ch1O1 Ch201 Ch301 Ch401
Ch1O2 Ch202 Ch302 Ch402
Ch1O3 Ch203 Ch303 Ch403
? or as
Ch1O1 Ch102 Ch103
Ch2O1 Ch202 Ch203
Ch3O1 Ch302 Ch303
Ch4O1 Ch402 Ch403
?
In both cases I've introduced phantom records that would not be in a stream file simply to aid in readability.
The first writes each timestep for all chanels, the second writes all timesteps (observations "O") for each channel sequentially.
Or, does the input processor have, by any chance, the ability to tell it which order the data are in?
The second above is what you have written; a stream file will be just a sequence of bytes; to write in the order by timestep/observation you will have to have those data in memory for all channels for each timestep as it is written.
I've checked with the people from Kilosort, and they told me that I need the data to be intermingled: first sample 1 for all channels, then sample 2... like in the first example.
I also enter my sampling rate and the number of channels in Kilosort, and I'm pretty sure that the individual traces are reconstructed based on those.
I should be able to get the correct kind of output by reading a portion of each channel, build that into a matrix (one channel per column), write the matrix to my .dat file and then repeat and append until I've written all my data, right?
I also think that a lot of my confusion comes from initally misunderstanding how these particular files work. I thought that, just like for e.g. .mat files and text files, the structure I write into the files with fwrite will just come out in the same shape when I read that file back. But that is clearly not totally the case. Not without some extra instructions anyway.

Sign in to comment.

Answers (0)

Products

Release

R2019a

Asked:

on 30 Jan 2021

Commented:

on 1 Feb 2021

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!