You are now following this question
- You will see updates in your followed content feed.
- You may receive emails, depending on your communication preferences.
How to extract rows of data according to text containing specific words in cells in Matlab
16 views (last 30 days)
Show older comments
Hi all,
I am quite stuck with a problem, I am trying to extract certain variables from large Excel files that classify organisms from multiple years in Excel so I can process it in MATLAB. I want to extract all columns from A to L and the row-number of the data I need starts from 657828: 1048576. I have tried the filter function in excel but it doesn't work so I am doing it in MATLAB. How I want to filter it includes the column j called object_annotation_hierachy and the precise species I am trying to filter out are the following:
Arthropoda_Crustacea_Maxillopoda_Copepoda_Calanoida_Calanidae
Arthropoda_Crustacea_Maxillopoda_Copepoda_Calanoida_Metridinidae
Arthropoda_Crustacea_Maxillopoda_Copepoda_Calanoida_Candaciidae
Arthropoda_Crustacea_Maxillopoda_Copepoda_Calanoida_Heterorhabdidae
Arthropoda_Crustacea_Maxillopoda_Copepoda_Calanoida_Euchaetidae
Arthropoda_Crustacea_Maxillopoda_Copepoda_Calanoida_Metridinidae
Arthropoda_Crustacea_Maxillopoda_Copepoda_Cyclopoida_Oithonidae
Arthropoda_Crustacea_Maxillopoda_Copepoda_Calanoida_Acartiidae
Arthropoda_Crustacea_Maxillopoda_Copepoda_Calanoida_Temoridae
All other species are a variations of this but I am trying to include all data with 'Copepoda' in the title.
Further I want to extract by year which is in the first column and called object_id with the name ['cruise2012'] up to 2016.
The code so far looks like this, however it does not work:
C=csvread('cruise_2004_2016_ZooScan_dataset.csv');
%R657828, C1048576 (just used in first line of code to show location)
copepods= contains(C.object_id=="cruise2012")&(C.object_annotation_hierachy,"Copepoda");
C1=C(copepods,:);
Any help would be much appreciated!
5 Comments
the cyclist
on 27 May 2023
Can you upload the data? You can use the paper clip icon in the INSERT section of the toolbar.
Image Analyst
on 27 May 2023
Edited: Image Analyst
on 27 May 2023
If you have any more questions, then attach your data with the paperclip icon after you read this:
Maybe you need the "IgnoreCase" option of contains to be true and put the thing in parentheses after the & also into contains.
Sophia
on 27 May 2023
Hi i cant i'm afraid it is confidential, i have changed some of the headings names as well to make it distinguished. Could you help without it?
The layout is as follows:
Column A- cruise2012 (changes to cruise2013.. up to cruise2016), Column B, Column C (wanted data to extract), Column D (wanted data to extract), E-H, and then J (column name for copepod species), up to L (wanted data to extract).
The length of the data is 1048576 long but data i need starts from 657828 (as i dont need previous years).
Thanks :)
Sophia
on 30 May 2023
Okay i have uploaded it and replaced it will dummy text, it is usually a lot longer. I have also split and numerized the taxonomy and code is in the sheet in case there is an easier way to do it. Because i have numerized the 1st and 10th row i used this code but it still doesnt seem to be working.
opts=detectImportOptions("Zoocam.xlsx");
opts.VariableTypes(2)={'double'};
opts.VariableTypes(19)={'double'};
opts.VariableTypes(20)={'double'};
C=readtable('Zoocam.xlsx', opts);
C.index=(C.object==2017 & C.object_annotation_hierarchy<18);
C_new=C(C.index==1,{'object','object_lat', 'object_lon','object_annotation_hierarchy', 'object_area'});
writetable(C_new,'2017_data.csv');
Accepted Answer
Matt J
on 27 May 2023
Edited: Matt J
on 30 May 2023
copepods= contains(C.object_id,"cruise2012") & ...
contains(C.object_annotation_hierarchy,"Copepoda");
14 Comments
Sophia
on 27 May 2023
Hi, thanks for the help still not working different error messages when i run the line of code you sent me. All other code seems to be fine, i get this error when using a csv file:
Error using dlmread
Unable to parse a "Numeric" field when reading row 1, field 1.
Actual Text:
"variables_heading_names,..,..."
Expected: A number or literal "NaN", "Inf". (possibly signed, case insensitive)
Error in csvread (line 53)
m=dlmread(filename, ',', r, c); %#ok<*DLMRD>
Error in Copepoda (line 1)
C=csvread('DATASET.csv');
And this when i use an excel spreadsheet:
Dot indexing is not supported for variables of this type.
Error in Copepoda (line 4)
copepods= contains(C.object_id,"PELGAS2012") & contains(C.object_annotation_hierachy,"Copepoda");
Any thoughts? I am not an expert at this so i cannot infer many possibilities
Sophia
on 28 May 2023
Thanks Matt for the help. I am just getting a string of zeros in one column, however i want to paste the data from rows 1, 3, 4, 10 and 12. I have tried with some other code (where i replaced all the copepods in the excel spreadsheet with 1).
copepods= contains(C.object_id,"cruise2012");
[idx]=find(copepods(:,10)==1);
copepodindex=copepods(:,1, 3, 4, 12); % column
Output=copepodindex(idx);
But this hasnt worked, any ideas?
Walter Roberson
on 28 May 2023
csvread() can only be used to read files that have non-numeric text in them if the text is in a leading row to be skipped according to the R parameter, or if the text is in a leading column that is to be skipped according to the C parameter. If you have a text column that is not a leading column (to be skipped) then csvread cannot deal with it even by specifying an RNG.
The csvread() process is to call dlmread() with comma delimiter, and the dlmread() process is to call textscan() with format '' (empty format) and HeaderLines according to R, and undocumented option indicating how many leading columns to skip. Everything that is not a leading skipped row or leading skipped column must be numeric. dlmread() does not apply the RNG until after it has read in the bottom-right data -- so if the RNG specified that you wanted to ignore a trailing column that contained text, it would be too late, textscan() would have already failed to read the data.
C=csvread('cruise_2004_2016_ZooScan_dataset.csv');
%R657828, C1048576 (just used in first line of code to show location)
copepods= contains(C.object_id=="cruise2012")&(C.object_annotation_hierachy,"Copepoda");
csvread() always returns numeric data, never a struct or object or table that you can use dot-indexing on. csvread() does not create one variable per named column. csvread() cannot deal with column names at all, except by the user having specified a R offset that is sufficient to skip the column names.
Sophia
on 28 May 2023
Thanks Walter, i am now using readtable; And no Matt it isnt working with the original problem i am just getting a string of 0's, not even any 1's so it is not identifying a match Where i want to print all rows of data from all columns (but prefentially 1, 3, 4, 10 and 12) with 1 and 10 being the columns i want to filter by the word 'cruise2014' and 'copepoda' respectively. It is a such a simple take but one i seemingly cant get around!!
Matt J
on 30 May 2023
Edited: Matt J
on 30 May 2023
This seems to work.
filename = "small_data.xlsx";
opts=detectImportOptions(filename, 'ExpectedNumVariables',13);
opts.VariableTypes([1,10])={'string'};
C=readtable(filename,opts);
Warning: Column headers from the file were modified to make them valid MATLAB identifiers before creating variable names for the table. The original column headers are saved in the VariableDescriptions property.
Set 'VariableNamingRule' to 'preserve' to use the original column headers as table variable names.
Set 'VariableNamingRule' to 'preserve' to use the original column headers as table variable names.
copepods= contains(C.object_id,"cruise2012") & ...
contains(C.object_annotation_hierarchy,"Copepoda");
C_new=C(copepods,:)
C_new = 5×13 table
object_id objid object_lat object_lon object_date object_time object_depth_min object_depth_max object_annotation_category object_annotation_hierarchy classif_id object__item object_area
____________ __________ __________ __________ ___________ ___________ ________________ ________________ __________________________ ________________________________________________________________________________________________________________ __________ ____________ ___________
"cruise2012" 4.1535e+08 47 -4.3383 2.009e+07 16101 20 100.1 {'Calanoida'} "living>Eukaryota>Opisthokonta>Holozoa>Metazoa>Arthropoda>Crustacea>Maxillopoda>Copepoda>Calanoida>Euchaetidae" 45074 23 1479
"cruise2012" 4.1535e+08 47 -4.3383 2.009e+07 16101 20 100.1 {'Calanoida'} "living>Eukaryota>Opisthokonta>Holozoa>Metazoa>Arthropoda>Crustacea>Maxillopoda>Copepoda>Calanoida>Metridinidae" 45074 23 1479
"cruise2012" 4.1535e+08 47 -4.3383 2.009e+07 16101 20 100.1 {'Calanoida'} "living>Eukaryota>Opisthokonta>Holozoa>Metazoa>Arthropoda>Crustacea>Maxillopoda>Copepoda>Cyclopoida>Oithonidae" 45074 23 1479
"cruise2012" 4.1535e+08 47 -4.3383 2.009e+07 16101 20 100.1 {'Calanoida'} "living>Eukaryota>Opisthokonta>Holozoa>Metazoa>Arthropoda>Crustacea>Maxillopoda>Copepoda>Calanoida>Acartiidae" 45074 23 1479
"cruise2012" 4.1535e+08 47 -4.3383 2.009e+07 16101 20 100.1 {'Calanoida'} "living>Eukaryota>Opisthokonta>Holozoa>Metazoa>Arthropoda>Crustacea>Maxillopoda>Copepoda>Calanoida>Temoridae" 45074 23 1479
Matt J
on 30 May 2023
Do you mind explaining where you got the values for the opts.VariableTypes([1,10])={'string'};
We want the first and tenth column to be strings because those are the ones we are giving to contains().
Sophia
on 30 May 2023
Yes i just realized, very long day completely forget that, thanks; it seems it wont stop running maybe because that is because the real file is too large (around 700 000 rows), would this significantly slow down matlab and maybemake it an issue to run? The data i am trying to extract is roughly 100 000 rows long. Thanks a lot for your dedication to help me with this problem :)
Matt J
on 30 May 2023
Edited: Matt J
on 30 May 2023
Large files do take longer to read. I also don't know what code changes you've made or how fast your hard drive is.
Possibly you could accelerate things by reading just the object_id column and use that to decide on a continguous subset of rows that contains the data you want. Then you can read in just the subset.
Walter Roberson
on 31 May 2023
sometimes you can get better performance with code along the lines of
mask1 = find(contains(C.object_id=="cruise2012"));
mask2 = contains(C.object_annotation_hierachy(mask1),"Copepoda");
matching_rows = mask1(mask2);
Sophia
on 31 May 2023
Thanks a lot Matt, it ended up working when i left it overnight! Thanks Walter, i could certaily need someone to improve efficiecny, however i get:
Error using contains
Not enough input arguments.
Error in test5 (line 12)
mask1 = find(contains(C.object_id=="cruise2012"));
When i run:
filename = "small_data.xlsx";
opts=detectImportOptions(filename, 'ExpectedNumVariables',13);
opts.VariableTypes([1,10])={'string'};
C=readtable(filename,opts);
mask1 = find(contains(C.object_id=="cruise2012"));
mask2 = contains(C.object_annotation_hierarchy(mask1),"Copepoda");
matching_rows = mask1(mask2);
C_new=C(matching_rows,:);
Any ideas why?
More Answers (0)
See Also
Categories
Find more on Spreadsheets in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!An Error Occurred
Unable to complete the action because of changes made to the page. Reload the page to see its updated state.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)
Asia Pacific
- Australia (English)
- India (English)
- New Zealand (English)
- 中国
- 日本Japanese (日本語)
- 한국Korean (한국어)