Using readtable on a csv with missing columns

Question

Samuel on 3 Sep 2024

0
Link

Direct link to this question

https://au.mathworks.com/matlabcentral/answers/2149859-using-readtable-on-a-csv-with-missing-columns

Edited: dpb on 3 Sep 2024

Accepted Answer: dpb

Open in MATLAB Online

Hey all,

I'm writing matlab script to load a number of CSV files into matlab to process them.

The last column of these CSVs is a notes column, which is often left blank. The software we use to edit our CSVs sometimes doesn't even fill in the last comma, like so:

colA,colB,colC,Notes
1000,2000,3000
1100,2200,3300
1110,2220,3330
1111,2222,3333

My ideal output would be

    colA    colB    colC    Notes
    ____    ____    ____    ____
    1000    2000    3000    NaN 
    1100    2200    3300    NaN 
    1110    2220    3330    NaN 
    1111    2222    3333    NaN 

I'm currently using readtable to load this data in, and have tried a few different parameters. Unfortunately, they all seem to fall short.

tab = readtable("tt.csv", 'MissingRule', 'fill');
    Var1    Var2    Var3
    ____    ____    ____
    1000    2000    3000
    1100    2200    3300
    1110    2220    3330
    1111    2222    3333  
    
tab = readtable("tt.csv", 'ReadVariableNames', true)
    x1000    x2000    x3000
    _____    _____    _____
    1100     2200     3300 
    1110     2220     3330 
    1111     2222     3333 

Note how both examples are missing the variable names (they should be named with col instead of var), and they're both missing the 4th column.

Any ideas? I could just write a helper script that goes around adding extra commas, but that seems like a lot of extra work for what might just be a setting.

Thanks!

2 Comments
Show NoneHide None

Voss on 3 Sep 2024

Can you upload (using the paperclip button) at least two of these CSV files - at least one that has a Notes column and at least one that does not?

Samuel on 3 Sep 2024

@Voss uploaded 3 files for you:

tt.csv: the original, failing case
tt_no_notes.csv: the Notes column in the header has been entirely deleted
tt_added_commas.csv: added commas after each line, the table is now read as intended

Sign in to comment.

Sign in to answer this question.

Answer 1

dpb on 3 Sep 2024

1
Link

Direct link to this answer

https://au.mathworks.com/matlabcentral/answers/2149859-using-readtable-on-a-csv-with-missing-columns#answer_1510074

Edited: dpb on 3 Sep 2024

Open in MATLAB Online

f=["colA,colB,colC,colD"
"1000,2000,3000"
"1100,2200,3300"
"1110,2220,3330"
"1111,2222,3333"];
writelines(f,'tt.csv')
type tt.csv
colA,colB,colC,colD
1000,2000,3000
1100,2200,3300
1110,2220,3330
1111,2222,3333
opt=detectImportOptions('tt.csv','filetype','text','readvariablenames',1,'Range',1)
opt = 
  DelimitedTextImportOptions with properties:

   Format Properties:
                    Delimiter: {','}
                   Whitespace: '\b\t '
                   LineEnding: {'\n'  '\r'  '\r\n'}
                 CommentStyle: {}
    ConsecutiveDelimitersRule: 'split'
        LeadingDelimitersRule: 'keep'
       TrailingDelimitersRule: 'ignore'
                EmptyLineRule: 'skip'
                     Encoding: 'UTF-8'

   Replacement Properties:
                  MissingRule: 'fill'
              ImportErrorRule: 'fill'
             ExtraColumnsRule: 'addvars'

   Variable Import Properties: Set types by name using setvartype
                VariableNames: {'colA', 'colB', 'colC' ... and 1 more}
                VariableTypes: {'double', 'double', 'double' ... and 1 more}
        SelectedVariableNames: {'colA', 'colB', 'colC' ... and 1 more}
              VariableOptions: [1-by-4 matlab.io.VariableImportOptions] 
	Access VariableOptions sub-properties using setvaropts/getvaropts
           VariableNamingRule: 'modify'

   Location Properties:
                    DataLines: [2 Inf]
            VariableNamesLine: 1
               RowNamesColumn: 0
            VariableUnitsLine: 0
     VariableDescriptionsLine: 0 
	To display a preview of the table, use preview
t=readtable('tt.csv',opt)
t = 4x4 table
    colA    colB    colC       colD   
    ____    ____    ____    __________

    1000    2000    3000    {0x0 char}
    1100    2200    3300    {0x0 char}
    1110    2220    3330    {0x0 char}
    1111    2222    3333    {0x0 char}

NOTA BENE: If the notes column ever does contain a note, then reading will fail if try to read as numeric.

ERRATUM: I think there is a failure in that setting only 'ReadVariableNames',1 didn't read the first record as the variable names but skipped it and read the first data row instead. I would submit this as a bug/support request. Inferring import options from a general file of almost any description is a tough job; detectImportOptions continues to evolve; example cases that don't work as expected are always good fodder for enhancements.

That the file is malformed in missing the delimiter to always have the fourth column is an excuse, but it is a fairly common thing with csv files.

2 Comments
Show NoneHide None

Samuel on 3 Sep 2024

Open in MATLAB Online

I had never seen the Range argument used like that, so I gave it a shot:

t = readtable("tt.csv", 'Range', 1)
t =
  4×4 table
    colA    colB    colC    colD
    ____    ____    ____    ____
    1000    2000    3000    NaN 
    1100    2200    3300    NaN 
    1110    2220    3330    NaN 
    1111    2222    3333    NaN 

So, uh... looks like that works?

I looked over the documentation and I can't figure out how it's working, but it certainly seems to be outputting the right thing.

Thanks so much!

dpb on 3 Sep 2024

Edited: dpb on 3 Sep 2024

Open in MATLAB Online

The how/why is compare the import options structure with and without it...the comment about 'ReadVariableNames' is that without the range value for some reason the first line was skipped as it should have been for data, but it was not read as the variable names per the documented behavior, but the second line was as your example above. That's what I think is worthy of a bug report.

To work around that, I forced it to begin reading at the first row.

What I didn't try was

f=["colA,colB,colC,colD"
"1000,2000,3000"
"1100,2200,3300"
"1110,2220,3330"
"1111,2222,3333"];
writelines(f,'tt.csv')
%type tt.csv
%opt=detectImportOptions('tt.csv','filetype','text','readvariablenames',1,'Range',1)
t=readtable('tt.csv','readvariablenames',1,'Range',1)
t = 4x4 table
    colA    colB    colC    colD
    ____    ____    ____    ____

    1000    2000    3000    NaN 
    1100    2200    3300    NaN 
    1110    2220    3330    NaN 
    1111    2222    3333    NaN 

That also works at least on this example without the extra step of the import options object.

Sign in to comment.

Using readtable on a csv with missing columns

2 Comments
Show NoneHide None

Accepted Answer

2 Comments
Show NoneHide None

More Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

Using readtable on a csv with missing columns

2 Comments Show NoneHide None

Accepted Answer

2 Comments Show NoneHide None

More Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

2 Comments
Show NoneHide None

2 Comments
Show NoneHide None