How can I find and remove the rows starting with different word?
Show older comments
Hi All;
I have a 40000 row document. The words 1 to 485 start with the letter "a". The words 486 to 1158 start with the letter "b". It starts with the letter "c" between 1158 and 4000... There are some erroneous places among these word groups. For example, there are words starting with another letter such as "b, c, d, v ..." in groups of words that should start with the letter "a" between 1 and 485. How can I find and remove the rows of these?
Thank you for your help.
5 Comments
Elias Gule
on 29 Mar 2018
Guillaume
on 29 Mar 2018
What is a document in matlab? A 2D char array? A cell array of 1D char arrays? A string array? Something else?
Can you give a short example of the input using valid matlab syntax?
Ergün AKGÜN
on 29 Mar 2018
Edited: Ergün AKGÜN
on 29 Mar 2018
Guillaume
on 29 Mar 2018
So, the problem is
- extracting the first letter of each line
- finding out which ones are out of order
That's fairly straightforward but for one thing: You're obviously not using an english alphabet (turk?) whose order may not match matlab's idea of order. For example
>> sort('aba güreşi değnek göstermek')
ans =
' aabdeeeeeggikkmnrrstöüğş'
is probably the wrong order.
If the alphabet for 1st letter is just US-ASCII [a-z] then it's easy.
Ergün AKGÜN
on 29 Mar 2018
Edited: Ergün AKGÜN
on 29 Mar 2018
Accepted Answer
More Answers (1)
Guillaume
on 29 Mar 2018
Your english is fine and I understood what you want to do.
If the first letter of every line is belong to the character set [a-z] (and it looks like you want to ignore case), then it's very easy to solve.
However, if we have to take into account accented letters such as ğ then it's a lot more complicated because matlab has no concept of internationalisation. I have no idea where ğ is located in your alphabet but it's not going to be where matlab think it is.
If we assume US-ASCII alphabet only, the intruders can be detected easily:
firstletter = lower(cellfun(@(s) s(1), Cnew)).'; %get first letter and convert to lower case
ldiff = sign(diff(firstletter));
outoforderrows = union(strfind(ldiff, [-1 1]), strfind(ldiff, [1 -1])) + 1
But with turkish alphabet, lower may not work correctly for a start. In addition, since matlab may have the wrong idea about the order of letters, it may tell you that some lines are out of order when they are not.
Categories
Find more on Entering Commands in Help Center and File Exchange
Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!