Text Extraction and retrieval
2 views (last 30 days)
Show older comments
<P ID=1>
A LITTLE BLACK BIRD.
</P>
<P ID=2>
Story about a bird,
(1811)
</P>
<P ID=3>
Part 1.
</P>
As I am new to text extraction, I need help in;
0 Comments
Accepted Answer
Akira Agata
on 25 Oct 2017
Just tried to make a script to do that. Here is the result (assuming the maximum ID = 10).
% Read your text file
fid = fopen('yourText.txt');
C = textscan(fid,'%s','TextType','string','Delimiter','\n','EndOfLine','\r\n');
C = C{1};
fclose(fid);
% 1. Count the delimiters '</P>'
idx = strfind(C,'</P>');
n = nnz(cellfun(@(x) ~isempty(x), idx));
% 2. Remove all punctuation
C2 = regexprep(C,'[.,!?:;]','');
% 3. Break the text into individual documents at each delimiter
idx2 = find(strcmp(C,'</P>'));
for kk = 1:10
str = ['<P ID=',num2str(kk),'>'];
idx_s = find(strcmp(C,str));
if ~isempty(idx_s)
idx_e = idx2(find(idx2>idx_s,1));
fileName = ['document',num2str(kk),'.txt'];
fid = fopen(fileName,'w');
fprintf(fid,'%s\r\n',C(idx_s:idx_e));
fclose(fid);
end
end
6 Comments
Akira Agata
on 30 Oct 2017
Edited: Akira Agata
on 30 Oct 2017
Thanks for your reply. I've just made a script to do the items 1~3, as follows. I hope this will help you somehow.
Regarding your last question ("count the number of documents each word appear in"), I think you can do that by combining the following script with my previous one.
% Read your text file
fid = fopen('yourText.txt');
C = textscan(fid,'%s','TextType','string','Delimiter','\n','EndOfLine','\r\n');
C = C{1};
fclose(fid);
C = regexprep(C,'<[\w \=\/]+>',''); % Remove tags
C = regexprep(C,'[.,!?:;()]',''); % Remove punctuation and brackets
C = regexprep(C,'[0-9]+',''); % Remove numbers
C = lower(C); % Convert to lower case
% Extract every words
words = regexp(C,'[a-z\-]+','match');
words = [words{:}];
% (1) Count total number of words
numOfWords = numel(words); % --> 9
% (2) Count the total number of distinct words
numOfDistWords = numel(unique(words)); % --> 7
% (3) Find the number of times each word is used in the original text
wordList = unique(words);
wordCount = arrayfun(@(x) nnz(strcmp(x,words)), wordList);
% Show the result
figure
bar(wordCount)
xticklabels(wordList)
More Answers (2)
Cedric
on 26 Oct 2017
Here is another approach based on pattern matching:
>> data = regexp(fileread('data.txt'), '(?<=<P[^>]+>\s*)[\w ]+', 'match' )
data =
1×3 cell array
{'A LITTLE BLACK BIRD'} {'Story about a bird'} {'Part 1'}
if you don't need the IDs (e.g. if in any case they will go from 1 to the number of P tags), you are done.
If you needed the IDs, you could get both IDs and content as follows:
>> data = regexp(fileread('data.txt'), '<P ID=(\d+)>\s*([\w ]+)', 'tokens' ) ;
data = vertcat( data{:} ) ;
ids = str2double( data(:,1) )
data = data(:,2)
ids =
1
2
3
data =
3×1 cell array
{'A LITTLE BLACK BIRD'}
{'Story about a bird' }
{'Part 1' }
Christopher Creutzig
on 2 Nov 2017
Edited: Christopher Creutzig
on 2 Nov 2017
It's probably easiest to split the text and then check the number of splits created to count, using string functions:
str = extractFileText('file.txt');
paras = split(str,"</P>");
paras(end) = []; % the split left an empty last entry
paras = extractAfter(paras,">") % Drop the "<P ID=n>" from the beginning
Then, numel(paras) will give you the number of </P>.
If you do not have extractFileText, calling string(fileread('file.txt')) should work just fine, too.
In one of the comments, you indicated you also need to count the frequency of words in documents. That is what bagOfWords is for:
tdoc = tokenizedDocument(lower(paras));
bag = bagOfWords(tdoc)
bag =
bagOfWords with 13 words and 3 documents:
a little black bird . …
1 1 1 1 1
1 0 0 1 0
…
2 Comments
shilpa patil
on 23 Sep 2019
Edited: shilpa patil
on 23 Sep 2019
how to rewrite the above code for a document image
instead of text file
See Also
Categories
Find more on Text Data Preparation in Help Center and File Exchange
Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!