Efficient identification of quoted substrings in a substring

4 views (last 30 days)
I'm looking for help from the matlab string parsing experts out there to help we come up with a computationally efficient way (perhaps using regular expressions), to identify the quoted parts of a string from random sources of text (e.g. a journal article). The method needs to work regardless of whether the quoted substrings are contained inside single or double quotes. Further the text may contain apostrophes either inside or outside of the of the quoted substrings.
For example, in this sentence:
Sally said "It's a wonderful life" when she heard Molly's sister proclaim "It's a great day".
I would like to identify "It's a wonderful life" and "It's a great day", while in this text:
The attributes of the <table> tag were 'width=80%' and 'align="center"'.
I would like to identity 'width=80%' and 'align="center"'. [Note, I purposedly did not show the above example sentences in matlab code, but rather just showed them as free text, so as to not to confuse my question with how to properly capturing such sentences in a matlab variable.]
I recognize these examples are a bit pedantic, but since the code won't be able to control the source of the text it is searching, it needs to be robust across these cases.
I have been able to do this with a "brute force" linear search through the text, but its pretty inefficient and complex. I am not enough of an regexp expert to figure out a way to do this with regular expressions, but I've seen such experts come up with pretty elegant and efficient solutions to such problems. Hence, I was hoping my case might be tantilizing to one of those experts in this community. Thanks for any suggestions
  3 Comments
Peter
Peter on 13 Feb 2021
Thanks John. Yes, I do recognize that in the general case things can be even more complex and even ambiguous. I figured I'd start with the 90-10 (99-1?) solution, and then see how I can, or even if I will, deal with the remaining edge cases. And time does matter to me as I'll be processing O(gb) of data routinely, and profiling tells me this search functionality is taking up the majority of the time of my overall processing. Thanks for the comments.
Peter
Peter on 14 Feb 2021
Turns out the extractBetween() function is roughly 2X faster than my homegrown version, so replacing that element of my code will help with efficiency. It is not the be-all, end-all to my use case however, as I have to deal with cases like this:
It's sentences like this, with "quoted parts" here and "some more there", which require infering context and meaning in the sentence to understand user's intent in using a single quote as a quote or an approtrosphe, that are particulaly difficult.
So, I'll plug away. I'm going to look into HTML parsers as they presumably need to deal with this problem somehow given that both single and double quotes are allowed there.

Sign in to comment.

Accepted Answer

Cris LaPierre
Cris LaPierre on 13 Feb 2021
Have you tried extractBetween?
  1 Comment
Peter
Peter on 13 Feb 2021
Wow! I never knew that function existed. I'm not sure how I overlooked it as I was already taking advantage of many of the other string functions (contains, startsWith, endsWith, etc.) Much of my "brute force" linear method likely was replicating what extractBetween does. Thank you for pointing it out. That will definitely simplify my code and I'll do some testing to see if it makes it more computationally efficient too.

Sign in to comment.

More Answers (0)

Categories

Find more on Characters and Strings in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!