regexrep for the neophyte

How to write expression to find and convert financial strings written with comma separator to be able to parse numerically?
Example stirng is something like:
recipients={'John Doe $200, Freddy Flint $132.40 SP19; Mary Lamb $1,423.00-SP19, Joe Blow $1,200'};
which is a list of student scholarship awards entered, unfortunately, free-style in a remarks field in an Excel spreadsheet. I need to parse by student and extract each.
I managed to find/return the location/token of the amounts containing the punctuation, but I'm illiterate with regexp and haven't figured out how to locate the comma then within that substring to remove it.
K>> regexp(recipients,'([$]\d+[,]\d+)','match')
ans =
1×1 cell array
{1×2 cell}
K>> ans{:}
ans =
1×2 cell array
{'$1,423'} {'$1,200'}
K>>
As can be seen, the original text also has "issues" in that the separator between students isn't consistent--it may be a comma or semicolon, not sure what else I'll find as I progress.
For context, this is the next step past the previous Q? of piecing back together disparate databases/spreadsheets...now that I can compare the award to the billing by account, I can find coding or other errors--but need to be able compare the details.

5 Comments

regexprep(receipts, ',', '') perhaps?
dpb
dpb on 20 Apr 2019
Edited: dpb on 20 Apr 2019
That would let me parse the values themselves, Walter, but doesn't solve the total problem of being able to separate the recipients by records because they used commas, semicolons, what-else-I'm-yet-to-find as separators in the comment record...
Right now I'm just iterating through the returned list and doing a lookup from these starting locations to find the embedded commas and removing them.
I figured there was bound to be a way to find the comma in the currency field and remove it, but my facility in regular expressions just isn't up to the task...and I've not found any examples similar-enough alike to follow...
Maybe try making this into a Cody problem, too. :-)
dpb
dpb on 21 Apr 2019
Edited: dpb on 24 Apr 2019
Not sure what that is, Cyclist??? I've not poked around any of the rest of the site much at all...
ADDENDUM: Oh. I'd forgotten about it...do know what it is, never played. I think I answered one or two easy problems "way back when" ... I don't have the needed 200 points accumulated to post, though...
dpb
dpb on 24 Apr 2019
Edited: dpb on 24 Apr 2019
OK, if anybody is still listening/watching -- I've found another general pattern that occurs often enough it would be big help to handle --
{'Mare Hartman (Pres. Tuit/Bks) $1,374 AY18/19; Mark Lester $1,376 SP19'}
The pattern adds a parenthetical comment and the semester time code is academic year of YR1FA/YR2SP. Present gets confused by the extra words/strings:
>> mtch=regexp(char(tAll.Recipient(lib)),rexp,'names')
mtch =
1×2 struct array with fields:
name
sum
time
>> struct2table(mtch)
ans =
2×3 table
name sum time
________________ _______ ______
'AY' '18' ''
'Mark Lester' '1,376' 'SP19'
>>
It would be ideal if could return an additional optional 'notes' variable containing the text in the parentheses and somehow manage to not split the semester string and interpret it as the $$ amount...
I've made a few feeble attempts but only managed to break what did work and not get further... :(

Sign in to comment.

 Accepted Answer

TADA
TADA on 21 Apr 2019
Edited: TADA on 21 Apr 2019
match = regexp(recipients,'(?<name>[a-zA-Z]+(?:\s*[a-zA-Z]+)*)\s?\$?(?<sum>\d+(?:[\.,]\d+)*)[^\w,;]*(?<time>[a-zA-Z]+\d+)?','names');
Will return a struct array with like that:
ans = 1x4 struct array with fields:
name % recipient name
sum % scholarship amount
time % time string
Time is not mandatory so it can be an empty char vector
sum is convertible to numbers using str2double even though it contains the commas
If you want clarifications for the regex pattern I'll gladly add them

12 Comments

Stephen23
Stephen23 on 21 Apr 2019
Edited: Stephen23 on 21 Apr 2019
Note that this will also match "numbers" such as 1.2.3.4.5.6.7 or 123456,7,8,9, because it does not require any particular number of digits in each digit-group, and the decimal radix/comma can repeat any number of times. This might be an advantage or a disadvantage, depending on your source data.
Yes, that can be fixed if necessary.
If in some Chilean University a student receives 1,000,000 pesos that regex would still work though
It depends on how much vallidation is required
dpb
dpb on 21 Apr 2019
Since is a US community college, no awards will exceed the four digit mark! :) So far, the numeric data has shown to be entered well as far as formatting; it just doesn't always match the other worksheet as to which student was awarded or how much; sometimes it appears there's some leftover data from previous billing cycle still...that's what I'm trying to find for QC checks before the actual check is written for an overall billing total.
Thanks, I'll give it a go and see...having the split that way could be a really beneficial side benefit of getting the students grouped to be able to compare to the other worksheet by name as well--that was next step in parsing you may have saved me quite a lot of time!
I see I didn't put all the nuances of the names in the example; there are hyphenated names as well and I'm not sure what other punctuation may be there but I think I see how that would get fixed. I'll just try and go forward for now given short time span but definitely come back to it and dig into the "how"...
dpb
dpb on 21 Apr 2019
Edited: dpb on 21 Apr 2019
That works a treat! :) I made the one fixup to handle hyphenated names of:
rexp ='(?<name>[a-zA-Z\-]+(?:\s*[a-zA-Z\-]+)*)\s?\$?(?<sum>\d+(?:[\.,]\d+)*)[^\w,;]*(?<time>[a-zA-Z\-]+\d+)?';
by putting the "\-" into the alphanumeric matching expressions and it got everything out of the first real test case of a list of 21 in a string that's a remark of 492(!!) characters in length...obviously why needed a tool to be able to do sanity checks other than trying to read by hand in a spreadsheet...
On to the next step of now trying to match the two lists of names--since they're retyped into the remarks instead of copied/pasted, who knows how many typos or variations introduced there will be???
However, if I can just get them listed side-by-side in a viewable manner, that will be a win.
dpb
dpb on 21 Apr 2019
Edited: dpb on 21 Apr 2019
Well, not to be picky or anything... :)
Additional glitches so far:
  1. A dollar amount wasn't entered in the remarks field so end up with empty struct. I can do a test for empty struct, of course, so not a critical need; just convenience. Sometimes that's a student not been awarded yet, there are some cases that just didn't make the comment entry, only the data one(*).
  2. Malformed award semester field as "SP 19" is interpreted as sum=$19 and time='SP'. This one is a problem by introducing a bum $$ amount. I can edit the current spreadsheet to make it go away, but I can't reliably control what the secretatry does again (altho I will instruct "don't DO that!!!"
(*) To make this even worser, the actual dollar values are kept as a running sum of the individual amounts in the freeform Recipient/Remarks field so there's no guarantee the two match. I added a column in the spreadsheet that is the formula for that summation and will do a sanity check on the sum from the values retrieved this way to that as part of the QC code.
And, yes, this is no way to run a railroad! :( All these spreadsheets have just proliferated over last 20-30 years--once upon a time the foundation was small enough it was doable; now it's grossly overgrown the tools...solving that is on the to-do list but ain't gonna' happen for the immediate need.
Sounds like a serious headache U_U
1. adding a question mark right after the sum named group like that
(?<sum>blah balh blah)?
should solve that (like with the time named group), however, it might cause the whole expression to detect false negative time strings...
2. If the time string is different and there is no $ amount it's going to be more complicated... Im going to have to think a bit about it
If you can throw in a few more of these problematic strings it might help
dpb
dpb on 21 Apr 2019
Thanks for the offer...let me think how can do something useful that doesn't put out too much private data...I hate to just post the real data having actual names in it but it's tough to make up fully realistic sample cases. Yours does better than you might think altho I did go back and clean up some of the data entries and added one fixup programmatically that if the $ amount was missing and was only a single entry, added that entry into the string before parsing it.
I've actually gotten a ways farther to the point of one of the problematic funds with 20-some awardees on one sheet and 18 on the other and three separate totals! :) I was now trying to match up the student names from the official institution billing record with that the secretary used in her worksheet...after turning around last,first to first last and doing ismember, amazingly enough 15 were actually matched, three more are things like "Marty" in place of "Martin" or a 30-character double but unhyphenated last name shortened to just the first-last.
You wouldn't happen to know of any ML routines for fuzzy name matching by any chance? For starters I can list the ones that match exactly and the other list of those that don't and at least have it separated out to look out.
This alone is a major step forward with your parser for the remarks field -- I've not tried it on some of the more esoteric as yet ... but it's so much better to just have the list separated than not!
dpb
dpb on 21 Apr 2019
OK, attached is a subset of the database -- the .Recipient field is where the dirty deeds are done. When I read it, I find the ' - ' string after the preamble stuff if it exists--that's another bastardization of the original spreadsheet that has two other columns of donor and institutional guidelines for each fund but because there's so much real estate consumed, the secretary put some of the guideline info at the beginning so it was there always visible in that column...and, as you can see, there's been no real discipline in coding individual entries for consistency...
That's the problem usually with unstructured input systems.... Text input is so hard to analyze
I will try to have a look at the data you posted when I can
Unfortunately I know next to nothing about fuzzy logic, so I'd suggest opening a new question about that to attract the attention of Matlab answers "big guns"
dpb
dpb on 22 Apr 2019
Anything further would be gravy... :)
I've gotten a first pass at matching names for the first large block of awards to a given fund that doesn't match the billing sheet totals -- and, by just a couple of machinations all the names match excepting two...one that the first/last names were reversed --an Owen Martin became Marty Owen and a second that is just misspelled...both of which would be where some general matching logic could probably have discovered. But, being able to spit them out as not matching is also a win outside having to search the comment field by hand/eye.
For it, there are the same number of entries in the formula, and I can match all but one with the billing but the numbers for the individual recipients don't match -- again, having that in a table by Fund/Student is a win even if it isn't perfect parsing...
I may be able to get something working well enough tomorrow to actually be able believe can get the problems uncovered and solved...before getting this far I wasn't so sure it was possible in less than "months" time frame which would be past the time the books have to be closed and errors would become permanent for evermore at that point, at least without a tremendous amount of additional forensic accounting.
Cheers then! =)
dpb
dpb on 22 Apr 2019
Not that I wouldn't find use for any further improvements... :)
Probably it's quicker despite the pain to just edit the remarks to make them more nearly regular and then reparse.
My plan of attack is to find the donor funds that show a mismatch and work through them first, then expand to more thoroughly vet the remainder that could have totals that agree but just by accidental cancellation of errors and keep expanding the thoroughness until either get completed or simply do run out of time...

Sign in to comment.

More Answers (0)

Asked:

dpb
on 20 Apr 2019

Edited:

dpb
on 24 Apr 2019

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!