Regex Match Characters When Not Preceded By A String
Solution 1:
This is the closest regex I could get (the trailing space is the one we match):
(?<=(?<!(No|\.\w))[\.\?\!])(?! *\d+ *)
which will split also after Sgt.
for the simple reason that a lookbehind assertion has to be fixed width in Python (what a limitation!).
This is how I would do it in vim
, which has no such limitation (the trailing space is the one we match):
\(\(No\|Sgt\|\.\w\)\@<![?.!]\)\( *\d\+ *\)\@!\zs
For the OP as well as the casual reader, this question and the answers to it are about lookarounds and are very interesting.
Solution 2:
You may consider a matching approach, it will offer you better control over the entities you want to count as single words, not as sentence break signals.
Use a pattern like
\s*((?:\d+\.\s*\d+|(?:No|M[rs]|[JD]r|S(?:r|gt))\.|\.(?!\s+-?[A-Z0-9])|[^.!?])+(?:[.?!]|$))
See the regex demo
It is very similar to what I posted here, but it contains a pattern to match poorly formatted float numbers, added No.
and Sgt.
abbreviation support and a better handling of strings not ending with final sentence punctuation.
import re
p = re.compile(r'\s*((?:\d+\.\s*\d+|(?:No|M[rs]|[JD]r|S(?:r|gt))\.|\.(?!\s+-?[A-Z0-9])|[^.!?])+(?:[.?!]|$))')
s = "I am from New York, N.Y. and I would like to say hello! How are you today? I am well. I owe you $6. 00 because you bought me a No. 3 burger. -Sgt. Smith"for m in p.findall(s):
print(m)
Output:
I am fromNew York, N.Y. and I would liketo say hello!
How are you today?
I am well.
I owe you $6. 00 because you bought me a No. 3 burger.
-Sgt. Smith
Pattern details
\s*
- matches 0 or more whitespace (used to trim the results)(?:\d+\.\s*\d+|(?:No|M[rs]|[JD]r|S(?:r|gt))\.|\.(?!\s+-?[A-Z0-9])|[^.!?])+
- one or more occurrences of several aternatives:\d+\.\s*\d+
- 1+ digits,.
, 0+ whitespaces, 1+ digits(?:No|M[rs]|[JD]r|S(?:r|gt))\.
- abbreviated strings likeNo.
,Mr.
,Ms.
,Jr.
,Dr.
,Sr.
,Sgt.
\.(?!\s+-?[A-Z0-9])
- matches a dot not followed by 1 or more whitespace and then an optional-
and uppercase letters or digits|
- or[^.!?]
- any character but a.
,!
, and?
(?:[.?!]|$)
- a.
,!
, and?
or end of string.
Solution 3:
As mentioned in my comment above, if you are not able to define a fixed set of edge cases, this might not be possible without false positives or false negatives. Again, without context you are not able to destinguish between abbreviations like "-Sgt. Smith" and ends of sentences like "Sergeant is often times abbreviated as Sgt. This makes it shorter.".
However, if you can define a fixed set of edge cases, its probably easier and much more readable to do this in multiple steps.
1. Identify your edge cases
For example, you can destinguish "Ill have a No. 3" and "No. I am your father" by checking for a subsequent number. So you would identify that edge case with a regex like this: No. \d
. (Again, context matters. Sentences like "Is 200 enough? No. 200 is not enough." will still give you a false positive)
2. Mask your edge cases
For each edge case, mask the string with a respective string that will 100% not be part of the original text. E.g. "No." => "======NUMBER======"
3. Run your algorithm
Now that you got rid of your unwanted punctuations, you can run a simpler regex like this to identify the true positives: [\.\!\?]\s
4. Unmask your edge cases
Turn "======NUMBER======" back into "No."
Solution 4:
Doing it with only one regex will be tricky - as stated in comments, there are lots of edge cases.
Myself I would do it with three steps:
- Replace spaces that should stay with some special character (
re.sub
) - Split the text (
re.split
) - Replace the special character with space
For example:
import re
zero_width_space = '\u200B'
s = 'I am from New York, N.Y. and I would like to say hello! How are you today? I am well. I owe you $6. 00 because you bought me a No. 3 burger. -Sgt. Smith'
s = re.sub(r'(?<=\.)\s+(?=[\da-z])|(?<=,)\s+|(?<=Sgt\.)\s+', zero_width_space, s)
s = re.split(r'(?<=[.?!])\s+', s)
from pprint import pprint
pprint([line.replace(zero_width_space, ' ') for line in s])
Prints:
['I am from New York, N.Y. and I would like to say hello!','How are you today?','I am well.','I owe you $6. 00 because you bought me a No. 3 burger.','-Sgt. Smith']
Post a Comment for "Regex Match Characters When Not Preceded By A String"