Skip to content Skip to sidebar Skip to footer

Regex Match Characters When Not Preceded By A String

I am trying to match spaces just after punctuation marks so that I can split up a large corpus of text, but I am seeing some common edge cases with places, titles and common abbrev

Solution 1:

This is the closest regex I could get (the trailing space is the one we match):

(?<=(?<!(No|\.\w))[\.\?\!])(?! *\d+ *) 

which will split also after Sgt. for the simple reason that a lookbehind assertion has to be fixed width in Python (what a limitation!).

This is how I would do it in vim, which has no such limitation (the trailing space is the one we match):

\(\(No\|Sgt\|\.\w\)\@<![?.!]\)\( *\d\+ *\)\@!\zs 

For the OP as well as the casual reader, this question and the answers to it are about lookarounds and are very interesting.

Solution 2:

You may consider a matching approach, it will offer you better control over the entities you want to count as single words, not as sentence break signals.

Use a pattern like

\s*((?:\d+\.\s*\d+|(?:No|M[rs]|[JD]r|S(?:r|gt))\.|\.(?!\s+-?[A-Z0-9])|[^.!?])+(?:[.?!]|$))

See the regex demo

It is very similar to what I posted here, but it contains a pattern to match poorly formatted float numbers, added No. and Sgt. abbreviation support and a better handling of strings not ending with final sentence punctuation.

Python demo:

import re
p = re.compile(r'\s*((?:\d+\.\s*\d+|(?:No|M[rs]|[JD]r|S(?:r|gt))\.|\.(?!\s+-?[A-Z0-9])|[^.!?])+(?:[.?!]|$))')
s = "I am from New York, N.Y. and I would like to say hello! How are you today? I am well. I owe you $6. 00 because you bought me a No. 3 burger. -Sgt. Smith"for m in p.findall(s):
    print(m)

Output:

I am fromNew York, N.Y. and I would liketo say hello!
How are you today?
I am well.
I owe you $6. 00 because you bought me a No. 3 burger.
-Sgt. Smith

Pattern details

  • \s* - matches 0 or more whitespace (used to trim the results)
  • (?:\d+\.\s*\d+|(?:No|M[rs]|[JD]r|S(?:r|gt))\.|\.(?!\s+-?[A-Z0-9])|[^.!?])+ - one or more occurrences of several aternatives:
    • \d+\.\s*\d+ - 1+ digits, ., 0+ whitespaces, 1+ digits
    • (?:No|M[rs]|[JD]r|S(?:r|gt))\. - abbreviated strings like No., Mr., Ms., Jr., Dr., Sr., Sgt.
    • \.(?!\s+-?[A-Z0-9]) - matches a dot not followed by 1 or more whitespace and then an optional - and uppercase letters or digits
    • | - or
    • [^.!?] - any character but a ., !, and ?
  • (?:[.?!]|$) - a ., !, and ? or end of string.

Solution 3:

As mentioned in my comment above, if you are not able to define a fixed set of edge cases, this might not be possible without false positives or false negatives. Again, without context you are not able to destinguish between abbreviations like "-Sgt. Smith" and ends of sentences like "Sergeant is often times abbreviated as Sgt. This makes it shorter.".

However, if you can define a fixed set of edge cases, its probably easier and much more readable to do this in multiple steps.


1. Identify your edge cases

For example, you can destinguish "Ill have a No. 3" and "No. I am your father" by checking for a subsequent number. So you would identify that edge case with a regex like this: No. \d. (Again, context matters. Sentences like "Is 200 enough? No. 200 is not enough." will still give you a false positive)

2. Mask your edge cases

For each edge case, mask the string with a respective string that will 100% not be part of the original text. E.g. "No." => "======NUMBER======"

3. Run your algorithm

Now that you got rid of your unwanted punctuations, you can run a simpler regex like this to identify the true positives: [\.\!\?]\s

4. Unmask your edge cases

Turn "======NUMBER======" back into "No."

Solution 4:

Doing it with only one regex will be tricky - as stated in comments, there are lots of edge cases.

Myself I would do it with three steps:

  1. Replace spaces that should stay with some special character (re.sub)
  2. Split the text (re.split)
  3. Replace the special character with space

For example:

import re

zero_width_space = '\u200B'

s = 'I am from New York, N.Y. and I would like to say hello! How are you today? I am well. I owe you $6. 00 because you bought me a No. 3 burger. -Sgt. Smith'

s = re.sub(r'(?<=\.)\s+(?=[\da-z])|(?<=,)\s+|(?<=Sgt\.)\s+', zero_width_space, s)
s = re.split(r'(?<=[.?!])\s+', s)

from pprint import pprint
pprint([line.replace(zero_width_space, ' ') for line in s])

Prints:

['I am from New York, N.Y. and I would like to say hello!','How are you today?','I am well.','I owe you $6. 00 because you bought me a No. 3 burger.','-Sgt. Smith']

Post a Comment for "Regex Match Characters When Not Preceded By A String"