Monday, February 21, 2011

Why does my regular expression select everything?

Hey guys, I'm trying to select a specific string out of a text, but I'm not a master of regular expressions. I tried one way, and it starts from the string I want but it matches everything after what I want too.

My regex:

\nSCR((?s).*)(GI|SI)(.*?)\n

Text I'm matching on.

Hierbij een test

SCR
S09
/vince@test.be
05FEB
GI BRGDS OPS

middle text string (may not selected)

SCR
S09
05FEB
LHR
NPVT700 PVT701 30MAR30MAR 1000000 005CRJ FAB1900 07301NCE DD
/ RE.GBFLY/
GI BRGDS

The middle string is selected, it only needs the SCR until the GI line.

From stackoverflow
  • To match from a line starting with SCR to a line starting with GI or SI (inclusive), you would use the following regular expression:

    (?m:^SCR\n(?:^(?!GI|SI).*\n)*(?:GI|SI).*)
    

    This will:

    • Find the start of a line.
    • Match SCR and a new line.
    • Match all lines not starting with GI or SI.
    • Match the last line, requiring there to be GI or SI (this prevents it from matching to the end of the string if there is no GI or SI.
    Blixt : I just changed my regex a bit, inspired by Gumbo. His regular expression took into account the fact that if a group doesn't have a `GI` or `SI` line, the regular expression shouldn't match. Now my regex and his second regex are pretty similar, except that mine uses the start of line anchor `^` instead of matching a new line.
  • Use the non-greedy quantifier also on the first quantifier:

    \nSCR((?s).*?)(GI|SI)(.*?)\n
    

    Or you could use a negative look-ahead assertion (?!expr) to capture just those lines that do not start with either GI or SI:

    \nSCR((?:\n(?!GI|SI).*)*)\n(?:GI|SI).*\n
    

0 comments:

Post a Comment