How to Find Word Pairs

This example illustrates the use of lookaround in regular expressions. In the discussion below, the file being searched through contains the four words one two three four.

Matching two consecutive words with a regular expression is easy: \w+\s+\w+. But when you try this regex in a collect data action, PowerGREP will find only two pairs: one two and three four. The middle pair two three is missing. The reason is that when PowerGREP finds a search match, it continues searching at the end of the match. After matching one two, PowerGREP continues at the space after two.

The solution is to use lookahead for the second word. Lookahead applies the regex match as usual, but does not actually expand the match result to the text matched by the lookahead. When you collect data with \w+\s+(?=\w+) PowerGREP will find all three pairs, but collect only one , two  and three , trailing spaces included.

To also collect the text matched by the lookahead, we need to use a capturing group. This does not change the nature of the lookahead. To make the output prettier, we’ll also capture the first word. That allows us to collect both words separated by just one space, rather than by whatever was matched by \s+.

When we search for (\w+)\s+(?=(\w+)) and collect \1 \2 the results will list all 3 word pairs: one two, two three and three four. You may need to select “replacement only” in the “display replacements” list on the Results panel to remove the regex match from the results and show the collected pairs only.

You can take this example as far as you want. Search for (\w+)\s+(?=(\w+)\s+(\w+)) and collect \1 \2 \3 to gather word triplets.

These actions are available in the PowerGREP5.pgl library as “Find word pairs” and “Find word triplets”.