Put Anchors Around URLs That Are Not Already Inside a Tag or Anchor

Suppose you have an HTML file that has URLs in its body text that are not clickable. You want to make them clickable by placing the URLs inside anchor tags. But like any other HTML file, your file also has URLs as part of anchors (links), images, and other tags. Those URLs should be left alone. You also want to ignore URLs that have already been placed inside anchor tags.

  1. Select the files you want to search through in the File Selector.
  2. Make sure the file format configuration searches through the raw (unconverted) contents of HTML files. The predefined "None" configuration is one that does this.
  3. Start with a fresh action.
  4. Set the action type to "search and replace". Leave the search type as "regular expression".
  5. Select "split along delimiters" from the File Sectioning drop-down list.
  6. Set the "section search type" to "list of regular expressions".
  7. Add <a\b[^<>]*>.*?</a> as the first file sectioning regular expression. It matches any <a> tag and its contents.
  8. Add <[^<>]+> as the second regex. This regex matches any opening or closing HTML tag. This regex assumes all < characters in your HTML file that aren't part of tags are properly escaped as &lt;.
  9. Make sure "non-overlapping search" is turned on. The file sectioning should make one pass of the file using both regular expressions.
  10. In the search box of the main part of the action, enter the regular expression https?://\S+ which is a quick way of matching any web URL.
  11. Enter <a href="\0">\0</a> in the Replacement box. This replaces each URL with itself wrapped inside an anchor using itself as the destination.
  12. Set the target and backup file options as you like them.
  13. Click the Preview button to run a test.
  14. If all looks well, click the Replace button to actually replace the URLs.

When PowerGREP executes this action, it first uses the file sectioning regex to match all the anchor tags with their contents, and all other HTML tags without contents. Because we put the anchor tag regex first in the list, it takes precedence over the HTML tag regex. At a position where both regexes can match, only the first one will. With "non-overlapping search" turned on, searching for the list of regular expressions one and two (in that order) is exactly the same as searching for the single regex one|two. A list of multiple short regexes is easier to manage than a long regex with many alternatives. But there's no functional difference.

Because "file sectioning" is set to "split along delimiters", PowerGREP treats the matches of the file sectioning regexes as delimiters that chop the file into pieces. The action's search-and-replace separately processes each bit of text between two delimiters (and before the first and after the last delimiter). In this case, the search-and-replace works on each bit of text between two HTML tags, between two anchor tags (with contents), or between an anchor tag and another HTML tag. Essentially, the search-and-replace skips over all anchor tags (with contents) and all HTML tags.

You can find this action in the PowerGREP5.pgl standard library as "Put anchors around URLs that are not already inside a tag".