Fix Invalid Characters in XML

Sometimes, XML files generated by poorly written software or by careless programmers will contain lone characters like < and &. These will cause the XML file to be rejected by XML parsers. They must be replaced with the entities &lt; and &amp;. Using PowerGREP, we can easily fix this with a search-and-replace using two regular expressions.

  1. Select the files you want to search through in the File Selector.
  2. Start with a fresh action.
  3. Set the action type to “search and replace”. Set the search type to “list of regular expressions”.
  4. In the Search box, enter the regular expression <(?![_:a-z][-._:a-z0-9]*\b[^<>]*>) and make sure to leave “case sensitive search” off. This regex matches any < symbol that is not followed by what looks like a valid XML tag. I’m using [_:a-z][-._:a-z0-9]*\b to check for an XML tag name, and [^<>]* to skip over any attributes. This regex isn’t 100% exact, but it’s easy to deal with. The example in the PowerGREP Library does include an exact regex.
  5. In the Replacement box, type &lt;.
  6. Click the button with the green plus symbol to the left of the Search box to prepare for another search-and-replace pair.
  7. In the Search box, enter the regular expression &(?!(?:[a-z]+|#[0-9]+|#x[0-9a-f]+);). This regex matches any ampersand that is not followed by an entity name or character code.
  8. In the Replacement box, type &amp;.
  9. Set the target and backup file options as you like them.
  10. Click the Preview button to run a test.
  11. If all looks well, click the Replace button to actually replace the tags.

This action will replace all invalid < and & characters with their respective entities. This action is a solution for XML files generated by a computer program that inserted arbitrary text into an XML structure without replacing the < and & characters in that text first.

If the computer program inserts the invalid XML only between certain XML elements, you can leverage PowerGREP’s “file sectioning” feature to use simpler regular expressions. The example below assumes a computer-generated XML file that is valid, except that the program inserted some SQL code between <sql>...</sql> tags without replacing the “greater than”, “less than”, and “and” symbols in the SQL with XML entities.

  1. Select the files you want to search through in the File Selector.
  2. Start with a fresh action.
  3. Set the action type to “search and replace”.
  4. Set “file sectioning” to “search and collect sections”. Leave the section search type as “regular expression”.
  5. In the “section search” box, enter the regular expression <sql>(.*?)</sql> to match the <sql> element and its contents.
  6. Turn on “dot matches newlines” to allow the section to span across lines.
  7. Enter \1 into the “section collect” box. This restricts the section to the contents of the <sql> element without the element’s enclosing tags. This is important because we only want to replace the reserved characters in the element’s contents.
  8. Set the search type of the main part of the action to “delimited literal text”.
  9. Leave the “search term delimiter” field set to “Line break”. Type a single equals sign in the “search pair delimiter” field.
  10. Paste these three lines into the search box:
    <=&lt;
    >=&gt;
    &=&amp;
  11. Set the target and backup file options as you like them.
  12. Click the Preview button to run a test.
  13. If all looks well, click the Replace button to actually replace the tags.

You can find this action in the PowerGREP5.pgl standard library as “Replace reserved characters in XML files”.