Search and Replace Through Microsoft Word Documents

Microsoft Word 2003 and prior used the DOC file format to save documents. This is a proprietary binary file format. PowerGREP can convert DOC files to plain text so that you can search through them. This converter does not allow you to make replacements in DOC files. There is no way to do so with PowerGREP.

Microsoft Word 2007 and later use the DOCX file format. DOCX files are technically ZIP archives that contain XML and assorted files. While DOCX is an open format in principle, the XML it uses is still really complicated. PowerGREP can convert DOCX files to plain text so you can easily search through them, without having to deal with the XML. PowerGREP can also reconvert its plain text conversion back into the original DOCX file so you can easily search-and-replace through DOCX files.

File Format Configurations for Editing DOCX Files

To be able to search-and-replace through DOCX files if they were plain text documents, you need to set the “file formats to convert to plain text” on the File Selector panel to a configuration that uses the option “Use PowerGREP’s built-in decoder to convert files to plain text” for the file format “Microsoft Word 2007 to 2016”. The configuration should not enable any read-only converters. PowerGREP will refuse to run a search-and-replace if the selected configuration enables read-only converters. This means the configuration needs to select “always exclude files of this type” for the “Microsoft Word 95 to 2003” file format. Default configurations that satisfy these requirements are “writable proprietary formats”, “all writable formats”, “attachments & proprietary formats”, “attachments & writable proprietary formats”, “attachments & all formats”, and “attachments & writable proprietary formats”.

With one of these configurations selected on the File Selector panel, you can use the Editor|Open menu item to open a DOCX file and edit it in PowerGREP’s built-in editor. Though PowerGREP won’t show you the file’s formatting, it will be preserved when you save the file.

If you want to search-and-replace only through DOCX files, enter the file mask *.do[ct][xm] in the “include files” box on the File Selector panel. If you leave the “include files” and “exclude files” boxes blank, then PowerGREP searches through the plain text conversion of all file formats enabled by the configuration, as well as through the raw contents of all files that are not recognized as one of those file formats.

This file selection is available in the PowerGREP5.pgl library as “Office: Search-and-replace through Word documents (DOCX only)”.

To indicate which Word documents to search through, click on the folder that contains them in the “folders and files” tree. Then select Include File or Folder or Include Folder and Subfolders from the File Selector menu.

Finally, prepare and execute your search-and-replace on the Action panel.

Search-and-Replace Through The Raw XML Inside DOCX Files

When PowerGREP converts Word documents to plain text, you can only search-and-replace through the body text of the documents. The conversion does not show any metadata such as hyperlinks, so you can’t edit that.

If you are familiar with the XML format used by DOCX files, you can edit the raw XML instead. This allows you to edit anything in the files, as long as you make sure not to mess up the XML structure. To do so, select a file format configuration on the File Selector panel that uses the option “search through the individual files inside the compound document” for the “Microsoft Word 2007 to 2016” file format and that does not enable read-only converters. Default configurations that do so are “compound documents” and “compound documents & writable proprietary formats”.

With one of these configurations selected on the File Selector panel, you can expand the nodes for DOCX files in the folders and files tree. You’ll then see the file and folder structure inside the DOCX file. Right-click on one of the XML files inside it and click the Edit item to edit it in PowerGREP’s built-in editor.

If you want to search only through DOCX files, enter the file mask *.do[ct][xm] in the “include files” box on the File Selector panel. When DOCX files are treated as compound documents, the file masks still include and exclude the DOCX files themselves rather than the files they contain. Target and backup settings on the Action panel also make PowerGREP save and back up DOCX files as a whole, even though the search results will show the replacements made in the XML files inside the DOCX files.

This file selection is available in the PowerGREP5.pgl library as “Office: Search through the raw XML inside DOCX files”.

To effectively work with the XML, you will likely want to use file sectioning. This makes it easy to restrict the main part of the action to the contents of specific XML tags. The XML in DOCX files does not have line breaks. So you need to change “context type” to avoid heaps of XML on the Results Panel. Select “use sections as context” when using file sectioning. Otherwise, set it to “no context” or to “search for context” with <?+[^<>]++>?+ as the context regex. This regex matches a single XML tag or the text between two tags. If your replacement string contains reserved XML characters, use extra processing to automatically replace those.

To search-and-replace through the body text of DOCX files, for example, take these steps on the Action panel:

  1. Start with a fresh action.
  2. Set the action type to “search and replace”.
  3. Enter the text to search for and replace with.
  4. Select “search and collect sections” from the “file sectioning” list. Leave the section search type as “regular expression”.
  5. In the Section Search box, enter the regular expression <w:t>([^<]++)</w:t>. This regular expression matches a pair of <w:t> and </w:t> XML tags, and the text between them. In Word .docx files, all printable text is stored between such tags. Turn on “case sensitive search” in the file sectioning for better performance. XML tags are case sensitive.
  6. In the Section Collect box, enter the backreference \1 to restrict the main action to the contents of the <w:t> tag.
  7. Turn on “extra processing”.
  8. Set the extra processing search type to “delimited literal text”, the “extra term delimiter” to a comma, and the “extra pair delimiter” to an equals sign.
  9. Enter <=&lt;,>=&gt;,&=&amp; into the “extra processing search” box.
  10. Make sure “non-overlapping search” is turned on for extra processing.
  11. Set “context type” to “use sections as context”.
  12. Click the Preview button to run the action.

This example is available in the PowerGREP5.pgl library as “Office: Search-and-replace in printable text in the raw XML inside DOCX files”.