Search Through Word Documents

Search Through Microsoft Word Documents

Microsoft Word 2003 and prior used the DOC file format to save documents. This is a proprietary binary file format. PowerGREP can convert DOC files to plain text so that you can search through them.

Microsoft Word 2007 and later use the DOCX file format. DOCX files are technically ZIP archives that contain XML and assorted files. While DOCX is an open format in principle, the XML it uses is still really complicated. PowerGREP can convert DOCX files to plain text so you can easily search through them, without having to deal with the XML. PowerGREP can also reconvert its plain text conversion back into the original DOCX file so you can easily search-and-replace through DOCX files.

File Format Configurations for Word Documents

To be able to search through Word documents as if they were plain text documents, you need to set the “file formats to convert to plain text” on the File Selector panel to a configuration that converts Word documents to plain text. In the configuration, the option “Use PowerGREP’s built-in decoder to convert files to plain text” should be turned on for the file formats “Microsoft Word 95 to 2003 (DOC)” and “Microsoft Word 2007 to 2016”. Default configurations that use these options are “proprietary formats”, “all formats”, “attachments & proprietary formats”, and “attachments & all formats”.

If you want to search only through Word documents, enter the file mask *.do[ct];*.do[ct][xm] in the “include files” box on the File Selector panel. If you leave the “include files” and “exclude files” boxes blank, then PowerGREP searches through the plain text conversion of all file formats enabled by the configuration, as well as through the raw contents of all files that are not recognized as one of those file formats.

This file selection is available in the PowerGREP5.pgl library as “Office: Search through Word documents”.

To indicate which Word documents to search through, click on the folder that contains them in the “folders and files” tree. Then select Include File or Folder or Include Folder and Subfolders from the File Selector menu.

Finally, prepare and execute your search on the Action panel.

Search Through The Raw XML Inside DOCX Files

When PowerGREP converts Word documents to plain text, you can only search through the body text of the documents. The conversion does not show any metadata, so you can’t search through that. For DOC files, this is the only way.

If you are familiar with the XML format used by DOCX files, you can tell PowerGREP to search through the raw XML instead. This allows you to search for anything in the files, as long as you know how it is represented in the XML. To do so, select a file format configuration on the File Selector panel that uses the option “search through the individual files inside the compound document” for the “Microsoft Word 2007 to 2016” file format. Default configurations that do so are “compound documents”, “compound documents & proprietary formats”, and “compound documents & writable proprietary formats”. Choose “Compound documents & proprietary formats” if you want to search through the plain text conversion of DOC files in addition to searching through the XML inside DOCX files. The other two skip DOC files.

If you want to search only through DOCX files, enter the file mask *.do[ct][xm] in the “include files” box on the File Selector panel.

This file selection is available in the PowerGREP5.pgl library as “Office: Search through the raw XML inside DOCX files”.

To effectively work with the XML, you will likely want to use file sectioning. This makes it easy to restrict the main part of the action to the contents of specific XML tags. To search through the body text of DOCX files, for example, take these steps on the Action panel:

Start with a fresh action.
Set the action type to “search”.
Enter the search terms that you want to find.
Select “search and collect sections” from the “file sectioning” list. Leave the section search type as “regular expression”.
In the Section Search box, enter the regular expression <w:t>([^<]++)</w:t>. This regular expression matches a pair of <w:t> and </w:t> XML tags, and the text between them. In Word .docx files, all printable text is stored between such tags. Turn on “case sensitive search” in the file sectioning for better performance. XML tags are case sensitive.
In the Section Collect box, enter the backreference \1 to restrict the main action to the contents of the <w:t> tag.
Click the Preview button to run the action.

Note that in .docx files, paragraphs with mixed formatting (bold, italics, etc.) are broken up into multiple <w:t> tags, one for each block of text with contiguous formatting. This means that the PowerGREP action above will process each contiguously formatted part of the paragraph in separate sections. The action will not find any search terms that span across sections.

This action is available in the PowerGREP5.pgl library as “Office: Search printable text in the raw XML inside DOCX files”.