Search Through OpenDocument Format Files

PowerGREP has a built-in decoder for OpenDocument Text (ODT) files. This allows you to search through OpenOffice Writer and LibreOffice Writer documents as if they were plain text files, as described in the preceding example.

PowerGREP does not have a decoder for other OpenDocument formats such as database files (*.odb), chart files (*.odc), formula files (*.odf), graphics files (*.odg), image files (*.odi), presentation files (*.odp), and spreadsheets (*.ods). All these files are technically ZIP archives containing one or more XML files and other support files such as image files.

If OpenOffice or LibreOffice is installed on your PC, then your PC will also have an IFilter installed for all the OpenDocument formats. OpenOffice and LibreOffice provide this IFilter so that Windows Search can extract and index the text from these files. PowerGREP can use the same IFilter to search through these files. Since the IFilter system was designed by Microsoft for Windows Search, and Windows Search can only search, the IFilter system is read-only. So PowerGREP can’t make replacements in files that are being converted to plain text using an IFilter.

To have PowerGREP use the OpenOffice or LibreOffice IFilter, you need to set the “file formats to convert to plain text” on the File Selector panel to a configuration that uses the option “Use IFilter, if available for this format, to convert files to plain text” for files that have an extension used by OpenDocument Format. The default configurations “proprietary formats”, “all formats”, “attachments & proprietary formats”, and “attachments & all formats” all include a custom file format named “IFilter” that uses the IFilter option for all ODF formats except ODT. For ODT, these configurations use PowerGREP’s built-in converter.

If you want to search only through OpenDocument Format files, enter the file mask *.od[bcfgimpst];*.sxw in the “include files” box on the File Selector panel. If you leave the “include files” and “exclude files” boxes blank, then PowerGREP searches through the plain text conversion of all file formats enabled by the configuration, as well as through the raw contents of all files that are not recognized as one of those file formats.

This file selection is available in the PowerGREP5.pgl library as “Office: Search through OpenDocument Format files (requires OpenOffice or LibreOffice)”.

Search Through the Raw XML Inside OpenDocument Format Files

Using the built-in converter for ODT files and the IFilter for other ODF files to convert these files to plain text is the most practical way to search through ODF files. But there is another way. You can tell PowerGREP to treat these files as compound documents and search through the raw XML inside them. This works even if you don’t have OpenOffice or LibreOffice installed. It even allows you to search-and-replace through the XML.

To handle ODF files as compound documents, select a file format configuration on the File Selector panel that uses the option “search through the individual files inside the compound document” for the “OpenDocument Text (ODT)” and the “zipped compound documents” file formats. The file mask for these two file formats need to match all extensions used by OpenOffice and LibreOffice. If you want to search-and-replace, then the file format configuration shouldn’t enable any read-only converters. Default configurations that fit these requirements are “compound documents” and “compound documents & writable proprietary formats”.

With one of these configurations selected on the File Selector panel, you can expand the nodes for ODF files in the folders and files tree. You’ll then see the file and folder structure inside the ODF file. Right-click on one of the XML files inside it and click the Edit item to edit it in PowerGREP’s built-in editor.

If you want to search only through OpenDocument Format files, enter the file mask *.od[bcfgimpst];*.sxw in the “include files” box on the File Selector panel.

This file selection is available in the PowerGREP5.pgl library as “Office: Search through the raw XML inside OpenDocument Format files”.

You can use PowerGREP’s file sectioning feature to search only through specific parts of a file, such as only the body text, as described in this example. When doing a search-and-replace through these files, you’ll need to be careful not to upset the XML structure.

  1. Start with a fresh action.
  2. Set the action type to “search”.
  3. Enter the search terms that you want to find.
  4. Select “search for sections” from the “file sectioning” list. Leave the section search type as “regular expression”.
  5. In the Section Search box, enter the regular expression <text:p[^<>]*+>.*?</text:p>. This regular expression matches a pair of <text:p> and </text:p> XML tags, and the text between them. In OpenDocument Format files, all printable text is stored between such tags. One tag holds one paragraph of text. Turn on “case sensitive search” in the file sectioning for better performance. XML tags are case sensitive.
  6. Click the Preview button to run the action.

Note that in OpenDocument Format files, paragraphs with mixed formatting (bold, italics, etc.) will have extra formatting tags inside them. If you follow the above steps, PowerGREP will search the document one paragraph at the time, including the paragraph tag itself and any formatting tags inside it.

If you do a search-and-replace, it’s important to make sure that the replacement text consists of a valid piece of XML. One way to do this is to use file sectioning as described above, and to make sure that your search-and-replace does not touch the XML tags (codes between angle brackets) in the sections that are found.

This action is available in the PowerGREP5.pgl library as “Office: Search through printable text the raw XML inside OpenDocument Format files”.