Search (and Replace) through RTF and HTML as Plain Text

RTF and HTML are two common file formats used for word processing documents and web pages. These are text based formats, so you often can get search results even when searching through their raw contents. If you are familiar with the RTF and HTML formats, you may even prefer to work with their raw contents as that allows you to search for and even manipulate the RTF and HTML tags.

But in many cases, those tags just get in the way. The French word élève, for example, may appear as \'e9l\'e8ve in an RTF file or as something much more complicated. In an HTML file it could appear literally as élève, but also with character entities like élève or élève or élève or a mixture of those. If you just want to search for some text, it’s much easier to do so if all the text is rendered like a word processor or web browser does.

File Format Configurations for RTF and HTML Files

To search through a plain text conversion of RTF and HTML files that eliminates all tags and mimics the page layout, you need to set the “file formats to convert to plain text” on the File Selector panel to a configuration that uses the option “Use PowerGREP’s built-in decoder to convert files to plain text” for the file formats “Rich Text Format (RTF)” and “HyperText Markup Language (HTML)”. Default configurations that do this are “all formats”, “all writable formats”, “attachments & all formats”, and “attachments & all writable formats”.

If you want to search-and-replace through the plain text conversion of RTF and HTML files, then the file format configuration should not use any read-only converters, in addition to using the built-in converters for RTF and HTML. Default configurations that satisfy all this are “all writable formats” and “attachments & all writable formats”.

With one of these configurations selected on the File Selector panel, you can use the Editor|Open menu item to open an RTF and HTML file and edit it in PowerGREP’s built-in editor. Though PowerGREP won’t show you the file’s formatting, it will be preserved when you save the file.

If you want to search through only RTF files, enter the file mask *.rtf in the “include files” box on the File Selector panel. If you want to search through only HTML files, use the file mask *.html;*.htm;*.shtml;*.hta. Enter both marks delimited with a semicolon or line break to search through RTF and HTML files. Leave the “include files” and “exclude files” boxes blank if you want to search through all files. The four configurations mentioned above do not exclude any file formats.

These file selections are available in the PowerGREP5.pgl library as “Office: Search through RTF and HTML as plain text” and “Office: Search-and-replace through RTF and HTML as plain text”.

If you’d rather deal with the RTF and HTML code directly, then you need to select a file format configuration that uses the option “search through the file’s raw (unconverted) contents” for the RTF and HTML file formats. All default configurations except the four mentioned above and “(unused)” do this. The “(unused)” configuration does not assign any file masks to any file formats, so no files are recognized as being convertible. So “(unused)” too tells PowerGREP to search through the raw contents of RTF and HTML files.