Search Through PDF Files

PDF (Portable Document Format) is a format designed to electronically store printed pages. PowerGREP can extract the text form PDF files and arrange it to reconstruct the text on the page.

To enable this, you need to set the “file formats to convert to plain text” on the File Selector panel to a configuration that uses the option “Use PowerGREP’s built-in decoder to convert files to plain text” for the file format “Portable Document Format (PDF)”. Default configurations that use this option are “proprietary formats”, “all formats”, “attachments & proprietary formats”, and “attachments & all formats”.

If you click the (...) button in the File Selector to edit the configuration and select “Portable Document Format (PDF)” in the list, you will see a checkbox labeled “Convert the text in PDF files in reading order rather than trying to mimic the page layout”. This option is turned off in all the default configurations. So by default PowerGREP’s plain text conversion of PDF files mimic the layout the text would have when you print the PDF or view it in a PDF viewer. Line breaks are added to limit the length of lines. Whitespace is used to preserve indentation. Text in columns is arranged in columns using extra whitespace.

If you turn on this option, then the plain text conversion will have the text in reading order. No line breaks are added to paragraphs. No whitespace is added to preserve indentation. Text that was in columns in the PDF appears as normal text in the plain text conversion, with all text of the first column before all the text of the second column.

Mimicking the page layout generally makes the text easier to read. But you have to take the extra spaces and line breaks into account when searching. Instead of searching for the phrase “two words” as literal text, you should search for the regular expression two\s+words. The \s+ matches any amount of whitespace or line breaks. So two\s+words matches two words even when there are multiple spaces or line breaks between the two words.

Converting text in reading order means you don’t have to deal with extra spaces or line breaks in your search terms. But it is mainly helpful when dealing with text in columns. If the phrase “two words” appears in a column in the PDF with “two” at the end of the line and “words” at the start of the next line, then a plain text conversion that mimics the page layout will have the entire line of text of the other column between those two words. Compare this plain text conversion mimicking two columns:

Converted from a PDF file with text     The second column has the phrase "two
in two columns.                         words" wrapped across two lines.

With this plain text conversion in reading order:

Converted from a PDF file with text in two columns. The second column has the phrase "two words" wrapped across two lines.

The regex two.*?words with the option “dot matches line breaks” turned on matches “two” followed by “words” with any amount of any text between them. In the second conversion, this would match two words as you’d expect. But in the first conversion it matches this:

two
in two columns.                         words

The regex engine has no concept of columns. It just processes the text from left to right and from top to bottom. Converting PDFs in reading order makes sure the regex engine sees the text in the order you would read it.

To see how your own PDF files fare with these conversions, first select the file format configuration that enables PDF conversion the way you want it on the File Selector panel. Then use the Editor|Open menu item to open a PDF file and view it in PowerGREP’s built-in editor. The text shown by the editor is the text that PowerGREP searches through when the PDF is included in an action.

If you want to search only through PDF files, enter the file mask *.pdf in the “include files” box on the File Selector panel. If you leave the “include files” and “exclude files” boxes blank, then PowerGREP searches through the plain text conversion of all file formats enabled by the configuration, as well as through the raw contents of all files that are not recognized as one of those file formats.

These file selections are available in the PowerGREP5.pgl library as “Office: Search through PDF documents (mimic page layout)” and “Office: Search through PDF documents (text in reading order)”.

The PDF format is designed to store final printouts. It is not designed to be editable. PowerGREP cannot make replacements in PDF files.