Find Bytes That Are Not Part of Valid UTF-8 Sequences

With its default settings, PowerGREP does a very good job of automatically handling all Unicode text files. When you have files in a variety of legacy encodings that cannot be auto-detected, you can use a text encoding configuration to make sure PowerGREP always shows you the correct text. PowerGREP is also very flexible at handling files that contain bytes that aren’t strictly valid for their encoding. Even when searching and replacing through such files, PowerGREP preserves any invalid bytes in the files.

But many other applications aren’t as flexible. Many scripting languages, for example, simply let your scripts crash when they read a file as UTF-8 and the file contains even one byte that is not part of a valid UTF-8 sequence. This example shows how you can disable PowerGREP’s smart handling of text file encodings and instead look at the raw bytes in UTF-8 files. Then you can search for bytes that aren’t valid UTF-8 sequences.

  1. Open the PowerGREP.pgl library file included with PowerGREP. You can find it in the folder where PowerGREP is installed, c:\Program Files\JGsoft\PowerGREP3 by default.
  2. Select the action “Encodings: Find bytes that are not part of valid UTF-8 sequences” in the library, and click the Use button. This loads a somewhat complicated regular expression onto the Action panel. It matches any byte that is not part of a valid UTF-8 sequence.
  3. There is a file selection labeled “Encodings: All files as binary” in the library. Loading this does steps 5 through 7 below, but clears any other settings. So load the file selection from the library only if you haven’t already marked the files or folders you want to search through.
  4. Select the UTF-8 files you want to inspect in the File Selector. This action only produces meaningful results on files that are mostly UTF-8, but have invalid bytes here and there. If you use it on files that aren’t UTF-8, the action will find pretty much all bytes 0x80 through 0xFF.
  5. Set “file formats to convert to plain text” to “None”. PowerGREP’s converters for proprietary formats produce UTF-16 for most formats. They never produce invalid UTF-8. So there’s no point in including files in proprietary formats in this action.
  6. Set “text encodings to read files with” to “all files as binary”. This predefined configuration tells PowerGREP to treat all files as binary files.
  7. Turn on “search through binary files” to make sure PowerGREP actually searches through any files.
  8. Click the Preview button to run the action.

After running this example, the Results panel will show you all the bytes that PowerGREP found that aren’t part of valid UTF-8 sequences. This allows you to manually fix those files, or determine the cause of these files not being valid UTF-8.

If you just want to get a list of files that aren’t valid UTF-8 without seeing the individual bytes, load the action “Encodings: Find files with bytes that are not part of valid UTF-8 sequences” from the library instead. This action uses the “list files” rather than the “search” action type. This is faster as PowerGREP will continue with the next file as soon as one invalid byte is found.

If you want to remove the offending bytes, load the action “Encodings: Delete bytes that are not part of valid UTF-8 sequences” from the library. This uses the “search and delete” action type to delete all bytes matched by the regular expression.