Text Encoding Configuration

Computers deal with numbers, not with characters. When you save a text file, each character is mapped to a number, and the numbers are stored on disk. When you open a text file, the numbers are read and mapped back to characters. When saving a file in one application, and opening the that file in another application, both applications need to use the same character mappings.

Traditional character mappings or code pages use only 8 bits per character. This means that only 256 characters can be represented in a text file. As a result, different character mappings are used for different language and scripts. Since different computer manufacturers had different ideas about how to create character mappings, there’s a wide variety of legacy character mappings. PowerGREP supports a wide range of these.

In addition to conversion problems, the main problem with using traditional character mappings is that it is impossible to create text files written in multiple languages using multiple scripts. You can’t mix Chinese, Russian, and French in a text file, unless you use Unicode. Unicode is a standard that aims to encompass all traditional character mappings, and all scripts used by current and historical human languages.

In order for PowerGREP to correctly search through and display the text in your files, PowerGREP needs to know which encoding to use to map the numbers stored in the file to characters. For files converted to plain text according to a file format configuration the encoding is determined as part of the conversion. For all other files, the encoding is determined by the Text Encoding Configuration that you select under “text encodings to read files with” on the File Selector panel. That includes files that need to be treated as raw files according to the File Format Configuration, as well as files not matched by any file masks in the File Format Configuration.

Predefined Text Encoding Configurations

PowerGREP ships with three predefined text encoding configurations. The “specific auto detection” configuration performs quick automatic Unicode detection on all files. It performs slower automatic detection of XML declarations and HTML charset tags only on files with extensions used for XML and HTML files. It also knows that certain file formats need special treatment. It tells PowerGREP not to add a byte order marker to XML files as many XML parsers choke on the BOM. It disables detection of NULL characters on RTF files, because RTF files sometimes end with a NULL.

The “generic auto detection” configuration performs all automatic text encoding detection methods on all files. It does not have specific settings for any kind of file.

The “all files as binary” configuration treats all files as binary files. This is useful for actions that use the “binary data” search type.

Testing Text Encoding Configurations

Use the Open item in the Editor menu to open a text file. The text shown in the editor is the text that PowerGREP searches through when the file is included in an action. Make sure the text appears correctly. Pay particular attention to letters with diacritics or letters from non-Latin scripts if your file is supposed to contain those. Many legacy code pages are extensions of ASCII. Since ASCII covers the English alphabet, English text often comes out correctly even when using the wrong code page.

If the text is not correct, use the Editor|New menu to close the file. Select a different text encoding configuration on the File Selector panel or edit the configuration you were using. Then open the file again in the editor. The editor uses the text encoding settings only when opening a file. So you need to close and reopen the file for the editor to use the new text encoding configuration.

Editing Text Encoding Configurations

Text Encoding Configuration

Click the (...) button next to the “text encodings to read files with” drop-down list on the File Selector panel to edit the text encoding configurations, or just to see their details.

The list on the left shows the available text encoding configurations. Select one to see its settings or edit it. You can edit all configurations. You can even delete all the configurations. If you delete them all and do not add your own, PowerGREP restores the configurations that were predefined when you first installed PowerGREP.

If you edit a configuration presently selected on the File Selector panel, those changes take effect immediately. But editing configurations does not change the behavior of previously saved file selections. When you save a file selection, it stores the full details of the selected configurations. When you load a file selection, it continues to use the configuration you saved it with. If you edited that configuration between the time you saved and loaded the file selection, then the configuration loaded with the file selection is indicated with a number such as (2) to indicate its details are different from the configuration with the same name in the Preferences. If you want the loaded file selection to use the edited configuration, then you need need to select the edited configuration (without the number in parenthesis) on the File Selector panel after loading the exiting file selection. If you click the (...) button, both the edited configuration and the loaded configuration are shown in the dialog.

Each configuration has a name that identifies it on the File Selector panel. You can also add comments to explain in which situations you want to use this configuration.

The list on the right shows all the file format specific settings in the selected configuration. You can add as many file formats to this list as you like. You can also leave this list blank if you want to use the configuration’s default settings for all files. Each configuration has its own list of file formats. Adding or deleting file formats only affects the selected configuration.

Each file format has a name that identifies it in the list of file formats. To actually enable the format, you need to specify one or more file masks that match the files that should use the text encoding settings specific to that file format. You can use the full syntax for traditional file masks as explained in the help topic about the File Selector panel. File masks are applied to the full file name, not just the extension. You can even use backslashes in file masks if you want the file mask to be applied to the full path of the files instead of just their names.

Text Encoding Settings

The settings in the left hand section “default settings for files without file format specific settings” are used for all files that do not match any of the file masks that you added. The settings in the right hand section are used for the file format you selected in the right hand list. These two sets of settings are exactly the same. Only one set is used for each file, depending on whether it matches one of the file masks for specific settings, or not.

Encoding

The “encoding” is the character mapping or code page to interpret the file. In most cases you’ll want to select the Windows code page that matches the language you work with. All Windows code pages are an extension of US ASCII, which supports the English alphabet. Code page 1252 is the default for the Americas and Western Europe. The encoding you select in the drop-down list is used as long as the automatic detection options are turned off or don’t detect anything.

Encoding of XML Files

XML files start with an XML declaration that indicates the encoding of the XML file. Before searching a file, PowerGREP will check if it starts with an XML declaration. If so, PowerGREP will process the file using the encoding indicated by the XML declaration.

PowerGREP will always do the XML declaration check for files for which you did not define a text encoding on the Text Encoding page in the Preferences. There is no XML option among the default text encoding settings.

You can turn off the XML declaration check for text encoding definitions. This can be useful when you want to search through XML files that use an encoding not recognized by PowerGREP. Otherwise, PowerGREP will skip those files with an error message indicating the unsupported encoding. Instead, you can select a fixed encoding from PowerGREP’s list that is close enough to the one actually used by the XML file.

E.g. PowerGREP supports only a handful of EBCDIC encodings. If you have XML files that use another EBCDIC encoding, you can create a text encoding definition for those XML files. Turn off the XML declaration check, and select the EBCDIC 037 encoding. This way you can still properly search through the parts of the XML file written in English, which could very well be the whole file.

Note that PowerGREP will not automatically add or update the XML declaration when writing XML files. It’s up to you if you want to add the declaration to your files, and to make sure it is correct.

Encoding of HTML Files

Most web servers and web browsers do not support the Unicode byte order marker. For HTML file types like HTML, PHP and ASP pages, you should turn off the options to write and preserve the byte order marker, and turn on the option to detect Unicode files without a byte order marker.

HTML files can use a meta tag to set a Content-Type header which can specify the encoding used by the web page. E.g. <meta http-equiv="Content-Type" content="text/html; charset=win1252"> specifies the Windows 1252 code page. Turn on the “detect HTML Content-Type meta tag” option to make PowerGREP search for this tag at the start of the file, and use the specified encoding if the tag can be found. Note that there’s no such option for the default settings. PowerGREP will not look for the HTML meta tag by default.

This is indeed different from the XML declaration check, which PowerGREP does do by default. The reason is that the XML declaration always appears at the very start of the file, so checking its presence is trivial. The meta tag can appear deep in the HTML file’s header, so PowerGREP has to search for it.

Note that PowerGREP will not automatically add or update the meta tag when writing HTML files. It’s up to you to decide if you want to add the tag to your files, and to make sure it is correct.

NULL Characters

Text files normally should not contain NULL characters. Binary files usually do contain NULL bytes. Turn on “treat files containing NULL characters as binary files” to be able to exclude binary files in the File Selector with the “search through binary files” checkbox. When you do search through binary files, PowerGREP displays the results in hexadecimal. The file editor edits binary files in hexadecimal mode.

If PowerGREP treats certain text files as binary files, that is because those text files contain spurious NULL characters. You can turn off “treat files containing NULL characters as binary files” to force PowerGREP to treat all files as text files.

Byte Order Marker

On the Windows platform, Unicode files should start with a byte order marker which is more accurately called a Unicode signature. The byte order marker is a special code that indicates the Unicode encoding (UTF-8, UTF-16 or UTF-32) used by the file. PowerGREP will always detect the byte order marker, and treat the file with the corresponding Unicode encoding.

Some applications save Unicode files without byte order markers. Reading a UTF-16 file as if it was encoded with a Windows code page will cause every other character in the file to appear as a NULL character. PowerGREP can detect this situation and read the file as UTF-16. Reading a UTF-8 file as if it was encoded with a Windows code page will cause non-ASCII characters to appear as two or three garbage characters. E.g. the French character é will appear as é. PowerGREP can detect if a file contains non-ASCII characters and if all of them are valid UTF-8 sequences, indicating the file is highly likely an UTF-8 file. You should turn on the option “detect UTF-8 and UTF-16 files without a byte order marker”, to prevent UTF-16 files from being treated as binary files, and to make sure UTF-8 files are processed properly.

On the Windows platform, Unicode files should start with a byte order marker. PowerGREP will write this marker at the start of each Unicode file, unless you’ve turned off that option for certain text encoding definitions. If an application that claims to support Unicode can’t read the Unicode files created by PowerGREP, try turning off the option to write the byte order marker for the files you’re trying to open with that application.

If you’re not sure whether files of this type should use a byte order marker or not, or if some applications require it and others can’t handle it, turn on the option to “preserve the presence of absence of the byte order marker in existing files”. When this option is on, and PowerGREP modifies a file, PowerGREP will keep the BOM if it was present in the file, but won’t write it if it wasn’t present, regardless of whether you turned the “write a byte order marker” option on or off. Note that the “write a byte order marker” option will still determine whether PowerGREP writes the byte order marker to new files that it creates.

Detect ASCII files using \uFFFF, &#65535; or &#xFFFF; as Unicode files

Some files consist only of ASCII characters and specify Unicode characters using numeric Unicode character escapes in the form of \uFFFF or using XML entities in the form of &#65535; and/or &#xFFFF;. Turn on this option to make PowerGREP treat these files as Unicode and transparently convert the Unicode escapes or XML entities into Unicode characters and back. PowerGREP will display the Unicode characters in the search results, and you'll need to type or paste in those Unicode characters as literal characters if you want to search for them. Turn this option off to make PowerGREP treat these files as ASCII files. The Unicode escapes and XML entities will appear as such in the search results.