Search through, Edit, and Convert All Plain Text Files

A plain text file is a computer file that is stored entirely as text in a human-readable form. A text document or .txt file created with a plain text editor such as Notepad or EditPad is stored in that way. On the other hand, a document created with Microsoft Word is stored in a binary format. If you open a DOCX file in a plain text editor, you will see only garbage.

Line Breaks and Code Pages

Not all text files are alike, though. Computers deal with numbers, not with characters. When you save a text file, each character is mapped to a number, and the numbers are stored on disk. Different character mappings or code pages are used for different language and scripts. Since different computer manufacturers had different ideas about how to create character mappings, there’s a wide variety of legacy character mappings.

While most Windows grep and search tools only support text files saved with a Windows code page or Unicode, PowerGREP supports all character sets that have or ever had any importance, including Unicode (UTF-8, UTF-16 and UTF-32), all Windows code pages, all ISO-8859 character sets (often used on Linux in the past), most legacy MS-DOS, PC DOS, and classic MacOS code pages, ECBDIC (used by IBM mainframes), KOI8 (popular in Russia and CIS countries), the many Vietnamese encodings, and a variety of other specialized code pages. PowerGREP can read and write all these encodings. So you can search through, make replacements in, collect matches from, and write results to files in any encoding that your other software may be using or expecting.

PowerGREP can automatically detect encodings in a variety of ways. This includes Unicode signatures or byte order markers, HTML meta tags, XML declarations, and even UTF-8 and UTF-16 byte patterns. If you work with many different encodings, you can use the setting “text encoding to read files with” on the File Selector panel to tell PowerGREP exactly which encodings it should use for exactly which files.

Inconsistent line break handling is also a problem with many grep tools. Windows text files normally use a CRLF pair for line breaks. But UNIX and Linux use a single LF and classic Mac used a single CR to end lines. This causes many Windows applications to display text from Linux files all on one line. On top of that, Unicode introduced additional line break characters. With PowerGREP, you don’t need to worry about line breaks. PowerGREP transparently handles all line break styles, even when mixed together in one file. PowerGREP’s regex flavor is also smart about line breaks. Anchors that match at line breaks recognize all line breaks, and treat CRLF pairs as indivisible. Literal line breaks match a line break in any style. And matches can span across lines if you want them to.

Convert Between Encodings and Line Break Styles

Other software that you use may not be as flexible as PowerGREP. If you have plain text files that use an encoding or line break style not supported by the application you want to use them with, use PowerGREP to convert or translate your plain text files from one encoding and/or line break style to another. To do so, simply run a search with “action type” set to “list files”. You don’t need to enter a search term if you want to convert all files you’ve included in the search. Then set “target file creation” to “convert matched files to text” or to “convert copies of matched files to text”. Then choose the text encoding and/or line break style that the converted files should use.

Convert Plain Text Files to Pure Text

Many file formats that you may not think of as plain text files are actually plain text files. HTML files (i.e. web pages), XML files, and RTF files (Rich Text Format), for example, are all plain text files. If you open an HTML file in your browser, you will see a nicely rendered web page. If you open the same HTML file in a plain text editor such as Notepad, you will see the text of the web page along with the HTML tags that provide the formatting.

By default, PowerGREP handles HTML and RTF files like a plain text editor would. This means you will see and can search for and make replacements on HTML tags and RTF tags. This opens up a lot of possibilities if you are familiar with these file formats. You can modify the file’s formatting and structure as well as its content. On a web page, text is bold text, and text is italic text. You can make all bold text italic by searching for the regular expression (.*?) and replacing it with \1. Don’t worry if you do not have any experience with regular expressions. The documentation that comes with PowerGREP includes a detailed tutorial to regular expressions.

But when you just want to search for some information or redact some text, those HTML and RTF codes just get in the way. If you set “file formats to convert to plain text” to “all formats” or “all writable formats” then PowerGREP converts HTML and RTF files to pure text. Then you won’t see any HTML tags or RTF tags in the search results or when you open HTML or RTF files in PowerGREP’s built-in editor. But PowerGREP does keep track of the HTML tags and RTF codes and keeps them in the file when you execute a search-and-replace or edit the file in PowerGREP’s editor. The document will retain all its formatting when you open it later in a web browser or word processor.

If you want to permanently remove the HTML and RTF tags from your files, you can use the same method as for converting text files to a particular encoding or line break style. Include your HTML and RTF files in the action, and make sure “file formats to convert to plain text” is set to “all formats”.

See PowerGREP in Action

There are four ways to see PowerGREP in action:

Just sit back and watch the videos in your web browser.
Take a closer look at the screen shots.
Download the free evaluation version, which comes with full documentation.
Buy PowerGREP now and try it risk-free with our 3-month unconditional money-back guarantee.

Page URL: https://www.powergrep.com/textfiles.html
Page last updated: 31 December 2020
Site last updated: 18 January 2024