Extract Data into a CSV File or Spreadsheet

With PowerGREP, you can easily extract any sort of information from large numbers of documents, archives or spreadsheets, and save the collected information into a comma-delimited text files or CSV files. Here are the basic steps:

  1. Select the files you want to extract information from in the File Selector.
  2. Start with a fresh action.
  3. Set the action type to “collect data”. Leave the search type as “regular expression”.
  4. Enter the regular expression that matches a data record. Use capturing groups to extract specific parts of each record.
  5. As the text to be collected, enter a comma-delimited list of backreferences to those capturing groups.
  6. Set the target type to “save results into a single file” if you want to create one CSV file that holds all the records. Specify the name of the file as the target location. Or, set the target type to “save one file for each searched file” to create one CSV file for each original file. In that case, you may want to set the target destination type to “path placeholders”. Enter c:\Output\%FILENAMENOEXT%.csv or c:\Output\%FOLDER\FILENAMENOEXT%.csv as the target location. The former placeholder will create one CSV file in the folder c:\Output for each source file with the same name as the source, but a .csv extension. The latter will also recreate the folder structure under c:\Output.
  7. Leave “between collected text” set to “Line break”. PowerGREP will insert a line break between each collected match. PowerGREP will not insert it before the first match or after the last match.
  8. Click the Collect button to create the CSV files.

Extracting a List of Delivery Addresses

Suppose you have a large number of orders stored in text documents, and you want to make a list of the delivery addresses. In each file, the delivery address has the following layout:

Deliver to:
Joe N. Doe
Street address (one or two lines)
City, ST, 12345-6789

In the CSV file, you want to have the following fields: name,address 1,address 2,city,state,zip

You can easily achieve this following the steps above. First, we need to create a regular expression that matches a delivery address, which is quite straightforward. We match “Deliver to:” first. Then we capture one line of text with (.*)\R which is the name. Then one or two lines with (.*)\R(?:(.*)\R)? which are the address. Finally, we match one line that ends with a state code [A-Z]{2} and ZIP code [0-9]{5}(?:-[0-9]{4})?. Using []+ we allow commas and/or spaces as delimiters in the last line. The complete regular expression becomes: Deliver to:\R(.*)\R(.*)\R(?:(.*)\R)?(.*?)[]+([A-Z]{2})[]+([0-9]{5}(?:-[0-9]{4})?).

The (?:group) parts in the regular expression are non-capturing groups. Those simply group items to repeat them together. The (group) parts are capturing groups. They’re essential in allowing us to insert part of the regular expression match into the text to be collected. In this case, we simply reference each group once. As the text to be collected, enter: \1,\2,\3,\4,\5,\6. If a capturing group did not participate in the match, it is substituted with nothing. E.g. in a delivery address with only one line for the street address, \3 will remain blank.

Set the target type to save results into a single file to get one CSV file with all delivery addresses. You can then open the CSV file in a spreadsheet program or other application.

You can find this action in the PowerGREP5.pgl standard library as “CSV: Extract data into a CSV file or spreadsheet”.