While there is a lot of specialized software available for gathering useful information from web server logs, sometimes you want to get some information that standard web log analyzers do not offer.
PowerGREP is most useful for analyzing logs for which no specialized software is available. The basic concepts illustrated in this example are applicable to analyzing any kind of server or system log.
In this example, we will use Apache’s extended log format. Most other web servers also use this format, or offer it as a choice. In this log format, each event gets one line in the log file:
bdsl.220.127.116.11.gte.net - - [31/Jan/2005:00:06:55 -0500] "GET / HTTP/1.1" 200 8669 "http://www.google.com/search?q=regex+tutorial" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1)" (In the actual log file, all this is on a single line.)
Each line consists of eight elements. If we assume that we will only apply our regular expression to valid log files, and therefore our regex need not exclude invalid log file lines, we can easily write the regular expression for each item:
We can easily put all of this together. Items are separated by whitespace, which we match with . The result is:
^\S++ \S++ \S++ \[[^]]++\] "(?:GET|POST|HEAD) [^\s"]++ HTTP/[0-9.]++" [0-9]++ [-0-9]++ "[^"]*+" "[^"]*+"$
While this regular expression properly matches a server log line, it is not useful for collecting information. To make it useful, we have to add capturing groups, so we can collect only the information we want. To make things easy, we’ll use named capturing groups. If we capture everything, and split the file in the HTTP request into file name and parameters, we get:
^(?<client>\S++) (?<auth>\S++ \S++) \[(?<datetime>[^]]++)\] "(?:GET|POST|HEAD) (?<file>[^\s?"]++)\??(?<parameters>[^\s?"]++)?+ HTTP/[0-9.]++" (?<status>[0-9]++) (?<size>[-0-9]++) "(?<referrer>[^"]*+)" "(?<useragent>[^"]*+)"$
This action is available in the PowerGREP5.pgl library as “Logs: Inspect Apache web logs”.
The regex for matching complete log file entries was clipped at the start and the end to produce this example. By removing the parts we aren’t interested in, we speed up the action. Capturing groups we don’t care fore were also removed. We’re capturing the HTTP request with GET [^\s?"]+?\.html?+[^\s"]*+ HTTP/[0-9.]++ to restrict matches to page hits only. This makes sure the statistics aren’t skewed, since most browsers send the same referrer information when loading images as when loading the page containing those images.
This action is available in the PowerGREP5.pgl library as “Inspect Apache web logs - Referring URLs”.
If we want to collect referring sites (domain names) rather than complete URLs, we have to refine the regular expression, to separate the domain name from the rest of the URL. Instead of using [^"]*+, we will use (?:-|http://([-.a-z0-9]++)[^"]*+). We are using two pairs of parenthesis now: the outer pair to group the pipe symbol, and the inner pair to create a backreference with the domain name part of the URL. The complete regular expression thus becomes:
"GET [^\s?"]+?\.html?+[^\s"]*+ HTTP/[0-9.]++" [0-9]++ [-0-9]++ "(?:-|http://([-.a-z0-9]++)[^"]*+)"
If the web browser did not pass referrer info, then the referrer item in the logs will show up as “-”, including the quotes. This is why we are using the pipe symbol to match this option, in addition to the domain name. If the dash was matched, the part of the regular expression in the capturing group will not have matched anything. In that case, the backreference will be empty. Since we only put \1 in the collect box, an empty string will be collected in that case.
This action is available in the PowerGREP5.pgl library as “Logs: Inspect Apache web logs - Referring domains”.