When working with any file, the first task is to become familiar with the file schema. In simple terms, we need to know what is represented by each field and what is used to delimit the fields. We will be working with the access log file from an Apache HTTPD web server. The location of the log file can be controlled from the httpd.conf file. The default log file location on a Debian-based system is /var/log/apache2/access.log; other systems may use the httpd directory in place of apache2.
The log file is already in the code bundle, so you can download it and use it directly.
Using the tail command, we can display the end of the log file. Although, to be fair, the use of cat will do just as well with this file, as it will have just a few lines:
$ tail /var/log/apache2/access.log
The output of the command and the contents of the file are shown in the following screenshot:

The output does wrap a little onto the new lines, but we do get a feel of the layout of the log. We can also see that even though we feel that we access just one web page, we are in fact accessing two items: the index.html and the ubuntu-logo.png. We also failed to access the favicon.ico file. We can see that the file is space separated. The meaning of each of the fields is laid out in the following table:
|
Field |
Purpose |
|
1 |
Client IP address. |
|
2 |
Client identity as defined by RFC 1413 and the identd client. This is not read unless IdentityCheck is enabled. If it is not read, the value will be with a hyphen. |
|
3 |
The user ID of the user authentication if enabled. If authentication is not enabled, the value will be a hyphen. |
|
4 |
The date and time of the request in the format of day/month/year:hour:minute:second offset. |
|
5 |
The actual request and method. |
|
6 |
The return status code, such as 200 or 404. |
|
7 |
File size in bytes. |
Even though these fields are defined by Apache, we have to be careful. The time, date, and time zone is a single field and is defined within square braces; however, there are additional spaces inside the field between that data and the time zone. To ensure that we print the complete time field if required, we need to print both $4 and $5. This is shown in the following command example:
$ awk ' { print $4,$5 } ' /var/log/apache2/access.log
We can view the command and the output it produces in the following screenshot:
