Chapter 9. The SiLK Suite

SiLK, the System for Internet-Level Knowledge, is a toolkit originally developed by Carnegie Mellon’s CERT to conduct large-scale NetFlow analysis. SiLK is now used extensively by the US Department of Defense, academic institutions, and technical companies as a basic analytical toolkit.

This chapter focuses primarily on using SiLK as an analytical tool. The CERT Network Situational Awareness (NetSA) Group has published extensive references on using SiLK, installing collectors, and setting up the suite.

What Is SiLK and How Does It Work?

SiLK is a suite of tools for querying and analyzing NetFlow data. The SiLK suite enables an analyst to rapidly and efficiently query very large volumes of network traffic in order to identify complex aggregate phenomena or extract individual events.

SiLK is effectively a database at the command line. Each tool performs a specific query, manipulation, or aggregation of data, and commands are chained together to produce results. By chaining together multiple records along pipes, SiLK enables the analyst to create complex commands that field data along multiple channels simultaneously. For example, the sequence of SiLK queries in Example 9-1 pulls HTTP (port 80) traffic from flow data, producing a time series and a list of activity by busiest address. This example illustrates the basics of SiLK operation: commands are passed through a series of pipes, which can be stdin, stdout, or FIFOs (named pipes).

Example 9-1. Some overly complicated rwfilter voodoo

$ mkfifo out2
$ rwfilter --proto=6 --aport=80 data.rwf --pass=stdout |
       rwfilter --input=stdin  --proto=6 --pass=stdout
     --all=out2 | rwstats --top --count=10 --fields=1 &
     rwcount out2 --bin-size=300

Data is maintained in an efficient binary representation up until the last moment, until commands that produce text (or some optional outputs) are called to produce output.

SiLK is very much an old-school Unix application suite: a family of tools tied together with pipes and using a lot of optional arguments. By using this approach, it’s possible to create powerful analytic scripts with SiLK, because the tools have well-defined interfaces that will efficiently handle binary data. Effectively using SiLK involves connecting the appropriate tools together in order to process binary data and produce text only at the very end of the process.

This chapter also uses some basic Unix shell commands, such as ls, cat, and head, but doesn’t require you to know the shell on an expert level.

Acquiring and Installing SiLK

The SiLK package is available as a free download on the CERT NetSA Security Suite web page, and can be installed on most Unix systems without much difficulty. CERT also provides a live CD image that can be used on its own.

The SiLK live CD comes with a training dataset called LBNL-05, containing anonymized header traces from Lawrence Berkeley National Labs in 2005. If you install the live CD, the data will be immediately accessible. If not, you can fetch the data from the LBNL-05 reference data page.¹

In addition to the live CD, SiLK is available in several package managers, including homebrew.

The Datafiles

The LBNL datafiles are stored in a file hierarchy; Example 9-2 shows the results of downloading and unarchiving them.

Example 9-2. Downloading the SiLK archives

$ gunzip -c SiLK-LBNL-05-noscan.tar
$ gunzip -c SiLK-LBNL-05-scanners.tar
$ cd SiLK-LBNL-05
$ ls
README-S0.txt in    out   silk.conf
README-S1.txt inweb       outweb
$ ls in/2005/01/07/*.01
in/2005/01/07/in-S0_20050107.01 in/2005/01/07/in-S1_20050107.01

When collecting data, SiLK partitions the data into subdirectories that divide traffic by the type of traffic and the time the event occurred. This provides scalability and speeds up analysis. However, it’s also generally a black box, and one we’re breaking right now simply to have some files to work with. For the purposes of demonstration and education, we’re going to work with four specific files:

inweb/2005/01/06/iw-S0_20050106.20
inweb/2005/01/06/iw-S0_20050106.21
in/2005/01/07/in-S0_20050107.01
in/2005/01/07/in-S1_20050107.01

These files are not special in any way. I chose them just to provide examples of scan and nonscan traffic. The following section discusses how to partition data and what the filenames mean.

Choosing and Formatting Output Field Manipulation: rwcut

SiLK records are stored in a compact binary format. They can’t be read directly, and are instead accessed using the rwcut tool (see Example 9-3). In the following example, and any other examples with an output longer than 80 characters, the lines are manually broken for clarity.

Example 9-3. Simple file access with rwcut

$ rwcut inweb/2005/01/06/iw-S0_20050106.20 | more
            sIP|            dIP|sPort|dPort|pro|   packets|     bytes|\
   flags|                  sTime|   dur|                  eTime|sen|
 148.19.251.179|   128.3.148.48| 2497|   80|  6|        16|      2631|\
FS PA   |2005/01/06T20:01:54.119| 0.246|2005/01/06T20:01:54.365|  ?|
 148.19.251.179|   128.3.148.48| 2498|   80|  6|        14|      2159|\
 S PA   |2005/01/06T20:01:54.160| 0.260|2005/01/06T20:01:54.420|  ?|
...

In its default invocation, rwcut outputs 12 fields: source and destination IP addresses and ports, protocol, number of packets, number of bytes, TCP flags, start time, duration, end time, and sensor of a flow. These values were discussed previously in Chapter 2, except for the sensor field. SiLK can be configured to identify individual sensors, which is useful when you’re trying to figure out where traffic came from or where it’s going. The sensor field is whatever ID is assigned during configuration. In the default data there are no sensors, so the value is set to a question mark (?).

All SiLK commands have built-in documentation. Typing rwcut --help brings up an enormous help page. We will cover the basic options. A fuller description of options can be found in the SiLK documentation for rwcut.

The most commonly used rwcut commands select the fields displayed during invocation. rwcut can actually print 29 different fields, in arbitrary order. A list of these fields is in Table 9-1.

rwcut fields are specified using the --fields= option, which takes the numeric values in Table 9-1 or the string values and prints the requested fields in the order specified, as in Example 9-4.

Table 9-1. rwcut fields
Field	Numeric ID	Description
`sIP`	`1`	Source IP address
`dIP`	`2`	Destination IP address
`sPort`	`3`	Source port
`dPort`	`4`	Destination port: if ICMP, the ICMP type and code is encoded here also
`protocol`	`5`	Layer 3 protocol
`packets`	`6`	Packets in the flow
`bytes`	`7`	Bytes in the flow
`flags`	`8`	OR of TCP flags
`sTime`	`9`	Start time in seconds
`eTime`	`10`	End time in seconds
`dur`	`11`	Duration (eTime–sTime)
`sensor`	`12`	Sensor ID
`in`	`13`	SNMP ID of the incoming interface on the router
`out`	`14`	SNMP ID of the outgoing interface on the router
`nhIP`	`15`	Next hop address
`sType`	`16`	Classification of the source address (internal, external)
`dType`	`17`	Classification of the destination address (internal, external)
`scc`	`18`	Country code of the source IP
`dcc`	`19`	Country code of the destination IP
`class`	`20`	Class of the flow
`type`	`21`	Type of the flow
`sTime +msec`	`22`	sTime in milliseconds
`eTime +msec`	`23`	eTime in milliseconds
`dur +msec`	`24`	Duration in milliseconds
`icmpTypeCode`	`25`	ICMP type and code
`initialFlags`	`26`	Flags in the first TCP packet
`sessionFlags`	`27`	Flags in all packets except the first
`attributes`	`28`	Attributes of the flow observed by the generator
`application`	`29`	Guess as to the application in the flow

Example 9-4. Some examples of field ordering

# Show a limited set of fields
$ rwcut --field=1-5 inweb/2005/01/06/iw-S0_20050106.20 | head -2
            sIP|            dIP|sPort|dPort|pro|
 148.19.251.179|   128.3.148.48| 2497|   80|  6|
$#Note the -, now explicitly enumerate
$ rwcut --field=1,2,3,4,5 inweb/2005/01/06/iw-S0_20050106.20 | head -2
            sIP|            dIP|sPort|dPort|pro|
 148.19.251.179|   128.3.148.48| 2497|   80|  6|
# Field order is based on what you enter in --field
$ rwcut --field=5,1,2,3,4 inweb/2005/01/06/iw-S0_20050106.20 | head -2
pro|            sIP|            dIP|sPort|dPort|
  6| 148.19.251.179|   128.3.148.48| 2497|   80|
# We can use text instead of numbers
$ rwcut --field=sIP,dIP,proto inweb/2005/01/06/iw-S0_20050106.20 |head -2
            sIP|            dIP|pro|
 148.19.251.179|   128.3.148.48|  6|

rwcut supports a number of other output formatting and manipulation tools. Some particularly useful ones, which let you control the lines that appear in the output, include:

--no-title: Commonly used with SiLK commands that produce tabular output. Drops the title from the output table.
--num-recs: Outputs a specific number of records, eliminating the need for the head pipe in Example 9-4. The default value is 0, which makes rwcut dump the entire contents of whatever file it’s reading.
--start-rec-num and --end-rec-num: Can be used to fetch a range of records in the file.

Example 9-5 shows a few ways to manipulate record numbers and headers.

Example 9-5. Manipulating record numbers and headers

# Drop the title
$ rwcut --field=1-9 --no-title inweb/2005/01/06/iw-S0_20050106.20 | head -5
 148.19.251.179|   128.3.148.48| 2497|   80|  6|        16|      2631|FS PA
  |2005/01/06T20:01:54.119|
 148.19.251.179|   128.3.148.48| 2498|   80|  6|        14|      2159| S PA
  |2005/01/06T20:01:54.160|
 148.19.251.179|   128.3.148.48| 2498|   80|  6|         2|        80|F   A
  |2005/01/06T20:07:07.845|
  56.71.233.157|   128.3.148.48|48906|   80|  6|         5|       300| S
  |2005/01/06T20:01:50.011|
   56.96.13.225|   128.3.148.48|50722|   80|  6|         6|       360| S
  |2005/01/06T20:02:57.132|
# Drop the head statement
$ rwcut --field=1-9 inweb/2005/01/06/iw-S0_20050106.20 --num-recs=5
            sIP|            dIP|sPort|dPort|pro|   packets|     bytes|   flags
|                  sTime|
 148.19.251.179|   128.3.148.48| 2497|   80|  6|        16|      2631|FS PA
|2005/01/06T20:01:54.119|
 148.19.251.179|   128.3.148.48| 2498|   80|  6|        14|      2159| S PA
|2005/01/06T20:01:54.160|
 148.19.251.179|   128.3.148.48| 2498|   80|  6|         2|        80|F   A
|2005/01/06T20:07:07.845|
  56.71.233.157|   128.3.148.48|48906|   80|  6|         5|       300| S
|2005/01/06T20:01:50.011|
   56.96.13.225|   128.3.148.48|50722|   80|  6|         6|       360| S
|2005/01/06T20:02:57.132|
# Print only the third through fifth records
$ rwcut --field=1-9 inweb/2005/01/06/iw-S0_20050106.20 --start-rec-num=3
  --end-rec-num=5
            sIP|            dIP|sPort|dPort|pro|   packets|     bytes|   flags
|                  sTime|
 148.19.251.179|   128.3.148.48| 2498|   80|  6|         2|        80|F   A
|2005/01/06T20:07:07.845|
  56.71.233.157|   128.3.148.48|48906|   80|  6|         5|       300| S
|2005/01/06T20:01:50.011|
   56.96.13.225|   128.3.148.48|50722|   80|  6|         6|       360| S
|2005/01/06T20:02:57.132|

A number of options manipulate output format. Tabulation is controllable with the --column-separator, --no-final-column, and --no-columns switches. --column-separator will change the character used to distinguish columns, while --no-final-column drops the delimiter at the end of the line. --no-columns removes any space padding between columns. The --delimited switch combines all three: it takes a character as an argument, uses that character as a column separator, removes all padding in the columns, and drops the final column separator.

In addition, there are a variety of switches for changing column content:

--integer-ips: Converts IP addresses to integers rather than dotted quads. This switch is deprecated as of SiLK v3, and users should now use --ip-format=decimal.
--ip-format: The updated version of --integer-ips, --ip-format specifies how addresses are rendered. Options include canonical (dotted quad for IPv4, canonical IPv6 for IPv6), zero-padded (canonical, except zeros are expanded to the maximal value for each format, so 127.0.0.1 is 127.000.000.001), decimal (prints as the corresponding 32-bit or 128-bit integer), hexadecimal (prints the integer in hexadecimal format), and force-ipv6 (prints all addresses in canonical IPv6 format, including IPv4 addresses mapped to the ::ffff:0:0/96 netblock).
--epoch-time: Prints timestamps as epoch values with floating-point millisecond precision.
--integer-tcp-flags: Converts TCP flags to their integer equivalents.
--zero-pad-ips: Pads the dotted quad IP address format with zeros, so that 128.2.11.12 is printed as 128.002.011.012. Deprecated in favor of --ip-format in SiLK v3.
--icmp-type-and-code: Places the ICMP type in the source port and the ICMP code in the destination port.
--pager: Specifies the program to use for paging output.

Example 9-6 shows some of the preceding options.

Example 9-6. Other formatting examples

# Change from fixed-width columns to delims
$ rwcut --field=1-5 inweb/2005/01/06/iw-S0_20050106.20 --no-columns --num-recs=2
sIP|dIP|sPort|dPort|protocol|
148.19.251.179|128.3.148.48|2497|80|6|
148.19.251.179|128.3.148.48|2498|80|6|
# Change the column separator
$ rwcut --field=1-5 inweb/2005/01/06/iw-S0_20050106.20 --column-sep=:
  --num-recs=2
            sIP:            dIP:sPort:dPort:pro:
 148.19.251.179:   128.3.148.48: 2497:   80:  6:
 148.19.251.179:   128.3.148.48: 2498:   80:  6:
$# Use --delim to change everything at once
$ rwcut --field=1-5 inweb/2005/01/06/iw-S0_20050106.20 --delim=: --num-recs=2
sIP:dIP:sPort:dPort:protocol
148.19.251.179:128.3.148.48:2497:80:6
148.19.251.179:128.3.148.48:2498:80:6
# Convert IP addresses to integers
$ rwcut --field=1-5 inweb/2005/01/06/iw-S0_20050106.20 --integer-ip --num-recs=2
       sIP|       dIP|sPort|dPort|pro|
2484337587|2147718192| 2497|   80|  6|
2484337587|2147718192| 2498|   80|  6|
# Use epoch time
$ rwcut --field=1-5,9 inweb/2005/01/06/iw-S0_20050106.20 --epoch --num-recs=2
            sIP|            dIP|sPort|dPort|pro|         sTime|
 148.19.251.179|   128.3.148.48| 2497|   80|  6|1105041714.119|
 148.19.251.179|   128.3.148.48| 2498|   80|  6|1105041714.160|
# Zero-pad IP addresses
$ rwcut --field=1-5,9 inweb/2005/01/06/iw-S0_20050106.20 --zero-pad --num-recs=2
            sIP|            dIP|sPort|dPort|pro|                  sTime|
148.019.251.179|128.003.148.048| 2497|   80|  6|2005/01/06T20:01:54.119|
148.019.251.179|128.003.148.048| 2498|   80|  6|2005/01/06T20:01:54.160|

You will note that, as the command lines get more complex, I have truncated the longer options. SiLK uses GNU-style long options universally, so the only requirement for specifying an option is to type enough characters to make the name unambiguous. Expect more and more truncation as we build more and more complex commands.

Basic Field Manipulation: rwfilter

The most basic SiLK command with analytical value is rwcut paired with rwfilter through a pipe. Example 9-7 shows a simple rwfilter command.

Example 9-7. A simple rwfilter command

$ rwfilter --dport=80 inweb/2005/01/06/iw-S0_20050106.20 --pass=stdout
  | rwcut --field=1-9 --num-recs=5
            sIP|            dIP|sPort|dPort|pro|   packets|     bytes|   flags
  |                  sTime|
 148.19.251.179|   128.3.148.48| 2497|   80|  6|        16|      2631|FS PA
  |2005/01/06T20:01:54.119|
 148.19.251.179|   128.3.148.48| 2498|   80|  6|        14|      2159| S PA
  |2005/01/06T20:01:54.160|
 148.19.251.179|   128.3.148.48| 2498|   80|  6|         2|        80|F   A
  |2005/01/06T20:07:07.845|
  56.71.233.157|   128.3.148.48|48906|   80|  6|         5|       300| S
  |2005/01/06T20:01:50.011|
   56.96.13.225|   128.3.148.48|50722|   80|  6|         6|       360| S
  |2005/01/06T20:02:57.132|

rwfilter with a single filter (the --dport option in this case) and a single redirect (the --pass=stdout) is about as simple as you can get. rwfilter is the workhorse of the SiLK suite: it reads input (directly from a file, using a set of globbing specifications, or through a pipe), applies one or more filters to each record in the data, and then redirects the records based on whether a record matches the filters (passes) or doesn’t match (fails).

SiLK’s rwfilter documentation is humongous, but primarily consists of repetitively describing the filter specifications for every field, so don’t be intimidated. rwfilter options basically do one of three things: they specify how to filter data, how to read data, or how to direct the results of those filters.

Ports and Protocols

The easiest filters to start with are --sport, --dport, and --protocol. As the names imply, they filter on the source port, destination port, and protocol, respectively (see Example 9-8). These values can filter on a specific value (e.g., --sport=80 will pass any traffic where the source port is 80), or a range specified with a dash or commas (so --sport=79-83 will pass anything where the source port is between 79 and 83 inclusive, and could be expressed as --sport=79,80,81,82,83).

Example 9-8. Examples of filtering by ports and by protocol

$ rwfilter --dport=4350-4360  inweb/2005/01/06/iw-S0_20050106.20
  --pass=stdout | rwcut --field=1-9 --num-recs=5
            sIP|            dIP|sPort|dPort|pro|   packets|     bytes|   flags
  |                  sTime|
 218.131.115.42| 131.243.105.35|   80| 4360|  6|         2|        80|F   A
  |2005/01/06T20:24:21.879|
  148.19.96.160|131.243.107.239|   80| 4350|  6|        27|     35445|FS PA
  |2005/01/06T20:59:42.451|
  148.19.96.160|131.243.107.239|   80| 4352|  6|         4|       709|FS PA
  |2005/01/06T20:59:42.507|
  148.19.96.160|131.243.107.239|   80| 4351|  6|        15|     16938|FS PA
  |2005/01/06T20:59:42.501|
  148.19.96.160|131.243.107.239|   80| 4353|  6|         4|       704|FS PA
  |2005/01/06T20:59:42.544|
$ rwfilter --sport=4000-  inweb/2005/01/06/iw-S0_20050106.20
  --pass=stdout | rwcut --field=1-9 --num-recs=5
            sIP|            dIP|sPort|dPort|pro|   packets|     bytes|   flags
  |                  sTime|
  56.71.233.157|   128.3.148.48|48906|   80|  6|         5|       300| S
  |2005/01/06T20:01:50.011|
   56.96.13.225|   128.3.148.48|50722|   80|  6|         6|       360| S
  |2005/01/06T20:02:57.132|
   56.96.13.225|   128.3.148.48|50726|   80|  6|         6|       360| S
  |2005/01/06T20:02:57.432|
  58.236.56.129|   128.3.148.48|32621|   80|  6|         3|       144| S
  |2005/01/06T20:12:10.747|
   56.96.13.225|   128.3.148.48|54497|  443|  6|         6|       360| S
  |2005/01/06T20:09:30.124|
$ rwfilter --dport=4350,4352  inweb/2005/01/06/iw-S0_20050106.20
  --pass=stdout | rwcut --field=1-9 --num-recs=5
            sIP|            dIP|sPort|dPort|pro|   packets|     bytes|   flags
  |                  sTime|
  148.19.96.160|131.243.107.239|   80| 4350|  6|        27|     35445|FS PA
  |2005/01/06T20:59:42.451|
  148.19.96.160|131.243.107.239|   80| 4352|  6|         4|       709|FS PA
  |2005/01/06T20:59:42.507|
  148.19.96.160|131.243.107.239|   80| 4352|  6|         1|        40|    A
  |2005/01/06T20:59:42.516|
$ rwfilter --proto=1 in/2005/01/07/in-S0_20050107.01 --pass=stdout
 | rwcut --field=1-6 --num-recs=2
            sIP|            dIP|sPort|dPort|pro|   packets|
 35.223.112.236|    128.3.23.93|    0| 2048|  1|         1|
 62.198.182.170|    128.3.23.81|    0| 2048|  1|         1|
$ rwfilter --proto=1,6,17 in/2005/01/07/in-S0_20050107.01 --pass=stdout
 | rwcut --num-recs=2 --fields=1-6
            sIP|            dIP|sPort|dPort|pro|   packets|
  116.66.41.147|131.243.163.201| 4283| 1026| 17|         1|
  116.66.41.147|131.243.163.201| 3131| 1027| 17|         1|
$ rwfilter --proto=1,6,17 in/2005/01/07/in-S0_20050107.01 --fail=stdout
 | rwcut --num-recs=2  --fields=1-6
            sIP|            dIP|sPort|dPort|pro|   packets|
 57.120.186.177|   128.3.26.171|    0|    0| 50|        70|
 57.120.186.177|   128.3.26.171|    0|    0| 50|        81|

Note the use of --fail in the last example. Because there are 255 potential protocols, specifying “everything but TCP, ICMP, and UDP” could be expressed in two ways: either by specifying everything you want (--proto=0,2-5,7-16,18-), or by using the --fail option. I’ll discuss more advanced manipulation of --pass and --fail in the next chapter.

Size

Size options (e.g., bytes and packets) are similar to the protocol and port options in that you express them numerically. Unlike the enumerations (ports and protocols), these numeric values can be expressed only as single digits or ranges, not as comma-separated values. So, --packets=70-81 is acceptable, but --bytes=1,2,3,4 is not.

IP Addresses

The simplest form of IP address filtering simply expresses the IP address directly (see Example 9-9). The following examples show strict filtering on the source (--saddress) and destination (--daddress) address, and the --any-address option. --any-address will match either source or destination addresses.

Example 9-9. Filtering on IP addresses

$ rwfilter --saddress=197.142.156.83 --pass=stdout
   in/2005/01/07/in-S0_20050107.01 | rwcut --num-recs=2
            sIP|            dIP|sPort|dPort|pro|   packets|     bytes|   flags|
                  sTime|      dur|                  eTime|sen|
 197.142.156.83|  224.2.127.254|44510| 9875| 17|        12|      7163|        |
2005/01/07T01:24:44.359|   16.756|2005/01/07T01:25:01.115|  ?|
 197.142.156.83|  224.2.127.254|44512| 9875| 17|         4|      2590|        |
2005/01/07T01:25:02.375|    5.742|2005/01/07T01:25:08.117|  ?|
$ rwfilter --daddress=128.3.26.249 --pass=stdout
  in/2005/01/07/in-S0_20050107.01 | rwcut --num-recs=2
            sIP|            dIP|sPort|dPort|pro|   packets|     bytes|   flags|
                    sTime|      dur|                  eTime|sen|
211.210.215.142|   128.3.26.249| 4068|   25|  6|         7|       388|FS PA   |
  2005/01/07T01:27:06.789|    5.052|2005/01/07T01:27:11.841|  ?|
 203.126.20.182|   128.3.26.249|51981| 4587|  6|        56|      2240|F   A   |
  2005/01/07T01:27:04.812|   18.530|2005/01/07T01:27:23.342|  ?|
$ rwfilter --any-address=128.3.26.249
  --pass=stdout in/2005/01/07/in-S0_20050107.01 | rwcut --num-recs=2
            sIP|            dIP|sPort|dPort|pro|   packets|     bytes|   flags|
                    sTime|      dur|                  eTime|sen|
211.210.215.142|   128.3.26.249| 4068|   25|  6|         7|       388|FS PA   |
  2005/01/07T01:27:06.789|    5.052|2005/01/07T01:27:11.841|  ?|
 203.126.20.182|   128.3.26.249|51981| 4587|  6|        56|      2240|F   A   |
  2005/01/07T01:27:04.812|   18.530|2005/01/07T01:27:23.342|  ?|

Address options accept a variety of range descriptors. Each quad in an IP address can be expressed using the same comma-dash format that protocols and ports use. IP addresses will also accept the character x to mean 0–255. This expression can be used within each quad; SiLK will match each quad separately. In addition to this comma-dash format, SiLK can match on CIDR blocks.

SiLK supports IPv6 by using IPv6’s colon-based notation. The following are all examples of valid IPv6 filters in SiLK, and Example 9-10 shows how to filter them:

::ffff:x
::ffff:0:aaaa,0-5
::ffff:0.0.5-130,1,255.x

Example 9-10. Filtering on IP ranges

# Filtering on the last quad
$ rwfilter --daddress=131.243.104.x inweb/2005/01/06/iw-S0_20050106.20
  --pass=stdout | rwcut --field=1-5 --num-recs=5
            sIP|            dIP|sPort|dPort|pro|
 150.52.105.212|131.243.104.181|   80| 1262|  6|
 150.52.105.212|131.243.104.181|   80| 1263|  6|
  59.100.39.174| 131.243.104.27|   80| 3188|  6|
  59.100.39.174| 131.243.104.27|   80| 3191|  6|
  59.100.39.174| 131.243.104.27|   80| 3193|  6|
# Filtering a range of specific values in the third quad
$ rwfilter --daddress=131.243.104,107,219.x inweb/2005/01/06/iw-S0_20050106.20
   --pass=stdout | rwcut --field=1-5 --num-recs=5
            sIP|            dIP|sPort|dPort|pro|
  208.122.23.36|131.243.219.201|   80| 2473|  6|
205.233.167.250|131.243.219.201|   80| 2471|  6|
   58.68.205.40| 131.243.219.37|   80| 3433|  6|
208.233.181.122| 131.243.219.37|   80| 3434|  6|
   58.68.205.40| 131.243.219.37|   80| 3435|  6|
# Using CIDR blocks
$ rwfilter --saddress=56.81.0.0/16 inweb/2005/01/06/iw-S0_20050106.20
  --pass=stdout | rwcut --field=1-5 --num-recs=5
            sIP|            dIP|sPort|dPort|pro|
   56.81.19.218|131.243.219.201|   80| 2480|  6|
    56.81.16.73|131.243.219.201|   80| 2484|  6|
    56.81.16.73|131.243.219.201|   80| 2486|  6|
    56.81.30.48|131.243.219.201|  443| 2490|  6|
   56.81.31.159|131.243.219.201|  443| 2489|  6|

Time

There are three time options: --stime, --etime, and --active-time. These fields require a time range, which in SiLK is written in the format:

YYYY/MM/DDTHH:MM:SS-YYYY/MM/DDTHH:MM:SS

Note the T separating the day and hour. The --stime and --etime fields filter exactly what it says on the can, which can be a bit counterintuitive; specifying --stime=2016/11/08T00:00:00-2012/11/08T00:02:00 filters any record whose start time is between midnight and two minutes after midnight on November 8, 2016. Records that started before midnight and are still being transmitted during that range will not pass. To find records that occurred within a particular period, use the --active-time filter.

TCP Options

Flows are aggregates of packets, and in the majority of cases, this aggregation is relatively easy to understand. For example, the number of bytes in a flow is the sum of the number of bytes in all the packets that comprise the flow. TCP flags, however, are a bit more problematic. In NetFlow v5, a flow’s flags are the bitwise OR of the flags in its constituent packets—meaning that a flow indicates that a flag was present or absent in the entire flow, but not where. A flow could conceivably consist of a gibberish sequence of flags such as a FIN, then an ACK and SYN. Monitoring software such as Yet Another Flowmeter (YAF) expands NetFlow to include additional flag fields, which SiLK can take advantage of.

The core flag filtering switches are --flags-initial,--flags-all, and --flags-session. These options accept flags in the form <high flags>/<mask flags>. If a flag is listed in the mask, SiLK always parses it. If a flag is listed in the high flags, SiLK passes it only if the value is high. The flags themselves are expressed using the characters in Table 9-2.

Table 9-2. Expressing TCP flags in rwfilter
Character	Flag
`F`	`FIN`
`S`	`SYN`
`R`	`RST`
`P`	`PSH`
`A`	`ACK`
`U`	`URG`
`E`	`ECE`
`C`	`CWR`

The combination of high flags and mask flags tends to confuse people, so let’s review some examples. Remember that the basic rule is that for a flag to be evaluated, it must be in the mask. A flag specified as high but not specified in the mask will be ignored. So:

Setting the value to S/S will pass any record where the SYN flag is high, regardless of what the other flags are set to.
Setting the value to S/SA will pass any record where the SYN flag is high and the ACK flag is low.
Setting the value to SA/SA will pass any record where both the SYN and ACK flags are high.
A combination like SAF/SAFR will return any record where the SYN, ACK, and FIN flags are high and the RST flag is low, which would be expected of a normal TCP connection.

In addition to these options, SiLK provides a set of flag-specific options in the form of --syn-flag, --fin-flag, and so on for each potential flag. These options take a 1 or 0 as an argument: setting the value to 1 will pass records where the flag is high, 0 will pass records where the flag is low, and not including the option will pass all records.

What Should TCP Flags Look Like?

The combination of TCP flags in any particular flow can be a useful indicator of the flow’s behavior, and there are certain flag combinations that raise suspicion.

Almost all TCP flows should pass either SAF/SAFR or SAR/SAFR, without passing SAFR/SAFR. This is because most sessions will end in a FIN, with aberrations ending in an RST. If both FIN and RST are seen, that should be suspicious; I’ve seen servers that intentionally terminated sessions with RST, but that’s bad practice.

A TCP session without an ACK flag is curious, especially if that session has four or more packets. Stacks are usually hardcoded to give up after n packets, where n tends to be in the neighborhood of three.

For a client, the initial flag should be a SYN, while a server should have a SYN+ACK. You should never see a SYN after the initial flag. Resynchronization would mean a new session started using the same ephemeral port, which is weird for TCP.

The PSH and URG flags are, in my mind, the universal indicators of boring sessions. If I see a session without PSH, especially if the session is long, it strikes me as curious. In my mind, a “normal” TCP session will have FSPA high. A flow with just PA high is usually a keepalive and an indication of a broken flow—look in the repository for the same address combination and you’ll probably find a SAP flow occurring before it.

Backscatter/response messages include A, SA, and RA flows. A good number of RA packets will arrive on any large network due to backscatter from spoofed DDoS attacks. There isn’t really anything you can do about these packets; they’re not even directly aimed at your network.

The “new flags” (ECE and CWR) are used to manage congestion notification, and fit into the same category as PSH and URG in my mind—indicators of tedious normality. However, they are new flags, and are only partially adopted at this time—Apple has been aggressively using these flags, but they still show up only in a minority of traffic.

Helper Options

If you compare rwfilter’s option-based filtering against tcpdump’s BPF filtering, it’s immediately obvious that rwfilter’s approach is much more primitive. This was an intentional decision: rwfilter is focused on processing large volumes as quickly as possible, and the overhead involved in processing some kind of parseable language was deemed too expensive.

What usually trips people up is the lack of obvious not and or operators. For example, if you want to filter out all web sessions, you may try to filter traffic where one port is 80, and the other is ephemeral. The initial attempt might be:

$ rwfilter --sport=80,1024-65535 --dport=80,1024-65535 --pass=stdout

The problem is that this will also pass any flows where the source and destination port are both 80, and flows where the source and destination port are both ephemeral. To deal with such issues, rwfilter has a collection of helper functions that, combined with the --fail option and multiple filters, should be able to address any of these problems.

In the case of ports, the --aport option refers to either the source or the destination port. Using --aport and two filters, you can identify the appropriate sessions as follows:

$ rwfilter --aport=80 --pass=stdout | rwfilter --input-pipe=stdin
   --aport=1024-65535 --pass=stdout

The first filter identifies anything engaged in port 80 traffic, and the second takes that set and identifies anything that also used an ephemeral port.

A number of IP address helper options are available. --anyaddress filters across source and destination addresses simultaneously. --not-saddress and --not-daddress pass records with addresses that don’t match the option specification.

Miscellaneous Filtering Options and Some Hacks

rwfilter has a couple of direct text output options: --print-stat (see Example 9-11) and --print-volume-stat. These can be used to print a summary of the traffic without having to resort to cut, count, or other display tools. They also will print volumes of records that did not pass a filter.

Example 9-11. Using --print-stat

$ rwfilter --print-volume-stat in/2005/01/07/in-S0_20050107.01 --proto=0-255
     |              Recs|           Packets|               Bytes|     Files|
Total|              2019|           2730488|           402105501|         1|
 Pass|              2019|           2730488|           402105501|          |
 Fail|                 0|                 0|                   0|          |
$ rwfilter --print-stat in/2005/01/07/in-S0_20050107.01 --proto=0-255
Files     1.  Read       2019.  Pass       2019. Fail           0.

Note in Example 9-11 the use of the --proto=0-255 option. In almost all invocations, rwfilter expects some form of filtering applied to it, so when you need a filter that passes everything, the easiest approach is just to specify all the protocols. --print-stat and --print-volume-stat output to stderr, so you can still use stdout for pass, fail, and all channels.

Like rwcut, rwfilter has record-limiting commands. --max-pass-records and --max-fail-records can be used to limit the number of records passed through a pass or fail channel.

rwfileinfo and Provenance

SiLK filter files contain a fair amount of metadata, which can be accessed using the rwfileinfo command (see Example 9-12). rwfileinfo can work with files, as seen in the examples here, or directly on stdin by using stdin or - as an argument.

Example 9-12. Using rwfileinfo

$ rwfileinfo in/2005/01/07/in-S0_20050107.01
in/2005/01/07/in-S0_20050107.01:
  format(id)          FT_RWAUGMENTED(0x14)
  version             2
  byte-order          littleEndian
  compression(id)     none(0)
  header-length       28
  record-length       28
  record-version      2
  silk-version        0
  count-records       2019
  file-size           56560
  packed-file-info    2005/01/07T01:00:00 ? ?
$ rwfilter --print-stat in/2005/01/07/in-S0_20050107.01 --proto=6
  --pass=example.rwf
Files     1.  Read       2019.  Pass       1353. Fail         666.
$ rwfileinfo example.rwf
example.rwf:
  format(id)          FT_RWGENERIC(0x16)
  version             16
  byte-order          littleEndian
  compression(id)     none(0)
  header-length       156
  record-length       52
  record-version      5
  silk-version        2.1.0
  count-records       1353
  file-size           70512
  command-lines
                   1  rwfilter --print-stat --proto=6 --pass=example.rwf
  in/2005/01/07/in-S0_20050107.01
$ rwfilter --aport=25 example.rwf --pass=example2.rwf --fail=example2_fail.rwf
$ rwfileinfo example2.rwf
example2.rwf:
  format(id)          FT_RWGENERIC(0x16)
  version             16
  byte-order          littleEndian
  compression(id)     none(0)
  header-length       208
  record-length       52
  record-version      5
  silk-version        2.1.0
  count-records       95
  file-size           5148
  command-lines
                   1  rwfilter --print-stat --proto=6 --pass=example.rwf
  in/2005/01/07/in-S0_20050107.01
                   2  rwfilter --aport=25 --pass=example2.rwf
  --fail=example2_fail.rwf example.rwf

The fields reported by rwfileinfo are as follows:

example2.rwf: The first line of every rwfileinfo dump is the name of the file.
format(id): SiLK files are maintained in a number of different optimized formats; the format value is a C macro describing the type of the file, followed by the hexadecimal ID of that type.
version: The version of the file format.
byte-order: The order in which bytes are stored on disk; SiLK maintains distinct little- and big-endian formats for faster reading.
compression(id): Whether the file is natively compressed, again for faster reading.
header-length: The size of the file header; a SiLK file with no records will be just the size of the header.
record-length: The size of individual file records. This value will be 1 if records are variable length.
record-version: The version of the records (note that record versions are distinct from file versions and SiLK versions).
silk-version: The version of the SiLK suite used to create the file.
count-records: The number of records in the file.
file-size: The total size of the file; if the file is uncompressed, this value should be equivalent to the header length added to the product of the record length and record count.
command-lines: A record of the SiLK commands used to create the file.

Example 9-13 shows how to use the --note-add command.

Example 9-13. Using --note-add

$ rwfilter --aport=22 example.rwf --note-add='Filtering ssh' --pass=ex2.rwf
$ rwfileinfo ex2.rwf
ex2.rwf:
  format(id)          FT_RWGENERIC(0x16)
  version             16
  byte-order          littleEndian
  compression(id)     none(0)
  header-length       260
  record-length       52
  record-version      5
  silk-version        2.1.0
  count-records       10
  file-size           780
  command-lines
                   1  rwfilter --print-stat --proto=6 --pass=example.rwf
  in/2005/01/07/in-S0_20050107.01
                   2  rwfilter --aport=22 --note-add=Filtering ssh
  --pass=ex2.rwf example.rwf
  annotations
                   1  Filtering ssh

Combining Information Flows: rwcount

rwcount can produce time series data from the output of an rwfilter command. It works by placing counts of bytes, packets, and flow records into fixed-duration bins, which are equally sized time periods specified by the user. rwcount is a relatively straightforward application. Most of its complexity comes from relating the flows, which themselves have a duration, to the bins.

The simplest invocation of rwcount is shown in Example 9-14. The first thing to notice is the use of the --bin-size option. In this example, the bins are half an hour, or 1,800 seconds. If --bin-size isn’t specified, rwcount will default to 30-second bins. Bin sizes don’t have to be integers; floating-point specifications with a resolution down to the millisecond are acceptable for people who like lots of bins in their output.

Example 9-14. Simple rwcount invocation

$ rwfilter in/2005/01/07/in-S0_20050107.01 --all=stdout |
   rwcount --bin-size=1800
               Date|        Records|               Bytes|          Packets|
2005/01/07T01:00:00|         257.58|         42827381.72|        248724.14|
2005/01/07T01:30:00|        1589.61|        211453506.60|       1438751.93|
2005/01/07T02:00:00|         171.81|        147824612.67|       1043011.93|

As Example 9-14 shows, rwcount outputs four columns: a date column in SiLK’s standard date format (YYYY/MM/DDTHH:MM:SS), followed by record, byte, and packet columns. The floating-point values are a function of rwcount interpolating how much traffic should be in each bin; rwcount calls this a load scheme.

The load scheme is an attempt by rwcount to approximate how much of a flow took place over the period specified by the bins. In the default load scheme, rwcount splits each flow proportionally across all the bins during which the flow was taking place. For example, if a flow takes place from 00:04:00 to 00:11:00, and bins are 5 minutes long, 1/7 of the flow will be added to the first (00:00:00–00:04:59) bin, 5/7 to the second bin (00:05:00–00:09:59), and 1/7 to the third (00:10:00–00:14:59) bin. rwcount takes an integer parameter in the --load-scheme option, with the following results:

0: Split the traffic evenly across all bins covered. In the example flow given in the previous paragraph, the flow would be split into thirds, and a third added to each bin.
1: Add the entire flow to the first bin covered by the flow: 00:00:00–00:04:59 in the above example.
2: Add the entire flow to the last bin covered by the flow: in the example above, 00:10:00–00:14:59.
3: Add the entire flow to the middle bin covered by the flow: in the example above, 00:05:00–00:09:59.
4: The default load scheme.

rwcount uses the flow data provided to guess which time bins are required, but sometimes you have to explicitly specify the time, especially when coordinating multiple files. This can be done using the --start-epoch and --end-epoch options to specify starting and ending bin times. Note that these parameters can use the epoch time or yyyy/mm/dd:HH:MM:SS format. rwcount also has an option to print dates using epoch time: the --epoch-slots option.

The --skip-zero option (see Example 9-15) is one of a number of output format options. Normally, rwcount prints every empty bin it has allocated, but --skip-zero causes empty bins to be omitted from the output. In addition, rwcount supports many of the output options mentioned for rwcut: --no-titles, --no-columns, --column-separator, --no-final-delimiter, and --delimited.

Example 9-15. Using epoch slots and the --skip-zero option

$ rwfilter in/2005/01/07/in-S0_20050107.01 --all=stdout |
   rwcount --bin-size=1800.00 --epoch
               Date|        Records|               Bytes|          Packets|
         1105059600|         257.58|         42827381.72|        248724.14|
         1105061400|        1589.61|        211453506.60|       1438751.93|
         1105063200|         171.81|        147824612.67|       1043011.93|
$ rwfilter in/2005/01/07/in-S0_20050107.01 --all=stdout |
   rwcount --bin-size=1800.00 --epoch --start-epoch=1105057800
               Date|        Records|               Bytes|          Packets|
         1105057800|           0.00|                0.00|             0.00|
         1105059600|         257.58|         42827381.72|        248724.14|
         1105061400|        1589.61|        211453506.60|       1438751.93|
         1105063200|         171.81|        147824612.67|       1043011.93|
$ rwfilter in/2005/01/07/in-S0_20050107.01 --all=stdout |
   rwcount --bin-size=1800.00 --epoch --start-epoch=1105056000
               Date|        Records|               Bytes|          Packets|
         1105056000|           0.00|                0.00|             0.00|
         1105057800|           0.00|                0.00|             0.00|
         1105059600|         257.58|         42827381.72|        248724.14|
         1105061400|        1589.61|        211453506.60|       1438751.93|
         1105063200|         171.81|        147824612.67|       1043011.93|
$ rwfilter in/2005/01/07/in-S0_20050107.01 --all=stdout |
   rwcount --bin-size=1800.00 --epoch --start-epoch=1105056000 --skip-zero
               Date|        Records|               Bytes|          Packets|
         1105059600|         257.58|         42827381.72|        248724.14|
         1105061400|        1589.61|        211453506.60|       1438751.93|
         1105063200|         171.81|        147824612.67|       1043011.93|

rwset and IP Sets

IP sets are SiLK’s most powerful capability, and something that distinguishes the toolkit from most other analytical tools. An IP set is a binary representation of an arbitrary collection of IP addresses. IP sets can be created from text files, from SiLK data, or by using other binary SiLK structures.

The easiest way to start with IP sets is to create one, as in Example 9-16.

Example 9-16. Creating IP sets with rwset

$ rwfilter in/2005/01/07/in-S0_20050107.01 --all=stdout |
  rwset --sip-file=sip.set --dip-file=dip.set
$ ls -l *.set
-rw-r--r--  1 mcollins  staff    580 Jan 10 01:06 dip.set
-rw-r--r--  1 mcollins  staff  15088 Jan 10 01:06 sip.set
$ rwsetcat sip.set | head -5
0.0.0.0
32.16.40.178
32.24.41.181
32.24.215.49
32.30.13.177
$ rwfileinfo sip.set
sip.set:
  format(id)          FT_IPSET(0x1d)
  version             16
  byte-order          littleEndian
  compression(id)     none(0)
  header-length       76
  record-length       1
  record-version      2
  silk-version        2.1.0
  count-records       15012
  file-size           15088
  command-lines
                   1  rwset --sip-file=sip.set --dip-file=dip.set

rwset takes flow records and produces up to four output files. The file specified with --sip-file will contain source IP addresses from the flow, --dip-file will contain destination addresses, --any-file will contain source and destination IP addresses, and nhip-file will contain next hop addresses. The output is binary and read with rwsetcat, and as with all SiLK files, the file can be examined using rwfileinfo.

The power of IP sets comes when they’re combined with rwfilter. rwfilter has eight commands that accept IP sets (--sipset, --dipset, --nhipset, --anyset, and their negations). Sets are explicitly designed so rwfilter can rapidly query using them, enabling a variety of useful queries, as seen in Example 9-17.

Example 9-17. Set manipulation and response

# First, we create IP sets; I use --aport=123 (NTP on UDP) to filter down
# to a reasonable set of addresses.  NTP clients and servers use the same
# port.
$ rwfilter in/2005/01/07/in-S0_20050107.01 --pass=stdout --aport=123 |
   rwset --sip-file=sip.set --dip-file=dip.set
# Now, let's see how many IP addresses are created.
$ rwsetcat --count-ip sip.set
15
# Generating output using rwfilter; note the use of the --dipset file as the
# sip set; this means that I'm now looking for messages that responded to
# these addresses.  This means that I've seen NTP going to and from the
# address, meaning it's likely to be a legitimate speaker, as opposed to a
# scan on port 123.
$ rwfilter out/2005/01/07/out-S0_20050107.01 --dipset=sip.set --pass=stdout
   --aport=123 | rwcut | head -5
            sIP|            dIP|sPort|dPort|pro|   packets|     bytes|   \
flags|                    sTime|      dur|                  eTime|sen|
   128.3.23.152|    56.7.90.229|  123|  123| 17|         1|        76|   \
     |  2005/01/07T01:10:00.520|    0.083|2005/01/07T01:10:00.603|  ?|
   128.3.23.152|  192.41.221.11|  123|  123| 17|         1|        76|   \
     |  2005/01/07T01:10:15.519|    0.000|2005/01/07T01:10:15.519|  ?|
   128.3.23.231| 87.221.134.185|  123|  123| 17|         1|        76|   \
     |  2005/01/07T01:24:46.251|    0.005|2005/01/07T01:24:46.256|  ?|
   128.3.26.152| 58.243.214.183|  123|10123| 17|         1|        76|   \
     |  2005/01/07T01:27:08.854|    0.000|2005/01/07T01:27:08.854|  ?|
# Let's look at statistics; using the same file, I look at the hosts
# that responded.
$ rwfilter out/2005/01/07/out-S0_20050107.01 --dipset=sip.set  --aport=123
   --print-stat
Files     1.  Read      12393.  Pass         21. Fail       12372.
# Now I look at everyone else; --not-dipset means that I'm looking at everything
# on port 123 that doesn't go to these addresses.
$ rwfilter out/2005/01/07/out-S0_20050107.01 --not-dipset=sip.set  --aport=123
   --print-stat
Files     1.  Read      12393.  Pass        337. Fail       12056.

Sets can also be generated by hand using rwsetbuild, which takes text input and produces a set file as the output. The rwsetbuild specification takes any of the IP address specifications used by the --saddress option in rwfilter: literal addresses, integers, ranges within dotted quads, and netmasks. Example 9-18 demonstrates this.

Example 9-18. Building a set using rwsetbuild

$ cat > setsample.txt
# Comments in set files are prefaced with a hashmark
# Literal address
255.230.1.1
# Note that I'm putting addresses in some semi-random order; the output
# will be ordered
111.2.3-4.1-2
# Netmask
22.11.1.128/30
^D
$ rwsetbuild setsample.txt setsample.set
$ rwsetcat --print-ip setsample.set
22.11.1.128
22.11.1.129
22.11.1.130
22.11.1.131
111.2.3.1
111.2.3.2
111.2.4.1
111.2.4.2
255.230.1.1

Sets can also be manipulated using the rwsettool command, which provides a variety of mechanisms for adding and removing sets. rwsettool supports four manipulations:

--union: Creates a set that includes any address that appears in any of the sets.
--intersect: Creates a set that includes only addresses that appear in all the sets specified.
--difference: Removes addresses in the latter sets from the first set.
--sample: Randomly samples a set to produce a subset.

rwsettool is generally invoked using an output path (--output=file), but if nothing is specified, it will dump to stdout. As with rwfilter, rwsettool output is binary, so a pure terminal dump triggers an error. Example 9-19 shows a manipulation with rwsettool.

Example 9-19. Set manipulation with rwsettool

$ rm setsample2.set
$ cat > setsample2.txt
# Build a set that covers our original setsample file to
# see what happens with various functions
22.11.1.128/29
$ rwsetbuild setsample2.txt setsample2.set
$ rwsettool --union setsample.set setsample2.set | rwsetcat
22.11.1.128
22.11.1.129
22.11.1.130
22.11.1.131
22.11.1.132
22.11.1.133
22.11.1.134
22.11.1.135
111.2.3.1
111.2.3.2
111.2.4.1
111.2.4.2
255.230.1.1
$ rwsettool --intersect setsample.set setsample2.set | rwsetcat
22.11.1.128
22.11.1.129
22.11.1.130
22.11.1.131
$ rwsettool --difference setsample.set setsample2.set | rwsetcat
111.2.3.1
111.2.3.2
111.2.4.1
111.2.4.2
255.230.1.1

Finally, there’s the rwsetmember command, which is effectively a set-based grep. Using rwsetmember, you can query multiple sets simultaneously about whether an IP address is present, as seen in the following examples:

$ rwsetcat x.set
4.8.2.1
92.11.3.15
128.2.1.1
$ rwsetcat y.set
44.3.17.2
99.3.5.5
128.2.1.1
$ rwsetmember 128.2.1.1 *.set
x.set
y.set
$ rwsetmember 99.3.5.5 *.set
y.set

Caching IP Sets

rwsetmember facilitates a very common and handy SiLK hack, generating periodic (usually hourly or daily) sets of IP addresses engaged in particular activity. For example, you may run a daily query creating a set of all the incoming IP addresses per day, storing the results as a distinct set. Then, in your directory of day1.set to dayn.set, you can run rwsetmember and get a list of every date where the IP address appeared.

Raw flow data has a very low signal-to-noise ratio, so caching information on a per-address basis via sets and bags saves you space and time. Useful information to save includes:

IP address (incoming and outgoing)
Scanners
Visits from blacklists or other threat intelligence sources
Traffic per host
Legitimate (4+ packet TCP with SAF) traffic per host

rwuniq

rwuniq is the utility knife of counting tools. It allows an analyst to specify a key containing one or more fields, and will then count a number of different values, including total number of bytes, packets, flow records, or unique IP addresses matching the key.

rwuniq’s default configuration counts the number of flows that occurred for a particular key. The key itself must be specified using the --field option, which accepts the field specifiers in Table 9-1. rwuniq can accept multiple fields, and the key will be generated in the order specified in the command line. Example 9-20 demonstrates the key features of the --field option. As it shows, field order in the option affects field ordering in the output.

Example 9-20. Various field specifiers using rwuniq

$ rwfilter out/2005/01/07/out-S0_20050107.01 --all=stdout |
  rwuniq --field=sip,proto | head -4
            sIP|pro|   Records|
 131.243.142.85| 17|         1|
131.243.141.187| 17|         6|
    128.3.23.41| 17|         4|
$ rwfilter out/2005/01/07/out-S0_20050107.01 --all=stdout |
  rwuniq --field=1,2 | head -4
            sIP|            dIP|   Records|
  128.3.174.158|    128.3.23.44|         2|
    128.3.191.1|239.255.255.253|         8|
   128.3.161.98|131.243.163.206|         1|
$ rwfilter out/2005/01/07/out-S0_20050107.01 --all=stdout |
  rwuniq --field=sip,sport | head -4
            sIP|sPort|   Records|
 131.243.63.143|53504|         1|
 131.243.219.52|61506|         1|
131.243.163.206| 1032|         1|
$ rwfilter out/2005/01/07/out-S0_20050107.01 --all=stdout |
  rwuniq --field=sport,sip | head -4
sPort|            sIP|   Records|
55876|  131.243.61.70|         1|
51864|131.243.103.106|         1|
50955| 131.243.103.13|         1|

Note that when fields’ orders are changed, the order in which records are output also changes. rwuniq does not guarantee record ordering by default; sorting can be ordered by using the --sort-output option.

rwuniq provides a number of count switches that instruct it to count additional values (see Example 9-21). The counting switches are --bytes, --packets, --flows, --sip-distinct, and --dip-distinct. Each of these fields can be used on their own, or by specifying a threshold (e.g., --bytes, --bytes=10, or --bytes=10-100). A single-value threshold (--bytes=10) provides a minimum, while a two-value threshold (--bytes=10-100) provides a range with a minimum and maximum. If you don’t specify an argument, then the switch returns all values.

Example 9-21. Field spec with rwuniq

$ rwfilter out/2005/01/07/out-S0_20050107.01 --all=stdout |
  rwuniq --field=sport,sip --bytes --packets | head -5
sPort|            sIP|               Bytes|   Packets|
55876|  131.243.61.70|                 308|         4|
51864|131.243.103.106|                 308|         4|
50955| 131.243.103.13|                 308|         4|
56568|  128.3.212.145|                 360|         5|
$ rwfilter out/2005/01/07/out-S0_20050107.01 --all=stdout |
  rwuniq --field=sport,sip --bytes --packets=8 | head -5
sPort|            sIP|               Bytes|   Packets|
    0| 131.243.30.224|                2520|        30|
  959|   128.3.215.60|                 876|        19|
 2315|131.243.124.237|                 608|         8|
56838| 131.243.61.187|                 616|         8|
$ rwfilter out/2005/01/07/out-S0_20050107.01 --all=stdout |
  rwuniq --field=sport,sip --bytes --packets=8-20 | head -5
sPort|            sIP|               Bytes|   Packets|
  959|   128.3.215.60|                 876|        19|
 2315|131.243.124.237|                 608|         8|
56838| 131.243.61.187|                 616|         8|
  514|   128.3.97.166|                2233|        20|

rwbag

The last set of tools to discuss in this chapter are bag tools. A bag is a form of storage structure. It contains a key (which can be an IP address, a port, the protocol, or an interface index), and a count of values for that key. Bags can be created from scratch or from flow data using the rwbag command (see Example 9-22).

Example 9-22. An rwbag call, creating an IP address bag

$ rwfilter out/2005/01/07/out-S0_20050107.01 --all=stdout |
  rwbag --sip-bytes=sip_bytes.bag
$ rwbagcat sip_bytes.bag | head -5
     128.3.2.16|            10026403|
     128.3.2.46|               27946|
     128.3.2.96|              218605|
     128.3.2.98|                 636|
    128.3.2.102|                1568|

Like sets, bags are a second-order binary structure for SiLK, meaning that they have their own toolkit (rwbagcat, rwbagtool, and rwbagbuild), the data is binary (so it can’t be read with cat or a text editor), and they can be derived from flow data or built from a datafile.

The basic bag generation tool is rwbag, which as seen in Example 9-22 takes flow data and produces a bag file from it. rwbag can generate 27 types of bags, simultaneously if you’re so inclined. These 27 types comprise 3 types of counting (bytes, packets, and flows) and 9 types of key (sip, dip, sport, dport, proto, sensor, input, output, nhip). Combine the key and the counting type, and you have a switch that will create a bag. For example, to count all packets from source and destination IP addresses, call rwbag --sip-packets=b1.bag --dip-packets=b2.bag.

Advanced SiLK Facilities

In this section, we discuss more advanced SiLK facilities: in particular, the use of PMAPs and the collection and conversion of SiLK data.

PMAPs

A SiLK prefix map (PMAP) is a binary file that associates specific subnetworks (prefixes) with tags. PMAPs are used to record various mappings of a network, such as whether a network belongs to a particular organization or ASN, and for country code lookup. Using a source such as GeoIP, you can build a PMAP that associates IP addresses with their country of origin.

The SiLK tool suite expects some basic PMAPs:

address_types.pmap: Describes an address’s type, conventionally indicating whether the address is inside or outside of the network you are monitoring. Specify the default filesystem location for this PMAP using the SILK_ADDRESS_TYPES environmental variable.
country_codes.pmap: This PMAP describes the country code for an address. Specify the default location for this PMAP using the SILK_COUNTRY_CODES environmental variable.

PMAPs, like set files, can be created from text. Example 9-23 shows a simple PMAP file. Note the following attributes:

The set of labels at the beginning. PMAPs do not store strings, but enumerable types identified by an integer. This enumeration is defined using the labels. You can see that the PMAP in Example 9-23, for instance, stores a 3 to mark normal traffic.
The default key. Any value that doesn’t match one of the network blocks listed in the map is given the default value.
The actual declarations. Each declaration consists of a network specification, such as 192.168.0.0/16, followed by a label.

Example 9-23. PMAP input

# This is a simple PMAP file that tracks some of the standard RFC 1918
# reserved addresses
#
# First we create some labels
label 0 1918-reserved
label 1 multicast
label 2 future
label 3 normal
#
# Specify the mode; this must be either ip or proto-port. ip in this case
# refers to v4 addresses.
#
mode ip
#
# Everything otherwise not specified is normal
default normal
# Now the maps
192.168.0.0/16    1918-reserved
10.0.0.0/8        1918-reserved
172.16.0.0/12     1918-reserved
224.0.0.0/4       multicast
240.0.0.0/4   future

Once you’ve created a text representation of the PMAP, you can compile the binary PMAP file using the rwpmapbuild command. rwpmapbuild has two mandatory arguments: an input filename, with the file in the text format described previously, and a name for the output file. As with most SiLK commands, rwpmapbuild will not overwrite an existing output file. For example:

$ rwpmapbuild -i reserve.txt -o reserve.pmap
$ ls -l reserve.*
  -rw-r--r-- 1 mcollins staff 406 May 27 17:16 reserve.pmap
  -rw-r--r-- 1 mcollins staff 526 May 27 17:00 reserve.txt

Once a PMAP file is created, it can be added to rwfilter and rwcut using the pmap-file argument. Specifying the use of a PMAP file effectively creates a new set of fields in the filter and cut commands; since PMAP files are explicitly related to IP addresses, these new fields are bound to IP addresses.

Consider Example 9-24, which uses rwcut. In this example, the --pmap-file argument is colon-delimited; the value before the colon (reserve in the example) is a label, and the value after is a filename. rwcut binds the term reserve to the PMAPs for the source and destination IP address, creating two new fields: src-reserve (for the mapping of the source address to the PMAP) and dst-reserve (for the mapping of the destination address).

Example 9-24. Creating the src-reserve and dst-reserve fields

$ rwcut --pmap-file=reserve:reserve.pmap --fields=1-4,src-reserve,dst-reserve
  traceroute.rwf | head -5
          sIP|          dIP|sPort|dPort|   src-reserve|   dst-reserve|
 192.168.1.12|  192.168.1.1|65428|   53| 1918-reserved| 1918-reserved|
 192.168.1.12|  192.168.1.1|56126|   53| 1918-reserved| 1918-reserved|
 192.168.1.12|  192.168.1.1|52055|   53| 1918-reserved| 1918-reserved|
  192.168.1.1|  92.168.1.12|   53|56126| 1918-reserved| 1918-reserved|


# Using the pmap in filter; note that rwcut is not using the pmap
$ rwfilter --pmap-file=reserve:reserve.pmap --pass=stdout traceroute.rwf
   --pmap-src-reserve=1918-reserved  | rwcut --field=1-5
   | head -5
sIP| dIP|sPort|dPort|pro|
192.168.1.12| 192.168.1.1|65428| 53| 17|
192.168.1.12| 192.168.1.1|56126| 53| 17|
192.168.1.12| 192.168.1.1|52055| 53| 17|
192.168.1.1| 192.168.1.12| 53|56126| 17|

Collecting SiLK Data

There are a number of different tools for collecting data and pushing it into SiLK. The major ones are YAF, which is a flow collector, and rwptoflow and rwtuc, which convert other data into SiLK format.

YAF

Yet Another Flowmeter (YAF) is the reference implementation for the IETF IPFIX standard, and is the standard flow collection software for the SiLK toolkit. YAF can read pcap data from files or capture packets directly, which it then assembles into flow records and exports to disk. The tool itself can be entirely configured using command-line options, but the number of options is fairly daunting. At its simplest, a YAF command looks like this:

$ sudo yaf -i en1 --live=pcap -out /tmp/yaf/yaf

This reads data from interface en1 and drops it to the file in the temporary directory. Additional options control how data is read and how it is converted into flow records and other output formats.

yaf output is specified via the --out switch in tandem with the --ipfix and --rotate switches. By default, --out outputs to a file; in the preceding example, the file is /tmp/yaf/yaf, but any valid filename will do (if --out is set to -, then yaf will output to stdout).

When --out is specified with --rotate, yaf writes the output to files that are rotated at an interval specified by the --rotate switch (e.g., --rotate 3600 will update files every hour). In this mode, yaf uses the name specified by --out as a base filename, and attaches a suffix specified in YYYYMMDDhhmmss format, along with a decimal serial number and a .yaf file extension.

When yaf is specified with the --ipfix switch, it communicates IPFIX data to a daemon located elsewhere on the network. In this case (the most complicated option), --ipfix takes a transport protocol as an argument, while --out takes the IP address of the host. The additional --ipfix-port switch takes a port number when needed. Consult the documentation for more information.

The most important options are:

--live: Specifies the type of data being read; possible values are pcap, dag, or napatech. dag and napatech refer to proprietary packet capture systems, so unless you have that hardware, just set --live to pcap.
--filter: Applies a BPF filter to the pcap data.
--out: The output specifier, discussed previously. This will be a file, a file prefix, or an IP address depending on whatever other switches are used.
--ipfix: Takes a transport protocol (tcp, udp, sctp, or spread) as an argument, and specifies that output is IPFIX-transported over the network. Consult the yaf documentation for more information.
--ipfix-port: Used only if --ipfix is specified. Specifies the port that the IPFIX data is sent to.
--rotate: Used only with files. If present, the filename in --out is used as a prefix, and files are written with a timestamp appended to them. The --rotate option takes an argument and the number of seconds to wait before moving to a new file.
--silk: Specifies output that can be parsed by SiLK’s rwflowpack tools.
--idle-timeout: Specifies the idle timeout for flows in seconds. If a flow is present in the flow cache and isn’t active, it’s flushed as soon as it’s been inactive for the duration of the idle timeout. Defaults to 300 seconds (5 minutes).
--active-timeout: Specifies the active timeout for flows, or the maximum amount of time an active flow will be stored in the cache before being flushed. Defaults to 30 minutes (1,800 seconds). Note that the active timeout determines the maximum observed duration of collected flows.

YAF has many more options, but these are the basic ones to consider when configuring flows. Consult the yaf manpage for more details.

rwptoflow

SiLK uses its own compact binary formats to represent NetFlow data that tools such as rwcut and rwcount present in a human-readable form. There are times when an analyst needs to convert other data into SiLK format, such as when taking packet captures from IDS alerts and converting them into a format where IP set filtering can be done on the data.

The go-to tool for this task is rwptoflow. rwptoflow is a packet data to flow conversion tool. It does not aggregate flows; instead, each flow generated by rwptoflow is converted into a one-packet flow record. The resulting file can then be manipulated by the SiLK suite like any other flow file.

rwptoflow is invoked relatively simply with an input filename as its argument. In Example 9-25, the pcap data from a traceroute is converted into flow data using rwptoflow. The resulting raw file is then read using rwcut, and you can see the correspondence between the traceroute records and the resulting flow records.

Example 9-25. Converting pcap data with rwptoflow

$ tcpdump -v -n -r traceroute.pcap  | head -6
reading from file traceroute.pcap, link-type EN10MB (Ethernet)
21:06:50.559146 IP (tos 0x0, ttl 255, id 8010, offset 0, flags [none],
    proto UDP (17), length 64)
    192.168.1.12.65428 > 192.168.1.1.53: 63077+ A? jaws.oscar.aol.com. (36)
21:06:50.559157 IP (tos 0x0, ttl 255, id 37467, offset 0, flags [none],
    proto UDP (17), length 86)
    192.168.1.12.56126 > 192.168.1.1.53: 30980+ PTR?
    dr._dns-sd._udp.0.1.168.192.in-addr.arpa. (58)
21:06:50.559158 IP (tos 0x0, ttl 255, id 2942, offset 0, flags [none],
    proto UDP (17), length 66)
    192.168.1.12.52055 > 192.168.1.1.53: 990+ PTR? db._dns-sd._udp.home. (38)
$ rwptoflow traceroute.pcap > traceroute.rwf
$ rwcut --num-recs=3 --fields=1-5 traceroute.rwf
   sIP|  dIP|sPort|dPort|pro|
 192.168.1.12|  192.168.1.1|65428|   53| 17|
 192.168.1.12|  192.168.1.1|56126|   53| 17|
 192.168.1.12|  192.168.1.1|52055|   53| 17|

rwtuc

When correlating data between different sources, you will occasionally want to convert it into SiLK’s format. rwtuc is the default tool for converting data into SiLK representation, as it works with columnar text files. Using rwtuc, you can convert IDS alerts and other data into SiLK data for further manipulations.

The easiest way to invoke rwtuc is to use it as an inverse of rwcut. Create a file with columnar entries and make sure that the titles match those used by rwcut:

$ cat rwtuc_sample.txt
sIP         |dIP        |proto
128.2.11.4  | 29.3.11.4 | 6
11.8.3.15   | 9.12.1.4  | 17
$ rwtuc < rwtuc_sample.txt > rwtuc_sample.rwf
$ rwcut rwtuc_sample.rwf --field=1-6
 sIP| dIP|sPort|dPort|pro|   packets|
  128.2.11.4|  29.3.11.4|    0|    0|  6|         1|
   11.8.3.15|   9.12.1.4|    0|    0| 17|         1|

As the following fragment shows, rwtuc will read the columns, use the headers to determine column content, and stuff any unspecified fields with a default value if no column is provided. rwtuc can also take column specifications at the command line using the --fields and --column-separator switches, as so:

$ cat rwtuc_sample2.txt
128.2.11.4  x 29.3.11.4 x 6 x 5
7.3.1.1    x  128.2.11.4 x 17 x 3
$ rwtuc --fields=sip,dip,proto,packets --column-sep=x < rwtuc_sample2.txt
  > rwtuc_sample2.rwf
$ rwcut --fields=1-7 rwtuc_sample2.rwf
  sIP|  dIP|sPort|dPort|pro|   packets|     bytes|
 128.2.11.4|  29.3.11.4|    0|    0|  6|         5|         5|
    7.3.1.1| 128.2.11.4|    0|    0| 17|         3|         3|

SiLK’s binary format requires values for every field, which means that rwtuc makes a best guess for field values that it doesn’t have. For instance, the previous example specifies packets as a field but not bytes, so rwtuc just defines the packet value to be identical to the byte value.

If there exists a common default value (e.g., all traffic has the same protocol), this value can be defined using one of a number of field-stuffing options in rwtuc. These options are identical to the field filtering options in rwfilter, except they only take single values. For example, --proto=17 sets the protocol of every entry to 17.

In the following fragment, we use the field stuffing command --bytes=300 to set a value of 300 bytes for every entry in rwtuc_sample2.txt:

$ rwtuc --fields=sip,dip,proto,packets --column-sep=x --bytes=300 <
  rwtuc_sample2.txt > rwtuc_sample2.rwf
$ rwcut --fields=1-7 rwtuc_sample2.rwf
  sIP|  dIP|sPort|dPort|pro|   packets|     bytes|
   128.2.11.4|   29.3.11.4|    0|    0|  6|         5|       300|
      7.3.1.1|  128.2.11.4|    0|    0| 17|         3|       300|

The resulting RWF file will contain a value of 300 bytes, even though the byte value is not in the original text file. The packet values, which are specified in the file, are set to whatever was specified there.

rwrandomizeip

rwrandomizeip is a tool to shuffle IP addresses in order to anonymize data for public release. Anonymization is itself a complex process, and should be considered on a case-by-case basis. To that end, rwrandomizeip provides a number of different anonymization techniques, including pure randomization and consistent mapping.

The basic invocation of rwrandomizeip takes an input file and an output file, and generates random addresses for both sets:

$ cat rwtuc_sample3.txt
sIP         |dIP        |proto
128.2.11.4  | 29.3.11.4 | 6
11.8.3.15   | 9.12.1.4  | 17
128.2.11.4  | 29.3.99.8 | 6
9.88.4.17   | 29.3.11.4 | 6
$ rwtuc < rwtuc_sample3.txt | rwrandomizeip stdin stdout | rwcut --fields=1-7
  --ipv6=ignore
            sIP|            dIP|sPort|dPort|pro|   packets|     bytes|
    10.93.81.37|   10.85.44.118|    0|    0|  6|         1|         1|
   10.99.53.145| 10.130.150.112|    0|    0| 17|         1|         1|
  10.146.120.29|   10.31.222.59|    0|    0|  6|         1|         1|
    10.3.86.205| 10.206.186.249|    0|    0|  6|         1|         1|
$ rwtuc < rwtuc_sample3.txt | rwrandomizeip stdin stdout | rwcut --fields=1-7
  --ipv6=ignore
            sIP|            dIP|sPort|dPort|pro|   packets|     bytes|
 10.147.117.187| 10.161.218.135|    0|    0|  6|         1|         1|
   10.15.216.69|  10.85.128.237|    0|    0| 17|         1|         1|
  10.148.145.16|  10.231.231.13|    0|    0|  6|         1|         1|
   10.255.35.36| 10.240.107.198|    0|    0|  6|         1|         1|

Specifying a seed with the --seed switch (which takes an integer) will randomize addresses consistently between invocations:

$ rwtuc < rwtuc_sample3.txt | rwrandomizeip --seed=590 stdin stdout | rwcut
  --fields=1-7 --ipv6=ignore
            sIP|            dIP|sPort|dPort|pro|   packets|     bytes|
  10.147.108.49|  10.207.87.141|    0|    0|  6|         1|         1|
   10.193.249.8| 172.29.236.141|    0|    0| 17|         1|         1|
     10.3.188.2|   10.103.37.28|    0|    0|  6|         1|         1|
  10.40.122.115| 10.247.125.160|    0|    0|  6|         1|         1|
$ rwtuc < rwtuc_sample3.txt | rwrandomizeip --seed=590 stdin stdout | rwcut
  --fields=1-7 --ipv6=ignore
            sIP|            dIP|sPort|dPort|pro|   packets|     bytes|
  10.147.108.49|  10.207.87.141|    0|    0|  6|         1|         1|
   10.193.249.8| 172.29.236.141|    0|    0| 17|         1|         1|
     10.3.188.2|   10.103.37.28|    0|    0|  6|         1|         1|
  10.40.122.115| 10.247.125.160|    0|    0|  6|         1|         1|

An alternative approach is to use the --consistent switch; this switch will generate a per-octet randomization that can be recorded in a distinct shuffle file. Once created, the shuffle file be reloaded and reused:

$ rwtuc < rwtuc_sample3.txt | rwrandomizeip --consistent --save-table=ipmap
  stdin stdout | rwcut --fields=1-7 --ipv6=ignore
            sIP|            dIP|sPort|dPort|pro|   packets|     bytes|
  47.116.224.20|  60.107.224.20|    0|    0|  6|         1|         1|
   211.8.97.234|  41.140.114.20|    0|    0| 17|         1|         1|
  47.116.224.20|  60.107.220.71|    0|    0|  6|         1|         1|
   41.24.235.32|  60.107.224.20|    0|    0|  6|         1|         1|

Note that in this example, the IP addresses in 29.3 are consistently mapped to 60.107.

Table of Contents for Network Security Through Data Analysis, 2nd Edition