(extended) Berkeley Packet Filter
Be conservative in what you do, be liberal in what you accept from others. - RFC 793
Preamble
Throughout the course of humanity, we have witnessed various forms of communication. We use language to communicate on a daily basis. Writing letters is also a form of communication, albeit a form which has slow transmission and reception time. Then came the era of telecommunication. Through the use of electronic and electrical means, we soon came to the realization that communication can be instantaneous. This, I believe, is one of the greatest achievement for mankind. This really made the world feel like a small village. The Internet is the closest to the anywhere door we know from Doraemon.
When looking into computer networking, it's no different from other forms of telecommunication. A separate hardware component is used to communicate with external world, the Network Interface Controller (NIC). This hardware component communicates using a specific physical layer and data link layer standard, mostly Ethernet or Wi-Fi. Through this device, the computer can communicate on the same Local Area Network (LAN) or through routable protocols, such as Internet Protocol (IP).
If we recall the fundamentals, communication takes place by the transmission and reception of a sequence of 0s and 1s, adhering to some protocol. The term protocol is a code of conduct that governs how systems interact with each other. It's a set of rules that systems communicating with each other must follow to communicate. A protocol stack is a layered collection of communication protocol that work together to enable reliable and efficient data exchange between devices.
In networking, the term packet is synonymous to an envelope. Based on the layer of the protocol stack, the terminology varies. The lowest layers use the term frame or bits, while the higher layers use the term message. The network layer is the one that uses the term packet to describe the header and actual payload. We'll disucss how packet filter is achieved, BPF instruction set, and how Linux extends this philosophy to have a programmable kernel. [Gregg 19] provides an excellent analogy for this:
JavaScript allows a website to run mini programs on browser events such as mouse clicks, enabling a wide variety of web-based applications. BPF allows the [Linux] kernel to run mini programs on system and application events, such as disk I/O, thereby enabling new system technologies.
The term packet capture and packet filtering may seem identical but they are distinct with their own properties. Strictly speaking, packet filtering is the process to decide if the packet needs to be sent to the userspace or should be discarded. Packet capture uses this idea, but it is used for logging purpose. We'll use some low-level operations when working with packet capture such as: interface, BPF assembly instruction set (see [McCanne 92]), and library for capturing packet (see pcap(3PCAP)).
BPF has it's limitation on macOS-based systems. It is strictly used as a packet capturing tool and not to filter out any unwanted packet. This does not imply that we cannot drop the packet on such systems. To do so, we need to use the program pfctl(1) provided by the system. Furthermore, pf.conf(5) provides detail on how to write the query to handle packet, such as dropping any unwanted packets.
Packet filtering
The notion of packet filtering was used for early network firewall. A packet filter can be a dedicated hardware device, but I'll mostly refer to the software. It is the software that inspects ingress packets (incomming packets) as well as egress packets (outgoing packets). As you might have guessed, we can provides rules that govern how packet transmission and reception should be handled. For example, we can filter out packets that is destined for port 3000 in the system. Moreover, we can decide the fate of a packet. For example, if we blacklist a certain IP address and a packet with source IP address matched with the blacklisted one, we can drop the packet entirely before it even reaches the userspace. [McCanne 92] gives a simple definition of what the packet filter is used for:
A packet filter is simply a boolean valued function on a packet. If the value of the function is true the kernel copies the packet for the application; if it is false the packet is ignored.
Like mentioned earlier, to filter something out, we need to define some rules that allows us to filter the packets we're interested in. To achieve this, we can have a boolean expression tree (used by Constrained Shortest Path First (CSPF)) or a directed acyclic Control Flow Graph (CFG). bpf uses the latter one to filter out packets. [McCanne 92] also mentions that the expression tree is more natural on a stack machine whereas CFG model maps naturally into code for a register machine. It also mentions some overhead that is incurred when using the CSPF model.
Network TAP
I'd like to point out an important point: packet filter and network tap are two separate components. [McCanne 92] mentions that it is the network tap that collects copies of packets from the network device drivers and delivers them to listening applications. Fun fact: tap in network tap is a backronym that stands for Test Access Point (also called Terminal Access Point). Think of a network tap as a passive device whose only job is to montior the network traffic. Network tap Wikipedia page has the following point mentioned:
The network tap has (at least) three ports: an A port, a B port, and a monitor port. A tap inserted between A and B passes all traffic (send and receive data streams) through unimpeded in real time, but also copies that same data to its monitor port, enabling a third party to listen.
From here, we get some basic idea of a network tap. There isn't any processing done by the tap, but it monitors the traffic.
Another form of tapping is through the promiscuous mode. On normal scenario, the NIC (or WNIC, Wireless Network Interface Controller) only delivers frame to the CPU that the controller is specifically programmed to receive. If we enable promiscuous mode for the bpf device, it will allow the NIC to capture foreign packets and pass them to the CPU.[0] Most OSes restrict this functionality to superuser only. It should be noted that devices connected on a switched network render this form of tapping usless as the switch is responsible for passing the packets to respective device.
Classic BPF Packet Capture
We now have a rough idea as to what packet filtering means. Let's write a simple program that will capture packets with the following condition: the packet must be an Internet Protocol (IP) packet, source IP address on the packet must be 127.0.0.1 and the destination IP address also 127.0.0.1. Finally, the destination port on the packet must be 3000. If a packet matches these conditions, then we will capture such packets, else not.
We'll first look at the various interfaces available on the system, We'll also inspect various ioctl(2) commands available for a bpf device (not all, but interested readers can refer to [bpf 4] for all available commands).
Interface
Systems provide various interface with which we can interact. For example, en0 (or eth0 in Linux) is the Ethernet interface. There are plenty of commands that allows us to view the interfaces. Linux uses the ip(1) utitily to work with various interfaces, while macOS uses the traditional ifconfig(1). For example, to list out all the interfaces available in the system, we can use the command:
$ ifconfig -a
The output is... verbatim, to say the least. I'll try to demystify the output we receive from this command.
Before I begin discussing the configuration details provided by ifconfig(1), I should alert the readers that this is done for inspecting the interface only and not to change any current configuration. Even if the configuration of the device can be changed through ifconfig(1), it is not advised as this command is consider old and newer alternatives exists.
On macOS, for instance, the changes made through the command may be overridden cause of configd(8). Refer to this StackExchange thread for more info. Another command provided is networksetup(1) that provides various functionality, as can be seen from its manual page.
On Linux, one might encounter disparity between output from ifconfig(1) and ip(1), as asked in this StackExchange thread. ifconfig(1) should only be used as a fallback in the case when ip(1) is not available.
- The interface name and flags are separated by a colon. The flag format is like this:
flags=<number><<symbols-for-number>>.
For example, it could be:flags=1<UP>. The flags seen here is defined asIFF_XXXsymbols inside<net/if.h>header file. It also shows themtufield followed by a number. This is the Maximum Transmission Unit that describes the largest size of a data packet (in bytes/octets) that can be transmitted over a network without any sort of fragmentation. - Some interfaces have additional capabilities that is expressed in the
optionsfield. For example, one could have the output of an interface with options field as:options=1203<RXCSUM,TXCSUM,TXSTATUS,SW_TXSTATUS>.
Similar to the flags, the various bits for the corresponding symbols for these options flag is defined in<net/if.h>with the symbols of formIFCAP_XXX. - If the interface has an Internet Protocol version 4 address, the
inetfield will be present. The value for this field is four decimal numbers in the range of [0, 255] separated by a period (.). Some more information may be present, such asnetmask. Thenetmaskis followed by a hexadecimal number that is used to differentiate the network bits and the host bits. - If the interface has an Internet Protocol version 6 address, the
inet6field will be present. It is possible that there will be multiple occurrence of this field in the same interface. If the address starts withfe80::, then it represents a Link-Local Address. This address is non-routable and is only valid in a network segment. This field may containprefixlenthat describes the netmask for an IPv6 addresses. The following information:prefixlen 64states that first 64 bits of the address is to be evaluated as the network portion, and the rest 64 bits as the host. - Like mentioned earlier, there could be multiple instances of
inet6field. On my machine, the differentiator between theseinet6field is the address properties and privacy extensions. For example, theinet6fields may contain keyword likesecured,temporary, and such. Historically, IPv6 address for a device would be generated based on the MAC address of the device. This turned out to be a privacy problem as the same device on different network would have the same address. Thesecuredaddress is generated using a stable, non-temporary key that is unique to your device, not relying on the MAC address. This address is consistent across network reboots and reconnections, making it suitable for services that need a stable identity. In contrast, atemporaryaddress is short-lived and disposable. Such address is used for outbound client connections, like browsing the web, sending mails, and so on. Through the use of this address, it is harder to trace the user's behavior across different website and services. - Some more information regarding
inet6field's labels. Some labels describing the address configuration is:autoconf,dynamic, orstatic. The address's lifecycle status is determined through the label:deprecatedordetached. Thesecuredandtemporarylabels we discussed earlier is the security and privacy feature. The logical boundary of the address is through the label:scopeid. Each interface has their own unique identifier, say,0xcforen0. This is rather useful in IPv6 as some address range is not globally unqiue, but the addresses are unique in the same network segment. In theinet6field, you'll see the output as:inet6 <address>%<interface-name>.
The%symbol acts as a delimiter between the address and the interface. When you're sending an IPv6 pakcet, you need to specify which interface will be used to send the packet, hence the use ofscopeid. - You may see
nd6field in some interfaces. This stands for IPv6 Neighbor Discovery Protocol. The symbols used in this option field is defined asND6_IFF_XXXin<netinet6/nd6.h>header file. Two most common ones are:PERFORMNUD(Perform Network Unreachability Detection) to check whether the neighbors are reachable or not, andDAD(Duplicate Address Detection) to ensure that the IPv6 address assigned to the interface is unique in link-local context. - Finally, some interfaces have the
mediafield that describes the physical characteristics and current state of the network connection. Some components of this field is: type that can beEthernet,Wi-Fi, orcoaxial, subtype that can be10baseT,100baseT, or1000baseT, and options that arefull-duplex,half-duplex, orautoselect. Theautoselectoptions allows the interface to negotiate the best speed and duplex mode.
There's one special interface known as the bridge interface (bridge0 on macOS or br0 on Linux.) This interface operates at the data-link layer of the OSI model. It functions as a virtual network switch that connects different network segments or interfaces. On Linux, the bridge interface see all frames, but it uses only Layer 2 (data-link layer) headers/information.
The bridge interface--like its physical counterpart; switch--is primarily used to connect different network segments or interfaces, making them all appear as if they're on the same local network. When you inspect the bridge interface through ifconfig(1), you'll notice a new field: member. This field lists all the network interfaces that have been slaved to the bridge. You can think of it like a device connected to a switch. For example, from the output:
$ ifconfig bridge100
bridge100: flags=8a63<UP,BROADCAST,SMART,RUNNING,ALLMULTI,SIMPLEX,MULTICAST> mtu 1500
options=3<RXCSUM,TXCSUM>
ether 3e:06:30:c0:dc:64
inet 192.168.64.1 netmask 0xffffff00 broadcast 192.168.64.255
inet6 fe80::3c06:30ff:fec0:dc64%bridge100 prefixlen 64 scopeid 0x18
inet6 fd3b:14e3:3a82:6841:1074:695b:e70d:3fc7 prefixlen 64 autoconf secured
Configuration:
id 0:0:0:0:0:0 priority 0 hellotime 0 fwddelay 0
maxage 0 holdcnt 0 proto stp maxaddr 100 timeout 1200
root id 0:0:0:0:0:0 priority 0 ifcost 0 port 0
ipfilter disabled flags 0x0
member: vmenet0 flags=3<LEARNING,DISCOVER>
ifmaxaddr 0 port 23 priority 0 path cost 0
Address cache:
7a:9b:b0:51:db:79 Vlan1 vmenet0 1197 flags=0<>
16:7c:df:f1:fc:de Vlan1 vmenet0 1197 flags=0<>
nd6 options=201<PERFORMNUD,DAD>
media: autoselect
status: active
$ ifconfig vmenet0
vmenet0: flags=8963<UP,BROADCAST,SMART,RUNNING,PROMISC,SIMPLEX,MULTICAST> mtu 1500
ether 7a:f5:ae:98:20:22
media: autoselect
status: active
We know that the bridge group has a member, vmenet0 interface, that is connected to the virtual bridge's port. It should be noted that some texts use the term bridge group to describe bridge interface. Unfortunately, the symbols that are used to constitute the flags field within the member field seems to not be available in the usual header files on my machine. I did find the definition in <bsd/net/if_bridgevar.h> file in XNU's source code, which is akin to the one described in OpenBSD's bridge(4) manual. The definition has the format: IFBIF_XXX. The LEARNING symbol indicates that the bridge is actively building its MAC address table. The DISCOVER flag indicates, based on OpenBSD manual, states that the interface can send packets with unknown destination.
Within the member field, we notice some other fields. priority and path cost are used for Spanning Tree Protocol (STP), which is a layer 2 protocol used to prevent any network loops. The priority value determines which port is the preferred forwarding path in a loop. Lower priority implies that it is more preferred. path cost represents the cost of sending traffic through a specific port. A lower cost indicates a faster or more desirable path. The ifmaxaddr field represents the maximum number of MAC addresses the bridge can "learn" from a specific port. The value 0 indicates there's no limit. Lastly, port is an internal identifier for the port. This is used by the bridge interface's forwarding table to refer to slaved interface, vmenet0 in our case.
The bridge100 interface above is created when I spin up an instance of Linux in Virtual Machine. Notice that the bridge interface has its own IP address. This implies that this interface acts as a gateway for the interface within the instance of VM. Inside the Linux VM, the route(1) command shows the default gateway as:
$ # Inside Linux VM
$ route
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
default 192.168.64.1 0.0.0.0 UG 1024 0 0 enp0s1
192.168.64.0 0.0.0.0 255.255.255.0 U 1024 0 0 enp0s1
192.168.64.1 0.0.0.0 255.255.255.255 UH 1024 0 0 enp0s1
Also, the interface is enp0s1. If we inspect it using ip(1) and ifconfig(1), we receive:
$ # Inside Linux VM
$ ip address show dev enp0s1
2: enp0s1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel master br100 state UP group default qlen 1000
link/ether 16:7c:df:f1:fc:de brd ff:ff:ff:ff:ff:ff
altname enx167cdff1fcde
inet 192.168.64.2/24 metric 1024 brd 192.168.64.255 scope global dynamic enp0s1
valid_lft 62775sec preferred_lft 62775sec
inet6 fd3b:14e3:3a82:6841:147c:dfff:fef1:fcde/64 scope global dynamic mngtmpaddr noprefixroute
valid_lft 2572080sec preferred_lft 584880sec
inet6 fe80::147c:dfff:fef1:fcde/64 scope link proto kernel_ll
valid_lft forever preferred_lft forever
$ ifconfig enp0s1
enp0s1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
inet 192.168.64.2 netmask 255.255.255.0 broadcast 192.168.64.255
inet6 fd3b:14e3:3a82:6841:147c:dfff:fef1:fcde prefixlen 64 scopeid 0x0<global>
inet6 fe80::147c:dfff:fef1:fcde prefixlen 64 scopeid 0x20<link>
ether 16:7c:df:f1:fc:de txqueuelen 1000 (Ethernet)
RX packets 4252 bytes 1567600 (1.4 MiB)
RX errors 0 dropped 0 overruns 0 frame 0
TX packets 600 bytes 53663 (52.4 KiB)
TX errors 0 dropped 0 overruns 0 carrier 0 collisions 0
Notice that the Host OS's bridge100 interface and the VM's enp0s1 interface are in the same subnet. The vmenet0 interface on Host OS is a port on the virtual router; bridge100. The VM is "plugged" into this port and gets a private IP address from it. Let's try to trace the route the packet takes inside the VM:
$ # Inside Linux VM
$ traceroute example.org
traceroute to example.org (23.220.75.238), 30 hops max, 60 byte packets
1 _gateway (192.168.64.1) 0.844 ms 0.789 ms 0.775 ms
2 192.168.1.254 (192.168.1.254) 6.046 ms 6.005 ms 6.000 ms
...
8 125.17.58.157 (125.17.58.157) 13.709 ms 13.667 ms 13.661 ms
9 116.119.158.232 (116.119.158.232) 316.859 ms 116.119.44.132 (116.119.44.132) 316.854 ms 116.119.44.134 (116.119.44.134) 316.851 ms
10 206.72.210.82.any2ix.coresite.com (206.72.210.82) 316.847 ms 316.843 ms 316.839 ms
11 * * *
12 * * *
13 * * *
14 a23-220-75-238.deploy.static.akamaitechnologies.com (23.220.75.238) 287.200 ms 306.862 ms 306.805 ms
This tells us a couple of things. First, the IP address of bridge100 is first contacted, as mentioned earlier. Second, the interface used to access the Internet is distinct from the Host's interface. I mention this because the bridge interface is really general (on Linux) and this is one of its use case. This is one specific use case for Virtualization. Getting back to the topic, we can see that the Darwin kernel--upon receiving the packet from the vmenet0 interface--silently translates the packet's source address from 192.168.64.1 to 192.168.1.68 (the IP address for Host's en0 interface). This can be proven from the output from tcpdump(1) below. Be aware that the three outputs are captured simultaneously. The VM's console will run the ping(1) command to contact a host.
$ # Inside Linux VM
$ ping google.com
PING google.com (142.250.192.174) 56(84) bytes of data.
64 bytes from del11s11-in-f14.1e100.net (142.250.192.174): icmp_seq=1 ttl=113 time=28.4 ms
64 bytes from del11s11-in-f14.1e100.net (142.250.192.174): icmp_seq=2 ttl=113 time=24.8 ms
64 bytes from del11s11-in-f14.1e100.net (142.250.192.174): icmp_seq=3 ttl=113 time=24.3 ms
--- google.com ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2004ms
rtt min/avg/max/mdev = 24.253/25.837/28.423/1.843 ms
As stated earlier, the output is joined together but the actual observation was done on three separate terminal windows.
$ # Host Machine
$ # First terminal window (capturing two packets from the vmenet0 interface)
$ tcpdump -i vmenet0 -n -c 2 icmp
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on vmenet0, link-type EN10MB (Ethernet), snapshot length 524288 bytes
13:09:15.214392 IP 192.168.64.2 > 142.250.192.174: ICMP echo request, id 2, seq 1, length 64
13:09:15.238388 IP 142.250.192.174 > 192.168.64.2: ICMP echo reply, id 2, seq 1, length 64
2 packets captured
2 packets received by filter
0 packets dropped by kernel
$ # Second terminal window (capturing two packets from the bridge100 interface)
$ tcpdump -i bridge100 -n -c 2 icmp
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on bridge100, link-type EN10MB (Ethernet), snapshot length 524288 bytes
13:09:15.214553 IP 192.168.64.2 > 142.250.192.174: ICMP echo request, id 2, seq 1, length 64
13:09:15.238359 IP 142.250.192.174 > 192.168.64.2: ICMP echo reply, id 2, seq 1, length 64
2 packets captured
2 packets received by filter
0 packets dropped by kernel
$ # Third terminal window (capturing two packets from the en0 interface)
$ tcpdump -i en0 -n -c 2 icmp
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on en0, link-type EN10MB (Ethernet), snapshot length 524288 bytes
13:09:15.214612 IP 192.168.1.68 > 142.250.192.174: ICMP echo request, id 44667, seq 1, length 64
13:09:15.238287 IP 142.250.192.174 > 192.168.1.68: ICMP echo reply, id 44667, seq 1, length 64
2 packets captured
1444 packets received by filter
0 packets dropped by kernel
We can clearly see the address translation thanks to millisecond precision of tcpdump(1). Notice that the packet first arrived in vmenet0 interface (.214392 milliseconds), sending it to the bridge100 interface (.214553 milliseconds) and before it was sent out from the host machine, we can see the output from en0 interface (.214612 milliseconds). It is the Host OS's kernel that is responsible for this Network Address Translation (NAT) and it is distinct from the NAT used by the router. Also notice that the Internet Control Message Protocol (ICMP) echo reply packet that is sent back by the host is first received by the en0 interface (.238287 milliseconds), passing it to the bridge100 interface (.238359 milliseconds) and the bridge interface passing it to the vmenet0 interface (.238388 milliseconds).
This is one of the use case of bridge interface. It is a powerful interface used for various purposes like: container networking, transparent firewalls, network loop detection, and much more. Linux's bridge interface contains many options that allows us to instrument the interface as to our needs. Container runtimes like docker and LXC uses the bridge interface to connect all containers on a host. This allows communication between the container while also providing a single point of exit for containers to access the internet. I've provided some references below for interested readers to take a deep dive into this topic.
For further reading, refer to the articles below:
Introduction to Linux interfaces for virtual networking
OpenBSD bridge(4) manual
What is the br0 interface?
Deep Guide to Bridge Command Line in Linux
Linux bridge(8) manual
Ethernet Bridging
Special Addresses
Table below lists out some of the special addresses and their purpose in networking.
| Address Type | IPv4 Address Range | IPv6 Address Range | Purpose |
|---|---|---|---|
| Loopback | 127.0.0.1/8 (Section 3.2.1.3 (g) of RFC 1122) | ::1/128 (Section 2.5.3 of RFC 4291) | Communicate with services on the local host. |
| Link-Local | 169.254.0.0/16 (APIPA) (Section 2.1 of RFC 3927) | fe80::/10 (Section 2.5.6 of RFC 4291) | Non-Routable address and communication only on local network segment. |
| Private/Unique Local | 10.0.0.0/8, 172.16.0.0/12, and 192.168.0.0/16 (Section 3 of RFC 1918) | fc00:/7 (Section 3 of RFC 4193) | Used for private network to avoid public address scarcity. |
| Broadcast | 255.255.255.255 or, 192.168.1.255 for directed broadcast (RFC 919 and RFC 922) | N/A (Multicast is used, as per second paragraph of Section 2 of RFC 4291, and Section 2.7 of RFC 4291 for the details) (Serverfault thread) | For one-to-all communication on a network segment. |
| All-Nodes Multicast | 255.255.255.255 or 224.0.0.1 (specific multicast group) (RFC 1122 and Multicast Address Wiki) | ff01::1 (all-nodes multicast, node-local scope) or ff02::1 (all-nodes multicast, link-local scope) (Cisco thread) (Section 2.7.1 of RFC 4291) | To send a packet to every device on the local network segment. |
| All-Routers Multicast | 224.0.0.2 (RFC 2365) | ff01::2 (interface-local) or ff02::2 (link-local) or ff05::2 (site-local) | Find all routers on the local network segment. |
References
- [Gregg 19] BPF Performance Tools; Brendan Gregg. ISBN-13: 9780136554820.
- [McCanne 92] The BSD Packet Filter: A New Architecture for User-level Packet Capture; https://www.tcpdump.org/papers/bpf-usenix93.pdf
- [bpf 4] https://man.netbsd.org/bpf.4
- [0] The manual for
bpf(4)on my computer has BUGS section that mentions the following quirk of promiscuous mode: A file that does not request promiscuous mode may receive
promiscuously received packets as a side effect of another
file requesting this mode on the same hardware interface.
This could be fixed in the kernel with additional processing
overhead. However, we favor the model where all files must assume
that the interface is promiscuous, and if so desired, must utilize
a filter to reject foreign packets.