让我们代码TCP/IP堆栈，1：以太网和ARP（2016）

让我们代码TCP/IP堆栈，1：以太网和ARP（2016）
Let’s code a TCP/IP stack, 1: Ethernet and ARP (2016)

原始链接: https://www.saminiir.com/lets-code-tcp-ip-stack-1-ethernet-arp/

这篇博客文章介绍了一个项目，以在Linux的用户空间中构建最小的TCP/IP堆栈，重点是学习网络和系统编程。堆栈使用Linux Tap设备拦截了低级网络流量，从而可以操纵第2层流量。最初的实现涵盖以太网框架处理和ARP（地址分辨率协议）。该项目利用了TAP设备，该设备配置为无需额外的数据包信息即可捕获以太网帧。它定义了以太网和ARP标头的C结构，反映了协议格式。代码解析传入的以太网框架，标识ARP数据包，并通过提供与给定IP地址关联的MAC地址来响应ARP请求。关键要素包括定义以太网和ARP标头结构，解析框架以及实现ARP算法，包括缓存。该博客展示了成功的ARP实现，该实现与内核的网络堆栈进行交互，从而允许自定义堆栈通过其虚拟网络设备的信息填充主机的ARP缓存。

黑客新闻线程讨论了一个文章系列“让我们代码tcp/ip堆栈”，重点是以太网和ARP。评论者分享了他们建立用户空间网络堆栈和解析网络协议的经验，尽管很复杂，但有些人发现它很有意义。一个用户指出，最小的Linux内核中TCP/IP堆栈的惊人大小，而其他用户则解释了堆栈大小的原因，例如安全性，硬件支持和IPv6。提到了一个涉及禁用ARP或使用回环别名进行负载平衡的窍门。评论者批评文章对先验知识的假设，而其他人则捍卫作者。用户还讨论点击设备，ARP在本地网络中的作用以及RARP等替代协议。提供了相关文章，事先讨论和类似项目的链接。

（评论） 2025-03-05

Show HN：我做了一个分立逻辑网卡 2024-04-10

（评论） 2024-09-10

（评论） 2024-09-16

Linux 内核模块编程指南 2024-07-28

原文

Writing your own TCP/IP stack may seem like a daunting task. Indeed, TCP has accumulated many specifications over its lifetime of more than thirty years. The core specification, however, is seemingly compact^{- the important parts being TCP header parsing, the state machine, congestion control and retransmission timeout computation.}

The most common layer 2 and layer 3 protocols, Ethernet and IP respectively, pale in comparison to TCP’s complexity. In this blog series, we will implement a minimal userspace TCP/IP stack for Linux.

The purpose of these posts and the resulting software is purely educational - to learn network and system programming at a deeper level.

To intercept low-level network traffic from the Linux kernel, we will use a Linux TAP device. In short, a TUN/TAP device is often used by networking userspace applications to manipulate L3/L2 traffic, respectively. A popular example is tunneling, where a packet is wrapped inside the payload of another packet.

The advantage of TUN/TAP devices is that they’re easy to set up in a userspace program and they are already being used in a multitude of programs, such as OpenVPN.

As we want to build the networking stack from the layer 2 up, we need a TAP device. We instantiate it like so:

/*
 * Taken from Kernel Documentation/networking/tuntap.txt
 */
int tun_alloc(char *dev)
{
    struct ifreq ifr;
    int fd, err;

    if( (fd = open("/dev/net/tap", O_RDWR)) < 0 ) {
        print_error("Cannot open TUN/TAP dev");
        exit(1);
    }

    CLEAR(ifr);

    /* Flags: IFF_TUN   - TUN device (no Ethernet headers)
     *        IFF_TAP   - TAP device
     *
     *        IFF_NO_PI - Do not provide packet information
     */
    ifr.ifr_flags = IFF_TAP | IFF_NO_PI;
    if( *dev ) {
        strncpy(ifr.ifr_name, dev, IFNAMSIZ);
    }

    if( (err = ioctl(fd, TUNSETIFF, (void *) &ifr)) < 0 ){
        print_error("ERR: Could not ioctl tun: %s\n", strerror(errno));
        close(fd);
        return err;
    }

    strcpy(dev, ifr.ifr_name);
    return fd;
}

After this, the returned file descriptor fd can be used to read and write data to the virtual device’s ethernet buffer.

The flag IFF_NO_PI is crucial here, otherwise we end up with unnecessary packet information prepended to the Ethernet frame. You can actually take a look at the kernel’s source code of the tun-device driver and verify this yourself.

The multitude of different Ethernet networking technologies are the backbone of connecting computers in Local Area Networks (LANs). As with all physical technology, the Ethernet standard has greatly evolved from its first version^{, published by Digital Equipment Corporation, Intel and Xerox in 1980.}

The first version of Ethernet was slow in today’s standards - about 10Mb/s and it utilized half-duplex communication, meaning that you either sent or received data, but not at the same time. This is why a Media Access Control (MAC) protocol had to be incorporated to organize the data flow. Even to this day, Carrier Sense, Multiple Access with Collision Detection (CSMA/CD) is required as the MAC method if running an Ethernet interface in half-duplex mode.

The invention of the 100BASE-T Ethernet standard used twisted-pair wiring to enable full-duplex communication and higher throughput speeds. Additionally, the simultaneous increase in popularity of Ethernet switches made CSMA/CD largely obsolete.

The different Ethernet standards are maintained by the IEEE 802.3^{working group.}

Next, we’ll take a look at the Ethernet Frame header. It can be declared as a C struct followingly:

#include <linux/if_ether.h>

struct eth_hdr
{
    unsigned char dmac[6];
    unsigned char smac[6];
    uint16_t ethertype;
    unsigned char payload[];
} __attribute__((packed));

The dmac and smac are pretty self-explanatory fields. They contain the MAC addresses of the communicating parties (destination and source, respectively).

The overloaded field, ethertype, is a 2-octet field, that depending on its value, either indicates the length or the type of the payload. Specifically, if the field’s value is greater or equal to 1536, the field contains the type of the payload (e.g. IPv4, ARP). If the value is less than that, it contains the length of the payload.

After the type field, there is a possibility of several different tags for the Ethernet frame. These tags can be used to describe the Virtual LAN (VLAN) or the Quality of Service (QoS) type of the frame. Ethernet frame tags are excluded from our implementation, so the corresponding field also does not show up in our protocol declaration.

The field payload contains a pointer to the Ethernet frame’s payload. In our case, this will contain an ARP or IPv4 packet. If the payload length is smaller than the minimum required 48 bytes (without tags), pad bytes are appended to the end of the payload to meet the requirement.

We also include the if_ether.h Linux header to provide a mapping between ethertypes and their hexadecimal values.

Lastly, the Ethernet Frame Format also includes the Frame Check Sequence field in the end, which is used with Cyclic Redundancy Check (CRC) to check the integrity of the frame. We will omit the handling of this field in our implementation.

The attribute packed in a struct’s declaration is an implementation detail - It is used to instruct the GNU C compiler not to optimize the struct memory layout for data alignment with padding bytes^{. The use of this attribute stems purely out of the way we are “parsing” the protocol buffer, which is just a type cast over the data buffer with the proper protocol struct:}

struct eth_hdr *hdr = (struct eth_hdr *) buf;

A portable, albeit slightly more laborious approach, would be to serialize the protocol data manually. This way, the compiler is free to add padding bytes to conform better to different processor’s data alignment requirements.

The overall scenario for parsing and handling incoming Ethernet frames is straightforward:

if (tun_read(buf, BUFLEN) < 0) {
    print_error("ERR: Read from tun_fd: %s\n", strerror(errno));
}

struct eth_hdr *hdr = init_eth_hdr(buf);

handle_frame(&netdev, hdr);

The handle_frame function just looks at the ethertype field of the Ethernet header, and decides its next action based upon the value.

The Address Resolution Protocol (ARP) is used for dynamically mapping a 48-bit Ethernet address (MAC address) to a protocol address (e.g. IPv4 address). The key here is that with ARP, multitude of different L3 protocols can be used: Not just IPv4, but other protocols like CHAOS, which declares 16-bit protocol addresses.

The usual case is that you know the IP address of some service in your LAN, but to establish actual communications, also the hardware address (MAC) needs to be known. Hence, ARP is used to broadcast and query the network, asking the owner of the IP address to report its hardware address.

The ARP packet format is relatively straightforward:

struct arp_hdr
{
    uint16_t hwtype;
    uint16_t protype;
    unsigned char hwsize;
    unsigned char prosize;
    uint16_t opcode;
    unsigned char data[];
} __attribute__((packed));

The ARP header (arp_hdr) contains the 2-octet hwtype, which determines the link layer type used. This is Ethernet in our case, and the actual value is 0x0001.

The 2-octet protype field indicates the protocol type. In our case, this is IPv4, which is communicated with the value 0x0800.

The hwsize and prosize fields are both 1-octet in size, and they contain the sizes of the hardware and protocol fields, respectively. In our case, these would be 6 bytes for MAC addresses, and 4 bytes for IP addresses.

The 2-octet field opcode declares the type of the ARP message. It can be ARP request (1), ARP reply (2), RARP request (3) or RARP reply (4).

The data field contains the actual payload of the ARP message, and in our case, this will contain IPv4 specific information:

struct arp_ipv4
{
    unsigned char smac[6];
    uint32_t sip;
    unsigned char dmac[6];
    uint32_t dip;
} __attribute__((packed));

The fields are pretty self explanatory. smac and dmac contain the 6-byte MAC addresses of the sender and receiver, respectively. sip and dip contain the sender’s and receiver’s IP addresses, respectively.

The original specification depicts this simple algorithm for address resolution:

?Do I have the hardware type in ar$hrd?
Yes: (almost definitely)
  [optionally check the hardware length ar$hln]
  ?Do I speak the protocol in ar$pro?
  Yes:
    [optionally check the protocol length ar$pln]
    Merge_flag := false
    If the pair <protocol type, sender protocol address> is
        already in my translation table, update the sender
        hardware address field of the entry with the new
        information in the packet and set Merge_flag to true.
    ?Am I the target protocol address?
    Yes:
      If Merge_flag is false, add the triplet <protocol type,
          sender protocol address, sender hardware address> to
          the translation table.
      ?Is the opcode ares_op$REQUEST?  (NOW look at the opcode!!)
      Yes:
        Swap hardware and protocol fields, putting the local
            hardware and protocol addresses in the sender fields.
        Set the ar$op field to ares_op$REPLY
        Send the packet to the (new) target hardware address on
            the same hardware on which the request was received.

Namely, the translation table is used to store the results of ARP, so that hosts can just look up whether they already have the entry in their cache. This avoids spamming the network for redundant ARP requests.

The algorithm is implemented in arp.c.

Finally, the ultimate test for an ARP implementation is to see whether it replies to ARP requests correctly:

[saminiir@localhost lvl-ip]$ arping -I tap0 10.0.0.4
ARPING 10.0.0.4 from 192.168.1.32 tap0
Unicast reply from 10.0.0.4 [00:0C:29:6D:50:25]  3.170ms
Unicast reply from 10.0.0.4 [00:0C:29:6D:50:25]  13.309ms

[saminiir@localhost lvl-ip]$ arp
Address                  HWtype  HWaddress           Flags Mask            Iface
10.0.0.4                 ether   00:0c:29:6d:50:25   C                     tap0

The kernel’s networking stack recognized the ARP reply from our custom networking stack, and consequently populated its ARP cache with the entry of our virtual network device. Success!

The minimal implementation of Ethernet Frame handling and ARP is relatively easy and can be done in a few lines of code. On the contrary, the reward-factor is quite high, since you get to populate a Linux host’s ARP cache with your own make-belief Ethernet device!

The source code for the project can be found at GitHub.

In the next post, we’ll continue the implementation with ICMP echo & reply (ping) and IPv4 packet parsing.

If you liked this post, you can share it with your followers and follow me on Twitter!

Kudos to Xiaochen Wang, whose similar implementation proved invaluable for me in getting up to speed with C network programming and protocol handling. I find his source code^{easy to understand and some of my design choices were straight-out copied from his implementation.}

让我们代码TCP/IP堆栈，1：以太网和ARP（2016） Let’s code a TCP/IP stack, 1: Ethernet and ARP (2016)

让我们代码TCP/IP堆栈，1：以太网和ARP（2016）
Let’s code a TCP/IP stack, 1: Ethernet and ARP (2016)