Using Multipath TCP on recent Linux kernels
The first version of Multipath TCP on Linux was an off-tree patch intially developed by UCLouvain researchers [40]. This implementation was initially the reference implementation of Multipath TCP. It influenced the design of the protocol as new features were always tested on this implementation. You can find additional information about this implementation on https://www.multipath-tcp.org.
Starting with version 5.6, the official Linux kernel includes support for Multipath TCP. The set of features supported by this implementation has increased over time as shown by its ChangeLog.
To avoid any interference with regular TCP, this implementation only creates a Multipath TCP connection if the application has created its socket
using the IPPROTO_MPTCP
protocol. Applications will probably be modified in the coming months and years to add specific support for Multipath TCP, but in the mean time, the Multipath TCP developers have created a work around to force legacy applications to use Multipath TCP with the mptcpize
command which is bundled with the mptcpd daemon. We use this approach in this section and discuss applications with native Multipath TCP support later.
To illustrate Multipath TCP, we use a very simple setup with a Linux client using Ubuntu 22 and a Linux server using Debian. The client uses Linux kernel version 5.15 while the server uses version 5.17. The server has a single network interface with an IPv4 and an IPv6 address. The client has both a Wi-Fi and an Ethernet interface. These two interfaces are connected to the same router that allocates IP addresses in the same subnet on both interfaces. The client has both an IPv4 and an IPv6 address.
Enabling Multipath TCP
Multipath TCP is a feature that needs to be compiled inside the kernel. If you compile your own kernel, you can manually select Multipath TCP.
Most users prefer to rely on already compiled Linux kernels that are included in their distribution. The following distributions support Multipath TCP:
CentOS starting with
Debian starting with
Ubuntu starting with 22.04
You need to to install a recent kernel to benefit from Multipath TCP. On some distributions, this installation will be part of the regular upgrade. On other distributions, you will need to add it manually.
Once the kernel has been installed and your computer has rebooted, you first need to verify that Multipath TCP is enabled.
sudo sysctl -a | grep mptcp.enabled
net.mptcp.enabled = 1
Here, the kernel supports Multipath TCP. If, for any reason, you want to disable Multipath TCP, you need to set this sysctl
variable to 0
.
To illustrate the basic operation of mptcpize
, let us first use the netcat command over the loopback interface. This is obviously not the target use case for Multipath TCP, but a nice way to perform tests.
Netcat allows to easily launch clients and servers. We start the server using: mptcpize run nc -l -p 12345
. This is a TCP server that listens on port 12345
. The client connects to this server using the mptcpize run nc 127.0.0.1 12345
command. The connection is established and all text lines sent by the client are printed by the server on standard output.
# mptcpize run nc -l -p 12345
Simple test
There are several ways to check that Multipath TCP is used for this connection. First, the ss
command provides information about the status of the different sockets.
ss -iaM
State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
ESTAB 0 0 127.0.0.1:12345 127.0.0.1:52854
subflows_max:2 remote_key token:5bba80d9 write_seq:2266a099179e2476 snd_una:2266a099179e2476 rcv_nxt:de9999038d0a29a2
ESTAB 0 0 127.0.0.1:52854 127.0.0.1:12345
subflows_max:2 remote_key token:c1f12b87 write_seq:de9999038d0a29a2 snd_una:de9999038d0a29a2 rcv_nxt:2266a099179e2476
ss provides several useful information to debug a Multipath TCP connection. The first column indicates that the connection is in the Established state, which means that it can currently transfer data. It also indicates the length of the Send and Receive queues at the TCP level and the four-tuple that identifies the connection. The next line provides Multipath TCP information with the maximum number of subflows which can be attached to the connection, the token assigned by the remote host and the write_seq, snd_una and rcv_next parameters of the sate machine. The next two lines provide information about the other direction of the connection.
It is also possible to capture packets on the loopback interface to verify that Multipath TCP is used. The output below provides the first collected packets:
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode listening on lo, link-type EN10MB (Ethernet), snapshot length 262144 bytes
18:43:42.676396 IP 127.0.0.1.52854 > 127.0.0.1.12345: Flags [S], seq 904893125, win 65495, options [mss 65495,sackOK,TS val 4026038040 ecr 0,nop,wscale 7,mptcp capable v1], length 0
18:43:42.676426 IP 127.0.0.1.12345 > 127.0.0.1.52854: Flags [S.], seq 1804351310, ack 904893126, win 65483, options [mss 65495,sackOK,TS val 4026038040 ecr 4026038040,nop,wscale 7,mptcp capable v1 {0x45edb502d861e7b1}], length 0
18:43:42.676472 IP 127.0.0.1.52854 > 127.0.0.1.12345: Flags [.], ack 1, win 512, options [nop,nop,TS val 4026038040 ecr 4026038040,mptcp capable v1 {0xdbb760db1d55e07b,0x45edb502d861e7b1}], length 0
18:44:59.519697 IP 127.0.0.1.52854 > 127.0.0.1.12345: Flags [P.], seq 1:13, ack 1, win 512, options [nop,nop,TS val 4026114884 ecr 4026038040,mptcp capable v1 {0xdbb760db1d55e07b,0x45edb502d861e7b1},nop,nop], length 12
18:44:59.519755 IP 127.0.0.1.12345 > 127.0.0.1.52854: Flags [.], ack 13, win 512, options [nop,nop,TS val 4026114884 ecr 4026114884,mptcp dss ack 16040019788386937262], length 0
The first packet is the SYN
that includes the MP_CAPABLE
option. The server replies with the SYN+ACK
with the MP_CAPABLE
containing the server key. The client returns the third ACK
with the MP_CAPABLE
and the two keys. As the server did not send any data, the MP_CAPABLE
option is sent again in the packet containing the Simple test
string. This packet also contains the DSS
option. The server replies with an acknowledgment that carries the DSS
option.
We can now use the netcat application to explore the operation of Multipath TCP over the Internet. Let us start with a very simple example.
mptcpize run nc serverv4 12345
hello
The netcat process listens on port 12345 on the server. This results in the following Multipath TCP connection :
09:05:23.695876 IP host-78-129-5-171.dynamic.voo.be.41510 > serverv4.12345: Flags [S], seq 3525674027, win 64240, options [mss 1460,sackOK,TS val 2619832768 ecr 0,nop,wscale 7,mptcp capable v1], length 0
09:05:23.696076 IP serverv4.12345 > host-78-129-5-171.dynamic.voo.be.41510: Flags [S.], seq 1745741580, ack 3525674028, win 65160, options [mss 1460,sackOK,TS val 3069340264 ecr 2619832768,nop,wscale 7,mptcp capable v1 {0x82aa42ef4245f0d0}], length 0
09:05:23.707909 IP host-78-129-5-171.dynamic.voo.be.41510 > serverv4.12345: Flags [.], ack 1, win 502, options [nop,nop,TS val 2619832783 ecr 3069340264,mptcp capable v1 {0x9dc8e3972e3d9f25,0x82aa42ef4245f0d0}], length 0
09:05:30.776312 IP host-78-129-5-171.dynamic.voo.be.41510 > serverv4.12345: Flags [P.], seq 1:7, ack 1, win 502, options [nop,nop,TS val 2619839851 ecr 3069340264,mptcp capable v1 {0x9dc8e3972e3d9f25,0x82aa42ef4245f0d0},nop,nop], length 6
09:05:30.776484 IP serverv4.12345 > host-78-129-5-171.dynamic.voo.be.41510: Flags [.], ack 7, win 510, options [nop,nop,TS val 3069347345 ecr 2619839851,mptcp dss ack 1561335003985645838], length 0
This is a Multipath TCP connection since it includes the Multipath TCP options, but the client does not create an additional subflow and the server does not announce its other addresses. This is the expected behavior since these operations are controlled by the path manager. On Linux, the Multipath TCP path manager can be configured using the ip-mptcp command. This command can be used to configure different parameters that are associated to an IP address. The path manager associates a numeric identifier to each IP address or endpoint. The ip mptcp endpoint show
command lists the identifiers of the active IP addresses on the host. Here is an example of the output of this command on our client:
ip mptcp endpoint show
fe80::3934:7572:b1ff:b555 id 1 dev wlp3s0
192.168.0.43 id 2 dev wlp3s0
fe80::5642:39bd:3390:43d3 id 3 dev enp2s0
192.168.0.37 id 4 dev enp2s0
2a02:2788:10c4:123:3d66:f590:d891:8fb3 id 5 dev wlp3s0
2a02:2788:10c4:123:6636:10c6:692b:18cc id 6 dev enp2s0
2a02:2788:10c4:123:2a09:5ec7:9b99:4a97 id 7 dev enp2s0
The two fe80
addresses are the IPv6 link local addresses configured on the Ethernet (enp2s0
) and Wi-Fi (wlp3s0
) interfaces of our client host. There are three flags which can be associated with each endpoint identifier:
subflow
. When this flag is set, the path manager will try to create a subflow over this interface when a Multipath TCP is created or the interface becomes active while there was an ongoing Multipath TCP connection. This flag is mainly useful for clients.
signal
. When this flag is set, the path manager will announce the address of the endpoint over any Multipath TCP connection created using other addresses. This flag can be used on clients or servers. It is mainly useful on servers that have multiple addresses.
backup
. This flag can be combined with the two other flags. When combined with thesubflow
flag, it indicates that a backup subflow will be created. When combined with thesignal
flag, it indicates that the address will be advertised as a backup address.
On our client host, we can configure the Wi-Fi interface as a backup interface that creates subflows as follows :
sudo ip mptcp endpoint del id 2
sudo ip mptcp endpoint add 192.168.0.43 dev wlp3s0 subflow backup
sudo ip mptcp endpoint show
fe80::3934:7572:b1ff:b555 id 1 dev wlp3s0
fe80::5642:39bd:3390:43d3 id 3 dev enp2s0
192.168.0.37 id 4 dev enp2s0
2a02:2788:10c4:123:3d66:f590:d891:8fb3 id 5 dev wlp3s0
2a02:2788:10c4:123:6636:10c6:692b:18cc id 6 dev enp2s0
2a02:2788:10c4:123:2a09:5ec7:9b99:4a97 id 7 dev enp2s0
192.168.0.43 id 8 subflow backup dev wlp3s0
We had to first remove the configuration for this endpoint because a default one was already active. Then we added the new parameters and verified them.
The path manager also has some limits which can be configured using the ip mptcp limits
command. Two limits can be set.
ip mptcp limits set subflow n
wheren
is an integer. This restricts the Multipath TCP connection to use up ton
different subflows. Servers should protect themselves by setting this limit to a few subflows. Most use cases would work well with 2 or 4 subflows.
ip mptcp limits set add_addr_accepted n
wheren
is an integer. This parameter limits the number of addresses that are learned over each Multipath TCP connection. This parameter could be used to protect the Multipath TCP implementation against attacks where two many addresses are advertised. Most use cases would work with 4 accepted addresses.
These parameters control the path manager, but before creating Multipath TCP subflows over different paths, we need to configure the IP routing table of our client host. Our client host has two network interfaces: Wi-Fi and Ethernet. By default, Linux prefers the Ethernet interface to Wi-Fi. The two interfaces are configured as follows :
ip -4 addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
2: enp2s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
inet 192.168.0.37/24 brd 192.168.0.255 scope global dynamic noprefixroute enp2s0
valid_lft 75697sec preferred_lft 75697sec
3: wlp3s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
inet 192.168.0.43/24 brd 192.168.0.255 scope global dynamic noprefixroute wlp3s0
valid_lft 75696sec preferred_lft 75696sec
By default, Linux creates the two following default routes.
route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 192.168.0.1 0.0.0.0 UG 100 0 0 enp2s0
0.0.0.0 192.168.0.1 0.0.0.0 UG 600 0 0 wlp3s0
We need to configure the routing tables to be able to use the two interfaces simultaneously. For this, we need to ensure that packets with source address 192.168.0.37
are sent over the enp2s0
interface while packets with source address 192.168.0.43
are sent over the wlp3s0
interface. This can be achieved using two different routing tables.
# create the two routing tables
ip rule add from 192.168.0.37 table 1
ip rule add from 192.168.0.43 table 2
# Configure routing table 1 for enp2s0
ip route add 192.168.0.0/24 dev enp2s0 scope link table 1
ip route add default via 192.168.0.1 dev enp2s0 table 1
# Configure routing table 2 for wlp3s0
ip route add 192.168.0.0/24 dev wlp3s0 scope link table 2
ip route add default via 192.168.0.1 dev wlp3s0 table 2
# Configure a default route to regular internet
ip route add default scope global nexthop via 192.168.0.1 dev enp2s0
We can check the routing tables using the ip command.
ip rule show
0: from all lookup local
32764: from 192.168.0.43 lookup 2
32765: from 192.168.0.37 lookup 1
32766: from all lookup main
32767: from all lookup default
ip route
default via 192.168.0.1 dev enp2s0
default via 192.168.0.1 dev enp2s0 proto dhcp metric 100
default via 192.168.0.1 dev wlp3s0 proto dhcp metric 600
169.254.0.0/16 dev wlp3s0 scope link metric 1000
192.168.0.0/24 dev enp2s0 proto kernel scope link src 192.168.0.37 metric 100
192.168.0.0/24 dev wlp3s0 proto kernel scope link src 192.168.0.43 metric 600
ip route show table 1
default via 192.168.0.1 dev enp2s0
192.168.0.0/24 dev enp2s0 scope link
ip route show table 2
default via 192.168.0.1 dev wlp3s0
192.168.0.0/24 dev wlp3s0 scope link
We can verify that the two routing tables are correct using nc
by forcing it to use a specific source address.
echo -e "GET / HTTP/1.0\r\n" | nc -4 -s 192.168.0.37 test.multipath-tcp.org 80
HTTP/1.0 200 OK
Content-Type: text/html
ETag: "4215149735"
Last-Modified: Tue, 05 Jul 2022 16:11:47 GMT
Content-Length: 389
Connection: close
Date: Wed, 06 Jul 2022 11:34:24 GMT
Server: lighttpd/1.4.59
<!DOCTYPE html>
<html>
<head>
<title>Welcome to test.multipath-tcp.org!</title>
<style>
body {
width: 35em;
margin: 0 auto;
font-family: Tahoma, Verdana, Arial, sans-serif;
}
</style>
</head>
<body>
<h1>Welcome to test.multipath-tcp.org !</h1>
<p>This web server runs Multipath TCP v1</p>
<p><em>Thank you for using Multipath TCP.</em></p>
</body>
</html>
You should get the same result when using the second interface, IP address 192.168.0.43
in our example.
echo -e "GET / HTTP/1.0\r\n" | nc -4 -s 192.168.0.43 test.multipath-tcp.org 80
The next step is to verify that Multipath TCP is working correctly and that two subflows are created. For this, we'll use the -i
parameter of nc
to add a delay between the two lines of the HTTP GET. We will leverage this delay to check that MPTCP is correctly working using ss
or tcpdump
echo -e "GET / HTTP/1.0\r\n" | mptcpize run nc -4 -i 5 test.multipath-tcp.org 80
We can observe the creation of the connection and the subflow using both ss
and tcpdump
. ss
shows that there are two subflows towards test.multipath-tcp.org
.
ss -4 -iatM dst test.multipath-tcp.org
Netid State Recv-Q Send-Q Local Address:Port Peer Address:Port Process
tcp ESTAB 0 0 192.168.0.43%wlp3s0:34801 5.196.67.207:http
cubic wscale:7,7 rto:220 rtt:17.439/8.719 mss:1448 pmtu:1500 rcvmss:536 advmss:1448 cwnd:10 bytes_acked:1 segs_out:2 segs_in:2 send 6.64Mbps lastsnd:1776 lastrcv:1776 lastack:1764 pacing_rate 13.3Mbps delivered:1 rcv_space:14480 rcv_ssthresh:64088 minrtt:17.439
tcp ESTAB 0 0 192.168.0.37:47672 5.196.67.207:http
cubic wscale:7,7 rto:216 rtt:14/5.405 mss:1448 pmtu:1500 rcvmss:536 advmss:1448 cwnd:10 bytes_sent:16 bytes_acked:17 segs_out:3 segs_in:3 data_segs_out:1 send 8.27Mbps lastsnd:1808 lastrcv:1808 lastack:1792 pacing_rate 16.5Mbps delivery_rate 790kbps delivered:2 busy:16ms rcv_space:14480 rcv_ssthresh:64088 minrtt:13.905
mptcp ESTAB 0 0 192.168.0.37:47672 5.196.67.207:http
subflows:1 subflows_max:8 remote_key token:e1e3cdeb write_seq:1045ecfa3f05f4ea snd_una:1045ecfa3f05f4ea rcv_nxt:d0568f430363c9aa
The line starting with mptcp
indicates that the Multipath TCP connection above has one additional subflow.
The tcpdump
output reveals precisely which packets have been sent over each network interface.
sudo tcpdump -n -i any host test.multipath-tcp.org and port 80tcpdump: data link type LINUX_SLL2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
13:43:26.620667 enp2s0 Out IP 192.168.0.37.47672 > 5.196.67.207.80: Flags [S], seq 3585891423, win 64240, options [mss 1460,sackOK,TS val 3993892549 ecr 0,nop,wscale 7,mptcp capable v1], length 0
13:43:26.634537 enp2s0 In IP 5.196.67.207.80 > 192.168.0.37.47672: Flags [S.], seq 1788691420, ack 3585891424, win 65160, options [mss 1460,sackOK,TS val 1030255900 ecr 3993892549,nop,wscale 7,mptcp capable v1 {0x54f04ad5bd2d9f42}], length 0
13:43:26.634609 enp2s0 Out IP 192.168.0.37.47672 > 5.196.67.207.80: Flags [.], ack 1, win 502, options [nop,nop,TS val 3993892563 ecr 1030255900,mptcp capable v1 {0xff2ec3a2f6151881,0x54f04ad5bd2d9f42}], length 0
13:43:26.634718 enp2s0 Out IP 192.168.0.37.47672 > 5.196.67.207.80: Flags [P.], seq 1:17, ack 1, win 502, options [nop,nop,TS val 3993892563 ecr 1030255900,mptcp capable v1 {0xff2ec3a2f6151881,0x54f04ad5bd2d9f42},nop,nop], length 16: HTTP: GET / HTTP/1.0
13:43:26.649351 enp2s0 In IP 5.196.67.207.80 > 192.168.0.37.47672: Flags [.], ack 17, win 509, options [nop,nop,TS val 1030255916 ecr 3993892563,mptcp dss ack 1172603837543216362], length 0
13:43:26.649351 enp2s0 In IP 5.196.67.207.80 > 192.168.0.37.47672: Flags [.], ack 17, win 509, options [nop,nop,TS val 1030255916 ecr 3993892563,mptcp dss ack 1172603837543216362], length 0
13:43:26.649498 wlp3s0 Out IP 192.168.0.43.34801 > 5.196.67.207.80: Flags [S], seq 2321572505, win 64240, options [mss 1460,sackOK,TS val 1218002018 ecr 0,nop,wscale 7,mptcp join id 8 token 0xeef7df2f nonce 0xc0d346f6], length 0
13:43:26.666895 wlp3s0 In IP 5.196.67.207.80 > 192.168.0.43.34801: Flags [S.], seq 1973196884, ack 2321572506, win 65160, options [mss 1460,sackOK,TS val 1030255931 ecr 1218002018,nop,wscale 7,mptcp join id 0 hmac 0xc7489cf7056428b4 nonce 0xa54f9af], length 0
13:43:26.666966 wlp3s0 Out IP 192.168.0.43.34801 > 5.196.67.207.80: Flags [.], ack 1, win 502, options [nop,nop,TS val 1218002035 ecr 1030255931,mptcp join hmac 0xb4e6a41bf5861313df7f5f454966998ad7e698a4], length 0
13:43:26.677776 wlp3s0 In IP 5.196.67.207.80 > 192.168.0.43.34801: Flags [.], ack 1, win 510, options [nop,nop,TS val 1030255944 ecr 1218002035,mptcp dss ack 1172603837543216362], length 0
13:43:31.635023 enp2s0 Out IP 192.168.0.37.47672 > 5.196.67.207.80: Flags [P.], seq 17:18, ack 1, win 502, options [nop,nop,TS val 3993897563 ecr 1030255916,mptcp dss ack 56871338 seq 1172603837543216362 subseq 17 len 1,nop,nop], length 1: HTTP
13:43:31.646703 enp2s0 In IP 5.196.67.207.80 > 192.168.0.37.47672: Flags [.], ack 18, win 509, options [nop,nop,TS val 1030260913 ecr 3993897563,mptcp dss ack 1172603837543216363], length 0
13:43:31.647276 enp2s0 In IP 5.196.67.207.80 > 192.168.0.37.47672: Flags [P.], seq 1:602, ack 18, win 509, options [nop,nop,TS val 1030260914 ecr 3993897563,mptcp dss ack 1172603837543216363 seq 15012343925868579242 subseq 1 len 601,nop,nop], length 601: HTTP: HTTP/1.0 200 OK
13:43:31.647300 enp2s0 Out IP 192.168.0.37.47672 > 5.196.67.207.80: Flags [.], ack 602, win 501, options [nop,nop,TS val 3993897576 ecr 1030260914,mptcp dss ack 15012343925868579843], length 0
13:43:31.647276 enp2s0 In IP 5.196.67.207.80 > 192.168.0.37.47672: Flags [.], ack 18, win 509, options [nop,nop,TS val 1030260914 ecr 3993897563,mptcp dss fin ack 1172603837543216363 seq 15012343925868579843 subseq 0 len 1,nop,nop], length 0
13:43:31.647321 enp2s0 Out IP 192.168.0.37.47672 > 5.196.67.207.80: Flags [.], ack 602, win 501, options [nop,nop,TS val 3993897576 ecr 1030260914,mptcp dss ack 15012343925868579844], length 0
13:43:31.647330 wlp3s0 Out IP 192.168.0.43.34801 > 5.196.67.207.80: Flags [.], ack 1, win 502, options [nop,nop,TS val 1218002046 ecr 1030255944,mptcp dss ack 15012343925868579844], length 0
13:43:31.648565 wlp3s0 In IP 5.196.67.207.80 > 192.168.0.43.34801: Flags [.], ack 1, win 510, options [nop,nop,TS val 1030255944 ecr 1218002035,mptcp dss fin ack 1172603837543216363 seq 15012343925868579843 subseq 0 len 1,nop,nop], length 0
13:43:36.635392 enp2s0 Out IP 192.168.0.37.47672 > 5.196.67.207.80: Flags [.], ack 602, win 501, options [nop,nop,TS val 3993897576 ecr 1030260914,mptcp dss fin ack 15012343925868579844 seq 1172603837543216363 subseq 0 len 1,nop,nop], length 0
13:43:36.635416 wlp3s0 Out IP 192.168.0.43.34801 > 5.196.67.207.80: Flags [.], ack 1, win 502, options [nop,nop,TS val 1218007017 ecr 1030255944,mptcp dss fin ack 15012343925868579844 seq 1172603837543216363 subseq 0 len 1,nop,nop], length 0
13:43:36.636468 enp2s0 Out IP 192.168.0.37.47672 > 5.196.67.207.80: Flags [.], ack 602, win 501, options [nop,nop,TS val 3993897576 ecr 1030260914,mptcp dss fin ack 15012343925868579844 seq 1172603837543216363 subseq 0 len 1,nop,nop], length 0
13:43:36.636482 wlp3s0 Out IP 192.168.0.43.34801 > 5.196.67.207.80: Flags [.], ack 1, win 502, options [nop,nop,TS val 1218007017 ecr 1030255944,mptcp dss fin ack 15012343925868579844 seq 1172603837543216363 subseq 0 len 1,nop,nop], length 0
13:43:36.640425 enp2s0 Out IP 192.168.0.37.47672 > 5.196.67.207.80: Flags [.], ack 602, win 501, options [nop,nop,TS val 3993897576 ecr 1030260914,mptcp dss fin ack 15012343925868579844 seq 1172603837543216363 subseq 0 len 1,nop,nop], length 0
13:43:36.640431 wlp3s0 Out IP 192.168.0.43.34801 > 5.196.67.207.80: Flags [.], ack 1, win 502, options [nop,nop,TS val 1218007017 ecr 1030255944,mptcp dss fin ack 15012343925868579844 seq 1172603837543216363 subseq 0 len 1,nop,nop], length 0
13:43:36.645605 enp2s0 In IP 5.196.67.207.80 > 192.168.0.37.47672: Flags [.], ack 18, win 509, options [nop,nop,TS val 1030265912 ecr 3993897576,mptcp dss ack 1172603837543216364], length 0
13:43:36.645659 enp2s0 Out IP 192.168.0.37.47672 > 5.196.67.207.80: Flags [F.], seq 18, ack 602, win 501, options [nop,nop,TS val 3993902574 ecr 1030265912,mptcp dss ack 15012343925868579844], length 0
13:43:36.645674 wlp3s0 Out IP 192.168.0.43.34801 > 5.196.67.207.80: Flags [F.], seq 1, ack 1, win 502, options [nop,nop,TS val 1218012014 ecr 1030255944,mptcp dss ack 15012343925868579844], length 0
13:43:36.646315 wlp3s0 In IP 5.196.67.207.80 > 192.168.0.43.34801: Flags [.], ack 1, win 510, options [nop,nop,TS val 1030260930 ecr 1218002046,mptcp dss ack 1172603837543216364], length 0
13:43:36.647699 enp2s0 In IP 5.196.67.207.80 > 192.168.0.37.47672: Flags [F.], seq 602, ack 18, win 509, options [nop,nop,TS val 1030265912 ecr 3993897576,mptcp dss ack 1172603837543216364], length 0
13:43:36.647718 enp2s0 Out IP 192.168.0.37.47672 > 5.196.67.207.80: Flags [.], ack 603, win 501, options [nop,nop,TS val 3993902576 ecr 1030265912,mptcp dss ack 15012343925868579844], length 0
13:43:36.648629 wlp3s0 In IP 5.196.67.207.80 > 192.168.0.43.34801: Flags [F.], seq 1, ack 1, win 510, options [nop,nop,TS val 1030265912 ecr 1218002046,mptcp dss ack 1172603837543216364], length 0
13:43:36.648649 wlp3s0 Out IP 192.168.0.43.34801 > 5.196.67.207.80: Flags [.], ack 2, win 502, options [nop,nop,TS val 1218012017 ecr 1030265912,mptcp dss ack 15012343925868579844], length 0
13:43:36.657040 enp2s0 In IP 5.196.67.207.80 > 192.168.0.37.47672: Flags [.], ack 19, win 509, options [nop,nop,TS val 1030265923 ecr 3993902574,mptcp dss ack 1172603837543216364], length 0
13:43:36.662211 wlp3s0 In IP 5.196.67.207.80 > 192.168.0.43.34801: Flags [.], ack 2, win 510, options [nop,nop,TS val 1030265928 ecr 1218012014,mptcp dss ack 1172603837543216364], length 0
If your host is dual stack, you also need to do the same configuration for IPv6 as well. Our test host uses the following IPv6 addresses.
ip -6 addr show
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 state UNKNOWN qlen 1000
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: enp2s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
inet6 2a02:2788:10c4:123:f468:1851:9a9f:7d44/64 scope global temporary dynamic
valid_lft 592298sec preferred_lft 73422sec
inet6 2a02:2788:10c4:123:6636:10c6:692b:18cc/64 scope global dynamic mngtmpaddr noprefixroute
valid_lft 1209600sec preferred_lft 604800sec
inet6 fe80::5642:39bd:3390:43d3/64 scope link noprefixroute
valid_lft forever preferred_lft forever
3: wlp3s0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 state UP qlen 1000
inet6 2a02:2788:10c4:123:3d66:f590:d891:8fb3/64 scope global dynamic noprefixroute
valid_lft 1209600sec preferred_lft 604800sec
inet6 fe80::3934:7572:b1ff:b555/64 scope link noprefixroute
valid_lft forever preferred_lft forever
We thus had to configure the following IPv6 routing tables. This is similar to the commands used to configure the IPv4 routing tables.
ip -6 rule add from 2a02:2788:10c4:123:6636:10c6:692b:18cc table 1
ip -6 rule add from 2a02:2788:10c4:123:3d66:f590:d891:8fb3 table 2
ip route add 2a02:2788:10c4:123::/64 dev enp2s0 scope link table 1
ip route add 2a02:2788:10c4:123::/64 dev wlp3s0 scope link table 2
ip route add default via fe80::10:18ff:fe07:fc33 dev enp2s0 table 1
ip route add default via fe80::10:18ff:fe07:fc33 dev wlp3s0 table 2
ip route add default scope global nexthop via fe80::10:18ff:fe07:fc33 dev enp2s0
Remember that if you want to create subflows using IPv6 addresses, you also need to configure the stack with ip mptcp endpoint add
as you did for the IPv4 addresses.
Note
The current versions of the Linux kernel only use one address family at a time. If a connection was created using IPv4, then only IPv4 addresses will be used to create new subflows. Future versions of the kernel will allow to mix IPv4 and IPv6 subflows.
Analyzing the output of ss
Analyzing the output of nstat
The Linux TCP/IP stack maintains a lot of counters that track various events inside the kernel. These counters are very useful for system administrators who need to manage Linux hosts and debug some specific networking problems.
Linux supports a few hundred counters associated to the protocols in the network and transport layers. Other operating systems have defined their own counters to track similar networking events. Fortunately, the IETF has standard some counters that are common to different operating systems and TCP/IP implementations. These counters are exported as variables which can be queried using a management protocol such as SNMP. This enables a management server to collect statistics for a series of hosts to process and analyze them. Several versions of SNMP exist, but we will not discuss them in details in this document. Instead, we focus on the Linux TCP/IP implementation and explain the different counters that the nstat application exposes to the user.
Linux kernel version 5.18 collects 363 different counters that are divided in 7 categories :
67 counters track the IPv4 implementation
80 counters track the ICMPv4 implementation
32 counters track the IPv6 implementation
46 counters track ICMPv6
135 counters track TCP
35 counters track UDP
46 counters track Multipath TCP
Some of these counters are part of the Management Information Bases (MIB) defined within the IETF, e.g. RFC 1213 for IPv4 and ICMPv4, RFC 4293 for IPv6 and ICMPv6, RFC 4022 for TCP, RFC 4113 for UDP. As of this writing, there is no official IETF MIB for Multipath TCP.
Using nstat
In this document, we describe the counters that are exposed by nstat for the different protocols of the TCP/IP stack. Before discussing these counters, it is useful to understand how nstat works.
nstat is a command line tool that supports a small number of arguments
nstat --help
Usage: nstat [OPTION] [ PATTERN [ PATTERN ] ]
-h, --help this message
-a, --ignore ignore history
-d, --scan=SECS sample every statistics every SECS
-j, --json format output in JSON
-n, --nooutput do history only
-p, --pretty pretty print
-r, --reset reset history
-s, --noupdate don't update history
-t, --interval=SECS report average over the last SECS
-V, --version output version information
-z, --zeros show entries with zero activity
By default, nstat displays the counters whose value has changed since the latest invocation of the tool. This is usually a small subset of the counters that depends on the networking activity of the host.
nstat can collect historical information and provides average counters.
nstat can also list the current value of the different counters.
#nstat -az
#kernel
IpInReceives 1073367 0.0
IpInHdrErrors 0 0.0
IpInAddrErrors 0 0.0
IpForwDatagrams 0 0.0
IpInUnknownProtos 0 0.0
IpInDiscards 0 0.0
IpInDelivers 1072518 0.0
IpOutRequests 484889 0.0
IpOutDiscards 0 0.0
IpOutNoRoutes 0 0.0
IpReasmTimeout 0 0.0
IpReasmReqds 0 0.0
IpReasmOKs 0 0.0
IpReasmFails 0 0.0
IpFragOKs 0 0.0
IpFragFails 0 0.0
IpFragCreates 0 0.0
IcmpInMsgs 561 0.0
IcmpInErrors 125 0.0
IcmpInCsumErrors 0 0.0
IcmpInDestUnreachs 6 0.0
IcmpInTimeExcds 125 0.0
IcmpInParmProbs 0 0.0
IcmpInSrcQuenchs 0 0.0
IcmpInRedirects 0 0.0
IcmpInEchos 298 0.0
IcmpInEchoReps 0 0.0
IcmpInTimestamps 33 0.0
IcmpInTimestampReps 0 0.0
IcmpInAddrMasks 99 0.0
IcmpInAddrMaskReps 0 0.0
IcmpOutMsgs 331 0.0
IcmpOutErrors 0 0.0
IcmpOutDestUnreachs 0 0.0
IcmpOutTimeExcds 0 0.0
IcmpOutParmProbs 0 0.0
IcmpOutSrcQuenchs 0 0.0
IcmpOutRedirects 0 0.0
IcmpOutEchos 0 0.0
IcmpOutEchoReps 298 0.0
IcmpOutTimestamps 0 0.0
IcmpOutTimestampReps 33 0.0
IcmpOutAddrMasks 0 0.0
IcmpOutAddrMaskReps 0 0.0
IcmpMsgInType3 6 0.0
IcmpMsgInType8 298 0.0
IcmpMsgInType11 125 0.0
IcmpMsgInType13 33 0.0
IcmpMsgInType17 99 0.0
IcmpMsgOutType0 298 0.0
IcmpMsgOutType14 33 0.0
TcpActiveOpens 3330 0.0
TcpPassiveOpens 252 0.0
TcpAttemptFails 0 0.0
TcpEstabResets 78 0.0
TcpInSegs 3202615 0.0
TcpOutSegs 6431616 0.0
TcpRetransSegs 7584 0.0
TcpInErrs 0 0.0
TcpOutRsts 102 0.0
TcpInCsumErrors 0 0.0
UdpInDatagrams 18972 0.0
UdpNoPorts 0 0.0
UdpInErrors 0 0.0
UdpOutDatagrams 19257 0.0
UdpRcvbufErrors 0 0.0
UdpSndbufErrors 0 0.0
UdpInCsumErrors 0 0.0
UdpIgnoredMulti 19989 0.0
UdpMemErrors 0 0.0
UdpLiteInDatagrams 0 0.0
UdpLiteNoPorts 0 0.0
UdpLiteInErrors 0 0.0
UdpLiteOutDatagrams 0 0.0
UdpLiteRcvbufErrors 0 0.0
UdpLiteSndbufErrors 0 0.0
UdpLiteInCsumErrors 0 0.0
UdpLiteIgnoredMulti 0 0.0
UdpLiteMemErrors 0 0.0
Ip6InReceives 2198489 0.0
Ip6InHdrErrors 0 0.0
Ip6InTooBigErrors 0 0.0
Ip6InNoRoutes 200 0.0
Ip6InAddrErrors 0 0.0
Ip6InUnknownProtos 0 0.0
Ip6InTruncatedPkts 0 0.0
Ip6InDiscards 0 0.0
Ip6InDelivers 2177604 0.0
Ip6OutForwDatagrams 0 0.0
Ip6OutRequests 1567967 0.0
Ip6OutDiscards 0 0.0
Ip6OutNoRoutes 6 0.0
Ip6ReasmTimeout 0 0.0
Ip6ReasmReqds 0 0.0
Ip6ReasmOKs 0 0.0
Ip6ReasmFails 0 0.0
Ip6FragOKs 0 0.0
Ip6FragFails 0 0.0
Ip6FragCreates 0 0.0
Ip6InMcastPkts 20785 0.0
Ip6OutMcastPkts 13 0.0
Ip6InOctets 2578707266 0.0
Ip6OutOctets 3533261025 0.0
Ip6InMcastOctets 1442288 0.0
Ip6OutMcastOctets 1252 0.0
Ip6InBcastOctets 0 0.0
Ip6OutBcastOctets 0 0.0
Ip6InNoECTPkts 2060704 0.0
Ip6InECT1Pkts 0 0.0
Ip6InECT0Pkts 137799 0.0
Ip6InCEPkts 0 0.0
Icmp6InMsgs 7525 0.0
Icmp6InErrors 0 0.0
Icmp6OutMsgs 7511 0.0
Icmp6OutErrors 0 0.0
Icmp6InCsumErrors 0 0.0
Icmp6InDestUnreachs 10 0.0
Icmp6InPktTooBigs 0 0.0
Icmp6InTimeExcds 0 0.0
Icmp6InParmProblems 0 0.0
Icmp6InEchos 2 0.0
Icmp6InEchoReplies 6 0.0
Icmp6InGroupMembQueries 0 0.0
Icmp6InGroupMembResponses 0 0.0
Icmp6InGroupMembReductions 0 0.0
Icmp6InRouterSolicits 0 0.0
Icmp6InRouterAdvertisements 0 0.0
Icmp6InNeighborSolicits 4316 0.0
Icmp6InNeighborAdvertisements 3189 0.0
Icmp6InRedirects 0 0.0
Icmp6InMLDv2Reports 2 0.0
Icmp6OutDestUnreachs 0 0.0
Icmp6OutPktTooBigs 0 0.0
Icmp6OutTimeExcds 0 0.0
Icmp6OutParmProblems 0 0.0
Icmp6OutEchos 6 0.0
Icmp6OutEchoReplies 2 0.0
Icmp6OutGroupMembQueries 0 0.0
Icmp6OutGroupMembResponses 0 0.0
Icmp6OutGroupMembReductions 0 0.0
Icmp6OutRouterSolicits 0 0.0
Icmp6OutRouterAdvertisements 0 0.0
Icmp6OutNeighborSolicits 3179 0.0
Icmp6OutNeighborAdvertisements 4316 0.0
Icmp6OutRedirects 0 0.0
Icmp6OutMLDv2Reports 8 0.0
Icmp6InType1 10 0.0
Icmp6InType128 2 0.0
Icmp6InType129 6 0.0
Icmp6InType135 4316 0.0
Icmp6InType136 3189 0.0
Icmp6InType143 2 0.0
Icmp6OutType128 6 0.0
Icmp6OutType129 2 0.0
Icmp6OutType135 3179 0.0
Icmp6OutType136 4316 0.0
Icmp6OutType143 8 0.0
Udp6InDatagrams 460 0.0
Udp6NoPorts 0 0.0
Udp6InErrors 0 0.0
Udp6OutDatagrams 95 0.0
Udp6RcvbufErrors 0 0.0
Udp6SndbufErrors 0 0.0
Udp6InCsumErrors 0 0.0
Udp6IgnoredMulti 0 0.0
Udp6MemErrors 0 0.0
UdpLite6InDatagrams 0 0.0
UdpLite6NoPorts 0 0.0
UdpLite6InErrors 0 0.0
UdpLite6OutDatagrams 0 0.0
UdpLite6RcvbufErrors 0 0.0
UdpLite6SndbufErrors 0 0.0
UdpLite6InCsumErrors 0 0.0
UdpLite6MemErrors 0 0.0
TcpExtSyncookiesSent 0 0.0
TcpExtSyncookiesRecv 0 0.0
TcpExtSyncookiesFailed 0 0.0
TcpExtEmbryonicRsts 0 0.0
TcpExtPruneCalled 3791 0.0
TcpExtRcvPruned 0 0.0
TcpExtOfoPruned 0 0.0
TcpExtOutOfWindowIcmps 0 0.0
TcpExtLockDroppedIcmps 0 0.0
TcpExtArpFilter 0 0.0
TcpExtTW 2283 0.0
TcpExtTWRecycled 0 0.0
TcpExtTWKilled 0 0.0
TcpExtPAWSActive 0 0.0
TcpExtPAWSEstab 11 0.0
TcpExtDelayedACKs 31995 0.0
TcpExtDelayedACKLocked 47 0.0
TcpExtDelayedACKLost 282 0.0
TcpExtListenOverflows 0 0.0
TcpExtListenDrops 0 0.0
TcpExtTCPHPHits 699069 0.0
TcpExtTCPPureAcks 997468 0.0
TcpExtTCPHPAcks 1235546 0.0
TcpExtTCPRenoRecovery 0 0.0
TcpExtTCPSackRecovery 2526 0.0
TcpExtTCPSACKReneging 0 0.0
TcpExtTCPSACKReorder 36858 0.0
TcpExtTCPRenoReorder 0 0.0
TcpExtTCPTSReorder 85 0.0
TcpExtTCPFullUndo 1 0.0
TcpExtTCPPartialUndo 67 0.0
TcpExtTCPDSACKUndo 11 0.0
TcpExtTCPLossUndo 0 0.0
TcpExtTCPLostRetransmit 184 0.0
TcpExtTCPRenoFailures 0 0.0
TcpExtTCPSackFailures 0 0.0
TcpExtTCPLossFailures 0 0.0
TcpExtTCPFastRetrans 7084 0.0
TcpExtTCPSlowStartRetrans 0 0.0
TcpExtTCPTimeouts 168 0.0
TcpExtTCPLossProbes 345 0.0
TcpExtTCPLossProbeRecovery 82 0.0
TcpExtTCPRenoRecoveryFail 0 0.0
TcpExtTCPSackRecoveryFail 0 0.0
TcpExtTCPRcvCollapsed 0 0.0
TcpExtTCPBacklogCoalesce 10938 0.0
TcpExtTCPDSACKOldSent 300 0.0
TcpExtTCPDSACKOfoSent 49 0.0
TcpExtTCPDSACKRecv 317 0.0
TcpExtTCPDSACKOfoRecv 2 0.0
TcpExtTCPAbortOnData 25 0.0
TcpExtTCPAbortOnClose 54 0.0
TcpExtTCPAbortOnMemory 0 0.0
TcpExtTCPAbortOnTimeout 4 0.0
TcpExtTCPAbortOnLinger 0 0.0
TcpExtTCPAbortFailed 0 0.0
TcpExtTCPMemoryPressures 0 0.0
TcpExtTCPMemoryPressuresChrono 0 0.0
TcpExtTCPSACKDiscard 0 0.0
TcpExtTCPDSACKIgnoredOld 2 0.0
TcpExtTCPDSACKIgnoredNoUndo 272 0.0
TcpExtTCPSpuriousRTOs 0 0.0
TcpExtTCPMD5NotFound 0 0.0
TcpExtTCPMD5Unexpected 0 0.0
TcpExtTCPMD5Failure 0 0.0
TcpExtTCPSackShifted 34290 0.0
TcpExtTCPSackMerged 11301 0.0
TcpExtTCPSackShiftFallback 40480 0.0
TcpExtTCPBacklogDrop 0 0.0
TcpExtPFMemallocDrop 0 0.0
TcpExtTCPMinTTLDrop 0 0.0
TcpExtTCPDeferAcceptDrop 0 0.0
TcpExtIPReversePathFilter 0 0.0
TcpExtTCPTimeWaitOverflow 0 0.0
TcpExtTCPReqQFullDoCookies 0 0.0
TcpExtTCPReqQFullDrop 0 0.0
TcpExtTCPRetransFail 0 0.0
TcpExtTCPRcvCoalesce 100585 0.0
TcpExtTCPOFOQueue 15954 0.0
TcpExtTCPOFODrop 0 0.0
TcpExtTCPOFOMerge 38 0.0
TcpExtTCPChallengeACK 0 0.0
TcpExtTCPSYNChallenge 0 0.0
TcpExtTCPFastOpenActive 0 0.0
TcpExtTCPFastOpenActiveFail 0 0.0
TcpExtTCPFastOpenPassive 0 0.0
TcpExtTCPFastOpenPassiveFail 0 0.0
TcpExtTCPFastOpenListenOverflow 0 0.0
TcpExtTCPFastOpenCookieReqd 0 0.0
TcpExtTCPFastOpenBlackhole 0 0.0
TcpExtTCPSpuriousRtxHostQueues 0 0.0
TcpExtBusyPollRxPackets 0 0.0
TcpExtTCPAutoCorking 73847 0.0
TcpExtTCPFromZeroWindowAdv 40 0.0
TcpExtTCPToZeroWindowAdv 40 0.0
TcpExtTCPWantZeroWindowAdv 2870 0.0
TcpExtTCPSynRetrans 91 0.0
TcpExtTCPOrigDataSent 5948573 0.0
TcpExtTCPHystartTrainDetect 34 0.0
TcpExtTCPHystartTrainCwnd 1880 0.0
TcpExtTCPHystartDelayDetect 3 0.0
TcpExtTCPHystartDelayCwnd 261 0.0
TcpExtTCPACKSkippedSynRecv 0 0.0
TcpExtTCPACKSkippedPAWS 9 0.0
TcpExtTCPACKSkippedSeq 11 0.0
TcpExtTCPACKSkippedFinWait2 0 0.0
TcpExtTCPACKSkippedTimeWait 0 0.0
TcpExtTCPACKSkippedChallenge 0 0.0
TcpExtTCPWinProbe 0 0.0
TcpExtTCPKeepAlive 67 0.0
TcpExtTCPMTUPFail 0 0.0
TcpExtTCPMTUPSuccess 0 0.0
TcpExtTCPDelivered 5951000 0.0
TcpExtTCPDeliveredCE 0 0.0
TcpExtTCPAckCompressed 3021 0.0
TcpExtTCPZeroWindowDrop 0 0.0
TcpExtTCPRcvQDrop 0 0.0
TcpExtTCPWqueueTooBig 0 0.0
TcpExtTCPFastOpenPassiveAltKey 0 0.0
TcpExtTcpTimeoutRehash 72 0.0
TcpExtTcpDuplicateDataRehash 0 0.0
TcpExtTCPDSACKRecvSegs 371 0.0
TcpExtTCPDSACKIgnoredDubious 0 0.0
TcpExtTCPMigrateReqSuccess 0 0.0
TcpExtTCPMigrateReqFailure 0 0.0
IpExtInNoRoutes 0 0.0
IpExtInTruncatedPkts 0 0.0
IpExtInMcastPkts 62 0.0
IpExtOutMcastPkts 24 0.0
IpExtInBcastPkts 19989 0.0
IpExtOutBcastPkts 0 0.0
IpExtInOctets 533061309 0.0
IpExtOutOctets 5153892360 0.0
IpExtInMcastOctets 7448 0.0
IpExtOutMcastOctets 3592 0.0
IpExtInBcastOctets 2082276 0.0
IpExtOutBcastOctets 0 0.0
IpExtInCsumErrors 0 0.0
IpExtInNoECTPkts 1073527 0.0
IpExtInECT1Pkts 0 0.0
IpExtInECT0Pkts 0 0.0
IpExtInCEPkts 0 0.0
IpExtReasmOverlaps 0 0.0
MPTcpExtMPCapableSYNRX 0 0.0
MPTcpExtMPCapableSYNTX 2203 0.0
MPTcpExtMPCapableSYNACKRX 2172 0.0
MPTcpExtMPCapableACKRX 0 0.0
MPTcpExtMPCapableFallbackACK 0 0.0
MPTcpExtMPCapableFallbackSYNACK 22 0.0
MPTcpExtMPFallbackTokenInit 0 0.0
MPTcpExtMPTCPRetrans 0 0.0
MPTcpExtMPJoinNoTokenFound 0 0.0
MPTcpExtMPJoinSynRx 0 0.0
MPTcpExtMPJoinSynAckRx 0 0.0
MPTcpExtMPJoinSynAckHMacFailure 0 0.0
MPTcpExtMPJoinAckRx 0 0.0
MPTcpExtMPJoinAckHMacFailure 0 0.0
MPTcpExtDSSNotMatching 0 0.0
MPTcpExtInfiniteMapRx 0 0.0
MPTcpExtDSSNoMatchTCP 0 0.0
MPTcpExtDataCsumErr 0 0.0
MPTcpExtOFOQueueTail 0 0.0
MPTcpExtOFOQueue 0 0.0
MPTcpExtOFOMerge 0 0.0
MPTcpExtNoDSSInWindow 0 0.0
MPTcpExtDuplicateData 0 0.0
MPTcpExtAddAddr 0 0.0
MPTcpExtEchoAdd 0 0.0
MPTcpExtPortAdd 0 0.0
MPTcpExtAddAddrDrop 0 0.0
MPTcpExtMPJoinPortSynRx 0 0.0
MPTcpExtMPJoinPortSynAckRx 0 0.0
MPTcpExtMPJoinPortAckRx 0 0.0
MPTcpExtMismatchPortSynRx 0 0.0
MPTcpExtMismatchPortAckRx 0 0.0
MPTcpExtRmAddr 0 0.0
MPTcpExtRmAddrDrop 0 0.0
MPTcpExtRmSubflow 0 0.0
MPTcpExtMPPrioTx 0 0.0
MPTcpExtMPPrioRx 0 0.0
MPTcpExtMPFailTx 0 0.0
MPTcpExtMPFailRx 0 0.0
MPTcpExtMPFastcloseTx 0 0.0
MPTcpExtMPFastcloseRx 0 0.0
MPTcpExtMPRstTx 17 0.0
MPTcpExtMPRstRx 0 0.0
MPTcpExtRcvPruned 0 0.0
MPTcpExtSubflowStale 0 0.0
MPTcpExtSubflowRecover 0 0.0
Among all these variables, the ones named \*Ext\*
are Linux specific
variables that are not defined in IETF MIBs. The others are usually defined
in an IETF RFC. The counters maintained by the Linux kernel are defined in
include/uapi/linux/snmp.h and
net/mptcp/mib.h for the Multipath TCP counters.
Each of the counters exposed by nstat correspond to one specific identifier
in the Linux kernel. For example, the beginning of the IP part of the
counters is defined as follows:
enum
{
IPSTATS_MIB_NUM = 0,
/* frequently written fields in fast path, kept in same cache line */
IPSTATS_MIB_INPKTS, /* InReceives */
IPSTATS_MIB_INOCTETS, /* InOctets */
IPSTATS_MIB_INDELIVERS, /* InDelivers */
IPSTATS_MIB_OUTFORWDATAGRAMS, /* OutForwDatagrams */
IPSTATS_MIB_OUTPKTS, /* OutRequests */
IPSTATS_MIB_OUTOCTETS, /* OutOctets */
/* other fields */
Before looking at the precise meaning of each of the counters managed by nstat, it is interesting to recall the definition of the Case diagrams. This graphical representation of SNMP variables can be really useful to understand the meaning of the Linux networking counters.
The Case diagrams
The Case diagrams were introduced by Jeffrey Case and Craig Partridge in 1989 in the paper Case diagrams: a first step to diagrammed management information bases. This article describes a simple but powerful graphical representation of the interactions among the different SNMP variables that a networking stack maintains.
A Case diagram represents the flow of packets through a stack and the different variables that are updated as the packet progress through the stack. The incoming packets are represented as progressing from the bottom layer of the stack to the upper layer, while the outgoing packets are represented in the other direction. The progression of these packets is represented using a large arrow. An horizontal line that crosses this arrow indicates the point in the stack where the associated SNMP counter is updated. A small that leaves the main packet processing flow indicates a specific treatment for a packet and a counter that is updated. In some cases, an arrow enters the main workflow and updates the associated counter.
The original paper used the IP counters of the MIB-2 to illustrate the Case diagrams. This figure is reproduced below in ASCII format to simplify the updates to the document.
Transport Layer
-----------------------------------------------------
/\
||
||
IpInDelivers +++++++++++++++
||
||
|+-----------> IpInUnknownProtos
||
||
IpInDiscards <----------+|
||
|<---------------- IpReasmOKs
|| /\
|| IpReasmFails <---+|
|| ||
|+--------------> IpReasmReqds
||
||
IpForwDatagrams <-------+|
||
|+--------------> IpInAddrErrors
||
||
IpInHdrErrors <---------+|
||
||
||
+++++++++++++++++++ IpInReceives
||
-----------------------------------------------------
Interface Layer
The Case diagram above shows how the packets are processed by the
IP stack. First, the Interface layer extracts the payload of the
received frame and passes it to the IP layer. At this point, the
IpInReceives
counter is incremented. The processing of the
IPv4 packet starts. First, the stack checks for errors inside the IPv4
header. If an error is detected in the IPv4 header, the packet is dropped
and IpInHdrErrors
is incremented. Then, the destination address is checked.
If the address is incorrect, the packet processing stops and IpInAddrErrors
is incremented.
If IP forwarding is enabled and the packet is not destined to this host,
then the packet is forwarded using the FIB. The IpForwDatagrams
counter
is incremented.
The next step is to check whether the received packet is a fragment of
a larger packet that needs to be reassembled. If the received packet is a
fragment, then the IpReasmReqds
counter is incremented and the
packet passed through the reassembly process. This reassembly can take time
since more fragments can be required to recover a complete packet. If
the packet reassembly succeeds, then IpReasmOKs
is incremented and
the processing of the full packet continues. If the reassembly fails, e.g.
because a fragment is missing before the timeout expires, then
IpReamsFails
gets incremented.
A this point, the packets have almost finished to be processed by the
IP stack. Most packets will be delivered to the transport layer
and increment the IpInDelivers
counter except if the IP queue becomes
full. In this case, the IpInDiscards
counter is incremented.
The incoming packet could also be discarded if its Protocol field
does not match one of the transport layers supported by the stack
(i.e. UDP, TCP, DCCP, ...). In this case, the IpInUnknownProtos
counter is incremented.
The Multipath TCP counters
Linux version 5.18 maintains 46 counters for Multipath TCP. These counters
correspond to different parts of the protocol and can be organized in four
groups. The first group gathers the counters that are incremented when TCP
packets containing the MP_CAPABLE
option are processed. The second group
gathers the counters that are incremented when processing packets with
the MP_JOIN
option. The third group gathers the counters that are
modified when packets with the ADD_ADDR
, RM_ADDR
or MP_PRIO
option
are processed. The fourth group gathers the remaining counters of the
Multipath TCP stack.
Two versions of Multipath TCP have been specified within the IETF. Version 0 was initially defined in RFC 6824. The off-tree but well maintained set of patches distributed by https://www.multipath-tcp.org implemented this version of Multipath TCP. Based on the experience gathered with this implementation and also Apple's implementation, Multipath TCP evolved and the IETF published version 1 in RFC 8684. The Multipath TCP counters correspond to this version of Multipath TCP.
The MPCapable counters
This group gathers the following counters: MPTcpExtMPCapableSYNRX
,
MPTcpExtMPCapableSYNTX
, MPTcpExtMPCapableSYNACKRX
,
MPTcpExtMPCapableACKRX
, MPTcpExtMPCapableFallbackACK
,
MPTcpExtMPCapableFallbackSYNACK
and MPTcpExtMPFallbackTokenInit
. They
relate to the establishment of the initial Multipath TCP subflow
which is described in the The Multipath TCP handshake section.
The MPTcpExtMPCapableSYNTX
counter is similar to the TcpActiveOpens
counter maintained by TCP. It counts the number of Multipath TCP connections
that this host has tried to establish. Its value will usually be much smaller than TcpActiveOpens
. When a Multipath connection is initiated using the
connect
system call, both MPTcpExtMPCapableSYNTX
and
TcpActiveOpens
are incremented. Although the name of the counter is
MPTcpExtMPCapableSYNTX
, it is only incremented once per Multipath
TCP connection if the SYN
packet needs to be retransmitted.
The MPTcpExtMPCapableSYNACKRX
counter is incremented every time
a Multipath TCP connection is confirmed by the reception of a
SYN+ACK
with the MP_CAPABLE
option to a SYN
packet
that it sent earlier. The value of this counter should be lower than
MPTcpExtMPCapableSYNTX
since only a subset of the connections initiated
by a host will typically reach a Multipath TCP compliant server.
If a client receives a SYN+ACK
without the MP_CAPABLE
option
in response to a SYN
sent with the MP_CAPABLE
option, then
the MPTcpExtMPCapableFallbackSYNACK
counter is incremented. This
counter tracks the Multipath TCP connections that were forced to fall back
to regular TCP during the three-way handshake of the initial subflow.
On the other hand, the MPTcpExtMPCapableSYNRX
counter tracks the
number of Multipath TCP connections that were accepted by the host.
Its value will usually be much smaller than TcpPassiveOpens
which
tracks all accepted TCP connections. When a Multipath connection is accepted,
both MPTcpExtMPCapableSYNRX
and TcpPassiveOpens
are incremented.
As for MPTcpExtMPCapableSYNTX
, the MPTcpExtMPCapableSYNRX
counter is
only incremented once per connection and not each time a packet is received.
Upon reception of a SYN
with the MP_CAPABLE
option, a
Multipath TCP server returns a SYN+ACK
with the MP_CAPABLE
option. The MPTcpExtMPCapableACKRX
counter is incremented upon reception
of the third ACK
containing the MP_CAPABLE
option. If this option
is not present in this ACK
, then the MPTcpExtMPCapableFallbackACK
gets incremented.
If this counter increases, it probably indicates some interference with a
middlebox that injects acknowledgments during the three-way handshake.
||
||
MPTcpExtMPCapableSYNTX+++++++++++++++
||
||
|+-----------> MPTcpExtMPCapableSYNACKRX
||
||
|+-----------> MPTcpExtMPCapableFallbackSYNACK
||
||
||
||
MPTcpExtMPCapableSYNRX+++++++++++++++
||
||
|+-----------> MPTcpExtMPCapableACKRX
||
||
|+-----------> MPTcpExtMPCapableFallbackACK
||
||
The Join counters
There are thirteen counters in this group. They are incremented when a host
processes SYN
packets corresponding to additional subflows.
The first counter, MPTcpExtMPJoinSynRx
is incremented every time a
SYN
packet with the MP_JOIN
option is received. Upon reception
of a such packet, the host first verifies that it knows the token of
the Multipath TCP connection. If so, the processing continues and
the host returns a SYN+ACK
packet with the MP_JOIN
option, its
random number and a HMAC. Otherwise, the MPTcpExtMPJoinNoTokenFound
counter is incremented. The host then waits for the third ACK
which contains the MP_JOIN
option and the HMAC computed by
the remote host. It then checks the validity of the received HMAC. If
the HMAC is invalid, then the MPTcpExtMPJoinAckHMacFailure
counter
is incremented.
The MPTcpExtMPJoinSynRx
counter will increase on Multipath TCP hosts
that accept subflows, typically servers. The value of the
MPTcpExtMPJoinACKRX
counter should be close to the previous one.
If the two other counters, MPTcpExtMPJoinNoTokenFound
or
MPTcpExtMPJoinAckHMacFailure
increase, then the system administrator
should probably investigate as these are indication of possible attacks.
||
||
MPTcpExtMPJoinSynRX +++++++++++++++
||
||
|+-----------> MPTcpExtMPJoinNoTokenFound
||
||
MPTcpExtMPJoinACKRX +++++++++++++++
||
|+-----------> MPTcpExtMPJoinAckHMacFailure
||
||
Unfortunately, there is no counter that tracks the creation of new subflows
by a host. The TCP stack counts these new subflows as active opens, but
there is no specific Multipath TCP counter. However, the
MPTcpExtMPJoinSynAckRX
counter tracks the reception of SYN+ACK
packets containing the MP_JOIN
option. This is thus an indirect
way to track the creation of new subflows. Upon reception of such a
packet, in response to a previously sent SYN
packet with the MP_JOIN
option, a host checks the validity of the received HMAC. If the HMAC is
invalid, the MPTcpExtMPJoinSynAckHMacFailure
is incremented. This counter
should rarely increase. If it increases, then the problem should be
investigated by collecting packet traces.
||
||
MPTcpExtMPJoinSynAckRX +++++++++++++++
||
||
|+-----------> MPTcpExtMPJoinSynAckHMacFailure
||
||
A Multipath TCP host will usually accept additional subflows on the address and ports where the initial subflow was accepted. The following counters track the arrival of packets destined to different port numbers:
MPTcpExtMPJoinPortSynRx
MPTcpExtMPJoinPortSynAckRx
MPTcpExtMPJoinPortAckRx
The last two counters, MPTcpExtMismatchPortSynRx
and
MPTcpExtMismatchPortAckRx
are a bit different. They are incremented when
a SYN
or ACK
sent to a different port number are received.
The MP_JOIN
option contains a B
that indicates whether the new
subflow should be considered as a backup subflow or a regular one. This
information is used by the path manager, but no counter tracks the value of
the backup bit in the MP_JOIN
option. Once a subflow has been established,
its backup status can be changed using the MP_PRIO
option. The
MPTcpExtMPPrioTx
counter is incremented every time such an option is sent.
The MPTcpExtMPPrioRx
counter is incremented by each received MP_PRIO
option.
The address advertisement counters
There are six counters in this group. The advertisement of addresses by Multipath TCP is described in ref:Address management <mmtpbook:mptcp-addr-management>.
When a host receives a packet with a valid ADD_ADDR
option with its
Echo
bit set to zero, the MPTcpExtAddAddr
counter is incremented.
If this option includes an optional port number, the MPTcpExtPortAdd
counter is also incremented. In addition to these two counters, the
MPTcpExtAddAddrDrop
tracks the address advertisements that were received
by the host, but not processed by the path manager, e.g. because no user
space path manager was active.
Multipath TCP does not track the advertisements of addresses by sending
the ADD_ADDR
option. However, it tracks the reception of packets
containing the ADD_ADDR
option with the Echo
bit set to one with
the MPTcpExtEchoAdd
counter. These packets are echoed by the remote host.
Similarly, the MPTcpExtRmAddr
counter tracks the number of received
RM_ADDR
options. These options typically indicate a change in the
addresses owned by a remote peer. Mobile hosts are likely to send these
options when they move from one type of network to another. The
MPTcpExtRmAddrDrop
is incremented when the path manager cannot process an
incoming RM_ADDR
option.
When a host receives a RM_ADDR
option from a remote peer, its path
manager should remove the subflows associated with this address. The
MPTcpExtRmSubflow
counter tracks the number of subflows that have
been destroyed by a path manager.
The connection termination counters
There are seven counters in this group. They track the abnormal termination of
a Multipath TCP connection. A normal Multipath TCP connection should end
with the exchange of DATA_FIN
in both directions. However, are scenarios
are possible. First, one of the hosts may wish to quickly terminate the
Multipath TCP connection without having to maintain state. Multipath TCP
uses the FAST_CLOSE
option
in this case. The MPTcpExtMPFastcloseTx
and MPTcpExtMPFastcloseRx
counters track the transmission and the reception of such options.
Multipath TCP was designed to prevent as much as possible interference
from middleboxes, but there are some types of interferences that force
Multipath TCP to fallback to regular TCP. In this case, the host that first
noticed the interference (e.g. problem during the handshake, DSS checksum
problem, ...) sends a packet with the MP_FAIL
option. This forces the
Multipath TCP connection to fall back to a regular TCP connection.
The MPTcpExtMPFailTx
and MPTcpExtMPFailRx
counters track the
transmission and the reception of the MP_FAIL
option. During some
types of fall backs, a host may also send an infinite DSS mapping. The
MPTcpExtInfiniteMapRx
counter tracks the reception of such infinite
DSS mappings.
An increase of these counters would indicate some type of middlebox interference which should be investigated since it could prevent a complete utilization of Multipath TCP.
Like TCP, Multipath TCP uses TCP RST
to terminate subflows. Multipath
TCP also defines the MP_TCPRST
option which can contain an option reason
code and flags indicating some information about the reason for the
transmission of the RST
. The MPTcpExtMPRstTx
and MPTcpExtMPRstRx
counters track the transmission and the reception of such RST
packets.
The other counters
The remaining eleven counters are mainly related to processing of data.
If the DSS checksum is enabled, the MPTcpExtDataCsumErr
is incremented
every time a check of the DSS checksum fails. This should be a rare event that
likely indicates the presence of middleboxes. It should be correlated with
the MPTcpExtMPFailTx
and MPTcpExtMPFailRx
counters discussed in the
previous section.
Three counters track the DSS option of the incoming packets :
MPTcpExtDSSNotMatching
, MPTcpExtDSSNoMatchTCP
and
MPTcpExtNoDSSInWindow
. The first counter is
incremented when a mapping is received for data that has already been mapped
and the new mapping is not the same as the existing one. The second counter
is incremented when the TCP sequence numbers found in the mapping do not
match with the current TCP sequence numbers. The third counter is incremented
upon reception of a packet that indicates a DSS option that is outside the
current window. These three counters should rarely increase.
The last counter that tracks data at the Multipath TCP connection
level is MPTcpExtDuplicateData
. It counts the number of received
packets whose data has been ignored because it had already been received
earlier. Such duplicated data can occur with Multipath TCP when data
sent over a subflow is retransmitted over another subflow. It would
be interesting to follow the evolution of this counter on a server that
interacts with mobile devices.
Multipath TCP tracks losses on the subflows that compose a Multipath
TCP connection. If one subflow accumulates losses, it may be marked
as stale and the packet scheduler will stop using it to transmit data
until the losses have been recovered. The MPTcpExtSubflowStale
counter is
incremented every time a subflow is marked as being stale. The
MPTcpExtSubflowRecover
counter tracks the transitions from stale to
active.
Multipath TCP uses an out-of-order queue to reorder the data received over
the different subflows. The MPTcpExtOFOQueueTail
and MPTcpExtOFOQueue
counters track the insertion of data at the tail and in the out-of-order
queue. The MPTcpExtOFOMerge
is incremented when data present in the
out-or-order queue can be merged.
Finally, the MPTcpExtRcvPruned
tracks the number of packets that
were dropped because the memory available for Multipath TCP was full.
If this counter increases, you should probably check the memory configuration
of your host.
Native Multipath TCP applications on Linux
On recent Linux kernels, Multipath TCP is enabled on a per-socket basis by passing IP_PROTO_MPTCP
as the third parameter of the socket
system call that creates the socket. This implies that existing applications that use TCP need to be changed to support Multipath TCP.
The mptcp-hello project on GitHub provides simple examples showing how to enable Multipath TCP on a TCP application in the following programming languages :
Patches have been proposed to add Multipath TCP support to the following applications :
In addition some specific applications are developed with Multipath TCP support :