Netdev 1.2 Day III

Last day of Netdev  1.2 in Tokyo, these talks were attended:

Microservice Networking Leveraging VRF on the Host

David Ahern (from Cumulus Networks) presented the network isolation performed by VRF approach between containers in different hosts.

Accelerating Linux IP Virtual Server with OpenNPU

Gilad Ben-Yossef (from Mellanox) presented HW offload for LVS with OpenNPU.

Linux Forwarding Stack Fastpath

Nishit Shah, Jagdish Motwani (from Sophos) presented an interesting approach to gather more throughput from the network interfaces marking forwarding packets with a fastpath mark.

XDP workshop: Introduction, experience, and future development

Tom Herbert (from Facebook) organized the eXpress Data Path workshop that provides a programmable eBPF interface to create firewalls, load balancers and similar devices.

KEYNOTE: The Undiscovered Internet: The Linux Network Stack, The Internet, Self-driving cars, Virtual Reality, and Pokemon Go

Tom Herbert (from Facebook) presented an overview of the near future of the networking challenges like IPv6, stream content parsing, secured connections for everything…

Accelerating container network with channel based IO

Rony Efraim, Or Gerlitz (from Mellanox) presented HW offload to accelerate network containers.

Advanced programmability and recent updates with tc’s cls_bpf

Daniel Borkmann presented the latest changes of tc’s cls_bpf infrastructure.

IPSec Workshop

Steffen Klassert organized the IPSec workshop where the collaborators presented the latest changes, issues to be resolved and differences with kTLS.

Scalable VM and Container Networking using /32bit subnets and BGP routing

Andrew yongjoon Kong (from Kakao) presented the scalable virtual infrastructure that they implemented using BGP routing.

eBPF/XDP hardware offload to SmartNICs

Jakub Kicinski, Nic Viljoen (from Mellanox) presented some HW offload ideas for eBPF and XDP.

Using SR-IOV offload with application like openVswitch

Rony Efraim, Or Gerlitz (from Mellanox) presented HW offload for openVswitch.


Netdev 1.2 Day II

During the Second Day of Netdev 1.2 we were attending the following talks:

Scaling with multiple network namespaces in a single application

PJ Waskiewicz presented how to manage large applications with multiple namespaces to be used with containers supported in the Linux Kernel and some good practices to avoid bottlenecks in the kernel side.

What is an L3 Master Device?

David Ahern (from Cumulus Networks) explained how the l3mdev could be used to manipulate packets at layer 3 and the API used for this driver. VRF is an example how to use it.

User Space TCP – Getting LKL ready for the Prime Time Use

Jerry Chu (from Google) presented some investigations to run the kernel TCP stack in userspace due to the heterogeneous platform of Android systems. David Miller pointed out that such work could became a lost cause as the work to port the full kernel interface is huge.

Introduction to switchdev SR-IOV offloads

Or Gerlitz, Hadar Hen-Zion, Amir Vadai, Rony Efraim (from Mellanox) presented HW offload for switchdev SR-IOV drivers.

Single Virtual function driver for current and future Intel Network devices

Anjali Singhai Jain, Mitch Williams, Jesse Brandeburg (from Intel) presented the challenge of a single virtual function driver that will be included in the Intel networking drivers.

KEYNOTE: Fast Programmable Networks & Encapsulated Protocols

David S. Miller (netdev kernel subsystem maintainer) presented the XDP framework which is a fast path for networking applications but the drivers needs to support it in order to call the hook even before than the skb structure is created.

Network Performance Workshop

Jesper D. Brouer (from RedHat) invited several collaborators to debate about different points of optimization of the networking stack, where they claimed that the unix socket have a lot of work to do in terms of optimization.

Nftables Workshop

Pablo Neira Ayuso (Netfilter Core Team) organized the nftables workshop where I had the opportunity to present the development of the load balancing infrastructure in nftables that took place during my outreachy internship. Several points of improvements I gather from this experience:

  1. Work to implement a lightweight NAT from ingress, due to the great performance we can get from there.
  2. Support for consistent hashing in the hash expression.

Florian and Pablo also presented the improvements of nftables from the last Netdev 1.1.

Data Center Networking Stack

Tom Herbert (from Facebook) presented the challenges that the datacenters need to accomplish to perform best as possible.

The adventures of a Suricata in eBPF land

Eric Leblond (from Suricata) presented the eBPF support on Suricata with examples like how to filter Facebook chats with a given content.

Network interface configuration on a Linux NOS

Roopa Prabhu (from Cumulus Networks) presented the new package ifupdown2 for Debian systems that includes the support for advanced network configurations of switches interfaces.


Netdev 1.2 Day I

The first day of the Netdev 1.2 Conference in Tokyo! We were attending the following talks:

Practical Guide to Run an IEEE 802.15.4 Network with 6LoWPAN under Linux

Stefan Schmidt presented the wireless solution for embedded devices oriented to IoT challenges like low battery and small RAM memory. 6LoWPAN is an IPv6 encapsulation for headers compression that permits the usage of optimized IPv6 on IoT networks.

Some tools used are linux-wpan and wpan-tools. RIOT is an operating system for IoT and Contiki is an alternative for linux+RIOT.

Implementing IPv6 Segment Routing

David Lebrun presented Segment Routing which is a new implementation that permits to have a shortest path through the intermediate nodes.

It’s required to encapsulate packets in ingress and egress stages, and the user space needs to be aware of the key used to identify packets.

Making Linux TCP Fast

Yuchung Cheng and Neal Cardwell (from Google) presented a new feature in the next kernel 4.9 to improve the TCP speed through reducing the retransmissions detecting the packets losses. They claim that buffering packets in queues improve delays and the retransmissions needed.

BBR (Bottleneck Bandwidth and Roundtrip propagation time) is a new module that seeks high tput with small queue size by probing BW and RTT sequentially. When the buffer is being filled, then drain and probe BW.

With BBR+RACK, Google improved speed 2-20x TCP WANs with higher bandwidth and lower latency.

Kernel TLS (Transport Layer Security) Socket

Dave Watson presented the TLS (or SSL) kernel implementation in order to reduce copies from/to userspace and hence improve performance. kTLS sendfile reduces CPU cycles 93/100.

The implementation needs 2 fds: TCP socket and AF_TLS socket.

In addition, Datagram TLS and KCM could be used in this implementation.

TLS Offload to Network Devices

Boris Pismenny (from Mellanox) presented HW offload solutions to get kTLS from 4.4 Gb to 8.6 Gb with HW offload.

KEYNOTE: Accelerating Service Delivery Using Linux Network Innovations

Damascene Joachimpillai (from Verizon) presented why a corporation like Verizon rely in Linux OS for their datacenters, although not currently for the carrier network.

They’re very interested in TC stuff.

TC extended workshop

Jamal Hadi Salim (from Mojatatu Networks) organized the TC workshop where several speakers talked about the evolve of such tool, mainly eBPF support.

Switchdev BoF

Jiri Pirko, Yotam Gigi and Matty Kadosh (from Mellanox) presented some problematic topics that they’re facing with the switchdev implementation.

Encapsulation Offloads LCO, GSO_PARTIAL, TSO_MANGLEID, and Why Less is More

Alexander Duyck (from Intel) explained by it’s important to calculate correctly the checksums and how to accelerate it with HW offload and such implementation for the most popular Intel network drivers.

VNIC offloads fact or fiction

Stephen Hemminger (from Microsoft) explained the research of different drivers and Linux Kernel infrastructure to provide Hyper-V hypervisor a high performance networking stack with Linux. A comparison of Virtual NICs drivers performance like virtio (Qemu), vmxnet3 (VMware), netsvc (Hyper-V) and Xen were shown.

Stacked Vlan: Performance Improvement and Challenges

Toshiaki Makita (from NTT) presented the implementation improvements, challenges and performance applied to stacked vlans since the last Netdev 0.1.


Ensuring data from netlink is within the bounds

While implementing the expression modules, a known bug raised and I was asked to fix it in the current modules I’m implementing and others modules like nft_exthdr.

The problem comes due to the netlink attributes are sent in U32 format, and sometimes, the private data in the kernel an U8 value will be enough. But it’s usual to assign directly the netlink value to the private attribute without checking if it’s bigger than U8.

For this reason, a piece of code like:

priv->offset = ntohl(nla_get_be32(tb[NFTA_EXTHDR_OFFSET]));
priv->len    = ntohl(nla_get_be32(tb[NFTA_EXTHDR_LEN]));


struct nft_exthdr {
    u8                      type;
    u8                      offset;
    u8                      len;
    enum nft_registers      dreg:8;

Produces an overflow. So we’ve to ensure the value loaded with something like:

u32 offset, len;

offset = ntohl(nla_get_be32(tb[NFTA_EXTHDR_OFFSET]));
len    = ntohl(nla_get_be32(tb[NFTA_EXTHDR_LEN]));

if (offset > U8_MAX || len > U8_MAX)
	return -EINVAL;

priv->offset = offset;
priv->len = len;

Fixes for other cases where found and patched.

Brand new hash expression

The new hash expression provides a way to generate a Jenkins Hash operation given a source register that could be a source IP address, destination IP address, or any other packet field.

meta mark set hash ip saddr mod 10

There was a module called nft_hash that implements a hash table, so I’ve to apply a patch to rename such module to another. This is the patch.

I had to learn the Jenkins Hash API in order to use it as described here:

And make the changes in the libnftnl package to support this new expression in the form:

reg1 <- payload(base, offset, len)
reg1 <- hash(reg1, len, mod)
mark set reg1

But after the implementation in the kernel and libnftnl sides, I get the following error:

 root@nfkernel:~# /usr/src/libnftnl/examples/nft-rule-add ip filter
 mnl_cb_run: No such file or directory

This error could be shown if the given table or chain doesn’t exist, or if the module is not loaded. But….

The nft structure was created

 root@nfkernel:~# nft list ruleset
 table ip filter {
         chain input {
                 type filter hook input priority 0; policy accept;

And the module was loaded

 Module                  Size  Used by
 nft_hash                9946  0
 nf_tables_ipv6          2206  4
 nf_tables_ipv4          2206  4
 nf_tables              56474  3 nf_tables_ipv4,nf_tables_ipv6,nft_hash
 nfnetlink               5700  1 nf_tables

But then I realized that the nft_hash module size in memory is too big for such relatively “small” expression.

Then, I came into the idea that some kind of incompatibility or collision in the kernel between the “old” and the “new” nft_hash module must exist.

Finally, and thanks to my mentor Pablo Neira, I must include the following line in the source code.



make clean
make modules_install

And it’s ready!

Here is the first patch for the kernel and libnftnl:


Brand new nth expression

I’ve been in charge of creating a new netfilter match expression that provides a nth packet counter matching but every a given value is reset. Such expression will allow to create a round robin packet matching very useful for load balancing or to emulate network failures, for example:

 ip daddr <ipsaddr> dnat nth 3 map {
         0: <ipdaddrA>,
         1: <ipdaddrB>,
         2: <ipdaddrC>

This expression is the equivalent to the nth mode of the statistic match in iptables.

In order to face this challenge, I’ve been studying how nft expressions works in both the kernel and libnftnl sides, using as a reference how expressions like nft_meta, nft_counter and nft_cmp works:

  • nft_meta was useful as a template, as the key random seems to be similar to nth. But it’s totally different for several reasons: there is no needed several operations, no required sreg registers.
  • nft_cmp was useful to pass-through a data structure from netlink to netfilter, but not too similar to what we need to build.
  • nft_counter likely the most similar code as performs counting operations SMP safe. But the counter behaves counting the packets and bytes independently in every CPU and once the user request the counter value, it operates an addition of all CPU counters to return the final result. This is not what we need from nth, as we need the last updated value.

Additionally, I’ve been inspired in the current implementation of the nth mode in the xt_statistic expression from xtables in order to operate with atomic values in order to ensures that all CPUs are synced with the last updated counter value.

From the libnftnl point of view, I’ve been inspired from the counter.c, cmp.c and meta.c expressions in order to implement the nth.c expression.

Supporting inverted bitwise in nft I

I’m still banging my head providing support for the inverted bitwise that I referenced in an older post. Now the challenge is not only provides such functionality but also simplify the code.

In the nftables source code we can currently see a function called


in the file netlink_linearize.c which is called to generate the bitwise and cmp operations needed when the list of bitwise is positive, like is shown below:

nft --debug=netlink add rule ip filter INPUT ct state new,related,established,untracked
ip filter INPUT 
  [ ct load state => reg 1 ]
  [ bitwise reg 1 = (reg=1 & 0x0000004e ) ^ 0x00000000 ]
  [ cmp neq reg 1 0x00000000 ]

Now, the challenge is to improve the behavior in order to generate both operations in the evaluation phase, within the file evaluate.c creating the logic structure:

        relational (OP_NEQ)
                / \
               /   \
              /     \
         bitwise   value
            /  \
           /    \
     ct state   mask

No luck until now, but I’ll upgrade the state of this development.

Conntrack translation to nft

One more translation for the conntrack expression.

extensions: libxt_conntrack: Add translation to nft

This patch raises an issue with the inverted list of bitwise values. The syntax parser of nft returns an error when an inverted list of conntrack states are passed to nft.

 $ nft add rule ip filter INPUT ct state != new,related counter accept
 <cmdline>:1:41-41: Error: syntax error, unexpected comma, expecting end of file or newline or semicolon
 add rule ip filter INPUT ct state != new,related counter accept

After several days working of this issue I sent a patch to make it work, but it seems that the byte code regenerated is not correct.

 nft --debug=netlink add rule ip filter INPUT ct state != new,related,established,untracked counter accept
 ip filter INPUT
   [ ct load state => reg 1 ]
   [ cmp neq reg 1 0x0000004e ]
   [ counter pkts 0 bytes 0 ]
   [ immediate reg 0 accept ]

It should be something like:

 nft --debug=netlink add rule ip filter INPUT ct state new,related,established,untracked
 ip filter INPUT
   [ ct load state => reg 1 ]
   [ bitwise reg 1 = (reg=1 & 0x0000004e ) ^ 0x00000000 ]
   [ cmp eq reg 1 0x00000000 ]

I’ll be working further on this issue. To be continued…



Translation phase passed

The iptables to nftables translation patches released until now are the following:

extensions: libxt_ipcomp: Add translation to nft

extensions: libip6t_hbh: Add translation to nft

extensions: libxt_multiport: Add translation to nft

extensions: libxt_dscp: Add translation to nft

extensions: libip6t_frag: Add translation to nft

extensions: libxt_cgroup: Add translation to nft

And some other documentation fixes in the nftables package:

doc: fix compression parameter index

doc: fix old parameters and update datatypes

doc: Update datatypes