Country-based routing made fast

posted in Network

Country-based routing sounds simple: if the destination IP belongs to a country, send it through a different WAN or VPN; otherwise use the normal default route. On OpenWrt, the slow version is also simple: download a country CIDR list and append thousands of firewall rules.

That works until it does not. A country list can contain thousands or tens of thousands of prefixes. If each packet has to walk a long linear rule chain before the router can choose an output interface, the router is doing routing policy work in one of the hottest parts of the network stack.

The fast version has three ideas:

  1. Store the country list in a kernel set, not as thousands of rules.
  2. Mark packets before the route lookup, then let Linux policy routing choose the table.
  3. Cache the result in the connection mark so the expensive country lookup happens once per flow instead of once per packet.

This post uses OpenWrt with firewall4 (fw4) and nftables, which is the default firewall stack on modern OpenWrt. I will also compare it with the older iptables and ipset approach because the concepts are the same, but the performance characteristics and tooling are different.

The packet path that matters

For forwarded IPv4 or IPv6 traffic, the simplified Linux path looks like this:

NIC RX
  -> driver / NAPI
  -> skb enters network stack
  -> netfilter prerouting
  -> routing decision
  -> netfilter forward
  -> netfilter postrouting / NAT
  -> qdisc / driver TX

For split routing, the important point is the boundary between prerouting and the routing decision.

If you set a mark in forward, it is too late for that packet’s route lookup. The packet has already been assigned an output path. For policy routing, mark the packet in prerouting, then use an ip rule that matches the mark.

The routing part is not nftables-specific:

ip rule add fwmark 0x100/0x100 lookup 100 priority 100
ip route replace default dev tun0 table 100

That says: if the packet mark has bit 0x100, use routing table 100. Table 100 sends traffic to tun0. Everything without the mark continues through the normal main table.

Netfilter classifies the packet. Linux policy routing chooses the route.

FLOPS are the wrong unit

People sometimes describe this kind of optimization as reducing “FLOPS”. Routers usually are not limited by floating-point work. Packet forwarding is mostly integer comparisons, pointer chasing, cache behavior, memory bandwidth, branch prediction, lock contention, and per-packet kernel overhead.

A better mental model is cycles per packet.

At 1 Gbit/s, the packet rate depends heavily on packet size:

1 Gbit/s with 1500-byte packets: about 81,000 packets/s
1 Gbit/s with minimum Ethernet frames: about 1,488,000 packets/s

On an 880 MHz router CPU, that gives roughly:

880,000,000 cycles/s / 81,000 pps     = about 10,800 cycles/packet
880,000,000 cycles/s / 1,488,000 pps  = about 590 cycles/packet

Those budgets include everything: interrupts or polling, skb handling, conntrack, NAT, firewall, route lookup, qdisc, driver work, and any VPN encryption. If country routing burns thousands of operations per packet, it can easily become the bottleneck.

So the goal is not “fewer FLOPS”. The goal is fewer cache misses, fewer rule evaluations, fewer branches, and fewer repeated lookups in the per-packet path.

Three ways to match a country

Assume the country list has N prefixes. For a large country or an unaggregated feed, N might be 10,000 or more.

1. Naive iptables rules

The most direct old-style implementation is one mangle rule per prefix:

iptables -t mangle -A PREROUTING -d 198.51.100.0/24 -j MARK --set-mark 0x100
iptables -t mangle -A PREROUTING -d 203.0.113.0/24 -j MARK --set-mark 0x100
# thousands more rules...

This is easy to understand and terrible at scale. The kernel evaluates rules in order. If the matching prefix is near the end, the packet walks most of the chain. If the packet does not match the country at all, it can walk the entire chain.

The rough cost is:

matching packet:       O(N / 2) rule checks on average
non-matching packet:   O(N) rule checks

With 20,000 prefixes and 81,000 packets/s, even an average of 10,000 checks per packet means hundreds of millions of rule checks per second. Each check is small, but the aggregate cost is not.

This is the version to avoid.

2. iptables with ipset

The better iptables-era design uses ipset:

ipset create country_vpn hash:net family inet
ipset add country_vpn 198.51.100.0/24
ipset add country_vpn 203.0.113.0/24

iptables -t mangle -A PREROUTING \
    -m set --match-set country_vpn dst \
    -j MARK --set-mark 0x100

Now the firewall has one rule. The country list lives in a kernel set. Instead of walking 20,000 firewall rules, the packet performs a set lookup.

That changes the model:

firewall rules:        O(1)
country lookup:        roughly hash/set lookup cost
update mechanism:      ipset add/del/restore/swap

This is much faster than naive iptables. It was the practical solution for a long time on OpenWrt fw3.

The downside is that ipset is a separate subsystem bolted onto iptables through -m set. It works well, but the ruleset and the set are managed through different tools. Atomic updates are possible with ipset restore and swap, but the integration is not as clean as nftables sets.

3. nftables interval sets

With nftables, sets are native. A country list can be represented directly in the nftables ruleset:

set country_vpn4 {
    type ipv4_addr
    flags interval
    auto-merge
    elements = {
        198.51.100.0/24,
        203.0.113.0/24
    }
}

Then one rule uses the set:

ip daddr @country_vpn4 meta mark set 0x100

This gives the same big win as ipset: one rule, kernel-side set lookup. It also gives better ruleset integration. Sets, maps, counters, marks, and policy logic all live in the same nftables language.

For country routing, flags interval matters. Country feeds are ranges or prefixes. An interval set lets nftables store and match ranges compactly. With auto-merge, adjacent or overlapping ranges can be collapsed when the ruleset is loaded.

The exact kernel backend can vary, but the performance model is the important part:

naive iptables chain:      thousands of rule checks
iptables + ipset:          one rule plus set lookup
nftables interval set:     one rule plus native set lookup

The large speedup comes from removing the linear firewall chain.

The next bottleneck: every packet lookup

A native set lookup is fast, but it is still a lookup. If you check the destination country for every packet in a long TCP connection, you are repeating work after the answer is already known.

Conntrack gives us a cache. The first packet of a flow can do the country set lookup. Then we store the result in ct mark. Later packets restore the packet mark from the connection mark and skip the country set lookup.

Use two bits:

0x001  checked this flow
0x100  route this flow through the country route table
0x101  checked + route through country table

The packet mark only needs 0x100, because ip rule only cares whether the route bit is set. The connection mark can keep both bits.

This changes the cost model again:

first packet of flow:       country set lookup
later routed packets:       cheap ct mark restore
later non-routed packets:   cheap ct mark check and return

For a large download, this is a major difference. You pay the country lookup once, not for every packet in the transfer.

OpenWrt fw4 implementation

Assume:

  • LAN subnet: 192.168.1.0/24
  • VPN device: tun0
  • Policy routing table: 100
  • Route bit: 0x100
  • Checked bit: 0x001
  • Combined checked+routed connection mark: 0x101

Pick mark bits that do not conflict with other routing packages such as mwan3 or pbr. The examples below assume these bits are reserved for this policy.

Install the full ip command if needed:

opkg update
opkg install ip-full

Add the policy rule and route:

ip rule add fwmark 0x100/0x100 lookup 100 priority 100 2>/dev/null
ip route replace default dev tun0 table 100

If your tunnel uses a gateway:

ip route replace default via 10.8.0.1 dev tun0 table 100

Check it:

ip rule show
ip route show table 100
ip route get 203.0.113.10 mark 0x100

Define the country set

Files in /etc/nftables.d/*.nft are included inside OpenWrt’s generated table inet fw4, so they are a good place for table-scope objects such as sets.

Create /etc/nftables.d/30-country-routing-sets.nft:

set country_vpn4 {
    type ipv4_addr
    flags interval
    auto-merge
    elements = {
        198.51.100.0/24,
        203.0.113.0/24
    }
}

In a real setup, generate the elements list from your country feed. Keep the generated file boring: sorted, aggregated, and easy to replace.

For example, if /tmp/country.zone already contains one CIDR per line:

{
    echo 'set country_vpn4 {'
    echo '    type ipv4_addr'
    echo '    flags interval'
    echo '    auto-merge'
    echo '    elements = {'
    sed 's/$/,/' /tmp/country.zone
    echo '    }'
    echo '}'
} > /etc/nftables.d/30-country-routing-sets.nft

Before loading a huge list, aggregate it. Fewer intervals mean smaller rulesets, faster load times, and less memory pressure. Aggregation does not change the packet path as dramatically as moving from linear rules to sets, but it still helps.

Add the fast prerouting logic

The marking logic must run in mangle_prerouting, before the route lookup. Keep the chain snippet outside /etc/nftables.d/*.nft, because /etc/nftables.d is included at table scope. This file is a chain body, not a table object.

Create /etc/firewall.country-prerouting.nft:

ct mark 0x101 meta mark set 0x100 return
ct mark 0x001 return

ip saddr 192.168.1.0/24 ip daddr @country_vpn4 ct mark set 0x101 meta mark set 0x100 return
ip saddr 192.168.1.0/24 ct mark set 0x001

What these rules do:

  1. If this flow was already checked and should use the VPN, restore packet mark 0x100 and stop.
  2. If this flow was already checked and should not use the VPN, stop.
  3. If this is an unchecked LAN flow and the destination is in the country set, set connection mark 0x101, set packet mark 0x100, and stop.
  4. If this is an unchecked LAN flow and did not match the country set, remember that it was checked.

Now tell fw4 to include this file at the start of mangle_prerouting:

uci add firewall include
uci set firewall.@include[-1].type='nftables'
uci set firewall.@include[-1].path='/etc/firewall.country-prerouting.nft'
uci set firewall.@include[-1].position='chain-prepend'
uci set firewall.@include[-1].chain='mangle_prerouting'
uci commit firewall

Reload and inspect:

/etc/init.d/firewall reload
nft list chain inet fw4 mangle_prerouting

If you want counters while testing, add counter to the country-match rule:

ip saddr 192.168.1.0/24 ip daddr @country_vpn4 counter ct mark set 0x101 meta mark set 0x100 return

Then inspect the chain again and watch whether the counter increases.

Make the route persistent

If tun0 exists only after a VPN comes up, a hotplug script is better than rc.local.

Create /etc/hotplug.d/iface/90-country-routing:

#!/bin/sh

[ "$ACTION" = "ifup" ] || exit 0
[ "$INTERFACE" = "vpn" ] || exit 0

ip rule add fwmark 0x100/0x100 lookup 100 priority 100 2>/dev/null
ip route replace default dev tun0 table 100

Use the OpenWrt interface name for $INTERFACE. That might be vpn, while the Linux device is tun0.

For a simple always-up interface, /etc/rc.local is acceptable:

ip rule add fwmark 0x100/0x100 lookup 100 priority 100 2>/dev/null
ip route replace default dev tun0 table 100

exit 0

IPv6

IPv6 is the same idea with a separate set and route:

set country_vpn6 {
    type ipv6_addr
    flags interval
    auto-merge
    elements = {
        2001:db8:100::/40,
        2001:db8:200::/40
    }
}

Then add an IPv6 rule to the same prerouting snippet:

ip6 saddr fd00::/8 ip6 daddr @country_vpn6 ct mark set 0x101 meta mark set 0x100 return

And add an IPv6 route in table 100:

ip -6 route replace default dev tun0 table 100

Be careful with source prefixes. If your LAN uses global IPv6 prefixes delegated by your ISP, match those prefixes instead of fd00::/8.

What about flow offloading?

OpenWrt can use software or hardware flow offloading. Flow offloading changes the packet path after a flow is established, which is exactly why it is fast.

For policy routing, the first packets still need to be classified correctly. The route bit must be set before the route lookup, and the route table must point to the right output interface. After that, flow offload may reduce how often later packets visit the normal netfilter path.

When debugging, disable flow offloading first. Otherwise counters can be misleading because later packets may bypass the rules you are watching.

iptables and ipset equivalent

The older OpenWrt fw3 style would use ipset plus iptables mangle rules:

ipset create country_vpn hash:net family inet
ipset add country_vpn 198.51.100.0/24
ipset add country_vpn 203.0.113.0/24

iptables -t mangle -A PREROUTING \
    -s 192.168.1.0/24 \
    -m set --match-set country_vpn dst \
    -j MARK --set-mark 0x100

ip rule add fwmark 0x100/0x100 lookup 100 priority 100
ip route replace default dev tun0 table 100

A connection-mark cache is also possible with CONNMARK, but the rules become harder to read:

iptables -t mangle -A PREROUTING \
    -m connmark --mark 0x101 \
    -j MARK --set-mark 0x100

iptables -t mangle -A PREROUTING \
    -m connmark --mark 0x101 \
    -j RETURN

iptables -t mangle -A PREROUTING \
    -m connmark --mark 0x001 \
    -j RETURN

iptables -t mangle -A PREROUTING \
    -s 192.168.1.0/24 \
    -m set --match-set country_vpn dst \
    -j CONNMARK --set-mark 0x101

iptables -t mangle -A PREROUTING \
    -m connmark --mark 0x101 \
    -j MARK --set-mark 0x100

iptables -t mangle -A PREROUTING \
    -m connmark --mark 0x101 \
    -j RETURN

iptables -t mangle -A PREROUTING \
    -s 192.168.1.0/24 \
    -j CONNMARK --set-mark 0x001

This works, but it shows why nftables is nicer. In nftables, matching, set lookup, packet mark, connection mark, and return can be expressed in a compact rule sequence.

Comparison

ApproachPacket path costUpdate modelOpenWrt fit
Thousands of iptables rulesLinear rule walk, O(N) worst caseAppend/delete many rulesAvoid
iptables + ipsetOne firewall rule plus set lookupipset restore, swap, separate toolGood for old fw3 systems
nftables setOne nft rule plus native set lookupNative nft ruleset/set updatesBest fit for fw4
nftables set + ct mark cacheFirst packet does set lookup, later packets use ct markSame as nftables, with conntrack dependencyBest for large country lists

The biggest jump is from linear rules to sets. The next jump is avoiding repeated set lookups on every packet of an already-classified flow.

Debugging checklist

Print the generated firewall:

fw4 print
nft list ruleset

Check that the set exists:

nft list set inet fw4 country_vpn4

Check that the prerouting rules were included:

nft list chain inet fw4 mangle_prerouting

Check routing policy:

ip rule show
ip route show table 100
ip route get 203.0.113.10 mark 0x100

Check live conntrack marks if conntrack is installed:

opkg install conntrack
conntrack -L | grep 'mark=257'

257 is decimal for 0x101.

If the nftables counter increases but traffic goes out the wrong interface, the problem is probably policy routing. If the counter never increases, the problem is the set, the source match, the destination match, or flow offloading hiding later packets during testing.

Practical notes

Country-based routing is only as accurate as the IP data. CDNs, anycast, cloud providers, and VPNs make “country” a fuzzy property. DNS-based approaches have their own problems because one hostname can resolve to different addresses depending on resolver, client subnet, and time.

For router performance, the important part is to keep the hot path small:

  • Do not generate one firewall rule per prefix.
  • Use a kernel set.
  • Aggregate the country list before loading it.
  • Mark in mangle_prerouting, not forward.
  • Use ip rule and a separate route table for the actual routing decision.
  • Cache the result in ct mark if the list is large or traffic volume is high.
  • Disable flow offloading while debugging counters and path selection.

Modern OpenWrt with fw4 and nftables is a good fit for this. The rules are shorter, the country list is represented as a native interval set, and the connection-mark cache keeps the expensive lookup out of the steady-state packet path.

References: