GRE over WireGuard mesh with BGP and BIRD

posted in Network

A WireGuard mesh gives each site a secure underlay. BGP gives each site a way to tell the rest of the network which prefixes it can reach. GRE, placed inside WireGuard, gives each adjacency a normal point-to-point tunnel interface that Linux, BIRD, firewall rules, and monitoring tools can treat like any other routed link.

The result is a multihomed private intranet: a house, VPS, office, and travel router can all announce their local LAN prefixes, learn the other LAN prefixes, and reroute when a link becomes slow or fails.

The important caveat is that BGP does not measure latency or throughput by itself. BGP chooses routes from attributes. If you want routes to follow measured quality, you need a control loop:

  1. Measure each tunnel.
  2. Smooth the measurements.
  3. Convert the score into BGP policy.
  4. Ask BIRD to re-evaluate the routes.

That sounds complicated, but the model is small enough to run on OpenWrt.

Topology

Imagine four sites:

flowchart LR
  A["site A<br/>OpenWrt<br/>10.10.1.0/24"]
  B["site B<br/>VPS<br/>10.10.2.0/24"]
  C["site C<br/>office<br/>10.10.3.0/24"]
  D["site D<br/>travel<br/>10.10.4.0/24"]

  A -- "wg0 + gre-ab<br/>BGP" --> B
  A -- "wg0 + gre-ac<br/>BGP" --> C
  B -- "wg0 + gre-bc<br/>BGP" --> C
  B -- "wg0 + gre-bd<br/>BGP" --> D
  C -- "wg0 + gre-cd<br/>BGP" --> D

  A -. "fallback path<br/>A -> B -> D" .-> D

WireGuard is the encrypted transport. GRE is the routed link inside that transport. BIRD runs BGP over the GRE addresses.

For example:

WireGuard underlay:
  site A wg0: 10.255.0.1/32
  site B wg0: 10.255.0.2/32

GRE adjacency over WireGuard:
  gre-ab local 10.255.0.1 remote 10.255.0.2
  site A gre-ab: 172.31.12.1/30
  site B gre-ab: 172.31.12.2/30

BGP:
  site A AS65001 peers with site B AS65002 over 172.31.12.0/30

You can run BGP directly over WireGuard. GRE is useful when you want a separate Linux interface per adjacency, a clean per-link MTU, per-link counters, per-link firewall rules, or a topology that looks like traditional routed point-to-point links. The cost is one more encapsulation header.

Encapsulation and MTU

MTU mistakes are the first failure mode. The packet is wrapped like this:

outer IP
  UDP
    WireGuard
      GRE
        inner IP packet

With IPv4 outside and no GRE key, the overhead is approximately:

$$ H = H_{\text{IPv4}} + H_{\text{UDP}} + H_{\text{WireGuard}} + H_{\text{GRE}} $$$$ H \approx 20 + 8 + 32 + 4 = 64\text{ bytes} $$

For a 1500-byte WAN:

$$ MTU_{\text{inner}} \le 1500 - 64 = 1436 $$

For PPPoE:

$$ MTU_{\text{inner}} \le 1492 - 64 = 1428 $$

If the outer underlay is IPv6, add another 20 bytes compared with IPv4:

$$ MTU_{\text{inner,IPv6 outer}} \le 1500 - (40 + 8 + 32 + 4) = 1416 $$

In practice I would start the GRE interfaces around 1400 or 1380, then raise them only after testing PMTU, ICMP handling, and real TCP flows. A slightly smaller MTU is usually less painful than intermittent black holes.

OpenWrt packages

On OpenWrt, install the routing pieces explicitly:

opkg update
opkg install wireguard-tools gre kmod-gre ip-full bird2 bird2c

Use ip-full if the tiny ip build is missing tunnel or policy commands. The OpenWrt GRE protocol helper is convenient, but the plain ip tunnel form is easier to reason about:

ip tunnel add gre-ab mode gre local 10.255.0.1 remote 10.255.0.2 ttl 255
ip addr add 172.31.12.1/30 dev gre-ab
ip link set dev gre-ab mtu 1400 up

Make the WireGuard peer route only the peer’s overlay address, not the whole remote LAN:

peer B allowed_ips = 10.255.0.2/32

BGP should learn the LAN routes. WireGuard should only make the next tunnel endpoint reachable.

BIRD baseline

Here is a compact BIRD 2 style configuration for site A. It announces 10.10.1.0/24 and learns other site LANs from BGP peers.

router id 10.255.0.1;

define MY_AS = 65001;
define SITE_A_LAN = 10.10.1.0/24;

protocol device {
}

protocol direct direct_lans {
  ipv4;
  interface "br-lan";
}

protocol kernel kernel4 {
  ipv4 {
    import none;
    export all;
  };
}

filter export_site_a {
  if net = SITE_A_LAN then accept;
  reject;
}

filter import_mesh_default {
  if net ~ [ 10.10.0.0/16{24,32} ] then accept;
  reject;
}

template bgp mesh_peer {
  local as MY_AS;
  ipv4 {
    import filter import_mesh_default;
    export filter export_site_a;
    next hop self;
  };
}

protocol bgp site_b from mesh_peer {
  neighbor 172.31.12.2 as 65002;
  source address 172.31.12.1;
}

protocol bgp site_c from mesh_peer {
  neighbor 172.31.13.2 as 65003;
  source address 172.31.13.1;
}

For a small intranet, private AS numbers per site make loop prevention simple because the AS path carries the path history. For a larger mesh, route reflectors or iBGP can reduce session count, but the policy model below is the same: each router prefers the neighbor that gives the best route to a prefix.

The scoring model

Let link \(i \to j\) have measured round-trip time \(r_{ij}\), jitter \(q_{ij}\), packet loss \(p_{ij}\), and available throughput estimate \(b_{ij}\).

Raw measurements are noisy, so do not feed them directly into BGP. Use an exponentially weighted moving average:

$$ \hat r_{ij}(t) = \alpha r_{ij}(t) + (1 - \alpha)\hat r_{ij}(t-1) $$$$ \hat b_{ij}(t) = \alpha b_{ij}(t) + (1 - \alpha)\hat b_{ij}(t-1) $$

For a fast-reacting mesh, \(\alpha = 0.25\) is reasonable. For a stable home network, \(\alpha = 0.1\) avoids route churn.

Now convert link quality into a score:

$$ \begin{aligned} S_{ij} &= w_b \cdot \min\left(1, \frac{\hat b_{ij}}{B_{\text{ref}}}\right) \\\\ &{}- w_r \cdot \frac{\hat r_{ij}}{R_{\text{ref}}} \\\\ &{}- w_q \cdot \frac{\hat q_{ij}}{Q_{\text{ref}}} \\\\ &{}- w_p \cdot \hat p_{ij} \end{aligned} $$

The terms are normalized so they can be compared:

  • \(B_{\text{ref}}\) is the bandwidth that counts as “good enough”, for example 200 Mbit/s.
  • \(R_{\text{ref}}\) is a latency scale, for example 50 ms.
  • \(Q_{\text{ref}}\) is a jitter scale, for example 20 ms.
  • \(p_{ij}\) is loss as a fraction, so 1% loss is 0.01.

A practical weight set for an interactive intranet is:

$$ w_b = 45,\quad w_r = 35,\quad w_q = 10,\quad w_p = 300 $$

Loss gets a high penalty because a lossy high-throughput path is usually worse than a clean lower-throughput path for SSH, DNS, TCP, and WireGuard itself.

Then map the score to BGP local preference:

$$ LP_{ij} = 100 + \left\lfloor 2 \cdot \text{clamp}(S_{ij}, 0, 50) \right\rfloor $$

This gives a local preference range of 100..200. Higher is better.

Why not use shortest path directly?

You can define a link cost:

$$ \begin{aligned} C_{ij} &= \beta \hat r_{ij} \\\\ &{}+ \gamma \frac{1}{\max(\hat b_{ij}, \epsilon)} \\\\ &{}+ \delta \hat p_{ij} \end{aligned} $$

Then the best path from node \(s\) to node \(d\) is:

$$ P^\* = \arg\min_{P \in \mathcal{P}_{s,d}} \sum_{(i,j)\in P} C_{ij} $$

That is the normal shortest-path view. OSPF, IS-IS, Babel, and many overlay controllers think this way. BGP does not. BGP is path-vector policy. It compares route attributes: local preference, AS path length, origin, MED, next-hop reachability, and tie breakers.

So with BGP, the trick is not to make BGP run Dijkstra. The trick is to translate your local measurements into BGP attributes that BIRD can use. For a small private mesh, local preference is the simplest knob:

  • prefer high-quality direct neighbor paths with higher bgp_local_pref
  • let AS path length break ties when quality is similar
  • use MED only when you intentionally want to tell a neighbor which ingress point you prefer
  • use route filters to avoid exporting routes back toward the neighbor that supplied them

Feeding scores into BIRD

BIRD configuration is declarative, so the monitor should generate a small include file and reload BIRD when the values change enough to matter.

Example generated file:

# /tmp/bird-mesh-metrics.conf
define LP_SITE_B = 178;
define LP_SITE_C = 142;

Main BIRD config:

include "/tmp/bird-mesh-metrics.conf";

filter import_from_b {
  if net ~ [ 10.10.0.0/16{24,32} ] then {
    bgp_local_pref = LP_SITE_B;
    accept;
  }
  reject;
}

filter import_from_c {
  if net ~ [ 10.10.0.0/16{24,32} ] then {
    bgp_local_pref = LP_SITE_C;
    accept;
  }
  reject;
}

protocol bgp site_b from mesh_peer {
  neighbor 172.31.12.2 as 65002;
  source address 172.31.12.1;
  ipv4 {
    import filter import_from_b;
    export filter export_site_a;
    next hop self;
  };
}

Then the monitor loop can do:

birdc configure

Do not reload on every tiny measurement change. Add hysteresis:

$$ \Delta LP = |LP_{\text{new}} - LP_{\text{old}}| $$

Only reconfigure when:

$$ \Delta LP \ge 5 $$

or when the link crosses a hard health boundary, such as loss > 5% or RTT > 500 ms.

Measurement without burning the router

Active throughput tests are expensive. Running iperf3 across every tunnel every few seconds can become the bottleneck you are trying to avoid.

Use a three-layer measurement plan:

  1. Fast health probe: ping or BFD-like keepalive to measure reachability, RTT, jitter, and loss.
  2. Passive throughput estimate: read interface byte counters and compute EWMA rate while real traffic is present.
  3. Occasional active probe: short iperf3 or UDP probe only when the passive estimate is stale or two paths are close.

The passive bandwidth estimator is:

$$ b_{ij}(t) = \frac{8 \cdot (bytes(t) - bytes(t - \Delta t))}{\Delta t} $$

Then smooth it:

$$ \hat b_{ij}(t) = \alpha b_{ij}(t) + (1-\alpha)\hat b_{ij}(t-1) $$

For counters:

rx=$(cat /sys/class/net/gre-ab/statistics/rx_bytes)
tx=$(cat /sys/class/net/gre-ab/statistics/tx_bytes)

For latency:

ping -I gre-ab -c 5 -q 172.31.12.2

On small routers, keep the loop slow. A 15- to 60-second interval is usually enough for intranet routing. BGP is not a per-packet load balancer; it is a control plane.

Route stability

If two links have almost equal scores, a naive controller will flap:

minute 1: site B score 151, site C score 149 -> choose B
minute 2: site B score 148, site C score 150 -> choose C
minute 3: site B score 151, site C score 149 -> choose B

The fix is a hold-down:

$$ LP_{\text{candidate}} > LP_{\text{current}} + H $$

where \(H\) might be 8 or 10. Keep the current path unless the alternative is clearly better. Also enforce a minimum dwell time:

$$ t_{\text{now}} - t_{\text{last-change}} \ge T_{\text{min}} $$

For example, Tmin = 120 seconds. This makes the network slightly less optimal but much more usable.

Packet path cost

The forwarding path is still simple:

LAN packet
  -> Linux route lookup chooses next hop from BIRD-installed route
  -> packet enters GRE interface
  -> GRE encapsulates packet
  -> WireGuard encrypts GRE packet
  -> UDP packet leaves WAN

Per forwarded packet, the control-plane math is not running. The hot path pays:

$$ \begin{aligned} C_{\text{packet}} &= C_{\text{route lookup}} \\\\ &{}+ C_{\text{GRE encap}} \\\\ &{}+ C_{\text{WireGuard crypto}} \\\\ &{}+ C_{\text{UDP/IP output}} \end{aligned} $$

The scoring loop is control-plane work:

$$ \begin{aligned} C_{\text{control}} &= O(E) \text{ probes} \\\\ &{}+ O(E) \text{ score updates} \\\\ &{}+ O(R \log R) \text{ route table work} \end{aligned} $$

where \(E\) is the number of tunnel adjacencies and \(R\) is the number of learned routes. In a home or small lab mesh, \(R\) is usually tiny. The expensive part is not BIRD. The expensive part is unnecessary probing and encryption overhead on small CPUs.

Operational rules

Keep these rules and the design stays boring:

  • WireGuard only routes tunnel endpoint addresses.
  • GRE carries the routed overlay.
  • BIRD owns LAN prefix distribution.
  • The monitor owns metrics, not routes.
  • BGP policy converts metrics into preference.
  • Hysteresis prevents churn.
  • MTU is set deliberately, not guessed.

The biggest design decision is whether to use GRE at all. If your WireGuard mesh already gives every peer a stable point-to-point address and you only need unicast IPv4 routes, BGP directly over WireGuard is simpler. Use GRE when the interface boundary is worth the overhead.

References