Mac mini as 10 GbE firewall

posted in Network

A Mac mini with the 10 GbE option is an interesting firewall box. The hardware has more memory bandwidth than a 10 Gbit/s link can consume, it is quiet, and Apple Silicon gives the CPU and GPU access to the same physical memory. That makes it tempting to imagine a small desktop machine doing routing, firewalling, BGP policy, telemetry, and Metal-accelerated packet classification at line rate.

The hard part is not the 10 Gbit/s number by itself.

The hard part is packet rate.

A 10 Gbit/s stream is about 1.25 GB/s of payload in one direction. Even if a forwarding path reads RX payload and writes TX payload, the raw byte rate is still small compared with the memory bandwidth of the M4 Mac mini. Apple lists the M4 Mac mini at 120 GB/s memory bandwidth, and the M4 Pro version at 273 GB/s. The problem is that minimum Ethernet frames at 10 GbE reach about 14.88 million packets per second. That gives about 67 ns per packet before the whole system has to receive, classify, route, forward, and account for the packet.

For 1500-byte-class traffic, the packet rate is much friendlier:

10 GbE with minimum Ethernet frames: about 14.88 Mpps
10 GbE with 1500-byte-class frames:  about 0.813 Mpps

So the design target is not “use the GPU for every packet.” That would add scheduling latency and synchronization overhead to the hottest part of the path.

The useful design is:

CPU owns the immediate packet decision. Metal accelerates selected, batched, metadata-heavy work.

That is the thread through the whole design.

macOS packet paths

There are four practical layers on macOS.

LayerAPI or toolPacket visibilityCan drop?Can modify or forward?Role in this design
Supported firewallNetworkExtension, especially NEFilterPacketProvider and NEFilterDataProviderPacket or flow visibility depending on providerYesVerdict API is allow, drop, or delay. It is not a packet rewrite APIMain supported product path
Local appliance configpf, routing table, route, sysctl, BGP daemonKernel forwarding pathYesNAT and routing through system configUseful for your own box, but not a stable app API
Capture and telemetry/dev/bpf*, cBPF filtersPacket tapDrops from capture stream onlyCan inject raw frames, but it is not a good router pathSampling, pcap, analytics input
Aggressive researchDriverKit networking, private NECP or Skywalk-style mechanisms, old Darwin socketsPotentially earlier ownershipDepends on pathPotentially yes if you own queuesLab path for near-zero-copy queue ownership

NEFilterPacketProvider is the clean supported firewall path. The provider evaluates packets and returns verdicts. Apple’s documentation also makes clear that packet handlers may be executed by multiple simultaneous threads, so a serious implementation must be built as a parallel, lock-minimized datapath.

pf is still very useful for a private appliance. You can combine system routing, NAT, anchors, tables, and a BGP daemon to build a real firewall. But Apple says Packet Filter is not API for software products, so I would not design a distributed commercial product around direct pf control.

BPF is excellent for visibility. It can prefilter capture traffic and feed samples into analysis code. It is not where I would put the inline forwarding decision.

The aggressive path is queue ownership: DriverKit if you control the NIC driver path, or private Skywalk-like mechanisms in a research build. That is where zero-copy forwarding becomes plausible, but it is also where the Apple API contract becomes the weakest.

Top-level architecture

The architecture I would build has three planes:

  1. A CPU fast path for immediate packet decisions.
  2. A BGP policy compiler that publishes immutable rule snapshots.
  3. A Metal batch engine for expensive secondary classification.
flowchart LR
    WAN[10 GbE WAN] --> RX[macOS packet ingress]
    RX --> CPU[CPU fast-path classifier]
    CPU -->|allow or drop| ROUTE[macOS routing and forwarding]
    ROUTE --> LAN[10 GbE LAN]

    CPU -->|selected descriptors| RING[Shared Metal descriptor ring]
    RING --> GPU[Metal batch classifier]
    GPU --> VERDICTS[Shared verdict ring]
    VERDICTS --> CPU

    BGP[BGP daemon] --> COMPILER[Policy compiler]
    COMPILER --> SNAP[Immutable CPU rule snapshot]
    COMPILER --> GRULES[GPU rule buffers]
    SNAP --> CPU
    GRULES --> GPU

The CPU path handles anything that must be decided immediately:

  • L2, L3, and L4 parsing
  • state table lookup
  • interface policy
  • anti-spoof checks
  • exact 5-tuple rules
  • BGP-derived blackhole prefixes
  • normal port and protocol policy
  • route hints and connection tracking

The GPU path handles work that becomes efficient only when batched:

  • large rule-matrix evaluation
  • secondary classification of suspicious new flows
  • anomaly features
  • Bloom-filter rebuild validation
  • payload-window scanning for selected flows
  • telemetry scoring
  • QUIC or TLS metadata feature extraction

The BGP daemon should not sit inside the packet path. It should feed a compiler that turns route policy into compact tables.

Supported Network Extension memory path

In the supported path, the system owns packet storage. The firewall extension should read only the header bytes it needs, produce a compact descriptor when needed, and return a verdict quickly for common traffic.

flowchart TD
    A[NIC RX DMA writes packet into kernel-owned memory] --> B[macOS network stack]
    B --> C[NEFilterPacketProvider]
    C --> D[CPU reads first 64 to 128 bytes]
    D --> E[CPU reads rule snapshot]
    E --> F{Verdict}
    F -->|allow| G[Kernel route or forward]
    F -->|drop| H[Release packet]
    G --> I[NIC TX DMA reads packet for egress]

    D -->|selected only| J[Write compact descriptor to shared Metal buffer]
    J --> K[Metal batch classifier]
    K --> L[Write result to shared verdict buffer]
    L --> F

The important discipline is to avoid treating the extension as a general per-packet application callback. The common path should not allocate, log synchronously, do DNS lookups, make IPC calls, or wait on the GPU.

The rough memory budget is favorable:

ComponentWorst-case estimate at 10 GbE
RX payload DMA1.25 GB/s
TX payload DMA1.25 GB/s
Read first 64 bytes at 64-byte frame line rateabout 0.95 GB/s
RX plus TX plus 64-byte header readabout 3.45 GB/s
64-byte GPU descriptor for every minimum-size packetabout 0.95 GB/s
64-byte GPU descriptor for 1 percent sampled minimum-size packetsabout 9.5 MB/s
64-byte GPU descriptor for 10 percent sampled minimum-size packetsabout 95 MB/s

Those numbers are small next to Apple Silicon memory bandwidth. The issue is synchronous software overhead per packet.

That is why the first rule is simple:

Do not wait on Metal in the hot path.

More aggressive queue ownership

The highest-performance design owns RX and TX queues. In that model, packet memory stays packet memory. The classifier reads metadata and header cache lines. Forwarding attaches the same packet buffer to a TX descriptor, or recycles it through a forwarding buffer pool.

flowchart LR
    NIC[NIC RX queue] --> RXDESC[RX descriptor ring]
    RXDESC --> PKT[Packet buffer pages]
    PKT --> CPU[CPU header classifier]
    CPU -->|drop| FREE[Return buffer to free ring]
    CPU -->|forward| TXDESC[TX descriptor ring]
    TXDESC --> NIC2[NIC TX queue]

    CPU -->|selected descriptors| MBUF[Metal shared descriptor ring]
    MBUF --> GPU[Metal compute]
    GPU --> VBUF[Metal shared verdict ring]
    VBUF --> CPU

The ideal memory behavior is:

RX packet buffer:
    NIC writes the packet once.
    CPU reads only the needed header cache lines.
    CPU does not copy payload.

Forwarded packet:
    TX descriptor references the packet buffer or a recycled forwarding buffer.
    NIC reads the packet once.

GPU:
    CPU writes compact descriptors into shared Metal buffers.
    GPU reads descriptors and rule buffers.
    GPU writes small verdict records.

This is the shape used by high-performance routers and packet engines. The Mac-specific problem is API access. NEFilterPacketProvider gives verdict hooks, not NIC ring ownership. DriverKit can expose packet queues when you own the relevant driver path, especially for external or custom NICs. Private Skywalk-like channels are research-only unless Apple gives you a supported contract.

Packet descriptor layout

Do not send full packets to the GPU unless the classifier really needs payload bytes. Most policy and metadata classification can run on a compact descriptor.

For the CPU, a 64-byte descriptor is convenient:

struct PacketDesc64 {
    uint64_t src_hi;        // IPv6 high, or zero for IPv4
    uint64_t src_lo;        // IPv4 stored in low bits
    uint64_t dst_hi;
    uint64_t dst_lo;

    uint16_t src_port;
    uint16_t dst_port;
    uint16_t ingress_if;
    uint16_t egress_hint;

    uint8_t  ip_version;    // 4 or 6
    uint8_t  l4_proto;      // TCP, UDP, ICMP, other
    uint8_t  tcp_flags;
    uint8_t  direction;

    uint32_t flow_hash;
    uint32_t packet_len;
    uint32_t policy_epoch;
    uint32_t flags;         // fragmented, VLAN, checksum state, suspicious, etc.
};

For Metal, structure-of-arrays usually maps better to coalesced memory reads:

struct GpuBatch {
    device const ulong  *src_hi    [[id(0)]];
    device const ulong  *src_lo    [[id(1)]];
    device const ulong  *dst_hi    [[id(2)]];
    device const ulong  *dst_lo    [[id(3)]];
    device const ushort *src_port  [[id(4)]];
    device const ushort *dst_port  [[id(5)]];
    device const uchar  *proto     [[id(6)]];
    device const uint   *flow_hash [[id(7)]];
    device       uchar  *verdict   [[id(8)]];
};

The CPU parses once. Every later subsystem consumes PacketDesc64 or a transposed GPU view of it.

CPU fast path

The CPU classifier should be organized around a simple lookup order:

parse
  -> state table
  -> interface policy
  -> Bloom filter
  -> prefix trie
  -> exact 5-tuple rules
  -> L4 policy
  -> GPU cold path

The order matters. Most packets in established flows should stop at the state table. Most non-denied new traffic should get a Bloom negative and skip expensive prefix checks.

The rule snapshot should be immutable. Control-plane changes build a new snapshot, then publish it with an atomic pointer swap. Packet threads never mutate global policy.

Verdict classify_packet(PacketBytes pkt, Iface ingress, Direction dir) {
    PacketDesc64 d;

    if (!parse_l2_l3_l4(pkt, &d)) {
        return DROP_MALFORMED;
    }

    d.ingress_if = ingress.id;
    d.direction = dir;
    d.flow_hash = siphash_5tuple(&d);

    RuleSnapshot *rs = atomic_load_acquire(&active_rules);

    StateShard *shard = &rs->state.shards[d.flow_hash & rs->state.mask];
    FlowState *st = shard_lookup_readmostly(shard, d.flow_hash, &d);

    if (st && st->verdict == ALLOW && st->epoch == rs->epoch) {
        return ALLOW;
    }

    if (interface_policy_denies(rs, ingress, dir, &d)) {
        return DROP_POLICY;
    }

    if (bloom_maybe_contains(&rs->deny_bloom, d.dst_hi, d.dst_lo)) {
        if (prefix_trie_contains(&rs->rtbh, d.dst_hi, d.dst_lo)) {
            return DROP_RTBH;
        }
        if (prefix_trie_contains(&rs->bogon, d.src_hi, d.src_lo)) {
            return DROP_BOGON;
        }
        if (prefix_trie_contains(&rs->bgp_deny, d.dst_hi, d.dst_lo)) {
            return DROP_BGP_POLICY;
        }
    }

    ExactRule *er = exact5_lookup(&rs->exact5, d.flow_hash, &d);
    if (er) {
        return er->verdict;
    }

    Verdict v = port_proto_rule_eval(&rs->l4_rules, &d);
    if (v == DROP) {
        return DROP_POLICY;
    }

    if (needs_gpu_secondary_classification(rs, &d)) {
        enqueue_gpu_descriptor(&gpu_ring, &d, pkt.handle);
        return DELAY_OR_TEMPORARY_POLICY;
    }

    shard_insert_new_flow(shard, &d, ALLOW, rs->epoch);
    return ALLOW;
}

In a strict inline firewall, DELAY_OR_TEMPORARY_POLICY is a hard design choice. Delaying too much traffic creates latency and queue pressure. A practical policy is:

  • delay only small volumes of truly suspicious new flows
  • allow with short probation for low-risk flows
  • drop immediately for cheap, known-bad matches
  • sample ordinary traffic for analytics without affecting forwarding

BGP policy compiler

BGP belongs in the control plane. It should feed the firewall, not run inside the packet classifier.

flowchart TD
    A[BGP neighbors] --> B[FRR, BIRD, or OpenBGPD]
    B --> C[Import policy: prefix lists, route maps, AS paths, communities]
    C --> D[Accepted RIB]
    C --> E[Rejected, blackhole, and bogon sets]
    D --> F[Kernel route table]
    E --> G[Firewall prefix compiler]
    G --> H[CPU LPM tries and Bloom filters]
    G --> I[GPU rule buffers]
    H --> J[Atomic RuleSnapshot swap]
    I --> K[Async Metal buffer update]

A useful snapshot looks like this:

struct RuleSnapshot {
    uint32_t epoch;

    FlowStateTable state;
    Exact5Table exact5;

    PrefixTrieV4 allow_v4;
    PrefixTrieV4 deny_v4;
    PrefixTrieV4 rtbh_v4;
    PrefixTrieV4 bogon_v4;

    PrefixTrieV6 allow_v6;
    PrefixTrieV6 deny_v6;
    PrefixTrieV6 rtbh_v6;
    PrefixTrieV6 bogon_v6;

    BloomFilter deny_bloom;
    L4RuleTable l4_rules;
    IfacePolicy if_policy;

    GpuRuleTableHandle gpu_rules;
};

For IPv4, a DIR-24-8 table is a good first version. One direct lookup covers /0 through /24, and a second table handles longer prefixes. A dense 2^24 table with 4-byte entries is 64 MiB. That is large for cache, but small relative to unified memory. If cache locality becomes the bottleneck, use a compressed multibit trie, LC-trie, or Poptrie-style layout.

For IPv6, do not use a naive pointer trie. Pointer chasing is poison in a packet path. Use a compressed multibit trie or bitmap/vector layout.

Publishing a new policy should look like this:

def publish_new_policy(bgp_rib, static_rules, old_snapshot):
    prefixes = extract_policy_prefixes(bgp_rib, static_rules)

    snap = RuleSnapshot()
    snap.epoch = old_snapshot.epoch + 1

    snap.deny_v4 = compile_lpm_v4(prefixes.deny_v4)
    snap.rtbh_v4 = compile_lpm_v4(prefixes.rtbh_v4)
    snap.bogon_v4 = compile_lpm_v4(prefixes.bogon_v4)

    snap.deny_v6 = compile_lpm_v6(prefixes.deny_v6)
    snap.rtbh_v6 = compile_lpm_v6(prefixes.rtbh_v6)
    snap.bogon_v6 = compile_lpm_v6(prefixes.bogon_v6)

    snap.deny_bloom = build_bloom(
        prefixes.deny_v4,
        prefixes.rtbh_v4,
        prefixes.bogon_v4,
        prefixes.deny_v6,
        prefixes.rtbh_v6,
        prefixes.bogon_v6,
    )

    snap.exact5 = compile_exact_rules(static_rules.exact5)
    snap.l4_rules = compile_l4_rules(static_rules.l4)
    snap.if_policy = compile_interface_rules(static_rules.interfaces)

    gpu_rules = build_gpu_rule_table(snap)
    upload_gpu_rules_async(gpu_rules)

    atomic_store_release(active_rules, snap)
    retire_after_grace_period(old_snapshot)

The important property is that packet threads see one complete snapshot at a time.

Metal memory model

Apple Silicon has unified physical memory. Metal exposes this through device properties and storage modes. MTLStorageMode.shared is system memory that both CPU and GPU can access, and it is the default storage mode for buffers on integrated GPUs and for buffers and textures on Apple Silicon GPUs.

That does not make synchronization free. It removes explicit PCIe-style transfer cost, but it does not remove the need for batching and ownership boundaries.

Use three classes of Metal buffers:

BufferStorage modeAccess patternReason
Descriptor ringMTLStorageModeSharedCPU writes, GPU readsAvoid a separate copy
Verdict ringMTLStorageModeSharedGPU writes, CPU readsSmall result transfer
Large rule tablesMTLStorageModePrivate after upload, or Shared for frequent updatesGPU reads heavilyPrivate can be better for GPU-side access; shared is simpler during churn

The scheduler should be triple-buffered:

sequenceDiagram
    participant CPU
    participant GPU

    CPU->>CPU: Fill descriptor buffer A
    CPU->>GPU: Commit batch A
    CPU->>CPU: Fill descriptor buffer B
    GPU->>GPU: Classify A
    CPU->>GPU: Commit batch B
    GPU->>CPU: Complete A verdicts
    CPU->>CPU: Apply A verdicts
    CPU->>CPU: Fill descriptor buffer C
    GPU->>GPU: Classify B

The CPU should submit GPU work through command buffers and then continue packet processing. Completion handlers, shared events, or polling can move a ring slot back to the CPU. The packet hot path should never block on the current command buffer.

Metal classifier shape

The GPU classifier should evaluate many descriptors against large, mostly static tables. The data layout matters more than the syntax of the shader.

struct GpuRuleMeta {
    uint deny_bloom_words;
    uint matrix_rule_count;
    uint trie_node_count;
    uint epoch;
};

kernel void classify_batch(
    device const ulong *dst_hi       [[buffer(0)]],
    device const ulong *dst_lo       [[buffer(1)]],
    device const ushort *src_port    [[buffer(2)]],
    device const ushort *dst_port    [[buffer(3)]],
    device const uchar *proto        [[buffer(4)]],
    device const GpuRuleMeta *meta   [[buffer(5)]],
    device const uint *bloom         [[buffer(6)]],
    device const TrieNode *trie      [[buffer(7)]],
    device const MatrixRule *rules   [[buffer(8)]],
    device uchar *verdict            [[buffer(9)]],
    uint tid [[thread_position_in_grid]],
    uint lane [[thread_index_in_threadgroup]]
) {
    ulong dhi = dst_hi[tid];
    ulong dlo = dst_lo[tid];

    if (!gpu_bloom_maybe_contains(bloom, meta->deny_bloom_words, dhi, dlo)) {
        verdict[tid] = VERDICT_ALLOW;
        return;
    }

    if (gpu_lpm_match(trie, meta->trie_node_count, dhi, dlo)) {
        verdict[tid] = VERDICT_DROP;
        return;
    }

    bool denied = false;
    uchar p = proto[tid];
    ushort sp = src_port[tid];
    ushort dp = dst_port[tid];

    for (uint i = 0; i < meta->matrix_rule_count; ++i) {
        MatrixRule r = rules[i];
        denied = denied || (
            proto_match(r, p) &&
            port_match(r.src_port_range, sp) &&
            port_match(r.dst_port_range, dp) &&
            address_tag_match(r, dhi, dlo)
        );
    }

    verdict[tid] = denied ? VERDICT_DROP : VERDICT_ALLOW;
}

Before dispatch, split batches by packet family and protocol:

Batch 1: IPv4 TCP descriptors
Batch 2: IPv4 UDP descriptors
Batch 3: IPv6 TCP descriptors
Batch 4: IPv6 UDP descriptors
Batch 5: fragmented, unusual, or expensive descriptors

This reduces branch divergence. A single batch with IPv4, IPv6, TCP, UDP, ICMP, fragments, and payload scans will waste lanes.

For large rule matrices, have one threadgroup cooperate on hot rule blocks:

threadgroup MatrixRule hot_rules[256];

for (uint base = 0; base < meta->matrix_rule_count; base += 256) {
    if (lane < 256) {
        hot_rules[lane] = rules[base + lane];
    }

    threadgroup_barrier(mem_flags::mem_threadgroup);

    for (uint j = 0; j < 256; ++j) {
        denied = denied || match_rule(hot_rules[j], packet);
    }

    threadgroup_barrier(mem_flags::mem_threadgroup);
}

The point is not that this exact shader is production-ready. The point is the memory pattern: descriptors and rules are arranged so adjacent GPU threads read adjacent memory.

When to send traffic to the GPU

A firewall that sends every packet to the GPU will lose on latency. The GPU queue should be selective.

Use GPU for:

new flows with rare ports
flows matching weak deny Bloom positives
large deny or allow rule matrices
payload-window scanning of selected unencrypted traffic
QUIC and TLS metadata feature extraction
sampled telemetry and anomaly scoring
BGP policy validation after route churn

Keep CPU-only for:

established flows
exact 5-tuple rules
BGP peer sessions
basic ICMP policy
ordinary LPM route and firewall checks
bogon and RTBH prefix drops
simple port/protocol allow or drop rules

A predicate might look like this:

bool needs_gpu_secondary_classification(RuleSnapshot *rs, PacketDesc64 *d) {
    if (d->flags & PKT_FRAGMENTED) {
        return true;
    }

    if (d->flags & PKT_SUSPICIOUS_TCP_FLAGS) {
        return true;
    }

    if (rare_l4_service(rs, d->l4_proto, d->dst_port)) {
        return true;
    }

    if (bloom_maybe_contains(&rs->weak_suspicion_bloom, d->dst_hi, d->dst_lo)) {
        return true;
    }

    if (rs->gpu_policy_enabled_for_interface[d->ingress_if]) {
        return sampled(d->flow_hash, rs->gpu_sample_rate);
    }

    return false;
}

The GPU path is a cold path for packet verdicts and a hot path for analytics. That distinction keeps the firewall stable under load.

Private queue research branch

The private or lab branch should be thought of as a ring-buffer dataplane. The names here are intentionally abstract. The idea is packet ownership, not a promise that these are stable public Apple APIs.

while (running) {
    RxBatch rx = channel_rx_dequeue(wan_channel, MAX_BATCH);

    for (int i = 0; i < rx.count; ++i) {
        PacketSlot *slot = rx.slots[i];

        // Packet payload stays in packet-owned memory.
        PacketDesc64 d = parse_packet_slot_header(slot);

        RuleSnapshot *rs = atomic_load_acquire(&active_rules);
        Verdict v = classify_cpu_fast(rs, &d);

        if (v == DROP) {
            channel_rx_release(slot);
            continue;
        }

        if (v == NEED_GPU) {
            gpu_ring_enqueue(slot, d);
            continue;
        }

        TxSlot *tx = channel_tx_reserve(lan_channel);
        tx_attach_packet(tx, slot);
        channel_tx_submit(lan_channel, tx);
    }
}

The intended memory access pattern is:

NIC -> packet page:
    DMA write once

CPU:
    read Ethernet/IP/L4 header cache lines
    read compact rule/state data
    write TX descriptor or release descriptor

GPU selected path:
    descriptor only, normally 64 bytes
    optional payload window, such as the first 256 or 512 bytes
    verdict byte or word

This is the path where “use all hardware resources” makes the most sense. The CPU owns rings and immediate policy. The GPU is a co-processor that consumes selected metadata. The hard part is API access and OS support. Public Network Extension does not give this packet ownership. DriverKit can expose packet queues when you control the driver. Private Skywalk-style mechanisms may expose similar concepts internally, but that belongs in lab builds unless Apple gives you a supported contract.

BPF as telemetry, not routing

BPF is useful as a side channel:

flowchart LR
    NIC --> STACK[Normal macOS stack and firewall path]
    STACK --> ROUTE[Forwarding]

    NIC --> BPF[BPF tap with cBPF prefilter]
    BPF --> SAMPLE[Sample ring]
    SAMPLE --> GPU[Metal telemetry batch]
    GPU --> POLICY[Policy compiler feedback]

A BPF capture filter might copy only DNS, BGP, SYN packets, ICMP errors, and selected unusual flows:

bpf_filter =
    "tcp port 179 or udp port 53 or tcp[tcpflags] & tcp-syn != 0 or icmp";

That helps with telemetry and training data. It should not be confused with an inline router datapath. A capture drop is not the same thing as dropping the live packet from the forwarding path.

Memory control plan

The design has four memory regions.

Region A: packet-owned memory
    Owner: macOS kernel, DriverKit queue, or private channel
    Contents: full Ethernet frames
    Rule: never copy payload unless required

Region B: CPU rule snapshot
    Owner: firewall control process
    Contents: immutable tries, Bloom filters, exact tables, L4 tables
    Rule: atomic pointer swap, no mutation in packet path

Region C: Metal shared rings
    Owner: CPU and GPU
    Contents: compact descriptors and verdicts
    Rule: triple-buffer, cache-line aligned, no per-packet allocation

Region D: GPU rule buffers
    Owner: GPU after upload
    Contents: SoA rule arrays, prefix trie nodes, Bloom words
    Rule: rebuild on policy epoch change

Graphically:

flowchart TB
    subgraph UM[Unified memory]
        subgraph A[Region A: packet-owned memory]
            PKT[Packet pages / mbufs / driver slots<br/>full Ethernet frames]
        end

        subgraph B[Region B: CPU rule snapshot]
            STATE[State shards]
            TRIES[Prefix tries]
            BLOOM[Bloom filters]
            EXACT[Exact and L4 tables]
        end

        subgraph C[Region C: Metal shared rings]
            DESCA[Descriptor ring A]
            DESCB[Descriptor ring B]
            DESCC[Descriptor ring C]
            VERDICT[Verdict rings]
        end

        subgraph D[Region D: GPU rule buffers]
            SOA[SoA rules]
            GTRIE[GPU trie]
            GBLOOM[GPU Bloom]
            MATRIX[Matrix rules]
        end
    end

    NICRX[NIC RX DMA] -->|writes full frame once| PKT
    PKT -->|CPU reads headers| CPU[CPU fast path]
    CPU -->|rule/state reads| STATE
    CPU --> TRIES
    CPU --> BLOOM
    CPU --> EXACT
    CPU -->|selected descriptors| DESCA
    CPU --> DESCB
    CPU --> DESCC

    DESCA -->|GPU reads batches| GPU[Metal compute]
    DESCB --> GPU
    DESCC --> GPU
    SOA -->|GPU reads heavily| GPU
    GTRIE --> GPU
    GBLOOM --> GPU
    MATRIX --> GPU
    GPU -->|small results| VERDICT
    VERDICT -->|CPU applies verdicts| CPU
    CPU -->|forwarded packet descriptor| NICTX[NIC TX DMA]
    PKT -->|NIC reads frame for egress| NICTX

Practical performance target

For normal 1500-byte-class traffic, this design is plausible on an M4 or M4 Pro Mac mini if the common path is CPU-only, parallel, and allocation-free. At about 0.813 Mpps, the CPU has time for compact parsing, one state lookup, and a few cache-friendly rule checks.

For minimum-frame floods, the design needs to degrade gracefully. At 14.88 Mpps, user-space callbacks and GPU round trips are too expensive for deterministic per-packet decisions. The firewall should drop early with simple CPU rules, rate-limit logging, and avoid extra classification work.

The practical architecture is:

Fast path:
    CPU, immediate verdict, no allocation, no IPC, no GPU wait

Control path:
    BGP daemon -> policy compiler -> immutable CPU and GPU snapshots

GPU path:
    selected packets, compact descriptors, asynchronous batches

Aggressive path:
    DriverKit queue ownership where possible
    private Skywalk-like research only for lab builds

The Mac mini has enough bandwidth for the job. The engineering battle is keeping the packet path boring: fewer locks, fewer cache misses, fewer copies, and no synchronous detours.

References