Mac mini as 10 GbE firewall
A Mac mini with the 10 GbE option is an interesting firewall box. The hardware has more memory bandwidth than a 10 Gbit/s link can consume, it is quiet, and Apple Silicon gives the CPU and GPU access to the same physical memory. That makes it tempting to imagine a small desktop machine doing routing, firewalling, BGP policy, telemetry, and Metal-accelerated packet classification at line rate.
The hard part is not the 10 Gbit/s number by itself.
The hard part is packet rate.
A 10 Gbit/s stream is about 1.25 GB/s of payload in one direction. Even if a forwarding path reads RX payload and writes TX payload, the raw byte rate is still small compared with the memory bandwidth of the M4 Mac mini. Apple lists the M4 Mac mini at 120 GB/s memory bandwidth, and the M4 Pro version at 273 GB/s. The problem is that minimum Ethernet frames at 10 GbE reach about 14.88 million packets per second. That gives about 67 ns per packet before the whole system has to receive, classify, route, forward, and account for the packet.
For 1500-byte-class traffic, the packet rate is much friendlier:
10 GbE with minimum Ethernet frames: about 14.88 Mpps
10 GbE with 1500-byte-class frames: about 0.813 Mpps
So the design target is not “use the GPU for every packet.” That would add scheduling latency and synchronization overhead to the hottest part of the path.
The useful design is:
CPU owns the immediate packet decision. Metal accelerates selected, batched, metadata-heavy work.
That is the thread through the whole design.
macOS packet paths
There are four practical layers on macOS.
| Layer | API or tool | Packet visibility | Can drop? | Can modify or forward? | Role in this design |
|---|---|---|---|---|---|
| Supported firewall | NetworkExtension, especially NEFilterPacketProvider and NEFilterDataProvider | Packet or flow visibility depending on provider | Yes | Verdict API is allow, drop, or delay. It is not a packet rewrite API | Main supported product path |
| Local appliance config | pf, routing table, route, sysctl, BGP daemon | Kernel forwarding path | Yes | NAT and routing through system config | Useful for your own box, but not a stable app API |
| Capture and telemetry | /dev/bpf*, cBPF filters | Packet tap | Drops from capture stream only | Can inject raw frames, but it is not a good router path | Sampling, pcap, analytics input |
| Aggressive research | DriverKit networking, private NECP or Skywalk-style mechanisms, old Darwin sockets | Potentially earlier ownership | Depends on path | Potentially yes if you own queues | Lab path for near-zero-copy queue ownership |
NEFilterPacketProvider is the clean supported firewall path. The provider evaluates packets and returns verdicts. Apple’s documentation also makes clear that packet handlers may be executed by multiple simultaneous threads, so a serious implementation must be built as a parallel, lock-minimized datapath.
pf is still very useful for a private appliance. You can combine system routing, NAT, anchors, tables, and a BGP daemon to build a real firewall. But Apple says Packet Filter is not API for software products, so I would not design a distributed commercial product around direct pf control.
BPF is excellent for visibility. It can prefilter capture traffic and feed samples into analysis code. It is not where I would put the inline forwarding decision.
The aggressive path is queue ownership: DriverKit if you control the NIC driver path, or private Skywalk-like mechanisms in a research build. That is where zero-copy forwarding becomes plausible, but it is also where the Apple API contract becomes the weakest.
Top-level architecture
The architecture I would build has three planes:
- A CPU fast path for immediate packet decisions.
- A BGP policy compiler that publishes immutable rule snapshots.
- A Metal batch engine for expensive secondary classification.
flowchart LR
WAN[10 GbE WAN] --> RX[macOS packet ingress]
RX --> CPU[CPU fast-path classifier]
CPU -->|allow or drop| ROUTE[macOS routing and forwarding]
ROUTE --> LAN[10 GbE LAN]
CPU -->|selected descriptors| RING[Shared Metal descriptor ring]
RING --> GPU[Metal batch classifier]
GPU --> VERDICTS[Shared verdict ring]
VERDICTS --> CPU
BGP[BGP daemon] --> COMPILER[Policy compiler]
COMPILER --> SNAP[Immutable CPU rule snapshot]
COMPILER --> GRULES[GPU rule buffers]
SNAP --> CPU
GRULES --> GPUThe CPU path handles anything that must be decided immediately:
- L2, L3, and L4 parsing
- state table lookup
- interface policy
- anti-spoof checks
- exact 5-tuple rules
- BGP-derived blackhole prefixes
- normal port and protocol policy
- route hints and connection tracking
The GPU path handles work that becomes efficient only when batched:
- large rule-matrix evaluation
- secondary classification of suspicious new flows
- anomaly features
- Bloom-filter rebuild validation
- payload-window scanning for selected flows
- telemetry scoring
- QUIC or TLS metadata feature extraction
The BGP daemon should not sit inside the packet path. It should feed a compiler that turns route policy into compact tables.
Supported Network Extension memory path
In the supported path, the system owns packet storage. The firewall extension should read only the header bytes it needs, produce a compact descriptor when needed, and return a verdict quickly for common traffic.
flowchart TD
A[NIC RX DMA writes packet into kernel-owned memory] --> B[macOS network stack]
B --> C[NEFilterPacketProvider]
C --> D[CPU reads first 64 to 128 bytes]
D --> E[CPU reads rule snapshot]
E --> F{Verdict}
F -->|allow| G[Kernel route or forward]
F -->|drop| H[Release packet]
G --> I[NIC TX DMA reads packet for egress]
D -->|selected only| J[Write compact descriptor to shared Metal buffer]
J --> K[Metal batch classifier]
K --> L[Write result to shared verdict buffer]
L --> FThe important discipline is to avoid treating the extension as a general per-packet application callback. The common path should not allocate, log synchronously, do DNS lookups, make IPC calls, or wait on the GPU.
The rough memory budget is favorable:
| Component | Worst-case estimate at 10 GbE |
|---|---|
| RX payload DMA | 1.25 GB/s |
| TX payload DMA | 1.25 GB/s |
| Read first 64 bytes at 64-byte frame line rate | about 0.95 GB/s |
| RX plus TX plus 64-byte header read | about 3.45 GB/s |
| 64-byte GPU descriptor for every minimum-size packet | about 0.95 GB/s |
| 64-byte GPU descriptor for 1 percent sampled minimum-size packets | about 9.5 MB/s |
| 64-byte GPU descriptor for 10 percent sampled minimum-size packets | about 95 MB/s |
Those numbers are small next to Apple Silicon memory bandwidth. The issue is synchronous software overhead per packet.
That is why the first rule is simple:
Do not wait on Metal in the hot path.
More aggressive queue ownership
The highest-performance design owns RX and TX queues. In that model, packet memory stays packet memory. The classifier reads metadata and header cache lines. Forwarding attaches the same packet buffer to a TX descriptor, or recycles it through a forwarding buffer pool.
flowchart LR
NIC[NIC RX queue] --> RXDESC[RX descriptor ring]
RXDESC --> PKT[Packet buffer pages]
PKT --> CPU[CPU header classifier]
CPU -->|drop| FREE[Return buffer to free ring]
CPU -->|forward| TXDESC[TX descriptor ring]
TXDESC --> NIC2[NIC TX queue]
CPU -->|selected descriptors| MBUF[Metal shared descriptor ring]
MBUF --> GPU[Metal compute]
GPU --> VBUF[Metal shared verdict ring]
VBUF --> CPUThe ideal memory behavior is:
RX packet buffer:
NIC writes the packet once.
CPU reads only the needed header cache lines.
CPU does not copy payload.
Forwarded packet:
TX descriptor references the packet buffer or a recycled forwarding buffer.
NIC reads the packet once.
GPU:
CPU writes compact descriptors into shared Metal buffers.
GPU reads descriptors and rule buffers.
GPU writes small verdict records.
This is the shape used by high-performance routers and packet engines. The Mac-specific problem is API access. NEFilterPacketProvider gives verdict hooks, not NIC ring ownership. DriverKit can expose packet queues when you own the relevant driver path, especially for external or custom NICs. Private Skywalk-like channels are research-only unless Apple gives you a supported contract.
Packet descriptor layout
Do not send full packets to the GPU unless the classifier really needs payload bytes. Most policy and metadata classification can run on a compact descriptor.
For the CPU, a 64-byte descriptor is convenient:
struct PacketDesc64 {
uint64_t src_hi; // IPv6 high, or zero for IPv4
uint64_t src_lo; // IPv4 stored in low bits
uint64_t dst_hi;
uint64_t dst_lo;
uint16_t src_port;
uint16_t dst_port;
uint16_t ingress_if;
uint16_t egress_hint;
uint8_t ip_version; // 4 or 6
uint8_t l4_proto; // TCP, UDP, ICMP, other
uint8_t tcp_flags;
uint8_t direction;
uint32_t flow_hash;
uint32_t packet_len;
uint32_t policy_epoch;
uint32_t flags; // fragmented, VLAN, checksum state, suspicious, etc.
};
For Metal, structure-of-arrays usually maps better to coalesced memory reads:
struct GpuBatch {
device const ulong *src_hi [[id(0)]];
device const ulong *src_lo [[id(1)]];
device const ulong *dst_hi [[id(2)]];
device const ulong *dst_lo [[id(3)]];
device const ushort *src_port [[id(4)]];
device const ushort *dst_port [[id(5)]];
device const uchar *proto [[id(6)]];
device const uint *flow_hash [[id(7)]];
device uchar *verdict [[id(8)]];
};
The CPU parses once. Every later subsystem consumes PacketDesc64 or a transposed GPU view of it.
CPU fast path
The CPU classifier should be organized around a simple lookup order:
parse
-> state table
-> interface policy
-> Bloom filter
-> prefix trie
-> exact 5-tuple rules
-> L4 policy
-> GPU cold path
The order matters. Most packets in established flows should stop at the state table. Most non-denied new traffic should get a Bloom negative and skip expensive prefix checks.
The rule snapshot should be immutable. Control-plane changes build a new snapshot, then publish it with an atomic pointer swap. Packet threads never mutate global policy.
Verdict classify_packet(PacketBytes pkt, Iface ingress, Direction dir) {
PacketDesc64 d;
if (!parse_l2_l3_l4(pkt, &d)) {
return DROP_MALFORMED;
}
d.ingress_if = ingress.id;
d.direction = dir;
d.flow_hash = siphash_5tuple(&d);
RuleSnapshot *rs = atomic_load_acquire(&active_rules);
StateShard *shard = &rs->state.shards[d.flow_hash & rs->state.mask];
FlowState *st = shard_lookup_readmostly(shard, d.flow_hash, &d);
if (st && st->verdict == ALLOW && st->epoch == rs->epoch) {
return ALLOW;
}
if (interface_policy_denies(rs, ingress, dir, &d)) {
return DROP_POLICY;
}
if (bloom_maybe_contains(&rs->deny_bloom, d.dst_hi, d.dst_lo)) {
if (prefix_trie_contains(&rs->rtbh, d.dst_hi, d.dst_lo)) {
return DROP_RTBH;
}
if (prefix_trie_contains(&rs->bogon, d.src_hi, d.src_lo)) {
return DROP_BOGON;
}
if (prefix_trie_contains(&rs->bgp_deny, d.dst_hi, d.dst_lo)) {
return DROP_BGP_POLICY;
}
}
ExactRule *er = exact5_lookup(&rs->exact5, d.flow_hash, &d);
if (er) {
return er->verdict;
}
Verdict v = port_proto_rule_eval(&rs->l4_rules, &d);
if (v == DROP) {
return DROP_POLICY;
}
if (needs_gpu_secondary_classification(rs, &d)) {
enqueue_gpu_descriptor(&gpu_ring, &d, pkt.handle);
return DELAY_OR_TEMPORARY_POLICY;
}
shard_insert_new_flow(shard, &d, ALLOW, rs->epoch);
return ALLOW;
}
In a strict inline firewall, DELAY_OR_TEMPORARY_POLICY is a hard design choice. Delaying too much traffic creates latency and queue pressure. A practical policy is:
- delay only small volumes of truly suspicious new flows
- allow with short probation for low-risk flows
- drop immediately for cheap, known-bad matches
- sample ordinary traffic for analytics without affecting forwarding
BGP policy compiler
BGP belongs in the control plane. It should feed the firewall, not run inside the packet classifier.
flowchart TD
A[BGP neighbors] --> B[FRR, BIRD, or OpenBGPD]
B --> C[Import policy: prefix lists, route maps, AS paths, communities]
C --> D[Accepted RIB]
C --> E[Rejected, blackhole, and bogon sets]
D --> F[Kernel route table]
E --> G[Firewall prefix compiler]
G --> H[CPU LPM tries and Bloom filters]
G --> I[GPU rule buffers]
H --> J[Atomic RuleSnapshot swap]
I --> K[Async Metal buffer update]A useful snapshot looks like this:
struct RuleSnapshot {
uint32_t epoch;
FlowStateTable state;
Exact5Table exact5;
PrefixTrieV4 allow_v4;
PrefixTrieV4 deny_v4;
PrefixTrieV4 rtbh_v4;
PrefixTrieV4 bogon_v4;
PrefixTrieV6 allow_v6;
PrefixTrieV6 deny_v6;
PrefixTrieV6 rtbh_v6;
PrefixTrieV6 bogon_v6;
BloomFilter deny_bloom;
L4RuleTable l4_rules;
IfacePolicy if_policy;
GpuRuleTableHandle gpu_rules;
};
For IPv4, a DIR-24-8 table is a good first version. One direct lookup covers /0 through /24, and a second table handles longer prefixes. A dense 2^24 table with 4-byte entries is 64 MiB. That is large for cache, but small relative to unified memory. If cache locality becomes the bottleneck, use a compressed multibit trie, LC-trie, or Poptrie-style layout.
For IPv6, do not use a naive pointer trie. Pointer chasing is poison in a packet path. Use a compressed multibit trie or bitmap/vector layout.
Publishing a new policy should look like this:
def publish_new_policy(bgp_rib, static_rules, old_snapshot):
prefixes = extract_policy_prefixes(bgp_rib, static_rules)
snap = RuleSnapshot()
snap.epoch = old_snapshot.epoch + 1
snap.deny_v4 = compile_lpm_v4(prefixes.deny_v4)
snap.rtbh_v4 = compile_lpm_v4(prefixes.rtbh_v4)
snap.bogon_v4 = compile_lpm_v4(prefixes.bogon_v4)
snap.deny_v6 = compile_lpm_v6(prefixes.deny_v6)
snap.rtbh_v6 = compile_lpm_v6(prefixes.rtbh_v6)
snap.bogon_v6 = compile_lpm_v6(prefixes.bogon_v6)
snap.deny_bloom = build_bloom(
prefixes.deny_v4,
prefixes.rtbh_v4,
prefixes.bogon_v4,
prefixes.deny_v6,
prefixes.rtbh_v6,
prefixes.bogon_v6,
)
snap.exact5 = compile_exact_rules(static_rules.exact5)
snap.l4_rules = compile_l4_rules(static_rules.l4)
snap.if_policy = compile_interface_rules(static_rules.interfaces)
gpu_rules = build_gpu_rule_table(snap)
upload_gpu_rules_async(gpu_rules)
atomic_store_release(active_rules, snap)
retire_after_grace_period(old_snapshot)
The important property is that packet threads see one complete snapshot at a time.
Metal memory model
Apple Silicon has unified physical memory. Metal exposes this through device properties and storage modes. MTLStorageMode.shared is system memory that both CPU and GPU can access, and it is the default storage mode for buffers on integrated GPUs and for buffers and textures on Apple Silicon GPUs.
That does not make synchronization free. It removes explicit PCIe-style transfer cost, but it does not remove the need for batching and ownership boundaries.
Use three classes of Metal buffers:
| Buffer | Storage mode | Access pattern | Reason |
|---|---|---|---|
| Descriptor ring | MTLStorageModeShared | CPU writes, GPU reads | Avoid a separate copy |
| Verdict ring | MTLStorageModeShared | GPU writes, CPU reads | Small result transfer |
| Large rule tables | MTLStorageModePrivate after upload, or Shared for frequent updates | GPU reads heavily | Private can be better for GPU-side access; shared is simpler during churn |
The scheduler should be triple-buffered:
sequenceDiagram
participant CPU
participant GPU
CPU->>CPU: Fill descriptor buffer A
CPU->>GPU: Commit batch A
CPU->>CPU: Fill descriptor buffer B
GPU->>GPU: Classify A
CPU->>GPU: Commit batch B
GPU->>CPU: Complete A verdicts
CPU->>CPU: Apply A verdicts
CPU->>CPU: Fill descriptor buffer C
GPU->>GPU: Classify BThe CPU should submit GPU work through command buffers and then continue packet processing. Completion handlers, shared events, or polling can move a ring slot back to the CPU. The packet hot path should never block on the current command buffer.
Metal classifier shape
The GPU classifier should evaluate many descriptors against large, mostly static tables. The data layout matters more than the syntax of the shader.
struct GpuRuleMeta {
uint deny_bloom_words;
uint matrix_rule_count;
uint trie_node_count;
uint epoch;
};
kernel void classify_batch(
device const ulong *dst_hi [[buffer(0)]],
device const ulong *dst_lo [[buffer(1)]],
device const ushort *src_port [[buffer(2)]],
device const ushort *dst_port [[buffer(3)]],
device const uchar *proto [[buffer(4)]],
device const GpuRuleMeta *meta [[buffer(5)]],
device const uint *bloom [[buffer(6)]],
device const TrieNode *trie [[buffer(7)]],
device const MatrixRule *rules [[buffer(8)]],
device uchar *verdict [[buffer(9)]],
uint tid [[thread_position_in_grid]],
uint lane [[thread_index_in_threadgroup]]
) {
ulong dhi = dst_hi[tid];
ulong dlo = dst_lo[tid];
if (!gpu_bloom_maybe_contains(bloom, meta->deny_bloom_words, dhi, dlo)) {
verdict[tid] = VERDICT_ALLOW;
return;
}
if (gpu_lpm_match(trie, meta->trie_node_count, dhi, dlo)) {
verdict[tid] = VERDICT_DROP;
return;
}
bool denied = false;
uchar p = proto[tid];
ushort sp = src_port[tid];
ushort dp = dst_port[tid];
for (uint i = 0; i < meta->matrix_rule_count; ++i) {
MatrixRule r = rules[i];
denied = denied || (
proto_match(r, p) &&
port_match(r.src_port_range, sp) &&
port_match(r.dst_port_range, dp) &&
address_tag_match(r, dhi, dlo)
);
}
verdict[tid] = denied ? VERDICT_DROP : VERDICT_ALLOW;
}
Before dispatch, split batches by packet family and protocol:
Batch 1: IPv4 TCP descriptors
Batch 2: IPv4 UDP descriptors
Batch 3: IPv6 TCP descriptors
Batch 4: IPv6 UDP descriptors
Batch 5: fragmented, unusual, or expensive descriptors
This reduces branch divergence. A single batch with IPv4, IPv6, TCP, UDP, ICMP, fragments, and payload scans will waste lanes.
For large rule matrices, have one threadgroup cooperate on hot rule blocks:
threadgroup MatrixRule hot_rules[256];
for (uint base = 0; base < meta->matrix_rule_count; base += 256) {
if (lane < 256) {
hot_rules[lane] = rules[base + lane];
}
threadgroup_barrier(mem_flags::mem_threadgroup);
for (uint j = 0; j < 256; ++j) {
denied = denied || match_rule(hot_rules[j], packet);
}
threadgroup_barrier(mem_flags::mem_threadgroup);
}
The point is not that this exact shader is production-ready. The point is the memory pattern: descriptors and rules are arranged so adjacent GPU threads read adjacent memory.
When to send traffic to the GPU
A firewall that sends every packet to the GPU will lose on latency. The GPU queue should be selective.
Use GPU for:
new flows with rare ports
flows matching weak deny Bloom positives
large deny or allow rule matrices
payload-window scanning of selected unencrypted traffic
QUIC and TLS metadata feature extraction
sampled telemetry and anomaly scoring
BGP policy validation after route churn
Keep CPU-only for:
established flows
exact 5-tuple rules
BGP peer sessions
basic ICMP policy
ordinary LPM route and firewall checks
bogon and RTBH prefix drops
simple port/protocol allow or drop rules
A predicate might look like this:
bool needs_gpu_secondary_classification(RuleSnapshot *rs, PacketDesc64 *d) {
if (d->flags & PKT_FRAGMENTED) {
return true;
}
if (d->flags & PKT_SUSPICIOUS_TCP_FLAGS) {
return true;
}
if (rare_l4_service(rs, d->l4_proto, d->dst_port)) {
return true;
}
if (bloom_maybe_contains(&rs->weak_suspicion_bloom, d->dst_hi, d->dst_lo)) {
return true;
}
if (rs->gpu_policy_enabled_for_interface[d->ingress_if]) {
return sampled(d->flow_hash, rs->gpu_sample_rate);
}
return false;
}
The GPU path is a cold path for packet verdicts and a hot path for analytics. That distinction keeps the firewall stable under load.
Private queue research branch
The private or lab branch should be thought of as a ring-buffer dataplane. The names here are intentionally abstract. The idea is packet ownership, not a promise that these are stable public Apple APIs.
while (running) {
RxBatch rx = channel_rx_dequeue(wan_channel, MAX_BATCH);
for (int i = 0; i < rx.count; ++i) {
PacketSlot *slot = rx.slots[i];
// Packet payload stays in packet-owned memory.
PacketDesc64 d = parse_packet_slot_header(slot);
RuleSnapshot *rs = atomic_load_acquire(&active_rules);
Verdict v = classify_cpu_fast(rs, &d);
if (v == DROP) {
channel_rx_release(slot);
continue;
}
if (v == NEED_GPU) {
gpu_ring_enqueue(slot, d);
continue;
}
TxSlot *tx = channel_tx_reserve(lan_channel);
tx_attach_packet(tx, slot);
channel_tx_submit(lan_channel, tx);
}
}
The intended memory access pattern is:
NIC -> packet page:
DMA write once
CPU:
read Ethernet/IP/L4 header cache lines
read compact rule/state data
write TX descriptor or release descriptor
GPU selected path:
descriptor only, normally 64 bytes
optional payload window, such as the first 256 or 512 bytes
verdict byte or word
This is the path where “use all hardware resources” makes the most sense. The CPU owns rings and immediate policy. The GPU is a co-processor that consumes selected metadata. The hard part is API access and OS support. Public Network Extension does not give this packet ownership. DriverKit can expose packet queues when you control the driver. Private Skywalk-style mechanisms may expose similar concepts internally, but that belongs in lab builds unless Apple gives you a supported contract.
BPF as telemetry, not routing
BPF is useful as a side channel:
flowchart LR
NIC --> STACK[Normal macOS stack and firewall path]
STACK --> ROUTE[Forwarding]
NIC --> BPF[BPF tap with cBPF prefilter]
BPF --> SAMPLE[Sample ring]
SAMPLE --> GPU[Metal telemetry batch]
GPU --> POLICY[Policy compiler feedback]A BPF capture filter might copy only DNS, BGP, SYN packets, ICMP errors, and selected unusual flows:
bpf_filter =
"tcp port 179 or udp port 53 or tcp[tcpflags] & tcp-syn != 0 or icmp";
That helps with telemetry and training data. It should not be confused with an inline router datapath. A capture drop is not the same thing as dropping the live packet from the forwarding path.
Memory control plan
The design has four memory regions.
Region A: packet-owned memory
Owner: macOS kernel, DriverKit queue, or private channel
Contents: full Ethernet frames
Rule: never copy payload unless required
Region B: CPU rule snapshot
Owner: firewall control process
Contents: immutable tries, Bloom filters, exact tables, L4 tables
Rule: atomic pointer swap, no mutation in packet path
Region C: Metal shared rings
Owner: CPU and GPU
Contents: compact descriptors and verdicts
Rule: triple-buffer, cache-line aligned, no per-packet allocation
Region D: GPU rule buffers
Owner: GPU after upload
Contents: SoA rule arrays, prefix trie nodes, Bloom words
Rule: rebuild on policy epoch change
Graphically:
flowchart TB
subgraph UM[Unified memory]
subgraph A[Region A: packet-owned memory]
PKT[Packet pages / mbufs / driver slots<br/>full Ethernet frames]
end
subgraph B[Region B: CPU rule snapshot]
STATE[State shards]
TRIES[Prefix tries]
BLOOM[Bloom filters]
EXACT[Exact and L4 tables]
end
subgraph C[Region C: Metal shared rings]
DESCA[Descriptor ring A]
DESCB[Descriptor ring B]
DESCC[Descriptor ring C]
VERDICT[Verdict rings]
end
subgraph D[Region D: GPU rule buffers]
SOA[SoA rules]
GTRIE[GPU trie]
GBLOOM[GPU Bloom]
MATRIX[Matrix rules]
end
end
NICRX[NIC RX DMA] -->|writes full frame once| PKT
PKT -->|CPU reads headers| CPU[CPU fast path]
CPU -->|rule/state reads| STATE
CPU --> TRIES
CPU --> BLOOM
CPU --> EXACT
CPU -->|selected descriptors| DESCA
CPU --> DESCB
CPU --> DESCC
DESCA -->|GPU reads batches| GPU[Metal compute]
DESCB --> GPU
DESCC --> GPU
SOA -->|GPU reads heavily| GPU
GTRIE --> GPU
GBLOOM --> GPU
MATRIX --> GPU
GPU -->|small results| VERDICT
VERDICT -->|CPU applies verdicts| CPU
CPU -->|forwarded packet descriptor| NICTX[NIC TX DMA]
PKT -->|NIC reads frame for egress| NICTXPractical performance target
For normal 1500-byte-class traffic, this design is plausible on an M4 or M4 Pro Mac mini if the common path is CPU-only, parallel, and allocation-free. At about 0.813 Mpps, the CPU has time for compact parsing, one state lookup, and a few cache-friendly rule checks.
For minimum-frame floods, the design needs to degrade gracefully. At 14.88 Mpps, user-space callbacks and GPU round trips are too expensive for deterministic per-packet decisions. The firewall should drop early with simple CPU rules, rate-limit logging, and avoid extra classification work.
The practical architecture is:
Fast path:
CPU, immediate verdict, no allocation, no IPC, no GPU wait
Control path:
BGP daemon -> policy compiler -> immutable CPU and GPU snapshots
GPU path:
selected packets, compact descriptors, asynchronous batches
Aggressive path:
DriverKit queue ownership where possible
private Skywalk-like research only for lab builds
The Mac mini has enough bandwidth for the job. The engineering battle is keeping the packet path boring: fewer locks, fewer cache misses, fewer copies, and no synchronous detours.
References
- Mac mini technical specifications, Apple Support
- NEFilterPacketProvider, Apple Developer Documentation
- packetHandler, Apple Developer Documentation
- TN3165: Packet Filter is not API, Apple Developer Documentation
- Choosing a resource storage mode for Apple GPUs, Apple Developer Documentation
- MTLStorageMode.shared, Apple Developer Documentation
- FRRouting filtering documentation
- NewOSXBook: Skywalk notes