Using Apple Metal to accelerate DPI recognition
GPU acceleration for DPI sounds attractive, but the wrong design can make a router slower. The expensive part is not just “running math.” The expensive part is moving packet data, synchronizing CPU and GPU work, and deciding which packets deserve the GPU path.
Apple Metal is interesting for DPI because Apple Silicon has a unified memory architecture. The CPU and GPU can both access system memory. That does not mean data movement is free. It means the design can avoid explicit PCIe-style copies and instead spend most of its effort on batching, cache-friendly layout, and synchronization.
The right shape is:
CPU does packet ownership, parsing, flow state, and fast decisions. Metal does batched scoring and feature extraction when there is enough parallel work.
This post uses two examples:
- VoIP detection, using RTP-like packet timing and header behavior.
- Encrypted BitTorrent detection, using uTP/DHT hints, fan-out, flow symmetry, and packet-size patterns.
The goal is not to decrypt traffic. The goal is fast recognition from visible metadata and early packet structure.
What the GPU is good at
The GPU is good at doing the same small program over many records:
packet descriptor 0 -> feature vector 0 -> score 0
packet descriptor 1 -> feature vector 1 -> score 1
packet descriptor 2 -> feature vector 2 -> score 2
...
packet descriptor N -> feature vector N -> score N
It is bad at being the owner of every packet decision when the traffic is branchy, sparse, and latency sensitive.
A DPI engine has both kinds of work:
- branchy work: parse Ethernet/IP/TCP/UDP, handle fragmentation, update flow tables, expire state
- regular work: compute packet statistics, update counters, score feature vectors, run small classifiers
Metal should get the regular work.
The CPU should keep the branchy work.
First mental model
Think of DPI as a two-stage system:
flowchart LR
A[NIC RX] --> B[CPU packet parser]
B --> C{known flow}
C -->|yes| D[fast cached action]
C -->|no| E[feature descriptor ring]
E --> F[Metal batch scoring]
F --> G[CPU policy merge]
G --> H[flow label cache]
H --> DThe most important optimization is the first branch. If a flow already has a label, do not send every packet through Metal. A cached flow should pay only:
L3/L4 parse + flow hash + table lookup + cached action
Metal is for unknown flows, suspicious flows, and periodic re-scoring. It is not the replacement for the flow cache.
Unified memory does not remove scheduling
On Apple GPUs, a shared Metal resource is system memory visible to both CPU and GPU. That is convenient for a DPI pipeline because the CPU can write packet descriptors into a Metal buffer and the GPU can read them without a manual copy into a discrete GPU memory heap.
But there are still three costs:
- CPU writes must be complete before GPU reads.
- GPU writes must be complete before CPU reads.
- CPU and GPU caches still need a synchronization boundary.
In practice, use a ring of shared buffers. Each slot has a state:
FREE -> CPU_WRITING -> GPU_READING -> GPU_DONE -> CPU_READING -> FREE
Do not let CPU and GPU fight over the same slot.
flowchart TB
subgraph SharedBufferRing["shared Metal buffers"]
S0["slot 0 descriptors and results"]
S1["slot 1 descriptors and results"]
S2["slot 2 descriptors and results"]
S3["slot 3 descriptors and results"]
end
CPUW["CPU writes descriptors"] --> S0
CPUW --> S1
GPU["Metal compute kernels"] --> S2
GPU --> S3
CPUR["CPU reads completed scores"] --> S2
S0 --> Q0["state CPU_WRITING"]
S1 --> Q1["state READY_FOR_GPU"]
S2 --> Q2["state GPU_DONE"]
S3 --> Q3["state GPU_READING"]The ring keeps both processors busy. The CPU fills one slot while the GPU processes another. A completion handler, shared event, or command-buffer status moves a slot from GPU_READING to GPU_DONE.
Data layout
Do not put full packets into the GPU path unless you must. Most DPI recognition uses a compact descriptor:
struct PacketDesc {
uint64 flow_id;
uint32 timestamp_delta_us;
uint16 payload_len;
uint16 l4_sport;
uint16 l4_dport;
uint8 ip_proto;
uint8 direction;
uint8 tcp_flags;
uint8 first_bytes_len;
uint8 first_bytes[32];
}
That descriptor is small enough to move through memory cheaply. It also avoids exposing the GPU to variable-length packet buffers.
If payload scanning is needed, cap it:
max bytes copied into descriptor: 32 or 64
max packets sent to deep path: first 8 per flow
max flow age for deep path: 10 seconds
For scoring, use a structure-of-arrays layout when the batch is large:
flow_id[] contiguous
payload_len[] contiguous
timestamp_delta_us[] contiguous
first_word[] contiguous
direction[] contiguous
The GPU likes contiguous memory. An array-of-structs is easier for the CPU to write. A structure-of-arrays is easier for the GPU to read. A practical design can use one CPU-friendly descriptor ring and one Metal kernel to unpack into GPU-friendly scratch buffers.
The cost model
Let a CPU-only classifier cost:
$$ T_{\mathrm{cpu}}(n) = n(c_{\mathrm{parse}} + c_{\mathrm{score}}) $$For Metal:
$$ T_{\mathrm{metal}}(n) = c_{\mathrm{submit}} + c_{\mathrm{sync}} + n(c_{\mathrm{pack}} + c_{\mathrm{gpu}}) $$Metal wins when:
$$ n > \frac{c_{\mathrm{submit}} + c_{\mathrm{sync}}} {c_{\mathrm{score}} - c_{\mathrm{pack}} - c_{\mathrm{gpu}}} $$This inequality is the whole design. If the batch is too small, the CPU wins because command submission and synchronization dominate. If the batch is large enough, the GPU wins because the scoring work is parallel.
That means a DPI scheduler should have two triggers:
submit when batch_size >= N_min
submit when oldest_descriptor_age >= L_max
The first trigger protects throughput. The second protects latency.
What stays on CPU
Keep these on CPU:
- packet capture and ownership
- Ethernet/IP/TCP/UDP parsing
- fragmentation and reassembly policy
- flow table lookup
- flow expiration
- NAT/firewall action
- exact early signatures
- final policy decision
The CPU can also run a cheap first classifier:
if known_flow:
apply cached label
elif packet is definitely boring:
pass
elif packet is early and useful:
enqueue descriptor for Metal
else:
update CPU counters only
The GPU should not be asked to decide whether an IPv4 header is malformed. That is control-flow-heavy and security-sensitive. Parse once on CPU, normalize into descriptors, then let the GPU compute scores.
What goes to Metal
Metal is useful for:
- scoring thousands of flow feature vectors
- testing RTP-like header candidates in parallel
- computing packet-size histograms
- computing inter-arrival statistics
- updating per-batch partial counters
- evaluating logistic or linear models
- reducing per-host fan-out sketches
The GPU should operate on bounded records:
one thread per packet descriptor
one thread per flow feature vector
one threadgroup per host bucket
Avoid unbounded parsers, regex, linked lists, and hash-table-heavy kernels.
CPU and Metal schedule
The simplest scheduler is double or triple buffering:
slot 0: CPU writes descriptors
slot 1: GPU scores descriptors
slot 2: CPU reads scores from previous GPU pass
Pseudocode:
while running:
packets = rx_poll()
for pkt in packets:
meta = parse_headers(pkt)
if meta.invalid:
continue
flow = flow_table.lookup(meta.five_tuple)
if flow.label_is_final:
apply(flow.label, pkt)
continue
update_cpu_fast_counters(flow, meta)
if should_enqueue_for_gpu(flow, meta):
ring.current_slot.append(make_packet_desc(pkt, meta, flow))
if ring.current_slot.count >= N_min:
submit_slot_to_metal(ring.current_slot)
ring.current_slot = ring.next_free_slot()
if ring.current_slot.age_us >= L_max:
submit_slot_to_metal(ring.current_slot)
ring.current_slot = ring.next_free_slot()
for done in completed_gpu_slots():
merge_scores_into_flow_table(done.results)
done.mark_free()
The scheduler should never block the packet path waiting for Metal. If the GPU queue is full, fall back to CPU counters and sampling.
if no_free_gpu_slot:
skip_gpu_for_this_packet
maybe_sample_later
A DPI engine that blocks forwarding on GPU completion has made the GPU part of the critical path. That is usually the wrong tradeoff.
Metal command structure
A batch usually needs several compute passes:
flowchart LR A[descriptor buffer] --> B[packet feature kernel] B --> C[per flow reduce kernel] C --> D[protocol score kernel] D --> E[host context score kernel] E --> F[result buffer]
In Metal terms, this is one command buffer with multiple compute encoders or multiple dispatches. The CPU commits the command buffer, then continues parsing packets. When the command buffer completes, the CPU merges the result buffer into the flow table.
The result should be small:
struct FlowResult {
uint64 flow_id;
int16 voip_score;
int16 bittorrent_score;
uint8 confidence;
uint8 flags;
}
Do not copy huge intermediate arrays back to CPU. The CPU only needs final scores and a few reason flags.
Example 1: VoIP detection
VoIP is a good GPU example because many RTP-like checks are simple and independent per packet.
RTP has a fixed header with a version field, sequence number, timestamp, and SSRC. In common SRTP deployments, the payload is encrypted, but the RTP packet shape and timing can still be visible. The detector should not assume every UDP packet with RTP version bits is voice. It should look for a sequence over time.
Useful features:
UDP flow
RTP version field looks like 2
sequence number increments
timestamp increases coherently
SSRC is stable
payload size is small or medium
packet interval is regular
both directions have similar cadence
RTCP appears nearby or is multiplexed
RTP/RTCP multiplexing matters because RTP data and RTCP control can share one UDP port. RFC 5761 discusses how RTP and RTCP packets can be distinguished when they are multiplexed.
VoIP scoring
For each packet:
$$ s_{\mathrm{rtp}} = w_v f_{\mathrm{version}} + w_p f_{\mathrm{payloadtype}} + w_s f_{\mathrm{seq}} + w_t f_{\mathrm{timestamp}} + w_l f_{\mathrm{length}} $$For the flow:
$$ S_{\mathrm{voip}} = s_{\mathrm{rtp}} + w_j f_{\mathrm{jitter}} + w_c f_{\mathrm{cadence}} + w_b f_{\mathrm{bidirectional}} $$The cadence feature can be:
$$ f_{\mathrm{cadence}} = \exp\left(-\frac{\sigma_{\Delta t}^{2}}{\tau^2}\right) $$where \(\sigma_{\Delta t}^{2}\) is the variance of packet inter-arrival time and \(\tau\) is the tolerance. A low-variance stream of small UDP packets is more VoIP-like than a bursty download.
VoIP kernel pseudocode
kernel score_voip(packet_descs, previous_flow_state, packet_scores):
i = global_thread_id
p = packet_descs[i]
score = 0
flags = 0
if p.ip_proto != UDP:
packet_scores[i] = 0
return
if p.payload_len < 12:
packet_scores[i] = 0
return
b0 = p.first_bytes[0]
version = b0 >> 6
if version == 2:
score += W_RTP_VERSION
flags |= FLAG_RTP_VERSION
payload_type = p.first_bytes[1] & 0x7f
if plausible_payload_type(payload_type):
score += W_PAYLOAD_TYPE
seq = u16be(p.first_bytes[2:4])
ts = u32be(p.first_bytes[4:8])
ssrc = u32be(p.first_bytes[8:12])
prev = previous_flow_state[p.flow_id]
if prev.ssrc == ssrc:
score += W_SSRC_STABLE
if seq_distance(prev.seq, seq) in [1, 2, 3]:
score += W_SEQ_MONOTONIC
if ts_after(ts, prev.timestamp):
score += W_TIMESTAMP_MONOTONIC
if p.payload_len >= 60 and p.payload_len <= 400:
score += W_VOICE_SIZE
packet_scores[i] = score
The CPU should merge packet scores into flow state. The GPU should not own the canonical flow table unless the whole packet engine is already GPU-resident.
Example 2: encrypted BitTorrent detection
Encrypted BitTorrent is different from VoIP. There may be no stable RTP-like header. The GPU should score behavior and visible side protocols.
Useful features:
many remote peers
many new flows in a short window
bidirectional transfer
mixed packet sizes
long-lived data flows
DHT-like UDP nearby
uTP-like UDP nearby
plain BitTorrent handshake if visible
BEP 3 defines the plain peer wire handshake and length-prefixed messages. If the handshake is visible, the CPU can classify it immediately. BEP 29 defines uTP over UDP, which gives a visible packet shape even when payload bytes are not useful.
For encrypted traffic, the GPU should not look for one magic string. It should evaluate a feature vector:
$$ \mathbf{x}_{\mathrm{bt}} = \left[ x_{\mathrm{fanout}}, x_{\mathrm{newflows}}, x_{\mathrm{symmetry}}, x_{\mathrm{sizehist}}, x_{\mathrm{dht}}, x_{\mathrm{utp}}, x_{\mathrm{duration}} \right] $$Then:
$$ S_{\mathrm{bt}} = \mathbf{w}_{\mathrm{bt}} \cdot \mathbf{x}_{\mathrm{bt}} $$and:
$$ P_{\mathrm{bt}} = \frac{1}{1 + e^{-S_{\mathrm{bt}}}} $$BitTorrent batch algorithm
The CPU pre-aggregates descriptors by flow shard. Metal computes partials:
kernel packet_features(packet_descs, packet_features):
i = global_thread_id
p = packet_descs[i]
f = zero_feature_vector()
f.bytes = p.payload_len
f.small = p.payload_len <= SMALL_PACKET
f.large = p.payload_len >= LARGE_PACKET
f.direction = p.direction
if p.ip_proto == UDP:
f.utp_hint = score_utp_shape(p.first_bytes, p.payload_len)
f.dht_hint = score_dht_prefix(p.first_bytes, p.first_bytes_len)
if p.ip_proto == TCP:
f.bt_plain = match_plain_bt_prefix(p.first_bytes, p.first_bytes_len)
packet_features[i] = f
Then a flow reduce kernel:
kernel reduce_flow_features(packet_features, flow_ranges, flow_features):
flow_index = global_thread_id
range = flow_ranges[flow_index]
acc = zero_flow_features()
for i in range.start .. range.end:
acc.bytes += packet_features[i].bytes
acc.small += packet_features[i].small
acc.large += packet_features[i].large
acc.utp_score += packet_features[i].utp_hint
acc.dht_score += packet_features[i].dht_hint
acc.bt_plain += packet_features[i].bt_plain
acc.dir_bytes[packet_features[i].direction] += packet_features[i].bytes
flow_features[flow_index] = acc
Then a score kernel:
kernel score_bittorrent(flow_features, host_snapshot, results):
i = global_thread_id
f = flow_features[i]
h = host_snapshot[f.local_host_id]
symmetry =
min(f.dir_bytes[0], f.dir_bytes[1]) /
(max(f.dir_bytes[0], f.dir_bytes[1]) + 1)
size_mix =
(f.small + f.large) / max(f.packet_count, 1)
score =
W_PLAIN * f.bt_plain +
W_UTP * f.utp_score +
W_DHT * f.dht_score +
W_SYM * symmetry +
W_SIZE * size_mix +
W_FANOUT * log1p(h.remote_peer_estimate) +
W_NEWFLOWS * log1p(h.new_flow_count)
results[i] = logistic(score)
The host snapshot should be read-only for the GPU. The CPU can publish a new snapshot every few milliseconds or every few batches.
Why not update the flow table on GPU?
You can, but it is usually not the best first design.
A DPI flow table is a hash table with expiration, locking, collision handling, protocol-specific state, and policy output. That is CPU-friendly. A GPU prefers dense arrays.
The compromise is:
CPU flow table:
canonical state
hash lookup
expiration
final labels
GPU arrays:
batch descriptors
packet features
flow partials
score results
The CPU converts sparse, branchy packet traffic into dense batches. Metal scores the dense batches. The CPU merges the results back into the canonical flow table.
Data movement budget
Assume one packet descriptor is \(D\) bytes, one feature result is \(F\) bytes, and one final score is \(R\) bytes.
For a batch of \(n\) packets:
$$ B_{\mathrm{shared}} = n(D + F) + mR $$where \(m\) is the number of flows in the batch.
If \(D = 64\), \(F = 32\), \(R = 8\), \(n = 8192\), and \(m = 2048\):
$$ B_{\mathrm{shared}} = 8192(64 + 32) + 2048(8) = 802816\ \mathrm{bytes} $$That is less than 1 MB per batch. The payload itself never enters the GPU path except for the bounded first bytes embedded in the descriptor.
This is the key data-movement rule:
Move facts about packets, not packets.
Avoiding normal-packet overhead
Normal traffic should not pay for GPU classification forever.
Use these escape hatches:
flow becomes normal_likely -> stop GPU enqueue
flow becomes known_service -> stop GPU enqueue
flow exceeds first_packet_window -> stop deep features
GPU queue backs up -> sample only
host is below suspicion floor -> CPU counters only
A good threshold policy:
$$ \operatorname{enqueue}(flow) = \begin{cases} 1, & flow_{\mathrm{unknown}} \land age < A_{\max} \\\\ 1, & host_{\mathrm{suspicious}} \land sample(flow) \\\\ 0, & flow_{\mathrm{final}} \\\\ 0, & gpu_{\mathrm{backpressure}} \end{cases} $$The GPU should be a scarce accelerator, not a tax on every packet.
Backpressure policy
The scheduler needs a clear overload behavior.
if gpu_pending_slots < soft_limit:
enqueue normal unknown descriptors
elif gpu_pending_slots < hard_limit:
enqueue only suspicious hosts
else:
disable gpu enqueue for this poll cycle
rely on CPU fast path
This is better than letting latency explode. DPI is normally inline with forwarding. A slightly stale label is better than a packet queue that grows without bound.
Combining CPU and Metal scores
The CPU often has signals the GPU does not need to see:
- known local service
- DNS/SNI/domain context
- interface or VLAN role
- policy allowlist
- conntrack state
- NAT mapping
The final score can combine CPU and GPU terms:
$$ S = \alpha S_{\mathrm{cpu}} + \beta S_{\mathrm{metal}} + \gamma S_{\mathrm{host}} $$Then:
$$ P = \frac{1}{1 + e^{-S}} $$Use hysteresis:
if P >= 0.90 for two batches:
label = likely_target_protocol
elif P <= 0.20 after first window:
label = normal_likely
else:
label = unknown
One GPU batch should not permanently label a flow unless the signal is exact, such as a visible plaintext BitTorrent handshake.
End-to-end design
Here is the complete pipeline:
flowchart TB
RX["RX packets"] --> P["CPU parse headers"]
P --> F["flow cache lookup"]
F --> K{final label}
K -->|yes| A["apply action"]
K -->|no| C["CPU cheap counters"]
C --> B{eligible for Metal}
B -->|no| U["keep unknown or normal"]
B -->|yes| R["write shared ring slot"]
R --> M["commit Metal command buffer"]
M --> G1["packet feature kernel"]
G1 --> G2["flow reduce kernel"]
G2 --> G3["VoIP and BT score kernels"]
G3 --> O["result buffer"]
O --> CM["CPU completion merge"]
CM --> FC["update flow labels"]
FC --> AThe critical path is still CPU forwarding. Metal runs beside it.
Implementation notes
Use multiple command buffers in flight. A single command buffer at a time often leaves either the CPU or GPU idle.
Use shared buffers for descriptors and results on Apple Silicon. Use private buffers for GPU-only scratch data if a kernel reads and writes the same intermediate data heavily.
Use compact integer features where possible:
uint16 packet_len
uint16 delta_us_clamped
uint8 direction
uint8 protocol
int16 score
Floating point is fine for final scoring, but most packet features are counters and flags.
Avoid atomics in the hot GPU kernels when possible. If flow aggregation needs atomics, first ask whether the CPU can sort or group descriptors by flow. Dense ranges are easier for Metal than random writes.
Keep reason flags:
FLAG_RTP_VERSION
FLAG_RTP_CADENCE
FLAG_BT_DHT
FLAG_BT_UTP
FLAG_BT_FANOUT
FLAG_BT_SYMMETRY
The final policy engine should be able to say why it classified a flow.
What a reader should take away
Metal acceleration helps when DPI recognition becomes a batched feature-scoring problem. It does not help when every packet needs a different parser and a different branch.
For VoIP, Metal can quickly score RTP-like packet structure and cadence across thousands of UDP flows.
For encrypted BitTorrent, Metal can score many weak signals together: uTP shape, DHT hints, fan-out, bidirectional transfer, packet-size mixture, and host behavior.
The unified memory model is useful because the CPU and GPU can share descriptor and result buffers. But the fast design is still about reducing movement:
parse on CPU
copy only compact descriptors
batch enough work
score on Metal
merge tiny results
cache the decision
The core rule is:
Do not accelerate packets. Accelerate uncertainty.
Packets that are already known should stay on the CPU fast path. Metal should spend its time on the small fraction of traffic where parallel scoring changes the answer.
References
- Apple Developer Documentation: Choosing a resource storage mode for Apple GPUs
- Apple Developer Documentation: Setting up a command structure
- Apple Developer Documentation: Synchronizing CPU and GPU work
- Apple Developer Documentation: Synchronizing events between a GPU and the CPU
- RFC 3550: RTP, A Transport Protocol for Real-Time Applications
- RFC 5761: Multiplexing RTP Data and Control Packets on a Single Port
- BEP 3: The BitTorrent Protocol Specification
- BEP 29: uTorrent Transport Protocol