Using Apple Metal to accelerate DPI recognition

2026-06-18 posted in Network

GPU acceleration for DPI sounds attractive, but the wrong design can make a router slower. The expensive part is not just “running math.” The expensive part is moving packet data, synchronizing CPU and GPU work, and deciding which packets deserve the GPU path.

Apple Metal is interesting for DPI because Apple Silicon has a unified memory architecture. The CPU and GPU can both access system memory. That does not mean data movement is free. It means the design can avoid explicit PCIe-style copies and instead spend most of its effort on batching, cache-friendly layout, and synchronization.

The right shape is:

CPU does packet ownership, parsing, flow state, and fast decisions. Metal does batched scoring and feature extraction when there is enough parallel work.

This post uses two examples:

VoIP detection, using RTP-like packet timing and header behavior.
Encrypted BitTorrent detection, using uTP/DHT hints, fan-out, flow symmetry, and packet-size patterns.

The goal is not to decrypt traffic. The goal is fast recognition from visible metadata and early packet structure.

What the GPU is good at

The GPU is good at doing the same small program over many records:

packet descriptor 0 -> feature vector 0 -> score 0
packet descriptor 1 -> feature vector 1 -> score 1
packet descriptor 2 -> feature vector 2 -> score 2
...
packet descriptor N -> feature vector N -> score N

It is bad at being the owner of every packet decision when the traffic is branchy, sparse, and latency sensitive.

A DPI engine has both kinds of work:

branchy work: parse Ethernet/IP/TCP/UDP, handle fragmentation, update flow tables, expire state
regular work: compute packet statistics, update counters, score feature vectors, run small classifiers

Metal should get the regular work.

The CPU should keep the branchy work.

First mental model

Think of DPI as a two-stage system:

flowchart LR
  A[NIC RX] --> B[CPU packet parser]
  B --> C{known flow}
  C -->|yes| D[fast cached action]
  C -->|no| E[feature descriptor ring]
  E --> F[Metal batch scoring]
  F --> G[CPU policy merge]
  G --> H[flow label cache]
  H --> D

The most important optimization is the first branch. If a flow already has a label, do not send every packet through Metal. A cached flow should pay only:

L3/L4 parse + flow hash + table lookup + cached action

Metal is for unknown flows, suspicious flows, and periodic re-scoring. It is not the replacement for the flow cache.

Unified memory does not remove scheduling

On Apple GPUs, a shared Metal resource is system memory visible to both CPU and GPU. That is convenient for a DPI pipeline because the CPU can write packet descriptors into a Metal buffer and the GPU can read them without a manual copy into a discrete GPU memory heap.

But there are still three costs:

CPU writes must be complete before GPU reads.
GPU writes must be complete before CPU reads.
CPU and GPU caches still need a synchronization boundary.

In practice, use a ring of shared buffers. Each slot has a state:

FREE -> CPU_WRITING -> GPU_READING -> GPU_DONE -> CPU_READING -> FREE

Do not let CPU and GPU fight over the same slot.

flowchart TB
  subgraph SharedBufferRing["shared Metal buffers"]
    S0["slot 0 descriptors and results"]
    S1["slot 1 descriptors and results"]
    S2["slot 2 descriptors and results"]
    S3["slot 3 descriptors and results"]
  end

  CPUW["CPU writes descriptors"] --> S0
  CPUW --> S1
  GPU["Metal compute kernels"] --> S2
  GPU --> S3
  CPUR["CPU reads completed scores"] --> S2

  S0 --> Q0["state CPU_WRITING"]
  S1 --> Q1["state READY_FOR_GPU"]
  S2 --> Q2["state GPU_DONE"]
  S3 --> Q3["state GPU_READING"]

The ring keeps both processors busy. The CPU fills one slot while the GPU processes another. A completion handler, shared event, or command-buffer status moves a slot from GPU_READING to GPU_DONE.

Data layout

Do not put full packets into the GPU path unless you must. Most DPI recognition uses a compact descriptor:

struct PacketDesc {
    uint64 flow_id;
    uint32 timestamp_delta_us;
    uint16 payload_len;
    uint16 l4_sport;
    uint16 l4_dport;
    uint8  ip_proto;
    uint8  direction;
    uint8  tcp_flags;
    uint8  first_bytes_len;
    uint8  first_bytes[32];
}

That descriptor is small enough to move through memory cheaply. It also avoids exposing the GPU to variable-length packet buffers.

If payload scanning is needed, cap it:

max bytes copied into descriptor: 32 or 64
max packets sent to deep path: first 8 per flow
max flow age for deep path: 10 seconds

For scoring, use a structure-of-arrays layout when the batch is large:

flow_id[]              contiguous
payload_len[]          contiguous
timestamp_delta_us[]   contiguous
first_word[]           contiguous
direction[]            contiguous

The GPU likes contiguous memory. An array-of-structs is easier for the CPU to write. A structure-of-arrays is easier for the GPU to read. A practical design can use one CPU-friendly descriptor ring and one Metal kernel to unpack into GPU-friendly scratch buffers.

The cost model

Let a CPU-only classifier cost:

$$ T_{\mathrm{cpu}}(n) = n(c_{\mathrm{parse}} + c_{\mathrm{score}}) $$

For Metal:

$$ T_{\mathrm{metal}}(n) = c_{\mathrm{submit}} + c_{\mathrm{sync}} + n(c_{\mathrm{pack}} + c_{\mathrm{gpu}}) $$

Metal wins when:

$$ n > \frac{c_{\mathrm{submit}} + c_{\mathrm{sync}}} {c_{\mathrm{score}} - c_{\mathrm{pack}} - c_{\mathrm{gpu}}} $$

This inequality is the whole design. If the batch is too small, the CPU wins because command submission and synchronization dominate. If the batch is large enough, the GPU wins because the scoring work is parallel.

That means a DPI scheduler should have two triggers:

submit when batch_size >= N_min
submit when oldest_descriptor_age >= L_max

The first trigger protects throughput. The second protects latency.

What stays on CPU

Keep these on CPU:

packet capture and ownership
Ethernet/IP/TCP/UDP parsing
fragmentation and reassembly policy
flow table lookup
flow expiration
NAT/firewall action
exact early signatures
final policy decision

The CPU can also run a cheap first classifier:

if known_flow:
    apply cached label
elif packet is definitely boring:
    pass
elif packet is early and useful:
    enqueue descriptor for Metal
else:
    update CPU counters only

The GPU should not be asked to decide whether an IPv4 header is malformed. That is control-flow-heavy and security-sensitive. Parse once on CPU, normalize into descriptors, then let the GPU compute scores.

What goes to Metal

Metal is useful for:

scoring thousands of flow feature vectors
testing RTP-like header candidates in parallel
computing packet-size histograms
computing inter-arrival statistics
updating per-batch partial counters
evaluating logistic or linear models
reducing per-host fan-out sketches

The GPU should operate on bounded records:

one thread per packet descriptor
one thread per flow feature vector
one threadgroup per host bucket

Avoid unbounded parsers, regex, linked lists, and hash-table-heavy kernels.

CPU and Metal schedule

The simplest scheduler is double or triple buffering:

slot 0: CPU writes descriptors
slot 1: GPU scores descriptors
slot 2: CPU reads scores from previous GPU pass

Pseudocode:

while running:
    packets = rx_poll()

    for pkt in packets:
        meta = parse_headers(pkt)
        if meta.invalid:
            continue

        flow = flow_table.lookup(meta.five_tuple)
        if flow.label_is_final:
            apply(flow.label, pkt)
            continue

        update_cpu_fast_counters(flow, meta)

        if should_enqueue_for_gpu(flow, meta):
            ring.current_slot.append(make_packet_desc(pkt, meta, flow))

    if ring.current_slot.count >= N_min:
        submit_slot_to_metal(ring.current_slot)
        ring.current_slot = ring.next_free_slot()

    if ring.current_slot.age_us >= L_max:
        submit_slot_to_metal(ring.current_slot)
        ring.current_slot = ring.next_free_slot()

    for done in completed_gpu_slots():
        merge_scores_into_flow_table(done.results)
        done.mark_free()

The scheduler should never block the packet path waiting for Metal. If the GPU queue is full, fall back to CPU counters and sampling.

if no_free_gpu_slot:
    skip_gpu_for_this_packet
    maybe_sample_later

A DPI engine that blocks forwarding on GPU completion has made the GPU part of the critical path. That is usually the wrong tradeoff.

Metal command structure

A batch usually needs several compute passes:

flowchart LR
  A[descriptor buffer] --> B[packet feature kernel]
  B --> C[per flow reduce kernel]
  C --> D[protocol score kernel]
  D --> E[host context score kernel]
  E --> F[result buffer]

In Metal terms, this is one command buffer with multiple compute encoders or multiple dispatches. The CPU commits the command buffer, then continues parsing packets. When the command buffer completes, the CPU merges the result buffer into the flow table.

The result should be small:

struct FlowResult {
    uint64 flow_id;
    int16  voip_score;
    int16  bittorrent_score;
    uint8  confidence;
    uint8  flags;
}

Do not copy huge intermediate arrays back to CPU. The CPU only needs final scores and a few reason flags.

Example 1: VoIP detection

VoIP is a good GPU example because many RTP-like checks are simple and independent per packet.

RTP has a fixed header with a version field, sequence number, timestamp, and SSRC. In common SRTP deployments, the payload is encrypted, but the RTP packet shape and timing can still be visible. The detector should not assume every UDP packet with RTP version bits is voice. It should look for a sequence over time.

Useful features:

UDP flow
RTP version field looks like 2
sequence number increments
timestamp increases coherently
SSRC is stable
payload size is small or medium
packet interval is regular
both directions have similar cadence
RTCP appears nearby or is multiplexed

RTP/RTCP multiplexing matters because RTP data and RTCP control can share one UDP port. RFC 5761 discusses how RTP and RTCP packets can be distinguished when they are multiplexed.

VoIP scoring

For each packet:

$$ s_{\mathrm{rtp}} = w_v f_{\mathrm{version}} + w_p f_{\mathrm{payloadtype}} + w_s f_{\mathrm{seq}} + w_t f_{\mathrm{timestamp}} + w_l f_{\mathrm{length}} $$

For the flow:

$$ S_{\mathrm{voip}} = s_{\mathrm{rtp}} + w_j f_{\mathrm{jitter}} + w_c f_{\mathrm{cadence}} + w_b f_{\mathrm{bidirectional}} $$

The cadence feature can be:

$$ f_{\mathrm{cadence}} = \exp\left(-\frac{\sigma_{\Delta t}^{2}}{\tau^2}\right) $$

where $\sigma_{\Delta t}^{2}$ is the variance of packet inter-arrival time and $\tau$ is the tolerance. A low-variance stream of small UDP packets is more VoIP-like than a bursty download.

VoIP kernel pseudocode

kernel score_voip(packet_descs, previous_flow_state, packet_scores):
    i = global_thread_id
    p = packet_descs[i]

    score = 0
    flags = 0

    if p.ip_proto != UDP:
        packet_scores[i] = 0
        return

    if p.payload_len < 12:
        packet_scores[i] = 0
        return

    b0 = p.first_bytes[0]
    version = b0 >> 6

    if version == 2:
        score += W_RTP_VERSION
        flags |= FLAG_RTP_VERSION

    payload_type = p.first_bytes[1] & 0x7f
    if plausible_payload_type(payload_type):
        score += W_PAYLOAD_TYPE

    seq = u16be(p.first_bytes[2:4])
    ts  = u32be(p.first_bytes[4:8])
    ssrc = u32be(p.first_bytes[8:12])

    prev = previous_flow_state[p.flow_id]

    if prev.ssrc == ssrc:
        score += W_SSRC_STABLE

    if seq_distance(prev.seq, seq) in [1, 2, 3]:
        score += W_SEQ_MONOTONIC

    if ts_after(ts, prev.timestamp):
        score += W_TIMESTAMP_MONOTONIC

    if p.payload_len >= 60 and p.payload_len <= 400:
        score += W_VOICE_SIZE

    packet_scores[i] = score

The CPU should merge packet scores into flow state. The GPU should not own the canonical flow table unless the whole packet engine is already GPU-resident.

Example 2: encrypted BitTorrent detection

Encrypted BitTorrent is different from VoIP. There may be no stable RTP-like header. The GPU should score behavior and visible side protocols.

Useful features:

many remote peers
many new flows in a short window
bidirectional transfer
mixed packet sizes
long-lived data flows
DHT-like UDP nearby
uTP-like UDP nearby
plain BitTorrent handshake if visible

BEP 3 defines the plain peer wire handshake and length-prefixed messages. If the handshake is visible, the CPU can classify it immediately. BEP 29 defines uTP over UDP, which gives a visible packet shape even when payload bytes are not useful.

For encrypted traffic, the GPU should not look for one magic string. It should evaluate a feature vector:

$$ \mathbf{x}_{\mathrm{bt}} = \left[ x_{\mathrm{fanout}}, x_{\mathrm{newflows}}, x_{\mathrm{symmetry}}, x_{\mathrm{sizehist}}, x_{\mathrm{dht}}, x_{\mathrm{utp}}, x_{\mathrm{duration}} \right] $$

Then:

$$ S_{\mathrm{bt}} = \mathbf{w}_{\mathrm{bt}} \cdot \mathbf{x}_{\mathrm{bt}} $$

and:

$$ P_{\mathrm{bt}} = \frac{1}{1 + e^{-S_{\mathrm{bt}}}} $$

BitTorrent batch algorithm

The CPU pre-aggregates descriptors by flow shard. Metal computes partials:

kernel packet_features(packet_descs, packet_features):
    i = global_thread_id
    p = packet_descs[i]

    f = zero_feature_vector()
    f.bytes = p.payload_len
    f.small = p.payload_len <= SMALL_PACKET
    f.large = p.payload_len >= LARGE_PACKET
    f.direction = p.direction

    if p.ip_proto == UDP:
        f.utp_hint = score_utp_shape(p.first_bytes, p.payload_len)
        f.dht_hint = score_dht_prefix(p.first_bytes, p.first_bytes_len)

    if p.ip_proto == TCP:
        f.bt_plain = match_plain_bt_prefix(p.first_bytes, p.first_bytes_len)

    packet_features[i] = f

Then a flow reduce kernel:

kernel reduce_flow_features(packet_features, flow_ranges, flow_features):
    flow_index = global_thread_id
    range = flow_ranges[flow_index]

    acc = zero_flow_features()

    for i in range.start .. range.end:
        acc.bytes += packet_features[i].bytes
        acc.small += packet_features[i].small
        acc.large += packet_features[i].large
        acc.utp_score += packet_features[i].utp_hint
        acc.dht_score += packet_features[i].dht_hint
        acc.bt_plain += packet_features[i].bt_plain
        acc.dir_bytes[packet_features[i].direction] += packet_features[i].bytes

    flow_features[flow_index] = acc

Then a score kernel:

kernel score_bittorrent(flow_features, host_snapshot, results):
    i = global_thread_id
    f = flow_features[i]
    h = host_snapshot[f.local_host_id]

    symmetry =
        min(f.dir_bytes[0], f.dir_bytes[1]) /
        (max(f.dir_bytes[0], f.dir_bytes[1]) + 1)

    size_mix =
        (f.small + f.large) / max(f.packet_count, 1)

    score =
        W_PLAIN * f.bt_plain +
        W_UTP * f.utp_score +
        W_DHT * f.dht_score +
        W_SYM * symmetry +
        W_SIZE * size_mix +
        W_FANOUT * log1p(h.remote_peer_estimate) +
        W_NEWFLOWS * log1p(h.new_flow_count)

    results[i] = logistic(score)

The host snapshot should be read-only for the GPU. The CPU can publish a new snapshot every few milliseconds or every few batches.

Why not update the flow table on GPU?

You can, but it is usually not the best first design.

A DPI flow table is a hash table with expiration, locking, collision handling, protocol-specific state, and policy output. That is CPU-friendly. A GPU prefers dense arrays.

The compromise is:

CPU flow table:
  canonical state
  hash lookup
  expiration
  final labels

GPU arrays:
  batch descriptors
  packet features
  flow partials
  score results

The CPU converts sparse, branchy packet traffic into dense batches. Metal scores the dense batches. The CPU merges the results back into the canonical flow table.

Data movement budget

Assume one packet descriptor is $D$ bytes, one feature result is $F$ bytes, and one final score is $R$ bytes.

For a batch of $n$ packets:

$$ B_{\mathrm{shared}} = n(D + F) + mR $$

where $m$ is the number of flows in the batch.

If $D = 64$, $F = 32$, $R = 8$, $n = 8192$, and $m = 2048$:

$$ B_{\mathrm{shared}} = 8192(64 + 32) + 2048(8) = 802816\ \mathrm{bytes} $$

That is less than 1 MB per batch. The payload itself never enters the GPU path except for the bounded first bytes embedded in the descriptor.

This is the key data-movement rule:

Move facts about packets, not packets.

Avoiding normal-packet overhead

Normal traffic should not pay for GPU classification forever.

Use these escape hatches:

flow becomes normal_likely -> stop GPU enqueue
flow becomes known_service -> stop GPU enqueue
flow exceeds first_packet_window -> stop deep features
GPU queue backs up -> sample only
host is below suspicion floor -> CPU counters only

A good threshold policy:

$$ \operatorname{enqueue}(flow) = \begin{cases} 1, & flow_{\mathrm{unknown}} \land age < A_{\max} \\\\ 1, & host_{\mathrm{suspicious}} \land sample(flow) \\\\ 0, & flow_{\mathrm{final}} \\\\ 0, & gpu_{\mathrm{backpressure}} \end{cases} $$

The GPU should be a scarce accelerator, not a tax on every packet.

Backpressure policy

The scheduler needs a clear overload behavior.

if gpu_pending_slots < soft_limit:
    enqueue normal unknown descriptors
elif gpu_pending_slots < hard_limit:
    enqueue only suspicious hosts
else:
    disable gpu enqueue for this poll cycle
    rely on CPU fast path

This is better than letting latency explode. DPI is normally inline with forwarding. A slightly stale label is better than a packet queue that grows without bound.

Combining CPU and Metal scores

The CPU often has signals the GPU does not need to see:

known local service
DNS/SNI/domain context
interface or VLAN role
policy allowlist
conntrack state
NAT mapping

The final score can combine CPU and GPU terms:

$$ S = \alpha S_{\mathrm{cpu}} + \beta S_{\mathrm{metal}} + \gamma S_{\mathrm{host}} $$

Then:

$$ P = \frac{1}{1 + e^{-S}} $$

Use hysteresis:

if P >= 0.90 for two batches:
    label = likely_target_protocol
elif P <= 0.20 after first window:
    label = normal_likely
else:
    label = unknown

One GPU batch should not permanently label a flow unless the signal is exact, such as a visible plaintext BitTorrent handshake.

End-to-end design

Here is the complete pipeline:

flowchart TB
  RX["RX packets"] --> P["CPU parse headers"]
  P --> F["flow cache lookup"]
  F --> K{final label}
  K -->|yes| A["apply action"]
  K -->|no| C["CPU cheap counters"]
  C --> B{eligible for Metal}
  B -->|no| U["keep unknown or normal"]
  B -->|yes| R["write shared ring slot"]
  R --> M["commit Metal command buffer"]
  M --> G1["packet feature kernel"]
  G1 --> G2["flow reduce kernel"]
  G2 --> G3["VoIP and BT score kernels"]
  G3 --> O["result buffer"]
  O --> CM["CPU completion merge"]
  CM --> FC["update flow labels"]
  FC --> A

The critical path is still CPU forwarding. Metal runs beside it.

Implementation notes

Use multiple command buffers in flight. A single command buffer at a time often leaves either the CPU or GPU idle.

Use shared buffers for descriptors and results on Apple Silicon. Use private buffers for GPU-only scratch data if a kernel reads and writes the same intermediate data heavily.

Use compact integer features where possible:

uint16 packet_len
uint16 delta_us_clamped
uint8  direction
uint8  protocol
int16  score

Floating point is fine for final scoring, but most packet features are counters and flags.

Avoid atomics in the hot GPU kernels when possible. If flow aggregation needs atomics, first ask whether the CPU can sort or group descriptors by flow. Dense ranges are easier for Metal than random writes.

Keep reason flags:

FLAG_RTP_VERSION
FLAG_RTP_CADENCE
FLAG_BT_DHT
FLAG_BT_UTP
FLAG_BT_FANOUT
FLAG_BT_SYMMETRY

The final policy engine should be able to say why it classified a flow.

What a reader should take away

Metal acceleration helps when DPI recognition becomes a batched feature-scoring problem. It does not help when every packet needs a different parser and a different branch.

For VoIP, Metal can quickly score RTP-like packet structure and cadence across thousands of UDP flows.

For encrypted BitTorrent, Metal can score many weak signals together: uTP shape, DHT hints, fan-out, bidirectional transfer, packet-size mixture, and host behavior.

The unified memory model is useful because the CPU and GPU can share descriptor and result buffers. But the fast design is still about reducing movement:

parse on CPU
copy only compact descriptors
batch enough work
score on Metal
merge tiny results
cache the decision

The core rule is:

Do not accelerate packets. Accelerate uncertainty.

Packets that are already known should stay on the CPU fast path. Metal should spend its time on the small fraction of traffic where parallel scoring changes the answer.