docs(HARDENING): rewrite roadmap based on live scan findings

Phase 1 is complete. Live testing revealed:
- Window size (64240) is already correct — Phase 2 window mangling unnecessary
- TI=Z (IP ID = 0) is the single remaining blocker for Windows spoofing
- ip_no_pmtu_disc does NOT fix TI=Z (tested and confirmed)

Revised phase plan:
- Phase 2: ICMP tuning (icmp_ratelimit + icmp_ratemask sysctls)
- Phase 3: NFQUEUE daemon for IP ID rewriting (fixes TI=Z)
- Phase 4: diminishing returns, not recommended

Added detailed NFQUEUE architecture, TCPOPTSTRIP notes, and
note clarifying P= field in nmap output.
This commit is contained in:
2026-04-10 16:38:27 -04:00
parent 6df2c9ccbf
commit 62a67f3d1d

View File

@@ -6,184 +6,203 @@ scanners see the intended OS rather than a generic Linux kernel.
---
## Current State
## Current State (Post-Phase 1)
OS spoofing is partially implemented. Each archetype declares an `nmap_os` slug
(e.g. `"windows"`, `"linux"`, `"embedded"`). The **composer** resolves that slug
via `os_fingerprint.get_os_sysctls()` and injects the resulting kernel parameters
into the **base container** as Docker `sysctls`. Service containers inherit the
same network namespace via `network_mode: "service:<base>"` and therefore appear
identical to outside scanners.
Phase 1 is **implemented and tested against live scans**. Each archetype declares
an `nmap_os` slug (e.g. `"windows"`, `"linux"`, `"embedded"`). The **composer**
resolves that slug via `os_fingerprint.get_os_sysctls()` and injects the resulting
kernel parameters into the **base container** as Docker `sysctls`. Service
containers inherit the same network namespace via `network_mode: "service:<base>"`
and therefore appear identical to outside scanners.
### Currently tuned knobs
### Implemented sysctls (8 per OS profile)
| Sysctl | Purpose |
|---|---|
| `net.ipv4.ip_default_ttl` | Primary TTL discriminator (64 = Linux, 128 = Windows, 255 = Embedded) |
| `net.ipv4.tcp_syn_retries` | SYN retransmit count before giving up |
### What this fools
| Scanner probe | Status |
|---|---|
| ping TTL | ✅ Fully spoofed |
| TCP SYN retry count | ✅ Tuned |
| `nmap -O` OS family (Win vs Linux) | ⚠️ Partial — likely correct family, wrong version |
| `p0f` passive fingerprint | ⚠️ Partial — TTL correct, window/options wrong |
| Full `nmap -O` version/build match | ❌ Not achievable without deeper tuning |
---
## Improvement Phases
### Phase 1 — Extended Sysctls (Low effort, High impact)
Several additional sysctls are **network-namespace-scoped** and can be safely set
per-container without `--privileged`. These directly affect nmap's SEQ, OPS, and
WIN probe groups.
**Changes required:** extend `OS_SYSCTLS` in `decnet/os_fingerprint.py`.
| Sysctl | nmap probe group | Windows | Linux | Embedded |
| Sysctl | Purpose | Win | Linux | Embedded |
|---|---|---|---|---|
| `net.ipv4.tcp_timestamps` | SEQ/OPS — timestamp option presence | `0` | `1` | `0` |
| `net.ipv4.tcp_window_scaling` | WIN — window scale option | `1` | `1` | `0` |
| `net.ipv4.tcp_sack` | OPS — SACK permitted option | `1` | `1` | `0` |
| `net.ipv4.tcp_ecn` | ECN probe — explicit congestion notification | `0` | `2` | `0` |
| `net.ipv4.ip_no_pmtu_disc` | IE — DF bit copying in ICMP replies | `0` | `0` | `1` |
| `net.ipv4.tcp_fin_timeout` | T2T6 — FIN_WAIT duration | `30` | `60` | `15` |
| `net.ipv4.ip_default_ttl` | TTL discriminator | `128` | `64` | `255` |
| `net.ipv4.tcp_syn_retries` | SYN retransmit count | `2` | `6` | `3` |
| `net.ipv4.tcp_timestamps` | TCP timestamp option (OPS probes) | `0` | `1` | `0` |
| `net.ipv4.tcp_window_scaling` | Window scale option | `1` | `1` | `0` |
| `net.ipv4.tcp_sack` | Selective ACK option | `1` | `1` | `0` |
| `net.ipv4.tcp_ecn` | ECN negotiation | `0` | `2` | `0` |
| `net.ipv4.ip_no_pmtu_disc` | DF bit in ICMP replies | `0` | `0` | `1` |
| `net.ipv4.tcp_fin_timeout` | FIN_WAIT_2 timeout (seconds) | `30` | `60` | `15` |
> **Highest single-value impact:** setting `net.ipv4.tcp_timestamps = 0` for
> Windows is the strongest signal. nmap's OPS probes explicitly look for the TCP
> timestamp option; its absence is a definitive Windows discriminator.
### Live scan results (Windows decky, 2026-04-10)
**Expected result after Phase 1:** `nmap -O` correctly identifies OS family in
the vast majority of scans. `p0f` passive fingerprinting becomes significantly
more convincing.
**What works:**
| nmap field | Expected | Got | Status |
|---|---|---|---|
| TTL (`T=`) | `80` (128 dec) | `T=80` | ✅ |
| TCP timestamps (`TS=`) | `U` (unsupported) | `TS=U` | ✅ |
| ECN (`CC=`) | `N` | `CC=N` | ✅ |
| TCP window (`W1=`) | `FAF0` (64240) | `W1=FAF0` | ✅ |
| Window options (`O1=`) | `M5B4NNSNWA` | `O1=M5B4NNSNWA` | ✅ |
| SACK | present | present | ✅ |
| DF bit | `DF=Y` | `DF=Y` | ✅ |
**What fails:**
| nmap field | Expected (Win) | Got | Impact |
|---|---|---|---|
| IP ID (`TI=`) | `I` (incremental) | `Z` (all zeros) | **Critical** — no Windows fingerprint in nmap's DB has `TI=Z`. This alone causes 91% confidence "Linux 2.4/2.6 embedded" |
| ICMP rate limiting | unlimited | Linux default rate | Minor — affects `IE`/`U1` probe groups |
**Key finding:** `TI=Z` is the **single remaining blocker** for a convincing
Windows fingerprint. Everything else (TTL, window, timestamps, ECN, SACK, DF)
is already correct. The Phase 2 window mangling originally planned is
**unnecessary** — the kernel already produces the correct 64240 value.
---
### Phase 2 — TCP Window Size Mangling (Medium effort, Very high impact)
## Remaining Improvement Phases
nmap's WIN probes record the raw **TCP window size** in SYN-ACK replies. This
is the single most discriminating feature after TTL. It cannot be set with
per-namespace sysctls because `net.core.rmem_default` is global.
### Phase 2 — ICMP Tuning via Sysctls (Low effort, Medium impact)
The fix is an **iptables rule applied at base container startup** via a custom
entrypoint script.
Two additional namespace-scoped sysctls control ICMP error rate limiting.
nmap's `IE` and `U1` probe groups measure how quickly the target responds to
ICMP and UDP-to-closed-port probes.
#### Target window sizes by OS
**Changes required:** add to `OS_SYSCTLS` in `decnet/os_fingerprint.py`.
| OS | TCP Window Size | Notes |
|---|---|---|
| Windows 10 / 11 | `64240` | Most common modern value |
| Windows 7 / Server 2008 | `8192` | Classic Windows signature |
| Linux 5.x / 6.x | `29200` | Default `tcp_rmem` min/4 |
| Linux 4.x | `43690` | Older default |
| FreeBSD / macOS | `65535` | BSD signature |
| Embedded / Cisco | `4128``8760` | Varies widely |
| Sysctl | What it controls | Windows | Linux | Embedded |
|---|---|---|---|---|
| `net.ipv4.icmp_ratelimit` | Minimum ms between ICMP error messages | `0` (none) | `1000` (1/sec) | `1000` |
| `net.ipv4.icmp_ratemask` | Bitmask of ICMP types subject to rate limiting | `0` | `6168` | `6168` |
#### Implementation sketch
**Why:** Windows does not rate-limit ICMP error responses. Linux defaults to
1000ms between ICMP errors (effectively 1 per second per destination). When
nmap sends rapid-fire UDP probes to closed ports, a Windows machine replies to
all of them instantly while a Linux machine throttles responses. Setting
`icmp_ratelimit=0` for Windows makes the `U1` probe response timing match.
Add a parameterized entrypoint script (`templates/base/entrypoint.sh`) that
receives the target window size as an environment variable and applies an
`iptables` MANGLE rule before yielding to `sleep infinity`:
**Estimated effort:** 15 min — same pattern as Phase 1, just two more entries.
```bash
#!/bin/sh
# Apply TCP window size spoofing via iptables mangle
if [ -n "$SPOOF_TCP_WINDOW" ]; then
iptables -t mangle -A POSTROUTING -p tcp \
-j TCPMSS --set-mss 1460
# Clamp outgoing window to the target value
# Requires xt_TCPMSS kernel module on the host
fi
exec sleep infinity
---
### Phase 3 — NFQUEUE IP ID Rewriting (Medium effort, Very high impact)
This is the **highest-priority remaining item** and the only way to fix `TI=Z`.
#### Root cause of `TI=Z`
The Linux kernel's `ip_select_ident()` function sets the IP Identification
field to `0` for all TCP packets where DF=1 (don't-fragment bit set). This is
correct behavior per RFC 6864 ("IP ID is meaningless when DF=1") but no Windows
fingerprint in nmap's database has `TI=Z`. **No namespace-scoped sysctl can
change this** — it's hardcoded in the kernel's TCP stack.
Note: `ip_no_pmtu_disc` does NOT fix this. That sysctl controls Path MTU
Discovery for UDP/ICMP paths only, not TCP IP ID generation. Setting it to 1
for Windows was tested and confirmed to have no effect on `TI=Z`.
#### Solution: NFQUEUE userspace packet rewriting
Use `iptables -t mangle` to send outgoing TCP packets to an NFQUEUE, where a
small Python daemon rewrites the IP ID field before release.
```
┌──────────────────────────┐
TCP SYN-ACK ───► │ iptables mangle/OUTPUT │
│ -j NFQUEUE --queue-num 0 │
└───────────┬──────────────┘
┌──────────────────────────┐
│ Python NFQUEUE daemon │
│ 1. Read IP ID field │
│ 2. Replace with target │
│ pattern (sequential │
│ for Windows, zero │
│ for embedded, etc.) │
│ 3. Recalculate checksum │
│ 4. Accept packet │
└───────────┬──────────────┘
Packet goes out
```
The composer would inject `SPOOF_TCP_WINDOW` as an environment variable on the
base container, sourced from the OS fingerprint profile.
**Target IP ID patterns by OS:**
| OS | nmap label | Pattern | Implementation |
|---|---|---|---|
| Windows | `TI=I` | Sequential, incrementing by 1 per packet | Global atomic counter |
| Linux 3.x+ | `TI=Z` | Zero (DF=1) or randomized | Leave untouched (already correct) |
| Embedded/Cisco | `TI=I` or `TI=Z` | Varies by device | Sequential or zero |
| BSD | `TI=RI` | Randomized incremental | Counter + small random delta |
**Two possible approaches:**
1. **TCPOPTSTRIP + NFQUEUE (comprehensive)**
- `TCPOPTSTRIP` can strip/modify TCP options (window scale, SACK, etc.)
via pure iptables rules, no userspace needed
- `NFQUEUE` handles IP-layer rewriting (IP ID) in userspace
- Combined: full control over the TCP/IP fingerprint
2. **NFQUEUE only (simpler)**
- Single Python daemon handles everything: IP ID rewriting, and optionally
TCP option/window manipulation if ever needed
- Fewer moving parts, one daemon to monitor
**Required changes:**
- `os_fingerprint.py` — add `tcp_window` field to each OS profile.
- `composer.py` — pass `SPOOF_TCP_WINDOW` env var to base container.
- `templates/base/entrypoint.sh` — new file, applies the iptables rule.
- `templates/base/Dockerfile` — new file, minimal image with `iptables`.
- `templates/base/Dockerfile` — new, installs `iptables` + `python3-netfilterqueue`
- `templates/base/entrypoint.sh` — new, sets up iptables rules + launches daemon
- `templates/base/nfq_spoofer.py` — new, the NFQUEUE packet rewriting daemon
- `os_fingerprint.py` — add `ip_id_pattern` field to each OS profile
- `composer.py` — pass `SPOOF_IP_ID` env var + use `templates/base/Dockerfile`
instead of bare distro images for base containers
> **Note:** requires `NET_ADMIN` capability (already granted) and the
> `xt_TCPMSS` and `xt_mangle` kernel modules loaded on the host. Both are
> present in any standard Linux distribution kernel.
**Dependencies on the host kernel:**
- `nfnetlink_queue` module (`modprobe nfnetlink_queue`)
- `xt_NFQUEUE` module (standard in all distro kernels)
- `NET_ADMIN` capability (already granted)
**Dependencies in the base container image:**
- `iptables` package
- `python3` + `python3-netfilterqueue` (or `scapy` with `NetfilterQueue`)
**Estimated effort:** 46 hours + tests
---
### Phase 3ICMP Response Tuning (Medium effort, Medium impact)
### Phase 4Full Fingerprint Database Matching (Hard, Low marginal impact)
nmap's `IE` probe group sends two ICMP echo requests with specific ToS values,
code fields, and payload sizes and inspects what the target returns. Currently
nothing in DECNET controls ICMP echo reply behavior.
After Phases 23, the remaining fingerprint differences are increasingly minor:
**Namespace-scoped sysctls to add per-OS:**
| Sysctl | Effect | Windows | Linux |
|---|---|---|---|
| `net.ipv4.icmp_ratelimit` | Packets/sec rate limit on ICMP errors | `0` (none) | `100` |
| `net.ipv4.icmp_ratemask` | Which ICMP types are rate-limited | `0` | `6168` |
**Expected result:** nmap's `IE` response classification improves from
"no response / filtered" to a correctly typed ICMP echo reply with OS-correct
rate limiting behavior.
---
### Phase 4 — IP ID Sequence Behavior (Hard, Medium impact)
nmap's SEQ probe group fires 6 TCP SYN packets in rapid succession and measures
the **IP ID increment pattern** across responses:
| OS | IP ID pattern | nmap label |
| Signal | Current | Notes |
|---|---|---|
| Windows (most) | Sequential, incrementing | `I` (incremental) |
| Linux 3.x+ | Per-socket hashed/random | `RI` or `RD` |
| Old Linux / BSD | Global counter (truly sequential) | `I` |
| Embedded | Often constant 0 or sequential | varies |
| TCP initial sequence number (ISN) pattern (`SP=`, `ISR=`) | Linux kernel default | Kernel-level, not spoofable without userspace TCP |
| TCP window variance across probes | Constant (`FAF0` × 6) | Real Windows sometimes varies slightly |
| T2/T3 responses | `R=N` (no response) | Correct for some Windows, wrong for others |
| ICMP data payload echo | Linux default | Difficult to control per-namespace |
Linux switched to per-socket hashed IDs at the kernel level (~3.x). This
**cannot be changed per network namespace** without patching the kernel or
replacing the TCP/IP stack with a userspace implementation.
These are diminishing returns. With Phases 13 complete, `nmap -O` should
correctly identify the OS family in >90% of scans.
**Options:**
1. **Accept the limitation** — the IP ID pattern is one of many signals; getting
TTL + window + timestamps right is already a very strong fingerprint match.
2. **Userspace TCP proxy** (e.g., `lwIP` or a custom `nfqueue`-based responder)
that intercepts SYN packets and replies with forged ID sequences. High
complexity; requires `NFQUEUE` kernel module and `libnetfilter_queue`.
> Phase 4 is **not recommended** for the near term. The complexity-to-realism
> ratio is poor compared to Phases 13.
> Phase 4 is **not recommended** for the near term. Effort is measured in days
> for single-digit percentage improvements.
---
## Implementation Priority
## Implementation Priority (revised)
```
Phase 1 ────────────────────────────────── (implement next)
└─ 5 new sysctls in os_fingerprint.py
└─ No new files, no Docker changes
└─ Estimated effort: 30 min
Phase 1 ✅ DONE ─────────────────────────────
└─ 8 sysctls per OS in os_fingerprint.py
└─ Verified: TTL, window, timestamps, ECN, SACK all correct
Phase 2 ────────────────────────────────── (implement after Phase 1)
└─ templates/base/Dockerfile + entrypoint.sh
└─ os_fingerprint.py: add tcp_window field
└─ composer.py: pass env var to base container
└─ Estimated effort: 23 hours + tests
Phase 2 ──────────────────────────────── (implement next)
└─ 2 more sysctls: icmp_ratelimit + icmp_ratemask
└─ Estimated effort: 15 min
Phase 3 ────────────────────────────────── (nice to have)
└─ 2 more sysctls in os_fingerprint.py
└─ Estimated effort: 15 min (after Phase 1 infra exists)
Phase 3 ──────────────────────────────── (high priority)
└─ NFQUEUE daemon in templates/base/
└─ Fix TI=Z for Windows (THE remaining blocker)
└─ Estimated effort: 46 hours + tests
Phase 4 ────────────────────────────────── (not recommended short-term)
└─ Requires kernel-level or userspace TCP stack work
└─ Estimated effort: days
Phase 4 ──────────────────────────────── (not recommended)
└─ ISN pattern, T2/T3, ICMP payload echo
└─ Estimated effort: days, diminishing returns
```
---
@@ -196,22 +215,34 @@ After each phase, validate with:
# Active OS fingerprint scan against a deployed decky
sudo nmap -O --osscan-guess <decky_ip>
# Aggressive scan with version detection
sudo nmap -sV -O -A --osscan-guess <decky_ip>
# Passive fingerprinting (run on host while generating traffic to decky)
sudo p0f -i <macvlan_interface> -p
# Quick TTL + window check
sudo nmap -sS --script banner <decky_ip>
hping3 -S -p 22 <decky_ip> # inspect TTL and window in reply
hping3 -S -p 445 <decky_ip> # inspect TTL and window in reply
# Test INI (all OS families, 10 deckies)
sudo .venv/bin/decnet deploy --config arche-test.ini --interface eth0
```
Expected outcomes by phase:
### Expected outcomes by phase
| Check | Pre-Phase 1 | Post-Phase 1 | Post-Phase 2 |
|---|---|---|---|
| TTL | ✅ | ✅ | ✅ |
| TCP timestamps | ❌ | ✅ | ✅ |
| TCP window size | ❌ | | ✅ |
| ICMP behavior | ❌ | ⚠️ | ⚠️ |
| IP ID sequence | ❌ | ❌ | |
| `nmap -O` family match | ⚠️ | | ✅ |
| `p0f` match | ⚠️ | ⚠️ | ✅ |
| Check | Pre-Phase 1 | Post-Phase 1 | Post-Phase 2 | Post-Phase 3 |
|---|---|---|---|---|
| TTL | ✅ | ✅ | ✅ | ✅ |
| TCP timestamps | ❌ | ✅ | ✅ | ✅ |
| TCP window size | ❌ | ✅ (kernel default OK) | ✅ | ✅ |
| ECN | ❌ | | ✅ | ✅ |
| ICMP rate limiting | ❌ | ❌ | ✅ | ✅ |
| IP ID sequence (`TI=`) | ❌ | | | ✅ |
| `nmap -O` family match | ⚠️ | ⚠️ (TI=Z blocks) | ⚠️ | ✅ |
| `p0f` match | ⚠️ | ⚠️ | ✅ | ✅ |
### Note on `P=` field in nmap output
The `P=x86_64-redhat-linux-gnu` that appears in the `SCAN(...)` block is the
**GNU build triple of the nmap binary itself**, not a fingerprint of the target.
It cannot be changed and is not relevant to OS spoofing.