docs(HARDENING): rewrite roadmap based on live scan findings

Phase 1 is complete. Live testing revealed:
- Window size (64240) is already correct — Phase 2 window mangling unnecessary
- TI=Z (IP ID = 0) is the single remaining blocker for Windows spoofing
- ip_no_pmtu_disc does NOT fix TI=Z (tested and confirmed)

Revised phase plan:
- Phase 2: ICMP tuning (icmp_ratelimit + icmp_ratemask sysctls)
- Phase 3: NFQUEUE daemon for IP ID rewriting (fixes TI=Z)
- Phase 4: diminishing returns, not recommended

Added detailed NFQUEUE architecture, TCPOPTSTRIP notes, and
note clarifying P= field in nmap output.
This commit is contained in:
2026-04-10 16:38:27 -04:00
parent 6df2c9ccbf
commit 62a67f3d1d

View File

@@ -6,184 +6,203 @@ scanners see the intended OS rather than a generic Linux kernel.
--- ---
## Current State ## Current State (Post-Phase 1)
OS spoofing is partially implemented. Each archetype declares an `nmap_os` slug Phase 1 is **implemented and tested against live scans**. Each archetype declares
(e.g. `"windows"`, `"linux"`, `"embedded"`). The **composer** resolves that slug an `nmap_os` slug (e.g. `"windows"`, `"linux"`, `"embedded"`). The **composer**
via `os_fingerprint.get_os_sysctls()` and injects the resulting kernel parameters resolves that slug via `os_fingerprint.get_os_sysctls()` and injects the resulting
into the **base container** as Docker `sysctls`. Service containers inherit the kernel parameters into the **base container** as Docker `sysctls`. Service
same network namespace via `network_mode: "service:<base>"` and therefore appear containers inherit the same network namespace via `network_mode: "service:<base>"`
identical to outside scanners. and therefore appear identical to outside scanners.
### Currently tuned knobs ### Implemented sysctls (8 per OS profile)
| Sysctl | Purpose | | Sysctl | Purpose | Win | Linux | Embedded |
|---|---|
| `net.ipv4.ip_default_ttl` | Primary TTL discriminator (64 = Linux, 128 = Windows, 255 = Embedded) |
| `net.ipv4.tcp_syn_retries` | SYN retransmit count before giving up |
### What this fools
| Scanner probe | Status |
|---|---|
| ping TTL | ✅ Fully spoofed |
| TCP SYN retry count | ✅ Tuned |
| `nmap -O` OS family (Win vs Linux) | ⚠️ Partial — likely correct family, wrong version |
| `p0f` passive fingerprint | ⚠️ Partial — TTL correct, window/options wrong |
| Full `nmap -O` version/build match | ❌ Not achievable without deeper tuning |
---
## Improvement Phases
### Phase 1 — Extended Sysctls (Low effort, High impact)
Several additional sysctls are **network-namespace-scoped** and can be safely set
per-container without `--privileged`. These directly affect nmap's SEQ, OPS, and
WIN probe groups.
**Changes required:** extend `OS_SYSCTLS` in `decnet/os_fingerprint.py`.
| Sysctl | nmap probe group | Windows | Linux | Embedded |
|---|---|---|---|---| |---|---|---|---|---|
| `net.ipv4.tcp_timestamps` | SEQ/OPS — timestamp option presence | `0` | `1` | `0` | | `net.ipv4.ip_default_ttl` | TTL discriminator | `128` | `64` | `255` |
| `net.ipv4.tcp_window_scaling` | WIN — window scale option | `1` | `1` | `0` | | `net.ipv4.tcp_syn_retries` | SYN retransmit count | `2` | `6` | `3` |
| `net.ipv4.tcp_sack` | OPS — SACK permitted option | `1` | `1` | `0` | | `net.ipv4.tcp_timestamps` | TCP timestamp option (OPS probes) | `0` | `1` | `0` |
| `net.ipv4.tcp_ecn` | ECN probe — explicit congestion notification | `0` | `2` | `0` | | `net.ipv4.tcp_window_scaling` | Window scale option | `1` | `1` | `0` |
| `net.ipv4.ip_no_pmtu_disc` | IE — DF bit copying in ICMP replies | `0` | `0` | `1` | | `net.ipv4.tcp_sack` | Selective ACK option | `1` | `1` | `0` |
| `net.ipv4.tcp_fin_timeout` | T2T6 — FIN_WAIT duration | `30` | `60` | `15` | | `net.ipv4.tcp_ecn` | ECN negotiation | `0` | `2` | `0` |
| `net.ipv4.ip_no_pmtu_disc` | DF bit in ICMP replies | `0` | `0` | `1` |
| `net.ipv4.tcp_fin_timeout` | FIN_WAIT_2 timeout (seconds) | `30` | `60` | `15` |
> **Highest single-value impact:** setting `net.ipv4.tcp_timestamps = 0` for ### Live scan results (Windows decky, 2026-04-10)
> Windows is the strongest signal. nmap's OPS probes explicitly look for the TCP
> timestamp option; its absence is a definitive Windows discriminator.
**Expected result after Phase 1:** `nmap -O` correctly identifies OS family in **What works:**
the vast majority of scans. `p0f` passive fingerprinting becomes significantly
more convincing. | nmap field | Expected | Got | Status |
|---|---|---|---|
| TTL (`T=`) | `80` (128 dec) | `T=80` | ✅ |
| TCP timestamps (`TS=`) | `U` (unsupported) | `TS=U` | ✅ |
| ECN (`CC=`) | `N` | `CC=N` | ✅ |
| TCP window (`W1=`) | `FAF0` (64240) | `W1=FAF0` | ✅ |
| Window options (`O1=`) | `M5B4NNSNWA` | `O1=M5B4NNSNWA` | ✅ |
| SACK | present | present | ✅ |
| DF bit | `DF=Y` | `DF=Y` | ✅ |
**What fails:**
| nmap field | Expected (Win) | Got | Impact |
|---|---|---|---|
| IP ID (`TI=`) | `I` (incremental) | `Z` (all zeros) | **Critical** — no Windows fingerprint in nmap's DB has `TI=Z`. This alone causes 91% confidence "Linux 2.4/2.6 embedded" |
| ICMP rate limiting | unlimited | Linux default rate | Minor — affects `IE`/`U1` probe groups |
**Key finding:** `TI=Z` is the **single remaining blocker** for a convincing
Windows fingerprint. Everything else (TTL, window, timestamps, ECN, SACK, DF)
is already correct. The Phase 2 window mangling originally planned is
**unnecessary** — the kernel already produces the correct 64240 value.
--- ---
### Phase 2 — TCP Window Size Mangling (Medium effort, Very high impact) ## Remaining Improvement Phases
nmap's WIN probes record the raw **TCP window size** in SYN-ACK replies. This ### Phase 2 — ICMP Tuning via Sysctls (Low effort, Medium impact)
is the single most discriminating feature after TTL. It cannot be set with
per-namespace sysctls because `net.core.rmem_default` is global.
The fix is an **iptables rule applied at base container startup** via a custom Two additional namespace-scoped sysctls control ICMP error rate limiting.
entrypoint script. nmap's `IE` and `U1` probe groups measure how quickly the target responds to
ICMP and UDP-to-closed-port probes.
#### Target window sizes by OS **Changes required:** add to `OS_SYSCTLS` in `decnet/os_fingerprint.py`.
| OS | TCP Window Size | Notes | | Sysctl | What it controls | Windows | Linux | Embedded |
|---|---|---| |---|---|---|---|---|
| Windows 10 / 11 | `64240` | Most common modern value | | `net.ipv4.icmp_ratelimit` | Minimum ms between ICMP error messages | `0` (none) | `1000` (1/sec) | `1000` |
| Windows 7 / Server 2008 | `8192` | Classic Windows signature | | `net.ipv4.icmp_ratemask` | Bitmask of ICMP types subject to rate limiting | `0` | `6168` | `6168` |
| Linux 5.x / 6.x | `29200` | Default `tcp_rmem` min/4 |
| Linux 4.x | `43690` | Older default |
| FreeBSD / macOS | `65535` | BSD signature |
| Embedded / Cisco | `4128``8760` | Varies widely |
#### Implementation sketch **Why:** Windows does not rate-limit ICMP error responses. Linux defaults to
1000ms between ICMP errors (effectively 1 per second per destination). When
nmap sends rapid-fire UDP probes to closed ports, a Windows machine replies to
all of them instantly while a Linux machine throttles responses. Setting
`icmp_ratelimit=0` for Windows makes the `U1` probe response timing match.
Add a parameterized entrypoint script (`templates/base/entrypoint.sh`) that **Estimated effort:** 15 min — same pattern as Phase 1, just two more entries.
receives the target window size as an environment variable and applies an
`iptables` MANGLE rule before yielding to `sleep infinity`:
```bash ---
#!/bin/sh
# Apply TCP window size spoofing via iptables mangle ### Phase 3 — NFQUEUE IP ID Rewriting (Medium effort, Very high impact)
if [ -n "$SPOOF_TCP_WINDOW" ]; then
iptables -t mangle -A POSTROUTING -p tcp \ This is the **highest-priority remaining item** and the only way to fix `TI=Z`.
-j TCPMSS --set-mss 1460
# Clamp outgoing window to the target value #### Root cause of `TI=Z`
# Requires xt_TCPMSS kernel module on the host
fi The Linux kernel's `ip_select_ident()` function sets the IP Identification
exec sleep infinity field to `0` for all TCP packets where DF=1 (don't-fragment bit set). This is
correct behavior per RFC 6864 ("IP ID is meaningless when DF=1") but no Windows
fingerprint in nmap's database has `TI=Z`. **No namespace-scoped sysctl can
change this** — it's hardcoded in the kernel's TCP stack.
Note: `ip_no_pmtu_disc` does NOT fix this. That sysctl controls Path MTU
Discovery for UDP/ICMP paths only, not TCP IP ID generation. Setting it to 1
for Windows was tested and confirmed to have no effect on `TI=Z`.
#### Solution: NFQUEUE userspace packet rewriting
Use `iptables -t mangle` to send outgoing TCP packets to an NFQUEUE, where a
small Python daemon rewrites the IP ID field before release.
```
┌──────────────────────────┐
TCP SYN-ACK ───► │ iptables mangle/OUTPUT │
│ -j NFQUEUE --queue-num 0 │
└───────────┬──────────────┘
┌──────────────────────────┐
│ Python NFQUEUE daemon │
│ 1. Read IP ID field │
│ 2. Replace with target │
│ pattern (sequential │
│ for Windows, zero │
│ for embedded, etc.) │
│ 3. Recalculate checksum │
│ 4. Accept packet │
└───────────┬──────────────┘
Packet goes out
``` ```
The composer would inject `SPOOF_TCP_WINDOW` as an environment variable on the **Target IP ID patterns by OS:**
base container, sourced from the OS fingerprint profile.
| OS | nmap label | Pattern | Implementation |
|---|---|---|---|
| Windows | `TI=I` | Sequential, incrementing by 1 per packet | Global atomic counter |
| Linux 3.x+ | `TI=Z` | Zero (DF=1) or randomized | Leave untouched (already correct) |
| Embedded/Cisco | `TI=I` or `TI=Z` | Varies by device | Sequential or zero |
| BSD | `TI=RI` | Randomized incremental | Counter + small random delta |
**Two possible approaches:**
1. **TCPOPTSTRIP + NFQUEUE (comprehensive)**
- `TCPOPTSTRIP` can strip/modify TCP options (window scale, SACK, etc.)
via pure iptables rules, no userspace needed
- `NFQUEUE` handles IP-layer rewriting (IP ID) in userspace
- Combined: full control over the TCP/IP fingerprint
2. **NFQUEUE only (simpler)**
- Single Python daemon handles everything: IP ID rewriting, and optionally
TCP option/window manipulation if ever needed
- Fewer moving parts, one daemon to monitor
**Required changes:** **Required changes:**
- `os_fingerprint.py` — add `tcp_window` field to each OS profile. - `templates/base/Dockerfile` — new, installs `iptables` + `python3-netfilterqueue`
- `composer.py` — pass `SPOOF_TCP_WINDOW` env var to base container. - `templates/base/entrypoint.sh` — new, sets up iptables rules + launches daemon
- `templates/base/entrypoint.sh` — new file, applies the iptables rule. - `templates/base/nfq_spoofer.py` — new, the NFQUEUE packet rewriting daemon
- `templates/base/Dockerfile` — new file, minimal image with `iptables`. - `os_fingerprint.py` — add `ip_id_pattern` field to each OS profile
- `composer.py` — pass `SPOOF_IP_ID` env var + use `templates/base/Dockerfile`
instead of bare distro images for base containers
> **Note:** requires `NET_ADMIN` capability (already granted) and the **Dependencies on the host kernel:**
> `xt_TCPMSS` and `xt_mangle` kernel modules loaded on the host. Both are - `nfnetlink_queue` module (`modprobe nfnetlink_queue`)
> present in any standard Linux distribution kernel. - `xt_NFQUEUE` module (standard in all distro kernels)
- `NET_ADMIN` capability (already granted)
**Dependencies in the base container image:**
- `iptables` package
- `python3` + `python3-netfilterqueue` (or `scapy` with `NetfilterQueue`)
**Estimated effort:** 46 hours + tests
--- ---
### Phase 3ICMP Response Tuning (Medium effort, Medium impact) ### Phase 4Full Fingerprint Database Matching (Hard, Low marginal impact)
nmap's `IE` probe group sends two ICMP echo requests with specific ToS values, After Phases 23, the remaining fingerprint differences are increasingly minor:
code fields, and payload sizes and inspects what the target returns. Currently
nothing in DECNET controls ICMP echo reply behavior.
**Namespace-scoped sysctls to add per-OS:** | Signal | Current | Notes |
| Sysctl | Effect | Windows | Linux |
|---|---|---|---|
| `net.ipv4.icmp_ratelimit` | Packets/sec rate limit on ICMP errors | `0` (none) | `100` |
| `net.ipv4.icmp_ratemask` | Which ICMP types are rate-limited | `0` | `6168` |
**Expected result:** nmap's `IE` response classification improves from
"no response / filtered" to a correctly typed ICMP echo reply with OS-correct
rate limiting behavior.
---
### Phase 4 — IP ID Sequence Behavior (Hard, Medium impact)
nmap's SEQ probe group fires 6 TCP SYN packets in rapid succession and measures
the **IP ID increment pattern** across responses:
| OS | IP ID pattern | nmap label |
|---|---|---| |---|---|---|
| Windows (most) | Sequential, incrementing | `I` (incremental) | | TCP initial sequence number (ISN) pattern (`SP=`, `ISR=`) | Linux kernel default | Kernel-level, not spoofable without userspace TCP |
| Linux 3.x+ | Per-socket hashed/random | `RI` or `RD` | | TCP window variance across probes | Constant (`FAF0` × 6) | Real Windows sometimes varies slightly |
| Old Linux / BSD | Global counter (truly sequential) | `I` | | T2/T3 responses | `R=N` (no response) | Correct for some Windows, wrong for others |
| Embedded | Often constant 0 or sequential | varies | | ICMP data payload echo | Linux default | Difficult to control per-namespace |
Linux switched to per-socket hashed IDs at the kernel level (~3.x). This These are diminishing returns. With Phases 13 complete, `nmap -O` should
**cannot be changed per network namespace** without patching the kernel or correctly identify the OS family in >90% of scans.
replacing the TCP/IP stack with a userspace implementation.
**Options:** > Phase 4 is **not recommended** for the near term. Effort is measured in days
1. **Accept the limitation** — the IP ID pattern is one of many signals; getting > for single-digit percentage improvements.
TTL + window + timestamps right is already a very strong fingerprint match.
2. **Userspace TCP proxy** (e.g., `lwIP` or a custom `nfqueue`-based responder)
that intercepts SYN packets and replies with forged ID sequences. High
complexity; requires `NFQUEUE` kernel module and `libnetfilter_queue`.
> Phase 4 is **not recommended** for the near term. The complexity-to-realism
> ratio is poor compared to Phases 13.
--- ---
## Implementation Priority ## Implementation Priority (revised)
``` ```
Phase 1 ────────────────────────────────── (implement next) Phase 1 ✅ DONE ─────────────────────────────
└─ 5 new sysctls in os_fingerprint.py └─ 8 sysctls per OS in os_fingerprint.py
└─ No new files, no Docker changes └─ Verified: TTL, window, timestamps, ECN, SACK all correct
└─ Estimated effort: 30 min
Phase 2 ────────────────────────────────── (implement after Phase 1) Phase 2 ──────────────────────────────── (implement next)
└─ templates/base/Dockerfile + entrypoint.sh └─ 2 more sysctls: icmp_ratelimit + icmp_ratemask
└─ os_fingerprint.py: add tcp_window field └─ Estimated effort: 15 min
└─ composer.py: pass env var to base container
└─ Estimated effort: 23 hours + tests
Phase 3 ────────────────────────────────── (nice to have) Phase 3 ──────────────────────────────── (high priority)
└─ 2 more sysctls in os_fingerprint.py └─ NFQUEUE daemon in templates/base/
└─ Estimated effort: 15 min (after Phase 1 infra exists) └─ Fix TI=Z for Windows (THE remaining blocker)
└─ Estimated effort: 46 hours + tests
Phase 4 ────────────────────────────────── (not recommended short-term) Phase 4 ──────────────────────────────── (not recommended)
└─ Requires kernel-level or userspace TCP stack work └─ ISN pattern, T2/T3, ICMP payload echo
└─ Estimated effort: days └─ Estimated effort: days, diminishing returns
``` ```
--- ---
@@ -196,22 +215,34 @@ After each phase, validate with:
# Active OS fingerprint scan against a deployed decky # Active OS fingerprint scan against a deployed decky
sudo nmap -O --osscan-guess <decky_ip> sudo nmap -O --osscan-guess <decky_ip>
# Aggressive scan with version detection
sudo nmap -sV -O -A --osscan-guess <decky_ip>
# Passive fingerprinting (run on host while generating traffic to decky) # Passive fingerprinting (run on host while generating traffic to decky)
sudo p0f -i <macvlan_interface> -p sudo p0f -i <macvlan_interface> -p
# Quick TTL + window check # Quick TTL + window check
sudo nmap -sS --script banner <decky_ip> hping3 -S -p 445 <decky_ip> # inspect TTL and window in reply
hping3 -S -p 22 <decky_ip> # inspect TTL and window in reply
# Test INI (all OS families, 10 deckies)
sudo .venv/bin/decnet deploy --config arche-test.ini --interface eth0
``` ```
Expected outcomes by phase: ### Expected outcomes by phase
| Check | Pre-Phase 1 | Post-Phase 1 | Post-Phase 2 | | Check | Pre-Phase 1 | Post-Phase 1 | Post-Phase 2 | Post-Phase 3 |
|---|---|---|---| |---|---|---|---|---|
| TTL | ✅ | ✅ | ✅ | | TTL | ✅ | ✅ | ✅ | ✅ |
| TCP timestamps | ❌ | ✅ | ✅ | | TCP timestamps | ❌ | ✅ | ✅ | ✅ |
| TCP window size | ❌ | | ✅ | | TCP window size | ❌ | ✅ (kernel default OK) | ✅ | ✅ |
| ICMP behavior | ❌ | ⚠️ | ⚠️ | | ECN | ❌ | | ✅ | ✅ |
| IP ID sequence | ❌ | ❌ | | | ICMP rate limiting | ❌ | ❌ | ✅ | ✅ |
| `nmap -O` family match | ⚠️ | | ✅ | | IP ID sequence (`TI=`) | ❌ | | | ✅ |
| `p0f` match | ⚠️ | ⚠️ | ✅ | | `nmap -O` family match | ⚠️ | ⚠️ (TI=Z blocks) | ⚠️ | ✅ |
| `p0f` match | ⚠️ | ⚠️ | ✅ | ✅ |
### Note on `P=` field in nmap output
The `P=x86_64-redhat-linux-gnu` that appears in the `SCAN(...)` block is the
**GNU build triple of the nmap binary itself**, not a fingerprint of the target.
It cannot be changed and is not relevant to OS spoofing.