docs(HARDENING): rewrite roadmap based on live scan findings
Phase 1 is complete. Live testing revealed: - Window size (64240) is already correct — Phase 2 window mangling unnecessary - TI=Z (IP ID = 0) is the single remaining blocker for Windows spoofing - ip_no_pmtu_disc does NOT fix TI=Z (tested and confirmed) Revised phase plan: - Phase 2: ICMP tuning (icmp_ratelimit + icmp_ratemask sysctls) - Phase 3: NFQUEUE daemon for IP ID rewriting (fixes TI=Z) - Phase 4: diminishing returns, not recommended Added detailed NFQUEUE architecture, TCPOPTSTRIP notes, and note clarifying P= field in nmap output.
This commit is contained in:
@@ -6,184 +6,203 @@ scanners see the intended OS rather than a generic Linux kernel.
|
||||
|
||||
---
|
||||
|
||||
## Current State
|
||||
## Current State (Post-Phase 1)
|
||||
|
||||
OS spoofing is partially implemented. Each archetype declares an `nmap_os` slug
|
||||
(e.g. `"windows"`, `"linux"`, `"embedded"`). The **composer** resolves that slug
|
||||
via `os_fingerprint.get_os_sysctls()` and injects the resulting kernel parameters
|
||||
into the **base container** as Docker `sysctls`. Service containers inherit the
|
||||
same network namespace via `network_mode: "service:<base>"` and therefore appear
|
||||
identical to outside scanners.
|
||||
Phase 1 is **implemented and tested against live scans**. Each archetype declares
|
||||
an `nmap_os` slug (e.g. `"windows"`, `"linux"`, `"embedded"`). The **composer**
|
||||
resolves that slug via `os_fingerprint.get_os_sysctls()` and injects the resulting
|
||||
kernel parameters into the **base container** as Docker `sysctls`. Service
|
||||
containers inherit the same network namespace via `network_mode: "service:<base>"`
|
||||
and therefore appear identical to outside scanners.
|
||||
|
||||
### Currently tuned knobs
|
||||
### Implemented sysctls (8 per OS profile)
|
||||
|
||||
| Sysctl | Purpose |
|
||||
|---|---|
|
||||
| `net.ipv4.ip_default_ttl` | Primary TTL discriminator (64 = Linux, 128 = Windows, 255 = Embedded) |
|
||||
| `net.ipv4.tcp_syn_retries` | SYN retransmit count before giving up |
|
||||
|
||||
### What this fools
|
||||
|
||||
| Scanner probe | Status |
|
||||
|---|---|
|
||||
| ping TTL | ✅ Fully spoofed |
|
||||
| TCP SYN retry count | ✅ Tuned |
|
||||
| `nmap -O` OS family (Win vs Linux) | ⚠️ Partial — likely correct family, wrong version |
|
||||
| `p0f` passive fingerprint | ⚠️ Partial — TTL correct, window/options wrong |
|
||||
| Full `nmap -O` version/build match | ❌ Not achievable without deeper tuning |
|
||||
|
||||
---
|
||||
|
||||
## Improvement Phases
|
||||
|
||||
### Phase 1 — Extended Sysctls (Low effort, High impact)
|
||||
|
||||
Several additional sysctls are **network-namespace-scoped** and can be safely set
|
||||
per-container without `--privileged`. These directly affect nmap's SEQ, OPS, and
|
||||
WIN probe groups.
|
||||
|
||||
**Changes required:** extend `OS_SYSCTLS` in `decnet/os_fingerprint.py`.
|
||||
|
||||
| Sysctl | nmap probe group | Windows | Linux | Embedded |
|
||||
| Sysctl | Purpose | Win | Linux | Embedded |
|
||||
|---|---|---|---|---|
|
||||
| `net.ipv4.tcp_timestamps` | SEQ/OPS — timestamp option presence | `0` | `1` | `0` |
|
||||
| `net.ipv4.tcp_window_scaling` | WIN — window scale option | `1` | `1` | `0` |
|
||||
| `net.ipv4.tcp_sack` | OPS — SACK permitted option | `1` | `1` | `0` |
|
||||
| `net.ipv4.tcp_ecn` | ECN probe — explicit congestion notification | `0` | `2` | `0` |
|
||||
| `net.ipv4.ip_no_pmtu_disc` | IE — DF bit copying in ICMP replies | `0` | `0` | `1` |
|
||||
| `net.ipv4.tcp_fin_timeout` | T2–T6 — FIN_WAIT duration | `30` | `60` | `15` |
|
||||
| `net.ipv4.ip_default_ttl` | TTL discriminator | `128` | `64` | `255` |
|
||||
| `net.ipv4.tcp_syn_retries` | SYN retransmit count | `2` | `6` | `3` |
|
||||
| `net.ipv4.tcp_timestamps` | TCP timestamp option (OPS probes) | `0` | `1` | `0` |
|
||||
| `net.ipv4.tcp_window_scaling` | Window scale option | `1` | `1` | `0` |
|
||||
| `net.ipv4.tcp_sack` | Selective ACK option | `1` | `1` | `0` |
|
||||
| `net.ipv4.tcp_ecn` | ECN negotiation | `0` | `2` | `0` |
|
||||
| `net.ipv4.ip_no_pmtu_disc` | DF bit in ICMP replies | `0` | `0` | `1` |
|
||||
| `net.ipv4.tcp_fin_timeout` | FIN_WAIT_2 timeout (seconds) | `30` | `60` | `15` |
|
||||
|
||||
> **Highest single-value impact:** setting `net.ipv4.tcp_timestamps = 0` for
|
||||
> Windows is the strongest signal. nmap's OPS probes explicitly look for the TCP
|
||||
> timestamp option; its absence is a definitive Windows discriminator.
|
||||
### Live scan results (Windows decky, 2026-04-10)
|
||||
|
||||
**Expected result after Phase 1:** `nmap -O` correctly identifies OS family in
|
||||
the vast majority of scans. `p0f` passive fingerprinting becomes significantly
|
||||
more convincing.
|
||||
**What works:**
|
||||
|
||||
| nmap field | Expected | Got | Status |
|
||||
|---|---|---|---|
|
||||
| TTL (`T=`) | `80` (128 dec) | `T=80` | ✅ |
|
||||
| TCP timestamps (`TS=`) | `U` (unsupported) | `TS=U` | ✅ |
|
||||
| ECN (`CC=`) | `N` | `CC=N` | ✅ |
|
||||
| TCP window (`W1=`) | `FAF0` (64240) | `W1=FAF0` | ✅ |
|
||||
| Window options (`O1=`) | `M5B4NNSNWA` | `O1=M5B4NNSNWA` | ✅ |
|
||||
| SACK | present | present | ✅ |
|
||||
| DF bit | `DF=Y` | `DF=Y` | ✅ |
|
||||
|
||||
**What fails:**
|
||||
|
||||
| nmap field | Expected (Win) | Got | Impact |
|
||||
|---|---|---|---|
|
||||
| IP ID (`TI=`) | `I` (incremental) | `Z` (all zeros) | **Critical** — no Windows fingerprint in nmap's DB has `TI=Z`. This alone causes 91% confidence "Linux 2.4/2.6 embedded" |
|
||||
| ICMP rate limiting | unlimited | Linux default rate | Minor — affects `IE`/`U1` probe groups |
|
||||
|
||||
**Key finding:** `TI=Z` is the **single remaining blocker** for a convincing
|
||||
Windows fingerprint. Everything else (TTL, window, timestamps, ECN, SACK, DF)
|
||||
is already correct. The Phase 2 window mangling originally planned is
|
||||
**unnecessary** — the kernel already produces the correct 64240 value.
|
||||
|
||||
---
|
||||
|
||||
### Phase 2 — TCP Window Size Mangling (Medium effort, Very high impact)
|
||||
## Remaining Improvement Phases
|
||||
|
||||
nmap's WIN probes record the raw **TCP window size** in SYN-ACK replies. This
|
||||
is the single most discriminating feature after TTL. It cannot be set with
|
||||
per-namespace sysctls because `net.core.rmem_default` is global.
|
||||
### Phase 2 — ICMP Tuning via Sysctls (Low effort, Medium impact)
|
||||
|
||||
The fix is an **iptables rule applied at base container startup** via a custom
|
||||
entrypoint script.
|
||||
Two additional namespace-scoped sysctls control ICMP error rate limiting.
|
||||
nmap's `IE` and `U1` probe groups measure how quickly the target responds to
|
||||
ICMP and UDP-to-closed-port probes.
|
||||
|
||||
#### Target window sizes by OS
|
||||
**Changes required:** add to `OS_SYSCTLS` in `decnet/os_fingerprint.py`.
|
||||
|
||||
| OS | TCP Window Size | Notes |
|
||||
|---|---|---|
|
||||
| Windows 10 / 11 | `64240` | Most common modern value |
|
||||
| Windows 7 / Server 2008 | `8192` | Classic Windows signature |
|
||||
| Linux 5.x / 6.x | `29200` | Default `tcp_rmem` min/4 |
|
||||
| Linux 4.x | `43690` | Older default |
|
||||
| FreeBSD / macOS | `65535` | BSD signature |
|
||||
| Embedded / Cisco | `4128`–`8760` | Varies widely |
|
||||
| Sysctl | What it controls | Windows | Linux | Embedded |
|
||||
|---|---|---|---|---|
|
||||
| `net.ipv4.icmp_ratelimit` | Minimum ms between ICMP error messages | `0` (none) | `1000` (1/sec) | `1000` |
|
||||
| `net.ipv4.icmp_ratemask` | Bitmask of ICMP types subject to rate limiting | `0` | `6168` | `6168` |
|
||||
|
||||
#### Implementation sketch
|
||||
**Why:** Windows does not rate-limit ICMP error responses. Linux defaults to
|
||||
1000ms between ICMP errors (effectively 1 per second per destination). When
|
||||
nmap sends rapid-fire UDP probes to closed ports, a Windows machine replies to
|
||||
all of them instantly while a Linux machine throttles responses. Setting
|
||||
`icmp_ratelimit=0` for Windows makes the `U1` probe response timing match.
|
||||
|
||||
Add a parameterized entrypoint script (`templates/base/entrypoint.sh`) that
|
||||
receives the target window size as an environment variable and applies an
|
||||
`iptables` MANGLE rule before yielding to `sleep infinity`:
|
||||
**Estimated effort:** 15 min — same pattern as Phase 1, just two more entries.
|
||||
|
||||
```bash
|
||||
#!/bin/sh
|
||||
# Apply TCP window size spoofing via iptables mangle
|
||||
if [ -n "$SPOOF_TCP_WINDOW" ]; then
|
||||
iptables -t mangle -A POSTROUTING -p tcp \
|
||||
-j TCPMSS --set-mss 1460
|
||||
# Clamp outgoing window to the target value
|
||||
# Requires xt_TCPMSS kernel module on the host
|
||||
fi
|
||||
exec sleep infinity
|
||||
---
|
||||
|
||||
### Phase 3 — NFQUEUE IP ID Rewriting (Medium effort, Very high impact)
|
||||
|
||||
This is the **highest-priority remaining item** and the only way to fix `TI=Z`.
|
||||
|
||||
#### Root cause of `TI=Z`
|
||||
|
||||
The Linux kernel's `ip_select_ident()` function sets the IP Identification
|
||||
field to `0` for all TCP packets where DF=1 (don't-fragment bit set). This is
|
||||
correct behavior per RFC 6864 ("IP ID is meaningless when DF=1") but no Windows
|
||||
fingerprint in nmap's database has `TI=Z`. **No namespace-scoped sysctl can
|
||||
change this** — it's hardcoded in the kernel's TCP stack.
|
||||
|
||||
Note: `ip_no_pmtu_disc` does NOT fix this. That sysctl controls Path MTU
|
||||
Discovery for UDP/ICMP paths only, not TCP IP ID generation. Setting it to 1
|
||||
for Windows was tested and confirmed to have no effect on `TI=Z`.
|
||||
|
||||
#### Solution: NFQUEUE userspace packet rewriting
|
||||
|
||||
Use `iptables -t mangle` to send outgoing TCP packets to an NFQUEUE, where a
|
||||
small Python daemon rewrites the IP ID field before release.
|
||||
|
||||
```
|
||||
┌──────────────────────────┐
|
||||
TCP SYN-ACK ───► │ iptables mangle/OUTPUT │
|
||||
│ -j NFQUEUE --queue-num 0 │
|
||||
└───────────┬──────────────┘
|
||||
▼
|
||||
┌──────────────────────────┐
|
||||
│ Python NFQUEUE daemon │
|
||||
│ 1. Read IP ID field │
|
||||
│ 2. Replace with target │
|
||||
│ pattern (sequential │
|
||||
│ for Windows, zero │
|
||||
│ for embedded, etc.) │
|
||||
│ 3. Recalculate checksum │
|
||||
│ 4. Accept packet │
|
||||
└───────────┬──────────────┘
|
||||
▼
|
||||
Packet goes out
|
||||
```
|
||||
|
||||
The composer would inject `SPOOF_TCP_WINDOW` as an environment variable on the
|
||||
base container, sourced from the OS fingerprint profile.
|
||||
**Target IP ID patterns by OS:**
|
||||
|
||||
| OS | nmap label | Pattern | Implementation |
|
||||
|---|---|---|---|
|
||||
| Windows | `TI=I` | Sequential, incrementing by 1 per packet | Global atomic counter |
|
||||
| Linux 3.x+ | `TI=Z` | Zero (DF=1) or randomized | Leave untouched (already correct) |
|
||||
| Embedded/Cisco | `TI=I` or `TI=Z` | Varies by device | Sequential or zero |
|
||||
| BSD | `TI=RI` | Randomized incremental | Counter + small random delta |
|
||||
|
||||
**Two possible approaches:**
|
||||
|
||||
1. **TCPOPTSTRIP + NFQUEUE (comprehensive)**
|
||||
- `TCPOPTSTRIP` can strip/modify TCP options (window scale, SACK, etc.)
|
||||
via pure iptables rules, no userspace needed
|
||||
- `NFQUEUE` handles IP-layer rewriting (IP ID) in userspace
|
||||
- Combined: full control over the TCP/IP fingerprint
|
||||
|
||||
2. **NFQUEUE only (simpler)**
|
||||
- Single Python daemon handles everything: IP ID rewriting, and optionally
|
||||
TCP option/window manipulation if ever needed
|
||||
- Fewer moving parts, one daemon to monitor
|
||||
|
||||
**Required changes:**
|
||||
- `os_fingerprint.py` — add `tcp_window` field to each OS profile.
|
||||
- `composer.py` — pass `SPOOF_TCP_WINDOW` env var to base container.
|
||||
- `templates/base/entrypoint.sh` — new file, applies the iptables rule.
|
||||
- `templates/base/Dockerfile` — new file, minimal image with `iptables`.
|
||||
- `templates/base/Dockerfile` — new, installs `iptables` + `python3-netfilterqueue`
|
||||
- `templates/base/entrypoint.sh` — new, sets up iptables rules + launches daemon
|
||||
- `templates/base/nfq_spoofer.py` — new, the NFQUEUE packet rewriting daemon
|
||||
- `os_fingerprint.py` — add `ip_id_pattern` field to each OS profile
|
||||
- `composer.py` — pass `SPOOF_IP_ID` env var + use `templates/base/Dockerfile`
|
||||
instead of bare distro images for base containers
|
||||
|
||||
> **Note:** requires `NET_ADMIN` capability (already granted) and the
|
||||
> `xt_TCPMSS` and `xt_mangle` kernel modules loaded on the host. Both are
|
||||
> present in any standard Linux distribution kernel.
|
||||
**Dependencies on the host kernel:**
|
||||
- `nfnetlink_queue` module (`modprobe nfnetlink_queue`)
|
||||
- `xt_NFQUEUE` module (standard in all distro kernels)
|
||||
- `NET_ADMIN` capability (already granted)
|
||||
|
||||
**Dependencies in the base container image:**
|
||||
- `iptables` package
|
||||
- `python3` + `python3-netfilterqueue` (or `scapy` with `NetfilterQueue`)
|
||||
|
||||
**Estimated effort:** 4–6 hours + tests
|
||||
|
||||
---
|
||||
|
||||
### Phase 3 — ICMP Response Tuning (Medium effort, Medium impact)
|
||||
### Phase 4 — Full Fingerprint Database Matching (Hard, Low marginal impact)
|
||||
|
||||
nmap's `IE` probe group sends two ICMP echo requests with specific ToS values,
|
||||
code fields, and payload sizes and inspects what the target returns. Currently
|
||||
nothing in DECNET controls ICMP echo reply behavior.
|
||||
After Phases 2–3, the remaining fingerprint differences are increasingly minor:
|
||||
|
||||
**Namespace-scoped sysctls to add per-OS:**
|
||||
|
||||
| Sysctl | Effect | Windows | Linux |
|
||||
|---|---|---|---|
|
||||
| `net.ipv4.icmp_ratelimit` | Packets/sec rate limit on ICMP errors | `0` (none) | `100` |
|
||||
| `net.ipv4.icmp_ratemask` | Which ICMP types are rate-limited | `0` | `6168` |
|
||||
|
||||
**Expected result:** nmap's `IE` response classification improves from
|
||||
"no response / filtered" to a correctly typed ICMP echo reply with OS-correct
|
||||
rate limiting behavior.
|
||||
|
||||
---
|
||||
|
||||
### Phase 4 — IP ID Sequence Behavior (Hard, Medium impact)
|
||||
|
||||
nmap's SEQ probe group fires 6 TCP SYN packets in rapid succession and measures
|
||||
the **IP ID increment pattern** across responses:
|
||||
|
||||
| OS | IP ID pattern | nmap label |
|
||||
| Signal | Current | Notes |
|
||||
|---|---|---|
|
||||
| Windows (most) | Sequential, incrementing | `I` (incremental) |
|
||||
| Linux 3.x+ | Per-socket hashed/random | `RI` or `RD` |
|
||||
| Old Linux / BSD | Global counter (truly sequential) | `I` |
|
||||
| Embedded | Often constant 0 or sequential | varies |
|
||||
| TCP initial sequence number (ISN) pattern (`SP=`, `ISR=`) | Linux kernel default | Kernel-level, not spoofable without userspace TCP |
|
||||
| TCP window variance across probes | Constant (`FAF0` × 6) | Real Windows sometimes varies slightly |
|
||||
| T2/T3 responses | `R=N` (no response) | Correct for some Windows, wrong for others |
|
||||
| ICMP data payload echo | Linux default | Difficult to control per-namespace |
|
||||
|
||||
Linux switched to per-socket hashed IDs at the kernel level (~3.x). This
|
||||
**cannot be changed per network namespace** without patching the kernel or
|
||||
replacing the TCP/IP stack with a userspace implementation.
|
||||
These are diminishing returns. With Phases 1–3 complete, `nmap -O` should
|
||||
correctly identify the OS family in >90% of scans.
|
||||
|
||||
**Options:**
|
||||
1. **Accept the limitation** — the IP ID pattern is one of many signals; getting
|
||||
TTL + window + timestamps right is already a very strong fingerprint match.
|
||||
2. **Userspace TCP proxy** (e.g., `lwIP` or a custom `nfqueue`-based responder)
|
||||
that intercepts SYN packets and replies with forged ID sequences. High
|
||||
complexity; requires `NFQUEUE` kernel module and `libnetfilter_queue`.
|
||||
|
||||
> Phase 4 is **not recommended** for the near term. The complexity-to-realism
|
||||
> ratio is poor compared to Phases 1–3.
|
||||
> Phase 4 is **not recommended** for the near term. Effort is measured in days
|
||||
> for single-digit percentage improvements.
|
||||
|
||||
---
|
||||
|
||||
## Implementation Priority
|
||||
## Implementation Priority (revised)
|
||||
|
||||
```
|
||||
Phase 1 ────────────────────────────────── (implement next)
|
||||
└─ 5 new sysctls in os_fingerprint.py
|
||||
└─ No new files, no Docker changes
|
||||
└─ Estimated effort: 30 min
|
||||
Phase 1 ✅ DONE ─────────────────────────────
|
||||
└─ 8 sysctls per OS in os_fingerprint.py
|
||||
└─ Verified: TTL, window, timestamps, ECN, SACK all correct
|
||||
|
||||
Phase 2 ────────────────────────────────── (implement after Phase 1)
|
||||
└─ templates/base/Dockerfile + entrypoint.sh
|
||||
└─ os_fingerprint.py: add tcp_window field
|
||||
└─ composer.py: pass env var to base container
|
||||
└─ Estimated effort: 2–3 hours + tests
|
||||
Phase 2 ──────────────────────────────── (implement next)
|
||||
└─ 2 more sysctls: icmp_ratelimit + icmp_ratemask
|
||||
└─ Estimated effort: 15 min
|
||||
|
||||
Phase 3 ────────────────────────────────── (nice to have)
|
||||
└─ 2 more sysctls in os_fingerprint.py
|
||||
└─ Estimated effort: 15 min (after Phase 1 infra exists)
|
||||
Phase 3 ──────────────────────────────── (high priority)
|
||||
└─ NFQUEUE daemon in templates/base/
|
||||
└─ Fix TI=Z for Windows (THE remaining blocker)
|
||||
└─ Estimated effort: 4–6 hours + tests
|
||||
|
||||
Phase 4 ────────────────────────────────── (not recommended short-term)
|
||||
└─ Requires kernel-level or userspace TCP stack work
|
||||
└─ Estimated effort: days
|
||||
Phase 4 ──────────────────────────────── (not recommended)
|
||||
└─ ISN pattern, T2/T3, ICMP payload echo
|
||||
└─ Estimated effort: days, diminishing returns
|
||||
```
|
||||
|
||||
---
|
||||
@@ -196,22 +215,34 @@ After each phase, validate with:
|
||||
# Active OS fingerprint scan against a deployed decky
|
||||
sudo nmap -O --osscan-guess <decky_ip>
|
||||
|
||||
# Aggressive scan with version detection
|
||||
sudo nmap -sV -O -A --osscan-guess <decky_ip>
|
||||
|
||||
# Passive fingerprinting (run on host while generating traffic to decky)
|
||||
sudo p0f -i <macvlan_interface> -p
|
||||
|
||||
# Quick TTL + window check
|
||||
sudo nmap -sS --script banner <decky_ip>
|
||||
hping3 -S -p 22 <decky_ip> # inspect TTL and window in reply
|
||||
hping3 -S -p 445 <decky_ip> # inspect TTL and window in reply
|
||||
|
||||
# Test INI (all OS families, 10 deckies)
|
||||
sudo .venv/bin/decnet deploy --config arche-test.ini --interface eth0
|
||||
```
|
||||
|
||||
Expected outcomes by phase:
|
||||
### Expected outcomes by phase
|
||||
|
||||
| Check | Pre-Phase 1 | Post-Phase 1 | Post-Phase 2 |
|
||||
|---|---|---|---|
|
||||
| TTL | ✅ | ✅ | ✅ |
|
||||
| TCP timestamps | ❌ | ✅ | ✅ |
|
||||
| TCP window size | ❌ | ❌ | ✅ |
|
||||
| ICMP behavior | ❌ | ⚠️ | ⚠️ |
|
||||
| IP ID sequence | ❌ | ❌ | ❌ |
|
||||
| `nmap -O` family match | ⚠️ | ✅ | ✅ |
|
||||
| `p0f` match | ⚠️ | ⚠️ | ✅ |
|
||||
| Check | Pre-Phase 1 | Post-Phase 1 ✅ | Post-Phase 2 | Post-Phase 3 |
|
||||
|---|---|---|---|---|
|
||||
| TTL | ✅ | ✅ | ✅ | ✅ |
|
||||
| TCP timestamps | ❌ | ✅ | ✅ | ✅ |
|
||||
| TCP window size | ❌ | ✅ (kernel default OK) | ✅ | ✅ |
|
||||
| ECN | ❌ | ✅ | ✅ | ✅ |
|
||||
| ICMP rate limiting | ❌ | ❌ | ✅ | ✅ |
|
||||
| IP ID sequence (`TI=`) | ❌ | ❌ | ❌ | ✅ |
|
||||
| `nmap -O` family match | ⚠️ | ⚠️ (TI=Z blocks) | ⚠️ | ✅ |
|
||||
| `p0f` match | ⚠️ | ⚠️ | ✅ | ✅ |
|
||||
|
||||
### Note on `P=` field in nmap output
|
||||
|
||||
The `P=x86_64-redhat-linux-gnu` that appears in the `SCAN(...)` block is the
|
||||
**GNU build triple of the nmap binary itself**, not a fingerprint of the target.
|
||||
It cannot be changed and is not relevant to OS spoofing.
|
||||
|
||||
Reference in New Issue
Block a user