Heavyweight Layer-2 extractors land alongside the cheap projections
shipped in commit e9324aca, so the EmailLifter R0042 / R0046 (macros
/ password / smuggling lanes) / R0048 fire from the bus payload
without the lifter having to reach back to disk.
Extractors:
* body_simhash — inlined 64-bit Charikar simhash (md5-keyed,
frequency-weighted) over word tokens of the union of text/* body
parts. Inlined rather than pulling the `simhash` PyPI dep, which
transitively brings numpy ~50 MB into a slim decky container; the
algorithm is ~15 lines and identical in extraction quality.
* body_base64_bytes — largest decoded base64 chunk's byte count,
scanning text body parts with the same `_BASE64_RE` the lifter's
`_p_encoded_payload` fallback uses. R0048 fires from this scalar
alone; the lifter's body_text fallback becomes dead in normal
operation.
* attachment_macro_indicator — stdlib zipfile sniff for
`vbaProject.bin` inside OOXML containers. Catches modern .docm /
.xlsm / .pptm and macro-injected .docx; legacy .xls (CFBF) is a
follow-up.
* attachment_encrypted — flag_bits & 0x01 on any ZIP / OOXML entry's
central directory; magic-byte match for 7z / RAR / CFBF (encrypted
Office wrap).
* html_smuggling — structural lxml parse first: fires when an `<a
download>` element coexists with a `<script>` referencing
`Blob` / `Uint8Array` / `URL.createObjectURL`. Regex pair-check
fallback on lxml parse failure (real-world phish HTML is often
malformed). Cuts the FP rate that pure-regex would produce on
legitimate "click to download" links.
Add `python3-lxml` (~5 MB Debian package, C-extension, no transitive
Python deps) to the SMTP decky's Dockerfile. simhash stays inline.
Per the dependency rule: lxml earns its weight by cutting R0046's
OR-combined FP rate; a heavier macro-detection lib (oletools ~5 MB
pure-python with msoffcrypto) would not measurably improve the
boolean signal we need, so stdlib stays for that lane.
27 lines
908 B
Docker
27 lines
908 B
Docker
ARG BASE_IMAGE=debian:bookworm-slim
|
|
FROM ${BASE_IMAGE}
|
|
|
|
RUN apt-get update && apt-get install -y --no-install-recommends \
|
|
python3 \
|
|
python3-lxml \
|
|
&& rm -rf /var/lib/apt/lists/*
|
|
|
|
COPY syslog_bridge.py /opt/syslog_bridge.py
|
|
COPY instance_seed.py /opt/instance_seed.py
|
|
COPY server.py /opt/server.py
|
|
COPY entrypoint.sh /entrypoint.sh
|
|
RUN chmod +x /entrypoint.sh
|
|
|
|
EXPOSE 25 587
|
|
RUN useradd -r -s /bin/false -d /opt logrelay \
|
|
&& apt-get update && apt-get install -y --no-install-recommends libcap2-bin \
|
|
&& rm -rf /var/lib/apt/lists/* \
|
|
&& (find /usr/bin/ -maxdepth 1 -name 'python3*' -type f -exec setcap 'cap_net_bind_service+eip' {} \; 2>/dev/null || true)
|
|
|
|
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 \
|
|
CMD kill -0 1 || exit 1
|
|
|
|
# Entrypoint runs as root so it can fix quarantine dir permissions before
|
|
# dropping to logrelay via su.
|
|
ENTRYPOINT ["/entrypoint.sh"]
|