Troubleshooting¶
When critical errors occur during operation or startup, they are systematically reported through these interfaces:
- The captive portal dashboard (operator-facing pages).
- The engine's console output (visible on the host via
eghost logsorjournalctl -u enforcegateon the appliance). - The engine's log file at
/var/log/enforcegate/engine.loginside the container (also surfaced todocker logsby the container log pipeline). - A structured diagnostic file at
/var/lib/enforcegate/apm-last-error.jsonwritten atomically on fatal APM (license-validation) failures. The appliance installer reads this file viadocker cpto dispatch per-reason hints.
Triage entry points¶
In order of expected payoff when something is wrong:
show policy match <url>ineghost cli— instantly tells you what verdict the engine returns and which rule fired. Replaces the "browse to the URL, hit a deny page, read the rule_name off the URL bar" diagnostic dance.eghost status— engine alive? Database connected? API responsive? Connector sessions up?eghost logs— the engine logs at[error]by default; bump to[debug]via[logging].levelfor verbose tracing./var/lib/enforcegate/apm-last-error.json— present if the engine failed to start due to a license issue. Read thereasonfield for a structured dispatch code.
Why is this URL blocked?¶
The canonical diagnostic for "I can't reach <URL>" tickets. Send the URL through the live policy engine without browsing to it:
Three outputs to expect:
Matched + admin viewer (sees the regex pattern):
URI: https://example.com/foo
Matched: yes
Code: 300 (deny — redirect to portal)
Rule ID: 14
Rule name: deny-some-rule_e14
Action: deny
Reason: <the rule's description field>
Pattern: ^(https?:\/\/)?([^/]+\.)?example\.com(/|$)
Matched + monitoring/standard viewer (no pattern — admin-only field):
URI: https://example.com/foo
Matched: yes
Code: 300 (deny — redirect to portal)
Rule ID: 14
Rule name: deny-some-rule_e14
Action: deny
Reason: <the rule's description field>
Not matched — the engine's synthesised [policy].default_action verdict takes effect:
The default-permit / default-deny / default-warn / default-aup synthetic rule name reports which default_action knob value is configured. Flip the no-match posture in engine.conf — see [policy].default_action.
Step 2 — Inspect the matched rule¶
Once you have the rule_name, grep the rule's source:
Or look up the compiled regex directly:
docker exec enforcegate duckdb /var/lib/enforcegate/engine.db \
"SELECT id, name, action, pattern FROM http_policy WHERE id = 14;"
Common false-positive causes¶
- Unescaped
.inmatch-uri-regex—t.comatchest/co,t?co, etc. Hand-written regexes must escape:t\.co. Thematch-domain-listhelper auto-escapes (post-2026-05-27 fix), butmatch-uri-regexis the operator's responsibility. - Missing
^anchor — substring matches everywhere.foo.commatchesbarfoo.com. Anchor with^https?://or^(https?://)?. - Missing trailing
(/|$)—foo.commatchesfoo.complaint. End-anchor with(/|$)to require a path separator or string end.
The learning-mode synthesiser's regex shape is a good template:
(Optional scheme, optional subdomain, required path delimiter at the end.)
Toolbox / shared-path deny isn't taking effect¶
Symptom — the toolbox sidecar (or any other writer to [policy].shared_path) has written a deny rule, the file is loaded (visible in show policy files), but the matching traffic still gets through. show policy match <url> reports the verdict as permit and the matched rule as some default-permit-shaped or catch-all rule in [policy].path.
Cause — [policy].path (operator-authored) loads before [policy].shared_path (toolbox / generated) and gets the lower rule ids. A catch-all permit rule in [policy].path therefore wins under lowest-id-wins and shadows every shared-path rule.
Fix — remove the catch-all from [policy].path and rely on [policy].default_action instead:
host> show policy files
File Size Modified Rules
------------------------------------ -------- -------------------- -----
99-default-permit.policy 96 B 2026-04-12 08:00:00 1 ← shadows shared rules
...
host> enable
host# configure
host(config)# delete policy 99-default-permit
host(config)# commit
Then set default_action = "permit" (or "deny" for default-deny) in engine.conf and reload. The synthesised no-match verdict occupies no rule id, so it can no longer shadow shared-path rules. Pre-2026.32.0 deployments shipped this catch-all; from 2026.32.0 it is retired.
Time-scheduled rule fires at the wrong time¶
A rule with a time-window: attribute is firing earlier, later, or not at all relative to what its window says. Two common causes:
-
Engine clock is wrong. Time-windows evaluate against the engine's wall clock — a drifted system clock makes them fire at the drifted time. Confirm in
eghost cli:host> show clock Engine clock UTC: 2026-06-06T08:42:15Z Local: 2026-06-06 10:42:15 CEST Epoch: 1781340135 (unix seconds)If the time is off, fix NTP on the engine host (or the appliance's time-sync service). The engine picks up the new clock immediately — no reload needed.
-
Timezone mismatch.
[policy].time_window_tzcontrols whetherHH:MMis interpreted as the engine's local time or UTC. The default islocal. Deployments with operators in multiple timezones should considerutcso windows mean the same thing everywhere; operators authoring rules then write the window in UTC.
If show policy match reports the rule did not match a request you expected to be inside the window, double-check both the day specifier (weekdays excludes Saturday and Sunday; weekend excludes Monday through Friday) and the wrap-past-midnight form (daily 22:00-06:00 is active overnight, not during the day).
Engine refuses to start¶
Symptom — exits with -1 immediately, no logs¶
Look at /var/lib/enforcegate/apm-last-error.json. The reason field is one of a stable set:
| Reason | Meaning | Operator fix |
|---|---|---|
apm.config.serial_missing |
[license].serial empty. |
Set ENGINE_LICENSE_SERIAL in .env (or write it into engine.conf directly). |
apm.config.credentials_missing |
[license].username or password empty. |
Set ENGINE_LICENSE_USERNAME + ENGINE_LICENSE_PASSWORD in .env. |
apm.cs.unreachable |
Can't reach the licensing endpoint (DNS, firewall, proxy). | Verify outbound HTTPS connectivity from the host. For firewall coordination on the specific endpoint, contact support. |
apm.cs.auth_rejected |
Control server refused the credentials. | Verify the username + password with the customer onboarding team. |
apm.license.expired |
License expired. | Renew via the customer onboarding team. |
apm.fingerprint.mismatch |
Host fingerprint doesn't match the license. | Likely a hardware change (different DMI UUID, new machine-id). Contact support to re-activate. |
apm.permissions.failed |
License file modes too permissive (when enforce_permissions = true). |
chmod 0400 on the .key and .lic files in /etc/enforcegate/license/. |
apm.ratchet.tampered |
Engine internal state check failed. | Don't roll back system time. If the host's clock genuinely needed re-syncing, contact support. |
The engine's exit code is intentionally generic (-1) for anti-piracy opacity. Operators dispatch on the JSON file, not the exit code.
Symptom — engine starts, then [critical] log line about config¶
The config parser couldn't read engine.conf. Most common causes: wrong -c path, missing file, malformed TOML. The log line names the file path and line number.
Symptom — engine starts but show status reports DB unavailable¶
Either [database].database_file doesn't exist (the engine creates it on first boot, so this is usually a permission issue) or another process (egpolicy running concurrently, a stuck previous engine instance) holds the DuckDB file lock. Stop the offender and restart the engine.
Common boot-card failures¶
| Boot-card line | Cause | Resolution |
|---|---|---|
[system] Integrity check ... [ FAIL ] |
Bundled binary doesn't match its build-time SHA-256. | Image is corrupt or tampered. Re-pull and verify the cosign signature. |
[system] APM activation failed ... [ FAIL ] |
Engine cannot reach the Control Server, or license credentials are wrong. | See APM reason table above. |
[system] SSL inspection: bump requested but acknowledgment not set ... [ FAIL ] |
ENFORCEGATE_SSL_INSPECT=bump was set without ENFORCEGATE_SSL_INSPECT_ACK=1. |
See SSL inspection for the binding acknowledgement gate. |
[system] Generate SSL certs ... [ FAIL ] |
Insufficient entropy on the host, or /etc/enforcegate/ssl/ bind-mounted read-only by mistake. |
Wait for entropy or remove the read-only bind-mount. |
[system] Compile default policy ... [ FAIL ] |
Syntax error in a .policy file under /etc/enforcegate/rules.d/. |
Run eghost shell -- egpolicy compile --dry-run to see the diagnostic. |
[critical] policy git mode: force-on but no valid .git/ at /etc/enforcegate/rules.d/ |
[policy].git_mode = "force-on" set in engine.conf, but [policy].path does not contain a .git/ directory (never git init-ed, or .git/ was removed). The engine logs Critical but continues in snapshot-only mode. |
Either run git init inside [policy].path and restart (see policy audit), or set [policy].git_mode = "auto" to silence the message. |
show policy log is empty / reports snapshot-only mode after git init |
[policy].git_mode = "auto", .git/ is present, but the git binary is not installed in the engine container. Auto-detect silently falls back to snapshot-only. |
Install git in the engine container (the standalone bundle's image ships it; only hand-rolled deployments need this). Restart, then re-check with show policy fingerprint. |
[error] Defendr protocol version mismatch: remote:0.X, local:0.Y |
The engine and a connector are on incompatible Defendr-protocol versions. The wire-format authentication changed between engine releases (most recently in 2026.28.0, when the neighbour protocol bumped from v1 to v2). | Upgrade engine and connectors together. The standalone bundle does this automatically — one artifact carries both. Multi-image or independently-upgraded deployments need to roll engine + connector together. See upgrading your deployment for the rolling-update shape. |
| Container exits with code 137 / OOM | mem_limit too low for the active bump cert DB on busy bump-mode deployments. |
Bump ENFORCEGATE_MEMORY in .env (default 1g; 2g is safe for bump mode). Policy size is not a typical OOM driver — even multi-million-rule domain lists fit comfortably under 1 GiB. |
Squid keeps spawning connectors that immediately exit¶
If eghost logs enforcegate shows repeated url_rewrite_program respawns, the most common cause is the connector failing to reach the engine.
Confirm by entering the REPL and looking at the neighbour table:
An empty table means no connector is connected. From inside the container:
eghost shell
# Check the squid-connector config
cat /etc/enforcegate/squid-connector.conf
# Check the engine is listening
netstat -lntp | grep 11224
# Check the pre-shared key matches
grep '^key' /etc/enforcegate/{squid-connector,engine}.conf
The pre-shared key in [engine.<name>].key (connector) must match [connectors.<name>].key (engine) exactly. A typo is a common silent-failure cause.
Policy reload reports "permission denied" on a .policy file¶
Common after a host-side docker cp of a new .policy file — docker cp preserves the source's owner (root) and permissions (0600) inside the container, but the engine runs as a non-root user:
eghost shell
chown enforcegate:enforcegate /etc/enforcegate/rules.d/50-new.policy
chmod 0644 /etc/enforcegate/rules.d/50-new.policy
exit
eghost cli
> request policy reload
The post-2026-05-26 fix in the engine's snapshot path surfaces the actual permission-denied error in the reload response — earlier builds only reported the snapshot path, which made the diagnosis harder.
Engine [critical] log — hyperscan compile failed¶
The regex compile failed on the merged policy set. The engine falls back to permit-by-default but no rules fire.
Most common cause: a .policy file's match-uri-regex is malformed (unbalanced parens, invalid regex syntax). Validate with a dry-run:
If dry-run reports Parsed rules: N cleanly but the engine still logs the compile failure, the parser accepted the rule shape but the regex compiler rejected it. Inspect each rule's match-uri-regex manually and try the patterns one at a time.
Captive portal — page doesn't render¶
| Symptom | Likely cause |
|---|---|
| Visitor hits a denied URL and sees a connection error, not the portal. | [captive_portal].base_url points at a host the visitor's browser cannot reach (e.g. 127.0.0.1). Or the captive-portal container is down. Or a firewall blocks the path. |
| Portal renders but shows "invalid token" / decrypt fails. | [captive_portal].secret (engine) doesn't match the portal's configured secret. Generate fresh with openssl rand -base64 32 and set the same base64 value on both sides. The engine needs a restart to pick up the new secret. |
warn/aup page reloads when the visitor clicks "Proceed". |
The POST to /api/captive/ack isn't reaching the engine. Check [debug] logs for captive_ack lines. Common: the load balancer in front of the portal doesn't route POSTs the same as GETs, or ack_token_max_age_s expired between the redirect and the click. |
Learning mode — capture returned zero URIs¶
Possible causes:
- Session filter doesn't match. Test with
request learning create ua ".*" 1000(a UA regex that matches everything) to confirm capture is wired up at all. - Engine's authorisation path isn't being exercised — no traffic is flowing through Squid (operators sometimes test from the same host the engine runs on, missing the proxy). Confirm with
eghost logs squid. [learning]section absent ordatabase_fileunset.request learning createreturns"learning subsystem not initialised". Add the section and restart the engine.
egctl — "API authentication failed"¶
The username + password don't match an entry in /etc/enforcegate/passwd. Common causes:
- Wrong password. The shipped default
admin / enforcegate-changememay have been changed during bootstrap. - Looking at the wrong engine host. Confirm
[global].host/portinegctl.conf. - The engine's
passwdfile was wiped or rebuilt. Re-add the user withrequest user add(if you have a working admin credential elsewhere), or use the first-boot bootstrap path with[aaa].auto_bootstrap = true.
Debugging outputs¶
For diagnostic purposes, any EnforceGate vX component executable can be started in the foreground with configurable logging verbosity. This mode displays real-time output directly on the terminal.
Performance hit
When debug logging is used, performance is significantly impacted due to the volume of output. Use only when necessary.
Example: launching a foreground engine instance from inside the running container, with debug logging directed to the console:
$ eghost shell
[root@enforcegate /]$ /usr/local/bin/enforcegate-engine --logtype console --loglevel debug
[14-05-2026 16:04:54] [pid:127] [debug] [apm] Host fingerprint: adad4c755a26ae7a
[14-05-2026 16:04:54] [pid:127] [debug] [apm] Control server public key successfully loaded
[14-05-2026 16:04:54] [pid:127] [debug] [apm] Key pair successfully loaded
[14-05-2026 16:04:54] [pid:127] [debug] [apm] Digital signature verification success
[14-05-2026 16:04:54] [pid:127] [debug] [apm] This license expires on 2027-01-01 00:00:00
[14-05-2026 16:04:54] [pid:127] [warning] Unix-domain local socket file already exists
[14-05-2026 16:04:54] [pid:127] [info] File /run/enforcegate/aggregator.sock successfully deleted
[14-05-2026 16:04:54] [pid:127] [debug] Starting monitoring thread...
[14-05-2026 16:04:54] [pid:127] [debug] Starting control interface thread...
[14-05-2026 16:04:54] [pid:127] [debug] Starting database interface thread...
[14-05-2026 16:04:54] [pid:127] [debug] Creating database instance...
[14-05-2026 16:04:54] [pid:127] [error] Error while creating database instance: Cannot open database file: /var/lib/enforcegate/engine.db
[...]
The error above is typically caused by the database file being deleted out from under the engine or by filesystem-permission damage to /var/lib/enforcegate/. The supervised engine won't surface this output directly on docker logs; the failure shows through the closing one-shot or the long-running supervisor instead.
A connector instance launched with debug logging:
[root@enforcegate /]$ /usr/local/bin/enforcegate-squid-connector --logtype console --loglevel debug
[14-05-2026 16:05:12] [pid:293] [debug] Establishing connection to engine host 127.0.0.1:11224...
[14-05-2026 16:05:12] [pid:293] [debug] Building authentication (init) message
[14-05-2026 16:05:12] [pid:293] [info] Successfully connected to engine host 127.0.0.1:11224
[14-05-2026 16:05:12] [pid:293] [debug] Received authentication (challenge) message from engine
[14-05-2026 16:05:12] [pid:293] [debug] Authentication response (peer): 71f610bab2dc4bf0b7218318f32c0e2e5aaea748ef76d94f292114270d141d59
[14-05-2026 16:05:12] [pid:293] [debug] Authentication response (local): 71f610bab2dc4bf0b7218318f32c0e2e5aaea748ef76d94f292114270d141d59
[14-05-2026 16:05:12] [pid:293] [debug] Engine authentication successful
[14-05-2026 16:05:12] [pid:293] [debug] Starting SSL/TLS session handshake...
[14-05-2026 16:05:12] [pid:293] [debug] Using SSL/TLS protocol version: TLSv1.3, negotiated cipher: TLS_AES_256_GCM_SHA384
[14-05-2026 16:05:12] [pid:293] [info] Session with engine (id:1) established
The output above shows the connector authenticating to the engine and upgrading the session to TLS.
Connector writes to stderr
The squid-connector's LoggerConsole writes to stderr specifically so that stdout stays exclusively for Squid's url_rewrite_program protocol responses. When tailing docker logs, you see both streams interleaved. When deploying outside a container (bare-metal Squid forwarding to a remote engine), redirect them separately: 2>connector.log.
State that can be safely deleted to recover¶
Read order: least disruptive → most disruptive.
| File / dir | What it is | Effect of deleting |
|---|---|---|
/var/lib/enforcegate/apm-last-error.json |
APM diagnostic | Cleared on the next successful boot anyway. Harmless. |
/var/lib/enforcegate/learning.db (+ .wal) |
Learning sessions and captures | All learning history lost. Sessions are re-created from scratch on next request learning create. |
/var/lib/enforcegate/policy-backups/<timestamp>/ |
A specific policy snapshot | Loses ability to roll back to that generation. Other snapshots unaffected. |
/var/lib/enforcegate/engine.db (+ .wal) |
Live policy DB | Engine refuses to start until egpolicy load rebuilds it. Use only when the DB is corrupted. |
Never delete:
/var/lib/enforcegate/.time-ratchet— the engine will refuse to start (ratchet rebuilt from license; out of scope for operators)./etc/enforcegate/license/*— license material; replacement requires customer onboarding coordination.
Reset to clean state¶
For a complete factory reset that destroys all operator state (configs, license activation, SSL-bump CA, policy DB, audit log):
eghost down
docker compose -f /opt/enforcegate/bundles/standalone/docker-compose.yml down -v # `-v` wipes the four named volumes
eghost up # next boot re-seeds defaults from skel
For a softer reset that preserves the policy DB and operator config but resets transient state:
If you only need to re-compile policies (after editing .policy files outside the eghost policy workflow):
This recompiles /etc/enforcegate/rules.d/*.policy into engine.db and tells the engine to reload — no container restart required. The eghost policy new / edit / remove verbs do this automatically on save; the manual egpolicy load is only needed when policy files were modified by some other path (bind-mount, docker cp, external orchestration).
When to escalate to engineering¶
- APM diagnostic JSON
reasonisn't in the known set above — likely a build mismatch between engine and installer. File a ticket with the JSON content. - Hyperscan compile failure persists after
request policy reload dry-run trueruns cleanly — possible engine bug. File with the offending rule and engine version. - Engine segfaults — file with
eghost logsoutput, the engine version, and recent config changes. - Captive portal flow works in the bundled tests but fails in production — likely a deployment / TLS / routing issue. Try SSL inspection and the captive portal reference first; ticket if stuck.
Collecting a support bundle¶
For most tickets the in-engine show tech-support verb is the fastest first step — one command aggregates ten of the most-asked-for show * outputs (version, uptime, memory, threads, listeners, status, license, neighbors, and the last 50 log lines) into a single paste-into-ticket block:
Paste the output directly into your support ticket. See show tech-support for the full sample output.
For deeper investigations where Exosys engineering needs the boot card, recent logs, version manifest, and the configuration files alongside the engine outputs, generate the host-side diagnostic bundle:
This writes a redacted tarball under /tmp/enforcegate-diag-<timestamp>.tar.gz (sensitive material — API keys, license credentials, audit log entries — redacted). Attach the tarball to the ticket.
For an inline diagnostic block (printable, no tarball):