Skip to content

Troubleshooting

When critical errors occur during operation or startup, they are systematically reported through these interfaces:

  • The captive portal dashboard (operator-facing pages).
  • The engine's console output (visible on the host via eghost logs or journalctl -u enforcegate on the appliance).
  • The engine's log file at /var/log/enforcegate/engine.log inside the container (also surfaced to docker logs by the container log pipeline).
  • A structured diagnostic file at /var/lib/enforcegate/apm-last-error.json written atomically on fatal APM (license-validation) failures. The appliance installer reads this file via docker cp to dispatch per-reason hints.

Triage entry points

In order of expected payoff when something is wrong:

  1. show policy match <url> in eghost cli — instantly tells you what verdict the engine returns and which rule fired. Replaces the "browse to the URL, hit a deny page, read the rule_name off the URL bar" diagnostic dance.
  2. eghost status — engine alive? Database connected? API responsive? Connector sessions up?
  3. eghost logs — the engine logs at [error] by default; bump to [debug] via [logging].level for verbose tracing.
  4. /var/lib/enforcegate/apm-last-error.json — present if the engine failed to start due to a license issue. Read the reason field for a structured dispatch code.

Why is this URL blocked?

The canonical diagnostic for "I can't reach <URL>" tickets. Send the URL through the live policy engine without browsing to it:

eghost cli
> show policy match https://example.com/foo

Three outputs to expect:

Matched + admin viewer (sees the regex pattern):

URI:        https://example.com/foo
Matched:    yes
Code:       300 (deny — redirect to portal)
Rule ID:    14
Rule name:  deny-some-rule_e14
Action:     deny
Reason:     <the rule's description field>
Pattern:    ^(https?:\/\/)?([^/]+\.)?example\.com(/|$)

Matched + monitoring/standard viewer (no pattern — admin-only field):

URI:        https://example.com/foo
Matched:    yes
Code:       300 (deny — redirect to portal)
Rule ID:    14
Rule name:  deny-some-rule_e14
Action:     deny
Reason:     <the rule's description field>

Not matched — the engine's synthesised [policy].default_action verdict takes effect:

URI:        https://example.com/foo
Matched:    no (default-permit)
Code:       200 (permit)

The default-permit / default-deny / default-warn / default-aup synthetic rule name reports which default_action knob value is configured. Flip the no-match posture in engine.conf — see [policy].default_action.

Step 2 — Inspect the matched rule

Once you have the rule_name, grep the rule's source:

docker exec enforcegate grep -r "deny-some-rule" /etc/enforcegate/rules.d/

Or look up the compiled regex directly:

docker exec enforcegate duckdb /var/lib/enforcegate/engine.db \
    "SELECT id, name, action, pattern FROM http_policy WHERE id = 14;"

Common false-positive causes

  • Unescaped . in match-uri-regext.co matches t/co, t?co, etc. Hand-written regexes must escape: t\.co. The match-domain-list helper auto-escapes (post-2026-05-27 fix), but match-uri-regex is the operator's responsibility.
  • Missing ^ anchor — substring matches everywhere. foo.com matches barfoo.com. Anchor with ^https?:// or ^(https?://)?.
  • Missing trailing (/|$)foo.com matches foo.complaint. End-anchor with (/|$) to require a path separator or string end.

The learning-mode synthesiser's regex shape is a good template:

^(https?:\/\/)?([^/]+\.)?<host>(/|$)

(Optional scheme, optional subdomain, required path delimiter at the end.)

Toolbox / shared-path deny isn't taking effect

Symptom — the toolbox sidecar (or any other writer to [policy].shared_path) has written a deny rule, the file is loaded (visible in show policy files), but the matching traffic still gets through. show policy match <url> reports the verdict as permit and the matched rule as some default-permit-shaped or catch-all rule in [policy].path.

Cause — [policy].path (operator-authored) loads before [policy].shared_path (toolbox / generated) and gets the lower rule ids. A catch-all permit rule in [policy].path therefore wins under lowest-id-wins and shadows every shared-path rule.

Fix — remove the catch-all from [policy].path and rely on [policy].default_action instead:

host> show policy files
 File                                  Size      Modified              Rules
 ------------------------------------  --------  --------------------  -----
 99-default-permit.policy                 96 B  2026-04-12 08:00:00       1   ← shadows shared rules
 ...
host> enable
host# configure
host(config)# delete policy 99-default-permit
host(config)# commit

Then set default_action = "permit" (or "deny" for default-deny) in engine.conf and reload. The synthesised no-match verdict occupies no rule id, so it can no longer shadow shared-path rules. Pre-2026.32.0 deployments shipped this catch-all; from 2026.32.0 it is retired.

Time-scheduled rule fires at the wrong time

A rule with a time-window: attribute is firing earlier, later, or not at all relative to what its window says. Two common causes:

  1. Engine clock is wrong. Time-windows evaluate against the engine's wall clock — a drifted system clock makes them fire at the drifted time. Confirm in eghost cli:

    host> show clock
    Engine clock
      UTC:                     2026-06-06T08:42:15Z
      Local:                   2026-06-06 10:42:15 CEST
      Epoch:                   1781340135 (unix seconds)
    

    If the time is off, fix NTP on the engine host (or the appliance's time-sync service). The engine picks up the new clock immediately — no reload needed.

  2. Timezone mismatch. [policy].time_window_tz controls whether HH:MM is interpreted as the engine's local time or UTC. The default is local. Deployments with operators in multiple timezones should consider utc so windows mean the same thing everywhere; operators authoring rules then write the window in UTC.

If show policy match reports the rule did not match a request you expected to be inside the window, double-check both the day specifier (weekdays excludes Saturday and Sunday; weekend excludes Monday through Friday) and the wrap-past-midnight form (daily 22:00-06:00 is active overnight, not during the day).

Engine refuses to start

Symptom — exits with -1 immediately, no logs

Look at /var/lib/enforcegate/apm-last-error.json. The reason field is one of a stable set:

Reason Meaning Operator fix
apm.config.serial_missing [license].serial empty. Set ENGINE_LICENSE_SERIAL in .env (or write it into engine.conf directly).
apm.config.credentials_missing [license].username or password empty. Set ENGINE_LICENSE_USERNAME + ENGINE_LICENSE_PASSWORD in .env.
apm.cs.unreachable Can't reach the licensing endpoint (DNS, firewall, proxy). Verify outbound HTTPS connectivity from the host. For firewall coordination on the specific endpoint, contact support.
apm.cs.auth_rejected Control server refused the credentials. Verify the username + password with the customer onboarding team.
apm.license.expired License expired. Renew via the customer onboarding team.
apm.fingerprint.mismatch Host fingerprint doesn't match the license. Likely a hardware change (different DMI UUID, new machine-id). Contact support to re-activate.
apm.permissions.failed License file modes too permissive (when enforce_permissions = true). chmod 0400 on the .key and .lic files in /etc/enforcegate/license/.
apm.ratchet.tampered Engine internal state check failed. Don't roll back system time. If the host's clock genuinely needed re-syncing, contact support.

The engine's exit code is intentionally generic (-1) for anti-piracy opacity. Operators dispatch on the JSON file, not the exit code.

Symptom — engine starts, then [critical] log line about config

The config parser couldn't read engine.conf. Most common causes: wrong -c path, missing file, malformed TOML. The log line names the file path and line number.

Symptom — engine starts but show status reports DB unavailable

Either [database].database_file doesn't exist (the engine creates it on first boot, so this is usually a permission issue) or another process (egpolicy running concurrently, a stuck previous engine instance) holds the DuckDB file lock. Stop the offender and restart the engine.

Common boot-card failures

Boot-card line Cause Resolution
[system] Integrity check ... [ FAIL ] Bundled binary doesn't match its build-time SHA-256. Image is corrupt or tampered. Re-pull and verify the cosign signature.
[system] APM activation failed ... [ FAIL ] Engine cannot reach the Control Server, or license credentials are wrong. See APM reason table above.
[system] SSL inspection: bump requested but acknowledgment not set ... [ FAIL ] ENFORCEGATE_SSL_INSPECT=bump was set without ENFORCEGATE_SSL_INSPECT_ACK=1. See SSL inspection for the binding acknowledgement gate.
[system] Generate SSL certs ... [ FAIL ] Insufficient entropy on the host, or /etc/enforcegate/ssl/ bind-mounted read-only by mistake. Wait for entropy or remove the read-only bind-mount.
[system] Compile default policy ... [ FAIL ] Syntax error in a .policy file under /etc/enforcegate/rules.d/. Run eghost shell -- egpolicy compile --dry-run to see the diagnostic.
[critical] policy git mode: force-on but no valid .git/ at /etc/enforcegate/rules.d/ [policy].git_mode = "force-on" set in engine.conf, but [policy].path does not contain a .git/ directory (never git init-ed, or .git/ was removed). The engine logs Critical but continues in snapshot-only mode. Either run git init inside [policy].path and restart (see policy audit), or set [policy].git_mode = "auto" to silence the message.
show policy log is empty / reports snapshot-only mode after git init [policy].git_mode = "auto", .git/ is present, but the git binary is not installed in the engine container. Auto-detect silently falls back to snapshot-only. Install git in the engine container (the standalone bundle's image ships it; only hand-rolled deployments need this). Restart, then re-check with show policy fingerprint.
[error] Defendr protocol version mismatch: remote:0.X, local:0.Y The engine and a connector are on incompatible Defendr-protocol versions. The wire-format authentication changed between engine releases (most recently in 2026.28.0, when the neighbour protocol bumped from v1 to v2). Upgrade engine and connectors together. The standalone bundle does this automatically — one artifact carries both. Multi-image or independently-upgraded deployments need to roll engine + connector together. See upgrading your deployment for the rolling-update shape.
Container exits with code 137 / OOM mem_limit too low for the active bump cert DB on busy bump-mode deployments. Bump ENFORCEGATE_MEMORY in .env (default 1g; 2g is safe for bump mode). Policy size is not a typical OOM driver — even multi-million-rule domain lists fit comfortably under 1 GiB.

Squid keeps spawning connectors that immediately exit

If eghost logs enforcegate shows repeated url_rewrite_program respawns, the most common cause is the connector failing to reach the engine.

Confirm by entering the REPL and looking at the neighbour table:

eghost cli
> show neighbor summary

An empty table means no connector is connected. From inside the container:

eghost shell
# Check the squid-connector config
cat /etc/enforcegate/squid-connector.conf
# Check the engine is listening
netstat -lntp | grep 11224
# Check the pre-shared key matches
grep '^key' /etc/enforcegate/{squid-connector,engine}.conf

The pre-shared key in [engine.<name>].key (connector) must match [connectors.<name>].key (engine) exactly. A typo is a common silent-failure cause.

Policy reload reports "permission denied" on a .policy file

Common after a host-side docker cp of a new .policy file — docker cp preserves the source's owner (root) and permissions (0600) inside the container, but the engine runs as a non-root user:

eghost shell
chown enforcegate:enforcegate /etc/enforcegate/rules.d/50-new.policy
chmod 0644 /etc/enforcegate/rules.d/50-new.policy
exit

eghost cli
> request policy reload

The post-2026-05-26 fix in the engine's snapshot path surfaces the actual permission-denied error in the reload response — earlier builds only reported the snapshot path, which made the diagnosis harder.

Engine [critical] log — hyperscan compile failed

The regex compile failed on the merged policy set. The engine falls back to permit-by-default but no rules fire.

Most common cause: a .policy file's match-uri-regex is malformed (unbalanced parens, invalid regex syntax). Validate with a dry-run:

eghost cli
> request policy reload dry-run true

If dry-run reports Parsed rules: N cleanly but the engine still logs the compile failure, the parser accepted the rule shape but the regex compiler rejected it. Inspect each rule's match-uri-regex manually and try the patterns one at a time.

Captive portal — page doesn't render

Symptom Likely cause
Visitor hits a denied URL and sees a connection error, not the portal. [captive_portal].base_url points at a host the visitor's browser cannot reach (e.g. 127.0.0.1). Or the captive-portal container is down. Or a firewall blocks the path.
Portal renders but shows "invalid token" / decrypt fails. [captive_portal].secret (engine) doesn't match the portal's configured secret. Generate fresh with openssl rand -base64 32 and set the same base64 value on both sides. The engine needs a restart to pick up the new secret.
warn/aup page reloads when the visitor clicks "Proceed". The POST to /api/captive/ack isn't reaching the engine. Check [debug] logs for captive_ack lines. Common: the load balancer in front of the portal doesn't route POSTs the same as GETs, or ack_token_max_age_s expired between the redirect and the click.

Learning mode — capture returned zero URIs

Possible causes:

  • Session filter doesn't match. Test with request learning create ua ".*" 1000 (a UA regex that matches everything) to confirm capture is wired up at all.
  • Engine's authorisation path isn't being exercised — no traffic is flowing through Squid (operators sometimes test from the same host the engine runs on, missing the proxy). Confirm with eghost logs squid.
  • [learning] section absent or database_file unset. request learning create returns "learning subsystem not initialised". Add the section and restart the engine.

egctl — "API authentication failed"

The username + password don't match an entry in /etc/enforcegate/passwd. Common causes:

  • Wrong password. The shipped default admin / enforcegate-changeme may have been changed during bootstrap.
  • Looking at the wrong engine host. Confirm [global].host / port in egctl.conf.
  • The engine's passwd file was wiped or rebuilt. Re-add the user with request user add (if you have a working admin credential elsewhere), or use the first-boot bootstrap path with [aaa].auto_bootstrap = true.

Debugging outputs

For diagnostic purposes, any EnforceGate vX component executable can be started in the foreground with configurable logging verbosity. This mode displays real-time output directly on the terminal.

Performance hit

When debug logging is used, performance is significantly impacted due to the volume of output. Use only when necessary.

Example: launching a foreground engine instance from inside the running container, with debug logging directed to the console:

Engine with console output and debug-level logging
$ eghost shell
[root@enforcegate /]$ /usr/local/bin/enforcegate-engine --logtype console --loglevel debug
[14-05-2026 16:04:54] [pid:127] [debug] [apm] Host fingerprint: adad4c755a26ae7a
[14-05-2026 16:04:54] [pid:127] [debug] [apm] Control server public key successfully loaded
[14-05-2026 16:04:54] [pid:127] [debug] [apm] Key pair successfully loaded
[14-05-2026 16:04:54] [pid:127] [debug] [apm] Digital signature verification success
[14-05-2026 16:04:54] [pid:127] [debug] [apm] This license expires on 2027-01-01 00:00:00
[14-05-2026 16:04:54] [pid:127] [warning] Unix-domain local socket file already exists
[14-05-2026 16:04:54] [pid:127] [info] File /run/enforcegate/aggregator.sock successfully deleted
[14-05-2026 16:04:54] [pid:127] [debug] Starting monitoring thread...
[14-05-2026 16:04:54] [pid:127] [debug] Starting control interface thread...
[14-05-2026 16:04:54] [pid:127] [debug] Starting database interface thread...
[14-05-2026 16:04:54] [pid:127] [debug] Creating database instance...
[14-05-2026 16:04:54] [pid:127] [error] Error while creating database instance: Cannot open database file: /var/lib/enforcegate/engine.db
[...]

The error above is typically caused by the database file being deleted out from under the engine or by filesystem-permission damage to /var/lib/enforcegate/. The supervised engine won't surface this output directly on docker logs; the failure shows through the closing one-shot or the long-running supervisor instead.

A connector instance launched with debug logging:

Squid connector with console output and debug-level logging
[root@enforcegate /]$ /usr/local/bin/enforcegate-squid-connector --logtype console --loglevel debug
[14-05-2026 16:05:12] [pid:293] [debug] Establishing connection to engine host 127.0.0.1:11224...
[14-05-2026 16:05:12] [pid:293] [debug] Building authentication (init) message
[14-05-2026 16:05:12] [pid:293] [info] Successfully connected to engine host 127.0.0.1:11224
[14-05-2026 16:05:12] [pid:293] [debug] Received authentication (challenge) message from engine
[14-05-2026 16:05:12] [pid:293] [debug] Authentication response (peer):  71f610bab2dc4bf0b7218318f32c0e2e5aaea748ef76d94f292114270d141d59
[14-05-2026 16:05:12] [pid:293] [debug] Authentication response (local): 71f610bab2dc4bf0b7218318f32c0e2e5aaea748ef76d94f292114270d141d59
[14-05-2026 16:05:12] [pid:293] [debug] Engine authentication successful
[14-05-2026 16:05:12] [pid:293] [debug] Starting SSL/TLS session handshake...
[14-05-2026 16:05:12] [pid:293] [debug] Using SSL/TLS protocol version: TLSv1.3, negotiated cipher: TLS_AES_256_GCM_SHA384
[14-05-2026 16:05:12] [pid:293] [info] Session with engine (id:1) established

The output above shows the connector authenticating to the engine and upgrading the session to TLS.

Connector writes to stderr

The squid-connector's LoggerConsole writes to stderr specifically so that stdout stays exclusively for Squid's url_rewrite_program protocol responses. When tailing docker logs, you see both streams interleaved. When deploying outside a container (bare-metal Squid forwarding to a remote engine), redirect them separately: 2>connector.log.

State that can be safely deleted to recover

Read order: least disruptive → most disruptive.

File / dir What it is Effect of deleting
/var/lib/enforcegate/apm-last-error.json APM diagnostic Cleared on the next successful boot anyway. Harmless.
/var/lib/enforcegate/learning.db (+ .wal) Learning sessions and captures All learning history lost. Sessions are re-created from scratch on next request learning create.
/var/lib/enforcegate/policy-backups/<timestamp>/ A specific policy snapshot Loses ability to roll back to that generation. Other snapshots unaffected.
/var/lib/enforcegate/engine.db (+ .wal) Live policy DB Engine refuses to start until egpolicy load rebuilds it. Use only when the DB is corrupted.

Never delete:

  • /var/lib/enforcegate/.time-ratchet — the engine will refuse to start (ratchet rebuilt from license; out of scope for operators).
  • /etc/enforcegate/license/* — license material; replacement requires customer onboarding coordination.

Reset to clean state

For a complete factory reset that destroys all operator state (configs, license activation, SSL-bump CA, policy DB, audit log):

eghost down
docker compose -f /opt/enforcegate/bundles/standalone/docker-compose.yml down -v   # `-v` wipes the four named volumes
eghost up                                                                          # next boot re-seeds defaults from skel

For a softer reset that preserves the policy DB and operator config but resets transient state:

eghost restart enforcegate

If you only need to re-compile policies (after editing .policy files outside the eghost policy workflow):

docker exec enforcegate egpolicy load

This recompiles /etc/enforcegate/rules.d/*.policy into engine.db and tells the engine to reload — no container restart required. The eghost policy new / edit / remove verbs do this automatically on save; the manual egpolicy load is only needed when policy files were modified by some other path (bind-mount, docker cp, external orchestration).

When to escalate to engineering

  • APM diagnostic JSON reason isn't in the known set above — likely a build mismatch between engine and installer. File a ticket with the JSON content.
  • Hyperscan compile failure persists after request policy reload dry-run true runs cleanly — possible engine bug. File with the offending rule and engine version.
  • Engine segfaults — file with eghost logs output, the engine version, and recent config changes.
  • Captive portal flow works in the bundled tests but fails in production — likely a deployment / TLS / routing issue. Try SSL inspection and the captive portal reference first; ticket if stuck.

Collecting a support bundle

For most tickets the in-engine show tech-support verb is the fastest first step — one command aggregates ten of the most-asked-for show * outputs (version, uptime, memory, threads, listeners, status, license, neighbors, and the last 50 log lines) into a single paste-into-ticket block:

eghost cli
host> show tech-support

Paste the output directly into your support ticket. See show tech-support for the full sample output.

For deeper investigations where Exosys engineering needs the boot card, recent logs, version manifest, and the configuration files alongside the engine outputs, generate the host-side diagnostic bundle:

eghost support bundle

This writes a redacted tarball under /tmp/enforcegate-diag-<timestamp>.tar.gz (sensitive material — API keys, license credentials, audit log entries — redacted). Attach the tarball to the ticket.

For an inline diagnostic block (printable, no tarball):

eghost support