Audit & observability¶
Three signal sources¶
- Regular logs —
tracingoutput to stderr / journald, levels DEBUG/INFO/WARN/ERROR, for development and debugging. - Audit JSONL — events tagged
target=futu_auditonly, for compliance, post-hoc forensics, attack investigation. - Prometheus metrics — counter aggregates for alerting + dashboards.
Audit JSONL¶
Enable:
futu-opend --audit-log /var/log/futu-audit.jsonl
# or
futu-opend --audit-log /var/log/futu/ # directory → daily rotating futu-audit.log.YYYY-MM-DD
futu-mcp / futucli take the same flag.
Event schema¶
{
"timestamp": "2026-04-15T10:23:45.123Z",
"level": "WARN", // reject=WARN, allow=INFO, trade=WARN
"target": "futu_audit",
"iface": "rest" | "grpc" | "ws" | "mcp" | "cli",
"endpoint": "/api/order" | "proto_id=2202" | "futu_place_order",
"key_id": "bot_a" | "<missing>" | "<invalid>" | "<none>",
"outcome": "allow" | "reject" | "success" | "failure",
"reason": "limit: rate limit exceeded: 5 in 60s (cap 3)",
"scope": "trade:real", // on allow
"args_hash": "8a3f2b9c" // on trade events: first 8 hex of SHA-256
}
Common jq queries¶
# recent rejects
jq 'select(.outcome=="reject")' /var/log/futu-audit.jsonl | tail -20
# orders from a specific key
jq 'select(.key_id=="bot_a" and .endpoint|test("order|place|modify"))' \
/var/log/futu-audit.jsonl
# reject reason histogram
jq -r 'select(.outcome=="reject") | .reason' /var/log/futu-audit.jsonl \
| awk -F': ' '{print $1}' \
| sort | uniq -c | sort -rn
# request distribution per iface
jq -r '.iface' /var/log/futu-audit.jsonl | sort | uniq -c
Batch analysis with DuckDB¶
-- load JSONL
CREATE TABLE audit AS SELECT * FROM read_json_auto('/var/log/futu-audit.jsonl');
-- daily order count per key
SELECT DATE(timestamp) AS day, COUNT(*) AS orders
FROM audit
WHERE key_id = 'bot_a' AND endpoint LIKE '%order%'
GROUP BY day;
Prometheus metrics¶
Scrape config:
scrape_configs:
- job_name: futu-opend
static_configs: [{ targets: ['opend:22222'] }]
- job_name: futu-mcp
static_configs: [{ targets: ['mcp:38765'] }]
Three counters¶
# HELP futu_auth_events_total Auth / trade events by iface, outcome, key_id
# TYPE futu_auth_events_total counter
futu_auth_events_total{iface="rest",outcome="allow",key_id="bot_a"} 1234
# HELP futu_auth_limit_rejects_total Limit-check rejects by iface, key_id, reason
# TYPE futu_auth_limit_rejects_total counter
futu_auth_limit_rejects_total{iface="grpc",key_id="bot_b",reason="rate"} 7
# HELP futu_ws_filtered_pushes_total Pushes filtered out for client lacking scope
# TYPE futu_ws_filtered_pushes_total counter
futu_ws_filtered_pushes_total{required_scope="trade",key_id="bot_c"} 42
Reason buckets¶
The reason label on futu_auth_limit_rejects_total is a finite set:
| reason | Meaning |
|---|---|
rate |
Rate-limit exceeded |
daily |
Daily cap exceeded |
per_order |
Per-order cap exceeded |
market |
Market whitelist |
symbol |
Symbol whitelist |
side |
Direction whitelist |
hours |
Time window |
other |
Other (not covered by classify_limit_reason) |
Alert rule examples¶
groups:
- name: futu-opend
rules:
- alert: FutuAuthRejectSpike
expr: rate(futu_auth_events_total{outcome="reject"}[5m]) > 10
for: 5m
annotations:
summary: "Auth reject rate high ({{ $value }}/s)"
description: "possibly an attack or misconfigured key"
- alert: FutuRateLimitFrequent
expr: rate(futu_auth_limit_rejects_total{reason="rate"}[15m]) > 1
for: 15m
annotations:
summary: "Key {{ $labels.key_id }} repeatedly hitting rate limit"
- alert: FutuDailyCapNearLimit
expr: futu_auth_limit_rejects_total{reason="daily"} > 5
for: 5m
annotations:
summary: "Key {{ $labels.key_id }} hit the daily cap multiple times today"
Grafana dashboard¶
Starting with v1.4.103, release tarballs bundle a pre-built Grafana
dashboard JSON: examples/grafana/futu-opend-dashboard.json.
Pre-built dashboard — import steps¶
- Extract the release tarball and locate
examples/grafana/futu-opend-dashboard.json - In Grafana, go to Dashboards → New → Import
- Click Upload JSON file and upload the file from step 1
- On the import page, pick your Prometheus datasource (the dashboard uses a
${DS_PROMETHEUS}variable so no JSON edit is needed) - Click Import — the dashboard loads, and
$interface/$key_idvariables auto-populate from your Prometheus labels
Dashboard contents¶
Two multi-select variables at the top ($interface / $key_id, both support All):
| Row | Panel | Type | Core PromQL |
|---|---|---|---|
| Auth | Auth events — allow vs reject | timeseries | sum by (outcome) (rate(futu_auth_events_total[5m])) |
| Auth | Request rate by interface (stacked) | timeseries | sum by (iface) (rate(futu_auth_events_total[5m])) |
| Auth | Top rejected keys | table | topk(10, sum by (key_id) (increase(futu_auth_events_total{outcome="reject"}[$__range]))) |
| Auth | Limit-reject reasons (pie) | piechart | sum by (reason) (increase(futu_auth_limit_rejects_total[$__range])) |
| Limits | Limit rejects by reason (per second) | timeseries | sum by (reason) (rate(futu_auth_limit_rejects_total[5m])) |
| Limits | Top keys by limit-rejects | table | topk(10, sum by (key_id, reason) (increase(futu_auth_limit_rejects_total[$__range]))) |
| WS Push | WS push filtered (scope mismatch) | timeseries | sum by (required_scope) (rate(futu_ws_filtered_pushes_total[5m])) |
| WS Push | Top keys by WS push filtered | table | topk(10, sum by (key_id, required_scope) (increase(futu_ws_filtered_pushes_total[$__range]))) |
| Summary | Allow rate / Total events / Limit rejects / WS filtered | stat × 4 | range-aggregate |
Total: 12 data panels + 4 row headers = 16 items, schemaVersion 38 (compatible with Grafana v10.x).
Manual panels (fallback — you already have your own dashboard)¶
If you don't want to import the whole dashboard and just want to bolt a few panels onto an existing board, copy these PromQL queries directly:
- Request rate by iface —
sum by (iface) (rate(futu_auth_events_total[5m])) - Allow vs Reject —
sum by (outcome) (rate(futu_auth_events_total[5m])) - Top rejected keys —
topk(10, sum by (key_id) (rate(futu_auth_events_total{outcome="reject"}[1h]))) - Limit reject breakdown —
sum by (reason) (increase(futu_auth_limit_rejects_total[$__range])) - WS filter dropped —
sum by (required_scope) (rate(futu_ws_filtered_pushes_total[5m]))
Push / trade health (non-Prometheus)¶
In the current release, Prometheus exposes only the three counters above. Push stream health (F3 staleness / F4 circuit breaker / F5 subscriber info) lives on the /api/push-subscriber-info sync endpoint, not in Prometheus — see the "Push chain self-healing (v1.4.84)" section below. If you want these fields inside Grafana, you can add them via a JSON API datasource pointed at /api/push-subscriber-info (not included in the default dashboard).
How they fit together¶
- Day-to-day monitoring → Grafana dashboard
- Alert fires → look up the specific event in audit JSONL
- Deep investigation → audit JSONL + DuckDB / jq
Metrics cover trends (numeric aggregates), audit covers forensics (specific events), and logs cover debugging (why did it fail).
Push-chain self-healing (v1.4.84)¶
v1.4.84 §9 introduced a 6-layer CMD3020 chain recovery to keep the push channel in steady state long-term. v1.4.84 A3 canary gates are the real-machine verify exit. Ops side needs to know how these components work and how to observe their state.
F3 staleness detector (30s interval / 60s threshold)¶
A background task scans the push channel every 30 seconds:
- If stale for >60s and there are active subscriptions → auto trigger re-subscribe
- daemon log:
tracing::warn!"v1.4.84 §9 F3: push stream stale >60s, auto re-subscribe" /api/push-subscriber-infocounterresubscribe_triggersbumps
F4 circuit breaker (30s cooldown)¶
If F3's auto re-subscribe fails to restore the stream within 60s → circuit trips:
- Dispatcher skips further push events for 30s to avoid spinning on bad state
- 30s elapses or any successful push arrives → auto-reset
- Observation:
/api/push-subscriber-info.is_circuit_tripped_now+ countercircuit_breaker_trips
F2 retry (TradeReQuery 0ms / 1s / 3s / 9s)¶
Backend queries triggered by order/fill push notify (query_orders / query_account_info / query_order_fills) use 4-attempt exponential backoff:
- Attempt 1: 0ms (immediate)
- Attempt 2: 1s
- Attempt 3: 3s
- Attempt 4: 9s
- daemon log tag:
"v1.4.84 §9 F2: retry"with attempt number
F5 /api/push-subscriber-info field reference¶
| Field | Type | Meaning |
|---|---|---|
push_stream_healthy |
bool | Composite (circuit not tripped + consecutive_errors <5 + last push <60s) |
last_push_received_at_ms |
int | Unix ms timestamp of the most recent push |
consecutive_parse_errors |
int | Consecutive parse failures (F3 fires when >=5) |
total_parse_errors |
int | Cumulative parse errors (monotonic counter) |
resubscribe_triggers |
int | How many times F3 fired auto re-sub |
circuit_breaker_trips |
int | How many times F4 has tripped |
is_circuit_tripped_now |
bool | Whether the circuit is currently tripped |
F6 orphan order scan (30s interval / 5min threshold)¶
A background task periodically scans for status=1 Unsubmitted orders:
- If an order is stuck in Unsubmitted for >5 minutes → daemon warn log "v1.4.84 §9 F6: orphan order detected acc_id=X order_id=Y age=Zs"
- Diagnoses cases like broker losing the fill notify, daemon failing to rebuild its subscription, etc.
Canary real-machine verify (v1.4.84)¶
Location: scripts/canary.sh. 6 gates from v1.4.82 + 4 new gates from
v1.4.84 A3 = 10 gates total.
Prerequisites¶
- daemon already running (
futu-opendup) - env vars
$ACCOUNT/$PWDset (non-interactive login credentials)
Usage¶
# Run all gates
./scripts/canary.sh
# Run a single gate
./scripts/canary.sh canary_7_push_health_f5_live
./scripts/canary.sh canary_10_f3_staleness_auto_resub
Gate reference¶
| Gate | Introduced | What it verifies |
|---|---|---|
canary_1_subscribe_push |
v1.4.82 | Subscribe → WS push event received |
canary_2_place_order_cache |
v1.4.82 | PlaceOrder → visible in /api/orders within 0ms |
canary_3_subscribe_wrong_fields |
v1.4.82 | Subscribe with bad fields → loud error (deny_unknown_fields guard) |
canary_4_sim_place_order_hint |
v1.4.82 | sim PlaceOrder with bad params → returns sim hint |
canary_5_history_kline_validation |
v1.4.82 | history-kline validation → loud error |
canary_6_cmd3020_recovery |
v1.4.84 | CMD3020 chain recovery real-machine (placeholder, depends on backend fault injection) |
canary_7_push_health_f5_live |
v1.4.84 | /api/push-subscriber-info returns all 5 live fields |
canary_8_orphan_scan_f6 |
v1.4.84 | 5.5min wait + daemon log shows orphan warn |
canary_9_f2_retry_exp_backoff |
v1.4.84 | daemon log contains F2 retry smoke test |
canary_10_f3_staleness_auto_resub |
v1.4.84 | resubscribe_triggers counter bumps |
SKIP semantics (NOT FAIL)¶
Canary gates SKIP (not FAIL) in these cases:
- daemon not running (no response on
:22222) - daemon log file missing (
$LOG_FILEunset or empty) - No retry / re-sub events in the observation window (the system is healthy — not a bug)
SKIP doesn't block release; FAIL blocks. On a healthy daemon, gates 7-10 often SKIP some; full PASS typically requires real-machine fault injection or waiting long enough for the background scan to fire.
Push stream troubleshooting (v1.4.84)¶
When ops sees push_stream_healthy=false or the FutuAuthRejectSpike
alert fires, the triage flow:
-
Check
Inspect/api/push-subscriber-infofirstpush_stream_healthy/last_push_received_at_ms/consecutive_parse_errors/is_circuit_tripped_now. -
If
push_stream_healthy=false, dig deeper: consecutive_parse_errors >= 5→ F3 is firing, auto re-sub within 30sis_circuit_tripped_now=true→ F4 in cooldown, auto-reset in 30s-
last_push_received_at_ms>60s ago → channel stale, F3 is firing -
Check daemon log for trigger history
-
Escalation scenarios
resubscribe_triggerskeeps growing butpush_stream_healthystays false → long-term backend failure or bad subscription parameters; manual restart + backend-status check requiredcircuit_breaker_trips>3 in a short window → F3 re-sub isn't working, backend side may need admin intervention