审计与可观察性¶
三个信号源¶
- 常规日志 ——
tracing输出到 stderr / journald,DEBUG/INFO/WARN/ERROR 级别,用于开发和故障诊断 - 审计 JSONL —— 只含
target=futu_audit的事件,用于合规、事后追溯、攻击调查 - Prometheus metrics —— counter 形式的聚合数据,用于告警 + dashboard
审计 JSONL¶
开启:
futu-opend --audit-log /var/log/futu-audit.jsonl
# 或
futu-opend --audit-log /var/log/futu/ # 目录 → 每日滚动 futu-audit.log.YYYY-MM-DD
futu-mcp / futucli 同名 flag。
事件 schema¶
{
"timestamp": "2026-04-15T10:23:45.123Z",
"level": "WARN", // reject=WARN, allow=INFO, trade=WARN
"target": "futu_audit",
"iface": "rest" | "grpc" | "ws" | "mcp" | "cli",
"endpoint": "/api/order" | "proto_id=2202" | "futu_place_order",
"key_id": "bot_a" | "<missing>" | "<invalid>" | "<none>",
"outcome": "allow" | "reject" | "success" | "failure",
"reason": "limit: rate limit exceeded: 5 in 60s (cap 3)",
"scope": "trade:real", // allow 时
"args_hash": "8a3f2b9c" // 交易事件:SHA-256 前 8 hex
}
常见 jq 查询¶
# 最近的 reject
jq 'select(.outcome=="reject")' /var/log/futu-audit.jsonl | tail -20
# 某把 key 的下单记录
jq 'select(.key_id=="bot_a" and .endpoint|test("order|place|modify"))' \
/var/log/futu-audit.jsonl
# 按拒绝原因统计
jq -r 'select(.outcome=="reject") | .reason' /var/log/futu-audit.jsonl \
| awk -F': ' '{print $1}' \
| sort | uniq -c | sort -rn
# per-iface 的请求分布
jq -r '.iface' /var/log/futu-audit.jsonl | sort | uniq -c
DuckDB 批量分析¶
-- 加载 JSONL
CREATE TABLE audit AS SELECT * FROM read_json_auto('/var/log/futu-audit.jsonl');
-- 某 key 的日成交次数
SELECT DATE(timestamp) AS day, COUNT(*) AS orders
FROM audit
WHERE key_id = 'bot_a' AND endpoint LIKE '%order%'
GROUP BY day;
Prometheus metrics¶
抓取:
scrape_configs:
- job_name: futu-opend
static_configs: [{ targets: ['opend:22222'] }]
- job_name: futu-mcp
static_configs: [{ targets: ['mcp:38765'] }]
三个 counter¶
# HELP futu_auth_events_total Auth / trade events by iface, outcome, key_id
# TYPE futu_auth_events_total counter
futu_auth_events_total{iface="rest",outcome="allow",key_id="bot_a"} 1234
# HELP futu_auth_limit_rejects_total Limit-check rejects by iface, key_id, reason
# TYPE futu_auth_limit_rejects_total counter
futu_auth_limit_rejects_total{iface="grpc",key_id="bot_b",reason="rate"} 7
# HELP futu_ws_filtered_pushes_total Pushes filtered out for client lacking scope
# TYPE futu_ws_filtered_pushes_total counter
futu_ws_filtered_pushes_total{required_scope="trade",key_id="bot_c"} 42
reason 分桶¶
futu_auth_limit_rejects_total 的 reason 标签是有限集合:
| reason | 含义 |
|---|---|
rate |
速率超限 |
daily |
日累计超限 |
per_order |
单笔超限 |
market |
市场白名单 |
symbol |
品种白名单 |
side |
方向白名单 |
hours |
时段窗口 |
other |
其他(classify_limit_reason 没覆盖的) |
告警规则示例¶
groups:
- name: futu-opend
rules:
- alert: FutuAuthRejectSpike
expr: rate(futu_auth_events_total{outcome="reject"}[5m]) > 10
for: 5m
annotations:
summary: "Auth reject rate high ({{ $value }}/s)"
description: "可能是攻击或 key 配置错"
- alert: FutuRateLimitFrequent
expr: rate(futu_auth_limit_rejects_total{reason="rate"}[15m]) > 1
for: 15m
annotations:
summary: "Key {{ $labels.key_id }} 长期触发 rate limit"
- alert: FutuDailyCapNearLimit
expr: futu_auth_limit_rejects_total{reason="daily"} > 5
for: 5m
annotations:
summary: "Key {{ $labels.key_id }} 今天多次触发日累计上限"
Grafana dashboard¶
v1.4.103+ 起,发布 tarball 附带预置 Grafana Dashboard JSON:examples/grafana/futu-opend-dashboard.json.
预置 Dashboard —— 导入步骤¶
- 解压 release tarball, 取
examples/grafana/futu-opend-dashboard.json - 在 Grafana 左栏 → Dashboards → New → Import
- 选 Upload JSON file,上传上一步的文件
- 在导入页选 Prometheus datasource(dashboard 用
${DS_PROMETHEUS}变量绑定,无需改 JSON) - 点 Import → dashboard 自动加载,
$interface/$key_id两个 variable 会从 Prometheus label 自动 populate
Dashboard 内容¶
顶部两个 variable($interface / $key_id)支持 multi-select + All:
| 行 | Panel | 类型 | 核心 PromQL |
|---|---|---|---|
| Auth | Auth events — allow vs reject | timeseries | sum by (outcome) (rate(futu_auth_events_total[5m])) |
| Auth | Request rate by interface (stacked) | timeseries | sum by (iface) (rate(futu_auth_events_total[5m])) |
| Auth | Top rejected keys | table | topk(10, sum by (key_id) (increase(futu_auth_events_total{outcome="reject"}[$__range]))) |
| Auth | Limit-reject reasons (pie) | piechart | sum by (reason) (increase(futu_auth_limit_rejects_total[$__range])) |
| Limits | Limit rejects by reason (per second) | timeseries | sum by (reason) (rate(futu_auth_limit_rejects_total[5m])) |
| Limits | Top keys by limit-rejects | table | topk(10, sum by (key_id, reason) (increase(futu_auth_limit_rejects_total[$__range]))) |
| WS Push | WS push filtered (scope mismatch) | timeseries | sum by (required_scope) (rate(futu_ws_filtered_pushes_total[5m])) |
| WS Push | Top keys by WS push filtered | table | topk(10, sum by (key_id, required_scope) (increase(futu_ws_filtered_pushes_total[$__range]))) |
| Summary | Allow rate / Total events / Limit rejects / WS filtered | stat × 4 | range-aggregate |
合计 12 data panel + 4 row header = 16 item,schemaVersion 38(Grafana v10.x 兼容)。
手工建面板(备用 —— 已经有自己风格的 dashboard)¶
如果你不想 import 整张 dashboard,只想把某几个 panel 拼进已有 dashboard,直接抄下面 PromQL:
- Request rate by iface —
sum by (iface) (rate(futu_auth_events_total[5m])) - Allow vs Reject —
sum by (outcome) (rate(futu_auth_events_total[5m])) - Top rejected keys —
topk(10, sum by (key_id) (rate(futu_auth_events_total{outcome="reject"}[1h]))) - Limit reject breakdown —
sum by (reason) (increase(futu_auth_limit_rejects_total[$__range])) - WS filter dropped —
sum by (required_scope) (rate(futu_ws_filtered_pushes_total[5m]))
Push / trade 健康观察(非 Prometheus)¶
当前版本 Prometheus 只暴露上述 3 个 counter。push stream 健康状态(F3 staleness / F4 circuit breaker / F5 subscriber_info)走 /api/push-subscriber-info 同步接口,不走 Prometheus —— 见下节 "Push 链路自愈 (v1.4.84)"。如果需要在 Grafana 里看这些字段,可以用 JSON API datasource 把 /api/push-subscriber-info 接进来(不在默认 dashboard 里)。
协作模式¶
- 日常监控 → Grafana dashboard
- 告警触发 → 查 audit JSONL 具体事件
- 深挖调查 → 审计 JSONL + DuckDB / jq
三者互补:metrics 做趋势(数字聚合),audit 做溯源(具体事件),日志做调试(为什么错了)。
Push 链路自愈 (v1.4.84)¶
v1.4.84 §9 CMD3020 chain recovery 引入了 6 层防御,确保 push 通道长期稳态。v1.4.84 A3 canary 作为真机 verify 出口,ops 侧需要知道这些组件如何工作以及如何观察其状态。
F3 staleness detector (30 秒 interval / 60 秒阈值)¶
后台任务每 30 秒巡检 push 通道:
- 若 >60 秒 且 有活跃订阅 → 自动触发 re-subscribe
- daemon log 会出现 tracing::warn! "v1.4.84 §9 F3: push stream stale >60s, auto re-subscribe"
- /api/push-subscriber-info 的 resubscribe_triggers counter 递增
F4 circuit breaker (30 秒 cooldown)¶
F3 自动 re-subscribe 之后,如果 60 秒内仍然 stale → circuit trip:
- Dispatcher 跳过后续 push event 30 秒,避免空转刷错
- 30 秒到期 或 任意一个成功 push 抵达 → auto-reset
- 观察点:/api/push-subscriber-info.is_circuit_tripped_now + circuit_breaker_trips counter
F2 retry (TradeReQuery 0ms / 1s / 3s / 9s)¶
order / fill push notify 触发的 backend 查询(query_orders / query_account_info / query_order_fills)采用 4 次指数退避:
- 第 1 次 0ms(立即)
- 第 2 次 1s
- 第 3 次 3s
- 第 4 次 9s
- daemon log tag:"v1.4.84 §9 F2: retry" 带 attempt 编号
F5 /api/push-subscriber-info 字段说明¶
| 字段 | 类型 | 含义 |
|---|---|---|
push_stream_healthy |
bool | 综合判断(circuit 未 trip + consecutive_errors <5 + 最近 push <60s) |
last_push_received_at_ms |
int | 最近一次 push 到达的 Unix ms 时间戳 |
consecutive_parse_errors |
int | 连续 parse 失败次数(F3 阈值 >=5 触发 re-sub) |
total_parse_errors |
int | 累计 parse 错误(monotonic counter) |
resubscribe_triggers |
int | F3 触发 auto re-sub 的累计次数 |
circuit_breaker_trips |
int | F4 trip 累计次数 |
is_circuit_tripped_now |
bool | 当前是否处于 trip 状态 |
F6 orphan order scan (30 秒 interval / 5 分钟阈值)¶
后台任务定期扫 status=1 Unsubmitted 订单:
- 若某订单卡在 Unsubmitted 超过 5 分钟 → daemon warn log "v1.4.84 §9 F6: orphan order detected acc_id=X order_id=Y age=Zs"
- 用于定位 broker 返 fill notify 丢失 / daemon 订阅重建失败 等场景
Canary 真机 verify (v1.4.84)¶
位置:scripts/canary.sh,v1.4.82 首版 6 gate + v1.4.84 A3 新增 4 gate = 10 gate。
前置¶
- daemon 已在运行(
futu-opend起着) - 环境变量
$ACCOUNT/$PWD已设(非交互登录凭证)
用法¶
# 跑全部 gate
./scripts/canary.sh
# 只跑单个 gate
./scripts/canary.sh canary_7_push_health_f5_live
./scripts/canary.sh canary_10_f3_staleness_auto_resub
Gate 列表¶
| Gate | 引入版本 | 验证对象 |
|---|---|---|
canary_1_subscribe_push |
v1.4.82 | 订阅后能收到 WS push event |
canary_2_place_order_cache |
v1.4.82 | PlaceOrder 后 0ms 内 /api/orders 可见 |
canary_3_subscribe_wrong_fields |
v1.4.82 | 订阅参数错 → 返 loud error(deny_unknown_fields 兜底) |
canary_4_sim_place_order_hint |
v1.4.82 | sim PlaceOrder 错参数 → 返 sim hint |
canary_5_history_kline_validation |
v1.4.82 | history-kline validation → loud error |
canary_6_cmd3020_recovery |
v1.4.84 | CMD3020 chain recovery 真机(placeholder,依赖 backend 故障注入) |
canary_7_push_health_f5_live |
v1.4.84 | /api/push-subscriber-info 返真实 5 字段 |
canary_8_orphan_scan_f6 |
v1.4.84 | 5.5 分钟等待 + daemon log 出现 orphan warn |
canary_9_f2_retry_exp_backoff |
v1.4.84 | daemon log 含 F2 retry smoke test |
canary_10_f3_staleness_auto_resub |
v1.4.84 | resubscribe_triggers counter bump |
SKIP 语义(不是 FAIL)¶
以下场景 canary 自然 SKIP 而不 FAIL:
- daemon 未跑(无
:22222响应) - daemon log 文件不存在(
$LOG_FILE未设或空) - 观察窗口内没出现 retry / re-sub 事件(说明系统是健康的,不是 bug)
SKIP 不阻塞 release,FAIL 才阻塞。通常 gate 7-10 在健康 daemon 上会出现若干 SKIP,真机注入故障 / 等足时间后才能全 PASS。
Push stream 异常排查 (v1.4.84)¶
Ops 发现 push_stream_healthy=false 或告警 FutuAuthRejectSpike 时的 triage flow:
-
先查
看/api/push-subscriber-infopush_stream_healthy/last_push_received_at_ms/consecutive_parse_errors/is_circuit_tripped_now -
若
push_stream_healthy=false,进一步看: consecutive_parse_errors >= 5→ F3 正在 fire,30 秒内会 auto re-subis_circuit_tripped_now=true→ F4 正在 cooldown,30 秒后自动 reset-
last_push_received_at_ms距今 >60s → 通道 stale,触发 F3 中 -
daemon log 查 trigger 历史
-
升级排查场景
resubscribe_triggers持续递增但push_stream_healthy一直是 false → backend 长期故障或订阅参数有问题,需要手工介入重启 + 确认后端状态circuit_breaker_trips在短时间内 >3 次 → 说明 F3 re-sub 无效,backend 侧可能需要 admin 介入