跳转至

审计与可观察性

三个信号源

  1. 常规日志 —— tracing 输出到 stderr / journald,DEBUG/INFO/WARN/ERROR 级别,用于开发和故障诊断
  2. 审计 JSONL —— 只含 target=futu_audit 的事件,用于合规、事后追溯、攻击调查
  3. Prometheus metrics —— counter 形式的聚合数据,用于告警 + dashboard

审计 JSONL

开启:

futu-opend --audit-log /var/log/futu-audit.jsonl
# 或
futu-opend --audit-log /var/log/futu/       # 目录 → 每日滚动 futu-audit.log.YYYY-MM-DD

futu-mcp / futucli 同名 flag。

事件 schema

{
  "timestamp": "2026-04-15T10:23:45.123Z",
  "level": "WARN",                      // reject=WARN, allow=INFO, trade=WARN
  "target": "futu_audit",
  "iface": "rest" | "grpc" | "ws" | "mcp" | "cli",
  "endpoint": "/api/order" | "proto_id=2202" | "futu_place_order",
  "key_id": "bot_a" | "<missing>" | "<invalid>" | "<none>",
  "outcome": "allow" | "reject" | "success" | "failure",
  "reason": "limit: rate limit exceeded: 5 in 60s (cap 3)",
  "scope": "trade:real",                // allow 时
  "args_hash": "8a3f2b9c"               // 交易事件:SHA-256 前 8 hex
}

常见 jq 查询

# 最近的 reject
jq 'select(.outcome=="reject")' /var/log/futu-audit.jsonl | tail -20

# 某把 key 的下单记录
jq 'select(.key_id=="bot_a" and .endpoint|test("order|place|modify"))' \
  /var/log/futu-audit.jsonl

# 按拒绝原因统计
jq -r 'select(.outcome=="reject") | .reason' /var/log/futu-audit.jsonl \
  | awk -F': ' '{print $1}' \
  | sort | uniq -c | sort -rn

# per-iface 的请求分布
jq -r '.iface' /var/log/futu-audit.jsonl | sort | uniq -c

DuckDB 批量分析

-- 加载 JSONL
CREATE TABLE audit AS SELECT * FROM read_json_auto('/var/log/futu-audit.jsonl');

-- 某 key 的日成交次数
SELECT DATE(timestamp) AS day, COUNT(*) AS orders
FROM audit
WHERE key_id = 'bot_a' AND endpoint LIKE '%order%'
GROUP BY day;

Prometheus metrics

抓取:

prometheus.yml
scrape_configs:
  - job_name: futu-opend
    static_configs: [{ targets: ['opend:22222'] }]
  - job_name: futu-mcp
    static_configs: [{ targets: ['mcp:38765'] }]

三个 counter

# HELP futu_auth_events_total Auth / trade events by iface, outcome, key_id
# TYPE futu_auth_events_total counter
futu_auth_events_total{iface="rest",outcome="allow",key_id="bot_a"} 1234

# HELP futu_auth_limit_rejects_total Limit-check rejects by iface, key_id, reason
# TYPE futu_auth_limit_rejects_total counter
futu_auth_limit_rejects_total{iface="grpc",key_id="bot_b",reason="rate"} 7

# HELP futu_ws_filtered_pushes_total Pushes filtered out for client lacking scope
# TYPE futu_ws_filtered_pushes_total counter
futu_ws_filtered_pushes_total{required_scope="trade",key_id="bot_c"} 42

reason 分桶

futu_auth_limit_rejects_totalreason 标签是有限集合:

reason 含义
rate 速率超限
daily 日累计超限
per_order 单笔超限
market 市场白名单
symbol 品种白名单
side 方向白名单
hours 时段窗口
other 其他(classify_limit_reason 没覆盖的)

告警规则示例

alerts.yml
groups:
  - name: futu-opend
    rules:
      - alert: FutuAuthRejectSpike
        expr: rate(futu_auth_events_total{outcome="reject"}[5m]) > 10
        for: 5m
        annotations:
          summary: "Auth reject rate high ({{ $value }}/s)"
          description: "可能是攻击或 key 配置错"

      - alert: FutuRateLimitFrequent
        expr: rate(futu_auth_limit_rejects_total{reason="rate"}[15m]) > 1
        for: 15m
        annotations:
          summary: "Key {{ $labels.key_id }} 长期触发 rate limit"

      - alert: FutuDailyCapNearLimit
        expr: futu_auth_limit_rejects_total{reason="daily"} > 5
        for: 5m
        annotations:
          summary: "Key {{ $labels.key_id }} 今天多次触发日累计上限"

Grafana dashboard

v1.4.103+ 起,发布 tarball 附带预置 Grafana Dashboard JSON:examples/grafana/futu-opend-dashboard.json.

预置 Dashboard —— 导入步骤

  1. 解压 release tarball, 取 examples/grafana/futu-opend-dashboard.json
  2. 在 Grafana 左栏 → Dashboards → New → Import
  3. Upload JSON file,上传上一步的文件
  4. 在导入页选 Prometheus datasource(dashboard 用 ${DS_PROMETHEUS} 变量绑定,无需改 JSON)
  5. Import → dashboard 自动加载,$interface / $key_id 两个 variable 会从 Prometheus label 自动 populate

Dashboard 内容

顶部两个 variable($interface / $key_id)支持 multi-select + All

Panel 类型 核心 PromQL
Auth Auth events — allow vs reject timeseries sum by (outcome) (rate(futu_auth_events_total[5m]))
Auth Request rate by interface (stacked) timeseries sum by (iface) (rate(futu_auth_events_total[5m]))
Auth Top rejected keys table topk(10, sum by (key_id) (increase(futu_auth_events_total{outcome="reject"}[$__range])))
Auth Limit-reject reasons (pie) piechart sum by (reason) (increase(futu_auth_limit_rejects_total[$__range]))
Limits Limit rejects by reason (per second) timeseries sum by (reason) (rate(futu_auth_limit_rejects_total[5m]))
Limits Top keys by limit-rejects table topk(10, sum by (key_id, reason) (increase(futu_auth_limit_rejects_total[$__range])))
WS Push WS push filtered (scope mismatch) timeseries sum by (required_scope) (rate(futu_ws_filtered_pushes_total[5m]))
WS Push Top keys by WS push filtered table topk(10, sum by (key_id, required_scope) (increase(futu_ws_filtered_pushes_total[$__range])))
Summary Allow rate / Total events / Limit rejects / WS filtered stat × 4 range-aggregate

合计 12 data panel + 4 row header = 16 item,schemaVersion 38(Grafana v10.x 兼容)。

手工建面板(备用 —— 已经有自己风格的 dashboard)

如果你不想 import 整张 dashboard,只想把某几个 panel 拼进已有 dashboard,直接抄下面 PromQL:

  • Request rate by ifacesum by (iface) (rate(futu_auth_events_total[5m]))
  • Allow vs Rejectsum by (outcome) (rate(futu_auth_events_total[5m]))
  • Top rejected keystopk(10, sum by (key_id) (rate(futu_auth_events_total{outcome="reject"}[1h])))
  • Limit reject breakdownsum by (reason) (increase(futu_auth_limit_rejects_total[$__range]))
  • WS filter droppedsum by (required_scope) (rate(futu_ws_filtered_pushes_total[5m]))

Push / trade 健康观察(非 Prometheus)

当前版本 Prometheus 只暴露上述 3 个 counter。push stream 健康状态(F3 staleness / F4 circuit breaker / F5 subscriber_info)走 /api/push-subscriber-info 同步接口,不走 Prometheus —— 见下节 "Push 链路自愈 (v1.4.84)"。如果需要在 Grafana 里看这些字段,可以用 JSON API datasource 把 /api/push-subscriber-info 接进来(不在默认 dashboard 里)。

协作模式

  • 日常监控 → Grafana dashboard
  • 告警触发 → 查 audit JSONL 具体事件
  • 深挖调查 → 审计 JSONL + DuckDB / jq

三者互补:metrics 做趋势(数字聚合),audit 做溯源(具体事件),日志做调试(为什么错了)。

Push 链路自愈 (v1.4.84)

v1.4.84 §9 CMD3020 chain recovery 引入了 6 层防御,确保 push 通道长期稳态。v1.4.84 A3 canary 作为真机 verify 出口,ops 侧需要知道这些组件如何工作以及如何观察其状态。

F3 staleness detector (30 秒 interval / 60 秒阈值)

后台任务每 30 秒巡检 push 通道: - 若 >60 秒 有活跃订阅 → 自动触发 re-subscribe - daemon log 会出现 tracing::warn! "v1.4.84 §9 F3: push stream stale >60s, auto re-subscribe" - /api/push-subscriber-inforesubscribe_triggers counter 递增

F4 circuit breaker (30 秒 cooldown)

F3 自动 re-subscribe 之后,如果 60 秒内仍然 stale → circuit trip: - Dispatcher 跳过后续 push event 30 秒,避免空转刷错 - 30 秒到期 任意一个成功 push 抵达 → auto-reset - 观察点:/api/push-subscriber-info.is_circuit_tripped_now + circuit_breaker_trips counter

F2 retry (TradeReQuery 0ms / 1s / 3s / 9s)

order / fill push notify 触发的 backend 查询(query_orders / query_account_info / query_order_fills)采用 4 次指数退避: - 第 1 次 0ms(立即) - 第 2 次 1s - 第 3 次 3s - 第 4 次 9s - daemon log tag:"v1.4.84 §9 F2: retry" 带 attempt 编号

F5 /api/push-subscriber-info 字段说明

字段 类型 含义
push_stream_healthy bool 综合判断(circuit 未 trip + consecutive_errors <5 + 最近 push <60s)
last_push_received_at_ms int 最近一次 push 到达的 Unix ms 时间戳
consecutive_parse_errors int 连续 parse 失败次数(F3 阈值 >=5 触发 re-sub)
total_parse_errors int 累计 parse 错误(monotonic counter)
resubscribe_triggers int F3 触发 auto re-sub 的累计次数
circuit_breaker_trips int F4 trip 累计次数
is_circuit_tripped_now bool 当前是否处于 trip 状态

F6 orphan order scan (30 秒 interval / 5 分钟阈值)

后台任务定期扫 status=1 Unsubmitted 订单: - 若某订单卡在 Unsubmitted 超过 5 分钟 → daemon warn log "v1.4.84 §9 F6: orphan order detected acc_id=X order_id=Y age=Zs" - 用于定位 broker 返 fill notify 丢失 / daemon 订阅重建失败 等场景

Canary 真机 verify (v1.4.84)

位置:scripts/canary.sh,v1.4.82 首版 6 gate + v1.4.84 A3 新增 4 gate = 10 gate。

前置

  • daemon 已在运行(futu-opend 起着)
  • 环境变量 $ACCOUNT / $PWD 已设(非交互登录凭证)

用法

# 跑全部 gate
./scripts/canary.sh

# 只跑单个 gate
./scripts/canary.sh canary_7_push_health_f5_live
./scripts/canary.sh canary_10_f3_staleness_auto_resub

Gate 列表

Gate 引入版本 验证对象
canary_1_subscribe_push v1.4.82 订阅后能收到 WS push event
canary_2_place_order_cache v1.4.82 PlaceOrder 后 0ms 内 /api/orders 可见
canary_3_subscribe_wrong_fields v1.4.82 订阅参数错 → 返 loud error(deny_unknown_fields 兜底)
canary_4_sim_place_order_hint v1.4.82 sim PlaceOrder 错参数 → 返 sim hint
canary_5_history_kline_validation v1.4.82 history-kline validation → loud error
canary_6_cmd3020_recovery v1.4.84 CMD3020 chain recovery 真机(placeholder,依赖 backend 故障注入)
canary_7_push_health_f5_live v1.4.84 /api/push-subscriber-info 返真实 5 字段
canary_8_orphan_scan_f6 v1.4.84 5.5 分钟等待 + daemon log 出现 orphan warn
canary_9_f2_retry_exp_backoff v1.4.84 daemon log 含 F2 retry smoke test
canary_10_f3_staleness_auto_resub v1.4.84 resubscribe_triggers counter bump

SKIP 语义(不是 FAIL)

以下场景 canary 自然 SKIP 而不 FAIL:

  • daemon 未跑(无 :22222 响应)
  • daemon log 文件不存在($LOG_FILE 未设或空)
  • 观察窗口内没出现 retry / re-sub 事件(说明系统是健康的,不是 bug)

SKIP 不阻塞 release,FAIL 才阻塞。通常 gate 7-10 在健康 daemon 上会出现若干 SKIP,真机注入故障 / 等足时间后才能全 PASS。

Push stream 异常排查 (v1.4.84)

Ops 发现 push_stream_healthy=false 或告警 FutuAuthRejectSpike 时的 triage flow:

  1. 先查 /api/push-subscriber-info

    curl -s http://localhost:22222/api/push-subscriber-info | jq
    
    push_stream_healthy / last_push_received_at_ms / consecutive_parse_errors / is_circuit_tripped_now

  2. push_stream_healthy=false,进一步看:

  3. consecutive_parse_errors >= 5 → F3 正在 fire,30 秒内会 auto re-sub
  4. is_circuit_tripped_now=true → F4 正在 cooldown,30 秒后自动 reset
  5. last_push_received_at_ms 距今 >60s → 通道 stale,触发 F3 中

  6. daemon log 查 trigger 历史

    grep "v1.4.84 §9 F3\|v1.4.84 §9 F4\|v1.4.84 §9 F6" /var/log/futu-opend.log | tail -20
    

  7. 升级排查场景

  8. resubscribe_triggers 持续递增但 push_stream_healthy 一直是 false → backend 长期故障或订阅参数有问题,需要手工介入重启 + 确认后端状态
  9. circuit_breaker_trips 在短时间内 >3 次 → 说明 F3 re-sub 无效,backend 侧可能需要 admin 介入