日志与监控指南
日志与监控指南
本指南介绍如何在 Silly 应用中实现全面的日志记录和性能监控,构建完整的可观测性方案。
简介
可观测性是现代应用运维的核心要素,包括三大支柱:
- 日志(Logging):记录离散的事件和错误信息
- 指标(Metrics):记录可聚合的数值数据,反映系统状态
- 追踪(Tracing):追踪请求在系统中的完整生命周期
Silly 框架为这三个方面提供了内置支持:
silly.logger:分级日志系统,支持日志轮转silly.metrics.prometheus:Prometheus 指标收集和导出silly.tracespawn/traceset:分布式追踪 ID 生成和传播
日志系统
基本使用
silly.logger 提供了 DEBUG、INFO、WARN、ERROR 四个日志级别:
local logger = require "silly.logger"
-- 设置日志级别(只输出 INFO 及以上级别)
logger.setlevel(logger.INFO)
-- 基本日志输出
logger.debug("调试信息") -- 不会输出
logger.info("服务器启动") -- 会输出
logger.warn("连接超时,重试中") -- 会输出
logger.error("数据库连接失败") -- 会输出日志格式
框架会自动为每条日志添加以下信息:
2025-10-21 09:37:27 0001e3d700010000 I cluster/node1.lua:30 [node1] Received HTTP GET /test日志格式说明:
2025-10-21 09:37:27- 时间戳0001e3d700010000- Trace ID(自动打印,无需业务代码显式添加)I- 日志级别(D=DEBUG, I=INFO, W=WARN, E=ERROR)cluster/node1.lua:30- 文件名和行号[node1] Received HTTP GET /test- 日志消息
Trace ID 自动打印
框架会自动在每条日志前打印当前协程的 Trace ID,业务代码无需在日志消息中显式包含 Trace ID。这使得同一请求的所有日志都可以通过 Trace ID 进行关联。
-- ❌ 错误:不要显式打印 trace ID
local trace_id = silly.tracepropagate()
logger.info("[" .. trace_id .. "] Processing request")
-- ✅ 正确:框架会自动打印 trace ID
logger.info("Processing request")日志级别选择
根据不同场景选择合适的日志级别:
| 级别 | 使用场景 | 示例 |
|---|---|---|
| DEBUG | 开发调试、问题排查 | 变量值、函数调用、详细请求信息 |
| INFO | 正常业务流程 | 服务启动/停止、用户登录、订单创建 |
| WARN | 潜在问题、降级操作 | 重试次数超限、缓存未命中、配置缺失 |
| ERROR | 错误和异常 | 数据库连接失败、请求处理失败 |
local logger = require "silly.logger"
-- 生产环境:使用 INFO 级别
logger.setlevel(logger.INFO)
-- 调试模式:使用 DEBUG 级别
logger.setlevel(logger.DEBUG)
-- 检查当前级别
if logger.getlevel() <= logger.DEBUG then
-- 只在 DEBUG 模式下执行昂贵的序列化操作
local json = require "json"
logger.debug("请求详情:", json.encode(request))
end格式化日志
使用格式化日志函数(*f 系列)提高日志可读性:
local logger = require "silly.logger"
-- 使用 string.format 格式
logger.infof("用户 [%s] 在 %d 秒内完成了 %d 次操作",
username, duration, count)
logger.errorf("订单 #%d 处理失败: %s (错误码: %d)",
order_id, error_msg, error_code)
-- 格式化参数
logger.debugf("%.2f%% 的请求在 %dms 内完成",
percentage, latency_ms)结构化日志
为了便于日志分析,建议使用结构化的日志格式:
local logger = require "silly.logger"
local json = require "silly.encoding.json"
-- 定义日志辅助函数
local function log_request(method, path, status, duration)
local log_entry = {
timestamp = os.time(),
level = "INFO",
event = "http_request",
method = method,
path = path,
status = status,
duration_ms = duration,
}
logger.info(json.encode(log_entry))
end
-- 使用
log_request("GET", "/api/users", 200, 15.3)
-- 输出: {"timestamp":1703001234,"level":"INFO","event":"http_request",...}日志轮转
Silly 支持通过信号进行日志轮转,避免日志文件无限增长:
-- 启动时指定日志文件
-- ./silly main.lua --logpath=/var/log/myapp.log执行日志轮转的 shell 脚本:
#!/bin/bash
# rotate-logs.sh
LOG_FILE="/var/log/myapp.log"
APP_PID=$(cat /var/run/myapp.pid)
# 1. 重命名当前日志文件
mv "$LOG_FILE" "$LOG_FILE.$(date +%Y%m%d-%H%M%S)"
# 2. 发送 SIGUSR1 信号,让 Silly 重新打开日志文件
kill -USR1 "$APP_PID"
# 3. 压缩旧日志(可选)
gzip "$LOG_FILE".*
# 4. 清理 7 天前的日志(可选)
find /var/log -name "myapp.log.*" -mtime +7 -delete配合 crontab 定期执行:
# 每天凌晨 2 点执行日志轮转
0 2 * * * /path/to/rotate-logs.sh动态调整日志级别
在生产环境中,通过信号动态调整日志级别可以避免重启服务:
local logger = require "silly.logger"
local signal = require "silly.signal"
-- 初始化为 INFO 级别
logger.setlevel(logger.INFO)
-- 通过 SIGUSR2 信号切换 DEBUG 模式
signal("SIGUSR2", function()
if logger.getlevel() == logger.DEBUG then
logger.setlevel(logger.INFO)
logger.info("日志级别切换为 INFO")
else
logger.setlevel(logger.DEBUG)
logger.info("日志级别切换为 DEBUG")
end
end)切换日志级别:
# 切换到 DEBUG 模式
kill -USR2 <pid>
# 再次执行切换回 INFO 模式
kill -USR2 <pid>性能监控
Prometheus 集成
Silly 内置了完整的 Prometheus 指标系统,支持 Counter、Gauge、Histogram 三种指标类型。
创建 /metrics 端点
最基本的监控集成是暴露 /metrics 端点供 Prometheus 抓取:
local http = require "silly.net.http"
local prometheus = require "silly.metrics.prometheus"
-- 启动 HTTP 服务器
local server = http.listen {
addr = "0.0.0.0:8080",
handler = function(stream)
if stream.path == "/metrics" then
-- 收集所有指标并以 Prometheus 格式返回
local metrics = prometheus.gather()
stream:respond(200, {
["content-type"] = "text/plain; version=0.0.4; charset=utf-8",
})
stream:close(metrics)
else
-- 业务逻辑
stream:respond(200, {["content-type"] = "text/plain"})
stream:close("Hello World")
end
end
}内置指标
prometheus.gather() 会自动收集以下内置指标:
Silly 运行时指标:
silly_worker_backlog:Worker 队列待处理消息数silly_timer_pending:待触发定时器数量silly_tasks_runnable:可运行任务数量silly_tcp_connections:活跃 TCP 连接数silly_network_sent_bytes_total:网络发送字节总数silly_network_received_bytes_total:网络接收字节总数
进程资源指标:
process_cpu_seconds_user:用户态 CPU 时间(秒)process_cpu_seconds_system:内核态 CPU 时间(秒)process_resident_memory_bytes:常驻内存大小(字节)process_heap_bytes:堆内存大小(字节)
JeMalloc 指标(如果使用 MALLOC=jemalloc 编译):
- 详细的内存分配统计
自定义指标
根据业务需求创建自定义指标:
Counter:累加计数器
Counter 只能增加,适合统计请求总数、错误次数等累积值:
local prometheus = require "silly.metrics.prometheus"
-- 创建 Counter
local http_requests_total = prometheus.counter(
"http_requests_total",
"HTTP 请求总数",
{"method", "path", "status"}
)
-- 记录请求
http_requests_total:labels("GET", "/api/users", "200"):inc()
http_requests_total:labels("POST", "/api/users", "201"):inc()
http_requests_total:labels("GET", "/api/users", "500"):inc()Gauge:仪表盘
Gauge 可增可减,适合统计当前连接数、队列深度等瞬时值:
local prometheus = require "silly.metrics.prometheus"
-- 创建 Gauge
local active_connections = prometheus.gauge(
"active_connections",
"当前活跃连接数"
)
local queue_depth = prometheus.gauge(
"queue_depth",
"队列深度",
{"queue_name"}
)
-- 使用
active_connections:inc() -- 增加 1
active_connections:dec() -- 减少 1
active_connections:set(42) -- 设置为 42
active_connections:add(10) -- 增加 10
active_connections:sub(5) -- 减少 5
queue_depth:labels("jobs"):set(128)Histogram:直方图
Histogram 统计数据分布,适合统计延迟、响应大小等需要分位数分析的场景:
local prometheus = require "silly.metrics.prometheus"
-- 创建 Histogram(默认桶)
local request_duration = prometheus.histogram(
"http_request_duration_seconds",
"HTTP 请求延迟(秒)"
)
-- 自定义桶边界
local response_size = prometheus.histogram(
"http_response_size_bytes",
"HTTP 响应大小(字节)",
{"method"},
{100, 500, 1000, 5000, 10000, 50000, 100000}
)
-- 记录观测值
local start = os.clock()
-- ... 处理请求 ...
local duration = os.clock() - start
request_duration:observe(duration)
response_size:labels("GET"):observe(1234)完整监控示例
一个带有完整监控的 HTTP 服务示例:
local silly = require "silly"
local http = require "silly.net.http"
local logger = require "silly.logger"
local prometheus = require "silly.metrics.prometheus"
-- 定义指标
local http_requests_total = prometheus.counter(
"myapp_http_requests_total",
"HTTP 请求总数",
{"method", "path", "status"}
)
local http_requests_in_flight = prometheus.gauge(
"myapp_http_requests_in_flight",
"正在处理的 HTTP 请求数"
)
local http_request_duration = prometheus.histogram(
"myapp_http_request_duration_seconds",
"HTTP 请求延迟(秒)",
{"method", "path"},
{0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0}
)
-- HTTP 处理函数
local function handle_request(stream)
local start = os.clock()
http_requests_in_flight:inc()
-- 处理不同路径
local status_code = 200
local response_body = ""
if stream.path == "/metrics" then
-- Prometheus 指标端点
local metrics = prometheus.gather()
stream:respond(200, {
["content-type"] = "text/plain; version=0.0.4",
})
stream:close(metrics)
elseif stream.path == "/api/users" then
-- 业务 API
logger.info("处理用户 API 请求:", stream.method)
response_body = '{"users": []}'
stream:respond(200, {["content-type"] = "application/json"})
stream:close(response_body)
else
-- 404
status_code = 404
response_body = "Not Found"
stream:respond(404, {["content-type"] = "text/plain"})
stream:close(response_body)
end
-- 记录指标
local duration = os.clock() - start
http_requests_in_flight:dec()
http_request_duration:labels(stream.method, stream.path):observe(duration)
http_requests_total:labels(stream.method, stream.path, tostring(status_code)):inc()
-- 记录日志
logger.infof("%s %s %d %.3fs",
stream.method, stream.path, status_code, duration)
end
-- 启动服务
local server = http.listen {
addr = "0.0.0.0:8080",
handler = function(stream)
local ok, err = silly.pcall(handle_request, stream)
if not ok then
logger.error("请求处理失败:", err)
stream:respond(500, {["content-type"] = "text/plain"})
stream:close("Internal Server Error")
end
end
}
logger.info("服务器启动在 0.0.0.0:8080")
logger.info("Prometheus 指标: http://localhost:8080/metrics")Grafana 可视化
配置 Prometheus 抓取 Silly 应用的指标:
# prometheus.yml
scrape_configs:
- job_name: 'silly-app'
scrape_interval: 15s
static_configs:
- targets: ['localhost:8080']
labels:
app: 'my-silly-app'
env: 'production'在 Grafana 中创建仪表盘,常用查询:
# QPS(每秒请求数)
rate(myapp_http_requests_total[1m])
# 按状态码分组的 QPS
sum by (status) (rate(myapp_http_requests_total[1m]))
# P95 延迟
histogram_quantile(0.95, rate(myapp_http_request_duration_seconds_bucket[5m]))
# 错误率
rate(myapp_http_requests_total{status=~"5.."}[1m])
/
rate(myapp_http_requests_total[1m])
# 当前活跃连接数
myapp_http_requests_in_flight
# 内存使用
process_resident_memory_bytes
# CPU 使用率
rate(process_cpu_seconds_total[1m]) * 100请求追踪
Trace ID 生成
Silly 提供了分布式追踪 ID 系统,每个协程都有独立的 trace ID:
local silly = require "silly"
local logger = require "silly.logger"
silly.fork(function()
-- 创建新的 trace ID(如果当前协程没有)
local old_trace_id = silly.tracespawn()
logger.infof("开始处理请求")
logger.infof("请求处理完成")
silly.traceset(old_trace_id)
end)跨服务追踪
在微服务架构中,需要将 trace ID 传播到下游服务:
local silly = require "silly"
local http = require "silly.net.http"
local logger = require "silly.logger"
-- 服务 A:发起 HTTP 请求
local function call_service_b()
-- 生成传播用的 trace ID
local trace_id = silly.tracepropagate()
logger.info("调用服务 B")
-- 通过 HTTP Header 传递 trace ID
local response = http.request {
method = "POST",
url = "http://service-b:8080/api/process",
headers = {
["X-Trace-Id"] = tostring(trace_id),
},
body = '{"data": "value"}',
}
return response
end
-- 服务 B:接收请求并使用传入的 trace ID
local server = http.listen {
addr = "0.0.0.0:8080",
handler = function(stream)
-- 提取并设置 trace ID
local trace_id = tonumber(stream.headers["x-trace-id"])
if trace_id then
silly.traceset(trace_id)
else
trace_id = silly.tracespawn()
end
logger.info("服务 B 收到请求")
-- 处理业务逻辑
stream:respond(200, {["content-type"] = "application/json"})
stream:close('{"status": "ok"}')
end
}RPC 自动追踪
使用 silly.net.cluster 进行 RPC 调用时,trace ID 会自动传播:
local cluster = require "silly.net.cluster"
local logger = require "silly.logger"
-- 创建 cluster 服务
cluster.serve {
marshal = ...,
unmarshal = ...,
call = function(peer, cmd, body)
-- trace ID 已由 cluster 自动设置,logger 会自动使用
logger.info("RPC 调用:", cmd)
-- 处理 RPC 请求
return handle_rpc(body, cmd)
end,
close = function(peer, errno)
logger.info("RPC 连接关闭, errno:", errno)
end,
}
-- 发起 RPC 调用(trace ID 自动传播)
local peer = cluster.connect("127.0.0.1:8080")
local result = cluster.call(peer, "get_user", {user_id = 123})日志关联
将 trace ID 融入日志,实现请求的完整追踪:
local silly = require "silly"
local logger = require "silly.logger"
local json = require "silly.encoding.json"
-- 结构化日志辅助函数
local function log_with_trace(level, event, data)
local log_entry = {
timestamp = os.time(),
level = level,
event = event,
}
-- 合并数据
for k, v in pairs(data or {}) do
log_entry[k] = v
end
local log_str = json.encode(log_entry)
if level == "ERROR" then
logger.error(log_str)
elseif level == "WARN" then
logger.warn(log_str)
elseif level == "DEBUG" then
logger.debug(log_str)
else
logger.info(log_str)
end
end
-- 使用
log_with_trace("INFO", "user_login", {
user_id = 12345,
ip = "192.168.1.100",
})
log_with_trace("ERROR", "database_error", {
query = "SELECT * FROM users",
error = "connection timeout",
})在日志收集系统(如 ELK)中,可以通过 trace_id 查询一个请求的完整日志链路。
告警配置
Prometheus 告警规则
在 Prometheus 中配置告警规则:
# alerts.yml
groups:
- name: silly_app_alerts
interval: 30s
rules:
# 错误率过高
- alert: HighErrorRate
expr: |
rate(myapp_http_requests_total{status=~"5.."}[5m])
/
rate(myapp_http_requests_total[5m]) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "高错误率: {{ $value | humanizePercentage }}"
description: "应用 {{ $labels.app }} 的错误率超过 5%"
# P95 延迟过高
- alert: HighLatency
expr: |
histogram_quantile(0.95,
rate(myapp_http_request_duration_seconds_bucket[5m])
) > 1.0
for: 5m
labels:
severity: warning
annotations:
summary: "P95 延迟过高: {{ $value }}s"
description: "应用 {{ $labels.app }} 的 P95 延迟超过 1 秒"
# 内存使用过高
- alert: HighMemoryUsage
expr: process_resident_memory_bytes > 1073741824 # 1GB
for: 10m
labels:
severity: critical
annotations:
summary: "内存使用过高: {{ $value | humanize1024 }}"
description: "应用 {{ $labels.app }} 内存使用超过 1GB"
# Worker 队列积压
- alert: WorkerBacklog
expr: silly_worker_backlog > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "Worker 队列积压: {{ $value }} 条消息"
description: "应用 {{ $labels.app }} 的 Worker 队列积压严重"
# 服务不可用
- alert: ServiceDown
expr: up{job="silly-app"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "服务不可用"
description: "应用 {{ $labels.app }} 无法访问"告警渠道
配置 Alertmanager 发送告警:
# alertmanager.yml
route:
group_by: ['alertname', 'app']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'default'
routes:
# 严重告警立即通知
- match:
severity: critical
receiver: 'pager'
continue: true
# 警告级别告警发送到邮件
- match:
severity: warning
receiver: 'email'
receivers:
# 默认接收器
- name: 'default'
webhook_configs:
- url: 'http://webhook-service:8080/alerts'
# 邮件通知
- name: 'email'
email_configs:
- to: 'ops@example.com'
from: 'alertmanager@example.com'
smarthost: 'smtp.example.com:587'
auth_username: 'alertmanager@example.com'
auth_password: 'password'
# 紧急呼叫
- name: 'pager'
webhook_configs:
- url: 'http://pagerduty-integration:8080/alert'应用内告警
也可以在应用内实现简单的告警逻辑:
local silly = require "silly"
local time = require "silly.time"
local logger = require "silly.logger"
local prometheus = require "silly.metrics.prometheus"
-- 定义告警阈值
local ALERT_CONFIG = {
error_rate_threshold = 0.05, -- 5% 错误率
latency_p95_threshold = 1.0, -- 1 秒
memory_threshold = 1073741824, -- 1GB
}
-- 告警状态
local alert_state = {
error_rate_fired = false,
latency_fired = false,
memory_fired = false,
}
-- 发送告警
local function send_alert(alert_name, message)
logger.errorf("[ALERT] %s: %s", alert_name, message)
-- 这里可以集成告警渠道,如 HTTP 回调、邮件等
-- http.request {
-- method = "POST",
-- url = "http://alert-service/webhook",
-- body = json.encode({
-- alert = alert_name,
-- message = message,
-- timestamp = os.time(),
-- })
-- }
end
-- 定期检查指标
local function check_alerts()
-- 这里是简化示例,实际应该从 Prometheus 指标计算
local error_rate = 0.06 -- 示例值
local latency_p95 = 1.2 -- 示例值
local memory_usage = 1200000000 -- 示例值
-- 检查错误率
if error_rate > ALERT_CONFIG.error_rate_threshold then
if not alert_state.error_rate_fired then
send_alert("HighErrorRate",
string.format("错误率 %.2f%% 超过阈值 %.2f%%",
error_rate * 100,
ALERT_CONFIG.error_rate_threshold * 100))
alert_state.error_rate_fired = true
end
else
alert_state.error_rate_fired = false
end
-- 检查延迟
if latency_p95 > ALERT_CONFIG.latency_p95_threshold then
if not alert_state.latency_fired then
send_alert("HighLatency",
string.format("P95 延迟 %.2fs 超过阈值 %.2fs",
latency_p95,
ALERT_CONFIG.latency_p95_threshold))
alert_state.latency_fired = true
end
else
alert_state.latency_fired = false
end
-- 检查内存
if memory_usage > ALERT_CONFIG.memory_threshold then
if not alert_state.memory_fired then
send_alert("HighMemoryUsage",
string.format("内存使用 %d MB 超过阈值 %d MB",
memory_usage / 1024 / 1024,
ALERT_CONFIG.memory_threshold / 1024 / 1024))
alert_state.memory_fired = true
end
else
alert_state.memory_fired = false
end
end
-- 每 60 秒检查一次
silly.fork(function()
while true do
time.sleep(60000)
check_alerts()
end
end)完整示例:生产级 HTTP 服务
一个具备完整日志、监控和追踪的生产级 HTTP 服务示例:
local silly = require "silly"
local http = require "silly.net.http"
local logger = require "silly.logger"
local signal = require "silly.signal"
local time = require "silly.time"
local prometheus = require "silly.metrics.prometheus"
local json = require "silly.encoding.json"
-- ========== 日志配置 ==========
logger.setlevel(logger.INFO)
-- 动态调整日志级别
signal("SIGUSR2", function()
if logger.getlevel() == logger.DEBUG then
logger.setlevel(logger.INFO)
logger.info("日志级别切换为 INFO")
else
logger.setlevel(logger.DEBUG)
logger.info("日志级别切换为 DEBUG")
end
end)
-- ========== 监控指标 ==========
-- 请求指标
local http_requests_total = prometheus.counter(
"api_http_requests_total",
"HTTP 请求总数",
{"method", "path", "status"}
)
local http_requests_in_flight = prometheus.gauge(
"api_http_requests_in_flight",
"正在处理的 HTTP 请求数"
)
local http_request_duration = prometheus.histogram(
"api_http_request_duration_seconds",
"HTTP 请求延迟(秒)",
{"method", "path"},
{0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0, 5.0}
)
local http_request_size = prometheus.histogram(
"api_http_request_size_bytes",
"HTTP 请求大小(字节)",
nil,
{100, 1000, 10000, 100000, 1000000}
)
local http_response_size = prometheus.histogram(
"api_http_response_size_bytes",
"HTTP 响应大小(字节)",
nil,
{100, 1000, 10000, 100000, 1000000}
)
-- 业务指标
local user_operations = prometheus.counter(
"api_user_operations_total",
"用户操作总数",
{"operation", "status"}
)
-- ========== 结构化日志 ==========
local function log_request(trace_id, method, path, status, duration, req_size, resp_size)
local log_entry = {
timestamp = os.time(),
trace_id = trace_id,
level = "INFO",
event = "http_request",
method = method,
path = path,
status = status,
duration_ms = duration * 1000,
request_size_bytes = req_size,
response_size_bytes = resp_size,
}
logger.info(json.encode(log_entry))
end
-- ========== 业务处理 ==========
local function handle_user_get(stream)
logger.debug("获取用户列表")
-- 模拟数据库查询
time.sleep(10)
local response = json.encode({
users = {
{id = 1, name = "Alice"},
{id = 2, name = "Bob"},
}
})
user_operations:labels("get_users", "success"):inc()
return 200, response
end
local function handle_user_post(stream)
logger.debug("创建用户")
-- 模拟数据库插入
time.sleep(20)
local response = json.encode({
id = 3,
name = "Charlie",
status = "created",
})
user_operations:labels("create_user", "success"):inc()
return 201, response
end
-- ========== HTTP 处理 ==========
local function handle_request(stream)
local start = os.clock()
http_requests_in_flight:inc()
-- 获取或创建 trace ID
local trace_id = tonumber(stream.headers["x-trace-id"])
if trace_id then
silly.traceset(trace_id)
else
silly.tracespawn()
trace_id = silly.tracepropagate() -- 获取当前 trace ID 用于响应头
end
-- 记录请求大小
local req_size = tonumber(stream.headers["content-length"]) or 0
http_request_size:observe(req_size)
-- 路由处理
local status_code = 200
local response_body = ""
if stream.path == "/metrics" then
-- Prometheus 指标端点
local metrics = prometheus.gather()
stream:respond(200, {
["content-type"] = "text/plain; version=0.0.4",
})
stream:close(metrics)
status_code = 200
response_body = metrics
elseif stream.path == "/api/users" then
-- 用户 API
if stream.method == "GET" then
status_code, response_body = handle_user_get(stream)
elseif stream.method == "POST" then
status_code, response_body = handle_user_post(stream)
else
status_code = 405
response_body = "Method Not Allowed"
end
stream:respond(status_code, {
["content-type"] = "application/json",
["x-trace-id"] = tostring(trace_id),
})
stream:close(response_body)
elseif stream.path == "/health" then
-- 健康检查
status_code = 200
response_body = json.encode({status = "healthy"})
stream:respond(status_code, {["content-type"] = "application/json"})
stream:close(response_body)
else
-- 404
status_code = 404
response_body = json.encode({error = "Not Found"})
stream:respond(status_code, {["content-type"] = "application/json"})
stream:close(response_body)
end
-- 记录指标
local duration = os.clock() - start
http_requests_in_flight:dec()
http_response_size:observe(#response_body)
http_request_duration:labels(stream.method, stream.path):observe(duration)
http_requests_total:labels(stream.method, stream.path, tostring(status_code)):inc()
-- 记录日志
log_request(trace_id, stream.method, stream.path, status_code,
duration, req_size, #response_body)
end
-- ========== 启动服务 ==========
local server = http.listen {
addr = "0.0.0.0:8080",
handler = function(stream)
local ok, err = silly.pcall(handle_request, stream)
if not ok then
silly.tracespawn() -- 创建新的 trace ID
logger.error("请求处理失败:", err)
stream:respond(500, {["content-type"] = "application/json"})
stream:close(json.encode({error = "Internal Server Error"}))
http_requests_total:labels(stream.method, stream.path, "500"):inc()
end
end
}
logger.info("========================================")
logger.info("应用启动成功")
logger.infof("API 服务: http://localhost:8080/api/users")
logger.infof("健康检查: http://localhost:8080/health")
logger.infof("监控指标: http://localhost:8080/metrics")
logger.info("========================================")
logger.info("信号控制:")
logger.info(" kill -USR1 <pid> # 重新打开日志文件")
logger.info(" kill -USR2 <pid> # 切换日志级别 (INFO <-> DEBUG)")
logger.info("========================================")监控面板配置
Grafana 仪表盘 JSON
创建一个 Grafana 仪表盘来可视化 Silly 应用的监控数据:
{
"dashboard": {
"title": "Silly Application Monitoring",
"panels": [
{
"title": "QPS",
"targets": [
{
"expr": "sum(rate(api_http_requests_total[1m]))"
}
]
},
{
"title": "错误率",
"targets": [
{
"expr": "sum(rate(api_http_requests_total{status=~\"5..\"}[1m])) / sum(rate(api_http_requests_total[1m]))"
}
]
},
{
"title": "P95 延迟",
"targets": [
{
"expr": "histogram_quantile(0.95, rate(api_http_request_duration_seconds_bucket[5m]))"
}
]
},
{
"title": "活跃连接数",
"targets": [
{
"expr": "silly_tcp_connections"
}
]
},
{
"title": "内存使用",
"targets": [
{
"expr": "process_resident_memory_bytes"
}
]
},
{
"title": "Worker 队列深度",
"targets": [
{
"expr": "silly_worker_backlog"
}
]
}
]
}
}最佳实践
日志最佳实践
- 合理使用日志级别:避免在生产环境使用 DEBUG 级别,会产生大量日志
- 结构化日志:使用 JSON 格式便于日志收集和分析
- 避免敏感信息:不要记录密码、token 等敏感数据
- 控制日志量:对于高频操作,考虑采样记录
- 定期轮转:避免日志文件无限增长
监控最佳实践
- 指标命名:遵循 Prometheus 命名规范(snake_case,带单位后缀)
- 避免高基数:不要使用用户 ID 等作为标签
- 合理选择指标类型:
- Counter:累积值(请求总数)
- Gauge:瞬时值(当前连接数)
- Histogram:分布(延迟)
- 设置合理的桶边界:根据实际数据分布选择 Histogram 的桶
- 监控关键业务指标:不仅要监控系统指标,还要监控业务指标
追踪最佳实践
- 始终传播 trace ID:在跨服务调用时传递 trace ID
- 日志关联:将 trace ID 融入日志,便于问题排查
- 保留足够信息:trace ID 要在日志、指标、错误报告中都包含
参见
- silly.logger - 日志系统 API 参考
- silly.metrics.prometheus - Prometheus 指标 API 参考
- silly - 核心调度器和追踪 API
- silly.signal - 信号处理