一、SQL Server性能监控概述
有效的性能监控是保障SQL Server数据库稳定运行的关键。在Ubuntu环境下,我们可以使用多种工具和方法来实时监控数据库性能,及时发现并解决潜在问题。
1.1 监控的核心指标
| 指标类别 | 关键指标 | 正常范围 | 异常表现 |
|---|---|---|---|
| CPU | 处理器利用率 | <70% | 持续>90% |
| 内存 | 缓冲池命中率 | >95% | <90% |
| 磁盘I/O | 磁盘队列长度 | <2 | >10 |
| 网络 | 网络带宽利用率 | <50% | >80% |
| 数据库 | 死锁次数/分钟 | 0 | >5 |
| 数据库 | 阻塞等待时间 | <100ms | >1000ms |
1.2 监控架构设计
┌─────────────────────────────────────────────────────┐
│ 监控数据采集层 │
├─────────────────────────────────────────────────────┤
│ 系统指标 │ SQL Server指标 │ 数据库指标 │
│ (top/vmstat)│ (DMVs/计数器) │ (查询性能) │
└──────┬────────────────┬────────────────┬────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────────┐
│ 监控数据处理层 │
├─────────────────────────────────────────────────────┤
│ 数据收集 │ 数据分析 │ 告警触发 │
└──────┬────────────────┬────────────────┬────────────┘
│ │ │
▼ ▼ ▼
┌─────────────────────────────────────────────────────┐
│ 监控展示与告警层 │
├─────────────────────────────────────────────────────┤
│ Grafana │ Prometheus │ 邮件/短信告警 │
└─────────────────────────────────────────────────────┘
二、系统级监控工具
2.1 使用top监控CPU和内存
# 实时监控
top -u mssql
# 查看SQL Server进程
ps aux | grep mssql
# 使用htop更直观查看
sudo apt install htop
htop -u mssql
2.2 使用vmstat监控系统资源
# 每5秒输出一次
vmstat 5
# 输出含义说明
# r: 运行队列长度
# b: 阻塞进程数
# si/so: 交换内存进出
# us/sy/id: 用户/系统/空闲CPU百分比
2.3 使用iostat监控磁盘I/O
# 安装sysstat包
sudo apt install sysstat
# 监控磁盘I/O
iostat -x 5
# 关键指标
# %util: 设备利用率
# await: 平均I/O等待时间
# svctm: 平均服务时间
2.4 使用sar长期监控
# 收集历史数据
sar -u 5 12 > cpu_stats.log
sar -d 5 12 > disk_stats.log
sar -r 5 12 > memory_stats.log
三、SQL Server动态管理视图(DMV)监控
3.1 监控CPU性能
-- 查看CPU密集查询
SELECT TOP 10
qs.execution_count,
qs.total_worker_time / 1000 AS total_cpu_time_ms,
qs.total_elapsed_time / 1000 AS total_elapsed_time_ms,
qs.total_worker_time / qs.execution_count / 1000 AS avg_cpu_time_ms,
SUBSTRING(qt.text, (qs.statement_start_offset/2)+1,
((CASE qs.statement_end_offset
WHEN -1 THEN DATALENGTH(qt.text)
ELSE qs.statement_end_offset
END - qs.statement_start_offset)/2) + 1) AS query_text,
DB_NAME(qt.dbid) AS database_name
FROM sys.dm_exec_query_stats qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) qt
ORDER BY qs.total_worker_time DESC;
GO
3.2 监控内存使用
-- 查看内存分配
SELECT
type,
name,
pages_kb/1024 AS pages_mb,
(pages_kb * 100.0 / SUM(pages_kb) OVER()) AS percentage
FROM sys.dm_os_memory_clerks
ORDER BY pages_kb DESC;
GO
-- 查看缓冲池命中率
SELECT
CAST((1 - (CAST(COUNT(*) AS FLOAT) /
(SELECT COUNT(*) FROM sys.dm_os_buffer_descriptors))) * 100 AS DECIMAL(10,2))
AS buffer_pool_hit_ratio;
GO
3.3 监控磁盘I/O
-- 查看I/O延迟
SELECT
database_id,
file_id,
io_stall_read_ms,
io_stall_write_ms,
io_stall,
num_of_reads,
num_of_writes,
io_stall/(num_of_reads + num_of_writes) AS avg_io_stall_ms
FROM sys.dm_io_virtual_file_stats(NULL, NULL)
ORDER BY io_stall DESC;
GO
-- 查看等待统计
SELECT
wait_type,
waiting_tasks_count,
wait_time_ms/1000 AS wait_time_seconds,
signal_wait_time_ms/1000 AS signal_wait_seconds,
wait_time_ms/waiting_tasks_count AS avg_wait_time_ms
FROM sys.dm_os_wait_stats
WHERE waiting_tasks_count > 0
ORDER BY wait_time_ms DESC;
GO
3.4 监控阻塞和死锁
-- 查看当前阻塞
SELECT
t1.session_id AS blocked_session,
t2.session_id AS blocking_session,
DB_NAME(t1.database_id) AS database_name,
t1.wait_type,
t1.wait_time/1000 AS wait_seconds,
t1.status,
t1.command,
t2.command AS blocking_command
FROM sys.dm_exec_requests t1
JOIN sys.dm_exec_requests t2 ON t1.blocking_session_id = t2.session_id
WHERE t1.blocking_session_id > 0;
GO
-- 查看死锁图(需要启用死锁跟踪)
-- 启用死锁跟踪
ALTER SERVER SET TRACK_DEADLOCK ON;
GO
-- 查看死锁信息
SELECT
xed.value('@timestamp', 'datetime') AS CreationTime,
xed.query('.') AS DeadlockGraph
FROM
(
SELECT CAST([target_data] AS XML) AS TargetData
FROM sys.dm_xe_session_targets AS st
INNER JOIN sys.dm_xe_sessions AS s ON s.address = st.event_session_address
WHERE s.name = 'system_health'
) AS Data
CROSS APPLY TargetData.nodes('Deadlock/deadlock') AS XEventData(xed);
GO
四、自动化监控脚本
4.1 综合监控脚本
#!/bin/bash
# monitor_sqlserver.sh - SQL Server综合监控脚本
LOG_DIR="/var/log/mssql_monitor"
mkdir -p $LOG_DIR
DATE=$(date +%Y%m%d_%H%M%S)
LOG_FILE="$LOG_DIR/monitor_$DATE.log"
echo "========== SQL Server监控报告 - $DATE ==========" >> $LOG_FILE
echo "" >> $LOG_FILE
# 1. 系统资源监控
echo "=== 系统资源 ===" >> $LOG_FILE
echo "CPU使用率:" >> $LOG_FILE
top -bn1 | grep "Cpu(s)" | awk '{print $2}' | awk -F'%' '{print $1}' >> $LOG_FILE
echo "内存使用:" >> $LOG_FILE
free -h >> $LOG_FILE
echo "磁盘I/O:" >> $LOG_FILE
iostat -x 1 2 | tail -n +4 >> $LOG_FILE
echo "" >> $LOG_FILE
# 2. SQL Server进程状态
echo "=== SQL Server进程状态 ===" >> $LOG_FILE
ps aux | grep mssql | grep -v grep >> $LOG_FILE
systemctl status mssql-server | head -10 >> $LOG_FILE
echo "" >> $LOG_FILE
# 3. 数据库连接数
echo "=== 数据库连接统计 ===" >> $LOG_FILE
/opt/mssql-tools/bin/sqlcmd -S localhost -U SA -P 'YourPassword' \
-Q "SELECT DB_NAME(dbid) AS DatabaseName, COUNT(dbid) AS Connections FROM sys.sysprocesses GROUP BY dbid" >> $LOG_FILE
# 4. 性能计数器快照
echo "" >> $LOG_FILE
echo "=== 关键性能计数器 ===" >> $LOG_FILE
/opt/mssql-tools/bin/sqlcmd -S localhost -U SA -P 'YourPassword' \
-Q "SELECT object_name, counter_name, cntr_value FROM sys.dm_os_performance_counters WHERE counter_name IN ('Buffer cache hit ratio', 'Page life expectancy', 'Batch Requests/sec', 'User Connections')" >> $LOG_FILE
echo "监控报告已保存至: $LOG_FILE"
4.2 配置定时监控
# 添加到crontab
# 每5分钟执行一次监控
*/5 * * * * /home/user/monitor_sqlserver.sh >> /var/log/mssql_monitor/cron.log 2>&1
# 每小时生成汇总报告
0 * * * * /home/user/generate_hourly_report.sh
五、集成Prometheus + Grafana监控
5.1 安装Prometheus SQL Server Exporter
# 安装Go环境
sudo apt install -y golang-go
# 下载sql_exporter
git clone https://github.com/free/sql_exporter.git
cd sql_exporter
# 配置数据源
cat > sql_exporter.yml << 'EOF'
global:
scrape_interval: 15s
target:
data_source_name: 'sqlserver://SA:YourPassword@localhost:1433'
collectors: [mssql_standard]
collectors:
- name: mssql_standard
metrics:
- name: mssql_buffer_pool_hit_ratio
type: gauge
help: 'Buffer pool hit ratio'
values: [buffer_pool_hit_ratio]
query: |
SELECT CAST((1 - (CAST(COUNT(*) AS FLOAT) /
(SELECT COUNT(*) FROM sys.dm_os_buffer_descriptors))) * 100 AS DECIMAL(10,2))
AS buffer_pool_hit_ratio FROM sys.dm_os_buffer_descriptors
EOF
# 启动exporter
./sql_exporter
5.2 Prometheus配置
# prometheus.yml
scrape_configs:
- job_name: 'sqlserver'
static_configs:
- targets: ['localhost:9399']
5.3 Grafana仪表盘配置
推荐监控面板:
– SQL Server Overview(ID: 14624)
– SQL Server Performance(ID: 9236)
关键指标配置:
# 关键告警规则
groups:
- name: sqlserver_alerts
rules:
- alert: HighCPUUsage
expr: mssql_cpu_utilization > 90
for: 5m
labels:
severity: warning
annotations:
summary: "SQL Server CPU使用率过高"
- alert: LowBufferPoolHitRatio
expr: mssql_buffer_pool_hit_ratio < 90
for: 5m
labels:
severity: warning
annotations:
summary: "SQL Server缓冲池命中率过低"
- alert: HighDeadlockCount
expr: rate(mssql_deadlocks[5m]) > 1
for: 1m
labels:
severity: critical
annotations:
summary: "SQL Server死锁次数过多"
六、性能问题诊断流程
6.1 标准诊断流程
发现性能问题
↓
检查系统资源(CPU/内存/磁盘)
↓
┌─────────────────┬─────────────────┐
│ 系统资源正常 │ 系统资源异常 │
│ │ │
│ 检查SQL Server │ 检查资源瓶颈 │
│ 内部性能 │ │
│ │ │
│ • 查看等待统计 │ • 增加CPU核心 │
│ • 检查阻塞情况 │ • 扩展内存 │
│ • 分析执行计划 │ • 升级磁盘 │
└─────────────────┴─────────────────┘
↓
定位问题根因
↓
实施优化方案
↓
验证效果
6.2 常见问题诊断SQL
-- 1. 查看最耗资源的查询
SELECT TOP 10
qs.execution_count,
qs.total_logical_reads,
qs.total_logical_writes,
qs.total_elapsed_time/1000 AS total_elapsed_time_ms,
SUBSTRING(qt.text, 1, 100) AS query_text
FROM sys.dm_exec_query_stats qs
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) qt
ORDER BY qs.total_logical_reads DESC;
-- 2. 查看缺失索引建议
SELECT
d.name AS database_name,
mid.statement AS table_name,
mid.equality_columns,
mid.inequality_columns,
mid.included_columns,
migs.user_seeks,
migs.avg_total_user_cost,
migs.avg_user_impact
FROM sys.dm_db_missing_index_details mid
JOIN sys.dm_db_missing_index_groups mig ON mid.index_handle = mig.index_handle
JOIN sys.dm_db_missing_index_group_stats migs ON mig.index_group_handle = migs.group_handle
CROSS JOIN sys.databases d
WHERE d.database_id = DB_ID()
ORDER BY migs.avg_user_impact DESC;
-- 3. 查看索引碎片
SELECT
OBJECT_NAME(i.object_id) AS table_name,
i.name AS index_name,
ips.avg_fragmentation_in_percent
FROM sys.dm_db_index_physical_stats(DB_ID(), NULL, NULL, NULL, 'LIMITED') ips
JOIN sys.indexes i ON i.object_id = ips.object_id AND i.index_id = ips.index_id
WHERE ips.avg_fragmentation_in_percent > 10
ORDER BY ips.avg_fragmentation_in_percent DESC;
七、常见问题解答
Q1: Buffer Pool Hit Ratio持续低于95%怎么办?
A: 建议措施:
– 增加服务器内存
– 检查是否有内存泄漏
– 优化查询减少内存占用
– 调整max server memory配置
Q2: 如何判断是否需要添加索引?
A: 查看dm_db_missing_index_details视图的建议,优先考虑user_seeks高且avg_user_impact高的索引。
Q3: 监控数据保留多久合适?
A: 建议:
– 细粒度数据(5秒间隔):保留7天
– 中粒度数据(1分钟间隔):保留30天
– 汇总数据(小时级):保留1年
Q4: 如何设置告警阈值?
A: 基线建议:
– CPU使用率 > 90% 持续5分钟
– Buffer Pool命中率 < 90% 持续5分钟
– 死锁次数 > 5次/分钟
– 平均查询响应时间 > 1000ms
八、总结
有效的SQL Server性能监控需要:
- 建立监控体系:系统级+数据库级+应用级
- 定期分析数据:发现趋势和异常
- 设置合理告警:及时发现和响应问题
- 持续优化:基于监控数据改进系统
- 保留历史数据:用于容量规划和趋势分析
掌握以上监控技巧,可以确保SQL Server数据库高效稳定运行。
注:本文基于SQL Server 2022 on Ubuntu 20.04/22.04编写,具体版本可能略有差异。