一、SQL Admin高可用性概述
SQL Admin作为数据库管理工具,在企业IT架构中承担着关键的管理职能。一旦SQL Admin服务中断,运维人员将无法对数据库进行有效的监控和管理,可能导致业务响应延迟甚至故障无法及时处理。因此,在生产环境中实现SQL Admin的高可用性部署至关重要。
高可用性(High Availability,HA)是指系统在大部分时间内都能正常运行,减少停机时间。在CentOS上部署SQL Admin时,通过合理的架构设计可以实现99.9%以上的可用性,确保数据库管理服务的持续稳定运行。
高可用性的核心衡量指标:
| 指标 | 定义 | 计算方式 | 目标值 |
|---|---|---|---|
| 可用性 | 系统正常运行时间比例 | Uptime / (Uptime + Downtime) | ≥99.9% |
| MTBF | 平均故障间隔时间 | Total Time / Number of Failures | ≥8760小时 |
| MTTR | 平均恢复时间 | Total Downtime / Number of Repairs | ≤30分钟 |
| RTO | 恢复时间目标 | 业务中断到恢复的最长时间 | ≤15分钟 |
| RPO | 恢复点目标 | 可接受的最大数据丢失时间 | ≤5分钟 |
二、高可用架构设计
2.1 常见高可用架构类型
SQL Admin的高可用架构设计需要综合考虑成本、性能和可靠性。以下是三种常见的架构方案:
架构一:负载均衡架构
┌─────────────────┐
│ Nginx/LB │
│ (VIP + 健康检查)│
└────────┬────────┘
│
┌──────────────┼──────────────┐
│ │ │
┌─────▼─────┐ ┌─────▼─────┐ ┌─────▼─────┐
│ SQL Admin │ │ SQL Admin │ │ SQL Admin │
│ Node 1 │ │ Node 2 │ │ Node 3 │
└─────┬─────┘ └─────┬─────┘ └─────┬─────┘
│ │ │
└──────────────┼──────────────┘
│
┌────────▼────────┐
│ 共享存储 │
│ (NFS/GlusterFS) │
└─────────────────┘
架构二:主备 failover 架构
Active Standby
┌──────────────┐ ┌──────────────┐
│ SQL Admin │ 心跳 │ SQL Admin │
│ (主节点) │◄───────►│ (备节点) │
└──────┬───────┘ └──────┬───────┘
│ │
┌──────▼───────┐ ┌──────▼───────┐
│ SQL Server │ │ SQL Server │
│ 主实例 │ │ 备实例 │
└──────────────┘ └──────────────┘
架构三:多活架构
┌────────────┐ ┌────────────┐ ┌────────────┐
│ 区域 A │ │ 区域 B │ │ 区域 C │
│ ┌────────┐│ │ ┌────────┐│ │ ┌────────┐│
│ │Admin A ││ │ │Admin B ││ │ │Admin C ││
│ └────────┘│ │ └────────┘│ │ └────────┘│
└─────┬──────┘ └─────┬──────┘ └─────┬──────┘
│ │ │
└───────────────┼───────────────┘
│
┌──────▼──────┐
│ 数据同步层 │
│ (DRBD/GIT) │
└─────────────┘
2.2 架构选择指南
根据业务需求和资源情况,选择合适的高可用架构:
| 架构类型 | 适用场景 | 成本 | 复杂度 | 推荐指数 |
|---|---|---|---|---|
| 负载均衡 | 中小型企业,多用户访问 | 中 | 低 | ⭐⭐⭐⭐ |
| 主备Failover | 对可靠性要求高 | 高 | 中 | ⭐⭐⭐⭐⭐ |
| 多活架构 | 大型企业,跨地域部署 | 极高 | 高 | ⭐⭐⭐ |
| 容器化+编排 | 云原生环境 | 中 | 中 | ⭐⭐⭐⭐ |
三、环境准备
3.1 系统环境要求
操作系统版本:
| 组件 | 推荐版本 | 最低版本 | 说明 |
|---|---|---|---|
| CentOS Stream | 9 | 7 | 推荐Stream 8/9 |
| 内核 | 5.x | 3.10 | 新内核性能更好 |
| Glibc | 2.35 | 2.17 | 影响兼容性 |
| OpenSSL | 3.0.x | 1.1.1 | 建议3.0+ |
网络环境要求:
# 配置主机名(每台服务器唯一)
hostnamectl set-hostname sqladmin-node1.example.com
# 配置hosts文件(所有节点)
cat >> /etc/hosts << 'EOF'
192.168.1.101 sqladmin-node1
192.168.1.102 sqladmin-node2
192.168.1.103 sqladmin-node3
192.168.1.200 sqladmin-vip
EOF
# 关闭防火墙(内部网络)
sudo systemctl stop firewalld
sudo systemctl disable firewalld
# 或开放必要端口
sudo firewall-cmd --permanent --add-port=8080/tcp
sudo firewall-cmd --permanent --add-port=8443/tcp
sudo firewall-cmd --permanent --add-port=7800/tcp # Keepalived
sudo firewall-cmd --reload
# 验证网络连通性
ping -c 3 sqladmin-node1
ping -c 3 sqladmin-node2
时间同步配置:
# 安装chrony时间同步
sudo yum install -y chrony
# 配置chrony服务端(Node1)
sudo cat > /etc/chrony.conf << 'EOF'
server 0.pool.ntp.org iburst
server 1.pool.ntp.org iburst
server 2.pool.ntp.org iburst
allow 192.168.1.0/24
local stratum 10
EOF
# 配置chrony客户端(Node2, Node3)
sudo cat > /etc/chrony.conf << 'EOF'
server sqladmin-node1 iburst
server 1.pool.ntp.org iburst
server 2.pool.ntp.org iburst
EOF
# 启动服务
sudo systemctl enable chronyd
sudo systemctl start chronyd
# 验证时间同步
chronyc sources
chronyc tracking
3.2 共享存储配置
对于需要共享会话状态的应用,配置NFS共享存储:
NFS服务端配置(Node1):
# 安装NFS服务
sudo yum install -y nfs-utils rpcbind
# 创建共享目录
sudo mkdir -p /data/sqladmin/shared
sudo chmod 777 /data/sqladmin/shared
# 配置exports
cat >> /etc/exports << 'EOF'
/data/sqladmin/shared 192.168.1.0/24(rw,sync,no_root_squash,no_all_squash)
EOF
# 启动服务
sudo systemctl enable rpcbind nfs-server
sudo systemctl start rpcbind nfs-server
sudo exportfs -r
# 验证导出
exportfs -v
NFS客户端配置(Node2, Node3):
# 安装NFS客户端
sudo yum install -y nfs-utils
# 创建挂载点
sudo mkdir -p /data/sqladmin/shared
# 手动挂载测试
sudo mount -t nfs sqladmin-node1:/data/sqladmin/shared /data/sqladmin/shared
# 配置自动挂载(/etc/fstab)
echo "sqladmin-node1:/data/sqladmin/shared /data/sqladmin/shared nfs defaults,_netdev 0 0" | sudo tee -a /etc/fstab
# 验证挂载
df -h | grep shared
mount | grep nfs
3.3 Docker环境配置
推荐使用Docker容器化部署SQL Admin,简化高可用配置:
Docker安装(所有节点):
# 安装Docker
sudo yum install -y yum-utils
sudo yum-config-manager --add-repo \
https://download.docker.com/linux/centos/docker-ce.repo
sudo yum install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
# 启动Docker
sudo systemctl enable docker
sudo systemctl start docker
# 添加用户到docker组
sudo usermod -aG docker $USER
# 配置Docker镜像加速
sudo mkdir -p /etc/docker
sudo cat > /etc/docker/daemon.json << 'EOF'
{
"registry-mirrors": [
"https://docker.mirrors.ustc.edu.cn",
"https://hub-mirror.c.163.com"
],
"log-driver": "json-file",
"log-opts": {
"max-size": "100m",
"max-file": "3"
},
"storage-driver": "overlay2"
}
EOF
sudo systemctl restart docker
Docker Compose配置:
# docker-compose.yml
version: '3.8'
services:
sqladmin:
image: sqladmin:3.2
container_name: sqladmin
restart: always
ports:
- "8080:8080"
- "8443:8443"
environment:
- DB_HOST=${DB_HOST}
- DB_PORT=${DB_PORT}
- DB_USER=${DB_USER}
- DB_PASSWORD=${DB_PASSWORD}
- SESSION_TYPE=redis
- REDIS_HOST=redis-cluster
- REDIS_PORT=6379
volumes:
- /data/sqladmin/shared:/app/shared
- /data/sqladmin/logs:/app/logs
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
networks:
- sqladmin-net
networks:
sqladmin-net:
driver: overlay
四、负载均衡高可用配置
4.1 Nginx负载均衡配置
使用Nginx作为SQL Admin的负载均衡器:
安装Nginx:
# CentOS Stream 8/9
sudo yum install -y nginx
# CentOS 7
sudo yum install -y epel-release
sudo yum install -y nginx
# 启动并设置开机启动
sudo systemctl enable nginx
sudo systemctl start nginx
# 验证安装
nginx -v
配置负载均衡:
# /etc/nginx/conf.d/sqladmin-lb.conf
upstream sqladmin_backend {
# 轮询负载均衡
server sqladmin-node1:8080 weight=5;
server sqladmin-node2:8080 weight=5;
server sqladmin-node3:8080 weight=5;
# 保持会话(可选)
ip_hash;
# 健康检查
keepalive 32;
}
# HTTP服务器
server {
listen 80;
server_name sqladmin.example.com;
# 重定向到HTTPS
return 301 https://$server_name$request_uri;
}
# HTTPS服务器
server {
listen 443 ssl http2;
server_name sqladmin.example.com;
# SSL证书配置
ssl_certificate /etc/nginx/ssl/sqladmin.crt;
ssl_certificate_key /etc/nginx/ssl/sqladmin.key;
ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers HIGH:!aNULL:!MD5;
ssl_prefer_server_ciphers on;
# 安全头
add_header X-Frame-Options "SAMEORIGIN" always;
add_header X-Content-Type-Options "nosniff" always;
add_header X-XSS-Protection "1; mode=block" always;
# 日志配置
access_log /var/log/nginx/sqladmin-access.log;
error_log /var/log/nginx/sqladmin-error.log;
# 代理配置
location / {
proxy_pass http://sqladmin_backend;
proxy_http_version 1.1;
# 传递真实客户端IP
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# 超时配置
proxy_connect_timeout 60s;
proxy_send_timeout 60s;
proxy_read_timeout 60s;
# 连接升级
proxy_set_header Connection "";
# 健康检查支持
proxy_next_upstream error timeout invalid_header http_500 http_502 http_503;
}
# 健康检查端点
location /health {
proxy_pass http://sqladmin_backend/health;
access_log off;
}
# 静态资源缓存
location ~* \.(js|css|png|jpg|jpeg|gif|ico|svg|woff|woff2|ttf|eot)$ {
proxy_pass http://sqladmin_backend;
expires 30d;
add_header Cache-Control "public, immutable";
}
}
健康检查配置:
# 主动健康检查(需要nginx-plus或第三方模块)
# 普通开源Nginx使用被动健康检查
upstream sqladmin_backend {
server sqladmin-node1:8080 weight=5 max_fails=3 fail_timeout=30s;
server sqladmin-node2:8080 weight=5 max_fails=3 fail_timeout=30s;
server sqladmin-node3:8080 weight=5 max_fails=3 fail_timeout=30s;
# 备份节点(所有节点失败时启用)
server sqladmin-backup:8080 backup;
}
# 配置健康检查响应码
proxy_next_upstream error timeout http_500 http_502 http_503 http_504;
测试负载均衡:
# 测试配置语法
sudo nginx -t
# 重载配置
sudo systemctl reload nginx
# 查看连接状态
sudo nginx -s reload
watch 'netstat -an | grep :8080 | wc -l'
# 多次请求测试负载分发
for i in {1..20}; do
curl -s -o /dev/null -w "%{http_code} %{time_total}s\n" https://sqladmin.example.com/api/status
done
4.2 HAProxy负载均衡配置
HAProxy是另一种流行的负载均衡方案:
安装HAProxy:
# 安装HAProxy
sudo yum install -y haproxy
# 备份默认配置
sudo cp /etc/haproxy/haproxy.cfg /etc/haproxy/haproxy.cfg.bak
配置HAProxy:
# /etc/haproxy/haproxy.cfg
global
log 127.0.0.1 local2
chroot /var/lib/haproxy
pidfile /var/run/haproxy.pid
maxconn 4000
user haproxy
group haproxy
daemon
# SSL证书
tune.ssl.default-dh-param 2048
defaults
log global
mode http
option httplog
option dontlognull
option http-server-close
option forwardfor except 127.0.0.0/8
option redispatch
retries 3
timeout connect 5000
timeout client 50000
timeout server 50000
errorfile 400 /etc/haproxy/errors/400.http
errorfile 403 /etc/haproxy/errors/403.http
errorfile 503 /etc/haproxy/errors/503.http
# 统计页面
listen stats
bind *:8404
mode http
stats enable
stats uri /stats
stats refresh 30s
stats auth admin:password
# SQL Admin前端
frontend sqladmin_front
bind *:80
bind *:443 ssl crt /etc/haproxy/ssl/sqladmin.pem
mode http
default_backend sqladmin_back
# ACL规则
acl is_api path_beg /api
acl is_admin path_beg /admin
# X-Forwarded-For
http-request set-header X-Forwarded-For %[src]
# SQL Admin后端
backend sqladmin_back
mode http
balance roundrobin
# 健康检查
option httpchk GET /health
http-check expect status 200
# 服务器配置
server sqladmin1 sqladmin-node1:8080 check inter 2000 rise 2 fall 3 weight 100
server sqladmin2 sqladmin-node2:8080 check inter 2000 rise 2 fall 3 weight 100
server sqladmin3 sqladmin-node3:8080 check inter 2000 rise 2 fall 3 weight 100
# 备份服务器
server sqladmin-backup sqladmin-node-backup:8080 backup check inter 2000 rise 2 fall 3
# 启用HAProxy统计
listen admin
bind *:8081
mode http
stats enable
启动HAProxy:
# 测试配置
sudo haproxy -f /etc/haproxy/haproxy.cfg -c
# 启动服务
sudo systemctl enable haproxy
sudo systemctl start haproxy
# 查看状态
sudo systemctl status haproxy
sudo haproxy -f /etc/haproxy/haproxy.cfg -d
五、Keepalived主备配置
5.1 Keepalived安装与配置
使用Keepalived实现VIP漂移,实现主备自动切换:
安装Keepalived(所有节点):
# 安装Keepalived
sudo yum install -y keepalived
# 备份默认配置
sudo cp /etc/keepalived/keepalived.conf /etc/keepalived/keepalived.conf.bak
主节点配置(Node1):
# /etc/keepalived/keepalived.conf
! Configuration File for keepalived
global_defs {
router_id sqladmin_lb
script_user root
enable_script_security
}
# 健康检查脚本
vrrp_script check_haproxy {
script "/etc/keepalived/check_haproxy.sh"
interval 2
weight -20
fall 2
rise 1
}
# VRRP实例
vrrp_instance VI_1 {
state MASTER
interface eth0 # 修改为实际网卡名
virtual_router_id 51
priority 100 # 主节点优先级最高
advert_int 1
# 认证配置
authentication {
auth_type PASS
auth_pass 1111
}
# 虚拟IP配置
virtual_ipaddress {
192.168.1.200/24 dev eth0 # 修改为实际网卡和IP
}
# 追踪脚本
track_script {
check_haproxy
}
# 通知脚本
notify_master "/etc/keepalived/notify_master.sh"
notify_backup "/etc/keepalived/notify_backup.sh"
notify_fault "/etc/keepalived/notify_fault.sh"
# 抢占模式
nopreempt # 故障恢复后不抢占
preemption_delay 300 # 延迟300秒再抢占
}
备节点配置(Node2):
# /etc/keepalived/keepalived.conf
! Configuration File for keepalived
global_defs {
router_id sqladmin_lb
script_user root
enable_script_security
}
vrrp_script check_haproxy {
script "/etc/keepalived/check_haproxy.sh"
interval 2
weight -20
fall 2
rise 1
}
vrrp_instance VI_1 {
state BACKUP
interface eth0
virtual_router_id 51
priority 90 # 备节点优先级较低
advert_int 1
authentication {
auth_type PASS
auth_pass 1111
}
virtual_ipaddress {
192.168.1.200/24 dev eth0
}
track_script {
check_haproxy
}
nopreempt # 非抢占模式
}
健康检查脚本:
#!/bin/bash
# /etc/keepalived/check_haproxy.sh
# 检查HAProxy进程是否存活
if pgrep haproxy > /dev/null; then
exit 0
else
exit 1
fi
通知脚本:
#!/bin/bash
# /etc/keepalived/notify_master.sh
LOG_FILE="/var/log/keepalived-notify.log"
VIP="192.168.1.200"
echo "[$(date)] Keepalived转换到MASTER状态" >> $LOG_FILE
# 发送邮件通知
# mail -s "SQL Admin HA: MASTER节点激活" admin@example.com << EOF
# 虚拟IP $VIP 已激活
# 当前节点已成为主节点
# EOF
exit 0
#!/bin/bash
# /etc/keepalived/notify_backup.sh
LOG_FILE="/var/log/keepalived-notify.log"
echo "[$(date)] Keepalived转换到BACKUP状态" >> $LOG_FILE
exit 0
#!/bin/bash
# /etc/keepalived/notify_fault.sh
LOG_FILE="/var/log/keepalived-notify.log"
echo "[$(date)] Keepalived检测到FAULT状态" >> $LOG_FILE
# 发送告警邮件
# mail -s "ALERT: SQL Admin HA 故障" admin@example.com << EOF
# Keepalived检测到故障
# 虚拟IP可能已漂移到其他节点
# 请立即检查服务状态
# EOF
exit 0
启动Keepalived:
# 设置脚本权限
sudo chmod +x /etc/keepalived/check_haproxy.sh
sudo chmod +x /etc/keepalived/notify_*.sh
# 测试配置
sudo keepalived --config-test
# 启动服务
sudo systemctl enable keepalived
sudo systemctl start keepalived
# 查看VIP绑定状态
ip addr show eth0 | grep 192.168.1.200
# 查看VRRP状态
cat /var/log/messages | grep -i keepalived | tail -20
5.2 VIP漂移测试
验证Keepalived的故障转移功能:
手动触发故障转移:
# 在主节点上停止Keepalived
sudo systemctl stop keepalived
# 观察VIP是否漂移到备节点
# 在备节点上执行
ip addr show eth0 | grep 192.168.1.200
cat /var/log/messages | grep -i "VRRP_Script\|vrrp_" | tail -10
# 恢复主节点
sudo systemctl start keepalived
# 观察VIP是否回归
ip addr show eth0 | grep 192.168.1.200
自动化故障测试:
#!/bin/bash
# /opt/scripts/ha-failover-test.sh
VIP="192.168.1.200"
MASTER_NODE="sqladmin-node1"
BACKUP_NODE="sqladmin-node2"
echo "===== SQL Admin HA故障转移测试 ====="
echo "测试时间: $(date)"
echo ""
# 检查当前VIP位置
check_vip() {
current_node=$(ssh root@$MASTER_NODE "ip addr show eth0 2>/dev/null | grep $VIP" > /dev/null && echo $MASTER_NODE || echo $BACKUP_NODE)
echo "当前VIP在: $current_node"
}
# 模拟故障
echo "1. 模拟主节点故障..."
ssh root@$MASTER_NODE "systemctl stop keepalived"
echo " Keepalived已停止"
echo ""
echo "2. 等待VIP漂移(10秒)..."
sleep 10
echo ""
echo "3. 检查VIP新位置..."
check_vip
echo ""
echo "4. 恢复主节点..."
ssh root@$MASTER_NODE "systemctl start keepalived"
sleep 5
echo ""
echo "5. 最终检查..."
check_vip
echo ""
echo "===== 测试完成 ====="
六、Redis会话共享配置
6.1 Redis集群安装
Redis用于SQL Admin的会话共享,确保故障转移时用户会话不丢失:
Redis单实例安装(所有节点):
# 安装Redis
sudo yum install -y redis
# 配置Redis
sudo cat > /etc/redis.conf << 'EOF'
bind 0.0.0.0
protected-mode no
port 6379
tcp-backlog 511
timeout 0
tcp-keepalive 300
daemonize no
supervised systemd
pidfile /var/run/redis/redis.pid
loglevel notice
logfile /var/log/redis/redis.log
databases 16
# 持久化配置
save 900 1
save 300 10
save 60 10000
stop-writes-on-bgsave-error yes
rdbcompression yes
rdbchecksum yes
dbfilename dump.rdb
dir /var/lib/redis
# AOF持久化
appendonly yes
appendfilename "appendonly.aof"
appendfsync everysec
no-appendfsync-on-rewrite no
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb
# 内存配置
maxmemory 1gb
maxmemory-policy allkeys-lru
# 安全配置
requirepass your_redis_password
EOF
# 启动Redis
sudo systemctl enable redis
sudo systemctl start redis
# 测试连接
redis-cli -a your_redis_password ping
Redis Sentinel哨兵模式(推荐):
# 安装Redis Sentinel
sudo yum install -y redis-sentinel
# 配置Sentinel
sudo cat > /etc/redis-sentinel.conf << 'EOF'
port 26379
daemonize no
protected-mode no
pidfile /var/run/redis-sentinel.pid
logfile /var/log/redis/sentinel.log
dir /tmp
# 监控主节点
sentinel monitor sqladmin-master 127.0.0.1 6379 2
sentinel auth-pass sqladmin-master your_redis_password
# 故障转移配置
sentinel down-after-milliseconds sqladmin-master 3000
sentinel parallel-syncs sqladmin-master 1
sentinel failover-timeout sqladmin-master 18000
# 通知配置
sentinel notification-script sqladmin-master /etc/redis/sentinel-notify.sh
EOF
# 启动Sentinel
sudo systemctl enable redis-sentinel
sudo systemctl start redis-sentinel
# 查看Sentinel状态
redis-cli -p 26379 sentinel masters
redis-cli -p 26379 sentinel master sqladmin-master
6.2 SQL Admin会话配置
配置SQL Admin使用Redis进行会话共享:
环境变量配置:
# 所有SQL Admin节点配置相同的Redis连接
cat >> ~/.bashrc << 'EOF'
export SESSION_TYPE=redis
export REDIS_HOST=sqladmin-redis-master
export REDIS_PORT=6379
export REDIS_PASSWORD=your_redis_password
export REDIS_DB=0
export REDIS_PREFIX=sqladmin:
export REDIS_TTL=3600
EOF
source ~/.bashrc
Docker Compose配置Redis:
# docker-compose.yml
version: '3.8'
services:
redis:
image: redis:7-alpine
container_name: redis-master
restart: always
command: redis-server --requirepass your_redis_password --appendonly yes
volumes:
- redis-data:/data
networks:
- sqladmin-net
healthcheck:
test: ["CMD", "redis-cli", "-a", "your_redis_password", "ping"]
interval: 10s
timeout: 5s
retries: 5
redis-sentinel:
image: redis:7-alpine
container_name: redis-sentinel
restart: always
command: redis-sentinel /usr/local/etc/redis/sentinel.conf
volumes:
- ./sentinel.conf:/usr/local/etc/redis/sentinel.conf
networks:
- sqladmin-net
sqladmin:
image: sqladmin:3.2
depends_on:
redis:
condition: service_healthy
environment:
- SESSION_TYPE=redis
- REDIS_HOST=redis-master
- REDIS_PORT=6379
- REDIS_PASSWORD=your_redis_password
networks:
- sqladmin-net
volumes:
redis-data:
networks:
sqladmin-net:
driver: bridge
七、健康监控与告警
7.1 监控指标
建立完善的监控系统是保障高可用性的关键:
关键监控指标:
| 类别 | 指标 | 阈值 | 处理方式 |
|---|---|---|---|
| 服务可用性 | SQL Admin进程 | 停止=告警 | 自动重启 |
| 负载均衡 | Nginx/HAProxy | down=告警 | 触发切换 |
| VIP状态 | Keepalived | 非master=告警 | 检查节点 |
| 响应时间 | HTTP请求 | >3s=告警 | 扩容或优化 |
| 连接数 | 端口连接数 | >5000=告警 | 扩容 |
| 磁盘使用 | 磁盘空间 | >80%=告警 | 清理 |
| 内存使用 | RAM使用率 | >85%=告警 | 优化配置 |
7.2 Prometheus+Grafana监控
使用Prometheus和Grafana构建监控平台:
安装Prometheus:
# 下载Prometheus
cd /opt
sudo wget https://github.com/prometheus/prometheus/releases/download/v2.48.0/prometheus-2.48.0.linux-amd64.tar.gz
sudo tar -xzf prometheus-2.48.0.linux-amd64.tar.gz
sudo mv prometheus-2.48.0 prometheus
# 创建systemd服务
sudo cat > /etc/systemd/system/prometheus.service << 'EOF'
[Unit]
Description=Prometheus
After=network.target
[Service]
Type=simple
User=root
ExecStart=/opt/prometheus/prometheus --config.file=/opt/prometheus/prometheus.yml --storage.tsdb.path=/var/lib/prometheus
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
sudo systemctl daemon-reload
sudo systemctl enable prometheus
sudo systemctl start prometheus
Prometheus配置:
# /opt/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'sqladmin-nodes'
static_configs:
- targets:
- sqladmin-node1:8080
- sqladmin-node2:8080
- sqladmin-node3:8080
metrics_path: '/metrics'
- job_name: 'nginx'
static_configs:
- targets:
- sqladmin-node1:9100
- sqladmin-node2:9100
- job_name: 'keepalived'
static_configs:
- targets:
- sqladmin-node1:9100
- sqladmin-node2:9100
- job_name: 'redis'
static_configs:
- targets:
- sqladmin-node1:9121
- sqladmin-node2:9121
- sqladmin-node3:9121
7.3 告警脚本
配置自动化告警机制:
#!/usr/bin/env python3
"""
SQL Admin高可用监控告警脚本
"""
import os
import sys
import subprocess
import smtplib
from email.mime.text import MIMEText
from datetime import datetime
class HAAlarm:
def __init__(self):
self.alarm_rules = {
'sqladmin_process': {'check': self.check_sqladmin_process, 'threshold': 0},
'nginx_status': {'check': self.check_nginx_status, 'threshold': 0},
'keepalived_vip': {'check': self.check_keepalived_vip, 'threshold': 0},
'disk_usage': {'check': self.check_disk_usage, 'threshold': 80},
'memory_usage': {'check': self.check_memory_usage, 'threshold': 85},
}
def check_sqladmin_process(self):
"""检查SQL Admin进程"""
try:
result = subprocess.run(
['pgrep', '-f', 'sqladmin'],
capture_output=True, text=True
)
return len(result.stdout.strip().split('\n')) > 0
except:
return False
def check_nginx_status(self):
"""检查Nginx状态"""
try:
result = subprocess.run(
['systemctl', 'is-active', 'nginx'],
capture_output=True, text=True
)
return 'active' in result.stdout
except:
return False
def check_keepalived_vip(self):
"""检查VIP状态"""
try:
result = subprocess.run(
['ip', 'addr', 'show'],
capture_output=True, text=True
)
return '192.168.1.200' in result.stdout
except:
return False
def check_disk_usage(self):
"""检查磁盘使用率"""
try:
result = subprocess.run(
['df', '-h', '/'],
capture_output=True, text=True
)
lines = result.stdout.strip().split('\n')
if len(lines) > 1:
usage = lines[1].split()[4]
return int(usage.rstrip('%'))
return 0
except:
return 0
def check_memory_usage(self):
"""检查内存使用率"""
try:
result = subprocess.run(
['free', '-m'],
capture_output=True, text=True
)
lines = result.stdout.strip().split('\n')
if len(lines) > 2:
mem_line = lines[2].split()
total = int(mem_line[1])
used = int(mem_line[2])
return int(used / total * 100)
return 0
except:
return 0
def send_alarm(self, alarm_type, message):
"""发送告警"""
print(f"[ALARM] {alarm_type}: {message}")
# 邮件告警
try:
smtp_server = os.environ.get('SMTP_SERVER', 'localhost')
smtp_port = int(os.environ.get('SMTP_PORT', 25))
smtp_user = os.environ.get('SMTP_USER', '')
smtp_password = os.environ.get('SMTP_PASSWORD', '')
msg = MIMEText(f"""
SQL Admin 高可用告警
告警类型: {alarm_type}
告警时间: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}
告警内容: {message}
请及时处理!
""")
msg['Subject'] = f'[ALERT] SQL Admin HA - {alarm_type}'
msg['From'] = smtp_user
msg['To'] = os.environ.get('ALARM_EMAIL', 'admin@example.com')
with smtplib.SMTP(smtp_server, smtp_port) as server:
if smtp_user:
server.login(smtp_user, smtp_password)
server.send_message(msg)
print(f"[ALARM] 邮件告警已发送")
except Exception as e:
print(f"[ALARM] 邮件发送失败: {str(e)}")
def run_check(self):
"""执行所有检查"""
print(f"\n===== SQL Admin HA 健康检查 - {datetime.now().strftime('%Y-%m-%d %H:%M:%S')} =====\n")
all_ok = True
for check_name, config in self.alarm_rules.items():
try:
if check_name in ['disk_usage', 'memory_usage']:
value = config['check']()
status = value <= config['threshold']
status_str = f"正常 ({value}%)" if status else f"告警 ({value}%)"
else:
status = config['check']()
status_str = "正常" if status else "告警"
status_icon = "✅" if status else "❌"
print(f"{status_icon} {check_name}: {status_str}")
if not status:
all_ok = False
self.send_alarm(check_name, status_str)
except Exception as e:
print(f"❌ {check_name}: 检查失败 - {str(e)}")
all_ok = False
print(f"\n总体状态: {'✅ 健康' if all_ok else '❌ 异常'}")
return all_ok
if __name__ == '__main__':
alarm = HAAlarm()
ok = alarm.run_check()
sys.exit(0 if ok else 1)
定时执行监控:
# 添加到cron(每5分钟执行一次)
echo "*/5 * * * * /usr/bin/python3 /opt/scripts/ha_monitor.py >> /var/log/ha_monitor.log 2>&1" | sudo tee -a /etc/crontab
# 重启cron服务
sudo systemctl restart crond
八、故障转移测试
8.1 故障场景测试
定期进行故障转移演练是验证高可用性的关键:
测试场景清单:
| 场景 | 操作 | 预期结果 | 验证方法 |
|---|---|---|---|
| 单节点故障 | 停止SQL Admin进程 | VIP漂移,请求自动分发 | 检查服务可用性 |
| 负载均衡器故障 | 停止Nginx/HAProxy | VIP漂移,备节点接管 | 访问VIP验证 |
| Keepalived故障 | 停止Keepalived | VIP自动切换 | 检查VIP位置 |
| 网络分区 | 断开网络 | 备节点接管 | ping VIP测试 |
| 共享存储故障 | 停止NFS | 本地缓存模式 | 验证数据一致性 |
故障转移测试脚本:
#!/bin/bash
# /opt/scripts/ha-failover-comprehensive-test.sh
LOG_FILE="/var/log/ha-failover-test.log"
VIP="192.168.1.200"
NODES=("sqladmin-node1" "sqladmin-node2" "sqladmin-node3")
log() {
echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a $LOG_FILE
}
check_vip() {
for node in "${NODES[@]}"; do
if ssh root@$node "ip addr show eth0 2>/dev/null | grep -q $VIP"; then
echo "$node"
return 0
fi
done
echo "none"
return 1
}
# 场景1: SQL Admin进程故障
log "===== 场景1: SQL Admin进程故障测试 ====="
ssh root@${NODES[0]} "docker stop sqladmin"
log "SQL Admin容器已停止"
sleep 15
result=$(check_vip)
log "VIP当前位置: $result"
if [ "$result" != "none" ]; then
log "✅ 场景1测试通过"
else
log "❌ 场景1测试失败"
fi
# 恢复
ssh root@${NODES[0]} "docker start sqladmin"
sleep 10
# 场景2: 负载均衡器故障
log "===== 场景2: Nginx故障测试 ====="
ssh root@${NODES[0]} "systemctl stop nginx"
log "Nginx已停止"
sleep 10
result=$(check_vip)
log "VIP当前位置: $result"
if [ "$result" == "${NODES[1]}" ] || [ "$result" == "${NODES[2]}" ]; then
log "✅ 场景2测试通过(VIP已漂移)"
else
log "⚠️ 场景2: VIP未漂移,检查Nginx配置"
fi
# 恢复
ssh root@${NODES[0]} "systemctl start nginx"
sleep 10
# 场景3: Keepalived故障
log "===== 场景3: Keepalived故障测试 ====="
ssh root@${NODES[0]} "systemctl stop keepalived"
log "Keepalived已停止"
sleep 5
result=$(check_vip)
log "VIP当前位置: $result"
if [ "$result" == "${NODES[1]}" ] || [ "$result" == "${NODES[2]}" ]; then
log "✅ 场景3测试通过(VIP已漂移到备用节点)"
else
log "❌ 场景3测试失败"
fi
# 恢复
ssh root@${NODES[0]} "systemctl start keepalived"
sleep 10
log "===== 所有测试完成 ====="
九、灾难恢复计划
9.1 RTO和RPO目标
制定明确的恢复目标:
| 级别 | RTO | RPO | 说明 |
|---|---|---|---|
| 核心业务 | ≤15分钟 | ≤5分钟 | 关键数据库管理 |
| 一般业务 | ≤1小时 | ≤1小时 | 内部使用系统 |
| 测试环境 | ≤4小时 | ≤4小时 | 开发测试 |
9.2 备份策略
配置文件备份:
#!/bin/bash
# /opt/scripts/backup-ha-config.sh
BACKUP_DIR="/backup/ha-config"
DATE=$(date +%Y%m%d_%H%M%S)
mkdir -p $BACKUP_DIR
# 备份配置文件
tar czf $BACKUP_DIR/sqladmin-config-$DATE.tar.gz \
/opt/sqladmin/conf \
/etc/nginx/conf.d \
/etc/haproxy/haproxy.cfg \
/etc/keepalived/keepalived.conf \
/etc/redis/redis.conf \
/opt/prometheus/prometheus.yml \
2>/dev/null
# 备份Docker配置
cp /opt/docker-compose.yml $BACKUP_DIR/docker-compose-$DATE.yml
# 保留最近30天备份
find $BACKUP_DIR -name "*.tar.gz" -mtime +30 -delete
echo "配置备份完成: $BACKUP_DIR/sqladmin-config-$DATE.tar.gz"
十、最佳实践总结
10.1 架构设计原则
- 消除单点故障:每个组件都应有冗余,避免单点故障导致整体不可用
- 故障自动切换:优先使用自动切换机制,减少人工干预时间和人为错误
- 监控全覆盖:建立从基础设施到应用层的全面监控体系
- 定期演练:至少每季度进行一次完整的故障转移演练
- 文档完善:记录架构设计、配置参数和故障处理流程
10.2 运维检查清单
日常检查:
| 检查项 | 频率 | 负责人 | 备注 |
|---|---|---|---|
| VIP状态 | 每小时 | 自动监控 | 告警 |
| 服务进程 | 每5分钟 | 自动监控 | 告警 |
| 磁盘空间 | 每日 | 定时任务 | 清理 |
| 日志审查 | 每周 | 运维人员 | 分析异常 |
| SSL证书 | 每月 | 运维人员 | 到期提醒 |
| 备份验证 | 每月 | 运维人员 | 恢复测试 |
遵循以上高可用性配置指南,可以在CentOS上构建稳定可靠的SQL Admin集群,确保数据库管理服务的持续可用。
本文基于CentOS Stream 9和SQL Admin 3.2编写,适用于CentOS环境下的SQL Admin高可用架构设计与实现。