2026年CentOS下SQL Admin高可用性实现完整指南(2026)

一、SQL Admin高可用性概述

SQL Admin作为数据库管理工具,在企业IT架构中承担着关键的管理职能。一旦SQL Admin服务中断,运维人员将无法对数据库进行有效的监控和管理,可能导致业务响应延迟甚至故障无法及时处理。因此,在生产环境中实现SQL Admin的高可用性部署至关重要。

高可用性(High Availability,HA)是指系统在大部分时间内都能正常运行,减少停机时间。在CentOS上部署SQL Admin时,通过合理的架构设计可以实现99.9%以上的可用性,确保数据库管理服务的持续稳定运行。

高可用性的核心衡量指标

指标 定义 计算方式 目标值
可用性 系统正常运行时间比例 Uptime / (Uptime + Downtime) ≥99.9%
MTBF 平均故障间隔时间 Total Time / Number of Failures ≥8760小时
MTTR 平均恢复时间 Total Downtime / Number of Repairs ≤30分钟
RTO 恢复时间目标 业务中断到恢复的最长时间 ≤15分钟
RPO 恢复点目标 可接受的最大数据丢失时间 ≤5分钟

二、高可用架构设计

2.1 常见高可用架构类型

SQL Admin的高可用架构设计需要综合考虑成本、性能和可靠性。以下是三种常见的架构方案:

架构一:负载均衡架构

                    ┌─────────────────┐
                    │   Nginx/LB      │
                    │  (VIP + 健康检查)│
                    └────────┬────────┘
                             │
              ┌──────────────┼──────────────┐
              │              │              │
        ┌─────▼─────┐  ┌─────▼─────┐  ┌─────▼─────┐
        │ SQL Admin  │  │ SQL Admin  │  │ SQL Admin  │
        │   Node 1  │  │   Node 2  │  │   Node 3  │
        └─────┬─────┘  └─────┬─────┘  └─────┬─────┘
              │              │              │
              └──────────────┼──────────────┘
                             │
                    ┌────────▼────────┐
                    │   共享存储      │
                    │ (NFS/GlusterFS) │
                    └─────────────────┘

架构二:主备 failover 架构

         Active                    Standby
    ┌──────────────┐          ┌──────────────┐
    │ SQL Admin    │  心跳    │ SQL Admin    │
    │ (主节点)     │◄───────►│ (备节点)     │
    └──────┬───────┘          └──────┬───────┘
           │                          │
    ┌──────▼───────┐          ┌──────▼───────┐
    │  SQL Server   │          │  SQL Server   │
    │   主实例      │          │   备实例      │
    └──────────────┘          └──────────────┘

架构三:多活架构

    ┌────────────┐  ┌────────────┐  ┌────────────┐
    │ 区域 A     │  │ 区域 B     │  │ 区域 C     │
    │ ┌────────┐│  │ ┌────────┐│  │ ┌────────┐│
    │ │Admin A ││  │ │Admin B ││  │ │Admin C ││
    │ └────────┘│  │ └────────┘│  │ └────────┘│
    └─────┬──────┘  └─────┬──────┘  └─────┬──────┘
          │               │               │
          └───────────────┼───────────────┘
                          │
                   ┌──────▼──────┐
                   │ 数据同步层   │
                   │ (DRBD/GIT)  │
                   └─────────────┘

2.2 架构选择指南

根据业务需求和资源情况,选择合适的高可用架构:

架构类型 适用场景 成本 复杂度 推荐指数
负载均衡 中小型企业,多用户访问 ⭐⭐⭐⭐
主备Failover 对可靠性要求高 ⭐⭐⭐⭐⭐
多活架构 大型企业,跨地域部署 极高 ⭐⭐⭐
容器化+编排 云原生环境 ⭐⭐⭐⭐

三、环境准备

3.1 系统环境要求

操作系统版本

组件 推荐版本 最低版本 说明
CentOS Stream 9 7 推荐Stream 8/9
内核 5.x 3.10 新内核性能更好
Glibc 2.35 2.17 影响兼容性
OpenSSL 3.0.x 1.1.1 建议3.0+

网络环境要求

# 配置主机名(每台服务器唯一)
hostnamectl set-hostname sqladmin-node1.example.com

# 配置hosts文件(所有节点)
cat >> /etc/hosts << 'EOF'
192.168.1.101  sqladmin-node1
192.168.1.102  sqladmin-node2
192.168.1.103  sqladmin-node3
192.168.1.200  sqladmin-vip
EOF

# 关闭防火墙(内部网络)
sudo systemctl stop firewalld
sudo systemctl disable firewalld

# 或开放必要端口
sudo firewall-cmd --permanent --add-port=8080/tcp
sudo firewall-cmd --permanent --add-port=8443/tcp
sudo firewall-cmd --permanent --add-port=7800/tcp  # Keepalived
sudo firewall-cmd --reload

# 验证网络连通性
ping -c 3 sqladmin-node1
ping -c 3 sqladmin-node2

时间同步配置

# 安装chrony时间同步
sudo yum install -y chrony

# 配置chrony服务端(Node1)
sudo cat > /etc/chrony.conf << 'EOF'
server 0.pool.ntp.org iburst
server 1.pool.ntp.org iburst
server 2.pool.ntp.org iburst
allow 192.168.1.0/24
local stratum 10
EOF

# 配置chrony客户端(Node2, Node3)
sudo cat > /etc/chrony.conf << 'EOF'
server sqladmin-node1 iburst
server 1.pool.ntp.org iburst
server 2.pool.ntp.org iburst
EOF

# 启动服务
sudo systemctl enable chronyd
sudo systemctl start chronyd

# 验证时间同步
chronyc sources
chronyc tracking

3.2 共享存储配置

对于需要共享会话状态的应用,配置NFS共享存储:

NFS服务端配置(Node1)

# 安装NFS服务
sudo yum install -y nfs-utils rpcbind

# 创建共享目录
sudo mkdir -p /data/sqladmin/shared
sudo chmod 777 /data/sqladmin/shared

# 配置exports
cat >> /etc/exports << 'EOF'
/data/sqladmin/shared 192.168.1.0/24(rw,sync,no_root_squash,no_all_squash)
EOF

# 启动服务
sudo systemctl enable rpcbind nfs-server
sudo systemctl start rpcbind nfs-server
sudo exportfs -r

# 验证导出
exportfs -v

NFS客户端配置(Node2, Node3)

# 安装NFS客户端
sudo yum install -y nfs-utils

# 创建挂载点
sudo mkdir -p /data/sqladmin/shared

# 手动挂载测试
sudo mount -t nfs sqladmin-node1:/data/sqladmin/shared /data/sqladmin/shared

# 配置自动挂载(/etc/fstab)
echo "sqladmin-node1:/data/sqladmin/shared /data/sqladmin/shared nfs defaults,_netdev 0 0" | sudo tee -a /etc/fstab

# 验证挂载
df -h | grep shared
mount | grep nfs

3.3 Docker环境配置

推荐使用Docker容器化部署SQL Admin,简化高可用配置:

Docker安装(所有节点)

# 安装Docker
sudo yum install -y yum-utils
sudo yum-config-manager --add-repo \
    https://download.docker.com/linux/centos/docker-ce.repo

sudo yum install -y docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

# 启动Docker
sudo systemctl enable docker
sudo systemctl start docker

# 添加用户到docker组
sudo usermod -aG docker $USER

# 配置Docker镜像加速
sudo mkdir -p /etc/docker
sudo cat > /etc/docker/daemon.json << 'EOF'
{
    "registry-mirrors": [
        "https://docker.mirrors.ustc.edu.cn",
        "https://hub-mirror.c.163.com"
    ],
    "log-driver": "json-file",
    "log-opts": {
        "max-size": "100m",
        "max-file": "3"
    },
    "storage-driver": "overlay2"
}
EOF

sudo systemctl restart docker

Docker Compose配置

# docker-compose.yml
version: '3.8'

services:
  sqladmin:
    image: sqladmin:3.2
    container_name: sqladmin
    restart: always
    ports:
      - "8080:8080"
      - "8443:8443"
    environment:
      - DB_HOST=${DB_HOST}
      - DB_PORT=${DB_PORT}
      - DB_USER=${DB_USER}
      - DB_PASSWORD=${DB_PASSWORD}
      - SESSION_TYPE=redis
      - REDIS_HOST=redis-cluster
      - REDIS_PORT=6379
    volumes:
      - /data/sqladmin/shared:/app/shared
      - /data/sqladmin/logs:/app/logs
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 60s
    networks:
      - sqladmin-net

networks:
  sqladmin-net:
    driver: overlay

四、负载均衡高可用配置

4.1 Nginx负载均衡配置

使用Nginx作为SQL Admin的负载均衡器:

安装Nginx

# CentOS Stream 8/9
sudo yum install -y nginx

# CentOS 7
sudo yum install -y epel-release
sudo yum install -y nginx

# 启动并设置开机启动
sudo systemctl enable nginx
sudo systemctl start nginx

# 验证安装
nginx -v

配置负载均衡

# /etc/nginx/conf.d/sqladmin-lb.conf

upstream sqladmin_backend {
    # 轮询负载均衡
    server sqladmin-node1:8080 weight=5;
    server sqladmin-node2:8080 weight=5;
    server sqladmin-node3:8080 weight=5;

    # 保持会话(可选)
    ip_hash;

    # 健康检查
    keepalive 32;
}

# HTTP服务器
server {
    listen 80;
    server_name sqladmin.example.com;

    # 重定向到HTTPS
    return 301 https://$server_name$request_uri;
}

# HTTPS服务器
server {
    listen 443 ssl http2;
    server_name sqladmin.example.com;

    # SSL证书配置
    ssl_certificate /etc/nginx/ssl/sqladmin.crt;
    ssl_certificate_key /etc/nginx/ssl/sqladmin.key;
    ssl_protocols TLSv1.2 TLSv1.3;
    ssl_ciphers HIGH:!aNULL:!MD5;
    ssl_prefer_server_ciphers on;

    # 安全头
    add_header X-Frame-Options "SAMEORIGIN" always;
    add_header X-Content-Type-Options "nosniff" always;
    add_header X-XSS-Protection "1; mode=block" always;

    # 日志配置
    access_log /var/log/nginx/sqladmin-access.log;
    error_log /var/log/nginx/sqladmin-error.log;

    # 代理配置
    location / {
        proxy_pass http://sqladmin_backend;
        proxy_http_version 1.1;

        # 传递真实客户端IP
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;

        # 超时配置
        proxy_connect_timeout 60s;
        proxy_send_timeout 60s;
        proxy_read_timeout 60s;

        # 连接升级
        proxy_set_header Connection "";

        # 健康检查支持
        proxy_next_upstream error timeout invalid_header http_500 http_502 http_503;
    }

    # 健康检查端点
    location /health {
        proxy_pass http://sqladmin_backend/health;
        access_log off;
    }

    # 静态资源缓存
    location ~* \.(js|css|png|jpg|jpeg|gif|ico|svg|woff|woff2|ttf|eot)$ {
        proxy_pass http://sqladmin_backend;
        expires 30d;
        add_header Cache-Control "public, immutable";
    }
}

健康检查配置

# 主动健康检查(需要nginx-plus或第三方模块)
# 普通开源Nginx使用被动健康检查

upstream sqladmin_backend {
    server sqladmin-node1:8080 weight=5 max_fails=3 fail_timeout=30s;
    server sqladmin-node2:8080 weight=5 max_fails=3 fail_timeout=30s;
    server sqladmin-node3:8080 weight=5 max_fails=3 fail_timeout=30s;

    # 备份节点(所有节点失败时启用)
    server sqladmin-backup:8080 backup;
}

# 配置健康检查响应码
proxy_next_upstream error timeout http_500 http_502 http_503 http_504;

测试负载均衡

# 测试配置语法
sudo nginx -t

# 重载配置
sudo systemctl reload nginx

# 查看连接状态
sudo nginx -s reload
watch 'netstat -an | grep :8080 | wc -l'

# 多次请求测试负载分发
for i in {1..20}; do
    curl -s -o /dev/null -w "%{http_code} %{time_total}s\n" https://sqladmin.example.com/api/status
done

4.2 HAProxy负载均衡配置

HAProxy是另一种流行的负载均衡方案:

安装HAProxy

# 安装HAProxy
sudo yum install -y haproxy

# 备份默认配置
sudo cp /etc/haproxy/haproxy.cfg /etc/haproxy/haproxy.cfg.bak

配置HAProxy

# /etc/haproxy/haproxy.cfg

global
    log         127.0.0.1 local2
    chroot      /var/lib/haproxy
    pidfile     /var/run/haproxy.pid
    maxconn     4000
    user        haproxy
    group       haproxy
    daemon

    # SSL证书
    tune.ssl.default-dh-param 2048

defaults
    log         global
    mode        http
    option      httplog
    option      dontlognull
    option      http-server-close
    option      forwardfor except 127.0.0.0/8
    option      redispatch
    retries     3
    timeout     connect 5000
    timeout     client  50000
    timeout     server  50000
    errorfile   400 /etc/haproxy/errors/400.http
    errorfile   403 /etc/haproxy/errors/403.http
    errorfile   503 /etc/haproxy/errors/503.http

# 统计页面
listen stats
    bind *:8404
    mode http
    stats enable
    stats uri /stats
    stats refresh 30s
    stats auth admin:password

# SQL Admin前端
frontend sqladmin_front
    bind *:80
    bind *:443 ssl crt /etc/haproxy/ssl/sqladmin.pem
    mode http

    default_backend sqladmin_back

    # ACL规则
    acl is_api path_beg /api
    acl is_admin path_beg /admin

    # X-Forwarded-For
    http-request set-header X-Forwarded-For %[src]

# SQL Admin后端
backend sqladmin_back
    mode http
    balance roundrobin

    # 健康检查
    option httpchk GET /health
    http-check expect status 200

    # 服务器配置
    server sqladmin1 sqladmin-node1:8080 check inter 2000 rise 2 fall 3 weight 100
    server sqladmin2 sqladmin-node2:8080 check inter 2000 rise 2 fall 3 weight 100
    server sqladmin3 sqladmin-node3:8080 check inter 2000 rise 2 fall 3 weight 100

    # 备份服务器
    server sqladmin-backup sqladmin-node-backup:8080 backup check inter 2000 rise 2 fall 3

# 启用HAProxy统计
listen admin
    bind *:8081
    mode http
    stats enable

启动HAProxy

# 测试配置
sudo haproxy -f /etc/haproxy/haproxy.cfg -c

# 启动服务
sudo systemctl enable haproxy
sudo systemctl start haproxy

# 查看状态
sudo systemctl status haproxy
sudo haproxy -f /etc/haproxy/haproxy.cfg -d

五、Keepalived主备配置

5.1 Keepalived安装与配置

使用Keepalived实现VIP漂移,实现主备自动切换:

安装Keepalived(所有节点)

# 安装Keepalived
sudo yum install -y keepalived

# 备份默认配置
sudo cp /etc/keepalived/keepalived.conf /etc/keepalived/keepalived.conf.bak

主节点配置(Node1)

# /etc/keepalived/keepalived.conf

! Configuration File for keepalived

global_defs {
    router_id sqladmin_lb
    script_user root
    enable_script_security
}

# 健康检查脚本
vrrp_script check_haproxy {
    script "/etc/keepalived/check_haproxy.sh"
    interval 2
    weight -20
    fall 2
    rise 1
}

# VRRP实例
vrrp_instance VI_1 {
    state MASTER
    interface eth0                    # 修改为实际网卡名
    virtual_router_id 51
    priority 100                     # 主节点优先级最高
    advert_int 1

    # 认证配置
    authentication {
        auth_type PASS
        auth_pass 1111
    }

    # 虚拟IP配置
    virtual_ipaddress {
        192.168.1.200/24 dev eth0    # 修改为实际网卡和IP
    }

    # 追踪脚本
    track_script {
        check_haproxy
    }

    # 通知脚本
    notify_master "/etc/keepalived/notify_master.sh"
    notify_backup "/etc/keepalived/notify_backup.sh"
    notify_fault "/etc/keepalived/notify_fault.sh"

    # 抢占模式
    nopreempt                          # 故障恢复后不抢占
    preemption_delay 300               # 延迟300秒再抢占
}

备节点配置(Node2)

# /etc/keepalived/keepalived.conf

! Configuration File for keepalived

global_defs {
    router_id sqladmin_lb
    script_user root
    enable_script_security
}

vrrp_script check_haproxy {
    script "/etc/keepalived/check_haproxy.sh"
    interval 2
    weight -20
    fall 2
    rise 1
}

vrrp_instance VI_1 {
    state BACKUP
    interface eth0
    virtual_router_id 51
    priority 90                      # 备节点优先级较低
    advert_int 1

    authentication {
        auth_type PASS
        auth_pass 1111
    }

    virtual_ipaddress {
        192.168.1.200/24 dev eth0
    }

    track_script {
        check_haproxy
    }

    nopreempt                         # 非抢占模式
}

健康检查脚本

#!/bin/bash
# /etc/keepalived/check_haproxy.sh

# 检查HAProxy进程是否存活
if pgrep haproxy > /dev/null; then
    exit 0
else
    exit 1
fi

通知脚本

#!/bin/bash
# /etc/keepalived/notify_master.sh

LOG_FILE="/var/log/keepalived-notify.log"
VIP="192.168.1.200"

echo "[$(date)] Keepalived转换到MASTER状态" >> $LOG_FILE

# 发送邮件通知
# mail -s "SQL Admin HA: MASTER节点激活" admin@example.com << EOF
# 虚拟IP $VIP 已激活
# 当前节点已成为主节点
# EOF

exit 0
#!/bin/bash
# /etc/keepalived/notify_backup.sh

LOG_FILE="/var/log/keepalived-notify.log"

echo "[$(date)] Keepalived转换到BACKUP状态" >> $LOG_FILE

exit 0
#!/bin/bash
# /etc/keepalived/notify_fault.sh

LOG_FILE="/var/log/keepalived-notify.log"

echo "[$(date)] Keepalived检测到FAULT状态" >> $LOG_FILE

# 发送告警邮件
# mail -s "ALERT: SQL Admin HA 故障" admin@example.com << EOF
# Keepalived检测到故障
# 虚拟IP可能已漂移到其他节点
# 请立即检查服务状态
# EOF

exit 0

启动Keepalived

# 设置脚本权限
sudo chmod +x /etc/keepalived/check_haproxy.sh
sudo chmod +x /etc/keepalived/notify_*.sh

# 测试配置
sudo keepalived --config-test

# 启动服务
sudo systemctl enable keepalived
sudo systemctl start keepalived

# 查看VIP绑定状态
ip addr show eth0 | grep 192.168.1.200

# 查看VRRP状态
cat /var/log/messages | grep -i keepalived | tail -20

5.2 VIP漂移测试

验证Keepalived的故障转移功能:

手动触发故障转移

# 在主节点上停止Keepalived
sudo systemctl stop keepalived

# 观察VIP是否漂移到备节点
# 在备节点上执行
ip addr show eth0 | grep 192.168.1.200
cat /var/log/messages | grep -i "VRRP_Script\|vrrp_" | tail -10

# 恢复主节点
sudo systemctl start keepalived

# 观察VIP是否回归
ip addr show eth0 | grep 192.168.1.200

自动化故障测试

#!/bin/bash
# /opt/scripts/ha-failover-test.sh

VIP="192.168.1.200"
MASTER_NODE="sqladmin-node1"
BACKUP_NODE="sqladmin-node2"

echo "===== SQL Admin HA故障转移测试 ====="
echo "测试时间: $(date)"
echo ""

# 检查当前VIP位置
check_vip() {
    current_node=$(ssh root@$MASTER_NODE "ip addr show eth0 2>/dev/null | grep $VIP" > /dev/null && echo $MASTER_NODE || echo $BACKUP_NODE)
    echo "当前VIP在: $current_node"
}

# 模拟故障
echo "1. 模拟主节点故障..."
ssh root@$MASTER_NODE "systemctl stop keepalived"
echo "   Keepalived已停止"

echo ""
echo "2. 等待VIP漂移(10秒)..."
sleep 10

echo ""
echo "3. 检查VIP新位置..."
check_vip

echo ""
echo "4. 恢复主节点..."
ssh root@$MASTER_NODE "systemctl start keepalived"
sleep 5

echo ""
echo "5. 最终检查..."
check_vip

echo ""
echo "===== 测试完成 ====="

六、Redis会话共享配置

6.1 Redis集群安装

Redis用于SQL Admin的会话共享,确保故障转移时用户会话不丢失:

Redis单实例安装(所有节点)

# 安装Redis
sudo yum install -y redis

# 配置Redis
sudo cat > /etc/redis.conf << 'EOF'
bind 0.0.0.0
protected-mode no
port 6379
tcp-backlog 511
timeout 0
tcp-keepalive 300
daemonize no
supervised systemd
pidfile /var/run/redis/redis.pid
loglevel notice
logfile /var/log/redis/redis.log
databases 16

# 持久化配置
save 900 1
save 300 10
save 60 10000
stop-writes-on-bgsave-error yes
rdbcompression yes
rdbchecksum yes
dbfilename dump.rdb
dir /var/lib/redis

# AOF持久化
appendonly yes
appendfilename "appendonly.aof"
appendfsync everysec
no-appendfsync-on-rewrite no
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb

# 内存配置
maxmemory 1gb
maxmemory-policy allkeys-lru

# 安全配置
requirepass your_redis_password
EOF

# 启动Redis
sudo systemctl enable redis
sudo systemctl start redis

# 测试连接
redis-cli -a your_redis_password ping

Redis Sentinel哨兵模式(推荐)

# 安装Redis Sentinel
sudo yum install -y redis-sentinel

# 配置Sentinel
sudo cat > /etc/redis-sentinel.conf << 'EOF'
port 26379
daemonize no
protected-mode no
pidfile /var/run/redis-sentinel.pid
logfile /var/log/redis/sentinel.log
dir /tmp

# 监控主节点
sentinel monitor sqladmin-master 127.0.0.1 6379 2
sentinel auth-pass sqladmin-master your_redis_password

# 故障转移配置
sentinel down-after-milliseconds sqladmin-master 3000
sentinel parallel-syncs sqladmin-master 1
sentinel failover-timeout sqladmin-master 18000

# 通知配置
sentinel notification-script sqladmin-master /etc/redis/sentinel-notify.sh
EOF

# 启动Sentinel
sudo systemctl enable redis-sentinel
sudo systemctl start redis-sentinel

# 查看Sentinel状态
redis-cli -p 26379 sentinel masters
redis-cli -p 26379 sentinel master sqladmin-master

6.2 SQL Admin会话配置

配置SQL Admin使用Redis进行会话共享:

环境变量配置

# 所有SQL Admin节点配置相同的Redis连接
cat >> ~/.bashrc << 'EOF'
export SESSION_TYPE=redis
export REDIS_HOST=sqladmin-redis-master
export REDIS_PORT=6379
export REDIS_PASSWORD=your_redis_password
export REDIS_DB=0
export REDIS_PREFIX=sqladmin:
export REDIS_TTL=3600
EOF

source ~/.bashrc

Docker Compose配置Redis

# docker-compose.yml
version: '3.8'

services:
  redis:
    image: redis:7-alpine
    container_name: redis-master
    restart: always
    command: redis-server --requirepass your_redis_password --appendonly yes
    volumes:
      - redis-data:/data
    networks:
      - sqladmin-net
    healthcheck:
      test: ["CMD", "redis-cli", "-a", "your_redis_password", "ping"]
      interval: 10s
      timeout: 5s
      retries: 5

  redis-sentinel:
    image: redis:7-alpine
    container_name: redis-sentinel
    restart: always
    command: redis-sentinel /usr/local/etc/redis/sentinel.conf
    volumes:
      - ./sentinel.conf:/usr/local/etc/redis/sentinel.conf
    networks:
      - sqladmin-net

  sqladmin:
    image: sqladmin:3.2
    depends_on:
      redis:
        condition: service_healthy
    environment:
      - SESSION_TYPE=redis
      - REDIS_HOST=redis-master
      - REDIS_PORT=6379
      - REDIS_PASSWORD=your_redis_password
    networks:
      - sqladmin-net

volumes:
  redis-data:

networks:
  sqladmin-net:
    driver: bridge

七、健康监控与告警

7.1 监控指标

建立完善的监控系统是保障高可用性的关键:

关键监控指标

类别 指标 阈值 处理方式
服务可用性 SQL Admin进程 停止=告警 自动重启
负载均衡 Nginx/HAProxy down=告警 触发切换
VIP状态 Keepalived 非master=告警 检查节点
响应时间 HTTP请求 >3s=告警 扩容或优化
连接数 端口连接数 >5000=告警 扩容
磁盘使用 磁盘空间 >80%=告警 清理
内存使用 RAM使用率 >85%=告警 优化配置

7.2 Prometheus+Grafana监控

使用Prometheus和Grafana构建监控平台:

安装Prometheus

# 下载Prometheus
cd /opt
sudo wget https://github.com/prometheus/prometheus/releases/download/v2.48.0/prometheus-2.48.0.linux-amd64.tar.gz
sudo tar -xzf prometheus-2.48.0.linux-amd64.tar.gz
sudo mv prometheus-2.48.0 prometheus

# 创建systemd服务
sudo cat > /etc/systemd/system/prometheus.service << 'EOF'
[Unit]
Description=Prometheus
After=network.target

[Service]
Type=simple
User=root
ExecStart=/opt/prometheus/prometheus --config.file=/opt/prometheus/prometheus.yml --storage.tsdb.path=/var/lib/prometheus
Restart=on-failure

[Install]
WantedBy=multi-user.target
EOF

sudo systemctl daemon-reload
sudo systemctl enable prometheus
sudo systemctl start prometheus

Prometheus配置

# /opt/prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'sqladmin-nodes'
    static_configs:
      - targets:
          - sqladmin-node1:8080
          - sqladmin-node2:8080
          - sqladmin-node3:8080
    metrics_path: '/metrics'

  - job_name: 'nginx'
    static_configs:
      - targets:
          - sqladmin-node1:9100
          - sqladmin-node2:9100

  - job_name: 'keepalived'
    static_configs:
      - targets:
          - sqladmin-node1:9100
          - sqladmin-node2:9100

  - job_name: 'redis'
    static_configs:
      - targets:
          - sqladmin-node1:9121
          - sqladmin-node2:9121
          - sqladmin-node3:9121

7.3 告警脚本

配置自动化告警机制:

#!/usr/bin/env python3
"""
SQL Admin高可用监控告警脚本
"""
import os
import sys
import subprocess
import smtplib
from email.mime.text import MIMEText
from datetime import datetime

class HAAlarm:
    def __init__(self):
        self.alarm_rules = {
            'sqladmin_process': {'check': self.check_sqladmin_process, 'threshold': 0},
            'nginx_status': {'check': self.check_nginx_status, 'threshold': 0},
            'keepalived_vip': {'check': self.check_keepalived_vip, 'threshold': 0},
            'disk_usage': {'check': self.check_disk_usage, 'threshold': 80},
            'memory_usage': {'check': self.check_memory_usage, 'threshold': 85},
        }

    def check_sqladmin_process(self):
        """检查SQL Admin进程"""
        try:
            result = subprocess.run(
                ['pgrep', '-f', 'sqladmin'],
                capture_output=True, text=True
            )
            return len(result.stdout.strip().split('\n')) > 0
        except:
            return False

    def check_nginx_status(self):
        """检查Nginx状态"""
        try:
            result = subprocess.run(
                ['systemctl', 'is-active', 'nginx'],
                capture_output=True, text=True
            )
            return 'active' in result.stdout
        except:
            return False

    def check_keepalived_vip(self):
        """检查VIP状态"""
        try:
            result = subprocess.run(
                ['ip', 'addr', 'show'],
                capture_output=True, text=True
            )
            return '192.168.1.200' in result.stdout
        except:
            return False

    def check_disk_usage(self):
        """检查磁盘使用率"""
        try:
            result = subprocess.run(
                ['df', '-h', '/'],
                capture_output=True, text=True
            )
            lines = result.stdout.strip().split('\n')
            if len(lines) > 1:
                usage = lines[1].split()[4]
                return int(usage.rstrip('%'))
            return 0
        except:
            return 0

    def check_memory_usage(self):
        """检查内存使用率"""
        try:
            result = subprocess.run(
                ['free', '-m'],
                capture_output=True, text=True
            )
            lines = result.stdout.strip().split('\n')
            if len(lines) > 2:
                mem_line = lines[2].split()
                total = int(mem_line[1])
                used = int(mem_line[2])
                return int(used / total * 100)
            return 0
        except:
            return 0

    def send_alarm(self, alarm_type, message):
        """发送告警"""
        print(f"[ALARM] {alarm_type}: {message}")

        # 邮件告警
        try:
            smtp_server = os.environ.get('SMTP_SERVER', 'localhost')
            smtp_port = int(os.environ.get('SMTP_PORT', 25))
            smtp_user = os.environ.get('SMTP_USER', '')
            smtp_password = os.environ.get('SMTP_PASSWORD', '')

            msg = MIMEText(f"""
SQL Admin 高可用告警

告警类型: {alarm_type}
告警时间: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}
告警内容: {message}

请及时处理!
""")

            msg['Subject'] = f'[ALERT] SQL Admin HA - {alarm_type}'
            msg['From'] = smtp_user
            msg['To'] = os.environ.get('ALARM_EMAIL', 'admin@example.com')

            with smtplib.SMTP(smtp_server, smtp_port) as server:
                if smtp_user:
                    server.login(smtp_user, smtp_password)
                server.send_message(msg)

            print(f"[ALARM] 邮件告警已发送")
        except Exception as e:
            print(f"[ALARM] 邮件发送失败: {str(e)}")

    def run_check(self):
        """执行所有检查"""
        print(f"\n===== SQL Admin HA 健康检查 - {datetime.now().strftime('%Y-%m-%d %H:%M:%S')} =====\n")

        all_ok = True

        for check_name, config in self.alarm_rules.items():
            try:
                if check_name in ['disk_usage', 'memory_usage']:
                    value = config['check']()
                    status = value <= config['threshold']
                    status_str = f"正常 ({value}%)" if status else f"告警 ({value}%)"
                else:
                    status = config['check']()
                    status_str = "正常" if status else "告警"

                status_icon = "✅" if status else "❌"
                print(f"{status_icon} {check_name}: {status_str}")

                if not status:
                    all_ok = False
                    self.send_alarm(check_name, status_str)

            except Exception as e:
                print(f"❌ {check_name}: 检查失败 - {str(e)}")
                all_ok = False

        print(f"\n总体状态: {'✅ 健康' if all_ok else '❌ 异常'}")
        return all_ok

if __name__ == '__main__':
    alarm = HAAlarm()
    ok = alarm.run_check()
    sys.exit(0 if ok else 1)

定时执行监控

# 添加到cron(每5分钟执行一次)
echo "*/5 * * * * /usr/bin/python3 /opt/scripts/ha_monitor.py >> /var/log/ha_monitor.log 2>&1" | sudo tee -a /etc/crontab

# 重启cron服务
sudo systemctl restart crond

八、故障转移测试

8.1 故障场景测试

定期进行故障转移演练是验证高可用性的关键:

测试场景清单

场景 操作 预期结果 验证方法
单节点故障 停止SQL Admin进程 VIP漂移,请求自动分发 检查服务可用性
负载均衡器故障 停止Nginx/HAProxy VIP漂移,备节点接管 访问VIP验证
Keepalived故障 停止Keepalived VIP自动切换 检查VIP位置
网络分区 断开网络 备节点接管 ping VIP测试
共享存储故障 停止NFS 本地缓存模式 验证数据一致性

故障转移测试脚本

#!/bin/bash
# /opt/scripts/ha-failover-comprehensive-test.sh

LOG_FILE="/var/log/ha-failover-test.log"
VIP="192.168.1.200"
NODES=("sqladmin-node1" "sqladmin-node2" "sqladmin-node3")

log() {
    echo "[$(date '+%Y-%m-%d %H:%M:%S')] $1" | tee -a $LOG_FILE
}

check_vip() {
    for node in "${NODES[@]}"; do
        if ssh root@$node "ip addr show eth0 2>/dev/null | grep -q $VIP"; then
            echo "$node"
            return 0
        fi
    done
    echo "none"
    return 1
}

# 场景1: SQL Admin进程故障
log "===== 场景1: SQL Admin进程故障测试 ====="
ssh root@${NODES[0]} "docker stop sqladmin"
log "SQL Admin容器已停止"
sleep 15
result=$(check_vip)
log "VIP当前位置: $result"
if [ "$result" != "none" ]; then
    log "✅ 场景1测试通过"
else
    log "❌ 场景1测试失败"
fi

# 恢复
ssh root@${NODES[0]} "docker start sqladmin"
sleep 10

# 场景2: 负载均衡器故障
log "===== 场景2: Nginx故障测试 ====="
ssh root@${NODES[0]} "systemctl stop nginx"
log "Nginx已停止"
sleep 10
result=$(check_vip)
log "VIP当前位置: $result"
if [ "$result" == "${NODES[1]}" ] || [ "$result" == "${NODES[2]}" ]; then
    log "✅ 场景2测试通过(VIP已漂移)"
else
    log "⚠️ 场景2: VIP未漂移,检查Nginx配置"
fi

# 恢复
ssh root@${NODES[0]} "systemctl start nginx"
sleep 10

# 场景3: Keepalived故障
log "===== 场景3: Keepalived故障测试 ====="
ssh root@${NODES[0]} "systemctl stop keepalived"
log "Keepalived已停止"
sleep 5
result=$(check_vip)
log "VIP当前位置: $result"
if [ "$result" == "${NODES[1]}" ] || [ "$result" == "${NODES[2]}" ]; then
    log "✅ 场景3测试通过(VIP已漂移到备用节点)"
else
    log "❌ 场景3测试失败"
fi

# 恢复
ssh root@${NODES[0]} "systemctl start keepalived"
sleep 10

log "===== 所有测试完成 ====="

九、灾难恢复计划

9.1 RTO和RPO目标

制定明确的恢复目标:

级别 RTO RPO 说明
核心业务 ≤15分钟 ≤5分钟 关键数据库管理
一般业务 ≤1小时 ≤1小时 内部使用系统
测试环境 ≤4小时 ≤4小时 开发测试

9.2 备份策略

配置文件备份

#!/bin/bash
# /opt/scripts/backup-ha-config.sh

BACKUP_DIR="/backup/ha-config"
DATE=$(date +%Y%m%d_%H%M%S)

mkdir -p $BACKUP_DIR

# 备份配置文件
tar czf $BACKUP_DIR/sqladmin-config-$DATE.tar.gz \
    /opt/sqladmin/conf \
    /etc/nginx/conf.d \
    /etc/haproxy/haproxy.cfg \
    /etc/keepalived/keepalived.conf \
    /etc/redis/redis.conf \
    /opt/prometheus/prometheus.yml \
    2>/dev/null

# 备份Docker配置
cp /opt/docker-compose.yml $BACKUP_DIR/docker-compose-$DATE.yml

# 保留最近30天备份
find $BACKUP_DIR -name "*.tar.gz" -mtime +30 -delete

echo "配置备份完成: $BACKUP_DIR/sqladmin-config-$DATE.tar.gz"

十、最佳实践总结

10.1 架构设计原则

  1. 消除单点故障:每个组件都应有冗余,避免单点故障导致整体不可用
  2. 故障自动切换:优先使用自动切换机制,减少人工干预时间和人为错误
  3. 监控全覆盖:建立从基础设施到应用层的全面监控体系
  4. 定期演练:至少每季度进行一次完整的故障转移演练
  5. 文档完善:记录架构设计、配置参数和故障处理流程

10.2 运维检查清单

日常检查

检查项 频率 负责人 备注
VIP状态 每小时 自动监控 告警
服务进程 每5分钟 自动监控 告警
磁盘空间 每日 定时任务 清理
日志审查 每周 运维人员 分析异常
SSL证书 每月 运维人员 到期提醒
备份验证 每月 运维人员 恢复测试

遵循以上高可用性配置指南,可以在CentOS上构建稳定可靠的SQL Admin集群,确保数据库管理服务的持续可用。

本文基于CentOS Stream 9和SQL Admin 3.2编写,适用于CentOS环境下的SQL Admin高可用架构设计与实现。

发表回复

您的邮箱地址不会被公开。 必填项已用 * 标注