朋友的数据库,做了主从replication复制。在slave实例上,SQL thread的长时间处于Opening tables状态,复制进程异常。
  
 
整个实例大概20个database,总共300G左右。
master是5.5版本,slave是5.6版本,master上执行xtrabackup全库备份后搭建的slave。
我的第一反应是table cache是不是太小了,导致open table比较慢,所以才长时间处于这个状态。无论如何,先一层层排查吧。
先看下slave status(部分无用信息我隐掉了):
mysql> show slave status/G
*************************** 1. row ***************************
Slave_IO_State: Queueing master event to the relay log
Master_Log_File: master-bin.000618
Read_Master_Log_Pos: 614915856
Relay_Log_File: replicate.000008
Relay_Log_Pos: 2384117
Relay_Master_Log_File: master-bin.000617
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 252048331
Relay_Log_Space: 1438994074
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Seconds_Behind_Master: 59240
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 1
Master_UUID:
Master_Info_File: /home/mysql/master.info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Opening tables
Master_Retry_Count: 86400
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crlpath:
Retrieved_Gtid_Set:
Executed_Gtid_Set:
Auto_Position: 0
看不出来有什么异常的。
再看下系统负载情况:
[root@localhost mysql]# vmstat -S m 1
procs ———–memory———- —swap– —–io—- –system– —–cpu—–
r b swpd free buff cache si so bi bo in cs us sy id wa st
0 2 1 19591 458 230576 0 0 11 20 0 0 0 0 100 0 0
1 1 1 19587 458 230579 0 0 2032 2528 1645 584 1 1 93 4 0
1 1 1 19583 458 230582 0 0 1664 2712 1773 461 1 1 93 4 0
0 2 1 19578 458 230585 0 0 2080 3376 1810 660 1 1 93 4 0
2 0 1 19576 458 230587 0 0 2224 1804 1634 594 1 1 94 4 0
3 1 1 19569 458 230590 0 0 1968 3488 1693 566 1 1 93 4 0
1 1 1 19567 458 230593 0 0 2016 2632 1775 515 1 1 93 4 0
[root@localhost mysql]# sar -d 1
Linux 2.6.32-431.el6.x86_64 (localhost.localdomain) 03/21/2015 _x86_64_ (24 CPU)
03:21:57 PM DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util
03:21:58 PM dev8-0 185.86 5753.54 64.65 31.30 1.20 6.48 5.08 94.44
03:21:58 PM DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util
03:21:59 PM dev8-0 197.03 6114.85 95.05 31.52 1.16 5.88 4.48 88.32
03:21:59 PM DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util
03:22:00 PM dev8-0 188.89 5882.83 88.89 31.61 1.14 6.03 5.09 96.16
03:22:00 PM DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util
03:22:01 PM dev8-0 166.00 5152.00 80.00 31.52 1.42 8.56 5.58 92.70
也看不出来异常,继续看看mysql的日志吧:
2015-04-21 14:51:04 18667 [ERROR] Native table ‘performance_schema’.’objects_summary_global_by_type’ has the wrong structure
2015-04-21 14:51:04 18667 [ERROR] Incorrect definition of table performance_schema.rwlock_instances: expected column ‘WRITE_LOCKED_BY_THREAD_ID’ at position 2 to have type bigint(20), found type int(11).
2015-04-21 14:51:04 18667 [ERROR] Native table ‘performance_schema’.’setup_actors’ has the wrong structure
2015-04-21 14:51:04 18667 [ERROR] Native table ‘performance_schema’.’setup_objects’ has the wrong structure
2015-04-21 14:51:04 18667 [ERROR] Native table ‘performance_schema’.’table_io_waits_summary_by_index_usage’ has the wrong structure
2015-04-21 14:51:04 18667 [ERROR] Native table ‘performance_schema’.’table_io_waits_summary_by_table’ has the wrong structure
2015-04-21 14:51:04 18667 [ERROR] Native table ‘performance_schema’.’table_lock_waits_summary_by_table’ has the wrong structure
2015-04-21 14:51:04 18667 [ERROR] Column count of mysql.threads is wrong. Expected 14, found 3. Created with MySQL 50524, now running 50623. Please use mysql_upgrade to fix this error.
2015-04-21 14:51:04 18667 [ERROR] Native table ‘performance_schema’.’events_stages_current’ has the wrong structure
2015-04-21 14:51:04 18667 [ERROR] Native table ‘performance_schema’.’events_stages_history’ has the wrong structure
2015-04-21 14:51:04 18667 [ERROR] Native table ‘performance_schema’.’events_stages_history_long’ has the wrong structure
…
…
2015-04-21 15:05:17 7f5997fff700 InnoDB: Error: Table “mysql”.”innodb_table_stats” not found.
2015-04-21 15:05:17 7f5997fff700 InnoDB: Error: Fetch of persistent statistics requested for table “db_anonymous_info_10”.”t_friend_info_3″ but the required system tables mysql.innodb_table_stats and mysql.innodb_index_stats are not present or have unexpected structure. Using transient stats instead.
2015-04-21 15:05:17 7f7beb0cd700 InnoDB: Error: Table “mysql”.”innodb_table_stats” not found.
2015-04-21 15:05:17 7f7beb0cd700 InnoDB: Error: Fetch of persistent statistics requested for table “db_user_trade_1”.”t_trade_15″ but the required system tables mysql.innodb_table_stats and mysql.innodb_index_stats are not present or have unexpected structure. Using transient stats instead.
可以看到,这个实例中,P_S(performanc_schema)库的几个表结构都异常了。另外,mysql.innodb_table_stats表也异常了,导致InnoDB表数据收集异常,然后SQL thread也跟着异常了。
造成上面问题主要原因是,没有用正确的方式来部署slave实例。从master上用xtrabackup物理备份到slave,启动实例后,应该再执行 mysql_upgrade 升级相关表结构,确保P_S(performanc_schema)、I_S(information_schema)以及 mysql 等几个系统库表结构都升级到最新版本。