本文主要是介绍TIDB事务过大触发Transaction is too large异常,希望对大家解决编程问题提供一定的参考价值,需要的开发者们随着小编来一起学习吧!
文章目录
- 前言
- 一、出问题环境的数据量
- 二、手动执行对应sql语句复现问题
- 三、解决方案
- 1.开启批量提交,使用不带事务的非安全执行过程
- 2.尝试寻找关闭事务的方式
- 3.分批删除
- 总结
前言
项目今年发的一个版本,在最近发现一个统计图表中最忌一小时的数据有缺失,且每天都有。反馈后进行了修改,当时经过排查后终于找到问题,本地和测试环境验证没有问题后就将补丁发出去了,个人以为万事大吉,问题终于解决。没想到,因为这个补丁触发了之前数据堆积的问题,最近客户又反馈数据直接没了,刚听到问题直到远程到客户环境上都还有点懵逼,好好的数据采集和统计任务怎么就不灵了呢。
看到现场,首先需要确定定位问题的步骤,根据这个业务的数据流向,图表数据是查询的统计表,当前的统计表是根据采集基础表规约出来的,要么是数据采集线程出问题了,要么是数据统计线程出问题了。根据日志和arthas很容易判断采集线程是正常运行的,那么问题就出在统计线程上了,跟踪代码后便发现了上一篇文章提到的问题:ScheduledExecutorService定时任务停止问题
可以确认是由于定时任务的某次执行过程中出现了异常,导致该定时任务后续的执行被抑制了,所以页面上看不到统计数据。实际上这个任务每次执行只做两件事,一是删除采集基础表的过期数据防止数据堆积,二是根据最近的数据统计出上一层时间维度的数据供图表展示。问题肯定就出在这俩兄弟上了,但是由于之前没有捕获异常,所以无法确认具体原因。虽然预感是删除时候出问题,但是没有证据。所以本想增加异常捕获后就转测补丁,好在下班时候验证补丁的环境复现了该问题,且打出了日志:
2023-01-05 18:31:37:902 [pool-12-thread-1] ERROR [xxx-monitor,xxx-monitor-59b7547c9f-7lcrr,,,,13] com.xxx.xxx.monitor.port.service.statistic.InfFlowStatistic.aggregateOneMinuteData:296 - aggregate minute statistic error:
org.springframework.jdbc.UncategorizedSQLException:
### Error updating database. Cause: java.sql.SQLException: Transaction is too large, size: 104857600
### The error may exist in com/xxx/xxx/monitor/port/service/statistic/dao/IInterfaceCountInfoDao.java (best guess)
### The error may involve com.xxx.xxx.monitor.port.service.statistic.dao.IInterfaceCountInfoDao.clearFixTimeAndTDBTableData-Inline
### The error occurred while setting parameters
### SQL: delete from xxx_interface_count_info where time_millisecond between 1641376915106 and 1672912915105
### Cause: java.sql.SQLException: Transaction is too large, size: 104857600
; uncategorized SQLException; SQL state [HY000]; error code [8004]; Transaction is too large, size: 104857600; nested exception is java.sql.SQLException: Transaction is too large, size: 104857600at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:89)at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:81)at org.springframework.jdbc.support.AbstractFallbackSQLExceptionTranslator.translate(AbstractFallbackSQLExceptionTranslator.java:81)at org.mybatis.spring.MyBatisExceptionTranslator.translateExceptionIfPossible(MyBatisExceptionTranslator.java:88)at org.mybatis.spring.SqlSessionTemplate$SqlSessionInterceptor.invoke(SqlSessionTemplate.java:440)at com.sun.proxy.$Proxy135.delete(Unknown Source)at org.mybatis.spring.SqlSessionTemplate.delete(SqlSessionTemplate.java:303)at org.apache.ibatis.binding.MapperMethod.execute(MapperMethod.java:72)at org.apache.ibatis.binding.MapperProxy$PlaixxxethodInvoker.invoke(MapperProxy.java:152)at org.apache.ibatis.binding.MapperProxy.invoke(MapperProxy.java:85)at com.sun.proxy.$Proxy146.clearFixTimeAndTDBTableData(Unknown Source)at com.xxx.xxx.monitor.port.service.statistic.InfFlowStatistic.clearFixTimeAndTDBTableData(InfFlowStatistic.java:308)at com.xxx.xxx.monitor.port.service.statistic.InfFlowStatistic.aggregateOneMinuteData(InfFlowStatistic.java:269)at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)at java.lang.Thread.run(Thread.java:855)
Caused by: java.sql.SQLException: Transaction is too large, size: 104857600at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:129)at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:97)at com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:122)at com.mysql.cj.jdbc.ClientPreparedStatement.executeInternal(ClientPreparedStatement.java:953)at com.mysql.cj.jdbc.ClientPreparedStatement.execute(ClientPreparedStatement.java:370)at com.zaxxer.hikari.pool.ProxyPreparedStatement.execute(ProxyPreparedStatement.java:44)at com.zaxxer.hikari.pool.HikariProxyPreparedStatement.execute(HikariProxyPreparedStatement.java)at org.apache.ibatis.executor.statement.PreparedStatementHandler.update(PreparedStatementHandler.java:47)at org.apache.ibatis.executor.statement.RoutingStatementHandler.update(RoutingStatementHandler.java:74)at org.apache.ibatis.executor.SimpleExecutor.doUpdate(SimpleExecutor.java:50)at org.apache.ibatis.executor.BaseExecutor.update(BaseExecutor.java:117)at org.apache.ibatis.executor.CachingExecutor.update(CachingExecutor.java:76)at sun.reflect.GeneratedMethodAccessor295.invoke(Unknown Source)at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)at java.lang.reflect.Method.invoke(Method.java:498)at org.apache.ibatis.plugin.Plugin.invoke(Plugin.java:63)at com.sun.proxy.$Proxy183.update(Unknown Source)at sun.reflect.GeneratedMethodAccessor295.invoke(Unknown Source)at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)at java.lang.reflect.Method.invoke(Method.java:498)at org.apache.ibatis.plugin.Invocation.proceed(Invocation.java:49)at com.xxx.xxx.common.db.AutoIdInterceptor.intercept(AutoIdInterceptor.java:83)at org.apache.ibatis.plugin.Plugin.invoke(Plugin.java:61)at com.sun.proxy.$Proxy183.update(Unknown Source)at org.apache.ibatis.session.defaults.DefaultSqlSession.update(DefaultSqlSession.java:197)at org.apache.ibatis.session.defaults.DefaultSqlSession.delete(DefaultSqlSession.java:212)at sun.reflect.GeneratedMethodAccessor298.invoke(Unknown Source)at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)at java.lang.reflect.Method.invoke(Method.java:498)at org.mybatis.spring.SqlSessionTemplate$SqlSessionInterceptor.invoke(SqlSessionTemplate.java:426)... 15 common frames omitted
根本原因在于:java.sql.SQLException: Transaction is too large, size: 104857600,可以判断任务失败原因就是删除数据量过大,导致tidb事务超出配置允许范围。出现该问题的原因是发布版本由于删除问题导致数据堆积,为了可以清理过期的数据,上次补丁时候增大了删除范围,但是事务超大导致任务失败。这一点是自己没有想到的,需要反思。其中还有一点,虽然隐约有怀疑到数据量大导致删除失败的问题上,但是没有实际动手执行删除sql在性能环境中去验证一下,否则不是运气好在补丁验证环境复现,还解决不了根本问题,这也是需要反思的,大胆假设小心求证。
那么接下来的问题,就是怎么去恢复环境和彻底解决问题。
一、出问题环境的数据量
可见数据量已经在千万级别了;
二、手动执行对应sql语句复现问题
这里是删除最近一年的数据。
三、解决方案
根据博客和tidb官网,确定有以下两种常用的解决方案:
参考:
TiDB删除大数据的处理
TiDB Best Practice
从一个简单的Delete删数据场景谈TiDB数据库开发规范的重要性
1.开启批量提交,使用不带事务的非安全执行过程
不过根据网上提供的开启批量删除模式,仍然失败:
set tidb_batch_delete=1;
delete from nm_interface_count_info where time_millisecond between 1643087626208 and 1674623626208;
set tidb_batch_delete=0;
2.尝试寻找关闭事务的方式
很遗憾通过官网文档发现,项目现在使用的tidb版本不具备非事务DML执行的特性,该特性要6.1版本才可以使用。官网批量删除示例
使用非事务删除数据语法:
BATCH ON id LIMIT 2 DELETE FROM t where v < 6;
项目tidb版本:
mysql> select tidb_version();
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| tidb_version() |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Release Version: v4.0.0
Edition: Community
Git Commit Hash: 689a6b6439ae7835947fcaccf329a3fc303986cb
Git Branch: heads/refs/tags/v4.0.0
UTC Build Time: 2020-05-28 01:37:40
GoVersion: go1.13
Race Enabled: false
TiKV Min Version: v3.0.0-60965b006877ca7234adaced7890d7b029ed1306
Check Table Before Drop: false |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.00 sec)
没有办法,只有根据建议,使用分批删除的方式清理。
3.分批删除
该实现参考tidb的推荐最佳实践文档:TiDB Best Practice,但是如果一次删除太多又会导致删除标记太多,影响后续的删除效率,越来越慢,需要将区间细化来规避该问题;
// 代码实现,此处没有细化删除区间
private void clearXxxData(String table) {long currentTimeMillis = System.currentTimeMillis();int count = 1;int effectRowsTotal = 0;int effectRows = this.xxxDao.clearXxxTableData(table, BATCH_DEL_NUM);int effectRowsTotal = effectRowsTotal + effectRows;logger.info("第1次执行xxx表清理sql,table:{},effectRows:{}", table, effectRows);while(effectRows >= BATCH_DEL_NUM) {effectRows = this.xxxDao.clearXxxTableData(table, BATCH_DEL_NUM);effectRowsTotal += effectRows;++count;logger.warn("第{}次执行xxx表清理sql,table:{},effectRows:{}", count);}logger.info("执行xxx表清理sql,table:{} effectRowsTotal:{} timeUse:{} ms", table, effectRowsTotal, System.currentTimeMillis() - currentTimeMillis);
}
进行以上修改后,在性能环境中验证,发现确实有删除较慢的问题,但是目前看功能问题解决。遂想通过优化sql来提高效率,最开始使用between确定区间,通过以下explain语句测试,以为增加order by语句可以提高删除效率,但实际并不然。
mysql> explain delete from xxx_table where time_millisecond between 1643087626208 and 1674623626208;
+---------------------------+-----------+-----------+-------------------------------+--------------------------------------------------------------------------------------------------------------------------------+
| id | estRows | task | access object | operator info |
+---------------------------+-----------+-----------+-------------------------------+--------------------------------------------------------------------------------------------------------------------------------+
| Delete_4 | N/A | root | | N/A |
| └─TableReader_8 | 428191.00 | root | | data:Selection_7 |
| └─Selection_7 | 428191.00 | cop[tikv] | | ge(nm.xxx_table.time_millisecond, 1643087626208), le(nm.xxx_table.time_millisecond, 1674623626208) |
| └─TableFullScan_6 | 428191.00 | cop[tikv] | table:xxx_table | keep order:false |
+---------------------------+-----------+-----------+-------------------------------+--------------------------------------------------------------------------------------------------------------------------------+
4 rows in set (0.00 sec)
mysql> explain delete from xxx_table where time_millisecond between 1643087626208 and 1674623626208 order by time_millisecond asc limit 10000;
+----------------------------------+----------+-----------+----------------------------------------------------------------------------+------------------------------------------------------+
| id | estRows | task | access object | operator info |
+----------------------------------+----------+-----------+----------------------------------------------------------------------------+------------------------------------------------------+
| Delete_6 | N/A | root | | N/A |
| └─IndexLookUp_32 | 10000.00 | root | | limit embedded(offset:0, count:10000) |
| ├─Limit_31(Build) | 10000.00 | cop[tikv] | | offset:0, count:10000 |
| │ └─IndexRangeScan_29 | 10000.00 | cop[tikv] | table:xxx_table, index:idx_timemillisecond(time_millisecond) | range:[1643087626208,1674623626208], keep order:true |
| └─TableRowIDScan_30(Probe) | 10000.00 | cop[tikv] | table:xxx_table | keep order:false, stats:pseudo |
+----------------------------------+----------+-----------+----------------------------------------------------------------------------+------------------------------------------------------+
5 rows in set (0.00 sec)
mysql> explain delete from xxx_table where time_millisecond < 1674623626208;
+---------------------------+-----------+-----------+-------------------------------+----------------------------------------------------------------+
| id | estRows | task | access object | operator info |
+---------------------------+-----------+-----------+-------------------------------+----------------------------------------------------------------+
| Delete_4 | N/A | root | | N/A |
| └─TableReader_8 | 441570.00 | root | | data:Selection_7 |
| └─Selection_7 | 441570.00 | cop[tikv] | | lt(nm.xxx_table.time_millisecond, 1674623626208) |
| └─TableFullScan_6 | 441570.00 | cop[tikv] | table:xxx_table | keep order:false |
+---------------------------+-----------+-----------+-------------------------------+----------------------------------------------------------------+
4 rows in set (0.00 sec)
mysql> explain delete from xxx_table where time_millisecond < 1674623626208 order by time_millisecond asc limit 10000;
+----------------------------------+----------+-----------+----------------------------------------------------------------------------+---------------------------------------------+
| id | estRows | task | access object | operator info |
+----------------------------------+----------+-----------+----------------------------------------------------------------------------+---------------------------------------------+
| Delete_6 | N/A | root | | N/A |
| └─IndexLookUp_32 | 10000.00 | root | | limit embedded(offset:0, count:10000) |
| ├─Limit_31(Build) | 10000.00 | cop[tikv] | | offset:0, count:10000 |
| │ └─IndexRangeScan_29 | 10000.00 | cop[tikv] | table:xxx_table, index:idx_timemillisecond(time_millisecond) | range:[-inf,1674623626208), keep order:true |
| └─TableRowIDScan_30(Probe) | 10000.00 | cop[tikv] | table:xxx_table | keep order:false, stats:pseudo |
+----------------------------------+----------+-----------+----------------------------------------------------------------------------+---------------------------------------------+
5 rows in set (0.00 sec)
可以看到扫描的行数比不加order by语句要少很多,但是实际执行并不满意,见trace截图:
使用between确定范围:
使用<确定范围:
可见效果都不明显,后续还需继续探索,目前的方案是减少批量删除的数据试一下,目前是10000,改为5000试下。客户环境就需要truncate进行清理,因为这些表目前的数据都是过期无用数据。
如果严格按照最佳实践推荐的方式,需要确定范围,根据业务差异实现方式也不同,比如可以通过查询结合的方式,不过查询也有一定开销,具体怎么平衡还需要根据实际业务评估和验证。
我这里采用的方式为封装数据清理方法:
private int clearData(String table, long endTime) {Long startTime = this.interfaceCountInfoDao.getOldestTime(table);if (startTime != null && startTime <= endTime) {return this.interfaceCountInfoDao.clearXxxTableData(table, startTime, endTime);}logger.info("开始时间大于结束时间,不需要清理 {} > {}", new Object[]{startTime, endTime});return 0;
}
总结
- 在开发过程中,应该评估好性能瓶颈
- 解决问题过程中,想清楚突破口,多动手实践验证
这篇关于TIDB事务过大触发Transaction is too large异常的文章就介绍到这儿,希望我们推荐的文章对编程师们有所帮助!