原文链接: https://blog.fliaping.com/a-high-concurrency-problem-of-log4j2/
一个APP推送服务,在一次发布之后, 当遇到批量推送任务时, 线程池打满, 队列打满, 任务堆积. (注: 由于该次发布的改动点很普通, 没有先怀疑该点.)
@RefreshScope
以下日志为改为异步日志(AsyncLogger)之后的阻塞线程栈,可以发现日志并没有用到disruptor这个无锁库,而是转为了同步Logger
"fastExecutor-670" #2178 prio=5 os_prio=0 tid=0x00007f2d483cd800 nid=0x6026 waiting for monitor entry [0x00007f2c158c3000]
java.lang.Thread.State: BLOCKED (on object monitor)
at java.lang.ClassLoader.loadClass(ClassLoader.java:404)
- waiting to lock <0x0000000088104b60> (a java.lang.Object)
at org.springframework.boot.loader.LaunchedURLClassLoader.loadClass(LaunchedURLClassLoader.java:94)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.logging.log4j.core.impl.ThrowableProxy.loadClass(ThrowableProxy.java:539)
at org.apache.logging.log4j.core.impl.ThrowableProxy.toExtendedStackTrace(ThrowableProxy.java:660)
at org.apache.logging.log4j.core.impl.ThrowableProxy.<init>(ThrowableProxy.java:137)
at org.springframework.boot.loader.LaunchedURLClassLoader.loadClass(LaunchedURLClassLoader.java:94)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.logging.log4j.core.impl.ThrowableProxy.loadClass(ThrowableProxy.java:539)
at org.apache.logging.log4j.core.impl.ThrowableProxy.toExtendedStackTrace(ThrowableProxy.java:660)
at org.apache.logging.log4j.core.impl.ThrowableProxy.<init>(ThrowableProxy.java:137)
at org.apache.logging.log4j.core.impl.ThrowableProxy.<init>(ThrowableProxy.java:121)
at org.apache.logging.log4j.core.impl.Log4jLogEvent.getThrownProxy(Log4jLogEvent.java:555)
at org.apache.logging.log4j.core.pattern.ExtendedThrowablePatternConverter.format(ExtendedThrowablePatternConverter.java:61)
... 精简部分日志 ...
at org.apache.logging.log4j.core.config.LoggerConfig.callAppenders(LoggerConfig.java:447)
at org.apache.logging.log4j.core.async.AsyncLoggerConfig.callAppendersInCurrentThread(AsyncLoggerConfig.java:105)
at org.apache.logging.log4j.core.async.EventRoute$2.logMessage(EventRoute.java:65)
at org.apache.logging.log4j.core.async.AsyncLoggerConfig.callAppenders(AsyncLoggerConfig.java:95)
at org.apache.logging.log4j.core.config.LoggerConfig.processLogEvent(LoggerConfig.java:432)
at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:416)
at org.apache.logging.log4j.core.config.LoggerConfig.log(LoggerConfig.java:402)
at org.apache.logging.log4j.core.config.AwaitCompletionReliabilityStrategy.log(AwaitCompletionReliabilityStrategy.java:63)
at org.apache.logging.log4j.core.Logger.logMessage(Logger.java:146)
at org.apache.logging.log4j.spi.AbstractLogger.logMessageSafely(AbstractLogger.java:2091)
at org.apache.logging.log4j.spi.AbstractLogger.logMessage(AbstractLogger.java:1988)
at org.apache.logging.log4j.spi.AbstractLogger.logIfEnabled(AbstractLogger.java:1960)
at org.apache.logging.slf4j.Log4jLogger.warn(Log4jLogger.java:259)
at com.dianwoda.delibird.common.traffic.FastCommander.lambda$runnersEnter$0(FastCommander.java:175)
at com.dianwoda.delibird.common.traffic.FastCommander$$Lambda$111/638548222.run(Unknown Source)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
以下为应用异常栈内容
com.dianwoda.delibird.common.domain.DeliException:
at com.dianwoda.delibird.push.manager.PushManager.send(PushManager.java:104) ~[classes!/:?]
at sun.reflect.GeneratedMethodAccessor119.invoke(Unknown Source) ~[?:?]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_77]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_77]
at org.springframework.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:216) ~[spring-core-4.3.15.RELEASE.jar!/:4.3.15.RELEASE]
at org.springframework.cloud.context.scope.GenericScope$LockedScopedProxyFactoryBean.invoke(GenericScope.java:472) ~[spring-cloud-context-1.3.3.RELEASE.jar!/:1.3.3.RELEASE]
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179) ~[spring-aop-4.3.15.RELEASE.jar!/:4.3.15.RELEASE]
at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:673) ~[spring-aop-4.3.15.RELEASE.jar!/:4.3.15.RELEASE]
at com.dianwoda.delibird.push.manager.PushManager$$EnhancerBySpringCGLIB$$74851038.send(<generated>) ~[classes!/:?]
at com.dianwoda.delibird.push.provider.DeliPushProviderImpl.lambda$send$0(DeliPushProviderImpl.java:84) ~[classes!/:?]
at com.dianwoda.delibird.common.tool.DeliRunnable.run(DeliRunnable.java:26) ~[delibird-common-1.0-SNAPSHOT.jar!/:?]
at com.dianwoda.delibird.common.traffic.FastCommander.lambda$runnersEnter$0(FastCommander.java:173) ~[delibird-common-1.0-SNAPSHOT.jar!/:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_77]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_77]
at java.lang.Thread.run(Thread.java:745) [?:1.8.0_77]
Log4j2中默认会优先尝试将LogEvent放入RingBuffer中,如果放不进去(原因大概是队列已满),则会通过AsyncQueueFullPolicy(可自定义)来决策下一步行为(EventRoute:丢弃 - discard,同步 - synchronous,等待入队 - enqueue),在log4j-2.7中默认是同步写日志(2.9中默认是等待入队Enqueue的方式),因此,当日志量突增时,异步日志变同步日志!
根据线程栈, 问题主要是因为 org.apache.logging.log4j.core.impl.ThrowableProxy.toExtendedStackTrace 这个方法. 其中 ThrowableProxy 这个类是因为LogEvent可能会被跨网络传输,而LogEvent中的异常栈可能不会被另一端识别,因此需要对异常栈进行重新封装,会取出栈中异常的jar包及版本等信息。该类的注释如下:
包装一个Throwable并添加每个堆栈跟踪元素的包信息。
用于在不同ClassLoader或JVM中表示Throwable的一个代理,当应用反序列化一个ThrowableProxy, Throwable也许没有被设置,但是Throwable信息被保存在该代理的其它字段,像 message和stack trace
另外该方法是为了"解析此堆栈跟踪中与父元素不同的所有堆栈条目", 简单的讲,就是把异常堆栈与当前的线程栈作对比,把异常堆栈中不一样的类信息解析出来,包括类名,行号,包名,包版本等信息,这时候就需要根据异常栈中的类名获取Class对象,便需要类加载. 根据线程栈来分析下类加载的过程, 可以看到阻塞的地方是在等待一个锁, 正是因为锁等待导致打印日志的操作耗时3s.
以下为阻塞线程的线程栈:
at java.lang.ClassLoader.loadClass(ClassLoader.java:404) - waiting to lock <0x0000000088104b60> (a java.lang.Object) at org.springframework.boot.loader.LaunchedURLClassLoader.loadClass(LaunchedURLClassLoader.java:94) at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at org.apache.logging.log4j.core.impl.ThrowableProxy.loadClass(ThrowableProxy.java:539) at org.apache.logging.log4j.core.impl.ThrowableProxy.toExtendedStackTrace(ThrowableProxy.java:660)
以下为类加载获取锁的代码片段: 代码1
// java.lang.ClassLoader#loadClass(java.lang.String, boolean)
protected Class<?> loadClass(String name, boolean resolve)
throws ClassNotFoundException
{
synchronized (getClassLoadingLock(name)) { // 源码行号:404
//(1. 注: 先从jvm检查下类加载过没有,已加载就直接返回该对象,防止类重复加载)
// First, check if the class has already been loaded
Class<?> c = findLoadedClass(name);
if (c == null) {
long t0 = System.nanoTime();
try {
// 2. 如果该类加载器有父对象,先用父加载器加载,这就是双亲委派机制
if (parent != null) {
c = parent.loadClass(name, false);
} else {
// 3. 如果父加载器加载不到,就用引导加载器加载
c = findBootstrapClassOrNull(name);
}
} catch (ClassNotFoundException e) {
// ClassNotFoundException thrown if class not found
// from the non-null parent class loader
}
if (c == null) {
// If still not found, then invoke findClass in order
// to find the class.
long t1 = System.nanoTime();
// 4. 如果前面双亲都没有加载到, 采用当前自定义的findClass加载
c = findClass(name); // 源码行号:424
// this is the defining class loader; record the stats
sun.misc.PerfCounter.getParentDelegationTime().addTime(t1 - t0);
sun.misc.PerfCounter.getFindClassTime().addElapsedTimeFrom(t1);
sun.misc.PerfCounter.getFindClasses().increment();
}
}
if (resolve) {
resolveClass(c);
}
return c;
}
}
可以看到类加载的时候会先用检查下是否已经加载过,加载过就直接返回jvm中的类对象. 如果是走的是获取加载过的类,应该是非常快的,因为仅仅是一个内存操作,获取的锁会被马上释放, 在几千QPS的情况下根本不可能发生阻塞3s这样的事情, 那么此时的类到底有没有被加载过呢?
可以看下获取到该锁 0x0000000088104b60 的线程栈:
"fastExecutor-671" #1739 prio=5 os_prio=0 tid=0x00007f2d00015000 nid=0x5e64 runnable [0x00007f2c237f9000]
java.lang.Thread.State: RUNNABLE
at sun.misc.URLClassPath$Loader.getResource(URLClassPath.java:702)
at sun.misc.URLClassPath.getResource(URLClassPath.java:212)
at java.net.URLClassLoader$1.run(URLClassLoader.java:365)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
- locked <0x0000000088104b60> (a java.lang.Object)
at org.springframework.boot.loader.LaunchedURLClassLoader.loadClass(LaunchedURLClassLoader.java:94)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.logging.log4j.core.impl.ThrowableProxy.loadClass(ThrowableProxy.java:539)
... 精简部分内容 ...
根据上面的线程栈, 获取锁之后走到源码的424行, 就表示类没有在jvm中加载过,并且双亲加载器也加载不到,调用了 findClass 去加载. 这里就非常奇怪了, jvm如果加载过一次,下次一定会从jvm中直接拿到, 结合前面说的锁会马上释放, 根本不会阻塞. 也就是说实际每次要重新加载. 为了找到具体是哪个类每次都需要加载一次,需要来debug确定.
下面是log4j2中 ThrowableProxy 类加载流程的代码: 代码2 ,是调用类加载的地方.
// org.apache.logging.log4j.core.impl.ThrowableProxy#loadClass(java.lang.String)
/**
* Loads classes not located via Reflection.getCallerClass.
*
* @param lastLoader The ClassLoader that loaded the Class that called this Class.
* @param className The name of the Class.
* @return The Class object for the Class or null if it could not be located.
*/
private Class<?> loadClass(final ClassLoader lastLoader, final String className) {
// XXX: this is overly complicated (注:确实过于复杂,哈哈哈)
Class<?> clazz;
if (lastLoader != null) {
try {
// 1. 先用ClassLoader加载一下, 加载上就返回
clazz = lastLoader.loadClass(className); // 源码行号: 539
if (clazz != null) {
return clazz;
}
} catch (final Throwable ignore) {
// Ignore exception.
}
}
try {
// 2. 上一步没加载上,或者出现异常,用LoaderUtil再次加载(使用Class.forName以及当前线程的ClassLoader)
clazz = LoaderUtil.loadClass(className);
} catch (final ClassNotFoundException | NoClassDefFoundError e) {
// 3. 加载出现异常,再次尝试一种加载方式
return loadClass(className);
} catch (final SecurityException e) {
return null;
}
return clazz;
}
// 4. 接上面的3, 再次用当前对象类的加载器加载,出现异常返回空
private Class<?> loadClass(final String className) {
try {
return this.getClass().getClassLoader().loadClass(className);
} catch (final ClassNotFoundException | NoClassDefFoundError | SecurityException e) {
return null;
}
}
先让代码跑起来,并走几次出问题的流程,确保该加载的类已经加载过了, 然后在 代码1 中 findLoadedClass 方法处打断点并查看返回值是否为空. 代码2 中 lastLoader 的类型是 LaunchedURLClassLoader 继承关系如下图
loadClass 方法如: 代码3 : org.springframework.boot.loader.LaunchedURLClassLoader#loadClass
@Override
protected Class<?> loadClass(String name, boolean resolve)
throws ClassNotFoundException {
Handler.setUseFastConnectionExceptions(true);
try {
try {
definePackageIfNecessary(name);
}
catch (IllegalArgumentException ex) {
// Tolerate race condition due to being parallel capable
if (getPackage(name) == null) {
// This should never happen as the IllegalArgumentException indicates
// that the package has already been defined and, therefore,
// getPackage(name) should not return null.
throw new AssertionError("Package " + name + " has already been "
+ "defined but it could not be found");
}
}
return super.loadClass(name, resolve);
}
finally {
Handler.setUseFastConnectionExceptions(false);
}
}
执行 lastLoader.loadClass(className) 跳到 代码1 , 整个加载流程遵循双亲委派机制, 如下图
LaunchedURLClassLoader是一个自定义类加载器, 直接调用父类 ClassLoader#loadClass 即 代码1 中所示, 分别用“应用类加载器”、“扩展类加载器”、“引导类加载器”加载,最终发现了当出现类名 sun.reflect.GeneratedMethodAccessor204 时经过 parent loaders、bootstrap loader、URLClassLoader#findClass都加载不到,最后抛出 ClassNotFoundException 被 代码2 步骤1处捕获并忽略,接着执行步骤2继续尝试加载,随后抛出异常,捕获后在步骤3处再次尝试加载,再次异常返回空。
如异常日志 at sun.reflect.GeneratedMethodAccessor119.invoke(Unknown Source) ~[?:?] , 由于加载不到该类,源码信息、包信息都是没有的。
插曲:调试过程中重启了一次应用,发现再也走不到之前的调试逻辑了,好像所有的事情都正常了,没有加载不上的类了,顿时一脸懵逼。然后误打误撞,用一个批量调用的脚本吭哧吭哧一顿调用,然后问题又重现了
在准备看看 sun.reflect.GeneratedMethodAccessor204 这个类为什么加载不上时,发现根本没有这个类,一看类名就怀疑是反射生成的类。加上之前的插曲,另外还有别人的提醒,以及google,发现根本原因竟是jvm对反射的优化策略。
jvm对待反射有两种方式:
在ReflectionFactory里有一种机制,就是当一个方法被反射调用的次数超过一定的阀值时(inflationThreshold),会使用第二种方式来提升速度。这个阀值的默认值是15.该阈值可以通过jvm参数 -Dsun.reflect.inflationThreshold 进行配置。
分析至此,引起阻塞的原因就清楚了,是因为jvm对待反射的优化,使得动态生成的类始终不能通过classLoader加载,于是每次解析异常栈都会进行类加载,并且由于双亲委派和 ThrowableProxy#loadClass 的多次异常处理,导致锁占有的时间很长,最终导致阻塞。
关于发布改动点,就是加上了 @RefreshScope 注解,通过对比加了和没加的异常栈日志,也发现了差别。
没有加 @RefreshScope 注解的异常栈:
com.dianwoda.delibird.common.domain.DeliException:
at com.dianwoda.delibird.push.manager.PushManager.send(PushManager.java:104) ~[classes/:?]
at com.dianwoda.delibird.push.provider.DeliPushProviderImpl.lambda$send$0(DeliPushProviderImpl.java:84) ~[classes/:?]
at com.dianwoda.delibird.common.tool.DeliRunnable.run(DeliRunnable.java:26) ~[classes/:?]
at com.dianwoda.delibird.common.traffic.FastCommander.lambda$runnersEnter$0(FastCommander.java:174) ~[classes/:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_161]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_161]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]
加了 @RefreshScope 注解的异常栈:
com.dianwoda.delibird.common.domain.DeliException:
at com.dianwoda.delibird.push.manager.PushManager.send(PushManager.java:104) ~[classes/:?]
at sun.reflect.GeneratedMethodAccessor204.invoke(Unknown Source) ~[?:?]
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:1.8.0_161]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_161]
at org.springframework.util.ReflectionUtils.invokeMethod(ReflectionUtils.java:216) ~[spring-core-4.3.15.RELEASE.jar:4.3.15.RELEASE]
at org.springframework.cloud.context.scope.GenericScope$LockedScopedProxyFactoryBean.invoke(GenericScope.java:472) ~[spring-cloud-context-1.3.3.RELEASE.jar:1.3.3.RELEASE]
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:179) ~[spring-aop-4.3.15.RELEASE.jar:4.3.15.RELEASE]
at org.springframework.aop.framework.CglibAopProxy$DynamicAdvisedInterceptor.intercept(CglibAopProxy.java:673) ~[spring-aop-4.3.15.RELEASE.jar:4.3.15.RELEASE]
at com.dianwoda.delibird.push.manager.PushManager$$EnhancerBySpringCGLIB$$14df0707.send(<generated>) ~[classes/:?]
at com.dianwoda.delibird.push.provider.DeliPushProviderImpl.lambda$send$0(DeliPushProviderImpl.java:84) ~[classes/:?]
at com.dianwoda.delibird.common.tool.DeliRunnable.run(DeliRunnable.java:26) ~[classes/:?]
at com.dianwoda.delibird.common.traffic.FastCommander.lambda$runnersEnter$0(FastCommander.java:174) ~[classes/:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_161]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_161]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_161]
该注解是spring-cloud中用来在配置更新后刷新bean,其原理如下:
RefreshScope 的父类方法 GenericScope#postProcessBeanFactory 把自己注册到beanFactory中,而实现了Scope接口的 GenericScope#get 的方法,会在get时放入 StandardScopeCache 缓存中, 其实声明了 RefreshScope 的bean都是懒加载,在初次使用时才进行创建并缓存。 RefreshScope#refresh 方法时,先从缓存中删除先前的Bean,然后再执行 GenericScope.BeanLifecycleWrapper#destroy 方法彻底结束上一个Bean生命周期,然后再发布一个 RefreshScopeRefreshedEvent 事件。 org.springframework.aop.framework.CglibAopProxy.DynamicAdvisedInterceptor#intercept 方法,方法中会调用 org.springframework.aop.target.SimpleBeanTargetSource#getTarget 重新生成一个bean 放入 StandardScopeCache 中,从而实现了bean更新 从上面的原理看出经过@RefreshScope注解过的bean就是一个代理,这也不难理解为什么会用到反射,然后由于jvm对反射的优化而产生该问题
该问题的原因,其实有很多方面:
@RefreshScope log4j 2.7