运维笔记-服务未知原因宕机
文章内索引
[显示]
服务未知原因宕机
130测试服务器在运行几天后,发现有俩个服务挂掉了,其中一个是API,另外一个是Service-Center
首先排查API服务宕机原因
1. 首先本地目录没有异常排除JVM异常退出
启动目录中并没有hs_err_pidXXXX.log
2. 查看log4j日志,发现8月3号服务就挂掉了。
-rw-r–r– 1 root root 371487 Aug 1 23:59 HJ-Data-Api-2018-08-01.0.log
-rw-r–r– 1 root root 249120 Aug 2 23:59 HJ-Data-Api-2018-08-02.0.log
-rw-r–r– 1 root root 184540 Aug 3 16:40 HJ-Data-Api.log
3. 查看log发现,16:40 应该是接收了一个SHUTDOWN的指令,之后退出应用。
2018-08-03 16:40:32,367 Thread-9 INFO com.netflix.loadbalancer.PollingServerListUpdater Shutting down the Executor Pool for PollingServerListUpdater 2018-08-03 16:40:32,374 Thread-5 INFO org.springframework.boot.context.embedded.AnnotationConfigEmbeddedWebApplicationContext Closing org.springframework.boot.context.embedded.AnnotationConfigEmbeddedWebApplicationContext@3eb77ea8: startup date [Wed Aug 01 16:08:46 CST 2018]; parent: org.springframework.context.annotation.AnnotationConfigApplicationContext@34340fab 2018-08-03 16:40:32,375 Thread-5 INFO org.springframework.cloud.netflix.eureka.serviceregistry.EurekaServiceRegistry Unregistering application HJ-Data-Api with eureka with status DOWN 2018-08-03 16:40:32,375 Thread-5 WARN com.netflix.discovery.DiscoveryClient Saw local status change event StatusChangeEvent [timestamp=1533285632375, current=DOWN, previous=UP] 2018-08-03 16:40:32,376 DiscoveryClient-InstanceInfoReplicator-0 INFO com.netflix.discovery.DiscoveryClient DiscoveryClient_HJ-DATA-API/192.168.1.130:8500: registering service... 2018-08-03 16:40:32,381 DiscoveryClient-InstanceInfoReplicator-0 INFO com.netflix.discovery.DiscoveryClient DiscoveryClient_HJ-DATA-API/192.168.1.130:8500 - registration status: 204 2018-08-03 16:40:32,996 Thread-5 WARN org.springframework.boot.context.embedded.AnnotationConfigEmbeddedWebApplicationContext Exception thrown from ApplicationListener handling ContextClosedEvent java.lang.IndexOutOfBoundsException: Index: 256, Size: 256 at java.util.ArrayList.rangeCheck(ArrayList.java:653) at java.util.ArrayList.get(ArrayList.java:429) at com.hzcf.api.components.rabbitMq.RabbitMqInit.destroy(RabbitMqInit.java:127) at com.hzcf.api.ApplicationEventListener.onApplicationEvent(HjDataApiApplication.java:51) at org.springframework.context.event.SimpleApplicationEventMulticaster.doInvokeListener(SimpleApplicationEventMulticaster.java:172) at org.springframework.context.event.SimpleApplicationEventMulticaster.invokeListener(SimpleApplicationEventMulticaster.java:165) at org.springframework.context.event.SimpleApplicationEventMulticaster.multicastEvent(SimpleApplicationEventMulticaster.java:139) at org.springframework.context.support.AbstractApplicationContext.publishEvent(AbstractApplicationContext.java:393) at org.springframework.context.support.AbstractApplicationContext.publishEvent(AbstractApplicationContext.java:347) at org.springframework.context.support.AbstractApplicationContext.doClose(AbstractApplicationContext.java:991) at org.springframework.context.support.AbstractApplicationContext$2.run(AbstractApplicationContext.java:929) 2018-08-03 16:40:32,997 Thread-5 INFO org.springframework.context.support.DefaultLifecycleProcessor Stopping beans in phase 2147483647 2018-08-03 16:40:33,017 Thread-5 INFO org.springframework.context.support.DefaultLifecycleProcessor Stopping beans in phase 0 2018-08-03 16:40:33,142 Thread-5 INFO org.springframework.context.annotation.AnnotationConfigApplicationContext Closing org.springframework.context.annotation.AnnotationConfigApplicationContext@4830c9b2: startup date [Wed Aug 01 16:09:14 CST 2018]; parent: org.springframework.boot.context.embedded.AnnotationConfigEmbeddedWebApplicationContext@3eb77ea8 2018-08-03 16:40:33,147 Thread-5 WARN org.springframework.context.annotation.AnnotationConfigApplicationContext Exception thrown from ApplicationListener handling ContextClosedEvent com.rabbitmq.client.AlreadyClosedException: channel is already closed due to clean channel shutdown; protocol method: #method<channel.close>(reply-code=200, reply-text=OK, class-id=0, method-id=0) at com.rabbitmq.client.impl.AMQChannel.processShutdownSignal(AMQChannel.java:286) at com.rabbitmq.client.impl.ChannelN.startProcessShutdownSignal(ChannelN.java:264) at com.rabbitmq.client.impl.ChannelN.close(ChannelN.java:574) at com.rabbitmq.client.impl.ChannelN.close(ChannelN.java:508) at com.rabbitmq.client.impl.ChannelN.close(ChannelN.java:501) at com.hzcf.api.components.rabbitMq.RabbitMqInit.destroy(RabbitMqInit.java:127) at com.hzcf.api.ApplicationEventListener.onApplicationEvent(HjDataApiApplication.java:51) at org.springframework.context.event.SimpleApplicationEventMulticaster.doInvokeListener(SimpleApplicationEventMulticaster.java:172) at org.springframework.context.event.SimpleApplicationEventMulticaster.invokeListener(SimpleApplicationEventMulticaster.java:165) at org.springframework.context.event.SimpleApplicationEventMulticaster.multicastEvent(SimpleApplicationEventMulticaster.java:139) at org.springframework.context.support.AbstractApplicationContext.publishEvent(AbstractApplicationContext.java:393) at org.springframework.context.support.AbstractApplicationContext.publishEvent(AbstractApplicationContext.java:399) at org.springframework.context.support.AbstractApplicationContext.publishEvent(AbstractApplicationContext.java:347) at org.springframework.context.support.AbstractApplicationContext.doClose(AbstractApplicationContext.java:991) at org.springframework.context.support.AbstractApplicationContext.close(AbstractApplicationContext.java:958) at org.springframework.cloud.context.named.NamedContextFactory.destroy(NamedContextFactory.java:76) at org.springframework.beans.factory.support.DisposableBeanAdapter.destroy(DisposableBeanAdapter.java:272) at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.destroyBean(DefaultSingletonBeanRegistry.java:578) at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.destroySingleton(DefaultSingletonBeanRegistry.java:554) at org.springframework.beans.factory.support.DefaultListableBeanFactory.destroySingleton(DefaultListableBeanFactory.java:961) at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.destroySingletons(DefaultSingletonBeanRegistry.java:523) at org.springframework.beans.factory.support.FactoryBeanRegistrySupport.destroySingletons(FactoryBeanRegistrySupport.java:230) at org.springframework.beans.factory.support.DefaultListableBeanFactory.destroySingletons(DefaultListableBeanFactory.java:968) at org.springframework.context.support.AbstractApplicationContext.destroyBeans(AbstractApplicationContext.java:1032) at org.springframework.context.support.AbstractApplicationContext.doClose(AbstractApplicationContext.java:1008) at org.springframework.context.support.AbstractApplicationContext$2.run(AbstractApplicationContext.java:929) 2018-08-03 16:40:33,151 Thread-5 INFO com.netflix.util.concurrent.ShutdownEnabledTimer Shutdown hook removed for: NFLoadBalancer-PingTimer-192.168.1.132 2018-08-03 16:40:33,152 Thread-5 INFO com.netflix.util.concurrent.ShutdownEnabledTimer Exception caught (might be ok if at shutdown) java.lang.IllegalStateException: Shutdown in progress at java.lang.ApplicationShutdownHooks.remove(ApplicationShutdownHooks.java:82) at java.lang.Runtime.removeShutdownHook(Runtime.java:239) at com.netflix.util.concurrent.ShutdownEnabledTimer.cancel(ShutdownEnabledTimer.java:70) at com.netflix.loadbalancer.BaseLoadBalancer.cancelPingTask(BaseLoadBalancer.java:613) at com.netflix.loadbalancer.BaseLoadBalancer.shutdown(BaseLoadBalancer.java:864) at com.netflix.loadbalancer.DynamicServerListLoadBalancer.shutdown(DynamicServerListLoadBalancer.java:285) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.springframework.beans.factory.support.DisposableBeanAdapter.invokeCustomDestroyMethod(DisposableBeanAdapter.java:364) at org.springframework.beans.factory.support.DisposableBeanAdapter.destroy(DisposableBeanAdapter.java:287) at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.destroyBean(DefaultSingletonBeanRegistry.java:578) at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.destroySingleton(DefaultSingletonBeanRegistry.java:554) at org.springframework.beans.factory.support.DefaultListableBeanFactory.destroySingleton(DefaultListableBeanFactory.java:961) at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.destroySingletons(DefaultSingletonBeanRegistry.java:523) at org.springframework.beans.factory.support.FactoryBeanRegistrySupport.destroySingletons(FactoryBeanRegistrySupport.java:230) at org.springframework.beans.factory.support.DefaultListableBeanFactory.destroySingletons(DefaultListableBeanFactory.java:968) at org.springframework.context.support.AbstractApplicationContext.destroyBeans(AbstractApplicationContext.java:1032) at org.springframework.context.support.AbstractApplicationContext.doClose(AbstractApplicationContext.java:1008) at org.springframework.context.support.AbstractApplicationContext.close(AbstractApplicationContext.java:958) at org.springframework.cloud.context.named.NamedContextFactory.destroy(NamedContextFactory.java:76) at org.springframework.beans.factory.support.DisposableBeanAdapter.destroy(DisposableBeanAdapter.java:272) at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.destroyBean(DefaultSingletonBeanRegistry.java:578) at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.destroySingleton(DefaultSingletonBeanRegistry.java:554) at org.springframework.beans.factory.support.DefaultListableBeanFactory.destroySingleton(DefaultListableBeanFactory.java:961) at org.springframework.beans.factory.support.DefaultSingletonBeanRegistry.destroySingletons(DefaultSingletonBeanRegistry.java:523) at org.springframework.beans.factory.support.FactoryBeanRegistrySupport.destroySingletons(FactoryBeanRegistrySupport.java:230) at org.springframework.beans.factory.support.DefaultListableBeanFactory.destroySingletons(DefaultListableBeanFactory.java:968) at org.springframework.context.support.AbstractApplicationContext.destroyBeans(AbstractApplicationContext.java:1032) at org.springframework.context.support.AbstractApplicationContext.doClose(AbstractApplicationContext.java:1008) at org.springframework.context.support.AbstractApplicationContext$2.run(AbstractApplicationContext.java:929) 2018-08-03 16:40:33,168 Thread-5 INFO org.springframework.boot.actuate.endpoint.jmx.EndpointMBeanExporter Unregistering JMX-exposed beans on shutdown 2018-08-03 16:40:33,170 Thread-5 INFO org.springframework.boot.actuate.endpoint.jmx.EndpointMBeanExporter Unregistering JMX-exposed beans 2018-08-03 16:40:33,170 Thread-5 INFO org.springframework.jmx.export.annotation.AnnotationMBeanExporter Unregistering JMX-exposed beans on shutdown 2018-08-03 16:40:33,173 Thread-5 INFO org.springframework.jmx.export.annotation.AnnotationMBeanExporter Unregistering JMX-exposed beans 2018-08-03 16:40:33,173 Thread-5 INFO com.netflix.discovery.DiscoveryClient Shutting down DiscoveryClient ... 2018-08-03 16:40:33,174 Thread-5 INFO com.netflix.discovery.DiscoveryClient Unregistering ... 2018-08-03 16:40:33,201 Thread-5 INFO com.netflix.discovery.DiscoveryClient DiscoveryClient_HJ-DATA-API/192.168.1.130:8500 - deregister status: 200 2018-08-03 16:40:33,204 Thread-5 INFO com.netflix.discovery.DiscoveryClient Completed shut down of DiscoveryClient 2018-08-03 16:40:33,272 Thread-5 INFO org.springframework.scheduling.concurrent.ThreadPoolTaskScheduler Shutting down ExecutorService 'poolScheduler' 2018-08-03 16:40:33,291 Thread-5 INFO io.undertow.servlet Destroying Spring FrameworkServlet 'dispatcherServlet'
尴尬的是我忘记了是自己调用的shutdown接口还是被系统kill的?
查看日志发现服务并未接收到shutdown的请求,那么应该是系统发送的kill指令?
使用history查看历史命令
看到了一些关于服务的kill操作,但是history我并没有添加操作时间。
890 ps -ef |grep java 891 jps 892 kill 13876 893 jps 894 cd /home/release/app/ 895 ifconfig 896 vi /etc/hosts 897 jps 898 kill 12533 899 vi /home/release/app/ 900 vi /home/release/app/start-fetchData.sh
使用dmesg查看服务关闭记录
其中1号发生了一个OOM kill process,2-3号没有什么游价值的日志。
[Wed Aug 1 17:58:22 2018] Out of memory: Kill process 10644 (memtester) score 220 or sacrifice child [Wed Aug 1 17:58:22 2018] Killed process 10644 (memtester) total-vm:3288060kB, anon-rss:3218348kB, file-rss:4kB [Thu Aug 2 13:32:37 2018] UDP: bad checksum. From 192.168.1.37:137 to 192.168.1.255:137 ulen 58 [Thu Aug 2 14:32:38 2018] UDP: bad checksum. From 192.168.1.37:137 to 192.168.1.255:137 ulen 58 [Fri Aug 3 04:32:57 2018] UDP: bad checksum. From 192.168.1.37:137 to 192.168.1.255:137 ulen 58 [Fri Aug 3 09:33:04 2018] UDP: bad checksum. From 192.168.1.37:137 to 192.168.1.255:137 ulen 58 [Fri Aug 3 09:52:03 2018] UDP: bad checksum. From 192.168.1.37:137 to 192.168.1.255:137 ulen 58 [Fri Aug 3 10:33:06 2018] UDP: bad checksum. From 192.168.1.37:137 to 192.168.1.255:137 ulen 58 [Fri Aug 3 20:33:19 2018] UDP: bad checksum. From 192.168.1.37:137 to 192.168.1.255:137 ulen 58 [Sun Aug 5 00:33:56 2018] UDP: bad checksum. From 192.168.1.37:137 to 192.168.1.255:137 ulen 58
查看系统日志
查看messagelog
/var/log 下的message 列表
-rw------- 1 root root 16252 Aug 23 10:01 messages -rw------- 1 root root 136365 Jul 29 03:07 messages-20180729 -rw------- 1 root root 82456 Aug 5 03:36 messages-20180805 -rw------- 1 root root 28364 Aug 13 03:12 messages-20180813 -rw------- 1 root root 21378 Aug 19 03:17 messages-20180819查看八月五号的message日志发现在那个时间点发生了登出操作,难道没用守护方式执行?cat messages-20180805Aug 3 16:01:01 localhost systemd: Starting Session 311 of user root. Aug 3 16:40:26 localhost systemd-logind: New session 312 of user root. Aug 3 16:40:26 localhost systemd: Started Session 312 of user root. Aug 3 16:40:26 localhost systemd: Starting Session 312 of user root. Aug 3 16:40:33 localhost systemd-logind: Removed session 259. Aug 3 16:59:12 localhost systemd: Starting Cleanup of Temporary Directories...使用journalctl
发现通过36登录
Aug 03 16:01:01 hanlin CROND[4839]: (root) CMD (run-parts /etc/cron.hourly) Aug 03 16:01:01 hanlin run-parts(/etc/cron.hourly)[4842]: starting 0anacron Aug 03 16:01:01 hanlin run-parts(/etc/cron.hourly)[4848]: finished 0anacron Aug 03 16:40:26 hanlin sshd[5141]: Accepted password for root from 192.168.1.36 port 64728 ssh2 Aug 03 16:40:26 hanlin systemd-logind[618]: New session 312 of user root. -- Subject: A new session 312 has been created for user root -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- Documentation: http://www.freedesktop.org/wiki/Software/systemd/multiseat -- -- A new session with the ID 312 has been created for the user root. -- -- The leading process of the session is 5141. Aug 03 16:40:26 hanlin systemd[1]: Started Session 312 of user root. -- Subject: Unit session-312.scope has finished start-up -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit session-312.scope has finished starting up. -- -- The start-up result is done. Aug 03 16:40:26 hanlin systemd[1]: Starting Session 312 of user root. -- Subject: Unit session-312.scope has begun start-up -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- -- Unit session-312.scope has begun starting up. Aug 03 16:40:26 hanlin sshd[5141]: pam_unix(sshd:session): session opened for user root by (uid=0) Aug 03 16:40:33 hanlin systemd-logind[618]: Removed session 259. -- Subject: Session 259 has been terminated -- Defined-By: systemd -- Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel -- Documentation: http://www.freedesktop.org/wiki/Software/systemd/multiseat
Service-Center
测试环境服务在运行了一段时间后,发现部分服务未知原因挂掉了,所以排查一下挂掉的原因。首先查看日志,发现日志并未提供有价值的信息。
2018-08-02 18:03:59,522 Thread-12 INFO org.springframework.boot.context.embedded.AnnotationConfigEmbeddedWebApplicationContext Closing org.springframework.boot.context.embedded.AnnotationConfigEmbeddedWebApplicationContext@6d8a00e3: startup date [Tue Jul 24 17:11:00 CST 2018]; parent: org.springframework.context.annotation.AnnotationConfigApplicationContext@58d25a40 2018-08-02 18:03:59,523 Thread-12 INFO org.springframework.cloud.netflix.eureka.serviceregistry.EurekaServiceRegistry Unregistering application HJ-ServiceCenter with eureka with status DOWN 2018-08-02 18:03:59,523 Thread-12 WARN com.netflix.discovery.DiscoveryClient Saw local status change event StatusChangeEvent [timestamp=1533204239523, current=DOWN, previous=UP] 2018-08-02 18:03:59,523 DiscoveryClient-InstanceInfoReplicator-0 INFO com.netflix.discovery.DiscoveryClient DiscoveryClient_HJ-SERVICECENTER/192.168.1.130:8761: registering service... 2018-08-02 18:03:59,524 Thread-12 INFO org.springframework.context.support.DefaultLifecycleProcessor Stopping beans in phase 0 2018-08-02 18:03:59,534 DiscoveryClient-InstanceInfoReplicator-0 INFO com.netflix.discovery.DiscoveryClient DiscoveryClient_HJ-SERVICECENTER/192.168.1.130:8761 - registration status: 204 2018-08-02 18:03:59,548 Thread-12 INFO org.springframework.boot.actuate.endpoint.jmx.EndpointMBeanExporter Unregistering JMX-exposed beans on shutdown 2018-08-02 18:03:59,550 Thread-12 INFO org.springframework.boot.actuate.endpoint.jmx.EndpointMBeanExporter Unregistering JMX-exposed beans 2018-08-02 18:03:59,551 Thread-12 INFO org.springframework.jmx.export.annotation.AnnotationMBeanExporter Unregistering JMX-exposed beans on shutdown 2018-08-02 18:03:59,551 Thread-12 INFO org.springframework.jmx.export.annotation.AnnotationMBeanExporter Unregistering JMX-exposed beans 2018-08-02 18:03:59,552 Thread-12 INFO com.netflix.discovery.DiscoveryClient Shutting down DiscoveryClient ... 2018-08-02 18:03:59,553 Thread-12 INFO com.netflix.discovery.DiscoveryClient Unregistering ... 2018-08-02 18:03:59,562 Thread-12 INFO com.netflix.discovery.DiscoveryClient DiscoveryClient_HJ-SERVICECENTER/192.168.1.130:8761 - deregister status: 200 2018-08-02 18:03:59,575 Thread-12 INFO com.netflix.discovery.DiscoveryClient Completed shut down of DiscoveryClient 2018-08-02 18:03:59,606 Thread-12 INFO io.undertow.servlet Destroying Spring FrameworkServlet 'dispatcherServlet'
使用dmesg查看服务关闭记录
dmesg –T / dmesg –T | grep java
其中1号有一个因为内存溢出引发的终止进程,2号并未发现异常。
[Wed Aug 1 17:58:22 2018] java invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0 [Wed Aug 1 17:58:22 2018] java cpuset=/ mems_allowed=0 [Wed Aug 1 17:58:22 2018] CPU: 1 PID: 13965 Comm: java Not tainted 3.10.0-327.el7.x86_64 #1
[Thu Aug 2 13:32:36 2018] UDP: bad checksum. From 192.168.1.37:137 to 192.168.1.255:137 ulen 58 [Thu Aug 2 14:32:37 2018] UDP: bad checksum. From 192.168.1.37:137 to 192.168.1.255:137 ulen 58 [Fri Aug 3 04:32:56 2018] UDP: bad checksum. From 192.168.1.37:137 to 192.168.1.255:137 ulen 58 [Fri Aug 3 09:33:03 2018] UDP: bad checksum. From 192.168.1.37:137 to 192.168.1.255:137 ulen 58
查看messagelog
/var/log 下的message 列表
-rw------- 1 root root 16252 Aug 23 10:01 messages -rw------- 1 root root 136365 Jul 29 03:07 messages-20180729 -rw------- 1 root root 82456 Aug 5 03:36 messages-20180805 -rw------- 1 root root 28364 Aug 13 03:12 messages-20180813 -rw------- 1 root root 21378 Aug 19 03:17 messages-20180819查看八月五号的message日志cat messages-20180805 | grep ttbrainAug 1 17:53:23 localhost kernel: Out of memory: Kill process 10644 (memtester) score 220 or sacrifice child Aug 1 17:53:23 localhost kernel: Killed process 10644 (memtester) total-vm:3288060kB, anon-rss:3218348kB, file-rss:4kBjournalctl -xb | egrep -i 'killed process'
©版权声明:本文为【翰林小院】(huhanlin.com)原创文章,转载时请注明出处!
发表评论