如何配置Windows Service的崩溃恢复策略?
我们组有一个服务是作为Windows Service部署的,在服务崩溃后,希望能够借助Windows系统的一些机制来自动重启。最近对这个问题进行了一点研究,在此记录一下。
设置服务的Startup Type
在创建Windows Service的时候,可以设置StartupType,指定在操作系统启动后,希望用哪种方式来启动服务。StartupType的值可以是:
Automatic:在操作系统启动后,尽快自动启动服务。AutomaticDelayedStart:在操作系统启动后,延迟一段时间(不固定)后自动启动服务。Manual:在操作系统启动后,手动启动服务。Disabled:服务不可以被启动。
对比Automatic和AutomaticDelayedStart
Automatic的优先级高于AutomaticDelayedStart。在系统启动初期,Service Control Manager(SCM)会优先启动所有Automatic服务。在系统更加稳定后,SCM才会去启动所有AutomaticDelayedStart服务。
Automatic服务会影响开机速度,而AutomaticDelayedStart不会影响。
关键服务选Automatic,非关键、重资源的服务选AutomaticDelayedStart。
设置服务的崩溃恢复策略
设置服务的StartupType只能影响系统启动后的启动状态,如果服务在运行过程中突然崩溃了,需要另外设置崩溃恢复策略来重启服务。
命令
设置崩溃恢复策略的命令如下:
sc.exe failure <ServiceName> reset= <ResetPeriodInSeconds> actions= <action1>/<delay1InMs>/<action2>/<delay2InMs>/... reboot= <RebootMessage> command= <CommandLine>
其中,
actions=表示崩溃后采取的恢复措施。- 支持的action类型:
- restart:重启服务
- run:运行指定程序(需配合
command=) - reboot:重启机器(需管理员权限)
- none(或任意其他字符串):不执行任何操作
<delayXInMs>表示延迟多久执行action。- 崩溃次数超过action个数后,会一直按照最后一个action来执行,即最后一个action会被无限重复执行。
- 支持的action类型:
reset=表示多久后重置崩溃次数。- 从最近一次崩溃开始计时,而不是从第一次崩溃开始计时。
查看崩溃恢复策略的命令如下:
sc.exe qfailure <ServiceName>
触发条件
手动停止服务是不会触发崩溃恢复的,只有服务意外崩溃时才会触发,包括以下情况:
- 服务的运行过程中抛出未捕获的异常,导致进程崩溃。
- 服务返回非0退出码,如返回
Environment.Exit(1)。 - 进程被强制kill掉,如使用命令
taskkill /f /im <ProcessName>。
查看event log
可以在Windows Logs -> System中查看崩溃恢复的日志,Level为Error,Source为Service Control Manager。
例子
例1:无限次重启
>> sc.exe failure TestWindowsService reset= 10 actions= restart/0/restart/3000
[SC] ChangeServiceConfig2 SUCCESS
>> sc.exe qfailure TestWindowsService
[SC] QueryServiceConfig2 SUCCESS
SERVICE_NAME: TestWindowsService
RESET_PERIOD (in seconds) : 10
REBOOT_MESSAGE :
COMMAND_LINE :
FAILURE_ACTIONS : RESTART -- Delay = 0 milliseconds.
RESTART -- Delay = 3000 milliseconds.
- 第一次崩溃:立刻重启
The Test Windows Service service terminated unexpectedly. It has done this 1 time(s). The following corrective action will be taken in 0 milliseconds: Restart the service. - 第二次崩溃:3s后重启
The Test Windows Service service terminated unexpectedly. It has done this 2 time(s). The following corrective action will be taken in 3000 milliseconds: Restart the service. - 第三次崩溃:3s后重启
The Test Windows Service service terminated unexpectedly. It has done this 3 time(s). The following corrective action will be taken in 3000 milliseconds: Restart the service. - 距离第三次崩溃10s后,崩溃次数被重置
- 第一次崩溃:立刻重启
The Test Windows Service service terminated unexpectedly. It has done this 1 time(s). The following corrective action will be taken in 0 milliseconds: Restart the service.
例2:限制重启次数
>> sc.exe failure TestWindowsService reset= 10 actions= restart/0/restart/3000/none/0
[SC] ChangeServiceConfig2 SUCCESS
>> sc.exe qfailure TestWindowsService
[SC] QueryServiceConfig2 SUCCESS
SERVICE_NAME: TestWindowsService
RESET_PERIOD (in seconds) : 10
REBOOT_MESSAGE :
COMMAND_LINE :
FAILURE_ACTIONS : RESTART -- Delay = 0 milliseconds.
RESTART -- Delay = 3000 milliseconds.
- 第一次崩溃:立刻重启
The Test Windows Service service terminated unexpectedly. It has done this 1 time(s). The following corrective action will be taken in 0 milliseconds: Restart the service. - 第二次崩溃:3s后重启
The Test Windows Service service terminated unexpectedly. It has done this 2 time(s). The following corrective action will be taken in 3000 milliseconds: Restart the service. - 第三次崩溃:不重启
The Test Windows Service service terminated unexpectedly. It has done this 3 time(s). - 距离第三次崩溃10s后:不重启
例3:设置reset=为0
>> sc.exe failure TestWindowsService reset= 0 actions= restart/0/restart/3000/none/0
[SC] ChangeServiceConfig2 SUCCESS
>> sc.exe qfailure TestWindowsService
[SC] QueryServiceConfig2 SUCCESS
SERVICE_NAME: TestWindowsService
RESET_PERIOD (in seconds) : 0
REBOOT_MESSAGE :
COMMAND_LINE :
FAILURE_ACTIONS : RESTART -- Delay = 0 milliseconds.
RESTART -- Delay = 3000 milliseconds.
- 第一次崩溃:立马重启,崩溃次数立马被重置
The Test Windows Service service terminated unexpectedly. It has done this 1 time(s). The following corrective action will be taken in 0 milliseconds: Restart the service. - 第一次崩溃:立马重启,崩溃次数立马被重启
The Test Windows Service service terminated unexpectedly. It has done this 1 time(s). The following corrective action will be taken in 0 milliseconds: Restart the service.