我们组有一个服务是作为Windows Service部署的,在服务崩溃后,希望能够借助Windows系统的一些机制来自动重启。最近对这个问题进行了一点研究,在此记录一下。

设置服务的Startup Type

在创建Windows Service的时候,可以设置StartupType,指定在操作系统启动后,希望用哪种方式来启动服务。StartupType的值可以是:

  • Automatic:在操作系统启动后,尽快自动启动服务。
  • AutomaticDelayedStart:在操作系统启动后,延迟一段时间(不固定)后自动启动服务。
  • Manual:在操作系统启动后,手动启动服务。
  • Disabled:服务不可以被启动。

对比AutomaticAutomaticDelayedStart

Automatic的优先级高于AutomaticDelayedStart。在系统启动初期,Service Control Manager(SCM)会优先启动所有Automatic服务。在系统更加稳定后,SCM才会去启动所有AutomaticDelayedStart服务。

Automatic服务会影响开机速度,而AutomaticDelayedStart不会影响。

关键服务选Automatic,非关键、重资源的服务选AutomaticDelayedStart

设置服务的崩溃恢复策略

设置服务的StartupType只能影响系统启动后的启动状态,如果服务在运行过程中突然崩溃了,需要另外设置崩溃恢复策略来重启服务。

命令

设置崩溃恢复策略的命令如下:

sc.exe failure <ServiceName> reset= <ResetPeriodInSeconds> actions= <action1>/<delay1InMs>/<action2>/<delay2InMs>/... reboot= <RebootMessage> command= <CommandLine>

其中,

  • actions=表示崩溃后采取的恢复措施。
    • 支持的action类型:
      • restart:重启服务
      • run:运行指定程序(需配合command=)
      • reboot:重启机器(需管理员权限)
      • none(或任意其他字符串):不执行任何操作
    • <delayXInMs>表示延迟多久执行action。
    • 崩溃次数超过action个数后,会一直按照最后一个action来执行,即最后一个action会被无限重复执行。
  • reset=表示多久后重置崩溃次数。
    • 从最近一次崩溃开始计时,而不是从第一次崩溃开始计时。

查看崩溃恢复策略的命令如下:

sc.exe qfailure <ServiceName>

触发条件

手动停止服务是不会触发崩溃恢复的,只有服务意外崩溃时才会触发,包括以下情况:

  • 服务的运行过程中抛出未捕获的异常,导致进程崩溃。
  • 服务返回非0退出码,如返回Environment.Exit(1)
  • 进程被强制kill掉,如使用命令taskkill /f /im <ProcessName>

查看event log

可以在Windows Logs -> System中查看崩溃恢复的日志,LevelErrorSourceService Control Manager

例子

例1:无限次重启

>> sc.exe failure TestWindowsService reset= 10 actions= restart/0/restart/3000
[SC] ChangeServiceConfig2 SUCCESS

>> sc.exe qfailure TestWindowsService
[SC] QueryServiceConfig2 SUCCESS

SERVICE_NAME: TestWindowsService
        RESET_PERIOD (in seconds)    : 10
        REBOOT_MESSAGE               :
        COMMAND_LINE                 :
        FAILURE_ACTIONS              : RESTART -- Delay = 0 milliseconds.
                                       RESTART -- Delay = 3000 milliseconds.
  • 第一次崩溃:立刻重启
      The Test Windows Service service terminated unexpectedly.  It has done this 1 time(s).  The following corrective action will be taken in 0 milliseconds: Restart the service.
    
  • 第二次崩溃:3s后重启
      The Test Windows Service service terminated unexpectedly.  It has done this 2 time(s).  The following corrective action will be taken in 3000 milliseconds: Restart the service.
    
  • 第三次崩溃:3s后重启
      The Test Windows Service service terminated unexpectedly.  It has done this 3 time(s).  The following corrective action will be taken in 3000 milliseconds: Restart the service.
    
  • 距离第三次崩溃10s后,崩溃次数被重置
  • 第一次崩溃:立刻重启
      The Test Windows Service service terminated unexpectedly.  It has done this 1 time(s).  The following corrective action will be taken in 0 milliseconds: Restart the service.
    

例2:限制重启次数

>> sc.exe failure TestWindowsService reset= 10 actions= restart/0/restart/3000/none/0
[SC] ChangeServiceConfig2 SUCCESS

>> sc.exe qfailure TestWindowsService
[SC] QueryServiceConfig2 SUCCESS

SERVICE_NAME: TestWindowsService
        RESET_PERIOD (in seconds)    : 10
        REBOOT_MESSAGE               :
        COMMAND_LINE                 :
        FAILURE_ACTIONS              : RESTART -- Delay = 0 milliseconds.
                                       RESTART -- Delay = 3000 milliseconds.
  • 第一次崩溃:立刻重启
      The Test Windows Service service terminated unexpectedly.  It has done this 1 time(s).  The following corrective action will be taken in 0 milliseconds: Restart the service.
    
  • 第二次崩溃:3s后重启
      The Test Windows Service service terminated unexpectedly.  It has done this 2 time(s).  The following corrective action will be taken in 3000 milliseconds: Restart the service.
    
  • 第三次崩溃:不重启
      The Test Windows Service service terminated unexpectedly.  It has done this 3 time(s).
    
  • 距离第三次崩溃10s后:不重启

例3:设置reset=为0

>> sc.exe failure TestWindowsService reset= 0 actions= restart/0/restart/3000/none/0
[SC] ChangeServiceConfig2 SUCCESS

>> sc.exe qfailure TestWindowsService
[SC] QueryServiceConfig2 SUCCESS

SERVICE_NAME: TestWindowsService
        RESET_PERIOD (in seconds)    : 0
        REBOOT_MESSAGE               :
        COMMAND_LINE                 :
        FAILURE_ACTIONS              : RESTART -- Delay = 0 milliseconds.
                                       RESTART -- Delay = 3000 milliseconds.
  • 第一次崩溃:立马重启,崩溃次数立马被重置
      The Test Windows Service service terminated unexpectedly.  It has done this 1 time(s).  The following corrective action will be taken in 0 milliseconds: Restart the service.
    
  • 第一次崩溃:立马重启,崩溃次数立马被重启
      The Test Windows Service service terminated unexpectedly.  It has done this 1 time(s).  The following corrective action will be taken in 0 milliseconds: Restart the service.
    

参考