1. Checkpoint and almost Restart in Open MPI

    Now that checkpoint/restart with CRIU is possible since Fedora 19 I started adding CRIU support to Open MPI. With my commit 30772 it is now possible to checkpoint a process running under Open MPI. The restart functionality is not yet implemented but should be soon available. I have a test case (orte-test) which prints its PID and sleeps one second in a loop which I start under orterun like this:

    /path/to/orterun --mca ft_cr_enabled 1 --mca opal_cr_use_thread 1 --mca oob tcp --mca crs_criu_verbose 30 --np 1 orte-test

    The options have following meaning:

    • --mca ft_cr_enabled 1
      • ft stands for fault tolerance
      • cr stands for checkpoint/restart
      • this option is to enable the checkpoint/restart functionality
    • --mca opal_cr_use_thread 1: use an additional thread to control checkpoint/restart operations
    • --mca oob tcp: use TCP instead of unix domain sockets (the socket code needs some additional changes for C/R to work)
    • --mca crs_criu_verbose 30: print all CRIU debug messages
    • --np 1: spawn one test case

    The output of the test case looks like this:

    [dcbz:12563] crs:criu: open()
    [dcbz:12563] crs:criu: open: priority = 10
    [dcbz:12563] crs:criu: open: verbosity = 30
    [dcbz:12563] crs:criu: open: log_file = criu.log
    [dcbz:12563] crs:criu: open: log_level = 0
    [dcbz:12563] crs:criu: open: tcp_established = 1
    [dcbz:12563] crs:criu: open: shell_job = 1
    [dcbz:12563] crs:criu: open: ext_unix_sk = 1
    [dcbz:12563] crs:criu: open: leave_running = 1
    [dcbz:12563] crs:criu: component_query()
    [dcbz:12563] crs:criu: module_init()
    [dcbz:12563] crs:criu: opal_crs_criu_prelaunch
    [dcbz:12565] crs:criu: open()
    [dcbz:12565] crs:criu: open: priority = 10
    [dcbz:12565] crs:criu: open: verbosity = 30
    [dcbz:12565] crs:criu: open: log_file = criu.log
    [dcbz:12565] crs:criu: open: log_level = 0
    [dcbz:12565] crs:criu: open: tcp_established = 1
    [dcbz:12565] crs:criu: open: shell_job = 1
    [dcbz:12565] crs:criu: open: ext_unix_sk = 1
    [dcbz:12565] crs:criu: open: leave_running = 1
    [dcbz:12565] crs:criu: component_query()
    [dcbz:12565] crs:criu: module_init()
    [dcbz:12565] crs:criu: opal_crs_criu_reg_thread Process 12565 Process 12565 Process 12565
    

    To start the checkpoint operation the Open MPI tool orte-checkpoint is used:

    /path/to/orte-checkpoint -V 10 `pidof orterun`

    which outputs the following:

    [dcbz:12570] orte_checkpoint: Checkpointing...
    [dcbz:12570] PID 12563
    [dcbz:12570] Connected to Mpirun [[56676,0],0]
    [dcbz:12570] orte_checkpoint: notify_hnp: Contact Head Node Process PID 12563
    [dcbz:12570] orte_checkpoint: notify_hnp: Requested a checkpoint of jobid [INVALID]
    [dcbz:12570] orte_checkpoint: hnp_receiver: Receive a command message.
    [dcbz:12570] orte_checkpoint: hnp_receiver: Status Update.
    [dcbz:12570] [ 0.00 / 0.08] Requested - ...
    [dcbz:12570] orte_checkpoint: hnp_receiver: Receive a command message.
    [dcbz:12570] orte_checkpoint: hnp_receiver: Status Update.
    [dcbz:12570] [ 0.00 / 0.08] Pending - ...
    [dcbz:12570] orte_checkpoint: hnp_receiver: Receive a command message.
    [dcbz:12570] orte_checkpoint: hnp_receiver: Status Update.
    [dcbz:12570] [ 0.00 / 0.08] Running - ...
    [dcbz:12570] orte_checkpoint: hnp_receiver: Receive a command message.
    [dcbz:12570] orte_checkpoint: hnp_receiver: Status Update.
    [dcbz:12570] [ 0.06 / 0.14] Locally Finished - ...
    [dcbz:12570] orte_checkpoint: hnp_receiver: Receive a command message.
    [dcbz:12570] orte_checkpoint: hnp_receiver: Status Update.
    [dcbz:12570] [ 0.00 / 0.14] Checkpoint Established - ompi_global_snapshot_12563.ckpt
    [dcbz:12570] orte_checkpoint: hnp_receiver: Receive a command message.
    [dcbz:12570] orte_checkpoint: hnp_receiver: Status Update.
    [dcbz:12570] [ 0.00 / 0.14] Continuing/Recovered - ompi_global_snapshot_12563.ckpt Snapshot Ref.: 0 ompi_global_snapshot_12563.ckpt
    

    orte-checkpoint tries to connect to the previously started orterun process and requests that a checkpoint should be taken. orterun outputs the following after receiving the checkpoint request:

    [dcbz:12565] crs:criu: checkpoint(12565, ---)
    [dcbz:12565] crs:criu: criu_init_opts() returned 0
    [dcbz:12565] crs:criu: opening snapshot directory /home/adrian/ompi_global_snapshot_12563.ckpt/0/opal_snapshot_0.ckpt
    [dcbz:12563] 12563: Checkpoint established for process [56676,0].
    [dcbz:12563] 12563: Successfully restarted process [56676,0]. Process 12565
    

    At this point the checkpoint has been written to disk and the process continues (printing its PID).

    For a complete checkpoint/restart functionality I still have to implement the restart functionality in Open MPI and I also have to take care of the unix domain sockets (shutting them down for the checkpointing).

    This requires the latest criu package (criu-1.1-4) which includes headers to build Open MPI against CRIU as well as the CRIU service.

    Tagged as : openmpi

Page 1 / 1