Chapter 38. Workflows

38.1. Overview

lb-workflow is a language and toolset to specify and execute complete batch procedures. Figure 38.1, “Architecture overview” shows an overview of lb-workflow’s architecture. In lb-workflow, batch procedures are called workflows and are specified in the lb-workflow language. The workflow specification is translated into logic and installed in a workspace. The state of the execution of the workflow is very precisely maintained in the workspace. The lb-workflow driver is a process that continuously polls the workspace for tasks to execute. The driver executes tasks, mostly via service invocations, and returns results to the workspace. The termination of a task may enable new tasks, which are subsequently polled and executed by the driver.

The lb-workflow workspace exposes an lb-web protobuf status service that provides detailed information about installed workflows. The lb-workflow console is a utility to monitor and interact with the workflow. It uses the status service to query the workspace, and executes actions directly in the workspace (e.g., to cancel tasks). Currently, the console is implemented as a command line utility, but it is designed so that it can be replaced with a separate administration user-interface that is multi-tenant.

Figure 38.1. Architecture overview

Architecture overview
Arrows represent invocations; large arrow-heads indicate the direction of the request, whereas small arrow-heads indicate the response. Colored elements represent independent processes; only the lb-workflow workspace persists its data. Services most often include lb-web services.

A workflow specification is a tree of lb-workflow processes (which need not be OS processes). A process can be a task or a compositional operator. A task is a process that instructs the driver to execute some functionality. A compositional operator is a process that ties other processes together.

For example, there is a task to export TDX files, and there are operators to execute tasks in sequence or in parallel. Since the focus of lb-workflow is on orchestration, most tasks involve service invocations. There are also some simple utility tasks, but the general idea is to push all complexity to the services.

Example 38.1. A sequence of two tasks

The first task is enabled when the workflow starts. When the driver polls for work to do, it will receive instructions to start this task. The driver will then execute the task, which in this case means printing “Hello World!” to its output, and will send a success action to the workspace. Logic in the workspace will respond by enabling the second task. This time, the execution of the task will cause the driver to print “Oops!” and send a fail action. The workflow will stop execution and will wait for user intervention.

main {
  lb.Log(msg = "Hello World!")
  ;
  lb.Fail(msg = "Oops!")
}

Note

Note that lb.Log and lb.Fail tasks are designed to facilitate tests and should not be used in production.

In lb-workflow, developers specify a number of tasks and then compose them together to a workflow. A task is an instance of an operation predefined by lb-workflow stdlib). Here are some examples of these tasks:

  • tasks calling TDX services for importing or exporting data;
  • tasks to communicate with lb-jobs, for example to create a job or to wait for its termination;
  • tasks to communicate with the workbook framework, for example to create or delete workbooks;
  • trivial task types, such as logging or getting the values of environment variables.

A complete overview of all the tasks and their specification can be found in the lb-workflow API documentation.

The following composition operators are supported:

  • sequential: ;.
  • parallel: ||.
  • looping: forall.

A workflow is a description of a tree of processes. The leaves of the tree are tasks, and the inner nodes are composition operators.

Example 38.2. 

Here is a simplified version of the tree we would create for a trivial workflow that logs something, then exports some data in parallel with creating some workbook, then imports some data. Note that the leaves are tasks.

seq
  Log
  par
    TdxExport
    CreateWorkbook
  TdxImport 

The description is installed in a workspace that contains the lb-workflow library. This library implements the semantics of the composition operators and exposes essentially two services:

  1. a poll service that clients can use to ask which tasks are ready to execute;
  2. an outcome service that allows clients to report the outcome of the execution of a task.

The workflow driver, a Java program that constantly asks the workspace for tasks to perform, executes these tasks and sends back the result.

So, in our example, when the workflow starts, the workspace will report that the task Log is ready. The driver will execute the task and send back a "success" event. Then the workspace will report that the tasks TdxExport and CreateWorkbook are ready, so the driver will start their execution in parallel. It is only after they have both reported successful outcomes that the TdxImport task will be ready for execution.

Execution of a task is not atomic, but goes through the following stages:

  • A task is in state INIT when it is ready to execute;
  • then it moves to StartRequested when the driver attempts to start it;
  • it moves to Executing when the driver acknowledges that it has started;
  • it moves to CleanupRequested when the driver reports success;
  • finally, it moves to END when the driver has finished cleaning up for the task.

38.1.1. Principles

lb-workflow is a language to specify complete batch procedures, and a tool to execute those batch procedures. The tool has extensive administrative features to inspect status, intervene, and handle failures.

Workflows use task implementations to get the actual work done. The generic parts of lb-workflow do nothing but coordinate the execution of these task implementations. Currently, almost all the task implementations call out to services, for example to import CSV data, to install workspaces, etc. The implementation of lb-workflow is modular, in the sense that it can be extended with new kinds of tasks.

The state of the execution of a workflow is very precisely maintained in a database. The driver daemon continuously polls the database for work to do. The lb-workflow command-line utility interacts with the workflow database to start the execution of a workflow, abort workflows, etc.

38.2. The lb-workflow Workspace

  • The goal is to have very detailed information about the state and history of the workflow: all relevant information is made persistent to allow restarting, monitoring, etc.
  • Everything that should persist is in the workspace - the driver can die.
  • A single workflow database can contain multiple workflows, for example, different batch procedures that are to be run monthly, weekly and daily.

Figure 38.2. The lb-workflow Workspace

The lb-workflow Workspace
Multiple workflow specifications can be installed by-name in a workspace (e.g., daily and backup workflows). They conform to the lb-workflow schema. A workflow can then be instantiated (started) multiple times. Each workflow instance gives rise to a process tree which is identified by a root process instance. The leaves of the process tree are tasks, and the inner nodes are composition operators. Each task instance has its state tracked by its own state machine. Thus, the states of individual task instances are independent from each other (as illustrated by failed instances in red).

Figure 38.3. SyncTask State Machine

SyncTask State Machine
Each process instance of a SyncTask has its state tracked by this state machine. When the task is enabled (ready to execute), the state machine is instantiated and the state is INIT. Arrows in blue represent transitions that are automatically triggered by actions performed by the driver, which include the normal control flow that leads to the successful END state. Arrows in green represent actions triggered by the console, when manual intervention is required due to failure. Arrows in red are internal transitions triggered by having the root process instance of the tree aborted, which causes all tasks in the tree to be automatically aborted as well.

  • !start is an output action.
  • When in StartRequested, the driver is trying to execute the task. It will answer with success or fail.
  • startup is needed for when the driver crashes.
  • Failure handling: retry vs ignore; ack_failure.
  • END vs. ABORTED: Note that the END state allows the workflow to continue (i.e., it enables subsequent processes), whereas ABORTED represents a dead-end: the workflow instance is completely aborted and cannot be restarted.

Figure 38.4. AsyncTask State Machine

AsyncTask State Machine
This state machine represents the state of a AsyncTask instance. This is more complex than SyncTask because it supports task cancellation, and because the driver keeps state about AsyncTask executions. Nodes in gray are abstract states; they represent a high level state of the task and have no direct counterpart in the actual state machine. Below we will show the concrete states and transitions under some of these abstract states.

  • Note that the last action before END or ABORT is an output, !cleanup.
  • The other difference is support for cancel.

Figure 38.5. AsyncTask Execution State Machine

AsyncTask Execution State Machine
This is a refinement of the Task Execution abstract state of the AsyncTask state machine. It focuses on the normal execution steps, when no exceptions occur, such as failures or cancellations. Arrows without a target state connect to the Failure, Cancel or Abort abstract states described before.

  • The actual execution of the task only starts when the request was acknowledged.
  • These tasks often hold resources, so there’s a final cleanup step that allows tasks to release resources.
  • When the driver restarts, if the task is still in StartRequested, it can safely go back to INIT because we know that the driver never started it. If it is in state Executing, a previous instance of the driver may have started it, so the task goes to failed to await manual intervention (to be retried or ignored). If in CleanupRequested, we know the the task was successfully executed by a previous instance, so it stays in that state until the new driver eventually cleans it up.
  • At any point, abort may be requested. If the task is in INIT, it can go directly to ABORTED.
  • A task can be cancelled while it is being executed (StartRequested or Executing), but not after it’s been executed (CleanupRequested).

Figure 38.6. AsyncTask Abort State Machine

AsyncTask Abort State Machine
This is a refinement of the Abort abstract state of the AsyncTask state machine. Once an internal abort action is triggered, the task starts the abort process, which invariably leads to the ABORTED end state.

  • !do_cancel allows the driver to attempt to cancel the task gracefully.
  • !cleanup allows the driver to release resources and remove internal records of the task.
  • Note that success and fail transitions are handled to ensure that the task is aborted even if the driver concurrently finished its execution.
  • Cancel and Failure abstract states are very similar; see stdlib.wf code in the lb-workflow API documentation.

38.3. The lb-workflow Language

38.3.1. Controlling concurrency

lb-workflow is designed to maximize task concurrency: the || (parallel) operator executes tasks concurrently; the forall operator by default executes all children in parallel. This may lead to excessive concurrency, for example, if multiple nodes execute requests against a server that is not capable of handling them concurrently. lb-workflow provides 2 ways of controlling concurrency: forall max and concurrency groups.

The forall operator accepts a max parameter that limits the number of children that are enabled concurrently. The value must be an integer. It can be a constant or a variable (as long as the variable is resolved statically). Consider the following workflow:

workflow forall_max_example() {
  forall<max=1>(i in {1, 2}) {
    lb.string.Join(
      parts = {
        "$(i)"
      }
    )
    ;
    lb.tdx.Import(input = {"..."})
  }
}

The children of the forall are seq nodes with two children, one for Join and one for Import. But since only one of these seq nodes can be executing at a certain point in time, the next Join will only start after the previous Import is finished. So the workflow is really just a serialization of the seq nodes: Join; Import; Join; Import. But often what you really want is to control the concurrency on the Import, whereas instances of Join could all run in parallel.

So there’s another way to control concurrency which is called concurrency groups. When you have a task, you can pass meta information to the driver. So you could remove the max, and control the concurrency at the task level:

workflow driver_meta_example() {
  forall(i in {1, 2}) {
    lb.string.Join(
      parts = {
        "$(i)"
      }
    )
    ;
    lb.tdx.Import(
      input = {"..."},
      driver_meta = "{ group: 'foo', max: 1}"
    )
  }
}

All the instances of Join would run in parallel, but the driver would execute at most one Import at a time.

However, there are two caveats:

  • Groups are global, and “the last max wins”. This means that if somewhere else in your tree you have some other task that participates in the same group, it is indeed the same group. This is great because you can control access to a certain server globally (by using the server name as the group name, for example). But when the driver picks such a task, it reads the max value and overwrites the previous max of the group, so you should use max consistently.

    The following example shows the possible consequences of these features. The two forall loops are executed in parallel. The driver can pick Export and Import tasks in parallel but, since they are in the same group, only one of them will execute at a certain point in time, even if they are in distinct branches of the process tree. Also note that the max value is used consistently in all tasks within the foo group.

    workflow driver_meta_global_example() {
      (
        forall(j in {3, 4}) {
          lb.tdx.Export(
            input = {"..."},
            driver_meta = "{ group: 'foo', max: 1 }"
          )
        }
      ) || (
        forall(i in {1, 2}) {
          lb.string.Join(
            parts = {
              "$(i)"
            }
          )
          ;
          lb.tdx.Import(
            input = {"..."},
            driver_meta = "{ group: 'foo', max: 1 }"
          )
        }
      )
    }
  • This only works for tasks, not workflows. That’s because only tasks are executed by the driver.