LogicBlox, Inc.

The lb-workflow Guide

Overview

lb-workflow is a language and toolset to specify and execute complete batch procedures. Figure 1 shows an overview of lb-workflow’s architecture. In lb-workflow, batch procedures are called workflows and are specified in the lb-workflow language. The workflow specification is translated into logic and installed in a workspace. The state of the execution of the workflow is very precisely maintained in the workspace. The lb-workflow driver is a process that continuously polls the workspace for tasks to execute. The driver executes tasks, mostly via service invocations, and returns results to the workspace. The termination of a task may enable new tasks, which are subsequently polled and executed by the driver.

The lb-workflow workspace exposes an lb-web protobuf status service that provides detailed information about installed workflows. The lb-workflow console is a utility to monitor and interact with the workflow. It uses the status service to query the workspace, and executes actions directly in the workspace (e.g., to cancel tasks). Currently, the console is implemented as a command line utility, but it is designed so that it can be replaced with a separate administration user-interface that is multi-tenant.

Architecture overview

Figure 1 - Architecture overview. Arrows represent invocations; large arrow-heads indicate the direction of the request, whereas small arrow-heads indicate the response. Colored elements represent independent processes; only the lb-workflow workspace persists its data. Services most often include lb-web services.

A workflow specification is a tree of processes. A process can be a task – processes that instruct the driver to execute some functionality – or compositional operators – processes that tie other processes together.

For example, there is a task to export TDX files, and there are operators to execute tasks in sequence or in parallel.

  • since the focus of lb-workflow is on orchestration, most tasks involve service invocations. There are also some simple utility tasks, but the focus is to push all that complexity to the services.

Example 1 - A sequence of 2 tasks. The first task is enabled when the workflow starts. When the driver polls for work to do, it will receive instructions to start this task. The driver will then execute the task, which in this case means printing “Hello World!” to its output, and will send a success action to the workspace. Logic in the workspace will respond by enabling the second task. This time, the execution of the task will cause the driver to print “Oops!” and send a fail action. The workflow will stop execution and will await for user intervention.

Note that lb.Log and lb.Fail tasks are designed to facilitate tests and should not be used in production.

main {
  lb.Log(msg = "Hello World!")
  ;
  lb.Fail(msg = "Oops!")
}

In lb-workflow you specify a bunch of “tasks” and then compose them together.

A task is an instance of some predefined (by the lb-workflow stdlib) operation. For example, there’s a task type that calls a TDX service for import and another that calls it for export; there are task types to communicate with steve jobs (create jobs and wait for their termination), to create workbooks, etc. There are also trivial task types, like l ogging or getting env variable values.

Then you have composition operators, which are the usual sequential (denoted by ‘;’), parallel ( || ) and forall (for looping).

So a workflow is a tree of these “processes” (process is just the name we use to talk about tasks and composition operators together). The leaves of the tree are tasks, and the inner nodes are composition operators. For example, this is a simplification of the tree we would create for a trivial workflow that logs something, then exports some data in parallel with creating some workbook, then imports some data. Note that the leaves are tasks.

seq Log par TdxExport CreateWorkbook TdxImport

This specification is then installed in a workspace that contains the lb-workflow library. This library implements the semantics of the composition operators and exposes essentially 2 services: 1) a “poll” service that clients can use to ask which tasks are ready to execute and 2) an “outcome” service that allows clients to report the outcome of the execution of a task. Then there is the so called “driver”, which is a Java program that constantly asks the workspace for “stuff to do”, executes them, and sends back the result. This shouldn’t surprise you much, because it is similar to the architecture of WAG.

So, in our example, when the workflow starts, the workspace will report that Log is ready. The driver would execute tha t and send some “success” event. Then the workspace will report that TdxExport and CreateWorkbook are ready, so the d river starts tasks in parallel for them. Only when they both reported successful outcomes to the workflow, will it allow TdxImport to be ready.

What I omitted so far is that task execution is not atomic at all. In fact, there’s a fairly complex state machine that describes the life-cycle of a task. For example, a task is in state INIT when it is ready to execute, then it moves to StartRequested when the driver will attempt to start it; it moves to Executing if the driver acknowledges that it started, and moves to CleanupRequested when the driver responds success. Finally, it moves to END when the driver finished cleaning up for the task.

Principles

  • driver can die
  • minimal processing, it’s all about orchestration

lb-workflow is a language to specify complete batch procedures, and a tool to execute those batch procedures with extensive administrative features to inspect status, intervene, and handle failures.

Workflows use task implementations to get the actual work done. The generic parts of lb-workflow do nothing but coordinate the execution of these task implementations. Currently, almost all the task implementations call out to services, for example to import CSV data, to install workspaces, etc. The implementation of lb-workflow is modular, in the sense that it can be extended with tasks.

The state of the execution of a workflow is very precisely maintained in a database. The driver daemon continuously polls the database for work to do. The lb-workflow command-line utility interacts with the workflow database to start the execution of a workflow, abort workflows, etc.

The workflow database is designed so that potentially the command-line lb-workflow utility can be replaced with a separate administration user-interface that is multi-tenant. The functions performed from that user-interface would be similar to the commands supported by the command-line utility.

A single workflow database can contain multiple workflows, for example for a monthly vs week vs daily batch procedure.

The lb-workflow workspace

  • goal is to have very detailed information about the state and history of the workflow: all relevant info persisted to allow restarting, monitoring, etc
  • everything that should persist is in the workspace - the driver can die
  • supports multiple workflows with a single workspace
The lb-workflow workspace

Figure 2 - The lb-workflow workspace. Multiple workflow specifications can be installed by-name in a workspace (e.g. daily and backup workflows). They conform to the lb-workflow schema. A workflow can then be instantiated (started) multiple times. Each workflow instance gives rise to a process tree which is identified by a root process instance. The leaves of the process tree are tasks, and the inner nodes are composition operators. Each task instance has its state tracked by its own state machine. Thus, the state of individual task instances are independent from each other (as illustrated by failed instances in red).

  • describe the simple process tree for the Example 1


SimpleTask State Machine

Figure 3 - SimpleTask State Machine. Each process instance of a SimpleTask has its state tracked by this state machine. When the task is enabled (ready to execute), the state machine is instantiated and the state is INIT. Arrows in blue represent transitions that are automatically triggered by actions performed by the driver, which include the normal control flow that leads to the successful END state. Arrows in green represent actions triggered by the console, when manual intervention is required due to failure. Arrows in red are internal transitions triggered by having the root process instance of the tree aborted, which causes all tasks in the tree to be automatically aborted as well.

  • !start is an output action.
  • when in StartRequested, the driver is trying to execute the task. It will answer with success or fail.
  • startup is needed for when the driver crashes.
  • failure handling: retry vs ignore; ack_failure
  • END vs ABORTED: Note that the END state allows the workflow to continue (i.e., it enables subsequent processes), whereas ABORTED represents a dead-end: the workflow instance is completely aborted and cannot be restarted.


BaseTask State Machine

Figure 4 - BaseTask State Machine. This state machine represents the state of a BaseTask instance. This is more complex than SimpleTask because it supports task cancellation, and because the driver keeps state about BaseTask executions. Nodes in gray are abstract states; they represent a high level state of the task and have no direct counterpart in the actual state machine. Below we will show the concrete states and transitions under some of these abstract states.

  • note that the last action before END or ABORT is an output, !cleanup.
  • the other difference is support for cancel


BaseTask Task Execution State Machine

Figure 5 - BaseTask Task Execution State Machine. This is a refinement of the Task Execution abstract state of the BaseTask state machine. It focuses on the normal execution steps, when no exceptions occur, such as failures or cancellations. Arrows without a target state connect to the Failure, Cancel or Abort abstract states described before.

  • the actual execution of the task only starts when the request was acknowledged.
  • these tasks often hold resources, so there’s a final cleanup step that allows tasks to release resources
  • when the driver restarts, if the task is still in StartRequested, it can safely go back to INIT because we know that the driver never started it. If it is in Executing, a previous instance of the driver may have started it, so the task goes to failed to await manual intervention (to be retried or ignored). If in CleanupRequested, we know the the task was successfully executed by a previous instance, so it stays in that state so that the new driver eventually cleans it up.
  • at any point, abort may be requested. If the task is in INIT, it can go directly to ABORTED.
  • a task can be cancelled while it is being executed (StartRequested or Executing), but not after it’s been executed (CleanupRequested).


BaseTask Abort State Machine

Figure 6 - BaseTask Abort State Machine. This is a refinement of the Abort abstract state of the BaseTask state machine. Once an internal abort action is triggered, the task starts the abort process, which invariably leads to the ABORTED end state.

  • !do_cancel allows the driver to attempt to cancel the task gracefully.
  • !cleanup allows the driver to release resources and remove internal records of the task
  • note that success and fail transitions are handled to ensure that the task is aborted even if the driver concurrently finished its execution.
  • Cancel and Failure abstract states are very similar; see stdlib.wf code.

The lb-workflow Language

Processes

  • Tasks

  • Control Flow – Seq – Par – ForAll – Restartable

Workflows

  • mostly for modularization; gets abstracted away by the compiler
  • explain a bit about instantiation
  • the main workflow

Parameters (cardinality, typing, varargs)

Instance Variables, Variable References

Process Reference (implicit bindings, alternative output syntax)

Predicate Bindings

Types: String Templates, Predicate Bindings, Boolean, Integer, Set Literals

Imports

The Driver

  • designed to be a daemon
  • what happens to tasks when the driver crashes (send the startup action when it restarts)
  • currently talks directly to the workspace; we may change to use services
  • explain some command line arguments:
    • auto-terminate
    • frequency
    • profile
    • max-retries, commit-mode

The Status Service

  • explain the protocol a bit
  • explain how to select workflows and processes

The lb-workflow Console

  • mostly about the action command

Documentation Generator

  • Markdown

Testing

  • Explain the python framework

Techniques

Controlling Concurrency

lb-workflow is designed to maximize task concurrency: the || (_parallel_) operator executes tasks concurrently; the forall operator by default executes all children in parallel. This may lead to excessive concurrency, for example, if multiple nodes execute requests againt a server that is not capable to handle them concurrently. lb-workflow provides 2 ways of controlling concurrency: forall max and concurrency groups.

The forall operator accepts a max parameter that limits the number of children that are enabled concurrently. The value must be an integer. It can be a constant or a variable (as long as the variable is resolved statically). Consider the following workflow:

workflow forall_max_example() {
  forall<max=1>(i in {1, 2}) {
    lb.string.Join(
      parts = {
        "$(i)"
      }
    )
    ;
    lb.tdx.Import(input = {"..."})
  }
}

The children of the forall are Seq nodes with 2 children, one for Join and one for Import. But since only one of these Seq nodes can be executing at a certain point in time, a the next Join will only start after the previous Import finished. So the workflow is really just a serialization of the Seq nodes: Join; Import; Join; Import… But often what you really want is to control the concurrency on the Import, whereas Joins could all run in parallel.

So there’s another way to control concurrency which is called concurrency groups. When you have a task you can pass meta information to the driver. So you could remove the forall max, and control the concurrency at the task level:

workflow driver_meta_example() {
  forall(i in {1, 2}) {
    lb.string.Join(
      parts = {
        "$(i)"
      }
    )
    ;
    lb.tdx.Import(
      input = {"..."},
      driver_meta = "{ group: 'foo', max: 1}"
    )
  }
}

All the Joins would run in parallel, but the driver would execute at most one Import concurrently.

Now, there are 2 caveats

1) groups are global and the “last max wins”. This means that if somewhere else in your tree you have some other task that participates in the same group, it is indeed the same group. This is great because you can control access to a certain server globally by using the server name as the group name for example. Also, when the driver picks such a task, it reads the max value and overwrites the previous max of the group, so you should use max consistently.

The following example shows the consequences of these features. The 2 foralls are executed in parallel. The driver can pick Export and Import tasks in parallel but, since they are in the same group, only one of them will execute at a certain point in time, even if they are in distinct branches of the process tree. Also note that the max value is used consistently in all tasks with the foo group.

workflow driver_meta_global_example() {
  (
    forall(j in {3, 4}) {
      lb.tdx.Export(
        input = {"..."},
        driver_meta = "{ group: 'foo', max: 1 }"
      )
    }
  ) || (
    forall(i in {1, 2}) {
      lb.string.Join(
        parts = {
          "$(i)"
        }
      )
      ;
      lb.tdx.Import(
        input = {"..."},
        driver_meta = "{ group: 'foo', max: 1 }"
      )
    }
  )
}

2) this only works in tasks, not workflows. That’s because only tasks are executed by the driver. But the workbook API has actually some workflows, like lb.wb.CreateWorkbook, which encapsulate one or more tasks. I don’t know exactly what’s the best way to expose driver_meta in that level. We could export multiple optional driver_metas, one for each task that the workflow calls, for example. Or perhaps expose a single driver meta and use it for tasks that we know we may want to control (e.g. do not use in the ParseJsonObject, use it only in the JsonService).

Predicate Bindings

Replaying Workflow Runs

Type Checking with Protobufs