Step functions are the way to coordinate components of a distributed application. Every component performs a well defined, separated task where output becomes an input of other tasks. Step functions consist of State Machines.

Amazon States Language

Amazon State Language is JSON-based language for defining state machines used by Step Functions. It is a collection of states leading execution from start state (can be only one) to very last state (many states can be the last one). Example of ASL

{
  "Comment": "First state machine",
  "StartAt": "HelloSt",
  "States": {
    "HelloSt": {
      "Type": "Pass",
      "Result": "Hello",
      "Next": "World"
    },
    "WorldSt": {
      "Type": "Pass",
      "Result": "World",
      "End": true
    }
  }
}

Above state machine will start on HelloSt state and pass result "Hello" to the WorldSt state input. Below is the graphic representation of this ASL:

Hello World state

Conditional statements are also supported:

{
  "Comment": "First state machine",
  "StartAt": "CheckName",
  "States": {
    "CheckName": {
      "Type": "Choice",
      "Choices": [
        {
          "Not": {
            "Variable": "$.name",
            "StringEquals": ""
          },
          "Next": "HelloName"
        },
        {
          "Variable": "$.name",
          "StringEquals": "",
          "Next": "HelloWorld"
        }
      ]
    },
    "HelloName": {
      "Type": "Pass",
      "Result": "Hello",
      "End": true
    },
    "HelloWorld": {
      "Type": "Pass",
      "Result": "World",
      "End": true
    }
  }
}

First state CheckName will decide, based on the name input parameter, to which next state flow should be directed. Above code will produce state machine:

Conditional Hello World state

For more information visit the official documentation of Amazon States Language.

The use case I was interested most

I was most interested in a use case like

Send an email to the customer after defined time passed since some fact occurs

I see a nice use case for Wait state type here. Wait state delays state machine from continuing for a specified time - it can be a defined number of seconds, fixed timestamp, it can also be passed as a call parameter. The state machine will be very simple

{
  "Comment": "Send an incentive email",
  "StartAt": "Wait",
  "States": {
    "Wait": {
      "Type": "Wait",
      "TimestampPath": "$.scheduledAt",
      "Next": "SendEmail"
    },
    "SendEmail": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:eu-central-1:074085690123:function:send-incentive-email",
      "End": true
    }
  }
}

Send incentive email state machine

I choose a way to provide the exact time when WaitForSchedule state should pass execution further to SendEmail state by using TimestampPath attribute in state definition. Now I can invoke a state machine with an example event

{
    "scheduledAt": "2019-12-31T08:13:00Z",                          # This is when state will transit to `SendEmail`
    "locale": "en",
    "airline": "Lufthansa",
    "departure_airport": "Frankfurt",
    "arrival_airport": "Gdańsk",
    "compensation": "600"
}

Example execution trigger can be implemented, with an official SDK, like below

require 'aws-sdk-states'
require 'json'

client = Aws::States::Client.new

# assuming there is only one state machine defined
state_machine_arn = client.list_state_machines.state_machines[0].state_machine_arn

client.start_execution({
  state_machine_arn: state_machine_arn,
  input: {
    scheduledAt: (Time.now.utc + 60).strftime('%Y-%m-%dT%H:%M:%SZ'),
    locale: "en",
    airline: "Lufthansa",
    departure_airport: "Frankfurt",
    arrival_airport: "Gdańsk",
    compensation: 600
  }.to_json
})

Execution can be named. Such name must be unique in the scope of state machine - sounds like a good way to achieve exactly one execution. SendEmail state can be extended by Retry (max attempts and exponential backoff supported!) and Catch attributes. Both limited to particular ErrorType.

Pricing

Free Tier includes 4000 state transitions. To make it simple, it is a count of edges between states. Sending delayed email required three state transitions to perform. Every next 1000 state transition costs $0.025.

Summary

Step functions look quite interesting, they were introduced in December 2016 and I haven’t played with them since today. Step function may look very limited and simple at first, but I can imagine some big state machines orchestrating complex execution logic.

State machine can be executed

  • via API action (like in the above example)
  • CloudWatch events (haven’t tried)
  • Amazon API Gateway (abstraction over abstraction?)
  • from other State Machine (we need to go deeper …)

What state machine can play with

  • Lambda functions (like in the above example)
  • DymamoDB (read & write)
  • SNS (publishing message)
  • SQS (putting a message into the queue)
  • some other AWS services I haven’t heard of

Where I find it useful

  • data processing and ETLs
  • delaying Lambda execution (like in the above example)
  • kind of continuous integration with Activities?

What I haven’t touched

State function offers also tasks type Parallel and Map. First can execute many tasks at once, for example sending email and sending SMS at the same time (duration savings). Second can execute the same step for every item in input array concurrently (duration and state transition savings).

All of “real executors” - Task, Parallel and Map offers the Catch and Retry options. In my opinion, Catch should be used more like workflow fallback in case of error instead of handling errors inside Lambda Function. Retry is … well, retry.

Every step in workflow and workflow itself can have defined limit after which move error (no output from state/workflow) or to a fallback defined state.

Any logging capabilities of workflows execution.

Activities - special kind of workflows that allow you to execute any task on Amazon EC2, Amazon Elastic Container Service, mobile device, …

Pros

  • relatively easy to start
  • terraform support
  • simple and express workflows
  • maximum execution time for workflow is 1 year
  • Express workflows looks fast, according to documentation event rates greater than 100,000 events per second
  • tooling inside AWS Console is nice, allow to execute and check when which state was executed with inputs and outputs
  • Lambda functions can be independent, without dependency on each other, without referencing other resources, pure input and output

Cons

  • pricing for complex and long-running workflows
  • can’t be executed by SNS event
  • workflow source code inside Terraform
  • while there is some linter for Amazon State Language, I find missing test tool as a drawback
  • lack of scheduling vs. scheduled Lambda functions