How to ensure RDS-Instance startup in AWS?

Tim Rosenblüh

7. March 2023

Reading time: 2 min

How to ensure RDS-Instance startup in AWS?

Introduction

Last year, I addressed a cost reduction topic for RDS-instances in a blog post. I implemented a straightforward time-based solution with EventBridge and Lambda, which starts up the corresponding instances sometime before they are used in a Step Functions workflow.

The problem with this solution (if you want to call it that) is that the time it takes for the instances to boot up is not always identical, so you need to build in some kind of time buffer to ensure that the database(s) are available.
For example, you could boot up the instances 30 minutes before the workflow will be executed. This still doesn’t ensure that the startup will run smoothly, but if the instances take that long, manual intervention will likely be required anyway.

As a result, I was wondering if there is a way to guarantee the start of the database at the beginning of the workflow before the actual processing will take place. This would make the use of a time buffer obsolete and reduce the database costs again.

Components

The following services were used for the solution:

– AWS Step Functions
– Amazon EventBridge
– Amazon SQS
– AWS Lambda
– (Amazon RDS)

Overview

These components were then used to build the following architecture:

width=

Let’s start looking into the functionality of the main workflow:

1. The database is started as the first step in the Step Functions workflow.
2. We place a TaskToken in the associated queue and wait within this step for the external process to finish.
3. After the external process has passed the TaskToken back to the workflow, the instance status of the database can be checked.
1. This is mainly to make sure that the database is actually in the *available* status.
4. Finally, the queue is completely emptied so that no messages remain that could cause problems in further executions.
1. Normally, the external process should have already deleted the message in the queue. This step is only to make sure that no messages remain in the queue.

The external process consists of the following steps:

1. The EventBridge rule captures the specific database event that is triggered when the instance is fully started and sends it to a Lambda-Function.
2. In the function itself, the TaskToken is then fetched from the queue and returned to the Step Function’s workflow with the call *sendTaskSuccess*. This then restarts the paused workflow.

Solution

Let’s take a closer look at some of the indivudal parts of the solution:

AWS Step Functions

To pause the Step Function’s workflow, the option *Wait for callback* must be used for a step. Here, a TaskToken will be issued, which can then be returned to the workflow after an external process has been completed to continue the execution.

SQS was used to store the TaskToken:

```json
"SQS SendMessage": {
"Type": "Task",
"Resource": "arn:aws:states:::sqs:sendMessage.waitForTaskToken",
"HeartbeatSeconds": 1800,
"Parameters": {
"QueueUrl": "URL_OF_THE_SQS_QUEUE",
"MessageBody": {
"TaskToken.$": "$.Task.Token"
}
},
"Next": "..."
}
```

Amazon EventBridge

The EventBridge rule is configured to capture the specific event that represents the start of the RDS-instance:

```json
{
"source": ["aws.rds"],
"detail-type": ["RDS DB Instance Event"],
"detail": {
"SourceArn": ["ARN_OF_THE_RDS_INSTANCE"],
"EventID": ["RDS-EVENT-0088"]
}
}
```

*(RDS-EVENT-0088 signals that the instance has been started and is in the ‘available’ state)*

To learn more about the various RDS-instance-events, it is worth taking a look at the documentation.

AWS Lambda

The Lambda-Function of the external process then extracts the TaskToken from the queue and sends it back to the State Machine via the Step Functions SDK call *sendTaskSuccess*.

```javascript
let message = await sqs.receiveMessage({QueueUrl: queueURL}).promise();
let messageBody = JSON.parse(message.Messages[0].Body)
let taskToken = messageBody.TaskToken

await sf.sendTaskSuccess({ 
output: JSON.stringify({message: 'You can add a message here.'}),
taskToken: taskToken
}).promise();
```

To make sure that the instance was started, the second Lambda-Function checks the status of the instance:

```javascript
let result = await rds.describeDBInstances({DBInstanceIdentifier: 'DBInstanceIdentifier'}).promise()
let dbInstanceStatus = result.DBInstances[0].DBInstanceStatus

if(dbInstanceStatus != 'available'){
response = {
statusCode: 400,
body: JSON.stringify('The DB-Instance is not available!')
}
}else{
response = {
statusCode: 200,
body: JSON.stringify('The DB-Instance was started succesfully!'),
}
}
return response;
```

Additional Information

Error handling has been omitted here, as this would have made the post too long. In an ideal case, the various instance states would also have to be taken into account, since the database could have already been started at the beginning of the workflow or be in a completely different state. Here, only the basic functionality is described, which can then be extended as required.

Summary

The presented solution allows starting databases in their associated workflow using services such as Step Functions, EventBridge, and Lambda. This means that instances do not have to be booted up long before they are used, but can be powered up in the associated workflow itself. This can lead to additional cost savings compared to a solution that has to start the database sometime in advance. Of course, this approach makes more sense for databases that are exclusively used for tasks that correspond to a workflow whose steps depend on a running RDS-Instance. Should the database be available even outside the time period in which the Step Functions workflow is running, you either need a completely different approach or could be fine with a rough time-based scheduling, which I described in my previous blog post.