The chicken can dance

Rubber Chicken Paradise

Making the chicken dance one line of code at a time

Jessica Oursler

11 minutes read

Disaster recovery is something every company needs to plan for. With the cloud and Infrastructure As Code (IAC) we can set up a robust cloud backup solution that requires minimal configuration to add additional servers.

Our company recently got hit with a one in a million RAID array failure that caused the loss of our primary database server. Thankfully we were able to recover the backup files since they were stored on a different RAID array but with out that insane amount of luck, we were looking at 24 hours of data loss. (This has also lead to a minor promotion for me to take over handling the care and feeding of the database servers.)

The 35,000 foot

All of our servers are using Ola Hallgren’s Backup and Maintenance Scripts to take nightly and log backups. Before I set this up, the backup files just sat on disk and were backed up nightly. This was fine for corruption or accidental deletions, but when it comes to hardware failures, one is none.

After the backups are taken, the files are uploaded to AWS S3 via the AWS CLI command aws s3 sync. This handles making sure the new files are uploaded to S3 with out needing to upload all the files every time, cutting down on the network bandwidth since the fill backup files are quite large.

Once the backups are uploaded to AWS S3, a Lambda function triggered off the s3:ObjectCreated event tags the files since the aws s3 sync functionality does not have the ability to add tags to any uploaded files.

The final step is AWS S3 Life-cycle rules to transition between storage classes and expire the backups based on the tags assigned.

With this setup, we are able to achieve a Recover Point Objective of 30 min with Log Backups every 15 min uploaded to AWS S3 once the backups are taken.

The 30 min time is to give time to handle any job alerts for upload failures.

Defining a retention policy

With any backup storage system, we need to define a retention policy. This is important so everyone in the business knows how long things will be kept around, and secondarily, what the cost of storage will be.

For SQL Servers, I like to have 3 different classes of backups, Nightly Backups, Weekly Backups, and Monthly Backups that are retained for different periods of time based on corruption checking. After discussing the retention periods with the Senior Leadership Team, this is our current retention policy (and is similar in quite a few other companies I have worked for).

For Transaction Log backups, we retain those for 7 days to maintain the ability for point in time recovery.

For the Nightly Backups, we retain the full backup file for 7 days. We do our corruption checks weekly and need to retain the ability to fall back to the last known good backup before the corruption check.

For the Weekly Backups, we retain the full backup file for 90 days. However instead of taking a second backup, we keep the Sunday Night backup as the weekly backup.

For the Monthly Backups, we retain the full backup file for 1 year. Again, instead of taking a second backup, we just retain the nightly backup from the first of the month.

For the Weekly and Monthly backups, since we retain the nightly backups for longer than 7 days, this retention policy does not introduce any additional load on the SQL Servers. The servers take their nightly backups and upload the files to S3, then AWS handles the expiration of the files at the appropriate time.

Setting up all the AWS Services

This starts with our AWS S3 configuration. For each server, we create an S3 bucket, an IAM user, IAM Access Keys stored in AWS Secrets Manager, and a lambda trigger on the bucket to the Lambda function to tag the files. We use the same Lambda function for all of our buckets to ensure consistency with all the tag values used.

Since we are using a stack per server, all of the user names, bucket names, etc are generated based on a ServerName parameter to the CloudFormation script.

Full CloudFormation scripts here

The Bucket

The first thing to set up is the S3 bucket (without the bucket we cant add the necessary permissions to an IAM user, so lets talk about it first).

Its just a basic bucket with some life-cycle rules. The transitions between storage classes are for cost. While it may be over optimizing, the cost to add the transitions is minimal and the storage savings help minimize the cost to retain backups that may never be accessed again once written.

  PackageBucket:
    Type: AWS::S3::Bucket
    DeletionPolicy: Retain
    Properties: 
      BucketName: !Join [ '-', [!Ref ServerName, 'backups' ] ]
      LifecycleConfiguration:
        Rules:
          - Id: Logs Backup Cleanup
            Status: Enabled
            TagFilters: 
              - Key: logBackup
                Value: ""
            ExpirationInDays: 7
          - Id: Nightly Backup Cleanup
            Status: Enabled
            TagFilters: 
              - Key: nightlyBackup
                Value: ""
            ExpirationInDays: 7
          - Id: Weekly Backup Cleanup
            Status: Enabled
            TagFilters: 
              - Key: firstOfWeekBackup
                Value: ""
            ExpirationInDays: 90
            Transitions:
              - TransitionInDays: 30
                StorageClass: STANDARD_IA
          - Id: Monthly Backup Cleanup
            Status: Enabled
            TagFilters: 
              - Key: firstOfMonthBackup
                Value: ""
            Transitions:
              - TransitionInDays: 30
                StorageClass: STANDARD_IA
              - TransitionInDays: 90
                StorageClass: GLACIER
            ExpirationInDays: 365

The IAM Users

So how do we securely upload the files? Lets create a service account for each server with access only to that servers bucket. While the script below does not list out all the permissions needed (instead opting for full S3 access to the specific bucket), setting just the necessary permissions is possible. Additionally, depending on your organizations policies, inline policies for a user may not be permitted and an IAM Role would need to be created.

  S3User:
    Type: AWS::IAM::User
    Properties:
      UserName: !Join [ '_', ['svc', !Ref ServerName, 'backup' ] ]
      Policies:
        - PolicyName: WriteToS3
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Effect: Allow
                Action: "s3:*"
                Resource:
                  - !Sub arn:aws:s3:::${PackageBucket}
                  - !Sub arn:aws:s3:::${PackageBucket}/*

The Tagging Lambda Function

The AWS documentation is a wonderful resource. Most of the code comes from The AWS Documentation on how to use S3 to trigger a Lambda function. The only major differences are the tagging. The CloudFormation stack can be found here. I am only pulling out the important pieces of the code since I just put the code inline in the CloudFormation yaml.

The Lambda function receives the trigger, retrieves the object meta data (Via an s3 headObject call), looks at the date it was uploaded (Via “Last Modified” and since this is an ObjectCreate trigger, its the upload date), and applies a Tag to the file with a s3 PutTag call. There are some additional comments here that are not in the full code.

const bucket = event.Records[0].s3.bucket.name;
const key = decodeURIComponent(event.Records[0].s3.object.key.replace(/\+/g, ' '));
const params = {
    Bucket: bucket,
    Key: key,
};
//.trn files are always log file backups.  we retain those for 7 days
if(key.endsWith(".trn")){
    await s3.putObjectTagging({Bucket: bucket, Key: key, Tagging:{TagSet:[{Key: "logBackup", Value: ""}]}}).promise();
    return "added log tag";
}

//get the last modified date.  s3.headObject gets the metadata which is all we need.  If we needed the full file s3.getObject would be the way to go.
const { LastModified } = await s3.headObject(params).promise();

//A debugging call but lets leave it in.
console.log('LAST MODIFIED:', LastModified);

//Get the day of the month for the backup.  We tag and return the first of the month so anything left will be a nightly or weekly
const dayOfMonth = LastModified.getDate()

//Our backups are taken after 00:00 UTC so we need to use the 2nd of the month since the Last Modified is in UTC.
//This doesn't matter too much to our company if the backup is the first of the month or the second of the month but it may be important to others.
if(dayOfMonth === 2){
    await s3.putObjectTagging({Bucket: bucket, Key: key, Tagging:{TagSet:[{Key: "firstOfMonthBackup", Value: ""}]}}).promise();
    return "added first of month tag";
}
const dayOfWeek = LastModified.getDay();
//Since its UTC time, have to use monday.
//Again, because our backups are after midnight UTC and if we keep the monday vs sunday is not a huge deal.
//All that matters is the retention time
if(dayOfWeek === 1) {
    await s3.putObjectTagging({Bucket: bucket, Key: key, Tagging:{TagSet:[{Key: "firstOfWeekBackup", Value: ""}]}}).promise();
    return "added first of week tag";
}

//If we havent applied a tag, lets tag it as a nightly backup since thats all thats left.
//Ya, we could tag the log files with this as well, but this allows different retention periods for log backups vs full backups.
await s3.putObjectTagging({Bucket: bucket, Key: key, Tagging:{TagSet:[{Key: "nightlyBackup", Value: ""}]}}).promise();
return "added Nightly Backup tab";

The final step is to create an IAM role for the Lambda function. For the purposes of the script here, I have used AWS managed roles for Logging and S3 access. To truly get least necessary permissions, an additional IAM role would need to be created.

  DatabaseBackupTaggingLambdaRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: DatabaseBackupTaggingLambdaRole
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: "Allow"
            Principal:
              Service: lambda.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSLambdaBasicExecutionRole
        - arn:aws:iam::aws:policy/AmazonS3FullAccess

Adding the Triggers

This is the fun part. When adding triggers, the Lambda Function needs permissions to the AWS S3 bucket before the trigger can be added to the bucket, but the lambda permission cant be added to the bucket before its created.

The full details can be found here in an AWS blog post.

There a couple ways to solve this, the way I handled it was to run the CloudFormation creation in multiple steps. When creating the bucket, I created just the Bucket with retention policies, IAM User, and IAM User Credentials. (If the Lambda function was created first, the Lambda permission can be created in the initial creation as well.)

Once the initial creation is done, the Notification Configuration can be added to the CloudFormation template and the Update Stack called.

      NotificationConfiguration:
        LambdaConfigurations:
          - Event: s3:ObjectCreated:*
            Function: !ImportValue "sqlbackup-taggingLambda:LambdaArn"

Order to set up the AWS services

The first step to setting up everything is create the CloudFormation stack for the Lambda function. Since we use the Lambda function for all of the server specific stacks, having this set up and exporting the Arn of the Lambda Function makes the individual server stacks easier to create. Lambda Cloudformation Script

The second step is to create the server specific stack. The Create Stack needs to be done without the Notification Configuration (uploaded here as the “First” Cloudformation script) and then updated to have the Notification Configuration (uploaded here as the “Full” CloudFormation script)

Once the S3 Bucket and IAM User Credentials are generated, the SQL Server side of things can be set up

Setting up the SQL Servers to Upload to S3

Prerequisites

The easiest way to sync the the backup files is with the AWS CLI. Installing the AWS CLI is the only extra application that needs to be installed on the server.

Syncing the files

The easiest way to upload the files is with a powershell script. How you access the IAM User Credentials is up to the reader. The easiest, most portable way, is with environment variables assigned in the script. Its not the safest way, but for the purposes of this blog post is how we will handle things.

The powershell script is simple, set the environment variables and call aws s3 sync. Since the tagging is handled by the Lambda Function above, this is all that is needed on the servers. In our case, all the backups are written to the servers local storage. If the backups are written to a central file server, this script could be configured to run on the file server and not on each individual server.

$env:AWS_ACCESS_KEY_ID = 'IdGoesHere'
$env:AWS_SECRET_ACCESS_KEY = 'KeyGoesHere'
& 'Install Path\aws.exe' s3 sync 'ServerBackupFolderGoesHere' s3://S3BucketGoesHere/

Agent Job

In order to schedule the upload, SQL Server Agent is a great option. Its installed on SQL Server by default, normally enabled, and normally used to execute the backup jobs (especially if Ola Hallgren’s scripts are used).

I have a couple different ways to make sure everything is synced. The first is a job that runs every 4 hours to sync all the files. Gist link

But wait, How can I meet the 30 min Recovery Point Objective, I add a second step to Ola’s Log and Full backup jobs that run the same sync as part of each backup job. When the log backups are complete the files are then uploaded to AWS S3.

All the jobs have failure alert notifications so a failed job, either the backup or upload, will alert the DBA team who can take action to restart the backups or run the upload job to catch any files that haven’t been pushed up to S3 yet.

Finishing up

So far, we have had this process running with no issues. There are a couple technical failure points, but by using automation to handle the uploading, tagging, and expiration of all the backup files, the IT team can now say we have the files necessary to meet our Recovery Point Objective in the case of a catastrophic server failure where all the drives make magic smoke at the same time.

Recent posts

See more

Categories

About

An ADHD programmer writing about the random stuff they run into. Check the about page for more.