Aggregating Data through DynamoDB Streams

Aggregating Data through DynamoDB Streams

DynamoDB is a NoSQL database service that is fully managed by Amazon Web Services (AWS). Compared to relational databases, tables on NoSQL databases are not restricted to a fixed structure. This allows development teams to quickly adapt to changing requirements. NoSQL databases can also scale horizontally as the documents are self-contained, thus making it a better choice for businesses that work with large data sets.

On top of the perks of being a NoSQL database, DynamoDB also provides the following benefits.

  • Reduce operation effort by providing hardware provisioning and replication.
  • Provide linear scalability by handling partitions and load-balancing.
  • Encryption at rest functionality to protect sensitive data.
  • Time-to-live functionality to optionally expire records and reduce storage.

However, DynamoDB only supports key-value queries. Its lack of native support for SQL functions such as triggers, group by and joins makes it challenging for products that have been built on it to support analytics, summary or aggregation of historical data.

Fortunately, AWS also offers DynamoDB Streams. In this article, we will go through how DynamoDB Streams was used to work around this limitation in a particular scenario. This eventually allowed end users to monitor aggregated data to make more informed decisions.


DynamoDB Streams

DynamoDB Streams is a service that can send time-ordered series of item modifications in a DynamoDB table to other consumers. This allows data from a source table to be processed in another service.

The most common approach to process the stream records would be through an AWS Lambda function. Once configured, AWS would invoke this function whenever a mutating action (INSERT, MODIFY or REMOVE) is implemented on the source table. Optionally, the function could also receive a snapshot of the mutated row in the source table before and / or after the action was performed.

Blog_Aggregating_Data_Streams_DynamoDBStreams_1

By making use of this service, we can use the snapshot and update the aggregated data in another AWS DynamoDB table through the AWS Lambda function.


Use Case

Let us consider a scenario in which we already have an AWS DynamoDB table that inserts a record each time a file is scanned by an application. The project team had already chosen AWS DynamoDB as they wanted to avoid managing database servers. They also chose it as they needed to integrate the application with other AWS services.

Blog_Aggregating_Data_Streams_DynamoDBStreams_2

A new feature request to view the number of times the files were scanned each month per project was just recently submitted by the product owner. Users would like to have this feature on the application so that they can monitor the quota that they have used per month.

To achieve that, we will need to create another DynamoDB table with the following schema.

Blog_Aggregating_Data_Streams_DynamoDBStreams_3

We then need to enable DynamoDB Streams on the source table and select “New Image”. As records are not modified or removed in this scenario, we would only require the snapshot of the row after it was inserted.

Blog_Aggregating_Data_Streams_DynamoDBStreams_4

Next, we need to create a new AWS Lambda function, provide the necessary IAM permissions to read and write to the relevant DynamoDB tables and configure it to be triggered by DynamoDB Streams.

Blog_Aggregating_Data_Streams_DynamoDBStreams_5

Once a new record is inserted into the source table, the AWS Lambda function should be invoked by an event body that is similar to the following structure.

Blog_Aggregating_Data_Streams_DynamoDBStreams_6.jpg

Finally, to update the aggregated table, we need to set up the AWS Lambda function with the relevant layers and packages and deploy the following code.

Blog_Aggregating_Data_Streams_DynamoDBStreams_7.jpg


Best Practices

This implementation is a simple solution for the scenario mentioned above. To maximise the value of DynamoDB Streams across other projects, here are some pointers to take into consideration.

Error Handling

DynamoDB Streams may resend the batch of data again should the execution of the AWS Lambda function fail. To prevent this, always set up a try-catch mechanism in the AWS Lambda function code. Log errors, notify system administrators or developers and store the stream data in a dead letter queue such as SQS or S3 in the event of a failure. Also consider keeping the batch size to 1. It may reduce the performance, but it would reduce the risk of losing data and make it easier to troubleshoot.

Throughput Limit

To reduce throughput exceptions on DynamoDB, the write throughput capacity of the aggregated DynamoDB table may be set to more than the source table’s. Also avoid batch writes on the source table. A single batch performed on the source table would result in multiple events on the DynamoDB stream. The throughput capacity of the aggregated DynamoDB table would then have to be the set according to the maximum number of records written in a batch write. Alternatively, if the application traffic is difficult to predict or control, set the throughput capacity of the aggregated DynamoDB to “on-demand”.

Update Expression

Invoking multiple lambda functions at the same time may simultaneously overwrite each updated aggregated value. This may happen if the code logic on the lambda function is set to look up the previous aggregate count before updating it. To prevent this from happening, use the DynamoDB client’s `UpdateExression` as shown in the code example to add on to the previous value without looking it up in the code layer.

Elasticsearch

Imagine there is another feature request to view the yearly aggregation from the aforementioned scenario. This means that another table would have to be created and streamed from the monthly aggregation table. DynamoDB is not designed to search large volumes of data and may not be able to support complicated search features.

An alternative for this would be to index the items onto an Amazon Elasticsearch Service cluster from the DynamoDB stream. Elasticsearch supports search queries and aggregation. Compared to saving aggregated data into another DynamoDB table, this approach may be more extensible. If search functionalities or multiple aggregated reports are required or expected, this alternative may be more suitable.

Blog_Aggregating_Data_Streams_DynamoDBStreams_8

 

Conclusion

AWS DynamoDB comes with many benefits such as reducing operation effort and providing scalability and flexibility. Its lack of SQL functions may make it challenging to handle complex queries, but DynamoDB Streams makes up for this and allows developers to extend and combine AWS DynamoDB with other services. Development teams can then develop solutions to provide queries like data aggregation and analytics, thus allowing end users to make more informed decisions.

Related articles

'Go' with Serverless Architecture
5 mins
Developer toolbox
'Go' with Serverless Architecture
A Beginner’s Guide to Micro Frontends with Webpack Module Federation
6 mins
Developer toolbox
A Beginner’s Guide to Micro Frontends with Webpack Module Federation
Design Systems: Building a Cross-Functional UI Library with Stencil.js
7 mins
Developer toolbox
Design Systems: Building a Cross-Functional UI Library with Stencil.js

Button / CloseCreated with Sketch.