AWS Kinesis optimal shards and cost estimation

AWS Kinesis optimal shards and cost estimation

AWS Kinesis optimal shards and cost estimation with hands on demo using producer and consumer long running tasks

Amazon Kinesis makes it easy to collect, process, and analyze real-time, streaming data at any scale at the most optimal costs. It supports real-time data such as video, audio, application logs, website clickstreams, and IoT telemetry data for various applications.

Calculation of optimal number of shards is important for improving the efficiency and lower the cost of the data stream.

We are going to use a simpler producer and consumer process using the code base present in this repo Kinesis shard estimation

Producer

Generates random characters, and then put the generated random characters into the stream as records.

Consumer

Gets batches of records and then seeks through the records for the search pattern and shows on terminal.

Now, install boto python package for interaction with AWS. pip install boto and start the long running tasks.

Long running tasks

nohup python producer.py test --shard_count 1 --poster_count 50 --poster_time 34560 --quiet &

nohup python worker.py test --sleep_interval 0.1 --worker_time 34560 > 01consumer.out 2> 01worker.err < /dev/null &

With this setup done we will start getting the results from consumer where it will find the patterns in the data.

+-> shard_worker:0 Got 25 Worker Records
+--> egg location: [797, 1893] <--+
+--> egg location: [1113] <--+

With this basic understanding and hands on we will look at a real world example and perform shard and cost estimation.

Shard estimation

Question

20 stock exchange servers are generating 10 records of 250kb of data each second. 3 trading servers are consuming 50000kb of such data each second. Estimate no. of shards required for this requirements in AWS Kinesis.

Solution

AWS has defined the below formula to calculate the number of shards

Number_of_shards = max(incoming_write_bandwidth_in_KiB/1024, outgoing_read_bandwidth_in_KiB/2048)

In our case,

incoming_write_bandwidth_in_KiB =

avg.data size in kb * records per second
                                = 250 * 20* 10 = 50000

outgoing_read_bandwidth_in_KiB =

incoming_write_bandwidth_in_KiB * consumers
                                =  50000 * 3 = 150000

So, No.of.Shards

= max (50000/1024,150000/2048)
                 = max (48.8 , 73.2)
                 = 73.2

and hence 74 shards.

Cost estimation

Total number of shards = 74 Hours in a month = 730

74 shards x 730 hours in a month = 54,020.00 Shard hours per month

54,020.00 Shard hours per month x 0.015 USD = 810.30 USD

Shard hours per month cost: 810.30 USD

There can be additional cost based on Extended data retention or Enhanced fan-out etc. if being used.