Case Study: How We Created a Self-Hosted Solution Able to Handle Billions of Events
Find out how to receive, process, and save thousands of events continuously based on our client's product.
Dec 23, 2022 | 7 min read
My name is Filip and I've been working as a Backend developer in MasterBorn since January 2021. While working on one of the projects for our American client - my team and I faced the challenge of creating a self-hosted solution for storing events created by the client's systems.
When I heard about the project goal for the first time, I got really excited! The project goal sounded like an intriguing problem to solve. Our client needed a performant system that would be able to receive, process, and save thousands of events coming from their systems per second. Continuously!
It sounded like a challenging but satisfying task. Firstly, because it was different from anything I've worked on before, and secondly, because of the huge amount of data the system needed to handle.
A client of ours used one of the most popular SaaS services to collect events from their applications. From the collected data, various metrics are created. Conclusions from those metrics can be used to build better interfaces or to improve processes. The stored data is also useful during the debugging process - thanks to that, the developer can see what actions preceded the error.
As time passed, the volume of stored data grew along with SaaS costs. After 3 years, the price of SaaS products became a problem and the client decided to hire MasterBorn to create a self-hosted solution.
Using SaaS products are convenient, but expensive. Simply pay a monthly subscription and that's all - your company can store events from its systems and employees can create metrics from the stored data. Your company does not have to invest in a complex system and maintain infrastructure for it, but every day millions of new events are inserted, the volume grows, and with the volume, monthly costs grow too.
Eventually, the cost of your SaaS service will become so overwhelming that you start to think about potential options to reduce it. The answer is to migrate to a self-hosted system for storing those events. This has a bigger up-front cost, but then your monthly expenses are reduced significantly for years.
First, we had to pick a database system suitable for the project. Our assumptions for the database were:
- Must scale well
- Must be able to run custom SQL queries (SaaS product was limited in this term)
- The possibility to remove duplicated events
- The structure of each event type can change in time (for example a new field can appear)
Based on the above we looked for a database that would match all the above points and we found a database called ClickHouse.
ClickHouse is an OLAP database so it can run SQL queries that generate reports in real-time. It is also a DBMS database which means the database stores saved data in a different way on the drive and thanks to that it can be compressed better.
ClickHouse also has a deduplication feature and supports ALTER queries with which we could update the event table structure. It was a perfect choice as we needed to store a huge amount of data that after insertion, would never change.
All services required for the project were deployed in a Kubernetes cluster.
Kubernetes is a system for automating deployment, scaling, and the management of containerized applications. It will start extra instances of service when traffic is high and reduce the number of running instances after the spike drops.
We found a fantastic open-source project called Jitsu that works with ClickHouse and can alter the structure of tables dynamically. Thanks to it, we don't need to worry about changing an event table structure, Jitsu will detect those changes and alter the database accordingly before writing events. We deployed scalable Jitsu and ClickHouse services in the cluster.
Get informed about the most interesting MasterBorn news.
After that, we faced another challenge. We needed to migrate data from their SaaS product to our new system. Luckily the service had an export option. The results are returned as a zipped archive of NDJSON files.
We created a simple utility script that called request endpoint for each since our client had started using the service, and streamed the response directly into S3 upload endpoint. We ran a single instance of the script on AWS Lambda and 2 days later we had all the data copied and stored in our bucket.
Then we had to decompress archives. For that purpose, we created another Lambda that read files one by one, decompressed them and uploaded them back into the S3 bucket. After the script finished running, we had decompressed files that we could work on. Thankfully, the files were in NDJSON format, so it was easy to process them efficiently in a streaming manner. Each line contained a valid JSON string which was one event we needed to insert into a ClickHouse table.
To insert events into the database we had to request the endpoint exposed by our Jitsu service. Then it updated the table schema if needed and inserted the event into the proper table.
Because there were billions of historical events, we decided to keep them in an SQS queue, before sending them to Jitsu. Thanks to the queue we could also control how many events are sent to Jitsu in one request. A second DLQ, Dead Letter Queue, was created to keep failed requests. It allowed us to inspect failed events and retry their insertion gracefully.
After preparing that part of the infrastructure, we ran a script that went through decompressed files and enqueued each event onto the SQS queue. After a while, we could see new events saved in the ClickHouse database! This process lasted for about a week. It resulted in 2.34 terabytes written to the ClickHouse drive.
The last part was to save new events into ClickHouse so our client could fully abandon their costly SaaS service. Thanks to our approach with SQS queues, it was really simple. All we had to simply do was use Lambda to expose the HTTP endpoint which accepted the events in a JSON format and then put them on the SQS queue. After doing that, new events coming from the client system started to be saved in the database.
The new solution costs only 270 USD monthly. The price included SQS queues, AWS Lambda, Kubernetes cluster and 3 terabytes of EKS drive.
Since our successful deployment, our client has saved around 1730 USD monthly, while not losing a single feature. Moreover, he gained the option to run more complex SQL queries to create advanced reports.
In my opinion, a good technical solution should work to achieve the business goal. My goal is not to make "cool JS toys" - but to deliver real value to the client. In this case, it was a very specific and measurable financial goal.
I hope you'll find this article useful and that it will save you a lot of time and effort! Sometimes we spend hours, days or even weeks to find the right solution ... right?
Bonus Tip: When I feel that I've been swimming in code and been up to my neck in programming - I go to the pool! Swimming is perfect - especially after days spent in front of the screen. Nothing refreshes your body and mind like swimming.
How to start a project with Test-Driven Development (TDD)
What is Test-Driven Development? Learn how to perform TDD with real code examples that focus on the Payment process.
Serverless Computing - 5 pitfalls to avoid in your project
The top 5 pitfalls of Serverless Computing and how to overcome them. Learn how to avoid problems with Microservices, Timeouts, Vendor Lock, Cold start and Running dry database connections.
Why Cloud Computing Architecture Components Are Like... LEGO Blocks?
Cloud architecture is like… putting together LEGO blocks and creating wonderful things. It is like having nearly endless opportunities to build your own app.