twitter analytics on AWS

A highly performant, scalable, and reliable analytical web service on 1TB of raw tweets

Designed, developed, deployed, and optimized functional web-servers under a strict budget & explored methods to identify the potential bottlenecks in a cloud-based web service and methods to improved system performance


This academic project introduced HTTP Web Servers and how to make them serve 1000s of requests per second; it also introduced databases (SQL & NoSQL) and the bottlenecks while working with a lot of web server instances and just 1 database instance.

The main goal of this project was to learn how to:
- build a reliable web service on the cloud within a specified budget
- implement ETL on a large data set (~ 1 TB) and load the data into SQL and NoSQL systems
- explore various methods, tools, configurations and optimizations to improve the performance of a web service deployed on cloud managed services

The challenge

With the budget of $0.85/hour, how to serve 30K unique requests per second?

Query 1: Mine this transaction for this blockchain

introduction to vert.x and undertow

After careful research on various benchmarking platforms, I decided to write my request handling server in Java Vert.x
To date, I prefer that over Express, Flask, or Undertow

I needed something to handle thousands of such JSONs (attached image on the right) but much, much bigger in size (zlib compressed and base64 encoded as query parameter).

The idea is: verify all the hashes and signatures given in the blockchain and then verify the validity of the new transaction. Simple, right?
Not at all! Try calculating the proof-of-work for each of the transaction (sometimes more than 100 in one request)!

40,000 RPS under $0.85/hour on AWS
The challenge

With the budget of $0.89/hour, how to serve 15K unique requests per second from a database of 200 million tweets?

Query 2: Recommend people to follow to this Twitter user

introduction to extraction, transformation, and loading

How to load 1TB of compressed tweets sitting in AWS S3 Bucket into an MySQL and HBase?

Here, I tried to implement a custom ranking algorithm (interaction with other users, hashtags used and followed, and user-specific keywords) as a web service.
It was physically impossible to filter, and process 200MN of these intricate tweet objects on my local machine.
But Scala + Spark on AWS can! The best combination ever - concise Java and the fastest bulk processing framework.

I explored every possible way to optimize the databases and reduce query fetching time.

Evident by the name, SQL and NoSQL are drastically apart even in the concepts related to improvisation and optimization.
I must have redesigned the databases at least 3 times. Starting from the basic indexing, sharding, tweaking the instance type (CPU/RAM); to the more advanced ones ...
... which I cannot discuss publicly because I am bound by academic integrity. Please contact me to know more. I'd be happy to discuss.

22,000 RPS under $0.89/hour on AWS
The challenge

With the budget of $1.28/hour, how to maintain the RPS while serving double ranged database queries?

Query 3: Find the topic words from this location + time

introduction to managed AWS services

Why spin things up manually when a few button clicks can do it for you?

You just have to find the right buttons!

When you identify a database bottleneck and you are under a budget, that is the time when you bring out a calculator to figure out how many instances for web-services, how many for databases and whether or not to write your own load balancer?
Do you go serverless using AWS Fargate or do you use AWS LightSail to manage instances or do you use AWS ECR and deploy docker containers? ...
... again, academic integrity. Please contact me to know what I actually did. (not if you are a grad student from Carnegie Mellon!)

6,000 RPS under $1.28/hour on AWS

Did I top the leaderboard for the mixed query? Yes.

I developed fault-tolerant, scalable web-servers to respond to a live load; designed schema as well as configured and optimized MySQL and HBase databases to deal with scale and improve throughput.
Also fell in love with Vert.x ❤️