Cab Service like Uber/Lyft

Processing a large stream of data using Kafka & Samza
Predicting cab fares using Machine Learning

Joined and processed multiple Kafka streams of GPS data using Samza API to setup a driver matching service; an end-to-end pipeline of cloud ML APIs and custom trained ML models on Google App Engine (GAE)


Over the past few years, the industry has managed to benefit financially from processing real-time streams of data. Hence, data has moved from being generated and processed at infrequent intervals to being processed in real-time. This places extreme latency and throughput requirements which simply cannot be managed efficiently by batch-based frameworks such as MapReduce or Apache Spark, unless specific techniques or libraries are used.

The main goal of this project was to learn how to:
- describe the fault tolerance model of Kafka and Samza & generate streams of data using Kafka and make it available for a Samza consumer
- design and implement a solution to join and process streams of IoT device data and static data using Samza API to setup an advertising system
- manage and monitor Samza jobs on Yarn & explore different debugging techniques when running a Samza job on YARN
- inspect and visualize the provided data set to identify discriminating features
- train a machine learning model (XGBoost) predictor utilizing the constructed features and training data on Google ML Engine

The challenge

What are stream messages? How to consume them?
How are they fault-tolerant in a distributed environment?

Creating streams using a Kafka Producer

introduction to publish-subscribe messaging system

Publish-subscribe refers to a pattern of communication in distributed systems where the producers/publishers of data produce data categorized in different classes without any knowledge of how the data will be used by the subscribers.


How many streams?

I used Kafka Producer API hosted on an instance on AWS EC2 to create a topic based on the type of the request: ENTERING_BLOCK, LEAVING_BLOCK, RIDE_REQUEST, RIDE_COMPLETE.
A topic is a category to which messages are published. A data consumer reads from one or more topics. Topics are generally maintained as a partitioned log (see figure above). I created 2 streams: events and driver-location.

How many partitions?

After analyzing the number and the patterns of the blocks in the data, I started with having 3 partitions. The throughput was lower than what I expected it to be. After some experimentation, I concluded on having 5 partitions.
Topics are divided into partitions. A partition represents the unit of parallelism in Kafka. In general, a higher number of partitions means higher throughput. Within each partition, each job has a specific offset that consumers use to keep track of how far they have progressed through the stream.
In some sense, you can think of Kafka as categorizing your data and providing it to you ordered by key, much like the Map and Shuffle stages in MapReduce.

Is messaging system reliable? Yes. Replication comes to the rescue ... as always!

The system deployed in distributed clusters which might encounter a timeout or failure. Kafka uses replication for fault-tolerance. For each partition of a Kafka topic, Kafka will choose a leader and zero or more followers from servers in the cluster and replicates the log across them. The number of replicas including the leader is determined by a replication factor. Brokers in Kafka are responsible for message persistence and replication. Producers talk to brokers to publish messages to the Kafka cluster and consumers talk to brokers to consume messages from the Kafka cluster.

Consume Kafka's streams using Samza

introduction to stream processing

What actually happens when you request a ride on Uber/Lyft?! How are you matched with the driver? Solely based on location? What if a driver, just a few yards away hasn't got a ride all day?

Here, let's try to come up with a formula to match customers to drivers. We have 2 incoming streams: events and driver-locations. We have 1 Samza processor. It is imperative to keep on updating driver location or the formula would simply fail.

Deriving the formula

Distance:Of course, distance should be the largest factor contributing to the score, but not the only one. I have used Euclidean distance between the driver and the client.
Gender:Does the customer or the driver have a gender preference?
Salary:If a driver has earned above average in a day, the other drivers should get the opportunity.
Rating:Finally, customer's and driver's rating. Sometimes optional in case of new customer or new driver.

Processing the Kafka Streams - finally!

I started by writing a Samza Job. A job is written using the Samza API to read and process data from one or more streams. A job may be broken into smaller units of execution called tasks.
A lot more marvellous stuff went in here, felt like I made an ungodly technological advance working on this real-time, highly scalable, fault-tolerant, beautiful framework. Unfortunately, I cannot make it public because I am bound by academic integrity. Please contact me to know more. I'd be happy to discuss.
As seen in the image above, I wrote the output to one Kafka Stream with the results.

Bugs in Samza code cannot be found using a simple debugger!

In a distributed environment, with 1000s of messages coming in every second in no particular order can be very painful to debug. I found some hacks around it ...
... which I cannot discuss here. Please contact me to know more.

Can we also serve ads using the same framework? Yes! That's how ads are served on social media platforms ... or should be.

Serving Ads is intricate... there is no stopping point... one can dig deeper and deeper and find new data every time. That's what it is all about. Data.
Using a big enough data about users and companies, I created a new Kafka Stream about user data updated less frequently - processed by the same Samza processor to display ads.
... same warning as above. You know what to do if you are interested!

Kafka Streams are generated in an AWS EC2 instance and processed by Samza in a distributed environment on AWS EMR. How to monitor if something goes wrong? YARN!

YARN UI proved to be extremely useful in managing Samza Jobs. Pretty self-explanatory, and helpful with debugging. Now while writing this project here, I should have taken a screenshot.

I created Kafka Messages at certain intervals, processed them using Samza on AWS EMR, served user-specific ads using the same architecture. Bit much. I will end this part of the project with a visualization made using and react-map-gl.

The challenge

With a strong backend in place, how to leverage Machine Learning for accurate fare prediction and bonus features?

Feature Engineering and Hyperparameter Tuning

Is human interpretation of the problem required to train an ML model better?

tldr; yes.

I have learned that most of the gains often come from using good features as well as high quality and large data, not from good machine learning algorithms. The quality of the features directly affects the quality of the predictive models. With good engineered features, you can get good results while using a simple model. A bad feature will often have a negative effect regardless of the model sophistication.

Hence, most of the machine learning problems we face become engineering problems rather than algorithm problems.
Harshal's Insightful Tip: Use Facets Overview to get some quick insight about the distribution of values across features. You will also uncover several uncommon and common issues such as unexpected feature values, missing feature values, training/serving skew. Set up the Cloud Datalab using this official guide.

You can see the improvement by adding more good features into the model. First, I visualized the data to gain insight; then transferred the insight into code, such as data cleaning or feature improvement; and finally, retrained your model and validate the improvement. Sounds easy! Is not!

data transformation techniques
I cannot discuss the exact approach, but most of my cleaning consisted of: Binarization, Quantization Binning, Log Transformation, One-hot/Dummy encoding, Hashing, and Dimensionality Reduction.
Be very careful about removing outliers from the training set, if you choose to do so. Removing outliers can definitely improve your cross validation score, since you are removing the data points that disagree the most with your model. However, a model trained this way will not generalize well to the test set. Remember, the purpose of your model is to make good predictions on data that are unseen in training.

hyperparameter tuning
tldr; Follow this official guide by Google.

With Google ML Engine, I did not host/configure your own virtual machine, manage the dependencies required for machine learning, or write code from scratch to orchestrate multiple workers and distribute the workload to achieve parallel training. This greatly simplified the process of scaling machine learning to elastic resources on the cloud -- simply upload the code that trains the model and ML Engine will take care of the rest. it supported scikit-learn, XGBoost, Keras and TensorFlow.

Hyperparameters, cannot be learnt during training directly but are set before the learning process begins. Hyperparameters affect the training thus the model performance, and therefore there exists a need to tune hyperparameters to improve the accuracy of the models. For example, you could train your model, measure its accuracy (e.g., RMSE in my case), and adjust the hyperparameters until you find a combination that yields good accuracy. The scikit-learn framework provides a systematic way of exhaustively searching all the combinations of the hyperparameters with an approach known as Grid Search. While hyperparameter tuning can help to improve your model, it comes at the cost of additional computation, since you have to re-train the model for each combination of hyperparameters.

The Challenge

How to deploy an end-to-end pipeline of Cloud ML APIs and custom trained ML models on Google App Engine?

Final Machine Learning Application