visit
Now, you might have noticed that we have been talking about “storage solutions” not “databases”. So let’s have a look at Databases now!
Well, there are a few factors basis which we can decide what kind of database to go with, these factors being — the structure of the data, query pattern, and scale.
A little confusing I know! This is why I have added the link to the video I referred to for this article. They have explained it beautifully, but we are also going to go over it in the coming sections, so read on.Now, scale, structure, and query pattern. Right. If the information is structured and can be represented as a table, and if we need our transactions to be atomic, consistent, isolated, and durable (ACID), we go with a relational database. The most commonly used one would be MySQL.
Now if the ACID properties are not required, well you could still use a relational database, or you could go with a NoSQL alternative. But if your data lacks a structure, it cannot be represented as a table and now we need to use a NoSQL DB like MongoDB, Cassandra, HBase, Couchbase, etc. And this is where the query pattern becomes a deciding factor.
Pssst: Elasticsearch is sort of a special case of document DB.
If we have a vast variety of attributes in our data and a vast variety of queries, we use a Document DB like MongoDB or Couchbase. But if we have to work on an extremely large scale but have few types of queries we need to run, then we go for a Columnar DB like Cassandra or HBase. And even between columnar DBs, as you might know, HBase is built on top of Hadoop.
So while setting up HBase we first need to set up Hadoop and the related components, and then set up HBase on top of it. This adds a level of complexity while setting up the system, so I would personally go for Cassandra if only for the sake of simplicity. Performance-wise both give similar results.Now, the thing with Columnar databases like Cassandra is that they work majorly by partitioning and duplicating the data. So if you can choose the partition key such that all of your queries use the common partition key in the where clause, Cassandra is the way to go.I came across this article about by “codekarle”, which beautifully explains when it might make sense to use a columnar DB vs a document DB with an example of how Uber interacts with their driver and rider sides of the system. Let me try to explain this using the same scenario.Let’s say Uber has saved the ride related information in a Cassandra with driver id as the partition key. Now when we run a query to fetch all data for a particular driver on a day, it fetches it based on the partition key driver id. This was the partition side of Cassandra’s solution. Now, what if we try to query a customer’s rides on a particular day by customer id. Now the query will be sent out to all the partitions and there goes the efficiency! This is where the replication side of the solution comes in. We can simply replicate the whole data and now use customer id as the partition key. Now when a query comes in based on customer id, it will be directed to the instance using customer id as the partition key. And this is why Cassandra can scale infinitely. Remember the query pattern for Cassandra? We mentioned it is only useful if there is a limited variety of queries. That is because we can only replicate the data so many times.Previously published behind a paywall at