#Week 5: Learning Cassandra [DoK intern Series]

#Week 5: Learning Cassandra [DoK intern Series]


3 min read

Hey everyone ๐Ÿ‘‹ welcome back, As you might have guessed from the thumbnail and the promotions that I recently completed my 5th week as an Data on Kubernetes Community Intern. And In this blog, I will be sharing my experience and learning of week 5.

Okay as explained by me in my last blog that I have been assigned a project which I would be working on and going through a couple of k8ssandra doc and reference videos I found that it would be beneficial if I learn Cassandra before k8ssandra. So I started searching for resources for learning Cassandra 'the right way'. After going through a couple of resources these are my learnings:

What is Apache Cassandra?

Apache Cassandra is an open-source NoSQL distributed database. Which provides features like scalability, distributiveness, elasticity, fault-tolerant, hybrid, etc.

The most interesting thing I found about it is that it doesn't have any master node which has all the data but it uses distributed architecture to store the data. So whenever the data is added (be that be by any medium) it locates the nearest node (This node is selected randomly on the basis of few factors). The nearest node accepts the data and passes it on to the correct node of its index. To explain it better if we are storing students names in Cassandra cluster okay, so say suppose we make 6 nodes and distribute the name according to this [A-D], [E-H], [I-L], [M-P], [Q-T], [U-X], [Y-Z]. Now if the data contains a name that starts from "M" but the nearest node located was [U-X] then the data would be passed on to the [M-P]. But the process does not stop here yet. Imagine a condition if [E-H] node goes node or crashes so the users which are having names from E-H would be having a really tough time, so what's the solution? The solution is really interesting and it amazed me a lot as well. So the solution is every node follows the data partitioning approach. This means the node will replicate itself in several other nodes. Let's understand this with a diagram. Here as you can see the A data is replicated in two other nodes. So now if Node 1 goes down still we have a copy of its data to be served. So this architecture becomes self-healing and more reliable. And you can have multiple node rings over several cloud service providers (Azure, Google Cloud, AWS, or local infra). Also wherever the data is updated in any of the rings it instantly gets updated in all other rings making all the rings contain the exact same data which can be assessed from any geographical location according to convenience

Resources I follow to learn Cassandra

Written resources

Video resources

I am learning a lot of new things every day so if you think I have explained something wrongly please do make a comment and let me know. Also if you know any resources which helped you with them don't forget to link them down.

Thank you so much for reading ๐Ÿ’–

Like | Follow | Subscribe to the newsletter.

Catch me on my socials here: bio.link/kaiwalya

Did you find this article valuable?

Support Kaiwalya Koparkar by becoming a sponsor. Any amount is appreciated!