Video Summary and Transcription
Today's Talk discusses microservices optimization strategies for distributed systems, specifically focusing on implementing casual consistency to ensure data synchronization. Vector clocks are commonly used to track the casual relationship between write events in distributed systems. Casual consistency allows for concurrent and independent operations without synchronization, maximizing parallelism and system resource utilization. It enables effective scalability, better latency, and fault tolerance in distributed systems through coordination, resilience, reconfiguration, recovery, and data replication.
1. Microservices Optimization Strategies
Hi, viewers. Today, I'll be talking about microservices optimization strategies for distributed systems. Casual consistency is crucial for ensuring data synchronization across multiple servers. Implementing casual consistency algorithms can enhance system performance. A real-world example is Google Docs, where multiple users can simultaneously edit a document. User edits are sent to respective servers, ensuring casual consistency.
Hi, viewers. I'm Santosh. I'm currently working at PyTance as an engineering lead in San Francisco Bay Area. Today, I'll be talking about the microservices optimization strategies for high performance and resilience of these distributed systems.
So, today, I'll be giving some examples and trying to explain with those examples which are very much related to all of us in our day-to-day use. So, what are these distributed systems or large-scale systems? I can give some examples like Google Docs or some booking systems like airline or movie booking, where a lot of us make concurrent requests, try to edit the documents and Google Docs in parallel.
And what needs to be considered? What optimization strategies need to be considered when we build such systems? So, first and foremost, let's get straight into this. So, casual consistency. Now, we often talk about consistency in distributed systems, where we don't have only one backend server in the distributed systems. You have multiple systems coordinating with each other, say, writes go to one system and reads go to another system. And you want the data between write server and the read server to be synchronized.
Or you have multiple nodes geographically located, one in USA, another in, say, India or Europe, and users making some booking requests. And ultimately, they access the same database. They're trying to book the last seat of an airplane and airplane, all of them try to access the same seat, requests coming onto different servers. But these servers need to coordinate in some way. Our data needs to be consistent across these servers in some way, such that they are well, seamlessly providing services to all the users.
So, consistency is like when you have different servers serving the requests of various users, you want the data to be same or consistent across all these servers. And now, in that, there is something called casual consistency. And if as a software architect, we can address this, it can really incorporate casual consistency or implement casual consistency algorithms in your backend distributed systems that can really, really enhance the performance of your system.
Now, let's talk about a use case. So, it's a very common use case, as you can see here, a Google Doc, right? As a real-world example, the casual consistency can be seen in a collaborative edit application like this. In Google Docs, multiple users can simultaneously edit a document. Each user's edits are sent to send to their respective servers. As you can see here, write request, user 1 tries to write a sentence to the document and user 2 also does the same thing, doing the write. And there are multiple users who are trying to read like users 3 and 4.
So, here, the important thing to note is that the write done by user 1 and the writes done by user 2 are related. How are they related? Like user 1 writes a sentence 1 to the document and user 2 is writing sentence 2 after reading the document, as you can see steps 2 and 3 in purple. So, the writing activity by user 1 and user 2 are dependent on each other, that means, which is the sentence 2 written by the user 2 is dependent on user 1. So, when that means we in distributed world we call this as casual, casually related, so casual relationship.
So, now if the user 3 and 4, they try to request or try to read the documents which are shared with them, as you can see step number 4 by read of user 3 and 4, yellow and blue, they get the response.
2. Implementing Casual Consistency
First, the order of edits in a collaborative document is crucial. Casual consistency ensures that newer edits appear after older ones for a seamless editing experience. Without casual consistency, users may see different versions of the document on different devices. Incorporating casual consistency ensures a consistent view of the document's history and preserves the relationships between edits. Coordinating the nodes in a distributed system is necessary to achieve casual consistency.
First they get sentence 1, because that is the one which is written by user 1. And then again when they do a read second time as step number 5, they get sentence 2, because that is the sentence written by user 2 in that sequence. So, first, they should be reading sentence 1 and then they should be reading sentence 2 in that particular order. So, why? Because, like you can think of it, when you are commonly using the google document and multiple people are editing it, you do not want to see the edits of the newer edits first, but rather you want to see the older edits first.
So, because these are dependent events. So, this means user b's sentence will always appear after user a's sentence regardless of the order in which the edits are received by the backend server or to the other users devices. So, without casual consistency, right. So, that means we need to identify at the backend server that these two events or these two transactions of rights are dependent to each other and it is maintained that way in the distributed system, so that whenever read operations happen from other users like users 3 and 4, that order is maintained. Without that casual consistency, users might see different versions of the document with edits appearing in different orders on different devices. This could lead to a lot of confusion and make it difficult for the users to collaborate effectively.
Casual consistency is a critical optimization strategy which needs to be incorporated to ensure that all users have a consistent view of the document's history, preserving the casual relationships between edits and providing seamless editing experience. Now, going into a bit of details about this. So, just now as we discussed there are write operations which are coming onto nodes or backend servers 1 and 2 and then you have nodes 3 and 4 where reads are coming and there should be some way that all these nodes need to be coordinating, right. Because you can't have the read data in nodes 3 and 4 without the nodes 1 and 2 doing some kind of a replication.
3. Achieving Casual Consistency with Vector Clocks
This replication can be asynchronous as well. In distributed systems, logic should be implemented to define and maintain the casual relationship between write events, even after replication. Vector clocks are commonly used to track this relationship, with each node maintaining a vector of logical clocks.
This replication can be asynchronous as well, right. So, when writes from to nodes 1 and 2 arrived and then the data has to be synced or coordinated through some message passing or whatever be the protocol and replication can happen through a lot of topologies, right, asynchronous mesh topology, ring or there are a lot of things and multi-leader, single leader all those concepts are again vast and we don't want to cover this.
So, there has to be a way in the backend distributed systems among all the nodes logic should be running in such a way that the casual relationship, the casual relationship between those write events are defined and that order is maintained even during, even after the replication so that the reads when they come on to nodes 3 and 4, they appear in the right form.
So, now how is this casual consistency achieved, right. Like I said there has to be some program or some logic needs to be written on the backend server. So, that logic is called vector clocks. So, that is just one way of implementing it. So, vector clocks are a common technique used to track the casual relationship in distributed systems. Each node maintains something called vector clock. It's like a class with a data structure like a hash map containing the vector of all the logical clocks like a timestamp. A write event occurred, timestamp 1. Write event 2 occurred, timestamp 2. That way.
4. Understanding Casual Consistency and Vector Clocks
When a node generates a write operation, it increments its logical clock and includes it with the message or data sent to other nodes. This ensures that the events are dependent on each other and casual consistency is maintained. Casual consistency is achieved by maintaining the ordering relationship between events, as illustrated by the example of user 1 writing content A, user 2 writing content B after reading A, and user 3 reading A followed by B.
So, when a node generates an event like a write operation, it increments its own logical clock that incrementing like a unique number and then putting it in the map. And when a node sends a message or replicates data to another node, it includes its current vector clock with a message or data. And upon receiving a message or data a node updates its own vector. So, the two nodes or multiple nodes interact with each other through a function called merge and the vector clocks are actually synchronized ensuring or saying that hey these two events are actually dependent on each other or this is a casual relationship that means casual consistency needs to be maintained or should happen in this distributed system.
So, now what I just described is this. So, casual consistency is nothing but as you can see let me explain this picture here. So, P1, P2, P3, and P4 you can consider them as nodes or the nodes in the distributed back-end system. So, user 1 has written content A. Then user 2 using processor 2 has written content B after reading content A. And then user 3 tries to read and without the user 3 reads is first A and then B not B and A as we just described right there has to be that relationship the ordering relationship and that is why the casual consistency is maintained if the order is maintained and A is read first and then followed by B.
5. Achieving High Performance with Casual Consistency
If the events are not casually related, concurrent writes can happen in any order. Casual consistency allows unrelated concurrent operations to proceed without coordination, maximizing parallelism and system resource utilization. Only casually dependent operations need coordination to maintain casual ordering, reducing overhead. Casual consistency prioritizes preserving casual relationships between events, rather than enforcing strict global ordering.
So, and if the events are not casually related like the second diagram here at the bottom there the two concurrent writes they are not casually related. They are like user 1 has written A, user 2 has written B and whenever there is a read it can be any order like user A, B if they are not related then what like the content A, B when you're when the processor 3 is reading it can either read A first or B first. Here in P3 and P4 they're reading A first then B and P4 is reading like B first and then A. So, that's the difference between casual consistency and non-casual consistency but concurrent operations.
Now taking a dig into how the high performance is guaranteed with this casual consistency is obviously these things which I have mentioned here. First and foremost is concurrency. So, casual consistency allows concurrent operations that are unrelated to proceed without coordination synchronization. Not all operations are related. So, what happens if they are unrelated like concurrent operations which I just described this enables multiple processes or nodes to perform independent operations concurrently maximizing parallelism and utilizing system resources more effectively.
Like in the previous example casual consistent system the second diagram like at the bottom you can see writes by P1 and P2 are independent. So, in that case you can just let concurrency take over and reduce coordination overhead in casual consistency. Only the casually dependent operations need to be coordinated synchronized to maintain a casual ordering operations that are independent or concurrent with each other can proceed without coordination reducing the overhead associated with locking, blocking, waiting or synchronization. And then yeah of course optimized communication because casual consistency focuses on preserving casual relationships between events rather than enforcing strict global ordering there is no ordering there.
6. Distributed Systems Optimization Strategies
Scalability is guaranteed by allowing concurrent and independent operations without synchronization. Casual consistency enables effective scaling by adding more nodes or resources. Better latency and fault tolerance. Conscientious algorithms ensure coordination and resilience in distributed systems. Nodes collaborate through message passing and agree on common values. Handling node failures and Byzantine failures is addressed. Reconfiguration, recovery, and data replication ensure fault tolerance and resilience.
And then scalability is guaranteed. So, by allowing concurrent and independent operations to proceed without any synchronization casual consistency again enables these systems to scale more effectively. How? Because systems can add more nodes or resources to handle increasing workload without introducing any bottlenecks or contention points associated with any centralized coordination. It's same thing like the lack of centralized coordination for concurrent things which are not casually related can enhance scalability as well.
Then latency of course the better latency when parallel things are happening. And fault tolerance how fault tolerance casual consistency provides a flexible and resilient framework for handling failures and maintaining system availability as well. Like for example independent operations can continue to execute even then during the even during the presence of partial failures or network partitions and so that allows the system to tolerate faults and recover gracefully without sacrificing the performance.
Now that is it about casual consistency and the second aspect of microservice optimization is something called conscientious algorithms. So, now this achieves resilience. Now let us talk about what is conscientious algorithms and what is conscientious first of all. So, if you just now I mentioned about the way the distributed system nodes interact with each other say be it for replicating the data between the nodes or there is another use case where the nodes need to interact with each other. Earlier at the beginning of this presentation I was talking about booking systems where the like say movie booking or air reservation or people tend to compete or book for a common spot or a common point like same seat or yeah same seat is a good example. So, in that case there needs to be coordination between all the nodes in the backend and what if there are some failures during that coordination or a message passing among the nodes. So, during those failures even during those failures we do not want the system we want the backend system to operate the same way as it would if there was no failure. So, that is called conscientious that means all the nodes need to have a conscientious on how they are going to coordinate or say example if the leader is dead or a leader node is not working or network failure has happened we still want the results to be retrieved the processing to be done and then provided to the user so he user is not affected.
So, say for example if there is a node failure and while during the coordination between the nodes and there has to be a critical area access like mutual exclusion needed to be done or like say a message has to be broadcasted to the other nodes from a node which failed. So, in all the failure cases we do not want the system to be affected so that is guaranteed by the conscientious algorithms because if something fails conscientious algorithms ensures that a new node is elected as a leader and the things like mutual exclusion and multicasting messages are all still guaranteed and also the algorithm in some algorithms ensure that the right communication happens between the nodes. So, now yeah so like I said multiple nodes are mutually connected with each other and collaborate with each other through message passing and during computation they need to agree upon a common thing or common value to coordinate among multiple processes. This is called distributed consciousness and why is it required right in it like I just explained it in like in distributed system it may happen that multiple nodes are processing large computations distributedly and they need to know the results of each of the other nodes to keep them updated about the whole system in such a situation the nodes need to agree upon a common value even in the presence of failures or network partitions. So, that is the whole point here and I have mentioned like two types of failures here one is crash failure when a node is not responding to other nodes of the system due to some hardware or software fault. This is very common use case common issue in distributed systems and it can be handled easily by ignoring the nodes existence like the logic can be you just see the logic the computer software can assume that there is no such the node is not existing when such problem exists when such problem arrives.
So, it's simple to handle but this Byzantine failure is the failure which is a tricky one it's like one or more nodes is not crashed but the node behaves abnormally and forwards a message different message to different peers as you can see different colors there due to some internal or external attack on that node handling this situation is complicated like Byzantine failure is about faulty node sending different messages to different peers. So, conscious algorithms would address these kind of problems and this is one of the best optimization strategies or considerations for you know this micro services world and now let's take the let's take one example of coordination here and a use case. So, again like I mentioned earlier booking.com or when airplane booking is being done we often there are like various algorithms to reach conscientious and distribute systems. I'll be demonstrating this algorithm called voting based model and like how the nodes coordinate with each other in this voting based model and so like I said the algorithms are like PACSOURCE, RAFT, ZAB, PBFT which I have given in the next slide.
So, what happens in this is say a user as you can see 1,2,3,4 are trying to book the same seat in an airplane or a theater and all those respective requests go to their corresponding nodes say node 1,2,3,4 and each node internally will have a like a leader follower or like those internally implemented but we don't want to go into those things. So, what happens after step two it means there has to be a way for all these nodes to coordinate because these nodes are geographically distributed. Let's assume users are like trying to book from say one from USA one from Europe and one from say Argentina and so all these nodes need to coordinate in such a way that only one person gets booked and that decision usually done by the conscientious algorithms and what happens is in step number three the leader node is selected the algorithm selects the leader node and then leader node sends a message like one node proposes a value like customer id. This customer gets the seat and other nodes need to vote for that value so in this case the leader node, node 1, step 3 sends propose customer id to nodes 3, 4 and 2 and then all the other nodes send a response back saying that yeah we vote which customer id will need to get the booking and after the votes are done the system takes majority vote the leader takes a majority vote and makes a decision so this is called deterministic approach and once that is determined step number five the leader node, node 1 communicates with all the other nodes saying that hey this is the customer who is picked so that's how this voting model works and there are a lot of other algorithms which helps the coordination and this is a very important concept in trying to have the most optimized micro services architecture in the back end when you're designing the systems and now how these algorithms guarantee the so called resilience I'm talking about right say what is resilience? So resilience can be categorized in a lot of things like fault tolerance for example these algorithms are designed to tolerate faults like node failures, message loss or network partitions and all that say for example a pack source or a raft algorithm as long as a majority of nodes are functioning correctly and can communicate with each other the system can make progress and reach conscientious on proposed values and the second way of resilience guarantees like leader election like I said before the algorithms like raft or zookeepers atomic broadcast protocol or algorithm allows leader election mechanisms to ensure that only one node acts as a leader at any given time this leader is responsible for coordinating the agreement process and ensuring that all nodes coverage are the same state so next is a quorum based agreement so these algorithms again typically require a quorum of nodes to agree on a value like the voting and before it is considered committed by ensuring that a sufficient number of nodes participate in the agreement process these algorithms this some algorithms prevent the system from reaching inconsistent state or making incorrect decisions like again packs also are this practical byzantine fault tolerance algorithms we can values considered committed once it has been accepted by majority of nodes and that guarantee is being made by these algorithms then of course the data replication which I earlier covered many conscious algorithms involve copying the data from one day one system to the other or across I mean one node to the other or within the node if you have like leader follower the copy things is always done asynchronously as part of the algorithm also handles this cop this replication aspect to ensure fault tolerance and resilience so by maintaining multiple copies of data distributed systems continue to operate correctly even if some nodes fail or are unavailable and and yeah and the last thing is reconfiguration and recovery so conscious algorithms provide mechanisms for for this in the event of a node failure and network partitions for example packs or raft algorithms support dynamic reconfiguration allowing nodes to join or leave the system dynamically without disrupting its operation at all so this way these algorithms ensure that the system can recover from failures or partitions by re-electing leaders resynchronizing data and restoring consistency among nodes so yeah these are the two main aspects of distributed systems or the the microservices optimization strategies in the in the architecture of distributed systems to to achieve higher performance and resilience and thank you thanks a lot for watching so I would I would look forward for another session to share more insights thank you.
Comments