Silkenweb Example: Hackernews Clone

Building a Distributed Log from Scratch: Data Replication

217 points by bhattisatish 7 years ago | 21 comments

majidazimi 7 years ago
Regarding Kafka replication: What if the following scenario happens?
1. Leader replicates to followers
2. Followers get the messages and send ACK to leader
3. Leader gets the confirmation from followers, increments high watermark (HW) and replies client that messages is commited.
4. NOW: Leader fails before it could piggyback to followers that HW has been incremented.
5. The question is: Since potential leader is not aware of the fact that HW has been incremented, it becomes the new leader and truncates the log to old HW, which means we have lost an entry in the log that has been confirmed by previous leader.
As a result, client has been confirmed that the message has been successfully written, but it has been lost.
- teraflop 7 years ago
  There's a more accurate and detailed explanation of the Kafka protocol here: https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Repl...
  In step 5 of your example, the new leader does not truncate its log; only the new followers do.
  Basically, every message that was committed is guaranteed to be in every in-sync replica's log. (Some of these may be after the replica's HW.) But the converse is not true: some replica's logs will contain messages that may not have been committed.
  So the new leader has to keep its entire log -- which includes all acknowledged messages, plus some unacknowledged ones that become committed as a result of the leadership transition. But the new followers have to discard messages above their HW, because those messages might not be present on the new leader.
  - majidazimi 7 years ago
    The doc says:
    > The replica that registers first becomes the new leader. The new leader chooses its LEO as the new HW.
    Isn't it a better idea to elect a process who has the largest offset? Basically during leader election, a process should get confirmation of majority of correct processes. If another process observes that it has larger offset in the log, rejects the election and starts another round of election. Eventualy a correct process who has largest offset will become leader. Regarding correctness, this sounds to me as a safer election.
    - teraflop 7 years ago
      Yeah, I think that would be a reasonable alternate approach. It's not really "safer", though.
      None of the replicas are guaranteed to have all of the messages that reached the old leader (some of those might never have been replicated anywhere). So clients that didn't receive acknowledgements might need to retry on the new leader, no matter what. Since the clients have to have retry logic anyway, there's no safety issue with discarding some of the messages that weren't yet acknowledged.
      Your suggestion would try to minimize the number of necessary retries, but at the cost of adding delay and complexity to the leader election process.
- caust1c 7 years ago
  IIRC the HW of the followers is updated after it writes the message to memory meaning that if the leader failed that instant, it would have writes which weren't acknowledged to the client.
  Kafka is a at-least-once system, not transactional.
  It's been a while since I've read the code though, so it might have changed.
- geertj 7 years ago
  I'd like to know the answer to this as well. I've been wondering about the same scenario in the context of 2-phase commit, which is what this is, I believe
  - teraflop 7 years ago
    Two-phase commit is actually very different, because it assumes the coordinator never fails permanently. In 2PC, the coordinator's storage is always the authoritative decision-maker on whether the system is committing or aborting. If the coordinator goes down, all you have to do is wait for it to come back.
yazaddaruvala 7 years ago
Are there any distributed logs which operate more as a framework than a service/platform?
Imagine Kafka but I am able to run my processors within the same JVM. You might even call them servlets for Kafka.
- 7 years ago
- fizx 7 years ago
  Plenty of things available when you google "embedded kafka". Still, it's mostly a bad idea.
- dullgiulio 7 years ago
  Ehm, what do you mean?
  If you want to work on streaming data (streaming logs), you can use Spark and others...
- polskibus 7 years ago
  Maybe Akka with Persistence module?
- ddorian43 7 years ago
  Maybe Apache Samza
chrisweekly 7 years ago
Outstanding material. Bookmarked.
- Confiks 7 years ago
  Please try to be more substantive in your replies. I would love to hear why you think this is such outstanding material, or what your own connection to the subject is.
- trevyn 7 years ago
  Why is it outstanding?
jaequery 7 years ago
id like to hear if blockchain can be used here to simplify all of this, anyone?
- jteppinette 7 years ago
  The blockchain supports untrusted nodes. This is typically not a requirement in the typical distributed commit log usecase. Also, this wouldn’t be a simplification. It would increase your costs and decrease performance.
  - jaequery 7 years ago
    i was thinking if you roll your own fork of the blockchain and set the costs to free, you just roll your own miners as the distributed nodes and it can work as a distributed storage that takes care of the consensus.
    - teraflop 7 years ago
      On top of the extra complexity, a blockchain has much weaker consistency properties than Paxos/Raft/Kafka. Because of finite propagation delays and network packet loss, there are normally multiple valid chains at any given moment in time.
      Even if you assume that all nodes are benevolent, and try their best to converge on the single longest chain without needing mining fees as an incentive, you still can't know whether the chain you're querying is the "winner" unless you have an omniscient view of the entire system. So you only get eventual consistency, as opposed to sequential consistency.
    - majidazimi 7 years ago
      Distributed logs like kafka/pulsar/nats are essentially a blockchain technology in trusted environments.
      Raft/multi-paxos are already the cheapest communication bound consensus. Theoretically you can't achieve any better with current network guarantees.