428 lines
13 KiB
Markdown
428 lines
13 KiB
Markdown
|
---
|
||
|
title: CRDTs
|
||
|
subtitle: Guaranteed Eventual Consistency
|
||
|
author: Levi Pearson
|
||
|
...
|
||
|
|
||
|
# Introduction
|
||
|
|
||
|
## Goals - To Understand
|
||
|
|
||
|
+ What makes a CRDT
|
||
|
|
||
|
+ Theoretical Impact
|
||
|
|
||
|
+ Practical Relevance
|
||
|
|
||
|
+ Some Existing CRDTs
|
||
|
|
||
|
## Source Paper
|
||
|
|
||
|
+ *A Comprehensive Study of Convergent and Commutative Replicated Data Types*
|
||
|
2011 INRIA Research Report by Shapiro, Preguiça, Baquero, Zawirski
|
||
|
|
||
|
+ Many related publications on CRDTs from same authors
|
||
|
|
||
|
+ Zawirski's Ph.D. Thesis is related to size-optimizing CRDTs
|
||
|
|
||
|
## Outline
|
||
|
|
||
|
+ Basic CRDT Concept
|
||
|
|
||
|
+ Detailed Model of CRDTs
|
||
|
|
||
|
+ Portfolio of CRDT Specifications
|
||
|
|
||
|
+ Garbage Collection Issues
|
||
|
|
||
|
+ A Shopping Cart Example
|
||
|
|
||
|
# CRDTs in a Nutshell
|
||
|
|
||
|
## Motivations
|
||
|
|
||
|
+ **Eventual Consistency** for scalability
|
||
|
|
||
|
+ Improve theoretical grounding
|
||
|
|
||
|
+ Weaken coordination requirements
|
||
|
|
||
|
## Eventual Consistency
|
||
|
|
||
|
+ Updates are executed at replicas
|
||
|
|
||
|
+ Later, sent to all other replicas
|
||
|
|
||
|
+ Eventually, all replicas see all updates
|
||
|
|
||
|
+ They apply asynchronously and possibly in different orders
|
||
|
|
||
|
+ Concurrent updates may conflict and require *arbitration*
|
||
|
|
||
|
## Strong Eventual Consistency
|
||
|
|
||
|
+ Formal model of Eventual Consistency
|
||
|
|
||
|
+ No conflict-resolution or roll-back required; *Conflict-Freedom*
|
||
|
|
||
|
+ Extremely scalable and fault-tolerant
|
||
|
|
||
|
+ Based on mathematical properties
|
||
|
|
||
|
## Two Related Formulations
|
||
|
|
||
|
+ *Convergent* Replicated Data Types (CvRDT)
|
||
|
|
||
|
- State-based
|
||
|
- Weak network requirements
|
||
|
- Based on *monotonic join semilattices*
|
||
|
|
||
|
+ *Commutative* Replicated Data Types (CmRDT)
|
||
|
|
||
|
- Operation-based
|
||
|
- Requires reliable broadcast with *delivery order*
|
||
|
- Based on *commutativity of concurrent operations*
|
||
|
|
||
|
+ Both can always simulate one another
|
||
|
|
||
|
# System Model of CRDTs
|
||
|
|
||
|
## Distributed Systems
|
||
|
|
||
|
+ Processes interconnected by asynchronous network
|
||
|
+ Network can partition and recover
|
||
|
+ Nodes can operate disconnected for some time
|
||
|
+ Processes may crash and recover, memory survives crashes
|
||
|
+ Assumption of non-Byzantine behavior
|
||
|
|
||
|
## Atoms and Objects
|
||
|
|
||
|
+ Processes store **atoms** and **objects**
|
||
|
+ Atoms are base immutable data; directly comparable values
|
||
|
+ Objects are mutable, replicated data; collections of objects and atoms
|
||
|
+ Objects with the same identity but different location are **replicas**
|
||
|
+ Objects are independent; no transactions
|
||
|
|
||
|
## Operations
|
||
|
|
||
|
+ There are **clients** that query and modify objects via **operations**
|
||
|
|
||
|
+ A client acts on one replica, the **source** replica
|
||
|
|
||
|
+ The operation executes locally at the source and then remotely
|
||
|
- *Phase 1*: Operation called at source; possibly some processing occurs
|
||
|
- *Phase 2*: Update transmitted asynchronously to **downstream** replicas
|
||
|
|
||
|
## State-Based Replication
|
||
|
|
||
|
+ Update happens entirely at source replica via **update**
|
||
|
+ Modified payload transmitted downstream
|
||
|
+ Receivers use **merge** operation to update their payload
|
||
|
+ Replicas can be compared via **compare** operation
|
||
|
+ Operations guarded by **source pre-conditions**
|
||
|
|
||
|
## State-Based Causal History Model
|
||
|
|
||
|
+ Initially, set $C(x_i)$ of each replica $x_i$ is $\emptyset$
|
||
|
+ After update $f$, $C(f(x_i))$ is $C(x_i) \cup \{f\}$
|
||
|
+ After merge of $C(x_i)$ and $C(x_j)$, the state is $C(x_i) \cup C(x_j)$
|
||
|
+ The "happens-before" relation $f \to g \Leftrightarrow C(f) \subset C(g)$
|
||
|
|
||
|
## State-Based Network Assumptions
|
||
|
|
||
|
+ States transmit between replicas at unspecified times
|
||
|
|
||
|
+ Transmissions occur infinitely often
|
||
|
|
||
|
+ Communication forms a connected graph
|
||
|
|
||
|
## Operation-Based Replication
|
||
|
|
||
|
+ Phase 1 of **update** is local to source replica
|
||
|
- Precondition guard must be true
|
||
|
- Executes immediately and atomically
|
||
|
- No side-effects
|
||
|
- May return results
|
||
|
+ Phase 2 of **update** executes asynchronously on downstream replicas
|
||
|
- Downstream precondition must be true
|
||
|
- Can't return results
|
||
|
- Updates downstream state
|
||
|
- Arguments prepared by Phase 1
|
||
|
- Executes atomically
|
||
|
|
||
|
## Operation-Based Causal History Model
|
||
|
|
||
|
+ Initially, set $C(x_i)$ of each replica $x_i$ is $\emptyset$
|
||
|
+ After downstream phase, $f$, $C(f(x_i))$ is $C(x_i) \cup \{f\}$
|
||
|
+ The "happens-before" relation $f \to g \Leftrightarrow C(f) \subset C(g)$
|
||
|
+ **causal delivery** means that if $f \to g$, then $f$ is delivered before $g$
|
||
|
|
||
|
## Operation-Based Network Assumptions
|
||
|
|
||
|
+ Presence of a reliable broadcast mechanism
|
||
|
+ Delivery of every update in a **delivery order**
|
||
|
+ The delivery order maintains downstream preconditions
|
||
|
+ A delivery order of causal delivery is always sufficient
|
||
|
|
||
|
## Definition of Convergence
|
||
|
|
||
|
+ Two replicas $x_i$ and $x_j$ of $x$ *converge eventually* if:
|
||
|
|
||
|
- Safety: $\forall{i,j}: C(x_i) = C(x_j)$ implies payloads of $i$ and $j$ are equivalent
|
||
|
|
||
|
- Liveness: $\forall{i,j}: f \in C(x_i)$ implies *eventually* $f \in C(x_j)$
|
||
|
|
||
|
+ States are equivalent when all **query** operations return the same values
|
||
|
|
||
|
## Convergent Replicated Data Type (CvRDT)
|
||
|
|
||
|
+ Based on concept of **join semilattice**
|
||
|
|
||
|
- A *set* of values equipped with
|
||
|
- a *partial order* relation ($\leq$ or $\sqsubseteq$) on them
|
||
|
- and a *least upper bound* (LUB or $\sqcup$) operation, aka *join*
|
||
|
|
||
|
+ LUB: if $m = x \sqcup y$ then $m$ is a LUB under $\leq$ **iff**
|
||
|
|
||
|
- $x \leq m \land y \leq m$
|
||
|
- There is no $m' \leq m$ where $x \leq m' \land y \leq m'$
|
||
|
- The "smallest" $m$ in the set that is "at least as big" as both $x$ and $y$
|
||
|
- LUB is *commutative*, *idempotent*, and *associative*
|
||
|
|
||
|
+ An *ordered set* is a *join semilattice* if for all pairs, a LUB exists
|
||
|
|
||
|
+ For CvRDTs, **merge** is the LUB operation on the payloads
|
||
|
|
||
|
+ All operations are *non-decreasing* with respect to the ordering
|
||
|
|
||
|
## Example CvRDT
|
||
|
|
||
|
+ An integer counter; our set is $\mathbb{Z}$
|
||
|
+ Our ordering is numeric $\leq$
|
||
|
+ For merge, we use $max()$
|
||
|
+ Our payload, `x`, is initialized to `0`
|
||
|
+ Operation **value** returning an integer $j$ is `let j = x`
|
||
|
+ Operation **compare(a,b)** is `a.x ` $\leq$ ` b.x`
|
||
|
+ Operation **inc** is `x := x + 1`
|
||
|
+ Operation **merge(a,b)** is `max(a,b)`
|
||
|
|
||
|
## Commutative Replicated Data Type (CmRDT)
|
||
|
|
||
|
+ Based on a reliable broadcast strong enough to guarantee **delivery order**
|
||
|
+ **Delivery order**, $<$, is specified per CmRDT
|
||
|
+ Some operations are not ordered; they are *concurrent*, $\|$
|
||
|
+ $f \| g \Leftrightarrow f \nless g \land g \nless f$
|
||
|
+ All replicas converge to the same state when:
|
||
|
- all concurrent operations commute, so
|
||
|
- all execution orders consistent with delivery order are equivalent
|
||
|
+ Reliable causal delivery
|
||
|
- doesn't require agreement
|
||
|
- is immune to partitioning (connected subsets still make deliveries)
|
||
|
- eventually delivers all updates to all nodes
|
||
|
+ Delivery order is never stricter than causal delivery
|
||
|
|
||
|
## Relationship of CvRDT and CmRDT
|
||
|
|
||
|
+ State-based CvRDT are:
|
||
|
- easy to reason about (relatively)
|
||
|
- requires only weak network properties, no membership tracking
|
||
|
- may be inefficient for large objects
|
||
|
- used in NFS, AFS, Coda, Dynamo, Riak
|
||
|
+ Operation-based CmRDT are:
|
||
|
- harder to reason about; requires reasoning about history
|
||
|
- have greater expressive power
|
||
|
- require an underlying reliable broadcast protocol, need membership tracking
|
||
|
- used in Bayou, Rover, IceCube, Telex
|
||
|
+ They can always emulate one another (although not necessarily efficiently)
|
||
|
|
||
|
# Portfolio of CRDTs
|
||
|
|
||
|
## Categories
|
||
|
|
||
|
+ Counters
|
||
|
|
||
|
+ Registers
|
||
|
|
||
|
+ Sets
|
||
|
|
||
|
+ Graphs
|
||
|
|
||
|
+ Co-op Text Editing
|
||
|
|
||
|
## State-Based Outline
|
||
|
|
||
|
| **payload** *Payload type; instantiated at all replicas*
|
||
|
| **initial** *Inital value*
|
||
|
| **query** *Query*( *arguments* ) : *returns*
|
||
|
| **pre** *Precondition*
|
||
|
| **let** *Evaluate synchronously, no side effects*
|
||
|
| **update** *Source-local operation*( *arguments* ) : *returns*
|
||
|
| **pre** *Precondition*
|
||
|
| **let** *Evaluate at source, synchronously*
|
||
|
| *Side effects at source to execute synchronously*
|
||
|
| **compare** (value1, value2) : boolean *b*
|
||
|
| *Is value1* $\leq$ *value2 in semilattice?*
|
||
|
| **merge** (value1, value2) : payload mergedValue
|
||
|
| *LUB merge of value1 and value2, at any replica*
|
||
|
|
||
|
## Op-Based Outline
|
||
|
|
||
|
| **payload** *Payload type; intstantiated at all replicas*
|
||
|
| **initial** *Initial value*
|
||
|
| **query** *Source-local operation*( *arguments* ) : *returns*
|
||
|
| **pre** *Precondition*
|
||
|
| **let** *Execute at source, synchronously, no side-effects*
|
||
|
| **update** *Global update*( *arguments* ) : *returns*
|
||
|
| **atSource**( *arguments* ) : *returns*
|
||
|
| **pre** *Precondition at source*
|
||
|
| **let** *1st phase: synchronous, at source, no side-effects*
|
||
|
| **downstream** ( *arguments passed downstream* )
|
||
|
| **pre** *Precondition against downstream state*
|
||
|
| *2nd phase, asynchronous, side-effects to downstream state*
|
||
|
|
||
|
## Op-based Counter
|
||
|
|
||
|
| **payload** integer $i$
|
||
|
| **initial** $0$
|
||
|
| **query** *value*() : integer $j$
|
||
|
| **let** $j = i$
|
||
|
| **update** *increment*()
|
||
|
| **downstream**()
|
||
|
| $i := i + 1$
|
||
|
| **update** *decrement*()
|
||
|
| **downstream**()
|
||
|
| $i := i - 1$
|
||
|
|
||
|
## G-Counter
|
||
|
|
||
|
| **payload** integer[$n$] $P$
|
||
|
| **initial** $[0,0,\dots,0]$
|
||
|
| **update** *increment*()
|
||
|
| **let** $g = myID()$
|
||
|
| $P[g] := P[g] + 1$
|
||
|
| **query** *value*() : integer $v$
|
||
|
| **let** $v = \sum_i{P[i]}$
|
||
|
| **compare**($X, Y$) : boolean $b$
|
||
|
| **let** $b = (\forall i \in [0, n-1] : X.P[i] \leq Y.P[i])$
|
||
|
| **merge**($X, Y$) : payload $Z$
|
||
|
| **let** $\forall i \in [0, n - 1] : Z.P[i] = max(X.P[i], Y.P[i])$
|
||
|
|
||
|
## PN-Counter
|
||
|
|
||
|
| **payload** integer[$n$] $P$, integer[$n$] $N$
|
||
|
| **initial** $[0,0,\dots,0], [0,0,\dots,0]$
|
||
|
| **update** *increment*()
|
||
|
| **let** $g = myID()$; $P[g] := P[g] + 1$
|
||
|
| **update** *decrement*()
|
||
|
| **let** $g = myID()$; $N[g] := N[g] + 1$
|
||
|
| **query** *value*() : integer $v$
|
||
|
| **let** $v = \sum_i{P[i]} - \sum_i{N[i]}$
|
||
|
| **compare**($X, Y$) : boolean $b$
|
||
|
| **let** $b = (\forall i \in [0, n-1] : X.P[i] \leq Y.P[i] \land \forall i \in [0,n - 1] : X.N[i] \leq Y.N[i])$
|
||
|
| **merge**($X, Y$) : payload $Z$
|
||
|
| **let** $\forall i \in [0, n - 1] : Z.P[i] = max(X.P[i], Y.P[i])$
|
||
|
| **let** $\forall i \in [0, n - 1] : Z.N[i] = max(X.N[i], Y.N[i])$
|
||
|
|
||
|
## LWW-Register
|
||
|
|
||
|
| **payload** $X$ $x$, timestamp $t$
|
||
|
| **initial** $\bot, 0$
|
||
|
| **update** *assign*( $X$ $w$ )
|
||
|
| $x, t := w, now()$
|
||
|
| **query** *value*() : $X$ $w$
|
||
|
| **let** $w = x$
|
||
|
| **compare**($R, R'$) : boolean $b$
|
||
|
| **let** $b = (R.t \leq R'.t)$
|
||
|
| **merge**($R, R'$) : payload $R''$
|
||
|
| **if** $R.t \leq R'.t$ **then** $R''.x, R''.t = R'.x, R'.t$
|
||
|
| **else** $R''.x, R''.t = R.x, R.t$
|
||
|
|
||
|
## LWW-Register (Op-based)
|
||
|
|
||
|
| **payload** $X$ $x$, timestamp $t$
|
||
|
| **initial** $\bot, 0$
|
||
|
| **query** *value*() : $X$ $w$
|
||
|
| **let** $w = x$
|
||
|
| **update** *assign*( $X$ $x'$ )
|
||
|
| **atSource**() $t'$
|
||
|
| **let** $t' = now()$
|
||
|
| **downstream**($x', t'$)
|
||
|
| **if** $t < t'$ **then** $x, t := x', t'$
|
||
|
|
||
|
## MV-Register
|
||
|
|
||
|
| **payload** set $S$ of $(x, V)$ pairs; $x \in X$; $V$ its version vector
|
||
|
| **initial** $\{(\bot, [0,\dots,0])\}$
|
||
|
| **query** *incVV*() : integer[$n$] $V'$
|
||
|
| **let** $g = myID()$
|
||
|
| **let** $\mathcal{V} = \{V \mid \exists x : (x,V) \in S\}$
|
||
|
| **let** $V' = [ max_{V \in \mathcal{V}}(V[j]) ]_{j \neq g}$
|
||
|
| **let** $V'[g] = max_{V \in \mathcal{V}}(V[g]) + 1$
|
||
|
| **update** *assign*( set $R$ )
|
||
|
| **let** $V = incVV()$
|
||
|
| $S := R \times \{V\}$
|
||
|
|
||
|
## MV-Register (cont.)
|
||
|
|
||
|
| **query** *value*() : set $S'$
|
||
|
| **let** $S' = S$
|
||
|
| **compare**($A, B$) : boolean $b$
|
||
|
| **let** $b = (\forall(x,V) \in A, (x', V') \in B : V \leq V')$
|
||
|
| **merge**($A, B$) : payload $C$
|
||
|
| **let** $A' = \{(x,V) \in A \mid \forall(y,W) \in B : V \| W \lor V \leq W\}$
|
||
|
| **let** $B' = \{(y,W) \in B \mid \forall(x,V) \in A : W \| V \lor W \leq V\}$
|
||
|
| **let** $C = A' \cup B'$
|
||
|
|
||
|
## G-Set
|
||
|
|
||
|
| **payload** set $A$
|
||
|
| **initial** $\emptyset$
|
||
|
| **update** *add*(element $e$)
|
||
|
| $A := A \cup \{e\}$
|
||
|
| **query** *lookup*(element $e$) : boolean $b$
|
||
|
| **let** $b = (e \in A)$
|
||
|
| **compare**($S, T$) : boolean $b$
|
||
|
| **let** $b = (S.A \subseteq T.A)$
|
||
|
| **merge**($S, T$) : payload $U$
|
||
|
| **let** $U.A = S.A \cup T.A$
|
||
|
|
||
|
## 2P-Set
|
||
|
|
||
|
| **payload** set $A$, set $R$
|
||
|
| **initial** $\emptyset, \emptyset$
|
||
|
| **query** *lookup*(element $e$) : boolean $b$
|
||
|
| **let** $b = (e \in A \land e \notin R)$
|
||
|
| **update** *add*(element $e$)
|
||
|
| $A := A \cup \{e\}$
|
||
|
| **update** *remove*(element $e$)
|
||
|
| **pre** $lookup(e)$
|
||
|
| $R := R \cup \{e\}$
|
||
|
| **compare**($S, T$) : boolean $b$
|
||
|
| **let** $b = (S.A \subseteq T.A \lor S.R \subseteq T.R)$
|
||
|
| **merge**($S, T$) : payload $U$
|
||
|
| **let** $U.A = S.A \cup T.A$
|
||
|
| **let** $U.R = S.R \cup T.R$
|
||
|
|
||
|
## U-Set
|
||
|
|
||
|
| **payload** set $S$
|
||
|
| **initial** $\emptyset$
|
||
|
| **query** *lookup*(element $e$) : boolean $b$
|
||
|
| **let** $b = (e \in S)$
|
||
|
| **update** *add*(element $e$)
|
||
|
| **atSource**($e$)
|
||
|
| **pre** $e$ is unique
|
||
|
| **downstream**($e$)
|
||
|
| $S := S \cup \{e\}$
|
||
|
|
||
|
## U-Set (cont.)
|
||
|
|
||
|
| **update** *remove*(element $e$)
|
||
|
| **atSource**($e$)
|
||
|
| **pre** $lookup(e)$
|
||
|
| **downstream**($e$)
|
||
|
| **pre** $add(e)$ has been delivered
|
||
|
| $S := S \setminus \{e\}$
|
||
|
|
||
|
# Thanks!
|