A CRDT for a to-do list

Many web applications need to deal with multiple users editing the same document concurrently. Even if only one user can edit a document, they might work from both their phone and laptop, and want to be able to work when they don't have a internet connection. If they make some changes on their phone when they don't have an internet connection, then make some other changes on their laptop, the web application needs to be able to handle merging these two sets of changes.

One approach to allowing concurrent editing of a document is to describe the document using a conflict-free replicated data type (CRDT). As the name suggests, these data types are structured so that conflicting changes are impossible. In this blog post, we'll look at how a CRDT can be used to describe a to-do list.

A CRDT is based on the idea of creating a data structure where all operations are idempotent and commutative. An idempotent operation is one that can be applied multiple times and have the same effect as if it was applied only once. A commutative operation is one in which the order of operations doesn't matter. The most simple CRDT is a grow-only set (G-Set), in which only element additions are allowed. Addition of an element to a set is idempotent, as adding the same element again has no effect if the element already exists in the set. Addition of an element to a set is also commutative, as it doesn't matter what order elements are added, the end-result will always be the same.

It is easy to see how a G-Set can handle concurrent operations. If each machine in a network has their own copy of the G-Set, they can apply operations locally, adding elements to the set, and asynchronously send updates to other machines in the network when a network connection is available.

From the point of view of the CAP theorem, a CRDT provides high availability and strong network partition tolerance, while only providing eventual consistency. This means that a document may be different across machines in a network, but as long as all machines eventually form connections with each other, the document will eventually become consistent.

A basic to-do list

As a starting point for representing a to-do list, we will use a G-Set. Each item in our to-do list will have a unique identifier and a text field containing the item's title. An example JSON representation of this to-do list structure is:

{
    "items": [
        {
            "id": "2b74e20b-18d5-461d-9763-351ec3356b59",
            "title": "Buy milk"
        },
        {
            "id": "fccb6547-f3aa-4f8f-97e1-3c1e5dbf85f9",
            "title": "Mow the lawns"
        },
        {
            "id": "449beefe-f190-4fad-bf92-77ac7c1775eb",
            "title": "Iron shirts"
        },
        {
            "id": "fb212751-a86d-4d41-b912-db3ec6ecdbac",
            "title": "Wash the car"
        }
    ]
}

And the resulting list is:

Buy milk
Mow the lawns
Iron shirts
Wash the car

Adding an item to the list will only have an effect if an item with a matching id is not already present. Different users can concurrently add items to the list, and they will all be merged into one consistent representation eventually.

Marking items as done and removing items

Now that we have a to-do list that users can add items too, they might also want to mark an item as done or remove it from the list.

This can be achieved by maintaining separate sets of done and removed items. An item is present in the to-do list only if is present in the items set and is not present in the removed set. If it is present in the done set, then it is marked as done. When an item is removed, the original item remains in the items set and is known as a "tombstone".

An example of this updated to-do list structure is shown below:

{
    "items": [
        {
            "id": "2b74e20b-18d5-461d-9763-351ec3356b59",
            "title": "Buy milk"
        },
        {
            "id": "fccb6547-f3aa-4f8f-97e1-3c1e5dbf85f9",
            "title": "Mow the lawns"
        },
        {
            "id": "449beefe-f190-4fad-bf92-77ac7c1775eb",
            "title": "Iron shirts"
        },
        {
            "id": "fb212751-a86d-4d41-b912-db3ec6ecdbac",
            "title": "Wash the car"
        }
    ],
    "removed": [
        "449beefe-f190-4fad-bf92-77ac7c1775eb"
    ],
    "done": [
        "fccb6547-f3aa-4f8f-97e1-3c1e5dbf85f9"
    ]
}

Buy milk
Mow the lawns
Wash the car

In this example, we've mowed the lawns so checked that item, and decided that our shirts don't need ironing after all, so we've removed that from the list.

In the simple case of only being able to remove items from the set, this data structure is known as a two-phase or 2P-Set.

Modification

Now we have a to-do list that users can add and remove items from, as well as mark them as done, but what if they want to modify an existing item? We could use a CRDT to represent the text within an item itself (for an example, see TreeDoc) or use other concurrency techniques such as operational transformation (OT). However, a simpler approach is to treat modification of an item as a deletion of the existing item and creation of a new item. For our particular case of a to-do list, this approach seems like it should work well. If two users concurrently edit the same item, the resulting to-do list will no longer contain the original item and will have two separate, new items.

Unchecking

We may also want to allow users to uncheck items that were previously marked as done. The current data structure does not allow this, as items can't be removed from the done set. One possible approach we could use to uncheck an item is to create a new unchecked clone of the item with a different id, and then delete the original item.

Alternatively, we can use what is known as a last write wins (LWW) set. With this approach, we maintain a set of state changes, where each element in the set has an associated timestamp. We can say that when an item does not have an associated entry in this set it is not checked. Otherwise, whether or not an item is checked or not is based on the entry with the latest timestamp. For cases where two timestamps are identical, we select either the checked or unchecked operation to win. The listing below shows our to-do list updated to use a last write wins set of state changes

{
    "items": [
        {
            "id": "2b74e20b-18d5-461d-9763-351ec3356b59",
            "title": "Buy milk"
        },
        {
            "id": "fccb6547-f3aa-4f8f-97e1-3c1e5dbf85f9",
            "title": "Mow the lawns"
        },
        {
            "id": "449beefe-f190-4fad-bf92-77ac7c1775eb",
            "title": "Iron shirts"
        },
        {
            "id": "fb212751-a86d-4d41-b912-db3ec6ecdbac",
            "title": "Wash the car"
        }
    ],
    "removed": [
        "449beefe-f190-4fad-bf92-77ac7c1775eb"
    ],
    "state": [
        {
            "id": "fccb6547-f3aa-4f8f-97e1-3c1e5dbf85f9",
            "state": "checked",
            "timestamp": "1424165000"
        },
        {
            "id": "fccb6547-f3aa-4f8f-97e1-3c1e5dbf85f9",
            "state": "unchecked",
            "timestamp": "1424166000"
        },
        {
            "id": "2b74e20b-18d5-461d-9763-351ec3356b59",
            "state": "checked",
            "timestamp": "1424167000"
        }
    ]
}

Buy milk
Mow the lawns
Wash the car

In this example our "Buy milk" item is checked, but "Mow the lawns" is not, as there is an "unchecked" state change with a later timestamp than the "checked" state change for this item.

Ordering items

So far we have assumed that our to-do list items are not in any particular order, but users may want to arrange items to place more important items near the top of the list. This is where things start to get a bit tricky. Handling re-ordering of items can be achieved with CRDTs, but the data structures become much more complex. One example of an ordered CRDT is the TreeDoc data structure used for collaborative text editing.

TreeDoc assigns a path identifier to each node in the CRDT, where each node represents a letter in a text document. These position identifiers organise the document into a binary tree. In a standard binary tree structure the root node has an empty position identifier, and each node in the document has a position identifier equal to that of its parent concatenated with a 0 if it is to the left of its parent, or a 1 if it is to the right. However, TreeDoc has to allow for concurrent inserts at the same position. To deal with this situation, each entry in the position identifier includes an identifier corresponding to the site where the insert was initiated, and these site identifiers are ordered.

For our to-do list application, ordering is not as important as in a text document, so we can afford to be less precise with our ordering approach. One approach we can use is to maintain a last write wins set of sort-keys, where each sort-key is a floating point value. By using floating point sort-keys, we can always find a new sort-key between any two existing items in our list. If two users concurrently add different items at the same position, they will have the same sort key, but we can choose to then order these items based on their id. The listing below shows our to-do list with a set of sort-keys applied:

{
    "items": [
        {
            "id": "2b74e20b-18d5-461d-9763-351ec3356b59",
            "title": "Buy milk"
        },
        {
            "id": "fccb6547-f3aa-4f8f-97e1-3c1e5dbf85f9",
            "title": "Mow the lawns"
        },
        {
            "id": "449beefe-f190-4fad-bf92-77ac7c1775eb",
            "title": "Iron shirts"
        },
        {
            "id": "fb212751-a86d-4d41-b912-db3ec6ecdbac",
            "title": "Wash the car"
        }
    ],
    "removed": [
        "449beefe-f190-4fad-bf92-77ac7c1775eb"
    ],
    "state": [
        {
            "id": "fccb6547-f3aa-4f8f-97e1-3c1e5dbf85f9",
            "state": "checked",
            "timestamp": "1424165000"
        },
        {
            "id": "fccb6547-f3aa-4f8f-97e1-3c1e5dbf85f9",
            "state": "unchecked",
            "timestamp": "1424166000"
        },
        {
            "id": "2b74e20b-18d5-461d-9763-351ec3356b59",
            "state": "checked",
            "timestamp": "1424167000"
        }
    ],
    "sort_keys": [
        {
            "id": "2b74e20b-18d5-461d-9763-351ec3356b59",
            "sort": 0.0,
            "timestamp": "1424163000"
        },
        {
            "id": "fccb6547-f3aa-4f8f-97e1-3c1e5dbf85f9",
            "sort": -1.0,
            "timestamp": "1424163000"
        },
        {
            "id": "449beefe-f190-4fad-bf92-77ac7c1775eb",
            "sort": -2.0,
            "timestamp": "1424163000"
        },
        {
            "id": "fb212751-a86d-4d41-b912-db3ec6ecdbac",
            "sort": -3.0,
            "timestamp": "1424163000"
        }
        {
            "id": "fccb6547-f3aa-4f8f-97e1-3c1e5dbf85f9",
            "sort": 1.0,
            "timestamp": "1424164000"
        }
    ]
}

Mow the lawns
Buy milk
Wash the car

In this example, the original order of items was "Buy milk", "Mow the lawns", "Iron shirts", then "Wash the car", but "Mow the lawns" has been moved to the top of the list (and Iron shirts was removed).

Conclusion

We now have a data structure representing a to-do list that can be replicated across multiple machines. Each machine can independently apply operations on the to-do list and easily merge any change from others without any chance of a conflict.

One thing you've probably noticed is that the CRDT describing our to-do list could end up being quite large as time progresses and many changes are made. The last write wins sets can be easily cleaned up by removing any entries that are invalidated by more recent entries. Cleaning up a 2P-Set is somewhat more complex and requires synchronisation across machines. For more information on Garbage collection techniques for CRDTs see the paper by Shapiro et al. (2011).