Monday, February 17, 2025

Merkle Trees

Merkle Trees are a data structure that allows for efficient and secure validation of contents. They are typically implemented as binary trees. Given N pieces of data, they would have 2N nodes and log(N) height. Each leaf node is the hash of a piece of data and every non-leaf node is a hash of its children's hashes. With only a hash stored at each node Merkle trees have a small, predictable size compared to the data of a large N.

Graphical representation of the Merkle Tree. Illustration by David Göthberg

Examples of Merkle tree usage include detecting data inconsistencies between replicas in NoSQL databases, ensuring file integrity in distributed storage systems, and verifying blockchain transactions. We can easily create one with pymerkle.


└─9f42c047...
    ├──50fcd75a...
    │   ├──4c4b77fe...
    │   │   ├──e8bcd97e...
    │   │   │   ├──2215e8ac...
    │   │   │   └──fa61e3de...
    │   │   └──9c769ac2...
    │   │       ├──906c5d24...
    │   │       └──11e1f558...
    │   └──fed7af7d...
    │       ├──2b15ae18...
    │       │   ├──53304f5e...
    │       │   └──3bf9c81c...
    │       └──8007dd69...
    │           ├──797427cf...
    │           └──195f58bc...
    └──555da077...
        ├──85224a5c...
        └──5c889ef4...

A "Merkle proof" is a list of intermediate hashes from the path between a leaf node representing the data you want to prove and the root of the tree. Generating the proof is like a modified DFS. The beauty of this is that anyone can verify the data is included in the tree without the whole tree being revealed. 

{
    "metadata": {
        "algorithm": "sha256",
        "security": true,
        "size": 10
    },
    "rule": [
        0,
        0,
        1,
        0,
        0
    ],
    "subset": [],
    "path": [
        "53304f5e3fd4bcd20b39abdef2fe118031cc5ae8217bcea008dea7e27869348a",
        "3bf9c81c231cae70b678d3f3038f9f4f6d6b9d7adcf9b378f25919ae53d17686",
        "8007dd69b92a67ea6410098635fa8ba53c44a5994c7e5d92b99e27f0711c626f",
        "4c4b77fe3fc6cfb92e4d3c90b5ade42f059a1f112a49827f07edbb7bd4540e7b",
        "555da077fcadba1f23e0f2bfac8793e6a3c79a0d605902df34ab43d3e0fb487c"
    ]
}

Verifying the proof requires only calculating the root hash from the provided proof and the data being verified. If the calculated root matches the known root of the tree then the data is present in the tree. We are using less space and less compute than if we were iterating a list to check if data is present.

That was an inclusion or audit proof. We can also do a consistency proof. In an append-only tree we can verify earlier versions of the tree against later versions to make sure no tampering has occurred. The later version must include everything in the earlier version, in the same order, and all new entries come after old entries.

{
    "metadata": {
        "algorithm": "sha256",
        "security": true,
        "size": 10
    },
    "rule": [
        1,
        0,
        0,
        0,
        0
    ],
    "subset": [
        0,
        1,
        0,
        0,
        0
    ],
    "path": [
        "fa61e3dec3439589f4784c893bf321d0084f04c572c7af2b68e3f3360a35b486",
        "2215e8ac4e2b871c2a48189e79738c956c081e23ac2f2415bf77da199dfd920c",
        "9c769ac26f8d61ff40859e5201537845555136f0fd7ab604f7033180fbe76af9",
        "fed7af7d64bf0a73fcad018df1219928dbafa4d96b5d78f8a5e9be66ff0ada38",
        "555da077fcadba1f23e0f2bfac8793e6a3c79a0d605902df34ab43d3e0fb487c"
    ]
}

Given how powerful and pervasive Merkle trees are, I'm surprised they aren't discussed more along with other common data structures. It seems that with AI-generated content making verifiable content more imperative, their usage will only increase going forwards.