Monday, June 30, 2025

Zero-Knowledge Proofs: Verifying Computation and Preserving Privacy

Expanding Verifiability Beyond Merkle Trees in the Age of AI

From my previous post, you're already familiar with Merkle Trees as a powerful data structure for efficient and secure validation of contents. You know they achieve this by hashing data and then hashing those hashes up a tree, allowing for Merkle proofs that can verify data inclusion or consistency without revealing the entire dataset. This capability is becoming even more crucial as AI-generated content makes verifiable content more imperative. 

But what if you need to prove something more complex than just data inclusion? What if you need to prove a computation was performed correctly, or that you know a secret, without revealing that secret? This is where Zero-Knowledge Proofs (ZKPs) come into play, offering new dimensions of verifiability and privacy.

What are Zero-Knowledge Proofs?

A Zero-Knowledge Proof is a cryptographic protocol where a prover can convince a verifier that a statement is true, without revealing any information beyond the truth of the statement itself. Think of it like proving you're over 18 without showing your ID or revealing your name and address. ZKPs bring two main "primitives" or building blocks:

  • Computational Integrity (Succinctness): They allow you to create proofs of computations that are significantly easier and faster to verify than to perform the original computation. This means the proof itself remains small, regardless of how complex the computation being proven is. Just as a Merkle proof is small compared to the original data, a ZKP is small compared to the computation it verifies.
  • Zero-Knowledge (Privacy): They provide the option to hide parts of the computation (like sensitive inputs or even parts of the model) while still proving its correctness.

While generating ZK proofs can be very computationally intensive, advancements in cryptography, hardware, and distributed systems are making them feasible for increasingly complex computations. This expansion of capabilities opens up a vast "design space for new applications".

Programming ZKPs: A Shift in Mindset

Unlike traditional programming, which focuses on how to compute, programming ZKPs (often called circuits) focuses on defining a set of constraints. These constraints are mathematical rules that the computation must satisfy. For example, you might constrain that two secret numbers multiplied together equal a public number, without ever revealing the secret numbers.

The typical workflow for building a ZKP involves:
  1. Writing the circuit: Defining the constraints of your computation.
  2. Building the circuit: Compiling it into a binary form and WebAssembly.
  3. Trusted Setup: A crucial pre-processing step that generates a proving key (for the prover) and a verification key (for the verifier). 
  4. Generating the proof: Using your private inputs (the "witness"), the compiled circuit, and the proving key.
  5. Verifying the proof: Using the verification key, the public output, and the generated proof.

Concepts like hash functions are fundamental in ZKPs, just as they are in Merkle Trees. However, ZKPs often use "ZK-Friendly hash functions" like Poseidon, which are optimized for use within ZKP circuits, offering significant performance gains compared to traditional hashes like SHA-256 due to their arithmetic-based implementation. 

Commitments, a cryptographic primitive allowing you to "commit" to a secret value without revealing it, are also crucial, often built using these hash functions. These are key building blocks for applications like digital signatures and more advanced concepts like group signatures, where you can prove you are part of a group without revealing your specific identity.

Programming ZKPs: An Example

Lets walk through a basic example of proving that we know two numbers whose product is 36 without revealing what those numbers are.

Write the circuit using Circomc <== a * b; is the constraint, two numbers multiplied together equals a third number.

Circom compiles it into a Wasm file which we'll use to generate a witness for specifying our private inputs when creating the proof and a Rank 1 Constraint System binary file mathematically defining our single constraint.

Then we perform a trusted setup. The generated Common Reference String (CRS) consists of a proving key and a verification key. These keys can then be used every time we want to generate and verify proofs, respectively. They can be shared publicly and I provide mine here as part of the example:

Finally, we generate the proof using snarkjs with the Wasm file, proving key and private input that might be something like 9 and 4.

We get proof and public output JSON files.

We've proven that we know two secret values, a and b, whose product is 36. You can verify the proof (assuming you trust my verification key) with snarkjs.

$ snarkjs groth16 verify example1_verification_key.json public.json proof.json
[INFO]  snarkJS: OK!

If you change public.json to contain a different number the proof will no longer be valid. I've no longer proved I know the factors of this new number.

ZKPs and Blockchains, and Machine Learning (ZKML)

The convergence of ZKPs, blockchains (Web3), and machine learning is a rapidly advancing area with significant potential.

Blockchain use cases include:
  • Scaling Blockchains: Public blockchains have limited computational power. ZKPs enable computations to be executed off-chain, with only a small ZK proof verified on-chain. This scales blockchains without sacrificing decentralization or security. Examples include ZK rollups like Polygon zkEVM and zkSync.
  • Privacy-Preserving Applications: The zero-knowledge property is ideal for creating applications that protect users' privacy and personal data when making cryptographic attestations. Aztec Network, for instance, uses a ZK rollup for Ethereum where users' balances and transactions are completely hidden.
  • Identity Primitives and Data Provenance: Projects like WorldID use ZKPs for privacy-preserving proof-of-personhood protocols, allowing a person to prove they are a unique human without revealing their identity.

ZKML is about applying ZK proofs to machine learning models, specifically focusing on the inference step. The core motivations for ZKML include:

  • Verifying AI-Generated Content: With AI content becoming indistinguishable from human-created content, ZKPs can help determine that a particular piece of content was produced by applying a specific model to a given input.
  • Privacy-Preserving Inference: ZKPs allow you to apply an ML model to sensitive data, where a user can get the result of the model's inference without revealing their input to any third party.

While proving something as large as current LLMs with ZKPs is not currently feasible, there's significant progress on creating proofs for smaller models. Teams are actively working on improving ZK technology, including specialized hardware and proof system architectures, to allow proving bigger models on less powerful machines in less time.

Summary

While Merkle Trees excel at verifying data inclusion and consistency, ZKPs extend this idea to verifying computations and knowledge with the added benefit of privacy. This makes them incredibly powerful for building the next generation of scalable and private applications on blockchains, especially as AI-generated content and privacy concerns continue to grow. The future of verifiable content, whether data or computation, is increasingly intertwined with these advanced cryptographic proofs.

Sources

https://zkintro.com/articles/programming-zkps-from-zero-to-hero

https://world.org/blog/engineering/intro-to-zkml

Monday, February 17, 2025

Merkle Trees

Merkle Trees are a data structure that allows for efficient and secure validation of contents. They are typically implemented as binary trees. Given N pieces of data, they would have 2N nodes and log(N) height. Each leaf node is the hash of a piece of data and every non-leaf node is a hash of its children's hashes. With only a hash stored at each node Merkle trees have a small, predictable size compared to the data of a large N.

Graphical representation of the Merkle Tree. Illustration by David Göthberg

Examples of Merkle tree usage include detecting data inconsistencies between replicas in NoSQL databases, ensuring file integrity in distributed storage systems, and verifying blockchain transactions. We can easily create one with pymerkle.


└─9f42c047...
    ├──50fcd75a...
    │   ├──4c4b77fe...
    │   │   ├──e8bcd97e...
    │   │   │   ├──2215e8ac...
    │   │   │   └──fa61e3de...
    │   │   └──9c769ac2...
    │   │       ├──906c5d24...
    │   │       └──11e1f558...
    │   └──fed7af7d...
    │       ├──2b15ae18...
    │       │   ├──53304f5e...
    │       │   └──3bf9c81c...
    │       └──8007dd69...
    │           ├──797427cf...
    │           └──195f58bc...
    └──555da077...
        ├──85224a5c...
        └──5c889ef4...

A "Merkle proof" is a list of intermediate hashes from the path between a leaf node representing the data you want to prove and the root of the tree. Generating the proof is like a modified DFS. The beauty of this is that anyone can verify the data is included in the tree without the whole tree being revealed. 

{
    "metadata": {
        "algorithm": "sha256",
        "security": true,
        "size": 10
    },
    "rule": [
        0,
        0,
        1,
        0,
        0
    ],
    "subset": [],
    "path": [
        "53304f5e3fd4bcd20b39abdef2fe118031cc5ae8217bcea008dea7e27869348a",
        "3bf9c81c231cae70b678d3f3038f9f4f6d6b9d7adcf9b378f25919ae53d17686",
        "8007dd69b92a67ea6410098635fa8ba53c44a5994c7e5d92b99e27f0711c626f",
        "4c4b77fe3fc6cfb92e4d3c90b5ade42f059a1f112a49827f07edbb7bd4540e7b",
        "555da077fcadba1f23e0f2bfac8793e6a3c79a0d605902df34ab43d3e0fb487c"
    ]
}

Verifying the proof requires only calculating the root hash from the provided proof and the data being verified. If the calculated root matches the known root of the tree then the data is present in the tree. We are using less space and less compute than if we were iterating a list to check if data is present.

That was an inclusion or audit proof. We can also do a consistency proof. In an append-only tree we can verify earlier versions of the tree against later versions to make sure no tampering has occurred. The later version must include everything in the earlier version, in the same order, and all new entries come after old entries.

{
    "metadata": {
        "algorithm": "sha256",
        "security": true,
        "size": 10
    },
    "rule": [
        1,
        0,
        0,
        0,
        0
    ],
    "subset": [
        0,
        1,
        0,
        0,
        0
    ],
    "path": [
        "fa61e3dec3439589f4784c893bf321d0084f04c572c7af2b68e3f3360a35b486",
        "2215e8ac4e2b871c2a48189e79738c956c081e23ac2f2415bf77da199dfd920c",
        "9c769ac26f8d61ff40859e5201537845555136f0fd7ab604f7033180fbe76af9",
        "fed7af7d64bf0a73fcad018df1219928dbafa4d96b5d78f8a5e9be66ff0ada38",
        "555da077fcadba1f23e0f2bfac8793e6a3c79a0d605902df34ab43d3e0fb487c"
    ]
}

Given how powerful and pervasive Merkle trees are, I'm surprised they aren't discussed more along with other common data structures. It seems that with AI-generated content making verifiable content more imperative, their usage will only increase going forwards.

Monday, January 15, 2024

CUDA Kernels: Speeding up Matrix Multiplication

The Python ecosystem benefits greatly from be able to use libraries written in C/C++ and Rust (see my previous post) to increase performance. Increasingly, though, I've been running code on GPUs instead. Libraries like PyTorch simplify this, but how do they do it? Previously I could only answer something like "it's CUDA" without knowing what that really means. In this post we'll dive deeper and see what it takes to create our own CUDA kernel. We'll (re-)implement matrix multiplication, run it on a GPU, and compare to numpy and PyTorch performance.









First, what is CUDA? CUDA is a general purpose parallel computing platform and API. It gives direct access to NVIDIA GPU's virtual instruction set and parallel computation. It works with C, C++, Fortran, has Java and Python wrappers, and is supposed to be easier to use than earlier APIs like OpenCL. CUDA allows us to execute kernels which are functions compiled for GPUs, separately from the main program. This is how the majority of deep learning taking place today functions.

Second, what makes GPUs so special? GPUs are generally slower than CPUs at one operation but excel at operations that can be parallelized. They can have thousands of cores and higher memory bandwidth. If we want to tackle more than embarrassingly parallel problems, like matrix multiplication, this means we need a parallel algorithm to make use of the GPU.

Third, how do we access it from Python? It turns out there are several options, or several Python wrappers for this. I chose to go with Numba, a just-in-time (JIT) compiler that can target both CPUs and GPUs. It lets you write kernels directly using a subset of Python. It does not implement the complete CUDA API, but supports enough to tackle many problems. There are other (uninvestigated) options like CuPy and PyCUDA as well.

Lastly, before we get to the code, is the topic of writing efficient kernels. While this is beyond the scope of my post, I can say that I've at least learned it requires understanding several additional concepts beyond general concurrent programming. CUDA kernels are not for the faint of heart. It's not something you can just pick up in a couple of hours or from following one tutorial. One of the things you'll first notice, as a very simple example, is the need to explicitly move data back and forth between CPUs and GPUs. And kernels can't return values, you can only pass inputs and outputs.

For a better intro, I recommend reading the four-part CUDA by Numba Examples series (part 2, part 3, part 4).

Our baseline for this experiment will be numpy's highly-optimized matmul function. It supports vectorized array operations and some multi-threading by releasing the GIL per Parallel Programming with numpy and scipy. The underlying BLAS routines will be optimized for your hardware. This method of matrix multiplication has been tuned over the past several decades.

Here we'll try our own kernel. It's moving the computation from the CPU to the GPU, but it's likely lacking many optimizations that a battle-tested library has. Still, for medium-sized, two-dimensional matrices, we get some performance improvement.

Now we'll use PyTorch's matmul function. It's highly optimized like numpy and has the GPU benefit like the kernel. It's amazingly fast and works with larger, higher-dimensional matrices.

NVIDIA has a profiling tool called Nsight Systems that we can use to see GPU utilization. GPUs are expensive, we would want them to be fully utilized. From the reports, I see that the PyTorch implementation used more threads so that's consistent with higher parallelism. It also seems to have a higher ratio of memory operations vs. kernel operations. I'm not sure I understand what that means, but the kernel operations are sgemm which looks to be a matrix multiplication algorithm akin to numpy using BLAS.



Creating a CUDA kernel has become accessible enough that I could do it in a couple of hours on a laptop, yet my implementation remains far from the top implementations. Matrix multiplication is obviously common and great libraries exist. For less common operations, even if there's a known parallel algorithm, I would hesitate going the custom kernel route again. It's not a one-off thing you would just try to improve performance, it requires a way of thinking and deep optimization knowledge that most software developers don't have. The underlying libraries used by PyTorch are, for example, optimizing use of memory caches, shared vs. global memory accesses, thread utilization, and probably tuning kernel parameters for my specific GPU.

UPDATE:

NVIDIA Finally Adds Native Python Support to CUDA.