Polymath
Project

solutions@pearlpolymath.com

Hey there. If you stumbled on this page (or got lost in the web), you are looking at a shortlist of difficult open problems we've encountered while trying to create the first Proof-of-Work network secured by native AI computations, or more precisely: making matrix multiplication verifiably hard with minimal additional overhead on modern AI hardware.

If you're still here, here's some additional context:

The Pearl network aims to redefine the unit economics of AI training & inference, by creating a new currency directly anchored to AI compute cycles, the primary asset of the future economy.

The goal of this collaborative "polymath" project is to inform and improve the algorithmic and economic design of the Pearl proof-of-useful-work protocol, so that as many AI workloads (MatMuls) in the world are performed on the blockchain, making Pearl a general auditable index and the trust layer of AI.

The open problems below reflect important technical and economic questions that are either still beyond our reach or open for debate, and would help us understand the implications of the new economy created by integrating the Pearl "2-for-1" protocol into the mainstream AI stack. We spent many hours over the past year thinking about these problems, with world-class mathematicians and low-level engineers… if you had reached this webpage, we believe you might be interested and able to help.

We'd love to hear any of your thoughts:

  • A proof sketch, counterexample, or attack.
  • A concrete construction or reduction.
  • An implementation detail, benchmark, or kernel-level idea.

Mathematical challenges

Reducing the Overhead of MatMul PoW Schemes

The most naïve attempt at MatMul-based PoW would be to simply hash the output h(A·B) and check if the result starts with X zeroes (note this protocol has near-zero overhead over vanilla GEMM!). Of course, this idea is not secure, since for h(A·B) to be unpredictable (random looking), A · B must itself be unpredictable, while a "dishonest" miner can pick A,B=0 (or reuse matrices he/she pre-computed at home).

To prevent this, a secure MatMul scheme must first perturb the matrices A, B to obtain "unpredictable" matrices A', B', and then apply the PoW test (block-opening condition) on the computation of A'·B'. Concretely, a MatMul PoW scheme consists of an encoder (A', B') ← f(A, B, σ) where σ is a random string (the "seed" of the previous block, as in Bitcoin). The miner computes A'·B' using the industry-standard algorithm, and the block-opening condition is some function (predicate) of the computation trace. Finally, since a "useful" miner (who is multiplying real weight and activation matrices of an LLM) is interested in the original product A·B, a decoder g(A'·B', A, B, σ) may be applied to recover A·B from the noisy A'·B'. Since we want the overall runtime to be roughly the same as SoTA GEMM algorithm for computing A·B (i.e., "2-for-1"), it better be the case that both the encoder and the decoder can be computed in << O(n³) FLOPs.

On the other hand, the encoder must ensure the computation of A'·B' required for PoW is "as hard as random MatMul" no matter what A, B the miner picked – otherwise, a useless miner may submit easy matrices and gain advantage over a useful miner, and the protocol is broken. In summary, we are looking for a fine-grained worst-case to average-case reduction for MatMul.

The cuPOW protocol encodes A, B into A' = A + E and B' = B + F, where E, F are random low-rank noise matrices. The hashing is done on intermediate tiles instead of just hashing the (perturbed) output h(A'·B') — this prevents a dishonest miner to "game the system" by picking the all-zero matrices A = B = 0 (or any other rank-1 matrix), which boils down to structured MatMul (low rank, FFT, etc), costing << of the useful miner (which would break the protocol). The decoder then computes AB = A'·B' − E·B − A·F − E'·F'. This is cheap since E, F are low-rank.

While peeling the additive noise is significantly cheaper than the full MatMul computation, in some applications it may incur non-negligible overhead. Furthermore, in FP datatypes, exact peeling is not always possible. To that end, we are exploring alternative schemes. Below is a list of challenges related to the development of such schemes.

(1) Security of self-cancelling noise scheme (random rotation scheme)

In the Pearl protocol, the MatMul A ⋅ B is computed as (A+E) ⋅ (B+F) where E,F are rank-r matrices. The miner hashes (r ⋅ r) @ (r ⋅ r) intermediate tiles, and peels the noise E(B+F)+A·F to recover A⋅B. For n × n matrices, hashing costs O(n³/r) and peeling is O(n²r), yielding a total of O(n²⋅⁵) additive overhead when r=√n.

Can you devise an alternative secure protocol with lower asymptotic overhead?

One idea (proposed in Appendix A of the paper) suggests a different "random rotation" encoding of A,B, which doesn't require any "noise peeling" at all: Sample a random orthogonal matrix S, ask the miner to compute (A·S) ⋅ (S⁻¹·B) while hashing (lg(n) × n/lg(n)) @ (n/lg(n) × lg(n)) intermediate tiles. To make the multiplication by S cheap (o(n³)), we propose sampling S as a "pseudorandom" rotation, i.e., a randomized Hadamard matrix so that computing AS and S⁻¹B and the entire overhead is O(n² lg(n)). Our conjecture is that for large enough r, the (FFT) structure of S cannot be exploited on intermediate tiles.

Is this PoW rotation-scheme secure?

(2) "Pearl-GEMM": Fusing MatMul-PoW with Quantization?

The protocol described below (which we call "Pearl-GEMM"), is our future candidate for an ultra-low overhead MatMul-PoW algorithm, that works with floating-point (FP8) as well as INT8 datatypes. This protocol no longer preserves exact MatMul, and only outputs an approximate MatMul. Nevertheless, since weight and activation matrices in AI training and inference are typically quantized–a lossy operation by definition–small approximation error is already incurred in the vanilla AI pipeline, provided that the approximation error is sufficiently small. As such, a natural idea is to leverage native quantization noise as the perturbation encoding of A,B for PoW, forcing the miner to commit to the weight+activation matrices pre-quantization (in higher precision). Let us make this more formal.

The idea is simple: Assume A, B are matrices given in some arbitrary datatype, say FP16, and we need to multiply them using a tensorcore of a certain datatype, say FP8. This requires to quantize them to Q(A), Q(B) in FP8 format, and then compute Q(A)·Q(B). As described above, this is not good enough for PoW, as the non-useful miner can submit A = B = 0, and we must apply an encoder incurring enough randomness.

In Pearl-GEMM, we add independent identically distributed (iid) noise matrices E, F to A, B prior to quantization, and use A'·B' = Q(A + E) · Q(B + F) as the approximation to A·B, without applying a decoder. There is a tradeoff between accuracy and security here: If the noise distribution is not "rich enough", it might be possible to come up with cheap schemes for computing Q(A + E) · Q(B + F) for tailored A, B. On the other hand, the approximation error A'·B' − A·B is proportional to the variance of the noise.

Problem: Find a noise distribution that is both secure and results in negligible end-to-end accuracy loss in LLM inference (e.g., in MMLU).

Remark: Ideally, we want Q(A + E) to be an unbiased estimator of A (and same for B), to ensure unbiased gradients in LLM training (as in stochastic rounding).

(3) Pearl-GEMM for FP4:

FP4 is a 4-bit per entry datatype, that can represent the numbers 𝓕 = {0, ±0.5, ±1, ±1.5, ±2, ±3, ±4, ±6}. In order to attain reasonable accuracy using such a small constellation, a common scale γ ∈ R is given to every k entries, where k is typically small. For example, in NVFP4, k = 32 and γ is represented in FP8 E4M3 datatype. To simplify the discussion that follows, we ignore the finite precision of γ and think of it as a real-valued number of infinite precision. Thus, NVFP4 encoding is a mapping f : 𝓕16 × ℝ → ℝ16 and the decoding is a mapping g : 𝓕16 × ℝ → ℝ16. Because NVFP4 quantization is already quite noisy, adding iid noise matrices E, F prior to quantization (as above) seems to lose too much accuracy. However, it seems plausible that the choice of scales γ opens another dimension for optimization, and may allow to incur enough randomness in A', B' without significant loss of accuracy.

Problem: design an encoder f(A, B, σ) that returns NVFP4 matrices A', B' such that the approximation error A·B − A'·B' is small (so that the end-to-end accuracy is not compromised), and computing A'·B' is "as hard" as multiplying two random NVFP4 matrices of the same shapes.

(4) Extending cuPOW/Pearl-GEMM for Training and Finetuning

So far, we have implemented cuPOW only for inference use cases on vLLM (due to its obvious reduced complexity compared to training, which involves more complex dynamics and algorithmic components, and since the commitment hashing can be done in preprocessing time, lowering overhead). On the bright side, one could hope that the training and finetuning processes will be less susceptible to small (unbiased) noise (perturbations of weights and activation matrices). The intuition is the robustness of SGD, which is already a very noisy estimator of the true gradient, yet works fantastically well in practice (enabling minibatch training, arguably the most dominant feature in the scalability of Deep Learning). If true, this could mean that no "denoising" is required in cuPOW (alternatively, that the "no-peeling" Pearl-GEMM scheme will not lose accuracy)

Despite recent success on INT8 training, training is in fact very sensitive to numerical precision (likely due to enormous curvature of the loss function (Hessian condition number). This begs the question:

How does Numerical precision affect training (e.g. LoRA)? Is the price only a few extra epochs?

Note that, for example, in the simpler setting of linear regression or minibatch-PCA, online gradient descent is known to be extremely robust to noise. It would be very interesting to develop numerical scaling laws for finetuning of SoTA LLMs.

(5) Collision-resistant Hash via Matrix Multiplication?

The cryptographic commitment hash on the multiplied matrices possesses an inherent overhead in the Pearl protocol, hence it is desirable to minimize its computational fingerprint.

Propose a GPU-compatible cryptographic hash function that is 128-bit collision-resistant, that is not a SoTA hash on CPUs (Blake3, 13.1 int32 ops/input byte)?

Is it possible to base it on matrix multiplications and harness TensorCores?

What about TPU-compatible options?

(6) Is the Pearl blockchain ASIC-resistant?

A notorious feature of Bitcoin's PoW (random hashing) is the emergence of ASIC hardware (produced solely for BTC mining and otherwise useless), which is order-of-magnitudes more efficient in SHA256 hashing than any other commodity hardware, This feature completely eliminated the incentive to mine using GPUs (let alone CPUs), and is the main reason there's no private mining of Bitcoin anymore.

A unique aspect of the Pearl protocol (cuPOW) is that it might, for the first time, create an ASIC-resistant PoW network, where commodity hardware (GPUs, TPUs) is as competitive as any other hardware. This is not a technical statement, but rather an economic one: While designing specialized hardware for multiplying (noisy) all-zero matrices A,B= 0 of size (say) 4096x4096 is viable (IO and silicon savings due to specific dim, etc.), it is very unlikely that such hardware will be able to run real AI workloads (training or inference). Therefore, the economic incentive for developing such ASIC hardware is diminished: The cost-of-production of Pearl tokens via such hypothetical hardware would have to outweigh the external rewards earned by (slower) commodity hardware (i.e., LLM price-per-token, which by free market value already justifies purchasing GPUs).

Question: Is cuPOW ASIC-resistant? How about Pearl-GEMM, Is 1-bit hardware enough to implement Pearl-GEMM(A=0, B=0)? How much speedup (measured in speed-of-light) would such hardware provide, compared to GEMM on H100?

(7) A 0% overhead "proof-of-inference" scheme?

Our team thought of the following candidate alternative protocol for PoW for end-to-end inference tasks, inspired by Pearl-GEMM (problem (2)), yet aims for literally zero computational overhead. The caveat is that this scheme only seems to be applicable to "white-listed" (open-source) models:

Proof-of-Inference Scheme:

a. C:=Commit(LLM, x) // once every 10 minutes, say

b. Generate low-magnitude noise vector (Z|C) ∈ {-1,1}m where m = # parameters of LLM.

c. Perturb model weights by adding the noise Zi to every parameter Mi of the LLM → LLM', and perturb the encoding of the input x → x'

d. Run LLM'(x').

This scheme may be vulnerable to backdoors of computing the model faster (e.g. on malicious prompts). Can one base the security of this scheme, or variants thereof, on some simpler concrete assumption.

Another critical problem is that this scheme doesn't have an efficient SNARK (unlike individual linear-layers = MatMul), let alone for FP computation, so an additional challenge is providing an efficient SNARK for the whole LLM computation.

Challenge: As an initial challenge, break the simplified scheme where we do not add noise to the input x. That is, find x for which M'(x) can be computed faster than naively running M'(x). Next, break the actual scheme when also adding noise to the input, or present simpler assumptions under which it is secure.

Economic Challenges

(1) Debate: Value of (hypothetical) zero-overhead PoUW protocol

(I.e., marginal mining cost is 0!).

A basic price attribution to Bitcoins is the cost of mining (i.e, electricity and amortized ASIC costs). in that sense, PoW must "waste" energy in order to attribute value to the coin (Ofc this is merely a lower bound, since actual price is based on network-effects which are hard to model). Indeed, a common criticism of (0-overhead) PoUW is that the marginal cost of mining is ~0 and hence also the cost of attacking the network.

However, the last statement is only true if one has an infinite stream of useful work—i.e., the ability to collect AI workloads for which an external consumer is willing to pay. This observation highlights the difference between Pearl and Bitcoin— the key "scarce" resource in PoUW mining is the ability to collect useful data (matrices). At ~0 overhead, every AI workload on earth should in theory opt-in to mine Pearl (free option call) in which case the hashrate (and hence distribution/entropy) are also maximized.

What should be the value/price of the Pearl coin at 0 mining overhead? Is it 0? infinity?

(2) Utilities and Applications of the Pearl Network:

While the Pearl coin was created to serve as AI-native money (i.e., a Store-of-Value (SoV) digital asset for intelligence), the Pearl blockchain also provides a trustless verifiable historical record of intelligent computing – It will record (encrypted) execution traces of general AI training and inference (from LLMs to AI agents, humanoid robots and self-driving cars). This feature of the blockchain facilitates unique applications. Some examples we thought of are:

a. Voting System for AI agents: Permissionless, trustless Voting System for AI agents (async. coordination, GPU financing, compute marketplace, etc.)

b. Model Tracing: Retroactive proof of inference (zero-knowledge).

c. A "Stablecoin" for trading compute: In the same vein that Bitcoin is used to trade energy outside the dollar system, Pearl can natively be used to trade compute (intelligence?) outside the dollar system – which we believe could lead to a more efficient resource allocation in the AI economy. This concept is far from fleshed out, and would be interesting to further develop...

For questions and submissions, contact: solutions@pearlpolymath.com