MixHash and Public Data Storage Proofs | Ethereum Improvement Proposals

Abstract

This proposal introduces a design for “minimum value selection” storage proofs on Merkle trees. The design consists of two main components:

A hashing algorithm termed MixHash, aimed to replace the commonly used Keccak256 and SHA256 algorithms.
Public data storage proofs. This enables anyone to present a proof to a public network, verifying their possession of a copy of specific public data marked by MixHash.

Additionally, the proposal discusses the practical implementation of this design in various scenarios and suggests some improvements to the ERC-721 and ERC-1155 standards.

Motivation

The ERC-721 and ERC-1155 standards are widely used in the NFT fields. However, the current standards do not provide a mechanism for verifying the existence of public data. This is a major obstacle to the development of many applications, such as decentralized data markets, decentralized data storage, and decentralized data oracles.

Specification

The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”, “SHOULD NOT”, “RECOMMENDED”, “NOT RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be interpreted as described in RFC 2119 and RFC 8174.

MixHash

MixHash is a Merkle tree root hash value that incorporates data length information. Its structure is as follows:

     +-----------256 bits MixHash-----------+
High |-2-|----62----|----------192----------| Low

2   bits: Hash algorithm selection, where 0b00 represents SHA256, and 0b10 represents Keccak256. (0b01, 0b11 are reserved)
62  bits: File size. Hence, MixHash can support file sizes up to 2^62-1.
192 bits: The lower 192 bits of the Merkel root node value constructed by the designated hash algorithm.

Given a file, we can construct a MixHash through the following defined steps:

File MUST Split into 1KB chunks. MUST Pad zeros to the end of the last chunk if needed.
Calculate the hash for each chunk and the low 128bits is the Merkle Tree leaf value.
Construct a Merkle tree , root node hash algorithm is 256bits, other node use low 128bits of the 256bits hash result.
Return the combination of hash type, the file size, and the low 192 bits of the Merkle tree root node hash.

MixHash retains a length of 256 bits, so replacing the widely used Keccak256 and SHA256 with MixHash incurs no additional cost. Although including the file length in the upper 62 bits compromises security to some extent, the 192-bit hash length is already sufficient for defending against hash collisions.

The following is the pseudo code for generating Mixhash:

def generateMixHash(blockHeight,hashType,file):
    chunk_hash_array = []
    for chunk in file:
        if len(chunk) < 1024:
            chunk = chunk + b'\x00' * (1024-len(chunk))
        chunk_hash_array.append(getChunkHash(chunk,hashType))
    merkle_tree_root = getMerkleTreeRoot(chunk_hash_array,hash_type)
    return mix_hash(hash_type, len(file), merkle_tree_root)

Public Data Storage Proofs

When MixHash is used to identify a piece of public data, anyone can construct a storage proof to demonstrate possession of a copy of that data. Here is a typical process for using a public data storage proof:

Users eligible to submit storage proofs for rewards are referred to as Suppliers.
A Supplier prepares a storage proof for data D (with MixHash mix_hash_d) based on a block at height h. A 256-bit nonce value for the proof is derived from this block (usually directly using the block’s hash).
To generate a correct storage proof, the Supplier needs to traverse every 1KB chunk of D to find the optimal leaf node m. This is done by attempting to append the nonce value to the end of each chunk to minimize the new Merkle tree root hash. After determining m, the path m_path and leaf node value m_leaf_data of m are extracted.
The Supplier constructs the storage proof for data D at block time h using {mix_hash_d, h, m, m_path, m_leaf_data} and submits it to the public network.
The public network can validate the correctness of m, m_path, and m_leaf_data based on mix_hash_d: verifying that m is indeed a chunk of D. The timeliness of the proof can be verified through h. After passing both correctness and timeliness checks, the public network calculates proof_result_m based on the nonce value and existing proof information, and saves it.
The public network does not have enough information to verify the optimality of the proof, but other Suppliers with the full data set can submit a better {mix_hash_d, h, better_m, better_m_path, better_m_leaf_data} to challenge the published storage proof.
The public network can determine the success of the challenge by comparing proof_result_m and proof_result_better_m. A successful challenge indicates the old storage proof was forged. If no one challenges the published storage proof within a certain timeframe, it can be considered correct from a game-theoretic perspective.
To support healthy competition, the public network should design an appropriate economic model, rewarding users who provide correct storage proofs and penalizing those who submit false ones.

With an understanding of the above process, let us describe the generation and verification of storage proofs more precisely using Pseudocode.

# generate proof off chain
def generateProof(mixHash, blockHeight,file) 
    nonce = getNonce(blockHeight)
    hash_type = getHashType(mixHash)
    chunk_hash_array = buildChunkHashArray(file,hash_type)

    min_index = 0
    min_merkle_tree_root = MAX_UINT256
    min_chunk = None

    m_index = 0
    for chunk in file:
      new_chunk = chunk + nonce
      chunk_hash_array[m_index] = getChunkHash(new_chunk,hash_type)
      merkle_tree_root = getMerkleTreeRoot(chunk_hash_array,hash_type)
      chunk_hash_array[m_index] = getChunkHash(chunk,hash_type)
      if (merkle_tree_root < min_merkle_tree_root):
        min_merkle_tree_root = merkle_tree_root
        min_index = m_index
        min_chunk = chunk
      m_index = m_index + 1

// verify on chain
function verifyDataProof(mixHash, blockHeight, m_index, m_path, m_leaf_data) {
    if(current_block_height - blockHeight > MAX_BLOCK_DISTANCE) {
       revert("proof expired");
    }
    hash_type = getHashType(mixHash);
    merkle_tree_root = getMerkleTreeRootFromPath(m_path,m_leaf_data,hash_type);
    if(low192(merkle_tree_root) != low192(mixHash)) {
       revert("invalid proof");
    }

    nonce = getNonce(blockHeight);
    proof_result = getMerkleTreeRootFromPath(m_path,m_leaf_data.append(nonce),hash_type);
    last_proof_result,last_prover = getProofResult(mixHash, blockHeight);
    if(proof_result < last_proof_result) {
      emit ProofPunish(last_prover);
      updateProofResult(mixHash, blockHeight, proof_result, msg.sender);
    } 
}

To minimize the size of the storage proof as much as possible, we have optimized the implementation of getMerkleTreeRoot: besides the RootHash, the hash values of other nodes are truncated to the lower 128 bits. This approach effectively compresses the hash value of a complete Merkle tree to half its size. The full implementation details can be found in the subsequent Reference Implementation section.

Defending Sourcing Attack

As can be seen from the process described above, the core of constructing public data storage proofs is based on a public, non-repeating nonce value generated at a specific moment. It requires traversing the entire content of the file within a designated time to construct a correct proof. Without restrictions, this process is vulnerable to external data source attacks: Suppliers do not store data locally but obtain it through network requests when constructing storage proofs. How does our design prevent such attacks?

Time-Limited Response: Suppliers must submit storage proofs within a specified time. On a typical public network like Ethereum, the block time is about 15 seconds. A typical maximum block interval could be 2 (MAX_BLOCK_DISTANCE = 2), meaning Suppliers must complete the construction and submission of the storage proof within 30 seconds. This duration is insufficient for most data sources to complete transmission, thus Suppliers must store data locally to have a chance to construct storage proofs within the allotted time.
Economic Game Theory: The economic model based on public data storage proofs usually rewards the first Supplier to submit a correct storage proof. This means that, from a game-theoretic standpoint, the inherent delay in using external data sources to construct storage proofs reduces the likelihood of successful submission. Economically, it’s less profitable than the expected gains from storing data locally. The economic model incentivizes Suppliers to store data locally.

Success Rate of Defending Sourcing Attack

Using a strategy combining block interval limitations and priority for first-time submissions is often effective in defending against external data source attacks. The effectiveness of this approach primarily relies on the difference in speed between reading files from local storage and retrieving files from the network. We can define the success rate R` of defending against external data source attacks using the following formula:

R = (TNetwork - TLocal) / AvgProofTime

The larger the AvgProofTime, the lower the success rate of defending against Sourcing Attack. Currently, the most significant factor affecting AvgProofTime is the average time for on-chain transactions. For example, in the BTC network, the time for 2 blocks is approximately 20 minutes. With such a large AvgProofTime, the success rate R` decreases rapidly, making sourcing attacks more likely to succeed. We can introduce a dynamically adjustable Proof of Work (PoW) mechanism to further defend against Sourcing Attack. This modification alters the formula as follows:

R = (TNetwork - TLocal) / (AvgProofTime-AvgPoWTime)

With the introduction of the Proof of Work (PoW) concept, the strategy for submitting storage proofs becomes: constructing and submitting storage proofs within a specified time while endeavoring to complete as much PoW computation as possible. In the valid proof time window, the storage proof with the greater amount of PoW computation prevails. Such a mechanism can effectively defend against external data source attacks, especially when AvgProofTime is large.

Integrating a PoW mechanism into the design of public data storage proofs is not complex. A simple implementation could modify the second step to:

2. To generate a correct storage proof, the Supplier needs to traverse all 1KB chunks of D to find the optimal leaf node `m`. The method involves attempting to append the nonce and a self-constructed noise value to the end of each chunk to minimize the new Merkle tree root hash and, according to PoW difficulty requirements, ensuring that the last x bits of the constructed `proof_result_m` are zero. After determining `m` and the noise, the path `m_path` and the leaf node value `m_leaf_data` of `m` are extracted.

The Pseudocode adjusted according to the above modifications is as follows:

# generate proof with PoW off chain
POW_DIFFICULTY = 16
def generateProofWithPow(mixHash, blockHeight,file) 
  nonce = getNonce(blockHeight)
  hash_type = getHashType(mixHash)
  chunk_hash_array = buildChunkHashArray(file,hash_type)

  min_index = 0
  min_merkle_tree_root = MAX_UINT256
  min_chunk = None

  m_index = 0
  noise = 0
  while True:
    for chunk in file:
      new_chunk = chunk + nonce + noise
      chunk_hash_array[m_index] = getChunkHash(new_chunk,hash_type)
      merkle_tree_root = getMerkleTreeRoot(chunk_hash_array,hash_type)
      chunk_hash_array[m_index] = getChunkHash(chunk,hash_type)
      if (merkle_tree_root < min_merkle_tree_root):
        min_merkle_tree_root = merkle_tree_root
        min_index = m_index
        min_chunk = chunk
      m_index = m_index + 1
    if(last_zero_bits(min_merkle_tree_root) >= POW_DIFFICULTY):
      break
    noise = noise + 1
    
  m_path = getMerkleTreePath(chunk_hash_array, min_index)
  return storage_proof(mixHash, blockHeight, min_index, m_path, min_chunk,noise) 

Applying this mechanism increases the cost of generating storage proofs, which deviates from our initial intent to reduce the widespread effective storage of public data. Moreover, heavily relying on a PoW-based economic model may allow Suppliers with significant advantages in PoW through specialized hardware to disrupt the basic participatory nature of the game, reducing the widespread distribution of public data. Therefore, it is advised not to enable the PoW mechanism unless absolutely necessary.

Limitations

The storage proofs discussed in this paper are not suitable for storing very small files, as small files inherently struggle to defend against external data source attacks.
Public data storage proofs do not address the issue of whether the data is genuinely public. Therefore, it is important to verify the public nature of MixHash in specific scenarios (which is often not easy). Allowing Suppliers to submit storage proofs for any MixHash and receive rewards would lead to a situation where Suppliers create data only they possess and exploit this to gain rewards through constructed attacks, ultimately leading to the collapse of the entire ecosystem.

ERC Extension Suggestion: Tracking High-Value Public Data by MixHash

We can use the existing Ethereum ecosystem to confirm whether a MixHash is public data and track its value. For any contracts related to unstructured data, the ERCPublicDataOwner interface can be implemented. This interface determines whether a specific MixHash is associated with the current contract and attempts to return an Owner address corresponding to a MixHash. Additionally, for the existing and widely recognized NFT ecosystem, we suggest that new ERC-721 and ERC-1155 contracts implement a new extension interface ERC721MixHashVerify. This interface can explicitly associate an NFT with a MixHash. The specific interface definition is as follows:

/// @title ERCPublicDataOwner Standard, query Owner of the specified MixHash
///  Note: the ERC-165 identifier for this interface is <ERC-Number>.
interface ERCPublicDataOwner {
    /**
        @notice Queries Owner of public data determined by Mixhash
        @param  mixHash    Mixhash you want to query
        @return            If it is an identified public data, return the Owner address, otherwise 0x0 will be returned
    */
    function getPublicDataOwner(bytes32 mixHash) external view returns (address);
}

The ERC721MixHashVerfiy extension is OPTIONAL for ERC-721 smart contracts or ERC-1155 smart contracts. This extension can help establish a relationship between specified NFT and MixHash.

/// @title ERC721MixHashVerfiy Extension, optional extension
///  Note: the ERC-165 identifier for this interface is <ERC-Number>.
interface ERC721MixHashVerfiy{
    /**
        @notice Is the tokenId of the NFT is the Mixhash?
        @return           True if the tokenId is MixHash, false if not
    */
    function tokenIdIsMixHash() external view returns (bool); 
    
    /**
        @notice Queries NFT's MixHash
        @param  _tokenId  NFT to be querying
        @return           The target NFT corresponds to MixHash, if it is not Mixhash, it returns 0x0
    */
    function tokenDataHash(uint256 _tokenId) external view returns (bytes32);
}

Rationale

Storage proofs (often referred to as space-time proofs) have long been a subject of interest, with numerous implementations and related projects existing.

Compared to existing copy proofs based on zero-knowledge proofs, our storage proof is based on “Nash Consensus,” with its core principles being: a. The public network (on-chain) cannot verify the optimality of a proof but relies on economic game theory. This significantly reduces the costs of construction and verification. b. Data without value typically lacks game value and is naturally eliminated from the system. There is no commitment to elusive perpetual storage.
It can be fully implemented through smart contracts (although the GAS cost of the current reference implementation is somewhat high), separating storage proof from the economic model.
For public data, we do not strictly defend against Sybil attacks. A Sybil attack refers to a Supplier using multiple identities to commit to storing multiple copies of data D (e.g., n copies) while actually storing less (like just one copy) but providing n storage proofs, thereby succeeding in the attack. Strictly preventing Sybil attacks essentially means attaching more additional costs to data storage. The core of our storage proof is to increase the probability of the existence of public data copies through a combination of storage proofs and different economic models, rather than needing to strictly define how many copies exist. Therefore, from the perspective of the design of public data storage proofs, we do not need to defend against Sybil attacks.

Backwards Compatibility

Using HashType allows storage proofs to be compatible with EVM-compatible public blockchain systems, as well as BTC-Like public blockchain systems. In fact, MixHash could become a new cross-chain value anchor: it can track the value of the same data represented by MixHash across different public blockchain networks using different models, achieving the aggregation of cross-chain values. Considering the need for backward compatibility, we have set the default HashType of MixHash to SHA256. Two categories of HashType remain unused, leaving ample room for future expansion.

Test Cases

PublicDataProofDemo includes test cases written using Hardhat.

Reference Implementation

PublicDataProof Demo

A standard reference implementation

DMC public data inscription

Based on public data storage certification, a complete economic model and gameplay has been designed on ETH network and BTC inscription network

Learn more background and existing attempts

DMC Main Chain
CYFS

Security Considerations

This storage proof revolves around public data. In demonstrating storage proofs, it often involves sending 1KB segments of the data to the public network. Therefore, please do not use the storage proof design presented in this paper for private data.

The design of MixHash can support storage proofs for private files, but this requires some adjustments in the processing of the original data and the construction of the storage proof. A detailed discussion on the design of storage proofs for private files is beyond the scope of this paper. In fact, some of the projects mentioned in the Reference Implementation section use both public data storage proofs and private data storage proofs.