By: Omar Metwally, M.D.
To illustrate the utility of cryptography and peer-to-peer networking in protecting the authenticity, integrity, and availability of information.
1. Information is the useful synthesis of data.
Our email inboxes, phones, and hard drives are constantly filling up with data; however, collecting, organizing, and archiving the useful nuggets of information in an ocean of junk requires time, money, and energy. The number of useful emails in my inboxes is a small fraction of the total number of emails, which are mostly spam. I don’t pay for extra storage out of principle. Why fund a company whose spam filters are more likely to block important emails than spam? Why perpetuate the problem?
Similarly with the high-resolution photos which take up so much memory on my phone and hard disk: most of these photographs do not deserve the 2+ MB of memory they occupy on my phone and PC. I’ll commonly snap a photo of a beautiful landscape, a critter I encounter on a walk, or something I need to remember for a short period of time (for example, where I parked). Backing up every photo and video on my phone seems wasteful considering that, like my email inbox, only a small proportion are media that I actually want to preserve. The alternative, however, would be to manually go through each of my inboxes and every photo I take on my phone and make a conscious decision whether to keep or delete a file. This latter strategy often proves far too time-intensive to pursue on a consistent basis.
2. Data that exists in only one location is as good as gone.
I once asked a colleague how he backs up his digital information. “I’ve never needed to back up my data,” he answered. This is a fallacy. Every possible failure of a digital system will eventually and inevitably occur. Hard disks fail all the time. People accidentally delete and lose files. Important bits of information drown in oceans of spam and junk, to the extent that locating them becomes practically impossible. Networked systems get hacked. People lose or upgrade their phones and change platforms, only to realize years later that they never backed up their old Android or iPhone which is now resting in a landfill.
Preserving information in a way that facilitates future retrieval requires:
– a consistent schema for organizing files and directories
– multiple physical (e.g. HDDs and SSDs) and cloud-based storage systems
– a consistent version control schema
– consistency in backing up information to each of these media
In other words, if you really cherish your data, you need to be organized, anticipate what can (and inevitably will) go wrong, and back up consistently. If it’s important information, chances are you’ll also want to encrypt your disks in a way that prevents unauthorized parties from accessing the data, without accidentally losing access to your own data.
3. Cryptography is arguably one of the most useful and powerful technologies in modern-day computing.
Modern cryptography is the basis for digital tools that protect the authenticity and integrity of information. While information ends up in the wrong hands all the time, encryption ensures that only the intended recipient can “unlock” the information. To lay people, “encryption” may conjure messaging apps designed for protect one’s privacy. However, another compelling use case of cryptography, which may be unknown to lay computer users, is to mathematically prove the authenticity of digital information. Algorithms such as SHA256 [https://csrc.nist.gov/glossary/term/SHA_256] can generate a mathematically unique string of numbers and letters, which can serve as a “fingerprint” for a file’s authenticity. Altering even the slightest letter in a document changes this cryptographic fingerprint.
Just like no two individuals have the same fingerprint, so do non-identical files yield unique cryptographic hashes. For instance, an attorney who needs to ensure the authenticity of a collection of evidence can use a cryptographic hashing algorithm such as SHA256 to prove beyond a doubt that the data do indeed represent what the attorney claims they do. However, it’s important to note that these hashing algorithms do not necessarily preserve the actual data to which they refer. It is still upon the attorney to back up the evidence in a secure and redundant manner. Furthermore, the attorney must ensure that each backup is identical. Although a small discrepancy may or may not be consequential in court (for instance, accidentally adding a space, period, or comma may or may not alter the interpreted meaning of a document), the cryptographic hash will be altered, negating the utility of the hashing algorithm.
4. Distributing and decentralizing information is a key value proposition of blockchain networks
Encryption and hashing preceded cryptocurrencies. Hash functions, which are defined by the National Institute of Standards and Technology, are generally free to use and are accessible via command line on any computer. Arguably the biggest value proposition of blockchain networks, on a technical level, is their capacity to add verifiable and tamper-proof timestamps to cryptographic hashes, by propagating a verifiable and identical chronological database across numerous peers around the world. Being able to reliably exchange information with thousands of computers across the world, spanning many different geographic areas, yields redundancy that would be implausible to replicate by entrusting any one party to create thousands of backups, spread them around the world, ensure that they can be accessed reliably, and also ensure the integrity of the original information. In reality, governments restrict access to online content all the time. People in affected locations can use tools such as VPNs to try and circumvent these limitations, but as long as a critical number of nodes is online, the information will not be lost, even if it is inaccessible from a certain geographic region due to inability to run a p2p client.
Cryptocurrencies create financial incentives for people to volunteer hard disk space, broadband, their time, skills, computing resources, and energy to contribute to a peer-to-peer network. Rather than relying on one party to ensure the integrity, authenticity, and availability of data (which is typically hosted in a relatively small number of geographic locations), blockchains are essentially distributed databases (also known as “distributed ledgers” when used in the context of exchanging digital value).
5. Ensuring information availability is another value proposition of blockchain networks
I have been experimenting with IPFS (“InterPlanetary Filesystem” [https://ipfs.io/]), a peer-to-peer file-sharing networking, since 2017. Each byte stored directly on a blockchain network is relatively expensive. While all blockchains are peer-to-peer networks, not all peer-to-peer networks are blockchain. IPFS, an example of a peer-to-peer network that is not a blockchain, allows users to easily upload directories and files to the network, where they are relayed from node to node. IPFS itself is free to use; that is, there is no subscription fee to cover hosting costs because volunteers around the world share in hosting the data. However, this utopian dream of “share everything, preserve everything” ignores the reality of the cost of hosting data. Bandwidth, disk space, processing power, and electricity cost money. Data hosted on IPFS can be “pinned” using a 3rd-party service, but this crosses the line of decentralization and places trust in a 3rd-party service to ensure the persistence of these data. Furthermore, it’s unclear to me why a 3rd-party service would volunteer their resources freely without charging a hosting fee.
Filecoin is a cryptocurrency developed by the creators of IPFS (Protocol Labs) which aims to solve this missing economic incentive. The Filecoin protocol aims to incentivize miners (people with a lot of computing power and storage capacity) to host others’ data by rewarding them with the Filecoin cryptocurrency in exchange for running software that can mathematically prove that the hosted data (1) exist on their hard drive(s), and (2) can be retrieved by the party that is paying Filecoin in exchange for their data to be hosted.
I downloaded the Filecoin client (“Lotus”) and spent several days running IPFS and Lotus in parallel in order to see if hosting a 113 MB file on Filecoin was a better alternative to using traditional cloud servers, and also to learn about the economics of the Filecoin ecosystem. I provide here my impressions of this limited experience without a recommendation for or against any cryptocurrency.
It took me a few hours to sync the Filecoin mainnet to completion. I had to download a snapshot of the chain in order to sync, and I could not locate a SHA256 checksum of the snapshot used to sync. I was unable to sync by connecting to peers directly. Using snapshots hosted on a centralized server which are not associated with published checksums is never best practice because there’s otherwise no way to ensure the authenticity or integrity of what one thinks they are downloading.
The Slack channels used by the Filecoin community are active, and I received timely answers to my questions by knowledgeable contributors. Once the Filecoin chain was synced, I proceeded to upload a 113 MB file using its IPFS hash (that is, the file was already uploaded to IPFS, and I used the IPFS hash to point to the data). The process of uploading data generally entails (1) identifying storage providers (miners) who are willing and able to host one’s data; (2) uploading the data to the storage providers; and (3) paying a transaction fee to upload the data. These transactions are referred to as “deals” and can range from 180 to 540 days in duration. Miners can specify parameters such as the minimum and maximum file size they are willing to host, duration of hosting, and their cost per Gigabyte per time period (in the case of Filecoin, per 30-second epoch). Retrieving data involves a separate set of processes, but I haven’t yet made it that far.
In Filecoin, miners host others’ data, which may or may not be encrypted. This is a potential legal gray area because miners generally don’t know what they’re hosting, and miners are often located in jurisdictions separate from the party seeking hosting services. Deals can be arranged on a Slack channel or third-party reputation marketplaces, but rarely does one know whom exactly they’re dealing with. What happens if a party is uploading content that is illegal in their jurisdiction? Or perhaps legal in their jurisdiction but forbidden in the miner’s jurisdiction?
The process of trying to host data on Filecoin is far more complex than using traditional cloud servers. The average person is unlikely to succeed without a strong commitment to the steep learning curve involved in using these command-line tools. Some of the complexities can theoretically be simplified using third-party services, but this can potentially negate the advantages of using an incentivized p2p network in the first place.
The Filecoin protocol incentivizes miners to contribute their computing resources (and time) to host others’ data by rewarding them for reliably hosting others’ data and financially punishing them by deducting penalties from the collateral they have to put up. Due to the relatively early stage of development of these tools, Filecoin documentation recommends making multiple deals with up to 10 different miners to ensure the availability of one’s data, in case one or more miners’ do not make good on their deal.
On my first attempt to upload a 113 MB file, the “deal” failed for unclear reasons, despite my attempts to troubleshoot the Lotus client’s behavior with the help of technical support volunteers. My starting balance was one Filecoin (1 FIL). Here are some numbers central to the (failed) transaction:
Initial wallet balance: 1 FIL
Cost of hosting 113 MB file with a particular miner for 180 days: 0.01296 FIL ($0.225504, at an exchange rate of $17.4 per FIL on March 12th, 2022).
Wallet balance after the escrow funds were returned to my wallet (i.e. after the deal failed):
Difference between initial and final wallet balance = amount of “gas” burned (network transaction fees):
Therefore, 51.285% of the original proposed cost of hosting the file (0.01296 FIL) was burned in the form of gas. In other words, 0.006646556300701767 FIL / 0.01296 FIL = 0.5128515664121734
While the amount of burned gas may seem trivial, it accounts for a majority of the cost of the failed deal (51.285%)! If the goal is to establish 10 deals with 10 different miners, then the cost of gas associated with failed deals can quickly add up.
6. Mathematical proof of data availability may or may not be necessary
There are certainly cases in which it’s necessary to prove mathematically not just the integrity and authenticity of data (for example, using hashing functions such as SHA256), but also the availability of the data. Filecoin aims to mathematically prove both the existence and availability of data hosted on a peer to peer network while incentivizing miners to uphold deals with parties who need data hosted. However, there are also many instances where a SHA256 checksum uploaded to a blockchain with an immutable timestamp is more than sufficient. In this latter case, the responsibility of organizing, archiving, and maintaining identical copies of these data falls upon the party willing to pay for the weight of this proof. As mentioned above, there are instances where entrusting miners to store and deliver content may be undesirable for legal reasons, privacy, or simply the need to trust that at least one miner with whom one conducts a deal will uphold their end of the deal.
In conclusion, cryptography and peer-to-peer networking are powerful technologies that can help protect the integrity of information and ensure its persistence. Various blockchain networks use financial incentives in different ways to provide a variety of value propositions to network participants. Clearly understanding one’s goals as the relate to information preservation/exchange, and clearly understanding each network’s value proposition, is key to making good investments of one’s time and resources.