Have you heard about Silent Data Corruption (SDC)? It’s everywhere and yet in the storage networking world, you can hardly find a storage vendor talking about it.
I did a paper for MNCC (Malaysian National Computer Confederation) a few years ago and one of the examples I used was what they found at CERN. CERN, the European Center for Nuclear Research published a paper in 2007 describing the issue of SDC. Later in 2008, they found approximately 38,000 files were corrupted in the 15,000TB of data they generated. Therefore SDC is very real and yet to the people in the storage networking industry, where data matters the most, it is one of the issues that is the least talked about.
What is Silent Data Corruption? Every computer component that we use is NOT perfect. It could be the memory; it could be the network interface cards (NICs); it could be the hard disk; it could also be the bus, the file system, the data block structure. Any computer component, whether it is hardware or software, which deals with the bits of data is subjected to the concern of SDC.
Data corruption happens all the time. It is when a bit or a set of bits is changed unintentionally due to various reasons. Some of the reasons are listed below:
- Hardware errors
- Data transfer noise
- Electromagnetic Interference (EMI)
- Firmware bugs
- Software bugs
- Poor electrical current distribution
- Many more …
And that is why there are published statistics for some hardware components such as memory, NICs, hard disks, and even protocols such as Fibre Channel. These published statistics talk about BER or bit-error-rate, which is the occurrence of an erroneous bit in every billion or trillion of bits transferred or processed.
And it is also why there are inherent mechanisms within these channels to detect data corruption. We see them all the time in things such as checksums (CRC32, SHA1, MD5 …), parity and ECC (error correction code). Because we can detect them, we see errors and warnings about their existence.
However, SILENT data corruption does not appear as errors and warnings, and they do OCCUR! And this problem is getting more and more prevalent in modern day disk drives, especially solid state drives (SSDs). As the disk manufacturers are coming out with more compact, higher capacity and performance drives, the cell geometry of SSDs are becoming smaller and smaller. This means each cell will have a smaller area to contain the electrical charge and maintain the bit-value, either a -0 or -1. At the same time, the smaller cell is more sensitive and susceptible to noise, electrical charge leakage and interference of nearby cells as some SSDs has different power modes to address green requirements.
When such things happen, a 0 can look like a 1 or vice versa and if the error is undetected, this becomes silent data corruption.
Most common storage networking technology such as RAID or file systems were introduced during the 80’s or 90’s when disks were 9GB, 18GB and so on, and FastEthernet was the standard for networking. Things have changed at a very fast pace, and data growth has been phenomenal. We need to look at storage vendors’ technology more objectively now and get more in-depth about issues such as SDC.
SDC is very real but until and unless we learn and equip ourselves with the knowledge, just don’t take things from vendors verbatim. Find out … and be in control of what you are putting into your IT environment.