Silkenweb Example: Hackernews Clone

Practical Reed-Solomon for Programmers

229 points by heydenberk 4 years ago | 43 comments

h2odragon 4 years ago
Amusing: the linked library page, https://www.ka9q.net/code/fec/ says (from 2007) "The new Intel Macs are not yet supported. Stay tuned."
Ignore an issue for long enough, and it stops being an issue :)
coder543 4 years ago
If anyone wants to read more about Reed Solomon, this is the article that helped me understand Reed Solomon earlier this year: https://innovation.vivint.com/introduction-to-reed-solomon-b...
- ahubert 4 years ago
  (author here) Nice! Added that to the page, very worthwhile. Thanks.
  - vmilner 4 years ago
    This BBC engineering paper is also useful.
    https://downloads.bbc.co.uk/rd/pubs/whp/whp-pdf-files/WHP031...
boredprograming 4 years ago
Just to add to this topic, usually Reed Solomon codes are interleaved with Turbo codes or LDPC.
The reason is this: Reed Solomon corrects entire symbols, but can only correct so many symbols per block. LDPC and Turbo correct single bit errors. If you have a lot of single bit errors, mixed with completely corrupt blocks, it's best to use both.
Interleaving usually mixes the bits around too, to spread long chains of errors out between symbols.
This setup is used in digital radio, 3G, CD and DVD, DSL, etc
conductor 4 years ago
Thanks for the article and code.
Also check out https://github.com/Parchive/par2cmdline
I wish something like this was integrated into a modern and free archive format. RAR has it. RAR also got volume recovery: when creating split multi-volume archives you can also create recovery volumes - each recovery volume can replace any other one missing volume. It was a life-saver in floppy-disk times.
- dale_glass 4 years ago
  We're not using floppies anymore, and media is extremely reliable.
  On the internet you can just retransmit if something went wrong, and if you're worried about the media specifically then there are media specific measures, like RAID and filesystem level checksums.
  - macintux 4 years ago
    Cosmic rays are a thing, and wasn’t there just a study linked here a few days ago about how unreliable hardware is at scale?
    - dale_glass 4 years ago
      Oh, I'm all for ECC and checksumming and whatnot.
      All I'm saying is that adding recovery info to archives is probably not ideal -- we can do it in a task or media specific way that works best for whatever context the data is in.
      Redundancy in .RAR made perfect sense back when you were transporting data on floppies, and if the floppy was bad, it was a huge pain. These days we transport over the network and don't need to carry any redundant data at all. All we need is a checksum and to retransmit whatever didn't arrive correctly.
ithkuil 4 years ago
There are other erasure codes that go beyond RS. For example LRC (Local Reconstruct Codes) is very interesting; it improves on reed Solomon by requiring less bandwidth/IO for repair reads.
https://www.usenix.org/conference/atc12/technical-sessions/p...
- 4 years ago
xrd 4 years ago
I loved this article and going down the rabbit hole with the associated links was excited to see a discussion from Russ Cox (golang) about how QR codes rely on RS (and another great article)
https://research.swtch.com/qart
crazygringo 4 years ago
Are there any uses cases for Reed-Solomon at the application level?
My impression was always that error checking was implemented at the hardware level, or else at the OS/driver level.
But just curious if there are some applications I'm missing.
- gopalv 4 years ago
  > uses cases for Reed-Solomon at the application level
  Hadoop has an RS implementation inside the filesystem (called "erasure coding"), instead of storing 3 copies of the same data, it can actually instead store ~1.5 copies as (6+3) or (10+4).
  Previously, I've run into this tech in satellite internet gateways, but distributed filesystems is where I've gone through the math & probabilities of failure properly.
  I work on perf & the extra network hops (with 3 replicas, you read 100% of data local, when you stripe it that doesn't work) and math for the error correction are hot spots when you are trying to keep all cores busy.
- elisbce 4 years ago
  Linux RAID-6: RS(10,8) Google File System II (Colossus): RS(9,6) Quantcast File System: RS(9,6) Intel & Cloudera’ HDFS-EC: RS(9,6) Yahoo Cloud Object Store: RS(11,8) Backblaze’s online backup: RS(20,17) Facebook’s f4 BLOB storage system: RS(14,10) Baidu’s Atlas Cloud Storage: RS(12, 8) .....
  - oofabz 4 years ago
    Are the numbers the size in bits? If so, those are very small blocks. It seems like you would get more effective redundancy with larger blocks. Does Reed-Solomon scale so poorly that only tiny blocks are computationally feasible?
    - sliken 4 years ago
      Generally blocks. So backblaze takes 17 blocks of input, and generates 20 blocks out output. Then recoveries work if 17 or more blocks are recovered.
- the_benno 4 years ago
  I worked on a research project in undergrad that used Reed-Solomon -- we showed a series of images to users, who would later pick those images out of a set of un-trained images in order to authenticate or access some privileged data, sans encryption. the error-correcting allowed for some degree of misremembering or forgetfulness, as is normal in human image recognition tasks.
  The professor later spun up a startup that has since been acquired by Dropbox, but I'm unaware what kind of product they're currently working on.
- 4 years ago
- mook 4 years ago
  Parchive 2 was, I believe, an implementation of Reed-Solomon at the application level; mostly used (when it was introduced) to counter loss from missing chunks on binary newsgroups. I think RAR might have picked up a similar feature around that time too?
  I believe parchive version 1 was simple XOR parity.
  - akalin 4 years ago
    No, parchive v1 also used Reed-Solomon. The main difference between v1 and v2 was that v1 worked on the file level, but v2 divided the set of files into blocks.
    Also, the parity matrix for v1 would be sometimes singular (non-invertible), which v2 tried to fix but it didn't quite work.
- curtisf 4 years ago
  You may want to be able to run highly resilient databases even on non-error-detecting hardware. (Most consumer disks, and even a lot of enterprise disks, don't reliably detect/report disk corruption)
  You might also be building a distributed system, where blocks are spread across multiple disks. In that case, an "error" you're trying to correct might be the loss of a disk or server in the distributed system; there is no one single disk or OS responsible for all the relevant blocks in that case.
  But, both of these are assuming you're building some kind of data-store, which is not a typical user application. Writing robust data-stores is hard for many reasons, error correcting is just one of them.
- alphabet9000 4 years ago
  i used a reed solomon library recently in a project that encodes and decodes binary data stored in a video channel. i added it to compensate for compression/noise introduced in the re-encoding process on third party sites like youtube. here's an example of what it looks like: [0]
  [0] https://cicada.jollo.org/xg/sp.mp4
- cycomanic 4 years ago
  QR codes use Reed Solomon codes
  - jtlienwis 4 years ago
    So do pdf-417 barcodes. I remember writing the code for pdf417_decode and pdf417_encode (freeware) back in the day. The Blahut book on error correcting codes was a great help, but all the examples where using binary galois fields. The pdf-417 used a field of powers of 3 mode 929, so my brain had to figure out how that worked and put it into code.
- kevin_thibedeau 4 years ago
  If you need to transmit data over an unreliable link where retransmission isn't practical you will want some sort of FEC. If hardware isn't available to do the job then it gets done in software.
- sparc24 4 years ago
  s3, dropbox - https://dropbox.tech/infrastructure/inside-the-magic-pocket
sparc24 4 years ago
These 2 papers helped me better understand some more practical aspects of it.
https://mirrors.edge.kernel.org/pub/linux/kernel/people/hpa/...https://www.usenix.org/legacy/event/fast08/tech/full_papers/...
moreati 4 years ago
A semi-related question/puzzle. Is it possible to reverse engineer parameters of a Reed Solomoon function?
Given one or more known inputs/outputs of a Reed Solomon function with known symbol size, message length, number of parity symbols. Is it possible to calculate the other parameters (polynomial coefficients, primitive element etc)?
I failed to find an answer a few years ago, when trying to interoperate with chirp.io (now gone). https://math.stackexchange.com/questions/663643/discover-par...
- xorvoid 4 years ago
  Doesn’t seem too hard actually, given enough example input. The parity is just a linear combination of the input so you can set up a linear system and solve for those things encoding matrix rows pretty easily. From there you may need to make some assumptions about what type of matrix was used (e.g. vandermonde is popular) and perhaps the field size being used (e.g. GF(2^8), and with those, it should be easy to determine the reducing polynomial. Some of these choices are small enough to be able to brute force (finite fields used for practical problems tend to be small, naturally). If you have a dataset, I might actually toy with it myself.. sounds fun.
  - moreati 4 years ago
    Thanks. I have 30 examples I got from the API, many moons ago https://github.com/moreati/chirppy/blob/master/examples.json. My finite field knowledge is not up to solving it analytically, my attempts to brute force it are in https://github.com/moreati/chirppy/blob/master/trials.py. That includes functions to go between the characters, and integer representations.
nayuki 4 years ago
Mathematical derivation and runnable code for a Reed-Solomon error-correcting code decoder: https://www.nayuki.io/page/reed-solomon-error-correcting-cod...
Tcepsa 4 years ago
At the risk of oversimplifying, is there any value in the analogy, "It's like RAID but for data streams"?
- frabert 4 years ago
  Problem with that is, as the article states, RAID6 _can_ use RS as its error correction method, which turns this description into something a bit too circular for my taste
IAmLiterallyAB 4 years ago
What is the difference between RS and Hamming codes?
- rurban 4 years ago
  Hamming Codes can only recover from 2 wrong bits per byte, with RS it's configurable.
alessandroetc 4 years ago
what are your thoughts on storj.io? they utilize R-S for all their entire storage platform.
mrfusion 4 years ago
Is this what tcp/ip uses?
- rl1987 4 years ago
  No. TCP, IP and UDP are all using Internet Checksum (RFC 1071). Ethernet uses CRC.
  - znpy 4 years ago
    I went and skimmed the rfc you're citing and it's fro 1988, it's old enough to have samples in C code where the "register" keyword is still used (afaik nowadays most compilers will ignore it) along with assembly code for motorola 68020, Cray CPUs and the IBM-370.
    EDIT: for reference: https://datatracker.ietf.org/doc/html/rfc1071.html
  - brandmeyer 4 years ago
    All of these are techniques for verifying a message. Galileo also uses a 24-bit CRC to verify its messages after performing forward error correction.
- jandrese 4 years ago
  TCP doesn't do forward error correction at all. It uses a retransmission scheme instead. TCP's error correction is not only to insure a consistent datastream, but also to more equitably share limited bandwidth.
- stagger87 4 years ago
  Forward error correction is usually at the PHY layer, not the transport layer. So the more appropriate question is, does 10GbE/100GbE/WLAN/etc use RS coding? (and yes some of them do.)
- TonyTrapp 4 years ago
  No, but Reed-Solomon codes are used on on CDs.
  - sliken 4 years ago
    Indeed, I think there's a 3 layers of encodings to protect against different corruption scenarios. Trick is some equipment only does a partial implementation, thus the need for a tool like cdparanoia to recovery data from corrupted CDs.