How to Write a Spelling Corrector (2007)

76 points by fybs 3 years ago | 23 comments
  • Joker_vD 3 years ago
    And after you've read that, here's a related blogpost: "A Spellchecker Used to Be a Major Feat of Software Engineering" [0], because Python being "fast enough" and having enough memory for large dictionaries hasn't always been the case.

    [0] https://prog21.dadgum.com/29.html

    • tgv 3 years ago
      I'll repeat my feat in this area: a spelling corrector that had 32kB memory (because it was tied to a keyboard trap, or something like that) and could do four 8kB reads from CD-ROM before telling the user. It contained not only orthographic similarity (the actual letters), but also phonetic similarity. You rarely see that in older English spell checkers, because it's such an irregular language, but for other languages it's quite informative, although in extremely regular languages, such as Spanish, you could also put it in the weighting/probability model (e.g. v<->b is a pretty common confusion).

      The CD-ROM spelling corrector was not really great, BTW, but at least it replied in 1s on a typical end-user PC.

      Edit: this was late 1980s.

    • mtreis86 3 years ago
      One of my most common spelling mistakes is physical mistypes on the keyboard, yet no spell checker seems to account for keyboard layout and locality of keys, or for something like my hand being one position off on the board but typing all the keys relatively correct only positionally shifted.
      • wootest 3 years ago
        Many designs does this implicitly. At least one, the original iPhone keyboard autocomplete/suggestion algorithm, did it explicitly.

        Ken Kocienda's book Creative Selection has a very good chapter on the algorithm being built piece by piece, but finding out words being created by surrounding keys was part of collecting all the candidates.

        They even used this in marketing. One of the pre-original-iPhone-launch videos was focused just on the on-screen keyboard (probably because almost everyone thought it was a really kooky idea at the time), and used the example of explicitly pressing "O-U-Z-Z-A" but still getting "pizza" as the autocomplete because it was the closest recommendation.

        One of the iOS versions a few years ago became incredibly fond of including the space bar and considering alternatives with slightly off key presses near the space bar split into two or more words. When you're using a language with a lot of compound words like Swedish, this yielded some almost postmodern corrections with one or more words often completely different (but close on the keyboard, of course). I don't know if this was a tweak to the manual algorithm going off the rails or an AI version that wasn't quite tuned well yet.

        • brian_cloutier 3 years ago
          spell checkers which account for keyboard layout absolutely exist. [1] is one example, but in general it's hard to believe that any trained model would fail to notice that certain mistakes are more likely than others, and it's hard to believe that any spell checker google releases these days would not be based on a trained model.

          [1] https://ai.googleblog.com/2017/05/the-machine-intelligence-b...

          • m463 3 years ago
            I think the problem is that there are multiple layers of correction - both the physical and the spelling/semantic.

            On ios I just had to turn off autocorrect, because I had reminders I jotted down quickly which missed the physical correction and were irretrievably corrected semantically into garbage.

            now I find a note with "spekk" and I can figure out I meant "spell" (need a better example)

            • z3t4 3 years ago
              The problem is that if you take an english word and change one letter it will likely become another valid word.
              • tyingq 3 years ago
                I would guess that the iPhone autocorrect does this implicitly since it's ML trained.
              • killingtime74 3 years ago
                Those interested in toy implementations in this area might also enjoy this blog https://blog.burntsushi.net/transducers/ on FSMs.

                Also the NLP Book on the data side https://web.stanford.edu/~jurafsky/slp3/

                • pandatigox 3 years ago
                  I've had a thought and am curious how people would solve it. Sometimes, if you copy words off a PDF lecture slide, all the words are mashed together (eg. Hello Foo bar → HelloFoobar). Is this an AI domain or can it solved by simple programming?
                  • wodenokoto 3 years ago
                    This is a common problem in Japanese NLP, and while state of the art is using deep learning, almost everyone use dynamic programming or conditional random fields together with a dictionary to solve it.

                    There also exists research on solving this problem unsupervised which basically invents new word boundaries for a language (remember that spoken languages doesn’t have word boundaries - it was invented for writing and strictly speaking, current spelling isn’t the only way to solve word boundaries for a given language)

                    • epilys 3 years ago
                      It's called word segmentation. There are libraries for it:

                      https://github.com/grantjenks/python-wordsegmenthttps://github.com/InstantDomain/instant-segment (rust)

                      I used the latter to process pdf plain text output quite successfully.

                      • gattilorenz 3 years ago
                        "AI" and "simple programming" are not mutually exclusive :)

                        look at the Speech and Language processing book, particularly chapter 3 about language models https://web.stanford.edu/~jurafsky/slp3/

                        You can implement a language model based on character n-grams to calculate whether a sequence is more likely with or without a space. Of course you would need a way of estimating the proability of each sequence, which means you need a corpus to train your language model on.

                        • abecedarius 3 years ago
                          Short answer in Python: https://stackoverflow.com/questions/195010/how-can-i-split-m...

                          This version assumes non-dictionary "words" have probability 0. But it's the same basic idea as a fancier answer, and it's quick.

                          • eutectic 3 years ago
                            I would try an n-gram model with dynamic programming.

                            e.g.

                            logp(_, "") = 0

                            logp(word0, text) = max(logp_bigram(word0, word1) + logp(word1, rest) for word1, rest in prefix_words(text))

                            • ghusbands 3 years ago
                              It's worth noting that different PDF-to-text tools (and different PDF-display tools that have text-copying) get different results and it can be worth trying a few. For the most part, the vector parts of a PDF and the association with the text is sufficient to discover spacing information.
                              • eutectic 3 years ago
                                I got nerd-sniped and wrote a simple Python solver for this problem. You can find the ngram files at Norvig's site (https://norvig.com/ngrams/).

                                    import collections
                                    import math
                                    import heapq
                                
                                    with open('count_1w.txt', 'r') as f:
                                        unigrams = [l.split() for l in f]
                                    unigram_map = collections.defaultdict(lambda: 0)
                                    for word, count in unigrams:
                                        unigram_map[word] = int(count)
                                    with open('count_2w.txt', 'r') as f:
                                        bigrams = [l.split() for l in f]
                                    bigram_map = collections.defaultdict(lambda: collections.defaultdict(lambda: {}))
                                    for word0, word1, count in bigrams:
                                        bigram_map[word0][word1] = int(count)
                                    log_p_unseen = collections.defaultdict(lambda: 0.)
                                    for word0, counts in bigram_map.items():
                                        for word1, count in counts.items():
                                            unigram_map[word1] += count
                                        total = sum(counts.values())
                                        #smoothing for unseen words
                                        mn, mx = min(counts.values()), max(counts.values())
                                        if mn == mx:
                                            p_unseen = 0.5
                                        else:
                                            #geometric series approximation
                                            r = (mn / mx) ** (1. / (len(counts) - 1))
                                            n = mn * r / (1. - r)
                                            p_unseen = n / (n + total)
                                        log_p_unseen[word0] = math.log(p_unseen)
                                        c = (1. - p_unseen) / total
                                        bigram_map[word0] = {word1: math.log(c * count) for word1, count in counts.items()}
                                    c = 1. / sum(unigram_map.values())
                                    unigram_map = {word: math.log(c * count) for word, count in unigram_map.items()}
                                    max_len = max(map(len, unigram_map))
                                
                                    def optimal_parse(text):
                                        word_spans = {j: [] for j in range(len(text) + 1)}
                                        for i in range(len(text)):
                                            for j in range(i + 1, min(i + max_len, len(text)) + 1):
                                                if text[i:j] in unigram_map:
                                                    word_spans[i].append(j)
                                        min_cost = collections.defaultdict(lambda: float('inf'))
                                        parent = {}
                                        queue = [(0., 0, 0)]
                                        while queue:
                                            cost, i, j = heapq.heappop(queue)
                                            if cost > min_cost[(i, j)]:
                                                continue
                                            if j == len(text):
                                                break
                                            if j == 0:
                                                word0 = '<s>'
                                            else:
                                                word0 = text[i:j]
                                            for k in word_spans[j]:
                                                word1 = text[j:k]
                                                if word1 in bigram_map[word0]:
                                                    word1_cost = -bigram_map[word0][word1]
                                                else:
                                                    #It would technically be more correct to normalize the unigram probability only over unseen words.
                                                    word1_cost = -(log_p_unseen[word0] + unigram_map[word1])
                                                cost1 = cost + word1_cost
                                                if cost1 < min_cost[(j, k)]:
                                                    min_cost[(j, k)] = cost1
                                                    parent[(j, k)] = i
                                                    heapq.heappush(queue, (cost1, j, k))
                                        words = []
                                        while j != 0:
                                            words.append(text[i:j])
                                            i, j = parent.get((i, j)), i
                                        return words[::-1]
                                
                                    print(optimal_parse('ivehadathoughtandamcurioushowpeoplewouldsolveit'))
                                • tester34 3 years ago
                                  Have you tried using an ~~XML~~ PDF parser instead? /s

                                  and tried to find which sentence without spaces matches your sentence

                                  • 3 years ago
                                  • eigenhombre 3 years ago
                                    One of the more interesting parts of the post, for me, is the list of implementations in other languages, including: one for Clojure written by that language's author, Rich Hickey; an interesting one in R that clocks in at 2 lines (with a longer, more readable version further down in the linked post); and one written in functional Java. The first one in Awk is also interesting.
                                    • graycat 3 years ago
                                      My favorite, long standard spell checker is Aspell long part of a TeX distribution.
                                      • the-smug-one 3 years ago
                                        Could use a Hidden Markov Model, how to implement: https://www.cs.sjsu.edu/~stamp/RUA/HMM.pdf

                                        Here's an impl of some kind: https://github.com/crisbal/hmm-spellcheck

                                        • da39a3ee 3 years ago
                                          From a pedagogical point of view presenting that as a Bayesian model and then using the error model he does, is a bit questionable. But as always his python style is inspiring.
                                          • blondin 3 years ago
                                            just wait for old hn to show up and tell new hn that the famous Norvig's spelling corrector is not efficient and is not teaching people how to do it right.