not demonic, creative

A bunch of the regular commentators here have noted the weird word combinations thrown up by the Recaptcha anti-spam filter — I remember on one occasion it made me type ‘vodka hag’. The general consensus was that Recaptcha was possessed by Satan. The truth, however, could not be more different. It turns out that every time you copy the Recaptcha phrases you are helping digitalise out of copyright texts.

Walrus magazine explains:

Now a growing number of websites, from e-commerce (Ticketmaster) to social networking (Facebook) to blogging (WordPress), have implemented the precocious professor’s new tool, dubbed recaptcha. If you’ve visited those sites, your squiggly-letter- reading ability has been harnessed for a massive project that aims to scan and make freely available every out-of- copyright book in the world, by deciphering words from old texts that have stumped scanning software. [snip]

Once the text is scanned, the file is sent to a server in California, where it’s run through optical character recognition software to produce a digital full-text version. For the newer books, OCR is about 90 percent accurate. But that success rate drops to as low as 60 percent for older texts, which often contain fonts that are blurry and less uniform. These troublesome scans are sent on to the reCAPTCHA servers at Carnegie Mellon University in Pittsburgh. [snip]

The program distorts a known word so that it will have a way to check that the user is human, and then pairs it up with a word OCR has failed to decipher. Each mystery word is served up in multiple reCAPTCHAs, until a consensus about the correct answer emerges. Sometimes a single user confirms the computer’s best guess, but the average is about four users per word. The system is now correcting over 10 million words a day, with 99.1 percent accuracy, von Ahn says.

Coolest. Anti. Spam. Device. Ever.

Jeff Sparrow

Jeff Sparrow is a Walkley Award-winning writer, broadcaster and former editor of Overland.

More by Jeff Sparrow ›

Overland is a not-for-profit magazine with a proud history of supporting writers, and publishing ideas and voices often excluded from other places.

If you like this piece, or support Overland’s work in general, please subscribe or donate.

Related articles & Essays

Contribute to the conversation

  1. When I posted something about this on the OL facebook site, the reaction was quite negative, which initially surprised me cos my first reaction was simply about what a great idea it represented. But since then I’ve been thinking. I mean, should they have to tell us about this? OK, they’re digitalising books, which is a good thing, but what if they were doing something evil — or simply making money out of us? Is there something exploitative about the whole idea?

    (I just contributed the words ’10 compressed’ to the world’s body of knowledge.)

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.