A bunch of the regular commentators here have noted the weird word combinations thrown up by the Recaptcha anti-spam filter — I remember on one occasion it made me type ‘vodka hag’. The general consensus was that Recaptcha was possessed by Satan. The truth, however, could not be more different. It turns out that every time you copy the Recaptcha phrases you are helping digitalise out of copyright texts.
Now a growing number of websites, from e-commerce (Ticketmaster) to social networking (Facebook) to blogging (WordPress), have implemented the precocious professor’s new tool, dubbed recaptcha. If you’ve visited those sites, your squiggly-letter- reading ability has been harnessed for a massive project that aims to scan and make freely available every out-of- copyright book in the world, by deciphering words from old texts that have stumped scanning software. [snip]
Once the text is scanned, the file is sent to a server in California, where it’s run through optical character recognition software to produce a digital full-text version. For the newer books, OCR is about 90 percent accurate. But that success rate drops to as low as 60 percent for older texts, which often contain fonts that are blurry and less uniform. These troublesome scans are sent on to the reCAPTCHA servers at Carnegie Mellon University in Pittsburgh. [snip]
The program distorts a known word so that it will have a way to check that the user is human, and then pairs it up with a word OCR has failed to decipher. Each mystery word is served up in multiple reCAPTCHAs, until a consensus about the correct answer emerges. Sometimes a single user confirms the computer’s best guess, but the average is about four users per word. The system is now correcting over 10 million words a day, with 99.1 percent accuracy, von Ahn says.
Coolest. Anti. Spam. Device. Ever.