So I spent $22 on an ebook for school.
It has this crappy DRM that only lets me view the pdf on one computer using only “Adobe Digital Editions”.
If that wasn’t so bad, only a small subset of the text is OCR’d, so most of it isn’t even searchable!
Now I’m pissed, but wait, what do you say? These files are just RSA encrypted, and I have the key?
Some cool guy named **i♥cabbages **has released code do extract your key, and then decrypt the file to a good ol’ plain pdf. If you want to reproduce my steps you will need to use the PDF decrypter unless you have epubs.
So I use the tool and get a pdf, now I can use one of the most awesome tools in the world: Imagemagick.
Imagemagick can whip this pdf into shape. The first thing I’m going to do is convert each page into a tiff:
$ convert -density 200 input.pdf[1-124] -depth 8 -monochrome %05d.tif
Then I’m going to run tesseract-ocr on them to get the text:
$ for i in $(seq --format=%005.f 1 324) do tesseract $i.tif tesseract-$i -l eng done
Now all I have to do is cat all the text together:
cat *.txt > output.txt
Now I have a fully searchable, plain text file. Exactly what I wanted in the first place!
For the REAL magic, I use agrep to search for strings similar to provided example test questions to help “highlight” the answers. More technical details on that magic on my wiki.