Re: [NTLK] PDF/CHM reader, Bluetooth and other questions

From: Scot McSweeney-Roberts (newton_at_mcsweeney-roberts.co.uk)
Date: Mon Oct 25 2004 - 10:09:39 PDT


Alexey Agapov wrote:

>1) I want to learn (or teach Newton) to read books in Microsoft's CHM
>and PDF formats. I'm yet to try PDFConv but it is claimed to convert
>PDF to a series of images and, hence, doesn't support bookmarks,
>hyperlinks and search. Different PDF to text converters are far from
>perfect also. I'm thinking of writing a PDF -> NewtonBook converter
>which ideally would preserve as much formatting as possible. I think it
>can be developed based on ghostscript interpreter. The same thing can
>be done for CHM. I wonder, why did nobody suggest that before? What do
>you think, is that possible? Or maybe nobody needs this besides me?
>
>

You might find the following useful

Mutlivalent Document Tools http://multivalent.sourceforge.net/Tools/ -
one of the useful things it can do is:

"Extract Unicode text from any supported document type, including PDF,
HTML, and DVI. Since the extractors are based on the same engine that is
used for rendering, they are not confused by markup or spacing. HTML
pastes together words across style markup, and PDF tries hard to paste
together words from fragments that may have been positioned with
intervening kerning spacing. Note that sometimes text can be drawn not
with fonts but with vector shapes or in an image; to extract this, run
OCR software."

and

CHM Decompiler (and assorted other CHM goodies)
http://bonedaddy.net/pabs3/hhm/

cheers

Scot

-- 
This is the NewtonTalk list - http://www.newtontalk.net/ for all inquiries
Official Newton FAQ: http://www.chuma.org/newton/faq/
WikiWikiNewt for all kinds of articles: http://tools.unna.org/wikiwikinewt/


This archive was generated by hypermail 2.1.5 : Mon Oct 25 2004 - 10:30:02 PDT