[java] convertir .pdf >> .html et .doc >> .html

Marsh Posté le 27-04-2006 à 22:58:38

hello :hello:

je cherche de quoi convertir differents formats (.pdf principalement, mais aussi .chm, et .doc) en html si possible(histoire de conserver la mise en page (titre surtout)), au pire txt si possible une librairie, sinon au moins en ligne de commande, histoire de pouvoir automatiser.

PDF
aucun topic n'en parle ici (recherche a pdf ds java)
http://forum.hardware.fr/forum1.ph [...] deration=0

certains oui, mais ds l'autre sens: ie xxx >> .pdf comme jpedale.
Itext y arrive, mais perd la mise en forme, etxtrait:

Citation :

You can't 'parse' an existing PDF file using iText, you can only 'read' it page per page.
What does this mean?
The pdf format is just a canvas where text and graphics are placed without any structure information. As such there aren't any 'iText-objects' in a PDF file. In each page there will probably be a number of 'Strings', but you can't reconstruct a phrase or a paragraph using these strings. There are probably a number of lines drawn, but you can't retrieve a Table-object based on these lines. In short: parsing the content of a PDF-file is NOT POSSIBLE with iText

maintenant, si ya des outils pr parser directement du pdf en java, je prend aussi [:aras qui rit] (ya ca en c#, pourtquoi pas hein )

doc en pdf et chm (meme si ca devrait etre plus simple, vu que c'est du html compressé) en html, je prend aussi

merci :jap:

Message édité par Profil supprimé le 10-09-2006 à 14:07:48

Reply

Marsh Posté le 27-04-2006 à 22:58:38

Reply

Marsh Posté le 10-08-2006 à 11:55:45

essaie de chercher si FOP le fait !!

Message cité 1 fois

Reply

Marsh Posté le 10-09-2006 à 14:07:27

titeade a écrit :

essaie de chercher si FOP le fait !!

:jap:
visiblement fop fais le chemin inverse, comme pas mal d'outil que j'ai trouvé
http://xmlgraphics.apache.org/fop/

Citation :

The goals of the Apache FOP project are to deliver an XSL-FO to PDF formatter that is compliant to at least the Basic conformance level described in the W3C Recommendation from 15 October 2001, and that complies with the 11 March 1999 Portable Document Format Specification (Version 1.3) from Adobe Systems.

si d'autres idées, je suis toujours prenneur

voila qq piste que j'ai trouvé ...

en php et totalement bien fait :love: :
http://search.cpan.org/~antro/PDF-111/PDF/Parse.pm
renvoi un bjet qui entre autres contient:

Citation :

a title ==> GetInfo ("Title" )
a subject ==> GetInfo ("Subject" )
an author ==> GetInfo("Author" )
a creation date ==> GetInfo("CreationDate" )
a creator ==> GetInfo("Creator" )
a producer ==> GetInfo("Producer" )
a modification date ==> GetInfo("ModDate" )
some keywords ==> GetInfo("Keywords" )

plus le nb de pages ...

j'ai tout de meme un debut de reponse en java (si jamais d'autres se pose ou se poseront la question ...)
que peut trouver ici
http://www.pdf-tools.com/
niveau code, ca donne un truc assez propre où il suffit d'extraire le contenu et de le lire. je sais pas si Itext etait aussi simple. de toute facon, ca perd tjrs la mise en forme

Message édité par Profil supprimé le 10-09-2006 à 14:51:57

Reply

[java] convertir .pdf >> .html et .doc >> .html

Sujets relatifs:

Leave a Replay