|
Today prepared to increase a PDF processing functions in Nutch source code, which do step is to extract the PDF document in a text message. Consider for a moment, or ready to use PDFBox. Looked, Nutch in parse-tika source plug under a PDFBox, but is version 1.1.0, a lot of PDF documents are not processed. Now the latest official version 1.6.0 is already online, so ready to replace it. Because they do not like to see the English instructions, get in when it touches took some twists and turns.
I started just downloaded pdfbox-1.6.0.jar, replaced the old version of the jar package, program error. In desperation, carefully looked at official documentation. PDFBox official website column depandencies (http://pdfbox.apache.org/) clearly stated on the use of components required and their associated PDFBox. PDFBox There are three main components, in addition to the above pdfbox-1.6.0.jar, there fontbox-1.6.0.jar and jempbox-1.6.0.jar, also need commons-logging component a log processing. For logging component, Nutch have been there, is a commons-logging-1.0.4.jar and
commons-logging-api-1.0.4.jar, if you use PDFBox in your application, you need the top five jar package (logging component is two jar package).
Of course, the official website for the convenience of users, but also provides an integrated package of jar: pdfbox-app-1.6.0.jar, if you are using the jar package, you no longer need the other.
OK, when you're ready to start extracting text information. Extract text information code is relatively simple, there are many online. Examples are as follows:
PDDocument doc = PDDocument.load ( "D: /331.pdf");
PDFTextStripper stripper = new PDFTextStripper ();
String text = stripper.getText (doc);
String title = stripper.getTitle (doc);
This is read from the local pdf file, if it is from the network, you will first get an InputStream object file (assuming called stream), the code is as follows:
PDDocument doc = new PDDocument ();
PDFParser parser = new PDFParser (stream);
parser.parse ();
doc = parser.getPDDocument ();
PDFTextStripper stripper = new PDFTextStripper ();
String text = stripper.getText (doc);
String title = stripper.getTitle (doc);
But to explain;
(1) PDFBox certain format pdf file is not extracted out, but most can be.
(2) PDFTextStripper attempts to extract more information, such as title, abstract and so on; but do not expect too much class, only those standard PDF documents (paper kind), it can be extracted. The rest is either null, or is wrong.
PDFBox There are many other features, such as trying to decode the like, if necessary, API went to study it ...... |
|
|
|