|
1, Using PDFBox processing PDF documents
PDF stands for Portable Document Format, is developed by Adobe electronic file format. This file format has nothing to do with the operating system platform, can be common on Windows, Unix or Mac OS operating systems.
PDF file format text, fonts, formatting, and device-independent color resolution graphics and images encapsulated in a single file. If you want to extract the text in which the information is required based on its file format to parse. Fortunately, there are already many tools to help us do these things.
2, PDFBox download
The most common form of PDF text extraction tool is PDFBox, and visit the website http://sourceforge.net/projects/pdfbox/, enter the download interface. Readers can download the latest version of the page. In this paper, the PDFBox-0.7.3 version. PDFBox is an open source Java PDF library, which allows you to access all the information on PDF files. In the next example, we will demonstrate how to use the API PDFBox provided information to extract text from a PDF file.
3, the configuration in Eclipse
The following is creating a project in Eclipse, and the process of building tools for parsing PDF files.
(1) create a common project in Java in the Eclipse workspace: ch7.
(2) to download the PDFBox-0.7.3.zip decompression.
(3) into the external directory, you can see here, including the use of external PDFBox all packages. Copy the following Jar package to the project lib directory ch7 (such as has not been established lib directory, create one).
l bcmail-jdk14-132.jar
l bcprov-jdk14-132.jar
l checkstyle-all-4.2.jar
l FontBox-0.1.0-dev.jar
l lucene-core-2.0.0.jar
Then from PDFBox lib directory, copy PDFBox-0.7.3.jar to the lib directory of the project.
(4) in the project, right-click the shortcut menu, select "Build Path-> Config Build Path-> Add Jars" command, the project lib directory of the package are added Build Path project.
4. Use PDFBox parsing PDF content
In the Eclipse project you just created, create a ch7.pdfbox package and create a PdfboxTest class. This class contains a getText method for acquiring text information from a PDF, the code is as follows.
import java.io.BufferedWriter;
import java.io.FileInputStream;
import java.io.FileWriter;
import org.pdfbox.pdfparser.PDFParser;
import org.pdfbox.util.PDFTextStripper;
public class PdfParser {
/ **
* @param Args
* /
// TODO automatically generated method stub
public static void main (String [] args) throws Exception {
FileInputStream fis = new FileInputStream ( "F: \\ task \\ lerman-atem2001.pdf");
BufferedWriter writer = new BufferedWriter (new FileWriter ( "F: \\ task \\ pdf_change.txt"));
PDFParser p = new PDFParser (fis);
p.parse ();
PDFTextStripper ts = new PDFTextStripper ();
String s = ts.getText (p.getPDDocument ());
writer.write (s);
System.out.println (s);
fis.close ();
writer.close ();
}
}
Here is an example in accordance with their own book on writing code.
1package TestPDF.pdfbox;
3import java.io.File;
4import java.io.FileOutputStream;
5import java.io.IOException;
6import java.io.OutputStreamWriter;
7import java.io.Writer;
8import java.net.URL;
10import org.apache.lucene.analysis.standard.StandardAnalyzer;
11import org.apache.lucene.document.Document;
12import org.apache.lucene.index.IndexWriter;
13import org.apache.lucene.index.Term;
14import org.apache.lucene.search.IndexSearcher;
15import org.apache.lucene.search.PhraseQuery;
16import org.apache.lucene.search.Query;
17import org.apache.lucene.search.ScoreDoc;
18import org.apache.lucene.search.TermQuery;
19import org.apache.lucene.search.TopDocCollector;
20import org.apache.lucene.search.TopDocs;
21import org.pdfbox.pdmodel.PDDocument;
22import org.pdfbox.searchengine.lucene.LucenePDFDocument;
23import org.pdfbox.util.PDFTextStripper;
25public class Test {
public void getText (String file) throws Exception {
// Sort whether
boolean sort = false;
// Pdf file name
String pdfFile = file;
// Input text file name
String textFile = null;
//Encoding
String encoding = "UTF-8";
// Start extracting pages
int startPage = 1;
// End extract pages
int endPage = Integer.MAX_VALUE;
// File input stream, the input text file
Writer output = null;
// Stored in memory PDF Document
PDDocument document = null;
try {
try {
// First as a URL to load the file, and then if you get an exception from the local file system load
URL url = new URL (pdfFile);
document = PDDocument.load (url);
String fileName = url.getFile ();
if (fileName.length ()> 4) {
// Original pdf to txt file to name the newly created
File outputFile = new File (fileName.substring (0, fileName.length () - 4) + ".txt");
textFile = outputFile.getName ();
}
} Catch (Exception e) {
// Get an exception if a URL is loaded from the file system is mounted
document = PDDocument.load (pdfFile);
if (pdfFile.length ()> 4) {
textFile = pdfFile.substring (0, pdfFile.length () - 4) + ".txt";
}
}
// File output stream, write files to textFile
output = new OutputStreamWriter (new FileOutputStream (textFile), encoding);
// PDFTextStripper to extract text
PDFTextStripper stripper = new PDFTextStripper ();
Are // set the sort
stripper.setSortByPosition (sort);
// Start page
stripper.setStartPage (startPage);
// Set the end page
stripper.setEndPage (endPage);
// Call PDFTextStripper of writeText extracted and output text
stripper.writeText (document, output);
} Finally {
if (output! = null) {
output.close ();
}
if (document! = null) {
document.close ();
}
}
}
/ ** // ** *
* Test Lucene with pdfbox
* @throws IOException
* /
public void LuceneTest () throws IOException {
String path = "D: \\ index";
String pdfpath = "D: \\ index \\ Lucene.Net basic usage .pdf";
IndexWriter writer = new IndexWriter (path, new StandardAnalyzer (), true);
//writer.setMaxFieldLength(10240);
// LucenePDFDocument return Lucene Document generated by the PDF
Document d = LucenePDFDocument.getDocument (new File (pdfpath));
//System.out.println(d);
// Write index
writer.addDocument (d);
writer.close ();
// Read d: \ index file index under established IndexSearcher
IndexSearcher searcher = new IndexSearcher (path);
// Index of contents Field keywords to find the Query
Term t = new Term ( "contents", "excellent");
Term m = new Term ( "contents", "of");
PhraseQuery q = new PhraseQuery ();
q.add (t);
q.add (m);
// Query q = new TermQuery (t);
TopDocCollector co = new TopDocCollector (10);
searcher.search (q, co);
Document document;
TopDocs docs = co.topDocs ();
ScoreDoc [] doc = docs.scoreDocs;
//System.out.println(doc.length);
for (int i = 0; i < doc.length; i ++) {
System.out.println ( "Document Number:" + doc [i] .doc);
// Document = searcher.doc (doc [i] .doc);
}
}
/ ** // ** *
* @param Args
* /
public static void main (String [] args) {
// TODO Auto-generated method stub
Test test = new Test ();
try {
//test.getText("D:\\index\\Lucene.Net basic usage .pdf ");
test.LuceneTest ();
} Catch (Exception e) {
e.printStackTrace ();
}
}
} |
|
|
|