Home PC Games Linux Windows Database Network Programming Server Mobile  
  Home \ Linux \ Use PDFBox processing PDF documents     - CentOS yum configuration under local sources (Linux)

- On the PC goes heavy security watch your startup items (Linux)

- To get Java class / jar package path (Programming)

- Singleton (Linux)

- How to troubleshoot Windows and Ubuntu dual system time is not synchronized (Linux)

- Grep, Ack, Ag searches Efficiency Comparison (Linux)

- linux remote control software (Linux)

- SVN common commands (Linux)

- Linux common network tools: Scan routing of mtr (Linux)

- Linux based serial programming (Programming)

- AngularJS - Getting Started with Routing (Programming)

- High-performance JavaScript DOM programming (Programming)

- Ubuntu 13.04 configure MyEclipse 10.7 Environment (Linux)

- Linux resource restriction level summary (Linux)

- How to configure security management services under Linux (Linux)

- Linux install and configure Heartbeat (Server)

- Zypper command for SUSE Linux package management (Linux)

- Linux system find command Detailed (Linux)

- Eight kinds of techniques to solve hard problems Linux (Linux)

- MySQL various log summary (Database)

  Use PDFBox processing PDF documents
  Add Date : 2018-11-21      
  1, Using PDFBox processing PDF documents

PDF stands for Portable Document Format, is developed by Adobe electronic file format. This file format has nothing to do with the operating system platform, can be common on Windows, Unix or Mac OS operating systems.

PDF file format text, fonts, formatting, and device-independent color resolution graphics and images encapsulated in a single file. If you want to extract the text in which the information is required based on its file format to parse. Fortunately, there are already many tools to help us do these things.

2, PDFBox download

The most common form of PDF text extraction tool is PDFBox, and visit the website http://sourceforge.net/projects/pdfbox/, enter the download interface. Readers can download the latest version of the page. In this paper, the PDFBox-0.7.3 version. PDFBox is an open source Java PDF library, which allows you to access all the information on PDF files. In the next example, we will demonstrate how to use the API PDFBox provided information to extract text from a PDF file.

3, the configuration in Eclipse

The following is creating a project in Eclipse, and the process of building tools for parsing PDF files.

(1) create a common project in Java in the Eclipse workspace: ch7.

(2) to download the PDFBox-0.7.3.zip decompression.

(3) into the external directory, you can see here, including the use of external PDFBox all packages. Copy the following Jar package to the project lib directory ch7 (such as has not been established lib directory, create one).

l bcmail-jdk14-132.jar

l bcprov-jdk14-132.jar

l checkstyle-all-4.2.jar

l FontBox-0.1.0-dev.jar

l lucene-core-2.0.0.jar

Then from PDFBox lib directory, copy PDFBox-0.7.3.jar to the lib directory of the project.

(4) in the project, right-click the shortcut menu, select "Build Path-> Config Build Path-> Add Jars" command, the project lib directory of the package are added Build Path project.

4. Use PDFBox parsing PDF content

In the Eclipse project you just created, create a ch7.pdfbox package and create a PdfboxTest class. This class contains a getText method for acquiring text information from a PDF, the code is as follows.
import java.io.BufferedWriter;
 import java.io.FileInputStream;
 import java.io.FileWriter;

 import org.pdfbox.pdfparser.PDFParser;
 import org.pdfbox.util.PDFTextStripper;

 public class PdfParser {

    / **
    * @param Args
    * /
    // TODO automatically generated method stub

      public static void main (String [] args) throws Exception {
            FileInputStream fis = new FileInputStream ( "F: \\ task \\ lerman-atem2001.pdf");
            BufferedWriter writer = new BufferedWriter (new FileWriter ( "F: \\ task \\ pdf_change.txt"));
            PDFParser p = new PDFParser (fis);
            p.parse ();
            PDFTextStripper ts = new PDFTextStripper ();
            String s = ts.getText (p.getPDDocument ());
            writer.write (s);
            System.out.println (s);
            fis.close ();
            writer.close ();

Here is an example in accordance with their own book on writing code.

1package TestPDF.pdfbox;

  3import java.io.File;

  4import java.io.FileOutputStream;

  5import java.io.IOException;

  6import java.io.OutputStreamWriter;

  7import java.io.Writer;

  8import java.net.URL;

 10import org.apache.lucene.analysis.standard.StandardAnalyzer;

 11import org.apache.lucene.document.Document;

 12import org.apache.lucene.index.IndexWriter;

 13import org.apache.lucene.index.Term;

 14import org.apache.lucene.search.IndexSearcher;

 15import org.apache.lucene.search.PhraseQuery;

 16import org.apache.lucene.search.Query;

 17import org.apache.lucene.search.ScoreDoc;

 18import org.apache.lucene.search.TermQuery;

 19import org.apache.lucene.search.TopDocCollector;

 20import org.apache.lucene.search.TopDocs;

 21import org.pdfbox.pdmodel.PDDocument;

 22import org.pdfbox.searchengine.lucene.LucenePDFDocument;

 23import org.pdfbox.util.PDFTextStripper;

 25public class Test {

  public void getText (String file) throws Exception {
      // Sort whether
      boolean sort = false;
      // Pdf file name
      String pdfFile = file;
      // Input text file name
      String textFile = null;
      String encoding = "UTF-8";
      // Start extracting pages
      int startPage = 1;
      // End extract pages
      int endPage = Integer.MAX_VALUE;
      // File input stream, the input text file
      Writer output = null;
      // Stored in memory PDF Document
      PDDocument document = null;
      try {
          try {
              // First as a URL to load the file, and then if you get an exception from the local file system load
              URL url = new URL (pdfFile);
              document = PDDocument.load (url);
              String fileName = url.getFile ();
              if (fileName.length ()> 4) {
                  // Original pdf to txt file to name the newly created
                  File outputFile = new File (fileName.substring (0, fileName.length () - 4) + ".txt");
                  textFile = outputFile.getName ();
          } Catch (Exception e) {
              // Get an exception if a URL is loaded from the file system is mounted
              document = PDDocument.load (pdfFile);
              if (pdfFile.length ()> 4) {
                  textFile = pdfFile.substring (0, pdfFile.length () - 4) + ".txt";
          // File output stream, write files to textFile
          output = new OutputStreamWriter (new FileOutputStream (textFile), encoding);
          // PDFTextStripper to extract text
          PDFTextStripper stripper = new PDFTextStripper ();
          Are // set the sort
          stripper.setSortByPosition (sort);
          // Start page
          stripper.setStartPage (startPage);
          // Set the end page
          stripper.setEndPage (endPage);
          // Call PDFTextStripper of writeText extracted and output text
          stripper.writeText (document, output);
      } Finally {
          if (output! = null) {
              output.close ();
          if (document! = null) {
              document.close ();
  / ** // ** *
    * Test Lucene with pdfbox
    * @throws IOException
    * /
  public void LuceneTest () throws IOException {
      String path = "D: \\ index";
      String pdfpath = "D: \\ index \\ Lucene.Net basic usage .pdf";
      IndexWriter writer = new IndexWriter (path, new StandardAnalyzer (), true);
      // LucenePDFDocument return Lucene Document generated by the PDF
      Document d = LucenePDFDocument.getDocument (new File (pdfpath));
      // Write index
      writer.addDocument (d);
      writer.close ();
      // Read d: \ index file index under established IndexSearcher
      IndexSearcher searcher = new IndexSearcher (path);
      // Index of contents Field keywords to find the Query
      Term t = new Term ( "contents", "excellent");
      Term m = new Term ( "contents", "of");
      PhraseQuery q = new PhraseQuery ();
      q.add (t);
      q.add (m);
      // Query q = new TermQuery (t);
      TopDocCollector co = new TopDocCollector (10);
      searcher.search (q, co);
      Document document;
      TopDocs docs = co.topDocs ();
      ScoreDoc [] doc = docs.scoreDocs;
      for (int i = 0; i < doc.length; i ++) {
          System.out.println ( "Document Number:" + doc [i] .doc);
          // Document = searcher.doc (doc [i] .doc);
  / ** // ** *
    * @param Args
    * /
  public static void main (String [] args) {
      // TODO Auto-generated method stub
      Test test = new Test ();
      try {
          //test.getText("D:\\index\\Lucene.Net basic usage .pdf ");
          test.LuceneTest ();
      } Catch (Exception e) {
          e.printStackTrace ();

- Under CentOS Linux automatic backup MySQL database daily (Database)
- MySQL full-index scan bug (Database)
- SecureCRT 7.0 Log Ubuntu 12.04 server via SSH service under Vmware (Server)
- Use DB2 federated access Oracle (Database)
- New experience Budgie (Budgerigar) desktop environment (Linux)
- C ++ precision performance test function (Programming)
- You can not ignore the seven Git tips (Linux)
- Python, and / or (Programming)
- Squid proxy server (Server)
- Ubuntu 12.04 / 14.04 users to install software LyX document processing (Linux)
- Sublime Text 3 using summary (Linux)
- Java Learning: elegant string (Programming)
- jQuery plugin dynamic label generation (Linux)
- RedHat install GCC problem --- Depends (Linux)
- HTML5 Application Cache (Programming)
- Based on Python: OpenCV simple image manipulation (Programming)
- How to merge two pictures in Cacti (Linux)
- Guide: Trickle restrict application bandwidth usage (Linux)
- Several Methods of SSH Auto - login (Linux)
- RabbitMQ tutorial examples: RabbitMQ installation under Windows (Linux)
  CopyRight 2002-2022 newfreesoft.com, All Rights Reserved.