Home PC Games Linux Windows Database Network Programming Server Mobile  
           
  Home \ Linux \ Use PDFBox parse PDF file     - Ubuntu 14.04 install the NVIDIA driver + CUDA + MATLAB (Linux)

- 127.0.0.1 and localhost difference (Server)

- Installation and configuration of phpMyAdmin under CentOS (Database)

- Advanced Linux security settings (Linux)

- How to configure FirewallD in RHEL / CentOS 7 and Fedora in (Linux)

- Iptables principle (Linux)

- PostgreSQL vacuum principle of a function and parameters (Database)

- Install Ubuntu text editor KKEdit 0.2.10 (Linux)

- How to fix fatal error: security / pam_modules.h: No such file or directory (Linux)

- Nginx caching using the official guide (Server)

- Linux environment password security settings (Linux)

- How to install new fonts on Ubuntu 14.04 and 14.10 (Linux)

- Dynamic programming Android (Programming)

- Java and C / C ++ data conversion when network communication (Programming)

- Linux systems dmesg command processing failures and system information collected seven usage (Linux)

- Linux 101 hack book reading notes (Linux)

- CentOS7 installation configuration Redis-3.0.0 (Database)

- To compile and install MySQL 5.7.7 RC under CentOS 7.1 (Database)

- Different versions of MongoDB achieve master-slave replication (Database)

- Android using shape drawable material production (Programming)

 
         
  Use PDFBox parse PDF file
     
  Add Date : 2018-11-21      
         
         
         
  Today prepared to increase a PDF processing functions in Nutch source code, which do step is to extract the PDF document in a text message. Consider for a moment, or ready to use PDFBox. Looked, Nutch in parse-tika source plug under a PDFBox, but is version 1.1.0, a lot of PDF documents are not processed. Now the latest official version 1.6.0 is already online, so ready to replace it. Because they do not like to see the English instructions, get in when it touches took some twists and turns.

I started just downloaded pdfbox-1.6.0.jar, replaced the old version of the jar package, program error. In desperation, carefully looked at official documentation. PDFBox official website column depandencies (http://pdfbox.apache.org/) clearly stated on the use of components required and their associated PDFBox. PDFBox There are three main components, in addition to the above pdfbox-1.6.0.jar, there fontbox-1.6.0.jar and jempbox-1.6.0.jar, also need commons-logging component a log processing. For logging component, Nutch have been there, is a commons-logging-1.0.4.jar and

commons-logging-api-1.0.4.jar, if you use PDFBox in your application, you need the top five jar package (logging component is two jar package).

Of course, the official website for the convenience of users, but also provides an integrated package of jar: pdfbox-app-1.6.0.jar, if you are using the jar package, you no longer need the other.

OK, when you're ready to start extracting text information. Extract text information code is relatively simple, there are many online. Examples are as follows:

PDDocument doc = PDDocument.load ( "D: /331.pdf");

PDFTextStripper stripper = new PDFTextStripper ();

String text = stripper.getText (doc);

String title = stripper.getTitle (doc);

This is read from the local pdf file, if it is from the network, you will first get an InputStream object file (assuming called stream), the code is as follows:

PDDocument doc = new PDDocument ();

PDFParser parser = new PDFParser (stream);

parser.parse ();

doc = parser.getPDDocument ();

PDFTextStripper stripper = new PDFTextStripper ();

String text = stripper.getText (doc);

String title = stripper.getTitle (doc);

But to explain;

(1) PDFBox certain format pdf file is not extracted out, but most can be.

(2) PDFTextStripper attempts to extract more information, such as title, abstract and so on; but do not expect too much class, only those standard PDF documents (paper kind), it can be extracted. The rest is either null, or is wrong.

PDFBox There are many other features, such as trying to decode the like, if necessary, API went to study it ......
     
         
         
         
  More:      
 
- CentOS 6.4 installation environment to build Scrapy 0.22 (Linux)
- I use the desktop environment in GNU / Linux combination tool (Linux)
- Computer black magic: tail recursion (Programming)
- redis configuration in detail (English) (Database)
- Hard disk encryption to protect data security (Linux)
- Ubuntu and Derivatives users install the latest KKEdit 0.0.31 (Linux)
- Kali Linux virtualbox rc = Error 1908 workaround (Linux)
- Cross server / client backup command: rsync use (Server)
- Use SecureCRT to transfer files between Linux and Windows (Linux)
- crontab task scheduling Health Check (Linux)
- Lazarus for Raspbian installation (Linux)
- You must ask yourself four questions before deploying Docker (Server)
- VMware ghost Linux card error (Linux)
- AngularJS notes --- Scope and controller (Programming)
- DM9000 timing settings (Programming)
- Linux / proc directory Comments (Linux)
- Ubuntu U disk do not have write privileges can only read but not write (Linux)
- CentOS 5.8 (64) Python 2.7.5 installation error resolved (Linux)
- C ++, overloading, cover, hide (Programming)
- Shell script to crawl through AWR SQL Report Problems (Database)
     
           
     
  CopyRight 2002-2022 newfreesoft.com, All Rights Reserved.