Tutorial : How to Find Text In PDF Using Apache Lucene and PDFBox

I found this requirement recently, to search out whether a selected word is present or not in a PDF file. At first i assumed this is a really simple requirement and created a simple application in Java, that may first extract text from PDF files and so do a linear character matching like

mystring.contains(mysearchterm) == true

It did offer me the expected output, however linear character matching operations are suitable only the content you're looking out is very small.
Otherwise it's terribly costly, in complexity terms O(np) wherever n= range of words to look and p= range of search terms.

How to Search Text In PDF Using Java, Apache Lucene and Apache PDFBox

The best answer is to go for an easy programme which is able to first pre-parse all of your data in to tokens to make an index and then permit us to query the index to retrieve matching results. This means the entire content are going to be first broken down into terms and then every of it will point to the content.

Download an Example

As an example, consider the raw data,

hello world
god is good all the time
all is well
the big bang theory

The programme can create an index like this,
all-> 2,3
hello-> 1
is->2,3
good->2
world->1
the->2,4
god->2
big->4
Full Text Search engines are what i'm relating here and these search engines quickly and effectively search massive volume of unstructured text. There are several different stuff you can do with a search engine however i'm not going to handle any of it during this post. The aim is to allow you to knowledge to form an easy java application which will look for a selected keyword in PDF documents and tell you whether or not the document contains that individual keyword or not.

You can check also : Save Tabular PDF into TXT using java

That being said, the open source full text program that i'm planning to use for this purpose is Apache Lucene, that could be a high performance, full-featured text program completely written in Java. Apache Lucene doesn't have the ability to extract text from PDF files. All it does is, creates index from text so allows us to query against the indices to retrieve the matching results. To extract text from PDF documents, allow us to use Apache PDFBox, an open source java library which will extract content from PDF documents which may be fed to Lucene for indexing.

Lets start by downloading the specified libraries. Please follow the version of software's that i'm using, since latest versions might need completely different reasonably implementation.

For See more : >> http://geekonjava.blogspot.com/2015/08/search-text-in-pdf-using-java-apache.html

Tutorial : How to Find Text In PDF Using Apache Lucene and PDFBox

Download an Example

0 comments:

Post a Comment

Compare Price from DealsBro

Popular Posts

Video