Forums

Home » Liferay Portal » English » 3. Development

Combination View Flat View Tree View
Threads [ Previous | Next ]
toggle
Florencia Gadea
Retrieving page numbers in custom pdf search portlet
July 19, 2013 5:40 AM
Answer

Florencia Gadea

Rank: Regular Member

Posts: 159

Join Date: March 27, 2012

Recent Posts

Hi Everyone!

Me again. I'm developing a custom search portlet for Liferay 6.1 GA 1. I'm seaching in pdf documents. In the search results I have to show the title of the document, and all the page numbers were the searched expression appears along with a snippet. Is there by any chance an already built in solution for this? If not, can you help me to figure out the way to do it?

Regards,

Flor.
Ray Augé
RE: Retrieving page numbers in custom pdf search portlet
July 26, 2013 6:41 AM
Answer

Ray Augé

LIFERAY STAFF

Rank: Liferay Legend

Posts: 1238

Join Date: February 7, 2005

Recent Posts

Ok, so I think that you should not modify the backend search at all. Attempting to get all that informating in a usable into the index document will be very complex and will likely not lead to a usable solution.

I suggest the following:

- after the existing index search has determined that a document is a valid match the asset framework is used to return the search result output fields, title, fragment, etc.
- the AssetRenderer implementation for the asset type can be used to return a summary.

By providing a custom AssetRenderer implementation, you could then identify that the result is a PDF file, load it, and return a PDF specific summary of the result.
Florencia Gadea
RE: Retrieving page numbers in custom pdf search portlet
July 26, 2013 6:55 AM
Answer

Florencia Gadea

Rank: Regular Member

Posts: 159

Join Date: March 27, 2012

Recent Posts

Hi Ray,

Thanks for your help, but the most important thing I need to get, is all the page numbers of the pdf document with a snippet of the matched expression. For example, let's say I'm searching for this expression: "het riool" (without the double quotes)

And one of the results should look like:

Title: 2004 dl 4 van den Akker
page 9: Prioriteit voor veiligheid aan het riool
page 1: het betreden van een riool
page 13: In het boek Veilig werken

What should I do to achieve this?
Right now, I'm not using the AssetRenderer to show the results.

Regards,

Flor.
Ray Augé
RE: Retrieving page numbers in custom pdf search portlet
July 26, 2013 7:04 AM
Answer

Ray Augé

LIFERAY STAFF

Rank: Liferay Legend

Posts: 1238

Join Date: February 7, 2005

Recent Posts

Yes, the point I am making is that information is not stored... nor will it ever be stored in the index!

In fact, it's entirely possible that not even the entire PDF's content is indexed. The reason is that there is a limit to the content that is retieved from a document during indexing. This is for scallability and performance overall. imagine that you added many 500Mb files into the document repository.

Lucene would struggle for any searches that may match in these documents simply trying to identify through all the possible content of such documents. (Not even Google search does this.)

Rather, if the document is larged that X number of kilobytes, it's content is not even extracted and we rely only on the metadata.

Furthermore, page numbers are not even considered at all. The entire contents that are indexed is just one single blob of content.

Therefore, the indexing engine will never return page number information at all.

So, what you need to do is use the index engine only _to find_ relevant documents.. and then use a custom AssetRenderer (or perhaps even just custom logic of your own) to read the document, and then produce a summary containing the information that you want (this is again more similar to what Google search does).
Ray Augé
RE: Retrieving page numbers in custom pdf search portlet
July 26, 2013 7:07 AM
Answer

Ray Augé

LIFERAY STAFF

Rank: Liferay Legend

Posts: 1238

Join Date: February 7, 2005

Recent Posts

.. I do suppose that if you wrote a com.liferay.portal.kernel.search.IndexerPostProcessor you could extract this information, and have it indexed.

However, you would have to also customize the view side because a normal search view would not understand how to return a summary view with the information.
Florencia Gadea
RE: Retrieving page numbers in custom pdf search portlet
July 26, 2013 7:34 AM
Answer

Florencia Gadea

Rank: Regular Member

Posts: 159

Join Date: March 27, 2012

Recent Posts

Oh, sure, I wasn't thinking about storing the page numbers in the index, sorry if I wasn't clear about that. I would like to get all that information after I get the search results. Do you have a suggestion about how to do it?

I was thinking about splitting the terms and retrieving the matched chunks of the pdf document. I thought that may be there was a Lucene/Liferay solution for that.

I am customizing the view as you can see here: http://rioolnet.dev.rotterdamcs.com/web/guest/zoeken?query=Boarnsterhim%20hoeft

Thanks,

Flor.
Ray Augé
RE: Retrieving page numbers in custom pdf search portlet
July 26, 2013 8:08 AM
Answer

Ray Augé

LIFERAY STAFF

Rank: Liferay Legend

Posts: 1238

Join Date: February 7, 2005

Recent Posts

Sorry, you would have to use something like the PDFBox libraries to read the PDF, get page numbers, and search through the contents.

There is nothing particularly Liferay specific about it.

.. one small thing is that Liferay does contain these libraries, so you could use those rather than assemble those yourself.
Florencia Gadea
RE: Retrieving page numbers in custom pdf search portlet
July 26, 2013 8:05 AM
Answer

Florencia Gadea

Rank: Regular Member

Posts: 159

Join Date: March 27, 2012

Recent Posts

Ok, thanks, will try that.

Regards,

Flor.