Foren

RE: Is it possible to search content from the PDF Document?

thumbnail
Amit Doshi, geändert vor 11 Jahren.

Is it possible to search content from the PDF Document?

Liferay Master Beiträge: 550 Beitrittsdatum: 29.12.10 Neueste Beiträge
It is possible that Liferay can index content of PDF documents so it can show up in the search?

if yes then how?

Thanks in advance.
thumbnail
Hitoshi Ozawa, geändert vor 11 Jahren.

RE: Is it possible to search content from the PDF Document?

Liferay Legend Beiträge: 7942 Beitrittsdatum: 24.03.10 Neueste Beiträge
Yes, just upload pdf files to document and media library!
thumbnail
Amit Doshi, geändert vor 11 Jahren.

RE: Is it possible to search content from the PDF Document?

Liferay Master Beiträge: 550 Beitrittsdatum: 29.12.10 Neueste Beiträge
Hi Hitoshi,

I done the same thing before posting ( uploaded the file and then try to search with the help of search Facet available in liferay 6.1), but it was not searching the content inside the pdf file. Is there any way to do it?

Thanks & Regards,
Amit Doshi
thumbnail
Hitoshi Ozawa, geändert vor 11 Jahren.

RE: Is it possible to search content from the PDF Document?

Liferay Legend Beiträge: 7942 Beitrittsdatum: 24.03.10 Neueste Beiträge
I'm using the Search portlet.
thumbnail
Amit Doshi, geändert vor 11 Jahren.

RE: Is it possible to search content from the PDF Document?

Liferay Master Beiträge: 550 Beitrittsdatum: 29.12.10 Neueste Beiträge
In Liferay 6.1, Search portlet is replaced by Search Facet. So, indeed both are same. Can you please highlight some point what i am missing? how can check whether the indexing is done or not for the pdf?

Because what i see at present is it just index the Title and Metadata, not the content inside it.
thumbnail
Hitoshi Ozawa, geändert vor 11 Jahren.

RE: Is it possible to search content from the PDF Document?

Liferay Legend Beiträge: 7942 Beitrittsdatum: 24.03.10 Neueste Beiträge
I'm talking about "Search" portlet that can be put on a page from "Add" -> "More" -> "Tools" -> "Search".

I just did a search on a pdf file I've uploaded to Document and Media library and went fine.
thumbnail
Alexander Chow, geändert vor 11 Jahren.

RE: Is it possible to search content from the PDF Document?

Liferay Master Beiträge: 518 Beitrittsdatum: 20.07.05 Neueste Beiträge
By default, the text layer of all PDF documents are extracted and indexed. This indexing would be used for any searching of the document in the (faceted) search portlet or in the documents and media library.

The first thing I would do is try to figure out if your document has a text layer. Are you able to open your PDF document in a PDF reader and copy and paste the text to another application (say MS Word)? If not, you don't have a text layer that Liferay can extract from.

If there is a text layer then, in theory, it should be indexed. You can try to look through what is indexed using Luke.
thumbnail
Amit Doshi, geändert vor 11 Jahren.

RE: Is it possible to search content from the PDF Document?

Liferay Master Beiträge: 550 Beitrittsdatum: 29.12.10 Neueste Beiträge
Thanks Alexander for highlight these. It was useful for me. I found problem in the pdf.

But I found once issue while re-indexing the Document and Media portlet. It shows me the below exception with new pdf.


10:25:22,921 ERROR [FileImpl:304] org.apache.tika.exception.TikaException: Not a HPSF document
org.apache.tika.exception.TikaException: Not a HPSF document
        at org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaryEntryIfExists(SummaryExtractor.java:78)
        at org.apache.tika.parser.microsoft.SummaryExtractor.parseSummaries(SummaryExtractor.java:58)
        at org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:164)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
        at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:197)
        at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:135)
        at org.apache.tika.Tika.parseToString(Tika.java:357)
        at org.apache.tika.Tika.parseToString(Tika.java:386)
        at com.liferay.portal.util.FileImpl.extractText(FileImpl.java:300)
        at com.liferay.portal.kernel.util.FileUtil.extractText(FileUtil.java:135)
        at com.liferay.portal.kernel.search.DocumentImpl.addFile(DocumentImpl.java:98)
        at com.liferay.portlet.documentlibrary.util.DLIndexer.doGetDocument(DLIndexer.java:360)
        at com.liferay.portal.kernel.search.BaseIndexer.getDocument(BaseIndexer.java:110)
        at com.liferay.portlet.documentlibrary.util.DLIndexer.reindexFileEntries(DLIndexer.java:540)
        at com.liferay.portlet.documentlibrary.util.DLIndexer.reindexFileEntries(DLIndexer.java:523)
        at com.liferay.portlet.documentlibrary.util.DLIndexer.doReindex(DLIndexer.java:498)
        at com.liferay.portal.kernel.search.BaseIndexer.reindex(BaseIndexer.java:329)
        at com.liferay.portlet.documentlibrary.util.DLIndexer.reindexFolders(DLIndexer.java:581)
        at com.liferay.portlet.documentlibrary.util.DLIndexer.reindexFolders(DLIndexer.java:560)
        at com.liferay.portlet.documentlibrary.util.DLIndexer.doReindex(DLIndexer.java:490)
        at com.liferay.portal.kernel.search.BaseIndexer.reindex(BaseIndexer.java:329)
        at com.liferay.portlet.admin.action.EditServerAction.reindex(EditServerAction.java:325)
        at com.liferay.portlet.admin.action.EditServerAction.processAction(EditServerAction.java:157)
        at com.liferay.portal.struts.PortletRequestProcessor.process(PortletRequestProcessor.java:175)
        at com.liferay.portlet.StrutsPortlet.processAction(StrutsPortlet.java:190)
        at com.liferay.portlet.FilterChainImpl.doFilter(FilterChainImpl.java:70)
        at com.liferay.portal.kernel.portlet.PortletFilterUtil.doFilter(PortletFilterUtil.java:48)
        at com.liferay.portlet.InvokerPortletImpl.invoke(InvokerPortletImpl.java:651)
        at com.liferay.portlet.InvokerPortletImpl.invokeAction(InvokerPortletImpl.java:686)
        at com.liferay.portlet.InvokerPortletImpl.processAction(InvokerPortletImpl.java:361)
        at com.liferay.portal.action.LayoutAction.processPortletRequest(LayoutAction.java:856)
        at com.liferay.portal.action.LayoutAction.processLayout(LayoutAction.java:635)
        at com.liferay.portal.action.LayoutAction.execute(LayoutAction.java:246)
        at org.apache.struts.action.RequestProcessor.processActionPerform(RequestProcessor.java:431)
        at org.apache.struts.action.RequestProcessor.process(RequestProcessor.java:236)
        at com.liferay.portal.struts.PortalRequestProcessor.process(PortalRequestProcessor.java:174)
        at org.apache.struts.action.ActionServlet.process(ActionServlet.java:1196)
        at org.apache.struts.action.ActionServlet.doPost(ActionServlet.java:432)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:641)
        at com.liferay.portal.servlet.MainServlet.callParentService(MainServlet.java:538)
        at com.liferay.portal.servlet.MainServlet.service(MainServlet.java:515)
        at javax.servlet.http.HttpServlet.service(HttpServlet.java:722)



Can you please highlight how to overcome with the above exception?

Regards,
Amit Doshi
thumbnail
Alexander Chow, geändert vor 11 Jahren.

RE: Is it possible to search content from the PDF Document?

Liferay Master Beiträge: 518 Beitrittsdatum: 20.07.05 Neueste Beiträge
Amit, I think your file is being interpreted as a MS Office file. Do you have a .pdf extension or something else?
thumbnail
Hitoshi Ozawa, geändert vor 11 Jahren.

RE: Is it possible to search content from the PDF Document?

Liferay Legend Beiträge: 7942 Beitrittsdatum: 24.03.10 Neueste Beiträge
How are you creating you pdf files? If you're using some java tools, try using Word2007/2010 or OpenOffice/LibreOffice.
If you are having problem still, please attach your pdf file here.

org.apache.tika.exception.TikaException: Not a HPSF document

BTW, this is also a well known problem. (This won't be in the JBoss forum :-) )
thumbnail
Amit Doshi, geändert vor 11 Jahren.

RE: Is it possible to search content from the PDF Document?

Liferay Master Beiträge: 550 Beitrittsdatum: 29.12.10 Neueste Beiträge
It was working fine with liferay 6.0.10(not too acurate) but giving exception in liferay 6.1 EE.

In order to verify once again, I used Vanilla Version of Liferay. Uploaded the Single pdf file(which was attached here) in Document and Media Portlet

And then done re-indexing of Document and Media Portlet. It gives me no result and no error.

Please find the attach PDF for the same.
thumbnail
Hitoshi Ozawa, geändert vor 11 Jahren.

RE: Is it possible to search content from the PDF Document?

Liferay Legend Beiträge: 7942 Beitrittsdatum: 24.03.10 Neueste Beiträge
I just downloaded the attached PDF and uploaded it on Liferay 6.1.0 GA1 CE as file name "OpenLDAP" and added Tools - Search portlet and did a search on "Installing". The pdf showed up in the list and I was able to open it up and view it without any problem. Please check if you're uploading the file to a folder which the user has permission to view.

The only other difference is that you're using liferay 6.1 EE while I'm using 6.1 CE. Maybe, you should write a ticket to this issue because you're using EE and you've paid for it.
thumbnail
Amit Doshi, geändert vor 11 Jahren.

RE: Is it possible to search content from the PDF Document?

Liferay Master Beiträge: 550 Beitrittsdatum: 29.12.10 Neueste Beiträge
The PDF is showing in the list but not highlighting the words inside PDF as it is working in liferay.com website and also it is working with liferay 6.0.10 version. But not with Liferay 6.1 EE

Please find the screen shot for the same what I am looking for. The Screenshot is taken from liferay 6.0.10 version. I want the same with liferay 6.1 EE.

Hope I am clear.
thumbnail
Hitoshi Ozawa, geändert vor 11 Jahren.

RE: Is it possible to search content from the PDF Document?

Liferay Legend Beiträge: 7942 Beitrittsdatum: 24.03.10 Neueste Beiträge
So, it seems you are able to search for it now. It's highlighting on my setup but I've changed the highlighting logic because it was buggy anyways. Are there any errors now?
thumbnail
Amit Doshi, geändert vor 11 Jahren.

RE: Is it possible to search content from the PDF Document?

Liferay Master Beiträge: 550 Beitrittsdatum: 29.12.10 Neueste Beiträge
Hitoshi :-
The attached Screen shot is for liferay 6 EE by default it provide functionality for highlighting the content inside the pdf and I am aspecting the same working functionality with liferay 6.1 EE. It seems that there is bug with liferay 6.1 EE.

In Liferay 6.1 EE it works similar as attached screen shot by Alexander Chow, not highlighting any words inside PDF just displaying PDF name and title.

But still couple of questions ...

The error is due to open office DOC file in Document and Media, so question is why it gives error for doc files while re-indexing Document and Media?
Is it bug in liferay 6.1 EE ?

why the search functionality is not working as Liferay 6 EE ?
Is it another bug in liferay 6.1 EE ?

Shall I raise ticket for both of them in liferay?

Please suggest.
thumbnail
Alexander Chow, geändert vor 11 Jahren.

RE: Is it possible to search content from the PDF Document?

Liferay Master Beiträge: 518 Beitrittsdatum: 20.07.05 Neueste Beiträge
Amit,

Using that file I am able to search based on the contents of the file in 6.1 CE and EE. I did not get any of your HPSF errors.

Incidentally, the preview is a little weird, but that should not affect your search problem.
thumbnail
Amit Doshi, geändert vor 11 Jahren.

RE: Is it possible to search content from the PDF Document?

Liferay Master Beiträge: 550 Beitrittsdatum: 29.12.10 Neueste Beiträge
Alexander Chow :-

While Re-indexing open office doc file in Document and Media , it gives error hspf error.

And also the search functionality is working similar for the pdf files as per your screenshot in liferay 6.1 EE.

But I am aspecting the words to be highlighted inside the pdf as per the screenshot that I have attached in my post before.

In liferay 6 EE by default providing the functionality then why not in liferay 6.1 EE ?

Can we say it is bug in liferay 6.1 EE?

Please suggest.
thumbnail
Alexander Chow, geändert vor 11 Jahren.

RE: Is it possible to search content from the PDF Document?

Liferay Master Beiträge: 518 Beitrittsdatum: 20.07.05 Neueste Beiträge
Amit,

So, let me get this straight.

  • For the PDF file, it works fine (as in it shows what I show).
  • For an OpenOffice doc, it does not index (HSPF error). Can you upload a test file?
  • For the highlighting… it seems to be more fundamental in which the summary itself is not displayed. I've emailed the developer who rewrote the search portlet to be a faceted search to ask him if it was intentional or not -- so you can hold off on a ticket for now.
thumbnail
Amit Doshi, geändert vor 11 Jahren.

RE: Is it possible to search content from the PDF Document?

Liferay Master Beiträge: 550 Beitrittsdatum: 29.12.10 Neueste Beiträge
Thanks Alexander and Hitoshi for your co-operation.

Here I attach the doc file, It doesn't give any error but it was not indexing in vanilla version liferay 6.1 EE

And about HSPF errror, it comes on our Test Server and there were lots of document uploaded and difficult to find because of what it comes. There we have js files,images,css,txt files etc. more than 100 files. I am trying to figure out because of which particular type of files it gives me that error and then come back on that.

So at current stage i need to figure two points :-

1) Doc file are not indexing. Don't know why?
2) Highlight point for pdf as you have mentioned.

Regards,
Amit Doshi
thumbnail
Alexander Chow, geändert vor 11 Jahren.

RE: Is it possible to search content from the PDF Document?

Liferay Master Beiträge: 518 Beitrittsdatum: 20.07.05 Neueste Beiträge
Amit Doshi:

And about HSPF errror, it comes on our Test Server and there were lots of document uploaded and difficult to find because of what it comes. There we have js files,images,css,txt files etc. more than 100 files. I am trying to figure out because of which particular type of files it gives me that error and then come back on that.


So, I just tested this against 6.1 CE and 6.1 EE and both seem to work fine. No console errors and, as you can see in the pictures, they search by content OK. Not sure if it makes any difference, but I'm testing with Tomcat.
thumbnail
Hitoshi Ozawa, geändert vor 11 Jahren.

RE: Is it possible to search content from the PDF Document?

Liferay Legend Beiträge: 7942 Beitrittsdatum: 24.03.10 Neueste Beiträge
I just tested the attached file with Liferay 6.1.0 CE Tomcat bundle and was able to search by content of the doc file.
The searched keyword was also highlighted.

Amit, your question keep changing on each post. Please test it fully before submitting another question.
thumbnail
Amit Doshi, geändert vor 11 Jahren.

RE: Is it possible to search content from the PDF Document?

Liferay Master Beiträge: 550 Beitrittsdatum: 29.12.10 Neueste Beiträge
Hitoshi :-

I just tested the attached file with Liferay 6.1.0 CE Tomcat bundle and was able to search by content of the doc file.
The searched keyword was also highlighted.


You can check the post of Alexander, the searched keywords are not getting highlighted inside the document or pdf. Alexander has attached the Screen shot for both the Edition(6.1 CE and 6.1 EE).


Amit, your question keep changing on each post. Please test it fully before submitting another question.


I had a problem with lucene. After deleting the folder of lucene from data, and then re-index it started to search the document , the same way as shown by Alexander.

My Question still remains the same It was not highlighting the word inside the pdf or doc and for that Alexander replied as below

# For the highlighting… it seems to be more fundamental in which the summary itself is not displayed. I've emailed the developer who rewrote the search portlet to be a faceted search to ask him if it was intentional or not -- so you can hold off on a ticket for now.


Now waiting for Alexander answers. It is a bug or the functionalities is developed like that.

Hope I am clear.

Thanks & Regards,
Amit Doshi
thumbnail
Hitoshi Ozawa, geändert vor 11 Jahren.

RE: Is it possible to search content from the PDF Document?

Liferay Legend Beiträge: 7942 Beitrittsdatum: 24.03.10 Neueste Beiträge
You can check the post of Alexander, the searched keywords are not getting highlighted inside the document or pdf. Alexander has attached the Screen shot for both the Edition(6.1 CE and 6.1 EE).


I probably fixed in my version then. It wasn't highlighting Japanese documents correctly anyways even in older versons. I don't think not highlightening is a new "improved" feature unless someone complained about it not working correctly, and one of the developer decided to delete the feature rather than to fix it (I've seen this at some sites).
thumbnail
Alexander Chow, geändert vor 11 Jahren.

RE: Is it possible to search content from the PDF Document? (Antwort)

Liferay Master Beiträge: 518 Beitrittsdatum: 20.07.05 Neueste Beiträge
Hey Amit,

OK, so what I found is this. The search portlet was refactored significantly when they added in the faceted search. As part of the refactoring, the results were changed to give preference to AssetRenderer data. See main_search_result_form.jsp:

AssetRendererFactory assetRendererFactory = AssetRendererFactoryRegistryUtil.getAssetRendererFactoryByClassName(className);

if (assetRendererFactory != null) {
...
	entryTitle = assetRenderer.getTitle(locale);
	entrySummary = assetRenderer.getSummary(locale);
}
else {
...
	Summary summary = indexer.getSummary(document, locale, snippet, viewFullContentURL);

	if (viewInContext) {
		viewURL = viewFullContentURL.toString();
	}

	entryTitle = summary.getTitle();
	entrySummary = summary.getContent();
}


So, what that does is for any Asset, it will try to display the summary based on the AssetRenderer summary. In the case of a Document, the summary is the description. So when you upload a file and add a description, you will find that the summary results will be anything in the description. If it turns out that the search keywords are in the description, that will be highlighted. (So, for example, if you set the AssetRenderer in that file to null, you will get the same results as you did in 6.0.)

Why is the AssetRender the default choice? The basic theory is that the AssetRenderer is supposed to have a much richer API than the Indexer. The AssetRenderer could itself use the Indexer if it wants, but not the other way around. In the future, the AssetRenderer will also be the vehicle for execution of view templates which will provide admins a way to create new presentations for assets dynamically. The Indexer will never provide any sort of templating functionality, and so the Indexer should only be used as a fallback.

Hope that helps to clarify a few things.

Alex
thumbnail
Amit Doshi, geändert vor 11 Jahren.

RE: Is it possible to search content from the PDF Document?

Liferay Master Beiträge: 550 Beitrittsdatum: 29.12.10 Neueste Beiträge
Thanks Alexander for the information that you shared with us.

So I changed the logic accordingly in main_search_result_form.jsp, moved the else content into if condition and it worked fine for me. Because I found that assetRendererFactory will never be null while moving through the flow.

So I created Hook for it. And It worked superb as per my aspectation.
Please check the below screenshot for the same.

Thanks & Regards,
Amit Doshi
thumbnail
Alexander Chow, geändert vor 11 Jahren.

RE: Is it possible to search content from the PDF Document?

Liferay Master Beiträge: 518 Beitrittsdatum: 20.07.05 Neueste Beiträge
Brilliant! Great to here.
thumbnail
Subhash Pavuskar, geändert vor 11 Jahren.

RE: Is it possible to search content from the PDF Document?

Regular Member Beiträge: 234 Beitrittsdatum: 13.03.12 Neueste Beiträge
Yes !! You can do this I hope this code may help you to read content form the PDF File ...

import com.liferay.util.bridges.mvc.MVCPortlet;
import java.io.File;
import java.io.FileInputStream;

import javax.portlet.ActionRequest;
import javax.portlet.ActionResponse;

import org.apache.poi.hssf.usermodel.HSSFCell;
import org.apache.poi.hssf.usermodel.HSSFRow;
import org.apache.poi.hssf.usermodel.HSSFSheet;
import org.apache.poi.hssf.usermodel.HSSFWorkbook;

import com.liferay.portal.kernel.upload.UploadPortletRequest;
import com.liferay.portal.util.PortalUtil;
/**
 * Portlet implementation class Isolate
 */
public class Isolate extends MVCPortlet {
 
        String Name="";
        int companyId;
  String Division;
public void processAction(ActionRequest request,ActionResponse response)
 {
                  
                  UploadPortletRequest uploadRequest =
PortalUtil.getUploadPortletRequest(request);
                  File file = uploadRequest.getFile("ufile");
                  int i=0;
        try
        {
                                FileInputStream fs =new FileInputStream(file);
                            HSSFWorkbook wb = new HSSFWorkbook(fs);
     for (int k = 0; k<wb.getnumberofsheets(); k++) { hssfsheet sheet="wb.getSheetAt(k);" int rows="sheet.getPhysicalNumberOfRows();" l1: for (int r="0;" r<rows+100;r++) try hssfrow row="sheet.getRow(r);" hssfcell cell1="row.getCell(0);" if(cell1.getstringcellvalue().indexof('@')>0)
                                  {
                                          String[] temp;                                          
                                          String delimiter = " ";
                                          temp = cell1.getStringCellValue().split(delimiter);
                  
                                          for(int a =0; a &lt; temp.length ; a++)
                                          { 
                                                  if(temp[a].indexOf('@')&gt;0)
                                                  {
                                                          request.setAttribute("email",temp[a]); 
                                                  }
                                          }
                                  }
                                  if(cell1.getStringCellValue().indexOf("www")&gt;=0)
                                  {
                                          String[] temp1;       
                                          String delimiter = " ";
                                          temp1 =cell1.getStringCellValue().split(delimiter);
                  
                                          for(int b =0; b &lt; temp1.length ; b++)
                                          {
                                                  if(temp1[b].indexOf("www")&gt;=0)
                                                  {
                                                          request.setAttribute("website",temp1[b]); 
                                                  }
                                          }
                                  }
                                  
                                  if(cell1.getStringCellValue().indexOf('+')&gt;=0||
cell1.getStringCellValue().indexOf("91")&gt;=0||
cell1.getStringCellValue().indexOf("080")&gt;=0)
                                  {
                                          String[] temp1;       
                                          String delimiter = " ";
                                          temp1 =cell1.getStringCellValue().split(delimiter);
                  
                                          for(int b =0; b &lt; temp1.length ; b++)
                                          {
                                                  if(temp1[b].indexOf('+')&gt;=0||temp1[b].indexOf("91")&gt;=0||
temp1[b].indexOf("080")&gt;=0)
                                                  {
                                                          if(temp1[b].indexOf('+')&gt;=0)
                                                          {
                                                                  request.setAttribute("number",temp1[b]);
                                                          }
                                                          else
                                                          {
                                                                  request.setAttribute("number",temp1[b]+temp1[b+1]);   
                                                          }                                                      
                                                  }
                                          }
                                  }
                                  if(cell1.getStringCellValue().indexOf('#')&gt;=0)
                                  {
                                         request.setAttribute("address", cell1.getStringCellValue());
                                         HSSFRow row1 = sheet.getRow(r-1);
                                              HSSFCell cell2  = row1.getCell(0);
                                              request.setAttribute("company", cell2.getStringCellValue());
                                  }
                                  if(cell1.getStringCellValue().indexOf('#')&gt;=0)
                                  {
                                         request.setAttribute("address", cell1.getStringCellValue());
                                         HSSFRow row1 = sheet.getRow(r-1);
                                              HSSFCell cell2  = row1.getCell(0);
                                              request.setAttribute("company", cell2.getStringCellValue());
                                  }
                              }
                              catch (Exception e) 
                              {
                                        continue l1;
                              }
                              
                            }
          }               
 }                
 catch (Exception e) 
 {
 }   
file.delete();   
response.setRenderParameter("jspPage","/html/isolate/result.jsp");
 }
}





</wb.getnumberofsheets();>
thumbnail
Prabhakar Singh, geändert vor 10 Jahren.

RE: Is it possible to search content from the PDF Document?

New Member Beiträge: 8 Beitrittsdatum: 02.08.12 Neueste Beiträge
Hii Alexander,Hitoshi,Amit ,

This is just another awesome post in Liferay Forums...thanks a lot..!!!
Got a much clearer picture reg: whats & what-not's about the liferay-serach ...!!!

Thanks & Best Regards ,
Prabhakar
Rashmi S, geändert vor 9 Jahren.

RE: Is it possible to search content from the PDF Document?

New Member Beiträge: 11 Beitrittsdatum: 03.01.14 Neueste Beiträge
Hii Alexander,Hitoshi,Amit ,

Thanks alot this post really helped me!!

Thanks,
Rashmi S