Foren

Lucene, Indexing with Multiple Threads breaks classloader after 2nd run.

thumbnail
Alex Wallace, geändert vor 15 Jahren.

Lucene, Indexing with Multiple Threads breaks classloader after 2nd run.

Liferay Master Beiträge: 640 Beitrittsdatum: 05.11.07 Neueste Beiträge
Hi all!

in order to improve indexing time i decided to use a single writer on multiple threads for indexing... Indeed i wrote a different indexer for this.

The indexer works great the first run... In fact, even if I use only one thread to reindex, i cut the reindexing time by 1/3...

To retrieve the records to be re-indexed i use a finder exposed via a LocalServiceUtil method (say MyUserLocalServiceUtil.findAll(0, 100))

Each thread picks up a different range of records (ie, first thread does 0 - 100 and second 101 - 200, and so on... Say i use 2 threads... I make the current executing thread (the indexer) join those 2 threads, so that it waits until they both finish, to launch another 2.

This works perfect the first time i use my indexer with threads... Again, time improved a lot... The resulting index is perfect.

Regardless of the number of threads I use, by just using threads, the second time i run the reindexing, i get a very nasty error when calling the same finder method i used the first time... Appearently the class loader has lost portal classes, or for a strange reason, the program is now using a different class loader... I'm not really sure what is taking place.

I am pasting the error below... .

Does anyone have a clue as to why this happens and how to prevent it?

Thanks in advance! the error follows:


Caused by: java.lang.NoClassDefFoundError: com/liferay/portal/model/impl/BaseModelImpl
        at java.lang.ClassLoader.defineClass1(Native Method)
        at java.lang.ClassLoader.defineClass(ClassLoader.java:675)
        at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:124)
        at org.apache.catalina.loader.WebappClassLoader.findClassInternal(WebappClassLoader.java:1847)
        at org.apache.catalina.loader.WebappClassLoader.findClass(WebappClassLoader.java:873)
        at org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1326)
        at org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1205)
        at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:374)
        at java.lang.ClassLoader.defineClass1(Native Method)
        at java.lang.ClassLoader.defineClass(ClassLoader.java:675)
        at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:124)
        at org.apache.catalina.loader.WebappClassLoader.findClassInternal(WebappClassLoader.java:1847)
        at org.apache.catalina.loader.WebappClassLoader.findClass(WebappClassLoader.java:873)
        at org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1326)
        at org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1205)
        at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:374)
        at java.lang.Class.forName0(Native Method)
        at java.lang.Class.forName(Class.java:247)
        at net.sf.ehcache.store.DiskStore$1.resolveClass(DiskStore.java:294)
        at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1575)
        at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1496)
        at java.io.ObjectInputStream.readClass(ObjectInputStream.java:1462)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1312)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1947)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1871)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1753)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
        at java.util.ArrayList.readObject(ArrayList.java:593)
        at sun.reflect.GeneratedMethodAccessor629.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:974)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1849)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1753)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329)
        at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1947)
        at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1871)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1753)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1329)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:351)
        at net.sf.ehcache.store.DiskStore.loadElementFromDiskElement(DiskStore.java:302)
        at net.sf.ehcache.store.DiskStore.get(DiskStore.java:257)
        at net.sf.ehcache.Cache.searchInDiskStore(Cache.java:924)
        at net.sf.ehcache.Cache.get(Cache.java:735)
        at net.sf.ehcache.Cache.get(Cache.java:710)
        at com.liferay.portal.cache.PortalCacheImpl.get(Unknown Source)
        at com.liferay.portal.cache.MultiVMPoolImpl.get(Unknown Source)
        at com.liferay.portal.kernel.cache.MultiVMPoolUtil.get(MultiVMPoolUtil.java:59)
        at com.liferay.portal.spring.hibernate.FinderCache.getResult(Unknown Source)
        at com.weareteachers.model.userextras.service.persistence.UserExtrasPersistenceImpl.findAll(UserExtrasPersistenceImpl.java:208)
        at com.weareteachers.model.userextras.service.persistence.UserExtrasPersistenceImpl.findAll(UserExtrasPersistenceImpl.java:194)
        at com.weareteachers.model.userextras.service.persistence.UserExtrasUtil.findAll(UserExtrasUtil.java:149)
        at com.weareteachers.model.userextras.service.impl.UserExtrasLocalServiceImpl.findAll(UserExtrasLocalServiceImpl.java:515)
        at sun.reflect.GeneratedMethodAccessor687.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:296)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:177)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:144)
        at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:107)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:166)
        at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204)
        at $Proxy91.findAll(Unknown Source)
        at com.weareteachers.model.userextras.service.UserExtrasLocalServiceUtil.findAll(UserExtrasLocalServiceUtil.java:270)
        at com.weareteachers.model.userextras.search.Indexer.reIndex(Indexer.java:146)
        at com.weareteachers.model.userextras.search.Indexer.reIndex(Indexer.java:93)
        at com.weareteachers.model.userextras.service.impl.UserExtrasLocalServiceImpl.reIndex(UserExtrasLocalServiceImpl.java:276)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:296)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:177)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:144)
        at org.springframework.transaction.interceptor.TransactionInterceptor.invoke(TransactionInterceptor.java:107)
        at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:166)
        at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:204)
        at $Proxy91.reIndex(Unknown Source)
        at com.weareteachers.model.userextras.service.UserExtrasLocalServiceUtil.reIndex(UserExtrasLocalServiceUtil.java:169)
        at com.weareteachers.portlets.search.bean.Reindex.performReindex(Reindex.java:89)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597)
        at org.apache.myfaces.el.MethodBindingImpl.invoke(MethodBindingImpl.java:129)
        ... 78 more
Caused by: java.lang.ClassNotFoundException: com.liferay.portal.model.impl.BaseModelImpl
        at org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1359)
        at org.apache.catalina.loader.WebappClassLoader.loadClass(WebappClassLoader.java:1205)
        at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:374)
        ... 164 more
thumbnail
Alex Wallace, geändert vor 15 Jahren.

RE: Lucene, Indexing with Multiple Threads breaks classloader after 2nd run

Liferay Master Beiträge: 640 Beitrittsdatum: 05.11.07 Neueste Beiträge
I've narrowed it down so far to iterating through the lists returned by the finders... Even if i don't to anything with the data, just iterating causes this behavior.
thumbnail
Alex Wallace, geändert vor 15 Jahren.

RE: Lucene, Indexing with Multiple Threads breaks classloader after 2nd run

Liferay Master Beiträge: 640 Beitrittsdatum: 05.11.07 Neueste Beiträge
Narrowed it down further.... If the amount of records indexed is small, the problem does not occur... Sounds like some sort of leakage in the finders...
thumbnail
Alex Wallace, geändert vor 15 Jahren.

RE: Lucene, Indexing with Multiple Threads breaks classloader after 2nd run

Liferay Master Beiträge: 640 Beitrittsdatum: 05.11.07 Neueste Beiträge
Ok... it is related to the db cache... For if i clear the db cache between runs, the thing works just fine...
thumbnail
Alex Wallace, geändert vor 15 Jahren.

RE: Lucene, Indexing with Multiple Threads breaks classloader after 2nd ru

Liferay Master Beiträge: 640 Beitrittsdatum: 05.11.07 Neueste Beiträge
I did two things that by themselfs each can prevent the issue, probably at different levels:

1 - a special finder using hibernate query.setCacheable(false)
2 - a call to com.liferay.portal.spring.hibernate.CacheRegistry.clear();

the second option is probably more drastic, but the most effective... While option 1 works, any use of finders inside of my indexer could make the problem come back...

I am probably going to use both...

These are however workarounds... I believe current behavior to be a bug.
thumbnail
Channing K Jackson, geändert vor 14 Jahren.

RE: Lucene, Indexing with Multiple Threads breaks classloader after 2nd ru

Junior Member Beiträge: 67 Beitrittsdatum: 13.11.08 Neueste Beiträge
Hi Alex,

I'm very interested in your multi-threaded indexing solution.

We are experiencing extremely long indexing periods when our servers start up in production, using file-based indexing against a document library with thousands (~15000) of documents, and GBs (~12Gemoticon of data volume. We see stack traces in the logs, and we can't easily determine whether the indexing is successfully finishing.

Are you willing to share your multi-threaded solution?
thumbnail
Alex Wallace, geändert vor 14 Jahren.

RE: Lucene, Indexing with Multiple Threads breaks classloader after 2nd ru

Liferay Master Beiträge: 640 Beitrittsdatum: 05.11.07 Neueste Beiträge
I can certainly help.

But before you write up your own indexers, How do you have property

index.with.thread in your portal-ext.properties?

What version of LR are you using? The Messaging Hub in LR is already multithreaded also...

Answers to these questions can help figure out what you need to do.

Let me know!
thumbnail
Channing K Jackson, geändert vor 14 Jahren.

RE: Lucene, Indexing with Multiple Threads breaks classloader after 2nd ru

Junior Member Beiträge: 67 Beitrittsdatum: 13.11.08 Neueste Beiträge
Alex, thanks for your willingness!

index.with.thread is currently set to true.

We are on Liferay Portal v5.2.2 with some bug fix patches.

I ran our indexer over the weekend to see how long it would take to complete. This was in a Staging env, so not as much horsepower as in a Production env, but still relatively roomy and fast. With ~9000 users, ~15000 documents, and a total of approximately 12 to 15GB of data, file-based indexing took over 25 hours to complete. This seems exceedingly long.

lucene.merge.factor=1000
lucene.optimize.interval=1000

The biggest one was the documentlibrary, and I believe the culprit there is our custom hook for the content repository we are using at our company. We use Oracle UCM, and we implemented a CISHook that uses the Java API to access documents in the document store. I suspect that a jackrabbit repository would be much faster, but we don't have a choice in the matter. I did, however, notice that user indexing was pretty slow too, which is just using out-of-box storage options in our Oracle database.

More info as needed, and thanks again for your willingness to help.
thumbnail
Alex Wallace, geändert vor 14 Jahren.

RE: Lucene, Indexing with Multiple Threads breaks classloader after 2nd ru

Liferay Master Beiträge: 640 Beitrittsdatum: 05.11.07 Neueste Beiträge
If I was you I would do a little profiling to find out where exactly is your time going... Most likely it is a number of places anyways, but it would give you clarity.

Threading the re-indexing will only allow you to take advantage of your hardware better, if it has enough HP...

For threaded reindexing, I have 3 main pieces:

1 - a 'DocumentWrapper', which receives a single db row and wraps it inside of a Search Document. This one declares a class variable per document field, adds all the fields to the document and updates the fields as needed with the db entity. By reusing lucene field instances and a single document instance, I saved some cycles in garbage collection. However, if you are using liferay's SearchUtil to index documents, it will not be a good idea to reuse the fields, because SearchUtil returns control to your code before the document has been actually reindexed and you will very likely process more db entities before reindexing one single doc. So, if you choose to use a 'DocumentWrapper', for good abstraction, but use SearchUtil, DO NOT reuse lucene document/fields, but instantiate a new set for each db entity.

2 - An IndexingThread: which is a class that iterates through a batch of db records (say 100 at a time), uses a DocumentWrapper to create and index documents, in it's own thread.

3 - The main loop in your Indexer which creates a 'configured' number of threads, gives each one of them a batch of db records and waits for them to finish before spawning a new set of threads... Also, very importan, it reuses a single lucene index writer with each set of threads before calling the 'write' command on it

Now, this code was written for LR 4.3.1 which back then used a way different indexing mechanism...

LR now has a messaging hub which uses threads itself... I think you may want to see about giving more power to the messaging hub. AFAIK, it can be configured to use more threads.

However, how often a writer is asked to write depends on the search hook and as far as I've seen, most likely every document being reindexed calls write on the writer... This is supposed to be mitigated by the merge factor and optimize interval though...

There is tons of little variables to play with, that can make a difference...

Again, I suggest you see if you can make the messaging hub use more threads for reindexing, but in any event, I am very willing to share whate the 3 pieces of code do if you feel you want to see..

In my case, my main improvements were in:

1 - carefully fine-tunning our finders to minimize future db calls
2 - caching common data in the DocumentWrapper, that was being retrieved too many times otherwise
3 - using 2 threads and up via the Indexing thread class (again, we wrote this before there was a SearchUtil which uses hte Messaging hub)
4 - reusing the single writer before calling write.

I have to go right now but I'll paste some sample of how it is done, which is really simple...

Hope this helps.
thumbnail
Alex Wallace, geändert vor 14 Jahren.

RE: Lucene, Indexing with Multiple Threads breaks classloader after 2nd ru

Liferay Master Beiträge: 640 Beitrittsdatum: 05.11.07 Neueste Beiträge
IndexingThread example:

 ...
    // reuse one document wrapper for the whole iterator to save a lot on GC                                                                               
    private DocumentWrapper dw = new DocumentWrapper();
    private Iterator<myentity> iterator;

   /** Creates a new instance of IndexingThread */
    public IndexingThread(Iterator<myentity> iterator) {
        this.iterator = iterator;
    }
    public void run() {
        MyEntity c = null;
        while (iterator.hasNext()) {
            try {
                _log.trace("run before creating/validating document");
                c = iterator.next();
                if (dw.create(c)) {
                    _log.trace("run before addDocument");
                    SearchEngineUtil.addDocument(companyId, dw.getDocument());
                    Thread.yield();
                    _log.trace("run after addDocument");
                }
                _log.trace("run after creating/validating document");
            } catch (Exception e) {
                // Continue indexing even if one entry fails                                                                                               
                _log.error("run unable to index entry: " + e.getMessage());
            }
        }
... </myentity></myentity>


DocumentWrapper is going to be way different for each application. It pretty much is a class that has methos to populate a search document out of a db entity. In my case it has methods to update a reuseable search document and search fields. A method to create a whole new search document with new fiealds, methods to validate the document and each of it's fields and also caching for commonly retrieved data. This is the class that knows what to index, in what fields, and how... It knwos about boosting, indext time fields for sorting, etc.

and finaly, the main loop inside Indexer.java:

 ...
            Thread[] threads = new Thread[threadCount];
            while (currentRow &lt; totalRows) {
                for (int t = 0; t &lt; threadCount; t++) {
                    // iterator for each thread                                                                                                            
                    Iterator itr = null;
                    // finders ends are non inclusive                                                                                                      
                    endRow = currentRow + rowsPerBatch;
                     _log.trace("reIndex before MyEntity.findAll for next batch");
                     itr = MyEntity.findByCompanyIdAndAvailable(companyId, true, currentRow, currentRow + rowsPerBatch).iterator();
                     _log.trace("reIndex after CatalogueUtil.findAll for next batch");

                    // add new worker                                                                                                                      
                    threads[t] = new Thread(new IndexingThread(itr));
                    if (_log.isDebugEnabled()) {
                        _log.debug("thread " + t + " will process rows " + currentRow + " through " + endRow);
                    }
                    // increase current row for next worker                                                                                                
                    currentRow += rowsPerBatch;
                    if (_log.isDebugEnabled()) {
                        _log.debug("reasigned currentRow to " + currentRow);
                    }
                    threads[t].start();
                    if (_log.isDebugEnabled()) {
                        _log.debug("started thread " + t);
                    }
                }
                // wait for all threads to finish.                                                                                                         
                for (int t = 0; t &lt; threadCount; t++) {
                    threads[t].join();
                }
                if (_log.isInfoEnabled()) {
                    _log.info("processed row " + endRow + " of " + totalRows);
                }
                Thread.yield();
            }
...


That's basically a loop that creates so many threads at a time (configured), waits for them to finish and then creates more until all rows of the entity are processed...

All above being said, unless you look at where your bottlenecks are, this may not help... It may be a shot in the dark until you research where you need to optimize the most.

I Hope this helps!
thumbnail
Channing K Jackson, geändert vor 14 Jahren.

RE: Lucene, Indexing with Multiple Threads breaks classloader after 2nd ru

Junior Member Beiträge: 67 Beitrittsdatum: 13.11.08 Neueste Beiträge
Thanks Alex.

I am going through some profiling right now to try to better pinpoint where the pain points are.

I will post back when I find out...