One of the first things I noticed when looking at IndexedCatalog for
the first time was that the fact that it stored the indexes as
OOBTrees, where the value was a reference to the object would probably
cause a significant slowdown when querying, cause it would potentially
wake up lots of objects unnecessarily. This proved to be true when we
made the first profile: there were around 2000 calls to __setstate__
on a normal query, which was responsible for around 75% of the total
time. There was also a intersection between OOSets (containing object
references) involved, which is undoubtly slower than a intersection
using IISets.
So, we decided to go ahead with the plan of converting the OO*s to II*s
and added a new feature to the plan, after a discussion over chinese
food: we would try to delay loading the objects until it was strictly
necessary. That would be possible because the objects are normally
fetched from a search result, and using only OIDs on the indexes would
allow us to return the object, given a OID when the user iterates
through the search results.
So, the workflow is more or less like this now:
- User does a query
- Catalog delegates query to the indexes
- Indexes returns a list of OIDs (actually a IISet)
- Catalog builds a Result object with the intersection of the
OIDs received from the Indexes- Result, when asked for an item, does a lookup by the OID and
returns the actual object.
Needless to say, the improvement was overwhelming. Not only the query
was blazingly faster, but the database, after replacing the indexes,
was 20% smaller.
I must admit: the BTrees package its one of the most amazing ones I’ve
used during all the time I’ve been involved with python, and when you
deploy it the right way, it can make a world of difference.