Re: Need a way to modify text before indexing (was SearchUpdate)
On Wed, Jan 15, 2014 at 12:07 AM, Vitaliy Filippov <[hidden email]>wrote:
> SearchEngine subclasses can implement getTextFromContent() if they want to
>> override the normal text fetching behavior.
> I can't put it into SearchEngine subclass because Tika isn't a search
> engine, it's rather a java application that runs separately and extracts
> text from binary files like *.doc, *.pdf and so on.
> TikaMW is a plugin that should work with any search engine - it just
> modifies indexed text for pages in File: namespace.
The problem is you can't make that assumption. Different search indexes
in different ways, and munging them into the same content field won't allow
do the right thing. With lsearchd and Elasticsearch, we absolutely wouldn't
munge file text into page content (with sql-backed search, you might maybe).
Most of the code in the SearchEngine and related classes is infrastructure
sql-backed options, leaving MWSearch and CirrusSearch to reinvent a lot of
If we cleaned this up a bunch (I want to do this anyway, but time) we might
to add a hook back in that only affects search engines that are
core sql search...