Need a way to modify text before indexing (was SearchUpdate)

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

Need a way to modify text before indexing (was SearchUpdate)

vitalif
Hi!

Change https://gerrit.wikimedia.org/r/#/c/79025/ that was merged to 1.22
breaks my TikaMW extension - I used that hook to extract contents from
binary files so the user can then search on it.

Maybe you can add some other hook for this purpose?

See also https://github.com/mediawiki4intranet/TikaMW/issues/2

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Need a way to modify text before indexing (was SearchUpdate)

Chad
On Tue, Jan 14, 2014 at 2:33 PM, <[hidden email]> wrote:

> Hi!
>
> Change https://gerrit.wikimedia.org/r/#/c/79025/ that was merged to 1.22
> breaks my TikaMW extension - I used that hook to extract contents from
> binary files so the user can then search on it.
>
> Maybe you can add some other hook for this purpose?
>
> See also https://github.com/mediawiki4intranet/TikaMW/issues/2
>
>
SearchEngine subclasses can implement getTextFromContent() if they want
to override the normal text fetching behavior.

-Chad
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Need a way to modify text before indexing (was SearchUpdate)

vitalif
In reply to this post by vitalif
> SearchEngine subclasses can implement getTextFromContent() if they want  
> to override the normal text fetching behavior.

I can't put it into SearchEngine subclass because Tika isn't a search  
engine, it's rather a java application that runs separately and extracts  
text from binary files like *.doc, *.pdf and so on.

TikaMW is a plugin that should work with any search engine - it just  
modifies indexed text for pages in File: namespace.

--
With best regards,
   Vitaliy Filippov

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Need a way to modify text before indexing (was SearchUpdate)

Chad
On Wed, Jan 15, 2014 at 12:07 AM, Vitaliy Filippov <[hidden email]>wrote:

> SearchEngine subclasses can implement getTextFromContent() if they want to
>> override the normal text fetching behavior.
>>
>
> I can't put it into SearchEngine subclass because Tika isn't a search
> engine, it's rather a java application that runs separately and extracts
> text from binary files like *.doc, *.pdf and so on.
>
> TikaMW is a plugin that should work with any search engine - it just
> modifies indexed text for pages in File: namespace.
>
>
The problem is you can't make that assumption. Different search indexes
treat text
in different ways, and munging them into the same content field won't allow
them to
do the right thing. With lsearchd and Elasticsearch, we absolutely wouldn't
want to
munge file text into page content (with sql-backed search, you might maybe).

Most of the code in the SearchEngine and related classes is infrastructure
for the
sql-backed options, leaving MWSearch and CirrusSearch to reinvent a lot of
wheels.
If we cleaned this up a bunch (I want to do this anyway, but time) we might
be able
to add a hook back in that only affects search engines that are
implementing the
core sql search...

-Chad
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l