Need a way to modify text before indexing (was SearchUpdate)

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Need a way to modify text before indexing (was SearchUpdate)

vitalif
I've written about my problem ~2 years ago:

http://wikitech-l.wikimedia.narkive.com/6G0YPmWQ/need-a-way-to-modify-text-before-indexing-was-searchupdate

It seems I've lost the latest message, so I want to answer to it now:

> With lsearchd and Elasticsearch, we absolutely wouldn't want to munge
> file text into page content (with sql-backed search, you might maybe).

Why?? Aren't these also just the fulltext search backends? As I
understand they're much faster than sql-backed search engines. What
would prevent them to store file texts?

Personally I use Sphinx (http://sphinxsearch.com) with TikaMW, and of
course everything is fine.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Need a way to modify text before indexing (was SearchUpdate)

Federico Leva (Nemo)
FWIW, we do index the full text of (PDF and?) DjVu files on Commons
(because it's stored in img_metadata). It's probably the biggest
improvement CirrusSearch brought for Commons.

Nemo

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Need a way to modify text before indexing (was SearchUpdate)

vitalif
> FWIW, we do index the full text of (PDF and?) DjVu files on Commons
> (because it's stored in img_metadata). It's probably the biggest
> improvement CirrusSearch brought for Commons.

And we also index office documents via Tika (*.doc and similar).

And I think it should not be a feature of the search engine at all! It's
a separate feature that's completely independent of the search engine
used (that's how it's implemented in my TikaMW).

So, is there any replacement for the SearchUpdate hook to modify the
indexed text?

Of course I can just return SearchUpdate back by including a patch in
our distribution mediawiki4intranet, but I would prefer if TikaMW didn't
require patching...

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l