Request for Comments: New Search

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

Request for Comments: New Search

Nikolas Everett
So Chad and I feel like we've gotten far enough in our prototype of our new
search backend for MediaWiki that we're ready to request comments.  So here
is our format RFC:
https://www.mediawiki.org/wiki/Requests_for_comment/CirrusSearch

You'll note that the plugin is called CirrusSearch.  SolrSearch seems to
have been taken by an unrelated project so we had to pick a different name.

Please read and comment in whatever way is normal for these things.

Thanks so much for your attention,

Nik Everett
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Request for Comments: New Search

Nikolas Everett
Everyone,

I'm reviving this old thread to update everyone on the status of the RFC:

We've continued working on implementation and everything seems to be
proceeding smoothly.  We evaluated Elasticsearch and were super impressed
and decided it was very likely to be worth switching from Solr4 to it.  The
evaluation and the switch did cost some time but in my opinion doing it was
time well spent.

Thanks so much for your comments a month ago when I first posted this. If
you are interested please give the page another look.  Just to be helpful,
here is a link to what I changed:
http://www.mediawiki.org/w/index.php?title=Requests_for_comment%2FCirrusSearch&diff=740790&oldid=728213

Nik Everett

On Fri, Jun 14, 2013 at 4:21 PM, Nikolas Everett <[hidden email]>wrote:

> So Chad and I feel like we've gotten far enough in our prototype of our
> new search backend for MediaWiki that we're ready to request comments.  So
> here is our format RFC:
> https://www.mediawiki.org/wiki/Requests_for_comment/CirrusSearch
>
> You'll note that the plugin is called CirrusSearch.  SolrSearch seems to
> have been taken by an unrelated project so we had to pick a different name.
>
> Please read and comment in whatever way is normal for these things.
>
> Thanks so much for your attention,
>
> Nik Everett
>
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Request for Comments: New Search

C. Scott Ananian
I wonder if there are queries or use cases we can support that *aren't*
already better handled by google.  Granted, users of private wikis can't
simply use the 'site:' trick to reuse Google search results -- but users of
private wikis also probably don't need superduper scalability.

Trying to brainstorm here, not start a flame war.  What sorts of useful
searches could we excel at?  (Maybe these are searches/use cases that will
facilitate editor engagement?)
 --scott

--
(http://cscott.net)
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Request for Comments: New Search

Nikolas Everett
Scott,

I was going to respond to this a while ago but couldn't really do it
justice.  I'm still pretty sure my explanation won't be great, which is an
indication of just how good Google is.

For strait search there is nothing we can do that Google can't.  It might
cost them more time and money to make searching mediawiki awesome but they
lots of both so we're just not going to beat them there.  There are a few
things that we can do more easily/cheaply than Google:
1.  We can update our search index right when changes are made including
when changes are made to transcluded pages.
2.  We can search based on redirects to a page.
3.  We can filter (and maybe one day facet) based on categories.
4.  We could search based on citations.

We will, on the other hand, be better about listening to what the community
needs with regards to search.  Part of the problem here is that
historically we've let search languish and my first foray into making
search nicer isn't going to provide much new stuff for the community.
Instead its a solid platform on which to build things that the community
needs and which should make search less exciting for operations engineers.
That really isn't exciting for the community to hear and for that I am
sorry.  I can only promise that we'll do more later.

There are some more deep integrations into mediawiki that I don't see
google doing but we could work on in the future:
1.  We could create a section that allowed users to easily find "similar"
pages.  I'm a little fuzzy on exactly how we'd calculate similarity.
2.  We could automatically dig around in commons for useful media for an
article.  We could use this to automatically provide extra media which
might be relevant or as a curation aid.  On second thought the second one
sounds much better.

Actually, some kind of game around tagging media as relevant to an article
might be quite a decent way to encourage engagement.  By game I mean
something like Galaxy Zoo or LinkedIn's endorsements.  You could do this
without a nice search but it'd help produce much more relevant results.

And then there is the cynic in me that says that it is worth doing just so
we aren't reliant on external (corporate) entities.  I'm really not sure
how I would feel if the only way to find stuff on WMF's wikis was with
Google/Bing/Yahoo....

Finally we have the private wikis like you mentioned - they mostly can't
use google.  We are trying to make sure CirrusSearch works for them.  The
idea there is to provide something that is better at finding results than
the database based search because it uses the same analysis that we've
optimized for WMF.  Elasticsearch isn't some kind of precision tuned
machine - you can actually get quite decent behaviour out of downloading
the deb or rpm and installing it.  You only really need one instance.

So now that I've created this wall of text I don't feel that I've really
answered your question well, but I've answered it.  That is the thing about
hard questions: they are harder to answer than to ask.

I'd really love more brainstorming.  Cross wiki search was another good
idea someone added to the page a while ago.

Nik





On Fri, Jul 19, 2013 at 2:24 PM, C. Scott Ananian <[hidden email]>wrote:

> I wonder if there are queries or use cases we can support that *aren't*
> already better handled by google.  Granted, users of private wikis can't
> simply use the 'site:' trick to reuse Google search results -- but users of
> private wikis also probably don't need superduper scalability.
>
> Trying to brainstorm here, not start a flame war.  What sorts of useful
> searches could we excel at?  (Maybe these are searches/use cases that will
> facilitate editor engagement?)
>  --scott
>
> --
> (http://cscott.net)
> _______________________________________________
> Wikitech-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Request for Comments: New Search

C. Scott Ananian
It seems like there are also a bunch of hacky search-alike features built
into the mediawiki database.  For example, "all pages linking to this
page", "my contributions", etc.  From a code cleanup standpoint, it would
also be worthwhile if these were all unified and brought together under a
single search engine.

It would be really nice if the search engine allowed me to make these sorts
of queries in a query language, so that I could combine features.  "All
pages which I have contributed to which link to Foo.jpg and have the word
Bar in them", for example.

This would potentially simplify the codebase as well as provide a
capability google.com does not.
 --scott

--
(http://cscott.net)
_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Request for Comments: New Search

Robert Elwell
In reply to this post by Nikolas Everett
Hi Nik,

Just a quick comment on choosing ElasticSearch over Solr:

We use Solr at Wikia, and we have a lot we can offer the Foundation in terms of knowledge sharing. It might be a good idea to consider future opportunities to collaborate while vetting ElasticSearch.

Even if ElasticSearch is your final call, you may still be able to use some of the code from our Search extension (https://github.com/wikia/app/tree/dev/extensions/wikia/Search). It uses the Solarium library for query abstraction, and I'm wondering if adding ElasticSearch support to that library and starting with some of the libraries we've written might get you most of the way there in your CirrusSearch efforts.

And code aside, both solutions have very similar engines behind them. When it comes to generating schemata, analyzing fields, handling language support, scaling, or backend architecture, please feel free to reach out. We'd love to help.

Robert Elwell

On Jul 19, 2013, at 5:14 PM, Rob Lanphier <[hidden email]> wrote:

> Everyone,
>
> I'm reviving this old thread to update everyone on the status of the RFC:
>
> We've continued working on implementation and everything seems to be
> proceeding smoothly.  We evaluated Elasticsearch and were super impressed
> and decided it was very likely to be worth switching from Solr4 to it.  The
> evaluation and the switch did cost some time but in my opinion doing it was
> time well spent.
>
> Thanks so much for your comments a month ago when I first posted this. If
> you are interested please give the page another look.  Just to be helpful,
> here is a link to what I changed:
> http://www.mediawiki.org/w/index.php?title=Requests_for_comment%2FCirrusSearch&diff=740790&oldid=728213
>
> Nik Everett


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l