Search query returns missing pages

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

Search query returns missing pages

Simon Lehmann
Hello list,

I am wondering if the search module should return pages that don't
exist. If I am searching for something, I probably want to find
something that already exists, especially if I use srwhat=text. I don't
even know where the API gets the text to search in, for a page that
doesn't even exist.

Just look at the example:

http://en.wikipedia.org/w/api.php?format=xml&action=query&gsrsearch=Does
%20not%20exist&generator=search&gsrnamespace=0

Besides that, it also seems to find stuff that doesn't even belong into
the main namespace, even if it existed, like:

 - Http://en.wikipedia.org/wiki/Talk:State-sponsored terrorism by the
United States/Archive 9 (This isn't even a valid title)

 - User tаIk:Jj137/Archivе 5

Is this desired behaviour or is it a bug?

Simon Lehmann




_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api

signature.asc (196 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Search query returns missing pages

Ran Ari-Gur-2
Those are all pages that have been deleted. Deleted pages remain in
the database so they can be undeleted. (I'm not sure if they get
purged eventually or not.)

Your examples are, in fact, all in the main namespace, since the
namespaces Http: and User taIk: (note: that's a capital "i", not a
lowercase "L") do not currently exist.

I don't think the search should be returning deleted pages.

-Ran
(User:Ruakh)


On Mon, Aug 11, 2008 at 7:46 AM, Simon Lehmann <[hidden email]> wrote:

> Hello list,
>
> I am wondering if the search module should return pages that don't
> exist. If I am searching for something, I probably want to find
> something that already exists, especially if I use srwhat=text. I don't
> even know where the API gets the text to search in, for a page that
> doesn't even exist.
>
> Just look at the example:
>
> http://en.wikipedia.org/w/api.php?format=xml&action=query&gsrsearch=Does
> %20not%20exist&generator=search&gsrnamespace=0
>
> Besides that, it also seems to find stuff that doesn't even belong into
> the main namespace, even if it existed, like:
>
>  - Http://en.wikipedia.org/wiki/Talk:State-sponsored terrorism by the
> United States/Archive 9 (This isn't even a valid title)
>
>  - User tаIk:Jj137/Archivе 5
>
> Is this desired behaviour or is it a bug?
>
> Simon Lehmann
>
>
>
>
> _______________________________________________
> Mediawiki-api mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
>
>
_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: Search query returns missing pages

Stephen Bain
In reply to this post by Simon Lehmann
On Mon, Aug 11, 2008 at 9:46 PM, Simon Lehmann <[hidden email]> wrote:

>
> I am wondering if the search module should return pages that don't
> exist. If I am searching for something, I probably want to find
> something that already exists, especially if I use srwhat=text. I don't
> even know where the API gets the text to search in, for a page that
> doesn't even exist.
>
> Just look at the example:
>
> http://en.wikipedia.org/w/api.php?format=xml&action=query&gsrsearch=Does
> %20not%20exist&generator=search&gsrnamespace=0

The first few in the results there have been deleted. No "pageid"
attribute and the presence of the "missing" attribute indicates a
deleted page. The question of whether such entries should be returned
by default, as seems to follow from your observation, is still open,
but the software didn't make this up, it's just from a deleted
revision.

> Besides that, it also seems to find stuff that doesn't even belong into
> the main namespace, even if it existed, like:
...

The ones that look like user talk pages had been moved, and the move
destination was misspelled ("User taIk", with a capital I instead of a
lowercase l - you may need a serif font to see the difference). If the
software doesn't recognise the namespace, then it treats it as if
there is simply a colon in the title and puts it in the mainspace.

--
Stephen Bain
[hidden email]

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: Search query returns missing pages

Simon Lehmann
Am Montag, den 11.08.2008, 23:04 +1000 schrieb Stephen Bain:

> On Mon, Aug 11, 2008 at 9:46 PM, Simon Lehmann <[hidden email]> wrote:
> >
> > I am wondering if the search module should return pages that don't
> > exist. If I am searching for something, I probably want to find
> > something that already exists, especially if I use srwhat=text. I don't
> > even know where the API gets the text to search in, for a page that
> > doesn't even exist.
> >
> > Just look at the example:
> >
> > http://en.wikipedia.org/w/api.php?format=xml&action=query&gsrsearch=Does
> > %20not%20exist&generator=search&gsrnamespace=0
>
> The first few in the results there have been deleted. No "pageid"
> attribute and the presence of the "missing" attribute indicates a
> deleted page. The question of whether such entries should be returned
> by default, as seems to follow from your observation, is still open,
> but the software didn't make this up, it's just from a deleted
> revision.
I am not quite sure if "no id and missing" is indicating a deleted page.
If I send a query with a title that never existed it also returns a page
without an id and is marked as missing, so it can't be distinguished
from a deleted page (or vice versa).

But my point was, as you have already said, that by default it shouldn't
return missing pages at all, no matter if they never have existed or
don't exist anymore. It would be good to have an option which toggles
the inclusion of deleted or even missing pages. The latter would
obviously only make sense with a title-search.

>
> > Besides that, it also seems to find stuff that doesn't even belong into
> > the main namespace, even if it existed, like:
> ...
>
> The ones that look like user talk pages had been moved, and the move
> destination was misspelled ("User taIk", with a capital I instead of a
> lowercase l - you may need a serif font to see the difference). If the
> software doesn't recognise the namespace, then it treats it as if
> there is simply a colon in the title and puts it in the mainspace.
Now that you mention it, I can see that too.

Simon Lehmann


_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api

signature.asc (196 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: Search query returns missing pages

Roan Kattouw
In reply to this post by Simon Lehmann
---- Simon Lehmann <[hidden email]> schrijft:
> I am not quite sure if "no id and missing" is indicating a deleted page.
The "missing" attribute definitely indicates a non-existent page. That convention is followed throughout the API.

> If I send a query with a title that never existed it also returns a page
> without an id and is marked as missing, so it can't be distinguished
> from a deleted page (or vice versa).
The distinction between deleted pages and pages that never existed does not exist in MediaWiki (for people who don't have the right to view deleted revisions, that is). Pages either exist right now, or they don't.

> But my point was, as you have already said, that by default it shouldn't
> return missing pages at all, no matter if they never have existed or
> don't exist anymore.
It's probably a good idea to drop missing pages. The reason they show up at all is that Wikipedia uses a search extension called Lucene, which is kind of slow on the uptake. This means recently deleted pages are only periodically removed from the search index. The standard MW search doesn't have this "bug".

Roan Kattouw (Catrope)

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api