"see also" parser strange output

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

"see also" parser strange output

Max Vlasov
Hi, 

I'm using two step approach to retrieve "see also" section links from articles by retrieving sections and using the section index for links action.
Today I noticed that for the article "Synchronous programming language" this gives unexpected results.

The page

returns
...  <s toclevel="1" level="2" line="See also" number="4" index="4" .....

and the following query


gives a long list of links very different to the actual correct list shown in the wikipedia article. Other usage of the same algorithm with other articles works correctly.

Is this a bug?

Thanks

Max



_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: "see also" parser strange output

bawolff
Presumably this is because the infobox is at the end of the article,
so is counted as being part of the final section, which is the see
also section. So you are also getting everything in
https://en.wikipedia.org/wiki/Template:Types_of_programming_languages.

You could maybe adjust this algorithm to look for any pagelinks in ns
10, look at all the things that they link to, and subtract them from
your results, although I doubt that will work perfectly.

You could also try fetching the wikitext of the see also section, and
attempting to parse it just for a list of links, but that's probably
hard to get right.

--
Brian

On Wed, Aug 23, 2017 at 2:32 PM, Max Vlasov <[hidden email]> wrote:

> Hi,
>
> I'm using two step approach to retrieve "see also" section links from
> articles by retrieving sections and using the section index for links
> action.
> Today I noticed that for the article "Synchronous programming language" this
> gives unexpected results.
>
> The page
>
> https://en.wikipedia.org/w/api.php?action=parse&prop=sections&page=Synchronous%20programming%20language&format=xml
>
> returns
> ...  <s toclevel="1" level="2" line="See also" number="4" index="4" .....
>
> and the following query
>
> https://en.wikipedia.org/w/api.php?action=parse&prop=links&page=Synchronous%20programming%20language&section=4&format=xml
>
> gives a long list of links very different to the actual correct list shown
> in the wikipedia article. Other usage of the same algorithm with other
> articles works correctly.
>
> Is this a bug?
>
> Thanks
>
> Max
>
>
>
> _______________________________________________
> Mediawiki-api mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
>

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: "see also" parser strange output

Brad Jorsch (Anomie)
In reply to this post by Max Vlasov
On Wed, Aug 23, 2017 at 10:32 AM, Max Vlasov <[hidden email]> wrote:
I'm using two step approach to retrieve "see also" section links from articles by retrieving sections and using the section index for links action.
Today I noticed that for the article "Synchronous programming language" this gives unexpected results.

The page

returns
...  <s toclevel="1" level="2" line="See also" number="4" index="4" .....

and the following query


gives a long list of links very different to the actual correct list shown in the wikipedia article. Other usage of the same algorithm with other articles works correctly.

Is this a bug?

No, it's not a bug. The "See also" section on that article also happens to contain the navbox, so you're getting all the links from the navbox as well as the links you expect.

Most articles don't have this problem because the "References" and "External links" sections usually come after "See also", as described at https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Layout#ORDER. So the navboxes would wind up being part of those sections instead.


--
Brad Jorsch (Anomie)
Senior Software Engineer
Wikimedia Foundation

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: "see also" parser strange output

Max Vlasov
In reply to this post by bawolff
Oh, thanks, I see. So this is intentional. I once wanted to exclude somehow backlinks originated from infoboxes, now I see that infoboxes also affect sections links. Probably I should do the parsing by myself to control this. Personally I think that links originated from infoboxes should be controlled with api. I believe they don't represent valuable informational product carefully prepared by human beings. They give some information, but most of them might as well be automatically inserted by an algorithm making some word count matching.

On Wed, Aug 23, 2017 at 5:41 PM, bawolff <[hidden email]> wrote:
Presumably this is because the infobox is at the end of the article,
so is counted as being part of the final section, which is the see
also section. So you are also getting everything in
https://en.wikipedia.org/wiki/Template:Types_of_programming_languages.



_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api