Ignore template links?

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

Ignore template links?

Brian Keegan-2
Hi all,

I'm trying to scrape some data from en.wiki about the outlinks from the body of articles. However, the API returns article outlinks contained within templates. While I can write a routine to get a list of all the templates and identify the article links inside these templates to remove from the outlinks, this is problematic if a link appears in both the body and a template. Thus if article X has a link to Y in the body as well as links to Y an Z in templates, I want to capture Y but not Y & Z.

Ideally, I'd like to either (1) be able to count the number of times an article links out to another article (if X links to Y twice) and then iterate this count down for each appearance in a template or (2) count only the links occurring in the body and not parsing the links in templates.

Thank you in advance for your suggestions!

Best,

Brian

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: Ignore template links?

Roan Kattouw-2
On Wed, Oct 10, 2012 at 2:15 PM, Brian Keegan <[hidden email]> wrote:

> Hi all,
>
> I'm trying to scrape some data from en.wiki about the outlinks from the body
> of articles. However, the API returns article outlinks contained within
> templates. While I can write a routine to get a list of all the templates
> and identify the article links inside these templates to remove from the
> outlinks, this is problematic if a link appears in both the body and a
> template. Thus if article X has a link to Y in the body as well as links to
> Y an Z in templates, I want to capture Y but not Y & Z.
>
> Ideally, I'd like to either (1) be able to count the number of times an
> article links out to another article (if X links to Y twice) and then
> iterate this count down for each appearance in a template or (2) count only
> the links occurring in the body and not parsing the links in templates.
>
> Thank you in advance for your suggestions!
>
Neither of these things is supported by the API, because the
underlying functionality in MediaWiki (the links tables and the
ParserOutput metadata) doesn't provide or store this information. You
would have to do some kind of processing of your own to get this
information.

Roan

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api
Reply | Threaded
Open this post in threaded view
|

Re: Ignore template links?

b-jorsch
On Wed, Oct 10, 2012 at 02:34:14PM -0700, Roan Kattouw wrote:
> On Wed, Oct 10, 2012 at 2:15 PM, Brian Keegan <[hidden email]> wrote:
> >
> > (2) count only the links occurring in the body and not parsing the
> > links in templates.
>
> You would have to do some kind of processing of your own to get this
> information.

Since you're dealing with articles you shouldn't have to worry much
about parser functions within the actual body text, so you should be
able to get a decent approximation by just removing/replacing "{{"
throughout the wikitext and then passing it to action=parse&prop=links.
That would miss cases like {{see also|Foo}} which generates a link to
Foo that isn't entirely due to the template, but it might be enough for
your purposes.

_______________________________________________
Mediawiki-api mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/mediawiki-api