Fwd: Reg. Research using Wikipedia

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

Fwd: Reg. Research using Wikipedia

David Gerard-2
Is there a standard answer to this question - how much researchers are
allowed to hammer the site?


- d.



---------- Forwarded message ----------
From: ramesh kumar <[hidden email]>
Date: 9 March 2011 09:47
Subject: Reg. Research using Wikipedia
To: [hidden email], [hidden email]


Dear Members,
I am Ramesh, pursuing my PhD in Monash University, Malaysia. My
Research is on blog classification using Wikipedia Categories.
As for my experiment, I use 12 main categories of Wikipedia.
I want to identify " which particular article belongs to which main 12
categories?".
So I wrote a program to collect the subcategories of each article and
classify based on 12 categories offline.
I have downloaded already wiki-dump which consists of around 3 million
article titles.
My program takes this 3 million article titles and goes to online
Wikipedia website and fetch the subcategories.
Our university network administrators are worried that, Wikipedia
would consider as DDOS attack and could block our IP address, if my
program functions.
In order to get permission from Wikipedia, I was searching allover. I
could able to find wikien-l members can help me.
Could you please suggest me, whom to contact, what is the procedure to
get approval for our IP address to do the process or other suggestions
Eagerly waiting for a positive reply
Thanks and Regards
Ramesh

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Reg. Research using Wikipedia

Roan Kattouw-2
2011/3/9 David Gerard <[hidden email]>:
> Is there a standard answer to this question - how much researchers are
> allowed to hammer the site?
>
If they use the API and wait for one request to finish before they
start the next one (i.e. don't make parallel requests), that's pretty
much always fine.

Roan Kattouw (Catrope)

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Reg. Research using Wikipedia

Platonides
In reply to this post by David Gerard-2
> Dear Members,
> I am Ramesh, pursuing my PhD in Monash University, Malaysia. My
> Research is on blog classification using Wikipedia Categories.
> As for my experiment, I use 12 main categories of Wikipedia.
> I want to identify " which particular article belongs to which main 12
> categories?".
> So I wrote a program to collect the subcategories of each article and
> classify based on 12 categories offline.
> I have downloaded already wiki-dump which consists of around 3 million
> article titles.
> My program takes this 3 million article titles and goes to online
> Wikipedia website and fetch the subcategories.

Why do you need to access the live wikipedia for this?
Using categorylinks.sql and page.sql you should be able to fetch the
same data. Probably faster.


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Reg. Research using Wikipedia

James Linden-3
> Why do you need to access the live wikipedia for this?
> Using categorylinks.sql and page.sql you should be able to fetch the
> same data. Probably faster.

In my research, the answer to this question is two-fold

A) Creating a local copy of wikipedia (using mediawiki and various
import tools) is quite a process, and requires a significant
investment of time and research unto itself.

B) A few months ago, I pulled 333 semi-random articles from the live
API -- of those, 329 of them have significant enough changes since
20100312 dump (which was the newest dump at the time). A new check
against the 20110115 dump has similar percentage.

Caveat -- my research is largely centered around the infobox template
usage, which is a relatively new deal, so they are generally being
updated frequently.

-- James

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Reg. Research using Wikipedia

Platonides
James Linden wrote:
>> Why do you need to access the live wikipedia for this?
>> Using categorylinks.sql and page.sql you should be able to fetch the
>> same data. Probably faster.
>
> In my research, the answer to this question is two-fold
>
> A) Creating a local copy of wikipedia (using mediawiki and various
> import tools) is quite a process, and requires a significant
> investment of time and research unto itself.

You don't need to do a full copy to eg. fetch infoboxes.


> B) A few months ago, I pulled 333 semi-random articles from the live
> API -- of those, 329 of them have significant enough changes since
> 20100312 dump (which was the newest dump at the time). A new check
> against the 20110115 dump has similar percentage.

Getting updated data may be a reason, but I don't think that's what
Ramesh wanted.
Plus, you wanted 333 articles, not the 3 million...


_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Reg. Research using Wikipedia

Thomas Dalton
In reply to this post by Platonides
On 9 March 2011 16:00, Platonides <[hidden email]> wrote:

>> Dear Members,
>> I am Ramesh, pursuing my PhD in Monash University, Malaysia. My
>> Research is on blog classification using Wikipedia Categories.
>> As for my experiment, I use 12 main categories of Wikipedia.
>> I want to identify " which particular article belongs to which main 12
>> categories?".
>> So I wrote a program to collect the subcategories of each article and
>> classify based on 12 categories offline.
>> I have downloaded already wiki-dump which consists of around 3 million
>> article titles.
>> My program takes this 3 million article titles and goes to online
>> Wikipedia website and fetch the subcategories.
>
> Why do you need to access the live wikipedia for this?
> Using categorylinks.sql and page.sql you should be able to fetch the
> same data. Probably faster.

I concur. Everything required for this project should be in the dumps.

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
Reply | Threaded
Open this post in threaded view
|

Re: Fwd: Reg. Research using Wikipedia

Alex Zaddach
In reply to this post by James Linden-3
On 3/9/2011 11:29 AM, James Linden wrote:
>> Why do you need to access the live wikipedia for this?
>> Using categorylinks.sql and page.sql you should be able to fetch the
>> same data. Probably faster.
>
> In my research, the answer to this question is two-fold
>
> A) Creating a local copy of wikipedia (using mediawiki and various
> import tools) is quite a process, and requires a significant
> investment of time and research unto itself.

You don't need a local copy of MediaWiki or any special tools to use the
SQL dumps, just MySQL.

--
Alex (wikipedia:en:User:Mr.Z-man)

_______________________________________________
Wikitech-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l