Wikipedia Citations: A comprehensive dataset of citations with identifiers extracted from English Wikipedia

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

Wikipedia Citations: A comprehensive dataset of citations with identifiers extracted from English Wikipedia

Giovanni1085
Dear all,

we have recently released a dataset which might be of interest to some of you. Wikipedia Citations contains the nearly 30M citations to be found in English Wikipedia (as of May 1st 2020), of which approx. 4M are to scientific publications. Most of these 4M have also been equipped with identifiers (ISBN, DOI, etc.). All code is there for replication and updates.
 
Pre-print: https://arxiv.org/abs/2007.07022 <https://arxiv.org/abs/2007.07022>
Data and code: https://zenodo.org/record/3940692#.XyQjaPj7SL8 <https://zenodo.org/record/3940692#.XyQjaPj7SL8>
 
We welcome feedback, ideas for collaboration and any question you might have in order to use the dataset for your research and work.
 
Best regards,
Giovanni Colavizza (with Harshdeep Singh and Bob West)
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Wikipedia Citations: A comprehensive dataset of citations with identifiers extracted from English Wikipedia

Andy Mabbett-2
On Fri, 31 Jul 2020 at 15:06, Giovanni Colavizza
<[hidden email]> wrote:

> we have recently released a dataset which might be of interest to some of you.
> Wikipedia Citations contains the nearly 30M citations to be found in English
> Wikipedia (as of May 1st 2020), of which approx. 4M are to scientific publications.
> Most of these 4M have also been equipped with identifiers (ISBN, DOI, etc.).

> We welcome feedback, ideas for collaboration and any question you might have

Thank you. Three thoughts occur to me:

1) It would be good to merge metadata on the 40 million scientific
articles (or at least those with identifiers) into Wikidata; this
could be done under the WikiCite umbrella.

2) We could use this data, or the tools that extrated it, for  quality
control - for example, highlighting multiple citations of the same
work (say, matched by DOI), with different authors, dates, venues,
etc. We could even do some simple data repairs, for example, if an
author name in three articles is spelled "Mabbett", but in a fourth it
is spelled "Mabbott", then the latter is likely a spelling error.

3) The current model is madness^W inefficient. Citations which call
metadata from Wikidata (for example using the 'Cite Q' template [1])
would be far more sensible.


[1] https://en.wikipedia.org/wiki/Template:Cite_Q

--
Andy Mabbett
@pigsonthewing
http://pigsonthewing.org.uk

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l