Citation Project - Comments Welcome!

classic Classic list List threaded Threaded
11 messages Options
Reply | Threaded
Open this post in threaded view
|

Citation Project - Comments Welcome!

Andrea Forte
Hi all,


One of my PhD students, Meen Chul Kim, is a data scientist with experience
in bibliometrics and we will be working on some citation-related research
together with Aaron and Dario in the coming months. Our main goal in the
short term is to develop an enhanced citation dataset that will allow for
future analyses of citation data associated with article quality,
lifecycle, editing trends, etc.


The project page is here:
https://meta.wikimedia.org/wiki/Research:Understanding_the_context_of_citations_in_Wikipedia


The project is just getting started so this is a great time to offer
feedback and suggestions, especially for features of citations that we
should mine as a first step, since this will affect what the dataset can be
used for in the future.


Looking forward to seeing some of you at WikiCite!!

Andrea




--
 :: Andrea Forte
 :: Associate Professor
 :: College of Computing and Informatics, Drexel University
 :: http://www.andreaforte.net
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Citation Project - Comments Welcome!

Kerry Raymond
Just a couple of thoughts that cross my mind ...

If people use the {{cite book}} etc templates, it will be relatively easy to work out what the components of the citation are. However if people roll their own, e.g.

<ref>[http://someurl This And That], Blah Blah 2000</ref>

you may have some difficulty working out what is what. I've just been though a tedious exercise of updating a set of URLs using AWB over some thousands of articles and some of the ways people roll their own citations were quite remarkable (and often quite unhelpful). It may be that you can't extract much from such citations. However, the good news is that if they have a URL in them, it will probably be in plain-sight.

Whereas there are a number of templates that I regularly use for citation like {{cite QHR}} (currently 1234 transclusions) and {{cite QPN}} (currently 2738  transclusions) and {{Census 2011 AUS}} (4400 transclusions) all of which generate their URLs. I'm not sure how you will deal with these in terms of extracting URLs.

But whatever the limitations, it will be a useful dataset to answer some interesting questions.

One phenomena I often see is new users updating information (e.g. changing the population of a town) while leaving behind the old citation for the previous value. So it superficially looks like the new information is cited to a reliable source when in fact it isn't. I've often wished we could automatically detect and raise a "warning" when the "text being supported" by the citation changes yet the citation does not. The problem, of course, is that we only know where the citation appears in the text and that we presume it is in support for "some earlier" text (without being clear exactly where it is). And if an article is reorganised, it may well result in the citation "drifting away" from the text it supports or even that it is in support of text that has been deleted. So I think it is important to know what text preceded the citation at the time the citation first appears in the article history as it may be useful to compare it against the text that *now* appears before it. It is a great pity that (in these digital times) we have not developed a citation model where you select chunks of text and link your citation to them, so that the relationship between the text and the citation is more apparent.

Kerry

-----Original Message-----
From: Wiki-research-l [mailto:[hidden email]] On Behalf Of Andrea Forte
Sent: Tuesday, 2 May 2017 5:18 AM
To: Research into Wikimedia content and communities <[hidden email]>
Subject: [Wiki-research-l] Citation Project - Comments Welcome!

Hi all,


One of my PhD students, Meen Chul Kim, is a data scientist with experience in bibliometrics and we will be working on some citation-related research together with Aaron and Dario in the coming months. Our main goal in the short term is to develop an enhanced citation dataset that will allow for future analyses of citation data associated with article quality, lifecycle, editing trends, etc.


The project page is here:
https://meta.wikimedia.org/wiki/Research:Understanding_the_context_of_citations_in_Wikipedia


The project is just getting started so this is a great time to offer feedback and suggestions, especially for features of citations that we should mine as a first step, since this will affect what the dataset can be used for in the future.


Looking forward to seeing some of you at WikiCite!!

Andrea




--
 :: Andrea Forte
 :: Associate Professor
 :: College of Computing and Informatics, Drexel University
 :: http://www.andreaforte.net
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l


_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Citation Project - Comments Welcome!

Robert West
In reply to this post by Andrea Forte
Hi,

This looks like a great project!

One quick thought: it would be extremely useful for your purposes to
be able to study not only the static structure of references pointing
from Wikipedia to external documents, but also how these references
are used. Tracking this traffic is currently impossible, since it will
only leave a footprint on the webserver of the link target, not on
Wikimedia's webservers.

Have you thought about the possibility of funneling external links
through a Wikimedia URL, which would allow you to record the links
through which users leave Wikipedia?

I know this would be a major change to the infrastructure, and I'm not
sure how the privacy implications would line up with Wikipedia's
guidelines, but it's worthwhile giving it some serious thought. At the
very least, Wikimedia could store counts of external-link clicks,
without linking those clicks to users' Wikimedia-internal browse
traces.

Bob

On Mon, May 1, 2017 at 9:17 PM, Andrea Forte <[hidden email]> wrote:

> Hi all,
>
>
> One of my PhD students, Meen Chul Kim, is a data scientist with experience
> in bibliometrics and we will be working on some citation-related research
> together with Aaron and Dario in the coming months. Our main goal in the
> short term is to develop an enhanced citation dataset that will allow for
> future analyses of citation data associated with article quality,
> lifecycle, editing trends, etc.
>
>
> The project page is here:
> https://meta.wikimedia.org/wiki/Research:Understanding_the_context_of_citations_in_Wikipedia
>
>
> The project is just getting started so this is a great time to offer
> feedback and suggestions, especially for features of citations that we
> should mine as a first step, since this will affect what the dataset can be
> used for in the future.
>
>
> Looking forward to seeing some of you at WikiCite!!
>
> Andrea
>
>
>
>
> --
>  :: Andrea Forte
>  :: Associate Professor
>  :: College of Computing and Informatics, Drexel University
>  :: http://www.andreaforte.net
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Citation Project - Comments Welcome!

Leila Zia
+ John and Lauren

On Tue, May 2, 2017 at 5:49 AM, Robert West <[hidden email]> wrote:

>
>
> One quick thought: it would be extremely useful for your purposes to
> be able to study not only the static structure of references pointing
> from Wikipedia to external documents, but also how these references
> are used. Tracking this traffic is currently impossible, since it will
> only leave a footprint on the webserver of the link target, not on
> Wikimedia's webservers.
>
> Have you thought about the possibility of funneling external links
> through a Wikimedia URL, which would allow you to record the links
> through which users leave Wikipedia?
>
> I know this would be a major change to the infrastructure, and I'm not
> sure how the privacy implications would line up with Wikipedia's
> guidelines, but it's worthwhile giving it some serious thought. At the
> very least, Wikimedia could store counts of external-link clicks,
> without linking those clicks to users' Wikimedia-internal browse
> traces.


ha! interesting that you say this. :)

John and Lauren reached out to us some months ago after we published
the work on Why We Read Wikipedia and asked about the possibility of
doing exactly what you say above. We met a few weeks ago and are now
exploring that space together. Pending on Board and FDC approvals,
this is an item which is part of our next annual plan programs.

Bob, you may have seen https://purl.stanford.edu/ny213kn0075 before?
John wrote this some years ago because they had access to the server
logs of Stanford's Encyclopedia of Philosophy and they could learn
more about the relation between Wikipedia reference usage and that. It
would be great if we can repeat that kind of analysis, but this time
beyond just that one specific resource.

We will post the proposal on meta as soon as the steps are solidified.
In the mean time, feel free to chat with John and Lauren directly.

Andrea, I know Lauren will be in WikiCite as well. The two of you may
enjoy having a chat with each other. :)

Best,
Leila
p.s. John, Lauren: I'm not sure if you're on this public list. If
you're not, please feel free to subscribe at
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

>
> Bob
>
> On Mon, May 1, 2017 at 9:17 PM, Andrea Forte <[hidden email]> wrote:
> > Hi all,
> >
> >
> > One of my PhD students, Meen Chul Kim, is a data scientist with experience
> > in bibliometrics and we will be working on some citation-related research
> > together with Aaron and Dario in the coming months. Our main goal in the
> > short term is to develop an enhanced citation dataset that will allow for
> > future analyses of citation data associated with article quality,
> > lifecycle, editing trends, etc.
> >
> >
> > The project page is here:
> > https://meta.wikimedia.org/wiki/Research:Understanding_the_context_of_citations_in_Wikipedia
> >
> >
> > The project is just getting started so this is a great time to offer
> > feedback and suggestions, especially for features of citations that we
> > should mine as a first step, since this will affect what the dataset can be
> > used for in the future.
> >
> >
> > Looking forward to seeing some of you at WikiCite!!
> >
> > Andrea
> >
> >
> >
> >
> > --
> >  :: Andrea Forte
> >  :: Associate Professor
> >  :: College of Computing and Informatics, Drexel University
> >  :: http://www.andreaforte.net
> > _______________________________________________
> > Wiki-research-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Citation Project - Comments Welcome!

Robert West
Thanks for the update, Leila -- great to see this is about to be addressed!


Bob

On Tue, May 2, 2017 at 5:16 PM, Leila Zia <[hidden email]> wrote:

> + John and Lauren
>
> On Tue, May 2, 2017 at 5:49 AM, Robert West <[hidden email]> wrote:
> >
> >
> > One quick thought: it would be extremely useful for your purposes to
> > be able to study not only the static structure of references pointing
> > from Wikipedia to external documents, but also how these references
> > are used. Tracking this traffic is currently impossible, since it will
> > only leave a footprint on the webserver of the link target, not on
> > Wikimedia's webservers.
> >
> > Have you thought about the possibility of funneling external links
> > through a Wikimedia URL, which would allow you to record the links
> > through which users leave Wikipedia?
> >
> > I know this would be a major change to the infrastructure, and I'm not
> > sure how the privacy implications would line up with Wikipedia's
> > guidelines, but it's worthwhile giving it some serious thought. At the
> > very least, Wikimedia could store counts of external-link clicks,
> > without linking those clicks to users' Wikimedia-internal browse
> > traces.
>
>
> ha! interesting that you say this. :)
>
> John and Lauren reached out to us some months ago after we published
> the work on Why We Read Wikipedia and asked about the possibility of
> doing exactly what you say above. We met a few weeks ago and are now
> exploring that space together. Pending on Board and FDC approvals,
> this is an item which is part of our next annual plan programs.
>
> Bob, you may have seen https://purl.stanford.edu/ny213kn0075 before?
> John wrote this some years ago because they had access to the server
> logs of Stanford's Encyclopedia of Philosophy and they could learn
> more about the relation between Wikipedia reference usage and that. It
> would be great if we can repeat that kind of analysis, but this time
> beyond just that one specific resource.
>
> We will post the proposal on meta as soon as the steps are solidified.
> In the mean time, feel free to chat with John and Lauren directly.
>
> Andrea, I know Lauren will be in WikiCite as well. The two of you may
> enjoy having a chat with each other. :)
>
> Best,
> Leila
> p.s. John, Lauren: I'm not sure if you're on this public list. If
> you're not, please feel free to subscribe at
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
> >
> > Bob
> >
> > On Mon, May 1, 2017 at 9:17 PM, Andrea Forte <[hidden email]>
> wrote:
> > > Hi all,
> > >
> > >
> > > One of my PhD students, Meen Chul Kim, is a data scientist with
> experience
> > > in bibliometrics and we will be working on some citation-related
> research
> > > together with Aaron and Dario in the coming months. Our main goal in
> the
> > > short term is to develop an enhanced citation dataset that will allow
> for
> > > future analyses of citation data associated with article quality,
> > > lifecycle, editing trends, etc.
> > >
> > >
> > > The project page is here:
> > > https://meta.wikimedia.org/wiki/Research:Understanding_
> the_context_of_citations_in_Wikipedia
> > >
> > >
> > > The project is just getting started so this is a great time to offer
> > > feedback and suggestions, especially for features of citations that we
> > > should mine as a first step, since this will affect what the dataset
> can be
> > > used for in the future.
> > >
> > >
> > > Looking forward to seeing some of you at WikiCite!!
> > >
> > > Andrea
> > >
> > >
> > >
> > >
> > > --
> > >  :: Andrea Forte
> > >  :: Associate Professor
> > >  :: College of Computing and Informatics, Drexel University
> > >  :: http://www.andreaforte.net
> > > _______________________________________________
> > > Wiki-research-l mailing list
> > > [hidden email]
> > > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >
> > _______________________________________________
> > Wiki-research-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Citation Project - Comments Welcome!

Andrea Forte
In reply to this post by Robert West
What a great idea! I think this would be fascinating - it's outside the
scope of what we'll accomplish in the initial project phase but it would be
an interesting complimentary project or something to think about as a
second effort that would increase the value of the data.

Andrea

On Tue, May 2, 2017 at 8:49 AM, Robert West <[hidden email]> wrote:

> Hi,
>
> This looks like a great project!
>
> One quick thought: it would be extremely useful for your purposes to
> be able to study not only the static structure of references pointing
> from Wikipedia to external documents, but also how these references
> are used. Tracking this traffic is currently impossible, since it will
> only leave a footprint on the webserver of the link target, not on
> Wikimedia's webservers.
>
> Have you thought about the possibility of funneling external links
> through a Wikimedia URL, which would allow you to record the links
> through which users leave Wikipedia?
>
> I know this would be a major change to the infrastructure, and I'm not
> sure how the privacy implications would line up with Wikipedia's
> guidelines, but it's worthwhile giving it some serious thought. At the
> very least, Wikimedia could store counts of external-link clicks,
> without linking those clicks to users' Wikimedia-internal browse
> traces.
>
> Bob
>
> On Mon, May 1, 2017 at 9:17 PM, Andrea Forte <[hidden email]>
> wrote:
> > Hi all,
> >
> >
> > One of my PhD students, Meen Chul Kim, is a data scientist with
> experience
> > in bibliometrics and we will be working on some citation-related research
> > together with Aaron and Dario in the coming months. Our main goal in the
> > short term is to develop an enhanced citation dataset that will allow for
> > future analyses of citation data associated with article quality,
> > lifecycle, editing trends, etc.
> >
> >
> > The project page is here:
> > https://meta.wikimedia.org/wiki/Research:Understanding_
> the_context_of_citations_in_Wikipedia
> >
> >
> > The project is just getting started so this is a great time to offer
> > feedback and suggestions, especially for features of citations that we
> > should mine as a first step, since this will affect what the dataset can
> be
> > used for in the future.
> >
> >
> > Looking forward to seeing some of you at WikiCite!!
> >
> > Andrea
> >
> >
> >
> >
> > --
> >  :: Andrea Forte
> >  :: Associate Professor
> >  :: College of Computing and Informatics, Drexel University
> >  :: http://www.andreaforte.net
> > _______________________________________________
> > Wiki-research-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>



--
 :: Andrea Forte
 :: Associate Professor
 :: College of Computing and Informatics, Drexel University
 :: http://www.andreaforte.net
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Citation Project - Comments Welcome!

Andrea Forte
In reply to this post by Kerry Raymond
Yes, we've parsed some citation data in the past and there are graduated
levels of interpretability... especially since we aim to look at citations
over time, early revisions are likely to have more variation than those in
more recent years when there have been more tools available to help people
format. Around 2005 I built a mediawiki extension (well, it turned out to
be a fork really) that structured the insertion of reference data in an
article and stored it in a separate reference table in the database. How I
wish I had figured out how to make that a scalable tool then, so we
wouldn't have this problem now!

One thing we've discussed is that although what we are really interested in
is the sources--what references point to--our ability to do understand what
those sources are is limited by how well we can successfully parse and
extract the reference text itself.



On Tue, May 2, 2017 at 7:59 AM, Kerry Raymond <[hidden email]>
wrote:

> Just a couple of thoughts that cross my mind ...
>
> If people use the {{cite book}} etc templates, it will be relatively easy
> to work out what the components of the citation are. However if people roll
> their own, e.g.
>
> <ref>[http://someurl This And That], Blah Blah 2000</ref>
>
> you may have some difficulty working out what is what. I've just been
> though a tedious exercise of updating a set of URLs using AWB over some
> thousands of articles and some of the ways people roll their own citations
> were quite remarkable (and often quite unhelpful). It may be that you can't
> extract much from such citations. However, the good news is that if they
> have a URL in them, it will probably be in plain-sight.
>
> Whereas there are a number of templates that I regularly use for citation
> like {{cite QHR}} (currently 1234 transclusions) and {{cite QPN}}
> (currently 2738  transclusions) and {{Census 2011 AUS}} (4400
> transclusions) all of which generate their URLs. I'm not sure how you will
> deal with these in terms of extracting URLs.
>
> But whatever the limitations, it will be a useful dataset to answer some
> interesting questions.
>
> One phenomena I often see is new users updating information (e.g. changing
> the population of a town) while leaving behind the old citation for the
> previous value. So it superficially looks like the new information is cited
> to a reliable source when in fact it isn't. I've often wished we could
> automatically detect and raise a "warning" when the "text being supported"
> by the citation changes yet the citation does not. The problem, of course,
> is that we only know where the citation appears in the text and that we
> presume it is in support for "some earlier" text (without being clear
> exactly where it is). And if an article is reorganised, it may well result
> in the citation "drifting away" from the text it supports or even that it
> is in support of text that has been deleted. So I think it is important to
> know what text preceded the citation at the time the citation first appears
> in the article history as it may be useful to compare it against the text
> that *now* appears before it. It is a great pity that (in these digital
> times) we have not developed a citation model where you select chunks of
> text and link your citation to them, so that the relationship between the
> text and the citation is more apparent.
>
> Kerry
>
> -----Original Message-----
> From: Wiki-research-l [mailto:[hidden email]]
> On Behalf Of Andrea Forte
> Sent: Tuesday, 2 May 2017 5:18 AM
> To: Research into Wikimedia content and communities <
> [hidden email]>
> Subject: [Wiki-research-l] Citation Project - Comments Welcome!
>
> Hi all,
>
>
> One of my PhD students, Meen Chul Kim, is a data scientist with experience
> in bibliometrics and we will be working on some citation-related research
> together with Aaron and Dario in the coming months. Our main goal in the
> short term is to develop an enhanced citation dataset that will allow for
> future analyses of citation data associated with article quality,
> lifecycle, editing trends, etc.
>
>
> The project page is here:
> https://meta.wikimedia.org/wiki/Research:Understanding_
> the_context_of_citations_in_Wikipedia
>
>
> The project is just getting started so this is a great time to offer
> feedback and suggestions, especially for features of citations that we
> should mine as a first step, since this will affect what the dataset can be
> used for in the future.
>
>
> Looking forward to seeing some of you at WikiCite!!
>
> Andrea
>
>
>
>
> --
>  :: Andrea Forte
>  :: Associate Professor
>  :: College of Computing and Informatics, Drexel University
>  :: http://www.andreaforte.net
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>


--
 :: Andrea Forte
 :: Associate Professor
 :: College of Computing and Informatics, Drexel University
 :: http://www.andreaforte.net
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Citation Project - Comments Welcome!

Andrea Forte
In reply to this post by Kerry Raymond
...and YES, detecting when a reference has changed but the adjacent text
has not is something that will be detectable with the dataset we aim to
produce. That's a great idea!

On Tue, May 2, 2017 at 7:59 AM, Kerry Raymond <[hidden email]>
wrote:

> Just a couple of thoughts that cross my mind ...
>
> If people use the {{cite book}} etc templates, it will be relatively easy
> to work out what the components of the citation are. However if people roll
> their own, e.g.
>
> <ref>[http://someurl This And That], Blah Blah 2000</ref>
>
> you may have some difficulty working out what is what. I've just been
> though a tedious exercise of updating a set of URLs using AWB over some
> thousands of articles and some of the ways people roll their own citations
> were quite remarkable (and often quite unhelpful). It may be that you can't
> extract much from such citations. However, the good news is that if they
> have a URL in them, it will probably be in plain-sight.
>
> Whereas there are a number of templates that I regularly use for citation
> like {{cite QHR}} (currently 1234 transclusions) and {{cite QPN}}
> (currently 2738  transclusions) and {{Census 2011 AUS}} (4400
> transclusions) all of which generate their URLs. I'm not sure how you will
> deal with these in terms of extracting URLs.
>
> But whatever the limitations, it will be a useful dataset to answer some
> interesting questions.
>
> One phenomena I often see is new users updating information (e.g. changing
> the population of a town) while leaving behind the old citation for the
> previous value. So it superficially looks like the new information is cited
> to a reliable source when in fact it isn't. I've often wished we could
> automatically detect and raise a "warning" when the "text being supported"
> by the citation changes yet the citation does not. The problem, of course,
> is that we only know where the citation appears in the text and that we
> presume it is in support for "some earlier" text (without being clear
> exactly where it is). And if an article is reorganised, it may well result
> in the citation "drifting away" from the text it supports or even that it
> is in support of text that has been deleted. So I think it is important to
> know what text preceded the citation at the time the citation first appears
> in the article history as it may be useful to compare it against the text
> that *now* appears before it. It is a great pity that (in these digital
> times) we have not developed a citation model where you select chunks of
> text and link your citation to them, so that the relationship between the
> text and the citation is more apparent.
>
> Kerry
>
> -----Original Message-----
> From: Wiki-research-l [mailto:[hidden email]]
> On Behalf Of Andrea Forte
> Sent: Tuesday, 2 May 2017 5:18 AM
> To: Research into Wikimedia content and communities <
> [hidden email]>
> Subject: [Wiki-research-l] Citation Project - Comments Welcome!
>
> Hi all,
>
>
> One of my PhD students, Meen Chul Kim, is a data scientist with experience
> in bibliometrics and we will be working on some citation-related research
> together with Aaron and Dario in the coming months. Our main goal in the
> short term is to develop an enhanced citation dataset that will allow for
> future analyses of citation data associated with article quality,
> lifecycle, editing trends, etc.
>
>
> The project page is here:
> https://meta.wikimedia.org/wiki/Research:Understanding_
> the_context_of_citations_in_Wikipedia
>
>
> The project is just getting started so this is a great time to offer
> feedback and suggestions, especially for features of citations that we
> should mine as a first step, since this will affect what the dataset can be
> used for in the future.
>
>
> Looking forward to seeing some of you at WikiCite!!
>
> Andrea
>
>
>
>
> --
>  :: Andrea Forte
>  :: Associate Professor
>  :: College of Computing and Informatics, Drexel University
>  :: http://www.andreaforte.net
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>


--
 :: Andrea Forte
 :: Associate Professor
 :: College of Computing and Informatics, Drexel University
 :: http://www.andreaforte.net
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Citation Project - Comments Welcome!

Kerry Raymond
The only thing is that the “real life” problem is the text changing but the citations stays the same. I don’t see the opposite happen much.

 

Another thought I had was of course to preserve details of the edit which added the citation initially, user, timestamp, edit summary, etc

 

It would be interesting to find “cliques” (in the loose social sense not the strict mathematical sense) of users who seem to use the same “clique of citations”. Such groups might be sockpuppets, meatpuppets etc. Of course, they might just be good faith editors accessing the same very useful resources for their favourite topic area.  But I guess if you “smell a rat” with one user or one source, then it might be handy to explore any “cliques” they appear to be operating within to look for suspicious activity of the others.

 

I am not quite sure what we might learn from the edit summaries, but I guess if they are not collected, we will never know if they do contain any interesting patterns.

 

Another thought that occurs to me is that there is at least one situation when some the text of interest may follow the citation rather precede it and that is list. E.g

 

The presidents of the USA are:<ref> one reliable source about all of the presidents</ref>

*        George Washington

*        …

*        Donald Trump

 

Also citations within tables pose a bit of a problem in terms of their “span”. Is it just the cell with the citation? Is it more? I see tables with the last column being used to hold citations for data that populates that whole row.

 

Also citations in infoboxes  where there is one field carrying some data followed by a corresponding citation field, e.g. pop and pop_footnotes (for population in infobox Australian place).

 

The more I think about this issue, the more I despair. Not so much for this project to build a citation database, but rather for the fact that without any binding of article text to the citation, the connection between them is likely to degrade as successive contributors come along and modify the article, particularly so if they cannot access the source. I think we have let ourselves be seduced into thinking that so long as we can *see* a lot of inline citations, [1][2][3] in our article that it is well-sourced, but if we really can’t explain what text is supported by which source, is it really well-sourced? You might as well just add a bibliography to the end and forget in-line citations. Now one might argue this is just as true with a traditional journal article  (again, no explicit binding of text to source), but the difference is that a traditional journal article has a single author or a group of tightly-coupled authors writing the journal article over a relatively short period of time (weeks rather than years), who are likely to have shared access to every source being cited and are able to confer among themselves if needed to sort out any issue relating to citations, so we can expect the citations to remain close to the text being supported by the citation. In Wikipedia, we have a disconnected set of authors operating over different time frames over an article lifetime of many years who are unable to share their source materials and so I think the coupling between text and citation is inevitably likely to be lost because we leave no trace of the coupling for the next contributor to uphold, even when everyone is acting in good faith. Let’s call it “cite rot”, which I’ll define as a loss of verifiability due to disconnect between article text and source.

 

It seems to me that we need to make the connection between text and source more explicit. Think of it from a reader perspective, in most e-readers you can select a word or phrase and a dictionary lookup is performed to tell you the meaning of the word(s). How about if in the Wikipedia of 2030 (since we discussing movement strategy at the moment), the reader could select some words and the sources are returned that supports them. E.g. currently we might write

 

Joe Smith was born in London in 1830.[1][2]

 

Where [1] supports that he was born in London and [2] that he was born in 1830.

 

In my 2030 Wikipedia, if we clicked on London, cite [1] would highlight (or something) and if we clicked on 1830, [2] would highlight and if we clicked on born, both would highlight. That is the words “Joe Smith was born in London” would be tagged as being [1] and “Joe Smith was born …. In 1830” would be tagged as being [2]. And probably a little pop-up with the exact quote out of the source document might appear for your verification pleasure.

 

Now of course we have enough problems with getting our contributors to supply any sources, let alone binding them to chunks of text as my proposal would entail. But I hear the Movement Strategy conversation is talking about improved quality and is talking about improved verifiability, so maybe it’s part of the quality assessment, if you want a VGA (verifiable good article), the text-to-cite mapping must be embedded in the article and almost all of the text is “covered” (in the mathematical sense) by the mapping. Indeed, the extent of coverage could be a verifiability metric.

 

OK, maybe what I am proposing is not the way to go, but I think we ought to be thinking about this issue of cite rot, because I think it’s a real problem. I suspect it’s already out there but we don’t notice it because we *see* lots of inline citations and assume all is well.

 

Kerry

 

From: Andrea Forte [mailto:[hidden email]]
Sent: Wednesday, 3 May 2017 11:46 PM
To: [hidden email]
Cc: Research into Wikimedia content and communities <[hidden email]>
Subject: Re: [Wiki-research-l] Citation Project - Comments Welcome!

 

 

...and YES, detecting when a reference has changed but the adjacent text has not is something that will be detectable with the dataset we aim to produce. That's a great idea!

 

On Tue, May 2, 2017 at 7:59 AM, Kerry Raymond <[hidden email] <mailto:[hidden email]> > wrote:

Just a couple of thoughts that cross my mind ...

If people use the {{cite book}} etc templates, it will be relatively easy to work out what the components of the citation are. However if people roll their own, e.g.

<ref>[http://someurl This And That], Blah Blah 2000</ref>

you may have some difficulty working out what is what. I've just been though a tedious exercise of updating a set of URLs using AWB over some thousands of articles and some of the ways people roll their own citations were quite remarkable (and often quite unhelpful). It may be that you can't extract much from such citations. However, the good news is that if they have a URL in them, it will probably be in plain-sight.

Whereas there are a number of templates that I regularly use for citation like {{cite QHR}} (currently 1234 transclusions) and {{cite QPN}} (currently 2738  transclusions) and {{Census 2011 AUS}} (4400 transclusions) all of which generate their URLs. I'm not sure how you will deal with these in terms of extracting URLs.

But whatever the limitations, it will be a useful dataset to answer some interesting questions.

One phenomena I often see is new users updating information (e.g. changing the population of a town) while leaving behind the old citation for the previous value. So it superficially looks like the new information is cited to a reliable source when in fact it isn't. I've often wished we could automatically detect and raise a "warning" when the "text being supported" by the citation changes yet the citation does not. The problem, of course, is that we only know where the citation appears in the text and that we presume it is in support for "some earlier" text (without being clear exactly where it is). And if an article is reorganised, it may well result in the citation "drifting away" from the text it supports or even that it is in support of text that has been deleted. So I think it is important to know what text preceded the citation at the time the citation first appears in the article history as it may be useful to compare it against the text that *now* appears before it. It is a great pity that (in these digital times) we have not developed a citation model where you select chunks of text and link your citation to them, so that the relationship between the text and the citation is more apparent.

Kerry


-----Original Message-----
From: Wiki-research-l [mailto:[hidden email] <mailto:[hidden email]> ] On Behalf Of Andrea Forte
Sent: Tuesday, 2 May 2017 5:18 AM
To: Research into Wikimedia content and communities <[hidden email] <mailto:[hidden email]> >
Subject: [Wiki-research-l] Citation Project - Comments Welcome!

Hi all,


One of my PhD students, Meen Chul Kim, is a data scientist with experience in bibliometrics and we will be working on some citation-related research together with Aaron and Dario in the coming months. Our main goal in the short term is to develop an enhanced citation dataset that will allow for future analyses of citation data associated with article quality, lifecycle, editing trends, etc.


The project page is here:
https://meta.wikimedia.org/wiki/Research:Understanding_the_context_of_citations_in_Wikipedia


The project is just getting started so this is a great time to offer feedback and suggestions, especially for features of citations that we should mine as a first step, since this will affect what the dataset can be used for in the future.


Looking forward to seeing some of you at WikiCite!!

Andrea




--
 :: Andrea Forte
 :: Associate Professor
 :: College of Computing and Informatics, Drexel University
 :: http://www.andreaforte.net

_______________________________________________
Wiki-research-l mailing list
[hidden email] <mailto:[hidden email]>
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l





 

--

 :: Andrea Forte
 :: Associate Professor
 :: College of Computing and Informatics, Drexel University
 :: http://www.andreaforte.net

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Citation Project - Comments Welcome!

Andrea Forte
(meant to reply all!)

Kerry - right, re: real world problem, I meant that the other way around,
we will have both the reference text and the preceding text, if either one
changes, it'll be detectable. That won't do everything you are describing
but it will provide the data required to think through a lot of these
problems and come up with future approaches to understanding the link
between a reference and the text in which it is embedded.

You raise a really good point that I had totally missed - in past work I
captured revision-related data like username/timestamp/etc. I do want to
capture revisionid and revision-related metadata along with the reference
text itself. I've added a note about that to the proposed data structure,
thank you!

On Wed, May 3, 2017 at 11:07 AM, Kerry Raymond <[hidden email]>
wrote:

> The only thing is that the “real life” problem is the text changing but
> the citations stays the same. I don’t see the opposite happen much.
>
>
>
> Another thought I had was of course to preserve details of the edit which
> added the citation initially, user, timestamp, edit summary, etc
>
>
>
> It would be interesting to find “cliques” (in the loose social sense not
> the strict mathematical sense) of users who seem to use the same “clique of
> citations”. Such groups might be sockpuppets, meatpuppets etc. Of course,
> they might just be good faith editors accessing the same very useful
> resources for their favourite topic area.  But I guess if you “smell a rat”
> with one user or one source, then it might be handy to explore any
> “cliques” they appear to be operating within to look for suspicious
> activity of the others.
>
>
>
> I am not quite sure what we might learn from the edit summaries, but I
> guess if they are not collected, we will never know if they do contain any
> interesting patterns.
>
>
>
> Another thought that occurs to me is that there is at least one situation
> when some the text of interest may follow the citation rather precede it
> and that is list. E.g
>
>
>
> The presidents of the USA are:<ref> one reliable source about all of the
> presidents</ref>
>
> ·        George Washington
>
> ·        …
>
> ·        Donald Trump
>
>
>
> Also citations within tables pose a bit of a problem in terms of their
> “span”. Is it just the cell with the citation? Is it more? I see tables
> with the last column being used to hold citations for data that populates
> that whole row.
>
>
>
> Also citations in infoboxes  where there is one field carrying some data
> followed by a corresponding citation field, e.g. pop and pop_footnotes (for
> population in infobox Australian place).
>
>
>
> The more I think about this issue, the more I despair. Not so much for
> this project to build a citation database, but rather for the fact that
> without any binding of article text to the citation, the connection between
> them is likely to degrade as successive contributors come along and modify
> the article, particularly so if they cannot access the source. I think we
> have let ourselves be seduced into thinking that so long as we can **see**
> a lot of inline citations, [1][2][3] in our article that it is
> well-sourced, but if we really can’t explain what text is supported by
> which source, is it really well-sourced? You might as well just add a
> bibliography to the end and forget in-line citations. Now one might argue
> this is just as true with a traditional journal article  (again, no
> explicit binding of text to source), but the difference is that a
> traditional journal article has a single author or a group of
> tightly-coupled authors writing the journal article over a relatively short
> period of time (weeks rather than years), who are likely to have shared
> access to every source being cited and are able to confer among themselves
> if needed to sort out any issue relating to citations, so we can expect the
> citations to remain close to the text being supported by the citation. In
> Wikipedia, we have a disconnected set of authors operating over different
> time frames over an article lifetime of many years who are unable to share
> their source materials and so I think the coupling between text and
> citation is inevitably likely to be lost because we leave no trace of the
> coupling for the next contributor to uphold, even when everyone is acting
> in good faith. Let’s call it “cite rot”, which I’ll define as a loss of
> verifiability due to disconnect between article text and source.
>
>
>
> It seems to me that we need to make the connection between text and source
> more explicit. Think of it from a reader perspective, in most e-readers you
> can select a word or phrase and a dictionary lookup is performed to tell
> you the meaning of the word(s). How about if in the Wikipedia of 2030
> (since we discussing movement strategy at the moment), the reader could
> select some words and the sources are returned that supports them. E.g.
> currently we might write
>
>
>
> Joe Smith was born in London in 1830.[1][2]
>
>
>
> Where [1] supports that he was born in London and [2] that he was born in
> 1830.
>
>
>
> In my 2030 Wikipedia, if we clicked on London, cite [1] would highlight
> (or something) and if we clicked on 1830, [2] would highlight and if we
> clicked on born, both would highlight. That is the words “Joe Smith was
> born in London” would be tagged as being [1] and “Joe Smith was born …. In
> 1830” would be tagged as being [2]. And probably a little pop-up with the
> exact quote out of the source document might appear for your verification
> pleasure.
>
>
>
> Now of course we have enough problems with getting our contributors to
> supply any sources, let alone binding them to chunks of text as my proposal
> would entail. But I hear the Movement Strategy conversation is talking
> about improved quality and is talking about improved verifiability, so
> maybe it’s part of the quality assessment, if you want a VGA (verifiable
> good article), the text-to-cite mapping must be embedded in the article and
> almost all of the text is “covered” (in the mathematical sense) by the
> mapping. Indeed, the extent of coverage could be a verifiability metric.
>
>
>
> OK, maybe what I am proposing is not the way to go, but I think we ought
> to be thinking about this issue of cite rot, because I think it’s a real
> problem. I suspect it’s already out there but we don’t notice it because we
> **see** lots of inline citations and assume all is well.
>
>
>
> Kerry
>
>
>
> *From:* Andrea Forte [mailto:[hidden email]]
> *Sent:* Wednesday, 3 May 2017 11:46 PM
> *To:* [hidden email]
> *Cc:* Research into Wikimedia content and communities <
> [hidden email]>
> *Subject:* Re: [Wiki-research-l] Citation Project - Comments Welcome!
>
>
>
>
>
> ...and YES, detecting when a reference has changed but the adjacent text
> has not is something that will be detectable with the dataset we aim to
> produce. That's a great idea!
>
>
>
> On Tue, May 2, 2017 at 7:59 AM, Kerry Raymond <[hidden email]>
> wrote:
>
> Just a couple of thoughts that cross my mind ...
>
> If people use the {{cite book}} etc templates, it will be relatively easy
> to work out what the components of the citation are. However if people roll
> their own, e.g.
>
> <ref>[http://someurl This And That], Blah Blah 2000</ref>
>
> you may have some difficulty working out what is what. I've just been
> though a tedious exercise of updating a set of URLs using AWB over some
> thousands of articles and some of the ways people roll their own citations
> were quite remarkable (and often quite unhelpful). It may be that you can't
> extract much from such citations. However, the good news is that if they
> have a URL in them, it will probably be in plain-sight.
>
> Whereas there are a number of templates that I regularly use for citation
> like {{cite QHR}} (currently 1234 transclusions) and {{cite QPN}}
> (currently 2738  transclusions) and {{Census 2011 AUS}} (4400
> transclusions) all of which generate their URLs. I'm not sure how you will
> deal with these in terms of extracting URLs.
>
> But whatever the limitations, it will be a useful dataset to answer some
> interesting questions.
>
> One phenomena I often see is new users updating information (e.g. changing
> the population of a town) while leaving behind the old citation for the
> previous value. So it superficially looks like the new information is cited
> to a reliable source when in fact it isn't. I've often wished we could
> automatically detect and raise a "warning" when the "text being supported"
> by the citation changes yet the citation does not. The problem, of course,
> is that we only know where the citation appears in the text and that we
> presume it is in support for "some earlier" text (without being clear
> exactly where it is). And if an article is reorganised, it may well result
> in the citation "drifting away" from the text it supports or even that it
> is in support of text that has been deleted. So I think it is important to
> know what text preceded the citation at the time the citation first appears
> in the article history as it may be useful to compare it against the text
> that *now* appears before it. It is a great pity that (in these digital
> times) we have not developed a citation model where you select chunks of
> text and link your citation to them, so that the relationship between the
> text and the citation is more apparent.
>
> Kerry
>
>
> -----Original Message-----
> From: Wiki-research-l [mailto:[hidden email]]
> On Behalf Of Andrea Forte
> Sent: Tuesday, 2 May 2017 5:18 AM
> To: Research into Wikimedia content and communities <
> [hidden email]>
> Subject: [Wiki-research-l] Citation Project - Comments Welcome!
>
> Hi all,
>
>
> One of my PhD students, Meen Chul Kim, is a data scientist with experience
> in bibliometrics and we will be working on some citation-related research
> together with Aaron and Dario in the coming months. Our main goal in the
> short term is to develop an enhanced citation dataset that will allow for
> future analyses of citation data associated with article quality,
> lifecycle, editing trends, etc.
>
>
> The project page is here:
> https://meta.wikimedia.org/wiki/Research:Understanding_
> the_context_of_citations_in_Wikipedia
>
>
> The project is just getting started so this is a great time to offer
> feedback and suggestions, especially for features of citations that we
> should mine as a first step, since this will affect what the dataset can be
> used for in the future.
>
>
> Looking forward to seeing some of you at WikiCite!!
>
> Andrea
>
>
>
>
> --
>  :: Andrea Forte
>  :: Associate Professor
>  :: College of Computing and Informatics, Drexel University
>  :: http://www.andreaforte.net
>
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
>
>
>
>
> --
>
>  :: Andrea Forte
>  :: Associate Professor
>  :: College of Computing and Informatics, Drexel University
>  :: http://www.andreaforte.net
>



--
 :: Andrea Forte
 :: Associate Professor
 :: College of Computing and Informatics, Drexel University
 :: http://www.andreaforte.net
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: Citation Project - Comments Welcome!

Ziko van Dijk-3
Hello Kerry,

A lot of good points. It is about how to work collaboratively using
footnotes. It does not really occur right now, and it is difficult to
verify something without having the book at hand  - or even then.

We would need much stricter rules how to show what piece of information
does exactly come where from, and where to look for it. But in reality, I
write e.g. one paragraph in Wikipedia, summing up 2-3 pages of a book, and
then put a footnote with regard to these 2-3 pages at the end of the
paragraph. Otherwise, my process of writing would be much slower, and
possibly my summary would be worse.

This is indeed a "real world problem" (=difficult to solve for Wikipedia),
but a special one within a text with a large number of (partially
anonymous) contributors. It would help if a mouse over on existing text
could at least indicate which Wikipedia account was responsible for exactly
that text.

Kind regards,
Ziko

PS:
In your example, I'd say:
Joe Smith was born in London[1] in 1830[2].

Where [1] supports that he was born in London and [2] that he was born in
1830.




2017-05-09 17:26 GMT+02:00 Andrea Forte <[hidden email]>:

> (meant to reply all!)
>
> Kerry - right, re: real world problem, I meant that the other way around,
> we will have both the reference text and the preceding text, if either one
> changes, it'll be detectable. That won't do everything you are describing
> but it will provide the data required to think through a lot of these
> problems and come up with future approaches to understanding the link
> between a reference and the text in which it is embedded.
>
> You raise a really good point that I had totally missed - in past work I
> captured revision-related data like username/timestamp/etc. I do want to
> capture revisionid and revision-related metadata along with the reference
> text itself. I've added a note about that to the proposed data structure,
> thank you!
>
> On Wed, May 3, 2017 at 11:07 AM, Kerry Raymond <[hidden email]>
> wrote:
>
> > The only thing is that the “real life” problem is the text changing but
> > the citations stays the same. I don’t see the opposite happen much.
> >
> >
> >
> > Another thought I had was of course to preserve details of the edit which
> > added the citation initially, user, timestamp, edit summary, etc
> >
> >
> >
> > It would be interesting to find “cliques” (in the loose social sense not
> > the strict mathematical sense) of users who seem to use the same “clique
> of
> > citations”. Such groups might be sockpuppets, meatpuppets etc. Of course,
> > they might just be good faith editors accessing the same very useful
> > resources for their favourite topic area.  But I guess if you “smell a
> rat”
> > with one user or one source, then it might be handy to explore any
> > “cliques” they appear to be operating within to look for suspicious
> > activity of the others.
> >
> >
> >
> > I am not quite sure what we might learn from the edit summaries, but I
> > guess if they are not collected, we will never know if they do contain
> any
> > interesting patterns.
> >
> >
> >
> > Another thought that occurs to me is that there is at least one situation
> > when some the text of interest may follow the citation rather precede it
> > and that is list. E.g
> >
> >
> >
> > The presidents of the USA are:<ref> one reliable source about all of the
> > presidents</ref>
> >
> > ·        George Washington
> >
> > ·        …
> >
> > ·        Donald Trump
> >
> >
> >
> > Also citations within tables pose a bit of a problem in terms of their
> > “span”. Is it just the cell with the citation? Is it more? I see tables
> > with the last column being used to hold citations for data that populates
> > that whole row.
> >
> >
> >
> > Also citations in infoboxes  where there is one field carrying some data
> > followed by a corresponding citation field, e.g. pop and pop_footnotes
> (for
> > population in infobox Australian place).
> >
> >
> >
> > The more I think about this issue, the more I despair. Not so much for
> > this project to build a citation database, but rather for the fact that
> > without any binding of article text to the citation, the connection
> between
> > them is likely to degrade as successive contributors come along and
> modify
> > the article, particularly so if they cannot access the source. I think we
> > have let ourselves be seduced into thinking that so long as we can
> **see**
> > a lot of inline citations, [1][2][3] in our article that it is
> > well-sourced, but if we really can’t explain what text is supported by
> > which source, is it really well-sourced? You might as well just add a
> > bibliography to the end and forget in-line citations. Now one might argue
> > this is just as true with a traditional journal article  (again, no
> > explicit binding of text to source), but the difference is that a
> > traditional journal article has a single author or a group of
> > tightly-coupled authors writing the journal article over a relatively
> short
> > period of time (weeks rather than years), who are likely to have shared
> > access to every source being cited and are able to confer among
> themselves
> > if needed to sort out any issue relating to citations, so we can expect
> the
> > citations to remain close to the text being supported by the citation. In
> > Wikipedia, we have a disconnected set of authors operating over different
> > time frames over an article lifetime of many years who are unable to
> share
> > their source materials and so I think the coupling between text and
> > citation is inevitably likely to be lost because we leave no trace of the
> > coupling for the next contributor to uphold, even when everyone is acting
> > in good faith. Let’s call it “cite rot”, which I’ll define as a loss of
> > verifiability due to disconnect between article text and source.
> >
> >
> >
> > It seems to me that we need to make the connection between text and
> source
> > more explicit. Think of it from a reader perspective, in most e-readers
> you
> > can select a word or phrase and a dictionary lookup is performed to tell
> > you the meaning of the word(s). How about if in the Wikipedia of 2030
> > (since we discussing movement strategy at the moment), the reader could
> > select some words and the sources are returned that supports them. E.g.
> > currently we might write
> >
> >
> >
> > Joe Smith was born in London in 1830.[1][2]
> >
> >
> >
> > Where [1] supports that he was born in London and [2] that he was born in
> > 1830.
> >
> >
> >
> > In my 2030 Wikipedia, if we clicked on London, cite [1] would highlight
> > (or something) and if we clicked on 1830, [2] would highlight and if we
> > clicked on born, both would highlight. That is the words “Joe Smith was
> > born in London” would be tagged as being [1] and “Joe Smith was born ….
> In
> > 1830” would be tagged as being [2]. And probably a little pop-up with the
> > exact quote out of the source document might appear for your verification
> > pleasure.
> >
> >
> >
> > Now of course we have enough problems with getting our contributors to
> > supply any sources, let alone binding them to chunks of text as my
> proposal
> > would entail. But I hear the Movement Strategy conversation is talking
> > about improved quality and is talking about improved verifiability, so
> > maybe it’s part of the quality assessment, if you want a VGA (verifiable
> > good article), the text-to-cite mapping must be embedded in the article
> and
> > almost all of the text is “covered” (in the mathematical sense) by the
> > mapping. Indeed, the extent of coverage could be a verifiability metric.
> >
> >
> >
> > OK, maybe what I am proposing is not the way to go, but I think we ought
> > to be thinking about this issue of cite rot, because I think it’s a real
> > problem. I suspect it’s already out there but we don’t notice it because
> we
> > **see** lots of inline citations and assume all is well.
> >
> >
> >
> > Kerry
> >
> >
> >
> > *From:* Andrea Forte [mailto:[hidden email]]
> > *Sent:* Wednesday, 3 May 2017 11:46 PM
> > *To:* [hidden email]
> > *Cc:* Research into Wikimedia content and communities <
> > [hidden email]>
> > *Subject:* Re: [Wiki-research-l] Citation Project - Comments Welcome!
> >
> >
> >
> >
> >
> > ...and YES, detecting when a reference has changed but the adjacent text
> > has not is something that will be detectable with the dataset we aim to
> > produce. That's a great idea!
> >
> >
> >
> > On Tue, May 2, 2017 at 7:59 AM, Kerry Raymond <[hidden email]>
> > wrote:
> >
> > Just a couple of thoughts that cross my mind ...
> >
> > If people use the {{cite book}} etc templates, it will be relatively easy
> > to work out what the components of the citation are. However if people
> roll
> > their own, e.g.
> >
> > <ref>[http://someurl This And That], Blah Blah 2000</ref>
> >
> > you may have some difficulty working out what is what. I've just been
> > though a tedious exercise of updating a set of URLs using AWB over some
> > thousands of articles and some of the ways people roll their own
> citations
> > were quite remarkable (and often quite unhelpful). It may be that you
> can't
> > extract much from such citations. However, the good news is that if they
> > have a URL in them, it will probably be in plain-sight.
> >
> > Whereas there are a number of templates that I regularly use for citation
> > like {{cite QHR}} (currently 1234 transclusions) and {{cite QPN}}
> > (currently 2738  transclusions) and {{Census 2011 AUS}} (4400
> > transclusions) all of which generate their URLs. I'm not sure how you
> will
> > deal with these in terms of extracting URLs.
> >
> > But whatever the limitations, it will be a useful dataset to answer some
> > interesting questions.
> >
> > One phenomena I often see is new users updating information (e.g.
> changing
> > the population of a town) while leaving behind the old citation for the
> > previous value. So it superficially looks like the new information is
> cited
> > to a reliable source when in fact it isn't. I've often wished we could
> > automatically detect and raise a "warning" when the "text being
> supported"
> > by the citation changes yet the citation does not. The problem, of
> course,
> > is that we only know where the citation appears in the text and that we
> > presume it is in support for "some earlier" text (without being clear
> > exactly where it is). And if an article is reorganised, it may well
> result
> > in the citation "drifting away" from the text it supports or even that it
> > is in support of text that has been deleted. So I think it is important
> to
> > know what text preceded the citation at the time the citation first
> appears
> > in the article history as it may be useful to compare it against the text
> > that *now* appears before it. It is a great pity that (in these digital
> > times) we have not developed a citation model where you select chunks of
> > text and link your citation to them, so that the relationship between the
> > text and the citation is more apparent.
> >
> > Kerry
> >
> >
> > -----Original Message-----
> > From: Wiki-research-l [mailto:wiki-research-l-
> [hidden email]]
> > On Behalf Of Andrea Forte
> > Sent: Tuesday, 2 May 2017 5:18 AM
> > To: Research into Wikimedia content and communities <
> > [hidden email]>
> > Subject: [Wiki-research-l] Citation Project - Comments Welcome!
> >
> > Hi all,
> >
> >
> > One of my PhD students, Meen Chul Kim, is a data scientist with
> experience
> > in bibliometrics and we will be working on some citation-related research
> > together with Aaron and Dario in the coming months. Our main goal in the
> > short term is to develop an enhanced citation dataset that will allow for
> > future analyses of citation data associated with article quality,
> > lifecycle, editing trends, etc.
> >
> >
> > The project page is here:
> > https://meta.wikimedia.org/wiki/Research:Understanding_
> > the_context_of_citations_in_Wikipedia
> >
> >
> > The project is just getting started so this is a great time to offer
> > feedback and suggestions, especially for features of citations that we
> > should mine as a first step, since this will affect what the dataset can
> be
> > used for in the future.
> >
> >
> > Looking forward to seeing some of you at WikiCite!!
> >
> > Andrea
> >
> >
> >
> >
> > --
> >  :: Andrea Forte
> >  :: Associate Professor
> >  :: College of Computing and Informatics, Drexel University
> >  :: http://www.andreaforte.net
> >
> > _______________________________________________
> > Wiki-research-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
> >
> >
> >
> >
> >
> > --
> >
> >  :: Andrea Forte
> >  :: Associate Professor
> >  :: College of Computing and Informatics, Drexel University
> >  :: http://www.andreaforte.net
> >
>
>
>
> --
>  :: Andrea Forte
>  :: Associate Professor
>  :: College of Computing and Informatics, Drexel University
>  :: http://www.andreaforte.net
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l