ground truth for section alignment across languages

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|

ground truth for section alignment across languages

Leila Zia
Hi all,

==Question==
Do you know of a dataset we can use as ground truth for aligning
sections of one article in two languages? I'm thinking a tool such as
Content Translation may capture this data somewhere, or there may be
some other community initiative that has matched a subset of the
sections between two versions of one article in two languages. Any
insights/directions is appreciated. :) I'm not going to worry about
what language pairs we do have this dataset in right now, the first
question is: do we have anything? :)

==Context==
As part of the research we are doing to build recommendation systems
that can recommend sections (or templates) for already existing
Wikipedia articles, we are looking at the problem of section alignment
between languages, i.e., given two languages x and y and two version
of article a in these two languages, can an algorithm (with relatively
high accuracy) tell us which section in the article in language x
correspond to which other section in the article in language y?

Thanks,
Leila

--
Leila Zia
Senior Research Scientist
Wikimedia Foundation

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: ground truth for section alignment across languages

Scott Hale
Dear Leila,

==Question==
> Do you know of a dataset we can use as ground truth for aligning
> sections of one article in two languages?
>

This question is super interesting to me. I am not aware of any ground
truth data, but could imagine trying to build some from
[[Template:Translated_page]]. At least on enwiki it has a "section"
parameter that is to be set:

> If the inserted translation is contained in one section of the target
> page, insert its name here. (A direct link to that section will be created.)
>
It also has a "version" parameter, and it might be possible to identify
cases where a section was added to the source after the translation was
made. This could then become a corpus to "learn the missing section". I
guess something similar could be done with articles created with the
Content Translation tool where a section was later added to the source.


>
> ==Context==
> As part of the research we are doing to build recommendation systems
> that can recommend sections (or templates) for already existing
> Wikipedia articles, we are looking at the problem of section alignment
> between languages, i.e., given two languages x and y and two version
> of article a in these two languages, can an algorithm (with relatively
> high accuracy) tell us which section in the article in language x
> correspond to which other section in the article in language y?
>


While I am not aware of research on Wikipedia section alignment per se,
there is a good amount of work on sentence alignment and building
parallel/bilingual corpora that seems relevant to to this [1-4]. I can
imagine an approach that would look for near matches across two Wikipedia
articles in different languages and then examine the distribution of these
sentences within sections to see if one or more sections looked to be
omitted. One challenge is the sub-article problem [5], which of course you
are already familiar. I wonder whether computing the overlap in article
links a la Omnipedia [6] and then examining the distribution of these
between sections would work and be much less computationally intensive. I
fear, however, that this could over identify sections further down an
article as missing given (I believe) that article links are often
concentrated towards the beginning of an article.

[1] Learning Joint Multilingual Sentence Representations with Neural
Machine Translation. 2017
https://arxiv.org/abs/1704.04154

[2] Fast and Accurate Sentence Alignment of Bilingual Corpora. 2002.
https://www.microsoft.com/en-us/research/publication/fast-and-accurate-sentence-alignment-of-bilingual-corpora/

[3] Large scale parallel document mining for machine translation. 2010.
http://www.aclweb.org/anthology/C/C10/C10-1124.pdf

[4] Building Bilingual Parallel Corpora Based on Wikipedia. 2010.
http://www.academia.edu/download/39073036/building_bilingual_parallel_corpora.pdf

[5] Problematizing and Addressing the Article-as-Concept Assumption in
Wikipedia. 2017
http://www.brenthecht.com/publications/cscw17_subarticles.pdf

[6] Omnipedia: Bridging the Wikipedia Language Gap. 2012.
http://www.brenthecht.com/papers/bhecht_CHI2012_omnipedia.pdf

Best wishes,
Scott

--
Dr Scott Hale
Senior Data Scientist
Oxford Internet Institute, University of Oxford
Turing Fellow, Alan Turing Institute
http://www.scotthale.net/
[hidden email]
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: ground truth for section alignment across languages

Gerard Meijssen-3
In reply to this post by Leila Zia
Hoi,
Sorry to state the obvious (for me) .. We datamine Wikipedias for
statements in Wikipedia. Consequently much information that could be /
should be in an article (in any and all languages) is reflected by
Wikidata. There is much that is not found in every language and information
on some subjects can easily be provided from Wikidata as a list (think
awards, books published etc). The good news is that Wikidata will provide
lists for this purpose. For all other topics like date of death / birth and
place of death / birth where people studied etc you have the benefit of
existing articles in a Wikipedia and the work done at Wikidata.

Hope this helps.
Thanks,
      GerardM

On 24 August 2017 at 19:56, Leila Zia <[hidden email]> wrote:

> Hi all,
>
> ==Question==
> Do you know of a dataset we can use as ground truth for aligning
> sections of one article in two languages? I'm thinking a tool such as
> Content Translation may capture this data somewhere, or there may be
> some other community initiative that has matched a subset of the
> sections between two versions of one article in two languages. Any
> insights/directions is appreciated. :) I'm not going to worry about
> what language pairs we do have this dataset in right now, the first
> question is: do we have anything? :)
>
> ==Context==
> As part of the research we are doing to build recommendation systems
> that can recommend sections (or templates) for already existing
> Wikipedia articles, we are looking at the problem of section alignment
> between languages, i.e., given two languages x and y and two version
> of article a in these two languages, can an algorithm (with relatively
> high accuracy) tell us which section in the article in language x
> correspond to which other section in the article in language y?
>
> Thanks,
> Leila
>
> --
> Leila Zia
> Senior Research Scientist
> Wikimedia Foundation
>
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: ground truth for section alignment across languages

Leila Zia
In reply to this post by Scott Hale
Hi Scott,


On Mon, Aug 28, 2017 at 2:01 AM, Scott Hale <[hidden email]> wrote:

> Dear Leila,
>
> ==Question==
>> Do you know of a dataset we can use as ground truth for aligning
>> sections of one article in two languages?
>>
>
> This question is super interesting to me. I am not aware of any ground
> truth data, but could imagine trying to build some from
> [[Template:Translated_page]]. At least on enwiki it has a "section"
> parameter that is to be set:

nice! :) Thanks for sharing it. It is definitely worth looking into
it. I did some search across a few languages and the usage of it is
limited, in es, around 600, for example and once you start slice and
dicing it, the labels become too few. but still, we may be able to use
it now or in the future.

>> ==Context==
>> As part of the research we are doing to build recommendation systems
>> that can recommend sections (or templates) for already existing
>> Wikipedia articles, we are looking at the problem of section alignment
>> between languages, i.e., given two languages x and y and two version
>> of article a in these two languages, can an algorithm (with relatively
>> high accuracy) tell us which section in the article in language x
>> correspond to which other section in the article in language y?
>>
>
>
> While I am not aware of research on Wikipedia section alignment per se,
> there is a good amount of work on sentence alignment and building
> parallel/bilingual corpora that seems relevant to to this [1-4]. I can
> imagine an approach that would look for near matches across two Wikipedia
> articles in different languages and then examine the distribution of these
> sentences within sections to see if one or more sections looked to be
> omitted. One challenge is the sub-article problem [5], which of course you
> are already familiar. I wonder whether computing the overlap in article
> links a la Omnipedia [6] and then examining the distribution of these
> between sections would work and be much less computationally intensive. I
> fear, however, that this could over identify sections further down an
> article as missing given (I believe) that article links are often
> concentrated towards the beginning of an article.

exactly.

a side note: we are trying to stay away, as much as possible, from
research/results that rely on NLP techniques as the introduction of
NLP will usually translate relatively quickly to limitations on what
languages our methodologies can scale to.

Thanks, again! :)

Leila

>
> [1] Learning Joint Multilingual Sentence Representations with Neural
> Machine Translation. 2017
> https://arxiv.org/abs/1704.04154
>
> [2] Fast and Accurate Sentence Alignment of Bilingual Corpora. 2002.
> https://www.microsoft.com/en-us/research/publication/fast-and-accurate-sentence-alignment-of-bilingual-corpora/
>
> [3] Large scale parallel document mining for machine translation. 2010.
> http://www.aclweb.org/anthology/C/C10/C10-1124.pdf
>
> [4] Building Bilingual Parallel Corpora Based on Wikipedia. 2010.
> http://www.academia.edu/download/39073036/building_bilingual_parallel_corpora.pdf
>
> [5] Problematizing and Addressing the Article-as-Concept Assumption in
> Wikipedia. 2017
> http://www.brenthecht.com/publications/cscw17_subarticles.pdf
>
> [6] Omnipedia: Bridging the Wikipedia Language Gap. 2012.
> http://www.brenthecht.com/papers/bhecht_CHI2012_omnipedia.pdf
>
> Best wishes,
> Scott
>
> --
> Dr Scott Hale
> Senior Data Scientist
> Oxford Internet Institute, University of Oxford
> Turing Fellow, Alan Turing Institute
> http://www.scotthale.net/
> [hidden email]
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: ground truth for section alignment across languages

Lucie-Aimée Kaffee-2
In reply to this post by Scott Hale
Hi Leila,

From the top of my head, I can think of this paper only I've read a while
ago:
https://eprints.soton.ac.uk/403386/1/tweb_gottschalk_demidova_multiwiki.pdf

I assume what is to be considered is the (lack of) content overlap of
articles in different languages in general as of, for example,
http://dl.acm.org/citation.cfm?id=1753370 which also compares different
language Wikipedias but more in the sense of completeness.

Sounds like interesting work, looking forward to seeing what you come up
with!

All the best,

Lucie

On 30 August 2017 at 00:13, Leila Zia <[hidden email]> wrote:

> Hi Scott,
>
>
> On Mon, Aug 28, 2017 at 2:01 AM, Scott Hale <[hidden email]>
> wrote:
> > Dear Leila,
> >
> > ==Question==
> >> Do you know of a dataset we can use as ground truth for aligning
> >> sections of one article in two languages?
> >>
> >
> > This question is super interesting to me. I am not aware of any ground
> > truth data, but could imagine trying to build some from
> > [[Template:Translated_page]]. At least on enwiki it has a "section"
> > parameter that is to be set:
>
> nice! :) Thanks for sharing it. It is definitely worth looking into
> it. I did some search across a few languages and the usage of it is
> limited, in es, around 600, for example and once you start slice and
> dicing it, the labels become too few. but still, we may be able to use
> it now or in the future.
>
> >> ==Context==
> >> As part of the research we are doing to build recommendation systems
> >> that can recommend sections (or templates) for already existing
> >> Wikipedia articles, we are looking at the problem of section alignment
> >> between languages, i.e., given two languages x and y and two version
> >> of article a in these two languages, can an algorithm (with relatively
> >> high accuracy) tell us which section in the article in language x
> >> correspond to which other section in the article in language y?
> >>
> >
> >
> > While I am not aware of research on Wikipedia section alignment per se,
> > there is a good amount of work on sentence alignment and building
> > parallel/bilingual corpora that seems relevant to to this [1-4]. I can
> > imagine an approach that would look for near matches across two Wikipedia
> > articles in different languages and then examine the distribution of
> these
> > sentences within sections to see if one or more sections looked to be
> > omitted. One challenge is the sub-article problem [5], which of course
> you
> > are already familiar. I wonder whether computing the overlap in article
> > links a la Omnipedia [6] and then examining the distribution of these
> > between sections would work and be much less computationally intensive. I
> > fear, however, that this could over identify sections further down an
> > article as missing given (I believe) that article links are often
> > concentrated towards the beginning of an article.
>
> exactly.
>
> a side note: we are trying to stay away, as much as possible, from
> research/results that rely on NLP techniques as the introduction of
> NLP will usually translate relatively quickly to limitations on what
> languages our methodologies can scale to.
>
> Thanks, again! :)
>
> Leila
>
> >
> > [1] Learning Joint Multilingual Sentence Representations with Neural
> > Machine Translation. 2017
> > https://arxiv.org/abs/1704.04154
> >
> > [2] Fast and Accurate Sentence Alignment of Bilingual Corpora. 2002.
> > https://www.microsoft.com/en-us/research/publication/fast-
> and-accurate-sentence-alignment-of-bilingual-corpora/
> >
> > [3] Large scale parallel document mining for machine translation. 2010.
> > http://www.aclweb.org/anthology/C/C10/C10-1124.pdf
> >
> > [4] Building Bilingual Parallel Corpora Based on Wikipedia. 2010.
> > http://www.academia.edu/download/39073036/building_
> bilingual_parallel_corpora.pdf
> >
> > [5] Problematizing and Addressing the Article-as-Concept Assumption in
> > Wikipedia. 2017
> > http://www.brenthecht.com/publications/cscw17_subarticles.pdf
> >
> > [6] Omnipedia: Bridging the Wikipedia Language Gap. 2012.
> > http://www.brenthecht.com/papers/bhecht_CHI2012_omnipedia.pdf
> >
> > Best wishes,
> > Scott
> >
> > --
> > Dr Scott Hale
> > Senior Data Scientist
> > Oxford Internet Institute, University of Oxford
> > Turing Fellow, Alan Turing Institute
> > http://www.scotthale.net/
> > [hidden email]
> > _______________________________________________
> > Wiki-research-l mailing list
> > [hidden email]
> > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
> _______________________________________________
> Wiki-research-l mailing list
> [hidden email]
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
Reply | Threaded
Open this post in threaded view
|

Re: ground truth for section alignment across languages

Martin Potthast
In reply to this post by Leila Zia
Hi Leila,

I can point you to two methods: CL-ESA and CL-CNG.

Cross-Language Explicit Semantic Analyse (CL-ESA):
http://www.uni-weimar.de/medien/webis/publications/papers/stein_2008b.pdf

This model allows for language-independent comparison of texts without
relying on parallel corpora or translation dictionaries for training.
Rather, it exploits the cross-language links of Wikipedia articles to embed
documents from two or more languages in a joint vector space, rendering
them directly comparable, e.g., using cosine similarity. The more language
links exit between two Wikipedia languages, the higher the dimensionality
of the joint vector space can be made, and the better a cross-language
ranking will perform. At the document level, near-perfect recall on a
ranking task is achieved at 100,000 dimensions (=linked articles across
languages). See Table 2 of the paper. The model is easy to be implemented,
however, somewhat expensive to compute.

Cross-language Character N-Gram model (CL-CNG):
In subsequent experiments, we compared the model with alternatives; one
that is trained on the basis of a parallel corpus, and another that simply
exploits lexical overlap of character N-grams between pairs of documents
from different languages:
http://www.uni-weimar.de/medien/webis/publications/papers/stein_2011b.pdf

As it turns out, CL-C3G (i.e., N=3) is extremely effective, too, on
language pairs that share an alphabet and where lexical overlap can be
expected, e.g., due to them having a common ancestor. So, it works very
well for German-Dutch, but less so for English-Russian. In the latter case,
CL-ESA works, though. The CL-CNG model is even easier to be implemented and
very scalable. Dependent on the language pairs you are investigating, this
model may help a great deal.

Perhaps these models may be of use when building a cross-language alignment
tool.

Best,
Martin



--
Dr. Martin Potthast
Bauhaus-Universität Weimar
Digital Bauhaus Lab
Bauhausstr. 9a
99423 Weimar
Germany

+49 3643 58 3567
+49 171 809 1945

www.potthast.net
_______________________________________________
Wiki-research-l mailing list
[hidden email]
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l