[Wikimedia-l] machine translation

classic Classic list List threaded Threaded
16 messages Options
Reply | Threaded
Open this post in threaded view
|

[Wikimedia-l] machine translation

Amir E. Aharoni
2017-05-02 18:20 GMT+03:00 John Erling Blad <[hidden email]>:

> Brute force solution; turn the ContentTranslation off. Really stupid
> solution.


... Then I guess you don't mind that I'm changing the thread name :)


> The next solution; turn the Yandex engine off. That would solve a
> part of the problem. Kind of lousy solution though.
>

> What about adding a language model that warns when the language constructs
> gets to weird? It is like a "test" for the translation. The CT is used for
> creating a translation, but the language model is used for verifying if the
> translation is good enough. If it does not validate against the language
> model it should simply not be published to the main name space. It will
> still be possible to create a draft, but then the user is completely aware
> that the translation isn't good enough.
>
> Such a language model should be available as a test for any article, as it
> can be used as a quality measure for the article. It is really a quantity
> measure for the well-spokenness of the article, but that isn't quite so
> intuitive.
>

So, I'll allow myself to guess that you are talking about one particular
language, probably Norwegian.

Several technical facts:

1. In the past there were several cases in which translators to different
languages who reported common translation mistakes to me. I passed them on
to Yandex developers, with whom I communicate quite regularly. They
acknowledged receiving all of them. I am aware of at least one such common
mistake that was fixed; possibly there were more. If you can give me a list
of such mistakes for Norwegian, I'll be very happy to pass them on. I
absolutely cannot promise that they will be fixed upstream, but it's
possible.

2. In Norwegian, Apertium is used for translating between the two varieties
of Norwegian itself (Bokmål and Nynorsk), and from other Scandinavian
languages. That's probably why it works so well—they are similar in
grammar, vocabulary, and narrative style (I'll pass it on to Apertium
developers—I'm sure they'll be happy to hear it). Unfortunately, machine
translation from English is not available in Apertium. Apertium works best
with very similar languages, and English has two characteristics, which are
unfortunate when combined: it is both the most popular source for
translation into almost all other languages (including Norwegian), and it
is not _very_ similar to any other languages (except maybe Scots). Machine
translation from English into Norwegian is only possible with Yandex at the
moment. More engines may be added in the future, but at the moment that's
all we have. That's why disabling Yandex completely would indeed be a lousy
solution: A lot of people say that without machine translation integration
Content Translation is useless. Not all users think like that, but many do.

3. We can define a numerical threshold of acceptable percentage of machine
translation post-editing. Currently it's 75%. It's a tad embarrassing, but
it's hard-coded at the moment, but it can be very easily be made into a
variable per language. If the translator tries to publish a page in which
less than that is modified, a warning will be shown.

4. I'm not sure what do you mean by "language model". If it's any kind of a
linguistic engine, then it's definitely not within the resources that the
Language team itself can currently dedicate. However, if somebody who knows
Norwegian and some programming will write a script that analyzes common bad
constructs in a Wikipedia dump, this will be very useful. This would
basically be an upgraded version of suggestion #1 above. (In my spare time
as a volunteer I'm doing something comparable for Hebrew, although not for
translation, but for improving how MediaWiki link trails work.)
_______________________________________________
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l
New messages to: [hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:[hidden email]?subject=unsubscribe>
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] machine translation

John Erling Blad
Actually this _is_ about turning ContentTranslation off, that is what
several users in the community want. They block people using the extension
and delete the translated articles. Use of ContentTranslation has become a
 rather contentious case.

Yandex as a general translation engine to be able to read some alien
language is quite good, but as an engine to produce written text it is not
very good at all. In fact it often creates quite horrible Norwegian, even
for closely related languages. One quite common problem is reordering of
words into meaningless constructs, an other problem is reordering lexical
gender in weird ways. The English preposition "a" is often translated as
"en" in a propositional phrase, and then the gender is added to the
following phrase. That gives a translation of  "Oppland is a county in…"
 into something like "Oppland er en fylket i…" This should be "Oppland er
et fylke i…".

(I just checked and it seems like Yandex messes up a lot less now than
previously, but it is still pretty bad.)

Apertium works because the language is closely related, Yandex does not
work because it is used between very different languages. People try to use
Yandex and gets disappointed, and falsely conclude that all language
translations are equally weird. They are not, but Yandex translations are
weird.

The numerical threshold does not work. The reason is simple, the number of
fixes depends on language constructs that fails, and that is simply not a
constant for small text fragments. Perhaps if we could flag specific
language constructs that is known to give a high percentage of failures,
and if the translator must check those sentences. One such language
construct is disappearances between the preposition and the gender of the
following term in a prepositional phrase. If they are not similar, then the
sentence must be checked. It is not always wrong to write "en jenta" in
Norwegian, but it is likely to be wrong.

A language model could be a statistical model for the language itself, not
for the translation into that language. We don't want a perfect language
model, but a sufficient language model to mark weird constructs. A very
simple solution could simply be to mark tri-grams that does not  already
exist in the text base for the destination as possible errors. It is not
necessary to do a live check, but  at least do it before the page can be
saved.

Note the difference in what Yandex do and what we want to achieve; Yandex
translates a text between two different languages, without any clear reason
why. It is not to important if there are weird constructs in the text, as
long as it is usable in "some" context. We translate a text for the purpose
of republishing it. The text should be usable and easily readable in that
language.



On Tue, May 2, 2017 at 7:07 PM, Amir E. Aharoni <
[hidden email]> wrote:

> 2017-05-02 18:20 GMT+03:00 John Erling Blad <[hidden email]>:
>
> > Brute force solution; turn the ContentTranslation off. Really stupid
> > solution.
>
>
> ... Then I guess you don't mind that I'm changing the thread name :)
>
>
> > The next solution; turn the Yandex engine off. That would solve a
> > part of the problem. Kind of lousy solution though.
> >
>
> > What about adding a language model that warns when the language
> constructs
> > gets to weird? It is like a "test" for the translation. The CT is used
> for
> > creating a translation, but the language model is used for verifying if
> the
> > translation is good enough. If it does not validate against the language
> > model it should simply not be published to the main name space. It will
> > still be possible to create a draft, but then the user is completely
> aware
> > that the translation isn't good enough.
> >
> > Such a language model should be available as a test for any article, as
> it
> > can be used as a quality measure for the article. It is really a quantity
> > measure for the well-spokenness of the article, but that isn't quite so
> > intuitive.
> >
>
> So, I'll allow myself to guess that you are talking about one particular
> language, probably Norwegian.
>
> Several technical facts:
>
> 1. In the past there were several cases in which translators to different
> languages who reported common translation mistakes to me. I passed them on
> to Yandex developers, with whom I communicate quite regularly. They
> acknowledged receiving all of them. I am aware of at least one such common
> mistake that was fixed; possibly there were more. If you can give me a list
> of such mistakes for Norwegian, I'll be very happy to pass them on. I
> absolutely cannot promise that they will be fixed upstream, but it's
> possible.
>
> 2. In Norwegian, Apertium is used for translating between the two varieties
> of Norwegian itself (Bokmål and Nynorsk), and from other Scandinavian
> languages. That's probably why it works so well—they are similar in
> grammar, vocabulary, and narrative style (I'll pass it on to Apertium
> developers—I'm sure they'll be happy to hear it). Unfortunately, machine
> translation from English is not available in Apertium. Apertium works best
> with very similar languages, and English has two characteristics, which are
> unfortunate when combined: it is both the most popular source for
> translation into almost all other languages (including Norwegian), and it
> is not _very_ similar to any other languages (except maybe Scots). Machine
> translation from English into Norwegian is only possible with Yandex at the
> moment. More engines may be added in the future, but at the moment that's
> all we have. That's why disabling Yandex completely would indeed be a lousy
> solution: A lot of people say that without machine translation integration
> Content Translation is useless. Not all users think like that, but many do.
>
> 3. We can define a numerical threshold of acceptable percentage of machine
> translation post-editing. Currently it's 75%. It's a tad embarrassing, but
> it's hard-coded at the moment, but it can be very easily be made into a
> variable per language. If the translator tries to publish a page in which
> less than that is modified, a warning will be shown.
>
> 4. I'm not sure what do you mean by "language model". If it's any kind of a
> linguistic engine, then it's definitely not within the resources that the
> Language team itself can currently dedicate. However, if somebody who knows
> Norwegian and some programming will write a script that analyzes common bad
> constructs in a Wikipedia dump, this will be very useful. This would
> basically be an upgraded version of suggestion #1 above. (In my spare time
> as a volunteer I'm doing something comparable for Hebrew, although not for
> translation, but for improving how MediaWiki link trails work.)
> _______________________________________________
> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> wiki/Wikimedia-l
> New messages to: [hidden email]
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> <mailto:[hidden email]?subject=unsubscribe>
_______________________________________________
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l
New messages to: [hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:[hidden email]?subject=unsubscribe>
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] machine translation

Amir E. Aharoni
2017-05-02 21:47 GMT+03:00 John Erling Blad <[hidden email]>:

> Yandex as a general translation engine to be able to read some alien
> language is quite good, but as an engine to produce written text it is not
> very good at all.


... Nor is it supposed to be.

A translator is a person. Machine translation software is not a person,
it's software. It's a tool that is supposed to help a human translator
produce a good written text more quickly. If it doesn't make this work
faster, it can and should be disabled. If no translator


> In fact it often creates quite horrible Norwegian, even
> for closely related languages. One quite common problem is reordering of
> words into meaningless constructs, an other problem is reordering lexical
> gender in weird ways. The English preposition "a" is often translated as
> "en" in a propositional phrase, and then the gender is added to the
> following phrase. That gives a translation of  "Oppland is a county in…"
>  into something like "Oppland er en fylket i…" This should be "Oppland er
> et fylke i…".
>

I suggest making a page with a list of such examples, so that the machine
translation developers could read it.


> (I just checked and it seems like Yandex messes up a lot less now than
> previously, but it is still pretty bad.)
>

I guess that this is something that Yandex developers will be happy to hear
:)

More seriously, it's quite possible that they already used some of the
translations made by the Norwegian Wikipedia community. In addition to
being published as an article, each translated paragraph is saved into
parallel corpora, and machine translation developers read the edited text
and use it to improve their software. This is completely open and usable by
all machine translation developers, not only for Yandex.



> The numerical threshold does not work. The reason is simple, the number of
> fixes depends on language constructs that fails, and that is simply not a
> constant for small text fragments. Perhaps if we could flag specific
> language constructs that is known to give a high percentage of failures,
> and if the translator must check those sentences. One such language
> construct is disappearances between the preposition and the gender of the
> following term in a prepositional phrase.
>

The question is how would we do it with our software. I simply cannot
imagine doing it with the current MediaWiki platform, unless we develop a
sophisticated NLP engine, although it's possible I'm exaggerating or
forgetting something.


> A language model could be a statistical model for the language itself, not
> for the translation into that language. We don't want a perfect language
> model, but a sufficient language model to mark weird constructs. A very
> simple solution could simply be to mark tri-grams that does not  already
> exist in the text base for the destination as possible errors. It is not
> necessary to do a live check, but  at least do it before the page can be
> saved.
>

See above—we don't have support for plugging something like that into our
workflow.

Perhaps one day some AI/machine-learning system like ORES would be able to
do it. Maybe it could be an extension to ORES itself.


> Note the difference in what Yandex do and what we want to achieve; Yandex
> translates a text between two different languages, without any clear reason
> why. It is not to important if there are weird constructs in the text, as
> long as it is usable in "some" context. We translate a text for the purpose
> of republishing it. The text should be usable and easily readable in that
> language.
>

This is a well-known problem in machine translation: domain.

Professional industrial translation powerhouses use internally-customized
machine translation engines that specialize on particular domains, such as
medicine, law, or news. In theory, it would make a lot of sense to have a
customized machine translation engine for encyclopedic articles, or maybe
even for several different styles of encyclopedic articles (biography,
science, history, etc.). For now what we have is a very general-purpose
consumer-oriented engine. I hope it changes in the future.
_______________________________________________
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l
New messages to: [hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:[hidden email]?subject=unsubscribe>
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] machine translation

Bodhisattwa Mandal
In reply to this post by John Erling Blad
Content translation with Yandex is also a problem in Bengali Wikipedia.
Some users have grown a tendency to create machine translated meaningless
articles with this extension to increase edit count and article count. This
has increased the workloads of admins to find and delete those articles.

Yandex is not ready for many languages and it is better to shut it. We
don't need it in Bengali.

Regards
On May 3, 2017 12:17 AM, "John Erling Blad" <[hidden email]> wrote:

> Actually this _is_ about turning ContentTranslation off, that is what
> several users in the community want. They block people using the extension
> and delete the translated articles. Use of ContentTranslation has become a
>  rather contentious case.
>
> Yandex as a general translation engine to be able to read some alien
> language is quite good, but as an engine to produce written text it is not
> very good at all. In fact it often creates quite horrible Norwegian, even
> for closely related languages. One quite common problem is reordering of
> words into meaningless constructs, an other problem is reordering lexical
> gender in weird ways. The English preposition "a" is often translated as
> "en" in a propositional phrase, and then the gender is added to the
> following phrase. That gives a translation of  "Oppland is a county in…"
>  into something like "Oppland er en fylket i…" This should be "Oppland er
> et fylke i…".
>
> (I just checked and it seems like Yandex messes up a lot less now than
> previously, but it is still pretty bad.)
>
> Apertium works because the language is closely related, Yandex does not
> work because it is used between very different languages. People try to use
> Yandex and gets disappointed, and falsely conclude that all language
> translations are equally weird. They are not, but Yandex translations are
> weird.
>
> The numerical threshold does not work. The reason is simple, the number of
> fixes depends on language constructs that fails, and that is simply not a
> constant for small text fragments. Perhaps if we could flag specific
> language constructs that is known to give a high percentage of failures,
> and if the translator must check those sentences. One such language
> construct is disappearances between the preposition and the gender of the
> following term in a prepositional phrase. If they are not similar, then the
> sentence must be checked. It is not always wrong to write "en jenta" in
> Norwegian, but it is likely to be wrong.
>
> A language model could be a statistical model for the language itself, not
> for the translation into that language. We don't want a perfect language
> model, but a sufficient language model to mark weird constructs. A very
> simple solution could simply be to mark tri-grams that does not  already
> exist in the text base for the destination as possible errors. It is not
> necessary to do a live check, but  at least do it before the page can be
> saved.
>
> Note the difference in what Yandex do and what we want to achieve; Yandex
> translates a text between two different languages, without any clear reason
> why. It is not to important if there are weird constructs in the text, as
> long as it is usable in "some" context. We translate a text for the purpose
> of republishing it. The text should be usable and easily readable in that
> language.
>
>
>
> On Tue, May 2, 2017 at 7:07 PM, Amir E. Aharoni <
> [hidden email]> wrote:
>
> > 2017-05-02 18:20 GMT+03:00 John Erling Blad <[hidden email]>:
> >
> > > Brute force solution; turn the ContentTranslation off. Really stupid
> > > solution.
> >
> >
> > ... Then I guess you don't mind that I'm changing the thread name :)
> >
> >
> > > The next solution; turn the Yandex engine off. That would solve a
> > > part of the problem. Kind of lousy solution though.
> > >
> >
> > > What about adding a language model that warns when the language
> > constructs
> > > gets to weird? It is like a "test" for the translation. The CT is used
> > for
> > > creating a translation, but the language model is used for verifying if
> > the
> > > translation is good enough. If it does not validate against the
> language
> > > model it should simply not be published to the main name space. It will
> > > still be possible to create a draft, but then the user is completely
> > aware
> > > that the translation isn't good enough.
> > >
> > > Such a language model should be available as a test for any article, as
> > it
> > > can be used as a quality measure for the article. It is really a
> quantity
> > > measure for the well-spokenness of the article, but that isn't quite so
> > > intuitive.
> > >
> >
> > So, I'll allow myself to guess that you are talking about one particular
> > language, probably Norwegian.
> >
> > Several technical facts:
> >
> > 1. In the past there were several cases in which translators to different
> > languages who reported common translation mistakes to me. I passed them
> on
> > to Yandex developers, with whom I communicate quite regularly. They
> > acknowledged receiving all of them. I am aware of at least one such
> common
> > mistake that was fixed; possibly there were more. If you can give me a
> list
> > of such mistakes for Norwegian, I'll be very happy to pass them on. I
> > absolutely cannot promise that they will be fixed upstream, but it's
> > possible.
> >
> > 2. In Norwegian, Apertium is used for translating between the two
> varieties
> > of Norwegian itself (Bokmål and Nynorsk), and from other Scandinavian
> > languages. That's probably why it works so well—they are similar in
> > grammar, vocabulary, and narrative style (I'll pass it on to Apertium
> > developers—I'm sure they'll be happy to hear it). Unfortunately, machine
> > translation from English is not available in Apertium. Apertium works
> best
> > with very similar languages, and English has two characteristics, which
> are
> > unfortunate when combined: it is both the most popular source for
> > translation into almost all other languages (including Norwegian), and it
> > is not _very_ similar to any other languages (except maybe Scots).
> Machine
> > translation from English into Norwegian is only possible with Yandex at
> the
> > moment. More engines may be added in the future, but at the moment that's
> > all we have. That's why disabling Yandex completely would indeed be a
> lousy
> > solution: A lot of people say that without machine translation
> integration
> > Content Translation is useless. Not all users think like that, but many
> do.
> >
> > 3. We can define a numerical threshold of acceptable percentage of
> machine
> > translation post-editing. Currently it's 75%. It's a tad embarrassing,
> but
> > it's hard-coded at the moment, but it can be very easily be made into a
> > variable per language. If the translator tries to publish a page in which
> > less than that is modified, a warning will be shown.
> >
> > 4. I'm not sure what do you mean by "language model". If it's any kind
> of a
> > linguistic engine, then it's definitely not within the resources that the
> > Language team itself can currently dedicate. However, if somebody who
> knows
> > Norwegian and some programming will write a script that analyzes common
> bad
> > constructs in a Wikipedia dump, this will be very useful. This would
> > basically be an upgraded version of suggestion #1 above. (In my spare
> time
> > as a volunteer I'm doing something comparable for Hebrew, although not
> for
> > translation, but for improving how MediaWiki link trails work.)
> > _______________________________________________
> > Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
> > wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> > wiki/Wikimedia-l
> > New messages to: [hidden email]
> > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> > <mailto:[hidden email]?subject=unsubscribe>
> _______________________________________________
> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> wiki/Wikimedia-l
> New messages to: [hidden email]
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> <mailto:[hidden email]?subject=unsubscribe>
_______________________________________________
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l
New messages to: [hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:[hidden email]?subject=unsubscribe>
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] machine translation

Pharos-3
I think it all depends on the level of engagement of the human translator.

When the tool is used in the right way, it is a fantastic tool.

Maybe we can find better methods to nudge people toward taking their time
and really doing work on their translations.

Thanks,
Pharos

On Tue, May 2, 2017 at 4:09 PM, Bodhisattwa Mandal <
[hidden email]> wrote:

> Content translation with Yandex is also a problem in Bengali Wikipedia.
> Some users have grown a tendency to create machine translated meaningless
> articles with this extension to increase edit count and article count. This
> has increased the workloads of admins to find and delete those articles.
>
> Yandex is not ready for many languages and it is better to shut it. We
> don't need it in Bengali.
>
> Regards
> On May 3, 2017 12:17 AM, "John Erling Blad" <[hidden email]> wrote:
>
> > Actually this _is_ about turning ContentTranslation off, that is what
> > several users in the community want. They block people using the
> extension
> > and delete the translated articles. Use of ContentTranslation has become
> a
> >  rather contentious case.
> >
> > Yandex as a general translation engine to be able to read some alien
> > language is quite good, but as an engine to produce written text it is
> not
> > very good at all. In fact it often creates quite horrible Norwegian, even
> > for closely related languages. One quite common problem is reordering of
> > words into meaningless constructs, an other problem is reordering lexical
> > gender in weird ways. The English preposition "a" is often translated as
> > "en" in a propositional phrase, and then the gender is added to the
> > following phrase. That gives a translation of  "Oppland is a county in…"
> >  into something like "Oppland er en fylket i…" This should be "Oppland er
> > et fylke i…".
> >
> > (I just checked and it seems like Yandex messes up a lot less now than
> > previously, but it is still pretty bad.)
> >
> > Apertium works because the language is closely related, Yandex does not
> > work because it is used between very different languages. People try to
> use
> > Yandex and gets disappointed, and falsely conclude that all language
> > translations are equally weird. They are not, but Yandex translations are
> > weird.
> >
> > The numerical threshold does not work. The reason is simple, the number
> of
> > fixes depends on language constructs that fails, and that is simply not a
> > constant for small text fragments. Perhaps if we could flag specific
> > language constructs that is known to give a high percentage of failures,
> > and if the translator must check those sentences. One such language
> > construct is disappearances between the preposition and the gender of the
> > following term in a prepositional phrase. If they are not similar, then
> the
> > sentence must be checked. It is not always wrong to write "en jenta" in
> > Norwegian, but it is likely to be wrong.
> >
> > A language model could be a statistical model for the language itself,
> not
> > for the translation into that language. We don't want a perfect language
> > model, but a sufficient language model to mark weird constructs. A very
> > simple solution could simply be to mark tri-grams that does not  already
> > exist in the text base for the destination as possible errors. It is not
> > necessary to do a live check, but  at least do it before the page can be
> > saved.
> >
> > Note the difference in what Yandex do and what we want to achieve; Yandex
> > translates a text between two different languages, without any clear
> reason
> > why. It is not to important if there are weird constructs in the text, as
> > long as it is usable in "some" context. We translate a text for the
> purpose
> > of republishing it. The text should be usable and easily readable in that
> > language.
> >
> >
> >
> > On Tue, May 2, 2017 at 7:07 PM, Amir E. Aharoni <
> > [hidden email]> wrote:
> >
> > > 2017-05-02 18:20 GMT+03:00 John Erling Blad <[hidden email]>:
> > >
> > > > Brute force solution; turn the ContentTranslation off. Really stupid
> > > > solution.
> > >
> > >
> > > ... Then I guess you don't mind that I'm changing the thread name :)
> > >
> > >
> > > > The next solution; turn the Yandex engine off. That would solve a
> > > > part of the problem. Kind of lousy solution though.
> > > >
> > >
> > > > What about adding a language model that warns when the language
> > > constructs
> > > > gets to weird? It is like a "test" for the translation. The CT is
> used
> > > for
> > > > creating a translation, but the language model is used for verifying
> if
> > > the
> > > > translation is good enough. If it does not validate against the
> > language
> > > > model it should simply not be published to the main name space. It
> will
> > > > still be possible to create a draft, but then the user is completely
> > > aware
> > > > that the translation isn't good enough.
> > > >
> > > > Such a language model should be available as a test for any article,
> as
> > > it
> > > > can be used as a quality measure for the article. It is really a
> > quantity
> > > > measure for the well-spokenness of the article, but that isn't quite
> so
> > > > intuitive.
> > > >
> > >
> > > So, I'll allow myself to guess that you are talking about one
> particular
> > > language, probably Norwegian.
> > >
> > > Several technical facts:
> > >
> > > 1. In the past there were several cases in which translators to
> different
> > > languages who reported common translation mistakes to me. I passed them
> > on
> > > to Yandex developers, with whom I communicate quite regularly. They
> > > acknowledged receiving all of them. I am aware of at least one such
> > common
> > > mistake that was fixed; possibly there were more. If you can give me a
> > list
> > > of such mistakes for Norwegian, I'll be very happy to pass them on. I
> > > absolutely cannot promise that they will be fixed upstream, but it's
> > > possible.
> > >
> > > 2. In Norwegian, Apertium is used for translating between the two
> > varieties
> > > of Norwegian itself (Bokmål and Nynorsk), and from other Scandinavian
> > > languages. That's probably why it works so well—they are similar in
> > > grammar, vocabulary, and narrative style (I'll pass it on to Apertium
> > > developers—I'm sure they'll be happy to hear it). Unfortunately,
> machine
> > > translation from English is not available in Apertium. Apertium works
> > best
> > > with very similar languages, and English has two characteristics, which
> > are
> > > unfortunate when combined: it is both the most popular source for
> > > translation into almost all other languages (including Norwegian), and
> it
> > > is not _very_ similar to any other languages (except maybe Scots).
> > Machine
> > > translation from English into Norwegian is only possible with Yandex at
> > the
> > > moment. More engines may be added in the future, but at the moment
> that's
> > > all we have. That's why disabling Yandex completely would indeed be a
> > lousy
> > > solution: A lot of people say that without machine translation
> > integration
> > > Content Translation is useless. Not all users think like that, but many
> > do.
> > >
> > > 3. We can define a numerical threshold of acceptable percentage of
> > machine
> > > translation post-editing. Currently it's 75%. It's a tad embarrassing,
> > but
> > > it's hard-coded at the moment, but it can be very easily be made into a
> > > variable per language. If the translator tries to publish a page in
> which
> > > less than that is modified, a warning will be shown.
> > >
> > > 4. I'm not sure what do you mean by "language model". If it's any kind
> > of a
> > > linguistic engine, then it's definitely not within the resources that
> the
> > > Language team itself can currently dedicate. However, if somebody who
> > knows
> > > Norwegian and some programming will write a script that analyzes common
> > bad
> > > constructs in a Wikipedia dump, this will be very useful. This would
> > > basically be an upgraded version of suggestion #1 above. (In my spare
> > time
> > > as a volunteer I'm doing something comparable for Hebrew, although not
> > for
> > > translation, but for improving how MediaWiki link trails work.)
> > > _______________________________________________
> > > Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
> > > wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> > > wiki/Wikimedia-l
> > > New messages to: [hidden email]
> > > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> > > <mailto:[hidden email]?subject=unsubscribe>
> > _______________________________________________
> > Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
> > wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> > wiki/Wikimedia-l
> > New messages to: [hidden email]
> > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> > <mailto:[hidden email]?subject=unsubscribe>
> _______________________________________________
> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> wiki/Wikimedia-l
> New messages to: [hidden email]
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> <mailto:[hidden email]?subject=unsubscribe>
>
_______________________________________________
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l
New messages to: [hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:[hidden email]?subject=unsubscribe>
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] machine translation

Yaroslav Blanter
Creating machine translations only in the draft space (or in the user space
in the projects which do not have draft) could help.

Cheers
Yaroslav

On Tue, May 2, 2017 at 10:16 PM, Pharos <[hidden email]>
wrote:

> I think it all depends on the level of engagement of the human translator.
>
> When the tool is used in the right way, it is a fantastic tool.
>
> Maybe we can find better methods to nudge people toward taking their time
> and really doing work on their translations.
>
> Thanks,
> Pharos
>
> On Tue, May 2, 2017 at 4:09 PM, Bodhisattwa Mandal <
> [hidden email]> wrote:
>
> > Content translation with Yandex is also a problem in Bengali Wikipedia.
> > Some users have grown a tendency to create machine translated meaningless
> > articles with this extension to increase edit count and article count.
> This
> > has increased the workloads of admins to find and delete those articles.
> >
> > Yandex is not ready for many languages and it is better to shut it. We
> > don't need it in Bengali.
> >
> > Regards
> > On May 3, 2017 12:17 AM, "John Erling Blad" <[hidden email]> wrote:
> >
> > > Actually this _is_ about turning ContentTranslation off, that is what
> > > several users in the community want. They block people using the
> > extension
> > > and delete the translated articles. Use of ContentTranslation has
> become
> > a
> > >  rather contentious case.
> > >
> > > Yandex as a general translation engine to be able to read some alien
> > > language is quite good, but as an engine to produce written text it is
> > not
> > > very good at all. In fact it often creates quite horrible Norwegian,
> even
> > > for closely related languages. One quite common problem is reordering
> of
> > > words into meaningless constructs, an other problem is reordering
> lexical
> > > gender in weird ways. The English preposition "a" is often translated
> as
> > > "en" in a propositional phrase, and then the gender is added to the
> > > following phrase. That gives a translation of  "Oppland is a county
> in…"
> > >  into something like "Oppland er en fylket i…" This should be "Oppland
> er
> > > et fylke i…".
> > >
> > > (I just checked and it seems like Yandex messes up a lot less now than
> > > previously, but it is still pretty bad.)
> > >
> > > Apertium works because the language is closely related, Yandex does not
> > > work because it is used between very different languages. People try to
> > use
> > > Yandex and gets disappointed, and falsely conclude that all language
> > > translations are equally weird. They are not, but Yandex translations
> are
> > > weird.
> > >
> > > The numerical threshold does not work. The reason is simple, the number
> > of
> > > fixes depends on language constructs that fails, and that is simply
> not a
> > > constant for small text fragments. Perhaps if we could flag specific
> > > language constructs that is known to give a high percentage of
> failures,
> > > and if the translator must check those sentences. One such language
> > > construct is disappearances between the preposition and the gender of
> the
> > > following term in a prepositional phrase. If they are not similar, then
> > the
> > > sentence must be checked. It is not always wrong to write "en jenta" in
> > > Norwegian, but it is likely to be wrong.
> > >
> > > A language model could be a statistical model for the language itself,
> > not
> > > for the translation into that language. We don't want a perfect
> language
> > > model, but a sufficient language model to mark weird constructs. A very
> > > simple solution could simply be to mark tri-grams that does not
> already
> > > exist in the text base for the destination as possible errors. It is
> not
> > > necessary to do a live check, but  at least do it before the page can
> be
> > > saved.
> > >
> > > Note the difference in what Yandex do and what we want to achieve;
> Yandex
> > > translates a text between two different languages, without any clear
> > reason
> > > why. It is not to important if there are weird constructs in the text,
> as
> > > long as it is usable in "some" context. We translate a text for the
> > purpose
> > > of republishing it. The text should be usable and easily readable in
> that
> > > language.
> > >
> > >
> > >
> > > On Tue, May 2, 2017 at 7:07 PM, Amir E. Aharoni <
> > > [hidden email]> wrote:
> > >
> > > > 2017-05-02 18:20 GMT+03:00 John Erling Blad <[hidden email]>:
> > > >
> > > > > Brute force solution; turn the ContentTranslation off. Really
> stupid
> > > > > solution.
> > > >
> > > >
> > > > ... Then I guess you don't mind that I'm changing the thread name :)
> > > >
> > > >
> > > > > The next solution; turn the Yandex engine off. That would solve a
> > > > > part of the problem. Kind of lousy solution though.
> > > > >
> > > >
> > > > > What about adding a language model that warns when the language
> > > > constructs
> > > > > gets to weird? It is like a "test" for the translation. The CT is
> > used
> > > > for
> > > > > creating a translation, but the language model is used for
> verifying
> > if
> > > > the
> > > > > translation is good enough. If it does not validate against the
> > > language
> > > > > model it should simply not be published to the main name space. It
> > will
> > > > > still be possible to create a draft, but then the user is
> completely
> > > > aware
> > > > > that the translation isn't good enough.
> > > > >
> > > > > Such a language model should be available as a test for any
> article,
> > as
> > > > it
> > > > > can be used as a quality measure for the article. It is really a
> > > quantity
> > > > > measure for the well-spokenness of the article, but that isn't
> quite
> > so
> > > > > intuitive.
> > > > >
> > > >
> > > > So, I'll allow myself to guess that you are talking about one
> > particular
> > > > language, probably Norwegian.
> > > >
> > > > Several technical facts:
> > > >
> > > > 1. In the past there were several cases in which translators to
> > different
> > > > languages who reported common translation mistakes to me. I passed
> them
> > > on
> > > > to Yandex developers, with whom I communicate quite regularly. They
> > > > acknowledged receiving all of them. I am aware of at least one such
> > > common
> > > > mistake that was fixed; possibly there were more. If you can give me
> a
> > > list
> > > > of such mistakes for Norwegian, I'll be very happy to pass them on. I
> > > > absolutely cannot promise that they will be fixed upstream, but it's
> > > > possible.
> > > >
> > > > 2. In Norwegian, Apertium is used for translating between the two
> > > varieties
> > > > of Norwegian itself (Bokmål and Nynorsk), and from other Scandinavian
> > > > languages. That's probably why it works so well—they are similar in
> > > > grammar, vocabulary, and narrative style (I'll pass it on to Apertium
> > > > developers—I'm sure they'll be happy to hear it). Unfortunately,
> > machine
> > > > translation from English is not available in Apertium. Apertium works
> > > best
> > > > with very similar languages, and English has two characteristics,
> which
> > > are
> > > > unfortunate when combined: it is both the most popular source for
> > > > translation into almost all other languages (including Norwegian),
> and
> > it
> > > > is not _very_ similar to any other languages (except maybe Scots).
> > > Machine
> > > > translation from English into Norwegian is only possible with Yandex
> at
> > > the
> > > > moment. More engines may be added in the future, but at the moment
> > that's
> > > > all we have. That's why disabling Yandex completely would indeed be a
> > > lousy
> > > > solution: A lot of people say that without machine translation
> > > integration
> > > > Content Translation is useless. Not all users think like that, but
> many
> > > do.
> > > >
> > > > 3. We can define a numerical threshold of acceptable percentage of
> > > machine
> > > > translation post-editing. Currently it's 75%. It's a tad
> embarrassing,
> > > but
> > > > it's hard-coded at the moment, but it can be very easily be made
> into a
> > > > variable per language. If the translator tries to publish a page in
> > which
> > > > less than that is modified, a warning will be shown.
> > > >
> > > > 4. I'm not sure what do you mean by "language model". If it's any
> kind
> > > of a
> > > > linguistic engine, then it's definitely not within the resources that
> > the
> > > > Language team itself can currently dedicate. However, if somebody who
> > > knows
> > > > Norwegian and some programming will write a script that analyzes
> common
> > > bad
> > > > constructs in a Wikipedia dump, this will be very useful. This would
> > > > basically be an upgraded version of suggestion #1 above. (In my spare
> > > time
> > > > as a volunteer I'm doing something comparable for Hebrew, although
> not
> > > for
> > > > translation, but for improving how MediaWiki link trails work.)
> > > > _______________________________________________
> > > > Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
> > > > wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> > > > wiki/Wikimedia-l
> > > > New messages to: [hidden email]
> > > > Unsubscribe: https://lists.wikimedia.org/
> mailman/listinfo/wikimedia-l,
> > > > <mailto:[hidden email]?subject=unsubscribe>
> > > _______________________________________________
> > > Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
> > > wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> > > wiki/Wikimedia-l
> > > New messages to: [hidden email]
> > > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> > > <mailto:[hidden email]?subject=unsubscribe>
> > _______________________________________________
> > Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
> > wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> > wiki/Wikimedia-l
> > New messages to: [hidden email]
> > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> > <mailto:[hidden email]?subject=unsubscribe>
> >
> _______________________________________________
> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> wiki/Wikimedia-l
> New messages to: [hidden email]
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> <mailto:[hidden email]?subject=unsubscribe>
>
_______________________________________________
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l
New messages to: [hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:[hidden email]?subject=unsubscribe>
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] machine translation

Wojciech Pędzich-2
It does depend a lot on the engagement level of the human behind the
keyboard. When I deal with machine-translated text, I simply wonder
whether the someone behind the keyboard took efforts to actually read
the piece.

Now whether this would work if limited to namespaces outside "main" - I
do not want to demonise the issue, but if the person submitting the text
for machine translation does not read it, what will stop them from a
quick ctrl+c / ctrl+v? Just asking.

Wojciech

W dniu 2017-05-03 o 09:33, Yaroslav Blanter pisze:

> Creating machine translations only in the draft space (or in the user space
> in the projects which do not have draft) could help.
>
> Cheers
> Yaroslav
>
> On Tue, May 2, 2017 at 10:16 PM, Pharos <[hidden email]>
> wrote:
>
>> I think it all depends on the level of engagement of the human translator.
>>
>> When the tool is used in the right way, it is a fantastic tool.
>>
>> Maybe we can find better methods to nudge people toward taking their time
>> and really doing work on their translations.
>>
>> Thanks,
>> Pharos
>>
>> On Tue, May 2, 2017 at 4:09 PM, Bodhisattwa Mandal <
>> [hidden email]> wrote:
>>
>>> Content translation with Yandex is also a problem in Bengali Wikipedia.
>>> Some users have grown a tendency to create machine translated meaningless
>>> articles with this extension to increase edit count and article count.
>> This
>>> has increased the workloads of admins to find and delete those articles.
>>>
>>> Yandex is not ready for many languages and it is better to shut it. We
>>> don't need it in Bengali.
>>>
>>> Regards
>>> On May 3, 2017 12:17 AM, "John Erling Blad" <[hidden email]> wrote:
>>>
>>>> Actually this _is_ about turning ContentTranslation off, that is what
>>>> several users in the community want. They block people using the
>>> extension
>>>> and delete the translated articles. Use of ContentTranslation has
>> become
>>> a
>>>>   rather contentious case.
>>>>
>>>> Yandex as a general translation engine to be able to read some alien
>>>> language is quite good, but as an engine to produce written text it is
>>> not
>>>> very good at all. In fact it often creates quite horrible Norwegian,
>> even
>>>> for closely related languages. One quite common problem is reordering
>> of
>>>> words into meaningless constructs, an other problem is reordering
>> lexical
>>>> gender in weird ways. The English preposition "a" is often translated
>> as
>>>> "en" in a propositional phrase, and then the gender is added to the
>>>> following phrase. That gives a translation of  "Oppland is a county
>> in…"
>>>>   into something like "Oppland er en fylket i…" This should be "Oppland
>> er
>>>> et fylke i…".
>>>>
>>>> (I just checked and it seems like Yandex messes up a lot less now than
>>>> previously, but it is still pretty bad.)
>>>>
>>>> Apertium works because the language is closely related, Yandex does not
>>>> work because it is used between very different languages. People try to
>>> use
>>>> Yandex and gets disappointed, and falsely conclude that all language
>>>> translations are equally weird. They are not, but Yandex translations
>> are
>>>> weird.
>>>>
>>>> The numerical threshold does not work. The reason is simple, the number
>>> of
>>>> fixes depends on language constructs that fails, and that is simply
>> not a
>>>> constant for small text fragments. Perhaps if we could flag specific
>>>> language constructs that is known to give a high percentage of
>> failures,
>>>> and if the translator must check those sentences. One such language
>>>> construct is disappearances between the preposition and the gender of
>> the
>>>> following term in a prepositional phrase. If they are not similar, then
>>> the
>>>> sentence must be checked. It is not always wrong to write "en jenta" in
>>>> Norwegian, but it is likely to be wrong.
>>>>
>>>> A language model could be a statistical model for the language itself,
>>> not
>>>> for the translation into that language. We don't want a perfect
>> language
>>>> model, but a sufficient language model to mark weird constructs. A very
>>>> simple solution could simply be to mark tri-grams that does not
>> already
>>>> exist in the text base for the destination as possible errors. It is
>> not
>>>> necessary to do a live check, but  at least do it before the page can
>> be
>>>> saved.
>>>>
>>>> Note the difference in what Yandex do and what we want to achieve;
>> Yandex
>>>> translates a text between two different languages, without any clear
>>> reason
>>>> why. It is not to important if there are weird constructs in the text,
>> as
>>>> long as it is usable in "some" context. We translate a text for the
>>> purpose
>>>> of republishing it. The text should be usable and easily readable in
>> that
>>>> language.
>>>>
>>>>
>>>>
>>>> On Tue, May 2, 2017 at 7:07 PM, Amir E. Aharoni <
>>>> [hidden email]> wrote:
>>>>
>>>>> 2017-05-02 18:20 GMT+03:00 John Erling Blad <[hidden email]>:
>>>>>
>>>>>> Brute force solution; turn the ContentTranslation off. Really
>> stupid
>>>>>> solution.
>>>>>
>>>>> ... Then I guess you don't mind that I'm changing the thread name :)
>>>>>
>>>>>
>>>>>> The next solution; turn the Yandex engine off. That would solve a
>>>>>> part of the problem. Kind of lousy solution though.
>>>>>>
>>>>>> What about adding a language model that warns when the language
>>>>> constructs
>>>>>> gets to weird? It is like a "test" for the translation. The CT is
>>> used
>>>>> for
>>>>>> creating a translation, but the language model is used for
>> verifying
>>> if
>>>>> the
>>>>>> translation is good enough. If it does not validate against the
>>>> language
>>>>>> model it should simply not be published to the main name space. It
>>> will
>>>>>> still be possible to create a draft, but then the user is
>> completely
>>>>> aware
>>>>>> that the translation isn't good enough.
>>>>>>
>>>>>> Such a language model should be available as a test for any
>> article,
>>> as
>>>>> it
>>>>>> can be used as a quality measure for the article. It is really a
>>>> quantity
>>>>>> measure for the well-spokenness of the article, but that isn't
>> quite
>>> so
>>>>>> intuitive.
>>>>>>
>>>>> So, I'll allow myself to guess that you are talking about one
>>> particular
>>>>> language, probably Norwegian.
>>>>>
>>>>> Several technical facts:
>>>>>
>>>>> 1. In the past there were several cases in which translators to
>>> different
>>>>> languages who reported common translation mistakes to me. I passed
>> them
>>>> on
>>>>> to Yandex developers, with whom I communicate quite regularly. They
>>>>> acknowledged receiving all of them. I am aware of at least one such
>>>> common
>>>>> mistake that was fixed; possibly there were more. If you can give me
>> a
>>>> list
>>>>> of such mistakes for Norwegian, I'll be very happy to pass them on. I
>>>>> absolutely cannot promise that they will be fixed upstream, but it's
>>>>> possible.
>>>>>
>>>>> 2. In Norwegian, Apertium is used for translating between the two
>>>> varieties
>>>>> of Norwegian itself (Bokmål and Nynorsk), and from other Scandinavian
>>>>> languages. That's probably why it works so well—they are similar in
>>>>> grammar, vocabulary, and narrative style (I'll pass it on to Apertium
>>>>> developers—I'm sure they'll be happy to hear it). Unfortunately,
>>> machine
>>>>> translation from English is not available in Apertium. Apertium works
>>>> best
>>>>> with very similar languages, and English has two characteristics,
>> which
>>>> are
>>>>> unfortunate when combined: it is both the most popular source for
>>>>> translation into almost all other languages (including Norwegian),
>> and
>>> it
>>>>> is not _very_ similar to any other languages (except maybe Scots).
>>>> Machine
>>>>> translation from English into Norwegian is only possible with Yandex
>> at
>>>> the
>>>>> moment. More engines may be added in the future, but at the moment
>>> that's
>>>>> all we have. That's why disabling Yandex completely would indeed be a
>>>> lousy
>>>>> solution: A lot of people say that without machine translation
>>>> integration
>>>>> Content Translation is useless. Not all users think like that, but
>> many
>>>> do.
>>>>> 3. We can define a numerical threshold of acceptable percentage of
>>>> machine
>>>>> translation post-editing. Currently it's 75%. It's a tad
>> embarrassing,
>>>> but
>>>>> it's hard-coded at the moment, but it can be very easily be made
>> into a
>>>>> variable per language. If the translator tries to publish a page in
>>> which
>>>>> less than that is modified, a warning will be shown.
>>>>>
>>>>> 4. I'm not sure what do you mean by "language model". If it's any
>> kind
>>>> of a
>>>>> linguistic engine, then it's definitely not within the resources that
>>> the
>>>>> Language team itself can currently dedicate. However, if somebody who
>>>> knows
>>>>> Norwegian and some programming will write a script that analyzes
>> common
>>>> bad
>>>>> constructs in a Wikipedia dump, this will be very useful. This would
>>>>> basically be an upgraded version of suggestion #1 above. (In my spare
>>>> time
>>>>> as a volunteer I'm doing something comparable for Hebrew, although
>> not
>>>> for
>>>>> translation, but for improving how MediaWiki link trails work.)
>>>>> _______________________________________________
>>>>> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
>>>>> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
>>>>> wiki/Wikimedia-l
>>>>> New messages to: [hidden email]
>>>>> Unsubscribe: https://lists.wikimedia.org/
>> mailman/listinfo/wikimedia-l,
>>>>> <mailto:[hidden email]?subject=unsubscribe>
>>>> _______________________________________________
>>>> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
>>>> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
>>>> wiki/Wikimedia-l
>>>> New messages to: [hidden email]
>>>> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
>>>> <mailto:[hidden email]?subject=unsubscribe>
>>> _______________________________________________
>>> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
>>> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
>>> wiki/Wikimedia-l
>>> New messages to: [hidden email]
>>> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
>>> <mailto:[hidden email]?subject=unsubscribe>
>>>
>> _______________________________________________
>> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
>> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
>> wiki/Wikimedia-l
>> New messages to: [hidden email]
>> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
>> <mailto:[hidden email]?subject=unsubscribe>
>>
> _______________________________________________
> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l
> New messages to: [hidden email]
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:[hidden email]?subject=unsubscribe>



_______________________________________________
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l
New messages to: [hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:[hidden email]?subject=unsubscribe>
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] machine translation

John Erling Blad
In reply to this post by Amir E. Aharoni
>
> More seriously, it's quite possible that they already used some of the
> translations made by the Norwegian Wikipedia community. In addition to
> being published as an article, each translated paragraph is saved into
> parallel corpora, and machine translation developers read the edited text
> and use it to improve their software. This is completely open and usable by
> all machine translation developers, not only for Yandex.


It is quite possible the Yandex people has done something as the
translations are a lot better now than previously. It also imply that it is
really important to correct the text inside CT.

The question is how would we do it with our software. I simply cannot
> imagine doing it with the current MediaWiki platform, unless we develop a
> sophisticated NLP engine, although it's possible I'm exaggerating or
> forgetting something.


There are several places this can be inserted, both in VE and in MW. What I
want is a kind of rather simple language model, but Aharoni proposed
Languagetools in private communication. That lib is very interesting.

Perhaps one day some AI/machine-learning system like ORES would be able to
> do it. Maybe it could be an extension to ORES itself.


I've seen language models implemented as neural nets, but it is not
necessary to do it like that. Actually it is more common to do it with
plain statistics.

On Tue, May 2, 2017 at 9:25 PM, Amir E. Aharoni <
[hidden email]> wrote:

> 2017-05-02 21:47 GMT+03:00 John Erling Blad <[hidden email]>:
>
> > Yandex as a general translation engine to be able to read some alien
> > language is quite good, but as an engine to produce written text it is
> not
> > very good at all.
>
>
> ... Nor is it supposed to be.
>
> A translator is a person. Machine translation software is not a person,
> it's software. It's a tool that is supposed to help a human translator
> produce a good written text more quickly. If it doesn't make this work
> faster, it can and should be disabled. If no translator
>
>
> > In fact it often creates quite horrible Norwegian, even
> > for closely related languages. One quite common problem is reordering of
> > words into meaningless constructs, an other problem is reordering lexical
> > gender in weird ways. The English preposition "a" is often translated as
> > "en" in a propositional phrase, and then the gender is added to the
> > following phrase. That gives a translation of  "Oppland is a county in…"
> >  into something like "Oppland er en fylket i…" This should be "Oppland er
> > et fylke i…".
> >
>
> I suggest making a page with a list of such examples, so that the machine
> translation developers could read it.
>
>
> > (I just checked and it seems like Yandex messes up a lot less now than
> > previously, but it is still pretty bad.)
> >
>
> I guess that this is something that Yandex developers will be happy to hear
> :)
>
> More seriously, it's quite possible that they already used some of the
> translations made by the Norwegian Wikipedia community. In addition to
> being published as an article, each translated paragraph is saved into
> parallel corpora, and machine translation developers read the edited text
> and use it to improve their software. This is completely open and usable by
> all machine translation developers, not only for Yandex.
>
>
>
> > The numerical threshold does not work. The reason is simple, the number
> of
> > fixes depends on language constructs that fails, and that is simply not a
> > constant for small text fragments. Perhaps if we could flag specific
> > language constructs that is known to give a high percentage of failures,
> > and if the translator must check those sentences. One such language
> > construct is disappearances between the preposition and the gender of the
> > following term in a prepositional phrase.
> >
>
> The question is how would we do it with our software. I simply cannot
> imagine doing it with the current MediaWiki platform, unless we develop a
> sophisticated NLP engine, although it's possible I'm exaggerating or
> forgetting something.
>
>
> > A language model could be a statistical model for the language itself,
> not
> > for the translation into that language. We don't want a perfect language
> > model, but a sufficient language model to mark weird constructs. A very
> > simple solution could simply be to mark tri-grams that does not  already
> > exist in the text base for the destination as possible errors. It is not
> > necessary to do a live check, but  at least do it before the page can be
> > saved.
> >
>
> See above—we don't have support for plugging something like that into our
> workflow.
>
> Perhaps one day some AI/machine-learning system like ORES would be able to
> do it. Maybe it could be an extension to ORES itself.
>
>
> > Note the difference in what Yandex do and what we want to achieve; Yandex
> > translates a text between two different languages, without any clear
> reason
> > why. It is not to important if there are weird constructs in the text, as
> > long as it is usable in "some" context. We translate a text for the
> purpose
> > of republishing it. The text should be usable and easily readable in that
> > language.
> >
>
> This is a well-known problem in machine translation: domain.
>
> Professional industrial translation powerhouses use internally-customized
> machine translation engines that specialize on particular domains, such as
> medicine, law, or news. In theory, it would make a lot of sense to have a
> customized machine translation engine for encyclopedic articles, or maybe
> even for several different styles of encyclopedic articles (biography,
> science, history, etc.). For now what we have is a very general-purpose
> consumer-oriented engine. I hope it changes in the future.
> _______________________________________________
> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> wiki/Wikimedia-l
> New messages to: [hidden email]
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> <mailto:[hidden email]?subject=unsubscribe>
>
_______________________________________________
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l
New messages to: [hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:[hidden email]?subject=unsubscribe>
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] machine translation

Amir E. Aharoni
[ Meta-comment: We usually call it "CX" and not "CT".[1] ]

2017-05-03 13:37 GMT+03:00 John Erling Blad <[hidden email]>:

> >
> > More seriously, it's quite possible that they already used some of the
> > translations made by the Norwegian Wikipedia community. In addition to
> > being published as an article, each translated paragraph is saved into
> > parallel corpora, and machine translation developers read the edited text
> > and use it to improve their software. This is completely open and usable
> by
> > all machine translation developers, not only for Yandex.
>
>
> It is quite possible the Yandex people has done something as the
> translations are a lot better now than previously. It also imply that it is
> really important to correct the text inside CT.
>

Absolutely.

All CX users must be encouraged to do this. Translation is done by humans.
That's the whole point. Content Translation is not a machine translation
tool. It's an article creation tool, which includes optional machine
translation for some language pairs. The Content Translation user interface
has three warning messages that discourage publishing unedited machine
translation,[2][3][4] and several of CX FAQs address this as well.[1]

If a user publishes an unedited machine translation, it should be handled
just like any other problematic page: it must be edited, moved to a draft,
or deleted, and the creating user should be warned.

[1] https://www.mediawiki.org/wiki/Content_translation/Documentation/FAQ
[2]
https://translatewiki.net/w/i.php?title=Special:Translations&message=MediaWiki%3ACx-tools-instructions-text4%2Fhe
[3]
https://translatewiki.net/w/i.php?title=Special:Translations&message=MediaWiki%3ACx-mt-abuse-warning-title%2Fhe
[4]
https://translatewiki.net/w/i.php?title=Special:Translations&message=MediaWiki%3ACx-mt-abuse-warning-text%2Fhe

--
Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
http://aharoni.wordpress.com
‪“We're living in pieces,
I want to live in peace.” – T. Moore‬
_______________________________________________
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l
New messages to: [hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:[hidden email]?subject=unsubscribe>
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] machine translation

David Cuenca Tudela
In reply to this post by Wojciech Pędzich-2
Perhaps it would be a good idea to compare the translated text to the text
that the user wants to save.

If they are more than 95% the same, that means that the user didn't take
the effort to correct the text.

Cheers,
Micru

On Wed, May 3, 2017 at 10:31 AM, Wojciech Pędzich <[hidden email]>
wrote:

> It does depend a lot on the engagement level of the human behind the
> keyboard. When I deal with machine-translated text, I simply wonder whether
> the someone behind the keyboard took efforts to actually read the piece.
>
> Now whether this would work if limited to namespaces outside "main" - I do
> not want to demonise the issue, but if the person submitting the text for
> machine translation does not read it, what will stop them from a quick
> ctrl+c / ctrl+v? Just asking.
>
> Wojciech
>
> W dniu 2017-05-03 o 09:33, Yaroslav Blanter pisze:
>
> Creating machine translations only in the draft space (or in the user space
>> in the projects which do not have draft) could help.
>>
>> Cheers
>> Yaroslav
>>
>> On Tue, May 2, 2017 at 10:16 PM, Pharos <[hidden email]>
>> wrote:
>>
>> I think it all depends on the level of engagement of the human translator.
>>>
>>> When the tool is used in the right way, it is a fantastic tool.
>>>
>>> Maybe we can find better methods to nudge people toward taking their time
>>> and really doing work on their translations.
>>>
>>> Thanks,
>>> Pharos
>>>
>>> On Tue, May 2, 2017 at 4:09 PM, Bodhisattwa Mandal <
>>> [hidden email]> wrote:
>>>
>>> Content translation with Yandex is also a problem in Bengali Wikipedia.
>>>> Some users have grown a tendency to create machine translated
>>>> meaningless
>>>> articles with this extension to increase edit count and article count.
>>>>
>>> This
>>>
>>>> has increased the workloads of admins to find and delete those articles.
>>>>
>>>> Yandex is not ready for many languages and it is better to shut it. We
>>>> don't need it in Bengali.
>>>>
>>>> Regards
>>>> On May 3, 2017 12:17 AM, "John Erling Blad" <[hidden email]> wrote:
>>>>
>>>> Actually this _is_ about turning ContentTranslation off, that is what
>>>>> several users in the community want. They block people using the
>>>>>
>>>> extension
>>>>
>>>>> and delete the translated articles. Use of ContentTranslation has
>>>>>
>>>> become
>>>
>>>> a
>>>>
>>>>>   rather contentious case.
>>>>>
>>>>> Yandex as a general translation engine to be able to read some alien
>>>>> language is quite good, but as an engine to produce written text it is
>>>>>
>>>> not
>>>>
>>>>> very good at all. In fact it often creates quite horrible Norwegian,
>>>>>
>>>> even
>>>
>>>> for closely related languages. One quite common problem is reordering
>>>>>
>>>> of
>>>
>>>> words into meaningless constructs, an other problem is reordering
>>>>>
>>>> lexical
>>>
>>>> gender in weird ways. The English preposition "a" is often translated
>>>>>
>>>> as
>>>
>>>> "en" in a propositional phrase, and then the gender is added to the
>>>>> following phrase. That gives a translation of  "Oppland is a county
>>>>>
>>>> in…"
>>>
>>>>   into something like "Oppland er en fylket i…" This should be "Oppland
>>>>>
>>>> er
>>>
>>>> et fylke i…".
>>>>>
>>>>> (I just checked and it seems like Yandex messes up a lot less now than
>>>>> previously, but it is still pretty bad.)
>>>>>
>>>>> Apertium works because the language is closely related, Yandex does not
>>>>> work because it is used between very different languages. People try to
>>>>>
>>>> use
>>>>
>>>>> Yandex and gets disappointed, and falsely conclude that all language
>>>>> translations are equally weird. They are not, but Yandex translations
>>>>>
>>>> are
>>>
>>>> weird.
>>>>>
>>>>> The numerical threshold does not work. The reason is simple, the number
>>>>>
>>>> of
>>>>
>>>>> fixes depends on language constructs that fails, and that is simply
>>>>>
>>>> not a
>>>
>>>> constant for small text fragments. Perhaps if we could flag specific
>>>>> language constructs that is known to give a high percentage of
>>>>>
>>>> failures,
>>>
>>>> and if the translator must check those sentences. One such language
>>>>> construct is disappearances between the preposition and the gender of
>>>>>
>>>> the
>>>
>>>> following term in a prepositional phrase. If they are not similar, then
>>>>>
>>>> the
>>>>
>>>>> sentence must be checked. It is not always wrong to write "en jenta" in
>>>>> Norwegian, but it is likely to be wrong.
>>>>>
>>>>> A language model could be a statistical model for the language itself,
>>>>>
>>>> not
>>>>
>>>>> for the translation into that language. We don't want a perfect
>>>>>
>>>> language
>>>
>>>> model, but a sufficient language model to mark weird constructs. A very
>>>>> simple solution could simply be to mark tri-grams that does not
>>>>>
>>>> already
>>>
>>>> exist in the text base for the destination as possible errors. It is
>>>>>
>>>> not
>>>
>>>> necessary to do a live check, but  at least do it before the page can
>>>>>
>>>> be
>>>
>>>> saved.
>>>>>
>>>>> Note the difference in what Yandex do and what we want to achieve;
>>>>>
>>>> Yandex
>>>
>>>> translates a text between two different languages, without any clear
>>>>>
>>>> reason
>>>>
>>>>> why. It is not to important if there are weird constructs in the text,
>>>>>
>>>> as
>>>
>>>> long as it is usable in "some" context. We translate a text for the
>>>>>
>>>> purpose
>>>>
>>>>> of republishing it. The text should be usable and easily readable in
>>>>>
>>>> that
>>>
>>>> language.
>>>>>
>>>>>
>>>>>
>>>>> On Tue, May 2, 2017 at 7:07 PM, Amir E. Aharoni <
>>>>> [hidden email]> wrote:
>>>>>
>>>>> 2017-05-02 18:20 GMT+03:00 John Erling Blad <[hidden email]>:
>>>>>>
>>>>>> Brute force solution; turn the ContentTranslation off. Really
>>>>>>>
>>>>>> stupid
>>>
>>>> solution.
>>>>>>>
>>>>>>
>>>>>> ... Then I guess you don't mind that I'm changing the thread name :)
>>>>>>
>>>>>>
>>>>>> The next solution; turn the Yandex engine off. That would solve a
>>>>>>> part of the problem. Kind of lousy solution though.
>>>>>>>
>>>>>>> What about adding a language model that warns when the language
>>>>>>>
>>>>>> constructs
>>>>>>
>>>>>>> gets to weird? It is like a "test" for the translation. The CT is
>>>>>>>
>>>>>> used
>>>>
>>>>> for
>>>>>>
>>>>>>> creating a translation, but the language model is used for
>>>>>>>
>>>>>> verifying
>>>
>>>> if
>>>>
>>>>> the
>>>>>>
>>>>>>> translation is good enough. If it does not validate against the
>>>>>>>
>>>>>> language
>>>>>
>>>>>> model it should simply not be published to the main name space. It
>>>>>>>
>>>>>> will
>>>>
>>>>> still be possible to create a draft, but then the user is
>>>>>>>
>>>>>> completely
>>>
>>>> aware
>>>>>>
>>>>>>> that the translation isn't good enough.
>>>>>>>
>>>>>>> Such a language model should be available as a test for any
>>>>>>>
>>>>>> article,
>>>
>>>> as
>>>>
>>>>> it
>>>>>>
>>>>>>> can be used as a quality measure for the article. It is really a
>>>>>>>
>>>>>> quantity
>>>>>
>>>>>> measure for the well-spokenness of the article, but that isn't
>>>>>>>
>>>>>> quite
>>>
>>>> so
>>>>
>>>>> intuitive.
>>>>>>>
>>>>>>> So, I'll allow myself to guess that you are talking about one
>>>>>>
>>>>> particular
>>>>
>>>>> language, probably Norwegian.
>>>>>>
>>>>>> Several technical facts:
>>>>>>
>>>>>> 1. In the past there were several cases in which translators to
>>>>>>
>>>>> different
>>>>
>>>>> languages who reported common translation mistakes to me. I passed
>>>>>>
>>>>> them
>>>
>>>> on
>>>>>
>>>>>> to Yandex developers, with whom I communicate quite regularly. They
>>>>>> acknowledged receiving all of them. I am aware of at least one such
>>>>>>
>>>>> common
>>>>>
>>>>>> mistake that was fixed; possibly there were more. If you can give me
>>>>>>
>>>>> a
>>>
>>>> list
>>>>>
>>>>>> of such mistakes for Norwegian, I'll be very happy to pass them on. I
>>>>>> absolutely cannot promise that they will be fixed upstream, but it's
>>>>>> possible.
>>>>>>
>>>>>> 2. In Norwegian, Apertium is used for translating between the two
>>>>>>
>>>>> varieties
>>>>>
>>>>>> of Norwegian itself (Bokmål and Nynorsk), and from other Scandinavian
>>>>>> languages. That's probably why it works so well—they are similar in
>>>>>> grammar, vocabulary, and narrative style (I'll pass it on to Apertium
>>>>>> developers—I'm sure they'll be happy to hear it). Unfortunately,
>>>>>>
>>>>> machine
>>>>
>>>>> translation from English is not available in Apertium. Apertium works
>>>>>>
>>>>> best
>>>>>
>>>>>> with very similar languages, and English has two characteristics,
>>>>>>
>>>>> which
>>>
>>>> are
>>>>>
>>>>>> unfortunate when combined: it is both the most popular source for
>>>>>> translation into almost all other languages (including Norwegian),
>>>>>>
>>>>> and
>>>
>>>> it
>>>>
>>>>> is not _very_ similar to any other languages (except maybe Scots).
>>>>>>
>>>>> Machine
>>>>>
>>>>>> translation from English into Norwegian is only possible with Yandex
>>>>>>
>>>>> at
>>>
>>>> the
>>>>>
>>>>>> moment. More engines may be added in the future, but at the moment
>>>>>>
>>>>> that's
>>>>
>>>>> all we have. That's why disabling Yandex completely would indeed be a
>>>>>>
>>>>> lousy
>>>>>
>>>>>> solution: A lot of people say that without machine translation
>>>>>>
>>>>> integration
>>>>>
>>>>>> Content Translation is useless. Not all users think like that, but
>>>>>>
>>>>> many
>>>
>>>> do.
>>>>>
>>>>>> 3. We can define a numerical threshold of acceptable percentage of
>>>>>>
>>>>> machine
>>>>>
>>>>>> translation post-editing. Currently it's 75%. It's a tad
>>>>>>
>>>>> embarrassing,
>>>
>>>> but
>>>>>
>>>>>> it's hard-coded at the moment, but it can be very easily be made
>>>>>>
>>>>> into a
>>>
>>>> variable per language. If the translator tries to publish a page in
>>>>>>
>>>>> which
>>>>
>>>>> less than that is modified, a warning will be shown.
>>>>>>
>>>>>> 4. I'm not sure what do you mean by "language model". If it's any
>>>>>>
>>>>> kind
>>>
>>>> of a
>>>>>
>>>>>> linguistic engine, then it's definitely not within the resources that
>>>>>>
>>>>> the
>>>>
>>>>> Language team itself can currently dedicate. However, if somebody who
>>>>>>
>>>>> knows
>>>>>
>>>>>> Norwegian and some programming will write a script that analyzes
>>>>>>
>>>>> common
>>>
>>>> bad
>>>>>
>>>>>> constructs in a Wikipedia dump, this will be very useful. This would
>>>>>> basically be an upgraded version of suggestion #1 above. (In my spare
>>>>>>
>>>>> time
>>>>>
>>>>>> as a volunteer I'm doing something comparable for Hebrew, although
>>>>>>
>>>>> not
>>>
>>>> for
>>>>>
>>>>>> translation, but for improving how MediaWiki link trails work.)
>>>>>> _______________________________________________
>>>>>> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
>>>>>> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
>>>>>> wiki/Wikimedia-l
>>>>>> New messages to: [hidden email]
>>>>>> Unsubscribe: https://lists.wikimedia.org/
>>>>>>
>>>>> mailman/listinfo/wikimedia-l,
>>>
>>>> <mailto:[hidden email]?subject=unsubscribe>
>>>>>>
>>>>> _______________________________________________
>>>>> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
>>>>> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
>>>>> wiki/Wikimedia-l
>>>>> New messages to: [hidden email]
>>>>> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
>>>>> <mailto:[hidden email]?subject=unsubscribe>
>>>>>
>>>> _______________________________________________
>>>> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
>>>> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
>>>> wiki/Wikimedia-l
>>>> New messages to: [hidden email]
>>>> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
>>>> <mailto:[hidden email]?subject=unsubscribe>
>>>>
>>>> _______________________________________________
>>> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
>>> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
>>> wiki/Wikimedia-l
>>> New messages to: [hidden email]
>>> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
>>> <mailto:[hidden email]?subject=unsubscribe>
>>>
>>> _______________________________________________
>> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wik
>> i/Mailing_lists/Guidelines and https://meta.wikimedia.org/wik
>> i/Wikimedia-l
>> New messages to: [hidden email]
>> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
>> <mailto:[hidden email]?subject=unsubscribe>
>>
>
>
>
> _______________________________________________
> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wik
> i/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l
> New messages to: [hidden email]
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> <mailto:[hidden email]?subject=unsubscribe>
>



--
Etiamsi omnes, ego non
_______________________________________________
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l
New messages to: [hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:[hidden email]?subject=unsubscribe>
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] machine translation

Lodewijk
Reading this, I get a strong impression the problem may very well be in
setting expectations for the users of this translation tool. If they expect
the automated translation to be rather good, they may get fed up more
easily than when they consider it primarily a glorified dictionary.

Lodewijk

On Wed, May 3, 2017 at 1:06 PM, David Cuenca Tudela <[hidden email]>
wrote:

> Perhaps it would be a good idea to compare the translated text to the text
> that the user wants to save.
>
> If they are more than 95% the same, that means that the user didn't take
> the effort to correct the text.
>
> Cheers,
> Micru
>
> On Wed, May 3, 2017 at 10:31 AM, Wojciech Pędzich <[hidden email]>
> wrote:
>
> > It does depend a lot on the engagement level of the human behind the
> > keyboard. When I deal with machine-translated text, I simply wonder
> whether
> > the someone behind the keyboard took efforts to actually read the piece.
> >
> > Now whether this would work if limited to namespaces outside "main" - I
> do
> > not want to demonise the issue, but if the person submitting the text for
> > machine translation does not read it, what will stop them from a quick
> > ctrl+c / ctrl+v? Just asking.
> >
> > Wojciech
> >
> > W dniu 2017-05-03 o 09:33, Yaroslav Blanter pisze:
> >
> > Creating machine translations only in the draft space (or in the user
> space
> >> in the projects which do not have draft) could help.
> >>
> >> Cheers
> >> Yaroslav
> >>
> >> On Tue, May 2, 2017 at 10:16 PM, Pharos <[hidden email]>
> >> wrote:
> >>
> >> I think it all depends on the level of engagement of the human
> translator.
> >>>
> >>> When the tool is used in the right way, it is a fantastic tool.
> >>>
> >>> Maybe we can find better methods to nudge people toward taking their
> time
> >>> and really doing work on their translations.
> >>>
> >>> Thanks,
> >>> Pharos
> >>>
> >>> On Tue, May 2, 2017 at 4:09 PM, Bodhisattwa Mandal <
> >>> [hidden email]> wrote:
> >>>
> >>> Content translation with Yandex is also a problem in Bengali Wikipedia.
> >>>> Some users have grown a tendency to create machine translated
> >>>> meaningless
> >>>> articles with this extension to increase edit count and article count.
> >>>>
> >>> This
> >>>
> >>>> has increased the workloads of admins to find and delete those
> articles.
> >>>>
> >>>> Yandex is not ready for many languages and it is better to shut it. We
> >>>> don't need it in Bengali.
> >>>>
> >>>> Regards
> >>>> On May 3, 2017 12:17 AM, "John Erling Blad" <[hidden email]> wrote:
> >>>>
> >>>> Actually this _is_ about turning ContentTranslation off, that is what
> >>>>> several users in the community want. They block people using the
> >>>>>
> >>>> extension
> >>>>
> >>>>> and delete the translated articles. Use of ContentTranslation has
> >>>>>
> >>>> become
> >>>
> >>>> a
> >>>>
> >>>>>   rather contentious case.
> >>>>>
> >>>>> Yandex as a general translation engine to be able to read some alien
> >>>>> language is quite good, but as an engine to produce written text it
> is
> >>>>>
> >>>> not
> >>>>
> >>>>> very good at all. In fact it often creates quite horrible Norwegian,
> >>>>>
> >>>> even
> >>>
> >>>> for closely related languages. One quite common problem is reordering
> >>>>>
> >>>> of
> >>>
> >>>> words into meaningless constructs, an other problem is reordering
> >>>>>
> >>>> lexical
> >>>
> >>>> gender in weird ways. The English preposition "a" is often translated
> >>>>>
> >>>> as
> >>>
> >>>> "en" in a propositional phrase, and then the gender is added to the
> >>>>> following phrase. That gives a translation of  "Oppland is a county
> >>>>>
> >>>> in…"
> >>>
> >>>>   into something like "Oppland er en fylket i…" This should be
> "Oppland
> >>>>>
> >>>> er
> >>>
> >>>> et fylke i…".
> >>>>>
> >>>>> (I just checked and it seems like Yandex messes up a lot less now
> than
> >>>>> previously, but it is still pretty bad.)
> >>>>>
> >>>>> Apertium works because the language is closely related, Yandex does
> not
> >>>>> work because it is used between very different languages. People try
> to
> >>>>>
> >>>> use
> >>>>
> >>>>> Yandex and gets disappointed, and falsely conclude that all language
> >>>>> translations are equally weird. They are not, but Yandex translations
> >>>>>
> >>>> are
> >>>
> >>>> weird.
> >>>>>
> >>>>> The numerical threshold does not work. The reason is simple, the
> number
> >>>>>
> >>>> of
> >>>>
> >>>>> fixes depends on language constructs that fails, and that is simply
> >>>>>
> >>>> not a
> >>>
> >>>> constant for small text fragments. Perhaps if we could flag specific
> >>>>> language constructs that is known to give a high percentage of
> >>>>>
> >>>> failures,
> >>>
> >>>> and if the translator must check those sentences. One such language
> >>>>> construct is disappearances between the preposition and the gender of
> >>>>>
> >>>> the
> >>>
> >>>> following term in a prepositional phrase. If they are not similar,
> then
> >>>>>
> >>>> the
> >>>>
> >>>>> sentence must be checked. It is not always wrong to write "en jenta"
> in
> >>>>> Norwegian, but it is likely to be wrong.
> >>>>>
> >>>>> A language model could be a statistical model for the language
> itself,
> >>>>>
> >>>> not
> >>>>
> >>>>> for the translation into that language. We don't want a perfect
> >>>>>
> >>>> language
> >>>
> >>>> model, but a sufficient language model to mark weird constructs. A
> very
> >>>>> simple solution could simply be to mark tri-grams that does not
> >>>>>
> >>>> already
> >>>
> >>>> exist in the text base for the destination as possible errors. It is
> >>>>>
> >>>> not
> >>>
> >>>> necessary to do a live check, but  at least do it before the page can
> >>>>>
> >>>> be
> >>>
> >>>> saved.
> >>>>>
> >>>>> Note the difference in what Yandex do and what we want to achieve;
> >>>>>
> >>>> Yandex
> >>>
> >>>> translates a text between two different languages, without any clear
> >>>>>
> >>>> reason
> >>>>
> >>>>> why. It is not to important if there are weird constructs in the
> text,
> >>>>>
> >>>> as
> >>>
> >>>> long as it is usable in "some" context. We translate a text for the
> >>>>>
> >>>> purpose
> >>>>
> >>>>> of republishing it. The text should be usable and easily readable in
> >>>>>
> >>>> that
> >>>
> >>>> language.
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Tue, May 2, 2017 at 7:07 PM, Amir E. Aharoni <
> >>>>> [hidden email]> wrote:
> >>>>>
> >>>>> 2017-05-02 18:20 GMT+03:00 John Erling Blad <[hidden email]>:
> >>>>>>
> >>>>>> Brute force solution; turn the ContentTranslation off. Really
> >>>>>>>
> >>>>>> stupid
> >>>
> >>>> solution.
> >>>>>>>
> >>>>>>
> >>>>>> ... Then I guess you don't mind that I'm changing the thread name :)
> >>>>>>
> >>>>>>
> >>>>>> The next solution; turn the Yandex engine off. That would solve a
> >>>>>>> part of the problem. Kind of lousy solution though.
> >>>>>>>
> >>>>>>> What about adding a language model that warns when the language
> >>>>>>>
> >>>>>> constructs
> >>>>>>
> >>>>>>> gets to weird? It is like a "test" for the translation. The CT is
> >>>>>>>
> >>>>>> used
> >>>>
> >>>>> for
> >>>>>>
> >>>>>>> creating a translation, but the language model is used for
> >>>>>>>
> >>>>>> verifying
> >>>
> >>>> if
> >>>>
> >>>>> the
> >>>>>>
> >>>>>>> translation is good enough. If it does not validate against the
> >>>>>>>
> >>>>>> language
> >>>>>
> >>>>>> model it should simply not be published to the main name space. It
> >>>>>>>
> >>>>>> will
> >>>>
> >>>>> still be possible to create a draft, but then the user is
> >>>>>>>
> >>>>>> completely
> >>>
> >>>> aware
> >>>>>>
> >>>>>>> that the translation isn't good enough.
> >>>>>>>
> >>>>>>> Such a language model should be available as a test for any
> >>>>>>>
> >>>>>> article,
> >>>
> >>>> as
> >>>>
> >>>>> it
> >>>>>>
> >>>>>>> can be used as a quality measure for the article. It is really a
> >>>>>>>
> >>>>>> quantity
> >>>>>
> >>>>>> measure for the well-spokenness of the article, but that isn't
> >>>>>>>
> >>>>>> quite
> >>>
> >>>> so
> >>>>
> >>>>> intuitive.
> >>>>>>>
> >>>>>>> So, I'll allow myself to guess that you are talking about one
> >>>>>>
> >>>>> particular
> >>>>
> >>>>> language, probably Norwegian.
> >>>>>>
> >>>>>> Several technical facts:
> >>>>>>
> >>>>>> 1. In the past there were several cases in which translators to
> >>>>>>
> >>>>> different
> >>>>
> >>>>> languages who reported common translation mistakes to me. I passed
> >>>>>>
> >>>>> them
> >>>
> >>>> on
> >>>>>
> >>>>>> to Yandex developers, with whom I communicate quite regularly. They
> >>>>>> acknowledged receiving all of them. I am aware of at least one such
> >>>>>>
> >>>>> common
> >>>>>
> >>>>>> mistake that was fixed; possibly there were more. If you can give me
> >>>>>>
> >>>>> a
> >>>
> >>>> list
> >>>>>
> >>>>>> of such mistakes for Norwegian, I'll be very happy to pass them on.
> I
> >>>>>> absolutely cannot promise that they will be fixed upstream, but it's
> >>>>>> possible.
> >>>>>>
> >>>>>> 2. In Norwegian, Apertium is used for translating between the two
> >>>>>>
> >>>>> varieties
> >>>>>
> >>>>>> of Norwegian itself (Bokmål and Nynorsk), and from other
> Scandinavian
> >>>>>> languages. That's probably why it works so well—they are similar in
> >>>>>> grammar, vocabulary, and narrative style (I'll pass it on to
> Apertium
> >>>>>> developers—I'm sure they'll be happy to hear it). Unfortunately,
> >>>>>>
> >>>>> machine
> >>>>
> >>>>> translation from English is not available in Apertium. Apertium works
> >>>>>>
> >>>>> best
> >>>>>
> >>>>>> with very similar languages, and English has two characteristics,
> >>>>>>
> >>>>> which
> >>>
> >>>> are
> >>>>>
> >>>>>> unfortunate when combined: it is both the most popular source for
> >>>>>> translation into almost all other languages (including Norwegian),
> >>>>>>
> >>>>> and
> >>>
> >>>> it
> >>>>
> >>>>> is not _very_ similar to any other languages (except maybe Scots).
> >>>>>>
> >>>>> Machine
> >>>>>
> >>>>>> translation from English into Norwegian is only possible with Yandex
> >>>>>>
> >>>>> at
> >>>
> >>>> the
> >>>>>
> >>>>>> moment. More engines may be added in the future, but at the moment
> >>>>>>
> >>>>> that's
> >>>>
> >>>>> all we have. That's why disabling Yandex completely would indeed be a
> >>>>>>
> >>>>> lousy
> >>>>>
> >>>>>> solution: A lot of people say that without machine translation
> >>>>>>
> >>>>> integration
> >>>>>
> >>>>>> Content Translation is useless. Not all users think like that, but
> >>>>>>
> >>>>> many
> >>>
> >>>> do.
> >>>>>
> >>>>>> 3. We can define a numerical threshold of acceptable percentage of
> >>>>>>
> >>>>> machine
> >>>>>
> >>>>>> translation post-editing. Currently it's 75%. It's a tad
> >>>>>>
> >>>>> embarrassing,
> >>>
> >>>> but
> >>>>>
> >>>>>> it's hard-coded at the moment, but it can be very easily be made
> >>>>>>
> >>>>> into a
> >>>
> >>>> variable per language. If the translator tries to publish a page in
> >>>>>>
> >>>>> which
> >>>>
> >>>>> less than that is modified, a warning will be shown.
> >>>>>>
> >>>>>> 4. I'm not sure what do you mean by "language model". If it's any
> >>>>>>
> >>>>> kind
> >>>
> >>>> of a
> >>>>>
> >>>>>> linguistic engine, then it's definitely not within the resources
> that
> >>>>>>
> >>>>> the
> >>>>
> >>>>> Language team itself can currently dedicate. However, if somebody who
> >>>>>>
> >>>>> knows
> >>>>>
> >>>>>> Norwegian and some programming will write a script that analyzes
> >>>>>>
> >>>>> common
> >>>
> >>>> bad
> >>>>>
> >>>>>> constructs in a Wikipedia dump, this will be very useful. This would
> >>>>>> basically be an upgraded version of suggestion #1 above. (In my
> spare
> >>>>>>
> >>>>> time
> >>>>>
> >>>>>> as a volunteer I'm doing something comparable for Hebrew, although
> >>>>>>
> >>>>> not
> >>>
> >>>> for
> >>>>>
> >>>>>> translation, but for improving how MediaWiki link trails work.)
> >>>>>> _______________________________________________
> >>>>>> Wikimedia-l mailing list, guidelines at:
> https://meta.wikimedia.org/
> >>>>>> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> >>>>>> wiki/Wikimedia-l
> >>>>>> New messages to: [hidden email]
> >>>>>> Unsubscribe: https://lists.wikimedia.org/
> >>>>>>
> >>>>> mailman/listinfo/wikimedia-l,
> >>>
> >>>> <mailto:[hidden email]?subject=unsubscribe>
> >>>>>>
> >>>>> _______________________________________________
> >>>>> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
> >>>>> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> >>>>> wiki/Wikimedia-l
> >>>>> New messages to: [hidden email]
> >>>>> Unsubscribe: https://lists.wikimedia.org/
> mailman/listinfo/wikimedia-l,
> >>>>> <mailto:[hidden email]?subject=unsubscribe>
> >>>>>
> >>>> _______________________________________________
> >>>> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
> >>>> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> >>>> wiki/Wikimedia-l
> >>>> New messages to: [hidden email]
> >>>> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
> ,
> >>>> <mailto:[hidden email]?subject=unsubscribe>
> >>>>
> >>>> _______________________________________________
> >>> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
> >>> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> >>> wiki/Wikimedia-l
> >>> New messages to: [hidden email]
> >>> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> >>> <mailto:[hidden email]?subject=unsubscribe>
> >>>
> >>> _______________________________________________
> >> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wik
> >> i/Mailing_lists/Guidelines and https://meta.wikimedia.org/wik
> >> i/Wikimedia-l
> >> New messages to: [hidden email]
> >> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> >> <mailto:[hidden email]?subject=unsubscribe>
> >>
> >
> >
> >
> > _______________________________________________
> > Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wik
> > i/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> wiki/Wikimedia-l
> > New messages to: [hidden email]
> > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> > <mailto:[hidden email]?subject=unsubscribe>
> >
>
>
>
> --
> Etiamsi omnes, ego non
> _______________________________________________
> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> wiki/Wikimedia-l
> New messages to: [hidden email]
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> <mailto:[hidden email]?subject=unsubscribe>
>
_______________________________________________
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l
New messages to: [hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:[hidden email]?subject=unsubscribe>
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] machine translation

John Erling Blad
In reply to this post by David Cuenca Tudela
Note that some language pairs could easily be 100% correct.

On Wed, May 3, 2017 at 1:06 PM, David Cuenca Tudela <[hidden email]>
wrote:

> Perhaps it would be a good idea to compare the translated text to the text
> that the user wants to save.
>
> If they are more than 95% the same, that means that the user didn't take
> the effort to correct the text.
>
> Cheers,
> Micru
>
> On Wed, May 3, 2017 at 10:31 AM, Wojciech Pędzich <[hidden email]>
> wrote:
>
> > It does depend a lot on the engagement level of the human behind the
> > keyboard. When I deal with machine-translated text, I simply wonder
> whether
> > the someone behind the keyboard took efforts to actually read the piece.
> >
> > Now whether this would work if limited to namespaces outside "main" - I
> do
> > not want to demonise the issue, but if the person submitting the text for
> > machine translation does not read it, what will stop them from a quick
> > ctrl+c / ctrl+v? Just asking.
> >
> > Wojciech
> >
> > W dniu 2017-05-03 o 09:33, Yaroslav Blanter pisze:
> >
> > Creating machine translations only in the draft space (or in the user
> space
> >> in the projects which do not have draft) could help.
> >>
> >> Cheers
> >> Yaroslav
> >>
> >> On Tue, May 2, 2017 at 10:16 PM, Pharos <[hidden email]>
> >> wrote:
> >>
> >> I think it all depends on the level of engagement of the human
> translator.
> >>>
> >>> When the tool is used in the right way, it is a fantastic tool.
> >>>
> >>> Maybe we can find better methods to nudge people toward taking their
> time
> >>> and really doing work on their translations.
> >>>
> >>> Thanks,
> >>> Pharos
> >>>
> >>> On Tue, May 2, 2017 at 4:09 PM, Bodhisattwa Mandal <
> >>> [hidden email]> wrote:
> >>>
> >>> Content translation with Yandex is also a problem in Bengali Wikipedia.
> >>>> Some users have grown a tendency to create machine translated
> >>>> meaningless
> >>>> articles with this extension to increase edit count and article count.
> >>>>
> >>> This
> >>>
> >>>> has increased the workloads of admins to find and delete those
> articles.
> >>>>
> >>>> Yandex is not ready for many languages and it is better to shut it. We
> >>>> don't need it in Bengali.
> >>>>
> >>>> Regards
> >>>> On May 3, 2017 12:17 AM, "John Erling Blad" <[hidden email]> wrote:
> >>>>
> >>>> Actually this _is_ about turning ContentTranslation off, that is what
> >>>>> several users in the community want. They block people using the
> >>>>>
> >>>> extension
> >>>>
> >>>>> and delete the translated articles. Use of ContentTranslation has
> >>>>>
> >>>> become
> >>>
> >>>> a
> >>>>
> >>>>>   rather contentious case.
> >>>>>
> >>>>> Yandex as a general translation engine to be able to read some alien
> >>>>> language is quite good, but as an engine to produce written text it
> is
> >>>>>
> >>>> not
> >>>>
> >>>>> very good at all. In fact it often creates quite horrible Norwegian,
> >>>>>
> >>>> even
> >>>
> >>>> for closely related languages. One quite common problem is reordering
> >>>>>
> >>>> of
> >>>
> >>>> words into meaningless constructs, an other problem is reordering
> >>>>>
> >>>> lexical
> >>>
> >>>> gender in weird ways. The English preposition "a" is often translated
> >>>>>
> >>>> as
> >>>
> >>>> "en" in a propositional phrase, and then the gender is added to the
> >>>>> following phrase. That gives a translation of  "Oppland is a county
> >>>>>
> >>>> in…"
> >>>
> >>>>   into something like "Oppland er en fylket i…" This should be
> "Oppland
> >>>>>
> >>>> er
> >>>
> >>>> et fylke i…".
> >>>>>
> >>>>> (I just checked and it seems like Yandex messes up a lot less now
> than
> >>>>> previously, but it is still pretty bad.)
> >>>>>
> >>>>> Apertium works because the language is closely related, Yandex does
> not
> >>>>> work because it is used between very different languages. People try
> to
> >>>>>
> >>>> use
> >>>>
> >>>>> Yandex and gets disappointed, and falsely conclude that all language
> >>>>> translations are equally weird. They are not, but Yandex translations
> >>>>>
> >>>> are
> >>>
> >>>> weird.
> >>>>>
> >>>>> The numerical threshold does not work. The reason is simple, the
> number
> >>>>>
> >>>> of
> >>>>
> >>>>> fixes depends on language constructs that fails, and that is simply
> >>>>>
> >>>> not a
> >>>
> >>>> constant for small text fragments. Perhaps if we could flag specific
> >>>>> language constructs that is known to give a high percentage of
> >>>>>
> >>>> failures,
> >>>
> >>>> and if the translator must check those sentences. One such language
> >>>>> construct is disappearances between the preposition and the gender of
> >>>>>
> >>>> the
> >>>
> >>>> following term in a prepositional phrase. If they are not similar,
> then
> >>>>>
> >>>> the
> >>>>
> >>>>> sentence must be checked. It is not always wrong to write "en jenta"
> in
> >>>>> Norwegian, but it is likely to be wrong.
> >>>>>
> >>>>> A language model could be a statistical model for the language
> itself,
> >>>>>
> >>>> not
> >>>>
> >>>>> for the translation into that language. We don't want a perfect
> >>>>>
> >>>> language
> >>>
> >>>> model, but a sufficient language model to mark weird constructs. A
> very
> >>>>> simple solution could simply be to mark tri-grams that does not
> >>>>>
> >>>> already
> >>>
> >>>> exist in the text base for the destination as possible errors. It is
> >>>>>
> >>>> not
> >>>
> >>>> necessary to do a live check, but  at least do it before the page can
> >>>>>
> >>>> be
> >>>
> >>>> saved.
> >>>>>
> >>>>> Note the difference in what Yandex do and what we want to achieve;
> >>>>>
> >>>> Yandex
> >>>
> >>>> translates a text between two different languages, without any clear
> >>>>>
> >>>> reason
> >>>>
> >>>>> why. It is not to important if there are weird constructs in the
> text,
> >>>>>
> >>>> as
> >>>
> >>>> long as it is usable in "some" context. We translate a text for the
> >>>>>
> >>>> purpose
> >>>>
> >>>>> of republishing it. The text should be usable and easily readable in
> >>>>>
> >>>> that
> >>>
> >>>> language.
> >>>>>
> >>>>>
> >>>>>
> >>>>> On Tue, May 2, 2017 at 7:07 PM, Amir E. Aharoni <
> >>>>> [hidden email]> wrote:
> >>>>>
> >>>>> 2017-05-02 18:20 GMT+03:00 John Erling Blad <[hidden email]>:
> >>>>>>
> >>>>>> Brute force solution; turn the ContentTranslation off. Really
> >>>>>>>
> >>>>>> stupid
> >>>
> >>>> solution.
> >>>>>>>
> >>>>>>
> >>>>>> ... Then I guess you don't mind that I'm changing the thread name :)
> >>>>>>
> >>>>>>
> >>>>>> The next solution; turn the Yandex engine off. That would solve a
> >>>>>>> part of the problem. Kind of lousy solution though.
> >>>>>>>
> >>>>>>> What about adding a language model that warns when the language
> >>>>>>>
> >>>>>> constructs
> >>>>>>
> >>>>>>> gets to weird? It is like a "test" for the translation. The CT is
> >>>>>>>
> >>>>>> used
> >>>>
> >>>>> for
> >>>>>>
> >>>>>>> creating a translation, but the language model is used for
> >>>>>>>
> >>>>>> verifying
> >>>
> >>>> if
> >>>>
> >>>>> the
> >>>>>>
> >>>>>>> translation is good enough. If it does not validate against the
> >>>>>>>
> >>>>>> language
> >>>>>
> >>>>>> model it should simply not be published to the main name space. It
> >>>>>>>
> >>>>>> will
> >>>>
> >>>>> still be possible to create a draft, but then the user is
> >>>>>>>
> >>>>>> completely
> >>>
> >>>> aware
> >>>>>>
> >>>>>>> that the translation isn't good enough.
> >>>>>>>
> >>>>>>> Such a language model should be available as a test for any
> >>>>>>>
> >>>>>> article,
> >>>
> >>>> as
> >>>>
> >>>>> it
> >>>>>>
> >>>>>>> can be used as a quality measure for the article. It is really a
> >>>>>>>
> >>>>>> quantity
> >>>>>
> >>>>>> measure for the well-spokenness of the article, but that isn't
> >>>>>>>
> >>>>>> quite
> >>>
> >>>> so
> >>>>
> >>>>> intuitive.
> >>>>>>>
> >>>>>>> So, I'll allow myself to guess that you are talking about one
> >>>>>>
> >>>>> particular
> >>>>
> >>>>> language, probably Norwegian.
> >>>>>>
> >>>>>> Several technical facts:
> >>>>>>
> >>>>>> 1. In the past there were several cases in which translators to
> >>>>>>
> >>>>> different
> >>>>
> >>>>> languages who reported common translation mistakes to me. I passed
> >>>>>>
> >>>>> them
> >>>
> >>>> on
> >>>>>
> >>>>>> to Yandex developers, with whom I communicate quite regularly. They
> >>>>>> acknowledged receiving all of them. I am aware of at least one such
> >>>>>>
> >>>>> common
> >>>>>
> >>>>>> mistake that was fixed; possibly there were more. If you can give me
> >>>>>>
> >>>>> a
> >>>
> >>>> list
> >>>>>
> >>>>>> of such mistakes for Norwegian, I'll be very happy to pass them on.
> I
> >>>>>> absolutely cannot promise that they will be fixed upstream, but it's
> >>>>>> possible.
> >>>>>>
> >>>>>> 2. In Norwegian, Apertium is used for translating between the two
> >>>>>>
> >>>>> varieties
> >>>>>
> >>>>>> of Norwegian itself (Bokmål and Nynorsk), and from other
> Scandinavian
> >>>>>> languages. That's probably why it works so well—they are similar in
> >>>>>> grammar, vocabulary, and narrative style (I'll pass it on to
> Apertium
> >>>>>> developers—I'm sure they'll be happy to hear it). Unfortunately,
> >>>>>>
> >>>>> machine
> >>>>
> >>>>> translation from English is not available in Apertium. Apertium works
> >>>>>>
> >>>>> best
> >>>>>
> >>>>>> with very similar languages, and English has two characteristics,
> >>>>>>
> >>>>> which
> >>>
> >>>> are
> >>>>>
> >>>>>> unfortunate when combined: it is both the most popular source for
> >>>>>> translation into almost all other languages (including Norwegian),
> >>>>>>
> >>>>> and
> >>>
> >>>> it
> >>>>
> >>>>> is not _very_ similar to any other languages (except maybe Scots).
> >>>>>>
> >>>>> Machine
> >>>>>
> >>>>>> translation from English into Norwegian is only possible with Yandex
> >>>>>>
> >>>>> at
> >>>
> >>>> the
> >>>>>
> >>>>>> moment. More engines may be added in the future, but at the moment
> >>>>>>
> >>>>> that's
> >>>>
> >>>>> all we have. That's why disabling Yandex completely would indeed be a
> >>>>>>
> >>>>> lousy
> >>>>>
> >>>>>> solution: A lot of people say that without machine translation
> >>>>>>
> >>>>> integration
> >>>>>
> >>>>>> Content Translation is useless. Not all users think like that, but
> >>>>>>
> >>>>> many
> >>>
> >>>> do.
> >>>>>
> >>>>>> 3. We can define a numerical threshold of acceptable percentage of
> >>>>>>
> >>>>> machine
> >>>>>
> >>>>>> translation post-editing. Currently it's 75%. It's a tad
> >>>>>>
> >>>>> embarrassing,
> >>>
> >>>> but
> >>>>>
> >>>>>> it's hard-coded at the moment, but it can be very easily be made
> >>>>>>
> >>>>> into a
> >>>
> >>>> variable per language. If the translator tries to publish a page in
> >>>>>>
> >>>>> which
> >>>>
> >>>>> less than that is modified, a warning will be shown.
> >>>>>>
> >>>>>> 4. I'm not sure what do you mean by "language model". If it's any
> >>>>>>
> >>>>> kind
> >>>
> >>>> of a
> >>>>>
> >>>>>> linguistic engine, then it's definitely not within the resources
> that
> >>>>>>
> >>>>> the
> >>>>
> >>>>> Language team itself can currently dedicate. However, if somebody who
> >>>>>>
> >>>>> knows
> >>>>>
> >>>>>> Norwegian and some programming will write a script that analyzes
> >>>>>>
> >>>>> common
> >>>
> >>>> bad
> >>>>>
> >>>>>> constructs in a Wikipedia dump, this will be very useful. This would
> >>>>>> basically be an upgraded version of suggestion #1 above. (In my
> spare
> >>>>>>
> >>>>> time
> >>>>>
> >>>>>> as a volunteer I'm doing something comparable for Hebrew, although
> >>>>>>
> >>>>> not
> >>>
> >>>> for
> >>>>>
> >>>>>> translation, but for improving how MediaWiki link trails work.)
> >>>>>> _______________________________________________
> >>>>>> Wikimedia-l mailing list, guidelines at:
> https://meta.wikimedia.org/
> >>>>>> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> >>>>>> wiki/Wikimedia-l
> >>>>>> New messages to: [hidden email]
> >>>>>> Unsubscribe: https://lists.wikimedia.org/
> >>>>>>
> >>>>> mailman/listinfo/wikimedia-l,
> >>>
> >>>> <mailto:[hidden email]?subject=unsubscribe>
> >>>>>>
> >>>>> _______________________________________________
> >>>>> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
> >>>>> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> >>>>> wiki/Wikimedia-l
> >>>>> New messages to: [hidden email]
> >>>>> Unsubscribe: https://lists.wikimedia.org/
> mailman/listinfo/wikimedia-l,
> >>>>> <mailto:[hidden email]?subject=unsubscribe>
> >>>>>
> >>>> _______________________________________________
> >>>> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
> >>>> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> >>>> wiki/Wikimedia-l
> >>>> New messages to: [hidden email]
> >>>> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
> ,
> >>>> <mailto:[hidden email]?subject=unsubscribe>
> >>>>
> >>>> _______________________________________________
> >>> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
> >>> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> >>> wiki/Wikimedia-l
> >>> New messages to: [hidden email]
> >>> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> >>> <mailto:[hidden email]?subject=unsubscribe>
> >>>
> >>> _______________________________________________
> >> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wik
> >> i/Mailing_lists/Guidelines and https://meta.wikimedia.org/wik
> >> i/Wikimedia-l
> >> New messages to: [hidden email]
> >> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> >> <mailto:[hidden email]?subject=unsubscribe>
> >>
> >
> >
> >
> > _______________________________________________
> > Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wik
> > i/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> wiki/Wikimedia-l
> > New messages to: [hidden email]
> > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> > <mailto:[hidden email]?subject=unsubscribe>
> >
>
>
>
> --
> Etiamsi omnes, ego non
> _______________________________________________
> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> wiki/Wikimedia-l
> New messages to: [hidden email]
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> <mailto:[hidden email]?subject=unsubscribe>
>
_______________________________________________
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l
New messages to: [hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:[hidden email]?subject=unsubscribe>
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] machine translation

John Erling Blad
In reply to this post by Lodewijk
Agree! I also wonder if translators adapt to specific errors if they are
repeated to often. I wonder if it works like priming the brain to a
specific pattern.

On Wed, May 3, 2017 at 1:15 PM, Lodewijk <[hidden email]>
wrote:

> Reading this, I get a strong impression the problem may very well be in
> setting expectations for the users of this translation tool. If they expect
> the automated translation to be rather good, they may get fed up more
> easily than when they consider it primarily a glorified dictionary.
>
> Lodewijk
>
> On Wed, May 3, 2017 at 1:06 PM, David Cuenca Tudela <[hidden email]>
> wrote:
>
> > Perhaps it would be a good idea to compare the translated text to the
> text
> > that the user wants to save.
> >
> > If they are more than 95% the same, that means that the user didn't take
> > the effort to correct the text.
> >
> > Cheers,
> > Micru
> >
> > On Wed, May 3, 2017 at 10:31 AM, Wojciech Pędzich <[hidden email]>
> > wrote:
> >
> > > It does depend a lot on the engagement level of the human behind the
> > > keyboard. When I deal with machine-translated text, I simply wonder
> > whether
> > > the someone behind the keyboard took efforts to actually read the
> piece.
> > >
> > > Now whether this would work if limited to namespaces outside "main" - I
> > do
> > > not want to demonise the issue, but if the person submitting the text
> for
> > > machine translation does not read it, what will stop them from a quick
> > > ctrl+c / ctrl+v? Just asking.
> > >
> > > Wojciech
> > >
> > > W dniu 2017-05-03 o 09:33, Yaroslav Blanter pisze:
> > >
> > > Creating machine translations only in the draft space (or in the user
> > space
> > >> in the projects which do not have draft) could help.
> > >>
> > >> Cheers
> > >> Yaroslav
> > >>
> > >> On Tue, May 2, 2017 at 10:16 PM, Pharos <[hidden email]
> >
> > >> wrote:
> > >>
> > >> I think it all depends on the level of engagement of the human
> > translator.
> > >>>
> > >>> When the tool is used in the right way, it is a fantastic tool.
> > >>>
> > >>> Maybe we can find better methods to nudge people toward taking their
> > time
> > >>> and really doing work on their translations.
> > >>>
> > >>> Thanks,
> > >>> Pharos
> > >>>
> > >>> On Tue, May 2, 2017 at 4:09 PM, Bodhisattwa Mandal <
> > >>> [hidden email]> wrote:
> > >>>
> > >>> Content translation with Yandex is also a problem in Bengali
> Wikipedia.
> > >>>> Some users have grown a tendency to create machine translated
> > >>>> meaningless
> > >>>> articles with this extension to increase edit count and article
> count.
> > >>>>
> > >>> This
> > >>>
> > >>>> has increased the workloads of admins to find and delete those
> > articles.
> > >>>>
> > >>>> Yandex is not ready for many languages and it is better to shut it.
> We
> > >>>> don't need it in Bengali.
> > >>>>
> > >>>> Regards
> > >>>> On May 3, 2017 12:17 AM, "John Erling Blad" <[hidden email]>
> wrote:
> > >>>>
> > >>>> Actually this _is_ about turning ContentTranslation off, that is
> what
> > >>>>> several users in the community want. They block people using the
> > >>>>>
> > >>>> extension
> > >>>>
> > >>>>> and delete the translated articles. Use of ContentTranslation has
> > >>>>>
> > >>>> become
> > >>>
> > >>>> a
> > >>>>
> > >>>>>   rather contentious case.
> > >>>>>
> > >>>>> Yandex as a general translation engine to be able to read some
> alien
> > >>>>> language is quite good, but as an engine to produce written text it
> > is
> > >>>>>
> > >>>> not
> > >>>>
> > >>>>> very good at all. In fact it often creates quite horrible
> Norwegian,
> > >>>>>
> > >>>> even
> > >>>
> > >>>> for closely related languages. One quite common problem is
> reordering
> > >>>>>
> > >>>> of
> > >>>
> > >>>> words into meaningless constructs, an other problem is reordering
> > >>>>>
> > >>>> lexical
> > >>>
> > >>>> gender in weird ways. The English preposition "a" is often
> translated
> > >>>>>
> > >>>> as
> > >>>
> > >>>> "en" in a propositional phrase, and then the gender is added to the
> > >>>>> following phrase. That gives a translation of  "Oppland is a county
> > >>>>>
> > >>>> in…"
> > >>>
> > >>>>   into something like "Oppland er en fylket i…" This should be
> > "Oppland
> > >>>>>
> > >>>> er
> > >>>
> > >>>> et fylke i…".
> > >>>>>
> > >>>>> (I just checked and it seems like Yandex messes up a lot less now
> > than
> > >>>>> previously, but it is still pretty bad.)
> > >>>>>
> > >>>>> Apertium works because the language is closely related, Yandex does
> > not
> > >>>>> work because it is used between very different languages. People
> try
> > to
> > >>>>>
> > >>>> use
> > >>>>
> > >>>>> Yandex and gets disappointed, and falsely conclude that all
> language
> > >>>>> translations are equally weird. They are not, but Yandex
> translations
> > >>>>>
> > >>>> are
> > >>>
> > >>>> weird.
> > >>>>>
> > >>>>> The numerical threshold does not work. The reason is simple, the
> > number
> > >>>>>
> > >>>> of
> > >>>>
> > >>>>> fixes depends on language constructs that fails, and that is simply
> > >>>>>
> > >>>> not a
> > >>>
> > >>>> constant for small text fragments. Perhaps if we could flag specific
> > >>>>> language constructs that is known to give a high percentage of
> > >>>>>
> > >>>> failures,
> > >>>
> > >>>> and if the translator must check those sentences. One such language
> > >>>>> construct is disappearances between the preposition and the gender
> of
> > >>>>>
> > >>>> the
> > >>>
> > >>>> following term in a prepositional phrase. If they are not similar,
> > then
> > >>>>>
> > >>>> the
> > >>>>
> > >>>>> sentence must be checked. It is not always wrong to write "en
> jenta"
> > in
> > >>>>> Norwegian, but it is likely to be wrong.
> > >>>>>
> > >>>>> A language model could be a statistical model for the language
> > itself,
> > >>>>>
> > >>>> not
> > >>>>
> > >>>>> for the translation into that language. We don't want a perfect
> > >>>>>
> > >>>> language
> > >>>
> > >>>> model, but a sufficient language model to mark weird constructs. A
> > very
> > >>>>> simple solution could simply be to mark tri-grams that does not
> > >>>>>
> > >>>> already
> > >>>
> > >>>> exist in the text base for the destination as possible errors. It is
> > >>>>>
> > >>>> not
> > >>>
> > >>>> necessary to do a live check, but  at least do it before the page
> can
> > >>>>>
> > >>>> be
> > >>>
> > >>>> saved.
> > >>>>>
> > >>>>> Note the difference in what Yandex do and what we want to achieve;
> > >>>>>
> > >>>> Yandex
> > >>>
> > >>>> translates a text between two different languages, without any clear
> > >>>>>
> > >>>> reason
> > >>>>
> > >>>>> why. It is not to important if there are weird constructs in the
> > text,
> > >>>>>
> > >>>> as
> > >>>
> > >>>> long as it is usable in "some" context. We translate a text for the
> > >>>>>
> > >>>> purpose
> > >>>>
> > >>>>> of republishing it. The text should be usable and easily readable
> in
> > >>>>>
> > >>>> that
> > >>>
> > >>>> language.
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>> On Tue, May 2, 2017 at 7:07 PM, Amir E. Aharoni <
> > >>>>> [hidden email]> wrote:
> > >>>>>
> > >>>>> 2017-05-02 18:20 GMT+03:00 John Erling Blad <[hidden email]>:
> > >>>>>>
> > >>>>>> Brute force solution; turn the ContentTranslation off. Really
> > >>>>>>>
> > >>>>>> stupid
> > >>>
> > >>>> solution.
> > >>>>>>>
> > >>>>>>
> > >>>>>> ... Then I guess you don't mind that I'm changing the thread name
> :)
> > >>>>>>
> > >>>>>>
> > >>>>>> The next solution; turn the Yandex engine off. That would solve a
> > >>>>>>> part of the problem. Kind of lousy solution though.
> > >>>>>>>
> > >>>>>>> What about adding a language model that warns when the language
> > >>>>>>>
> > >>>>>> constructs
> > >>>>>>
> > >>>>>>> gets to weird? It is like a "test" for the translation. The CT is
> > >>>>>>>
> > >>>>>> used
> > >>>>
> > >>>>> for
> > >>>>>>
> > >>>>>>> creating a translation, but the language model is used for
> > >>>>>>>
> > >>>>>> verifying
> > >>>
> > >>>> if
> > >>>>
> > >>>>> the
> > >>>>>>
> > >>>>>>> translation is good enough. If it does not validate against the
> > >>>>>>>
> > >>>>>> language
> > >>>>>
> > >>>>>> model it should simply not be published to the main name space. It
> > >>>>>>>
> > >>>>>> will
> > >>>>
> > >>>>> still be possible to create a draft, but then the user is
> > >>>>>>>
> > >>>>>> completely
> > >>>
> > >>>> aware
> > >>>>>>
> > >>>>>>> that the translation isn't good enough.
> > >>>>>>>
> > >>>>>>> Such a language model should be available as a test for any
> > >>>>>>>
> > >>>>>> article,
> > >>>
> > >>>> as
> > >>>>
> > >>>>> it
> > >>>>>>
> > >>>>>>> can be used as a quality measure for the article. It is really a
> > >>>>>>>
> > >>>>>> quantity
> > >>>>>
> > >>>>>> measure for the well-spokenness of the article, but that isn't
> > >>>>>>>
> > >>>>>> quite
> > >>>
> > >>>> so
> > >>>>
> > >>>>> intuitive.
> > >>>>>>>
> > >>>>>>> So, I'll allow myself to guess that you are talking about one
> > >>>>>>
> > >>>>> particular
> > >>>>
> > >>>>> language, probably Norwegian.
> > >>>>>>
> > >>>>>> Several technical facts:
> > >>>>>>
> > >>>>>> 1. In the past there were several cases in which translators to
> > >>>>>>
> > >>>>> different
> > >>>>
> > >>>>> languages who reported common translation mistakes to me. I passed
> > >>>>>>
> > >>>>> them
> > >>>
> > >>>> on
> > >>>>>
> > >>>>>> to Yandex developers, with whom I communicate quite regularly.
> They
> > >>>>>> acknowledged receiving all of them. I am aware of at least one
> such
> > >>>>>>
> > >>>>> common
> > >>>>>
> > >>>>>> mistake that was fixed; possibly there were more. If you can give
> me
> > >>>>>>
> > >>>>> a
> > >>>
> > >>>> list
> > >>>>>
> > >>>>>> of such mistakes for Norwegian, I'll be very happy to pass them
> on.
> > I
> > >>>>>> absolutely cannot promise that they will be fixed upstream, but
> it's
> > >>>>>> possible.
> > >>>>>>
> > >>>>>> 2. In Norwegian, Apertium is used for translating between the two
> > >>>>>>
> > >>>>> varieties
> > >>>>>
> > >>>>>> of Norwegian itself (Bokmål and Nynorsk), and from other
> > Scandinavian
> > >>>>>> languages. That's probably why it works so well—they are similar
> in
> > >>>>>> grammar, vocabulary, and narrative style (I'll pass it on to
> > Apertium
> > >>>>>> developers—I'm sure they'll be happy to hear it). Unfortunately,
> > >>>>>>
> > >>>>> machine
> > >>>>
> > >>>>> translation from English is not available in Apertium. Apertium
> works
> > >>>>>>
> > >>>>> best
> > >>>>>
> > >>>>>> with very similar languages, and English has two characteristics,
> > >>>>>>
> > >>>>> which
> > >>>
> > >>>> are
> > >>>>>
> > >>>>>> unfortunate when combined: it is both the most popular source for
> > >>>>>> translation into almost all other languages (including Norwegian),
> > >>>>>>
> > >>>>> and
> > >>>
> > >>>> it
> > >>>>
> > >>>>> is not _very_ similar to any other languages (except maybe Scots).
> > >>>>>>
> > >>>>> Machine
> > >>>>>
> > >>>>>> translation from English into Norwegian is only possible with
> Yandex
> > >>>>>>
> > >>>>> at
> > >>>
> > >>>> the
> > >>>>>
> > >>>>>> moment. More engines may be added in the future, but at the moment
> > >>>>>>
> > >>>>> that's
> > >>>>
> > >>>>> all we have. That's why disabling Yandex completely would indeed
> be a
> > >>>>>>
> > >>>>> lousy
> > >>>>>
> > >>>>>> solution: A lot of people say that without machine translation
> > >>>>>>
> > >>>>> integration
> > >>>>>
> > >>>>>> Content Translation is useless. Not all users think like that, but
> > >>>>>>
> > >>>>> many
> > >>>
> > >>>> do.
> > >>>>>
> > >>>>>> 3. We can define a numerical threshold of acceptable percentage of
> > >>>>>>
> > >>>>> machine
> > >>>>>
> > >>>>>> translation post-editing. Currently it's 75%. It's a tad
> > >>>>>>
> > >>>>> embarrassing,
> > >>>
> > >>>> but
> > >>>>>
> > >>>>>> it's hard-coded at the moment, but it can be very easily be made
> > >>>>>>
> > >>>>> into a
> > >>>
> > >>>> variable per language. If the translator tries to publish a page in
> > >>>>>>
> > >>>>> which
> > >>>>
> > >>>>> less than that is modified, a warning will be shown.
> > >>>>>>
> > >>>>>> 4. I'm not sure what do you mean by "language model". If it's any
> > >>>>>>
> > >>>>> kind
> > >>>
> > >>>> of a
> > >>>>>
> > >>>>>> linguistic engine, then it's definitely not within the resources
> > that
> > >>>>>>
> > >>>>> the
> > >>>>
> > >>>>> Language team itself can currently dedicate. However, if somebody
> who
> > >>>>>>
> > >>>>> knows
> > >>>>>
> > >>>>>> Norwegian and some programming will write a script that analyzes
> > >>>>>>
> > >>>>> common
> > >>>
> > >>>> bad
> > >>>>>
> > >>>>>> constructs in a Wikipedia dump, this will be very useful. This
> would
> > >>>>>> basically be an upgraded version of suggestion #1 above. (In my
> > spare
> > >>>>>>
> > >>>>> time
> > >>>>>
> > >>>>>> as a volunteer I'm doing something comparable for Hebrew, although
> > >>>>>>
> > >>>>> not
> > >>>
> > >>>> for
> > >>>>>
> > >>>>>> translation, but for improving how MediaWiki link trails work.)
> > >>>>>> _______________________________________________
> > >>>>>> Wikimedia-l mailing list, guidelines at:
> > https://meta.wikimedia.org/
> > >>>>>> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> > >>>>>> wiki/Wikimedia-l
> > >>>>>> New messages to: [hidden email]
> > >>>>>> Unsubscribe: https://lists.wikimedia.org/
> > >>>>>>
> > >>>>> mailman/listinfo/wikimedia-l,
> > >>>
> > >>>> <mailto:[hidden email]?subject=
> unsubscribe>
> > >>>>>>
> > >>>>> _______________________________________________
> > >>>>> Wikimedia-l mailing list, guidelines at:
> https://meta.wikimedia.org/
> > >>>>> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> > >>>>> wiki/Wikimedia-l
> > >>>>> New messages to: [hidden email]
> > >>>>> Unsubscribe: https://lists.wikimedia.org/
> > mailman/listinfo/wikimedia-l,
> > >>>>> <mailto:[hidden email]?subject=
> unsubscribe>
> > >>>>>
> > >>>> _______________________________________________
> > >>>> Wikimedia-l mailing list, guidelines at:
> https://meta.wikimedia.org/
> > >>>> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> > >>>> wiki/Wikimedia-l
> > >>>> New messages to: [hidden email]
> > >>>> Unsubscribe: https://lists.wikimedia.org/
> mailman/listinfo/wikimedia-l
> > ,
> > >>>> <mailto:[hidden email]?subject=
> unsubscribe>
> > >>>>
> > >>>> _______________________________________________
> > >>> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
> > >>> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> > >>> wiki/Wikimedia-l
> > >>> New messages to: [hidden email]
> > >>> Unsubscribe: https://lists.wikimedia.org/
> mailman/listinfo/wikimedia-l,
> > >>> <mailto:[hidden email]?subject=unsubscribe>
> > >>>
> > >>> _______________________________________________
> > >> Wikimedia-l mailing list, guidelines at:
> https://meta.wikimedia.org/wik
> > >> i/Mailing_lists/Guidelines and https://meta.wikimedia.org/wik
> > >> i/Wikimedia-l
> > >> New messages to: [hidden email]
> > >> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l
> ,
> > >> <mailto:[hidden email]?subject=unsubscribe>
> > >>
> > >
> > >
> > >
> > > _______________________________________________
> > > Wikimedia-l mailing list, guidelines at:
> https://meta.wikimedia.org/wik
> > > i/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> > wiki/Wikimedia-l
> > > New messages to: [hidden email]
> > > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> > > <mailto:[hidden email]?subject=unsubscribe>
> > >
> >
> >
> >
> > --
> > Etiamsi omnes, ego non
> > _______________________________________________
> > Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
> > wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> > wiki/Wikimedia-l
> > New messages to: [hidden email]
> > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> > <mailto:[hidden email]?subject=unsubscribe>
> >
> _______________________________________________
> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> wiki/Wikimedia-l
> New messages to: [hidden email]
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> <mailto:[hidden email]?subject=unsubscribe>
>
_______________________________________________
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l
New messages to: [hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:[hidden email]?subject=unsubscribe>
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] machine translation

Ziko van Dijk-3
Hello,
This seems to me like a social problem, rather than a technical one.
Shutting down the tool would be a disadvantage for those people who benefit
from the tool and do good things with it.
What is the general opinion among the Norwegians about this issue? Is there
consent about how to deal with this kind of "articles"? If most people
agree they should be speedy-deleted, this would be a useful deterrence for
those who are not careful enough when using the tool?
Kind regards
Ziko



2017-05-03 13:22 GMT+02:00 John Erling Blad <[hidden email]>:

> Agree! I also wonder if translators adapt to specific errors if they are
> repeated to often. I wonder if it works like priming the brain to a
> specific pattern.
>
> On Wed, May 3, 2017 at 1:15 PM, Lodewijk <[hidden email]>
> wrote:
>
> > Reading this, I get a strong impression the problem may very well be in
> > setting expectations for the users of this translation tool. If they
> expect
> > the automated translation to be rather good, they may get fed up more
> > easily than when they consider it primarily a glorified dictionary.
> >
> > Lodewijk
> >
> > On Wed, May 3, 2017 at 1:06 PM, David Cuenca Tudela <[hidden email]>
> > wrote:
> >
> > > Perhaps it would be a good idea to compare the translated text to the
> > text
> > > that the user wants to save.
> > >
> > > If they are more than 95% the same, that means that the user didn't
> take
> > > the effort to correct the text.
> > >
> > > Cheers,
> > > Micru
> > >
> > > On Wed, May 3, 2017 at 10:31 AM, Wojciech Pędzich <[hidden email]>
> > > wrote:
> > >
> > > > It does depend a lot on the engagement level of the human behind the
> > > > keyboard. When I deal with machine-translated text, I simply wonder
> > > whether
> > > > the someone behind the keyboard took efforts to actually read the
> > piece.
> > > >
> > > > Now whether this would work if limited to namespaces outside "main"
> - I
> > > do
> > > > not want to demonise the issue, but if the person submitting the text
> > for
> > > > machine translation does not read it, what will stop them from a
> quick
> > > > ctrl+c / ctrl+v? Just asking.
> > > >
> > > > Wojciech
> > > >
> > > > W dniu 2017-05-03 o 09:33, Yaroslav Blanter pisze:
> > > >
> > > > Creating machine translations only in the draft space (or in the user
> > > space
> > > >> in the projects which do not have draft) could help.
> > > >>
> > > >> Cheers
> > > >> Yaroslav
> > > >>
> > > >> On Tue, May 2, 2017 at 10:16 PM, Pharos <
> [hidden email]
> > >
> > > >> wrote:
> > > >>
> > > >> I think it all depends on the level of engagement of the human
> > > translator.
> > > >>>
> > > >>> When the tool is used in the right way, it is a fantastic tool.
> > > >>>
> > > >>> Maybe we can find better methods to nudge people toward taking
> their
> > > time
> > > >>> and really doing work on their translations.
> > > >>>
> > > >>> Thanks,
> > > >>> Pharos
> > > >>>
> > > >>> On Tue, May 2, 2017 at 4:09 PM, Bodhisattwa Mandal <
> > > >>> [hidden email]> wrote:
> > > >>>
> > > >>> Content translation with Yandex is also a problem in Bengali
> > Wikipedia.
> > > >>>> Some users have grown a tendency to create machine translated
> > > >>>> meaningless
> > > >>>> articles with this extension to increase edit count and article
> > count.
> > > >>>>
> > > >>> This
> > > >>>
> > > >>>> has increased the workloads of admins to find and delete those
> > > articles.
> > > >>>>
> > > >>>> Yandex is not ready for many languages and it is better to shut
> it.
> > We
> > > >>>> don't need it in Bengali.
> > > >>>>
> > > >>>> Regards
> > > >>>> On May 3, 2017 12:17 AM, "John Erling Blad" <[hidden email]>
> > wrote:
> > > >>>>
> > > >>>> Actually this _is_ about turning ContentTranslation off, that is
> > what
> > > >>>>> several users in the community want. They block people using the
> > > >>>>>
> > > >>>> extension
> > > >>>>
> > > >>>>> and delete the translated articles. Use of ContentTranslation has
> > > >>>>>
> > > >>>> become
> > > >>>
> > > >>>> a
> > > >>>>
> > > >>>>>   rather contentious case.
> > > >>>>>
> > > >>>>> Yandex as a general translation engine to be able to read some
> > alien
> > > >>>>> language is quite good, but as an engine to produce written text
> it
> > > is
> > > >>>>>
> > > >>>> not
> > > >>>>
> > > >>>>> very good at all. In fact it often creates quite horrible
> > Norwegian,
> > > >>>>>
> > > >>>> even
> > > >>>
> > > >>>> for closely related languages. One quite common problem is
> > reordering
> > > >>>>>
> > > >>>> of
> > > >>>
> > > >>>> words into meaningless constructs, an other problem is reordering
> > > >>>>>
> > > >>>> lexical
> > > >>>
> > > >>>> gender in weird ways. The English preposition "a" is often
> > translated
> > > >>>>>
> > > >>>> as
> > > >>>
> > > >>>> "en" in a propositional phrase, and then the gender is added to
> the
> > > >>>>> following phrase. That gives a translation of  "Oppland is a
> county
> > > >>>>>
> > > >>>> in…"
> > > >>>
> > > >>>>   into something like "Oppland er en fylket i…" This should be
> > > "Oppland
> > > >>>>>
> > > >>>> er
> > > >>>
> > > >>>> et fylke i…".
> > > >>>>>
> > > >>>>> (I just checked and it seems like Yandex messes up a lot less now
> > > than
> > > >>>>> previously, but it is still pretty bad.)
> > > >>>>>
> > > >>>>> Apertium works because the language is closely related, Yandex
> does
> > > not
> > > >>>>> work because it is used between very different languages. People
> > try
> > > to
> > > >>>>>
> > > >>>> use
> > > >>>>
> > > >>>>> Yandex and gets disappointed, and falsely conclude that all
> > language
> > > >>>>> translations are equally weird. They are not, but Yandex
> > translations
> > > >>>>>
> > > >>>> are
> > > >>>
> > > >>>> weird.
> > > >>>>>
> > > >>>>> The numerical threshold does not work. The reason is simple, the
> > > number
> > > >>>>>
> > > >>>> of
> > > >>>>
> > > >>>>> fixes depends on language constructs that fails, and that is
> simply
> > > >>>>>
> > > >>>> not a
> > > >>>
> > > >>>> constant for small text fragments. Perhaps if we could flag
> specific
> > > >>>>> language constructs that is known to give a high percentage of
> > > >>>>>
> > > >>>> failures,
> > > >>>
> > > >>>> and if the translator must check those sentences. One such
> language
> > > >>>>> construct is disappearances between the preposition and the
> gender
> > of
> > > >>>>>
> > > >>>> the
> > > >>>
> > > >>>> following term in a prepositional phrase. If they are not similar,
> > > then
> > > >>>>>
> > > >>>> the
> > > >>>>
> > > >>>>> sentence must be checked. It is not always wrong to write "en
> > jenta"
> > > in
> > > >>>>> Norwegian, but it is likely to be wrong.
> > > >>>>>
> > > >>>>> A language model could be a statistical model for the language
> > > itself,
> > > >>>>>
> > > >>>> not
> > > >>>>
> > > >>>>> for the translation into that language. We don't want a perfect
> > > >>>>>
> > > >>>> language
> > > >>>
> > > >>>> model, but a sufficient language model to mark weird constructs. A
> > > very
> > > >>>>> simple solution could simply be to mark tri-grams that does not
> > > >>>>>
> > > >>>> already
> > > >>>
> > > >>>> exist in the text base for the destination as possible errors. It
> is
> > > >>>>>
> > > >>>> not
> > > >>>
> > > >>>> necessary to do a live check, but  at least do it before the page
> > can
> > > >>>>>
> > > >>>> be
> > > >>>
> > > >>>> saved.
> > > >>>>>
> > > >>>>> Note the difference in what Yandex do and what we want to
> achieve;
> > > >>>>>
> > > >>>> Yandex
> > > >>>
> > > >>>> translates a text between two different languages, without any
> clear
> > > >>>>>
> > > >>>> reason
> > > >>>>
> > > >>>>> why. It is not to important if there are weird constructs in the
> > > text,
> > > >>>>>
> > > >>>> as
> > > >>>
> > > >>>> long as it is usable in "some" context. We translate a text for
> the
> > > >>>>>
> > > >>>> purpose
> > > >>>>
> > > >>>>> of republishing it. The text should be usable and easily readable
> > in
> > > >>>>>
> > > >>>> that
> > > >>>
> > > >>>> language.
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>> On Tue, May 2, 2017 at 7:07 PM, Amir E. Aharoni <
> > > >>>>> [hidden email]> wrote:
> > > >>>>>
> > > >>>>> 2017-05-02 18:20 GMT+03:00 John Erling Blad <[hidden email]>:
> > > >>>>>>
> > > >>>>>> Brute force solution; turn the ContentTranslation off. Really
> > > >>>>>>>
> > > >>>>>> stupid
> > > >>>
> > > >>>> solution.
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>> ... Then I guess you don't mind that I'm changing the thread
> name
> > :)
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> The next solution; turn the Yandex engine off. That would solve
> a
> > > >>>>>>> part of the problem. Kind of lousy solution though.
> > > >>>>>>>
> > > >>>>>>> What about adding a language model that warns when the language
> > > >>>>>>>
> > > >>>>>> constructs
> > > >>>>>>
> > > >>>>>>> gets to weird? It is like a "test" for the translation. The CT
> is
> > > >>>>>>>
> > > >>>>>> used
> > > >>>>
> > > >>>>> for
> > > >>>>>>
> > > >>>>>>> creating a translation, but the language model is used for
> > > >>>>>>>
> > > >>>>>> verifying
> > > >>>
> > > >>>> if
> > > >>>>
> > > >>>>> the
> > > >>>>>>
> > > >>>>>>> translation is good enough. If it does not validate against the
> > > >>>>>>>
> > > >>>>>> language
> > > >>>>>
> > > >>>>>> model it should simply not be published to the main name space.
> It
> > > >>>>>>>
> > > >>>>>> will
> > > >>>>
> > > >>>>> still be possible to create a draft, but then the user is
> > > >>>>>>>
> > > >>>>>> completely
> > > >>>
> > > >>>> aware
> > > >>>>>>
> > > >>>>>>> that the translation isn't good enough.
> > > >>>>>>>
> > > >>>>>>> Such a language model should be available as a test for any
> > > >>>>>>>
> > > >>>>>> article,
> > > >>>
> > > >>>> as
> > > >>>>
> > > >>>>> it
> > > >>>>>>
> > > >>>>>>> can be used as a quality measure for the article. It is really
> a
> > > >>>>>>>
> > > >>>>>> quantity
> > > >>>>>
> > > >>>>>> measure for the well-spokenness of the article, but that isn't
> > > >>>>>>>
> > > >>>>>> quite
> > > >>>
> > > >>>> so
> > > >>>>
> > > >>>>> intuitive.
> > > >>>>>>>
> > > >>>>>>> So, I'll allow myself to guess that you are talking about one
> > > >>>>>>
> > > >>>>> particular
> > > >>>>
> > > >>>>> language, probably Norwegian.
> > > >>>>>>
> > > >>>>>> Several technical facts:
> > > >>>>>>
> > > >>>>>> 1. In the past there were several cases in which translators to
> > > >>>>>>
> > > >>>>> different
> > > >>>>
> > > >>>>> languages who reported common translation mistakes to me. I
> passed
> > > >>>>>>
> > > >>>>> them
> > > >>>
> > > >>>> on
> > > >>>>>
> > > >>>>>> to Yandex developers, with whom I communicate quite regularly.
> > They
> > > >>>>>> acknowledged receiving all of them. I am aware of at least one
> > such
> > > >>>>>>
> > > >>>>> common
> > > >>>>>
> > > >>>>>> mistake that was fixed; possibly there were more. If you can
> give
> > me
> > > >>>>>>
> > > >>>>> a
> > > >>>
> > > >>>> list
> > > >>>>>
> > > >>>>>> of such mistakes for Norwegian, I'll be very happy to pass them
> > on.
> > > I
> > > >>>>>> absolutely cannot promise that they will be fixed upstream, but
> > it's
> > > >>>>>> possible.
> > > >>>>>>
> > > >>>>>> 2. In Norwegian, Apertium is used for translating between the
> two
> > > >>>>>>
> > > >>>>> varieties
> > > >>>>>
> > > >>>>>> of Norwegian itself (Bokmål and Nynorsk), and from other
> > > Scandinavian
> > > >>>>>> languages. That's probably why it works so well—they are similar
> > in
> > > >>>>>> grammar, vocabulary, and narrative style (I'll pass it on to
> > > Apertium
> > > >>>>>> developers—I'm sure they'll be happy to hear it). Unfortunately,
> > > >>>>>>
> > > >>>>> machine
> > > >>>>
> > > >>>>> translation from English is not available in Apertium. Apertium
> > works
> > > >>>>>>
> > > >>>>> best
> > > >>>>>
> > > >>>>>> with very similar languages, and English has two
> characteristics,
> > > >>>>>>
> > > >>>>> which
> > > >>>
> > > >>>> are
> > > >>>>>
> > > >>>>>> unfortunate when combined: it is both the most popular source
> for
> > > >>>>>> translation into almost all other languages (including
> Norwegian),
> > > >>>>>>
> > > >>>>> and
> > > >>>
> > > >>>> it
> > > >>>>
> > > >>>>> is not _very_ similar to any other languages (except maybe
> Scots).
> > > >>>>>>
> > > >>>>> Machine
> > > >>>>>
> > > >>>>>> translation from English into Norwegian is only possible with
> > Yandex
> > > >>>>>>
> > > >>>>> at
> > > >>>
> > > >>>> the
> > > >>>>>
> > > >>>>>> moment. More engines may be added in the future, but at the
> moment
> > > >>>>>>
> > > >>>>> that's
> > > >>>>
> > > >>>>> all we have. That's why disabling Yandex completely would indeed
> > be a
> > > >>>>>>
> > > >>>>> lousy
> > > >>>>>
> > > >>>>>> solution: A lot of people say that without machine translation
> > > >>>>>>
> > > >>>>> integration
> > > >>>>>
> > > >>>>>> Content Translation is useless. Not all users think like that,
> but
> > > >>>>>>
> > > >>>>> many
> > > >>>
> > > >>>> do.
> > > >>>>>
> > > >>>>>> 3. We can define a numerical threshold of acceptable percentage
> of
> > > >>>>>>
> > > >>>>> machine
> > > >>>>>
> > > >>>>>> translation post-editing. Currently it's 75%. It's a tad
> > > >>>>>>
> > > >>>>> embarrassing,
> > > >>>
> > > >>>> but
> > > >>>>>
> > > >>>>>> it's hard-coded at the moment, but it can be very easily be made
> > > >>>>>>
> > > >>>>> into a
> > > >>>
> > > >>>> variable per language. If the translator tries to publish a page
> in
> > > >>>>>>
> > > >>>>> which
> > > >>>>
> > > >>>>> less than that is modified, a warning will be shown.
> > > >>>>>>
> > > >>>>>> 4. I'm not sure what do you mean by "language model". If it's
> any
> > > >>>>>>
> > > >>>>> kind
> > > >>>
> > > >>>> of a
> > > >>>>>
> > > >>>>>> linguistic engine, then it's definitely not within the resources
> > > that
> > > >>>>>>
> > > >>>>> the
> > > >>>>
> > > >>>>> Language team itself can currently dedicate. However, if somebody
> > who
> > > >>>>>>
> > > >>>>> knows
> > > >>>>>
> > > >>>>>> Norwegian and some programming will write a script that analyzes
> > > >>>>>>
> > > >>>>> common
> > > >>>
> > > >>>> bad
> > > >>>>>
> > > >>>>>> constructs in a Wikipedia dump, this will be very useful. This
> > would
> > > >>>>>> basically be an upgraded version of suggestion #1 above. (In my
> > > spare
> > > >>>>>>
> > > >>>>> time
> > > >>>>>
> > > >>>>>> as a volunteer I'm doing something comparable for Hebrew,
> although
> > > >>>>>>
> > > >>>>> not
> > > >>>
> > > >>>> for
> > > >>>>>
> > > >>>>>> translation, but for improving how MediaWiki link trails work.)
> > > >>>>>> _______________________________________________
> > > >>>>>> Wikimedia-l mailing list, guidelines at:
> > > https://meta.wikimedia.org/
> > > >>>>>> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> > > >>>>>> wiki/Wikimedia-l
> > > >>>>>> New messages to: [hidden email]
> > > >>>>>> Unsubscribe: https://lists.wikimedia.org/
> > > >>>>>>
> > > >>>>> mailman/listinfo/wikimedia-l,
> > > >>>
> > > >>>> <mailto:[hidden email]?subject=
> > unsubscribe>
> > > >>>>>>
> > > >>>>> _______________________________________________
> > > >>>>> Wikimedia-l mailing list, guidelines at:
> > https://meta.wikimedia.org/
> > > >>>>> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> > > >>>>> wiki/Wikimedia-l
> > > >>>>> New messages to: [hidden email]
> > > >>>>> Unsubscribe: https://lists.wikimedia.org/
> > > mailman/listinfo/wikimedia-l,
> > > >>>>> <mailto:[hidden email]?subject=
> > unsubscribe>
> > > >>>>>
> > > >>>> _______________________________________________
> > > >>>> Wikimedia-l mailing list, guidelines at:
> > https://meta.wikimedia.org/
> > > >>>> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> > > >>>> wiki/Wikimedia-l
> > > >>>> New messages to: [hidden email]
> > > >>>> Unsubscribe: https://lists.wikimedia.org/
> > mailman/listinfo/wikimedia-l
> > > ,
> > > >>>> <mailto:[hidden email]?subject=
> > unsubscribe>
> > > >>>>
> > > >>>> _______________________________________________
> > > >>> Wikimedia-l mailing list, guidelines at:
> https://meta.wikimedia.org/
> > > >>> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> > > >>> wiki/Wikimedia-l
> > > >>> New messages to: [hidden email]
> > > >>> Unsubscribe: https://lists.wikimedia.org/
> > mailman/listinfo/wikimedia-l,
> > > >>> <mailto:[hidden email]?subject=
> unsubscribe>
> > > >>>
> > > >>> _______________________________________________
> > > >> Wikimedia-l mailing list, guidelines at:
> > https://meta.wikimedia.org/wik
> > > >> i/Mailing_lists/Guidelines and https://meta.wikimedia.org/wik
> > > >> i/Wikimedia-l
> > > >> New messages to: [hidden email]
> > > >> Unsubscribe: https://lists.wikimedia.org/
> mailman/listinfo/wikimedia-l
> > ,
> > > >> <mailto:[hidden email]?subject=
> unsubscribe>
> > > >>
> > > >
> > > >
> > > >
> > > > _______________________________________________
> > > > Wikimedia-l mailing list, guidelines at:
> > https://meta.wikimedia.org/wik
> > > > i/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> > > wiki/Wikimedia-l
> > > > New messages to: [hidden email]
> > > > Unsubscribe: https://lists.wikimedia.org/
> mailman/listinfo/wikimedia-l,
> > > > <mailto:[hidden email]?subject=unsubscribe>
> > > >
> > >
> > >
> > >
> > > --
> > > Etiamsi omnes, ego non
> > > _______________________________________________
> > > Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
> > > wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> > > wiki/Wikimedia-l
> > > New messages to: [hidden email]
> > > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> > > <mailto:[hidden email]?subject=unsubscribe>
> > >
> > _______________________________________________
> > Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
> > wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> > wiki/Wikimedia-l
> > New messages to: [hidden email]
> > Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> > <mailto:[hidden email]?subject=unsubscribe>
> >
> _______________________________________________
> Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/
> wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/
> wiki/Wikimedia-l
> New messages to: [hidden email]
> Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l,
> <mailto:[hidden email]?subject=unsubscribe>
>
_______________________________________________
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l
New messages to: [hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:[hidden email]?subject=unsubscribe>
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] machine translation

Amir E. Aharoni
In reply to this post by David Cuenca Tudela
2017-05-03 14:06 GMT+03:00 David Cuenca Tudela <[hidden email]>:

> Perhaps it would be a good idea to compare the translated text to the text
> that the user wants to save.
>
> If they are more than 95% the same, that means that the user didn't take
> the effort to correct the text.
>
> Cheers,
> Micru
>
>
As I noted, this already exists. Set at 75%. Can be changed.


--
Amir Elisha Aharoni · אָמִיר אֱלִישָׁע אַהֲרוֹנִי
http://aharoni.wordpress.com
‪“We're living in pieces,
I want to live in peace.” – T. Moore‬
_______________________________________________
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l
New messages to: [hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:[hidden email]?subject=unsubscribe>
Reply | Threaded
Open this post in threaded view
|

Re: [Wikimedia-l] machine translation

Yaroslav Blanter
My idea was that adding an extra button to press brings the probability of
the whole process down. If someone is determine to systemically add bad
machine translations to the main namespace I guess only blocks could help.
On the other hand, and extra button gives at leat an opportunity to read
the result and reflect on it.

Cheers
Yaroslav
_______________________________________________
Wikimedia-l mailing list, guidelines at: https://meta.wikimedia.org/wiki/Mailing_lists/Guidelines and https://meta.wikimedia.org/wiki/Wikimedia-l
New messages to: [hidden email]
Unsubscribe: https://lists.wikimedia.org/mailman/listinfo/wikimedia-l, <mailto:[hidden email]?subject=unsubscribe>