Native Cherokee XML Dumps 20060619 Posted

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view

Native Cherokee XML Dumps 20060619 Posted


The Native Cherokee Language Translation Project has posted XML dumps
against enwiki 06-19-2006 at

Please feel free to download and review. This translation is using
conjugation, and verb stem decomposition and reconstruction.
Translation runs are posted in Sequoyah Syllabary and text phonetics.  
This release has improved the XML parser to support
translation and auto-link generation, Image translation parsing, and
templates.   Since the English Wikipedia XML dumps appear to
be supporting multilanguage tags, this version of the translator has
been enabled to convert them into English and then Cherokee
for image control statements (right, rucht, etc.).

The website has been used for XML translation import
testing for the past several weeks while I corrected
 and added support for link translations and tuned the AI engine to
compress, conjugate, and decompose and reconstruct verb stems and
tensing, but the site is fully populated and will remain updated from
now on.  

The current translation is up to 92% Cherokee with only less than a 20MB
word list left to be translated and tensed.

There have been several enhancements to the translation to detect and
correct language drift between the various dialects.

There are four dialects of the Cherokee Language:

Otali (Overhill) - spoken in Oklahoma, 30,000 native speakers
Giduwa (Keetoowah) - spoken in North Carolina, 5,000 native speakers
Southern - formerly spoken in Southern Alabama, Southern Georgia and
Florida (Extinct)
Ahniyvwiya - spoken in New Mexico and Missouri (AniKutani), 500 native
speakers (this dialect is the ancient written
form of the Cherokee Language which uses the AniKutani Syllabary, and is
used by the religious organization
for record keeping.  Since this dialect was written and numerous ancient
texts exist, the modern spoken form has not
experienced language drift.  The Otali dialect has drifted due to
English influences and sentence structure in the
Otali dialect more resembles English than any of the other Cherokee
Of the four dialects, only the Giduwa and Ahniyvwiya dialects  are still
100% mappable to the Sequoyah Syllabary.
The Otali dialect is approximately 98% mappable however, Otali has
contracted verb roots to the point they are
no longer recognizable in their original form in many words due to
language drift and synthesized newer sounds
and hybridization with English.

"do" is now spoken "to" in many words in Otali
"du" is now spoken "tu" in many words in Otali
"l" has replaced "i" and in some words is a new consruct in the language
as a result in Otali
Many original inflections are now contracted and use English sounds
rather than Syllabary constructs.

The Cherokee New Testament translated by Elias Boudinet and his
associates in the early 1800's is one of the
few surviving documents written in the Giduwa dialect and published
before the language drift began in
Oklahoma, and this older dialect is the most common dialect still
understandable by most modern speakers who speak
in Otali.  This dialect forms the basis of this translation with common
words from the Otali dialect
which are still mappable to the Syllabary and corrected words with the
original verb roots.     This project has an
additional purpose of forcing  stnadardization in our immersion efforts
to prevent further language drift by
restandardizing all written works into the Giduwa dialect and conversion of
non-conforming Otali words back to the Sequoyah Syllabary.  Most of the
Cherokee Language spoken in Oklahoma now uses an English Phoetics System
devised by Cherokee Linguist
Dr. Durbin Feeling of  Oklahoma University in order to be written
properly to reflect the modern spoken form and
no longer are mappable to the Syllabary.  

Internal discussions with Dr. Feeling and other members of this project
have resulted in the conclusion that written Cherokee must use
the Sequoyah Syllabary and we are planning to force standardization to
correct this language drift in all translations.  There have been
several committees proposing expanding the syllabary to incorporate the
new sounds, but these efforts may not address the issues of
language drift.  When a language is written and forced to use a standard
alphabet or syllabary it typically does not drift much.

These translations now provide the following additional files which
address these issues for our project folks and to force
retranslation of Otali words into Ahniyvwiya or Giduwa dialects.  Many
of the words have been resynthesized into
their original forms in order to be mapable to the Syllabary:

Each file set corresponds to a translation run of the wikitrans Cherokee
Machine Translator.

cherokee-syntax-errors-<date>.txt.bz2  - words in Otali dialect not
mappable to the syllabary but displayed in text phonetics in the translation
phwiki-<date>-pages-articles.xml - text phonetic translation
sylwiki-<date>-pages-articles.xml - Sequoyah Syllabary translation
untranslated-log-<date>.txt.bz2 - words not yet translated or mapped to
a thesaraus for their Cherokee equivalents

The end goal is to reduce the untranslated log file output to zero and
achieve 100% by the end of the summer.   There has been a lot of debate on
the translation and which  dialect structure would have the broadest
coverage.    Corrected Otali and Giduwa combined which remap to the
syllabary have been chosen as the current standard for this effort.

The Machine translation is a work in progress.

Jeff Merkey

foundation-l mailing list
[hidden email]
Reply | Threaded
Open this post in threaded view

Re: Native Cherokee Lexicons 20060619 Posted


I have received several inquiries and requests from Wikimedia community
members for release of the Cherokee lexicons for incorporation into
the Wiktionary and several adjunct projects which support Wiktionary
projects for the Native Cherokee Language.

Cherokee Language Lexicons have been released and posted to


The released lexicons are in text format and follow the format


for individual words.

These lexicons do not contain the verb parsing and decomposition rules
for pure translation of the 14 tenses, and are simply
dictionaries of common words used by most modern speakers. Complex
sentence construction requires the AI inference
engine which reorders english sentences into Cherokee contructs, then
synthesizes the verb stem and pronoun modifiers
for each phrase. These lexicons are however, an ideal beginning for
contruction of a Wiktionary for the Cherokee Language.

Jeff V. Merkey
foundation-l mailing list
[hidden email]