I haven't tried this, but it sounds like translators and Language Engineering folks might be interested.
---------- Forwarded message ----------
From: Alex Brollo <[hidden email]>
Date: Tue, Aug 1, 2017 at 12:10 AM
Subject: [Wikisource-l] An it.source gadget to manage diacritics
To: wikisource list <[hidden email]>
Just to let it known, some it.source contributors are using a comfortable gadget to manage diacritics - it can delete, replace or add a pretty large list of diacritical marks to any character with a single click.
It uses .normalize() string method, so decomposing-recomposing (when possible) unicode characters and allowing to manage diacritics alone indipendently from base ascii character.
Perhaps is this gadget "rediscovering the wheel"....? Anyway, the code is here: https://it.wikisource.
Wikisource-l mailing list
Translators-l mailing list
It is worth mentioning that diacritics are not the only decomposable characters, and that Korean Hangul syllables are also decomposable algorithmically, which could be used to avoid retyping a syllabic cluster after 2 jamos (a leading consonnant and a vowel) or 3 jamos (a leading consonnant, a vowel and a trailing consonnant) have been composed.
Also, I hope that the method will recompose the characters to NFC form once they have been edited and the selection caret goes to another position.
A simple way to do this: don't add any new button, but just press Alt+Backspace to remove only the last character in the NFD form of a character and recompose it immediately after the deletion. This way the text in the edtable buffer is always in NFC form, the NFD form is only used internally and temporarily when handling the Alt+Backspace key (which may be repeated and should be able to remove even a non-decomposable character).
* some characters that are composed to NFC are no longer decomposable by the NFD form, because this decomposition is prohibited in NFD form as well. This is the case for "overstriking" diacritics like the slash when they occur in some canonical composition pairs, or a few "compatibility" diacritics whose decomposition is possible but are not recomposable with another character than a base letter, where one of them will compose but not the other one which will remain after the composed character.
* And ideally when entering any composable diacritic or a Hangul vowel jamo or Hangul traiing consonnant jamo anywhere, the character(s) before them should be check to see if this forms a NFC composition. In some case you'll need to look backward over possibly long sequences because of canonical reordering (but never more than 254 codepoints given that reordering can only occur in sequences of diacritics with distinct non-zero combining classes, and there cannot be more than 254 classes; in fact there are not even 254 classes assigned in Unicode; canonical reordering also never occurs in Hangul syllables between jamos, and their composable sequences are limited to 2 or 3; so you don't need to scan backward over large buffers, this will still be very fast during input)
* There's no easy way to select an isolated diacritic, but the Alt+Backspace keystroke that drops a diacritic cuold place it in the clipboard, to allow pasting it somewhere else: press Alt+Backspace than CTRL+Z to cancel the deletion, the diacritic is in the clipboard and you can paste it easily anywhere else: it can be useful to fix a text where not all diacritics have been entered (useful notably for Arabic or Hebrew): it will be faster than using long palettes of letters with diacritics, and instead of using palettes with precomposed characters, only isoalted diacritics and Hangul vowels or trailing consonnants would be placed in the palette. ==> This would greatly improve the usability for Latin as well: we would show only the base letters (A-Z and a-z Basic Latin could be dropped from the palette, or hidden by default, as they are on all keyboards and never difficult to enter, but additional letters will be useful such as the open o; the diacritics would have more space to be selectable, starting by the most frequent ones: acute, grace, diaeresis, circumflex, cedilla, caron, macron, hacek, dot above, and hook; if the palette has a setup for a particular language, it should still show its "natural" alphabet, and then its own diacritics, before listing other rare diacritrics).
2017-08-04 0:34 GMT+02:00 Pine W <[hidden email]>:
Translators-l mailing list
|Free forum by Nabble||Edit this page|