Diacritization Operations
Overview and Functionality
This Diacritization Operations Processor enables you to provide various
diacritization operations on a given text. It is thus only important for
languages that do have diacritics like german, czech and
others. English has no diacritics.
This module basically allows 2 major ways of operation: reconstruct
diacritization on a text where diacritics are not present and removal of
present diacritics. More features are:
- supports diacritization removal and two methods of diacritization
restoration
- based on statistical data derived from corpora
- reconstruction support for 6 languages
The reconstruction may be important if you want to bring a text
from an email or SMS to publishable form, the removal helps you to transfer a
text with diacritization to a medium that does not allow to display
diacritized characters.
You can either input (or cut & paste) the text to be processed into the
textarea in the form, or upload a file from your local computer. In case
you input both text and upload a file, the text is discarded and only the file
is processed.
In detail, there are 3 modes of operation:
- 1st Fit Reconstruction operation. The most probable
alternative of a word (either with or without diacritization) is
used. Because the "most probable" alternative (see Restrictions
below) is not always the correct one, there is:
- Choose Reconstruction operation. If there is more than
one alternative, the user is presented all of them to be able to make his
own choice. While this requires most interaction on the users side, the
preset is equivalent to the '1st Fit' mode of operation and in more than
90% of all cases correct.
- Remove Removal operation. All diacritics are removed from
a given text. This is the inverse functionality to the reconstruction
operations.
For diacritics reconstruction, the system needs to know the language of the
text. Per default, it tries to identify the language
automatically. By setting the Language hint manually, you both
save processing ressources and eliminate a possible uncertainity from the
identification process.
Examples
Restrictions
The reconstruction basically tries to restore information
that is lost / has been removed from text. For this to be 100% reliable a deep
semantic analysis of the given text would be necessary. As the restoration we
provide with this module is a statistical process, it may get things
wrong.
Moreover, statistical here means, that the system may know about a wealth of
word forms in the given language but it cannot know all of them. Thus
sometimes a correct reconstruction is not recognized.