nlp.petamem.com
  Powered by PetaMem technology
Advertisement 

Wednesday 08 Feb 2012
MyNLP - guest
Register
Forgot password?
NLP Services & Tools
Dictionary
Spellchecker
Chatbot
Language Identification
Diacritisation
Number <=> Text
Special characters
Main menu
 Language 
 Theme 
Articles
News
Discussions
Advertisement

dot
Help page

Diacritization Operations

Overview and Functionality

This Diacritization Operations Processor enables you to provide various diacritization operations on a given text. It is thus only important for languages that do have diacritics like german, czech and others. English has no diacritics.

This module basically allows 2 major ways of operation: reconstruct diacritization on a text where diacritics are not present and removal of present diacritics. More features are:

  • supports diacritization removal and two methods of diacritization restoration
  • based on statistical data derived from corpora
  • reconstruction support for 6 languages

The reconstruction may be important if you want to bring a text from an email or SMS to publishable form, the removal helps you to transfer a text with diacritization to a medium that does not allow to display diacritized characters.

You can either input (or cut & paste) the text to be processed into the textarea in the form, or upload a file from your local computer. In case you input both text and upload a file, the text is discarded and only the file is processed.

In detail, there are 3 modes of operation:

  1. 1st Fit Reconstruction operation. The most probable alternative of a word (either with or without diacritization) is used. Because the "most probable" alternative (see Restrictions below) is not always the correct one, there is:
  2. Choose Reconstruction operation. If there is more than one alternative, the user is presented all of them to be able to make his own choice. While this requires most interaction on the users side, the preset is equivalent to the '1st Fit' mode of operation and in more than 90% of all cases correct.
  3. Remove Removal operation. All diacritics are removed from a given text. This is the inverse functionality to the reconstruction operations.

For diacritics reconstruction, the system needs to know the language of the text. Per default, it tries to identify the language automatically. By setting the Language hint manually, you both save processing ressources and eliminate a possible uncertainity from the identification process.

Examples

Restrictions

The reconstruction basically tries to restore information that is lost / has been removed from text. For this to be 100% reliable a deep semantic analysis of the given text would be necessary. As the restoration we provide with this module is a statistical process, it may get things wrong.
Moreover, statistical here means, that the system may know about a wealth of word forms in the given language but it cannot know all of them. Thus sometimes a correct reconstruction is not recognized.