nlp.petamem.com
  Powered by PetaMem technology
Advertisement 

Wednesday 08 Feb 2012
MyNLP - guest
Register
Forgot password?
NLP Services & Tools
Dictionary
Spellchecker
Chatbot
Language Identification
Diacritisation
Number <=> Text
Special characters
Main menu
 Language 
 Theme 
Articles
News
Discussions
Advertisement

dot
Help page

Language Identification

Overview and Functionality

This module allows to identify the language of a given text. You can either input (or cut & paste) this text into the big textfield in the form, or upload a file from your local computer. In case you input both text and upload a file, the text is discarded and only the file is processed.

Note: The uploaded texts are processed with no human intervention and discarded after analysis, we do not store these texts nor use them for any other reason.

You can also steer the process of identification, by choosing the "Recognition Method" parameter. It can have one of the following possible values:

  • Dict A dictionary based language identification method. The given text is iterated word by word trying to lookup these words in the dictionaries of the system. This method is computationally quite expensive, but suited for short texts, where statistical methods often fail.
  • NGram This is the industry standard of language identification. NGram language identification is fast and reliable (if there are good so called "language models") for mid-sized texts.
  • NVect This is a PetaMem proprietary technology for language identification. It is in some respect a generalization of the NGram method and computationally expensive, so not suited for long texts, but it yields better results than NGram for short texts.
  • Smart As the name of this default parameter says, it switches recognition to a mode that tries to be smart, i.e. to apply the best method suitable for a given text. This should be the best choice for all cases.

If you enable the "Show Details" checkbox, the result will be more verbose about the identification process. The informations given will depend on the method chosen.

Examples

Restrictions

The success of language identification depends on clean input data. The result of the identification process might be altered if you enter text containing markup (e.g. HTML, LaTeX, ...).

For optimal results just enter plain text (the encoding does not matter). If your text is a Word document, a PDF file or similar: convert it first.