Language Identification
Overview and Functionality
This module allows to identify the language of a given text. You can either
input (or cut & paste) this text into the big textfield in the form, or
upload a file from your local computer. In case you input both text and
upload a file, the text is discarded and only the file is processed.
Note: The uploaded texts are processed with no human
intervention and discarded after analysis, we do not store these texts nor
use them for any other reason.
You can also steer the process of identification, by choosing the
"Recognition Method" parameter. It can have one of the following possible
values:
- Dict A dictionary based language identification
method. The given text is iterated word by word trying to lookup these
words in the dictionaries of the system. This method is computationally
quite expensive, but suited for short texts, where statistical methods
often fail.
- NGram This is the industry standard of language
identification. NGram language identification is fast and reliable (if
there are good so called "language models") for mid-sized texts.
- NVect This is a PetaMem proprietary technology for
language identification. It is in some respect a generalization of the NGram
method and computationally expensive, so not suited for long texts, but
it yields better results than NGram for short texts.
- Smart As the name of this default parameter
says, it switches recognition to a mode that tries to be smart, i.e. to
apply the best method suitable for a given text. This should be the best
choice for all cases.
If you enable the "Show Details" checkbox, the result will be more verbose
about the identification process. The informations given will depend on the
method chosen.
Examples
Restrictions
The success of language identification depends on clean input
data. The result of the identification process might be altered if you enter
text containing markup (e.g. HTML, LaTeX, ...).
For optimal results just enter plain text (the encoding does not matter). If
your text is a Word document, a PDF file or similar: convert it first.