Standard Input Processors Chain
Introduction
An Input Processor (IP) pre-processes inputs for the Teneo Engine to be able to perform different processes on them, such as normalization and tokenization of inputs or spelling correction. Each language supported by the Teneo Platform has a chain of Input Processors which know how to process that particular language. The Standard Input Processor chain offers support to a a large number of the supported Teneo languages.
Supported Languages
Currently the below listed languages are supported by the Standard Input Processors chain.
| Supported languages | ||||||
|---|---|---|---|---|---|---|
| Afrikaans | Czech | Georgian | Kinyarwanda | Nepali | Sango | Tigrinya |
| Albanian | Danish | German | Kirundi (Rundi) | Norwegian (Nynorsk/Bokmål) | Scottish Gaelic | Tsonga |
| Amharic | Dutch | Greek | Kyrgyz | Odia | Serbian | Tswana (Setswana) |
| Armenian | English | Gujarati | Latvian | Oromo | Shona | Turkmen |
| Azerbaijani | Esperanto | Hindi | Lithuanian | Papiamento | Sinhala | Ukrainian |
| Basque | Estonian | Hungarian | Luxembourgish | Polish | Slovak | Uzbek |
| Belarusian | Ewe | Icelandic | Macedonian | Portuguese | Slovene | Vietnamese |
| Bengali/Bangla | Faroese | Igbo | Malagasy | Quechuan (Quechua) | Somali | Welsh |
| Bosnian | Finnish* | Indonesian | Malay | Romanian | Spanish | Yoruba |
| Bulgarian | French | Irish | Maltese | Romansh | Swahili (Kiswahili) | Zulu (isiZulu) |
| Catalan | Frisian | Italian | Marathi | Russian | Swazi | |
| Croatian | Galician | Kazakh | Mongolian | Sámi | Swedish |
* The Input Processor chain for Finnish language also contains the Finnish Splitting Input Processor on top of the IPs in the Standard Input Processor chain.
IP Chain Setup
The below graph displays the default setup of the Standard Input Processors chain.
* The Input Processors marked with a star (*) in the above graph are currently available as NL Analyzers for a selection of the languages only; for more information on available languages, please see the NL Analyzer's section.
The default Input Processors are listed below with a short description of the input processor's functionality; the following sections and sub-sections also provide further details.
- Standard Splitting: divides the user input text into sentences and words, considering abbreviations that should not be split.
- Standard Auto Correction: applies spelling correction to the existing words, based on a fixed list of auto-correct mappings.
- Predict: classifies user inputs based on a machine learning model and annotates the user input with the predicted top intent classes and a confidence score; note that as of Teneo 7.3, deferred intent classification is applied.
- Standard Similarity Match Correction: applies spelling correction to the existing words, based on similarity match to the words in the solution dictionary.
- System Annotation: sets a number of annotations based on properties of the user input text.
- Basic Number Recognizer: recognizes all Arabic numbers and annotates them with a NUMBER annotation and a variable which contains the found number.
- Language Detector: identifies the language of the input sentence provided and annotates it with the predicted language together with a confidence score.
Property Reference
Properties can be referenced by other properties using the schema:
${<property name>}
The expression is replaced by the value of the property and the characters ${ and } are removed. This is applicable to property values.
The Java system properties can be referenced by the expression
${systemProperties.<system property name>}
If the web app controller module is used, the servlet context init parameter (defined in the element <context-param> of the web.xml deployment descriptor file) and the servlet configuration init parameters (defined in element <servlet> in the web.xml) can be referenced by the expressions:
${servletContextParameters.<parameter name>}
${servletConfigParameters.<parameter name>}
General Properties
The following properties are generally available.
| Name | Value |
|---|---|
| properties.file.path | The absolute path of the folder containing additional configuration files for the input processors |
Standard Simplifier
The Standard Simplifier is a simplifier implementation with support for configurable character decomposition and normalization as well as character mapping and it executes the following processing steps:
- Conversion to lower case, considering the configured language locale.
- Optional compatibility simplification: this is Unicode compatibility decomposition (like mapping 2 to 2, etc.), with optional exceptions defined by the property excludeFromCompatibilitySimplify.
- This step is disabled by default, see compatibilitySimplify
Optional canonical simplification: Unicode canonical decomposition is applied, then by default all combining characters are deleted (exceptions may be defined with the property excludeFromCanonicalSimplify, which letter-combining character will be left untouched). - Conversion to Unicode composed form.
- Optional simplification mapping: character/substring replacement as specified by property simplificationMapping.* are applied. No mappings are set by default.
Configuration Properties
Canonical Simplify
| Name | Type | Required | Default |
|---|---|---|---|
| canonicalSimplify | true/false | no | true |
The canonicalSimplify enables/disables simplification based on canonical decomposition of Unicode characters; see Unicode normalization forms for more information. An exception list can be defined in excludeCanonicalSimplify.
If enabled:
- First, canonical decomposition is applied, meaning that accented characters are decomposed into the base letter and combining marks (non-spacing mark) for the accent(s).
- On a second step, all non-spacing marks are deleted, i.e., á becomes a, etc.
- Finally, canonical composition is applied.
Exclude From Canonical Simplify
| Name | Type | Required | Default |
|---|---|---|---|
| excludeFromCanonicalSimplify | string | no | empty |
All characters int he string given here will be excluded from the canonical simplification defined above. To be more precise, for character-combinations resulting from step one while step two will be skipped; see steps above.
Compatibility Simplify
| Name | Type | Required | Default |
|---|---|---|---|
| compatibilitySimplify | true/false | no | false |
The compatibilitySimplify property allows to enable/disable simplification based on compatibility decomposition of Unicode characters, for example, 5 becomes 5.
Exclude From Compatibility Simplify
| Name | Type | Required | Default |
|---|---|---|---|
| excludeFromCompatibilitySimplify | string | no | empty |
All characters in the string given here are excluded from the compatibility simplification described above.
Simplification Mapping
| Name | Type | Required | Default |
|---|---|---|---|
| simplificationMapping.* | Format:simplificationMapping.<n> = <letter(s)>=<replacement><n>: number, which must be unique within the simplification mappings of one file<letter(s)>: string, letter(s) to be replaced <replacement>: string, replacement | no | empty |
Custom simplification mapping which is applied after canonical and compatibility simplification.
Be aware that, for example, an accented character - for which a custom simplification mapping is applied - must be listed also under excludeFromCanonicalSimplify if canonical simplification isn't disabled.
For example: simplificationMapping.1 = ä=ae also requires excludeFromCanonicalSimplify = ...ä...
Input Processors
Standard Splitting
The Standard Splitting Input Processor splits the user input text into sentences and words based on configurable sentence and word delimiters; splitting exceptions may be defined as a configurable list of abbreviations and a configurable regular expression.
The Standard Splitting IP generates one or more sentences with zero or more words. The generated WordData objects contain the original and simplified form of the word. The final wordform is initialized with the simplified wordform.
Other Considerations
Extra request parameters read by this input processor: (none)
Processing options read by this input processor: (none)
Annotations generated by this input processor: (none)
Configuration Properties
Abbreviation Related Properties
| Name | Type | Required | Data |
|---|---|---|---|
| abbreviations.item.* | Format:abbreviations.item.<n> = <abbreviation><n>: number, which must be unique within the abbreviation definitions of one file<abbreviation>: an abbreviation | no | none |
List of abbreviations; the abbreviations are considered in the sentence separation process and sentence delimiters within abbreviations will not lead to separated sentences.
| Name | Type | Required | Default |
|---|---|---|---|
| abbreviations.file.name | string (filename) | no | empty |
Filename (including path) of an extra file containing abbreviations. A relative filename relates to the location of the properties file.
| Name | Type | Required | Default |
|---|---|---|---|
| abbreviations.file.encoding | string (encoding name) | no | UTF-8 |
Encoding of the extra file containing abbreviations.
Sentences and Words Separation Related Properties
| Name | Type | Required | Default |
|---|---|---|---|
| inputSeparation.sentenceDelimiters | string | no | . ¡ ! ¿ ? … |
List of characters that are used to separate sentences (unless part of an abbreviation).
| Name | Type | Required | Default |
|---|---|---|---|
| inputSeparation.wordDelimiters | string | no | ^ " “ ” ' ‘ ’ ` ´ # $ € £ % & § | ~ ° • [ ] ( ) < > { } = + - ÷ * / \ , : ; . ¡ ! ¿ ? … <SP> <CR> <LF> <HT> |
List of characters that are used to separate words. Delimiting characters are kept as separate words, except for those that are listed under inputSeparation.nonWordCharacters (see below).
| Name | Type | Required | Default |
|---|---|---|---|
| inputSeparation.additionalWordDelimiterRegEx | string | no | empty |
Additional word delimiting regular expression, may be used to specify additional or alternative word delimiting to those mentioned above. This is an optional regular expression for delimiting words or defining (optionally zero width) word boundaries. It may be specified as addition or alternative to inputSeparation.wordDelimiters.
Note: in Java 6 & 7, a position look behind construct in the regex does not work with Unicode blocks outside the BMP if the block is specified with the \p{ln...} construct, probably due to a bug in the Java regex implementation. Instead the characters must be specified directly as a range.
| Name | Type | Required | Default |
|---|---|---|---|
| inputSeparation.nonWordCharacters | string | no | " “ ” ' ‘ ’ ` ´ , ; . ¡ ! ¿ ? … <SP> <CR> <LF> <HT> |
Word separators that are not kept as words; the set of characters specified here should be a subset of inputSeparation.wordDelimiters and the characters matched by inputSeparation.additionalWordDelimiterRegEx.
Example (assuming defaults):
Argh$%, separate this!
will be separated into:
Argh
$
%
separate
this
| Name | Type | Required | Default |
|---|---|---|---|
| inputSeparation.excludeWordDelimitersRegEx | string | no | (?<=([<SP><HT><CR><LF> "“”,;.¡!¿?…\d]|^))[.,](/?=\d) |
Regular expression that specify exceptions to the splitting of a sentence into words; the default regular expression prevents the characters , (comma) and . (dot) from acting as word delimiters when they appear in the context of a number.
Note that the text matched by the regular expression will be excluded from splitting, thus any word splitting characters used only as context condition should be given as zero-width look-behind/look-ahead construct.
Standard Auto Correction
The Standard Auto Correction Input Processor applies spelling correction based on a configurable list of auto-correction mappings; the corrections are applied to the finalized form of the sentence word. The Standard Auto Correction IP works on the existing sentences and words passed in and it may modify the final form of words. The count of sentences and words are not modified.
Other Considerations
Extra request parameters read by this input processor: (none)
Processing options read by this input processor: (none)
Annotations generated by this input processor: (none)
Configuration Properties
Properties Related to Autocorrection Mapping
| Name | Type | Required | Default |
|---|---|---|---|
| autoCorrections.item.* | Format: autoCorrections.item.<n> = <incorrect word>=<correct word> <n>: number, which must be unique within the autocorrection definitions of one file <incorrect word>: misspelled word that shall be mapped to a corrected version <correct word>: the corrected version (it must be a single word; word splitting is not supported) | no | none |
List of word mappings for direct replacement of typical misspellings. The replacement takes place after the simplification.
Properties Related to External Autocorrection List
| Name | Type | Required | Default |
|---|---|---|---|
| autoCorrections.file.name | string | no | empty |
Filename (including path) of an extra file containing autocorrection mappings in the format:
<incorrect word>=<correct word>
incorrect word: the misspelled word that shall be mapped to a corrected version
correct word: the corrected spelling of the word (it must be a single word, word splitting is not supported).
A relative filename relates to the location of the properties file.
| Name | Type | Required | Default |
|---|---|---|---|
| autoCorrections.file.encoding | string (encoding name) | no | UTF-8 |
Encoding of the extra file containing autocorrection mappings.
Predict
The Predict Input Processor makes use of an intent model generated when classes are available in a Teneo Studio solution to annotate user inputs with the defined classes; intent models can be generated either with Teneo Learn or CLU. Note that as of Teneo 7.3, deferred intent classification is applied and annotations are only created by Predict if references to class annotations are found during the input matching process.
When Predict receives a user input, confidence scores are calculated for each class based on the model and annotations created for the most confident class and for each other class that matches the following criteria:
- the confidence is above the minimum confidence (defaults to 0.01)
- the confidence is higher than 0.5 times the confidence value of the top class.
For each selected class, an annotation with the scheme <CLASS_NAME>.INTENT is created, with the value of the model's confidence in the class as well as an annotation variable specifying the used classifier (i.e., Learn, CLU or LearnFallback) and an Order variable defining the order of the selected classes (i.e., 0 for the class with the highest confidence score and 4 for the selected class with the lowest confidence score).
A special annotation <CLASS_NAME>.TOP_INTENT is created for the class with the highest confidence score.
| Annotation | Variable | Variable | Variable | Description |
|---|---|---|---|---|
| <CLASS_NAME>.TOP_INTENT | classifier | confidence | Annotation created for the class with the highest confidence score | |
| <CLASS_NAME>.INTENT | classifier | confidence | Order | Annotation given to each selected class with a maximum of five top classes |
The Predict Input Processor creates a maximum of 5 annotations, regardless of how many classes match the criteria.
Configuration Properties
Min Confidence Similarity Distance
| Name | Type | Required | Default |
|---|---|---|---|
| minConfidenceSimilarityDistance | float | no | 0.5 |
Confidence percentage of the top score confidence a class must have in order to be considered, e.g., if the top confidence class has a confidence of 0.7, classes with confidence lower than 0.5 x 0.7 = 0.35 will be discarded.
Max Number of Annotations
| Name | Type | Required | Default |
|---|---|---|---|
| maxNumberOfAnnotations | int | no | 5 |
The maximum number of class annotations created for each user input.
Min Confidence Threshold
| Name | Type | Required | Default |
|---|---|---|---|
| minConfidenceThreshold | float | no | 0.01 |
The minimum value of confidence a model must have for a class in order to add it as one of the candidate annotations.
Intent Model File Name
| Name | Type | Required | Default |
|---|---|---|---|
| ìntent.model.file.name | string (filename) | no | empty |
Name of the file containing the machine learning model; it is usually set automatically by Teneo Studio so no configuration is required.
Standard Similarity Match Correction
The Standard Similarity Match Correction Input Processor applies spelling correction based on a similarity matching of sentence words against words provided by a dictionary. The corrections are applied to the final form of the sentence words.
This Input Processors works on the existing sentences and words passed in and it may modify the final form of a word; the count of sentences and words is not modified.
The dictionary used by this Input Processor is the solution dictionary which is solution specific and generated on (re)load of the Engine. It is formed of the bare words in TLML syntaxes in the solution and in referenced libraries. Adding a word to the dictionary is done by adding the word to the syntax of a new or existing Language Object, Entity, Trigger or Transition (in case of the latter two, as TLML Syntax).
Other Considerations
Extra request parameters read by this input processor: (none)
Processing options read by this input processor: (none)
Annotations generated by this input processor: (none)
The various spelling distance constants define the spelling tolerance behavior. For fine-tuning, they can be changed from their default values, although generally fine-tuning should not be required. The defaults are sensible and tested. It is not recommended to change settings due to an isolated problem as it may compromise the IP.
All values are given as percentages. The spelling tolerance process adds up all distance values and divide them by the length of the word in the syntax and the result is compared to the spelling tolerance threshold.
The distance from similarities (defined below) take precedence over the standard distance defined here, for example:
similarities.1 = ah=a:5
Now, blah will have a distance of 5 to bla no matter which value is given under spellingDistance.missingEndLetter.
Configuration Properties
Spelling Tolerance
| Name | Type | Required | Default |
|---|---|---|---|
| spellingTolerance | Integer number 0-100 (0=off) | no | 15 |
The spelling tolerance limit.
The accumulated distance value of comparing a user input word with a syntax word, divided by the length of the syntax word, must not be greater than this limit to consider a user input word similar to a syntax word.
Spelling Distance Extra End Letter
| Name | Type | Required | Default |
|---|---|---|---|
| spellingDistance.extraEndLetter | Integer number >=0 | no | 100 |
Spelling distance for an extra letter at the end of the word, for example:
Syntax: abcd
User input word: abcdx
Spelling Distance Double Instead Single Letter
| Name | Type | Required | Default |
|---|---|---|---|
| spellingDistance.doubleInsteadSingleLetter | Integer number >=0 | no | 62 |
Spelling distance for double letter where a single letter should be, for example:
Syntax: abcd
User input word: abbcd
Spelling Distance Single Instead Double Letter
| Name | Type | Required | Default |
|---|---|---|---|
| spellingDistance.singleInsteadDoubleLetter | Integer number >=0 | no | 62 |
Spelling distance for a single letter where a double letter should be, for example:
Syntax: abbcd
User input word: abcd
Spelling Distance Swapped Letter
| Name | Type | Required | Default |
|---|---|---|---|
| spellingDistance.swappedLetter | Integer number >=0 | no | 100 |
Spelling distance for swapped letters, for example:
Syntax: abcd
User input word: acbd
Spelling Distance Extra Letter
| Name | Type | Required | Default |
|---|---|---|---|
| spellingDistance.extraLetter | Integer number >=0 | no | 75 |
Spelling distance for an extra letter that should not be there, for example:
Syntax: abcd
User input word: abxcd
Spelling Distance Missing Letter
| Name | Type | Required | Default |
|---|---|---|---|
| spellingDistance.missingLetter | Integer number >=0 | no | 75 |
Spelling distance for a missing letter, for example:
Syntax: abcd
User input word: abd
Spelling Distance Wrong Letter
| Name | Type | Required | Default |
|---|---|---|---|
| spellingDistance.wrongLetter | Integer number >=0 | no | 100 |
Spelling distance for a completely wrong letter, for example:
Syntax: abcd
User input word: abxd
Spelling Distance Key Adjacent Letter
| Name | Type | Required | Default |
|---|---|---|---|
| spellingDistance.keyAdjacentLetter | Integer number >=0 | no | 75 |
Spelling distance for a wrong letter, which is adjacent to the correct one on the keyboard, for example:
On qwerty or qwertz keyboards
Syntax: hello
User input word: hrllo
Similarities
| Name | Type | Required | Default |
|---|---|---|---|
| similarities.* | Format:similarities.<n> = <letter(s)>=<letter(s)>:<d>or similarities.<n> = <letter(s)>><letter(s)>:<d> (In the second case, note the > symbol between the two <letter(s)> strings)<n>: number, which must be unique within the similarity definitions of one file<letter(s)>: string, letter(s) on which a similarity is defined<d>: positive number indicating the spelling distance given as a percentage value | no | none |
Similarity definitions, where:
Equal sign (=) indicates that the defined similarity works bidirectional, and
Greater-than sign (>) indicates that the first letter combination in the user input is regarded similar to the second in the syntax, but not vice versa.
The word matching process taking into account similarities usually run after simplification. So, defining similarities between letters that are replaced by simplification makes no sense.
The number <d> is the spelling distance given as a percentage, for example:
similarities.4 = f?ph:25
Upper keyboard row:
| Name | Type | Required | Default |
|---|---|---|---|
| keyboard.row1 | string | no | qwertyuiop |
Middle keyboard row:
| Name | Type | Required | Default |
|---|---|---|---|
| keyboard.row2 | string | no | asdfghjkl |
Lower keyboard row:
| Name | Type | Required | Default |
|---|---|---|---|
| keyboard.row3 | string | no | zxcvbnm |
System Annotation
The System Annotation Input Processor performs simple analysis of the sentence texts to set some annotations and the decision algorithms are configurable by various properties.
Further customization is possible by sub-classing this Input Processor and overriding one or more of the following methods: decideBinary, decideBrackets, decideEmpty, decideExclamation, decideNonsense, decideQuestion, decideQuote.
This Input Processor works on the sentences passed in, but does not modify them.
Other Considerations
Extra request parameters read by this input processor: (none)
Processing options read by this input processor: (none)
Annotations this input processor may generate:
| Annotation | Description |
|---|---|
| _BINARY | The input consists only of characters specified by the properties binaryCharacters (at least one of them) and binaryIgnoredCharacters (zero or more of them) |
| _BRACKETPAIR | The input contains at least one matching pair of the bracket characters specified under the property bracketPairCharacters |
| _EXCLAMATION | The input contains at least one of the characters specified for the property exclamationMarkCharacters |
| _EM3 | The input contains three or more characters in a row of the characters specified with exclamationMarkCharacters |
| _EMPTY | The input contains no text / the sentence text is empty |
| _NONSENSE | The input probably contains nonsense text as configured with the properties consonants, nonsenseThreshold.absolute and nonsenseThreshold.relative |
| _QUESTION | The input contains at least one of the characters specified in property questionMarkCharacters |
| _QT3 | The input contains three or more characters specified in property questionMarkCharacters |
| _QUOTE | The input contains at least one of the characters specified with property quoteCharacters |
| _DBLQUOTE | The input contains at least one of the characters specified with property doubleQuoteCharacters |
Configuration Properties
Consonants
| Name | Type | Required | Default |
|---|---|---|---|
| consonants | string | no | BCDFGHJKLMNPQRSTVWXYZ bcdfghjklmnpqrstvwxyz |
Contains all letters (upper and lower case) that are considered consonants in the language.
Together with the following two properties for absolute and relative nonsense threshold, the defined consonants are used for detecting probable nonsense inputs like kljljljljjlj.
Nonsense Threshold Absolute
| Name | Type | Required | Default |
|---|---|---|---|
| nonsenseThreshold.absolute | Positive integer number | no | 6 |
For nonsense detection, an input consisting exclusively of the defined number of consonants without any non-consonants is considered nonsense.
Nonsense Threshold Relative
| Name | Type | Required | Default |
|---|---|---|---|
| nonsenseThreshold.relative | Positive integer number | no | 10 |
For nonsense detection, an input containing the defined number of consonants in a row is considered nonsense.
Exclamation Mark Characters
| Name | Type | Required | Default |
|---|---|---|---|
| exclamationMarkCharacters | string | no | ! |
List of characters considered exclamation mark in the language where at least one must occur in the input to set the annotation _EXCLAMATION and a sequence of at least three must occur to set the annotation _EM3.
Question Mark Characters
| Name | Type | Required | Default |
|---|---|---|---|
| questionMarkCharacters | string | no | ? |
List of characters considered question mark in the language where at least one must occur in the input to set the annotation _QUESTION and a sequence of at least three must occur to set the annotation _QT3.
Double Quote Characters
| Name | Type | Required | Default |
|---|---|---|---|
| doubleQuoteCharacters | string | no | " |
List of characters considered double quotes in the language; at least one must occur in the input to set the annotation _DBLQUOTE.
Quote Characters
| Name | Type | Required | Default |
|---|---|---|---|
| quoteCharacters | string | no | ‘ |
List of characters considered quote in the language; at least one must occur in the input to set the annotation _QUOTE.
Binary Characters
| Name | Type | Required | Default |
|---|---|---|---|
| binaryCharacters | string | no | 01 |
List of characters to be recognized as binary, sets the annotation _BINARY.
Binary Ignored Characters
| Name | Type | Required | Default |
|---|---|---|---|
| binaryIgnoredCharacters | string | no | ! ? , . - ; : # \r \n \t " ' |
List of characters additionally allowed in binary inputs.
Bracket Pair Characters
| Name | Type | Required | Default |
|---|---|---|---|
| bracketPairCharacters | string | no | ( ) [ ] { } |
List of bracketing characters of which at least one pair (opening and closing bracket of the same type) must occur in the input to set the annotation _BRACKETPAIR.
Special System Annotations
The following two, special annotations are set by the Teneo Engine. These special system annotations are not related to individual inputs but rather to whole dialogues and are dependent on the session state.
| Annotation | Description |
|---|---|
| _INIT | Indicates session start, i.e., the first input in a dialogue |
| _TIMEOUT | Indicates the continuation of a previously timed-out session/dialogue |
Basic Number Recognizer
The Basic Number Recognizer identifies all Arabic numbers of the type 123 and 3.14 in the user input and annotates each of them with an annotation associated with a variable which holds the actual numeric value of the number found.
The Basic Number Recognizer is language dependent and each language has its own configuration defining the decimal point characters and the thousands separator character to be ignored.
| Annotation | Variable | Description |
|---|---|---|
| NUMBER | numericValue | Annotation created for identified Arabic numbers in user inputs |
For the annotation and its numeric value variable to be added, a number in the user input must meet the following syntax:
It must match the regular expression:
[,]?[0-9]+([,][0-9]+)*([.][0-9]+)?|[.][0-9]+
It must be parseable by Java's BigDecimal to ensure it is a number
The above syntax provides the following guarantees:
- The sign is not included in the annotated token
- The numericValue variable contains a BigDecimal representation of the number.
The decimal marker(s) and the thousands separator(s) can be configured; in the above regex, the dot is used as a decimal marker and the comma as a regular expression.
Language Detector
The Language Detector uses a machine learning model to predict the language of a given user input and adds an annotation, as seen in below table, to the input together with a confidence score of the prediction.
| Annotation | Variable | Description |
|---|---|---|
| <language label>.LANG, e.g., %$DA.LANG | Confidence | Annotation created for the predicted language |
The Language Detector can predict the following 45 languages; the language label used to create the annotation name is in brackets:
Arabic (AR), Bulgarian (BG), Bengali (BN), Catalan (CA), Czech (CS), Danish (DA), German (DE), Greek (EL), English (EN), Esperanto (EO), Spanish (ES), Estonian (ET), Basque (EU), Persian (FA), Finnish (FI), French (FR), Hebrew (HE), Hindi (HI), Hungarian (HU), Indonesian-Malay (ID_MS), Icelandic (IS), Italian (IT), Japanese (JA), Korean (KO), Lithuanian (LT), Latvian (LV), Macedonian (MK), Dutch (NL), Norwegian (NO), Polish (PL), Portuguese (PT), Romanian (RO), Russian (RU), Slovak (SK), Slovenian (SL), Serbian-Croatian-Bosnian (SR_HR), Swedish (SV), Tamil (TA), Telugu (TE), Thai (TH), Tagalog (TL), Turkish (TR), Urdu (UR), Vietnamese (VI) and Chinese (ZH).
Serbian, Bosnian and Croatian are treated as one language under the label SR_HR, and Indonesian and Malay are treated as one language under the label ID_MS
A number of regexes are also in use by the Input Processor, helping the model to not predict a language for fully numerical inputs, URLs or other type of nonsense inputs.
The Language Detector will provide an annotation when the confidence prediction threshold is above 0.2 for the languages, but for the following listed languages, language annotations are always created (even for predictions below 0.2) since the Language Detector is mostly accurate when predicting them: Arabic, Bengali, Greek, Hebrew, Hindi, Japanese, Korean, Tamil, Telugu, Thai, Chinese, Vietnamese, Persian and Urdu.
Finnish Input Processors Chain
The input processing chain for Finnish language shares its Input Processors with the Standard Input Processor chain, but furthermore includes the Finnish Splitting Input Processor which comes between the Standard Splitting and Standard AutoCorrect Input Processors as displayed in the below graph.
* The DateTime Recognizer (in the above graph) is also available in the Finnish Input Processors chain, but it is currently not supported by the Approach in the Teneo Platform for understanding and interpretation of date and time expressions.
The Input Processors shared with the Standard Input Processor chain are:
- Standard Splitting
- Standard Auto Correct
- Predict
- Standard Similarity Match Correction
- System Annotation
- Basic Number Recognizer
- Language Detector
These Input Processors are all described in details in sections above; below, please find details related to the Finnish Splitting Input Processor.
Finnish Splitting
The Finnish Splitting splits off suffixes from the existing sentence words passed in, using configurable word lists in its algorithm. It may modify an existing word (it is set to the word stem) and add one or more words after it (the suffixes split off). These added words all have the same original word form and begin index as the modified word. Words shorter than five characters or contained in the no-cut list will not be split. The count of sentences is not modified.
The suffixes are grouped into five lists:
- clitic,
- participe,
- poss,
- cases, and
- comparison.
Suffixes are searched for and split off in the order of the groups listed above. Within each group, the suffixes are searched in the order given in the configuration file containing the suffixes of the group.
Other Considerations
Extra request parameters read by this input processor: (none)
Processing options read by this input processor: (none)
Annotations generated by this input processor: (none)
Configuration Properties
No Cut
| Name | Type | Required | Default |
|---|---|---|---|
| nocut.file.name | string (filename) | no | empty |
Filename (including path) of a file containing words not to split; a relative filename relates to the location of the properties file.
| Name | Type | Required | Default |
|---|---|---|---|
| nocut.file.encoding | string (encoding name) | no | UTF-8 |
The encoding of the file containing the words not to split.
Clitic
| Name | Type | Required | Default |
|---|---|---|---|
| clitic.file.name | string (filename) | no | empty |
Filename (including path) of a file containing the clitic suffixes; a relative filename relates to the location of the properties file.
| Name | Type | Required | Default |
|---|---|---|---|
| clitic.file.encoding | string (encoding name) | no | UTF-8 |
Encoding of the file containing the clitic suffixes.
Participe
| Name | Type | Required | Default |
|---|---|---|---|
| participe.file.name | string (filename) | no | empty |
Filename (including path) of a file containing participe suffixes; a relative filename relates to the location of the properties file.
| Name | Type | Required | Default |
|---|---|---|---|
| participe.file.encoding | string (encoding name) | no | UTF-8 |
Encoding of the file containing the participe suffixes.
Poss
| Name | Type | Required | Default |
|---|---|---|---|
| poss.file.name | string (filename) | no | empty |
Filename (including path) of a file containing possessive suffixes; a relative filename relates to the location of the properties file.
| Name | Type | Required | Default |
|---|---|---|---|
| poss.file.encoding | string (encoding name) | no | UTF-8 |
Encoding of the file containing the possessive suffixes.
Cases
| Name | Type | Required | Default |
|---|---|---|---|
| cases.file.name | string (filename) | no | empty |
Filename (including path) of a file containing the cases suffixes; a relative filename relates to the location of the properties file.
| Name | Type | Required | Default |
|---|---|---|---|
| cases.file.encoding | string (encoding name) | no | UTF-8 |
Encoding of the file containing the cases suffixes.
Comparison
| Name | Type | Required | Default |
|---|---|---|---|
| comparison.file.name | string (filename) | no | empty |
Filename (including path) of a file containing the comparison suffixes; a relative filename relates to the location of the properties file.
| Name | Type | Required | Default |
|---|---|---|---|
| comparison.file.encoding | string (encoding name) | no | UTF-8 |
Encoding of the file containing the comparison suffixes.di