Skip to main content

Korean Input Processors Chain

Introduction

An Input Processor (IP) pre-processes inputs for the Teneo Engine to be able to perform different processes on them, such as normalization and tokenization of inputs or spelling correction. Each language supported by the Teneo Platform has a chain of Input Processors which know how to process that particular language. This pages details the Input Processors chain for the Korean language.

IP Chain Setup

The following graph displays the setup of the Korean Input Processors chain; each Input Processor is described further in the following sections.

The default Input Processors are shortly described below and in further details in the following sections on this page.

  • Standard Splitting: divides the user input text into sentences and words, considering abbreviations that should not be split.
  • Korean Morphological Analyzer: runs Komoran on each user input returning the root of every word as well a tag containing Part-of-Speech and morphological information.
  • System Annotation: sets a number of annotations based on properties of the user input text.
  • Basic Number Recognizer: recognizes all Arabic numbers and annotates them with a NUMBER annotation and a variable which contains the found number.
  • Language Detector: identifies the language of the input sentence provided and annotates it with the predicted language together with a confidence score.
  • Predict: classifies user inputs based on a machine learning model and annotates the user input with the predicted top intent classes and a confidence score; note that as of Teneo 7.3, deferred intent classification is applied.

Korean Simplifier

The Korean Simplifier is a special kind of processors that is used to normalize the user input by:

  • converting full width Latin letters and Arabic digits into their half width version, and
  • lowercasing the uppercase Latin letters.

This Simplifier is special because it is not run as part of the Input Processors chain, but rather by the Tokenizer when it puts the tokens generated by Kuromoji into a Teneo data structure. Additionally, the Simplifier is also run by the condition parser inside the Teneo Engine, which normalizes the Language Object syntax words before adding them to the internal Engine dictionary.

Input Processors

Standard Splitting

The Standard Splitting Input Processor splits the user input text into sentences and words based on configurable sentence and word delimiters; splitting exceptions may be defined as a configurable list of abbreviations and a configurable regular expression.
The Standard Splitting IP generates one or more sentences with zero or more words. The generated WordData objects contain the original and simplified form of the word. The final wordform is initialized with the simplified wordform.

Other Considerations

Extra request parameters read by this input processor: (none)
Processing options read by this input processor: (none)
Annotations generated by this input processor: (none)

Configuration Properties

NameTypeRequiredData
abbreviations.item.*Format:
abbreviations.item.<n> = <abbreviation>
<n>: number, which must be unique within the abbreviation definitions of one file
<abbreviation>: an abbreviation
nonone

List of abbreviations; the abbreviations are considered in the sentence separation process and sentence delimiters within abbreviations will not lead to separated sentences.

NameTypeRequiredDefault
abbreviations.file.namestring (filename)noempty

Filename (including path) of an extra file containing abbreviations. A relative filename relates to the location of the properties file.

NameTypeRequiredDefault
abbreviations.file.encodingstring (encoding name)noUTF-8

Encoding of the extra file containing abbreviations.

NameTypeRequiredDefault
inputSeparation.sentenceDelimitersstringno. ¡ ! ¿ ? …

List of characters that are used to separate sentences (unless part of an abbreviation).

NameTypeRequiredDefault
inputSeparation.wordDelimitersstringno$ € £ % & ^ " “ ” # | ~ § ° [ ] ( ) < > { } = + - ÷ \ * / , : ; \r \n \t • . ¡ ! ¿ ? …

List of characters that are used to separate words. Delimiting characters are kept as separate words, except for those that are listed under inputSeparation.nonWordCharacters (see below).

NameTypeRequiredDefault
inputSeparation.additionalWordDelimiterRegExstringnoempty

Additional word delimiting regular expression, may be used to specify additional or alternative word delimiting to those mentioned above. This is an optional regular expression for delimiting words or defining (optionally zero width) word boundaries. It may be specified as addition or alternative to inputSeparation.wordDelimiters.

Note: in Java 6 & 7, a 'position look behind' construct in the regex does not work with Unicode blocks outside the BMP if the block is specified with the \p{ln...} construct, probably due to a bug in the Java regex implementation. Instead the characters must be specified directly as a range.

NameTypeRequiredDefault
inputSeparation.nonWordCharactersstringno" “ ” . ¡ ! ¿ ? … , ; \t \r \n

Word separators that are not kept as words; the set of characters specified here should be a subset of inputSeparation.wordDelimiters and the characters matched by inputSeparation.additionalWordDelimiterRegEx.

Example (assuming defaults):

Argh$%, separate this!

will be separated into:

Argh
$
%
separate
this

NameTypeRequiredDefault
inputSeparation.excludeWordDelimitersRegExstringno(?<=([<SP><HT><CR><LF> "“”,;.¡!¿?…\d]|^))[.,](/?=\d)

Regular expression that specify exceptions to the splitting of a sentence into words; the default regular expression prevents the characters , (comma) and . (dot) from acting as word delimiters when they appear in the context of a number.

Note that the text matched by the regular expression will be excluded from splitting, thus any word splitting characters used only as context condition should be given as zero-width look-behind/look-ahead construct.

Korean Morphological Analyzer

The Korean Morphological Analyzer runs Komoran on every sentence from the user input as provided by the Standard Splitting Input Processor. Komoran returns the root of every word in the sentence as well as a tag that contains both Part-of-Speech (POS) and morphological information. The Korean Morphological Analyzer then converts into Teneo annotations the root, the Part-of-Speech and morphological information for every word.

The table below lists how the tags from Komoran are mapped to annotations in Teneo.

Komoran tagDescriptionMap to Teneo annotation(s)
VAAdjectiveADJ.POS
JKGAdnominal case markerJKG.MST
MAGAdverbADV.POS
JKBAdverbial case markerJKB.MST
VCPAffirmation/positiveVCP.MST
NAAnalytical CategoryNA.MST
JXAuxiliary postpositional particleJX.MST
VXAuxiliary predicate elementVX.MST
NNBBound nounNN.POS , NNB.MST
SNCardinal numberCARDINAL.POS
JKCComplement case markerJKC.MST
JCConjunctive postpositional particleJC.MST
MAJConnective adverbADV.POS, CONNECTIVE.MST
ECConnective endingEC.MST
VCNDenial/negativeVCN.MST
MMDeterminerDET.POS
ETEnding of a wordET.MST
ETMEnding of a wordETM.MST
SLForeign wordFOREIGN.POS
SHForeign wordFOREIGN.POS
ICInterjectionINTERJ.POS
NNGNounNN.POS
NFNoun estimation categoryNF.MST
NRNumeralNR.MST
JKOObject case markerJKO.MST
JKQPostposition, postpositional particleJKQ.MST
EPPre-final endingEP.MST
XPNPrefixXPN.MST
NPPronounPRON.POS
NNPProper NounNN.POS , PROPER.POS
SFPunctuationPUNCT.POS
SPPunctuationPUNCT.POS
SSPunctuationPUNCT.POS
SEPunctuationPUNCT.POS
SOPunctuationPUNCT.POS
XRRootXR.MST
EFSentence-closing endingEF.MST
JKSSubject case markerJKS.MST
XSNSuffixXSN.MST
XSVSuffixXSV.MST
XSASuffixXSA.MST
SWSymbolSYM.POS
VVVerbVB.POS
NVVerb estimation categoryNV.MST
JKVVocative case markerJKV.MST

System Annotation

The System Annotation Input Processor performs simple analysis of the sentence texts to set some annotations and the decision algorithms are configurable by various properties.
Further customization is possible by sub-classing this Input Processor and overriding one or more of the following methods: decideBinary, decideBrackets, decideEmpty, decideExclamation, decideNonsense, decideQuestion, decideQuote.

This Input Processor works on the sentences passed in, but does not modify them.

Other Considerations

Extra request parameters read by this input processor: (none)
Processing options read by this input processor: (none)
Annotations this input processor may generate:

AnnotationDescription
_BINARYThe input consists only of characters specified by the properties __binaryCharacters__g (at least one of them) and binaryIgnoredCharacters (zero or more of them)
_BRACKETPAIRThe input contains at least one matching pair of the bracket characters specified under the property bracketPairCharacters
_EXCLAMATIONThe input contains at least one of the characters specified for the property exclamationMarkCharacters
_EM3The input contains three or more characters in a row of the characters specified with exclamationMarkCharacters
_EMPTYThe input contains no text / the sentence text is empty
_NONSENSEThe input probably contains nonsense text as configured with the properties consonants, nonsenseThreshold.absolute and nonsenseThreshold.relative
_QUESTIONThe input contains at least one of the characters specified in property questionMarkCharacters
_QT3The input contains three or more characters specified in property questionMarkCharacters
_QUOTEThe input contains at least one of the characters specified with property quoteCharacters
_DBLQUOTEThe input contains at least one of the characters specified with property doubleQuoteCharacters

Configuration Properties

Consonants
NameTypeRequiredDefault
consonantsstringnoBCDFGHJKLMNPQRSTVWXZ bcdfghjklmnpqrsßtvwxz ㄱㄲㄴㄷㄸㄹㅁㅂㅃㅅㅆㅇㅈㅉㅊㅋㅌㅍㅎ

Contains all letters (upper and lower case) that are considered consonants in the language.
Together with the following two properties for absolute and relative nonsense threshold, the defined consonants are used for detecting probable nonsense inputs like kljljljljjlj.

Nonsense Threshold Absolute
NameTypeRequiredDefault
nonsenseThreshold.absolutePositive integer numberNo6

For nonsense detection, an input consisting exclusively of the defined number of consonants without any non-consonants is considered nonsense.

Nonsense Threshold Relative
NameTypeRequiredDefault
nonsenseThreshold.relativePositive integer numberno10

For nonsense detection, an input containing the defined number of consonants in a row is considered nonsense.

Exclamation Mark Characters
NameTypeRequiredDefault
exclamationMarkCharactersstringno!!

List of characters considered exclamation mark in the language where at least one must occur in the input to set the annotation _EXCLAMATION and a sequence of at least three must occur to set the annotation _EM3.

Question Mark Characters
NameTypeRequiredDefault
questionMarkCharactersstringno??

List of characters considered question mark in the language where at least one must occur in the input to set the annotation _QUESTION and a sequence of at least three must occur to set the annotation _QT3.

Double Quote Characters
NameTypeRequiredDefault
doubleQuoteCharactersstringno" “ ”"『』

List of characters considered double quotes in the language; at least one must occur in the input to set the annotation _DBLQUOTE.

Quote Characters
NameTypeRequiredDefault
quoteCharactersstringno''「」

List of characters considered quote in the language; at least one must occur in the input to set the annotation _QUOTE.

Binary Characters
NameTypeRequiredDefault
binaryCharactersstringno01

List of characters to be recognized as binary, sets the annotation _BINARY.

Binary Ignored Characters
NameTypeRequiredDefault
binaryIgnoredCharactersstringno! ? , . - ; : # \r \n \t " '

List of characters additionally allowed in binary inputs.

Bracket Pair Characters
NameTypeRequiredDefault
bracketPairCharactersstringno() [] {}()〔〕[]{}〈〉《》〚〛〘〙〖〗【】⦅⦆

List of bracketing characters of which at least one pair (opening and closing bracket of the same type) must occur in the input to set the annotation _BRACKETPAIR.

Special System Annotations

The following two, special annotations are set by the Teneo Engine. These special system annotations are not related to individual inputs but rather to whole dialogues and are dependent on the session state.

AnnotationDescription
_INITIndicates session start, i.e., the first input in a dialogue
_TIMEOUTIndicates the continuation of a previously timed-out session/dialogue

Basic Number Recognizer

The Basic Number Recognizer identifies all Arabic numbers of the type 123 and 3.14 in the user input and annotates each of them with an annotation associated with a variable which holds the actual numeric value of the number found.
The Basic Number Recognizer is language dependent and each language has its own configuration defining the decimal point characters and the thousands separator character to be ignored.

AnnotationVariableDescription
NUMBERnumericValueAnnotation created for identified Arabic numbers in user inputs

For the annotation and its numeric value variable to be added, a number in the user input must meet the following syntax:

It must match the regular expression:

[,]?[0-9]+([,][0-9]+)*([.][0-9]+)?|[.][0-9]+

It must be parseable by Java's BigDecimal to ensure it is a number

The above syntax provides the following guarantees:

  • The sign is not included in the annotated token
  • The numericValue variable contains a BigDecimal representation of the number.

The decimal marker(s) and the thousands separator(s) can be configured; in the above regex, the dot is used as a decimal marker and the comma as a regular expression.

Configuration Properties

NameDefault
decimalMarkers..

The default decimal markers in Korean.

NameDefault
charactersToIgnore, ,

The default characters to ignore.

Language Detector

The Language Detector uses a machine learning model to predict the language of a given user input and adds an annotation, as seen in below table, to the input together with a confidence score of the prediction.

AnnotationVariableDescription
<language label>.LANG, e.g., %$DA.LANGConfidenceAnnotation created for the predicted language

The Language Detector can predict the following 45 languages; the language label used to create the annotation name is in brackets:

Arabic (AR), Bulgarian (BG), Bengali (BN), Catalan (CA), Czech (CS), Danish (DA), German (DE), Greek (EL), English (EN), Esperanto (EO), Spanish (ES), Estonian (ET), Basque (EU), Persian (FA), Finnish (FI), French (FR), Hebrew (HE), Hindi (HI), Hungarian (HU), Indonesian-Malay (ID_MS), Icelandic (IS), Italian (IT), Japanese (JA), Korean (KO), Lithuanian (LT), Latvian (LV), Macedonian (MK), Dutch (NL), Norwegian (NO), Polish (PL), Portuguese (PT), Romanian (RO), Russian (RU), Slovak (SK), Slovenian (SL), Serbian-Croatian-Bosnian (SR_HR), Swedish (SV), Tamil (TA), Telugu (TE), Thai (TH), Tagalog (TL), Turkish (TR), Urdu (UR), Vietnamese (VI) and Chinese (ZH).

Serbian, Bosnian and Croatian are treated as one language under the label SR_HR, and Indonesian and Malay are treated as one language under the label ID_MS

A number of regexes are also in use by the Input Processor, helping the model to not predict a language for fully numerical inputs, URLs or other type of nonsense inputs.

The Language Detector will provide an annotation when the confidence prediction threshold is above 0.2 for the languages, but for the following listed languages, language annotations are always created (even for predictions below 0.2) since the Language Detector is mostly accurate when predicting them: Arabic, Bengali, Greek, Hebrew, Hindi, Japanese, Korean, Tamil, Telugu, Thai, Chinese, Vietnamese, Persian and Urdu.

Predict

The Predict Input Processor makes use of an intent model generated when classes are available in a Teneo Studio solution to annotate user inputs with the defined classes; intent models can be generated either with Teneo Learn or CLU. Note that as of Teneo 7.3, deferred intent classification is applied and annotations are only created by Predict if references to class annotations are found during the input matching process.

When Predict receives a user input, confidence scores are calculated for each class based on the model and annotations created for the most confident class and for each other class that matches the following criteria:

  • the confidence is above the minimum confidence (defaults to 0.01)
  • the confidence is higher than 0.5 times the confidence value of the top class.

For each selected class, an annotation with the scheme <CLASS_NAME>.INTENT is created, with the value of the model's confidence in the class as well as an annotation variable specifying the used classifier (i.e., Learn, CLU or LearnFallback) and an Order variable defining the order of the selected classes (i.e., 0 for the class with the highest confidence score and 4 for the selected class with the lowest confidence score).
A special annotation <CLASS_NAME>.TOP_INTENT is created for the class with the highest confidence score.

AnnotationVariableVariableVariableDescription
<CLASS_NAME>.TOP_INTENTclassifierconfidenceAnnotation created for the class with the highest confidence score
<CLASS_NAME>.INTENTclassifierconfidenceOrderAnnotation given to each selected class with a maximum of five top classes

The Predict Input Processor creates a maximum of 5 annotations, regardless of how many classes match the criteria.

Configuration Properties

Min Confidence Similarity Distance
NameTypeRequiredDefault
minConfidenceSimilarityDistancefloatno0.5

Confidence percentage of the top score confidence a class must have in order to be considered, e.g., if the top confidence class has a confidence of 0.7, classes with confidence lower than 0.5 x 0.7 = 0.35 will be discarded.

Max Number of Annotations
NameTypeRequiredDefault
maxNumberOfAnnotationsintno5

The maximum number of class annotations created for each user input.

Min Confidence Threshold
NameTypeRequiredDefault
minConfidenceThresholdfloatno0.01

The minimum value of confidence a model must have for a class in order to add it as one of the candidate annotations.

Intent Model File Name
NameTypeRequiredDefault
ìntent.model.file.namestring (filename)noempty

Name of the file containing the machine learning model; it is usually set automatically by Teneo Studio so no configuration is required.