Package ai.djl.modality.nlp.bert
Class BertFullTokenizer
java.lang.Object
ai.djl.modality.nlp.preprocess.SimpleTokenizer
ai.djl.modality.nlp.bert.BertTokenizer
ai.djl.modality.nlp.bert.BertFullTokenizer
- All Implemented Interfaces:
TextProcessor,Tokenizer
BertFullTokenizer runs end to end tokenization of input text
It will run basic preprocessors to clean the input text and then run WordpieceTokenizer to split into word pieces.
Reference implementation: Google Research Bert Tokenizer
-
Constructor Summary
ConstructorsConstructorDescriptionBertFullTokenizer(Vocabulary vocabulary, boolean lowerCase) Creates an instance ofBertFullTokenizer. -
Method Summary
Modifier and TypeMethodDescriptionbuildSentence(List<String> tokens) Combines a list of tokens to form a sentence.static List<TextProcessor>getPreprocessors(boolean lowerCase) Get a list ofTextProcessors to process input text for Bert models.Returns theVocabularyused for tokenization.Breaks down the given sentence into a list of tokens that can be represented by embeddings.Methods inherited from class ai.djl.modality.nlp.bert.BertTokenizer
encode, encode, pad, tokenToStringMethods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, waitMethods inherited from interface ai.djl.modality.nlp.preprocess.Tokenizer
preprocess
-
Constructor Details
-
BertFullTokenizer
Creates an instance ofBertFullTokenizer.- Parameters:
vocabulary- the BERT vocabularylowerCase- whether to convert tokens to lowercase
-
-
Method Details
-
getVocabulary
Returns theVocabularyused for tokenization.- Returns:
- the
Vocabularyused for tokenization
-
tokenize
Breaks down the given sentence into a list of tokens that can be represented by embeddings.- Specified by:
tokenizein interfaceTokenizer- Overrides:
tokenizein classBertTokenizer- Parameters:
input- the sentence to tokenize- Returns:
- a
Listof tokens
-
buildSentence
Combines a list of tokens to form a sentence.- Specified by:
buildSentencein interfaceTokenizer- Overrides:
buildSentencein classSimpleTokenizer- Parameters:
tokens- theListof tokens- Returns:
- the sentence built from the given tokens
-
getPreprocessors
Get a list ofTextProcessors to process input text for Bert models.- Parameters:
lowerCase- whether to convert input to lowercase- Returns:
- List of
TextProcessors
-