gm.text.Tokenizer#
- class gemma.gm.text.Tokenizer(
- *,
- path: str | os.PathLike,
- custom_tokens: dict[int,
- str]=<factory>,
Bases:
objectBase class for tokenizers.
tokenizer = gm.text.Gemma2Tokenizer() tokenizer.encode('Hello world!') tokenizer.decode([10, 20, 30, 40, 50]) print(tokenizer.tokens[:200]) # Print the first 200 tokens. assert ( tokenizer.tokens[tokenizer.special_tokens.START_OF_TURN] == '<start_of_turn>' )
- path
Path to the vocab file.
- Type:
str | os.PathLike
- custom_tokens
The Gemma tokenizer has a few unused tokens which can be overwritten by the user here. Expect a dictionary mapping the unused id (0-98) to the token string. (e.g. `{0: ‘<start_of_audio>’})
- Type:
dict[int, str]
- VERSION
The Gemma version of the tokenizer (2, 3).
- Type:
ClassVar[int | str]
- FORBIDDEN_TOKENS
Default forbidden tokens.
- Type:
ClassVar[tuple[int, …]]
- FORMAT
The dialog format to use for the tokenizer (gemma3, …).
- Type:
ClassVar[dialog._src.string.str_compat.Format]
- FORMAT_TO_CONVERT
When the tokenizer use a different token names than the default.
- Type:
ClassVar[dialog._src.string.str_compat.Format | None]
- path: str | os.PathLike
- custom_tokens: dict[int, str]
- VERSION: ClassVar[int | str] = 0
- FORBIDDEN_TOKENS: ClassVar[tuple[int, ...]] = ()
- FORMAT: ClassVar[dialog._src.string.str_compat.Format] = 'gemma3'
- FORMAT_TO_CONVERT: ClassVar[dialog._src.string.str_compat.Format | None] = None
- classmethod from_version(
- version: int | str,
Create a tokenizer from a version.
- encode(
- text: str | list[str],
- *,
- add_bos: bool = False,
- add_eos: bool = False,
Encode a text into a list of token ids.
tokenizer = gm.text.Gemma2Tokenizer() tokenizer.encode('Hello world!') pieces = tokenizer.split('Hello world!') assert pieces == ['Hello', ' world', '!'] tokenizer.encode(pieces)
- Parameters:
text – The text to encode. Can be a single string or a list of tokens.
add_bos – Whether to prepend the BOS token (2) (begin of sentence).
add_eos – Whether to append the EOS token (1) (end of sentence).
- Returns:
The list of token ids.
- decode(
- ids: int | list[int] | etils.enp.array_types.typing.Array,
Decode a token id(s) into a text.
- split(text: str) list[str][source]
Split a text into pieces.
- property vocab_size: int
Size of the vocabulary.
- property tokens: list[str]
Returns the list of all tokens str from the vocabulary.
- property special_tokens: type[gemma.gm.text._tokenizer.SpecialTokens]
Returns the special tokens.
- plot_logits(
- logits: enp.typing.Array,
- *,
- keep_top: int = 30,
Plot the distribution of logits.
- Parameters:
logits – The logits to plot, before softmax is applied (as returned by the model).
keep_top – Number of tokens to display.
- Returns:
The plot as a plotly figure.