gm.text.Tokenizer#

class gemma.gm.text.Tokenizer( *, path: str | os.PathLike, custom_tokens: dict[int, str]=<factory>, )[source]

Bases: object

Base class for tokenizers.

tokenizer = gm.text.Gemma2Tokenizer()

tokenizer.encode('Hello world!')
tokenizer.decode([10, 20, 30, 40, 50])

print(tokenizer.tokens[:200])  # Print the first 200 tokens.

assert (
    tokenizer.tokens[tokenizer.special_tokens.START_OF_TURN]
    == '<start_of_turn>'
)

path

Path to the vocab file.

Type:: str | os.PathLike

custom_tokens

The Gemma tokenizer has a few unused tokens which can be overwritten by the user here. Expect a dictionary mapping the unused id (0-98) to the token string. (e.g. `{0: ‘<start_of_audio>’})

Type:: dict[int, str]

VERSION

The Gemma version of the tokenizer (2, 3).

Type:: ClassVar[int | str]

FORBIDDEN_TOKENS

Default forbidden tokens.

Type:: ClassVar[tuple[int, …]]

FORMAT

The dialog format to use for the tokenizer (gemma3, …).

Type:: ClassVar[dialog._src.string.str_compat.Format]

FORMAT_TO_CONVERT

When the tokenizer use a different token names than the default.

Type:: ClassVar[dialog._src.string.str_compat.Format | None]

path: str | os.PathLike

custom_tokens: dict[int, str]

VERSION: ClassVar[int | str] = 0

FORBIDDEN_TOKENS: ClassVar[tuple[int, ...]] = ()

FORMAT: ClassVar[dialog._src.string.str_compat.Format] = 'gemma3'

FORMAT_TO_CONVERT: ClassVar[dialog._src.string.str_compat.Format | None] = None

classmethod from_version( version: int | str, ) → gemma.gm.text._tokenizer.Tokenizer[source]: Create a tokenizer from a version.

encode( text: str | list[str], *, add_bos: bool = False, add_eos: bool = False, ) → list[int][source]

Encode a text into a list of token ids.

tokenizer = gm.text.Gemma2Tokenizer()
tokenizer.encode('Hello world!')

pieces = tokenizer.split('Hello world!')
assert pieces == ['Hello', ' world', '!']
tokenizer.encode(pieces)

Parameters:

text – The text to encode. Can be a single string or a list of tokens.
add_bos – Whether to prepend the BOS token (2) (begin of sentence).
add_eos – Whether to append the EOS token (1) (end of sentence).

Returns:

The list of token ids.

decode( ids: int | list[int] | etils.enp.array_types.typing.Array, ) → str[source]: Decode a token id(s) into a text.

split(text: str) → list[str][source]: Split a text into pieces.

property vocab_size: int: Size of the vocabulary.

property tokens: list[str]: Returns the list of all tokens str from the vocabulary.

property special_tokens: type[gemma.gm.text._tokenizer.SpecialTokens]: Returns the special tokens.

plot_logits( logits: enp.typing.Array, *, keep_top: int = 30, ) → go.Figure[source]

Plot the distribution of logits.

Parameters:

logits – The logits to plot, before softmax is applied (as returned by the model).
keep_top – Number of tokens to display.

Returns:

The plot as a plotly figure.