gm.text.Tokenizer

gm.text.Tokenizer#

class gemma.gm.text.Tokenizer(
*,
path: str | os.PathLike,
custom_tokens: dict[int,
str]=<factory>,
)[source]

Bases: object

Base class for tokenizers.

tokenizer = gm.text.Gemma2Tokenizer()

tokenizer.encode('Hello world!')
tokenizer.decode([10, 20, 30, 40, 50])

print(tokenizer.tokens[:200])  # Print the first 200 tokens.

assert (
    tokenizer.tokens[tokenizer.special_tokens.START_OF_TURN]
    == '<start_of_turn>'
)
path

Path to the vocab file.

Type:

str | os.PathLike

custom_tokens

The Gemma tokenizer has a few unused tokens which can be overwritten by the user here. Expect a dictionary mapping the unused id (0-98) to the token string. (e.g. `{0: ‘<start_of_audio>’})

Type:

dict[int, str]

VERSION

The Gemma version of the tokenizer (2, 3).

Type:

ClassVar[int | str]

FORBIDDEN_TOKENS

Default forbidden tokens.

Type:

ClassVar[tuple[int, …]]

FORMAT

The dialog format to use for the tokenizer (gemma3, …).

Type:

ClassVar[dialog._src.string.str_compat.Format]

FORMAT_TO_CONVERT

When the tokenizer use a different token names than the default.

Type:

ClassVar[dialog._src.string.str_compat.Format | None]

path: str | os.PathLike
custom_tokens: dict[int, str]
VERSION: ClassVar[int | str] = 0
FORBIDDEN_TOKENS: ClassVar[tuple[int, ...]] = ()
FORMAT: ClassVar[dialog._src.string.str_compat.Format] = 'gemma3'
FORMAT_TO_CONVERT: ClassVar[dialog._src.string.str_compat.Format | None] = None
classmethod from_version(
version: int | str,
) gemma.gm.text._tokenizer.Tokenizer[source]

Create a tokenizer from a version.

encode(
text: str | list[str],
*,
add_bos: bool = False,
add_eos: bool = False,
) list[int][source]

Encode a text into a list of token ids.

tokenizer = gm.text.Gemma2Tokenizer()
tokenizer.encode('Hello world!')

pieces = tokenizer.split('Hello world!')
assert pieces == ['Hello', ' world', '!']
tokenizer.encode(pieces)
Parameters:
  • text – The text to encode. Can be a single string or a list of tokens.

  • add_bos – Whether to prepend the BOS token (2) (begin of sentence).

  • add_eos – Whether to append the EOS token (1) (end of sentence).

Returns:

The list of token ids.

decode(
ids: int | list[int] | etils.enp.array_types.typing.Array,
) str[source]

Decode a token id(s) into a text.

split(text: str) list[str][source]

Split a text into pieces.

property vocab_size: int

Size of the vocabulary.

property tokens: list[str]

Returns the list of all tokens str from the vocabulary.

property special_tokens: type[gemma.gm.text._tokenizer.SpecialTokens]

Returns the special tokens.

plot_logits(
logits: enp.typing.Array,
*,
keep_top: int = 30,
) go.Figure[source]

Plot the distribution of logits.

Parameters:
  • logits – The logits to plot, before softmax is applied (as returned by the model).

  • keep_top – Number of tokens to display.

Returns:

The plot as a plotly figure.