Tokenizer#

Open in Colab

This tutorial show how to use the Gemma tokenizer. Understanding tokenizer is important to correctly feed input to the model.

For more info on tokenizer, see the excelent talk from Andrej Karpathy.

!pip install -q gemma
# Common imports

# Gemma imports
from gemma import gm

Tokenizer basics#

Gemma tokenizers are directly available:

tokenizer = gm.text.Gemma3Tokenizer()

The total number of tokens is available through .vocab_size:

tokenizer.vocab_size
256000

Encoding#

You can encode a string:

  • Into token ids with .encode:

tokenizer.encode('Derinkuyu is an underground city.')
[8636, 979, 78904, 603, 671, 30073, 3413, 235265]
  • Into token string with .split:

tokenizer.split('Derinkuyu is an underground city.')
['Der', 'ink', 'uyu', ' is', ' an', ' underground', ' city', '.']

One thing to notice is that the whitespace are part of the tokens. For example, this means that for the model, hello and hello map to 2 different token ids.

tokenizer.encode(' hello');
tokenizer.encode('hello');
[25612]
[17534]

If doing next word prediction, it’s important to not add a trailing space as it would make the out of distribution.

# When encoding this sentence, the last token will be an empty whitespace,
# which is unusual for the model.
tokenizer.split('The capital of France is ')
['The', ' capital', ' of', ' France', ' is', ' ']

Decoding#

Tokens can be decoded with .decode. You can decode a single id or an entire sentence.

tokenizer.decode([8636, 979, 78904, 603, 671, 30073, 3413, 235265])
'Derinkuyu is an underground city.'
tokenizer.decode(4567)
'Med'

Controls tokens#

Some tokens have special meaning. Forgeting about those may affect the model quality significantly.

Special token ids can be accessed through tokenizer.special_tokens attribute.

<bos> / <eos>#

In Gemma models, the begin of sentence token (<bos>) should appear only once at the begining of the input. You can add it either explicitly or with add_bos=True:

token_ids = tokenizer.encode('Hello world!')
token_ids = [tokenizer.special_tokens.BOS] + token_ids
token_ids
[<_Gemma2SpecialTokens.BOS: 2>, 4521, 2134, 235341]
tokenizer.encode('Hello world!', add_bos=True)
[<_Gemma2SpecialTokens.BOS: 2>, 4521, 2134, 235341]

Similarly, the model can output a <eos> token to indicate the prediction is complete.

When fine-tuning Gemma, you can train the model to predict <eos> tokens.

tokenizer.encode('Hello world!', add_eos=True)
[4521, 2134, 235341, <_Gemma2SpecialTokens.EOS: 1>]

<start_of_turn> / <end_of_turn>#

When using the instruction-tuned version of Gemma, the <start_of_turn> / <end_of_turn> tokens allow to specify who from the user or the model is talking.

The <start_of_turn> should be followed by either:

  • user

  • model

Example of dialogue with user and model:

token_ids = tokenizer.encode("""<start_of_turn>user
Knock knock.<end_of_turn>
<start_of_turn>model
Who's there ?<end_of_turn>
<start_of_turn>user
Gemma.<end_of_turn>
<start_of_turn>model
Gemma who?<end_of_turn>""")
tokenizer.decode(token_ids[0])
'<start_of_turn>'

<start_of_image>#

In Gemma 3, to indicate the position of an image in the text, the prompt should contain the special <start_of_image> token. Internally, Gemma model will automatically expand the token to insert the soft images tokens.

(Note: There’s also a <end_of_image> token, but is handled internally by the model)

Custom tokens#

In all Gemma versions, a few tokens (99) are unused. This allow custom applications to define and fine-tune their own custom tokens for their application. Those tokens are available through tokenizer.special_tokens.CUSTOM + xx, with xx being a number between 0 and 98

tokenizer.decode(tokenizer.special_tokens.CUSTOM + 17)
'<unused17>'

You can customize what the custom tokens correspond to when constructing the tokenizer.

tokenizer = gm.text.Gemma3Tokenizer(
    custom_tokens={
        0: '<my_custom_tag>',
        17: '<my_other_tag>',
    },
)

tokenizer.encode('<my_other_tag>')
[24]

The custom tokens string are encoded to the matching token id.

tokenizer.special_tokens.CUSTOM + 17
24
tokenizer.decode(tokenizer.special_tokens.CUSTOM + 17)
'<my_other_tag>'