gm.data.Tokenize#

class gemma.gm.data.Tokenize(*, key: typing.Annotated[typing.Any, <kauldron.kontext.annotate._KeyToken object at 0x7001c239ecf0>] | typing.Sequence[typing.Annotated[typing.Any, <kauldron.kontext.annotate._KeyToken object at 0x7001c239ecf0>]] | dict[typing.Annotated[typing.Any, <kauldron.kontext.annotate._KeyToken object at 0x7001c239ecf0>], typing.Annotated[typing.Any, <kauldron.kontext.annotate._KeyToken object at 0x7001c239ecf0>]], tokenizer: gemma.gm.text._tokenizer.Tokenizer, add_eos: bool = False, add_bos: bool = False)[source]

Bases: kauldron.data.transforms.base.ElementWiseTransform

Tokenize a string to ids.

tokenizer

The tokenizer to use.

Type:: gemma.gm.text._tokenizer.Tokenizer

add_eos

Whether to add the EOS token (1) to the end of the sequence.

Type:: bool

add_bos

Whether to add the BOS token (2) to the beginning of the sequence.

Type:: bool

tokenizer: gemma.gm.text._tokenizer.Tokenizer

add_eos: bool = False

add_bos: bool = False

map_element(element: str)[source]