gm.data.Tokenize

gm.data.Tokenize#

class gemma.gm.data.Tokenize(*, key: typing.Annotated[typing.Any, <kauldron.kontext.annotate._KeyToken object at 0x7001c239ecf0>] | typing.Sequence[typing.Annotated[typing.Any, <kauldron.kontext.annotate._KeyToken object at 0x7001c239ecf0>]] | dict[typing.Annotated[typing.Any, <kauldron.kontext.annotate._KeyToken object at 0x7001c239ecf0>], typing.Annotated[typing.Any, <kauldron.kontext.annotate._KeyToken object at 0x7001c239ecf0>]], tokenizer: gemma.gm.text._tokenizer.Tokenizer, add_eos: bool = False, add_bos: bool = False)[source]

Bases: kauldron.data.transforms.base.ElementWiseTransform

Tokenize a string to ids.

tokenizer

The tokenizer to use.

Type:

gemma.gm.text._tokenizer.Tokenizer

add_eos

Whether to add the EOS token (1) to the end of the sequence.

Type:

bool

add_bos

Whether to add the BOS token (2) to the beginning of the sequence.

Type:

bool

tokenizer: gemma.gm.text._tokenizer.Tokenizer
add_eos: bool = False
add_bos: bool = False
map_element(element: str)[source]