gm.data.Tokenize#
- class gemma.gm.data.Tokenize(*, key: typing.Annotated[typing.Any, <kauldron.kontext.annotate._KeyToken object at 0x7001c239ecf0>] | typing.Sequence[typing.Annotated[typing.Any, <kauldron.kontext.annotate._KeyToken object at 0x7001c239ecf0>]] | dict[typing.Annotated[typing.Any, <kauldron.kontext.annotate._KeyToken object at 0x7001c239ecf0>], typing.Annotated[typing.Any, <kauldron.kontext.annotate._KeyToken object at 0x7001c239ecf0>]], tokenizer: gemma.gm.text._tokenizer.Tokenizer, add_eos: bool = False, add_bos: bool = False)[source]
Bases:
kauldron.data.transforms.base.ElementWiseTransformTokenize a string to ids.
- tokenizer
The tokenizer to use.
- Type:
gemma.gm.text._tokenizer.Tokenizer
- add_eos
Whether to add the EOS token (1) to the end of the sequence.
- Type:
bool
- add_bos
Whether to add the BOS token (2) to the beginning of the sequence.
- Type:
bool
- tokenizer: gemma.gm.text._tokenizer.Tokenizer
- add_eos: bool = False
- add_bos: bool = False
- map_element(element: str)[source]