From the course: Generative AI and Open Source Models: Hands-On Practice with Hugging Face Models

Unlock the full course today

Join today to access over 24,100 courses taught by industry experts.

Encoding and decoding text

Encoding and decoding text

- [Instructor] All right, so now that we understand what a tokenizer is, what's under the hood of a tokenizer, let's actually get an intuition for what is going on in the process of tokenization. So during tokenization, we're converting raw text into tokens. Those tokens are turning into a numerical ID, and then there's going to be some handling for special tokens and for paddings. And the tokenizer also will have instructions for model-specific pre-processing of that input. So what happens when you feed a string of text into a tokenizer? Well, it is going to produce a structured output, which is typically a dictionary-like object. And this is going to have different elements that are going to be required by the model. So let's first talk about the tokenizer vocabulary size, just to give you an idea of the vocabulary of the tokenizers that we have. So in this case, the Gemma2 tokenizer is abnormally large. It's going to be larger than most tokenizers that you'll come across, but in…

Contents