From the course: Generative AI and Open Source Models: Hands-On Practice with Hugging Face Models
Unlock the full course today
Join today to access over 24,100 courses taught by industry experts.
Encoding and decoding text
From the course: Generative AI and Open Source Models: Hands-On Practice with Hugging Face Models
Encoding and decoding text
- [Instructor] All right, so now that we understand what a tokenizer is, what's under the hood of a tokenizer, let's actually get an intuition for what is going on in the process of tokenization. So during tokenization, we're converting raw text into tokens. Those tokens are turning into a numerical ID, and then there's going to be some handling for special tokens and for paddings. And the tokenizer also will have instructions for model-specific pre-processing of that input. So what happens when you feed a string of text into a tokenizer? Well, it is going to produce a structured output, which is typically a dictionary-like object. And this is going to have different elements that are going to be required by the model. So let's first talk about the tokenizer vocabulary size, just to give you an idea of the vocabulary of the tokenizers that we have. So in this case, the Gemma2 tokenizer is abnormally large. It's going to be larger than most tokenizers that you'll come across, but in…
Practice while you learn with exercise files
Download the files the instructor uses to teach the course. Follow along and learn by watching, listening and practicing.