05 Tokenization
Introduction¶
- In early days, neural networks were trained at the character level. It needs to maintain small data but requires a lot of computational power from the model in predicting the next letter.
- Then neural networks are trained on words, but the vocabulary count is enormous, a small change in the word would result in another vocabulary.
- Then now, the models work with chunk of letters called tokens. This elegantly handles word stems which has the same beginning but has a different ending. Whitespaces will also be included while calculating the tokens.
Token Calculation¶
- Below is the approximate estimate of tokens in a typical English writing
- 1 token ~ 4 characters or 0.75 words
- 1000 tokens 750 words
- Token count would be higher for math and scientific terms.
Tiktoken¶
- Encoding text to tokens.
- Decoding tokens back to text.
References¶
Tokenizer - Visually displays how a sentence is broken down into tokens.