05 Tokenization

Introduction¶

In early days, neural networks were trained at the character level. It needs to maintain small data but requires a lot of computational power from the model in predicting the next letter.
Then neural networks are trained on words, but the vocabulary count is enormous, a small change in the word would result in another vocabulary.
Then now, the models work with chunk of letters called tokens. This elegantly handles word stems which has the same beginning but has a different ending. Whitespaces will also be included while calculating the tokens.

Below is the approximate estimate of tokens in a typical English writing
- 1 token ~ 4 characters or 0.75 words
- 1000 tokens 750 words
Token count would be higher for math and scientific terms.

Encoding text to tokens.

import tiktoken
titoken.encoding_for_model('gpt-4.1-mini)
tokens = encoding.encode('Hi, My name is Gopalakrishnan.')
tokens

Decoding tokens back to text.

for token_id in tokens:
token_text = encoding.decode([token_id])
print(f'{token_id}: {token_text}')

encoding.decode([275])

Tokenizer - Visually displays how a sentence is broken down into tokens.