Skip to content

05 Tokenization

Introduction

  • In early days, neural networks were trained at the character level. It needs to maintain small data but requires a lot of computational power from the model in predicting the next letter.
  • Then neural networks are trained on words, but the vocabulary count is enormous, a small change in the word would result in another vocabulary.
  • Then now, the models work with chunk of letters called tokens. This elegantly handles word stems which has the same beginning but has a different ending. Whitespaces will also be included while calculating the tokens.

Token Calculation

  • Below is the approximate estimate of tokens in a typical English writing
    • 1 token ~ 4 characters or 0.75 words
    • 1000 tokens 750 words
  • Token count would be higher for math and scientific terms.

Tiktoken

  • Encoding text to tokens.
    1
    2
    3
    4
    import tiktoken
    titoken.encoding_for_model('gpt-4.1-mini)
    tokens = encoding.encode('Hi, My name is Gopalakrishnan.')
    tokens
    
  • Decoding tokens back to text.
    1
    2
    3
    for token_id in tokens:
    token_text = encoding.decode([token_id])
    print(f'{token_id}: {token_text}')
    
encoding.decode([275])

References

Tokenizer - Visually displays how a sentence is broken down into tokens.