Search notes:

Python library: tiktoken

tiktoken is a fast BPE (Byte pair encoding) tokenizer for use with OpenAI's models.

A feature of byte pair encoders is that they can encode any arbitrary string. If it encounters a word not present in the vocabulary, it breaks it down to tokens that it understands.

Use different encoders to encode the same text

import tiktoken

from tiktoken_ext.openai_public import ENCODING_CONSTRUCTORS

def tokenize(encName):

    print(encName)    

    encFunc = ENCODING_CONSTRUCTORS[encName]
    encDict = encFunc()
    
    enc = tiktoken.Encoding(encDict['name'],
             pat_str         = encDict['pat_str'        ],
             mergeable_ranks = encDict['mergeable_ranks'],
             special_tokens  = encDict['special_tokens' ])
    
    tokens = enc.encode_ordinary('''
select
   o.id,
   o.order_date,
   o.amount,
   i.article_nr,
   i.price
from
   orders  o                         join
   items   i o o.id, items.order_id
    ''')
    
    print(tokens)
    print('')

#   --------------------------------------------------------

tokenize('gpt2'       )
tokenize('r50k_base'  )
tokenize('p50k_base'  )
tokenize('cl100k_base')

Encode, then decode

import tiktoken
from tiktoken_ext.openai_public import ENCODING_CONSTRUCTORS

encFunc = ENCODING_CONSTRUCTORS['gpt2']
encDict = encFunc()

enc = tiktoken.Encoding(encDict['name'],
         pat_str         = encDict['pat_str'        ],
         mergeable_ranks = encDict['mergeable_ranks'],
         special_tokens  = encDict['special_tokens' ])

tokens = enc.encode('''
def F(txt):
    print(txt)
''')

print(tokens)

print(enc.decode(tokens))

Fatal error: Uncaught PDOException: SQLSTATE[HY000]: General error: 8 attempt to write a readonly database in /home/httpd/vhosts/renenyffenegger.ch/php/web-request-database.php:78 Stack trace: #0 /home/httpd/vhosts/renenyffenegger.ch/php/web-request-database.php(78): PDOStatement->execute(Array) #1 /home/httpd/vhosts/renenyffenegger.ch/php/web-request-database.php(30): insert_webrequest_('/notes/developm...', 1759414053, '216.73.216.42', 'Mozilla/5.0 App...', NULL) #2 /home/httpd/vhosts/renenyffenegger.ch/httpsdocs/notes/development/languages/Python/libraries/tiktoken(101): insert_webrequest() #3 {main} thrown in /home/httpd/vhosts/renenyffenegger.ch/php/web-request-database.php on line 78

Python library: tiktoken

Use different encoders to encode the same text

Encode, then decode

See also