Supported Languages

#72
by vpkprasanna - opened

What are the languages supported by the LLM ?
how to find out easily from tokenizer vocab file ?

Google org

Hi @vpkprasanna ,

  1. Load the tokenizer associated with the model.
  2. Retrieve the vocabulary, which contains the tokens, to inspect their structure and determine if we can infer the supported languages.
  3. Further filter these tokens to search for language-specific characters, such as those from Hindi and Chinese, to confirm whether the model supports these languages.

You can refer to the below IPython notebook where I use the google/gemma-2b model to check if it supports Hindi and Chinese.
https://colab.research.google.com/gist/Gopi-Uppari/2600403197351f4a746b988f937adc4e/supported-languagesipynb.ipynb

Thank you.

Sign up or log in to comment