--- license: gemma language: - en tags: - conversational quantized_by: qnixsynapse --- ## Llamacpp Quantizations of official gguf of gemma-2-9b-it from kaggle repo Using llama.cpp PR 8156 for quantization. Original model: https://huggingface.co/google/gemma-2-9b-it ## Downloading using huggingface-cli First, make sure you have hugginface-cli installed: ``` pip install -U "huggingface_hub[cli]" ``` Then, you can target the specific file you want: ``` huggingface-cli download qnixsynapse/Gemma-V2-9B-Instruct-GGUF --include "" --local-dir ./ ``` or you can download directly. ## Prompt format The prompt format is same as Gemma v1 however not included with gguf file. This can be edited with gguf script to add a new key `chat_template` later. ``` user {prompt} model ``` The model should stop either at `` or ``. If it doesn't then stop tokens needs to be added to the gguf metadata. ## Quants Currently only two quants are available: | quant | size | |-------|-------| | Q4_K_S| 5.5GB| |Q3_K_M | 4.8GB| If Q4_K_S is causing OOM when offloading all the layers to the GPU, consider decreasing batch size or use Q3_K_M. Minimum VRAM needed: 8GB