Edit model card

h2o-danube3-4b-chat-GGUF

Description

This repo contains GGUF format model files for h2o-danube3-4b-chat quantized using llama.cpp framework.

Table below summarizes different quantized versions of h2o-danube3-4b-chat. It shows the trade-off between size, speed and quality of the models.

Name Quant method Model size MT-Bench AVG Perplexity Tokens per second
h2o-danube3-4b-chat-F16.gguf F16 7.92 GB 6.43 6.17 479
h2o-danube3-4b-chat-Q8_0.gguf Q8_0 4.21 GB 6.49 6.17 725
h2o-danube3-4b-chat-Q6_K.gguf Q6_K 3.25 GB 6.37 6.20 791
h2o-danube3-4b-chat-Q5_K_M.gguf Q5_K_M 2.81 GB 6.25 6.24 927
h2o-danube3-4b-chat-Q4_K_M.gguf Q4_K_M 2.39 GB 6.31 6.37 967
h2o-danube3-4b-chat-Q3_K_M.gguf Q3_K_M 1.94 GB 5.87 6.99 1099
h2o-danube3-4b-chat-Q2_K.gguf Q2_K 1.51 GB 3.71 9.42 1299

Columns in the table are:

  • Name -- model name and link
  • Quant method -- quantization method
  • Model size -- size of the model in gigabytes
  • MT-Bench AVG -- MT-Bench benchmark score. The score is from 1 to 10, the higher, the better
  • Perplexity -- perplexity metric on WikiText-2 dataset. It's reported in a perplexity test from llama.cpp. The lower, the better
  • Tokens per second -- generation speed in tokens per second, as reported in a perplexity test from llama.cpp. The higher, the better. Speed tests are done on a single H100 GPU

Prompt template

<|prompt|>Why is drinking water so healthy?</s><|answer|>
Downloads last month
1,605
GGUF
Model size
3.96B params
Architecture
llama

2-bit

3-bit

4-bit

5-bit

6-bit

8-bit

16-bit

Inference Examples
Inference API (serverless) is not available, repository is disabled.

Collection including h2oai/h2o-danube3-4b-chat-GGUF