h2o-danube3-4b-chat-GGUF

Description

This repo contains GGUF format model files for h2o-danube3-4b-chat quantized using llama.cpp framework.

Table below summarizes different quantized versions of h2o-danube3-4b-chat. It shows the trade-off between size, speed and quality of the models.

Name	Quant method	Model size	MT-Bench AVG	Perplexity	Tokens per second
h2o-danube3-4b-chat-F16.gguf	F16	7.92 GB	6.43	6.17	479
h2o-danube3-4b-chat-Q8_0.gguf	Q8_0	4.21 GB	6.49	6.17	725
h2o-danube3-4b-chat-Q6_K.gguf	Q6_K	3.25 GB	6.37	6.20	791
h2o-danube3-4b-chat-Q5_K_M.gguf	Q5_K_M	2.81 GB	6.25	6.24	927
h2o-danube3-4b-chat-Q4_K_M.gguf	Q4_K_M	2.39 GB	6.31	6.37	967
h2o-danube3-4b-chat-Q3_K_M.gguf	Q3_K_M	1.94 GB	5.87	6.99	1099
h2o-danube3-4b-chat-Q2_K.gguf	Q2_K	1.51 GB	3.71	9.42	1299

Columns in the table are:

Name -- model name and link
Quant method -- quantization method
Model size -- size of the model in gigabytes
MT-Bench AVG -- MT-Bench benchmark score. The score is from 1 to 10, the higher, the better
Perplexity -- perplexity metric on WikiText-2 dataset. It's reported in a perplexity test from llama.cpp. The lower, the better
Tokens per second -- generation speed in tokens per second, as reported in a perplexity test from llama.cpp. The higher, the better. Speed tests are done on a single H100 GPU

<|prompt|>Why is drinking water so healthy?</s><|answer|>