Incorrect ifeval benchmark

#879
by DavidGF - opened

Hello everyone,

Apparently the ifeval evaluation went wrong with our model. Unfortunately I can't explain how this could have happened. As you can see in the ifeval benchmark result data set, most responses are simply empty. In our internal tests (based on the HF leaderboard documentation) everything worked correctly.
You can also see from the remaining benchmarks that the ifeval value cannot have been calculated correctly, as all other values ​​are similar to our internal tests (we have also stored diagrams in the model card where you can check this again)

https://huggingface.co/datasets/open-llm-leaderboard/VAGOsolutions__SauerkrautLM-gemma-2-2b-it-details/viewer/VAGOsolutions__SauerkrautLM-gemma-2-2b-it__leaderboard_ifeval?row=16

Do you have an idea or even a solution?

Thanks in advance,
David

Open LLM Leaderboard org

Hi @DavidGF ,

Thank you for reporting this issue! We will need some time to check it, I will get back to you as soon as I get more info

Open LLM Leaderboard org

Hi @DavidGF ,

It looks like the issue with your model’s responses to the IFEval benchmark is challenging to pinpoint. Sometimes the model responds as expected, but other times it doesn't, particularly with more complex prompts.

From what I can see, everything seems set up correctly, like the BOS token and the chat template, so the problem might be related to how the model treats the specific generation settings. These settings might be making the model too rigid, which could explain why it occasionally fails to generate a response. Plus, I've tried to re-evaluate your model and got the same results. Have you also tried evaluating your model with the added BOS token?

Hello @alozowski ,
First of all, thank you very much for your efforts!
We have also evaluated the model several times with lm eval harness and have not had any problems.
If it was only specific to our model, then the behavior should not automatically occur in the other models I mentioned.
A lot of Gemma 2 finetunes are affected by this behavior.
The results of the rest of the benchmark also show that the model performs well and I therefore do not assume that it is overwhelmed by the complexity of certain ifeval prompts.

Open LLM Leaderboard org

Hi @DavidGF ,

Thank you for the additional context! I need more time to investigate this issue, but I'll get back to you as soon as I have more info or a potential solution.

Sign up or log in to comment