eval_harness_v043_updates

#10
by meg HF staff - opened
No description provided.

Change 1

WHAT: Updates requirements.txt to the newest lm_eval version, 0.4.3. This also requires accelerate>=0.26.0

Change 2

WHAT: Removes no_cache argument for lm_eval simple_evaluate function.
WHY: no_cache (bool) was replaced with use_cache (str), a path to a sqlite db file for caching model responses, andNone if not caching ; see https://github.com/EleutherAI/lm-evaluation-harness/commit/fbd712f723d39e60949abeabd588f1a6f7fb8dcb#diff-6cc182ce4ebf9431fdf0ef577412f518d45396d4153a3825496304fa0f857c2d
FILES AFFECTED:

  • src/backend/run_eval_suite_harness.py
  • main_backend_harness.py

Change 3

WHAT: Changes the import of run_auto_eval to call from the lm_eval task Harnesss, not lighteval
WHY: The description of the templates specifies that the Harness is being used: "launches evaluations through the main_backend.py file, using the Eleuther AI Harness."
FILES AFFECTED:

  • main_backend_harness.py

Change 4:

WHAT: Set batch_size to "auto"
WHY: The Harness will automatically determine the batch size, based on the compute the user has set up.
FILES AFFECTED:

  • main_backend_harness.py
  • src/backend/run_eval_suite_harness.py (a typing change to accept "auto" string)

Change 5

WHAT: Additional updates to src/backend/run_eval_suite_harness.py for running the Harness code in v0.4.3:

  • ALL_TASKS constant as previously defined is deprecated. This commit introduces another way to get those same values, using TaskManager(). NB: there appears to be be another alternative option that I have not tested, from lm_eval.api.registry import ALL_TASKS
  • Specifies "hf" as the the model value, which is the recommended default. Previously defined "hf-causal-experimental" has been deprecated. See: https://github.com/EleutherAI/lm-evaluation-harness/issues/1235#issuecomment-1873940238
  • Removes output_path argument, which is no longer supported in lm_eval simple_evaluate. See: https://github.com/EleutherAI/lm-evaluation-harness/commit/6a2620ade383b8d30592fc2342eb1d213ad4b4cb NB: There may be an option to add something similar or comparable in another way, which I'm not experimenting with here. The argument log_samples, for example, might be added here and set to True.
  • Additional minor: The definition of device uses the term "gpu:0" -- I think "cuda:0" is meant.
    FILES AFFECTED:
  • src/backend/run_eval_suite_harness.py
meg changed pull request status to open
Demo leaderboard with an integrated backend org

LGTM, thanks!

clefourrier changed pull request status to merged

Sign up or log in to comment