Tell me this isnt the just copy and pasted weights from deepseekcoder-6.7b-instruct? Because thats what it looks like

#4
by rombodawg - opened

I ran both models through human eval, here are the results. They scored exactly the same. Like bruh, this is a scam

Deepseekcoder-6.7-instruct

{
  "humaneval": {
    "pass@1": 0.725609756097561
  },
  "config": {
    "prefix": "",
    "do_sample": true,
    "temperature": 0.1,
    "top_k": 0,
    "top_p": 0.95,
    "n_samples": 1,
    "eos": "<|endoftext|>",
    "seed": 0,
    "model": "deepseek-ai/deepseek-coder-6.7b-instruct",
    "modeltype": "causal",
    "peft_model": null,
    "revision": null,
    "use_auth_token": false,
    "trust_remote_code": false,
    "tasks": "humaneval",
    "instruction_tokens": null,
    "batch_size": 1,
    "max_length_generation": 4000,
    "precision": "fp16",
    "load_in_8bit": true,
    "load_in_4bit": false,
    "left_padding": false,
    "limit": null,
    "limit_start": 0,
    "save_every_k_tasks": -1,
    "postprocess": true,
    "allow_code_execution": true,
    "generation_only": false,
    "load_generations_path": null,
    "load_data_path": null,
    "metric_output_path": "evaluation_results.json",
    "save_generations": false,
    "load_generations_intermediate_paths": null,
    "save_generations_path": "generations.json",
    "save_references": false,
    "save_references_path": "references.json",
    "prompt": "prompt",
    "max_memory_per_gpu": null,
    "check_references": false
  }
}

AutoCoder_S_6.7B

{
  "humaneval": {
    "pass@1": 0.725609756097561
  },
  "config": {
    "prefix": "",
    "do_sample": true,
    "temperature": 0.1,
    "top_k": 0,
    "top_p": 0.95,
    "n_samples": 1,
    "eos": "<|endoftext|>",
    "seed": 0,
    "model": "Bin12345/AutoCoder_S_6.7B",
    "modeltype": "causal",
    "peft_model": null,
    "revision": null,
    "use_auth_token": false,
    "trust_remote_code": false,
    "tasks": "humaneval",
    "instruction_tokens": null,
    "batch_size": 1,
    "max_length_generation": 4000,
    "precision": "fp16",
    "load_in_8bit": true,
    "load_in_4bit": false,
    "left_padding": false,
    "limit": null,
    "limit_start": 0,
    "save_every_k_tasks": -1,
    "postprocess": true,
    "allow_code_execution": true,
    "generation_only": false,
    "load_generations_path": null,
    "load_data_path": null,
    "metric_output_path": "evaluation_results.json",
    "save_generations": false,
    "load_generations_intermediate_paths": null,
    "save_generations_path": "generations.json",
    "save_references": false,
    "save_references_path": "references.json",
    "prompt": "prompt",
    "max_memory_per_gpu": null,
    "check_references": false
  }
}
Owner

Thanks for the verification, could you check if all these outputs are the same?

There are 164 problems inside the humaneval dataset. If both model correctly solved 119 problems. The final accuracy will be the same.

And

  1. We add some special tokens to the Deepseek-Coder and fine-tuned the model. These two models are different, you can use the following code to check two model's weights
from transformers import AutoTokenizer, AutoModelForCausalLM

model_path = "" # Bin12345/AutoCoder_S_6.7B, deepseek-ai/deepseek-coder-6.7b-instruct
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto")

model_weights = model.state_dict()

for name, param in model_weights.items():
    print(f"Layer: {name} | Size: {param.size()} | Values: {param[:2]}") 
  1. The results shown in our paper are performed by using greedy sampling method. The steps to reproduce the experiments are shown in https://github.com/bin123apple/AutoCoder

  2. Could you share your experiment code and I will run it on my side and check the results.

Yes I ran the benchmark in google colab, here is the code:

!pip install -q -U transformers
!pip install -q -U accelerate
!pip install -q -U bitsandbytes
!git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git
%cd bigcode-evaluation-harness
!pip install -r requirements.txt
!accelerate launch main.py --tasks humaneval --model Bin12345/AutoCoder_S_6.7B --load_in_8bit --allow_code_execution --max_length_generation 4000 --precision fp16 --temperature 0.1

And for the other model just replace the model name

!accelerate launch main.py --tasks humaneval --model deepseek-ai/deepseek-coder-6.7b-instruct --load_in_8bit --allow_code_execution --max_length_generation 4000 --precision fp16 --temperature 0.1

You have to add a flag to get the results though, it wont save them without the extra flag, i forget what it is, you should be able to find it here
https://github.com/bigcode-project/bigcode-evaluation-harness

Owner

Thanks for the feedback! I will try it.

If you want to check if these two models are the same, you can use

from transformers import AutoTokenizer, AutoModelForCausalLM

model_path = "" # Bin12345/AutoCoder_S_6.7B, deepseek-ai/deepseek-coder-6.7b-instruct
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto")

model_weights = model.state_dict()

for name, param in model_weights.items():
    print(f"Layer: {name} | Size: {param.size()} | Values: {param[:2]}")

to print out their weights.

You will see that they are different.

Owner

Hey @rombodawg , I just reproduced your experiment, the output of these two models are different. You can add two flags --save_generations and --save_generations_path to save and check the generated code. Thanks

rombodawg changed discussion status to closed

Sign up or log in to comment