Tell me this isnt the just copy and pasted weights from deepseekcoder-6.7b-instruct? Because thats what it looks like

by rombodawg - opened Jun 1

Jun 1

I ran both models through human eval, here are the results. They scored exactly the same. Like bruh, this is a scam

Deepseekcoder-6.7-instruct

{
  "humaneval": {
    "pass@1": 0.725609756097561
  },
  "config": {
    "prefix": "",
    "do_sample": true,
    "temperature": 0.1,
    "top_k": 0,
    "top_p": 0.95,
    "n_samples": 1,
    "eos": "<|endoftext|>",
    "seed": 0,
    "model": "deepseek-ai/deepseek-coder-6.7b-instruct",
    "modeltype": "causal",
    "peft_model": null,
    "revision": null,
    "use_auth_token": false,
    "trust_remote_code": false,
    "tasks": "humaneval",
    "instruction_tokens": null,
    "batch_size": 1,
    "max_length_generation": 4000,
    "precision": "fp16",
    "load_in_8bit": true,
    "load_in_4bit": false,
    "left_padding": false,
    "limit": null,
    "limit_start": 0,
    "save_every_k_tasks": -1,
    "postprocess": true,
    "allow_code_execution": true,
    "generation_only": false,
    "load_generations_path": null,
    "load_data_path": null,
    "metric_output_path": "evaluation_results.json",
    "save_generations": false,
    "load_generations_intermediate_paths": null,
    "save_generations_path": "generations.json",
    "save_references": false,
    "save_references_path": "references.json",
    "prompt": "prompt",
    "max_memory_per_gpu": null,
    "check_references": false
  }
}

AutoCoder_S_6.7B

{
  "humaneval": {
    "pass@1": 0.725609756097561
  },
  "config": {
    "prefix": "",
    "do_sample": true,
    "temperature": 0.1,
    "top_k": 0,
    "top_p": 0.95,
    "n_samples": 1,
    "eos": "<|endoftext|>",
    "seed": 0,
    "model": "Bin12345/AutoCoder_S_6.7B",
    "modeltype": "causal",
    "peft_model": null,
    "revision": null,
    "use_auth_token": false,
    "trust_remote_code": false,
    "tasks": "humaneval",
    "instruction_tokens": null,
    "batch_size": 1,
    "max_length_generation": 4000,
    "precision": "fp16",
    "load_in_8bit": true,
    "load_in_4bit": false,
    "left_padding": false,
    "limit": null,
    "limit_start": 0,
    "save_every_k_tasks": -1,
    "postprocess": true,
    "allow_code_execution": true,
    "generation_only": false,
    "load_generations_path": null,
    "load_data_path": null,
    "metric_output_path": "evaluation_results.json",
    "save_generations": false,
    "load_generations_intermediate_paths": null,
    "save_generations_path": "generations.json",
    "save_references": false,
    "save_references_path": "references.json",
    "prompt": "prompt",
    "max_memory_per_gpu": null,
    "check_references": false
  }
}

Bin12345

Owner Jun 1

Thanks for the verification, could you check if all these outputs are the same?

There are 164 problems inside the humaneval dataset. If both model correctly solved 119 problems. The final accuracy will be the same.

And

We add some special tokens to the Deepseek-Coder and fine-tuned the model. These two models are different, you can use the following code to check two model's weights

from transformers import AutoTokenizer, AutoModelForCausalLM

model_path = "" # Bin12345/AutoCoder_S_6.7B, deepseek-ai/deepseek-coder-6.7b-instruct
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto")

model_weights = model.state_dict()

for name, param in model_weights.items():
    print(f"Layer: {name} | Size: {param.size()} | Values: {param[:2]}")

The results shown in our paper are performed by using greedy sampling method. The steps to reproduce the experiments are shown in https://github.com/bin123apple/AutoCoder
Could you share your experiment code and I will run it on my side and check the results.

rombodawg

Jun 1

Yes I ran the benchmark in google colab, here is the code:

!pip install -q -U transformers
!pip install -q -U accelerate
!pip install -q -U bitsandbytes

!git clone https://github.com/bigcode-project/bigcode-evaluation-harness.git

%cd bigcode-evaluation-harness

!pip install -r requirements.txt

!accelerate launch main.py --tasks humaneval --model Bin12345/AutoCoder_S_6.7B --load_in_8bit --allow_code_execution --max_length_generation 4000 --precision fp16 --temperature 0.1

And for the other model just replace the model name

!accelerate launch main.py --tasks humaneval --model deepseek-ai/deepseek-coder-6.7b-instruct --load_in_8bit --allow_code_execution --max_length_generation 4000 --precision fp16 --temperature 0.1

rombodawg

Jun 1

You have to add a flag to get the results though, it wont save them without the extra flag, i forget what it is, you should be able to find it here
https://github.com/bigcode-project/bigcode-evaluation-harness

Bin12345

Owner Jun 1

Thanks for the feedback! I will try it.

If you want to check if these two models are the same, you can use

from transformers import AutoTokenizer, AutoModelForCausalLM

model_path = "" # Bin12345/AutoCoder_S_6.7B, deepseek-ai/deepseek-coder-6.7b-instruct
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto")

model_weights = model.state_dict()

for name, param in model_weights.items():
    print(f"Layer: {name} | Size: {param.size()} | Values: {param[:2]}")

to print out their weights.

You will see that they are different.

Bin12345

Owner Jun 1

Hey @rombodawg , I just reproduced your experiment, the output of these two models are different. You can add two flags --save_generations and --save_generations_path to save and check the generated code. Thanks

rombodawg changed discussion status to closed Jun 1

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment