🍭 Fine-tuning support for Qwen2-VL-7B-Instruct

#1
by study-hjt - opened

The open-source release of Qwen2-VL is truly exciting 😊. We have supported VQA, OCR, grounding fine-tuning, and video fine-tuning for qwen2-vl.

English fine-tuning document:
https://swift.readthedocs.io/en/latest/Multi-Modal/qwen2-vl-best-practice.html

中文微调文档:
https://swift.readthedocs.io/zh-cn/latest/Multi-Modal/qwen2-vl%E6%9C%80%E4%BD%B3%E5%AE%9E%E8%B7%B5.html

github: https://github.com/modelscope/ms-swift

Nice! Thank you!

Thank you, @study-hjt !
Is there a way that we can lock the encoder and fine-tuning the LM decoder only to accept typical multi-turn conversations with image/video as a part of the conversation?

Error occurred in V100:

RuntimeError: CUDA error: too many resources requested for launch CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile withTORCH_USE_CUDA_DSAto enable device-side assertions.

Error occurred in V100:

RuntimeError: CUDA error: too many resources requested for launch CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile withTORCH_USE_CUDA_DSAto enable device-side assertions.

Try to remove ' torch_dtype="auto" ' !

using the following codes works, not sure why torch_dtype="auto" failed.

model = Qwen2VLForConditionalGeneration.from_pretrained(
    "Qwen2-VL-7B-Instruct", torch_dtype=torch.float16, device_map="auto"
)

Sign up or log in to comment