Avoid Loading CPU kernel if User Have a GPU and Cuda Environment
Thank you for providing this model for low-GPU-memory users.
There is a potential for improvement, as I encountered several issues while setting up the environment on a Windows 10 machine. However, this model can be used in a normal Win10 environment without requiring a gcc compiler or WSL support. You just need to avoid the CPU kernel loading process.
To achieve this, modify the 'load_cpu_kernel' method in the 'quantization.py' file located in the model folder 'chatglm-6b-int4'. Ensure that the actual loading is not triggered if the GPU device is available by changing it to the following code:
def load_cpu_kernel(**kwargs):
if not torch.cuda.is_available(): # check before load CPU kernel
global cpu_kernels
cpu_kernels = CPUKernel(**kwargs)
assert cpu_kernels.load
If the user does not have a GPU, the normal CPU kernel loading process will be triggered, which requires a 'gcc' and 'WSL' environment on a Windows machine.
After making the above modification, users can easily load the model from any front-end code without worrying about the 'assert cpu_kernels.load' error. For example, in the chatglm-webui project, simply download the model folder and name it 'chatglm-6b-int4'. Use the following command to load it on a Windows 10 machine (assuming that the cuda and required python libraries are already installed):
python webui.py --model-path chatglm-6b-int4 --precision int4
I successfully loaded the model on a Win10 + 2080 (8GB) machine without gcc or WSL installed.
Thanks again for this awesome model!
It is possible that someone has a GPU but he/she wants to use CPU inference (for example the GPU memory is not enough to load the model).
Currently, if the load_cpu_kernel
method fails, the exception is caught and the program only prints a warning. The program fails only if both the CPU kernel loading and GPU kernel loading fail. Therefore I don't think it is necessary to avoid the CPU kernel loading process according to CUDA availability.
Not all load_cpu_kernel calls are wrapped in try/catch section.
For example, in modeling_chatglm.py on line 1430, the loading is triggered, which will cause an AssertionError: 'assert cpu_kernels.load'.
An issue has already been reported about the same error: https://github.com/THUDM/ChatGLM-6B/issues/676
A possible fix could be placing the try/catch section within the 'load_cpu_kernel' function.
It's not a major issue but quite confusing, as the error appears to be related to the CPU, yet users might not want to use the CPU at all.
Just my two cents.
Not all load_cpu_kernel calls are wrapped in try/catch section.
For example, in modeling_chatglm.py on line 1430, the loading is triggered, which will cause an AssertionError: 'assert cpu_kernels.load'.
An issue has already been reported about the same error: https://github.com/THUDM/ChatGLM-6B/issues/676A possible fix could be placing the try/catch section within the 'load_cpu_kernel' function.
It's not a major issue but quite confusing, as the error appears to be related to the CPU, yet users might not want to use the CPU at all.
Just my two cents.
Thank you for your advice. Removed the assert in load_cpu_kernel