Deploying to dedicated Inference Endpoints

#29
by stmackcat - opened

I was wondering if anyone was about to deploy 3.1-70B to dedicated Inference Endpoint? I have tried both 4xA100 and 4xH100 and get the following error:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 448.00 MiB. GPU 0 has a total capacity of 79.10 GiB of which 400.00 MiB is free. Including non-PyTorch memory, this process has 0 bytes memory in use. Of the allocated memory 78.02 GiB is allocated by PyTorch, and 86.47 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management

I have tried both default container (ghcr.io/huggingface/text-generation-inference:sha-f852190) as well trick suggested for 3.1-8B (ghcr.io/huggingface/text-generation-inference:2.2.0 with MODEL_ID=/repository).

Sign up or log in to comment