microsoft/Phi-3.5-vision-instruct · Unexpected behavior of Phi-3.5-vision model with respect to 4K token length.

While testing the Phi3.5 vision model, I noticed that if prompt length is less than 4096 and len(prompt) + len(generation tokens) is greater than 4096, the model either creates dummy output or stops generating.
In this case, the model tends to generate <|end|> tokens (in logit distribution), but generates dummy output because <|end|> is not a stop token in the HF implementation.

Here is the scripts I used to generate this behavior, but brought from code snippet in https://huggingface.co/microsoft/Phi-3.5-vision-instruct
I just reduced the number of images in the prompt.

from PIL import Image 
import requests 
from transformers import AutoModelForCausalLM 
from transformers import AutoProcessor 

model_id = "/opt/models/phi/Phi-3.5-vision-instruct/" 

# Note: set _attn_implementation='eager' if you don't have flash_attn installed
model = AutoModelForCausalLM.from_pretrained(
  model_id, 
  device_map="cuda", 
  trust_remote_code=True, 
  torch_dtype="auto", 
  _attn_implementation='eager'    
)

# for best performance, use num_crops=4 for multi-frame, num_crops=16 for single-frame.
processor = AutoProcessor.from_pretrained(model_id, 
  trust_remote_code=True, 
  num_crops=4
) 


images = []
placeholder = ""

# Note: if OOM, you might consider reduce number of frames in this example.
for i in range(1, 6):
    url = f"https://image.slidesharecdn.com/azureintroduction-191206101932/75/Introduction-to-Microsoft-Azure-Cloud-{i}-2048.jpg" 
    images.append(Image.open(requests.get(url, stream=True).raw))
    placeholder += f"<|image_{i}|>\n"

messages = [
    {"role": "user", "content": placeholder+"Summarize the deck of slides."},
]

prompt = processor.tokenizer.apply_chat_template(
  messages, 
  tokenize=False, 
  add_generation_prompt=True
)

inputs = processor(prompt, images, return_tensors="pt").to("cuda:0") 

generation_args = { 
    "max_new_tokens": 512, 
    "temperature": 0.0, 
} 

generate_ids = model.generate(**inputs, 
  eos_token_id=processor.tokenizer.eos_token_id, 
  **generation_args
)
print("input prompt size", inputs.input_ids.shape[1])

generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
print("output tokens size", generate_ids.shape[1])

response = processor.batch_decode(generate_ids, 
  skip_special_tokens=True, 
  clean_up_tokenization_spaces=False)[0]

it prints

input prompt size 3817
output tokens size 512

To encapsulate, the slides feature the following segments:

- Introduction to Azure: 
The presentation introduces Microsoft Azure, a cloud computing platform. It highlights the three types of Azure services: Enterprise, Hybrid, and Hyper-scale. The presenter is Dinesh Kumar Wickramasinghe, a Senior Software Engineer from CMS Private Limited in Sri Lanka.

- Azure Services Overview: 
Azure offers a continuously expanding set of cloud services to help organizations meet their current and future business challenges. It provides the freedom to build, manage, and deploy applications on a massive global network using favorite tools and frameworks.

- Cloud Computing Models: 
The presentation explains the three main models of cloud computing: IaaS (Infrastructure-as-a-Service), PaaS (Platform-as-a-Service), and SaaS (Software-as-a-Service). Each model is represented by a unique icon and color.

- Cloud Service Comparison: 
The presentation compares the roles of the user in different cloud service models using a dining table analogy. In IaaS, the user manages the infrastructure. In PaaS, the user manages the platform. In SaaS, the user manages, and and and services. The, and the and the service. The the and and the service.
s the service. The and and the service. The is the service. The service. The service. The service. The service. The service. The service. The and and the is the service. The service.


 and the and the service. The and and and 

and and the and ands.



 in the service. The service. The in the


and and the to the 















 









the the the home. 

and and the the the vendor.



and and and ands and and and and the me.

and and and and and and and and and and and and and 





and 
and

and and and: 


and




and




andS

How can it be solved, or is it an intended behavior?