visheratin (Alexander Visheratin)

posted an update 5 months ago

Post

3099

Yesterday, xAI announced Grok-1.5 Vision - https://x.ai/blog/grok-1.5v. But more importantly, they also released a new VLM benchmark dataset - RealWorldQA. The only problem was that they released it as a ZIP archive. I fixed that! Now you can use it in your evaluations as a regular HF dataset: visheratin/realworldqa

1 reply

·

posted an update 6 months ago

Post

1855

Look at the beauty in the video — four different embeddings on the same map! In another community blog post, I explore how you can use Nomic Atlas to view and clean your dataset. You can check it out here - https://huggingface.co/blog/visheratin/nomic-data-cleaning

1 reply

·

replied to their post 6 months ago

It uses the same vision encoder, so I expect that nothing changes.

posted an update 6 months ago

Post

Keep stacking cool stuff and getting better results! After I changed the standard vision encoder to SigLIP, NLLB-CLIP got a 10% average performance improvement. And now, I added matryoshka layers (https://arxiv.org/abs/2205.13147) to enable smaller embeddings and got another 6% performance boost! Plus, thanks to MRL, 4.5x smaller embeddings retain 90%+ quality.

The large model is finally SoTA for both image and text multilingual retrieval!

The models are available on the hub:
- visheratin/nllb-siglip-mrl-base
- visheratin/nllb-siglip-mrl-large

2 replies

·

replied to their post 7 months ago

I used 8xA100 80GB. With LoRA and smaller batch size, it should be possible to train on smaller GPUs, but it is still very resource-intensive.

replied to their post 7 months ago

You are right. The method requires multiple passes for the vision encoder, which increases memory usage. This is not such a big problem during inference, but it makes training harder because of the gradients stored. At the moment, I don't have a solution to make it more efficient. But this is an ongoing project, so maybe I will find one =)

replied to their post 7 months ago

There are links to existing papers in the blog post if you want to dive into the field.

replied to their post 7 months ago

I used mainly the LLaVA training codebase with some changes to support multi-crop. I'll be working on the next post about fine-tuning MC-LLaVA on a task-specific dataset and will open all the code.

posted an update 7 months ago

Post

VLMs have a resolution problem, which prevents them from finding small details in large images. In my community blog post, I discuss the ways to solve it and describe the details of MC-LLaVA architecture - https://huggingface.co/blog/visheratin/vlm-resolution-curse

Check it out, and let me know what you think!

11 replies

·

posted an update 7 months ago

Post

Isn't it sad that VLMs don't have any inference parameters for the vision part? Well, MC-LLaVA now has two whole knobs you can use to make it find even the smallest details! I finally (almost) properly implemented multi-crop, and now you can control the number of crops and how many image tokens will be generated. The video shows how, by increasing the number of crops and tokens, my 3B model correctly identifies the 30x90 pixel logo in the 3200x3000 pixel image.
Other notable updates:
- I use SigLIP from Transformers, so you don't need to install additional libraries.
- the model now supports auto classes, so you can create the model and processor with only two lines.
- performance increased by 10%+ across all benchmarks.

The work is far from over, but it feels like good progress.

The model on the hub: visheratin/MC-LLaVA-3b
You can try the model here: visheratin/mc-llava-3b

Alexander Visheratin

AI & ML interests

Articles

Data exploration and filtering with Nomic Atlas

Breaking resolution curse of vision-language models

Organizations

visheratin's activity