Alexander Visheratin
AI & ML interests
Articles
Organizations
visheratin's activity
It uses the same vision encoder, so I expect that nothing changes.
The large model is finally SoTA for both image and text multilingual retrieval!
The models are available on the hub:
- visheratin/nllb-siglip-mrl-base
- visheratin/nllb-siglip-mrl-large
I used 8xA100 80GB. With LoRA and smaller batch size, it should be possible to train on smaller GPUs, but it is still very resource-intensive.
You are right. The method requires multiple passes for the vision encoder, which increases memory usage. This is not such a big problem during inference, but it makes training harder because of the gradients stored. At the moment, I don't have a solution to make it more efficient. But this is an ongoing project, so maybe I will find one =)
There are links to existing papers in the blog post if you want to dive into the field.
I used mainly the LLaVA training codebase with some changes to support multi-crop. I'll be working on the next post about fine-tuning MC-LLaVA on a task-specific dataset and will open all the code.
Check it out, and let me know what you think!
Other notable updates:
- I use SigLIP from Transformers, so you don't need to install additional libraries.
- the model now supports auto classes, so you can create the model and processor with only two lines.
- performance increased by 10%+ across all benchmarks.
The work is far from over, but it feels like good progress.
The model on the hub: visheratin/MC-LLaVA-3b
You can try the model here: visheratin/mc-llava-3b