# ColPali: Efficient Document Retrieval with Vision Language Models [[Blog]](https://huggingface.co/blog/manu/colpali) [[Paper]](https://arxiv.org/abs/2407.01449) [[ColPali Model card]](https://huggingface.co/vidore/colpali) [[ViDoRe Benchmark]](https://huggingface.co/vidore) [[HuggingFace Demo]](https://huggingface.co/spaces/manu/ColPali-demo) ## Associated Paper **ColPali: Efficient Document Retrieval with Vision Language Models** Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline Hudelot, Pierre Colombo This repository contains the code for training custom Colbert retriever models. Notably, we train colbert with LLMs (decoders) as well as Image Language models ! ## Installation ### From git ```bash pip install git+https://github.com/illuin-tech/colpali ``` ### From source ```bash git clone https://github.com/illuin-tech/colpali mv colpali pip install -r requirements.txt ``` ## Usage Example usage of the model is shown in the `scripts` directory. ```bash # hackable example script to adapt python scripts/infer/run_inference_with_python.py ``` ```python import torch import typer from torch.utils.data import DataLoader from tqdm import tqdm from transformers import AutoProcessor from PIL import Image from colpali_engine.models.paligemma_colbert_architecture import ColPali from colpali_engine.trainer.retrieval_evaluator import CustomEvaluator from colpali_engine.utils.colpali_processing_utils import process_images, process_queries from colpali_engine.utils.image_from_page_utils import load_from_dataset def main() -> None: """Example script to run inference with ColPali""" # Load model model_name = "vidore/colpali" model = ColPali.from_pretrained("google/paligemma-3b-mix-448", torch_dtype=torch.bfloat16, device_map="cuda").eval() model.load_adapter(model_name) processor = AutoProcessor.from_pretrained(model_name) # select images -> load_from_pdf(), load_from_image_urls([""]), load_from_dataset() images = load_from_dataset("vidore/docvqa_test_subsampled") queries = ["From which university does James V. Fiorca come ?", "Who is the japanese prime minister?"] # run inference - docs dataloader = DataLoader( images, batch_size=4, shuffle=False, collate_fn=lambda x: process_images(processor, x), ) ds = [] for batch_doc in tqdm(dataloader): with torch.no_grad(): batch_doc = {k: v.to(model.device) for k, v in batch_doc.items()} embeddings_doc = model(**batch_doc) ds.extend(list(torch.unbind(embeddings_doc.to("cpu")))) # run inference - queries dataloader = DataLoader( queries, batch_size=4, shuffle=False, collate_fn=lambda x: process_queries(processor, x, Image.new("RGB", (448, 448), (255, 255, 255))), ) qs = [] for batch_query in dataloader: with torch.no_grad(): batch_query = {k: v.to(model.device) for k, v in batch_query.items()} embeddings_query = model(**batch_query) qs.extend(list(torch.unbind(embeddings_query.to("cpu")))) # run evaluation retriever_evaluator = CustomEvaluator(is_multi_vector=True) scores = retriever_evaluator.evaluate(qs, ds) print(scores.argmax(axis=1)) if __name__ == "__main__": typer.run(main) ``` Detais are also given in the model card for the base Colpali model on HuggingFace: [ColPali Model card](https://huggingface.co/vidore/colpali). ## Training ```bash USE_LOCAL_DATASET=0 python scripts/train/train_colbert.py scripts/configs/siglip/train_siglip_model_debug.yaml ``` or ```bash accelerate launch scripts/train/train_colbert.py scripts/configs/train_colidefics_model.yaml ``` ### Configurations All training arguments can be set through a configuration file. The configuration file is a yaml file that contains all the arguments for training. The construction is as follows: ```python @dataclass class ColModelTrainingConfig: model: PreTrainedModel tr_args: TrainingArguments = None output_dir: str = None max_length: int = 256 run_eval: bool = True run_train: bool = True peft_config: Optional[LoraConfig] = None add_suffix: bool = False processor: Idefics2Processor = None tokenizer: PreTrainedTokenizer = None loss_func: Optional[Callable] = ColbertLoss() dataset_loading_func: Optional[Callable] = None eval_dataset_loader: Optional[Dict[str, Callable]] = None pretrained_peft_model_name_or_path: Optional[str] = None ``` ### Example An example configuration file is: ```yaml config: (): colpali_engine.utils.train_colpali_engine_models.ColModelTrainingConfig output_dir: !path ../../../models/without_tabfquad/train_colpali-3b-mix-448 processor: () : colpali_engine.utils.wrapper.AutoProcessorWrapper pretrained_model_name_or_path: "./models/paligemma-3b-mix-448" max_length: 50 model: (): colpali_engine.utils.wrapper.AutoColModelWrapper pretrained_model_name_or_path: "./models/paligemma-3b-mix-448" training_objective: "colbertv1" # attn_implementation: "eager" torch_dtype: !ext torch.bfloat16 # device_map: "auto" # quantization_config: # (): transformers.BitsAndBytesConfig # load_in_4bit: true # bnb_4bit_quant_type: "nf4" # bnb_4bit_compute_dtype: "bfloat16" # bnb_4bit_use_double_quant: true dataset_loading_func: !ext colpali_engine.utils.dataset_transformation.load_train_set eval_dataset_loader: !import ../data/test_data.yaml max_length: 50 run_eval: true add_suffix: true loss_func: (): colpali_engine.loss.colbert_loss.ColbertPairwiseCELoss tr_args: !import ../tr_args/default_tr_args.yaml peft_config: (): peft.LoraConfig r: 32 lora_alpha: 32 lora_dropout: 0.1 init_lora_weights: "gaussian" bias: "none" task_type: "FEATURE_EXTRACTION" target_modules: '(.*(language_model).*(down_proj|gate_proj|up_proj|k_proj|q_proj|v_proj|o_proj).*$|.*(custom_text_proj).*$)' # target_modules: '(.*(language_model).*(down_proj|gate_proj|up_proj|k_proj|q_proj|v_proj|o_proj).*$|.*(custom_text_proj).*$)' ``` #### Local training ```bash USE_LOCAL_DATASET=0 python scripts/train/train_colbert.py scripts/configs/siglip/train_siglip_model_debug.yaml ``` #### SLURM ```bash sbatch --nodes=1 --cpus-per-task=16 --mem-per-cpu=32GB --time=20:00:00 --gres=gpu:1 -p gpua100 --job-name=colidefics --output=colidefics.out --error=colidefics.err --wrap="accelerate launch scripts/train/train_colbert.py scripts/configs/train_colidefics_model.yaml" sbatch --nodes=1 --time=5:00:00 -A cad15443 --gres=gpu:8 --constraint=MI250 --job-name=colpali --wrap="python scripts/train/train_colbert.py scripts/configs/train_colpali_model.yaml" ``` ## CITATION ```bibtex @misc{faysse2024colpaliefficientdocumentretrieval, title={ColPali: Efficient Document Retrieval with Vision Language Models}, author={Manuel Faysse and Hugues Sibille and Tony Wu and Bilel Omrani and Gautier Viaud and Céline Hudelot and Pierre Colombo}, year={2024}, eprint={2407.01449}, archivePrefix={arXiv}, primaryClass={cs.IR}, url={https://arxiv.org/abs/2407.01449}, } ```