LLMs can see and hear without any training Official implementation of the paper LLMs can see and hear without any training. Installation Install the conda environment using conda env create -f environment.yml conda activate MILS Dataset and checkpoints Download the following datasets, annotations, and checkpoints MS-COCO: Download the MS-COCO validation dataset from the official website here. Also, download the 5000 samples test split used in Karpathy et al., Deep visual-semantic alignments for generating image descriptions, CVPR 2015. wget http://images.cocodataset.org/zips/val2014.zip wget http://images.cocodataset.org/annotations/annotations_trainval2014.zip unzip val2014.zip unzip annotations_trainval2014.zip Clotho: Download the clotho dataset from the official website here. We use the test split of this dataset for our benchmarking. wget https://zenodo.org/records/3490684/files/clotho_audio_evaluation.7z pip3 install dtrx wget https://www.7-zip.org/a/7z2107-linux-x64.tar.xz tar xf 7z2107-linux-x64.tar.xz ./7zz e clotho_audio_evaluation.7z wget https://zenodo.org/records/3490684/files/clotho_captions_evaluation.csv MSR-VTT: Download the dataset from here. We use the test split of this dataset. wget https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip unzip MSRVTT.zip ViClip-InternVid-10M-FLT.pth: Download from here and set the correct path in task_utils/video/viclip.py . Updating the paths Update the variables in paths.py to set the dataset directory, and the output folder. Running the code MILS is an inference-only method that can be run on a single A100 GPU. We run the experiments on eight A100 GPUs, and the code below can be adjusted for any number of GPUs. Image captioning Generate captions using CUDA_VISIBLE_DEVICES=0 python main_image_captioning.py --process 0 --num_processes 8 --batch_size 32 & CUDA_VISIBLE_DEVICES=1 python main_image_captioning.py --process 1 --num_processes 8 --batch_size 32 & CUDA_VISIBLE_DEVICES=2 python main_image_cap...
First seen: 2025-04-26 14:07
Last seen: 2025-04-27 05:13