LLMs can see and hear without any training

https://news.ycombinator.com/rss Hits: 16

Summary

LLMs can see and hear without any training Official implementation of the paper LLMs can see and hear without any training. Installation Install the conda environment using conda env create -f environment.yml conda activate MILS Dataset and checkpoints Download the following datasets, annotations, and checkpoints MS-COCO: Download the MS-COCO validation dataset from the official website here. Also, download the 5000 samples test split used in Karpathy et al., Deep visual-semantic alignments for generating image descriptions, CVPR 2015. wget http://images.cocodataset.org/zips/val2014.zip wget http://images.cocodataset.org/annotations/annotations_trainval2014.zip unzip val2014.zip unzip annotations_trainval2014.zip Clotho: Download the clotho dataset from the official website here. We use the test split of this dataset for our benchmarking. wget https://zenodo.org/records/3490684/files/clotho_audio_evaluation.7z pip3 install dtrx wget https://www.7-zip.org/a/7z2107-linux-x64.tar.xz tar xf 7z2107-linux-x64.tar.xz ./7zz e clotho_audio_evaluation.7z wget https://zenodo.org/records/3490684/files/clotho_captions_evaluation.csv MSR-VTT: Download the dataset from here. We use the test split of this dataset. wget https://www.robots.ox.ac.uk/~maxbain/frozen-in-time/data/MSRVTT.zip unzip MSRVTT.zip ViClip-InternVid-10M-FLT.pth: Download from here and set the correct path in task_utils/video/viclip.py . Updating the paths Update the variables in paths.py to set the dataset directory, and the output folder. Running the code MILS is an inference-only method that can be run on a single A100 GPU. We run the experiments on eight A100 GPUs, and the code below can be adjusted for any number of GPUs. Image captioning Generate captions using CUDA_VISIBLE_DEVICES=0 python main_image_captioning.py --process 0 --num_processes 8 --batch_size 32 & CUDA_VISIBLE_DEVICES=1 python main_image_captioning.py --process 1 --num_processes 8 --batch_size 32 & CUDA_VISIBLE_DEVICES=2 python main_image_cap...

First seen: 2025-04-26 14:07

Last seen: 2025-04-27 05:13

Read Full Article More from this Source

LLMs can see and hear without any training

Summary

Related News

OpenBSD 7.7 Released

Reverse Geocoding Is Hard

Sigbovik Conference Proceedings 2025 [pdf]

Did 5G Kill the IMSI Catcher?

Show HN: I created snapDOM to capture DOM nodes as images with exceptional speed