LLMs are the present of AI
Video understanding is the future
That's why we're organizing the 2nd VidLLM workshop at CVPR 2026!
We'll have paper submission and 3 challenges. More info coming soon!
LLMs are the present of AI
Video understanding is the future
That's why we're organizing the 2nd VidLLM workshop at CVPR 2026!
We'll have paper submission and 3 challenges. More info coming soon!
I agree. In general though, it just seems like the tokens used in LLMs are far too fine-grained. Baking more info into each token can be done in other ways than rendering images too. For instance, @parskatt.bsky.social pointed me to CompLLM by @berton-gabri.bsky.social arxiv.org/abs/2509.19228
This is exactly what I thought when deepseek OCR came out!
These two and many other works build on the same assumptions, and I don't understand how come we're still using ~one token per word in NLP
different datasets can end up in the same cluster). Intuitively, the first method is cheaper, while the latter more expensive and better performing.
DataDecide: arxiv.org/abs/2504.11393
CLIMB: arxiv.org/abs/2504.13161
data quality.
The main difference is that DataDecide splits the data according to its data source (usually training datasets are a collection of multiple datasets), while CLIMB creates clusters with each documents embeddings (meaning documents from ...
large LLM on many subsets would be unfeasibly expensive).
Here some similarities and differences between these two papers:
Both papers split the whole available training data into subsets, train a small LLM on the subsets, and see how this performs: its performance is used as a proxy for ...
How to select pre-training data for LLMs?
Two papers came out last week from AllenAI and Nvidia that do it in a similar way, building on the intuition that good data is good regardless the size of the LLM.
This intuition can be used to select good data in a cheap manner (training a ...
To Match or Not to Match: Revisiting Image Matching for Reliable Visual Place Recognition
Davide Sferrazza, @berton-gabri.bsky.social, @gabtriv.bsky.social, Carlo Masone
tl;dr:VPR datasets saturate;re-ranking not good;image matching->uncertainty->inlier counts->confidence
arxiv.org/abs/2504.06116
When I read a paper, the only way I have to remember something about it six months from now is to use Anki
Probably nobody knows how to pronounce his name and so they avoid talking about him
And it gets better... for MCoT (Multimodal Chain-of-Thought) they should say "in recent weeks" ๐
I find mindblowing that LLM papers should start saying "in recent months" instead of years. OpenAI O1 and DeepSeek R1 are literally a few months old
The FastAPLoss gave us worse results than average, but again, it was preliminary results with batch size 32.
The SmoothAP and Recall@k are not in the PML so we didn't even consider them (we had already over 30 losses to try). It might be helpful to add your Recall@k to PML :)
Cool stuff :)
bsky.app/profile/bert...
Yeah intuitively it makes sense to perturb the student's images, not sure why it doesn't work in the 2021 distillation paper.
Someone should make a benchmark for distillation across tasks...
I believe Beyer et al 2021 distillation paper says the images should be the same for teacher and student
๐ Big news! Just got my O-1 visa, booked my flight to San Francisco, and Iโm really happy to join Amazon in Palo Alto! Ready for this exciting new chapter ๐
I'll be doing a PostDoc on Vision-Language Models!
The line is so blurry...
Two images of the same car are the same instance? (yes)
If it's the same car but re-painted?
If it's the same car but re-made?
If it's two different cars, same model with same color?
If same model, different color?
Same brand, different model?
Someone should add the GLDv2 dataset to the PML library datasets.
It should take a couple hours to write the code (maybe 10 minutes with cursor ๐), you'd be a contributor to the most important metric learning library
github.com/cvdfoundatio...
kevinmusgrave.github.io/pytorch-metr...
Interesting work, happy to see people working on the field!
Also a bit disappointed not to see them comparing with methods that we found to be SOTA on the task, like RoMa and SIFT+LightGlue
I won't have time to run new experiments (starting a new job on Monday) but if anyone wants to add results with other losses or anything else I'm happy to update the paper :)
Interesting point, are you referring to e.g. the FastAPLoss?
To be fair, our preliminary results, which were used to select the shortlist of 12 losses (out of 34, all those in the pytorch-metric-learning library), were run on a batch size of 32, so there's a chance we missed out on good losses
I think I see your point, for you image retrieval is about retrieving an image of exactly the same object (e.g. exactly that one car, not a car of the same model)?
Then isn't that instance retrieval?
But anyway, naming conventions are very blurry in our field
Also, the paper is only on arxiv, we have no plans to submit, and the code is super simple
If anyone wants to add results we're pretty flexible with it, and we can add new authors
My main goal is to have a good reference paper for anyone doing retrieval, so I'm happy to update the paper as needed
And I'd call GLD, Oxford, etc "landmark retrieval" ๐
To be fair they're all image retrieval datasets, but GLD-oxford and CUB-Cars are just different subcategories of it
The nice things about the datasets we used is that train-test sets are well defined, whereas e.g. oxford, paris have no train sets
I'll have to pay a visit ๐ชด
The one and only fern! Where is it?
While writing this I've realized that fern is an anagram of NeRF, definitely not a coincidence
Arxiv: arxiv.org/abs/2503.13045
Code: github.com/gmberton/ima...
Pytorch Metric Learning Library: kevinmusgrave.github.io/pytorch-metric
All this comes from a tiny yet powerful 400-LOC codebase, thanks to the PyTorch Metric Learning Library - whose developer is co-author of this paper!
So many thanks to co-authors Kevin Musgrave and Carlo Masone!