#VisionLanguage — Bluesky Posts — bluesky.baby

Profile Explorer

Home New Trending Search

About Privacy Terms

#

#VisionLanguage

Posts tagged #VisionLanguage on Bluesky

@aidailypost.com

1 month ago

Even the hottest multimodal models stumble—capped at 50% on simple visual entity tasks. What does this reveal about current vision‑language gaps? Dive into the benchmarks and see why AI still has a long way to go. #MultimodalLearning #VisionLanguage #AIPerformance

🔗 aidailypost.com/news/top-mul...

0 0 0 0

@aidailypost.com

1 month ago

New research shows how to fool CLIP‑style vision‑language models with fresh adversarial tricks. Could this expose hidden AI security gaps? Dive into the latest evasion techniques and what they mean for multimodal ML. #AdversarialAttacks #VisionLanguage #AIsecurity

🔗 aidailypost.com/news/researc...

0 0 0 0

@arxivlens.bsky.social

2 months ago

Beyond Memorization: A Multi-Modal Ordinal Regression Benchmark to Expose Popularity Bias in Vision-Language Models
Chia-Jui Chang, He Syu et al.
Paper
Details
#VisionLanguage #OrdinalRegression #BiasBenchmark

0 0 0 0

@aidailypost.com

2 months ago

Just saw an open‑source OCR model hit 82.4 on the olmOCR‑bench—handles equations, tables, multilingual docs, and scales like a champ with PaddleOCR VL and ERNIE‑4.5‑0.3B. Dive into the details! #OCR #olmOCRbench #VisionLanguage

🔗 aidailypost.com/news/open-so...

0 0 0 0

@aidailypost.com

3 months ago

Black Forest Labs just dropped Flux 2, packing the new Mistral‑3 24B vision‑language model with a hybrid Rectified Flow Transformer + VAE encoder. The BFL API makes it super easy to experiment—check out the details! #Flux2 #Mistral324B #VisionLanguage

🔗 aidailypost.com/news/black-f...

0 0 0 0

@glenberseth.bsky.social

3 months ago

#VisionLanguage models are increasingly used for a wide range of problems, but seem complex to build. I wrote some code and recorded a tutorial in my lab yesterday to help others demystify how to create these models. #keepbuilding

6 1 2 0

@ahmedfe.bsky.social

4 months ago

E-MM1 Dataset: The World's Largest Multimodal AI Dataset The E-MM1 dataset is the world's largest multimodal AI dataset, with more than 100 million groups of data in five modalities to foster the development of models that fuse multiple modalities.

EMM1 evaluates how AI understands images and text together. It highlights where models excel and where they fall short, helping build more reliable multimodal systems.

#AI #Data #VisionLanguage
encord.com/multimodal-d...

2 0 0 0

Vittorio Cuculo

@vcuculo.bsky.social

4 months ago

Back from the break with Phillip Isola @phillipisola.bsky.social on
“On the Perceptual Distance Between Images and Text.”
A fascinating and interactive look at how models (and humans!) measure similarity 👏🏻

#HiCV2025 #ICCV2025 #VisionLanguage

0 0 0 0

@getnews-me.bsky.social

5 months ago

Training-Free Explainable Vision-Language Model for Medical Imaging

Training-Free Explainable Vision-Language Model for Medical Imaging

A training-free, explainable vision-language model for medical imaging has been announced. Read more: getnews.me/training-free-explainabl... #medimaging #visionlanguage #explainable

0 0 0 0

@getnews-me.bsky.social

5 months ago

Probabilistic Language-Image Pre-Training Boosts Vision-Language Models

Probabilistic Language-Image Pre-Training Boosts Vision-Language Models

A new probabilistic language-image pre-training approach is reported to boost performance of vision-language models. Read more: getnews.me/probabilistic-language-i... #visionlanguage #pretraining #ai

0 0 0 0

@getnews-me.bsky.social

5 months ago

Cross-modal Backward-Compatible Learning for Vision-Language Models

Cross-modal Backward-Compatible Learning for Vision-Language Models

A new study introduces cross-modal backward-compatible learning for vision-language models. Read more: getnews.me/cross-modal-backward-com... #visionlanguage #crossmodal #machinelearning

1 0 0 0

@getnews-me.bsky.social

5 months ago

Vision-Language Models Boost Efficiency of Indoor Robot Navigation

Vision-Language Models Boost Efficiency of Indoor Robot Navigation

Vision‑language models guide indoor robot navigation, selecting subgoals that reduce path length by about 10 % in simulation, working zero‑shot with the DYNUS planner. Read more: getnews.me/vision-language-models-b... #visionlanguage #robotics

0 0 0 0

@getnews-me.bsky.social

5 months ago

Zero-Shot Fine-Grained Classification with Vision-Language Models

Zero-Shot Fine-Grained Classification with Vision-Language Models

The study reframes zero‑shot classification as Q&A and adds an attention‑intervention, boosting top‑1 accuracy on bird, flower and vehicle benchmarks. Code on GitHub. Read more: getnews.me/zero-shot-fine-grained-c... #visionlanguage #zeroshot

0 0 0 0

@getnews-me.bsky.social

5 months ago

Spatial‑ViLT Improves 3D Spatial Reasoning with Multi‑Task Learning

Spatial‑ViLT Improves 3D Spatial Reasoning with Multi‑Task Learning

Spatial‑ViLT adds depth maps, 3D coordinate grids and edge maps to vision‑language models, achieving top results on the Visual Spatial Reasoning benchmark. Read more: getnews.me/spatial-vilt-improves-3d... #spatialvilt #visionlanguage

0 0 0 0

@getnews-me.bsky.social

5 months ago

Large Vision‑Language Models Boost Carotid Plaque Risk Prediction

Large Vision‑Language Models Boost Carotid Plaque Risk Prediction

Fine‑tuned LLaVa‑NeXT‑Vicuna with LoRA boosted specificity and balanced accuracy in carotid plaque stroke‑risk prediction, especially when paired with patient data. 3 Oct 2025. getnews.me/large-vision-language-mo... #visionlanguage #carotid #stroke

0 0 0 0

@getnews-me.bsky.social

5 months ago

MaskCD Cuts Hallucinations in Vision‑Language Models

MaskCD Cuts Hallucinations in Vision‑Language Models

MaskCD, a new contrastive decoding method that masks the image head, cuts hallucination rates in LVLMs like LLaVA‑1.5‑7B and Qwen‑VL‑7B without hurting overall performance. Read more: getnews.me/maskcd-cuts-hallucinatio... #maskcd #lvlm #visionlanguage

0 0 0 0

@getnews-me.bsky.social

5 months ago

Explainability Shows Limits of Vision‑Language Models on Rebus Puzzles

Explainability Shows Limits of Vision‑Language Models on Rebus Puzzles

A study of 221 rebus puzzles shows vision‑language models excel at visual composition but falter on missing elements and cultural symbols. The paper was submitted on 3 Oct 2025. getnews.me/explainability-shows-lim... #visionlanguage #rebuspuzzles

0 0 0 0

@getnews-me.bsky.social

5 months ago

AdaRD-Key Boosts Query-Driven Frame Selection for Long-Form Video AI

AdaRD-Key Boosts Query-Driven Frame Selection for Long-Form Video AI

AdaRD‑Key selects query‑relevant, diverse keyframes in real time on a single GPU, achieving state‑of‑the‑art results on LongVideoBench and Video‑MME. getnews.me/adard-key-boosts-query-d... #adardkey #visionlanguage

0 0 0 0

@getnews-me.bsky.social

5 months ago

AGILE boosts visual perception and reasoning in Vision‑Language Models

AGILE boosts visual perception and reasoning in Vision‑Language Models

The AGILE framework raised 2x2 jigsaw accuracy from 9.5% to 82.8% and added roughly 3% average gain across nine vision tasks, according to the authors. Read more: getnews.me/agile-boosts-visual-perc... #visionlanguage #agile #multimodal

1 0 0 0

@getnews-me.bsky.social

5 months ago

AgenticIQA: Adaptive, Interpretable Image Quality Assessment Framework

AgenticIQA: Adaptive, Interpretable Image Quality Assessment Framework

AgenticIQA uses a planner‑executor‑summarizer workflow and released AgenticIQA‑200K with 200,000 examples. It beats strong baselines on Pearson and Spearman correlation. getnews.me/agenticiqa-adaptive-inte... #agenticiqa #imagequality #visionlanguage

0 0 0 0

@getnews-me.bsky.social

5 months ago

Vision-Language Process Reward Models Enhance Test-Time Scaling

Vision-Language Process Reward Models Enhance Test-Time Scaling

Hybrid pipeline merging Monte Carlo Tree Search with a strong vision‑language model makes reliable step‑level labels, boosting benchmarks like MMMU and MathVista. Read more: getnews.me/vision-language-process-... #multimodal #visionlanguage

1 0 0 0

@getnews-me.bsky.social

5 months ago

TDBench Launches Rotational Benchmark for Top‑Down Vision Models

TDBench Launches Rotational Benchmark for Top‑Down Vision Models

TDBench offers a benchmark for top‑down vision‑language models with 2,000 questions per each of four rotational views. The dataset and code are available on GitHub. Read more: getnews.me/tdbench-launches-rotatio... #tdbench #visionlanguage

0 0 0 0

@getnews-me.bsky.social

5 months ago

Visual Self-Refinement Boosts Autoregressive Vision‑Language Models

Visual Self-Refinement Boosts Autoregressive Vision‑Language Models

A plug‑and‑play visual self‑refinement module refines token sequences after generation, improving coherence of vision‑language models. Accepted at EMNLP 2025. Read more: getnews.me/visual-self-refinement-b... #visionlanguage #selfrefinement

0 0 0 0

@getnews-me.bsky.social

5 months ago

MULTI‑TAP: Multi‑Objective Predictor for Image‑Text Alignment

MULTI‑TAP: Multi‑Objective Predictor for Image‑Text Alignment

MULTI‑TAP adds a lightweight ridge‑regression layer to frozen LVLMs, staying under a 7‑8 B‑parameter size while matching GPT‑4o‑based predictors. Read more: getnews.me/multi-tap-multi-objectiv... #multitap #visionlanguage

0 0 0 0

@getnews-me.bsky.social

5 months ago

Adaptive Event Slicing Boosts Open‑Vocabulary Detection

Adaptive Event Slicing Boosts Open‑Vocabulary Detection

A hybrid SNN‑CNN framework adaptively slices event streams for the open‑vocabulary object detection with CLIP; the paper was submitted in October 2025. Read more: getnews.me/adaptive-event-slicing-b... #eventcameras #visionlanguage

0 0 0 0

@getnews-me.bsky.social

5 months ago

GUI-KV Improves Efficiency of Vision‑Language GUI Agents

GUI-KV Improves Efficiency of Vision‑Language GUI Agents

GUI‑KV, a KV cache compression for vision‑language GUI agents, cuts decoding FLOPs by 38.9% and boosts step‑wise accuracy by 4.1% on the AgentNetBench 5‑screenshot benchmark. Read more: getnews.me/gui-kv-improves-efficien... #guikv #visionlanguage

0 0 0 0

@getnews-me.bsky.social

5 months ago

MathSticks: Visual Symbolic Reasoning Benchmark Using Matchsticks

MathSticks: Visual Symbolic Reasoning Benchmark Using Matchsticks

MathSticks offers ~1.4 million matchstick puzzles where fixing an equation needs moving one or two sticks. Humans score 90 percent, while vision‑language models lag. Read more: getnews.me/mathsticks-visual-symbol... #mathsticks #benchmark #visionlanguage

0 0 0 0

@getnews-me.bsky.social

5 months ago

TRIPS Enhances Vision‑Language Pre‑Training via Text Patch Selection

TRIPS Enhances Vision‑Language Pre‑Training via Text Patch Selection

TRIPS selects text‑relevant image patches for vision‑language models, cutting training time by 40% with no loss in accuracy and no extra parameters; presented at EMNLP 2022. Read more: getnews.me/trips-enhances-vision-la... #trips #visionlanguage

0 0 0 0

@getnews-me.bsky.social

5 months ago

Dual Active Learning Multimodal Model Boosts Source-Free Domain Adaptation

Dual Active Learning Multimodal Model Boosts Source-Free Domain Adaptation

The Dual Active Learning (DAM) framework merges vision‑language model targets with a small set of human labels, achieving state‑of‑the‑art results on SFADA benchmarks. Read more: getnews.me/dual-active-learning-mul... #sfada #visionlanguage

0 0 0 0

@getnews-me.bsky.social

5 months ago

Geometry-Based Fine-Tuning Boosts Spatial Skills in Vision-Language Models

Geometry-Based Fine-Tuning Boosts Spatial Skills in Vision-Language Models

Fine‑tuning on Euclid30K (~30 k geometry problems) raised VSI‑Bench accuracy from 34.5% to 40.5% in zero‑shot tests and gave RoboBrain2.0‑Euclid‑7B a 49.6% score. Read more: getnews.me/geometry-based-fine-tuni... #visionlanguage #spatialai

0 0 0 0