1/30 https://arxiv.org/abs/2602.09281
2/30 https://arxiv.org/abs/2603.00782
3/30 https://arxiv.org/abs/2602.12176
4/30 https://arxiv.org/abs/2602.20159
5/30 https://arxiv.org/abs/2602.11988
6/30 https://arxiv.org/abs/2602.12670
7/30 https://arxiv.org/abs/2602.10177
8/30 https://arxiv.org/abs/2602.15763
9/30 https://arxiv.org/abs/2602.11632
10/30 https://arxiv.org/abs/2602.13964
11/30 https://arxiv.org/abs/2602.13517
12/30 https://arxiv.org/abs/2602.08222
13/30 https://arxiv.org/abs/2602.21548
14/30 https://arxiv.org/abs/2602.08354
15/30 https://arxiv.org/abs/2602.10388
16/30 https://arxiv.org/abs/2602.10693
17/30 https://arxiv.org/abs/2602.15827
18/30 https://arxiv.org/abs/2602.07274
19/30 https://arxiv.org/abs/2602.09856
20/30 https://arxiv.org/abs/2602.09877
21/30 https://arxiv.org/abs/2602.23152
22/30 https://arxiv.org/abs/2602.10604
23/30 https://arxiv.org/abs/2602.07085
24/30 https://arxiv.org/abs/2602.16800
25/30 https://arxiv.org/abs/2603.03281
26/30 https://arxiv.org/abs/2602.11358
27/30 https://arxiv.org/abs/2602.20392
28/30 https://arxiv.org/abs/2602.15171
29/30 https://arxiv.org/abs/2602.09082
30/30 https://arxiv.org/abs/2602.08794
Top 30 most popular arXiv papers in the last 30 days.
[1/30] [2/30] [3/30] [4/30] [5/30] [6/30] [7/30] [8/30] [9/30] [10/30] [11/30] [12/30] [13/30] [14/30] [15/30] [16/30] [17/30] [18/30] [19/30] [20/30] [21/30] [22/30] [23/30] [24/30] [25/30] [26/30] [27/30] [28/30] [29/30] [30/30]
08.03.2026 00:07
ð 0
ð 0
ð¬ 0
ð 0
é³å£°ã¯çŸå®äžçã®åç»ã«äžå¯æ¬ ã§ããã«ãããããããçæã¢ãã«ã¯é³å£°èŠçŽ ãã»ãšãã©ç¡èŠããŠããã
çŸåšã®èŠèŽèŠã³ã³ãã³ãå¶äœææ³ã¯ãã«ã¹ã±ãŒãåãã€ãã©ã€ã³ã«äŸåããããšãå€ããããã«ããã³ã¹ããå¢å ãããšã©ãŒãèç©ãããå
šäœçãªå質ãäœäžããã
Veo 3ãSora 2ãšãã£ãã·ã¹ãã ãåæçæã®äŸ¡å€ã匷調ããäžæ¹ã§ãå
±åãã«ãã¢ãŒãã«ã¢ããªã³ã°ã¯ã¢ãŒããã¯ãã£ãããŒã¿ããã¬ãŒãã³ã°ã«ãããŠç¬èªã®èª²é¡ãæç€ºããã
ããã«ãæ¢åã·ã¹ãã ã®ã¯ããŒãºããœãŒã¹ãªæ§è³ªãããã®åéã«ããã鲿©ãå¶éããŠããã
æ¬ç ç©¶ã§ã¯ãé«å質ã§åæããé³å£°ã»æ åã³ã³ãã³ããçæå¯èœãªãªãŒãã³ãœãŒã¹ã¢ãã«ãMOVAïŒMOSS Video and AudioïŒããææ¡ãããããã«ã¯ããªã¢ã«ãªå£ãã¯åæé³å£°ãç°å¢èªèå广é³ãã³ã³ãã³ãã«æŽåãã鳿¥œãå«ãŸããã
MOVAã¯æ··åãšãã¹ããŒãïŒMoEïŒã¢ãŒããã¯ãã£ãæ¡çšããŠãããç·ãã©ã¡ãŒã¿æ°ã¯320ååã§ããã®ãã¡æšè«æã«ã¢ã¯ãã£ããšãªãã®ã¯180ååã§ããã
IT2VAïŒç»åã»ããã¹ãããåç»ã»é³å£°ãžã®å€æïŒçæã¿ã¹ã¯ããµããŒãããŸãã
ã¢ãã«éã¿ãšã³ãŒããå
¬éããããšã§ãç ç©¶ã®é²å±ãšæŽ»çºãªã¯ãªãšã€ã¿ãŒã³ãã¥ããã£ã®è²æãç®æããŸãã
å
¬éãããã³ãŒãããŒã¹ã¯ãå¹ççãªæšè«ãLoRAã«ãã埮調æŽãããã³ãã匷åã«å¯Ÿããå
æ¬çãªãµããŒããåããŠããŸãã
2602.08794
é³å£°ã¯çŸå®äžçã®åç»ã«äžå¯æ¬ ã§ããã«ãããããããçæã¢ãã«ã¯é³å£°èŠçŽ ãã»ãšãã©ç¡èŠããŠãããçŸåšã®èŠèŽèŠã³ã³ãã³ãå¶äœææ³ã¯ãã«ã¹ã±ãŒãåãã€ãã©ã€ã³ã«äŸåããããšãå€ããããã«ããã³ã¹ããå¢å ãããšã©ãŒãèç©ãããå
šäœçãªå質ãäœäžãããVeo 3ãSora 2ãšãã£ãã·ã¹ãã ãåæçæã®äŸ¡...
08.03.2026 00:07
ð 0
ð 0
ð¬ 0
ð 0
Audio is indispensable for real-world video, yet generation models have largely overlooked audio components.
Current approaches to producing audio-visual content often rely on cascaded pipelines, which increase cost, accumulate errors, and degrade overall quality.
While systems such as Veo 3 and Sora 2 emphasize the value of simultaneous generation, joint multimodal modeling introduces unique challenges in architecture, data, and training.
Moreover, the closed-source nature of existing systems limits progress in the field.
In this work, we introduce MOVA (MOSS Video and Audio), an open-source model capable of generating high-quality, synchronized audio-visual content, including realistic lip-synced speech, environment-aware sound effects, and content-aligned music.
MOVA employs a Mixture-of-Experts (MoE) architecture, with a total of 32B parameters, of which 18B are active during inference.
It supports IT2VA (Image-Text to Video-Audio) generation task.
By releasing the model weights and code, we aim to advance research and foster a vibrant community of creators.
The released codebase features comprehensive support for efficient inference, LoRA fine-tuning, and prompt enhancement.
[30/30] 154 Likes, 4 Comments, 1 Posts
2602.08794, csâ€CV | csâ€SD, 10 Feb 2026
ðMOVA: Towards Scalable and Synchronized Video-Audio Generation
SII-OpenMOSS Team, :, Donghua Yu, Mingshu Chen, Qi Chen, Qi Luo, Qianyi Wu, Qinyuan Cheng, Ruixiao Li, Tianyi Liang, Wenbo Zhang, Wenming Tu, Xiangyu ...
08.03.2026 00:07
ð 0
ð 0
ð¬ 1
ð 0
çŸåšã®ãã«ããã¥ãŒå±å
3Dç©äœæ€åºåšã¯ããã«ããã¥ãŒæ
å ±ãã°ããŒãã«ãªã·ãŒã³è¡šçŸã«èåãããããã«ãååŸã³ã¹ãã®é«ãã»ã³ãµãŒå¹ŸäœåŠïŒããªãã¡ç²Ÿå¯ã«ãã£ãªãã¬ãŒã·ã§ã³ããããã«ããã¥ãŒã«ã¡ã©ã®å§¿å¢ïŒã«äŸåããŠãããå®äžçã®ã·ãŒã³ã§ã®å±éãå¶éããŠããã
æã
ã¯ããå®çšçãªèšå®ãç®æšãšããïŒã»ã³ãµãŒå¹ŸäœåŠããªãŒïŒSG-FreeïŒãã«ããã¥ãŒå±å
3Dç©äœæ€åºãããã§ã¯ã»ã³ãµãŒãæäŸãã幟äœåŠçå
¥åïŒãã«ããã¥ãŒå§¿å¢ã深床ïŒãååšããªãã
æè¿ã®èŠèŠå¹ŸäœåŠåºç€ãã©ã³ã¹ãã©ãŒããŒïŒVGGTïŒã¯ã匷åãª3Dæããããç»åããçŽæ¥æšæž¬ã§ããããšã瀺ããŠããã
ãã®ç¥èŠã«åºã¥ããæã
ã¯SGããªãŒãªãã«ããã¥ãŒå±å
3Dç©äœæ€åºã«ç¹åããåã®ãã¬ãŒã ã¯ãŒã¯ã§ããVGGT-Detãææ¡ããã
åã«VGGTã®äºæž¬çµæãå©çšããã®ã§ã¯ãªããæã
ã®ææ³ã§ã¯VGGTãšã³ã³ãŒããŒããã©ã³ã¹ãã©ãŒããŒããŒã¹ã®ãã€ãã©ã€ã³ã«çµ±åããã
VGGTå
éšã®ã»ãã³ãã£ãã¯äºåæ
å ±ãšå¹ŸäœåŠçäºåæ
å ±ã®äž¡æ¹ã广çã«æŽ»çšãããããæã
ã¯äºã€ã®æ°èŠäž»èŠã³ã³ããŒãã³ããå°å
¥ããïŒ
(i) 泚æèªå°åã¯ãšãªçæïŒAGïŒïŒVGGTã®æ³šæããããæå³çäºåæ
å ±ãšããŠæŽ»çšãããªããžã§ã¯ãã¯ãšãªãåæåãããããã«ãããã°ããŒãã«ãªç©ºéæ§é ãç¶æãã€ã€ãªããžã§ã¯ãé åã«çŠç¹ãåœãŠãããšã§å±æå粟床ãåäžãããã
(ii) ã¯ãšãªé§ååç¹åŸŽééçŽïŒQDïŒïŒåŠç¿å¯èœãªãèŠãã¯ãšãªãããªããžã§ã¯ãã¯ãšãªãšçžäºäœçšãããã®å¿
èŠæ§ããèªèãããåŸãVGGTå±€å
šäœã§å€éå±€ã®å¹ŸäœåŠçç¹åŸŽéãåçã«éçŽãããããã«ãã2次å
ç¹åŸŽéãæ®µéçã«3次å
ãžæè¯ãããã
å®éšã«ãããVGGT-Detã¯SG-Freeèšå®ã«ãããŠãScanNetãšARKitScenesããããã§ãæè¯ææ³ã4.4ããã³8.6mAP@0.25äžåãæ§èœãçºæ®ããããšã瀺ãããã
ã¢ãã¬ãŒã·ã§ã³ç ç©¶ã«ãããVGGTãå
éšçã«åŠç¿ããæå³çã»å¹ŸäœåŠçå
éšç¥èããæã
ã®AGãšQDã«ãã£ãŠå¹æçã«æŽ»çšã§ããããšã瀺ãããã
2603.00912
çŸåšã®ãã«ããã¥ãŒå±å
3Dç©äœæ€åºåšã¯ããã«ããã¥ãŒæ
å ±ãã°ããŒãã«ãªã·ãŒã³è¡šçŸã«èåãããããã«ãååŸã³ã¹ãã®é«ãã»ã³ãµãŒå¹ŸäœåŠïŒããªãã¡ç²Ÿå¯ã«ãã£ãªãã¬ãŒã·ã§ã³ããããã«ããã¥ãŒã«ã¡ã©ã®å§¿å¢ïŒã«äŸåããŠãããå®äžçã®ã·ãŒã³ã§ã®å±éãå¶éããŠãããæã
ã¯ããå®çšçãªèšå®ãç®æšãšããïŒã»...
07.03.2026 00:17
ð 0
ð 0
ð¬ 0
ð 0
Current multi-view indoor 3D object detectors rely on sensor geometry that is costly to obtain (i.e., precisely calibrated multi-view camera poses) to fuse multi-view information into a global scene representation, limiting deployment in real-world scenes.
We target a more practical setting: Sensor-Geometry-Free (SG-Free) multi-view indoor 3D object detection, where there are no sensor-provided geometric inputs (multi-view poses or depth).
Recent Visual Geometry Grounded Transformer (VGGT) shows that strong 3D cues can be inferred directly from images.
Building on this insight, we present VGGT-Det, the first framework tailored for SG-Free multi-view indoor 3D object detection.
Rather than merely consuming VGGT predictions, our method integrates VGGT encoder into a transformer-based pipeline.
To effectively leverage both the semantic and geometric priors from inside VGGT, we introduce two novel key components:
(i) Attention-Guided Query Generation (AG): exploits VGGT attention maps as semantic priors to initialize object queries, improving localization by focusing on object regions while preserving global spatial structure;
(ii) Query-Driven Feature Aggregation (QD): a learnable See-Query interacts with object queries to 'see' what they need, and then dynamically aggregates multi-level geometric features across VGGT layers that progressively lift 2D features into 3D.
Experiments show that VGGT-Det significantly surpasses the best-performing method in the SG-Free setting by 4.4 and 8.6 mAP@0.25 on ScanNet and ARKitScenes, respectively.
Ablation study shows that VGGT's internally learned semantic and geometric priors can be effectively leveraged by our AG and QD.
[5/30] 33 Likes, 3 Comments, 1 Posts
2603.00912, csâ€CV, 01 Mar 2026
ðVGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-View Indoor 3D Object Detection
Yang Cao, Feize Wu, Dave Zhenyu Chen, Yingji Zhong, Lanqing Hong, Dan Xu
07.03.2026 00:16
ð 0
ð 0
ð¬ 1
ð 0
æ¡æ£ã¢ãã«ã¯é«ç²Ÿçްç»åã»åç»çæã®äž»æµããŒã«ãšãªã£ãããæ¡æ£ãã©ã³ã¹ãã©ãŒããŒã®å埩åŠçã倿°å¿
èŠãšãªããããæšè«é床ãé倧ãªããã«ããã¯ãšãªã£ãŠããã
èšç®è² è·ã軜æžãããããè¿å¹Žã®ç ç©¶ã§ã¯ç¹åŸŽéã®ãã£ãã·ã¥ãšåå©çšææ³ãæ¡çšããŠãããããã¯ãåã®ã¹ãããã§ãã£ãã·ã¥ãããç¹åŸŽéã䜿çšããããšã§ãéžæãããæ¡æ£ã¹ãããã«ããããããã¯ãŒã¯è©äŸ¡ãã¹ããããããã®ã§ããã
ããããªããã圌ãã®äºåèšèšã¯å±æè¿äŒŒã®ã¿ã«äŸåããŠãããããã¹ãããã倧ãããªãã«ã€ããŠèª€å·®ãæ¥éã«å¢å€§ããé«éåã«äŒŽããµã³ãã«å質ãäœäžããã
æ¬ç ç©¶ã§ã¯ãã¹ãã¯ãã«æ¡æ£ç¹åŸŽäºæž¬åšïŒSpectrumïŒãææ¡ãããããã¯åŠç¿äžèŠã®ææ³ã§ãããå³å¯ã«å¶åŸ¡ããã誀差ã®ããšã§ãã°ããŒãã«ãã€é·è·é¢ã«ãããç¹åŸŽã®åå©çšãå¯èœãšããã
ç¹ã«ãæã
ã¯ããã€ã¶ãŒã®æœåšç¹åŸŽãæéã«é¢ãã颿°ãšèŠãªãããã§ãã·ã§ãå€é
åŒãçšããŠè¿äŒŒããã
å
·äœçã«ã¯ããªããžååž°ãçšããŠååºåºæåã®ä¿æ°ãæšå®ãããããçšããŠè€æ°ã®å°æ¥ã®æ¡æ£ã¹ãããã«ãããç¹åŸŽéãäºæž¬ããã
æã
ã®ææ³ã¯ããè¯å¥œãªé·æçãªæåã瀺ããã¹ããããµã€ãºã«äŸåããªã誀差ã®äžéãããããããšãçè«çã«æããã«ããã
æ§ã
ãªæå
端ã®ç»åããã³åç»æ¡æ£ã¢ãã«ã«å¯Ÿããåºç¯ãªå®éšã«ãããæã
ã®ææ³ã®åªäœæ§ãäžè²«ããŠå®èšŒãããŠããã
ç¹ã«ãããŒã¹ã©ã€ã³ãšæ¯èŒããŠã¯ããã«é«ããµã³ãã«å質ãç¶æããªãããFLUX.1ã§ã¯æå€§4.79åãWan2.1-14Bã§ã¯æå€§4.67åã®é«éåãéæããŠããã
2603.01623
æ¡æ£ã¢ãã«ã¯é«ç²Ÿçްç»åã»åç»çæã®äž»æµããŒã«ãšãªã£ãããæ¡æ£ãã©ã³ã¹ãã©ãŒããŒã®å埩åŠçã倿°å¿
èŠãšãªããããæšè«é床ãé倧ãªããã«ããã¯ãšãªã£ãŠãããèšç®è² è·ã軜æžãããããè¿å¹Žã®ç ç©¶ã§ã¯ç¹åŸŽéã®ãã£ãã·ã¥ãšåå©çšææ³ãæ¡çšããŠãããããã¯ãåã®ã¹ãããã§ãã£ãã·ã¥ãããç¹åŸŽéã䜿çš...
07.03.2026 00:16
ð 0
ð 0
ð¬ 0
ð 0
Diffusion models have become the dominant tool for high-fidelity image and video generation, yet are critically bottlenecked by their inference speed due to the numerous iterative passes of Diffusion Transformers.
To reduce the exhaustive compute, recent works resort to the feature caching and reusing scheme that skips network evaluations at selected diffusion steps by using cached features in previous steps.
However, their preliminary design solely relies on local approximation, causing errors to grow rapidly with large skips and leading to degraded sample quality at high speedups.
In this work, we propose spectral diffusion feature forecaster (Spectrum), a training-free approach that enables global, long-range feature reuse with tightly controlled error.
In particular, we view the latent features of the denoiser as functions over time and approximate them with Chebyshev polynomials.
Specifically, we fit the coefficient for each basis via ridge regression, which is then leveraged to forecast features at multiple future diffusion steps.
We theoretically reveal that our approach admits more favorable long-horizon behavior and yields an error bound that does not compound with the step size.
Extensive experiments on various state-of-the-art image and video diffusion models consistently verify the superiority of our approach.
Notably, we achieve up to 4.79$\times$ speedup on FLUX.1 and 4.67$\times$ speedup on Wan2.1-14B, while maintaining much higher sample quality compared with the baselines.
[6/30] 32 Likes, 5 Comments, 1 Posts
2603.01623, csâ€CV | csâ€LG, 02 Mar 2026
ðAdaptive Spectral Feature Forecasting for Diffusion Sampling Acceleration
Jiaqi Han, Juntong Shi, Puheng Li, Haotian Ye, Qiushan Guo, Stefano Ermon
07.03.2026 00:16
ð 0
ð 0
ð¬ 1
ð 0
鳿¥œçæã¢ãã«ã¯ããã¹ããæè©ãåç
§é³æºãçµã¿åãããè€éãªãã«ãã¢ãŒãã«å
¥åãåŠçã§ããããé²åããŠããäžæ¹ã§ãè©äŸ¡ã¡ã«ããºã ã¯é
ãããšã£ãŠããã
æ¬è«æã§ã¯ãçæããã鳿¥œãããã¹ãèšè¿°ãæè©ãé³å£°ããã³ãããæ¡ä»¶ãšããå Žåã«ã察å¿å¯èœãªãæ§æçãã«ãã¢ãŒãã«æç€ºïŒCMIïŒäžã«ããã鳿¥œå ±é
¬ã¢ããªã³ã°ã®ããã®å
æ¬çãªãšã³ã·ã¹ãã ã確ç«ããããšã§ããã®éèŠãªã®ã£ãããåããã
ãŸãã11äžä»¶ã®ç䌌ã©ãã«ä»ããµã³ãã«ãããªãå€§èŠæš¡å奜ããŒã¿ã»ãããCMI-Pref-Pseudoããšã埮现ãªã¢ã©ã€ã³ã¡ã³ãã¿ã¹ã¯åãã«èª¿æŽãããé«å質ãªäººéã«ããã¢ãããŒã·ã§ã³ã³ãŒãã¹ãCMI-Prefãã玹ä»ããŸãã
è©äŸ¡ç°å¢ãçµ±äžãããããæã
ã¯é³æ¥œå ±é
¬ã¢ãã«ãã鳿¥œæ§ãæè©ãšé³æ¥œã®æŽåæ§ãäœæ²æç€ºã®æŽåæ§ãšããç°çš®ãµã³ãã«çŸ€ã§è©äŸ¡ããçµ±äžãã³ãããŒã¯ãCMI-RewardBenchããææ¡ããã
ãããã®ãªãœãŒã¹ã掻çšããæã
ã¯CMIå ±é
¬ã¢ãã«ïŒCMI-RMsïŒãéçºãããããã¯ç°çš®å
¥åã®åŠçãå¯èœãªããã©ã¡ãŒã¿å¹çã«åªããå ±é
¬ã¢ãã«çŸ€ã§ããã
æã
ã¯ããããã®çžé¢ãã鳿¥œæ§ã«é¢ãã人éã®å€æã¹ã³ã¢ããã³CMI-Prefã«ãããæŽåæ§ããªãã³ã«æ¢åã®ããŒã¿ã»ãããšäœµããŠè©äŸ¡ããã
ãããªãå®éšã«ãããCMI-RMã¯äººéã®å€æãšåŒ·ãçžé¢ããã ãã§ãªãããããkãã£ã«ã¿ãªã³ã°ã«ãã广çãªæšè«æéã®ã¹ã±ãŒãªã³ã°ãå¯èœã«ããããšãå®èšŒãããã
å¿
èŠãªãã¬ãŒãã³ã°ããŒã¿ããã³ãããŒã¯ãããã³å ±é
¬ã¢ãã«ã¯å
¬éãããŠããŸãã
2603.00610
鳿¥œçæã¢ãã«ã¯ããã¹ããæè©ãåç
§é³æºãçµã¿åãããè€éãªãã«ãã¢ãŒãã«å
¥åãåŠçã§ããããé²åããŠããäžæ¹ã§ãè©äŸ¡ã¡ã«ããºã ã¯é
ãããšã£ãŠãããæ¬è«æã§ã¯ãçæããã鳿¥œãããã¹ãèšè¿°ãæè©ãé³å£°ããã³ãããæ¡ä»¶ãšããå Žåã«ã察å¿å¯èœãªãæ§æçãã«ãã¢ãŒãã«æç€ºïŒCMIïŒäžã«ãããé³...
07.03.2026 00:16
ð 0
ð 0
ð¬ 0
ð 0
CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction
While music generation models have evolved to handle complex multimodal inputs mixing text, lyrics, and reference audio, evaluation mechanisms have lagged behind. In this paper, we bridge this critica...
Links: abs, pdf
Search: Bluesky, Twitter, Reddit, Hacker News, Hugging Face, alphaXiv
07.03.2026 00:16
ð 0
ð 0
ð¬ 1
ð 0
While music generation models have evolved to handle complex multimodal inputs mixing text, lyrics, and reference audio, evaluation mechanisms have lagged behind.
In this paper, we bridge this critical gap by establishing a comprehensive ecosystem for music reward modeling under Compositional Multimodal Instruction (CMI), where the generated music may be conditioned on text descriptions, lyrics, and audio prompts.
We first introduce CMI-Pref-Pseudo, a large-scale preference dataset comprising 110k pseudo-labeled samples, and CMI-Pref, a high-quality, human-annotated corpus tailored for fine-grained alignment tasks.
To unify the evaluation landscape, we propose CMI-RewardBench, a unified benchmark that evaluates music reward models on heterogeneous samples across musicality, text-music alignment, and compositional instruction alignment.
Leveraging these resources, we develop CMI reward models (CMI-RMs), a parameter-efficient reward model family capable of processing heterogeneous inputs.
We evaluate their correlation with human judgments scores on musicality and alignment on CMI-Pref along with previous datasets.
Further experiments demonstrate that CMI-RM not only correlates strongly with human judgments, but also enables effective inference-time scaling via top-k filtering.
The necessary training data, benchmarks, and reward models are publicly available.
[8/30] 32 Likes, 2 Comments, 1 Posts
2603.00610, csâ€SD | csâ€AI | csâ€LG | csâ€MM | eessâ€AS, 04 Mar 2026
ðCMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction
Yinghao Ma, Haiwen Xia, Hewei Gao, Weixiong Chen, Yuxin Ye, Yuchen Yang, Sungkyun Chang, Mingshu...
07.03.2026 00:16
ð 0
ð 0
ð¬ 1
ð 0
çæå ±é
¬ã¢ãã«ïŒGRMïŒã«ãããæè¿ã®é²å±ã¯ãæèã®é£éïŒCoTïŒæšè«ã®é·ããæ¡åŒµããããšã§è©äŸ¡ã®ä¿¡é Œæ§ã倧å¹
ã«åäžããããšãå®èšŒããŠããã
ããããªãããçŸåšã®ç ç©¶ã¯äž»ã«æ§é åãããŠããªãé·ãã®çž®å°ºã«äŸåããŠãããç°ãªãæšè«ã¡ã«ããºã ã®ç°ãªãæå¹æ§ãç¡èŠããŠããïŒå¹
ã®CoTïŒB-CoTãããªãã¡å€æ¬¡å
çãªåçã«ãã¬ããžïŒãšæ·±ãã®CoTïŒD-CoTãããªãã¡å®è³ªçãªå€æã®åŠ¥åœæ§ïŒã
ãã®èª²é¡ã«å¯ŸåŠãããããæã
ã¯Mix-GRMãå°å
¥ãããããã¯ã¢ãžã¥ãŒã«åŒåæãã€ãã©ã€ã³ãéããŠçã®æšè«æ ¹æ ãæ§é åãããB-CoTããã³D-CoTãžåæ§æãããã®åŸãæåž«ãã埮調æŽïŒSFTïŒãšæ€èšŒå¯èœãªå ±é
¬ãçšãã匷ååŠç¿ïŒRLVRïŒãé©çšããŠãããã®ã¡ã«ããºã ãå
é¢åã»æé©åãããã¬ãŒã ã¯ãŒã¯ã§ããã
å
æ¬çãªå®éšã«ãããMix-GRMã5ã€ã®ãã³ãããŒã¯ã«ãããŠæ°ããªæå
端æ§èœã確ç«ããäž»èŠãªãªãŒãã³ãœãŒã¹RMãå¹³å8.2ïŒ
äžåãããšãå®èšŒãããã
æã
ã®çµæã¯ãæšè«ã«ãããæç¢ºãªåå²ãæããã«ããŠããïŒB-CoTã¯äž»èгçéžå¥œèª²é¡ã«æçã§ããäžæ¹ãD-CoTã¯å®¢èŠ³çæ£ãã課é¡ã«ãããŠåªããŠããã
ãããã£ãŠãæšè«ã¡ã«ããºã ãšã¿ã¹ã¯ã®æŽåæ§ãåããŠããªããšãçŽæ¥çã«æ§èœãäœäžããã
ããã«ãRLVRãã¹ã€ããã³ã°å¢å¹
åšãšããŠæ©èœããã¢ãã«ãã¿ã¹ã¯èŠæ±ã«é©åããããæšè«æ§åŒãèªçºçã«å²ãåœãŠããšããåµçºçãªåæ¥µç¶æ
ãèªå°ããããšãå®èšŒããã
åæããŒã¿ãšã¢ãã«ã¯\href{https://huggingface.co/collections/DonJoey/mix-grm}{Hugging Face}ã§å
¬éãããã³ãŒãã¯\href{https://github.com/Don-Joey/Mix-GRM}{Github}ã§å
¬éãããŠããŸãã
2603.01571
çæå ±é
¬ã¢ãã«ïŒGRMïŒã«ãããæè¿ã®é²å±ã¯ãæèã®é£éïŒCoTïŒæšè«ã®é·ããæ¡åŒµããããšã§è©äŸ¡ã®ä¿¡é Œæ§ã倧å¹
ã«åäžããããšãå®èšŒããŠãããããããªãããçŸåšã®ç ç©¶ã¯äž»ã«æ§é åãããŠããªãé·ãã®çž®å°ºã«äŸåããŠãããç°ãªãæšè«ã¡ã«ããºã ã®ç°ãªãæå¹æ§ãç¡èŠããŠããïŒå¹
ã®CoTïŒB-CoTãããªãã¡å€æ¬¡...
07.03.2026 00:16
ð 0
ð 0
ð¬ 0
ð 0
Recent advancements in Generative Reward Models (GRMs) have demonstrated that scaling the length of Chain-of-Thought (CoT) reasoning considerably enhances the reliability of evaluation.
However, current works predominantly rely on unstructured length scaling, ignoring the divergent efficacy of different reasoning mechanisms: Breadth-CoT (B-CoT, i.e., multi-dimensional principle coverage) and Depth-CoT (D-CoT, i.e., substantive judgment soundness).
To address this, we introduce Mix-GRM, a framework that reconfigures raw rationales into structured B-CoT and D-CoT through a modular synthesis pipeline, subsequently employing Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR) to internalize and optimize these mechanisms.
Comprehensive experiments demonstrate that Mix-GRM establishes a new state-of-the-art across five benchmarks, surpassing leading open-source RMs by an average of 8.2\%.
Our results reveal a clear divergence in reasoning: B-CoT benefits subjective preference tasks, whereas D-CoT excels in objective correctness tasks.
Consequently, misaligning the reasoning mechanism with the task directly degrades performance.
Furthermore, we demonstrate that RLVR acts as a switching amplifier, inducing an emergent polarization where the model spontaneously allocates its reasoning style to match task demands.
The synthesized data and models are released at \href{https://huggingface.co/collections/DonJoey/mix-grm}{Hugging Face}, and the code is released at \href{https://github.com/Don-Joey/Mix-GRM}{Github}.
[9/30] 32 Likes, 2 Comments, 1 Posts
2603.01571, csâ€AI, 02 Mar 2026
ðBeyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models
Qiyuan Zhang, Yufei Wang, Tianhe Wu, Can Xu, Qingfeng Sun, Kai Zheng, Xue Liu, Chen Ma
07.03.2026 00:16
ð 1
ð 0
ð¬ 1
ð 0
æã
ã¯ãããã30åãã©ã¡ãŒã¿ã§åŒ·åãªãšãŒãžã§ã³ãçè¡åãã³ãŒãçæãæ±çšæšè«ãåæã«å®çŸããçµ±äžæ±çšèšèªã¢ãã«ãNanbeige4.1-3Bããçºè¡šããã
æã
ã®ç¥ãéããåäžã¢ãã«ã§ããã»ã©ã®æ±çšæ§ãå®çŸãããªãŒãã³ãœãŒã¹ã®å°åèšèªã¢ãã«ïŒSLMïŒã¯åããŠã§ããã
æšè«ãšéžå¥œã®æŽåæ§ãåäžããããããç¹ããšã®å ±é
¬ã¢ãã«ãšãã¢ããšã®å ±é
¬ã¢ãã«ãçµã¿åãããé«å質ã§äººéãšæŽåããå¿çãä¿èšŒããã
ã³ãŒãçæã®ããã匷ååŠç¿ã«ãããŠè€éæ§ãèæ
®ããå ±é
¬é¢æ°ãèšèšããæ£ç¢ºæ§ãšå¹çæ§ã®äž¡æ¹ãæé©åããã
深局æ¢çŽ¢ã§ã¯ãè€éãªããŒã¿åæãè¡ããåŠç¿äžã«ã¿ãŒã³åäœã®ç£ç£ãçµã¿èŸŒãã
ããã«ããå®å®ããé·æçãªããŒã«çžäºäœçšãå¯èœãšãªããNanbeige4.1-3Bã¯è€éãªåé¡è§£æ±ºã®ããã«æå€§600åã®ããŒã«åŒã³åºãã¿ãŒã³ã確å®ã«å®è¡ã§ããã
åºç¯ãªå®éšçµæãããNanbeige4.1-3Bã¯Nanbeige4-3B-2511ãQwen3-4Bãšãã£ãåèŠæš¡ã®æ¢åã¢ãã«ã倧å¹
ã«äžåããQwen3-30B-A3Bã®ãããªã¯ããã«å€§èŠæš¡ãªã¢ãã«ãšæ¯èŒããŠãåªããæ§èœãçºæ®ããããšã瀺ãããã
æã
ã®çµæã¯ãå°èŠæš¡ã¢ãã«ãåºç¯ãªèœåãšåŒ·åãªç¹åæ§ãåæã«éæã§ããããšã瀺ããŠããã3Bãã©ã¡ãŒã¿ã¢ãã«ã®æœåšèœåãåå®çŸ©ãããã®ã§ããã
2602.13367
æã
ã¯ãããã30åãã©ã¡ãŒã¿ã§åŒ·åãªãšãŒãžã§ã³ãçè¡åãã³ãŒãçæãæ±çšæšè«ãåæã«å®çŸããçµ±äžæ±çšèšèªã¢ãã«ãNanbeige4.1-3Bããçºè¡šãããæã
ã®ç¥ãéããåäžã¢ãã«ã§ããã»ã©ã®æ±çšæ§ãå®çŸãããªãŒãã³ãœãŒã¹ã®å°åèšèªã¢ãã«ïŒSLMïŒã¯åããŠã§ãããæšè«ãšéžå¥œã®æŽåæ§ãåäžããããããç¹...
07.03.2026 00:16
ð 0
ð 0
ð¬ 1
ð 0
Nanbeige4.1-3B: A Small General Model that Reasons, Aligns, and Acts
We present Nanbeige4.1-3B, a unified generalist language model that simultaneously achieves strong agentic behavior, code generation, and general reasoning with only 3B parameters. To the best of our ...
Links: abs, pdf
Search: Bluesky, Twitter, Reddit, Hacker News, Hugging Face, alphaXiv
07.03.2026 00:16
ð 0
ð 0
ð¬ 1
ð 0
We present Nanbeige4.1-3B, a unified generalist language model that simultaneously achieves strong agentic behavior, code generation, and general reasoning with only 3B parameters.
To the best of our knowledge, it is the first open-source small language model (SLM) to achieve such versatility in a single model.
To improve reasoning and preference alignment, we combine point-wise and pair-wise reward modeling, ensuring high-quality, human-aligned responses.
For code generation, we design complexity-aware rewards in Reinforcement Learning, optimizing both correctness and efficiency.
In deep search, we perform complex data synthesis and incorporate turn-level supervision during training.
This enables stable long-horizon tool interactions, allowing Nanbeige4.1-3B to reliably execute up to 600 tool-call turns for complex problem-solving.
Extensive experimental results show that Nanbeige4.1-3B significantly outperforms prior models of similar scale, such as Nanbeige4-3B-2511 and Qwen3-4B, even achieving superior performance compared to much larger models, such as Qwen3-30B-A3B.
Our results demonstrate that small models can achieve both broad competence and strong specialization simultaneously, redefining the potential of 3B parameter models.
[10/30] 31 Likes, 3 Comments, 1 Posts
2602.13367, csâ€AI | csâ€CL, 13 Feb 2026
ðNanbeige4.1-3B: A Small General Model that Reasons, Aligns, and Acts
Chen Yang, Guangyue Peng, Jiaying Zhu, Ran Le, Ruixiang Feng, Tao Zhang, Xiyun Xu, Yang Song, Yiming Jia, Yuntao Wen, Yunzhi Xu, Zekai Wang, Zhen...
07.03.2026 00:16
ð 1
ð 0
ð¬ 1
ð 0
æ¡åŒµçŸå®ïŒXRïŒã¯ããŠãŒã¶ãŒã®è¿œè·¡ãããçŸå®äžçã®åäœã«å¿çããçæã¢ãã«ãå¿
èŠãšãããããããçŸåšã®ãããªã¯ãŒã«ãã¢ãã«ã¯ããã¹ããããŒããŒãå
¥åãšãã£ãç²ãå¶åŸ¡ä¿¡å·ã®ã¿ãåãå
¥ããããã身äœåãããã€ã³ã¿ã©ã¯ã·ã§ã³ãžã®å¿çšå¯èœæ§ãå¶éãããŠããã
远跡ãããé éšå§¿å¢ãšé¢ç¯ã¬ãã«ã®æã®å§¿å¢ã®äž¡æ¹ãæ¡ä»¶ãšããã人éäžå¿ã®åç»äžçã¢ãã«ãææ¡ããã
ãã®ç®çã®ããã«ãæ¢åã®æ¡æ£ãã©ã³ã¹ãã©ãŒããŒèª¿æŽæŠç¥ãè©äŸ¡ãã3Dé éšã»æå¶åŸ¡ã®ããã®å¹æçãªã¡ã«ããºã ãææ¡ãããããã«ãããåšçšãªæãšç©äœã®çžäºäœçšãå¯èœãšãªãã
ãã®æŠç¥ãçšããŠåæ¹åãããªæ¡æ£ã¢ãã«ã®æåž«ãèšç·Žãããããå æçãã€åæ¹åçãªã·ã¹ãã ã«èžçããããšã§ãèªå·±äžå¿çãªä»®æ³ç°å¢ãçæããã
æ¬çæçŸå®ã·ã¹ãã ã被éšè
ãçšããŠè©äŸ¡ããçµæãé¢é£ããããŒã¹ã©ã€ã³ãšæ¯èŒããŠãã¿ã¹ã¯éè¡èœåã®åäžãšãå®è¡ãããåäœã«å¯Ÿããå¶åŸ¡æã®èããé«ãŸããå®èšŒãããã
2602.18422
æ¡åŒµçŸå®ïŒXRïŒã¯ããŠãŒã¶ãŒã®è¿œè·¡ãããçŸå®äžçã®åäœã«å¿çããçæã¢ãã«ãå¿
èŠãšãããããããçŸåšã®ãããªã¯ãŒã«ãã¢ãã«ã¯ããã¹ããããŒããŒãå
¥åãšãã£ãç²ãå¶åŸ¡ä¿¡å·ã®ã¿ãåãå
¥ããããã身äœåãããã€ã³ã¿ã©ã¯ã·ã§ã³ãžã®å¿çšå¯èœæ§ãå¶éãããŠããã远跡ãããé éšå§¿å¢ãšé¢ç¯ã¬ãã«ã®æã®å§¿...
07.03.2026 00:16
ð 0
ð 0
ð¬ 0
ð 0
Extended reality (XR) demands generative models that respond to users' tracked real-world motion, yet current video world models accept only coarse control signals such as text or keyboard input, limiting their utility for embodied interaction.
We introduce a human-centric video world model that is conditioned on both tracked head pose and joint-level hand poses.
For this purpose, we evaluate existing diffusion transformer conditioning strategies and propose an effective mechanism for 3D head and hand control, enabling dexterous hand--object interactions.
We train a bidirectional video diffusion model teacher using this strategy and distill it into a causal, interactive system that generates egocentric virtual environments.
We evaluate this generated reality system with human subjects and demonstrate improved task performance as well as a significantly higher level of perceived amount of control over the performed actions compared with relevant baselines.
[11/30] 30 Likes, 5 Comments, 1 Posts
2602.18422, csâ€CV, 20 Feb 2026
ðGenerated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control
Linxi Xie, Lisong C. Sun, Ashley Neall, Tong Wu, Shengqu Cai, Gordon Wetzstein
07.03.2026 00:16
ð 0
ð 0
ð¬ 1
ð 0
äžçã®å
éšã¢ããªã³ã°ââè¡å$Z$ã®ããšã§ã®éå»ç¶æ
$X$ãšæ¬¡ç¶æ
$Y$ã®é·ç§»ãäºæž¬ããããšââã¯ãLLMãVLMã«ãããæšè«ãšèšç»ã«äžå¯æ¬ ã§ããã
ããããã¢ãã«ã®åŠç¿ã«ã¯éåžžãé«ã³ã¹ããªè¡åã©ãã«ä»ãè»éãå¿
èŠãšãªãã
æã
ã¯ãè¡åãæœåšå€æ°ãšããŠæ±ããé æ¹åäžçã¢ããªã³ã°ïŒFWMïŒ$P_Ξ(Y|X,Z)$ãšéæ¹åååŠã¢ããªã³ã°ïŒIDMïŒ$Q_Ï(Z|X,Y)$ã亀äºã«å®è¡ããããšã§ãç¶æ
ã®ã¿ã®ã·ãŒã±ã³ã¹ããåŠç¿ããèªå·±æ¹åãã¬ãŒã ã¯ãŒã¯ãSWIRLããææ¡ããã
SWIRLã¯äºã€ã®ãã§ãŒãºãå埩ããïŒ(1) å€åæ
å ±æå€§åïŒVariational Information MaximisationïŒã¯ãäºåç¶æ
ãšæœåšè¡åã®æ¡ä»¶ä»ãçžäºæ
å ±ãæå€§åããæ¬¡ç¶æ
ãçæããããFWMãæŽæ°ããèå¥å¯èœãªäžè²«æ§ãä¿é²ããïŒ(2) ELBOæå€§åïŒELBO MaximisationïŒã¯ã芳枬ãããé·ç§»ã説æããããIDMãæŽæ°ãã广çã«åº§æšäžæãè¡ãã
äž¡ã¢ãã«ã¯ãéã®åçµã¢ãã«ã®å¯Ÿæ°å°€åºŠãå ±é
¬ä¿¡å·ãšããŠã匷ååŠç¿ïŒå
·äœçã«ã¯GRPOïŒãçšããŠåŠç¿ãããã
äž¡æ¹ã®æŽæ°ææ³ã«ã€ããŠçè«çãªåŠç¿å¯èœæ§ã®ä¿èšŒãæäŸããLLMããã³VLMã察象ã«è€æ°ã®ç°å¢ïŒååããã³è€æ°åã®ãªãŒãã³ã¯ãŒã«ãèŠèŠãã€ããã¯ã¹ãç©çåŠã»ãŠã§ãã»ããŒã«åŒã³åºãåãã®åæããã¹ãç°å¢ïŒã§SWIRLãè©äŸ¡ããã
SWIRLã¯AURORABenchã§16%ãByteMorphã§28%ãWorldPredictionBenchã§16%ãStableToolBenchã§14%ã®æ§èœåäžãéæããã
2602.06130
äžçã®å
éšã¢ããªã³ã°ââè¡å$Z$ã®ããšã§ã®éå»ç¶æ
$X$ãšæ¬¡ç¶æ
$Y$ã®é·ç§»ãäºæž¬ããããšââã¯ãLLMãVLMã«ãããæšè«ãšèšç»ã«äžå¯æ¬ ã§ãããããããã¢ãã«ã®åŠç¿ã«ã¯éåžžãé«ã³ã¹ããªè¡åã©ãã«ä»ãè»éãå¿
èŠãšãªããæã
ã¯ãè¡åãæœåšå€æ°ãšããŠæ±ããé æ¹åäžçã¢ããªã³ã°ïŒFWMïŒ$P_Ξ(Y|X,Z)$ãšéæ¹...
07.03.2026 00:16
ð 0
ð 0
ð¬ 0
ð 0