Paper (@paper) — bluesky.baby

1/30 https://arxiv.org/abs/2602.09281 2/30 https://arxiv.org/abs/2603.00782 3/30 https://arxiv.org/abs/2602.12176 4/30 https://arxiv.org/abs/2602.20159 5/30 https://arxiv.org/abs/2602.11988 6/30 https://arxiv.org/abs/2602.12670 7/30 https://arxiv.org/abs/2602.10177 8/30 https://arxiv.org/abs/2602.15763 9/30 https://arxiv.org/abs/2602.11632 10/30 https://arxiv.org/abs/2602.13964 11/30 https://arxiv.org/abs/2602.13517 12/30 https://arxiv.org/abs/2602.08222 13/30 https://arxiv.org/abs/2602.21548 14/30 https://arxiv.org/abs/2602.08354 15/30 https://arxiv.org/abs/2602.10388 16/30 https://arxiv.org/abs/2602.10693 17/30 https://arxiv.org/abs/2602.15827 18/30 https://arxiv.org/abs/2602.07274 19/30 https://arxiv.org/abs/2602.09856 20/30 https://arxiv.org/abs/2602.09877 21/30 https://arxiv.org/abs/2602.23152 22/30 https://arxiv.org/abs/2602.10604 23/30 https://arxiv.org/abs/2602.07085 24/30 https://arxiv.org/abs/2602.16800 25/30 https://arxiv.org/abs/2603.03281 26/30 https://arxiv.org/abs/2602.11358 27/30 https://arxiv.org/abs/2602.20392 28/30 https://arxiv.org/abs/2602.15171 29/30 https://arxiv.org/abs/2602.09082 30/30 https://arxiv.org/abs/2602.08794

Top 30 most popular arXiv papers in the last 30 days.
[1/30] [2/30] [3/30] [4/30] [5/30] [6/30] [7/30] [8/30] [9/30] [10/30] [11/30] [12/30] [13/30] [14/30] [15/30] [16/30] [17/30] [18/30] [19/30] [20/30] [21/30] [22/30] [23/30] [24/30] [25/30] [26/30] [27/30] [28/30] [29/30] [30/30]

08.03.2026 00:07 👍 0 🔁 0 💬 0 📌 0

音声は現実世界の動画に不可欠であるにもかかわらず、生成モデルは音声要素をほとんど無視してきた。現在の視聴覚コンテンツ制作手法は、カスケード型パイプラインに依存することが多く、これによりコストが増加し、エラーが蓄積され、全体的な品質が低下する。 Veo 3やSora 2といったシステムが同時生成の価値を強調する一方で、共同マルチモーダルモデリングはアーキテクチャ、データ、トレーニングにおいて独自の課題を提示する。さらに、既存システムのクローズドソースな性質が、この分野における進歩を制限している。本研究では、高品質で同期した音声・映像コンテンツを生成可能なオープンソースモデル「MOVA（MOSS Video and Audio）」を提案する。これには、リアルな口パク同期音声、環境認識型効果音、コンテンツに整合した音楽が含まれる。 MOVAは混合エキスパート（MoE）アーキテクチャを採用しており、総パラメータ数は320億個で、そのうち推論時にアクティブとなるのは180億個である。 IT2VA（画像・テキストから動画・音声への変換）生成タスクをサポートします。モデル重みとコードを公開することで、研究の進展と活発なクリエイターコミュニティの育成を目指します。公開されたコードベースは、効率的な推論、LoRAによる微調整、プロンプト強化に対する包括的なサポートを備えています。

2602.08794
音声は現実世界の動画に不可欠であるにもかかわらず、生成モデルは音声要素をほとんど無視してきた。現在の視聴覚コンテンツ制作手法は、カスケード型パイプラインに依存することが多く、これによりコストが増加し、エラーが蓄積され、全体的な品質が低下する。Veo 3やSora 2といったシステムが同時生成の価...

08.03.2026 00:07 👍 0 🔁 0 💬 0 📌 0

MOVA: Towards Scalable and Synchronized Video-Audio Generation Audio is indispensable for real-world video, yet generation models have largely overlooked audio components. Current approaches to producing audio-visual content often rely on cascaded pipelines, whic...

Links: abs, pdf
Search: Bluesky, Twitter, Reddit, Hacker News, Hugging Face, alphaXiv

08.03.2026 00:07 👍 0 🔁 0 💬 1 📌 0

Paper page - MOVA: Towards Scalable and Synchronized Video-Audio Generation Join the discussion on this paper page

(1/1) 154 Likes, 4 Comments, 10 Feb 2026, Hugging Face

08.03.2026 00:07 👍 0 🔁 0 💬 1 📌 0

Audio is indispensable for real-world video, yet generation models have largely overlooked audio components. Current approaches to producing audio-visual content often rely on cascaded pipelines, which increase cost, accumulate errors, and degrade overall quality. While systems such as Veo 3 and Sora 2 emphasize the value of simultaneous generation, joint multimodal modeling introduces unique challenges in architecture, data, and training. Moreover, the closed-source nature of existing systems limits progress in the field. In this work, we introduce MOVA (MOSS Video and Audio), an open-source model capable of generating high-quality, synchronized audio-visual content, including realistic lip-synced speech, environment-aware sound effects, and content-aligned music. MOVA employs a Mixture-of-Experts (MoE) architecture, with a total of 32B parameters, of which 18B are active during inference. It supports IT2VA (Image-Text to Video-Audio) generation task. By releasing the model weights and code, we aim to advance research and foster a vibrant community of creators. The released codebase features comprehensive support for efficient inference, LoRA fine-tuning, and prompt enhancement.

[30/30] 154 Likes, 4 Comments, 1 Posts
2602.08794, cs․CV | cs․SD, 10 Feb 2026

🆕MOVA: Towards Scalable and Synchronized Video-Audio Generation

SII-OpenMOSS Team, :, Donghua Yu, Mingshu Chen, Qi Chen, Qi Luo, Qianyi Wu, Qinyuan Cheng, Ruixiao Li, Tianyi Liang, Wenbo Zhang, Wenming Tu, Xiangyu ...

08.03.2026 00:07 👍 0 🔁 0 💬 1 📌 0

現在のマルチビュー屋内3D物体検出器は、マルチビュー情報をグローバルなシーン表現に融合させるために、取得コストの高いセンサー幾何学（すなわち精密にキャリブレーションされたマルチビューカメラの姿勢）に依存しており、実世界のシーンでの展開を制限している。我々はより実用的な設定を目標とする：センサー幾何学フリー（SG-Free）マルチビュー屋内3D物体検出。ここではセンサーが提供する幾何学的入力（マルチビュー姿勢や深度）が存在しない。最近の視覚幾何学基盤トランスフォーマー（VGGT）は、強力な3D手がかりが画像から直接推測できることを示している。この知見に基づき、我々はSGフリーなマルチビュー屋内3D物体検出に特化した初のフレームワークであるVGGT-Detを提案する。単にVGGTの予測結果を利用するのではなく、我々の手法ではVGGTエンコーダーをトランスフォーマーベースのパイプラインに統合する。 VGGT内部のセマンティック事前情報と幾何学的事前情報の両方を効果的に活用するため、我々は二つの新規主要コンポーネントを導入する： (i) 注意誘導型クエリ生成（AG）：VGGTの注意マップを意味的事前情報として活用し、オブジェクトクエリを初期化する。これにより、グローバルな空間構造を維持しつつオブジェクト領域に焦点を当てることで局所化精度を向上させる。 (ii) クエリ駆動型特徴量集約（QD）：学習可能な「見るクエリ」がオブジェクトクエリと相互作用し、その必要性を「認識」した後、VGGT層全体で多階層の幾何学的特徴量を動的に集約する。これにより2次元特徴量が段階的に3次元へ昇華される。実験により、VGGT-DetはSG-Free設定において、ScanNetとARKitScenesそれぞれで、最良手法を4.4および8.6mAP@0.25上回る性能を発揮することが示された。アブレーション研究により、VGGTが内部的に学習した意味的・幾何学的先験知識が、我々のAGとQDによって効果的に活用できることが示された。

2603.00912
現在のマルチビュー屋内3D物体検出器は、マルチビュー情報をグローバルなシーン表現に融合させるために、取得コストの高いセンサー幾何学（すなわち精密にキャリブレーションされたマルチビューカメラの姿勢）に依存しており、実世界のシーンでの展開を制限している。我々はより実用的な設定を目標とする：セ...

07.03.2026 00:17 👍 0 🔁 0 💬 0 📌 0

VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-View Indoor 3D Object Detection Current multi-view indoor 3D object detectors rely on sensor geometry that is costly to obtain (i.e., precisely calibrated multi-view camera poses) to fuse multi-view information into a global scene r...

Links: abs, pdf
Search: Bluesky, Twitter, Reddit, Hacker News, Hugging Face, alphaXiv

07.03.2026 00:17 👍 0 🔁 0 💬 1 📌 0

Paper page - VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-View Indoor 3D Object Detection Join the discussion on this paper page

(1/1) 33 Likes, 3 Comments, 03 Mar 2026, Hugging Face

07.03.2026 00:17 👍 0 🔁 0 💬 1 📌 0

Current multi-view indoor 3D object detectors rely on sensor geometry that is costly to obtain (i.e., precisely calibrated multi-view camera poses) to fuse multi-view information into a global scene representation, limiting deployment in real-world scenes. We target a more practical setting: Sensor-Geometry-Free (SG-Free) multi-view indoor 3D object detection, where there are no sensor-provided geometric inputs (multi-view poses or depth). Recent Visual Geometry Grounded Transformer (VGGT) shows that strong 3D cues can be inferred directly from images. Building on this insight, we present VGGT-Det, the first framework tailored for SG-Free multi-view indoor 3D object detection. Rather than merely consuming VGGT predictions, our method integrates VGGT encoder into a transformer-based pipeline. To effectively leverage both the semantic and geometric priors from inside VGGT, we introduce two novel key components: (i) Attention-Guided Query Generation (AG): exploits VGGT attention maps as semantic priors to initialize object queries, improving localization by focusing on object regions while preserving global spatial structure; (ii) Query-Driven Feature Aggregation (QD): a learnable See-Query interacts with object queries to 'see' what they need, and then dynamically aggregates multi-level geometric features across VGGT layers that progressively lift 2D features into 3D. Experiments show that VGGT-Det significantly surpasses the best-performing method in the SG-Free setting by 4.4 and 8.6 mAP@0.25 on ScanNet and ARKitScenes, respectively. Ablation study shows that VGGT's internally learned semantic and geometric priors can be effectively leveraged by our AG and QD.

[5/30] 33 Likes, 3 Comments, 1 Posts
2603.00912, cs․CV, 01 Mar 2026

🆕VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-View Indoor 3D Object Detection

Yang Cao, Feize Wu, Dave Zhenyu Chen, Yingji Zhong, Lanqing Hong, Dan Xu

07.03.2026 00:16 👍 0 🔁 0 💬 1 📌 0

拡散モデルは高精細画像・動画生成の主流ツールとなったが、拡散トランスフォーマーの反復処理が多数必要となるため、推論速度が重大なボトルネックとなっている。計算負荷を軽減するため、近年の研究では特徴量のキャッシュと再利用手法を採用している。これは、前のステップでキャッシュされた特徴量を使用することで、選択された拡散ステップにおけるネットワーク評価をスキップするものである。しかしながら、彼らの予備設計は局所近似のみに依存しているため、スキップが大きくなるにつれて誤差が急速に増大し、高速化に伴いサンプル品質が低下する。本研究では、スペクトル拡散特徴予測器（Spectrum）を提案する。これは学習不要の手法であり、厳密に制御された誤差のもとで、グローバルかつ長距離にわたる特徴の再利用を可能とする。特に、我々はデノイザーの潜在特徴を時間に関する関数と見なし、チェビシェフ多項式を用いて近似する。具体的には、リッジ回帰を用いて各基底成分の係数を推定し、これを用いて複数の将来の拡散ステップにおける特徴量を予測する。我々の手法はより良好な長期的な挙動を示し、ステップサイズに依存しない誤差の上限をもたらすことを理論的に明らかにした。様々な最先端の画像および動画拡散モデルに対する広範な実験により、我々の手法の優位性が一貫して実証されている。特に、ベースラインと比較してはるかに高いサンプル品質を維持しながら、FLUX.1では最大4.79倍、Wan2.1-14Bでは最大4.67倍の高速化を達成している。

2603.01623
拡散モデルは高精細画像・動画生成の主流ツールとなったが、拡散トランスフォーマーの反復処理が多数必要となるため、推論速度が重大なボトルネックとなっている。計算負荷を軽減するため、近年の研究では特徴量のキャッシュと再利用手法を採用している。これは、前のステップでキャッシュされた特徴量を使用...

07.03.2026 00:16 👍 0 🔁 0 💬 0 📌 0

Adaptive Spectral Feature Forecasting for Diffusion Sampling Acceleration Diffusion models have become the dominant tool for high-fidelity image and video generation, yet are critically bottlenecked by their inference speed due to the numerous iterative passes of Diffusion ...

Links: abs, pdf
Search: Bluesky, Twitter, Reddit, Hacker News, Hugging Face, alphaXiv

07.03.2026 00:16 👍 0 🔁 0 💬 1 📌 0

From the comfyui community on Reddit: ComfyUI-Spectrum-SDXL: Accelerate SDXL inference by ~1.5-2x with no noticeable quality loss! Explore this post and more from the comfyui community

(1/1) 32 Likes, 5 Comments, 05 Mar 2026, Reddit

07.03.2026 00:16 👍 0 🔁 0 💬 1 📌 0

$Diffusion models have become the dominant tool for high-fidelity image and video generation, yet are critically bottlenecked by their inference speed due to the numerous iterative passes of Diffusion Transformers. To reduce the exhaustive compute, recent works resort to the feature caching and reusing scheme that skips network evaluations at selected diffusion steps by using cached features in previous steps. However, their preliminary design solely relies on local approximation, causing errors to grow rapidly with large skips and leading to degraded sample quality at high speedups. In this work, we propose spectral diffusion feature forecaster (Spectrum), a training-free approach that enables global, long-range feature reuse with tightly controlled error. In particular, we view the latent features of the denoiser as functions over time and approximate them with Chebyshev polynomials. Specifically, we fit the coefficient for each basis via ridge regression, which is then leveraged to forecast features at multiple future diffusion steps. We theoretically reveal that our approach admits more favorable long-horizon behavior and yields an error bound that does not compound with the step size. Extensive experiments on various state-of-the-art image and video diffusion models consistently verify the superiority of our approach. Notably, we achieve up to 4.79$\times$ speedup on FLUX.1 and 4.67$\times$ speedup on Wan2.1-14B, while maintaining much higher sample quality compared with the baselines.$

Diffusion models have become the dominant tool for high-fidelity image and video generation, yet are critically bottlenecked by their inference speed due to the numerous iterative passes of Diffusion Transformers. To reduce the exhaustive compute, recent works resort to the feature caching and reusing scheme that skips network evaluations at selected diffusion steps by using cached features in previous steps. However, their preliminary design solely relies on local approximation, causing errors to grow rapidly with large skips and leading to degraded sample quality at high speedups. In this work, we propose spectral diffusion feature forecaster (Spectrum), a training-free approach that enables global, long-range feature reuse with tightly controlled error. In particular, we view the latent features of the denoiser as functions over time and approximate them with Chebyshev polynomials. Specifically, we fit the coefficient for each basis via ridge regression, which is then leveraged to forecast features at multiple future diffusion steps. We theoretically reveal that our approach admits more favorable long-horizon behavior and yields an error bound that does not compound with the step size. Extensive experiments on various state-of-the-art image and video diffusion models consistently verify the superiority of our approach. Notably, we achieve up to 4.79$\times$ speedup on FLUX.1 and 4.67$\times$ speedup on Wan2.1-14B, while maintaining much higher sample quality compared with the baselines.

[6/30] 32 Likes, 5 Comments, 1 Posts
2603.01623, cs․CV | cs․LG, 02 Mar 2026

🆕Adaptive Spectral Feature Forecasting for Diffusion Sampling Acceleration

Jiaqi Han, Juntong Shi, Puheng Li, Haotian Ye, Qiushan Guo, Stefano Ermon

07.03.2026 00:16 👍 0 🔁 0 💬 1 📌 0

音楽生成モデルはテキスト、歌詞、参照音源を組み合わせた複雑なマルチモーダル入力を処理できるよう進化してきた一方で、評価メカニズムは遅れをとっている。本論文では、生成される音楽がテキスト記述、歌詞、音声プロンプトを条件とする場合にも対応可能な、構成的マルチモーダル指示（CMI）下における音楽報酬モデリングのための包括的なエコシステムを確立することで、この重要なギャップを埋める。まず、11万件の疑似ラベル付きサンプルからなる大規模嗜好データセット「CMI-Pref-Pseudo」と、微細なアラインメントタスク向けに調整された高品質な人間によるアノテーションコーパス「CMI-Pref」を紹介します。評価環境を統一するため、我々は音楽報酬モデルを、音楽性、歌詞と音楽の整合性、作曲指示の整合性という異種サンプル群で評価する統一ベンチマーク「CMI-RewardBench」を提案する。これらのリソースを活用し、我々はCMI報酬モデル（CMI-RMs）を開発した。これは異種入力の処理が可能な、パラメータ効率に優れた報酬モデル群である。我々は、それらの相関を、音楽性に関する人間の判断スコアおよびCMI-Prefにおける整合性、ならびに既存のデータセットと併せて評価する。さらなる実験により、CMI-RMは人間の判断と強く相関するだけでなく、トップkフィルタリングによる効果的な推論時間のスケーリングを可能にすることが実証された。必要なトレーニングデータ、ベンチマーク、および報酬モデルは公開されています。

2603.00610
音楽生成モデルはテキスト、歌詞、参照音源を組み合わせた複雑なマルチモーダル入力を処理できるよう進化してきた一方で、評価メカニズムは遅れをとっている。本論文では、生成される音楽がテキスト記述、歌詞、音声プロンプトを条件とする場合にも対応可能な、構成的マルチモーダル指示（CMI）下における音...

07.03.2026 00:16 👍 0 🔁 0 💬 0 📌 0

CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction While music generation models have evolved to handle complex multimodal inputs mixing text, lyrics, and reference audio, evaluation mechanisms have lagged behind. In this paper, we bridge this critica...

Links: abs, pdf
Search: Bluesky, Twitter, Reddit, Hacker News, Hugging Face, alphaXiv

07.03.2026 00:16 👍 0 🔁 0 💬 1 📌 0

Paper page - CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction Join the discussion on this paper page

(1/1) 32 Likes, 2 Comments, 03 Mar 2026, Hugging Face

07.03.2026 00:16 👍 0 🔁 0 💬 1 📌 0

While music generation models have evolved to handle complex multimodal inputs mixing text, lyrics, and reference audio, evaluation mechanisms have lagged behind. In this paper, we bridge this critical gap by establishing a comprehensive ecosystem for music reward modeling under Compositional Multimodal Instruction (CMI), where the generated music may be conditioned on text descriptions, lyrics, and audio prompts. We first introduce CMI-Pref-Pseudo, a large-scale preference dataset comprising 110k pseudo-labeled samples, and CMI-Pref, a high-quality, human-annotated corpus tailored for fine-grained alignment tasks. To unify the evaluation landscape, we propose CMI-RewardBench, a unified benchmark that evaluates music reward models on heterogeneous samples across musicality, text-music alignment, and compositional instruction alignment. Leveraging these resources, we develop CMI reward models (CMI-RMs), a parameter-efficient reward model family capable of processing heterogeneous inputs. We evaluate their correlation with human judgments scores on musicality and alignment on CMI-Pref along with previous datasets. Further experiments demonstrate that CMI-RM not only correlates strongly with human judgments, but also enables effective inference-time scaling via top-k filtering. The necessary training data, benchmarks, and reward models are publicly available.

[8/30] 32 Likes, 2 Comments, 1 Posts
2603.00610, cs․SD | cs․AI | cs․LG | cs․MM | eess․AS, 04 Mar 2026

🆕CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction

Yinghao Ma, Haiwen Xia, Hewei Gao, Weixiong Chen, Yuxin Ye, Yuchen Yang, Sungkyun Chang, Mingshu...

07.03.2026 00:16 👍 0 🔁 0 💬 1 📌 0

$生成報酬モデル（GRM）における最近の進展は、思考の連鎖（CoT）推論の長さを拡張することで評価の信頼性が大幅に向上することを実証している。しかしながら、現在の研究は主に構造化されていない長さの縮尺に依存しており、異なる推論メカニズムの異なる有効性を無視している：幅のCoT（B-CoT、すなわち多次元的な原理カバレッジ）と深さのCoT（D-CoT、すなわち実質的な判断の妥当性）。この課題に対処するため、我々はMix-GRMを導入する。これはモジュール式合成パイプラインを通じて生の推論根拠を構造化されたB-CoTおよびD-CoTへ再構成し、その後、教師あり微調整（SFT）と検証可能な報酬を用いた強化学習（RLVR）を適用してこれらのメカニズムを内面化・最適化するフレームワークである。包括的な実験により、Mix-GRMが5つのベンチマークにおいて新たな最先端性能を確立し、主要なオープンソースRMを平均8.2％上回ることが実証された。我々の結果は、推論における明確な分岐を明らかにしている：B-CoTは主観的選好課題に有益である一方、D-CoTは客観的正しさ課題において優れている。したがって、推論メカニズムとタスクの整合性が取れていないと、直接的に性能が低下する。さらに、RLVRがスイッチング増幅器として機能し、モデルがタスク要求に適合するよう推論様式を自発的に割り当てるという創発的な偏極状態を誘導することを実証する。合成データとモデルは\href{https://huggingface.co/collections/DonJoey/mix-grm}{Hugging Face}で公開され、コードは\href{https://github.com/Don-Joey/Mix-GRM}{Github}で公開されています。$

生成報酬モデル（GRM）における最近の進展は、思考の連鎖（CoT）推論の長さを拡張することで評価の信頼性が大幅に向上することを実証している。しかしながら、現在の研究は主に構造化されていない長さの縮尺に依存しており、異なる推論メカニズムの異なる有効性を無視している：幅のCoT（B-CoT、すなわち多次元的な原理カバレッジ）と深さのCoT（D-CoT、すなわち実質的な判断の妥当性）。この課題に対処するため、我々はMix-GRMを導入する。これはモジュール式合成パイプラインを通じて生の推論根拠を構造化されたB-CoTおよびD-CoTへ再構成し、その後、教師あり微調整（SFT）と検証可能な報酬を用いた強化学習（RLVR）を適用してこれらのメカニズムを内面化・最適化するフレームワークである。包括的な実験により、Mix-GRMが5つのベンチマークにおいて新たな最先端性能を確立し、主要なオープンソースRMを平均8.2％上回ることが実証された。我々の結果は、推論における明確な分岐を明らかにしている：B-CoTは主観的選好課題に有益である一方、D-CoTは客観的正しさ課題において優れている。したがって、推論メカニズムとタスクの整合性が取れていないと、直接的に性能が低下する。さらに、RLVRがスイッチング増幅器として機能し、モデルがタスク要求に適合するよう推論様式を自発的に割り当てるという創発的な偏極状態を誘導することを実証する。合成データとモデルは\href{https://huggingface.co/collections/DonJoey/mix-grm}{Hugging Face}で公開され、コードは\href{https://github.com/Don-Joey/Mix-GRM}{Github}で公開されています。

2603.01571
生成報酬モデル（GRM）における最近の進展は、思考の連鎖（CoT）推論の長さを拡張することで評価の信頼性が大幅に向上することを実証している。しかしながら、現在の研究は主に構造化されていない長さの縮尺に依存しており、異なる推論メカニズムの異なる有効性を無視している：幅のCoT（B-CoT、すなわち多次...

07.03.2026 00:16 👍 0 🔁 0 💬 0 📌 0

Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models Recent advancements in Generative Reward Models (GRMs) have demonstrated that scaling the length of Chain-of-Thought (CoT) reasoning considerably enhances the reliability of evaluation. However, curre...

Links: abs, pdf
Search: Bluesky, Twitter, Reddit, Hacker News, Hugging Face, alphaXiv

07.03.2026 00:16 👍 0 🔁 0 💬 1 📌 0

Paper page - Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models Join the discussion on this paper page

(1/1) 32 Likes, 2 Comments, 04 Mar 2026, Hugging Face

07.03.2026 00:16 👍 0 🔁 0 💬 1 📌 0

$Recent advancements in Generative Reward Models (GRMs) have demonstrated that scaling the length of Chain-of-Thought (CoT) reasoning considerably enhances the reliability of evaluation. However, current works predominantly rely on unstructured length scaling, ignoring the divergent efficacy of different reasoning mechanisms: Breadth-CoT (B-CoT, i.e., multi-dimensional principle coverage) and Depth-CoT (D-CoT, i.e., substantive judgment soundness). To address this, we introduce Mix-GRM, a framework that reconfigures raw rationales into structured B-CoT and D-CoT through a modular synthesis pipeline, subsequently employing Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR) to internalize and optimize these mechanisms. Comprehensive experiments demonstrate that Mix-GRM establishes a new state-of-the-art across five benchmarks, surpassing leading open-source RMs by an average of 8.2\%. Our results reveal a clear divergence in reasoning: B-CoT benefits subjective preference tasks, whereas D-CoT excels in objective correctness tasks. Consequently, misaligning the reasoning mechanism with the task directly degrades performance. Furthermore, we demonstrate that RLVR acts as a switching amplifier, inducing an emergent polarization where the model spontaneously allocates its reasoning style to match task demands. The synthesized data and models are released at \href{https://huggingface.co/collections/DonJoey/mix-grm}{Hugging Face}, and the code is released at \href{https://github.com/Don-Joey/Mix-GRM}{Github}.$

Recent advancements in Generative Reward Models (GRMs) have demonstrated that scaling the length of Chain-of-Thought (CoT) reasoning considerably enhances the reliability of evaluation. However, current works predominantly rely on unstructured length scaling, ignoring the divergent efficacy of different reasoning mechanisms: Breadth-CoT (B-CoT, i.e., multi-dimensional principle coverage) and Depth-CoT (D-CoT, i.e., substantive judgment soundness). To address this, we introduce Mix-GRM, a framework that reconfigures raw rationales into structured B-CoT and D-CoT through a modular synthesis pipeline, subsequently employing Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR) to internalize and optimize these mechanisms. Comprehensive experiments demonstrate that Mix-GRM establishes a new state-of-the-art across five benchmarks, surpassing leading open-source RMs by an average of 8.2\%. Our results reveal a clear divergence in reasoning: B-CoT benefits subjective preference tasks, whereas D-CoT excels in objective correctness tasks. Consequently, misaligning the reasoning mechanism with the task directly degrades performance. Furthermore, we demonstrate that RLVR acts as a switching amplifier, inducing an emergent polarization where the model spontaneously allocates its reasoning style to match task demands. The synthesized data and models are released at \href{https://huggingface.co/collections/DonJoey/mix-grm}{Hugging Face}, and the code is released at \href{https://github.com/Don-Joey/Mix-GRM}{Github}.

[9/30] 32 Likes, 2 Comments, 1 Posts
2603.01571, cs․AI, 02 Mar 2026

🆕Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models

Qiyuan Zhang, Yufei Wang, Tianhe Wu, Can Xu, Qingfeng Sun, Kai Zheng, Xue Liu, Chen Ma

07.03.2026 00:16 👍 1 🔁 0 💬 1 📌 0

我々は、わずか30億パラメータで強力なエージェント的行動、コード生成、汎用推論を同時に実現する統一汎用言語モデル「Nanbeige4.1-3B」を発表する。我々の知る限り、単一モデルでこれほどの汎用性を実現したオープンソースの小型言語モデル（SLM）は初めてである。推論と選好の整合性を向上させるため、点ごとの報酬モデルとペアごとの報酬モデルを組み合わせ、高品質で人間と整合した応答を保証する。コード生成のため、強化学習において複雑性を考慮した報酬関数を設計し、正確性と効率性の両方を最適化する。深層探索では、複雑なデータ合成を行い、学習中にターン単位の監督を組み込む。これにより安定した長期的なツール相互作用が可能となり、Nanbeige4.1-3Bは複雑な問題解決のために最大600回のツール呼び出しターンを確実に実行できる。広範な実験結果から、Nanbeige4.1-3BはNanbeige4-3B-2511やQwen3-4Bといった同規模の既存モデルを大幅に上回り、Qwen3-30B-A3Bのようなはるかに大規模なモデルと比較しても優れた性能を発揮することが示された。我々の結果は、小規模モデルが広範な能力と強力な特化性を同時に達成できることを示しており、3Bパラメータモデルの潜在能力を再定義するものである。

2602.13367
我々は、わずか30億パラメータで強力なエージェント的行動、コード生成、汎用推論を同時に実現する統一汎用言語モデル「Nanbeige4.1-3B」を発表する。我々の知る限り、単一モデルでこれほどの汎用性を実現したオープンソースの小型言語モデル（SLM）は初めてである。推論と選好の整合性を向上させるため、点...

07.03.2026 00:16 👍 0 🔁 0 💬 1 📌 0

Nanbeige4.1-3B: A Small General Model that Reasons, Aligns, and Acts We present Nanbeige4.1-3B, a unified generalist language model that simultaneously achieves strong agentic behavior, code generation, and general reasoning with only 3B parameters. To the best of our ...

Links: abs, pdf
Search: Bluesky, Twitter, Reddit, Hacker News, Hugging Face, alphaXiv

07.03.2026 00:16 👍 0 🔁 0 💬 1 📌 0

Paper page - Nanbeige4.1-3B: A Small General Model that Reasons, Aligns, and Acts Join the discussion on this paper page

(1/1) 31 Likes, 3 Comments, 17 Feb 2026, Hugging Face

07.03.2026 00:16 👍 0 🔁 0 💬 1 📌 0

We present Nanbeige4.1-3B, a unified generalist language model that simultaneously achieves strong agentic behavior, code generation, and general reasoning with only 3B parameters. To the best of our knowledge, it is the first open-source small language model (SLM) to achieve such versatility in a single model. To improve reasoning and preference alignment, we combine point-wise and pair-wise reward modeling, ensuring high-quality, human-aligned responses. For code generation, we design complexity-aware rewards in Reinforcement Learning, optimizing both correctness and efficiency. In deep search, we perform complex data synthesis and incorporate turn-level supervision during training. This enables stable long-horizon tool interactions, allowing Nanbeige4.1-3B to reliably execute up to 600 tool-call turns for complex problem-solving. Extensive experimental results show that Nanbeige4.1-3B significantly outperforms prior models of similar scale, such as Nanbeige4-3B-2511 and Qwen3-4B, even achieving superior performance compared to much larger models, such as Qwen3-30B-A3B. Our results demonstrate that small models can achieve both broad competence and strong specialization simultaneously, redefining the potential of 3B parameter models.

[10/30] 31 Likes, 3 Comments, 1 Posts
2602.13367, cs․AI | cs․CL, 13 Feb 2026

🆕Nanbeige4.1-3B: A Small General Model that Reasons, Aligns, and Acts

Chen Yang, Guangyue Peng, Jiaying Zhu, Ran Le, Ruixiang Feng, Tao Zhang, Xiyun Xu, Yang Song, Yiming Jia, Yuntao Wen, Yunzhi Xu, Zekai Wang, Zhen...

07.03.2026 00:16 👍 1 🔁 0 💬 1 📌 0

拡張現実（XR）は、ユーザーの追跡された現実世界の動作に応答する生成モデルを必要とする。しかし、現在のビデオワールドモデルはテキストやキーボード入力といった粗い制御信号のみを受け入れるため、身体化されたインタラクションへの応用可能性が制限されている。追跡された頭部姿勢と関節レベルの手の姿勢の両方を条件とする、人間中心の動画世界モデルを提案する。この目的のために、既存の拡散トランスフォーマー調整戦略を評価し、3D頭部・手制御のための効果的なメカニズムを提案する。これにより、器用な手と物体の相互作用が可能となる。この戦略を用いて双方向ビデオ拡散モデルの教師を訓練し、それを因果的かつ双方向的なシステムに蒸留することで、自己中心的な仮想環境を生成する。本生成現実システムを被験者を用いて評価した結果、関連するベースラインと比較して、タスク遂行能力の向上と、実行された動作に対する制御感の著しい高まりが実証された。

2602.18422
拡張現実（XR）は、ユーザーの追跡された現実世界の動作に応答する生成モデルを必要とする。しかし、現在のビデオワールドモデルはテキストやキーボード入力といった粗い制御信号のみを受け入れるため、身体化されたインタラクションへの応用可能性が制限されている。追跡された頭部姿勢と関節レベルの手の姿...

07.03.2026 00:16 👍 0 🔁 0 💬 0 📌 0

Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control Extended reality (XR) demands generative models that respond to users' tracked real-world motion, yet current video world models accept only coarse control signals such as text or keyboard input, limi...

Links: abs, pdf
Search: Bluesky, Twitter, Reddit, Hacker News, Hugging Face, alphaXiv

07.03.2026 00:16 👍 0 🔁 0 💬 1 📌 0

Paper page - Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control Join the discussion on this paper page

(1/1) 30 Likes, 5 Comments, 23 Feb 2026, Hugging Face

07.03.2026 00:16 👍 0 🔁 0 💬 1 📌 0

Extended reality (XR) demands generative models that respond to users' tracked real-world motion, yet current video world models accept only coarse control signals such as text or keyboard input, limiting their utility for embodied interaction. We introduce a human-centric video world model that is conditioned on both tracked head pose and joint-level hand poses. For this purpose, we evaluate existing diffusion transformer conditioning strategies and propose an effective mechanism for 3D head and hand control, enabling dexterous hand--object interactions. We train a bidirectional video diffusion model teacher using this strategy and distill it into a causal, interactive system that generates egocentric virtual environments. We evaluate this generated reality system with human subjects and demonstrate improved task performance as well as a significantly higher level of perceived amount of control over the performed actions compared with relevant baselines.

[11/30] 30 Likes, 5 Comments, 1 Posts
2602.18422, cs․CV, 20 Feb 2026

🆕Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control

Linxi Xie, Lisong C. Sun, Ashley Neall, Tong Wu, Shengqu Cai, Gordon Wetzstein

07.03.2026 00:16 👍 0 🔁 0 💬 1 📌 0

世界の内部モデリング――行動$Z$のもとでの過去状態$X$と次状態$Y$の遷移を予測すること――は、LLMやVLMにおける推論と計画に不可欠である。こうしたモデルの学習には通常、高コストな行動ラベル付き軌道が必要となる。我々は、行動を潜在変数として扱い、順方向世界モデリング（FWM）$P_θ(Y|X,Z)$と逆方向力学モデリング（IDM）$Q_φ(Z|X,Y)$を交互に実行することで、状態のみのシーケンスから学習する自己改善フレームワーク「SWIRL」を提案する。 SWIRLは二つのフェーズを反復する：(1) 変分情報最大化（Variational Information Maximisation）は、事前状態と潜在行動の条件付き相互情報を最大化する次状態を生成するようFWMを更新し、識別可能な一貫性を促進する；(2) ELBO最大化（ELBO Maximisation）は、観測された遷移を説明するようIDMを更新し、効果的に座標上昇を行う。両モデルは、逆の凍結モデルの対数尤度を報酬信号として、強化学習（具体的にはGRPO）を用いて学習される。両方の更新手法について理論的な学習可能性の保証を提供し、LLMおよびVLMを対象に複数の環境（単回および複数回のオープンワールド視覚ダイナミクス、物理学・ウェブ・ツール呼び出し向けの合成テキスト環境）でSWIRLを評価する。 SWIRLはAURORABenchで16%、ByteMorphで28%、WorldPredictionBenchで16%、StableToolBenchで14%の性能向上を達成した。

2602.06130
世界の内部モデリング――行動$Z$のもとでの過去状態$X$と次状態$Y$の遷移を予測すること――は、LLMやVLMにおける推論と計画に不可欠である。こうしたモデルの学習には通常、高コストな行動ラベル付き軌道が必要となる。我々は、行動を潜在変数として扱い、順方向世界モデリング（FWM）$P_θ(Y|X,Z)$と逆方...

07.03.2026 00:16 👍 0 🔁 0 💬 0 📌 0

Paper

Latest posts by Paper @paper