Paper (@paper) — bluesky.baby

現在のマルチビュー屋内3D物体検出器は、マルチビュー情報をグローバルなシーン表現に融合させるために、取得コストの高いセンサー幾何学（すなわち精密にキャリブレーションされたマルチビューカメラの姿勢）に依存しており、実世界のシーンでの展開を制限している。我々はより実用的な設定を目標とする：センサー幾何学フリー（SG-Free）マルチビュー屋内3D物体検出。ここではセンサーが提供する幾何学的入力（マルチビュー姿勢や深度）が存在しない。最近の視覚幾何学基盤トランスフォーマー（VGGT）は、強力な3D手がかりが画像から直接推測できることを示している。この知見に基づき、我々はSGフリーなマルチビュー屋内3D物体検出に特化した初のフレームワークであるVGGT-Detを提案する。単にVGGTの予測結果を利用するのではなく、我々の手法ではVGGTエンコーダーをトランスフォーマーベースのパイプラインに統合する。 VGGT内部のセマンティック事前情報と幾何学的事前情報の両方を効果的に活用するため、我々は二つの新規主要コンポーネントを導入する： (i) 注意誘導型クエリ生成（AG）：VGGTの注意マップを意味的事前情報として活用し、オブジェクトクエリを初期化する。これにより、グローバルな空間構造を維持しつつオブジェクト領域に焦点を当てることで局所化精度を向上させる。 (ii) クエリ駆動型特徴量集約（QD）：学習可能な「見るクエリ」がオブジェクトクエリと相互作用し、その必要性を「認識」した後、VGGT層全体で多階層の幾何学的特徴量を動的に集約する。これにより2次元特徴量が段階的に3次元へ昇華される。実験により、VGGT-DetはSG-Free設定において、ScanNetとARKitScenesそれぞれで、最良手法を4.4および8.6mAP@0.25上回る性能を発揮することが示された。アブレーション研究により、VGGTが内部的に学習した意味的・幾何学的先験知識が、我々のAGとQDによって効果的に活用できることが示された。

2603.00912
現在のマルチビュー屋内3D物体検出器は、マルチビュー情報をグローバルなシーン表現に融合させるために、取得コストの高いセンサー幾何学（すなわち精密にキャリブレーションされたマルチビューカメラの姿勢）に依存しており、実世界のシーンでの展開を制限している。我々はより実用的な設定を目標とする：セ...

07.03.2026 00:17 👍 0 🔁 0 💬 0 📌 0

VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-View Indoor 3D Object Detection Current multi-view indoor 3D object detectors rely on sensor geometry that is costly to obtain (i.e., precisely calibrated multi-view camera poses) to fuse multi-view information into a global scene r...

Links: abs, pdf
Search: Bluesky, Twitter, Reddit, Hacker News, Hugging Face, alphaXiv

07.03.2026 00:17 👍 0 🔁 0 💬 1 📌 0

Paper page - VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-View Indoor 3D Object Detection Join the discussion on this paper page

(1/1) 33 Likes, 3 Comments, 03 Mar 2026, Hugging Face

07.03.2026 00:17 👍 0 🔁 0 💬 1 📌 0

Current multi-view indoor 3D object detectors rely on sensor geometry that is costly to obtain (i.e., precisely calibrated multi-view camera poses) to fuse multi-view information into a global scene representation, limiting deployment in real-world scenes. We target a more practical setting: Sensor-Geometry-Free (SG-Free) multi-view indoor 3D object detection, where there are no sensor-provided geometric inputs (multi-view poses or depth). Recent Visual Geometry Grounded Transformer (VGGT) shows that strong 3D cues can be inferred directly from images. Building on this insight, we present VGGT-Det, the first framework tailored for SG-Free multi-view indoor 3D object detection. Rather than merely consuming VGGT predictions, our method integrates VGGT encoder into a transformer-based pipeline. To effectively leverage both the semantic and geometric priors from inside VGGT, we introduce two novel key components: (i) Attention-Guided Query Generation (AG): exploits VGGT attention maps as semantic priors to initialize object queries, improving localization by focusing on object regions while preserving global spatial structure; (ii) Query-Driven Feature Aggregation (QD): a learnable See-Query interacts with object queries to 'see' what they need, and then dynamically aggregates multi-level geometric features across VGGT layers that progressively lift 2D features into 3D. Experiments show that VGGT-Det significantly surpasses the best-performing method in the SG-Free setting by 4.4 and 8.6 mAP@0.25 on ScanNet and ARKitScenes, respectively. Ablation study shows that VGGT's internally learned semantic and geometric priors can be effectively leveraged by our AG and QD.

[5/30] 33 Likes, 3 Comments, 1 Posts
2603.00912, cs․CV, 01 Mar 2026

🆕VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-View Indoor 3D Object Detection

Yang Cao, Feize Wu, Dave Zhenyu Chen, Yingji Zhong, Lanqing Hong, Dan Xu

07.03.2026 00:16 👍 0 🔁 0 💬 1 📌 0

拡散モデルは高精細画像・動画生成の主流ツールとなったが、拡散トランスフォーマーの反復処理が多数必要となるため、推論速度が重大なボトルネックとなっている。計算負荷を軽減するため、近年の研究では特徴量のキャッシュと再利用手法を採用している。これは、前のステップでキャッシュされた特徴量を使用することで、選択された拡散ステップにおけるネットワーク評価をスキップするものである。しかしながら、彼らの予備設計は局所近似のみに依存しているため、スキップが大きくなるにつれて誤差が急速に増大し、高速化に伴いサンプル品質が低下する。本研究では、スペクトル拡散特徴予測器（Spectrum）を提案する。これは学習不要の手法であり、厳密に制御された誤差のもとで、グローバルかつ長距離にわたる特徴の再利用を可能とする。特に、我々はデノイザーの潜在特徴を時間に関する関数と見なし、チェビシェフ多項式を用いて近似する。具体的には、リッジ回帰を用いて各基底成分の係数を推定し、これを用いて複数の将来の拡散ステップにおける特徴量を予測する。我々の手法はより良好な長期的な挙動を示し、ステップサイズに依存しない誤差の上限をもたらすことを理論的に明らかにした。様々な最先端の画像および動画拡散モデルに対する広範な実験により、我々の手法の優位性が一貫して実証されている。特に、ベースラインと比較してはるかに高いサンプル品質を維持しながら、FLUX.1では最大4.79倍、Wan2.1-14Bでは最大4.67倍の高速化を達成している。

2603.01623
拡散モデルは高精細画像・動画生成の主流ツールとなったが、拡散トランスフォーマーの反復処理が多数必要となるため、推論速度が重大なボトルネックとなっている。計算負荷を軽減するため、近年の研究では特徴量のキャッシュと再利用手法を採用している。これは、前のステップでキャッシュされた特徴量を使用...

07.03.2026 00:16 👍 0 🔁 0 💬 0 📌 0

Adaptive Spectral Feature Forecasting for Diffusion Sampling Acceleration Diffusion models have become the dominant tool for high-fidelity image and video generation, yet are critically bottlenecked by their inference speed due to the numerous iterative passes of Diffusion ...

Links: abs, pdf
Search: Bluesky, Twitter, Reddit, Hacker News, Hugging Face, alphaXiv

07.03.2026 00:16 👍 0 🔁 0 💬 1 📌 0

From the comfyui community on Reddit: ComfyUI-Spectrum-SDXL: Accelerate SDXL inference by ~1.5-2x with no noticeable quality loss! Explore this post and more from the comfyui community

(1/1) 32 Likes, 5 Comments, 05 Mar 2026, Reddit

07.03.2026 00:16 👍 0 🔁 0 💬 1 📌 0

$Diffusion models have become the dominant tool for high-fidelity image and video generation, yet are critically bottlenecked by their inference speed due to the numerous iterative passes of Diffusion Transformers. To reduce the exhaustive compute, recent works resort to the feature caching and reusing scheme that skips network evaluations at selected diffusion steps by using cached features in previous steps. However, their preliminary design solely relies on local approximation, causing errors to grow rapidly with large skips and leading to degraded sample quality at high speedups. In this work, we propose spectral diffusion feature forecaster (Spectrum), a training-free approach that enables global, long-range feature reuse with tightly controlled error. In particular, we view the latent features of the denoiser as functions over time and approximate them with Chebyshev polynomials. Specifically, we fit the coefficient for each basis via ridge regression, which is then leveraged to forecast features at multiple future diffusion steps. We theoretically reveal that our approach admits more favorable long-horizon behavior and yields an error bound that does not compound with the step size. Extensive experiments on various state-of-the-art image and video diffusion models consistently verify the superiority of our approach. Notably, we achieve up to 4.79$\times$ speedup on FLUX.1 and 4.67$\times$ speedup on Wan2.1-14B, while maintaining much higher sample quality compared with the baselines.$

Diffusion models have become the dominant tool for high-fidelity image and video generation, yet are critically bottlenecked by their inference speed due to the numerous iterative passes of Diffusion Transformers. To reduce the exhaustive compute, recent works resort to the feature caching and reusing scheme that skips network evaluations at selected diffusion steps by using cached features in previous steps. However, their preliminary design solely relies on local approximation, causing errors to grow rapidly with large skips and leading to degraded sample quality at high speedups. In this work, we propose spectral diffusion feature forecaster (Spectrum), a training-free approach that enables global, long-range feature reuse with tightly controlled error. In particular, we view the latent features of the denoiser as functions over time and approximate them with Chebyshev polynomials. Specifically, we fit the coefficient for each basis via ridge regression, which is then leveraged to forecast features at multiple future diffusion steps. We theoretically reveal that our approach admits more favorable long-horizon behavior and yields an error bound that does not compound with the step size. Extensive experiments on various state-of-the-art image and video diffusion models consistently verify the superiority of our approach. Notably, we achieve up to 4.79$\times$ speedup on FLUX.1 and 4.67$\times$ speedup on Wan2.1-14B, while maintaining much higher sample quality compared with the baselines.

[6/30] 32 Likes, 5 Comments, 1 Posts
2603.01623, cs․CV | cs․LG, 02 Mar 2026

🆕Adaptive Spectral Feature Forecasting for Diffusion Sampling Acceleration

Jiaqi Han, Juntong Shi, Puheng Li, Haotian Ye, Qiushan Guo, Stefano Ermon

07.03.2026 00:16 👍 0 🔁 0 💬 1 📌 0

音楽生成モデルはテキスト、歌詞、参照音源を組み合わせた複雑なマルチモーダル入力を処理できるよう進化してきた一方で、評価メカニズムは遅れをとっている。本論文では、生成される音楽がテキスト記述、歌詞、音声プロンプトを条件とする場合にも対応可能な、構成的マルチモーダル指示（CMI）下における音楽報酬モデリングのための包括的なエコシステムを確立することで、この重要なギャップを埋める。まず、11万件の疑似ラベル付きサンプルからなる大規模嗜好データセット「CMI-Pref-Pseudo」と、微細なアラインメントタスク向けに調整された高品質な人間によるアノテーションコーパス「CMI-Pref」を紹介します。評価環境を統一するため、我々は音楽報酬モデルを、音楽性、歌詞と音楽の整合性、作曲指示の整合性という異種サンプル群で評価する統一ベンチマーク「CMI-RewardBench」を提案する。これらのリソースを活用し、我々はCMI報酬モデル（CMI-RMs）を開発した。これは異種入力の処理が可能な、パラメータ効率に優れた報酬モデル群である。我々は、それらの相関を、音楽性に関する人間の判断スコアおよびCMI-Prefにおける整合性、ならびに既存のデータセットと併せて評価する。さらなる実験により、CMI-RMは人間の判断と強く相関するだけでなく、トップkフィルタリングによる効果的な推論時間のスケーリングを可能にすることが実証された。必要なトレーニングデータ、ベンチマーク、および報酬モデルは公開されています。

2603.00610
音楽生成モデルはテキスト、歌詞、参照音源を組み合わせた複雑なマルチモーダル入力を処理できるよう進化してきた一方で、評価メカニズムは遅れをとっている。本論文では、生成される音楽がテキスト記述、歌詞、音声プロンプトを条件とする場合にも対応可能な、構成的マルチモーダル指示（CMI）下における音...

07.03.2026 00:16 👍 0 🔁 0 💬 0 📌 0

CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction While music generation models have evolved to handle complex multimodal inputs mixing text, lyrics, and reference audio, evaluation mechanisms have lagged behind. In this paper, we bridge this critica...

Links: abs, pdf
Search: Bluesky, Twitter, Reddit, Hacker News, Hugging Face, alphaXiv

07.03.2026 00:16 👍 0 🔁 0 💬 1 📌 0

Paper page - CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction Join the discussion on this paper page

(1/1) 32 Likes, 2 Comments, 03 Mar 2026, Hugging Face

07.03.2026 00:16 👍 0 🔁 0 💬 1 📌 0

While music generation models have evolved to handle complex multimodal inputs mixing text, lyrics, and reference audio, evaluation mechanisms have lagged behind. In this paper, we bridge this critical gap by establishing a comprehensive ecosystem for music reward modeling under Compositional Multimodal Instruction (CMI), where the generated music may be conditioned on text descriptions, lyrics, and audio prompts. We first introduce CMI-Pref-Pseudo, a large-scale preference dataset comprising 110k pseudo-labeled samples, and CMI-Pref, a high-quality, human-annotated corpus tailored for fine-grained alignment tasks. To unify the evaluation landscape, we propose CMI-RewardBench, a unified benchmark that evaluates music reward models on heterogeneous samples across musicality, text-music alignment, and compositional instruction alignment. Leveraging these resources, we develop CMI reward models (CMI-RMs), a parameter-efficient reward model family capable of processing heterogeneous inputs. We evaluate their correlation with human judgments scores on musicality and alignment on CMI-Pref along with previous datasets. Further experiments demonstrate that CMI-RM not only correlates strongly with human judgments, but also enables effective inference-time scaling via top-k filtering. The necessary training data, benchmarks, and reward models are publicly available.

[8/30] 32 Likes, 2 Comments, 1 Posts
2603.00610, cs․SD | cs․AI | cs․LG | cs․MM | eess․AS, 04 Mar 2026

🆕CMI-RewardBench: Evaluating Music Reward Models with Compositional Multimodal Instruction

Yinghao Ma, Haiwen Xia, Hewei Gao, Weixiong Chen, Yuxin Ye, Yuchen Yang, Sungkyun Chang, Mingshu...

07.03.2026 00:16 👍 0 🔁 0 💬 1 📌 0

$生成報酬モデル（GRM）における最近の進展は、思考の連鎖（CoT）推論の長さを拡張することで評価の信頼性が大幅に向上することを実証している。しかしながら、現在の研究は主に構造化されていない長さの縮尺に依存しており、異なる推論メカニズムの異なる有効性を無視している：幅のCoT（B-CoT、すなわち多次元的な原理カバレッジ）と深さのCoT（D-CoT、すなわち実質的な判断の妥当性）。この課題に対処するため、我々はMix-GRMを導入する。これはモジュール式合成パイプラインを通じて生の推論根拠を構造化されたB-CoTおよびD-CoTへ再構成し、その後、教師あり微調整（SFT）と検証可能な報酬を用いた強化学習（RLVR）を適用してこれらのメカニズムを内面化・最適化するフレームワークである。包括的な実験により、Mix-GRMが5つのベンチマークにおいて新たな最先端性能を確立し、主要なオープンソースRMを平均8.2％上回ることが実証された。我々の結果は、推論における明確な分岐を明らかにしている：B-CoTは主観的選好課題に有益である一方、D-CoTは客観的正しさ課題において優れている。したがって、推論メカニズムとタスクの整合性が取れていないと、直接的に性能が低下する。さらに、RLVRがスイッチング増幅器として機能し、モデルがタスク要求に適合するよう推論様式を自発的に割り当てるという創発的な偏極状態を誘導することを実証する。合成データとモデルは\href{https://huggingface.co/collections/DonJoey/mix-grm}{Hugging Face}で公開され、コードは\href{https://github.com/Don-Joey/Mix-GRM}{Github}で公開されています。$

生成報酬モデル（GRM）における最近の進展は、思考の連鎖（CoT）推論の長さを拡張することで評価の信頼性が大幅に向上することを実証している。しかしながら、現在の研究は主に構造化されていない長さの縮尺に依存しており、異なる推論メカニズムの異なる有効性を無視している：幅のCoT（B-CoT、すなわち多次元的な原理カバレッジ）と深さのCoT（D-CoT、すなわち実質的な判断の妥当性）。この課題に対処するため、我々はMix-GRMを導入する。これはモジュール式合成パイプラインを通じて生の推論根拠を構造化されたB-CoTおよびD-CoTへ再構成し、その後、教師あり微調整（SFT）と検証可能な報酬を用いた強化学習（RLVR）を適用してこれらのメカニズムを内面化・最適化するフレームワークである。包括的な実験により、Mix-GRMが5つのベンチマークにおいて新たな最先端性能を確立し、主要なオープンソースRMを平均8.2％上回ることが実証された。我々の結果は、推論における明確な分岐を明らかにしている：B-CoTは主観的選好課題に有益である一方、D-CoTは客観的正しさ課題において優れている。したがって、推論メカニズムとタスクの整合性が取れていないと、直接的に性能が低下する。さらに、RLVRがスイッチング増幅器として機能し、モデルがタスク要求に適合するよう推論様式を自発的に割り当てるという創発的な偏極状態を誘導することを実証する。合成データとモデルは\href{https://huggingface.co/collections/DonJoey/mix-grm}{Hugging Face}で公開され、コードは\href{https://github.com/Don-Joey/Mix-GRM}{Github}で公開されています。

2603.01571
生成報酬モデル（GRM）における最近の進展は、思考の連鎖（CoT）推論の長さを拡張することで評価の信頼性が大幅に向上することを実証している。しかしながら、現在の研究は主に構造化されていない長さの縮尺に依存しており、異なる推論メカニズムの異なる有効性を無視している：幅のCoT（B-CoT、すなわち多次...

07.03.2026 00:16 👍 0 🔁 0 💬 0 📌 0

Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models Recent advancements in Generative Reward Models (GRMs) have demonstrated that scaling the length of Chain-of-Thought (CoT) reasoning considerably enhances the reliability of evaluation. However, curre...

Links: abs, pdf
Search: Bluesky, Twitter, Reddit, Hacker News, Hugging Face, alphaXiv

07.03.2026 00:16 👍 0 🔁 0 💬 1 📌 0

Paper page - Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models Join the discussion on this paper page

(1/1) 32 Likes, 2 Comments, 04 Mar 2026, Hugging Face

07.03.2026 00:16 👍 0 🔁 0 💬 1 📌 0

$Recent advancements in Generative Reward Models (GRMs) have demonstrated that scaling the length of Chain-of-Thought (CoT) reasoning considerably enhances the reliability of evaluation. However, current works predominantly rely on unstructured length scaling, ignoring the divergent efficacy of different reasoning mechanisms: Breadth-CoT (B-CoT, i.e., multi-dimensional principle coverage) and Depth-CoT (D-CoT, i.e., substantive judgment soundness). To address this, we introduce Mix-GRM, a framework that reconfigures raw rationales into structured B-CoT and D-CoT through a modular synthesis pipeline, subsequently employing Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR) to internalize and optimize these mechanisms. Comprehensive experiments demonstrate that Mix-GRM establishes a new state-of-the-art across five benchmarks, surpassing leading open-source RMs by an average of 8.2\%. Our results reveal a clear divergence in reasoning: B-CoT benefits subjective preference tasks, whereas D-CoT excels in objective correctness tasks. Consequently, misaligning the reasoning mechanism with the task directly degrades performance. Furthermore, we demonstrate that RLVR acts as a switching amplifier, inducing an emergent polarization where the model spontaneously allocates its reasoning style to match task demands. The synthesized data and models are released at \href{https://huggingface.co/collections/DonJoey/mix-grm}{Hugging Face}, and the code is released at \href{https://github.com/Don-Joey/Mix-GRM}{Github}.$

Recent advancements in Generative Reward Models (GRMs) have demonstrated that scaling the length of Chain-of-Thought (CoT) reasoning considerably enhances the reliability of evaluation. However, current works predominantly rely on unstructured length scaling, ignoring the divergent efficacy of different reasoning mechanisms: Breadth-CoT (B-CoT, i.e., multi-dimensional principle coverage) and Depth-CoT (D-CoT, i.e., substantive judgment soundness). To address this, we introduce Mix-GRM, a framework that reconfigures raw rationales into structured B-CoT and D-CoT through a modular synthesis pipeline, subsequently employing Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR) to internalize and optimize these mechanisms. Comprehensive experiments demonstrate that Mix-GRM establishes a new state-of-the-art across five benchmarks, surpassing leading open-source RMs by an average of 8.2\%. Our results reveal a clear divergence in reasoning: B-CoT benefits subjective preference tasks, whereas D-CoT excels in objective correctness tasks. Consequently, misaligning the reasoning mechanism with the task directly degrades performance. Furthermore, we demonstrate that RLVR acts as a switching amplifier, inducing an emergent polarization where the model spontaneously allocates its reasoning style to match task demands. The synthesized data and models are released at \href{https://huggingface.co/collections/DonJoey/mix-grm}{Hugging Face}, and the code is released at \href{https://github.com/Don-Joey/Mix-GRM}{Github}.

[9/30] 32 Likes, 2 Comments, 1 Posts
2603.01571, cs․AI, 02 Mar 2026

🆕Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models

Qiyuan Zhang, Yufei Wang, Tianhe Wu, Can Xu, Qingfeng Sun, Kai Zheng, Xue Liu, Chen Ma

07.03.2026 00:16 👍 1 🔁 0 💬 1 📌 0

我々は、わずか30億パラメータで強力なエージェント的行動、コード生成、汎用推論を同時に実現する統一汎用言語モデル「Nanbeige4.1-3B」を発表する。我々の知る限り、単一モデルでこれほどの汎用性を実現したオープンソースの小型言語モデル（SLM）は初めてである。推論と選好の整合性を向上させるため、点ごとの報酬モデルとペアごとの報酬モデルを組み合わせ、高品質で人間と整合した応答を保証する。コード生成のため、強化学習において複雑性を考慮した報酬関数を設計し、正確性と効率性の両方を最適化する。深層探索では、複雑なデータ合成を行い、学習中にターン単位の監督を組み込む。これにより安定した長期的なツール相互作用が可能となり、Nanbeige4.1-3Bは複雑な問題解決のために最大600回のツール呼び出しターンを確実に実行できる。広範な実験結果から、Nanbeige4.1-3BはNanbeige4-3B-2511やQwen3-4Bといった同規模の既存モデルを大幅に上回り、Qwen3-30B-A3Bのようなはるかに大規模なモデルと比較しても優れた性能を発揮することが示された。我々の結果は、小規模モデルが広範な能力と強力な特化性を同時に達成できることを示しており、3Bパラメータモデルの潜在能力を再定義するものである。

2602.13367
我々は、わずか30億パラメータで強力なエージェント的行動、コード生成、汎用推論を同時に実現する統一汎用言語モデル「Nanbeige4.1-3B」を発表する。我々の知る限り、単一モデルでこれほどの汎用性を実現したオープンソースの小型言語モデル（SLM）は初めてである。推論と選好の整合性を向上させるため、点...

07.03.2026 00:16 👍 0 🔁 0 💬 1 📌 0

Nanbeige4.1-3B: A Small General Model that Reasons, Aligns, and Acts We present Nanbeige4.1-3B, a unified generalist language model that simultaneously achieves strong agentic behavior, code generation, and general reasoning with only 3B parameters. To the best of our ...

Links: abs, pdf
Search: Bluesky, Twitter, Reddit, Hacker News, Hugging Face, alphaXiv

07.03.2026 00:16 👍 0 🔁 0 💬 1 📌 0

Paper page - Nanbeige4.1-3B: A Small General Model that Reasons, Aligns, and Acts Join the discussion on this paper page

(1/1) 31 Likes, 3 Comments, 17 Feb 2026, Hugging Face

07.03.2026 00:16 👍 0 🔁 0 💬 1 📌 0

We present Nanbeige4.1-3B, a unified generalist language model that simultaneously achieves strong agentic behavior, code generation, and general reasoning with only 3B parameters. To the best of our knowledge, it is the first open-source small language model (SLM) to achieve such versatility in a single model. To improve reasoning and preference alignment, we combine point-wise and pair-wise reward modeling, ensuring high-quality, human-aligned responses. For code generation, we design complexity-aware rewards in Reinforcement Learning, optimizing both correctness and efficiency. In deep search, we perform complex data synthesis and incorporate turn-level supervision during training. This enables stable long-horizon tool interactions, allowing Nanbeige4.1-3B to reliably execute up to 600 tool-call turns for complex problem-solving. Extensive experimental results show that Nanbeige4.1-3B significantly outperforms prior models of similar scale, such as Nanbeige4-3B-2511 and Qwen3-4B, even achieving superior performance compared to much larger models, such as Qwen3-30B-A3B. Our results demonstrate that small models can achieve both broad competence and strong specialization simultaneously, redefining the potential of 3B parameter models.

[10/30] 31 Likes, 3 Comments, 1 Posts
2602.13367, cs․AI | cs․CL, 13 Feb 2026

🆕Nanbeige4.1-3B: A Small General Model that Reasons, Aligns, and Acts

Chen Yang, Guangyue Peng, Jiaying Zhu, Ran Le, Ruixiang Feng, Tao Zhang, Xiyun Xu, Yang Song, Yiming Jia, Yuntao Wen, Yunzhi Xu, Zekai Wang, Zhen...

07.03.2026 00:16 👍 1 🔁 0 💬 1 📌 0

拡張現実（XR）は、ユーザーの追跡された現実世界の動作に応答する生成モデルを必要とする。しかし、現在のビデオワールドモデルはテキストやキーボード入力といった粗い制御信号のみを受け入れるため、身体化されたインタラクションへの応用可能性が制限されている。追跡された頭部姿勢と関節レベルの手の姿勢の両方を条件とする、人間中心の動画世界モデルを提案する。この目的のために、既存の拡散トランスフォーマー調整戦略を評価し、3D頭部・手制御のための効果的なメカニズムを提案する。これにより、器用な手と物体の相互作用が可能となる。この戦略を用いて双方向ビデオ拡散モデルの教師を訓練し、それを因果的かつ双方向的なシステムに蒸留することで、自己中心的な仮想環境を生成する。本生成現実システムを被験者を用いて評価した結果、関連するベースラインと比較して、タスク遂行能力の向上と、実行された動作に対する制御感の著しい高まりが実証された。

2602.18422
拡張現実（XR）は、ユーザーの追跡された現実世界の動作に応答する生成モデルを必要とする。しかし、現在のビデオワールドモデルはテキストやキーボード入力といった粗い制御信号のみを受け入れるため、身体化されたインタラクションへの応用可能性が制限されている。追跡された頭部姿勢と関節レベルの手の姿...

07.03.2026 00:16 👍 0 🔁 0 💬 0 📌 0

Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control Extended reality (XR) demands generative models that respond to users' tracked real-world motion, yet current video world models accept only coarse control signals such as text or keyboard input, limi...

Links: abs, pdf
Search: Bluesky, Twitter, Reddit, Hacker News, Hugging Face, alphaXiv

07.03.2026 00:16 👍 0 🔁 0 💬 1 📌 0

Paper page - Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control Join the discussion on this paper page

(1/1) 30 Likes, 5 Comments, 23 Feb 2026, Hugging Face

07.03.2026 00:16 👍 0 🔁 0 💬 1 📌 0

Extended reality (XR) demands generative models that respond to users' tracked real-world motion, yet current video world models accept only coarse control signals such as text or keyboard input, limiting their utility for embodied interaction. We introduce a human-centric video world model that is conditioned on both tracked head pose and joint-level hand poses. For this purpose, we evaluate existing diffusion transformer conditioning strategies and propose an effective mechanism for 3D head and hand control, enabling dexterous hand--object interactions. We train a bidirectional video diffusion model teacher using this strategy and distill it into a causal, interactive system that generates egocentric virtual environments. We evaluate this generated reality system with human subjects and demonstrate improved task performance as well as a significantly higher level of perceived amount of control over the performed actions compared with relevant baselines.

[11/30] 30 Likes, 5 Comments, 1 Posts
2602.18422, cs․CV, 20 Feb 2026

🆕Generated Reality: Human-centric World Simulation using Interactive Video Generation with Hand and Camera Control

Linxi Xie, Lisong C. Sun, Ashley Neall, Tong Wu, Shengqu Cai, Gordon Wetzstein

07.03.2026 00:16 👍 0 🔁 0 💬 1 📌 0

世界の内部モデリング――行動$Z$のもとでの過去状態$X$と次状態$Y$の遷移を予測すること――は、LLMやVLMにおける推論と計画に不可欠である。こうしたモデルの学習には通常、高コストな行動ラベル付き軌道が必要となる。我々は、行動を潜在変数として扱い、順方向世界モデリング（FWM）$P_θ(Y|X,Z)$と逆方向力学モデリング（IDM）$Q_φ(Z|X,Y)$を交互に実行することで、状態のみのシーケンスから学習する自己改善フレームワーク「SWIRL」を提案する。 SWIRLは二つのフェーズを反復する：(1) 変分情報最大化（Variational Information Maximisation）は、事前状態と潜在行動の条件付き相互情報を最大化する次状態を生成するようFWMを更新し、識別可能な一貫性を促進する；(2) ELBO最大化（ELBO Maximisation）は、観測された遷移を説明するようIDMを更新し、効果的に座標上昇を行う。両モデルは、逆の凍結モデルの対数尤度を報酬信号として、強化学習（具体的にはGRPO）を用いて学習される。両方の更新手法について理論的な学習可能性の保証を提供し、LLMおよびVLMを対象に複数の環境（単回および複数回のオープンワールド視覚ダイナミクス、物理学・ウェブ・ツール呼び出し向けの合成テキスト環境）でSWIRLを評価する。 SWIRLはAURORABenchで16%、ByteMorphで28%、WorldPredictionBenchで16%、StableToolBenchで14%の性能向上を達成した。

2602.06130
世界の内部モデリング――行動$Z$のもとでの過去状態$X$と次状態$Y$の遷移を予測すること――は、LLMやVLMにおける推論と計画に不可欠である。こうしたモデルの学習には通常、高コストな行動ラベル付き軌道が必要となる。我々は、行動を潜在変数として扱い、順方向世界モデリング（FWM）$P_θ(Y|X,Z)$と逆方...

07.03.2026 00:16 👍 0 🔁 0 💬 0 📌 0

Self-Improving World Modelling with Latent Actions Internal modelling of the world -- predicting transitions between previous states $X$ and next states $Y$ under actions $Z$ -- is essential to reasoning and planning for LLMs and VLMs. Learning such m...

Links: abs, pdf
Search: Bluesky, Twitter, Reddit, Hacker News, Hugging Face, alphaXiv

07.03.2026 00:16 👍 0 🔁 0 💬 1 📌 0

Paper page - Self-Improving World Modelling with Latent Actions Join the discussion on this paper page

(1/1) 30 Likes, 2 Comments, 09 Feb 2026, Hugging Face

07.03.2026 00:15 👍 0 🔁 0 💬 1 📌 0

Internal modelling of the world -- predicting transitions between previous states $X$ and next states $Y$ under actions $Z$ -- is essential to reasoning and planning for LLMs and VLMs. Learning such models typically requires costly action-labelled trajectories. We propose SWIRL, a self-improvement framework that learns from state-only sequences by treating actions as a latent variable and alternating between Forward World Modelling (FWM) $P_θ(Y|X,Z)$ and an Inverse Dynamics Modelling (IDM) $Q_φ(Z|X,Y)$. SWIRL iterates two phases: (1) Variational Information Maximisation, which updates the FWM to generate next states that maximise conditional mutual information with latent actions given prior states, encouraging identifiable consistency; and (2) ELBO Maximisation, which updates the IDM to explain observed transitions, effectively performing coordinate ascent. Both models are trained with reinforcement learning (specifically, GRPO) with the opposite frozen model's log-probability as a reward signal. We provide theoretical learnability guarantees for both updates, and evaluate SWIRL on LLMs and VLMs across multiple environments: single-turn and multi-turn open-world visual dynamics and synthetic textual environments for physics, web, and tool calling. SWIRL achieves gains of 16% on AURORABench, 28% on ByteMorph, 16% on WorldPredictionBench, and 14% on StableToolBench.

[12/30] 30 Likes, 2 Comments, 1 Posts
2602.06130, cs․LG | cs․AI | cs․CL, 15 Feb 2026

🆕Self-Improving World Modelling with Latent Actions

Yifu Qiu, Zheng Zhao, Waylon Li, Yftah Ziser, Anna Korhonen, Shay B. Cohen, Edoardo M. Ponti

07.03.2026 00:15 👍 1 🔁 0 💬 1 📌 0

モデル容量とデータ取得の持続的な拡張にもかかわらず、視覚言語行動（VLA）モデルは接触頻度の高い動的な操作タスクにおいて脆弱性を残しており、わずかな実行上の逸脱が失敗へと連鎖する可能性がある。強化学習（RL）は頑健性への原理的な道筋を提供する一方で、物理世界におけるオンポリシーRLは安全リスク、ハードウェアコスト、環境リセットによって制約を受ける。このギャップを埋めるため、我々は想像力を通じたロボット強化学習の拡張可能なフレームワーク「RISE」を提案する。その中核には、構成的世界モデルが存在する。 (i) 制御可能な動的モデルを通じて多視点の未来を予測し、 (ii) 進捗価値モデルを用いて想定される結果を評価し、政策改善に有益な情報を提供する。このような構成設計により、状態と値は最適でありながら異なるアーキテクチャと目的によって個別に調整される。これらの構成要素は、閉ループの自己改善パイプラインに統合され、高コストな物理的相互作用なしに仮想空間において仮想的な展開を継続的に生成し、優位性を推定し、ポリシーを更新する。 3つの困難な実世界タスクにおいて、RISEは従来技術に対し大幅な改善を達成した。具体的には、動的レンガ仕分けでは絶対性能が35％以上向上、リュックサック詰めでは45％向上、箱閉じでは35％向上した。

2602.11075
モデル容量とデータ取得の持続的な拡張にもかかわらず、視覚言語行動（VLA）モデルは接触頻度の高い動的な操作タスクにおいて脆弱性を残しており、わずかな実行上の逸脱が失敗へと連鎖する可能性がある。強化学習（RL）は頑健性への原理的な道筋を提供する一方で、物理世界におけるオンポリシーRLは安全リス...

07.03.2026 00:15 👍 0 🔁 0 💬 0 📌 0

RISE: Self-Improving Robot Policy with Compositional World Model Despite the sustained scaling on model capacity and data acquisition, Vision-Language-Action (VLA) models remain brittle in contact-rich and dynamic manipulation tasks, where minor execution deviation...

Links: abs, pdf
Search: Bluesky, Twitter, Reddit, Hacker News, Hugging Face, alphaXiv

07.03.2026 00:15 👍 0 🔁 0 💬 1 📌 0

Paper

Latest posts by Paper @paper