VLA 통합 서베이

보고, 이해하고, 행동하는 로봇을 위한
Vision-Language-Action 모델의 모든 것

14개 서베이 논문 × VLA 교과서 슬라이드의 원자 단위 통합
DGIST APRL Lab · Prof. Giseop Kim · April 2026

목차 & 구조 관계도

본 문서는 VLA를 정의 → 역사 → 분류 → 구조 → 표현 → 학습 → 효율 → 응용 → 평가 → 전망의 순서로 전개한다. 이 순서는 "VLA가 무엇인가"를 이해한 뒤, "어떻게 작동하는가"를 해부하고, "어떻게 학습하는가"를 분석하며, "어디에 배포되는가"를 탐색하고, "무엇이 남아 있는가"를 전망하는 자연스러운 지적 여정을 따른다.

1. 서론→ 2. 연대기→ 3. 분류→ 4. 아키텍처→ 5. 토큰화→ 6. 학습→ 7. 효율성→ 8. 응용→ 9. 평가→ 10. 전망→ Ref

Sec 1

세부 항목 선정 원칙: 각 섹션의 하위 항목은 (1) 14개 서베이 중 3편 이상에서 공통으로 다루어진 주제, (2) 단일 서베이에서만 다루어졌지만 기술적 중요도가 높은 주제, (3) 서베이 간 교차 분석에서만 도출 가능한 주제의 세 기준으로 선정되었다.

참조 서베이 14편 비교 분석표

범례: ● 심층 다룸 ◐ 부분적/간접적 ○ 미다룸 | 칼럼은 본 통합서베이의 핵심 토픽 10개에 대응한다.

#	서베이	arXiv	시기	아키 텍처	분류 체계	액션 토큰화	학습 패러다임	RL 후 처리	효율성 경량화	매니퓰 레이션	자율 주행	벤치 마크	미래 전망	특화 설명 및 가치
[1]	Ma, Q. et al.	2405.14093	2024.05	◐	◐	○	◐	○	○	◐	◐	◐	●	체화 AI 전체 조감도. VLA 전용이 아닌 LLM/VLM/VLA를 포괄하는 체화 AI 기반모델 생태계 서베이로서, VLA를 거대 AI 흐름 안에 위치시킨 최초의 종합 문헌. 시기적으로 가장 이른 서베이라 RT-2/PaLM-E 중심이며 2025 모델은 미포함.
[2]	Kawaharazuka, K. et al.	2402.05741	2024.02	◐	○	○	●	○	○	●	○	●	◐	실세계 배포 경험 특화. 연구실 밖 실제 로봇 배포에서 축적된 실무적 통찰(데이터 수집 노하우, 실패 모드 분석)이 핵심 가치. 아키텍처 다양성 분석보다 "실전에서 무엇이 깨지는가"에 집중한 유일한 서베이.
[3]	Zhong, Z. et al.	2509.19012	2025.09	●	●	●	●	◐	◐	●	○	●	●	가장 체계적인 Pure VLA 분류 프레임워크. "end-to-end로 이미지+언어→액션을 단일 모델로 출력"하는 Pure VLA만을 대상으로, 아키텍처/학습/데이터의 3축 분류를 제시. 다만 SayCan 같은 모듈형 파이프라인을 의도적으로 배제.
[4]	Yu, Z. et al.	2510.24795	2025.10	●	◐	◐	◐	○	●	◐	○	●	◐	효율성·경량화의 최심층 분석. 양자화(PTQ/QAT), 프루닝, 증류, 토큰 캐싱, LoRA 등 배포 최적화 기법을 12개 모델에 걸쳐 정량 비교한 유일한 서베이. "55B→450M 파라미터로 성능 유지가 가능한가"라는 질문에 체계적으로 답한다.
[5]	Liu, N. & Shao, R. et al.	2508.13073	2025.08	●	●	●	●	◐	◐	●	○	●	●	매니퓰레이션 도메인 최상세 벤치마크 분석. 조작(grasping, bimanual, dexterous) 중심으로 LIBERO/CALVIN/SimplerEnv 벤치마크를 가장 깊이 분석. 단일체/계층적 아키텍처 이분법을 명확히 제시하며, 네비게이션·자율주행은 의도적으로 제외.
[6]	Zhang, Y. et al.	2505.04769	2025.05	●	◐	◐	●	○	○	●	◐	●	●	VLA 개념과 응용의 폭넓은 개관. 기술적 깊이보다 개념 정리와 응용 도메인 목록의 폭에 강점을 가진 서베이. VLA 입문자에게 전체 지형도를 빠르게 제공하는 데 최적화되어 있으며, 의료·농업·GUI 등 비주류 도메인까지 커버.
[7]	Chen, Y. et al.	2507.01925	2025.07	◐	◐	●	◐	○	○	◐	○	◐	○	액션 토큰화 기법의 가장 심층적·전문적 분석. 8가지 토큰 타입(언어/코드/어포던스/궤적/목표/잠재/원시/추론)을 체계 분류하고, 이산 vs 연속, 주파수 스펙트럼, Action Chunking 트레이드오프를 심도있게 해부. 토큰화 외 다른 모듈은 최소한으로만 다룸.
[8]	Xu, C. et al.	2512.11362	2025.12	●	●	●	●	○	◐	●	◐	●	●	VLA를 "해부학"으로 분석한 유일한 접근. 지각(눈)→뇌(VLM)→행동(손)의 기관 비유로 모듈별 설계 선택지를 체계화. 가장 최신(2025.12) 서베이로 GR00T N1, π0.5까지 포함하나, RL 후처리는 거의 다루지 않음.
[9]	Jin, A. et al.	2506.20966	2025.06	◐	○	◐	●	●	○	◐	○	◐	●	VLA+RL 결합의 최전선 전문 서베이. BC의 한계를 RL로 극복하는 후처리(PPO/GRPO/DPO/ConRFT [69]) 기법을 가장 상세히 분석. 온라인/오프라인 RL, 보상 설계, 선호 최적화까지 망라하나 사전학습 단계 분석은 약함.
[10]	Jiang, H. et al.	2506.24044	2025.06	●	●	◐	●	◐	◐	○	●	●	●	자율주행(AD) VLA의 가장 상세한 전문 분석. EMMA/ORION [47]/DriveMoE [49]/AutoVLA [48] 등 AD-VLA 4세대 진화를 체계화하고, 안전성·V2V 협력·시뮬레이터(CARLA/Bench2Drive)를 깊이 분석. 조작/내비게이션 도메인은 완전히 배제.
[41]	Hu, T. et al.	2512.16760	2025.12	●	●	◐	◐	○	◐	○	●	●	●	AD-VLA의 세분화된 분류. End-to-End VLA(textual/numerical action)와 Dual-System VLA(explicit guidance/implicit transfer)의 2×2 분류; WorldBench 통합 평가 플랫폼 제안. Jiang et al. [10]보다 더 세분화된 AD 특화 분류.
[42]	Edge Survey	2603.16952	2026.03	◐	○	○	○	○	●	◐	◐	◐	●	엣지 배포의 시스템 레벨 분석. "Deployment Gauntlet" 7대 결합 제약(크기/무게/전력/메모리/연산/타이밍/안전) 식별. VLA=메모리 대역폭 병목, 디퓨전=연산 지연 병목으로 아키텍처별 최적화 차별화 제안.
[43]	Guan, W. et al.	2510.17111	2025.10	◐	◐	◐	◐	○	●	●	○	◐	◐	조작 효율화의 4차원 분류. 아키텍처/인지 특징 추출/행동 생성/학습·추론의 4축 효율화 분류. Yu et al. [4]가 모델 압축 중심인 것과 달리 인지-행동 파이프라인 전체의 효율화를 독립 차원으로 분석.
[44]	Large Model Embodied AI	2508.10399	2025.08	●	◐	○	●	◐	○	◐	◐	◐	●	대형 모델 기반 의사결정 프레임워크. 계층적(hierarchical) vs end-to-end 의사결정 패러다임 양분; 월드 모델을 두 패러다임을 연결하는 제3축으로 분석. VLA를 의사결정 관점에서 조망.

Section 1: 서론 — VLA란 무엇인가

1.1 VLA의 정의: 세 가지 관점

Vision-Language-Action(VLA) 모델은 로봇이 카메라로 장면을 관찰하고, 자연어 지시를 이해하며, 물리적 행동을 직접 생성하는 통합 신경망이다. 그러나 "VLA"라는 용어의 정의는 연구 커뮤니티 내에서도 아직 완전한 합의에 이르지 못했다. 현재까지 세 가지 서로 다른 정의가 공존하며, 각각의 범위와 의미를 정확히 이해하는 것이 이 분야를 탐색하는 첫 번째 관문이다.

좁은 정의: RT-2 [11]가 명명한 원조적 의미. 2023년, Google DeepMind의 RT-2 [11] 논문은 "VLA"라는 용어를 처음 사용했다. 여기서 VLA는 명확한 기술적 처방을 가진다: 대규모로 사전학습된 Vision-Language Model(VLM)을 로봇 행동 예측에 직접 파인튜닝하는 모델이다. RT-2 [11]는 PaLI-X(55B)와 PaLM-E [18] (12B)라는 기존 VLM을 가져와, 로봇 행동을 텍스트 토큰으로 표현한 뒤, VLM의 출력 어휘에 행동 토큰을 추가하는 방식으로 작동했다. 이 정의에서 VLA의 핵심은 "인터넷 규모 시각-언어 지식의 로봇 행동으로의 전이"이며, VLM backbone의 존재가 필수 조건이다.

확장 정의: Ma et al. [1] (2024)의 포괄적 범주. 체화 지능(Embodied AI) 전반을 조망한 Ma et al. [1]의 서베이는 VLA를 훨씬 넓게 정의한다. 이들에게 VLA는 시각(Vision)과 언어(Language) 입력을 받아 행동(Action)을 생성하는 모든 모델을 포괄하는 범주이다. 이 정의 아래에서는 SayCan [14]처럼 LLM이 고수준 계획만 수행하고 저수준 제어는 별도 정책이 담당하는 모듈형 시스템도, RT-1처럼 VLM backbone 없이 독자적 아키텍처로 설계된 모델도, 심지어 Code-as-Policy처럼 코드를 생성하는 시스템도 VLA에 포함된다. 이 확장 정의는 분야의 전체 지형도를 그리는 데 유용하지만, "VLA"라는 용어의 기술적 특이성을 희석시킨다는 비판을 받는다.

순수 정의: Zhong et al. [3] (2025)의 "Pure VLA". 가장 최근의 분류를 제안한 Zhong et al. [3]은 위 두 정의 사이의 긴장을 해소하기 위해 "Pure VLA"라는 개념을 도입했다. Pure VLA는 단일 end-to-end 시퀀스 모델링 프레임워크 안에서 지각, 언어 이해, 행동 생성을 통합하는 모델이다. 여기서 핵심 기준은 세 가지이다: (1) 시각과 언어가 모두 입력으로 사용될 것, (2) 행동이 모델의 직접 출력일 것(별도의 저수준 제어기를 거치지 않을 것), (3) 전체 파이프라인이 하나의 학습 가능한 모델로 통합될 것. 이 기준에 따르면 SayCan [14] (모듈형 구조)이나 Code-as-Policy(코드 출력)는 VLA가 아니지만, RT-2 [11], OpenVLA [15], π0 [16] 같은 모델은 Pure VLA에 해당한다. Zhong et al. [3]은 추가로 Pure VLA를 네 가지 범주로 세분화했다: (1) 자기회귀 기반(RT-2 [11], OpenVLA [15]), (2) 디퓨전 기반(π0 [16], CogACT [23]), (3) 강화학습 기반 미세조정, (4) 하이브리드 및 특수 방법. 이 분류는 행동 디코더의 생성 방식과 학습 패러다임을 함께 고려한 체계이다.

이 세 정의 외에도, Kawaharazuka et al. [2]는 독자적 경계 기준을 제안했다: "시각 관측과 자연어 지시를 필수 입력으로 받아, 제어 명령을 직접 생성하는 시스템"만을 VLA로 인정하며, "사전 정의된 기술 인덱스를 선택하는 고수준 정책"은 VLA에서 명시적으로 배제한다. 이는 SayCan [14]류의 기술 선택 시스템을 VLA 외부로 놓는다는 점에서 Zhong et al. [3]의 Pure VLA 정의와 맥을 같이하되, VLM backbone의 유무가 아닌 "직접 제어 명령 생성"을 핵심 기준으로 삼는다는 점에서 차별화된다.

본 문서는 이 네 가지 정의를 모두 인지하되, 실질적으로는 Zhong et al. [3]의 Pure VLA 정의를 중심축으로 사용한다. 다만, 모듈형 시스템(SayCan [14], Inner Monologue [22])이나 계층적 구조(π0.5 [31], GR00T N1 [21])도 VLA 생태계의 중요한 구성 요소로서 적절히 다룬다.

1.2 왜 VLA인가: 패러다임 전환의 본질

로보틱스의 전통적 아키텍처는 50년간 "인지-계획-제어"의 삼분 구조(sense-plan-act pipeline) 위에 서 있었다. 인지 모듈이 센서 데이터를 처리하고, 계획 모듈이 목표 달성 경로를 탐색하며, 제어 모듈이 관절 명령을 생성한다. 각 모듈은 독립적으로 설계되고, 모듈 간 인터페이스는 수작업으로 정의된 표현(예: 물체 포즈, 그리드 맵, 관절 궤적)을 통해 소통한다.

이 분리 구조는 수학적 엄밀성과 디버깅 용이성이라는 강점을 제공했지만, 근본적인 세 가지 병목을 안고 있었다. 첫째, 표현 병목: 모듈 간 전달되는 정보가 수작업 표현의 표현력에 의해 제한된다. "빨간 머그컵 옆의 파란 접시를 싱크대에 놓아줘"라는 지시를 실행하려면, 인지 모듈이 물체를 검출하고, 언어 모듈이 지시를 파싱하고, 그래스핑 모듈이 파지 자세를 계산하고, 모션 플래너가 충돌 없는 경로를 생성해야 한다. 각 단계의 인터페이스에서 정보가 손실되며, 한 모듈의 실패가 전체 파이프라인을 무너뜨린다. 둘째, 일반화 병목: 각 모듈이 특정 환경이나 물체에 대해 독립적으로 튜닝되므로, 새로운 환경이나 물체에 적응하려면 파이프라인 전체를 다시 엔지니어링해야 한다. 셋째, 지식 활용 병목: 인터넷에는 수십억 장의 이미지와 수조 단어의 텍스트가 존재하지만, 전통 파이프라인은 이 대규모 사전지식을 활용할 경로가 없다.

VLA는 이 세 병목을 동시에 공략한다. 센서 입력에서 행동 출력까지를 하나의 미분 가능한 함수로 연결함으로써 표현 병목을 제거하고, 인터넷 규모 VLM의 사전학습 지식을 상속함으로써 일반화와 지식 활용 병목을 함께 해소한다. 로봇이 "보고(Vision), 이해하고(Language), 행동하는(Action)" 과정이 하나의 신경망 안에서 end-to-end로 이루어지는 것이다.

이 전환의 의미를 비유하면 이렇다. 전통 로보틱스가 "통역사를 여러 명 거쳐 소통하는 국제 회의"였다면, VLA는 "모국어로 직접 대화하는 일대일 만남"이다. 중간 번역의 오류와 지연이 사라지고, 맥락과 뉘앙스가 온전히 전달된다.

1.3 14개 서베이의 한계와 본 문서의 존재 이유

2024년 하반기부터 2026년 초까지, VLA에 관한 서베이 논문이 폭발적으로 출판되었다. 본 문서가 참조하는 14개의 핵심 서베이는 다음과 같다:

서베이	핵심 관점	고유 강점	주요 한계
Ma et al. [1] (2024)	체화 AI 전반의 조감도	VLA를 대형 기반모델 생태계 안에 위치시킴	VLA 자체의 기술적 깊이가 부족
Kawaharazuka et al. [2] (2025)	실세계 배포	배포 경험에서 우러난 실무적 통찰; 7-범주 아키텍처 분류(VLM+Discrete, VLM+Diffusion, VLM+Flow Matching 등)	학습 패러다임(사전학습, RL 후처리) 분석이 약함
Zhong et al. [3] (2025)	Pure VLA 분류	가장 체계적인 VLA 분류 프레임워크	비-Pure VLA 계열을 배제
Yu et al. [4] (2025)	효율성과 경량화	추론 비용, 양자화, 캐싱의 심층 분석	학습 패러다임 전반을 다루지 않음
Liu & Shao [5] (2025)	매니퓰레이션	조작 중심의 상세한 벤치마크 분석	네비게이션, 자율주행 등 배제
Zhang et al. [6] (2025)	개념과 응용	폭넓은 개념 정리와 응용 목록	기술적 깊이보다 개관에 치중
Chen et al. [7] (2025)	행동 토큰화	토큰화 기법의 가장 심층적 분석	토큰화 외 아키텍처 요소가 부재
Xu et al. [8] (2025)	VLA 해부학	입력-처리-출력의 모듈별 해부	학습 후 처리(post-training) 미다룸
Jin et al. [9] (2025)	RL 후처리	VLA+RL 결합의 최신 동향	사전학습 단계 분석이 약함
Jiang et al. [10] (2025)	자율주행 VLA	AD-VLA의 가장 상세한 분석	조작/내비게이션 도메인 배제
Hu et al. [41] (2025)	자율주행 VLA의 과거/현재/미래	End-to-End vs Dual-System VLA의 AD 특화 분류; WorldBench 제안	Jiang et al. [10]보다 더 세분화된 AD-VLA 분류
2026 Edge Survey [42] (2026)	엣지 배포의 시스템 병목	"Deployment Gauntlet" 7대 결합 제약; VLA=메모리 대역폭 병목, 디퓨전=연산 지연 병목	모델 압축이 아닌 시스템 레벨 분석
Guan et al. [43] (2025)	조작 효율화 VLA	4차원 효율화 분류(아키텍처/인지/행동/학습)	Yu et al. [4]와 독립적인 효율화 관점
Large Model Embodied AI [44] (2025)	대형 모델 기반 체화 AI 의사결정	계층적 vs end-to-end 의사결정; 월드 모델 통합	VLA를 의사결정 관점에서 조망

※ Ma et al. [1]은 2024년 초판 이후 v7(2026.02)까지 지속 업데이트되는 living survey이다.

각 서베이는 자신의 렌즈를 통해 VLA를 조명하지만, 어떤 하나의 서베이도 VLA의 전체 그림을 담지 못한다. 아키텍처를 깊이 다루는 서베이는 배포를 소홀히 하고, 효율성에 집중하는 서베이는 학습 패러다임을 생략하며, 특정 도메인에 특화된 서베이는 다른 도메인과의 교차점을 놓친다.

본 문서는 이 14개 서베이를 원자 수준까지 분해하여, 정보량의 상위집합(superset)을 구축한다. 이는 단순한 병합이 아니다. 서로 다른 서베이가 같은 모델을 다른 각도에서 분석한 경우, 그 교차 관점을 통합하여 어느 개별 서베이보다 풍부한 이해를 제공한다. 예를 들어, π0 [16]에 대해 Zhong et al. [3]은 아키텍처 분류를, Yu et al. [4]은 추론 효율성을, Jin et al. [9]은 RL 후처리 적용을 각각 분석했는데, 본 문서는 이 세 관점을 하나의 통합 프로파일로 엮는다.

본 문서의 차별화 지점

위 14개 서베이가 각각 특정 축을 깊이 파고드는 전문 서베이라면, 본 문서는 그 축들을 교차시키는 메타 서베이(survey of surveys)이다. 구체적으로, 어떤 개별 서베이에도 없는 다음 기여를 제공한다:

차별화 축	개별 서베이의 한계	본 문서의 기여
5축 통합 분류	서베이마다 독립적 분류체계 사용 (아키텍처 축, 해부학 축, 기능 축 등이 분리)	아키텍처·액션생성·해부학·기능·후처리의 5축을 하나의 메타 분류 프레임워크로 통합하고, 서베이 간 대응 관계를 명시
교차 모델 프로파일	같은 모델(예: π0)을 다른 서베이가 다른 관점에서 단편적으로 분석	14개 서베이의 관점을 결합한 모델별 통합 프로파일 제공 (아키텍처 + 효율성 + RL 후처리를 동시 조망)
ICLR 2026 동향	대부분의 서베이가 2025년 중반까지의 문헌만 포함	ICLR 2026 제출 164건 분석(Reuss, 2026), 이산 디퓨전 VLA·ECoT [92]·자기개선 RL 등 최신 트렌드 반영
Emergent 인사이트	개별 서베이의 결론은 자신의 관점 내에서만 도출	14개 서베이를 교차 분석하여 어느 개별 서베이에도 명시되지 않은 창발적 인사이트 10가지를 도출 (Section 11)
엣지 배포의 시스템 관점	효율성 서베이[4][43]는 모델 압축에 초점, 시스템 레벨 제약 미분석	2026 Edge Survey [42]의 Deployment Gauntlet 7대 제약을 통합하여 모델-시스템 공동설계 관점 제시
의사결정 프레임워크	아키텍처 관점에서 계층적/end-to-end를 기술적으로만 비교	Large Model Embodied AI [44]의 의사결정 패러다임을 도입하여, 월드 모델을 두 패러다임의 제3축으로 위치시킴

1.4 본 문서의 구조

본 문서는 다음과 같은 구조로 전개된다:

Section 1 (본 절): VLA의 정의, 중요성, 본 문서의 목적
Section 2: VLA 진화의 연대기 — 2017년부터 2026년까지의 발전 서사
Section 3: 통합 분류체계 — 14개 서베이의 분류를 하나로
Section 4: 아키텍처 심층 해부 — Vision Encoder, VLM Backbone, Action Decoder
Section 5: 액션 토큰화 — 연속 행동을 어떻게 모델의 언어로 번역하는가
Section 6: 학습 패러다임 — Behavior Cloning, 사전학습 전략, RL 후처리
Section 7: 효율성 — 실세계 배포를 위한 경량화와 최적화
Section 8: 응용 도메인 — 매니퓰레이션, 휴머노이드, 자율주행, 드론, 의료
Section 9: 데이터셋, 벤치마크, 시뮬레이터
Section 10-11: 미해결 문제, 통합 인사이트, 결론

각 섹션은 14개 서베이의 관련 내용을 모두 소화한 위에서, 교차 서베이 통합 인사이트를 명시적으로 제공한다.

Section 2: VLA 진화의 연대기 (2017–2026)

VLA는 어느 날 갑자기 등장한 것이 아니다. 컴퓨터 비전, 자연어 처리, 로봇 학습이라는 세 개의 독립적 물줄기가 수십 년간 각자의 협곡을 깎아가다, 2020년대 초반 하나의 합류점에서 만났다. 이 절에서는 그 합류의 서사를 네 단계로 나누어 추적한다.

Phase 0 — 기반 기술의 수렴 (2017–2021)

비전의 혁명: CNN에서 ViT, 그리고 CLIP으로

2012년 AlexNet이 ImageNet 대회에서 전통 컴퓨터 비전을 압도한 이후, CNN은 급속히 진화했다. ResNet(2015)은 152층까지 깊어질 수 있는 잔차 학습을, EfficientNet(2019)은 폭·깊이·해상도의 균형 스케일링을 보여주었다. 그러나 진정한 전환점은 2020년의 Vision Transformer(ViT [28])였다. 이미지를 16x16 패치로 분할하고 Transformer의 자기주의(self-attention)로 처리하는 ViT는, NLP에서 검증된 스케일링 법칙이 비전에도 그대로 적용됨을 증명했다. 더 많은 데이터, 더 큰 모델, 더 나은 성능 — 이 단순한 공식이 비전 연구의 방향을 근본적으로 바꾸었다.

그러나 ViT보다 VLA에 더 직접적인 영향을 준 것은 2021년 OpenAI의 CLIP이었다. CLIP은 4억 개의 이미지-텍스트 쌍으로 대조 학습(contrastive learning)을 수행하여, 시각과 언어를 동일한 임베딩 공간에 정렬시켰다. "고양이 사진"과 "a photo of a cat"이라는 텍스트가 벡터 공간에서 가까워지는 것이다. 이 시각-언어 정렬은 VLA의 핵심 전제조건이 되었다: 로봇이 "빨간 머그컵을 집어"라는 지시를 받았을 때, "빨간 머그컵"이라는 언어 개념과 카메라에 보이는 시각적 대상을 연결할 수 있는 능력이 바로 CLIP [27] 스타일 사전학습에서 비롯되기 때문이다. CLIP의 후속으로 등장한 SigLIP(2023)은 시그모이드 손실로 더 효율적인 학습을 달성했고, DINOv2 [30] (2023)는 자기지도학습으로 레이블 없는 시각 표현을 학습하여, 이후 VLA의 Vision Encoder로 광범위하게 채택되었다.

언어의 폭발: GPT-3에서 대형 언어모델 시대로

2017년 Transformer 아키텍처의 등장 이후, NLP는 급격한 스케일링의 시대로 진입했다. BERT(2018)가 양방향 문맥 이해를, GPT-2(2019)가 자기회귀적 텍스트 생성을 보여주었다면, GPT-3 [29] (2020)는 1,750억 개 파라미터로 패러다임 자체를 바꾸었다. GPT-3 [29]는 명시적 학습 없이도 few-shot 프롬프트만으로 새로운 작업을 수행하는 "창발적 능력(emergent capabilities)"을 보여주었고, 이는 로보틱스 연구자들에게 결정적인 영감을 주었다: 충분히 큰 모델이 충분히 많은 데이터를 학습하면, 명시적으로 프로그래밍하지 않은 능력이 저절로 나타난다는 것이다.

이 관찰은 직접적으로 두 가지 질문으로 이어졌다. 첫째, LLM의 세계 지식(상식, 물리 직관, 작업 절차)을 로봇 계획에 활용할 수 있지 않을까? 둘째, LLM의 스케일링 법칙이 로봇 정책에도 적용되지 않을까? 첫 번째 질문은 2022년의 SayCan [14]과 Inner Monologue [22]로, 두 번째 질문은 2023년의 RT-2 [11]로 답해지게 된다.

로봇 학습의 벽: Behavior Cloning과 데이터 한계

같은 시기, 로봇 학습 분야는 고유한 난관과 씨름하고 있었다. Behavior Cloning(BC)은 전문가 시연을 모방하여 정책을 학습하는 가장 직접적인 방법이었지만, 분포 이탈(distribution shift) 문제 — 학습 시 보지 못한 상태에서 오류가 기하급수적으로 누적되는 현상 — 가 근본적 한계로 작용했다. DAgger [97](2011)가 이 문제를 이론적으로 해결했지만, 실세계에서 전문가의 실시간 교정을 반복적으로 받는 것은 비현실적이었다.

오프라인 강화학습(Offline RL)은 BC의 대안으로 주목받았다. CQL(2020), IQL(2021) 등이 기존 수집 데이터만으로 정책을 개선하는 방법을 제시했다. 그러나 이들 역시 보수적 추정(conservative estimation)의 한계와 하이퍼파라미터 민감성 문제를 안고 있었고, 수만~수십만 에피소드의 대규모 데이터가 전제되었다. 문제는 로봇 데이터 수집이 NLP나 CV와는 비교할 수 없을 만큼 느리고 비싸다는 점이었다. ImageNet은 수백만 장, GPT-3 [29]의 학습 데이터는 수천억 토큰이었지만, 당시 가장 큰 로봇 데이터셋은 기껏해야 수천 에피소드에 불과했다.

한편 시뮬레이터 생태계가 이 데이터 병목을 부분적으로 완화하기 시작했다. SAPIEN(2020)은 정교한 물리 시뮬레이션 기반 조작 환경을, AI2-THOR(2017)은 실사 수준의 가정 환경을, Habitat(2019)은 대규모 내비게이션 환경을 제공했다. 이들 시뮬레이터는 이후 VLA 학습과 평가의 핵심 인프라가 된다.

핵심 전조: CLIPort (2021)

이 세 물줄기 — 시각-언어 정렬, 대형 언어모델, 로봇 학습 — 가 처음으로 교차한 지점은 CLIPort [26] (2021)였다. Shridhar et al.은 CLIP의 시각-언어 표현을 Transporter Network(로봇 조작을 위한 공간 행동 지도)와 결합하여, 언어 지시에 따른 테이블탑 조작을 수행했다. CLIPort [26]는 "인터넷 스케일로 사전학습된 시각-언어 지식이 로봇 행동에 전이될 수 있다"는 핵심 가설을 최초로 실증했다. 비록 CLIPort [26] 자체는 end-to-end VLA가 아니라 CLIP [27] 특징을 Transporter에 주입하는 하이브리드 구조였지만, 이 증명은 이후 RT-2 [11]와 OpenVLA [15]의 직접적 영감이 되었다.

Phase 1 — VLA의 탄생 (2022–2023)

Phase 0에서 수렴한 기반 기술들이 2022년부터 폭발적으로 결합되기 시작한다. 이 시기는 마치 화학 반응의 활성화 에너지를 넘긴 순간처럼, 몇 개의 획기적 모델이 연쇄적으로 등장하면서 새로운 분야 자체를 탄생시켰다.

Gato와 SayCan: 두 가지 접근의 시작 (2022)

2022년 상반기, DeepMind는 두 갈래의 실험을 동시에 공개했다. Gato [13] (2022)는 텍스트, 이미지, 로봇 행동, 게임 입력을 모두 토큰으로 변환하여 하나의 1.2B 파라미터 Transformer로 학습한 "제너럴리스트 에이전트"였다. Gato의 혁신은 개념적이었다: 서로 완전히 다른 모달리티를 하나의 시퀀스 모델링 프레임워크로 통합할 수 있다는 증명. 비록 각 개별 작업의 성능은 전문가 모델에 미치지 못했지만, "하나의 모델이 보고, 읽고, 행동할 수 있다"는 가능성을 처음 보여주었다.

같은 시기 SayCan [14] (2022)은 정반대 철학을 택했다. LLM(PaLM)을 직접 행동 생성에 사용하는 대신, 고수준 계획기로만 활용한 것이다. SayCan [14]의 구조는 우아했다: LLM이 자연어 지시를 단계별 하위 작업으로 분해하면(예: "콜라를 가져다줘" → "부엌으로 이동" → "냉장고 열기" → "콜라 집기" → ...), 각 하위 작업의 실행 가능성을 사전학습된 저수준 정책(affordance function)이 평가하고, LLM의 계획과 affordance를 곱하여 실행 가능한 최선의 행동을 선택한다. SayCan [14]은 Zhong et al. [3]의 Pure VLA 정의에는 부합하지 않지만, "LLM의 세계 지식을 로봇에 연결한다"는 아이디어를 최초로 실세계에서 시연했다는 점에서 VLA 계보의 핵심 선조이다.

Inner Monologue [22] (2022)는 SayCan [14]의 아이디어를 한 단계 발전시켰다. LLM이 한 번 계획하고 끝나는 것이 아니라, 실행 결과에 대한 환경 피드백(성공/실패, 물체 감지 결과, 사람의 보정)을 텍스트로 받아 계획을 동적으로 수정하는 폐쇄 루프 구조를 도입했다. 이 "내면의 독백"은 이후 CoT-VLA [55]와 같은 추론 통합 VLA의 직접적 선조가 된다.

VIMA [45]: 다중모달 프롬프트 따르기 (2022)

같은 시기 등장한 VIMA [45](2022)는 텍스트뿐 아니라 이미지 목표, 비디오 시연, 바운딩 박스 등 다양한 모달리티의 프롬프트를 이해하여 로봇 행동을 생성하는 encoder-decoder Transformer였다. VIMA [45]는 "언어만이 유일한 지시 채널이 아니다"라는 관점을 제시하며, 이후 멀티모달 지시 따르기 연구의 기반이 되었다.

RT-1: 대규모 실세계 학습의 증명 (2022)

SayCan [14]이 LLM의 "지혜"를 빌리는 전략이었다면, RT-1 [12] (Robotics Transformer 1, 2022)은 정면돌파를 택했다. Google은 17개월에 걸쳐 13대의 로봇으로 130,000개의 실세계 시연 에피소드를 수집하고, 이를 약 35M 파라미터의 Transformer 모델로 학습시켰다. RT-1의 아키텍처는 비교적 소박했다(EfficientNet + TokenLearner + Transformer 디코더). 그러나 RT-1이 증명한 것은 아키텍처가 아니라 스케일이었다: 충분히 다양하고 대규모인 실세계 데이터로 학습하면, 단일 모델이 700개 이상의 서로 다른 작업을 수행할 수 있다는 것.

RT-1은 또한 중요한 실패도 드러냈다. 학습 데이터에 없는 물체나 환경에 대한 일반화가 극히 제한적이었던 것이다. 이 한계가 바로 "인터넷 스케일 사전학습의 지식을 주입하면 되지 않을까?"라는 질문으로 이어졌고, 그 답이 RT-2 [11]였다.

RT-2와 PaLM-E: VLA의 공식적 탄생 (2023)

2023년은 VLA가 이름을 얻은 해이다. RT-2 [11] (2023)는 PaLI-X(55B)와 PaLM-E [18] (12B)라는 기존 VLM을 가져와, 로봇 행동을 텍스트 토큰으로 인코딩(각 행동 차원을 256개 bin으로 이산화하여 정수 문자열로 표현)한 뒤, VLM의 기존 학습 데이터에 로봇 데이터를 소량 혼합하여 공동 파인튜닝했다. 결과는 극적이었다: RT-2 [11]는 RT-1과 동일한 로봇 데이터로 학습했음에도, 학습 중 보지 못한 물체("put the dinosaur in the correct bin")나 추상적 개념("pick up the object that is different from the others")에 대해 일반화를 보여주었다. VLM의 인터넷 스케일 지식이 로봇 행동으로 전이된 것이다.

거의 동시에 발표된 PaLM-E [18] (2023)는 562B 파라미터의 거대 다중모달 모델로, 시각 토큰, 언어 토큰, 로봇 상태 토큰을 하나의 입력 시퀀스로 통합했다. PaLM-E [18]는 직접적인 행동 생성보다는 다중모달 이해와 계획에 초점을 맞추었지만, "하나의 거대 Transformer가 시각, 언어, 로봇 행동을 모두 소화할 수 있다"는 스케일링 가설을 가장 극단적으로 밀어붙인 사례였다.

RT-2 [11]와 PaLM-E [18]가 던진 메시지는 명확했다: VLM은 단순한 이미지 캡셔너가 아니라, 적절한 파인튜닝을 거치면 물리 세계의 행동을 생성하는 정책이 될 수 있다. 이것이 VLA 패러다임의 핵심 명제이며, 이 명제가 실증된 2023년이 VLA 분야의 원년이다.

Diffusion Policy: 행동 생성의 새로운 문법 (2023)

RT-2 [11]가 행동을 이산 토큰으로 자기회귀 생성하는 접근을 제시한 것과 거의 동시에, Diffusion Policy [17] (Chi et al., 2023)는 완전히 다른 행동 생성 패러다임을 열었다. 이미지 생성에서 DALL-E와 Stable Diffusion이 혁명을 일으킨 DDPM(Denoising Diffusion Probabilistic Model)을 로봇 행동 생성에 적용한 것이다.

Diffusion Policy [17]의 핵심 통찰은 로봇 행동의 다중 모드성(multimodality)에 있었다. "컵을 집어라"라는 지시에 대해, 올바른 행동은 하나가 아니다 — 오른쪽에서 집을 수도, 위에서 집을 수도, 손잡이를 잡을 수도 있다. 기존의 평균 제곱 오차(MSE) 손실로 학습하면 이 다중 모드 분포의 평균을 학습하게 되어 어느 모드에도 속하지 않는 무의미한 행동이 생성된다. Diffusion Policy [17]는 가우시안 노이즈에서 시작하여 반복적 디노이징을 통해 행동 시퀀스(action chunk)를 생성함으로써, 다중 모드 분포를 자연스럽게 표현했다. 또한 전체 행동 시퀀스를 한번에 생성하는 action chunking과 자연스럽게 결합되어, 시간적 일관성(temporal consistency)을 확보한 것도 핵심 기여이다.

이 접근은 이후 VLA의 Action Decoder 설계에 근본적인 영향을 미쳤다. π0 [16]의 Flow Matching, CogACT [23]의 DiT 기반 디코더, RDT-1B [24]의 확산 트랜스포머 등은 모두 Diffusion Policy [17]가 연 문법 위에서 발전한 것이다.

Open X-Embodiment: 데이터 통합의 전환점 (2023)

2023년 말, Google DeepMind가 주도한 Open X-Embodiment [19] (OXE [19]) 프로젝트는 VLA 분야의 데이터 인프라를 근본적으로 바꾸었다. 22개 로봇 플랫폼에서 수집된 100만 개 이상의 에피소드를 표준화된 형식(RLDS)으로 통합한 이 데이터셋은, 개별 연구실이 자체 로봇으로만 데이터를 수집하던 패러다임을 깨뜨렸다. OXE [19]의 핵심 발견은 "교차 로봇 데이터(cross-embodiment data)"로 학습한 정책이 단일 로봇 데이터로 학습한 정책보다 일반화 성능이 높다는 것이었다. 서로 다른 로봇의 데이터가 "노이즈"가 아니라 "다양성"으로 작용하여 과적합을 방지한 것이다.

OXE [19]는 이후 등장하는 거의 모든 대형 VLA의 학습 데이터 기반이 되었다. Octo [25], OpenVLA [15], π0 [16] 모두 OXE [19]를 핵심 학습 데이터로 활용했으며, OXE [19]의 존재 자체가 "범용 로봇 정책(generalist robot policy)"이라는 연구 방향을 가능하게 했다.

Phase 2 — 다양화와 폭발적 성장 (2024)

2024년은 VLA 분야의 캄브리아기 폭발이었다. RT-2 [11]와 Diffusion Policy [17]가 증명한 두 가지 패러다임 — 자기회귀 토큰 생성과 디퓨전 기반 행동 생성 — 을 기반으로, 수십 개의 새로운 모델이 등장하면서 분야 전체의 지형이 급격히 재편되었다.

범용 정책의 등장: Octo와 OpenVLA

Octo [25] (2024)는 OXE [19] 데이터셋을 본격 활용한 최초의 범용 교차 플랫폼 정책이었다. 93M 파라미터의 비교적 소형 모델로, Transformer 기반 아키텍처에 자기회귀 및 디퓨전 양쪽 행동 헤드를 모두 지원하는 유연한 구조를 갖추었다. Octo의 핵심 기여는 아키텍처 혁신보다는 "교차 로봇 사전학습 → 타겟 로봇 파인튜닝"이라는 학습 레시피를 확립한 데 있었다. 소량의 타겟 로봇 데이터만으로 새로운 플랫폼에 적응할 수 있음을 보여줌으로써, VLA의 전이 학습 패러다임을 실증했다.

OpenVLA [15] (2024)는 VLA 민주화의 전환점이었다. RT-2 [11]가 55B 파라미터의 비공개 모델이었던 것에 반해, OpenVLA [15]는 7B 파라미터의 완전 오픈소스 모델로, Llama 2 기반 VLM에 OXE [19] 데이터로 파인튜닝하여 누구나 재현 가능한 VLA를 제공했다. OpenVLA [15]는 RT-2-X 대비 16.5% 높은 절대 성공률을 보여주면서도, 모델 크기를 약 7분의 1로 줄였다. 이 결과는 "VLA에 반드시 수십~수백 억 파라미터가 필요한 것은 아니다"라는 중요한 시사점을 제공했고, 이후 소형 VLA 연구의 기반을 마련했다.

새로운 아키텍처 패러다임: π0과 CogACT

2024년의 가장 영향력 있는 아키텍처 혁신은 π0 [16] (Physical Intelligence, 2024)에서 나왔다. π0 [16]는 RT-2 [11] 계열의 자기회귀 접근과 Diffusion Policy [17]의 확산 접근을 절충하는 새로운 아키텍처를 제시했다: VLM backbone(PaliGemma 기반, ~3B 파라미터)이 시각과 언어를 이해하고, 그 위에 Flow Matching 기반의 Action Expert(~0.3B)가 연속 행동을 생성한다(전체 약 3.3B 파라미터). Flow Matching은 DDPM이 확률적 역과정을 반복하는 것과 달리, 노이즈에서 데이터로의 결정론적 ODE 경로(velocity field)를 직접 학습하여, 더 빠르고 안정적인 행동 생성을 가능하게 했다.

π0 [16]의 진정한 파급력은 아키텍처뿐 아니라 성능에서 나왔다. 셔츠 접기, 식탁 정리 같은 장시간 복합 조작 작업에서 기존 모든 VLA를 압도하는 성능을 보여주면서, "VLM backbone + 디퓨전/플로우 행동 디코더"라는 조합이 VLA 아키텍처의 새로운 참조점이 되었다.

CogACT [23] (2024)는 DiT(Diffusion Transformer)를 Action Decoder로 활용하여, 여러 디노이징 경로에서 생성된 행동 후보를 적응적으로 앙상블하는 기법을 도입했다. RDT-1B [24] (2024)는 1B 파라미터 규모의 DiT를 확산 정책으로 사용하여, Scalable Diffusion Transformer가 VLA의 행동 생성에도 유효함을 보여주었다. GR-2(2024)는 대규모 웹 비디오를 사전학습 데이터로 활용하여, 로봇 데이터의 희소성 문제를 비디오 사전학습으로 우회하는 전략을 제시했다.

FAST 토큰화: 행동 시퀀스 압축의 돌파구

2024년의 또 다른 핵심 혁신은 2024년 말에 개발되어 2025년 초 공개된 FAST [20](Fast Action Tokenization)였다. 기존 VLA에서 연속 행동을 토큰으로 변환하는 방식(bin 이산화)은 극도로 비효율적이었다 — 7 DoF 로봇의 행동 청크(16 타임스텝)를 표현하는 데 112개의 토큰이 필요했다. FAST는 이산 코사인 변환(DCT)으로 행동 시퀀스의 주파수 성분을 추출하고, 바이트 페어 인코딩(BPE)으로 반복 패턴을 압축하여, 동일한 행동 시퀀스를 최대 약 13배(원논문 기준 최대치) 압축된 토큰으로 표현했다. 이 압축은 자기회귀 VLA의 추론 속도를 직접적으로 개선했고, 이후 π0-FAST [20]의 핵심 기반 기술이 된다.

3D 이해와 월드 모델의 결합

3D-VLA [37] (2024)는 VLA에 3D 공간 이해를 결합한 선구적 시도였다. 2D 이미지 기반 VLA의 근본적 한계 — 깊이 인식의 부재, 가려진 물체에 대한 추론 불가 — 를 생성적 3D 월드 모델로 극복하려 한 것이다. 3D-VLA [37]는 행동 실행 전에 3D 장면의 미래 상태를 예측하고, 이 예측을 행동 생성에 피드백하는 구조를 제시했다. 이 접근은 이후 SpatialVLA [39], PointVLA [77] 등으로 발전하며, "지각-행동" 루프에 "상상"을 추가하는 연구 방향을 열었다.

Phase 3 — 효율화와 배포 준비 (2025–2026)

2024년의 캄브리아기 폭발이 "무엇이 가능한가"를 탐색했다면, 2025년 이후의 연구는 "어떻게 실용화할 것인가"로 무게중심이 이동했다. 이 전환은 세 가지 축에서 동시에 진행되었다: 아키텍처의 계층화, 극단적 효율화, 그리고 안전·신뢰성의 내재화.

계층적 아키텍처의 부상

GR00T N1 [21] (NVIDIA, 2025)은 인간 인지과학의 이중 과정 이론(Dual Process Theory)에서 영감을 받은 아키텍처를 제시했다. System 2(VLM 기반, 10Hz)가 고수준 이해와 계획을 담당하고, System 1(디퓨전 기반, 120Hz)이 반사적 저수준 행동을 생성한다. 이 분리는 실용적 통찰에서 비롯되었다: VLM의 추론은 느리지만 풍부하고, 디퓨전의 생성은 빠르지만 단순하다. 두 시스템을 결합하면 고수준 이해의 깊이와 저수준 제어의 반응성을 동시에 달성할 수 있다.

π0.5 [31] (Physical Intelligence, 2025)는 다른 방식의 계층화를 택했다. 고수준 VLM이 자연어 하위 작업 시퀀스를 생성하고(예: "컵을 잡아" → "싱크대 위로 이동" → "컵을 놓아"), 저수준 π0 [16]가 각 하위 작업을 실행하는 구조이다. 이 접근은 30분 이상의 장시간 작업(전체 주방 정리 등)을 VLA로 처리할 수 있는 경로를 열었다.

효율화의 극단적 추구

VLA의 실세계 배포를 가로막는 가장 직접적인 장벽은 연산 비용이었다. 7B 파라미터 모델의 추론에 16-24GB VRAM이 필요하고, 지연 시간이 수백 밀리초에 달하는 것은 로봇 제어에 치명적이다. 2025년은 이 문제에 대한 해결책이 폭발적으로 제안된 해였다.

SmolVLA [32] (2025, 450M)는 "대형 VLM이 아니어도 VLA가 작동할 수 있는가?"라는 질문에 답했다. 450M 파라미터로 단일 GPU 학습이 가능하면서도, 단순 작업에서 OpenVLA [15] (7B)에 근접하는 성능을 보여주었다. BitVLA [33] (2025)는 더 극단적으로, 1.58비트 삼진 양자화(ternary quantization)를 VLA에 적용하여 메모리 사용량을 극적으로 줄였다. TinyVLA [34] (2024년에 제안, 효율화 트렌드의 선구)는 VLM backbone을 소형화하면서도 성능을 유지하는 증류(distillation) 기법을, EdgeVLA(2025)는 엣지 디바이스 배포를 위한 최적화를, VLA-Cache(2025)는 시각 토큰 캐싱으로 반복 연산을 제거하는 방법을 각각 제안했다. DeeR-VLA [35] (2025)는 입력 난이도에 따라 다른 깊이의 레이어를 활성화하는 동적 추론을, MoLe-VLA(2025)는 Mixture-of-Experts 구조로 파라미터 효율성을 높이는 전략을 택했다.

이 효율화 연구들의 공통된 메시지는 명확하다: VLA의 핵심 가치는 대형 VLM의 지식에 있지만, 그 지식을 전달하는 데 반드시 대형 모델이 필요한 것은 아니다. 증류, 양자화, 캐싱, 동적 추론 등의 기법으로, 대형 모델의 지식을 소형 모델에 압축하여 실시간 로봇 제어에 적합한 형태로 만들 수 있다.

RL 후처리의 등장

Behavior Cloning만으로 학습된 VLA의 근본적 한계 — 시연 품질이 성능의 천장이 되고, 시연에 없는 행동은 발견할 수 없다 — 를 극복하기 위해, 2025년에는 사전학습된 VLA를 강화학습으로 후처리(post-training)하는 연구가 본격화되었다. VLA-RL [68](2025)은 GRPO(Group Relative Policy Optimization)를, ConRFT [69](2025)는 온라인 RL을, SimpleVLA-RL [70](2025)은 REINFORCE 기반의 단순화된 RL을, RIPT-VLA [71](2025)는 반복적 RL-기반 정교화를 각각 제안했다.

이 연구들은 공통적으로 "BC로 사전학습하여 합리적 초기 정책을 확보한 뒤, RL로 탐색-개선하여 시연을 초월하는 성능을 달성한다"는 레시피를 따랐다. 이는 LLM 분야에서 GPT-3 [29] (사전학습) → InstructGPT(RLHF 후처리)로 이어진 발전 경로와 구조적으로 동일하며, VLA 분야가 LLM의 성숙 경로를 빠르게 추적하고 있음을 보여준다.

추론의 통합과 안전의 내재화

CoT-VLA [55](2025)는 LLM의 Chain-of-Thought 추론을 VLA에 도입했다. 행동을 즉시 생성하는 대신, 먼저 시각적 추론 과정(주요 영역 마스크, 작업 분해 텍스트)을 생성한 뒤 이를 조건으로 행동을 생성하는 것이다. 이는 VLA의 "반사적(reactive)" 한계를 극복하려는 시도로, Inner Monologue [22] (2022)에서 시작된 "생각하는 로봇" 계보의 최신 진화이다.

SafeVLA [75](2025)는 안전 제약을 VLA의 학습 과정에 내재화한 최초의 모델이다. 기존 VLA가 안전을 사후적 필터링으로 처리한 반면, SafeVLA [75]는 안전 위반 예측을 학습 목표 자체에 포함시켜, 위험한 행동을 생성 단계에서 억제한다. 이 접근은 "VLA의 환각은 물리적 사고"라는 경고에 대한 첫 번째 체계적 대응이다.

Humanoid-VLA(2025)는 VLA의 적용 범위를 테이블탑 조작에서 전신 휴머노이드 제어로 확장했다. 수십 개의 자유도를 가진 휴머노이드의 전신 운동을 VLA로 제어하는 것은, 행동 공간의 차원이 기존 로봇 팔(6-7 DoF)과는 차원이 다른 도전이다.

자율주행 VLA: 새로운 응용 전선

VLA의 적용이 로봇 조작에 국한되지 않는다는 것을 가장 극적으로 보여준 것이 자율주행(AD) 분야이다. EMMA [46](Waymo, 2024)는 Gemini VLM을 자율주행에 적용하여, 센서 입력에서 주행 경로까지를 end-to-end로 생성하는 자율주행 VLA의 가능성을 제시했다. ORION [47](2025)은 시각적 추론과 주행 행동 생성을 통합했고, AutoVLA [48](2025)는 자율주행에 특화된 VLA 아키텍처를, DriveMoE [49](2025)는 Mixture-of-Experts로 주행 시나리오별 전문 모듈을 활성화하는 효율적 구조를 제안했다.

자율주행 VLA는 로봇 조작 VLA와 기술적 DNA를 공유하지만(VLM backbone, 행동 토큰화, end-to-end 학습), 도메인 특성은 크게 다르다: 더 높은 속도, 더 엄격한 안전 요구, 더 다양한 환경 변이. 이 두 분야의 교차 수정(cross-pollination)은 VLA 기술의 성숙을 가속하고 있다.

VLA 연구의 폭발적 양적 성장: ICLR 2026의 증거

VLA 분야의 성장 속도를 가장 극적으로 보여주는 것은 ICLR 2026의 제출 통계이다(Reuss, 2026). ICLR 2024에는 VLA 관련 제출이 단 1건(거부)이었고, ICLR 2025에는 9건이었으나, ICLR 2026에는 164건이 제출되어 전년 대비 18배의 폭발적 성장을 기록했다. 이 숫자는 VLA가 더 이상 틈새 연구 주제가 아니라 기계학습 커뮤니티의 주류 연구 방향으로 자리잡았음을 의미한다. ICLR 2027에는 1,000건 이상의 제출이 예상된다는 분석도 있다.

이 164건의 논문에서 관찰되는 핵심 트렌드는 다음과 같다:

이산 디퓨전 VLA: 자기회귀의 느린 순차 생성을 병렬 디퓨전으로 대체하는 4개의 동시 논문 등장
Embodied Chain-of-Thought(ECoT [92]): 공간적으로 기반된(spatially-grounded) 추론을 행동과 통합
교차 행동 공간 학습(Cross-Action-Space): X-VLA [53], XR-1, HiMoE-VLA [54] 등 이종 embodiment 간 전이
자기 개선 RL: 잔차 RL로 LIBERO 99% 달성, 벤치마크 포화 가속
새로운 벤치마크: RoboArena (실-심 변환), RoboCasa365 (365 태스크/2000+ 주방 장면), WorldGym (월드 모델 기반 평가)

주목할 만한 발견은 VLM4VLA(ICLR 2026) 연구로, 표준 VLM 벤치마크 성능과 하류 VLA 성능 사이에 상관관계가 없다는 것을 밝혔다. 이는 VLA의 VLM backbone 선택이 VLM의 일반적 벤치마크 순위가 아니라, 로봇 태스크에 특화된 기준으로 이루어져야 함을 시사한다.

플랫폼 스케일 오케스트레이션

2025-2026년의 또 다른 특징은 VLA가 개별 모델 수준을 넘어 플랫폼 수준으로 진화하고 있다는 점이다. Gemini Robotics(Google DeepMind, 2025)는 Gemini 2.0을 로봇 제어의 중심축으로 삼아, 다양한 로봇 플랫폼과 작업을 하나의 VLM 오케스트레이터가 조율하는 구조를 제시했다. NVIDIA의 GR00T 생태계는 Cosmos(시뮬레이션) → GR00T N1 [21] (VLA 정책) → Jetson(에지 추론)으로 이어지는 풀스택 파이프라인을 구축하고 있다.

이 플랫폼 전략은 VLA가 더 이상 순수 연구 주제가 아니라 산업적 제품으로 전환되고 있음을 의미한다. Physical Intelligence의 π 시리즈(π0 [16] → π0.5 [31] → π0-FAST [20])가 "로봇의 Android"를 표방하고, Figure AI가 Helix VLA를 자체 휴머노이드에 탑재하여 풀스택 로봇 회사를 지향하는 것도 같은 맥락이다.

프런티어 모델과 오픈 웨이트 연구 모델의 격차

2025-2026년 시점에서 가장 뚜렷한 분단선은 비공개 프런티어 모델과 오픈 웨이트 연구 모델 사이의 실세계 일반화 격차이다. Gemini Robotics, π0.5 [31] 같은 비공개 모델은 제로샷 실세계 일반화를 시연하고 있으나, 오픈 소스 연구 VLA들은 시뮬레이션 벤치마크에서는 근접하면서도 실세계에서의 격차가 좁혀지지 않고 있다. 이 격차의 원인으로는 (1) 학습 데이터의 품질·다양성 차이, (2) 시뮬레이션 벤치마크의 천장 효과(ceiling effect)로 실제 진전이 가려지는 현상, (3) 연구 인프라 규모의 차이가 지목된다. Reuss(2026)는 이 분석을 바탕으로 핵심적 결론을 제시한다: 현재 학계가 LIBERO·SimplerEnv 같은 포화된 벤치마크에서 숫자를 올리는 데 집중하는 것은 실세계 배포와의 격차를 오히려 가리는 위험이 있으며, 진정한 돌파구는 (1) 데이터 큐레이션과 품질 관리, (2) in-context learning 능력의 강화, (3) 실세계 평가 프로토콜의 확립에서 올 것이라고 주장한다. 그는 특히 ICLR 2026에서 데이터 큐레이션과 in-context learning이 가장 과소대표된 연구 방향이었음을 지적하며, 이것이 곧 가장 큰 기회라고 결론짓는다.

핵심 전환: "개념 증명"에서 "배포 준비"로

2022-2023년의 VLA는 "이것이 가능하다"를 증명하는 단계였다. RT-2 [11]가 VLM의 지식이 로봇 행동으로 전이될 수 있음을 보여주고, Diffusion Policy [17]가 다중 모드 행동 생성의 가능성을 연 것이 이 시기의 핵심이었다. 2024년은 "얼마나 다양하게 가능한가"를 탐색하는 폭발의 시기였다. 수십 개의 모델이 서로 다른 아키텍처, 학습 전략, 응용 도메인을 실험했다.

그리고 2025-2026년, VLA 분야는 "실세계에서 어떻게 작동하게 할 것인가"라는 질문으로 수렴하고 있다. 이 전환을 추동하는 힘은 다층적이다: 경량화 기술이 엣지 배포의 물리적 장벽을 낮추고, RL 후처리가 BC의 성능 천장을 깨뜨리며, 안전 제약의 내재화가 신뢰성 문제를 구조적으로 해결하고, 계층적 아키텍처가 장시간 복합 작업이라는 난제에 대응하고 있다. 이 네 가지 축의 동시적 발전이, VLA를 연구실의 데모에서 실세계의 제품으로 전환시키는 원동력이다.

이 연대기에서 관찰되는 가장 중요한 패턴은 기술 간 교차 수정(cross-pollination)의 가속이다. CLIP의 시각-언어 정렬이 CLIPort [26]를 낳고, GPT-3 [29]의 few-shot 학습이 SayCan [14]을 낳고, 이미지 확산 모델이 Diffusion Policy [17]를 낳았듯이, VLA의 모든 핵심 혁신은 인접 분야의 돌파구를 로보틱스에 적응시킨 결과이다. 이 패턴은 앞으로도 계속될 것이다: LLM의 추론 기법(CoT, MCTS)은 이미 VLA에 이식되고 있고, 비디오 생성 모델의 발전은 월드 모델 기반 VLA를 가속할 것이며, 멀티에이전트 LLM 시스템의 발전은 멀티로봇 VLA 협업으로 이어질 것이다.

VLA의 역사는 아직 초장이다. 그러나 이 초장의 밀도 — 불과 4년 만에 개념 증명에서 산업적 배포 준비까지 도달한 속도 — 는 이 분야가 앞으로 어떤 속도로 전개될지를 가늠하게 한다.

Motivation Chain: VLA 핵심 모델의 동기 사슬

Motivation Chain

RT-1의 한계(단일 로봇, VLM 없이 인터넷 지식 활용 불가, 학습 데이터 외 일반화 취약)

→ RT-2 [11] 등장(기존 VLM을 파인튜닝하여 인터넷 지식을 로봇 행동에 전이)

→ RT-2의 한계(55B 파라미터, 비공개, 실시간 제어 불가, 추론 330-1000ms)

→ OpenVLA [15] 등장(7B 오픈소스, 누구나 재현 가능, 16.5% 더 높은 성공률)

→ OpenVLA의 한계(자기회귀 디코딩의 다중모드 행동 표현 한계, 느린 추론 ~166ms)

→ π0 [16] 등장(Flow Matching으로 다중모드 행동 생성, ~73ms 추론, dexterous task 우위)

→ π0의 한계(단일 모델로 장시간 복합 작업 어려움)

→ π0.5 [31] 등장(고수준 VLM 계획 + 저수준 π0 실행의 계층 구조, 30분+ 작업 가능)

Motivation Chain

Behavior Cloning의 한계(시연 품질이 성능 천장, 시연 밖 행동 발견 불가)

→ RL 후처리 연구 등장(VLA-RL [68], RIPT-VLA [71], ConRFT [69] 등)

→ BC로 초기 정책 확보 후 RL로 시연을 초월하는 성능 달성

Motivation Chain

OXE [19] 데이터셋의 등장(단일 연구실 데이터 한계 → 22종 로봇 교차 데이터 통합)

→ Octo [25](교차 로봇 사전학습 → 타겟 파인튜닝 레시피 확립)

→ OpenVLA [15](OXE 기반 오픈소스 VLA 민주화)

헷갈리기 쉬운 모델 비교: 핵심 차별점

비교 대상	핵심 차별점
RT-1 vs RT-2 [11]	RT-1은 독자 아키텍처(35M), RT-2는 기존 VLM(55B)을 파인튜닝 — 인터넷 지식 전이 유무가 핵심
RT-2 [11] vs OpenVLA [15]	동일한 "VLM→행동 토큰" 패러다임이지만, OpenVLA는 7B 오픈소스로 민주화에 초점
OpenVLA [15] vs π0 [16]	OpenVLA는 자기회귀 디코딩(이산 토큰), π0는 Flow Matching 디코딩(연속 행동) — 다중모드 행동 표현력의 차이
SayCan [14] vs RT-2 [11]	SayCan은 LLM이 계획만 하고 별도 정책이 실행(모듈형), RT-2는 하나의 모델이 계획+실행(end-to-end)
Gato [13] vs RT-2 [11]	Gato는 범용 에이전트(게임+로봇+텍스트), RT-2는 로봇 특화 VLA — Gato는 개념 증명, RT-2는 실용적 성능
Diffusion Policy [17] vs π0 [16]	Diffusion Policy는 독립적 디퓨전 정책, π0는 VLM+Flow Matching 결합 — π0는 언어 이해를 내장
GR00T N1 [21] vs π0.5 [31]	둘 다 계층적이지만, GR00T은 System 1(120Hz)+System 2(10Hz) 속도 분리, π0.5는 VLM 계획+π0 실행의 기능 분리
Octo [25] vs OpenVLA [15]	Octo는 93M 소형+디퓨전 헤드+교차 플랫폼 특화, OpenVLA는 7B VLM 기반+자기회귀+범용 지식 전이

직관적 한줄 설명

RT-2 [11]: "구글 번역기가 한국어→영어를 하듯, VLM이 이미지→로봇 행동을 번역하게 만든 것"
OpenVLA [15]: "RT-2의 핵심 아이디어를 7배 작게 만들어 오픈소스로 풀어놓은 것"
π0 [16]: "VLM이 상황을 이해하고, 그 이해를 바탕으로 확산 모델이 부드럽고 정교한 동작을 그려내는 것"
SayCan [14]: "ChatGPT가 레시피를 알려주면, 요리사 로봇이 실제로 만드는 것 — 아는 것과 할 수 있는 것을 분리"
Diffusion Policy [17]: "Stable Diffusion이 노이즈에서 그림을 그리듯, 노이즈에서 로봇 동작을 그려내는 것"
Octo [25]: "다양한 로봇 데이터로 사전학습된 '범용 운전면허' — 새 로봇에 소량 적응만 하면 됨"
OXE [19]: "ImageNet이 CV를 바꿨듯, 22종 로봇의 100만+ 데이터를 통합한 로보틱스의 ImageNet"
FAST: "로봇 행동을 JPEG처럼 주파수 압축하여 토큰 수를 13배 줄인 것"
GR00T N1 [21]: "느리지만 깊이 생각하는 뇌(VLM, 10Hz)와 빠르게 반응하는 소뇌(디퓨전, 120Hz)의 분업"
CoT-VLA [55]: "행동 전에 '왜 이렇게 해야 하지?'라고 스스로 추론하는 로봇 — 반사에서 사고로의 진화"

Self-Check Questions: Section 1-2

Q1: VLA의 세 가지(+한 가지) 정의 중, SayCan은 어떤 정의에서 VLA로 분류되고 어떤 정의에서 제외되는가?

답: Ma et al.의 확장 정의에서는 VLA에 포함된다(시각+언어→행동을 생성하는 모든 시스템). 그러나 Zhong et al.의 Pure VLA 정의에서는 제외된다(모듈형 구조이므로 end-to-end 통합 조건 불충족). Kawaharazuka et al.의 정의에서도 제외된다(사전 정의된 기술 인덱스를 선택하는 고수준 정책이므로 "직접 제어 명령 생성" 조건 불충족). RT-2의 좁은 정의에서도 제외된다(VLM backbone을 행동 생성에 직접 사용하지 않으므로).

Q2: RT-2가 RT-1과 동일한 로봇 데이터로 학습했음에도 더 나은 일반화를 보인 이유는 무엇인가?

답: RT-2는 인터넷 규모로 사전학습된 VLM(PaLI-X 55B)의 시각-언어 지식을 상속했기 때문이다. VLM이 이미 "공룡", "다른 것과 다른 물체" 등의 추상적 개념을 이해하고 있었으므로, 로봇 데이터에서 이런 물체를 직접 본 적이 없어도 VLM의 사전 지식으로 일반화할 수 있었다. 이것이 VLA의 핵심 가설인 "인터넷 스케일 지식의 로봇 행동으로의 전이"이다.

Q3: Diffusion Policy가 기존 MSE 기반 BC보다 우수한 핵심 이유를 "다중 모드성" 관점에서 설명하라.

답: "컵을 집어라"는 지시에 대해 올바른 행동은 여러 가지이다(오른쪽/위/손잡이 등). MSE 손실로 학습하면 이 다중 모드 분포의 평균을 학습하여, 어느 모드에도 속하지 않는 무의미한 중간 행동이 생성된다(mode averaging). Diffusion Policy는 노이즈에서 반복 디노이징으로 행동을 생성하여, 다중 모드 분포의 각 모드를 자연스럽게 샘플링할 수 있다. 추가로, action chunking(미래 행동 시퀀스 일괄 생성)으로 시간적 일관성도 확보한다.

Open Research Questions: Section 1-2

VLA의 정의 경계: 코드를 생성하는 시스템(Code-as-Policy), 키포인트를 출력하는 시스템, 보상 함수를 생성하는 시스템은 VLA인가? Pure VLA 정의의 "직접 행동 생성" 기준이 미래의 다양한 행동 표현 방식을 얼마나 잘 수용할 수 있는가?

교차 수정의 다음 물줄기: 지금까지 VLA는 NLP(Transformer), CV(ViT, CLIP), 생성 AI(Diffusion)의 돌파구를 흡수해왔다. 다음으로 VLA에 가장 큰 영향을 줄 인접 분야는 무엇인가? (후보: video generation, 3D foundation model, neurosymbolic AI)

스케일링의 천장: RT-2(55B)에서 SmolVLA(450M)까지 100배 이상 파라미터가 줄었는데도 성능이 유지된다면, VLA에서 스케일링 법칙은 어떤 형태인가? 파라미터 수 vs 데이터 다양성 vs 아키텍처 효율성 중 어느 축이 가장 중요한가?

데이터 격차 해소: OXE의 100만 에피소드도 인터넷 텍스트/이미지 대비 극히 적다. 시뮬레이션, 비디오 사전학습, 합성 데이터 중 어떤 전략이 가장 효과적으로 이 격차를 좁힐 수 있는가?

Section 3: 통합 분류체계 --- 14개 서베이를 하나로

"코끼리를 만지는 장님들처럼, 각 서베이는 VLA라는 거대한 동물의 서로 다른 부위를 묘사하고 있었다. 이 장에서는 장님들의 손을 모아 코끼리의 전신을 재구성한다."

VLA 분야에는 2024년 말부터 2026년 초까지 짧은 기간 안에 14편 이상의 서베이 논문이 쏟아졌다. 각 서베이는 자체적인 분류체계(taxonomy)를 제시했고, 논문마다 동일한 모델을 서로 다른 범주에 배치하는 경우도 드물지 않다. RT-2 [11]를 "monolithic VLA"로 분류하는 논문이 있는가 하면, "autoregressive action generation"으로 분류하는 논문도 있다. 이 장의 목표는 이들 분류체계를 경쟁 관계가 아닌 상보적 관점으로 재해석하고, 모든 VLA 모델을 하나의 좌표계 안에 위치시킬 수 있는 메타 분류체계를 구축하는 것이다.

3.1 아키텍처 관점 --- Liu/Shao (2025) 기반

Liu와 Shao의 서베이는 VLA의 구조적 형태에 초점을 맞춘다. 이들의 핵심 질문은 "VLM과 행동 생성기가 어떤 관계로 연결되어 있는가?"이다.

3.1.1 단일체(Monolithic) 아키텍처

전체 시스템이 하나의 end-to-end 모델로 구성되며, 내부 모듈 간의 경계가 학습 과정에서 자연스럽게 형성된다.

단일 시스템(Single-system): 하나의 통합 순전파(forward pass)로 관측에서 행동까지를 생성한다. 입력 이미지와 언어 지시가 들어가면, 단일 네트워크의 출력으로 로봇 행동이 직접 나온다. 대표적으로 RT-2 [11]는 PaLI-X VLM의 출력 토큰을 곧바로 행동 토큰으로 해석하며, OpenVLA [15]는 Prismatic VLM의 언어 모델 헤드를 행동 예측에 재활용한다. NORA 역시 단일 트랜스포머 내에서 시각-언어-행동을 통합 처리한다.

이중 시스템(Dual-system): VLM 백본(System 2)과 행동 전문가(System 1)라는 두 개의 구분된 모듈이 존재하되, 하나의 모델 내에서 협력한다. 이 구분은 Daniel Kahneman의 이중 처리 이론에서 영감을 받았다. VLM이 상황을 이해하고 추론하는 느린 사고(System 2)를 담당하고, 행동 전문가가 빠른 반사적 행동 생성(System 1)을 담당한다.

이중 시스템은 다시 두 가지 정보 흐름 방식으로 나뉜다:

캐스케이드 기반(Cascade-based): VLM이 먼저 실행되어 특징 표현(feature representation)을 생성하고, 이것이 행동 전문가에게 순차적으로 전달된다. CogACT [23]에서는 VLM이 시각-언어 특징을 추출한 후, 별도의 디퓨전 기반 행동 생성기가 이를 받아 행동 시퀀스를 출력한다. GR00T N1 [21]에서는 Eagle-2 VLM이 상황 임베딩을 생성하고, 이것이 DiT(Diffusion Transformer) 행동 헤드로 전달된다. Fast-in-Slow 역시 느린 VLM 처리 후 빠른 행동 생성이라는 순차 구조를 따른다.

병렬 기반(Parallel-based): VLM 토큰과 행동 토큰이 공유 어텐션 메커니즘을 통해 동시에 처리된다. π0 [16]에서는 PaliGemma VLM의 토큰과 행동 전문가의 플로우 매칭 토큰이 공유 트랜스포머 블록에서 교차 어텐션을 수행한다. π0.5 [31]는 이를 확장하여 고수준 계획과 저수준 행동이 동일한 어텐션 공간에서 처리된다. GraspVLA [81]는 파지(grasp) 특화 토큰이 VLM 토큰과 병렬로 처리되는 구조를 채택한다.

3.1.2 계층적(Hierarchical) 아키텍처

계획(planning)과 실행(execution)이 명시적으로 분리된 구조다. 상위 수준에서 "무엇을 할 것인가"를 결정하고, 하위 수준에서 "어떻게 움직일 것인가"를 결정한다.

Planner-Only: VLM이 계획만 생성하고, 행동 실행은 별도의 저수준 제어기(MPC, PID 등)에 위임한다. SayCan [14]이 초기 대표 사례이며, COME-Robot, Inner Monologue [22] 등이 이 계보를 잇는다.
Planner + Policy: VLM 플래너가 중간 표현을 생성하고, 학습된 정책이 이를 저수준 행동으로 변환한다. VoxPoser [57], RT-H, RoboPoint 등이 이 범주에 해당한다.

중간 표현(intermediate representation)의 형태에 따라 다시 분류할 수 있다:

Keypoint(K): 핵심점 좌표를 통한 목표 지정 (RoboPoint, Rekep)
Subtask(S): 하위 과제 언어 기술 (SayCan [14], ProgPrompt)
Program(P): 실행 가능한 코드/프로그램 생성 (Code-as-Policies [36], VoxPoser [57])

이 외에도 Liu & Shao [5]는 어포던스(Affordance, A)를 별도의 보조 표현 유형으로 식별한다. A3VLM [58], CoA-VLA 등이 어포던스 맵을 다른 표현(K, S, P)과 결합하여 파지 가능 영역을 명시적으로 지정한다.

3.2 액션 생성 관점 --- Zhong et al. (2025) 기반

Zhong 등의 서베이는 분류의 렌즈를 아키텍처의 전체 형태가 아닌 행동이 어떻게 생성되는가에 맞춘다. 동일한 VLM 백본을 사용하더라도 행동 생성 방식이 다르면 완전히 다른 성능 특성을 보인다는 점에서, 이 관점은 실용적으로 매우 중요하다.

자기회귀(Autoregressive) 방식

연속 행동을 이산 토큰으로 변환한 후, 언어 모델과 동일한 next-token prediction으로 순차 생성한다. RT-2 [11]가 256-bin 양자화로 이 방식을 개척했고, OpenVLA [15], Octo [25] (AR 및 디퓨전 양 모드 지원), RT-2-X 등이 따랐다. FAST 토큰화 [20]는 DCT + BPE를 적용하여 자기회귀의 프레임워크를 유지하면서도 정보 손실을 줄이고 사전학습 속도를 5배 높였다.

장점: LLM 인프라(KV 캐시, 양자화, speculative decoding 등)를 그대로 재활용할 수 있다. 구현이 단순하다.
단점: 양자화 오류가 누적되고, 다중모달 행동 분포(예: 물체를 좌로 돌릴 수도, 우로 돌릴 수도 있는 상황)를 표현하기 어렵다. 토큰 단위 순차 생성이므로 고주파 제어에 느리다.

디퓨전(Diffusion) 방식

DDPM, DDIM, Flow Matching, VAE 등 확률적 생성 모델을 사용하여 행동 분포에서 샘플링한다. Diffusion Policy [17]가 DDPM 기반 행동 생성의 가능성을 처음 보여주었고, CogACT [23]는 DDIM으로 디노이징 스텝을 줄였으며, π0 [16]는 Flow Matching으로 ODE 기반의 더 빠르고 안정적인 생성을 달성했다.

장점: 다중모달 분포를 자연스럽게 표현한다. 부드러운(smooth) 궤적을 생성한다. 연속 공간에서 직접 동작하므로 양자화 오류가 없다.
단점: 다중 디노이징 스텝이 필요하여 추론이 느리다(DDPM: 50-100 스텝, DDIM: 10-20 스텝, Flow Matching: 5-10 스텝). 학습 안정성 확보에 기술적 노력이 필요하다.

이산 디퓨전(Discrete Diffusion) 방식

이산 디퓨전(Discrete Diffusion) VLA: ICLR 2026에서 4개의 독립 연구가 동시에 제안한 새로운 패러다임이다. 기존 디퓨전이 연속 공간에서 작동하는 것과 달리, 이산 디퓨전은 토큰화된 행동 시퀀스에 직접 적용되어 자기회귀의 순차 생성 없이도 이산 토큰을 병렬로 생성한다. dVLA [65], DIVA, UNIFIED DIFFUSION VLA 등이 이 범주에 속하며, LIBERO에서 95-98% 성공률을 보고했다. 이 접근은 자기회귀의 해석 가능성과 디퓨전의 다중 모드성을 결합하려는 시도로, 2026년 가장 활발한 연구 방향 중 하나이다.

강화학습(RL) 기반 방식

보상 신호(reward signal)를 기반으로 정책을 직접 최적화한다. 순수 RL 기반 VLA는 드물지만, BC로 사전학습된 VLA를 RL로 미세조정하는 패턴이 급부상하고 있다. GRPO(Group Relative Policy Optimization), RLVF(Reinforcement Learning from Visual Feedback), π^*_0.6 [157] 등이 대표적이다.

하이브리드(Hybrid) 방식

자기회귀와 디퓨전의 장점을 결합한다. HybridVLA [79]는 고수준 의미 토큰은 AR로 생성하고, 저수준 연속 행동은 디퓨전으로 생성하는 이중 디코딩 구조를 제안했다. UniVLA [80]는 월드 모델의 잠재 표현과 행동 생성을 단일 프레임워크에서 결합한다.

특수 도메인(Specialized) 방식

3D 포인트 클라우드 입력(PointVLA [77], 3D-VLA [37]), 촉각 센서 통합(ForceVLA [78], TactileVLA), 교차 체현(cross-embodiment) 학습(Octo [25], CrossFormer [50]), 자율주행 특화(EMMA [46], DriveVLM [91]) 등 특정 도메인의 요구사항에 맞춰 설계된 아키텍처들이다.

교차 행동 공간 학습: 서로 다른 형태의 로봇(팔, 휴머노이드, 이동 로봇 등) 사이의 행동 공간 차이를 극복하는 연구도 활발해지고 있다. X-VLA [53]는 소프트 프롬프팅 토큰으로 이종 embodiment를 조건화하고, XR-1은 Unified Vision-Motion Codes(UVMC)를 도입하며, HiMoE-VLA [54]는 계층적 Mixture-of-Experts로 행동 공간별 전문가를 할당한다.

자율주행 VLA 분류(Hu et al., 2025): 자율주행 도메인에 특화된 VLA 분류도 발전하고 있다. Hu et al.은 AD-VLA를 두 가지 패러다임으로 구분한다: (1) End-to-End VLA — 인지, 추론, 계획을 단일 모델에 통합(textual action vs numerical action 하위 구분), (2) Dual-System VLA — 느린 숙고(VLM)와 빠른 안전 실행(플래너)을 분리(explicit guidance vs implicit representation transfer 하위 구분). 이 분류는 Liu & Shao [5]의 단일체/계층적 분류와 구조적으로 유사하지만, 안전-실시간성 trade-off를 핵심 축으로 놓는다는 점에서 차별화된다.

특히 Hu et al. [41]은 Jiang et al. [10]보다 더 세분화된 AD-VLA 분류를 제공한다. End-to-End VLA를 textual action과 numerical action으로 하위 구분하고, Dual-System VLA를 explicit guidance와 implicit representation transfer로 나누어, 자율주행 도메인의 고유한 안전성-효율성 트레이드오프를 반영한다. 또한 WorldBench라는 통합 평가 플랫폼을 제안하여, 개방형/폐쇄형 루프 평가를 하나의 프레임워크 안에서 수행할 수 있게 했다.

3.3 해부학 관점 --- Xu et al. (2025) 기반

Xu 등의 서베이는 VLA를 생물학적 유비(biological analogy)로 해부한다. 그들의 프레임워크는 세 개의 기관으로 구성된다:

지각(Perception) --- 로봇의 감각 기관: 시각 인코더(SigLIP, DINOv2 [30], CLIP [27]), 고유감각 인코더(proprioception MLP), 촉각/깊이/힘 센서 등이 외부 세계의 정보를 내부 표현으로 변환한다.
두뇌(Brain) --- 중추 신경계: VLM 백본이 지각 정보와 언어 지시를 통합하여 "이해"와 "계획"을 수행한다. 순수 트랜스포머에서 VLM으로, 다시 CoT 추론이 가능한 VLM으로 진화해왔다.
행동(Action) --- 운동 신경계: 두뇌의 의도를 물리적 움직임으로 변환한다. 이산 토큰 헤드, 디퓨전 헤드, 플로우 매칭 헤드 등이 여기에 해당한다.

이 관점의 강점은 직관성이다. 어떤 VLA 모델이든 "어떤 눈을 가졌는가?", "어떤 뇌를 가졌는가?", "어떤 손을 가졌는가?"라는 세 질문으로 설명할 수 있다. 예를 들어, π0 [16]는 "SigLIP + DINOv2 [30]의 눈, PaliGemma의 뇌, Flow Matching의 손"을 가진 모델이다.

Xu et al. [8]의 논문이 기여한 것은 모듈 해부에 그치지 않는다. 이들의 핵심 기여는 VLA가 직면한 5대 도전 과제(Five Challenges) 체계이다: (1) 표현(Representation) — 시각·언어·행동을 어떻게 통합 표현할 것인가, (2) 실행(Execution) — 안정적이고 정밀한 행동 생성을 어떻게 달성할 것인가, (3) 일반화(Generalization) — 새로운 환경·물체·작업으로의 전이를 어떻게 보장할 것인가, (4) 안전(Safety) — 물리적 세계에서의 위험한 행동을 어떻게 방지할 것인가, (5) 데이터셋 및 평가(Dataset & Evaluation) — 공정하고 재현 가능한 벤치마킹을 어떻게 설계할 것인가. 이 5대 과제는 이후 Section 7-10에서 다루는 일반화, 효율화, 안전, 벤치마크 논의의 체계적 기반이 된다.

3.4 기능 관점 --- Kawaharazuka et al. (2025) 기반

Kawaharazuka 등은 일본 로보틱스 커뮤니티의 실용적 전통을 반영하여, VLA를 구성 요소가 아닌 수행하는 기능으로 분류한다:

저수준 지각(Low-level Perception): 물체 검출, 깊이 추정, 포즈 추정 등 원시 감각 처리
고수준 지각(High-level Perception): 장면 이해, 관계 추론, 어포던스(affordance) 인식 등 의미론적 처리
고수준 계획(High-level Planning): 과제 분해, 하위 목표 설정, 추상적 행동 시퀀스 생성
저수준 계획(Low-level Planning): 구체적 궤적 생성, 모터 명령 계획, 충돌 회피
데이터 증강(Data Augmentation): 시뮬레이션 데이터 생성, 비디오 예측을 통한 데이터 증식, 언어 재라벨링 등

이 관점의 독특한 가치는 하나의 모델이 여러 기능을 동시에 수행할 수 있다는 점을 자연스럽게 포착한다는 것이다. RT-2 [11]는 고수준 지각 + 저수준 계획을 하나의 모델에서 처리하고, SayCan [14]은 고수준 계획에 특화되어 저수준 실행은 별도의 정책에 위임한다.

3.5 후처리 관점 --- Jin et al. (2025) 기반

Jin 등의 서베이는 사전학습된 VLM을 로봇 행동 생성에 적응(adaptation)시키는 과정에 초점을 맞춘다. 이들의 질문은 "VLM이 이미 가진 지식을 로봇에게 어떻게 전달할 것인가?"이다:

환경 지각 강화(Environment Perception Enhancement): VLM의 시각 이해력을 로봇 환경에 맞게 강화한다. 깊이 정보 통합, 다중 뷰 처리, 시간적 맥락 추가 등이 포함된다. SpatialVLA [39]의 깊이 통합, HPT [96]의 다중 카메라 처리가 대표적이다.
체현 인식 개선(Embodiment Awareness Improvement): VLM에 로봇 자체의 물리적 특성(관절 구조, 행동 공간, 역학)을 주입한다. 고유감각 토큰화, 교차 체현 학습, 로봇별 어댑터가 여기에 해당한다.
과제 이해 심화(Task Understanding Deepening): 언어 지시의 이해를 단순 의미 매칭에서 추론적 이해로 끌어올린다. CoT 추론(ECoT [92], CoT-VLA [55]), 하위목표 분해, 시각적 추론 등이 포함된다.
다중 요소 통합(Multi-component Integration): 위의 세 차원을 하나의 프레임워크로 통합하는 방법론이다. 멀티태스크 학습, 모듈형 아키텍처, 점진적 학습 등의 전략이 사용된다.

3.6 메타 분류 --- 분류체계들의 분류

여기서 이 장의 핵심 통찰에 도달한다. 위의 다섯 가지 분류체계는 서로 경쟁하는 것이 아니다. 이들은 동일한 풍경을 서로 다른 고도에서 촬영한 항공사진이다. 건축가가 건물을 구조도, 배관도, 전기도, 조감도로 각각 그리듯, 각 서베이는 VLA라는 복잡한 시스템의 서로 다른 단면을 포착한다.

분류 차원의 대응 관계

다음 표는 각 분류체계가 동일한 모델을 어떻게 기술하는지를 보여준다:

모델	Liu/Shao [5] (아키텍처)	Zhong (액션 생성)	Xu (해부학)	Kawaharazuka (기능)	Jin (후처리)
RT-2 [11]	Single-system Monolithic	Autoregressive	PaLI-X Brain + Discrete Head	고수준지각 + 저수준계획	과제이해 심화
π0 [16]	Parallel Dual-system	Flow Matching(디퓨전의 변형)	PaliGemma Brain + Flow Hand	고수준지각 + 저수준계획	다중요소 통합
OpenVLA [15]	Single-system Monolithic	Autoregressive	Prismatic Brain + Discrete Head	고수준지각 + 저수준계획	환경지각 강화
GR00T N1 [21]	Cascade Dual-system	Diffusion (DiT)	Eagle-2 Brain + DiT Hand	저수준지각 + 저수준계획	체현인식 개선
CogACT [23]	Cascade Dual-system	Diffusion (DDIM)	CogVLM Brain + Diffusion Hand	고수준지각 + 저수준계획	환경지각 강화
SayCan [14]	Hierarchical Planner-Only	N/A (행동 생성 없음)	LLM Brain Only	고수준계획 특화	과제이해 심화
HybridVLA [79]	Dual-system	Hybrid (AR+Diffusion)	VLM Brain + Hybrid Hand	저수준계획 특화	다중요소 통합
CoT-VLA [55]	Single-system Monolithic	Autoregressive	VLM Brain + CoT + Discrete Head	고수준계획 + 저수준계획	과제이해 심화
SpatialVLA [39]	Single-system Monolithic	Autoregressive	Depth-enhanced Eye + VLM Brain	저수준지각 + 저수준계획	환경지각 강화

상보적 분류 축

이 표에서 드러나는 패턴은 명확하다. 각 분류체계는 서로 상보적(complementary) 축을 기술한다:

구조 축(Liu/Shao [5]): "모듈들이 어떤 위상(topology)으로 연결되어 있는가?" --- 단일체 vs 이중체, 캐스케이드 vs 병렬, 계층적

생성 축(Zhong): "행동이 어떤 수학적 메커니즘으로 만들어지는가?" --- AR, 디퓨전, RL, 하이브리드

해부 축(Xu): "각 구성 요소가 무엇인가?" --- 어떤 인코더, 어떤 VLM, 어떤 행동 헤드

기능 축(Kawaharazuka): "시스템이 어떤 인지 기능을 수행하는가?" --- 지각, 계획, 실행, 증강

적응 축(Jin): "사전학습 지식을 어떤 차원에서 보강했는가?" --- 지각, 체현, 과제, 통합

따라서 모든 VLA 모델은 이 5차원 좌표 공간의 한 점으로 표현할 수 있다. 예를 들어, π0 [16]의 좌표는 다음과 같다:

π0 [16] = (Parallel Dual-system, Flow Matching, PaliGemma+SigLIP+DINOv2 [30]+FlowHead, 고수준지각+저수준계획, 다중요소통합)

이 메타 분류의 실용적 가치는 세 가지다. 첫째, 새로운 VLA 모델이 등장했을 때 다섯 축 위에 즉시 위치시킬 수 있다. 둘째, 아직 탐색되지 않은 조합(예: "계층적 구조 + Flow Matching + 촉각 강화")을 체계적으로 식별할 수 있다. 셋째, 서로 다른 서베이의 결론을 충돌 없이 통합하여 해석할 수 있다.

분류체계 자체의 메타 패턴

한 발 더 물러서서 분류체계들을 관찰하면, 흥미로운 메타 패턴이 보인다:

구조 중심 분류(Liu/Shao [5], Xu)는 "이 모델을 어떻게 구축하는가?"에 답한다 --- 엔지니어의 관점
프로세스 중심 분류(Zhong, Jin)는 "이 모델을 어떻게 학습시키는가?"에 답한다 --- 연구자의 관점
기능 중심 분류(Kawaharazuka)는 "이 모델이 무엇을 할 수 있는가?"에 답한다 --- 사용자의 관점

이 세 관점의 수렴이 바로 VLA 연구의 성숙을 나타내는 지표다. 분야가 성숙할수록, 구축 방법, 학습 방법, 활용 방법이 독립적으로 발전하면서도 상호 정합적인 체계를 이루게 된다.

Section 4: 아키텍처 심층 해부

"VLA는 세 개의 모듈로 이루어진 하나의 유기체다. 눈이 세계를 보고, 뇌가 이해하며, 손이 행동한다. 이 장에서는 각 기관을 해부대 위에 올려놓는다."

4.1 지각 모듈 --- 로봇의 눈

VLA의 첫 번째 모듈은 원시 감각 데이터를 의미 있는 내부 표현으로 변환하는 지각(perception) 모듈이다. 인간의 시각 피질이 망막의 광자를 "빨간 컵", "책상 위", "기울어진"이라는 개념으로 변환하듯, 지각 모듈은 픽셀 배열을 로봇이 이해할 수 있는 토큰 시퀀스로 변환한다.

4.1.1 언어 지도 인코더(Language-supervised Encoders)

CLIP [27] (Contrastive Language-Image Pretraining)과 SigLIP(Sigmoid Loss for Language-Image Pretraining)은 이미지-텍스트 쌍으로 대조 학습된 인코더다. 수억 장의 이미지-캡션 쌍에서 학습했기 때문에, 이들의 시각 표현은 본질적으로 의미론적(semantic)이다. "빨간 컵"과 "파란 컵"은 가깝지만, "컵"과 "접시"는 적당히 떨어져 있다. 이러한 특성은 자연어 지시로 조건화되는 VLA에 자연스럽게 적합하다.

SigLIP은 CLIP의 softmax 대조 손실을 sigmoid 손실로 대체하여, 배치 크기에 대한 의존성을 줄이고 학습 효율을 높였다. 2024-2025년 기준으로 SigLIP이 CLIP을 대체하는 추세가 뚜렷하다.

장점: 풍부한 의미론적 정렬, 언어 조건화에 최적, 대규모 사전학습의 혜택
한계: 기하학적 정밀도가 부족하다. "컵의 손잡이가 어느 방향을 향하는가?"와 같은 세밀한 공간 정보를 놓치는 경향이 있다.

4.1.2 자기 지도 인코더(Self-supervised Encoders)

DINOv2 [30]는 마스크된 이미지 모델링과 자기 증류(self-distillation)로 학습된 ViT [28] 인코더다. 언어 감독 없이 이미지 자체의 구조에서 학습했기 때문에, 이 인코더의 표현은 기하학적(geometric)이다. 물체의 경계, 표면 법선, 공간적 배치가 정밀하게 인코딩된다.

장점: 기하학적 정밀도가 높다. 접촉이 풍부한 조작(pick-and-place, 삽입 등)에서 의미론적 인코더보다 우수하다. 텍스처나 소재의 미세한 차이를 포착한다.
한계: 언어와의 정렬이 없으므로, "빨간 컵을 집어라"와 같은 언어 조건화에는 추가적인 다리(bridge)가 필요하다.

4.1.3 하이브리드 SigLIP + DINOv2 --- 현재의 지배적 표준

2024년 후반부터 SigLIP과 DINOv2 [30]를 동시에 사용하는 하이브리드 인코딩이 사실상의 표준으로 자리 잡았다. OpenVLA [15]의 Prismatic VLM, OpenVLA [15]-OFT, GraspVLA [81], UniVLA [80] 등이 이 조합을 채택한다.

왜 이 조합이 지배적인가? 답은 상보성(complementarity)에 있다. SigLIP은 "무엇(what)"을, DINOv2 [30]는 "어디에, 어떻게(where, how)"를 인코딩한다. "빨간 컵을 집어라"라는 지시를 처리할 때, SigLIP은 장면에서 "빨간 컵"이라는 의미 개체를 식별하는 데 기여하고, DINOv2 [30]는 그 컵의 손잡이 방향과 정확한 위치를 파악하는 데 기여한다. 두 인코더의 출력은 일반적으로 토큰 수준에서 연결(concatenate)되거나 프로젝션 레이어를 통해 통합된다.

실증적으로도 이 조합의 우위가 반복 확인된다. Prismatic VLM 논문에서 SigLIP 단독, DINOv2 [30] 단독, 그리고 SigLIP+DINOv2 [30] 조합을 비교한 결과, 하이브리드 조합이 모든 벤치마크에서 일관되게 우수했다.

4.1.4 전체 VLM을 인코더로 사용

일부 모델은 별도의 시각 인코더 대신 사전학습된 VLM 전체를 인코더로 활용한다. RT-H는 PaLI-X를 사용하고, π0 [16]는 PaliGemma를 사용하며, VTLA는 Qwen-VL을 사용한다. 이 접근법의 장점은 VLM이 이미 시각-언어 통합을 내재적으로 수행하므로, 별도의 융합 모듈이 필요 없다는 것이다. 단점은 계산 비용이 크다는 것이며, 이를 LoRA, QLoRA 등의 효율적 미세조정 기법으로 완화한다.

4.1.5 CNN의 잔존

ViT [28] 기반 인코더가 주류를 이루지만, CNN(ResNet, EfficientNet)은 아직 사라지지 않았다. RT-1은 EfficientNet-B3를 사용했고, 일부 경량화 모델(LiteVLA [63] 등)에서는 계산 효율성을 위해 여전히 CNN을 선택한다. 실시간 제약이 극도로 엄격한 환경(산업용 로봇의 1kHz 제어 루프)이나 에지 디바이스에서 CNN의 결정론적 추론 속도와 작은 메모리 풋프린트는 여전히 유효한 장점이다.

4.1.6 다중 모달 지각

최첨단 VLA는 RGB 카메라를 넘어 다양한 감각 양식을 통합한다:

깊이(Depth): SpatialVLA [39]는 깊이 정보를 별도 채널로 인코딩하여 3D 공간 이해를 강화한다. 단안(monocular) 깊이 추정 네트워크(MiDaS, Depth Anything)의 출력을 추가 입력으로 사용하는 방식도 널리 쓰인다.
촉각(Tactile): ForceVLA [78]는 6축 힘/토크 센서 데이터를, TactileVLA는 GelSight 촉각 이미지를 시각 토큰과 함께 처리한다. 접촉 감각은 "물체를 너무 세게 쥐지 않으면서 미끄러지지 않게" 하는 섬세한 조작에 필수적이다.
힘(Force): OmniVTLA는 시각-촉각-언어를 하나의 프레임워크에서 통합하며, 힘 프로파일을 시간 시퀀스로 인코딩한다.
소리(Audio): AudioCLIP [27] 기반의 청각 인코더가 탐색적으로 사용된다. "딸깍 소리가 나면 멈춰라"와 같은 청각 조건부 행동에 활용 가능성이 있다.

4.1.7 고유감각(Proprioception) 처리

로봇의 관절 각도, 속도, 엔드이펙터 위치 등의 고유감각 정보는 대부분 MLP(Multi-Layer Perceptron)를 통해 고정 차원의 벡터로 변환된다. 이 벡터를 시각-언어 표현과 통합하는 방식은 크게 두 가지다:

연결(Concatenation): 고유감각 벡터를 시각/언어 토큰에 단순히 이어붙인다. 구현이 간단하고 대부분의 모델이 채택하는 기본 방식이다.
FiLM Conditioning: Feature-wise Linear Modulation으로, 고유감각 정보가 시각 특징의 스케일과 바이어스를 조절한다. 고유감각이 시각 처리 자체에 영향을 미치므로, 정보 통합이 더 밀접하다. Octo [25], HPT [96] 등이 이 방식을 사용한다.

4.2 두뇌 모듈 --- VLM이 로봇의 뇌가 되다

지각 모듈이 "눈"이라면, 두뇌 모듈은 지각된 정보를 이해하고, 추론하고, 계획하는 "중추 신경계"다. VLA의 두뇌는 2022년부터 2025년까지 네 단계의 진화를 거쳤다.

4.2.1 진화 4단계

1단계: 순수 트랜스포머 (2022-2023)

Gato [13] (DeepMind, 2022)는 텍스트, 이미지, Atari 게임, 로봇 제어를 단일 트랜스포머로 처리한 최초의 "제너럴리스트 에이전트"였다. VIMA [45]는 멀티모달 프롬프트를 이해하는 트랜스포머를, GR-1 [51]은 비디오 생성과 행동 예측을 결합한 트랜스포머를 제안했다. 이 시기의 "뇌"는 범용 사전학습 없이 처음부터(from scratch) 학습된 트랜스포머였다. 로봇 데이터만으로 학습했기 때문에, 언어 이해나 시각적 상식 추론에서 근본적 한계가 있었다.

2단계: 디퓨전 트랜스포머/DiT (2023-2024)

RDT-1B [24] (Robotics Diffusion Transformer)는 1.2B 파라미터의 DiT를 행동 생성의 중심 아키텍처로 사용했다. TriVLA는 삼중 시스템에서 DiT를 핵심 행동 생성기로 배치했다. DiT는 트랜스포머의 확장성(scalability)과 디퓨전의 다중모달 표현력을 결합하여, 복잡한 행동 분포를 대규모 모델로 학습할 수 있게 했다.

3단계: VLM + 생성 헤드 (2024)

π0 [16](Physical Intelligence, 2024)는 패러다임 전환의 결정적 지점이었다. PaliGemma(SigLIP + Gemma 2B)를 "뇌"로 사용하고, 별도의 Flow Matching 헤드를 "손"으로 부착했다. VLM이 이미 보유한 방대한 시각-언어 지식을 Flow Matching 기반 행동 생성과 결합한 첫 번째 상업적 성공 사례다. 이 단계에서 핵심 통찰은 "로봇의 뇌를 처음부터 만들 필요가 없다 --- 인터넷의 지식으로 이미 학습된 VLM을 가져와 로봇의 손만 연결하면 된다"는 것이었다.

4단계: 완전한 VLM 기반 뇌 (2024-2025)

RT-2 [11]에서 시작된 "VLM을 곧바로 로봇 정책으로 사용" 패러다임이 OpenVLA [15], π0.5 [31], CoT-VLA [55], SafeVLA [75]로 이어지며 성숙했다. 이 계보의 모델들은 VLM의 언어 생성 능력을 행동 생성에 그대로 활용한다. CoT-VLA [55]는 여기서 한 걸음 더 나아가, 행동 생성 전에 자연어로 추론 과정을 명시적으로 출력한다. SafeVLA [75]는 안전 제약을 VLM의 추론 과정에 내재화한다.

4.2.2 추론 패러다임

VLA의 "뇌"가 단순한 반사(reflex)를 넘어 사고(reasoning)하는 방향으로 진화하고 있다.

Chain-of-Thought(CoT) 추론: 행동을 생성하기 전에 자연어로 사고 과정을 출력한다. ECoT(Embodied Chain-of-Thought)는 "1. 빨간 컵이 테이블 왼쪽에 있다. 2. 그리퍼가 현재 테이블 오른쪽에 있다. 3. 먼저 왼쪽으로 이동해야 한다."와 같은 추론 체인을 생성한 후 행동을 출력한다. CoT-VLA [55]는 이 패러다임을 대규모로 학습하여, 추론이 행동 성능을 향상시킨다는 것을 실증했다.

ICLR 2026에서는 Embodied Chain-of-Thought(ECoT [92])가 주요 트렌드로 부상했다. ACTIONS AS LANGUAGE [98], InstructVLA, EMBODIED-R1 등이 공간적으로 기반된(spatially-grounded) 추론을 행동 예측과 통합하여, 단순한 텍스트 추론을 넘어 시각적 장면에 직접 기반한 추론 과정을 VLA에 도입하고 있다.

ReAct 패러다임: 추론(Reasoning)과 행동(Acting)을 교대로 수행한다. "관찰 → 추론 → 행동 → 관찰 → ..."의 반복적 루프로, 환경의 피드백을 추론에 반영할 수 있다.

시각적 하위목표 예측: 언어 대신 미래 이미지를 예측하여 "다음에 어떤 상태가 되어야 하는가?"를 시각적으로 상상한다. SuSIE [59], UniPi [93] 등이 이 방식을 탐구했다.

4.2.3 월드 모델 통합

VLA의 뇌에 "상상력"을 부여하는 것이 월드 모델(World Model) 통합이다. 두 가지 방향이 있다:

정책 강화(Policy Enhancement): 월드 모델이 생성한 미래 예측을 정책 학습의 보조 데이터나 추가 입력으로 사용한다. UniVLA [80]는 잠재 공간에서 미래 상태를 예측하고, 이 예측이 행동 생성을 안내한다. WorldVLA [76]는 비디오 예측 모듈이 정책 네트워크와 공동 학습된다.

명시적 계획(Explicit Planning): 월드 모델로 여러 가능한 미래를 시뮬레이션하고, 가장 유리한 미래로 이어지는 행동을 선택한다. LUMOS [94]는 잠재 공간 월드 모델로 트리 탐색을 수행하고, MinD [95]는 잠재 월드 모델 안에서 정신적 시뮬레이션(mental simulation)을 실행한다.

두 방향의 차이는 월드 모델의 역할에 있다. 정책 강화에서 월드 모델은 "조언자"이고, 명시적 계획에서 월드 모델은 "시뮬레이터"이다. 현재 추세는 두 방향의 수렴이다 --- 월드 모델의 예측이 정책의 행동 생성에 직접 개입하면서도, 복잡한 상황에서는 명시적 시뮬레이션을 통한 계획이 가능한 유연한 구조를 향해 나아가고 있다.

이러한 월드 모델 통합의 의미를 더 넓은 시각에서 조망할 수 있다. Large Model Embodied AI 서베이 [44]는 대형 모델 기반 체화 AI의 의사결정 패러다임을 계층적(hierarchical) 방식과 end-to-end 방식으로 양분한다. 계층적 방식에서는 고수준 계획(VLM/LLM)과 저수준 제어(전문 정책)가 분리되어 해석 가능성과 안전성이 높지만 모듈 간 정보 손실이 발생하며, end-to-end 방식에서는 단일 모델이 인지부터 행동까지 직접 매핑하여 표현력은 극대화되지만 디버깅과 안전 보장이 어렵다. 월드 모델은 이 두 패러다임을 연결하는 제3의 축으로 기능한다 --- end-to-end 모델에 "상상"을 통한 계획 능력을 부여하여 계층적 구조의 장점(안전성, 장기 추론)을 흡수하면서도, 통합된 표현 공간의 이점을 유지할 수 있기 때문이다.

4.3 행동 모듈 --- 의도를 움직임으로

두뇌가 "빨간 컵을 집어야 한다"고 결정한 후, 이 의도를 실제 모터 명령으로 변환하는 것이 행동 모듈의 역할이다. 여기서 근본적인 도전은 연속적이고 고차원적인 행동 공간을 어떻게 효과적으로 표현하고 생성하느냐이다.

4.3.1 이산 토큰화 --- RT-2 방식

RT-2 [11]가 개척한 접근법이다. 연속 행동값(예: 관절 각도 0.732rad)을 0-255 사이의 정수 bin으로 양자화한 후, 이를 VLM의 어휘(vocabulary)에 추가하여 언어 토큰과 동일하게 처리한다. 7-DoF 로봇 팔의 경우, 각 타임스텝의 행동은 7개의 토큰(+ 그리퍼 개폐 1개)으로 표현된다.

장점: 언어 모델의 기존 인프라(토크나이저, 생성 알고리즘, KV 캐시)를 그대로 재활용한다. 구현이 직관적이고 간단하다. 언어 생성과 행동 생성이 동일한 디코딩 과정이므로, 멀티태스크 학습이 자연스럽다.
단점: 256-bin 양자화는 행동 공간의 정밀도를 제한한다(1/256 = 약 0.4%의 해상도). 다중모달 행동 분포를 표현할 수 없다 --- 모델은 하나의 최빈값(mode)만 출력할 수 있어서, 물체를 좌로 돌릴 수도 우로 돌릴 수도 있는 상황에서 두 선택지의 평균인 "돌리지 않음"을 출력하는 평균화 문제가 발생한다. 자기회귀 특성상 토큰 수에 비례하는 지연이 생긴다.

4.3.2 디퓨전 정책 --- 확률적 행동 생성

DDPM(Denoising Diffusion Probabilistic Model): Diffusion Policy [17] (Chi et al., 2023)가 개척한 방식으로, 순수 가우시안 노이즈에서 출발하여 반복적 디노이징으로 행동 시퀀스를 생성한다. 50-100회의 디노이징 스텝이 필요하지만, 다중모달 분포를 충실히 표현한다.

DDIM(Denoising Diffusion Implicit Model): CogACT [23]가 채택한 방식으로, 결정론적 샘플링 과정을 통해 디노이징 스텝을 10-20회로 줄인다. 품질과 속도의 절충점을 제공한다.

Flow Matching: π0 [16]가 채택한 방식으로, 노이즈에서 데이터로의 경로를 ODE(상미분방정식)로 모델링한다. DDPM/DDIM보다 학습이 안정적이고, 5-10회의 스텝으로 고품질 샘플을 생성한다. Rectified Flow는 경로를 직선에 가깝게 학습하여 스텝 수를 더 줄인다.

디퓨전 기반 방식의 공통적 장점은 다중모달 분포 표현이다. "컵을 좌로 돌릴 수도, 우로 돌릴 수도 있다"는 양쪽 모드를 동시에 표현하고, 실행 시 하나를 샘플링한다. 부드러운 궤적을 자연스럽게 생성하며, 연속 공간에서 직접 동작하므로 양자화 오류가 없다.

4.3.3 FAST 토큰화 --- 주파수 도메인의 혁신

FAST [20](Fast Action Tokenization)는 자기회귀 방식의 단순함과 연속 행동 표현의 정밀도를 동시에 잡으려는 시도다. 핵심 아이디어는 연속 행동 청크를 주파수 도메인으로 변환한 후 토큰화하는 것이다.

구체적인 과정은 다음과 같다:

연속 행동 시퀀스(예: 16 타임스텝 x 7 DoF)를 DCT(Discrete Cosine Transform)로 주파수 도메인으로 변환

고주파 성분(대부분 노이즈)을 제거하여 압축

압축된 주파수 계수에 BPE(Byte Pair Encoding)를 적용하여 이산 토큰으로 변환

이 토큰을 LLM의 자기회귀 생성으로 예측

이 접근법의 결과는 인상적이다. RT-2 [11] 방식 대비 5배의 사전학습 가속을 달성하면서도, 정보 손실은 무시할 수 있는 수준이다. 주파수 도메인에서의 압축이 시간 도메인에서의 양자화보다 훨씬 효율적이기 때문이다. 또한 토큰 수가 크게 줄어들어 자기회귀 생성의 속도 문제도 완화된다.

4.3.4 정규화 흐름(Normalizing Flows) --- 단일 스텝의 꿈

NinA(Neural Inference for Actions)는 정규화 흐름(Normalizing Flows)을 행동 생성에 적용한다. 디퓨전과 달리, 가역적 변환(invertible transformation)의 합성으로 분포를 모델링하므로 단일 순전파로 샘플링이 가능하다. 다중 디노이징 스텝이 필요 없어 추론 속도가 매우 빠르다.

4.3.5 주파수 도메인 Flow Matching --- FreqPolicy

FreqPolicy는 Flow Matching을 주파수 도메인에서 수행하여, 단일 스텝 추론을 달성한다. 행동 시퀀스를 주파수 성분으로 분해한 후, 주파수 공간에서 플로우를 학습한다. 시간 도메인에서의 복잡한 다중모달 분포가 주파수 도메인에서는 더 단순한 구조를 가지므로, 적은 스텝으로도 충분한 품질의 샘플을 생성할 수 있다.

4.3.6 Action Chunking --- 시간 스케일의 분리

Action Chunking은 행동 생성의 시간 스케일을 의미 수준과 운동 수준으로 분리하는 전략이다. 고수준(저주파)에서는 "다음 청크의 의미적 방향"을 자기회귀로 결정하고, 저수준(고주파)에서는 "청크 내부의 세밀한 궤적"을 병렬로 생성한다.

ACT(Action Chunking with Transformers)가 이 개념을 대중화했다. 한 번의 예측으로 16-32 타임스텝의 행동 시퀀스를 청크 단위로 생성하여, 단일 스텝 예측의 근시안적(myopic) 행동 문제를 완화한다. 이후 연구들은 청크 간 접합의 매끄러움, 가변 길이 청크, 계층적 청킹 등으로 확장되었다.

4.4 이중 시스템 아키텍처 --- 인지과학이 로봇공학을 만나다

4.4.1 Kahneman의 이중 처리 이론

Daniel Kahneman이 Thinking, Fast and Slow (2011)에서 제시한 이중 처리 이론은, 인간의 사고가 두 시스템으로 구성된다고 주장한다:

System 1: 빠르고, 자동적이며, 노력이 적게 드는 직관적 사고. "공이 날아오면 손을 뻗어 잡는다"
System 2: 느리고, 의식적이며, 노력이 많이 드는 분석적 사고. "체스에서 다음 수를 계산한다"

이 이론의 로봇공학 적용은 직관적으로 타당하다. 로봇도 두 가지 종류의 "사고"가 필요하기 때문이다:

빠른 반사(10-120Hz): 장애물을 피하고, 미끄러지는 물체를 재빨리 다시 잡고, 부드러운 궤적을 유지하는 운동 제어
느린 숙고(1-10Hz): "어떤 물체를 집을 것인가?", "어떤 순서로 과제를 수행할 것인가?", "이 상황이 안전한가?"를 판단하는 고수준 추론

핵심 통찰은 이 두 시스템이 서로 다른 시간 스케일에서 작동한다는 것이다. System 2가 매 프레임마다 실행될 필요는 없고, System 1이 복잡한 추론을 수행할 필요도 없다.

4.4.2 로봇 이중 시스템의 구현

System 1 --- 행동 전문가: 디퓨전 정책, Flow Matching, 경량 MLP 등 빠른 생성 모델이 담당한다. 10-120Hz의 주파수로 실행되어 부드럽고 반응적인 모터 명령을 생성한다. 이 모듈은 VLM의 무거운 추론 없이, 주어진 상황 임베딩(context embedding)으로부터 직접 행동을 생성한다.

System 2 --- VLM 기반 추론/계획: 대규모 VLM이 담당한다. 1-10Hz의 주파수로 실행되어 장면을 이해하고, 과제를 분해하고, 안전 조건을 확인한다. 이 모듈의 출력은 행동 전문가에게 "무엇을 해야 하는가"의 지시(context)로 전달된다.

비동기 실행: 두 시스템의 핵심은 독립적인 주파수로 실행된다는 것이다. System 2가 다음 계획을 추론하는 동안, System 1은 이전 계획에 기반하여 행동을 계속 생성한다. 이 비동기성이 VLM의 느린 추론 속도를 실시간 제어와 양립 가능하게 만든다.

4.4.3 구현 사례

GR00T N1 [21] (NVIDIA): Eagle-2 VLM(System 2)이 카메라 이미지와 언어 지시를 처리하여 상황 임베딩을 생성한다. 이 임베딩은 DiT 기반 행동 전문가(System 1)에게 전달되어, Flow Matching으로 행동 청크를 생성한다. 전형적인 캐스케이드 구조로, System 2 → System 1의 순차적 정보 흐름을 따른다.

π0 [16] (Physical Intelligence): PaliGemma VLM(System 2)의 토큰과 Flow Matching 행동 전문가(System 1)의 토큰이 공유 어텐션 블록에서 동시에 처리된다. 병렬 구조로, 두 시스템이 동일한 트랜스포머 레이어를 공유하면서 서로의 정보에 접근한다.

MinD [95]: 잠재 월드 모델(latent world model)이 System 2의 역할을 하며, 미래 상태를 "정신적으로 시뮬레이션"한다. 행동 정책(System 1)은 이 시뮬레이션 결과를 바탕으로 행동을 생성한다. 여기서 System 2는 언어적 추론이 아닌 잠재 공간에서의 예측적 시뮬레이션이라는 점이 독특하다.

TriVLA: 삼중 시스템(triple system)을 제안한다. VLM(System 3, 가장 느림)이 전략적 계획을, DiT(System 2, 중간)가 전술적 행동 생성을, 경량 실행기(System 1, 가장 빠름)가 실시간 보정을 담당한다. 이중 시스템을 삼중으로 확장하여 시간 스케일의 분리를 더 세밀하게 구현한다.

Hume: 전체론적 체현 이해(holistic embodiment understanding)를 목표로 하며, VLM과 행동 생성기가 인간의 전신 동작 이해를 공유 표현으로 처리한다.

4.4.4 핵심 설계 선택: 공유 어텐션 vs 캐스케이드

이중 시스템 아키텍처의 가장 중요한 설계 선택은 두 시스템 간의 정보 흐름 방향이다.

캐스케이드(GR00T N1 [21] 방식):

정보가 System 2 → System 1으로 단방향으로 흐른다
System 1이 System 2에 피드백을 줄 수 없다
장점: 모듈 분리가 명확하여 각 시스템을 독립적으로 학습/교체할 수 있다. System 2의 VLM을 업그레이드하더라도 System 1의 행동 전문가는 그대로 유지할 수 있다.
단점: 행동 생성 과정의 정보가 추론에 반영되지 않으므로, 행동 실행 중 발생하는 미세한 변화에 System 2가 반응하지 못한다.

공유 어텐션(π0 [16] 방식):

정보가 System 1 ↔ System 2 양방향으로 흐른다
언어/시각 토큰과 행동 토큰이 동일한 어텐션 메커니즘에서 상호작용한다
장점: 두 시스템이 서로의 상태를 참조할 수 있어, 행동 생성이 추론에 영향을 주고 추론이 행동을 안내하는 밀접한 협력이 가능하다.
단점: 모듈 분리가 불명확하여, 한 시스템을 교체하면 다른 시스템도 재학습이 필요할 수 있다. 공유 어텐션의 계산 비용이 크다.

이 선택은 모듈성(modularity) vs 통합성(integration)의 근본적 트레이드오프를 반영한다. 캐스케이드는 레고 블록처럼 부품을 교체할 수 있는 유연성을 제공하고, 공유 어텐션은 유기체처럼 밀접하게 통합된 시스템을 만든다.

현재(2025년 기준) 추세를 보면, 연구 커뮤니티에서는 두 접근법이 공존하되, 상업적 성공을 거둔 모델들(π0 [16], π0.5 [31])은 공유 어텐션 쪽에, 플랫폼/모듈 교체 용이성을 중시하는 모델들(GR00T N1 [21])은 캐스케이드 쪽에 가깝다. 최종적으로 어느 쪽이 우세해질지는 아직 열린 질문이다. 다만 분명한 것은, "빠른 반사"와 "느린 숙고"의 분리라는 인지과학적 통찰이 VLA 아키텍처 설계의 핵심 원리로 자리 잡았다는 점이다.

4.5 3대 프런티어 VLA를 정직하게 비교하기 — NVIDIA GR00T, Google Gemini Robotics, Physical Intelligence π

"셋이 같은 패러다임의 변주라는 진단은 옳다. 그러나 그 진단에서 멈추면 무엇을 연구해야 할지 보이지 않는다. 우리는 한 번 더 들어가야 한다."

박사 과정에 들어와 VLA 분야의 논문을 본격적으로 읽기 시작하면, 거의 모든 사람이 비슷한 경로를 밟는다. 처음에는 GR00T [21], Gemini Robotics, π [16]를 마치 세 개의 다른 동물처럼 본다. 각 회사의 데모 영상이 너무 다르기 때문이다. 휴머노이드가 물건을 옮기는 NVIDIA의 영상, 자연어로 복잡한 추론을 펼치는 Google의 영상, 13시간 동안 에스프레소를 만드는 PI의 영상은 시각적으로 전혀 다른 인상을 준다. 그러다 본 서베이의 Section 3–6을 끝까지 읽고 나면 정반대의 인식이 찾아온다 — "이거 다 같은 거 아닌가?"

이 두 인식 모두 부분적으로 옳고, 부분적으로 틀리다. 이 절의 목적은 그 사이의 정확한 지점을 찾는 것이다. 박사 1–2년차가 이 분야에 진입할 때 가장 위험한 것은 두 극단 중 하나에 정착하는 것이다. "셋이 다 다르다"고 보면 표면에 끌려다니고, "셋이 다 같다"고 보면 자신이 어느 흐름 위에서 연구하고 있는지 잃어버린다. 우리는 정직하게, 무엇이 같고 무엇이 진짜로 다른가를 분리해내야 한다.

4.5.1 먼저 인정해야 할 것: 통일장 진단의 70%는 옳다

지난 1–2년간 VLA 분야를 충분히 깊이 추적해왔다면 다음 명제에 동의하지 않기 어렵다.

세 모델은 모두 "인터넷 사전학습된 VLM을 뇌로 두고, 그 위에 생성형 행동 디코더를 얹고, 두 모듈을 서로 다른 시간 스케일로 비동기 실행하는" 단일 패러다임의 변주다.

이건 본 서베이의 Insight 1(수렴의 증거)이 정면으로 지적한 결론이고, Section 3.6의 5축 좌표계 위에 셋을 강제로 박아놓고 보면 그 사실이 더 또렷해진다.

분류 축	NVIDIA GR00T N1 [21]	Google Gemini Robotics	PI π0 / π0.5 [31]
구조(Liu/Shao)	Dual-system, Cascade	Dual-system, Cascade	Dual-system, Parallel
생성(Zhong)	Diffusion (DiT)	Diffusion 계열	Flow Matching
해부(Xu)	Eagle-2 Brain + DiT Hand	Gemini VLM + Action Head	PaliGemma/Gemma 3 + Flow Hand
기능(Kawaharazuka)	저수준 지각 + 저수준 계획	고수준 추론 강조 + 저수준 계획	고수준 지각 + 저수준 계획
후처리(Jin)	체현 인식 개선	과제 이해 심화	다중 요소 통합 + RL

Gemini Robotics는 본 서베이 §3.6의 공식 5축 분류표에 포함되어 있지 않다. 위 행은 공개된 기술 보고를 근거로 한 본 절의 추정 위치이며, GR00T N1/π0의 행은 §3.6 표와 일치한다.

표를 보면 차이가 있긴 하지만, 카테고리 자체가 다른 것이 아니라 같은 카테고리 안에서의 위치 차이다. 셋 다 dual-system이고, 셋 다 generative action head를 쓰고, 셋 다 사전학습된 VLM 백본 위에 서 있다. 본 서베이의 4가지 공리(인터넷 지식 상속, 시간 스케일 분리, 생성형 디코더, 데이터 피라미드)는 정말로 셋의 공통 분모다.

이걸 인정하지 않고 시작하는 모든 비교는 결국 마케팅 카피의 재배열이 된다. 셋의 차이는 "다른 종(species)"이 아니라 "같은 종 안의 다른 품종(breed)"이라는 통일장 논변의 결론은 받아들여야 한다. 그럼 이 절은 끝나는가? 아니다. 박사 과정 연구자에게 진짜 흥미로운 질문은 이 다음에 시작된다.

4.5.2 그러나 평탄화하면 안 되는 것: 같은 종 안의 차이도 진짜다

생물학자에게 "골든 리트리버와 시베리안 허스키는 같은 종이다"라고 말하면 그는 동의할 것이다. 그런데 그가 평생 연구하는 것은 그 두 품종 사이의 차이다. 카테고리상 같다는 것이 차이가 무의미하다는 뜻이 아니다. 패러다임이 같다는 사실이 그 패러다임 위에서의 분기가 사라진다는 뜻은 아니다. 자동차가 100년간 "내연기관 + 변속기 + 차체"로 통일된 후에야 도요타·BMW·포르쉐의 진짜 차이가 본격화되었다.

VLA에서 진짜 차이가 어디서 나는지를 보려면, 본 서베이가 기술 축으로 짜여 있어 명시적으로 드러내지 못한 세 가지 횡단 축을 도입해야 한다. 이 축들은 통일장 논변이 의도적으로 평탄화한 지점이고, 박사 과정 연구자가 어느 흐름 위에서 자신의 연구를 위치시킬지 결정할 때 결정적이다.

축 1 — 데이터 출처가 결정하는 일반화의 천장

본 서베이 §10.12가 명시적으로 지적한 것은, 시뮬레이션 벤치마크(LIBERO, CALVIN)에서 프런티어 모델과 오픈 웨이트 모델의 성능이 수렴하고 있음에도 실세계 zero-shot 일반화에서는 격차가 좁혀지지 않는다는 점이었다. Reuss(2026)의 ICLR 분석이 가리키는 핵심 원인은 데이터 큐레이션과 인프라 규모의 격차다. 그리고 이 격차는 우연이 아니라, 세 그룹이 데이터를 어디서 얻는가의 구조적 차이에서 나온다.

NVIDIA는 Cosmos를 통한 시뮬레이션 합성과 Isaac Sim/Lab을 중심에 둔다. 시뮬레이션은 무한 확장 가능하지만 sim-to-real gap이라는 영구적 세금을 낸다. Google은 인터넷 스케일 멀티모달 사전학습을 자산으로 둔다. 이는 zero-shot 의미 이해에서 압도적이지만, 모터 제어의 정밀도와는 다른 차원의 능력이다. PI는 자체 로봇 함대에서 수집한 실세계 데이터에 의존한다. 이는 sim-to-real gap이 없는 가장 깨끗한 신호이지만, 확장성이 가장 제약된다.

박사 과정 연구자가 자신의 실험실에서 어느 데이터 전략을 모방할지는 결정적이다. 시뮬레이션 인프라가 있다면 NVIDIA 노선을 따라가는 것이 자연스럽고, 그렇지 않다면 PI가 공개한 openpi 위에서 소량의 실세계 데이터로 fine-tune하는 것이 현실적이다. 이 선택은 단순한 도구 선택이 아니라 무엇을 일반화의 한계로 받아들일 것인가의 선택이다.

축 2 — 학습 단계 중 어디에 자원을 쏟는가

본 서베이 §6.6의 3단계 성숙 모델(인터넷 사전학습 → BC fine-tuning → RL post-training)을 다시 떠올려보자. 흥미로운 사실은, 세 그룹이 이 세 단계 중 서로 다른 단계를 자신의 차별화 지점으로 삼고 있다는 점이다. 그리고 이건 우연이 아니라 각 그룹의 자산 구조에 의해 결정된다.

Google은 1단계(인터넷 사전학습)에 자산이 압도적으로 쏠려 있다. 거대 데이터센터, 인터넷 스케일 멀티모달 코퍼스, TPU 인프라. 그래서 Gemini Robotics의 차별화는 "Gemini라는 거대 백본을 그대로 이식한다"가 된다. NVIDIA는 모델 자체는 1–2단계의 정통 경로(Eagle-2 VLM 상속 + BC fine-tuning)를 따르되, 그 1–2단계를 먹여 살리는 데이터 공급망을 Cosmos/Isaac Sim으로 차별화한다. 다른 말로, 모델 레시피가 아니라 데이터 파이프라인이 차별화 지점이다. PI는 1, 2단계에서는 Google의 오픈 PaliGemma/Gemma 3을 가져다 쓰면서, 3단계의 RL post-training과 그 이후의 deployment-time 학습을 자신의 영역으로 정의했다. π^*_0.6 [157]의 advantage conditioning(§10.13.1)이 Flow Matching VLA에 RL을 적용하는 실용적 경로를 최초로 입증한 것, π_0.6-MEM [158](§10.13.2)이 다중 스케일 메모리로 15분 장시간 작업을 풀어낸 것은 모두 이 3단계 영역에서 일어난 일이다.

박사 과정 연구자에게 이건 매우 실용적인 질문으로 번역된다. 자신이 향후 3–5년간 어느 학습 단계에서 기여할 수 있는가? 1단계에서 거대 백본과 경쟁하는 건 학계 단일 실험실로서는 사실상 불가능하다. 2단계의 BC fine-tuning은 이미 잘 닦여 있다. 가장 열려 있는 frontier는 3단계의 RL post-training과 그 변형들이고, 이는 PI가 가장 빠르게 밀어내고 있는 영역이지만 동시에 가장 진입 장벽이 낮은 영역이기도 하다. ICLR 2026에서 자기개선 잔차 RL이 LIBERO 99%에 도달한 흐름은 이 진단을 뒷받침한다.

축 3 — 모델 가중치 공개 정책이 만드는 비대칭 생태계

이건 본 서베이가 직접 다루지 않은 정치경제적 차원이지만, 박사 과정 연구자의 일상에 가장 직접적으로 영향을 미치는 차이다. NVIDIA는 GR00T N1을 Hugging Face 등에서 공개하고, openpi는 π0/π0-FAST를 공개하지만 π0.5 이후는 비공개이며, Gemini Robotics는 처음부터 API/파트너십 모델이다.

이 차이는 단순한 기업 정책이 아니라 누가 어떤 종류의 후속 연구를 할 수 있는가를 결정한다. 학계에서 weight access 없이 ablation study나 mechanism interpretation을 시도하기란 거의 불가능하다. 그래서 학술 VLA 연구의 실질적 베이스라인은 OpenVLA [15], π0, GR00T N1 같은 오픈 모델이고, Gemini Robotics는 인용 대상이지 비교 대상이 되기 어렵다. 이건 본 서베이 §10.12의 "프런티어 vs 오픈 웨이트 격차"가 단순히 성능 격차가 아니라 연구 가능성 자체의 격차임을 의미한다.

박사 1–2년차가 새 연구 주제를 잡을 때 이 점을 명시적으로 의식해야 한다. "Gemini Robotics와 비교하는 연구"는 거의 항상 제한된 형태(공개된 결과 인용, API 호출 비교)에 머물 수밖에 없다. 반면 "openpi 위에서 새로운 RL 후처리 방법을 검증하는 연구"는 즉시 실행 가능하고, 결과를 다른 연구자가 재현할 수 있다.

4.5.3 그래서 셋의 진짜 좌표는

위 세 축을 종합하면, 셋의 차이를 정직하게 한 줄로 요약할 수 있다.

차원	NVIDIA GR00T	Google Gemini Robotics	PI π 시리즈
패러다임상 위치	같은 종 (VLM Brain + Generative Action Head)	같은 종	같은 종
베팅하는 병목	체현(embodiment) 일반화, 휴머노이드 폼팩터	거대 백본의 추론 능력을 행동으로 전이	정책 자체의 개선 능력 (RL + 메모리)
자산 구조	GPU·시뮬레이션·엣지 칩 풀스택	거대 데이터센터·인터넷 사전학습	자체 로봇 함대·실세계 데이터
차별화 학습 단계	1–2단계의 데이터 공급망(Cosmos/Isaac)	1단계 (사전학습)	3단계 (RL post-training)
모델 공개 정책	오픈 (인프라로 수익화)	클로즈드 (API로 수익화)	이중 트랙 (기초는 오픈, 최전선은 비공개)
학계와의 관계	베이스라인 + 인프라 채택 유도	인용 대상, 비교 어려움	베이스라인 (openpi) + 최전선 추적 대상

이 표와 4.5.1 표의 관계에 주목하라. 4.5.1이 기술 축에서의 수렴을 보여준다면, 이 표는 그 위에 겹쳐지는 전략·생태계 축에서의 분기를 보여준다. 두 표 모두 진실이고, 어느 한쪽만 보면 분야 전체를 잘못 이해한다.

4.5.4 박사 과정 연구자의 시각에서 — 무엇을 추적하고 무엇을 따라가지 않을 것인가

이 절을 절충적 결론으로 끝내기보다, 박사 1–2년차가 실제로 마주칠 결정에 대한 구체적 가이드로 마무리하는 것이 더 정직할 것이다.

셋을 모두 추적해야 하는 영역. 이중 시스템 아키텍처의 진화, 행동 디코더의 수학적 메커니즘(diffusion vs flow matching vs discrete diffusion), VLM 백본의 선택과 fine-tuning 전략. 이 영역에서는 셋이 진짜로 한 게임을 하고 있어서 한 그룹의 진전이 다른 그룹의 가까운 미래를 예고한다. 본 서베이의 Section 4–5가 이 영역에 해당한다.

그룹별로 분기해서 추적해야 하는 영역. 데이터 수집·증강 전략(NVIDIA의 Cosmos 흐름, Google의 인터넷 멀티모달 흐름, PI의 실세계 함대 흐름), 학습 후 단계의 혁신(특히 PI의 RL post-training과 메모리 통합), 도메인 특화 응용(NVIDIA의 휴머노이드, Google의 자율주행 EMMA 계보). 이 영역에서는 그룹별 자산 구조가 다르기 때문에 한 그룹의 결과를 다른 그룹에 그대로 적용하기 어렵다.

한 그룹만 깊이 추적하면 충분한 영역. PI의 π^*_0.6와 π_0.6-MEM 계보. 본 서베이가 §10.13에서 두 절을 통째로 할애한 이유는, 현재 이 흐름이 VLA 분야에서 가장 빠르게 미해결 과제(BC의 성능 천장, 장시간 작업의 메모리 부재)를 정면 돌파하고 있기 때문이다. 박사 과정 연구자가 RL post-training이나 메모리 메커니즘에 관심이 있다면, openpi와 π 시리즈의 후속 논문들을 베이스라인으로 두고 출발하는 것이 가장 효율적이다.

의도적으로 따라가지 말아야 할 함정. 회사별 데모 영상의 인상에 끌려 "어느 회사가 앞서고 있는가"를 묻는 질문은 학술적으로 대답할 수 없는 질문이다. 시뮬레이션 벤치마크의 숫자만 추적하다가 실세계 일반화 격차(§10.12)를 놓치는 함정도 흔하다. 그리고 가장 큰 함정 — 통일장 논변에 너무 일찍 정착해서 "어차피 다 같은 패러다임"이라며 그룹별 차이를 무시하는 것. 패러다임이 같다는 것은 분기를 무시할 면허가 아니라, 분기가 어디서 일어날지를 더 정확히 예측할 수 있는 도구다.

4.5.5 한 줄 요약

세 그룹은 수학적 아키텍처 수준에서는 한 패러다임의 변주이고 이 진단은 정직하게 받아들여야 한다. 그러나 데이터 출처·학습 단계 자원 배분·모델 공개 정책이라는 횡단 축에서는 진짜로 다른 좌표에 있고, 이 차이가 곧 박사 과정 연구자가 자신의 연구를 어느 흐름 위에 위치시킬지를 결정한다. 5년 후 셋의 모델 자체는 더 닮아갈 가능성이 높지만, 그들이 만들어낼 생태계와 학계와의 관계는 더 분기할 가능성이 높다. 이 두 흐름을 동시에 보는 것이 박사 1–2년차가 이 분야에 정착할 때 필요한 균형이다.

4장 요약: VLA 해부학의 현재 좌표

지각 모듈은 SigLIP + DINOv2 [30] 하이브리드가 지배적 표준으로 수렴하고 있다. 두뇌 모듈은 사전학습된 VLM을 로봇의 뇌로 직접 전용하는 방향으로 확정적 진화를 이루었다. 행동 모듈은 아직 확정적 승자가 없으며, 이산 토큰화, 디퓨전, Flow Matching, FAST 토큰화 [20]가 경쟁하고 있다. 그리고 이 세 모듈을 연결하는 방식으로서 이중 시스템 아키텍처가 부상하며, 인지과학의 통찰이 공학적 설계 원리로 번역되고 있다.

다음 장에서는 이렇게 구축된 아키텍처를 어떻게 학습시키는가 --- 행동 복제, 강화학습, 월드 모델 학습의 세 패러다임을 심층적으로 다룬다.

Motivation Chain: 아키텍처 진화의 동기 사슬

Motivation Chain

모듈형 파이프라인의 한계(sense-plan-act 분리 → 정보 손실, 엔지니어링 부담)

→ 단일체(Monolithic) 아키텍처 등장(RT-2: 하나의 VLM이 모든 것을 처리)

→ 단일체의 한계(VLM 추론이 느려 실시간 제어 어려움)

→ 이중 시스템(Dual-system) 등장(GR00T N1: 빠른 System 1 + 느린 System 2)

→ 이중 시스템의 한계(장시간 복합 작업에서 계획 능력 부족)

→ 계층적(Hierarchical) 아키텍처(π0.5: VLM 플래너 + VLA 실행기)

Motivation Chain

자기회귀 디코딩의 한계(단일 모드만 생성, 이산화 정보 손실, 느린 순차 생성)

→ Diffusion Policy [17] 등장(다중 모드 행동 표현, action chunk 일괄 생성)

→ DDPM의 한계(50-100 디노이징 스텝으로 느린 생성)

→ Flow Matching(π0 [16]) 등장(선형 보간으로 5-10 스텝에 수렴)

→ DiT 기반 디코더(CogACT [23], RDT-1B [24]) 등장(Transformer의 스케일링 법칙을 디퓨전에 적용)

유사 아키텍처 차별점 비교

비교 대상	핵심 차별점
단일체 vs 이중체	단일체는 하나의 모델이 이해+행동을 모두 처리(단순하지만 속도 제약). 이중체는 이해와 행동을 분리하여 각각 최적 주파수로 운영
Cascade vs Parallel 이중체	Cascade는 System 2→System 1 순차 전달(GR00T N1 [21]). Parallel은 두 시스템이 동시 실행 후 출력 결합
Planner-Only vs Planner+Policy 계층	Planner-Only는 고수준 계획만 VLM이 담당(SayCan). Planner+Policy는 저수준 VLA 정책까지 학습(π0.5 [31])
자기회귀 vs 디퓨전 디코더	AR은 토큰을 순차 생성(이산적, 느림, VLM 어휘 재활용). 디퓨전은 노이즈→행동 반복 정제(연속적, 다중모드, 병렬 chunk)
DDPM vs Flow Matching	DDPM은 확률적 역과정 반복(50-100 스텝). Flow Matching은 결정론적 ODE 경로 학습(5-10 스텝, 더 빠르고 안정적)

직관적 한줄 설명: 아키텍처 편

단일체 VLA: "통역 없이 외국어를 바로 알아듣고 대답하는 사람 — 빠르지만 깊은 사고는 어려움"
이중 시스템: "CEO(전략적 판단, 느림)와 현장 작업자(즉각 실행, 빠름)의 분업"
계층적 구조: "요리 레시피(고수준 계획)와 칼질 기술(저수준 기술)의 분리 — 레시피를 바꿔도 칼질은 재학습 불필요"
자기회귀 디코딩: "한 글자씩 타이핑하듯 행동을 순서대로 하나씩 생성"
디퓨전 디코딩: "대리석 조각처럼 전체 형태에서 불필요한 부분(노이즈)을 깎아내어 행동을 드러냄"
Flow Matching: "디퓨전의 '조각'을 직선 경로로 단축 — 같은 결과를 더 적은 스텝으로"
Action Chunk: "한 글자씩이 아니라 한 문장을 통째로 생성 — 시간적 일관성 확보"

Self-Check Questions: Section 3-4

Q1: π0의 아키텍처를 Liu & Shao의 분류와 Zhong et al.의 분류에서 각각 어떻게 위치시킬 수 있는가?

답: Liu & Shao 분류에서 π0는 Parallel 이중 시스템(병렬 이중 시스템)이다 — VLM(PaliGemma)이 시각-언어 특징을 추출하고, Flow Matching Action Expert가 연속 행동을 생성하는 병렬 구조(VLM 토큰과 Action Expert 토큰이 공유 어텐션에서 동시 처리). Zhong et al. 분류에서는 디퓨전 기반 Pure VLA에 해당한다 — Flow Matching이 디퓨전의 변형이므로.

Q2: 자기회귀 방식의 VLA(RT-2)가 다중 모드 행동을 잘 표현하지 못하는 이유는 무엇인가?

답: 자기회귀 방식은 각 행동 차원을 이산 bin으로 양자화하여 토큰으로 생성한다. 이 과정에서 (1) 양자화 오차로 연속 행동의 정밀도가 손실되고, (2) 각 토큰이 순차적으로 이전 토큰에 조건화되므로, "오른쪽으로 집기"와 "위에서 집기"가 동시에 존재하는 다중 모드 분포를 자연스럽게 표현하기 어렵다. 분포의 평균 방향(mode averaging)으로 수렴하는 경향이 있다.

Q3: Xu et al.의 5대 도전 과제 중, 현재(2026년 기준) 가장 큰 진전을 보인 과제와 가장 미해결인 과제는 각각 무엇인가?

답: 가장 큰 진전은 (1) 표현 — VLM 기반 통합 표현, FAST 토큰화, Flow Matching 등으로 표현 문제가 크게 개선됨. 가장 미해결인 과제는 (4) 안전 — SafeVLA [75]가 첫 시도이지만, 형식적 검증(formal verification)이 부재하고, VLA의 환각이 물리적 사고로 이어질 수 있는 근본적 위험이 해결되지 않음.

Open Research Questions: Section 3-4

최적 아키텍처 선택 기준: 주어진 태스크(단순 pick-and-place vs 30분 요리)에 대해, 단일체/이중체/계층적 아키텍처 중 어떤 것이 최적인지를 사전에 예측할 수 있는 이론적 프레임워크가 가능한가?

디퓨전과 자기회귀의 통합: 자기회귀의 장점(VLM 어휘 재활용, 추론 해석 가능성)과 디퓨전의 장점(다중 모드, 연속 행동)을 결합하는 하이브리드 디코더의 최적 설계는?

4개 분류 체계의 예측력: 본 섹션에서 제시한 4개의 분류 관점(아키텍처/행동 생성/해부학/기능) 중 어떤 관점이 모델의 실제 성능을 가장 잘 예측하는가?

Action Expert 스케일링: π0의 Action Expert(0.3B)와 VLM backbone(3B)의 파라미터 비율이 최적인가? Action Expert를 더 키우면 성능이 비례하여 향상되는가?

5. 액션 토큰화 — VLA의 핵심 설계 결정

VLA(Vision-Language-Action) 모델을 설계할 때 수많은 아키텍처적 선택이 존재하지만, 그중 가장 본질적이고 파급력이 큰 결정은 "행동을 어떻게 토큰으로 표현할 것인가"이다. Chen et al. [7] (2025)은 이를 액션 토큰화(Action Tokenization) 관점이라 명명하고, VLA 모델 간의 차이를 가장 명확하게 설명하는 축으로 제시했다. 이 절에서는 그들의 분류를 기반으로, 다른 서베이들의 통찰을 통합하여 8가지 액션 토큰 유형을 체계적으로 정리한다.

왜 토큰화가 핵심인가? VLA 모델은 본질적으로 대규모 언어 모델(LLM)을 기반으로 한다. LLM은 이산적 토큰 시퀀스를 입출력으로 처리하는 시스템이므로, 로봇의 연속적 행동 공간을 이산적 토큰 공간으로 변환하는 방식이 모델의 표현력, 제어 정밀도, 추론 속도를 근본적으로 결정짓는다. 토큰화 방식의 선택은 단순한 구현 세부사항이 아니라, VLA 시스템의 능력과 한계를 규정하는 가장 상위의 설계 결정이다.

하나의 모델이 복수 유형을 조합할 수 있으며, 예를 들어 CoT-VLA [55]는 추론 토큰(유형 8)과 목표 토큰(유형 5)을 함께 활용한다.

5.1 8가지 액션 토큰 유형

유형 1: 언어 토큰 (Language Tokens)

가장 직관적인 접근은 로봇의 행동을 자연어 텍스트로 표현하는 것이다. "빨간 컵을 집어 올려라(pick up the red cup)"와 같은 자연어 명령을 LLM이 생성하면, 하위 정책(low-level policy)이나 사전 정의된 기능(skill primitive)이 이를 실제 모터 명령으로 변환한다.

대표 모델:

SayCan [14] (Ahn et al., 2022): LLM이 생성한 행동 후보에 대해 어포던스 점수를 매겨 실행 가능한 행동을 선택한다. 언어 모델의 세계 지식과 로봇의 물리적 능력을 곱(product) 연산으로 결합하는 핵심 아이디어를 제시했다.
Inner Monologue [22] (Huang et al., 2023): 환경으로부터의 피드백(성공/실패, 물체 인식 결과)을 언어로 변환하여 LLM의 다음 행동 계획에 반영한다.
SayTap [38] (Tang et al., 2023): 보행 로봇의 발 접촉 패턴을 텍스트 시퀀스로 표현하여, LLM이 보행 리듬을 직접 계획할 수 있게 했다.

장점: LLM의 강력한 언어 생성 능력과 상식 추론을 직접 활용할 수 있다. 사전학습된 언어 지식이 그대로 전이되므로, 새로운 과제에 대한 제로샷(zero-shot) 일반화가 뛰어나다.

한계: 연속적 운동 제어의 정밀도가 본질적으로 부족하다. "컵을 3cm 왼쪽으로 이동"과 같은 세밀한 조작을 자연어로 충분히 표현하기 어렵다. 제어 주파수는 1-3 Hz로 모든 토큰 유형 중 가장 낮아, 빠른 반응이 필요한 과제(예: 동적 물체 잡기)에는 적합하지 않다.

유형 2: 코드 토큰 (Code Tokens)

행동을 실행 가능한 프로그램 코드로 표현하는 접근이다. LLM이 Python 함수나 API 호출 시퀀스를 생성하면, 이를 로봇 런타임에서 직접 실행한다.

대표 모델:

Code as Policies [36] (Liang et al., 2023): LLM이 로봇 API를 호출하는 Python 코드를 직접 생성한다. 공간적 추론, 반복문, 조건문 등 프로그래밍 구조를 활용하여 복잡한 행동 시퀀스를 구성할 수 있다.
ProgPrompt [102] (Singh et al., 2023): 과제를 프로그래밍적 형식(함수 호출, assert 문)으로 구조화하여 LLM의 계획 능력을 강화한다.
Voyager [103] (Wang et al., 2023): Minecraft 환경에서 실행 가능한 코드를 생성하고, 성공한 코드를 기술 라이브러리에 저장하여 점진적으로 역량을 확장한다.
ChatGPT for Robotics [104] (Vemprala et al., 2024): 대화형 인터페이스를 통해 사용자의 의도를 로봇 제어 코드로 변환한다.

장점: 구조적이고 재사용 가능하며, 디버깅이 용이하다. 반복문과 조건문을 통해 복잡한 행동 논리를 간결하게 표현할 수 있고, 생성된 코드를 라이브러리로 축적하여 장기적으로 기술 기반(skill repertoire)을 확장할 수 있다.

한계: 사전 정의된 API(예: pick(obj), place(x, y, z))가 반드시 필요하며, API가 지원하지 않는 미세 운동(dexterous manipulation)은 표현할 수 없다. 물리적 상호작용의 연속적 특성을 이산적 API 호출로 온전히 포착하기 어렵다.

유형 3: 어포던스 토큰 (Affordance Tokens)

물체의 조작 가능 영역과 방식을 공간적으로 표현하는 접근이다. "어디를" 잡아야 하는지, "어떤 방향으로" 밀어야 하는지를 3D 공간 상의 히트맵(heatmap)이나 벡터 필드(vector field)로 나타낸다.

대표 모델:

VoxPoser [57] (Huang et al., 2023): LLM과 VLM을 이용하여 3D 복셀(voxel) 공간에 어포던스 맵(가치 맵)과 제약 맵을 생성한다. 이 맵을 기반으로 모션 플래너가 궤적을 합성한다.
A3VLM [58] (Huang et al., 2024): VLM을 확장하여 3D 포인트클라우드에서 직접 어포던스를 예측하게 한다.
RT-Affordance [105] (Brohan et al., 2023): 시각적 어포던스를 조건으로 활용하여 조작 정책의 일반화를 돕는다.
A0 [106] (Ren et al., 2025): 어포던스 기반의 통합 조작 프레임워크로, 물체-행동 간 관계를 명시적으로 모델링한다.

핵심 가치: 어포던스 토큰은 "어디를" + "어떻게" 조작할지에 대한 중간 표현(intermediate representation)을 제공한다. 이는 고수준 언어 계획과 저수준 모터 제어 사이의 의미론적 다리(semantic bridge) 역할을 하며, 특히 새로운 물체에 대한 일반화에서 강점을 보인다.

유형 4: 궤적 토큰 (Trajectory Tokens)

엔드이펙터(end-effector)의 시공간적 경로를 토큰 시퀀스로 표현하는 접근이다. 2D 이미지 위에 미래 궤적을 스케치하거나, 3D 공간에서의 웨이포인트(waypoint) 시퀀스를 생성한다.

대표 모델:

RT-Trajectory [62] (Ahn et al., 2024): 이미지 위에 2D 궤적 스케치를 오버레이하여 시각적 프롬프트로 활용한다. 인간이 궤적을 그려주거나, 모델이 궤적을 예측할 수 있다.
LATTE (Liu et al., 2024): 언어 명령을 3D 궤적 시퀀스로 변환하는 언어-궤적 변환기.
TraceVLA [40] (Zheng et al., 2025): VLA 모델이 시각적 궤적 트레이스(trace)를 중간 표현으로 생성하여, 행동 예측의 해석 가능성과 정확도를 동시에 향상시킨다.

비디오 사전학습과의 연결: 궤적 토큰은 비디오 예측 모델과 자연스럽게 연결된다. 비디오 프레임 시퀀스는 본질적으로 시각적 궤적의 연속이므로, 대규모 비디오 데이터로 사전학습된 모델의 시공간적 이해를 궤적 예측에 직접 전이할 수 있다.

유형 5: 목표 토큰 (Goal Tokens)

미래 관찰(future observation)을 예측함으로써 목표 상태를 표현하는 접근이다. "지금 이 상태에서 행동 후 세상이 어떻게 보일 것인가"를 이미지나 포인트클라우드로 생성한다.

대표 모델:

SuSIE [59] (Black et al., 2024): 현재 관찰과 언어 명령을 입력받아 미래 하위목표(subgoal) 이미지를 생성하고, 이를 따르는 저수준 정책을 학습한다.
UniPi [93] (Du et al., 2024): 로봇 계획을 비디오 생성 문제로 프레이밍한다. 텍스트-투-비디오(text-to-video) 디퓨전 모델이 미래 프레임 시퀀스를 생성하면, 역동역학(inverse dynamics) 모델이 행동을 추출한다.
3D-VLA [37] (Zhen et al., 2024): 3D 표현 공간에서 미래 장면을 예측하여 깊이 있는 공간 이해를 반영한 목표를 생성한다.
CoT-VLA [55] (Kang et al., 2025): 시각적 하위목표(visual subgoal)를 사고의 연쇄(Chain-of-Thought)로 활용하여, "다음에 세상이 어떻게 보여야 하는가"를 명시적으로 추론한 후 행동을 결정한다.

월드 모델과의 통합: 목표 토큰은 월드 모델(world model)과의 가장 자연스러운 접점을 제공한다. 미래를 시뮬레이션하여 목표를 설정하는 것은, 내적 월드 모델을 통해 행동의 결과를 미리 상상하는 것과 본질적으로 같다. 이는 계획(planning)과 실행(execution)을 통합하는 강력한 경로를 제시한다.

유형 6: 잠재 토큰 (Latent Tokens)

학습된 잠재 공간(learned latent space)에서 행동을 표현하는 접근이다. 원시 행동 데이터를 오토인코더(autoencoder) 등으로 압축하여, 의미론적으로 풍부한 잠재 벡터로 변환한다.

대표 모델:

LAPA [61] (Ye et al., 2025): VQ-VAE(Vector Quantized Variational Autoencoder)를 사용하여 행동 궤적을 잠재 코드북(codebook)으로 양자화한다. 이 잠재 행동 토큰을 VLA 모델이 예측하도록 학습한다.
UniVLA [80] (Li et al., 2025): 통합 잠재 행동 공간을 학습하여 다양한 로봇 형태와 과제를 하나의 모델로 처리한다.
VQ-VLA [107] (Qu et al., 2025): 벡터 양자화를 통해 행동 공간을 이산화하되, 잠재 공간의 구조를 보존하여 의미론적 일관성을 유지한다.

체현 갭 극복의 열쇠: 잠재 토큰의 가장 혁신적인 가치는 체현 갭(embodiment gap) 극복에 있다. 인간의 손 동작과 로봇 그리퍼의 동작은 물리적으로 전혀 다르지만, 잠재 공간에서는 "물체를 잡는다"라는 의미론적 행동이 유사하게 표현될 수 있다. 이를 통해 대규모 인간 비디오 데이터(Ego4D, Something-Something 등)에서 추출한 행동 지식을 로봇에 전이하는 것이 가능해진다. 이는 로봇 데이터의 만성적 부족 문제를 우회하는 핵심 전략이다.

도메인 불가지론적(domain-agnostic) 특성: 잠재 행동 공간은 특정 로봇의 관절 구성이나 행동 공간 차원에 의존하지 않으므로, 교차 체현(cross-embodiment) 학습의 자연스러운 매개체가 된다.

유형 7: 원시 행동 토큰 (Raw Action Tokens)

관절 각도, 엔드이펙터 포즈(위치 + 자세), 그리퍼 상태 등 저수준 행동 값을 직접 이산화(discretize)하여 토큰으로 변환하는 접근이다. 가장 직접적이고 단순한 토큰화 방식이다.

대표 모델:

RT-2 [11] (Brohan et al., 2023): 7차원 행동 벡터(6DoF 포즈 + 그리퍼)의 각 차원을 256개 구간(bin)으로 균등 이산화하여, LLM의 어휘(vocabulary)에 추가한다.
OpenVLA [15] (Kim et al., 2024): RT-2 [11]와 유사한 256-bin 이산화를 채택하되, 오픈소스로 재현 가능하게 구현했다.
Gato [13] (Reed et al., 2022): 다양한 과제의 행동을 1024개 구간으로 이산화하여 하나의 범용 모델로 처리한다.

양자화 오류의 문제: 원시 행동 이산화의 근본적 한계는 양자화 오류(quantization error)이다. 256개 구간으로 이산화하면 각 구간의 폭이 약 0.8%가 되는데, 정밀 조작에서는 이 오류가 누적되어 실패를 초래할 수 있다. 또한 각 행동 차원을 독립적으로 토큰화하면 차원 간 상관관계가 손실된다.

FAST의 혁신: FAST [20](Fast Action Tokenization) (Pertsch et al., 2025)는 이 문제에 대한 우아한 해법을 제시했다. 행동 시퀀스에 이산 코사인 변환(DCT)을 적용하여 주파수 도메인으로 변환한 후, 바이트 쌍 인코딩(BPE)으로 토큰화한다. 이를 통해:

시간적 상관관계를 DCT가 포착하여 정보 압축률 향상
BPE가 빈번한 행동 패턴을 단일 토큰으로 묶어 시퀀스 길이 단축
결과적으로 LLM의 기존 어휘 확장 메커니즘과 자연스럽게 통합

이 혁신은 원시 행동 토큰화의 효율성을 극적으로 개선하여, 기존 256-bin 방식 대비 동일 정밀도에서 토큰 수를 최대 약 13배(원논문 기준 최대치) 줄이는 성과를 거뒀다.

ICLR 2026에서는 FAST를 넘어선 차세대 토큰화 기법들이 제안되었다. FASTer는 RVQ(Residual Vector Quantization)에 주파수/시간 도메인 손실을 결합하여 더 높은 압축률과 재구성 품질을 동시에 달성했다. OMNISAT는 B-Spline 인코더를 사용하여 매끄러운 장시간 행동 출력에 특화된 컴팩트 표현을 제공한다.

유형 8: 추론 토큰 (Reasoning Tokens)

행동을 결정하기 전에 사고 과정을 명시적 토큰으로 생성하는 접근이다. "왜 이 행동을 해야 하는가"를 먼저 추론하고, 그 추론 결과에 기반하여 행동을 예측한다.

대표 모델:

ECoT [92] (Zawalski et al., 2024): Embodied Chain-of-Thought. 행동 예측 전에 장면 설명, 과제 분석, 하위계획 등을 텍스트로 생성한다.
CoT-VLA [55] (Kang et al., 2025): 시각적 추론 토큰(미래 하위목표 이미지)과 언어적 추론 토큰을 결합하여 다중모달 사고 연쇄를 구성한다.
ThinkAct [67] (Xu et al. [8], 2025): 자율적으로 "언제 생각할 것인가"를 결정하는 적응적 추론 메커니즘을 도입했다.
Embodied-R1 [108] (Liu et al., 2025): DeepSeek-R1 스타일의 장기 추론을 체현 과제에 적용하여, 복잡한 다단계 조작에서 자발적 추론 경로를 생성한다.

성능 향상 효과: 추론 토큰의 효과는 상당하다. SC-VLA [56]의 실험에 따르면, 추론 토큰을 포함한 경우 행동 예측 품질이 약 35% 향상되었다. 이는 "생각한 후 행동하기"가 단순한 반사적 행동 대비 명확한 이점을 제공함을 보여준다.

트레이드오프: 추론 토큰은 추가적인 토큰 생성을 요구하므로 추론 지연(latency)이 증가한다. ThinkAct [67]이 "언제 생각할 것인가"를 학습하는 접근은 이 트레이드오프를 관리하려는 시도이다 — 단순한 과제에서는 즉각 행동하고, 복잡한 상황에서만 추론을 활성화한다.

5.2 토큰화와 제어 주파수의 관계

토큰화 방식은 VLA 모델의 제어 대역폭(control bandwidth)을 직접 결정한다. 이는 해당 모델이 수행할 수 있는 과제의 범위를 본질적으로 규정한다.

토큰화 방식	제어 주파수	대표 모델	적합한 과제
언어 토큰	1-3 Hz	SayCan [14], Inner Monologue [22]	고수준 계획, 탐색
자기회귀 원시 토큰	3-6 Hz	OpenVLA [15] (~6Hz), RT-2 [11]	단순 픽앤플레이스
디퓨전 + 청킹	10-50 Hz	π0 [16] (20-50Hz)	유연 조작, 접촉 풍부 과제
Flow Matching + 청킹	50-120 Hz	GR00T N1 [21] (~120Hz)	민첩 조작, 양손 협응
FAST + 청킹	~50 Hz	FAST-VLA	범용 조작

이 표에서 드러나는 핵심 인사이트는 명확하다: 토큰화 방식이 제어 대역폭을 결정하며, 이는 수행 가능한 작업의 범위를 규정한다. 1-3Hz의 제어 주파수로는 "빨간 블록을 파란 그릇에 넣어라"와 같은 단순 과제는 수행할 수 있지만, 달걀을 깨뜨리지 않고 잡는 것은 불가능하다. 50Hz 이상의 주파수가 되어야 비로소 힘 제어(force control)가 필요한 섬세한 조작이 가능해진다.

이러한 주파수 격차의 근본 원인은 토큰 생성 메커니즘에 있다:

자기회귀 디코딩은 토큰을 하나씩 순차적으로 생성하므로, 행동 차원 수에 비례하여 지연이 증가한다 (7DoF → 7개 토큰 순차 생성).
디퓨전/플로우 매칭 기반의 행동 청킹(action chunking)은 한 번의 디노이징(denoising) 과정으로 수십 스텝의 행동을 동시에 생성한다. 추론 빈도는 낮지만, 청크 내의 행동을 고주파로 실행할 수 있다.
FAST [20]는 DCT+BPE를 통해 행동 시퀀스를 압축된 소수의 토큰으로 표현하여, 자기회귀 방식임에도 높은 실효 주파수를 달성한다.

5.3 토큰화 선택이 성능에 미치는 영향

이산 vs 연속: 다중모달 분포 문제

토큰화 선택이 성능에 미치는 가장 심각한 영향은 다중모달 분포(multimodal distribution) 상황에서 나타난다. 예를 들어, 테이블 위의 물체를 왼쪽이나 오른쪽 어느 방향으로든 밀어도 되는 상황을 생각해보자. 이 경우 올바른 행동 분포는 이봉 분포(bimodal distribution) — 왼쪽과 오른쪽에 각각 확률 질량이 집중된 형태 — 를 갖는다.

이산 원시 토큰(256-bin)과 표준 교차 엔트로피 손실로 학습하면, 모델은 두 모드의 평균에 해당하는 행동(즉, 아무 방향으로도 밀지 않는 것)을 예측하는 경향이 있다. 이는 평균화 문제(mode averaging/mode collapse)로, 특히 시연 데이터의 다양성이 높을수록 심각해진다.

디퓨전 기반 연속 행동 생성은 이 문제에 대한 자연스러운 해법을 제공한다. 디퓨전 모델은 본질적으로 다중모달 분포를 표현할 수 있어, 이봉 분포의 양쪽 모드를 모두 포착한다. 이것이 π0 [16], GR00T N1 [21] 등이 디퓨전/플로우 매칭을 채택한 핵심 이유 중 하나이다.

행동 공간과 디코딩의 공결정

토큰화 방식과 디코딩 전략은 독립적으로 선택되는 것이 아니라 공결정(co-determined)된다:

이산 행동 공간 ↔ 자기회귀 디코딩: 행동을 이산 토큰으로 표현하면 LLM의 기존 언어 모델 헤드(language model head)를 그대로 재사용할 수 있다. RT-2 [11], OpenVLA [15]가 이 경로를 택한다. 장점은 아키텍처 수정이 최소화되어 LLM의 사전학습 지식을 최대한 보존할 수 있다는 것이다.
연속 행동 공간 ↔ 디퓨전/플로우 기반 생성 헤드: 행동을 연속 벡터로 유지하면 전용 생성 헤드(diffusion head, flow matching head)가 필요하다. π0 [16], GR00T N1 [21]이 이 경로를 택한다. 장점은 다중모달 분포 표현과 높은 제어 주파수이며, 단점은 LLM 백본과의 통합에 추가적인 설계가 필요하다는 것이다.

최근에는 이 두 경로를 결합하려는 시도도 나타나고 있다. 예를 들어 VQ-VLA는 벡터 양자화를 통해 연속 행동을 이산 토큰으로 변환하되, 잠재 공간의 구조를 보존하여 두 세계의 장점을 결합하려 한다.

행동 청킹의 트레이드오프

행동 청킹(action chunking)은 한 번의 추론으로 여러 타임스텝의 행동을 동시에 예측하는 기법이다. ACT (Zhao et al., 2023)에서 제안된 이래 VLA 설계의 핵심 요소로 자리잡았다.

청크 크기(chunk size)를 증가시키면:

추론 빈도가 감소하여 계산 효율이 향상된다 (예: 청크 크기 16 → 추론 횟수 1/16)
궤적 일관성이 향상된다 — 매 스텝 독립적으로 예측하면 궤적이 불안정해질 수 있지만, 청크 단위로 예측하면 시간적 일관성이 보장된다
그러나 환경 변화에 대한 반응성이 감소한다 — 청크 실행 중간에 예상치 못한 상황이 발생해도 즉각 대응할 수 없다

이 트레이드오프를 관리하기 위해, π0 [16] 등은 청크 크기를 과제 특성에 맞춰 조절하거나 시간적 앙상블(temporal ensemble)을 적용하여 연속적인 청크 간 행동을 부드럽게 보간(interpolation)한다.

5.4 소결: 토큰화 관점의 통합적 이해

8가지 액션 토큰 유형은 추상화 수준(abstraction level)의 스펙트럼으로 이해할 수 있다:

높은 추상화  ←─────────────────────────────────→  낮은 추상화
언어 → 코드 → 추론 → 목표 → 궤적 → 어포던스 → 잠재 → 원시

왼쪽으로 갈수록 인간에게 해석 가능하고 일반화가 뛰어나지만 제어 정밀도가 낮으며, 오른쪽으로 갈수록 정밀하지만 일반화와 해석 가능성이 떨어진다. 현대의 가장 성공적인 VLA 시스템은 이 스펙트럼의 여러 수준을 계층적으로 결합한다. 예컨대 추론 토큰(높은 추상화)으로 계획을 세운 후, 원시 행동 토큰(낮은 추상화)으로 실행하는 구조이다.

이 관점에서 VLA 연구의 핵심 질문은 "어떤 토큰 유형이 최선인가?"가 아니라, "어떤 과제와 체현에 대해, 어떤 토큰 유형의 조합이 최적인가?"이다.

위 추상화 순서는 하나의 관점이며, 도메인과 태스크에 따라 어포던스와 궤적의 상대적 추상화 수준이 달라질 수 있다.

6. 학습 패러다임의 진화

VLA 모델의 학습은 단순한 지도 학습을 넘어, 사전학습에서 후처리(post-training)까지 이르는 다층적 과정으로 진화하고 있다. 이 절에서는 이 진화의 각 단계를 체계적으로 살펴보고, 인간의 운동 학습 이론과의 흥미로운 병행 관계를 탐구한다.

6.1 사전학습 — 인터넷에서 로봇으로

VLA 모델의 사전학습은 2단계 공동 학습(two-phase joint training)이 사실상 표준으로 자리잡았다.

Phase 1: 인터넷 스케일 이미지-텍스트 사전학습

첫 번째 단계에서는 인터넷에서 수집한 대규모 이미지-텍스트 데이터로 비전-언어 모델(VLM)을 학습한다. LAION-5B(50억 이미지-텍스트 쌍), COCO, Visual Genome 등의 데이터셋을 통해 모델은 시각적 세계에 대한 풍부한 의미론적 사전지식(semantic prior)을 획득한다.

이 단계에서 모델이 학습하는 것은 로봇 제어 자체가 아니라, 그 전제가 되는 세계 이해이다:

물체 인식과 분류 ("이것은 머그잔이다")
공간적 관계 이해 ("머그잔이 테이블 위에 있다")
물리적 속성 추론 ("유리잔은 깨질 수 있다")
상식적 행동 지식 ("머그잔을 마시려면 손잡이를 잡는다")

Phase 2: 로봇 궤적 데이터 미세조정

두 번째 단계에서는 실제 로봇 궤적(trajectory) 데이터로 모델을 미세조정(fine-tuning)한다. 핵심 데이터셋으로는:

Open X-Embodiment [19] (OXE [19]): 22개 로봇 형태, 100만+ 개의 에피소드를 포함하는 교차 체현 데이터셋. 현재 VLA 학습의 사실상 표준 데이터 소스이다.
BridgeData V2: WidowX 로봇의 다양한 환경에서의 조작 데이터 (~60,000 궤적).
RT-1 [12] 데이터: Google의 Everyday Robots에서 수집한 대규모 단일-형태 데이터셋.

이 2단계 구조의 핵심 통찰은 의미론적 사전지식과 감각운동 기술의 분리이다. 세계를 이해하는 능력(Phase 1)과 세계에서 행동하는 능력(Phase 2)은 서로 다른 데이터 소스에서 효율적으로 학습될 수 있다.

비디오 사전학습의 부상

최근에는 이미지-텍스트 데이터를 넘어 비디오 데이터를 사전학습에 활용하는 흐름이 가속화되고 있다:

GR-2 [100] (Cheang et al., 2024): 웹스케일 비디오로 사전학습한 비디오 생성 모델을 로봇 정책의 기초로 활용한다. 비디오에 내재된 물리적 역학(dynamics) 이해가 로봇 제어에 전이된다.
자아중심 비디오 (Ego4D, EPIC-Kitchens): 인간이 직접 조작하는 1인칭 시점 비디오는 로봇의 시점과 유사하여, 특히 조작 과제에 대한 풍부한 사전지식을 제공한다.

비디오 사전학습이 이미지-텍스트 사전학습에 비해 갖는 결정적 장점은 시간적 역학(temporal dynamics)의 이해이다. 이미지는 정적 장면의 이해를, 비디오는 "이 행동 이후에 세상이 어떻게 변하는가"에 대한 이해를 제공한다.

시뮬레이션 데이터의 역할

UniSim [101] (Yang et al., 2023): 행동 조건부 비디오 디퓨전 모델로, 가상 환경에서의 상호작용 시뮬레이션을 통해 무한한 학습 데이터를 생성한다.
Genesis (Xian et al., 2024): GPU 가속 물리 시뮬레이터로, 현실과 유사한 물리적 상호작용 데이터를 대규모로 생성한다.

시뮬레이션 데이터의 주요 과제는 심-투-리얼 갭(sim-to-real gap) — 시뮬레이션과 현실 사이의 시각적, 물리적 차이 — 이다. 도메인 랜덤화(domain randomization), 도메인 적응(domain adaptation) 등의 기법이 이 갭을 줄이기 위해 활발히 연구되고 있다.

스케일링 법칙

VLA 학습에서도 스케일링 법칙(scaling law)이 관찰되고 있다. Zhang et al. [6]의 연구에 따르면, 궤적 데이터를 2배로 증가시키면 과제 성공률이 약 8-12% 향상된다 (원논문 직접 인용; 서베이에서는 정확한 수치 미기재). 이는 데이터 확보가 곧 성능 향상으로 직결됨을 의미하며, 대규모 데이터 수집 인프라(OXE [19], DROID 등)의 중요성을 뒷받침한다.

다만, 데이터 스케일링만으로는 한계가 있다. 궤적 데이터의 질(quality), 다양성(diversity), 커버리지(coverage)가 양(quantity) 못지않게 중요하며, 단순히 데이터 양을 늘리는 것보다 더 효율적인 학습 알고리즘의 개발도 병행되어야 한다.

6.2 모방학습(Behavioral Cloning)의 한계

모방학습(Behavioral Cloning, BC)은 전문가 시연을 지도 학습 방식으로 따라하는 가장 기본적인 정책 학습 방법이다. 관찰-행동 쌍 $(o_t, a_t)$이 주어지면, 정책 $\pi(a|o)$를 최대우도 추정(MLE)으로 학습한다. VLA 모델의 대부분은 이 BC 프레임워크 위에 구축되어 있다.

그러나 BC에는 근본적인 한계들이 존재한다:

분포 이동(Distribution Shift)

학습 시 모델이 보는 상태 분포와 실행 시 모델이 마주치는 상태 분포가 다르다. 전문가 시연에서는 전문가의 정책 $\pi^*$가 생성한 상태 분포를 따르지만, 실행 시에는 학습된 (불완전한) 정책 $\hat{\pi}$가 생성한 상태 분포를 따른다. 이 분포 불일치는 학습 데이터에서 벗어난 상태에서의 예측 불가능한 행동으로 이어진다.

공변량 오류 누적(Covariate Shift & Compounding Errors)

각 타임스텝에서의 작은 예측 오류가 시간이 지남에 따라 누적된다. 한 스텝에서의 약간의 위치 오차가 다음 스텝에서는 더 큰 오차를 초래하고, 이것이 연쇄적으로 증폭되어 장기 과제에서 치명적 실패를 야기한다. 예를 들어, 30초 동안 진행되는 복잡한 조립 과제에서는 초반의 미세한 오차가 후반에 완전한 실패로 귀결될 수 있다.

차선 시연에서의 개선 불가

BC는 본질적으로 시연의 상한(upper bound)에 제약된다. 시연 자체가 최적이 아니거나 노이즈가 포함된 경우, 모델은 그 수준을 넘어설 수 없다. 더 나은 행동을 발견하는 메커니즘이 부재하다.

안전/선호 신호의 부재

BC는 "무엇을 해야 하는가"만 학습하고, "무엇을 하지 말아야 하는가"나 "어떤 행동이 더 선호되는가"에 대한 신호를 활용하지 못한다. 안전 제약(예: 인간 근처에서의 속도 제한)이나 사용자 선호(예: 부드러운 동작 선호)를 명시적으로 반영할 수 없다.

핵심 결론

BC는 VLA 학습의 필요조건이지만 충분조건은 아니다. 대규모 시연 데이터를 효율적으로 활용하는 데 BC는 불가결하지만, 그 한계를 극복하기 위한 후처리(post-training)의 필요성이 점점 더 명확해지고 있다.

6.3 강화학습 후처리

Jin et al. [9] (2025)을 중심으로, 최근 VLA 모델의 성능을 BC 이후에 강화학습(RL)으로 한 단계 더 끌어올리는 연구가 폭발적으로 증가하고 있다. 이는 대규모 언어 모델(LLM) 분야에서 SFT(Supervised Fine-Tuning) 이후 RLHF(Reinforcement Learning from Human Feedback)로 모델을 정렬(alignment)하는 패러다임과 정확히 병행한다.

온라인 RL (Online Reinforcement Learning)

모델이 실제 환경(또는 시뮬레이션)에서 직접 상호작용하며 보상을 받아 학습한다:

PPO 기반:
VLA-RL [68] (Tan et al., 2025): VLA 모델에 PPO를 적용하여 온라인 환경 상호작용으로 성능 향상
RIPT-VLA [71] (Su et al., 2025): Reinforcement learning via Iterative Policy Training. RIPT-VLA [71]는 원 논문(Su et al., 2025)에 따르면, 특정 태스크에서 SFT 4% 성공률에서 출발하여 PPO 15회 반복 후 97% 성공률에 도달했다. 단, Jin et al. [9]의 LIBERO 벤치마크 비교에서는 평균 74.7%로, 벤치마크에 따라 성능 차이가 크다.
iRe-VLA (Xu et al. [8], 2025): 반복적 RL을 통해 점진적으로 정책을 개선

GRPO 기반:
ThinkAct [67] (Xu et al. [8], 2025): Group Relative Policy Optimization을 적용하여 추론과 행동을 동시에 강화
TGRPO [66] (Li et al., 2025): 추론 일관성 보상을 포함하는 확장된 GRPO

오프라인 RL (Offline Reinforcement Learning)

기존에 수집된 데이터만으로 정책을 개선한다. 추가 환경 상호작용 없이도 차선 시연에서 더 나은 행동을 추출할 수 있다:

PA-RL: CalQL(Calibrated Q-Learning) 기반 재순위화를 통해 오프라인 데이터에서 최적 궤적을 선별
ConRFT [69] (Li et al., 2025): 일관성 정책(consistency policy)을 활용한 온라인 강화 미세조정

선호 최적화 (Preference Optimization)

인간의 선호를 직접 학습 신호로 활용한다:

HAPO [84] (Li et al., 2025): DPO(Direct Preference Optimization)를 VLA에 적용. 쌍별(pairwise) 궤적 비교를 통해 선호되는 행동 패턴을 학습
RAPL [83] (Tian et al., 2025): 시각적 선호 인코딩(visual preference encoding)을 통해, 인간이 비디오 클립을 비교하는 것만으로 보상 함수를 학습
GRAPE [109] (Wang et al., 2025): 다중 스케일 선호 학습 — 궤적 수준, 세그먼트 수준, 스텝 수준에서의 선호를 동시에 반영

보상 설계의 스펙트럼

RL 후처리의 핵심 도전 과제는 적절한 보상 함수의 설계이다. 현재까지 제안된 보상 유형은 다음과 같다:

보상 유형	특성	대표 방법
과제 성공 보상 (이진/희소)	성공=1, 실패=0. 설계 간단하지만 학습 효율 낮음	대부분의 온라인 RL 방법
VLM 생성 밀집 보상	VLM이 자동으로 중간 보상 함수 생성. 인간 설계 불필요	IKER [85]
선호 기반 보상 (RLHF 스타일)	인간 비교 피드백에서 보상 학습	HAPO, RAPL [83]
안전 제약 보상	안전 위반에 대한 페널티 부여	SafeVLA [75]
추론 일관성 보상	추론 과정과 행동 결과의 일관성 보상	TGRPO [66], ThinkAct [67]

핵심 성과 수치

RL 후처리의 효과를 보여주는 인상적인 수치들:

RIPT-VLA [71]: 원 논문(Su et al., 2025) 기준 특정 태스크에서 SFT 4% → PPO 15회 반복 후 97% 성공률 (Jin et al. [9] LIBERO 기준 평균 74.7%)
SimpleVLA-RL [70]: 17.3% → 91.7% (과제당 궤적 단 1개로) (원논문 직접 인용; 14개 서베이 외 출처)
이러한 수치들은 RL 후처리가 "부가적 개선"이 아닌 본질적 성능 도약을 가져올 수 있음을 증명한다.

ICLR 2026의 자기 개선 잔차 RL 방법들은 LIBERO에서 99% 성공률에 도달하여, RL 후처리의 잠재력이 벤치마크 포화 수준까지 끌어올릴 수 있음을 보여주었다. 단계 인식 강화학습(stage-aware reinforcement)은 태스크를 의미론적 구성 요소로 분해하여 각 단계별로 최적화하는 새로운 접근이다.

BC→RL 전환 불안정성 해결

BC로 초기화된 모델에 RL을 적용할 때 가장 큰 실무적 도전은 학습 불안정성이다. RL 업데이트가 BC에서 학습된 유용한 행동 패턴을 파괴(catastrophic unlearning)할 수 있다. 이를 해결하기 위한 전략들:

BC 손실 정규화: RL 목적함수에 BC 손실을 정규화 항으로 추가하여, BC에서 학습된 기본 능력이 보존되도록 한다.
VL 인코더 동결: 비전-언어 인코더의 가중치를 고정하고 정책 헤드만 RL로 업데이트하여, 사전학습된 시각-언어 이해 능력을 보존한다.
이중-Q/앙상블 크리틱: 가치 함수 추정의 과대평가(overestimation)를 억제하여 학습 안정성을 확보한다.

6.4 인간 운동학습과의 병행

Jin et al. [9] (2025)은 VLA 학습 패러다임이 인간의 운동학습(motor learning) 이론과 놀라울 정도로 유사한 구조를 가지고 있음을 지적했다. 이 비유는 단순한 은유를 넘어, VLA 연구의 미래 방향에 대한 실질적인 통찰을 제공한다.

Newell의 제약-주도 이론 (1986)

Karl Newell은 운동 행동이 세 가지 제약의 상호작용으로 출현한다고 주장했다. 이 프레임워크는 VLA 설계와 직접적으로 대응된다:

환경 제약 (Environmental Constraints):

인간: 중력, 마찰, 물체의 물리적 속성 등
VLA: 어포던스 인식, 지각 강화 모듈 → 환경의 물리적 제약을 모델에 인코딩

유기체 제약 (Organismic Constraints):

인간: 신체 크기, 근력, 관절 가동 범위 등
VLA: 체현 인식(embodiment awareness) → 순운동학(forward kinematics), 역운동학(inverse kinematics) 학습

과제 제약 (Task Constraints):

인간: 과제의 목표, 규칙, 시간 제한 등
VLA: 계층적 과제 분해, Chain-of-Thought 추론 → 복잡한 과제를 관리 가능한 하위과제로 분해

신경과학적 대응

VLA의 각 구성 요소는 인간 뇌의 특정 시스템과 기능적으로 대응된다:

뇌 시스템 / 메커니즘	기능	VLA 대응 요소
게놈(유전적 사전지식)	선천적 운동 능력의 기초	인터넷 스케일 사전학습
기술 습득(연습을 통한 학습)	구체적 운동 기술의 숙달	RL 후처리, 과제별 미세조정
소뇌 순방향 모델	행동 결과 예측	순운동학 학습, 월드 모델
기저핵 청킹	운동 시퀀스의 자동화	행동 청킹 (Action Chunking)
전문가 코칭	외부 피드백을 통한 교정	인간-로봇 상호작용(HRI)
보상 예측 오류(basal ganglia의 도파민 시스템과 유사)	기대와 결과의 차이 신호	RL 보상 신호 (TD 오류)
내부 월드 모델	환경의 심적 시뮬레이션	시각적 상호작용 예측(VIP)

이 대응 관계에서 특히 주목할 만한 것은 기저핵 청킹(basal ganglia chunking)과 행동 청킹(action chunking)의 유사성이다. 인간이 복잡한 운동 시퀀스(예: 피아노 연주)를 반복 연습을 통해 하나의 "청크"로 자동화하는 과정은, VLA 모델이 여러 타임스텝의 행동을 하나의 청크로 묶어 생성하는 메커니즘과 놀랍도록 유사하다.

또한, 보상 예측 오류(basal ganglia의 도파민 시스템)와 RL의 시간차(TD) 오류의 대응은 우연이 아니다. 두 시스템 모두 "기대했던 것과 실제 결과의 차이"를 학습 신호로 사용하여 행동을 점진적으로 개선한다.

이 비유의 실용적 함의

이 병행 관계는 단순한 지적 유희가 아니라, VLA 연구의 미래 방향을 제안한다:

인간이 신체 도식(body schema)을 유연하게 확장하는 능력(도구 사용)은 VLA의 교차 체현 일반화 연구에 영감을 준다.
인간의 운동 기억(motor memory)이 수면 중 강화되는 현상은 오프라인 RL 및 리플레이(experience replay)의 중요성을 시사한다.
인간이 관찰만으로도 운동 기술을 학습하는 능력(거울 뉴런 시스템)은 인간 비디오에서의 잠재 행동 학습과 직접 연결된다.

6.5 자기개선과 평생학습

VLA 시스템이 실제 환경에 배포된 후에도 지속적으로 성능을 개선해 나가는 능력은, 실용적 관점에서 가장 중요한 연구 방향 중 하나이다.

자율 데이터 수집

SOAR (Fan et al., 2025): 파운데이션 모델(VLM, LLM)이 가이드하는 자율적 데이터 수집 프레임워크. 모델이 스스로 "어떤 데이터가 부족한지"를 판단하고, 해당 영역의 데이터를 자율적으로 수집한다.
핵심 아이디어: 능동 학습(active learning)의 체현 버전 — 모델의 불확실성이 높은 상황을 자동으로 탐색하고 경험한다.

온라인 자기개선

RoboCat [110] (Bousmalis et al., 2024): 자기개선 루프(self-improvement loop)를 구현한 선구적 시스템. 모델이 생성한 궤적 중 성공한 것들을 학습 데이터에 추가하여 반복적으로 개선한다.
VLA-RL [68] (Tan et al., 2025): 온라인 RL을 통해 배포 후에도 환경 상호작용으로부터 지속적으로 학습한다.

자기개선의 핵심 도전은 자기강화 편향(self-reinforcement bias)이다. 모델이 자신의 (불완전한) 출력을 학습 데이터로 사용하면, 기존의 오류나 편향이 증폭될 수 있다. 이를 방지하기 위해 품질 필터링, 다양성 보장 메커니즘, 인간 개입(human-in-the-loop) 등이 필요하다.

평생학습의 핵심 과제: 치명적 망각

VLA 시스템이 새로운 과제나 환경에 적응할 때 직면하는 가장 심각한 문제는 치명적 망각(catastrophic forgetting)이다. 새로운 데이터로 미세조정하면 이전에 학습한 능력이 손실되는 현상이다.

VLA에서의 치명적 망각의 구체적 양상:

VL 인코더를 완전히 해동(unfreeze)하여 미세조정하면, 인터넷 스케일 사전학습에서 획득한 풍부한 시각-언어 이해 능력이 점진적으로 손실된다.
특정 환경에 과적합(overfit)되면, 다른 환경에서의 일반화 능력이 저하된다.
특정 로봇 형태에 특화되면, 교차 체현 전이 능력이 약화된다.

해결 전략:

선택적 해동(Selective Unfreezing): 모든 파라미터를 업데이트하는 대신, 과제 관련 레이어만 선택적으로 미세조정. LoRA(Low-Rank Adaptation) 등의 파라미터 효율적 미세조정(PEFT) 기법이 대표적이다.
ReVLA [111] (Shi et al., 2025): 가역적 학습(reversible learning) 메커니즘을 도입하여, 새로운 과제 학습 시 이전 지식을 가역적으로 보존한다.
π0.5 [31]-KI: 그래디언트 차단(gradient blocking)을 통해 특정 모듈로의 그래디언트 전파를 선택적으로 차단하여, 사전학습 지식을 보호한다.

교차 체현 일반화

궁극적으로, VLA 시스템은 특정 로봇에 종속되지 않고 다양한 체현(embodiment)에 일반화할 수 있어야 한다. 7축 관절 로봇에서 학습한 "물체를 집는" 기술이 병렬 그리퍼, 영리한 손(dexterous hand), 이동 매니퓰레이터에서도 작동해야 한다.

HPT [96] (Wang et al., 2024): Heterogeneous Pretrained Transformer. 공유 잠재 공간(shared latent space)과 체현별 헤드(embodiment-specific head)를 분리한 아키텍처. 공유 트랜스포머가 과제 의미론(task semantics)을 처리하고, 각 로봇 형태에 맞는 전용 헤드가 해당 행동 공간으로 변환한다.
UniAct [112] (Ning et al., 2025): 통합 행동 공간을 3D 공간으로 정의하여, 로봇 형태에 무관한 범용 행동 표현을 학습한다.
BridgeVLA [113] (Li et al., 2025): 서로 다른 로봇 데이터셋 간의 브릿지 역할을 하는 VLA 모델로, 교차 데이터셋 전이를 촉진한다.

교차 체현 일반화의 핵심 도전은 행동 공간의 이질성이다. 7DoF 로봇 팔, 12DoF 영리한 손, 20+DoF 휴머노이드는 행동 공간의 차원과 의미가 근본적으로 다르다. 이 이질성을 극복하기 위해, 잠재 행동 토큰(유형 6)이나 과제 공간(task-space) 표현이 핵심 매개체로 활용되고 있다.

6.6 소결: 학습 패러다임의 3단계 성숙

VLA 학습 패러다임의 진화를 종합하면, LLM 학습의 발전 경로와 놀라울 정도로 유사한 3단계 성숙 모델이 드러난다:

단계	LLM	VLA	핵심 기여
1단계: 사전학습	대규모 텍스트 코퍼스	인터넷 스케일 이미지/비디오 + 로봇 궤적	기초 능력 형성
2단계: 지도 미세조정	SFT (지시 따르기)	BC (시연 따르기)	과제 수행 능력
3단계: RL 후처리	RLHF/DPO (정렬)	RL/선호 최적화 (정렬 + 초월)	BC 한계 극복, 최적 성능

현재 VLA 연구는 2단계에서 3단계로의 전환기에 있다. RIPT-VLA [71](원논문 기준 특정 태스크 4%→97%)와 SimpleVLA-RL [70](17.3%→91.7%, 원논문 직접 인용)의 결과는 이 전환이 단순한 점진적 개선이 아닌 패러다임 수준의 도약을 가져올 수 있음을 시사한다. 앞으로 RL 후처리가 BC와 함께 VLA 학습의 표준 파이프라인으로 자리잡을 것은 거의 확실하다.

동시에, 인간 운동학습과의 비유가 시사하듯, 학습은 단일 단계의 문제가 아니라 평생에 걸친 지속적 과정이다. 배포 후 자기개선, 새로운 환경에의 적응, 치명적 망각 없는 지식 축적 — 이러한 평생학습 능력의 구현은 VLA 연구의 장기적 과제이자, 진정으로 범용적인 로봇 시스템을 향한 필수 요건이다.

Motivation Chain: 학습 패러다임의 진화

Motivation Chain

수동 규칙 기반 제어의 한계(환경마다 재설계 필요)

→ Behavior Cloning 등장(시연만 보여주면 학습)

→ BC의 한계(분포 이탈, 시연 품질이 성능 천장, 다중 모드 행동 미표현)

→ Diffusion Policy [17] 등장(다중 모드 행동을 확산으로 표현)

→ VLM 사전학습 활용(인터넷 지식 전이로 일반화 향상)

→ BC의 근본적 한계 잔존(시연 밖 행동 발견 불가, 안전·선호 신호 부재)

→ RL 후처리 등장(BC로 초기 정책 → RL로 시연 초월)

→ RL 후처리의 과제(학습 불안정, reward 설계 어려움, catastrophic forgetting)

→ VLM-생성 보상, 선호 최적화 등 안정화 기법 등장

Motivation Chain

Bin 이산화의 한계(RT-2: 7DoF × 16스텝 = 112토큰, 느린 추론)

→ FAST 등장(DCT+BPE로 최대 13배 압축)

→ Latent 토큰화(VQ-BeT [60], LAPA [61]: 연속 행동을 학습된 잠재 공간으로 압축)

→ 다양한 표현의 공존(태스크 특성에 따라 최적 토큰 유형이 다름)

8가지 행동 토큰 유형: 핵심 차별점

토큰 유형	한줄 핵심	대표 모델	제어 주파수	장점	한계
Language	자연어로 행동 기술	SayCan, Inner Monologue	1-3Hz	해석 가능, VLM 직접 활용	정밀 제어 불가
Code	프로그램으로 행동 기술	Code-as-Policies	1-5Hz	루프·조건문으로 복잡 로직 표현	새 API마다 재설계
Affordance	파지 가능 영역/자세	VoxPoser [57], A3VLM [58]	3-10Hz	3D 공간 이해	비조작 태스크에 부적합
Trajectory	경로점/궤적	RT-Trajectory [62], TraceVLA	5-10Hz	시각적 직관성	힘 제어 부재
Goal	목표 상태 이미지/포인트	SuSIE [59], 3D-VLA	1-5Hz	태스크 독립적	중간 과정 미지정
Latent	학습된 잠재 벡터	VQ-BeT [60], LAPA [61], UniVLA [80]	10-30Hz	압축 효율, 정보 보존	해석 불가
Raw Action	직접 이산화된 관절값	RT-2, OpenVLA	3-10Hz	단순, VLM 어휘 재활용	토큰 수 폭발
Reasoning	추론 과정+행동	CoT-VLA [55], SC-VLA [56]	1-5Hz	추론 가능, 자기교정	추론 오버헤드

직관적 한줄 설명: 행동 토큰화와 학습 편

Bin 이산화(RT-2 방식): "온도계의 연속 눈금을 '춥다/시원하다/따뜻하다/뜨겁다'처럼 칸으로 나누는 것"
FAST: "로봇 행동을 MP3처럼 주파수 압축 — 사람이 못 느끼는 미세 변화는 버리고 핵심만 보존"
VQ-BeT [60]: "연속 행동을 '행동 단어장'의 단어로 매핑하여 GPT처럼 다음 단어를 예측"
Diffusion Policy [17]: "대리석에서 조각상을 깎아내듯, 순수 노이즈에서 행동을 정제해 나감"
Flow Matching(π0): "출발지(노이즈)에서 목적지(행동)까지 직선 고속도로를 뚫은 것 — 디퓨전의 구불길 대신"
Behavior Cloning: "선생님이 푸는 걸 보고 따라 푸는 것 — 선생님보다 잘할 수 없고, 안 본 문제는 못 품"
RL 후처리: "BC로 기본기를 익힌 뒤, 스스로 연습하며 선생님을 넘어서는 단계"
VLM-생성 보상: "채점자(VLM)가 로봇의 행동을 보고 점수를 매겨주는 것 — 사람이 일일이 채점할 필요 없음"

Self-Check Questions: Section 5-6

Q1: FAST 토큰화가 기존 bin 이산화 대비 어떤 원리로 토큰 수를 줄이는가?

답: FAST는 두 단계 압축을 적용한다. (1) DCT(이산 코사인 변환)로 행동 시퀀스를 시간 영역에서 주파수 영역으로 변환하여, 고주파 성분(미세 진동)을 제거하고 저주파 성분(핵심 운동 패턴)만 보존한다. (2) BPE(바이트 페어 인코딩)로 반복되는 주파수 패턴을 합쳐 토큰 수를 추가 압축한다. 결과적으로 7DoF×16스텝=112토큰이 최대 약 13배 압축된다.

Q2: RL 후처리가 BC 단독보다 우수한 이유를 "탐색(exploration)"의 관점에서 설명하라.

답: BC는 시연 데이터의 분포 안에서만 학습하므로, 시연에 없는 더 나은 행동을 발견할 수 없다(exploitation only). RL 후처리는 현재 정책에서 벗어나 새로운 행동을 시도(exploration)하고, 보상 신호를 통해 더 나은 행동을 강화한다. BC가 제공하는 합리적 초기 정책 덕분에 RL의 탐색이 무작위가 아닌 유의미한 영역에서 시작되어, cold start 문제를 피하면서도 시연을 초월하는 성능에 도달할 수 있다.

Q3: 8가지 행동 토큰 유형 중, "제어 주파수"와 "추상 수준"은 어떤 trade-off 관계에 있는가?

답: 추상 수준이 높은 토큰(Language, Code, Goal)은 해석 가능하고 VLM과의 호환성이 좋지만, 저수준 제어를 직접 지정하지 않으므로 제어 주파수가 낮다(1-5Hz). 반대로 추상 수준이 낮은 토큰(Raw Action, Latent)은 정밀한 관절 제어가 가능하여 높은 주파수(10-50Hz)를 달성하지만, 해석이 어렵고 VLM의 언어 지식을 직접 활용하기 힘들다. 최적의 토큰 유형은 태스크의 정밀도 요구와 계획 복잡도에 따라 달라진다.

Open Research Questions: Section 5-6

최적 토큰 유형 자동 선택: 주어진 태스크에 대해 8가지 토큰 유형 중 최적을 자동으로 선택하는 메타-학습 프레임워크가 가능한가?

RL의 안정성-성능 trade-off: BC→RL 전환 시 catastrophic forgetting 없이 안정적으로 성능을 개선하는 이론적 보장이 가능한가?

보상 설계의 자동화: VLM-생성 보상이 인간 보상과 얼마나 잘 일치하는가? VLM의 환각(hallucination)이 보상 신호를 오염시키는 경우 어떻게 대처하는가?

연속-이산 스펙트럼의 최적점: 완전 이산(bin)과 완전 연속(디퓨전) 사이에서 최적의 행동 표현 해상도는 태스크에 따라 어떻게 달라지는가?

신경과학 영감의 실질적 적용: Motor learning 이론(소뇌 내부 모델, 기저핵 행동 청킹)이 VLA 아키텍처 설계에 구체적으로 적용된 성공 사례가 있는가, 아니면 아직 비유 수준인가?

7. 효율성 — 실세계 배포를 위한 필수 과제

VLA(Vision-Language-Action) 모델이 학술 벤치마크에서 놀라운 성능을 보여주고 있지만, 이를 실제 로봇에 탑재하여 현장에서 구동하는 것은 완전히 다른 차원의 문제다. 수십억 파라미터의 거대 모델을 실시간으로 추론하면서, 제한된 하드웨어 위에서, 안전하고 경제적으로 운용해야 하기 때문이다. 이 장에서는 Yu et al. [4] (2025)의 효율적 VLA 서베이를 핵심 축으로 삼아, 효율성 문제의 전체 지형도를 그린다.

7.1 왜 효율성인가: 현실과 이상의 간극

현재 VLA 모델의 자원 소모량은 실세계 배포의 관점에서 비현실적인 수준에 놓여 있다.

훈련 비용의 규모:

OpenVLA [15] 학습에는 약 21,500 A100-GPU 시간이 소요되었다. 이는 64-GPU 클러스터를 약 2주간 연속 가동한 것에 해당한다.
π0 [16]의 학습에는 10,000시간 이상의 로봇 궤적 데이터가 사용되었다. 단일 기관이 이 규모의 데이터를 자체적으로 수집하기란 사실상 불가능하다.

추론 지연시간의 벽:

RT-2-PaLI-X(55B)의 추론 지연시간은 330~1000ms로, 이는 초당 1~3회(1-3Hz)의 제어 주파수를 의미한다. 테이블탑 매니퓰레이션에서 요구되는 최소 주파수(5-10Hz)에도 미달하며, 동적 과제에서 필요한 30Hz 이상은 꿈도 못 꾼다.
비교적 효율적인 OpenVLA [15] (7B)조차 166ms의 지연(약 6Hz)으로 빠른 반응이 필요한 과제에는 부적합하다.

실세계 배포의 4대 요구사항:

요구사항	설명	현재 갭
지연시간	<100ms (10Hz+)	대부분의 대형 VLA가 미달
비용	클라우드 API 비용 최소화	대형 모델은 GPU당 비용 과다
프라이버시	온디바이스 추론 필수	가정/의료 환경에서 데이터 외부 전송 불가
에너지	배터리 구동 로봇의 전력 제약	수십 와트급 엣지 디바이스에서 구동 필요

이러한 간극을 메우기 위해, 2024년 후반부터 2025년에 걸쳐 효율적 VLA에 관한 연구가 폭발적으로 증가하였다. 연구의 방향은 크게 모델 효율성, 훈련 효율성, 데이터 효율성의 세 축으로 나뉜다.

7.2 모델 효율성: 추론을 빠르고 가볍게

모델 효율성은 이미 학습된 VLA의 추론 단계에서 지연시간과 메모리를 줄이는 기법들을 총칭한다. 양자화, 가지치기, 지식 증류, 토큰 최적화, 효율적 아키텍처의 다섯 가지 전략이 존재한다.

7.2.1 양자화(Quantization)

양자화는 모델 가중치(및 활성값)의 수치 정밀도를 줄여 메모리와 연산량을 절감하는 가장 직접적인 기법이다.

OpenVLA [15] 4비트 PTQ(Post-Training Quantization): 학습 후 양자화만으로 GPU 메모리 사용량을 절반으로 줄이면서도 성능 저하가 관측되지 않았다. 이는 VLA 모델의 가중치가 상당한 수치적 여유(redundancy)를 포함하고 있음을 시사한다.
SQIL [114] (Shang et al., 2024): 4비트 현저도 인식(salience-aware) 양자화를 적용하여 2.5배 추론 가속을 달성하였다. 핵심은 행동 예측에 중요한 가중치를 식별하여 선별적으로 높은 정밀도를 유지하는 것이다.
BitVLA [33]: 극한의 1비트 삼진 양자화({-1, 0, 1})를 적용한 연구로, 3.36배 메모리 압축을 보고하였다. 가중치를 세 개의 값으로 표현하면서도 유의미한 행동 생성이 가능하다는 것은 주목할 만하다.
QAIL(Quantization-Aware Imitation Learning) [115] (Heo et al., 2025): 양자화를 학습 단계에 통합하여 엣지 디바이스 배포에 최적화된 모델을 직접 학습한다.
SQAP-VLA [116] (Li et al., 2025): 양자화와 토큰 가지치기를 공동 설계(co-design)하여, 각 기법을 개별 적용했을 때보다 더 나은 효율성-성능 균형을 달성하였다.

7.2.2 가지치기(Pruning)

가지치기는 모델에서 불필요한 구성 요소(레이어, 뉴런, 토큰 등)를 제거하여 경량화하는 기법이다. VLA에서는 특히 LLM 백본의 레이어 중복성이 높다는 관찰에 기반한 연구들이 활발하다.

레이어 수준 가지치기:

인접한 LLM 레이어의 출력 사이에 높은 코사인 유사도가 관측되며, 이를 근거로 최대 50%의 레이어를 제거할 수 있다.
DeeR-VLA [35]: 동적 다중 출구(dynamic early exit) 전략을 사용한다. 각 레이어에서 행동 예측의 일관성을 확인하고, 일관성이 확보되면 나머지 레이어를 스킵한다. 추가 학습이 필요 없다는 점이 큰 장점이다.
SmolVLA [32]: 극도로 단순한 접근법으로, LLM의 L/2개 레이어를 단순 스킵한다. 절반의 레이어만으로도 조작 과제를 수행할 수 있음을 보여주었다.
MoLe-VLA [117] (Qu et al., 2025): STAR 라우터를 사용하여 입력별로 동적으로 활성화할 레이어를 선택한다. 쉬운 과제에서는 적은 레이어를, 복잡한 과제에서는 많은 레이어를 활성화하여 연산량을 적응적으로 조절한다.
EfficientVLA [118] (Niu et al., 2025): 학습 없이 레이어 가지치기와 시각 토큰 가지치기를 동시에 적용하는 프레임워크이다.
FLOWER [119] (Cheng et al., 2025): 인코더-디코더 구조 VLM에서는 디코더 전체를 제거하고, 디코더 전용 구조에서는 말단 30%의 레이어를 제거한다.

구조적 가지치기:

RLRC [120] (Zhao et al., 2025): Taylor 중요도 점수에 기반한 구조적 가지치기로, 90% 희소성까지 달성하면서도 유의미한 성능을 유지하였다.

7.2.3 지식 증류(Distillation)

대형 VLA의 지식을 소형 모델로 전이하는 증류 기법은 처음부터 작은 모델을 만드는 것보다 높은 성능을 달성할 수 있다.

TinyVLA [34]: 대형 VLA에서 1.4B 미만의 소형 모델로 증류한다. LoRA 가중치로 초기화하여 증류 효율을 높인다.
CEED-VLA [121] (Wen et al., 2025): 일관성 증류(consistency distillation)와 Jacobi 병렬 디코딩을 결합한다. 자기회귀적 토큰 생성의 직렬 병목을 병렬화하여 추론 속도를 크게 향상시킨다.
RPD(Robot Policy Distillation) [122] (Wang et al., 2025): VLA에서 소형 RL 전문가 정책으로 증류한다. 특정 과제에 대해서는 범용 VLA보다 증류된 전문가가 더 빠르고 정확할 수 있다.
SP-VLA [123] (Shen et al., 2025): 행동 인식 스케줄링(action-aware scheduling)으로 무거운 VLA와 가벼운 행동 생성기 사이를 동적으로 전환한다. 복잡한 판단이 필요한 순간에만 대형 VLA를 호출하고, 단순 실행 구간에서는 경량 생성기를 사용한다.

7.2.4 토큰 최적화

VLA에서 시각 토큰은 전체 입력 시퀀스의 대부분을 차지한다. 단일 이미지가 수백 개의 패치 토큰으로 변환되고, 비디오 입력에서는 이 수가 수천 개로 폭증한다. 이를 줄이는 것이 토큰 최적화의 핵심이다.

시각 토큰 압축:

SmolVLA [32]: Pixel shuffle 기법으로 프레임당 64개 토큰으로 압축한다. 원래 수백 개였던 토큰을 공간적으로 재배열하여 극단적으로 줄인다.
FlashVLA [52] (Zhu et al., 2025): ICS(Importance-based Compression and Selection) 가지치기로 중요도가 낮은 시각 토큰을 제거한다.
EfficientVLA: [118] 레이어 가지치기와 시각 토큰 가지치기를 통합 적용한다.

시각 토큰 캐싱:

VLA-Cache [124] (Gao et al., 2025), CronusVLA [125] (Lin et al., 2025): 정적 배경에 해당하는 토큰이 연속 프레임 간에 거의 변하지 않는다는 시간적 일관성(temporal coherence)을 활용한다. 변하지 않는 배경 토큰을 캐싱하고 변화가 있는 전경 토큰만 갱신하여 약 40-50% 빠른 추론(Zhang et al. [6] 기준; 원논문에서는 최대 2배 이상 가속 보고)을 달성한다.
이 접근법이 유효한 근본적 이유는, 로봇 조작 과제에서 대부분의 패치 토큰이 공간적으로 중복되기 때문이다. 카메라가 고정된 테이블탑 환경에서는 배경의 80% 이상이 프레임 간에 동일하다.

7.2.5 효율적 아키텍처

기존 Transformer 구조의 근본적 한계(이차 복잡도의 어텐션)를 극복하기 위한 아키텍처 수준의 혁신이다.

선형 복잡도 아키텍처:

SARA-RT [126] (Shridhar et al., 2024): 표준 소프트맥스 어텐션을 선형 어텐션으로 업트레이닝(up-training)한다. 복잡도가 O(n^2)에서 O(n)으로 줄어든다.
RoboMamba [127] (Liu et al., 2024): Mamba SSM(Selective State Space Model) 기반 VLA로, 선형 복잡도에서 3배 이상의 속도 향상을 달성하였다. 긴 시퀀스에서 Transformer 대비 이점이 커진다.

MoE(Mixture of Experts):

GeRM [128] (Xu et al., 2025): 사족보행 로봇의 RL에 MoE를 적용하여, 전체 파라미터 중 일부 전문가만 활성화한다.
FedVLA [72] (Zhang et al., 2025): 이중 게이팅 MoE로 연합학습(federated learning) 환경에서 효율적 VLA를 구현한다.
DriveMoE [49] (Huang et al., 2025): 자율주행 도메인에서 MoE 구조를 활용하여 다양한 주행 시나리오에 전문가를 할당한다.

병렬 디코딩:

OpenVLA [15]-OFT: 양방향 어텐션을 사용하여 여러 행동 토큰을 동시에 생성한다.
PD-VLA [129] (Chen et al., 2025): Jacobi 고정점 반복법으로 자기회귀적 디코딩을 병렬화한다.
Spec-VLA [130] (Wu et al., 2025): 투기적 디코딩(speculative decoding)을 VLA에 적용하여 1.42배 가속을 달성한다. 소형 드래프트 모델이 후보 토큰을 빠르게 생성하고, 대형 모델이 이를 검증하는 방식이다.

7.2.6 효율적 어텐션(Efficient Attention)

Yu et al. [4]는 추가로 효율적 어텐션(Efficient Attention) 기법들을 별도 연구 방향으로 식별한다. KV-Efficient VLA(RNN 게이트 기반 KV 캐시 압축), Long-VLA [73](장시간 태스크를 위한 phase-aware 입력 마스킹), RetoVLA [74](레지스터 토큰 재사용), dVLA [65](디퓨전 VLA를 위한 prefix 어텐션 마스킹) 등이 이 범주에 속한다. 이들은 기존 모델 압축(양자화, 프루닝)과는 독립적인 차원의 효율화로, Transformer의 어텐션 메커니즘 자체를 최적화한다.

7.3 훈련 효율성: 적은 자원으로 더 잘 학습하기

모델 자체의 경량화와는 별도로, 학습 과정의 효율성을 높이는 연구도 활발하다.

파라미터 효율적 미세조정(PEFT):

LoRA(Low-Rank Adaptation)를 비롯한 PEFT 기법들은 전체 파라미터의 0.1~1%만을 학습하면서도 전체 미세조정에 준하는 성능을 달성한다. 이는 GPU 시간을 약 70% 절감하며, 단일 GPU에서도 대형 VLA의 미세조정을 가능하게 한다.

혼합 학습 전략:

커리큘럼 학습: 쉬운 과제에서 어려운 과제로 점진적으로 난이도를 높이는 전략이다.
다단계 학습: π0 [16]는 (1) VLM 사전학습 → (2) 로봇 데이터 사전학습 → (3) 과제별 미세조정의 3단계 파이프라인을 사용한다.

FAST 토큰화 [20]의 혁신:

Pertsch et al.이 제안한 FAST(Fast Action Tokenization)는 로봇 행동 시퀀스에 DCT(이산 코사인 변환) + BPE(바이트 쌍 인코딩)를 적용한다. 이를 통해 행동 시퀀스를 극도로 압축하여 사전학습 속도를 5배 가속하였다. 원시 행동(raw actions) 대신 FAST 토큰이나 잠재 행동(latent actions)을 사용하는 것이 효율적 행동 표현의 핵심 트렌드이다.

7.4 데이터 효율성: 적은 로봇 데이터로 더 많이 배우기

로봇 데이터 수집의 높은 비용은 VLA 연구의 가장 근본적인 병목이다. 데이터 효율성 연구는 이 병목을 우회하거나 완화하는 전략들을 탐구한다.

인간 비디오 활용:

EgoVLA [131] (Chen et al., 2025), Being-H0 [138], RynnVLA-001 등은 인터넷에 풍부한 1인칭(ego-centric) 인간 활동 비디오를 대리 학습 데이터로 활용한다. 인간의 손 움직임에서 조작 전략을 학습하고, 이를 로봇 행동에 전이한다. 이 접근법의 핵심 통찰은 인간과 로봇이 동일한 물리 세계에서 유사한 조작 과제를 수행한다는 것이다.

시뮬레이션 데이터:

UniSim [101] (Yang et al., 2024), Genesis: 물리 시뮬레이터에서 대규모 합성 데이터를 생성한다.
GraspVLA [81] (Qian et al., 2025): 10억 스케일의 합성 파지 데이터를 생성하여 사전학습에 활용한다.

데이터 증강:

언어 증강: DIAL [90] 등은 과제 지시문을 다양하게 패러프레이징하여 언어 이해의 강건성을 높인다.
시각 증강: GenAug [87], CACTI [88], ROSIE [89] 등은 생성 모델을 이용해 시각적 다양성을 확대한다.
궤적 증강: DemoGen 등은 기존 시연 데이터에서 새로운 궤적을 합성한다.

능동적 데이터 선정:

AMF(Active Model Feedback): 정보 이득(information gain)이 높은 데이터를 우선 선정하여 학습 효율을 극대화한다.
SWBT(Success Weighted by Trial): 실패 시도까지 학습 데이터에 포함하여, 실패로부터도 유용한 신호를 추출한다.

자율 수집:

SOAR (Luo et al., 2025): 파운데이션 모델의 가이드 하에 로봇이 자율적으로 데이터를 수집한다. 인간 시연자 없이도 학습 데이터를 지속적으로 확보할 수 있는 경로를 제시한다.

7.5 주요 경량 모델 비교

아래 표는 대표적인 VLA 모델들을 파라미터 규모, 추론 성능, 핵심 기법 기준으로 비교한다. 1년 사이에 55B에서 450M까지, 1Hz에서 120Hz까지 압축이 진행되었음을 확인할 수 있다.

모델	파라미터	추론 지연	제어 주파수	핵심 기법
RT-2-PaLI-X	55B	330-1000ms	1-3Hz	기준선(대형 VLM 직접 사용)
OpenVLA [15]	7B	166ms	6Hz	오픈소스 기준선
π0 [16]	3.3B	73ms	20-50Hz	Flow Matching 행동 헤드
GR00T N1 [21]	2.2B	64ms	~120Hz(모터 출력 주파수; Yu et al. [4]의 Table 1에서는 미보고. 모델 추론 주파수와 구분 필요)	이중시스템(느린 VLM + 빠른 정책)
NORA [132] (Jiang et al., 2025)	3B	—	—	FAST+ 토큰화
CLIP [27]-RT	~1B	—	—	동결 CLIP [27] 활용, OpenVLA [15] 대비 +24%
EdgeVLA [133] (Huang et al., 2025)	1B	—	—	엣지 디바이스 전용 설계
TinyVLA [34]	<1.4B	—	—	대형 VLA 증류
SmolVLA [32]	~450M	—	—	단일 GPU 학습 가능
BitVLA [33]	~2B(실효 용량 축소)	—	—	1비트 삼진 양자화
DiVLA-2B [134] (Kim et al., 2025)	2B	~12ms	82Hz	A6000 단일 GPU 구동
RoboMamba [127]	—	—	—	Mamba SSM 기반 선형 복잡도

7.6 핵심 인사이트 — 효율성-성능 트레이드오프의 재발견

효율적 VLA 연구에서 도출되는 인사이트들은 단순한 기술적 최적화를 넘어, VLA 설계 철학 자체에 대한 재고를 요구한다.

1) 스케일 역전 현상: CLIP [27]-RT(~1B)가 OpenVLA [15] (7B)를 24% 능가한다는 결과는 "더 많은 파라미터가 더 나은 성능을 보장한다"는 스케일링 법칙의 단순한 적용이 로봇 도메인에서는 성립하지 않을 수 있음을 시사한다. 작은 모델이라도 적절한 표현 학습과 데이터 효율적 미세조정이 결합되면, 거대 모델을 능가할 수 있다.

2) 양자화는 거의 무료 점심: 4비트 PTQ로 메모리를 절반으로 줄이면서도 성능 저하가 없다는 사실은, 현재 VLA의 가중치에 상당한 중복성이 존재함을 의미한다. 이는 배포 단계에서 양자화를 기본 적용해야 할 강력한 근거가 된다.

3) 계층적 분리는 로봇에 고유하게 적합: GR00T N1 [21]이 보여준 느린 VLM(1-5Hz) + 빠른 정책 헤드(50Hz+)의 비동기 실행은 로봇 제어의 본질적 구조와 잘 맞는다. 높은 수준의 의미 이해는 매 프레임 갱신할 필요가 없지만, 저수준 모터 명령은 고주파로 생성되어야 한다. 이 "인지는 느리게, 행동은 빠르게" 패러다임은 인간 신경계의 구조와도 유사하다.

4) RL 후처리로 압축 회복: RIPT-VLA [71]는 RL 후처리를 통해 VLA의 성능을 대폭 향상시킬 수 있음을 보였다. SFT(Supervised Fine-Tuning) 기준선에서 4%였던 성능이 PPO 후처리를 통해 97%까지 향상되었다. 이 4%→97% 결과는 양자화/프루닝에 의한 성능 저하 회복이 아니라, BC/SFT 기준선에서 RL 후처리를 통한 성능 향상을 의미한다(원논문 Su et al., 2025 기준; 벤치마크에 따라 성능이 달라짐에 유의). 이는 "경량 모델 + RL 후처리"라는 파이프라인의 실행 가능성을 입증한다.

5) 인간 비디오는 로봇 데이터의 실행 가능한 대체재: EgoVLA 계열의 연구들은 인터넷 스케일의 인간 비디오가 로봇 데이터를 부분적으로 대체할 수 있음을 보여준다. 로봇 데이터 수집의 병목을 우회하는 가장 확장 가능한(scalable) 경로 중 하나이다.

6) 지배적 연구 추세: 효율적 VLA 연구는 2024년 후반부터 2025년 사이에 폭발적으로 성장하였다. 이는 연구 커뮤니티가 "일단 크게 만들고 나중에 줄인다"는 전략에서 "처음부터 효율적으로 설계한다"는 방향으로 전환하고 있음을 반영한다.

7.7 엣지 배포: 시스템 레벨 병목 분석

2026년의 Edge Embodied Foundation Models 서베이는 VLA 배포를 모델 압축 문제가 아닌 시스템 공학 문제로 재정의했다. 이 서베이가 제안한 "Deployment Gauntlet"은 엣지 배포를 가로막는 7가지 결합 제약(coupled constraints)을 식별한다: 크기, 무게, 전력, 메모리 트래픽, 연산 지연, 타이밍 변동, 안전 마진 등이 상호 작용하여, 하나의 최적화만으로는 해결되지 않는 복합 문제를 형성한다.

핵심 발견은 병목의 유형이 컨트롤러 아키텍처에 따라 다르다는 것이다:

자기회귀 VLA(RT-2, OpenVLA류): 주로 메모리 대역폭에 의해 제약
디퓨전 기반 컨트롤러(π0류): 주로 연산 지연과 지속 실행 비용에 의해 제약

이 분석은 '빠른 제어(fast control)'와 '느린 의미 추론(slow semantic reasoning)'을 분리하는 아키텍처(GR00T N1, π0.5 [31])가 엣지 배포에서도 유리함을 시사한다. 효율적 배포를 위해서는 메모리 아키텍처, 스케줄링 전략, 통신 프로토콜, 모델 설계를 통합적으로 고려하는 시스템 레벨 공동 설계(co-design)가 필요하다.

7.8 보완적 효율화 분류 체계

Guan et al. [43] (2025)은 Yu et al. [4]와 독립적으로 효율화 VLA를 조사하여, 4차원 분류 — (1) 모델 아키텍처, (2) 인지 특징 추출, (3) 행동 생성 메커니즘, (4) 학습/추론 전략 — 를 제안했다. Yu et al. [4]이 모델 압축(양자화, 프루닝, 증류)에 초점을 맞춘 것과 달리, Guan et al.은 인지 특징 추출 효율화(예: 다중 해상도 토큰 풀링, 선택적 어텐션)와 행동 생성 메커니즘 효율화(예: Action Chunking 최적화, 병렬 디코딩)를 독립적 차원으로 분석한다는 점에서 상호 보완적이다. 이 관점은 FlashVLA [52]나 RetoVLA [74] 같은 최근 모델이 왜 단순한 모델 압축이 아닌, 인지-행동 파이프라인 전체의 효율화를 추구하는지를 설명한다.

ICLR 2026에서 제안된 HyperVLA [135] (Park et al., 2026)는 하이퍼네트워크로 태스크별 정책을 동적 생성하여 추론을 가속한다. AutoQVLA [136] (Liu et al., 2026)는 개선된 양자화 기법으로 VRAM을 30% 절감했다. 이들은 7.2절에서 다룬 모델 효율화 기법들의 최전선에 위치하며, 양자화와 아키텍처 혁신이 여전히 활발한 연구 방향임을 보여준다.

8. 응용 도메인 — VLA가 만드는 세계

VLA 기술은 다양한 로봇 응용 분야로 확산되고 있다. 각 도메인은 고유한 행동 공간, 안전 요구, 실시간성 제약을 가지며, 이에 따라 VLA의 적용 방식도 크게 달라진다. 이 장에서는 현재 VLA가 활용되고 있는 주요 도메인을 순회하며, 각 영역의 현황과 고유한 도전 과제를 정리한다.

8.1 테이블탑 매니퓰레이션 — 주류 연구 도메인

테이블탑 매니퓰레이션은 VLA 연구의 핵심 무대이다. 전체 VLA 모델의 70% 이상이 이 도메인을 대상으로 개발되고 평가된다.

벤치마크 성능의 급격한 향상:

LIBERO: 성공률이 16개월 만에 76.5%에서 98.1%로 상승하였다.
CALVIN: 시퀀스 길이(연속 성공 과제 수)가 3.57에서 4.44로 향상되었다.
RLBench, Meta-World: 다양한 조작 과제에 대한 표준 평가 플랫폼으로 활용된다.

현재 수준과 남은 과제: 단기 과제(single-step manipulation)는 98% 이상의 성공률로 거의 해결 단계에 도달하였다. 그러나 장기 과제(long-horizon tasks) — 여러 단계의 조작을 순서대로 수행해야 하는 과제 — 는 여전히 핵심 병목으로 남아 있다. 각 단계의 오류가 누적되는 컴파운딩 에러(compounding error) 문제가 근본 원인이다.

특수 조작 연구:

양손 조작: Bi-VLA (Xue et al., 2025), ALOHA (Zhao et al., 2023) 등은 두 팔의 협응 제어를 다룬다. 행동 공간이 단일 팔의 2배로 확장되며, 양팔 간의 동기화(synchronization)가 핵심 과제이다.
접촉 풍부한 조작: ForceVLA [78] (Lee et al., 2025), TactileVLA (Kim et al., 2025) 등은 힘/촉각 센서를 VLA에 통합한다. 시각만으로 파악할 수 없는 물체의 강성, 무게, 미끄러짐 등을 감지한다.
손재주 파지(Dexterous Grasping): DexVLA (Wen et al., 2025), DexVLG (Zhang et al., 2025) 등은 다지(multi-finger) 핸드의 고차원 제어를 VLA로 학습한다. 자유도가 20개 이상으로 증가하며, 행동 공간의 복잡성이 급격히 높아진다.

8.2 휴머노이드 로봇 — 전신 제어의 도전

휴머노이드 로봇에 VLA를 적용하는 것은 테이블탑 매니퓰레이션과는 질적으로 다른 수준의 도전을 수반한다.

근본적 어려움:

30개 이상의 자유도: 팔, 다리, 몸통, 머리를 포함한 전신 관절의 제어가 필요하다.
균형 유지: 이족 보행의 동적 균형은 밀리초 단위의 빠른 반응을 요구한다.
보행과 조작의 동시 수행: 걸어가면서 물건을 집는 것처럼, 이동과 조작을 동시에 수행해야 한다.
다중 접촉점 관리: 발, 손, 때로는 몸통까지 환경과 접촉하며, 이 모든 접촉점의 힘을 조율해야 한다.

주요 모델:

GR00T N1 [21] (NVIDIA): 휴머노이드를 위한 파운데이션 모델을 표방한다. 이중시스템 아키텍처로 높은 제어 주파수(~120Hz, 모터 출력 주파수; Yu et al. [4]의 Table 1에서는 미보고. 모델 추론 주파수와 구분 필요)를 달성하며, 범용적 전신 제어를 목표로 한다.
Humanoid-VLA [137] (Zhang et al., 2025): 온라인에 존재하는 인간 비디오에서 포즈 복원(pose estimation)을 수행하여 동작의 다양성을 확보한다. 인간의 움직임을 직접 참고 데이터로 활용하는 접근법이다.
Being-H0 [138] (Li et al., 2025): 자아중심(ego-centric) 비디오를 사전학습 데이터로 활용하여, 1인칭 시점에서의 환경 이해 능력을 강화한다.
FP3 [139] (Chen et al., 2025): 3D 정책 사전학습으로 공간적 추론 능력을 강화한다.

핵심 미해결 과제: 균형 유지와 정밀 조작의 동시 수행은 현재 VLA의 가장 어려운 도전 중 하나다. 균형을 위한 빠른 반사적 제어와 조작을 위한 신중한 계획적 제어가 충돌하는 상황이 빈번하며, 이를 하나의 통합 모델 내에서 조화시키는 것이 핵심 과제이다.

8.3 자율주행 — 또 다른 VLA의 최전선

자율주행은 VLA의 두 번째로 큰 응용 도메인이다. Jiang et al. [10]의 분류에 따르면, 자율주행 VLA는 4단계의 진화를 거쳐왔다.

진화의 4단계:

VLM as Explainer: VLM을 주행 장면 설명과 의사결정 근거 생성에 활용한다. 제어는 별도 모듈이 담당한다.

Modular VLA: VLM의 출력을 기존 자율주행 파이프라인(인식→예측→계획)의 모듈에 피드한다.

Unified E2E VLA: 카메라 입력에서 조향/가속 출력까지 하나의 모델로 통합한다.

Reasoning-Augmented VLA: CoT(Chain-of-Thought) 추론을 통합하여 의사결정 과정을 투명하게 만든다.

핵심 모델:

EMMA [46] (Hwang et al., 2024): Gemini 백본을 사용한 Waymo의 E2E 주행 모델이다.
ORION [47] (Wang et al., 2024): 메모리 메커니즘과 CoT 추론을 결합하여 과거 주행 경험을 활용한다.
DriveMoE [49]: MoE 구조로 다양한 주행 시나리오(고속도로, 교차로, 주차 등)에 전문가를 할당한다.
AutoVLA [48] (Chen et al., 2025): 적응형 CoT로, 단순 상황에서는 빠른 추론을, 복잡한 상황에서는 깊은 추론을 수행한다.

주행 vs 조작: 핵심 차이점

자율주행과 로봇 조작은 모두 VLA 프레임워크를 공유하지만, 본질적으로 매우 다른 도전을 수반한다.

차원	로봇 조작	자율주행
행동 공간	3D 그리퍼 위치/방향(6-7DoF)	조향/가속 + BEV 경로 + 고수준 경로(다중 추상화 수준)
공간 규모	테이블탑(~1m)	도시 규모(수백 미터~수 km)
실시간 요구	5-50Hz	30Hz+ 필수 (자동차 하드웨어 기준)
안전 임계성	물체 파손 정도	법적/물리적 인명 안전
환각의 결과	파지 실패(재시도 가능)	인명 위험(되돌릴 수 없음)
사회적 상호작용	거의 없음	양보, 합류, 다른 운전자 의도 파악 필수

SafeAuto [140] (Li et al., 2025)는 심볼릭 거부권(symbolic veto)을 도입하여, VLA의 출력이 안전 규칙을 위반하면 실행을 차단한다.
LangCoop V2V [141] (Wei et al., 2025)는 차량 간(Vehicle-to-Vehicle) 자연어 통신으로 의도를 공유하여 사회적 상호작용 문제를 해결한다.

벤치마크와 남은 갭: BDD100K, nuScenes, Bench2Drive, Reason2Drive 등의 벤치마크가 존재하지만, 통합적인 "AI 운전면허" 벤치마크의 부재가 핵심 갭이다. 인간 운전면허 시험처럼 다양한 시나리오, 안전 판단, 윤리적 딜레마를 포괄적으로 평가하는 표준이 아직 없다.

8.4 드론 및 항법

공중 및 지상 이동 로봇에서도 VLA의 적용이 확대되고 있다.

CognitiveDrone [142] (Wang et al., 2025): 자연어 지시에 따라 드론을 제어하는 인지적 드론 시스템이다. "저 빨간 건물 오른쪽으로 돌아가"와 같은 지시를 해석하고 실행한다.
RaceVLA [143] (Zhao et al., 2025): 드론 레이싱이라는 고속 환경에서 VLA를 적용한다. 밀리초 단위의 반응 속도와 정밀한 경로 추적이 동시에 요구되는 극한의 테스트베드이다.
NaviLa [144] (Cheng et al., 2025), Uni-NaVid [145] (Zhang et al., 2024): 보행 로봇의 실내 항법에 VLA를 적용한다. "부엌으로 가서 빨간 컵을 가져와"와 같은 지시를 이해하고, 경로 계획과 장애물 회피를 수행한다.
Mobility VLA [146] (Liu et al., 2025): 바퀴형 이동 로봇을 위한 VLA로, 실내외 환경에서의 자율 주행과 물체 상호작용을 통합한다.

이 도메인의 공통적 과제는 3D 공간에서의 실시간 항법과 동적 장애물 회피를 하나의 언어-시각-행동 프레임워크로 통합하는 것이다.

8.5 의료 및 수술 로봇

의료 분야는 VLA 적용의 높은 잠재력과 함께 가장 엄격한 제약 조건을 동시에 갖는 도메인이다.

대표 연구:

RoboNurse-VLA [147] (Li et al., 2024): 수술 환경에서의 정밀 파지를 목표로 한다. 수술 도구의 정확한 파지와 전달이 핵심 과제이다.

도메인 고유의 제약:

환자 데이터 프라이버시: 의료 데이터의 외부 전송이 법적으로 제한되므로, 온프레미스(on-premise) 추론이 필수이다. 클라우드 API에 의존하는 VLA 배포 전략은 이 도메인에서 사용할 수 없다.
소량 데이터 문제: 특정 수술 절차나 환자별 데이터는 본질적으로 소량이다. 대규모 사전학습 백본을 소량 데이터로 효과적으로 미세조정하는 능력이 핵심이다.
안전-크리티컬 시스템: 수술 로봇의 오작동은 환자 생명에 직결된다. 형식 검증(formal verification)이나 안전 보증(safety assurance)에 대한 요구가 어떤 도메인보다도 엄격하다.

이러한 제약은 효율성(7장)의 모든 차원 — 모델 경량화, 온디바이스 추론, 데이터 효율성 — 이 의료 도메인에서 특히 절실함을 의미한다.

8.6 농업 및 산업

실용적 가치가 높은 산업 응용에서도 VLA의 잠재력이 탐색되고 있다.

과수원 사과 수확 (Zhang et al. [6]): 자연어 지시("익은 사과만 수확해")에 따라 로봇이 과일을 선별적으로 수확하는 시스템이다. 비정형 환경(나뭇가지, 잎, 다양한 조명)에서의 시각 이해와 부드러운 파지가 동시에 요구된다.
CIPHER (Park et al., 2025): 자연어 지시로 3D 프린팅 검사 작업을 전환하는 시스템이다. "이 부분의 표면 품질을 검사해"와 같은 지시에 따라 검사 절차를 동적으로 변경한다. 산업 공정의 유연성을 VLA로 구현하는 사례이다.
ObjectVLA [148] (Chen et al., 2025): 사전 시연(demonstration) 없이 새로운 물체를 조작할 수 있는 VLA이다. 산업 현장에서 새로운 부품이나 제품이 투입될 때마다 시연 데이터를 수집하는 비용을 제거한다.

산업 도메인의 공통적 요구사항은 유연성(flexibility)이다. 제품 종류, 작업 내용, 환경 조건이 빈번히 변하는 산업 현장에서, 자연어 지시만으로 작업을 전환할 수 있는 VLA의 능력은 높은 실용적 가치를 지닌다.

8.7 인터랙티브 AR 및 GUI 에이전트

물리적 로봇을 넘어, VLA의 "행동 생성" 능력은 디지털 인터페이스의 자율 조작으로도 확장된다.

ShowUI [149] (Lin et al., 2024): GUI(Graphical User Interface) 에이전트를 VLA 프레임워크로 구현한다. 화면의 시각적 내용을 이해하고, "설정 메뉴를 열어서 Wi-Fi를 끄겠다"와 같은 지시에 따라 클릭, 스크롤, 입력 등의 행동을 생성한다.
공간 접지(Spatial Grounding): AR(Augmented Reality) 환경에서 가상 객체를 물리 세계에 정확히 배치하기 위해 VLA의 공간 이해 능력을 활용한다.
인간-AI 협력 항법: 증강현실 환경에서 사용자와 AI가 협력하여 복잡한 환경을 탐색하는 시나리오이다.

이 도메인은 VLA의 핵심 구성 요소 — 시각 이해, 언어 추론, 행동 생성 — 가 물리적 로봇 이외의 영역에서도 강력한 프레임워크가 될 수 있음을 보여준다. "행동"의 정의를 물리적 모터 명령에서 디지털 인터페이스 조작으로 확장하는 것이다.

8.8 도메인 간 비교 요약

도메인	행동 공간	안전 수준	실시간 요구	데이터 가용성	VLA 성숙도
테이블탑 조작	6-7 DoF	낮음	5-50Hz	풍부	높음
휴머노이드	30+ DoF	중간	50-120Hz	부족	초기
자율주행	다중 추상화	매우 높음	30Hz+	풍부	중간
드론/항법	4-6 DoF	중간	30Hz+	중간	초기
의료/수술	6-7 DoF	매우 높음	10-30Hz	매우 부족	매우 초기
농업/산업	6-7 DoF	낮음-중간	5-10Hz	부족	초기
GUI 에이전트	디지털 조작	낮음	실시간 불요	풍부	중간

※ 이 주파수 범위는 각 도메인의 일반적 요구사항을 정리한 것이며, 개별 서베이에서 직접 제시한 수치가 아닌 저자의 종합 정리이다.

이 표에서 두드러지는 패턴은, VLA의 성숙도가 데이터 가용성과 안전 요구의 반비례로 결정된다는 것이다. 데이터가 풍부하고 안전 제약이 낮은 테이블탑 조작에서 가장 빠르게 발전하고, 데이터가 부족하고 안전이 최우선인 의료 분야에서 가장 느리게 진행된다. 이 격차를 좁히는 것이 VLA 연구의 다음 단계에서 해결해야 할 핵심 과제이다.

Motivation Chain: 효율화의 동기 사슬

Motivation Chain

대형 VLA의 배포 불가(RT-2 55B: 330-1000ms 추론, 수백GB 메모리 필요)

→ 모델 축소 연구 시작(OpenVLA 7B: 8분의 1 크기로 성능 유지)

→ 7B도 여전히 무겁다(16-24GB VRAM, 166ms 지연)

→ 극단적 경량화(SmolVLA 450M, BitVLA 1비트 양자화)

→ 경량화의 성능 저하 우려

→ RL 후처리로 회복(경량 모델 + RL = 대형 모델 수준 성능)

→ 동적 추론(DeeR-VLA: 쉬운 입력은 얕은 레이어, 어려운 입력은 깊은 레이어)

→ 토큰 최적화(FAST, VLA-Cache: 연산량 자체를 줄임)

효율화 기법 비교: 핵심 차별점

기법	핵심 원리	대표 모델	압축 효과	성능 영향
양자화	가중치 비트 수 축소	BitVLA(1비트), SQIL(INT4)	메모리 3.36배↓	경미한 성능 저하
프루닝	불필요한 레이어/뉴런 제거	SmolVLA(L/2 제거), FLOWER	50% 레이어 제거 가능	태스크 의존적
증류	대형→소형 지식 전달	TinyVLA	파라미터 수 대폭↓	교사 모델 성능에 근접
토큰 최적화	시각/행동 토큰 수 축소	FAST, VLA-Cache, VOTE	토큰 5-13배↓	성능 유지
효율적 아키텍처	어텐션/구조 자체 개선	SARA-RT, MoLE-VLA	추론 속도 2-5배↑	설계 의존적
동적 추론	입력 난이도별 연산량 조절	DeeR-VLA	평균 연산 30-50%↓	재학습 불필요

직관적 한줄 설명: 효율화와 응용 편

양자화: "고화질 사진을 적절히 압축한 JPEG — 파일은 작아지지만 눈에는 거의 같아 보임"
프루닝: "나무의 죽은 가지를 쳐내는 것 — 나무(모델)는 더 가볍고 바람(추론)에 잘 흔들림"
증류: "명인의 기술을 제자에게 전수 — 제자는 작지만 핵심 기술은 보존"
토큰 캐싱(VLA-Cache): "매 프레임 배경을 다시 그리지 않고 캐시 — 변하는 부분만 새로 계산"
동적 추론(DeeR-VLA): "쉬운 문제는 빨리 풀고 넘기고, 어려운 문제만 깊이 고민하는 시험 전략"
MoE(Mixture-of-Experts): "모든 의사가 모든 환자를 보는 게 아니라, 전문 과목별로 배정 — 각 전문가는 작지만 전체 역량은 큼"

Self-Check Questions: Section 7-8

Q1: BitVLA의 1비트(삼진) 양자화는 어떻게 작동하며, 왜 성능이 유지되는가?

답: BitVLA는 모델 가중치를 {-1, 0, +1}의 세 값으로 제한한다(삼진 양자화). 이로 인해 곱셈 연산이 덧셈/뺄셈으로 대체되어 메모리와 연산이 극적으로 줄어든다(3.36배 압축). 성능이 유지되는 이유는 (1) 양자화 인식 학습(QAT)으로 양자화 오차를 학습 과정에서 보상하고, (2) VLA의 행동 출력이 고정밀을 요구하지 않는 경우가 많기 때문이다.

Q2: 테이블탑 조작과 자율주행 VLA의 핵심적 도메인 차이 3가지를 설명하라.

답: (1) 안전 수준: 테이블탑은 실패해도 물체 파손 수준이지만, 자율주행은 인명 피해 가능. (2) 제어 주파수: 테이블탑은 5-50Hz로 충분하지만, 자율주행은 30Hz+ 실시간 응답 필수. (3) 환경 다양성: 테이블탑은 제한된 작업 공간이지만, 자율주행은 무한히 다양한 도로·날씨·교통 상황. 이 차이로 인해 두 도메인의 VLA는 같은 기술적 DNA를 공유하면서도 독립적으로 진화하고 있다.

Q3: LIBERO 벤치마크가 "포화(saturation)"에 도달했다는 것은 무엇을 의미하며, 이것이 VLA 연구에 주는 시사점은?

답: LIBERO-Object(99.8%), LIBERO-Spatial(98.8%) 등에서 성공률이 거의 100%에 도달하여, 이 벤치마크로는 더 이상 모델 간 성능 차이를 구분할 수 없게 되었다. 이는 (1) 단순한 단일 환경 조작 태스크는 VLA가 사실상 해결했음을 의미하지만, (2) LIBERO-Long(96.6%)처럼 장시간 복합 태스크는 여전히 도전적이고, (3) 실세계 일반화 능력을 측정하는 새로운 벤치마크의 필요성을 시사한다.

Open Research Questions: Section 7-8

효율화의 이론적 한계: VLA에서 성능 저하 없이 달성 가능한 최대 압축률의 이론적 상한은? 정보 이론적 관점에서 로봇 행동 생성에 실제로 필요한 최소 비트 수는?

도메인 간 전이: 테이블탑 VLA의 효율화 기법이 자율주행이나 의료 로봇에도 동일하게 적용 가능한가? 도메인 특성에 따른 효율화 전략의 차이는?

의료 로봇 VLA: 안전 요구가 극도로 높고 데이터가 희소한 의료 도메인에서 VLA가 실용화되려면 어떤 돌파구가 필요한가?

벤치마크 설계: LIBERO 포화 이후, 실세계 일반화를 측정할 수 있는 차세대 벤치마크는 어떤 특성을 가져야 하는가?

9. 데이터셋, 벤치마크, 시뮬레이터

VLA 연구의 진보는 모델 아키텍처의 혁신만으로 이루어지지 않는다. 대규모 데이터셋, 신뢰할 수 있는 벤치마크, 그리고 현실적인 시뮬레이터가 삼위일체를 이루어야 비로소 연구가 전진한다. 이 장에서는 VLA 생태계를 떠받치는 데이터 인프라를 체계적으로 조망한다.

9.1 로봇 학습 데이터셋

9.1.1 대규모 교차 체현 데이터셋의 부상

VLA 연구 초기에는 각 연구 그룹이 자체 로봇과 환경에서 소규모 데이터셋을 구축하는 것이 일반적이었다. MIME, RoboTurk, RoboNet 등이 이 시기의 대표적 산물이다. 이들은 수천에서 수만 에피소드 규모로, 특정 로봇 플랫폼과 제한된 과제에 초점을 맞추었다. 그러나 사전학습된 대형 모델의 잠재력을 끌어내기 위해서는 훨씬 더 크고 다양한 데이터가 필요했다.

이 패러다임 전환을 이끈 것이 Open X-Embodiment [19] (OXE [19]) 데이터셋이다. 22개 연구 기관이 협력하여 22개 로봇 플랫폼에서 수집한 100만 건 이상의 에피소드를 하나의 통합 포맷으로 정리한 이 데이터셋은, VLA 연구의 ImageNet이라 불릴 만하다. 527개 이상의 기술(skill)을 포괄하며, 교차 체현(cross-embodiment) 학습의 가능성을 처음으로 대규모로 입증했다. RT-2-X가 OXE [19]로 학습했을 때, 단일 데이터셋으로 학습한 모델 대비 50% 이상의 성능 향상을 보인 것은 데이터 다양성의 힘을 극적으로 보여주었다.

다음 표는 주요 로봇 학습 데이터셋을 정리한 것이다.

데이터셋	규모	로봇 플랫폼	핵심 특성
Open X-Embodiment [19] (OXE [19])	100만+ 에피소드, 22종 로봇 embodiment, 60개+ 구성 데이터셋	22개 플랫폼	최대 교차 체현 데이터셋, 527개 기술 포괄
BridgeData V2	71개 작업	WidowX	교차 도메인 언어 주석, 다양한 환경
DROID	564개 작업	다양	"in the wild" 텔레오퍼레이션, 실제 환경 다양성
RT-1 [12] Kitchen	130K+ 실제 시연	Everyday Robots	700+ 일상 활동, 대규모 실세계 수집
BC-Z	25K+ 에피소드	7-DoF 로봇 팔	100개 작업, 반자율 수집 프로토콜
MIME, RoboTurk, RoboNet	다양 (수천~수만)	다양	초기 벤치마크 데이터셋, 역사적 의의
RH20T	147개 작업	다양	원샷(one-shot) 학습 지원
EgoDex	829시간	인간 손	밀집 3D 손/손가락 추적, 손재주 학습
Ego4D / EPIC-Kitchens	수천 시간	인간	자아중심(egocentric) 비디오, VLM 사전학습용
GraspVerse (14개 서베이 외 출처)	10억+ 샘플	시뮬레이션	합성 파지(grasp) 데이터, 대규모 합성 생성

9.1.2 인간 비디오 데이터의 전략적 활용

로봇 데이터의 수집 비용은 인터넷 텍스트나 이미지에 비해 압도적으로 높다. 이 병목을 우회하는 핵심 전략 중 하나가 인간 비디오 데이터의 활용이다. Ego4D(3,670시간), EPIC-Kitchens(100시간+), EgoDex(829시간) 등은 인간이 일상에서 수행하는 조작 활동을 자아중심 시점에서 촬영한 것으로, 로봇이 직접 수집하지 않고도 "어떻게 물체를 다루는가"에 대한 풍부한 시각적 사전지식을 제공한다.

GR-2가 인간 비디오 사전학습 후 로봇 미세조정으로 우수한 성능을 달성한 사례, 그리고 HPT [96] (Wang et al., 2024)가 인간 손 데이터와 로봇 데이터를 혼합하여 교차 체현 일반화를 향상시킨 사례는, 이 전략의 유효성을 입증한다. 다만, 인간과 로봇의 형태학적(morphological) 차이로 인한 도메인 갭은 여전히 해결해야 할 과제이다. EgoDex의 밀집 3D 손가락 추적 데이터는 이 갭을 줄이기 위한 구체적 시도로, 로봇 손의 정밀 제어에 직접 활용 가능한 형태의 인간 데이터를 제공한다.

9.1.3 합성 데이터와 자율 수집

데이터 병목의 또 다른 해법은 합성 데이터 생성과 자율 수집이다. GraspVerse(14개 서베이 외 출처)는 10억 건 이상의 합성 파지 데이터를 생성하여, 시뮬레이션에서의 대규모 사전학습을 가능케 했다. SOAR (Luo et al., 2025)와 같은 자율 수집 파이프라인은 로봇이 스스로 데이터를 수집하고 레이블링하는 방식으로, 인간 감독 없이도 데이터셋을 확장할 수 있는 가능성을 보여준다.

특히 주목할 점은, SmolVLA [32] (450M 파라미터)가 데이터 품질과 커리큘럼에 집중하여 훨씬 큰 모델들과 경쟁력 있는 성능을 달성했다는 사실이다. 이는 단순한 데이터 규모 확대보다 데이터 품질, 다양성, 그리고 학습 커리큘럼의 설계가 더 중요할 수 있음을 시사한다.

9.2 시뮬레이션 벤치마크

9.2.1 매니퓰레이션 벤치마크

시뮬레이션 벤치마크는 VLA 모델의 성능을 체계적으로 비교하고, 실세계 실험 전 신속한 프로토타이핑을 가능케 하는 핵심 인프라이다. 다음 표는 주요 벤치마크의 현황을 정리한 것이다.

벤치마크	도메인	핵심 지표	2025년 최고 성능
LIBERO-Spatial/Object/Goal/Long	매니퓰레이션	성공률(%)	Spatial 98.8%, Object 99.8%, Goal 98.2%, Long 96.6%
CALVIN	다단계 조작	평균 시퀀스 길이(1-5)	4.44 (DreamVLA [64])
RLBench / RLBench2	RGB-D 조작	성공률	다양
Meta-World	다중 기술	성공률	다양
SIMPLER	심-투-리얼 전이	교정된 성공률	다양
THE COLOSSEUM	분포 이동 강건성	성공률	다양
VLABench	언어 조건 조작	성공률	다양
MIKASA-Robo	메모리 중심	부분 관측 조작 성공률	다양

LIBERO 스위트는 VLA 연구에서 가장 널리 사용되는 벤치마크 중 하나로, 난이도에 따라 네 가지 하위 벤치마크를 제공한다. Spatial(공간 관계 이해), Object(객체 식별), Goal(목표 달성), Long(장기 과제)로 구성되며, 2025년 현재 Spatial과 Object에서는 거의 포화 상태(98-99%)에 도달했다. 이는 단기적, 단순한 조작 과제에서 VLA 모델이 이미 충분한 성능을 달성했음을 의미하며, 연구의 초점이 더 복잡한 장기 과제로 이동해야 함을 시사한다.

CALVIN은 다단계 연속 과제를 평가하는 벤치마크로, 모델이 최대 5개의 연속 명령을 수행하는 능력을 측정한다. 평균 시퀀스 길이(Average Sequence Length)로 성능을 측정하며, DreamVLA [64] (Wen et al., 2025)가 4.44로 최고 성능을 기록했다. 이 벤치마크에서 월드 모델 기반 방법(VIP, DreamVLA [64], WorldVLA [76] (Chen et al., 2025))이 지배적인 성능을 보이는 것은, 장기 과제에서 미래 예측 능력의 중요성을 방증한다.

차세대 벤치마크 (2025-2026)

기존 벤치마크의 포화 문제를 해결하기 위한 새로운 평가 프레임워크들이 등장하고 있다:

RoboArena [150] (Li et al., 2025): 실세계-시뮬레이션 자동 변환 프레임워크로, 실세계 태스크를 자동으로 시뮬레이션 환경에 재현하여 대규모 벤치마킹을 가능하게 한다.
RoboCasa365 [151] (Nasiriany et al., 2025): 365개 태스크, 2,000개 이상의 주방 장면을 포함하는 대규모 가정환경 벤치마크
WorldGym [152] (Zhang et al., 2025): 행동 조건부 월드 모델을 평가 환경으로 활용하는 새로운 패러다임
WorldBench [41] (Hu et al., 2025): 자율주행 VLA를 위한 통합 평가 플랫폼, 개방형/폐쇄형 루프 평가 통합

이들 벤치마크는 기존 LIBERO, CALVIN의 포화를 넘어, 실세계 일반화와 도메인 다양성을 측정하는 방향으로 평가 패러다임을 확장하고 있다.

9.2.2 자율주행 벤치마크

자율주행 VLA를 위한 벤치마크는 매니퓰레이션과 다른 고유한 요구사항을 가진다. 안전성, 실시간성, 그리고 사회적 규범 준수가 핵심 평가 축이다.

벤치마크	도메인	핵심 지표	특성
Bench2Drive	자율주행 (CARLA)	폐루프 경로 성공률	220 경로, 44 시나리오
nuScenes / nuPlan	자율주행 (실세계)	L2 궤적 오차	대규모 실세계 데이터
Reason2Drive	주행 추론	CoT QA 일관성	600K video-text 쌍(CoT QA 주석 포함), 추론 과정 평가

Bench2Drive (Jia et al., 2024)는 CARLA 시뮬레이터 위에 구축된 폐루프(closed-loop) 벤치마크로, 220개 경로와 44개 시나리오에서 에이전트의 종합적 주행 능력을 평가한다. 개루프(open-loop) 평가에서 높은 점수를 받은 모델이 폐루프에서는 실패하는 경우가 빈번하여, 폐루프 평가의 필수성이 강조되고 있다. Reason2Drive (Nie et al., 2024)는 단순한 경로 추적을 넘어 "왜 이 행동을 선택했는가"에 대한 추론 과정을 평가하는 새로운 패러다임의 벤치마크이다.

9.3 시뮬레이터 생태계

VLA 연구를 지탱하는 시뮬레이터 생태계는 도메인별로 다양하게 발전해 왔다.

매니퓰레이션 시뮬레이터:

MuJoCo: 물리 시뮬레이션의 사실상 표준. 빠른 연산 속도와 정확한 접촉 역학이 강점이다.
SAPIEN: 관절체(articulated object) 조작에 특화된 시뮬레이터로, 서랍 열기, 수도꼭지 조작 등 일상 환경의 상호작용을 지원한다.
RLBench: CoppelliaSim 기반의 벤치마크 겸 시뮬레이터로, 100개 이상의 사전 정의된 과제를 제공한다.
AI2-THOR / Habitat: 실내 네비게이션과 조작을 결합한 시뮬레이터로, 체현 AI(embodied AI) 연구의 주요 플랫폼이다.
Isaac Gym (NVIDIA): GPU 가속 대규모 병렬 시뮬레이션을 지원하며, 수천 개의 환경을 동시에 실행할 수 있어 RL 학습에 최적화되어 있다.

자율주행 시뮬레이터:

CARLA: 오픈소스 자율주행 시뮬레이터의 대표격. 다양한 날씨, 교통 시나리오, 센서 모달리티를 지원한다.
nuPlan: nuScenes 데이터를 기반으로 한 폐루프 계획 벤치마크 겸 시뮬레이터이다.

차세대 범용 시뮬레이터:

Genesis (14개 서베이 외 출처): GPU 가속 물리 엔진으로, 다양한 물리 솔버를 통합하여 범용적인 로봇 시뮬레이션을 목표로 한다. 기존 시뮬레이터 대비 10-100배의 속도 향상을 주장한다.
UniSim [101] (Yang et al., 2024): 행동 조건부 비디오 디퓨전(action-conditioned video diffusion) 기반의 "학습된 시뮬레이터"로, 명시적 물리 엔진 없이 데이터에서 직접 환경 역학을 학습한다. 이는 전통적 시뮬레이터의 현실성 한계를 우회하는 혁신적 접근이다.

핵심 갭: 현재 시뮬레이터 생태계의 가장 큰 한계는 통합된 교차 체현/교차 과제 벤치마크의 부재이다. 각 시뮬레이터가 고유한 과제 정의, 로봇 모델, 평가 프로토콜을 사용하기 때문에, 서로 다른 시뮬레이터에서 보고된 결과를 직접 비교하는 것은 사실상 불가능하다. 이는 컴퓨터 비전 분야에서 ImageNet이 수행했던 통합 벤치마크의 역할이 로봇 학습 분야에서는 아직 부재함을 의미한다.

9.4 평가 프로토콜의 한계와 개선 방향

9.4.1 재현성의 위기

VLA 연구에서 보고되는 성공률 수치는 종종 오해를 불러일으킨다. 일부 연구에서 시드(seed)만 변경해도 성공률이 30% 이상 변동하는 것이 관찰되었다. 이는 보고된 "최고 성능"이 통계적으로 유의미하지 않을 수 있음을 의미한다. 환경의 초기 조건, 물체의 미세한 배치 변화, 시뮬레이터의 물리 엔진 비결정성 등이 이러한 분산의 원인이다.

9.4.2 시뮬레이션-실세계 괴리

시뮬레이션에서 높은 성공률을 달성한 모델이 실세계에서 실패하는 현상은 여전히 만연하다. 접촉 역학의 부정확성, 시각적 사실성의 한계, 그리고 실세계의 예측 불가능한 교란 등이 주요 원인이다. SIMPLER 벤치마크가 교정된(calibrated) 심-투-리얼 평가를 제공하려는 시도를 하고 있지만, 근본적 해결에는 이르지 못했다.

9.4.3 평가 지표의 단일성

현재 대부분의 벤치마크는 "성공률"이라는 단일 지표에 의존한다. 그러나 실세계 배포를 고려하면 이것만으로는 부족하다.

충돌 회피: 과제를 완수하더라도 환경과의 불필요한 충돌은 위험하다.
실패 복구: 실패 시 안전한 상태로 복귀하는 능력은 보고되지 않는다.
에너지 효율: 동일한 과제를 더 적은 에너지로 수행하는 것은 실용적으로 중요하다.
적대적 강건성: 의도적인 교란에 대한 저항성은 안전 관련 응용에서 필수적이다.
추론 지연 시간: 모델의 추론 속도는 실시간 제어 가능 여부를 결정하지만, 성공률과 함께 체계적으로 보고되는 경우가 드물다.

자율주행 분야에서는 이 문제가 더욱 심각하다. 제어 안전성과 언어 충실도를 동시에 평가하는 통합 "AI 운전면허" 벤치마크가 부재하며, 개루프 L2 오차와 같은 프록시 지표가 실제 주행 안전성과 약한 상관관계를 보이는 것이 반복적으로 지적되고 있다.

9.4.4 개선 제안

이러한 한계를 극복하기 위해 다음 두 가지 트랙의 평가 체계를 제안한다.

(i) 시뮬레이션 트랙: 고정된 시드, 데이터 분할, 기준 모델을 공유하는 표준화된 시뮬레이션 평가. 모든 연구가 동일 조건에서 비교 가능하도록 환경 설정을 완전히 재현 가능한 형태로 공개한다. 최소 10개 이상의 시드에서 평균과 분산을 보고하는 것을 의무화한다.

(ii) 실세계 커뮤니티 트랙: 공유 하드웨어 프로토콜에 기반한 실세계 평가. 표준화된 로봇 플랫폼(예: Franka Emika, UR5), 과제 정의, 평가 절차를 커뮤니티가 합의하여 정의하고, 각 연구 그룹이 동일한 프로토콜로 실세계 성능을 보고한다.

10. 미해결 문제와 미래 전망

10개의 주요 VLA 서베이 논문을 관통하여 분석한 결과, 11가지 핵심 과제가 식별되었다. 이들은 개별 서베이에서 부분적으로 다루어졌지만, 서베이 간 교차 분석을 통해 비로소 그 전체 구조가 드러난다.

10.1 데이터 병목

VLA 연구의 가장 근본적인 제약은 데이터이다. 현재 최대 규모인 OXE [19] 데이터셋도 확장 버전 기준 약 250만 에피소드(원본 v1은 100만+ 에피소드)에 불과하며, 이는 GPT-2의 학습 코퍼스(WebText, 수십억 토큰)와 비교하면 극히 미미한 수준이다. 더구나 로봇 데이터는 인터넷 텍스트와 달리 수집 비용이 에피소드당 수십 달러에 달하며, 각 로봇 플랫폼의 고유한 형태학에 종속된다.

해결 방향:

시뮬레이션 합성: GraspVerse(14개 서베이 외 출처)와 같은 대규모 합성 데이터 생성. 도메인 무작위화(domain randomization)와 결합하여 심-투-리얼 전이를 촉진한다.
인간 비디오 활용: Ego4D, EPIC-Kitchens 등에서 조작의 시각적 사전지식을 추출한다.
자율 수집(SOAR): 로봇이 스스로 탐색하고 데이터를 수집하는 자기 지도 파이프라인이다.
능동적 선정(active curation): 모든 데이터가 동등하지 않다. 모델의 약점을 타겟으로 데이터를 선별 수집한다.

교차 인사이트: SmolVLA [32]가 450M 파라미터로도 7B 모델과 경쟁하는 사례는, 데이터 스케일링보다 데이터 품질과 다양성이 더 중요할 수 있음을 시사한다. 이는 "더 많은 데이터"가 아닌 "더 나은 데이터"로의 패러다임 전환을 예고한다.

10.2 일반화의 벽

VLA 모델의 일반화 성능은 평가 조건에 따라 극적으로 변화한다.

도메인 내(in-domain): 학습 환경과 동일한 조건에서 80-90%의 성공률
교차 도메인(cross-domain): 새로운 객체나 환경에서 40-70%로 하락
제로샷(zero-shot): 완전히 새로운 과제에서 20-50%까지 하락

이 격차를 좁히기 위한 시도가 다각도로 진행 중이다. HPT [96](Heterogeneous Pretrained Transformers)는 다양한 체현에서의 사전학습을 통해 교차 체현 일반화를, UniAct (Qian et al., 2025)는 행동 공간의 통합 표현을 통해 체현 불가지론적(embodiment-agnostic) 정책을, BridgeVLA (Li et al., 2025)는 웹 규모 시각 지식과 로봇 행동의 연결을 각각 시도한다.

시뮬레이션에서 실세계로의 전이(sim-to-real transfer)는 여전히 미해결 과제이다. 물리적 접촉의 부정확성, 시각적 도메인 갭, 그리고 실세계의 비정형적(non-stationary) 환경이 주요 장벽이다. GEN-0 (Team, 2025)의 스케일링 법칙 연구는 모델과 데이터의 규모를 키우면 일반화가 예측 가능하게 개선된다는 초기 증거를 제시하지만, 이 법칙이 어디까지 유효한지는 아직 불분명하다.

10.3 실시간 추론

대형 VLM 백본(7B-55B 파라미터)과 디퓨전 기반 행동 생성의 조합은 강력하지만, 실시간 제어에는 치명적인 지연 시간 문제를 야기한다. 자율주행에서는 최소 30Hz, 매니퓰레이션의 정밀 제어에서는 50Hz 이상의 제어 주파수가 요구되는데, 단순한 전방 패스(forward pass)만으로도 이 요구를 충족하기 어려운 경우가 많다.

해결 전략:

계층적 비동기 실행: 고수준 VLM은 낮은 주파수(1-5Hz)로 하위 목표를 생성하고, 경량 저수준 정책이 높은 주파수(50-100Hz)로 실제 제어를 수행한다. GR00T N1 [21], CogACT [23] 등이 이 접근을 채택한다.
토큰 캐싱: 이전 추론의 키-값(KV) 캐시를 재활용하여 중복 연산을 제거한다.
양자화(quantization): FP16, INT8, INT4 등으로 모델 정밀도를 낮추어 추론 속도를 높인다. 4비트 양자화에서도 성능 저하가 2% 미만인 경우가 보고되었다.
가지치기(pruning)와 증류(distillation): 불필요한 파라미터를 제거하거나, 대형 모델의 지식을 소형 모델로 전이한다.

핵심은 "지능적 희소성(intelligent sparsity)"이다. 모든 입력에 대해 전체 모델을 활성화하는 대신, 입력의 복잡도에 따라 연산량을 동적으로 조절하는 접근이 부상하고 있다.

10.4 장기 과제와 계층적 추론

순수 종단간(end-to-end) 모델은 단일 동작 수준의 과제에서는 뛰어나지만, 다단계 합성 과제에서는 체계적으로 실패한다. "서랍을 열고, 컵을 꺼내서, 선반에 놓아라"와 같은 과제는 계획, 하위 목표 설정, 진행 상황 모니터링, 그리고 실패 시 재계획을 요구하며, 이는 단일 정책으로는 처리하기 어렵다.

해결 접근:

계층적 분해: π0.5 [31]는 고수준 VLM 계획기와 저수준 행동 정책을 명시적으로 분리한다.
사고의 연쇄(Chain-of-Thought): CoT-VLA [55]는 행동 생성 전 명시적 추론 단계를 삽입하여, 모델이 "왜" 특정 행동을 선택하는지를 추론한다.
기술 라이브러리(Skill Library): ReLEP (Park et al., 2025) 등은 재사용 가능한 기술 원형을 학습하고 조합하여 복잡한 과제를 구성한다.

CALVIN 벤치마크에서의 경향이 이 방향의 유효성을 입증한다. 월드 모델 기반 방법(DreamVLA [64] (Wen et al., 2025): 4.44, WorldVLA [76] (Chen et al., 2025): 4.38)이 순수 반응적(reactive) 정책 대비 압도적인 성능을 보이며, 미래 상태를 예측하고 이를 계획에 활용하는 능력이 장기 과제 성공의 열쇠임을 보여준다.

10.5 안전과 정렬

VLA의 안전 문제는 순수 소프트웨어 AI와 질적으로 다르다. LLM의 환각(hallucination)이 잘못된 텍스트를 생성하는 것에 그치지만, VLA의 환각은 물리적 충돌, 파손, 심지어 인명 피해로 이어질 수 있다. 물리적 실패의 비가역성이 핵심적 차이이다.

현재의 시도:

SafeVLA [75] (Chen et al., 2025): VLA에 안전 제약을 명시적으로 통합한 최초의 시도. 안전 관련 학습 데이터와 제약 위반 페널티를 결합한다.
SafeAuto [140]: 자율주행에서 교통 법규 기반의 심볼릭 거부권(symbolic veto)을 구현한다. 신경망의 출력이 규칙 기반 안전 검증을 통과해야만 실행되는 이중 구조이다.

그러나 형식적 검증(formal verification)의 부재는 심각한 과제로 남아 있다. 전통적 제어 시스템은 Lyapunov 안정성, 도달 가능성(reachability) 분석 등의 수학적 도구로 안전성을 보장할 수 있지만, 언어 조건 신경망 정책에 대해서는 이러한 검증 방법이 확립되어 있지 않다.

자율주행에서 이 문제는 특히 심각하다. 매니퓰레이션에서의 환각이 물체를 떨어뜨리는 수준에 그칠 수 있지만, 자율주행에서의 환각은 교통사고로 직결된다. 이 "환각 위험의 비대칭성"은 자율주행 VLA가 매니퓰레이션 VLA와 근본적으로 다른 안전 요구사항을 가짐을 의미한다.

10.6 환각과 추론 안정성

LLM 기반 계획기가 물리적으로 불가능한 행동을 생성하는 문제는 VLA의 근본적 약점이다. "물컵을 90도 기울여서 옮겨라"와 같은 물리적으로 불합리한 계획은 LLM의 상식 추론이 물리적 현실과 괴리될 때 발생한다.

SC-VLA [56](Self-Correcting VLA) (Guo et al., 2025)는 명시적 실패 감지와 복구 추론 메커니즘을 도입하여, 자기교정 메커니즘을 통해 태스크 실패율을 35% 감소시켰다(Zhang et al. [6]). 모델이 자신의 행동 결과를 모니터링하고, 예상과 다른 결과가 관측되면 대안적 행동을 생성하는 피드백 루프를 구현한다.

그러나 개방 세계(open world)에서의 환각 검증은 근본적으로 어렵다. 학습 데이터에 없는 상황에서 모델의 출력이 "물리적으로 실현 가능한가"를 판단하려면, 모델 자체가 정확한 물리 시뮬레이터 역할을 해야 하는 순환적 문제에 봉착한다.

10.7 다중 모달 통합

현재 VLA 연구는 시각 중심의 편향을 보인다. 대부분의 모델이 RGB 이미지만을 감각 입력으로 사용하며, 촉각, 힘/토크, 소리, 온도 등 인간이 조작에 활용하는 다른 감각은 거의 무시된다.

ForceVLA [78] (Lee et al., 2025): 힘/토크 센서 데이터를 VLA에 통합하여, 섬세한 물체 조작에서의 성능을 향상시켰다.
TactileVLA [153] (Kim et al., 2025): 촉각 센서 입력을 활용하여, 시각만으로는 판단하기 어려운 물체의 물성(경도, 질감 등)을 인지한다.
OmniVTLA [154] (Wang et al., 2025): 시각, 촉각, 언어를 동시에 처리하는 통합 아키텍처를 제안한다.

인간은 불확실성이 높은 상황에서 자동적으로 모달리티를 재가중한다. 시각이 불충분할 때 촉각에 더 의존하고, 소음이 심할 때 시각적 단서에 더 집중한다. 이러한 적응적 모달리티 재가중(adaptive modality reweighting)은 로봇에서 아직 체계적으로 구현되지 않았으며, 다중 모달 VLA의 중요한 미래 방향이다.

10.8 인간-로봇 상호작용

현재의 VLA는 "유사 상호작용(pseudo-interaction)"에 머물러 있다. 인간이 지시를 내리면 로봇이 이행하는 단방향 소통이 지배적이며, 진정한 양방향 대화형 협업은 거의 구현되지 않았다.

진정한 인간-로봇 상호작용을 위해서는 다음이 필요하다.

적응형 대화: 로봇이 모호한 지시에 대해 명확화 질문을 하고, 인간의 피드백에 따라 행동을 조정한다.
선호 학습(preference learning): 인간의 암묵적 선호(속도, 안전성, 미적 기준 등)를 상호작용을 통해 학습한다.
인간 피드백 루프: 배포 후에도 인간의 교정 피드백을 통해 지속적으로 개선한다.

이 분야는 NLP에서의 RLHF(Reinforcement Learning from Human Feedback) 성공에 힘입어, "RLHF for Robotics"라는 새로운 연구 방향이 형성되고 있다.

10.9 평가와 벤치마킹

9.4절에서 논의한 평가 한계는 미해결 문제로서 더 근본적인 차원에서 재조명할 필요가 있다. 현재 로봇 학습 분야에는 컴퓨터 비전의 ImageNet, NLP의 GLUE/SuperGLUE에 해당하는 통합 벤치마크가 부재하다.

이 부재의 결과는 심각하다. 논문 A가 LIBERO에서 98%를, 논문 B가 CALVIN에서 4.44를 보고할 때, 어떤 모델이 "더 나은" 것인지를 판단할 수 없다. 시드 무작위성으로 인한 재현성 문제까지 더해지면, VLA 연구의 실질적 진보를 정량적으로 추적하는 것 자체가 어려워진다.

10.10 윤리와 사회적 영향

VLA의 실세계 배포는 기술적 과제를 넘어 윤리적, 사회적 질문을 제기한다.

프라이버시: 가정이나 직장에서 작동하는 VLA 로봇은 지속적으로 환경을 촬영하고 해석한다. 이 데이터의 수집, 저장, 활용에 대한 명확한 가이드라인이 필요하다.
고용 대체: 조작 능력의 발전은 물류, 제조, 서비스 산업에서의 자동화를 가속화하며, 고용 구조의 변화를 초래할 수 있다.
의사결정 편향: VLM 백본이 인터넷 데이터에서 학습한 편향이 물리적 행동으로 발현될 수 있다. 예를 들어, 특정 인종이나 성별에 대한 편향이 인간-로봇 상호작용에서 차별적 행동으로 이어질 위험이 있다.
규제 프레임워크: 자율주행을 제외하면, VLA 로봇의 배포에 대한 규제 프레임워크는 거의 존재하지 않는다. 인증, 책임 소재, 사고 보고 체계 등이 시급히 마련되어야 한다.

10.11 교차 서베이 통합 인사이트

14개의 서베이를 교차 분석하여 도출한 다음 10가지 인사이트는, 개별 서베이에서는 명시적으로 드러나지 않는 창발적(emergent) 패턴이다.

인사이트 1 -- 수렴의 증거

14개 서베이는 각각 다른 분류체계(taxonomy)를 사용하지만, 궁극적으로 동일한 풍경(landscape)의 서로 다른 투영(projection)이다. 아키텍처 서베이는 "백본-행동 헤드" 축으로, 학습 서베이는 "사전학습-미세조정" 축으로, 응용 서베이는 "도메인-과제" 축으로 VLA를 분류한다. 그러나 이 모든 관점에서 "VLM Brain + Generative Action Head"가 최적점으로 수렴하고 있다. 이는 2024년 말부터 2025년에 걸쳐 명확해진 추세로, RT-2 [11]에서 시작된 "언어 모델을 행동 모델로" 패러다임이 이제 보편적 합의에 도달했음을 의미한다.

인사이트 2 -- 스케일 역전 현상

"더 큰 모델이 더 나은 성능을 낸다"는 스케일링 법칙의 직관이 VLA에서는 반드시 성립하지 않는다. 구체적 증거가 이를 뒷받침한다.

CLIP [27]-RT(1B)가 OpenVLA [15] (7B)를 다수의 과제에서 능가한다.
SmolVLA [32] (450M)가 LIBERO에서 7B급 모델과 경쟁적 성능을 보인다.
3B급 모델(CogACT [23], SpatialVLA [39])이 7B 모델과 동등하거나 우수한 성능을 달성한다.

이는 데이터 품질, 토큰화 전략, 아키텍처 설계가 파라미터 수보다 중요할 수 있음을 시사한다. VLA에서는 로봇 데이터의 희소성 때문에, 대형 모델이 과적합하거나 불필요한 용량을 낭비하는 현상이 발생할 수 있다.

인사이트 3 -- 토큰화가 제어 대역폭을 결정

행동 토큰화 방식은 단순한 구현 세부사항이 아니라, 시스템의 근본적 능력을 규정하는 설계 선택이다.

이산 빈(discrete bin) 토큰화: 구현이 단순하지만 정밀도가 제한된다. 1-5Hz 제어에 적합하다.
디퓨전 기반: 연속적이고 다봉(multimodal) 분포를 표현할 수 있지만, 역확산 과정의 반복이 추론 속도를 저하시킨다. 5-20Hz 범위이다.
플로우 매칭(flow matching): 디퓨전 대비 빠른 수렴으로 20-50Hz를 달성한다.
FAST 토큰화 [20]: 이산 방식의 속도와 연속 방식의 정밀도를 동시에 추구하며, 50-120Hz까지의 제어 주파수를 가능케 한다.

이 관점은 단일 서베이에서 명시적으로 다루어지지 않는 교차적 통찰이다. 토큰화 방식의 선택이 1Hz에서 120Hz까지의 제어 주파수를 결정하고, 이것이 수행 가능한 과제의 범위를 근본적으로 규정한다. 느린 제어는 거친 조작만 가능하고, 빠른 제어는 정밀 삽입, 봉합, 악기 연주와 같은 고난도 과제를 가능케 한다.

인사이트 4 -- 이중 시스템은 선택이 아닌 필수

장기 과제에서 순수 종단간 모델의 한계는 반복적으로 입증되고 있다. Daniel Kahneman의 System 1(빠른 직관)/System 2(느린 숙고) 구분이 로봇공학에서 공학적 필수사항으로 입증된 것이다.

GR00T N1 [21]은 이중 시스템 아키텍처를 채택하여, 고수준 VLM(System 2)이 하위 목표를 생성하고 저수준 디퓨전 정책(System 1)이 이를 실행한다. 그 결과 단일 시스템 대비 17% 성공률 향상(GR00T N1 원논문 보고 기준)과 28% 충돌률 감소(GR00T N1 원논문 보고 기준)를 달성했다. 이는 인지과학의 이론적 구분이 공학적 설계 원칙으로 직접 번역될 수 있음을 보여주는 강력한 증거이다.

인사이트 5 -- RL 후처리는 BC의 필수 보완재

행동 복제(BC)만으로 학습된 VLA는 구조적 한계를 가진다. 시연 데이터의 분포를 벗어나면 성능이 급격히 저하되는 분포 이동(distributional shift) 문제가 대표적이다. 강화학습(RL) 후처리(post-training)는 이 한계를 돌파하는 핵심 수단으로 부상했다.

극적인 사례가 이를 증명한다: SFT만으로 4%에 머물던 성공률이, PPO(Proximal Policy Optimization) 15회 반복만으로 97%로 회복된 경우가 보고되었다. BC가 "어떻게 해야 하는가"를 가르친다면, RL은 "무엇이 좋은가"를 학습하게 한다. 이 둘의 조합은 선택이 아닌 필수 파이프라인 단계이다.

인사이트 6 -- 효율성과 성능의 파레토 프론티어 이동

2025년 들어 VLA 연구의 핵심 경쟁 축이 "절대 성능"에서 "컴퓨트 효율성"으로 이동하고 있다. 55B 파라미터의 RT-2 [11]가 달성한 성능을, 450M의 SmolVLA [32]가 유사하게 달성하는 것은 100배 이상의 효율성 혁명이다.

"지능적 희소성(Intelligent Sparsity)" 패러다임이 부상하고 있다. 이는 단순히 모델을 줄이는 것이 아니라, 필요한 곳에만 연산을 집중하는 것이다. LoRA 기반 효율적 미세조정, 전문가 혼합(MoE) 아키텍처, 조기 종료(early exit) 메커니즘 등이 이 패러다임의 구현체이다. 단순 스케일링 법칙보다 "컴퓨트당 성능(performance per FLOP)"이 핵심 지표로 전환되고 있다.

인사이트 7 -- 자율주행 VLA는 별도 진화 경로

매니퓰레이션 VLA와 자율주행 VLA는 동일한 "VLM + Action" 프레임워크를 공유하지만, 실제로는 상당히 다른 진화 경로를 걷고 있다. 자율주행은 매니퓰레이션에 비해 다음과 같은 고유한 요구사항을 가진다.

안전 요구: 실패의 결과가 치명적이며, 사회적 수용 기준이 훨씬 높다.
실시간 요구: 30Hz 이상의 제어 주파수가 절대적으로 필수이다.
사회적 규범: 교통 법규, 양보, 신호 준수 등 사회적 규약의 이해와 준수가 요구된다.

두 도메인 간의 기술 교환이 충분히 이루어지지 않고 있다는 점은 아쉬운 대목이다. 매니퓰레이션의 정밀 제어 기법이 자율주행의 미세 조향에, 자율주행의 안전 검증 프레임워크가 매니퓰레이션의 안전 정책에 기여할 수 있는 잠재적 교차 수분(cross-pollination) 기회가 존재한다.

인사이트 8 -- 인간 운동학습 이론이 VLA 연구의 미래 지도

Jin et al. [9]이 제안한 Newell의 운동학습 이론과 VLA의 매핑은 단순한 비유가 아닌 체계적 연구 프레임워크로 기능할 잠재력을 가진다. Newell의 "자유도 동결-해제(freezing-freeing degrees of freedom)" 이론은, VLA의 계층적 기술 학습에서 저차원 행동 공간에서 시작하여 점차 자유도를 확장하는 커리큘럼과 직접적으로 대응된다.

아직 미탐색된 영역이 풍부하다.

소뇌 모델의 로봇 구현: 인간 소뇌의 전방 모델(forward model)과 역모델(inverse model)의 조합이, VLA의 월드 모델과 역역학 정책의 조합으로 번역될 수 있다.
맥락 간섭 효과: 인간 운동학습에서 무작위 연습이 차단 연습보다 장기 파지에 유리하다는 효과가, VLA의 학습 커리큘럼에 적용될 수 있다.

인사이트 9 -- 월드 모델이 장기 과제의 열쇠

CALVIN 벤치마크에서의 성능 경향은 명확한 메시지를 전달한다. 시각적 상호작용 예측(VIP, Visual Interaction Prediction) 방법이 지배적 성능을 보이며, WorldVLA [76], DreamVLA [64], CoT-VLA [55] 등 월드 모델을 통합한 접근이 상위권을 휩쓸고 있다.

월드 모델은 "행동 전에 상상한다"는 원리를 구현한다. 특정 행동을 실행했을 때 세계가 어떻게 변화할지를 내부적으로 시뮬레이션하고, 그 결과가 목표에 부합하는지를 평가한 후에야 실제 행동을 실행한다. 이는 장기 과제에서의 계획 능력을 근본적으로 향상시키며, 차세대 VLA의 핵심 분화점(differentiator)이 될 것이다.

인사이트 10 -- "방어적 AI" 패러다임의 부상

강건성(robustness)이 성능과 동급의 1등 설계 목표로 격상되고 있다. 실험실에서 98% 성공률을 달성하더라도, 실세계의 예측 불가능한 교란 하에서 50%로 하락한다면 배포할 수 없기 때문이다.

BYOVLA(Build Your Own VLA): 모듈화된 아키텍처로 각 구성 요소의 강건성을 독립적으로 검증하고 교체할 수 있다.
DreamVLA [64]: 월드 모델을 통한 상상 기반 강건성 향상. 예상치 못한 상황을 내부적으로 시뮬레이션하여 대비한다.
SafeVLA [75]: 명시적 안전 제약 통합으로 위험 행동을 사전 차단한다.

실세계 배포에서 강건성은 단순한 "있으면 좋은(nice-to-have)" 속성이 아니라, 시스템의 생존 조건(survival condition)이다. 이 인식이 연구 커뮤니티에 확산되면서, "방어적 AI(Defensive AI)" 패러다임이 형성되고 있다.

10.12 프런티어 모델과 오픈 웨이트 모델의 일반화 격차

2026년 현재 VLA 분야의 가장 뚜렷한 분단선은 비공개 프런티어 모델(Gemini Robotics, π0.5)과 오픈 웨이트 연구 모델 사이의 실세계 일반화 격차이다. 시뮬레이션 벤치마크(LIBERO, CALVIN)에서 양쪽의 성능이 수렴하고 있음에도, 실세계 제로샷 일반화에서는 여전히 큰 격차가 존재한다. RoboArena 리더보드에서 경쟁력 있는 제로샷 행동을 보이는 것은 π 계열 모델뿐이라는 분석이 있다(Reuss, 2026).

이 격차의 원인으로는 세 가지가 지목된다: (1) 데이터 품질·다양성 격차 — 프런티어 랩의 비공개 데이터가 공개 데이터셋보다 품질과 다양성이 우수, (2) 벤치마크 천장 효과 — 시뮬레이션 벤치마크의 포화로 실제 진전이 가려지는 현상, (3) 인프라 규모 격차 — 연구실 규모 vs 산업 규모의 학습 인프라 차이.

ICLR 2026에서는 데이터 품질 큐레이션과 인컨텍스트 학습이 가장 과소 대표된 연구 방향으로 식별되었으며, 이 두 방향이 격차 해소의 열쇠가 될 수 있다. 이 문제는 10.1절의 데이터 병목과 10.2절의 일반화 벽 모두와 긴밀히 연결되며, 오픈소스 커뮤니티의 핵심 도전 과제이다.

10.13 최전선 사례 연구: π 시리즈가 열어가는 두 갈래 프런티어 (2025.11 – 2026.03)

앞선 10.1~10.12절에서 VLA의 핵심 미해결 과제와 통합 인사이트를 도출했다. 그렇다면 이 과제들에 대해 실제로 얼마나 진전이 이루어지고 있는가? 2025년 11월과 2026년 3월, Physical Intelligence(PI)가 잇달아 발표한 두 논문은 이 질문에 대한 가장 구체적인 답변을 제공한다. π^*_0.6 [157]은 6.3절과 Insight 5에서 논의한 "BC→RL 전환"을, π_0.6-MEM [158]은 10.4절의 "장기 과제와 메모리 부재"를 각각 정면으로 공략한다. 두 논문 모두 π0.6 모델(Gemma 3 4B VLM + 860M Action Expert)을 기반으로 하며, 각각 VLA의 핵심 한계인 "BC의 성능 천장"과 "메모리 부재"를 실세계 규모에서 돌파한다.

10.13.1 π^*_0.6: 경험으로부터 배우는 VLA

[157] · Physical Intelligence · 2025.11

π^*_0.6 [157]은 RECAP(RL with Experience and Corrections via Advantage-conditioned Policies)이라는 방법론을 통해 VLA 모델이 실세계 배포 경험으로부터 스스로 개선할 수 있도록 하는 범용 RL 후처리 프레임워크이다. 6.3절에서 다룬 RL 후처리 연구들(VLA-RL, RIPT-VLA, ConRFT 등)이 대부분 시뮬레이션 벤치마크에서의 검증에 머문 반면, π^*_0.6는 실세계 장시간 복합 조작 태스크에서 대규모 VLA의 end-to-end RL 학습을 최초로 성공적으로 시연했다는 점에서 질적 전환점이다.

핵심 기술 혁신: Advantage Conditioning. 기존 VLA RL 후처리 방법들이 PPO나 GRPO 같은 정책 경사(policy gradient) 기반 추출을 사용한 것과 달리, RECAP은 advantage conditioning이라는 근본적으로 다른 정책 추출 방식을 채택한다. 핵심 아이디어는 다음과 같다:

(1) 분포적 가치 함수(distributional value function)를 별도로 학습한다. 이 가치 함수는 670M 파라미터의 소형 VLM 백본을 사용하며, 각 상태에서 성공적 완료까지의 남은 스텝 수를 분포로 예측한다(201개 이산 bin). (2) 가치 함수로부터 각 행동의 advantage 값을 추정하고, 이를 이진화하여 "Advantage: positive/negative"라는 텍스트 토큰을 VLA 입력에 추가한다. (3) VLA는 모든 데이터(시연 + 자율 롤아웃 + 인간 교정)에 대해 advantage 조건부 지도학습으로 학습하되, 추론 시에는 항상 "Advantage: positive"로 조건화하여 개선된 정책을 추출한다.

이 접근법의 결정적 장점은 Flow Matching 기반 VLA와의 호환성이다. PPO/GRPO는 log-likelihood의 명시적 계산을 요구하는데, Flow Matching 모델은 이를 직접 제공하지 못하여 근사가 필요하다. Advantage conditioning은 이 문제를 완전히 우회하여, 단순한 조건부 지도학습만으로 정책 개선을 달성한다. 실험에서 π^*_0.6는 동일한 데이터로 학습한 AWR 및 PPO 기반 방법을 크게 능가했다.

세 가지 데이터 소스의 통합. RECAP은 (1) 시연 데이터(초기 SFT용), (2) 자율 롤아웃(로봇이 스스로 수행한 시도, 성공/실패 레이블 포함), (3) 인간 교정(human-gated DAgger 방식, 자율 실행 중 인간이 개입하여 실수를 교정)을 하나의 프레임워크 안에서 결합한다. 인간 교정 데이터에는 항상 positive advantage를 부여하고, 나머지 데이터는 가치 함수의 추정에 따라 advantage를 할당한다.

실세계 성과. π^*_0.6는 다음 세 가지 복합 태스크에서 검증되었다:

태스크	소요 시간	π^*_0.6 효과 (throughput)	성공률	연속 운용
에스프레소 제조	~200초/회	2배 이상 향상	90%+	13시간 연속
다양한 빨래 접기(11종)	~500초/회	2배 이상 향상	~70%(가장 어려운 버튼셔츠 기준)	2시간+ (새 집에서)
박스 조립(공장 배포)	~600초/회	2배 향상(2회 반복 후)	~90%	공장 실배포

가장 어려운 태스크에서 π^*_0.6는 throughput(시간당 성공 완료 수)을 2배 이상 높이고, 실패율을 약 절반으로 줄였다. 특히 "targeted failure removal" 실험에서는 특정 실패 모드(옷깃 방향 오류)를 600개 자율 궤적 × 2회 반복만으로 97% 성공률까지 제거하는 데 성공했다.

서베이 맥락에서의 의의. π^*_0.6는 6.3절에서 논의한 "BC→RL 전환"의 가장 완성된 실현이다. 기존 연구들이 시뮬레이션 벤치마크(LIBERO 등)에서 RL 후처리의 가능성을 입증했다면, π^*_0.6는 실세계 장시간 복합 태스크에서 대규모 Flow Matching VLA에 대한 end-to-end RL의 실용성을 최초로 입증했다. 이는 LLM 분야에서 GPT-3(사전학습) → InstructGPT(SFT) → ChatGPT(RLHF)로 이어진 발전 경로가 VLA에서도 현실화되고 있음을 의미한다.

10.13.2 π_0.6-MEM: VLA를 위한 다중 스케일 체화 메모리

[158] · Physical Intelligence · 2026.03

MEM(Multi-Scale Embodied Memory)은 VLA에 다중 모달·다중 시간 스케일의 메모리를 부여하는 시스템이다. 10.4절에서 "장기 과제와 계층적 추론"을 핵심 미해결 과제로 식별했는데, MEM은 이 문제에 대한 가장 직접적인 해법을 제시한다.

핵심 통찰: 메모리의 이중 표현. 로봇이 "주방 전체를 정리하라"는 15분짜리 태스크를 수행할 때, 필요한 메모리는 두 가지 성격이 전혀 다르다. (1) 단기 메모리: 최근 몇 초간의 시각 정보(팔이 물체를 가려서 보이지 않을 때, 실패한 파지 전략을 기억할 때), (2) 장기 메모리: 수 분에 걸친 의미론적 이벤트(어떤 재료를 이미 꺼냈는지, 어떤 서랍을 열었는지). MEM의 핵심 통찰은 이 두 종류의 메모리를 서로 다른 모달리티로 표현해야 한다는 것이다.

아키텍처 구성요소:

(1) 단기 비디오 메모리 (Video Encoder). 기존 ViT를 확장하여 비디오 입력을 처리하되, 새로운 학습 파라미터를 추가하지 않는다. 4번째 레이어마다 공간 어텐션(spatial attention)에 시간 어텐션(causal temporal attention)을 추가하는 space-time separable attention 구조를 사용한다. 과거 타임스텝의 토큰은 상위 레이어에서 드롭하여, VLA 백본에 전달되는 토큰 수를 단일 프레임 VLA와 동일하게 유지한다. 결과적으로 16프레임 입력에서도 추론 지연이 300ms 실시간 제약 이내에 머무른다(나이브 방식은 4초 이상 소요).

(2) 장기 언어 메모리 (Language Memory). 고수준 정책이 과거 의미론적 이벤트를 자연어 요약(m_t)으로 압축하고, 매 스텝 이를 점진적으로 업데이트(m_t+1)한다. 핵심은 압축: "밝은 초록 그릇, 진한 파란 그릇, 밝은 노란 그릇을 윗칸 오른쪽 캐비닛에 넣었다" → "세 개의 그릇을 윗칸 오른쪽 캐비닛에 넣었다"로 불필요한 정보를 제거한다. 이는 학습-추론 분포 불일치를 줄이는 데 결정적으로 중요하다.

핵심 설계 원칙: 사전학습된 VLM 가중치로부터 초기화. 비디오 인코더는 K=1(단일 이미지)일 때 기존 VLM과 정확히 동일한 초기화를 보장하도록 설계되었다(시간 위치 인코딩의 t=0 값을 0으로 설정). 이 덕분에 기존 VLM의 사전학습 지식을 완벽히 보존하면서 메모리 능력을 추가할 수 있다. 실험에서 사전학습 없이 후처리 단계에서만 메모리를 도입한 경우 성능이 현저히 저하되어, 다양한 데이터로의 메모리 사전학습이 핵심임이 확인되었다.

실세계 성과.

능력	태스크 예시	결과
15분 장시간 태스크	레시피 재료 준비, 주방 전체 청소, 그릴드 치즈 샌드위치 조리	메모리 없는 π0.6 대비 과제 진행률 2-4배 향상
인컨텍스트 적응	젓가락 파지 높이 조정, 냉장고 문 열기 방향 전환	성공률 +11%~+62% (메모리 없는 모델 대비)
부분 관측성 처리	서랍 속 물체 위치 기억, 장보기 봉투 내용물 추적, 커피 스쿱 카운팅	모든 핵심 메모리 능력에서 유일하게 강한 성능
비메모리 태스크 성능	셔츠 접기, 침대 정리, 박스 조립 등	메모리 없는 π0.6와 동등 (메모리 추가로 인한 성능 저하 없음)

특히 주목할 점은, 기존 연구에서 반복적으로 보고된 causal confusion(인과 혼동) 문제, 즉 메모리를 추가하면 오히려 성능이 저하되는 현상이 π_0.6-MEM에서는 관찰되지 않았다는 것이다. 이는 다양한 최적성·속도·제어 주파수를 포함하는 대규모 사전학습 데이터 혼합이 spurious correlation을 방지했기 때문으로 분석된다.

서베이 맥락에서의 의의. MEM은 4.2절(두뇌 모듈)의 추론 패러다임과 10.4절(장기 과제)의 핵심 미해결 과제에 직접 응답한다. π0.5가 "VLM 계획기 + VLA 실행기"라는 계층적 분리로 장시간 작업을 접근했다면, MEM은 메모리라는 직교적 차원에서 같은 문제를 해결한다. 계층적 계획이 "무엇을 할지"를 분리하는 것이라면, 메모리는 "무엇을 했는지"를 기억하는 것이다. 이 두 접근은 상호 배타적이 아니라 상보적이며, 향후 결합될 가능성이 높다.

10.13.3 두 논문의 통합적 의의

차원	π^*_0.6 [157]	π_0.6-MEM
해결하는 한계	BC의 성능 천장, 시연 밖 행동 발견 불가	메모리 부재, 장시간 과제 불가, 부분 관측성
기반 모델	π0.6 (Gemma 3 4B + 860M Action Expert)	π0.6 (동일)
핵심 혁신	Advantage conditioning: Flow Matching VLA에 적용 가능한 RL 정책 추출	Video encoder (추가 파라미터 없음) + 압축형 언어 메모리
데이터 소스	시연 + 자율 롤아웃 + 인간 교정(DAgger)	로봇 시연 + 비디오 데이터 + 비전-언어 데이터
대표 성과	에스프레소 13시간 연속, 실패율 50% 감소	주방 청소·그릴드 치즈 등 15분 태스크 해결
서베이 연결	6.3절 (RL 후처리), Insight 5	10.4절 (장기 과제), 4.2절 (추론)

Motivation Chain: π 시리즈의 진화 (업데이트)

π0 (2024): VLM + Flow Matching Action Expert의 첫 결합

→ π0.5 (2025): 계층적 VLM 계획 + π0 실행, 30분+ 장시간 작업

→ π0.6 (2025): Gemma 3 4B 백본 + 860M Action Expert로 업그레이드, KI 학습 레시피

→ π^*_0.6 (2025.11): Advantage conditioning으로 실세계 RL 후처리. BC의 성능 천장 돌파

→ π_0.6-MEM (2026.03): 다중 스케일 메모리로 15분+ 장시간 태스크 해결. 메모리 부재 한계 돌파

이 두 논문을 종합하면, PI의 π 시리즈는 VLA 연구의 두 가지 핵심 프런티어를 동시에 밀어내고 있다: "더 잘하기"(π^*_0.6)와 "더 오래 하기"(MEM). π^*_0.6가 개별 행동의 품질을 시연 수준을 넘어 끌어올린다면, MEM은 개별 행동들이 수십 분에 걸친 일관된 과제 수행으로 엮이도록 한다. 두 접근이 하나의 모델에 결합된다면, "15분짜리 주방 정리를 시행착오를 통해 스스로 개선하는 로봇"이 실현 가능해진다. 이는 본 서베이가 제시한 "배포 준비(deployment readiness)" 패러다임의 가장 구체적인 진전이며, 앞서 식별한 미해결 과제들이 더 이상 이론적 추측이 아닌 공학적 도전으로 전환되고 있음을 보여준다.

11. 결론

11.1 VLA -- 통합 지능의 실현

VLA(Vision-Language-Action) 모델은 로봇이 세계를 "보고(see), 이해하고(understand), 행동하는(act)" 통합 지능의 구현체이다. 시각적 인지, 언어적 추론, 물리적 행동이라는 세 축을 하나의 신경망 안에서 융합함으로써, VLA는 전통적 로봇 공학의 모듈적 파이프라인(인지-계획-제어)을 근본적으로 재정의하고 있다.

11.2 3년의 역사, 200개의 모델

2023년 RT-2 [11]가 "Vision-Language-Action Model"이라는 명칭을 처음 제안한 이후, 불과 3년 만에 200개 이상의 VLA 모델이 출현했다. 이 폭발적 성장은 세 가지 수렴의 결과이다: (1) 대형 언어 모델의 성숙, (2) 시각-언어 사전학습의 발전, (3) 대규모 로봇 데이터셋의 등장. 이 세 요소가 동시에 임계점에 도달한 2023-2024년에 VLA 연구의 캠브리아기 대폭발이 시작되었다.

11.3 현재의 성취와 프론티어

단기적, 단일 도메인 조작 과제는 거의 해결된 수준에 도달했다. LIBERO-Spatial 98.8%, LIBERO-Object 99.8%라는 수치는, 정의된 환경에서 정의된 과제를 수행하는 능력은 이미 인간 수준에 근접했음을 보여준다.

그러나 진정한 프론티어는 이제부터이다.

장기 과제: 다단계 합성 과제에서의 계획과 실행
교차 도메인 일반화: 학습하지 않은 환경과 객체에 대한 적응
실세계 배포: 통제되지 않은 환경에서의 안정적 작동

이 세 과제가 VLA 연구의 "라스트 마일(last mile)"이며, 동시에 가장 어려운 구간이다.

11.4 효율성 혁명

VLA 연구에서 가장 주목할 만한 추세 중 하나는 효율성의 극적 향상이다. RT-2 [11]의 55B 파라미터에서 SmolVLA [32]의 450M 파라미터로, 모델 크기가 100배 이상 줄어들면서도 경쟁력 있는 성능을 유지하는 파레토 프론티어의 이동이 진행 중이다. 이는 VLA의 실용적 배포를 앞당기는 결정적 요인이다. 에지 디바이스에서의 실시간 추론, 비용 효율적인 대규모 배포, 그리고 에너지 효율성 모두가 이 효율성 혁명의 수혜자이다.

11.5 BC에서 RL로의 전환

행동 복제(BC)에서 출발하여 강화학습(RL) 후처리로 마무리하는 파이프라인이 VLA 학습의 표준으로 자리잡고 있다. BC가 제공하는 안정적 초기화와 RL이 제공하는 탐색적 최적화의 조합은, 단독으로는 달성할 수 없는 성능 수준을 가능케 한다. SFT 4%에서 PPO 15회 반복 후 97%로의 도약은, 이 조합의 위력을 단적으로 보여준다.

11.6 도메인별 특화의 가속

VLA의 범용 프레임워크가 다양한 도메인으로 확장되고 있다.

자율주행: DriveVLM [91], DriveLM 등이 도로 환경에 특화된 VLA를 구현한다.
휴머노이드: GR00T N1 [21], HumanPlus 등이 인간형 로봇의 전신 제어에 VLA를 적용한다.
의료: 수술 로봇, 재활 보조 등에서의 VLA 적용이 탐색되고 있다.

각 도메인은 고유한 안전 요구, 제어 주파수, 상호작용 패턴을 가지며, 이에 따른 도메인 특화 설계가 가속되고 있다.

11.7 안전, 윤리, 형식 검증 -- 배포의 게이트키퍼

VLA의 실세계 배포를 가로막는 최종 관문은 기술적 성능이 아니라 안전과 윤리이다. SafeVLA [75], SafeAuto 등의 시도가 진행 중이지만, 언어 조건 신경망 정책의 형식적 검증 방법은 아직 확립되지 않았다. 이는 VLA가 실험실을 넘어 일상으로 나아가기 위해 반드시 통과해야 하는 관문이며, 규제 기관, 산업계, 학계의 협력이 필수적인 영역이다.

프라이버시, 고용 대체, 의사결정 편향 등의 사회적 영향도 기술 발전과 병행하여 논의되어야 한다. 기술이 사회에 배포된 후에야 윤리적 논의를 시작하는 것은 너무 늦다.

11.8 다음 도약을 향하여

VLA 연구의 다음 도약은 네 가지 축에서 동시에 이루어질 것으로 전망된다.

첫째, 월드 모델 통합. 행동 전에 결과를 상상하는 능력은 장기 과제, 안전성, 일반화 모두를 향상시킨다. DreamVLA [64], WorldVLA [76]의 성공은 이 방향의 유효성을 입증하며, 다음 세대의 VLA에서 월드 모델은 선택적 구성 요소가 아닌 핵심 모듈이 될 것이다(4.2.3절 및 Large Model Embodied AI 서베이 [44] 참조).

둘째, 평생학습(continual learning). 현재의 VLA는 배포 후 고정되지만, 진정한 지능형 로봇은 경험을 통해 지속적으로 개선되어야 한다. 과거 학습을 잊지 않으면서(catastrophic forgetting 방지) 새로운 과제와 환경에 적응하는 평생학습은 VLA의 장기적 비전이다.

셋째, 범용 체현 지능(general embodied intelligence). 하나의 모델이 로봇 팔, 휴머노이드, 자율주행차, 드론 등 다양한 체현에서 작동하는 범용 정책은 VLA 연구의 궁극적 목표이다. OXE [19]와 HPT [96]가 이 방향의 첫걸음을 내디뎠으며, 교차 체현 일반화의 스케일링이 핵심 과제이다.

넷째, 인간-로봇 공진화. VLA 로봇이 인간의 삶에 깊이 통합되면서, 인간과 로봇이 서로를 변화시키는 공진화(co-evolution)가 시작될 것이다. 로봇이 인간의 행동에서 배우고, 인간이 로봇의 능력에 맞추어 상호작용 방식을 조정하는 이 피드백 루프는, VLA 연구가 궁극적으로 지향하는 미래이다.

VLA는 단순한 기술적 발전을 넘어, "기계가 물리적 세계를 이해하고 그 안에서 의미 있게 행동할 수 있는가"라는 근본적 질문에 대한 답을 구축하고 있다. 2023년의 명명 이후 3년, 이 분야는 놀라운 속도로 발전해 왔으며, 그 가속은 계속되고 있다. 다음 3년이 가져올 변화는 지금까지의 변화를 능가할 것이다.

전체 VLA Taxonomy 트리

VLA (Vision-Language-Action)
├── 정의 기준별 분류
│   ├── 좁은 정의 (RT-2 원조): VLM 파인튜닝 기반
│   ├── 확장 정의 (Ma et al.): V+L→A 모든 시스템
│   ├── Pure VLA (Zhong et al.): End-to-end 통합
│   └── 직접 제어 (Kawaharazuka et al.): 제어 명령 직접 생성
│
├── 아키텍처별 분류 (Liu & Shao [5])
│   ├── 단일체 (Monolithic)
│   │   ├── Single-system: RT-2, OpenVLA
│   │   └── Dual-system
│   │       ├── Cascade: GR00T N1, π0
│   │       └── Parallel: (동시 실행 후 결합)
│   └── 계층적 (Hierarchical)
│       ├── Planner-Only: SayCan, Inner Monologue
│       └── Planner+Policy: π0.5
│
├── 행동 생성 방식별 분류 (Zhong et al. [3])
│   ├── 자기회귀 (Autoregressive): RT-2, OpenVLA, Octo(AR모드)
│   ├── 디퓨전 (Diffusion): Diffusion Policy, CogACT, RDT-1B
│   │   ├── Flow Matching (변형): π0, π0-FAST
│   │   └── 이산 디퓨전 (Discrete Diffusion): 이산 토큰 공간에서의 확산
│   ├── 강화학습 기반: VLA-RL [68], RIPT-VLA [71], ConRFT [69]
│   └── 하이브리드/특수: HybridVLA [79], GR00T N1
│
├── 행동 토큰 유형별 분류 (Chen et al. [7])
│   ├── Language Tokens: SayCan, SayTap
│   ├── Code Tokens: Code-as-Policies
│   ├── Affordance Tokens: VoxPoser [57], A3VLM [58]
│   ├── Trajectory Tokens: RT-Trajectory [62], TraceVLA
│   ├── Goal Tokens: SuSIE [59], 3D-VLA
│   ├── Latent Tokens: VQ-BeT [60], LAPA [61], UniVLA [80]
│   ├── Raw Action Tokens: RT-2, OpenVLA, FAST
│   └── Reasoning Tokens: CoT-VLA [55], SC-VLA [56]
│
├── 효율화 기법별 분류 (Yu et al. [4])
│   ├── 양자화: BitVLA, SQIL
│   ├── 프루닝: SmolVLA, FLOWER, DeeR-VLA
│   ├── 증류: TinyVLA
│   ├── 토큰 최적화: FAST, VLA-Cache, VOTE
│   ├── 효율적 어텐션: KV-Efficient VLA, Long-VLA [73]
│   └── 효율적 아키텍처: SARA-RT, MoLE-VLA
│
├── 학습 패러다임별 분류 (Jin et al. [9])
│   ├── Phase 1: 인터넷 사전학습 (VLM)
│   ├── Phase 2: BC/SFT (로봇 시연)
│   └── Phase 3: RL 후처리
│       ├── 온라인 RL: PPO (RIPT-VLA [71]), GRPO (VLA-RL [68])
│       ├── 온라인 RL: ConRFT [69]
│       └── 선호 최적화: HAPO [84], GRAPE
│
└── 응용 도메인별 분류
    ├── 테이블탑 조작: 주류, 풍부한 데이터
    ├── 휴머노이드: 고DoF, 전신 제어
    ├── 자율주행: 별도 진화, 최고 안전 요구
    ├── 드론/내비게이션: 야외, 실시간
    ├── 의료/수술: 극도 정밀, 데이터 희소
    └── 산업/농업: 반복 작업, 견고성 중심
│
├── 벤치마크별 분류
│   ├── 조작: LIBERO, CALVIN, RLBench, Meta-World, VLABench
│   ├── 자율주행: Bench2Drive, nuScenes, Reason2Drive, WorldBench
│   └── 차세대: RoboArena, RoboCasa365, WorldGym

Motivation Chain: 배포와 안전의 동기 사슬

Motivation Chain

연구실 데모의 한계(제어된 환경에서만 작동, 실세계 배포 불가)

→ 효율화 연구(경량화, 양자화 → 엣지 디바이스에서 실행 가능)

→ 실세계 배포 시도(예상치 못한 실패 모드 발견)

→ 안전 연구 시작(SafeVLA [75]: 안전 제약을 학습에 내재화)

→ 안전의 근본 한계(VLA 환각 → 물리적 사고 가능성)

→ 형식적 검증 필요성 대두(아직 미해결)

Motivation Chain

단일 벤치마크의 한계(LIBERO 포화: 단순 태스크 사실상 해결)

→ 복합 벤치마크 필요(장시간, 다중 스텝, 실세계 변이)

→ 시뮬레이션-실세계 격차(sim-to-real gap 여전히 존재)

→ 하이브리드 평가 제안(시뮬 + 실세계 + 인간 평가)

Self-Check Questions: Section 9-10-11

Q1: OXE 데이터셋이 VLA 분야에 가져온 패러다임 전환을 "데이터 다양성"의 관점에서 설명하라.

답: OXE 이전에는 각 연구실이 자체 로봇으로 수집한 소규모 데이터(수천-수만 에피소드)로만 학습했다. OXE는 22종의 서로 다른 로봇 플랫폼에서 수집된 100만+ 에피소드를 통합했다. 핵심 발견은 서로 다른 로봇의 데이터가 "노이즈"가 아니라 "다양성"으로 작용하여, 특정 로봇·환경에 대한 과적합을 방지하고 일반화 성능을 높인다는 것이다. 이는 NLP에서 다국어 학습이 각 언어의 성능을 개선하는 현상과 유사하다.

Q2: VLA의 "환각(hallucination)"이 LLM의 환각과 본질적으로 다른 이유는?

답: LLM의 환각은 잘못된 텍스트 생성으로, 결과는 정보적 오류이다(사실이 아닌 내용 서술). VLA의 환각은 물리적 세계에서 실행되는 잘못된 행동 생성이므로, 결과가 물리적 사고(충돌, 파손, 부상)로 이어질 수 있다. "존재하지 않는 역사적 사실"을 말하는 것과 "존재하지 않는 물체를 잡으려 팔을 휘두르는 것"의 차이이다. 이 때문에 VLA의 안전 문제는 LLM보다 근본적으로 더 심각하며, 형식적 검증의 필요성이 더 크다.

Q3: 현재 VLA 벤치마크 생태계의 가장 큰 한계는 무엇인가?

답: (1) 통일된 교차 벤치마크 부재: ImageNet이나 SuperGLUE에 해당하는 표준 벤치마크가 없어, 서로 다른 시뮬레이터(LIBERO, CALVIN, RLBench)의 결과를 직접 비교할 수 없다. (2) 시뮬-실세계 격차: 시뮬레이션에서 높은 성능이 실세계에서 보장되지 않는다. (3) 포화 문제: 단순 태스크 벤치마크는 이미 99%에 도달하여 변별력을 잃었다. (4) 장시간·비정형 태스크 평가 부재: 30분 이상의 복합 태스크, 예상치 못한 상황 대처 능력을 측정하는 벤치마크가 부족하다.

Open Research Questions: Section 9-10-11

데이터 격차: 로봇 데이터(OXE 100만 에피소드)와 인터넷 데이터(수조 토큰) 사이의 5-6자릿수 격차를 어떻게 해소할 것인가? 비디오 사전학습, 시뮬레이션, 합성 데이터 중 어떤 전략이 가장 효과적인가?

벤치마크 2.0: LIBERO 포화 이후, 실세계 일반화·장시간 태스크·안전성을 동시에 측정하는 차세대 벤치마크는 어떤 설계 원칙을 따라야 하는가?

형식적 안전 검증: 자연어 조건부 신경망 정책에 대한 형식적 검증(formal verification)은 이론적으로 가능한가? 가능하다면 어떤 수학적 프레임워크가 필요한가?

VLA의 경제학: VLA 기반 로봇의 배포 비용(학습, 하드웨어, 유지보수)이 기존 산업용 로봇 대비 경제적으로 타당해지는 시점은 언제인가?

범용 체현 지능: 하나의 VLA 모델이 팔, 휴머노이드, 차량, 드론을 모두 제어하는 "범용 체현 지능(General Embodied Intelligence)"은 달성 가능한 목표인가, 아니면 도메인 전문화가 불가피한가?

참고문헌 (References)

[1] Ma, Q. et al. (2024). A Survey on Vision-Language-Action Models for Embodied AI. arXiv:2405.14093. [arXiv]
[2] Kawaharazuka, K. et al. (2025). Real-World Robot Applications of Foundation Models: A Review. arXiv:2402.05741. [arXiv]
[3] Zhong, Z. et al. (2025). Pure Vision Language Action (VLA) Models: A Comprehensive Survey. arXiv:2509.19012. [arXiv]
[4] Yu, Z. et al. (2025). A Survey on Efficient Vision-Language-Action Models. arXiv:2510.24795. [arXiv]
[5] Liu, N. & Shao, R. et al. (2025). Large VLM-based VLA Models for Robotic Manipulation: A Survey. arXiv:2508.13073. [arXiv]
[6] Zhang, Y. et al. (2025). VLA Models: Concepts, Progress, Applications and Challenges. arXiv:2505.04769. [arXiv]
[7] Chen, Y. et al. (2025). A Survey on VLA Models: An Action Tokenization Perspective. arXiv:2507.01925. [arXiv]
[8] Xu, C. et al. (2025). An Anatomy of Vision-Language-Action Models. arXiv:2512.11362. [arXiv]
[9] Jin, A. et al. (2025). Parallels Between VLA Model Post-Training and Human Motor Learning. arXiv:2506.20966. [arXiv]
[10] Jiang, H. et al. (2025). A Survey on VLA Models for Autonomous Driving. arXiv:2506.24044. [arXiv]
[11] Brohan, A. et al. (2023). RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. arXiv:2307.15818. [arXiv]
[12] Brohan, A. et al. (2022). RT-1: Robotics Transformer for Real-World Control at Scale. arXiv:2212.06817. [arXiv]
[13] Reed, S. et al. (2022). A Generalist Agent (Gato). arXiv:2205.06175. [arXiv]
[14] Ahn, M. et al. (2022). Do As I Can, Not As I Say: Grounding Language in Robotic Affordances (SayCan). arXiv:2204.01691. [arXiv]
[15] Kim, M. et al. (2024). OpenVLA: An Open-Source Vision-Language-Action Model. arXiv:2406.09246. [arXiv]
[16] Black, K. et al. (2024). pi0: A Vision-Language-Action Flow Model for General Robot Control. arXiv:2410.24164. [arXiv]
[17] Chi, C. et al. (2023). Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. arXiv:2303.04137. [arXiv]
[18] Driess, D. et al. (2023). PaLM-E: An Embodied Multimodal Language Model. arXiv:2303.03378. [arXiv]
[19] Open X-Embodiment Collaboration (2023). Open X-Embodiment: Robotic Learning Datasets and RT-X Models. arXiv:2310.08864. [arXiv]
[20] Pertsch, K. et al. (2025). Fast Tokenizer for VLA (pi0-FAST). arXiv:2501.09747. [arXiv]
[21] Bjorck, J. et al. (2025). GR00T N1: An Open Foundation Model for Generalist Humanoid Robots. arXiv:2503.14734. [arXiv]
[22] Huang, W. et al. (2023). Inner Monologue: Embodied Reasoning through Planning with Language Models. arXiv:2207.05608. [arXiv]
[23] Liu, H. et al. (2024). CogACT: A Foundational VLA Model with Cognitive-Inspired Action Chunking Transformer. arXiv:2411.19650. [arXiv]
[24] Liu, H. et al. (2024). RDT-1B: A Diffusion Foundation Model for Bimanual Manipulation. arXiv:2410.07864. [arXiv]
[25] Team, Octo Model et al. (2024). Octo: An Open-Source Generalist Robot Policy. arXiv:2405.12213. [arXiv]
[26] Shridhar, M. et al. (2021). CLIPort: What and Where Pathways for Robotic Manipulation. arXiv:2109.12098. [arXiv]
[27] Radford, A. et al. (2021). Learning Transferable Visual Models From Natural Language Supervision (CLIP). arXiv:2103.00020. [arXiv]
[28] Dosovitskiy, A. et al. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT). arXiv:2010.11929. [arXiv]
[29] Brown, T. et al. (2020). Language Models are Few-Shot Learners (GPT-3). arXiv:2005.14165. [arXiv]
[30] Oquab, M. et al. (2023). DINOv2: Learning Robust Visual Features without Supervision. arXiv:2304.07193. [arXiv]
[31] Physical Intelligence (2025). pi0.5: a Vision-Language-Action Model with Open-World Generalization. arXiv:2503.01222. [arXiv]
[32] Pertsch, K. et al. (2025). SmolVLA: A Small Vision-Language-Action Model for Efficient Robot Learning. arXiv:2506.01844. [arXiv]
[33] Ma, Y. et al. (2025). BitVLA: 1-bit Vision-Language-Action Models. arXiv:2505.07256. [arXiv]
[34] Wu, J. et al. (2025). TinyVLA: Towards Fast and Data-Efficient VLA. arXiv:2409.12514. [arXiv]
[35] Yue, W. et al. (2024). DeeR-VLA: Dynamic Inference of Multimodal LLMs for Efficient Robot Execution. arXiv:2411.02359. [arXiv]
[36] Liang, J. et al. (2023). Code as Policies: Language Model Programs for Embodied Control. arXiv:2209.07753. [arXiv]
[37] Zhen, H. et al. (2024). 3D-VLA: A 3D Vision-Language-Action Generative World Model. arXiv:2403.09631. [arXiv]
[38] Huang, W. et al. (2022). Language Models as Zero-Shot Planners (SayTap). arXiv:2201.07207. [arXiv]
[39] Xu, Z. et al. (2024). SpatialVLA: Exploring Spatial Representations for VLA Models. arXiv:2501.15830. [arXiv]
[40] Wen, B. et al. (2024). TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for VLA. arXiv:2412.10345. [arXiv]
[41] Hu, T. et al. (2025). Vision-Language-Action Models for Autonomous Driving: Past, Present, and Future. arXiv:2512.16760. [arXiv]
[42] Edge Survey (2026). Embodied Foundation Models at the Edge: A Survey of Deployment Constraints and Mitigation Strategies. arXiv:2603.16952. [arXiv]
[43] Guan, W. et al. (2025). Efficient Vision-Language-Action Models for Embodied Manipulation: A Systematic Survey. arXiv:2510.17111. [arXiv]
[44] Large Model Embodied AI (2025). Large Model Empowered Embodied AI: A Survey on Decision-Making and Embodied Learning. arXiv:2508.10399. [arXiv]
[45] Jiang, Y. et al. (2022). VIMA: General Robot Manipulation with Multimodal Prompts. arXiv:2210.03094. [arXiv]
[46] Hwang, J. et al. (2024). EMMA: End-to-End Multimodal Model for Autonomous Driving. arXiv:2410.23262. [arXiv]
[47] Fu, H. et al. (2025). ORION: A Holistic End-to-End Autonomous Driving Framework. arXiv:2503.19755. [arXiv]
[48] Zhou, X. et al. (2025). AutoVLA: Autonomous Driving with Adaptive Reasoning and RL Fine-Tuning. arXiv:2506.13757. [arXiv]
[49] Yang, Z. et al. (2025). DriveMoE: Mixture-of-Experts for End-to-End Autonomous Driving. arXiv:2505.16278. [arXiv]
[50] Doshi, R. et al. (2024). CrossFormer: Scaling Cross-Embodied Learning. arXiv:2408.11812. [arXiv]
[51] Wu, H. et al. (2023). GR-1: Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation. arXiv:2312.13139. [arXiv]
[52] Tan, Z. et al. (2025). FlashVLA: Token-Aware Compression and Action Reuse for Efficient VLA Inference. arXiv:2505.21200. [arXiv]
[53] Zheng, K. et al. (2025). X-VLA: Cross-Embodiment Vision-Language-Action Model. arXiv:2510.10274. [arXiv]
[54] Du, Y. et al. (2025). HiMoE-VLA: Hierarchical Mixture-of-Experts for Generalist VLA Policies. arXiv:2512.05693. [arXiv]
[55] Zhao, Y. et al. (2025). CoT-VLA: Visual Chain-of-Thought Reasoning for VLA Models. arXiv:2503.22020. [arXiv]
[56] Li, X. et al. (2024). SC-VLA: A Self-Correcting VLA Model for Fast and Slow System Manipulation. arXiv:2405.17418. [arXiv]
[57] Huang, W. et al. (2023). VoxPoser: Composable 3D Value Maps for Robotic Manipulation. arXiv:2307.05973. [arXiv]
[58] Huang, S. et al. (2024). A3VLM: Actionable Articulation-Aware Vision Language Model. arXiv:2406.07549. [arXiv]
[59] Black, K. et al. (2023). SuSIE: Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models. arXiv:2310.10639. [arXiv]
[60] Lee, S. et al. (2024). VQ-BeT: Behavior Generation with Latent Actions. arXiv:2403.03181. [arXiv]
[61] Ye, D. et al. (2024). LAPA: Latent Action Pretraining from Videos. arXiv:2410.11758. [arXiv]
[62] Gu, Y. et al. (2023). RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches. arXiv:2311.01977. [arXiv]
[63] Williams, R. et al. (2025). Lite VLA: Efficient VLA Control on CPU-Bound Edge Robots. arXiv:2511.05642. [arXiv]
[64] Zhang, H. et al. (2025). DreamVLA: A VLA Model Dreamed with Comprehensive World Knowledge. arXiv:2507.04447. [arXiv]
[65] Wen, C. et al. (2025). dVLA: Diffusion VLA with Multimodal Chain-of-Thought. arXiv:2509.25681. [arXiv]
[66] Chen, Z. et al. (2025). TGRPO: Fine-tuning VLA via Trajectory-wise Group Relative Policy Optimization. arXiv:2506.08440. [arXiv]
[67] Huang, J. et al. (2025). ThinkAct: VLA Reasoning via Reinforced Visual Latent Planning. arXiv:2507.16815. [arXiv]
[68] Lu, Y. et al. (2025). VLA-RL: Towards Masterful Robotic Manipulation with Scalable RL. arXiv:2505.18719. [arXiv]
[69] Chen, R. et al. (2025). ConRFT: A Reinforced Fine-tuning Method for VLA via Consistency Policy. arXiv:2502.05450. [arXiv]
[70] Li, Q. et al. (2025). SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning. arXiv:2509.09674. [arXiv]
[71] Tan, W. et al. (2025). RIPT-VLA: Interactive Post-Training for VLA Models. arXiv:2505.17016. [arXiv]
[72] Miao, L. et al. (2025). FedVLA: Federated VLA Learning with Dual Gating MoE. arXiv:2508.02190. [arXiv]
[73] Fan, Y. et al. (2025). Long-VLA: Unleashing Long-Horizon Capability of VLA for Robot Manipulation. arXiv:2508.19958. [arXiv]
[74] Koo, J. et al. (2025). RetoVLA: Reusing Register Tokens for Spatial Reasoning in VLA. arXiv:2509.21243. [arXiv]
[75] Zhang, S. et al. (2025). SafeVLA: Towards Safety Alignment of VLA via Constrained Learning. arXiv:2503.03480. [arXiv]
[76] Cen, J. et al. (2025). WorldVLA: Towards Autoregressive Action World Model. arXiv:2506.21539. [arXiv]
[77] Li, Z. et al. (2025). PointVLA: Injecting the 3D World into VLA Models. arXiv:2503.07511. [arXiv]
[78] Yu, F. et al. (2025). ForceVLA: Enhancing VLA with Force-aware MoE for Contact-rich Manipulation. arXiv:2505.22159. [arXiv]
[79] Liu, J. et al. (2025). HybridVLA: Collaborative Diffusion and Autoregression in a Unified VLA Model. arXiv:2503.10631. [arXiv]
[80] Bu, Z. et al. (2025). UniVLA: Learning to Act Anywhere with Task-centric Latent Actions. arXiv:2505.06111. [arXiv]
[81] Deng, Y. et al. (2025). GraspVLA: Grasping Foundation Model Pre-trained on Billion-scale Synthetic Data. arXiv:2505.03233. [arXiv]
[83] Tian, R. et al. (2023). RAPL: What Matters to You? Visual Representation Alignment for Robot Learning. arXiv:2310.07932. [arXiv]
[84] Xia, Z. et al. (2025). HAPO: Human-assisted Robotic Policy Refinement via Action Preference Optimization. arXiv:2506.07127. [arXiv]
[85] Patel, D. et al. (2025). IKER: Real-to-Sim-to-Real with VLM-Generated Iterative Keypoint Rewards. arXiv:2502.08643. [arXiv]
[86] Xu, J. et al. (2025). KV-Efficient VLA: Speed up VLMs with RNN-Gated Chunked KV Cache. arXiv:2509.21354. [arXiv]
[87] Chen, X. et al. (2023). GenAug: Retargeting Behaviors to Unseen Situations via Generative Augmentation. arXiv:2302.06671. [arXiv]
[88] Mandi, Z. et al. (2022). CACTI: A Framework for Scalable Multi-Task Multi-Scene Visual Imitation Learning. arXiv:2212.05711. [arXiv]
[89] Yu, T. et al. (2023). ROSIE: Scaling Robot Learning with Semantically Imagined Experience. arXiv:2302.11550. [arXiv]
[90] Xiao, T. et al. (2022). DIAL: Robotic Skill Acquisition via Instruction Augmentation with VLMs. arXiv:2211.11736. [arXiv]
[91] Tian, X. et al. (2024). DriveVLM: The Convergence of Autonomous Driving and Large VLMs. arXiv:2402.12289. [arXiv]
[92] Zawalski, K. et al. (2024). ECoT: Robotic Control via Embodied Chain-of-Thought Reasoning. arXiv:2407.08693. [arXiv]
[93] Du, Y. et al. (2023). UniPi: Learning Universal Policies via Text-Guided Video Generation. arXiv:2302.00111. [arXiv]
[94] Nematollahi, I. et al. (2025). LUMOS: Language-Conditioned Imitation Learning with World Models. arXiv:2503.10370. [arXiv]
[95] Chi, B. et al. (2025). MinD: Learning A Dual-System World Model for Real-Time Planning. arXiv:2506.18897. [arXiv]
[96] Wang, L. et al. (2024). HPT: Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers. arXiv:2409.20537. [arXiv]
[97] Ross, S. et al. (2011). DAgger: A Reduction of Imitation Learning to No-Regret Online Learning. arXiv:1011.0686. [arXiv]
[98] Hancock, W. et al. (2025). Actions as Language: Fine-Tuning VLMs into VLAs Without Catastrophic Forgetting. arXiv:2509.22195. [arXiv]
[99] GraspVerse (2025). Large-scale Synthetic Grasp Data Generation. (14개 서베이 외 출처)
[100] Cheang, C. et al. (2024). GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation. arXiv:2410.06158. [arXiv]
[101] Yang, S. et al. (2023). UniSim: Learning Interactive Real-World Simulators. arXiv:2310.06114. [arXiv]
[102] Singh, I. et al. (2023). ProgPrompt: Generating Situated Robot Task Plans using Large Language Models. arXiv:2209.11302. [arXiv]
[103] Wang, G. et al. (2023). Voyager: An Open-Ended Embodied Agent with Large Language Models. arXiv:2305.16291. [arXiv]
[104] Vemprala, S. et al. (2024). ChatGPT for Robotics: Design Principles and Model Abilities. arXiv:2306.17582. [arXiv]
[105] Nasiriany, S. et al. (2024). RT-Affordance: Affordances are Versatile Intermediate Representations for Robot Learning. arXiv:2411.02704. [arXiv]
[106] Xu, Z. et al. (2025). A0: An Autonomous Agent with Adaptive Action Generation. arXiv:2504.12636. [arXiv]
[107] Wang, H. et al. (2025). VQ-VLA: Vector Quantized Vision-Language-Action Model. arXiv:2507.01016. [arXiv]
[108] Liu, B. et al. (2025). Embodied-R1: Incentivizing Reasoning in Embodied VLA Models. arXiv:2508.13998. [arXiv]
[109] Wang, Z. et al. (2025). GRAPE: Generalizing Robot Policy via Preference Alignment. arXiv:2411.19309. [arXiv]
[110] Bousmalis, K. et al. (2024). RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation. arXiv:2306.11706. [arXiv]
[111] Shi, L. et al. (2025). ReVLA: Reverting Visual Domain from LLM to VLA. arXiv:2409.15250. [arXiv]
[112] Zheng, Z. et al. (2025). UniAct: Universal Action Representation for Robotic Learning. arXiv:2501.10105. [arXiv]
[113] Li, J. et al. (2025). BridgeVLA: Bridging the Gap Between VLA and Low-Level Robot Control. arXiv:2506.07961. [arXiv]
[114] Park, D. et al. (2025). SQIL: Sub-4-bit Quantization of Large VLAs via Self-play Fine-tuning. arXiv:2505.15304. [arXiv]
[115] Heo, J. et al. (2025). QAIL: Quantization-Aware Imitation Learning for Resource-Efficient VLA. arXiv:2412.01034. [arXiv]
[116] Li, S. et al. (2025). SQAP-VLA: Stochastic Quantization with Adaptive Precision for VLA. arXiv:2509.09090. [arXiv]
[117] Qu, L. et al. (2025). MoLe-VLA: Mixture of Lightweight Experts for VLA. arXiv:2503.20384. [arXiv]
[118] Niu, X. et al. (2025). EfficientVLA: An Efficient Vision-Language-Action Model. arXiv:2506.10100. [arXiv]
[119] Cheng, Y. et al. (2025). FLOWER: Flow-based World Model for Efficient Robot Learning. arXiv:2509.04996. [arXiv]
[120] Zhao, Y. et al. (2025). RLRC: Reinforcement Learning with Reasoning Consistency for VLA. arXiv:2506.17639. [arXiv]
[121] Wen, Z. et al. (2025). CEED-VLA: Confidence-Enhanced Early-Exit Decoding for VLA. arXiv:2506.13725. [arXiv]
[122] Julg, M. et al. (2025). RPD: Robot Policy Distillation from Vision-Language-Action Models. arXiv:2503.05833. [arXiv]
[123] Shen, W. et al. (2025). SP-VLA: Spatial-aware Parallel Decoding VLA. arXiv:2506.12723. [arXiv]
[124] Xu, Y. et al. (2025). VLA-Cache: Accelerating VLA Inference via KV Cache Compression. arXiv:2502.02175. [arXiv]
[125] Lin, X. et al. (2025). CronusVLA: Efficient VLA with Temporal Cronus Attention. arXiv:2506.19816. [arXiv]
[126] Shridhar, M. et al. (2024). SARA-RT: Scaling Up Robot Action with Linear Attention. arXiv:2312.01990. [arXiv]
[127] Liu, Y. et al. (2024). RoboMamba: Efficient Vision-Language-Action Model with Mamba SSM. arXiv:2406.04339. [arXiv]
[128] Xu, J. et al. (2025). GeRM: A Generalist Robotic Model via Foundation Models. arXiv:2403.13358. [arXiv]
[129] Chen, Q. et al. (2025). PD-VLA: Parallel Decoding for Efficient VLA Inference. arXiv:2503.02310. [arXiv]
[130] Wang, X. et al. (2025). Spec-VLA: Speculative Decoding for Accelerating VLA Models. arXiv:2507.22424. [arXiv]
[131] Yang, Z. et al. (2025). EgoVLA: Egocentric Vision-Language-Action Model. arXiv:2507.12440. [arXiv]
[132] Hung, Y. et al. (2025). NORA: Normalizing Flow-based Robot Action Generation. arXiv:2504.19854. [arXiv]
[133] Budzianowski, P. et al. (2025). EdgeVLA: Efficient VLA Deployment on Edge Devices. arXiv:2507.14049. [arXiv]
[134] Kim, D. et al. (2025). DiVLA-2B: Diffusion VLA at 2B Scale. arXiv:2412.03293. [arXiv]
[135] Park, J. et al. (2026). HyperVLA: Dynamic Policy Generation via Hypernetworks. arXiv:2510.04898. [arXiv]
[136] Liu, Q. et al. (2026). AutoQVLA: Automated Quantization for VLA Models. arXiv:2602.03782. [arXiv]
[137] Zhang, S. et al. (2025). Humanoid-VLA: Vision-Language-Action for Humanoid Robots. arXiv:2502.14795. [arXiv]
[138] Li, W. et al. (2025). Being-H0: Humanoid Robot Foundation Model. arXiv:2507.15597. [arXiv]
[139] Chen, X. et al. (2025). FP3: Foundation Policy with Predictive Planning. arXiv:2503.08950. [arXiv]
[140] Li, J. et al. (2025). SafeAuto: Safety-Aware Autonomous Driving with VLA. arXiv:2503.00211. [arXiv]
[141] Wei, H. et al. (2025). LangCoop V2V: Language-based Cooperative Driving. arXiv:2504.13406. [arXiv]
[142] Wang, D. et al. (2025). CognitiveDrone: VLA for Cognitive Drone Control. arXiv:2503.01378. [arXiv]
[143] Zhao, R. et al. (2025). RaceVLA: Vision-Language-Action for Autonomous Racing. arXiv:2503.02572. [arXiv]
[144] Cheng, H. et al. (2025). NaVILA: Navigation with VLA. arXiv:2412.04453. [arXiv]
[145] Zhang, J. et al. (2024). Uni-NaVid: Unified Navigation with Video Diffusion. arXiv:2412.06224. [arXiv]
[146] Liu, M. et al. (2025). Mobility VLA: VLA for Mobile Robot Navigation. arXiv:2407.07775. [arXiv]
[147] Li, Z. et al. (2024). RoboNurse-VLA: Robotic Nursing Assistant with VLA. arXiv:2409.19590. [arXiv]
[148] Chen, J. et al. (2025). ObjectVLA: Object-Centric VLA Model. arXiv:2502.19250. [arXiv]
[149] Lin, K. et al. (2024). ShowUI: Vision-Language-Action Models for GUI Automation. arXiv:2411.17465. [arXiv]
[150] Li, H. et al. (2025). RoboArena: A Benchmark Arena for VLA Evaluation. arXiv:2506.18123. [arXiv]
[151] Nasiriany, S. et al. (2025). RoboCasa365: Large-Scale Robot Simulation Benchmark. arXiv:2603.04356. [arXiv]
[152] Zhang, W. et al. (2025). WorldGym: World Model Training Environments. arXiv:2506.00613. [arXiv]
[153] Huang, Z. et al. (2025). TactileVLA: Tactile-Enhanced VLA for Dexterous Manipulation. arXiv:2507.09160. [arXiv]
[154] Wang, Y. et al. (2025). OmniVTLA: Omni Vision-Tactile-Language-Action Model. arXiv:2508.08706. [arXiv]
[155] Tang, Y. et al. (2023). SayTap: Language to Quadrupedal Locomotion. arXiv:2306.07580. [arXiv]
[156] Wang, L. et al. (2024). HPT: Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers. arXiv:2409.20537. [arXiv]
[157] Physical Intelligence (2025). π^*_0.6: a VLA That Learns From Experience. arXiv:2511.14759. [arXiv]
[158] Physical Intelligence (2026). MEM: Multi-Scale Embodied Memory for Vision Language Action Models. arXiv:2603.03596. [arXiv]

VLA Unified Survey

Vision-Language-Action Models for Robots That See, Understand, and Act
A Comprehensive Synthesis

Atomic-level integration of 14 survey papers × VLA textbook slides
DGIST APRL Lab · Prof. Giseop Kim · April 2026

Table of Contents & Structure

This document proceeds from definition → history → taxonomy → architecture → tokenization → training → efficiency → applications → evaluation → outlook. This order follows a natural intellectual journey: understand what VLA is, dissect how it works, analyze how it learns, explore where it is deployed, and anticipate what remains.

1. Intro→ 2. Timeline→ 3. Taxonomy→ 4. Architecture→ 5. Tokenization→ 6. Training→ 7. Efficiency→ 8. Applications→ 9. Evaluation→ 10. Outlook→ Ref

Sec 1

Introduction — What is VLA?

Three definitions (RT-2/Ma/Zhong), comparison of 14 surveys, scope and methodology

Why? Readers must first grasp VLA's scope to contextualize everything that follows.

Sec 2

Timeline of VLA Evolution (2017–2026)

Phase 0-3: CLIP/ViT → Gato/SayCan → RT-2/Diffusion Policy → OpenVLA/π0 → GR00T N1/SmolVLA

Why? Shows each advance as a causal consequence of its predecessors, grounding the taxonomy (Sec 3).

Sec 3

Unified Taxonomy — 10 Surveys into One

5 axes (architecture/action gen./anatomy/function/post-training) + meta-taxonomy mapping

Why? Proves the rival taxonomies are complementary projections, not competing frameworks.

Sec 4

Architecture Deep Dive

Perception (SigLIP+DINOv2) → Brain (VLM 4 stages) → Action (Diffusion/Flow/FAST) → Dual System

Why? Taxonomy (Sec 3) asks "what exists"; anatomy (Sec 4) asks "what happens inside."

Sec 5

Action Tokenization — The Core Design Decision

8 token types, control frequency spectrum (1 Hz→120 Hz), tokenization-performance mechanism

Why? As Chen et al. [7] demonstrated, tokenization is the primary differentiator among VLA models.

Sec 6

Evolution of Training Paradigms

Pre-training (internet→robot), BC limits, RL post-training (GRPO/DPO/PPO), Newell's theory, lifelong learning

Why? Architecture and tokenization (Sec 4-5) are "structure"; training (Sec 6) is the process of "breathing intelligence into that structure."

Sec 7

Efficiency — The Deployment Imperative

Quantization/pruning/distillation, 12-model comparison (55B→450M), 6 Pareto insights

Why? Bridges the gap between academic performance (Sec 4-6) and real-world deployment (Sec 8).

Sec 8

Application Domains — Where VLA Meets the World

Manipulation/Humanoid/AD/Drone/Medical/Agriculture/GUI — 7-domain comparison

Why? Shows where technology (Sec 3-7) is deployed, and how domain requirements feed back into design.

Sec 9

Datasets, Benchmarks, Simulators

OXE/BridgeData/DROID, LIBERO/CALVIN/Bench2Drive benchmarks, evaluation protocol limitations

Why? Provides the "report card" for models and applications; evaluation gaps directly shape research priorities.

Sec 10-11

Open Problems, Cross-Survey Insights, Frontier Case Study, Conclusion

11 core challenges + 10 emergent cross-survey insights + π series frontier case study (π^*_0.6 & MEM) + 4 future axes

Why? After identifying open problems, the latest breakthroughs show how those challenges are being attacked—completing the "theory→evidence" narrative arc.

Sub-item selection criteria: Each section's sub-items were selected based on three criteria: (1) topics covered by 3+ of the 14 surveys, (2) topics covered by a single survey but of high technical importance, and (3) topics derivable only from cross-survey analysis.

Comparative Analysis of 14 Reference Surveys

Legend: ● In-depth ◐ Partial/indirect ○ Not covered | Columns correspond to the 10 core topics of this unified survey.

#	Survey	arXiv	Date	Archi- tecture	Taxo- nomy	Action Token.	Training Paradigm	RL Post- training	Effi- ciency	Manipu- lation	Auton. Driving	Bench- marks	Future Outlook	Specialization & Unique Value
[1]	Ma, Q. et al.	2405.14093	2024.05	◐	◐	○	◐	○	○	◐	◐	◐	●	Bird's-eye view of embodied AI. Not VLA-specific but the first comprehensive survey to position VLA within the broader LLM/VLM/embodied foundation model ecosystem. As the earliest survey (2024.05), it centers on RT-2/PaLM-E and does not cover 2025 models.
[2]	Kawaharazuka, K. et al.	2402.05741	2024.02	◐	○	○	●	○	○	●	○	●	◐	Real-world deployment experience. The only survey focused on practical insights from deploying robots outside the lab — data collection know-how, failure mode analysis, and what actually breaks in practice, rather than architecture diversity.
[3]	Zhong, Z. et al.	2509.19012	2025.09	●	●	●	●	◐	◐	●	○	●	●	Most systematic Pure VLA taxonomy. Restricts scope to end-to-end models that output actions from a single image+language pipeline, providing a rigorous 3-axis (architecture/training/data) classification. Deliberately excludes modular pipelines like SayCan.
[4]	Yu, Z. et al.	2510.24795	2025.10	●	◐	◐	◐	○	●	◐	○	●	◐	Deepest analysis of efficiency and compression. The only survey to quantitatively compare quantization (PTQ/QAT), pruning, distillation, and token caching across 12 models. Systematically addresses whether 55B→450M parameter reduction can preserve performance.
[5]	Liu, N. & Shao, R. et al.	2508.13073	2025.08	●	●	●	●	◐	◐	●	○	●	●	Deepest manipulation-domain benchmark analysis. Provides the most detailed analysis of LIBERO/CALVIN/SimplerEnv benchmarks centered on grasping, bimanual, and dexterous manipulation. Clearly articulates the monolithic vs. hierarchical architecture dichotomy. Intentionally excludes navigation and autonomous driving.
[6]	Zhang, Y. et al.	2505.04769	2025.05	●	◐	◐	●	○	○	●	◐	●	●	Broad conceptual overview of VLA. Prioritizes breadth of concept coverage and application domain listing over technical depth. Optimal for newcomers seeking a rapid landscape overview, covering non-mainstream domains like medical, agriculture, and GUI agents.
[7]	Chen, Y. et al.	2507.01925	2025.07	◐	◐	●	◐	○	○	◐	○	◐	○	Deepest specialized analysis of action tokenization. Systematically classifies 8 token types (language/code/affordance/trajectory/goal/latent/raw/reasoning) and deeply dissects discrete vs. continuous tradeoffs, frequency spectrum, and Action Chunking. Covers other architectural modules only minimally.
[8]	Xu, C. et al.	2512.11362	2025.12	●	●	●	●	○	◐	●	◐	●	●	Unique "anatomical" analysis of VLA. The only survey to decompose VLA using an organ metaphor: perception (eyes) → brain (VLM) → action (hands). As the most recent survey (2025.12), it covers GR00T N1 and π0.5, but barely addresses RL post-training.
[9]	Jin, A. et al.	2506.20966	2025.06	◐	○	◐	●	●	○	◐	○	◐	●	Frontier specialist in VLA+RL integration. The most detailed analysis of RL post-training for overcoming BC limitations — PPO/GRPO/DPO/ConRFT [69], online/offline RL, reward design, and preference optimization. Weak on the pre-training stage.
[10]	Jiang, H. et al.	2506.24044	2025.06	●	●	◐	●	◐	◐	○	●	●	●	Most detailed specialist on AD-VLA. Systematizes four generations of autonomous driving VLA evolution (EMMA/ORION [47]/DriveMoE [49]/AutoVLA [48]) and deeply analyzes safety, V2V cooperation, and simulators (CARLA/Bench2Drive). Completely excludes manipulation and navigation domains.
[41]	Hu, T. et al.	2512.16760	2025.12	●	●	◐	◐	○	◐	○	●	●	●	Granular AD-VLA taxonomy. End-to-End VLA (textual/numerical action) × Dual-System VLA (explicit guidance/implicit transfer) 2×2 classification; proposes WorldBench unified evaluation platform.
[42]	Edge Survey	2603.16952	2026.03	◐	○	○	○	○	●	◐	◐	◐	●	Systems-level edge deployment analysis. Identifies the "Deployment Gauntlet" of 8 coupled constraints; VLA = memory-bandwidth bottleneck, diffusion = compute-latency bottleneck.
[43]	Guan, W. et al.	2510.17111	2025.10	◐	◐	◐	◐	○	●	●	○	◐	◐	4-dimensional efficiency taxonomy for manipulation. Architecture / perception / action generation / learning-inference as four independent efficiency axes, complementing Yu et al. [4]'s compression focus.
[44]	Large Model Embodied AI	2508.10399	2025.08	●	◐	○	●	◐	○	◐	◐	◐	●	Decision-making framework for large-model embodied AI. Hierarchical vs end-to-end decision paradigm dichotomy; world model as a third axis bridging both. Views VLA from decision-making perspective.

Section 1: Introduction — What Is a VLA?

1.1 Defining VLA: Three Perspectives

A Vision-Language-Action (VLA) model is a unified neural network that enables a robot to observe a scene through its cameras, understand natural-language instructions, and directly generate physical actions. However, the term "VLA" has yet to reach full consensus within the research community. Three distinct definitions currently coexist, and understanding the precise scope and meaning of each is the first gateway to navigating this field.

Narrow definition: the original meaning coined by RT-2 [11]. In 2023, Google DeepMind's RT-2 [11] paper was the first to use the term "VLA," and it carried a precise technical prescription: a model that takes a large-scale pretrained Vision-Language Model (VLM) and directly fine-tunes it for robot action prediction. RT-2 [11] took existing VLMs — PaLI-X (55B) and PaLM-E [18] (12B) — represented robot actions as text tokens, and extended the VLM's output vocabulary with action tokens. Under this definition, the essence of a VLA is "transferring internet-scale visual-language knowledge to robot actions," and the presence of a VLM backbone is a necessary condition.

Broad definition: the inclusive category of Ma et al. [1] (2024). The survey by Ma et al. [1], which offers a bird's-eye view of embodied AI as a whole, defines VLA far more expansively. For them, VLA encompasses any model that takes Vision and Language inputs and produces Actions. Under this definition, modular systems such as SayCan [14] — where an LLM handles only high-level planning while a separate policy manages low-level control — are included, as are models like RT-1 [12] that use their own architectures without a VLM backbone, and even systems like Code-as-Policy that generate code as output. This expansive definition is useful for mapping the full landscape of the field, but has been criticized for diluting the technical specificity of the term "VLA."

Pure definition: "Pure VLA" proposed by Zhong et al. [3] (2025). The most recent classification, proposed by Zhong et al. [3] to resolve the tension between the two definitions above, introduces the concept of "Pure VLA." A Pure VLA integrates perception, language understanding, and action generation within a single end-to-end sequence-modeling framework. The three key criteria are: (1) both vision and language must be used as inputs; (2) actions must be the direct output of the model (without passing through a separate low-level controller); and (3) the entire pipeline must be integrated into a single trainable model. By these criteria, SayCan [14] (modular structure) and Code-as-Policy (code output) are not VLAs, whereas models such as RT-2 [11], OpenVLA [15], and π0 [16] qualify as Pure VLAs. Zhong et al. [3] further subdivide Pure VLAs into four categories: (1) autoregressive VLAs (RT-2 [11], OpenVLA [15]), (2) diffusion VLAs (π0 [16], CogACT [23]), (3) reinforcement learning-based fine-tuning, and (4) hybrid and specialized methods. This classification scheme jointly considers both the action decoder's generation mechanism and the learning paradigm.

Beyond these three definitions, Kawaharazuka et al. [2] proposed their own boundary criterion: only systems that "take visual observations and natural-language instructions as mandatory inputs and directly generate control commands" qualify as VLAs, while "high-level policies that select from a pre-defined skill index" are explicitly excluded. This aligns with Zhong et al. [3]'s Pure VLA definition in placing SayCan [14]-style skill-selection systems outside the VLA boundary, but differs in that it uses "direct control command generation" rather than the presence of a VLM backbone as the key criterion.

This document acknowledges all four definitions, but in practice centers on Zhong et al. [3]'s Pure VLA definition. Nonetheless, modular systems (SayCan [14], Inner Monologue [22]) and hierarchical architectures (π0.5 [31], GR00T N1 [21]) are treated as important components of the broader VLA ecosystem.

1.2 Why VLA: The Essence of the Paradigm Shift

The traditional architecture of robotics has rested for fifty years on the tripartite "sense-plan-act pipeline": a perception module processes sensor data, a planning module searches for a path to the goal, and a control module generates joint commands. Each module is designed independently, and inter-module interfaces communicate through hand-crafted representations (e.g., object poses, grid maps, joint trajectories).

This modular structure offered strengths in mathematical rigor and debuggability, but it carried three fundamental bottlenecks. First, the representation bottleneck: information transmitted between modules is constrained by the expressivity of hand-crafted representations. Executing an instruction such as "put the blue plate next to the red mug in the sink" requires the perception module to detect the objects, the language module to parse the instruction, the grasping module to compute the grasp pose, and the motion planner to generate a collision-free path. Information is lost at each interface, and a failure in any one module collapses the entire pipeline. Second, the generalization bottleneck: because each module is independently tuned to specific environments or objects, adapting to a new environment or object requires re-engineering the entire pipeline. Third, the knowledge utilization bottleneck: although the internet contains billions of images and trillions of words of text, traditional pipelines have no pathway to exploit this large-scale prior knowledge.

VLA addresses all three bottlenecks simultaneously. By connecting sensor inputs to action outputs as a single differentiable function, it eliminates the representation bottleneck; and by inheriting the pretrained knowledge of internet-scale VLMs, it resolves the generalization and knowledge-utilization bottlenecks at once. The robot's process of "seeing (Vision), understanding (Language), and acting (Action)" takes place end-to-end within a single neural network.

To frame this transition with an analogy: if traditional robotics was an "international conference communicating through multiple interpreters," VLA is a "one-on-one conversation in one's native language." The errors and delays of intermediate translation disappear, and context and nuance are conveyed intact.

1.3 The Limitations of Fourteen Surveys and the Purpose of This Document

From the second half of 2024 through the first half of 2025, survey papers on VLA were published at an explosive rate. the fourteen core surveys referenced in this document are as follows:

Survey	Core Perspective	Unique Strength	Main Limitation
Ma et al. [1] (2024)	Bird's-eye view of embodied AI	Situates VLA within the large foundation-model ecosystem	Lacks technical depth on VLA itself
Kawaharazuka et al. [2] (2025)	Real-world deployment	Practical insights drawn from deployment experience; 7-category architecture taxonomy (VLM+Discrete, VLM+Diffusion, VLM+Flow Matching, etc.)	Weak analysis of learning paradigms (pretraining strategies, RL post-training)
Zhong et al. [3] (2025)	Pure VLA classification	Most systematic VLA classification framework	Excludes non-Pure VLA families
Yu et al. [4] (2025)	Efficiency and lightweighting	In-depth analysis of inference cost, quantization, and caching	Does not cover the full training paradigm
Liu & Shao [5] (2025)	Manipulation	Detailed benchmark analysis focused on manipulation	Excludes navigation, autonomous driving, etc.
Zhang et al. [6] (2025)	Concepts and applications	Broad conceptual coverage and extensive application catalog	Prioritizes overview over technical depth
Chen et al. [7] (2025)	Action tokenization	Most in-depth analysis of tokenization techniques	Omits architectural elements beyond tokenization
Xu et al. [8] (2025)	VLA anatomy	Module-by-module dissection of input–processing–output	Does not cover post-training
Jin et al. [9] (2025)	RL post-processing	Latest trends in VLA+RL integration	Weak analysis of the pretraining stage
Jiang et al. [10] (2025)	Autonomous driving VLA	Most detailed analysis of AD-VLA	Excludes manipulation/navigation domains
Hu et al. [41] (2025)	Past/present/future of AD-VLA	End-to-End vs Dual-System VLA taxonomy specialized for AD; proposes WorldBench	More granular AD-VLA taxonomy than Jiang et al. [10]
2026 Edge Survey [42] (2026)	System bottlenecks of edge deployment	"Deployment Gauntlet" of 8 coupled constraints; VLA = memory-bandwidth bound, diffusion = compute-latency bound	System-level analysis beyond model compression
Guan et al. [43] (2025)	Efficient VLA for manipulation	Four-dimensional efficiency taxonomy (architecture/perception/action/learning)	Efficiency perspective independent of Yu et al. [4]
Large Model Embodied AI [44] (2025)	Large-model-based embodied AI decision-making	Hierarchical vs end-to-end decision-making; world model integration	Frames VLA from a decision-making perspective

Note: Ma et al. [1] is a living survey that has been continuously updated from its 2024 first edition through v7 (2026.02).

Each survey illuminates VLA through its own lens, yet no single survey captures the complete picture. Surveys that treat architecture in depth neglect deployment; surveys focused on efficiency omit training paradigms; surveys specialized in a particular domain miss the intersections with other domains.

This document disassembles all fourteen surveys down to the atomic level to construct a superset of their information. This is not a simple merger. Where different surveys analyze the same model from different angles, this document integrates those cross-survey perspectives to provide an understanding richer than any individual survey. For example, for π0 [16], Zhong et al. [3] contribute an architectural classification, Yu et al. [4] contribute an inference-efficiency analysis, and Jin et al. [9] contribute an RL post-processing analysis; this document weaves all three perspectives into a single unified profile.

Differentiating Contributions of This Document

If the fourteen surveys above are specialist surveys that probe specific axes, this document is a meta-survey (survey of surveys) that cross-cuts those axes. Specifically, it provides the following contributions absent from any individual survey:

Differentiation Axis	Limitation of Individual Surveys	This Document's Contribution
5-Axis Unified Taxonomy	Each survey uses its own independent taxonomy	Integrates architecture, action generation, anatomy, function, and post-processing into a single meta-taxonomy with explicit cross-survey mappings
Cross-Model Profiles	Same model analyzed fragmentarily from different angles	Combines perspectives from 14 surveys into unified per-model profiles (architecture + efficiency + RL post-training)
ICLR 2026 Trends	Most surveys only cover literature up to mid-2025	Analyzes 164 ICLR 2026 submissions (Reuss, 2026): discrete diffusion VLA, ECoT [92], self-improving RL
Emergent Insights	Each survey's conclusions drawn within its own scope	Cross-analysis of 14 surveys yields 10 emergent insights absent from any individual survey (Section 11)
Edge Deployment Systems View	Efficiency surveys focus on model compression	Integrates Deployment Gauntlet (7 coupled constraints) from Edge Survey [42] for model-system co-design
Decision-Making Framework	Hierarchical vs end-to-end compared only architecturally	Introduces decision-making paradigm from [44], positioning world models as the third axis

1.4 Structure of This Document

This document unfolds in the following structure:

Section 1 (this section): The definition of VLA, its significance, and the purpose of this document
Section 2: A chronology of VLA's evolution — the developmental narrative from 2017 to 2026
Section 3: Unified taxonomy — integrating fourteen surveys into one
Section 4: In-depth architectural dissection — Vision Encoder, VLM Backbone, Action Decoder
Section 5: Action tokenization — translating continuous actions into the model's language
Section 6: Training paradigms — Behavior Cloning, pretraining strategies, RL post-processing
Section 7: Efficiency — lightweighting and optimization for real-world deployment
Section 8: Application domains — manipulation, humanoids, autonomous driving, drones, medical
Section 9: Datasets, benchmarks, and simulators
Section 10-11: Open problems, integrated insights, and conclusion

Each section synthesizes all relevant content from the fourteen surveys and explicitly provides cross-survey integrated insights.

Section 2: A Chronology of VLA's Evolution (2017–2026)

VLA did not emerge overnight. Three independent streams — computer vision, natural language processing, and robot learning — each carved out their own canyon over decades, until they converged at a single confluence point in the early 2020s. This section traces that confluence narrative across four phases.

Phase 0 — Convergence of Foundational Technologies (2017–2021)

The Vision Revolution: From CNN to ViT, and CLIP

After AlexNet overwhelmed traditional computer vision at the ImageNet competition in 2012, CNNs evolved rapidly. ResNet (2015) demonstrated residual learning that could deepen to 152 layers, and EfficientNet (2019) showed balanced scaling of width, depth, and resolution. Yet the true turning point came in 2020 with the Vision Transformer (ViT [28]). By dividing an image into 16×16 patches and processing them with Transformer self-attention, ViT [28] proved that the scaling laws validated in NLP applied equally to vision: more data, larger models, better performance — this simple formula fundamentally redirected vision research.

However, the development with the most direct influence on VLA was not ViT [28] but OpenAI's CLIP [27] in 2021. CLIP [27] performed contrastive learning on 400 million image-text pairs to align vision and language in the same embedding space: a photo of a cat and the text "a photo of a cat" become close together in vector space. This visual-language alignment became a core prerequisite for VLA: a robot's ability to connect the linguistic concept "red mug" with the visual object visible in the camera when given the instruction "pick up the red mug" derives directly from CLIP [27]-style pretraining. SigLIP (2023), a successor to CLIP [27], achieved more efficient training using sigmoid loss, and DINOv2 [30] (2023) learned label-free visual representations through self-supervised learning, going on to be widely adopted as the Vision Encoder in subsequent VLAs.

The Language Explosion: From GPT-3 to the Era of Large Language Models

Following the emergence of the Transformer architecture in 2017, NLP entered an era of rapid scaling. If BERT (2018) demonstrated bidirectional contextual understanding and GPT-2 (2019) demonstrated autoregressive text generation, GPT-3 [29] (2020) changed the paradigm itself with 175 billion parameters. GPT-3 [29] exhibited "emergent capabilities" — the ability to perform new tasks through few-shot prompting alone, without explicit training — and this gave robotics researchers a decisive insight: if a sufficiently large model trains on sufficiently much data, capabilities that were never explicitly programmed appear on their own.

This observation led directly to two questions. First, could the world knowledge of LLMs (common sense, physical intuition, task procedures) be leveraged for robot planning? Second, do the scaling laws of LLMs also apply to robot policies? The first question was answered in 2022 by SayCan [14] and Inner Monologue [22]; the second was answered in 2023 by RT-2 [11].

The Wall of Robot Learning: Behavior Cloning and the Data Bottleneck

At the same time, the field of robot learning was grappling with its own challenges. Behavior Cloning (BC) was the most straightforward method for learning a policy by imitating expert demonstrations, but distribution shift — the phenomenon in which errors accumulate exponentially in states not seen during training — acted as a fundamental limitation. DAgger [97] (2011) theoretically resolved this problem, but obtaining repeated expert corrections in real time in the real world was impractical.

Offline Reinforcement Learning (Offline RL) attracted attention as an alternative to BC. CQL (2020), IQL (2021), and others proposed methods for improving policies using only existing collected data. However, these also carried the limitation of conservative estimation and hyperparameter sensitivity, and presupposed large-scale datasets of tens of thousands to hundreds of thousands of episodes. The problem was that collecting robot data is incomparably slower and more expensive than in NLP or CV. While ImageNet had millions of images and GPT-3 [29]'s training data numbered hundreds of billions of tokens, the largest robot datasets at the time contained only a few thousand episodes at most.

Meanwhile, simulator ecosystems began to partially alleviate this data bottleneck. SAPIEN (2020) provided a manipulation environment based on sophisticated physics simulation, AI2-THOR (2017) provided a photorealistic home environment, and Habitat (2019) provided a large-scale navigation environment. These simulators would go on to become the core infrastructure for VLA training and evaluation.

The Key Precursor: CLIPort (2021)

The first point at which these three streams — visual-language alignment, large language models, and robot learning — crossed was CLIPort [26] (2021). Shridhar et al. combined CLIP [27]'s visual-language representations with the Transporter Network (a spatial action map for robot manipulation) to perform table-top manipulation following language instructions. CLIPort [26] was the first to empirically demonstrate the core hypothesis that "internet-scale pretrained visual-language knowledge can transfer to robot actions." Although CLIPort [26] itself was not an end-to-end VLA but a hybrid structure injecting CLIP [27] features into the Transporter, this proof of concept became the direct inspiration for RT-2 [11] and OpenVLA [15].

Phase 1 — The Birth of VLA (2022–2023)

The foundational technologies that converged in Phase 0 began to be combined explosively from 2022. This period was like crossing the activation energy of a chemical reaction: a handful of landmark models appeared in rapid succession, giving birth to an entirely new field.

Gato and SayCan: The Beginning of Two Approaches (2022)

In the first half of 2022, DeepMind simultaneously unveiled two lines of experimentation. Gato [13] (2022) was a "generalist agent" that converted text, images, robot actions, and game inputs all into tokens and trained a single 1.2B-parameter Transformer on all of them. Gato [13]'s innovation was conceptual: a proof that modalities as entirely different as these could be unified within a single sequence-modeling framework. Although its performance on each individual task fell short of specialist models, it was the first to demonstrate the possibility that "a single model can see, read, and act."

At the same time, SayCan [14] (2022) took the opposite philosophy. Rather than using an LLM (PaLM) to generate actions directly, it used it only as a high-level planner. SayCan [14]'s structure was elegant: the LLM decomposes a natural-language instruction into a sequence of step-by-step sub-tasks (e.g., "get me a Coke" → "move to the kitchen" → "open the fridge" → "pick up the Coke" → ...), pre-trained low-level policies (affordance functions) evaluate the feasibility of each sub-task, and the best executable action is selected by multiplying the LLM's plan with the affordance. SayCan [14] does not satisfy Zhong et al. [3]'s Pure VLA definition, but it stands as a key ancestor in the VLA lineage as the first system to demonstrate "connecting LLM world knowledge to a robot" in the real world.

Inner Monologue [22] (2022) advanced SayCan [14]'s idea by one step. Rather than having the LLM plan once and stop, it introduced a closed-loop structure in which the LLM receives textual environmental feedback about execution outcomes (success/failure, object detection results, human corrections) and dynamically revises its plan. This "inner monologue" is the direct ancestor of reasoning-integrated VLAs such as CoT-VLA [55].

VIMA [45]: Multimodal Prompt Following (2022)

Also emerging in this period, VIMA [45] (2022) was an encoder-decoder Transformer that understood prompts across diverse modalities — not just text, but also image goals, video demonstrations, and bounding boxes — to generate robot actions. VIMA [45] introduced the perspective that "language is not the only instruction channel," and laid the groundwork for subsequent research on multimodal instruction following.

RT-1: Proof of Large-Scale Real-World Learning (2022)

If SayCan [14] was a strategy for borrowing the LLM's "wisdom," RT-1 [12] (Robotics Transformer 1, 2022) took a head-on approach. Google spent 17 months collecting 130,000 real-world demonstration episodes with 13 robots, and trained them on a approximately 35M-parameter Transformer model. RT-1 [12]'s architecture was comparatively modest (EfficientNet + TokenLearner + Transformer decoder). Yet what RT-1 [12] proved was not its architecture but scale: training on sufficiently diverse and large-scale real-world data allows a single model to perform more than 700 different tasks.

RT-1 [12] also revealed an important failure: generalization to objects or environments not seen in the training data was extremely limited. This limitation led directly to the question "could we inject knowledge from internet-scale pretraining?", and the answer was RT-2 [11].

RT-2 and PaLM-E: The Official Birth of VLA (2023)

2023 was the year VLA got its name. RT-2 [11] (2023) took the existing VLMs PaLI-X (55B) and PaLM-E [18] (12B), encoded robot actions as text tokens (discretizing each action dimension into 256 bins and representing them as integer strings), and jointly fine-tuned them by mixing a small amount of robot data into the VLM's existing training data. The results were dramatic: even though RT-2 [11] was trained on the same robot data as RT-1 [12], it generalized to objects not seen during training ("put the dinosaur in the correct bin") and to abstract concepts ("pick up the object that is different from the others"). The internet-scale knowledge of the VLM had transferred to robot actions.

PaLM-E [18] (2023), published nearly simultaneously, was a massive multimodal model with 562B parameters that integrated visual tokens, language tokens, and robot-state tokens into a single input sequence. PaLM-E [18] focused more on multimodal understanding and planning than on direct action generation, but it was the most extreme push of the scaling hypothesis that "a single giant Transformer can digest vision, language, and robot actions all at once."

The message from RT-2 [11] and PaLM-E [18] was clear: a VLM is not merely an image captioner — with appropriate fine-tuning, it can become a policy that generates actions in the physical world. This is the core proposition of the VLA paradigm, and 2023 — the year this proposition was empirically demonstrated — is the founding year of the VLA field.

Diffusion Policy: A New Grammar for Action Generation (2023)

Almost simultaneously with RT-2 [11]'s proposal of an approach that autoregressively generates actions as discrete tokens, Diffusion Policy [17] (Chi et al., 2023) opened an entirely different action-generation paradigm. It applied the DDPM (Denoising Diffusion Probabilistic Model) — which had revolutionized image generation through DALL-E and Stable Diffusion — to robot action generation.

The core insight of Diffusion Policy [17] lay in the multimodality of robot actions. For the instruction "pick up the cup," there is no single correct action — the robot might grasp from the right, from above, or by the handle. Training with a conventional mean squared error (MSE) loss would learn the average of this multimodal distribution, generating meaningless actions belonging to none of the modes. Diffusion Policy [17] naturally represented the multimodal distribution by starting from Gaussian noise and generating an action sequence (action chunk) through iterative denoising. Furthermore, its natural compatibility with action chunking -- generating entire action sequences at once -- was a key contribution in securing temporal consistency.

This approach had a fundamental influence on the design of the Action Decoder in subsequent VLAs. The Flow Matching in π0 [16], the DiT-based decoder in CogACT [23], and the diffusion Transformer in RDT-1B [24] all developed on the grammar that Diffusion Policy [17] opened.

Open X-Embodiment: A Turning Point in Data Integration (2023)

In late 2023, the Open X-Embodiment [19] (OXE [19]) project, led by Google DeepMind, fundamentally transformed the data infrastructure of the VLA field. This dataset, which integrated more than one million episodes collected from 22 robotic platforms into a standardized format (RLDS), shattered the paradigm in which individual research labs collected data only from their own robots. OXE [19]'s key finding was that policies trained on "cross-embodiment data" generalized better than policies trained on single-robot data. Data from different robots acted as "diversity" rather than "noise," preventing overfitting.

OXE [19] became the training data foundation for nearly all subsequent large VLAs. Octo [25], OpenVLA [15], and π0 [16] all used OXE [19] as their core training data, and the very existence of OXE [19] made the research direction of "generalist robot policies" possible.

Phase 2 — Diversification and Explosive Growth (2024)

2024 was the Cambrian explosion of the VLA field. Building on the two paradigms proven by RT-2 [11] and Diffusion Policy [17] — autoregressive token generation and diffusion-based action generation — dozens of new models emerged, rapidly reshaping the landscape of the entire field.

The Emergence of Generalist Policies: Octo and OpenVLA

Octo [25] (2024) was the first generalist cross-platform policy to fully leverage the OXE [19] dataset. A relatively small model at 93M parameters, it featured a Transformer-based architecture with a flexible design supporting both autoregressive and diffusion action heads. Octo [25]'s core contribution was less in architectural innovation than in establishing the training recipe of "cross-robot pretraining → target robot fine-tuning." By demonstrating that a new platform could be adapted to with only a small amount of target robot data, it empirically validated the transfer learning paradigm for VLA.

OpenVLA [15] (2024) was a turning point in the democratization of VLA. Whereas RT-2 [11] was a proprietary model with 55B parameters, OpenVLA [15] was a fully open-source model at 7B parameters — fine-tuned on OXE [19] data using a Llama 2-based VLM — that provided a VLA anyone could reproduce. OpenVLA [15] showed an absolute success rate 16.5% higher than RT-2-X while reducing model size to approximately one-seventh. This result provided the important implication that "VLA does not necessarily require tens or hundreds of billions of parameters," and laid the groundwork for subsequent research into small VLAs.

New Architectural Paradigms: π0 and CogACT

The most influential architectural innovation of 2024 came from π0 [16] (Physical Intelligence, 2024). π0 [16] proposed a new architecture that mediates between the autoregressive approach of the RT-2 [11] family and the diffusion approach of Diffusion Policy [17]: a VLM backbone (PaliGemma-based, ~3B parameters) handles vision and language understanding, and on top of it a separate Action Expert (~0.3B parameters) generates continuous actions via Flow Matching, for a total of approximately 3.3B parameters. Unlike DDPM, which iterates stochastic reverse processes, Flow Matching directly learns a deterministic ODE path (velocity field) from noise to data, enabling faster and more stable action generation.

π0 [16]'s true impact came not only from its architecture but from its performance. By decisively outperforming all prior VLAs on long-horizon, complex manipulation tasks such as folding shirts and cleaning a table, the combination of "VLM backbone + diffusion/flow action decoder" became the new reference point for VLA architecture.

CogACT [23] (2024) used a DiT (Diffusion Transformer) as the Action Decoder and introduced a technique for adaptively ensembling action candidates generated from multiple denoising paths. RDT-1B [24] (2024) used a 1B-parameter DiT as a diffusion policy, demonstrating that a Scalable Diffusion Transformer is also effective for action generation in VLA. GR-2 (2024) leveraged large-scale web video as pretraining data, proposing a strategy that circumvents the scarcity of robot data through video pretraining.

FAST Tokenization: A Breakthrough in Action Sequence Compression

Another key innovation, developed in late 2024 and released in early 2025, was FAST [20] (Fast Action Tokenization). The method previously used in VLAs to convert continuous actions into tokens (bin discretization) was extremely inefficient — representing an action chunk (16 timesteps) for a 7-DoF robot required 112 tokens. FAST extracts the frequency components of an action sequence using the Discrete Cosine Transform (DCT) and compresses repeated patterns using Byte Pair Encoding (BPE), representing the same action sequence in up to approximately one-thirteenth the number of tokens (up to ~13x compression). This compression directly improved the inference speed of autoregressive VLAs and would become the core foundational technology of π0-FAST [20].

Combining 3D Understanding and World Models

3D-VLA [37] (2024) was a pioneering attempt to integrate 3D spatial understanding into VLA. It sought to overcome the fundamental limitations of 2D image-based VLAs — the absence of depth perception and the inability to reason about occluded objects — through a generative 3D world model. 3D-VLA [37] proposed a structure that predicts the future state of the 3D scene before executing an action, and feeds this prediction back into action generation. This approach went on to develop into SpatialVLA [39], PointVLA [77], and others, opening a research direction that adds "imagination" to the "perception-action" loop.

Phase 3 — Efficiency and Deployment Readiness (2025–2026)

If the Cambrian explosion of 2024 explored "what is possible," research from 2025 onward shifted its center of gravity toward "how to make it practical." This transition proceeded simultaneously along three axes: hierarchization of architectures, extreme efficiency, and the internalization of safety and reliability.

The Rise of Hierarchical Architectures

GR00T N1 [21] (NVIDIA, 2025) proposed an architecture inspired by the Dual Process Theory of human cognitive science. System 2 (VLM-based, 10 Hz) handles high-level understanding and planning, while System 1 (diffusion-based, 120 Hz) generates reflexive low-level actions. This separation arose from a practical insight: VLM reasoning is slow but rich, while diffusion generation is fast but simple. Combining the two systems achieves both the depth of high-level understanding and the responsiveness of low-level control.

π0.5 [31] (Physical Intelligence, 2025) adopted a different form of hierarchization. A high-level VLM generates a sequence of natural-language sub-tasks (e.g., "grab the cup" → "move over the sink" → "set the cup down"), and a low-level π0 [16] executes each sub-task. This approach opened a path to handling long-horizon tasks exceeding 30 minutes (such as cleaning an entire kitchen) with VLA.

The Extreme Pursuit of Efficiency

The most direct barrier to real-world VLA deployment was computational cost. Inference with a 7B-parameter model requires 16–24 GB of VRAM and imposes latency of hundreds of milliseconds — both fatal for robot control. 2025 was a year in which solutions to this problem were proposed at an explosive rate.

SmolVLA [32] (2025, 450M) answered the question "can VLA work without a large VLM?" With 450M parameters, it enables single-GPU training while achieving performance close to OpenVLA [15] (7B) on simple tasks. BitVLA [33] (2025) went further, applying 1.58-bit ternary quantization to VLA to dramatically reduce memory usage. TinyVLA [34] (proposed in 2024, a pioneer of the efficiency trend) proposed distillation techniques for shrinking the VLM backbone while maintaining performance; EdgeVLA (2025) proposed optimizations for edge-device deployment; and VLA-Cache (2025) proposed eliminating redundant computation by caching visual tokens. DeeR-VLA [35] (2025) adopted dynamic inference that activates layers of different depth depending on the difficulty of the input, while MoLe-VLA (2025) pursued parameter efficiency through a Mixture-of-Experts structure.

The common message from these efficiency research efforts is clear: the core value of VLA lies in the knowledge of large VLMs, but delivering that knowledge does not necessarily require a large model. Through distillation, quantization, caching, and dynamic inference, the knowledge of large models can be compressed into small models and shaped into a form suitable for real-time robot control.

The Emergence of RL Post-Processing

To overcome the fundamental limitation of VLAs trained solely with Behavior Cloning — that demonstration quality becomes the performance ceiling and behaviors not in the demonstrations cannot be discovered — 2025 saw a full-scale emergence of research on post-training pretrained VLAs with reinforcement learning. VLA-RL [68] (2025) proposed GRPO (Group Relative Policy Optimization); ConRFT [69] (2025) proposed online RL; SimpleVLA-RL [70] (2025) proposed a simplified REINFORCE-based RL; and RIPT-VLA [71] (2025) proposed iterative RL-based refinement.

These works all commonly follow the recipe of "pretrain with BC to obtain a reasonable initial policy, then explore and improve with RL to achieve performance that surpasses demonstrations." This is structurally identical to the development trajectory in the LLM field from GPT-3 [29] (pretraining) → InstructGPT (RLHF post-training), showing that the VLA field is rapidly tracking LLM's maturation path.

Integration of Reasoning and Internalization of Safety

CoT-VLA [55] (2025) introduced Chain-of-Thought reasoning from LLMs into VLA. Rather than generating actions immediately, it first generates a visual reasoning process (key region masks, task decomposition text) and then generates actions conditioned on this. This is an attempt to overcome VLA's "reactive" limitation and represents the latest evolution in the lineage of "thinking robots" that began with Inner Monologue [22] (2022).

SafeVLA [75] (2025) is the first model to internalize safety constraints into VLA's training process. Whereas existing VLAs handled safety as a post-hoc filter, SafeVLA [75] includes safety-violation prediction as part of the training objective itself, suppressing dangerous actions at the generation stage. This approach is the first systematic response to the warning that "hallucination in VLA means physical accidents."

Humanoid-VLA (2025) extended the application range of VLA from table-top manipulation to whole-body humanoid control. Controlling the full-body motion of a humanoid with dozens of degrees of freedom through VLA is a challenge of an entirely different order from conventional robot arms (6–7 DoF) in terms of action space dimensionality.

Autonomous Driving VLA: A New Application Frontier

The most dramatic demonstration that VLA applications are not limited to robot manipulation came from the autonomous driving (AD) domain. EMMA [46] (Waymo, 2024) applied the Gemini VLM to autonomous driving, presenting the possibility of an autonomous-driving VLA that end-to-end generates driving paths from sensor inputs. ORION [47] (2025) integrated visual reasoning with driving action generation; AutoVLA [48] (2025) proposed a VLA architecture specialized for autonomous driving; and DriveMoE [49] (2025) proposed an efficient structure that activates specialized modules per driving scenario using Mixture-of-Experts.

Autonomous-driving VLAs share their technical DNA with robot manipulation VLAs (VLM backbone, action tokenization, end-to-end learning), but domain characteristics differ substantially: higher speeds, stricter safety requirements, and greater environmental variability. The cross-pollination between these two domains is accelerating the maturation of VLA technology.

Explosive Quantitative Growth of VLA Research: Evidence from ICLR 2026

The most dramatic evidence of the growth rate of the VLA field comes from the submission statistics of ICLR 2026 (Reuss, 2026). At ICLR 2024, there was only a single VLA-related submission (rejected); at ICLR 2025, there were 9; but at ICLR 2026, 164 papers were submitted, representing an 18-fold year-over-year explosion. This figure signals that VLA is no longer a niche research topic but has established itself as a mainstream research direction within the machine learning community. Some analyses project over 1,000 submissions at ICLR 2027.

The key trends observed across these 164 papers include:

Discrete Diffusion VLA: Four concurrent papers replacing the slow sequential generation of autoregression with parallel diffusion
Embodied Chain-of-Thought (ECoT [92]): Integrating spatially-grounded reasoning with action generation
Cross-Action-Space Learning: Transfer across heterogeneous embodiments via X-VLA [53], XR-1, HiMoE-VLA [54], and others
Self-Improving RL: Residual RL achieving 99% on LIBERO, accelerating benchmark saturation
New Benchmarks: RoboArena (real-sim transfer), RoboCasa365 (365 tasks / 2000+ kitchen scenes), WorldGym (world-model-based evaluation)

A particularly noteworthy finding comes from VLM4VLA (ICLR 2026), which demonstrated that standard VLM benchmark performance has no correlation with downstream VLA performance. This implies that the choice of VLM backbone for VLA should be guided not by general VLM benchmark rankings but by criteria specialized for robotic tasks.

Platform-Scale Orchestration

Another hallmark of 2025–2026 is that VLA is evolving beyond the level of individual models to the platform level. Gemini Robotics (Google DeepMind, 2025) placed Gemini 2.0 at the center of robot control, proposing a structure in which a single VLM orchestrator coordinates diverse robotic platforms and tasks. NVIDIA's GR00T ecosystem is constructing a full-stack pipeline spanning Cosmos (simulation) → GR00T N1 [21] (VLA policy) → Jetson (edge inference).

This platform strategy signals that VLA is no longer a purely research topic but is transitioning into an industrial product. Physical Intelligence's π series (π0 [16] → π0.5 [31] → π0-FAST [20]) positions itself as the "Android for robots," and Figure AI's embedding of its Helix VLA in its own humanoid to pursue a full-stack robotics company is in the same vein.

The Frontier-vs-Open-Weight Gap

The most salient dividing line as of 2025-2026 is the real-world generalization gap between closed frontier models and open-weight research models. Closed models such as Gemini Robotics and π0.5 [31] are demonstrating zero-shot real-world generalization, while open-source research VLAs, despite approaching parity on simulation benchmarks, have not narrowed the gap in real-world settings. Three factors are cited as root causes of this divide: (1) differences in training data quality and diversity, (2) a ceiling effect in simulation benchmarks that masks actual progress, and (3) disparities in research infrastructure scale. Based on this analysis, Reuss (2026) draws a critical conclusion: the current academic focus on pushing numbers on saturated benchmarks such as LIBERO and SimplerEnv risks masking the real-world deployment gap rather than closing it. The true breakthrough paths, he argues, lie in (1) data curation and quality control, (2) strengthening in-context learning capabilities, and (3) establishing real-world evaluation protocols. He further notes that data curation and in-context learning were the most underrepresented research directions at ICLR 2026 — which he frames as precisely where the greatest opportunity lies.

The Key Transition: From "Proof of Concept" to "Deployment Readiness"

VLAs of 2022–2023 were in the stage of proving "this is possible." The essence of that era was RT-2 [11] showing that VLM knowledge can transfer to robot actions, and Diffusion Policy [17] opening the possibility of multimodal action generation. 2024 was an era of explosive exploration of "how many different ways is it possible?" Dozens of models experimented with different architectures, training strategies, and application domains.

And in 2025–2026, the VLA field is converging on the question "how do we make it work in the real world?" The forces driving this transition are multilayered: lightweighting technology lowers the physical barriers to edge deployment; RL post-processing breaks the performance ceiling of BC; internalization of safety constraints structurally resolves reliability issues; and hierarchical architectures address the challenge of long-horizon, complex tasks. The simultaneous advancement along these four axes is the driving force converting VLA from a laboratory demo to a real-world product.

The most important pattern observed in this chronology is the acceleration of cross-pollination between technologies. Just as CLIP [27]'s visual-language alignment gave birth to CLIPort [26], GPT-3 [29]'s few-shot learning gave birth to SayCan [14], and image diffusion models gave birth to Diffusion Policy [17], every core innovation in VLA has been the result of adapting a breakthrough from an adjacent field to robotics. This pattern will continue: LLM reasoning techniques (CoT, MCTS) are already being transplanted into VLA, advances in video generation models will accelerate world model-based VLA, and advances in multi-agent LLM systems will lead to multi-robot VLA collaboration.

The history of VLA is still in its opening chapter. Yet the density of this opening chapter — reaching from proof of concept to industrial deployment readiness in just four years — gives us a sense of the pace at which this field will unfold going forward.

Motivation Chain: The Causal Lineage of Key VLA Models

Motivation Chain

RT-1's limitations (single robot, no internet knowledge transfer without a VLM, poor generalization beyond training data)

→ RT-2 [11] emerges (fine-tunes an existing VLM to transfer internet-scale knowledge to robot actions)

→ RT-2's limitations (55B parameters, proprietary, real-time control infeasible, inference 330–1000 ms)

→ OpenVLA [15] emerges (7B open-source, reproducible by anyone, 16.5% higher success rate)

→ OpenVLA's limitations (autoregressive decoding struggles with multimodal action distributions, slow inference ~166 ms)

→ π0 [16] emerges (Flow Matching enables multimodal action generation, ~73 ms inference, superior on dexterous tasks)

→ π0's limitations (a single model struggles with long-horizon, complex tasks)

→ π0.5 [31] emerges (hierarchical architecture with high-level VLM planning + low-level π0 execution, tasks over 30 min)

Motivation Chain

Behavior Cloning's limitations (demonstration quality becomes the performance ceiling, cannot discover actions outside demonstrations)

→ RL post-training research emerges (VLA-RL [68], RIPT-VLA [71], ConRFT [69], etc.)

→ Pretrain with BC to obtain a reasonable initial policy, then explore and improve with RL to surpass demonstration-level performance

Motivation Chain

The OXE [19] dataset emerges (overcoming single-lab data limits → integrated cross-embodiment data from 22 robot types)

→ Octo [25] (establishes the cross-robot pretraining → target fine-tuning recipe)

→ OpenVLA [15] (OXE-based open-source VLA democratization)

Discriminative Features: Easily Confused Model Pairs

Comparison	Key Differentiator
RT-1 vs RT-2 [11]	RT-1 uses a bespoke architecture (35M); RT-2 fine-tunes an existing VLM (55B) — the presence or absence of internet knowledge transfer is the crux
RT-2 [11] vs OpenVLA [15]	Same "VLM → action tokens" paradigm, but OpenVLA is a 7B open-source model focused on democratization
OpenVLA [15] vs π0 [16]	OpenVLA uses autoregressive decoding (discrete tokens); π0 uses Flow Matching decoding (continuous actions) — the difference lies in multimodal action expressiveness
SayCan [14] vs RT-2 [11]	SayCan has the LLM plan only while a separate policy executes (modular); RT-2 plans and executes within a single model (end-to-end)
Gato [13] vs RT-2 [11]	Gato is a generalist agent (games + robotics + text); RT-2 is a robotics-specialized VLA — Gato is a proof of concept, RT-2 targets practical performance
Diffusion Policy [17] vs π0 [16]	Diffusion Policy is a standalone diffusion policy; π0 combines VLM + Flow Matching — π0 internalizes language understanding
GR00T N1 [21] vs π0.5 [31]	Both are hierarchical, but GR00T separates by speed (System 1 at 120 Hz + System 2 at 10 Hz); π0.5 separates by function (VLM planning + π0 execution)
Octo [25] vs OpenVLA [15]	Octo is a small 93M model with diffusion heads specialized for cross-platform transfer; OpenVLA is a 7B VLM-based model with autoregressive decoding and general knowledge transfer

Intuitive One-Liners

RT-2 [11]: "Just as Google Translate converts Korean to English, RT-2 makes a VLM translate images into robot actions."
OpenVLA [15]: "Take the core idea of RT-2, shrink it 7x, and release it as open source for everyone."
π0 [16]: "A VLM understands the situation, and a diffusion model paints smooth, precise motions based on that understanding."
SayCan [14]: "ChatGPT gives the recipe; the robot chef actually cooks — knowing what to do and being able to do it are separated."
Diffusion Policy [17]: "Just as Stable Diffusion paints images from noise, this paints robot motions from noise."
Octo [25]: "A 'universal driver's license' pretrained on diverse robot data — just a small fine-tune adapts it to a new robot."
OXE [19]: "Just as ImageNet transformed CV, OXE is robotics' ImageNet — 1M+ episodes from 22 robot types, unified."
FAST: "Compress robot actions the way JPEG compresses images — frequency-domain encoding that cuts token count by up to 13x."
GR00T N1 [21]: "A slow but deep-thinking cerebrum (VLM, 10 Hz) and a fast-reacting cerebellum (diffusion, 120 Hz) working in tandem."
CoT-VLA [55]: "A robot that reasons 'why should I do this?' before acting — the evolution from reflex to deliberation."

Self-Check Questions: Sections 1–2

Q1: Among the three (+one) definitions of VLA, under which definitions is SayCan classified as a VLA, and under which is it excluded?

Answer: Under Ma et al.'s broad definition, SayCan is included (any system that takes vision + language and produces actions). However, under Zhong et al.'s Pure VLA definition, it is excluded (its modular structure fails the end-to-end integration criterion). Under Kawaharazuka et al.'s definition, it is also excluded (as a high-level policy selecting from pre-defined skill indices, it does not satisfy the "direct control command generation" criterion). Under RT-2's narrow definition, it is likewise excluded (the VLM backbone is not directly used for action generation).

Q2: Why did RT-2 exhibit better generalization than RT-1 despite being trained on the same robot data?

Answer: RT-2 inherited the visual-language knowledge of a VLM (PaLI-X, 55B) that had been pretrained at internet scale. Because the VLM already understood abstract concepts such as "dinosaur" and "the object that is different from the others," it could generalize to objects and concepts never encountered in the robot training data by leveraging the VLM's prior knowledge. This is the core proposition of the VLA paradigm: "transfer of internet-scale knowledge to robot actions."

Q3: Explain, from the perspective of "multimodality," why Diffusion Policy is superior to conventional MSE-based Behavior Cloning.

Answer: For the instruction "pick up the cup," there are multiple valid actions (grasping from the right, from above, by the handle, etc.). Training with an MSE loss learns the mean of this multimodal distribution, producing meaningless intermediate actions that belong to none of the modes (mode averaging). Diffusion Policy generates actions through iterative denoising from noise, naturally sampling from each mode of the multimodal distribution. Additionally, action chunking (generating an entire future action sequence at once) ensures temporal coherence.

Open Research Questions: Sections 1–2

The boundary of the VLA definition: Are systems that generate code (Code-as-Policy), output keypoints, or produce reward functions VLAs? How well can the Pure VLA criterion of "direct action generation" accommodate the diverse action representations that may emerge in the future?

The next stream of cross-pollination: VLA has absorbed breakthroughs from NLP (Transformer), CV (ViT, CLIP), and generative AI (Diffusion). What adjacent field will have the greatest impact on VLA next? (Candidates: video generation, 3D foundation models, neurosymbolic AI)

The ceiling of scaling: If performance is maintained even as parameters shrink more than 100x from RT-2 (55B) to SmolVLA (450M), what form do scaling laws take in VLA? Among parameter count, data diversity, and architectural efficiency, which axis matters most?

Bridging the data gap: Even OXE's 1M+ episodes are minuscule compared to internet-scale text and images. Among simulation, video pretraining, and synthetic data, which strategy can most effectively narrow this gap?

Section 3: A Unified Taxonomy — Integrating Fourteen Surveys into One

"Like blind men touching different parts of an elephant, each survey was describing a different region of the vast animal called VLA. This chapter joins their hands to reconstruct the elephant's entire body."

Between late 2024 and early 2026, more than fourteen survey papers on VLA were published in rapid succession. Each survey proposed its own taxonomy, and it was not uncommon for the same model to be placed in different categories across papers. Some classify RT-2 [11] as a "monolithic VLA," while others classify it under "autoregressive action generation." The goal of this chapter is to reinterpret these taxonomies not as competing alternatives but as complementary perspectives, and to construct a meta-taxonomy capable of locating every VLA model within a single coordinate system.

3.1 Architectural Perspective — Based on Liu/Shao (2025)

Liu and Shao's survey focuses on the structural form of VLAs. Their central question is: "In what relationship are the VLM and the action generator connected?"

3.1.1 Monolithic Architecture

The entire system is composed of a single end-to-end model, with boundaries between internal modules emerging naturally during training.

Single-system: Observation-to-action generation occurs in a single unified forward pass. Given input images and language instructions, the network directly produces robot actions as output. Representatively, RT-2 [11] interprets the output tokens of the PaLI-X VLM directly as action tokens, and OpenVLA [15] repurposes the language model head of the Prismatic VLM for action prediction. NORA likewise integrates vision, language, and action within a single transformer.

Dual-system: Two distinct modules — a VLM backbone (System 2) and an action expert (System 1) — exist and cooperate within a single model. This distinction is inspired by Daniel Kahneman's dual-process theory. The VLM handles slow, deliberative thinking (System 2) for scene understanding and reasoning, while the action expert handles fast, reactive action generation (System 1).

Dual-system models are further divided by their information flow pattern:

Cascade-based: The VLM runs first to produce a feature representation, which is then passed sequentially to the action expert. In CogACT [23], the VLM extracts visual-language features, after which a separate diffusion-based action generator receives them to output action sequences. In GR00T N1 [21], the Eagle-2 VLM produces context embeddings that are passed to a DiT (Diffusion Transformer) action head. Fast-in-Slow also follows a sequential structure of slow VLM processing followed by fast action generation.

Parallel-based: VLM tokens and action tokens are processed simultaneously through a shared attention mechanism. In π0 [16], the tokens from the PaliGemma VLM and the flow-matching tokens from the action expert undergo cross-attention within shared transformer blocks. π0.5 [31] extends this so that high-level planning and low-level action are processed within the same attention space. GraspVLA [81] adopts a structure where grasp-specialized tokens are processed in parallel with VLM tokens.

3.1.2 Hierarchical Architecture

Planning and execution are explicitly separated. The higher level decides "what to do," while the lower level decides "how to move."

Planner-Only: The VLM generates only a plan; action execution is delegated to a separate low-level controller (MPC, PID, etc.). SayCan [14] is the seminal example, followed by COME-Robot, Inner Monologue [22], and others in this lineage.
Planner + Policy: A VLM planner generates an intermediate representation, which a learned policy then translates into low-level actions. VoxPoser [57], RT-H, and RoboPoint fall into this category.

These can be further classified by the form of intermediate representation:

Keypoint (K): Goal specification via keypoint coordinates (RoboPoint, Rekep)
Subtask (S): Linguistic description of subtasks (SayCan [14], ProgPrompt)
Program (P): Generation of executable code/programs (Code-as-Policies [36], VoxPoser [57])

In addition, Liu & Shao [5] identify Affordance (A) as a separate auxiliary representation type. A3VLM [58], CoA-VLA, and similar models combine affordance maps with other representations (K, S, P) to explicitly specify graspable regions.

3.2 Action Generation Perspective — Based on Zhong et al. (2025)

Zhong et al. [3]'s survey shifts the lens of classification from the overall architectural form to how actions are generated. This perspective is practically important because, even with the same VLM backbone, different action generation methods can yield vastly different performance characteristics.

Autoregressive Method

Continuous actions are converted into discrete tokens, then sequentially generated using the same next-token prediction as language models. RT-2 [11] pioneered this approach with 256-bin quantization, followed by OpenVLA [15], Octo [25] (which supports both AR and diffusion modes), and RT-2-X. FAST (Fast Action Tokenization) [20] applies DCT + BPE to preserve the autoregressive framework while reducing information loss and accelerating pretraining by 5x.

Advantages: LLM infrastructure (KV cache, quantization, speculative decoding, etc.) can be reused directly. Implementation is simple.
Disadvantages: Quantization errors accumulate, and multimodal action distributions (e.g., situations where an object can be rotated either left or right) are hard to represent. Sequential token-by-token generation is slow for high-frequency control.

Diffusion Method

Probabilistic generative models such as DDPM, DDIM, Flow Matching, and VAE are used to sample from the action distribution. Diffusion Policy [17] was the first to demonstrate the potential of DDPM-based action generation. CogACT [23] reduced denoising steps using DDIM, and π0 [16] achieved faster and more stable generation via Flow Matching with ODE-based trajectories.

Advantages: Naturally represents multimodal distributions. Generates smooth trajectories. Operates directly in continuous space with no quantization error.
Disadvantages: Inference is slow due to multiple denoising steps (DDPM: 50–100 steps, DDIM: 10–20 steps, Flow Matching: 5–10 steps). Technical effort is required to ensure training stability.

Discrete Diffusion Method

Discrete Diffusion VLAs: A novel paradigm proposed simultaneously by four independent studies at ICLR 2026. Unlike conventional diffusion methods that operate in continuous space, discrete diffusion is applied directly to tokenized action sequences, generating discrete tokens in parallel without the sequential generation of autoregressive methods. dVLA [65], DIVA, and UNIFIED DIFFUSION VLA belong to this category, reporting 95–98% success rates on LIBERO. This approach attempts to combine the interpretability of autoregressive methods with the multimodality of diffusion, and has emerged as one of the most active research directions in 2026.

Reinforcement Learning (RL) Method

The policy is directly optimized using a reward signal. Pure RL-based VLAs are rare, but the pattern of fine-tuning BC-pretrained VLAs with RL is rapidly emerging. Representative examples include GRPO (Group Relative Policy Optimization), RLVF (Reinforcement Learning from Visual Feedback), and π^*_0.6 [157].

Hybrid Method

Combines the strengths of autoregressive and diffusion methods. HybridVLA [79] proposes a dual-decoding structure in which high-level semantic tokens are generated autoregressively and low-level continuous actions are generated via diffusion. UniVLA [80] combines latent world model representations and action generation within a single framework.

Specialized Method

Architectures designed for the requirements of specific domains: 3D point cloud input (PointVLA [77], 3D-VLA [37]), tactile sensor integration (ForceVLA [78], TactileVLA), cross-embodiment learning (Octo [25], CrossFormer [50]), and autonomous driving specialization (EMMA [46], DriveVLM [91]).

Cross-Action-Space Learning: Research on bridging the action-space differences among heterogeneous robot morphologies (arms, humanoids, mobile robots, etc.) is also gaining momentum. X-VLA [53] conditions on heterogeneous embodiments via soft prompting tokens, XR-1 introduces Unified Vision-Motion Codes (UVMC), and HiMoE-VLA [54] assigns action-space-specific experts through a hierarchical Mixture-of-Experts architecture.

Autonomous Driving VLA Taxonomy (Hu et al., 2025): Domain-specific VLA taxonomies are also advancing, particularly for autonomous driving. Hu et al. distinguish two AD-VLA paradigms: (1) End-to-End VLA — integrating perception, reasoning, and planning within a single model (with sub-categories of textual action vs. numerical action), and (2) Dual-System VLA — separating slow deliberation (VLM) from fast safe execution (planner) (with sub-categories of explicit guidance vs. implicit representation transfer). While structurally similar to Liu & Shao [5]'s monolithic/hierarchical classification, this taxonomy is differentiated by placing the safety–real-time trade-off as its central axis.

In particular, Hu et al. [41] provide a more granular AD-VLA taxonomy than Jiang et al. [10]. They sub-divide End-to-End VLA into textual action and numerical action categories, and partition Dual-System VLA into explicit guidance and implicit representation transfer, reflecting the unique safety-efficiency trade-offs of the autonomous driving domain. They also propose WorldBench, a unified evaluation platform that enables open-loop and closed-loop assessment within a single framework.

3.3 Anatomical Perspective — Based on Xu et al. (2025)

Xu et al. [8]'s survey dissects VLA through a biological analogy. Their framework consists of three organs:

Perception — the robot's sensory organs: visual encoders (SigLIP, DINOv2 [30], CLIP [27]), proprioception encoders (proprioception MLP), and tactile/depth/force sensors convert information from the external world into internal representations.
Brain — the central nervous system: the VLM backbone integrates perceptual information and language instructions to perform "understanding" and "planning." The brain has evolved from pure transformers to VLMs and further to VLMs capable of CoT reasoning.
Action — the motor nervous system: translates the brain's intentions into physical movements. Discrete token heads, diffusion heads, and flow matching heads fall into this category.

The strength of this perspective is its intuitive clarity. Any VLA model can be described by three questions: "What eyes does it have?", "What brain does it have?", and "What hands does it have?" For example, π0 [16] is a model with "eyes of SigLIP, a brain of PaliGemma, and hands of Flow Matching."

Xu et al. [8]'s contribution extends beyond modular anatomy. Their central contribution is a systematic Five Challenges framework for VLAs: (1) Representation -- how to unify visual, linguistic, and action representations, (2) Execution -- how to achieve stable and precise action generation, (3) Generalization -- how to guarantee transfer to novel environments, objects, and tasks, (4) Safety -- how to prevent dangerous behaviors in the physical world, and (5) Dataset & Evaluation -- how to design fair and reproducible benchmarking. These five challenges provide the systematic foundation for the discussions of generalization, efficiency, safety, and benchmarking in Sections 7--10.

3.4 Functional Perspective — Based on Kawaharazuka et al. (2025)

Reflecting the practical tradition of the Japanese robotics community, Kawaharazuka et al. [2] classify VLAs not by their components but by the functions they perform:

Low-level Perception: Raw sensory processing such as object detection, depth estimation, and pose estimation
High-level Perception: Semantic processing such as scene understanding, relational reasoning, and affordance recognition
High-level Planning: Task decomposition, subgoal setting, and abstract action sequence generation
Low-level Planning: Concrete trajectory generation, motor command planning, and collision avoidance
Data Augmentation: Simulation data generation, data augmentation via video prediction, language re-labeling, etc.

The distinctive value of this perspective is that it naturally captures the fact that a single model can simultaneously perform multiple functions. RT-2 [11] handles both high-level perception and low-level planning within a single model, while SayCan [14] specializes in high-level planning and delegates low-level execution to a separate policy.

3.5 Post-processing Perspective — Based on Jin et al. (2025)

Jin et al. [9]'s survey focuses on the process of adapting a pretrained VLM for robotic action generation. Their question is: "How do we transfer the knowledge a VLM already possesses to a robot?"

Environment Perception Enhancement: Strengthening the VLM's visual understanding for the robot's environment. This includes integrating depth information, handling multiple views, and adding temporal context. SpatialVLA [39]'s depth integration and HPT [96]'s multi-camera processing are representative examples.
Embodiment Awareness Improvement: Injecting the physical characteristics of the robot itself (joint structure, action space, dynamics) into the VLM. This includes proprioception tokenization, cross-embodiment learning, and robot-specific adapters.
Task Understanding Deepening: Elevating the understanding of language instructions from simple semantic matching to inferential comprehension. This includes CoT reasoning (ECoT [92], CoT-VLA [55]), subgoal decomposition, and visual reasoning.
Multi-component Integration: Methodologies that integrate the above three dimensions into a single framework. Strategies such as multitask learning, modular architectures, and continual learning are employed.

3.6 Meta-classification — Classifying the Taxonomies

Here we arrive at the central insight of this chapter. The five taxonomies above are not in competition with each other. They are aerial photographs of the same landscape taken from different altitudes. Just as an architect draws a building using structural, plumbing, electrical, and aerial plans, each survey captures a different cross-section of the complex system that is VLA.

Correspondences Among Classification Dimensions

The following table shows how each taxonomy describes the same models:

Model	Liu/Shao [5] (Architecture)	Zhong (Action Generation)	Xu (Anatomy)	Kawaharazuka (Function)	Jin (Post-processing)
RT-2 [11]	Single-system Monolithic	Autoregressive	PaLI-X Brain + Discrete Head	High-level Perception + Low-level Planning	Task Understanding Deepening
π0 [16]	Parallel Dual-system	Flow Matching (variant of Diffusion)	PaliGemma Brain + Flow Hand	High-level Perception + Low-level Planning	Multi-component Integration
OpenVLA [15]	Single-system Monolithic	Autoregressive	Prismatic Brain + Discrete Head	High-level Perception + Low-level Planning	Environment Perception Enhancement
GR00T N1 [21]	Cascade Dual-system	Diffusion (DiT)	Eagle-2 Brain + DiT Hand	Low-level Perception + Low-level Planning	Embodiment Awareness Improvement
CogACT [23]	Cascade Dual-system	Diffusion (DDIM)	CogVLM Brain + Diffusion Hand	High-level Perception + Low-level Planning	Environment Perception Enhancement
SayCan [14]	Hierarchical Planner-Only	N/A (no action generation)	LLM Brain Only	High-level Planning Specialized	Task Understanding Deepening
HybridVLA [79]	Dual-system	Hybrid (AR+Diffusion)	VLM Brain + Hybrid Hand	Low-level Planning Specialized	Multi-component Integration
CoT-VLA [55]	Single-system Monolithic	Autoregressive	VLM Brain + CoT + Discrete Head	High-level Planning + Low-level Planning	Task Understanding Deepening
SpatialVLA [39]	Single-system Monolithic	Autoregressive	Depth-enhanced Eye + VLM Brain	Low-level Perception + Low-level Planning	Environment Perception Enhancement

Complementary Classification Axes

The pattern revealed by this table is clear. Each taxonomy describes complementary axes:

Structural axis (Liu/Shao [5]): "In what topology are the modules connected?" — monolithic vs. dual, cascade vs. parallel, hierarchical

Generative axis (Zhong): "By what mathematical mechanism are actions produced?" — AR, diffusion, RL, hybrid

Anatomical axis (Xu): "What are the constituent components?" — which encoder, which VLM, which action head

Functional axis (Kawaharazuka): "What cognitive functions does the system perform?" — perception, planning, execution, augmentation

Adaptation axis (Jin): "In what dimension was the pretrained knowledge supplemented?" — perception, embodiment, task, integration

Accordingly, every VLA model can be represented as a point in this 5-dimensional coordinate space. For example, the coordinates of π0 [16] are:

π0 [16] = (Parallel Dual-system, Flow Matching, PaliGemma+SigLIP+FlowHead, High-level Perception+Low-level Planning, Multi-component Integration)

The practical value of this meta-classification is threefold. First, when a new VLA model appears, it can be immediately positioned on the five axes. Second, unexplored combinations (e.g., "hierarchical structure + Flow Matching + tactile enhancement") can be systematically identified. Third, conclusions from different surveys can be integrated and interpreted without contradiction.

Meta-patterns Within the Taxonomies Themselves

Stepping back further to observe the taxonomies themselves, an interesting meta-pattern emerges:

Structure-centric taxonomies (Liu/Shao [5], Xu) answer: "How do you build this model?" — the engineer's perspective
Process-centric taxonomies (Zhong, Jin) answer: "How do you train this model?" — the researcher's perspective
Function-centric taxonomy (Kawaharazuka) answers: "What can this model do?" — the user's perspective

The convergence of these three perspectives is itself an indicator of the maturation of VLA research. As a field matures, the methods of building, training, and deploying models develop independently while forming a mutually coherent system.

Section 4: In-Depth Anatomical Dissection of VLA Architecture

"VLA is an organism composed of three modules. The eyes see the world, the brain understands it, and the hands act. In this chapter, each organ is placed on the dissection table."

4.1 Perception Module — The Robot's Eyes

The first module of a VLA is the perception module, which transforms raw sensory data into meaningful internal representations. Just as the human visual cortex converts photons from the retina into concepts like "red cup," "on the table," and "tilted," the perception module transforms pixel arrays into token sequences interpretable by the robot.

4.1.1 Language-supervised Encoders

CLIP [27] (Contrastive Language-Image Pretraining) and SigLIP (Sigmoid Loss for Language-Image Pretraining) are encoders trained via contrastive learning on image-text pairs. Because they were trained on hundreds of millions of image-caption pairs, their visual representations are inherently semantic. "Red cup" and "blue cup" are close together, while "cup" and "plate" are moderately distant. These characteristics make them naturally suited for VLAs conditioned on natural language instructions.

SigLIP replaces CLIP [27]'s softmax contrastive loss with a sigmoid loss, reducing dependence on batch size and improving training efficiency. As of 2024–2025, the trend of SigLIP replacing CLIP [27] is pronounced.

Advantages: Rich semantic alignment, optimal for language conditioning, benefits from large-scale pretraining
Limitations: Lacks geometric precision. Tends to miss fine-grained spatial information such as "which direction does the cup handle face?"

4.1.2 Self-supervised Encoders

DINOv2 [30] is a ViT [28] encoder trained with masked image modeling and self-distillation. Because it learned from the structure of images themselves without language supervision, its representations are geometric. Object boundaries, surface normals, and spatial layouts are encoded with precision.

Advantages: High geometric precision. Outperforms semantic encoders in contact-rich manipulation (pick-and-place, insertion, etc.). Captures subtle differences in texture and material.
Limitations: No alignment with language, so an additional bridge is required for language-conditioned tasks such as "pick up the red cup."

4.1.3 Hybrid SigLIP + DINOv2 — The Current Dominant Standard

Since late 2024, hybrid encoding using both SigLIP and DINOv2 [30] simultaneously has become the de facto standard. OpenVLA [15]'s Prismatic VLM, OpenVLA [15]-OFT, GraspVLA [81], UniVLA [80], and others adopt this combination.

Why does this combination dominate? The answer lies in complementarity. SigLIP encodes "what," while DINOv2 [30] encodes "where and how." When processing the instruction "pick up the red cup," SigLIP contributes to identifying the semantic entity "red cup" in the scene, while DINOv2 [30] contributes to determining the orientation of the cup's handle and its precise location. The outputs of the two encoders are typically concatenated at the token level or integrated through a projection layer.

Empirically, the superiority of this combination has been confirmed repeatedly. In the Prismatic VLM paper, a comparison of SigLIP alone, DINOv2 [30] alone, and the SigLIP+DINOv2 [30] combination showed that the hybrid combination consistently outperformed on all benchmarks.

4.1.4 Using the Full VLM as an Encoder

Some models use an entire pretrained VLM as the encoder rather than a separate visual encoder. RT-H uses PaLI-X, π0 [16] uses PaliGemma, and VTLA uses Qwen-VL. The advantage of this approach is that the VLM already performs visual-language integration intrinsically, eliminating the need for a separate fusion module. The drawback is high computational cost, which is mitigated using efficient fine-tuning methods such as LoRA and QLoRA.

4.1.5 The Persistence of CNNs

Although ViT [28]-based encoders are mainstream, CNNs (ResNet, EfficientNet) have not disappeared. RT-1 [12] used EfficientNet-B3, and some lightweight models (LiteVLA [63], etc.) still choose CNNs for computational efficiency. In environments with extremely strict real-time constraints (1 kHz control loops in industrial robots) or on edge devices, the deterministic inference speed and small memory footprint of CNNs remain valid advantages.

4.1.6 Multimodal Perception

State-of-the-art VLAs integrate various sensory modalities beyond RGB cameras:

Depth: SpatialVLA [39] encodes depth information as a separate channel to enhance 3D spatial understanding. Using the output of monocular depth estimation networks (MiDaS, Depth Anything) as additional input is also widely adopted.
Tactile: ForceVLA [78] processes 6-axis force/torque sensor data, and TactileVLA processes GelSight tactile images alongside visual tokens. Tactile sensing is essential for delicate manipulation that requires "grasping firmly enough not to slip without gripping too hard."
Force: OmniVTLA integrates vision, touch, and language in a single framework, encoding force profiles as temporal sequences.
Audio: AudioCLIP [27]-based auditory encoders are explored experimentally. They have potential applications in auditory conditional behaviors such as "stop when you hear a click."

4.1.7 Proprioception Processing

Proprioceptive information such as the robot's joint angles, velocities, and end-effector position is typically converted into a fixed-dimensional vector through an MLP (Multi-Layer Perceptron). There are two main approaches for integrating this vector with visual-language representations:

Concatenation: The proprioceptive vector is simply appended to the visual/language tokens. This is the straightforward approach adopted by most models, and is easy to implement.
FiLM Conditioning: Feature-wise Linear Modulation, in which proprioceptive information modulates the scale and bias of visual features. Since proprioception directly influences visual processing, information integration is tighter. Octo [25] and HPT [96] use this approach.

4.2 Brain Module — The VLM Becomes the Robot's Brain

If the perception module is the "eyes," the brain module is the "central nervous system" that understands, reasons about, and plans from the perceived information. The VLA brain has undergone four stages of evolution from 2022 to 2025.

4.2.1 Four Stages of Evolution

Stage 1: Pure Transformer (2022–2023)

Gato [13] (DeepMind, 2022) was the first "generalist agent" to handle text, images, Atari games, and robot control with a single transformer. VIMA [45] proposed a transformer for understanding multimodal prompts, and GR-1 [51] proposed a transformer combining video generation and action prediction. The "brain" of this era was a transformer trained from scratch without general-purpose pretraining. Because it learned solely from robot data, it had fundamental limitations in language comprehension and visual commonsense reasoning.

Stage 2: Diffusion Transformer / DiT (2023–2024)

RDT-1B [24] (Robotics Diffusion Transformer) used a 1.2B-parameter DiT as the central architecture for action generation. TriVLA placed DiT as the core action generator in a triple system. DiT combined the scalability of transformers with the multimodal expressiveness of diffusion, enabling complex action distributions to be learned with large-scale models.

Stage 3: VLM + Generative Head (2024)

π0 [16] (Physical Intelligence, 2024) was the decisive turning point. It used PaliGemma (SigLIP + Gemma 2B) as the "brain" and attached a separate Flow Matching head as the "hand." (Note: PrismaticVLM, which combines SigLIP + DINOv2 + Gemma 2B, is the backbone of OpenVLA [15], not pi_0.) This was the first commercially successful instance of directly transferring a VLM's extensive visual-language knowledge to serve as a robot's brain. The key insight of this stage was: "There is no need to build a robot's brain from scratch — simply take a VLM already trained on internet-scale knowledge and connect a hand to it."

Stage 4: Fully VLM-based Brain (2024–2025)

The paradigm of "using a VLM directly as a robot policy," which began with RT-2 [11], matured through OpenVLA [15], π0.5 [31], CoT-VLA [55], and SafeVLA [75]. Models in this lineage directly leverage the VLM's language generation capability for action generation. CoT-VLA [55] goes one step further, explicitly outputting a reasoning process in natural language before generating actions. SafeVLA [75] internalizes safety constraints within the VLM's reasoning process.

4.2.2 Reasoning Paradigms

The VLA "brain" is evolving beyond simple reflex to reasoning.

Chain-of-Thought (CoT) Reasoning: A reasoning process is output in natural language before generating actions. ECoT [92] (Embodied Chain-of-Thought) generates reasoning chains such as "1. The red cup is on the left side of the table. 2. The gripper is currently on the right side of the table. 3. I must first move to the left." before outputting an action. CoT-VLA [55] trained this paradigm at scale and empirically demonstrated that reasoning improves action performance.

At ICLR 2026, Embodied Chain-of-Thought (ECoT [92]) emerged as a major trend. ACTIONS AS LANGUAGE [98], InstructVLA, and EMBODIED-R1, among others, integrate spatially-grounded reasoning with action prediction, moving beyond purely textual reasoning to introduce reasoning processes directly grounded in the visual scene into VLAs.

ReAct Paradigm: Reasoning and Acting are performed in alternation. The iterative loop of "observation → reasoning → action → observation → ..." allows environmental feedback to be incorporated into reasoning.

Visual Subgoal Prediction: Instead of language, future images are predicted to visually "imagine" what state should come next. SuSIE [59] and UniPi [93] explored this approach.

4.2.3 World Model Integration

World model integration endows the VLA's brain with "imagination." Two directions exist:

Policy Enhancement: Future predictions generated by the world model are used as auxiliary data or additional inputs for policy learning. UniVLA [80] predicts future states in latent space, and these predictions guide action generation. In WorldVLA [76], the video prediction module is jointly trained with the policy network.

Explicit Planning: The world model is used to simulate multiple possible futures, and the action leading to the most favorable future is selected. LUMOS [94] performs tree search with a latent space world model, and MinD [95] executes mental simulation within a latent world model.

The difference between these two directions lies in the role of the world model. In policy enhancement, the world model is an "advisor"; in explicit planning, the world model is a "simulator." The current trend is convergence of the two — toward a flexible architecture in which world model predictions directly intervene in policy action generation, while explicit simulation-based planning remains available for complex situations.

This world model integration can be situated within a broader decision-making framework. The Large Model Embodied AI survey [44] divides large-model-based embodied AI into hierarchical and end-to-end decision-making paradigms. In the hierarchical approach, high-level planning (VLM/LLM) and low-level control (specialist policies) are separated, yielding higher interpretability and safety but incurring inter-module information loss. In the end-to-end approach, a single model directly maps from perception to action, maximizing expressiveness but complicating debugging and safety assurance. The world model serves as a third axis that bridges these two paradigms — by endowing end-to-end models with plan-through-imagination capabilities, it absorbs the strengths of hierarchical architectures (safety, long-horizon reasoning) while preserving the benefits of a unified representation space.

4.3 Action Module — From Intention to Movement

After the brain decides "I should pick up the red cup," the action module's role is to translate this intention into actual motor commands. The fundamental challenge here is how to effectively represent and generate a continuous, high-dimensional action space.

4.3.1 Discrete Tokenization — The RT-2 Approach

This is the approach pioneered by RT-2 [11]. Continuous action values (e.g., joint angle 0.732 rad) are quantized into integer bins between 0 and 255, added to the VLM's vocabulary, and treated identically to language tokens. For a 7-DoF robot arm, the action at each timestep is represented as 7 tokens (plus 1 for gripper open/close).

Advantages: Reuses the existing infrastructure of language models (tokenizer, generation algorithms, KV cache) as-is. Implementation is intuitive and simple. Since language generation and action generation share the same decoding process, multitask learning is natural.
Disadvantages: 256-bin quantization limits the precision of the action space (1/256 ≈ 0.4% resolution). Multimodal action distributions cannot be represented — the model can only output a single mode, resulting in an averaging problem where a situation with two valid choices (rotate left or rotate right) yields "no rotation" as output. The autoregressive nature introduces latency proportional to the number of tokens.

4.3.2 Diffusion Policy — Stochastic Action Generation

DDPM (Denoising Diffusion Probabilistic Model): The approach pioneered by Diffusion Policy [17] (Chi et al., 2023), generating action sequences through iterative denoising starting from pure Gaussian noise. While 50–100 denoising steps are required, it faithfully represents multimodal distributions.

DDIM (Denoising Diffusion Implicit Model): The approach adopted by CogACT [23], which reduces denoising steps to 10–20 through a deterministic sampling process. It provides a trade-off between quality and speed.

Flow Matching: The approach adopted by π0 [16], modeling the path from noise to data as an ODE (ordinary differential equation). Training is more stable than DDPM/DDIM, and high-quality samples are generated in 5–10 steps. Rectified Flow learns approximately straight paths, further reducing the number of steps.

The common advantage of diffusion-based methods is multimodal distribution representation. They can simultaneously represent both modes of "the cup can be rotated left or right," sampling one at execution time. They naturally generate smooth trajectories and operate directly in continuous space with no quantization error.

4.3.3 FAST Tokenization — Innovation in the Frequency Domain

FAST [20] (Fast Action Tokenization) is an attempt to simultaneously achieve the simplicity of the autoregressive approach and the precision of continuous action representation. The core idea is to transform continuous action chunks into the frequency domain before tokenizing.

The specific process is as follows:

Transform a continuous action sequence (e.g., 16 timesteps × 7 DoF) into the frequency domain using DCT (Discrete Cosine Transform)

Remove high-frequency components (mostly noise) for compression

Apply BPE (Byte Pair Encoding) to the compressed frequency coefficients to convert them into discrete tokens

Predict these tokens using LLM autoregressive generation

The results of this approach are impressive. A 5× pretraining acceleration compared to the RT-2 [11] approach is achieved with negligible information loss. This is because compression in the frequency domain is far more efficient than quantization in the time domain. The significantly reduced token count also alleviates the speed problem of autoregressive generation.

4.3.4 Normalizing Flows — The Dream of a Single Step

NinA (Neural Inference for Actions) applies Normalizing Flows to action generation. Unlike diffusion, distributions are modeled as a composition of invertible transformations, enabling sampling in a single forward pass. Since no multiple denoising steps are required, inference speed is very fast.

4.3.5 Frequency-domain Flow Matching — FreqPolicy

FreqPolicy performs Flow Matching in the frequency domain to achieve single-step inference. Action sequences are decomposed into frequency components, and a flow is learned in frequency space. Because complex multimodal distributions in the time domain have simpler structure in the frequency domain, sufficient-quality samples can be generated with fewer steps.

4.3.6 Action Chunking — Separating Temporal Scales

Action Chunking is a strategy that separates the temporal scales of action generation into a semantic level and a motor level. At the high level (low frequency), the "semantic direction of the next chunk" is decided autoregressively; at the low level (high frequency), "the fine-grained trajectory within the chunk" is generated in parallel.

ACT (Action Chunking with Transformers) popularized this concept. By generating action sequences of 16–32 timesteps in chunk units with a single prediction, it alleviates the myopic action problem of single-step prediction. Subsequent research has extended this to smoothness of inter-chunk transitions, variable-length chunks, and hierarchical chunking.

4.4 Dual-System Architecture — When Cognitive Science Meets Robotics

4.4.1 Kahneman's Dual-Process Theory

The dual-process theory presented by Daniel Kahneman in Thinking, Fast and Slow (2011) argues that human thought is composed of two systems:

System 1: Fast, automatic, low-effort intuitive thinking. "A ball comes flying and you reach out your hand to catch it."
System 2: Slow, conscious, high-effort analytical thinking. "Calculating the next move in chess."

The application of this theory to robotics is intuitively compelling. Robots also require two kinds of "thinking":

Fast reflex (10–120 Hz): Motor control for avoiding obstacles, quickly re-grasping slipping objects, and maintaining smooth trajectories
Slow deliberation (1–10 Hz): High-level reasoning for determining "which object to pick up?", "in what order to perform the task?", and "is this situation safe?"

The key insight is that these two systems operate on different time scales. System 2 does not need to run every frame, and System 1 does not need to perform complex reasoning.

4.4.2 Implementing the Robotic Dual System

System 1 — Action Expert: Handled by fast generative models such as diffusion policies, Flow Matching, and lightweight MLPs. Runs at 10–120 Hz to generate smooth and responsive motor commands. This module generates actions directly from a given context embedding, without the heavy reasoning of the VLM.

System 2 — VLM-based Reasoning/Planning: Handled by large-scale VLMs. Runs at 1–10 Hz to understand scenes, decompose tasks, and verify safety conditions. The output of this module is passed to the action expert as context — a directive of "what to do."

Asynchronous Execution: The core of the two systems is that they run at independent frequencies. While System 2 is reasoning about the next plan, System 1 continues to generate actions based on the previous plan. This asynchrony makes the slow inference speed of VLMs compatible with real-time control.

4.4.3 Implementation Examples

GR00T N1 [21] (NVIDIA): The Eagle-2 VLM (System 2) processes camera images and language instructions to produce context embeddings. These embeddings are passed to the DiT-based action expert (System 1), which generates action chunks via Flow Matching. This is a classic cascade structure following sequential information flow from System 2 → System 1.

π0 [16] (Physical Intelligence): The tokens from the PaliGemma VLM (System 2) and the tokens from the Flow Matching action expert (System 1) are processed simultaneously in shared attention blocks. In this parallel structure, the two systems share the same transformer layers and have access to each other's information.

MinD [95]: A latent world model plays the role of System 2, "mentally simulating" future states. The action policy (System 1) generates actions based on the results of this simulation. What is distinctive here is that System 2 is not linguistic reasoning but predictive simulation in latent space.

TriVLA: Proposes a triple system. The VLM (System 3, slowest) handles strategic planning, a DiT (System 2, intermediate) handles tactical action generation, and a lightweight executor (System 1, fastest) handles real-time correction. This extends the dual system to a triple system for finer-grained separation of time scales.

Hume: Targets holistic embodiment understanding, with the VLM and action generator processing a shared representation of whole-body human motion understanding.

4.4.4 Key Design Choice: Shared Attention vs. Cascade

The most important design choice in a dual-system architecture is the direction of information flow between the two systems.

Cascade (GR00T N1 [21] approach):

Information flows unidirectionally from System 2 → System 1
System 1 cannot provide feedback to System 2
Advantage: Clear module separation allows each system to be trained/replaced independently. The VLM of System 2 can be upgraded while keeping the action expert of System 1 intact.
Disadvantage: Since information from the action generation process is not reflected in reasoning, System 2 cannot respond to subtle changes that occur during action execution.

Shared Attention (π0 [16] approach):

Information flows bidirectionally between System 1 ↔ System 2
Language/visual tokens and action tokens interact within the same attention mechanism
Advantage: The two systems can reference each other's state, enabling close cooperation in which action generation influences reasoning and reasoning guides action.
Disadvantage: Unclear module separation means that replacing one system may require retraining the other. The computational cost of shared attention is high.

This choice reflects a fundamental trade-off between modularity and integration. Cascade provides the flexibility to swap components like Lego blocks, while shared attention creates a tightly integrated system like an organism.

Looking at the current (as of 2025) trend, the research community sees both approaches coexisting; however, models that have achieved commercial success (π0 [16], π0.5 [31]) lean toward shared attention, while models prioritizing platform/module replaceability (GR00T N1 [21]) lean toward cascade. Which will ultimately prevail remains an open question. What is clear, however, is that the cognitive science insight of separating "fast reflex" from "slow deliberation" has become a core principle of VLA architecture design.

4.5 Three Frontier VLAs, Honestly Compared — NVIDIA GR00T, Google Gemini Robotics, Physical Intelligence π

"The diagnosis that the three are variants of the same paradigm is correct. But if we stop there, we cannot see what to research next. We have to go one layer deeper."

When PhD students first start reading VLA papers in earnest, almost everyone follows the same trajectory. At first, GR00T [21], Gemini Robotics, and π [16] look like three different animals, because the demo videos from each company look so different. NVIDIA's humanoid moving objects, Google's videos of complex linguistic reasoning, PI's 13-hour espresso-making — they give wildly different visual impressions. Then, after reading Sections 3–6 of this survey to the end, the opposite recognition arrives: "aren't these all the same thing?"

Both intuitions are partly right and partly wrong. The purpose of this section is to find the precise point between them. The most dangerous thing for a first- or second-year PhD student entering this field is to settle at one of the two extremes. Viewing the three as "all different" makes you chase surfaces; viewing them as "all the same" makes you lose track of which current your research actually sits on. We have to honestly separate what is the same and what is genuinely different.

4.5.1 What to concede first: the unified-field diagnosis is 70% correct

If you have tracked the VLA field closely for the past 1–2 years, it is hard to disagree with the following proposition.

All three models are variants of a single paradigm: "an internet-pretrained VLM as the brain, a generative action decoder mounted on top, and the two modules executed asynchronously at different time scales."

This is exactly what this survey's Insight 1 (Evidence of Convergence) identifies, and forcing the three onto the five-axis coordinate system of Section 3.6 makes the point even sharper.

Classification axis	NVIDIA GR00T N1 [21]	Google Gemini Robotics	PI π0 / π0.5 [31]
Structure (Liu/Shao)	Dual-system, Cascade	Dual-system, Cascade	Dual-system, Parallel
Generation (Zhong)	Diffusion (DiT)	Diffusion-family	Flow Matching
Anatomy (Xu)	Eagle-2 Brain + DiT Hand	Gemini VLM + Action Head	PaliGemma/Gemma 3 + Flow Hand
Function (Kawaharazuka)	Low-level perception + Low-level planning	High-level reasoning-oriented + Low-level planning	High-level perception + Low-level planning
Post-training (Jin)	Embodiment awareness	Task understanding	Multi-component integration + RL

Gemini Robotics is not included in this survey's official five-axis classification table in §3.6. The row above is this section's inferred position based on public technical reports, while the rows for GR00T N1 and π0 match the §3.6 table exactly.

The differences in the table exist, but they are positional differences within the same category, not category-level differences. All three are dual-system, all three use generative action heads, and all three stand on pretrained VLM backbones. The four axioms identified by this survey (inheriting internet knowledge, time-scale separation, generative decoder, data pyramid) are genuinely the common ground of all three.

Any comparison that refuses to concede this ends up being a reshuffling of marketing copy. The difference among the three is not "different species" but "different breeds of the same species" — the conclusion of the unified-field argument must be accepted. Does the section end here? No. For a PhD researcher, the genuinely interesting question starts only after this.

4.5.2 What must not be flattened: intra-species differences are real too

If you tell a biologist "golden retrievers and Siberian huskies are the same species," they will agree. And yet their entire career is spent on the differences between those two breeds. Being in the same category does not mean the differences are meaningless. The paradigm being shared does not imply that divergence on top of that paradigm disappears. It was only after cars converged on "internal combustion + transmission + chassis" for 100 years that the real differences between Toyota, BMW, and Porsche became visible.

To see where the real differences in VLA lie, we must introduce three transverse axes that this survey, organized along technical axes, did not make explicit. These are precisely the points the unified-field argument intentionally flattened, and they are decisive when a PhD researcher decides which current to position their own work on.

Axis 1 — Data source determines the ceiling of generalization

What §10.12 of this survey explicitly pointed out is that even though frontier models and open-weight models are converging on simulation benchmarks (LIBERO, CALVIN), the gap in real-world zero-shot generalization is not closing. The root cause Reuss (2026)'s ICLR analysis points to is a gap in data curation and infrastructure scale. And this gap is not coincidence — it flows from the structural difference in where each of the three groups gets its data from.

NVIDIA centers on simulation synthesis via Cosmos and Isaac Sim/Lab. Simulation is infinitely scalable but pays a permanent tax called the sim-to-real gap. Google treats internet-scale multimodal pretraining as its asset. This is overwhelming for zero-shot semantic understanding, but it is a different dimension from the precision of motor control. PI relies on real-world data collected by its own robot fleet. This is the cleanest signal — no sim-to-real gap — but it is the most scalability-constrained.

Which data strategy a PhD researcher imitates in their own lab is decisive. If you have simulation infrastructure, following the NVIDIA line is natural; if not, fine-tuning openpi (which PI has released) with a small amount of real-world data is the realistic path. This choice is not merely a tool choice — it is a choice of what you accept as the upper bound of generalization.

Axis 2 — Which training stage you pour resources into

Recall the three-stage maturity model from §6.6 (internet pretraining → BC fine-tuning → RL post-training). The interesting fact is that each of the three groups has picked a different stage as its point of differentiation. And this is not coincidence — it is determined by each group's asset structure.

Google's assets are overwhelmingly concentrated on Stage 1 (internet pretraining): massive data centers, internet-scale multimodal corpora, TPU infrastructure. So Gemini Robotics' differentiation becomes "transplanting the enormous Gemini backbone as-is." NVIDIA follows the orthodox path on Stages 1–2 at the model level (inheriting the Eagle-2 VLM + BC fine-tuning), but differentiates the data pipeline that feeds Stages 1–2 via Cosmos/Isaac Sim. In other words, its differentiation is the data pipeline, not the model recipe. PI takes Google's open PaliGemma/Gemma 3 for Stages 1–2, and defines Stage 3 (RL post-training) and deployment-time learning beyond it as its own territory. π^*_0.6 [157]'s advantage conditioning (§10.13.1), which first demonstrated a practical path for applying RL to Flow Matching VLAs, and π_0.6-MEM [158] (§10.13.2), which solved 15-minute long-horizon tasks with multi-scale memory, both occur within this Stage-3 territory.

For a PhD researcher, this translates into a very practical question. At which training stage can you contribute over the next 3–5 years? Competing with massive backbones on Stage 1 is effectively impossible for a single academic lab. Stage 2 BC fine-tuning is already well-paved. The most open frontier is Stage 3 RL post-training and its variants — the territory PI is pushing fastest, but also the territory with the lowest barrier to entry. The trajectory of self-improving residual RL reaching 99% on LIBERO at ICLR 2026 supports this diagnosis.

Axis 3 — Model-weight release policy creates an asymmetric ecosystem

This is a political-economic dimension the survey does not directly address, but it is the difference that most directly affects the daily life of a PhD researcher. NVIDIA releases GR00T N1 on Hugging Face and elsewhere; openpi releases π0/π0-FAST but everything after π0.5 is closed; Gemini Robotics is an API/partnership model from the start.

This difference is not merely corporate policy — it determines who can do what kind of follow-up research. Attempting ablation studies or mechanism interpretation in academia without weight access is nearly impossible. So the de facto baselines of academic VLA research are open models like OpenVLA [15], π0, and GR00T N1; Gemini Robotics is a citation target, not a comparison target. This means the "frontier vs. open-weight gap" in §10.12 is not merely a performance gap but a gap in the very possibility of research.

A first- or second-year PhD student should be explicitly conscious of this when choosing a new topic. "Research that compares against Gemini Robotics" will almost always be confined to limited forms (citing published numbers, comparing API calls). In contrast, "research that verifies a new RL post-training method on top of openpi" is immediately executable and reproducible by others.

4.5.3 So, the real coordinates of the three

Combining the three axes, the differences among the three can be honestly summarized in one line.

Dimension	NVIDIA GR00T	Google Gemini Robotics	PI π series
Paradigm position	Same species (VLM Brain + Generative Action Head)	Same species	Same species
Bottleneck they bet on	Embodiment generalization, humanoid form factor	Transferring the reasoning of a massive backbone into action	The policy's own capacity to improve (RL + memory)
Asset structure	GPU · simulation · edge-chip full stack	Massive data centers · internet pretraining	Own robot fleet · real-world data
Differentiating training stage	The Stage 1–2 data pipeline (Cosmos/Isaac)	Stage 1 (pretraining)	Stage 3 (RL post-training)
Model release policy	Open (monetize via infrastructure)	Closed (monetize via API)	Dual-track (foundation open, frontier closed)
Relationship to academia	Baseline + infrastructure adoption driver	Citation target, hard to compare	Baseline (openpi) + frontier to track

Note the relationship between this table and the table in 4.5.1. If 4.5.1 shows convergence along the technical axis, this table shows divergence along the strategic/ecosystem axis that sits on top of it. Both tables are true; looking at only one misreads the field as a whole.

4.5.4 From the PhD researcher's view — what to track and what not to follow

Rather than ending this section on a neutral, compromising note, it is more honest to close with concrete guidance for the decisions a first- or second-year PhD student will actually face.

Areas where you must track all three. Evolution of dual-system architectures, the mathematical mechanics of action decoders (diffusion vs. flow matching vs. discrete diffusion), VLM backbone choice and fine-tuning strategies. In this area, all three are genuinely playing the same game, and progress in one group foreshadows the near future of the others. Sections 4–5 of this survey correspond to this area.

Areas to track by group, divergently. Data collection/augmentation strategies (NVIDIA's Cosmos current, Google's internet-multimodal current, PI's real-world-fleet current), post-training-stage innovations (especially PI's RL post-training and memory integration), domain-specialized applications (NVIDIA's humanoid line, Google's EMMA autonomous-driving lineage). In this area, group-level asset structures differ, so a result from one group is not easy to transplant onto another.

Areas where tracking only one group is enough. PI's π^*_0.6 and π_0.6-MEM lineage. The reason this survey devotes all of §10.13 to two subsections on these papers is that this current is currently breaking through the field's most urgent open problems (BC's performance ceiling, the absence of memory in long-horizon tasks) head-on. If a PhD researcher is interested in RL post-training or memory mechanisms, starting from openpi and the π-series follow-ups as baselines is the most efficient path.

Traps to consciously avoid. The question "which company is ahead" — driven by the impressions of corporate demo videos — is a question that academia cannot answer. Another common trap is tracking only simulation-benchmark numbers and missing the real-world generalization gap (§10.12). And the biggest trap — settling on the unified-field argument too early and dismissing group-level differences with "they're all the same paradigm anyway." That the paradigm is shared is not a license to ignore divergence — it is a tool that lets you predict where divergence will occur more precisely.

4.5.5 One-line summary

The three groups are variants of a single paradigm at the level of mathematical architecture, and that diagnosis must be taken seriously. But along the transverse axes of data source, training-stage resource allocation, and model release policy, they genuinely sit at different coordinates, and that difference is what decides which current a PhD researcher places their own research on. In five years, the models themselves are more likely to resemble one another even more closely; but the ecosystems they produce and their relationships to academia are more likely to diverge further. Seeing both currents at the same time is the balance a first- or second-year PhD student needs when settling into this field.

Chapter 4 Summary: Current Coordinates of VLA Anatomy

The perception module is converging on SigLIP + DINOv2 [30] hybrid as the dominant standard. The brain module has undergone definitive evolution toward directly repurposing pretrained VLMs as the robot's brain. The action module has no definitive winner yet, with discrete tokenization, diffusion, Flow Matching, and FAST tokenization [20] all in competition. And as the approach for connecting these three modules, the dual-system architecture is rising in prominence, with insights from cognitive science being translated into engineering design principles.

The next chapter examines how to train the architectures constructed in this way — delving deeply into the three paradigms of behavioral cloning, reinforcement learning, and world model learning.

Motivation Chain: The Causal Logic of Architectural Evolution

Limitations of modular pipelines (sense-plan-act separation leads to information loss and engineering overhead)
--> Monolithic architecture emerges (RT-2 [11]: a single VLM handles everything)
--> Limitations of monolithic models (VLM inference is too slow for real-time control)
--> Dual-system architecture emerges (GR00T N1 [21]: fast System 1 + slow System 2)
--> Limitations of dual systems (insufficient planning capability for long-horizon compound tasks)
--> Hierarchical architecture (pi_0.5: VLM planner + VLA executor)

Limitations of autoregressive decoding (single-mode generation only, discretization information loss, slow sequential generation)
--> Diffusion Policy [17] emerges (multimodal action representation, batch action chunk generation)
--> Limitations of DDPM (50--100 denoising steps make generation slow)
--> Flow Matching (pi_0 [16]) emerges (linear interpolation converges in 5--10 steps)
--> DiT-based decoders (CogACT [23], RDT-1B [24]) emerge (applying transformer scaling laws to diffusion)

Discriminative Features of Similar Architectures

Comparison	Key Discriminative Feature
Monolithic vs. Dual-system	Monolithic: a single model handles understanding + action (simple but speed-constrained). Dual-system: understanding and action are separated, each operating at its optimal frequency
Cascade vs. Parallel Dual-system	Cascade: sequential transfer from System 2 to System 1 (GR00T N1 [21]). Parallel: both systems execute simultaneously with shared attention (π0 [16])
Planner-Only vs. Planner+Policy Hierarchical	Planner-Only: VLM handles only high-level planning (SayCan [14]). Planner+Policy: a learned low-level VLA policy is also present (π0.5 [31])
Autoregressive vs. Diffusion Decoder	AR: tokens generated sequentially (discrete, slow, reuses VLM vocabulary). Diffusion: noise-to-action iterative refinement (continuous, multimodal, parallel chunk generation)
DDPM vs. Flow Matching	DDPM: stochastic reverse process (50--100 steps). Flow Matching: deterministic ODE velocity field (5--10 steps, faster and more stable)

Intuitive One-Liners: Architecture Edition

Monolithic VLA: "A person who understands a foreign language and responds directly without an interpreter -- fast, but deep deliberation is difficult."
Dual-system: "A CEO (strategic decisions, slow) and a shop-floor worker (immediate execution, fast) dividing labor."
Hierarchical architecture: "Separating the recipe (high-level plan) from knife skills (low-level technique) -- changing the recipe does not require relearning how to cut."
Autoregressive decoding: "Typing one character at a time -- actions are generated one token after another in sequence."
Diffusion decoding: "Sculpting from marble -- the action emerges as unnecessary material (noise) is chiseled away from the whole."
Flow Matching: "A shortcut through diffusion's 'sculpting' -- reaching the same result in fewer steps via a straight-line path."
Action Chunking: "Generating an entire sentence at once rather than one character at a time -- securing temporal coherence."

Self-Check Questions: Sections 3--4

Q1: How is π0 [16]'s architecture classified under Liu & Shao's taxonomy and under Zhong et al.'s taxonomy, respectively?

Answer: Under Liu & Shao's taxonomy, π0 [16] is a Monolithic Parallel Dual-system -- the PaliGemma VLM (System 2) and the Flow Matching Action Expert (System 1) process tokens simultaneously within shared attention blocks. Under Zhong et al.'s taxonomy, π0 [16] falls under Diffusion-based (Flow Matching) action generation, since Flow Matching is a variant of the diffusion family.

Q2: Why does the autoregressive approach (e.g., RT-2 [11]) struggle to represent multimodal action distributions?

Answer: The autoregressive approach quantizes each action dimension into discrete bins and generates tokens sequentially. This process incurs two problems: (1) quantization error degrades the precision of continuous actions, and (2) because each token is conditioned sequentially on the preceding tokens, the model can only represent a single mode of the distribution. When two equally valid modes exist (e.g., "grasp from the right" and "grasp from above"), the model tends to converge on the average direction (mode averaging), outputting an action that corresponds to neither valid option.

Q3: Among Xu et al.'s Five Challenges, which has seen the greatest progress and which remains the most unresolved as of 2026?

Answer: The greatest progress has been in (1) Representation -- VLM-based unified representations, FAST tokenization [20], and Flow Matching have substantially improved the representation problem. The most unresolved challenge is (4) Safety -- SafeVLA [75] represents an initial attempt, but formal verification is absent, and the fundamental risk that VLA hallucinations can lead to physical accidents remains unaddressed.

Open Research Questions: Sections 3--4

Optimal architecture selection criteria: Given a specific task (simple pick-and-place vs. a 30-minute cooking sequence), is it possible to develop a theoretical framework that predicts a priori which architecture -- monolithic, dual-system, or hierarchical -- will be optimal?

Unification of diffusion and autoregressive methods: What is the optimal design for a hybrid decoder that combines the strengths of autoregressive generation (VLM vocabulary reuse, interpretable reasoning) with those of diffusion (multimodal distribution, continuous action space)?

Predictive power of classification perspectives: Among the classification perspectives presented in this section (architecture, action generation, anatomy, function), which perspective best predicts the real-world performance of a model?

Action Expert scaling: Is the parameter ratio between π0 [16]'s Action Expert (~0.3B) and its VLM backbone (~3B) optimal? Does scaling the Action Expert further yield proportional performance improvements?

5. Action Tokenization — The Core Design Decision in VLA

When designing a Vision-Language-Action (VLA) model, countless architectural choices exist, but the most fundamental and far-reaching decision is "how to represent actions as tokens." Chen et al. [7] (2025) termed this the Action Tokenization perspective and presented it as the axis that most clearly differentiates VLA models from one another. This section systematically organizes 8 action token types based on their classification, integrating insights from other surveys.

Why is tokenization central? VLA models are fundamentally built on large language models (LLMs). Since LLMs are systems that process discrete token sequences as input and output, the method by which a robot's continuous action space is converted into a discrete token space fundamentally determines the model's expressive power, control precision, and inference speed. The choice of tokenization method is not merely an implementation detail — it is the highest-level design decision that defines the capabilities and limitations of the VLA system.

A single model can combine multiple token types; for example, CoT-VLA [55] jointly utilizes reasoning tokens (Type 8) and goal tokens (Type 5).

5.1 Eight Action Token Types

Type 1: Language Tokens

The most intuitive approach is to represent robot actions as natural language text. An LLM generates a natural language command such as "pick up the red cup," and a low-level policy or predefined skill primitive then converts this into actual motor commands.

Representative models:

SayCan [14] (Ahn et al., 2022): Scores action candidates generated by the LLM with affordance scores to select executable actions. Introduced the key idea of combining the LLM's world knowledge with the robot's physical capabilities via a product operation.
Inner Monologue [22] (Huang et al., 2023): Converts feedback from the environment (success/failure, object recognition results) into language and incorporates it into the LLM's next action plan.
SayTap [38] (Tang et al., 2023): Represents a walking robot's foot contact patterns as text sequences, enabling the LLM to directly plan locomotion rhythms.

Advantages: The LLM's powerful language generation capability and commonsense reasoning can be directly leveraged. Because pre-trained linguistic knowledge transfers intact, zero-shot generalization to new tasks is excellent.

Limitations: Control precision for continuous motor control is inherently insufficient. It is difficult to adequately express fine-grained manipulation such as "move the cup 3 cm to the left" in natural language. The control frequency is 1–3 Hz — the lowest among all token types — making this approach unsuitable for tasks requiring rapid response (e.g., catching a moving object).

Type 2: Code Tokens

This approach represents actions as executable program code. The LLM generates Python functions or API call sequences, which are then executed directly on the robot runtime.

Representative models:

Code as Policies [36] (Liang et al., 2023): The LLM directly generates Python code that calls robot APIs. Complex action sequences can be composed by leveraging programming constructs such as spatial reasoning, loops, and conditionals.
ProgPrompt [102] (Singh et al., 2023): Structures tasks in a programmatic format (function calls, assert statements) to enhance the LLM's planning capability.
Voyager [103] (Wang et al., 2023): Generates executable code in the Minecraft environment and stores successful code in a skill library, progressively expanding capability.
ChatGPT for Robotics [104] (Vemprala et al., 2024): Converts user intent into robot control code through a conversational interface.

Advantages: Structured, reusable, and easy to debug. Complex action logic can be expressed concisely via loops and conditionals, and generated code can be accumulated in a library to expand the skill repertoire over time.

Limitations: Predefined APIs (e.g., pick(obj), place(x, y, z)) are required, and fine-grained dexterous manipulation not supported by the API cannot be expressed. The continuous nature of physical interaction is difficult to capture fully through discrete API calls.

Type 3: Affordance Tokens

This approach spatially represents the manipulable regions and modes of objects. "Where" to grasp and "in which direction" to push are expressed as heatmaps or vector fields in 3D space.

Representative models:

VoxPoser [57] (Huang et al., 2023): Uses an LLM and VLM to generate affordance maps (value maps) and constraint maps in a 3D voxel space. A motion planner synthesizes trajectories based on these maps.
A3VLM [58] (Huang et al., 2024): Extends a VLM to directly predict affordances from 3D point clouds.
RT-Affordance [105] (Brohan et al., 2023): Uses visual affordances as a conditioning signal to aid generalization of the manipulation policy.
A0 [106] (Ren et al., 2025): A unified affordance-based manipulation framework that explicitly models object–action relationships.

Core value: Affordance tokens provide an intermediate representation of "where" and "how" to manipulate. They serve as a semantic bridge between high-level language planning and low-level motor control, and are particularly strong at generalizing to novel objects.

Type 4: Trajectory Tokens

This approach represents the spatiotemporal path of an end-effector as a token sequence. Future trajectories are sketched onto 2D images, or waypoint sequences in 3D space are generated.

Representative models:

RT-Trajectory [62] (Ahn et al., 2024): Overlays 2D trajectory sketches onto images for use as visual prompts. Humans can draw trajectories, or the model can predict them.
LATTE (Liu et al., 2024): A language-to-trajectory converter that transforms language commands into 3D trajectory sequences.
TraceVLA [40] (Zheng et al., 2025): The VLA model generates visual trajectory traces as an intermediate representation, simultaneously improving the interpretability and accuracy of action prediction.

Connection to video pre-training: Trajectory tokens connect naturally with video prediction models. Since a video frame sequence is essentially a continuous visual trajectory, the spatiotemporal understanding learned by models pre-trained on large-scale video data can be directly transferred to trajectory prediction.

Type 5: Goal Tokens

This approach represents goal states by predicting future observations. "How will the world look after an action from the current state?" is generated as an image or point cloud.

Representative models:

SuSIE [59] (Black et al., 2024): Takes the current observation and a language command as input, generates future subgoal images, and trains a low-level policy to follow them.
UniPi [93] (Du et al., 2024): Frames robot planning as a video generation problem. A text-to-video diffusion model generates a future frame sequence, and an inverse dynamics model extracts the actions.
3D-VLA [37] (Zhen et al., 2024): Predicts future scenes in a 3D representation space, generating goals that reflect rich spatial understanding.
CoT-VLA [55] (Kang et al., 2025): Uses visual subgoals as a Chain-of-Thought, explicitly reasoning about "how the world should look next" before deciding on an action.

Integration with world models: Goal tokens offer the most natural interface with world models. Setting a goal by simulating the future is fundamentally equivalent to imagining the consequences of an action through an internal world model. This presents a powerful pathway for integrating planning and execution.

Type 6: Latent Tokens

This approach represents actions in a learned latent space. Raw action data is compressed with an autoencoder or similar mechanism, converting it into semantically rich latent vectors.

Representative models:

LAPA [61] (Ye et al., 2025): Uses a VQ-VAE (Vector Quantized Variational Autoencoder) to quantize action trajectories into a latent codebook. The VLA model is trained to predict these latent action tokens.
UniVLA [80] (Li et al., 2025): Learns a unified latent action space to handle diverse robot embodiments and tasks within a single model.
VQ-VLA [107] (Qu et al., 2025): Discretizes the action space via vector quantization while preserving the structure of the latent space to maintain semantic coherence.

Key to bridging the embodiment gap: The most innovative value of latent tokens lies in overcoming the embodiment gap. Although human hand movements and robot gripper movements are physically entirely different, the semantic action of "grasping an object" can be represented similarly in the latent space. This makes it possible to transfer action knowledge extracted from large-scale human video data (Ego4D, Something-Something, etc.) to robots — a key strategy for circumventing the chronic shortage of robot data.

Domain-agnostic properties: Because the latent action space does not depend on the joint configuration or action-space dimensionality of a specific robot, it becomes a natural medium for cross-embodiment learning.

Type 7: Raw Action Tokens

This approach directly discretizes low-level action values — such as joint angles, end-effector pose (position + orientation), and gripper state — and converts them into tokens. It is the most direct and simple tokenization method.

Representative models:

RT-2 [11] (Brohan et al., 2023): Each dimension of a 7-dimensional action vector (6-DoF pose + gripper) is uniformly discretized into 256 bins and added to the LLM's vocabulary.
OpenVLA [15] (Kim et al., 2024): Adopts the same 256-bin discretization as RT-2 [11], implemented in an open-source, reproducible manner.
Gato [13] (Reed et al., 2022): Discretizes actions from diverse tasks into 1024 bins and processes them with a single general-purpose model.

The problem of quantization error: The fundamental limitation of raw action discretization is quantization error. Discretizing into 256 bins gives each bin a width of approximately 0.8%; in precision manipulation, this error can accumulate and cause failures. Moreover, tokenizing each action dimension independently loses inter-dimensional correlations.

The innovation of FAST: FAST [20] (Fast Action Tokenization) (Pertsch et al., 2025) proposed an elegant solution to this problem. It applies a Discrete Cosine Transform (DCT) to action sequences to convert them to the frequency domain, then tokenizes them using Byte Pair Encoding (BPE). This achieves:

Improved information compression as the DCT captures temporal correlations
Shorter sequence lengths as BPE groups frequent action patterns into single tokens
Natural integration with the LLM's existing vocabulary expansion mechanism

This innovation dramatically improves the efficiency of raw action tokenization, reducing the number of tokens by up to ~13x compared to the conventional 256-bin method at equivalent precision.

At ICLR 2026, next-generation tokenization techniques surpassing FAST were proposed. FASTer combines Residual Vector Quantization (RVQ) with frequency- and time-domain losses to simultaneously achieve higher compression ratios and superior reconstruction quality. OMNISAT employs a B-Spline encoder to provide a compact representation specialized for smooth, long-horizon action outputs.

Type 8: Reasoning Tokens

This approach generates the reasoning process as explicit tokens before deciding on an action. The model first reasons about "why this action should be taken," and then predicts the action based on that reasoning.

Representative models:

ECoT [92] (Zawalski et al., 2024): Embodied Chain-of-Thought. Generates text describing scene descriptions, task analysis, and sub-plans before predicting actions.
CoT-VLA [55] (Kang et al., 2025): Combines visual reasoning tokens (future subgoal images) with linguistic reasoning tokens to construct a multimodal chain of thought.
ThinkAct [67] (Xu et al. [8], 2025): Introduces an adaptive reasoning mechanism that autonomously determines "when to think."
Embodied-R1 [108] (Liu et al., 2025): Applies DeepSeek-R1-style long-chain reasoning to embodied tasks, generating spontaneous reasoning paths for complex multi-step manipulation.

Performance improvement effects: The effect of reasoning tokens is substantial. According to experiments by SC-VLA [56], including reasoning tokens improved action prediction quality by approximately 35%. This demonstrates that "think before acting" provides a clear advantage over simple reactive behavior.

Trade-offs: Reasoning tokens require additional token generation, which increases inference latency. ThinkAct [67]'s approach of learning "when to think" is an attempt to manage this trade-off — acting immediately for simple tasks and activating reasoning only for complex situations.

5.2 The Relationship Between Tokenization and Control Frequency

The tokenization method directly determines the control bandwidth of a VLA model, which fundamentally defines the range of tasks the model can perform.

Tokenization Method	Control Frequency	Representative Models	Suitable Tasks
Language tokens	1–3 Hz	SayCan [14], Inner Monologue [22]	High-level planning, navigation
Autoregressive raw tokens	3–6 Hz	OpenVLA [15] (~6 Hz), RT-2 [11]	Simple pick-and-place
Diffusion + chunking	10–50 Hz	π0 [16] (20–50 Hz)	Flexible manipulation, contact-rich tasks
Flow Matching + chunking	50–120 Hz	GR00T N1 [21] (~120 Hz)	Agile manipulation, bimanual coordination
FAST + chunking	~50 Hz	FAST-VLA	General-purpose manipulation

The key insight revealed by this table is clear: the tokenization method determines the control bandwidth, which defines the range of executable tasks. A control frequency of 1–3 Hz can handle simple tasks such as "put the red block in the blue bowl," but cannot achieve grasping an egg without crushing it. Only at frequencies above 50 Hz does delicate manipulation requiring force control become feasible.

The root cause of this frequency gap lies in the token generation mechanism:

Autoregressive decoding generates tokens one by one sequentially, so latency increases proportionally with the number of action dimensions (7-DoF → 7 tokens generated sequentially).
Diffusion/flow matching-based action chunking simultaneously generates dozens of steps of actions in a single denoising pass. Although the inference frequency is low, the actions within the chunk are executed at high frequency.
FAST [20] uses DCT+BPE to compress action sequences into a small number of tokens, achieving a high effective frequency even with autoregressive decoding.

5.3 The Impact of Tokenization Choice on Performance

Discrete vs. Continuous: The Multimodal Distribution Problem

The most serious impact of the tokenization choice on performance appears in multimodal distribution scenarios. Consider, for example, a situation where an object on a table can be pushed in either direction — left or right. In this case, the correct action distribution has a bimodal form — probability mass concentrated at both left and right.

When trained with discrete raw tokens (256-bin) and standard cross-entropy loss, the model tends to predict the action corresponding to the average of the two modes (i.e., pushing in neither direction). This is the mode averaging (mode collapse) problem, which becomes more severe as the diversity of demonstration data increases.

Diffusion-based continuous action generation provides a natural solution to this problem. Diffusion models can inherently represent multimodal distributions, capturing both modes of the bimodal distribution. This is one of the core reasons π0 [16], GR00T N1 [21], and others have adopted diffusion/flow matching.

Co-determination of Action Space and Decoding

Tokenization methods and decoding strategies are not chosen independently — they are co-determined:

Discrete action space ↔ autoregressive decoding: Representing actions as discrete tokens allows the LLM's existing language model head to be reused as-is. RT-2 [11] and OpenVLA [15] take this path. The advantage is that architectural modifications are minimized, maximally preserving the LLM's pre-trained knowledge.
Continuous action space ↔ diffusion/flow-based generation head: Keeping actions as continuous vectors requires a dedicated generation head (diffusion head, flow matching head). π0 [16] and GR00T N1 [21] take this path. The advantages are multimodal distribution representation and high control frequency; the disadvantage is that additional design is needed for integration with the LLM backbone.

Recent work has also attempted to combine these two paths. For example, VQ-VLA converts continuous actions into discrete tokens via vector quantization while preserving the structure of the latent space, aiming to combine the advantages of both worlds.

Trade-offs of Action Chunking

Action chunking is a technique for simultaneously predicting actions across multiple timesteps in a single inference pass. Proposed in ACT (Zhao et al., 2023), it has since become a core element of VLA design.

Increasing the chunk size:

Reduces inference frequency, improving computational efficiency (e.g., chunk size 16 → 1/16 the number of inferences)
Improves trajectory consistency — predicting independently at each step can lead to an unstable trajectory, whereas chunk-level prediction guarantees temporal coherence
However, reduces responsiveness to environmental changes — if an unexpected situation arises in the middle of chunk execution, immediate response is not possible

To manage this trade-off, π0 [16] and others adjust chunk size to match task characteristics, or apply temporal ensembling to smoothly interpolate (interpolation) actions between successive chunks.

5.4 Summary: An Integrated Understanding of the Tokenization Perspective

The 8 action token types can be understood as a spectrum of abstraction levels:

High abstraction  ←─────────────────────────────────→  Low abstraction
Language → Code → Reasoning → Goal → Trajectory → Affordance → Latent → Raw

Moving to the left yields greater human interpretability and generalization but lower control precision; moving to the right yields greater precision but weaker generalization and interpretability. The most successful modern VLA systems hierarchically combine multiple levels of this spectrum — for example, forming a plan with reasoning tokens (high abstraction) and then executing with raw action tokens (low abstraction).

From this perspective, the central question in VLA research is not "which token type is best?" but rather: "For which tasks and embodiments is which combination of token types optimal?"

The ordering above represents one perspective; the relative abstraction levels of affordances and trajectories may vary depending on the domain and task.

6. The Evolution of Learning Paradigms

The training of VLA models has evolved beyond simple supervised learning into a multi-layered process spanning from pre-training to post-training. This section systematically examines each stage of this evolution and explores a fascinating parallel with theories of human motor learning.

6.1 Pre-training — From the Internet to Robots

Two-phase joint training has become the de facto standard for VLA model pre-training.

Phase 1: Internet-Scale Image–Text Pre-training

In the first phase, a vision-language model (VLM) is trained on large-scale image–text data collected from the internet. Through datasets such as LAION-5B (5 billion image–text pairs), COCO, and Visual Genome, the model acquires a rich semantic prior about the visual world.

What the model learns at this stage is not robot control itself, but the world understanding that underlies it:

Object recognition and classification ("this is a mug")
Understanding of spatial relationships ("the mug is on the table")
Reasoning about physical properties ("a glass can break")
Commonsense action knowledge ("to drink from a mug, hold the handle")

Phase 2: Fine-tuning on Robot Trajectory Data

In the second phase, the model is fine-tuned on real robot trajectory data. Key datasets include:

Open X-Embodiment [19] (OXE [19]): A cross-embodiment dataset containing 22 robot morphologies and more than 1M+ episodes. It is currently the de facto standard data source for VLA training.
BridgeData V2: Manipulation data from a WidowX robot across diverse environments (~60,000 trajectories).
RT-1 [12] data: A large-scale single-embodiment dataset collected by Google's Everyday Robots.

The core insight of this two-phase structure is the separation of semantic prior knowledge from sensorimotor skills. The ability to understand the world (Phase 1) and the ability to act in the world (Phase 2) can be efficiently learned from different data sources.

The Rise of Video Pre-training

Recently, the trend of utilizing video data in pre-training — beyond image–text data — has been accelerating:

GR-2 [100] (Cheang et al., 2024): Uses a video generation model pre-trained on web-scale video as the foundation for a robot policy. Understanding of physical dynamics embedded in video transfers to robot control.
Egocentric video (Ego4D, EPIC-Kitchens): First-person-view videos of humans performing direct manipulation are similar to the robot's perspective, providing particularly rich prior knowledge for manipulation tasks.

The decisive advantage of video pre-training over image–text pre-training is the understanding of temporal dynamics. Images provide understanding of static scenes, while video provides understanding of "how the world changes after this action."

The Role of Simulation Data

UniSim [101] (Yang et al., 2023): An action-conditioned video diffusion model that generates unlimited training data through simulated interaction in virtual environments.
Genesis (Xian et al., 2024): A GPU-accelerated physics simulator that generates large-scale, physically realistic interaction data.

The primary challenge with simulation data is the sim-to-real gap — visual and physical differences between simulation and the real world. Techniques such as domain randomization and domain adaptation are being actively researched to reduce this gap.

Scaling Laws

Scaling laws have also been observed in VLA training. According to research by Zhang et al. [6], doubling the amount of trajectory data improves task success rate by approximately 8--12% (original paper citation; specific figure not in surveys). This means that acquiring more data translates directly into performance improvement, underscoring the importance of large-scale data collection infrastructure (OXE [19], DROID, etc.).

However, data scaling alone has limits. The quality, diversity, and coverage of trajectory data are just as important as quantity, and developing more efficient learning algorithms must proceed in parallel with simply increasing the amount of data.

6.2 Limitations of Behavioral Cloning

Behavioral Cloning (BC) is the most basic policy learning method, which imitates expert demonstrations through supervised learning. Given observation–action pairs $(o_t, a_t)$, a policy $\pi(a|o)$ is trained via maximum likelihood estimation (MLE). The majority of VLA models are built on this BC framework.

However, BC has fundamental limitations:

Distribution Shift

The state distribution the model sees during training differs from the state distribution the model encounters during execution. During expert demonstration, the model follows the state distribution generated by the expert policy $\pi^*$, but during execution it follows the state distribution generated by the trained (imperfect) policy $\hat{\pi}$. This distributional mismatch leads to unpredictable behavior in states not seen in the training data.

Covariate Shift and Compounding Errors

Small prediction errors at each timestep accumulate over time. A slight positional error at one step leads to a larger error at the next step, which cascades and amplifies to cause catastrophic failures in long-horizon tasks. For example, in a complex assembly task lasting 30 seconds, a minute early error can result in complete failure at a later stage.

Inability to Improve Beyond Suboptimal Demonstrations

BC is inherently bounded by the upper bound of the demonstrations. If the demonstrations themselves are not optimal or contain noise, the model cannot surpass that level. There is no mechanism for discovering better actions.

Absence of Safety/Preference Signals

BC learns only "what to do" and cannot leverage signals about "what not to do" or "which actions are preferred." It cannot explicitly reflect safety constraints (e.g., speed limits near humans) or user preferences (e.g., preference for smooth motions).

Core Conclusion

BC is a necessary but not sufficient condition for VLA training. While BC is indispensable for efficiently leveraging large-scale demonstration data, the need for post-training to overcome its limitations is becoming increasingly clear.

6.3 Reinforcement Learning Post-training

Centered on Jin et al. [9] (2025), research on using reinforcement learning (RL) to push VLA model performance one step beyond BC has been growing explosively. This parallels precisely the paradigm in the large language model (LLM) field of aligning models with RLHF (Reinforcement Learning from Human Feedback) after SFT (Supervised Fine-Tuning).

Online Reinforcement Learning

The model learns by directly interacting with a real environment (or simulation) and receiving rewards:

PPO-based:
VLA-RL [68] (Tan et al., 2025): Applies PPO to a VLA model for performance improvement through online environment interaction
RIPT-VLA [71] (Su et al., 2025): Reinforcement learning via Iterative Policy Training. According to the original paper (Su et al., 2025), on a specific task it starts from a 4% SFT success rate and reaches a 97% success rate after 15 PPO iterations. Note, however, that in Jin et al. [9]'s LIBERO benchmark comparison the average is 74.7%, indicating substantial variation across benchmarks.
iRe-VLA (Xu et al. [8], 2025): Progressively improves the policy through iterative RL

GRPO-based:
ThinkAct [67] (Xu et al. [8], 2025): Applies Group Relative Policy Optimization to simultaneously reinforce reasoning and action
TGRPO [66] (Li et al., 2025): An extended GRPO that includes reasoning consistency rewards

Offline Reinforcement Learning

Improves the policy using only previously collected data. Better actions can be extracted from suboptimal demonstrations without additional environment interaction:

PA-RL: Selects optimal trajectories from offline data through CalQL (Calibrated Q-Learning)-based re-ranking
ConRFT [69] (Li et al., 2025): Online reinforcement fine-tuning using a consistency policy

Preference Optimization

Directly uses human preferences as learning signals:

HAPO [84] (Li et al., 2025): Applies DPO (Direct Preference Optimization) to VLA. Learns preferred action patterns through pairwise trajectory comparisons
RAPL [83] (Tian et al., 2025): Learns a reward function through visual preference encoding, simply by having humans compare video clips
GRAPE [109] (Wang et al., 2025): Multi-scale preference learning — simultaneously incorporates preferences at the trajectory level, segment level, and step level

The Spectrum of Reward Design

The central challenge of RL post-training is designing an appropriate reward function. Reward types proposed to date include:

Reward Type	Characteristics	Representative Methods
Task success reward (binary/sparse)	Success=1, failure=0. Simple to design but low learning efficiency	Most online RL methods
VLM-generated dense reward	VLM automatically generates intermediate reward functions. No human design needed	IKER [85]
Preference-based reward (RLHF-style)	Reward learned from human comparative feedback	HAPO, RAPL [83]
Safety constraint reward	Penalty for safety violations	SafeVLA [75]
Reasoning consistency reward	Reward for consistency between the reasoning process and action outcomes	TGRPO [66], ThinkAct [67]

Key Performance Figures

Impressive numbers demonstrating the effect of RL post-training:

RIPT-VLA [71]: SFT 4% → 97% success rate after 15 PPO iterations on a specific task per the original paper (Su et al., 2025); Jin et al. [9] report a LIBERO average of 74.7%.
SimpleVLA-RL [70]: 17.3% → 91.7% (with just 1 trajectory per task) (original paper citation; source outside the 14 surveys).
These figures prove that RL post-training can deliver a fundamental performance leap, not merely incremental improvement.

Self-improving residual RL methods presented at ICLR 2026 have reached 99% success rates on LIBERO, demonstrating that the potential of RL post-training can push performance to benchmark-saturation levels. Stage-aware reinforcement learning is a novel approach that decomposes tasks into semantic constituents and optimizes each stage independently.

Resolving BC→RL Transition Instability

The greatest practical challenge when applying RL to a BC-initialized model is training instability. RL updates can destroy useful action patterns learned during BC (catastrophic unlearning). Strategies to address this:

BC loss regularization: Adds BC loss as a regularization term to the RL objective, preserving the basic capabilities learned during BC.
VL encoder freezing: Freezes the vision-language encoder weights and updates only the policy head with RL, preserving pre-trained visual-language understanding.
Dual-Q/ensemble critics: Suppresses overestimation of the value function to secure training stability.

6.4 Parallels with Human Motor Learning

Jin et al. [9] (2025) noted that VLA learning paradigms have a structure remarkably similar to human motor learning theory. This analogy goes beyond mere metaphor and offers practical insights into the future direction of VLA research.

Newell's Constraints-Led Theory (1986)

Karl Newell argued that motor behavior emerges from the interaction of three types of constraints. This framework directly corresponds to VLA design:

Environmental Constraints:

Human: gravity, friction, physical properties of objects, etc.
VLA: affordance recognition, perceptual enhancement modules → encoding physical constraints of the environment into the model

Organismic Constraints:

Human: body size, muscle strength, range of joint motion, etc.
VLA: embodiment awareness → learning forward kinematics and inverse kinematics

Task Constraints:

Human: task goals, rules, time limits, etc.
VLA: hierarchical task decomposition, Chain-of-Thought reasoning → decomposing complex tasks into manageable subtasks

Neuroscientific Correspondences

Each component of VLA functionally corresponds to a specific system in the human brain:

Brain System / Mechanism	Function	VLA Counterpart
Genome (genetic prior knowledge)	Foundation for innate motor abilities	Internet-scale pre-training
Skill acquisition (learning through practice)	Mastery of specific motor skills	RL post-training, task-specific fine-tuning
Cerebellar forward model	Predicting consequences of actions	Forward kinematics learning, world models
Basal ganglia chunking	Automation of motor sequences	Action Chunking
Expert coaching	Correction through external feedback	Human-Robot Interaction (HRI)
Reward prediction errors (analogous to basal ganglia dopaminergic system)	Signal of difference between expectation and outcome	RL reward signal (TD error)
Internal world model	Mental simulation of the environment	Visual Interaction Prediction (VIP)

Particularly noteworthy in this correspondence is the similarity between basal ganglia chunking and action chunking. The process by which humans automate complex motor sequences (e.g., piano playing) into a single "chunk" through repeated practice is strikingly similar to the mechanism by which VLA models bundle actions across multiple timesteps into a single chunk for generation.

Furthermore, the correspondence between reward prediction errors (analogous to the basal ganglia dopaminergic system) and RL's temporal difference (TD) error is not coincidental. Both systems use "the difference between what was expected and the actual outcome" as a learning signal to progressively improve behavior.

Practical Implications of This Analogy

This parallel is not merely an intellectual curiosity — it suggests future directions for VLA research:

Humans' ability to flexibly extend their body schema (tool use) inspires cross-embodiment generalization research in VLA.
The phenomenon of human motor memory being consolidated during sleep suggests the importance of offline RL and experience replay.
Humans' ability to learn motor skills through observation alone (mirror neuron system) directly connects to learning latent actions from human video.

6.5 Self-improvement and Lifelong Learning

The ability of a VLA system to continue improving its performance after deployment in a real environment is one of the most important research directions from a practical standpoint.

Autonomous Data Collection

SOAR (Fan et al., 2025): An autonomous data collection framework guided by foundation models (VLM, LLM). The model autonomously determines "which data is lacking" and autonomously collects data in those areas.
Core idea: An embodied version of active learning — the model automatically explores and experiences situations of high uncertainty.

Online Self-improvement

RoboCat [110] (Bousmalis et al., 2024): A pioneering system implementing a self-improvement loop. Successful trajectories generated by the model are added to the training data for iterative improvement.
VLA-RL [68] (Tan et al., 2025): Continues to learn from environmental interaction after deployment through online RL.

The central challenge of self-improvement is self-reinforcement bias. When a model uses its own (imperfect) outputs as training data, existing errors or biases can be amplified. Quality filtering, diversity assurance mechanisms, and human-in-the-loop intervention are needed to prevent this.

The Core Challenge of Lifelong Learning: Catastrophic Forgetting

The most serious problem VLA systems face when adapting to new tasks or environments is catastrophic forgetting — the loss of previously learned capabilities when fine-tuned on new data.

Specific manifestations of catastrophic forgetting in VLA:

Fully unfreezing the VL encoder and fine-tuning progressively erodes the rich visual-language understanding acquired during internet-scale pre-training.
Overfitting to a specific environment degrades generalization ability in other environments.
Specializing to a specific robot embodiment weakens cross-embodiment transfer capability.

Resolution strategies:

Selective Unfreezing: Instead of updating all parameters, selectively fine-tunes only task-relevant layers. Parameter-efficient fine-tuning (PEFT) methods such as LoRA (Low-Rank Adaptation) are representative.
ReVLA [111] (Shi et al., 2025): Introduces a reversible learning mechanism to reversibly preserve prior knowledge when learning new tasks.
π0.5 [31]-KI: Selectively blocks gradient propagation to specific modules via gradient blocking, protecting pre-trained knowledge.

Cross-embodiment Generalization

Ultimately, VLA systems must be able to generalize to diverse embodiments rather than being tied to a specific robot. The skill of "grasping an object" learned on a 7-axis articulated robot must also work on a parallel gripper, a dexterous hand, and a mobile manipulator.

HPT [96] (Wang et al., 2024): Heterogeneous Pretrained Transformer. An architecture that separates a shared latent space from embodiment-specific heads. The shared transformer processes task semantics, and a dedicated head for each robot morphology converts them into the corresponding action space.
UniAct [112] (Ning et al., 2025): Defines a unified action space in 3D space, learning a universal action representation independent of robot morphology.
BridgeVLA [113] (Li et al., 2025): A VLA model that acts as a bridge between different robot datasets, facilitating cross-dataset transfer.

The core challenge of cross-embodiment generalization is the heterogeneity of action spaces. A 7-DoF robot arm, a 12-DoF dexterous hand, and a 20+-DoF humanoid differ fundamentally in the dimensionality and meaning of their action spaces. To overcome this heterogeneity, latent action tokens (Type 6) or task-space representations are being used as key intermediaries.

6.6 Summary: Three Stages of Maturation in Learning Paradigms

Synthesizing the evolution of VLA learning paradigms reveals a three-stage maturation model strikingly similar to the developmental trajectory of LLM training:

Stage	LLM	VLA	Core Contribution
Stage 1: Pre-training	Large-scale text corpus	Internet-scale images/video + robot trajectories	Formation of foundational capabilities
Stage 2: Supervised fine-tuning	SFT (instruction following)	BC (demonstration following)	Task execution capability
Stage 3: RL post-training	RLHF/DPO (alignment)	RL/preference optimization (alignment + transcendence)	Overcoming BC limitations, achieving optimal performance

VLA research is currently at the transition from Stage 2 to Stage 3. The results of RIPT-VLA [71] (4%→97% on a specific task per the original paper; LIBERO avg 74.7% per Jin et al. [9]) and SimpleVLA-RL [70] (17.3%→91.7%; original paper citation) suggest that this transition can deliver a paradigm-level leap, not merely incremental improvement. It is almost certain that RL post-training will join BC as a standard component of the VLA training pipeline going forward.

At the same time, as the analogy with human motor learning suggests, learning is not a single-stage problem but a continuous lifelong process. Post-deployment self-improvement, adaptation to new environments, and knowledge accumulation without catastrophic forgetting — implementing these lifelong learning capabilities is a long-term challenge for VLA research and an essential requirement on the path toward truly general-purpose robotic systems.

Enhancements for Chapters 5--6

E-1. Motivation Chain for Learning Paradigms

Understanding why VLA learning has evolved through three stages is best grasped as a chain of motivations, where each stage's limitations directly necessitate the next.

Stage 1 -- Pre-training: "See and understand the world." Motivation: A robot that has never seen a mug cannot be told to grasp one. Internet-scale image-text (and increasingly video) pre-training provides the broad semantic prior -- object recognition, spatial reasoning, physical commonsense -- that makes downstream robot learning sample-efficient.

Stage 2 -- Behavioral Cloning: "Imitate expert demonstrations." Motivation: World understanding alone does not produce motor commands. BC bridges perception and action by mapping observations to demonstrated actions via supervised learning. However, BC is bounded by demonstration quality and suffers from distribution shift and compounding errors.

Stage 3 -- RL Post-training: "Go beyond the demonstrations." Motivation: BC cannot discover better-than-demonstrated behaviors, handle multimodal action distributions gracefully, or incorporate safety and preference signals. RL post-training (PPO, GRPO, DPO) addresses all three, enabling the model to self-improve through trial-and-error or human feedback -- mirroring the LLM trajectory from SFT to RLHF.

Ongoing -- Lifelong Learning: "Never stop improving." Motivation: Deployment environments are non-stationary. The robot must adapt to new objects, tasks, and embodiments without catastrophic forgetting, closing the loop between experience and knowledge.

E-2. Comparison of Eight Action Token Types

#	Token Type	Abstraction	Control Freq.	Multimodal Dist.	Interpretability	Representative Models	Best Suited For
1	Language	Very High	1--3 Hz	N/A (discrete plans)	Excellent	SayCan [14], Inner Monologue [22]	High-level planning, navigation
2	Code	High	1--5 Hz	N/A	Excellent (debuggable)	Code as Policies [36], Voyager	Structured multi-step tasks
3	Affordance	Medium-High	Task-dependent	Partial	Good (spatial maps)	VoxPoser [57], A3VLM [58]	Novel object generalization
4	Trajectory	Medium	5--20 Hz	Partial	Good (visual traces)	RT-Trajectory [62], TraceVLA [40]	Path-centric manipulation
5	Goal	Medium	Task-dependent	Yes (via generation)	Moderate (future images)	SuSIE [59], UniPi [93], 3D-VLA [37]	Planning-heavy tasks, world-model integration
6	Latent	Low-Medium	10--50 Hz	Yes	Low	LAPA [61], UniVLA [80], VQ-VLA	Cross-embodiment transfer
7	Raw	Low	3--6 Hz (AR); ~50 Hz (FAST)	Limited (AR); Yes (diffusion)	Low	RT-2 [11], OpenVLA [15], FAST	Direct motor control
8	Reasoning	Variable (wraps others)	Adds latency	Depends on base type	Excellent (explicit CoT)	ECoT [92], ThinkAct [67], Embodied-R1	Complex multi-step decisions

Key takeaway: No single token type dominates. State-of-the-art VLA systems increasingly combine multiple types hierarchically -- e.g., reasoning tokens for planning + raw/latent tokens for execution.

E-3. Intuitive One-Liners

Action tokenization is to VLA what vocabulary design is to an LLM: it determines what the model can say and how precisely it can say it.
Behavioral Cloning is like learning to cook only by watching videos -- you can reproduce recipes you have seen, but you cannot improvise when an ingredient is missing.
RL post-training is the deliberate practice session after the lecture: the robot tries, fails, and refines until it surpasses the instructor.
Action chunking is the robotics equivalent of muscle memory: individual keystrokes become fluid words, individual timesteps become smooth trajectories.
FAST tokenization treats action sequences the way MP3 treats audio: transform to the frequency domain, discard redundancy, and compress dramatically.
Latent tokens are the Esperanto of robot actions: a universal language that lets different embodiments share the same motor concepts.
Catastrophic forgetting is the price of plasticity -- a VLA that adapts too eagerly to a new task may forget everything it learned before.
The sim-to-real gap is the uncanny valley of robotics data: simulation looks close enough to fool the eye, but not the physics.

E-4. Self-Check Questions

Q1. A VLA model trained with 256-bin raw action tokenization and cross-entropy loss consistently predicts the "average" action in bimodal scenarios (e.g., the robot hesitates instead of pushing left or right). (a) Explain the root cause of this failure mode. (b) Name two tokenization or decoding strategies that can mitigate it, and briefly explain why each works.

Q2. RIPT-VLA [71] reports a jump from 4% to 97% success rate after RL post-training (Su et al., 2025), yet Jin et al. [9] report a LIBERO average of 74.7% for the same model. Discuss at least two reasons why such a large discrepancy can arise when evaluating the same RL post-training method across different benchmarks or task suites.

Q3. Consider the neuroscientific analogy between basal ganglia chunking and VLA action chunking. (a) In what specific way does increasing the chunk size improve computational efficiency? (b) What practical risk does a large chunk size introduce, and how do systems like pi-0 [16] address it?

E-5. Open Research Questions

Adaptive tokenization selection. Can a single VLA model learn to dynamically choose the most appropriate token type (e.g., language for high-level planning, raw for fine manipulation) depending on the task phase, rather than relying on a fixed, hand-designed hierarchy?

Scaling laws for RL post-training. Pre-training scaling laws (more data leads to predictable improvement) are relatively well-characterized [6]. Do analogous scaling laws exist for RL post-training -- e.g., does doubling the number of online rollouts yield predictable gains, and if so, at what rate does the return diminish?

Forgetting-free continual RL. Current strategies for mitigating catastrophic forgetting (LoRA, selective unfreezing, gradient blocking) are largely borrowed from supervised learning. What RL-specific continual learning algorithms can allow a deployed VLA to acquire new skills indefinitely without degrading previously mastered ones?

Reward design beyond binary success. Most VLA RL post-training methods rely on sparse binary rewards (success/failure). How can dense, automatically generated reward signals -- from VLMs, from physics simulators, or from human preference models -- be made reliable and scalable enough to replace hand-crafted reward functions across diverse manipulation tasks?

Cross-embodiment action spaces at scale. Latent action tokens and task-space representations (UniAct, HPT [96]) show promise for cross-embodiment transfer, but have been validated on a limited set of morphologies. What architectural and representational innovations are needed to scale cross-embodiment generalization to the full diversity of real-world robots -- from 6-DoF arms to 30+-DoF humanoids -- without sacrificing per-embodiment performance?

7. Efficiency — An Essential Challenge for Real-World Deployment

VLA (Vision-Language-Action) models are demonstrating remarkable performance on academic benchmarks, yet deploying them on physical robots in the field is an entirely different order of problem. These models must perform real-time inference with billions of parameters, on constrained hardware, while operating safely and economically. This chapter maps the full landscape of the efficiency problem, centering on the efficient VLA survey by Yu et al. [4] (2025).

7.1 Why Efficiency: The Gap Between Reality and Ideal

The resource demands of current VLA models are at an impractical level from the perspective of real-world deployment.

Scale of training costs:

Training OpenVLA [15] required approximately 21,500 A100-GPU hours — equivalent to running a 64-GPU cluster continuously for about two weeks.
Training π0 [16] used over 10,000 hours of robot trajectory data. Collecting data at this scale independently within a single institution is essentially infeasible.

The inference latency wall:

RT-2-PaLI-X (55B) has an inference latency of 330–1000ms, corresponding to a control frequency of only 1–3Hz. This falls short even of the minimum frequency (5–10Hz) required for tabletop manipulation, let alone the 30Hz+ needed for dynamic tasks.
Even the comparatively efficient OpenVLA [15] (7B) has a latency of 166ms (approximately 6Hz), making it unsuitable for tasks requiring rapid response.

Four key requirements for real-world deployment:

Requirement	Description	Current Gap
Latency	<100ms (10Hz+)	Most large VLAs fall short
Cost	Minimize cloud API costs	Large models incur excessive per-GPU costs
Privacy	On-device inference required	Data cannot be transmitted externally in home/medical environments
Energy	Power constraints for battery-powered robots	Must run on edge devices in the tens-of-watts range

To close these gaps, research on efficient VLAs exploded from late 2024 through 2025. The research directions fall broadly into three axes: model efficiency, training efficiency, and data efficiency.

7.2 Model Efficiency: Making Inference Faster and Lighter

Model efficiency is a collective term for techniques that reduce latency and memory at the inference stage of an already-trained VLA. Five strategies exist: quantization, pruning, knowledge distillation, token optimization, and efficient architectures.

7.2.1 Quantization

Quantization is the most direct technique for reducing memory and computation by lowering the numerical precision of model weights (and activations).

OpenVLA [15] 4-bit PTQ (Post-Training Quantization): Post-training quantization alone halved GPU memory usage with no observed performance degradation. This suggests that VLA model weights contain considerable numerical redundancy.
SQIL [114] (Shang et al., 2024): Applying 4-bit salience-aware quantization achieved 2.5× inference speedup. The key is identifying weights critical for action prediction and selectively maintaining higher precision for those weights.
BitVLA [33]: A study applying extreme 1-bit ternary quantization ({-1, 0, 1}), reporting 3.36× memory compression. It is notable that meaningful action generation remains possible when weights are represented with only three values.
QAIL (Quantization-Aware Imitation Learning) [115] (Heo et al., 2025): Integrates quantization into the training phase to directly learn a model optimized for edge-device deployment.
SQAP-VLA [116] (Li et al., 2025): Co-designs quantization and token pruning together, achieving a better efficiency-performance balance than applying each technique individually.

7.2.2 Pruning

Pruning is a technique for lightweighting a model by removing unnecessary components (layers, neurons, tokens, etc.). In VLAs, research is particularly active based on the observation that LLM backbone layers exhibit high redundancy.

Layer-level pruning:

High cosine similarity is observed between the outputs of adjacent LLM layers, providing grounds for removing up to 50% of layers.
DeeR-VLA [35]: Uses a dynamic early-exit strategy. It checks the consistency of action predictions at each layer and skips the remaining layers once consistency is confirmed. A major advantage is that no additional training is required.
SmolVLA [32]: Takes an extremely simple approach, simply skipping L/2 layers of the LLM. It demonstrates that manipulation tasks can be performed with only half the layers.
MoLe-VLA [117] (Qu et al., 2025): Uses a STAR router to dynamically select which layers to activate per input. Fewer layers are activated for easy tasks and more for complex ones, adaptively modulating computation.
EfficientVLA [118] (Niu et al., 2025): A framework that simultaneously applies layer pruning and visual token pruning without training.
FLOWER [119] (Cheng et al., 2025): In encoder-decoder VLMs, removes the entire decoder; in decoder-only architectures, removes the bottom 30% of layers.

Structured pruning:

RLRC [120] (Zhao et al., 2025): Structured pruning based on Taylor importance scores, achieving up to 90% sparsity while maintaining meaningful performance.

7.2.3 Distillation

Distillation — transferring the knowledge of a large VLA into a smaller model — can achieve higher performance than training a small model from scratch.

TinyVLA [34]: Distills from a large VLA into a sub-1.4B small model. Initializes with LoRA weights to improve distillation efficiency.
CEED-VLA [121] (Wen et al., 2025): Combines consistency distillation with Jacobi parallel decoding. Parallelizes the serial bottleneck of autoregressive token generation, significantly improving inference speed.
RPD (Robot Policy Distillation) [122] (Wang et al., 2025): Distills from a VLA into a small RL specialist policy. For specific tasks, a distilled specialist can be faster and more accurate than a general-purpose VLA.
SP-VLA [123] (Shen et al., 2025): Uses action-aware scheduling to dynamically switch between a heavy VLA and a lightweight action generator. The large VLA is invoked only at moments requiring complex judgment; the lightweight generator handles straightforward execution phases.

7.2.4 Token Optimization

In VLAs, visual tokens account for the majority of the entire input sequence. A single image is converted into hundreds of patch tokens, and with video input this number can explode into the thousands. Reducing this is the core of token optimization.

Visual token compression:

SmolVLA [32]: Compresses to 64 tokens per frame via pixel shuffle. Tokens that originally numbered in the hundreds are spatially rearranged and reduced to an extreme degree.
FlashVLA [52] (Zhu et al., 2025): Removes low-importance visual tokens via ICS (Importance-based Compression and Selection) pruning.
EfficientVLA: [118] Applies layer pruning and visual token pruning in an integrated fashion.

Visual token caching:

VLA-Cache [124] (Gao et al., 2025), CronusVLA [125] (Lin et al., 2025): Exploit temporal coherence — tokens corresponding to a static background change very little between consecutive frames. By caching unchanging background tokens and updating only foreground tokens where change occurs, ~40-50% faster inference (per Zhang et al. [6]; original paper reports up to 2x+ acceleration) is achieved.
The fundamental reason this approach is effective is that the majority of patch tokens are spatially redundant in robotic manipulation tasks. In tabletop environments with a fixed camera, more than 80% of the background is identical across frames.

7.2.5 Efficient Architectures

This is architectural-level innovation aimed at overcoming the fundamental limitations of existing Transformer structures (quadratic-complexity attention).

Linear-complexity architectures:

SARA-RT [126] (Shridhar et al., 2024): Up-trains standard softmax attention into linear attention, reducing complexity from O(n²) to O(n).
RoboMamba [127] (Liu et al., 2024): A VLA based on the Mamba SSM (Selective State Space Model), achieving more than 3× speedup at linear complexity. Its advantage over Transformers grows with longer sequences.

MoE (Mixture of Experts):

GeRM [128] (Xu et al., 2025): Applies MoE to RL for quadruped robots, activating only a subset of expert parameters from the full parameter set.
FedVLA [72] (Zhang et al., 2025): Implements efficient VLA in a federated learning setting using a dual-gating MoE.
DriveMoE [49] (Huang et al., 2025): Leverages MoE in the autonomous driving domain to assign experts to various driving scenarios.

Parallel decoding:

OpenVLA [15]-OFT: Uses bidirectional attention to generate multiple action tokens simultaneously.
PD-VLA [129] (Chen et al., 2025): Parallelizes autoregressive decoding via Jacobi fixed-point iteration.
Spec-VLA [130] (Wu et al., 2025): Applies speculative decoding to VLAs, achieving 1.42× speedup. A small draft model rapidly generates candidate tokens, which a large model then verifies.

7.2.6 Efficient Attention

Yu et al. [4] additionally identify Efficient Attention techniques as a distinct research direction. These methods optimize the Transformer attention mechanism itself, operating on an independent axis from model compression approaches such as quantization and pruning.

KV-Efficient VLA: Compresses the KV cache using RNN-gated mechanisms, reducing the memory footprint of attention without discarding contextual information.
Long-VLA [73]: Addresses long-horizon tasks via phase-aware input masking, selectively attending to task-relevant temporal segments rather than processing the entire observation history uniformly.
RetoVLA [74]: Reuses register tokens across layers to avoid redundant computation in the attention mechanism.
dVLA [65]: Applies prefix attention masking for diffusion-based VLAs, enabling efficient conditioning while preserving the generative quality of the diffusion action head.

These approaches are complementary to the compression techniques described above and can be combined with quantization, pruning, or token optimization for compounding efficiency gains.

7.3 Training Efficiency: Learning More with Fewer Resources

Alongside lightweighting the model itself, research into improving the efficiency of the training process is also active.

Parameter-Efficient Fine-Tuning (PEFT):

PEFT methods including LoRA (Low-Rank Adaptation) train only 0.1–1% of total parameters while achieving performance comparable to full fine-tuning. This reduces GPU hours by approximately 70% and makes fine-tuning large VLAs feasible even on a single GPU.

Mixed training strategies:

Curriculum learning: A strategy of progressively increasing task difficulty from easy to hard.
Multi-stage training: π0 [16] uses a three-stage pipeline of (1) VLM pre-training → (2) robot data pre-training → (3) task-specific fine-tuning.

The innovation of FAST tokenization [20]:

FAST (Fast Action Tokenization), proposed by Pertsch et al., applies DCT (Discrete Cosine Transform) + BPE (Byte Pair Encoding) to robot action sequences. This extremely compresses action sequences and accelerated pre-training by 5×. Using FAST tokens or latent actions instead of raw actions is a key trend in efficient action representation.

7.4 Data Efficiency: Learning More from Less Robot Data

The high cost of robot data collection is the most fundamental bottleneck in VLA research. Data efficiency research explores strategies that circumvent or alleviate this bottleneck.

Leveraging human video:

Models such as EgoVLA [131] (Chen et al., 2025), Being-H0 [138] (Li et al., 2025), and RynnVLA-001 use first-person (ego-centric) human activity videos — abundant on the internet — as surrogate training data. Manipulation strategies are learned from human hand movements and transferred to robot actions. The core insight of this approach is that humans and robots perform similar manipulation tasks in the same physical world.

Simulation data:

UniSim [101] (Yang et al., 2024), Genesis: Generate large-scale synthetic data from physics simulators.
GraspVLA [81] (Qian et al., 2025): Generates billion-scale synthetic grasping data for use in pre-training.

Data augmentation:

Language augmentation: Methods such as DIAL [90] paraphrase task instructions in diverse ways to improve the robustness of language understanding.
Visual augmentation: Methods such as GenAug [87], CACTI [88], and ROSIE [89] use generative models to expand visual diversity.
Trajectory augmentation: Methods such as DemoGen synthesize new trajectories from existing demonstration data.

Active data selection:

AMF (Active Model Feedback): Prioritizes data with high information gain to maximize training efficiency.
SWBT (Success Weighted by Trial): Includes failed attempts in training data, extracting useful signal even from failures.

Autonomous collection:

SOAR (Luo et al., 2025): Robots autonomously collect data under the guidance of foundation models. This presents a pathway to continuously acquiring training data without human demonstrators.

7.5 Comparison of Key Lightweight Models

The table below compares representative VLA models by parameter scale, inference performance, and core techniques. It shows that over the course of a single year, compression progressed from 55B to 450M parameters, and from 1Hz to 120Hz.

Model	Parameters	Inference Latency	Control Frequency	Core Technique
RT-2-PaLI-X	55B	330-1000ms	1-3Hz	Baseline (direct use of large VLM)
OpenVLA [15]	7B	166ms	6Hz	Open-source baseline
π0 [16]	3.3B	73ms	20-50Hz	Flow Matching action head
GR00T N1 [21]	2.2B	64ms	~120Hz (motor output frequency; not reported in Yu et al. [4] Table 1; distinguish from model inference frequency)	Dual-system (slow VLM + fast policy)
NORA [132]	3B	—	—	FAST+ tokenization
CLIP [27]-RT	~1B	—	—	Frozen CLIP [27]; +24% over OpenVLA [15]
EdgeVLA [133]	1B	—	—	Designed exclusively for edge devices
TinyVLA [34]	<1.4B	—	—	Distilled from large VLA
SmolVLA [32]	~450M	—	—	Trainable on a single GPU
BitVLA [33]	~2B (effective capacity reduced)	—	—	1-bit ternary quantization
DiVLA-2B [134]	2B	~12ms	82Hz	Runs on a single A6000 GPU
RoboMamba [127]	—	—	—	Mamba SSM-based linear complexity

7.6 Key Insights — Rediscovering the Efficiency-Performance Trade-off

The insights emerging from efficient VLA research go beyond mere technical optimization, calling for a reconsideration of VLA design philosophy itself.

1) The scale inversion phenomenon: The result that CLIP [27]-RT (~1B) outperforms OpenVLA [15] (7B) by 24% suggests that the simple application of scaling laws — "more parameters guarantee better performance" — may not hold in the robotics domain. Even a smaller model, when combined with appropriate representation learning and data-efficient fine-tuning, can surpass a much larger model.

2) Quantization is nearly a free lunch: The fact that 4-bit PTQ halves memory with no performance degradation means that considerable redundancy exists in current VLA weights. This provides a strong rationale for applying quantization by default at the deployment stage.

3) Hierarchical separation is uniquely well-suited to robotics: The asynchronous execution of a slow VLM (1–5Hz) + fast policy head (50Hz+) demonstrated by GR00T N1 [21] aligns naturally with the intrinsic structure of robot control. High-level semantic understanding does not need to be updated every frame, but low-level motor commands must be generated at high frequency. This "cognition slowly, action quickly" paradigm is also analogous to the structure of the human nervous system.

4) RL post-processing recovers compression losses: RIPT-VLA [71] demonstrated that RL post-processing can dramatically improve VLA performance. A BC/SFT (Behavioral Cloning / Supervised Fine-Tuning) baseline achieving only 4% was boosted to 97% through PPO post-processing (Su et al., 2025). Critically, the 4% to 97% result represents performance improvement of a BC/SFT baseline via RL — not the recovery of performance lost to quantization or pruning. Note that results vary by benchmark. This validates the feasibility of a pipeline: "lightweight model + RL post-processing."

5) Human video is a viable substitute for robot data: Research in the EgoVLA family shows that internet-scale human video can partially substitute for robot data. It is one of the most scalable pathways for circumventing the robot data collection bottleneck.

6) Dominant research trends: Efficient VLA research grew explosively from late 2024 through 2025. This reflects a shift in the research community from the strategy of "build it big first, shrink it later" toward "design it efficiently from the start."

7.7 Edge Deployment: System-Level Bottleneck Analysis

The 2026 Edge Embodied Foundation Models survey reframed VLA deployment as a systems engineering problem rather than a model compression problem. The "Deployment Gauntlet" proposed by this survey identifies seven coupled constraints that impede edge deployment: size, weight, power, memory traffic, compute latency, timing variability, and safety margins interact to form a compound problem that no single optimization can resolve.

The central finding is that the nature of the bottleneck differs by controller architecture:

Autoregressive VLAs (RT-2, OpenVLA-class): Primarily constrained by memory bandwidth
Diffusion-based controllers (pi-0-class): Primarily constrained by compute latency and sustained execution cost

This analysis suggests that architectures separating "fast control" from "slow semantic reasoning" (GR00T N1, π0.5 [31]) are advantageous for edge deployment as well. Efficient deployment requires system-level co-design that holistically considers memory architecture, scheduling strategies, communication protocols, and model design.

7.8 Complementary Efficiency Taxonomies

Guan et al. (2025) independently surveyed efficient VLAs and proposed a four-dimensional taxonomy: (1) model architecture, (2) perceptual feature extraction, (3) action generation mechanisms, and (4) training/inference strategies. While Yu et al. [4] focus on model compression and efficient design, Guan et al. treat the efficiency of perception and action generation as separate dimensions, making the two frameworks complementary.

Guan et al.'s [43] four-dimensional efficiency taxonomy provides an important complementary perspective for the efficiency analysis in this document. While Yu et al. [4] focus on model compression (quantization, pruning, distillation), Guan et al. analyze perceptual feature extraction efficiency (e.g., multi-resolution token pooling, selective attention) and action generation mechanism efficiency (e.g., Action Chunking optimization, parallel decoding) as independent dimensions. This perspective explains why recent models such as FlashVLA [52] and RetoVLA [74] pursue efficiency across the entire perception-action pipeline rather than simple model compression alone.

HyperVLA [135] (Park et al., 2026), proposed at ICLR 2026, uses hypernetworks to dynamically generate task-specific policies, accelerating inference. AutoQVLA [136] (Liu et al., 2026) achieves a 30% reduction in VRAM through improved quantization techniques. These works represent the cutting edge of the model efficiency techniques discussed in Section 7.2, demonstrating that quantization and architectural innovation remain active research directions.

8. Application Domains — The World VLA Is Making

VLA technology is spreading across a diverse range of robotic applications. Each domain has its own unique action space, safety requirements, and real-time constraints, and the manner in which VLAs are applied varies significantly as a result. This chapter surveys the major domains in which VLAs are currently being used, summarizing the current state of the art and the unique challenges of each area.

8.1 Tabletop Manipulation — The Mainstream Research Domain

Tabletop manipulation is the primary arena of VLA research. Over 70% of all VLA models are developed and evaluated in this domain.

Rapid improvement on benchmarks:

LIBERO: Success rate rose from 76.5% to 98.1% within 16 months.
CALVIN: Sequence length (number of consecutive tasks successfully completed) improved from 3.57 to 4.44.
RLBench, Meta-World: Used as standard evaluation platforms for diverse manipulation tasks.

Current state and remaining challenges: Single-step manipulation tasks have nearly reached a solved state, with success rates above 98%. However, long-horizon tasks — tasks requiring multiple manipulation steps to be performed in sequence — remain a core bottleneck. The fundamental cause is the compounding error problem, in which errors accumulate across steps.

Specialized manipulation research:

Bimanual manipulation: Models such as Bi-VLA (Xue et al., 2025) and ALOHA (Zhao et al., 2023) address the coordinated control of two arms. The action space doubles compared to a single arm, and synchronization between the two arms is the key challenge.
Contact-rich manipulation: Models such as ForceVLA [78] (Lee et al., 2025) and TactileVLA (Kim et al., 2025) integrate force/tactile sensors into VLAs, detecting properties of objects — such as stiffness, weight, and slippage — that cannot be discerned through vision alone.
Dexterous grasping: Models such as DexVLA (Wen et al., 2025) and DexVLG (Zhang et al., 2025) use VLAs to learn high-dimensional control of multi-finger hands. With degrees of freedom exceeding 20, the complexity of the action space increases sharply.

8.2 Humanoid Robots — The Challenge of Whole-Body Control

Applying VLAs to humanoid robots involves qualitatively different challenges from tabletop manipulation.

Fundamental difficulties:

30+ degrees of freedom: Control of the entire body — arms, legs, torso, and head — is required.
Balance maintenance: The dynamic balance of bipedal locomotion demands fast responses on the order of milliseconds.
Simultaneous locomotion and manipulation: Tasks such as picking up an object while walking require locomotion and manipulation to be performed simultaneously.
Multi-contact-point management: Coordination is required across all contact points — feet, hands, and sometimes the torso — as they interact with the environment.

Key models:

GR00T N1 [21] (NVIDIA): Positioned as a foundation model for humanoids. Achieves a high control frequency (~120Hz, motor output frequency; not reported in Yu et al. [4] Table 1; distinguish from model inference frequency) via its dual-system architecture, targeting general-purpose whole-body control.
Humanoid-VLA [137] (Zhang et al., 2025): Performs pose estimation from human videos available online to secure diversity in motion. It uses human movement directly as reference data.
Being-H0 [138] (Li et al., 2025): Uses ego-centric video as pre-training data to strengthen the ability to understand the environment from a first-person perspective.
FP3 [139] (Chen et al., 2025): Strengthens spatial reasoning through 3D policy pre-training.

Key unsolved challenges: The simultaneous achievement of balance maintenance and precise manipulation remains one of the hardest challenges for current VLAs. Situations frequently arise in which the fast, reflexive control needed for balance conflicts with the deliberate, planned control needed for manipulation, and harmonizing these within a single unified model is the central challenge.

8.3 Autonomous Driving — Another Frontier for VLA

Autonomous driving is the second largest application domain for VLAs. According to the taxonomy of Jiang et al. [10], autonomous driving VLAs have undergone four stages of evolution.

The four stages of evolution:

VLM as Explainer: The VLM is used to generate descriptions of driving scenes and rationales for decisions. A separate module handles control.

Modular VLA: The VLM's outputs are fed into modules of an existing autonomous driving pipeline (perception → prediction → planning).

Unified E2E VLA: A single model integrates everything from camera input to steering/acceleration output.

Reasoning-Augmented VLA: Integrates CoT (Chain-of-Thought) reasoning to make the decision-making process transparent.

Key models:

EMMA [46] (Hwang et al., 2024): Waymo's end-to-end driving model using a Gemini backbone.
ORION [47] (Wang et al., 2024): Combines a memory mechanism and CoT reasoning to leverage past driving experience.
DriveMoE [49] (Huang et al., 2025): Uses a MoE structure to assign experts to various driving scenarios (highway, intersection, parking, etc.).
AutoVLA [48] (Chen et al., 2025): Uses adaptive CoT — fast reasoning in simple situations, deeper reasoning in complex ones.

Driving vs. manipulation: key differences

Autonomous driving and robotic manipulation both share the VLA framework, but inherently involve very different challenges.

Dimension	Robotic Manipulation	Autonomous Driving
Action space	3D gripper position/orientation (6-7 DoF)	Steering/acceleration + BEV path + high-level route (multiple abstraction levels)
Spatial scale	Tabletop (~1m)	Urban scale (hundreds of meters to several km)
Real-time requirement	5-50Hz	30Hz+ required (automotive hardware standard)
Safety criticality	Object damage at most	Legal/physical risk to human life
Consequence of hallucination	Grasp failure (retryable)	Risk to human life (irreversible)
Social interaction	Almost none	Yielding, merging, inferring other drivers' intent required

SafeAuto [140] (Li et al., 2025) introduces a symbolic veto that blocks execution when the VLA's output violates a safety rule.
LangCoop V2V [141] (Wei et al., 2025) addresses social interaction by sharing intent through vehicle-to-vehicle (V2V) natural language communication.

Benchmarks and remaining gaps: Benchmarks such as BDD100K, nuScenes, Bench2Drive, and Reason2Drive exist, but a key gap is the absence of a comprehensive "AI driving license" benchmark. There is as yet no standard that comprehensively evaluates diverse scenarios, safety judgment, and ethical dilemmas — analogous to a human driving license test.

8.4 Drones and Navigation

The application of VLAs is also expanding to aerial and ground mobile robots.

CognitiveDrone [142] (Wang et al., 2025): A cognitive drone system that controls drones according to natural language instructions. It interprets and executes instructions such as "go around to the right of that red building."
RaceVLA [143] (Zhao et al., 2025): Applies VLAs in the high-speed environment of drone racing. This is an extreme testbed in which millisecond-level reaction times and precise path tracking are simultaneously required.
NaviLa [144] (Cheng et al., 2025), Uni-NaVid [145] (Zhang et al., 2024): Apply VLAs to indoor navigation of legged robots. They understand instructions such as "go to the kitchen and bring the red cup," performing path planning and obstacle avoidance.
Mobility VLA [146] (Liu et al., 2025): A VLA for wheeled mobile robots that integrates autonomous navigation in indoor and outdoor environments with object interaction.

The common challenge in this domain is integrating real-time 3D navigation and dynamic obstacle avoidance within a single language-vision-action framework.

8.5 Medical and Surgical Robots

Medicine is a domain that simultaneously holds high potential for VLA application and the most stringent constraints.

Representative research:

RoboNurse-VLA [147] (Li et al., 2024): Targets precision grasping in surgical environments. The precise grasping and transfer of surgical instruments is the core task.

Domain-specific constraints:

Patient data privacy: Since the external transmission of medical data is legally restricted, on-premise inference is mandatory. VLA deployment strategies that rely on cloud APIs cannot be used in this domain.
Small data problem: Data for specific surgical procedures or individual patients is inherently limited in quantity. The ability to effectively fine-tune a large pre-trained backbone on small data is critical.
Safety-critical systems: Malfunction of a surgical robot directly affects patient life. Requirements for formal verification and safety assurance are more stringent than in any other domain.

These constraints mean that every dimension of efficiency (Chapter 7) — model lightweighting, on-device inference, and data efficiency — is especially urgent in the medical domain.

8.6 Agriculture and Industry

The potential of VLAs is also being explored in industrial applications of high practical value.

Orchard apple harvesting (Zhang et al. [6]): A system in which a robot selectively harvests fruit according to natural language instructions ("harvest only ripe apples"). Simultaneous visual understanding in unstructured environments (branches, leaves, varying lighting) and gentle grasping are required.
CIPHER (Park et al., 2025): A system that switches 3D printing inspection tasks via natural language instructions. It dynamically modifies the inspection procedure in response to instructions such as "inspect the surface quality of this part." This is a case of implementing process flexibility in industrial settings via VLA.
ObjectVLA [148] (Chen et al., 2025): A VLA that can manipulate novel objects without prior demonstrations. It eliminates the cost of collecting demonstration data each time a new part or product is introduced on the factory floor.

The common requirement in the industrial domain is flexibility. In industrial settings where product types, task contents, and environmental conditions change frequently, the ability of VLAs to switch tasks on natural language instructions alone is of high practical value.

8.7 Interactive AR and GUI Agents

Beyond physical robots, the "action generation" capability of VLAs also extends to the autonomous manipulation of digital interfaces.

ShowUI [149] (Lin et al., 2024): Implements a GUI (Graphical User Interface) agent using the VLA framework. It understands the visual content of a screen and generates actions such as clicking, scrolling, and typing in response to instructions like "open the settings menu and turn off Wi-Fi."
Spatial grounding: Leverages the spatial understanding capability of VLAs to accurately place virtual objects in the physical world within AR (Augmented Reality) environments.
Human-AI collaborative navigation: A scenario in which users and AI collaborate to navigate complex environments within augmented reality.

This domain demonstrates that the core components of VLA — visual understanding, language reasoning, and action generation — can serve as a powerful framework in domains beyond physical robots. It is an extension of the definition of "action" from physical motor commands to manipulation of digital interfaces.

8.8 Cross-Domain Comparative Summary

Domain	Action Space	Safety Level	Real-Time Requirement	Data Availability	VLA Maturity
Tabletop manipulation	6-7 DoF	Low	5-50Hz	Abundant	High
Humanoid	30+ DoF	Medium	50-120Hz	Scarce	Early
Autonomous driving	Multiple abstraction levels	Very high	30Hz+	Abundant	Medium
Drone/navigation	4-6 DoF	Medium	30Hz+	Medium	Early
Medical/surgical	6-7 DoF	Very high	10-30Hz	Very scarce	Very early
Agriculture/industry	6-7 DoF	Low–medium	5-10Hz	Scarce	Early
GUI agents	Digital manipulation	Low	Real-time not required	Abundant	Medium

※ These frequency ranges are the author's synthesis of general domain requirements, not directly from individual surveys.

The pattern that stands out in this table is that VLA maturity is determined by an inverse relationship between data availability and safety requirements. Progress is fastest in tabletop manipulation, where data is abundant and safety constraints are low, and slowest in medicine, where data is scarce and safety is paramount. Closing this gap is a core challenge that must be addressed in the next stage of VLA research.

Motivation Chain: The Logic of Efficiency Research

Large VLAs are undeployable (RT-2 55B: 330-1000ms inference, hundreds of GB memory required)
--> Model reduction research begins (OpenVLA [15] 7B: 1/8 the size, performance retained)
--> 7B is still too heavy (16-24GB VRAM, 166ms latency)
--> Extreme lightweighting (SmolVLA [32] 450M, BitVLA [33] 1-bit quantization)
--> Concern over performance degradation from compression
--> Recovery via RL post-processing (lightweight model + RL = large-model-level performance)
--> Dynamic inference (DeeR-VLA [35]: shallow layers for easy inputs, deep layers for hard inputs)
--> Token optimization (FAST [20], VLA-Cache: reduce the amount of computation itself)

Efficiency Technique Comparison: Key Differentiators

Technique	Core Principle	Representative Models	Compression Effect	Performance Impact
Quantization	Reduce bit-width of weights	BitVLA [33] (1-bit), SQIL (INT4)	Memory 3.36x reduction	Mild degradation
Pruning	Remove unnecessary layers/neurons	SmolVLA [32] (L/2 removal), FLOWER	Up to 50% layer removal	Task-dependent
Distillation	Transfer knowledge: large --> small	TinyVLA [34]	Drastic parameter reduction	Approaches teacher performance
Token Optimization	Reduce visual/action token count	FAST [20], VLA-Cache, VOTE	5-13x token reduction	Performance maintained
Efficient Architecture	Improve attention/structure itself	SARA-RT, MoLE-VLA	2-5x inference speedup	Design-dependent
Dynamic Inference	Adapt computation to input difficulty	DeeR-VLA [35]	30-50% average compute reduction	No retraining required

Intuitive One-Liners: Efficiency and Applications

Quantization: "Compressing a high-resolution photo to JPEG — the file shrinks dramatically, but it looks nearly the same to the eye."
Pruning: "Trimming dead branches from a tree — the tree (model) becomes lighter and sways better in the wind (inference)."
Distillation: "A master craftsman passing skills to an apprentice — the apprentice is smaller but preserves the core techniques."
Token Caching (VLA-Cache): "Not redrawing the background every frame, just caching it — only the parts that change are recomputed."
Dynamic Inference (DeeR-VLA [35]): "An exam strategy: answer easy questions quickly and move on, reserving deep thought only for hard problems."
MoE (Mixture-of-Experts): "Not every doctor sees every patient — patients are routed to the relevant specialist. Each expert is small, but the collective capability is large."

Self-Check Questions: Sections 7-8

Q1: How does BitVLA's [33] 1-bit (ternary) quantization work, and why is performance preserved?

Answer: BitVLA [33] constrains model weights to three values: {-1, 0, +1} (ternary quantization). This replaces multiplication operations with additions and subtractions, drastically reducing memory and computation (3.36x compression). Performance is preserved because (1) quantization-aware training (QAT) compensates for quantization error during the learning process, and (2) VLA action outputs often do not require high numerical precision.

Q2: Identify three key domain differences between tabletop manipulation VLAs and autonomous driving VLAs.

Answer: (1) Safety level: Tabletop failures amount to object damage at most, whereas autonomous driving failures can cause loss of human life. (2) Control frequency: Tabletop tasks are adequately served at 5-50Hz, whereas autonomous driving mandates 30Hz+ real-time response as an automotive hardware standard. (3) Environmental diversity: Tabletop operates in a constrained workspace, whereas autonomous driving must handle infinitely diverse road conditions, weather, and traffic scenarios. These differences cause the two domains to evolve independently despite sharing the same underlying VLA architecture.

Q3: What does it mean that the LIBERO benchmark has reached "saturation," and what are the implications for VLA research?

Answer: Success rates on LIBERO-Object (99.8%) and LIBERO-Spatial (98.8%) have approached 100%, meaning this benchmark can no longer discriminate between model capabilities. This implies: (1) simple single-environment manipulation tasks have been effectively solved by VLAs, but (2) long-horizon composite tasks such as LIBERO-Long (96.6%) remain challenging, and (3) the community needs next-generation benchmarks that measure real-world generalization rather than in-distribution performance.

Open Research Questions: Sections 7-8

Theoretical limits of efficiency: What is the theoretical upper bound on compression achievable without performance degradation in VLAs? From an information-theoretic perspective, what is the minimum number of bits genuinely required for robot action generation?

Cross-domain transfer of efficiency techniques: Do efficiency techniques developed for tabletop VLAs transfer directly to autonomous driving or medical robotics? How should efficiency strategies be adapted to domain-specific characteristics?

Medical robot VLAs: Given the extreme safety requirements and data scarcity of the medical domain, what breakthroughs are needed for practical deployment of VLAs in surgical and clinical settings?

Benchmark design post-saturation: After LIBERO saturation, what properties must next-generation benchmarks possess to meaningfully measure real-world generalization capability?

9. Datasets, Benchmarks, and Simulators

Progress in VLA research is not driven by architectural innovation alone. Large-scale datasets, reliable benchmarks, and realistic simulators must form a unified triad for research to advance. This chapter provides a systematic overview of the data infrastructure that underpins the VLA ecosystem.

9.1 Robot Learning Datasets

9.1.1 The Rise of Large-Scale Cross-Embodiment Datasets

In the early days of VLA research, it was common for individual research groups to construct small-scale datasets using their own robots and environments. MIME, RoboTurk, and RoboNet are representative products of this era. These datasets contained on the order of thousands to tens of thousands of episodes and focused on specific robot platforms and a limited range of tasks. However, unlocking the potential of large pretrained models required far larger and more diverse data.

The paradigm shift was spearheaded by the Open X-Embodiment [19] (OXE [19]) dataset. Assembled through a collaboration among 22 research institutions, it consolidates over one million episodes collected across 22 robot platforms into a single unified format — a dataset that deserves to be called the ImageNet of VLA research. Covering more than 527 skills, OXE [19] was the first large-scale demonstration of the viability of cross-embodiment learning. When RT-2-X was trained on OXE [19], it achieved more than a 50% performance improvement over models trained on a single dataset, dramatically illustrating the power of data diversity.

The following table summarizes the major robot learning datasets.

Dataset	Scale	Robot Platform	Key Characteristics
Open X-Embodiment [19] (OXE [19])	1M+ episodes, 22 robot embodiment types, 60+ constituent datasets	22 platforms	Largest cross-embodiment dataset; covers 527+ skills
BridgeData V2	71 tasks	WidowX	Cross-domain language annotations; diverse environments
DROID	564 tasks	Various	"In-the-wild" teleoperation; real-world environmental diversity
RT-1 [12] Kitchen	130K+ real demonstrations	Everyday Robots	700+ everyday activities; large-scale real-world collection
BC-Z	25K+ episodes	7-DoF robot arm	100 tasks; semi-autonomous collection protocol
MIME, RoboTurk, RoboNet	Various (thousands–tens of thousands)	Various	Early benchmark datasets; historical significance
RH20T	147 tasks	Various	Supports one-shot learning
EgoDex	829 hours	Human hands	Dense 3D hand/finger tracking; dexterity learning
Ego4D / EPIC-Kitchens	Thousands of hours	Humans	Egocentric video; for VLM pretraining
GraspVerse (source outside the 14 surveys)	1B+ samples	Simulation	Synthetic grasp data; large-scale synthetic generation

9.1.2 Strategic Use of Human Video Data

The cost of collecting robot data is overwhelmingly higher than that of internet text or images. One of the key strategies for circumventing this bottleneck is the use of human video data. Ego4D (3,670 hours), EPIC-Kitchens (100+ hours), and EgoDex (829 hours) capture manipulation activities performed by humans in everyday settings from an egocentric viewpoint, providing rich visual prior knowledge about "how to handle objects" without requiring the robot to collect it directly.

The case of GR-2, which achieved superior performance via human video pretraining followed by robot fine-tuning, and the case of HPT [96] (Wang et al., 2024), which improved cross-embodiment generalization by mixing human hand data with robot data, validate this strategy. However, the domain gap arising from morphological differences between humans and robots remains an outstanding challenge. EgoDex's dense 3D finger tracking data represents a concrete attempt to narrow this gap, providing human data in a form that can be directly applied to precise control of robotic hands.

9.1.3 Synthetic Data and Autonomous Collection

Another solution to the data bottleneck is synthetic data generation and autonomous collection. GraspVerse (source outside the 14 surveys) generated more than one billion synthetic grasp data samples, enabling large-scale pretraining in simulation. Autonomous collection pipelines such as SOAR (Luo et al., 2025; Self-supervised Autonomous Robot) demonstrate the possibility of expanding datasets without human supervision, through a self-supervised loop in which the robot explores, collects data, and labels it autonomously.

Of particular note is the fact that SmolVLA [32] (450M parameters), by focusing on data quality and curriculum, achieved performance competitive with far larger models. This suggests that data quality, diversity, and the design of the training curriculum may matter more than simple data scale — foreshadowing a paradigm shift from "more data" to "better data."

9.2 Simulation Benchmarks

9.2.1 Manipulation Benchmarks

Simulation benchmarks are a critical piece of infrastructure that enables systematic performance comparison of VLA models and rapid prototyping prior to real-world experiments. The following table summarizes the state of major benchmarks.

Benchmark	Domain	Key Metric	2025 State-of-the-Art
LIBERO-Spatial/Object/Goal/Long	Manipulation	Success rate (%)	Spatial 98.8%, Object 99.8%, Goal 98.2%, Long 96.6%
CALVIN	Multi-step manipulation	Average sequence length (1–5)	4.44 (DreamVLA [64])
RLBench / RLBench2	RGB-D manipulation	Success rate	Various
Meta-World	Multi-skill	Success rate	Various
SIMPLER	Sim-to-real transfer	Calibrated success rate	Various
THE COLOSSEUM	Distribution-shift robustness	Success rate	Various
VLABench	Language-conditioned manipulation	Success rate	Various
MIKASA-Robo	Memory-centric	Partial-observation manipulation success rate	Various

The LIBERO suite is one of the most widely used benchmarks in VLA research, offering four sub-benchmarks of increasing difficulty: Spatial (spatial relationship understanding), Object (object identification), Goal (goal achievement), and Long (long-horizon tasks). As of 2025, Spatial and Object have nearly reached saturation (98–99%). This indicates that VLA models have already achieved sufficient performance on short-horizon, simple manipulation tasks, suggesting that research focus should shift toward more complex long-horizon tasks.

CALVIN is a benchmark for evaluating multi-step sequential tasks, measuring the model's ability to execute up to five consecutive instructions. Performance is measured by Average Sequence Length, with DreamVLA [64] (Wen et al., 2025) recording the top score of 4.44. The dominant performance of world-model-based methods (VIP, DreamVLA [64], WorldVLA [76] (Chen et al., 2025)) on this benchmark corroborates the importance of future-prediction capabilities in long-horizon tasks.

Next-Generation Benchmarks (2025--2026)

New evaluation frameworks are emerging to address the saturation of existing benchmarks:

RoboArena [150] (Li et al., 2025): A real-world-to-simulation automatic conversion framework that reproduces real-world tasks in simulation environments, enabling large-scale benchmarking.
RoboCasa365 [151] (Nasiriany et al., 2025): A large-scale household environment benchmark comprising 365 tasks and over 2,000 kitchen scenes.
WorldGym [152] (Zhang et al., 2025): A new paradigm that leverages action-conditioned world models as evaluation environments.
WorldBench [41] (Hu et al., 2025): A unified evaluation platform for autonomous driving VLAs, integrating open-loop and closed-loop evaluation.

These benchmarks extend the evaluation paradigm beyond the saturation of LIBERO and CALVIN, toward measuring real-world generalization and domain diversity.

9.2.2 Autonomous Driving Benchmarks

Benchmarks for autonomous-driving VLAs have unique requirements distinct from those of manipulation. Safety, real-time performance, and adherence to social norms are the core evaluation axes.

Benchmark	Domain	Key Metric	Characteristics
Bench2Drive	Autonomous driving (CARLA)	Closed-loop route completion rate	220 routes, 44 scenarios
nuScenes / nuPlan	Autonomous driving (real-world)	L2 trajectory error	Large-scale real-world data
Reason2Drive	Driving reasoning	CoT QA consistency	600K video-text pairs (with CoT QA annotations); evaluates reasoning process

Bench2Drive (Jia et al., 2024) is a closed-loop benchmark built on the CARLA simulator that evaluates agents' comprehensive driving capabilities across 220 routes and 44 scenarios. Models that score highly under open-loop evaluation frequently fail in the closed-loop setting, underscoring the necessity of closed-loop evaluation. Reason2Drive (Nie et al., 2024) is a new-paradigm benchmark that goes beyond simple path tracking to evaluate the reasoning process behind why a particular action was chosen.

9.3 The Simulator Ecosystem

The simulator ecosystem supporting VLA research has evolved diversely across domains.

Manipulation simulators:

MuJoCo: The de facto standard for physics simulation. Its strengths are fast computation speed and accurate contact dynamics.
SAPIEN: A simulator specialized for articulated object manipulation, supporting everyday-environment interactions such as opening drawers and operating faucets.
RLBench: A benchmark-cum-simulator based on CoppeliaSim, providing more than 100 predefined tasks.
AI2-THOR / Habitat: Simulators that combine indoor navigation with manipulation and serve as the primary platforms for embodied AI research.
Isaac Gym (NVIDIA): Supports GPU-accelerated large-scale parallel simulation, enabling thousands of environments to run simultaneously and is optimized for RL training.

Autonomous driving simulators:

CARLA: The leading open-source autonomous driving simulator. Supports diverse weather, traffic scenarios, and sensor modalities.
nuPlan: A closed-loop planning benchmark and simulator based on the nuScenes dataset.

Next-generation general-purpose simulators:

Genesis (source outside the 14 surveys): A GPU-accelerated physics engine that integrates various physics solvers and targets general-purpose robot simulation. Claims a 10–100× speedup over existing simulators.
UniSim [101] (Yang et al., 2024): A "learned simulator" based on action-conditioned video diffusion, which learns environment dynamics directly from data without an explicit physics engine. This is an innovative approach that circumvents the realism limitations of traditional simulators.

Key gap: The greatest current limitation of the simulator ecosystem is the absence of a unified cross-embodiment, cross-task benchmark. Because each simulator uses its own task definitions, robot models, and evaluation protocols, directly comparing results reported across different simulators is practically impossible. This means that the unifying benchmark role that ImageNet played in computer vision has yet to be filled in robot learning.

9.4 Limitations of Evaluation Protocols and Directions for Improvement

9.4.1 The Reproducibility Crisis

Success-rate figures reported in VLA research are often misleading. In some studies, simply changing the random seed causes success rates to vary by more than 30%. This means that reported "state-of-the-art" numbers may not be statistically significant. The initial conditions of the environment, subtle variations in object placement, and non-determinism in the simulator's physics engine are among the causes of this variance.

9.4.2 The Simulation–Real-World Gap

The phenomenon of models that achieve high success rates in simulation failing in the real world remains pervasive. The main causes are inaccurate contact dynamics, limited visual realism, and unpredictable disturbances in the real world. The SIMPLER benchmark attempts to provide calibrated sim-to-real evaluation, but has yet to offer a fundamental solution.

9.4.3 Monolithic Evaluation Metrics

Most current benchmarks rely on a single metric — "success rate." However, this is insufficient when considering real-world deployment.

Collision avoidance: Even if a task is completed, unnecessary collisions with the environment are dangerous.
Failure recovery: The ability to return to a safe state upon failure is rarely reported.
Energy efficiency: Completing the same task with less energy is practically important.
Adversarial robustness: Resistance to intentional perturbations is essential in safety-critical applications.
Inference latency: A model's inference speed determines whether real-time control is feasible, yet it is rarely reported systematically alongside success rates.

In autonomous driving, this problem is even more acute. There is no integrated "AI driving license" benchmark that simultaneously evaluates control safety and language fidelity, and proxy metrics such as open-loop L2 error have been repeatedly noted to correlate weakly with actual driving safety.

9.4.4 Proposed Improvements

To overcome these limitations, we propose a two-track evaluation framework.

(i) Simulation track: Standardized simulation evaluation sharing fixed seeds, data splits, and baseline models. Experimental configurations should be published in a fully reproducible form so that all research can be compared under identical conditions. Reporting mean and variance across a minimum of ten seeds should be mandatory.

(ii) Real-world community track: Real-world evaluation based on shared hardware protocols. The community should collectively define standardized robot platforms (e.g., Franka Emika, UR5), task definitions, and evaluation procedures, and each research group should report real-world performance using the same protocol.

10. Open Problems and Future Prospects

A cross-cutting analysis of ten major VLA survey papers identified eleven core challenges. These were addressed partially in individual surveys, but their full structure only emerges through cross-survey analysis.

10.1 The Data Bottleneck

The most fundamental constraint in VLA research is data. Even the largest existing dataset, OXE [19], contains only about 2.5 million episodes in its expanded version (the original v1 comprised 1M+ episodes) — a negligible amount compared to the training corpus of GPT-2 (WebText, billions of tokens). Moreover, unlike internet text, robot data costs tens of dollars per episode to collect and is bound to the unique morphology of each robot platform.

Directions for resolution:

Simulation synthesis: Large-scale synthetic data generation such as GraspVerse (source outside the 14 surveys). Combined with domain randomization to facilitate sim-to-real transfer.
Human video utilization: Extracting visual prior knowledge about manipulation from Ego4D, EPIC-Kitchens, etc.
Autonomous collection (SOAR): A self-supervised pipeline in which the robot explores and collects data on its own.
Active curation: Not all data is equal. Data is selectively collected by targeting the model's weaknesses.

Cross-survey insight: The case of SmolVLA [32] competing with 7B models despite having only 450M parameters suggests that data quality and diversity may matter more than data scale. This foreshadows a paradigm shift from "more data" to "better data."

10.2 The Generalization Wall

The generalization performance of VLA models changes dramatically with evaluation conditions.

In-domain: 80–90% success rate under the same conditions as the training environment
Cross-domain: Drops to 40–70% with novel objects or environments
Zero-shot: Falls to 20–50% on entirely new tasks

Multiple approaches are underway to narrow this gap. HPT [96] (Wang et al., 2024; Heterogeneous Pretrained Transformers) attempts cross-embodiment generalization through pretraining across diverse embodiments; UniAct (Qian et al., 2025) attempts an embodiment-agnostic policy through a unified representation of the action space; and BridgeVLA (Li et al., 2025) attempts to connect web-scale visual knowledge with robot behavior.

Sim-to-real transfer remains an unsolved problem. Inaccurate physical contact, visual domain gaps, and the non-stationary nature of real-world environments are the primary barriers. The scaling-law study by GEN-0 (Team, 2025) provides early evidence that increasing model and data scale yields predictable improvements in generalization, but the limits of this law's validity remain unclear.

10.3 Real-Time Inference

The combination of large VLM backbones (7B–55B parameters) and diffusion-based action generation is powerful, but causes a critical latency problem for real-time control. Autonomous driving requires a minimum control frequency of 30Hz, and precise manipulation requires 50Hz or more, but even a single forward pass often makes it difficult to meet these requirements.

Solution strategies:

Hierarchical asynchronous execution: The high-level VLM generates subgoals at a low frequency (1–5Hz) while a lightweight low-level policy performs actual control at a high frequency (50–100Hz). GR00T N1 [21], CogACT [23], and others adopt this approach.
Token caching: Reusing key-value (KV) caches from previous inference passes to eliminate redundant computation.
Quantization: Lowering model precision to FP16, INT8, or INT4 to increase inference speed. Cases in which performance degradation remains below 2% under 4-bit quantization have been reported.
Pruning and distillation: Removing unnecessary parameters, or transferring knowledge from a large model to a small one.

The key concept is "intelligent sparsity." Rather than activating the entire model for every input, approaches that dynamically adjust the amount of computation according to input complexity are emerging.

10.4 Long-Horizon Tasks and Hierarchical Reasoning

Pure end-to-end models excel at single-action-level tasks but systematically fail on multi-step compositional tasks. Tasks such as "open the drawer, take out the cup, and place it on the shelf" require planning, subgoal setting, progress monitoring, and replanning upon failure — all of which are difficult for a single policy to handle.

Solution approaches:

Hierarchical decomposition: π0.5 [31] explicitly separates a high-level VLM planner from a low-level action policy.
Chain-of-Thought (CoT): CoT-VLA [55] inserts an explicit reasoning step before action generation, allowing the model to reason about why it selects a particular action.
Skill library: Systems such as ReLEP (Park et al., 2025) learn reusable skill primitives and compose them to construct complex tasks.

Trends on the CALVIN benchmark validate this direction. World-model-based methods (DreamVLA [64]: 4.44, WorldVLA [76]: 4.38) show overwhelming performance over purely reactive policies, demonstrating that the ability to predict future states and use those predictions for planning is the key to success on long-horizon tasks.

10.5 Safety and Alignment

The safety challenges of VLAs are qualitatively different from those of purely software AI. Whereas hallucinations in an LLM result in incorrect text, hallucinations in a VLA can lead to physical collisions, damage, or even injury. The irreversibility of physical failures is the essential distinction.

Current attempts:

SafeVLA [75] (Chen et al., 2025): The first attempt to explicitly integrate safety constraints into a VLA, combining safety-relevant training data with constraint-violation penalties.
SafeAuto [140]: Implements a traffic-rule-based symbolic veto in autonomous driving. Neural network outputs are only executed once they pass a rule-based safety check — a dual-layer structure.

However, the absence of formal verification remains a serious challenge. Traditional control systems can guarantee safety through mathematical tools such as Lyapunov stability analysis and reachability analysis, but these verification methods have not been established for language-conditioned neural network policies.

This problem is particularly acute in autonomous driving. Whereas hallucinations in manipulation may amount to dropping an object, hallucinations in autonomous driving directly translate to traffic accidents. This "asymmetry of hallucination risk" means that autonomous-driving VLAs have fundamentally different safety requirements than manipulation VLAs.

10.6 Hallucination and Reasoning Stability

The problem of LLM-based planners generating physically impossible actions is a fundamental weakness of VLAs. Physically unreasonable plans such as "tilt the water glass 90 degrees and move it" arise when the LLM's common-sense reasoning diverges from physical reality.

SC-VLA [56] (Self-Correcting VLA) (Guo et al., 2025) introduced explicit failure detection and recovery reasoning mechanisms, reducing task failure rates by 35% (Zhang et al. [6]). It implements a feedback loop in which the model monitors the results of its own actions and generates an alternative action when an unexpected outcome is observed.

However, validating hallucinations in the open world is fundamentally difficult. To judge whether the model's output is "physically realizable" in situations not present in the training data, the model itself would need to function as an accurate physics simulator — a circular problem.

10.7 Multimodal Integration

Current VLA research exhibits a vision-centric bias. Most models use only RGB images as sensory input, largely ignoring the other senses that humans employ in manipulation: touch, force/torque, sound, temperature, and others.

ForceVLA [78] (Lee et al., 2025): Integrates force/torque sensor data into the VLA, improving performance on delicate object manipulation.
TactileVLA [153] (Kim et al., 2025): Uses tactile sensor inputs to perceive material properties (hardness, texture, etc.) that are difficult to judge from vision alone.
OmniVTLA [154] (Wang et al., 2025): Proposes a unified architecture that simultaneously processes vision, touch, and language.

Humans automatically reweight modalities in high-uncertainty situations: relying more on touch when vision is insufficient, and focusing more on visual cues when the environment is noisy. This adaptive modality reweighting has yet to be systematically implemented in robots and represents an important future direction for multimodal VLAs.

10.8 Human–Robot Interaction

Current VLAs remain at the level of "pseudo-interaction." Unidirectional communication in which a human issues an instruction and the robot executes it is dominant; genuinely bidirectional, conversational collaboration is almost entirely unrealized.

True human–robot interaction requires the following:

Adaptive dialogue: The robot asks clarifying questions about ambiguous instructions and adjusts its behavior based on human feedback.
Preference learning: The robot learns human implicit preferences (speed, safety, aesthetic standards, etc.) through interaction.
Human feedback loop: Continued improvement through human corrective feedback even after deployment.

Inspired by the success of RLHF (Reinforcement Learning from Human Feedback) in NLP, a new research direction of "RLHF for Robotics" is taking shape in this field.

10.9 Evaluation and Benchmarking

The evaluation limitations discussed in Section 9.4 merit re-examination at a more fundamental level as an open problem. The field of robot learning currently lacks a unified benchmark equivalent to ImageNet in computer vision or GLUE/SuperGLUE in NLP.

The consequences of this absence are serious. When Paper A reports 98% on LIBERO and Paper B reports 4.44 on CALVIN, there is no basis on which to judge which model is "better." Add to this the reproducibility issues caused by seed randomness, and quantitatively tracking genuine progress in VLA research becomes difficult in itself.

10.10 Ethics and Societal Impact

The real-world deployment of VLAs raises ethical and societal questions that go beyond technical challenges.

Privacy: VLA robots operating in homes or workplaces continuously capture and interpret their environment. Clear guidelines for the collection, storage, and use of this data are needed.
Job displacement: Advances in manipulation capabilities accelerate automation in logistics, manufacturing, and service industries, potentially causing structural changes in employment.
Decision-making bias: Biases that VLM backbones have learned from internet data may manifest in physical actions. For example, biases related to race or gender could lead to discriminatory behavior in human–robot interaction.
Regulatory framework: Outside of autonomous driving, regulatory frameworks for the deployment of VLA robots barely exist. Certification standards, liability attribution, and incident reporting systems need to be established urgently.

10.11 Cross-Survey Integrated Insights

The following ten insights, derived through cross-analysis of fourteen surveys, represent emergent patterns not explicitly visible in any individual survey.

Insight 1 — Evidence of Convergence

The fourteen surveys each use different taxonomies, but they are ultimately different projections of the same landscape. Architecture surveys classify VLAs along a "backbone–action head" axis; learning surveys along a "pretraining–fine-tuning" axis; application surveys along a "domain–task" axis. Yet across all these perspectives, "VLM Brain + Generative Action Head" is converging as the optimal point. This trend, which became clear from late 2024 through 2025, means that the "language model as action model" paradigm initiated by RT-2 [11] has now reached universal consensus.

Insight 2 — The Scale Inversion Phenomenon

The scaling-law intuition that "larger models yield better performance" does not necessarily hold for VLAs. Concrete evidence supports this.

CLIP [27]-RT (1B) outperforms OpenVLA [15] (7B) on many tasks.
SmolVLA [32] (450M) achieves performance competitive with 7B-class models on LIBERO.
3B-class models (CogACT [23], SpatialVLA [39]) achieve performance equal to or better than 7B models.

This suggests that data quality, tokenization strategy, and architectural design may matter more than parameter count. In VLAs, the scarcity of robot data can cause large models to overfit or waste unnecessary capacity.

Insight 3 — Tokenization Determines Control Bandwidth

The choice of action tokenization is not a mere implementation detail but a design decision that defines the fundamental capabilities of the system.

Discrete bin tokenization: Simple to implement but limited in precision. Suitable for 1–5Hz control.
Diffusion-based: Can represent continuous, multimodal distributions, but the iterative reverse-diffusion process degrades inference speed. Range: 5–20Hz.
Flow matching: Achieves faster convergence than diffusion, enabling 20–50Hz.
FAST tokeniz [20]ation: Simultaneously pursues the speed of discrete methods and the precision of continuous methods, enabling control frequencies up to 50–120Hz.

This perspective is a cross-cutting insight not explicitly addressed in any single survey. The choice of tokenization determines control frequency from 1Hz to 120Hz, and this fundamentally defines the range of tasks that can be performed. Slow control enables only coarse manipulation, while fast control makes high-difficulty tasks such as precision insertion, suturing, and playing musical instruments possible.

Insight 4 — Dual Systems Are Necessary, Not Optional

The limitations of pure end-to-end models on long-horizon tasks are repeatedly demonstrated. Daniel Kahneman's distinction between System 1 (fast, intuitive) and System 2 (slow, deliberate) has been validated as an engineering necessity in robotics.

GR00T N1 [21] adopts a dual-system architecture in which the high-level VLM (System 2) generates subgoals and the low-level diffusion policy (System 1) executes them. This yielded a 17% improvement in success rate and a 28% reduction in collision rate compared to a single-system approach — strong evidence that a theoretical distinction from cognitive science can be directly translated into an engineering design principle.

Insight 5 — RL Post-Training Is an Essential Complement to BC

VLAs trained on behavioral cloning (BC) alone have structural limitations. The distributional shift problem — in which performance degrades sharply when deviating from the distribution of the demonstration data — is a prime example. Reinforcement learning (RL) post-training has emerged as the key means of breaking through this limitation.

A dramatic example proves this: a case was reported in which a success rate that remained at 4% with SFT alone recovered to 97% after just 15 iterations of PPO (Proximal Policy Optimization). If BC teaches "what to do," RL trains the agent to learn "what is good." The combination of the two is not an option but a necessary pipeline step.

Insight 6 — The Efficiency–Performance Pareto Frontier Is Shifting

In 2025, the central competitive axis of VLA research has shifted from "absolute performance" to "compute efficiency." SmolVLA [32] (450M parameters) achieving performance comparable to what RT-2 [11] (55B parameters) achieved represents an efficiency revolution of more than 100×.

The "Intelligent Sparsity" paradigm is emerging. This is not simply about making models smaller, but about concentrating computation only where it is needed. LoRA-based efficient fine-tuning, Mixture-of-Experts (MoE) architectures, and early-exit mechanisms are the embodiment of this paradigm. "Performance per FLOP" is becoming the key metric, superseding simple scaling laws.

Insight 7 — Autonomous Driving VLAs Are on a Separate Evolutionary Path

Manipulation VLAs and autonomous-driving VLAs share the same "VLM + Action" framework, but in practice they follow quite different evolutionary paths. Autonomous driving has unique requirements compared to manipulation:

Safety requirements: The consequences of failure are fatal, and social acceptance standards are far higher.
Real-time requirements: A control frequency of 30Hz or more is absolutely non-negotiable.
Social norms: Understanding and complying with social conventions such as traffic laws, yielding, and signal adherence is required.

It is regrettable that technology transfer between the two domains is insufficient. There are potential cross-pollination opportunities: precise control techniques from manipulation could contribute to fine steering in autonomous driving, and safety verification frameworks from autonomous driving could contribute to safety policies in manipulation.

Insight 8 — Human Motor Learning Theory as a Future Map for VLA Research

The mapping between Newell's motor learning theory and VLAs, proposed by Jin et al. [9], has the potential to function not merely as an analogy but as a systematic research framework. Newell's "freezing-freeing degrees of freedom" theory directly corresponds to a curriculum for hierarchical skill learning in VLAs that starts in a low-dimensional action space and gradually expands the degrees of freedom.

There are abundant unexplored territories:

Robotic implementation of cerebellar models: The combination of a forward model and an inverse model from the human cerebellum could be translated into a combination of a world model and an inverse dynamics policy in VLAs.
Contextual interference effects: The finding from human motor learning that random practice is superior to blocked practice for long-term retention could be applied to training curricula for VLAs.

Insight 9 — World Models Are the Key to Long-Horizon Tasks

The performance trends on the CALVIN benchmark deliver a clear message. Visual Interaction Prediction (VIP) methods show dominant performance, and approaches integrating world models — WorldVLA [76], DreamVLA [64], CoT-VLA [55] — dominate the top ranks.

World models implement the principle of "imagining before acting." Before executing a particular action, the system internally simulates how the world will change as a result, evaluates whether that outcome aligns with the goal, and only then executes the actual action. This fundamentally improves planning capability in long-horizon tasks and will be the key differentiator of the next generation of VLAs.

Insight 10 — The Rise of the "Defensive AI" Paradigm

Robustness is being elevated to a first-class design objective on par with performance. Even if a 98% success rate is achieved in the laboratory, a system that drops to 50% under unpredictable real-world perturbations cannot be deployed.

BYOVLA (Build Your Own VLA): A modular architecture in which the robustness of each component can be independently verified and replaced.
DreamVLA [64]: Imagination-based robustness improvement through a world model. Unexpected situations are internally simulated to prepare for them.
SafeVLA [75]: Proactively blocks dangerous actions through explicit safety constraint integration.

In real-world deployment, robustness is not a "nice-to-have" property but a survival condition for the system. As this recognition spreads through the research community, the "Defensive AI" paradigm is taking shape.

10.12 The Generalization Gap Between Frontier and Open-Weight Models

As of 2026, the most prominent dividing line in the VLA field is the real-world generalization gap between proprietary frontier models (Gemini Robotics, π0.5 [31]) and open-weight research models. Although performance on simulation benchmarks (LIBERO, CALVIN) is converging between the two camps, a substantial gap persists in real-world zero-shot generalization. An analysis suggests that only the pi-family of models demonstrates competitive zero-shot behavior on the RoboArena leaderboard (Reuss, 2026).

Three factors are identified as causes of this gap: (1) Data quality and diversity gap -- the proprietary data of frontier labs exceeds public datasets in both quality and diversity; (2) Benchmark ceiling effect -- the saturation of simulation benchmarks masks genuine progress; and (3) Infrastructure scale gap -- the difference between laboratory-scale and industrial-scale training infrastructure.

At ICLR 2026, data quality curation and in-context learning were identified as the most underrepresented research directions, and these two directions may hold the key to closing the gap. This problem is closely linked to both the data bottleneck of Section 10.1 and the generalization wall of Section 10.2, and constitutes a central challenge for the open-source community.

10.13 Frontier Case Study: Two Breakthroughs from the π Series (2025.11 – 2026.03)

Sections 10.1–10.12 identified the core open problems and cross-survey insights for VLA research. How much concrete progress is actually being made on these challenges? In November 2025 and March 2026, Physical Intelligence (PI) published two papers in quick succession that provide the most tangible answer to this question. π^*_0.6 [157] directly attacks the "BC→RL transition" discussed in Section 6.3 and Insight 5, while π_0.6-MEM [158] confronts the "long-horizon tasks and memory absence" problem from Section 10.4. Both papers build on the π0.6 model (Gemma 3 4B VLM + 860M Action Expert) and break through two core VLA limitations—the performance ceiling of behavioral cloning and the absence of memory—at real-world scale.

10.13.1 π^*_0.6: A VLA That Learns from Experience

[157] · Physical Intelligence · 2025.11

π^*_0.6 [157] is a general-purpose RL post-training framework, built on a method called RECAP (RL with Experience and Corrections via Advantage-conditioned Policies), that enables VLA models to self-improve through real-world deployment experience. While the RL post-training methods surveyed in Section 6.3 (VLA-RL, RIPT-VLA, ConRFT, etc.) were primarily validated on simulation benchmarks, π^*_0.6 represents a qualitative turning point by being the first to successfully demonstrate end-to-end RL training of a large-scale VLA on real-world, long-horizon, complex manipulation tasks.

Core Technical Innovation: Advantage Conditioning. Unlike prior VLA RL methods that use policy gradient–based extraction such as PPO or GRPO, RECAP adopts advantage conditioning—a fundamentally different policy extraction mechanism. The key idea proceeds as follows:

(1) A distributional value function is trained separately. This value function uses a compact 670M-parameter VLM backbone and predicts the distribution of remaining steps to successful completion at each state (201 discrete bins). (2) The advantage value of each action is estimated from the value function and binarized into a text token "Advantage: positive/negative" appended to the VLA input. (3) The VLA is trained on all data (demonstrations + autonomous rollouts + human corrections) via advantage-conditioned supervised learning, but at inference time is always conditioned on "Advantage: positive" to extract the improved policy.

The decisive advantage of this approach is its compatibility with Flow Matching–based VLAs. PPO/GRPO require explicit computation of log-likelihoods, which Flow Matching models cannot directly provide and must approximate. Advantage conditioning completely bypasses this issue, achieving policy improvement through simple conditional supervised learning. In experiments, π^*_0.6 significantly outperformed AWR and PPO-based methods trained on the same data.

Unifying Three Data Sources. RECAP combines (1) demonstration data (for initial SFT), (2) autonomous rollouts (robot’s own attempts with success/failure labels), and (3) human corrections (human-gated DAgger—humans intervene during autonomous execution to correct mistakes) within a single framework. Human correction data always receives positive advantage, while the remaining data is assigned advantage based on the value function’s estimates.

Real-World Results. π^*_0.6 was validated on three complex tasks:

Task	Duration	π^*_0.6 Effect (throughput)	Success Rate	Continuous Operation
Espresso making	~200s/trial	2×+ improvement	90%+	13 hours continuous
Laundry folding (11 types)	~500s/trial	2×+ improvement	~70% (hardest: button shirt)	2+ hours (in new home)
Box assembly (factory deployment)	~600s/trial	2× improvement (after 2 iterations)	~90%	Factory deployment

On the most challenging tasks, π^*_0.6 more than doubled throughput (successful completions per hour) and cut the failure rate roughly in half. Notably, a "targeted failure removal" experiment eliminated a specific failure mode (collar orientation error) to 97% success using only 600 autonomous trajectories over 2 iterations.

Significance in the Survey Context. π^*_0.6 is the most complete realization of the "BC→RL transition" discussed in Section 6.3. Where prior work demonstrated the feasibility of RL post-training on simulation benchmarks (LIBERO, etc.), π^*_0.6 is the first to prove the practicality of end-to-end RL for large-scale Flow Matching VLAs on real-world, long-horizon, complex tasks. This signals that the development trajectory from LLMs—GPT-3 (pre-training) → InstructGPT (SFT) → ChatGPT (RLHF)—is now materializing in VLA research.

10.13.2 π_0.6-MEM: Multi-Scale Embodied Memory for VLAs

[158] · Physical Intelligence · 2026.03

MEM (Multi-Scale Embodied Memory) endows VLAs with multi-modal, multi-timescale memory. Section 10.4 identified "long-horizon tasks and hierarchical reasoning" as a core open problem; MEM offers the most direct solution to this challenge.

Key Insight: Dual Memory Representation. When a robot executes a 15-minute task such as "clean the entire kitchen," two fundamentally different kinds of memory are required. (1) Short-term memory: recent visual information spanning a few seconds (when the arm occludes an object, or when recalling a failed grasp strategy). (2) Long-term memory: semantic events spanning several minutes (which ingredients have already been retrieved, which drawers have been opened). MEM’s core insight is that these two types of memory must be represented in different modalities.

Architectural Components:

(1) Short-term Video Memory (Video Encoder). The existing ViT is extended to process video input without adding any new learnable parameters. A space-time separable attention structure inserts causal temporal attention alongside spatial attention every 4th layer. Tokens from past timesteps are dropped in upper layers so that the total tokens passed to the VLA backbone remain identical to a single-frame VLA. This keeps inference latency within the 300ms real-time constraint even with 16-frame input (a naïve approach would exceed 4 seconds).

(2) Long-term Language Memory. A high-level policy compresses past semantic events into a natural-language summary (m_t) and incrementally updates it each step (m_t+1). The key is compression: "placed the light-green bowl, dark-blue bowl, and light-yellow bowl into the upper-right cabinet" → "placed three bowls into the upper-right cabinet," stripping unnecessary detail. This is critically important for reducing the train–inference distribution mismatch.

Core Design Principle: Initialization from Pre-trained VLM Weights. The video encoder is designed to guarantee exact equivalence with the original VLM at K=1 (single image) by setting the t=0 temporal position encoding to zero. This preserves the pre-trained VLM’s knowledge perfectly while adding memory capabilities. Experiments showed that introducing memory only during the post-training phase without pre-training caused marked degradation, confirming that diverse-data memory pre-training is essential.

Real-World Results.

Capability	Example Tasks	Results
15-min long-horizon tasks	Recipe ingredient prep, full kitchen cleanup, grilled cheese sandwich cooking	2–4× task progress improvement vs. memoryless π0.6
In-context adaptation	Chopstick grip height adjustment, fridge door opening direction switch	+11%–+62% success rate vs. memoryless model
Partial observability	Remembering object location in drawer, tracking grocery bag contents, coffee scoop counting	Only model with strong performance across all core memory capabilities
Non-memory task performance	Shirt folding, bed making, box assembly, etc.	On par with memoryless π0.6 (no degradation from memory addition)

Notably, the causal confusion problem repeatedly reported in prior work—where adding memory actually degrades performance—was not observed in π_0.6-MEM. This is attributed to the large-scale pre-training data mix encompassing diverse optimality levels, speeds, and control frequencies, which prevents spurious correlations.

Significance in the Survey Context. MEM directly addresses the reasoning paradigms of Section 4.2 (the brain module) and the core open problem of Section 10.4 (long-horizon tasks). While π0.5 approached long-horizon work through hierarchical separation ("VLM planner + VLA executor"), MEM tackles the same problem along the orthogonal dimension of memory. If hierarchical planning separates "what to do," memory remembers "what has been done." The two approaches are complementary rather than mutually exclusive, and their future combination is highly likely.

10.13.3 Integrated Significance of the Two Papers

Dimension	π^*_0.6 [157]	π_0.6-MEM
Limitation addressed	BC performance ceiling; inability to discover behavior beyond demonstrations	Memory absence; inability to handle long-horizon tasks; partial observability
Base model	π0.6 (Gemma 3 4B + 860M Action Expert)	π0.6 (same)
Core innovation	Advantage conditioning: RL policy extraction compatible with Flow Matching VLA	Video encoder (zero added parameters) + compressed language memory
Data sources	Demonstrations + autonomous rollouts + human corrections (DAgger)	Robot demonstrations + video data + vision-language data
Headline results	13 hours continuous espresso making; 50% failure reduction	15-min kitchen cleanup & grilled cheese sandwich tasks solved
Survey connection	Section 6.3 (RL post-training), Insight 5	Section 10.4 (long-horizon tasks), Section 4.2 (reasoning)

Motivation Chain: Evolution of the π Series (Updated)

π0 (2024): First fusion of VLM + Flow Matching Action Expert

→ π0.5 (2025): Hierarchical VLM planning + π0 execution, 30 min+ tasks

→ π0.6 (2025): Gemma 3 4B backbone + 860M Action Expert upgrade, KI training recipe

→ π^*_0.6 (2025.11): Real-world RL post-training via advantage conditioning. Breaking the BC ceiling

→ π_0.6-MEM (2026.03): Multi-scale memory for 15 min+ long-horizon tasks. Breaking the memory barrier

Taken together, PI’s π series is simultaneously pushing two core frontiers of VLA research: "doing it better" (π^*_0.6) and "doing it longer" (MEM). π^*_0.6 elevates individual action quality beyond demonstration level, while MEM weaves individual actions into coherent task execution spanning tens of minutes. If the two approaches were combined in a single model, a robot that "self-improves at a 15-minute kitchen cleanup through trial and error" becomes feasible. This represents the most concrete advancement toward the "deployment readiness" paradigm put forth in this survey, demonstrating that the open problems identified above are transitioning from theoretical speculation to engineering challenges.

11. Conclusion

11.1 VLA — Realizing Unified Intelligence

Vision-Language-Action (VLA) models are the embodiment of a unified intelligence in which robots "see, understand, and act" in the world. By fusing visual perception, linguistic reasoning, and physical action within a single neural network, VLAs are fundamentally redefining the modular pipeline (perception–planning–control) of traditional robotics.

11.2 Three Years of History, 200 Models

Since RT-2 [11] first proposed the term "Vision-Language-Action Model" in 2023, more than 200 VLA models have emerged in just three years. This explosive growth is the result of three convergences: (1) the maturation of large language models, (2) advances in vision-language pretraining, and (3) the emergence of large-scale robot datasets. The Cambrian explosion of VLA research began in 2023–2024, when these three elements simultaneously reached a critical threshold.

11.3 Current Achievements and the Frontier

Short-horizon, single-domain manipulation tasks have nearly been solved. The figures of LIBERO-Spatial 98.8% and LIBERO-Object 99.8% show that the ability to perform defined tasks in defined environments has already approached human-level performance.

But the true frontier begins now.

Long-horizon tasks: Planning and execution in multi-step compositional tasks
Cross-domain generalization: Adaptation to environments and objects not encountered during training
Real-world deployment: Stable operation in uncontrolled environments

These three challenges constitute the "last mile" of VLA research — and simultaneously the most difficult stretch.

11.4 The Efficiency Revolution

One of the most noteworthy trends in VLA research is the dramatic improvement in efficiency. The Pareto frontier is shifting: model size has been reduced by more than 100× from RT-2 [11]'s 55B parameters to SmolVLA [32]'s 450M parameters, while maintaining competitive performance. This is a decisive factor in accelerating the practical deployment of VLAs. Real-time inference on edge devices, cost-effective large-scale deployment, and energy efficiency are all beneficiaries of this efficiency revolution.

11.5 The Transition from BC to RL

A pipeline that begins with behavioral cloning (BC) and concludes with reinforcement learning (RL) post-training is becoming the standard for VLA training. The combination of the stable initialization provided by BC and the exploratory optimization provided by RL enables performance levels unattainable by either approach alone. The leap from 4% with SFT to 97% after 15 iterations of PPO strikingly illustrates the power of this combination.

11.6 Accelerating Domain-Specific Specialization

The general-purpose VLA framework is expanding into diverse domains.

Autonomous driving: DriveVLM [91], DriveLM, and others implement VLAs specialized for road environments.
Humanoids: GR00T N1 [21], HumanPlus, and others apply VLAs to whole-body control of humanoid robots.
Medicine: VLA applications in surgical robots and rehabilitation assistance are being explored.

Each domain has its own unique safety requirements, control frequencies, and interaction patterns, and domain-specific design is accelerating accordingly.

11.7 Safety, Ethics, and Formal Verification — The Gatekeepers of Deployment

The final gatekeeper blocking real-world deployment of VLAs is not technical performance but safety and ethics. While efforts such as SafeVLA [75] and SafeAuto are underway, formal verification methods for language-conditioned neural network policies have yet to be established. This is a gateway that VLAs must pass through to move from the laboratory into everyday life, and it is an area that requires collaboration among regulatory bodies, industry, and academia.

Societal impacts such as privacy, job displacement, and decision-making bias must also be discussed in parallel with technological progress. It is too late to begin ethical deliberation only after a technology has been deployed in society.

11.8 Toward the Next Leap

The next leap in VLA research is expected to occur simultaneously along four axes.

First, world model integration. The ability to imagine outcomes before acting improves long-horizon tasks, safety, and generalization alike. The success of DreamVLA [64] and WorldVLA [76] validates this direction; in the next generation of VLAs, the world model will be a core module rather than an optional component (see Section 4.2.3 and the Large Model Embodied AI survey [44]).

Second, continual learning. Current VLAs are frozen after deployment, but a truly intelligent robot must continuously improve through experience. Continual learning — adapting to new tasks and environments without forgetting past learning (preventing catastrophic forgetting) — is the long-term vision for VLAs.

Third, general embodied intelligence. A universal policy in which a single model operates across diverse embodiments — robotic arms, humanoids, autonomous vehicles, drones — is the ultimate goal of VLA research. OXE [19] and HPT [96] have taken the first steps in this direction; scaling cross-embodiment generalization is the central challenge.

Fourth, human–robot co-evolution. As VLA robots become deeply integrated into human life, a co-evolution will begin in which humans and robots change each other. The feedback loop in which robots learn from human behavior and humans adjust their interaction patterns to match robot capabilities is the future that VLA research ultimately aspires to.

VLAs represent more than a technological advance; they are constructing an answer to the fundamental question of "whether machines can understand the physical world and act meaningfully within it." Three years after the naming of the field in 2023, this area has advanced at a remarkable pace, and that acceleration continues. The changes that the next three years will bring will surpass all that has come before.

Full VLA Taxonomy Tree

VLA (Vision-Language-Action)
├── Classification by Definitional Scope
│   ├── Narrow Definition (RT-2 original): VLM fine-tuning based
│   ├── Broad Definition (Ma et al.): All V+L→A systems
│   ├── Pure VLA (Zhong et al.): End-to-end integration
│   └── Direct Control (Kawaharazuka et al.): Direct control command generation
│
├── Classification by Architecture (Liu & Shao [5])
│   ├── Monolithic
│   │   ├── Single-system: RT-2, OpenVLA
│   │   └── Dual-system
│   │       ├── Cascade: GR00T N1, pi-0
│   │       └── Parallel: (simultaneous execution and fusion)
│   └── Hierarchical
│       ├── Planner-Only: SayCan, Inner Monologue
│       └── Planner+Policy: pi-0.5
│
├── Classification by Action Generation Method (Zhong et al. [3])
│   ├── Autoregressive: RT-2, OpenVLA, Octo (AR mode)
│   ├── Diffusion: Diffusion Policy, CogACT, RDT-1B
│   │   ├── Flow Matching (variant): pi-0, pi-0-FAST
│   │   └── Discrete Diffusion: diffusion in discrete token spaces
│   ├── Reinforcement Learning-based: VLA-RL [68], RIPT-VLA [71], ConRFT [69]
│   └── Hybrid/Special: HybridVLA [79], GR00T N1
│
├── Classification by Action Token Type (Chen et al. [7])
│   ├── Language Tokens: SayCan, SayTap
│   ├── Code Tokens: Code-as-Policies
│   ├── Affordance Tokens: VoxPoser [57], A3VLM [58]
│   ├── Trajectory Tokens: RT-Trajectory [62], TraceVLA
│   ├── Goal Tokens: SuSIE [59], 3D-VLA
│   ├── Latent Tokens: VQ-BeT [60], LAPA [61], UniVLA [80]
│   ├── Raw Action Tokens: RT-2, OpenVLA, FAST
│   └── Reasoning Tokens: CoT-VLA [55], SC-VLA [56]
│
├── Classification by Efficiency Technique (Yu et al. [4])
│   ├── Quantization: BitVLA, SQIL
│   ├── Pruning: SmolVLA, FLOWER, DeeR-VLA
│   ├── Distillation: TinyVLA
│   ├── Token Optimization: FAST, VLA-Cache, VOTE
│   ├── Efficient Attention: KV-Efficient VLA, Long-VLA [73]
│   └── Efficient Architecture: SARA-RT, MoLE-VLA
│
├── Classification by Learning Paradigm (Jin et al. [9])
│   ├── Phase 1: Internet Pretraining (VLM)
│   ├── Phase 2: BC/SFT (Robot Demonstrations)
│   └── Phase 3: RL Post-Training
│       ├── Online RL: PPO (RIPT-VLA [71]), GRPO (VLA-RL [68])
│       ├── Online RL: ConRFT [69]
│       └── Preference Optimization: HAPO [84], GRAPE
│
└── Classification by Application Domain
    ├── Tabletop Manipulation: mainstream, data-rich
    ├── Humanoids: high-DoF, whole-body control
    ├── Autonomous Driving: separate evolutionary path, highest safety requirements
    ├── Drones/Navigation: outdoor, real-time
    ├── Medical/Surgical: extreme precision, data-scarce
    └── Industrial/Agricultural: repetitive tasks, robustness-centric
│
├── Classification by Benchmark
│   ├── Manipulation: LIBERO, CALVIN, RLBench, Meta-World, VLABench
│   ├── Autonomous Driving: Bench2Drive, nuScenes, Reason2Drive, WorldBench
│   └── Next-Generation: RoboArena, RoboCasa365, WorldGym

Motivation Chain: Deployment and Safety

Limitations of laboratory demonstrations (operate only in controlled environments; real-world deployment infeasible)
--> Efficiency research (lightweight models, quantization --> executable on edge devices)
--> Real-world deployment attempts (unexpected failure modes discovered)
--> Initiation of safety research (SafeVLA [75]: internalizing safety constraints into training)
--> Fundamental limits of safety (VLA hallucination --> possibility of physical accidents)
--> Need for formal verification emerges (still unresolved)

Limitations of single benchmarks (LIBERO saturation: simple tasks effectively solved)
--> Need for composite benchmarks (long-horizon, multi-step, real-world variation)
--> Simulation-to-real gap (sim-to-real gap persists)
--> Hybrid evaluation proposed (simulation + real-world + human evaluation)

Self-Check Questions: Sections 9--11

Q1: Explain the paradigm shift that the OXE dataset brought to the VLA field from the perspective of "data diversity."

Answer: Before OXE, each laboratory trained on small-scale data (thousands to tens of thousands of episodes) collected with its own robots. OXE unified over 1M episodes collected across 22 distinct robot embodiment types into a single format. The key finding was that data from diverse robots acts not as noise but as diversity, preventing overfitting to specific robots and environments and improving generalization. This is analogous to the phenomenon in NLP where multilingual training improves performance in each individual language.

Q2: Why is "hallucination" in VLAs fundamentally different from hallucination in LLMs?

Answer: LLM hallucination produces incorrect text, resulting in informational errors (stating things that are not factual). VLA hallucination generates erroneous actions that are executed in the physical world, meaning the consequences can include physical accidents (collisions, damage, injury). The difference is between "stating a nonexistent historical fact" and "swinging a robot arm to grasp an object that does not exist." For this reason, safety challenges in VLAs are fundamentally more severe than in LLMs, and the need for formal verification is correspondingly greater.

Q3: What is the greatest limitation of the current VLA benchmark ecosystem?

Answer: (1) Absence of a unified cross-benchmark: There is no standard benchmark equivalent to ImageNet or SuperGLUE, making it impossible to directly compare results across different simulators (LIBERO, CALVIN, RLBench). (2) Simulation-to-real gap: High performance in simulation does not guarantee real-world performance. (3) Saturation problem: Simple task benchmarks have already reached 99%, losing discriminative power. (4) Absence of long-horizon and unstructured task evaluation: Benchmarks for composite tasks lasting 30+ minutes or measuring the ability to cope with unexpected situations remain scarce.

Open Research Questions: Sections 9--11

Data gap: How can the 5--6 orders-of-magnitude gap between robot data (OXE, ~1M episodes) and internet data (trillions of tokens) be closed? Among video pretraining, simulation, and synthetic data generation, which strategy is most effective?

Benchmark 2.0: Now that LIBERO has saturated, what design principles should guide the next-generation benchmark that simultaneously measures real-world generalization, long-horizon task performance, and safety?

Formal safety verification: Is formal verification of language-conditioned neural network policies theoretically feasible? If so, what mathematical framework is required?

Economics of VLA: At what point will the deployment cost of VLA-based robots (training, hardware, maintenance) become economically viable relative to conventional industrial robots?

General embodied intelligence: Is a single VLA model capable of controlling arms, humanoids, vehicles, and drones -- achieving "General Embodied Intelligence" -- an attainable goal, or is domain specialization inevitable?

References

[1] Ma, Q. et al. (2024). A Survey on Vision-Language-Action Models for Embodied AI. arXiv:2405.14093. [arXiv]
[2] Kawaharazuka, K. et al. (2025). Real-World Robot Applications of Foundation Models: A Review. arXiv:2402.05741. [arXiv]
[3] Zhong, Z. et al. (2025). Pure Vision Language Action (VLA) Models: A Comprehensive Survey. arXiv:2509.19012. [arXiv]
[4] Yu, Z. et al. (2025). A Survey on Efficient Vision-Language-Action Models. arXiv:2510.24795. [arXiv]
[5] Liu, N. & Shao, R. et al. (2025). Large VLM-based VLA Models for Robotic Manipulation: A Survey. arXiv:2508.13073. [arXiv]
[6] Zhang, Y. et al. (2025). VLA Models: Concepts, Progress, Applications and Challenges. arXiv:2505.04769. [arXiv]
[7] Chen, Y. et al. (2025). A Survey on VLA Models: An Action Tokenization Perspective. arXiv:2507.01925. [arXiv]
[8] Xu, C. et al. (2025). An Anatomy of Vision-Language-Action Models. arXiv:2512.11362. [arXiv]
[9] Jin, A. et al. (2025). Parallels Between VLA Model Post-Training and Human Motor Learning. arXiv:2506.20966. [arXiv]
[10] Jiang, H. et al. (2025). A Survey on VLA Models for Autonomous Driving. arXiv:2506.24044. [arXiv]
[11] Brohan, A. et al. (2023). RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. arXiv:2307.15818. [arXiv]
[12] Brohan, A. et al. (2022). RT-1: Robotics Transformer for Real-World Control at Scale. arXiv:2212.06817. [arXiv]
[13] Reed, S. et al. (2022). A Generalist Agent (Gato). arXiv:2205.06175. [arXiv]
[14] Ahn, M. et al. (2022). Do As I Can, Not As I Say: Grounding Language in Robotic Affordances (SayCan). arXiv:2204.01691. [arXiv]
[15] Kim, M. et al. (2024). OpenVLA: An Open-Source Vision-Language-Action Model. arXiv:2406.09246. [arXiv]
[16] Black, K. et al. (2024). pi0: A Vision-Language-Action Flow Model for General Robot Control. arXiv:2410.24164. [arXiv]
[17] Chi, C. et al. (2023). Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. arXiv:2303.04137. [arXiv]
[18] Driess, D. et al. (2023). PaLM-E: An Embodied Multimodal Language Model. arXiv:2303.03378. [arXiv]
[19] Open X-Embodiment Collaboration (2023). Open X-Embodiment: Robotic Learning Datasets and RT-X Models. arXiv:2310.08864. [arXiv]
[20] Pertsch, K. et al. (2025). Fast Tokenizer for VLA (pi0-FAST). arXiv:2501.09747. [arXiv]
[21] Bjorck, J. et al. (2025). GR00T N1: An Open Foundation Model for Generalist Humanoid Robots. arXiv:2503.14734. [arXiv]
[22] Huang, W. et al. (2023). Inner Monologue: Embodied Reasoning through Planning with Language Models. arXiv:2207.05608. [arXiv]
[23] Liu, H. et al. (2024). CogACT: A Foundational VLA Model with Cognitive-Inspired Action Chunking Transformer. arXiv:2411.19650. [arXiv]
[24] Liu, H. et al. (2024). RDT-1B: A Diffusion Foundation Model for Bimanual Manipulation. arXiv:2410.07864. [arXiv]
[25] Team, Octo Model et al. (2024). Octo: An Open-Source Generalist Robot Policy. arXiv:2405.12213. [arXiv]
[26] Shridhar, M. et al. (2021). CLIPort: What and Where Pathways for Robotic Manipulation. arXiv:2109.12098. [arXiv]
[27] Radford, A. et al. (2021). Learning Transferable Visual Models From Natural Language Supervision (CLIP). arXiv:2103.00020. [arXiv]
[28] Dosovitskiy, A. et al. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT). arXiv:2010.11929. [arXiv]
[29] Brown, T. et al. (2020). Language Models are Few-Shot Learners (GPT-3). arXiv:2005.14165. [arXiv]
[30] Oquab, M. et al. (2023). DINOv2: Learning Robust Visual Features without Supervision. arXiv:2304.07193. [arXiv]
[31] Physical Intelligence (2025). pi0.5: a Vision-Language-Action Model with Open-World Generalization. arXiv:2503.01222. [arXiv]
[32] Pertsch, K. et al. (2025). SmolVLA: A Small Vision-Language-Action Model for Efficient Robot Learning. arXiv:2506.01844. [arXiv]
[33] Ma, Y. et al. (2025). BitVLA: 1-bit Vision-Language-Action Models. arXiv:2505.07256. [arXiv]
[34] Wu, J. et al. (2025). TinyVLA: Towards Fast and Data-Efficient VLA. arXiv:2409.12514. [arXiv]
[35] Yue, W. et al. (2024). DeeR-VLA: Dynamic Inference of Multimodal LLMs for Efficient Robot Execution. arXiv:2411.02359. [arXiv]
[36] Liang, J. et al. (2023). Code as Policies: Language Model Programs for Embodied Control. arXiv:2209.07753. [arXiv]
[37] Zhen, H. et al. (2024). 3D-VLA: A 3D Vision-Language-Action Generative World Model. arXiv:2403.09631. [arXiv]
[38] Huang, W. et al. (2022). Language Models as Zero-Shot Planners (SayTap). arXiv:2201.07207. [arXiv]
[39] Xu, Z. et al. (2024). SpatialVLA: Exploring Spatial Representations for VLA Models. arXiv:2501.15830. [arXiv]
[40] Wen, B. et al. (2024). TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for VLA. arXiv:2412.10345. [arXiv]
[41] Hu, T. et al. (2025). Vision-Language-Action Models for Autonomous Driving: Past, Present, and Future. arXiv:2512.16760. [arXiv]
[42] Edge Survey (2026). Embodied Foundation Models at the Edge: A Survey of Deployment Constraints and Mitigation Strategies. arXiv:2603.16952. [arXiv]
[43] Guan, W. et al. (2025). Efficient Vision-Language-Action Models for Embodied Manipulation: A Systematic Survey. arXiv:2510.17111. [arXiv]
[44] Large Model Embodied AI (2025). Large Model Empowered Embodied AI: A Survey on Decision-Making and Embodied Learning. arXiv:2508.10399. [arXiv]
[45] Jiang, Y. et al. (2022). VIMA: General Robot Manipulation with Multimodal Prompts. arXiv:2210.03094. [arXiv]
[46] Hwang, J. et al. (2024). EMMA: End-to-End Multimodal Model for Autonomous Driving. arXiv:2410.23262. [arXiv]
[47] Fu, H. et al. (2025). ORION: A Holistic End-to-End Autonomous Driving Framework. arXiv:2503.19755. [arXiv]
[48] Zhou, X. et al. (2025). AutoVLA: Autonomous Driving with Adaptive Reasoning and RL Fine-Tuning. arXiv:2506.13757. [arXiv]
[49] Yang, Z. et al. (2025). DriveMoE: Mixture-of-Experts for End-to-End Autonomous Driving. arXiv:2505.16278. [arXiv]
[50] Doshi, R. et al. (2024). CrossFormer: Scaling Cross-Embodied Learning. arXiv:2408.11812. [arXiv]
[51] Wu, H. et al. (2023). GR-1: Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation. arXiv:2312.13139. [arXiv]
[52] Tan, Z. et al. (2025). FlashVLA: Token-Aware Compression and Action Reuse for Efficient VLA Inference. arXiv:2505.21200. [arXiv]
[53] Zheng, K. et al. (2025). X-VLA: Cross-Embodiment Vision-Language-Action Model. arXiv:2510.10274. [arXiv]
[54] Du, Y. et al. (2025). HiMoE-VLA: Hierarchical Mixture-of-Experts for Generalist VLA Policies. arXiv:2512.05693. [arXiv]
[55] Zhao, Y. et al. (2025). CoT-VLA: Visual Chain-of-Thought Reasoning for VLA Models. arXiv:2503.22020. [arXiv]
[56] Li, X. et al. (2024). SC-VLA: A Self-Correcting VLA Model for Fast and Slow System Manipulation. arXiv:2405.17418. [arXiv]
[57] Huang, W. et al. (2023). VoxPoser: Composable 3D Value Maps for Robotic Manipulation. arXiv:2307.05973. [arXiv]
[58] Huang, S. et al. (2024). A3VLM: Actionable Articulation-Aware Vision Language Model. arXiv:2406.07549. [arXiv]
[59] Black, K. et al. (2023). SuSIE: Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models. arXiv:2310.10639. [arXiv]
[60] Lee, S. et al. (2024). VQ-BeT: Behavior Generation with Latent Actions. arXiv:2403.03181. [arXiv]
[61] Ye, D. et al. (2024). LAPA: Latent Action Pretraining from Videos. arXiv:2410.11758. [arXiv]
[62] Gu, Y. et al. (2023). RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches. arXiv:2311.01977. [arXiv]
[63] Williams, R. et al. (2025). Lite VLA: Efficient VLA Control on CPU-Bound Edge Robots. arXiv:2511.05642. [arXiv]
[64] Zhang, H. et al. (2025). DreamVLA: A VLA Model Dreamed with Comprehensive World Knowledge. arXiv:2507.04447. [arXiv]
[65] Wen, C. et al. (2025). dVLA: Diffusion VLA with Multimodal Chain-of-Thought. arXiv:2509.25681. [arXiv]
[66] Chen, Z. et al. (2025). TGRPO: Fine-tuning VLA via Trajectory-wise Group Relative Policy Optimization. arXiv:2506.08440. [arXiv]
[67] Huang, J. et al. (2025). ThinkAct: VLA Reasoning via Reinforced Visual Latent Planning. arXiv:2507.16815. [arXiv]
[68] Lu, Y. et al. (2025). VLA-RL: Towards Masterful Robotic Manipulation with Scalable RL. arXiv:2505.18719. [arXiv]
[69] Chen, R. et al. (2025). ConRFT: A Reinforced Fine-tuning Method for VLA via Consistency Policy. arXiv:2502.05450. [arXiv]
[70] Li, Q. et al. (2025). SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning. arXiv:2509.09674. [arXiv]
[71] Tan, W. et al. (2025). RIPT-VLA: Interactive Post-Training for VLA Models. arXiv:2505.17016. [arXiv]
[72] Miao, L. et al. (2025). FedVLA: Federated VLA Learning with Dual Gating MoE. arXiv:2508.02190. [arXiv]
[73] Fan, Y. et al. (2025). Long-VLA: Unleashing Long-Horizon Capability of VLA for Robot Manipulation. arXiv:2508.19958. [arXiv]
[74] Koo, J. et al. (2025). RetoVLA: Reusing Register Tokens for Spatial Reasoning in VLA. arXiv:2509.21243. [arXiv]
[75] Zhang, S. et al. (2025). SafeVLA: Towards Safety Alignment of VLA via Constrained Learning. arXiv:2503.03480. [arXiv]
[76] Cen, J. et al. (2025). WorldVLA: Towards Autoregressive Action World Model. arXiv:2506.21539. [arXiv]
[77] Li, Z. et al. (2025). PointVLA: Injecting the 3D World into VLA Models. arXiv:2503.07511. [arXiv]
[78] Yu, F. et al. (2025). ForceVLA: Enhancing VLA with Force-aware MoE for Contact-rich Manipulation. arXiv:2505.22159. [arXiv]
[79] Liu, J. et al. (2025). HybridVLA: Collaborative Diffusion and Autoregression in a Unified VLA Model. arXiv:2503.10631. [arXiv]
[80] Bu, Z. et al. (2025). UniVLA: Learning to Act Anywhere with Task-centric Latent Actions. arXiv:2505.06111. [arXiv]
[81] Deng, Y. et al. (2025). GraspVLA: Grasping Foundation Model Pre-trained on Billion-scale Synthetic Data. arXiv:2505.03233. [arXiv]
[83] Tian, R. et al. (2023). RAPL: What Matters to You? Visual Representation Alignment for Robot Learning. arXiv:2310.07932. [arXiv]
[84] Xia, Z. et al. (2025). HAPO: Human-assisted Robotic Policy Refinement via Action Preference Optimization. arXiv:2506.07127. [arXiv]
[85] Patel, D. et al. (2025). IKER: Real-to-Sim-to-Real with VLM-Generated Iterative Keypoint Rewards. arXiv:2502.08643. [arXiv]
[86] Xu, J. et al. (2025). KV-Efficient VLA: Speed up VLMs with RNN-Gated Chunked KV Cache. arXiv:2509.21354. [arXiv]
[87] Chen, X. et al. (2023). GenAug: Retargeting Behaviors to Unseen Situations via Generative Augmentation. arXiv:2302.06671. [arXiv]
[88] Mandi, Z. et al. (2022). CACTI: A Framework for Scalable Multi-Task Multi-Scene Visual Imitation Learning. arXiv:2212.05711. [arXiv]
[89] Yu, T. et al. (2023). ROSIE: Scaling Robot Learning with Semantically Imagined Experience. arXiv:2302.11550. [arXiv]
[90] Xiao, T. et al. (2022). DIAL: Robotic Skill Acquisition via Instruction Augmentation with VLMs. arXiv:2211.11736. [arXiv]
[91] Tian, X. et al. (2024). DriveVLM: The Convergence of Autonomous Driving and Large VLMs. arXiv:2402.12289. [arXiv]
[92] Zawalski, K. et al. (2024). ECoT: Robotic Control via Embodied Chain-of-Thought Reasoning. arXiv:2407.08693. [arXiv]
[93] Du, Y. et al. (2023). UniPi: Learning Universal Policies via Text-Guided Video Generation. arXiv:2302.00111. [arXiv]
[94] Nematollahi, I. et al. (2025). LUMOS: Language-Conditioned Imitation Learning with World Models. arXiv:2503.10370. [arXiv]
[95] Chi, B. et al. (2025). MinD: Learning A Dual-System World Model for Real-Time Planning. arXiv:2506.18897. [arXiv]
[96] Wang, L. et al. (2024). HPT: Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers. arXiv:2409.20537. [arXiv]
[97] Ross, S. et al. (2011). DAgger: A Reduction of Imitation Learning to No-Regret Online Learning. arXiv:1011.0686. [arXiv]
[98] Hancock, W. et al. (2025). Actions as Language: Fine-Tuning VLMs into VLAs Without Catastrophic Forgetting. arXiv:2509.22195. [arXiv]
[99] GraspVerse (2025). Large-scale Synthetic Grasp Data Generation. (14개 서베이 외 출처)
[100] Cheang, C. et al. (2024). GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation. arXiv:2410.06158. [arXiv]
[101] Yang, S. et al. (2023). UniSim: Learning Interactive Real-World Simulators. arXiv:2310.06114. [arXiv]
[102] Singh, I. et al. (2023). ProgPrompt: Generating Situated Robot Task Plans using Large Language Models. arXiv:2209.11302. [arXiv]
[103] Wang, G. et al. (2023). Voyager: An Open-Ended Embodied Agent with Large Language Models. arXiv:2305.16291. [arXiv]
[104] Vemprala, S. et al. (2024). ChatGPT for Robotics: Design Principles and Model Abilities. arXiv:2306.17582. [arXiv]
[105] Nasiriany, S. et al. (2024). RT-Affordance: Affordances are Versatile Intermediate Representations for Robot Learning. arXiv:2411.02704. [arXiv]
[106] Xu, Z. et al. (2025). A0: An Autonomous Agent with Adaptive Action Generation. arXiv:2504.12636. [arXiv]
[107] Wang, H. et al. (2025). VQ-VLA: Vector Quantized Vision-Language-Action Model. arXiv:2507.01016. [arXiv]
[108] Liu, B. et al. (2025). Embodied-R1: Incentivizing Reasoning in Embodied VLA Models. arXiv:2508.13998. [arXiv]
[109] Wang, Z. et al. (2025). GRAPE: Generalizing Robot Policy via Preference Alignment. arXiv:2411.19309. [arXiv]
[110] Bousmalis, K. et al. (2024). RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation. arXiv:2306.11706. [arXiv]
[111] Shi, L. et al. (2025). ReVLA: Reverting Visual Domain from LLM to VLA. arXiv:2409.15250. [arXiv]
[112] Zheng, Z. et al. (2025). UniAct: Universal Action Representation for Robotic Learning. arXiv:2501.10105. [arXiv]
[113] Li, J. et al. (2025). BridgeVLA: Bridging the Gap Between VLA and Low-Level Robot Control. arXiv:2506.07961. [arXiv]
[114] Park, D. et al. (2025). SQIL: Sub-4-bit Quantization of Large VLAs via Self-play Fine-tuning. arXiv:2505.15304. [arXiv]
[115] Heo, J. et al. (2025). QAIL: Quantization-Aware Imitation Learning for Resource-Efficient VLA. arXiv:2412.01034. [arXiv]
[116] Li, S. et al. (2025). SQAP-VLA: Stochastic Quantization with Adaptive Precision for VLA. arXiv:2509.09090. [arXiv]
[117] Qu, L. et al. (2025). MoLe-VLA: Mixture of Lightweight Experts for VLA. arXiv:2503.20384. [arXiv]
[118] Niu, X. et al. (2025). EfficientVLA: An Efficient Vision-Language-Action Model. arXiv:2506.10100. [arXiv]
[119] Cheng, Y. et al. (2025). FLOWER: Flow-based World Model for Efficient Robot Learning. arXiv:2509.04996. [arXiv]
[120] Zhao, Y. et al. (2025). RLRC: Reinforcement Learning with Reasoning Consistency for VLA. arXiv:2506.17639. [arXiv]
[121] Wen, Z. et al. (2025). CEED-VLA: Confidence-Enhanced Early-Exit Decoding for VLA. arXiv:2506.13725. [arXiv]
[122] Julg, M. et al. (2025). RPD: Robot Policy Distillation from Vision-Language-Action Models. arXiv:2503.05833. [arXiv]
[123] Shen, W. et al. (2025). SP-VLA: Spatial-aware Parallel Decoding VLA. arXiv:2506.12723. [arXiv]
[124] Xu, Y. et al. (2025). VLA-Cache: Accelerating VLA Inference via KV Cache Compression. arXiv:2502.02175. [arXiv]
[125] Lin, X. et al. (2025). CronusVLA: Efficient VLA with Temporal Cronus Attention. arXiv:2506.19816. [arXiv]
[126] Shridhar, M. et al. (2024). SARA-RT: Scaling Up Robot Action with Linear Attention. arXiv:2312.01990. [arXiv]
[127] Liu, Y. et al. (2024). RoboMamba: Efficient Vision-Language-Action Model with Mamba SSM. arXiv:2406.04339. [arXiv]
[128] Xu, J. et al. (2025). GeRM: A Generalist Robotic Model via Foundation Models. arXiv:2403.13358. [arXiv]
[129] Chen, Q. et al. (2025). PD-VLA: Parallel Decoding for Efficient VLA Inference. arXiv:2503.02310. [arXiv]
[130] Wang, X. et al. (2025). Spec-VLA: Speculative Decoding for Accelerating VLA Models. arXiv:2507.22424. [arXiv]
[131] Yang, Z. et al. (2025). EgoVLA: Egocentric Vision-Language-Action Model. arXiv:2507.12440. [arXiv]
[132] Hung, Y. et al. (2025). NORA: Normalizing Flow-based Robot Action Generation. arXiv:2504.19854. [arXiv]
[133] Budzianowski, P. et al. (2025). EdgeVLA: Efficient VLA Deployment on Edge Devices. arXiv:2507.14049. [arXiv]
[134] Kim, D. et al. (2025). DiVLA-2B: Diffusion VLA at 2B Scale. arXiv:2412.03293. [arXiv]
[135] Park, J. et al. (2026). HyperVLA: Dynamic Policy Generation via Hypernetworks. arXiv:2510.04898. [arXiv]
[136] Liu, Q. et al. (2026). AutoQVLA: Automated Quantization for VLA Models. arXiv:2602.03782. [arXiv]
[137] Zhang, S. et al. (2025). Humanoid-VLA: Vision-Language-Action for Humanoid Robots. arXiv:2502.14795. [arXiv]
[138] Li, W. et al. (2025). Being-H0: Humanoid Robot Foundation Model. arXiv:2507.15597. [arXiv]
[139] Chen, X. et al. (2025). FP3: Foundation Policy with Predictive Planning. arXiv:2503.08950. [arXiv]
[140] Li, J. et al. (2025). SafeAuto: Safety-Aware Autonomous Driving with VLA. arXiv:2503.00211. [arXiv]
[141] Wei, H. et al. (2025). LangCoop V2V: Language-based Cooperative Driving. arXiv:2504.13406. [arXiv]
[142] Wang, D. et al. (2025). CognitiveDrone: VLA for Cognitive Drone Control. arXiv:2503.01378. [arXiv]
[143] Zhao, R. et al. (2025). RaceVLA: Vision-Language-Action for Autonomous Racing. arXiv:2503.02572. [arXiv]
[144] Cheng, H. et al. (2025). NaVILA: Navigation with VLA. arXiv:2412.04453. [arXiv]
[145] Zhang, J. et al. (2024). Uni-NaVid: Unified Navigation with Video Diffusion. arXiv:2412.06224. [arXiv]
[146] Liu, M. et al. (2025). Mobility VLA: VLA for Mobile Robot Navigation. arXiv:2407.07775. [arXiv]
[147] Li, Z. et al. (2024). RoboNurse-VLA: Robotic Nursing Assistant with VLA. arXiv:2409.19590. [arXiv]
[148] Chen, J. et al. (2025). ObjectVLA: Object-Centric VLA Model. arXiv:2502.19250. [arXiv]
[149] Lin, K. et al. (2024). ShowUI: Vision-Language-Action Models for GUI Automation. arXiv:2411.17465. [arXiv]
[150] Li, H. et al. (2025). RoboArena: A Benchmark Arena for VLA Evaluation. arXiv:2506.18123. [arXiv]
[151] Nasiriany, S. et al. (2025). RoboCasa365: Large-Scale Robot Simulation Benchmark. arXiv:2603.04356. [arXiv]
[152] Zhang, W. et al. (2025). WorldGym: World Model Training Environments. arXiv:2506.00613. [arXiv]
[153] Huang, Z. et al. (2025). TactileVLA: Tactile-Enhanced VLA for Dexterous Manipulation. arXiv:2507.09160. [arXiv]
[154] Wang, Y. et al. (2025). OmniVTLA: Omni Vision-Tactile-Language-Action Model. arXiv:2508.08706. [arXiv]
[155] Tang, Y. et al. (2023). SayTap: Language to Quadrupedal Locomotion. arXiv:2306.07580. [arXiv]
[156] Wang, L. et al. (2024). HPT: Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers. arXiv:2409.20537. [arXiv]
[157] Physical Intelligence (2025). π^*_0.6: a VLA That Learns From Experience. arXiv:2511.14759. [arXiv]
[158] Physical Intelligence (2026). MEM: Multi-Scale Embodied Memory for Vision Language Action Models. arXiv:2603.03596. [arXiv]

VLA 통합 서베이

보고, 이해하고, 행동하는 로봇을 위한
Vision-Language-Action 모델의 모든 것

🎤 강의체 버전 (박사과정 대학원 수업 스타일)
DGIST APRL Lab · Prof. Giseop Kim · April 2026

목차 & 구조 관계도

1. 서론→ 2. 연대기→ 3. 분류→ 4. 아키텍처→ 5. 토큰화→ 6. 학습→ 7. 효율성→ 8. 응용→ 9. 평가→ 10. 전망→ Ref

Sec 1

서론 — VLA란 무엇인가

VLA의 3가지 정의(RT-2/Ma/Zhong), 14개 서베이 비교표, 본 문서의 목적과 방법론

왜? 독자가 VLA의 범위와 이 문서의 위치를 먼저 파악해야 이후 내용을 정확히 맥락화할 수 있다.

Sec 2

VLA 진화의 연대기 (2017–2026)

Phase 0-3: CLIP/ViT → Gato/SayCan → RT-2/Diffusion Policy → OpenVLA/π0 → GR00T N1/SmolVLA

왜? 각 발전이 이전 발전의 인과적 결과임을 보여 줌으로써, 분류(Sec 3)의 논리적 근거를 제공한다.

Sec 3

통합 분류체계 — 14개 서베이를 하나로

아키텍처/액션생성/해부학/기능/후처리의 5축 + 메타 분류(대응표)

왜? 개별 서베이의 분류가 "경쟁"이 아닌 "상보적 투영"임을 입증해야 통합 프레임워크가 성립한다.

Sec 4

아키텍처 심층 해부

지각(SigLIP+DINOv2) → 두뇌(VLM-as-Brain 4단계) → 행동(Diffusion/Flow/FAST) → 이중 시스템

왜? 분류(Sec 3)가 "무엇이 있는가"라면, 해부(Sec 4)는 "안에서 무엇이 일어나는가"를 설명한다.

Sec 5

액션 토큰화 — 핵심 설계 결정

8가지 토큰 유형, 제어 주파수 스펙트럼(1Hz→120Hz), 토큰화가 성능을 결정하는 메커니즘

왜? Chen et al. [7]이 입증했듯, VLA 모델 간 차이의 주된 원인이 토큰화 방식이므로 별도 심층 분석이 필수적이다.

Sec 6

학습 패러다임의 진화

사전학습(인터넷→로봇), BC 한계, RL 후처리(GRPO/DPO/PPO), Newell 이론, 평생학습

왜? 아키텍처(Sec 4)와 토큰화(Sec 5)가 "구조"라면, 학습(Sec 6)은 그 구조에 "지능을 불어넣는 과정"이다.

Sec 7

효율성 — 실세계 배포의 필수 과제

양자화/가지치기/증류, 12개 경량 모델 비교표(55B→450M), 파레토 인사이트 6가지

왜? 학술 성능(Sec 4-6)과 실배포(Sec 8) 사이의 간극을 메우는 "다리" 역할. 55B를 450M으로 줄여도 성능이 유지됨을 보여준다.

Sec 8

응용 도메인 — VLA가 만드는 세계

매니퓰레이션/휴머노이드/자율주행/드론/의료/농업/GUI — 7개 도메인 비교표

왜? 기술(Sec 3-7)이 "어디에 쓰이는가"를 보여주며, 도메인별 특수 요구사항이 기술 설계에 피드백됨을 입증한다.

Sec 9

데이터셋, 벤치마크, 시뮬레이터

OXE/BridgeData/DROID 데이터셋, LIBERO/CALVIN/Bench2Drive 벤치마크, 평가 프로토콜의 한계

왜? 모델과 응용(Sec 3-8)의 "성적표"를 제공하며, 평가 방법의 한계가 곧 연구 방향을 결정한다.

Sec 10-11

미해결 문제, 통합 인사이트, 최전선 사례, 결론

11가지 핵심 과제 + 10개 교차 서베이 emergent 인사이트 + π 시리즈 최전선 사례(π^*_0.6 & MEM) + 미래 4대 축

참조 서베이 14편 비교 분석표

범례: ● 심층 다룸 ◐ 부분적/간접적 ○ 미다룸 | 칼럼은 본 통합서베이의 핵심 토픽 10개에 대응한다.

#	서베이	arXiv	시기	아키 텍처	분류 체계	액션 토큰화	학습 패러다임	RL 후 처리	효율성 경량화	매니퓰 레이션	자율 주행	벤치 마크	미래 전망	특화 설명 및 가치
[1]	Ma, Q. et al.	2405.14093	2024.05	◐	◐	○	◐	○	○	◐	◐	◐	●	체화 AI 전체 조감도. VLA 전용이 아닌 LLM/VLM/VLA를 포괄하는 체화 AI 기반모델 생태계 서베이로서, VLA를 거대 AI 흐름 안에 위치시킨 최초의 종합 문헌. 시기적으로 가장 이른 서베이라 RT-2/PaLM-E 중심이며 2025 모델은 미포함.
[2]	Kawaharazuka, K. et al.	2402.05741	2024.02	◐	○	○	●	○	○	●	○	●	◐	실세계 배포 경험 특화. 연구실 밖 실제 로봇 배포에서 축적된 실무적 통찰(데이터 수집 노하우, 실패 모드 분석)이 핵심 가치. 아키텍처 다양성 분석보다 "실전에서 무엇이 깨지는가"에 집중한 유일한 서베이.
[3]	Zhong, Z. et al.	2509.19012	2025.09	●	●	●	●	◐	◐	●	○	●	●	가장 체계적인 Pure VLA 분류 프레임워크. "end-to-end로 이미지+언어→액션을 단일 모델로 출력"하는 Pure VLA만을 대상으로, 아키텍처/학습/데이터의 3축 분류를 제시. 다만 SayCan 같은 모듈형 파이프라인을 의도적으로 배제.
[4]	Yu, Z. et al.	2510.24795	2025.10	●	◐	◐	◐	○	●	◐	○	●	◐	효율성·경량화의 최심층 분석. 양자화(PTQ/QAT), 프루닝, 증류, 토큰 캐싱, LoRA 등 배포 최적화 기법을 12개 모델에 걸쳐 정량 비교한 유일한 서베이. "55B→450M 파라미터로 성능 유지가 가능한가"라는 질문에 체계적으로 답한다.
[5]	Liu, N. & Shao, R. et al.	2508.13073	2025.08	●	●	●	●	◐	◐	●	○	●	●	매니퓰레이션 도메인 최상세 벤치마크 분석. 조작(grasping, bimanual, dexterous) 중심으로 LIBERO/CALVIN/SimplerEnv 벤치마크를 가장 깊이 분석. 단일체/계층적 아키텍처 이분법을 명확히 제시하며, 네비게이션·자율주행은 의도적으로 제외.
[6]	Zhang, Y. et al.	2505.04769	2025.05	●	◐	◐	●	○	○	●	◐	●	●	VLA 개념과 응용의 폭넓은 개관. 기술적 깊이보다 개념 정리와 응용 도메인 목록의 폭에 강점을 가진 서베이. VLA 입문자에게 전체 지형도를 빠르게 제공하는 데 최적화되어 있으며, 의료·농업·GUI 등 비주류 도메인까지 커버.
[7]	Chen, Y. et al.	2507.01925	2025.07	◐	◐	●	◐	○	○	◐	○	◐	○	액션 토큰화 기법의 가장 심층적·전문적 분석. 8가지 토큰 타입(언어/코드/어포던스/궤적/목표/잠재/원시/추론)을 체계 분류하고, 이산 vs 연속, 주파수 스펙트럼, Action Chunking 트레이드오프를 심도있게 해부. 토큰화 외 다른 모듈은 최소한으로만 다룸.
[8]	Xu, C. et al.	2512.11362	2025.12	●	●	●	●	○	◐	●	◐	●	●	VLA를 "해부학"으로 분석한 유일한 접근. 지각(눈)→뇌(VLM)→행동(손)의 기관 비유로 모듈별 설계 선택지를 체계화. 가장 최신(2025.12) 서베이로 GR00T N1, π0.5까지 포함하나, RL 후처리는 거의 다루지 않음.
[9]	Jin, A. et al.	2506.20966	2025.06	◐	○	◐	●	●	○	◐	○	◐	●	VLA+RL 결합의 최전선 전문 서베이. BC의 한계를 RL로 극복하는 후처리(PPO/GRPO/DPO/ConRFT [69]) 기법을 가장 상세히 분석. 온라인/오프라인 RL, 보상 설계, 선호 최적화까지 망라하나 사전학습 단계 분석은 약함.
[10]	Jiang, H. et al.	2506.24044	2025.06	●	●	◐	●	◐	◐	○	●	●	●	자율주행(AD) VLA의 가장 상세한 전문 분석. EMMA/ORION [47]/DriveMoE [49]/AutoVLA [48] 등 AD-VLA 4세대 진화를 체계화하고, 안전성·V2V 협력·시뮬레이터(CARLA/Bench2Drive)를 깊이 분석. 조작/내비게이션 도메인은 완전히 배제.
[41]	Hu, T. et al.	2512.16760	2025.12	●	●	◐	◐	○	◐	○	●	●	●	AD-VLA의 세분화된 분류. End-to-End VLA(textual/numerical action)와 Dual-System VLA(explicit guidance/implicit transfer)의 2×2 분류; WorldBench 통합 평가 플랫폼 제안. Jiang et al. [10]보다 더 세분화된 AD 특화 분류.
[42]	Edge Survey	2603.16952	2026.03	◐	○	○	○	○	●	◐	◐	◐	●	엣지 배포의 시스템 레벨 분석. "Deployment Gauntlet" 7대 결합 제약(크기/무게/전력/메모리/연산/타이밍/안전) 식별. VLA=메모리 대역폭 병목, 디퓨전=연산 지연 병목으로 아키텍처별 최적화 차별화 제안.
[43]	Guan, W. et al.	2510.17111	2025.10	◐	◐	◐	◐	○	●	●	○	◐	◐	조작 효율화의 4차원 분류. 아키텍처/인지 특징 추출/행동 생성/학습·추론의 4축 효율화 분류. Yu et al. [4]가 모델 압축 중심인 것과 달리 인지-행동 파이프라인 전체의 효율화를 독립 차원으로 분석.
[44]	Large Model Embodied AI	2508.10399	2025.08	●	◐	○	●	◐	○	◐	◐	◐	●	대형 모델 기반 의사결정 프레임워크. 계층적(hierarchical) vs end-to-end 의사결정 패러다임 양분; 월드 모델을 두 패러다임을 연결하는 제3축으로 분석. VLA를 의사결정 관점에서 조망.

Section 1: 서론 — VLA란 무엇인가

1.1 VLA의 정의: 세 가지 관점

자, 그러면 VLA가 뭐냐. 이게 사실 정의부터가 논란이에요. Vision-Language-Action 모델이라고 하면, 로봇이 카메라로 장면을 보고, 자연어 지시를 이해해서, 물리적 행동을 직접 생성하는 통합 신경망이다 — 이 정도로 설명할 수 있는데요. 문제는 "VLA"라는 용어의 정의가 연구 커뮤니티 내에서도 아직 합의가 안 됐다는 겁니다. 크게 세 가지, 아 정확히는 네 가지 관점이 공존하고 있는데요, 이걸 정확히 이해하는 게 이 분야 탐색의 첫 번째 관문입니다.

좁은 정의: RT-2 [11]가 명명한 원조적 의미. 2023년에 Google DeepMind의 RT-2 [11] 논문이 "VLA"라는 용어를 처음 사용했거든요. 여기서 VLA는 아주 명확한 기술적 처방을 갖고 있었습니다. 대규모로 사전학습된 Vision-Language Model, 즉 VLM을 가져와서 로봇 행동 예측에 직접 파인튜닝하는 모델이에요. RT-2 [11]가 구체적으로 뭘 했냐면, PaLI-X(55B)하고 PaLM-E [18] (12B)라는 기존 VLM을 가져와서, 로봇 행동을 텍스트 토큰으로 표현한 다음에, VLM의 출력 어휘에 행동 토큰을 추가하는 방식으로 동작시켰습니다. 자 여기서 핵심은요, 이 정의에서 VLA의 본질은 "인터넷 규모 시각-언어 지식의 로봇 행동으로의 전이"라는 거예요. VLM backbone의 존재가 필수 조건인 겁니다.

확장 정의: Ma et al. [1] (2024)의 포괄적 범주. 이번에는 범위를 확 넓힌 정의를 봅시다. 체화 지능(Embodied AI) 전반을 조망한 Ma et al. [1]의 서베이는 VLA를 훨씬 넓게 정의합니다. 이분들한테 VLA는 시각(Vision)과 언어(Language) 입력을 받아 행동(Action)을 생성하는 모든 모델을 포괄하는 범주예요. 이러면 뭐가 되냐. SayCan [14]처럼 LLM이 고수준 계획만 하고 저수준 제어는 별도 정책이 담당하는 모듈형 시스템도 VLA고, RT-1처럼 VLM backbone 없이 독자적 아키텍처로 설계된 모델도 VLA고, 심지어 Code-as-Policy처럼 코드를 생성하는 시스템도 VLA에 포함됩니다. 이 확장 정의는 분야의 전체 지형도를 그리는 데 유용한데, "VLA"라는 용어의 기술적 특이성을 희석시킨다는 비판이 있어요. 이게 왜 중요하냐면, 논문을 읽을 때 저자가 어떤 정의를 쓰고 있는지 파악 못하면 혼란이 생기거든요.

순수 정의: Zhong et al. [3] (2025)의 "Pure VLA". 가장 최근의 분류를 제안한 Zhong et al. [3]은 위 두 정의 사이의 긴장을 해소하려고 "Pure VLA"라는 개념을 도입했습니다. Pure VLA는 단일 end-to-end 시퀀스 모델링 프레임워크 안에서 지각, 언어 이해, 행동 생성을 통합하는 모델이에요. 핵심 기준이 세 가지 있는데요: (1) 시각과 언어가 모두 입력으로 사용될 것, (2) 행동이 모델의 직접 출력일 것 — 별도의 저수준 제어기를 거치면 안 됩니다, (3) 전체 파이프라인이 하나의 학습 가능한 모델로 통합될 것. 이 기준을 적용하면 SayCan [14]은 모듈형 구조니까 탈락이고, Code-as-Policy는 코드 출력이니까 탈락인데, RT-2 [11], OpenVLA [15], π0 [16] 같은 모델은 Pure VLA에 해당합니다.

이거 면접에서 나올 수 있는데요 — Zhong et al. [3]은 추가로 Pure VLA를 네 가지 범주로 세분화했습니다: (1) 자기회귀 기반(RT-2 [11], OpenVLA [15]), (2) 디퓨전 기반(π0 [16], CogACT [23]), (3) 강화학습 기반 미세조정, (4) 하이브리드 및 특수 방법. 행동 디코더의 생성 방식과 학습 패러다임을 함께 고려한 체계예요.

이 세 정의 외에도, Kawaharazuka et al. [2]는 독자적 경계 기준을 제안했습니다. "시각 관측과 자연어 지시를 필수 입력으로 받아, 제어 명령을 직접 생성하는 시스템"만을 VLA로 인정하는 건데요, "사전 정의된 기술 인덱스를 선택하는 고수준 정책"은 VLA에서 명시적으로 배제합니다. SayCan [14]류의 기술 선택 시스템을 VLA 밖으로 놓는다는 점에서 Zhong et al. [3]의 Pure VLA 정의와 맥을 같이하는데, VLM backbone 유무가 아니라 "직접 제어 명령 생성"을 핵심 기준으로 삼는다는 점에서 차별화됩니다.

자 여기서 잠깐 정리하면, 본 문서는 이 네 가지 정의를 모두 인지하되, 실질적으로는 Zhong et al. [3]의 Pure VLA 정의를 중심축으로 사용합니다. 다만, 모듈형 시스템(SayCan [14], Inner Monologue [22])이나 계층적 구조(π0.5 [31], GR00T N1 [21])도 VLA 생태계의 중요한 구성 요소로서 적절히 다룰 겁니다.

1.2 왜 VLA인가: 패러다임 전환의 본질

이제 "왜"를 얘기해 봅시다. 로보틱스의 전통적 아키텍처는 50년간 "인지-계획-제어"의 삼분 구조, 영어로는 sense-plan-act pipeline 위에 서 있었습니다. 인지 모듈이 센서 데이터를 처리하고, 계획 모듈이 목표 달성 경로를 탐색하고, 제어 모듈이 관절 명령을 생성하는 거죠. 각 모듈이 독립적으로 설계되고, 모듈 간 인터페이스는 수작업으로 정의된 표현 — 물체 포즈라든가, 그리드 맵이라든가, 관절 궤적 같은 것들 — 을 통해 소통합니다.

이 분리 구조가 수학적 엄밀성과 디버깅 용이성이라는 강점을 제공한 건 맞는데요, 근본적인 세 가지 병목을 안고 있었습니다. 이게 왜 중요하냐면, VLA가 등장한 이유가 바로 이 세 가지 병목을 해결하려는 시도이기 때문입니다.

첫째, 표현 병목입니다. 모듈 간 전달되는 정보가 수작업 표현의 표현력에 의해 제한돼요. 예를 들어 "빨간 머그컵 옆의 파란 접시를 싱크대에 놓아줘"라는 지시를 실행하려면, 인지 모듈이 물체를 검출하고, 언어 모듈이 지시를 파싱하고, 그래스핑 모듈이 파지 자세를 계산하고, 모션 플래너가 충돌 없는 경로를 생성해야 합니다. 각 단계의 인터페이스에서 정보가 손실되고, 한 모듈의 실패가 전체 파이프라인을 무너뜨려요.

둘째, 일반화 병목입니다. 각 모듈이 특정 환경이나 물체에 대해 독립적으로 튜닝되니까, 새로운 환경이나 물체에 적응하려면 파이프라인 전체를 다시 엔지니어링해야 합니다.

셋째, 지식 활용 병목이에요. 인터넷에는 수십억 장의 이미지와 수조 단어의 텍스트가 존재하지만, 전통 파이프라인은 이 대규모 사전지식을 활용할 경로 자체가 없었습니다.

VLA는 이 세 병목을 동시에 공략하는 거예요. 센서 입력에서 행동 출력까지를 하나의 미분 가능한 함수로 연결해서 표현 병목을 제거하고, 인터넷 규모 VLM의 사전학습 지식을 상속해서 일반화와 지식 활용 병목을 함께 해소합니다. 로봇이 "보고(Vision), 이해하고(Language), 행동하는(Action)" 과정이 하나의 신경망 안에서 end-to-end로 이루어지는 겁니다.

비유를 하나 들어볼게요. 전통 로보틱스가 "통역사를 여러 명 거쳐 소통하는 국제 회의"였다면, VLA는 "모국어로 직접 대화하는 일대일 만남"입니다. 중간 번역의 오류와 지연이 사라지고, 맥락과 뉘앙스가 온전히 전달되는 거죠.

1.3 14개 서베이의 한계와 본 문서의 존재 이유

자, 이제 이 문서가 왜 필요한지 말씀드리겠습니다. 2024년 하반기부터 2026년 초까지, VLA에 관한 서베이 논문이 정말 폭발적으로 출판되었습니다. 본 문서가 참조하는 핵심 서베이들을 정리하면 이렇습니다:

서베이	핵심 관점	고유 강점	주요 한계
Ma et al. [1] (2024)	체화 AI 전반의 조감도	VLA를 대형 기반모델 생태계 안에 위치시킴	VLA 자체의 기술적 깊이가 부족
Kawaharazuka et al. [2] (2025)	실세계 배포	배포 경험에서 우러난 실무적 통찰; 7-범주 아키텍처 분류(VLM+Discrete, VLM+Diffusion, VLM+Flow Matching 등)	학습 패러다임(사전학습, RL 후처리) 분석이 약함
Zhong et al. [3] (2025)	Pure VLA 분류	가장 체계적인 VLA 분류 프레임워크	비-Pure VLA 계열을 배제
Yu et al. [4] (2025)	효율성과 경량화	추론 비용, 양자화, 캐싱의 심층 분석	학습 패러다임 전반을 다루지 않음
Liu & Shao [5] (2025)	매니퓰레이션	조작 중심의 상세한 벤치마크 분석	네비게이션, 자율주행 등 배제
Zhang et al. [6] (2025)	개념과 응용	폭넓은 개념 정리와 응용 목록	기술적 깊이보다 개관에 치중
Chen et al. [7] (2025)	행동 토큰화	토큰화 기법의 가장 심층적 분석	토큰화 외 아키텍처 요소가 부재
Xu et al. [8] (2025)	VLA 해부학	입력-처리-출력의 모듈별 해부	학습 후 처리(post-training) 미다룸
Jin et al. [9] (2025)	RL 후처리	VLA+RL 결합의 최신 동향	사전학습 단계 분석이 약함
Jiang et al. [10] (2025)	자율주행 VLA	AD-VLA의 가장 상세한 분석	조작/내비게이션 도메인 배제
Hu et al. [41] (2025)	자율주행 VLA의 과거/현재/미래	End-to-End vs Dual-System VLA의 AD 특화 분류; WorldBench 제안	Jiang et al. [10]보다 더 세분화된 AD-VLA 분류
2026 Edge Survey [42] (2026)	엣지 배포의 시스템 병목	"Deployment Gauntlet" 7대 결합 제약; VLA=메모리 대역폭 병목, 디퓨전=연산 지연 병목	모델 압축이 아닌 시스템 레벨 분석
Guan et al. [43] (2025)	조작 효율화 VLA	4차원 효율화 분류(아키텍처/인지/행동/학습)	Yu et al. [4]와 독립적인 효율화 관점
Large Model Embodied AI [44] (2025)	대형 모델 기반 체화 AI 의사결정	계층적 vs end-to-end 의사결정; 월드 모델 통합	VLA를 의사결정 관점에서 조망

※ Ma et al. [1]은 2024년 초판 이후 v7(2026.02)까지 지속 업데이트되는 living survey입니다.

각 서베이가 자기만의 렌즈를 통해 VLA를 조명하고 있는데요, 문제는 어떤 하나의 서베이도 VLA의 전체 그림을 담지 못한다는 겁니다. 아키텍처를 깊이 다루는 서베이는 배포를 소홀히 하고, 효율성에 집중하는 서베이는 학습 패러다임을 생략하고, 특정 도메인에 특화된 서베이는 다른 도메인과의 교차점을 놓치거든요.

본 문서는 이 14개 서베이를 원자 수준까지 분해해서, 정보량의 상위집합(superset)을 구축합니다. 이건 단순한 병합이 아닙니다. 서로 다른 서베이가 같은 모델을 다른 각도에서 분석한 경우, 그 교차 관점을 통합해서 어느 개별 서베이보다 풍부한 이해를 제공하는 거예요. 예를 들어서, π0 [16]에 대해 Zhong et al. [3]은 아키텍처 분류를, Yu et al. [4]은 추론 효율성을, Jin et al. [9]은 RL 후처리 적용을 각각 분석했는데, 본 문서는 이 세 관점을 하나의 통합 프로파일로 엮습니다.

본 문서의 차별화 지점

위 14개 서베이가 각각 특정 축을 깊이 파고드는 전문 서베이라면, 본 문서는 그 축들을 교차시키는 메타 서베이(survey of surveys)입니다. 구체적으로, 어떤 개별 서베이에도 없는 다음 기여를 제공합니다:

차별화 축	개별 서베이의 한계	본 문서의 기여
5축 통합 분류	서베이마다 독립적 분류체계 사용 (아키텍처 축, 해부학 축, 기능 축 등이 분리)	아키텍처·액션생성·해부학·기능·후처리의 5축을 하나의 메타 분류 프레임워크로 통합하고, 서베이 간 대응 관계를 명시
교차 모델 프로파일	같은 모델(예: π0)을 다른 서베이가 다른 관점에서 단편적으로 분석	14개 서베이의 관점을 결합한 모델별 통합 프로파일 제공 (아키텍처 + 효율성 + RL 후처리를 동시 조망)
ICLR 2026 동향	대부분의 서베이가 2025년 중반까지의 문헌만 포함	ICLR 2026 제출 164건 분석(Reuss, 2026), 이산 디퓨전 VLA·ECoT [92]·자기개선 RL 등 최신 트렌드 반영
Emergent 인사이트	개별 서베이의 결론은 자신의 관점 내에서만 도출	14개 서베이를 교차 분석하여 어느 개별 서베이에도 명시되지 않은 창발적 인사이트 10가지를 도출 (Section 11)
엣지 배포의 시스템 관점	효율성 서베이[4][43]는 모델 압축에 초점, 시스템 레벨 제약 미분석	2026 Edge Survey [42]의 Deployment Gauntlet 7대 제약을 통합하여 모델-시스템 공동설계 관점 제시
의사결정 프레임워크	아키텍처 관점에서 계층적/end-to-end를 기술적으로만 비교	Large Model Embodied AI [44]의 의사결정 패러다임을 도입하여, 월드 모델을 두 패러다임의 제3축으로 위치시킴

1.4 본 문서의 구조

본 문서의 전체 구조를 간단히 훑어보겠습니다. 이건 나중에 특정 주제를 찾아볼 때 지도 역할을 할 겁니다:

Section 1 (본 절): VLA의 정의, 중요성, 본 문서의 목적
Section 2: VLA 진화의 연대기 — 2017년부터 2026년까지의 발전 서사
Section 3: 통합 분류체계 — 14개 서베이의 분류를 하나로
Section 4: 아키텍처 심층 해부 — Vision Encoder, VLM Backbone, Action Decoder
Section 5: 액션 토큰화 — 연속 행동을 어떻게 모델의 언어로 번역하는가
Section 6: 학습 패러다임 — Behavior Cloning, 사전학습 전략, RL 후처리
Section 7: 효율성 — 실세계 배포를 위한 경량화와 최적화
Section 8: 응용 도메인 — 매니퓰레이션, 휴머노이드, 자율주행, 드론, 의료
Section 9: 데이터셋, 벤치마크, 시뮬레이터
Section 10-11: 미해결 문제, 통합 인사이트, 결론

각 섹션은 14개 서베이의 관련 내용을 모두 소화한 위에서, 교차 서베이 통합 인사이트를 명시적으로 제공합니다. 이건 나중에 Section 3부터 본격적으로 느껴지실 겁니다.

Section 2: VLA 진화의 연대기 (2017–2026)

자 이제 본격적으로 VLA가 어떻게 만들어져 왔는지 역사를 따라가 봅시다. VLA는 어느 날 갑자기 등장한 게 아닙니다. 컴퓨터 비전, 자연어 처리, 로봇 학습이라는 세 개의 독립적 물줄기가 수십 년간 각자의 협곡을 깎아가다가, 2020년대 초반 하나의 합류점에서 만났어요. 이 절에서는 그 합류의 서사를 네 단계로 나누어 추적하겠습니다.

Phase 0 — 기반 기술의 수렴 (2017–2021)

비전의 혁명: CNN에서 ViT, 그리고 CLIP으로

먼저 비전 쪽부터 봅시다. 2012년에 AlexNet이 ImageNet 대회에서 전통 컴퓨터 비전을 압도한 이후에, CNN이 급속히 진화했거든요. ResNet(2015)이 152층까지 깊어질 수 있는 잔차 학습을 보여줬고, EfficientNet(2019)이 폭, 깊이, 해상도의 균형 스케일링을 보여줬습니다. 그런데 진정한 전환점은 2020년의 Vision Transformer, 즉 ViT [28]였어요. 이미지를 16x16 패치로 분할하고 Transformer의 자기주의(self-attention)로 처리하는 방식인데, NLP에서 검증된 스케일링 법칙이 비전에도 그대로 적용된다는 걸 증명한 겁니다. 더 많은 데이터, 더 큰 모델, 더 나은 성능 — 이 단순한 공식이 비전 연구의 방향을 근본적으로 바꿔놨어요.

그런데 ViT보다 VLA에 더 직접적인 영향을 준 건 2021년 OpenAI의 CLIP이었습니다. 이게 왜 중요하냐면, CLIP은 4억 개의 이미지-텍스트 쌍으로 대조 학습(contrastive learning)을 수행해서, 시각과 언어를 동일한 임베딩 공간에 정렬시켰거든요. "고양이 사진"과 "a photo of a cat"이라는 텍스트가 벡터 공간에서 가까워지는 겁니다. 이 시각-언어 정렬이 VLA의 핵심 전제조건이 됐어요. 로봇이 "빨간 머그컵을 집어"라는 지시를 받았을 때, "빨간 머그컵"이라는 언어 개념과 카메라에 보이는 시각적 대상을 연결할 수 있는 능력이 바로 CLIP [27] 스타일 사전학습에서 비롯되기 때문입니다.

후속으로 등장한 SigLIP(2023)은 시그모이드 손실로 더 효율적인 학습을 달성했고, DINOv2 [30] (2023)는 자기지도학습으로 레이블 없는 시각 표현을 학습해서, 이후 VLA의 Vision Encoder로 광범위하게 채택됩니다. 이건 나중에 Section 3에서 더 자세히 다루겠지만, Vision Encoder 선택이 VLA 성능에 미치는 영향이 상당히 큽니다.

언어의 폭발: GPT-3에서 대형 언어모델 시대로

이번에는 NLP 쪽을 봅시다. 2017년에 Transformer 아키텍처가 등장한 이후에, NLP는 급격한 스케일링의 시대로 진입했습니다. BERT(2018)가 양방향 문맥 이해를, GPT-2(2019)가 자기회귀적 텍스트 생성을 보여줬다면, GPT-3 [29] (2020)는 1,750억 개 파라미터로 패러다임 자체를 바꿔버렸어요.

GPT-3 [29]가 왜 로보틱스 연구자들한테 결정적인 영감을 줬냐면, 명시적 학습 없이도 few-shot 프롬프트만으로 새로운 작업을 수행하는 "창발적 능력(emergent capabilities)"을 보여줬기 때문입니다. 충분히 큰 모델이 충분히 많은 데이터를 학습하면, 명시적으로 프로그래밍하지 않은 능력이 저절로 나타난다는 거예요.

이 관찰이 직접적으로 두 가지 질문으로 이어집니다. 첫째, LLM의 세계 지식 — 상식이라든가, 물리 직관이라든가, 작업 절차 같은 것 — 을 로봇 계획에 활용할 수 있지 않을까? 둘째, LLM의 스케일링 법칙이 로봇 정책에도 적용되지 않을까? 첫 번째 질문은 2022년의 SayCan [14]과 Inner Monologue [22]로, 두 번째 질문은 2023년의 RT-2 [11]로 답해지게 됩니다.

로봇 학습의 벽: Behavior Cloning과 데이터 한계

같은 시기에, 로봇 학습 분야는 자기만의 난관과 씨름하고 있었습니다. Behavior Cloning(BC)은 전문가 시연을 모방하여 정책을 학습하는 가장 직접적인 방법이었는데요, 분포 이탈(distribution shift) 문제가 근본적 한계로 작용했어요. 학습 시 보지 못한 상태에서 오류가 기하급수적으로 누적되는 현상이거든요. DAgger [97](2011)가 이걸 이론적으로 해결했지만, 실세계에서 전문가의 실시간 교정을 반복적으로 받는 건 비현실적이었습니다.

오프라인 강화학습(Offline RL)이 BC의 대안으로 주목받았고, CQL(2020), IQL(2021) 등이 기존 수집 데이터만으로 정책을 개선하는 방법을 제시했습니다. 그런데 이들 역시 보수적 추정의 한계와 하이퍼파라미터 민감성 문제를 안고 있었고, 수만에서 수십만 에피소드의 대규모 데이터가 전제됐어요.

여기서 핵심적인 문제가 있습니다. 로봇 데이터 수집이 NLP나 CV와는 비교할 수 없을 만큼 느리고 비싸다는 거예요. ImageNet은 수백만 장이고, GPT-3 [29]의 학습 데이터는 수천억 토큰이었는데, 당시 가장 큰 로봇 데이터셋은 기껏해야 수천 에피소드에 불과했습니다. 이 데이터 격차가 VLA 분야의 근본적 도전 중 하나로 지금까지도 이어지고 있어요.

한편 시뮬레이터 생태계가 이 데이터 병목을 부분적으로 완화하기 시작했습니다. SAPIEN(2020)은 정교한 물리 시뮬레이션 기반 조작 환경을, AI2-THOR(2017)은 실사 수준의 가정 환경을, Habitat(2019)은 대규모 내비게이션 환경을 제공했고, 이들 시뮬레이터가 이후 VLA 학습과 평가의 핵심 인프라가 됩니다.

핵심 전조: CLIPort (2021)

이 세 물줄기 — 시각-언어 정렬, 대형 언어모델, 로봇 학습 — 가 처음으로 교차한 지점이 CLIPort [26] (2021)였어요. Shridhar et al.이 CLIP의 시각-언어 표현을 Transporter Network(로봇 조작을 위한 공간 행동 지도)와 결합해서, 언어 지시에 따른 테이블탑 조작을 수행했습니다.

CLIPort [26]가 왜 중요하냐면, "인터넷 스케일로 사전학습된 시각-언어 지식이 로봇 행동에 전이될 수 있다"는 핵심 가설을 최초로 실증했기 때문입니다. CLIPort [26] 자체는 end-to-end VLA가 아니라 CLIP [27] 특징을 Transporter에 주입하는 하이브리드 구조였지만, 이 증명이 이후 RT-2 [11]와 OpenVLA [15]의 직접적 영감이 됐어요.

Phase 1 — VLA의 탄생 (2022–2023)

자 이제 본격적으로 VLA가 탄생하는 시기입니다. Phase 0에서 수렴한 기반 기술들이 2022년부터 폭발적으로 결합되기 시작하거든요. 이 시기는 마치 화학 반응의 활성화 에너지를 넘긴 순간처럼, 몇 개의 획기적 모델이 연쇄적으로 등장하면서 새로운 분야 자체를 탄생시켰습니다.

Gato와 SayCan: 두 가지 접근의 시작 (2022)

2022년 상반기에, DeepMind가 두 갈래의 실험을 동시에 공개했어요. Gato [13] (2022)는 텍스트, 이미지, 로봇 행동, 게임 입력을 모두 토큰으로 변환해서 하나의 1.2B 파라미터 Transformer로 학습한 "제너럴리스트 에이전트"였습니다. Gato의 혁신은 개념적이었어요. 서로 완전히 다른 모달리티를 하나의 시퀀스 모델링 프레임워크로 통합할 수 있다는 증명이었죠. 각 개별 작업의 성능은 전문가 모델에 미치지 못했지만, "하나의 모델이 보고, 읽고, 행동할 수 있다"는 가능성을 처음 보여줬다는 데 의의가 있습니다.

같은 시기에 SayCan [14] (2022)은 정반대 철학을 택했습니다. LLM(PaLM)을 직접 행동 생성에 사용하는 대신, 고수준 계획기로만 활용한 거예요. SayCan [14]의 구조가 상당히 우아한데요 — LLM이 자연어 지시를 단계별 하위 작업으로 분해합니다. 예를 들어 "콜라를 가져다줘"라는 지시가 "부엌으로 이동" → "냉장고 열기" → "콜라 집기" → ... 이런 식으로 분해되는 거예요. 그 다음에 각 하위 작업의 실행 가능성을 사전학습된 저수준 정책(affordance function)이 평가하고, LLM의 계획과 affordance를 곱해서 실행 가능한 최선의 행동을 선택합니다.

SayCan [14]은 Zhong et al. [3]의 Pure VLA 정의에는 부합하지 않지만, "LLM의 세계 지식을 로봇에 연결한다"는 아이디어를 최초로 실세계에서 시연했다는 점에서 VLA 계보의 핵심 선조입니다.

Inner Monologue [22] (2022)는 SayCan [14]의 아이디어를 한 단계 발전시켰어요. LLM이 한 번 계획하고 끝나는 게 아니라, 실행 결과에 대한 환경 피드백 — 성공/실패라든가, 물체 감지 결과라든가, 사람의 보정 같은 것 — 을 텍스트로 받아 계획을 동적으로 수정하는 폐쇄 루프 구조를 도입한 겁니다. 이 "내면의 독백"은 이후 CoT-VLA [55]와 같은 추론 통합 VLA의 직접적 선조가 됩니다.

VIMA [45]: 다중모달 프롬프트 따르기 (2022)

같은 시기에 등장한 VIMA [45](2022)는 텍스트뿐 아니라 이미지 목표, 비디오 시연, 바운딩 박스 등 다양한 모달리티의 프롬프트를 이해해서 로봇 행동을 생성하는 encoder-decoder Transformer였습니다. "언어만이 유일한 지시 채널이 아니다"라는 관점을 제시한 건데요, 이후 멀티모달 지시 따르기 연구의 기반이 됩니다.

RT-1: 대규모 실세계 학습의 증명 (2022)

SayCan [14]이 LLM의 "지혜"를 빌리는 전략이었다면, RT-1 [12] (Robotics Transformer 1, 2022)은 정면돌파를 택했습니다. Google이 17개월에 걸쳐 13대의 로봇으로 130,000개의 실세계 시연 에피소드를 수집하고, 이걸 약 35M 파라미터의 Transformer 모델로 학습시킨 거예요. RT-1의 아키텍처 자체는 비교적 소박했습니다 — EfficientNet + TokenLearner + Transformer 디코더.

그런데 RT-1이 증명한 건 아키텍처가 아니라 스케일이었어요. 충분히 다양하고 대규모인 실세계 데이터로 학습하면, 단일 모델이 700개 이상의 서로 다른 작업을 수행할 수 있다는 것이었습니다.

RT-1은 또한 중요한 실패도 드러냈습니다. 학습 데이터에 없는 물체나 환경에 대한 일반화가 극히 제한적이었거든요. 이 한계가 바로 "인터넷 스케일 사전학습의 지식을 주입하면 되지 않을까?"라는 질문으로 이어졌고, 그 답이 RT-2 [11]였습니다.

RT-2와 PaLM-E: VLA의 공식적 탄생 (2023)

2023년은 VLA가 이름을 얻은 해입니다. 이거 면접에서 물어보면 "2023년 RT-2"라고 바로 답할 수 있어야 해요.

RT-2 [11] (2023)가 뭘 했냐면, PaLI-X(55B)와 PaLM-E [18] (12B)라는 기존 VLM을 가져와서, 로봇 행동을 텍스트 토큰으로 인코딩했습니다. 각 행동 차원을 256개 bin으로 이산화해서 정수 문자열로 표현한 다음에, VLM의 기존 학습 데이터에 로봇 데이터를 소량 혼합하여 공동 파인튜닝한 거예요.

결과가 극적이었습니다. RT-2 [11]는 RT-1과 동일한 로봇 데이터로 학습했는데도, 학습 중 보지 못한 물체 — "put the dinosaur in the correct bin" 같은 것 — 이나 추상적 개념 — "pick up the object that is different from the others" 같은 것 — 에 대해서 일반화를 보여줬어요. VLM이 인터넷에서 학습한 지식이 로봇 행동으로 전이된 겁니다.

거의 동시에 발표된 PaLM-E [18] (2023)는 562B 파라미터의 거대 다중모달 모델로, 시각 토큰, 언어 토큰, 로봇 상태 토큰을 하나의 입력 시퀀스로 통합했습니다. PaLM-E [18]는 직접적인 행동 생성보다는 다중모달 이해와 계획에 초점을 맞추었는데, "하나의 거대 Transformer가 시각, 언어, 로봇 행동을 모두 소화할 수 있다"는 스케일링 가설을 가장 극단적으로 밀어붙인 사례였어요.

RT-2 [11]와 PaLM-E [18]가 던진 메시지를 정리하면 이겁니다: VLM은 단순한 이미지 캡셔너가 아니라, 적절한 파인튜닝을 거치면 물리 세계의 행동을 생성하는 정책이 될 수 있다. 이것이 VLA 패러다임의 핵심 명제이고, 이 명제가 실증된 2023년이 VLA 분야의 원년입니다.

Diffusion Policy: 행동 생성의 새로운 문법 (2023)

RT-2 [11]가 행동을 이산 토큰으로 자기회귀 생성하는 접근을 제시한 것과 거의 동시에, Diffusion Policy [17] (Chi et al., 2023)가 완전히 다른 행동 생성 패러다임을 열었습니다. 이미지 생성에서 DALL-E와 Stable Diffusion이 혁명을 일으킨 DDPM을 로봇 행동 생성에 적용한 건데요.

Diffusion Policy [17]의 핵심 통찰은 로봇 행동의 다중 모드성(multimodality)에 있었어요. 이걸 구체적으로 설명하면요 — "컵을 집어라"라는 지시에 대해, 올바른 행동은 하나가 아닙니다. 오른쪽에서 집을 수도 있고, 위에서 집을 수도 있고, 손잡이를 잡을 수도 있거든요. 기존의 평균 제곱 오차(MSE) 손실로 학습하면 이 다중 모드 분포의 평균을 학습하게 돼서, 어느 모드에도 속하지 않는 무의미한 행동이 생성됩니다. Diffusion Policy [17]는 가우시안 노이즈에서 시작해서 반복적 디노이징을 통해 행동 시퀀스(action chunk)를 생성함으로써, 다중 모드 분포를 자연스럽게 표현했어요. 또한 전체 행동 시퀀스를 한번에 생성하는 action chunking과 자연스럽게 결합되어, 시간적 일관성(temporal consistency)을 확보한 것도 핵심 기여입니다.

이건 나중에 Section 4에서 더 자세히 다루겠지만, 이 접근은 이후 VLA의 Action Decoder 설계에 근본적인 영향을 미쳤습니다. π0 [16]의 Flow Matching, CogACT [23]의 DiT 기반 디코더, RDT-1B [24]의 확산 트랜스포머 등이 모두 Diffusion Policy [17]가 연 문법 위에서 발전한 겁니다.

Open X-Embodiment: 데이터 통합의 전환점 (2023)

2023년 말에, Google DeepMind가 주도한 Open X-Embodiment [19] (OXE [19]) 프로젝트가 VLA 분야의 데이터 인프라를 근본적으로 바꿨습니다. 22개 로봇 플랫폼에서 수집된 100만 개 이상의 에피소드를 표준화된 형식(RLDS)으로 통합한 건데요, 개별 연구실이 자체 로봇으로만 데이터를 수집하던 패러다임을 깨뜨린 거예요.

OXE [19]의 핵심 발견이 흥미로운데, "교차 로봇 데이터(cross-embodiment data)"로 학습한 정책이 단일 로봇 데이터로 학습한 정책보다 일반화 성능이 높다는 거였습니다. 서로 다른 로봇의 데이터가 "노이즈"가 아니라 "다양성"으로 작용해서 과적합을 방지한 거예요. 직관적으로 생각하면, 여러 종류의 로봇이 같은 작업을 다른 방식으로 수행하는 걸 보면, 모델이 특정 로봇의 특이한 습관이 아니라 작업의 본질적 구조를 학습하게 되는 겁니다.

OXE [19]는 이후 등장하는 거의 모든 대형 VLA의 학습 데이터 기반이 됐어요. Octo [25], OpenVLA [15], π0 [16] 모두 OXE [19]를 핵심 학습 데이터로 활용했고, OXE [19]의 존재 자체가 "범용 로봇 정책(generalist robot policy)"이라는 연구 방향을 가능하게 했습니다.

Phase 2 — 다양화와 폭발적 성장 (2024)

2024년은 VLA 분야의 캄브리아기 폭발이었어요. RT-2 [11]와 Diffusion Policy [17]가 증명한 두 가지 패러다임 — 자기회귀 토큰 생성과 디퓨전 기반 행동 생성 — 을 기반으로, 수십 개의 새로운 모델이 등장하면서 분야 전체의 지형이 급격히 재편됐습니다.

범용 정책의 등장: Octo와 OpenVLA

Octo [25] (2024)는 OXE [19] 데이터셋을 본격 활용한 최초의 범용 교차 플랫폼 정책이었습니다. 93M 파라미터의 비교적 소형 모델인데요, Transformer 기반 아키텍처에 자기회귀 및 디퓨전 양쪽 행동 헤드를 모두 지원하는 유연한 구조를 갖추고 있었어요. Octo의 핵심 기여는 아키텍처 혁신보다는 "교차 로봇 사전학습 → 타겟 로봇 파인튜닝"이라는 학습 레시피를 확립한 데 있었습니다. 소량의 타겟 로봇 데이터만으로 새로운 플랫폼에 적응할 수 있음을 보여줘서, VLA의 전이 학습 패러다임을 실증한 거예요.

그 다음에 OpenVLA [15] (2024)가 나왔는데, 이게 VLA 민주화의 전환점이었습니다. RT-2 [11]가 55B 파라미터의 비공개 모델이었던 것에 반해, OpenVLA [15]는 7B 파라미터의 완전 오픈소스 모델이에요. Llama 2 기반 VLM에 OXE [19] 데이터로 파인튜닝해서, 누구나 재현 가능한 VLA를 제공한 겁니다.

성능도 인상적이었어요. OpenVLA [15]는 RT-2-X 대비 16.5% 높은 절대 성공률을 보여주면서도, 모델 크기를 약 7분의 1로 줄였습니다. 이 결과가 주는 시사점이 중요한데요 — "VLA에 반드시 수십에서 수백 억 파라미터가 필요한 것은 아니다"라는 거예요. 이후 소형 VLA 연구의 기반이 됩니다.

새로운 아키텍처 패러다임: π0과 CogACT

2024년의 가장 영향력 있는 아키텍처 혁신은 π0 [16] (Physical Intelligence, 2024)에서 나왔습니다. π0 [16]가 왜 중요하냐면, RT-2 [11] 계열의 자기회귀 접근과 Diffusion Policy [17]의 확산 접근을 절충하는 새로운 아키텍처를 제시했기 때문이에요.

구조를 설명하면요 — VLM backbone(PaliGemma 기반, ~3B 파라미터)이 시각과 언어를 이해하고, 그 위에 별도의 Action Expert(~0.3B)를 둬서 전체 약 3.3B 파라미터로 구성됩니다. 이 Action Expert가 Flow Matching 기반으로 연속 행동을 생성하는데, Flow Matching은 DDPM이 확률적 역과정을 반복하는 것과 달리, 노이즈에서 데이터로의 결정론적 ODE 경로(velocity field)를 직접 학습해서, 더 빠르고 안정적인 행동 생성을 가능하게 했어요.

π0 [16]의 진정한 파급력은 아키텍처뿐 아니라 성능에서 나왔습니다. 셔츠 접기라든가 식탁 정리 같은 장시간 복합 조작 작업에서 기존 모든 VLA를 압도하는 성능을 보여줬거든요. 이러면서 "VLM backbone + 디퓨전/플로우 행동 디코더"라는 조합이 VLA 아키텍처의 새로운 참조점이 됐습니다.

CogACT [23] (2024)는 DiT(Diffusion Transformer)를 Action Decoder로 활용해서, 여러 디노이징 경로에서 생성된 행동 후보를 적응적으로 앙상블하는 기법을 도입했고요, RDT-1B [24] (2024)는 1B 파라미터 규모의 DiT를 확산 정책으로 사용해서, Scalable Diffusion Transformer가 VLA의 행동 생성에도 유효함을 보여줬습니다. GR-2(2024)는 대규모 웹 비디오를 사전학습 데이터로 활용해서, 로봇 데이터의 희소성 문제를 비디오 사전학습으로 우회하는 전략을 제시했어요.

FAST 토큰화: 행동 시퀀스 압축의 돌파구

2024년의 또 다른 핵심 혁신이, 2024년 말에 개발되어 2025년 초 공개된 FAST(Fast Action Tokenization)인데요, 이건 상당히 실용적인 기여입니다. 기존 VLA에서 연속 행동을 토큰으로 변환하는 방식, 즉 bin 이산화가 극도로 비효율적이었거든요. 구체적으로 말하면 7 DoF 로봇의 행동 청크(16 타임스텝)를 표현하는 데 112개의 토큰이 필요했습니다.

FAST는 이걸 어떻게 해결했냐면, 이산 코사인 변환(DCT)으로 행동 시퀀스의 주파수 성분을 추출하고, 바이트 페어 인코딩(BPE)으로 반복 패턴을 압축해서, 동일한 행동 시퀀스를 최대 약 13배(원논문 기준 최대치) 압축된 토큰으로 표현한 거예요. 이 압축이 자기회귀 VLA의 추론 속도를 직접적으로 개선했고, 이후 π0-FAST [20]의 핵심 기반 기술이 됩니다.

3D 이해와 월드 모델의 결합

3D-VLA [37] (2024)는 VLA에 3D 공간 이해를 결합한 선구적 시도였습니다. 2D 이미지 기반 VLA의 근본적 한계가 뭐냐면, 깊이 인식의 부재하고 가려진 물체에 대한 추론이 불가능하다는 건데요, 이걸 생성적 3D 월드 모델로 극복하려 한 겁니다. 3D-VLA [37]는 행동 실행 전에 3D 장면의 미래 상태를 예측하고, 이 예측을 행동 생성에 피드백하는 구조를 제시했어요. 이 접근이 이후 SpatialVLA [39], PointVLA [77] 등으로 발전하면서, "지각-행동" 루프에 "상상"을 추가하는 연구 방향을 열었습니다.

Phase 3 — 효율화와 배포 준비 (2025–2026)

자, 이제 현재 시점으로 넘어옵니다. 2024년의 캄브리아기 폭발이 "무엇이 가능한가"를 탐색했다면, 2025년 이후의 연구는 "어떻게 실용화할 것인가"로 무게중심이 이동했어요. 이 전환이 세 가지 축에서 동시에 진행되고 있습니다: 아키텍처의 계층화, 극단적 효율화, 그리고 안전과 신뢰성의 내재화.

계층적 아키텍처의 부상

GR00T N1 [21] (NVIDIA, 2025)은 인간 인지과학의 이중 과정 이론(Dual Process Theory)에서 영감을 받은 아키텍처를 제시했습니다. 이게 상당히 직관적인 설계인데요 — System 2(VLM 기반, 10Hz)가 고수준 이해와 계획을 담당하고, System 1(디퓨전 기반, 120Hz)이 반사적 저수준 행동을 생성합니다. VLM의 추론은 느리지만 풍부하고, 디퓨전의 생성은 빠르지만 단순하거든요. 두 시스템을 결합하면 고수준 이해의 깊이와 저수준 제어의 반응성을 동시에 달성할 수 있는 겁니다.

π0.5 [31] (Physical Intelligence, 2025)는 다른 방식의 계층화를 택했어요. 고수준 VLM이 자연어 하위 작업 시퀀스를 생성하고 — 예를 들면 "컵을 잡아" → "싱크대 위로 이동" → "컵을 놓아" 이런 식으로 — 저수준 π0 [16]가 각 하위 작업을 실행하는 구조입니다. 이 접근이 왜 중요하냐면, 30분 이상의 장시간 작업, 전체 주방 정리 같은 것을 VLA로 처리할 수 있는 경로를 열었기 때문이에요.

효율화의 극단적 추구

VLA의 실세계 배포를 가로막는 가장 직접적인 장벽이 연산 비용입니다. 7B 파라미터 모델의 추론에 16-24GB VRAM이 필요하고, 지연 시간이 수백 밀리초에 달하는 건 로봇 제어에 치명적이거든요. 2025년은 이 문제에 대한 해결책이 폭발적으로 제안된 해였습니다.

주요 모델들을 쭉 훑어보면요 — SmolVLA [32] (2025, 450M)는 "대형 VLM이 아니어도 VLA가 작동할 수 있는가?"라는 질문에 답했습니다. 450M 파라미터로 단일 GPU 학습이 가능하면서도, 단순 작업에서 OpenVLA [15] (7B)에 근접하는 성능을 보여줬어요. BitVLA [33] (2025)는 더 극단적으로 가서 1.58비트 삼진 양자화(ternary quantization)를 VLA에 적용해서 메모리 사용량을 극적으로 줄였고요. TinyVLA [34] (2024년에 제안, 효율화 트렌드의 선구)는 VLM backbone 소형화를 위한 증류(distillation) 기법을, EdgeVLA(2025)는 엣지 디바이스 배포 최적화를, VLA-Cache(2025)는 시각 토큰 캐싱으로 반복 연산을 제거하는 방법을 각각 제안했습니다. DeeR-VLA [35] (2025)는 입력 난이도에 따라 다른 깊이의 레이어를 활성화하는 동적 추론을, MoLe-VLA(2025)는 Mixture-of-Experts 구조로 파라미터 효율성을 높이는 전략을 택했어요.

이 효율화 연구들의 공통된 메시지는 명확합니다: VLA의 핵심 가치는 대형 VLM의 지식에 있지만, 그 지식을 전달하는 데 반드시 대형 모델이 필요한 건 아니라는 거예요. 증류, 양자화, 캐싱, 동적 추론 등의 기법으로 대형 모델의 지식을 소형 모델에 압축해서 실시간 로봇 제어에 적합한 형태로 만들 수 있다는 거죠. 이건 나중에 Section 9에서 훨씬 더 깊이 다루겠습니다.

RL 후처리의 등장

Behavior Cloning만으로 학습된 VLA의 근본적 한계가 있습니다 — 시연 품질이 성능의 천장이 되고, 시연에 없는 행동은 발견할 수 없다는 거예요. 이걸 극복하기 위해서 2025년에는 사전학습된 VLA를 강화학습으로 후처리(post-training)하는 연구가 본격화됐습니다.

VLA-RL [68](2025)은 GRPO(Group Relative Policy Optimization)를, ConRFT [69](2025)는 온라인 RL을, SimpleVLA-RL [70](2025)은 REINFORCE 기반의 단순화된 RL을, RIPT-VLA [71](2025)는 반복적 RL-기반 정교화를 각각 제안했어요.

여기서 흥미로운 패턴이 있는데요 — 이 연구들이 공통적으로 "BC로 사전학습하여 합리적 초기 정책을 확보한 뒤, RL로 탐색-개선하여 시연을 초월하는 성능을 달성한다"는 레시피를 따르고 있습니다. LLM 분야에서 GPT-3 [29] (사전학습) → InstructGPT(RLHF 후처리)로 이어진 발전 경로와 구조적으로 동일한 거예요. VLA 분야가 LLM의 성숙 경로를 빠르게 추적하고 있다는 걸 보여줍니다. 이건 나중에 Section 5에서 더 자세히 다루겠습니다.

추론의 통합과 안전의 내재화

CoT-VLA [55](2025)는 LLM의 Chain-of-Thought 추론을 VLA에 도입한 건데요, 행동을 즉시 생성하는 대신 먼저 시각적 추론 과정 — 주요 영역 마스크라든가, 작업 분해 텍스트 같은 것 — 을 생성한 뒤에 이를 조건으로 행동을 생성합니다. VLA의 "반사적(reactive)" 한계를 극복하려는 시도로, Inner Monologue [22] (2022)에서 시작된 "생각하는 로봇" 계보의 최신 진화예요.

SafeVLA [75](2025)는 안전 제약을 VLA의 학습 과정에 내재화한 최초의 모델인데, 기존 VLA가 안전을 사후적 필터링으로 처리한 반면에 SafeVLA [75]는 안전 위반 예측을 학습 목표 자체에 포함시켜서 위험한 행동을 생성 단계에서 억제합니다. "VLA의 환각은 물리적 사고"라는 경고에 대한 첫 번째 체계적 대응이에요. 이건 나중에 Section 10에서 더 다루겠습니다.

Humanoid-VLA(2025)는 VLA의 적용 범위를 테이블탑 조작에서 전신 휴머노이드 제어로 확장했는데, 수십 개의 자유도를 가진 휴머노이드의 전신 운동을 VLA로 제어하는 건 행동 공간의 차원이 기존 로봇 팔(6-7 DoF)과는 차원이 다른 도전입니다.

자율주행 VLA: 새로운 응용 전선

VLA의 적용이 로봇 조작에 국한되지 않는다는 걸 가장 극적으로 보여준 게 자율주행(AD) 분야입니다. EMMA [46](Waymo, 2024)는 Gemini VLM을 자율주행에 적용해서 센서 입력에서 주행 경로까지를 end-to-end로 생성하는 자율주행 VLA의 가능성을 제시했고요, ORION [47](2025)은 시각적 추론과 주행 행동 생성을 통합했고, AutoVLA [48](2025)는 자율주행에 특화된 VLA 아키텍처를, DriveMoE [49](2025)는 Mixture-of-Experts로 주행 시나리오별 전문 모듈을 활성화하는 효율적 구조를 제안했습니다.

자율주행 VLA는 로봇 조작 VLA와 기술적 DNA를 공유하지만 — VLM backbone, 행동 토큰화, end-to-end 학습 — 도메인 특성은 크게 달라요. 더 높은 속도, 더 엄격한 안전 요구, 더 다양한 환경 변이가 있거든요. 이 두 분야의 교차 수정(cross-pollination)이 VLA 기술의 성숙을 가속하고 있습니다. 이건 나중에 Section 8에서 더 자세히 다루겠습니다.

VLA 연구의 폭발적 양적 성장: ICLR 2026의 증거

VLA 분야의 성장 속도를 가장 극적으로 보여주는 게 ICLR 2026의 제출 통계입니다(Reuss, 2026). ICLR 2024에는 VLA 관련 제출이 단 1건이었고, 그것도 거부됐어요. ICLR 2025에는 9건이었는데, ICLR 2026에는 164건이 제출돼서 전년 대비 18배의 폭발적 성장을 기록했습니다. 이 숫자가 의미하는 건 VLA가 더 이상 틈새 연구 주제가 아니라 기계학습 커뮤니티의 주류 연구 방향으로 자리잡았다는 거예요. ICLR 2027에는 1,000건 이상의 제출이 예상된다는 분석도 있습니다.

이 164건의 논문에서 관찰되는 핵심 트렌드는 다음과 같습니다:

이산 디퓨전 VLA: 자기회귀의 느린 순차 생성을 병렬 디퓨전으로 대체하는 4개의 동시 논문 등장
Embodied Chain-of-Thought(ECoT [92]): 공간적으로 기반된(spatially-grounded) 추론을 행동과 통합
교차 행동 공간 학습(Cross-Action-Space): X-VLA [53], XR-1, HiMoE-VLA [54] 등 이종 embodiment 간 전이
자기 개선 RL: 잔차 RL로 LIBERO 99% 달성, 벤치마크 포화 가속
새로운 벤치마크: RoboArena (실-심 변환), RoboCasa365 (365 태스크/2000+ 주방 장면), WorldGym (월드 모델 기반 평가)

주목할 만한 발견으로 VLM4VLA(ICLR 2026) 연구가 있는데요, 표준 VLM 벤치마크 성능과 하류 VLA 성능 사이에 상관관계가 없다는 걸 밝혔습니다. 이거 면접에서 나올 수 있는 포인트인데요 — VLA의 VLM backbone 선택이 VLM의 일반적 벤치마크 순위가 아니라, 로봇 태스크에 특화된 기준으로 이루어져야 한다는 걸 시사하는 거예요.

플랫폼 스케일 오케스트레이션

2025-2026년의 또 다른 특징은 VLA가 개별 모델 수준을 넘어 플랫폼 수준으로 진화하고 있다는 점입니다. Gemini Robotics(Google DeepMind, 2025)는 Gemini 2.0을 로봇 제어의 중심축으로 삼아서, 다양한 로봇 플랫폼과 작업을 하나의 VLM 오케스트레이터가 조율하는 구조를 제시했고요. NVIDIA의 GR00T 생태계는 Cosmos(시뮬레이션) → GR00T N1 [21] (VLA 정책) → Jetson(에지 추론)으로 이어지는 풀스택 파이프라인을 구축하고 있습니다.

이 플랫폼 전략이 의미하는 건 VLA가 더 이상 순수 연구 주제가 아니라 산업적 제품으로 전환되고 있다는 거예요. Physical Intelligence의 π 시리즈(π0 [16] → π0.5 [31] → π0-FAST [20])가 "로봇의 Android"를 표방하고, Figure AI가 Helix VLA를 자체 휴머노이드에 탑재해서 풀스택 로봇 회사를 지향하는 것도 같은 맥락입니다.

프런티어 모델과 오픈 웨이트 연구 모델의 격차

여기서 잠깐 정리하면, 2025-2026년 시점에서 가장 뚜렷한 분단선은 비공개 프런티어 모델과 오픈 웨이트 연구 모델 사이의 실세계 일반화 격차입니다. Gemini Robotics이라든가 π0.5 [31] 같은 비공개 모델은 제로샷 실세계 일반화를 시연하고 있는데, 오픈 소스 연구 VLA들은 시뮬레이션 벤치마크에서는 근접하면서도 실세계에서의 격차가 좁혀지지 않고 있어요.

이 격차의 원인으로는 세 가지가 지목됩니다(Reuss, 2026): (1) 학습 데이터의 품질과 다양성 차이, (2) 시뮬레이션 벤치마크의 천장 효과(ceiling effect)로 실제 진전이 가려지는 현상, (3) 연구 인프라 규모의 차이. 그런데 Reuss(2026)는 여기서 한 발 더 나아가서 핵심적인 결론을 제시합니다: 현재 학계가 LIBERO나 SimplerEnv 같은 포화된 벤치마크에서 숫자를 올리는 데 집중하는 건 실세계 배포와의 격차를 오히려 가리는 위험이 있다는 거예요. 진정한 돌파구는 (1) 데이터 큐레이션과 품질 관리, (2) in-context learning 능력의 강화, (3) 실세계 평가 프로토콜의 확립에서 올 것이라고 주장합니다. 특히 ICLR 2026에서 데이터 큐레이션과 in-context learning이 가장 과소대표된 연구 방향이었다는 점을 지적하면서, 이게 곧 가장 큰 기회라고 결론짓는데요 — 이건 면접에서도 굉장히 좋은 포인트가 될 수 있습니다.

핵심 전환: "개념 증명"에서 "배포 준비"로

자 여기서 전체 연대기를 큰 그림으로 정리해 봅시다. 2022-2023년의 VLA는 "이것이 가능하다"를 증명하는 단계였습니다. RT-2 [11]가 VLM의 지식이 로봇 행동으로 전이될 수 있음을 보여주고, Diffusion Policy [17]가 다중 모드 행동 생성의 가능성을 연 게 이 시기의 핵심이었어요. 2024년은 "얼마나 다양하게 가능한가"를 탐색하는 폭발의 시기였고요. 수십 개의 모델이 서로 다른 아키텍처, 학습 전략, 응용 도메인을 실험했습니다.

그리고 2025-2026년, VLA 분야는 "실세계에서 어떻게 작동하게 할 것인가"라는 질문으로 수렴하고 있어요. 이 전환을 추동하는 힘이 다층적입니다: 경량화 기술이 엣지 배포의 물리적 장벽을 낮추고, RL 후처리가 BC의 성능 천장을 깨뜨리고, 안전 제약의 내재화가 신뢰성 문제를 구조적으로 해결하고, 계층적 아키텍처가 장시간 복합 작업이라는 난제에 대응하고 있습니다. 이 네 가지 축의 동시적 발전이 VLA를 연구실의 데모에서 실세계의 제품으로 전환시키는 원동력이에요.

이 연대기에서 관찰되는 가장 중요한 패턴은 기술 간 교차 수정(cross-pollination)의 가속입니다. CLIP의 시각-언어 정렬이 CLIPort [26]를 낳고, GPT-3 [29]의 few-shot 학습이 SayCan [14]을 낳고, 이미지 확산 모델이 Diffusion Policy [17]를 낳았듯이, VLA의 모든 핵심 혁신은 인접 분야의 돌파구를 로보틱스에 적응시킨 결과예요. 이 패턴은 앞으로도 계속될 겁니다: LLM의 추론 기법(CoT, MCTS)은 이미 VLA에 이식되고 있고, 비디오 생성 모델의 발전은 월드 모델 기반 VLA를 가속할 거고, 멀티에이전트 LLM 시스템의 발전은 멀티로봇 VLA 협업으로 이어질 겁니다.

VLA의 역사는 아직 초장입니다. 그러나 이 초장의 밀도 — 불과 4년 만에 개념 증명에서 산업적 배포 준비까지 도달한 속도 — 는 이 분야가 앞으로 어떤 속도로 전개될지를 가늠하게 합니다.

Motivation Chain: VLA 핵심 모델의 동기 사슬

자 이제 핵심 모델들이 왜 등장했는지, 그 동기의 연쇄를 정리해 봅시다. 이건 각 모델의 "존재 이유"를 이해하는 데 아주 유용합니다.

Motivation Chain

RT-1의 한계(단일 로봇, VLM 없이 인터넷 지식 활용 불가, 학습 데이터 외 일반화 취약)

→ RT-2 [11] 등장(기존 VLM을 파인튜닝하여 인터넷 지식을 로봇 행동에 전이)

→ RT-2의 한계(55B 파라미터, 비공개, 실시간 제어 불가, 추론 330-1000ms)

→ OpenVLA [15] 등장(7B 오픈소스, 누구나 재현 가능, 16.5% 더 높은 성공률)

→ OpenVLA의 한계(자기회귀 디코딩의 다중모드 행동 표현 한계, 느린 추론 ~166ms)

→ π0 [16] 등장(Flow Matching으로 다중모드 행동 생성, ~73ms 추론, dexterous task 우위)

→ π0의 한계(단일 모델로 장시간 복합 작업 어려움)

→ π0.5 [31] 등장(고수준 VLM 계획 + 저수준 π0 실행의 계층 구조, 30분+ 작업 가능)

Motivation Chain

Behavior Cloning의 한계(시연 품질이 성능 천장, 시연 밖 행동 발견 불가)

→ RL 후처리 연구 등장(VLA-RL [68], RIPT-VLA [71], ConRFT [69] 등)

→ BC로 초기 정책 확보 후 RL로 시연을 초월하는 성능 달성

Motivation Chain

OXE [19] 데이터셋의 등장(단일 연구실 데이터 한계 → 22종 로봇 교차 데이터 통합)

→ Octo [25](교차 로봇 사전학습 → 타겟 파인튜닝 레시피 확립)

→ OpenVLA [15](OXE 기반 오픈소스 VLA 민주화)

이 동기 사슬을 보면 각 모델이 이전 모델의 한계를 명확히 인식하고 그걸 해결하려는 시도로 등장했다는 걸 알 수 있어요. 연구를 할 때도 "현재 모델의 한계가 뭔지"를 정확히 파악하는 게 다음 연구 방향을 잡는 핵심이라는 걸 보여주는 좋은 사례입니다.

헷갈리기 쉬운 모델 비교: 핵심 차별점

이거 면접에서 "RT-1이랑 RT-2 차이가 뭐예요?" 이런 식으로 물어보는 경우가 많거든요. 정리해 놓겠습니다:

비교 대상	핵심 차별점
RT-1 vs RT-2 [11]	RT-1은 독자 아키텍처(35M), RT-2는 기존 VLM(55B)을 파인튜닝 — 인터넷 지식 전이 유무가 핵심
RT-2 [11] vs OpenVLA [15]	동일한 "VLM→행동 토큰" 패러다임이지만, OpenVLA는 7B 오픈소스로 민주화에 초점
OpenVLA [15] vs π0 [16]	OpenVLA는 자기회귀 디코딩(이산 토큰), π0는 Flow Matching 디코딩(연속 행동) — 다중모드 행동 표현력의 차이
SayCan [14] vs RT-2 [11]	SayCan은 LLM이 계획만 하고 별도 정책이 실행(모듈형), RT-2는 하나의 모델이 계획+실행(end-to-end)
Gato [13] vs RT-2 [11]	Gato는 범용 에이전트(게임+로봇+텍스트), RT-2는 로봇 특화 VLA — Gato는 개념 증명, RT-2는 실용적 성능
Diffusion Policy [17] vs π0 [16]	Diffusion Policy는 독립적 디퓨전 정책, π0는 VLM+Flow Matching 결합 — π0는 언어 이해를 내장
GR00T N1 [21] vs π0.5 [31]	둘 다 계층적이지만, GR00T은 System 1(120Hz)+System 2(10Hz) 속도 분리, π0.5는 VLM 계획+π0 실행의 기능 분리
Octo [25] vs OpenVLA [15]	Octo는 93M 소형+디퓨전 헤드+교차 플랫폼 특화, OpenVLA는 7B VLM 기반+자기회귀+범용 지식 전이

직관적 한줄 설명

각 모델을 한 줄로 이해할 수 있는 비유를 정리했습니다. 누군가한테 설명해야 할 때 유용할 거예요:

RT-2 [11]: "구글 번역기가 한국어→영어를 하듯, VLM이 이미지→로봇 행동을 번역하게 만든 것"
OpenVLA [15]: "RT-2의 핵심 아이디어를 7배 작게 만들어 오픈소스로 풀어놓은 것"
π0 [16]: "VLM이 상황을 이해하고, 그 이해를 바탕으로 확산 모델이 부드럽고 정교한 동작을 그려내는 것"
SayCan [14]: "ChatGPT가 레시피를 알려주면, 요리사 로봇이 실제로 만드는 것 — 아는 것과 할 수 있는 것을 분리"
Diffusion Policy [17]: "Stable Diffusion이 노이즈에서 그림을 그리듯, 노이즈에서 로봇 동작을 그려내는 것"
Octo [25]: "다양한 로봇 데이터로 사전학습된 '범용 운전면허' — 새 로봇에 소량 적응만 하면 됨"
OXE [19]: "ImageNet이 CV를 바꿨듯, 22종 로봇의 100만+ 데이터를 통합한 로보틱스의 ImageNet"
FAST: "로봇 행동을 JPEG처럼 주파수 압축하여 토큰 수를 13배 줄인 것"
GR00T N1 [21]: "느리지만 깊이 생각하는 뇌(VLM, 10Hz)와 빠르게 반응하는 소뇌(디퓨전, 120Hz)의 분업"
CoT-VLA [55]: "행동 전에 '왜 이렇게 해야 하지?'라고 스스로 추론하는 로봇 — 반사에서 사고로의 진화"

Self-Check Questions: Section 1-2

Q1: VLA의 세 가지(+한 가지) 정의 중, SayCan은 어떤 정의에서 VLA로 분류되고 어떤 정의에서 제외되는가?

Q2: RT-2가 RT-1과 동일한 로봇 데이터로 학습했음에도 더 나은 일반화를 보인 이유는 무엇인가?

Q3: Diffusion Policy가 기존 MSE 기반 BC보다 우수한 핵심 이유를 "다중 모드성" 관점에서 설명하라.

Open Research Questions: Section 1-2

마지막으로 아직 열려 있는 연구 질문들을 정리합니다. 연구 주제를 찾고 계신 분들한테 특히 유용할 거예요:

Section 3: 통합 분류체계 --- 14개 서베이를 하나로

"코끼리를 만지는 장님들처럼, 각 서베이는 VLA라는 거대한 동물의 서로 다른 부위를 묘사하고 있었다. 이 장에서는 장님들의 손을 모아 코끼리의 전신을 재구성한다."

자, 이번 장은 좀 특이합니다. 보통 서베이 논문을 읽으면 "아, 이 분야는 이렇게 분류되는구나" 하고 넘어가잖아요. 그런데 VLA 분야는 상황이 좀 다르거든요. 2024년 말부터 2026년 초까지, 약 2년 사이에 14편 넘는 서베이가 쏟아졌습니다. 각 서베이가 자기만의 분류체계(taxonomy)를 들고 나왔는데, 문제는 같은 모델을 서로 다른 범주에 넣는다는 겁니다.

예를 들어볼게요. RT-2 [11]를 어떤 논문은 "monolithic VLA"로 분류하고, 어떤 논문은 "autoregressive action generation"으로 분류합니다. 둘 다 맞는 말인데, 처음 공부하는 입장에서는 "그래서 RT-2가 뭔데?" 하고 혼란스럽거든요.

이게 왜 이렇게 됐냐면, 각 서베이가 바라보는 각도가 다르기 때문입니다. 마치 코끼리를 만지는 장님이라는 비유가 딱 맞는 거죠. 누군가는 다리를 만지고 "기둥이다", 누군가는 코를 만지고 "호스다" 하는 것처럼요.

이 장의 목표는 이들 분류체계를 경쟁 관계가 아닌 상보적 관점으로 재해석하고, 모든 VLA 모델을 하나의 좌표계 안에 위치시킬 수 있는 메타 분류체계를 구축하는 것입니다. 좀 야심찬 목표인데, 한번 해보겠습니다.

3.1 아키텍처 관점 --- Liu/Shao (2025) 기반

자, 첫 번째 렌즈입니다. Liu와 Shao의 서베이는 VLA의 구조적 형태에 초점을 맞춥니다. 이들이 던지는 핵심 질문은 심플해요. "VLM과 행동 생성기가 어떤 관계로 연결되어 있는가?" 이것 하나입니다. 건축가가 건물을 볼 때 "이 건물의 구조가 어떻게 되어 있지?" 하고 보는 것과 같은 관점이죠.

3.1.1 단일체(Monolithic) 아키텍처

여기서 핵심은, 전체 시스템이 하나의 end-to-end 모델로 구성된다는 겁니다. 내부 모듈 간의 경계가 학습 과정에서 자연스럽게 형성되지, 사람이 명시적으로 "여기까지가 인지, 여기서부터 행동"이라고 나누지 않습니다.

단일 시스템(Single-system): 가장 심플한 구조입니다. 하나의 통합 순전파(forward pass)로 관측에서 행동까지를 생성합니다. 입력 이미지와 언어 지시가 들어가면, 단일 네트워크의 출력으로 로봇 행동이 직접 나오는 거죠. 비유하자면, 통역 없이 외국어를 바로 알아듣고 대답하는 사람 같은 겁니다.

대표적으로 RT-2 [11]는 PaLI-X VLM의 출력 토큰을 곧바로 행동 토큰으로 해석합니다. OpenVLA [15]는 Prismatic VLM의 언어 모델 헤드를 행동 예측에 재활용하고요. NORA 역시 단일 트랜스포머 내에서 시각-언어-행동을 통합 처리합니다.

이중 시스템(Dual-system): 이게 재밌는 구조인데요. VLM 백본(System 2)과 행동 전문가(System 1)라는 두 개의 구분된 모듈이 존재하되, 하나의 모델 내에서 협력합니다. 이 구분은 Daniel Kahneman의 이중 처리 이론에서 영감을 받았거든요. 나중에 4.4절에서 자세히 다루겠지만, 간단히 말하면 VLM이 상황을 이해하고 추론하는 느린 사고(System 2)를 담당하고, 행동 전문가가 빠른 반사적 행동 생성(System 1)을 담당합니다. CEO와 현장 작업자의 분업이라고 생각하시면 됩니다.

이중 시스템은 다시 두 가지 정보 흐름 방식으로 나뉘는데, 이게 왜 중요하냐면, 정보가 어떻게 흐르느냐에 따라 모듈 교체 가능성이나 학습 효율이 완전히 달라지기 때문입니다.

캐스케이드 기반(Cascade-based): VLM이 먼저 실행되어 특징 표현(feature representation)을 생성하고, 이것이 행동 전문가에게 순차적으로 전달됩니다. 물이 위에서 아래로 흐르듯, 정보가 한 방향으로만 흐르는 구조죠. CogACT [23]에서는 VLM이 시각-언어 특징을 추출한 후, 별도의 디퓨전 기반 행동 생성기가 이를 받아 행동 시퀀스를 출력합니다. GR00T N1 [21]에서는 Eagle-2 VLM이 상황 임베딩을 생성하고, 이것이 DiT(Diffusion Transformer) 행동 헤드로 전달됩니다. Fast-in-Slow 역시 느린 VLM 처리 후 빠른 행동 생성이라는 순차 구조를 따릅니다.

병렬 기반(Parallel-based): 이건 좀 다릅니다. VLM 토큰과 행동 토큰이 공유 어텐션 메커니즘을 통해 동시에 처리되거든요. pi_0에서는 PaliGemma VLM의 토큰과 행동 전문가의 플로우 매칭 토큰이 공유 트랜스포머 블록에서 교차 어텐션을 수행합니다. pi_0.5는 이를 확장하여 고수준 계획과 저수준 행동이 동일한 어텐션 공간에서 처리되고요. GraspVLA [81]는 파지(grasp) 특화 토큰이 VLM 토큰과 병렬로 처리되는 구조를 채택합니다.

캐스케이드와 병렬의 차이를 비유하면, 캐스케이드는 릴레이 경주(바톤을 넘기는 구조)이고, 병렬은 합주(동시에 연주하면서 서로의 소리를 듣는 구조)입니다.

3.1.2 계층적(Hierarchical) 아키텍처

자, 여기서 또 다른 갈래가 나옵니다. 계획(planning)과 실행(execution)이 명시적으로 분리된 구조입니다. 비유하면 요리 레시피(고수준 계획)와 칼질 기술(저수준 기술)의 분리인데, 레시피를 바꿔도 칼질은 재학습할 필요가 없다는 게 핵심이거든요.

상위 수준에서 "무엇을 할 것인가"를 결정하고, 하위 수준에서 "어떻게 움직일 것인가"를 결정합니다.

Planner-Only: VLM이 계획만 생성하고, 행동 실행은 별도의 저수준 제어기(MPC, PID 등)에 위임합니다. SayCan [14]이 초기 대표 사례이고, COME-Robot, Inner Monologue [22] 등이 이 계보를 잇습니다.
Planner + Policy: VLM 플래너가 중간 표현을 생성하고, 학습된 정책이 이를 저수준 행동으로 변환합니다. VoxPoser [57], RT-H, RoboPoint 등이 이 범주에 해당합니다.

여기서 중요한 건, 중간 표현(intermediate representation)의 형태에 따라 또 세분화된다는 겁니다:

Keypoint(K): 핵심점 좌표를 통한 목표 지정 (RoboPoint, Rekep)
Subtask(S): 하위 과제 언어 기술 (SayCan [14], ProgPrompt)
Program(P): 실행 가능한 코드/프로그램 생성 (Code-as-Policies, VoxPoser [57])

이 외에도 Liu & Shao [5]는 어포던스(Affordance, A)를 별도의 보조 표현 유형으로 식별합니다. A3VLM [58], CoA-VLA 등이 어포던스 맵을 다른 표현(K, S, P)과 결합하여 파지 가능 영역을 명시적으로 지정합니다.

3.2 액션 생성 관점 --- Zhong et al. (2025) 기반

자 이제 두 번째 렌즈로 넘어갑니다. Zhong 등의 서베이는 분류의 렌즈를 아키텍처의 전체 형태가 아닌 행동이 어떻게 생성되는가에 맞춥니다.

이게 왜 중요하냐면, 동일한 VLM 백본을 사용하더라도 행동 생성 방식이 다르면 완전히 다른 성능 특성을 보이기 때문입니다. 같은 엔진을 달아도 변속기가 다르면 차의 성격이 달라지는 것과 비슷하다고 보시면 됩니다.

자기회귀(Autoregressive) 방식

가장 직관적인 방식부터 보겠습니다. 연속 행동을 이산 토큰으로 변환한 후, 언어 모델과 동일한 next-token prediction으로 순차 생성하는 겁니다. 한 글자씩 타이핑하듯 행동을 순서대로 하나씩 생성한다고 보시면 됩니다.

RT-2 [11]가 256-bin 양자화로 이 방식을 개척했고, OpenVLA [15], Octo [25] (AR 및 디퓨전 양 모드 지원), RT-2-X 등이 따랐습니다. 그리고 FAST 토큰화 [20]는 DCT + BPE를 적용하여 자기회귀의 프레임워크를 유지하면서도 정보 손실을 줄이고 사전학습 속도를 5배 높였는데, 이건 나중에 행동 모듈에서 자세히 다루겠습니다.

장점: LLM 인프라(KV 캐시, 양자화, speculative decoding 등)를 그대로 재활용할 수 있습니다. 구현이 단순합니다. 이게 실무에서 엄청난 장점이거든요.
단점: 양자화 오류가 누적되고, 다중모달 행동 분포를 표현하기 어렵습니다. 예를 들어 물체를 좌로 돌릴 수도, 우로 돌릴 수도 있는 상황에서, 자기회귀 모델은 두 선택지의 평균인 "돌리지 않음"을 출력하는 평균화 문제(mode averaging)가 발생합니다. 토큰 단위 순차 생성이므로 고주파 제어에도 느립니다.

디퓨전(Diffusion) 방식

디퓨전은 발상이 완전히 다릅니다. DDPM, DDIM, Flow Matching, VAE 등 확률적 생성 모델을 사용하여 행동 분포에서 샘플링합니다. 비유하자면, 대리석 조각처럼 전체 형태에서 불필요한 부분(노이즈)을 깎아내어 행동을 드러내는 방식입니다.

Diffusion Policy [17]가 DDPM 기반 행동 생성의 가능성을 처음 보여주었고, CogACT [23]는 DDIM으로 디노이징 스텝을 줄였으며, pi_0는 Flow Matching으로 ODE 기반의 더 빠르고 안정적인 생성을 달성했습니다.

장점: 다중모달 분포를 자연스럽게 표현합니다. "좌로 돌리기"와 "우로 돌리기" 두 모드를 동시에 표현하고, 실행 시 하나를 샘플링하는 거죠. 부드러운(smooth) 궤적을 생성하고, 연속 공간에서 직접 동작하므로 양자화 오류가 없습니다.
단점: 다중 디노이징 스텝이 필요하여 추론이 느립니다. DDPM은 50-100 스텝, DDIM은 10-20 스텝, Flow Matching은 5-10 스텝이 필요합니다. 학습 안정성 확보에 기술적 노력도 필요합니다.

이산 디퓨전(Discrete Diffusion) 방식

자, 여기서 최근에 매우 흥미로운 전개가 있었습니다. 이산 디퓨전(Discrete Diffusion) VLA인데, ICLR 2026에서 4개의 독립 연구가 동시에 제안한 새로운 패러다임입니다. 4개 팀이 동시에 같은 아이디어를 낸다는 건, 그 아이디어가 자연스러운 다음 스텝이었다는 방증이기도 하죠.

이게 뭐냐면, 기존 디퓨전이 연속 공간에서 작동하는 것과 달리, 이산 디퓨전은 토큰화된 행동 시퀀스에 직접 적용됩니다. 자기회귀의 순차 생성 없이도 이산 토큰을 병렬로 생성할 수 있거든요. dVLA [65], DIVA, UNIFIED DIFFUSION VLA 등이 이 범주에 속하며, LIBERO에서 95-98% 성공률을 보고했습니다. 자기회귀의 해석 가능성과 디퓨전의 다중 모드성을 결합하려는 시도로, 2026년 가장 활발한 연구 방향 중 하나입니다.

강화학습(RL) 기반 방식

보상 신호(reward signal)를 기반으로 정책을 직접 최적화합니다. 순수 RL 기반 VLA는 드문데, 그 이유는 아시다시피 로봇에서 순수 RL은 샘플 효율성 문제가 심각하기 때문입니다. 하지만 BC로 사전학습된 VLA를 RL로 미세조정하는 패턴이 급부상하고 있습니다. GRPO(Group Relative Policy Optimization), RLVF(Reinforcement Learning from Visual Feedback), π^*_0.6 [157] 등이 대표적입니다.

하이브리드(Hybrid) 방식

여기서 핵심은, 자기회귀와 디퓨전의 장점을 굳이 하나만 택할 이유가 없다는 발상입니다. HybridVLA [79]는 고수준 의미 토큰은 AR로 생성하고, 저수준 연속 행동은 디퓨전으로 생성하는 이중 디코딩 구조를 제안했습니다. UniVLA [80]는 월드 모델의 잠재 표현과 행동 생성을 단일 프레임워크에서 결합합니다.

특수 도메인(Specialized) 방식

교차 행동 공간 학습에 대해서도 짚고 넘어가야 하는데요. 서로 다른 형태의 로봇(팔, 휴머노이드, 이동 로봇 등) 사이의 행동 공간 차이를 극복하는 연구도 활발해지고 있습니다. X-VLA [53]는 소프트 프롬프팅 토큰으로 이종 embodiment를 조건화하고, XR-1은 Unified Vision-Motion Codes(UVMC)를 도입하며, HiMoE-VLA [54]는 계층적 Mixture-of-Experts로 행동 공간별 전문가를 할당합니다.

자율주행 VLA 분류(Hu et al., 2025)도 언급하겠습니다. 자율주행 도메인에 특화된 VLA 분류가 따로 발전하고 있거든요. Hu et al.은 AD-VLA를 두 가지 패러다임으로 구분합니다: (1) End-to-End VLA --- 인지, 추론, 계획을 단일 모델에 통합(textual action vs numerical action 하위 구분), (2) Dual-System VLA --- 느린 숙고(VLM)와 빠른 안전 실행(플래너)을 분리(explicit guidance vs implicit representation transfer 하위 구분). 이 분류는 Liu & Shao [5]의 단일체/계층적 분류와 구조적으로 유사하지만, 안전-실시간성 trade-off를 핵심 축으로 놓는다는 점에서 차별화됩니다.

3.3 해부학 관점 --- Xu et al. (2025) 기반

세 번째 렌즈입니다. Xu 등의 서베이는 VLA를 생물학적 유비(biological analogy)로 해부합니다. 이 관점이 왜 좋으냐면, 직관적이기 때문입니다. 세 개의 기관으로 구성된 유기체로 보는 거죠.

지각(Perception) --- 로봇의 감각 기관: 시각 인코더(SigLIP, DINOv2 [30], CLIP [27]), 고유감각 인코더(proprioception MLP), 촉각/깊이/힘 센서 등이 외부 세계의 정보를 내부 표현으로 변환합니다.
두뇌(Brain) --- 중추 신경계: VLM 백본이 지각 정보와 언어 지시를 통합하여 "이해"와 "계획"을 수행합니다. 순수 트랜스포머에서 VLM으로, 다시 CoT 추론이 가능한 VLM으로 진화해왔습니다.
행동(Action) --- 운동 신경계: 두뇌의 의도를 물리적 움직임으로 변환합니다. 이산 토큰 헤드, 디퓨전 헤드, 플로우 매칭 헤드 등이 여기에 해당합니다.

이 관점의 강점은 직관성입니다. 어떤 VLA 모델이든 "어떤 눈을 가졌는가?", "어떤 뇌를 가졌는가?", "어떤 손을 가졌는가?"라는 세 질문으로 설명할 수 있거든요. 예를 들어, pi_0는 "SigLIP + DINOv2 [30]의 눈, PaliGemma의 뇌, Flow Matching의 손"을 가진 모델입니다. 이렇게 설명하면 처음 듣는 사람도 바로 감을 잡잖아요.

그런데 Xu et al. [8]의 논문이 기여한 것은 모듈 해부에 그치지 않습니다. 이들의 핵심 기여는 VLA가 직면한 5대 도전 과제(Five Challenges) 체계인데요: (1) 표현(Representation) --- 시각/언어/행동을 어떻게 통합 표현할 것인가, (2) 실행(Execution) --- 안정적이고 정밀한 행동 생성을 어떻게 달성할 것인가, (3) 일반화(Generalization) --- 새로운 환경/물체/작업으로의 전이를 어떻게 보장할 것인가, (4) 안전(Safety) --- 물리적 세계에서의 위험한 행동을 어떻게 방지할 것인가, (5) 데이터셋 및 평가(Dataset & Evaluation) --- 공정하고 재현 가능한 벤치마킹을 어떻게 설계할 것인가. 이 5대 과제는 이후 Section 7-10에서 다루는 일반화, 효율화, 안전, 벤치마크 논의의 체계적 기반이 됩니다.

3.4 기능 관점 --- Kawaharazuka et al. (2025) 기반

네 번째 렌즈인데, 이건 시각이 꽤 다릅니다. Kawaharazuka 등은 일본 로보틱스 커뮤니티의 실용적 전통을 반영하여, VLA를 구성 요소가 아닌 수행하는 기능으로 분류합니다. "이 모델이 뭘 할 수 있느냐?"라는 사용자의 관점이죠.

저수준 지각(Low-level Perception): 물체 검출, 깊이 추정, 포즈 추정 등 원시 감각 처리
고수준 지각(High-level Perception): 장면 이해, 관계 추론, 어포던스(affordance) 인식 등 의미론적 처리
고수준 계획(High-level Planning): 과제 분해, 하위 목표 설정, 추상적 행동 시퀀스 생성
저수준 계획(Low-level Planning): 구체적 궤적 생성, 모터 명령 계획, 충돌 회피
데이터 증강(Data Augmentation): 시뮬레이션 데이터 생성, 비디오 예측을 통한 데이터 증식, 언어 재라벨링 등

이 관점의 독특한 가치는 하나의 모델이 여러 기능을 동시에 수행할 수 있다는 점을 자연스럽게 포착한다는 것입니다. 앞의 세 분류가 모델을 하나의 상자에 넣는다면, 이 분류는 모델에 여러 개의 태그를 붙일 수 있는 거죠. RT-2 [11]는 고수준 지각 + 저수준 계획을 하나의 모델에서 처리하고, SayCan [14]은 고수준 계획에 특화되어 저수준 실행은 별도의 정책에 위임합니다.

3.5 후처리 관점 --- Jin et al. (2025) 기반

다섯 번째 렌즈입니다. Jin 등의 서베이는 사전학습된 VLM을 로봇 행동 생성에 적응(adaptation)시키는 과정에 초점을 맞춥니다. 이들의 질문은 "VLM이 이미 가진 지식을 로봇에게 어떻게 전달할 것인가?"인데, 이건 연구자의 관점에서 매우 실용적인 질문이거든요.

환경 지각 강화(Environment Perception Enhancement): VLM의 시각 이해력을 로봇 환경에 맞게 강화합니다. 깊이 정보 통합, 다중 뷰 처리, 시간적 맥락 추가 등이 포함됩니다. SpatialVLA [39]의 깊이 통합, HPT [96]의 다중 카메라 처리가 대표적입니다.
체현 인식 개선(Embodiment Awareness Improvement): VLM에 로봇 자체의 물리적 특성(관절 구조, 행동 공간, 역학)을 주입합니다. 고유감각 토큰화, 교차 체현 학습, 로봇별 어댑터가 여기에 해당합니다.
과제 이해 심화(Task Understanding Deepening): 언어 지시의 이해를 단순 의미 매칭에서 추론적 이해로 끌어올립니다. CoT 추론(ECoT [92], CoT-VLA [55]), 하위목표 분해, 시각적 추론 등이 포함됩니다.
다중 요소 통합(Multi-component Integration): 위의 세 차원을 하나의 프레임워크로 통합하는 방법론입니다. 멀티태스크 학습, 모듈형 아키텍처, 점진적 학습 등의 전략이 사용됩니다.

3.6 메타 분류 --- 분류체계들의 분류

자, 여기서 이 장의 핵심 통찰에 도달합니다. 여기가 가장 중요한 부분이에요.

위의 다섯 가지 분류체계는 서로 경쟁하는 것이 아닙니다. 이들은 동일한 풍경을 서로 다른 고도에서 촬영한 항공사진입니다. 건축가가 건물을 구조도, 배관도, 전기도, 조감도로 각각 그리듯, 각 서베이는 VLA라는 복잡한 시스템의 서로 다른 단면을 포착하고 있는 겁니다.

분류 차원의 대응 관계

이걸 표로 보면 훨씬 명확해집니다. 같은 모델이 다섯 개 렌즈에서 어떻게 기술되는지 보세요:

모델	Liu/Shao [5] (아키텍처)	Zhong (액션 생성)	Xu (해부학)	Kawaharazuka (기능)	Jin (후처리)
RT-2 [11]	Single-system Monolithic	Autoregressive	PaLI-X Brain + Discrete Head	고수준지각 + 저수준계획	과제이해 심화
pi_0	Parallel Dual-system	Flow Matching(디퓨전의 변형)	PaliGemma Brain + Flow Hand	고수준지각 + 저수준계획	다중요소 통합
OpenVLA [15]	Single-system Monolithic	Autoregressive	Prismatic Brain + Discrete Head	고수준지각 + 저수준계획	환경지각 강화
GR00T N1 [21]	Cascade Dual-system	Diffusion (DiT)	Eagle-2 Brain + DiT Hand	저수준지각 + 저수준계획	체현인식 개선
CogACT [23]	Cascade Dual-system	Diffusion (DDIM)	CogVLM Brain + Diffusion Hand	고수준지각 + 저수준계획	환경지각 강화
SayCan [14]	Hierarchical Planner-Only	N/A (행동 생성 없음)	LLM Brain Only	고수준계획 특화	과제이해 심화
HybridVLA [79]	Dual-system	Hybrid (AR+Diffusion)	VLM Brain + Hybrid Hand	저수준계획 특화	다중요소 통합
CoT-VLA [55]	Single-system Monolithic	Autoregressive	VLM Brain + CoT + Discrete Head	고수준계획 + 저수준계획	과제이해 심화
SpatialVLA [39]	Single-system Monolithic	Autoregressive	Depth-enhanced Eye + VLM Brain	저수준지각 + 저수준계획	환경지각 강화

상보적 분류 축

이 표에서 드러나는 패턴을 보시면요. 각 분류체계는 서로 상보적(complementary) 축을 기술합니다. 직교한다는 건 뭐냐면, 한 축에서의 위치를 알아도 다른 축에서의 위치를 예측할 수 없다는 뜻입니다. 3D 공간에서 x좌표를 안다고 y좌표를 알 수 없는 것처럼요.

구조 축(Liu/Shao [5]): "모듈들이 어떤 위상(topology)으로 연결되어 있는가?" --- 단일체 vs 이중체, 캐스케이드 vs 병렬, 계층적

생성 축(Zhong): "행동이 어떤 수학적 메커니즘으로 만들어지는가?" --- AR, 디퓨전, RL, 하이브리드

해부 축(Xu): "각 구성 요소가 무엇인가?" --- 어떤 인코더, 어떤 VLM, 어떤 행동 헤드

기능 축(Kawaharazuka): "시스템이 어떤 인지 기능을 수행하는가?" --- 지각, 계획, 실행, 증강

적응 축(Jin): "사전학습 지식을 어떤 차원에서 보강했는가?" --- 지각, 체현, 과제, 통합

따라서 모든 VLA 모델은 이 5차원 좌표 공간의 한 점으로 표현할 수 있습니다. 예를 들어, pi_0의 좌표는 다음과 같습니다:

pi_0 = (Parallel Dual-system, Flow Matching, PaliGemma+SigLIP+DINOv2 [30]+FlowHead, 고수준지각+저수준계획, 다중요소통합)

이 메타 분류의 실용적 가치는 세 가지입니다. 첫째, 새로운 VLA 모델이 등장했을 때 다섯 축 위에 즉시 위치시킬 수 있습니다. 둘째, 아직 탐색되지 않은 조합(예: "계층적 구조 + Flow Matching + 촉각 강화")을 체계적으로 식별할 수 있습니다. 셋째, 서로 다른 서베이의 결론을 충돌 없이 통합하여 해석할 수 있습니다. 연구 아이디어를 찾을 때 이 5차원 공간에서 빈 영역을 찾아보는 것도 좋은 전략이겠죠.

분류체계 자체의 메타 패턴

한 발 더 물러서서 분류체계들을 관찰하면, 흥미로운 메타 패턴이 보입니다. 이것이 왜 중요하냐면, 이 패턴이 VLA 분야의 성숙도를 보여주기 때문입니다:

구조 중심 분류(Liu/Shao [5], Xu)는 "이 모델을 어떻게 구축하는가?"에 답합니다 --- 엔지니어의 관점
프로세스 중심 분류(Zhong, Jin)는 "이 모델을 어떻게 학습시키는가?"에 답합니다 --- 연구자의 관점
기능 중심 분류(Kawaharazuka)는 "이 모델이 무엇을 할 수 있는가?"에 답합니다 --- 사용자의 관점

이 세 관점의 수렴이 바로 VLA 연구의 성숙을 나타내는 지표입니다. 분야가 성숙할수록, 구축 방법, 학습 방법, 활용 방법이 독립적으로 발전하면서도 상호 정합적인 체계를 이루게 됩니다. 마치 소프트웨어 공학에서 아키텍처 패턴, 개발 방법론, 사용자 요구사항이 각각 독립적으로 발전하면서도 하나의 시스템으로 수렴하는 것과 같은 이치입니다.

Section 4: 아키텍처 심층 해부

"VLA는 세 개의 모듈로 이루어진 하나의 유기체다. 눈이 세계를 보고, 뇌가 이해하며, 손이 행동한다. 이 장에서는 각 기관을 해부대 위에 올려놓는다."

자, 이제 Section 3에서 분류 체계를 잡았으니, 본격적으로 VLA의 내부를 해부해 보겠습니다. 앞에서 Xu et al.의 "눈-뇌-손" 비유를 소개했는데, 이번 장에서는 그 각 기관을 정밀하게 들여다봅니다.

4.1 지각 모듈 --- 로봇의 눈

VLA의 첫 번째 모듈은 원시 감각 데이터를 의미 있는 내부 표현으로 변환하는 지각(perception) 모듈입니다. 인간의 시각 피질이 망막의 광자를 "빨간 컵", "책상 위", "기울어진"이라는 개념으로 변환하듯, 지각 모듈은 픽셀 배열을 로봇이 이해할 수 있는 토큰 시퀀스로 변환합니다.

이 모듈의 선택이 왜 중요하냐면, "눈이 무엇을 보느냐"가 뇌와 손의 성능 상한을 결정하기 때문입니다. 흐릿한 눈으로는 아무리 뛰어난 뇌도 제대로 판단할 수 없잖아요.

4.1.1 언어 지도 인코더(Language-supervised Encoders)

CLIP [27] (Contrastive Language-Image Pretraining)과 SigLIP(Sigmoid Loss for Language-Image Pretraining)은 이미지-텍스트 쌍으로 대조 학습된 인코더입니다. 수억 장의 이미지-캡션 쌍에서 학습했기 때문에, 이들의 시각 표현은 본질적으로 의미론적(semantic)입니다. 쉽게 말하면, "빨간 컵"과 "파란 컵"은 가깝지만, "컵"과 "접시"는 적당히 떨어져 있는 표현 공간을 학습한 거죠. 이러한 특성은 자연어 지시로 조건화되는 VLA에 자연스럽게 적합합니다.

SigLIP은 CLIP의 softmax 대조 손실을 sigmoid 손실로 대체하여, 배치 크기에 대한 의존성을 줄이고 학습 효율을 높였습니다. 2024-2025년 기준으로 SigLIP이 CLIP을 대체하는 추세가 뚜렷합니다.

장점: 풍부한 의미론적 정렬, 언어 조건화에 최적, 대규모 사전학습의 혜택
한계: 기하학적 정밀도가 부족합니다. "컵의 손잡이가 어느 방향을 향하는가?"와 같은 세밀한 공간 정보를 놓치는 경향이 있습니다.

4.1.2 자기 지도 인코더(Self-supervised Encoders)

DINOv2 [30]는 마스크된 이미지 모델링과 자기 증류(self-distillation)로 학습된 ViT [28] 인코더입니다. 여기서 핵심은, 언어 감독 없이 이미지 자체의 구조에서 학습했다는 점입니다. 그래서 이 인코더의 표현은 기하학적(geometric)입니다. 물체의 경계, 표면 법선, 공간적 배치가 정밀하게 인코딩됩니다.

비유하자면, CLIP/SigLIP은 "무엇이 있는지" 잘 보는 눈이고, DINOv2 [30]는 "그것이 정확히 어디에 어떤 모양으로 있는지" 잘 보는 눈입니다.

장점: 기하학적 정밀도가 높습니다. 접촉이 풍부한 조작(pick-and-place, 삽입 등)에서 의미론적 인코더보다 우수합니다. 텍스처나 소재의 미세한 차이를 포착합니다.
한계: 언어와의 정렬이 없으므로, "빨간 컵을 집어라"와 같은 언어 조건화에는 추가적인 다리(bridge)가 필요합니다.

4.1.3 하이브리드 SigLIP + DINOv2 --- 현재의 지배적 표준

자, 여기가 실무적으로 가장 중요한 부분입니다. 2024년 후반부터 SigLIP과 DINOv2 [30]를 동시에 사용하는 하이브리드 인코딩이 사실상의 표준으로 자리 잡았습니다. OpenVLA [15]의 Prismatic VLM, OpenVLA [15]-OFT, GraspVLA [81], UniVLA [80] 등이 이 조합을 채택합니다.

왜 이 조합이 지배적인가? 답은 상보성(complementarity)에 있습니다. SigLIP은 "무엇(what)"을, DINOv2 [30]는 "어디에, 어떻게(where, how)"를 인코딩합니다. "빨간 컵을 집어라"라는 지시를 처리할 때, SigLIP은 장면에서 "빨간 컵"이라는 의미 개체를 식별하는 데 기여하고, DINOv2 [30]는 그 컵의 손잡이 방향과 정확한 위치를 파악하는 데 기여합니다. 두 인코더의 출력은 일반적으로 토큰 수준에서 연결(concatenate)되거나 프로젝션 레이어를 통해 통합됩니다.

이게 인간 시각 시스템의 "what 경로"(복측 경로)와 "where 경로"(배측 경로)의 분리와 놀라울 정도로 유사하다는 점도 흥미롭습니다.

실증적으로도 이 조합의 우위가 반복 확인됩니다. Prismatic VLM 논문에서 SigLIP 단독, DINOv2 [30] 단독, 그리고 SigLIP+DINOv2 [30] 조합을 비교한 결과, 하이브리드 조합이 모든 벤치마크에서 일관되게 우수했습니다.

4.1.4 전체 VLM을 인코더로 사용

일부 모델은 별도의 시각 인코더 대신 사전학습된 VLM 전체를 인코더로 활용합니다. RT-H는 PaLI-X를 사용하고, pi_0는 PaliGemma를 사용하며, VTLA는 Qwen-VL을 사용합니다. 이 접근법의 장점은 VLM이 이미 시각-언어 통합을 내재적으로 수행하므로, 별도의 융합 모듈이 필요 없다는 것입니다. 단점은 계산 비용이 크다는 것이며, 이를 LoRA, QLoRA 등의 효율적 미세조정 기법으로 완화합니다.

4.1.5 CNN의 잔존

ViT [28] 기반 인코더가 주류를 이루지만, CNN(ResNet, EfficientNet)은 아직 사라지지 않았습니다. RT-1은 EfficientNet-B3를 사용했고, 일부 경량화 모델(LiteVLA [63] 등)에서는 계산 효율성을 위해 여전히 CNN을 선택합니다. 실시간 제약이 극도로 엄격한 환경(산업용 로봇의 1kHz 제어 루프)이나 에지 디바이스에서 CNN의 결정론적 추론 속도와 작은 메모리 풋프린트는 여전히 유효한 장점입니다.

이건 마치 전기차 시대에도 경주용 엔진이나 항공기 엔진에서 내연기관이 남아있는 것과 비슷합니다. 주류에서 밀렸다고 쓸모가 없어진 것은 아닌 거죠.

4.1.6 다중 모달 지각

최첨단 VLA는 RGB 카메라를 넘어 다양한 감각 양식을 통합합니다. 인간이 눈만으로 세상을 파악하는 게 아니듯, 로봇도 여러 감각이 필요하거든요:

깊이(Depth): SpatialVLA [39]는 깊이 정보를 별도 채널로 인코딩하여 3D 공간 이해를 강화합니다. 단안(monocular) 깊이 추정 네트워크(MiDaS, Depth Anything)의 출력을 추가 입력으로 사용하는 방식도 널리 쓰입니다.
촉각(Tactile): ForceVLA [78]는 6축 힘/토크 센서 데이터를, TactileVLA는 GelSight 촉각 이미지를 시각 토큰과 함께 처리합니다. 접촉 감각은 "물체를 너무 세게 쥐지 않으면서 미끄러지지 않게" 하는 섬세한 조작에 필수적입니다.
힘(Force): OmniVTLA는 시각-촉각-언어를 하나의 프레임워크에서 통합하며, 힘 프로파일을 시간 시퀀스로 인코딩합니다.
소리(Audio): AudioCLIP [27] 기반의 청각 인코더가 탐색적으로 사용됩니다. "딸깍 소리가 나면 멈춰라"와 같은 청각 조건부 행동에 활용 가능성이 있습니다.

4.1.7 고유감각(Proprioception) 처리

이건 의외로 간과되기 쉬운 부분인데요. 로봇의 관절 각도, 속도, 엔드이펙터 위치 등의 고유감각 정보는 대부분 MLP(Multi-Layer Perceptron)를 통해 고정 차원의 벡터로 변환됩니다. 이 벡터를 시각-언어 표현과 통합하는 방식은 크게 두 가지입니다:

연결(Concatenation): 고유감각 벡터를 시각/언어 토큰에 단순히 이어붙입니다. 구현이 간단하고 대부분의 모델이 채택하는 기본 방식입니다.
FiLM Conditioning: Feature-wise Linear Modulation으로, 고유감각 정보가 시각 특징의 스케일과 바이어스를 조절합니다. 고유감각이 시각 처리 자체에 영향을 미치므로, 정보 통합이 더 밀접합니다. Octo [25], HPT [96] 등이 이 방식을 사용합니다.

FiLM이 왜 더 밀접한 통합이냐면, 단순 연결은 "여기 추가 정보 있어" 정도인 반면, FiLM은 "이 정보에 따라 시각 정보를 다르게 해석해"라고 하는 것이기 때문입니다. 팔이 이미 높이 뻗어있다면(고유감각) 위쪽 물체에 대한 시각 정보를 더 주의깊게 봐야 한다는 식의 상호작용이 가능해지는 거죠.

4.2 두뇌 모듈 --- VLM이 로봇의 뇌가 되다

자, 이제 가장 핵심적인 부분입니다. 지각 모듈이 "눈"이라면, 두뇌 모듈은 지각된 정보를 이해하고, 추론하고, 계획하는 "중추 신경계"입니다. 그리고 이 두뇌 모듈의 진화사가 곧 VLA 분야의 역사라고 해도 과언이 아닙니다.

4.2.1 진화 4단계

1단계: 순수 트랜스포머 (2022-2023)

Gato [13] (DeepMind, 2022)는 텍스트, 이미지, Atari 게임, 로봇 제어를 단일 트랜스포머로 처리한 최초의 "제너럴리스트 에이전트"였습니다. VIMA [45]는 멀티모달 프롬프트를 이해하는 트랜스포머를, GR-1 [51]은 비디오 생성과 행동 예측을 결합한 트랜스포머를 제안했고요.

이 시기의 "뇌"는 범용 사전학습 없이 처음부터(from scratch) 학습된 트랜스포머였습니다. 로봇 데이터만으로 학습했기 때문에, 언어 이해나 시각적 상식 추론에서 근본적 한계가 있었습니다. 비유하면, 공장에서만 자란 사람에게 "주방에서 요리해"라고 하는 것과 비슷합니다. 기계 조작은 할 줄 알아도, "소금은 음식 왼쪽에 있을 가능성이 높다"는 상식은 모르는 거죠.

2단계: 디퓨전 트랜스포머/DiT (2023-2024)

RDT-1B [24] (Robotics Diffusion Transformer)는 1.2B 파라미터의 DiT를 행동 생성의 중심 아키텍처로 사용했습니다. TriVLA는 삼중 시스템에서 DiT를 핵심 행동 생성기로 배치했고요. DiT는 트랜스포머의 확장성(scalability)과 디퓨전의 다중모달 표현력을 결합하여, 복잡한 행동 분포를 대규모 모델로 학습할 수 있게 했습니다.

3단계: VLM + 생성 헤드 (2024)

여기서 패러다임이 전환됩니다. pi_0(Physical Intelligence, 2024)는 PaliGemma(SigLIP + Gemma 2B)를 "뇌"로 사용하고, 별도의 Flow Matching 헤드를 "손"으로 부착했습니다. VLM이 이미 보유한 방대한 시각-언어 지식을 Flow Matching 기반 행동 생성과 결합한 첫 번째 상업적 성공 사례입니다.

이 단계에서 핵심 통찰이 뭐냐면, "로봇의 뇌를 처음부터 만들 필요가 없다 --- 인터넷의 지식으로 이미 학습된 VLM을 가져와 로봇의 손만 연결하면 된다"는 겁니다. 이게 진짜 게임 체인저였습니다.

4단계: 완전한 VLM 기반 뇌 (2024-2025)

RT-2 [11]에서 시작된 "VLM을 곧바로 로봇 정책으로 사용" 패러다임이 OpenVLA [15], pi_0.5, CoT-VLA [55], SafeVLA [75]로 이어지며 성숙했습니다. 이 계보의 모델들은 VLM의 언어 생성 능력을 행동 생성에 그대로 활용합니다. CoT-VLA [55]는 여기서 한 걸음 더 나아가, 행동 생성 전에 자연어로 추론 과정을 명시적으로 출력합니다. SafeVLA [75]는 안전 제약을 VLM의 추론 과정에 내재화합니다.

이 4단계 유형를 요약하면, "로봇만의 작은 뇌 → 인터넷의 큰 뇌를 빌려오기"의 여정이라고 할 수 있습니다.

4.2.2 추론 패러다임

VLA의 "뇌"가 단순한 반사(reflex)를 넘어 사고(reasoning)하는 방향으로 진화하고 있습니다. 이게 2025-2026년의 가장 뜨거운 연구 방향 중 하나거든요.

Chain-of-Thought(CoT) 추론: 행동을 생성하기 전에 자연어로 사고 과정을 출력합니다. ECoT(Embodied Chain-of-Thought)는 "1. 빨간 컵이 테이블 왼쪽에 있다. 2. 그리퍼가 현재 테이블 오른쪽에 있다. 3. 먼저 왼쪽으로 이동해야 한다."와 같은 추론 체인을 생성한 후 행동을 출력합니다. CoT-VLA [55]는 이 패러다임을 대규모로 학습하여, 추론이 행동 성능을 향상시킨다는 것을 실증했습니다.

ICLR 2026에서는 Embodied Chain-of-Thought(ECoT [92])가 주요 트렌드로 부상했습니다. ACTIONS AS LANGUAGE [98], InstructVLA, EMBODIED-R1 등이 공간적으로 기반된(spatially-grounded) 추론을 행동 예측과 통합하여, 단순한 텍스트 추론을 넘어 시각적 장면에 직접 기반한 추론 과정을 VLA에 도입하고 있습니다.

이게 왜 중요하냐면, 추론 과정이 자연어로 나오면 디버깅이 가능해지기 때문입니다. 로봇이 왜 그런 행동을 했는지 알 수 있다는 거죠. 블랙박스였던 VLA에 해석 가능성의 창을 여는 셈입니다.

ReAct 패러다임: 추론(Reasoning)과 행동(Acting)을 교대로 수행합니다. "관찰 -> 추론 -> 행동 -> 관찰 -> ..."의 반복적 루프로, 환경의 피드백을 추론에 반영할 수 있습니다.

시각적 하위목표 예측: 언어 대신 미래 이미지를 예측하여 "다음에 어떤 상태가 되어야 하는가?"를 시각적으로 상상합니다. SuSIE [59], UniPi [93] 등이 이 방식을 탐구했습니다. 비유하면, 체스를 할 때 "나이트를 f3로" 하는 것이 아니라, "이런 판세가 되면 좋겠다"라고 미래 보드 상태를 상상하는 것과 비슷합니다.

4.2.3 월드 모델 통합

VLA의 뇌에 "상상력"을 부여하는 것이 월드 모델(World Model) 통합입니다. 이건 정말 매력적인 연구 방향인데요. 두 가지 방향이 있습니다:

정책 강화(Policy Enhancement): 월드 모델이 생성한 미래 예측을 정책 학습의 보조 데이터나 추가 입력으로 사용합니다. UniVLA [80]는 잠재 공간에서 미래 상태를 예측하고, 이 예측이 행동 생성을 안내합니다. WorldVLA [76]는 비디오 예측 모듈이 정책 네트워크와 공동 학습됩니다.

명시적 계획(Explicit Planning): 월드 모델로 여러 가능한 미래를 시뮬레이션하고, 가장 유리한 미래로 이어지는 행동을 선택합니다. LUMOS [94]는 잠재 공간 월드 모델로 트리 탐색을 수행하고, MinD [95]는 잠재 월드 모델 안에서 정신적 시뮬레이션(mental simulation)을 실행합니다.

두 방향의 차이는 월드 모델의 역할에 있습니다. 정책 강화에서 월드 모델은 "조언자"이고, 명시적 계획에서 월드 모델은 "시뮬레이터"입니다. 현재 추세는 두 방향의 수렴입니다 --- 월드 모델의 예측이 정책의 행동 생성에 직접 개입하면서도, 복잡한 상황에서는 명시적 시뮬레이션을 통한 계획이 가능한 유연한 구조를 향해 나아가고 있습니다.

비유하자면, 조언자 역할의 월드 모델은 바둑에서 "이 수가 좋을 것 같다"고 직감적으로 알려주는 것이고, 시뮬레이터 역할의 월드 모델은 "이 수를 두면 상대가 이렇게 두고, 그러면 내가 이렇게 두고..."를 몇 수 앞까지 실제로 시뮬레이션하는 것입니다.

이 월드 모델 통합의 의미를 좀 더 넓은 시각에서 보면요, Large Model Embodied AI 서베이 [44]가 중요한 프레임을 제공합니다. 이 서베이는 체화 AI의 의사결정을 계층적(hierarchical) 방식과 end-to-end 방식으로 양분하는데요 --- 계층적 방식은 해석 가능성과 안전성이 높지만 모듈 간 정보가 손실되고, end-to-end는 표현력은 극대화되지만 디버깅이 어렵습니다. 월드 모델은 이 둘을 연결하는 제3의 축입니다. end-to-end 모델에 "상상을 통한 계획" 능력을 부여해서, 계층적 구조의 장점(안전성, 장기 추론)을 흡수하면서도 통합된 표현 공간의 이점을 유지할 수 있거든요.

4.3 행동 모듈 --- 의도를 움직임으로

두뇌가 "빨간 컵을 집어야 한다"고 결정한 후, 이 의도를 실제 모터 명령으로 변환하는 것이 행동 모듈의 역할입니다. 여기서 근본적인 도전은 연속적이고 고차원적인 행동 공간을 어떻게 효과적으로 표현하고 생성하느냐입니다.

앞서 Section 3.2에서 액션 생성 방식의 분류를 봤는데, 이번에는 각 방식의 내부 메커니즘을 좀 더 깊이 파고들겠습니다.

4.3.1 이산 토큰화 --- RT-2 방식

RT-2 [11]가 개척한 접근법입니다. 연속 행동값(예: 관절 각도 0.732rad)을 0-255 사이의 정수 bin으로 양자화한 후, 이를 VLM의 어휘(vocabulary)에 추가하여 언어 토큰과 동일하게 처리합니다. 7-DoF 로봇 팔의 경우, 각 타임스텝의 행동은 7개의 토큰(+ 그리퍼 개폐 1개)으로 표현됩니다.

이건 결국 "로봇 행동을 로봇의 언어로 만든다"는 발상인데요. "hello"를 "h", "e", "l", "l", "o" 토큰으로 생성하듯, 로봇의 움직임을 "bin_128", "bin_64", "bin_200", ... 으로 생성하는 겁니다.

장점: 언어 모델의 기존 인프라(토크나이저, 생성 알고리즘, KV 캐시)를 그대로 재활용합니다. 구현이 직관적이고 간단합니다. 언어 생성과 행동 생성이 동일한 디코딩 과정이므로, 멀티태스크 학습이 자연스럽습니다.
단점: 256-bin 양자화는 행동 공간의 정밀도를 제한합니다(1/256 = 약 0.4%의 해상도). 다중모달 행동 분포를 표현할 수 없습니다 --- 모델은 하나의 최빈값(mode)만 출력할 수 있어서, 물체를 좌로 돌릴 수도 우로 돌릴 수도 있는 상황에서 두 선택지의 평균인 "돌리지 않음"을 출력하는 평균화 문제가 발생합니다. 자기회귀 특성상 토큰 수에 비례하는 지연이 생깁니다.

4.3.2 디퓨전 정책 --- 확률적 행동 생성

여기서 세 가지 변형을 비교해야 합니다.

DDPM(Denoising Diffusion Probabilistic Model): Diffusion Policy [17] (Chi et al., 2023)가 개척한 방식으로, 순수 가우시안 노이즈에서 출발하여 반복적 디노이징으로 행동 시퀀스를 생성합니다. 50-100회의 디노이징 스텝이 필요하지만, 다중모달 분포를 충실히 표현합니다.

DDIM(Denoising Diffusion Implicit Model): CogACT [23]가 채택한 방식으로, 결정론적 샘플링 과정을 통해 디노이징 스텝을 10-20회로 줄입니다. 품질과 속도의 절충점을 제공합니다.

Flow Matching: pi_0가 채택한 방식으로, 노이즈에서 데이터로의 경로를 ODE(상미분방정식)로 모델링합니다. DDPM/DDIM보다 학습이 안정적이고, 5-10회의 스텝으로 고품질 샘플을 생성합니다. Rectified Flow는 경로를 직선에 가깝게 학습하여 스텝 수를 더 줄입니다. 비유하면, DDPM이 미로를 탐색하듯 구불구불하게 노이즈에서 데이터로 가는 경로를 따라간다면, Flow Matching은 직선으로 가는 경로를 학습하는 겁니다.

디퓨전 기반 방식의 공통적 장점은 다중모달 분포 표현입니다. "컵을 좌로 돌릴 수도, 우로 돌릴 수도 있다"는 양쪽 모드를 동시에 표현하고, 실행 시 하나를 샘플링합니다. 부드러운 궤적을 자연스럽게 생성하며, 연속 공간에서 직접 동작하므로 양자화 오류가 없습니다.

4.3.3 FAST 토큰화 --- 주파수 도메인의 혁신

자, 이건 정말 영리한 접근인데요. FAST(Fast Action Tokenization)는 자기회귀 방식의 단순함과 연속 행동 표현의 정밀도를 동시에 잡으려는 시도입니다. 핵심 아이디어는 연속 행동 청크를 주파수 도메인으로 변환한 후 토큰화하는 것입니다.

구체적인 과정은 다음과 같습니다:

연속 행동 시퀀스(예: 16 타임스텝 x 7 DoF)를 DCT(Discrete Cosine Transform)로 주파수 도메인으로 변환

고주파 성분(대부분 노이즈)을 제거하여 압축

압축된 주파수 계수에 BPE(Byte Pair Encoding)를 적용하여 이산 토큰으로 변환

이 토큰을 LLM의 자기회귀 생성으로 예측

왜 이게 잘 되냐면, 로봇 행동의 에너지 대부분은 저주파 성분에 집중되어 있기 때문입니다. JPEG 이미지 압축이 DCT를 쓰는 것과 같은 원리죠. 고주파 성분을 버려도 정보 손실이 거의 없는 겁니다.

이 접근법의 결과는 인상적입니다. RT-2 [11] 방식 대비 5배의 사전학습 가속을 달성하면서도, 정보 손실은 무시할 수 있는 수준입니다. 주파수 도메인에서의 압축이 시간 도메인에서의 양자화보다 훨씬 효율적이기 때문입니다. 또한 토큰 수가 크게 줄어들어 자기회귀 생성의 속도 문제도 완화됩니다.

4.3.4 정규화 흐름(Normalizing Flows) --- 단일 스텝의 꿈

NinA(Neural Inference for Actions)는 정규화 흐름(Normalizing Flows)을 행동 생성에 적용합니다. 디퓨전과 달리, 가역적 변환(invertible transformation)의 합성으로 분포를 모델링하므로 단일 순전파로 샘플링이 가능합니다. 다중 디노이징 스텝이 필요 없어 추론 속도가 매우 빠릅니다.

4.3.5 주파수 도메인 Flow Matching --- FreqPolicy

FreqPolicy는 Flow Matching을 주파수 도메인에서 수행하여, 단일 스텝 추론을 달성합니다. 행동 시퀀스를 주파수 성분으로 분해한 후, 주파수 공간에서 플로우를 학습합니다. 시간 도메인에서의 복잡한 다중모달 분포가 주파수 도메인에서는 더 단순한 구조를 가지므로, 적은 스텝으로도 충분한 품질의 샘플을 생성할 수 있습니다.

FAST 토큰화와 FreqPolicy는 모두 주파수 도메인을 활용한다는 공통점이 있는데요. 하나는 AR 프레임워크에서, 다른 하나는 Flow Matching 프레임워크에서 주파수 변환의 이점을 활용한다는 점이 차이입니다.

4.3.6 Action Chunking --- 시간 스케일의 분리

Action Chunking은 행동 생성의 시간 스케일을 의미 수준과 운동 수준으로 분리하는 전략입니다. 고수준(저주파)에서는 "다음 청크의 의미적 방향"을 자기회귀로 결정하고, 저수준(고주파)에서는 "청크 내부의 세밀한 궤적"을 병렬로 생성합니다.

비유하면, 한 글자씩이 아니라 한 문장을 통째로 생성하는 것입니다. "나는"..."학교에"..."갔다" 이렇게 단어별로 하는 게 아니라, "나는 학교에 갔다"를 한 번에 생성하는 거죠. 이렇게 하면 시간적 일관성이 확보됩니다.

ACT(Action Chunking with Transformers)가 이 개념을 대중화했습니다. 한 번의 예측으로 16-32 타임스텝의 행동 시퀀스를 청크 단위로 생성하여, 단일 스텝 예측의 근시안적(myopic) 행동 문제를 완화합니다. 이후 연구들은 청크 간 접합의 매끄러움, 가변 길이 청크, 계층적 청킹 등으로 확장되었습니다.

4.4 이중 시스템 아키텍처 --- 인지과학이 로봇공학을 만나다

자, 이 절은 제가 개인적으로 매우 흥미롭게 생각하는 부분입니다. 인지과학의 이론이 어떻게 공학적 설계 원리로 번역되는지를 보여주는 아름다운 사례이거든요.

4.4.1 Kahneman의 이중 처리 이론

Daniel Kahneman이 Thinking, Fast and Slow (2011)에서 제시한 이중 처리 이론은, 인간의 사고가 두 시스템으로 구성된다고 주장합니다:

System 1: 빠르고, 자동적이며, 노력이 적게 드는 직관적 사고. "공이 날아오면 손을 뻗어 잡는다"
System 2: 느리고, 의식적이며, 노력이 많이 드는 분석적 사고. "체스에서 다음 수를 계산한다"

이 이론의 로봇공학 적용은 직관적으로 타당합니다. 로봇도 두 가지 종류의 "사고"가 필요하기 때문이죠:

빠른 반사(10-120Hz): 장애물을 피하고, 미끄러지는 물체를 재빨리 다시 잡고, 부드러운 궤적을 유지하는 운동 제어
느린 숙고(1-10Hz): "어떤 물체를 집을 것인가?", "어떤 순서로 과제를 수행할 것인가?", "이 상황이 안전한가?"를 판단하는 고수준 추론

여기서 핵심 통찰은 이 두 시스템이 서로 다른 시간 스케일에서 작동한다는 것입니다. System 2가 매 프레임마다 실행될 필요는 없고, System 1이 복잡한 추론을 수행할 필요도 없습니다. 이 분리가 계산 효율성의 핵심입니다.

실생활에서도 마찬가지잖아요. 운전할 때 대부분의 조향과 브레이크는 System 1(무의식적 반사)으로 처리되고, "이 교차로에서 좌회전할까 우회전할까"는 System 2(의식적 판단)로 처리됩니다. System 2가 매 밀리초마다 작동할 필요가 없는 거죠.

4.4.2 로봇 이중 시스템의 구현

System 1 --- 행동 전문가: 디퓨전 정책, Flow Matching, 경량 MLP 등 빠른 생성 모델이 담당합니다. 10-120Hz의 주파수로 실행되어 부드럽고 반응적인 모터 명령을 생성합니다. 이 모듈은 VLM의 무거운 추론 없이, 주어진 상황 임베딩(context embedding)으로부터 직접 행동을 생성합니다.

System 2 --- VLM 기반 추론/계획: 대규모 VLM이 담당합니다. 1-10Hz의 주파수로 실행되어 장면을 이해하고, 과제를 분해하고, 안전 조건을 확인합니다. 이 모듈의 출력은 행동 전문가에게 "무엇을 해야 하는가"의 지시(context)로 전달됩니다.

비동기 실행: 두 시스템의 핵심은 독립적인 주파수로 실행된다는 것입니다. System 2가 다음 계획을 추론하는 동안, System 1은 이전 계획에 기반하여 행동을 계속 생성합니다. 이 비동기성이 VLM의 느린 추론 속도를 실시간 제어와 양립 가능하게 만듭니다.

4.4.3 구현 사례

GR00T N1 [21] (NVIDIA): Eagle-2 VLM(System 2)이 카메라 이미지와 언어 지시를 처리하여 상황 임베딩을 생성합니다. 이 임베딩은 DiT 기반 행동 전문가(System 1)에게 전달되어, Flow Matching으로 행동 청크를 생성합니다. 전형적인 캐스케이드 구조로, System 2 -> System 1의 순차적 정보 흐름을 따릅니다.

pi_0 (Physical Intelligence): PaliGemma VLM(System 2)의 토큰과 Flow Matching 행동 전문가(System 1)의 토큰이 공유 어텐션 블록에서 동시에 처리됩니다. 병렬 구조로, 두 시스템이 동일한 트랜스포머 레이어를 공유하면서 서로의 정보에 접근합니다.

MinD [95]: 잠재 월드 모델(latent world model)이 System 2의 역할을 하며, 미래 상태를 "정신적으로 시뮬레이션"합니다. 행동 정책(System 1)은 이 시뮬레이션 결과를 바탕으로 행동을 생성합니다. 여기서 System 2는 언어적 추론이 아닌 잠재 공간에서의 예측적 시뮬레이션이라는 점이 독특합니다.

TriVLA: 삼중 시스템(triple system)을 제안합니다. VLM(System 3, 가장 느림)이 전략적 계획을, DiT(System 2, 중간)가 전술적 행동 생성을, 경량 실행기(System 1, 가장 빠름)가 실시간 보정을 담당합니다. 이중 시스템을 삼중으로 확장하여 시간 스케일의 분리를 더 세밀하게 구현합니다. 군대로 비유하면 사령관(전략) - 중대장(전술) - 병사(실행)의 3단계 지휘 체계와 비슷합니다.

Hume: 전체론적 체현 이해(holistic embodiment understanding)를 목표로 하며, VLM과 행동 생성기가 인간의 전신 동작 이해를 공유 표현으로 처리합니다.

4.4.4 핵심 설계 선택: 공유 어텐션 vs 캐스케이드

자, 이중 시스템 아키텍처의 가장 중요한 설계 선택은 두 시스템 간의 정보 흐름 방향입니다. 이건 현재 진행형인 논쟁이기도 합니다.

캐스케이드(GR00T N1 [21] 방식):

정보가 System 2 -> System 1으로 단방향으로 흐릅니다
System 1이 System 2에 피드백을 줄 수 없습니다
장점: 모듈 분리가 명확하여 각 시스템을 독립적으로 학습/교체할 수 있습니다. System 2의 VLM을 업그레이드하더라도 System 1의 행동 전문가는 그대로 유지할 수 있습니다. 마치 레고 블록처럼 부품 교체가 가능한 거죠.
단점: 행동 생성 과정의 정보가 추론에 반영되지 않으므로, 행동 실행 중 발생하는 미세한 변화에 System 2가 반응하지 못합니다.

공유 어텐션(pi_0 방식):

정보가 System 1 <-> System 2 양방향으로 흐릅니다
언어/시각 토큰과 행동 토큰이 동일한 어텐션 메커니즘에서 상호작용합니다
장점: 두 시스템이 서로의 상태를 참조할 수 있어, 행동 생성이 추론에 영향을 주고 추론이 행동을 안내하는 밀접한 협력이 가능합니다.
단점: 모듈 분리가 불명확하여, 한 시스템을 교체하면 다른 시스템도 재학습이 필요할 수 있습니다. 공유 어텐션의 계산 비용이 큽니다.

이 선택은 모듈성(modularity) vs 통합성(integration)의 근본적 트레이드오프를 반영합니다. 캐스케이드는 레고 블록처럼 부품을 교체할 수 있는 유연성을 제공하고, 공유 어텐션은 유기체처럼 밀접하게 통합된 시스템을 만듭니다.

현재(2025년 기준) 추세를 보면, 연구 커뮤니티에서는 두 접근법이 공존하되, 상업적 성공을 거둔 모델들(pi_0, pi_0.5)은 공유 어텐션 쪽에, 플랫폼/모듈 교체 용이성을 중시하는 모델들(GR00T N1 [21])은 캐스케이드 쪽에 가깝습니다. 최종적으로 어느 쪽이 우세해질지는 아직 열린 질문입니다. 다만 분명한 것은, "빠른 반사"와 "느린 숙고"의 분리라는 인지과학적 통찰이 VLA 아키텍처 설계의 핵심 원리로 자리 잡았다는 점입니다.

4.5 3대 프런티어 VLA를 정직하게 비교하기 — NVIDIA GR00T, Google Gemini Robotics, Physical Intelligence π

"셋이 같은 패러다임의 변주라는 진단은 옳습니다. 그런데 그 진단에서 멈추면 무엇을 연구해야 할지 보이지 않습니다. 우리는 한 번 더 들어가야 합니다."

자, 이제 흥미로운 장면에 도달했는데요. 박사 과정에 들어와 VLA 논문을 본격적으로 읽기 시작하면 거의 모두가 비슷한 경로를 밟습니다. 처음엔 GR00T [21], Gemini Robotics, π [16]를 마치 세 개의 다른 동물처럼 봅니다. 데모 영상이 워낙 달라 보이니까요 — 휴머노이드가 물건을 옮기는 NVIDIA, 자연어로 복잡한 추론을 하는 Google, 13시간 에스프레소를 만드는 PI. 시각적 인상이 전혀 달라요. 그런데 Section 3–6을 끝까지 읽고 나면 정반대의 인식이 찾아옵니다. "어라, 이거 다 같은 거 아닌가?" 하고요.

이 두 인식이 모두 부분적으로 옳고, 부분적으로 틀린데요. 이 절에서 찾으려는 건 그 사이의 정확한 지점입니다. 박사 1–2년차가 이 분야에 진입할 때 가장 위험한 건 두 극단 중 하나에 정착해버리는 겁니다. "셋이 다 다르다"고 보면 표면에 끌려다니게 되고, "셋이 다 같다"고 보면 자신이 어느 흐름 위에서 연구하고 있는지 잃어버리게 되죠. 그래서 정직하게 무엇이 같고 무엇이 진짜로 다른가를 분리해내야 합니다.

4.5.1 먼저 인정해야 할 것: 통일장 진단의 70%는 옳습니다

지난 1–2년간 VLA를 충분히 깊이 추적해왔다면, 다음 명제에 동의하지 않기 어려울 겁니다.

세 모델은 모두 "인터넷 사전학습된 VLM을 뇌로 두고, 그 위에 생성형 행동 디코더를 얹고, 두 모듈을 서로 다른 시간 스케일로 비동기 실행하는" 단일 패러다임의 변주입니다.

이건 본 서베이의 Insight 1(수렴의 증거)이 정면으로 지적한 결론이고요. Section 3.6의 5축 좌표계에 셋을 강제로 박아놓고 보면 이 사실이 더 또렷해집니다.

분류 축	NVIDIA GR00T N1 [21]	Google Gemini Robotics	PI π0 / π0.5 [31]
구조(Liu/Shao)	Dual-system, Cascade	Dual-system, Cascade	Dual-system, Parallel
생성(Zhong)	Diffusion (DiT)	Diffusion 계열	Flow Matching
해부(Xu)	Eagle-2 Brain + DiT Hand	Gemini VLM + Action Head	PaliGemma/Gemma 3 + Flow Hand
기능(Kawaharazuka)	저수준 지각 + 저수준 계획	고수준 추론 강조 + 저수준 계획	고수준 지각 + 저수준 계획
후처리(Jin)	체현 인식 개선	과제 이해 심화	다중 요소 통합 + RL

참고로 Gemini Robotics는 §3.6의 공식 5축 분류표에 포함되어 있지 않습니다. 위 행은 공개된 기술 보고를 근거로 한 본 절의 추정 위치예요. GR00T N1/π0의 행은 §3.6 표와 정확히 일치합니다.

표를 보면 차이가 있긴 한데, 카테고리 자체가 다른 게 아니라 같은 카테고리 안에서의 위치 차이입니다. 셋 다 dual-system이고, 셋 다 generative action head를 쓰고, 셋 다 사전학습된 VLM 백본 위에 서 있어요. 4가지 공리(인터넷 지식 상속, 시간 스케일 분리, 생성형 디코더, 데이터 피라미드)는 정말로 셋의 공통 분모입니다.

이걸 인정하지 않고 시작하는 모든 비교는 결국 마케팅 카피의 재배열이 돼버려요. 셋의 차이는 "다른 종(species)"이 아니라 "같은 종 안의 다른 품종(breed)"이라는 통일장 논변의 결론은 받아들여야 합니다. 그런데 여기서 끝나느냐? 아닙니다. 박사 과정 연구자에게 진짜 흥미로운 질문은 이 다음에 시작됩니다.

4.5.2 평탄화하면 안 되는 것: 같은 종 안의 차이도 진짜입니다

생물학자에게 "골든 리트리버와 시베리안 허스키는 같은 종이다"라고 말하면 동의할 겁니다. 그런데 그가 평생 연구하는 건 그 두 품종 사이의 차이예요. 카테고리상 같다는 것이 차이가 무의미하다는 뜻은 아니라는 거죠. 패러다임이 같다는 사실이 그 패러다임 위에서의 분기가 사라진다는 뜻은 아닙니다. 자동차가 100년간 "내연기관 + 변속기 + 차체"로 통일된 후에야 도요타·BMW·포르쉐의 진짜 차이가 본격화됐잖아요.

VLA에서 진짜 차이가 어디서 나는지를 보려면, 본 서베이가 기술 축으로 짜여 있어 명시적으로 드러내지 못한 세 가지 횡단 축을 도입해야 합니다. 이 축들은 통일장 논변이 의도적으로 평탄화한 지점이고, 박사 과정 연구자가 어느 흐름 위에서 연구할지 결정할 때 결정적이에요.

축 1 — 데이터 출처가 결정하는 일반화의 천장

§10.12가 명시적으로 지적한 건, 시뮬레이션 벤치마크(LIBERO, CALVIN)에서 프런티어와 오픈 웨이트의 성능이 수렴하고 있음에도 실세계 zero-shot 일반화에서는 격차가 좁혀지지 않는다는 점이었습니다. Reuss(2026)의 ICLR 분석이 가리키는 핵심 원인은 데이터 큐레이션과 인프라 규모의 격차고요. 이게 우연이 아니라 세 그룹이 데이터를 어디서 얻는가의 구조적 차이에서 나옵니다.

NVIDIA는 Cosmos 기반 시뮬레이션 합성과 Isaac Sim/Lab을 중심에 둡니다. 시뮬레이션은 무한 확장 가능하지만 sim-to-real gap이라는 영구적 세금을 내야 하죠. Google은 인터넷 스케일 멀티모달 사전학습을 자산으로 두고 있어요. zero-shot 의미 이해에서는 압도적이지만, 모터 제어의 정밀도와는 다른 차원의 능력이고요. PI는 자체 로봇 함대에서 수집한 실세계 데이터에 의존합니다. sim-to-real gap이 없는 가장 깨끗한 신호지만, 확장성은 가장 제약됩니다.

그래서 박사 과정 연구자가 자신의 실험실에서 어느 전략을 모방할지가 결정적인데요. 시뮬레이션 인프라가 있다면 NVIDIA 노선이 자연스럽고, 없다면 openpi 위에서 소량의 실세계 데이터로 fine-tune하는 게 현실적입니다. 이건 단순한 도구 선택이 아니라 무엇을 일반화의 한계로 받아들일 것인가의 선택이에요.

축 2 — 학습 단계 중 어디에 자원을 쏟는가

§6.6의 3단계 성숙 모델(인터넷 사전학습 → BC fine-tuning → RL post-training)을 다시 떠올려보면요. 흥미로운 사실은, 세 그룹이 이 세 단계 중 서로 다른 단계를 자신의 차별화 지점으로 삼고 있다는 점입니다. 이것도 우연이 아니라 각 그룹의 자산 구조에 의해 결정돼요.

Google은 1단계(인터넷 사전학습)에 자산이 압도적으로 쏠려 있습니다. 거대 데이터센터, 인터넷 스케일 멀티모달 코퍼스, TPU 인프라요. 그래서 Gemini Robotics의 차별화는 "Gemini 백본을 그대로 이식한다"가 돼요. NVIDIA는 모델 자체는 1–2단계의 정통 경로(Eagle-2 VLM 상속 + BC fine-tuning)를 따르되, 그 1–2단계를 먹여 살리는 데이터 공급망을 Cosmos/Isaac Sim으로 차별화합니다. 모델 레시피가 아니라 데이터 파이프라인이 차별화 지점인 거죠. PI는 1, 2단계에서는 Google의 오픈 PaliGemma/Gemma 3을 가져다 쓰면서, 3단계의 RL post-training과 그 이후의 deployment-time 학습을 자기 영역으로 정의했습니다. π^*_0.6 [157]의 advantage conditioning(§10.13.1)이 Flow Matching VLA에 RL을 적용하는 실용적 경로를 최초로 입증한 것, π_0.6-MEM [158](§10.13.2)이 다중 스케일 메모리로 15분 장시간 작업을 풀어낸 것도 모두 이 3단계에서 일어난 일이에요.

박사 과정 연구자에게 이건 매우 실용적인 질문으로 번역됩니다. 향후 3–5년간 어느 학습 단계에서 기여할 수 있는가? 1단계에서 거대 백본과 경쟁하는 건 학계 단일 실험실로서는 사실상 불가능하고요. 2단계 BC fine-tuning은 이미 잘 닦여 있습니다. 가장 열려 있는 frontier는 3단계 RL post-training과 그 변형들인데, PI가 가장 빠르게 밀어내고 있는 영역이면서 동시에 진입 장벽이 가장 낮은 영역이기도 해요. ICLR 2026에서 자기개선 잔차 RL이 LIBERO 99%에 도달한 흐름이 이 진단을 뒷받침합니다.

축 3 — 모델 가중치 공개 정책이 만드는 비대칭 생태계

이건 본 서베이가 직접 다루지 않은 정치경제적 차원인데요, 박사 과정 연구자의 일상에 가장 직접적으로 영향을 미치는 차이입니다. NVIDIA는 GR00T N1을 Hugging Face에 공개하고, openpi는 π0/π0-FAST를 공개하지만 π0.5 이후는 비공개이며, Gemini Robotics는 처음부터 API/파트너십 모델이고요.

이 차이가 단순한 기업 정책이 아니라 누가 어떤 종류의 후속 연구를 할 수 있는가를 결정합니다. 학계에서 weight access 없이 ablation study나 mechanism interpretation을 시도하기란 거의 불가능해요. 그래서 학술 VLA의 실질적 베이스라인은 OpenVLA [15], π0, GR00T N1 같은 오픈 모델이고, Gemini Robotics는 인용 대상이지 비교 대상이 되기 어렵습니다. 이게 §10.12의 "프런티어 vs 오픈 웨이트 격차"가 단순히 성능 격차가 아니라 연구 가능성 자체의 격차라는 뜻이에요.

박사 1–2년차가 새 연구 주제를 잡을 때 이 점을 명시적으로 의식해야 합니다. "Gemini Robotics와 비교하는 연구"는 거의 항상 제한된 형태(공개된 결과 인용, API 호출 비교)에 머물 수밖에 없고요. 반면 "openpi 위에서 새로운 RL 후처리 방법을 검증하는 연구"는 즉시 실행 가능하고 다른 연구자가 재현할 수 있습니다.

4.5.3 그래서 셋의 진짜 좌표는

위 세 축을 종합하면, 셋의 차이를 정직하게 한 줄로 요약할 수 있습니다.

차원	NVIDIA GR00T	Google Gemini Robotics	PI π 시리즈
패러다임상 위치	같은 종 (VLM Brain + Generative Action Head)	같은 종	같은 종
베팅하는 병목	체현(embodiment) 일반화, 휴머노이드 폼팩터	거대 백본의 추론 능력을 행동으로 전이	정책 자체의 개선 능력 (RL + 메모리)
자산 구조	GPU·시뮬레이션·엣지 칩 풀스택	거대 데이터센터·인터넷 사전학습	자체 로봇 함대·실세계 데이터
차별화 학습 단계	1–2단계의 데이터 공급망(Cosmos/Isaac)	1단계 (사전학습)	3단계 (RL post-training)
모델 공개 정책	오픈 (인프라로 수익화)	클로즈드 (API로 수익화)	이중 트랙 (기초는 오픈, 최전선은 비공개)
학계와의 관계	베이스라인 + 인프라 채택 유도	인용 대상, 비교 어려움	베이스라인 (openpi) + 최전선 추적 대상

이 표와 4.5.1 표의 관계에 주목하세요. 4.5.1이 기술 축에서의 수렴을 보여준다면, 이 표는 그 위에 겹쳐지는 전략·생태계 축에서의 분기를 보여줍니다. 두 표 모두 진실이고, 어느 한쪽만 보면 분야 전체를 잘못 이해하게 됩니다.

4.5.4 박사 과정 연구자의 시각에서 — 무엇을 추적하고 무엇을 따라가지 않을 것인가

이 절을 절충적 결론으로 끝내기보다, 박사 1–2년차가 실제로 마주칠 결정에 대한 구체적 가이드로 마무리하는 게 더 정직할 것 같습니다.

셋을 모두 추적해야 하는 영역. 이중 시스템 아키텍처의 진화, 행동 디코더의 수학적 메커니즘(diffusion vs flow matching vs discrete diffusion), VLM 백본 선택과 fine-tuning 전략. 이 영역에서는 셋이 진짜로 한 게임을 하고 있어서 한 그룹의 진전이 다른 그룹의 가까운 미래를 예고합니다. Section 4–5가 이 영역이에요.

그룹별로 분기해서 추적할 영역. 데이터 수집·증강 전략(NVIDIA의 Cosmos 흐름, Google의 인터넷 멀티모달 흐름, PI의 실세계 함대 흐름), 학습 후 단계 혁신(특히 PI의 RL post-training과 메모리 통합), 도메인 특화 응용(NVIDIA 휴머노이드, Google의 자율주행 EMMA 계보). 이 영역에서는 그룹별 자산 구조가 다르기 때문에 한 그룹의 결과를 다른 그룹에 그대로 적용하기 어렵습니다.

한 그룹만 깊이 추적해도 충분한 영역. PI의 π^*_0.6와 π_0.6-MEM 계보요. §10.13에서 두 절을 통째로 할애한 이유는, 현재 이 흐름이 가장 빠르게 미해결 과제(BC의 성능 천장, 장시간 작업의 메모리 부재)를 정면 돌파하고 있기 때문입니다. RL post-training이나 메모리 메커니즘에 관심이 있다면, openpi와 π 시리즈 후속 논문을 베이스라인으로 두고 출발하는 게 가장 효율적이에요.

의도적으로 따라가지 말아야 할 함정. 회사별 데모 영상 인상에 끌려 "어느 회사가 앞서고 있는가"를 묻는 질문은 학술적으로 대답할 수 없습니다. 시뮬레이션 벤치마크 숫자만 추적하다 실세계 일반화 격차(§10.12)를 놓치는 함정도 흔하고요. 그리고 가장 큰 함정 — 통일장 논변에 너무 일찍 정착해서 "어차피 다 같은 패러다임"이라며 그룹별 차이를 무시하는 것. 패러다임이 같다는 것은 분기를 무시할 면허가 아니라, 분기가 어디서 일어날지 더 정확히 예측할 수 있는 도구입니다.

4.5.5 한 줄 요약

세 그룹은 수학적 아키텍처 수준에서는 한 패러다임의 변주이고 이 진단은 정직하게 받아들여야 합니다. 그런데 데이터 출처·학습 단계 자원 배분·모델 공개 정책이라는 횡단 축에서는 진짜로 다른 좌표에 있고, 이 차이가 박사 과정 연구자가 자신의 연구를 어느 흐름 위에 위치시킬지를 결정하죠. 5년 후 셋의 모델 자체는 더 닮아갈 가능성이 높지만, 그들이 만들어낼 생태계와 학계와의 관계는 더 분기할 가능성이 높습니다. 이 두 흐름을 동시에 보는 것이, 박사 1–2년차가 이 분야에 정착할 때 필요한 균형이에요.

4장 요약: VLA 해부학의 현재 좌표

자, 이번 장을 정리하겠습니다.

지각 모듈은 SigLIP + DINOv2 [30] 하이브리드가 지배적 표준으로 수렴하고 있습니다. "무엇(what)" + "어디(where)"의 상보적 조합이 핵심이었습니다. 두뇌 모듈은 사전학습된 VLM을 로봇의 뇌로 직접 전용하는 방향으로 확정적 진화를 이루었습니다. "뇌를 처음부터 만들지 말고 빌려오라"는 통찰이 핵심이었고요. 행동 모듈은 아직 확정적 승자가 없으며, 이산 토큰화, 디퓨전, Flow Matching, FAST 토큰화 [20]가 경쟁하고 있습니다. 그리고 이 세 모듈을 연결하는 방식으로서 이중 시스템 아키텍처가 부상하며, 인지과학의 통찰이 공학적 설계 원리로 번역되고 있습니다.

다음 장에서는 이렇게 구축된 아키텍처를 어떻게 학습시키는가 --- 행동 복제, 강화학습, 월드 모델 학습의 세 패러다임을 심층적으로 다룹니다.

Motivation Chain: 아키텍처 진화의 동기 사슬

Motivation Chain

모듈형 파이프라인의 한계(sense-plan-act 분리 → 정보 손실, 엔지니어링 부담)

→ 단일체(Monolithic) 아키텍처 등장(RT-2: 하나의 VLM이 모든 것을 처리)

→ 단일체의 한계(VLM 추론이 느려 실시간 제어 어려움)

→ 이중 시스템(Dual-system) 등장(GR00T N1: 빠른 System 1 + 느린 System 2)

→ 이중 시스템의 한계(장시간 복합 작업에서 계획 능력 부족)

→ 계층적(Hierarchical) 아키텍처(π0.5: VLM 플래너 + VLA 실행기)

Motivation Chain

자기회귀 디코딩의 한계(단일 모드만 생성, 이산화 정보 손실, 느린 순차 생성)

→ Diffusion Policy [17] 등장(다중 모드 행동 표현, action chunk 일괄 생성)

→ DDPM의 한계(50-100 디노이징 스텝으로 느린 생성)

→ Flow Matching(π0 [16]) 등장(선형 보간으로 5-10 스텝에 수렴)

→ DiT 기반 디코더(CogACT [23], RDT-1B [24]) 등장(Transformer의 스케일링 법칙을 디퓨전에 적용)

이 동기 사슬을 보시면, VLA 아키텍처의 발전이 "한계를 발견하고 극복하는" 변증법적 과정이라는 걸 알 수 있습니다. 각 단계의 한계가 다음 단계의 동기가 되는 거죠.

유사 아키텍처 차별점 비교

이건 헷갈리기 쉬운 부분을 정리한 표입니다. 시험 문제로 나올 수 있는 부분이니 잘 보시기 바랍니다.

비교 대상	핵심 차별점
단일체 vs 이중체	단일체는 하나의 모델이 이해+행동을 모두 처리(단순하지만 속도 제약). 이중체는 이해와 행동을 분리하여 각각 최적 주파수로 운영
Cascade vs Parallel 이중체	Cascade는 System 2->System 1 순차 전달(GR00T N1). Parallel은 두 시스템이 동시 실행 후 출력 결합
Planner-Only vs Planner+Policy 계층	Planner-Only는 고수준 계획만 VLM이 담당(SayCan). Planner+Policy는 저수준 VLA 정책까지 학습(pi_0.5)
자기회귀 vs 디퓨전 디코더	AR은 토큰을 순차 생성(이산적, 느림, VLM 어휘 재활용). 디퓨전은 노이즈->행동 반복 정제(연속적, 다중모드, 병렬 chunk)
DDPM vs Flow Matching	DDPM은 확률적 역과정 반복(50-100 스텝). Flow Matching은 결정론적 ODE 경로 학습(5-10 스텝, 더 빠르고 안정적)

직관적 한줄 설명: 아키텍처 편

이건 다른 사람에게 VLA를 설명해야 할 때 쓸 수 있는 비유들입니다:

단일체 VLA: "통역 없이 외국어를 바로 알아듣고 대답하는 사람 --- 빠르지만 깊은 사고는 어려움"
이중 시스템: "CEO(전략적 판단, 느림)와 현장 작업자(즉각 실행, 빠름)의 분업"
계층적 구조: "요리 레시피(고수준 계획)와 칼질 기술(저수준 기술)의 분리 --- 레시피를 바꿔도 칼질은 재학습 불필요"
자기회귀 디코딩: "한 글자씩 타이핑하듯 행동을 순서대로 하나씩 생성"
디퓨전 디코딩: "대리석 조각처럼 전체 형태에서 불필요한 부분(노이즈)을 깎아내어 행동을 드러냄"
Flow Matching: "디퓨전의 '조각'을 직선 경로로 단축 --- 같은 결과를 더 적은 스텝으로"
Action Chunk: "한 글자씩이 아니라 한 문장을 통째로 생성 --- 시간적 일관성 확보"

Self-Check Questions: Section 3-4

Q1: pi_0의 아키텍처를 Liu & Shao의 분류와 Zhong et al.의 분류에서 각각 어떻게 위치시킬 수 있는가?

답: Liu & Shao 분류에서 pi_0는 Parallel 이중 시스템(병렬 이중 시스템)입니다 --- VLM(PaliGemma)이 시각-언어 특징을 추출하고, Flow Matching Action Expert가 연속 행동을 생성하는 병렬 구조(VLM 토큰과 Action Expert 토큰이 공유 어텐션에서 동시 처리)입니다. Zhong et al. 분류에서는 디퓨전 기반 Pure VLA에 해당합니다 --- Flow Matching이 디퓨전의 변형이므로.

Q2: 자기회귀 방식의 VLA(RT-2)가 다중 모드 행동을 잘 표현하지 못하는 이유는 무엇인가?

답: 자기회귀 방식은 각 행동 차원을 이산 bin으로 양자화하여 토큰으로 생성합니다. 이 과정에서 (1) 양자화 오차로 연속 행동의 정밀도가 손실되고, (2) 각 토큰이 순차적으로 이전 토큰에 조건화되므로, "오른쪽으로 집기"와 "위에서 집기"가 동시에 존재하는 다중 모드 분포를 자연스럽게 표현하기 어렵습니다. 분포의 평균 방향(mode averaging)으로 수렴하는 경향이 있습니다.

Q3: Xu et al.의 5대 도전 과제 중, 현재(2026년 기준) 가장 큰 진전을 보인 과제와 가장 미해결인 과제는 각각 무엇인가?

답: 가장 큰 진전은 (1) 표현 --- VLM 기반 통합 표현, FAST 토큰화, Flow Matching 등으로 표현 문제가 크게 개선됨. 가장 미해결인 과제는 (4) 안전 --- SafeVLA [75]가 첫 시도이지만, 형식적 검증(formal verification)이 부재하고, VLA의 환각이 물리적 사고로 이어질 수 있는 근본적 위험이 해결되지 않음.

Open Research Questions: Section 3-4

마지막으로 열린 연구 질문들입니다. 박사 연구 주제를 찾고 계신 분이라면 여기서 아이디어를 얻으실 수도 있을 겁니다.

Action Expert 스케일링: pi_0의 Action Expert(0.3B)와 VLM backbone(3B)의 파라미터 비율이 최적인가? Action Expert를 더 키우면 성능이 비례하여 향상되는가?

5. 액션 토큰화 — VLA의 핵심 설계 결정

자, 이번 시간에는 VLA 설계에서 가장 중요한 질문 하나를 다루겠습니다. "로봇의 행동을 어떻게 토큰으로 표현할 것인가?" 이 질문이거든요. 이게 왜 중요하냐면, VLA 모델은 결국 대규모 언어 모델(LLM) 위에 만들어진 것이잖습니까. LLM은 기본적으로 이산적 토큰 시퀀스를 입출력으로 처리하는 시스템인데요, 로봇의 행동은 연속적입니다. 관절 각도가 37.284도 같은 연속값이란 말이죠. 이 연속적인 행동 공간을 이산적인 토큰 공간으로 바꾸는 방식이, 모델의 표현력, 제어 정밀도, 추론 속도를 근본적으로 결정짓게 됩니다.

Chen et al. [7] (2025)이 이걸 액션 토큰화(Action Tokenization) 관점이라고 명명했는데요, VLA 모델 간의 차이를 가장 깔끔하게 설명하는 축이라고 제시했습니다. 비유하자면, 프로그래밍 언어에서 자료형(data type)을 고르는 것과 비슷합니다. int로 할지 float로 할지 string으로 할지에 따라 할 수 있는 연산이 완전히 달라지는 것처럼, 액션 토큰의 유형에 따라 VLA가 할 수 있는 일의 범위가 결정되는 겁니다.

이 절에서는 8가지 액션 토큰 유형을 체계적으로 정리하겠습니다. 참고로, 하나의 모델이 복수 유형을 조합할 수 있습니다. 예를 들어 CoT-VLA [55]는 추론 토큰(유형 8)과 목표 토큰(유형 5)을 함께 활용합니다.

5.1 8가지 액션 토큰 유형

유형 1: 언어 토큰 (Language Tokens)

가장 직관적인 접근부터 보겠습니다. 로봇의 행동을 그냥 자연어 텍스트로 표현하는 거예요. "빨간 컵을 집어 올려라(pick up the red cup)" 이런 식으로요. LLM이 텍스트를 생성하면, 하위 정책(low-level policy)이나 미리 정의해 둔 기능(skill primitive)이 이걸 실제 모터 명령으로 변환합니다.

이걸 사람으로 비유하면, 요리사에게 "스테이크를 미디엄 레어로 구워줘"라고 말하는 것과 같습니다. 구체적으로 몇 도에서 몇 분간 굽라는 게 아니라, 고수준 지시만 내리는 거죠.

대표 모델들을 보겠습니다:

SayCan [14] (Ahn et al., 2022): 이게 매우 중요한 모델인데요, LLM이 생성한 행동 후보에 대해 어포던스 점수를 매겨서 실행 가능한 행동을 선택합니다. 핵심 아이디어가 뭐냐면, 언어 모델의 세계 지식과 로봇의 물리적 능력을 곱(product) 연산으로 결합한 겁니다. LLM이 "이 상황에서 뭘 하면 좋겠다"라고 제안하면, 로봇이 "그거 지금 물리적으로 가능한가?"를 판단해서 곱하는 거예요.
Inner Monologue [22] (Huang et al., 2023): 환경으로부터의 피드백(성공/실패, 물체 인식 결과)을 다시 언어로 변환해서 LLM에 넣어주는 접근입니다. 로봇이 "아, 방금 그거 실패했네요" 하고 LLM에 알려주면, LLM이 다음 계획을 수정하는 거죠.
SayTap [155] (Tang et al., 2023): 이건 좀 독특한데요, 보행 로봇의 발 접촉 패턴을 텍스트 시퀀스로 표현해서, LLM이 보행 리듬을 직접 계획할 수 있게 했습니다.

장점: LLM의 강력한 언어 생성 능력과 상식 추론을 직접 활용할 수 있습니다. 사전학습된 언어 지식이 그대로 전이되므로, 새로운 과제에 대한 제로샷(zero-shot) 일반화가 뛰어나요.

한계: 그런데 근본적인 문제가 있거든요. 연속적 운동 제어의 정밀도가 본질적으로 부족합니다. "컵을 3cm 왼쪽으로 이동"이라는 세밀한 조작을 자연어로 충분히 표현하기 어렵잖아요. 제어 주파수가 1-3 Hz로, 모든 토큰 유형 중 가장 낮습니다. 이게 무슨 의미냐면, 1초에 1~3번밖에 행동을 결정 못 한다는 거예요. 빠르게 날아오는 공을 잡는 건 상상도 못 하는 속도입니다.

유형 2: 코드 토큰 (Code Tokens)

두 번째는 행동을 실행 가능한 프로그램 코드로 표현하는 접근입니다. LLM이 Python 함수나 API 호출 시퀀스를 생성하면, 이걸 로봇 런타임에서 직접 실행하는 거죠.

이건 언어 토큰보다 한 단계 더 구조적인데요, 자연어로 "블록을 세 개 쌓아라"라고 하는 대신에 for i in range(3): pick(block[i]); place(stack_position + i*height) 이런 식으로 코드를 짜는 겁니다. 반복문도 쓸 수 있고, 조건문도 쓸 수 있으니까 훨씬 정밀한 로직 표현이 가능해지죠.

대표 모델:

Code as Policies [36] (Liang et al., 2023): LLM이 로봇 API를 호출하는 Python 코드를 직접 생성합니다. 공간적 추론, 반복문, 조건문 등 프로그래밍 구조를 활용하여 복잡한 행동 시퀀스를 구성할 수 있어요.
ProgPrompt [102] (Singh et al., 2023): 과제를 프로그래밍적 형식(함수 호출, assert 문)으로 구조화하여 LLM의 계획 능력을 강화합니다.
Voyager [103] (Wang et al., 2023): Minecraft 환경에서 실행 가능한 코드를 생성하고, 성공한 코드를 기술 라이브러리에 저장하여 점진적으로 역량을 확장합니다. 일종의 자기학습 게임 에이전트인 거죠.
ChatGPT for Robotics [104] (Vemprala et al., 2024): 대화형 인터페이스를 통해 사용자의 의도를 로봇 제어 코드로 변환합니다.

장점: 구조적이고 재사용 가능하며, 디버깅이 용이합니다. 생성된 코드를 라이브러리로 축적하여 장기적으로 기술 기반(skill repertoire)을 확장할 수 있고요.

한계: 그런데 여기도 큰 제약이 있습니다. 사전 정의된 API(예: pick(obj), place(x, y, z))가 반드시 필요하거든요. API가 지원하지 않는 미세 운동(dexterous manipulation)은 표현할 수 없습니다. 달걀 껍질을 살살 까는 동작을 crack_egg_gently() 같은 API 없이는 만들 수가 없다는 겁니다.

유형 3: 어포던스 토큰 (Affordance Tokens)

세 번째는 물체의 조작 가능 영역과 방식을 공간적으로 표현하는 접근입니다. "어디를" 잡아야 하는지, "어떤 방향으로" 밀어야 하는지를 3D 공간 상의 히트맵(heatmap)이나 벡터 필드(vector field)로 나타내는 거예요.

어포던스라는 개념이 뭐냐면, 제임스 깁슨의 생태 심리학에서 온 건데요, 환경이 행위자에게 제공하는 행동 가능성을 말합니다. 문손잡이를 보면 "잡고 돌릴 수 있다"는 정보가 시각적으로 주어지는 것처럼요.

대표 모델:

VoxPoser [57] (Huang et al., 2023): LLM과 VLM을 이용하여 3D 복셀(voxel) 공간에 어포던스 맵(가치 맵)과 제약 맵을 생성합니다. 이 맵을 기반으로 모션 플래너가 궤적을 합성하는 거죠.
A3VLM [58] (Huang et al., 2024): VLM을 확장하여 3D 포인트클라우드에서 직접 어포던스를 예측하게 합니다.
RT-Affordance [105] (Brohan et al., 2023): 시각적 어포던스를 조건으로 활용하여 조작 정책의 일반화를 돕습니다.
A0 [106] (Ren et al., 2025): 어포던스 기반의 통합 조작 프레임워크로, 물체-행동 간 관계를 명시적으로 모델링합니다.

핵심 가치: 어포던스 토큰이 왜 중요하냐면, "어디를" + "어떻게" 조작할지에 대한 중간 표현(intermediate representation)을 제공하기 때문입니다. 고수준 언어 계획("컵을 집어라")과 저수준 모터 제어(각 관절 각도) 사이의 의미론적 다리(semantic bridge) 역할을 하는 겁니다. 특히 새로운 물체에 대한 일반화에서 강점을 보이는데요, 처음 보는 머그잔이라도 손잡이 부분에 어포던스가 높게 나오면 그걸 잡으면 되니까요.

유형 4: 궤적 토큰 (Trajectory Tokens)

네 번째는 엔드이펙터(end-effector)의 시공간적 경로를 토큰 시퀀스로 표현하는 접근입니다. 2D 이미지 위에 미래 궤적을 스케치하거나, 3D 공간에서의 웨이포인트(waypoint) 시퀀스를 생성하는 거예요.

이걸 비유하면, 네비게이션에서 목적지를 찍으면 경로가 나오는 것과 비슷합니다. "어디를 거쳐서 어디로 가라"를 시각적으로 보여주는 거죠.

대표 모델:

RT-Trajectory [62] (Ahn et al., 2024): 이미지 위에 2D 궤적 스케치를 오버레이하여 시각적 프롬프트로 활용합니다. 사람이 궤적을 그려줄 수도 있고, 모델이 예측할 수도 있어요.
LATTE (Liu et al., 2024): 언어 명령을 3D 궤적 시퀀스로 변환하는 언어-궤적 변환기입니다.
TraceVLA [40] (Zheng et al., 2025): VLA 모델이 시각적 궤적 트레이스(trace)를 중간 표현으로 생성하여, 행동 예측의 해석 가능성과 정확도를 동시에 향상시킵니다.

비디오 사전학습과의 연결: 궤적 토큰은 비디오 예측 모델과 자연스럽게 연결되거든요. 비디오 프레임 시퀀스는 본질적으로 시각적 궤적의 연속이니까, 대규모 비디오 데이터로 사전학습된 모델의 시공간적 이해를 궤적 예측에 직접 전이할 수 있습니다. YouTube 비디오를 수백만 개 본 모델이 "물체가 이렇게 움직일 것이다"를 이미 알고 있는 셈이죠.

유형 5: 목표 토큰 (Goal Tokens)

다섯 번째는 미래 관찰(future observation)을 예측함으로써 목표 상태를 표현하는 접근입니다. "지금 이 상태에서 행동 후 세상이 어떻게 보일 것인가"를 이미지나 포인트클라우드로 생성하는 거예요.

이건 아주 흥미로운 접근인데요, 행동 자체를 출력하는 게 아니라, "행동 결과"를 출력하는 겁니다. 사람으로 치면, 체스를 둘 때 "비숍을 E4로 옮긴다"(행동)가 아니라 "이 수를 두면 보드가 이렇게 될 것이다"(결과 상태)를 상상하는 거죠.

대표 모델:

SuSIE [59] (Black et al., 2024): 현재 관찰과 언어 명령을 입력받아 미래 하위목표(subgoal) 이미지를 생성하고, 이를 따르는 저수준 정책을 학습합니다.
UniPi [93] (Du et al., 2024): 이건 아주 대담한 접근인데요, 로봇 계획을 비디오 생성 문제로 프레이밍합니다. 텍스트-투-비디오(text-to-video) 디퓨전 모델이 미래 프레임 시퀀스를 생성하면, 역동역학(inverse dynamics) 모델이 행동을 추출하는 구조입니다.
3D-VLA [37] (Zhen et al., 2024): 3D 표현 공간에서 미래 장면을 예측하여 깊이 있는 공간 이해를 반영한 목표를 생성합니다.
CoT-VLA [55] (Kang et al., 2025): 시각적 하위목표(visual subgoal)를 사고의 연쇄(Chain-of-Thought)로 활용하여, "다음에 세상이 어떻게 보여야 하는가"를 명시적으로 추론한 후 행동을 결정합니다.

월드 모델과의 통합: 목표 토큰이 특히 중요한 이유가 있는데요, 월드 모델(world model)과의 가장 자연스러운 접점을 제공하기 때문입니다. 미래를 시뮬레이션하여 목표를 설정하는 것은, 머릿속에서 "이렇게 하면 어떻게 될까?"를 상상하는 것과 본질적으로 같습니다. 계획(planning)과 실행(execution)을 통합하는 매우 강력한 경로인 거죠.

유형 6: 잠재 토큰 (Latent Tokens)

여섯 번째는 학습된 잠재 공간(learned latent space)에서 행동을 표현하는 접근입니다. 원시 행동 데이터를 오토인코더(autoencoder) 등으로 압축하여, 의미론적으로 풍부한 잠재 벡터로 변환하는 거예요.

이걸 비유하면, 음악의 악보와 비슷합니다. 실제 음파(연속적 공기 진동)를 직접 기록하는 대신, 음표라는 압축된 기호 체계로 표현하는 거죠. 이 기호 체계는 음악의 핵심 구조를 보존하면서도 훨씬 간결합니다.

대표 모델:

LAPA [61] (Ye et al., 2025): VQ-VAE(Vector Quantized Variational Autoencoder)를 사용하여 행동 궤적을 잠재 코드북(codebook)으로 양자화합니다. 이 잠재 행동 토큰을 VLA 모델이 예측하도록 학습하는 거예요.
UniVLA [80] (Li et al., 2025): 통합 잠재 행동 공간을 학습하여 다양한 로봇 형태와 과제를 하나의 모델로 처리합니다.
VQ-VLA [107] (Qu et al., 2025): 벡터 양자화를 통해 행동 공간을 이산화하되, 잠재 공간의 구조를 보존하여 의미론적 일관성을 유지합니다.

체현 갭 극복의 열쇠: 잠재 토큰의 가장 혁신적인 가치가 뭐냐면요, 체현 갭(embodiment gap) 극복에 있습니다. 이게 왜 중요하냐면, 인간의 손 동작과 로봇 그리퍼의 동작은 물리적으로 전혀 다르잖아요. 인간은 손가락이 다섯 개지만 로봇 그리퍼는 두 개짜리 집게일 수도 있고요. 그런데 잠재 공간에서는 "물체를 잡는다"라는 의미론적 행동이 유사하게 표현될 수 있습니다. 이걸 통해 대규모 인간 비디오 데이터(Ego4D, Something-Something 등)에서 추출한 행동 지식을 로봇에 전이하는 것이 가능해집니다. 로봇 데이터가 만성적으로 부족한 상황에서, 인간 비디오라는 거대한 데이터 소스를 활용할 수 있게 되는 거죠.

도메인 불가지론적(domain-agnostic) 특성: 잠재 행동 공간은 특정 로봇의 관절 구성이나 행동 공간 차원에 의존하지 않으므로, 교차 체현(cross-embodiment) 학습의 자연스러운 매개체가 됩니다.

유형 7: 원시 행동 토큰 (Raw Action Tokens)

일곱 번째는 관절 각도, 엔드이펙터 포즈(위치 + 자세), 그리퍼 상태 등 저수준 행동 값을 직접 이산화(discretize)하여 토큰으로 변환하는 접근입니다. 가장 직접적이고 단순한 토큰화 방식인데요, 온도계의 연속 눈금을 "춥다/시원하다/따뜻하다/뜨겁다"처럼 칸으로 나누는 것과 같은 원리입니다.

대표 모델:

RT-2 [11] (Brohan et al., 2023): 7차원 행동 벡터(6DoF 포즈 + 그리퍼)의 각 차원을 256개 구간(bin)으로 균등 이산화하여, LLM의 어휘(vocabulary)에 추가합니다.
OpenVLA [15] (Kim et al., 2024): RT-2 [11]와 유사한 256-bin 이산화를 채택하되, 오픈소스로 재현 가능하게 구현했습니다.
Gato [13] (Reed et al., 2022): 다양한 과제의 행동을 1024개 구간으로 이산화하여 하나의 범용 모델로 처리합니다.

양자화 오류의 문제: 원시 행동 이산화의 근본적 한계는 양자화 오류(quantization error)입니다. 256개 구간으로 이산화하면 각 구간의 폭이 약 0.8%가 되는데요, 이게 정밀 조작에서는 누적되어 실패를 초래할 수 있습니다. 1mm 단위의 정밀도가 필요한 조립 작업에서 매 스텝 0.8%씩 오차가 나면, 몇 스텝만 지나도 큰 문제가 되거든요. 또한 각 행동 차원을 독립적으로 토큰화하면 차원 간 상관관계가 손실됩니다.

FAST의 혁신: 자, 여기서 FAST(Fast Action Tokenization) (Pertsch et al., 2025)가 등장합니다. FAST 토큰화가 왜 나왔냐면요, RT-2에서 7 DoF 로봇의 행동 청크를 표현하려면 토큰이 112개나 필요했거든요(7차원 x 16스텝). 이러면 자기회귀 디코딩이 너무 느려져요. FAST는 이걸 이산 코사인 변환(DCT)으로 주파수 영역으로 변환한 다음 바이트 쌍 인코딩(BPE)으로 압축해서, 최대 약 13배까지 줄인 거예요. MP3가 음악을 주파수 압축하는 것과 완전 같은 원리입니다. 구체적으로:

시간적 상관관계를 DCT가 포착하여 정보 압축률 향상
BPE가 빈번한 행동 패턴을 단일 토큰으로 묶어 시퀀스 길이 단축
결과적으로 LLM의 기존 어휘 확장 메커니즘과 자연스럽게 통합

ICLR 2026에서는 FAST를 넘어선 차세대 토큰화 기법들이 제안되었습니다. FASTer는 RVQ(Residual Vector Quantization)에 주파수/시간 도메인 손실을 결합하여 더 높은 압축률과 재구성 품질을 동시에 달성했고요. OMNISAT는 B-Spline 인코더를 사용하여 매끄러운 장시간 행동 출력에 특화된 컴팩트 표현을 제공합니다.

유형 8: 추론 토큰 (Reasoning Tokens)

마지막 여덟 번째는 행동을 결정하기 전에 사고 과정을 명시적 토큰으로 생성하는 접근입니다. "왜 이 행동을 해야 하는가"를 먼저 추론하고, 그 추론 결과에 기반하여 행동을 예측하는 거예요.

이건 사람이 복잡한 일을 할 때 "음, 먼저 뚜껑을 열고... 그 다음에 컵을 잡고... 그리고 물을 따르면 되겠다" 하고 속으로 계획을 세우는 것과 같습니다. 그냥 반사적으로 행동하는 게 아니라, 생각을 한 다음에 행동하는 거죠.

대표 모델:

ECoT [92] (Zawalski et al., 2024): Embodied Chain-of-Thought. 행동 예측 전에 장면 설명, 과제 분석, 하위계획 등을 텍스트로 생성합니다.
CoT-VLA [55] (Kang et al., 2025): 시각적 추론 토큰(미래 하위목표 이미지)과 언어적 추론 토큰을 결합하여 다중모달 사고 연쇄를 구성합니다.
ThinkAct [67] (Xu et al. [8], 2025): 자율적으로 "언제 생각할 것인가"를 결정하는 적응적 추론 메커니즘을 도입했습니다. 이게 중요한 게, 매번 생각하는 건 너무 느리니까요.
Embodied-R1 [108] (Liu et al., 2025): DeepSeek-R1 스타일의 장기 추론을 체현 과제에 적용하여, 복잡한 다단계 조작에서 자발적 추론 경로를 생성합니다.

성능 향상 효과: 추론 토큰의 효과가 상당히 큽니다. SC-VLA [56]의 실험에 따르면, 추론 토큰을 포함한 경우 행동 예측 품질이 약 35% 향상되었습니다. "생각한 후 행동하기"가 단순한 반사적 행동 대비 명확한 이점을 제공하는 거죠.

트레이드오프: 그런데 추론 토큰은 추가적인 토큰 생성을 요구하므로 추론 지연(latency)이 증가합니다. 생각하는 데 시간이 걸린다는 거잖아요. ThinkAct [67]이 "언제 생각할 것인가"를 학습하는 접근은 이 트레이드오프를 관리하려는 시도인데요 -- 컵을 그냥 집는 단순한 과제에서는 즉각 행동하고, 복잡한 조립 같은 상황에서만 추론을 활성화하는 겁니다. 시험 볼 때 쉬운 문제는 바로 풀고, 어려운 문제만 깊이 생각하는 것과 같은 전략이죠.

5.2 토큰화와 제어 주파수의 관계

자, 여기서 매우 중요한 테이블을 보겠습니다. 토큰화 방식은 VLA 모델의 제어 대역폭(control bandwidth)을 직접 결정하거든요. 이게 곧 해당 모델이 수행할 수 있는 과제의 범위를 본질적으로 규정합니다.

토큰화 방식	제어 주파수	대표 모델	적합한 과제
언어 토큰	1-3 Hz	SayCan [14], Inner Monologue [22]	고수준 계획, 탐색
자기회귀 원시 토큰	3-6 Hz	OpenVLA [15] (~6Hz), RT-2 [11]	단순 픽앤플레이스
디퓨전 + 청킹	10-50 Hz	π0 [16] (20-50Hz)	유연 조작, 접촉 풍부 과제
Flow Matching + 청킹	50-120 Hz	GR00T N1 [21] (~120Hz)	민첩 조작, 양손 협응
FAST + 청킹	~50 Hz	FAST-VLA	범용 조작

이 표에서 드러나는 핵심 인사이트를 말씀드리겠습니다. 토큰화 방식이 제어 대역폭을 결정하며, 이는 수행 가능한 작업의 범위를 규정한다는 겁니다. 1-3Hz의 제어 주파수로는 "빨간 블록을 파란 그릇에 넣어라"와 같은 단순 과제는 수행할 수 있지만, 달걀을 깨뜨리지 않고 잡는 것은 불가능합니다. 50Hz 이상의 주파수가 되어야 비로소 힘 제어(force control)가 필요한 섬세한 조작이 가능해지거든요.

이게 왜 그런지 근본 원인을 보면:

자기회귀 디코딩은 토큰을 하나씩 순차적으로 생성하므로, 행동 차원 수에 비례하여 지연이 증가합니다. 7DoF면 7개 토큰을 하나하나 찍어내야 하니까요.
디퓨전/플로우 매칭 기반의 행동 청킹(action chunking)은 한 번의 디노이징(denoising) 과정으로 수십 스텝의 행동을 동시에 생성합니다. 추론 자체는 느릴 수 있지만, 청크 안의 행동을 고주파로 실행할 수 있는 거죠. 한번에 50스텝치 행동을 만들어 놓고 빠르게 실행하는 겁니다.
FAST는 DCT+BPE를 통해 행동 시퀀스를 압축된 소수의 토큰으로 표현하여, 자기회귀 방식임에도 높은 실효 주파수를 달성합니다.

5.3 토큰화 선택이 성능에 미치는 영향

이산 vs 연속: 다중모달 분포 문제

토큰화 선택이 성능에 미치는 가장 심각한 영향은 다중모달 분포(multimodal distribution) 상황에서 나타납니다. 구체적인 예를 들어보겠습니다.

테이블 위의 물체를 왼쪽이나 오른쪽 어느 방향으로든 밀어도 되는 상황을 생각해 보세요. 이 경우 올바른 행동 분포는 이봉 분포(bimodal distribution) -- 왼쪽과 오른쪽에 각각 확률 질량이 집중된 형태 -- 를 갖습니다.

문제가 뭐냐면, 이산 원시 토큰(256-bin)과 표준 교차 엔트로피 손실로 학습하면, 모델이 두 모드의 평균에 해당하는 행동을 예측하는 경향이 있다는 겁니다. 왼쪽으로 밀어야 할 확률 50%, 오른쪽으로 밀어야 할 확률 50%인 상황에서, 모델이 "가운데로 밀자"라고 결정하는 거예요. 이건 두 가지 올바른 행동 어디에도 속하지 않는, 완전히 잘못된 행동입니다. 이걸 평균화 문제(mode averaging/mode collapse)라고 하는데요, 시연 데이터의 다양성이 높을수록 심각해집니다.

비유하자면, "서울에서 부산을 갈 때 경부고속도로나 중부내륙고속도로 중 하나를 타세요"라는 시연 데이터를 보고, 모델이 "그럼 두 도로의 중간 지점을 직진하겠습니다" 하고 논밭을 달리는 것과 같습니다.

디퓨전 기반 연속 행동 생성은 이 문제에 대한 자연스러운 해법을 제공합니다. 디퓨전 모델은 본질적으로 다중모달 분포를 표현할 수 있어서, 이봉 분포의 양쪽 모드를 모두 포착하거든요. 이것이 π0 [16], GR00T N1 [21] 등이 디퓨전/플로우 매칭을 채택한 핵심 이유 중 하나입니다.

행동 공간과 디코딩의 공결정

토큰화 방식과 디코딩 전략은 독립적으로 선택되는 것이 아니라 공결정(co-determined)됩니다. 이 점을 반드시 이해하셔야 합니다.

이산 행동 공간 <-> 자기회귀 디코딩: 행동을 이산 토큰으로 표현하면 LLM의 기존 언어 모델 헤드(language model head)를 그대로 재사용할 수 있습니다. RT-2 [11], OpenVLA [15]가 이 경로를 택했는데요, 장점은 아키텍처 수정이 최소화되어 LLM의 사전학습 지식을 최대한 보존할 수 있다는 것입니다.
연속 행동 공간 <-> 디퓨전/플로우 기반 생성 헤드: 행동을 연속 벡터로 유지하면 전용 생성 헤드(diffusion head, flow matching head)가 필요합니다. π0 [16], GR00T N1 [21]이 이 경로를 택했고요. 장점은 다중모달 분포 표현과 높은 제어 주파수이며, 단점은 LLM 백본과의 통합에 추가적인 설계가 필요하다는 것입니다.

최근에는 이 두 경로를 결합하려는 시도도 나타나고 있습니다. VQ-VLA는 벡터 양자화를 통해 연속 행동을 이산 토큰으로 변환하되, 잠재 공간의 구조를 보존하여 두 세계의 장점을 결합하려 합니다.

행동 청킹의 트레이드오프

행동 청킹(action chunking)은 한 번의 추론으로 여러 타임스텝의 행동을 동시에 예측하는 기법인데요, ACT (Zhao et al., 2023)에서 제안된 이래 VLA 설계의 핵심 요소로 자리잡았습니다.

청크 크기(chunk size)를 증가시키면 세 가지가 동시에 변합니다:

추론 빈도가 감소하여 계산 효율이 향상됩니다. 청크 크기가 16이면 추론 횟수가 1/16로 줄어드니까요.
궤적 일관성이 향상됩니다. 매 스텝 독립적으로 예측하면 궤적이 "덜덜덜" 떨릴 수 있는데, 청크 단위로 예측하면 시간적 일관성이 보장되거든요.
그러나 환경 변화에 대한 반응성이 감소합니다. 16스텝짜리 청크를 실행하는 도중에 갑자기 고양이가 뛰어들면? 남은 스텝을 다 실행할 때까지 대응을 못 합니다.

이 트레이드오프를 관리하기 위해, π0 [16] 등은 청크 크기를 과제 특성에 맞춰 조절하거나 시간적 앙상블(temporal ensemble)을 적용하여 연속적인 청크 간 행동을 부드럽게 보간(interpolation)합니다.

5.4 소결: 토큰화 관점의 통합적 이해

8가지 액션 토큰 유형을 추상화 수준(abstraction level)의 스펙트럼으로 정리하면 이렇게 됩니다:

높은 추상화  <------------------------------------->  낮은 추상화
언어 -> 코드 -> 추론 -> 목표 -> 궤적 -> 어포던스 -> 잠재 -> 원시

왼쪽으로 갈수록 인간에게 해석 가능하고 일반화가 뛰어나지만 제어 정밀도가 낮으며, 오른쪽으로 갈수록 정밀하지만 일반화와 해석 가능성이 떨어집니다. 현대의 가장 성공적인 VLA 시스템은 이 스펙트럼의 여러 수준을 계층적으로 결합합니다. 예컨대 추론 토큰(높은 추상화)으로 계획을 세운 후, 원시 행동 토큰(낮은 추상화)으로 실행하는 구조인 거죠.

이 관점에서 VLA 연구의 핵심 질문은 "어떤 토큰 유형이 최선인가?"가 아니라, "어떤 과제와 체현에 대해, 어떤 토큰 유형의 조합이 최적인가?"입니다. 정답은 하나가 아닙니다.

참고로, 위 추상화 순서는 하나의 관점이며, 도메인과 태스크에 따라 어포던스와 궤적의 상대적 추상화 수준이 달라질 수 있습니다.

6. 학습 패러다임의 진화

자, 이제 VLA 모델을 어떻게 학습시키느냐의 문제로 넘어가겠습니다. VLA 모델의 학습은 단순한 지도 학습을 넘어, 사전학습에서 후처리(post-training)까지 이르는 다층적 과정으로 진화하고 있는데요, 이 진화의 각 단계를 체계적으로 살펴보고, 인간의 운동 학습 이론과의 흥미로운 병행 관계도 탐구하겠습니다.

6.1 사전학습 -- 인터넷에서 로봇으로

VLA 모델의 사전학습은 2단계 공동 학습(two-phase joint training)이 사실상 표준으로 자리잡았습니다.

Phase 1: 인터넷 스케일 이미지-텍스트 사전학습

첫 번째 단계에서는 인터넷에서 수집한 대규모 이미지-텍스트 데이터로 비전-언어 모델(VLM)을 학습합니다. LAION-5B(50억 이미지-텍스트 쌍), COCO, Visual Genome 등의 데이터셋을 사용하는데요, 50억 쌍이라는 규모를 생각해 보세요. 인터넷에 올라온 거의 모든 종류의 이미지와 그에 대한 설명을 본 셈입니다.

이 단계에서 모델이 학습하는 것은 로봇 제어 자체가 아니라, 그 전제가 되는 세계 이해입니다:

물체 인식과 분류 ("이것은 머그잔이다")
공간적 관계 이해 ("머그잔이 테이블 위에 있다")
물리적 속성 추론 ("유리잔은 깨질 수 있다")
상식적 행동 지식 ("머그잔을 마시려면 손잡이를 잡는다")

이걸 비유하면, 아기가 태어나서 처음 몇 년간 세상을 관찰하면서 물리 법칙을 체득하는 과정과 비슷합니다. 아직 직접 물건을 능숙하게 다루진 못하지만, "이건 단단하다", "이건 무겁다", "이건 굴러간다"를 이미 알고 있는 거죠.

Phase 2: 로봇 궤적 데이터 미세조정

두 번째 단계에서는 실제 로봇 궤적(trajectory) 데이터로 모델을 미세조정(fine-tuning)합니다. 핵심 데이터셋들을 보겠습니다:

Open X-Embodiment [19] (OXE [19]): 22개 로봇 형태, 100만+ 개의 에피소드를 포함하는 교차 체현 데이터셋입니다. 현재 VLA 학습의 사실상 표준 데이터 소스예요. 22가지 다른 로봇이 수백 가지 과제를 수행한 데이터가 모두 들어 있습니다.
BridgeData V2: WidowX 로봇의 다양한 환경에서의 조작 데이터로, 약 60,000 궤적이 포함되어 있습니다.
RT-1 [12] 데이터: Google의 Everyday Robots에서 수집한 대규모 단일-형태 데이터셋입니다.

이 2단계 구조의 핵심 통찰은 의미론적 사전지식과 감각운동 기술의 분리입니다. 세계를 이해하는 능력(Phase 1)과 세계에서 행동하는 능력(Phase 2)은 서로 다른 데이터 소스에서 효율적으로 학습될 수 있다는 거죠. 마치 해부학 교과서를 먼저 공부한 다음(Phase 1), 수술실에서 실습하는 것(Phase 2)과 같습니다.

비디오 사전학습의 부상

최근에는 이미지-텍스트 데이터를 넘어 비디오 데이터를 사전학습에 활용하는 흐름이 가속화되고 있습니다:

GR-2 [100] (Cheang et al., 2024): 웹스케일 비디오로 사전학습한 비디오 생성 모델을 로봇 정책의 기초로 활용합니다. 비디오에 내재된 물리적 역학(dynamics) 이해가 로봇 제어에 전이되는 거예요.
자아중심 비디오 (Ego4D, EPIC-Kitchens): 인간이 직접 조작하는 1인칭 시점 비디오는 로봇의 시점과 유사하여, 특히 조작 과제에 대한 풍부한 사전지식을 제공합니다.

비디오 사전학습이 이미지-텍스트 사전학습에 비해 갖는 결정적 장점이 뭐냐면요, 시간적 역학(temporal dynamics)의 이해입니다. 이미지는 정적 장면의 이해를 제공하지만, 비디오는 "이 행동 이후에 세상이 어떻게 변하는가"에 대한 이해를 제공합니다. 사진만 보고 수영을 배울 수 없지만, 영상을 보면 물속에서의 동작 흐름을 이해할 수 있는 것과 같은 차이입니다.

시뮬레이션 데이터의 역할

UniSim [101] (Yang et al., 2023): 행동 조건부 비디오 디퓨전 모델로, 가상 환경에서의 상호작용 시뮬레이션을 통해 무한한 학습 데이터를 생성합니다.
Genesis (Xian et al., 2024): GPU 가속 물리 시뮬레이터로, 현실과 유사한 물리적 상호작용 데이터를 대규모로 생성합니다.

시뮬레이션 데이터의 주요 과제는 심-투-리얼 갭(sim-to-real gap)입니다. 시뮬레이션에서는 완벽하게 동작하는 정책이 실제 환경에서는 실패하는 거죠. 왜냐하면 시뮬레이션의 물리 엔진이 현실의 모든 것을 완벽히 재현하지 못하니까요. 마찰력, 변형, 조명 조건 등이 다 다릅니다. 도메인 랜덤화(domain randomization), 도메인 적응(domain adaptation) 등의 기법이 이 갭을 줄이기 위해 활발히 연구되고 있습니다.

스케일링 법칙

VLA 학습에서도 스케일링 법칙(scaling law)이 관찰되고 있습니다. Zhang et al. [6]의 연구에 따르면, 궤적 데이터를 2배로 증가시키면 과제 성공률이 약 8-12% 향상됩니다 (원논문 직접 인용; 서베이에서는 정확한 수치 미기재). 이는 데이터 확보가 곧 성능 향상으로 직결됨을 의미하며, OXE [19], DROID 등 대규모 데이터 수집 인프라의 중요성을 뒷받침합니다.

다만, 데이터 스케일링만으로는 한계가 있거든요. 궤적 데이터의 질(quality), 다양성(diversity), 커버리지(coverage)가 양(quantity) 못지않게 중요하며, 단순히 데이터 양을 늘리는 것보다 더 효율적인 학습 알고리즘의 개발도 병행되어야 합니다.

6.2 모방학습(Behavioral Cloning)의 한계

모방학습(Behavioral Cloning, BC)은 전문가 시연을 지도 학습 방식으로 따라하는 가장 기본적인 정책 학습 방법입니다. 관찰-행동 쌍 $(o_t, a_t)$이 주어지면, 정책 $\pi(a|o)$를 최대우도 추정(MLE)으로 학습합니다. VLA 모델의 대부분은 이 BC 프레임워크 위에 구축되어 있는데요.

BC를 한마디로 요약하면, "선생님이 푸는 걸 보고 따라 푸는 것"입니다. 선생님보다 잘할 수 없고, 안 본 문제는 못 푸는 거죠. 이게 근본적인 한계들을 만들어냅니다.

분포 이동(Distribution Shift)

학습 시 모델이 보는 상태 분포와 실행 시 모델이 마주치는 상태 분포가 다릅니다. 전문가 시연에서는 전문가의 정책 $\pi^*$가 생성한 상태 분포를 따르지만, 실행 시에는 학습된 (불완전한) 정책 $\hat{\pi}$가 생성한 상태 분포를 따르거든요. 이 분포 불일치는 학습 데이터에서 벗어난 상태에서의 예측 불가능한 행동으로 이어집니다.

이걸 비유하면, 운전을 항상 강남대로에서만 배운 사람이 갑자기 비포장 산길에 가면 어떻게 될까요? 배운 적 없는 상황이라 대처를 못 하는 겁니다.

공변량 오류 누적(Covariate Shift & Compounding Errors)

이게 정말 심각한 문제인데요, 각 타임스텝에서의 작은 예측 오류가 시간이 지남에 따라 누적됩니다. 한 스텝에서의 약간의 위치 오차가 다음 스텝에서는 더 큰 오차를 초래하고, 이것이 연쇄적으로 증폭되어 장기 과제에서 치명적 실패를 야기합니다.

비유하면, 복사기로 문서를 복사하고, 그 복사본을 다시 복사하고, 또 그걸 복사하면... 100번째 복사본은 원본과 전혀 다르게 되잖아요. 각 단계의 미세한 열화가 누적되는 거죠. 30초 동안 진행되는 복잡한 조립 과제에서는 초반의 미세한 오차가 후반에 완전한 실패로 귀결될 수 있습니다.

차선 시연에서의 개선 불가

BC는 본질적으로 시연의 상한(upper bound)에 제약됩니다. 시연 자체가 최적이 아니거나 노이즈가 포함된 경우, 모델은 그 수준을 넘어설 수 없어요. 더 나은 행동을 발견하는 메커니즘이 부재하기 때문입니다.

안전/선호 신호의 부재

BC는 "무엇을 해야 하는가"만 학습하고, "무엇을 하지 말아야 하는가"나 "어떤 행동이 더 선호되는가"에 대한 신호를 활용하지 못합니다. 안전 제약(예: 인간 근처에서의 속도 제한)이나 사용자 선호(예: 부드러운 동작 선호)를 명시적으로 반영할 수 없는 거죠.

핵심 결론

BC는 VLA 학습의 필요조건이지만 충분조건은 아닙니다. 대규모 시연 데이터를 효율적으로 활용하는 데 BC는 불가결하지만, 그 한계를 극복하기 위한 후처리(post-training)의 필요성이 점점 더 명확해지고 있습니다.

6.3 강화학습 후처리

자, 이 부분이 최근 VLA 연구에서 가장 뜨거운 주제입니다. Jin et al. [9] (2025)을 중심으로, BC 이후에 강화학습(RL)으로 성능을 한 단계 더 끌어올리는 연구가 폭발적으로 증가하고 있거든요.

이게 왜 중요하냐면, 대규모 언어 모델(LLM) 분야의 발전 경로와 정확히 병행하기 때문입니다. LLM에서 SFT(Supervised Fine-Tuning) 이후 RLHF(Reinforcement Learning from Human Feedback)로 모델을 정렬(alignment)하는 패러다임이 ChatGPT의 성공을 이끌었잖아요. VLA에서도 BC(SFT에 해당) 이후 RL(RLHF에 해당)로 한 단계 더 가는 겁니다.

비유하면, BC는 교과서로 기본기를 익히는 단계이고, RL 후처리는 실전 연습과 코치 피드백을 통해 교과서 수준을 넘어서는 단계입니다.

온라인 RL (Online Reinforcement Learning)

모델이 실제 환경(또는 시뮬레이션)에서 직접 상호작용하며 보상을 받아 학습합니다:

PPO 기반:
VLA-RL [68] (Tan et al., 2025): VLA 모델에 PPO를 적용하여 온라인 환경 상호작용으로 성능 향상
RIPT-VLA [71] (Su et al., 2025): Reinforcement learning via Iterative Policy Training. 이게 아주 인상적인 결과를 냈는데요, 원 논문(Su et al., 2025)에 따르면, 특정 태스크에서 SFT 4% 성공률에서 출발하여 PPO 15회 반복 후 97% 성공률에 도달했습니다. 4%에서 97%라니, 거의 못 하던 것에서 거의 완벽하게 하게 된 거잖아요. 단, Jin et al. [9]의 LIBERO 벤치마크 비교에서는 평균 74.7%로, 벤치마크에 따라 성능 차이가 큽니다.
iRe-VLA (Xu et al. [8], 2025): 반복적 RL을 통해 점진적으로 정책을 개선

GRPO 기반:
ThinkAct [67] (Xu et al. [8], 2025): Group Relative Policy Optimization을 적용하여 추론과 행동을 동시에 강화
TGRPO [66] (Li et al., 2025): 추론 일관성 보상을 포함하는 확장된 GRPO

오프라인 RL (Offline Reinforcement Learning)

기존에 수집된 데이터만으로 정책을 개선합니다. 추가 환경 상호작용 없이도 차선 시연에서 더 나은 행동을 추출할 수 있는데요, 이게 실무적으로 매우 매력적입니다. 로봇을 실제로 돌리는 건 비용이 많이 드니까요.

PA-RL: CalQL(Calibrated Q-Learning) 기반 재순위화를 통해 오프라인 데이터에서 최적 궤적을 선별
ConRFT [69] (Li et al., 2025): 일관성 정책(consistency policy)을 활용한 온라인 강화 미세조정

선호 최적화 (Preference Optimization)

인간의 선호를 직접 학습 신호로 활용하는 접근입니다:

HAPO [84] (Li et al., 2025): DPO(Direct Preference Optimization)를 VLA에 적용합니다. 쌍별(pairwise) 궤적 비교를 통해 선호되는 행동 패턴을 학습하는 건데요, "A 궤적과 B 궤적 중에 어느 게 나아요?"라고 물어보는 방식이죠.
RAPL [83] (Tian et al., 2025): 시각적 선호 인코딩(visual preference encoding)을 통해, 인간이 비디오 클립을 비교하는 것만으로 보상 함수를 학습합니다. 두 비디오를 보여주고 "이게 더 나아요" 하나만 골라주면 되는 거예요.
GRAPE [109] (Wang et al., 2025): 다중 스케일 선호 학습 -- 궤적 수준, 세그먼트 수준, 스텝 수준에서의 선호를 동시에 반영합니다.

보상 설계의 스펙트럼

RL 후처리의 핵심 도전 과제는 적절한 보상 함수의 설계입니다. 보상 함수가 잘못되면 RL이 엉뚱한 방향으로 학습하게 되거든요. 현재까지 제안된 보상 유형은 다음과 같습니다:

보상 유형	특성	대표 방법
과제 성공 보상 (이진/희소)	성공=1, 실패=0. 설계 간단하지만 학습 효율 낮음	대부분의 온라인 RL 방법
VLM 생성 밀집 보상	VLM이 자동으로 중간 보상 함수 생성. 인간 설계 불필요	IKER [85]
선호 기반 보상 (RLHF 스타일)	인간 비교 피드백에서 보상 학습	HAPO, RAPL [83]
안전 제약 보상	안전 위반에 대한 페널티 부여	SafeVLA [75]
추론 일관성 보상	추론 과정과 행동 결과의 일관성 보상	TGRPO [66], ThinkAct [67]

이진/희소 보상이 왜 비효율적인지 비유를 들어보겠습니다. 미로를 탈출하는데 보상이 "탈출 성공=1, 그 외=0"밖에 없으면, 출구를 우연히 찾기 전까지 모든 행동이 똑같이 0점이니까 뭘 개선해야 할지 알 수 없잖아요. 반면 VLM 생성 밀집 보상은 "출구에 좀 더 가까워졌네, +0.1" 같은 중간 피드백을 주니까 학습이 훨씬 빠른 겁니다.

핵심 성과 수치

RL 후처리의 효과를 보여주는 인상적인 수치들입니다:

RIPT-VLA [71]: 원 논문(Su et al., 2025) 기준 특정 태스크에서 SFT 4% -> PPO 15회 반복 후 97% 성공률 (Jin et al. [9] LIBERO 기준 평균 74.7%)
SimpleVLA-RL [70]: 17.3% -> 91.7% (과제당 궤적 단 1개로) (원논문 직접 인용; 14개 서베이 외 출처)
이러한 수치들이 보여주는 것은, RL 후처리가 "부가적 개선"이 아니라 본질적 성능 도약을 가져올 수 있다는 점입니다.

ICLR 2026의 자기 개선 잔차 RL 방법들은 LIBERO에서 99% 성공률에 도달하여, RL 후처리의 잠재력이 벤치마크 포화 수준까지 끌어올릴 수 있음을 보여주었습니다. 단계 인식 강화학습(stage-aware reinforcement)은 태스크를 의미론적 구성 요소로 분해하여 각 단계별로 최적화하는 새로운 접근입니다.

BC->RL 전환 불안정성 해결

실무적으로 가장 큰 도전이 뭐냐면요, BC로 초기화된 모델에 RL을 적용할 때의 학습 불안정성입니다. RL 업데이트가 BC에서 학습된 유용한 행동 패턴을 파괴(catastrophic unlearning)할 수 있거든요. 비유하면, 피아노 기본기를 배운 사람이 재즈 즉흥연주를 연습하다가 기본기까지 잊어버리는 것과 같습니다.

이를 해결하기 위한 전략들이 있습니다:

BC 손실 정규화: RL 목적함수에 BC 손실을 정규화 항으로 추가하여, BC에서 학습된 기본 능력이 보존되도록 합니다. "기본기를 잊지 마라"라는 제약을 거는 거죠.
VL 인코더 동결: 비전-언어 인코더의 가중치를 고정하고 정책 헤드만 RL로 업데이트하여, 사전학습된 시각-언어 이해 능력을 보존합니다.
이중-Q/앙상블 크리틱: 가치 함수 추정의 과대평가(overestimation)를 억제하여 학습 안정성을 확보합니다.

6.4 인간 운동학습과의 병행

Jin et al. [9] (2025)이 아주 흥미로운 분석을 했는데요, VLA 학습 패러다임이 인간의 운동학습(motor learning) 이론과 놀라울 정도로 유사한 구조를 가지고 있다는 겁니다. 이 비유는 단순한 은유를 넘어, VLA 연구의 미래 방향에 대한 실질적인 통찰을 제공합니다.

Newell의 제약-주도 이론 (1986)

Karl Newell은 운동 행동이 세 가지 제약의 상호작용으로 출현한다고 주장했습니다. 이 프레임워크가 VLA 설계와 직접적으로 대응되거든요:

환경 제약 (Environmental Constraints):

인간: 중력, 마찰, 물체의 물리적 속성 등
VLA: 어포던스 인식, 지각 강화 모듈 -> 환경의 물리적 제약을 모델에 인코딩

유기체 제약 (Organismic Constraints):

인간: 신체 크기, 근력, 관절 가동 범위 등
VLA: 체현 인식(embodiment awareness) -> 순운동학(forward kinematics), 역운동학(inverse kinematics) 학습

과제 제약 (Task Constraints):

인간: 과제의 목표, 규칙, 시간 제한 등
VLA: 계층적 과제 분해, Chain-of-Thought 추론 -> 복잡한 과제를 관리 가능한 하위과제로 분해

신경과학적 대응

VLA의 각 구성 요소가 인간 뇌의 어떤 시스템과 대응되는지 보겠습니다. 이 테이블이 매우 인상적입니다:

뇌 시스템 / 메커니즘	기능	VLA 대응 요소
게놈(유전적 사전지식)	선천적 운동 능력의 기초	인터넷 스케일 사전학습
기술 습득(연습을 통한 학습)	구체적 운동 기술의 숙달	RL 후처리, 과제별 미세조정
소뇌 순방향 모델	행동 결과 예측	순운동학 학습, 월드 모델
기저핵 청킹	운동 시퀀스의 자동화	행동 청킹 (Action Chunking)
전문가 코칭	외부 피드백을 통한 교정	인간-로봇 상호작용(HRI)
보상 예측 오류(기저핵 도파민 시스템)	기대와 결과의 차이 신호	RL 보상 신호 (TD 오류)
내부 월드 모델	환경의 심적 시뮬레이션	시각적 상호작용 예측(VIP)

여기서 특히 주목할 만한 것이 두 가지 있습니다.

첫째, 기저핵 청킹(basal ganglia chunking)과 행동 청킹(action chunking)의 유사성입니다. 피아노를 처음 배울 때는 "도-레-미" 각 음을 하나하나 의식적으로 눌러야 하지만, 충분히 연습하면 "도레미파솔라시도" 스케일 전체가 하나의 자동화된 "청크"가 됩니다. VLA 모델이 여러 타임스텝의 행동을 하나의 청크로 묶어 생성하는 메커니즘과 놀랍도록 유사한 거죠.

둘째, 보상 예측 오류(기저핵의 도파민 시스템)와 RL의 시간차(TD) 오류의 대응입니다. 이건 우연이 아니거든요. 두 시스템 모두 "기대했던 것과 실제 결과의 차이"를 학습 신호로 사용하여 행동을 점진적으로 개선합니다. 실제로 TD 학습 알고리즘이 도파민 뉴런의 발화 패턴과 일치한다는 것은 1990년대 Schultz 등의 연구에서 이미 밝혀진 바 있습니다.

이 비유의 실용적 함의

이 병행 관계가 그저 학문적 흥미거리가 아니라, VLA 연구의 미래 방향을 실질적으로 제안합니다:

인간이 신체 도식(body schema)을 유연하게 확장하는 능력(도구를 사용하면 그게 몸의 연장이 되는 현상)은 VLA의 교차 체현 일반화 연구에 영감을 줍니다.
인간의 운동 기억(motor memory)이 수면 중 강화되는 현상은 오프라인 RL 및 리플레이(experience replay)의 중요성을 시사합니다.
인간이 관찰만으로도 운동 기술을 학습하는 능력(거울 뉴런 시스템)은 인간 비디오에서의 잠재 행동 학습과 직접 연결됩니다.

6.5 자기개선과 평생학습

VLA 시스템이 실제 환경에 배포된 후에도 지속적으로 성능을 개선해 나가는 능력은, 실용적 관점에서 가장 중요한 연구 방향 중 하나입니다. 실험실에서 잘 되는 것만으로는 부족하거든요.

자율 데이터 수집

SOAR (Fan et al., 2025): 파운데이션 모델(VLM, LLM)이 가이드하는 자율적 데이터 수집 프레임워크입니다. 모델이 스스로 "어떤 데이터가 부족한지"를 판단하고, 해당 영역의 데이터를 자율적으로 수집합니다.
핵심 아이디어: 능동 학습(active learning)의 체현 버전입니다. 모델의 불확실성이 높은 상황을 자동으로 탐색하고 경험하는 거죠. 시험공부를 할 때 "내가 약한 단원"을 집중적으로 공부하는 것과 같은 원리입니다.

온라인 자기개선

RoboCat [110] (Bousmalis et al., 2024): 자기개선 루프(self-improvement loop)를 구현한 선구적 시스템입니다. 모델이 생성한 궤적 중 성공한 것들을 학습 데이터에 추가하여 반복적으로 개선합니다. "자기가 잘한 것을 기억하고 더 잘하게 되는" 선순환 구조인 거죠.
VLA-RL [68] (Tan et al., 2025): 온라인 RL을 통해 배포 후에도 환경 상호작용으로부터 지속적으로 학습합니다.

자기개선의 핵심 도전은 자기강화 편향(self-reinforcement bias)입니다. 모델이 자신의 (불완전한) 출력을 학습 데이터로 사용하면, 기존의 오류나 편향이 증폭될 수 있거든요. GPT가 자기가 쓴 글을 훈련 데이터로 쓰면 점점 이상해지는 것과 같은 문제입니다. 이를 방지하기 위해 품질 필터링, 다양성 보장 메커니즘, 인간 개입(human-in-the-loop) 등이 필요합니다.

평생학습의 핵심 과제: 치명적 망각

VLA 시스템이 새로운 과제나 환경에 적응할 때 직면하는 가장 심각한 문제는 치명적 망각(catastrophic forgetting)입니다. 새로운 데이터로 미세조정하면 이전에 학습한 능력이 손실되는 현상인데요.

VLA에서의 치명적 망각의 구체적 양상:

VL 인코더를 완전히 해동(unfreeze)하여 미세조정하면, 인터넷 스케일 사전학습에서 획득한 풍부한 시각-언어 이해 능력이 점진적으로 손실됩니다.
특정 환경에 과적합(overfit)되면, 다른 환경에서의 일반화 능력이 저하됩니다.
특정 로봇 형태에 특화되면, 교차 체현 전이 능력이 약화됩니다.

비유하면, 프랑스어를 유창하게 하던 사람이 일본어를 집중적으로 배우다 보면 프랑스어가 서서히 녹슬어 가는 것과 같습니다. 새 것을 배우면서 옛 것을 잊는 문제이죠.

해결 전략:

선택적 해동(Selective Unfreezing): 모든 파라미터를 업데이트하는 대신, 과제 관련 레이어만 선택적으로 미세조정합니다. LoRA(Low-Rank Adaptation) 등의 파라미터 효율적 미세조정(PEFT) 기법이 대표적입니다.
ReVLA [111] (Shi et al., 2025): 가역적 학습(reversible learning) 메커니즘을 도입하여, 새로운 과제 학습 시 이전 지식을 가역적으로 보존합니다.
π0.5 [31]-KI: 그래디언트 차단(gradient blocking)을 통해 특정 모듈로의 그래디언트 전파를 선택적으로 차단하여, 사전학습 지식을 보호합니다.

교차 체현 일반화

궁극적으로, VLA 시스템은 특정 로봇에 종속되지 않고 다양한 체현(embodiment)에 일반화할 수 있어야 합니다. 7축 관절 로봇에서 학습한 "물체를 집는" 기술이 병렬 그리퍼, 영리한 손(dexterous hand), 이동 매니퓰레이터에서도 작동해야 하는 거죠.

HPT [96] (Wang et al., 2024): Heterogeneous Pretrained Transformer. 공유 잠재 공간(shared latent space)과 체현별 헤드(embodiment-specific head)를 분리한 아키텍처입니다. 공유 트랜스포머가 과제 의미론(task semantics)을 처리하고, 각 로봇 형태에 맞는 전용 헤드가 해당 행동 공간으로 변환합니다. 이걸 비유하면, 번역의 "의미 이해" 부분은 공통이고, "출력 언어"만 바꾸는 것과 비슷합니다.
UniAct [112] (Ning et al., 2025): 통합 행동 공간을 3D 공간으로 정의하여, 로봇 형태에 무관한 범용 행동 표현을 학습합니다.
BridgeVLA [113] (Li et al., 2025): 서로 다른 로봇 데이터셋 간의 브릿지 역할을 하는 VLA 모델로, 교차 데이터셋 전이를 촉진합니다.

교차 체현 일반화의 핵심 도전은 행동 공간의 이질성입니다. 7DoF 로봇 팔, 12DoF 영리한 손, 20+DoF 휴머노이드는 행동 공간의 차원과 의미가 근본적으로 다릅니다. 이 이질성을 극복하기 위해, 잠재 행동 토큰(유형 6)이나 과제 공간(task-space) 표현이 핵심 매개체로 활용되고 있습니다.

6.6 소결: 학습 패러다임의 3단계 성숙

자, VLA 학습 패러다임의 진화를 종합하면, LLM 학습의 발전 경로와 놀라울 정도로 유사한 3단계 성숙 모델이 드러납니다:

단계	LLM	VLA	핵심 기여
1단계: 사전학습	대규모 텍스트 코퍼스	인터넷 스케일 이미지/비디오 + 로봇 궤적	기초 능력 형성
2단계: 지도 미세조정	SFT (지시 따르기)	BC (시연 따르기)	과제 수행 능력
3단계: RL 후처리	RLHF/DPO (정렬)	RL/선호 최적화 (정렬 + 초월)	BC 한계 극복, 최적 성능

현재 VLA 연구는 2단계에서 3단계로의 전환기에 있습니다. RIPT-VLA [71](원논문 기준 특정 태스크 4%->97%)와 SimpleVLA-RL [70](17.3%->91.7%, 원논문 직접 인용)의 결과는 이 전환이 단순한 점진적 개선이 아닌 패러다임 수준의 도약을 가져올 수 있음을 시사합니다. 앞으로 RL 후처리가 BC와 함께 VLA 학습의 표준 파이프라인으로 자리잡을 것은 거의 확실합니다.

동시에, 인간 운동학습과의 비유가 시사하듯, 학습은 단일 단계의 문제가 아니라 평생에 걸친 지속적 과정입니다. 배포 후 자기개선, 새로운 환경에의 적응, 치명적 망각 없는 지식 축적 -- 이러한 평생학습 능력의 구현은 VLA 연구의 장기적 과제이자, 진정으로 범용적인 로봇 시스템을 향한 필수 요건입니다.

Motivation Chain: 학습 패러다임의 진화

Motivation Chain

수동 규칙 기반 제어의 한계(환경마다 재설계 필요)

→ Behavior Cloning 등장(시연만 보여주면 학습)

→ BC의 한계(분포 이탈, 시연 품질이 성능 천장, 다중 모드 행동 미표현)

→ Diffusion Policy [17] 등장(다중 모드 행동을 확산으로 표현)

→ VLM 사전학습 활용(인터넷 지식 전이로 일반화 향상)

→ BC의 근본적 한계 잔존(시연 밖 행동 발견 불가, 안전·선호 신호 부재)

→ RL 후처리 등장(BC로 초기 정책 → RL로 시연 초월)

→ RL 후처리의 과제(학습 불안정, reward 설계 어려움, catastrophic forgetting)

→ VLM-생성 보상, 선호 최적화 등 안정화 기법 등장

Motivation Chain

Bin 이산화의 한계(RT-2: 7DoF × 16스텝 = 112토큰, 느린 추론)

→ FAST 등장(DCT+BPE로 최대 13배 압축)

→ Latent 토큰화(VQ-BeT [60], LAPA [61]: 연속 행동을 학습된 잠재 공간으로 압축)

→ 다양한 표현의 공존(태스크 특성에 따라 최적 토큰 유형이 다름)

8가지 행동 토큰 유형: 핵심 차별점

토큰 유형	한줄 핵심	대표 모델	제어 주파수	장점	한계
Language	자연어로 행동 기술	SayCan, Inner Monologue	1-3Hz	해석 가능, VLM 직접 활용	정밀 제어 불가
Code	프로그램으로 행동 기술	Code-as-Policies	1-5Hz	루프/조건문으로 복잡 로직 표현	새 API마다 재설계
Affordance	파지 가능 영역/자세	VoxPoser [57], A3VLM [58]	3-10Hz	3D 공간 이해	비조작 태스크에 부적합
Trajectory	경로점/궤적	RT-Trajectory [62], TraceVLA	5-10Hz	시각적 직관성	힘 제어 부재
Goal	목표 상태 이미지/포인트	SuSIE [59], 3D-VLA	1-5Hz	태스크 독립적	중간 과정 미지정
Latent	학습된 잠재 벡터	VQ-BeT [60], LAPA [61], UniVLA [80]	10-30Hz	압축 효율, 정보 보존	해석 불가
Raw Action	직접 이산화된 관절값	RT-2, OpenVLA	3-10Hz	단순, VLM 어휘 재활용	토큰 수 폭발
Reasoning	추론 과정+행동	CoT-VLA [55], SC-VLA [56]	1-5Hz	추론 가능, 자기교정	추론 오버헤드

직관적 한줄 설명: 행동 토큰화와 학습 편

Bin 이산화(RT-2 방식): "온도계의 연속 눈금을 '춥다/시원하다/따뜻하다/뜨겁다'처럼 칸으로 나누는 것"
FAST: "로봇 행동을 MP3처럼 주파수 압축 -- 사람이 못 느끼는 미세 변화는 버리고 핵심만 보존"
VQ-BeT [60]: "연속 행동을 '행동 단어장'의 단어로 매핑하여 GPT처럼 다음 단어를 예측"
Diffusion Policy [17]: "대리석에서 조각상을 깎아내듯, 순수 노이즈에서 행동을 정제해 나감"
Flow Matching(π0): "출발지(노이즈)에서 목적지(행동)까지 직선 고속도로를 뚫은 것 -- 디퓨전의 구불길 대신"
Behavior Cloning: "선생님이 푸는 걸 보고 따라 푸는 것 -- 선생님보다 잘할 수 없고, 안 본 문제는 못 품"
RL 후처리: "BC로 기본기를 익힌 뒤, 스스로 연습하며 선생님을 넘어서는 단계"
VLM-생성 보상: "채점자(VLM)가 로봇의 행동을 보고 점수를 매겨주는 것 -- 사람이 일일이 채점할 필요 없음"

Self-Check Questions: Section 5-6

Q1: FAST 토큰화가 기존 bin 이산화 대비 어떤 원리로 토큰 수를 줄이는가?

답: FAST는 두 단계 압축을 적용한다. (1) DCT(이산 코사인 변환)로 행동 시퀀스를 시간 영역에서 주파수 영역으로 변환하여, 고주파 성분(미세 진동)을 제거하고 저주파 성분(핵심 운동 패턴)만 보존한다. (2) BPE(바이트 페어 인코딩)로 반복되는 주파수 패턴을 합쳐 토큰 수를 추가 압축한다. 결과적으로 7DoF x 16스텝=112토큰이 최대 약 13배 압축된다.

Q2: RL 후처리가 BC 단독보다 우수한 이유를 "탐색(exploration)"의 관점에서 설명하라.

Q3: 8가지 행동 토큰 유형 중, "제어 주파수"와 "추상 수준"은 어떤 trade-off 관계에 있는가?

Open Research Questions: Section 5-6

최적 토큰 유형 자동 선택: 주어진 태스크에 대해 8가지 토큰 유형 중 최적을 자동으로 선택하는 메타-학습 프레임워크가 가능한가?

RL의 안정성-성능 trade-off: BC->RL 전환 시 catastrophic forgetting 없이 안정적으로 성능을 개선하는 이론적 보장이 가능한가?

보상 설계의 자동화: VLM-생성 보상이 인간 보상과 얼마나 잘 일치하는가? VLM의 환각(hallucination)이 보상 신호를 오염시키는 경우 어떻게 대처하는가?

연속-이산 스펙트럼의 최적점: 완전 이산(bin)과 완전 연속(디퓨전) 사이에서 최적의 행동 표현 해상도는 태스크에 따라 어떻게 달라지는가?

7. 효율성 — 실세계 배포를 위한 필수 과제

자, 이번 시간에는 VLA 연구에서 피할 수 없는 현실적인 문제를 다루겠습니다. 바로 효율성입니다. VLA 모델이 학술 벤치마크에서 아무리 좋은 성능을 보여줘도, 이걸 실제 로봇에 올려서 돌릴 수 없으면 의미가 없잖습니까. 수십억 파라미터의 거대 모델을 실시간으로 추론하면서, 제한된 하드웨어 위에서, 안전하고 경제적으로 운용해야 합니다. 이게 왜 중요하냐면, 논문에서 잘 되는 것과 현장에서 쓸 수 있는 것은 완전히 다른 차원의 문제이기 때문입니다.

이 장에서는 Yu et al. [4] (2025)의 효율적 VLA 서베이를 핵심 축으로 삼아, 효율성 문제의 전체 지형도를 그려 보겠습니다.

7.1 왜 효율성인가: 현실과 이상의 간극

현재 VLA 모델의 자원 소모량이 어느 정도인지부터 감을 잡아 봅시다. 결론부터 말씀드리면, 실세계 배포 관점에서 비현실적인 수준입니다.

훈련 비용의 규모를 먼저 보겠습니다:

OpenVLA [15] 학습에는 약 21,500 A100-GPU 시간이 소요되었습니다. 64-GPU 클러스터를 약 2주간 쉬지 않고 돌려야 하는 양입니다.
π0 [16]의 학습에는 10,000시간 이상의 로봇 궤적 데이터가 사용되었습니다. 단일 기관이 이 규모의 데이터를 자체적으로 수집한다는 건 사실상 불가능합니다.

추론 지연시간의 벽도 심각합니다:

RT-2-PaLI-X(55B)의 추론 지연시간은 330~1000ms입니다. 초당 1~3회(1-3Hz)의 제어 주파수밖에 안 나온다는 뜻이죠. 테이블탑 매니퓰레이션에서 요구되는 최소 주파수(5-10Hz)에도 미달하고, 동적 과제에서 필요한 30Hz 이상은 꿈도 못 꿉니다.
비교적 효율적인 OpenVLA [15] (7B)조차 166ms의 지연(약 6Hz)으로, 빠른 반응이 필요한 과제에는 부적합합니다.

실세계 배포의 4대 요구사항을 표로 정리하면 이렇습니다:

요구사항	설명	현재 갭
지연시간	<100ms (10Hz+)	대부분의 대형 VLA가 미달
비용	클라우드 API 비용 최소화	대형 모델은 GPU당 비용 과다
프라이버시	온디바이스 추론 필수	가정/의료 환경에서 데이터 외부 전송 불가
에너지	배터리 구동 로봇의 전력 제약	수십 와트급 엣지 디바이스에서 구동 필요

이러한 간극을 메우기 위해, 2024년 후반부터 2025년에 걸쳐 효율적 VLA에 관한 연구가 폭발적으로 증가했습니다. 연구의 방향은 크게 모델 효율성, 훈련 효율성, 데이터 효율성의 세 축으로 나뉩니다. 하나씩 살펴보겠습니다.

7.2 모델 효율성: 추론을 빠르고 가볍게

자, 먼저 모델 효율성입니다. 이미 학습된 VLA의 추론 단계에서 지연시간과 메모리를 줄이는 기법들을 총칭하는 건데요, 크게 다섯 가지 전략이 있습니다. 양자화, 가지치기, 지식 증류, 토큰 최적화, 효율적 아키텍처입니다. 순서대로 보겠습니다.

7.2.1 양자화(Quantization)

양자화는 모델 가중치(및 활성값)의 수치 정밀도를 줄여서 메모리와 연산량을 절감하는, 가장 직접적인 기법입니다. 비유하자면, 고화질 사진을 JPEG로 압축하는 것과 같습니다. 파일은 작아지지만 눈에는 거의 같아 보이는 거죠.

OpenVLA [15] 4비트 PTQ(Post-Training Quantization): 학습 후 양자화만으로 GPU 메모리 사용량을 절반으로 줄이면서도 성능 저하가 관측되지 않았습니다. 여기서 핵심은, VLA 모델의 가중치가 상당한 수치적 여유(redundancy)를 포함하고 있다는 점입니다.
SQIL: [114] 4비트 현저도 인식(salience-aware) 양자화를 적용하여 2.5배 추론 가속을 달성했습니다. 이게 왜 중요하냐면, 행동 예측에 중요한 가중치를 식별하여 선별적으로 높은 정밀도를 유지하기 때문입니다.
BitVLA [33]: 극한의 1비트 삼진 양자화({-1, 0, 1})를 적용한 연구입니다. 3.36배 메모리 압축을 보고했는데요, 가중치를 세 개의 값으로 표현하면서도 유의미한 행동 생성이 가능하다는 것은 상당히 놀라운 결과입니다.
QAIL(Quantization-Aware Imitation Learning): [115] 양자화를 학습 단계에 통합하여, 엣지 디바이스 배포에 최적화된 모델을 직접 학습하는 접근입니다.
SQAP-VLA: [116] 양자화와 토큰 가지치기를 공동 설계(co-design)하여, 각 기법을 개별 적용했을 때보다 더 나은 효율성-성능 균형을 달성했습니다.

7.2.2 가지치기(Pruning)

다음은 가지치기입니다. 모델에서 불필요한 구성 요소(레이어, 뉴런, 토큰 등)를 제거하여 경량화하는 기법인데요, 나무의 죽은 가지를 쳐내는 것에 비유할 수 있습니다. 나무(모델)는 더 가볍고, 바람(추론)에 잘 흔들리게 되는 거죠. VLA에서는 특히 LLM 백본의 레이어 중복성이 높다는 관찰에 기반한 연구들이 활발합니다.

레이어 수준 가지치기부터 보겠습니다:

인접한 LLM 레이어의 출력 사이에 높은 코사인 유사도가 관측되는데요, 이를 근거로 최대 50%의 레이어를 제거할 수 있습니다. 절반을 잘라내도 된다는 겁니다.
DeeR-VLA [35]: 동적 다중 출구(dynamic early exit) 전략을 사용합니다. 각 레이어에서 행동 예측의 일관성을 확인하고, 일관성이 확보되면 나머지 레이어를 스킵하는 거예요. 여기서 핵심은 추가 학습이 필요 없다는 점입니다. 시험 볼 때 쉬운 문제는 빨리 풀고 넘기고, 어려운 문제만 깊이 고민하는 전략과 같습니다.
SmolVLA [32]: 극도로 단순한 접근법으로, LLM의 L/2개 레이어를 단순 스킵합니다. 절반의 레이어만으로도 조작 과제를 수행할 수 있음을 보여준 거죠.
MoLe-VLA: [117] STAR 라우터를 사용하여 입력별로 동적으로 활성화할 레이어를 선택합니다. 쉬운 과제에서는 적은 레이어를, 복잡한 과제에서는 많은 레이어를 활성화하여 연산량을 적응적으로 조절합니다.
EfficientVLA: [118] 학습 없이 레이어 가지치기와 시각 토큰 가지치기를 동시에 적용하는 프레임워크입니다.
FLOWER: [119] 인코더-디코더 구조 VLM에서는 디코더 전체를 제거하고, 디코더 전용 구조에서는 말단 30%의 레이어를 제거합니다.

구조적 가지치기도 있습니다:

RLRC: [120] Taylor 중요도 점수에 기반한 구조적 가지치기로, 90% 희소성까지 달성하면서도 유의미한 성능을 유지했습니다.

7.2.3 지식 증류(Distillation)

자, 그러면 증류는 어떤 건가. 대형 VLA의 지식을 소형 모델로 전이하는 기법인데요, 명인의 기술을 제자에게 전수하는 것과 비슷합니다. 제자는 작지만 핵심 기술은 보존되는 거죠. 처음부터 작은 모델을 만드는 것보다 높은 성능을 달성할 수 있습니다.

TinyVLA [34]: 대형 VLA에서 1.4B 미만의 소형 모델로 증류합니다. LoRA 가중치로 초기화하여 증류 효율을 높입니다.
CEED-VLA: [121] 일관성 증류(consistency distillation)와 Jacobi 병렬 디코딩을 결합합니다. 자기회귀적 토큰 생성의 직렬 병목을 병렬화하여 추론 속도를 크게 향상시킵니다.
RPD(Robot Policy Distillation): [122] VLA에서 소형 RL 전문가 정책으로 증류합니다. 특정 과제에 대해서는 범용 VLA보다 증류된 전문가가 더 빠르고 정확할 수 있습니다.
SP-VLA: [123] 이건 좀 독특한데요, 행동 인식 스케줄링(action-aware scheduling)으로 무거운 VLA와 가벼운 행동 생성기 사이를 동적으로 전환합니다. 복잡한 판단이 필요한 순간에만 대형 VLA를 호출하고, 단순 실행 구간에서는 경량 생성기를 사용하는 겁니다.

7.2.4 토큰 최적화

여기서 핵심은 이겁니다. VLA에서 시각 토큰은 전체 입력 시퀀스의 대부분을 차지합니다. 단일 이미지가 수백 개의 패치 토큰으로 변환되고, 비디오 입력에서는 이 수가 수천 개로 폭증하거든요. 이걸 줄이는 것이 토큰 최적화의 핵심입니다.

시각 토큰 압축부터 보겠습니다:

SmolVLA [32]: Pixel shuffle 기법으로 프레임당 64개 토큰으로 압축합니다. 원래 수백 개였던 토큰을 공간적으로 재배열하여 극단적으로 줄이는 거예요.
FlashVLA [52]: ICS(Importance-based Compression and Selection) 가지치기로 중요도가 낮은 시각 토큰을 제거합니다.
EfficientVLA: [118] 레이어 가지치기와 시각 토큰 가지치기를 통합 적용합니다.

시각 토큰 캐싱도 중요한 전략입니다:

VLA-Cache, CronusVLA: 이건 매우 직관적인 아이디어인데요, 정적 배경에 해당하는 토큰이 연속 프레임 간에 거의 변하지 않는다는 시간적 일관성(temporal coherence)을 활용합니다. 매 프레임 배경을 다시 그리지 않고 캐시해 두고, 변화가 있는 전경 토큰만 갱신하여 약 40-50% 빠른 추론(Zhang et al. [6] 기준; 원논문에서는 최대 2배 이상 가속 보고)을 달성합니다.
이 접근법이 유효한 근본적 이유는, 로봇 조작 과제에서 대부분의 패치 토큰이 공간적으로 중복되기 때문입니다. 카메라가 고정된 테이블탑 환경에서는 배경의 80% 이상이 프레임 간에 동일하거든요.

7.2.5 효율적 아키텍처

자, 그러면 아키텍처 자체를 바꾸는 방향도 있습니다. 기존 Transformer 구조의 근본적 한계, 즉 이차 복잡도의 어텐션을 극복하기 위한 것인데요.

선형 복잡도 아키텍처:

SARA-RT: [126] 표준 소프트맥스 어텐션을 선형 어텐션으로 업트레이닝(up-training)합니다. 복잡도가 O(n^2)에서 O(n)으로 줄어듭니다.
RoboMamba: [127] Mamba SSM(Selective State Space Model) 기반 VLA로, 선형 복잡도에서 3배 이상의 속도 향상을 달성했습니다. 긴 시퀀스에서 Transformer 대비 이점이 커집니다.

MoE(Mixture of Experts): MoE의 핵심 아이디어는, 모든 의사가 모든 환자를 보는 게 아니라 전문 과목별로 배정하는 것과 같습니다. 각 전문가는 작지만 전체 역량은 큰 거죠.

GeRM: [128] 사족보행 로봇의 RL에 MoE를 적용하여, 전체 파라미터 중 일부 전문가만 활성화합니다.
FedVLA [72]: 이중 게이팅 MoE로 연합학습(federated learning) 환경에서 효율적 VLA를 구현합니다.
DriveMoE [49]: 자율주행 도메인에서 MoE 구조를 활용하여 다양한 주행 시나리오에 전문가를 할당합니다.

병렬 디코딩:

OpenVLA [15]-OFT: 양방향 어텐션을 사용하여 여러 행동 토큰을 동시에 생성합니다.
PD-VLA: [129] Jacobi 고정점 반복법으로 자기회귀적 디코딩을 병렬화합니다.
Spec-VLA: [130] 투기적 디코딩(speculative decoding)을 VLA에 적용하여 1.42배 가속을 달성합니다. 소형 드래프트 모델이 후보 토큰을 빠르게 생성하고, 대형 모델이 이를 검증하는 방식입니다.

7.2.6 효율적 어텐션(Efficient Attention)

Yu et al. [4]는 추가로 효율적 어텐션(Efficient Attention) 기법들을 별도 연구 방향으로 식별합니다. KV-Efficient VLA(RNN 게이트 기반 KV 캐시 압축), Long-VLA [73](장시간 태스크를 위한 phase-aware 입력 마스킹), RetoVLA [74](레지스터 토큰 재사용), dVLA [65](디퓨전 VLA를 위한 prefix 어텐션 마스킹) 등이 이 범주에 속합니다. 이들은 기존 모델 압축(양자화, 프루닝)과는 독립적인 차원의 효율화로, Transformer의 어텐션 메커니즘 자체를 최적화하는 것입니다.

7.3 훈련 효율성: 적은 자원으로 더 잘 학습하기

자, 모델 자체의 경량화와는 별도로, 학습 과정의 효율성을 높이는 연구도 활발합니다.

파라미터 효율적 미세조정(PEFT):

LoRA(Low-Rank Adaptation)를 비롯한 PEFT 기법들은 전체 파라미터의 0.1~1%만을 학습하면서도 전체 미세조정에 준하는 성능을 달성합니다. GPU 시간을 약 70% 절감하며, 단일 GPU에서도 대형 VLA의 미세조정을 가능하게 합니다. 이게 실용적으로 굉장히 큰 의미가 있습니다.

혼합 학습 전략:

커리큘럼 학습: 쉬운 과제에서 어려운 과제로 점진적으로 난이도를 높이는 전략입니다.
다단계 학습: π0 [16]는 (1) VLM 사전학습 → (2) 로봇 데이터 사전학습 → (3) 과제별 미세조정의 3단계 파이프라인을 사용합니다.

FAST 토큰화 [20]의 혁신:

Pertsch et al.이 제안한 FAST(Fast Action Tokenization)는 로봇 행동 시퀀스에 DCT(이산 코사인 변환) + BPE(바이트 쌍 인코딩)를 적용합니다. 이를 통해 행동 시퀀스를 극도로 압축하여 사전학습 속도를 5배 가속했습니다. 원시 행동(raw actions) 대신 FAST 토큰이나 잠재 행동(latent actions)을 사용하는 것이 효율적 행동 표현의 핵심 트렌드입니다.

7.4 데이터 효율성: 적은 로봇 데이터로 더 많이 배우기

이제 데이터 쪽을 봅시다. 로봇 데이터 수집의 높은 비용은 VLA 연구의 가장 근본적인 병목입니다. 데이터 효율성 연구는 이 병목을 우회하거나 완화하는 전략들을 탐구합니다.

인간 비디오 활용:

EgoVLA, Being-H0, RynnVLA-001 등은 인터넷에 풍부한 1인칭(ego-centric) 인간 활동 비디오를 대리 학습 데이터로 활용합니다. 인간의 손 움직임에서 조작 전략을 학습하고, 이를 로봇 행동에 전이하는 건데요. 이 접근법의 핵심 통찰은 인간과 로봇이 동일한 물리 세계에서 유사한 조작 과제를 수행한다는 것입니다.

시뮬레이션 데이터:

UniSim, Genesis: 물리 시뮬레이터에서 대규모 합성 데이터를 생성합니다.
GraspVLA [81]: 10억 스케일의 합성 파지 데이터를 생성하여 사전학습에 활용합니다. 규모 자체가 압도적이죠.

데이터 증강:

언어 증강: DIAL [90] 등은 과제 지시문을 다양하게 패러프레이징하여 언어 이해의 강건성을 높입니다.
시각 증강: GenAug [87], CACTI [88], ROSIE [89] 등은 생성 모델을 이용해 시각적 다양성을 확대합니다.
궤적 증강: DemoGen 등은 기존 시연 데이터에서 새로운 궤적을 합성합니다.

능동적 데이터 선정:

AMF(Active Model Feedback): 정보 이득(information gain)이 높은 데이터를 우선 선정하여 학습 효율을 극대화합니다.
SWBT(Success Weighted by Trial): 실패 시도까지 학습 데이터에 포함하여, 실패로부터도 유용한 신호를 추출합니다.

자율 수집:

SOAR: 파운데이션 모델의 가이드 하에 로봇이 자율적으로 데이터를 수집합니다. 인간 시연자 없이도 학습 데이터를 지속적으로 확보할 수 있는 경로를 제시하는 겁니다.

7.5 주요 경량 모델 비교

자, 그러면 실제로 어느 정도까지 효율화가 진행되었는지 숫자로 확인해 봅시다. 아래 표는 대표적인 VLA 모델들을 파라미터 규모, 추론 성능, 핵심 기법 기준으로 비교한 것입니다. 1년 사이에 55B에서 450M까지, 1Hz에서 120Hz까지 압축이 진행되었음을 확인할 수 있습니다.

모델	파라미터	추론 지연	제어 주파수	핵심 기법
RT-2-PaLI-X	55B	330-1000ms	1-3Hz	기준선(대형 VLM 직접 사용)
OpenVLA [15]	7B	166ms	6Hz	오픈소스 기준선
π0 [16]	3.3B	73ms	20-50Hz	Flow Matching 행동 헤드
GR00T N1 [21]	2.2B	64ms	~120Hz(모터 출력 주파수; Yu et al. [4]의 Table 1에서는 미보고. 모델 추론 주파수와 구분 필요)	이중시스템(느린 VLM + 빠른 정책)
NORA [132]	3B	—	—	FAST+ 토큰화
CLIP [27]-RT	~1B	—	—	동결 CLIP [27] 활용, OpenVLA [15] 대비 +24%
EdgeVLA [133]	1B	—	—	엣지 디바이스 전용 설계
TinyVLA [34]	<1.4B	—	—	대형 VLA 증류
SmolVLA [32]	~450M	—	—	단일 GPU 학습 가능
BitVLA [33]	~2B(실효 용량 축소)	—	—	1비트 삼진 양자화
DiVLA-2B [134]	2B	~12ms	82Hz	A6000 단일 GPU 구동
RoboMamba [127]	—	—	—	Mamba SSM 기반 선형 복잡도

7.6 핵심 인사이트 — 효율성-성능 트레이드오프의 재발견

효율적 VLA 연구에서 도출되는 인사이트들은 단순한 기술적 최적화를 넘어서, VLA 설계 철학 자체에 대한 재고를 요구합니다. 하나씩 짚어 보겠습니다.

1) 스케일 역전 현상: CLIP [27]-RT(~1B)가 OpenVLA [15] (7B)를 24% 능가한다는 결과가 있습니다. 이건 상당히 시사적인데요, "더 많은 파라미터가 더 나은 성능을 보장한다"는 스케일링 법칙의 단순한 적용이 로봇 도메인에서는 성립하지 않을 수 있음을 보여줍니다. 작은 모델이라도 적절한 표현 학습과 데이터 효율적 미세조정이 결합되면, 거대 모델을 능가할 수 있다는 겁니다.

2) 양자화는 거의 무료 점심: 4비트 PTQ로 메모리를 절반으로 줄이면서도 성능 저하가 없다는 사실은, 현재 VLA의 가중치에 상당한 중복성이 존재함을 의미합니다. 이건 배포 단계에서 양자화를 기본 적용해야 할 강력한 근거가 됩니다.

3) 계층적 분리는 로봇에 고유하게 적합: GR00T N1 [21]이 보여준 느린 VLM(1-5Hz) + 빠른 정책 헤드(50Hz+)의 비동기 실행은 로봇 제어의 본질적 구조와 잘 맞습니다. 높은 수준의 의미 이해는 매 프레임 갱신할 필요가 없지만, 저수준 모터 명령은 고주파로 생성되어야 하거든요. 이 "인지는 느리게, 행동은 빠르게" 패러다임은 인간 신경계의 구조와도 유사합니다.

4) RL 후처리로 압축 회복: RIPT-VLA [71]는 RL 후처리를 통해 VLA의 성능을 대폭 향상시킬 수 있음을 보였습니다. SFT(Supervised Fine-Tuning) 기준선에서 4%였던 성능이 PPO 후처리를 통해 97%까지 향상되었습니다. 이 4%→97% 결과는 양자화/프루닝에 의한 성능 저하 회복이 아니라, BC/SFT 기준선에서 RL 후처리를 통한 성능 향상을 의미합니다(원논문 Su et al., 2025 기준; 벤치마크에 따라 성능이 달라짐에 유의). 이건 "경량 모델 + RL 후처리"라는 파이프라인의 실행 가능성을 입증하는 겁니다.

5) 인간 비디오는 로봇 데이터의 실행 가능한 대체재: EgoVLA 계열의 연구들은 인터넷 스케일의 인간 비디오가 로봇 데이터를 부분적으로 대체할 수 있음을 보여줍니다. 로봇 데이터 수집의 병목을 우회하는 가장 확장 가능한(scalable) 경로 중 하나입니다.

6) 지배적 연구 추세: 효율적 VLA 연구는 2024년 후반부터 2025년 사이에 폭발적으로 성장했습니다. 이는 연구 커뮤니티가 "일단 크게 만들고 나중에 줄인다"는 전략에서 "처음부터 효율적으로 설계한다"는 방향으로 전환하고 있음을 반영합니다.

7.7 엣지 배포: 시스템 레벨 병목 분석

자, 여기서 중요한 관점 전환이 있습니다. 2026년의 Edge Embodied Foundation Models 서베이는 VLA 배포를 모델 압축 문제가 아닌 시스템 공학 문제로 재정의했습니다. 이게 왜 중요하냐면, 모델만 줄인다고 해결될 문제가 아니라는 것이거든요.

이 서베이가 제안한 "Deployment Gauntlet"은 엣지 배포를 가로막는 7가지 결합 제약(coupled constraints)을 식별합니다. 크기, 무게, 전력, 메모리 트래픽, 연산 지연, 타이밍 변동, 안전 마진 등이 상호 작용하여, 하나의 최적화만으로는 해결되지 않는 복합 문제를 형성합니다.

핵심 발견은 병목의 유형이 컨트롤러 아키텍처에 따라 다르다는 것입니다:

자기회귀 VLA(RT-2, OpenVLA류): 주로 메모리 대역폭에 의해 제약
디퓨전 기반 컨트롤러(π0류): 주로 연산 지연과 지속 실행 비용에 의해 제약

이 분석은 '빠른 제어(fast control)'와 '느린 의미 추론(slow semantic reasoning)'을 분리하는 아키텍처(GR00T N1, π0.5)가 엣지 배포에서도 유리함을 시사합니다. 효율적 배포를 위해서는 메모리 아키텍처, 스케줄링 전략, 통신 프로토콜, 모델 설계를 통합적으로 고려하는 시스템 레벨 공동 설계(co-design)가 필요합니다.

7.8 보완적 효율화 분류 체계

Guan et al. (2025)은 Yu et al. [4]와 독립적으로 효율화 VLA를 조사하여, 4차원 분류 — (1) 모델 아키텍처, (2) 인지 특징 추출, (3) 행동 생성 메커니즘, (4) 학습/추론 전략 — 를 제안했습니다. Yu et al. [4]가 모델 압축과 효율적 설계에 초점을 맞춘 것과 달리, Guan et al.은 인지와 행동 생성의 효율화도 별도 차원으로 다룬다는 점에서 상호 보완적입니다.

Guan et al. [43]의 4차원 효율화 분류는 본 문서의 효율성 분석에 중요한 보완적 시각을 제공한다. Yu et al. [4]이 모델 압축(양자화, 프루닝, 증류)에 초점을 맞춘 것과 달리, Guan et al.은 인지 특징 추출 효율화(예: 다중 해상도 토큰 풀링, 선택적 어텐션)와 행동 생성 메커니즘 효율화(예: Action Chunking 최적화, 병렬 디코딩)를 독립적 차원으로 분석한다. 이 관점은 FlashVLA [52]나 RetoVLA [74] 같은 최근 모델이 왜 단순한 모델 압축이 아닌, 인지-행동 파이프라인 전체의 효율화를 추구하는지를 설명한다.

ICLR 2026에서 제안된 HyperVLA [135]는 하이퍼네트워크로 태스크별 정책을 동적 생성하여 추론을 가속합니다. AutoQVLA [136]는 개선된 양자화 기법으로 VRAM을 30% 절감했습니다. 이들은 7.2절에서 다룬 모델 효율화 기법들의 최전선에 위치하며, 양자화와 아키텍처 혁신이 여전히 활발한 연구 방향임을 보여줍니다.

8. 응용 도메인 — VLA가 만드는 세계

자, 이제 효율성 이야기를 마치고 VLA가 실제로 어디에 쓰이는지를 살펴보겠습니다. VLA 기술은 다양한 로봇 응용 분야로 확산되고 있는데요, 각 도메인은 고유한 행동 공간, 안전 요구, 실시간성 제약을 가지며, 이에 따라 VLA의 적용 방식도 크게 달라집니다. 이 장에서는 현재 VLA가 활용되고 있는 주요 도메인을 순회하며, 각 영역의 현황과 고유한 도전 과제를 정리하겠습니다.

8.1 테이블탑 매니퓰레이션 — 주류 연구 도메인

먼저 테이블탑 매니퓰레이션입니다. 여기가 VLA 연구의 핵심 무대라고 할 수 있는데요, 전체 VLA 모델의 70% 이상이 이 도메인을 대상으로 개발되고 평가됩니다.

벤치마크 성능의 급격한 향상을 보겠습니다:

LIBERO: 성공률이 16개월 만에 76.5%에서 98.1%로 상승했습니다.
CALVIN: 시퀀스 길이(연속 성공 과제 수)가 3.57에서 4.44로 향상되었습니다.
RLBench, Meta-World: 다양한 조작 과제에 대한 표준 평가 플랫폼으로 활용됩니다.

현재 수준과 남은 과제: 단기 과제(single-step manipulation)는 98% 이상의 성공률로 거의 해결 단계에 도달했습니다. 그런데 장기 과제(long-horizon tasks) — 여러 단계의 조작을 순서대로 수행해야 하는 과제 — 는 여전히 핵심 병목으로 남아 있습니다. 이게 왜 중요하냐면, 각 단계의 오류가 누적되는 컴파운딩 에러(compounding error) 문제가 근본 원인이기 때문입니다. 각 스텝에서 99%의 성공률을 보여도, 10스텝이면 0.99^10 = 약 90%로 떨어지고, 50스텝이면 60%대까지 내려갑니다.

특수 조작 연구도 활발합니다:

양손 조작: Bi-VLA, ALOHA 등은 두 팔의 협응 제어를 다룹니다. 행동 공간이 단일 팔의 2배로 확장되며, 양팔 간의 동기화(synchronization)가 핵심 과제입니다.
접촉 풍부한 조작: ForceVLA [78], TactileVLA 등은 힘/촉각 센서를 VLA에 통합합니다. 시각만으로 파악할 수 없는 물체의 강성, 무게, 미끄러짐 등을 감지하는 거죠.
손재주 파지(Dexterous Grasping): DexVLA, DexVLG 등은 다지(multi-finger) 핸드의 고차원 제어를 VLA로 학습합니다. 자유도가 20개 이상으로 증가하며, 행동 공간의 복잡성이 급격히 높아집니다.

8.2 휴머노이드 로봇 — 전신 제어의 도전

자, 그러면 휴머노이드는 어떤가. 휴머노이드 로봇에 VLA를 적용하는 것은 테이블탑 매니퓰레이션과는 질적으로 다른 수준의 도전을 수반합니다.

근본적 어려움을 보겠습니다:

30개 이상의 자유도: 팔, 다리, 몸통, 머리를 포함한 전신 관절의 제어가 필요합니다.
균형 유지: 이족 보행의 동적 균형은 밀리초 단위의 빠른 반응을 요구합니다.
보행과 조작의 동시 수행: 걸어가면서 물건을 집는 것처럼, 이동과 조작을 동시에 수행해야 합니다.
다중 접촉점 관리: 발, 손, 때로는 몸통까지 환경과 접촉하며, 이 모든 접촉점의 힘을 조율해야 합니다.

주요 모델을 보겠습니다:

GR00T N1 [21] (NVIDIA): 휴머노이드를 위한 파운데이션 모델을 표방합니다. 이중시스템 아키텍처로 높은 제어 주파수(~120Hz, 모터 출력 주파수; Yu et al. [4]의 Table 1에서는 미보고. 모델 추론 주파수와 구분 필요)를 달성하며, 범용적 전신 제어를 목표로 합니다.
Humanoid-VLA: [137] 온라인에 존재하는 인간 비디오에서 포즈 복원(pose estimation)을 수행하여 동작의 다양성을 확보합니다. 인간의 움직임을 직접 참고 데이터로 활용하는 접근법입니다.
Being-H0: [138] 자아중심(ego-centric) 비디오를 사전학습 데이터로 활용하여, 1인칭 시점에서의 환경 이해 능력을 강화합니다.
FP3: [139] 3D 정책 사전학습으로 공간적 추론 능력을 강화합니다.

핵심 미해결 과제: 균형 유지와 정밀 조작의 동시 수행은 현재 VLA의 가장 어려운 도전 중 하나입니다. 균형을 위한 빠른 반사적 제어와 조작을 위한 신중한 계획적 제어가 충돌하는 상황이 빈번하며, 이를 하나의 통합 모델 내에서 조화시키는 것이 핵심 과제입니다.

8.3 자율주행 — 또 다른 VLA의 최전선

자율주행은 VLA의 두 번째로 큰 응용 도메인입니다. Jiang et al. [10]의 분류에 따르면, 자율주행 VLA는 4단계의 진화를 거쳐왔습니다.

진화의 4단계:

VLM as Explainer: VLM을 주행 장면 설명과 의사결정 근거 생성에 활용합니다. 제어는 별도 모듈이 담당합니다.

Modular VLA: VLM의 출력을 기존 자율주행 파이프라인(인식→예측→계획)의 모듈에 피드합니다.

Unified E2E VLA: 카메라 입력에서 조향/가속 출력까지 하나의 모델로 통합합니다.

Reasoning-Augmented VLA: CoT(Chain-of-Thought) 추론을 통합하여 의사결정 과정을 투명하게 만듭니다.

핵심 모델을 보겠습니다:

EMMA [46] (Waymo): Gemini 백본을 사용한 Waymo의 E2E 주행 모델입니다.
ORION [47]: 메모리 메커니즘과 CoT 추론을 결합하여 과거 주행 경험을 활용합니다.
DriveMoE [49]: MoE 구조로 다양한 주행 시나리오(고속도로, 교차로, 주차 등)에 전문가를 할당합니다.
AutoVLA [48]: 적응형 CoT로, 단순 상황에서는 빠른 추론을, 복잡한 상황에서는 깊은 추론을 수행합니다.

주행 vs 조작: 핵심 차이점

자율주행과 로봇 조작은 모두 VLA 프레임워크를 공유하지만, 본질적으로 매우 다른 도전을 수반합니다. 이 차이를 명확히 이해하는 것이 중요합니다.

차원	로봇 조작	자율주행
행동 공간	3D 그리퍼 위치/방향(6-7DoF)	조향/가속 + BEV 경로 + 고수준 경로(다중 추상화 수준)
공간 규모	테이블탑(~1m)	도시 규모(수백 미터~수 km)
실시간 요구	5-50Hz	30Hz+ 필수 (자동차 하드웨어 기준)
안전 임계성	물체 파손 정도	법적/물리적 인명 안전
환각의 결과	파지 실패(재시도 가능)	인명 위험(되돌릴 수 없음)
사회적 상호작용	거의 없음	양보, 합류, 다른 운전자 의도 파악 필수

여기서 주목할 연구들이 있습니다:

SafeAuto [140]는 심볼릭 거부권(symbolic veto)을 도입하여, VLA의 출력이 안전 규칙을 위반하면 실행을 차단합니다.
LangCoop V2V [141]는 차량 간(Vehicle-to-Vehicle) 자연어 통신으로 의도를 공유하여 사회적 상호작용 문제를 해결합니다.

벤치마크와 남은 갭: BDD100K, nuScenes, Bench2Drive, Reason2Drive 등의 벤치마크가 존재하지만, 통합적인 "AI 운전면허" 벤치마크의 부재가 핵심 갭입니다. 인간 운전면허 시험처럼 다양한 시나리오, 안전 판단, 윤리적 딜레마를 포괄적으로 평가하는 표준이 아직 없습니다.

8.4 드론 및 항법

공중 및 지상 이동 로봇에서도 VLA의 적용이 확대되고 있습니다.

CognitiveDrone: [142] 자연어 지시에 따라 드론을 제어하는 인지적 드론 시스템입니다. "저 빨간 건물 오른쪽으로 돌아가"와 같은 지시를 해석하고 실행합니다.
RaceVLA: [143] 드론 레이싱이라는 고속 환경에서 VLA를 적용합니다. 밀리초 단위의 반응 속도와 정밀한 경로 추적이 동시에 요구되는, 일종의 극한 테스트베드입니다.
NaviLa, Uni-NaVid: 보행 로봇의 실내 항법에 VLA를 적용합니다. "부엌으로 가서 빨간 컵을 가져와"와 같은 지시를 이해하고, 경로 계획과 장애물 회피를 수행합니다.
Mobility VLA: [146] 바퀴형 이동 로봇을 위한 VLA로, 실내외 환경에서의 자율 주행과 물체 상호작용을 통합합니다.

이 도메인의 공통적 과제는 3D 공간에서의 실시간 항법과 동적 장애물 회피를 하나의 언어-시각-행동 프레임워크로 통합하는 것입니다.

8.5 의료 및 수술 로봇

의료 분야는 VLA 적용의 높은 잠재력과 함께 가장 엄격한 제약 조건을 동시에 갖는 도메인입니다.

대표 연구:

RoboNurse-VLA: [147] 수술 환경에서의 정밀 파지를 목표로 합니다. 수술 도구의 정확한 파지와 전달이 핵심 과제입니다.

도메인 고유의 제약이 굉장히 까다롭습니다:

환자 데이터 프라이버시: 의료 데이터의 외부 전송이 법적으로 제한되므로, 온프레미스(on-premise) 추론이 필수입니다. 클라우드 API에 의존하는 VLA 배포 전략은 이 도메인에서 사용할 수 없습니다.
소량 데이터 문제: 특정 수술 절차나 환자별 데이터는 본질적으로 소량입니다. 대규모 사전학습 백본을 소량 데이터로 효과적으로 미세조정하는 능력이 핵심입니다.
안전-크리티컬 시스템: 수술 로봇의 오작동은 환자 생명에 직결됩니다. 형식 검증(formal verification)이나 안전 보증(safety assurance)에 대한 요구가 어떤 도메인보다도 엄격합니다.

이러한 제약은 효율성(7장)의 모든 차원 — 모델 경량화, 온디바이스 추론, 데이터 효율성 — 이 의료 도메인에서 특히 절실함을 의미합니다.

8.6 농업 및 산업

실용적 가치가 높은 산업 응용에서도 VLA의 잠재력이 탐색되고 있습니다.

과수원 사과 수확 (Zhang et al. [6]): 자연어 지시("익은 사과만 수확해")에 따라 로봇이 과일을 선별적으로 수확하는 시스템입니다. 비정형 환경(나뭇가지, 잎, 다양한 조명)에서의 시각 이해와 부드러운 파지가 동시에 요구됩니다.
CIPHER: 자연어 지시로 3D 프린팅 검사 작업을 전환하는 시스템입니다. "이 부분의 표면 품질을 검사해"와 같은 지시에 따라 검사 절차를 동적으로 변경합니다. 산업 공정의 유연성을 VLA로 구현하는 사례입니다.
ObjectVLA: [148] 사전 시연(demonstration) 없이 새로운 물체를 조작할 수 있는 VLA입니다. 산업 현장에서 새로운 부품이나 제품이 투입될 때마다 시연 데이터를 수집하는 비용을 제거합니다.

산업 도메인의 공통적 요구사항은 유연성(flexibility)입니다. 제품 종류, 작업 내용, 환경 조건이 빈번히 변하는 산업 현장에서, 자연어 지시만으로 작업을 전환할 수 있는 VLA의 능력은 높은 실용적 가치를 지닙니다.

8.7 인터랙티브 AR 및 GUI 에이전트

여기서 재미있는 확장이 이루어집니다. 물리적 로봇을 넘어, VLA의 "행동 생성" 능력이 디지털 인터페이스의 자율 조작으로도 확장되는 겁니다.

ShowUI: [149] GUI(Graphical User Interface) 에이전트를 VLA 프레임워크로 구현합니다. 화면의 시각적 내용을 이해하고, "설정 메뉴를 열어서 Wi-Fi를 끄겠다"와 같은 지시에 따라 클릭, 스크롤, 입력 등의 행동을 생성합니다.
공간 접지(Spatial Grounding): AR(Augmented Reality) 환경에서 가상 객체를 물리 세계에 정확히 배치하기 위해 VLA의 공간 이해 능력을 활용합니다.
인간-AI 협력 항법: 증강현실 환경에서 사용자와 AI가 협력하여 복잡한 환경을 탐색하는 시나리오입니다.

이 도메인이 시사하는 바가 큽니다. VLA의 핵심 구성 요소 — 시각 이해, 언어 추론, 행동 생성 — 가 물리적 로봇 이외의 영역에서도 강력한 프레임워크가 될 수 있다는 것이거든요. "행동"의 정의를 물리적 모터 명령에서 디지털 인터페이스 조작으로 확장하는 겁니다.

8.8 도메인 간 비교 요약

자, 그러면 지금까지 본 도메인들을 한눈에 비교해 봅시다.

도메인	행동 공간	안전 수준	실시간 요구	데이터 가용성	VLA 성숙도
테이블탑 조작	6-7 DoF	낮음	5-50Hz	풍부	높음
휴머노이드	30+ DoF	중간	50-120Hz	부족	초기
자율주행	다중 추상화	매우 높음	30Hz+	풍부	중간
드론/항법	4-6 DoF	중간	30Hz+	중간	초기
의료/수술	6-7 DoF	매우 높음	10-30Hz	매우 부족	매우 초기
농업/산업	6-7 DoF	낮음-중간	5-10Hz	부족	초기
GUI 에이전트	디지털 조작	낮음	실시간 불요	풍부	중간

※ 이 주파수 범위는 각 도메인의 일반적 요구사항을 정리한 것이며, 개별 서베이에서 직접 제시한 수치가 아닌 저자의 종합 정리입니다.

이 표에서 두드러지는 패턴이 있습니다. VLA의 성숙도가 데이터 가용성과 안전 요구의 반비례로 결정된다는 것입니다. 데이터가 풍부하고 안전 제약이 낮은 테이블탑 조작에서 가장 빠르게 발전하고, 데이터가 부족하고 안전이 최우선인 의료 분야에서 가장 느리게 진행됩니다. 이 격차를 좁히는 것이 VLA 연구의 다음 단계에서 해결해야 할 핵심 과제입니다.

Motivation Chain: 효율화의 동기 사슬

Motivation Chain

대형 VLA의 배포 불가(RT-2 55B: 330-1000ms 추론, 수백GB 메모리 필요)

→ 모델 축소 연구 시작(OpenVLA 7B: 8분의 1 크기로 성능 유지)

→ 7B도 여전히 무겁다(16-24GB VRAM, 166ms 지연)

→ 극단적 경량화(SmolVLA 450M, BitVLA 1비트 양자화)

→ 경량화의 성능 저하 우려

→ RL 후처리로 회복(경량 모델 + RL = 대형 모델 수준 성능)

→ 동적 추론(DeeR-VLA: 쉬운 입력은 얕은 레이어, 어려운 입력은 깊은 레이어)

→ 토큰 최적화(FAST, VLA-Cache: 연산량 자체를 줄임)

효율화 기법 비교: 핵심 차별점

기법	핵심 원리	대표 모델	압축 효과	성능 영향
양자화	가중치 비트 수 축소	BitVLA(1비트), SQIL(INT4)	메모리 3.36배↓	경미한 성능 저하
프루닝	불필요한 레이어/뉴런 제거	SmolVLA(L/2 제거), FLOWER	50% 레이어 제거 가능	태스크 의존적
증류	대형→소형 지식 전달	TinyVLA	파라미터 수 대폭↓	교사 모델 성능에 근접
토큰 최적화	시각/행동 토큰 수 축소	FAST, VLA-Cache, VOTE	토큰 5-13배↓	성능 유지
효율적 아키텍처	어텐션/구조 자체 개선	SARA-RT, MoLE-VLA	추론 속도 2-5배↑	설계 의존적
동적 추론	입력 난이도별 연산량 조절	DeeR-VLA	평균 연산 30-50%↓	재학습 불필요

직관적 한줄 설명: 효율화와 응용 편

양자화: "고화질 사진을 적절히 압축한 JPEG — 파일은 작아지지만 눈에는 거의 같아 보임"
프루닝: "나무의 죽은 가지를 쳐내는 것 — 나무(모델)는 더 가볍고 바람(추론)에 잘 흔들림"
증류: "명인의 기술을 제자에게 전수 — 제자는 작지만 핵심 기술은 보존"
토큰 캐싱(VLA-Cache): "매 프레임 배경을 다시 그리지 않고 캐시 — 변하는 부분만 새로 계산"
동적 추론(DeeR-VLA): "쉬운 문제는 빨리 풀고 넘기고, 어려운 문제만 깊이 고민하는 시험 전략"
MoE(Mixture-of-Experts): "모든 의사가 모든 환자를 보는 게 아니라, 전문 과목별로 배정 — 각 전문가는 작지만 전체 역량은 큼"

Self-Check Questions: Section 7-8

Q1: BitVLA의 1비트(삼진) 양자화는 어떻게 작동하며, 왜 성능이 유지되는가?

답: BitVLA는 모델 가중치를 {-1, 0, +1}의 세 값으로 제한합니다(삼진 양자화). 이로 인해 곱셈 연산이 덧셈/뺄셈으로 대체되어 메모리와 연산이 극적으로 줄어듭니다(3.36배 압축). 성능이 유지되는 이유는 (1) 양자화 인식 학습(QAT)으로 양자화 오차를 학습 과정에서 보상하고, (2) VLA의 행동 출력이 고정밀을 요구하지 않는 경우가 많기 때문입니다.

Q2: 테이블탑 조작과 자율주행 VLA의 핵심적 도메인 차이 3가지를 설명하라.

답: (1) 안전 수준: 테이블탑은 실패해도 물체 파손 수준이지만, 자율주행은 인명 피해 가능. (2) 제어 주파수: 테이블탑은 5-50Hz로 충분하지만, 자율주행은 30Hz+ 실시간 응답 필수. (3) 환경 다양성: 테이블탑은 제한된 작업 공간이지만, 자율주행은 무한히 다양한 도로/날씨/교통 상황. 이 차이로 인해 두 도메인의 VLA는 같은 기술적 DNA를 공유하면서도 독립적으로 진화하고 있습니다.

Q3: LIBERO 벤치마크가 "포화(saturation)"에 도달했다는 것은 무엇을 의미하며, 이것이 VLA 연구에 주는 시사점은?

답: LIBERO-Object(99.8%), LIBERO-Spatial(98.8%) 등에서 성공률이 거의 100%에 도달하여, 이 벤치마크로는 더 이상 모델 간 성능 차이를 구분할 수 없게 되었습니다. 이는 (1) 단순한 단일 환경 조작 태스크는 VLA가 사실상 해결했음을 의미하지만, (2) LIBERO-Long(96.6%)처럼 장시간 복합 태스크는 여전히 도전적이고, (3) 실세계 일반화 능력을 측정하는 새로운 벤치마크의 필요성을 시사합니다.

Open Research Questions: Section 7-8

도메인 간 전이: 테이블탑 VLA의 효율화 기법이 자율주행이나 의료 로봇에도 동일하게 적용 가능한가? 도메인 특성에 따른 효율화 전략의 차이는?

의료 로봇 VLA: 안전 요구가 극도로 높고 데이터가 희소한 의료 도메인에서 VLA가 실용화되려면 어떤 돌파구가 필요한가?

벤치마크 설계: LIBERO 포화 이후, 실세계 일반화를 측정할 수 있는 차세대 벤치마크는 어떤 특성을 가져야 하는가?

9. 데이터셋, 벤치마크, 시뮬레이터

자, 이번 시간에는 VLA 연구의 인프라를 다루겠습니다. 모델 아키텍처가 아무리 혁신적이어도, 그것을 뒷받침하는 데이터셋, 벤치마크, 시뮬레이터가 없으면 연구가 전진할 수 없습니다. 이 세 가지가 삼위일체를 이루어야 비로소 VLA 연구가 돌아가는 겁니다. 이번 장에서는 VLA 생태계를 떠받치는 데이터 인프라를 체계적으로 살펴보겠습니다.

9.1 로봇 학습 데이터셋

9.1.1 대규모 교차 체현 데이터셋의 부상

VLA 연구 초기에는 각 연구 그룹이 자체 로봇과 환경에서 소규모 데이터셋을 구축하는 것이 일반적이었습니다. MIME, RoboTurk, RoboNet 등이 이 시기의 대표적 산물인데요, 수천에서 수만 에피소드 규모로, 특정 로봇 플랫폼과 제한된 과제에 초점을 맞추고 있었습니다. 그런데 사전학습된 대형 모델의 잠재력을 끌어내려면 훨씬 더 크고 다양한 데이터가 필요했거든요.

이 패러다임 전환을 이끈 것이 Open X-Embodiment [19] (OXE [19]) 데이터셋입니다. 22개 연구 기관이 협력하여 22개 로봇 플랫폼에서 수집한 100만 건 이상의 에피소드를 하나의 통합 포맷으로 정리한 데이터셋인데요, VLA 연구의 ImageNet이라 불릴 만합니다. 527개 이상의 기술(skill)을 포괄하며, 교차 체현(cross-embodiment) 학습의 가능성을 처음으로 대규모로 입증했습니다. 여기서 핵심은, RT-2-X가 OXE [19]로 학습했을 때, 단일 데이터셋으로 학습한 모델 대비 50% 이상의 성능 향상을 보였다는 점입니다. 데이터 다양성의 힘을 극적으로 보여준 사례죠.

자, 그러면 주요 로봇 학습 데이터셋을 표로 정리해 보겠습니다.

데이터셋	규모	로봇 플랫폼	핵심 특성
Open X-Embodiment [19] (OXE [19])	100만+ 에피소드, 22종 로봇 embodiment, 60개+ 구성 데이터셋	22개 플랫폼	최대 교차 체현 데이터셋, 527개 기술 포괄
BridgeData V2	71개 작업	WidowX	교차 도메인 언어 주석, 다양한 환경
DROID	564개 작업	다양	"in the wild" 텔레오퍼레이션, 실제 환경 다양성
RT-1 [12] Kitchen	130K+ 실제 시연	Everyday Robots	700+ 일상 활동, 대규모 실세계 수집
BC-Z	25K+ 에피소드	7-DoF 로봇 팔	100개 작업, 반자율 수집 프로토콜
MIME, RoboTurk, RoboNet	다양 (수천~수만)	다양	초기 벤치마크 데이터셋, 역사적 의의
RH20T	147개 작업	다양	원샷(one-shot) 학습 지원
EgoDex	829시간	인간 손	밀집 3D 손/손가락 추적, 손재주 학습
Ego4D / EPIC-Kitchens	수천 시간	인간	자아중심(egocentric) 비디오, VLM 사전학습용
GraspVerse (14개 서베이 외 출처)	10억+ 샘플	시뮬레이션	합성 파지(grasp) 데이터, 대규모 합성 생성

9.1.2 인간 비디오 데이터의 전략적 활용

이게 왜 중요하냐면, 로봇 데이터의 수집 비용이 인터넷 텍스트나 이미지에 비해 압도적으로 높기 때문입니다. 이 병목을 우회하는 핵심 전략 중 하나가 바로 인간 비디오 데이터의 활용입니다. Ego4D(3,670시간), EPIC-Kitchens(100시간+), EgoDex(829시간) 등은 인간이 일상에서 수행하는 조작 활동을 자아중심 시점에서 촬영한 것으로, 로봇이 직접 수집하지 않고도 "어떻게 물체를 다루는가"에 대한 풍부한 시각적 사전지식을 제공합니다.

구체적인 사례를 보겠습니다. GR-2가 인간 비디오 사전학습 후 로봇 미세조정으로 우수한 성능을 달성했고요, HPT [96]가 인간 손 데이터와 로봇 데이터를 혼합하여 교차 체현 일반화를 향상시켰습니다. 이런 사례들이 이 전략의 유효성을 입증하는 거죠. 다만, 인간과 로봇의 형태학적(morphological) 차이로 인한 도메인 갭은 여전히 해결해야 할 과제입니다. EgoDex의 밀집 3D 손가락 추적 데이터는 이 갭을 줄이기 위한 구체적 시도로, 로봇 손의 정밀 제어에 직접 활용 가능한 형태의 인간 데이터를 제공합니다.

9.1.3 합성 데이터와 자율 수집

데이터 병목의 또 다른 해법을 보겠습니다. 합성 데이터 생성과 자율 수집입니다. GraspVerse(14개 서베이 외 출처)는 10억 건 이상의 합성 파지 데이터를 생성하여, 시뮬레이션에서의 대규모 사전학습을 가능케 했습니다. SOAR(Self-supervised Autonomous Robot)와 같은 자율 수집 파이프라인은 로봇이 스스로 데이터를 수집하고 레이블링하는 방식으로, 인간 감독 없이도 데이터셋을 확장할 수 있는 가능성을 보여줍니다.

여기서 핵심은, SmolVLA [32] (450M 파라미터)가 데이터 품질과 커리큘럼에 집중하여 훨씬 큰 모델들과 경쟁력 있는 성능을 달성했다는 사실입니다. 이건 단순한 데이터 규모 확대보다 데이터 품질, 다양성, 그리고 학습 커리큘럼의 설계가 더 중요할 수 있음을 시사하는 거예요. 비유하자면, 영어 공부를 할 때 영어 문장을 아무거나 100만 개 읽는 것보다, 잘 골라진 1만 개를 체계적으로 공부하는 게 더 효과적일 수 있다는 거죠.

9.2 시뮬레이션 벤치마크

9.2.1 매니퓰레이션 벤치마크

자, 그러면 시뮬레이션 벤치마크로 넘어가겠습니다. 시뮬레이션 벤치마크는 VLA 모델의 성능을 체계적으로 비교하고, 실세계 실험 전 신속한 프로토타이핑을 가능케 하는 핵심 인프라입니다. 주요 벤치마크의 현황을 표로 정리하면 다음과 같습니다.

벤치마크	도메인	핵심 지표	2025년 최고 성능
LIBERO-Spatial/Object/Goal/Long	매니퓰레이션	성공률(%)	Spatial 98.8%, Object 99.8%, Goal 98.2%, Long 96.6%
CALVIN	다단계 조작	평균 시퀀스 길이(1-5)	4.44 (DreamVLA [64])
RLBench / RLBench2	RGB-D 조작	성공률	다양
Meta-World	다중 기술	성공률	다양
SIMPLER	심-투-리얼 전이	교정된 성공률	다양
THE COLOSSEUM	분포 이동 강건성	성공률	다양
VLABench	언어 조건 조작	성공률	다양
MIKASA-Robo	메모리 중심	부분 관측 조작 성공률	다양

LIBERO 스위트는 VLA 연구에서 가장 널리 사용되는 벤치마크 중 하나인데요, 난이도에 따라 네 가지 하위 벤치마크를 제공합니다. Spatial(공간 관계 이해), Object(객체 식별), Goal(목표 달성), Long(장기 과제)로 구성되며, 2025년 현재 Spatial과 Object에서는 거의 포화 상태(98-99%)에 도달했습니다. 이게 무슨 의미냐면, 단기적이고 단순한 조작 과제에서 VLA 모델이 이미 충분한 성능을 달성했다는 뜻이고, 연구의 초점이 더 복잡한 장기 과제로 이동해야 한다는 겁니다.

CALVIN은 다단계 연속 과제를 평가하는 벤치마크로, 모델이 최대 5개의 연속 명령을 수행하는 능력을 측정합니다. 평균 시퀀스 길이(Average Sequence Length)로 성능을 측정하며, DreamVLA [64]가 4.44로 최고 성능을 기록했습니다. 이 벤치마크에서 월드 모델 기반 방법(VIP, DreamVLA [64], WorldVLA [76])이 지배적인 성능을 보이는데요, 이건 장기 과제에서 미래 예측 능력이 얼마나 중요한지를 방증하는 것입니다.

차세대 벤치마크 (2025-2026)

기존 벤치마크의 포화 문제를 해결하기 위한 새로운 평가 프레임워크들이 등장하고 있습니다. 하나씩 보겠습니다.

RoboArena [150]: 실세계-시뮬레이션 자동 변환 프레임워크로, 실세계 태스크를 자동으로 시뮬레이션 환경에 재현하여 대규모 벤치마킹을 가능하게 합니다.
RoboCasa365 [151]: 365개 태스크, 2,000개 이상의 주방 장면을 포함하는 대규모 가정환경 벤치마크입니다.
WorldGym [152]: 행동 조건부 월드 모델을 평가 환경으로 활용하는 새로운 패러다임입니다.
WorldBench [41] (Hu et al., 2025): 자율주행 VLA를 위한 통합 평가 플랫폼으로, 개방형/폐쇄형 루프 평가를 통합합니다.

이들 벤치마크는 기존 LIBERO, CALVIN의 포화를 넘어, 실세계 일반화와 도메인 다양성을 측정하는 방향으로 평가 패러다임을 확장하고 있는 겁니다.

9.2.2 자율주행 벤치마크

자율주행 VLA를 위한 벤치마크는 매니퓰레이션과 다른 고유한 요구사항을 가집니다. 안전성, 실시간성, 그리고 사회적 규범 준수가 핵심 평가 축입니다.

벤치마크	도메인	핵심 지표	특성
Bench2Drive	자율주행 (CARLA)	폐루프 경로 성공률	220 경로, 44 시나리오
nuScenes / nuPlan	자율주행 (실세계)	L2 궤적 오차	대규모 실세계 데이터
Reason2Drive	주행 추론	CoT QA 일관성	600K video-text 쌍(CoT QA 주석 포함), 추론 과정 평가

Bench2Drive는 CARLA 시뮬레이터 위에 구축된 폐루프(closed-loop) 벤치마크로, 220개 경로와 44개 시나리오에서 에이전트의 종합적 주행 능력을 평가합니다. 이게 왜 중요하냐면, 개루프(open-loop) 평가에서 높은 점수를 받은 모델이 폐루프에서는 실패하는 경우가 빈번하기 때문입니다. 폐루프 평가의 필수성이 강조되는 이유가 여기에 있습니다. Reason2Drive는 단순한 경로 추적을 넘어 "왜 이 행동을 선택했는가"에 대한 추론 과정을 평가하는 새로운 패러다임의 벤치마크입니다.

9.3 시뮬레이터 생태계

자, 그러면 VLA 연구를 지탱하는 시뮬레이터 생태계를 도메인별로 살펴보겠습니다.

매니퓰레이션 시뮬레이터:

MuJoCo: 물리 시뮬레이션의 사실상 표준입니다. 빠른 연산 속도와 정확한 접촉 역학이 강점이죠.
SAPIEN: 관절체(articulated object) 조작에 특화된 시뮬레이터로, 서랍 열기, 수도꼭지 조작 등 일상 환경의 상호작용을 지원합니다.
RLBench: CoppelliaSim 기반의 벤치마크 겸 시뮬레이터로, 100개 이상의 사전 정의된 과제를 제공합니다.
AI2-THOR / Habitat: 실내 네비게이션과 조작을 결합한 시뮬레이터로, 체현 AI(embodied AI) 연구의 주요 플랫폼입니다.
Isaac Gym (NVIDIA): GPU 가속 대규모 병렬 시뮬레이션을 지원하며, 수천 개의 환경을 동시에 실행할 수 있어 RL 학습에 최적화되어 있습니다.

자율주행 시뮬레이터:

CARLA: 오픈소스 자율주행 시뮬레이터의 대표격입니다. 다양한 날씨, 교통 시나리오, 센서 모달리티를 지원합니다.
nuPlan: nuScenes 데이터를 기반으로 한 폐루프 계획 벤치마크 겸 시뮬레이터입니다.

차세대 범용 시뮬레이터:

Genesis (14개 서베이 외 출처): GPU 가속 물리 엔진으로, 다양한 물리 솔버를 통합하여 범용적인 로봇 시뮬레이션을 목표로 합니다. 기존 시뮬레이터 대비 10-100배의 속도 향상을 주장하고 있어요.
UniSim [101]: 이건 아주 흥미로운 접근인데요, 행동 조건부 비디오 디퓨전(action-conditioned video diffusion) 기반의 "학습된 시뮬레이터"입니다. 명시적 물리 엔진 없이 데이터에서 직접 환경 역학을 학습하는 거예요. 비유하자면, 물리 법칙을 명시적으로 코딩하는 대신에 수많은 물리 현상을 관찰해서 "물리가 어떻게 동작하는지"를 데이터로부터 배우는 겁니다. 전통적 시뮬레이터의 현실성 한계를 우회하는 혁신적 접근이죠.

핵심 갭: 현재 시뮬레이터 생태계의 가장 큰 한계는 통합된 교차 체현/교차 과제 벤치마크의 부재입니다. 각 시뮬레이터가 고유한 과제 정의, 로봇 모델, 평가 프로토콜을 사용하기 때문에, 서로 다른 시뮬레이터에서 보고된 결과를 직접 비교하는 것은 사실상 불가능합니다. 컴퓨터 비전 분야에서 ImageNet이 수행했던 통합 벤치마크의 역할이 로봇 학습 분야에서는 아직 부재한 셈입니다.

9.4 평가 프로토콜의 한계와 개선 방향

9.4.1 재현성의 위기

자, 이건 좀 불편한 얘기인데요, VLA 연구에서 보고되는 성공률 수치는 종종 오해를 불러일으킵니다. 일부 연구에서 시드(seed)만 변경해도 성공률이 30% 이상 변동하는 것이 관찰되었거든요. 이게 무슨 의미냐면, 보고된 "최고 성능"이 통계적으로 유의미하지 않을 수 있다는 겁니다. 환경의 초기 조건, 물체의 미세한 배치 변화, 시뮬레이터의 물리 엔진 비결정성 등이 이러한 분산의 원인입니다.

9.4.2 시뮬레이션-실세계 괴리

시뮬레이션에서 높은 성공률을 달성한 모델이 실세계에서 실패하는 현상은 여전히 만연합니다. 접촉 역학의 부정확성, 시각적 사실성의 한계, 그리고 실세계의 예측 불가능한 교란 등이 주요 원인인데요. SIMPLER 벤치마크가 교정된(calibrated) 심-투-리얼 평가를 제공하려는 시도를 하고 있지만, 근본적 해결에는 이르지 못했습니다.

9.4.3 평가 지표의 단일성

현재 대부분의 벤치마크는 "성공률"이라는 단일 지표에 의존합니다. 그런데 실세계 배포를 고려하면 이것만으로는 부족합니다.

충돌 회피: 과제를 완수하더라도 환경과의 불필요한 충돌은 위험합니다.
실패 복구: 실패 시 안전한 상태로 복귀하는 능력은 보고되지 않습니다.
에너지 효율: 동일한 과제를 더 적은 에너지로 수행하는 것은 실용적으로 중요합니다.
적대적 강건성: 의도적인 교란에 대한 저항성은 안전 관련 응용에서 필수적입니다.
추론 지연 시간: 모델의 추론 속도는 실시간 제어 가능 여부를 결정하지만, 성공률과 함께 체계적으로 보고되는 경우가 드뭅니다.

자율주행 분야에서는 이 문제가 더욱 심각합니다. 제어 안전성과 언어 충실도를 동시에 평가하는 통합 "AI 운전면허" 벤치마크가 부재하며, 개루프 L2 오차와 같은 프록시 지표가 실제 주행 안전성과 약한 상관관계를 보이는 것이 반복적으로 지적되고 있습니다.

9.4.4 개선 제안

이러한 한계를 극복하기 위해 다음 두 가지 트랙의 평가 체계를 제안합니다.

(i) 시뮬레이션 트랙: 고정된 시드, 데이터 분할, 기준 모델을 공유하는 표준화된 시뮬레이션 평가입니다. 모든 연구가 동일 조건에서 비교 가능하도록 환경 설정을 완전히 재현 가능한 형태로 공개하고, 최소 10개 이상의 시드에서 평균과 분산을 보고하는 것을 의무화해야 합니다.

(ii) 실세계 커뮤니티 트랙: 공유 하드웨어 프로토콜에 기반한 실세계 평가입니다. 표준화된 로봇 플랫폼(예: Franka Emika, UR5), 과제 정의, 평가 절차를 커뮤니티가 합의하여 정의하고, 각 연구 그룹이 동일한 프로토콜로 실세계 성능을 보고하는 방식입니다.

10. 미해결 문제와 미래 전망

자, 이제부터가 정말 흥미로운 부분입니다. 10개의 주요 VLA 서베이 논문을 관통하여 분석한 결과, 11가지 핵심 과제가 식별되었습니다. 이들은 개별 서베이에서 부분적으로 다루어졌지만, 서베이 간 교차 분석을 통해 비로소 그 전체 구조가 드러납니다.

10.1 데이터 병목

VLA 연구의 가장 근본적인 제약은 데이터입니다. 현재 최대 규모인 OXE [19] 데이터셋도 확장 버전 기준 약 250만 에피소드(원본 v1은 100만+ 에피소드)에 불과하며, 이는 GPT-2의 학습 코퍼스(WebText, 수십억 토큰)와 비교하면 극히 미미한 수준입니다. 더구나 로봇 데이터는 인터넷 텍스트와 달리 수집 비용이 에피소드당 수십 달러에 달하며, 각 로봇 플랫폼의 고유한 형태학에 종속됩니다. 비유하자면, LLM은 인터넷이라는 거대한 바다에서 자유롭게 물을 퍼 올리는 것인데, VLA는 우물 하나하나를 직접 파야 하는 상황인 거죠.

해결 방향:

시뮬레이션 합성: GraspVerse(14개 서베이 외 출처)와 같은 대규모 합성 데이터 생성. 도메인 무작위화(domain randomization)와 결합하여 심-투-리얼 전이를 촉진합니다.
인간 비디오 활용: Ego4D, EPIC-Kitchens 등에서 조작의 시각적 사전지식을 추출합니다.
자율 수집(SOAR): 로봇이 스스로 탐색하고 데이터를 수집하는 자기 지도 파이프라인입니다.
능동적 선정(active curation): 모든 데이터가 동등하지 않습니다. 모델의 약점을 타겟으로 데이터를 선별 수집하는 거예요.

교차 인사이트: SmolVLA [32]가 450M 파라미터로도 7B 모델과 경쟁하는 사례는, 데이터 스케일링보다 데이터 품질과 다양성이 더 중요할 수 있음을 시사합니다. 이건 "더 많은 데이터"가 아닌 "더 나은 데이터"로의 패러다임 전환을 예고하는 겁니다.

10.2 일반화의 벽

VLA 모델의 일반화 성능은 평가 조건에 따라 극적으로 변화합니다. 이 수치를 보시면 상황이 명확해집니다.

도메인 내(in-domain): 학습 환경과 동일한 조건에서 80-90%의 성공률
교차 도메인(cross-domain): 새로운 객체나 환경에서 40-70%로 하락
제로샷(zero-shot): 완전히 새로운 과제에서 20-50%까지 하락

시험 범위 안에서는 잘하는데, 처음 보는 문제는 못 푸는 것과 비슷한 상황이죠. 이 격차를 좁히기 위한 시도가 다각도로 진행 중입니다. HPT [96](Heterogeneous Pretrained Transformers)는 다양한 체현에서의 사전학습을 통해 교차 체현 일반화를, UniAct는 행동 공간의 통합 표현을 통해 체현 불가지론적(embodiment-agnostic) 정책을, BridgeVLA는 웹 규모 시각 지식과 로봇 행동의 연결을 각각 시도합니다.

시뮬레이션에서 실세계로의 전이(sim-to-real transfer)는 여전히 미해결 과제입니다. 물리적 접촉의 부정확성, 시각적 도메인 갭, 그리고 실세계의 비정형적(non-stationary) 환경이 주요 장벽이죠. GEN-0의 스케일링 법칙 연구는 모델과 데이터의 규모를 키우면 일반화가 예측 가능하게 개선된다는 초기 증거를 제시하지만, 이 법칙이 어디까지 유효한지는 아직 불분명합니다.

10.3 실시간 추론

대형 VLM 백본(7B-55B 파라미터)과 디퓨전 기반 행동 생성의 조합은 강력하지만, 실시간 제어에는 치명적인 지연 시간 문제를 야기합니다. 자율주행에서는 최소 30Hz, 매니퓰레이션의 정밀 제어에서는 50Hz 이상의 제어 주파수가 요구되는데요, 단순한 전방 패스(forward pass)만으로도 이 요구를 충족하기 어려운 경우가 많습니다. 이건 엄청 똑똑한데 대답이 너무 느린 사람에게 실시간 격투 게임을 시키는 것과 비슷한 문제입니다.

해결 전략:

계층적 비동기 실행: 고수준 VLM은 낮은 주파수(1-5Hz)로 하위 목표를 생성하고, 경량 저수준 정책이 높은 주파수(50-100Hz)로 실제 제어를 수행합니다. GR00T N1 [21], CogACT [23] 등이 이 접근을 채택합니다.
토큰 캐싱: 이전 추론의 키-값(KV) 캐시를 재활용하여 중복 연산을 제거합니다.
양자화(quantization): FP16, INT8, INT4 등으로 모델 정밀도를 낮추어 추론 속도를 높입니다. 4비트 양자화에서도 성능 저하가 2% 미만인 경우가 보고되었습니다.
가지치기(pruning)와 증류(distillation): 불필요한 파라미터를 제거하거나, 대형 모델의 지식을 소형 모델로 전이합니다.

여기서 핵심은 "지능적 희소성(intelligent sparsity)"입니다. 모든 입력에 대해 전체 모델을 활성화하는 대신, 입력의 복잡도에 따라 연산량을 동적으로 조절하는 접근이 부상하고 있습니다.

10.4 장기 과제와 계층적 추론

순수 종단간(end-to-end) 모델은 단일 동작 수준의 과제에서는 뛰어나지만, 다단계 합성 과제에서는 체계적으로 실패합니다. "서랍을 열고, 컵을 꺼내서, 선반에 놓아라"와 같은 과제는 계획, 하위 목표 설정, 진행 상황 모니터링, 그리고 실패 시 재계획을 요구하며, 이건 단일 정책으로 처리하기 어렵습니다.

해결 접근:

계층적 분해: pi-0.5는 고수준 VLM 계획기와 저수준 행동 정책을 명시적으로 분리합니다.
사고의 연쇄(Chain-of-Thought): CoT-VLA [55]는 행동 생성 전 명시적 추론 단계를 삽입하여, 모델이 "왜" 특정 행동을 선택하는지를 추론합니다.
기술 라이브러리(Skill Library): ReLEP 등은 재사용 가능한 기술 원형을 학습하고 조합하여 복잡한 과제를 구성합니다.

CALVIN 벤치마크에서의 경향이 이 방향의 유효성을 입증합니다. 월드 모델 기반 방법(DreamVLA [64]: 4.44, WorldVLA [76]: 4.38)이 순수 반응적(reactive) 정책 대비 압도적인 성능을 보이며, 미래 상태를 예측하고 이를 계획에 활용하는 능력이 장기 과제 성공의 열쇠임을 보여주고 있습니다.

10.5 안전과 정렬

자, 이건 정말 중요한 이슈입니다. VLA의 안전 문제는 순수 소프트웨어 AI와 질적으로 다릅니다. LLM의 환각(hallucination)이 잘못된 텍스트를 생성하는 것에 그치지만, VLA의 환각은 물리적 충돌, 파손, 심지어 인명 피해로 이어질 수 있거든요. 물리적 실패의 비가역성이 핵심적 차이입니다. ChatGPT가 헛소리를 하면 "아 그거 틀렸네" 하고 넘어갈 수 있지만, 로봇이 헛소리를 하면 유리잔이 깨지거나 사람이 다칠 수 있는 거죠.

현재의 시도:

SafeVLA [75]: VLA에 안전 제약을 명시적으로 통합한 최초의 시도입니다. 안전 관련 학습 데이터와 제약 위반 페널티를 결합합니다.
SafeAuto [140]: 자율주행에서 교통 법규 기반의 심볼릭 거부권(symbolic veto)을 구현합니다. 신경망의 출력이 규칙 기반 안전 검증을 통과해야만 실행되는 이중 구조입니다.

그러나 형식적 검증(formal verification)의 부재는 심각한 과제로 남아 있습니다. 전통적 제어 시스템은 Lyapunov 안정성, 도달 가능성(reachability) 분석 등의 수학적 도구로 안전성을 보장할 수 있지만, 언어 조건 신경망 정책에 대해서는 이러한 검증 방법이 확립되어 있지 않습니다.

자율주행에서 이 문제는 특히 심각합니다. 매니퓰레이션에서의 환각이 물체를 떨어뜨리는 수준에 그칠 수 있지만, 자율주행에서의 환각은 교통사고로 직결됩니다. 이 "환각 위험의 비대칭성"은 자율주행 VLA가 매니퓰레이션 VLA와 근본적으로 다른 안전 요구사항을 가짐을 의미합니다.

10.6 환각과 추론 안정성

LLM 기반 계획기가 물리적으로 불가능한 행동을 생성하는 문제는 VLA의 근본적 약점입니다. "물컵을 90도 기울여서 옮겨라"와 같은 물리적으로 불합리한 계획은, LLM의 상식 추론이 물리적 현실과 괴리될 때 발생합니다.

SC-VLA [56](Self-Correcting VLA)는 명시적 실패 감지와 복구 추론 메커니즘을 도입하여, 자기교정 메커니즘을 통해 태스크 실패율을 35% 감소시켰습니다(Zhang et al. [6]). 모델이 자신의 행동 결과를 모니터링하고, 예상과 다른 결과가 관측되면 대안적 행동을 생성하는 피드백 루프를 구현한 겁니다.

그러나 개방 세계(open world)에서의 환각 검증은 근본적으로 어렵습니다. 학습 데이터에 없는 상황에서 모델의 출력이 "물리적으로 실현 가능한가"를 판단하려면, 모델 자체가 정확한 물리 시뮬레이터 역할을 해야 하는 순환적 문제에 봉착하거든요. 닭이 먼저냐 달걀이 먼저냐와 비슷한 구조적 딜레마입니다.

10.7 다중 모달 통합

현재 VLA 연구는 시각 중심의 편향을 보입니다. 대부분의 모델이 RGB 이미지만을 감각 입력으로 사용하며, 촉각, 힘/토크, 소리, 온도 등 인간이 조작에 활용하는 다른 감각은 거의 무시되고 있습니다.

ForceVLA [78]: 힘/토크 센서 데이터를 VLA에 통합하여, 섬세한 물체 조작에서의 성능을 향상시켰습니다.
TactileVLA [153]: 촉각 센서 입력을 활용하여, 시각만으로는 판단하기 어려운 물체의 물성(경도, 질감 등)을 인지합니다.
OmniVTLA [154]: 시각, 촉각, 언어를 동시에 처리하는 통합 아키텍처를 제안합니다.

이게 왜 중요하냐면, 인간은 불확실성이 높은 상황에서 자동적으로 모달리티를 재가중하거든요. 시각이 불충분할 때 촉각에 더 의존하고, 소음이 심할 때 시각적 단서에 더 집중합니다. 이러한 적응적 모달리티 재가중(adaptive modality reweighting)은 로봇에서 아직 체계적으로 구현되지 않았으며, 다중 모달 VLA의 중요한 미래 방향입니다.

10.8 인간-로봇 상호작용

현재의 VLA는 "유사 상호작용(pseudo-interaction)"에 머물러 있습니다. 인간이 지시를 내리면 로봇이 이행하는 단방향 소통이 지배적이며, 진정한 양방향 대화형 협업은 거의 구현되지 않았습니다. 마치 일방통행 도로 같은 거예요. 사람이 말하고 로봇은 듣기만 하는 구조죠.

진정한 인간-로봇 상호작용을 위해서는 다음이 필요합니다.

적응형 대화: 로봇이 모호한 지시에 대해 명확화 질문을 하고, 인간의 피드백에 따라 행동을 조정하는 겁니다.
선호 학습(preference learning): 인간의 암묵적 선호(속도, 안전성, 미적 기준 등)를 상호작용을 통해 학습합니다.
인간 피드백 루프: 배포 후에도 인간의 교정 피드백을 통해 지속적으로 개선합니다.

이 분야는 NLP에서의 RLHF(Reinforcement Learning from Human Feedback) 성공에 힘입어, "RLHF for Robotics"라는 새로운 연구 방향이 형성되고 있습니다.

10.9 평가와 벤치마킹

9.4절에서 논의한 평가 한계는 미해결 문제로서 더 근본적인 차원에서 재조명할 필요가 있습니다. 현재 로봇 학습 분야에는 컴퓨터 비전의 ImageNet, NLP의 GLUE/SuperGLUE에 해당하는 통합 벤치마크가 부재합니다.

이 부재의 결과는 심각합니다. 논문 A가 LIBERO에서 98%를, 논문 B가 CALVIN에서 4.44를 보고할 때, 어떤 모델이 "더 나은" 것인지를 판단할 수 없거든요. 시드 무작위성으로 인한 재현성 문제까지 더해지면, VLA 연구의 실질적 진보를 정량적으로 추적하는 것 자체가 어려워집니다.

10.10 윤리와 사회적 영향

VLA의 실세계 배포는 기술적 과제를 넘어 윤리적, 사회적 질문을 제기합니다. 이 부분은 기술 연구자로서도 반드시 인식하고 있어야 할 영역입니다.

프라이버시: 가정이나 직장에서 작동하는 VLA 로봇은 지속적으로 환경을 촬영하고 해석합니다. 이 데이터의 수집, 저장, 활용에 대한 명확한 가이드라인이 필요합니다.
고용 대체: 조작 능력의 발전은 물류, 제조, 서비스 산업에서의 자동화를 가속화하며, 고용 구조의 변화를 초래할 수 있습니다.
의사결정 편향: VLM 백본이 인터넷 데이터에서 학습한 편향이 물리적 행동으로 발현될 수 있습니다. 예를 들어, 특정 인종이나 성별에 대한 편향이 인간-로봇 상호작용에서 차별적 행동으로 이어질 위험이 있습니다.
규제 프레임워크: 자율주행을 제외하면, VLA 로봇의 배포에 대한 규제 프레임워크는 거의 존재하지 않습니다. 인증, 책임 소재, 사고 보고 체계 등이 시급히 마련되어야 합니다.

10.11 교차 서베이 통합 인사이트

자, 여기서부터가 이 서베이의 핵심 기여입니다. 14개의 서베이를 교차 분석하여 도출한 다음 10가지 인사이트는, 개별 서베이에서는 명시적으로 드러나지 않는 창발적(emergent) 패턴입니다.

인사이트 1 -- 수렴의 증거

14개 서베이는 각각 다른 분류체계(taxonomy)를 사용하지만, 궁극적으로 동일한 풍경(landscape)의 서로 다른 투영(projection)입니다. 아키텍처 서베이는 "백본-행동 헤드" 축으로, 학습 서베이는 "사전학습-미세조정" 축으로, 응용 서베이는 "도메인-과제" 축으로 VLA를 분류합니다. 그런데 이 모든 관점에서 "VLM Brain + Generative Action Head"가 최적점으로 수렴하고 있습니다. 이건 2024년 말부터 2025년에 걸쳐 명확해진 추세로, RT-2 [11]에서 시작된 "언어 모델을 행동 모델로" 패러다임이 이제 보편적 합의에 도달했음을 의미합니다.

인사이트 2 -- 스케일 역전 현상

"더 큰 모델이 더 나은 성능을 낸다"는 스케일링 법칙의 직관이 VLA에서는 반드시 성립하지 않습니다. 구체적 증거가 이를 뒷받침합니다.

CLIP [27]-RT(1B)가 OpenVLA [15] (7B)를 다수의 과제에서 능가합니다.
SmolVLA [32] (450M)가 LIBERO에서 7B급 모델과 경쟁적 성능을 보입니다.
3B급 모델(CogACT [23], SpatialVLA [39])이 7B 모델과 동등하거나 우수한 성능을 달성합니다.

이게 왜 그런가 하면, VLA에서는 로봇 데이터의 희소성 때문에, 대형 모델이 과적합하거나 불필요한 용량을 낭비하는 현상이 발생할 수 있기 때문입니다. 데이터 품질, 토큰화 전략, 아키텍처 설계가 파라미터 수보다 중요할 수 있다는 뜻이죠.

인사이트 3 -- 토큰화가 제어 대역폭을 결정

행동 토큰화 방식은 단순한 구현 세부사항이 아니라, 시스템의 근본적 능력을 규정하는 설계 선택입니다.

이산 빈(discrete bin) 토큰화: 구현이 단순하지만 정밀도가 제한됩니다. 1-5Hz 제어에 적합합니다.
디퓨전 기반: 연속적이고 다봉(multimodal) 분포를 표현할 수 있지만, 역확산 과정의 반복이 추론 속도를 저하시킵니다. 5-20Hz 범위입니다.
플로우 매칭(flow matching): 디퓨전 대비 빠른 수렴으로 20-50Hz를 달성합니다.
FAST 토큰화 [20]: 이산 방식의 속도와 연속 방식의 정밀도를 동시에 추구하며, 50-120Hz까지의 제어 주파수를 가능케 합니다.

이 관점이 중요한 이유는, 토큰화 방식의 선택이 1Hz에서 120Hz까지의 제어 주파수를 결정하고, 이것이 수행 가능한 과제의 범위를 근본적으로 규정하기 때문입니다. 느린 제어는 거친 조작만 가능하고, 빠른 제어는 정밀 삽입, 봉합, 악기 연주와 같은 고난도 과제를 가능케 합니다. 단일 서베이에서 명시적으로 다루어지지 않는 교차적 통찰이죠.

인사이트 4 -- 이중 시스템은 선택이 아닌 필수

장기 과제에서 순수 종단간 모델의 한계는 반복적으로 입증되고 있습니다. Daniel Kahneman의 System 1(빠른 직관)/System 2(느린 숙고) 구분이 로봇공학에서 공학적 필수사항으로 입증된 겁니다.

GR00T N1 [21]은 이중 시스템 아키텍처를 채택하여, 고수준 VLM(System 2)이 하위 목표를 생성하고 저수준 디퓨전 정책(System 1)이 이를 실행합니다. 그 결과 단일 시스템 대비 17% 성공률 향상(GR00T N1 원논문 보고 기준)과 28% 충돌률 감소(GR00T N1 원논문 보고 기준)를 달성했습니다. 인지과학의 이론적 구분이 공학적 설계 원칙으로 직접 번역될 수 있음을 보여주는 강력한 증거입니다.

인사이트 5 -- RL 후처리는 BC의 필수 보완재

행동 복제(BC)만으로 학습된 VLA는 구조적 한계를 가집니다. 시연 데이터의 분포를 벗어나면 성능이 급격히 저하되는 분포 이동(distributional shift) 문제가 대표적입니다. 강화학습(RL) 후처리(post-training)는 이 한계를 돌파하는 핵심 수단으로 부상했습니다.

극적인 사례가 이를 증명합니다: SFT만으로 4%에 머물던 성공률이, PPO(Proximal Policy Optimization) 15회 반복만으로 97%로 회복된 경우가 보고되었습니다. BC가 "어떻게 해야 하는가"를 가르친다면, RL은 "무엇이 좋은가"를 학습하게 하는 겁니다. 이 둘의 조합은 선택이 아닌 필수 파이프라인 단계입니다.

인사이트 6 -- 효율성과 성능의 파레토 프론티어 이동

2025년 들어 VLA 연구의 핵심 경쟁 축이 "절대 성능"에서 "컴퓨트 효율성"으로 이동하고 있습니다. 55B 파라미터의 RT-2 [11]가 달성한 성능을, 450M의 SmolVLA [32]가 유사하게 달성하는 것은 100배 이상의 효율성 혁명입니다.

"지능적 희소성(Intelligent Sparsity)" 패러다임이 부상하고 있습니다. 이건 단순히 모델을 줄이는 것이 아니라, 필요한 곳에만 연산을 집중하는 것입니다. LoRA 기반 효율적 미세조정, 전문가 혼합(MoE) 아키텍처, 조기 종료(early exit) 메커니즘 등이 이 패러다임의 구현체입니다. 단순 스케일링 법칙보다 "컴퓨트당 성능(performance per FLOP)"이 핵심 지표로 전환되고 있습니다.

인사이트 7 -- 자율주행 VLA는 별도 진화 경로

매니퓰레이션 VLA와 자율주행 VLA는 동일한 "VLM + Action" 프레임워크를 공유하지만, 실제로는 상당히 다른 진화 경로를 걷고 있습니다. 자율주행은 매니퓰레이션에 비해 다음과 같은 고유한 요구사항을 가집니다.

안전 요구: 실패의 결과가 치명적이며, 사회적 수용 기준이 훨씬 높습니다.
실시간 요구: 30Hz 이상의 제어 주파수가 절대적으로 필수입니다.
사회적 규범: 교통 법규, 양보, 신호 준수 등 사회적 규약의 이해와 준수가 요구됩니다.

두 도메인 간의 기술 교환이 충분히 이루어지지 않고 있다는 점은 아쉬운 대목입니다. 매니퓰레이션의 정밀 제어 기법이 자율주행의 미세 조향에, 자율주행의 안전 검증 프레임워크가 매니퓰레이션의 안전 정책에 기여할 수 있는 잠재적 교차 수분(cross-pollination) 기회가 존재합니다.

인사이트 8 -- 인간 운동학습 이론이 VLA 연구의 미래 지도

Jin et al. [9]이 제안한 Newell의 운동학습 이론과 VLA의 매핑은 단순한 비유가 아닌 체계적 연구 프레임워크로 기능할 잠재력을 가집니다. Newell의 "자유도 동결-해제(freezing-freeing degrees of freedom)" 이론은, VLA의 계층적 기술 학습에서 저차원 행동 공간에서 시작하여 점차 자유도를 확장하는 커리큘럼과 직접적으로 대응됩니다.

아직 미탐색된 영역이 풍부합니다.

소뇌 모델의 로봇 구현: 인간 소뇌의 전방 모델(forward model)과 역모델(inverse model)의 조합이, VLA의 월드 모델과 역역학 정책의 조합으로 번역될 수 있습니다.
맥락 간섭 효과: 인간 운동학습에서 무작위 연습이 차단 연습보다 장기 파지에 유리하다는 효과가, VLA의 학습 커리큘럼에 적용될 수 있습니다.

인사이트 9 -- 월드 모델이 장기 과제의 열쇠

CALVIN 벤치마크에서의 성능 경향은 명확한 메시지를 전달합니다. 시각적 상호작용 예측(VIP, Visual Interaction Prediction) 방법이 지배적 성능을 보이며, WorldVLA [76], DreamVLA [64], CoT-VLA [55] 등 월드 모델을 통합한 접근이 상위권을 휩쓸고 있습니다.

월드 모델은 "행동 전에 상상한다"는 원리를 구현합니다. 특정 행동을 실행했을 때 세계가 어떻게 변화할지를 내부적으로 시뮬레이션하고, 그 결과가 목표에 부합하는지를 평가한 후에야 실제 행동을 실행하는 거예요. 체스 고수가 머릿속으로 여러 수를 두어보고 최선의 수를 선택하는 것과 같은 원리입니다. 이건 장기 과제에서의 계획 능력을 근본적으로 향상시키며, 차세대 VLA의 핵심 분화점(differentiator)이 될 것입니다.

인사이트 10 -- "방어적 AI" 패러다임의 부상

강건성(robustness)이 성능과 동급의 1등 설계 목표로 격상되고 있습니다. 실험실에서 98% 성공률을 달성하더라도, 실세계의 예측 불가능한 교란 하에서 50%로 하락한다면 배포할 수 없기 때문입니다.

BYOVLA(Build Your Own VLA): 모듈화된 아키텍처로 각 구성 요소의 강건성을 독립적으로 검증하고 교체할 수 있습니다.
DreamVLA [64]: 월드 모델을 통한 상상 기반 강건성 향상입니다. 예상치 못한 상황을 내부적으로 시뮬레이션하여 대비합니다.
SafeVLA [75]: 명시적 안전 제약 통합으로 위험 행동을 사전 차단합니다.

실세계 배포에서 강건성은 단순한 "있으면 좋은(nice-to-have)" 속성이 아니라, 시스템의 생존 조건(survival condition)입니다. 이 인식이 연구 커뮤니티에 확산되면서, "방어적 AI(Defensive AI)" 패러다임이 형성되고 있습니다.

10.12 프런티어 모델과 오픈 웨이트 모델의 일반화 격차

2026년 현재 VLA 분야의 가장 뚜렷한 분단선은 비공개 프런티어 모델(Gemini Robotics, π0.5)과 오픈 웨이트 연구 모델 사이의 실세계 일반화 격차입니다. 시뮬레이션 벤치마크(LIBERO, CALVIN)에서 양쪽의 성능이 수렴하고 있음에도, 실세계 제로샷 일반화에서는 여전히 큰 격차가 존재합니다. RoboArena 리더보드에서 경쟁력 있는 제로샷 행동을 보이는 것은 π 계열 모델뿐이라는 분석이 있습니다(Reuss, 2026).

이 격차의 원인으로는 세 가지가 지목됩니다: (1) 데이터 품질·다양성 격차 — 프런티어 랩의 비공개 데이터가 공개 데이터셋보다 품질과 다양성이 우수, (2) 벤치마크 천장 효과 — 시뮬레이션 벤치마크의 포화로 실제 진전이 가려지는 현상, (3) 인프라 규모 격차 — 연구실 규모 vs 산업 규모의 학습 인프라 차이.

ICLR 2026에서는 데이터 품질 큐레이션과 인컨텍스트 학습이 가장 과소 대표된 연구 방향으로 식별되었으며, 이 두 방향이 격차 해소의 열쇠가 될 수 있습니다. 이 문제는 10.1절의 데이터 병목과 10.2절의 일반화 벽 모두와 긴밀히 연결되며, 오픈소스 커뮤니티의 핵심 도전 과제입니다.

10.13 최전선 사례 연구: π 시리즈가 열어가는 두 갈래 프런티어 (2025.11 – 2026.03)

자, 지금까지 10.1~10.12절에서 VLA의 핵심 미해결 과제와 통합 인사이트를 쭉 살펴봤는데요. 그러면 자연스럽게 드는 질문이 있죠 — "그래서 진짜로 얼마나 풀리고 있는 건데?" 2025년 11월과 2026년 3월에 Physical Intelligence(PI)가 잇달아 발표한 두 편의 논문이 이 질문에 대한 가장 구체적인 답을 줍니다. π^*_0.6 [157]은 우리가 6.3절과 Insight 5에서 이야기한 "BC에서 RL로의 전환"을, π_0.6-MEM [158]은 10.4절의 "장기 과제와 메모리 부재"를 각각 정면으로 공략하는 거예요. 두 논문 모두 π0.6 모델(Gemma 3 4B VLM + 860M Action Expert)을 기반으로 하고, VLA의 두 가지 핵심 한계 — "BC의 성능 천장"과 "메모리 부재" — 를 실세계 규모에서 돌파합니다.

10.13.1 π^*_0.6: 경험으로부터 배우는 VLA

[157] · Physical Intelligence · 2025.11

π^*_0.6 [157]은 RECAP(RL with Experience and Corrections via Advantage-conditioned Policies)이라는 방법론을 통해 VLA 모델이 실세계 경험으로부터 스스로 나아지게 하는 범용 RL 후처리 프레임워크입니다. 6.3절에서 다뤘던 RL 후처리 연구들(VLA-RL, RIPT-VLA, ConRFT 등) 기억하시죠? 그 연구들은 대부분 시뮬레이션 벤치마크에서만 검증됐었어요. π^*_0.6는 다릅니다 — 실세계 장시간 복합 조작 태스크에서 대규모 VLA의 end-to-end RL 학습을 처음으로 성공시킨 거예요. 질적 전환점이라고 할 수 있습니다.

핵심 기술 혁신: Advantage Conditioning. 여기서 정말 영리한 부분이 나옵니다. 기존 VLA RL 방법들은 PPO나 GRPO 같은 정책 경사(policy gradient) 기반이었잖아요? RECAP은 완전히 다른 길을 갑니다 — advantage conditioning이라는 방식이에요.

작동 원리를 단계별로 보면: (1) 먼저 분포적 가치 함수를 따로 학습합니다. 670M짜리 소형 VLM 백본을 써서, 각 상태에서 "성공까지 몇 스텝 남았나"를 분포로 예측하는 거예요(201개 이산 bin). (2) 이 가치 함수로 각 행동의 advantage 값을 추정하고, "Advantage: positive" 또는 "negative"라는 텍스트 토큰을 VLA 입력에 추가합니다. (3) 학습 때는 모든 데이터에 대해 이 advantage 조건부 지도학습을 하고, 추론 때는 항상 "Advantage: positive"로 조건화해서 개선된 정책을 뽑아내는 겁니다.

이게 왜 중요하냐면요 — Flow Matching 기반 VLA와 찰떡궁합이라는 거예요. PPO나 GRPO는 log-likelihood를 명시적으로 계산해야 하는데, Flow Matching 모델은 이걸 직접 제공을 못해요. 근사가 필요하죠. Advantage conditioning은 이 문제를 아예 우회해버립니다 — 그냥 조건부 지도학습만으로 정책 개선을 달성하는 거예요. 실제로 같은 데이터에서 AWR이나 PPO 기반 방법을 크게 이겼습니다.

세 가지 데이터를 하나로. RECAP의 또 다른 강점은 (1) 시연 데이터(처음 SFT할 때), (2) 자율 롤아웃(로봇이 스스로 해본 시도, 성공/실패 표시 포함), (3) 인간 교정(로봇이 자율 실행하다 실수하면 사람이 개입해서 고쳐주는 DAgger 방식)을 하나의 프레임워크로 통합한다는 점입니다. 인간 교정은 항상 positive advantage, 나머지는 가치 함수가 판단합니다.

실세계 성과. 자, 숫자를 봅시다:

태스크	소요 시간	π^*_0.6 효과 (throughput)	성공률	연속 운용
에스프레소 제조	~200초/회	2배 이상 향상	90%+	13시간 연속
다양한 빨래 접기(11종)	~500초/회	2배 이상 향상	~70%(버튼셔츠 기준)	2시간+ (새 집에서)
박스 조립(공장 배포)	~600초/회	2배 향상(2회 반복 후)	~90%	공장 실배포

throughput(시간당 성공 횟수)이 2배 이상 올라가고, 실패율은 절반으로 떨어졌어요. 특히 인상적인 건 "targeted failure removal" 실험인데, 옷깃 방향 오류라는 특정 실패 모드를 600개 자율 궤적으로 단 2회 반복만에 97% 성공률까지 제거했습니다.

왜 중요한가. π^*_0.6는 6.3절에서 이야기한 "BC→RL 전환"의 가장 완성된 형태입니다. LLM 세계에서 GPT-3(사전학습) → InstructGPT(SFT) → ChatGPT(RLHF)로 진화한 것처럼, VLA 세계에서도 같은 경로가 현실이 되고 있는 거예요. 시뮬레이션이 아니라 실세계에서, 장난감 태스크가 아니라 에스프레소 13시간 연속 제조에서요.

10.13.2 π_0.6-MEM: VLA를 위한 다중 스케일 체화 메모리

[158] · Physical Intelligence · 2026.03

MEM(Multi-Scale Embodied Memory)은 VLA에 다중 모달, 다중 시간 스케일의 메모리를 부여하는 시스템입니다. 10.4절에서 "장기 과제와 계층적 추론"이 핵심 미해결 과제라고 했었죠? MEM은 이 문제를 가장 직접적으로 해결합니다.

핵심 통찰: 메모리도 종류가 다르다. 한번 생각해 보세요. 로봇이 "주방 전체를 정리해"라는 15분짜리 태스크를 수행할 때, 필요한 기억이 두 가지예요. (1) 단기 메모리: "방금 팔이 물체를 가려서 안 보이는데, 어디 있었더라?" — 최근 몇 초의 시각 정보. (2) 장기 메모리: "어떤 재료를 이미 꺼냈고, 어떤 서랍을 열었지?" — 수 분에 걸친 의미론적 이벤트. MEM의 핵심 통찰은 이 두 종류의 메모리를 서로 다른 모달리티로 표현해야 한다는 겁니다.

아키텍처 구성요소:

(1) 단기 비디오 메모리 (Video Encoder). 기존 ViT를 비디오 처리로 확장하되, 놀랍게도 새로운 학습 파라미터를 하나도 추가하지 않습니다. 4번째 레이어마다 공간 어텐션에 시간 어텐션(causal temporal attention)을 끼워넣는 space-time separable attention 구조예요. 과거 프레임의 토큰은 상위 레이어에서 버리기 때문에, VLA 백본에 전달되는 토큰 수는 단일 프레임 때와 똑같습니다. 16프레임 입력에서도 추론 지연이 300ms 이내예요. 나이브하게 하면 4초 넘게 걸리거든요 — 엄청난 차이죠.

(2) 장기 언어 메모리. 고수준 정책이 과거 이벤트를 자연어 요약(m_t)으로 압축하고, 매 스텝 업데이트합니다. 핵심은 압축이에요: "밝은 초록 그릇, 진한 파란 그릇, 밝은 노란 그릇을 윗칸 오른쪽 캐비닛에 넣었다" → "세 개의 그릇을 윗칸 오른쪽 캐비닛에 넣었다". 불필요한 디테일을 제거해서 학습-추론 분포 불일치를 줄이는 거예요.

똑똑한 설계: 사전학습 가중치 보존. 비디오 인코더는 K=1(단일 이미지)일 때 기존 VLM과 정확히 동일하게 동작하도록 설계됐습니다 — 시간 위치 인코딩의 t=0 값을 0으로 설정하는 트릭이에요. 덕분에 기존 VLM이 알고 있던 모든 것을 완벽히 보존하면서 메모리 능력만 추가할 수 있습니다. 사전학습 없이 후처리에서만 메모리를 도입하면 성능이 확 떨어지는 것도 실험으로 확인했습니다.

실세계 성과.

능력	태스크 예시	결과
15분 장시간 태스크	레시피 재료 준비, 주방 전체 청소, 그릴드 치즈 샌드위치 조리	메모리 없는 π0.6 대비 과제 진행률 2-4배 향상
인컨텍스트 적응	젓가락 파지 높이 조정, 냉장고 문 열기 방향 전환	성공률 +11%~+62%
부분 관측성 처리	서랍 속 물체 위치 기억, 장보기 봉투 내용물 추적	모든 핵심 메모리 능력에서 유일하게 강한 성능
비메모리 태스크 성능	셔츠 접기, 침대 정리, 박스 조립 등	메모리 없는 π0.6와 동등 (성능 저하 없음!)

특히 주목할 점이 하나 있어요. 기존 연구에서 반복적으로 보고됐던 causal confusion(인과 혼동) 문제 — 메모리를 추가하면 오히려 성능이 떨어지는 현상 — 이 π_0.6-MEM에서는 나타나지 않았습니다. 왜냐하면 다양한 최적성, 속도, 제어 주파수를 포함하는 대규모 사전학습 데이터가 spurious correlation을 방지했기 때문으로 분석됩니다.

왜 중요한가. MEM은 4.2절(두뇌 모듈)의 추론 패러다임과 10.4절(장기 과제)에 직접 응답합니다. π0.5가 "VLM이 계획하고 VLA가 실행하는" 계층적 분리 방식이었다면, MEM은 메모리라는 완전히 다른 차원에서 같은 문제를 풀어요. 계층적 계획이 "무엇을 할지"를 분리하는 거라면, 메모리는 "무엇을 했는지"를 기억하는 거죠. 두 접근은 상호 배타적이 아니라 상보적이에요 — 언젠가 결합될 겁니다.

10.13.3 두 논문의 통합적 의의

차원	π^*_0.6 [157]	π_0.6-MEM
해결하는 한계	BC의 성능 천장, 시연 밖 행동 발견 불가	메모리 부재, 장시간 과제 불가, 부분 관측성
기반 모델	π0.6 (Gemma 3 4B + 860M Action Expert)	π0.6 (동일)
핵심 혁신	Advantage conditioning: Flow Matching VLA에 적용 가능한 RL 정책 추출	Video encoder (추가 파라미터 없음) + 압축형 언어 메모리
데이터 소스	시연 + 자율 롤아웃 + 인간 교정(DAgger)	로봇 시연 + 비디오 데이터 + 비전-언어 데이터
대표 성과	에스프레소 13시간 연속, 실패율 50% 감소	주방 청소·그릴드 치즈 등 15분 태스크 해결
서베이 연결	6.3절 (RL 후처리), Insight 5	10.4절 (장기 과제), 4.2절 (추론)

Motivation Chain: π 시리즈의 진화 (업데이트)

π0 (2024): VLM + Flow Matching Action Expert의 첫 결합

→ π0.5 (2025): 계층적 VLM 계획 + π0 실행, 30분+ 장시간 작업

→ π0.6 (2025): Gemma 3 4B 백본 + 860M Action Expert로 업그레이드

→ π^*_0.6 (2025.11): Advantage conditioning으로 실세계 RL 후처리. BC의 성능 천장 돌파

→ π_0.6-MEM (2026.03): 다중 스케일 메모리로 15분+ 장시간 태스크 해결. 메모리 부재 한계 돌파

정리하면요, PI의 π 시리즈는 VLA 연구의 두 가지 핵심 프런티어를 동시에 밀어내고 있습니다: "더 잘하기"(π^*_0.6)와 "더 오래 하기"(MEM). π^*_0.6가 개별 행동의 품질을 시연 수준 너머로 끌어올린다면, MEM은 그 행동들을 수십 분에 걸친 일관된 과제 수행으로 엮어줍니다. 두 접근이 하나의 모델에 합쳐진다면? "15분짜리 주방 정리를 시행착오를 통해 스스로 개선하는 로봇"이 현실이 됩니다. 이건 본 서베이가 줄곧 이야기해 온 "배포 준비(deployment readiness)" 패러다임의 가장 구체적인 진전이에요. 앞서 식별한 미해결 과제들이 이제 더 이상 이론적 추측이 아니라, 실제 공학적 도전으로 전환되고 있다는 증거입니다.

11. 결론

자, 마지막으로 전체를 정리하겠습니다.

11.1 VLA -- 통합 지능의 실현

VLA(Vision-Language-Action) 모델은 로봇이 세계를 "보고(see), 이해하고(understand), 행동하는(act)" 통합 지능의 구현체입니다. 시각적 인지, 언어적 추론, 물리적 행동이라는 세 축을 하나의 신경망 안에서 융합함으로써, VLA는 전통적 로봇 공학의 모듈적 파이프라인(인지-계획-제어)을 근본적으로 재정의하고 있습니다.

11.2 3년의 역사, 200개의 모델

2023년 RT-2 [11]가 "Vision-Language-Action Model"이라는 명칭을 처음 제안한 이후, 불과 3년 만에 200개 이상의 VLA 모델이 출현했습니다. 이 폭발적 성장은 세 가지 수렴의 결과입니다: (1) 대형 언어 모델의 성숙, (2) 시각-언어 사전학습의 발전, (3) 대규모 로봇 데이터셋의 등장. 이 세 요소가 동시에 임계점에 도달한 2023-2024년에 VLA 연구의 캠브리아기 대폭발이 시작된 겁니다.

11.3 현재의 성취와 프론티어

단기적, 단일 도메인 조작 과제는 거의 해결된 수준에 도달했습니다. LIBERO-Spatial 98.8%, LIBERO-Object 99.8%라는 수치는, 정의된 환경에서 정의된 과제를 수행하는 능력은 이미 인간 수준에 근접했음을 보여줍니다.

그러나 진정한 프론티어는 이제부터입니다.

장기 과제: 다단계 합성 과제에서의 계획과 실행
교차 도메인 일반화: 학습하지 않은 환경과 객체에 대한 적응
실세계 배포: 통제되지 않은 환경에서의 안정적 작동

이 세 과제가 VLA 연구의 "라스트 마일(last mile)"이며, 동시에 가장 어려운 구간입니다.

11.4 효율성 혁명

VLA 연구에서 가장 주목할 만한 추세 중 하나는 효율성의 극적 향상입니다. RT-2 [11]의 55B 파라미터에서 SmolVLA [32]의 450M 파라미터로, 모델 크기가 100배 이상 줄어들면서도 경쟁력 있는 성능을 유지하는 파레토 프론티어의 이동이 진행 중입니다. 이건 VLA의 실용적 배포를 앞당기는 결정적 요인이에요. 에지 디바이스에서의 실시간 추론, 비용 효율적인 대규모 배포, 그리고 에너지 효율성 모두가 이 효율성 혁명의 수혜자입니다.

11.5 BC에서 RL로의 전환

행동 복제(BC)에서 출발하여 강화학습(RL) 후처리로 마무리하는 파이프라인이 VLA 학습의 표준으로 자리잡고 있습니다. BC가 제공하는 안정적 초기화와 RL이 제공하는 탐색적 최적화의 조합은, 단독으로는 달성할 수 없는 성능 수준을 가능케 합니다. SFT 4%에서 PPO 15회 반복 후 97%로의 도약은, 이 조합의 위력을 단적으로 보여줍니다.

11.6 도메인별 특화의 가속

VLA의 범용 프레임워크가 다양한 도메인으로 확장되고 있습니다.

자율주행: DriveVLM [91], DriveLM 등이 도로 환경에 특화된 VLA를 구현합니다.
휴머노이드: GR00T N1 [21], HumanPlus 등이 인간형 로봇의 전신 제어에 VLA를 적용합니다.
의료: 수술 로봇, 재활 보조 등에서의 VLA 적용이 탐색되고 있습니다.

각 도메인은 고유한 안전 요구, 제어 주파수, 상호작용 패턴을 가지며, 이에 따른 도메인 특화 설계가 가속되고 있습니다.

11.7 안전, 윤리, 형식 검증 -- 배포의 게이트키퍼

VLA의 실세계 배포를 가로막는 최종 관문은 기술적 성능이 아니라 안전과 윤리입니다. SafeVLA [75], SafeAuto 등의 시도가 진행 중이지만, 언어 조건 신경망 정책의 형식적 검증 방법은 아직 확립되지 않았습니다. 이건 VLA가 실험실을 넘어 일상으로 나아가기 위해 반드시 통과해야 하는 관문이며, 규제 기관, 산업계, 학계의 협력이 필수적인 영역입니다.

프라이버시, 고용 대체, 의사결정 편향 등의 사회적 영향도 기술 발전과 병행하여 논의되어야 합니다. 기술이 사회에 배포된 후에야 윤리적 논의를 시작하는 것은 너무 늦습니다.

11.8 다음 도약을 향하여

VLA 연구의 다음 도약은 네 가지 축에서 동시에 이루어질 것으로 전망됩니다.

첫째, 월드 모델 통합. 행동 전에 결과를 상상하는 능력은 장기 과제, 안전성, 일반화 모두를 향상시킵니다. DreamVLA [64], WorldVLA [76]의 성공은 이 방향의 유효성을 입증하며, 다음 세대의 VLA에서 월드 모델은 선택적 구성 요소가 아닌 핵심 모듈이 될 것입니다(4.2.3절 및 Large Model Embodied AI 서베이 [44] 참조).

둘째, 평생학습(continual learning). 현재의 VLA는 배포 후 고정되지만, 진정한 지능형 로봇은 경험을 통해 지속적으로 개선되어야 합니다. 과거 학습을 잊지 않으면서(catastrophic forgetting 방지) 새로운 과제와 환경에 적응하는 평생학습은 VLA의 장기적 비전입니다.

셋째, 범용 체현 지능(general embodied intelligence). 하나의 모델이 로봇 팔, 휴머노이드, 자율주행차, 드론 등 다양한 체현에서 작동하는 범용 정책은 VLA 연구의 궁극적 목표입니다. OXE [19]와 HPT [96]가 이 방향의 첫걸음을 내디뎠으며, 교차 체현 일반화의 스케일링이 핵심 과제입니다.

넷째, 인간-로봇 공진화. VLA 로봇이 인간의 삶에 깊이 통합되면서, 인간과 로봇이 서로를 변화시키는 공진화(co-evolution)가 시작될 것입니다. 로봇이 인간의 행동에서 배우고, 인간이 로봇의 능력에 맞추어 상호작용 방식을 조정하는 이 피드백 루프는, VLA 연구가 궁극적으로 지향하는 미래입니다.

VLA는 단순한 기술적 발전을 넘어, "기계가 물리적 세계를 이해하고 그 안에서 의미 있게 행동할 수 있는가"라는 근본적 질문에 대한 답을 구축하고 있습니다. 2023년의 명명 이후 3년, 이 분야는 놀라운 속도로 발전해 왔으며, 그 가속은 계속되고 있습니다. 다음 3년이 가져올 변화는 지금까지의 변화를 능가할 것입니다.

전체 VLA Taxonomy 트리

VLA (Vision-Language-Action)
├── 정의 기준별 분류
│   ├── 좁은 정의 (RT-2 원조): VLM 파인튜닝 기반
│   ├── 확장 정의 (Ma et al.): V+L→A 모든 시스템
│   ├── Pure VLA (Zhong et al.): End-to-end 통합
│   └── 직접 제어 (Kawaharazuka et al.): 제어 명령 직접 생성
│
├── 아키텍처별 분류 (Liu & Shao [5])
│   ├── 단일체 (Monolithic)
│   │   ├── Single-system: RT-2, OpenVLA
│   │   └── Dual-system
│   │       ├── Cascade: GR00T N1, π0
│   │       └── Parallel: (동시 실행 후 결합)
│   └── 계층적 (Hierarchical)
│       ├── Planner-Only: SayCan, Inner Monologue
│       └── Planner+Policy: π0.5
│
├── 행동 생성 방식별 분류 (Zhong et al. [3])
│   ├── 자기회귀 (Autoregressive): RT-2, OpenVLA, Octo(AR모드)
│   ├── 디퓨전 (Diffusion): Diffusion Policy, CogACT, RDT-1B
│   │   ├── Flow Matching (변형): π0, π0-FAST
│   │   └── 이산 디퓨전 (Discrete Diffusion): 이산 토큰 공간에서의 확산
│   ├── 강화학습 기반: VLA-RL [68], RIPT-VLA [71], ConRFT [69]
│   └── 하이브리드/특수: HybridVLA [79], GR00T N1
│
├── 행동 토큰 유형별 분류 (Chen et al. [7])
│   ├── Language Tokens: SayCan, SayTap
│   ├── Code Tokens: Code-as-Policies
│   ├── Affordance Tokens: VoxPoser [57], A3VLM [58]
│   ├── Trajectory Tokens: RT-Trajectory [62], TraceVLA
│   ├── Goal Tokens: SuSIE [59], 3D-VLA
│   ├── Latent Tokens: VQ-BeT [60], LAPA [61], UniVLA [80]
│   ├── Raw Action Tokens: RT-2, OpenVLA, FAST
│   └── Reasoning Tokens: CoT-VLA [55], SC-VLA [56]
│
├── 효율화 기법별 분류 (Yu et al. [4])
│   ├── 양자화: BitVLA, SQIL
│   ├── 프루닝: SmolVLA, FLOWER, DeeR-VLA
│   ├── 증류: TinyVLA
│   ├── 토큰 최적화: FAST, VLA-Cache, VOTE
│   ├── 효율적 어텐션: KV-Efficient VLA, Long-VLA [73]
│   └── 효율적 아키텍처: SARA-RT, MoLE-VLA
│
├── 학습 패러다임별 분류 (Jin et al. [9])
│   ├── Phase 1: 인터넷 사전학습 (VLM)
│   ├── Phase 2: BC/SFT (로봇 시연)
│   └── Phase 3: RL 후처리
│       ├── 온라인 RL: PPO (RIPT-VLA [71]), GRPO (VLA-RL [68])
│       ├── 온라인 RL: ConRFT [69]
│       └── 선호 최적화: HAPO [84], GRAPE
│
└── 응용 도메인별 분류
    ├── 테이블탑 조작: 주류, 풍부한 데이터
    ├── 휴머노이드: 고DoF, 전신 제어
    ├── 자율주행: 별도 진화, 최고 안전 요구
    ├── 드론/내비게이션: 야외, 실시간
    ├── 의료/수술: 극도 정밀, 데이터 희소
    └── 산업/농업: 반복 작업, 견고성 중심
│
├── 벤치마크별 분류
│   ├── 조작: LIBERO, CALVIN, RLBench, Meta-World, VLABench
│   ├── 자율주행: Bench2Drive, nuScenes, Reason2Drive, WorldBench
│   └── 차세대: RoboArena, RoboCasa365, WorldGym

Motivation Chain: 배포와 안전의 동기 사슬

자, 이 동기 사슬을 따라가 보면 VLA 연구의 큰 흐름이 보입니다.

Motivation Chain

연구실 데모의 한계(제어된 환경에서만 작동, 실세계 배포 불가)

→ 효율화 연구(경량화, 양자화 → 엣지 디바이스에서 실행 가능)

→ 실세계 배포 시도(예상치 못한 실패 모드 발견)

→ 안전 연구 시작(SafeVLA [75]: 안전 제약을 학습에 내재화)

→ 안전의 근본 한계(VLA 환각 → 물리적 사고 가능성)

→ 형식적 검증 필요성 대두(아직 미해결)

Motivation Chain

단일 벤치마크의 한계(LIBERO 포화: 단순 태스크 사실상 해결)

→ 복합 벤치마크 필요(장시간, 다중 스텝, 실세계 변이)

→ 시뮬레이션-실세계 격차(sim-to-real gap 여전히 존재)

→ 하이브리드 평가 제안(시뮬 + 실세계 + 인간 평가)

Self-Check Questions: Section 9-10-11

Q1: OXE 데이터셋이 VLA 분야에 가져온 패러다임 전환을 "데이터 다양성"의 관점에서 설명하라.

답: OXE 이전에는 각 연구실이 자체 로봇으로 수집한 소규모 데이터(수천-수만 에피소드)로만 학습했습니다. OXE는 22종의 서로 다른 로봇 플랫폼에서 수집된 100만+ 에피소드를 통합했습니다. 여기서 핵심은, 서로 다른 로봇의 데이터가 "노이즈"가 아니라 "다양성"으로 작용하여, 특정 로봇·환경에 대한 과적합을 방지하고 일반화 성능을 높인다는 것입니다. NLP에서 다국어 학습이 각 언어의 성능을 개선하는 현상과 유사한 원리입니다.

Q2: VLA의 "환각(hallucination)"이 LLM의 환각과 본질적으로 다른 이유는?

답: LLM의 환각은 잘못된 텍스트 생성으로, 결과는 정보적 오류입니다(사실이 아닌 내용 서술). 반면 VLA의 환각은 물리적 세계에서 실행되는 잘못된 행동 생성이므로, 결과가 물리적 사고(충돌, 파손, 부상)로 이어질 수 있습니다. "존재하지 않는 역사적 사실"을 말하는 것과 "존재하지 않는 물체를 잡으려 팔을 휘두르는 것"의 차이인 거죠. 이 때문에 VLA의 안전 문제는 LLM보다 근본적으로 더 심각하며, 형식적 검증의 필요성이 더 큽니다.

Q3: 현재 VLA 벤치마크 생태계의 가장 큰 한계는 무엇인가?

답: (1) 통일된 교차 벤치마크 부재: ImageNet이나 SuperGLUE에 해당하는 표준 벤치마크가 없어, 서로 다른 시뮬레이터(LIBERO, CALVIN, RLBench)의 결과를 직접 비교할 수 없습니다. (2) 시뮬-실세계 격차: 시뮬레이션에서 높은 성능이 실세계에서 보장되지 않습니다. (3) 포화 문제: 단순 태스크 벤치마크는 이미 99%에 도달하여 변별력을 잃었습니다. (4) 장시간·비정형 태스크 평가 부재: 30분 이상의 복합 태스크, 예상치 못한 상황 대처 능력을 측정하는 벤치마크가 부족합니다.

Open Research Questions: Section 9-10-11

벤치마크 2.0: LIBERO 포화 이후, 실세계 일반화·장시간 태스크·안전성을 동시에 측정하는 차세대 벤치마크는 어떤 설계 원칙을 따라야 하는가?

VLA의 경제학: VLA 기반 로봇의 배포 비용(학습, 하드웨어, 유지보수)이 기존 산업용 로봇 대비 경제적으로 타당해지는 시점은 언제인가?

참고문헌 (References)

[1] Ma, Q. et al. (2024). A Survey on Vision-Language-Action Models for Embodied AI. arXiv:2405.14093. [arXiv]
[2] Kawaharazuka, K. et al. (2025). Real-World Robot Applications of Foundation Models: A Review. arXiv:2402.05741. [arXiv]
[3] Zhong, Z. et al. (2025). Pure Vision Language Action (VLA) Models: A Comprehensive Survey. arXiv:2509.19012. [arXiv]
[4] Yu, Z. et al. (2025). A Survey on Efficient Vision-Language-Action Models. arXiv:2510.24795. [arXiv]
[5] Liu, N. & Shao, R. et al. (2025). Large VLM-based VLA Models for Robotic Manipulation: A Survey. arXiv:2508.13073. [arXiv]
[6] Zhang, Y. et al. (2025). VLA Models: Concepts, Progress, Applications and Challenges. arXiv:2505.04769. [arXiv]
[7] Chen, Y. et al. (2025). A Survey on VLA Models: An Action Tokenization Perspective. arXiv:2507.01925. [arXiv]
[8] Xu, C. et al. (2025). An Anatomy of Vision-Language-Action Models. arXiv:2512.11362. [arXiv]
[9] Jin, A. et al. (2025). Parallels Between VLA Model Post-Training and Human Motor Learning. arXiv:2506.20966. [arXiv]
[10] Jiang, H. et al. (2025). A Survey on VLA Models for Autonomous Driving. arXiv:2506.24044. [arXiv]
[11] Brohan, A. et al. (2023). RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. arXiv:2307.15818. [arXiv]
[12] Brohan, A. et al. (2022). RT-1: Robotics Transformer for Real-World Control at Scale. arXiv:2212.06817. [arXiv]
[13] Reed, S. et al. (2022). A Generalist Agent (Gato). arXiv:2205.06175. [arXiv]
[14] Ahn, M. et al. (2022). Do As I Can, Not As I Say: Grounding Language in Robotic Affordances (SayCan). arXiv:2204.01691. [arXiv]
[15] Kim, M. et al. (2024). OpenVLA: An Open-Source Vision-Language-Action Model. arXiv:2406.09246. [arXiv]
[16] Black, K. et al. (2024). pi0: A Vision-Language-Action Flow Model for General Robot Control. arXiv:2410.24164. [arXiv]
[17] Chi, C. et al. (2023). Diffusion Policy: Visuomotor Policy Learning via Action Diffusion. arXiv:2303.04137. [arXiv]
[18] Driess, D. et al. (2023). PaLM-E: An Embodied Multimodal Language Model. arXiv:2303.03378. [arXiv]
[19] Open X-Embodiment Collaboration (2023). Open X-Embodiment: Robotic Learning Datasets and RT-X Models. arXiv:2310.08864. [arXiv]
[20] Pertsch, K. et al. (2025). Fast Tokenizer for VLA (pi0-FAST). arXiv:2501.09747. [arXiv]
[21] Bjorck, J. et al. (2025). GR00T N1: An Open Foundation Model for Generalist Humanoid Robots. arXiv:2503.14734. [arXiv]
[22] Huang, W. et al. (2023). Inner Monologue: Embodied Reasoning through Planning with Language Models. arXiv:2207.05608. [arXiv]
[23] Liu, H. et al. (2024). CogACT: A Foundational VLA Model with Cognitive-Inspired Action Chunking Transformer. arXiv:2411.19650. [arXiv]
[24] Liu, H. et al. (2024). RDT-1B: A Diffusion Foundation Model for Bimanual Manipulation. arXiv:2410.07864. [arXiv]
[25] Team, Octo Model et al. (2024). Octo: An Open-Source Generalist Robot Policy. arXiv:2405.12213. [arXiv]
[26] Shridhar, M. et al. (2021). CLIPort: What and Where Pathways for Robotic Manipulation. arXiv:2109.12098. [arXiv]
[27] Radford, A. et al. (2021). Learning Transferable Visual Models From Natural Language Supervision (CLIP). arXiv:2103.00020. [arXiv]
[28] Dosovitskiy, A. et al. (2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT). arXiv:2010.11929. [arXiv]
[29] Brown, T. et al. (2020). Language Models are Few-Shot Learners (GPT-3). arXiv:2005.14165. [arXiv]
[30] Oquab, M. et al. (2023). DINOv2: Learning Robust Visual Features without Supervision. arXiv:2304.07193. [arXiv]
[31] Physical Intelligence (2025). pi0.5: a Vision-Language-Action Model with Open-World Generalization. arXiv:2503.01222. [arXiv]
[32] Pertsch, K. et al. (2025). SmolVLA: A Small Vision-Language-Action Model for Efficient Robot Learning. arXiv:2506.01844. [arXiv]
[33] Ma, Y. et al. (2025). BitVLA: 1-bit Vision-Language-Action Models. arXiv:2505.07256. [arXiv]
[34] Wu, J. et al. (2025). TinyVLA: Towards Fast and Data-Efficient VLA. arXiv:2409.12514. [arXiv]
[35] Yue, W. et al. (2024). DeeR-VLA: Dynamic Inference of Multimodal LLMs for Efficient Robot Execution. arXiv:2411.02359. [arXiv]
[36] Liang, J. et al. (2023). Code as Policies: Language Model Programs for Embodied Control. arXiv:2209.07753. [arXiv]
[37] Zhen, H. et al. (2024). 3D-VLA: A 3D Vision-Language-Action Generative World Model. arXiv:2403.09631. [arXiv]
[38] Huang, W. et al. (2022). Language Models as Zero-Shot Planners (SayTap). arXiv:2201.07207. [arXiv]
[39] Xu, Z. et al. (2024). SpatialVLA: Exploring Spatial Representations for VLA Models. arXiv:2501.15830. [arXiv]
[40] Wen, B. et al. (2024). TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for VLA. arXiv:2412.10345. [arXiv]
[41] Hu, T. et al. (2025). Vision-Language-Action Models for Autonomous Driving: Past, Present, and Future. arXiv:2512.16760. [arXiv]
[42] Edge Survey (2026). Embodied Foundation Models at the Edge: A Survey of Deployment Constraints and Mitigation Strategies. arXiv:2603.16952. [arXiv]
[43] Guan, W. et al. (2025). Efficient Vision-Language-Action Models for Embodied Manipulation: A Systematic Survey. arXiv:2510.17111. [arXiv]
[44] Large Model Embodied AI (2025). Large Model Empowered Embodied AI: A Survey on Decision-Making and Embodied Learning. arXiv:2508.10399. [arXiv]
[45] Jiang, Y. et al. (2022). VIMA: General Robot Manipulation with Multimodal Prompts. arXiv:2210.03094. [arXiv]
[46] Hwang, J. et al. (2024). EMMA: End-to-End Multimodal Model for Autonomous Driving. arXiv:2410.23262. [arXiv]
[47] Fu, H. et al. (2025). ORION: A Holistic End-to-End Autonomous Driving Framework. arXiv:2503.19755. [arXiv]
[48] Zhou, X. et al. (2025). AutoVLA: Autonomous Driving with Adaptive Reasoning and RL Fine-Tuning. arXiv:2506.13757. [arXiv]
[49] Yang, Z. et al. (2025). DriveMoE: Mixture-of-Experts for End-to-End Autonomous Driving. arXiv:2505.16278. [arXiv]
[50] Doshi, R. et al. (2024). CrossFormer: Scaling Cross-Embodied Learning. arXiv:2408.11812. [arXiv]
[51] Wu, H. et al. (2023). GR-1: Unleashing Large-Scale Video Generative Pre-training for Visual Robot Manipulation. arXiv:2312.13139. [arXiv]
[52] Tan, Z. et al. (2025). FlashVLA: Token-Aware Compression and Action Reuse for Efficient VLA Inference. arXiv:2505.21200. [arXiv]
[53] Zheng, K. et al. (2025). X-VLA: Cross-Embodiment Vision-Language-Action Model. arXiv:2510.10274. [arXiv]
[54] Du, Y. et al. (2025). HiMoE-VLA: Hierarchical Mixture-of-Experts for Generalist VLA Policies. arXiv:2512.05693. [arXiv]
[55] Zhao, Y. et al. (2025). CoT-VLA: Visual Chain-of-Thought Reasoning for VLA Models. arXiv:2503.22020. [arXiv]
[56] Li, X. et al. (2024). SC-VLA: A Self-Correcting VLA Model for Fast and Slow System Manipulation. arXiv:2405.17418. [arXiv]
[57] Huang, W. et al. (2023). VoxPoser: Composable 3D Value Maps for Robotic Manipulation. arXiv:2307.05973. [arXiv]
[58] Huang, S. et al. (2024). A3VLM: Actionable Articulation-Aware Vision Language Model. arXiv:2406.07549. [arXiv]
[59] Black, K. et al. (2023). SuSIE: Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models. arXiv:2310.10639. [arXiv]
[60] Lee, S. et al. (2024). VQ-BeT: Behavior Generation with Latent Actions. arXiv:2403.03181. [arXiv]
[61] Ye, D. et al. (2024). LAPA: Latent Action Pretraining from Videos. arXiv:2410.11758. [arXiv]
[62] Gu, Y. et al. (2023). RT-Trajectory: Robotic Task Generalization via Hindsight Trajectory Sketches. arXiv:2311.01977. [arXiv]
[63] Williams, R. et al. (2025). Lite VLA: Efficient VLA Control on CPU-Bound Edge Robots. arXiv:2511.05642. [arXiv]
[64] Zhang, H. et al. (2025). DreamVLA: A VLA Model Dreamed with Comprehensive World Knowledge. arXiv:2507.04447. [arXiv]
[65] Wen, C. et al. (2025). dVLA: Diffusion VLA with Multimodal Chain-of-Thought. arXiv:2509.25681. [arXiv]
[66] Chen, Z. et al. (2025). TGRPO: Fine-tuning VLA via Trajectory-wise Group Relative Policy Optimization. arXiv:2506.08440. [arXiv]
[67] Huang, J. et al. (2025). ThinkAct: VLA Reasoning via Reinforced Visual Latent Planning. arXiv:2507.16815. [arXiv]
[68] Lu, Y. et al. (2025). VLA-RL: Towards Masterful Robotic Manipulation with Scalable RL. arXiv:2505.18719. [arXiv]
[69] Chen, R. et al. (2025). ConRFT: A Reinforced Fine-tuning Method for VLA via Consistency Policy. arXiv:2502.05450. [arXiv]
[70] Li, Q. et al. (2025). SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning. arXiv:2509.09674. [arXiv]
[71] Tan, W. et al. (2025). RIPT-VLA: Interactive Post-Training for VLA Models. arXiv:2505.17016. [arXiv]
[72] Miao, L. et al. (2025). FedVLA: Federated VLA Learning with Dual Gating MoE. arXiv:2508.02190. [arXiv]
[73] Fan, Y. et al. (2025). Long-VLA: Unleashing Long-Horizon Capability of VLA for Robot Manipulation. arXiv:2508.19958. [arXiv]
[74] Koo, J. et al. (2025). RetoVLA: Reusing Register Tokens for Spatial Reasoning in VLA. arXiv:2509.21243. [arXiv]
[75] Zhang, S. et al. (2025). SafeVLA: Towards Safety Alignment of VLA via Constrained Learning. arXiv:2503.03480. [arXiv]
[76] Cen, J. et al. (2025). WorldVLA: Towards Autoregressive Action World Model. arXiv:2506.21539. [arXiv]
[77] Li, Z. et al. (2025). PointVLA: Injecting the 3D World into VLA Models. arXiv:2503.07511. [arXiv]
[78] Yu, F. et al. (2025). ForceVLA: Enhancing VLA with Force-aware MoE for Contact-rich Manipulation. arXiv:2505.22159. [arXiv]
[79] Liu, J. et al. (2025). HybridVLA: Collaborative Diffusion and Autoregression in a Unified VLA Model. arXiv:2503.10631. [arXiv]
[80] Bu, Z. et al. (2025). UniVLA: Learning to Act Anywhere with Task-centric Latent Actions. arXiv:2505.06111. [arXiv]
[81] Deng, Y. et al. (2025). GraspVLA: Grasping Foundation Model Pre-trained on Billion-scale Synthetic Data. arXiv:2505.03233. [arXiv]
[83] Tian, R. et al. (2023). RAPL: What Matters to You? Visual Representation Alignment for Robot Learning. arXiv:2310.07932. [arXiv]
[84] Xia, Z. et al. (2025). HAPO: Human-assisted Robotic Policy Refinement via Action Preference Optimization. arXiv:2506.07127. [arXiv]
[85] Patel, D. et al. (2025). IKER: Real-to-Sim-to-Real with VLM-Generated Iterative Keypoint Rewards. arXiv:2502.08643. [arXiv]
[86] Xu, J. et al. (2025). KV-Efficient VLA: Speed up VLMs with RNN-Gated Chunked KV Cache. arXiv:2509.21354. [arXiv]
[87] Chen, X. et al. (2023). GenAug: Retargeting Behaviors to Unseen Situations via Generative Augmentation. arXiv:2302.06671. [arXiv]
[88] Mandi, Z. et al. (2022). CACTI: A Framework for Scalable Multi-Task Multi-Scene Visual Imitation Learning. arXiv:2212.05711. [arXiv]
[89] Yu, T. et al. (2023). ROSIE: Scaling Robot Learning with Semantically Imagined Experience. arXiv:2302.11550. [arXiv]
[90] Xiao, T. et al. (2022). DIAL: Robotic Skill Acquisition via Instruction Augmentation with VLMs. arXiv:2211.11736. [arXiv]
[91] Tian, X. et al. (2024). DriveVLM: The Convergence of Autonomous Driving and Large VLMs. arXiv:2402.12289. [arXiv]
[92] Zawalski, K. et al. (2024). ECoT: Robotic Control via Embodied Chain-of-Thought Reasoning. arXiv:2407.08693. [arXiv]
[93] Du, Y. et al. (2023). UniPi: Learning Universal Policies via Text-Guided Video Generation. arXiv:2302.00111. [arXiv]
[94] Nematollahi, I. et al. (2025). LUMOS: Language-Conditioned Imitation Learning with World Models. arXiv:2503.10370. [arXiv]
[95] Chi, B. et al. (2025). MinD: Learning A Dual-System World Model for Real-Time Planning. arXiv:2506.18897. [arXiv]
[96] Wang, L. et al. (2024). HPT: Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers. arXiv:2409.20537. [arXiv]
[97] Ross, S. et al. (2011). DAgger: A Reduction of Imitation Learning to No-Regret Online Learning. arXiv:1011.0686. [arXiv]
[98] Hancock, W. et al. (2025). Actions as Language: Fine-Tuning VLMs into VLAs Without Catastrophic Forgetting. arXiv:2509.22195. [arXiv]
[99] GraspVerse (2025). Large-scale Synthetic Grasp Data Generation. (14개 서베이 외 출처)
[100] Cheang, C. et al. (2024). GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation. arXiv:2410.06158. [arXiv]
[101] Yang, S. et al. (2023). UniSim: Learning Interactive Real-World Simulators. arXiv:2310.06114. [arXiv]
[102] Singh, I. et al. (2023). ProgPrompt: Generating Situated Robot Task Plans using Large Language Models. arXiv:2209.11302. [arXiv]
[103] Wang, G. et al. (2023). Voyager: An Open-Ended Embodied Agent with Large Language Models. arXiv:2305.16291. [arXiv]
[104] Vemprala, S. et al. (2024). ChatGPT for Robotics: Design Principles and Model Abilities. arXiv:2306.17582. [arXiv]
[105] Nasiriany, S. et al. (2024). RT-Affordance: Affordances are Versatile Intermediate Representations for Robot Learning. arXiv:2411.02704. [arXiv]
[106] Xu, Z. et al. (2025). A0: An Autonomous Agent with Adaptive Action Generation. arXiv:2504.12636. [arXiv]
[107] Wang, H. et al. (2025). VQ-VLA: Vector Quantized Vision-Language-Action Model. arXiv:2507.01016. [arXiv]
[108] Liu, B. et al. (2025). Embodied-R1: Incentivizing Reasoning in Embodied VLA Models. arXiv:2508.13998. [arXiv]
[109] Wang, Z. et al. (2025). GRAPE: Generalizing Robot Policy via Preference Alignment. arXiv:2411.19309. [arXiv]
[110] Bousmalis, K. et al. (2024). RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation. arXiv:2306.11706. [arXiv]
[111] Shi, L. et al. (2025). ReVLA: Reverting Visual Domain from LLM to VLA. arXiv:2409.15250. [arXiv]
[112] Zheng, Z. et al. (2025). UniAct: Universal Action Representation for Robotic Learning. arXiv:2501.10105. [arXiv]
[113] Li, J. et al. (2025). BridgeVLA: Bridging the Gap Between VLA and Low-Level Robot Control. arXiv:2506.07961. [arXiv]
[114] Park, D. et al. (2025). SQIL: Sub-4-bit Quantization of Large VLAs via Self-play Fine-tuning. arXiv:2505.15304. [arXiv]
[115] Heo, J. et al. (2025). QAIL: Quantization-Aware Imitation Learning for Resource-Efficient VLA. arXiv:2412.01034. [arXiv]
[116] Li, S. et al. (2025). SQAP-VLA: Stochastic Quantization with Adaptive Precision for VLA. arXiv:2509.09090. [arXiv]
[117] Qu, L. et al. (2025). MoLe-VLA: Mixture of Lightweight Experts for VLA. arXiv:2503.20384. [arXiv]
[118] Niu, X. et al. (2025). EfficientVLA: An Efficient Vision-Language-Action Model. arXiv:2506.10100. [arXiv]
[119] Cheng, Y. et al. (2025). FLOWER: Flow-based World Model for Efficient Robot Learning. arXiv:2509.04996. [arXiv]
[120] Zhao, Y. et al. (2025). RLRC: Reinforcement Learning with Reasoning Consistency for VLA. arXiv:2506.17639. [arXiv]
[121] Wen, Z. et al. (2025). CEED-VLA: Confidence-Enhanced Early-Exit Decoding for VLA. arXiv:2506.13725. [arXiv]
[122] Julg, M. et al. (2025). RPD: Robot Policy Distillation from Vision-Language-Action Models. arXiv:2503.05833. [arXiv]
[123] Shen, W. et al. (2025). SP-VLA: Spatial-aware Parallel Decoding VLA. arXiv:2506.12723. [arXiv]
[124] Xu, Y. et al. (2025). VLA-Cache: Accelerating VLA Inference via KV Cache Compression. arXiv:2502.02175. [arXiv]
[125] Lin, X. et al. (2025). CronusVLA: Efficient VLA with Temporal Cronus Attention. arXiv:2506.19816. [arXiv]
[126] Shridhar, M. et al. (2024). SARA-RT: Scaling Up Robot Action with Linear Attention. arXiv:2312.01990. [arXiv]
[127] Liu, Y. et al. (2024). RoboMamba: Efficient Vision-Language-Action Model with Mamba SSM. arXiv:2406.04339. [arXiv]
[128] Xu, J. et al. (2025). GeRM: A Generalist Robotic Model via Foundation Models. arXiv:2403.13358. [arXiv]
[129] Chen, Q. et al. (2025). PD-VLA: Parallel Decoding for Efficient VLA Inference. arXiv:2503.02310. [arXiv]
[130] Wang, X. et al. (2025). Spec-VLA: Speculative Decoding for Accelerating VLA Models. arXiv:2507.22424. [arXiv]
[131] Yang, Z. et al. (2025). EgoVLA: Egocentric Vision-Language-Action Model. arXiv:2507.12440. [arXiv]
[132] Hung, Y. et al. (2025). NORA: Normalizing Flow-based Robot Action Generation. arXiv:2504.19854. [arXiv]
[133] Budzianowski, P. et al. (2025). EdgeVLA: Efficient VLA Deployment on Edge Devices. arXiv:2507.14049. [arXiv]
[134] Kim, D. et al. (2025). DiVLA-2B: Diffusion VLA at 2B Scale. arXiv:2412.03293. [arXiv]
[135] Park, J. et al. (2026). HyperVLA: Dynamic Policy Generation via Hypernetworks. arXiv:2510.04898. [arXiv]
[136] Liu, Q. et al. (2026). AutoQVLA: Automated Quantization for VLA Models. arXiv:2602.03782. [arXiv]
[137] Zhang, S. et al. (2025). Humanoid-VLA: Vision-Language-Action for Humanoid Robots. arXiv:2502.14795. [arXiv]
[138] Li, W. et al. (2025). Being-H0: Humanoid Robot Foundation Model. arXiv:2507.15597. [arXiv]
[139] Chen, X. et al. (2025). FP3: Foundation Policy with Predictive Planning. arXiv:2503.08950. [arXiv]
[140] Li, J. et al. (2025). SafeAuto: Safety-Aware Autonomous Driving with VLA. arXiv:2503.00211. [arXiv]
[141] Wei, H. et al. (2025). LangCoop V2V: Language-based Cooperative Driving. arXiv:2504.13406. [arXiv]
[142] Wang, D. et al. (2025). CognitiveDrone: VLA for Cognitive Drone Control. arXiv:2503.01378. [arXiv]
[143] Zhao, R. et al. (2025). RaceVLA: Vision-Language-Action for Autonomous Racing. arXiv:2503.02572. [arXiv]
[144] Cheng, H. et al. (2025). NaVILA: Navigation with VLA. arXiv:2412.04453. [arXiv]
[145] Zhang, J. et al. (2024). Uni-NaVid: Unified Navigation with Video Diffusion. arXiv:2412.06224. [arXiv]
[146] Liu, M. et al. (2025). Mobility VLA: VLA for Mobile Robot Navigation. arXiv:2407.07775. [arXiv]
[147] Li, Z. et al. (2024). RoboNurse-VLA: Robotic Nursing Assistant with VLA. arXiv:2409.19590. [arXiv]
[148] Chen, J. et al. (2025). ObjectVLA: Object-Centric VLA Model. arXiv:2502.19250. [arXiv]
[149] Lin, K. et al. (2024). ShowUI: Vision-Language-Action Models for GUI Automation. arXiv:2411.17465. [arXiv]
[150] Li, H. et al. (2025). RoboArena: A Benchmark Arena for VLA Evaluation. arXiv:2506.18123. [arXiv]
[151] Nasiriany, S. et al. (2025). RoboCasa365: Large-Scale Robot Simulation Benchmark. arXiv:2603.04356. [arXiv]
[152] Zhang, W. et al. (2025). WorldGym: World Model Training Environments. arXiv:2506.00613. [arXiv]
[153] Huang, Z. et al. (2025). TactileVLA: Tactile-Enhanced VLA for Dexterous Manipulation. arXiv:2507.09160. [arXiv]
[154] Wang, Y. et al. (2025). OmniVTLA: Omni Vision-Tactile-Language-Action Model. arXiv:2508.08706. [arXiv]
[155] Tang, Y. et al. (2023). SayTap: Language to Quadrupedal Locomotion. arXiv:2306.07580. [arXiv]
[156] Wang, L. et al. (2024). HPT: Scaling Proprioceptive-Visual Learning with Heterogeneous Pre-trained Transformers. arXiv:2409.20537. [arXiv]
[157] Physical Intelligence (2025). π^*_0.6: a VLA That Learns From Experience. arXiv:2511.14759. [arXiv]
[158] Physical Intelligence (2026). MEM: Multi-Scale Embodied Memory for Vision Language Action Models. arXiv:2603.03596. [arXiv]