Reprojection PnP Calibration Bundle Adjustment Monocular Scale

author: Giseop Kim
APRL @ DGIST

Classical Visual SLAM in One Page

Reprojection Is All You Need
PnP, Camera Calibration, BA는 같은 residual family다

Visual SLAM, Perspective-n-Point (PnP), camera calibration, bundle adjustment는 모두 같은 기하학적 일관성 조건을 사용한다. 3D point를 camera frame으로 변환한 뒤 camera model로 image plane에 투영했을 때, 그 위치가 실제로 검출된 image feature 위치와 일치해야 한다. 이 차이를 reprojection residual이라고 부른다. 이 residual은 image observation, camera intrinsic, lens distortion, camera pose, 3D map point를 하나의 유니버설한 nonlinear least-squares 문제로 연결한다 (ps. 최적화 과정은 좋은 초기값 주변에서 residual을 linearize하고 Gauss-Newton이나 LM으로 반복 갱신하는 방식으로 이루어진다). 그래서 visual SLAM의 핵심 measurement model이다. 문제들 사이의 차이는 residual 자체가 아니라 어떤 parameter block을 고정하고 어떤 parameter block을 최적화 변수로 두는가에 있다.

Note: Direct methods replace or complement feature reprojection residuals with photometric residuals, but the same optimization viewpoint remains.

Reprojection residual

\[ e_{ij} = u_{ij} - \pi(K, d, R_i, t_i, X_j) \] \[ \min \sum_{i,j} \| e_{ij} \|^2 \]

\(u_{ij}\): observed image point, \(X_j\): 3D object/map point, \(K\): intrinsic, \(d\): distortion, \(R_i,t_i\): camera or board pose.

왜 reprojection residual이 visual SLAM의 핵심인가?

Visual SLAM에서 센서가 직접 제공하는 측정값은 image plane 위의 feature 좌표이고, 추정해야 하는 상태는 camera pose와 3D map structure다. 따라서 reprojection residual은 image-based geometry에서 가장 자연스러운 likelihood term이다. 현재 추정된 \(K,d,R_i,t_i,X_j\)가 관측된 pixel 좌표 \(u_{ij}\)를 얼마나 잘 설명하는지를 직접 평가하기 때문이다. Tracking, local bundle adjustment, global bundle adjustment, relocalization, calibration은 결국 이 변수들 중 무엇을 고정하고 무엇을 최적화할지 바꾸는 방식으로 표현할 수 있다.

Index notation / 인덱스 표기

기호	의미	예시
\(i\)	image / view / frame index	3번째 calibration image, 현재 SLAM frame, view \(i\)
\(j\)	하나의 view 안에 있는 point / corner / landmark index	12번째 chessboard corner, map landmark \(j\)
\(u_{ij}\)	image \(i\)에서 point \(j\)가 관측된 2D 좌표	image \(i\)에서 검출된 pixel coordinate \((u,v)\)
\(X_j\)	point \(j\)의 3D 좌표	board frame의 corner 위치 또는 map frame의 landmark 위치
\(\{R_i,t_i\}, \{X_j\}\)	여러 pose 전체, 여러 3D point 전체를 뜻하는 set 표기	BA에서 모든 camera pose와 모든 map point를 함께 최적화

Single-view PnP에서는 이미지가 하나뿐이라 view index \(i\)를 보통 생략한다: \(u_j \leftrightarrow X_j\). Calibration과 multi-view BA에서는 여러 이미지를 쓰기 때문에 \(i\)와 \(j\)가 함께 등장한다.

Pose-only

PnP

\(K,d,X_j\)를 알고 있다고 두고, 한 이미지에서의 pose \(R,t\)만 찾는다. 3D point가 meter 단위면 결과 pose도 metric이다.

Camera model + poses

Camera Calibration

\(X_j\)는 calibration board 좌표로 알고 있고, \(K,d\)와 각 image의 \(R_i,t_i\)를 동시에 찾는다.

Map + poses

Bundle Adjustment (SfM)

pose뿐 아니라 3D map point \(X_j\)까지 같이 조정할 수 있다. monocular이면 전체 scale은 metric anchor 없이는 arbitrary하다.

수식으로 보는 차이

중괄호 \(\{\cdot\}\)는 하나의 값이 아니라 index가 붙은 변수 전체 묶음을 뜻한다. 예를 들어 \(\{R_i,t_i\}\)는 모든 camera pose, \(\{X_j\}\)는 모든 3D map point를 의미한다.

Variable Fixed Observation

PnP

\[ \min_{\var{R,t}} \sum_j \left\| \obs{u_j} - \pi(\fix{K},\fix{d},\var{R},\var{t},\fix{X_j}) \right\|^2 \]

초록색 \((K,d,X_j)\)은 이미 아는 값이고, 빨간색 \((R,t)\)은 풀어야 하는 최적화 변수다. 파란색 \(u_j\)는 이미지에서 실제로 관측된 2D pixel 좌표다. 참고: PnP는 보통 한 이미지에서 하나의 pose를 추정하므로 \(\sum_j\)로 쓴다. 여러 frame에 대해 PnP를 동시에 쓰면 \(\sum_{i,j}\) 형태로도 쓸 수 있다.

Camera Calibration

\[ \min_{\var{K,d,\{R_i,t_i\}}} \sum_{i,j} \left\| \obs{u_{ij}} - \pi(\var{K},\var{d},\var{R_i},\var{t_i},\fix{X_j}) \right\|^2 \]

초록색 \(X_j\)는 이미 알고 있는 calibration board 또는 map의 3D point다. 빨간색 \(K,d,R_i,t_i\)는 모든 이미지에 대해 함께 풀어야 하는 최적화 변수다.

Bundle Adjustment (SfM)

\[ \min_{\var{\{R_i,t_i\},\{X_j\}}} \sum_{i,j} \left\| \obs{u_{ij}} - \pi(\fix{K},\fix{d},\var{R_i},\var{t_i},\var{X_j}) \right\|^2 \]

여기서는 \(K,d\)를 고정된 값으로 두고, camera pose와 3D map point를 최적화 변수로 둔다. self-calibration BA에서는 \(K,d\)까지 변수로 둘 수도 있다.

Monocular SLAM의 scale

\[ \var{X_j'} = \gauge{s}\var{X_j},\qquad \var{t_i'} = \gauge{s}\var{t_i} \]

마젠타색 scale factor \(s\)로 빨간색 map point와 translation을 함께 같은 비율로 바꿔도 monocular reprojection error는 변하지 않는다. 그래서 metric anchor가 없으면 monocular SLAM의 map과 trajectory는 up-to-scale로만 정해진다.

Variable / Fixed 비교표

문제	Known / 아는 값	Unknown / 풀어야 하는 변수	Observation / 관측값	대표 OpenCV 함수 / 개념
PnP	\(X_j, K, d\)	\(R,t\)	\(u_j\)	`cv2.solvePnP`, `cv2.solvePnPRansac`
Camera Calibration	\(X_j\)	\(K,d,\{R_i,t_i\}\)	\(u_{ij}\)	`cv2.calibrateCamera`
Bundle Adjustment (SfM)	보통 \(K,d\), gauge 일부	\(\{R_i,t_i\}, \{X_j\}\)	\(u_{ij}\)	BA, visual SLAM back-end
Uncalibrated Monocular SfM	gauge 일부	\(K,d,\{R_i,t_i\},\{X_j\}\)	\(u_{ij}\)	uncalibrated SfM, auto-calibration

핵심 문장: PnP, camera calibration, bundle adjustment는 같은 reprojection residual과 reprojection-error objective를 공유한다. 차이는 어떤 parameter block을 알고 있다고 둘지, 어떤 parameter block을 최적화할지에 있다.

난이도 순서로 보면

Multi-view geometry와 visual SLAM의 주요 문제들은 같은 projection model을 공유하지만, 어떤 값을 알고 있고 무엇을 풀어야 하는지에 따라 난이도가 급격히 달라진다.

Unknown / 풀어야 하는 변수 Known / 아는 값 Observation / 관측값

문제	푸는 것	색칠된 수식	난이도
Forward Projection	\(K,d,R,t,X\)를 넣어 pixel 위치를 직접 계산	\(\hat{u} = \pi(\fix{K},\fix{d},\fix{R},\fix{t},\fix{X})\)	직접 계산
PnP	\(K,d,X,u\)를 알고 \(R,t\) 추정	\(\min_{\var{R,t}}\sum_j\\|\obs{u_j}-\pi(\fix{K},\fix{d},\var{R},\var{t},\fix{X_j})\\|^2\)	중간
Triangulation	\(K,d,R,t,u\)를 알고 \(X\) 추정	\(\min_{\var{X_j}}\sum_i\\|\obs{u_{ij}}-\pi(\fix{K},\fix{d},\fix{R_i},\fix{t_i},\var{X_j})\\|^2\)	중간
Bundle Adjustment (SfM)	\(R,t,X\)를 함께 refine	\(\min_{\var{\{R_i,t_i\},\{X_j\}}}\sum_{i,j}\\|\obs{u_{ij}}-\pi(\fix{K},\fix{d},\var{R_i},\var{t_i},\var{X_j})\\|^2\)	어려움
Camera Calibration	\(K,d,R_i,t_i\)를 함께 추정	\(\min_{\var{K,d,\{R_i,t_i\}}}\sum_{i,j}\\|\obs{u_{ij}}-\pi(\var{K},\var{d},\var{R_i},\var{t_i},\fix{X_j})\\|^2\)	어려움
Uncalibrated Monocular SfM	\(K,d,R,t,X\)를 image correspondence만으로 거의 다 추정	\(\min_{\var{K,d,\{R_i,t_i\},\{X_j\}}}\sum_{i,j}\\|\obs{u_{ij}}-\pi(\var{K},\var{d},\var{R_i},\var{t_i},\var{X_j})\\|^2\)	더 어려움

실제 최종형: 위 표에서는 이해를 위해 \(\|e_{ij}\|^2\) 형태로 썼지만, 실제 BA/SLAM에서는 outlier를 줄이기 위해 robust kernel \(\rho\)를 덧붙이는 경우가 많다.

\[ \min_{\theta} \sum_{i,j} \rho\!\left( \left\| e_{ij}(\theta) \right\|^2 \right) \]

여기서 \(\rho\)는 Huber나 Cauchy 같은 robust loss를 의미한다.

Must-Read List

이 리스트는 논문 유명도 순서가 아니라, 이 한 페이지의 수식과 discussion을 깊게 이해하기 위한 reading path다.

Foundations

Hartley & Zisserman, Multiple View Geometry in Computer Vision

Projection, epipolar geometry, triangulation, calibration, SfM의 언어를 잡는 기준서.

Foundations

Triggs et al., “Bundle Adjustment: A Modern Synthesis”

BA를 sparse nonlinear least-squares 문제로 이해하기 위한 핵심 문헌.

Feature SLAM

Klein & Murray, “Parallel Tracking and Mapping” (PTAM)

Tracking과 mapping을 분리한 modern keyframe-based visual SLAM의 출발점.

Feature SLAM

Mur-Artal, Montiel, Tardós, “ORB-SLAM”

Feature-based monocular SLAM의 canonical system. tracking, local mapping, loop closing 구조가 교육적으로 좋다.

Feature SLAM

Mur-Artal & Tardós, “ORB-SLAM2”

Monocular, stereo, RGB-D 설정에서 scale 문제가 센서에 따라 어떻게 달라지는지 보기 좋다.

SfM

Schönberger & Frahm, “Structure-from-Motion Revisited” / COLMAP

Offline SfM 관점에서 matching, incremental reconstruction, BA가 system으로 묶이는 방식을 보여준다.

Direct

Engel, Schöps, Cremers, “LSD-SLAM”

Direct/semi-dense monocular SLAM 계열을 이해하기 좋은 고전.

Direct

Engel, Koltun, Cremers, “Direct Sparse Odometry” (DSO)

Photometric residual의 대표. feature reprojection만이 전부가 아니라는 caveat에 대응한다.

Semi-Direct

Forster et al., “SVO: Fast Semi-Direct Monocular Visual Odometry”

Feature와 direct method 사이의 semi-direct 관점. speed와 real-time trade-off를 보기 좋다.

Metric Scale

Qin, Li, Shen, “VINS-Mono”

Monocular scale이 IMU와 만나 metric해지는 대표 시스템.

Projection Models

Omnidirectional / Fisheye DSO

Fisheye와 omnidirectional camera가 residual family를 바꾸는 것이 아니라 projection model \(\pi(\cdot)\)을 확장한다는 점을 보여준다.

Discussion notes

그런데 왜 calibration을 연구실에서 대체로 제일 먼저 하나?

Camera calibration은 수식만 보면 \(K,d,R_i,t_i\)를 함께 추정하는 꽤 어려운 multi-view optimization이다. 그런데 실험에서는 카메라를 쓰기 전에 반드시 필요한 metric foundation이기 때문에 가장 먼저 배운다. 또한 chessboard나 circle grid는 \(X_j\)를 정확히 알고 있는 known 3D target을 제공하고, planar homography로 좋은 초기값을 만들 수 있으며, OpenCV가 LM refinement를 함수 하나로 감춰준다. 그래서 이론적으로는 고농축 문제지만, 실습에서는 로보틱스와 visual SLAM의 입구 역할을 한다.

Calibration에서 체스보드의 square size를 꼭 알아야 하나?

\(K,d\)만 보고 싶다면 square size를 \(1.0\)으로 둬도 calibration은 돌아간다. Intrinsic matrix \(K\)는 pixel 단위의 camera model이라 object point의 절대 길이에 크게 의존하지 않는다. 하지만 \(t_i\)는 object point의 단위를 그대로 따른다. square size를 meter로 넣으면 \(t_i\)도 meter 단위이고, \(1.0\)으로 넣으면 translation은 "칸 단위"가 된다. LiDAR-camera extrinsic이나 robot-camera extrinsic처럼 metric translation이 필요하면 실제 square size가 중요하다.

OpenCV의 solvePnP와 calibrateCamera는 같은 함수인가?

같은 함수는 아니지만 같은 residual family에 속한다. solvePnP는 \(K,d,X_j,u_j\)를 알고 있다고 두고 한 view의 \(R,t\)만 푸는 pose-only 문제다. calibrateCamera는 \(X_j,u_{ij}\)를 이용해 \(K,d\)와 각 view의 \(R_i,t_i\)를 함께 푸는 camera-model-plus-poses 문제다. 직관적으로는 여러 PnP를 묶고, 그 위에 intrinsic과 distortion까지 unknown으로 올린 joint reprojection optimization으로 볼 수 있다.

OpenCV의 \(R,t\) 방향은 world-to-camera인가 camera-to-world인가?

OpenCV의 solvePnP나 calibrateCamera가 반환하는 \(rvec,tvec\)는 보통 object/world point를 camera frame으로 보내는 변환이다. 즉 수식으로는 \(X_c = R X_w + t\)에 해당한다. 그래서 camera pose를 world frame 기준 위치와 방향으로 쓰고 싶다면 inverse를 취해야 한다: \(R_{wc}=R^\top,\; C_w=-R^\top t\). 이 방향을 헷갈리면 projection은 맞는데 trajectory 해석이 뒤집힐 수 있다.

2D-3D correspondence가 있으면 항상 pose가 잘 풀리나?

꼭 그렇지는 않다. 3D point들이 너무 좁은 영역에 몰려 있거나 한 직선에 가깝거나, correspondence에 outlier가 많거나, \(K,d\)가 부정확하면 pose가 불안정해진다. planar point만 있는 경우에도 특정 motion이나 view 구성에서는 ambiguity가 커질 수 있다. 그래서 실제 시스템은 보통 RANSAC으로 outlier를 제거하고, 이후 LM refinement나 local BA로 pose를 다시 다듬는다.

Triangulation과 BA는 무엇이 다른가?

Triangulation은 camera pose와 camera model을 고정한 뒤 2D observation들로부터 3D point \(X_j\)를 구하는 문제다. 반면 Bundle Adjustment는 \(X_j\)뿐 아니라 camera pose \(R_i,t_i\)까지 함께 조정해 전체 reprojection error를 줄인다. 그래서 triangulation은 보통 map point를 처음 만드는 단계에 가깝고, BA는 이미 만든 pose와 map을 함께 정교화하는 단계에 가깝다.

왜 좋은 초기값이 중요한가?

Reprojection-error objective는 nonlinear이고 일반적으로 convex가 아니다. Gauss-Newton이나 Levenberg-Marquardt는 현재 추정값 주변에서 residual을 linearize한 뒤 update를 반복한다. 따라서 초기값이 좋은 basin 안에 있으면 빠르게 수렴하지만, 초기값이 너무 나쁘면 local minimum, wrong correspondence, 잘못된 pose branch, 이상한 distortion 추정으로 빠질 수 있다.

PnP는 metric pose가 풀리는데, monocular BA는 왜 scale이 안 풀리나?

수식만 보면 둘 다 같은 reprojection residual을 쓰기 때문에 차이가 잘 드러나지 않는다. 중요한 차이는 수식에 직접 쓰이지 않은 암묵적 가정에 있다. PnP에서의 \(X_j\)는 보통 meter나 millimeter 단위로 이미 알고 있는 metric map point다. 그래서 \(\fix{X_j}\)를 기준으로 추정되는 translation \(t\)도 같은 metric 단위를 갖는다.

Monocular BA의 \(X_j\)는 보통 metric anchor가 없다

Monocular visual SLAM의 BA에서도 2D-3D correspondence는 사용한다. 다만 그 3D point \(X_j\)가 LiDAR map이나 chessboard처럼 외부에서 metric하게 주어진 점이 아니라, monocular image들로부터 triangulation되어 만들어진 map point인 경우가 많다. 이때 map과 camera translation을 함께 같은 비율로 키우거나 줄여도 reprojection error가 변하지 않으므로, reconstruction은 up-to-scale로만 정해진다.

즉 문제는 2D-3D를 하느냐가 아니라 3D의 단위다

2D-3D correspondence가 있으면 pose는 풀린다. 하지만 그 pose의 scale은 3D point의 scale을 그대로 따른다. \(X_j\)가 metric known point이면 PnP pose도 metric이고, \(X_j\)가 monocular SLAM이 만든 arbitrary-scale map point이면 BA와 tracking 결과도 arbitrary-scale pose가 된다.

Gauge freedom은 scale만이 아니다

Monocular scale이 가장 눈에 띄지만, BA/SfM의 gauge freedom은 scale만 뜻하지 않는다. Reprojection error만으로는 world frame의 원점, 방향, 전체 scale 같은 자유도가 절대적으로 정해지지 않는다. 그래서 BA에서 "gauge 일부를 고정한다"는 말은 첫 pose를 고정하거나, scale prior를 넣거나, 특정 convention으로 world frame을 묶어 최적화 문제가 흔들리지 않게 만든다는 뜻이다.

BA를 오래 돌리면 scale이 생기나?

생기지 않는다. BA는 주어진 residual을 더 잘 만족하도록 변수들을 조정할 뿐이고, monocular reprojection residual은 scene point와 camera translation의 joint scaling에 대해 invariant하다. 그래서 external metric cue가 없으면 최적화를 오래 돌려도 map과 trajectory는 더 일관되게 될 수는 있지만, meter 단위의 absolute scale은 새로 생기지 않는다.

Loop closure를 하면 monocular scale이 생기나?

Pure monocular loop closure는 drift와 inconsistency를 줄이는 데는 중요하지만, absolute metric scale을 새로 주지는 않는다. 많은 monocular SLAM은 Sim(3) pose graph를 사용해 scale drift를 조정한다. 이것은 서로 다른 구간의 상대 scale을 맞추는 것이지, 외부 metric anchor 없이 "이 이동이 정확히 몇 meter인가"를 알려주는 것은 아니다.

왜 reprojection error는 pixel 단위인가?

Reprojection residual은 image plane에서 관측된 pixel 좌표와 예측된 pixel 좌표의 차이, 즉 \(u_{ij}-\hat{u}_{ij}\)다. 그래서 residual의 기본 단위는 pixel이다. 이 때문에 feature localization noise, image resolution, pyramid level, subpixel corner refinement가 모두 중요해진다. 같은 1 pixel error라도 sensor resolution과 feature uncertainty에 따라 의미가 달라질 수 있다.

왜 robust kernel을 쓰나?

Squared reprojection error는 outlier에 매우 민감하다. 잘못 매칭된 feature 하나가 큰 residual을 만들면, optimizer가 그 outlier를 맞추려고 pose나 map point를 엉뚱하게 움직일 수 있다. 그래서 실제 BA와 SLAM back-end는 Huber, Cauchy 같은 robust loss를 붙여 큰 residual의 영향력을 줄이는 경우가 많다.

모든 관측을 똑같이 믿어도 되나?

보통은 아니다. Chessboard corner, ORB feature, optical flow, learned feature는 localization uncertainty가 서로 다르다. 관측마다 covariance나 weight를 넣으면 objective는 weighted least-squares가 된다. 즉 residual의 모양은 같아도 어떤 관측을 더 믿을지에 따라 최적화 결과가 달라질 수 있다.

Reprojection error가 작으면 항상 좋은 map인가?

항상 그렇지는 않다. 잘못된 intrinsic, rolling shutter, dynamic object, 반복 texture, degenerate motion이 있으면 낮은 reprojection error가 좋은 geometry를 보장하지 않을 수 있다. 특히 너무 많은 자유도를 주면 특정 데이터에는 잘 맞지만 물리적으로 말이 안 되는 solution으로 overfit될 수도 있다.

왜 rotation과 translation은 같은 방식으로 업데이트하지 않나?

Translation은 \(\mathbb{R}^3\) vector라 덧셈 update가 자연스럽지만, rotation은 \(SO(3)\) manifold 위의 값이다. 그래서 SLAM/BA 구현에서는 rotation matrix를 직접 더하기보다 Lie algebra perturbation을 사용해 \(\delta\theta\)를 구하고, exponential map으로 \(R\)을 갱신하는 방식이 흔하다.

SLAM과 SfM의 차이는?

Visual SLAM은 용어적으로 sequential 또는 streamed SfM으로 설명되기도 한다. 여러 image를 한꺼번에 처리하는 offline SfM에 비해, SLAM은 image stream이 들어오는 순서대로 tracking, mapping, local BA, loop closure를 실시간 또는 준실시간 제약 안에서 수행한다는 점이 다르다.

왜 visual SLAM에서는 local BA를 자주 쓰고 global BA는 가끔 쓰나?

Global BA는 모든 keyframe과 map point를 함께 최적화하므로 정확하지만 비싸다. 실시간 SLAM에서는 매 frame마다 전체 map을 최적화할 수 없기 때문에, 최근 keyframe과 주변 map point만 묶은 local BA를 자주 수행한다. Global BA는 loop closure 이후나 background thread에서 가끔 수행해 전체 consistency를 회복하는 역할에 가깝다.

Rolling shutter나 time offset은 이 표에 어디 들어가나?

Rolling shutter와 camera-IMU time offset은 residual family 바깥의 예외가 아니라, projection model \(\pi(\cdot)\)나 state vector에 추가되는 calibration/temporal parameter로 볼 수 있다.

\[ \min_{\var{\{R_i,t_i\},\{X_j\},\tau}} \sum_{i,j} \left\| \obs{u_{ij}} - \pi_{\mathrm{RS}}\!\left(\fix{K},\fix{d},\var{R_i(t_{ij}+\tau)},\var{t_i(t_{ij}+\tau)},\var{X_j}\right) \right\|^2 \]

여기서 \(\pi_{\mathrm{RS}}\)는 rolling-shutter-aware projection model이고, \(\tau\)는 sensor time offset이다. 즉 더 복잡한 센서 모델도 결국 residual 안의 projection 함수와 unknown block을 확장하는 방식으로 들어온다.

Fisheye camera는 이 표에 어디 들어가나?

Fisheye도 residual family가 달라지는 것이 아니라 projection model이 바뀌는 경우다. pinhole projection \(\pi\) 대신 fisheye projection \(\pi_{\mathrm{fish}}\)를 쓰고, distortion parameter \(d_{\mathrm{fish}}\)를 fixed 또는 variable block으로 둔다.

\[ \min_{\var{K,d_{\mathrm{fish}},\{R_i,t_i\}}} \sum_{i,j} \left\| \obs{u_{ij}} - \pi_{\mathrm{fish}}\!\left(\var{K},\var{d_{\mathrm{fish}}},\var{R_i},\var{t_i},\fix{X_j}\right) \right\|^2 \]

예를 들어 OpenCV의 fisheye calibration은 일반 pinhole distortion model 대신 fisheye 전용 projection과 distortion model을 쓴다. 하지만 관측 pixel과 예측 pixel의 차이를 줄인다는 optimization viewpoint는 그대로 유지된다.