PTAM: Parallel Tracking and Mapping for Small AR Workspaces

http://www.robots.ox.ac.uk/~gk/publications/KleinMurray2007ISMAR.pdf

SLAM을 공부하거나 paper를 볼때마다 항상 수학적 기반이 부족해서 진입이 어렵다는 생각을 하게되는것 같습니다. 그래서 그나마 조금이라도 이해하기 용이한 자료나 paper등을 찾게되는것 같습니다. PTAM 2007년에 나온 오래된 paper지만 SLAM 방향에 지대한 영향을 끼쳤고 기초가 되는 이론들을 조금이나마 짚고 넘어가주는 친절한.. paper인것 같습니다.

PTAM: Parallel Tracking and Mapping for Small AR Workspaces

SLAM은 기본적으로 Localization(Tracking)과 Mapping이 동시에 수행되는 알고리즘입니다. PTAM은 이 Tracking과 Mapping을 두개의 thread로 동시에 돌림으로써 real-time에 가능하도록 만들었고 이게 가장 큰 contribution이 아닐까 합니다.

Method Overview in the Context of SLAM

Tracking and Mapping are separated, and run in two parallel threads
Mapping is based on keyframes, which are processed using batch techniques(Bundle Adjustment)
The map is densely initialized from a stereo pair(5-point Algorithm)
New points are initialized with an epipolar search
Large numbers of points are mapped

위는 PTAM 저자가 요약해놓은 PTAM 수행 방식입니다.

PTAM의 타겟 어플리케이션은 Hand-held AR인데 이는 일반적인 robot motion보다 더 어렵습니다. Robot은 odometry를 뽑아낼 수 있고, 그렇게 빠르지 않습니다. 하지만 사람이 손으로 카메라를 들고 영상을 찍는다면 odemetry도 없고 불규칙한 움직임 때문에 많은 에러들이 발생합니다. PTAM 이전의 state-of-the art 였던 EKF-SLAM, FastSLAM2.0 에서는 이런 에러들을 잡아주고자 RANSAC(Random Sample Consensus)와 같은 방법들을 적용함에도 제대로 작동하지 않았다고 합니다.

이런 이유로 PTAM은 Tracking을 mapping에서 분리시킴으로써 매 frame 마다의 mapping cost 를 덜어내고 tracking에 좀더 많은 비용을 할애해 성능을 끌어올렸습니다. 또한 Mapping 역시 Tracking에 종속되지 않아 모든 frame이 아닌 유용한 keyframe만을 이용합니다. Tracking과 다르게 Mapping은 strict하게 real-time으로 돌 필요가 없어 더 큰 map을 만드는데도 더 용이합니다.

Mapping을 하기위해선 가장 최근의 N개의 camera pose의 local map들을 bundle adjustment을 이용해 최적화합니다. Bundle Adjustment는 SLAM을 하면 꼭 듣게되는데 맨날 다시 찾아보게 되는 내용인데.. 여러 프레임의 keypoint들을 이용해 local map, camera 와 frame간 relative 모션 등을 최적화하는 기법입니다.

The Map

Map 은 world coordinate frame(의미상 system이 맞는것 같습니다) $W$의 $M$개의 point features 로 이루어져있습니다. $p_{j}$는 world coordinate frame$W$위 $(x_{jW}, y_{jW}, z_{jW}, 1)^T $ 에 위치합니다. 또한 각각의 점들은 patch normal $n_{j}$를 갖고있습니다.

Camera-centred coordinated system of Keyframe and four-level images

Map에는 $N$개의 keyframe도 포함되어있는데 keyframe은 말그대로 hand-held camera로 찍은 frame들중 keyframe으로 뽑힌 frame을 말합니다. $i_{th}$ Keyframe의 camera-centred coordinate frame(system) $\mathcal{K}_{i} $과 W간 변환을 $E_{\mathcal{K}_{i}W} $ 로 표기합니다. 그리고 각 frame은 4-level subsampled grayscale 이미지를 갖고있습니다.

Tracking

PTAM의 Tracking 섹션에서는 3D point가 이미 만들어져있다고 가정합니다.

A new frame is acquired from the camera, and a prior pose estimate is generated from a motion model(Decaying Velocity model).
Map points are projected into the image according to the frame's prior pose estimate.
A small number of the coarest-scale features are searched for in the image.
The camera pose is updated frome theses coarse matches.
A larger number of points is re-projected and searched for in the image
A final pose estimate for the frame is computed from all the matches found.

위는 Tracking System이 frame 마다 수행하는 과정입니다. 정리해보면 새 프레임이 들어오면 1) Map points(약 50개)를 이전 pose를 이용해 현재 image에 projection 한후 image feature point와 matching 시킵니다. 이를통해 camera pose 를 1차로 update 해줍니다. 2) Map points(약 1000개)를 다시 projection 한 후 또 match, pose update를 합니다.

$$ p_{j\mathcal{C}}=E_{\mathcal{CW}}p_{j\mathcal{W}}$$

$$ (u_{i}, v_{i}) = CamProj(p_{j\mathcal{C}}) $$

Map point 변환을 수식으로 표현하면 위와 같습니다. World Coordinate의 점 $p_{j\mathcal{W}} $를 Camera Coordinate으로 변환하고 이를 다시 Camera의 intrinsic matrix에 곱해줍니다.

Pin-hole camera projection을 접해보신분들은 $E_{\mathcal{CW}}$가 extrinsic matrix, 즉 Camera pose 라는것을 아실겁니다. 우리가 이전 pose로부터 현재 pose를 구하기 위해선 camera pose의 변화량 즉 위 두번째 수식의 미분을 해야합니다. 이를 용이하게 하기위해 Lie Algebra가 쓰이는데 이부분은 잘 몰라서 스킵하도록 하겠습니다.

정리해보면 1) Camera가 움직이면 Decaying Velocity motion model로 현재 camera의 pose를 예측합니다. 2) 그리고 가지고 있는 map point를 현재 이미지에 reprojection한 후 이미지의 feature point 매칭을 수행해줍니다. 3) 그렇게 되면 Figure 3. 의 그림처럼 matching된 이미지의 점, map point가 reproject 된 점의 reprojection error가 발생하게되는데 이 error가 최소화되도록 camera pose를 최적화해줍니다. 최적화를 위한 objective function은 아래와 같습니다.

$$ e_{j} = (\hat{u_{j}}, \hat{v_{j}}) - Camproj(exp(\mu)E_{\mathcal{CW}}) $$

$$ \mu = argmin_{\mu}\sum_{j \in S}{Obj(|e_{j}/\sigma_{j},\sigma_{T}|)} $$

Mapping

6.1 Map Initialization

위에서 언급했듯이 SLAM 은 Simultaneous Localization and Mapping을 뜻하기 때문에 PTAM 역시 Mapping을 수행합니다. 최초(Application을 수행한 시점)에는 Map이 없기 때문에 사용자가 Map을 초기화 해줘야합니다. 사용자가 Camera를 살짝 움직이게 되면(not pure-rotation) 첫 key frame에서의 feature point들이 tracking 되어 Five-point 알고리즘을 이용해 base map을 만들수 있게 됩니다. 따라서 Figure 4.의 Stereo Initialization step에서는 최초의 key frame 두 개로 생성된 initial map을 얻게됩니다.

6.2 Keyframe insertion and epipolar search

카메라가 움직이면서 key frame이 추가되고 map도 확장되는데 key frame의 조건은 1) camera frame이 이전 key frame과 적어도 20 frame 이상 차이날것과 2) map point와 일정거리 떨어져있어야 합니다.

Tracking은 real-time constraint 로 인해 frame 내의 feature point를 일부만 사용했을 수 있습니다. 따라서 mapping 단계에서 나머지 map point 들을 추가해줍니다.

'SLAM' 카테고리의 다른 글

iSAM2: Incremental Smoothing and Mapping Using the Bayes Tree (0)	2022.02.28
iSAM: incremental Smoothing and Mapping (0)	2022.02.26
Mono-Visual Odometry Example with Kitti-Dataset (0)	2021.11.04
Graph SLAM with Example Code (2)	2021.10.27
Graph-based SLAM using Pose-Graph (0)	2021.10.10

Keep Walking

PTAM: Parallel Tracking and Mapping for Small AR Workspaces