iLabel + iMAP | Notion

Joint representation of geometry , color and semantics using a 3D neural field enables accurate dense labelling from very sparse interactions/inputs. The semantic classes are defined on the fly.
The underlying model is a single MLP trained from scratch in real time to learn a joint neural scene representation. How about if we use hierarchical representation instead of a MLP for better geometric details. The scene model (MLP) is updated and visualized in real time.
iLabel an online and interactive 3D scene capturing system with an unified neural field representation which allows a user to achieve high-quality sense scene reconstruction and semantic labelling from scratch.
scan a scene and provide very sparse semantic annotations on the keyframe images. no prior training data required and new categories can be defined on the fly.
Hierarchical semantic labeling can be achieved by interpreting outputs as branches in a binary tree .
Using neural representations for scene understanding requires an RGBD based geometric SLAM system maintaining a dense 3D map of the scene (think NICE SLAM hierarchical feature grids or a MLP from NodeSLAM). a semantic segmentation module of the scene labels of the scene . multi view semantic predictions are incrementally fused into the geomteric model which gives us densely labelled 3D scenes.
iLabel uses the same SLAM system of iMAP that is a single MLP which maps a 3D coordinate to a color $c(r,g,b)$ and volume density $\sigma$ like in NerF. It optimizes the MLP***(mapping)*** and poses of the keyframes***(tracking)*** through differentiable volume rendering with sampled pixels.
This paper adds a semantic head to the MLP (easily added to the hierarchical branch as well) that predicts a flat softmax distribution or binary hierarchical tree . The semantic rendering is through a group of sparse pixels provided in the keyframes. The MLP densely propagates these labels to the scene. (how can we get the hierarchical rep to do this?).
The underlying representation of the MLP(grids) is continuously optimized to output color ,semantics and volume density . $F_{\theta}(p)=(c,s,\rho)$.
The scene representation is updated with respect to volumetric renderings of depth, color, and semantics computed computed by composting the queried network values along the back projected ray of that particular pixel.
$\hat{D}[u, v]=\sum_{i=1}^{N} w_{i} d_{i}, \quad \hat{I}[u, v]=\sum_{i=1}^{N} w_{i} \mathbf{c}{i}, \quad \hat{S}[u, v]=\sum{i=1}^{N} w_{i} \mathbf{s}_{i}$
The weights $w_{i}$ is the ray termination probability which is used for volume rendering in the final step. It is the same in NICE SLAM and also in NERF.
$w_{i}=o_{i} \prod_{j=1}^{i-1}\left(1-o_{j}\right)$ is the ray termination probability of sample $i$ at depth $d_{i}$ along the ray; $o_{i}=1-exp(-\rho_{i}\delta_{i})$ is the occupancy activation function and the delta is the distance between two sample points on the ray.
Like in NICE SLAM the geometry(mapping) and keyframe camera poses(localization) are optimized using photometric loss between the depth and RGB image. Semantics are optimized with cross entropy loss using from sparse labels provided in the keyframes(add this in NICE SLAM).
The semantic loss is wither using flat sematic loss or a hierarchical loss. The flat semantic structure loss is using the above equations and using a cross entropy loss for the GT label and the rendered semantic prediction which is straighforward.
The hierarchical semantic loss is a novel loss for labelling and predicting the semantic loss. While the network output $s_{i}$ it is still represents a $n$-dimensional vector instead of being followed by a softmax layer for semantic labels. The $n$ represents the depth of a binary tree not the number of semantic classes.
The logits are rendered using $\hat{S}[u, v]=\sum_{i=1}^{N} w_{i} \mathbf{s}_{i}$ but the image activation is changed. A sigmoid is applied to the output logits squishing each value between 0 and 1 .
Then each logit is threshold at 0.5. If the value > 0.5 the node is set to 1 or else 0. The GT label corresponds to selecting a node from the binary tree . The GT label is transformed into a binary branching tree (how??) and a binary cross entropy loss is calculated for each rendered value. A label selecting a tree node at level $L$ only calculates the loss uptil level $L:\hat{S}[u,v],j\in {{1,....L}}$ .
For any given scene the semantic scene outputs three levels in the logits $i.e\space 3\space logits$. At L1 user separates foreground and background. A background label is given as $[0,,].$ The * signifies that the loss is not calculated for those labels . The labels is then divided into further nodes. So at the level L2 the label becomes $[0,1,*]$. This means 0⇒ background , 1⇒ wall , *⇒doesn’t count for loss.
The arch also outputs uncertainty based sampling to actively propose pixel positions for label acquisitions. For measuring uncertainity they use softmax cross entropy : $u_{\text {entropy }}=-\sum_{c=1}^{C} \hat{\mathbf{S}}^{c}[u, v] \log \left(\hat{\mathbf{S}}^{c}[u, v]\right)$. C⇒ number of semantic classes. In order to decide which frames to use they calculate entropy of all keyframes(sum of entropy over all pixels) . Then select the one with the highest entropy