Occupancy Network

Rodrigo Loro Schuller

December 2019

1 Introduction

1.1 Presentation

Click here to see it in your browser - Firefox is recommended.

Click here for the PDF version.

1.2 Abstract Definition

The problem of representing 3D structures is harder than its 2D counterpart. Good solutions for it are specially important for learning-based algorithms, since bad representations usually yield unreasonable large memory requirements, glaring inconsistencies or other difficulties.

An Occupancy Network [1] is a state-of-the-art solution that uses implicit functions (neural networks with parameters \(\theta\)) to represent 3D objects in a compact and expressive manner. Bellow we have its formal definition.

Definition (Occupancy Network) For a given input \(x \in X\), we want a binary classification neural network: \(f^x_\theta : \mathbb{R}^3 \to [0,1]\). We can just add \(x\) to the inputs, ie,

\[f_\theta : \mathbb{R}^3 \times X \to [0,1].\]

\(f_\theta\) is called the Occupancy Network.

2 Other Representations

A more detailed (and visual) comparison can be found in the presentation. To avoid redundancy, we’ll present a brief synthesis with the main takeaways.

2.1 Voxels

Pros

Simple to use;

Cons

Doesn’t quite work in low resolutions;
Requires a lot of memory;

2.2 Point clouds

Pros

Simple to use and behaves well under geometric transformations;
Doesn’t require a lot of memory;

Cons

Hard to extract the underlying geometry;

2.3 Meshes

Pros

Simple to use and behaves well under geometric transformations;
Doesn’t require a lot of memory;

Cons

Topology limitations or consistency problems - depending on the approach;

3 The Devil is in The Details

Using the abstract definition, the authors presented solutions for 4 different problems. In order to set apart the common definitions from the specificities of each problem, this section is divided in 5 subsections.

Besides the expected lack of complete coverage of all implementation nuances, a few inconsistencies were found in the article and in the supplementary material. The source code was an important settler for the questions that arouse during my readings, so we’ll adopt the following typographical convention:

(see path_to_a_file) refers to a file in https://github.com/autonomousvision/occupancy_networks.

3.1 Common

3.1.1 ONet Architecture - The Big Picture

According to the supplementary material (sec 1.1 Architectures): “We employ the same occupancy network architecture (Fig. 1) in all experiments”. By studying the source code, one can see that this statement isn’t accurate, since it does not generalize to the unconditional mesh generation experiment.

For the other 3 experiments, the original assertive is correct. With these particularities in mind, we propose the following naming scheme: ONet architecture will be reserved for the architecture described in this section, and the decoder for unconditional mesh generation will be denoted variational decoder.

Naturally, the utterances in the former two paragraphs are quite strong and require reasonable justification. This will be presented in the Unconditional Mesh Generation section.

Figure 1 Architecture of the Occupancy Network. — **Figure 1** Architecture of the *Occupancy Network*.

Input The output of a task-specific encoder \(c \in X = \mathbb{R}^C\) and a batch of \(T\) points \(p_i \in \mathbb{R}^3\).

Output To be consistent with our previous definition, the output is given by the numbers \[f_{\theta}(c, p_1),\, \cdots,\, f_{\theta}(c, p_T) \in [0,1].\] In other words, for each point \(p_i\) in the batch, we get a number in \([0, 1]\).

Evaluation To show how the architecture works we’ll first explain the Big Picture - how the components are connected - and afterwards tell exactly what each component does.

Use a fully-connected layer to produce a 256-dim feature vector from each of the points \(p_i\);
Do 5 times:
- Take the output from the previous step and use a ONet ResNet-block to produce a new 256-dim feature vector;
Take the output from the last ONet ResNet-block and pass through a CBN layer and a ReLu activation;
Pass the result through a fully-connected layer to project the features down to 1-dim;
Use a Sigmoid activation to obtain a number in \([0,1]\);

Observation In the ONet article, they originally used the nomenclature ResNet-blocks. Since there’re different kinds of ResNet-blocks, I’ve added specification tokens to avoid unnecessary confusion.

3.1.2 Activation Functions

ReLU From the source code (see im2mesh/layers.py), it’s clear that the standard PyTorch’s ReLU was used. For both the current stable version (1.3.1) and the version used in the project (1.0.0), it is defined as

\[ \mathrm{ReLU}(x) = \max\{0, x\}. \]

Sigmoid The sigmoid function is actually implemented in the mesh extraction phase (see im2mesh/onet/generation.py - line 171), by applying the inverse function to the threshold:

threshold = np.log(self.threshold) - np.log(1. - self.threshold)

By inverting the threshold function again, we can recoup the information:

\[\mathrm{Sigmoid}(x) = \frac{1}{1 + e^{-x}}.\]

This is known as the logistic sigmoid.

3.1.3 Conditional Batch Normalization (CBN) Layer

Let \((p_i)_{i=[1:T]}\) be the vector of the input points, as shown in Figure 1. With this indexation in mind, let us define a CBN layer [2]:

Input

\(c\), which is the output from the task-specific encoder;
The output from the last layer: \(\left(f_{in}^i\right)_{i \in [1:T]}\), in which \(f_{in}^i \in \mathbb{R}^{256}\);

Output A vector \(\left(f_{out}^i\right)_{i \in [1:T]}\), with \(f_{out}^i \in \mathbb{R}^{256}\).

Evaluation Compute the first two moments of \(\left(f_{in}^i\right)_{i \in [1:T]}\) over all \(i\in[1:T]\):

\[ \mu = \mathrm{E}\left[f_{in}^i\right] \text{ and}\]

\[ \sigma^2 = \mathrm{Var}\left[f_{in}^i\right];\]

then use two fully-connected layers to obtain 256-dim vectors \(\beta(c)\) and \(\gamma(c)\) to evaluate the final output

\[f_{out}^i = \gamma(c) \frac{f_{in}^i - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta(c),\]

in which \(\epsilon = 10^{-5}\) is a constant added for numerical stability.

Since the sum between scalars and vectors is already implicitly defined in the denominator, it’s important to highlight (as in the original article [2]) that the multiplication by \(\gamma\) is a piecewise (not inner) product.

Observation The PyTorch’s class BatchNorm1d [3], used to compute \(\frac{f_{in}^i - \mu}{\sqrt{\sigma^2 + \epsilon}}\) (see im2mesh/layers.py), keeps a running mean of the first two moments. These estimates are then used for normalization during evaluation.

3.1.4 ONet ResNet-block

We’ll now describe a single ONet ResNet-block (see im2mesh/layers.py - class CResnetBlockConv1d) as the composition of previously defined components, in the order of application:

CBN layer;
ReLU activation function;
Fully-connected layer;
CBN layer;
ReLU activation function;
Fully-connected layer;

To get the output of the ONet ResNet-block, we then sum the input of step 1 to the output of step 6.

3.1.5 Pre-processing Shapes

Definition (Occupancy function) Given a solid shape \(S \subset \mathbb{R}^3\), the occupancy function \(o : \mathbb{R}^3 \to \{0, 1\}\) is defined as \[ o(p) = \begin{cases} 0 && \text{if } p\notin S \\ 1 && \text{if } p\in S \\ \end{cases}. \] in other words, the occupancy function is the characteristic function for the set of the solid shape.

All 4 experiments used the ShapeNet [4] database for training, which is a dataset composed of annotated CAD meshes. To extract the data from each of the shapes, the following procedure was used:

Make sure the mesh is watertight with code provided by Stutz et al. [5];
Normalize the shape such that its bounding box is centered at the origin and that the biggest side of the bounding box measures exactly 1;
Using an uniform random distribution, sample 100k points from the new bounding box with 0.05 padding on the sides;
Choose (with repetition) \(K=2048\) points;
For each of the \(K\) chosen points, compute \(o_i = o(p_i)\) and store both \((p_i)_{i \in [1:K]}\) and \((o_i)_{i\in [1:K]}\) in a file;

Observation Contradicting the supplementary material, the source code defines \(K=1024\) (step 4) for the voxel super resolution experiment (see configs/voxels/onet.yaml and configs/voxels/onet_pretrained.yaml).

3.1.6 Training And The Loss Function

With the exception of unsupervised mesh generation - which will be explained later - the same loss function was adopted in all experiments:

Definition (Mini-batch loss function) Let \(\mathcal{B}\) be a subset of 64 (see configs/*/onet.yaml) preprocessed shapes from ShapeNet. The mini-batch loss function is defined as

\[ \mathcal{L}_{\mathcal{B}}(\theta) = \frac{1}{|\mathcal{B}|} \sum_{i=1}^{|\mathcal{B}|} \sum_{j=1}^ K \mathcal{L}\left(f_\theta(p_{ij},c_i), o_{ij}\right), \]

in which \(c_i\) is the output from the task-specific encoder for the \(i\)th shape and \(\mathcal{L}\) is a cross-entropy classification loss. This function isn’t explicitly defined in the article nor in the supplementary material, which is not a problem for an open source project. These are the 4 relevant lines (see im2mesh/onet/training.py)

168:        logits = self.model.decode(p, z, c, **kwargs).logits
169:        loss_i = F.binary_cross_entropy_with_logits(
170:            logits, occ, reduction='none')
171:        loss = loss + loss_i.sum(-1).mean()

logits is the tensor \(f_\theta(p_{ij},c_i)\) without the logistic sigmoid and occ is the tensor of the ground true occupancies \(o_{ij}\). binary_cross_entropy_with_logits is a PyTorch’s function that computes the following tensor:

\[ l(x, y) = L = \{l_1, \cdots, l_N\}^T \]

in which

\[ l_n = - \Big[ y_n \log \big(\mathrm{Sigmoid}(x_n)\big) + (1-y_n) \log\big(1-\mathrm{Sigmoid}(x_n)\big)\Big]. \]

In other words, it’s a numerically more stable version of the composition of the logistic sigmoid and a binary cross entropy function. Translating this to our naming conventions yields

\[ \mathcal{L}\left(f_\theta(p_{ij},c_i), o_{ij}\right) := -\Big[ o_{ij} \log \big(f_\theta(p_{ij},c_i)\big) + (1-o_{ij}) \log\big(1-f_\theta(p_{ij},c_i)\big)\Big]. \]

Since \(o_{ij} \in \{0, 1\}\), we can also write \(\mathcal{L}\) in the following way:

\[ \mathcal{L}\left(f_\theta(p_{ij},c_i), o_{ij}\right) = - \begin{cases} \log \big(f_\theta(p_{ij},c_i)\big) && \text{if } o_{ij} = 1 \\ \log\big(1-f_\theta(p_{ij},c_i)\big) && \text{if } o_{ij} = 0 \end{cases}. \] Note that the enhanced numerical stability of binary_cross_entropy_with_logits over the naive composition also explains why the logistic sigmoid isn’t embedded in the architecture itself.

\(\blacksquare\)

The mini-batch gradient descent is a variation of the standard gradient descent. Instead of computing the partial derivatives of the loss function using the entire set of shapes each time, the database is divided in batches or partitions. Then, for each partition, we just approximate the true gradient and update the neural network parameters \(\theta\) accordingly. This process is repeated until a stop criteria is triggered.

Observation For all ONet experiments, the stop criteria is based on the IoU metric.

To compute the gradient approximations and update the parameters, Adam optimizer [6] was used with a learning rate of \(\eta=10^{-4}\) and no weight decay. The default PyTorch’s values for the other hyperparameters were left untouched: \(\beta_1=.9\), \(\beta_2=.999\) and \(\epsilon=10^{-8}\).

3.2 Single View Image Reconstruction

Before talking about the encoder, we need the definition bellow.

Definition (ImageNet normalization) Let \(x \in [0,1]^{w \times h \times 3}\) be a colored image, \(\mathrm{\mu_{ImN}} :=( .485,\, .456,\, .406)\) and \(\mathrm{\sigma_{ImN}}:= (.229,\, .224, \, .225)\). Then the normalized image is then given by

\[ \hat{x}_{ij} = \frac{x_{ij} - \mu_{ImN}}{\sigma_{ImN}}, \] in which the division is piecewise.

\(\blacksquare\)

3.2.1 Image Encoder

Figure 2 Modified ResNet18 - the image encoder. — **Figure 2** Modified ResNet18 - the image encoder.

Input \(224 \times 224\) image, normalized according to ImageNet standards.

Output A feature vector \(c \in \mathbb{R}^C\), for \(C=256\).

Evaluation The only difference between ResNet18 [7] and the neural network used as the encoder is the last fully connected layer. Instead of producing a 512-dim output, the last layer projects it down to a 256-dim vector \(c\). The encoder was pre-trained on the ImageNet dataset.

One eye-catching piece of information is that the authors of ONet (see im2mesh/encoder/conv.py) did implement several sizes of ResNets: 18, 34, 50 and 101. Since this particular encoder was pre-trained, it raises the question of why ResNet18 worked better than its deeper counterparts.

3.3 Point Cloud Completion

3.3.1 Point Cloud Encoder

Figure 3 Encoder for point cloud completion. — **Figure 3** Encoder for point cloud completion.

Input M=300 points generated from a mesh taken from ShapeNet, in the following manner:

Make sure the mesh is watertight with code provided by Stutz et al. [5];
Normalize the shape such that its bounding box is centered at the origin and that the biggest side of the bounding box measures exactly 1;
Sample 300 points from the surface of the model;
Apply noise to the points using a Gaussian distribution with zero mean and standard deviation of 0.05 (see im2mesh/data/transforms.py);

Output A feature vector \(c \in \mathbb{R}^{C}\), for \(C=512\).

Description The network consists of 2 fully connected layers (for input and output) and 5 PointNet ResNet-blocks intercalated by pooling+expansion layers, as shown in Figure 3.

Note In the supplementary there’s no distinction between PointNet ResNet-blocks and ONet ones. In the source code they’re different (see im2mesh/layers.py and ./im2mesh/encoder/pointnet.py).

We can find a legacy definition of the ONet architecture, and in this version both used the same simplified ResNet-block defined bellow. This might explain the inaccuracies in the PDFs. Without further ado, let us present the definition from the source code:

Definition (PointNet ResNet-block) The class used to represent the PointNet ResNet-blocks is the following (see im2mesh/layers.py and ./im2mesh/encoder/pointnet.py):

# Resnet Blocks
class ResnetBlockFC(nn.Module):
    ''' Fully connected ResNet Block class.

    Args:
        size_in (int): input dimension
        size_out (int): output dimension
        size_h (int): hidden dimension
    '''

For PointNet ResNet-blocks: size_in=1024, size_out=512 and size_h=512. Therefore one block is defined as the following composition, in order of application:

ReLU activation layer (1024-dim);
Fully connected NN - 1024-dim to 512-dim;
ReLU activation layer (512-dim);
Fully connected NN - 512-dim to 512-dim;

Since the input and output dimensions differ, we have an additional FCNN projecting the input for step 1 (1024-dim) to \(x_s\) (512-dim). Hence the final output is the sum of the output from step 4 and \(x_s\).

Note that this block can be described as an ONet ResNet-block without the CBN layers and with different dimensions.

\(\blacksquare\)

3.4 Voxel Super Resolution

3.4.1 Voxel Encoder

Figure 4 Encoder for voxel super resolution. — **Figure 4** Encoder for voxel super resolution.

Input A grid of \(32^3\) voxels. More specifically, voxels generated from (watertight) ShapeNet meshes with the algorithm bellow:

Normalize and make the shape watertight using the methods described previously;
Mark all voxels that intercept the model’s surface as occupied;
For each of the remaining voxels:
- Choose 1 random point inside the voxel;
- If the point lies inside the mesh mark the corresponding voxel as occupied;

Output A feature vector \(c \in \mathbb{R}^C\), for \(C=256\).

Evaluation The input passes trough 5 3D convolution layers, and a fully connected layer to project the output to the space \(\mathbb{R}^{256}\). All convolution layers use zero-padding with size 1 and \(3\times 3\times 3\) filters (see im2mesh/encoder/voxels.py). Stride is implicitly defined in Fig [4].

3.5 Unconditional Mesh Generation

We’ll talk about the implementation before long, but before that I’d like to present a general idea of the process to the reader. Although the authors did use the word unsupervised in the article, it wasn’t the most descriptive choice - let me explain why:

Suppose that we want to generate shapes in the category car, we would proceed as follows:

Take the annotated meshes from ShapeNet and extract the subset of shapes that correspond to our chosen category;
Train a Variational Autoencoder using this subset;
Sample the latent space and use the decoder to generate new shapes;

A more appropriate label would be semi-supervised, since we’re both using the labels to extract subsets in a supervised fashion and learning about the distributions of these subsets in an unsupervised manner.

3.5.1 Variational Encoder - Abstract Definition

Preceding the low level definition, it’s a good idea to keep in mind a high level model of what we’re trying to achieve. The busy reader can skip this section by jumping directly to the definition of encoder latent.

Traditional learning-based autoencoders are usually defined as two neural networks:

\[ E_\psi: X \to Z \]

\[ D_\theta: Z \to X \]

One called the encoder, that takes the input \(x \in X\) and maps it to a dimensionally smaller latent space \(Z\); and another called decoder, that performs the inverse, ie, that tries to reconstruct the original input from a point in \(Z\).

Given a set of inputs \(B = \{x_1, \cdots ,x_N\} \in X\), we can define a loss function

\[ \mathcal{L}_B(\theta, \psi) = \frac{1}{N} \sum_{i=1}^N \Big[ \mathcal{L}\big(D_\theta(E_\psi(x_i)), x_i\big) \Big], \]

that just compares each input \(x_i\) to its encoded and decoded counterpart \(\tilde x_i = D_\theta(E_\psi(x_i))\). Using \(\mathcal{L}\), we can just train the two neural networks to optimize the compression, which is an unsupervised training scheme because no labeling is required for the inputs \(B\).

To generate new plausible entries for the input space \(X\), a naive approach would be to simply take random points \(z \in Z\) or in an open cover \(\tilde Z \subset Z\) for \(E_\psi(B)\) and store \(D_\theta(z)\) for those random points. Unfortunately this doesn’t work for traditional learn-based autoencoders because \(D_\theta\), in general, isn’t a stable function, ie,

\[ \newcommand{\notimplies}{\;\not\!\!\!\implies} \big| z - E_\psi(x_i) \big|\text{ is small} \notimplies \mathcal{L}(D_\theta(z), x_i) \text{ is small}. \]

To solve this problem and regularize the latent space, we can modify our encoder to produce a probability distribution. Instead of mapping \(x0 \in X\) to a single point \(z0 \in Z\) as a traditional encoder, a variational encoder does the following

\[ g_\psi(x0) = q_\psi(z | x0) \in P(Z), \]

in which \(P(Z)\) is the space of probability measures on \(Z\).

To use this new encoder, we can define a new loss function:

\[ \mathcal{L}'_B(\theta, \psi) = \frac{1}{N} \sum_{i=1}^N \Big[ \mathcal{L}\big(D_\theta(\tilde z_i), x_i\big) + \mathcal{L}_P\big(g_\psi(x_i), \mathcal{N}(0,1)\big) \Big], \]

in which

\[ \tilde z_i\text{ is a single random sample from } g_\psi(x_i) \]

\[ \mathcal{L}_P \text{ is a loss function for two probabilities in }P(Z) \]

Naturally, one can tweak the stochastic loss function \(\mathcal{L}'_B\) by using different sampling schemes or different arguments for \(\mathcal{L}_P\), but this example encapsulates the basic idea of a variational autoencoder - VAC for short.

Since this changes force the latent space to be more regular, one can generate new plausible outputs by taking random \(z \in Z\) according to the standard normal distribution or by performing interpolations in the latent space for example.

Definition (Encoder latent) Let \(p_i \in \mathbb{R}^3\) be a sequence of \(K = 2048\) positions in 3D space and \(o_i \in \{0, 1\}\) be their corresponding ground truth occupancies. The encoder

\[ g_\psi : (p, o) \mapsto (\mu_\psi, \sigma_\psi) \]

takes the points and their occupancies and maps them to a pair of values in \(\mathbb{R}^L\), for \(L=128\), that represent respectively the average and the standard deviation of a Gaussian distribution \(q_\psi(z |(p_i, o_i)_{i=1:K}) =\mathcal{N}(\mu_\psi, \sigma_\psi)\) in the latent space \(\mathbb{R}^L\).

Just as a side note, the labeling encoder latent isn’t a hallucination: it’s the name given in the source code (see im2mesh/onet/models/encoder_latent.py) and in the configuration files (see configs/unconditional/*.yaml).

\(\blacksquare\)

3.5.2 Variational Loss Function and Training

The variational version of the loss function is given by:

\[ \mathcal{L}^{\text{gen}}_{\mathcal{B}}(\theta, \psi) = \frac{1}{|\mathcal{B}|} \sum_{i=1}^{|\mathcal{B}|} \left[ \sum_{j=1}^ K \mathcal{L}\left(f_\theta(p_{ij},\tilde z_i), o_{ij}\right) + \mathrm{KL} \left(\mathcal{N}(\mu_\psi, \sigma_\psi)\,\|\, \mathcal{N}(0,1) \right) \right] \]

in which \(\mathcal{L}\) is the loss function defined earlier, \(\mathrm{KL}\) denotes the KL-divergence and \(\tilde z_i \in \mathbb{R}^L\) is a single random sample from the probability distribution given by the encoder: \(\mathcal{N}(\mu_\psi, \sigma_\psi)\).

More info about the KL-divergence and its interpretations can be found here. For us, the important thing is that it’s a loss function between probability distributions whose formula simplifies to

\[ \mathrm{KL} \left(\mathcal{N}(\mu, \sigma) \,\|\, \mathcal{N}(0,1) \right) = \frac{1}{2} \sum_{i=1}^L \left( \sigma_i^2 + \mu_i^2 - \ln(\sigma_i^2)-1) \right). \]

The same gradient descent method and batch size seen in the Common section was employed.

3.5.3 Variational Decoder - Architecture

As we said before, the architecture for this decoder and ONet’s one aren’t the same. In the configuration files (see configs/unconditional/*.yaml) it’s clear that the simple decoder was used. The dictionary for ONet decoders (see im2mesh/onet/models/__init__.py) defines the following:

# Decoder dictionary
decoder_dict = {
    'simple': decoder.Decoder,
    'cbatchnorm': decoder.DecoderCBatchNorm,
    'cbatchnorm2': decoder.DecoderCBatchNorm2,
    'batchnorm': decoder.DecoderBatchNorm,
    'cbatchnorm_noresnet': decoder.DecoderCBatchNormNoResnet,
}

Hence, the class that defines the simple decoder is decoder.Decoder.

Another important observation is that the optional entry model:decoder_kwargs isn’t defined in the unconditional configuration files, which means that the default values for the constructor of decoder.Decoder were used. With these pieces of information, we can now define the architecture of the variational decoder:

Input A point \(z \in Z = \mathbb{R}^L\), for \(L=128\), and a batch of \(T=2048\) points \(p_i \in \mathbb{R}^3\).

Output Exactly the same as the ONet, ie, the approximated occupancies \[ f_\theta(z, p_1), \cdots, f_\theta(z, p_T) \in [0,1]. \]

Description The network can be described as the composition of the following components, in order of application:

Fully connected NN mapping both \(p_i\) and \(z\) to a 128-dim vector for each of the T points;
5 VarDec ResNet-Blocks;
ReLU activation layer (128dim);
Fully connected NN - 128-dim to 1-dim;

Definition (VarDec ResNet-block) It uses the same class as the PointNet ResNet-block (ResnetBlockFC), but with different dimensions: size_in = size_out = size_h = 128. Note that size_in == size_out implies that we don’t need a FCNN mapping the input for step 1 to \(x_s\).

Therefore, one block is defined as the following composition, in order of application:

ReLU activation layer (128-dim);
Fully connected NN - 128-dim to 128-dim;
ReLU activation layer (128-dim);
Fully connected NN - 128-dim to 128-dim;

The final output is given by the sum of the output from step 4 and the input for step 1.

\(\blacksquare\)

3.5.4 Variational Encoder - Architecture

Since this layer is a bit more extensive and the description in the supplementary material is incomplete, here’s a picture describing the architecture (see im2mesh/onet/models/encoder_latent.py):

Figure 5 Variational encoder’s architecture. — **Figure 5** Variational encoder’s architecture.

Input A batch of \(T=2048\) points \(p_i \in \mathbb{R}^3\) and their respective ground truth occupancies \(o_i\). See Pre-Processing Shapes for more details about the process of acquiring the data.

Output Two vectors in \(\mu, s=\log(\sigma) \in \mathbb{R}^L\), for \(L = 128\).

Description The network can be described as the composition of the following components, in order of application:

A fully connected NN mapping the 4-dim vector \((p_i, o_i)\) to a 128-dim vector, for each \(i \in [1:T]\);
ReLU activation layer (128-dim);
Fully connected NN - 128-dim to 128-dim;
Maxpool (128-dim) expanded and concatenated w/ the output of step 3 - resulting in a 256-dim output;
ReLU activation layer (256-dim);
Fully connected NN - 256-dim to 128-dim;
Maxpool expanded and concatenated w/ the output of step 6 - resulting in a 256-dim output;
ReLU activation layer (256-dim);
Fully connected NN - 256-dim to 128-dim;
Maxpool (128-dim)
\(\mu\) = the output from a fully connected NN (128-dim to 128-dim) applied to the output of step 10;
\(s\) = the output from a fully connected NN (128-dim to 128-dim) applied to the output of step 10;

3.5.5 Sampling from the Latent Space

After training the VAC for a particular category, one can generate new objects by sampling from the prior probability distribution of the latent space \(Z = \mathbb{R}^L\), which is by construction \(\mathcal{N}(0, 1) \in P(\mathbb{R}^L)\).

Figure 6 Qualitative results for unconditional mesh generation for the cateories: car, airplane, sofa and chair. — **Figure 6** Qualitative results for unconditional mesh generation for the cateories: car, airplane, sofa and chair.

4 References

[1] Occupancy Networks: Learning 3D Reconstruction in Function Space, L. Mescheder et. al (2019)

[2] Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, S. Ioffe, C. Szegedy (2015)

[3] torch.nn.modules.batchnorm

[4] ShapeNet: An Information-Rich 3D Model Repository, A. X. Chang et al. (2015)

[5] Learning 3D Shape Completion under Weak Supervision, D. Stutz, A. Geiger (2018)

[6] Adam: A Method for Stochastic Optimization, D. P. Kingma, Jimmy Ba (2014)

[7] Deep Residual Learning for Image Recognition, K. He et al. (2015)