1 Introduction
1.1 NonNegative Matrix Factorization
There has been a considerable increase in interest in NMF since the publication of a seminal work by Lee & Seung (1999) (although earlier Paatero & Tapper (1994)
had studied this field) in part because NMF tends to produce a sparse and parts based representation of the data. This sparse and parts based representation is in contrast to other dimensionality reduction techniques such as principal component analysis which tends to produce a holistic representation. The parts should represent features of the data, therefore NMF can produce a representation of the data by the addition of extracted features. This representation may be considerably more interpretable than more holistic approaches.
Consider a data matrix with dimensions and data points which has only nonnegative elements. If we define two matrices, also with only nonnegative elements: and , then nonnegative matrix factorisation (NMF) can reduce the dimensionality of V through the approximation where, generally, .
The columns of W make up the new basis directions of the dimensions we are projecting onto. Each column of H represents the coefficients of each data point in this new subspace. There are a range of algorithms to conduct NMF, most of them involving minimising an objective function such as
1.2 Using Autoencoders for NMF
Several authors have studied the addition of extra constaints to an autoencoder to perform NMF (Lemme et al., 2012; Ayinde et al., 2016; HosseiniAsl et al., 2016; Smaragdis & Venkataramani, 2017). These methods show some potential advantages over standard NMF including the implicit creation of the H matrix and straightforward adaptation to online techniques.
In Figure 1 we show a representation of a onehidden layer autoencoder for performing NMF. We feed in a datapoint , the latent representation is produced by where is an elementwise (linear or nonlinear) function with nonnegative ouputs, is the representation in the latent space and contains the weights of the first layer. It is also possible to add additional layers to make a deeper network with multiple hidden layers before arriving at the constriction layer which produces the h
output. The final set of weights must be kept nonnegative with an identity activation function such that
where . The final weights of the autoencoder then can be interpreted as the dimensions of the new subspace with the elements of h as the coefficients in that subspace. The network is then trained so that .1.3 Variational Autoencoders
NMF and autoencoders both produce a lower dimensional representation of some input data. However, neither produce a probability model, just a deterministic mapping. It is also not obvious how to generate new data from these lower dimensional spaces. Generative models solve both of these problems, enabling a probability distribution to be found linking the input and latent spaces whilst enabling new data to be created.
One of the most popular recent generative model is the variational autoencoder (VAE) (Kingma & Welling, 2013; Rezende et al., 2014)
. A VAE is a probabilistic model which utilises the autoencoder framework of a neural network to find the probabilistic mappings from the input to the latent layers and on to the output layer. Unlike a standard autoencoder the VAE finds a distribution between the latent and seen variables, which also enables the production of new data by sampling from the latent distributions.
1.4 Probabilistic NonNegative Matrix Factorisation
Several authors have presented versions of NMF with a probabilistic element (Paisley et al., 2014; Cegmil, 2008; Schmidt et al., 2009)
which involve the use of various sampling techniques to estimate posteriors. Other work has been done using hidden Markov models to produce a probabilistic NMF
(Mohammadiha et al., 2013) and utilising probabilistic NMF to perform topic modelling (Luo et al., 2017). However, to our knowledge no one has utilised the ideas behind the VAE to perform NMF.Although probabilistic methods for NMF have been developed, even a full Bayesian framework still faces the problem that, for the vast majority of problems where NMF is used, we have little idea about what is the appropriate prior. We would therefore be forced to do model selection or introduce hyperparameters and perform inference (maximum likelihood or Bayesian) based on the evidence. However, as the posterior in such cases is unlikely to be analytic this is likely to involve highly time consuming Monte Carlo. In doing so we would expect to get results close to those we obtain using PAENMF. However, for machine learning algorithms to be of value they must be practical. Our approach, following a minimum description length methodology, provides a principled method for achieving automatic regularisation. Because it fits within the framework of deep learning it is relatively straightforward and quick to implement (using software such as Keras or PyTorch with builtin automatic differentiation, fast gradient descent algorithms, and GPU support). In addition, our approach provides a considerable degree of flexibility (e.g. in continuous updating or including exogenous data), which we believe might be much more complicated to achieve in a fully probabilistic approach.
The next section lays out the key alterations needed to a VAE to allow it to perform NMF and is our main contribution in this paper.
2 PaeNmf
The model proposed in this paper provides advantages both to VAEs and to NMF. For VAEs, by forcing a nonnegative latent space we inherit many of the beneficial properties of NMF; namely we find representations that tend to be sparse and often capture a parts based representation of the objects being represented. For NMF we introduce a probabilisitic representation of the vectors
h which models the uncertainty in the parameters of the model due to the limited data.2.1 Ideas Behind PAENMF
Kingma & Welling (2013)
proposed the VAE, their aim is to perform inference where the latent variables have intractable posteriors and the datasets are too large to easily manage. Two of their contributions are showing that the “reparameterization trick” allows for use of standard stochastic gradient descent methods through the autoencoder and that the intractable posterior can be estimated.
In a standard autoencoder we take some data point and run it through a neural network to produce a latent variable where and is some nonlinear elementwise function produced by the neural network. This is the encoding part of the network. We then run h through another neural network to produce an output , which is the decoding part of the network. The hope is that due to the latent variables will contain real structure from the data.
A standard VAE differs in that instead of encoding a deterministic variable we find a mean,
, and variance,
of a Gaussian distribution. If we want to generate new data we can then sample from this distribution to get
and then run through the decoding part of the network to produce our new generated data.PAENMF utilises the same objective function as a standard VAE, so for each datapoint, we are minimising:
(1)  
(2) 
where is the KL divergence, and represent the parameters of the encoder and decoder, respectively, v is an input vector with v̂ the reconstructed vector, created by sampling from and running the h produced through the decoding part of the network. The first term represents the reconstruction error between the original and recreated datapoint, which we assume to be approximately Gaussian. The second term is the KL divergence between our prior expectation, , of the distribution of h and the representation created by the encoding part of our neural network, . We can interpret this as a regularisation term, it will prevent much of the probability density being located far from the origin, assuming we select a sensible prior. Another way of looking at it is that it will prevent the distributions for each datapoint being very different from one another as they are forced to remain close to the prior distribution.
This objective function can also be interpreted as the amount of information needed to communicate the data (Kingma et al., 2014). We can think of vectors in latent space as code words. Thus to communicate our data, we can send a code word and an error term (as the errors fall in a more concentrated distribution than the original message they can be communicated more accurately). The logprobability term can be interpreted as the message length for the errors. By associating a probability with the code words we reduce the length of the message needed to communicate the latent variables (intuitively we can think of sending the latent variables with a smaller number of significant figures). The KL divergence (or relative entropy) measures the length of code to communicate a latent variable with probability distribution . Thus by minimising the objective function we learn a model that extracts all the useful information from the data (i.e. information that allows the data to be compressed), but will not overfit the data. This interpretation shares considerable similarity with previous work which used a model of the minimum description length within the objective function when performing matrix factorisation (Squires et al., 2019).
2.2 Structure of the PAENMF
The structure of our PAENMF is given in Figure 2. The input is fed through an encoding neural network which produces two vectors k and , which are the parameters of the distribution. The latent vector for that datapoint, h
, is then created by sampling from that distribution. However, this causes a problem when training the network because backpropagation requires the ability to differentiate through the entire network. In other words, during backpropagation we need to be able to find
and , where the refers to a dimension. If we sample from the distribution we cannot perform these derivatives. In variational autoencoders this problem is removed by the “reparameterization trick” (Kingma & Welling, 2013) which pushes the stochasticity to an input node, which does not need to be differentiated through, rather than in the middle of the network. A standard variational autoencoder uses Gaussian distributions. For a univariate Gaussian the trick is to turn the latent variable, , which cannot be differentiated through, into where . The parameters and are both deterministic and can be differentiated through and the stochasticity is added from outside.We cannot use the Gaussian distribution because we need our
terms to be nonnegative to fulfill the requirement of NMF. There are several choices for probability distributions which produce only nonnegative samples, we use the Weibull distribution for reasons detailed later in this section. We can sample from the Weibull distribution using its inverse cumulative distribution and the input of a uniform distribution at an input node. In Figure
2 we display this method of imposing stochasticity from an input node through . Similarly to using a standard autoencoder for performing NMF the same restrictions, such as being forced to stay nonnegative, apply to the PAENMF.2.3 Details of the PAENMF
In this paper we have utilised the Weibull distribution which has a probability density function (PDF) of
with parameters and . The Weibull distribution satisfies our requirements: that the PDF is zero below , falls towards 0 as becomes large and is flexible enough to enable various shapes of distribution. For each data point there will be Weibull distributions generated, one for each of the subspace dimensions.
To perform PAENMF we need an analytical form of the KL divergence, so we can differentiate the objective function, and a way of extracting samples using some outside form of stochasticity. The KL divergence between two Weibull distribution is given by (Bauckhage, 2014)
(3) 
where is the EulerMascheroni constant and is the gamma function.
In other situations using NMF with probability distributions the gamma distribution has been used
(Squires et al., 2017). The reason that we have chosen the Weibull distribution is that while it is possible to apply variations on the reparameterization trick to the gamma function (Figurnov et al., 2018) it is simpler to use the Weibull distribution and use inverse transform sampling. To sample from the Weibull distribution all we need is the inverse cumulative function,. We generate a uniform random variable
at an input node and then sample from the Weibull distribution with and by . So, refering to Figure 2, we now have and each of the dimensions of z are found by .It is worth considering exactly how and where PAENMF differs from standard NMF. In Figure 1 we see that, in terms of NMF, it is fairly unimportant what occurs before the constriction layer in terms of the outputs of interest (W and H). The aim of the design before that part is to allow the network to find the best possible representation that gets the lowest value of the objective function. There are effectively two parts which make this a probabilistic method: the choice of objective function and the fact that we sample from the distribution. Without those two parts the link between the distribution parameters and h would just be a nonlinear function which might do better or worse than any other choice but would not make this a probabilistic method.
2.4 Methodology
We now have the structure of our network in Figure 2 and the distribution (Weibull) that we are using. The basic flow through the network with one hidden layer, for an input datapoint v, looks like:
(4) 
where and are nonlinear functions that work elementwise. The inverse cumulative function, , also works elementwise.
There are a range of choices to make for this network, to demonstrate the value of our technique we have kept the choices as simple as possible. We use rectified linear units (ReLU) as the activation function as these are probably the most popular current method
Nair & Hinton (2010). To keep the network simple we have only used one hidden layer. We update the weights and biases using gradient descent. We keep the values nonnegative by setting any negative terms to zero after updating the weights. We did experiment with using multiplicative updates (Lee & Seung, 1999) for the weights but found no clear improvement. We use the whole dataset in each batch, the same as is used in standard NMF. We initialise the weights using W found from standard NMF, with added noise. We use a prior of and . We calculate the subspace size individually for each dataset using the method of Squires et al. (2017). The learning rates are chosen by trial and error.To demonstrate the use of our technique we have tested it on three heterogeneous datasets, displayed in Table 1. The faces dataset, are a group of greyscale images of faces. The Genes dataset is the 5000 gene expressions of 38 samples from a leukaemia dataset and the FTSE 100 data is the share price of 94 different companies over 1305 days.
Name  Type  m  n  Source 
Faces  Image  361  2429  http://cbcl.mit.edu/softwaredatasets 
/FaceData2.html  
Genes  Biological  5000  38  http://www.broadinstitute.org/cgibin 
/cancer/datasets.cgi  
FTSE 100  Financial  1305  94  Bloomberg information terminal 

3 Results and Discussion
First we demonstrate that PAENMF will produce a reasonable recreation of the original data. We would expect the accuracy of the reconstruction to be worse than for standard NMF because we impose the KL divergence term and we sample from the distribution rather than taking the latent outputs deterministically. These two impositions should help to reduce overfitting which results in the higher reconstruction error.
In Figure 3 we show recreations of the original data for the three datasets from Table 1. The faces plots show five original images chosen at random above the recreated versions (with ). The bottom left plot shows nine stocks from the FTSE 100 with the original data as a black dashed line and the recreated results as a red dotted line (). The results here follow the trend well and appear to be ignoring some of the noise in the data. In the bottom right plot we show 1000 elements of the recreated matrix versus the equivalent elements (). There is significant noise in this data, but there is also a clear trend. The black line shows what the results would be if they were recreated perfectly.
Secondly, we want to look at the matrices (the weights of the final layer). In NMF the columns of W represent the dimensions of the new subspace we are projecting onto. In many circumstances we hope these will produce interpretable results, which is one of the key features of NMF. In Figure 4 we demonstrate the matrices for the faces dataset, where each of the 81 columns of has been converted into a image (left) and the FTSE 100 dataset (right) where the nine columns are shown. We can see that the weights of the final layer does do what we expect in that these results are similar to those found using standard NMF (Lee & Seung, 1999). The faces dataset produces a representation which can be considered to be parts of a face. The FTSE 100 dataset can be viewed as showing certain trends in the stock market.
We now want to consider empirically the effect of the sampling and KL divergence term. In Figure 5 we show the distributions of the latent space of one randomly chosen datapoint from the faces dataset. We use so that the distributions are easier to inspect. The left and right set of plots show results with and without the KL divergence term respectively. The black and blue dashed lines show samples extracted deterministically from the median of the distributions and sampled from the distribution respectively. The inclusion of the KL divergence term has several effects. First, it reduces the scale of the distributions so that the values assigned to the distributions are significantly lower in the left hand plot. This has the effect of making the distributions closer together in space. The imposition of randomness into the PAENMF through the uniform random variable has the effect of reducing the variance of the distributions. When there is a KL divergence term we can see that the distributions follow fairly close to the prior. However, once the stochasticity is added this is no longer tenable as the wide spread of data we are sampling from becomes harder to learn effectively from. When there is no KL divergence term the effect is to tighten up the distribution so that the spread is as small as possible. This then means we are approaching a point, which in fact would return us towards standard NMF. The requirement for both the KL divergence term and the stochasticity means that we do get a proper distribution and the results are prevented from simply reducing towards a standard NMF formulation.
A potentially valuable effect of PAENMF is that we would expect to see similar distributions for similar datapoints. In Figure 6 we can see the distributions of the h vectors for two pairs of faces, each pair is similar to the other one in the pair and very different from the other pair. Next to them we plot the distributions. The distributions for the two faces on the left are the black dashed lines and the distributions for the two faces on the right are given by the red dotted lines. There is very clear similarity in distribution between the similar faces, and significant differences between the dissimilar pairs.
Finally, we want to discuss the generation of new data using this model. We show results for the faces and FTSE 100 datasets in Figure 7. For the faces we use a low value and show four random faces (left column), with the median being the h drawn from the middle of the inverse cumulative distribution. The three image plots on the right are then drawn randomly from the distributions. Four stocks from the FTSE 100 data are shown on the right of the figure. The solid lines are the original data with the dotted lines representing sampled data. This ability to directly sample from the distributions and produce plausible new datapoints is one of the most useful features of using PAENMF over standard NMF.
4 Summary
We have demonstrated a novel method of probabilistic NMF using a variational autoencoder. This model extends NMF by providing us with uncertainties on our h vectors, a principled way of providing regularisation and allowing us to sample new data. The advantage over a VAE is that we should see improved interpretability due to the sparse and parts based nature of the representation formed, especially in that we can interpret the layer as the dimensions of the projected subspace.
Our method extracts the useful information from the data without overfitting due to the combination of the logprobability with the KL divergence. While the logprobability works to minimise the error, the KL divergence term acts to prevent the model overfitting by forcing the distribution
to remain close to its prior distribution. This provides a principled regularisation mechanism. Other alternatives for probabilistic NMF still require the choice of an appropriate prior. This could be done using model selection through the evidence term, but this would require computing the full posterior, which is likely to require techniques such as Markov Chain Monte Carlo and could be prohibitively time consuming.
Another advantage of this approach is that as the objective function measures the description length, we could use it to select the appropriate size for the latent space. This would provide an alternative to other description length methods (Squires et al., 2017). However, as the distribution provides a selfregularisation the size of the latent space is likely to be less critical.
References

Ayinde et al. (2016)
Babajide O Ayinde, Ehsan HosseiniAsl, and Jacek M Zurada.
Visualizing and understanding nonnegativity constrained sparse
autoencoder in deep learning.
In
International Conference on Artificial Intelligence and Soft Computing
, pp. 3–14. Springer, 2016.  Bauckhage (2014) Christian Bauckhage. Computing the kullbackleibler divergence between two generalized gamma distributions. arXiv preprint arXiv:1401.6853, 2014.
 Cegmil (2008) AT Cegmil. Bayesian inference in nonnegative matrix factorization models. Computational Intelligence and Neuroscience, 2008.
 Figurnov et al. (2018) Michael Figurnov, Shakir Mohamed, and Andriy Mnih. Implicit reparameterization gradients. arXiv preprint arXiv:1805.08498, 2018.
 HosseiniAsl et al. (2016) Ehsan HosseiniAsl, Jacek M Zurada, and Olfa Nasraoui. Deep learning of partbased representation of data using sparse autoencoders with nonnegativity constraints. IEEE transactions on neural networks and learning systems, 27(12):2486–2498, 2016.
 Kingma & Welling (2013) Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Kingma et al. (2014) Diederik P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling. Semisupervised learning with deep generative models. In Advances in Neural Information Processing Systems, pp. 3581–3589, 2014.
 Lee & Seung (1999) Daniel D Lee and H Sebastian Seung. Learning the parts of objects by nonnegative matrix factorization. Nature, 401(6755):788–791, 1999.
 Lemme et al. (2012) Andre Lemme, René Felix Reinhart, and Jochen Jakob Steil. Online learning and generalization of partsbased image representations by nonnegative sparse autoencoders. Neural Networks, 33:194–203, 2012.
 Luo et al. (2017) Minnan Luo, Feiping Nie, Xiaojun Chang, Yi Yang, Alexander G Hauptmann, and Qinghua Zheng. Probabilistic nonnegative matrix factorization and its robust extensions for topic modeling. In AAAI, pp. 2308–2314, 2017.
 Mohammadiha et al. (2013) Nasser Mohammadiha, W Bastiaan Kleijn, and Arne Leijon. Gamma hidden markov model as a probabilistic nonnegative matrix factorization. In Signal Processing Conference (EUSIPCO), 2013 Proceedings of the 21st European, pp. 1–5. IEEE, 2013.

Nair & Hinton (2010)
Vinod Nair and Geoffrey E Hinton.
Rectified linear units improve restricted boltzmann machines.
In Proceedings of the 27th international conference on machine learning (ICML10), pp. 807–814, 2010.  Paatero & Tapper (1994) Pentti Paatero and Unto Tapper. Positive matrix factorization: A nonnegative factor model with optimal utilization of error estimates of data values. Environmetrics, 5(2):111–126, 1994.
 Paisley et al. (2014) John Paisley, D Blei, and Michael I Jordan. Bayesian nonnegative matrix factorization with stochastic variational inference. Handbook of Mixed Membership Models and Their Applications. Chapman and Hall/CRC, 2014.
 Rezende et al. (2014) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.

Schmidt et al. (2009)
Mikkel N Schmidt, Ole Winther, and Lars Kai Hansen.
Bayesian nonnegative matrix factorization.
In
International Conference on Independent Component Analysis and Signal Separation
, pp. 540–547. Springer, 2009.  Smaragdis & Venkataramani (2017) Paris Smaragdis and Shrikant Venkataramani. A neural network alternative to nonnegative audio models. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on, pp. 86–90. IEEE, 2017.
 Squires et al. (2017) Steven Squires, Adam PrügelBennett, and Mahesan Niranjan. Rank selection in nonnegative matrix factorization using minimum description length. Neural Computation, 2017.
 Squires et al. (2019) Steven Squires, Adam Prugel Bennett, and Mahesan Niranjan. Minimum description length as an objective function for nonnegative matrix factorization. arXiv preprint arXiv:1902.01632, 2019.
Comments
There are no comments yet.