#### Michael P. J. Camilleri

So my Supervisor recently proposed the following problem to me as a potential way of getting the students to think about Principal Components Analysis. I write below my thoughts.

*[Edit] Thanks to my supervisor for discussing the problem with me.*

“The MovieFlix Corporation wants to improve the algorithm it uses to recommend new movies to its customers. It has decided to release a portion of its data to the public in the hope that researchers will be able to find patterns in users’ preferences for the different movies.

The data is represented as a matrix *X*, where rows *i={1,…,m}* correspond to movies, and columns *j={1,…,c}* correspond to customers, and *X _{ij}* is a numeric value that reflects the degree to which customer

*j*enjoyed movie

*i*. Assume the data is complete, i.e. we know the rating of every user for every movie.

MovieFlix is concerned about the privacy of its customers and has decided to reduce the dimensionality of the data before the public release. Explain how the lower-dimensional matrix obtained by applying PCA on this data protects the privacy of individual customers.”

## Why PCA?

So, the first thing to note is that there are two mechanisms by which PCA will be able to provide some degree^{[1]} of privacy to the data. The first stems from the transformation of the data into a new space, and given that the eigen-vector projections are not provided with the data, then the new dimensions, being a linear combination of the original, have no direct meaning into the original ones. That is, the data will live in a new space, and given only the new projections, there are infinitely many ways in which it could map to the original one^{[2]}.

One related thing to note is that PCA (without the dimensionality reduction) is only a rotation of the original space. This means that distances are preserved. This will be important when thinking about predictive tasks.

The other way in which some privacy is introduced is due to the truncation of the projection: that is, if we keep only the first *n* principal components, where *n* is less than the original dimensionality, then by definition, we are withholding information. In this case, we do not necessarily need to hide the eigen-vector projections, since the reconstruction into the original space will still be incomplete. However, the repercussion of this is that we are losing information which might be useful for any upstream classification task.

Obviously, the two can be combined and in this case, more privacy is ensured.

## Which dimension?

The other question one would need to address is which dimension are we going to apply PCA to. Answering this has repurcussions on what the privacy preserves, but it also requires looking at how one intends to use the data.

In particular, note that if we treat this as a matrix completion problem, then in reality, applying PCA implies that the problem becomes ill-posed. This is because once we have transformed the original space, and we get a new sample where some of the entries are missing, then we cannot transform it to the PCA space. In fact, it might be more proper to use something like Singular Value Decomposition (SVD), however, this is a discussion for another time. There is another way this can be done: that is, we only apply PCA on a subset of the columns/rows, and our goal is to predict another column/row. This is what I will use to discuss the predictive task.

### PCA on each Row (Columns as Features)

So, for me the most natural way of applying PCA would be to the columns: i.e. treating the customer ID’s as features. This lends itself very well to the first of our goals: privacy. By representing the customers only through a linear combination, one loses the identity of the individual. That is, for each movie, we only have a linear combination of the individual preferences, and, provided the caveats described above, we preserve the privacy of the individuals.

Note that by just rotating the data (which is what PCA does), distances are preserved: that is two movies who are similar (have the same preference across customers) in the original space, are equally similar in the rotated PCA space. Hence using this, we can learn a predictor on a new customer, given the features of each movie.

### PCA on each Column (Rows as Features)

There is another way of looking at this however: and that is applying PCA to the movie dimensions. In this case, privacy is preserved by not explicitly specifying which movies each individual prefers. In a way, this could be seen as capturing some space of the preferentiability of certain groupings of movies (which may or may not align with genres). However, I believe that the level of privacy is less in this case.

As regards prediction, this would lend itself to the slightly different task of proposing which customers a movie should be suggested to: this is because, the focus would now be on similarity in the customer space (since the movie space itself is now ‘meaningless’).

## Conclusion

The final thing to note however is that at the end of the day, in some way, the difference is a bit artificial. This is because the non-zero eigen-values of *XX ^{T}* and

*X*are the same

^{T}X^{[3]}. Because of this, in terms of information content (which has a bearing on privacy) the same information is available. However, it helps to think in the above way if at all conceptually.

## Footnotes

1. I say some degree because in reality, we can never fully anonymise the data: indeed, full anonymisation will render the data useless, see for example this discussion.

2. Now this is **NOT** always true. If we know something about the original data, there may be only a finite number of possibilities, and one could envision re-engineering the original solutions. For example, it is possible that one could find out the ranges of the values in the original data and hence, only some rotations would be valid. Indeed, this is a problem with many anomymisation schemes, that if the data is itself highly anonymised but can be augmented with other datasets, then the level of privacy will actually drop.

3. See for example, **C.M. Bishop**, *Pattern Recognition and Machine Learning* (2006), Section 12.1.4.