Microsoft Word - [Formatted] 3 - Learning Predictors from Multidimensional Data with Tensor Factorizations ARESTY RUTGERS UNDERGRADUATE RESEARCH JOURNAL, VOLUME I, ISSUE III This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. LEARNING PREDICTORS FROM MULTI- DIMENSIONAL DATA WITH TENSOR FACTORIZATIONS SOO MIN KWON, ANAND D. SARWATE (FACULTY ADVISOR) ✵ ABSTRACT Statistical machine learning algorithms often involve learning a linear relationship between dependent and independent variables. This relationship is modeled as a vector of numerical values, commonly referred to as weights or predictors. These weights allow us to make predictions, and the quality of these weights influence the accuracy of our predictions. However, when the dependent variable inherently possesses a more complex, multidimensional structure, it becomes increasingly difficult to model the relationship with a vector. In this paper, we address this issue by investigating machine learning classification algorithms with multidimensional (tensor) structure. By imposing tensor factorizations on the predictors, we can better model the relationship, as the predictors would take the form of the data in question. We empirically show that our approach works more efficiently than the traditional machine learning method when the data possesses both an exact and an approximate tensor structure. Additionally, we show that estimating predictors with these factorizations also allow us to solve for fewer parameters, making computation more feasible for multidimensional data. 1 INTRODUCTION Machine learning classification algorithms are widely used in many applications such as fraud and spam detection. The objective of these algo- rithms is to model a linear relationship between the independent (e.g. card transactions, amount spent) and dependent (e.g. fraud or not fraud) variables. This relationship is generally modeled by learning a hyperplane that best separates the two classes of data as shown in FIGURE 1. The hyperplane is con- structed of weights and biases, which can simply be interpreted as the slope and intercept, respectively. One can solve or estimate these values by learning the parameters that most accurately describe the ob- served data points. In machine learning, we solve for these pa- rameters (or predictors) through empirical risk mini- mization (ERM). The ERM framework tries to estimate the parameters that minimize the ``risk’’ or error of a loss function between the true and computed pre- dictors given data. The minimization of this loss func- tion measures the ``closeness’’ of the predictors, where a smaller objective function value would ac- count for a more accurate model. There are many different machine learning classification algorithms, and each algorithm has a different loss function. However, since many loss functions try to model a linear relationship, there is an implicit need for our FIGURE 1: Visualization of learning a line (or hyperplane in higher dimensional space) that best separates two classes of data. Machine learning algorithms estimate these weights, 𝑊, and bias, 𝑏, through empirical risk minimiza- tion. ARESTY RUTGERS UNDERGRADUATE RESEARCH JOURNAL, VOLUME I, ISSUE III data to be vectorized. If our data samples were to be multidimensional, vectorization would make estima- tion of accurate predictors much more challenging. For example, consider a different scenario in which one would like to make movie recommenda- tions for a user given the number of movies watched in a certain genre. This data can easily be stored in the form of a matrix, where the rows represent each user, and the columns represent the movie genre. However, what happens when a user’s movie prefer- ences change over time? As shown in FIGURE 2, we can capture this third variable (and many others) by mod- elling the observations in the form of a tensor, as it matches the structure of the data. Clearly, the struc- ture of this tensor is significant for accurate data anal- ysis. If the orderings of the movies watched were swapped for two given users, incorrect recommen- dations could be made. Vectorizing this data does not account for these types of structures, making in- ference much more challenging. There are many modern applications that involve analyzing data with intrinsically many more dimensions, including medical imaging,[3,13] image processing,[5,18] and seismic data analysis.[8] In most of these settings, the objective is similar to that of the traditional machine learning goal: to formulate a problem of prediction to establish an association be- tween the tensor covariates (independent variable) and outcomes (dependent variable). However, as previously mentioned, most machine learning frameworks are formulated for vector spaces, mak- ing statistical inference challenging for tensor data. In addition, in most of these domains, the tensor data also exhibits high dimensions. For example, in medicine, tensor data samples may be of dimen- sions 128 × 128 × 128 or greater. Naively turning this array into a vector for traditional machine learn- ing would result in solving for 128 = 2,097,152 co- efficients. In this scenario, vectorization not only de- stroys the structure of the data, but also makes com- putation infeasible. RELATED WORKS Recently, work on tensor-based machine learning approaches uses tensor factorizations to reduce the number of coefficients to be estimated.[10,15,21] Spe- cifically, tensor decompositions are imposed on the coefficients as a scheme of feature selection or di- mensionality reduction. Integrating such decompo- sition structures solves for low rank approximations of the true predictor, rather than the vector counter- part. Zhou et al. proposes a tensor regression model with additional independent variables for predicting continuous values given fMRI data.[21] For parameter estimation, they propose a maximum likelihood (ML) approach using a block relaxation algorithm, which we adopt to formulate tensor classification models. Tan et al. proposes a logistic tensor regression model with a norm regularization to induce spar- sity.[15] We observe that this technique efficiently ex- ploits structure, which motivates us to generalize and formulate more classification problems with differ- ent regularization (e.g. norm), and on different datasets. OUR CONTRIBUTION In this paper, we investigate the performance of ma- chine learning classifiers with a CANDECOMP/ PARAFAC (CP) decomposition structure on the coef- ficients/predictors. We have seen in previous litera- ture that these methods work efficiently for solving linear regression and logistic regression coeffi- cients.[15,21] We solve classification problems, namely Support Vector Machines and Logistic Regression FIGURE 2: Example of modeling observations: left – matrix, right – tensor ARESTY RUTGERS UNDERGRADUATE RESEARCH JOURNAL, VOLUME I, ISSUE III on both synthetic and real data. The rest of this pa- per is organized as follows. We first discuss some tensor algebra and notation that will be used throughout this paper. Then, we propose the objec- tive functions as well as a short analysis of the CP structured machine learning problems. We motivate and show results of our approach by fixing and solv- ing for the true predictor. We compare the results from the tensor structured algorithm as opposed to the unstructured vector algorithm. Our contributions can be summarized as follows:  We perform experiments to show that our struc- tured method works more efficiently than the tradi- tional method when the true predictor exhibits both an approximate and exact low rank structure.  We show that our structured approach solves for fewer coefficients more efficiently than the tradi- tional approach with a dimensionality reduction step (e.g. Principal Component Analysis).  We develop algorithms to solve machine learning problems with decomposition of 𝑛 -dimensional tensors with an alternating minimization scheme. 2 PRELIMINARIES We dedicate this section to discuss some of the concepts used throughout this paper. Due to the theoretical nature of this work, the technical descrip- tion may require some mathematical maturity. The reader interested in the empirical findings can skip to SECTION 4. For a complete introduction to tensors, see the comprehensive survey of Kolda and Bader and Rabanser et al.[6,14] Tensors are simply defined as multidimensional arrays, and these two terms will be used interchangeably. We will denote vectors with lower case letters (𝑥), matrices with capital letters (𝑋), and tensors as bold capital letters (𝐗). i. Tensor Reorderings Let 𝐗 be a third-order tensor of dimensions 𝐗 ∈ ℝ × × with the two frontal slices defined by 𝑋 , 𝑋 ∈ ℝ × : VECTORIZATION We can create a vector from any matrix or tensor by stacking the row or column elements into a row or column vector, respectively. For example, vectoriz- ing the tensor 𝐗 by its columns would yield the fol- lowing column vector: where we stack the columns from the first frontal slice, 𝑋 and the second frontal slice, 𝑋 . The dimen- sions of the resulting vector would be 𝑥 ∈ ℝ . MATRICIZATION The 𝑛-mode matricization (or unfolding) of a tensor 𝐘 ∈ ℝ × ×…× is denoted as 𝑌( ) , where 𝑌( ) has the columns of the 𝑛 -mode fibers. Consider the same tensor 𝐗 from the previous example. Then the three 𝑛-mode matricizations are the following: One can think of matricization as a generali- zation of vectorization but to matrices. Since our ex- ample 𝐗 is a third-order tensor, we have three matri- ces from matricization, one for each mode. ii. Vector & Matrix Products OUTER PRODUCT Let 𝑎 and 𝑏 be two vectors of dimensions 𝑎 ∈ ℝ and 𝑏 ∈ ℝ , ARESTY RUTGERS UNDERGRADUATE RESEARCH JOURNAL, VOLUME I, ISSUE III The outer product of 𝑎 and 𝑏, denoted as 𝑎 ○ 𝑏, is a matrix of dimensions (𝑎 ○ 𝑏) ∈ ℝ × , Note that this outer product is not only lim- ited to vectors, and can be generalized to matrices and tensors as well. KRONECKER PRODUCT Let 𝐴 and 𝐵 be two matrices of dimensions 𝐴 ∈ ℝ × and 𝐵 ∈ ℝ × , The Kronecker product of 𝐴 and 𝐵 , denoted as 𝐴 ⊗ 𝐵, is a matrix of dimensions (𝐴 ⊗ 𝐵) ∈ ℝ × , In essence, the Kronecker product is computed by multiplying every element in the first matrix, 𝐴, by the entire second matrix, 𝐵. KHATRI-RAO PRODUCT The Khatri–Rao product is the columnwise Kronecker product. Consider two (different) matrices 𝐴 ∈ ℝ × and 𝐵 ∈ ℝ × . The Khatri–Rao product of 𝐴 and 𝐵, denoted as 𝐴 ⊙ 𝐵, is a matrix of dimen- sions (𝐴 ⊙ 𝐵) ∈ ℝ × , Here, we are taking the Kronecker product between every column vector from 𝐴 and 𝐵. Note that if 𝐴 and 𝐵 itself were column vectors, i.e. 𝑛 = 1 , then the Khatri–Rao product is equivalent to the Kronecker product, 𝐴 ⊙ 𝐵 = 𝐴 ⊗ 𝐵. iii. Tensor Decomposition Tensor decompositions are generalizations of matrix factorizations to multidimensional arrays.[20] We introduce one tensor factorization scheme that is important in understanding the setting of our algo- rithm. In the matricized form, we show that this fac- torization has useful properties to be solved with an alternating minimization scheme. CANDECOMP/PARAFAC (CP) DECOMPOSITION The objective of the CP decomposition is to express a tensor as the sum of component rank–one tensors, i.e. vectors, as depicted in FIGURE 3. For example, con- sider a third-order tensor 𝐗 ∈ ℝ × × . We can ap- proximate this tensor as the following where ”○” denotes the outer product, R represents the rank (positive integer), and 𝑎 ∈ ℝ , 𝑏 ∈ ℝ , and 𝑐 ∈ ℝ for 𝑟 = 1, . . . , 𝑅 . We can formalize this decomposition as the following optimization prob- lem: where 𝐗 would represent a low rank approximation of 𝐗. The factor matrices or CP factors are matrices with the rank–one tensors as entries. From the previ- ous three-dimensional case, 𝐴 ∈ ℝ × would be an estimated CP factor with entries FIGURE 3: Graphical representation of the CANDECOMP/ PARAFAC decomposition – low rank approximation of a third–order tensor ARESTY RUTGERS UNDERGRADUATE RESEARCH JOURNAL, VOLUME I, ISSUE III With these definitions and the products defined pre- viously, we can formulate some useful properties for the third–order case: These relationships can easily be generalized to 𝑛- mode tensors, but for the purposes of this paper, 𝑛 = 3 will suffice. We will show how we can use these equations for our alternating minimization algorithm in the following sections. There also are other useful tensor factorizations, such as the Tucker decomposi- tion, which is explained in detail in the survey pa- per.[6] iv. Machine Learning Optimization Problems Many machine learning algorithms can be framed as empirical risk minimization (ERM) prob- lems. The empirical risk is defined in terms of a risk, or loss function (·). For linear classifiers, the loss of a linear predictor 𝑤 on the data sample (𝑥 , 𝑦 ) can be written as (𝑤 𝑥 , 𝑦 ) and the average empirical risk as ∑ (𝑤 𝑥 , 𝑦 ). We discuss these loss func- tions for some common classifiers and how we can use them to solve tensor structured ERM problems. SUPPORT VECTOR MACHINES Consider a dataset with 𝑛 samples, i.e. {(𝑥 , 𝑦 )} , where 𝑦 ∈ {−1,1}. Support Vector Machine (SVM) or maximum margin linear classifier is a binary classifier that finds a hyperplane to best separate the data, while making minimal margin violations.[4] SVM uses a loss function called the hinge loss function, defined by where 𝑤 is the coefficients of the separating hyper- plane. With a penalty (or regularizer), we can mathe- matically formulate SVM as the following ERM prob- lem: The regularization term, λ, is used to penal- ize the features, and hence weights, that do not nec- essarily contribute to the prediction outcome. Here, we are considering the penalty, but there are other regularizers such as the penalty. We use these regularization terms in our loss function to es- timate a more accurate model. LOGISTIC REGRESSION Similarly, consider a dataset with 𝑛 samples, i.e. {(𝑥 , 𝑦 )} , where 𝑦 ∈ {−1,1}. The objective of Lo- gistic Regression (LOGIT) is the same as SVM, with a different loss function called the logistic loss function, defined by The logistic loss function takes the form of the sigmoid function. With a regularization term, we can define Logistic Regression as the following ERM problem: We only introduce the objective function of these two classifiers, as we will construct the CP structured algorithm with these functions in the fol- lowing section. Note that we do not include the bias term in our hyperplane equation, as it can be mod- eled in 𝑤 as a column vector. 3 PROBLEM FORMULATION In this section, we propose our tensor-based classifiers in the form of an ERM framework. In gen- eral, we structure our linear predictors (𝑤 ) to admit a CP decomposition, in which we can reconstruct to make classifications. We also discuss the metrics that we will be investigating to evaluate the performance of our models. ARESTY RUTGERS UNDERGRADUATE RESEARCH JOURNAL, VOLUME I, ISSUE III i. CANDECOMP/PARAFAC Structured Classifiers SUPPORT VECTOR MACHINES Consider a dataset {(𝐗𝒊, 𝑦 )} , where 𝐗𝒊 ∈ ℝ ×…× denotes a tensor data sample with 𝑦 ∈ {−1,1} . By imposing the constraints from (1) onto the predictors of (3), we can formulate the following optimization problem: The traditional ERM problem for SVM in (3) solves for one vector predictor of dimensions 𝑤 ∈ ℝ ×…× . The problem in (5) , which we call “CP-SVM”, solves for 𝑁 matrix-valued predictors of dimensions 𝑊𝒊 ∈ ℝ × , for 𝑖 = 1, . . . , 𝑁 . As a con- crete example, let each tensor sample be dimen- sions 𝐗𝒊 ∈ ℝ × × and 𝑅 = 3 . The traditional prob- lem would solve for 5 × 5 × 5 = 125 coefficients, whereas the structured problem would solve for 3 × (5 × 3) = 45 coefficients. As the dimensions in- crease, the structured problem substantially reduces the number of parameters/coefficients to be esti- mated. LOGISTIC REGRESSION Similarly, consider a dataset {(𝐗𝒊, 𝑦 )} , where 𝐗𝒊 ∈ ℝ ×…× denotes a tensor data sample with 𝑦 ∈ {−1,1} . By imposing the constraints from (1) onto the predictors of (4), we can solve the following ERM problem: This new framework, which we call “CP- LOGIT”, solves for fewer parameters, similar to CP- SVM. In practice, we solve for the weights using numerical optimization methods such as gradient descent. However, solving for the weights in this new CP-structured paradigm is a non-trivial task. In order to solve for the coefficients in (5) and (6), we adopt an alternating minimization algorithm similar to the block relaxation algorithm proposed in Zhou et al.[21] At each iteration, we update block 𝑊𝒊, while keeping the rest of the blocks fixed. To see this, when updat- ing 𝑊𝒊 ∈ ℝ × , we can rewrite the inner product in (5) and (6) with the properties mentioned in (2): This alternating minimization algorithm is summa- rized in ALGORITHM 1, in which (·) represents the ERM problem to be minimized, 𝜃 represents a collection of all the parameters, and λ is the regularization pa- rameter. The parameter λ was tuned by hand, but can also be determined through cross validation. To understand the CP structured algorithm, consider the loss function in (5) with 𝑁 = 3 . When updating 𝑊𝟐 , we rewrite the inner product 〈∑ 𝑊 ( ) ○ 𝑊 ( ) ○ 𝑊 ( ) , 𝐗𝒊〉 as 〈𝑊 , 𝐗( )(𝑊 ⊙ 𝑊 )〉 . Note that this equation follows from the property of tensor algebra as shown in (2). We perform this al- gorithm for all the factor matrices until the stopping criteria is met. ARESTY RUTGERS UNDERGRADUATE RESEARCH JOURNAL, VOLUME I, ISSUE III The alternating minimization algorithm is useful for several reasons. First, in practice, this algo- rithm almost always converges to at least a local min- imum.[1,20,21] To find the best solution, the algorithm can be ran several times with different initial factor matrices. Second, the low rank optimization prob- lem over the factor matrices is non-convex.[2] Thus, this problem becomes difficult to solve using com- mon unconstrained solvers, such as gradient de- scent. In literature, there are two ways to handle the non-convexity of this optimization problem. One way is to relax the rank constraint by adding a convex regularization term that induces low rank (e.g. trace norm, nuclear norm).[16,19] The other solution is to employ this alternating minimization algorithm, as the optimization over one matrix, while holding the others fixed is convex. We chose to explore this pro- cedure following Zhou et al.,[21] as the algorithm is straightforward to implement using statistical soft- ware such as MATLAB or Python. ii. Performance Metrics We evaluate the performance of our models using several measures with different sample sizes. The following four metrics help determine the meas- ure of “closeness” between the true and estimated predictors. 1. The Mean Squared Error (MSE) for 𝑛 data samples and true predictor 𝑊 is computed as where 𝑊 is the estimated predictor from solving the ERM problem. 2. The cosine distance (or similarity)[12] for true predic- tor 𝑊 is computed as where 𝑊 is the reconstructed predictor from solv- ing the ERM problem. Mathematically, the cosine similarity measures the cosine of the angle between two vectors projected in a 𝑛-dimensional space. As the angle, 𝜃 , between the two vectors becomes smaller, the cosine similarity will approach a value of 1. As the angles become farther apart (perpen- dicular), the cosine similarity will approach a value of 0. 3. The reconstruction error for true predictor 𝑊 and estimated tensor predictor 𝑊 is defined as where || · || denotes the Frobenius norm, a matrix generalization of the norm. 4. The classification accuracy for 𝑛 test samples is simply defined as the following: After solving for 𝑊 , we make predictions on test data and compare 𝑦 to the true 𝑦 . Before compar- ing the labels, we use the sign function to quantize our values to 𝑦 ∈ {−1,1}. 4 EXPERIMENTS We used two types of data for our experi- ments: synthetic data and the Modified National In- stitute of Standards and Technology (MNIST) data- base.[9] The MNIST database is a benchmark dataset used widely in machine learning that consists of 60,000 samples of handwritten digits from 0 to 9. The objective of both experiments is to compare the performance between the CP-structured algorithms and the traditional algorithms, which were imple- mented using software packages TensorLy[7] and SciPy.[17] For all experiments, we use a Python envi- ronment on a Macbook Pro with 2.2 GHz Intel Core i7 and 16 GB RAM. i. Synthetic Data For synthetic data, we generated univariate 𝑦 responses with different sample sizes according to the following model: where 𝑋 is drawn independently and identically dis- tributed (iid) from 𝒩(0,1), 𝜖 is a noise term drawn iid from 𝒩(0,1), and 𝑊 is the fixed predictor as shown in FIGURE 4. The objective was to observe if our models defined in (5) and (6) can identify the true signal 𝑊 given (𝑋 , 𝑦 ). ARESTY RUTGERS UNDERGRADUATE RESEARCH JOURNAL, VOLUME I, ISSUE III PERFORMANCE COMPARISON To measure the “closeness” and classification acc- uracy between the true model and the predicted model, we use performance metrics defined in (7), (8) , (9) , and (10) , we compute these metrics at different sample sizes and show that as the number of samples increases, the performance of the tra- ditional vector approach converges to the perfor- mance of the CP structured model. These results are displayed in FIGURES 5 and 6. In FIGURE 5, we can visually see that the predictors from our method solves for the true predictors more accurately. For example, in the case of 𝑛 = 500 from ROW 1, the “cross” figure is more accurately portrayed using the CP method (RIGHT) than the traditional method (MIDDLE). This would allow us to make more accurate predictions, as the estimated weights more closely follow the true wei- ghts. In FIGURE 6, we can see that the MSE for both algorithms is relatively the same throughout all sam- ple sizes. For the cosine distance, we can see that the CP structured algorithm approaches a value of 1 very quickly, which indicates that there is a strong simil- arity between the estimated and the true coefficients. The reconstruction error and classification accuracy both generally have gaps in the figures, but lessen as the sample sizes increase. We can conclude that these results depend on the sample size, as more samples can decrease the number of hyperplanes that separates the data, predicting coefficients clo- ser to the true model. Based on the trends of the graphs in FIGURE 6, we also hypothesize that if the variance of the noise (𝜖) distribution was higher, the CP structured algorithms would also perform better than the traditional method. RESULTS WITH PCA The CP structured algorithm significantly reduces the number of predictors to be estimated. To solve for less coefficients using the traditional method, we can perform Principal Component Analysis (PCA) on the dataset before using the algorithm. We use PCA on 𝑋 with an energy capture of 95%, which reduces the number of coefficients from 225 to 189. However, even with this minimal reduction, we can see in FIGURE 6 that there is a notable decrease in performance throughout most metrics. The MSE seems unaff- ected, but the other three metrics start to see a gap between the traditional method with no PCA and the CP structured algorithm. A possible explanation for this phenomenon is that PCA does not capture tensor data efficiently in lower dimensional space. If we were to decrease the energy capture, the gap in performance would grow larger even for a bigger sample size. We predict that as the dimensions of the data increases, PCA would not be an efficient feature learning method for parameter reduction, favoring the CP structured methods. FIGURE 4: Two 15 × 15 images used as true predictors 𝑊 to generate synthetic data FIGURE 5: Reconstructed predictors from both algorithms: LEFT – true predictor, MIDDLE – reconstructed predictor from traditional method with increasing sample sizes (𝑛 = 500,1000,1500), RIGHT – reconstructed predictor from CP-structured Logistic Regression with increasing sample sizes (𝑛 = 500,1000,1500) ARESTY RUTGERS UNDERGRADUATE RESEARCH JOURNAL, VOLUME I, ISSUE III ii. MNIST Data The objective of the MNIST dataset experi- ment was to observe which algorithm would be more efficient to use when the true predictor exhib- ited an “approximate” low rank structure. In the pre- vious experiment, the two images used as the true predictor had an exact low rank structure, as it could easily be computed through an outer product of two matrices. Similar to the synthetic data setup, we gen- erated univariate 𝑦 responses with sample size 𝑛 = 750 with the model defined in (11). However, for the true predictor, 𝑊, we chose a “1” from the MNIST dataset, as it exhibits “approximate” low rank structure. We compared the CP-structued algo- rithms to the traditional algorithms using different rank values. These results are shown in TABLE 1. PERFORMANCE COMPARISON We use the same performance metrics defined for the previous experiment and display the results in TABLE 1. From this table, we can conclude that both CP structured algorithms gave favorable results when the CP rank was 2. This shows that we can approximate a ”1” from the MNIST dataset with matrices of rank 2. However in all cases from rank 1 FIGURE 6: Variation of performance (y-axis) with different sample sizes (x-axis) for SVM and LOGIT. COLUMNS 1-4 represent plots for the MSE, Cosine Distance, Reconstruction Error, and Classification Accuracy, respectively. ROW 1, 2: Performance metrics for LOGIT with predictors as cross and square, respectively. ROW 3, 4: Performance metrics for SVM with predictors as cross and square, respectively. Predictors of cross and square is as shown in FIGURE 4 ARESTY RUTGERS UNDERGRADUATE RESEARCH JOURNAL, VOLUME I, ISSUE III to 3, the structured algorithms gave more favorable results. This proves to show that if the true predictor exhibits an approximate low rank structure, it may be beneficial to use the structured algorithms for classification. 5 CONCLUSION In this paper, we explored tensor-based classification models using a tensor decomposition. We proposed two algorithms that imposed a CAN- DECOMP/PARAFAC factorization structure on the predictors of traditional classification algorithms: Support Vector Machines and Logistic Regression. Imposing these techniques on traditional algorithms allowed us to exploit the structure of the data, ena- bling efficient learning with fewer parameters. We showed with different performance metrics that our proposed method increased accuracy and overall solved a more accurate estimation of the weights. The experiments showed that the CP algorithm per- formed best when the true predictor had either an approximate or exact low rank structure. We also showed that solving for fewer parameters using PCA compromised the performance of the traditional method. We predict that PCA would not generalize well to data with multidimensional structure, favor- ing the CP structured algorithms. However, we be- lieve that it would be interesting if one could show when PCA could be better than using CP structure. This could possibly be a case when the data in ques- tion is known to be linear, as PCA is a linear feature learning method. One potential example is using structured data for prediction when it is known a pri- ori that the features have a linear relationship. How- ever, due to time constraints, we were not able to ex- plore this possibility in detail. We also think it would be interesting to test these algorithms on more da- tasets. In addition, we believe an exciting direction for future research is to exploit tensor decomposi- tions in other learning problems such as deep learn- ing. However, it is not clear how one would approach this problem, as deep learning algorithms have non- convex loss functions. We leave this up to the audi- ence to investigate for future exploration.∎ METHOD MSE COS DISTANCE RECONSTRUCTION ERROR SVM 0.00128 0.51832 0.00053 CP-SVM (R=1) 0.00088 0.66727 0.00044 CP-SVM (R=2) 0.00026 0.90259 0.00024 CP-SVM (R=3) 0.00039 0.85099 0.00029 LOGIT 0.00120 0.54759 0.00051 CP-LOGIT (R=1) 0.00089 0.66483 0.00044 CP-LOGIT (R=2) 0.00028 0.89590 0.00024 CP-LOGIT (R=3) 0.00033 0.87807 0.00026 TABLE 1: Performance metrics between the traditional and structured algorithms for the MNIST dataset experiment. The bolded values represent the “best” performance through- out each method, where 𝑅 represents the rank of the CP structured algo- rithm for each experiment. ARESTY RUTGERS UNDERGRADUATE RESEARCH JOURNAL, VOLUME I, ISSUE III 6 REFERENCES [1] Bezdek, J. & Hathaway, R. (2003). Convergence of alternat- ing optimization. Neural, Parallel & Scientific Computations. 11. 351-368. [2] Candès, E. J., & Recht, B. (2009). Exact matrix completion via convex optimization. Foundations of Computational Mathe- matics, 9(6), 717–772. [3] de Luis-García, R., Westin, C.-F., & Alberola-López, C. (2010). Gaussian mixtures on tensor fields for segmentation: Appli- cations to medical imaging. Computerized Medical Imaging and Graphics, 35(1), 16–30. [4] James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical Learning with Applications in R (1st ed. 2013.). Springer New York. [5] Jia, C., Kong, Y., Ding, Z., & Fu, Y. (2014). Latent Tensor Transfer Learning for RGB-D Action Recognition. Proceed- ings of the 22nd ACM International Conference on Multime- dia, 87–96. [6] Kolda, T. G., & Bader, B. W. (2009). Tensor Decompositions and Applications. SIAM Review, 51(3), 455–500. [7] Kossaifi, J., Panagakis, Y., Anandkumar, A., & Pantic, M. (2019). TensorLy: Tensor learning in python. Journal of Ma- chine Learning Research, 20. [8] Kreimer, N., Stanton, A., & Sacchi, M. D. (2013). Tensor com- pletion based on nuclear norm minimization for 5D seismic data reconstruction. Geophysics, 78(6), V273–V284. [9] Lecun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradi- ent-based learning applied to document recognition. Pro- ceedings of the IEEE, 86(11), 2278–2324. [10] Li, X., Xu, D., Zhou, H., & Li, L. (2018). Tucker Tensor Regres- sion and Neuroimaging Analysis. Statistics in Biosciences, 10(3), 520–545. [11] Li, Y., Zhu, H., Shen, D., Lin, W., Gilmore, J. H., & Ibrahim, J. G. (2011). Multiscale adaptive regression models for neu- roimaging data: Multiscale Adaptive Regression Models. Journal of the Royal Statistical Society. Series B, Statistical Methodology, 73(4), 559–578. [12] Nguyen, H. V., & Bai, L. (2011). Cosine Similarity Metric Learning for Face Verification. In Computer Vision – ACCV 2010 (Vol. 6493, Issue 2, pp. 709–720). Springer Berlin Hei- delberg. [13] O’Donnell, L. J., & Westin, C.-F. (2011). An Introduction to Diffusion Tensor Image Analysis. Neurosurgery Clinics of North America, 22(2), 185–196. [14] Rabanser, S., Shchur, O., & Günnemann, S. (2017). Introduc- tion to Tensor Decompositions and their Applications in Ma- chine Learning. [15] Tan, X., Zhang, Y., Tang, S., Shao, J., Wu, F., & Zhuang, Y. (2013). Logistic Tensor Regression for Classification. In Intel- ligent Science and Intelligent Data Engineering (Vol. 7751, pp. 573–581). Springer Berlin Heidelberg. [16] Tomioka, R., & Suzuki, T. (2013). Convex Tensor Decomposi- tion via Structured Schatten Norm Regularization. [17] Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weck- esser, W., Bright, J., van der Walt, S. J., Brett, M., Wilson, J., Millman, K. J., Mayorov, N., Nelson, A. R. J., Jones, E., Kern, R., Larson, E., … Moore, E. W. (2020). SciPy 1.0: fundamental algorithms for scientific computing in Python. Nature Meth- ods, 17(3), 261–272. [18] Weiwei Guo, Kotsia, I., & Patras, I. (2012). Tensor Learning for Regression. IEEE Transactions on Image Processing, 21(2), 816–827. [19] Wimalawarne, K., Tomioka, R., & Sugiyama, M. (2016). Theo- retical and experimental analyses of tensor-based regression and classification. Neural Computation, 28(4), 686–715. [20] Wright, S. J. (2015). Coordinate descent algorithms. Mathe- matical Programming, 151(1), 3–34. [21] Zhou, H., Li, L., & Zhu, H. (2013). Tensor Regression with Ap- plications in Neuroimaging Data Analysis. Journal of the American Statistical Association, 108(502), 540–552. Soo Min Kwon is currently pursuing a M.S. degree in the Department of Electrical and Computer Engineering at Rutgers, The State University of New Jersey. His research interests broadly lie in optimization, multidimen- sional (tensor) data analysis, differential privacy, and distributed learning. He earned his B.S. degree in Electrical and Computing Engineering from Rutgers University in 2020. He completed his undergraduate thesis under supervision of Prof. Anand D. Sarwate on tensor-based machine learning algorithms. Soo Min hopes to pursue a PhD degree upon completion of his M.S. degree.