Microsoft Word - [Formatted] 3 - Learning Predictors from Multidimensional Data with Tensor Factorizations


  ARESTY  RUTGERS UNDERGRADUATE RESEARCH JOURNAL, VOLUME I, ISSUE III 
 
 
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. 

LEARNING PREDICTORS 
FROM MULTI- 

DIMENSIONAL DATA 
WITH TENSOR  

FACTORIZATIONS 

SOO MIN KWON, 
ANAND D. SARWATE (FACULTY ADVISOR) 

 
✵ ABSTRACT  
Statistical machine learning algorithms often 

involve learning a linear relationship between 
dependent and independent variables. This 
relationship is modeled as a vector of numerical 
values, commonly referred to as weights or 
predictors. These weights allow us to make 
predictions, and the quality of these weights 
influence the accuracy of our predictions. However, 
when the dependent variable inherently possesses a 
more complex, multidimensional structure, it 
becomes increasingly difficult to model the 
relationship with a vector. In this paper, we address 
this issue by investigating machine learning 
classification algorithms with multidimensional 
(tensor) structure. By imposing tensor factorizations 
on the predictors, we can better model the 
relationship, as the predictors would take the form of 
the data in question. We empirically show that our 
approach works more efficiently than the traditional 
machine learning method when the data possesses 
both an exact and an approximate tensor structure. 
Additionally, we show that estimating predictors 
with these factorizations also allow us to solve for 
fewer parameters, making computation more 
feasible for multidimensional data. 

1 INTRODUCTION 
Machine learning classification algorithms 

are widely used in many applications such as fraud 
and spam detection. The objective of these algo-
rithms is to model a linear relationship between the 
independent (e.g. card transactions, amount spent) 
and dependent (e.g. fraud or not fraud) variables. 
This relationship is generally modeled by learning a 
hyperplane that best separates the two classes of 
data as shown in FIGURE 1. The hyperplane is con-
structed of weights and biases, which can simply be 
interpreted as the slope and intercept, respectively. 
One can solve or estimate these values by learning 
the parameters that most accurately describe the ob-
served data points. 

In machine learning, we solve for these pa-
rameters (or predictors) through empirical risk mini-
mization (ERM). The ERM framework tries to estimate 
the parameters that minimize the ``risk’’ or error of a 
loss function between the true and computed pre-
dictors given data. The minimization of this loss func-
tion measures the ``closeness’’ of the predictors, 
where a smaller objective function value would ac-
count for a more accurate model. There are many 
different machine learning classification algorithms, 
and each algorithm has a different loss function. 
However, since many loss functions try to model a 
linear relationship, there is an implicit need for our 

FIGURE 1: Visualization of learning a line (or hyperplane in 
higher dimensional space) that best separates two classes 
of data. Machine learning algorithms estimate these 
weights, 𝑊, and bias, 𝑏, through empirical risk minimiza-
tion. 


  ARESTY  RUTGERS UNDERGRADUATE RESEARCH JOURNAL, VOLUME I, ISSUE III 
 
 
data to be vectorized. If our data samples were to be 
multidimensional, vectorization would make estima-
tion of accurate predictors much more challenging.  

For example, consider a different scenario in 
which one would like to make movie recommenda-
tions for a user given the number of movies watched 
in a certain genre. This data can easily be stored in 
the form of a matrix, where the rows represent each 
user, and the columns represent the movie genre. 
However, what happens when a user’s movie prefer-
ences change over time? As shown in FIGURE 2, we can 
capture this third variable (and many others) by mod-
elling the observations in the form of a tensor, as it 
matches the structure of the data. Clearly, the struc-
ture of this tensor is significant for accurate data anal-
ysis. If the orderings of the movies watched were 
swapped for two given users, incorrect recommen-
dations could be made. Vectorizing this data does 
not account for these types of structures, making in-
ference much more challenging.  

There are many modern applications that 
involve analyzing data with intrinsically many more 
dimensions, including medical imaging,[3,13] image 
processing,[5,18] and seismic data analysis.[8] In most 
of these settings, the objective is similar to that of the 
traditional machine learning goal: to formulate a 
problem of prediction to establish an association be-
tween the tensor covariates (independent variable) 

and outcomes (dependent variable). However, as 
previously mentioned, most machine learning 
frameworks are formulated for vector spaces, mak-
ing statistical inference challenging for tensor data. 
In addition, in most of these domains, the tensor 
data also exhibits high dimensions. For example, in 
medicine, tensor data samples may be of dimen-
sions 128 × 128 × 128  or greater. Naively turning 
this array into a vector for traditional machine learn-
ing would result in solving for 128 = 2,097,152 co-
efficients. In this scenario, vectorization not only de-
stroys the structure of the data, but also makes com-
putation infeasible. 
 
RELATED WORKS 

Recently, work on tensor-based machine learning 
approaches uses tensor factorizations to reduce the 
number of coefficients to be estimated.[10,15,21] Spe-
cifically, tensor decompositions are imposed on the 
coefficients as a scheme of feature selection or di-
mensionality reduction. Integrating such decompo-
sition structures solves for low rank approximations 
of the true predictor, rather than the vector counter-
part. Zhou et al. proposes a tensor regression model 
with additional independent variables for predicting 
continuous values given fMRI data.[21] For parameter 
estimation, they propose a maximum likelihood (ML) 
approach using a block relaxation algorithm, which 
we adopt to formulate tensor classification models. 
Tan et al. proposes a logistic tensor regression 
model with a  norm regularization to induce spar-
sity.[15] We observe that this technique efficiently ex-
ploits structure, which motivates us to generalize and 
formulate more classification problems with differ-
ent regularization (e.g.  norm), and on different 
datasets. 

 
OUR CONTRIBUTION 

In this paper, we investigate the performance of ma-
chine learning classifiers with a CANDECOMP/ 
PARAFAC (CP) decomposition structure on the coef-
ficients/predictors. We have seen in previous litera-
ture that these methods work efficiently for solving 
linear regression and logistic regression coeffi-
cients.[15,21] We solve classification problems, namely 
Support Vector Machines and Logistic Regression 

FIGURE 2: Example of modeling observations: left – matrix, 
right – tensor 


  ARESTY  RUTGERS UNDERGRADUATE RESEARCH JOURNAL, VOLUME I, ISSUE III 
 
 
on both synthetic and real data. The rest of this pa-
per is organized as follows. We first discuss some 
tensor algebra and notation that will be used 
throughout this paper. Then, we propose the objec-
tive functions as well as a short analysis of the CP 
structured machine learning problems. We motivate 
and show results of our approach by fixing and solv-
ing for the true predictor. We compare the results 
from the tensor structured algorithm as opposed to 
the unstructured vector algorithm. Our contributions 
can be summarized as follows: 

 We perform experiments to show that our struc-
tured method works more efficiently than the tradi-
tional method when the true predictor exhibits both 
an approximate and exact low rank structure. 

 We show that our structured approach solves for 
fewer coefficients more efficiently than the tradi-
tional approach with a dimensionality reduction 
step (e.g. Principal Component Analysis). 

 We develop algorithms to solve machine learning 
problems with decomposition of 𝑛 -dimensional 
tensors with an alternating minimization scheme. 

 
2 PRELIMINARIES  
 We dedicate this section to discuss some of 
the concepts used throughout this paper. Due to the 
theoretical nature of this work, the technical descrip-
tion may require some mathematical maturity. The 
reader interested in the empirical findings can skip 
to SECTION 4. For a complete introduction to tensors, 
see the comprehensive survey of Kolda and Bader 
and Rabanser et al.[6,14] Tensors are simply defined as 
multidimensional arrays, and these two terms will be 
used interchangeably. We will denote vectors with 
lower case letters (𝑥), matrices with capital letters (𝑋), 
and tensors as bold capital letters (𝐗). 
 
 
i. Tensor Reorderings 
Let 𝐗 be a third-order tensor of dimensions 

𝐗 ∈ ℝ × ×  with the two frontal slices defined by 
𝑋 , 𝑋 ∈ ℝ × : 
 
 
VECTORIZATION 

We can create a vector from any matrix or tensor by 
stacking the row or column elements into a row or 
column vector, respectively. For example, vectoriz-
ing the tensor 𝐗 by its columns would yield the fol-
lowing column vector: 
 

where we stack the columns from the first frontal 
slice, 𝑋  and the second frontal slice, 𝑋 . The dimen-
sions of the resulting vector would be 𝑥 ∈ ℝ . 
 
MATRICIZATION 

The 𝑛-mode matricization (or unfolding) of a tensor 
𝐘 ∈ ℝ × ×…×  is denoted as 𝑌( ) , where 𝑌( )  has 

the columns of the 𝑛 -mode fibers. Consider the 
same tensor 𝐗 from the previous example. Then the 
three 𝑛-mode matricizations are the following: 
 
 
One can think of matricization as a generali-
zation of vectorization but to matrices. Since our ex-
ample 𝐗 is a third-order tensor, we have three matri-
ces from matricization, one for each mode. 

 
ii. Vector & Matrix Products 
OUTER PRODUCT 

Let 𝑎  and 𝑏  be two vectors of dimensions 𝑎 ∈ ℝ  
and 𝑏 ∈ ℝ , 
 

  ARESTY  RUTGERS UNDERGRADUATE RESEARCH JOURNAL, VOLUME I, ISSUE III 
 
 
The outer product of 𝑎 and 𝑏, denoted as 𝑎 ○ 𝑏, is a 
matrix of dimensions (𝑎 ○ 𝑏) ∈ ℝ × , 

Note that this outer product is not only lim-
ited to vectors, and can be generalized to matrices 
and tensors as well. 

 
KRONECKER PRODUCT 

Let 𝐴  and 𝐵  be two matrices of dimensions 𝐴 ∈
ℝ ×  and 𝐵 ∈ ℝ × , 

 
The Kronecker product of 𝐴  and 𝐵 , denoted as 
 𝐴 ⊗ 𝐵, is a matrix of dimensions (𝐴 ⊗ 𝐵) ∈ ℝ × , 

 
In essence, the Kronecker product is computed by 
multiplying every element in the first matrix, 𝐴, by the 
entire second matrix, 𝐵. 
 
KHATRI-RAO PRODUCT 

The Khatri–Rao product is the columnwise Kronecker 
product. Consider two (different) matrices  
𝐴 ∈ ℝ ×  and 𝐵 ∈ ℝ × . The Khatri–Rao product of 
𝐴  and 𝐵, denoted as 𝐴 ⊙ 𝐵, is a matrix of dimen-
sions (𝐴 ⊙ 𝐵) ∈ ℝ × , 

Here, we are taking the Kronecker product between 
every column vector from 𝐴 and 𝐵. Note that if 𝐴 and 
𝐵  itself were column vectors, i.e. 𝑛 = 1 , then the 
Khatri–Rao product is equivalent to the Kronecker 
product, 𝐴 ⊙ 𝐵 =  𝐴 ⊗ 𝐵. 

iii. Tensor Decomposition 
Tensor decompositions are generalizations 

of matrix factorizations to multidimensional arrays.[20] 
We introduce one tensor factorization scheme that is 
important in understanding the setting of our algo-
rithm. In the matricized form, we show that this fac-
torization has useful properties to be solved with an 
alternating minimization scheme. 
 
CANDECOMP/PARAFAC (CP) DECOMPOSITION 

The objective of the CP decomposition is to express 
a tensor as the sum of component rank–one tensors, 
i.e. vectors, as depicted in FIGURE 3. For example, con-
sider a third-order tensor 𝐗 ∈ ℝ × × . We can ap-
proximate this tensor as the following 

where ”○” denotes the outer product, R represents 
the rank (positive integer), and 𝑎 ∈ ℝ , 𝑏 ∈ ℝ , 
and 𝑐 ∈ ℝ  for 𝑟 = 1, . . . , 𝑅 . We can formalize this 
decomposition as the following optimization prob-
lem: 

 
where 𝐗 would represent a low rank approximation 
of 𝐗. The factor matrices or CP factors are matrices 
with the rank–one tensors as entries. From the previ-
ous three-dimensional case, 𝐴 ∈ ℝ ×  would be an 
estimated CP factor with entries 
 
 
FIGURE 3: Graphical representation of the CANDECOMP/ 
PARAFAC decomposition – low rank approximation of a 
third–order tensor 


  ARESTY  RUTGERS UNDERGRADUATE RESEARCH JOURNAL, VOLUME I, ISSUE III 
 
 
With these definitions and the products defined pre-
viously, we can formulate some useful properties for 
the third–order case: 
 
 
These relationships can easily be generalized to 𝑛-
mode tensors, but for the purposes of this paper, 
𝑛 = 3 will suffice. We will show how we can use these 
equations for our alternating minimization algorithm 
in the following sections. There also are other useful 
tensor factorizations, such as the Tucker decomposi-
tion, which is explained in detail in the survey pa-
per.[6] 
 

iv. Machine Learning Optimization Problems 
Many machine learning algorithms can be 

framed as empirical risk minimization (ERM) prob-
lems. The empirical risk is defined in terms of a risk, 
or loss function (·). For linear classifiers, the loss of 
a linear predictor 𝑤 on the data sample (𝑥 , 𝑦 ) can 
be written as (𝑤 𝑥 , 𝑦 ) and the average empirical 

risk as  ∑ (𝑤 𝑥 , 𝑦 ). We discuss these loss func-

tions for some common classifiers and how we can 
use them to solve tensor structured ERM problems. 
 
SUPPORT VECTOR MACHINES 

Consider a dataset with 𝑛 samples, i.e. {(𝑥 , 𝑦 )} , 
where 𝑦 ∈ {−1,1}. Support Vector Machine (SVM) or 
maximum margin linear classifier is a binary classifier 
that finds a hyperplane to best separate the data, 
while making minimal margin violations.[4] SVM uses 
a loss function called the hinge loss function, defined 
by 
 
 
where 𝑤 is the coefficients of the separating hyper-
plane. With a penalty (or regularizer), we can mathe-
matically formulate SVM as the following ERM prob-
lem:  

The regularization term, λ, is used to penal-
ize the features, and hence weights, that do not nec-
essarily contribute to the prediction outcome. Here, 
we are considering the  penalty, but there are 
other regularizers such as the  penalty. We use 
these regularization terms in our loss function to es-
timate a more accurate model.  

 
LOGISTIC REGRESSION 

Similarly, consider a dataset with 𝑛  samples, i.e. 
{(𝑥 , 𝑦 )} , where 𝑦 ∈ {−1,1}. The objective of Lo-
gistic Regression (LOGIT) is the same as SVM, with a 
different loss function called the logistic loss function, 
defined by 

 
The logistic loss function takes the form of 

the sigmoid function. With a regularization term, we 
can define Logistic Regression as the following ERM 
problem: 
 

We only introduce the objective function of 

these two classifiers, as we will construct the CP 
structured algorithm with these functions in the fol-
lowing section. Note that we do not include the bias 
term in our hyperplane equation, as it can be mod-
eled in 𝑤  as a column vector. 

 
3 PROBLEM FORMULATION 
In this section, we propose our tensor-based 

classifiers in the form of an ERM framework. In gen-
eral, we structure our linear predictors (𝑤 ) to admit 
a CP decomposition, in which we can reconstruct to 
make classifications. We also discuss the metrics that 
we will be investigating to evaluate the performance 
of our models. 


  ARESTY  RUTGERS UNDERGRADUATE RESEARCH JOURNAL, VOLUME I, ISSUE III 
 
 
i. CANDECOMP/PARAFAC Structured Classifiers  
SUPPORT VECTOR MACHINES 

Consider a dataset {(𝐗𝒊, 𝑦 )} ,  where 𝐗𝒊 ∈
ℝ ×…×  denotes a tensor data sample with 𝑦 ∈
{−1,1} . By imposing the constraints from (1)  onto 
the predictors of (3), we can formulate the following 
optimization problem:  

 
The traditional ERM problem for SVM in (3) 

solves for one vector predictor of dimensions 
𝑤 ∈ ℝ ×…× . The problem in (5) , which we call 
“CP-SVM”, solves for 𝑁  matrix-valued predictors of 
dimensions 𝑊𝒊 ∈ ℝ

× , for 𝑖 =  1, . . . , 𝑁 . As a con-
crete example, let each tensor sample be dimen-
sions 𝐗𝒊 ∈ ℝ

× ×  and 𝑅 = 3 . The traditional prob-
lem would solve for 5 × 5 × 5 = 125  coefficients, 
whereas the structured problem would solve for  
3 × (5 × 3) = 45 coefficients. As the dimensions in-
crease, the structured problem substantially reduces 
the number of parameters/coefficients to be esti-
mated. 

 
LOGISTIC REGRESSION 

Similarly, consider a dataset {(𝐗𝒊, 𝑦 )} , where 
𝐗𝒊 ∈ ℝ

×…×  denotes a tensor data sample with 
𝑦 ∈ {−1,1} . By imposing the constraints from (1) 
onto the predictors of (4), we can solve the following 
ERM problem: 

 
This new framework, which we call “CP-
LOGIT”, solves for fewer parameters, similar to CP-
SVM.  

In practice, we solve for the weights using 
numerical optimization methods such as gradient 
descent. However, solving for the weights in this new 
CP-structured paradigm is a non-trivial task. In order 
to solve for the coefficients in (5) and (6), we adopt 
an alternating minimization algorithm similar to the 
block relaxation algorithm proposed in Zhou et al.[21] 
At each iteration, we update block 𝑊𝒊, while keeping 
the rest of the blocks fixed. To see this, when updat-
ing 𝑊𝒊 ∈ ℝ

× , we can rewrite the inner product in 
(5) and (6) with the properties mentioned in (2): 
 

This alternating minimization algorithm is summa-
rized in ALGORITHM 1, in which (·) represents the ERM 
problem to be minimized, 𝜃 represents a collection 
of all the parameters, and λ is the regularization pa-
rameter. The parameter λ was tuned by hand, but 
can also be determined through cross validation.  
To understand the CP structured algorithm,  
consider the loss function in (5)  with 𝑁 = 3 .  
When updating 𝑊𝟐 , we rewrite the inner product 

〈∑ 𝑊
( )

○ 𝑊
( )

○ 𝑊
( )

, 𝐗𝒊〉  as 〈𝑊 , 𝐗( )(𝑊 ⊙ 𝑊 )〉 . 
Note that this equation follows from the property of 
tensor algebra as shown in (2). We perform this al-
gorithm for all the factor matrices until the stopping 
criteria is met. 


  ARESTY  RUTGERS UNDERGRADUATE RESEARCH JOURNAL, VOLUME I, ISSUE III 
 
 
The alternating minimization algorithm is 
useful for several reasons. First, in practice, this algo-
rithm almost always converges to at least a local min-
imum.[1,20,21] To find the best solution, the algorithm 
can be ran several times with different initial factor 
matrices. Second, the low rank optimization prob-
lem over the factor matrices is non-convex.[2] Thus, 
this problem becomes difficult to solve using com-
mon unconstrained solvers, such as gradient de-
scent. In literature, there are two ways to handle the 
non-convexity of this optimization problem. One 
way is to relax the rank constraint by adding a convex 
regularization term that induces low rank (e.g. trace 
norm, nuclear norm).[16,19] The other solution is to 
employ this alternating minimization algorithm, as 
the optimization over one matrix, while holding the 
others fixed is convex. We chose to explore this pro-
cedure following Zhou et al.,[21] as the algorithm is 
straightforward to implement using statistical soft-
ware such as MATLAB or Python. 

 
ii. Performance Metrics 

We evaluate the performance of our models 
using several measures with different sample sizes. 
The following four metrics help determine the meas-
ure of “closeness” between the true and estimated 
predictors. 

1. The Mean Squared Error (MSE) for 𝑛 data samples 
and true predictor 𝑊 is computed as 
  
  
where 𝑊  is the estimated predictor from solving 
the ERM problem. 

2. The cosine distance (or similarity)[12] for true predic-
tor 𝑊 is computed as 

where 𝑊 is the reconstructed predictor from solv-
ing the ERM problem. Mathematically, the cosine 
similarity measures the cosine of the angle between 
two vectors projected in a 𝑛-dimensional space. As 
the angle, 𝜃 , between the two vectors becomes 
smaller, the cosine similarity will approach a value 
of 1. As the angles become farther apart (perpen-
dicular), the cosine similarity will approach a value 
of 0. 

3. The reconstruction error for true predictor 𝑊  and 
estimated tensor predictor 𝑊 is defined as 
 
 
where || · ||  denotes the Frobenius norm, a matrix 
generalization of the  norm. 

4. The classification accuracy for 𝑛  test samples is 
simply defined as the following: 

After solving for 𝑊 , we make predictions on test 
data and compare 𝑦  to the true 𝑦 . Before compar-
ing the labels, we use the sign function to quantize 
our values to 𝑦 ∈ {−1,1}. 

 
4 EXPERIMENTS 
We used two types of data for our experi-

ments: synthetic data and the Modified National In-
stitute of Standards and Technology (MNIST) data-
base.[9] The MNIST database is a benchmark dataset 
used widely in machine learning that consists of 
60,000 samples of handwritten digits from 0 to 9. 
The objective of both experiments is to compare the 
performance between the CP-structured algorithms 
and the traditional algorithms, which were imple-
mented using software packages TensorLy[7] and 
SciPy.[17] For all experiments, we use a Python envi-
ronment on a Macbook Pro with 2.2 GHz Intel Core 
i7 and 16 GB RAM. 

 
i. Synthetic Data 
For synthetic data, we generated univariate 

𝑦  responses with different sample sizes according to 
the following model: 

 
where 𝑋  is drawn independently and identically dis-
tributed (iid) from 𝒩(0,1), 𝜖 is a noise term drawn iid 
from 𝒩(0,1), and 𝑊 is the fixed predictor as shown 
in FIGURE 4. The objective was to observe if our models 
defined in (5) and (6) can identify the true signal 𝑊 
given (𝑋 , 𝑦 ). 


  ARESTY  RUTGERS UNDERGRADUATE RESEARCH JOURNAL, VOLUME I, ISSUE III 
 
 
PERFORMANCE COMPARISON 
To measure the “closeness” and classification acc-
uracy between the true model and the predicted 
model, we use performance metrics defined in (7), 
(8) , (9) , and (10) , we compute these metrics at 
different sample sizes and show that as the number 
of samples increases, the performance of the tra-
ditional vector approach converges to the perfor-
mance of the CP structured model. These results are 
displayed in FIGURES 5 and 6. In FIGURE 5, we can visually 
see that the predictors from our method solves for 
the true predictors more accurately. For example, in 
the case of 𝑛 = 500 from ROW 1, the “cross” figure is 
more accurately portrayed using the CP method 
(RIGHT) than the traditional method (MIDDLE). This would  
allow us to make more accurate predictions, as the 
estimated weights more closely follow the true wei-
ghts. In FIGURE 6, we can see that the MSE for both 
algorithms is relatively the same throughout all sam-
ple sizes. For the cosine distance, we can see that the 
CP structured algorithm approaches a value of 1 very 
quickly, which indicates that there is a strong simil-
arity between the estimated and the true coefficients. 
The reconstruction error and classification accuracy 
both generally have gaps in the figures, but lessen 
as the sample sizes increase. We can conclude that 
these results depend on the sample size, as more 
samples can decrease the number of hyperplanes 
that separates the data, predicting coefficients clo-
ser to the true model. Based on the trends of the 
graphs in FIGURE 6, we also hypothesize that if the 
variance of the noise (𝜖) distribution was higher, the 
CP structured algorithms would also perform better 
than the traditional method.  
 
RESULTS WITH PCA 
The CP structured algorithm significantly reduces 
the number of predictors to be estimated. To solve 
for less coefficients using the traditional method, we 
can perform Principal Component Analysis (PCA) on 
the dataset before using the algorithm. We use PCA 
on 𝑋 with an energy capture of 95%, which reduces 
the number of coefficients from 225 to 189. However, 
even with this minimal reduction, we can see in FIGURE 
6 that there is a notable decrease in performance 
throughout most metrics. The MSE seems unaff-
ected, but the other three metrics start to see a gap  

between the traditional method with no PCA and the 
CP structured algorithm. A possible explanation for 
this phenomenon is that PCA does not capture 
tensor data efficiently in lower dimensional space. If 
we were to decrease the energy capture, the gap in 
performance would grow larger even for a bigger 
sample size. We predict that as the dimensions of the 
data increases, PCA would not be an efficient feature 
learning method for parameter reduction, favoring 
the CP structured methods. 

FIGURE 4: Two 15 × 15 images used as true predictors 𝑊 to 
generate synthetic data 

FIGURE 5: Reconstructed predictors from both algorithms:  
LEFT – true predictor, 
MIDDLE – reconstructed predictor from traditional method 
with increasing sample sizes (𝑛 = 500,1000,1500), 
RIGHT – reconstructed predictor from CP-structured  
Logistic Regression with increasing sample sizes 
 (𝑛 = 500,1000,1500) 


  ARESTY  RUTGERS UNDERGRADUATE RESEARCH JOURNAL, VOLUME I, ISSUE III 
 
 
ii. MNIST Data 
The objective of the MNIST dataset experi-

ment was to observe which algorithm would be 
more efficient to use when the true predictor exhib-
ited an “approximate” low rank structure. In the pre-
vious experiment, the two images used as the true 
predictor had an exact low rank structure, as it could 
easily be computed through an outer product of two 
matrices. Similar to the synthetic data setup, we gen-
erated univariate 𝑦  responses with sample size  
𝑛 = 750 with the model defined in (11). However, 
for the true predictor, 𝑊, we chose a “1” from the 
MNIST dataset, as it exhibits “approximate” low rank  

 
structure. We compared the CP-structued algo-
rithms to the traditional algorithms using different 
rank values. These results are shown in TABLE 1. 

 
PERFORMANCE COMPARISON 
We use the same performance metrics defined for 
the previous experiment and display the results in 
TABLE 1. From this table, we can conclude that both CP 
structured algorithms gave favorable results when 
the CP rank was 2. This shows that we can 
approximate a ”1” from the MNIST dataset with 
matrices of rank 2. However in all cases from rank 1  

FIGURE 6: Variation of performance (y-axis) with different sample sizes (x-axis) for SVM and LOGIT. 
COLUMNS 1-4 represent plots for the MSE, Cosine Distance, Reconstruction Error, and Classification Accuracy, respectively. 
ROW 1, 2: Performance metrics for LOGIT with predictors as cross and square, respectively. 
ROW 3, 4: Performance metrics for SVM with predictors as cross and square, respectively. Predictors of cross and square is as 
shown in FIGURE 4  


  ARESTY  RUTGERS UNDERGRADUATE RESEARCH JOURNAL, VOLUME I, ISSUE III 
 
 
to 3, the structured algorithms gave more favorable 
results. This proves to show that if the true predictor 
exhibits an approximate low rank structure, it may be 
beneficial to use the structured algorithms for 
classification. 

 
5 CONCLUSION 
In this paper, we explored tensor-based 

classification models using a tensor decomposition. 
We proposed two algorithms that imposed a CAN-
DECOMP/PARAFAC factorization structure on the 
predictors of traditional classification algorithms: 
Support Vector Machines and Logistic Regression. 
Imposing these techniques on traditional algorithms 
allowed us to exploit the structure of the data, ena-
bling efficient learning with fewer parameters. We 
showed with different performance metrics that our 
proposed method increased accuracy and overall 
solved a more accurate estimation of the weights. 
The experiments showed that the CP algorithm per-
formed best when the true predictor had either an 
approximate or exact low rank structure. We also 

showed that solving for fewer parameters using PCA 
compromised the performance of the traditional 
method. We predict that PCA would not generalize 
well to data with multidimensional structure, favor-
ing the CP structured algorithms. However, we be-
lieve that it would be interesting if one could show 
when PCA could be better than using CP structure. 
This could possibly be a case when the data in ques-
tion is known to be linear, as PCA is a linear feature 
learning method. One potential example is using 
structured data for prediction when it is known a pri-
ori that the features have a linear relationship. How-
ever, due to time constraints, we were not able to ex-
plore this possibility in detail. We also think it would 
be interesting to test these algorithms on more da-
tasets. In addition, we believe an exciting direction 
for future research is to exploit tensor decomposi-
tions in other learning problems such as deep learn-
ing. However, it is not clear how one would approach 
this problem, as deep learning algorithms have non-
convex loss functions. We leave this up to the audi-
ence to investigate for future exploration.∎  

METHOD MSE COS DISTANCE RECONSTRUCTION ERROR 

SVM 0.00128 0.51832 0.00053 

CP-SVM (R=1) 0.00088 0.66727 0.00044 

CP-SVM (R=2) 0.00026 0.90259 0.00024 

CP-SVM (R=3) 0.00039 0.85099 0.00029 

LOGIT 0.00120 0.54759 0.00051 

CP-LOGIT (R=1) 0.00089 0.66483 0.00044 

CP-LOGIT (R=2) 0.00028 0.89590 0.00024 

CP-LOGIT (R=3) 0.00033 0.87807 0.00026 

 
TABLE 1: Performance metrics between the traditional and structured algorithms for the MNIST dataset experiment. The bolded 
values represent the “best” performance through- out each method, where 𝑅 represents the rank of the CP structured algo-
rithm for each experiment. 
 

  ARESTY  RUTGERS UNDERGRADUATE RESEARCH JOURNAL, VOLUME I, ISSUE III 
 
 
6 REFERENCES  
[1] Bezdek, J. & Hathaway, R. (2003). Convergence of alternat-

ing optimization. Neural, Parallel & Scientific Computations. 
11. 351-368. 

[2]  Candès, E. J., & Recht, B. (2009). Exact matrix completion via 
convex optimization. Foundations of Computational Mathe-
matics, 9(6), 717–772.  

[3]  de Luis-García, R., Westin, C.-F., & Alberola-López, C. (2010). 
Gaussian mixtures on tensor fields for segmentation: Appli-
cations to medical imaging. Computerized Medical Imaging 
and Graphics, 35(1), 16–30.  

[4]  James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An 
Introduction to Statistical Learning with Applications in R  (1st 
ed. 2013.). Springer New York. 

[5] Jia, C., Kong, Y., Ding, Z., & Fu, Y. (2014). Latent Tensor 
Transfer Learning for RGB-D Action Recognition. Proceed-
ings of the 22nd ACM International Conference on Multime-
dia, 87–96.  

[6] Kolda, T. G., & Bader, B. W. (2009). Tensor Decompositions 
and Applications. SIAM Review, 51(3), 455–500.  

[7] Kossaifi, J., Panagakis, Y., Anandkumar, A., & Pantic, M. 
(2019). TensorLy: Tensor learning in python. Journal of Ma-
chine Learning Research, 20.  

[8] Kreimer, N., Stanton, A., & Sacchi, M. D. (2013). Tensor com-
pletion based on nuclear norm minimization for 5D seismic 
data reconstruction. Geophysics, 78(6), V273–V284.   

[9] Lecun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradi-
ent-based learning applied to document recognition. Pro-
ceedings of the IEEE, 86(11), 2278–2324. 

[10] Li, X., Xu, D., Zhou, H., & Li, L. (2018). Tucker Tensor Regres-
sion and Neuroimaging Analysis. Statistics in Biosciences, 
10(3), 520–545.  

[11] Li, Y., Zhu, H., Shen, D., Lin, W., Gilmore, J. H., & Ibrahim, J. 
G. (2011). Multiscale adaptive regression models for neu-
roimaging data: Multiscale Adaptive Regression Models. 
Journal of the Royal Statistical Society. Series B, Statistical 
Methodology, 73(4), 559–578.

 
[12] Nguyen, H. V., & Bai, L. (2011). Cosine Similarity Metric 

Learning for Face Verification. In Computer Vision – ACCV 
2010 (Vol. 6493, Issue 2, pp. 709–720). Springer Berlin Hei-
delberg. 

[13] O’Donnell, L. J., & Westin, C.-F. (2011). An Introduction to 
Diffusion Tensor Image Analysis. Neurosurgery Clinics of 
North America, 22(2), 185–196.  

[14] Rabanser, S., Shchur, O., & Günnemann, S. (2017). Introduc-
tion to Tensor Decompositions and their Applications in Ma-
chine Learning.  

[15] Tan, X., Zhang, Y., Tang, S., Shao, J., Wu, F., & Zhuang, Y. 
(2013). Logistic Tensor Regression for Classification. In Intel-
ligent Science and Intelligent Data Engineering (Vol. 7751, 
pp. 573–581). Springer Berlin Heidelberg.  

[16] Tomioka, R., & Suzuki, T. (2013). Convex Tensor Decomposi-
tion via Structured Schatten Norm Regularization. 

[17] Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., 
Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Weck-
esser, W., Bright, J., van der Walt, S. J., Brett, M., Wilson, J., 
Millman, K. J., Mayorov, N., Nelson, A. R. J., Jones, E., Kern, 
R., Larson, E., … Moore, E. W. (2020). SciPy 1.0: fundamental 
algorithms for scientific computing in Python. Nature Meth-
ods, 17(3), 261–272. 

[18] Weiwei Guo, Kotsia, I., & Patras, I. (2012). Tensor Learning for 
Regression. IEEE Transactions on Image Processing, 21(2), 
816–827. 

[19] Wimalawarne, K., Tomioka, R., & Sugiyama, M. (2016). Theo-
retical and experimental analyses of tensor-based regression 
and classification. Neural Computation, 28(4), 686–715. 

[20] Wright, S. J. (2015). Coordinate descent algorithms. Mathe-
matical Programming, 151(1), 3–34. 

[21] Zhou, H., Li, L., & Zhu, H. (2013). Tensor Regression with Ap-
plications in Neuroimaging Data Analysis. Journal of the 
American Statistical Association, 108(502), 540–552. 

 
Soo Min Kwon is currently pursuing a M.S. degree in the Department of Electrical and Computer Engineering 
at Rutgers, The State University of New Jersey. His research interests broadly lie in optimization, multidimen-
sional (tensor) data analysis, differential privacy, and distributed learning. He earned his B.S. degree in Electrical 
and Computing Engineering from Rutgers University in 2020. He completed his undergraduate thesis under 
supervision of Prof. Anand D. Sarwate on tensor-based machine learning algorithms. Soo Min hopes to pursue 
a PhD degree upon completion of his M.S. degree.