INTERNATIONAL JOURNAL OF COMPUTERS COMMUNICATIONS & CONTROL
Online ISSN 1841-9844, ISSN-L 1841-9836, Volume: 16, Issue: 4, Month: August, Year: 2021
Article Number: 4217, https://doi.org/10.15837/ijccc.2021.4.4217

CCC Publications 

Regression Loss in Transformer-based Supervised Neural Machine
Translation

D.X. Li, Z.Y. Luo

Dongxing Li
School of Artificial Intelligence,
Beijing Normal University, Beijing 100875, China
lidx@bnu.edu.cn

Zuying Luo*
School of Artificial Intelligence,
Beijing Normal University, Beijing 100875, China
*Corresponding author: luozy@bnu.edu.cn

Abstract
Transformer-based model has achieved human-level performance in supervised neural machine

translation (SNMT), much better than the models based on recurrent neural networks (RNNs)
or convolutional neural network (CNN). The original Transformer-based model is trained through
maximum likelihood estimation (MLE), which regards the machine translation task as a multi-
label classification problem and takes the sum of the cross entropy loss of all the target tokens
as the loss function. However, this model assumes that token generation is partially independent,
without realizing that tokens are the components of a sequence. To solve the problem, this paper
proposes a semantic regression loss for Transformer training, treating the generated sequence as a
global. Upon finding that the semantic difference is proportional to candidate-reference distance,
the authors considered the machine translation problem as a multi-task problem, and took the
linear combination of cross entropy loss and semantic regression loss as the overall loss function.
The semantic regression loss was proved to significantly enhance SNMT performance, with a slight
reduction in convergence speed.

Keywords: supervised neural machine translation (SNMT), Transformer, attention mecha-
nism, semantic regression loss, evaluation metric.

1 Introduction
Relying on sequence-to-sequence deep neural networks (DNNs), supervised neural machine trans-

lation (SNMT) aims to automatically convert a sequence from one language to another, with true
sequence pairs as inputs [12, 31]. The most prevalent SNMT approach employs the encoder-to-decoder
structure, which encodes the source sequence into a context representation in one neural network and
generates the target sequence in the other [4]. Notably, the two neural networks are trained simulta-
neously in an end-to-end fashion. In addition, the current neural machine translation (NMT) systems
mostly adopt the attention mechanism [1, 12, 23, 32].


https://doi.org/10.15837/ijccc.2021.4.4217 2

There are generally three model architectures to train the NMT neural network. The earliest archi-
tecture is recurrent neural network (RNN), which faces problems like vanishing gradients, exploding
gradients, and long-range dependency. To solve the first two problems, many improved RNNs have
been designed, including long short-term memory (LSTM) proposed by Hochreiter and Schmidhuber
[14], gated recurrent units (GRU) proposed by Cho et al. [4], and bidirectional LSTM (BiLSTM)
proposed by Schuster and Paliwal [27]. To address the long-range dependency, convolutional neural
network (CNN) has been introduced by Gehring et al. [11, 12], in which a succession of convolutional
layers captures the dependency of a few tokens (phrases), and concatenates the local dependency
representations as the sequence representation. Recent years has seen the emergence of a novel and
competitive model called Transformer for NMT [32]. Solely based on attention mechanism, Trans-
former uses a self-attention network (SAN) to compute the mutual relationship scores of all the tokens
within the source sequence or the target sequence. Hassan et al. proved that Transformer can achieve
human-level performance on some languages [13]. Therefore, Transformer-based architectures have
been widely used in the field of NMT [20, 32, 33].

The NMT performance is mainly affected by the following factors: network architecture, opti-
mization algorithm, loss function, and evaluation metric. The original Transformer uses the Adam
optimizer, cross entropy loss function, and the metric of bilingual evaluation understudy (BLEU) [26].
This Transformer-based model is trained through maximum likelihood estimation (MLE) to learn
the conditional probability distribution of the target token step by step, which can be regarded as
a token-level target [29]. However, the model training only focuses on the loss of the target token,
yet hardly pays attention to the semantic loss of the global generated sequence. Besides, the sum of
the probability distribution loss is calculated under the assumption that token generation is partially
independent, without realizing that tokens are the components of a sequence. Therefore, it is rea-
sonable to include the semantic regression loss of the global target sequence in training. To disclose
sentence-level influences, most scholars resorted to ensemble learning approaches to improve trans-
lation quality [5, 24, 34], such as bag-of-words (BOW) model [24], and machine learning (ML) [5].
The simple BOW model only pays attention to token frequency, failing to consider token order and
sequence semantics. Meanwhile, ML models like supervised ML used by Cohn and Goodman [5], and
reinforcement learning (RL) used by Wu et al. [34] would greatly complicate Transformer training.

This paper improves the Transformer as the basic learning network. Compared with the original
Transformer proposed by Vaswani et al. [32], the improved Transformer has the following unique
features: (1) the optimizer is AdamW proposed by Loshchilov and Hutter [22]. (2) the evaluation
metrics are translation edit rate (TER) proposed by Snover et al. [30], metric for evaluation of
translation with explicit ordering (METEOR) proposed by Banerjee and Lavie [2], recall-oriented
understudy for gisting evaluation-the longest common subsequence (ROUGE-L) proposed by Lin [21]
as well as BLEU [26]. (3) the loss function innovatively computes the semantic loss of the global
sequence between the candidate and the reference, except for the cross entropy loss of all the point-
wise tokens, that is, the NMT problem is treated as a regression problem from the perspective of
the global sequence, in addition to the multi-label classification problem from the perspective of the
tokens.

The improved Transformer was compared with the original Transformer through NMT experiments
on three evaluation datasets. On dataset IWSLT2014 DE->EN, the improved Transformer outper-
formed the original Transformer by 0.55 BLEU/2.07 TER/0.20 ROUGE-L, and by 0.49 BLEU/2.70
TER/0.60 ROUGE-L, with transformer-small configuration and transformer-base configuration, re-
spectively. On dataset IWSLT2016 DE->EN, the improved Transformer outperformed the original
Transformer by 0.51 BLEU/-0.66 TER/0.78 METEOR/0.68 ROUGE-L, and by 0.56 BLEU/-0.07
TER/0.37 METEOR/0.87 ROUGE-L, with transformer-small configuration and transformer-base con-
figuration, respectively. On dataset WMT17 EN->DE, the improved Transformer outperformed the
original Transformer by 0.86 BLEU/-1.08 TER/0.68 METEOR/0.80 ROUGE-L, and by 1.21 BLEU/-
1.14 TER/0.62 METEOR/0.83 ROUGE-L, with transformer-base configuration and transformer-big
configuration, respectively.

The remainder of this paper is organized as follows: Chapter 2 introduces the preliminaries of
this work, including three SNMT architectures, attention mechanisms, and training strategy, with


https://doi.org/10.15837/ijccc.2021.4.4217 3

particular emphasis on Transformer attention mechanisms; Chapter 3 puts forward our model, high-
lighting the realization of the novel loss function; Chapter 4 carries out a series of experiments, explains
the selection of some hyper-parameters, and describes the implementation of our model; Chapter 5
summarizes our work and looks forward to future research.

2 Preliminaries

2.1 Transformer architecture

Transformer is implemented entirely based on the attention mechanism, whose maximum path
length and minimum number of sequential operations are and . Consisting of stacked encoder lay-
ers and decoder layers, this model architecture can capture long-range dependencies more directly
than RNNs and CNN. An encoder layer contains a multi-head self-attention layer, followed by a
position-wise feedforward layer. Both layers are added with a residual connection layer and a layer
normalization layer. Multi-head self-attention is implemented by multiple SANs via a linear transfor-
mation [35]. The layers of an encoder can be summarized in sequence as: self-attention -> residual
connection -> layer normalization -> feed-forward -> residual connection -> layer normalization. The
encoder layer can the hidden representations of all the tokens in the source sequence. A decoder layer
has a similar structure to the encoder, except that an encoder-decoder attention layer is inserted,
which is followed by the multi-head self-attention layer. The insertion aims to compute the attention
score between the hidden representations of the source sequence and the target token representation.
The layers of a decoder can be summarized in sequence as: self-attention -> residual connection ->
layer normalization -> encoder-decoder attention -> residual connection -> layer normalization ->
feed-forward -> residual connection -> layer normalization.

Transformer combines the merits and eliminates the defects of RNNs-based and CNN-based NMTs.
It relies on an encoder multi-head self-attention network and a decoder multi-head self-attention
network to capture the token dependencies in the source sequence and the target sequence, respectively.
However, the SAN sequence fails to consider the token order. To remedy this issue, the relative or
absolute position of the tokens in the sequence are added to the corresponding token embeddings via
trigonometric functions.

2.2 Attention mechanisms in Transformer

Self-attention. Transformer architecture entirely relies on two attention mechanisms: self-attention
and encoder-decoder attention. During training, Transformer employs the SAN to compute the atten-
tion scores between one token and another within the encoder and the decoder, respectively. Notably,
the decoder computes the attention scores between the current token and its previous tokens, masking
the subsequent tokens. Meanwhile, it applies encoder-decoder attention to compute the attention
scores between the target token and all the source tokens. These attention mechanisms operate on
an input sequence, X = (x1,x2, · · · xn), where n is its length and xi ∈ Rdx is the i − th token.The
goal is to obtain a new feature matrix O = (o1,o2, · · · on) of the same dimension as X, where oi ∈ Rdo
is the token representation of xi. Here, dx and do represent the dimensions of the input and latent
representation, respectively, which are usually equal to the model dimension. Then, the i− th hidden
embedding oi can be calculated by:

oi =
n∑

j=1
αij (xjW V )

αij =
exp(eij )

n∑
k=1

exp(eik )

eij =
(xiW Q)(xj W K )

T

√
do

(1)

where, W Q ∈ Rdx×dq,W K ∈ Rdx×dk,W V ∈ Rdx×dv (dq = dk) are learned parameter matrices; eij is
the scaled dot-product attention score; αij is attention score obtained by a softmax function subject


https://doi.org/10.15837/ijccc.2021.4.4217 4

to
∑n

j=1 αij = 1. Similarly, the representation matrix at the sequence level can be obtained by:

O = soft max( QK
T

√
do

+ MASK)V

Q = XW Q,K = XW K,V = XW V
(2)

where, O is the attention head computed on three matrices Q ∈ Rn×dq,K ∈ Rn×dq,V ∈ Rn×dv whose
dimensions are dq, dq and dv (dq = dv = do = dmodel), respectively. These matrices are packed
together by queries, keys, and values which constitute the input. For encoder, MASK ∈ Rn×n is
named as padding mask, and used to align a batch of examples. The elements of the matrix are 0s or
−∞s, which correspond with non-padding or padding elements in X, respectively. 0s indicate that
the tokens can attend to each other, and −∞s mean the otherwise. For decoder, MASK ∈ Rm×m is
a triangular matrix whose elements are all 0s below the diagonal and all −∞s in other places. This
means that the target token can attend to its previous tokens and cannot attend to the subsequent
tokens.

Encoder-Decoder Attention. This attention mainly computes the representation of the target
token in decoder. On the top layer of the encoders, there are three feature matrices of the source
sequence X − Q ∈ Rn×dq,K ∈ Rn×dq,V ∈ Rn×dv . Let m be the length of the target sequence X′;
Q

′
∈ Rm×dq,K

′
∈ Rm×dq,V

′
∈ Rm×dv be the hidden states. Then, the final hidden state for the

generated sequence can be calculated by:

O
′ = soft max( Q

′
KT√
do

+ MASK)V (3)

where, the row vector of MASK ∈ Rm×n is same as that of the padding mask in the encoder.
Multi-head Self-attention. Transformer actually employs attention heads in the original implemen-

tation. This multi-head mechanism has a great advantage: the model is allowed to capture the token
representations in different vector subspaces. Many experiments have shown that different heads can
also capture different language information [32]. Some subspaces include syntax information and some
include semantic information. Under the multi-head mechanism, Q,K,V are first split into h, each
head is represented, and all the representations are concatenated as the final output:

Qi,Ki,Vi = split(Q,K,V )
headi = Attention(Qi,Ki,Vi)
MultiHead(Q,K,V ) = Concat(head1,head2, · · · ,headh)W o

(4)

2.3 Model training strategy

Token-level learning. In the original Transformer architecture, NMT aims to maximize the mapping
score of a sequence pair or maximize the conditional probability ŷ = arg max

y
p(y|x), where x and y

are the source and target sequences, respectively. This task is typically regarded as a multi-label
classification problem, where the class size equals the size of the target language vocabulary V . The
MLE is usually adopted to address this problem:

ζ1 = −
n∑

j=1

∑
yj∈V

pdata log pmodel (5)

where, n is the length of the target sequence; V is the target vocabulary; pdata and pmodel are the
probability distribution of the ground-truth data with label smoothing, and the probability distribution
of the model output, respectively.

Sequence-level learning. Ignoring the global sequence-level semantic loss, the token-level learning
approach only computes the isolated token-level loss, which merely focuses on the cross entropy of
each token. The novel sequence-level method adds BOW as another target loss, and assumes that the
sequence-level probability of each token is independent of the position in the sequence [24]. Despite its
simplicity and improved performance, BOW only takes account of token frequency, without considering
token order and sequence meaning. Hence, the reference is not easily exposed due to the highly


https://doi.org/10.15837/ijccc.2021.4.4217 5

similarity of the candidate. Some ML models like supervised ML and RL have also been integrated
into the Transformer model [5, 34]. For example, Cohn and Goodman built Bayesian model on
Transformer to reduce meaning loss, applied rational speech acts (RSA) model to produce speakers
and listeners which can be modeled as Bayesian agents, and utilized the role of speakers and listeners
as double-sided mirrors to understand the overall sequence information [5]. The RL rewards and
punishes the target translation sequence by setting an effective reward function [34]. The problem is
supervised ML and RL algorithms inevitably complicates Transformer training. To solve the problem,
our method adopts a simple structure and considers the semantic loss between the candidate and the
reference.

3 Modeling
This chapter explains the details on our model. Besides presenting the input/output notations and

training network architecture, the authors expounded at length on the implementation of semantic
regression loss during training.

3.1 Notations

Some notations are necessary to facilitate the model description. For an SNMT task, the i − th
sample (xi,yi) consists of a source language sequence xi and a target language sequence yi. In the
model architecture, xi relates to the encoder input xenc_in(i), and yi relates to the decoder input
ydec_in(i) and the decoder output ydec_out(i) which have 1-position misalignment. The semantic output
of yi is a vector denoted as ysem(i). All inputs and outputs of our model can be formatted as:

xenc_in(i) = {x1i ,x2i , ...,xmi }
ydec_in(i) = {y1i ,y2i , ...,yni }
ydec_out(i) = {y1i ,y2i , ...,yni }
ysem_out(i) = ysmi

(6)

where, m and n are the length of the source sequence and the target sequence respectively; sm is
the dimension of the semantic vector of the generated sequence, which equals the target vocabulary
size |V |. Despite being the same in length, the decoder input and decoder output differ due to the
malposition prediction caused by the language model; xenc_in(i) and ydec_out(i) are the ground truth
data; ysem_out(i) is the sum of all the label smoothing vectors.

3.2 Network architecture

As shown in Figure 1, our model firstly preprocesses all the samples. The first step is to insert
〈/s〉 (end of the sequence) into the last position of the source sequence, and then to insert 〈s〉 (start
of the sequence) and 〈/s〉 into the first position and last position of the target sequence, respectively.
Before a preprocessed sample is imported into the encoder, all the token embeddings are initialized
to compute their absolute positional encodings. Next, token embeddings and positional encodings are
added to the encoder, and regarded as the real input of the encoder.

xenc_in(i,j) = we(x
j
i ) + pe(x

j
i )

ydec_in(i,j) = we(y
j
i ) + pe(y

j
i )

xenc_in(i) = concat(xenc_in(i,j))
ydec_in(i) = concat(ydec_in(i,j))
xenc_out(i) = encoders(xenc_in(i))

(7)

where, i is the i− th sample and j is the j − th token. Through nonlinear computing via the stacked
encoder layers, it is possible to obtain three latent representation matrices of the source sequence-
Q,K,V . Then, matrices K and V are transferred into the encoder-decoder attention layer of all


https://doi.org/10.15837/ijccc.2021.4.4217 6

Figure 1: Architecture of our model. Note: The red part is the semantics of the candidate translation,
and the other part is the original Transformer

decoder layers. Similarly, the stacked decoders are utilized to generate the probability distribution
of the target token, conditioned on the encoder output and previous generated tokens, as well as the
masked attention scores.

tokenti = decoders(xenc_out(i),y
[:t]
dec_in(i))

scoreti = W × tokenti + b
P(tokenti) = soft max(scoreti)

(8)

Through nonlinear transform of the stacked decoders, the token probability distribution at t− th
step is obtained by softmax function. The dimension of probability distribution pj equals the target
vocabulary size |V | and subjects to the constraint condition

∑n
j=1 pij = 1. So far, learning the target

token distribution has been treated as a multiple-label classification problem. Additionally, the authors
analyzed the semantic distribution of the target sequence. Because these token distributions contain
positional encodings, the weighted sum of them is taken as the semantics of the translation sequence.

sti =
∑|V |

j=1 pj ⊗wej

ysem_out(i) =
n∑

t=1
W tsti

(9)

where, sti is the semantics of the t−th reference token; wej is the token embedding; n is the length of
the target sequence. Therefore, ysem_out(i) can be regarded as the semantic of the generated sequence.
For simplicity, the average of the weighted sum 1

n

∑n
j=1 s

t
i = 1 is applied as the approximate value of

ysem_out(i).

3.3 Semantic regression loss function

At the t − th step, the original Transformer considers the distribution of the target output is
regarded as a multi-label classification problem. Let n and |V | be the length of the target sequence
and the size of the target vocabulary, respectively. The mean sum of token cross entropy is taken as
the first loss, which is computed by formula (5) and denoted as ζ1. To improve the translation quality
during the iterative training, the meaning of the candidate translation should be close to the reference
sequence, and the sematic distance between the best choice and the reference should be minimized
as in a Procrustes problem. For a given vocabulary on a specific dataset, it is possible to obtain the
token embeddings, and denote them as a matrix Ev ∈ R|V |×d, where |V | is vocabulary size and d


https://doi.org/10.15837/ijccc.2021.4.4217 7

is the dimension of token embedding. This matrix is always relatively static. In other words, the
token embedding changes slightly with corpuses and vocabularies. For a sequence, the mean sum of
all the token embeddings is taken as its eigenvector. Therefore, a target sequence can be expressed as
a one-hot encoding ssen ∈ R|V | without taking label smoothing into account, e.g., [1, 0, 0, 0, 1, · · · 0, 1].
Here, the authors compute the multiplication between one component of pi,j and ~ej, and then sum
up the matrix on axis = 0, treating the transpose of the vector as the approximate semantics of the
target sequence:

Msen =




p11 p21 ... pn1
p12 p22 ... pn2
... ... ... ...
p1|V | p2|V | ... pn|V |




Ev =




e11 e12 ... e1d
e21 e22 ... e2d
... ... ... ...
e|V |1 e|V |2 ... e|V |d


 =




~e1
~e2
...
~e|V |




(10)

where, the column vector in Msen ∈ R|V |×n is the target token probability distribution; Ev ∈ R|V |×d
is the target embedding lookup table, in which each row vector is the embedding corresponding to
each token in the target vocabulary. Then, the latent representation of the target token, e.g., the
column vector of stoken, can be obtained in the matrix form by formula (11), and the semantics of the
sequence can calculated by adding up the multiplications between Msen by formula (11).

stoken =




p11~e1 p21~e1 ... pn1~e1
p12~e2 p22~e2 ... pn2~e2
... ... ... ...

p1|V |~e|V | p2|V |~e|V | ... pn|V |~e|V |




ssen =
∑

axis=0
MTsen ⊗Ev

=
∑

axis=0
(




p11 p21 ... pn1
p12 p22 ... pn2
... ... ... ...
p1|V | p2|V | ... pn|V |




T

⊗




~e1
~e2
...
~e|V |


)

=
∑

axis=0
(




p11~e1 + p21~e1 + ... + pn1~e1
p12~e2 + p22~e2 + ... + pn2~e2

...
p1|V |~e|V | + p2|V |~e|V | + ... + pn|V |~e|V |


)

(11)

where, ssen is the convex combination of the semantic embeddings of the target sequence [25]. In
this way, it is possible to acquire the semantics of the predicted sequence and target sequence. The
semantic regression loss can be derived from the distance between the semantics of the two sequences.
Furthermore, the target tokens are mapped into the same d-dimensional vector space, where the
vectors are not highly variable. Hence, the semantic distance is approximately taken as the second
loss from the global sequence:

Ds(y,ŷ) = ssen − ŝsen = Msen ⊗Ev −M̂sen ⊗Ev = (Msen −M̂sen) ⊗Ev ∝ (Msen −M̂sen)
Ds(y,ŷ) ≈ (Msen −M̂sen)

ζ2=
∣∣∣Ds(y,ŷ)∣∣∣

(12)

where, y and ŷ are the reference sequence and candidate sequence, respectively; Ssen and Ŝsen are the
semantics of the reference sequence and candidate sequence, respectively. If the vector space of tokens


https://doi.org/10.15837/ijccc.2021.4.4217 8

does not change, the semantic difference between the reference and the candidate must be proportional
to the vector distance.

In the original Transformer, the decoder is trained in the teacher forcing way, that is, the translation
system actually knows the correct answer, when it predicts the subsequent token. However, the decoder
works in an autoregressive manner during the inference. In this way, the token does not know what
the subsequent token is. To bridge the gap, it is necessary to make an overall analysis of the semantic
differences between the reference sequence and the candidate sequence. Therefore, the decoder could
be trained from the perspectives of the token and the global target sequence simultaneously. The first
loss aims to maximize the probability distribution of the generated token, while the second loss aims
to maximize the similarity between the reference and the candidate, and to preserve the meaning. To
weight the difference, scalar λ is introduced to balance the loss functions as a hyper-parameter.

ζ = λζ1 + (1 −λ)ζ2
s.t. λ ∈ [0, 1]

(13)

4 Experiments
This chapter verifies the performance of our model through experiments. Firstly, the sequence

pairs of three evaluation datasets (Section 4.1) were preprocessed. Then, the experimental setting,
evaluation metrics (Section 4.2) and baseline systems based on Transformer (Section 4.3) were intro-
duced in turn. Finally, experimental implementation was detailed, including hyper-parameter selection
(Section 4.4) and results analysis (Section 4.5).

4.1 Datasets

Our experiments were conducted on three translation tasks: IWSLT 2014/2016 DE->EN transla-
tion dataset and WMT17 EN->DE translation dataset, which are widely used as evaluation bench-
marks for NMT [32]. Each dataset was split into a training set, a validation set, and a test set.
Then, the data were preprocessed through normalization and sub-word segmentation. The IWSLT
dataset contains the data extracted from the IWSLT Evaluation Campaign [3, 9]. IWSLT2014 of-
fers 160k/7k sentence pairs as training/validation sets. The authors concatenated dev2010, dev2012,
tst2010, tst2011 and tst2012 as the test set, including about 7k sentence pairs. For IWSLT2016, the
data consist of 180k/12k sentence pairs as training/validation sets. The authors took the concatenation
of tst2010/2011/2012/2013/2014 as the test set, including about 12k sentence pairs. For WMT17, the
original training set was adopted as the training set of our model, which contains 5.9 million sentence
pairs. The concatenation of newstest2013/2014/2015/2016 was taken as the validation set, involving
more than 1 million sentence pairs, and newstest2017 was treated as the test set, containing about 3k
sentence pairs (see Table 1).

The three datasets above were preprocessed by Moses, a de-facto standard toolkit for SMT [16].
Firstly, the sentence pairs were tokenized to deal with the punctuations. Considering the huge range
of tokens in the tokenized sequences, the sequence pairs of more than 100/80 tokens were discarded
for IWSLT2014/2016 and WMT17, respectively, such as to improve translation quality and accelerate
training. Finally, every subset of each dataset was truncated. To mitigate the influence of out of
vocabulary (OOV) tokens and rare tokens, the sequence was tokenized to sub-word units by Sentence-
Piece ([39]) [18] or WordPiece ([40]) [17], which have the same function of bytes pair encoding (BPE)
[28]. In addition, each subset shares a vocabulary of 32,000, because English and German belong to
the Germanic language family.

4.2 Experimental setting and evaluation metrics

The implementation of our experiments is based on the empirical results. Our experiments are
build using appropriate setting, such as, optimizer, learning rate and other hyper-parameters. For
optimizer, we choose AdamW optimizer [22] with weight_decay = 10−5,β1 = 0.9,β2 = 0.999 which
is used in Bert [6]. For learning rate, we also follow the learning rate warm-up strategy [32] with


https://doi.org/10.15837/ijccc.2021.4.4217 9

Table 1: The sequence pair statistics after preprocessing by Moses ((a)) and subword statistics after
segmentation by BPE ((b)) on IWSLT 2014/2016 DE->EN translation dataset and WMT17 EN->DE
translation dataset. The terms of “Train”, “Eval” and “Test” represent training set, validation set and
test set respectively. “S” and “T” represent the source language and the target language respectively.
Note: units of k and m stand for thousand and million respectively

Dataset Train Eval Test Dataset S/T Train Eval Test
IWSLT2014 161k 7.3k 6.7k IWSLT 2014 DE 4.13m 188k 167kIWSLT2016 181k 12k 11.8k EN 4.06m 184k -
WMT17 5.85m 1.12m 3k IWSLT 2016 DE 4.65m 306k 290k(a) EN 4.59m 302k -

WMT17 EN 155.36m 423k 94kDE 169.71m 460k -
(b)

warmup − steps = 8000. During training, label smoothing rate is set to 0.1 and all dropout values
are set to 0.1. Our experiments were implemented under the framework of TensorFlow 1.14.0. The
attainted model cpt files also can be easily converted into PyTorch bin files using transformers. All
the experiments were completed by two NVIDIA 1080Ti GPUs. During the inference, beam search
size was set to 4 for validation set and test set.

There are many evaluation metrics, each of which has its own strengths and weaknesses. For diver-
sity and reliability, four metrics were selected to evaluate our model, namely, BLEU [26], METEOR
[2], TER [30] and ROUGE-L [21]. The authors computed BLEU score with standard Moses tools of
multi-bleu.perl ([36]), and TER score with pyter ([38]), and METEOR score, ROUGE-L score with
nlg-eval ([37]). BLEU, as the earliest automatic evaluation method for machine translation, analyzes
the degree of co-occurrence of n-gram between the candidate and the reference. The main component
of BLEU is n-gram precision via geometric averaging. METEOR is based on explicit word-to-word
matches, which includes the identical words in the surface forms, morphological variants in stemmed
forms, and synonyms in meanings between the candidate and the reference. This metric combines
unigram-precision, unigram-recall, and a direct measure of how out-of-order the words in the candi-
date translation are. TER is a distance-based metric of the workload of the post-translation editing of
the candidate translation. The distance is defined as the minimum number of edits which transforms
one sequence into another. TER considers edit operations like insertion, deletion, and substitution
of single words, as well as shifts of word sequences. ROUGE is a recall-based metric commonly used
for machine translation and text summarization. Chin-Yew Lin introduced four different ROUGE
measures: ROUGE-N, ROUGE-L, ROUGE-W, and ROUGE-S. Here, ROUGE-L is chosen as the
metric to evaluate machine translation. L stands for the longest common subsequence (LCS) of the
corresponding sequences of the candidate and the reference.

4.3 Baseline systems of machine translation based on the original Transformer

Our model was compared to the original Transformer models with small, base, and big config-
urations. The difference between our model and the original Transformer lies in the application of
the semantic regression loss during training. Drawing on Vaswani et al.’s work [32], the transformer-
base model and transformer-large model were designed to contain 6 layers and 12 layers, respectively.
Facing the limited GPU resources, the number of stacked encoders and decoders was set to 6 in all
the configurations for our model. Each transformer (small, base, and big) has a unique configura-
tion. The small, base, and big configurations differ in the number of heads, hidden size, filter size,
and number of blocks (layers) of encoders and decoders. The corresponding hyper-parameters of the
three models are listed in Table 2. According to the different sizes of the three datasets, transformer-
small and transformer-base settings were adopted for IWSLT DE->EN, and transformer-base and
transformer-big settings for WMT17 EN->DE.


https://doi.org/10.15837/ijccc.2021.4.4217 10

Table 2: The hyper-parameters of transformer-small, transformer-base and transformer-big configu-
rations of Transformer based machine translation systems

Transformer Heads Blocks Hidden size Filter size
small 4 6 256 1024
base 8 6 512 2048
big 16 6 1024 4096

4.4 SNMT based on our model

1) Different distance measurement algorithms
The candidate sequence and the reference sequence were separately taken as a whole. In the

computing graph of the network architecture, the two sequences are essentially two tensors or mul-
tidimensional vectors, and their distance is denoted as ζ2 . In this section, our model was tested
with three different metrics for the distance between the two vectors—x,y, where x is the seman-
tic representation of the candidate sequence and y is the semantic representation of the reference
sequence: Euclidean distance (square): ED(x,y) = ‖x−y‖22 = (x − y)

T (x − y); Cosine distance
(cosine similarity): Cos(x,y) = x · y = xT y = yT x. Note that the cosine distance is generalized
without normalization, which is equivalent to the dot product operation, without affecting the final
result; Max-pooling distance (MPD): MPD(x,y) = max(xi,yi)|n1 . Inspired by max-pooling, this met-
ric maximizes the corresponding components of the two vectors. Our model was implemented on the
three datasets with transformer-base configuration. Some experimental results are listed in Table 3.

Multiple trials and comparisons were carried out to contrast the cross entropy loss of each token
with Kullback–Leibler divergence (KLD, [19]) and Jensen-Shannon divergence (JSD, [8]), both of
which are distribution metrics. Under these circumstances, the metric scores were slightly improved.
However, cross entropy (CE) is the simplest algorithm in terms of computing. From the left side of
Table 3, it can be seen that the use of CE brings stable effects. Therefore, the experiments on the
right side of Table 3 all use the CE algorithm. From the right side of Table 3, it was inferred that ED
> MPD > Cos achieved apparently different performances, with the conventional ED being the best
performer. Moreover, MPD had a similar effect as ED. These experiments were implemented with
formula (13). Overall, the linear combination of CE and ED was determined as our loss function.

Table 3: The different evaluation scores using different distance evaluation algorithms on IWSLT2014,
IWSLT2016 and WMT17 with transformer-base configuration respectively. CE, KLD [19] and JSD [8]
are ζ1 loss which stand for cross-entropy, Kullback–Leibler divergence and Jensen-Shannon divergence
respectively. ED, Cos and MPD are ζ2 loss which stand for Euclidean distance, Cosine distance and
Max-pooling distance respectively. ED+CE, Cos+CE and MPD+CE are the linear combinations
between CE and different distance losses

Loss CE KLD JSD ED+CE CE+CE KLD+CE

IWSLT2014

BLEU 33.96 33.90 33.97 34.45 34.32 34.21
TER 50.71 50.97 50.98 48.01 48.15 49.07
METEOR 33.73 32.77 33.29 33.07 33.71 33.80
ROUGE_L 62.95 63.04 63.26 63.55 63.47 63.11

IWSLT2016

BLEU 32.45 32.40 32.41 33.01 34.32 34.21
TER 51.38 51.37 50.48 51.31 48.15 49.07
METEOR 32.52 31.07 31.29 32.89 33.71 33.80
ROUGE_L 60.23 59.04 59.26 61.10 63.47 63.11

WMT17

BLEU 28.10 28.25 28.29 28.96 28.41 28.66
TER 55.11 54.97 54.98 54.03 54.76 54.49
METEOR 28.73 28.75 28.71 29.41 28.58 28.87
ROUGE_L 56.22 56.24 56.26 57.02 56.43 56.68

2) Effect of hyper-parameter λ


https://doi.org/10.15837/ijccc.2021.4.4217 11

In order to weight the two loss functions, our model was tested with the λ, value changing from
0.1 to 0.9 with a step length of 0.2. As shown in Table 4, the four evaluation scores obeyed the normal
distribution, peaking at λ = 0.5. Hence, distance loss is as important as cross entropy loss for our
model. In terms of performance, the semantics loss of the global sequence is as important as the
cross entropy loss of all the tokens. Moreover, in the decoder, each step is a multi-label classification
problem, from the perspective of token translation. From the perspective of the global sequence
translation, however, each step is a regression task. The results in Table 4 show that λ = 0.5 is the
best choice for our model.

Table 4: The different evaluation scores using different λ on IWSLT2014, IWSLT2016 and WMT17
with transformer-base configuration respectively. λ = 1 means that only ζ1 is used. All results are
obtained by using ED+CE

λ 0.1 0.3 0.5 0.7 0.9 1

IWSLT2014

BLEU 33.98 34.13 34.45 34.28 33.80 33.96
TER 49.76 48.38 48.01 49.30 49.92 50.71
METEOR 32.89 32.51 33.07 33.51 33.53 33.73
ROUGE_L 63.47 63.43 63.55 63.80 62.84 62.95

IWSLT2016

BLEU 32.73 32.54 33.01 32.15 32.37 32.45
TER 53.09 52.28 51.31 51.30 52.95 51.38
METEOR 30.98 31.63 32.89 31.42 32.53 32.52
ROUGE_L 59.37 60.33 61.10 59.19 60.09 60.23

WMT17

BLEU 27.73 28.54 28.96 28.41 28.53 28.10
TER 55.99 55.28 54.03 57.30 55.95 55.11
METEOR 27.89 28.51 29.41 28.51 28.53 28.73
ROUGE_L 55.17 55.93 57.02 55.80 55.84 56.22

3) Effects of adam optimizer cluster
The next is to test the impact of different Adam optimization algorithms on machine translation

performance. The authors focused on comparing three optimization algorithms, namely, Adam [15],
NAdam [7] and AdamW [22], especially Adam and AdamW. The Adam optimizer integrates tradi-
tional momentum with RMSProp. NAdam optimizer combines Nesterov accelerated gradient (NAG)
with Adam, replacing the traditional momentum in the original Adam with Nesterov momentum [7].
AdamW optimizer was adopted for our model, which targets Adam’s roller coaster problem. To avoid
overfitting, Adam optimizer employs L2 to update weights. But this approach is susceptible to large
weight decay. Hence, the weight decay method in the Adam algorithm should be adopted instead of L2
regularization. The experimental results in Table 5 demonstrate the effectiveness of AdamW optimizer.
Compared to the Adam optimizer, AdamW achieved a performance improvement of 0.38 BLEU/0.56
TER/1.42 METEOR/0.79 ROUGE-L on the IWSLT2014 dataset, 0.34 BLEU/1.05 TER/0.40 ME-
TEOR/1.09 ROUGE-L on the IWSLT2016 dataset, and 0.26 BLEU/0.45 TER/0.30 METEOR/0.85
ROUGE-L on the WMT17 dataset.

4) Effect of beam size
To test its performance more accurately, our model was compared with a beam search algorithm

[10], using different beam sizes during inference. Beam=1 means to use greedy search for machine
translation. As shown in Table 6, the results of greedy search were far worse than those of beam
search algorithm. In addition, Beam = 3 brought significant improvement and could be seen as
an inflection point during inference. Further, little fluctuation occurred when the beam size was
greater than 3. To save time and space, Beam=4 was usually the best choice. Compared to greedy
search, the performance improvement was 3.19 BLEU/5.31 TER/2.60 METEOR/3.26 ROUGE-L on
the IWSLT2014 dataset at Beam=4, 1.78 BLEU/3.90 TER/3.62 METEOR/3.63 ROUGE-L on the
IWSLT2016 dataset at Beam=4, and 1.79 BLEU/4.09 TER/2.15 METEOR/3.14 ROUGE-L on the
WMT17 dataset at Beam=4.

5) Experimental results
The experiments (1)-(4) help to identify the key hyper-parameters that ensure the success of our


https://doi.org/10.15837/ijccc.2021.4.4217 12

Table 5: The different evaluation scores using different optimizers on IWSLT2014, IWSLT2016 and
WMT17 with transformer-base configuration respectively. All results are obtained by using ED+CE
and λ = 0.5

Optimizer Adam Nadam AdamW

IWSLT2014

BLEU 34.07 34.28 34.45 (+0.38)
TER 48.57 48.32 48.01 (-0.56)
METEOR 31.65 32.24 33.07 (+1.42)
ROUGE_L 62.76 62.81 63.55 (+0.79)

IWSLT2016

BLEU 32.67 32.62 33.01 (+0.34)
TER 52.36 52.53 51.31 (-1.05)
METEOR 32.49 32.48 32.89 (+0.40)
ROUGE_L 60.01 59.81 61.10 (+1.09)

WMT17

BLEU 28.70 28.73 28.96 (+0.26)
TER 54.48 54.90 54.03 (-0.45)
METEOR 29.11 28.89 29.41 (+0.30)
ROUGE_L 56.17 56.17 57.02 (+0.85)

Table 6: The different evaluation scores using different beam sizes on IWSLT2014, IWSLT2016 and
WMT17 with transformer-base configuration respectively. All results are obtained by using ED+CE,
λ = 0.5 and AdamW optimizer

Beam size 1 2 3 4 5 6 7 8 9 10

IWSLT2014

BLEU 31.26 32.74 34.12 34.45 33.41 32.89 32.94 33.74 33.94 34.09
TER 53.31 52.03 49.51 48.01 47.95 48.03 49.32 49.25 48.23 48.54
METEOR 30.47 31.60 31.95 33.07 32.64 31.85 31.93 32.32 32.71 32.92
ROUGE_L 60.29 61.66 62.17 63.55 62.35 61.75 62.27 62.48 63.31 63.68

IWSLT2016

BLEU 31.23 31.75 32.40 33.01 32.54 32.18 31.87 32.45 33.10 32.66
TER 55.21 54.13 53.81 51.31 52.39 51.93 53.14 53.78 51.56 51.82
METEOR 29.27 31.28 31.46 32.89 31.27 31.28 30.43 31.82 32.57 32.32
ROUGE_L 57.47 58.28 58.98 61.10 58.88 56.49 57.45 57.49 58.99 59.72

WMT17

BLEU 27.17 27.74 28.35 28.96 28.49 27.17 27.74 28.35 28.70 28.49
TER 58.12 56.11 55.01 54.03 54.93 58.03 56.02 55.02 54.87 54.95
METEOR 27.26 27.30 28.75 29.41 28.87 27.27 27.39 28.79 29.11 29.01
ROUGE_L 53.88 54.23 55.38 57.02 56.00 53.92 54.23 55.49 56.17 55.97

model. The overall results of our experiments are shown in Table 7. All metric scores were computed
on the condition of uncased tokens. For IWSLT2014/2016 DE->EN translation tasks, our model was
tested on test sets. As shown in Table 7, our approach improved the performance over the baseline
system using the transformer-small and transformer-base configurations, respectively. The perfor-
mances on IWSLT2014/IWSLT2016 were improved by 0.55 BLEU/-2.07 TER/0.20 ROUGE-L and
0.51 BLEU/-0.66 TER/0.78 METEOR/0.68 ROUGE-L with transformer-small configuration, respec-
tively. The performances on IWSLT2014/IWSLT2016 were improved by 0.49 BLEU/-2.70 TER/0.60
ROUGE-L and 0.56 BLEU/-0.07 TER/0.37 METEOR/0.87 ROUGE-L with transformer-base con-
figuration, respectively. The slight improvement may relate to the size of the dataset. For WMT17
EN->DE translation tasks, our model achieved progressive improvement when using transformer-big
configuration. For the transformer-base configuration, our model outperformed the baseline system
by 0.86/-1.08/0.68/0.80 using BLEU, TER, METEOR and ROUGE-L, respectively. For transformer-
big configuration, our model outperformed the baseline system by 1.21/-1.14/0.62/0.83 using BLEU,
TER, METEOR and ROUGE-L, respectively. These performances were achieved under the setting of
ED+CE, λ = 0.5, AdamW optimizer and Beam=4.

6) Training time cost
Further, the time cost of the original Transformer was compared with our model. The following


https://doi.org/10.15837/ijccc.2021.4.4217 13

Table 7: The overall evaluation scores whether using the semantics distance loss on IWSLT2014,
IWSLT2016 and WMT17 test set by using ED+CE, λ = 0.5, AdamW optimizer and Beam=4

Models Metrics IWSLT2014 IWSLT2016 WMT17

Transformer-small

BLUE 33.78 32.17 -
TER 51.77 52.83 -
METEOR 33.29 31.93 -
ROUGE_L 62.91 60.05 -

Transformer-base

BLUE 33.96 32.45 28.10
TER 50.71 51.38 55.11
METEOR 33.73 32.52 28.73
ROUGE_L 62.95 60.23 56.22

Transformer-big

BLUE - - 29.08
TER - - 54.37
METEOR - - 29.09
ROUGE_L - - 56.78

Transformer-small+ours

BLEU 34.33 (+0.55) 32.68 (+0.51) -
TER 49.70 (-2.07) 52.17 (-0.66) -
METEOR 32.60 32.71 (+0.78) -
ROUGE_L 63.11 (+0.20) 60.73 (+0.68) -

Transformer-base+ours

BLEU 34.45 (+0.49) 33.01 (+0.56) 28.96 (+0.86)
TER 48.01 (-2.70) 51.31 (-0.07) 54.03 (-1.08)
METEOR 33.07 32.89 (+0.37) 29.41 (+0.68)
ROUGE_L 63.55 (+0.60) 61.10 (+0.87) 57.02 (+0.80)

Transformer-big+ours

BLEU - - 30.29 (+1.21)
TER - - 53.23 (-1.14)
METEOR - - 29.71 (+0.62)
ROUGE_L - - 57.61 (+0.83)

can be inferred from the results in Table 8. Horizontal comparison: the convergence speed is nega-
tively correlated with dataset size. Meanwhile, the time to reach convergence is positively correlated
with dataset size. Longitudinal comparison: our model consumed more time than the original Trans-
former, especially for the datasets with small size. Compared with WMT17, it cost more time for
IWSLT2014/2016 training with our model.

4.5 Discussion

This paper proves that the semantic distance loss is as important as cross entropy loss between
the translation output and the reference during training. The translation quality can be improved
with the linear combination of the two losses as the overall loss, at the cost of a slight increment in
time overhead. As in the original Transformer, the generation task in the decoder essentially solves a
multi-label classification problem from the perspective of token generation. This paper also focuses on
the semantic distance between the generated sequence and the reference sequence from the perspective
of sequence generation.

There are three possible reasons for the insignificant improvement on performance: (1) The limited
improvement over IWSLT14 and IWSLT16 comes from the certain degree of overfitting due to the
small training set. (2) The semantics of a sequence are more than the average meaning of the tokens.
Each token has a unique weight on the semantics of the sequence. Thus, the parameters in formula
(9) should be learned during model training. (3) At the start of training, there is not much prior
knowledge about the generation of the target sequence. Thus, it is more suitable to use ζ1 as the loss
function for this stage. With the growing number of iterations in training, it is better to adopt the
linear combination of cross entropy loss and semantic regression loss as the loss function. However,
this paper implements the linear combination from the very beginning, failing to determine when is
the best time to start using this loss function. Hence, the future research will try to determine when


https://doi.org/10.15837/ijccc.2021.4.4217 14

Table 8: The time costs on IWSLT2014, IWSLT2016 and WMT17 train set. Note: units of s, h and
d stand for second, hour and day respectively

Models Time IWSLT2014 IWSLT2016 WMT17

Transformer-small Total 7h 9h -Step 0.064s 0.065s -

Transformer-base Total 10h 15h 11.8dStep 0.188s 0.189s 0.205s

Transformer-big Total - - 23.8dStep - - 0.435s

Transformer-small+ours Total 9h 11h -Step 0.068s 0.067s -

Transformer-base+ours Total 12.5h 16.5h 12.5dStep 0.197s 0.194s 0.212s

Transformer-big+ours Total - - 23.9dStep - - 0.438s

to replace the loss function.

5 Conclusions
This paper proposes a novel semantic regression loss function for the SNMT based on the Trans-

former architecture. The authors find that the semantic loss function is proportional to the distance
between the reference sequence and the candidate sequence, conditioned on the given target language
dataset and vocabulary. Hence, the linear combination of the cross entropy loss (classification ob-
jective function) and the distance loss (regression objective function) are synthetized as our training
objective function. Then, our model is implemented on three datasets, and evaluated by four metrics.
During the experiments, our model is coupled with the Euclidean distance metric, AdamW optimizer
and beam search algorithm. The results show that our model can effectively improve the machine
translation performance.

However, the semantic loss proposed here is not actually a true semantic loss. It is approximately
equivalent to the distance loss between the candidate sequence and the reference sequence. In addition,
the linear combination of ζ1 and ζ2 is implemented from the beginning to the end, which limits the
performance improvement. In future work, the authors will use a pre-training language model like
Bert to compute the exact semantic loss in the decoder, apply prior knowledge to the loss function,
and determine the golden section points for choosing the different losses.

Funding

This work was supported by the National Natural Science Foundation of China under Grant
61977009.

Author contributions

The authors contributed equally to this work.

Conflict of interest

The authors declare no conflict of interest.

References
[1] Bahdanau, D.; Cho, K.; Bengio, Y. (2014). Neural machine translation by jointly learning to

align and translate, arXiv preprint arXiv:1409.0473, 2014.


https://doi.org/10.15837/ijccc.2021.4.4217 15

[2] Banerjee, S.; Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved
correlation with human judgments, In Proceedings of the acl workshop on intrinsic and extrinsic
evaluation measures for machine translation and/or summarization 65-72, 2005.

[3] Cettolo, M.; Niehues, J.; Stüker, S.; Bentivogli, L.; Federico, M. (2014). Report on the 11th
iwslt evaluation campaign, In Proceedings of the International Workshop on Spoken Language
Translation, Hanoi, 2014.

[4] Cho, K.; Van Merriënboer, B.; Gulcehre, C.;Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio,
Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine
translation, arXiv preprint arXiv:1406.1078, 2014.

[5] Cohn-Gordon, R.; Goodman, N. (2019). Lost in machine translation: A method to reduce meaning
loss, arXiv preprint arXiv:1902.09514, 2019.

[6] Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. (2018). Bert: Pre-training of deep bidirectional
transformers for language understanding, arXiv preprint arXiv:1810.04805, 2018.

[7] Dozat, T. (2016). Incorporating nesterov momentum into adam, ICLR 2016 workshop homepage
2016, 2016.

[8] Endres, D.M.; Schindelin, J.E. (2003). A new metric for probability distributions, IEEE Trans-
actions on Information theory, 49(7): 1858-1860, 2003.

[9] Federmann, C.; Lewis, W.D. (2016). Microsoft speech language translation (mslt) corpus: The
iwslt 2016 release for english, french and german, In International Workshop on Spoken Language
Translation, 2016.

[10] Freitag, M.; Al-Onaizan, Y. (2017). Beam search strategies for neural machine translation, arXiv
preprint arXiv:1702.01806, 2017.

[11] Gehring, J.; Auli, M.; Grangier, D.; Dauphin, Y.N. (2016). A convolutional encoder model for
neural machine translation, arXiv preprint arXiv:1611.02344, 2016.

[12] Gehring, J.; Auli, M.; Grangier, D.; Yarats, D.; Dauphin, Y.N. (2017). Convolutional sequence
to sequence learning, In International Conference on Machine Learning, 1243-1252, 2017.

[13] Hassan, H.; Aue, A.; Chen, C.; Chowdhary, V.; Clark, J.; Federmann, C. (2018). Achieving
human parity on automatic chinese to english news translation, arXiv preprint arXiv:1803.05567,
2018.

[14] Hochreiter, S.; Schmidhuber, J. (1997). Long short-term memory, Neural computation, 9(8):
1735-1780, 1997.

[15] Kingma, D.P.; Ba, J. (2014). Adam: a method for stochastic optimization, CoRR abs/1412.6980,
2014.

[16] Koehn, P.; Hoang, H.; Birch, A.; Callison-Burch, C.; Federico, M.; Bertoldi, N. (2007). Moses:
Open source toolkit for statistical machine translation, In Proceedings of the 45th annual meeting
of the association for computational linguistics companion volume proceedings of the demo and
poster sessions, 177-180, 2007.

[17] Kudo, T. (2018). Subword regularization: Improving neural network translation models with
multiple subword candidates, arXiv preprint arXiv:1804.10959, 2018.

[18] Kudo, T.; Richardson, J. (2018). Sentencepiece: A simple and language independent subword
tokenizer and detokenizer for neural text processing, arXiv preprint arXiv:1808.06226, 2018.

[19] Kullback. S.; Leibler, R.A. (1951). On information and sufficiency, The annals of mathematical
statistics, 22(1): 79-86, 1951.


https://doi.org/10.15837/ijccc.2021.4.4217 16

[20] Li, Y.; Wang, Q.; Xiao, T.; Liu, T.; Zhu, J. (2020). Neural machine translation with joint
representation, In Proceedings of the AAAI Conference on Artificial Intelligence, 34(5): 8285-
8292, 2020.

[21] Lin, C.Y. (2004). Rouge: A package for automatic evaluation of summaries, In Text summariza-
tion branches out, 74-81, 2004

[22] Loshchilov, I.; Hutter, F. (2017). Decoupled weight decay regularization, arXiv preprint
arXiv:1711.05101, 2017.

[23] Luong, M.T.; Pham, H.; Manning, C.D. (2015). Effective approaches to attention-based neural
machine translation, arXiv preprint arXiv:1508.04025, 2015.

[24] Ma, S.; Sun, X.; Wang, Y.; Lin, J. (2018). Bag-of-words as target for neural machine translation,
arXiv preprint arXiv:1805.04871, 2018.

[25] Norouzi, M.; Mikolov, T.; Bengio, S.; Singer, Y.; Shlens, J.; Frome, A. (2013). Zero-shot learning
by convex combination of semantic embeddings, arXiv preprint arXiv:1312.5650, 2013.

[26] Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. (2002). BLEU: a method for automatic evaluation
of machine translation, In Proceedings of the 40th annual meeting on association for computational
linguistics, 311-318, 2002.

[27] Schuster, M.; Paliwal, K.K. (1997). Bidirectional recurrent neural networks, IEEE transactions
on Signal Processing, 45(11): 2673-2681

[28] Sennrich, R.; Haddow, B.; Birch, A. (2015). Neural machine translation of rare words with
subword units, arXiv preprint arXiv:1508.07909, 2015.

[29] Shen, S.; Cheng, Y.; He, Z.; He, W.; Wu, H.; Sun, M.; Liu, Y. (2015). Minimum risk training for
neural machine translation, arXiv preprint arXiv:1512.02433, 2015.

[30] Snover, M.; Dorr, B.; Schwartz, R.; Micciulla, L.; Makhoul, J. (2006). A study of translation edit
rate with targeted human annotation, In Proceedings of association for machine translation in
the Americas, 2006.

[31] Sutskever, I.; Vinyals, O.; Le, Q.V. (2014). Sequence to sequence learning with neural networks,
Advances in neural information processing systems, 3104-3112, 2014.

[32] Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polo-
sukhin, I. (2017). Attention is all you need, arXiv preprint arXiv:1706.03762, 2017.

[33] Wang, Q.; Li, B.; Xiao, T.; Zhu, J.; Li, C.; Wong, D.F.; Chao, L.S. (2019). Learning deep
transformer models for machine translation, arXiv preprint arXiv:1906.01787, 2019.

[34] Wu, L.; Tian, F.; Qin, T.; Lai, J.; Liu, T.Y. (2018). A study of reinforcement learning for neural
machine translation, arXiv preprint arXiv:1808.08866, 2018.

[35] Yang, B.; Li, J.; Wong, D.F.; Chao, L.S.; Wang, X.; Tu, Z. (2019). Context-aware self-attention
networks, In Proceedings of the AAAI Conference on Artificial Intelligence, 33: 387-394, 2019.

[36] [Online]. Available: https://github.com/moses-smt/mosesdecoder, Accesed on 30 December 2017.

[37] [Online]. Available: https://github.com/Maluuba/nlg-eval, Accesed on 23 November 2019.

[38] [Online]. Available: https://pypi.org/project/pyter, Accesed on 7 Decemebr 2012.

[39] [Online]. Available: https://github.com/google/sentencepiece, Accesed on 10 January 2021.

[40] [Online]. Available: https://github.com/lovit/WordPieceModel, Accesed on 5 November 2018.


https://doi.org/10.15837/ijccc.2021.4.4217 17

Copyright ©2021 by the authors. Licensee Agora University, Oradea, Romania.
This is an open access article distributed under the terms and conditions of the Creative Commons
Attribution-NonCommercial 4.0 International License.
Journal’s webpage: http://univagora.ro/jour/index.php/ijccc/

This journal is a member of, and subscribes to the principles of,
the Committee on Publication Ethics (COPE).

https://publicationethics.org/members/international-journal-computers-communications-and-control

Cite this paper as:

D.X. Li, Z.Y. Luo. (2021). Regression Loss in Transformer-based Supervised Neural Machine
Translation, International Journal of Computers Communications & Control, 16(4), 4217, 2021.

https://doi.org/10.15837/ijccc.2021.4.4217


	Introduction
	Preliminaries
	Transformer architecture
	Attention mechanisms in Transformer
	Model training strategy

	Modeling
	Notations
	Network architecture
	Semantic regression loss function

	Experiments
	Datasets
	Experimental setting and evaluation metrics
	Baseline systems of machine translation based on the original Transformer
	SNMT based on our model
	Discussion

	Conclusions