Instruction


FACTA UNIVERSITATIS  
Series: Electronics and Energetics Vol. 29, N

o
 2, June 2016, pp. 243 - 260 

DOI: 10.2298/FUEE1602243M 

PARALLEL EXECUTION TRACING:  

AN ALTERNATIVE SOLUTION TO EXPLOIT  

UNDER-UTILIZED RESOURCES IN MULTI-CORE 

ARCHITECTURES FOR CONTROL-FLOW CHECKING

 

Mohammad Maghsoudloo, Hamid R. Zarandi  

Amirkabir University of Technology (Tehran Polytechnic), Iran 

Abstract. In this paper, a software behavior-based technique is presented to detect 

control-flow errors in multi-core architectures. The analysis of a key point leads to  

introduction of the proposed technique: employing under-utilized CPU resources in multi-

core processors to check the execution flow of the programs concurrently and in parallel 

with the main executions. To evaluate the proposed technique, a quad-core processor 

system was used as the simulation environment, and the behavior of SPEC CPU2006 

benchmarks were studied as the target to compare with conventional techniques. The 

experimental results, with regard to both detection coverage and performance overhead, 

demonstrate that on average, about 94% of the control-flow errors can be detected by the 

proposed technique, with less performance overhead compared to previous techniques. 

Key words: On-line error detection, control-flow error, error detection coverage, 

multi-core processor, CPU utilization. 

1. INTRODUCTION 

Advances in CMOS technology have provided reduction in transistor size and voltage 

levels [1]. As the number of available transistors continues to grow exponentially, adding 

and using more hardware resources has emerged as a convenient solution to increase the 

performance of the microprocessors. While these additional resources yield performance 

enhancements, they lead to increases in the level of power dissipation [2]. In order to 

improve the performance per cost ratio of processors, computer architects are forced to 

shift from single-core single-threaded processors to multi-core multi-threaded processors 

or Chip Multi-Processors (CMPs) [2], [3]. 

However, current trends in computer architectures have shown that the forthcoming of 

new processors will involve new challenges related to increasing the vulnerability of them 

against hardware transient faults [4]. Most soft errors are the result of energetic particle 

                                                           
Received April 6, 2015; received in revised form December 27, 2015 

Corresponding author: Mohammad Maghsoudloo 

Amirkabir University of Technology (Tehran Polytechnic), 424 Hafez Ave, Iran 
(e-mail: m.maghsoudloo@aut.ac.ir) 


244 M. MAGHOSUDLOO, H. R. ZARANDI 
 

strikes, induced by high-energy neutrons from cosmic rays, and alpha particles from 

decaying radioactive impurities in packaging and interconnect materials [5]. 

Soft errors, that occur in both system memory and combinational logic of the computer 

system, will need to be addressed during the design phase of the systems, especially for 

safety-critical applications [2]. To counter the errors, present in different level of memories, 

designers typically employ redundant information or hardware, such as error correcting 

codes (ECCs), and bit interleaving to protect memories. 

Similarly, combinational logic within the processor should be protected. Therefore, 

high-availability systems need much more hardware redundancy than that provided by ECC 

and parity bits. For example, IBM has historically added 20-30% of additional logic within 

its mainframe processors for fault tolerance [6]. There have been several approaches to add 

extra hardware redundancy and modification for tolerating hardware faults or detecting soft 

errors in CMPs, called HardWare Fault Tolerant techniques (HWFT), such as Core 

Cannibalization Architecture (CCA) [7], Core Salvaging [8], Mixed-Mode Multi-core 

(MMM) [9], which imposes more power consumption, and complexity challenges [10]. The 

inclusion of redundant hardware design and modification may negatively impact the design 

cycles of systems and also area- and power-efficiency of the new and modern processor 

[11], [12], [13]. 

On the other hand, SoftWare Fault Tolerant techniques (SWFT) usually provide 

adequate levels of fault tolerance, hence system reliability [14], such as mSWAT [1], and 

Detouring [11]. Despite the flexibility of SWFT techniques which can moderate the 

negative impacts of HWFT ones, they can lead to huge and considerable performance 

degradation during the execution, because of the nature of their structures which are based 

on execution replication [10].  

In the other category, Control-Flow Checking (CFC) methods, key methods used for 

monitoring the behavior of a program, have shown to provide effective and cost-efficient 

error detection coverage [15], [16], [17], [18], [20]. Unfortunately, due to the crucial 

drawbacks of the hardware-based methods, Hardware-based CFC (HCFC) methods are 

not considered to be appropriate for current architectures [23]. Moreover, the structure of 

conventional Software-based CFC (SCFC) techniques has not been adapted with the 

inherent features of CMPs. Although using the SCFC techniques can reduce the huge 

performance overhead of the software-based methods, however, these techniques still have 

potential of imposing high performance overhead (due to inserting some instructions into 

the programs) on the systems that would undermine the obtained benefits of the modern 

processors and would waste the efforts taken by designers to improve the execution time of 

the programs and utilization of the processors. Typically, the performance overhead of these 

techniques grows to 40% for only detection, and also grows to more than 100% for detection 

and correction [18] that would be intolerable in current architectures and applications. 

So, one effective way for enhancing the reliability of CMPs, is to modify the configuration 

of software-centric methods, so that the adaptability with the status and load of the CPU 

become visible in their structure. This paper presents an adaptive and efficient SCFC 

technique which takes the advantages of multi-core multi-threaded processors, such as parallel 

execution capabilities, to remove the main drawbacks of the conventional SCFC techniques 

[21], [22]. 

Shifting from single-core to multi-core processor means that programmers must write 

concurrent multi-threaded programs for the optimal use of hardware capacities. Unfortunately, 

some challenges, such as load balancing, sequential and synchronization dependencies, 


 Parallel Execution Tracing: an Alternative Solution to Exploit Under-utilized Resources... 245 

 
cause that parallelism is growing slowly and, the resources in multi-core processors may 

not be employed properly [19], [28], [29]. So, there are some idle cycles and resources 

during the execution of each core, which is noticeable when the number of cores increases. 

Therefore, the availability of idle hardware resources in current processors has motivated us 

to use under-utilized CPU capacities for employing a technique for parallel and concurrent 

execution tracing which is compatible with the features of multi-core multi-threaded 

processors. Using this idea will lead to moderation of the negative effects of the serial control-

flow checking, proposed by the previous techniques, that causes undesirable performance and 

memory overheads. 

To evaluate the proposed technique, eleven well-known programs in the SPEC CPU2006 

Benchmark suite [30] were utilized. These benchmarks were run on a quad-core processor 

system, Intel® core™ i7-740QM with 6.00 GB Memory RAM, with a real operating system 

(Red Hat Enterprise Linux AS release 10). The results of injecting about 12000 errors reveal 

that on average, about 94% of the CFEs were detected by the proposed technique. Moreover, 

based on a comparison metric, which considers the effects of methods in CFE detection 

coverage and performance overheads, the proposed technique is found to have higher 

coverage with lower overhead compared to the previous works.  

The structure of this paper is as follows: section 2 introduces terminology. Section 3 

describes the main problem and motivation. The structure of the proposed technique is 

explained in Section 4. The simulation environment and the experimental results are shown in 

Section 5. Finally, section 6 concludes the paper. 

2. BACKGROUND 

All of the hardware techniques have been traditionally considered to achieve reliability 
requirements. However, it is not feasible for those where cost is a critical issue. The use of 
Commercial Off-The-Shelf (COTS) components for safety-critical applications has been 
suggested to accelerate the development cycle and produce cost effective systems. COTS 
components require specific approaches to take into account the effect of possible hardware 
faults [17]. 

Therefore, SWFT techniques are recently preferred. Re-execution of the program in 
different level of the code (thread, process, or instruction) is the basis of the most SWFT 
techniques to detect faults. However, these methods require around 100% performance 
overhead, which may not be suitable for most of the current applications and systems [11], 
[12], [13]. 

Consequently, the need for state-of-the-art high computing performances, coupled with 
cost containment, provides a strong motivation for investigating feasible alternatives to 
traditional solutions [17]. Transient or intermittent faults induced in hardware have an 
impact on software running on it, which is either data error or control flow error. Control-
flow errors occur when a processor jumps to an incorrect next instruction, which have been 
demonstrated to account for more than 70% of all errors [17]. So, due to the incidence and 
importance of control-flow errors and the advantages of software-based techniques, using 
the SCFC techniques for detecting control-flow errors has been shown as an effective and 
low-cost alternative solution to enhance the reliability of the processors. Moreover, previous 
studies on the effects of Multiple Bit Upset (MBU), the new types of errors, arisen due to 
shrinking the feature size of transistors, has demonstrated that the probability of occurrence of 
control-flow errors has been recently increased compared to the past [31]. 


246 M. MAGHOSUDLOO, H. R. ZARANDI 
 

Table 1 summarizes the features of different methods for enhancing the reliability of the 

CMPs. As Table 1 demonstrates, if the problem of high performance degradation, imposed by 

the SCFC techniques, can be moderated, the SCFC techniques will be more compatible with 

the features of the modern processors, compared to the other types of the proposed methods. 

Table 1 A general comparison among ifferent methods for enhancing the reliability of CMPS 

Categories 

Detecting hardware faults/errors Need 

program 

modification 

Need 

hardware 

modification 

Performance 

degradation In  

memory 

In  

combinational logic 

HWFT Most Almost all No Yes Low 

ECC Almost all None No Yes Low 

HWFT+ECC Almost all Almost all No Yes High 

HCFC Most Most No Yes Low 

SWFT Most Most Yes No Very high 

SCFC Most Most Yes No High 

3. PROBLEM AND MOTIVATION 

In the general case, almost all of the previous SCFC techniques are based on inserting 
some instructions into the program code. These instructions are responsible for monitoring the 
flow of the program execution and detecting any violation of run-time behavior. In fact, the 
fault model, considered in the paper, is the transient faults which lead to illegal sequence 
of basic blocks execution and control flow errors. 

Extracting the Control Flow Graph (CFG) from relations among basic blocks of a 
program code is always considered a prerequisite step in both of SCFC and HCFC methods. 
Any incorrectness and limitation in capturing the control dependencies among nodes of the 
CFG causes that the flow of a given program will not be precisely followed in checking 
phase. As Fig. 1(a) shows, after determining control dependencies among basic blocks of 
the program, each node of the CFG should be labeled by a unique signature, and to store the 
values of run-time signatures, a global variable or register should be reserved (for example 
variable Sig in Fig. 1(a)). 

After assigning signatures to each basic block (BB), types and locations of the checking 

and updating instructions should be specified. Fig. 1(b) demonstrates the method, used by 

the conventional techniques, for locating and specifying additional statements in the 

program code. Most of the previous SCFC techniques added these instructions at the 

beginning and/or at the end of the BBs in order to achieve high detection coverage. The 

sequence of the signatures is checked at run-time by conditional branch instructions. These 

instructions compare the value of the run-time signature with the pre-defined value assigned 

to each block at the design or compile time in order to detect any misbehavior. The run-time 

signatures should be updated, so that after checking instructions they confirm the correct 

execution. The major differences among previous related methods are in the content of 

updating statements. For example, the Control-Flow Checking by Software Signatures 

(CFCSS) technique [16] computes the signature of the destination blocks from the signature 

of the source block by implementing the XOR functions between the signature of the 

current node and the destination node. In addition, the Yet Another Control-Flow Checking 

using Assertions (YACCA) technique [17] defines a set of instructions that updates the 


 Parallel Execution Tracing: an Alternative Solution to Exploit Under-utilized Resources... 247 

 
signature using AND operation and XOR function with two constant masks. Finally, the 

Control-Flow Error Detection through Assertions (CEDA) technique [15] inserts fewer 

instructions (only a XOR operation) than the previous works by calculating signatures 

differently. Moreover, this technique can uniquely identify the node from which the illegal 

jump occurred, because for eventual CFE correction, stronger requirements are required. 

None of the above mentioned techniques can be used here as proposed. Therefore, this 

technique has been shown to be more efficient and effective compared to the prior methods 

[15], [18]. Fig. 1(b) shows three typical basic blocks of a program with the added instructions, 

specified according to CEDA. Due to the advantages of the CEDA, this technique is selected 

as the most efficient previous SCFC techniques to compare with the proposed method, in the 

rest of this paper. 

.L1: Br    Sig != 0001, Err

movl  Sig, Sig Xor 0011

pushl %ebp

movl %esp, %ebp

pushl %edi

pushl %esi

subl $44, %esp

movl $0, -28(%ebp)

movl $0, -36(%ebp)

Br    Sig != 0010, Err

movl  Sig, Sig Xor 0001

jmp .L3

.L2: Br    Sig != 0101, Err

movl  Sig, Sig Xor 0011

movl $0, -32(%ebp)

addl %eax, %eax

addl %edx, %eax

addl %edi, %eax

Br    Sig != 0110, Err

movl  Sig, Sig Xor 0001

jmp .L4

.L3: Br    Sig != 0011, Err

movl  Sig, Sig Xor 0111

movl $0, -28(%ebp)

addl %eax, %eax

addl %edx, %eax

addl %edi, %eax

cmpl $2, -22(%ebp)

addl $1, -30(%ebp)

Br    Sig != 0100, Err

movl  Sig, Sig Xor 0001

jle .L2

BB1

BB2

BB4

BB3

Sig = 1

Sig = 2

Sig = 3

Sig = 4

Sig = 5

Sig = 6

Sig = 7

Sig = 8

Sig = 9

 
 (a) (b) 

 
Fig. 1 The conventional algorithm for (a): assigning signatures, (b): locating the added 

instructions, and specifying types of the added instructions 


248 M. MAGHOSUDLOO, H. R. ZARANDI 
 

3.1. The problem 

Unfortunately, due to the execution of checking and updating statements, added at the 

beginning and at the end of each BB, applying previous SCFC techniques on the current 

processors will lead to high performance degradation. The conditional branch instructions, 

used as the usual checking statements by the previous SCFC techniques, take two clock 

cycles; and the XOR operation, used for updating the values of run-time signature, takes 

one clock cycle. So, the negative impacts of the checking statements on the execution time are 

more than the updating ones. On average, about 75% of performance degradation due to the 

previous SCFC techniques is imposed just because of executing the checking statements. This 

drawback of the SCFC techniques can undermine the obtained benefits of the modern 

processors and would waste the efforts taken by designers to improve the execution time of 

the programs and utilization of the processors. 

3.2. The motivation 

In order to reduce the negative impacts of the checking instructions, the features of 

HCFC techniques can be modeled. The HCFC techniques check the behavior of the main 

processor in parallel with the execution of the main program [24], [25]. In general case, an 

external checker or watchdog processor is responsible to check the sequence of the signatures 

of the main program, or the accesses of the main processor to the memory for detecting CFEs 

during the execution. Due to the reasons mentioned in the previous section, although HCFC 

techniques can moderate the main drawback (performance overhead) of SCFC techniques, 

these techniques require some modifications in hardware, which make it impossible to be 

used in COTS and current processors. 

The one effective way to take the advantages of parallel CFC in current architectures 

(without adding any hardware redundancy) is to use the under-utilized CPU capacities in 

multi-core processors. Utilization is the percentage of time that a component is actually 

occupied, compared to the total time that the component is available for use [10], [19], [29]. 

Unfortunately, developing parallel software for shared-memory multi-cores, using today’s 

programming languages, can be challenging. Therefore, due to the lack of high-level 

parallelization constructs, multi-core processors are not being fully leveraged [19], [29]. 

Fig. 2 shows the average percent of CPU usage of the system for three real applications 

from SPEC CPU2006 benchmarks suite when running on a real quad-core system (the 

execution times of applications in this figure are not real). In the example, the cores are only 

executing the designated application. The results show that since parallelism is not thoroughly 

possible, the CPU is not fully occupied during the execution of these applications (even 

when the number of running programs grows to 3) which becomes worse when the number 

of cores in the system increases. Therefore, there are enough resources in the system to 

run CFC mechanism during the idle cycles of CPUs. 


 Parallel Execution Tracing: an Alternative Solution to Exploit Under-utilized Resources... 249 

 
(a) 

 
(b) 

 
(c) 

Fig. 2 Percentage of CPU usage for three benchmarks: (a): bzip2, (b): bzip2*, (c): mcf 

4. THE PROPOSED TECHNIQUE FOR PARALLEL SCFC 

The idea, for exploiting under-utilized resources in multi-cores, is to develop a 

watchdog thread, which is responsible for checking the sequence of signatures, in parallel 

with updating phases of signatures by the main threads. As Fig. 3 shows, although redundancy 

at thread-level allows the operating system to freely schedule the competing threads across 

all available resources [1], the structure of the watchdog thread should be organized with 

regard to some important facts: 

0

10

20

30

40

50

60

70

1 4 7 10 13 16 19 22 25 28 31 34 37 40

%
 C

P
U

 U
sa

g
e
 

Time (s) 

0

10

20

30

40

50

60

70

1 4 7 10 13 16 19 22 25 28 31 34 37 40

%
 C

P
U

 U
sa

g
e
 

Time (s) 

0

10

20

30

40

50

60

70

1 4 7 10 13 16 19 22 25 28 31 34 37 40

%
 C

P
U

 U
sa

g
e
 

Time (s) 


250 M. MAGHOSUDLOO, H. R. ZARANDI 
 

1. Too many new dependencies should not appear between the watchdog and the 

main threads. Increasing the synchronization and communication dependencies 

among threads of a process will directly lead to the increased number of 

interruptions, and also reduction in the level of parallelism in the multi-threaded 

programs. So, as much as possible, the watchdog thread should not interrupt the 

execution of the main ones. 

2. The watchdog thread can check the value of the signatures, just between two 

consecutive updating phases of each main thread. Otherwise, the number of false 

positive CFEs (number of wrongly CFE detection) will be increased. Supposed 

that, the watchdog thread compares the value of the signature with the expected 

value, after two updating phases of the main thread. Then, the mismatch is detected as 

a CFE, while in reality it is not. 

According to these two facts, there is a conflict between them: how could the watchdog 

thread guarantee that the values of the signatures will be checked exactly between two 

consecutive updating phases of the main ones, while the execution of the main threads 

cannot be interrupted? This conflict between facts would be removed, if the values of the 

signatures are not overwritten till the certain time, so the watchdog has enough time to 

check the sequence of the signatures in each thread. This solution can be implemented by 

replacing a register with an array for storing the values of the signature. It should be noted 

that the virtual cores in Fig. 3 show a physical core that is assigned to a virtual machine. By 

default, virtual machines are allocated one core each. If the physical host has multiple cores 

at its disposal, however, then a core scheduler assigns execution contexts and the virtual 

core essentially becomes a series of time slots on logical processors. Therefore, the virtual 

cores are the software/hardware co-elements that is used by the system and does not show 

the virtualization in the operation of watchdog thread. 

V1 V2 V3 V4

C4C3C2C1

Application

Operating System

 
System

Software

System

Hardware

Main 

Thread

Watchdog 

Thread

Virtual Cores

Physical Cores

 
Fig. 3 The presence of watchdog thread in the architectural model of a quad-core processor 

Fig. 4 illustrates this idea by implementing two updating phases of a thread. The first 

bit of each element of the array is reserved, as a tag bit, to show the state of each element 

(U: Un-checked, C: Che cked). After initialization of the elements (step 1 in Fig. 4), the 

main thread update the value of the signature, stored in the first element, and the state of 

this element is changed to Un-checked (step 2). In the second updating phase, the second 

element of the array is updated, and the value of signature in the first element is still 


 Parallel Execution Tracing: an Alternative Solution to Exploit Under-utilized Resources... 251 

 
maintained. This process continues until the last element of the array is updated. During 

the updating phase and from the first element, the watchdog can check the elements of 

the array which is tagged with Un-checked. If any mismatch is observed, occurrence of 

the CFE is reported, otherwise, the state of the element is changed to Checked. When the 

last element of array is also updated (step 4), the main thread is suspended until the 

watchdog checks all elements, and the main thread is let to overwrite the elements of the 

array from the first one (steps 6, 7, and 8). 

0000

0000

0000

...

0001

0000

0000

...

0001

0001

0000

...
0001

0001

0001

...

S11

S12

S1b

0001

0001

0001

...

0010

0001

0001

...

0010

0010

0001

...

0010

0010

0010

...

C

C

C

C

C

C

U

C

C

U

U

C

C

C

C

C

U

U

C

C

U

U

C

C

C

U

                   (a)                                                  (b)                                                     (c)                                                  (d)

                  (e)                                                    (f)                                                    (g)                                                  (h)

S11

S12

S1b

S11

S12

S1b

S11

S12

S1b

S11

S12

S1b

S11

S12

S1b

S11

S12

S1b

S11

S12

S1b

 
Fig. 4 The steps of updating signatures stored in the signature array of each thread: 

(a):step1, ... , (h):step 8 (C:Checked, U:Un-Checked) 

 
Ignoring dependencies among running threads in phase of designing the CFGs causes 

that interactions among threads to be considered as CFEs. Using one shared global variable 

or register for storing the value of the signatures is the main reason of wrong CFE detection, 

since the running threads can access and alter the value of the signature register, 

simultaneously and without any limitation. For instance, in a dual-threaded program, if one 

thread updates the value of the signature between two consecutive checking phases of the 

other one, unexpected value of the signature is wrongly detected as a CFE by the second 

thread. These weaknesses in detection is known as false positive CFEs, when a technique 

detects some behavior in execution as CFEs, while in reality they are not. In other words, 

this mistake means that a positive inference about a CFE occurrence is actually false.  
 

However, reserving separated arrays for storing the run-time signatures of each thread, 

introduced as one of the perquisite steps of the proposed technique for parallel CFC. 

Threads can alter the values of the signatures, stored in their own specific arrays, and their 

accesses to the arrays of the other threads are limited to check the values of them. So, the 

proposed technique for parallel CFC guarantees that the described scenario for checking 

and updating the signatures in multi-threaded programs will not be happen during the 

run-time. 


252 M. MAGHOSUDLOO, H. R. ZARANDI 
 

1     i: a number assigned to each thread (id)

2     j: a number assigned to each element of arrays (index)

3     Sij: the value of element j in the array of thread i as the signature

4     Lij: the tag bit of element j in the array of thread i

5     Eij: the value of the expected value for element j in the array of thread i

6     Ti: the thread, identified by the value of variable i as its id

7     b: the number of elements in a array

8     n: the number of threads in the program

9     begin

10 x = 0,   i = 1

11 for  (j = 1,   j <= b,   j++)

12 begin

13 while  (Lij == 1)

14 begin

15 i++

16 if  (i >  n) then

17 i = 1

18 end

19 end

20 if  (Sij == Eij) then

21 x++,   Lij = 1

22 end

23 else

24 “CFE Occurrence”

25 end

26 i++

27 if  (i > n) then 

28 if  (x == n) then

29 i = 1

30 end

31 else

32 i = 1,   goto  13

33 end

34 end

35 if  (j == b)

36 Continue (Ti)

37 end

38 end

39 goto  10

40   end
 

Fig. 5 The algorithm of watchdog thread used in the proposed technique 

The configuration of the watchdog thread should be organized at the design time 

considering the information provided by the algorithm in Fig. 5. The following steps 

describe the structure of the watchdog thread: 


 Parallel Execution Tracing: an Alternative Solution to Exploit Under-utilized Resources... 253 

 
 To select the execution flow of the first thread for checking the sequence of 
signatures, the value of i is initialized to 1 (line 10). To ensure that the element j in the 

array of each thread has been checked (before increasing the value of the j), variable x 

is reserved to count the number of threads during each round of the algorithm. 

 To determine the state of selected element, whether it is Checked or Un-checked, 
the value of Lij is compared with 0 (line 13). 0 represents the Un-Checked, and 1 

indicates the Checked state. If this element of the selected array has been checked 

in the previous round, the value of i is increased by 1 (line 15), and the next thread 

from the list of threads is selected for checking the state of element j in its array. 

 The value of the signature is compared with the expected value to detect possible 
CFEs during execution (line 20). 

 To count the number of threads, which have been checked in this round, the value 
of variable x in increased by 1, and the state of element j is changed to Checked 

(line 21). 

 The value of i is increased by 1, to select the next thread from the list of threads 
(line 26). 

 If the value of i is still less than the number of threads in the program, this round is 
repeated until the element j of the last thread in the list of threads is also checked 

(line 27). 

 One step should be intended to ensure that the elements j of all threads has been 
checked (line 28). After the first round of each checking phase, if the value of x is 

less than the number of threads in the program, the tag bit of elements j for all 

thread will be re-checked. In the cases, where the main threads did not have 

enough time to update the signature stored in element j, state of their elements 

remains Un-Checked, and the process of checking the signature should be 

performed for them in the next rounds (line 32). 

 At line 38, the watchdog has already checked the element j of each thread. So, the 
value of j is increased by 1 to check the next element in the array of main threads 

(line 11). 

 At line 35, if the value of j is equal to the number of elements in array of each 
thread and all elements of each array have been already checked, then the watchdog 

let the main threads to continue the execution and overwrite the array from the first 

node (line 36), and the value of i and j are re-initialized (lines 10 and 11). 

Fig. 6 shows the structure of BBs in the proposed technique. Suppose that the size of 

arrays is five. Therefore, after five updating phases of the main threads, they will be 

suspended until the watchdog let them continue their executions and overwrite the arrays. 

This interruption in the executions of the main threads is inevitable, because the main 

threads should be ensured that overwriting the elements of the array will not be detected 

as a CFE in the cases where the watchdog has not had enough time to check the previous 

value of those elements. The experimental results will show that these interruptions have 

negligible side effects on the execution time of the program. Moreover, the code of watchdog 

thread algorithm can be also divided to basic blocks. So, the proposed techniques can be 

easily applied on the code of watchdog thread. 


254 M. MAGHOSUDLOO, H. R. ZARANDI 
 

L11=0 , Updating S11

L12=0 , Updating S12

L13=0 , Updating S13

L14=0 , Updating S14

L15=0 , Updating S15

Suspending

L11=0 , Updating S11

L12=0 , Updating S12

L13=0 , Updating S13

Li1=0 , Updating Si1

Li2=0 , Updating Si2

Li3=0 , Updating Si3

Li4=0 , Updating Si4

Li5=0 , Updating Si5

Suspending

Li1=0 , Updating Si1

Li2=0 , Updating Si2

Li3=0 , Updating Si3

T1 Ti...

...

...

 
Fig. 6 The structure of the BBs in the proposed technique 


 Parallel Execution Tracing: an Alternative Solution to Exploit Under-utilized Resources... 255 

 
5. EXPERIMENTAL RESULTS 

In order to evaluate our design decisions, a quad-core processor system, Intel® core™ 

i7-740QM with 6.00 GB Memory RAM, is used as the simulation environment with a 

real operating system (Red Hat Enterprise Linux AS release 10). Also, the behaviors of 

eleven well-known programs of SPEC CPU2006 [30] have been studied on the simulation 

environment. 

To implement the real behaviors of soft errors which lead to program sequence 

changes, a software function was written as a saboteur thread that runs simultaneously 

with the main program. This thread has accesses to the registers, such as instruction pointer, 

stack pointer, and also memory spaces with which program’s threads have interactions. 

During simulation, the saboteur thread manipulates the content of registers and variables in a 

random fashion (like the effects of the bit-flips and stuck-at fault models). 

Also, to observe the effects of the injected errors and to see what is going on inside a 

program after error injection, the GNU Project Debugger (GDB) [32] has been used. The 

GDB can perform several operations to help you catch misbehaviors of the execution in 

operation, such as examine what has happened when your program has stopped, make your 

program stop on specified conditions, and so on. In this work, the GDB has been utilized to 

identify the destinations of the illegal branches, occurred due to the error injections. 

Fig. 7 shows the performance overhead of previous SCFC techniques due to checking 

and updating instructions. The information provided by the figure confirms the fact that 

the checking instructions have a greater impact on the performance degradation. On 

average, about 26.48% out of 35.31% performance overhead of the previous SCFC 

techniques is imposed just due to the execution time of checking statements. 

 
Fig. 7 The percentage of performance overhead due to the previous techniques 

0

10

20

30

40

50

%
 P

e
r
fo

r
m

a
n

c
e
 O

v
e
r
h

e
a

d
 

Benchmarks 

Due to the execution of updating statements

Due to the execution of checking statements


256 M. MAGHOSUDLOO, H. R. ZARANDI 
 

Fig. 8 compares the performance overhead of the related techniques. As illustrates by 

the figure, while array size increases, the number of interruptions during the execution and 

subsequently the performance overhead, is considerably reduced. For example, when the 

array size increases to 5, 10, and 15 the performance degradation is reduced to 24%, 20% 

and 19%, respectively. On the other hand, when the array size is equal to 1, as similar as 

previous SCF C techniques, the checking phase should take place exactly before each 

updating phase. Furthermore, in this case, adding a new phase to wait for watchdog thread 

causes increased performance overhead compared to the other cases. Therefore, increasing 

the size of the arrays plays an impressive role in moderating the negative effects of the 

conventional SCFC techniques on the execution time of the programs. 

 
Fig. 8 Comparison of the performance overhead of the related techniques 

 
Fig. 9 represents the average percentage of CPU usage for each benchmark during 

their executions in the presence of the watchdog thread compared to the average CPU 

usage for the original programs. With regard to the figure, the average percentage of CPU 

usage for the original programs is less than 22%. So, there are more underutilized CPU 

resources in multi-core processors during the execution which can be exploited for 

parallel SCFC. Also, adding the watchdog thread to the set of the running threads in the 

programs increases the CPU utilization up to 19%, relatively. 

As similar as some previous works, to give a general comparison among all the 

methods, which also takes the error detection coverage and the performance overhead into 

account; a metric called Method Efficiency [15], [18], [26], [27] is defined to estimate the 

efficiency of the methods: 

 
0

10

20

30

40

50

60

70

80

%
 P

e
r
fo

r
m

a
n

c
e
 O

v
e
r
h

e
a

d
 

Benchmarks 

Due to the conventional techniques Due to the proposed technique (array size = 1)

Due to the proposed technique (array size = 5) Due to the proposed technique (array size = 10)

Due to the proposed technique (array size = 15)


 Parallel Execution Tracing: an Alternative Solution to Exploit Under-utilized Resources... 257 

 
                     [
(

 
)

                   
]      

 
Fig. 9 Average percentage of the CPU usage for the programs during their executions 

Table 2 compares the value of four important parameters obtained after applying the 

related techniques on the programs. The conventional techniques with two sets of the added 

instructions at the end and at the beginning of each BB is shown as The Type1 Conventional 

Techniques, and the conventional techniques with one set of the added instructions at the 

end and/or at the beginning of each BB is shown as The Type2 Conventional Techniques. 

Table 2 General comparison among related techniques in terms of four impressive factors 

 
Category Technique 

Error detection 
coverage 

(%) 

Performance 
overhead 

(%) 

Method efficiency 
Error detection 

latency 

(cycle) 
bzip2 mcf sjeng astar bzip2 mcf sjeng astar bzip2 mcf sjeng astar bzip2 mcf sjeng astar 

T
h
e
 

c
o
n
v
e
n
ti

o
n

a
l 

te
c
h
n
iq

u
e
s Type1 

(like [15]) 
93.0 98.9 89.9 95.7 34.7 39.9 46.4 38.6 0.41 2.28 0.21 0.60 7.1 6.3 6.4 6.8 

Type2 

(like [16]) 
82.1 86.5 80.4 84.1 17.9 21.3 25.1 20.2 0.31 0.35 0.20 0.31 9.0 8.2 7.8 8.8 

T
h
e
 

p
ro

p
o
se

d
 

te
c
h
n
iq

u
e
s With L=1 93.0 98.9 89.9 95.7 56.4 66.5 79.7 64.0 0.25 1.37 0.12 0.36 11.9 10.3 9.4 11.2 

With L=5 93.0 98.9 89.9 95.7 24.2 28.5 34.2 27.4 0.59 3.19 0.29 0.85 19.1 16.2 16.2 18.1 
With L=10 93.0 98.9 89.9 95.7 20.2 23.7 28.5 22.8 0.70 3.84 0.35 1.01 35.2 33.0 32.8 34.9 

With L=15 93.0 98.9 89.9 95.7 18.8 22.2 26.6 21.3 0.76 4.09 0.37 1.09 45.0 42.9 41.0 44.6 

According to the table, the efficiency of the proposed techniques is more than the 

efficiency of the conventional techniques, especially when the length of the arrays grows to 5, 

10 and 15. It brings to this concept that the proposed technique can effectively moderate the 

negative impacts of the previous techniques on the execution time of the programs. The only 

0

10

20

30

40

50

60

70

%
 A

v
e
r
a

g
e
 C

P
U

 U
sa

g
e
 

Benchmarks 

Without watchdog thread With watchdog thread


258 M. MAGHOSUDLOO, H. R. ZARANDI 
 

drawback of the proposed technique is the increase of the error detection latency which can be 

controlled by adjusting the length of the array with respect to the type of applications and 

systems. On average, the error detection latency of the previous techniques is about 7 cycles, 

while this value is increased to 18, 35, and 44 cycles for the proposed techniques, with the 

arrays size of 5, 10 and 15, respectively. It is concluded that the size of array should be 

selected with respect to the type of applications and systems. For example, for real-time 

systems that error detection latency is important, the arrays with shorter lengths are better than 

the other cases. The efficiency of the proposed techniques is more than the efficiency of the 

conventional techniques, especially when the length of the arrays grows to 5, 10 and 15. 

Therefore, for high performance computing, where performance is the main goal of design 

process, the arrays with longer lengths are better than the other cases. 

6. CONCLUSIONS 

In this paper, an efficient software behavior-based technique was presented in order to 

detect control-flow errors in multi-threaded architectures. The goal is to enhance the 

applicability of the related techniques in order to be employed in multi-core processors. 

The key innovation is to make some changes in the structure of software-based control-

flow checking techniques to exploit under-utilized resources in multi-core processors for 

parallel control-flow checking. Error injection experiments have shown that the proposed 

technique, when applied on the programs, can detect the CFEs in over 94%. The latency 

needed for detecting the CFEs is considerably less than the related techniques which have 

been recently published. A metric for estimating and comparing the efficiency of the 

methods was defined, and it was shown that the proposed technique is more efficient 

compared to conventional methods in order to be used in multi-core architectures. 

REFERENCES 

[1] S. Kumar, S. Hari, M. Li, P. Ramachandran, B. Choi and S. V. Adve, “mSWAT: Low-Cost Hardware 
Fault Detection and Diagnosis for Multicore Systems,” In Proc of the 42th Annual International 
Symposium on Microarchitecture, 2009. 

[2] G. A. Reis, J. Chang, N. Vachharajani, R. Rangan and D. I. August, “SWIFT: Software Implemented Fault 
Tolerance,” In Proc. of the 3rd International Symposium on Code Generation and Optimization, 2005, pp. 

243-254. 
[3] W. Shi, H. Hsin, S. L. Laura Falk and M. Ghosh, “An Integrated Framework for Dependable and Revivable 

Architecture Using Multicore Processors,” In Proc. of the 33th Annual International Symposium on Computer 

Architecture, 2006. 
[4] N. Aggarwal, P. Ranganathan, N. Jouppi and J. Smith, “Configurable Isolation: Building High Availability 

Systems with Commodity Multi-Core Processors,” In Proc. of the 34th Annual International Symposium on 

Computer Architecture, 2007, pp. 340-347.  
[5] M. Manoochehri, M. Annavaram and M. Dubois, “CPPC: Correctable Parity Protected Cache,” In Proc. 

of the 44th Annual International Symposium on Microarchitecture, 2011. 

[6] T. Slegel, R. Averill III, M. Check, B. Giamei, B. Krumm, C. Krygowski, W. Li, J. Liptay, J. MacDougall, T. 
McPherson, J. Navarro, E. Schwarz, K. Shum, and C. Webb, “IBM’s S/390 G5 Microprocessor design,” IEEE 

Micro, vol. 19, pp. 12–23, 1999. 


 Parallel Execution Tracing: an Alternative Solution to Exploit Under-utilized Resources... 259 

 
[7] B. F. Romanescu and D. J. Sprin, “Core Cannibalization Architecture: Improving Lifetime Chip Performance 

for Multicore Processors in the Presence of Hard Faults,” In Proc. of the 17th International Conference on 
Parallel Architecture and Compilation Techniques, 2008, pp. 43-50. 

[8] M. D. Powell, A. Biswas, S. Gupta and S. S. Mukherjee, “Architectural Core Salvaging in a Multi-Core 
Processor for Hard-Error Tolerance,” In Proc. of the 36th Annual International Symposium on Computer 
Architecture, 2009, pp. 93-104. 

[9] P. M.Wells, K. Chakraborty and G. S. Sohi, “Mixed-Mode Multicore Reliability,” In Proc. of the 14th 
International Conference on Architectural Support for Programming Languages and Operating Systems, 2009, 
pp. 169-180. 

[10] H. Aliee, H. R. Zarandi and A. Tajary, “Dynamically Scheduled Process-Level Redundancy to Tolerate Faults 
in Multi-cores,” In Proc. of the 9th IEEE International Conference on High Performance Computing & 

Simulation, 2011. 

[11] A. Meixner and D. J. Sorin, “Detouring: Translating Software to Circumvent Hard Faults in Simple Cores,” In 
Proc. of the 38th IEEE/IFIP International Conference on Dependable Systems and Networks, 2008 pp. 80-89. 

[12] A. Shye, T. Moseley, V. J. Reddi, J. Blomstedt and D. A. Connors, “Using Process-Level Redundancy to 
Exploit Multiple Cores for Transient Fault Tolerance,” In Proc. of the 37th IEEE/IFIP International Conference 

on Dependable Systems and Networks, 2007. 
[13] A. Shye, J. Blomstedt, T. Moseley, V. J. Reddi and D. A. Connors, “PLR: A Software Approach to 

Transient Fault Tolerance for Multi-Core Architectures,” IEEE Transactions on Dependable and Secure 

Computing, 2008. 
[14] M. Fazeli, R. Farivar and G. Miremadi, “Error Detection Enhancement in PowerPC Architecture-based 

Embedded Processors,” Journal of Electronic Testing: Theory and Applications, vol. 24, pp. 21-33, 2008. 

[15] R. Vemu and J. A. Abraham, “CEDA: Control-flow Error Detection through Assertions,” In Proc. of the 12th 
IEEE International On-Line Testing Symposium, 2006, pp. 151–158. 

[16] N. Oh, P. Shirvani and E. McCluskey, “Control-flow Checking by Software Signatures,” IEEE Transactions on 
Reliability, vol. 51, pp. 111–122, 2002. 

[17] O. Goloubeva, M. Rebaudengo, M. Sonza Reorda and M. Violante, “Improved Software-Based Processor 
Control-Flow Errors Detection Technique,” In Proc. of the 50th Reliability and Maintainability Syposium, 

2005, pp. 583-589. 

[18] R. Vemu, S. Gurumurthy and J. A. Abraham, “ACCE: Automatic Correction of Control-flow Errors,” In Proc. 
of the IEEE International Test Conference, 2007, pp. 1-10. 

[19] G. Blake, R. G. Dreslinski, T. Mudge, K. Flautner, “Evolution of Thread-Level Parallelism in 
DesktopApplications,” In Proc. of the 43th Annual International Symposium on Microarchitecture, ISCA, 

2010. 

[20] M. Rimén, J. Ohlsson and J. Karlsson, “Experimental evaluation of control flow errors,” In Proc. of the 
Pacific Rim International Symposium on Fault Tolerant Systems, 1995, pp. 238–243. 

[21] M.A. Schuette, J.P. Shen, "Processor Control Flow Monitoring Using Signatured Instruction Streams," 
IEEE Transactions on Computers, vol. 36, pp. 264-276, 1987. 

[22] A. Rajabzadeh and G. Miremadi, “Transient detection in COTS processors using software approach,” 
Journal of Microelectronics Reliability, vol. 46, pp. 124-133, 2006.  

[23] A. Rajabzadeh, G. Miremadi and M. Mohandespour, “Error detection enhancement in COTS superscalar 
processors with performance monitoring features,” Journal of Electronic Testing: Theory and Applications, vol. 

20, pp. 553-567, 2004. 
[24] A. Rajabzadeh and G. Miremadi, “CFCET: A Hardware-Based Control Flow Checking Technique in 

COTS Processors Using Execution Tracing,” Elsevier Journal of Microelectronics Reliability, vol. 46, 

pp. 959-972, 2006. 
[25] U. Schiffel, A. Schmitt, M. Süßkraut and C. Fetzer, “ANB- and ANBDmem-Encoding: Detecting Hardware 

Errors in Software,” In Proc. of the 29th International Conference on Computer Safety, Reliability and Security, 

2010, pp. 169-182. 
[26] H. R. Zarandi, M. Maghsoudloo and N. Khoshavi,“Two Efficient Software Techniques to Detect and Correct 

Control-flow Errors,” In Proc. of the 16th IEEE Pacific Rim International Symposium on Dependable 

Computing,2010, pp. 141-148. 
[27] M. Maghsoudloo, H.R. Zarandi, S. Pour-Mozaffari and N. Khoshavi, “Soft Error Detection Technique in 

Multi-threaded Architectures Using Control-Flow Monitoring,” In Proc. of the 14th Euromicro 

Conference on Digital System Design, Architecture, Methods and Tools, 2011. 
[28] E. D. Berger, T. Yang, T. Liu and G. Novark, “Grace: Safe Multithreaded Programming for C/C++,” In 

Proc. of the 24th SIGPLAN Conference on Object Oriented Programming Systems Languages and 

Applications, 2009, pp.81-96. 
[29] David Patterson, “The Trouble with Multi-Core,” Journal of IEEE Spectrum, vol. 47, pp. 28-32, 2010. 

http://www.springerlink.com/content/?Author=Ute+Schiffel
http://www.springerlink.com/content/?Author=Andr%c3%a9+Schmitt
http://www.springerlink.com/content/?Author=Martin+S%c3%bc%c3%9fkraut
http://www.springerlink.com/content/?Author=Christof+Fetzer


260 M. MAGHOSUDLOO, H. R. ZARANDI 
 
[30] Standard Performance Evaluation Corporation. SPEC CPU2006 Benchmarks, http://www.specbench.org/ 

cpu2006, 2006. 
[31] Sh. Parsaeeian, and A. Rajabzadeh, “Multi-Bit Upset Faults Correction in Embedded Systems”, Journal 

of Iran Computer Society, 2011. 

[32] GDB: The GNU Project Debugger, http://www.gnu.org/s/gdb.