10666 FACTA UNIVERSITATIS Series: Electronics and Energetics Vol. 35, No 2, June 2022, pp. 155-186 https://doi.org/10.2298/FUEE2202155N © 2022 by University of Niš, Serbia | Creative Commons License: CC BY-NC-ND Review paper FIFTY YEARS OF MICROPROCESSOR EVOLUTION: FROM SINGLE CPU TO MULTICORE AND MANYCORE SYSTEMS Goran Nikolić, Bojan Dimitrijević, Tatjana Nikolić, Mile Stojčev University of Niš, Faculty of Electronic Engineering, Niš, Serbia Abstract. Nowadays microprocessors are among the most complex electronic systems that man has ever designed. One small silicon chip can contain the complete processor, large memory and logic needed to connect it to the input-output devices. The performance of today's processors implemented on a single chip surpasses the performance of a room-sized supercomputer from just 50 years ago, which cost over $ 10 million [1]. Even the embedded processors found in everyday devices such as mobile phones are far more powerful than computer developers once imagined. The main components of a modern microprocessor are a number of general-purpose cores, a graphics processing unit, a shared cache, memory and input-output interface and a network on a chip to interconnect all these components [2]. The speed of the microprocessor is determined by its clock frequency and cannot exceed a certain limit. Namely, as the frequency increases, the power dissipation increases too, and consequently the amount of heating becomes critical. So, silicon manufacturers decided to design new processor architecture, called multicore processors [3]. With aim to increase performance and efficiency these multiple cores execute multiple instructions simultaneously. In this way, the amount of parallel computing or parallelism is increased [4]. In spite of mentioned advantages, numerous challenges must be addressed carefully when more cores and parallelism are used. This paper presents a review of microprocessor microarchitectures, discussing their generations over the past 50 years. Then, it describes the currently used implementations of the microarchitecture of modern microprocessors, pointing out the specifics of parallel computing in heterogeneous microprocessor systems. To use efficiently the possibility of multi-core technology, software applications must be multithreaded. The program execution must be distributed among the multi-core processors so they can operate simultaneously. To use multi-threading, it is imperative for programmer to understand the basic principles of parallel computing and parallel hardware. Finally, the paper provides details how to implement hardware parallelism in multicore systems. Key words: Microprocessor, Pipelining, Superscalar, Multicore, Multithreading Received April 13, 2022 Corresponding author: Goran Nikolić University of Niš, Faculty of Electronic Engineering, 18106 Niš, Aleksandra Medvedeva 14, Serbia E-mail: goran.nikolic@elfak.ni.ac.rs 156 G. NIKOLIĆ, B. DIMITRIJEVIĆ, T. NIKOLIĆ, M. STOJČEV 1. INTRODUCTION A microprocessor (processor implemented in a single chip) is one of the most inventive technological innovations in electronics since the discovery of the transistor in 1948. This amazing device has involved many innovations in the field of digital electronics, and became a part of everyday life of people. The microprocessor is the central processing unit (CPU) and it is an essential component of the computer [5]. Nowadays, it is a silicon chip that is composed from millions up to billions of transistors and other electronic components. The CPU can execute several hundred millions/billions of instructions per second. A microprocessor is preprogrammed to execute software in conjunction with memory and special-purpose chips. It accepts digital data as input and processes it according to the instructions stored in the memory [6]. The microprocessor performs numerous functions including data storage, interaction with input-output devices, time-critical execution and other. Applications of microprocessors range from very complex process controllers to simple devices and even toys. Therefore, it is necessary for every electronics engineer to have a solid knowledge of microprocessors. This article discusses the types and 50 years’ evolution period of microprocessor. The evolution of microprocessors throughout history has been turbulent. The first microprocessor called Intel 4004 was designed by Intel in 1971. It was composed of about 2,300 transistors, was clocked at 740 kHz and delivered 92,000 instructions per second while dissipating around 0.5 watts. After that, almost every year a new microprocessor, with significant performance improvements in respect to previous ones, was launched. The growth in performance was exponential, of the order of 50% per year, resulting in a cumulative growth of over three orders of magnitude over a two-decade period [7]. These improvements have been driven by advances in the semiconductor manufacturing process and innovations in processor architecture [8]. Multicore processing has posed new challenges for both hardware designers and application developers. Parallel applications place new demands on the processing system. Although a multicore architecture designed for a specific target problem gives excellent results, it should be borne in mind that the main goal in computer system design should be to provide the ability to efficiently handle different types of problems. However, a single architecture "one size fits all", which is able to effectively solve all challenges, has not been found so far, and many are convinced that it will never be [9]. This article presents a review of the microarchitecture of contemporary microprocessors. The discussion starts with 50 years of microprocessor history and its generations. Then, it describes the currently used microarchitecture implementations of modern microprocessors. At the end it points to specifics of parallel computing in heterogeneous microprocessor systems. This article is intended for an advanced course on computer architecture, suitable for graduate students or senior undergrads in computer and electrical engineering. It can be also useful for practitioners in the industry in the area of microprocessor design. 2. DEFINITION OF MICROPROCESSOR Central Processing Unit, also known as a processor or microprocessor, is a controlling unit of a micro-computer inside a small chip. CPU is often referred to as the brain and heart of all computer (digital) systems and is responsible for doing all the work. It performs every single action a computer does and executes programs. In essence, the CPU is capable to perform Arithmetic Logical Unit (ALU) operations and communicates Fifty Years of Microprocessor Evolution: from Single CPU to Multicore and Manycore Systems 157 with the other input/output devices and auxiliary storage units connected with it. In modern computers, the CPU is contained on an integrated circuit chip in which several functions are combined [10]. In general, all CPUs, single-chip microprocessors or multi- chip implementations run programs by performing the following steps: 1. Read an instruction and decode it 2. Find any associated data that is needed to process the instruction 3. Process the instruction 4. Write the results out The instruction cycle is repeated continuously until the power is turned off. A microprocessor is built using the following three basic circuit blocks [11]: 1. Registers, 2. ALU, and 3. Control Unit (CU). Registers can exist in two forms, either as an array of static memory elements such as flip- flops, or as a portion of a Random Access Memory (RAM) which may be of the dynamic or static type. ALU usually provides, at the minimum, facilities for addition, subtraction, OR, AND, complementation, and shift operations. The CU of the CPU regulates and integrates computer operations. It selects and retrieves instructions from main memory in the appropriate order and interprets them to activate other functional building blocks of the system at the appropriate time with aim to perform its proper operations. 3. GENERATION AND MICROPROCESSOR HISTORY On December 23rd, 1947, the Transistor was invented in Bell Laboratory, whereas an Integrated Circuit was invented in 1958 in Texas Instruments. In 1971 Intel or INTegrated ELectronics has invented the first Microprocessor. The evolution of CPU can be divided into five generations such as first, second, third, fourth, and fifth generation [12], and the characteristics of these generations will be discussed in the sequel. 1st Generation: The first-generation microprocessors were introduced in the year 1971-1972 when INTEL launched the first microprocessor 4004 running at a clock speed of 740 kHz. Other microprocessors that belong to this generation are Rockwell International PPS-4, INTEL-8008, and National Semiconductors IMP-16. Instruction processing of these CPUs was serial. Namely, instruction phases, fetch, decode and execution, were performed sequentially. When the current instruction was finished, then the CPU updates the instruction pointer and fetches the consecutive one in the program sequence, and so on for each instruction in turn. 2nd Generation: This was the period from 1973 to 1978 in which very efficient 8-bit microprocessors were implemented like Motorola 6800 and 6801, INTEL-8085, and Zilog’s- Z80, which were among the most popular ones. The second-generation of the microprocessor is characterized by overlapped fetch, decode, and execute phases. When the first instruction is processed in the execution unit, then the second instruction is decoded and the third instruction is fetched. Compared to the first-generation, the use of new semiconductor technologies for chip manufacture was a novelty in the second generation. Gains in innovation were a significant increase in instruction execution speed and chip densities. 3rd Generation: The third-generation microprocessors were introduced in the year 1978, as denoted by Intel’s 8086 and the Zilog Z8000. From 1979 to 1980, Intel 8086/80186/80286 158 G. NIKOLIĆ, B. DIMITRIJEVIĆ, T. NIKOLIĆ, M. STOJČEV and Motorola 68000 and 68010 were developed. Processors of this generation were 16-bit, four times faster than the previous generation, and with a performance like mini computers [13], [14], [15]. The development of a proprietary microprocessor architecture based on own Instruction Set Computer (ISC) was a novelty of this generation. 4th Generation: Development of 32-bit microprocessors, during the period from 1981 up to 1995, characterizes the fourth-generation. Typical products were Intel-80386 and Motorola’s 68020/68030. Microprocessors of this generation are characterized by higher chip density, even up to a million transistors. High-end microprocessors at the time, such as Motorola's 88100 and Intel's 80960CA, could issue and retrieve more than one instruction per clock cycle [16], [17]. 5th Generation: From 1995 until now, this generation has been characterized by 64- bit processors that have high performance and run at high speeds. Typical representatives are Pentium, Celeron, Dual and Quad-core processors that use superscalar processing, and their chip design exceeds 10 million transistors. The 64-bit processors became mainstream in the 2000s. Microprocessor speeds were limited by power dissipation. In order to avoid the implementation of expensive cooling systems, manufacturers were forced to use parallel computing in the form of the multi-core and many-core processor. Thus, the microprocessor has evolved through all these generations, and the fifth- generation microprocessors represent an advancement in specifications. Some of the processors from the fifth generation of processors with their specifications will be briefly discussed in the text that follows. 4. CLASSIFICATION OF PROCESSOR Processor can be classified along several orthogonal dimensions. Here we will point briefly to some of the most commonly used. The first classification is based on microarchitecture specifics, second one to the market segment, the third on type of processing, e.tc. In this article, we will focus on the first classification scheme. For more details about this problematic the readers can consult Reference [10]. 4.1. Classification of microarchitecture specifics In general, we distinguish the following classifications: 4.1.1. Pipelined vs Non-Pipelined Processors A Non-Pipelined processor executes only a single instruction at a given time. The start of the next instruction is delayed until the current ends, not based on hazards but unconditionally. The CPU scheduler chooses the instruction from the pool of waiting instructions, when it is free. Pipelining is a technique where multiple instructions are overlapped during execution. The Pipelined processor is divided into several processing stages (segments). The stages are mutually connected in a form of a pipe structure. Constituents of each stage are an input register and a combinational circuit. The role of the register is to hold data and of combinational circuit to process it. The combinational circuit outputs processed data to the input register of the next segment. Fifty Years of Microprocessor Evolution: from Single CPU to Multicore and Manycore Systems 159 The pipeline technique is divided into two categories: a) Arithmetic Pipelines are mainly used for floating point operations, multiplication of fixed-point numbers, etc.; b) In Instruction Pipeline instructions are executed by overlapping fetch, decode and execute phases. Pipeline technique increases Instruction Level Parallelism (ILP) and is used by all processors nowadays [18]. Table 1 Difference between Pipelining and Non-Pipelining Systems Pipelining System Non-Pipelining System Multiple instructions are overlapped during execution Phases fetching, decoding, execution and writing memory are merged into a single unit (step) Several instructions are executed at the same time Only one instruction is executed at the same time The CPU scheduler design determines efficiency The efficiency is not dependent on the CPU scheduler Execution time is less (in a fewer cycle) Execution takes more time (a greater number of cycles) In addition to the fact that pipelining increases the overall system performance, there are several factors that cause conflicts and degrade performance. Among the most important factors are the following: 1. Timing Variations - The processing time of instructions is not the same, because different instructions may require different operands (constants, registers, memory). Accordingly, pipeline stages do not always consume the same amount of time. 2. Data Hazards - The problem arises when several instructions are partially executed in the pipeline system and in doing so, two or more of them refer to the same data. In that case, it must be ensured that the next instruction stall until the current instruction has finished processing that data, because otherwise an incorrect result will occur. 3. Branching - The next instruction is fetched during the execution of the current one. However, if the current instruction is conditional branching, then the next instruction will not be known until this current one completes data processing and determines the branching outcome. 4. Interrupts - Interrupts have an impact on the execution of instructions by inserting unwanted instructions into the current instruction stream. 5. Data Dependency - This problem occurs when the result of the previous instruction is not yet available, and it is already needed as data for the current instruction. Main advantages of pipelining are higher clock frequency and increased the system throughput. However, there are disadvantages of this technique, primarily the greater complexity of the design and the increased latency of the instruction. 4.1.2. In-Order vs Out-of-Order Processors A processor that executes instructions sequentially usually uses resources inefficiently, resulting in poor performance. Two approaches can be used to improve processor performance. The first one deals with simultaneous executing different sub-steps of consecutive instructions or even executing instructions completely simultaneously. The second one refers to out-of-order instruction execution which can be achieved by executing the instruction in a different order from the original one [1], [19]. Instructions order is determined by the compiler, but it is not necessary to execute them in that order. They may 160 G. NIKOLIĆ, B. DIMITRIJEVIĆ, T. NIKOLIĆ, M. STOJČEV be: a) issued in order and completed in order; b) issued in order, but completed out of order; c) issued out of order, but completed in order; and d) issued out of order and completed out of order [1]. First- and second-generation microprocessors process instructions in order. In-order processor performs the following steps: 1. Retrieves instructions from the program memory. 2. If input operands are available in the register file, it sends command to the execution unit in order to execute instruction. 3. If during the current clock cycle input operands are not available, the processor will wait for them. This case occurs when the processor retrieves data from slow memory. This implies that instructions are statically scheduled. 4. The instruction is then executed by the appropriate execution unit. 5. After that, the result is entered back into the destination register. Out-of-order execution is an approach used in third-, fourth-, and fifth-generation microprocessors. This approach significantly reduces latency when executing instructions. The specificity is that the processor will execute instructions in the order of data or operand availability, but not in the original order of instructions generated by the compiler. In this way, the processor will avoid waiting states, because during the execution of the current instruction, it will obtain operands for the next instruction. For example, I1 and I2 are two instructions where I1 is the first and I2 is the second. In out-of-order execution, the processor may execute an I2 instruction before the I1 instruction is completed. This feature will improve CPU performance as it allows execution with less latency. The steps required for out-of-order processor are as follows: 1. Retrieves instructions from the program memory. 2. Instructions are sent to an instruction queue (also called instruction buffer). 3. Until the input operand is available the instruction waits in the queue. The instructions will leave the queue when the operand is available. This implies that instructions are dynamically scheduled. 4. The instruction is sent to appropriate execution unit for execution. 5. Then the results are queued. 6. If all the previous instructions have their results written back to register file, then the current result is entered back to the destination register. The main goal of out-of-order instruction execution is to increase the amount of ILP. But let note that the hardware complexity of out-of-order processors is significantly higher compared to in-order ones. 5. SCALAR VS SUPERSCALAR PROCESSORS A scalar processor is one where instructions are executed in a pipeline, as is presented in Fig. 1a), but only a single instruction can be fetched or decoded in a single cycle. A super scalar processor on the other hand can have multiple parallel instruction pipelines [20], [21]. A 2-way super scalar processor (see Fig. 1b)) can fetch two instructions per cycle and supports two parallel pipelines. The terms "scalar" or "superscalar" are not to be confused with "single-core/multi-core". Scalars are single-core processors, while superscalars may either be single- or multi-cores. The key point is that scalars cannot perform more than one operation (i.e., carry out more than one instruction) per clock cycle, but Fifty Years of Microprocessor Evolution: from Single CPU to Multicore and Manycore Systems 161 superscalars can perform up to two instructions in some cases. This means that if you have a CPU with three cores on it – one being an old scalar processor – and you run an application that utilizes all three cores, the old third core will be no more than half as fast as if it were completely superscalar. The main thing to remember is that certain instruction sets are suited better to certain optimizations. Superscalars can execute basic operations such as add and load on separate registers simultaneously, whereas a scalar processor would have to complete one operation before moving on to the next. For example, a scalar processor may be able to run multiple threads, but they will all share the same core and therefore only run as fast as the slowest thread. Superscalars can provide much higher performance because each thread gets its own core/execution unit. a) b) Fig. 1 Scalar processor (a), superscalar processor (b) The terms "scalar" or "superscalar" are not to be confused with "single-core/multi-core". Scalars are single-core processors, while superscalars may either be single- or multi-cores [22]. The key point is that scalars cannot perform more than one operation (i.e., carry out more than one instruction) per clock cycle, but superscalars can perform up to two instructions in some cases. This means that if you have a CPU with three cores on it – one being an old scalar processor – and you run an application that utilizes all three cores, the old third core will be no more than half as fast as if it were completely superscalar. The main thing to remember is that certain instruction sets are suited better to certain optimizations. Superscalars can execute basic operations such as add and load on separate registers simultaneously, whereas a scalar processor would have to complete one operation before moving on to the next. For example, a scalar processor may be able to run multiple threads, but they will all share the same core and therefore only run as fast as the slowest thread. Superscalars can provide much higher performance because each thread gets its own core/execution unit. The main challenge in superscalar processing is how many instructions can be issued per cycle. If a processor can issue k instructions per cycle, then it is called a k-degree superscalar processor. In order for a superscalar processor to take full advantage of parallelism, then k instructions must be executable in parallel. So, the key idea of a superscalar processor is that there is more instruction level parallelism (ILP) [18]. 162 G. NIKOLIĆ, B. DIMITRIJEVIĆ, T. NIKOLIĆ, M. STOJČEV The implementation of superscalar processing requires special hardware (see Fig. 2 for more details). The data path is increased with the degree of superscalar processing. For instance, if 2-degree superscalar processor is used and the instruction size is 32 bit, then 64- bit data is fetched from the instruction memory and 2 instruction registers are required. Fig. 2 Comparison between a Scalar and a Superscalar Processor. Notice: The Superscalar Processor implements one pipeline dedicated for Memory Access and one pipeline for Arithmetic operations. The main feature of superscalar processors is to issue more than one instruction in each cycle (usually up to 8 instructions). Let note that instructions can change the order to make better use of the processor architecture. In order to reduce data dependency in superscalar processing, more complex parallel hardware is necessary. Hardware parallelism ensures the availability of more resources and it is one of the ways to use parallelism. An alternative way is to use ILP which can be achieved by transforming the source code using an optimization compiler. Typical commercial superscalar processors are IBM RS/6000, DEC 21064, MIPS R4000, Power PC, Pentium, etc. Very-Long-Instruction-Word (VLIW) processors are a variant of superscalar processors because they can process multiple instructions in all pipeline stages [23]. The VLIW processor has the following features: (a) it is an in-order processor; (b) the binary code defines which instructions will be executed in parallel. The size of the VLIW instruction word can be in hundreds of bits. The compiler forms the layout of the VLIW instruction by compacting the instruction words of the source program. The processor must have the sufficient number of hardware resources to execute all the specified operations in VLIW word simultaneously. For instance, as shown in Fig. 3, one VLIW instruction word is compacted to have L/S operation, FP addition, FP multiply, branch, and integer ALU. Fifty Years of Microprocessor Evolution: from Single CPU to Multicore and Manycore Systems 163 Fig. 3 a) VLIW instruction word; b) VLIW processor All functional units (shown in Figure 3 b)) are implemented according to the VLIW instruction word (given in Figure 3 a). Large registry file is shared by all functional units in the processor. The parallelism in instructions and data flow is specified at compile time. Trace scheduling is used for handling branch instructions. It is based on the prediction of branch decisions at compile time, while prediction is based on some heuristic methods. In Table 2 a comparison between VLIW and Superscalar processors from aspect of ILP implementation is given. As conclusion, when we compare VLIW and Superscalar Processors, we can say that VLIW differs from superscalar machine in the following: a) instruction decoding process is simpler; b) ILP is higher but code density is lower; and c) object-code compatibility with a larger family of nonparallel machines is lower. Table 2 Instruction-Level Parallelism: VLIW vs Superscalar Superscalar VLIW Instruction scheduling mechanism is implemented with complex hardware More functional units are needed Instruction code word is larger Complex compiler is needed Out-of-order execution ▪ There is a logic that checks the dependencies between parallel instructions and checks the hazards when working functional units If a compiler that performs efficient code optimization is not implemented, then more effort is needed to create executable code Longer execution time and higher power consumption are potential consequences Hardware is simpler due to the use of predicted execution to avoid branching More efficiently execution of pipeline- dependent code 164 G. NIKOLIĆ, B. DIMITRIJEVIĆ, T. NIKOLIĆ, M. STOJČEV Simple hardware structure and instruction set are the crucial advantages of VLIW architecture. The VLIW processor is suitable for scientific applications where the program behavior is more predictable. Super-Pipelining is an alternative performance method to superscalar. In this approach, pipeline stages can be segmented into n distinct non-overlapping parts each of which can execute in 1/n of a clock cycle, i.e., Super-Pipelining is based on dividing the stages of a pipeline into sub-stages and thus increasing the number of instructions which are active in the pipeline at a given moment. By dividing each stage into two, the cycle period τ is reduced to the half, τ/2 => at maximum capacity, a result is produced every τ/2 s (see Fig. 4). For a given architecture and the corresponding instruction set there is an optimal number of pipeline stages; increasing the number of stages over this limit reduces the overall performance [24]. By analyzing Fig. 4 we can observe the following: 1. Base pipeline: i) Issues one instruction per clock cycle; ii) Can perform one pipeline stage per clock cycle; iii) Several instructions are executing concurrently; iv) Only one instruction is in its execution stage at any one time; and vi) Total time to execute 6 instructions is 10 cycles. 2. Super-Pipelined implementation: j) Capable of performing two pipeline stages per clock cycle; jj) Each stage can be split into two non-overlapping parts: jjj) Each executing in half a clock cycle; jiv) Total time to execute 6 instructions is 7.5 cycles; jv) Theoretical speedup is equal to 1 − 7.5 / 10 ≈ 25%. 3. Superscalar implementation: k) Capable of executing two instances of each stage in parallel; kk) Total time to execute 6 instructions is 7 cycles; and kkk) Theoretical speedup: 1 – 7/10 ≈ 30%. Fig. 4 Comparison of superscalar and super-pipeline approaches Fifty Years of Microprocessor Evolution: from Single CPU to Multicore and Manycore Systems 165 From the Fig. 4 we can notice that both the Super-Pipeline and the Superscalar implementations: a) Have the same number of instructions executing at the same time; b) However, Super-Pipelined processor falls behind the superscalar processor; and c) Parallelism empowers greater performance. So, a better solution to further improve speed is the Superscalar architecture. 6. VECTOR PROCESSOR A vector is an ordered set of the same type of scalar data items that can be of type a floating-point number, an integer, or a logical value. Vector processing is the arithmetic, or logical computation, applied on vectors whereas in scalar processing only one or pair of data is processed. Therefore, vector processing is faster compared to scalar processing. When the scalar code is converted to vector form then it is called vectorization. A vector processor is a special accelerator building block, which is designed to handle the vector computations [25], [26]. There are the following types of vector instructions: a) Vector-Vector Instructions: Vector operands are fetched from the vector register and after processing generated results are stored in another vector register. These instructions are marked with the following function mappings: 1 : 1 2 2 : 1 2 3 P V V P V V V    For example, P1 type denotes vector square root, and P2 addition (or multiplication) of two vectors. b) Vector-Scalar Instructions: Scalar and vector operands are fetched and stored in vector register. These instructions are denoted with the following function mappings: 3 : 1 2P S V V  ; where S is the scalar item For example, P3 type denotes vector-scalar subtraction or divisions. c) Vector-Reduction Instructions: This type of instructions is used when operations on vector are being reduced to scalar items as the result. These instructions are presented with the following function mappings: 4 : 1 1P V S 5 : 1 2 2P V V S  For example, P4 type corresponds to finding the maximum, minimum and summation of all the elements of vector, while P5 is used for the dot product of two vectors. d) Vector-Memory Instructions: This type of instructions is used when vector operations with memory M are performed. These instructions are marked with the following function mappings: 6 : 1 1P M V 7 : 1 2P V M For example, P6 type corresponds to vector load and P7 to vector store operation. Typical examples of vector operations are the following: 1. 2 1V V ; Complement all elements 2. 1S V ; Min, Max, Sum 3. 3 2 1V V V  ; Vector addition, multiplication, division 4. 2 1V V S  ; Multiply or add a scalar to a vector 5. 2 1S V V  ; Calculate an element of a matrix 166 G. NIKOLIĆ, B. DIMITRIJEVIĆ, T. NIKOLIĆ, M. STOJČEV Vector Processing with Pipelining: Due to the repetition of the same computation on different operands, vector processing is very suitable for pipelining. A vector processor performs better if length of vector is larger, but it causes the problem in storage and manipulating of vectors. Efficiency of Vector Processing over Scalar Processing: As we have already mentioned, a sequential computer processes vector item by item. Therefore, with aim to process a vector of length n through the sequential computer then the vector must be divided into n scalar steps and executed one by one. For example, consider the following example which is used for addition of two vectors of length 1000: + A B C The sequential computer implements this operation by 1000 add instructions in the following way: [1] [1] [1] [2] [2] [2] . . . [1000] [1000] [1000] C A B C A B C A B = + = + = + A vector processor does not divide the vectors in 1000 add statements to perform identical operation, because it has the set of vector instructions that allow the operations to be specified in single vector instruction as: (1:1000) (1:1000) (1:1000)+ A B C Comparative execution of addition instruction by scalar and vector processor is presented in Fig. 5. Fig. 5 Scalar vs Vector operations execution Thus, the main advantage of using vector in respect to scalar processing is reflected in the elimination of overhead caused by the loop control. Fifty Years of Microprocessor Evolution: from Single CPU to Multicore and Manycore Systems 167 Properties of Vector Instructions: a) Single Instruction implies lot of operations: Hence reduce the number of instruction’s fetch and decode. b) Each operation is independent of each other: i) Simple design; ii) Multiple Operations can be run in parallel. c) Data hazards has to be checked for each vector operation and not each operation. d) Reduces Control hazards by reducing branches. e) Knows memory access pattern. Nowadays, large number of microprocessors contain a set of instructions that manipulate with relatively small vectors (e.g., up to 8 single-precision FP elements in the Intel AVX extensions [27]). These instructions are often referred to as SIMD (single instruction, multiple data) instructions. Table 3 shows the comparative properties (advantages vs disadvantages) of vector processors. Table 3 Comparative properties of vector processors Advantages of Vector Processors Disadvantages of Vector Processors ▪ Instruction bandwidth is lower ▪ Fetch and decode phases are reduced ▪ Main memory addressing is easier ▪ Load/Store units use known patterns for memory access ▪ Memory wastage is eliminated – no cache misses, latency only occurs during vector loading ▪ Control hazards logic is simple – loop-related control hazards are eliminated ▪ Scalable platform – larger number of hardware resources increases performance ▪ Code size is reduced – N operations are described by single instruction ▪ Works (only) if parallelism is regular (data/SIMD parallelism). ▪ Very inefficient if parallelism is irregular. ▪ Memory (bandwidth) can easily become a bottleneck especially if: a) Compute/memory operation balance is not maintained; b) Data is not mapped appropriately to memory banks Vector Processing Applications include problems that can be efficiently formulated in terms of vectors such as: a) Long-range weather forecasting; b) Petroleum explorations; c) Seismic data analysis; d) Medical diagnosis; e) Aerodynamics and space flight simulations; f) Artificial intelligence and expert systems; g) Mapping the human genome; and h) Image processing. 7. MULTICORE PROCESSORS Nowadays, large uniprocessors no longer scale in performance, because conventional superscalar techniques for instruction issue allow only a limited amount of parallelism to be extracted from the instruction flow. In addition, it is not possible to further increase the clock speed, because the power dissipation will become prohibitive. For more than thirty years (time period between 1972-2003 year, often called as time intensive microarchitecture processor design), a variety of modifications have been conducted to perform one of two goals: 1) increasing the number of instructions that can be issued per cycle; and 2) increasing the clock frequency faster than Moore’s law and 168 G. NIKOLIĆ, B. DIMITRIJEVIĆ, T. NIKOLIĆ, M. STOJČEV Denard’s rule would normally allow [28]. Pipelining and super-pipelining of individual instruction execution into a sequence of stages has allowed designers to increase clock rates. Superscalar processors were designed to execute multiple instructions from an instruction stream on each cycle. These function by dynamically examining sets of instructions from the stream to find one’s capable candidates for parallel execution on each cycle. These can be often executed in out-of-order manner with respect to the original sequence. This concept is referred as instruction-level parallelism (ILP). Typical instruction streams have only a limited amount of usable parallelism among instructions [1], [29], so superscalar processors that can issue more than about four instructions per cycle achieve very little additional benefit on most applications. Today, advances in processor core development have slowed dramatically because of a simple physical limit: power dissipation. In modern pipelined and superscalar processors, typical high-end power exceeds 100 W. In order to bypass the mentioned design constraints, processor manufacturers are now switching to a new microprocessor design paradigm: multicore (also called chip multiprocessor, or CMP for short) and many-core. A multi-core processor is a single computing component with two or more independent actual processing units (called "cores" - made up of computation units and caches [30]), which are functional units that read and execute program instructions. Multiple cores can run multiple instructions (ordinary CPU instructions) at the same time, increasing overall speed for programs suitable to parallel computing. Coupling multiple cores on a single chip should achieve the performance of a single faster processor. The individual cores on a multi-core processor are not necessary to run as fast as the highest performing single- core processors, but in general they improve overall performance by executing more tasks in parallel [31]. The increase in performance can be seen by considering the way single- core and multi-core processors execute programs. Single-core processors that run multiple programs will assign different time slices to all programs, and they will run sequentially. If one of the processes lasts longer, then all the other processes start to lag behind. However, with multi-core processors, if there are multiple tasks that can run in parallel at the same time, then each of them will be executed by a separate core in parallel. This improves performance. Depending on the application requirements, multi-core processors can be implemented in different ways. It can be a group of heterogeneous cores or a group of homogeneous cores or a combination of both. In a homogeneous core architecture, all cores in the processor are identical [32] and in order to improve overall performance they break down a computationally intensive application into less intensive applications and run them in parallel [4]. Significant advantages of a homogeneous multi-core processor are reduced design complexity, reusability, and reduced verification effort [33]. Heterogeneous cores, on the other hand, consist of dedicated application specific processor cores that would run various applications [34]. Cores in multi-core systems, as well as single-processor systems, can implement architectures such as VLIW, superscalar, vector, or multithreading. Multicore processors are used in many application domains, such as general purpose, embedded, multimedia, network, digital signal processing (DSP) and graphics (GPU). They can be harnessed as complex cores that address computationally intensive applications, or a remedial core that deals with less computationally intensive applications [24]. Software algorithms and their implementations greatly influence the performance improvement obtained by using multi-core processors. In particular, possible gains are limited by the fraction of the software that can run in parallel simultaneously on multiple Fifty Years of Microprocessor Evolution: from Single CPU to Multicore and Manycore Systems 169 cores. At best, parallel problems can achieve acceleration factors close to the number of cores, or even more if the problem is sufficiently split to fit in the local core cache(s). In this way the number of accesses to much slower main system memory are reduced. However, most applications are not so fast without the effort of programmers to reshape the whole problem. Currently, software parallelization is a significant ongoing research topic [4]. A comparison between single and multiple-core processor is given in Table 4. Table 4 Comparation of Single-Core processor and Multi-Core processor Parameter Single-Core Processor Multi-Core Processor Number of cores on a die Single Multiple Instruction Execution Single instruction is executed at a time Multiple instructions are executed by using multiple cores Gain Speed up every program Speed up the programs intended for multi-core processor Performance Depend on the clock frequency Depend on the clock frequency, number of cores and program Examples 80386, 80486, AMD 29000, AMD K6, Pentium I, II, III etc. Core-2-Duo, Athlon 64 X2, I3, I5, I7 etc. 7.1. Multicore topologies In the sequel we will point out to four types of multicore topologies: symmetric (or homogeneous), asymmetric, dynamic, and composed (alternatively referred as "fused" or "heterogeneous") [20], [35]. The symmetric multicore topology is composed of multiple copies of the same core that functioning at the same frequency and voltage. In this topology, the resources such as the power and the area budget, are evenly distributed on all cores. In Figure 6a) symmetric multicore processor is presented where each block is a Basic Core Equivalent (BCE) and contains L1 and L2 caches as constituents. L3 cache and on-chip network are not presented. The asymmetric multicore topology is composed of one large monolithic core and a number of identical small cores. This topology uses a large high-performance core that performs the serial part of the code and uses a number of small cores as well as the large core to take advantage of the parallel part of the code. In Figures 6b) and 6c) asymmetric multicore processors are presented with: b) one complex core and 12 BCEs; c) two complex cores and 8 BCEs. The dynamic multicore topology is a modification of the asymmetric topology. Parallel parts of the code are executed by small cores while the large core is off, and the serial part of the code is executed only on the large core, while small cores are inoperative. In Figures 6d) and 6e) dynamic multicore processors are presented with: d) 16 BCEs or one large core; e) four cores and frequency scaling using power budget of 8 BCEs (currently one core is at full core thermal design point (TDP), two cores are at 0.5 core TDP, and one core is switched off). 170 G. NIKOLIĆ, B. DIMITRIJEVIĆ, T. NIKOLIĆ, M. STOJČEV Fig. 6 a) Symmetric multicore processor; b) and c) Asymmetric multicore processors; d) and e) Dynamic multicore processors; f) Heterogeneous multicore The heterogeneous (composed) multicore topology is composed of a set of small cores that are logically combined to assemble a large high-performance core for serial code execution. In serial or parallel cases, exclusively large or small cores are used. In Figure 6f) heterogeneous multicore is presented with large core, four BCEs, two accelerators or co-processors of type A, B, D, E each. Some of the limits of multicore processors are the following [36]: 1. Present days CMPs are designed to exploit both instruction-level parallelism (ILP) and thread-level parallelism (TLP). In such solutions, the number of processors and the complexity of each processor are fixed at design time. 2. Performance improvement mainly achieved by increasing the number of cores cannot always lead to effective design solution due to: a) Dark silicon problem (all the cores cannot be powered at the same time); and b) Declining yield in TLP. Nowadays, we have multicore processors all over the place, single thread programs are no longer an option. In essence, we moved from single core to multicore not because the software community was ready for concurrency but because the hardware community could not afford to neglect the power issue. Fifty Years of Microprocessor Evolution: from Single CPU to Multicore and Manycore Systems 171 Today, multi-core technology has become commonly used in most personal electronic devices that contain multiple cores. Therefore, in order to take advantage of multiple cores on such machines, creating parallel programs is crucial to achieving high performance and enabling large-scale data processing. In addition to multicore technology (mainly realized as shared memory systems) [37], parallel computing can be in the form of distributed systems. Unlike multicore shared memory systems, distributed systems can solve problems that do not fit in the memory of a single machine. In contrast to multicores with shared memory, communication and data replication in distributed systems causes high additional overheads. Compared to distributed memory systems, multicores with shared memory are more efficient for programs that can fit in memory. Efficiency is reflected in reduced hardware, cost, and power consumption [3]. Today’s multicore CPUs use most of their transistors on processing logic and cache memory. During operation most of the power is consumed by non-computational units. Alternative strategy are heterogeneous architectures, i.e. multi-core architectures in combination with accelerator cores. Accelerators are specialized hardware cores designed with fewer transistors, operating at lower frequencies than traditional CPUs, and enabling increased system performance. 8. MULTITHREADED PROCESSORS As hardware complexity of modern processor and capabilities have increased, so demands related to higher performance increased too. This requirement has led to an increase in CPU resource efficiency to the same extent. The main idea is that the time while the processor is waiting to perform certain tasks, i.e. it is in idle state, is used to perform another activities. To achieve this goal, software designers involved new approach in possibilities of the operating system that support running pieces of programs, called threads. Threads are small tasks that can run independently [38]. During execution, each thread gets its own time period. As a consequence, the processor time is efficiently utilized. Fig. 7 shows multithreading execution on single processor and two-way superscalar processor. Fig. 7 Multithreading in a CPU: (a) Single processor running a single thread. (b) Single processor running several threads. (c) Two-way superscalar processor running a single thread. (d) Two-way superscalar processor running multiple (two) threads. 172 G. NIKOLIĆ, B. DIMITRIJEVIĆ, T. NIKOLIĆ, M. STOJČEV 8.1. Difference between Multitasking, Multiprocessing and Multithreading Several threads make up one process (task), and share access to processor resources. This new concept of operating systems, known as multi treading, has ensured the run of one thread while the other is in a state of waiting for an event. Contemporary commercially available PC machines and servers, mainly based on Intel or AMD processors, that run Microsoft Windows, support multithreading [28]. Each program requires resources that are occupied by the process (task). The process is assigned a virtual address space, executable code, system object manipulation, a security context, a unique process identifier, environment variables, a priority type, working set sizes, and at least one thread of execution. A thread is a single entity within a process that can be planned for execution. All threads that are part of a process share its previously mentioned resources. In addition, each thread maintains code for manipulation with exceptions, a planning priority, a local thread memory, a thread identifier, and structures that the system will use in order to preserve the context of the thread. The thread context consists of a set of machine registers, a kernel stack, an environment block, and a user thread stack. Each thread is characterized by: 1) thread ID; 2) register state, including PC and SP; 3) stack; 4) signal mask; 5) priority; and 6) thread-private memory. Threads share instructions and data of the process to which they belong. All threads in the process can see changes in the shared data of any thread. Threads in the same process can interact with each other without involving the operating environment. Multitasking is a mode of operation where the CPU performs multiple tasks at the same time (see Fig. 8). It is characterized by CPU switching between multiple tasks so that users can work together with each program. Unlike multithreading, in multitasking, processes share separate memory and resources. In multitasking, CPU switching between tasks is relatively fast. Fig. 8 Multitasking operating system for single processor Multithreading is an operating mode in which during process execution many threads are active. In this manner, higher computer power is achieved. In multithreading (see Fig. 9), CPU executes many threads that are part of a process at a time. Processes share the same memory and resources. Property of multithreading is that two or more threads can run concurrently. Therefore, multithreading is also referred as concurrency [39]. Fifty Years of Microprocessor Evolution: from Single CPU to Multicore and Manycore Systems 173 Fig. 9 Multithreading system for single processor The difference between multitasking and multithreading [40] is presented in Table 5. Table 5 Difference between Multitasking and Multithreading No. Multitasking Multithreading 1. CPU performs many user tasks Within a process many threads are created 2. CPU switching among the tasks CPU switching among the threads 3. Processes share separate memory Processes share same memory 4. Multiprocessing can be involved Multiprocessing cannot be involved 5. CPU executes many tasks at a time CPU executes many threads at a time 6. Each process has separate resources Each process shares same resources 7. Multitasking is slower Multithreading is faster 8. Termination of process is longer Termination of thread is shorter A computer system composed of two or more processors is called a multiprocessing system (see Fig. 10). In this way, the computing speed of the system is increased. In such systems, each processor has its own registers and main memory. The division of processes and resources among processors is done dynamically. The main characteristics of multiprocessing are the following: i) The organization of memory determines the type of multiprocessing; ii) System reliability is improved, and iii) Decomposing programs into parallel executable tasks leads to performance increase. Advantages of multiprocessing are the following: a) more activity can be performed in a shorter time; b) code is simple; c) system is composed of multiple CPU and cores; d) synchronization is simplified; e) child processes are interruptible/killable; and f) cost- efficient because processors share resources. Disadvantages of multiprocessing are: a) inter-process communication involves time overhead; and b) larger memory is needed. The main characteristics of multithreading are the following: j) each thread is executed parallel with other; and jj) program performance is increased since threads share the same memory area. 174 G. NIKOLIĆ, B. DIMITRIJEVIĆ, T. NIKOLIĆ, M. STOJČEV Fig. 10 Multiprocessing system Advantages of multithreading are the following: a) the address space is shared for all threads; b) lower amount of memory is needed; c) cost-efficient and fast communication between threads; d) fast context switching; e) suitable for input/output-oriented applications; and f) switching time between two threads is short. Disadvantages of multithreading are the following: a) not interruptible/killable; b) manual synchronization is often necessary; and c) program code is harder to understand and testing and debugging is harder due to race conditions. Both multiprocessing and multithreading as operating modes increase a computing power [41]. A multiprocessing system is composed of multiple processors where a multithreading comprises multiple threads. Table 6 Multiprocessing vs Multithreading Multiprocessing Multithreading Multiple CPUs increase computing power Multiple threads of a single process increase computing power Multiple processes are executed concurrently Multiple threads of a single process are executed concurrently 9. MULTITHREADING: EXECUTION MODEL One decade later in respect to software architects, hardware architects designed a multithreaded processor which can run more than one thread on some of its cores at the same time. A multithreaded architecture is one in which a single processor has the ability to follow multiple streams of execution without the aid of software context switches. In order for a conventional processor to stop executing one thread and start executing instructions from another thread, it requires special software. The role of this software is to transfer the state of the running thread to memory (usually to stack memory) and then load the state of the selected other thread into the processor. This process usually requires hundreds (or thousands) of cycles, especially if an operating system was introduced. A multithreaded architecture, on the other hand, can access the state of multiple threads in, or near, the processor core. This allows the multithreaded architecture to quickly switch between threads, and potentially more efficiently and effectively use processor resources [42] (see for illustration Fig. 11). Fifty Years of Microprocessor Evolution: from Single CPU to Multicore and Manycore Systems 175 Fig. 11 Multithreaded pipeline example Multicore and multithreading can be used simultaneously because they are two orthogonal concepts. For instance, the Intel Core i7 processor has multiple cores, and each core is two-way multithreaded [43]. In the case where multiple threads are executed simultaneously, then those threads use mostly different hardware resources in the multicore, while they share most of the hardware resources in the multithreaded processor. In order to achieve this, a multithreaded architecture must be able to store the state of multiple threads in hardware - this storage is referred to as hardware contexts, where the number of supported hardware contexts defines the level of multithreading (the number of threads that can share the processor without software intervention). The state of a thread is primarily composed of the program counter (PC), the contents of general- purpose registers, and special purpose and program status registers. It does not include memory (because that remains in place), or dynamic state that can be rebuilt or retained between thread invocations (branch predictor, cache, or TLB contents). 9.1. Instructions issue Multithreaded processors are divided into two groups depending on how many threads can issue instructions in a given cycle. When instructions can be issued only from a single thread in a given cycle, explicit multithreading is used. In that case, the following two main techniques can be applied [44] (see Fig. 12): i) Coarse-grain multithreading (CGMT) or Blocked multithreading (BMT); and ii) Fine-grain multithreading (FGMT) or Interleaved multithreading (IMT). When instructions can be issued from multiple threads in a given cycle, Simultaneous multithreading (SMT) is used. Fig. 12 Explicit multithreading 176 G. NIKOLIĆ, B. DIMITRIJEVIĆ, T. NIKOLIĆ, M. STOJČEV Coarse-grain multithreading, also called blocked multithreading or switch-on-event multithreading, has multiple hardware contexts associated with each processor core [45]. The instructions of a thread are executed successively, but when an event occurs that may cause latency, then it produces a context switch. Instructions of one thread continue to be executed until there is a long delay such as a branch or no cache data is found (see Fig. 13). When such a delay is achieved, it is switched to another thread, and this thread is also executed until a long delay occurs. This process is constantly repeated. The strategy of this technique makes it possible to hide long delays, but omits shorter delays where the cost of switching is higher than the cost of tolerating delays. A hardware context is the program counter, register file, and other data required to enable a software thread to execute on a core. A coarse-grain multithreaded processor operates similarly to a software time-shared system, but with hardware support for fast context switch, allowing it to switch within a small number of cycles (e.g., less than 10) rather than thousands or tens of thousands. Fig. 13 Coarse-grain multithreading Fine-grain multithreading, also called interleaved multithreading, also has multiple hardware contexts associated with each core, but can switch between them with no additional delay. An instruction of another thread is fetched and entered into the execution pipeline at each cycle, and therefore the processor can execute an instruction or instructions from different thread in each cycle. Unlike coarse-grain multithreading, then, a fine-grain multithreaded processor has instructions from different threads active in the processor at once, within different pipeline stages [46]. But within a single pipeline stage (or given our particular definitions, within the issue stage) there is only one thread represented. In this approach, the CPU executes one instruction of each thread in succession (one after the other) before going back (in a circular way) to execute the next instruction of the first thread. During execution, the CPU skips the instruction of any thread that is waiting for an event to occur and has a long delay (stalled). In this manner, the processor is busy because the pipeline system is almost always full. Such a processor has significantly complex hardware structure because for each thread it needs a separate copy of register file and program counter. Since the next instruction of a thread is fed into the pipeline after the withdrawal of the previous instruction of this thread, control and data dependencies between instructions do not occur in FGMT. The pipeline system is simple and potentially very fast because there is no need for complex hardware hazard detection. In addition, the context switching time between threads is zero cycles. Memory latency is compensated by not scheduling a thread until memory access is completed. In this model, the number of CPU pipeline stages determines the number of threads that can be executed. The processing power available to one thread is limited by the instruction interleaving from other threads. Fifty Years of Microprocessor Evolution: from Single CPU to Multicore and Manycore Systems 177 In simultaneous multithreading (SMT) instructions are simultaneously initiated from multiple threads to the execution units of a superscalar CPU. In this way, the initiation of several superscalar instructions is linked with hardware resources for multiple-context approach. The CPU can issue multiple instructions from multiple threads each cycle. In this way, both unused cycles in the case of latencies and unused issue slots within one cycle can be filled by instructions of alternative threads. From one hand, TLP exists as a consequence of multithreading, parallel programs or multiple independent programs in a multiprogramming workload. From the other hand, ILP is based on execution of individual threads. SMT processor achieves better throughput and speedup in respect to single threaded superscalar processor for multithreaded workloads because it efficiently uses coarse- and fine-grain parallelism, at cost of more complex hardware architecture. SMT, CGMT and FGMT are approaches which are often used in RISC or VLIW processors [39], [47]. Intel Pentium 4 implements SMT from 2002, starting from the 3.06 GHz model. Intel calls SMT technique as Hyper-Threading. Other processors that use SMT are Alpha AXP 21464, IBM Power5, and Intel Nehalem i7 [48]. One simple SMT architecture is presented in Fig. 14. Fig. 14 SMT processor architecture Much like pipelining, superscalar architecture (presented in Fig. 14) also extends very naturally the possibility to support multiple threads of instructions. A multi-threaded superscalar processor executes instructions from multiple threads. Each thread executes its logical instruction stream and uses separate registers, etc., but shares most of the available physical resources. The additional hardware required to support multiple thread execution is minor, but performance is significantly improved. Short remarks related to explicit multithreading: Coarse-grain multithreaded processors directly execute one thread at a time, but can switch contexts relatively quickly, in a matter of a few cycles. This allows them to switch to the execution of a new thread to hide long latencies (such as memory accesses), but they are less effective at hiding short latencies. Fine- grain multithreaded processors can context switch every cycle with no delay. This allows them to hide even short latencies by interleaving instructions from different threads while one thread is stalled. However, this processor cannot hide single-cycle latencies. A simultaneous multithreaded processor can issue instructions from multiple threads in the same cycle, 178 G. NIKOLIĆ, B. DIMITRIJEVIĆ, T. NIKOLIĆ, M. STOJČEV allowing it to fill out the full issue width of the processor, even when one thread does not have sufficient ILP to use the entire issue bandwidth. Illustration purpose only, two different approaches that are possible with single-issue (scalar) processors and multiple-issue processors are given in Fig. 15 and Fig. 16, respectively [37]. Fig. 17 presents two cases of issuing multiple threads in a cycle [38]. Fig. 15 Different approaches possible with single-issue (scalar) processors: a) single-threaded scalar, b) interleaved multithreading scalar, c) blocked multithreading scalar Fig. 16 Different approaches possible with multiple-issue processors: (a) single-threaded four-wide superscalar, (b) interleaved multithreading four-wide superscalar, (c) blocked multithreading four-wide superscalar Notice: Vertical waste corresponds to darker marked box, while horizontal waste corresponds to lighter marked box Fifty Years of Microprocessor Evolution: from Single CPU to Multicore and Manycore Systems 179 Fig. 17 Issuing from multiple threads in a cycle: a) simultaneous multithreading, b) chip multiprocessor 10. PARALLEL VS SERIAL COMPUTING What is Serial Computing: Computer software is conventionally created for serial execution, where the algorithm divides the problem into smaller parts, i.e. instructions. These instructions are then serially executed on the CPU of the computer one by one [18]. After completing the current instruction, the next one begins. So, in short, Serial (sequential) Computing is following (see Fig. 18): ▪ A problem is broken into a discrete series of instructions ▪ Instructions are executed sequentially one after another ▪ Executed on a single processor ▪ Only one instruction may execute at any moment in time. Fig. 18 Serial Computing generic example What is Parallel Computing: Contrary to the serial approach, parallelism can be defined as an approach of dividing big problems into smaller ones. After that, smaller problems are simultaneously solved by multiple processors. The terms parallelism and concurrency are often confused. Parallelism means that two or more program sequences 180 G. NIKOLIĆ, B. DIMITRIJEVIĆ, T. NIKOLIĆ, M. STOJČEV are executed independently of each other by a number of processors, while in concurrent execution there are dependencies between program sequences so that the execution of one program sequence must wait for the execution of another to continue. Every parallel processing is not needed to be considered as concurrent. For example, bit-level parallelism is not concurrent. As can be seen from Fig. 19, to solve a computational problem Parallel Computing involves the simultaneous usage of multiple computing resources [49]: ▪ A problem is decomposed into several parts that can be solved concurrently ▪ Each part is further decomposed down to a series of instructions ▪ Instructions from each part execute simultaneously on different processors ▪ A general control/coordination mechanism is implemented. Concurrency vs Parallelism: We can see how concurrency and parallelism work with the below example. As shown in Fig. 20, there are two cores and two tasks. In a concurrent approach, each core is executing both tasks by switching among them over time. In contrast, the parallel approach doesn’t switch among tasks, but instead executes them in parallel over time [50]. This simple example for concurrent processing can be any user-interactive program, like a text editor. In such a program, there can be some IO operations that waste CPU cycles. When we save a file or print it, the user can concurrently type. The main thread launches many threads for typing, saving, and similar activities concurrently. They may run in the same time period; however, they aren’t actually running in parallel. Types of Parallelism: In essence, the parallelism can be implemented at two levels, hardware and software, respectively [51]. Fig. 19 Parallel Computing generic example Parallelism at hardware level is built into machines architecture and hardware multiplicity, so it is also known as machine parallelism. This type of parallelism is a function of cost and performance trade off. It also displays resource utilization patterns of simultaneously executable operations and indicates the peak performance of processor resources. It is characterized by number of instruction issues per machine cycle [1]. Fifty Years of Microprocessor Evolution: from Single CPU to Multicore and Manycore Systems 181 Fig. 20 Concurrent vs Parallel execution In summary, we distinguish the following types of hardware parallelism: 1. Parallelism in a Uniprocessor level implemented as: i) Pipelining, Super-pipelining; ii) Superscalar, VLIW etc. 2. Parallelism implemented with SIMD instructions, Vector processors, GPUs 3. Parallelism at Multiprocessor level: j) Symmetric shared-memory multiprocessors; jj) Distributed-memory multiprocessors; jjj) Chip-multiprocessors a.k.a. multi-cores; ivj) Multicomputer a.k.a. clusters. Parallelism at software level is exploited by the concurrent execution of machine language instructions in a program. This type of parallelism is a function of algorithm, programming style and compiler optimization. It also displays patterns of simultaneously executable operations using the program flow graph [52]. We distinguish the following two types of software parallelism: 1. Control parallelism – allows two or more operations to be performed simultaneously; 2. Data parallelism – at most same operation is performed over many data elements by many processors simultaneously. Data level parallelism (DLP) arises from executing essentially the same code on a large number of objects [53], while control level parallelism (CLP) arises from executing different threads of control concurrently [54]. Parallelism in software level can be implemented at instruction, task, data or transaction level parallelism [55]. 10.1. Implementations of the Most Common Type of Parallelism Bit-level Parallelism: This type of parallelism (implemented at hardware level) uses doubling the processor word size. It provides faster execution of arithmetic operations for large numbers. For instance, an 8-bit CPU executes 16-bit addition for two cycles, whereas a 16-bit processor needs just one cycle for the same activity. This level of parallelism is also used in 64-bit processors. Instruction-level parallelism (ILP): This type of parallelism (implemented at hardware level) exploits the potential overlap between instructions in a program. In most cases, ILP is implemented on each processor’s hardware as: i) Instruction pipelining; ii) Superscalar processing; iii) Out-of-order execution; and iv) Speculative execution/Branch prediction. Most processors use a combination of the aforementioned ILP techniques to achieve higher performance. Very Long Instruction Word (VLIW) processors use specialized compilers to achieve static ILP parallelism at the software level. Compilers prepare parallel instruction streams for VLIW processors so that they take full advantage of a number of executive units organized in multiple pipelines. 182 G. NIKOLIĆ, B. DIMITRIJEVIĆ, T. NIKOLIĆ, M. STOJČEV Task/Thread-level parallelism (TLP): This type of high-level parallel computing (implemented at software level) is based on partitioning the application in distinct task or threads, that can be then executed simultaneously. Threads are executed on different computer units and can work on independent data or share data. Until recently, programming was done sequentially, with a single thread representing the entire application. Today, it is necessary to use the new paradigm of multi-threaded programming in order to take full advantage of the available multicore processors. In that sense, modern operating systems (OSs) provide scheduling of different processes on different cores. However, in the case of complex applications such as bioinformatics, the OS cannot efficiently distribute the computational load of each process to available cores. In order to improve performance, these applications need to be redeveloped to achieve thread-level parallelism. Data parallelism: This form of high-level parallelism partitions data into various available computing units. Data is assigned to cores that independently execute the same task code on each fragment of data. Therefore, this type of parallelism requires advanced code development skills and can only be applied to specific problems. Computer graphics is an important area of application of high-level data parallelism. The design of graphic processor units (GPUs) enables efficient execution of every graphics processing task. First, each frame is divided into regions, and then, based on the command, hundreds of processor units perform the task independently on each data region. Many incarnations of DLP architectures over decades are the following [49]: a) Old vector processors (Cray processors: Cray-1, Cray-2, …, Cray X1); b) SIMD extensions (Intel SSE and AVX units, Alpha Tarantula (didn’t see light of day)); c) Old massively parallel computers (Connection Machines, MasPar machines); and d) Modern GPUs (NVIDIA, AMD, Qualcomm, ...). In general, DLP focus of throughput rather than latency. In Fig. 21 a classification scheme of parallel computer architectures based on type of instruction processing is given. Fig. 21 Classification of Parallel Architectures based on type of instruction processing The difference among the three major categories ILP, TLP and DLP that are nowadays mainly used in computer systems to exploit parallelism is sketched in Fig. 22, and that is [56]: ▪ Instruction-Level Parallelism (ILP) - Multiple instructions from one instruction stream are executed simultaneously; ▪ Thread-Level Parallelism (TLP) - Multiple instruction streams are executed simultaneously; ▪ Vector Data Parallelism (VDP) - The same operation is performed simultaneously on arrays of elements. Fifty Years of Microprocessor Evolution: from Single CPU to Multicore and Manycore Systems 183 There are many reasons to use parallel computing [4]. First of all, the whole real world has a dynamic nature, i.e. many things happen at a certain time, but in different places at the same time, so this data is very huge to manage and requires more dynamic simulation and modeling. It is parallel computing that ensures the concurrency and organization of complex, large datasets and their management while saving money and time. It also provides efficient use of hardware resources and real-time system implementation. Fig. 22 Differences in execution among ILP, TLP and VDP Parallel computing finds application in many areas of science and engineering, then in databases and data mining, in real-time system simulations, as well as in advanced graphics, augmented reality and virtual reality. In addition to a number of advantages, there are some limitations of parallel computing. The main problem is the difficulty in achieving communication and synchronization between multiple subtasks and processes. Also, algorithms or programs must be provided with low coupling and high cohesion as well as the possibility that they can be handled in a parallel mechanism. Developers must be experts and technically skilled in order to be able to effectively code a program based on parallelism. 11. CONCLUSION Nowadays, the microprocessor represents one of the most complex applications of the transistor, with well over 10 billion transistors of the most powerful microprocessor. In fact, throughout its 50 years of evolution period, the microprocessor has always used the technology of the day. The intention to permanently increase performance has led to rapid technological improvements that have made it possible to build more complex microprocessors. Advances in semiconductor fabrication processes, computer architecture and organization, as well as CMOS IC VLSI design methodologies, were all needed to create today’s microprocessor. The development of microprocessors since 1971 has been aimed at (a) improving architecture, (b) improving instruction set, (c) increasing speeds, (d) simplifying power requirements [57], [58] and (e) embedding more and more memory space and I/O facilities in the same chip (using single chip computers). This paper discusses first, fifty years of microprocessor history and its generations. Then it describes the benefits of switching from non-pipelined processor to single core 184 G. NIKOLIĆ, B. DIMITRIJEVIĆ, T. NIKOLIĆ, M. STOJČEV pipelined processor, and switching from single core pipelined and superscalar processor to multicore pipelined and superscalar processor. Finally, it presents the design of a multicore processor. The transition from single-core to multi-core is inevitable because past techniques for accelerating processor architectures that do not modify the basic Von Neumann computer model, such as pipelining and superscalar, encounter strong limits. The question is, why have multi-core machines become so widespread in the last decade? According to Moore's Law, the density of transistors doubles approximately every 18 months [59], and according to Dennard's scaling, the power density of transistors is constant [60]. This has historically corresponded to increase in clock speed of single core machines of approximately 30% per year since the mid-1970s [61]. However, since the mid-2000s, Dennard scaling has no longer been maintained due to physical hardware limitations, and therefore, there has been a need for new mechanisms. To improve performance, hardware vendors have focused on developing processors with multiple cores [62]. As a result, the microprocessor industry is moving towards multicore architectures. However, the full potential of these architectures will not be exploited until the software industry fully accepts parallel programming. Multiprocessor programming is much more complex than programming single processor machines and requires an understanding of new algorithms, computational principles, and software tools. Only a small number of developers currently master these skills. There are many techniques that can be used to facilitate the transition to multicore processors, but to take full advantage of the potential offered by such systems, some form of parallel programming will always be needed [4]. Multicore technology has become ubiquitous today with most personal computers and even mobile phones [63], so writing parallel programs is crucial to achieving scalable performance and enabling large-scale data processing. In addition, to take full advantage of multicore technology, software applications must be multithreaded. The total work to be performed must be able to be distributed among the execution units of a multicore processor in such a way that they can execute at the same time. In order to consider multithreading in more detail, it is first necessary to understand parallel hardware and parallel computing [44]. Finally, this paper provides some details on how to implement hardware parallelism in multicore systems. Acknowledgement: This work was supported by the Serbian Ministry of Education and Science, Project No TR-32009 – "Low power reconfigurable fault-tolerant platforms". REFERENCES [1] D. Patterson and J. Hennessy, Computer Architecture: A Quantitative Approach, 6th Ed., Morgan Kaufmann, 2017. [2] J.-L. Baer, Microprocessor Architecture: From Simple Pipelines to Chip Multiprocessors, Cambridge University Press, 2009. [3] Y. Solihin, Fundamentals of Parallel Multicore Architecture, Chapman & Hall/CRC, 2015. [4] R. Kuhn and D. Padua, Parallel Processing, 1980 to 2020", Morgan & Claypool, 2021. [5] M. Stojčev, Microprocessor architectures I part, in serbian, Elektronski fakultet Niš, 2004 [6] B. Parhami, Computer Architecture: From Microprocessors to Supercomputers, Oxford University Press, 2005 [7] Semiconductor Industry Association, International Technology Roadmap for Semiconductors (ITRS), 2013 edition, 2013 [8] K. Olukotun, L. Hammond, and J. Laudon, Chip Multiprocessor Architecture: Techniques to Improve Throughput and Latency, Morgan & Claypool, 2007 Fifty Years of Microprocessor Evolution: from Single CPU to Multicore and Manycore Systems 185 [9] R. V. Mehta, K. R. Bhatt and V. V. Dwivedi, "Multicore Processor Challenges – Design Aspects", J. Emerg. Technol. Innov. Res. (JETIR), vol. 8, no. 5, pp. C171-C174, May 2021. [10] A. Gonzalez, F. Latorre and G. Magklis, Processor Microarchitecture: An Implementation Perspective, Morgan & Claypool, 2011. [11] M. Stojčev and P. Krtolica, Computer systems: Principle of digital systems, in serbian, Elektronski fakultet Niš i Prirodno-matematički fakultet Niš, 2005. [12] "Microprocessor Chronology", av.at. https://en.wikipedia.org/wiki/Microprocessor_chronology, last access 28.03.2022. [13] M. Stojčev, Contemporary 16-bit microprocessors, Vol. I, in serbian, Naučna knjiga, Beograd, 1988. [14] M. Stojčev, Contemporary 16-bit microprocessors, Vol. II, in serbian, Naučna knjiga, Beograd, 1988. [15] M. Stojčev, Contemporary 16-bit microprocessors, Vol. III, in serbian, Naučna knjiga, Beograd, 1988. [16] M. Stojčev, RISC, CISC and DSP processors, in serbian, Elektronski fakultet Niš, 1997. [17] M. Stojčev, Branislav Petrović, Architectures and programming microcomputer systems based on processor family 80x86, in serbian, Elektronski fakultet Niš, 1999. [18] D. Patterson and J. Hennessy, Computer Organization and Design: The Hardware/Software Interface, 5th Ed., Morgan Kaufmann, 2014. [19] Y. Etsion, "Computer Architecture Out-of-order Execution", av.at. https://iis-people.ee.ethz.ch/~gmichi/ asocd/addinfo/Out-of-Order_execution.pdf, last access 28.03.2022. [20] "Superscalar Processors", av. at. https://www.cambridge.org/core/terms, last access. 28.03.2022. [21] M. Stojčev and T. Nikolić, Pipeline processing and scalar RISC processor, in serbian, Elektronski fakultet Niš, 2012. [22] M. Stojčev and T. Nikolić, Superscalar and VLIW processors, in serbian, Elektronski fakultet Niš, 2012 . [23] Philips Semiconductors, Introduction to VLIW Computer Architecture, av. at. https://www.isi.edu/ ~youngcho/cse560m/vliw.pdf. Last access 28.03.2022 [24] N. P. Jouppi and D. W. Wall, "Available Instruction Level Parallelism for Superscalar and Superpipelined Machines", WRL Research Report 89/7, av. at. https://www.hpl.hp.com/techreports/ Compaq-DEC/WRL-89-7.pdf, last access 28.03.2022 [25] C. E. Kozyrakis and D.A. Patterson, "Scalable Vector Processors for Embedded System", IEEE Micro, vol. 23, no. 6, pp. 36– 45, Nov.-Dec. 2003. [26] E. Aldakheel, G. Chandrasekaran and A. Kshemkalyani, "Vector Procesors", av. at. https://www.cs.uic.edu/~ajayk/c566/VectorProcessors.pdf, last access 29.03.2022 [27] C. Lomont, "Introduction to Intel® Advanced Vector Extensions", av. at. https://hpc.llnl.gov/sites/ default/files/intelAVXintro.pdf, last access 29.03.2022. [28] M. Stojčev, E. Milovanović and T. Nikolić, Multiprocessor systems on chip, in serbian, Elektronski fakultet Niš, 2012. [29] J. L. Lo and S. J. Eggers, "Improving Balanced Scheduling with Compiler Optimizations that Increase Instruction-Level Parallelism", av. at. https://homes.cs.washington.edu/~eggers/Research/bsOpt.pdf, last access 29.03.2022. [30] S. Akhter and J. Roberts, Multi-Core Programming, Intel Press, 2006. [31] G. Koch, "Intel’s Road to Multi-Core Chip Architecture", av. at. http://www.intel.com/cd/ids/ developer/asmo-na/eng/220997.htm [32] G. Koch, "Transitioning to multi-core architecture", av.at. www.intel.com/cd/ids/developer/asmo- na/eng/recent /221170.htm, last access 29.03.2022. [33] M. Brorsson, "Multi-core and Many-core Processor Architectures", Chapter 2 in Programming Many- Core Chips, Ed. A. Vajda, Springer, 2011. [34] M. Zahran, Heterogeneous Computing: Hardware and Software Perspectives, ACM Books #26, 2019. [35] M. Mitić, M. Stojčev and Z. Stamenković, "An Overview of SoC Buses", in Embedded Systems Handbook, Digital Systems and Aplications, Ed. V. Oklobdzija, Chapter 7, 7.1- 7.16, CRC Press, Boca Raton, 2008. [36] J. Rehman, "Advantages and disadvantages of multi-core processors", av. at https://www.itrelease.com/ 2020/07/advantages-and-disadvantages-of-multi-core-processors/, last access 29.03.2022 [37] J. Shun, Shared-Memory Parallelism Can Be Simple, Fast, and Scalable, Morgan & Claypool Pub., 2017 [38] T. Ungerer, B. Rogic and J. Silc, "Multithreaded Processors", Comput J., vol. 45, no. 3, pp. 320–348, 2002. [39] A. Silberschatz, G. Gagne and P. B. Galvin, "Multithreaded Programming", Chapter 4 in Operating System Concepts, 8th Ed., John Wiley, 2009. [40] DifferenceBetween.com, "Difference Between Multithreading and Multitasking", av.at. https://www.differencebetween.com/difference-between-multithreading-and-vs-multitasking/, last access 29.03.2022. https://en.wikipedia.org/wiki/Microprocessor_chronology https://iis-people.ee.ethz.ch/~gmichi/%0basocd/addinfo/Out-of-Order_execution.pdf https://iis-people.ee.ethz.ch/~gmichi/%0basocd/addinfo/Out-of-Order_execution.pdf https://www.cambridge.org/core/terms https://www.hpl.hp.com/techreports/%0bCompaq-DEC/WRL-89-7.pdf https://www.hpl.hp.com/techreports/%0bCompaq-DEC/WRL-89-7.pdf https://www.cs.uic.edu/~ajayk/c566/VectorProcessors.pdf https://homes.cs.washington.edu/~eggers/Research/bsOpt.pdf http://www.intel.com/cd/ids/developer/asmo-na/eng/recent%20/221170.htm http://www.intel.com/cd/ids/developer/asmo-na/eng/recent%20/221170.htm https://www.itrelease.com/%0b2020/07/advantages-and-disadvantages-of-multi-core-processors/ https://www.itrelease.com/%0b2020/07/advantages-and-disadvantages-of-multi-core-processors/ https://www.differencebetween.com/difference-between-multithreading-and-vs-multitasking/ 186 G. NIKOLIĆ, B. DIMITRIJEVIĆ, T. NIKOLIĆ, M. STOJČEV [41] TechDifferences, "Difference Between Multitasking and Multithreading in OS", av. at. https://techdifferences.com/difference-between-multitasking-and-multithreading-in-os.html, last access 29.03.2022 [42] tutorialspoint, "Multi-Threading Models", av.at https://www.tutorialspoint.com/multi-threading-models, last access 22.03.2022 [43] Wikipedia, "List of Intel Core i7 Processors", av. at https://en.wikipedia.org/wiki/List_of_Intel_ Core_i7_processors, last access 29.03.2022 [44] M. Nemirovsky and D. M. Tullsen, Multithreading Architecture, Morgan & Claypool, 2013. [45] O. Mutlu, "Computer Architecture: Multithreading", av. at. https://rmd.ac.in/dept/ece/Supporting_ Online_%20Materials/5/CAO/unit5.pdf, last access 22.03.2022 [46] N. Manjikian, "Implementation of Hardware Multithreading in a Pipelined Processor", In Proceedings of the IEEE North-East Workshop on Circuits and Systems, 2006, pp. 145–148. [47] P. Manadhata, and V. Sekar, "Simultaneous Multithreading”, av. at https://www.cs.cmu.edu/afs/cs/ academic/class/15740-f03/www/lectures/smt.pdf, last access 29.03.2022 [48] Intel, "Products formerly Nehalem EP", av. at [49] https://ark.intel.com/content/www/us/en/ark/products/codename/54499/products-formerly-nehalem-ep.html, last access 29.03.2022 [50] K. Hwang and Z. Xu, Scalable Parallel Computing: Technology, Architecture, Programming, McGraw- Hill, 1998. [51] D. Malkhi, Concurrency: The Works of Leslie Lamport, ACM Books #29, 2019. [52] CS4/MSc Parallel Architectures, "Lect. 2: Types of Parallelism", av. at https://www.inf.ed.ac.uk/ teaching/courses/pa/Notes/lecture02-types.pdf, last access 29.03.2022 [53] Chapter 3: "Understanding Parallelism", av. at https://courses.cs.washington.edu/courses/cse590o/06au/LNLCh-3-4.pdf, last access 29.03.2022 [54] J. Owens, "Data Level Parallelism", av. at https://www.ece.ucdavis.edu/~jowens/171/lectures/dlp3.pdf. Last access 29.03.2022 [55] A. A. Freitas, S. H. Lavington, "Data Parallelism, Control Parallelism, and Related Issues", in Mining Very Large Databases with Parallel Processing, Springer, 2000. [56] E. I. Milovanović, T. R. Nikolić, M. K. Stojčev and I. Ž. Milovanović, "Multi-functional Systolic Array with Reconfigurable Micro-Power Processing Elements", Microelectron. Reliab., vol. 49, no. 7, pp. 813–820, July 2009. [57] C. Severance and K. Dowd, High Performance Computing, Connexions, Rice University, Houston, Texas, 2012. [58] G. Nikolić, M. Stojčev, Z. Stamenković, G. Panić and B. Petrović, "Wireless Sensor Node with Low- Power Sensing", Facta Univ. Ser.: Elec. Energ., vol. 27, no 3, pp. 435–453, Sept. 2014. [59] T. Nikolić, M. Stojčev, G. Nikolić and G. Jovanović, "Energy Harvesting Techniques In Wireless Sensor Networks", Facta Univ. Ser.: Aut. Cont. Rob., vol. 17, no. 2, pp. 117-142, Dec. 2018. [60] G. E. Moore. "Cramming More Components onto Integrated Circuits", Electronics, vol. 38, no. 8, pp. 114–117, April 1965. [61] R. H. Dennard, F. H. Gaensslen and K. Mai, "Design of Ion-Implanted MOSFET’s with Very Small Physical Dimensions", IEEE J. Solid-State Circuits, vol. 9, no. 5, pp. 256–268, Oct. 1974. [62] S. Naffziger, J. Warnock and H. Knapp. "When Processors Hit the Power Wall", In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC), 2005, pp. 16–17. [63] S. Borkar and A. A. Chien, "The Future of Microprocessors", Commun. ACM, vol. 54, no. 5, pp.67–77, May 2011. [64] M. D. Hill and M. R. Marty, "Amdahl’s Law in the Multicore Era", IEEE Comput. Mag., vol. 41, no.7, pp. 33–38, July 2008. https://techdifferences.com/difference-between-multitasking-and-multithreading-in-os.html https://www.tutorialspoint.com/multi-threading-models https://en.wikipedia.org/wiki/List_of_Intel_%0bCore_i7_processors https://en.wikipedia.org/wiki/List_of_Intel_%0bCore_i7_processors https://rmd.ac.in/dept/ece/Supporting_%0bOnline_%20Materials/5/CAO/unit5.pdf https://rmd.ac.in/dept/ece/Supporting_%0bOnline_%20Materials/5/CAO/unit5.pdf https://ieeexplore.ieee.org/xpl/conhome/4016922/proceeding https://www.cs.cmu.edu/afs/cs/%0bacademic/class/15740-f03/www/lectures/smt.pdf https://www.cs.cmu.edu/afs/cs/%0bacademic/class/15740-f03/www/lectures/smt.pdf https://ark.intel.com/content/www/us/en/ark/products/codename/54499/products-formerly-nehalem-ep.html https://www.inf.ed.ac.uk/%0bteaching/courses/pa/Notes/lecture02-types.pdf https://www.inf.ed.ac.uk/%0bteaching/courses/pa/Notes/lecture02-types.pdf https://www.ece.ucdavis.edu/~jowens/171/lectures/dlp3.pdf http://casopisi.junis.ni.ac.rs/index.php/FUElectEnerg/article/view/140 http://casopisi.junis.ni.ac.rs/index.php/FUElectEnerg/article/view/140