11_Armen Kostanyan's article_116-121 116 Mathematical Problems of Computer Science 54, 116--121, 2020. UDC 519.688 Fuzzy String Matching Using a Prefix Table Armen H. Kostanyan IT Educational and Research Center Yerevan State University, Armenia e-mail: armko@ysu.am Abstract The string matching problem (that is, the problem of finding all occurrences of a pattern in the text) is one of the well-known problems in symbolic computations with applications in many areas of artificial intelligence. The most famous algorithms for solving it are the finite state machine method and the Knuth-Morris-Pratt algorithm (KMP). In this paper, we consider the problem of finding all occurrences of a fuzzy pattern in the text. Such a pattern is defined as a sequence of fuzzy properties of text characters. To construct a solution to this problem, we introduce a two-dimensional prefix table, which is a generalization of the one-dimensional prefix array used in the KMP algorithm. Keywords: KMP algorithm, Approximate string matching, Fuzzy string matching 1. Introduction String matching is a well-known computational problem, the earliest references to which date to the 1960s. Interest in this problem really grew with the publication of the Boyer-Moor (BM) [1] and Knuth-Morris-Pratt (KMP) [2] algorithms in the 1970s. Since then, quite a lot of exact string matching algorithms have been proposed using one or another modification of BM or KMP. Along with investigations on exact string matching, there have also been investigations on approximate string matching that looked at the problem of finding a generic pattern in the text. Unlike the regular pattern, specified as a list of characters, the generic pattern is rather some description of the substring to be found. Research on approximate string matching has focused on distance-based string matching [3], string matching using meta-characters in patterns, and more generally, string matching using a regular expression as a pattern [4]. A detailed survey of these works is given in the monograph [5]. In this paper, we formulate a fuzzy string matching problem, in which the pattern is represented as a sequence of fuzzy characters, and the problem of finding all occurrences of such A. Kostanyan 117 a pattern in a text with a given accuracy is defined. This problem was investigated in [6], where a non-deterministic transition system was constructed to describe the possibilities of processing the given text in order to find all occurrences of a fuzzy pattern in it, and an efficient algorithm for determining some occurrences was proposed. In contrast, here we propose an efficient algorithm to determine all occurrences of a fuzzy pattern in the text, mimicking the KMP algorithm. The paper is organized as follows. Section 2 presents the problem definitions. Section 3 introduces the concept of a two- dimensional prefix table, which generalizes the notion of a one dimensional prefix array used in the KMP algorithm. The text processing algorithm using the prefix table is presented in Section 4. The complexity of the proposed algorithm is analyzed in Section 5. Finally, the conclusion summarizes the obtained results. 2. The Fuzzy String Matching Problem Suppose that (L, ∨, ∧, 0, 1) is a finite lattice with the smallest element 0 and the largest element 1. According to [7], the fuzzy subset A of the universal set U is defined by the membership function µA: U→L that associates with each element x from U the number µA(x) in L, representing the grade of membership of x in A. A fuzzy subset A from U can be represented by the additive form ( ) ∈ = Ux A xxA µ/ . We say that an element x definitely belongs to A if µA(x) = 1, and it definitely does not belong to A if µA(x) = 0. On the contrary, if 0 < µA(x) < 1, we say that x belongs to A with degree µA(x). Let us define a fuzzy symbol α over the alphabet Σ to be a fuzzy subset of Σ. Given a character x ∈ Σ, we say that x matches α with the grade µα(x). We define the fuzzy pattern P[1 .. m] as a sequence of fuzzy symbols of length m. For a given text T[1 .. n], a fuzzy pattern P[1 .. m] and a threshold λ, we formulate the fuzzy string matching problem as the problem of finding all positions w (or, valid shifts), 1 ≤ w ≤ n-m+1, in the text such so that µ P[k] (T[w + k - 1]) ≥ λ for all k, 1 ≤ k ≤ m. 3. Prefix Table Let x = x[1 .. q] be an array of text symbols of length q and λ ≥ 0. Let us define the x-border of P = P[1..m] as a subarray P[1 .. k] (k < q) such that µ P[i] (x[q – k + i]) ≥ λ for all i, 1 ≤ i ≤ k, (that is, the last k characters of x match the first k symbols of P with degree of at least λ). Let LBP(x) denote the longest x-border of P. Example 3.1: Choose Σ = {1, 2, 3, 4, 5}, L = { 0, 0.25, 0.5, 0.75, 1 } and define the fuzzy symbols S (small), M (middle) and L (large) as follows: S = 1/1 + 2/0.75 + 3/0.5 + 4/0.25 + 5/0, M = 1/0 + 2/0.75 + 3/1 + 4/0.75 + 5/0, L = 1/0 + 2/0.25 + 3/0.5 + 4/0.75 + 5/1. Fuzzy String Matching Using Prefix Table 118 Then, for x = 14252, P = SLMLS and λ = 0.75, the following statement holds LBP(x) = SLM. � Let us define the prefix table as a matrix π[1 .. m, 1 .. n] such that Claim 1: For all 2 ≤ q ≤ m, 1 ≤ i ≤ n–q+1 Proof: Follows from the following statements:  [|LBP ( T[i .. i+q-1] )| = q-1 ]  [ µ P[s] (T[i+s]) ≥ λ for all s, 1 ≤ s ≤ q-2 ]   [|LBP(T[i .. q-1])| = q-2, µ P[q-1] (T[i+q-2]) ≥ λ ]   [π [q-1, i] = q-2, µ P[q-1] (T[i+q-2]) ≥ λ ]  [π [q, i] = q-1].  [|LBP ( T[i .. i+q-1] )| < q-1]  [ |LBP (T[i .. i+q-1])| = |LBP ( T[i+1 .. i+q-1])| ]   [ π [q, i] = π [q-1, i+1] ]. � The formula (1) allows us to construct the prefix table by filling it from top to bottom and from left to right. The procedure below provides additional details. Compute-Prefix-Table(P, T, λ) 1 m = P.length, n = T.length 2 let π[1..m, 1.. n] be the prefix table to be filled accordingly 3 for i = 1 to n 4 π [1, i] = 0 5 for q = 2 to m //substring length 6 for i = 1 to n-q+1 //position in the text 7 if π [q-1, i] == q - 2 and µ P[q-1] (T[i+q-1]) ≥ λ 8 π [q, i] = q - 1 9 else π [q, i] = π [q-1, i+1] 10 return π Example 3.2: Suppose that P = MSMSLM, T = 223141325422414251, where S, M and L are the fuzzy symbols defined in Example 3.1. The matrix in Fig. 1 presents а prefix table built by the Compute-Prefix-Table algorithm based on P, T and λ = 0.75. 0, if q = 1 π [q, i] = |LBP(T[i .. i+q-1])|, if 2 ≤ q ≤ m, 1 ≤ i ≤ n–q+1 undefined, if i+q-1 > n q - 1, if π [q-1, i] = q-2 and µ P[q-1] (T[i+q-1]) ≥ λ π [q, i] = (1) π [q-1, i+1], otherwise A. Kostanyan 119 Fig. 1. Prefix table for P = MSMSLM, T = 223141325422414251. 4. Text Processing using a Prefix Table The following procedure is an adaptation of the KMP algorithm to solve the fuzzy strings matching problem using a prefix table. KMP-FUZZY-STRING-MATCHING ( P, T, λ ) 1 m = P.length, n = T.length, 2 π = Compute-Prefix-Table(P, T, λ) 3 q = 0 // number of characters matched 4 for i = 1 to n 5 while q > 0 and µ P[q+1](T[i]) < λ 6 q = π[q, i - q] 7 if µ P[q+1](T[i]) ≥ λ 8 q = q + 1 9 if q == m //the entire pattern matches 10 Print( i – q +1 ) 11 q = π[q, i – q + 1] Example 4.1. Fig. 2 illustrates the fuzzy string matching process for T, P and λ from Example 3.2 and the prefix table in Fig. 1. Fig. 2. Fuzzy string matching process. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 T: 2 2 3 1 4 1 3 2 5 4 2 2 4 1 4 2 5 1 1 2 3 4 5 # P: M S M S L M 1 2 3 4 # M S M S L 1 2 3 4 5 6 M S M S L M - matching 1 2 3 # M S M S 1 2 3 4 5 6 M S M S L M - matching 1 2 3 4 5 # M S M S L M # M 2 2 3 1 4 1 3 2 5 4 2 2 4 1 4 2 5 1 M 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 S 1 1 0 1 0 1 1 0 1 1 1 1 0 1 1 0 0 - M 1 2 1 2 1 2 0 1 2 2 1 2 1 2 0 0 - - S 2 3 2 3 2 0 1 2 3 3 2 3 2 0 0 - - - L 3 4 3 4 0 1 2 3 3 4 3 4 0 0 - - - - M 4 3 4 5 1 2 3 3 4 5 4 5 0 - - - - - Fuzzy String Matching Using Prefix Table 120 5. Analysis To estimate the complexity of the KMP-FUZZY-STRING-MATCHING algorithm, let us take advantage of the potential method. Define the potential before every execution of the body of the for loop equal to q, which is initially 0 and never becomes negative. Note that the while loop has an amortized cost of O(1) since the total number of executions of the assignment statement in line 6 does not exceed the potential value before this loop is executed. The two subsequent if statements obviously have O(1) amortized complexities. Thus, the amortized complexity of the execution of the body of the for loop is O(1), which gives O(n) complexity for the processing phase. The complexity of the preprocessing phase represented by the Compute-Prefix-Table procedure is obviously O(mn). The algorithm uses O(mn) extra memory needed to represent the prefix table π. 5. Conclusion The problem of finding occurrences of a fuzzy pattern in a given text with a given accuracy has been considered in this paper. A generalization of the KMP string matching algorithm is proposed to solve this problem. Like the KMP algorithm, the proposed algorithm consists of preprocessing and processing phases. The preprocessing phase creates a two-dimensional prefix table that is used in text processing. For a pattern of length m and a text of length n, the preprocessing phase takes O(mn) time, and the processing phase takes O(n) time. The extra space used by the algorithm is O(mn). References [1] R. S. Boyer and J. S. Moore, “A fast string matching algorithm,” Association for Computing Machinery, vol. 20, no. 10, pp. 762-772, 1977. [2] D. Knuth and M. Pratt, “Fast pattern matching in strings,” SIAM J. Comput. vol. 6, no. 2, pp. 323-350, 1977. [3] G. Landau and U. Vishkin, “Efficient string matching with k mismatches,” TCS, vol. 43, pp. 239-249, 1986. [4] R. Baeza-Yates and N. Gonzaio, “A faster algorithm for approximate string matching,” in Proc. of Seventh Annual Symp. Combinatorial Pattern Matching, Spinger-Verlag, 1996, pp. 1-23. [5] B. Smyth, “Computing Patterns in Strings”, Addison-Wesley UK, 2003. [6] A. Kostanyan, “Fuzzy String Matching with Finite Automata”, in Proceedings on 2017 IEEE Conference CSIT-2017, Yerevan, Armenia, pp. 25-29. IEEE Press, USA 2018. [7] L. A. Zadeh, “The concept of a linguistic variable and its application to approximate reasoning-I,” Information Sciences, vol. 8, pp. 199-249, 1975. Submitted 16.07.20, accepted 12.11.20. A. Kostanyan 121 Տողում ենթատողի ոչ հստակ որոնում նախածանցների աղյուսակի օգտագործմամբ Արմեն Հ․ Կոստանյան ՏՏ կրթական և հետազոտական կենտրոն Երևանի պետական համալսարան, Հայաստան e-mail: armko@ysu.am Ամփոփում Տողում ենթատողի որոնման խնդիրը արհեստական բանականության տարբեր բնագավառներոմ լայն կիրառում ունեցող խնդիրներից է։ Նշված խնդրի լուծման հանրահայտ ալգորիթմներն են՝ վերջավոր ավտոմատների մեթոդը և Կնուտի- Մորիսի-Պրատի (ԿՄՊ) ալգորիթմը։ Տվյալ հոդվածում դիտարկվում է հստակ տողում (տեքստում) ոչ հստակ ենթատողի (շաբլոնի) որոնման խնդիրը։ Ոչ հստակ ենթատողը սահմանվում է որպես տեքստի սիմվոլների ոչ հստակ հատկությունների հաջորդականություն։ Խնդրի լուծումը կառուցվում է ԿՄՊ ալգորիթմում օգտագործվող նախածանցների միաչափ աղյուսակի երկչափ ընդլայնման եղանակով։ Բանալի բառեր` ԿՄՊ ալգորիթմ, տողում ենթատողի մոտավոր որոնում, տողում ենթատողի ոչ հստակ որոնում: Нечеткий поиск подстроки в строки с использованием таблицы префиксов Армен Г. Костанян Образовательный и научный центр ИТ Ереванский государственный университет, Армения e-mail: armko@ysu.am Аннотация Задача поиска подстроки в строке одна из часто встречающихся задач в области символических вычислений, имеющая многочисленные применения в задачах искусственного интеллекта. К числу наиболее известных алгоритмов для ее решения относятся метод конечных автоматов и алгоритм Кнута-Морриса-Прата (алгоритм КМП). В настоящей статье рассматривается задача поиска всех вхождений нечеткой подстроки в строке. При этом нечеткая подстрока определяется как последовательность нечетких свойств символов строки (текста). Поставленная задача решается с помощью двумерной таблицы префиксов, являющейся обобщением одномерной таблицы префиксов, используемой в алгоритме КМП. Ключевые слова: алгоритм КМП, приближенный поиск подстроки в строке, нечеткий поиск подстроки в строке.