CET 97 DOI: 10.3303/CET2297094 Paper Received: 24 September 2022; Revised: 10 November 2022; Accepted: 23 November 2022 Please cite this article as: Hou Y., 2022, Design of Mathematical Formula Information Retrieval System, Chemical Engineering Transactions, 97, 559-564 DOI:10.3303/CET2297094 CHEMICAL ENGINEERING TRANSACTIONS VOL. 97, 2022 A publication of The Italian Association of Chemical Engineering Online at www.cetjournal.it Guest Editors: Jeng Shiun Lim, Nor Alafiza Yunus, Jiří Jaromír Klemeš Copyright © 2022, AIDIC Servizi S.r.l. ISBN 978-88-95608-96-9; ISSN 2283-9216 Design of Mathematical Formula Information Retrieval System Yong Hou Bengbu University, Anhui, 233030, China aspnetcs@163.com The mathematical formula information retrieval system -MFIRS is designed and implemented, and the architecture of the system is discussed. A similarity indexing method based on the mathematical sub-formula of representation MathML is proposed. The system has the characteristics of mathematical perception. The mathreteval dataset was created using more than 4,500,000,000 arXiv documents and 158,106,118 mathematical formulas, and on this dataset, The scalability of the system is verified. The front end of the system uses a web interface that allows users to retrieve complex queries consisting of plain text and mathematical formulas that can be written in TEX or MathML. When a user queries with TEX, the system is able to instantly convert it into a MathML tree representation and index it. The system is a mathematical formula information retrieval engine with mathematical perception characteristics, which can be retrieved by sub-formula similarity and the index of adjacent mathematical formula is realized. 1. Introduction By searching in a digital library, people can find a lot of what their need. Mainstream search technology is mainly for plain text retrieval, text documents in the form of word bags, do not support mathematical formula processing. Scientific literature is full of indexes, indices, and complex mathematical formulas, even in the basic metadata, titles, and abstracts of papers. Research experience on Google Scholar has shown that not dealing with mathematical formulas in references can lead to serious retrieval problems. The standard for mathematical exchange between related software tools is W3C's MathML.Few people want to write MathML directly, and people usually prefer some kind of TEX-style compact symbol, such as LATEX or AMSLATEX. As a result. Mathematical retrieval system enables users to use their favorite symbols (such as TEX package or similar (AMS) LATEX) to query, to meet the different retrieval preferences of users, so the data should be converted into a unified format. Represented MathML or content MathML is used only for the output of software systems. In the process of scientific and technological literature retrieval, the unresolved mathematical retrieval problem becomes very prominent and arouses great interest, because the system that does not support the information retrieval of mathematical formula is not perfect. Therefore, The current popular mathematical retrieval systems are including MathDex(Chan C,2020), MathFind(Gardesten M. 2021), EgoMath(Liu H et al.,2021), Egothor(MD A et al.,2021), LATEXSearch(Perepu P K,2021), LeActiveMath(Shen Y et al.,2021), MathWebSearch(T V. Bakhteeva et al.,2021,),TUW-University of Technology(Zhai J et al.,2022,), et al. 2. Design of the system The developed system divides the index content into mathematical formula index and ordinary text index when indexing XHTML, HTML and other documents. The indexing methods of the two types of content are different, among which ordinary text indexes are indexed in a conventional manner and using a traditional method. In Figure 1 below, the overall architecture of the system is described in detail. 559 Figure 1. The overall architecture of the system The index module of the realized system is mainly to normalize the input and preprocess the mathematical formula. 2.1 Input Normalization The MathML document is normalized using the UMCL toolset to avoid the problem of mathematical formulas with the same semantics represented by different MathML symbols. Standardize MathML in documents through the UMCL toolset. 2.2 Mathematical formula Unification There are three different types of unified algorithms used by the system. In order to achieve multiple common representations of all formulas, the unified algorithm performs a tokenization process. The system returns match similar to user queries, while retaining formula structures and α equations. 2.3 Coding mathematical formula The mathematical formula in math format is then used with hash coding techniques and path-based coding techniques. Path-based coding technology splits a mathematical formula into three types of information, including: brother node information for subtrees, ordered path information, and no path information. 2.4 Extract text information and mathematical formulas Text information is extracted from four level: Body paragraph level; Document level; Math level. The system realizes the overall and partial perception recognition of mathematical formulas,which appear in multiple parts of a document, in the following ways. Read all documents and use Python's regular expressions to extract, parse, and store the contents of MathML formulas in query documents. The specific steps are as follows. 1).Iterate through each document and find all formulas in this document with regular expression (1). Pattern=re.compile('(.*?)',re.S) (1) 560 2).For each formula in the document, use regular expressions (2), (3), (4), (5) to extract the formula ID number, and get the parts of the formula. pattern = re.compile('