http://www.sajim.co.za/peer21.3nr2.asp?print=1 Peer Reviewed Article Vol.3(2) September 2001 Automatic extraction and analysis of financial data from the EDGAR database Christoph Leinemann schlottmann@aifb.uni-karlsruhe.de Institute AIFB, University of Karlsruhe (TH) Frank Schlottmann schlottmann@aifb.uni-karlsruhe.de Institute AIFB, University of Karlsruhe (TH) Detlef Seese seese@aifb.uni-karlsruhe.de Institute AIFB, University of Karlsruhe (TH) Thomas Stuempert stuempert@aifb.uni-karlsruhe.de Institute AIFB, University of Karlsruhe (TH) Contents 1. Introduction 2. Structure of SEC 10-K filings containing semistructured company data 3. Dextrapi wrapper 4. Edgar2xml 5. Ongoing work 6. References 1. Introduction Contemporary financial markets cause a growing need for quick access to information supporting trading decisions. Since the amount of information available on the World-Wide Web is exponentially growing, it is a huge problem to find useful and reliable information efficiently and at the right time. Intelligent systems (e.g. Almeida Ribeiro et al. 1999; Goonatilaki and Treleaven 1995; Frick et al. 1996; Hermann et al. 1998) and the new technology of software agents (Klusch 1999) can be useful tools for accomplishing this task. We follow the latter approach by implementing Edgar2xml, a software agent which extracts fundamental company data from the Electronic Data Gathering, Analysis and Retrieval (EDGAR) database of the United States Securities and Exchange Commission (SEC) and outputs this data in a format which is useful to support stock market trading decisions. The SEC is a regulatory authority for securities markets in the United States with the task to protect investors and to maintain fair, honest and efficient markets. Registered companies are required to file certain financial company data on forms (e.g. 'form 10-Q' for quarterly financial reports and 'form 10-K' for annual reports), which are then made publically available on the EDGAR database. The database can be accessed via a WWW interface and contains documents reaching back to 1994. According to their high relevance for investor decisions, we will concentrate on data extraction from form 10-K filings, although the methods presented in this article can be applied to any other type of filing as well. While companies use some SGML tags in their filings, documents in the EDGAR database generally contain only very few tags, for example at the beginning and
at the end of balance sheet information on some but not all 10-K forms. The balance sheet itself is pure ASCII text and can easily exceed a size of 200 KB. There is a need for automatic extraction of relevant data, because investors who are interested in quantitative balance sheet information naturally prefer immediate online access to relevant data over having to read the entire 10-K filing. Currently several EDGAR agents exist, for example EDGAR online, 10-K Wizard and FRAANK(Kogan et al. 1998). They are used on portal sites for general financial information (e.g. at Yahoo! and BigCharts). These EDGAR agents, however, do not fragment a balance sheet into its components but can only extract well structured passages, that is, extract a whole balance sheet. They are unable to extract detailed information like single balance sheet items from the ASCII passages. If not only the online extraction of entire balance sheets but also the process of online financial analysis is supported by software, investors will be able to analyse companies faster and more conveniently. This is the goal of our software agent, Edgar2xml, which will be introduced in the following sections. 2. Structure of SEC 10-K filings containing semistructured company data Automatic extraction of unstructured information is an almost impossible task. Hence our agent uses the semistructure found in some SEC 10-K filings. Valid for a specific company and a single year, form 10-K is divided into four parts. Each part consists of several different sections of the annual report of the company which are organized as numbered items. An overview of form 10-K's structure with special respect to those items containing balance sheets and other financial information is given below (Skousen 1991; Leinemann 2000). Figure 1 Overview of form 10-K structure We focused our work on items 8 and 14, which contain audited balance sheets for two years as well as three audited annual statements of income and cash flow, and supplemental financial data schedules (FDS). The FDS consist of aggregated financial information and top selected financial ratios. Only the FDS are sufficiently tagged to be machine understandable in the sense of XML. Our software agent, however, implements a new Java-based methodology for mining even other, semistructured parts of a balance sheet. We cannot query SEC filings in a database-like fashion based on their underlying structure. However, we can provide database-like querying for semi-structured sources by building wrappers around these sources. Each source is wrapped with a translator (or wrapper) that logically converts the underlying data objects into a common information format. Before we will be able to introduce Edgar2xml agent it is therefore necessary to have a closer look at dextrapi wrapper, a general API for Data EXTRaction. 3. Dextrapi wrapper 3.1 Overview The dextrapi wrapper is a framework for extracting information, for example financial data, from any text-based resource. The extracted information is transformed into XML syntax. Most wrapper approaches (Azavant and Sahuguet 1999; Huck et al. 1998) need a fixed structure, for example HTML structure, in the file they have to extract from. Dextrapi is able to process even ASCII text sources with changing structure, that is dextrapi detects the beginning of a balance sheet whether the beginning is marked with a tag or special ASCII text. The following picture shows the architecture of dextrapi: Figure 2 Dextrapi architecture The data extraction process is accomplished as follows: 1. Text from external sources is written into an input buffer. 2. The text in the input buffer is read by a parser. 3. The knowledge database of the document manager provides the parser with regular expressions . These regular expressions are needed for section and keyword identification. We define section identification as identification of the position where the extraction process top starts or ends, for example the beginning of some table. Keyword identification detects items which should be extracted. The document manager organizes the parsing process. 4. For section identification, the parser fires an event if a regular expression in the parser's knowledge base matches some ASCII text in the input buffer. This event activates the data listener. 5. The data listener is activated on receiving an event from the parser. Extraction within a document's section is done by the parser. The parser reads out the input buffer until the data listener detects a structure specified by a regular expression. If the specified structure is detected, relevant information has been found and the parser's data event class fires an event to the listener. 6. Each detected keyword is transformed to a Document Object Model (DOM) element by the listener, that is, the content of a section is transformed to be part of a DOM tree, elements of this DOM tree are generated and sent to the document manager. 7. The internal data structure of the document manager generates a DOM tree from the DOM fragments of the data listener. Each DOM element represents relevant information that is specified by regular expressions and that is to be extracted. The section listener generates a representation of the text document's structure which is then passed to the internal data structure of the document manager. 8. The document manager writes elements of the DOM tree to an XML output stream which conforms to an XML schema. 3.2 Parser The parser gets its information from the knowledge database that describes when to fire an event. We define parser knowledge as a set of keywords and regular expressions that the parser needs to fire events if an ASCII text matches some element of this set. Parser knowledge consists of section knowledge and data knowledge. The section knowledge for identification of the sections (e.g. SEC Header for company identification, balance sheet, FDS) has the columns as shown in Table 1. Table 1 Section knowledge Name Name of a section sectionID key for identification of the section section- Beginning contains a regular expression matching the beginning of a section sectionEnding contains a regular expression matching the ending of a section subSectionOf contains the sectionID of its parent section or '-1' if it is the root section. There can only be one root section beginning- Scope can have the value 'cursor' or 'buffer' and relates to the scope which is subject to the match of the beginning regular expression endingScope can have the value 'cursor' or 'buffer' and relates to the scope which is subject to the match of the regular ending expression We use section knowledge and data knowledge to optimise the parsing process, which is organized in two steps. In the first step, interesting sections (SEC-Header, a whole balance sheet, the Financial Data Schedules) are identified. In the second step, data knowledge is used for keyword detection, identifying the data (financial items) in a section that is to be extracted. The separation of structural information, like the beginning of a section, from data extraction information (e.g. keywords detecting a certain structure in the underlying ASCII text) results in an efficient parsing process. The data knowledge consists of knowledge about the data in a section that is to be extracted and has the columns as shown in Table 2. Table 2 Data knowledge The parser reads the SEC filing through a buffer that consists of several lines. Within the buffer one line represents the cursor. While the buffer moves through the file, the buffer fires two kind of events, section events and data events. The list of section events which are fired is dynamically changing according to the section knowledge from the knowledge database. For every section that is entered, the possible subsection events and the section ending event are loaded into the firing list. The firing list for data events is changed with the parser's plugKnowledge(String section) method, which queries the database for the data events that occur in a section and adds them to the firing list. This is useful when only data of a certain section is of interest, like a balance sheet section, because it avoids redundant data events to be fired throughout the whole document. The parser registers and unregisters data event listeners and section event listeners. When dealing with section events, the parser performs event-single-casting, that is that only one listener object can register to the parser. Concerning data events the parser serves as an event-multi-caster, that is several data listener objects can register to receive data events. The document manager registers itself as a section listener and manages which regular expressions of its knowledge database (see Figure 2) are activated and therefore sent to the burnable can have the value true or false. If true the section occurs only once per encapsulating section. If false the section can occur several times within it’s parent section Name Name of the data to be extracted dataId serving as a primary key data a regular expression matching the data associated with the name parent a sectionID within this data occurs. This is a foreign key and relates to the field 'sectionID' in the section knowledge entity priority an integer determining the order in which the events are fired in case they occur in the same scope scope this field is the same as the scope of the section knowledge entity depth the number of bracket pairs the data of the match is embraced in burnable can have the value true or false. If true the data occurs only once per encapsulating section. If false the data can occur several times within its parent section listener registry. The registry administers these regular expressions, if a regular expression is not needed any more, it will be deleted from the registry. One or several data listener objects can register to map data. The firing process is called every time when a new line is added to the buffer. According to the regular expressions that have matched buffer and cursor some events are instantiated and methods of the registered listeners are called passing the appropriate event as a parameter. 3.3 Listener The listener interface requires the methods data(DataEvent). The data() method is supplied with a formal parameter of type DataEvent. The DataEvent object encapsulates two objects. One is the match-object encapsulating the data that has been retrieved and the other one is a string-object representing the name of the data. The listener stores information in its internal DOM representation. The data listener returns a DOM element representing the document fragment that the data listener has created. The data event object encapsulates two objects of type match and string. The match object encapsulates the data that has been retrieved. It can be accessed through its toString() method. Invoking the getName() method on data event returns a string representing the name of the data. The data will be stored in an appropriate place within an internal DOM element, for example by appending a DOM text node that includes the data or by appending an attribute name-value-pair or by creating a sub-element and then appending a text node. A mapping to an internal DOM structure can be performed. Invoking the getData() method on data listener returns the internal DOM element that has been modified through data method invocations. This element now contains a document fragment that can be further processed from the document manager that has invoked getData(). The section listener returns a DOM object that represents the whole document that has been parsed. Two more methods, sectionBeginning(SectionEvent) and sectionEnding (SectionEvent) are required, they are invoked when sections as defined in section knowledge begin and end. 3.4 Document manager The document manager consists of section listeners and holds a reference on the parser creating data listener objects if the document manager gets an event for extracting a section. The sectionBeginning() method supplies the parser with a certain class of events to fire and registers the matching data listener object. This class of events can be queried from the parser knowledge database according to the current section. The number of events the parser fires and the number of listeners that are registered is crucial to the performance of the parser. To achieve high performance the sectionEnding() method removes data knowledge from the parser by deleting the class of events associated with the ending section from the parser's firing list and data listener objects will be set unregistered. This improves the speed of the parsing process because the number of target methods which are called while firing an event are reduced. Now we are able to describe Edgar2xml, the agent for extracting financial data from SEC filings. 4. Edgar2xml top First we give an overview of the Edgar2xml agent architecture and explain the components, that is, the different layers (see Figure 3). Figure 3 Overview of Edgar2xml's architecture 4.1 Data acquisition layer The data acquisition layer is responsible for retrieving the target file and for opening a data input stream that is passed to the data extraction layer, that is it extracts the text-based data from the EDGAR database and writes it to the input buffer. The data acquisition layer receives a request with the parameters company identification and year from the processing layer and the data acquisition layer sends the text-based data to the data extraction layer. 4.2 Data extraction layer The data extraction layer performs the text mining process. It receives the data in its proprietary format and processes it. A parser reads the data input stream and fires corresponding events identifying relevant data that hane been found. The parser reads regular expressions from the knowledge database. In Figure 4 a typical sample input file is presented showing the architecture of a 10-K filing. Keywords for detecting the balance sheet section are in this case 'balance', 'sheet', '' and 'Current assets'. In this example there are only two listed balance sheet items, namely 'Cash and cash equivalents' and 'Short-term investments'. Figure 4 Excerpt from form 10-K for Intel Corp. 0001012870-00-001562-index.html : 20000324 0001012870-00-001562.hdr.sgml : 20000324 ACCESSION NUMBER: 0001012870-00-001562 CONFORMED SUBMISSION TYPE: 10-K PUBLIC DOCUMENT COUNT: 6 CONFORMED PERIOD OF REPORT: 19991225 FILED AS OF DATE: 20000323 The task of the listeners is the following: Listeners get the events and generate an internal object representation of the events they FILER: COMPANY DATA: COMPANY CONFORMED NAME: INTEL CORP CENTRAL INDEX KEY: 0000050863 STANDARD INDUSTRIAL CLASSIFICATION: SEMICONDUCTORS & RELATED DEVICES [3674] IRS NUMBER:941672743 . . Page 14 Consolidated balance sheets December 25, 1999 and December 26, 1998 (In millions--except per share amounts)
1999 1998 -------- -------- Assets Current assets: Cash and cash equivalents $ 3,695 $ 2,038 Short-term investments 7,705 5,272 . .(..... items of a balance sheet) . Liabilities . .(.... items of a balance sheet) receive. This object representation of the data can be accessed by the data processing layer. The data extraction layer applies both object-model and event-driven parsing: We use the event driven concept for the identification of the different sections, that is items of a form 10- K and the object model approach, that is a DOM parser, for the representation of the data extracted from these sections. This combination guarantees good performance without a lack of power in rendering the data. 4.3 Data processing layer The extraction layers store balance sheet items in a DOM object. Now the data processing layer transforms the DOM elements into an XML data output stream. Figure 5 displays such an XML output which contains each detected financial item. Balance sheet items have been converted to XML. This XML data conform to an XML schema, XML Data Reduced (XDR), that is registered at Biztalk, a public repository for XML schemata, that is a set of rules for describing the underlying document structure of the XML document. Figure 5 Excerpt from an XML output for form 10-K of Intel Corp. SEMICONDUCTORS & RELATED DEVICES INTEL CORP 941672743 1999 1998 1000000 USD 1460 1622 .... THIS SCHEDULE CONTAINS SUMMARY INFORMATION EXTRACTED FROM INTEL CORPORATION'S CONSOLIDATED STATEMENTS OF INCOME AND CONSOLIDATED BALANCE SHEETS 1000000 12 DEC-25- 1999 DEC-25-1999 Several recommendations exist at the World-Wide Web Consortium for XML schemas, for example XML-Schema, XML-Data, Document Content Description (DCD) and Document Type Definitions (DTDs). Each XML schema has its own syntax for constructing these rules and each XML schema has a different set of features for defining the rules. This data can be used as a message for other applications like a financial analysis tool that will use XML data for calculating ratios etc. Document Type Definitions do not allow to define data types like integer or real numbers which are necessary for processing financial data. Therefore we use the XDR schema, in which data types for each financial ratio can be defined. For presentation purposes the XML data are transformed by an XSL-processor into formats like HTML or WML. 5. Ongoing work The activities of building agents for the automatic transformation of fundamental company data into a representation that enables and supports quick trading decisions cannot be separated from standardization activities. In September 1999 the XFRML workgroup was founded from AICPA with the aim to develop a standardized computer language for describing financial reports with only one XML vocabulary. The XFRML specification, a XML dialect for financial applications named XBRL (Extensible Business Reporting Language), is still in use. At the moment XFRML shows only a sample 10-K filing which adapts XML tags to each ASCII-10-K output. Therefore there is a strong need for only one specification for all 10-K filings which enable quick access to 10-K filings supporting financial analysis and fast trading decisions. We have shown that it is possible to automatically extract financial data and transform the data into XML. With the usage of XML for describing financial data, the data become machine understandable and reusable. At the moment we do not detect all balance sheet items. Therefore the scope of Edgar2xml can be extended in two dimensions: Improving extraction quality: Include more financial items in a single balance sheet and detect synonyms (financial items of different types but with the same meaning). This task could be done by modelling financial data with an ontology. Extending extraction scope: Extract not only balance sheet information but also a consolidated statement of income and consolidated statement of cash flow. Up to now Edgar2xml has been capable of extracting the financial data of balance sheets and the financial data schedules. By specifying new data events in the database and by implementing new listeners the functionality can be extended to other applications. 3695 2038 3695 8093