key: cord-020841-40f2p3t4 authors: hofstätter, sebastian; zlabinger, markus; hanbury, allan title: neural-ir-explorer: a content-focused tool to explore neural re-ranking results date: 2020-03-24 journal: advances in information retrieval doi: 10.1007/978-3-030-45442-5_58 sha: doc_id: 20841 cord_uid: 40f2p3t4 in this paper we look beyond metrics-based evaluation of information retrieval systems, to explore the reasons behind ranking results. we present the content-focused neural-ir-explorer, which empowers users to browse through retrieval results and inspect the inner workings and fine-grained results of neural re-ranking models. the explorer includes a categorized overview of the available queries, as well as an individual query result view with various options to highlight semantic connections between query-document pairs. the neural-ir-explorer is available at: https://neural-ir-explorer.ec.tuwien.ac.at/. the prevalent evaluation of information retrieval systems, based on metrics that are averaged across a set of queries, distills a large variety of information into a single number. this approach makes it possible to compare models and configurations, however it also decouples the explanation from the evaluation. with the adoption of neural re-ranking models, where the scoring process is arguably more complex than traditional retrieval methods, the divide between result score and the reasoning behind it becomes even stronger. because neural models learn based on data, they are more likely to evade our intuition about how their components should behave. having a thorough understanding of neural reranking models is important for anybody who wants to analyze or deploy these models [6, 7] . in this paper we present the neural-ir-explorer: a system to explore the output of neural re-ranking models. the explorer complements metrics based evaluation, by focusing on the content of queries and documents, and how the neural models relate them to each other. we enable users to efficiently browse the output of a batched retrieval run. we start with an overview page showing all evaluated queries. we cluster the queries using their term representations taken from the neural model. users can explore each query result in more detail: we show the internal partial scores and content of the returned documents with different highlighting modes to surface the inner workings of a neural re-ranking model. here, users can also select different query terms to individually highlight their connections to the terms in each document. in our demo we focus on the kernel-pooling models knrm [14] and tk [8] evaluated on the msmarco-passage [2] collection. the kernel-pooling makes it easy to analyze temporary scoring results. finally, we discuss some of the insights we gained about the knrm model using the neural-ir-explorer. the neural-ir-explorer is available at: https://neural-ir-explorer.ec.tuwien.ac.at/. our work sits at the intersection of visual ir evaluation and the interpretability of neural networks with semantic word representations. the ir community mainly focused on tools to visualize result metrics over different configurations: claire allows users to select and evaluate a broad range of different settings [1] ; aviator integrates basic metric visualization directly in the experimentation process [4] ; and the retrieval tool provides a data-management platform for multimedia retrieval including differently scoped metric views [9] . lipani et al. [11] created a tool to inspect different pooling strategies, including an overview of the relevant result positions of retrieval runs. from a visualization point of view term-by-term similarities are similar to attention, as both map a single value to a token. lee et al. [10] created a visualization system for attention in a translation task. transformer-based models provide ample opportunity to visualize different aspects of the many attention layers used [3, 13] . visualizing simpler word embeddings is possible via a neighborhood of terms [5] . now we showcase the capabilities of the neural-ir-explorer (sect. 3.1) and how we already used it to gain novel insights (sect. 3.2). the explorer displays data created by a batched evaluation run of a neural re-ranking model. the back-end is written in python and uses flask as web server; the front-end uses vue.js. the source code is available at: github.com/sebastian-hofstaetter/neural-ir-explorer. when users first visit our website they are greeted with a short introduction to neural re-ranking and the selected neural model. we provide short explanations throughout the application, so that that new users can effectively use our tool. we expect this tool's audience to be not only neural re-ranking experts, but anyone who is interested in ir. the central hub of the neural-ir-explorer is the query overview (fig. 1) . we organize the queries by clustering them in visually separated cards. we collapse the cards to only show a couple of queries per default. this is especially useful for collections with a large number of queries, such as the msmarco collection we use in this demo (the dev set contains over 6.000 queries). in the cluster header we display a manually assigned summary title, the median result of the queries, and median difference to the initial bm25 ranking, as this is the basis for the re-ranking. each query is displayed with the rank of the first relevant document, the difference to bm25, and the query text. the controls at the top allow to sort the queries and clusters -including a random option to discover new queries. users can expand all clusters or apply a term-prefix filter to search for specific words in the queries. once a user clicks on a query, they are redirected to the query result view (fig. 2) . here, we offer an information rich view of the top documents returned by the neural re-ranking model. each document is displayed in full with its rank, overall and kernel-specific scores. the header controls allow to highlight the connections between the query and document terms in two different ways. first, users can choose a minimum cosine similarity that a term pair must exceed to be colored, which is a simple way of exploring the semantic similarity of the word representations. secondly, for kernel-pooling models that we support, we offer a highlight mode much closer to how the neural model sees the document: based on the association of a term to a kernel. users can select one or more kernels and terms are highlighted based on their value after the kernel transformation. additionally, we enable users to select two documents and compare them side-by-side (fig. 3) . users can highlight query-document connections as in the list view. additionally, we display the different kernel-scores in the middle, so that users can effectively investigate which kernels of the neural model have the deciding influence of the different scores for the two documents. we already found the neural-ir-explorer to be a useful tool to analyze the knrm neural model and understand its behaviors better. the knrm model includes a kernel for exact matches (cosine similarity of exactly 1), however judging from the displayed kernel scores this kernel is not a deciding factor. most of the time the kernels for 0.9 & 0.7 (meaning quite close cosine similarities) are in fact the deciding factor for the overall score of the model. we assume this is due to the fact, that every candidate document (retrieved via exact matched bm25) contains exact matches and therefore it is not a differentiating factor anymore -a specific property of the re-ranking task. additionally, the neural-ir-explorer also illuminates the pool bias [12] of the msmarco ranking collection: the small number of judged documents per query makes the evaluation fragile. users can see how relevant unjudged documents are actually ranked higher than the relevant judged documents, wrongly decreasing the model's score. we presented the content-focused neural-ir-explorer to complement metric based evaluation of retrieval models. the key contribution of the neural-ir-explorer is to empower users to efficiently explore retrieval results in different depths. the explorer is a first step to open the black-boxes of neural re-ranking models, as it investigates neural network internals in the retrieval task setting. the seamless and instantly updated visualizations of the neural-ir-explorer offer a great foundation for future work inspirations, both for neural ranking models as well as how we evaluate them. claire: a combinatorial visual analytics system for information retrieval evaluation ms marco: a human generated machine reading comprehension dataset visualizing and measuring the geometry of bert a progressive visual analytics tool for incremental experimental evaluation interactive analysis of word vector embeddings let's measure run time! extending the ir replicability infrastructure to include performance aspects on the effect of lowfrequency terms on neural-ir models interpretable & time-budgetconstrained contextualization for re-ranking retrieval: an online performance evaluation tool for information retrieval methods interactive visualization and manipulation of attention-based neural machine translation visual pool: a tool to visualize and interact with the pooling method the impact of fixedcost pooling strategies on test collection bias a multiscale visualization of attention in the transformer model end-to-end neural ad-hoc ranking with kernel pooling acknowledgements. this work has received funding from the european union's horizon 2020 research and innovation program under grant agreement no. 822670. key: cord-020834-ch0fg9rp authors: grand, adrien; muir, robert; ferenczi, jim; lin, jimmy title: from maxscore to block-max wand: the story of how lucene significantly improved query evaluation performance date: 2020-03-24 journal: advances in information retrieval doi: 10.1007/978-3-030-45442-5_3 sha: doc_id: 20834 cord_uid: ch0fg9rp the latest major release of lucene (version 8) in march 2019 incorporates block-max indexes and exploits the block-max variant of wand for query evaluation, which are innovations that originated from academia. this paper shares the story of how this came to be, which provides an interesting case study at the intersection of reproducibility and academic research achieving impact in the “real world”. we offer additional thoughts on the often idiosyncratic processes by which academic research makes its way into deployed solutions. we share the story of how an innovation that originated from academia-blockmax indexes and the corresponding block-max wand query evaluation algorithm of ding and suel [6] -made its way into the open-source lucene search library. this represents not only a case study in widespread reproducibility, since every recent deployment of lucene has access to these features and thus their performance benefits can be easily measured, but also of academic research achieving significant impact. how did these innovations make their way from the "ivory tower" into the "real world"? we recount the sequence of events, including false starts, that finally led to the inclusion of block-max wand in the latest major version of lucene (version 8), released in march 2019. we see this paper as having two main contributions beyond providing a narrative of events: first, we report results of experiments that attempt to match the original conditions of ding and suel [6] and present additional results on a number of standard academic ir test collections. these experiments characterize the performance of lucene's implementation and show the extent to which performance improvements are retained when moving from a research prototype to a production codebase. second, we offer a number of observations about the adoption of academic innovations, perhaps providing some insight into how academics might achieve greater real-world impact with their work. from its very beginnings in 1999, lucene has mostly existed in a "parallel universe" from academic ir researchers. part of this can be attributed to its "target audience": developers who wish to build real-world search applications, as opposed to researchers who wish to write papers. academic ir researchers have a long history of building and sharing search engines, dating back to the mid 1980s with cornell's smart system [4] . the tradition continues to this day, with lemur/indri [12, 13] and terrier [8, 14] being the most successful examples of open-source academic search engines, still popular with many researchers today. until recently, there has been little exchange between lucene and these systems, other than a few academic workshops [16, 21] . lucene has, for the longest time, been somewhat maligned in the academic ir community. for much of its existence, its default ranking model was a variant of tf-idf that was not only ad hoc, but demonstrably less effective than ranking models that were widely available in academic systems [18] . okapi bm25 was not added to lucene until 2011, 1 more than a decade after it gained widespread adoption in the research community; the consensus had long emerged that it was more effective than tf-idf variants. this lag has contributed to the broad perception by researchers that lucene produces poor search results and is illsuited for information retrieval research. this negative perception of lucene, however, began to change a few years ago. in 2015, an evaluation exercise known as the "open-source reproducibility challenge" [7] benchmarked seven open-source search engines and demonstrated that lucene was quite competitive in terms of both effectiveness and efficiency. it was the fourth fastest system (of seven) in terms of query evaluation, beating all the systems that were better than it in terms of effectiveness. since then, there has been a resurgence of interest in adopting lucene for information retrieval research, including a number of workshops that brought together like-minded researchers over the past few years [1, 2] . anserini [19, 20] is an open-source toolkit built on lucene that was specifically designed to support replicable information retrieval research by providing many research-oriented features missing from lucene, such as out-of-the-box support for a variety of common test collections. the project aims to better align ir researchers and practitioners, as lucene has become the de facto platform used in industry to build production search solutions (typically via systems such as elasticsearch and solr). the experiments in this paper were conducted with anserini. 3 from maxscore to block-max wand at berlin buzzwords in 2012, stefan pohl gave a presentation about max-score [17] to raise awareness about efficient retrieval techniques in the lucene community [15] . the presentation was accompanied by a working prototype. 2 this contribution was exciting but also challenging to integrate as it conflicted with some of the flexibility that lucene provides, requiring an index rewrite. there were ideas on how to address these issues, but they entailed a lot of effort, and so the issue remained stalled for about five years. five years is a long time and many changes occurred meanwhile. the switch from tf-idf to bm25 as lucene's default scoring function in 2015 created a natural upper bound on scores due to bm25's saturation effect, which made it possible to implement retrieval algorithms that reasoned about maximum scores without changes to lucene's index format. this led to an effort to implement a general-purpose wand [3] , based on a previous implementation for booleanquery. lucene received support for wand at the end of 2017 (although it wasn't released until version 8.0 with block-max indexes). implementing wand introduced two new issues. first, the total hit count would no longer be accurate, since not all matches are visited. common analytics use cases depend on this count, and many search engines display this value in their interfaces (see additional discussion in sect. 5). second, the fact that some lucene queries could produce negative scores became problematic, so lucene now requires positive scores. 3 support for block-max indexes was the final feature that was implemented, based on the developers' reading of the paper by ding and suel [6] , which required invasive changes to lucene's index format. note that the paper describes directly storing the maximum impact score per block, which fixes the scoring function at indexing time. to provide flexibility in being able to swap in different scoring functions, the lucene implementation stores all tf (term frequency) and dl (document length) pairs that might yield the maximum score. if we have one such pair (tf i , dl i ) then we can remove all other pairs (tf j , dl j ) where tf j ≤ tf i ∧ dl j ≥ dl l , since they are guaranteed to yield lower (or equal) scores-based on the assumption that scores increase monotonically with increasing tf and decreasing dl. this is implemented by accumulating all such pairs in a tree-like structure during the indexing process. these pairs are stored in skip lists, so the information is available to groups of 8, 64, 512, 4096, . . . blocks, allowing query evaluation to skip over more than one block at a time. an interesting coda to this story is that academic researchers were exploring alternatives to per-block impact scores circa 2017, for exactly the same reason (to allow the scoring model to be defined at search time). for example, macdonald and tonellotto [10] showed how to derive tight approximate upper bounds for block-max wand, based on work that dates back to 2011 [9] . similarly, the recently-released pisa research system stores flexible block-level metadata [11] . unfortunately, the lucene developers were not aware of these developments during their implementation. the journey from maxscore to block-max wand concluded in march 2019, with the rollout of all these features in the version 8.0 release of lucene. they are now the out-of-the-box defaults in the world's most popular search library. during the implementation of block-max wand, performance improvements were quantified in terms of lucene's internal benchmark suite, which showed a 3× to 7× improvement in query evaluation performance. as part of a formal reproducibility effort, we present experiments that attempt to match, to the extent practical, the original conditions described by ding and suel [6] . according to the paper, experiments were conducted on the gov2 web collection, on a randomly-selected subset of 1000 queries from the trec 2005 and 2006 efficiency tracks, which we were able to obtain from the authors. for their experiments, the inverted index was completely loaded into main memory and query evaluation latency was measured to retrieval depth ten. our experiments were conducted with the anserini ir toolkit, 4 comparing v0.5.1, which depends on lucene 7.6 and uses an optimized exhaustive or query evaluation strategy [5] with v0.6.0, which depends on lucene 8.0 and uses blockmax wand. we used anserini's standard regression test settings on the different collections, as described on its homepage. results represent averages over three trials on a warm cache. while the indexes were not explicitly loaded into memory, lucene benefits from caching at the os level. all experiments were conducted using a single thread on an otherwise idle server with dual intel xeon e5-2699 v4 processors and 1tb ram running rhel (release 7.7). results are shown in table 1 , where figures in the top three rows are copied from table 1 in the original paper. it is interesting that ding and suel report a much larger increase in performance comparing exhaustive or to bmw (18× on trec 2005 and 8× on trec 2006) than the comparable conditions in lucene (a more modest improvement of around 3×). this is due to a more optimized implementation of exhaustive or in lucene, which, for example, implements block processing [5] . interestingly, ding and suel report faster query evaluation in absolute terms, even on hardware that is much older: among the differences include c++ vs. java, as well as the simplicity of a research prototype vs. the realities of a fully-featured search library. beyond implementation differences, lucene must additionally compute the upper bound scores per block from the stored (tf, dl) pairs on the fly. we also report performance evaluations on two other standard test collections frequently used in academic information retrieval: clueweb09b and clueweb12-b13, with the same sets of queries. these results are shown in table 2 , where we report figures for different values of retrieval depth k, also averaged over three trials. these numbers are consistent with fig. 7 in ding and suel's paper: performance of exhaustive or drops modestly as depth k increases, but bmw performance degrades much more quickly. this is exactly as expected. finally, we quantify the modest increase in indexing time due to the need to maintain (tf, dl) pairs in the inverted indexes, shown in table 3 (averaged over three trials, using 44 threads in all cases). these experiments used anserini's default regression settings on the respective collections, which builds full positional indexes and also stores the raw documents. the story of block-max wand in lucene provides a case study of how an innovation that originated in academia made its way into the world's most widely-used search library and achieved significant impact in the "real world" through hundreds of production deployments worldwide (if we consider the broader lucene ecosystem, which includes systems such as elasticsearch and solr). as there are very few such successful case studies (the other prominent one being the incorporation of bm25 in lucene), it is difficult to generalize these narratives into "lessons learned". however, here we attempt to offer a few observations about how academic research might achieve greater real-world impact. in short, block-max wand is in lucene because the developers learned about ding and suel and decided to reimplement it. this is somewhat stating the obvious, but this fateful decision highlights the idiosyncratic nature of technology adoption. we could imagine alternatives where the lucene developers had not come across the paper and developed a comparable solution in isolation, or they might have known about the paper and elected to take a different approach. in either case, the lucene solution would likely differ from block-max wand. this would be akin to convergent evolution in evolutionary biology, whereby different organisms independently evolve similar traits because they occupy similar environments. in such an "alternate reality", this paper would be comparing and contrasting different solutions to handling score outliers, not describing a reproducibility effort. to bring researchers and practitioners closer together, we recommend that the former be more proactive to "evangelize" their innovations, and the latter be more diligent in consulting the literature. eight years passed from the publication of the original paper (2011) until the release of lucene that included block-max wand (2019). the entire course of innovation was actually much longer if we trace the origins back to maxscore (1995) and wand (2003). one obvious question is: why did it take so long? there are many explanations, the most salient of which is the difference between a research prototype and a fully-featured search library that is already widely deployed. this decomposes into two related issues, the technical and the social. from a technical perspective, supporting bmw required invasive changes to lucene's index format and a host of related changes in scoring functionsfor example, scores could no longer be negative, and implementations could no longer access arbitrary fields (which was an api change). these had to be staged incrementally. concomitant with technical changes and backwards-compatibility constraints were a host of "social" changes, which required changing users' expectations about the behavior of the software. in short, bmw was not simply a drop-in replacement. for example, as discussed in sect. 3, the hit count was no longer accurate, which required workarounds for applications that depended on the value. because such major changes can be somewhat painful, they need to be justified by the potential benefits. this means that only dramatic improvements really have any hope of adoption: multiple-fold, not marginal, performance gains. an interesting side effect is that entire generations of techniques might be skipped, in the case of lucene, directly from exhaustive or to bmw, leapfrogging intermediate innovations such as maxscore and wand. aiming to achieve real-world impact with academic research is a worthy goal, and we believe that this case study represents an endorsement of efforts to better align research prototypes with production systems, as exemplified by lucenebased projects like anserini. if academic researchers are able to look ahead "down the road" to see how their innovations might benefit end applications, the path from the "ivory tower" to the "real world" might become more smoothly paved. proceedings of the 40th annual international acm sigir conference on research and development in information retrieval (sigir 2017) lucene4ir: developing information retrieval evaluation resources using lucene efficient query evaluation using a two-level retrieval process implementation of the smart information retrieval system. department of computer science tr space optimizations for total ranking faster top-k document retrieval using block-max indexes toward reproducible baselines: the open-source ir reproducibility challenge from puppy to maturity: experiences in developing terrier upper-bound approximations for dynamic pruning upper bound approximation for blockmaxwand pisa: performant indexes and search for academia combining the language model and inference network approaches to retrieval indri at trec 2004: terabyte track terrier: a high performance and scalable information retrieval platform efficient scoring in lucene open source information retrieval: a report on the sigir 2012 workshop query evaluation: strategies and optimizations yet another comparison of lucene and indri performance anserini: enabling the use of lucene for information retrieval research anserini: reproducible ranking baselines using lucene sigir06 workshop report: open source information retrieval systems (osir06). in: sigir forum acknowledgments. this work was supported in part by the natural sciences and engineering research council (nserc) of canada. we'd like to thank craig macdonald, joel mackenzie, antonio mallia, and nicola tonellotto for helpful discussions on the intricacies of computing flexible per-block score bounds, and torsten suel for providing us with the original queries used in their evaluations. key: cord-020843-cq4lbd0l authors: almeida, tiago; matos, sérgio title: calling attention to passages for biomedical question answering date: 2020-03-24 journal: advances in information retrieval doi: 10.1007/978-3-030-45442-5_9 sha: doc_id: 20843 cord_uid: cq4lbd0l question answering can be described as retrieving relevant information for questions expressed in natural language, possibly also generating a natural language answer. this paper presents a pipeline for document and passage retrieval for biomedical question answering built around a new variant of the deeprank network model in which the recursive layer is replaced by a self-attention layer combined with a weighting mechanism. this adaptation halves the total number of parameters and makes the network more suited for identifying the relevant passages in each document. the overall retrieval system was evaluated on the bioasq tasks 6 and 7, achieving similar retrieval performance when compared to more complex network architectures. question answering (qa) is a subfield of information retrieval (ir) that specializes in producing or retrieving a single answer for a natural language question. qa has received growing interest since users often look for a precise answer to a question instead of having to inspect full documents [4] . similarly, biomedical question answering has also gained importance given the amount of information scattered over large specialized repositories such as medline. research on biomedical qa has been pushed forward by community efforts such as the bioasq challenge [13] , originating a range of different approaches and systems. recent studies on the application of deep learning methods to ir have shown very good results. these neural models are commonly subdivided into two categories based on their architecture. representation-based models, such as the deep structured semantic model (dssm) [5] or the convolutional latent semantic model (clsm) [12] , learn semantic representations of texts and score each query-document pair based on the similarity of their representations. on the other hand, models such as the deep relevance matching model (drmm) [3] or deeprank [10] follow a interaction-based approach, in which matching signals between query and document are captured and used by the neural network to produces a ranking score. the impact of neural ir approaches is also noticeable in biomedical question answering, as shown by the results on the most recent bioasq challenges [9] . the top performing team in the document and snippet retrieval sub-tasks in 2017 [1] , for example, used a variation of the drmm [8] to rank the documents recovered by the traditional bm25 [11] . for the 2018 task, the same team extended their system with the inclusion of models based on bert [2] and with joint training for document and snippet retrieval. the main contribution of this work is a new variant of the deeprank neural network architecture in which the recursive layer originally included in the final aggregation step is replaced by a self-attention layer followed by a weighting mechanism similar to the term gating layer of the drmm. this adaptation not only halves the total number of network parameters, therefore speeding up training, but it is also more suited for identifying the relevant snippets in each document. the proposed model was evaluated on the bioasq dataset, as part of a document and passage (snippet) retrieval pipeline for biomedical question answering, achieving similar retrieval performance when compared to more complex network architectures. the full network configuration is publicly available at https://github.com/bioinformatics-ua/bioasq, together with code for replicating the results presented in this paper. this section presents the overall retrieval pipeline and describes the neural network architecture proposed in this work for the document ranking step. the retrieval system follows the pipeline presented in fig. 1 , encompassing three major modules, fast retrieval, neural ranking and snippet extraction. the fast retrieval step is focused on minimizing the number of documents passed on to the computationally more demanding neural ranking module, while maintaining the highest possible recall. as in previous studies [1, 7] , we adopted elasticsearch (es) with the bm25 ranking function as the retrieval mechanism. the documents returned by the first module are ranked by the neural network which also directly provides to the following module the information for extracting relevant snippets. these modules are detailed in sects. 2.1 and 2.2. the network follows a similar architecture to the original version of deeprank [10] , as illustrated in fig. 2 . particularly, we build upon the best reported configuration, which uses a cnn in the measurement network and the reciprocal function as the position indicator. the inputs to the network are the query, a set of document passages aggregated by each query term, and the absolute position of each passage. for the remaining explanation, let us first define a query as a sequence of terms q = {u 0 , u 1 , ..., u q }, where u i is the i-th term of the query; a set of document passages aggregated by each query term as d(u i ) = {p 0 , p 1 , ..., p p }, where p j corresponds to the j-th passage with respect to the query term u i ; and a document passage as where v k is the k-th term of the passage. we chose to aggregate the passages by their respective query term at the input level, since it simplifies the neural network flow and implementation. the detection network receives as input the query and the set of document passages and creates a similarity tensor (interaction matrix) s ∈ [−1, 1] q×s for each passage, where each entry s ij corresponds to the cosine similarity between the embeddings of the i-th query term and j-th passage term, the measurement network step is the same used in the original deeprank model. it takes as inputs the previously computed tensors s and the absolute position of each passage and applies a 2d convolution followed by a global max polling operation, to capture the local relevance present in each tensor s, as defined in eq. 1: at this point, the set of document passages for each query term is represented by their respective vectors h, i.e, d( encodes the local relevance captured by the m convolution kernels of size x × y, plus an additional feature corresponding to the position of the passage. . the next step uses a self-attention layer [6] to obtain an aggregation c ui m ×1 over the passages h pj for each query term u i , as defined in eq. 2. the weights a pj , which are computed by a feed forward network and converted to a probabilistic distribution using the softmax operation, represent the importance of each passage vector from the set d(u i ). the addition of this self-attention layer, instead of the recurrent layer present in the original architecture, allows using the attention weights, that are directly correlated with the local relevance of each passage, to identify important passages within documents. moreover, this layer has around a × m parameters, compared to up to three times more in the gru layer (approximately 3 × a × (a + m )), which in practice means reducing the overall number of network parameters to half. finally, the aggregation network combines the vectors c ui m ×1 according to weights that reflect the importance of each individual query term u i . we chose to employ a similar weighting mechanism to the term gating layer in drmm [3] , which uses the query term embedding to compute its importance, as defined in eq. 3. this option replaces the use of a trainable parameter for each vocabulary term, as in the original work, which is less suited for modelling a rich vocabulary as in the case of biomedical documents. the final aggregated vector c is then fed to a dense layer for computing the final ranking score. optimization. we used the pairwise hinge loss as the objective function to be minimized by the adadelta optimizer. in this perspective, the training data is viewed as a set of triples, (q, d + , d − ), composed of a query q, a positive document d + and a negative document d − . additionally, inspired by [14] and as successfully demonstrated by [16] , we adopted a similar negative sampling strategy, where a negative document can be drawn from the following sets: -partially irrelevant set: irrelevant documents that share some matching signals with the query. more precisely, this corresponds to documents retrieved by the fast retrieval module but which do not appear in the training data as positive examples; -completely irrelevant set: documents not in the positive training instances and not sharing any matching signal with the query. passage extraction is accomplished by looking at the attention weights of the neural ranking model. as described, the proposed neural ranking model includes two attention mechanisms. the first one computes a local passage attention with respect to each query term, a pi . the second is used to compute the importance of each query term, a u k . therefore, a global attention weight for each passage can be obtained from the product of these two terms, a g (k,i) = a u k × a pi , as shown in eq. 4: this section presents the system evaluation results. we used the training data from the bioasq 6b and 7b phase a challenges [13] , containing 2251 and 2747 biomedical questions with the corresponding relevant documents, taken from the medline repository. the objective for a system is to retrieve the ten most relevant documents for each query, with the performance evaluated in terms of map@10 on five test sets containing 100 queries each. at first, a study was conducted to investigate the performance of the proposed neural ranking model. after that, the full system was compared against the results of systems submitted to the bioasq 6 and 7 editions for the document retrieval task. finally, we investigate if the attention given to each passage is indeed relevant. in the results, we compare two variants of deeprank: biodeeprank refers to the model with the modified aggregation network and weighting mechanism, and using word embeddings for the biomedical domain [15] ; attn-biodeeprank refers to the final model that additionally replaces the recurrent layer by a self-attention layer. 2 neural ranking models. we compared both neural ranking versions against bm25 in terms of map@10 and recall@10, on a 5-fold cross validation over the bioasq training data. table 1 summarizes the results. both models successfully improved the bm25 ranking order, achieving an increase of around 0.14 in map and 0.31 in recall. results of attn-biodeeprank, although lower, suggest that this version is at least nearly as effective at ranking the documents as the model that uses the recursive layer. biomedical document retrieval. we report results on the bioasq 6b and bioasq 7b document ranking tasks (table 2) . regarding bioasq 6b, it should be noted that the retrieved documents were evaluated against the final goldstandard of the task, revised after reevaluating the documents submitted by the participating systems. since we expect that some of the retrieved documents would have been revised as true positives, the results presented can be considered a lower bound of the system's performance. for bioasq 7b, the results shown are against the gold-standard before the reevaluation, since the final annotations were not available at the time of writing. in this dataset both systems achieved performance nearer to the best result, including a top result on batch 1. passage evaluation. finally, we analysed whether the information used by the model for ranking the documents, as given by the attention weights, corresponded to relevant passages in the gold-standard. for this, we calculated the precision of the passages, considering overlap with the gold-standard, and evaluated how it related to the confidence assigned by the model. interestingly, although the model is not trained with this information, the attention weights seem to focus on these relevant passages, as indicated by the results in fig. 3. fig. 3 . quality of retrieved passages as a function of the confidence attributed by the model. this paper describes a new neural ranking model based on the deeprank architecture. evaluated on a biomedical question answering task, the proposed model achieved similar performance to a range of others strong systems. we intend to further explore the proposed approach by considering semantic matching signals in the fast retrieval module, and by introducing joint learning for document and passage retrieval. the network implementation and code for reproducing these results are available at https://github.com/bioinformatics-ua/bioasq. aueb at bioasq 6: document and snippet retrieval bert: pre-training of deep bidirectional transformers for language understanding a deep relevance matching model for ad-hoc retrieval natural language question answering: the view from here learning deep structured semantic models for web search using clickthrough data a structured self-attentive sentence embedding mindlab neural network approach at bioasq 6b deep relevance ranking using enhanced document-query interactions results of the sixth edition of the bioasq challenge deeprank the probabilistic relevance framework: bm25 and beyond a latent semantic model with convolutional-pooling structure for information retrieval an overview of the bioasq large-scale biomedical semantic indexing and question answering competition learning fine-grained image similarity with deep ranking biowordvec, improving biomedical word embeddings with subword information and mesh a hierarchical attention retrieval model for healthcare question answering acknowledgments. this work was partially supported by the european regional development fund (erdf) through the compete 2020 operational programme, and by national funds through fct -foundation for science and technology, projects ptdc/eei-ess/6815/2014 and uid/cec/00127/2019. key: cord-020806-lof49r72 authors: landin, alfonso; parapar, javier; barreiro, álvaro title: novel and diverse recommendations by leveraging linear models with user and item embeddings date: 2020-03-24 journal: advances in information retrieval doi: 10.1007/978-3-030-45442-5_27 sha: doc_id: 20806 cord_uid: lof49r72 nowadays, item recommendation is an increasing concern for many companies. users tend to be more reactive than proactive for solving information needs. recommendation accuracy became the most studied aspect of the quality of the suggestions. however, novel and diverse suggestions also contribute to user satisfaction. unfortunately, it is common to harm those two aspects when optimizing recommendation accuracy. in this paper, we present eer, a linear model for the top-n recommendation task, which takes advantage of user and item embeddings for improving novelty and diversity without harming accuracy. in recent years, the way users access services has shifted from a proactive approach, where the user actively looks for the information, to one where the users take a more passive role, and content is suggested to them. within this transformation, recommender systems have played a pivotal role, enabling an increase in user engagement and revenue. recommender systems are usually classified into three families [1] . the first approach, content-based systems, use item metadata to produce recommendations [7] . the second family, collaborative filtering, is composed of systems that exploit the past interactions of the users with the items to compute the recommendations [10, 17] . these interactions can take several forms, such as ratings, clicks, purchases. finally, hybrid approaches combine both to generate suggestions. collaborative filtering (cf) systems can be divided into memory-based systems, that use the information about these interactions directly to compute the recommendations, and model-based systems, that build models from this information that are later used to make the recommendations. in this paper, we will present a cf model to address the top-n recommendation task [4] . the objective of a top-n recommender is to produce a ranked list of items for each user. these systems can be evaluated using traditional ir metrics over the rankings [2, 4] . in that evaluation approach, accuracy is usually the most important metric and has been the focus of previous research and competitions [3] . nevertheless, other properties are also important, such as diversity and novelty [8, 13] . diversity is the ability of the system to make recommendations that include items equitably from the whole catalog, which is usually desired by vendors [5, 22] . on the other hand, novelty is the capacity of the system to produce unexpected recommendations. this characteristic is a proxy for serendipity, associated with higher user engagement and satisfaction [6] . all these properties, accuracy, diversity and novelty, are linked to the extent that raising accuracy usually lowers the best achievable results in the other properties [11] . in this paper, we propose a method to augment an existing recommendation linear model to make more diverse and novel recommendations, while maintaining similar accuracy results. we do so by making use of user and item embeddings that are able to capture non-linear relations thanks to the way they are obtained [21] . experiments conducted on three datasets show that our proposal outperforms the original model in both novelty and diversity while maintaining similar levels of accuracy. with reproducibility in mind, we also make the software used for the experiments publicly available 1 . in this section, we introduce fism, the existing recommendation method we augment in our proposal. after that, we introduce prefs2vec, the user and item embedding model used to make this enhancement. fism is a state-of-the-art model-based recommender system proposed by kabbur et al [9] . this method learns a low rank factorization of an item-item similarity matrix, which is later used to compute the scores to make the predictions. this method is an evolution of a previous method, slim [16] , that learns this matrix without factorizing it. factorizing the similarity matrix allows fism to overcome slim's limitation of not being able to learn a similarity other than zero for items that have never been rated both by at least one user. as a side effect of this factorization, it lowers the space complexity from o( |i|. it also drops the non-negativity constraint and the constraint that the diagonal of the similarity matrix has to contain zeroes. as a consequence of these changes, the optimization problem can be solved using regular gradient descent algorithms, instead of the coordinated gradient descent used by slim, leading to faster training times. embedding models allow transforming high-dimensional and sparse vector representations, such as classical one-hot and bag-of-words, into a space with much lower dimensionality. in particular, previous word embedding models, that produce fixed-length dense representations, have proven to be more effective in several npl tasks [14, 15, 19] . recently, prefs2vec [21] , a new embedding model for obtaining dense user and item representations, an adaptation of the cbow model [14] , has shown that these embeddings can be useful for the top-n recommendation task. when used with a memory-based recommender, they are more efficient than the classical representation [21] . the results show that not only they can improve the accuracy of the results, but also their novelty and diversity. the versatility of this embedding model, in particular of the underlying neural model and the way it is trained, is also shown in [12] . here the prediction capabilities of the neural model are used directly in a probabilistic recommender. in this section, we present our method to enhance diversity and novelty in recommendation, explaining how the model is trained and used to produce recommendations. firstly, we introduce how the product of user and item embeddings (based on prefs2vec) can be used to make recommendations, which is later used as part of the proposal. as representations of users and items in a space with much lower dimensionality, prefs2vec embeddings can be viewed as latent vectors. however, there is no sense in multiplying both item and user vectors as they have different basis even when they have the same dimensions. this is a consequence of learning the item and user representations independently, how prefs2vec initializes the parameters of the model and how the training is performed. however, it is possible to make this product if we can compute a change of basis matrix t ∈ r d×d to transform the user embeddings into the item embeddings space. this way we can calculate an estimated ratings matrixr using the simple matrix multiplication: where e ∈ r |u |×d is the matrix of user embeddings, and f ∈ r |i|×d is the matrix of item embeddings, one embedding in each row. the transformation matrix t is learned by solving the optimization problem with 2 regularization: where r is the ratings matrix and β e is the regularization hyperparameter. this problem can be solved using gradient descent algorithms. once the transformation matrix has been trained, recommendations can be produced by computing the estimated rating matrixr as described in eq. 1. recommendations are made to each user by sorting the corresponding row and picking the top-n items not already rated by the user. we dubbed this recommender elp, short for embedding linear product, and we present its performance in table 3 in the experiments section. we have seen that linear methods, like fism, can obtain good accuracy figures. on the other side, as results in table 3 show, elp is able to provide good figures in novelty and diversity, thanks to the embedding model capturing non-linear relations between users and items. we propose to capture both properties by joining the models together in the eer model (embedding enhanced recommender). we choose the rmse variant of fism as it matches the loss used in elp. we also use a trainable scalar parameter α to joint the models, as the scores obtained from each recommender need not be on the same scale. this results in the following equation to calculate the estimated ratings matrix:r where p ∈ r |i|×k and q ∈ r k×|i| are the low rank factorization of the item-item similarity matrix. the parameters of the model, p , q, t and α, are learned by solving the joint 2 regularized optimization problem resulting from the previous joint equation, using standard gradient descent algorithms: minimize p ,q ,t ,α similar to the case of elp, once the parameters are learned, we make the recommendations by calculating the estimated ratings matrix using eq. 3, sorting each row and picking the top-n items not yet rated by the user corresponding to that row. in this section, we introduce the datasets used to perform our experiments, the evaluation protocol followed and the metrics used. after that, we present the results of our experiments. to evaluate our proposal, we conducted a series of experiments on several datasets, from different domains: the movielens 20m dataset 2 , a movie dataset, the book dataset librarything, and the beeradvocate dataset 3 , consisting of beer reviews. table 1 shows statistics of each collection. in order to perform the experiments, the datasets were divided randomly into train and test sets. the training dataset consisted of 80% of the ratings of each user, with the remaining 20% forming the test dataset. we follow the testitems evaluation methodology [2] to evaluate the performance. to assess the accuracy of the rankings, we use normalized discounted cumulative gain (ndcg), using the standard formulation as described in [23] , with the ratings in the test set as graded relevance judgments. we considered only items with a rating of 4 or more, on a 5 point scale, to be relevant for evaluation purposes. we also measured the diversity of the recommendations using the complement of the gini index [5] . finally, we use the mean self-information (msi) [24] to assess the novelty of the recommendations. all the metrics are evaluated at cut-off 100 because it has shown to be more robust with respect to the sparsity and popularity biases than sallower cut-offs [20] . we perform a wilcoxon test [18] to asses the statistical significance of the improvements regarding ndcg@100 and msi@100, with p < 0.01. we cannot apply it to the gini index because we are using a paired test and gini is a global metric. results in table 3 are annotated with their statistical significance. we performed a grid search over the hyperparameters of the original model and our proposal tuning them to maximize ndcg@100. although we aim to increase diversity and novelty, we want the recommendations to be effective, which is why the tuning is done over accuracy. for the parameters of the prefs2vec model, we took those that performed better in [21] . for reproducibility's sake, values for the best hyperparameters for each collection can be consulted in table 2 . table 3 shows the values of ndcg@100, gini@100 and msi@100 for fism, eer and elp. the results show that eer outperforms the baseline (fism) on both novelty and diversity. it also surpasses it on accuracy on the movielens 20m and librarything datasets. in the case of diversity, we can see important table 2 . best values of the hyperparameters for ndcg@100 for fism and our proposals eer and elp. librarything beeradvocate fism β = 1, k = 1000 β = 1000, k = 1000 β = 50, k = 1000 elp βe = 0.1 βe = 10 βe = 10 eer β = 0.1, βe = 1, k = 1000 β = 500, βe = 10, k = 1000 β = 10, βe = 1, k = 1000 improvements. elp, on the other hand, obtains the best diversity and novelty values, but this comes with a big reduction in accuracy. it is common in the field of recommender systems for methods with lower accuracy to have higher values in diversity and novelty. we believe that the ability of the embeddings to find nonlinear relationships contributes to the model novelty and diversity. this property of the model allows it, for example, to discover relationships between popular and not so popular items leading to better diversity. moreover, the integration in the linear model allows to keep its advantage in terms on accuracy, clearly suparssing the use of embeddings in isolatation (elp). in this paper, we presented eer, a method to enhance an existing recommendation algorithm to produce recommendations that are both more diverse and novel, while maintaining similar levels on accuracy. this process is done by combining two models, a linear one that is able to obtain good levels of accuracy, with a model based in an embedding technique that extracts non-linear relationships, allowing it to produce more diverse and novel recommendations. as future work, we plan to apply the same technique to other recommender systems, examining if it can be applied in general to enhance the recommendations, independently of the base algorithm chosen for the task. we also envision studying the effects that varying the value of α in eq. 3 has on the recommendations. fab: content-based, collaborative recommendation precision-oriented evaluation of recommender systems the netflix prize performance of recommender algorithms on top-n recommendation tasks blockbuster culture's next rise or fall: the impact of recommender systems on sales diversity beyond accuracy: evaluating recommender systems by coverage and serendipity semantics-aware content-based recommender systems recommender systems handbook evaluating collaborative filtering recommender systems fism: factored item similarity models for top-n recommender systems advances in collaborative filtering when diversity met accuracy: a story of recommender systems prin: a probabilistic recommender with item priors and neural models being accurate is not enough: how accuracy metrics have hurt recommender systems efficient estimation of word representations in vector space distributed representations of words and phrases and their compositionality slim: sparse linear methods for top-n recommender systems a comprehensive survey of neighborhoodbased recommendation methods using score distributions to compare statistical significance tests for information retrieval evaluation glove: global vectors for word representation on the robustness and discriminative power of information retrieval metrics for top-n recommendation collaborative filtering embeddings for memory-based recommender systems item-based relevance modelling of recommendations for getting rid of long tail products. knowl.-based syst a theoretical analysis of ndcg ranking measures solving the apparent diversity-accuracy dilemma of recommender systems key: cord-020848-nypu4w9s authors: morris, david; müller-budack, eric; ewerth, ralph title: slideimages: a dataset for educational image classification date: 2020-03-24 journal: advances in information retrieval doi: 10.1007/978-3-030-45442-5_36 sha: doc_id: 20848 cord_uid: nypu4w9s in the past few years, convolutional neural networks (cnns) have achieved impressive results in computer vision tasks, which however mainly focus on photos with natural scene content. besides, non-sensor derived images such as illustrations, data visualizations, figures, etc. are typically used to convey complex information or to explore large datasets. however, this kind of images has received little attention in computer vision. cnns and similar techniques use large volumes of training data. currently, many document analysis systems are trained in part on scene images due to the lack of large datasets of educational image data. in this paper, we address this issue and present slideimages, a dataset for the task of classifying educational illustrations. slideimages contains training data collected from various sources, e.g., wikimedia commons and the ai2d dataset, and test data collected from educational slides. we have reserved all the actual educational images as a test dataset in order to ensure that the approaches using this dataset generalize well to new educational images, and potentially other domains. furthermore, we present a baseline system using a standard deep neural architecture and discuss dealing with the challenge of limited training data. convolutional neural networks (cnns) are making great strides in computer vision, driven by large datasets of annotated photos, such as imagenet [1] . many images relevant for information retrieval, such as charts, tables, and diagrams, are created with software rather than through photography or scanning. there are several applications in information retrieval for a robust classifier of educational illustrations. search tools might directly expose filters by predicted label, natural language systems could choose images by type based on what information a user is seeking. further analysis systems could be used to extract more information from an image to be indexed based on its class. in this case, we have classes such as pie charts and x-y graphs that indicate what type of information is in the image (e.g., proportions, or the relationship of two numbers) and how it is symbolized (e.g., angular size, position along axes). most educational images are created with software and are qualitatively different from photos and scans. neural networks designed and trained to make sense of the noise and spatial relationships in photos are sometimes suboptimal for born-digital images and educational images in general. educational images and illustrations are under-served in training datasets and challenges. competitions such as the contest on robust reading for multi-type web images [2] and icdar detext [3] have shown that these tasks are difficult and unsolved. research on text extraction such as morris et al. [4] and nayef and ogier [5] has shown that even noiseless born-digital images are sometimes better analyzed with neural nets than with handcrafted features and heuristics. born-digital and educational images need further benchmarks on challenging information retrieval tasks in order to test generalization. in this paper, we introduce slideimages, a dataset which targets images from educational presentations. most of these educational illustrations are created with diverse software, so the same symbols are drawn in different ways in different parts of the image. as a result, we expect that effective synthetic datasets will be hard to create, and methods effective on slideimages will generalize well to other tasks with similar symbols. slideimages contains eight classes of image types (e.g. bar charts and x-y plots) and a class for photos. the labels we have created were made with information extraction for image summarization in mind. in the rest of this paper, we discuss related work in sect. 2, details about our dataset and baseline method in sect. 3, results of our baseline method in sect. 4, and conclude with a discussion of potential future developments in sect. 5. prior information retrieval publications used or could use document figure classification. charbonnier et al. [6] built a search engine with image type filters. aletras and mittal [7] automatically label topics in photos. kembhavi et al.'s [8] diagram analysis assumes the input figure is a diagram. hiippala and orekhova extended that dataset by annotating it in terms of relational structure theory, which implies that the same visual features communicate the same semantic relationships. de herrera et al. [9] seek to classify image types to filter their search for medical professionals. we intend to use document figure classification as a first step in automatic educational image summarization applications. a similar idea is followed by morash et al. [10] , who built one template for each type of image, then manually classified images and filled out the templates, and suggested automating the steps of that process. moraes et al. [11] mentioned the same idea for their sight (summarizing information graphics textually) system. a number of publications on document image classification such as afzal et al. [12] and harley et al. [13] use the rvl-cdip (ryerson vision lab complex document information processing) dataset, which covers scanned documents. while document scans and born-digital educational illustrations have materially different appearance, these papers show that the utility of deep neural networks is not limited to scene image tasks (fig. 1) . a classification dataset of scientific illustrations was created for the noa project [14] . however, their dataset is not publicly available, and does not draw as many distinctions between types of educational illustrations. jobin et al.'s docfigure [15] consists of 28 different categories of illustrations extracted from scientific publications totaling 33,000 images. techniques that work well on docfigure [15] do not generalize to the educational illustrations in our use case scenarios (as we also show in sect. 4.2). different intended uses or software cause sufficient differences in illustrations that a dataset of specifically educational illustrations is needed. cnns and related techniques are heavily data driven. an approach must consist of both an architecture and optimization technique, but also the data used for that optimization. in our case, we consider the dataset our main contribution. when building our taxonomy, we have chosen classes such that one class would have the same types of salient features, and appropriate summaries would also be similar in structure. our classes are also all common in educational materials. beyond the requirements of our taxonomy, our datasets needed to be representative of common educational illustrations in order to fit real-world applications, and legally shareable to promote research on educational image classification. educational illustrations are created by a variety of communities with varying expertise, techniques, and tools, so choosing a dataset from one source may eliminate certain variables in educational illustration. to identify these variables, we kept our training and test data sources separate. we assembled training and validation datasets from various sources of open access illustrations. bar charts, x-y plots, maps, photos, pie charts, slide images, table images, and technical drawings were manually selected by a student assistant (supported by the main author) using the wikimedia commons image search for related terms. we manually selected graph diagrams, which we also call node-edge diagrams or "structured diagrams," from the kembhavi et al. [8] allenai diagram understanding (ai2d) dataset; not all ai2d images contain graph edges [8] . the training dataset of slideimages consists of 2,938 images and is intended for fine-tuning cnns, not training from scratch. the slideimages test set is derived from a snapshot of slidewiki open educational resource platform (https://slidewiki.org/) datastore obtained in 2018. from that snapshot, two annotators manually selected and labeled 691 images. our data are available at our code repository: https://github.com/david-morris/slideimages/. the slideimages training dataset is small compared to datasets like imagenet [1] , with over 14 million images, rvl-cdip [13] with 400,000 images, or even docfigure [15] with 33,000 images. much of our methodology is shaped by needing to confront the challenges of a small dataset. in particular, we aim to avoid overfitting: the tendency of a classifier to identify individual images and patterns specific to the training set rather than the desired semantic concepts. for our pre-training dataset, a large, diverse dataset is required that contains a large proportion of educational and scholarly images. we pre-trained on a dataset of almost 60,000 images labeled by sohmen et al. [6] (noa dataset), provided by the authors on request. the images are categorized as composite images, diagrams, medical imaging, photos, or visualizations/models. to mitigate overfitting, we used data augmentation: distorting an image while keeping relevant traits. we used image stretching, brightness scaling, zooming, and color channel shifting as shown in our source code. we also added dropout with a rate of 0.1 on the extracted features before the fully connected and output layers. we used similar image augmentation for pre-training and training. we use mobilenetv2 [16] as our network architecture. we chose mobilenetv2 as a compromise between a small number of parameters and performance on ima-genet. intuitively, a smaller parameter space implies a model with more bias and lower variance, which is better for smaller datasets. we initialized our weights from an imagenet model and pre-trained for a further 40 epochs with early stopping on the noa dataset using the adam (adaptive moment estimation) [17] optimizer. this additional pre-training was intended to cause the lower levels of the network to extract more features specific to born-digital images. we then trained for 40 epochs with adam and a learning rate schedule. our schedule drops the learning rate by a factor of 10 at the 15th and 30th epoch. our implementation is available at https://github.com/david-morris/slideimages/. we have performed two experiments, in order to show that this dataset represents a meaningful improvement over existing work, and to establish a baseline. because our classes are unbalanced, we have reported summary statistics as accuracy averages of each class weighted by number of instances per class. we set a baseline for our dataset with the classifier described in sect. 3.2. the confusion matrix in fig. 2 shows that misclassifications do tend towards a few types of errors, but none of the classes have collapsed. while certain classes are likely to be misclassified as another specific class (such as structured diagrams as slides), those relationships do not happen in reverse, and a correct classification is more likely. figure 2 shows that our baseline leaves room for improvement, and our test set helps to identify challenges in this task. viewing individual classification errors highlighted a few problems with our training data. our training the related docfigure dataset covers similar images and has much more data than slideimages. to justify slideimages, we have created a head-to-head comparison of classifiers trained in the same way (as described in sect. 3.2) on the slideimages and docfigure datasets. all the slideimages classes except slides have an equivalent in docfigure. we have shown the reduction in the data used, and the relative sizes of the datasets, in table 1 . the head-to-head datasets contain only the matching classes, and in the case of the docfigure dataset, the original test set has been split into validation and test sets. after obtaining the two trained networks, we have tested each network on both the matching test set, and the other test set. although we were unable to reproduce the vgg-v baseline used by jobin et al., we used a linear svm with vgg-16 features and achieved comparable results on the full docfigure dataset (90% macro average compared to their 88.96% with a fully neural feature extractor). the results ( table 2) show that slideimages is a more challenging and potentially more general task. the net trained on slideimages did even better on the docfigure test set than on the slideimages test set. despite having a different source and approximately a fifth of the size of the docfigure dataset, the net trained on slideimages training set was better on our test set. in this paper, we have presented the task of classifying educational illustrations and images in slides and introduced a novel dataset slideimages. the classification remains an open problem despite our baseline and represents a useful task for information retrieval. we have provided a test set derived from actual educational illustrations, and a training set compiled from open access images. finally, we have established a baseline system for the classification task. other potential avenues for future research include experimenting with the docfigure dataset in the pre-training and training phases, and experimenting with text extraction for multimodal classification. imagenet: a large-scale hierarchical image database icpr2018 contest on robust reading for multi-type web images icdar2017 robust reading challenge on text extraction from biomedical literature figures (detext) a neural approach for text extraction from scholarly figures semantic text detection in born-digital images via fully convolutional networks noa: a search engine for reusable scientific images beyond the life sciences labeling topics with images using a neural network a diagram is worth a dozen images semi-supervised learning for image modality classification guiding novice web workers in making image descriptions using templates evaluating the accessibility of line graphs through textual summaries for visually impaired users cutting the error by half: investigation of very deep cnn and advanced training strategies for document image classification evaluation of deep convolutional nets for document image classification and retrieval figures in scientific open access publications docfigure: a dataset for scientific document figure classification 2018 ieee conference on computer vision and pattern recognition. cvpr 2018 adam: a method for stochastic optimization acknowledgement. this work is financially supported by the german federal ministry of education and research (bmbf) and european social fund (esf) (inclu-siveocw project, no. 01pe17004). key: cord-020899-d6r4fr9r authors: doinychko, anastasiia; amini, massih-reza title: biconditional generative adversarial networks for multiview learning with missing views date: 2020-03-17 journal: advances in information retrieval doi: 10.1007/978-3-030-45439-5_53 sha: doc_id: 20899 cord_uid: d6r4fr9r in this paper, we present a conditional gan with two generators and a common discriminator for multiview learning problems where observations have two views, but one of them may be missing for some of the training samples. this is for example the case for multilingual collections where documents are not available in all languages. some studies tackled this problem by assuming the existence of view generation functions to approximately complete the missing views; for example machine translation to translate documents into the missing languages. these functions generally require an external resource to be set and their quality has a direct impact on the performance of the learned multiview classifier over the completed training set. our proposed approach addresses this problem by jointly learning the missing views and the multiview classifier using a tripartite game with two generators and a discriminator. each of the generators is associated to one of the views and tries to fool the discriminator by generating the other missing view conditionally on the corresponding observed view. the discriminator then tries to identify if for an observation, one of its views is completed by one of the generators or if both views are completed along with its class. our results on a subset of reuters rcv1/rcv2 collections show that the discriminator achieves significant classification performance; and that the generators learn the missing views with high quality without the need of any consequent external resource. we address the problem of multiview learning with generative adversarial networks (gans) in the case where some observations may have missing views without there being an external resource to complete them. this is a typical situation in many applications where different sources generate different views of samples unevenly; like text information present in all wikipedia pages while images are more scarce. another example is multilingual text classification where documents are available in two languages and share the same set of classes while some are just written in one language. previous works supposed the existence of view generating functions to complete the missing views before deploying a learning strategy [2] . however, the performance of the global multiview approach is biased by the quality of the generating functions which generally require external resources to be set. the challenge is hence to learn an efficient model from the multiple views of training data without relying on an extrinsic approach to generate altered views for samples that have missing ones. in this direction, gans provide a propitious and broad approach with a high ability to seize the underlying distribution of the data and create new samples [11] . these models have been mostly applied to image analysis and major advances have been made on generating realistic images with low variability [7, 15, 16] . gans take their origin from the game theory and are formulated as a two players game formed by a generator g and a discriminator d. the generator takes a noise z and produces a sample g(z) in the input space, on the other hand the discriminator determines whenever a sample comes from the true distribution of the data or if it is generated by g. other works included an inverse mapping from the input to the latent representation, mostly referred to as bigans, and showed the usefulness of the learned feature representation for auxiliary discriminant problems [8, 9] . this idea paved the way for the design of efficient approaches for generating coherent synthetic views of an input image [6, 14, 21] . in this work, we propose a gan based model for bilingual text classification, called cond 2 gans, where some training documents are just written in one language. the model learns the representation of missing versions of bilingual documents jointly with the association to their respective classes, and is composed of a discriminator d and two generators g 1 and g 2 formulated as a tripartite game. for a given document with a missing version in one language, the corresponding generator induces the latter conditionally on the observed one. the training of the generators is carried out by minimizing a regularized version of the cross-entropy measure proposed for multi-class classification with gans [19] in a way to force the models to generate views such that the completed bilingual documents will have high class assignments. at the same time, the discriminator learns the association between documents and their classes and distinguishes between observations that have their both views and those that got a completed view by one of the generators. this is achieved by minimizing an aggregated cross-entropy measure in a way to force the discriminator to be certain of the class of observations with their complete views and uncertain of the class of documents for which one of the versions was completed. the regularization term in the objectives of generators is derived from an adapted feature matching technique [17] which is an effective way for preventing from situations where the models become unstable; and which leads to fast convergence. we demonstrate that generated views allow to achieve state-of-the-art results on a subset of reuters rcv1/rcv2 collections compared to multiview approaches that rely on machine translation (mt) for translating documents into languages in which their versions do not exist; before training the models. importantly, we exhibit qualitatively that generated documents have meaningful translated words bearing similar ideas compared to the original ones; and that, without employing any large external parallel corpora to learn the translations as it would be the case if mt were used. more precisely, this work is the first to: -propose a new tripartite gan model that makes class prediction along with the generation of high quality document representations in different input spaces in the case where the corresponding versions are not observed (sect. 3.2); -achieve state-of-the art performance compared to multiview approaches that rely on external view generating functions on multilingual document classification; and which is another challenging application than image analysis which is the domain of choice for the design of new gan models (sect. 4.2); -demonstrate the value of the generated views within our approach compared to when they are generated using mt (sect. 4.2). multiview learning has been an active domain of research these past few years. many advances have been made on both theoretic and algorithmic sides [5, 12] . the three main families of techniques for (semi-)supervised learning are (kernel) canonical correlation analysis (cca), multiple kernel learning (mkl) and co-regularization. cca finds pairs of highly correlated subspaces between the views that is used for mapping the data before training, or integrated in the learning objective [3, 10] . mkl considers one kernel per view and different approaches have been proposed for their learning. in one of the earliest work, [4] proposed an efficient algorithm based on sequential minimization techniques for learning a corresponding support vector machine defined over a convex nonsmooth optimization problem. co-regularization techniques tend to minimize the disagreement between the single-view classifiers over their outputs on unlabeled examples by adding a regularization term to the objective function [18] . some approaches have also tackled the tedious question of combining the predictions of the view specific classifiers [20] . however all these techniques assume that views of a sample are complete and available during training and testing. recently, many other studies have considered the generation of multiple views from a single input image using gans [14, 21, 23] and have demonstrated the intriguing capacity of these models to generate coherent unseen views. the former approaches rely mostly on an encoder-encoder network to first map images into a latent space and then generate their views using an inverse mapping. this is a very exciting problem, however, our learning objective differs from these approaches as we are mostly interested in the classification of muti-view samples with missing views. the most similar work to ours that uses gans for multiview classification is probably [6] . this approach generates missing views of images in the same latent space than the input image, while cond 2 gans learns the representations of the missing views in their respective input spaces conditionally on the observed ones which in general are from other feature spaces. furthermore, cond 2 gans benefits from low complexity and stable convergence which has been shown to be lacking in the previous approach. another work which has considered multiview learning with incomplete views, also for document classification, is [2] . the authors proposed a rademacher complexity bounds for a multiview gibbs classifier trained on multilingual collections where the missing versions of documents have been generated by machine translation systems. their bounds exhibit a term corresponding to the quality of the mt system generating the views. the bottleneck is that mt systems depend on external resources, and they require a huge amount of parallel collections containing documents and their translations in all languages of interest for their tuning. for rare languages, this can ultimately affect the performance of the learning models. regarding these aspects our proposed approach differs from all the previous studies, as we do not suppose the existence of parallel training sets nor mt systems to generate the missing versions of the training observations. in the following sections, we first present the basic definitions which will serve to our problem setting, and then the proposed model for multiview classification with missing views. we consider multiclass classification problems, where a bilingual document is defined as a sequence x = (x 1 , x 2 ) ∈ x that belongs to one and only one class y ∈ y = {0, 1} k . the class membership indicator vector y = (y k ) 1≤k≤k , of each bilingual document, has all its components equal to 0 except the one that indicates the class associated with the example which is equal to one. here we suppose that x = ( following the conclusions of the co-training study [5] , our framework is based on the following main assumption: observed views are not completely correlated, and are equally informative. furthermore, we assume that each example (x, y) is identically and independently distributed (i.i.d.) according to a fixed yet unknown distribution d over x ×y, and that at least one of its views is observed. additionally, we suppose to have access to a training set denotes the subset of training samples with their both complete views and is the subset of training samples with their second (respectively first) view that is not observed (i.e. m = m f + m 1 + m 2 ). it is possible to address this problem using existing techniques; for example, by learning singleview classifiers independently on the examples of s s 1 (respectively s s 2 ) for the first (respectively second) view. to make predictions, one can then combine the outputs of the classifiers [20] if both views of a test example are observed; or otherwise, use one of the outputs corresponding to the observed view. another solution is to apply multiview approaches over the training samples of s f ; or over the whole training set s by completing the views of examples in s 1 and s 2 before using external view generation functions. as an alternative, the learning objective of our proposed approach is to generate the missing views of examples in s 1 and s 2 , jointly with the learning of the association between the multiview samples (with all their views complete or completed) and their classes. the proposed model consists of three neural networks that are trained using an objective implementing a three players game between a discriminator, d, and two generators, g 1 and g 2 . the game that these models play is depicted in fig. 1 and it can be summarized as follows. at each step, if an observation is chosen with a missing view, the corresponding generator -g 1 (respectively g 2 ) if the first (respectively second) view is missingproduces the view from random noise conditionally on the observed view in a way to fool the discriminator. on the other hand, the discriminator takes as input an observation with both of its views complete or completed and, classifies it if the views are initially observed or tells if a view was produced by one of the generators. formally, both generators g 1 and g 2 take as input; samples from the training subsets s 2 and s 1 respectively; as well as random noise drawn from a uniform distribution defined over the input space of the missing view and produce the corresponding pseudo-view, which is missing; i.e. g 1 (z 1 , x 2 ) =x 1 and g 2 (x 1 , z 2 ) =x 2 . these models are learned in a way to replicate the conditional distributions p(x 1 |x 2 , z 1 ) and p(x 2 |x 1 , z 2 ); and inherently define two probability distributions, denoted respectively by p g1 and p g2 , as the distribution of samples if both views where observed i.e. (x 1 , . on the other hand, the discriminator takes as input a training sample; either from the set s f , or from one of the training subsets s 1 or s 2 where the missing view of the example is generated by one of the generators accordingly. the task of d is then to recognize observations from s 1 and s 2 that have completed views by g 1 and g 2 and to classify examples from s to their true classes. to achieve this goal we add a fake class, k + 1, to the set of classes, y, corresponding to samples that have one of their views generated by g 1 or g 2 . the dimension of the discriminator's output is hence set to k + 1 which by applying softmax is supposed to estimate the posterior probability of classes for each multiview observation (with complete or completed views) given in input. for an observation x ∈ x , we use d k+1 (x) = p d (y = k + 1|x) to estimate the probability that one of its views is generated by g 1 or g 2 . as the task of the generators is to produce good quality views such that the observation with the completed view will be assigned to its true class with high probability, we follow [17] by supplying the discriminator to not get fooled easily as stated in the following assumption: an observation x has one of its views generated by in the case where; d k+1 (x) ≤ k k=1 d k (x) the observation x is supposed to have its both views observed and it is affected to one of the classes following the rule; max k={1,...,k} d k (x). the overall learning objective of cond 2 gans is to train the generators to produce realistic views indistinguishable with the real ones, while the discriminator is trained to classify multiview observations having their complete views and to identify view generated samples. if we denote by p real the marginal distribution of multiview observations with their both views observed (i.e. (x 1 , x 2 ) = p real (x 1 , x 2 )); the above procedure resumes to the following discriminator objective function v d (d, g 1 , g 2 ): in this way, we stated minmax game over k + 1 component of discriminator. in addition to this objective, we made generators also learn from labels of completed samples. therefore, the following equation defines objective for the generators z) ) . note that, following assumption 1, we impose the generators to produce equally informative views by assigning the same weight to their corresponding terms in the objective functions (eqs. 1, 2). from the outputs of the discriminator for all x ∈ x we build an auxiliary function d(x) = k k=1 p d (y = k | x) equal to the sum of the first k outputs associated to the true classes. in this following, we provide a theoretical analysis of cond 2 gans involving the auxiliary function d under nonparametric hypotheses. for fixed generators g 1 and g 2 , the objective defined in (eq. 1) leads to the following optimal discriminator d * g1,g2 : where p g1,2 (x 1 , x 2 ) = 1 2 (p g1 (x 1 , x 2 ) + p g2 (x 1 , x 2 )). proof. the proof follows from [11] . let from assumption 2, and the fact that for any observation x the outputs of the discriminator sum to one i.e. k+1 k=1 d k (x) = 1, the value function v d writes: for any (α, β, γ) ∈ r 3 \{0, 0, 0}; the function z → α log z + β 2 log(1 − z) + γ 2 log(1 − z) reaches its maximum at z = α α+ 1 2 (β+γ) , which ends the proof as the discrimintaor does not need to be defined outside the supports of p data , p g1 and p g2 . by plugging back d * g1,g2 (eq. 3) into the value function v d we have the following necessary and sufficient condition for attaining the global minimum of this function: at this point, the minimum is equal to − log 4. proof. by plugging back the expression of d * (eq. 3), into the value function which from the definition of the kullback leibler (kl) and the jensen shannon divergence (jsd) can be rewritten as the jsd is always positive and jsd p real pg 1,2 = 0 if and only if p real = pg 1,2 which ends the proof. from eq. 4, it is straightforward to verify that p real (x 1 , is a global nash equilibrium but it may not be unique. in order to ensure the uniqueness, we add the jensen-shannon divergence between the distribution p g1 and p real and p g2 and p real the value function v d (eq. 1) as stated in the corollary below. is reached if and only if where v d (d, g 1 , g 2 ) is the value function defined in eq. (1) and jsd(p g1 ||p real ) is the jensen-shannon divergence between the distribution p g1 and p real . proof. the proof follows from the positivness of jsd and the necessary and sufficient condition for it to be equal to 0. hence,v d (d, this result suggests that at equilibrium, both generators produce views such that observations with their completed view follow the same real distribution than those which have their both views observed. in order to avoid the collapse of the generators [17] , we perform minibatch discrimination by allowing the discriminator to have access to multiple samples in combination. from this perspective, the minmax game (eqs. 1, 2) is equivalent to the maximization of a cross-entropy loss, and we use minibatch training to learn the parameters of the three models. the corresponding empirical errors estimated over a minibatch b that contains m b samples from each of the sets s f , s 1 and s 2 are: lg input: a training set s = s f s 1 s 2 initialization: size of minibatches, m b use xavier initializer to initialize discriminator and generators parameters, respectively θ sample randomly a minibatch b i of size 3m b from s 1 , s 2 and s f ; create minibatches of noise vector z 1 , z 2 from u (−1, 1) in order to be inline with the premises of corollary 1; we empirically tested different solutions and the most effective one that we found was the feature matching technique proposed in [17] , which addressed the problem of instability for the learning of generators by adding a penalty term to their corresponding objectives (eq. (8)). where, . is the 2 norm and f is the sigmoid activation function on an intermediate layer of the discriminator. the overall algorithm of cond 2 gans is shown above. the parameters of the three neural networks are first initialized using xavier. for a given number of iterations t , minibatches of size 3m b are randomly sampled from the sets s f , s 1 and s 2 . minibatches of noise vectors are randomly drawn from the uniform distribution. models parameters of the discriminator and both generators are then sequentially updated using adam optimization algorithm [13] . we implemented our method by having two layers neural networks for each of the components of cond 2 gans. these neural nets are composed of 200 nodes in hidden layers with a sigmoid activation function. since the values of the generated samples are supposed to approximate any possible real value, we do not use the activation function in the outputs of both generators. 1 in this section, we present experimental results aimed at evaluating how the generation of views by cond 2 gans can help to take advantage of existing training examples, with many having an incomplete view, in order to learn an efficient classification function. we perform experiments on a publicly available collection, extracted from reuters rcv1/rcv2, that is proposed for multilingual multiclass text categorization 2 ( table 1 ). the dataset contains numerical feature vectors of documents originally presented in five languages (en, fr, gr, it, sp). in our experiments, we consider four pairs of languages with always english as one of the views ((en,fr),(en,sp),(en,it), (en,gr) ). documents in different languages belong to one and only one class within the same set of classes (k = 6); and they also have translations into all the other languages. these translations are obtained from a state-of-the-art statistical machine translation system [22] trained over the europal parallel collection using about 8.10 6 sentences for the 4 considered pairs of languages. 3 in our experiments, we consider the case where the number of training documents having their two versions is much smaller than those with only one of their available versions (i.e. m f m 1 + m 2 ). this corresponds to the case where the effort of gathering documents in different languages is much less than translating them from one language to another. to this end, we randomly select m f = 300 samples having their both views, m 1 = m 2 = 6000 samples with one of their views missing and the remaining samples without their translations for test. in order to evaluate the quality of generated views by cond 2 gans we considered two scenarios. in the first one (denoted by tenṽ), we test on english documents by considering the generation of these documents with respect to the other view (v ∈ {fr, gr, it, sp}) using the corresponding generator. in the second scenario (denoted by tẽ nv ), we test on documents that are written in another language than english by considering their generation on english provided by the other generator. for evaluation, we test the following four classification approaches along with cond 2 gans; one singleview approach and four multiview approaches. in the singleview approach (denoted by c v ) classifiers are the same as the discriminator and they are trained on the part of the training set with examples having their corresponding view observed. the multiview approaches are mkl [4] , coclassification (co-classif) [1] , unanimous vote ( mv b ) [2] . results are evaluated over the test set using the accuracy and the f 1 measure which is the harmonic average of precision and recall. the reported performance are averaged over 20 random (train/test) sets, and the parameters of adam optimization algorithm are set to α = 10 −4 , β = 0.5. on the value of the generated views. we start our evaluation by comparing the f 1 scores over the test set, obtained with cond 2 gans and a neural network having the same architecture than the discriminator d of cond 2 gans trained over the concatenated views of documents in the training set where the missing views are generated by machine translation. figure 2 shows these results. each point represents a class, where its abscissa (resp. ordinate) represents the test f 1 score of the neural network trained using mt (resp. one of the generators of cond 2 gans) to complete the missing views. all of the classes, in the different language pair scenarios, are above the line of equality, suggesting that the generated views by cond 2 gans provide higher value information than translations provided by mt for learning the neural network. this is an impressive finding, as the resources necessary for the training of mt is large (8.10 6 pairs of sentences and their translations); while cond 2 gans does both view completion and discrimination using only the available training data. this is mainly because both generators induce missing views with the same distribution than real pairs of views as stated in corollary 1. comparison between multiview approaches. we now examine the gains, in terms of accuracy, of learning the different multiview approaches on a collection where for other approaches than cond 2 gans the missing views are completed by one of the generators of our model. table 2 summarizes these results obtained by cond 2 gans, mkl, co-classif, and mv b for both test scenarios. in all cases cond 2 gans, provides significantly better results than other approaches. this provides empirical evidence of the effectiveness of the joint view generation and class prediction of cond 2 gans. furthermore, mkl, co-classif and cond 2 gans are binary classification models and tackle the multiclass classification case with one vs all strategy making them to suffer from class imbalance problem. results obtained using the f 1 measure are in line with those of table 2 and they are not reported for the sake of space. in this paper we presented cond 2 gans for multiview multiclass classification where observations may have missing views. the model consists of three neuralnetworks implementing a three players game between a discriminator and two generators. for an observation with a missing view, the corresponding generator produces the view conditionally on the other observed one. the discriminator is trained to recognize observations with a generated view from others having their views complete and to classify the latter into one of the existing classes. we evaluate the effectiveness of our approach on another challenging application than image analysis which is the domain of choice for the design of new gan models. our experiments on a subset of reuters rcv1/rcv2 show the effectiveness of cond 2 gans to generate high quality views allowing to achieve significantly better results, compared to the case where the missing views are generated by machine translation which requires a large collection of sentences and their translations to be tuned. as future study, we will be working on the generalization of the proposed model to more than 2 views. one possible direction is the use of an aggregation function of available views as a condition to the generators. a co-classification approach to learning from multilingual corpora learning from multiple partially observed views -an application to multilingual text categorization kernel independent component analysis multiple kernel learning, conic duality, and the smo algorithm combining labeled and unlabeled data with co-training multi-view generative adversarial networks deep generative image models using a laplacian pyramid of adversarial networks adversarial feature learning adversarially learned inference two view learning: svm-2k, theory and practice generative adversarial nets pac-bayesian analysis for a two-step hierarchical multiview learning approach adam: a method for stochastic optimization pose guided person image generation conditional image synthesis with auxiliary classifier gans unsupervised representation learning with deep convolutional generative adversarial networks improved techniques for training gans an rkhs for multi-view learning and manifold co-regularization unsupervised and semi-supervised learning with categorical generative adversarial networks a unified weight learning paradigm for multi-view learning cr-gan: learning complete representations for multi-view generation nrc's portage system for wmt multi-view image generation from a single-view key: cord-020932-o5scqiyk authors: zhong, wei; rohatgi, shaurya; wu, jian; giles, c. lee; zanibbi, richard title: accelerating substructure similarity search for formula retrieval date: 2020-03-17 journal: advances in information retrieval doi: 10.1007/978-3-030-45439-5_47 sha: doc_id: 20932 cord_uid: o5scqiyk formula retrieval systems using substructure matching are effective, but suffer from slow retrieval times caused by the complexity of structure matching. we present a specialized inverted index and rank-safe dynamic pruning algorithm for faster substructure retrieval. formulas are indexed from their operator tree (opt) representations. our model is evaluated using the ntcir-12 wikipedia formula browsing task and a new formula corpus produced from math stackexchange posts. our approach preserves the effectiveness of structure matching while allowing queries to be executed in real-time. in information retrieval, a great deal of research has gone into creating efficient search engines for large corpora. however, few have addressed substructure search in structural content, e.g., in mathematical information retrieval (mir) [21] where efficient substructure similarity search is needed to identify shared subexpressions effectively. for example, in math formula search, to discern that a + b and b + a are equivalent (by commutativity), but that ab + cd and a + bcd are different, applying tokenization and counting common token frequencies is insufficient. instead, a hierarchical representation of mathematical operations is needed and we may want to identify shared substructures. in the most recent math similarity search competition, 1 effective systems all take a tree-based approach by extracting query terms from tree representations. for example, an operator tree (opt) is used in fig. 1 to represent math formulas where operands are represented by leaves and operators are located at internal nodes. this facilitates searching substructures shared by two math expressions. for example, we can extract paths from their tree representations and find their shared subtrees by matching their common paths grouped by subtree root nodes. however, in order to carry structure information, it is common to see structural queries with over tens or even hundreds of path tokens which is unusual for normal fulltext search. this makes query processing costly for realistic math search tasks. in text similarity search, query processing can be accelerated through dynamic pruning [18] , which typically estimates score upperbounds to prune documents unlikely to be in the top k results. however, effective substructure search requires additional matching or alignment among query terms, and this makes it hard to get a good score estimation and it prevents us applying traditional dynamically pruning effectively. in fact, reportedly few state-of-the-art mir systems have achieved practical query run times even when given a large amount of computing resources [11, 20] . in this paper we try to address this problem by introducing a specialized inverted index and we propose a dynamic pruning method based on this inverted index to boost formula retrieval efficiency. recently there has been an increasing amount of research on similarity search for math formulas, with most focusing on search effectiveness [5, 7, 11, 23] . there are many emerging issues regarding effectiveness, including handling mathematical semantics, and identifying interchangeable symbols and common subexpressions. however, the efficiency of math formula search systems is often not addressed. a number of mir systems apply text search models to math retrieval, extracting sequential features from formulas and use variants of tf-idf scoring [12, 14, 16] . these approaches incorporate a bag-of-words model, and use frequency to measure formula similarity. inevitably, they need to index different combinations of sequences or substrings to handle operator commutativity and subexpression identification. this index augmentation results in a non-linearly increasing index size in the number of indexed "words" [12] and thus hurts efficiency for large corpora. on the other hand, recent results [10, 20, 23] reveal that effective systems for formula retrieval use tree-based approaches distinct from text-based methods. however, tree-based systems usually need to calculate costly graph matching or edit distance metrics [9, 22] , which generally have non-linear time complexity. recently, a path-based approach [23] was developed to search substructures in formula opts approximately by assuming that identical formulas have the same leaf-root path set. although at the time of writing, it obtains the best effectiveness for the ntcir-12 dataset, the typically large number of query paths means that query run times are not ideal -maximum run times can be a couple of seconds. dynamic pruning has been recognized as an effective way to reduce query processing times [2, 8, 13, 18] . dynamic pruning speeds up query processing by skipping scoring calculations or avoiding unnecessary reads for documents which are unlikely to be ranked in the top k results. pruning methods can be based on different query processing schemes: document-at-a-time (daat) requires all relevant posting lists be merged simultaneously. term-at-a-time (taat) or score-at-a-time (saat) processes one posting list at a time for each term, requiring additional memory to store partial scores, and posting lists in this case are usually sorted by document importance (e.g, impact score [1] ), with promising documents placed at the front of inverted lists. pruning strategies are rank-safe (or safe up to rank k ) [19] if they guarantee that the top k documents are ranked in the same order before and after pruning. the most well-known rank-safe pruning strategies for daat are maxscore [8, 17, 19] and wand variants [3, 6] . shan et al. [15] show that maxscore variants (e.g. bmm, lbmm) outperform other dynamic pruning strategies for long queries, and recently mallia et al. [2] report a similar finding over a range of popular index encodings. baseline model. this work is based on our previous work [23] which extracts prefixes from opt leaf-root paths as index or query terms. the opt is parsed from a formula in l a t e x. for indexed paths, they are mapped to corresponding posting lists in an inverted index where the ids of expressions containing the path are appended. for query paths, the corresponding posting lists are merged and approximate matching is performed on candidates one expression at a time. the similarity score is measured from matched common subtree(s). because math symbols are interchangeable, paths are tokenized for better recall, e.g., variables such as a, b, c are tokenized into var. in our tokenized path representation uppercase words denote token types, which may be for operators as well as operands (e.g., times for symbols representing multiplication). in fig. 1 , when indexing "bc + xy + a + z," its expression id (or expid) will be appended to posting lists associated with tokenized prefix paths from its opt representation, i.e., var/times, var/add and var/times/add. at query processing, the shared structures highlighted in black and gray are found by matching these tokenized paths (two paths match if and only if they have the same tokenized paths, for example, "a/+" and "z/+" can be matched) and common subtree roots are identified by grouping paths by their root nodes. as a result, the posting list entry also stores the root node id for indexed paths, in order to reconstruct matches substructures at merge time. at query time, the similarity score is given by the size of matched common subtrees. specifically, the model chooses a number of "widest" matched subtree(s) (e.g., a + bc is the widest matched in fig. 1 because it has 3 common leaves and is "wider" than the other choices) and measure formula similarity based on the size of these common subtrees. the original approach0 model [23] matches up to three widest common subtrees and scores similarity by a weighted sum of the number of matched leaves (operands) and operators from different common subtreest i q ,t i d of a common forest π. operators and operand (leaf) nodes weights are controlled by parameter α, while the weight of rooted substructures from largest to smallest are given by β i . in the following, | · | indicates the size of a set: interestingly, while multiple subtree matching boosts effectiveness, using just the widest match still outperforms other systems in terms of highly relevant results [23] . the simplified similarity score based on widest common subtree between query and document opts t q , t d is the widest match w * q,d , formally where cfs(t q , t d ) are all the common formula subtrees between t q and t d . in addition to subtree isomorphism, a formula subtree requires leaves in a subtree to match leaves in the counterpart, in other words, subtrees are matched bottomup from operands in opts. in fig. 1 , the value of w * q,d is 3, produced by the widest common subtrees shown in gray. dynamic pruning. in dynamic pruning, the top k scored hits are kept throughout the querying process, with the lowest score in the top k at a given point defining the threshold θ. since at most k candidates will be returned, dynamic pruning strategies work by estimating score upperbounds before knowing the precise score of a hit so that candidate hits with a score upperbound less or equal to θ can be pruned safely, because they will not appear in the final top k results. moreover, if a subset of posting lists alone cannot produce a top k result from their upperbounds, they are called a non-requirement set, the opposite being the requirement set. posting lists in the non-requirement with ids less than the currently evaluating ids in the requirement set can be skipped safely, because posting lists in the non-requirement set alone will not produce a top k candidate. in this paper, we apply dynamic pruning to structural search. as structure search has more query terms in general, we focus on a maxscore-like strategy suggested by [2, 15] , since they do not need to sort query terms at merge iterations (which is expensive for long queries). our approach is different from the original maxscore, as upperbound scores are also calculated from the query tree representation. we also use the simplified scoring eq. (2) where a subset of query terms in the widest matched common subtreest * q ,t * d contribute to the score. in contrast, typical tf-idf scoring has all hit terms contribute to the rank score. when we merge posting lists, a set of query paths match paths from a document expression one at a time, each time a hit path set for matched query and candidate paths are examined. define p(t ) to be all paths extracted from opt t , i.e., p(t ) = {p : p ∈ leafroot paths(t n ), n ∈ t } where t n is the entire subtree of t rooted at n with all its descendants. we model the hit path set by a bipartite graph g(q, d, e) where q = {q : q ∈ p(t q )}, d = {d : d ∈ p(t d )} are query and document path sets, and edges are ordered pairs e = {(q, d) : tokenized(q) = tokenized(d), q ∈ q, d ∈ d} representing a potential match between a query path to a document path. since an edge is established only for paths with the same token sequence, we can partition the graph into disconnected smaller bipartite graphs g t = g(q t , d t , e t ), each identified by tokenized query path t: figure 2 shows the hit path set of the example in fig. 1 , this example can be partitioned into independent subgraphs associated with tokenized paths var/times/add, var/times and var/add. each partition is actually a complete bipartite graph (fully connected) because for any edge between q t and d t , it is in edge set e t . and for each complete bipartite graph g(q t , d t , e t ), we can obtain their maximum matching sizes from min(|q t |, |d t |) easily. on the other hand, to calculate score w * q,d , we need to find a pair of query and document nodes at which the widest common subtreet * q ,t * d are rooted (see eq. 2), so we also define the matching candidate relations filtered by nodes. let g (m,n) = g(q (m) , d (n) , e (m,n) ) be the subgraph matching between query subtree rooted at m and document subtree rooted at n where then, similarity score w * q,d can be calculated from selecting the best matched node pairs and summing their partition matches. specifically, define token paths of tree t rooted at n as set t(n) = {t : t = tokenized(p), p ∈ leafroot paths(t n )}, where ν(g) is the maximum matching size of bipartite graph g. t |) as our (precomputed) partial score upperbound. it is analogous to text search where each posting list has a partial score upperbound, the tf-idf score upperbound is merely their sum. in our case, the sum for partial score upperbounds is only for one node or a subtree. in the following we propose three strategies to compute w * q,d upperbound from partial score upperbounds and assign non-requirement set. max reference (maxref ) strategy. in maxscore [17, 19] , each posting list has a partial score upperbound, however, our scoring function implies each posting list can be involved with multiple partial score upperbounds. one way to select the non-requirement set in our case is using an upperbound score maxref t (for each posting list t) which is the maximum partial score from the query nodes by which this posting list gets "referenced", and if a set of posting lists alone has a sum of maxref scores less or equal to θ, they can be safely put into the non-requirement set. the rank safety can be justified, since each posting list corresponds to a unique tokenized path t, and maxref t = max m w m,t . then for m ∈ t q , n ∈ t d , greedy binary programming (gbp) strategies. inequality (6) is relaxed twice, so it spurs the motivation to get tighter upperbound value by maximizing the number of posting lists in the non-requirement set, so that more posting lists are likely to be skipped. define partial upperbound matrix w = {w i,j } |tq|×|t| where t = {t(m), m ∈ t q } are all the token paths from query opt (t is essentially the same as tokenized p(t q )), and a binary variable x |t|×1 indicating which corresponding posting lists are placed in the non-requirement set. one heuristic objective is to maximize the number of posting lists in the non-requirement set (gbp-num): however, maximizing the number of posting lists in the non-requirement set does not necessarily cause more items to be skipped, because the posting lists can be very short. instead, we can maximize the total length of posting lists in the non-requirement set. in this case, the vector of ones in objective function (7) is replaced with posting list length vector l = l 1 , l 2 , . . . l |t| , where l i is the length of posting list i. we call this strategy gbp-len. the two gbp strategies are rank-safe since constraints in inequality (8) implies t∈skip w m,t ≤ θ. both strategies require solving binary programming problems, which are known to be np-complete and thus too intensive for long queries. instead, we greedily follow one branch of the binary programming sub-problems to obtain a feasible (but not optimal) solution in o(|t q ||t| 2 ). figure 3 illustrates formula query processing using a modified inverted index for dynamic pruning. for each internal node m of the query opt, we store the number of leaves of m as w m = |q (m) |. each query node points to tokenized path entries in a dictionary, where each reference is associated with w m,t = |q (m) t | identified by tokenized path t (denoted as m/w m of t). in fig. 3 , node q1 from the query has 6 leaves, which is also the upperbound number of path matches for q1, i.e, |q (1) |. since q1 consists of 2 tokenized leaf-root paths var/times/add and var/add, q1 is linked to two posting lists, each associated with a partial score upperbound (5 and 1). each posting list maps to a token path t ∈ t with a dynamic counter for the number of query nodes referring to it (initially |q t |). query nodes are pruned by our algorithm when its subtree width is no longer greater than the current threshold, because the corresponding subexpression cannot be in the top-k results. in this case the reference counter decreases. a posting list is removed if its reference counter is less than one. each posting list entry identified by an expid stores n and w n,t = |d (n) t | values of subtree token path t rooted at n (denoted as n/w n of t). as an example, in fig. 3 , the hit opt (of expid 12) has 5 paths tokenized as query processing is described in algorithm 1. requirementset returns selected iterators of the requirement set. assignment according to different pruning strategies is described in sect. 4. in the maxref strategy, we sort posting lists by descending maxref values, and take as many posting lists as possible into non-requirement set from the lowest maxref value. at merging, a candidate id is assigned by the minimal expid of current posting list iterators in the requirement set. requirement set iterators are advanced by one using the next() function, while iterators in the non-requirement set are advanced directly to the id equal to or greater than the current candidate by the skipto() function. in fig. 3 for example, the posting list corresponding to var/times/add is in the requirement set under the maxref strategy, while the other two are not: document expression 13 and 15 will be skipped if the next candidate is 90. for ease of testing termination, we append a special expid maxid at the end of each posting list, which is larger than any expid in the collection. at each iteration, a set of hitnodes is inferred containing query nodes associated with posting lists whose current expids are candidate id. qrynode-match calculates matches for hit nodes according to eq. 5, pruning nodes whose maximum matching size is smaller than previously examined nodes. given query hit node q1 in fig. 3 , function qrynodematch returns max n∈t d ν(g (1,n) ) = max(min(5, 2) + min(1, 2), min(5, 3)) = 3 then the algorithm selects the best matched query node and its matched width (i.e., widest in algorithm 1) is our structural similarity w * q,d . after obtaining w * q,d , we compute a metric for the similarity of symbols (e.g., to differentiate e = mc 2 and y = ax 2 ) and penalize larger formulas, to produce a final overall similarity score [23] for ranking. because of this additional layer, we need to relax our upperbound further. according to the overall scoring function in [23] , our relaxing function u can be defined by assuming perfect symbol similarity score in overall scoring function, specifically where in our setting, parameters η = 0.05, n d = 1. whenever threshold θ is updated, we will examine all the query nodes, if a query node m has an upperbound less or equal to the threshold, i.e., u(w m ) ≤ θ, then the corresponding subtree of this node is too "small" to make it into top k results. as a result, some of the posting lists (or iterators) may also be dropped due to zero reference. for each m/wm of tokenized path t rooted at m do let i be the iterator index associated with t if heap := data structure to hold top k results while true do 20: candidate := minimal id in current expids of reqs if candidate equals maxid then search terminated, return results. return top k results let g(q, d, e) be the hit path set bipartite graph. if maxmatch > widest then widest := maxmatch find the widest width. if widest > 0 then score := calculate final score (including symbol similarity). see [23] . if heap is not full or score > θ then push candidate or replace the lowest scored hit in heap. if heap is full then update current threshold. θ := minimal score in current top k results drop small query nodes and unreferenced iterators. reqs := requirementset(θ, strategy) update requirement set. for iters[i] in reqs do advance posting list iterators. if iters[i].expid = candidate then iters[i].next() we first evaluate our system 2 on the ntcir-12 wikipedia formula browsing task [20] (ntcir-12 for short), which is the most current benchmark for formula-only retrieval. the dataset contains over 590,000 math expressions taken from english wikipedia. since work in formula retrieval is relatively new, there are only 40 queries in ntcir-12 that can be compared with other published systems. however, these queries are well designed to cover a variety of math expressions in different complexity. there are 20 queries containing wildcards in this task (using wildcard specifier \qvar to match arbitrary subexpression or symbols, e.g., query "\qvar{a} 2 + \qvar{b} 3 " can match "x 2 + (y + 1) 3 "). we add support for wildcards by simply treating internal nodes (representing a rooted subexpression) of formulas as additional "leaves" (by ignoring their descendants), and the wildcard specifiers in a query are treated as normal leaves to match those indexed wildcard paths. since the corpus of ntcir-12 is not large enough to show the full impact of pruning, we also evaluate query run times on a corpus containing over 1 million math related documents/threads from math stackexchange (mse) q&a website 3 and we run the same query set from ntcir-12. run times are shown for the posting list merging stage (e.g., time for parsing the query into opt is excluded) and unless specified, posting lists are compressed and cached into memory. each system had five independent runs, and we report results from overall distribution. the resulting uncompressed index size for ntcir-12 and mse corpora are around 2 gb and 16 gb in size, with 961,604 and 5,764,326 posting lists respectively. the (min, max, mean, standard deviation) for posting list lengths are (1, 262309, 16.95, 737.84) and (1, 7916296, 73.74, 9736.72) . table 1 reports run time statistics. non-pruning (exhaustive search) baselines with k = 100 are also compared here. almost consistently, gbp-len strategy achieves the best efficiency with smaller variance. this is expected since gbp-len models the skipping possibility better than gbp-num. although gbp-num gives a tighter theoretic upperbound than maxref, it only maximizes the number of posting lists in the non-requirement set and may lead to bad performance when these posting lists are short. there are a few times the best minimal run times are from other strategies, for those with meaningful gaps, i.e., in wiki dataset of non-wildcard queries when k = 1000, maxref outperforms in standard deviation and maximum run time to a notable margin; however, it likely results from a small threshold due to large k, so that the efficiency on the small sized ntcir dataset is less affected by pruning (small θ means less pruning potential) compared to the time complexity added from assigning to the requirement set. the latter is more dominant in gbp runs. in wildcard queries, however, many expressions can match the query thus the threshold value is expected to be larger than that in the non-wildcard case. len 144.25 126.95 105.00 6.00 622.00 195.70 122.25 176.00 9.00 secondly, we have compared our system effectiveness (fig. 4) and efficiency ( fig. 5) with tangent-s [5] , mcat [11] and our baseline system without pruning [23] , which are all structure-based formula search engines that have obtained the best published bpref scores on ntcir-12 dataset. in addition, icst system [7] also obtains effective results for math and text mixed task, but they do training on previous wiki dataset and their system is currently not available. all systems are evaluated in a single thread for top-1000 results. we use our best performance strategy, i.e., gbp-len, having an on-disk version with posting lists uncompressed and always read from disk, and an in-memory version with compression. for the baseline system, only 20 non-wildcard queries are reported because it does not support wildcards. we compare the baseline best performed run (base-best) which uses costly multiple tree matching as well as its specialized version (base-opd-only) which considers only the largest matched tree width (see eq. 2). tangent-s has a few outliers as a result of its costly alignment algorithm to rerank structure and find the maximum subtree similarity [22] , its non-linear complexity makes it expensive for some long queries (especially in wildcard case). and mcat reportedly has a median query execution time around 25 s, using a server machine and multi-threading [11] . so we remove tangent-s outliers and mcat from runtime boxplot. for space, we only include the faster base-opd-only baseline in fig. 5 . we outperform tangent-s in efficiency even if we exclude their outlier queries, with higher bpref in non-wildcard fully relevant results. our efficiency is also better than the baseline system, even if the latter only considers less complex non-wildcard queries. however, our overall effectiveness is skewed by bad performance of wildcard queries because a much more expensive phase is introduced to boost accuracy by other systems to handle inherently difficult "structural wildcards." our pruning strategies are rank-safe (pruning and exhaustive version shows the same bpref scores) but there is a minor bpref difference between ours and baseline (base-opd-only) due to parser changes we have applied to support wildcards (e.g., handle single left brace array as seen in a wildcard query) and they happen to slightly improve accuracy in partially relevant cases. we have presented rank-safe dynamic pruning strategies that produce an upperbound estimation of structural similarity in order to speedup formula search using subtree matching. our dynamic pruning strategies and specialized inverted index are different from traditional linear text search pruning methods and they further associate query structure representation with posting lists. our results show we can obtain substantial improvement in efficiency over the baseline model, while still generating highly relevant non-wildcard search results. our approach can process a diverse set of structural queries in real time. pruned query evaluation using pre-computed impacts an experimental study of index compression and daat query processing methods efficient query evaluation using a two-level retrieval process retrieval evaluation with incomplete information layout and semantics: combining representations for mathematical formula search faster top-k document retrieval using block-max indexes the math retrieval system of icst for ntcir-12 mathir task efficient compressed inverted index skipping for disjunctive text-queries structural similarity search for mathematics retrieval tangent-v: math formula image search using line-of-sight graphs mcat math retrieval system for ntcir-12 mathir task a mathematics retrieval system for formulae in layout presentations upper-bound approximations for dynamic pruning technical aspects of the digital library of mathematical functions optimized top-k processing with global page scores on block-max indexes indexing and searching mathematics in digital libraries optimization strategies for complex queries efficient query processing for scalable web search query evaluation: strategies and optimizations ntcir-12 mathir task overview recognition and retrieval of mathematical expressions multi-stage math formula search: using appearance-based similarity metrics at scale structural similarity search for formulas using leaf-root paths in operator subtrees key: cord-020846-mfh1ope6 authors: zlabinger, markus; hofstätter, sebastian; rekabsaz, navid; hanbury, allan title: dsr: a collection for the evaluation of graded disease-symptom relations date: 2020-03-24 journal: advances in information retrieval doi: 10.1007/978-3-030-45442-5_54 sha: doc_id: 20846 cord_uid: mfh1ope6 the effective extraction of ranked disease-symptom relationships is a critical component in various medical tasks, including computer-assisted medical diagnosis or the discovery of unexpected associations between diseases. while existing disease-symptom relationship extraction methods are used as the foundation in the various medical tasks, no collection is available to systematically evaluate the performance of such methods. in this paper, we introduce the disease-symptom relation collection (dsr-collection), created by five physicians as expert annotators. we provide graded symptom judgments for diseases by differentiating between relevant symptoms and primary symptoms. further, we provide several strong baselines, based on the methods used in previous studies. the first method is based on word embeddings, and the second on co-occurrences of mesh-keywords of medical articles. for the co-occurrence method, we propose an adaption in which not only keywords are considered, but also the full text of medical articles. the evaluation on the dsr-collection shows the effectiveness of the proposed adaption in terms of ndcg, precision, and recall. disease-symptom knowledge bases are the foundation for many medical tasks -including medical diagnosis [9] or the discovery of unexpected associations between diseases [12, 14] . most knowledge bases only capture a binary relationship between diseases and symptoms, neglecting the degree of the importance between a symptoms and a disease. for example, abdominal pain and nausea are both symptoms of an appendicitis, but while abdominal pain is a key differentiating factor, nausea does only little to distinguish appendicitis from other diseases of the digestive system. while several disease-symptom extraction methods have been proposed that retrieve a ranked list of symptoms for a disease [7, 10, 13, 14] , no collection is available to systematically evaluate the performance of such methods [11] . while these method are extensively used in downstream tasks, e.g., to increase the accuracy of computer-assisted medical diagnosis [9] , their effectiveness for disease-symptom extraction remains unclear. in this paper, we introduce the disease-s ymptom relation collection (dsrcollection) for the evaluation of graded disease-symptom relations. the collection is annotated by five physicians and contains 235 symptoms for 20 diseases. we label the symptoms using graded judgments [5] , where we differentiate between: relevant symptoms (graded as 1) and primary symptoms (graded as 2). primary symptoms-also called cardinal symptoms-are the leading symptoms that guide physicians in the process of disease diagnosis. the graded judgments allow us for the first time to measure the importance of different symptoms with grade-based metrics, such as ndcg [4] . as baselines, we implement two methods from previous studies to compute graded disease-symptom relations: in the first method [10] , the relation is the cosine similarity between the word vectors of a disease and a symptom, taken from a word embedding model. in the second method [14] , the relation between a disease and symptom is calculated based on their co-occurrence in the meshkeywords 1 of medical articles. we describe limitations of the keyword-based method [14] and propose an adaption in which we calculate the relations not only on keywords of medical articles, but also on the full text and the title. we evaluate the baselines on the dsr-collection to compare their effectiveness in the extraction of graded disease-symptom relations. as evaluation metrics, we consider precision, recall, and ndcg. for all three metrics, our proposed adapted version of the keyword-based method outperforms the other methods, providing a strong baseline for the dsr-collection. the contributions of this paper are the following: -we introduce the dsr-collection for the evaluation of graded disease-symptom relations. we make the collection freely available to the research community. 2 -we compare various baselines on the dsr-collection to give insights on their effectiveness in the extraction of disease-symptom relations. in this section, we describe the new disease-s ymptom relation collection (dsr-collection) for the evaluation of disease-symptom relations. we create the collection in two steps: in the first step, relevant disease-symptom pairs (e.g. appendicitis-nausea) are collected by two physicians. they collect the pairs in a collaborative effort from high-quality sources, including medical textbooks and an online information service 3 that is curated by medical experts. in the second step, the primary symptoms of the collected disease-symptom pairs are annotated. the annotation of primary symptoms is conducted to incorporate a graded relevance information into the collection. for the annotation procedure, we develop guidelines that briefly describe the task and an online annotation tool. then, the annotation of primary symptoms is conducted by three physicians. the final label is obtained by a majority voting. based on the labels obtained from the majority voting, we assign the relevance score 2 to primary symptoms and 1 to the other symptoms, which we call relevant symptoms. in total, the dsr-collection contains relevant symptoms and primary symptoms for 20 diseases. we give an overview of the collection in table 1 . for the 20 diseases, the collection contains a total of 235 symptoms, of which 55 are labeled as primary symptom (about 25%). the top-3 most occurring symptoms are: fatigue which appears for 15 of the 20 diseases, fever which appears for 10, and coughing which appears for 7. notice that the diseases are selected from different medical disciplines: mental (e.g. depression), dental (e.g. periodontitis), digestive (e.g. appendicitis), and respiration (e.g. asthma). we calculate the inter-annotator agreement using fleiss' kappa [2] , a statistical measure to compute the agreement for three or more annotators. for the annotation of the primary symptoms, we measure a kappa value of κ = 0.61, which indicates a substantial agreement between the three annotators [6] . individual κ-values per disease are reported in table 1 . by analyzing the disagreements, we found that the annotators labeled primary symptoms with varying frequencies: the first annotator annotated on average 2.1 primary symptoms per disease, the second 2.8, and the third 3.8. vocabulary compatibility: we map each disease and symptom of the collection to the unified medical language system (umls) vocabulary. the umls is a compendium of over 100 vocabularies (e.g. icd-10, mesh, snomed-ct) that are cross-linked with each other. this makes the collection compatible with the umls vocabulary and also with each of the over 100 cross-linked vocabularies. although the different vocabularies are compatible with the collection, a fair comparison of methods is only possible when the methods utilize the same vocabulary since the vocabulary impacts the evaluation outcome. for instance, the symptom loss of appetite is categorized as a symptom in mesh; whereas, in the cross-linked umls vocabulary, it is categorized as a disease. therefore, the symptom loss of appetite can be identified when using the mesh vocabulary, but it cannot be identified when using the umls vocabulary. evaluation: we consider following evaluation metrics for the collection: recall@k, precision@k, and ndcg@k at the cutoff k = 5 and k = 10. recall measures how many of the relevant symptoms are retrieved, precision measures how many of the retrieved symptoms are relevant, and finally, ndcg is a standard metric to evaluate graded relevance [5] . in this section, we discuss disease-symptom extraction methods used in previous studies. a commonly used resource for the extraction of disease-symptom relations are the articles of the pubmed database. pubmed contains more than 30 million biomedical articles, including the abstract, title, and various metadata. previous work [3, 7] uses the abstracts of the pubmed articles together with rule-based approaches. in particular, hassan et al. [3] derive patterns of disease-symptom relations from dependency graphs, followed by the automatic selection of the best patterns based on proposed selection criteria. martin et al. [7] generate extraction rules automatically, which are then inspected for their viability by medical experts. xia et al. [13] design special queries that include the name and synonyms of each disease and symptom. they use these queries to return the relevant articles, and use the number of retrieved results to perform a ranking via pointwise mutual information (pmi). the mentioned studies use resources that are not publicly available, i.e., rules in [3, 7] and special queries in [13] . to enable reproducibility in future studies, we define our baselines based on the methods that only utilize publicly available resources, described in the next section. here, we first describe two recently proposed methods [10, 14] for the extraction of disease-symptom relations as our baselines. afterwards, we describe limitations of the method described in [14] and propose an adapted version in which the limitations are addressed. we apply the methods on the open-access subset of the pubmed central (pmc) database, containing 1,542,847 medical articles. to have a common representation for diseases/symptoms across methods (including an unique name and identifier), we consider the 382 symptoms and 4,787 diseases from the medical subject headings (mesh) vocabulary [14] . given the set of diseases (x) and symptoms (s), each method aims to compute a relation scoring function λ(x, s) ∈ r between a disease x ∈ x and a symptom s ∈ s. in the following, we explain each method in detail. embedding: proposed by shah et al. [10] , the method is based on the cosine similarity of the vector representations of a disease and a symptom. we first apply metamap [1] , a tool for the identification of medical concepts within a given text, to the full text of all pmc articles to substitute the identified diseases/symptoms by their unique names. then, we train a word2vec model [8] with 300 dimensions and a window size of 15, following the parameter setting in [10] . using the word embedding, the disease-symptom relation is defined as λ(x, s) = cos(e x , e s ), where e refers to the vector representation of a word. cooccur: this method, proposed by zhou et al. [14] , calculates the relation of a disease and a symptom, by measuring the degree of their co-occurrences in the mesh-keywords of medical articles. the raw co-occurrence of the disease x and symptom s, is denoted by co(x, s). the raw co-occurrence does not consider the overall appearance of each symptom across diseases. for instance, symptoms like pain or obesity tend to co-occur with many diseases, and are therefore less informative. hence, the raw co-occurrence is normalized by an inverse symptom frequency (isf) measure, defined as isf(s) = |x| ns , where |x| is the total number of diseases and n s is the number of diseases that co-occur with s at least in one of the articles. finally, the disease-symptom relation is defined as λ(x, s) = co(x, s) × isf(s). we compute three variants of the cooccur method: -kwd: the disease-symptom relations are computed using the mesh-keywords of the ≈1.5 million pmc articles. -kwdlarge: while kwd uses the 1.5 million pmc articles, zhou et al. [14] apply the exact same method on the ≈30 million articles of the pubmed database. while they did not evaluate the effectiveness of their diseasesymptom relation extraction method, they published their relation scores which we will evaluate in this paper. -fulltext: applying the cooccur method only on mesh-keywords has two disadvantages: first, keywords are not available for all articles (e.g. only 30% of the ≈1.5 million pmc articles have keywords) and second, usually only the core topics of an article occur as keywords. we address these limitations by proposing an adaption of the cooccur method, in which we use the full text, the title, and the keywords of the ≈1.5 million pmc articles. specifically, we adapt the computation of the co-occurrence co(x, s), as follows: we first retrieve a set of relevant articles to a disease x, where an article is relevant if the disease exists in either the keyword, or the title section of the article. given these relevant articles and a symptom s, we compute the adapted co-occurrence co(x, s), which is the number of relevant articles in that the symptom occurs in the full text. the identification of the diseases in the title and symptoms in the full text is done using the metamap tool [1] . we now compare the disease-symptom extraction baselines on the proposed dsrcollection. the results for various evaluation metrics are shown in table 2 . the fulltext-variant of the cooccur method outperforms the other baselines on all evaluation metrics. this demonstrates the high effectiveness of our proposed adaption to the cooccur method. further, we see a clear advantage of the cooccur-method with meshkeywords from ≈30 million pubmed articles as the resource (kwdlarge) -in comparison to the same method with keywords from approximately 1.5 million pmc articles (kwd). this highlights the importance of the number of input samples to the method. error analysis: a common error source is a result of the fine granularity of the symptoms in the medical vocabularies. for example, the utilized mesh vocabulary contains the symptoms abdominal pain and abdomen, acute 4 . both symptoms can be found in the top ranks of the evaluated methods for the disease appendicitis (see table 3 ). however, since the corpus is not labeled on such a fine-grained level, the symptom abdomen, acute is counted as a false positive. another error source is a result of the bias in medical articles towards specific disease-symptom relationships. for instance, between the symptom obesity and periodontitis 5 a special relationship exists, which is the topic of various publications. despite obesity not being a characteristic symptom of a periodontitis, all methods return the symptom in the top-3 ranks. a promising research direction is the selective extraction of symptoms from biomedical literature by also considering the context (e.g. in a sentence) in that a disease/symptom appears. effective mapping of biomedical text to the umls metathesaurus: the metamap program measuring nominal scale agreement among many raters extracting disease-symptom relationships by learning syntactic patterns from dependency graphs cumulated gain-based evaluation of ir techniques binary and graded relevance in ir evaluations -comparison of the effects on ranking of ir systems the measurement of observer agreement for categorical data symptom extraction issue distributed representations of words and phrases and their compositionality automated medical diagnosis by ranking clusters across the symptom-disease network neural networks for mining the associations between diseases and symptoms in clinical notes enhancing ontology-driven diagnostic reasoning with a symptom-dependency-aware naïve bayes classifier evaluating wikipedia as a source of information for disease understanding mining disease-symptom relation from massive biomedical literature and its application in severe disease diagnosis human symptoms-disease network we introduced the disease-s ymptom relation collection (dsr-collection) for the evaluation of graded disease-symptom relations. we provided baseline results for two recent methods, one based on word embeddings and the second on the cooccurrence of mesh-keywords of medical articles. we proposed an adaption to the co-occurrence method to make it applicable to the full text of medical articles and showed significant improvement of effectiveness over the other methods. key: cord-020830-97xmu329 authors: ghanem, bilal; karoui, jihen; benamara, farah; rosso, paolo; moriceau, véronique title: irony detection in a multilingual context date: 2020-03-24 journal: advances in information retrieval doi: 10.1007/978-3-030-45442-5_18 sha: doc_id: 20830 cord_uid: 97xmu329 this paper proposes the first multilingual (french, english and arabic) and multicultural (indo-european languages vs. less culturally close languages) irony detection system. we employ both feature-based models and neural architectures using monolingual word representation. we compare the performance of these systems with state-of-the-art systems to identify their capabilities. we show that these monolingual models trained separately on different languages using multilingual word representation or text-based features can open the door to irony detection in languages that lack of annotated data for irony. figurative language makes use of figures of speech to convey non-literal meaning [2, 16] . it encompasses a variety of phenomena, including metaphor, humor, and irony. we focus here on irony and uses it as an umbrella term that covers satire, parody and sarcasm. irony detection (id) has gained relevance recently, due to its importance to extract information from texts. for example, to go beyond the literal matches of user queries, veale enriched information retrieval with new operators to enable the non-literal retrieval of creative expressions [40] . also, the performances of sentiment analysis systems drastically decrease when applied to ironic texts [5, 19] . most related work concern english [17, 21] with some efforts in french [23] , portuguese [7] , italian [14] , dutch [26] , hindi [37] , spanish variants [31] and arabic [11, 22] . bilingual id with one model per language has also been explored, like english-czech [32] and english-chinese [38] , but not within a cross-lingual perspective. in social media, such as twitter, specific hashtags (#irony, #sarcasm) are often used as gold labels to detect irony in a supervised learning setting. although recent studies pointed out the issue of false-alarm hashtags in selflabeled data [20] , id via hashtag filtering provides researchers positive examples with high precision. on the other hand, systems are not able to detect irony in languages where such filtering is not always possible. multilingual prediction (either relying on machine translation or multilingual embedding methods) is a common solution to tackle under-resourced languages [6, 33] . while multilinguality has been widely investigated in information retrieval [27, 34] and several nlp tasks (e.g., sentiment analysis [3, 4] and named entity recognition [30] ), no one explored it for irony. we aim here to bridge the gap by tackling id in tweets from both multilingual (french, english and arabic) and multicultural perspectives (indo-european languages whose speakers share quite the same cultural background vs. less culturally close languages). our approach does not rely either on machine translation or parallel corpora (which are not always available), but rather builds on previous corpus-based studies that show that irony is a universal phenomenon and many languages share similar irony devices. for example, karoui et al. [24] concluded that their multi-layer annotated schema, initially used to annotate french tweets, is portable to english and italian, observing relatively the same tendencies in terms of irony categories and markers. similarly, chakhachiro [8] studies irony in english and arabic, and shows that both languages share several similarities in the rhetorical (e.g., overstatement), grammatical (e.g., redundancy) and lexical (e.g., synonymy) usage of irony devices. the next step now is to show to what extent these observations are still valid from a computational point of view. our contributions are: i. a new freely available corpus of arabic tweets manually annotated for irony detection 1 . ii. monolingual id: we propose both feature-based models (relying on language-dependent and language-independent features) and neural models to measure to what extent id is language dependent. iii. cross-lingual id: we experiment using cross-lingual word representation by training on one language and testing on another one to measure how the proposed models are culture-dependent. our results are encouraging and open the door to id in languages that lack of annotated data for irony. arabic dataset (ar = 11,225 tweets). our starting point was the corpus built by [22] that we extended to different political issues and events related to the middle east and maghreb that hold during the years 2011 to 2018. tweets were collected using a set of predefined keywords (which targeted specific political figures or events) and containing or not arabic ironic hashtags 2 . the collection process resulted in a set of 6,809 ironic tweets (i) vs. 15,509 non ironic (ni) written using standard (formal) and different arabic language varieties: egypt, gulf, levantine, and maghrebi dialects. to investigate the validity of using the original tweets labels, a sample of 3,000 i and 3,000 ni was manually annotated by two arabic native speakers which resulted in 2,636 i vs. 2,876 ni. the inter-annotator agreement using cohen's kappa was 0.76, while the agreement score between the annotators' labels and the original labels was 0.6. agreements being relatively good knowing the difficulty of the task, we sampled 5,713 instances from the original unlabeled dataset to our manually labeled part. the added tweets have been manually checked to remove duplicates, very short tweets and tweets that depend on external links, images or videos to understand their meaning. french dataset (fr = 7,307 tweets). we rely on the corpus used for the deft 2017 french shared task on irony [5] which consists of tweets relative to a set of topics discussed in the media between 2014 and 2016 and contains topic keywords and/or french irony hashtags (#ironie, #sarcasme). tweets have been annotated by three annotators (after removing the original labels) with a reported cohen's kappa of 0.69. english dataset (en = 11,225 tweets). we use the corpus built by [32] which consists of 100,000 tweets collected using the hashtag #sarcasm. it was used as benchmark in several works [13, 18] . we sliced a subset of approximately 11,200 tweets to match the sizes of the other languages' datasets. table 1 shows the tweet distribution in all corpora. across the three languages, we keep a similar number of instances for train and test sets to have fair cross-lingual experiments as well (see sect. 4). also, for french, we use the original dataset without any modification, keeping the same number of records for train and test to better compare with state-of-the-art results. for the classes distribution (ironic vs. non ironic), we do not choose a specific ratio but we use the resulted distribution from the random shuffling process. it is important to note that our aim is not to outperform state-of-the-art models in monolingual id but to investigate which of the monolingual architectures (neural or feature-based) can achieve comparable results with existing systems. the result can show which kind of features works better in the monolingual settings and can be employed to detect irony in a multilingual setting. in addition, it can show us to what extend id is language dependent by comparing their results to multilingual results. two models have been built, as explained below. prior to learning, basic preprocessing steps were performed for each language (e.g., removing foreign characters, ironic hashtags, mentions, and urls). feature-based models. we used state-of-the-art features that have shown to be useful in id: some of them are language-independent (e.g., punctuation marks, positive and negative emoticons, quotations, personal pronouns, tweet's length, named entities) while others are language-dependent relying on dedicated lexicons (e.g., negation, opinion lexicons, opposition words). several classical machine learning classifiers were tested with several feature combinations, among them random forest (rf) achieved the best result with all features. neural model with monolingual embeddings. we used convolutional neural network (cnn) network whose structure is similar to the one proposed by [25] . for the embeddings, we relied on arav ec [36] for arabic, fasttext [15] for french, and word2vec google news [29] for english 3 . for the three languages, the size of the embeddings is 300 and the embeddings were fine-tuned during the training process. the cnn network was tuned with 20% of the training corpus using the hyperopt 4 library. table 2 shows the results obtained when using train-test configurations for each language. for english, our results, in terms of macro f-score (f ), were not comparable to those of [32, 39] , as we used 11% of the original dataset. for french, our scores are in line with those reported in state of the art (cf. best system in the irony shared task achieved f = 78.3 [5] ). they outperform those obtained for arabic (a = 71.7) [22] and are comparable to those recently reported in the irony detection shared task in arabic tweets [11, 12] (f = 84.4). overall, the results show that semantic-based information captured by the embedding space are more productive comparing to standard surface and lexicon-based features. we use the previous cnn architecture with bilingual embedding and the rf model with surface features (e.g., use of personal pronoun, presence of interjections, emoticon or specific punctuation) 5 to verify which pair of the three languages: (a) has similar ironic pragmatic devices, and (b) uses similar textbased pattern in the narrative of the ironic tweets. as continuous word embedding spaces exhibit similar structures across (even distant) languages [28] , we use a multilingual word representation which aims to learn a linear mapping from a source to a target embedding space. many methods have been proposed to learn this mapping such as parallel data supervision and bilingual dictionaries [28] or unsupervised methods relying on monolingual corpora [1, 10, 41] . for our experiments, we use conneau et al.'s approach as it showed superior results with respect to the literature [10] . we perform several experiments by training on one language (lang 1 ) and testing on another one (lang 2 ) (henceforth lang 1 → lang 2 ). we get 6 configurations, plus two others to evaluate how irony devices are expressed cross-culturally, i.e. in european vs. non european languages. in each experiment, we took 20% from the training to validate the model before the testing process. table 3 presents the results. from a semantic perspective, despite the language and cultural differences between arabic and french languages, cnn results show a high performance comparing to the other languages pairs when we train on each of these two languages and test on the other one. similarly, for the french and english pair, but when we train on french they are quite lower. we have a similar case when we train on arabic and test on english. we can justify that by, the language presentation of the arabic and french tweets are quite informal and have many dialect words that may not exist in the pretrained embeddings we used comparing to the english ones (lower embeddings coverage ratio), which become harder for the cnn to learn a clear semantic pattern. another point is the presence of arabic dialects, where some dialect words may not exist in the multilingual pretrained embedding model that we used. on the other hand, from the text-based perspective, the results show that the text-based features can help in the case when the semantic aspect shows weak detection; this is the case for the ar −→ en configuration. it is worthy to mention that the highest result we get in this experiment is from the en → fr pair, as both languages use latin characters. finally, when investigating the relatedness between european vs. non european languages (cf. (en/fr) → ar), we obtain similar results than those obtained in the monolingual experiment (macro f-score 62.4 vs. 68.0) and best results are achieved by ar → (en/fr). this shows that there are pragmatic devices in common between both sides and, in a similar way, similar text-based patterns in the narrative way of the ironic tweets. this paper proposes the first multilingual id in tweets. we show that simple monolingual architectures (either neural or feature-based) trained separately on each language can be successfully used in a multilingual setting providing a crosslingual word representation or basic surface features. our monolingual results are comparable to state of the art for the three languages. the cnn architecture trained on cross-lingual word representation shows that irony has a certain similarity between the languages we targeted despite the cultural differences which confirm that irony is a universal phenomena, as already shown in previous linguistic studies [9, 24, 35] . the manual analysis of the common misclassified tweets across the languages in the multilingual setup, shows that classification errors are due to three main factors. (1) first, the absence of context where writers did not provide sufficient information to capture the ironic sense even in the monolingual setting, as in (let's start again, get off get off mubarak!! ) where the writer mocks the egyptian revolution, as the actual president "sisi" is viewed as mubarak's fellows. (2) second, the presence of out of vocabulary (oov) terms because of the weak coverage of the multilingual embeddings which make the system fails to generalize when the oov set of unseen words is large during the training process. we found tweets in all the three languages written in a very informal way, where some characters of the words were deleted, duplicated or written phonetically (e.g phat instead of fat). (3) another important issue is the difficulty to deal with the arabic language. arabic tweets are often characterized by non-diacritised texts, a large variations of unstandardized dialectal arabic (recall that our dataset has 4 main varieties, namely egypt, gulf, levantine, and maghrebi), presence of transliterated words (e.g. the word table becomes (tabla)), and finally linguistic code switching between modern standard arabic and several dialects, and between arabic and other languages like english and french. we found some tweets contain only words from one of the varieties and most of these words do not exist in the arabic embeddings model. for example in (since many days mubarak didn't die .. is he sick or what? #egypt), only the words (day), (mubarak), and (he) exist in the embeddings. clearly, considering only these three available words, we are not able to understand the context or the ironic meaning of the tweet. to conclude, our multilingual experiments confirmed that the door is open towards multilingual approaches for id. furthermore, our results showed that id can be applied to languages that lack of annotated data. our next step is to experiment with other languages such as hindi and italian. unsupervised neural machine translation irony as relevant inappropriateness comparative experiments using supervised learning and machine translation for multilingual sentiment analysis bilingual sentiment embeddings: joint projection of sentiment across languages analyse d'opinion et langage figuratif dans des tweets présentation et résultats du défi fouille de textes deft2017 multilingual natural language processing applications: from theory to practice clues for detecting irony in user-generated contents: oh s "so easy";-) translating irony in political commentary texts from english into arabic irony as indirectness cross-linguistically: on the scope of generic mechanisms word translation without parallel data idat@fire2019: overview of the track on irony detection in arabic tweets idat@fire2019: overview of the track on irony detection in arabic tweets ldr at semeval-2018 task 3: a low dimensional text representation for irony detection annotating irony in a novel italian corpus for sentiment analysis learning word vectors for 157 languages logic and conversation semeval-2018 task 3: irony detection in english tweets sentiment polarity classification of figurative language: exploring the role of irony-aware and multifaceted affect features irony detection in twitter: the role of affective content disambiguating false-alarm hashtag usages in tweets for irony detection irony detection with attentive recurrent neural networks soukhria: towards an irony detection system for arabic in social media towards a contextual pragmatic model to detect irony in tweets exploring the impact of pragmatic phenomena on irony detection in tweets: a multilingual corpus study convolutional neural networks for sentence classification the perfect solution for detecting sarcasm in tweets# not unsupervised cross-lingual information retrieval using monolingual data only efficient estimation of word representations in vector space linguistic regularities in continuous space word representations improving multilingual named entity recognition with wikipedia entity type mapping overview of the task on irony detection in spanish variants sarcasm detection on czech and english twitter a survey of cross-lingual embedding models cross-lingual learning-torank with shared representations a contrastive study of ironic expressions in english and arabic aravec: a set of arabic word embedding models for use in arabic nlp a corpus of english-hindi code-mixed tweets for sarcasm detection chinese irony corpus construction and ironic structure analysis reasoning with sarcasm by reading inbetween creative language retrieval: a robust hybrid of information retrieval and linguistic creativity unsupervised cross-lingual word embedding by multilingual neural language models key: cord-020815-j9eboa94 authors: kamphuis, chris; de vries, arjen p.; boytsov, leonid; lin, jimmy title: which bm25 do you mean? a large-scale reproducibility study of scoring variants date: 2020-03-24 journal: advances in information retrieval doi: 10.1007/978-3-030-45442-5_4 sha: doc_id: 20815 cord_uid: j9eboa94 when researchers speak of bm25, it is not entirely clear which variant they mean, since many tweaks to robertson et al.’s original formulation have been proposed. when practitioners speak of bm25, they most likely refer to the implementation in the lucene open-source search library. does this ambiguity “matter”? we attempt to answer this question with a large-scale reproducibility study of bm25, considering eight variants. experiments on three newswire collections show that there are no significant effectiveness differences between them, including lucene’s often maligned approximation of document length. as an added benefit, our empirical approach takes advantage of databases for rapid ir prototyping, which validates both the feasibility and methodological advantages claimed in previous work. bm25 [8] is perhaps the most well-known scoring function for "bag of words" document retrieval. it is derived from the binary independence relevance model to include within-document term frequency information and document length normalization in the probabilistic framework for ir [7] . although learning-to-rank approaches and neural ranking models are widely used today, they are typically deployed as part of a multi-stage reranking architecture, over candidate documents supplied by a simple term-matching method using traditional inverted indexes [1] . often, this is accomplished using bm25, and thus this decades-old scoring function remains a critical component of search applications today. as many researchers have previously observed, e.g., trotman et al. [11] , the referent of bm25 is quite ambiguous. there are, in fact, many variants of the scoring function: beyond the original version proposed by robertson et al. [8] , many variants exist that include small tweaks by subsequent researchers. also, researchers using different ir systems report (sometimes quite) different effectiveness measurements for their implementation of bm25, even on the same test collections; consider for example the results reported in osirrc 2019, the open-source ir replicability challenge at sigir 2019 [2] . furthermore, bm25 is parameterized in terms of k 1 and b (plus k 2 , k 3 in the original formulation), and researchers often neglect to include the parameter settings in their papers. our goal is a large-scale reproducibility study to explore the nuances of different variants of bm25 and their impact on retrieval effectiveness. we include in our study the specifics of the implementation of bm25 in the lucene open-source search library, a widely-deployed variant "in the real world". outside of a small number of commercial search engine companies, lucene-either stand-alone or via higher-level platforms such as solr and elasticsearch-has today become the de facto foundation for building search applications in industry. our approach enlists the aid of relational databases for rapid prototyping, an idea that goes back to the 1990s and was more recently revived by mühleisen et al. [6] . adding or revising scoring functions in any search engine requires custom code within some framework for postings traversal, making the exploration of many different scoring functions (as in our study) a tedious and error-prone process. as an alternative, it is possible to "export" the inverted index to a relational database and recast the document ranking problem into a database (specifically, sql) query. varying the scoring function, then, corresponds to varying the expression for calculating the score in the sql query, allowing us to explore different bm25 variants by expressing them declaratively (instead of programming imperatively). we view our work as having two contributions: -we conducted a large-scale reproducibility study of bm25 variants, focusing on the lucene implementation and variants described by trotman et al. [11] . their findings are confirmed: effectiveness differences in ir experiments are unlikely to be the result of the choice of bm25 variant a system implemented. -from the methodological perspective, our work can be viewed as reproducing and validating the work of mühleisen et al. [6] , the most recent advocate of using databases for rapid ir prototyping. robertson et al. is negative when df t > n/2, lucene adds a constant one before calculating the log value. second, the document length used in the scoring function is compressed (in a lossy manner) to a one byte value, denoted l dlossy . with only 256 distinct document lengths, lucene can pre-compute the value of k 1 · (1 − b + b · (l dlossy /l avg )) for each possible length, resulting in fewer computations at query time. lucene (accurate) represents our attempt to measure the impact of lucene's lossy document length encoding. we implemented a variant that uses exact document lengths, but is otherwise identical to the lucene default. atire [10] implements the idf component of bm25 as log (n/df t ), which also avoids negative values. the tf component is multiplied by k 1 + 1 to make it look more like the classic rsj weight; this has no effect on the resulting ranked list, as all scores are scaled linearly with this factor. bm25l [5] builds on the observation that bm25 penalizes longer documents too much compared to shorter ones. the idf component differs, to avoid negative values. the tf component is reformulated as the c td component is further modified by adding a constant δ to it, boosting the score for longer documents. the authors report using δ = 0.5 for highest effectiveness. [4] encodes a general approach for dealing with the issue that ranking functions unfairly prefer shorter documents over longer ones. the proposal is to add a lower-bound bonus when a term appears at least one time in a document. the difference with bm25l is a constant δ to the tf component. the idf component is again changed to a variant that disallows negative values. [3] is an approach that varies k 1 per term (i.e., uses term specific k 1 values). in order to determine the optimal value for k 1 , the method starts by identifying the probability of a term occurring at least once in a document as (df r + 0.5)/(n + 1). the probability of the term occurring one more time is then defined as (df r+1 + 0.5)/(df r + 1). the information gain of a term occurring r + 1 instead of r times is defined as g r q = log 2 ((df r+1 + 0.5)/(df r + 1)) − log 2 ((df tr + 0.5)/(n + 1)), where df r is defined as follows: |d t|c td ≥r−0.5 | if r > 1, df t if r = 1, and n if r = 0 (c td is the same as in bm25l). the information gain is calculated for r ∈ {0, . . . , t }, until g r q > g r+1 q . the optimal value for k 1 is then determined by finding the value for k 1 that minimizes the equation essentially, this gives a value for k 1 that maximizes information gain for that specific term; k 1 and g 1 q are then plugged into the bm25-adpt formula. we found that the optimal value of k 1 is actually not defined for about 90% of the terms. a unique optimal value for k 1 only exists when r > 1 while calculating g r q . for many terms, especially those with a low df , g r q > g r+1 q occurs before r > 1. in these cases, picking different values for k 1 has virtually no effect on retrieval effectiveness. for undefined values, we set k 1 to 0.001, the same as trotman et al. [11] . tf l•δ•p ×idf [9] models the non-linear gain of a term occurring multiple times in a document as 1 + log (1 + log (tf td )). to ensure that terms occurring at least once in a document get boosted, the approach adds a fixed component δ, following bm25+. these parts are combined into the tf component using tf td /(1 − b + b · (l d /l avg )). the same idf component as in bm25+ is used. our experiments were conducted using anserini (v0.6.0) on java 11 to create an initial index, and subsequently using relational databases for rapid prototyping, which we dub "olddog" after mühleisen et al. [6] ; following that work we use monetdb as well. evaluations with lucene (default) and lucene (accurate) were performed directly in anserini; the latter was based on previously-released code that we updated and incorporated into anserini. 2 the inverted index was exported from lucene to olddog, ensuring that all experiments share exactly the same document processing pipeline (tokenization, stemming, stopword removal, etc.). while exporting the inverted index, we precalculate all k 1 values for bm25adpt as suggested by lv and zhai [3] . as an additional verification step, we implemented both lucene (default) and lucene (accurate) in olddog and compared results to the output from anserini. we are able to confirm that the results are the same, setting aside unavoidable differences related to floating point precision. all bm25 variants are then implemented in olddog as minor variations upon the original sql query provided in mühleisen et al. [6] . the term-specific parameter optimization for the adpt variant was already calculated during the index extraction stage, allowing us to upload the optimal (t, k) pairs and directly use the term-specific k values in the sql query. the advantage of our experimental methodology is that we did not need to implement a single new ranking function from scratch. all the sql variants implemented for this paper can be found on github. 3 the experiments use three trec newswire test collections: trec disks 4 and 5, excluding congressional record, with topics and relevance judgments from the trec 2004 robust track (robust04); the new york times annotated corpus, with topics and relevance judgments from the trec 2017 common core track (core17); the trec washington post corpus, with topics and relevance judgments from the trec 2018 common core track (core18). following standard experimental practice, we assess ranked list output in terms of average precision (ap) and precision at rank 30 (p@30). the parameters shared by all models are set to k 1 = 0.9 and b = 0.4, anserini's defaults. the parameter δ is set to the value reported as best in the corresponding source publication. table 2 presents the effectiveness scores for the implemented retrieval functions on all three test collections. all experiments were run on a linux desktop (fedora 30, kernel 5.2.18, selinux enabled) with 4 cores (intel xeon cpu e3-1226 v3 @ 3.30 ghz) and 16 gb of main memory; the monetdb 11.33.11 server was compiled from source using the --enable-optimize flag. table 3 presents the average retrieval time per query in milliseconds (without standard deviation for anserini, which does not report time per query). monetdb uses all cores for both inter-and intraquery parallelism, while anserini is single-threaded. the observed differences in effectiveness are very small and can be fully attributed to variations in the scoring function; our methodology fixes all other parts of the indexing pipeline (tag cleanup, tokenization, stopwords, etc.). both an anova and tukey's hsd show no significant differences between any variant, on all test collections. this confirms the findings of trotman et al. [11] : across the ir literature, we find that differences due to more mundane settings (such as the choice of stopwords) are often larger than the differences we observe here. although we find no significant improvements over the original robertson et al. [8] formulation, it might still be worthwhile to use a variant of bm25 that avoids negative ranking scores. comparing lucene (default) and lucene (accurate), we find negligible differences in effectiveness. however, the differences in retrieval time are also negligible, which calls into question the motivation behind the original length approximation. currently, the similarity function and thus the document length encoding are defined at index time. storing exact document lengths would allow for different ranking functions to be swapped at query time more easily, as no information would be discarded at index time. accurate document lengths might additionally benefit downstream modules that depend on lucene. we therefore suggest that lucene might benefit from storing exact document lengths. in summary, this work describes a double reproducibility study-we methodologically validate the usefulness of databases for ir prototyping claimed by mühleisen et al. [6] and performed a large-scale study of bm25 to confirm the findings of trotman et al. [11] . returning to our original motivating question regarding the multitude of bm25 variants: "does it matter?", we conclude that the answer appears to be "no, it does not". effectiveness/efficiency tradeoffs for candidate generation in multi-stage retrieval architectures ceur workshop proceedings of the open-source ir replicability challenge (osirrc 2019) at sigir 2009 adaptive term frequency normalization for bm25 lower-bounding term frequency normalization when documents are very long old dogs are great at new tricks: column stores for ir prototyping the probabilistic relevance framework: bm25 and beyond okapi at trec-3 composition of tf normalizations: new insights on scoring functions for ad hoc ir towards an efficient and effective search engine improvements to bm25 and language models examined acknowledgements. this work is part of the research program commit2data with project number 628.011.001, which is (partly) financed by the nwo. additional support was provided by the natural sciences and engineering research council (nserc) of canada. key: cord-020814-1ty7wzlv authors: berrendorf, max; faerman, evgeniy; melnychuk, valentyn; tresp, volker; seidl, thomas title: knowledge graph entity alignment with graph convolutional networks: lessons learned date: 2020-03-24 journal: advances in information retrieval doi: 10.1007/978-3-030-45442-5_1 sha: doc_id: 20814 cord_uid: 1ty7wzlv in this work, we focus on the problem of entity alignment in knowledge graphs (kg) and we report on our experiences when applying a graph convolutional network (gcn) based model for this task. variants of gcn are used in multiple state-of-the-art approaches and therefore it is important to understand the specifics and limitations of gcn-based models. despite serious efforts, we were not able to fully reproduce the results from the original paper and after a thorough audit of the code provided by authors, we concluded, that their implementation is different from the architecture described in the paper. in addition, several tricks are required to make the model work and some of them are not very intuitive.we provide an extensive ablation study to quantify the effects these tricks and changes of architecture have on final performance. furthermore, we examine current evaluation approaches and systematize available benchmark datasets.we believe that people interested in kg matching might profit from our work, as well as novices entering the field. (code: https://github.com/valentyn1997/kg-alignment-lessons-learned). the success of information retrieval in a given task critically depends on the quality of the underlying data. another issue is that in many domains knowledge bases are spread across various data sources [14] and it is crucial to be able to combine information from different sources. in this work, we focus on knowledge bases in the form of knowledge graphs (kgs), which are particularly suited for information retrieval [17] . joining information from different kgs is non-trivial, as there is no unified schema or vocabulary. the goal of the entity alignment task is to overcome this problem by learning a matching between entities in different kgs. in the typical setting some of the alignments are known in advance (seed alignments) and the task is therefore supervised. more formally, we are given graphs g l = (v l , e l ) and g r = (v r , e r ) with a seed alignment a = (l i , r i ) i ⊆ v l × v r . it is commonly assumed that an entity v ∈ v l can match at most one entity v ∈ v r . thus the goal is to infer alignments for the remaining nodes only. graph convolutional networks (gcn) [7, 9] , which have been recently become increasingly popular, are at the core of state-of-the-art methods for entity alignments in kgs [3, 6, 22, 24, 27] . in this paper, we thoroughly analyze one of the first gcn-based entity alignment methods, gcn-align [22] . since the other methods we are studying can be considered as extensions of this first paper and have a similar architecture, our goal is to understand the importance of its individual components and architecture choices.in summary, our contribution is as follows: 1. we investigate the reproducibility of the published results of a recent gcnbased method for entity alignment and uncover differences between the method's description in the paper and the authors' implementation. 2. we perform an ablation study to demonstrate the individual components' contribution. 3. we apply the method to numerous additional datasets of different sizes to investigate the consistency of results across datasets. in this section, we review previous work on entity alignment for knowledge graphs and revisit the current evaluation process. we believe that this is useful for practitioners, since we discover some pitfalls, especially when implementing evaluation scores and selecting datasets for comparison. an overview of methods, datasets and metrics is provided in table 1 . methods. while the problem of entity alignments in knowledge graphs has been tackled historically by researching vocabularies which are as broad as possible, and establish them as a standard, recent approaches take a more data-driven view. early methods use classical knowledge graph link prediction models such as transe [2] to embed the entities of the individual knowledge graphs using an intra-kg link prediction loss, and differ in what they do with the aligned entities. for instance, mtranse [5] learns a linear transformation between the embedding spaces of the individual graphs using an l 2 -loss. bootea [19] adopts a bootstrapping approach and iteratively labels the most likely alignments to utilize them for further training. in addition to the alignment loss, embeddings of aligned entities are swapped regularly to calibrate embedding spaces against each other. sea [15] learns a mapping between embedding spaces in both directions and additionally adds a cycle-consistency loss. thereby, the distance between the original embedding of an entity, and the result of translating this embedding to the opposite space and back again, is penalized. iptranse [26] embeds both kgs into the same embedding space and uses a margin-based loss to enforce the embeddings of aligned entities to become similar. rsn [8] generates sequences using different types of random walks which can move between graphs when visiting aligned entities. the generated sequences are feed to an adapted recurrent model. jape [18] , kdcoe [4] , multike [25] and attre [20] utilize attributes available for some entities and additional information like the names of entities and relationships. graph convolutional network (gcn) based models [3, 6, 22, 24, 27] 1 have in common that they use gcn to create node representations by aggregating node representations together with representations of their neighbors. most of gcn approaches do not distinguish between different relations and either consider all neighbors equally [6, 22, 24] or use attention [3] to weight the representations of the neighbors for the aggregation. datasets. the datasets used by entity alignments methods are generally based on large-scale open-source data sources such as dbpedia [1] , yago [13] , or wikidata [23] . while there is the dwy-100k dataset, which comprises 100 k aligned entities across the three aforementioned individual knowledge graphs, most of the datasets, such as dbp15k, or wk3l are generated from a single multi-lingual database. there, subsets are formed according to a specific language, and entities which are linked across languages are used as alignments. a detailed description of most-used datasets can be found in table 2 . as an interesting observation we found out that all papers which evaluate on dbp15, do not evaluate on the full dbp15k dataset 2 (which we refer to as dbp15k (full)), but rather use a smaller subset provided by the authors of jape [18] in their github repository 3 , which we call dbp15k (jape). the smaller subsets were created by selecting a portion of entities (around 20k of 100k) which are popular, i.e. appear in many triples as head or tail. the number of aligned entities stays the same (15k). as [18] only reports the dataset statistics of the larger dataset, and does not mention the reduction of the dataset, subsequent papers also report the statistics of the larger dataset, although experiments use the smaller variant [3, 18, 19, 22, 26] . as the metrics rely on absolute ranks, the numbers are better than on the full dataset (cf. table 3 ). scores. it is common practice to only consider the entities being part of the test alignment as potential matching candidates. although we argue that ignoring entities exclusive to a single graph as potential candidates does not reflect well table 2 . overview of used datasets with their sizes in the number of triples (edges), entities (nodes), relations (different edge types) and alignments. for wk3l, the alignment is provided as a directed mapping on a entity level. however, there are additional triple alignments. following a common practice as e.g. [15] we can assume that an alignment should be symmetric and that we can extract entity alignments from the triple alignments. thereby, we obtain the number of alignments given in brackets. gcn-align [22] is a gcn-based approach to embed all entities from both graphs into a common embedding space. each entity i is associated with structural features h i ∈ r d , which are initialized randomly and updated during training. the features of all entities in a single graph are combined to the feature matrix h. subsequently, a two-layer gcn is applied. a single gcn layer is described by ij is the diagonal node degree matrix. the input of the first layer is set to h (0) = h, and σ is non-linear activation function, chosen as relu. the output of the last layer is considered as the structural representation, denoted by s i = h (2) i ∈ r d . both graphs are equipped with their own node features, but the convolution weights w (i) are shared across the graphs. the adjacency matrix is derived from the knowledge graph by first computing a score, called functionality, for each relation as the ratio between the number of different entities which occur as head, and the number of triples in which the relation occurs α r . analogously, the inverse functionality α r is obtained by replacing the nominator by the number of different tail entities. the final adjacency matrix is obtained as a ij = (ei,r,ej ) α r + (ej ,r,ej ) α r . note, that analogously to structural features gcn-align is able to process the attributes and integrate them in final representation. however, since attributes have little effect on final score, and to be consistent with other gnn models, here we focus only on structural representations. implementation specifics. the code 5 provided by the authors differs in a few aspects from the method described in the paper. first, when computing the adjacency matrix, fun(r) and if un(f ) are set to at least 0.3. s, the node embeddings are initialized with values drawn from a normal distribution with variance n −1/2 , where n is the number of nodes 6 . additionally, the node features are always normalised to unit euclidean length before passing them into the network. finally, there are no convolution weights. this means that the whole gcn does not contain a single parameter, but is just a fixed function on the learned node embeddings. in initial experiments we were able to reproduce the results reported in the paper using the implementation provided by the authors. moreover, we are able to reproduce the results using our own implementation, and settings adjusted to the authors' code. in addition, we replaced the adjacency matrix based on functionality and inverse functionality by a simpler version, where we additionally used −1â instead of the symmetric normalization. in total, we see no difference in performance between our simplified adjacency matrix, and the authors' one. we identified two aspects which affect the model's performance: not using convolutional weights, and normalizing the variance when initializing node embeddings. we provide empirical evidence for this finding across numerous datasets. our results regarding hits@1 (h@1) are summarised in table 3 . node embedding initialization. comparing the columns of table 3 we can observe the influence of the node embedding initialization. using the settings from the authors' code, i.e. not using weights, a choosing a variance of n −1/2 actually results in inferior performance in terms of h@1, as compared to use a standard normal distribution. these findings are consistent across datasets. convolution weights. the first column of table 3 corresponds to the weight usage and initialization settings used in the code for gcn-align. we achieve slightly better results than published in [22] , which we attribute to a more exhaustive parameter search. interestingly, all best configurations use adam optimizer instead of sgd. adding convolution weights degrades the performance across all datasets and subsets thereof but one as witnessed by comparing the first two columns with the last two columns. in this work, we reported our experiences when implementing the knowledge graph alignment method gcn-align. we pointed at important differences between the model described in the paper and the actual implementation and quantified their effects in the ablation study. for future work, we plan to include other methods for entity alignments in our framework. dbpedia: a nucleus for a web of open data translating embeddings for modeling multi-relational data multi-channel graph neural network for entity alignment co-training embeddings of knowledge graphs and entity descriptions for cross-lingual entity alignment multilingual knowledge graph embeddings for cross-lingual knowledge alignment deep graph matching consensus neural message passing for quantum chemistry learning to exploit long-term relational dependencies in knowledge graphs semi-supervised classification with graph convolutional networks proceedings of the 57th conference of the association for computational linguistics (acl 2019) proceedings of the twenty-eighth international joint conference on artificial intelligence (ijcai 2019) proceedings of the twenty-seventh international joint conference on artificial intelligence (ijcai 2018) yago3: a knowledge base from multilingual wikipedias a review of relational machine learning for knowledge graphs semi-supervised entity alignment via knowledge graph embedding with awareness of degree difference proceedings of the twenty-sixth international joint conference on artificial intelligence (ijcai 2017) introducing the knowledge graph: things, not strings cross-lingual entity alignment via joint attribute-preserving embedding bootstrapping entity alignment with knowledge graph embedding entity alignment between knowledge graphs using attribute embeddings graph attention networks cross-lingual knowledge graph alignment via graph convolutional networks cross-lingual knowledge graph alignment via graph matching neural network multi-view knowledge graph embedding for entity alignment iterative entity alignment via joint knowledge embeddings neighborhood-aware attentional representation for multilingual knowledge graphs acknowledgements. this work has been funded by the german federal ministry of education and research (bmbf) under grant no. 01is18036a and by the bavarian ministry for economic affairs, infrastructure, transport and technology through the center for analytics-data-applications (ada-center) within the framework of "bay-ern digital ii". the authors of this work take full responsibilities for its content. key: cord-020820-cbikq0v0 authors: papadakos, panagiotis; kalipolitis, orfeas title: dualism in topical relevance date: 2020-03-24 journal: advances in information retrieval doi: 10.1007/978-3-030-45442-5_40 sha: doc_id: 20820 cord_uid: cbikq0v0 there are several concepts whose interpretation and meaning is defined through their binary opposition with other opposite concepts. to this end, in this paper we elaborate on the idea of leveraging the available antonyms of the original query terms for eventually producing an answer which provides a better overview of the related conceptual and information space. specifically, we sketch a method in which antonyms are used for producing dual queries, which can in turn be exploited for defining a multi-dimensional topical relevance based on the antonyms. we motivate this direction by providing examples and by conducting a preliminary evaluation that shows its importance to specific users. dualism denotes the state of two parts. the term was originally coined to denote co-eternal binary opposition and has been especially studied in philosophy. for example, there is duality in ethics (good -bad), in human beings (man -nietzsche'sübermensch or man -god) and in logic (true -false). in addition, dualism determines in a great extent our everyday lives (ugly -beautiful, happyunhappy, etc.), and our relations with other people (rich -poor, black -white, love -hate, etc.). none of these concepts can be understood without their dual concepts, since this duality and opposition generates their meaning and interpretation. dualism is also crucial in mathematics and physics (e.g., matterantimatter), and is the power behind our whole information society and our binary data. moving from philosophy, sciences and everyday life to information retrieval, we find a very vague situation. users of search engines are 'dictated' to provide a very concise and specific query that is extremely efficient for focalized search (e.g., looking for a specific hotel). on the other hand, studies show that 60% of user tasks are of exploratory nature [12] . in such tasks users do not accurately know their information need and can not be satisfied by a single 'hit' [5] . consequently, users spend a lot of time reformulating queries and investigating results, in order to construct a conceptual model regarding their information need. information needs that include non-monosemous terms can be considered such exploratory tasks. however, the simplicity of inserting terms in an empty text box and 'magically' return the most relevant object(s), will always be a desired feature. in this paper we elaborate on the idea of leveraging the available antonyms of the original query terms (if they exist), for eventually producing an answer which provides a better overview of the related information and conceptual space. we sketch a method in which antonyms are used for producing dual queries, which in turn can be exploited for defining a multi-dimensional topical relevance. this approach can be applied on demand, helping users to be aware of the various opposing dimensions and aspects of their topic of interest. a preliminary evaluation shows the value of the approach for some exploratory tasks and users. to the best of our knowledge, the proposed direction is not covered by the existing literature. antonyms have been studied in fuzzy logic [7] showing a relation with negates. in the ir domain, query expansion methods are based on synonyms and semantically related terms, but do not exploit antonyms explicitly, while in relevance and pseudo-relevance feedback techniques the antonyms are essentially penalized [1] . results diversification can produce a kind of dual clusters, but this is neither guaranteed nor controlled [3] . "capitalism and war". consider a user exploring the relationship between capitalism and war. the user submits to a wse (web search engine) the query "capitalism and war" and starts inspecting the results. the left part of fig. 1 shows the top-5 results for this query from a popular wse. the results include articles about the connection of capitalism with war from research and academic domains, as well as from socialistic, communistic and theological sites. considering a different direction, the user might also be interested about how capitalism can support peace, the dual of war. the top-5 results for the query "capitalism and peace" are shown at the right side of fig. 1 . they contain a wikipedia and a research article about the capitalist peace theory, and articles about the importance of capitalism for the prosperity of modern societies and its association to peace from policy research organizations. analogously, since socialism is the economic system that opposes capitalism, the user could be interested about how socialism may promote war or support peace, by inspecting the results of the queries "socialism and war" and "socialism and peace" respectively. the top-5 results for each of the above queries are shown in fig. 2 . the results for the former query include the socialism and war pamphlet written by lenin, a collection of articles by the economist and philosopher friedrich hayek, a list of articles from two marxist domains, and a critical article for both left and right views from the foundation for economic education. for the latter query, the results include articles connecting socialism with peace, like a chapter from the encyclopedia of anti-revisionism, a wikipedia article about the theoretical magazine problems of peace and socialism, and an article from a site supporting a far left u.s. party. the above hits indicate interesting directions to the original information need of the user. we argue that users should get aware of these directions for a better exploration of the domain at hand, since they can provide a more comprehensive view of the information and conceptual space. furthermore, the exploration of these directions let available supportive or counter arguments of dual concepts to emerge, leading to better informed and responsible humans and citizens. "aloe". a comprehensive view of the various different directions can be beneficial also for reducing false-positive results. for example, consider a pregnant woman that was advised to take aloe vera by mouth to relieve digestive discomfort. to check if this is true, she submits to a wse the query "aloe vera indications". however, since aloe can stimulate uterine contractions, increasing the risk of miscarriage or premature birth, it is crucial to know also its contraindications. the proposed direction can alleviate this problem, because this information would be contained in the results of the query "aloe vera contraindications". one can imagine various ways for leveraging antonyms. we shall hereafter use t t to denote that the terms t and t are antonyms. building on the "capitalistic" example of the previous section, according to the online dictionary wordnet 1 , socialism capitalism, and war peace. now, we can generate all possible queries, denoted by q, where non-monosemous terms of the original query are substituted by their dual ones, as expressed by their antonyms. for example, the query "capitalism and war" will generate three extra queries: "socialism and peace", "capitalism and peace" and "socialism and war". based on q we can now define two vector spaces. in the first case, the space has |q| dimensions, where each query is a dimension of the space. each document is placed in this space according to its relevenace to each query. in the second case we assume a space with only |q| 2 dimensions. each dimension represents a pair of dual queries, where each query in the pair contains the antonyms of the other. we denote with q q , that the queries q and q are dual. for our running example, the first pair is ("capitalism and war","socialism and peace") and the second one is ("capitalism and peace","socialism and war"). each pair defines an axis, therefore the two pairs define a 2d space against which we can evaluate the "value" of each document. for each axis we can consider policies for composing the relevance scores of each document to each member of a dual query. generally, there are various criteria that can be considered for assessing the value of each document or set of documents. such criteria include the bias of documents to specific queries (e.g., the original user query), the purity to a specific query, the overview factor of a document regarding either a dual query or all queries, and the diversity of the returned set of documents with respect to these queries. in general, we need to define appropriate ranking methods, that will take into account the relevance of the documents to the available queries for different criteria. therefore, we will explore whether the existing multiplecriteria approaches described in [4, 6, 9, 13] are appropriate for the problem at hand. regarding the process of finding the corresponding antonyms, we can use existing dictionaries like wordnet for nouns and adjectives or word-embedding antonym detection approaches like [8, 11] . the case of verbs and adverbs is more complicated since they require a kind of grammatical and language analysis (i.e., exist not exist, lot total, a lot bit, etc). there are three categories of antonyms: (a) gradable, (b) relational and (c) complementary. we have gradable antonyms (e.g., hot cold) in cases where the definitions of the words lie on a continuous spectrum. we have relational antonyms (e.g., teacher student) in cases where the two meanings are opposite only within the context of their relationship. the rest are called complementary antonyms (e.g., day night). in general, the selection of the "right" antonyms raises various questions. in many cases more than one antonyms exist, so one should decide which one(s) to select. sometimes this can depend on the context, e.g., the antonym of "action" is "apathy", but in terms of physics or sociology the dual of "action" is "reaction". notice that the proposed approach can be exploited in any context where the aim is to retrieve semantically opposing entities, information, etc. as an example consider the argument web [2] , where the approach could be used for retrieving contradicting arguments and providing support for each one of them. from a system's perspective, the approach can be realized in various levels and settings. in the setting of an ir system, it can be implemented by changing accordingly the query processor and the ranking module, while in a meta-search setting, by changing the query rewriting, the query forwarding and the ranking components. it could also be exploited in the query autocompletion layer. to start with, we have conducted a preliminary evaluation. we have specified 15 information tasks which are shown in table 1 , that can exploit the proposed approach. the tasks are of exploratory nature and were created using the task refinement steps described in [10] . we have identified the following types of tasks: explore domain (ed), medical treatment (mt), explore product reviews (epr) and person qualities (pq). for each task we provide a description of the information need, a representative query and the relevant antonyms, which were manually selected from the list of the respective wordnet antonyms. we conducted our experiment over 9 female and 25 male users of various ages. for each task, they were given two lists of results. one contained the results of the query from a popular wse, and the other one was constructed by interleaving the results of the same wse for the dual queries of this task (i.e., first the top result of the original query, then the first result of its dual, etc.). the two kinds of lists were given in random order for each task. the users were asked to select the most preferred list and to provide a grade of preference taking values in {1, 2, 3, 4, 5}, where 5 means that the selected list was preferred much more than the other one. in the background, when users prefer the results of the dual approach, we change the sign of the score and make it negative. the users were not aware how the lists were constructed and were not guided in any way by the evaluator. in fig. 3 we provide two graphs that describe the results of the evaluation. figure 3 (a), shows the aggregated scores given by all users to each query, while fig. 3 (b) shows the aggregated scores given by each participant to all queries. regarding the first one the results are not the expected ones, although we hypothesize that the users mainly penalized the dual approach because of the 'irrelevant' results to the original query in terms of query tokens and not in terms of relevant information. for eleven of the queries there is a strong preference towards the non-dual approach. the epr type of queries belong to this category, showing that users are probably not interested for reviews with the opposite direction of what they are looking for. this is especially true for q 12 , where the dual approach provided results about winter vacations and was the least preferred. for two of the tasks, the approaches are almost incomparable. both of these tasks belong to the mt group. there are also two queries, q 3 and q 15 , where the dual approach is better, especially in the last one. in their comments for these queries, users mention that the selected (i.e., dual) list "provides a more general picture" and "more relevant and interesting results, although contradicting". regarding the second graph we have the interesting result that the proposed approach appeals to specific users. it seems that nine users (26% of the participants) have an exploratory nature and generally prefer the dual approach (six of them strongly), while for four of them the two approaches are incomparable. the rest are better served with the non-dual approach. this is an interesting outcome, and in the future we plan to identify the types of users that prefer the dual approach. we have motivated with examples why it is worth investigating dualism for nonmonosemous terms in the context of exploratory search and we have shown its importance at least for some types of users and tasks. for the future, we plan to define the appropriate antonyms selection algorithms and relevance metrics, implement the proposed functionality in a meta-search setting, and conduct a large scale evaluation with real users over exploratory tasks, to identify in which queries the dual approach is beneficial and to what types of users. query expansion techniques for information retrieval: a survey implementing the argument web evaluating subtopic retrieval methods: clustering versus diversification of search results multidimensional relevance: a new aggregation criterion supporting exploratory search multidimensional relevance: prioritized aggregation in a personalized information retrieval setting on antonym and negate in fuzzy logic improving word embeddings for antonym detection using thesauri and sentiwordnet negotiating a multidimensional framework for relevance space creating exploratory tasks for a faceted search interface word embedding-based antonym detection using thesauri and distributional information understanding user goals in web search relevance: a review of the literature and a framework for thinking on the notion in information science. part ii: nature and manifestations of relevance key: cord-020813-0wc23ixy authors: hashemi, helia; aliannejadi, mohammad; zamani, hamed; croft, w. bruce title: antique: a non-factoid question answering benchmark date: 2020-03-24 journal: advances in information retrieval doi: 10.1007/978-3-030-45442-5_21 sha: doc_id: 20813 cord_uid: 0wc23ixy considering the widespread use of mobile and voice search, answer passage retrieval for non-factoid questions plays a critical role in modern information retrieval systems. despite the importance of the task, the community still feels the significant lack of large-scale non-factoid question answering collections with real questions and comprehensive relevance judgments. in this paper, we develop and release a collection of 2,626 open-domain non-factoid questions from a diverse set of categories. the dataset, called antique, contains 34k manual relevance annotations. the questions were asked by real users in a community question answering service, i.e., yahoo! answers. relevance judgments for all the answers to each question were collected through crowdsourcing. to facilitate further research, we also include a brief analysis of the data as well as baseline results on both classical and neural ir models. with the rising popularity of information access through devices with small screens, e.g., smartphones, and voice-only interfaces, e.g., amazon's alexa and google home, there is a growing need to develop retrieval models that satisfy user information needs with sentence-level and passage-level answers. this has motivated researchers to study answer sentence and passage retrieval, in particular in response to non-factoid questions [1, 18] . non-factoid questions are defined as open-ended questions that require complex answers, like descriptions, opinions, or explanations, which are mostly passage-level texts. questions like "how to cook burgers?" are non-factoid. we believe this type of questions plays a pivotal role in the overall quality of question answering systems, since their technologies are not as mature as those for factoid questions, which seek precise facts, such as "at what age did rossini stop writing opera?". despite the widely-known importance of studying answer passage retrieval for non-factoid questions [1, 2, 8, 18] , the research progress for this task is limited by the availability of high-quality public data. some existing collections, e.g., [8, 13] , consist of few queries, which are not sufficient to train sophisticated machine learning models for the task. some others, e.g., [1] , significantly suffer from incomplete judgments. most recently, cohen et al. [3] developed a publicly available collection for non-factoid question answering with a few thousands questions, which is called wikipassageqa. although wikipassageqa is an invaluable contribution to the community, it does not cover all aspects of the non-factoid question answering task and has the following limitations: (i) it only contains an average of 1.7 relevant passages per question and does not cover many questions with multiple correct answers; (ii) it was created from the wikipedia website, containing only formal text; (iii) more importantly, the questions in the wikipassageqa dataset were generated by crowdworkers, which is different from the questions that users ask in real-world systems; (iv) the relevant passages in wikipassageqa contain the answer to the question in addition to some surrounding text. therefore, some parts of a relevant passage may not answer any aspects of the question; (v) it only provides binary relevance labels. to address these shortcomings, in this paper, we create a novel dataset for non-factoid question answering research, called antique, with a total of 2,626 questions. in more detail, we focus on the non-factoid questions that have been asked by users of yahoo! answers, a community question answering (cqa) service. non-factoid cqa data without relevance annotation has been previously used in [1] , however, as mentioned by the authors, it significantly suffers from incomplete judgments (see sect. 2 for more information on existing collections). we collected four-level relevance labels through a careful crowdsourcing procedure involving multiple iterations and several automatic and manual quality checks. note that we paid extra attention to collect reliable and comprehensive relevance judgments for the test set. therefore, we annotated the answers after conducting result pooling among several term-matching and neural retrieval models. in summary, antique provides annotations for 34,011 question-answer pairs, which is significantly larger than many comparable datasets. we further provide brief analysis to uncover the characteristics of antique. moreover, we conduct extensive experiments with antique to present benchmark results of various methods, including classical and neural ir models on the created dataset, demonstrating the unique challenges antique introduces to the community. to foster research in this area, we release antique. 1 factoid qa datasets. trec qa [14] and wikiqa [17] are examples of factoid qa datasets whose answers are typically brief and concise facts, such as named entities and numbers. insuranceqa [5] is another factoid dataset in the domain of insurance. antique, on the other hand, consists of open-domain non-factoid questions that require explanatory answers. the answers to these questions are often passage level, which is contrary to the factoid qa datasets. non-factoid qa datasets. there have been efforts for developing non-factoid question answering datasets [7, 8, 16] . keikha et al. [8] introduced the webap dataset, which is a non-factoid qa dataset with 82 queries. the questions and answers in webap were not generated by real users. there exist a number of datasets that partially contain non-factoid questions and were collected from cqa websites, such as yahoo! webscope l6, qatar living [9] , and stackexchange. these datasets are often restricted to a specific domain, suffer from incomplete judgments, and/or do not contain sufficient non-factoid questions for training sophisticated machine learning models. the nfl6 dataset [1] is a collection of non-factoid questions extracted from the yahoo! webscope l6. its main drawback is the absence of complete relevance annotation. previous work assumes that the only answer that the question writer has marked as correct is relevant, which is far from being realistic. that is why we aim to collect a complete set of relevance annotations. wikipassageqa is another non-factoid qa dataset that has been recently created by cohen et al. [3] . as mentioned in sect. 1, despite its great potentials, it has a number of limitations. antique addresses these limitations to provide a complementary benchmark for nonfactoid question answering (see sect. 1). more recently, microsoft has released the ms marco v2.1 passage re-ranking dataset [10] , containing a large number of queries sampled from the bing search engine. in addition to not being specific to non-factoid qa, it significantly suffers from incomplete judgments. in contrast, antique provides a reliable collection with complete relevance annotations for evaluating non-factoid qa models. following cohen et al. [1] , we used the publicly available dataset of non-factoid questions collected from the yahoo! webscope l6, called nfl6. we conducted the following steps for pre-processing and question sampling: (i) questions with less than 3 terms were omitted (excluding punctuation marks); (ii) questions with no best answer (â) were removed; (iii) duplicate or near-duplicate questions were removed. we calculated term overlap between questions and from the questions with more than 90% term overlap, we only kept one, randomly; (iv) we omitted the questions under the categories of "yahoo! products" and "computers & internet" since they are beyond the expertise of most workers; (v) from the remaining data, we randomly sampled 2,626 questions (out of 66,634). each question q in nfl6 corresponds to a list of answers named 'nbest answers', which we denote with a = {a 1 , . . . , a n }. for every question, one answer is marked by the question author on the community web site as the best answer, denoted byâ. it is important to note that as different people have different information needs, this answer is not necessarily the best answer to the question. also, many relevant answers have been added after the user has chosen the correct answer. nevertheless, in this work, we respect the users' explicit feedback, assuming that the candidates selected by the actual user are relevant to the query. therefore, we do not collect relevance assessments for those answers. we created a human intelligence task (hit) on amazon mechanical turk, in which we presented workers with a question-answer pair, and instructed them to annotate the answer with a label between 1 to 4. the instructions started with a short introduction to the task and its motivations, followed by detailed annotation guidelines. since workers needed background knowledge for answering the majority of the questions, we also includedâ in the instructions and called it a "possibly correct answer." in some cases, the question is very subjective and could have multiple correct answers. this is why it is called "possibly correct answer" to make it clear in the instructions that other answers could potentially be different from the provided answer, but still be correct. label definitions. to facilitate the labeling procedure, we described labels in the form of a flowchart to users. our aim was to preserve the notion of relevance in qa systems as we discriminate it with the typical topical relevance definition in ad-hoc retrieval tasks. the definition of each label is as follows: label 4: it looks reasonable and convincing. its quality is on par with or better than the "possibly correct answer". note that it does not have to provide the same answer as the "possibly correct answer". label 3: it can be an answer to the question, however, it is not sufficiently convincing. there should be an answer with much better quality for the question. label 2: it does not answer the question or if it does, it provides an unreasonable answer, however, it is not out of context. therefore, you cannot accept it as an answer to the question. label 1: it is completely out of context or does not make any sense. we included 15 diverse examples of annotated qa pairs with explanation of why and how the annotations were done. overall, we launched 7 assignment batches, appointing 3 workers to each qa pair. in cases where the workers could agree on a label (i.e., majority vote), we considered the label as the ground truth. we then added all qa pairs with no agreement to a new batch and performed a second round of annotation. it is interesting to note that the ratio of pairs with no agreement was nearly the same among the 7 batches (∼13%). in the very rare cases of no agreement after two rounds of annotation (776 pairs), an expert annotator decided on the final label. to allow further analysis, we have added a flag in the dataset identifying the answers annotated by the expert annotator. in total, the annotation task costed 2,400 usd. to ensure the quality of the data, we limited the hit to the workers with over 98% approval rate, who have completed at least 5,000 assignments. 3% of qa pairs were selected from a set of quality check questions with obviously objective labels. it enabled us to identify workers who did not provide high-quality labels. moreover, we recorded the click log of the workers to detect any abnormal behavior (e.g., employing automatic labeling scripts) that would affect the quality of the data. finally, we constantly performed manual quality checks by reading the qa pairs and their respective labels. the manual inspection was done on the 20% of each worker's submission as well as the qa pairs with no agreement. training set. in the training set, we annotate the list a (see sect. 3) for each query, and assume that for each question, answers to the other questions are irrelevant. as we removed similar questions from the dataset, this assumption is fair. to test this assumption, we sampled 100 questions from the filtered version of nfl6 and annotated the top 10 results retrieved by bm25 using the same crowdsourcing procedure. the results showed that only 13.7% of the documents (excluding a) were annotated as relevant (label 3 or 4). this error rate can be tolerated in the training process as it enables us to collect significantly larger amount of training labels. on the other hand, for the test set we performed pooling to label all possibly relevant answers. in total, the antique's training set contains 27,422 answer annotations as it shown in table 1 , that is 11.3 annotated candidate answers per training question, which is significantly larger than its similar datasets, e.g., wikipassageqa [3] . test set. the test set in antique consists of 200 questions which were randomly sampled from nfl6 after pre-processing and filtering. statistics of the test set can be found in table 1 . the set of candidate questions for annotation was selected by performing depth-k (k = 10) pooling. to do so, we considered the union of the top k results of various retrieval models, including term-matching and neural models (listed in table 2 ). we took the union of this set and "nbest answers" (set a) for annotation. here, we present a brief analysis of antique to highlight its characteristics. table 1 lists general statistics of antique. as we see, antique consists of 2,426 non-factoid questions that can be used for training, followed by 200 questions as a test set. furthermore, antique contains 27.4k and 6.5k annotations (judged answers) for the train and test sets, respectively. we also report the total number of answers with specific labels. workers performance. overall, we launched 7 crowdsourcing batches to collect antique. this allowed us to identify and ban less accurate workers. as reported in table 1 , a total number of 577 workers made over 148k annotations (257 per worker), out of which we rejected 12% because they failed to satisfy the quality criteria. questions distribution. figure 1 shows how questions are distributed in antique by reporting the top 40 starting trigrams of the questions. as shown in the figure, majority of the questions start with "how" and "why," constituting 38% and 36% of the questions, respectively. it is notable that, according to fig. 1 , a considerable number of questions start with "how do you," "how can you," "what do you," and "why do you," suggesting that their corresponding answers would be highly subjective and opinion based. also, we can see a major fraction of questions start with "how can i" and "how do i," indicating the importance and dominance of personal questions. fig. 2 , we plot the distribution for the number of 'nbest answers' (|a|). we see that the majority of questions have 9 or less nbest answers (=54%) and 82% of questions have 14 or less nbest answers. the distribution, however, has a long tail which is not shown in the figure. in this section, we provide benchmark results on the antique dataset. we report the results for a wide range of retrieval models in table 2 . in this experiment, we report a wide range of standard precision-and recall-oriented retrieval metrics (see table 2 ). note that for the metrics that require binary labels (i.e., map, mrr, and p@k), we assume that the labels 3 and 4 are relevant, while 1 and 2 are non-relevant. due to the definition of our labels (see sect. 3), we recommend this setting for future work. for ndcg, we use the four-level relevance annotations (we mapped our 1 to 4 labels to 0 to 3). as shown in the table, the neural models significantly outperform bm25, an effective term-matching retrieval model. among all, bert [4] provides the best performance. recent work on passage retrieval also made similar observations [11, 12] . since map is a recall-oriented metric, the results suggest that all the models still fail at retrieving all relevant answers. there is still a large room for improvement, in terms of both precision-and recall-oriented metrics. this paper introduced antique; a non-factoid question answering dataset. the questions in antique were sampled from a wide range of categories on yahoo! answers, a community question answering service. we collected four-level relevance annotations through a multi-stage crowdsourcing as well as expert annotation. in summary, antique consists of 34,011 qa-pair relevance annotations for 2,426 and 200 questions in the training and test sets, respectively. additionally, we reported the benchmark results for a set of retrieval models, ranging from term-matching to recent neural ranking models, on antique. our data analysis and retrieval experiments demonstrated that antique introduces unique challenges while fostering research for non-factoid question answering. end to end long short term memory networks for nonfactoid question answering a hybrid embedding approach to noisy answer passage retrieval wikipassageqa: a benchmark collection for research on non-factoid answer passage retrieval bert: pre-training of deep bidirectional transformers for language understanding applying deep learning to answer selection: a study and an open task a deep relevance matching model for ad-hoc retrieval new collection announcement: focused retrieval over the web evaluating answer passages using summarization measures semeval-2017 task 3: community question answering ms marco: a human generated machine reading comprehension dataset passage re-ranking with bert investigating the successes and failures of bert for passage re-ranking evaluating and predicting answer quality in community qa what is the jeopardy model? a quasisynchronous grammar for qa anmm: ranking short answer texts with attention-based neural matching model beyond factoid qa: effective methods for non-factoid answer sentence retrieval wikiqa: a challenge dataset for open-domain question answering document summarization for answering non-factoid queries acknowledgement. this work was supported in part by the center for intelligent information retrieval and in part by nsf iis-1715095. any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the sponsor. key: cord-020801-3sbicp3v authors: macavaney, sean; soldaini, luca; goharian, nazli title: teaching a new dog old tricks: resurrecting multilingual retrieval using zero-shot learning date: 2020-03-24 journal: advances in information retrieval doi: 10.1007/978-3-030-45442-5_31 sha: doc_id: 20801 cord_uid: 3sbicp3v while billions of non-english speaking users rely on search engines every day, the problem of ad-hoc information retrieval is rarely studied for non-english languages. this is primarily due to a lack of data set that are suitable to train ranking algorithms. in this paper, we tackle the lack of data by leveraging pre-trained multilingual language models to transfer a retrieval system trained on english collections to non-english queries and documents. our model is evaluated in a zero-shot setting, meaning that we use them to predict relevance scores for query-document pairs in languages never seen during training. our results show that the proposed approach can significantly outperform unsupervised retrieval techniques for arabic, chinese mandarin, and spanish. we also show that augmenting the english training collection with some examples from the target language can sometimes improve performance. every day, billions of non-english speaking users [22] interact with search engines; however, commercial retrieval systems have been traditionally tailored to english queries, causing an information access divide between those who can and those who cannot speak this language [39] . non-english search applications have been equally under-studied by most information retrieval researchers. historically, ad-hoc retrieval systems have been primarily designed, trained, and evaluated on english corpora (e.g., [1, 5, 6, 23] ). more recently, a new wave of supervised state-of-the-art ranking models have been proposed by researchers [11, 14, 21, 24, 26, 35, 37] ; these models rely on neural architectures to rerank the head of search results retrieved using a traditional unsupervised ranking algorithm, such as bm25. like previous ad-hoc ranking algorithms, these methods are almost exclusively trained and evaluated on english queries and documents. the absence of rankers designed to operate on languages other than english can largely be attributed to a lack of suitable publicly available data sets. this aspect particularly limits supervised ranking methods, as they require samples for training and validation. for english, previous research relied on english collections such as trec robust 2004 [32] , the 2009-2014 trec web track [7] , and ms marco [2] . no datasets of similar size exist for other languages. while most of recent approaches have focused on ad hoc retrieval for english, some researchers have studied the problem of cross-lingual information retrieval. under this setting, document collections are typically in english, while queries get translated to several languages; sometimes, the opposite setup is used. throughout the years, several cross lingual tracks were included as part of trec. trec 6, 7, 8 [4] offer queries in english, german, dutch, spanish, french, and italian. for all three years, the document collection was kept in english. clef also hosted multiple cross-lingual ad-hoc retrieval tasks from 2000 to 2009 [3] . early systems for these tasks leveraged dictionary and statistical translation approaches, as well as other indexing optimizations [27] . more recently, approaches that rely on cross-lingual semantic representations (such as multilingual word embeddings) have been explored. for example, vulić and moens [34] proposed bwesg, an algorithm to learn word embeddings on aligned documents that can be used to calculate document-query similarity. sasaki et al. [28] leveraged a data set of wikipedia pages in 25 languages to train a learning to rank algorithm for japanese-english and swahili-english cross-language retrieval. litschko et al. [20] proposed an unsupervised framework that relies on aligned word embeddings. ultimately, while related, these approaches are only beneficial to users who can understand documents in two or more languages instead of directly tackling non-english document retrieval. a few monolingual ad-hoc data sets exist, but most are too small to train a supervised ranking method. for example, trec produced several non-english test collections: spanish [12] , chinese mandarin [31] , and arabic [25] . other languages were explored, but the document collections are no longer available. the clef initiative includes some non-english monolingual datasets, though these are primarily focused on european languages [3] . recently, zheng et al. [40] introduced sogou-qcl, a large query log dataset in mandarin. such datasets are only available for languages that already have large, established search engines. inspired by the success of neural retrieval methods, this work focuses on studying the problem of monolingual ad-hoc retrieval on non english languages using supervised neural approaches. in particular, to circumvent the lack of training data, we leverage transfer learning techniques to train arabic, mandarin, and spanish retrieval models using english training data. in the past few years, transfer learning between languages has been proven to be a remarkably effective approach for low-resource multilingual tasks (e.g. [16, 17, 29, 38] ). our model leverages a pre-trained multi-language transformer model to obtain an encoding for queries and documents in different languages; at train time, this encoding is used to predict relevance of query document pairs in english. we evaluate our models in a zero-shot setting; that is, we use them to predict relevance scores for query document pairs in languages never seen during training. by leveraging a pre-trained multilingual language model, which can be easily trained from abundant aligned [19] or unaligned [8] web text, we achieve competitive retrieval performance without having to rely on language specific relevance judgements. during the peer review of this article, a preprint [30] was published with similar observations as ours. in summary, our contributions are: -we study zero shot transfer learning for ir in non-english languages. -we propose a simple yet effective technique that leverages contextualized word embedding as multilingual encoder for query and document terms. our approach outperforms several baselines on multiple non-english collections. -we show that including additional in-language training samples may help further improve ranking performance. -we release our code for pre-processing, initial retrieval, training, and evaluation of non-english datasets. 1 we hope that this encourages others to consider cross-lingual modeling implications in future work. zero-shot multi-lingual ranking. because large-scale relevance judgments are largely absent in languages other than english, we propose a new setting to evaluate learning-to-rank approaches: zero-shot cross-lingual ranking. this setting makes use of relevance data from one language that has a considerable amount of training data (e.g., english) for model training and validation, and applies the trained model to a different language for testing. more formally, let s be a collection of relevance tuples in the source language, and t be a collection of relevance judgments from another language. each relevance tuple q, d, r consists of a query, document, and relevance score, respectively. in typical evaluation environments, s is segmented into multiple splits for training (s train ) and testing (s test ), such that there is no overlap of queries between the two splits. a ranking algorithm is tuned on s train to define the ranking function r strain (q, d) ∈ r, which is subsequently tested on s test . we propose instead tuning a model on all data from the source language (i.e., training r s (·)), and testing on a collection from the second language (t ). we evaluate on monolingual newswire datasets from three languages: arabic, mandarin, and spanish. the arabic document collection contains 384k documents (ldc2001t55), and we use topics/relevance information from the 2001-02 trec multilingual track (25 and 50 topics, respectively). for mandarin, we use 130k news articles from ldc2000t52. mandarin topics and relevance judgments are utilized from trec 5 and 6 (26 and 28 topics, respectively). finally, the spanish collection contains 58k articles from ldc2000t51, and we use topics from trec 3 and 4 (25 topics each). we use the topics, rather than the query descriptions, in all cases except trec spanish 4, in which only descriptions are provided. the topics more closely resemble real user queries than descriptions. 2 we test on these collections because they are the only document collections available from trec at this time. 3 we index the text content of each document using a modified version of anserini with support for the languages we investigate [36] . specifically, we add anserini support for lucene's arabic and spanish light stemming and stop word list (via spanishanalyzer and arabicanalyzer). we treat each character in mandarin text as a single token. modeling. we explore the following ranking models: -unsupervised baselines. we use the anserini [36] implementation of bm25, rm3 query expansion, and the sequential dependency model (sdm) as unsupervised baselines. in the spirit of the zero-shot setting, we use the default parameters from anserini (i.e., assuming no data of the target language). -pacrr [14] models n-gram relationships in the text using learned 2d convolutions and max pooling atop a query-document similarity matrix. -knrm [35] uses learned gaussian kernel pooling functions over the querydocument similarity matrix to rank documents. -vanilla bert [21] uses the bert [10] transformer model, with a dense layer atop the classification token to compute a ranking score. to support multiple languages, we use the base-multilingual-cased pretrained weights. these weights were trained on wikipedia text from 104 languages. we use the embedding layer output from base-multilingual-cased model for pacrr and knrm. in pilot studies, we investigated using cross-lingual muse vectors [8] and the output representations from bert, but found the bert embeddings to be more effective. experimental setup. we train and validate models using trec robust 2004 collection [32] . trec robust 2004 contains 249 topics, 528k documents, and 311k relevance judgments in english (folds 1-4 from [15] for training, fold 5 for validation). thus, the model is only exposed to english text in the training and validation stages (though the embedding and contextualized language models are trained on large amounts of unlabeled text in the languages). the validation dataset is used for parameter tuning and for the selection of the optimal training epoch (via ndcg@20). we train using pairwise softmax loss with adam [18] . we evaluate the performance of the trained models by re-ranking the top 100 documents retrieved with bm25. we report map, precision@20, and ndcg@20 to gauge the overall performance of our approach, and the percentage of judged documents in the top 20 ranked documents (judged@20) to evaluate how suitable the datasets are to approaches that did not contribute to the original judgments. we present the ranking results in table 1 . we first point out that there is considerable variability in the performance of the unsupervised baselines; in some cases, rm3 and sdm outperform bm25, whereas in other cases they underperform. similarly, the pacrr and knrm neural models also vary in effectiveness, though more frequently perform much worse than bm25. this makes sense because these models capture matching characteristics that are specific to english. for instance, n-gram patterns captured by pacrr for english do not necessarily transfer well to languages with different constituent order, such as arabic (vso instead of svo). an interesting observation is that the vanilla bert model (which recall is only tuned on english text) generally outperforms a variety of approaches across three test languages. this is particularly remarkable because it is a single trained model that is effective across all three languages, without any difference in parameters. the exceptions are the arabic 2001 dataset, in which it performs only comparably to bm25 and the map results for spanish. for spanish, rm3 is able to substantially improve recall (as evidenced by map), and since vanilla bert acts as a re-ranker atop bm25, it is unable to take advantage of this improved recall, despite significantly improving the precision-focused metrics. in all cases, vanilla bert exhibits judged@20 above 85%, indicating that these test collections are still valuable for evaluation. to test whether a small amount of in-language training data can further improve bert ranking performance, we conduct an experiment that uses the other collection for each language as additional training data. the in-language samples are interleaved into the english training samples. results for this fewshot setting are shown in table 2 . we find that the added topics for arabic 2001 (+50) and spanish 4 (+25) significantly improve the performance. this results in a model significantly better than bm25 for arabic 2001, which suggests that there may be substantial distributional differences in the english trec 2004 training and arabic 2001 test collections. we further back this up by training an "oracle" bert model (training on the test data) for arabic 2001, which yields a model substantially better (p@20 = 0.7340, ndcg@20 = 0.8093, map = 0.4250). we introduced a zero-shot multilingual setting for evaluation of neural ranking methods. this is an important setting due to the lack of training data available in many languages. we found that contextualized languages models (namely, bert) have a big upper-hand, and are generally more suitable for cross-lingual performance than prior models (which may rely more heavily on phenomena exclusive to english). we also found that additional in-language training data may improve the performance, though not necessarily. by releasing our code and models, we hope that cross-lingual evaluation will become more commonplace. probabilistic models of information retrieval based on measuring the divergence from randomness ms marco: a human generated machine reading comprehension dataset clef 2003 -overview of results cross-language information retrieval (clir) track overview learning to rank: from pairwise approach to listwise approach a survey of automatic query expansion in information retrieval trec 2014 web track overview word translation without parallel data deeper text understanding for ir with contextual neural language modeling bert: pre-training of deep bidirectional transformers for language understanding overview of the fourth text retrieval conference (trec-4) overview of the third text retrieval conference (trec-3) pacrr: a position-aware neural ir model for relevance matching parameters learned in the comparison of retrieval models using term dependencies google's multilingual neural machine translation system: enabling zero-shot translation cross-lingual transfer learning for pos tagging without cross-lingual resources adam: a method for stochastic optimization cross-lingual language model pretraining unsupervised cross-lingual information retrieval using monolingual data only cedr: contextualized embeddings for document ranking a markov random field model for term dependencies an introduction to neural information retrieval the trec 2002 arabic/english clir track neural information retrieval: at the end of the early years multilingual information retrieval: from research to practice cross-lingual learning-torank with shared representations cross-lingual transfer learning for multilingual task oriented dialog cross-lingual relevance transfer for document retrieval the sixth text retrieval conference (trec-6) overview of the trec 2005 robust retrieval track overview of the fifth text retrieval conference (trec-5) monolingual and cross-lingual information retrieval models based on (bilingual) word embeddings end-to-end neural ad-hoc ranking with kernel pooling anserini: reproducible ranking baselines using lucene simple applications of bert for ad hoc document retrieval transfer learning for sequence tagging with hierarchical recurrent networks the digital language divide the 41st international acm sigir conference on research & development in information retrieval this work was supported in part by arcs foundation. key: cord-020794-d3oru1w5 authors: leekha, maitree; goswami, mononito; jain, minni title: a multi-task approach to open domain suggestion mining using language model for text over-sampling date: 2020-03-24 journal: advances in information retrieval doi: 10.1007/978-3-030-45442-5_28 sha: doc_id: 20794 cord_uid: d3oru1w5 consumer reviews online may contain suggestions useful for improving commercial products and services. mining suggestions is challenging due to the absence of large labeled and balanced datasets. furthermore, most prior studies attempting to mine suggestions, have focused on a single domain such as hotel or travel only. in this work, we introduce a novel over-sampling technique to address the problem of class imbalance, and propose a multi-task deep learning approach for mining suggestions from multiple domains. experimental results on a publicly available dataset show that our over-sampling technique, coupled with the multi-task framework outperforms state-of-the-art open domain suggestion mining models in terms of the f-1 measure and auc. consumers often express their opinions towards products and services through online reviews and discussion forums. these reviews may include useful suggestions that can help companies better understand consumer needs and improve their products and services. however, manually mining suggestions amid vast numbers of non-suggestions can be cumbersome, and equated to finding needles in a haystack. therefore, designing systems that can automatically mine suggestions is essential. the recent semeval [6] challenge on suggestion mining saw many researchers using different techniques to tackle the domain-specific task (in-domain suggestion mining). however, open-domain suggestion mining, which obviates the need for developing separate suggestion mining systems for different domains, is still an emerging research problem. we formally define the problem of open-domain suggestion mining as follows: building on the work of [5] , we design a framework to detect suggestions from multiple domains. we formulate a multitask classification problem to identify both the domain and nature (suggestion or non-suggestion) of reviews. furthermore, we also propose a novel language model-based text over-sampling approach to address the class imbalance problem. we use the first publicly available and annotated dataset for suggestion mining from multiple domains created by [5] . it comprises of reviews from four domains namely, hotel, electronics, travel and software. during pre-processing, we remove all urls (eg. https:// ...) and punctuation marks, convert the reviews to lower case and lemmatize them. we also pad the text with start s and end e symbols for over-sampling. one of the major challenges in mining suggestions is the imbalanced distribution of classes, i.e. the number of non-suggestions greatly outweigh the number of suggestions (refer table 1 ). to this end, studies frequently utilize synthetic minority over-sampling technique (smote) [1] to over-sample the minority class samples using the text embeddings as features. however, smote works in table 1 . datasets and their sources used in our study [5] . the class ratio column highlights the extent of class imbalance in the datasets. the travel datasets have lower inter-annotator agreement than the rest, indicating that they may contain confusing reviews which are hard to confidently classify as suggestions or non-suggestions. this also reflects in our classification results. the euclidean space and therefore does not allow an intuitive understanding and representation of the over-sampled data, which is essential for qualitative and error analysis of the classification models. we introduce a novel over-sampling technique, language model-based over-sampling technique (lmote), exclusively for text data and note comparable (and even slightly better sometimes) performance to smote. we use lmote to over-sample the number of suggestions before training our classification model. for each domain, lmote uses the following procedure to over-sample suggestions: find top η n-grams: from all reviews labelled as suggestions (positive samples), sample the top η = 100 most frequently occurring n-grams (n = 5). for example, the phrase "nice to be able to" occurred frequently in many domains. train a bilstm language model on the positive samples (suggestions). the bilstm model predicts the probability distribution of the next word (w t ) over the whole vocabulary (v ∪ e) based on the last n = 5 words (w t−5 , . . . , w t−1 ), i.e., the model learns to predict the probability distribution n-grams: using the language model and a randomly chosen frequent 5-gram as the seed, we generate text by repeatedly predicting the most probable next word (w t ), until the end symbol e is predicted. table 2 comprises of the most frequent 5-grams and their corresponding suggestions 'sampled' using lmote. in our study, we generate synthetic positive reviews till the number of suggestion and non-suggestion class samples becomes equal in the training set. seed ← random(n grams) 6: sample ← lmotegenerate(language model, seed) 7: s ← s ∪ sample 8: end while 9: return s algorithm 1 summarizes the lmote over-sampling methodology. following is a brief description of the sub-procedures used in the algorithm: • lmotegenerate(language model, seed): the procedure takes as input the trained language model and a randomly chosen n-gram from the set of top η n-grams as seed, and starts generating a review till the end tag, e is produced. the procedure is repeated until we have a total of n suggestion reviews. multi-task learning (mtl) has been successful in many applications of machine learning since sharing representations between auxiliary tasks allows models to generalize better on the primary task. figure 1b illustrates 3-dimensional umap [4] visualization of text embeddings of suggestions, coloured by their domain. these embeddings are outputs of the penultimate layer (dense layer before the final softmax layer) of the single task (stl) ensemble baseline. it can be clearly seen that suggestions from different domains may have varying feature representations. therefore, we hypothesize that we can identify suggestions better by leveraging domain-specific information using mtl. therefore, in the mtl setting, given a review r i in the dataset, d, we aim to identify both the domain of the review, as well as its nature. we use an ensemble of three architectures namely, cnn [2] to mirror the spatial perspective and preserve the n-gram representations; attention network to learn the most important features automatically; and a bilstm-based text rcnn [3] model to capture the context of a text sequence (fig. 2) . in the mtl setting, the ensemble has two output softmax layers, to predict the domain and nature of a review. the stl baselines on the contrary, only have a singe softmax layer to predict the nature of the review. we use elmo [7] word embeddings trained on the dataset, as input to the models. we conducted experiments to assess the impact of over-sampling, the performance of lmote and the multi-task model. we used the same train-test split as provided in the dataset for our experiments. all comparisons have been made in terms of the f-1 score of the suggestion class for a fair comparison with prior work on representational learning for open domain suggestion mining [5] (refer baseline in table 3 ). for a more insightful evaluation, we also compute the area under receiver operating characteristic (roc) curves for all models used in this work. tables 3, 4 over-sampling improves performance. to examine the impact of oversampling, we compared the performance of our ensemble classifier with and without over-sampling i.e. we compared results under the stl, stl + smote and stl + lmote columns. our results confirm that in general, over-sampling suggestions to obtain a balanced dataset improves the performance (f-1 score & auc) of our classifiers. we compared the performance of smote and lmote in the single task settings (stl + smote and stl + lmote ) and found that lmote performs comparably to smote (and even outperforms it in the electronics and software domains). lmote also has the added advantage of resulting in intelligible samples which can be used to qualitatively analyze and troubleshoot deep learning based systems. for instance, consider suggestions created by lmote in table 2 . while the suggestions may not be grammatically correct, their constituent phrases are nevertheless semantically sensible. multi-task learning outperforms single-task learning. we compared the performance of our classifier in single and multi-task settings (stl + lmote and mtl + lmote ) and found that by multi-task learning improves the performance of our classifier. we qualitatively analysed the single and multi task models, and found many instances where by leveraging domain-specific information the multi task model was able to accurately identify suggestions. for instance, consider the following review: "bring a lan cable and charger for your laptop because house-keeping doesn't provide it." while the review appears to be an assertion (non-suggestion), by predicting its domain (hotel), the multitask model was able to accurately classify it as a suggestion. in this work, we proposed a multi-task learning framework for open domain suggestion mining along with a novel language model based over-sampling technique for text-lmote. our experiments revealed that multi-task learning combined with lmote over-sampling outperformed considered alternatives in terms of both the f1-score of the suggestion class and auc. smote: synthetic minority over-sampling technique convolutional neural networks for sentence classification recurrent convolutional neural networks for text classification umap: uniform manifold approximation and projection for dimension reduction suggestion mining from text semeval-2019 task 9: suggestion mining from online reviews and forums. in: semeval@naacl-hlt deep contextualized word representations key: cord-020835-n9v5ln2i authors: jangra, anubhav; jatowt, adam; hasanuzzaman, mohammad; saha, sriparna title: text-image-video summary generation using joint integer linear programming date: 2020-03-24 journal: advances in information retrieval doi: 10.1007/978-3-030-45442-5_24 sha: doc_id: 20835 cord_uid: n9v5ln2i automatically generating a summary for asynchronous data can help users to keep up with the rapid growth of multi-modal information on the internet. however, the current multi-modal systems usually generate summaries composed of text and images. in this paper, we propose a novel research problem of text-image-video summary generation (tivs). we first develop a multi-modal dataset containing text documents, images and videos. we then propose a novel joint integer linear programming multi-modal summarization (jilp-mms) framework. we report the performance of our model on the developed dataset. advancement in technology has led to rapid growth of multimedia data on the internet, which prevent users from obtaining important information efficiently. summarization can help tackle this problem by distilling the most significant information from the plethora of available content. recent research in summarization [2, 11, 31] has proven that having multi-modal data can improve the quality of summary in comparison to uni-modal summaries. multi-modal information can help users gain deeper insights. including supportive representation of text can reach out to a larger set of people including those who have reading disabilities, users who have less proficiency in the language of text and skilled readers who are looking to skim the information quickly [26] . although visual representation of information is more expressive and comprehensive in comparison to textual description of the same information, it is still not a thorough model of representation. encoding abstract concepts like guilt or freedom [11] , geographical locations or environmental features like temperature, humidity etc. via images is impractical. also images are a static medium and cannot represent dynamic and sequential information efficiently. including videos could then help overcome these barriers since video contains both visual and verbal information. to the best of our knowledge, all the previous works have focused on creating text or text-image summaries, and the task of generating an extractive multimodal output containing text, images and videos from a multi-modal input has not been done before. we thus focus on a novel research problem of text-imagevideo summary generation (tivs). to tackle the tivs task, we design a novel integer linear programming (ilp) framework that extracts the most relevant information from the multimodal input. we set up three objectives for this task, (1) salience within modality, (2) diversity within modality and (3) correspondence across modalities. for preprocessing the input, we convert the audio into text using an automatic speech recognition (asr) system, and we extract the key-frames from video. the most relevant images and videos are then selected in accordance with the output generated by our ilp model. to sum up, we make the following contributions: (1) we present a novel multimodal summarization task which takes news with images and videos as input, and outputs text, images and video as summary. (2) we create an extension of the multi-modal summarization dataset [12] by constructing multi-modal references containing text, images and video for each topic. (3) we design a joint ilp framework to address the proposed multi-modal summarization task. text summarization techniques are used to extract important information from textual data. a lot of research has been done in the area of extractive [10, 21] and abstractive [3, 4, 19, 23] summarization. various techniques like graph-based methods [6, 15, 16] , artificial neural networks [22] and deep learning based approaches [18, 20, 29] have been developed for text summarization. integer linear programming (ilp) has also shown promising results in extractive document summarization [1, 9] . duan et al. [5] proposed a joint-ilp framework that produces summaries from temporally separate text documents. recent years have shown great promise in the emerging field of multi-modal summarization. multi-modal summarization has various applications ranging from meeting recordings summarization [7] , sports video summarization [25] , movie summarization [8] to tutorial summarization [13] . video summarization [17, 28, 30] is also a major sub-domain of multi-modal summarization. a few deep learning frameworks [2, 11, 31] show promising results, too. li et al. [12] uses an asynchronous dataset containing text, images and videos to generate a textual summary. although some work on document summarization has been done using ilp, to the best of our knowledge no one has ever used an ilp framework in the area of multi-modal summarization. our objective is to generate a multimodal summary s = x sum i sum v sum such that the final summary s covers up all the important information in the original data while minimizing the length of summary, where each topic in our dataset comprises of text documents, images, audio and videos. as shown in fig. 1 , we firstlextract key-frames from the videos [32] . these keyframes together with images from the original data form the image-set. the audio is transcribed into text (ibm watson speech-to-text service: www.ibm.com/ watson/developercloud/speech-to-text.html), which contributes to the text-set together with the sentences from text-documents. the images from then imageset are encoded by the vgg model [24] and the 4,096-dimensional vector from the pre-softmax layer is used as the image representation. every sentence from the text-set is encoded using the hybrid gaussian-laplacian mixture model (hglmm) into a 6,000-dimensional vector. for text-image matching, these image and sentence vectors are fed into a two-branch neural network [27] to have a 512-dimensional vector for images and sentences in a shared space. ilp is a global optimization technique, used to maximize or minimize an objective function subject to some constraints. in this paper, we propose a joint-ilp technique to optimize the output to have high salience, diversity and crossmodal correlation. the idea of joint-ilp is similar to the one applied in the field of across-time comparative summarization [5] . however, to the best of our knowledge, an ilp framework was not used to solve multi-modal summarization (gurobi optimizer is used for ilp optimization: https://www.gurobi.com/). decision variables. m txt is a n × n binary matrix such that m txt i,i indicates whether sentence s i is selected as an exemplar or not and m txt i,j =i indicates whether sentence s i votes for s j as its representative. similarly, m img is a p × p binary matrix that indicates the exemplars chosen in the image set. m c is n × p binary matrix that indicates the cross-modal correlation. m c i,j is true when there is some correlation between sentence s i and image i j . where mod, t, item ∈ { text, n, s , img, p, i } is used to represent multiple modalities together in a simple way. we need to maximize the objective function in eq. 1, containing salience of text, images and cross-modal correlation. similar to the joint-ilp formulation in [5] the diversity objective is implicit in this model. equation 4 generates the set of entities that are a part of the cluster whose exemplar is item i . the salience is calculated by eqs. 2 and 3 by taking cosine similarity over all the exemplars with the items belonging to their representative clusters separately for each modality. the cross-modal correlation score is calculated in eq. 5. equation 7 ensures that exactly k txt and k img clusters are formed in their respective uni-modal vector space. equation 8 guarantees that an entity can either be an exemplar or be part of a single cluster. according to eq. 9, a sentence or image must be exemplar in their respective vector space to be included in the sentence-image summary pairs. values of m, k txt and k img are set to be 10, same as in [5] . the joint-ilp framework outputs the text summary (x sum ) and top-m images from the image-set. this output is used to prepare the image and video summary. equation 11 selects all those images from top10 images that are not keyframes. assuming that images which look similar would have similar annotation scores and would help users gain more insight, the images relevant to the images in i sum1 (at least with α cosine similarity) but not too similar (at max with β cosine similarity) to avoid redundancy are also selected to be a part of the final image summary i sum (eq. 12). α is set to 0.4 and β is 0.8 in our experiments. extracting video. for each video, weighted sum of visual (eq. 13) and verbal (eq. 14) scores is computed. the video with the highest score is selected as our video summary. where kf is the set of all key-frames and st is the set of speech transcriptions. there is no benchmark dataset for the tivs task. therefore, we created our own text-image-video dataset by extending and manually annotating the multi-modal summarization dataset introduced by li et al. [12] . their dataset comprised of 25 new topics. each topic was composed of 20 text documents, 3 to 9 images, and 3 to 8 videos. the final summary however was unimodal, that is, in the form of only a textual summary containing around 300 words. we then extended it by selecting some images and a video for each topic that summarize the topic well. three undergraduate students were employed to score the images and videos with respect to the benchmark text references. all annotators scored each image and video on a scale of 1 to 5, on the basis of similarity between the image/video and the text references (1 indicating no similarity and 5 denoting the highest level of similarity). average annotation scores (aas) were calculated for each image and video. the value of the minimum average annotation score for images was kept as a hyper-parameter to evaluate the performance of our model in various settings 2 . the video with the highest score is chosen to be the video component of the multi-modal summary 3 . we evaluate the performance of our model using the dataset as described above. we use the rouge scores [14] to evaluate the textual summary, and based on them we compare our results with the ones of three baselines. we use the multi-document summarization model proposed in [1] . for baseline-1 we feed the model with embedded sentences from all the original documents together. the central vector is calculated as the average of all the sentence vectors. the model is given vectors for sentences from the text-set and images from the image-set in the joint space for other baselines. for baseline-2, the average of all the vectors is taken as the central vector. for baseline-3, the central vector is calculated as the weighted average of all the sentence and image vectors. we give equal weights to text, speech and images for simplicity. as shown in table 1 , our model produces better results than the prepared baselines in terms of rouge-2 and rouge-l scores. table 2 shows the average precision and recall scores as well as the variance. we set various threshold values for the annotation scores to generate multiple image test sets in order to evaluate the performance of our model. we get a higher precision score for low aas value, because the number of images in the final solution increases on decreasing the threshold values. the proposed model gave 44% accuracy in extracting the most appropriate video (whereas random selection of images for 10 different iterations gives an average 16% accuracy). unlike other problems that focus on text-image summarization, we propose to generate a truly multi-modal summary comprising of text, images and video. we also develop a dataset for this task, and propose a novel joint ilp framework to tackle this problem. multi-document summarization model based on integer linear programming abstractive text-image summarization using multi-modal attentional hierarchical rnn fast abstractive summarization with reinforce-selected sentence rewriting abstractive sentence summarization with attentive recurrent neural networks across-time comparative summarization of news articles lexrank: graph-based lexical centrality as salience in text summarization multimodal summarization of meeting recordings multimodal saliency and fusion for movie summarization based on aural, visual, and textual attention extractive multi-document summarization with integer linear programming and support vector regression a trainable document summarizer multi-modal sentence summarization with modality attention and image filtering multi-modal summarization for asynchronous collection of text, image multimodal abstractive summarization for open-domain videos rouge: a package for automatic evaluation of summaries graph-based ranking algorithms for sentence extraction, applied to text summarization textrank: bringing order into text streaming non-monotone submodular maximization: personalized video summarization on the fly summarunner: a recurrent neural network based sequence model for extractive summarization of documents abstractive text summarization using sequence-to-sequence rnns and beyond classify or select: neural architectures for extractive document summarization constructing literature abstracts by computer: techniques and prospects extractive single document summarization using multi-objective optimization: exploring self-organized differential evolution, grey wolf optimizer and water cycle algorithm. knowl.-based syst get to the point: summarization with pointergenerator networks very deep convolutional networks for large-scale image recognition multi-modal summarization of key events and top players in sports tournament videos multimodal summarization of complex sentences learning deep structure-preserving image-text embeddings video summarization via semantic attended networks multiview convolutional neural networks for multidocument extractive summarization deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward msmo: multimodal summarization with multimodal output adaptive key frame extraction using unsupervised clustering key: cord-020903-qt0ly5d0 authors: tamine, lynda; melgarejo, jesús lovón; pinel-sauvagnat, karen title: what can task teach us about query reformulations? date: 2020-03-17 journal: advances in information retrieval doi: 10.1007/978-3-030-45439-5_42 sha: doc_id: 20903 cord_uid: qt0ly5d0 a significant amount of prior research has been devoted to understanding query reformulations. the majority of these works rely on time-based sessions which are sequences of contiguous queries segmented using time threshold on users’ activities. however, queries are generally issued by users having in mind a particular task, and time-based sessions unfortunately fail in revealing such tasks. in this paper, we are interested in revealing in which extent time-based sessions vs. task-based sessions represent significantly different background contexts to be used in the perspective of better understanding users’ query reformulations. using insights from large-scale search logs, our findings clearly show that task is an additional relevant search unit that helps better understanding user’s query reformulation patterns and predicting the next user’s query. the findings from our analyses provide potential implications for model design of task-based search engines. query reformulation is a critical user behaviour in modern search engines and it is still addressed by a significant amount of research studies [10] [11] [12] 17, 23, 26, 33] . a salient behavioural facet that has been widely captured and analysed by those studies is query history. the latter is generally structured into "query sessions" which are sequences of queries submitted by a user while completing a search activity with a search system. in the literature review, there are many definitions of query sessions. the widely used definitions are the following [19, 25] : (1) a time-based session, also called physical session in [6] , is a set of consecutive queries automatically delimited using a time-out threshold on user's activities. time-gap values of 30 min and 90 min have been the most commonly used in previous research [4, 6, 9, 19] ; (2) a task-based session, also called mission in [6] , is a set of queries that are possibly neither consecutive nor within the same timebased session. the queries belong to related information needs that are driven by a goal-oriented search activity, called search task (eg., job search task). the latter could be achieved by subsets of consecutive related queries called logical sessions in [6] or subtasks in [9] . previous research [4, 7, 20, 21] showed that: (1) users have a natural multitasking behaviour by intertwining different tasks during the same time-based session; and that (2) users possibly interleave the same task at different timestamps in the same time-based session or throughout multiple time-based sessions (ie., multi-session tasks). such long-term tasks are acknowledged as being complex tasks [7, 9] . figure 1 shows a sample of 3 time-based search sessions extracted from the webis-smc-12 search corpus [6] for a single user. the sessions are manually annotated with tasks. as can be seen, 6 tasks (task 1 -task 6) are performed by the user during these 3 sessions. we can observe that all these sessions are multi-tasking, since they include queries that relate to multiple tasks (eg., session 1 is multi-tasking since it includes queries that relate to task 1, 2, 3 and 4). we can also see that task 1 and task 3 are interleaved within and across sessions (eg., task 1 is interleaved within session 1 and across session 1, 2 and 3). thus, tasks 1 and 3 are multi-session tasks. while it is well-known that time-based session detection methods fail in revealing tasks [6, 19] , most of previous research work has employed time-based sessions as the focal units of analysis for understanding query reformulations [10] [11] [12] 26, 33] . other works rather studied users' query reformulations from the task perspective through user studies [15, 17, 29] . however, the authors analysed low-scale pre-designed search tasks conducted in controlled laboratory settings. in addition to their limited ability to observe natural search behaviour, there is a clear lack of comparability in search tasks across those studies. to design support processes for task-based search systems, we argue that we need to: (1) fully understand how user's task performed in natural settings drives the query reformulations changes; and (2) gauge the level of similarity of these changes trends with those observed in time-based sessions. our ultimate goal is to gain insights regarding the relevance of using user's tasks as the focal units of search to both understand and predict query reformulations. with this in mind, we perform large-scale log analyses of users naturally engaged in tasks to examine query reformulations from both the time-based session vs. task-based session perspectives. moreover, we show the role of the task characteristics in predicting the next user's query. our findings clearly show that task is an additional relevant search unit that helps to better understand user's query reformulation patterns and to predict the next user's query. query reformulation has been the focus of a large body of work. a high number of related taxonomies have been proposed [5, 11, 16] . to identify query reformulation patterns, most of the previous works used large-scale log analyses segmented into time-based sessions. different time gaps have been used including 10-15 min [8] , 30 min [4, 19] and 90 min [6, 9] . in a significant body of work, authors categorised the transitions made from one query to the subsequent queries through syntactic changes [11, 12, 23, 26] and query semantic changes [10, 12, 33] . syntactic changes include word substitution, removing, adding and keeping. the results highlighted that the query and its key terms evolve throughout the session regardless of the query position in the session. moreover, such strategies are more likely to cause clicks on highly ranked documents. further experiments on semantic query changes through generalisation vs. specialisation [10, 12] showed that a trend exists toward going from generalisation to specialisation. this behavioural pattern represents a standard building-box strategy while specialisation occurs early in the session. another category of work rather employed lab user studies to understand how different task characteristics impact users' query reformulations [15, 17, 18, 28, 31, 32] . the results mainly revealed that: (1) the domain knowledge of the task doer significantly impacts query term changes. for instance, wildemuth [31] found that search tactics changed while performing the task as users' domain knowledge evolved; (2) the cognitive complexity and structure of the task (eg., simple, hierarchical, parallel) has a significant effect on users' query reformulation behavior. for instance, liu et al. [17] found that specialisation in parallel tasks was significantly less frequent than in simple and hierarchical tasks. a few work [4, 22] used large-scale web search logs annotated with tasks to understand query reformulations. the findings in [4] were consistent with log-based studies [26] showing that page visits have significant influence on the vocabulary of subsequent queries. odijk et al. [22] studied the differences in users' reformulation strategies within successful vs. unsuccessful tasks. using a crowd-sourcing methodology, the authors showed that query specialisation through term adding is substantially more common in successful tasks than in unsuccessful tasks. it also appeared that actions such as formulating the same query than the previous one and reformulating completely a new query are rather relevant signals of unsuccessful tasks. we make several contributions over prior work. first, to the best of our knowledge, no previous study examined the differences in query reformulation strategies from the two perspectives of time-based sessions and task-based sessions viewed as background contexts. insights gleaned from our data analysis have implications for designing task-based search systems. second, although there has been intensive research on query reformulation, we provide a new insight into the variation of query reformulation strategies. the latter are analysed in relation with search episode size (short, medium and long) and search stage (start, middle and end ) from two different viewpoints (stream of query history and the search task progress). third, building on the characterisation of search tasks, we provide insights on how considering task features might improve a supervised predictive model of query reformulations. this analysis is carried out using the freely available webis-smc-12 search corpus 1 [1, 6] extracted from the 2006 aol query log which is a very large collection of web queries. the released corpus comprises 8800 queries. we remove the repeated successive queries that were automatically generated following a click instead of a user's reformulation. we also remove all non-alphanumeric characters from the queries and apply a lowercasing. the cleaned data finally include 4734 queries submitted by 127 unique users. the query log is automatically segmented into time-based sessions using a time-gap threshold on users' activities. since there is so far no agreement about the most accurate time-out threshold for detecting session boundaries [9, 19] , we consider the two widely used time-gap values between successive queries: 30 min as done in [4, 19] and 90 min as done in [6, 9] . we also use the provided manual annotations to segment the query log into task-based sessions. for care of simplicity, we subsequently refer to time-based session as "session" and we refer to task-based session as "task ". table 1 presents the data collection statistics. one immediate observation is that the average number of queries in tasks (3.45) is higher than that of the sessions (eg., 2.04 in the 30 min-sessions) as reported in [9, 19] . the total percentage of multi-tasking sessions is roughly 13% (resp. 16%) of the 30 min-session (resp. 90 min-session). higher statistics (50%) were reported in [19] . however, we found that there are only 30.28% (resp. 31.27%) of the 30-min sessions (resp. 90-min sessions) that include only 1 task that is non interleaved throughout the user's search history. thus, the 70% remaining sessions are either multi-tasking or include interleaved tasks that reoccur in multiple sessions. similar statistics were observed in previous work (eg., 68% in [9] ). another interesting observation is that a high percentage of tasks (23.23%) are interleaved, which is roughly comparable to that of previous studies (eg., 17% in [14] ), or spanned over multiple sessions (e.g, 27.09% of tasks spanned over multiple 30-min sessions). sim(qi, qi+1) jaccard query pair similarity to study query reformulations, we consider the three usual categories of syntactic changes [11, 13, 26] between successive query pairs (q i , q i+1 ) composed of s(q i ) and s(q i+1 ) term sets respectively: (1) query term-retention rr; (2) query termremoval rm acts as search generalisation [12, 13] ; and (3) query term-adding ra acts as search specialisation [12, 13] . for each query pair, we compute the similarity and the query reformulation features presented in table 2 , both at the sessions and tasks levels (sect. 5). here, our objective is twofold: (1) we investigate how query length (ie., # query terms) varies across the search stages within sessions and tasks of different sizes (ie., # queries); and (2) we examine in what extent the trends of query length changes observed within tasks are similar to those observed within sessions. to make direct comparisons of trends between sessions and tasks with different sizes in a fair way, we first statistically partition the search sessions and tasks into three balanced categories (short, medium and long). to do so, we compute the cumulative distribution function (cdf) of session size values for the 30-min and the 90-min sessions, as well as the cdf of task size values in relation with the number of included queries. then, we compute the cdf of the search stage values in relation with the query position boundary (start, middle and end ) along each size-based category of sessions vs. tasks. since short sessions and tasks only contain 1 query and consequently do not contain query reformulations, we do not distinguish between the search stages nor consider this category of sessions and tasks in the remainder of the paper. table 3 shows the statistics of the search stages (start, middle, end ) with respect to medium and long sessions and tasks. based on those categorisations, fig. 2 shows the variation of the query length limit within each category of sessions and tasks and along the different search stages. we can see two clear trends. first, queries in both longer sessions and longer tasks generally tend to contain more terms (2.60-2.87 vs. 2.41-2.51 in average). this trend remains along all the different search stages. regarding sessions, previous studies [2] have also shown similar trends in log-based data. regarding tasks, our results suggest that long tasks require to issue more search terms. one could argue that long tasks, that more likely involve complex information needs, lead users to formulate more informative queries. we also relate this observation with previous findings [2] showing that increased success is associated with longer queries, particularly in complex search tasks. second we can surprisingly see that in general, queries observed within sessions whatever their sizes, are slightly longer in average than queries issued within tasks of the same category except at the end of the search stage. by cross-linking with the cdf results presented in table 3 , we expect that this observation particularly relates to long sessions. one possible explanation is that since long sessions are more likely to be multi-tasking (eg., there are 1.57 task in average in the long 90-min sessions vs. 1.29 in the 30-min sessions), the average query length is particularly increased within sessions that include queries at late search stages of the associated tasks (middle, end ). inspired by [13] , we examine query term frequency along the search with respect to session vs. task search context. in contrast to [13] , our underlying intent here is rather to learn more about the impact of search context (ie., session vs. task) on the level of query term reuse. for a query q i belonging to session s and task t and not submitted at the beginning (ie., i > 1), we compute the frequency of each of its terms from the previous queries within the same session q s j (resp. same task q t j ), j = 1..i − 1. then, we take the maximal value t r as "maximum term repeat" for query q i if the latter contains at least one term used t r times in previous queries. figure 3a plots the average "maximum term repeat values" for all the queries within all the sessions and tasks ranged by size (short, medium and long). we can see that the term repeat trend across sessions is similar to that reported in [13] . by comparing between the term repeat trends in sessions and tasks, we clearly observe that there are less reformulated queries that do not share any identical terms with the previous queries in tasks (eg., 70% of medium tasks) in comparison to sessions (eg., 75-78% of medium sessions). interestingly, we can see that the difference is particularly higher in the case of long tasks and long sessions (33% vs. 53-54%). however, we can notice that even if the percentage of queries sharing an increased number of terms with previous queries decreases for both medium sessions and medium tasks, the difference is reversed between long sessions and long tasks. it is more likely that query terms are renewed during long tasks which could be explained by shifts in information needs related to the same driving long-term task. figure 3b shows the percentage of reformulated queries for which each reused term occurs at the first time at a given position within sequences from length 1 to 6. it appears that the sources of reused query terms in both tasks and sessions are limited to the two previous queries. more particularly, while we find terms used in the previous query in all (100%) of the reformulated queries in medium sessions and medium tasks, it is more likely to observe reformulated queries containing terms from the two previous queries in long sessions than in long tasks (71% of sessions vs. 46% of tasks). to sum up, the context used for driving query actions is limited to the two previous queries even for long sessions and tasks, with however, a lower level of term reuse in long tasks. given each query q i belonging to session s (resp. task t ), table 4 gives the query reformulation feature values (see table 2 ) for both medium (m) and long (l) sessions and tasks and are computed over: (1) the short-term context (sc), by considering the query reformulation pair observed within the same session s (resp. task t ) (q i , q i+1 ) s (resp. (q i , q i+1 ) t ), i ≥ 1; and (2) the long-term context (lc), by considering the set of successive query reformulation pairs within the same session s (resp. task t ), (q k , q k+1 ) s (resp. (q k , q k+1 ) t ), 1 ≤ k ≤ i. significance of the differences between the "within session" scenario and the "within task" scenario considering either the short-term context (sc) or the long-term context (lc) is computed using the non-paired student t-test. we can see from table 4 that for the whole set of search actions (ie., term-retention rr, termremoval rm and term-adding ra) and similarity values (ie., avg sim), most of the differences between task-based and session-based scenarios are highlighted as significant. more particularly, we can make two key observations: (1) successive queries in both medium and long tasks are significantly more similar (avg sim of 0.27 and 0.25 respectively) than they are in medium and long sessions for both time-out thresholds (avg sim of 0.20-0.23) with higher ratios of term-retention (34% vs. 25-29%); and (2) the query history along long tasks exhibits a higher topical cohesion (avg sim of 0.24) than it does in long sessions (avg sim of 0.18-0.20) with a higher ratio of term-retention (30% vs. 23-26%) and a lower ratio of term-adding (70% vs. 74-77%) for tasks. all these results are consistent with those obtained through the analysis of query term repeat (sect. 4.2). they suggest that longer tasks more likely include topically and lexically closer information needs that might drive subtasks in comparison with long sessions. unlikely, the latter might include multiple and topically different information needs that belong to distinct tasks. to better understand the changes trends along the search, we also examine (fig. 4 ) the query reformulation similarities at different stages of the search sessions vs. tasks by considering both short-term context (sc) and long-term context (lc). we can make from fig. 4 depending on the context used (session vs. task) to make the observation. as outlined earlier through query length analysis (sect. 4.1), sessions might include different ongoing tasks that lead to formulate lexically distinct queries. unlikely, tasks might include different ongoing related subtasks. however, queries are still overall more similar (m = 0.13, sd = 0.23, avg = 0.20) across the search stages in long tasks than they are in long sessions (m = 0.11, sd = 0.17, avg = 0.16), particularly at the end of the search stage. this observation might be related to the better cohesiveness of tasks with increased number of queries since, unlike sessions, they are goal-oriented. through the analyses presented in the previous sections, we have shown that there are significant differences in query reformulation patterns depending potentially on the context used (session or task) to make the observations. the results also indicate that time threshold value used to segment the sessions has no impact on the differences trends. in general, the most significant differences are observed regarding long tasks. informed by these findings, we show in the final contribution of this paper the potential of the task features studied in sects. 4 and 5 for enhancing the performance of a query reformulation predictive model. given a session s = {q 1 , q 2 , . . . , q m −1 , q m }, we aim to predict for each query sequence s k ⊂ s, s k = {q 1 , q 2 . . . , q k−1 , q k }, 1 < k < m, the target query q k given the context c q k defined by queries {q 1 , q 2 . . . , q k−1 ), where q k−1 is the anchor query. evaluation protocol. as usually done in previous work for query autocompletion [13] and next query prediction [3, 24, 27] , we adopt a train-test methodology. we first sort the 30 min-sessions time-wise and partition them into two parts. we use the first 60 day-data for training the predictive model and the remaining 30 days for testing. we use 718 sessions (including 2418 queries) which represent 70% of the dataset as our training set, and 300 sessions (including 998 queries) which represent 30% of the dataset as our testing set. to enable the evaluation of the learning approach, we first produce a set of ground truth suggestions for each test query. to do so, we follow a standard procedure [3, 13, 27] : for each session in the training-test sets, we select as the candidate set, the top-20 queries q k that follows each anchor query q k−1 , ranked by query frequency. to assess the contributions of the task context features in predicting the next user's query, we use the baseline ranker, a competitive learning to rank query suggestion model that relies on contextual features [3, 27] . model training. we design the task-aware baseline ranker which we refer to as taskranker. for training purpose, we first generate from the 718 training sessions, 1395 task-based query sequences that are built with respect to the task labels provided in the webis-smc-12 search corpus. we remove the task-based query sequences with only 1 query candidate. for instance, using task labels provided in fig. 1 , we built and then select from session 1 the task-based query sequences {q1, q6}; {q3, q4} with respectively q6 and q4 as the ground truth queries. besides, to guarantee the candidate set includes the target query, we remove the task-based query sequences whose ground truth is not included in the associated candidate sets. after filtering, we obtain 215 cleaned task-based query sequences used for training the taskranker model. similarly to [3, 27] , we use the state-of-the-art boosted regression tree ranking algorithm lamdamart as our supervised ranker. we tune the lamdamart model with parameters of 500 decision trees across all experiments. we use 2 sets of features (30 in total): (1) 10 features related to the analyses conducted in previous sections of the paper (sects. 4, 5) . we use the user-action related features including ratios of termretention (rr ), term-adding (ra), term-removal (rm), and term-repeat (tr ), that are measured using both the short-term (sc) and long-term (lc) contexts. we also use query-similarity related features (avg sim) based on the similarity of the target query q k with short-term context sc (anchor query q k−1 ) and long-term context lc (with the previous queries in c q k ); (2) 20 features that are similar to those previously used for a learning to rank suggestion model, and described in detail in [3, 27] . this set of features includes (a) pairwise and suggestion features based on target query characteristics and anchor query characteristics including length and frequency in the dataset; (b) contextual features that include n-gram similarity values between the suggestion and the 10 most recent queries. note that we extended the baseline ranker released by sordoni et al. [27] 2 . baselines and evaluation metric. we use the conventional models widely used in the literature [3, 13, 27] with respectively q2, q3, q4, q5 and q6 as the ground truth queries. we obtain 1700 session-based query sequences that are then cleaned, similarly to the taskranker by removing query sequences with only 1 query candidate and those with ground truth not included in the associated candidate sets. finally, the ses-sionranker has been trained on 302 cleaned session-based query sequences. similarly to the taskranker, we use the same sets of features (30 in total) learned here at the session level, and we tune it using the lamdamart model. we use the mean reciprocal rank (mrr) which is the commonly used metric for evaluating next query prediction models [3, 24, 27] . the mrr performance of the taskranker and the baselines is measured using the same test subset that includes 150 cleaned session-based query sequences built up on the subset of 698 session-based query sequences generated from the 300 test sessions. the task annotations of the testing test are ignored. table 5 shows the mrr performance for the taskranker and the baselines. the taskranker achieves an improvement of +152.8% with respect to the mps model and an improvement of +10.2% with respect to the sessionranker model. the differences in mrr are statistically significant by the t-test (p < 0.01). it has been shown in previous work [3, 27] that session size has an impact on the performance of context-aware next query prediction models. thus, we report in fig. 5 separate mrr results for each of the medium (2 queries) and the long sessions (≥3 queries) studied in our analyses (sects. 4 and 5). as can be seen, the task-based contextual features particularly help predicting the next query in long sessions (+14, 1% in comparison to the sessionranker, p = 7× 10 −3 ). prediction performance for medium sessions is slightly but not significantly lower (−1, 3% in comparison to the sessionranker, p = 0.65). this result can be expected from the findings risen from our analyses, since long sessions include queries related to 89.9% of long tasks whose cohesive contexts enable more accurate predictions of user's future search intent. better understanding user's query reformulations is important for designing task completion engines. through the analysis of large-scale query logs annotated with task labels, we have revealed significant differences in the query changes trends along the search depending on the retrospective context used, either session or task. we found that queries are even longer in longer tasks with however a lower level of term reuse in tasks than in sessions. in addition, terms are particularly renewed in long tasks indicating clear shifts in information needs. using lexical similarity measures, we have also shown that the query reformulations exhibit a clearer cohesiveness within tasks than within sessions along the different search stages, with however a decreasing level of similarity. finally, we provided insights on the usefulness of task features to enhance the user's next query prediction accuracy. given the crucial lack of query logs with annotated tasks, we acknowledge that the predictive model has been trained and tested with limited amount of data. however, the features used are based on the analysis performed on a large-scale data provided in the webis corpus. thus, we believe that the trend of our results would remain reliable. there are several promising research directions for future work. firstly, evidence related to the characterization of tasks through query length variation and query reformulation similarities along the search, presented in sects. 4 and 5, may benefit research on automatic task boundary detection. in sect. 6, we showed that learning from query streams annotated with tasks helps the query suggestion process particularly for long-term tasks. it will be interesting to design a predictive model of query trails associated with subtasks, by analogy to search trails [30] . this might help users in completing complex tasks by issuing fewer queries. this would decrease the likeliness of search struggling as shown in previous work [22] . webis corpus archive leading people to longer queries learning to attend, copy, and generate for session-based query suggestion lessons from the journey: a query log analysis of within-session learning a unified and discriminative model for query refinement from search session detection to search mission detection supporting complex search tasks detecting session boundaries from web user logs user behaviour and task characteristics: a field study of daily information behaviour learning to rewrite queries analyzing and evaluating query reformulation strategies in web search logs patterns of query reformulation during web searching learning user reformulation behavior for query auto-completion beyond the session timeout: automatic hierarchical segmentation of search topics in query logs humancomputer interaction: the impact of users' cognitive styles on query reformulation behaviour during web searching patterns of search: analyzing and modeling web query refinement analysis and evaluation of query reformulations in different task types factors that influence query reformulations and search performance in health information retrieval: a multilevel modeling approach identifying taskbased sessions in search engine query logs characterizing users' multi-tasking behavior in web search uncovering task based behavioral heterogeneities in online search behavior struggling and success in web search analysis of multiple query reformulations on the web: the interactive information retrieval context learning to rank query suggestions for adhoc and diversity search analysis of a very large web search engine query log a term-based methodology for query reformulation understanding a hierarchical recurrent encoder-decoder for generative context-aware query suggestion on the impact of domain expertise on query formulation, relevance assessment and retrieval performance in clinical settings a theory of the task-based information retrieval assessing the scenic route: measuring the value of search trails in web logs the effects of domain knowledge on search tactic formulation examining the impact of domain and cognitive complexity on query formulation and reformulation application of automatic topic identification on excite web search engine data logs key: cord-020912-tbq7okmj authors: batra, vishwash; haldar, aparajita; he, yulan; ferhatosmanoglu, hakan; vogiatzis, george; guha, tanaya title: variational recurrent sequence-to-sequence retrieval for stepwise illustration date: 2020-03-17 journal: advances in information retrieval doi: 10.1007/978-3-030-45439-5_4 sha: doc_id: 20912 cord_uid: tbq7okmj we address and formalise the task of sequence-to-sequence (seq2seq) cross-modal retrieval. given a sequence of text passages as query, the goal is to retrieve a sequence of images that best describes and aligns with the query. this new task extends the traditional cross-modal retrieval, where each image-text pair is treated independently ignoring broader context. we propose a novel variational recurrent seq2seq (vrss) retrieval model for this seq2seq task. unlike most cross-modal methods, we generate an image vector corresponding to the latent topic obtained from combining the text semantics and context. this synthetic image embedding point associated with every text embedding point can then be employed for either image generation or image retrieval as desired. we evaluate the model for the application of stepwise illustration of recipes, where a sequence of relevant images are retrieved to best match the steps described in the text. to this end, we build and release a new stepwise recipe dataset for research purposes, containing 10k recipes (sequences of image-text pairs) having a total of 67k image-text pairs. to our knowledge, it is the first publicly available dataset to offer rich semantic descriptions in a focused category such as food or recipes. our model is shown to outperform several competitive and relevant baselines in the experiments. we also provide qualitative analysis of how semantically meaningful the results produced by our model are through human evaluation and comparison with relevant existing methods. there is growing interest in cross-modal analytics and search in multimodal data repositories. a fundamental problem is to associate images with some corresponding descriptive text. such associations often rely on semantic understanding, beyond traditional similarity search or image labelling, to provide humanlike visual understanding of the text and reflect abstract ideas in the image. stepwise recipe illustration example showing a few text recipe instruction steps alongside one full sequence of recipe images. note that retrieval of an accurate illustration of step 4, for example, depends on previously acquired context information. cross-modal retrieval systems must return outputs of one modality from a data repository, while a different modality is used as the input query. the multimodal repository usually consists of paired objects from two modalities, but may be labelled or unlabelled. classical approaches to compare data across modalities include canonical correlation analysis [12] , partial least squares regression [28] , and their numerous variants. more recently, various deep learning models have been developed to learn shared embedding spaces from paired image-text data, either unsupervised, or supervised using image class labels. the deep models popularly used include deep belief networks [23] , correspondence autoencoders [9] , deep metric learning [13] , and convolutional neural networks (cnns) [33] . with all these models it is expected that by learning from pairwise aligned data, the common representation space will capture semantic similarities across modalities. most such systems, however, do not consider sequences of related data in the query or result. in traditional image retrieval using text queries, for example, each image-text pair is considered in isolation ignoring any broader 'context'. a context-aware image-from-text retrieval model must look at pairwise associations and also consider sequential relationships. such sequence-to-sequence (seq2seq) cross-modal retrieval is possible when contextual information and semantic meaning are both encoded and used to inform the retrieval step. for stepwise recipe illustration, an effective retrieval system must identify and align a set of relevant images corresponding to each step of a given text sequence of recipe instructions. more generally, for the task of automatic story picturing, a series of suitable images must be chosen to illustrate the events and abstract concepts found in a sequential text taken from a story. an example of the instruction steps and illustrations of a recipe taken from our new stepwise recipe dataset is shown in fig. 1 . in this paper, we present a variational recurrent learning model to enable seq2seq retrieval, called variational recurrent sequence-to-sequence (vrss) model. vrss produces a joint representation of the image-text repository, where the semantic associations are grounded in context by making use of the sequential nature of the data. stepwise query results are then obtained by searching this representation space. more concretely, we incorporate the global context information encoded in the entire text sequence (through the attention mechanism) into a variational autoencoder (vae) at each time step, which converts the input text into an image representation in the image embedding space. to capture the semantics of the images retrieved so far (in a story/recipe), we assume the prior of the distribution of the topic given the text input follows the distribution conditional on the latent topic from the previous time step. by doing so, our model can naturally capture sequential semantic structure. our main contributions can be summarised below: -we formalise the task of sequence-to-sequence (seq2seq) retrieval for stepwise illustration of text. -we propose a new variational recurrent seq2seq (vrss) retrieval model for seq2seq retrieval, which employs temporally-dependent latent variables to capture the sequential semantic structure of text-image sequences. -we release a new stepwise recipe dataset (10k recipes, 67k total imagetext pairs) for research purposes, and show that vrss outperforms several cross-modal retrieval alternatives on this dataset, using various performance metrics. our work is related to: cross-modal retrieval, story picturing, variational recurrent neural networks, and cooking recipe datasets. a number of pairwise-based methods over the years have attempted to address the cross-modal retrieval problem in different ways, such as metric learning [26] and deep neural networks [32] . for instance, an alignment model [16] was devised that learns inter-modal correspondences using ms-coco [19] and flickr-30k [25] datasets. other work [18] proposed unifying joint image-text embedding models with multimodal neural language models, using an encoder-decoder pipeline. a later method [8] used hard negatives to improve their ranking loss function, which yielded significant gains in retrieval performance. such systems focus only on isolated image retrieval when given a text query, and do not address the seq2seq retrieval problem that we study here. in a slight variation [2] , the goal was to retrieve an image-text multimodal unit when given a text query. for this, they proposed a gated neural architecture to create an embedding space from the query texts and query images along with the multimodal units that form the retrieval results set, and then performed semantic matching in this space. the training minimized structured hinge loss, and there was no sequential nature to the data used. picturing. an early story picturing system [15] retrieved landscape and art images to illustrate ten short stories based on key terms in the stories and image descriptions as well as a similarity linking of images. the idea was pursued further with a system [11] for helping people with limited literacy to read, which split a sentence into three categories and then retrieved a set of explanatory pictorial icons for each category. to our knowledge, an application [17] that ranks and retrieves image sequences based on longer text paragraphs as queries was the first to extend the pairwise image-text relationship to matching image sequences with longer paragraphs. they employed a structural ranking support vector machine with latent variables and used a custom-built disneyland dataset, consisting of blog posts with associated images as the parallel corpus from which to learn joint embeddings. we follow a similar approach, creating our parallel corpus from sequential stepwise cooking recipes rather than unstructured blog posts, and design an entirely new seq2seq model to learn our embeddings. the visual storytelling dataset (vist) [14] was built with a motivation similar to our own, but for generating text descriptions of image sequences rather than the other way around. relying on human annotators to generate captions, vist contains sequential image-text pairs with a focus on abstract visual concepts, temporal event relations, and storytelling. in our work, we produce a similar sequenced dataset in a simple, automated manner. a recent joint sequence-to-sequence model [20] learned a common image-text semantic space and generated paragraphs to describe photo streams. this bidirectional attention recurrent neural network was evaluated on both the above datasets. despite being unsuitable for our inverse problem, vist has also been used for retrieving images when given text, in work related to ours. in an approach called coherent neural story illustration (cnsi), an encoder-decoder network [27] was built to first encode sentences using a hierarchical two-level sentence-story gated recurrent unit (gru), and then sequentially decode into a corresponding sequence of illustrative images. a previously proposed coherence model [24] was used to explicitly model co-references between sentences. variational recurrent neural networks. our model is partly inspired by the variational recurrent neural network (vrnn) [6] , which introduces latent random variables into the hidden state of an rnn by combining it with a variational autoencoder (vae). they showed that using high level latent random variables, vrnn can model the variability observed in structured sequential data such as natural speech and handwriting. vrnn has recently been applied to other sequential modelling tasks such as machine translation [31] . our proposed vrss model introduces temporally-dependent latent variables to capture the sequential semantic structure of text/image sequences. different from existing approaches, we take into account the global context information encoded in the entire query sequence. we use vae for cross-modal generation by converting the text into a representation in the image embedding space instead of using it to reconstruct the text input. finally, we use the max-margin hinge loss to enforce similarity between text and paired image representations. cooking recipe datasets. the first attempt at automatic classification of food images was the food-101 dataset [3] having 101k images across 101 categories. since then, the new recipe1m dataset [29] gained wide attention, which paired each recipe with several images to build a collection of 13m food images for 1m recipes. recent work [4] proposed a cross-modal retrieval model that aligns recipe1m images and recipes in a shared representation space. as this dataset does not offer any sequential data for stepwise illustration, this association is between images of the final dish and the corresponding entire recipe text. our stepwise recipe dataset, by comparison, provides an image for each instruction step, resulting in a sequence of image-text pairs for each recipe. in [5] they release a dataset of sequenced image-text pairs in the cooking domain, with focus on text generation conditioned on images. recipeqa [34] is another popular dataset, used for multimodal comprehension and reasoning, with 36k questions about the 20k recipes and illustrative images for each step of the recipes. recent work [1] used it to analyse image-text coherence relations, thereby producing a human-annotated corpus with coherence labels to characterise different inferential relationships. the recipeqa dataset reveals associations between image-text pairs much like our stepwise recipe dataset, and we therefore utilise it to augment our own dataset. we construct the stepwise recipe dataset, composed of illustrated, step-bystep recipes from three websites 1 . recipes were automatically web-scraped and cleaned of html tags. the information about data and scripts will be made available on github 2 . the construction of such an image-text parallel corpus has several challenges as highlighted in previous work [17] . the text is often unstructured, without information about the canonical association between image-text pairs. each image is semantically associated with some portion of the text in the same recipe, and we assume that the images chosen by the author to augment the text are semantically meaningful. we thus perform text segmentation to divide the recipe text and associate segments with a single image each. we perform text-based filtering [30] to ensure text quality: (1) descriptions should have a high unique word ratio covering various part-of-speech tags, therefore descriptions with high noun ratio are discarded; (2) descriptions with high repetition of tokens are discarded; and (3) some predefined boiler-plate prefixsuffix sequences are removed. our constructed dataset consists of about 2k recipes with 44k associated images. furthermore, we augment our parallel corpus using similarly filtered recipeqa data [34] , which contains images for each step of the recipes in addition to visual question answering data. the final dataset contains over 10k recipes in total and 67k images. the seq2seq retrieval task is formalised as follows: given a sequence of text passages, x = {x 1 , x 2 , ..., x t }, retrieve a sequence of images i = {i 1 , i 2 , . .., i t } (from a data repository) which best describes the semantic meanings of the text passages, i.e., p(i|x) = we address the seq2seq retrieval problem by considering three aspects: (1) encoding the contextual information of text passages; (2) capturing the semantics of the images retrieved (in a story/recipe); and (3) learning the relatedness between each text passage and its corresponding image. it is natural to use rnns to encode a sequence of text passages. here, we encode a text sequence using a bi-directional gru (bi-gru). given a text passage, we use the attention mechanism to capture the contextual information of the whole recipe. we map the text embedding into a latent topic z t by using a vae. in order to capture the semantics of the images retrieved so far (in a story/recipe), we assume the prior of the distribution of the topic given the text input follows a distribution conditional on the latent topic z t−1 from the previous step. we decode the corresponding image vector i t conditional on the latent topic, to learn the relatedness between text and image with a multi-layer perceptron and obtain a synthetic image embedding point generated from its associated text embedding point. our proposed variational recurrent seq2seq (vrss) model is illustrated in fig. 2 . below, we describe each of the main components of the vrss model. we use a bi-gru to learn the hidden representations of the text passage (e.g. one recipe instruction) in the forward and backward directions. the two learned hidden states are then concatenated to form the text segment to encode a sequence of such text passages (e.g. one recipe), a hierarchical bi-gru is used which first encodes each text segment and subsequently combines them. image encoder. to generate the vector representation of an image, we use the pre-trained modified resnet50 cnn [22] . in experiments, this model produced a well distributed feature space when trained on the limited domain, namely food related images. this was verified using t-sne visualisations [21] , which showed less clustering in the generated embedding space as compared to embeddings obtained from models pre-trained on imagenet [7] . to capture global context, we feed the bi-gru encodings into a top level bi-gru. assuming the hidden state output of each text passage x l in the global context is h c l , we use an attention mechanism to capture its similarity with the hidden state output of the t th text passage h t as the context vector is encoded as the combination of l text passages weighted by the attentions as c t = l l=1 α l h c l . this ensures that any given text passage is influenced more by others that are semantically similar. at the t th step text x t of the text sequence, the bi-gru output h t is combined with the context c t and fed into a vae to generate the latent topic z t . two prior networks f μ θ and f σ θ define the prior distribution of z t conditional on the previous z t−1 . we also define two inference networks f μ φ and f σ φ which are functions of h t , c t , and z t−1 : unlike the typical vae setup where the text input x t is reconstructed by generation networks, here we generate the corresponding image vector i t . to generate the image vector conditional on z t , the generation networks are defined which are also conditional on z t−1 : the generation loss for image i t is then: − kl(q(z t |x ≤t , z